Artificial Intelligence Research

In-Context Reinforcement Learning for Tool Use in Large Language Models

ICRL is a novel, RL-only framework that teaches large language models (LLMs) to use external tools effectively, without requiring costly supervised fine-tuning (SFT). By leveraging few-shot in-context examples during initial RL rollouts and gradually reducing them to a zero-shot setting, ICRL enables LLMs to autonomously internalize tool-use strategies.

Schedule Your Strategy Session

Executive Impact: Unleashing LLM Potential

This framework significantly outperforms traditional SFT+RL pipelines and other baselines on challenging QA and reasoning benchmarks, demonstrating state-of-the-art accuracy and superior data efficiency. ICRL scales effectively with model size and generalizes across diverse domains, including web search and code execution for mathematical problems, offering a scalable, supervision-light alternative.

0 Avg EM Improvement (Qwen2.5-3B)

0 Avg EM Improvement (Qwen2.5-7B)

0 SFT Data Elimination

0 Avg Tool Calls (0-Shot Stage)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

In-Context Reinforcement Learning (ICRL) is an innovative, RL-only framework designed to teach Large Language Models (LLMs) to effectively use external tools. Unlike traditional methods that depend on costly supervised fine-tuning (SFT) for initial tool-use instruction, ICRL integrates few-shot demonstrations directly into the reinforcement learning rollout process. This approach provides soft supervision, guiding the model toward successful tool invocation and structured reasoning from the outset. As training progresses, the explicit in-context examples are gradually phased out, enabling the LLM to transition from guided imitation to autonomous, zero-shot tool use. The framework is optimized using GRPO with loss masking, focusing learning on model-generated tokens for greater efficiency and stability.

A core component of ICRL is its multi-stage curriculum design, which systematically reduces the number of in-context demonstrations provided to the model during RL rollouts. Training begins with a small set of few-shot examples (e.g., 3-shot), guiding the model in structured tool invocation. As the model gains proficiency, these examples are progressively reduced (e.g., to 2-shot, then 0-shot). This gradual reduction fosters the internalization of tool-use strategies, allowing the model to independently reason and produce structured outputs without relying on prompt-based scaffolding. Ablation studies reveal that a simpler, less aggressive reduction schedule (e.g., 3→2→0 stages) leads to superior performance, enabling the model to explore longer reasoning paths and achieve higher accuracy compared to more rapid reductions.

ICRL demonstrates state-of-the-art performance across a wide range of challenging QA and reasoning benchmarks, significantly outperforming strong SFT+RL baselines and direct prompting methods. For instance, on the Qwen2.5-7B model, ICRL achieves an average Exact Match (EM) accuracy of 49.12%, a substantial improvement of +7.34% over the strongest competitor. The framework exhibits particular strength in multi-hop reasoning tasks, where it secures double-digit improvements on datasets like TriviaQA, 2Wiki, and Musique. Furthermore, ICRL's effectiveness extends beyond web search, showing robust performance in math reasoning tasks by leveraging code execution as a tool, matching or exceeding SFT+RL baselines like ReTool without any supervised pretraining.

A key advantage of ICRL is its remarkable data efficiency. By eliminating the need for supervised fine-tuning (SFT) and its associated large volumes of labeled data, ICRL offers a scalable and cost-effective alternative to traditional LLM training pipelines for tool use. It learns complex tool-use behaviors directly through in-context examples during RL rollouts, showcasing superior data efficiency compared to methods requiring cold-start SFT. The framework also demonstrates excellent scalability; when applied to larger models like Qwen2.5-14B-Instruct, ICRL continues to deliver significant performance gains across all QA datasets, indicating that it effectively leverages increased model capacity without additional supervision.

Enterprise Process Flow: ICRL Training Workflow

3-Shot Rollout

→

2-Shot Rollout

→

0-Shot Rollout

ICRL's training workflow involves a multi-stage curriculum that gradually reduces the number of in-context examples in the rollout template, enabling a transition from imitation to autonomous tool use.

49.12% Avg EM Accuracy (Qwen2.5-7B) - State-of-the-Art Performance

ICRL vs. O2-Searcher: Data Efficiency Comparison

Feature	ICRL	O2-Searcher
SFT Required	No	Yes
TriviaQA EM	72.6%	59.7%
HotpotQA EM	35.4%	38.8%
2Wiki EM	39.2%	37.4%
Musique EM	20.0%	16.0%
Bamboogle EM	33.6%	34.4%
Average EM	40.16%	37.26%

ICRL achieves better performance without SFT or labeled data, a key advantage highlighted by its superior performance compared to O2-Searcher, which requires a cold-start SFT phase.

Case Study: Multi-Turn Reasoning with ICRL-Qwen2.5-7B

Context: ICRL-Qwen2.5-7B demonstrates the ability to decompose complex questions and retrieve relevant information across multiple turns, maintaining coherent reasoning without explicit intermediate supervision.

Problem: When did the president who set the precedent of a two term limit enter office?

Solution Steps:

Initial Thought: "I need to know which president set this precedent. I also need to know when he entered office. Let me start by finding out which president it is."
First Search: Invocates a search tool with query: "president who set two term limit precedent"
Information 1 (Summary): Documents confirm George Washington set the precedent.
Second Thought: "It seems George Washington set the precedent. Now I need to find out when he entered office."
Second Search: Invocates a search tool with query: "when did George Washington enter office as president"
Information 2 (Summary): Documents confirm inauguration on April 30, 1789.
Final Answer: "April 30, 1789"

Key Learnings: This example illustrates ICRL's capabilities in:

Autonomous decomposition of complex queries
Multi-turn tool invocation for information retrieval
Coherent reasoning process without explicit supervision

0 Avg Tool Invocations (0-Shot Stage) at Training End

Calculate Your Potential ROI with ICRL Integration

Estimate the efficiency gains and cost savings your enterprise could achieve by integrating ICRL-powered LLMs into your workflows.

Your Industry

Number of Employees Impacted

Average Hours Spent on Tool-Augmented Tasks per Week (per Employee)

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Discuss Your Implementation

Your ICRL Implementation Roadmap

A structured approach to integrate In-Context Reinforcement Learning into your enterprise, maximizing efficiency and impact.

Phase 01: Initial Setup & Few-Shot Guidance

Deploy ICRL with a small set of few-shot in-context examples to bootstrap tool-use capabilities and establish structured reasoning patterns within your LLMs.

Phase 02: Iterative Curriculum Optimization

Progressively reduce in-context examples during RL training, enabling the model to internalize tool-use strategies and transition towards autonomous operation, moving to a zero-shot inference model.

Phase 03: Performance Benchmarking & Scaling

Evaluate model performance on diverse reasoning and QA benchmarks, and scale ICRL to larger foundation models to maximize impact and generalization across enterprise tasks.

Phase 04: Domain Adaptation & Deployment

Adapt the ICRL-trained LLM for specific enterprise applications, leveraging its data-efficient tool-use for enhanced problem-solving and automated workflows in production environments.

Ready to Transform Your LLM Capabilities?

ICRL offers a groundbreaking path to more autonomous, capable, and data-efficient LLMs. Discover how this innovation can drive significant value for your organization.

Book Your AI Consultation

Artificial Intelligence Research

In-Context Reinforcement Learning for Tool Use in Large Language Models

Executive Impact: Unleashing LLM Potential

Deep Analysis & Enterprise Applications

Enterprise Process Flow: ICRL Training Workflow

ICRL vs. O2-Searcher: Data Efficiency Comparison

Case Study: Multi-Turn Reasoning with ICRL-Qwen2.5-7B

Calculate Your Potential ROI with ICRL Integration

Your ICRL Implementation Roadmap

Phase 01: Initial Setup & Few-Shot Guidance

Phase 02: Iterative Curriculum Optimization

Phase 03: Performance Benchmarking & Scaling

Phase 04: Domain Adaptation & Deployment

Ready to Transform Your LLM Capabilities?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai