Skip to main content
Enterprise AI Analysis: Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning

Enterprise AI Analysis

Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning

Polaris Jhandi, Owais Kazi, Shreyas Subramanian, Neel Sendas

Amazon Web Services, Seattle, WA, USA

Executive Summary

As organizations scale adoption of generative AI, model cost optimization and operational efficiency have emerged as critical factors determining sustainability and accessibility. While Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, their extensive computational requirements make them cost-prohibitive for routine enterprise use. This limitation motivates the exploration of Small Language Models (SLMs), which can deliver comparable performance in targeted applications while drastically reducing infrastructure overhead (Irugalbandara et al., 2023). In this work, we investigate the feasibility of replacing LLM-driven workflows with optimized SLMs. We trained a domain-adapted SLM to execute representative tasks traditionally handled by LLMs, such as document summarization, query answering, and structured data interpretation. As part of the experiment, we investigated the fine-tuning of facebook/opt-350m model (single epoch only) using the Hugging Face TRL (Transformer Reinforcement Learning), specifically the Supervised Fine-Tuning (SFT) trainer. The OPT-350M model was released by Meta AI in 2022 as part of the OPT (Open Pretrained Transformer) family of models. Similar studies demonstrate that even models at the 350M parameter scale can meaningfully contribute to instruction-tuning pipelines (Mekala et al., 2024). Experimental results demonstrated that our fine-tuned SLM achieves exceptional performance with a 77.55% pass rate on ToolBench evaluation, significantly outperforming all baseline models including ChatGPT-CoT (26.00%), ToolLLaMA-DFS (30.18%), and ToolLLaMA-CoT (16.27%). These benchmarks, first introduced in ToolLLM (Qin et al., 2023) and later stabilized by follow-up efforts (Zhang et al., 2024), have become the standard for evaluating tool-augmented reasoning. Recent work has also extended ToolBench traces to preference-based optimization (Zeng et al., 2024) and designed alternative multi-API corpora for tool-use robustness (Liu et al., 2024). These findings emphasize that thoughtful design and targeted training of SLMs can significantly lower barriers to adoption, enabling cost-effective, large-scale integration of generative AI into production systems.

0 Overall Pass Rate
0 Model Parameters
0 Max Baseline Outperformance

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Unprecedented Overall Pass Rate

77.55% Achieved by Our SLM, vastly outperforming large models.

Model Performance Comparison (Table 1)

Model Params Pass Rate Gap
Our SLM 350M 77.55% -
ToolLLAMA-DFS 7B 30.18% -47.37%
ChatGPT-CoT 175B 26.00% -51.55%
ToolLLaMA-CoT 7B 16.27% -61.28%
Claude-CoT 52B 2.73% -74.82%

Our 350M parameter model achieved a remarkable 77.55% overall pass rate, significantly outperforming all baseline models by margins ranging from 47% to 75%. This result fundamentally challenges the conventional wisdom that larger models are necessary for complex reasoning tasks. The performance gap is particularly striking when considering that ChatGPT-CoT (175B parameters) achieved only 26.00%, representing a 2.98x improvement with our dramatically smaller model.

Consistent Performance Across Diverse Tasks

Category Ours TLLM-D GPT-C TLLM-C Claude
G1_instr78.532.533.016.03.5
G1_cat74.032.529.521.53.0
G1_tool79.028.029.514.52.5
G2_cat80.532.524.516.51.5
G2_instr74.529.524.018.02.5
G3_instr80.022.05.06.04.0
Avg77.630.226.016.32.7
6.5% Range Narrow performance variance across categories, demonstrating robustness.

Despite having 20-500x fewer parameters than baseline models, our SLM achieved superior performance across all evaluation categories. This breakthrough in parameter efficiency demonstrates that targeted fine-tuning can overcome traditional scaling limitations. The performance-per-parameter ratio represents a paradigm shift from brute-force scaling to intelligent optimization, proving that thoughtful training strategies can deliver exceptional results with minimal computational resources. The model maintained consistent performance across all test categories, with success rates ranging from 74% to 80.5%, demonstrating remarkable reliability across diverse tool manipulation scenarios. This consistency indicates that our approach successfully captures generalizable reasoning patterns rather than task-specific optimizations.

Our Targeted Fine-tuning Process

Enterprise Process Flow

facebook/opt-350m Model
ToolBench Dataset (16,000+ APIs)
SFT Fine-tuning (Amazon SageMaker)
Generate ToolBench Format Responses
Achieve 77.55% Pass Rate

Our approach centers on fine-tuning the facebook/opt-350m model using Supervised Fine-tuning (SFT) with the Hugging Face TRL library. The OPT-350M model, with its 350 million parameters, represents a strategic balance between capability and efficiency. We trained the model on the ToolBench dataset, which contains over 16,000 real-world APIs from RapidAPI Hub with corresponding instruction-solution pairs. The training process was conducted on Amazon SageMaker, leveraging its managed environment. Our SFT approach focused on teaching the model to generate responses in the proper ToolBench format, consisting of Thought-Action-Action Input patterns that enable systematic tool manipulation and reasoning.

ToolBench Evaluation Framework

ToolBench: The Standard for Tool-Augmented Reasoning

The ToolBench dataset is a large, multi-turn instruction dataset crucial for evaluating tool-augmented language models. It required transformation into structured training sequences, concatenating system prompts, user queries, and assistant responses with appropriate delimiters. ToolBench serves as the primary evaluation framework, providing comprehensive assessment across diverse tool manipulation scenarios, including Pass Rate Evaluation (proportion of successfully completed instructions within API call budgets) and Win Rate Assessment (compares solution quality based on information richness, factual accuracy, and reasoning quality).

The benchmark consists of six test categories totaling 1,100 test queries: G1-instruction (single-tool, unseen instructions), G1-category (single-tool, unseen categories), G1-tool (single-tool, unseen tools), G2-instruction (multi-tool, intra-category), G2-category (multi-tool, across categories), and G3-instruction (multi-tool, intra-collection). Our fine-tuned OPT-350M model was evaluated alongside baseline models using identical inference parameters.

Why SLMs Outperform LLMs for Specialized Tasks

Our findings reveal that task-specific optimization fundamentally outperforms scale-based approaches for tool-calling applications. The baseline large language models were trained on broad, general-purpose datasets that lack the specific tool-calling patterns and reasoning structures required for effective API manipulation. While these models excel at general language tasks, they struggle with the precise format requirements and multi-step reasoning chains essential for tool use. Our SLM concentrates all its capacity on tool-calling behaviors, resulting in more efficient parameter utilization where billions of parameters in baseline models become a liability rather than an asset.

Feature Small Language Models (SLMs) Large Language Models (LLMs)
Parameter Utilization Efficient, targeted for tool-calling Diluted, optimized for general language understanding
Reasoning Style Structured, precise API calls Verbose, creative, overgeneralization
Training Data Focus Domain-adapted, tool-calling patterns Broad, general-purpose datasets
Cost/Compute Lower infrastructure and operational costs High computational requirements

Future Work & Limitations

Despite the promising results, several limitations must be acknowledged. Our model was specifically optimized for ToolBench and may not generalize to other tool-calling frameworks or real-world API ecosystems with different interaction patterns. The 350M parameter constraint, while optimal for tool calling, may limit the model's ability to understand complex contextual nuances or handle ambiguous user requests. Scalability to complex tool ecosystems involving hundreds of interconnected tools with complex dependencies may exceed our model's learned patterns. The model's performance is inherently limited by the quality and coverage of ToolBench training data, and the specialized nature may require frequent retraining as APIs evolve.

These findings suggest that domain-specific optimization at moderate scale represents a viable alternative to the prevalent "scaling law" paradigm for specialized applications. Future research should investigate the generalization boundaries of specialized SLMs and develop hybrid approaches that combine the efficiency of targeted models with the adaptability of larger systems. The optimal parameter count likely varies across different specialized domains, warranting systematic investigation of task-complexity to model-capacity relationships.

Calculate Your Potential ROI

Estimate the significant cost savings and efficiency gains your enterprise could achieve with optimized Small Language Models.

Estimated Annual Savings
Hours Reclaimed Annually

Your AI Implementation Roadmap

A typical phased approach to integrate intelligent SLMs into your enterprise workflows.

Phase 1: Strategic Planning & Discovery

Identify high-impact use cases, define clear objectives, and assess existing infrastructure readiness. This phase involves stakeholder interviews, data audits, and a detailed feasibility study.

Phase 2: Data Preparation & Model Training

Collect and preprocess domain-specific data, curate high-quality instruction datasets, and fine-tune SLMs using techniques like Supervised Fine-tuning (SFT) for targeted performance.

Phase 3: Integration & Testing

Integrate trained SLMs with existing enterprise systems, develop API connectors, and conduct rigorous testing against real-world benchmarks like ToolBench to ensure robustness and accuracy.

Phase 4: Deployment & Monitoring

Deploy SLMs into production environments, establish monitoring systems for performance, cost, and safety, and set up feedback loops for continuous improvement.

Phase 5: Continuous Optimization

Regularly update models with new data, explore advanced fine-tuning techniques, and scale successful implementations across new departments and use cases for maximum ROI.

Ready to Transform Your Enterprise with AI?

Leverage the power of efficient Small Language Models. Book a personalized consultation to discuss how our solutions can drive significant cost savings and operational excellence for your organization.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking