Enterprise AI Analysis
Error-Driven Prompt Optimization for Arithmetic Reasoning
Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and health-care, within tabular data workflows. This paper introduces an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through systematic evaluation, it demonstrates significant performance improvement, surpassing larger models while preserving data privacy.
Unlock Precision & Privacy for Your Data
Our research demonstrates how targeted prompt optimization can transform small language models into powerful, privacy-compliant analytical agents for complex tabular data tasks.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and health-care, within tabular data workflows. A key capability for such systems is performing accurate arithmetic operations on structured data while ensuring sensitive information never leaves secure, on-premises environments. Here, we introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through a systematic evaluation of a leading SLM (Qwen3 4B), we find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions to refine prompt-rules iteratively, dramatically improves performance, elevating the model's accuracy to 70.8%. Our results suggest that developing reliable, interpretable, and industrially deploy-able AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization, enabling small models to surpass larger lan-guage models (GPT-3.5 Turbo) in a privacy-compliant manner.
Tabular data remains a cornerstone of knowledge representation in domains such as finance, healthcare, and the natural sciences. Yet their automated interpretation presents persistent challenges for language models [1] [2]. Early work has revealed that large language models (LLMs) perform impressively in natural language understanding. Still, their arithmetic reasoning is fragile: they often produce inconsistent or even wrong numerical outputs when faced with structured data. To overcome this limitation, our previous study (Arithmetic-aware question-answering on tabular data using a large language model-based code generation agent) [3] reframed question answering as a code-generation task, in which the model produces executable programs that perform the required data selection and arithmetic operations. When coupled with table restructuring and domain-specific rule injection, this hybrid framework—the Code Generation Agent (CGA)—achieved dramatic gains, raising exact-match accuracy on financial benchmarks from below 30% to nearly 80%.
Despite these advances, the reliance on API-based large models posed a critical barrier for regulated sectors, where sensitive data cannot leave secure environments. A subsequent study [4] investigated small language models (SLMs) in the 1–7B parameter range, which can run entirely on-premises while requiring more modest computational resources. Strikingly, the combination of CGA, table restructuring, and prompt simplification enabled certain SLMs—notably Qwen3 4B [5]—to exceed GPT-3.5 Turbo in numerical accuracy, delivering both interpretability and privacy preservation. These results pointed to the feasibility of local, resource-efficient agents capable of robust tabular reasoning. However, key questions remained: how to extend such agents beyond hand-crafted rules, which often fail across model scales, and how to ground prompting strategies in a theoretical framework that balances rule number and type.
We present an error-driven framework that clusters mistaken predictions to derive domain-specific rules, iteratively refining reasoning prompts so the agent "learns from its mistakes" without costly fine-tuning. We formalize an optimal rule set (Kopt) balancing informativeness and cognitive load, explaining the performance curve. This method boosts Qwen3 4B to 70.82% accuracy, surpassing GPT-3.5 Turbo while preserving full data sovereignty. In this Article, we integrate CGA's methodological innovations, small language models' efficiency, and error-driven rule induction's systematic power into a cohesive blueprint for interpretable, auditable, and industrially deployable agents. We outline in Section 2 the dataset and the prompt extending algorithm 3. We describe the implementation details in Section 4 and Experiment 5. Present results and theoretical insights 6, concluding 7 with a best-practice foundation for trustworthy tabular reasoning AI.
This research extends the capabilities of the Code Generation Agent (CGA) by introducing a systematic error-driven optimization framework. The core methodology involves three key improvements: table restructuring, the Code Generation Agent (CGA) itself which generates executable Python code, and the application of domain-specific prompt rules. A novel aspect is the semi-automated formulation of these rules by clustering erroneous predictions and iteratively refining the prompts. This process allows the agent to "learn from its mistakes" without costly fine-tuning, significantly boosting performance for on-premises Small Language Models (SLMs) on tabular arithmetic reasoning tasks.
The method is formalized through a task decomposition model where semantic interpretation and arithmetic computation are separated into distinct steps, enhancing accuracy. Prompt directives are algorithmically defined to identify common root causes of failure and introduce targeted rules. The entire process is designed for interpretable, auditable, and industrially deployable AI.
The error-driven prompt optimization framework significantly boosts the performance of Small Language Models (SLMs) for arithmetic reasoning on tabular data. Specifically, the Qwen3 4B model, a top performer in the SLM category, achieved an exact match accuracy of 70.82% after applying the iterative rule refinement process. This outcome surpasses the performance of the larger, general-purpose GPT-3.5 Turbo, which recorded 66.27% accuracy in the same benchmark.
The theoretical implication highlights the existence of an optimal number of prompt rules (Kopt), beyond which performance can stagnate or even degrade due to "cognitive overload" or rule conflicts. This demonstrates that SLM optimization is a distinct discipline requiring concise, data-driven rule sets, rather than simply maximizing rule count. The findings underscore the feasibility of deploying resource-efficient, privacy-preserving analytical agents entirely on-premises for sensitive domains.
The perceived arithmetic limitations of small, locally deployable language models (SLMs) are not fundamental, but rather an artifact of the inference strategy. This research demonstrates that decomposing complex, multi-step tabular reasoning into deterministic, verifiable code generation fundamentally alters the performance landscape. However, task decomposition alone is insufficient. The critical breakthrough comes from introducing a systematic, error-driven optimization cycle: algorithmic clustering of the model's prediction errors allows for the precise identification of error root causes (such as the misunderstanding of "percentage change" or "year average") and their targeted remediation through domain-specific prompt rules.
The fusion of these two methods—code decomposition and iterative, error-driven refinement—is potent enough to elevate the accuracy of a compact 4B-parameter model (Qwen3 4B) from 59.96% to 70.82%, thereby surpassing the performance of the significantly larger, general-purpose GPT-3.5 Turbo (66.27%). This finding breaks the dependency on large, API-based models for highly sensitive, privacy-constrained domains such as financial analysis.
Our results also outline a new optimization paradigm: peak performance is achieved not by the accumulation of rules, but by finding an optimal, concise rule set (Kopt), beyond which cognitive overload degrades the results. This confirms that SLM optimization is a distinct discipline, not a simple downscaling of LLM strategies. The presented procedure serves as a blueprint for developing auditable, resource-efficient, and privacy-preserving analytical agents that operate entirely on-premises.
Enterprise Process Flow
| Model | Exact Match | Value Match |
|---|---|---|
| GPT 3.5 turbo | 66.27% | 78.92% |
| Qwen3 4B | 70.82% | 77.06% |
Real-World Impact: Secure & Scalable AI for Regulated Sectors
For regulated sectors like finance and healthcare, our error-driven prompt optimization framework enables the deployment of powerful AI agents directly on-premises. This ensures sensitive data never leaves secure environments, offering a crucial advantage over API-based LLMs while achieving superior accuracy in arithmetic reasoning tasks. This solution is designed to be simple to deploy, robust, and easily scalable, supporting interpretable and auditable AI operations.
Calculate Your Potential ROI
Estimate the efficiency gains and cost savings your enterprise could achieve by implementing optimized SLM agents.
Your Path to On-Premises AI Excellence
Our proven implementation roadmap ensures a smooth transition to secure, high-performing AI agents within your existing infrastructure.
Phase 1: Discovery & Strategy
We assess your current data workflows, identify key arithmetic reasoning challenges, and define clear objectives for AI integration, ensuring alignment with your enterprise goals.
Phase 2: Data & Model Preparation
Preparation of your tabular data for optimal SLM consumption and initial setup of the Code Generation Agent with base prompts on your secure, on-premises environment.
Phase 3: Iterative Optimization & Rule Generation
Deployment of our error-driven optimization framework to systematically analyze SLM prediction errors, cluster root causes, and iteratively refine domain-specific prompt rules for peak accuracy.
Phase 4: Validation & Deployment
Thorough validation of the optimized SLM agent against your enterprise benchmarks, followed by seamless integration into your production environment for secure, performant operations.
Ready to Transform Your Data Operations?
Book a complimentary consultation to explore how error-driven prompt optimization can deliver secure, accurate, and on-premises AI solutions for your specific enterprise needs.