Artificial Intelligence Analysis
SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints
Andrew Tremante, Yang He, Rocky Klopfenstein, Yuepeng Wang, Nina Narodytska, and Haoze Wu
We present SPOTIT⁺, an open-source tool for evaluating Text-to-SQL systems via bounded equivalence verification. Given a generated SQL query and the ground truth, SPOTIT+ actively searches for database instances that differentiate the two queries. To ensure that the generated counterexamples reflect practically relevant discrepancies, we introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation. Experimental results on the BIRD dataset show that the mined constraints enable SPOTIT+ to generate more realistic differentiating databases, while preserving its ability to efficiently uncover numerous discrepancies between generated and gold SQL queries that are missed by standard test-based evaluation.
Unlocking Deeper SQL Evaluation with SPOTIT+
SPOTIT+ revolutionizes Text-to-SQL evaluation by moving beyond simplistic test-based methods, providing a more rigorous and realistic assessment of query correctness.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Traditional test-based evaluation of Text-to-SQL models often overlooks logical non-equivalence between generated and gold SQL queries. This happens when two queries produce identical results on a fixed test database but diverge on other possible database instances. This can lead to an overly optimistic assessment of model performance.
SPOTIT+ addresses this by employing bounded equivalence verification, a technique that systematically searches for database instances (counterexamples) that differentiate two queries within a defined search space. This provides stronger correctness guarantees than simple test execution.
The underlying verification engine, VeriEQL, translates SQL queries and database constraints into Satisfiability Modulo Theories (SMT) problems, solvable by powerful solvers like Z3, allowing for a rigorous, formal check of equivalence.
A critical innovation in SPOTIT+ is its constraint-extraction pipeline. This module mines practical domain-specific constraints from example databases, going beyond explicit schema integrity constraints.
Five types of constraints are extracted:
- Range Constraints: Restrict numeric columns to plausible intervals (e.g., patient age [0, 120]).
- Categorical Constraints: Limit column values to a finite set of discrete choices (e.g., 'OWNER', 'DISPONENT').
- NotNull Constraints: Ensure critical columns do not contain null values.
- Functional Dependencies: Specify when one set of columns uniquely determines another (e.g., country code determines country name).
- Ordering Dependencies: Enforce inequality relationships between numeric columns.
To ensure realism and prevent overfitting to idiosyncratic test data, SPOTIT+ integrates a Large Language Model (LLM). The LLM validates and repairs mined constraints, relaxing overly restrictive ranges (e.g., [30,60] to [0,120]) and confirming genuine domain properties. These LLM-validated constraints are then encoded into the verification process, guiding the search for counterexamples towards more realistic scenarios.
Experimental evaluations on the BIRD dataset demonstrate that SPOTIT+ significantly enhances the realism of counterexamples while maintaining strong discrepancy-detection capabilities.
Compared to traditional test-based methods, SPOTIT+ uncovers a substantial number of additional discrepancies, providing a more accurate measure of Text-to-SQL model correctness. For instance, it identifies 7.4% more discrepancies on average than EX-TEST.
Crucially, the integration of LLM-validated constraints ensures that these detected discrepancies are relevant and not artifacts of pathological, unrealistic database states.
Furthermore, SPOTIT+ remains highly efficient, with an average counterexample generation time of just 0.9 seconds, making it practical for large-scale evaluations. The system successfully encodes 93-97% of SQL pairs, indicating high coverage for complex queries.
Enterprise Process Flow
| Feature | Test-Based Evaluation | SPOTIT (Vanilla Verification) | SPOTIT+ (Verification with LLM-Validated Constraints) |
|---|---|---|---|
| Discrepancy Detection | ❌ Limited (misses logical non-equivalence due to fixed test data) | ✅ High (finds many discrepancies using formal methods) | ✅ Enhanced (finds realistic discrepancies, filters out unrealistic ones found by vanilla SPOTIT) |
| Counterexample Realism | N/A (no explicit counterexamples) | ⚠️ Can generate unrealistic or pathological counterexamples | ✨ Significantly improved realism through LLM-validated domain constraints |
| Database Constraints Used | Implicitly uses test database's data | Only explicit schema integrity constraints (PK/FK) | Explicit + mined (Range, Categorical, NotNull, Functional, Ordering) + LLM-validated |
| Average Verification Time | Fast (simple execution on fixed data) | 1.7 seconds | 0.9 seconds (more efficient due to constrained search space) |
Motivating Example: Realistic Counterexamples with LLM Validation
Consider a scenario where a generated SQL query uses a filter like DISTRICT.A11 > 8000, while the gold query uses DISTRICT.A11 BETWEEN 8000 AND 9000. On a standard test database, these might yield identical results if no data points exactly at 8000 are present, causing test-based evaluation to incorrectly label them as equivalent.
Vanilla SPOTIT would find a counterexample, but it might populate categorical columns (e.g., DISTRICT.A2, ACCOUNT.FREQUENCY) with arbitrary, unrealistic values like '2147483648'. This highlights a discrepancy, but the counterexample itself is not practically relevant.
SPOTIT+ addresses this by integrating LLM-validated constraints. It mines constraints like `DISTRICT.A2` values belonging to fixed choices (e.g., 'Prague') and passes them to the verifier. This guides the counterexample generation to produce a database instance that is not only differentiating but also qualitatively more realistic, using plausible values for categorical data. As demonstrated in Figure 1, the LLM-validated counterexample provides a more meaningful insight into the real-world implications of the query discrepancy, without being overly restrictive.
Advanced ROI Calculator
Estimate the potential annual time and cost savings by implementing advanced AI-driven Text-to-SQL verification in your operations.
Our Implementation Roadmap
A structured approach to integrating advanced AI capabilities into your enterprise.
Phase 1: Discovery & Strategy
In-depth analysis of existing Text-to-SQL workflows, data schemas, and evaluation bottlenecks. Define specific KPIs and tailor a verification strategy aligned with business objectives.
Phase 2: SPOTIT+ Integration
Deployment and configuration of SPOTIT+ within your existing infrastructure. Establish secure data connections and configure constraint extraction pipelines for your databases.
Phase 3: LLM Constraint Validation
Fine-tune LLM integration for domain-specific constraint validation and repair. Develop custom prompts and validate constraint realism with subject matter experts.
Phase 4: Continuous Evaluation & Optimization
Automate continuous verification, monitor model performance, and refine constraint sets. Implement feedback loops for iterative improvement of Text-to-SQL systems.
Ready to Transform Your Data Operations?
Schedule a personalized consultation with our AI experts to explore how SPOTIT+ can elevate the reliability and accuracy of your Text-to-SQL systems.