Artificial Intelligence Analysis

SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints

Andrew Tremante, Yang He, Rocky Klopfenstein, Yuepeng Wang, Nina Narodytska, and Haoze Wu

We present SPOTIT⁺, an open-source tool for evaluating Text-to-SQL systems via bounded equivalence verification. Given a generated SQL query and the ground truth, SPOTIT+ actively searches for database instances that differentiate the two queries. To ensure that the generated counterexamples reflect practically relevant discrepancies, we introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation. Experimental results on the BIRD dataset show that the mined constraints enable SPOTIT+ to generate more realistic differentiating databases, while preserving its ability to efficiently uncover numerous discrepancies between generated and gold SQL queries that are missed by standard test-based evaluation.

Explore SPOTIT+ Capabilities

Unlocking Deeper SQL Evaluation with SPOTIT+

SPOTIT+ revolutionizes Text-to-SQL evaluation by moving beyond simplistic test-based methods, providing a more rigorous and realistic assessment of query correctness.

0 More Discrepancies Found

0 Avg. Verification Time

0 SQL Query Coverage

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Traditional test-based evaluation of Text-to-SQL models often overlooks logical non-equivalence between generated and gold SQL queries. This happens when two queries produce identical results on a fixed test database but diverge on other possible database instances. This can lead to an overly optimistic assessment of model performance.

SPOTIT+ addresses this by employing bounded equivalence verification, a technique that systematically searches for database instances (counterexamples) that differentiate two queries within a defined search space. This provides stronger correctness guarantees than simple test execution.

The underlying verification engine, VeriEQL, translates SQL queries and database constraints into Satisfiability Modulo Theories (SMT) problems, solvable by powerful solvers like Z3, allowing for a rigorous, formal check of equivalence.

A critical innovation in SPOTIT+ is its constraint-extraction pipeline. This module mines practical domain-specific constraints from example databases, going beyond explicit schema integrity constraints.

Five types of constraints are extracted:

Range Constraints: Restrict numeric columns to plausible intervals (e.g., patient age [0, 120]).
Categorical Constraints: Limit column values to a finite set of discrete choices (e.g., 'OWNER', 'DISPONENT').
NotNull Constraints: Ensure critical columns do not contain null values.
Functional Dependencies: Specify when one set of columns uniquely determines another (e.g., country code determines country name).
Ordering Dependencies: Enforce inequality relationships between numeric columns.

To ensure realism and prevent overfitting to idiosyncratic test data, SPOTIT+ integrates a Large Language Model (LLM). The LLM validates and repairs mined constraints, relaxing overly restrictive ranges (e.g., [30,60] to [0,120]) and confirming genuine domain properties. These LLM-validated constraints are then encoded into the verification process, guiding the search for counterexamples towards more realistic scenarios.

Experimental evaluations on the BIRD dataset demonstrate that SPOTIT+ significantly enhances the realism of counterexamples while maintaining strong discrepancy-detection capabilities.

Compared to traditional test-based methods, SPOTIT+ uncovers a substantial number of additional discrepancies, providing a more accurate measure of Text-to-SQL model correctness. For instance, it identifies 7.4% more discrepancies on average than EX-TEST.

Crucially, the integration of LLM-validated constraints ensures that these detected discrepancies are relevant and not artifacts of pathological, unrealistic database states.

Furthermore, SPOTIT+ remains highly efficient, with an average counterexample generation time of just 0.9 seconds, making it practical for large-scale evaluations. The system successfully encodes 93-97% of SQL pairs, indicating high coverage for complex queries.

Enterprise Process Flow

Input NL Query, Generated SQL, Gold SQL, Example DB

→

Extract Database Constraints (Rule-Based Mining)

→

Validate & Repair Constraints (LLM)

→

Perform Bounded Equivalence Verification (VeriEQL)

→

Output: Equivalent or Realistic Counterexample

Comparison of Text-to-SQL Evaluation Approaches

SPOTIT+ offers a significant leap forward in evaluating Text-to-SQL models by combining formal verification with realistic database constraints.

Feature	Test-Based Evaluation	SPOTIT (Vanilla Verification)	SPOTIT+ (Verification with LLM-Validated Constraints)
Discrepancy Detection	❌ Limited (misses logical non-equivalence due to fixed test data)	✅ High (finds many discrepancies using formal methods)	✅ Enhanced (finds realistic discrepancies, filters out unrealistic ones found by vanilla SPOTIT)
Counterexample Realism	N/A (no explicit counterexamples)	⚠️ Can generate unrealistic or pathological counterexamples	✨ Significantly improved realism through LLM-validated domain constraints
Database Constraints Used	Implicitly uses test database's data	Only explicit schema integrity constraints (PK/FK)	Explicit + mined (Range, Categorical, NotNull, Functional, Ordering) + LLM-validated
Average Verification Time	Fast (simple execution on fixed data)	1.7 seconds	0.9 seconds (more efficient due to constrained search space)

7.4% Increase in Discrepancy Detection over Test-Based Methods

Motivating Example: Realistic Counterexamples with LLM Validation

Consider a scenario where a generated SQL query uses a filter like DISTRICT.A11 > 8000, while the gold query uses DISTRICT.A11 BETWEEN 8000 AND 9000. On a standard test database, these might yield identical results if no data points exactly at 8000 are present, causing test-based evaluation to incorrectly label them as equivalent.

Vanilla SPOTIT would find a counterexample, but it might populate categorical columns (e.g., DISTRICT.A2, ACCOUNT.FREQUENCY) with arbitrary, unrealistic values like '2147483648'. This highlights a discrepancy, but the counterexample itself is not practically relevant.

SPOTIT+ addresses this by integrating LLM-validated constraints. It mines constraints like `DISTRICT.A2` values belonging to fixed choices (e.g., 'Prague') and passes them to the verifier. This guides the counterexample generation to produce a database instance that is not only differentiating but also qualitatively more realistic, using plausible values for categorical data. As demonstrated in Figure 1, the LLM-validated counterexample provides a more meaningful insight into the real-world implications of the query discrepancy, without being overly restrictive.

Advanced ROI Calculator

Estimate the potential annual time and cost savings by implementing advanced AI-driven Text-to-SQL verification in your operations.

Your Industry

Number of Employees (impacted)

Avg. Hours/Week on Data Tasks

Avg. Hourly Rate ($)

Annual Cost Savings $0

Hours Reclaimed Annually 0

Calculate Your Savings

Our Implementation Roadmap

A structured approach to integrating advanced AI capabilities into your enterprise.

Phase 1: Discovery & Strategy

In-depth analysis of existing Text-to-SQL workflows, data schemas, and evaluation bottlenecks. Define specific KPIs and tailor a verification strategy aligned with business objectives.

Phase 2: SPOTIT+ Integration

Deployment and configuration of SPOTIT+ within your existing infrastructure. Establish secure data connections and configure constraint extraction pipelines for your databases.

Phase 3: LLM Constraint Validation

Fine-tune LLM integration for domain-specific constraint validation and repair. Develop custom prompts and validate constraint realism with subject matter experts.

Phase 4: Continuous Evaluation & Optimization

Automate continuous verification, monitor model performance, and refine constraint sets. Implement feedback loops for iterative improvement of Text-to-SQL systems.

Start Your AI Journey

Ready to Transform Your Data Operations?

Schedule a personalized consultation with our AI experts to explore how SPOTIT+ can elevate the reliability and accuracy of your Text-to-SQL systems.

Schedule Your Strategy Session

Artificial Intelligence Analysis

SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints

Unlocking Deeper SQL Evaluation with SPOTIT+

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Comparison of Text-to-SQL Evaluation Approaches

Motivating Example: Realistic Counterexamples with LLM Validation

Advanced ROI Calculator

Our Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: SPOTIT+ Integration

Phase 3: LLM Constraint Validation

Phase 4: Continuous Evaluation & Optimization

Ready to Transform Your Data Operations?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai