STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving
Revolutionizing Formal Theorem Proving with AI Self-Improvement
This paper introduces Self-play Theorem Prover (STP), an LLM-based system for formal theorem proving that mimics mathematicians by iteratively conjecturing and proving. STP addresses the data scarcity challenge in LLM-based theorem proving by training a 'conjecturer' and a 'prover' in a self-play loop. The conjecturer generates new, challenging problems that are 'barely provable' by the current prover, which then attempts to solve them. Correct proofs and conjectures with appropriate difficulty provide training signals. Evaluated with Lean and Isabelle, STP significantly outperforms previous methods, doubling the best result on Lean Workbook (from 13.2% to 28.5%) and achieving state-of-the-art performance on miniF2F-test (65.0%), ProofNet-test (23.9%), and PutnamBench (8/644). This approach enables continuous self-improvement without requiring additional human-labeled data, pushing the boundaries of automated mathematical reasoning.
Quantifiable Impact & Breakthroughs
STP's self-play mechanism delivers substantial improvements, setting new benchmarks in automated theorem proving across various formal systems and difficulty levels.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
STP operates as a two-role system: a conjecturer and a prover. The conjecturer generates new, related conjectures given a seed theorem and its proof. The prover attempts to prove these conjectures and existing statements. Critical to its self-improvement is the feedback loop where conjectures are selected based on their 'barely provable' nature, ensuring a continuously challenging curriculum.
On the Lean Workbook dataset, STP achieved a remarkable 28.5% pass rate, effectively doubling the previous best result of 13.2% from expert iteration. This improvement highlights STP's superior scaling behavior with increasing compute and its ability to generate high-quality training signals through conjectures. The model also achieved state-of-the-art results on miniF2F-test and ProofNet-test with Lean.
STP's performance was also evaluated using Isabelle, starting from a Llemma-7b base model. Across 58 iterations, STP consistently demonstrated better scaling than expert iteration and parallel sampling on the LeanWorkbook (Isabelle translation). This further validates the approach's effectiveness across different formal verifiers and its ability to continually improve.
Ablation studies confirmed that generated conjectures provide a denser training signal compared to traditional methods. While expert iteration struggles with sparse rewards on unproven theorems, STP's conjecturer creates 'approachable yet challenging' problems, leading to significantly higher proof success rates during training. Re-training with these conjectures also boosts downstream performance on benchmarks.
The generated conjectures showcased STP's ability to create variations, extensions, and generalizations of existing theorems. Examples include strengthening inequalities, rephrasing divisibility properties in terms of modulo, and generalizing specific numerical values to variables, demonstrating a sophisticated understanding of mathematical concepts.
Enterprise Process Flow
| Feature | STP (Self-play) | Expert Iteration |
|---|---|---|
| Training Data Source | Self-generated 'barely provable' conjectures + existing dataset | Existing dataset unproved statements |
| Reward Signal | Dense, curriculum-adaptive | Sparse (only correct proofs) |
| Learning Curve | Continuous self-improvement, scales indefinitely | Plateaus quickly due to data scarcity |
| Compute Efficiency | Higher return on compute for proof generation (47% pass rate on conjectures) | Massive wasted compute on incorrect proofs (1.5% pass rate on unproven statements) |
Example Conjecture: Generalizing Inequalities
From Specific to General
STP generated a conjecture (1 + x)2n > 1 + xn from the seed statement 1 + x2 < (1 + x)2. This showcases its ability to generalize existing theorems, creating new, more challenging problems that still leverage similar proof techniques (binomial expansion) but require a deeper understanding of variable constraints and integer properties. This iterative generalization is a cornerstone of mathematical development.
A fundamental hurdle in training powerful LLM theorem provers is the scarcity of high-quality, diverse, and challenging formal proofs. Unlike other domains with vast datasets, mathematical proofs require specialized domain expertise, making scaling data collection prohibitively expensive. STP directly addresses this by dynamically generating its own curriculum, pushing beyond the limits of static datasets.
Calculate Your Potential ROI
Estimate the efficiency gains and cost savings your enterprise could achieve by integrating advanced AI for complex reasoning tasks.
Your AI Implementation Roadmap
Our structured approach ensures a seamless integration of advanced AI capabilities into your existing workflows, maximizing impact and minimizing disruption.
Phase 1: Discovery & Strategy
In-depth analysis of current reasoning processes, identification of high-impact areas, and co-creation of a tailored AI strategy aligned with your business objectives.
Phase 2: Pilot & Proof of Concept
Develop and deploy a pilot AI solution on a defined scope, demonstrating tangible value and gathering critical feedback for refinement. This phase often involves custom model training and integration with your data.
Phase 3: Scaled Deployment & Integration
Full-scale deployment across relevant departments, comprehensive integration with enterprise systems, and development of robust monitoring and management tools.
Phase 4: Optimization & Continuous Improvement
Ongoing performance monitoring, iterative model fine-tuning, and exploration of new AI capabilities to ensure sustained competitive advantage and evolving ROI.
Ready to Transform Your Enterprise Reasoning?
Connect with our AI experts to explore how self-improving theorem provers and advanced AI can unlock new capabilities for your organization.