STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

Revolutionizing Formal Theorem Proving with AI Self-Improvement

This paper introduces Self-play Theorem Prover (STP), an LLM-based system for formal theorem proving that mimics mathematicians by iteratively conjecturing and proving. STP addresses the data scarcity challenge in LLM-based theorem proving by training a 'conjecturer' and a 'prover' in a self-play loop. The conjecturer generates new, challenging problems that are 'barely provable' by the current prover, which then attempts to solve them. Correct proofs and conjectures with appropriate difficulty provide training signals. Evaluated with Lean and Isabelle, STP significantly outperforms previous methods, doubling the best result on Lean Workbook (from 13.2% to 28.5%) and achieving state-of-the-art performance on miniF2F-test (65.0%), ProofNet-test (23.9%), and PutnamBench (8/644). This approach enables continuous self-improvement without requiring additional human-labeled data, pushing the boundaries of automated mathematical reasoning.

Schedule Your Strategy Session

Quantifiable Impact & Breakthroughs

STP's self-play mechanism delivers substantial improvements, setting new benchmarks in automated theorem proving across various formal systems and difficulty levels.

0% Lean Workbook Pass Rate

0% miniF2F-test Pass@3200

0% ProofNet-test Pass@3200

0/644 PutnamBench Solved (Pass@3200)

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Methodology

Lean Results

Isabelle Results

Ablation Studies

Examples

STP operates as a two-role system: a conjecturer and a prover. The conjecturer generates new, related conjectures given a seed theorem and its proof. The prover attempts to prove these conjectures and existing statements. Critical to its self-improvement is the feedback loop where conjectures are selected based on their 'barely provable' nature, ensuring a continuously challenging curriculum.

On the Lean Workbook dataset, STP achieved a remarkable 28.5% pass rate, effectively doubling the previous best result of 13.2% from expert iteration. This improvement highlights STP's superior scaling behavior with increasing compute and its ability to generate high-quality training signals through conjectures. The model also achieved state-of-the-art results on miniF2F-test and ProofNet-test with Lean.

STP's performance was also evaluated using Isabelle, starting from a Llemma-7b base model. Across 58 iterations, STP consistently demonstrated better scaling than expert iteration and parallel sampling on the LeanWorkbook (Isabelle translation). This further validates the approach's effectiveness across different formal verifiers and its ability to continually improve.

Ablation studies confirmed that generated conjectures provide a denser training signal compared to traditional methods. While expert iteration struggles with sparse rewards on unproven theorems, STP's conjecturer creates 'approachable yet challenging' problems, leading to significantly higher proof success rates during training. Re-training with these conjectures also boosts downstream performance on benchmarks.

The generated conjectures showcased STP's ability to create variations, extensions, and generalizations of existing theorems. Examples include strengthening inequalities, rephrasing divisibility properties in terms of modulo, and generalizing specific numerical values to variables, demonstrating a sophisticated understanding of mathematical concepts.

Enterprise Process Flow

Step 1: Conjecturing

→

Step 2: Proving (sampling)

→

Step 3: Verifying proofs

→

Step 4: Preparing training data (assigning reward)

→

Step 5: LLM training

2x Improvement over previous SOTA on Lean Workbook

Feature	STP (Self-play)	Expert Iteration
Training Data Source	Self-generated 'barely provable' conjectures + existing dataset	Existing dataset unproved statements
Reward Signal	Dense, curriculum-adaptive	Sparse (only correct proofs)
Learning Curve	Continuous self-improvement, scales indefinitely	Plateaus quickly due to data scarcity
Compute Efficiency	Higher return on compute for proof generation (47% pass rate on conjectures)	Massive wasted compute on incorrect proofs (1.5% pass rate on unproven statements)

Example Conjecture: Generalizing Inequalities

From Specific to General

STP generated a conjecture (1 + x)²ⁿ > 1 + xⁿ from the seed statement 1 + x² < (1 + x)². This showcases its ability to generalize existing theorems, creating new, more challenging problems that still leverage similar proof techniques (binomial expansion) but require a deeper understanding of variable constraints and integer properties. This iterative generalization is a cornerstone of mathematical development.

A fundamental hurdle in training powerful LLM theorem provers is the scarcity of high-quality, diverse, and challenging formal proofs. Unlike other domains with vast datasets, mathematical proofs require specialized domain expertise, making scaling data collection prohibitively expensive. STP directly addresses this by dynamically generating its own curriculum, pushing beyond the limits of static datasets.

65.0% State-of-the-Art on miniF2F-test (Pass@3200)

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by integrating advanced AI for complex reasoning tasks.

Industry

Number of Employees (impacted by reasoning tasks)

Avg. Hours/Week on Reasoning Tasks per Employee

Avg. Hourly Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your AI Implementation Roadmap

Our structured approach ensures a seamless integration of advanced AI capabilities into your existing workflows, maximizing impact and minimizing disruption.

Phase 1: Discovery & Strategy

In-depth analysis of current reasoning processes, identification of high-impact areas, and co-creation of a tailored AI strategy aligned with your business objectives.

Phase 2: Pilot & Proof of Concept

Develop and deploy a pilot AI solution on a defined scope, demonstrating tangible value and gathering critical feedback for refinement. This phase often involves custom model training and integration with your data.

Phase 3: Scaled Deployment & Integration

Full-scale deployment across relevant departments, comprehensive integration with enterprise systems, and development of robust monitoring and management tools.

Phase 4: Optimization & Continuous Improvement

Ongoing performance monitoring, iterative model fine-tuning, and exploration of new AI capabilities to ensure sustained competitive advantage and evolving ROI.

Begin Your AI Journey

Ready to Transform Your Enterprise Reasoning?

Connect with our AI experts to explore how self-improving theorem provers and advanced AI can unlock new capabilities for your organization.

Book a Free Consultation

STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

Revolutionizing Formal Theorem Proving with AI Self-Improvement

Quantifiable Impact & Breakthroughs

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Example Conjecture: Generalizing Inequalities

From Specific to General

Calculate Your Potential ROI

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot & Proof of Concept

Phase 3: Scaled Deployment & Integration

Phase 4: Optimization & Continuous Improvement

Ready to Transform Your Enterprise Reasoning?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai