Skip to main content
Enterprise AI Analysis: VeruSAGE: A Study of Agent-Based Verification for Rust Systems

VeruSAGE: A Study of Agent-Based Verification for Rust Systems

Unlocking System Proofs: LLMs Revolutionize Rust Verification

Our study introduces VeruSAGE-Bench, an 849-task benchmark, and showcases how specialized LLM agents can achieve 80%+ success rates in real-world Rust system verification.

Executive Impact: Bridging AI and Formal Verification

Large language models (LLMs) demonstrate impressive capabilities in code understanding and development, but their ability to rigorously reason and prove code correctness remains a key challenge for system software.

This research introduces VeruSAGE-Bench, a comprehensive benchmark of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. We explore different agent systems to optimize LLM performance for system verification.

Our findings show that the best LLM-agent combination (Sonnet 4.5 with a hands-off approach) successfully completes over 80% of VeruSAGE-Bench tasks. This includes an 83% success rate on tasks from Atmosphere, a project not in training data.

Remarkably, LLMs even proved 33 tasks that human experts had not yet finished, highlighting the immense potential for LLM-assisted development of verified system software and accelerating the formal verification process.

849 Total Proof Tasks Analyzed
81% Top LLM Success Rate
33 Unfinished Human Tasks Solved
7.2 min Avg. Time per Task

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

System proofs, as characterized by VeruSAGE-Bench, are significantly more complex than those in previous benchmarks like VerusBench. They involve over 50x more lines of specification, extensive code dependencies, proof annotations, and helper lemmas. Key differences include:

  • Huge Specifications: Anvil Controller (AC) tasks average 2037 lines and 235 functions of specification, making comprehensive understanding challenging.
  • Few Loop Invariants: Many system projects (AC, NO, NR) contain no loops, meaning loop invariants, a staple of smaller proofs, are rare. When loops exist, their invariants can be extremely complex, sometimes exceeding 100 lines of code.
  • Heavy Reliance on Lemmas: System proofs leverage helper lemmas much more frequently (2.4 vs. 0.07 per task) for decomposition of complex tasks into smaller, manageable pieces.
  • Diverse Proof Strategies: VeruSAGE-Bench tasks exhibit a wide range of proof strategies and features, including assert_forall, bit-vector provers, and non-linear arithmetic, which are rarely or never used in smaller benchmarks.

These characteristics underscore the fundamental differences and increased complexity when moving from small programming problems to real-world system verification tasks.

To effectively tackle the complexity of real-world system verification, we designed two distinct agentic approaches:

  • Hands-Off Approach: This uses a generic coding agent (like GitHub Copilot CLI) with a simple prompt. It grants the LLM access to the Verus standard library and two tools: Verus and a cheating checker. Surprisingly, this approach proved highly effective for powerful models like Claude Sonnet 4/4.5.
  • Hands-On Approach (VeruSAGE): This significantly expands upon AutoVerus, providing LLMs with detailed domain knowledge, verification-error debugging strategies, and a guided proof development methodology. Key improvements include:
    • Expanded Action Agents: Many new agents for logical reasoning (case-analysis, induction), arithmetic & solvers (bit-vector, non-linear), proof context (reveal-opaque, use-lemma), and quantifiers (instantiate-forall, instantiate-exists).
    • Two-Phase Plan-Then-Act: LLMs first plan based on errors and context, then select an action agent, rather than simple error-driven dispatch.
    • Sophisticated Candidate Selector: More nuanced criteria for accepting intermediate proof candidates, allowing for temporary error increases if it moves towards resolution.
    • Enhanced Context Management: Comprehensive history tracking and concise code diffs (search-and-replace blocks) to focus LLMs and avoid token bloat.

The Hands-On approach proved beneficial for smaller models (o4-mini, GPT-5) that struggle with syntax and hallucinations, providing the necessary scaffolding and guidance. For more capable models (Sonnet 4/4.5), the Hands-Off approach allowed for greater autonomy and often faster proof development by permitting larger changes.

Our experiments reveal significant insights into LLM capabilities for system verification:

  • High Success Rates: The best LLM-agent combination (Sonnet 4.5 + Hands-Off) achieved an impressive 81% success rate across all 849 VeruSAGE-Bench tasks. This includes 100% success on AL, NO, and VE projects, and 83% on the unseen Atmosphere (OS) project.
  • Solving Unfinished Human Tasks: Sonnet 4.5 successfully completed 33 proof tasks from the Atmosphere project that human experts had not yet finished, even proposing beneficial specification adjustments.
  • LLM Proofs vs. Human Proofs: LLM-generated proofs are often longer (median 24 lines vs. human median 9 lines), sometimes including unnecessary annotations. They also utilize different strategies, favoring contradiction more and non-linear provers less than human experts.
  • LLM Failure Modes: Common failures include struggles with inductive invariants (AC projects), inability to handle procedural macros and hidden definitions (ST, NR projects), and broad syntax/hallucination issues (o4-mini).
  • Time and Cost: The Hands-Off approach, particularly with Sonnet 4.5, is fastest, completing 60% of tasks under 6 minutes. Hands-On approaches tend to be more expensive due to strict step-by-step guidance. Average task cost for the best combination is $5.61, taking 7.2 minutes.
  • Tool Support is Crucial: Even Hands-Off LLMs benefit immensely from Verus feedback, a cheat checker (reducing cheating rate from 14% to <1.5%), and access to the Verus standard library (vstd), which improves success rates significantly.

These findings underscore the potent combination of advanced LLMs and specialized agentic designs for tackling complex system-level verification challenges.

81% Top LLM Success Rate on VeruSAGE-Bench (Sonnet 4.5 + Hands-Off)

Enterprise Process Flow

Verus Reports Errors
Candidate Selector
Planning Agent
Select Action Agent
Propose Proof Candidate

System Proofs: More Complex, Less Loopy

Characteristic VerusBench VeruSAGE-Bench
Total LoC (incl. dependencies) 32 947
Spec LoC 8 496
Proof LoC 10 50
Loop Invariant Proofs (avg.) 8 1
Avg. # of Helper Lemmas 0.07 2.4

LLMs Augment Human Expertise

Sonnet 4.5 successfully proved 33 tasks not yet completed by human experts in the Atmosphere project. In one notable instance, it even suggested a specification adjustment for seq_skip_lemma that was later confirmed by human experts. Furthermore, when starting from human-provided partial proofs, Sonnet 4.5 could resolve 16 out of 17 incomplete tasks, reducing average proof development time from 7.3 to 4.7 minutes.

7.2 min Average Time per Task (Best LLM-Agent Combo)

Advanced ROI Calculator

Estimate the potential return on investment for integrating AI-powered verification into your enterprise workflows.

Estimated Annual Savings $0
Developer Hours Reclaimed 0

Your AI Verification Roadmap

A structured approach to integrating advanced AI verification into your development lifecycle.

Phase 1: Discovery & Strategy

Assess current verification processes, identify key challenges, and define AI integration goals. Develop a tailored strategy for pilot projects.

Phase 2: Pilot & Proof-of-Concept

Implement AI-powered agents on a selected subset of system proof tasks. Measure initial success metrics and gather feedback for refinement.

Phase 3: Integration & Scaling

Integrate AI agents into existing CI/CD pipelines. Expand to broader codebase, provide training for developers, and establish continuous improvement loops.

Phase 4: Optimization & Advanced Features

Refine agent performance, explore custom model fine-tuning, and leverage advanced features like automatic inductive invariant inference for maximum impact.

Ready to Transform Your Verification?

Unlock the full potential of AI-assisted formal verification for your Rust systems. Schedule a consultation with our experts to discuss a tailored implementation plan.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking