Skip to main content
Enterprise AI Analysis: ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

ENTERPRISE AI ANALYSIS

ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

This paper introduces ContractSkill, a novel framework designed to improve the reliability and reusability of self-generated skills for web agents through deterministic verification, fault localization, and minimal patch repair.

Executive Impact & Key Metrics

ContractSkill significantly boosts the performance and reliability of AI agents, transforming how enterprises approach web automation and task execution.

0% VWA Success Rate (GLM)
0% VWA Improvement (GLM)
0% MiniWoB Success Rate (GLM)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Long-horizon web tasks seem like exactly the setting where skills should help most. Yet recent evidence already points to a striking mismatch between that intuition and current practice. SkillsBench [14] reports that curated skills raise average pass rate by 16.2 percentage points, whereas self-generated skills provide no benefit on average. For us, this is the real motivating anomaly. It suggests that the problem is not whether external procedural knowledge can help agents, but whether model-written skills are stable enough to serve as executable objects in the first place.

Our diagnosis is object-centered rather than pipeline-centered. The key bottleneck is not only skill generation quality, but the fact that web skills remain implicit and therefore cannot be checked or locally repaired. Most draft skills are loose textual objects: they omit explicit preconditions, blur step boundaries, leave success evidence underspecified, and provide no principled recovery policy when the page state diverges. Once execution fails, the agent typically rewrites the whole skill or abandons it altogether.

We therefore propose ContractSkill, which converts a draft skill into an executable contract and improves it through deterministic verification, fault localization, and minimal patch repair. Starting from a model-generated draft, ContractSkill compiles the skill into an explicit artifact, executes it under a deterministic verifier, localizes failure to a specific step and error type, and edits only the implicated selector, condition, recovery rule, or action argument. Instead of treating failure as a reason to regenerate the whole skill, the framework treats it as a local repair problem over a persistent procedural object. The repaired artifact can then be reused by the source model or consumed by a target model without regenerating the skill.

We test this claim from three complementary angles. First, on VisualWebArena, ContractSkill improves execution in realistic multimodal web interaction where naive self-generated skills fail. Second, in MiniWoB, controlled comparisons show that the gain comes from deterministic verification, fault localization, and minimal repair rather than rewriting alone. Third, under matched transfer layers, repaired artifacts remain reusable across GLM-4.6V and Qwen3.5-Plus after removing the source model from the loop. This is a benchmark-specific test of portability rather than a claim of full-benchmark generalization.

ContractSkill is effective on VisualWebArena, while MiniWoB validates the mechanism under controlled conditions. We further show that matched transfer layers provide evidence that repaired artifacts remain reusable across models after removing the source model from the loop. This is a benchmark-specific portability claim rather than one of full-benchmark generalization.

On VisualWebArena, ContractSkill reaches 28.1% success, compared with 12.5% for No-Skill and 9.4% for Self-Generated Skill (GLM-4.6V). With Qwen3.5-Plus, the packaged final verified aggregation raises ContractSkill to 37.5%, exceeding both No-Skill (28.1%) and Self-Generated Skill (10.9%).

On MiniWoB, ContractSkill is the strongest method for both models, reaching 77.5% on GLM and 81.0% on Qwen. Relative to Self-Generated Skill, the improvement is 11.0 points for GLM and 20.5 points for Qwen.

ContractSkill Workflow Overview

The ContractSkill pipeline proceeds through five stages, transforming a draft skill into a reusable, repairable artifact. This structured approach enables deterministic fault localization and minimal patch repair, improving skill reliability.

Draft Generation
Skill Compilation
Verification-Guided Loop
Minimal Patch Repair
Downstream Reuse

Significant Performance Boost on VisualWebArena

ContractSkill achieves 28.1% success on VisualWebArena (GLM-4.6V), a challenging multimodal web benchmark, significantly outperforming No-Skill (12.5%) and Self-Generated Skill (9.4%). This highlights its effectiveness in realistic web environments where naive self-generated skills often fail.

28.1% VWA Success Rate with ContractSkill (GLM-4.6V)

ContractSkill vs. Naive Self-Generated Skills

This comparison highlights the fundamental advantages of ContractSkill's approach over traditional self-generated skills, particularly in terms of stability, repairability, and cross-model reusability.

Feature Self-Generated Skill (Naive) ContractSkill
Representational Stability Implicit, unstable
  • Explicit, verifiable artifact
Repair Mechanism Full rewriting/abandonment
  • Deterministic verification, local patch
Reusability Limited, model-specific
  • Cross-model portability (matched transfer)
Performance on VWA (GLM) 9.4% Success
  • 28.1% Success
Performance on MiniWoB (GLM) 66.5% Success
  • 77.5% Success

The Role of Deterministic Verification and Local Repair

Experiments on MiniWoB confirm that the performance gains of ContractSkill are attributable to its structured repair mechanism, including deterministic verification, fault localization, and minimal patch application, rather than simply regenerating skills.

MiniWoB Case: Verifier-Guided Repair in Action

On MiniWoB, controlled comparisons show that the gain comes specifically from verifier-guided local repair rather than rewriting alone. For instance, ContractSkill reached 77.5% success for GLM, an 11.0 point improvement over Self-Generated Skill. This demonstrates that explicit verification, fault localization, and minimal patch operators are key drivers for improved performance.

MiniWoB results confirm: structured repair is key, not just re-generation.

Advanced ROI Calculator

Estimate the potential annual savings and reclaimed hours for your enterprise by implementing ContractSkill-powered AI agents.

$
Annual Savings $0
Hours Reclaimed 0

Implementation Roadmap

A structured approach to integrating ContractSkill into your enterprise workflows for maximum impact and minimal disruption.

Phase 1: Skill Compilation

Convert draft skills into explicit, executable contracted artifacts with defined steps and assertions.

Phase 2: Verifier-Guided Execution

Execute skills under a deterministic verifier, identifying failures with precise error codes.

Phase 3: Fault Localization

Pinpoint failures to specific steps and error types using localized diagnosis.

Phase 4: Minimal Patch Repair

Apply small, targeted edits (selector, condition, recovery) to fix identified faults.

Phase 5: Cross-Model Reuse

Ensure repaired artifacts are portable and reusable across different agent models without regeneration.

Ready to Transform Your Enterprise with AI?

Book a free consultation to explore how ContractSkill can revolutionize your web automation and agent capabilities.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking