Enterprise AI Analysis
SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
OpenClaw's ClawHub marketplace hosts over 13,000 community-contributed agent skills, and between 13% and 26% of them contain security vulnerabilities according to recent audits. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural language instructions in SKILL. md files where prompt injection and social engineering attacks hide. Neither approach handles both modalities. SKILLSIEVE is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer, filtering roughly 86% of benign skills in under 40ms on average at zero API cost. Layer 2 sends suspicious skills to an LLM, but instead of asking one broad question, it splits the analysis into four parallel sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency), each with its own prompt and structured output. Layer 3 puts high-risk skills before a jury of three different LLMs that vote independently and, if they disagree, debate before reaching a verdict. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the full pipeline on a $440 ARM single-board computer. On a 400-skill labeled benchmark, SKILLSIEVE achieves 0.800 F1, outperforming ClawVet's 0.421, at an average cost of $0.006 per skill. Code, data, and benchmark are open-sourced.
Key Performance Indicators for AI Agent Security
SkillSieve significantly enhances the detection of malicious AI agent skills, offering a cost-effective and robust solution for enterprise-grade security.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
SkillSieve: A Three-Layer Triage Framework
SkillSieve processes each skill package through up to three layers of progressively deeper analysis. This tiered approach ensures that expensive LLM calls are made only when necessary, optimizing both cost and efficiency for large-scale deployments.
Enterprise Process Flow
The core idea of triage allows SkillSieve to prioritize analysis, focusing deep, expensive LLM scrutiny on only the most suspicious cases, while quickly clearing the majority of benign skills through fast, zero-cost static checks.
Layer 1: Static Triage - Fast & Cost-Effective Filtering
Layer 1 is designed for high recall at low cost. It aims to pass ≥98% of truly malicious skills to Layer 2, accepting a higher false positive rate that subsequent layers will resolve. This layer incorporates: Pattern Matching (regex rules), AST Feature Extraction (system calls, network ops, entropy), Metadata Reputation (typosquatting, sensitive permissions), and SKILL.md Surface Statistics (instruction length, URLs, urgency language).
This initial stage processes each skill in under 40ms on average, with zero API cost, filtering approximately 86% of the total volume. This drastically reduces the number of skills requiring more intensive LLM analysis.
Layer 2: Structured Semantic Decomposition (SSD)
Natural language instructions in SKILL.md are a primary attack surface for prompt injection and social engineering. Posing a monolithic "is this malicious?" question to an LLM yields unreliable results. SkillSieve addresses this by decomposing semantic analysis into four parallel sub-tasks:
- Intent Alignment: Does what the skill claims to do match its instructions?
- Permission Justification: Are requested permissions reasonable for the stated purpose?
- Covert Behavior Detection: Are there instructions to hide actions, suppress error reporting, or bypass safety?
- Cross-File Consistency: Does the code in scripts/ implement what SKILL.md describes, or perform undeclared actions?
| Metric | Single-Prompt LLM (Kimi 2.5) | SSD (Ours) |
|---|---|---|
| F1 Score | 0.746 | 0.800 |
| Precision | 1.000 | 0.752 |
| Recall | 0.596 | 0.854 |
| Missed Malicious Skills | 36 | 13 |
The SSD approach significantly outperforms single-prompt LLM analysis by tackling each security dimension independently, leading to higher recall and more robust detection of sophisticated attacks that monolithic judgments might miss.
Layer 3: Multi-LLM Jury Protocol - Robust Decision Making
Individual LLMs can exhibit systematic biases. A single-model verdict lacks a mechanism for quantifying uncertainty or resolving ambiguous cases. SkillSieve's Layer 3 protocol addresses this with a two-round, multi-LLM jury:
- Round 1: Independent Voting. Three independent LLMs (Kimi 2.5, MiniMax M2.7, DeepSeek-V3) analyze the skill and provide a structured JSON verdict. If all agree, the verdict is final.
- Round 2: Structured Debate. If jurors disagree, they receive each other's reasoning and evidence, and must either maintain or change their verdict, explicitly addressing counter-arguments. A majority vote (≥2/3) determines the verdict; if no majority, the skill is flagged for human review.
Jury Dynamics in Action
In our evaluation of 20 borderline skills, the debate mechanism activated in 7 out of 18 jury sessions (38.9%). In 3 cases, dissenting jurors changed their verdict to reach unanimous consensus. In 2 cases, a 2-to-1 majority determined the verdict. For the remaining 2 genuinely ambiguous cases (e.g., "verified-agent-identity-5"), no majority emerged, and the skill was correctly flagged for human review—exactly the intended behavior for truly complex scenarios.
This protocol provides a robust mechanism for cross-validating high-risk verdicts and ensures explainable reports with evidence chains from all three layers, increasing trust and accountability in AI agent security assessments.
Comprehensive Evaluation & Real-World Performance
SkillSieve was evaluated on 49,592 real ClawHub skills and a 400-skill labeled benchmark, demonstrating superior performance and efficiency compared to existing methods.
| Method | P | R | F1 | Acc | FPR |
|---|---|---|---|---|---|
| ClawVet [9] | 0.329 | 0.584 | 0.421 | 0.642 | 0.341 |
| SkillSieve L1 | 0.583 | 0.989 | 0.733 | 0.840 | 0.203 |
| + Single prompt | 1.000 | 0.596 | 0.746 | 0.910 | 0.000 |
| + SSD (ours) | 0.752 | 0.854 | 0.800 | 0.905 | 0.080 |
SkillSieve significantly outperforms baselines, with the full pipeline achieving an F1 score of 0.800 at an average cost of $0.006 per skill.
SkillSieve also demonstrates strong adversarial robustness, successfully intercepting all five tested bypass techniques: encoding obfuscation, cross-file logic splitting, conditional triggers, homoglyph substitution, and time-delayed payloads, often caught by Layer 1's static analysis or Layer 2's semantic decomposition.
| Technique | L1 Score | Caught by | L1 Rule |
|---|---|---|---|
| Encoding | 0.35 | L1+L2 | obfuscation |
| Cross-file | 0.40 | L1+L2 | credential_theft |
| Conditional | 0.70 | L1 | conditional_trigger |
| Homoglyph | 0.80 | L1+L2 | prompt_injection |
| Time-delay | 0.70 | L1 | time_delay |
The efficiency of Layer 1, running entirely on-device at zero API cost, enables SkillSieve to be deployed in resource-constrained environments like a $440 ARM single-board computer, making it practical for self-hosted deployment in air-gapped networks and CI/CD pipelines.
Calculate Your Potential AI Security ROI
Estimate the security improvements and cost savings SkillSieve could bring to your organization. Input your parameters to see the impact.
Your Path to Enhanced AI Security
A typical implementation roadmap for integrating SkillSieve into your enterprise security framework.
Phase 1: Initial Assessment & Strategy
Detailed analysis of existing AI agent usage, security posture, and custom requirements. Development of a tailored integration strategy for SkillSieve.
Phase 2: Pilot Deployment & Customization
Deployment of SkillSieve in a controlled environment, customization of rules and LLM prompts to fit specific enterprise policies and agent ecosystems.
Phase 3: Full-Scale Integration & Training
Seamless integration into CI/CD pipelines and agent marketplaces. Comprehensive training for security teams on monitoring, incident response, and continuous optimization.
Phase 4: Continuous Monitoring & Optimization
Ongoing performance monitoring, regular updates to detection models, and adaptive tuning to counter evolving adversarial techniques.
Ready to Secure Your AI Agents?
Book a personalized strategy session to see how SkillSieve can be integrated into your enterprise workflows, protecting your AI agents from sophisticated attacks.