Skip to main content
Enterprise AI Analysis: SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Enterprise AI Analysis

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

OpenClaw's ClawHub marketplace hosts over 13,000 community-contributed agent skills, and between 13% and 26% of them contain security vulnerabilities according to recent audits. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural language instructions in SKILL. md files where prompt injection and social engineering attacks hide. Neither approach handles both modalities. SKILLSIEVE is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer, filtering roughly 86% of benign skills in under 40ms on average at zero API cost. Layer 2 sends suspicious skills to an LLM, but instead of asking one broad question, it splits the analysis into four parallel sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency), each with its own prompt and structured output. Layer 3 puts high-risk skills before a jury of three different LLMs that vote independently and, if they disagree, debate before reaching a verdict. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the full pipeline on a $440 ARM single-board computer. On a 400-skill labeled benchmark, SKILLSIEVE achieves 0.800 F1, outperforming ClawVet's 0.421, at an average cost of $0.006 per skill. Code, data, and benchmark are open-sourced.

Key Performance Indicators for AI Agent Security

SkillSieve significantly enhances the detection of malicious AI agent skills, offering a cost-effective and robust solution for enterprise-grade security.

0.0 Overall F1 Score
0.0 Average Cost per Skill
0 Skills Filtered by Layer 1
0 Total ClawHub Skills Scanned
0.0 Cost Saving vs. Single LLM
0.0 Layer 1 Triage F1 Score

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Framework Overview
Layer 1: Static Triage
Layer 2: SSD
Layer 3: Multi-LLM Jury
Evaluation & Results

SkillSieve: A Three-Layer Triage Framework

SkillSieve processes each skill package through up to three layers of progressively deeper analysis. This tiered approach ensures that expensive LLM calls are made only when necessary, optimizing both cost and efficiency for large-scale deployments.

Enterprise Process Flow

Layer 1: Static Triage
Layer 2: Structured Semantic Decomposition
Layer 3: Multi-LLM Jury Protocol
Final Verdict

The core idea of triage allows SkillSieve to prioritize analysis, focusing deep, expensive LLM scrutiny on only the most suspicious cases, while quickly clearing the majority of benign skills through fast, zero-cost static checks.

Layer 1: Static Triage - Fast & Cost-Effective Filtering

Layer 1 is designed for high recall at low cost. It aims to pass ≥98% of truly malicious skills to Layer 2, accepting a higher false positive rate that subsequent layers will resolve. This layer incorporates: Pattern Matching (regex rules), AST Feature Extraction (system calls, network ops, entropy), Metadata Reputation (typosquatting, sensitive permissions), and SKILL.md Surface Statistics (instruction length, URLs, urgency language).

86% of benign skills filtered at Layer 1 with zero API cost

This initial stage processes each skill in under 40ms on average, with zero API cost, filtering approximately 86% of the total volume. This drastically reduces the number of skills requiring more intensive LLM analysis.

Layer 2: Structured Semantic Decomposition (SSD)

Natural language instructions in SKILL.md are a primary attack surface for prompt injection and social engineering. Posing a monolithic "is this malicious?" question to an LLM yields unreliable results. SkillSieve addresses this by decomposing semantic analysis into four parallel sub-tasks:

  • Intent Alignment: Does what the skill claims to do match its instructions?
  • Permission Justification: Are requested permissions reasonable for the stated purpose?
  • Covert Behavior Detection: Are there instructions to hide actions, suppress error reporting, or bypass safety?
  • Cross-File Consistency: Does the code in scripts/ implement what SKILL.md describes, or perform undeclared actions?
SSD vs Single-Prompt LLM Analysis
Metric Single-Prompt LLM (Kimi 2.5) SSD (Ours)
F1 Score 0.746 0.800
Precision 1.000 0.752
Recall 0.596 0.854
Missed Malicious Skills 36 13

The SSD approach significantly outperforms single-prompt LLM analysis by tackling each security dimension independently, leading to higher recall and more robust detection of sophisticated attacks that monolithic judgments might miss.

Layer 3: Multi-LLM Jury Protocol - Robust Decision Making

Individual LLMs can exhibit systematic biases. A single-model verdict lacks a mechanism for quantifying uncertainty or resolving ambiguous cases. SkillSieve's Layer 3 protocol addresses this with a two-round, multi-LLM jury:

  1. Round 1: Independent Voting. Three independent LLMs (Kimi 2.5, MiniMax M2.7, DeepSeek-V3) analyze the skill and provide a structured JSON verdict. If all agree, the verdict is final.
  2. Round 2: Structured Debate. If jurors disagree, they receive each other's reasoning and evidence, and must either maintain or change their verdict, explicitly addressing counter-arguments. A majority vote (≥2/3) determines the verdict; if no majority, the skill is flagged for human review.

Jury Dynamics in Action

In our evaluation of 20 borderline skills, the debate mechanism activated in 7 out of 18 jury sessions (38.9%). In 3 cases, dissenting jurors changed their verdict to reach unanimous consensus. In 2 cases, a 2-to-1 majority determined the verdict. For the remaining 2 genuinely ambiguous cases (e.g., "verified-agent-identity-5"), no majority emerged, and the skill was correctly flagged for human review—exactly the intended behavior for truly complex scenarios.

This protocol provides a robust mechanism for cross-validating high-risk verdicts and ensures explainable reports with evidence chains from all three layers, increasing trust and accountability in AI agent security assessments.

Comprehensive Evaluation & Real-World Performance

SkillSieve was evaluated on 49,592 real ClawHub skills and a 400-skill labeled benchmark, demonstrating superior performance and efficiency compared to existing methods.

End-to-End Detection Performance (400 Labeled Skills)
Method P R F1 Acc FPR
ClawVet [9] 0.329 0.584 0.421 0.642 0.341
SkillSieve L1 0.583 0.989 0.733 0.840 0.203
+ Single prompt 1.000 0.596 0.746 0.910 0.000
+ SSD (ours) 0.752 0.854 0.800 0.905 0.080

SkillSieve significantly outperforms baselines, with the full pipeline achieving an F1 score of 0.800 at an average cost of $0.006 per skill.

SkillSieve also demonstrates strong adversarial robustness, successfully intercepting all five tested bypass techniques: encoding obfuscation, cross-file logic splitting, conditional triggers, homoglyph substitution, and time-delayed payloads, often caught by Layer 1's static analysis or Layer 2's semantic decomposition.

Adversarial Robustness: Per-Layer Interception
Technique L1 Score Caught by L1 Rule
Encoding 0.35 L1+L2 obfuscation
Cross-file 0.40 L1+L2 credential_theft
Conditional 0.70 L1 conditional_trigger
Homoglyph 0.80 L1+L2 prompt_injection
Time-delay 0.70 L1 time_delay

The efficiency of Layer 1, running entirely on-device at zero API cost, enables SkillSieve to be deployed in resource-constrained environments like a $440 ARM single-board computer, making it practical for self-hosted deployment in air-gapped networks and CI/CD pipelines.

Calculate Your Potential AI Security ROI

Estimate the security improvements and cost savings SkillSieve could bring to your organization. Input your parameters to see the impact.

Estimated Annual Security Savings $0
Estimated Annual Hours Reclaimed 0

Your Path to Enhanced AI Security

A typical implementation roadmap for integrating SkillSieve into your enterprise security framework.

Phase 1: Initial Assessment & Strategy

Detailed analysis of existing AI agent usage, security posture, and custom requirements. Development of a tailored integration strategy for SkillSieve.

Phase 2: Pilot Deployment & Customization

Deployment of SkillSieve in a controlled environment, customization of rules and LLM prompts to fit specific enterprise policies and agent ecosystems.

Phase 3: Full-Scale Integration & Training

Seamless integration into CI/CD pipelines and agent marketplaces. Comprehensive training for security teams on monitoring, incident response, and continuous optimization.

Phase 4: Continuous Monitoring & Optimization

Ongoing performance monitoring, regular updates to detection models, and adaptive tuning to counter evolving adversarial techniques.

Ready to Secure Your AI Agents?

Book a personalized strategy session to see how SkillSieve can be integrated into your enterprise workflows, protecting your AI agents from sophisticated attacks.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking