Skip to main content
Enterprise AI Analysis: Linear probes rely on textual evidence: Results from leakage mitigation studies in language models

AI SAFETY RESEARCH

Linear probes rely on textual evidence: Results from leakage mitigation studies in language models

By Gerard Boxo, Aman Neelappa, Shivam Raval • March 9, 2026

White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text-ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe performance. The AUROC reduction ranges from 10- to 30-point depending on the setting. We evaluate probe monitors across three setups (Sandbagging, Sycophancy, and Bias), finding that when probes rely on textual evidence of the target behaviour (such as system prompts or CoT reasoning), performance degrades once these tokens are filtered. This filtering procedure is standard practice for output monitor evaluation. As further evidence of this phenomenon, we train Model Organisms which produce outputs without any behaviour verbalisations. We validate that probe performance on Model Organisms is substantially lower than unfiltered evaluations: 0.57 vs 0.74 AUROC for Bias, and 0.57 vs 0.94 AUROC for Sandbagging. Our findings suggest that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.

Executive Impact

Recent AI safety research has focused on developing monitoring systems to detect potentially dangerous behaviors (Amodei et al., 2016; Bengio et al., 2025). We find evidence that removing textual evidence of a behaviour significantly decreases probe performance.

Max AUROC Reduction
MO Bias AUROC (Mitigated)
MO Sandbagging AUROC (Chance)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Impact of Leakage on Probe Performance
Leakage Mitigation Strategies
Probe Performance Across Scenarios
Why Leakage Matters for AI Safety
10-30% AUROC Reduction with Leakage Mitigation

Enterprise Process Flow

Content Filtering (Elicitation Substitution)
Score Filtering (Private Token Filtering)
Finetuned Model Organisms (Prompt Distillation Finetuning)
ScenarioBaseline AUROCContent Filtering AUROCScore Filtering AUROCModel Organism AUROC
Sandbagging0.950.620.810.57
Sycophancy0.990.610.900.51
Bias0.740.540.460.57

Understanding Brittle Monitors

Description: Controlling for leakage is crucial because current models are often prompted to exhibit harmful behaviors, creating artifacts that monitors can exploit (elicitation leakage). Additionally, models might verbalize harmful intentions, making surface-level cues detectable instead of genuine intent (reasoning leakage).

Challenge: Develop robust white-box monitors that detect internal representations of behavior, not just surface-level textual evidence, especially as LLMs evolve to use latent reasoning or internal languages.

Solution: Evaluate linear probes with rigorous leakage mitigation strategies (Content Filtering, Score Filtering, Model Organisms) to quantify their reliance on textual evidence and identify scenarios where they are brittle.

Result: Probes relying on textual evidence show substantial performance degradation when leakage is removed, indicating brittleness in detecting non-surface-level patterns and highlighting the need for more sophisticated monitoring approaches.

Advanced ROI Calculator

Estimate your potential efficiency gains and cost savings by implementing advanced AI monitoring solutions.

Annual Savings $0
Annual Hours Reclaimed 0

Implementation Roadmap

A typical phased approach to integrating advanced AI monitoring within your enterprise.

Phase 1: Assessment & Strategy

Conduct a thorough review of existing AI systems, identify critical behaviors to monitor, and define success metrics. Develop a tailored strategy for probe deployment and leakage mitigation.

Phase 2: Pilot & Validation

Implement probes on a pilot model or specific scenario. Validate performance against baseline and mitigated evaluations using techniques like Content and Score Filtering, ensuring robustness against textual leakage.

Phase 3: Integration & Scaling

Scale probe deployment across relevant models and integrate with existing MLOps pipelines. Continuously monitor performance and retrain probes as model behaviors evolve, potentially leveraging Model Organisms for advanced testing.

Ready to Enhance Your AI Oversight?

Book a personalized consultation to explore how our advanced AI monitoring solutions can safeguard your enterprise AI.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking