AI SAFETY RESEARCH

Linear probes rely on textual evidence: Results from leakage mitigation studies in language models

By Gerard Boxo, Aman Neelappa, Shivam Raval • March 9, 2026

White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text-ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe performance. The AUROC reduction ranges from 10- to 30-point depending on the setting. We evaluate probe monitors across three setups (Sandbagging, Sycophancy, and Bias), finding that when probes rely on textual evidence of the target behaviour (such as system prompts or CoT reasoning), performance degrades once these tokens are filtered. This filtering procedure is standard practice for output monitor evaluation. As further evidence of this phenomenon, we train Model Organisms which produce outputs without any behaviour verbalisations. We validate that probe performance on Model Organisms is substantially lower than unfiltered evaluations: 0.57 vs 0.74 AUROC for Bias, and 0.57 vs 0.94 AUROC for Sandbagging. Our findings suggest that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.

Schedule Your Strategy Session

Executive Impact

Recent AI safety research has focused on developing monitoring systems to detect potentially dangerous behaviors (Amodei et al., 2016; Bengio et al., 2025). We find evidence that removing textual evidence of a behaviour significantly decreases probe performance.

Max AUROC Reduction

MO Bias AUROC (Mitigated)

MO Sandbagging AUROC (Chance)

Discuss Your AI Strategy

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Impact of Leakage on Probe Performance

Leakage Mitigation Strategies

Probe Performance Across Scenarios

Why Leakage Matters for AI Safety

10-30% AUROC Reduction with Leakage Mitigation

Enterprise Process Flow

Content Filtering (Elicitation Substitution)

→

Score Filtering (Private Token Filtering)

→

Finetuned Model Organisms (Prompt Distillation Finetuning)

Scenario	Baseline AUROC	Content Filtering AUROC	Score Filtering AUROC	Model Organism AUROC
Sandbagging	0.95	0.62	0.81	0.57
Sycophancy	0.99	0.61	0.90	0.51
Bias	0.74	0.54	0.46	0.57

Understanding Brittle Monitors

Description: Controlling for leakage is crucial because current models are often prompted to exhibit harmful behaviors, creating artifacts that monitors can exploit (elicitation leakage). Additionally, models might verbalize harmful intentions, making surface-level cues detectable instead of genuine intent (reasoning leakage).

Challenge: Develop robust white-box monitors that detect internal representations of behavior, not just surface-level textual evidence, especially as LLMs evolve to use latent reasoning or internal languages.

Solution: Evaluate linear probes with rigorous leakage mitigation strategies (Content Filtering, Score Filtering, Model Organisms) to quantify their reliance on textual evidence and identify scenarios where they are brittle.

Result: Probes relying on textual evidence show substantial performance degradation when leakage is removed, indicating brittleness in detecting non-surface-level patterns and highlighting the need for more sophisticated monitoring approaches.

Advanced ROI Calculator

Estimate your potential efficiency gains and cost savings by implementing advanced AI monitoring solutions.

Your Industry

Number of Employees (impacted by manual monitoring)

Avg. Weekly Hours (spent on manual tasks)

Avg. Hourly Rate ($)

Annual Savings $0

Annual Hours Reclaimed 0

Calculate Your Specific ROI

Implementation Roadmap

A typical phased approach to integrating advanced AI monitoring within your enterprise.

Phase 1: Assessment & Strategy

Conduct a thorough review of existing AI systems, identify critical behaviors to monitor, and define success metrics. Develop a tailored strategy for probe deployment and leakage mitigation.

Phase 2: Pilot & Validation

Implement probes on a pilot model or specific scenario. Validate performance against baseline and mitigated evaluations using techniques like Content and Score Filtering, ensuring robustness against textual leakage.

Phase 3: Integration & Scaling

Scale probe deployment across relevant models and integrate with existing MLOps pipelines. Continuously monitor performance and retrain probes as model behaviors evolve, potentially leveraging Model Organisms for advanced testing.

Begin Your AI Monitoring Journey

Ready to Enhance Your AI Oversight?

Book a personalized consultation to explore how our advanced AI monitoring solutions can safeguard your enterprise AI.

Book Your Free Consultation

AI SAFETY RESEARCH

Linear probes rely on textual evidence: Results from leakage mitigation studies in language models

Executive Impact

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Understanding Brittle Monitors

Advanced ROI Calculator

Implementation Roadmap

Phase 1: Assessment & Strategy

Phase 2: Pilot & Validation

Phase 3: Integration & Scaling

Ready to Enhance Your AI Oversight?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai