Enterprise AI Analysis

Great Models Think Alike and this Undermines AI Oversight

Authored by Shashwat Goel, Joschka Strüber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as “AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing Chance Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend - model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

Schedule Your Strategy Session

Executive Impact: Key Findings for Enterprise AI

This paper introduces Chance Adjusted Probabilistic Agreement (CAPA), a novel metric for measuring Language Model (LM) similarity based on overlapping mistakes and output probabilities. The research highlights critical implications for 'AI Oversight': LLM-as-a-judge systems exhibit an 'affinity bias' towards similar models, and 'weak-to-strong generalization' in training benefits from dissimilar supervisors. Crucially, as LM capabilities increase, their mistakes become more correlated, posing significant risks from common blind-spots and correlated failures in AI oversight. The study advocates for reporting and correcting for model similarity to ensure safer and more effective AI development.

LLM Judge Bias Identified

Avg. Judge-Model Similarity Correlation (Pearson r)

Weak-to-Strong Gen. Gain (Similarity inversely correlated)

Error Correlation Increase with Capability

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introducing CAPA: Chance Adjusted Probabilistic Alignment

CAPA is introduced as a novel probabilistic metric that extends error consistency by accounting for probabilistic outputs and distinguishing between different incorrect predictions. It adjusts for chance agreement due to accuracy, making it a more robust measure of functional similarity between LMs.

Metric	Adjusts for Accuracy	Distinguishes different mistakes	Incorporates Probabilities
%Flips = 1 - Cobs (Dutta et al., 2024)	✗	✗	✗
Cohen's κ, Scott's π, Fleiss κ	✗	✗	✗
%Agreement (Zheng et al., 2023)	✗	✗	✗
Error Consistency (Geirhos et al., 2020)	✓	✗	✗
Pearson / Matthew's Correlation of Errors	✗	✓	✗
Divergence metrics like KL, JSD	✗	✓	✓
CAPA (Ours)	✓	✓	✓

LLM-as-a-Judge Affinity Bias

LLM-as-a-judge systems exhibit a significant 'affinity bias', assigning higher scores to models that are more functionally similar to themselves, even when controlling for model accuracy. This finding generalizes previous self-preference results and indicates that excluding the judge model from rankings is insufficient; model similarity must also be accounted for.

Average Pearson r for Judge-Model Similarity vs Judgment Scores

Complementary Knowledge in Weak-to-Strong Generalization

Performance gains from 'weak-to-strong generalization' are inversely proportional to the similarity between the weak supervisor and the strong student. This highlights the crucial role of complementary knowledge in achieving higher performance ceilings, where dissimilar models offer greater learning opportunities.

Enterprise Process Flow

Weak Supervisor Annotates Data

→

Strong Student Trains on Annotations

→

Higher Gain with Dissimilar Supervisors

→

Leverages Complementary Knowledge

Increasing Error Correlation with LM Capabilities

As frontier LMs become more capable, their mistakes are becoming increasingly similar, as captured by CAPA. This trend indicates a rising risk of correlated failures and common blind-spots across advanced AI systems.

Scenario: Correlated AI Failures

As frontier LMs become more capable, their mistakes are becoming increasingly similar, as captured by CAPA. This trend indicates a rising risk of correlated failures and common blind-spots across advanced AI systems.

Implication for Your Business:

If model errors converge, AI oversight mechanisms that rely on diverse models to catch failures might become less effective. This poses significant safety risks, as a single vulnerability could affect multiple advanced AI systems simultaneously. It underscores the urgency of proactive reporting and correction for model similarity to ensure robust AI safety.

Mitigate Correlated AI Failures

Calculate Your Potential AI Efficiency Gains

Estimate how much time and cost your enterprise could save by optimizing AI oversight and training based on model similarity insights.

Your Industry

Number of Employees (Impacted by AI workflows)

Average Weekly Hours (AI-related tasks per employee)

Average Hourly Cost (per employee)

Estimated Annual Savings

Annual Hours Reclaimed

Your AI Oversight Implementation Roadmap

A phased approach to integrate model similarity insights for enhanced AI evaluation, training, and risk mitigation.

Phase 1: CAPA Integration & Baseline

Integrate CAPA into existing evaluation pipelines for a comprehensive understanding of model similarity. Establish baseline similarity metrics for all deployed and candidate LMs. This phase provides foundational data for identifying affinity biases and diversity gaps.

Phase 2: Bias-Adjusted AI Oversight

Implement mechanisms to adjust LLM-as-a-judge scores for affinity bias using CAPA. Prioritize model diversity in evaluation ensembles. For training, strategically pair strong student models with weak supervisors exhibiting low CAPA scores to maximize 'weak-to-strong generalization' gains.

Phase 3: Continuous Monitoring & Diversification

Continuously monitor model similarity trends using CAPA as capabilities evolve. Actively seek and integrate models with low similarity to reduce correlated failure risks and enhance overall system robustness. Explore architectural and training interventions to promote diversity.

Discuss Your Implementation Timeline

Ready to Future-Proof Your AI Strategy?

Understand and mitigate the risks of correlated AI failures and leverage model diversity for superior performance. Our experts can help you implement advanced oversight and training methodologies.

Schedule Your AI Oversight Consultation

Enterprise AI Analysis

Great Models Think Alike and this Undermines AI Oversight

Executive Impact: Key Findings for Enterprise AI

Deep Analysis & Enterprise Applications

Introducing CAPA: Chance Adjusted Probabilistic Alignment

LLM-as-a-Judge Affinity Bias

Complementary Knowledge in Weak-to-Strong Generalization

Enterprise Process Flow

Increasing Error Correlation with LM Capabilities

Scenario: Correlated AI Failures

Implication for Your Business:

Calculate Your Potential AI Efficiency Gains

Your AI Oversight Implementation Roadmap

Phase 1: CAPA Integration & Baseline

Phase 2: Bias-Adjusted AI Oversight

Phase 3: Continuous Monitoring & Diversification

Ready to Future-Proof Your AI Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai