Enterprise AI Analysis
Great Models Think Alike and this Undermines AI Oversight
Authored by Shashwat Goel, Joschka Strüber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as “AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing Chance Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend - model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.
Executive Impact: Key Findings for Enterprise AI
This paper introduces Chance Adjusted Probabilistic Agreement (CAPA), a novel metric for measuring Language Model (LM) similarity based on overlapping mistakes and output probabilities. The research highlights critical implications for 'AI Oversight': LLM-as-a-judge systems exhibit an 'affinity bias' towards similar models, and 'weak-to-strong generalization' in training benefits from dissimilar supervisors. Crucially, as LM capabilities increase, their mistakes become more correlated, posing significant risks from common blind-spots and correlated failures in AI oversight. The study advocates for reporting and correcting for model similarity to ensure safer and more effective AI development.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Introducing CAPA: Chance Adjusted Probabilistic Alignment
CAPA is introduced as a novel probabilistic metric that extends error consistency by accounting for probabilistic outputs and distinguishing between different incorrect predictions. It adjusts for chance agreement due to accuracy, making it a more robust measure of functional similarity between LMs.
| Metric | Adjusts for Accuracy | Distinguishes different mistakes | Incorporates Probabilities |
|---|---|---|---|
| %Flips = 1 - Cobs (Dutta et al., 2024) | ✗ | ✗ | ✗ |
| Cohen's κ, Scott's π, Fleiss κ | ✗ | ✗ | ✗ |
| %Agreement (Zheng et al., 2023) | ✗ | ✗ | ✗ |
| Error Consistency (Geirhos et al., 2020) | ✓ | ✗ | ✗ |
| Pearson / Matthew's Correlation of Errors | ✗ | ✓ | ✗ |
| Divergence metrics like KL, JSD | ✗ | ✓ | ✓ |
| CAPA (Ours) | ✓ | ✓ | ✓ |
LLM-as-a-Judge Affinity Bias
LLM-as-a-judge systems exhibit a significant 'affinity bias', assigning higher scores to models that are more functionally similar to themselves, even when controlling for model accuracy. This finding generalizes previous self-preference results and indicates that excluding the judge model from rankings is insufficient; model similarity must also be accounted for.
Complementary Knowledge in Weak-to-Strong Generalization
Performance gains from 'weak-to-strong generalization' are inversely proportional to the similarity between the weak supervisor and the strong student. This highlights the crucial role of complementary knowledge in achieving higher performance ceilings, where dissimilar models offer greater learning opportunities.
Enterprise Process Flow
Increasing Error Correlation with LM Capabilities
As frontier LMs become more capable, their mistakes are becoming increasingly similar, as captured by CAPA. This trend indicates a rising risk of correlated failures and common blind-spots across advanced AI systems.
Scenario: Correlated AI Failures
As frontier LMs become more capable, their mistakes are becoming increasingly similar, as captured by CAPA. This trend indicates a rising risk of correlated failures and common blind-spots across advanced AI systems.
Implication for Your Business:
If model errors converge, AI oversight mechanisms that rely on diverse models to catch failures might become less effective. This poses significant safety risks, as a single vulnerability could affect multiple advanced AI systems simultaneously. It underscores the urgency of proactive reporting and correction for model similarity to ensure robust AI safety.
Calculate Your Potential AI Efficiency Gains
Estimate how much time and cost your enterprise could save by optimizing AI oversight and training based on model similarity insights.
Your AI Oversight Implementation Roadmap
A phased approach to integrate model similarity insights for enhanced AI evaluation, training, and risk mitigation.
Phase 1: CAPA Integration & Baseline
Integrate CAPA into existing evaluation pipelines for a comprehensive understanding of model similarity. Establish baseline similarity metrics for all deployed and candidate LMs. This phase provides foundational data for identifying affinity biases and diversity gaps.
Phase 2: Bias-Adjusted AI Oversight
Implement mechanisms to adjust LLM-as-a-judge scores for affinity bias using CAPA. Prioritize model diversity in evaluation ensembles. For training, strategically pair strong student models with weak supervisors exhibiting low CAPA scores to maximize 'weak-to-strong generalization' gains.
Phase 3: Continuous Monitoring & Diversification
Continuously monitor model similarity trends using CAPA as capabilities evolve. Actively seek and integrate models with low similarity to reduce correlated failure risks and enhance overall system robustness. Explore architectural and training interventions to promote diversity.
Ready to Future-Proof Your AI Strategy?
Understand and mitigate the risks of correlated AI failures and leverage model diversity for superior performance. Our experts can help you implement advanced oversight and training methodologies.