Skip to main content
Enterprise AI Analysis: Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

ENTERPRISE AI ANALYSIS

Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

As multimodal large language models (MLLMs) become integral to enterprise systems, understanding their safety under adversarial conditions is paramount. Our comprehensive two-phase study reveals that MLLM harmlessness is neither uniform nor stable across updates. We found significant alignment drift, shifting modality effects, and critical differences in refusal strategies across eight leading model releases. This highlights the urgent need for continuous, multimodal safety benchmarking to track evolving vulnerabilities and ensure robust AI deployment.

Executive Impact: Key Takeaways for Your Enterprise

For enterprises deploying or building with MLLMs, this analysis offers critical insights into the dynamic nature of AI safety. Expect model vulnerabilities and safety behaviors to shift with each update, requiring a proactive, longitudinal approach to red teaming. Relying on single-timepoint or text-only evaluations can leave your systems exposed to evolving multimodal attack surfaces and unpredictable alignment drift.

0 Human Harm Ratings
0 Adversarial Prompts
0 Expert Red Teamers
0 MLLM Generations Evaluated

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Underexplored MLLM Safety Under Adversarial Prompting

Despite rapid MLLM integration into consumer and enterprise products, their safety against adversarial prompting remains largely unexamined. Multimodal inputs introduce new attack surfaces, and existing safety mechanisms may not generalize, leaving a critical gap in robust alignment understanding.

Enterprise Process Flow

Fixed Benchmark of 726 Adversarial Prompts
Authored by 26 Expert Red Teamers
Phase 1: Evaluate Initial MLLM Generation (4 Models)
Phase 2: Evaluate Successor MLLM Generation (4 Models)
82,256 Human Harm Ratings & Analysis
Longitudinal Comparison: Alignment Drift & Modality Shifts

We conducted a rigorous two-phase evaluation using an identical, fixed benchmark of 726 adversarial prompts (half text-only, half multimodal), crafted by 26 professional red teamers. This longitudinal design allowed for direct comparison across two generations of eight leading MLLMs, yielding 82,256 human harm ratings to precisely track safety evolution.

Aspect Phase 1 Observation Phase 2 Observation
Most Vulnerable Family Pixtral (ASR 0.72 multimodal / 0.63 text-only) Pixtral (ASR 0.60 text-only / 0.50 multimodal); highest risk family persists.
Safest Family Claude (ASR < 0.03 in both modalities) Claude (ASR ~ 0.19-0.20 across modalities); safest in expected harmfulness.
GPT Models ASR Change GPT-4o (Low ASR 0.067 multimodal / 0.043 text) GPT-5 (+8% overall ASR; +18% multimodal, -4% text-only).
Claude Models ASR Change Claude Sonnet 3.5 (ASR < 0.03) Claude Sonnet 4.5 (+10% overall ASR).
Pixtral Models ASR Change Pixtral 12B (Highest ASR 0.72 multimodal / 0.63 text-only) Pixtral Large (-7% overall ASR; still highest risk).
Qwen Models ASR Change Qwen VL Plus (Moderate ASR 0.36 multimodal / 0.32 text) Qwen Omni (-5% overall ASR; mild improvement).

Significant differences in vulnerability persist across model families, with Pixtral consistently the most vulnerable and Claude the safest (though with increased ASR). Notably, GPT and Claude models exhibited increased Attack Success Rates (ASR) across generations, while Pixtral and Qwen showed modest decreases. This confirms that safety does not monotonically improve with updates.

The Evolving Role of Modalities and Refusal Strategies

The impact of input modality on MLLM safety shifted dramatically across generations. In Phase 1, text-only prompts were generally more effective at eliciting harmful responses across all models. However, Phase 2 revealed model-specific patterns: GPT-5 and Claude 4.5 showed near-equivalent vulnerability across modalities, while Pixtral Large remained more susceptible to text-only prompts. This demonstrates that modality sensitivity is neither stable nor uniform across model updates.

Furthermore, refusal behavior plays a crucial role in interpreting safety. Claude models consistently exhibited the highest default refusal rates, suppressing their observed harmfulness. Newer GPT and Claude models reduced their refusal rates, while Qwen Omni increased its. This underscores that refusal is a distinct safety mechanism, and merely 'low harmlessness' may reflect an abstention strategy rather than inherently safer generative behavior, necessitating careful evaluation of both.

Essential Continuous Multimodal Benchmarking

The observed alignment drift, shifting modality effects, and evolving refusal dynamics underscore the absolute necessity of longitudinal, modality-controlled evaluations using fixed adversarial benchmarks. One-time or text-only assessments are insufficient to capture how MLLM safety behaviors truly evolve with new architectures, training data, and alignment strategies. Regular, systematic evaluations are essential to identify where safety mechanisms improve, regress, or where new vulnerabilities emerge within your deployed AI systems.

Quantify Your AI ROI Potential

Estimate the potential efficiency gains and cost savings for your enterprise by optimizing MLLM safety and alignment.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Alignment Roadmap

A strategic plan to integrate longitudinal safety evaluations and ensure robust MLLM deployment in your enterprise.

Phase 1: Assessment & Benchmarking

Conduct a baseline evaluation of your current MLLM deployments using multimodal adversarial benchmarks. Identify existing vulnerabilities and establish key safety metrics specific to your use cases.

Phase 2: Strategy Development & Custom Red Teaming

Develop a tailored red teaming strategy, incorporating diverse attack modalities and longitudinal tracking. Focus on model-specific vulnerabilities and evolving alignment drift patterns identified in the assessment phase.

Phase 3: Continuous Monitoring & Iteration

Implement a continuous evaluation framework. Regularly re-evaluate MLLMs on fixed adversarial benchmarks across new model releases to track safety evolution, detect regressions, and adapt alignment interventions proactively.

Phase 4: Operational Integration & Training

Integrate safety metrics into your MLLM deployment pipelines. Train internal teams on best practices for adversarial robustness and responsible AI use, ensuring a culture of continuous safety improvement.

Ready to fortify your AI's alignment?

Don't let alignment drift or unseen vulnerabilities compromise your enterprise AI. Schedule a free consultation with our experts to discuss a tailored strategy for continuous MLLM safety.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking