Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

Enterprise AI Analysis

The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model's activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model's activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept.

Schedule Your Strategy Session

Executive Impact

This analysis provides a clear, actionable roadmap for leveraging advanced AI capabilities to enhance autonomous driving systems, addressing critical limitations in VLM perception and reasoning.

0 Failure Modes Identified

High Visual Concepts Linearly Encoded

Partial Spatial Concepts Implicitly Encoded

High Distance Degradation Impact

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enterprise Process Flow

Counterfactual Input Creation

→

VLM Activation Extraction

→

Linear Probe Training

→

Probe-Model Output Comparison

→

Failure Mode Identification

Two Distinct VLM Failure Modes Identified

Our analysis reveals that VLMs fail in two primary ways when processing visual information crucial for automated driving:

Perceptual Failure Visual information not linearly encoded in activations, leading to low probe and model accuracy.

Cognitive Failure Visual information encoded, but model fails to align with language semantics, resulting in high probe accuracy but low model accuracy.

Impact of Distance on Presence Encoding

The presence of an object is well encoded at short distances, but quality degrades significantly at longer ranges, primarily within the vision encoder.

Component	Short Range (5-20m)	Long Range (30-50m)
Vision Encoder	Near-perfect linear encoding	Significant degradation in linear encoding
LLM	Maintains quality, further improvement for longer distances	Improves representation, but still lower than short-range

Object Type & Frequency in Presence Detection

Larger, more frequent objects (pedestrians) are better encoded for presence than smaller, less frequent ones (traffic barrels), especially at greater distances.

Pedestrians Consistently higher representation quality and linear separability than traffic barrels (Section 5.1.1).

Count Concept Encoding Dynamics Across Models

The ability to count objects improves rapidly in early layers, plateaus, and then sees further improvement in the LLM. Traffic barrels are slightly easier to count due to consistent alignment.

Stage	Observation
Vision Encoder	Quick increase to good linear encoding for short distances.
Projector	Identified as a small bottleneck for Ovis2.5 & InternVL3.5, but recovery in LLM.
LLM	Uniform improvement across distances in middle layers, aiding overall count accuracy.

Explicit vs. Implicit Spatial Encoding

Spatial concepts are not explicitly encoded in the vision encoder but implicitly preserved through spatial structure. The LLM then leverages this structure to infer relationships.

Encoding Type	Vision Encoder	LLM (Final Token)
Explicit (Avg-pooled)	Close to zero for Spatial-1, moderate for Spatial-2 (Ovis2.5)	Sharp increase in middle layers for Spatial-1, high for Spatial-2
Implicit (Region-pooled)	Good preservation of spatial structure, especially for Spatial-2 at short distances	Infers correct answer from retained structure, leading to high accuracy

Pedestrian vs. Blinker: Spatial Task Performance

Pedestrians (Spatial-2) are easier to localize and track spatially than blinkers (Spatial-1) due to their larger size and higher frequency in training data, especially at longer distances.

Significantly Higher Linear separability for pedestrians (Spatial-2) compared to blinkers (Spatial-1).

Orientation Concept: A Persistent Challenge

Orientation is poorly encoded across the VLM architecture, both explicitly and implicitly, especially for smaller objects and longer distances, suggesting a major bottleneck.

Stage	Observation
Vision Encoder	Low linear encoding, implicit encoding degrades rapidly for most models and distances.
Projector	Strong bottleneck for InternVL3.5, not as pronounced in other models.
LLM (Final Token)	Fails to explicitly encode and ground orientation in language semantics.

Object Size Impact on Orientation Encoding

The orientation of larger objects (bicycle for Orientation-2) shows slightly better linear separability at short ranges compared to smaller objects (pedestrian for Orientation-1).

Bicycle Orientation Slightly better linear separability at short ranges than Pedestrian orientation.

Advanced ROI Calculator

Quantify the potential efficiency gains for your enterprise by understanding and mitigating VLM perception limitations. Our calculator estimates the operational hours and cost savings from accurately interpreting complex traffic scenes.

Your Industry

Number of Employees (Impacted by VLM Performance)

Average Weekly Hours (Spent on VLM-dependent tasks)

Average Hourly Cost Per Employee ($)

Estimated Annual Savings $-

Hours Reclaimed Annually --

Calculate Your Custom ROI

Your AI Implementation Roadmap

A phased approach to integrate and optimize VLM capabilities in your autonomous driving systems.

Phase 1: Diagnostic Probing & Bottleneck Identification

Deploy initial linear probes on your existing VLM architectures using targeted counterfactual datasets. Pinpoint layers and components exhibiting perceptual or cognitive failures for critical driving concepts (e.g., presence, orientation, distance).

Phase 2: Targeted Vision Encoder Optimization

Implement fine-tuning or architectural adjustments to the vision encoder, focusing on improving linear encoding of fine-grained spatial concepts and maintaining representation quality for distant objects. Explore techniques like enhanced data augmentation with diverse distances and orientations.

Phase 3: LLM Alignment & Reasoning Enhancement

Refine the projector and LLM training strategy to better align visual features with language semantics. Address cognitive failures by strengthening the model's ability to leverage encoded visual information for correct semantic interpretation, potentially through new prompting or attention mechanisms.

Phase 4: Continuous Validation & Real-World Deployment

Integrate enhanced VLMs into a continuous validation pipeline using real-world traffic data and new counterfactual scenarios. Monitor performance for long-tail events and adapt models to ensure robust, safe, and efficient autonomous driving operations.

Discuss Your Implementation Timeline

Ready to Unlock Your Enterprise AI Potential?

Schedule a free consultation with our AI strategists to explore how these insights can be tailored to your specific business needs and drive tangible results.

Book Your Free Consultation

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

Enterprise AI Analysis

Executive Impact

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Two Distinct VLM Failure Modes Identified

Impact of Distance on Presence Encoding

Object Type & Frequency in Presence Detection

Count Concept Encoding Dynamics Across Models

Explicit vs. Implicit Spatial Encoding

Pedestrian vs. Blinker: Spatial Task Performance

Orientation Concept: A Persistent Challenge

Object Size Impact on Orientation Encoding

Advanced ROI Calculator

Your AI Implementation Roadmap

Phase 1: Diagnostic Probing & Bottleneck Identification

Phase 2: Targeted Vision Encoder Optimization

Phase 3: LLM Alignment & Reasoning Enhancement

Phase 4: Continuous Validation & Real-World Deployment

Ready to Unlock Your Enterprise AI Potential?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai