Skip to main content
Enterprise AI Analysis: Predicting Sentence Acceptability Judgments in Multimodal Contexts

Enterprise AI Analysis

Predicting Sentence Acceptability Judgments in Multimodal Contexts

This paper investigates how deep neural networks (DNNs), particularly transformers, predict human sentence acceptability judgments in multimodal contexts. It examines the impact of visual images (visual context) on these judgments for both humans and large language models (LLMs). The findings suggest that visual images have little to no impact on human acceptability ratings, in contrast to textual context. However, LLMs display a compression effect similar to that seen in textual document contexts. While LLMs can predict human judgments accurately, their performance is generally better without visual contexts. The distribution of LLM judgments varies, with Qwen models showing patterns closer to human. Correlations between LLM predictions and log probabilities decrease with visual contexts, indicating a larger gap in internal representations. The study highlights similarities and differences in human and LLM processing of sentences in multimodal environments.

Executive Impact & Key Findings

Understanding the nuanced performance of AI models in complex linguistic tasks is critical for enterprise adoption. Our findings shed light on key metrics that drive strategic decisions.

0.89 GPT-4o Correlation (Null)
0.85 InternVL3-8B Correlation (Relevant)
0.79 Qwen2.5-7B Logprob Correlation (Null)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Humans exhibit distinct processing of visual versus textual contexts. Visual context has minimal impact on acceptability judgments, showing only a slight 'raising effect' for bad sentences in relevant visual contexts, unlike the compression seen with textual contexts. This suggests different cognitive load mechanisms.

The study reveals that human acceptability judgments are largely unaffected by visual contexts. Unlike textual contexts, where a 'compression effect' (raising bad ratings, lowering good ratings) is observed due to cognitive load or discourse coherence, visual contexts only show a minor 'raising effect' for grammatically poor sentences in relevant visual conditions. This implies that humans might more easily 'ignore' irrelevant visual information compared to irrelevant textual information, suggesting different processing modes for these modalities.

0.017 P-value alpha for Human Ratings (Bonferroni correction)
Context Type Impact on Acceptability
Textual Context (Lau et al., 2020)
  • Significant compression effect
  • Raising of bad sentences
  • Lowering of good sentences
  • Suggests cognitive load & discourse coherence
Visual Context (Current Study)
  • Minimal overall impact
  • Slight raising effect only for bad sentences in relevant contexts
  • No significant compression
  • Suggests easier disregard of irrelevant visual info

LLMs show high accuracy in predicting human judgments, often performing better without visual contexts. While some LLMs (like Qwen) align better with human judgment distributions, others exhibit distinct patterns, including a clear compression effect in visual contexts, unlike humans.

Our analysis demonstrates that LLMs achieve high accuracy in predicting human sentence acceptability judgments, with top models like GPT-4o showing correlations over 0.87. Interestingly, LLMs often perform better when visual contexts are removed. A key divergence from human behavior is the prominent 'compression effect' observed in several LLMs (e.g., GPT-4o, Qwen2.5-7B) when visual contexts are present, where ratings cluster towards the middle. This effect, while similar to human behavior in textual contexts, is not observed in humans with visual contexts. This suggests fundamental differences in how LLMs and humans process multimodal information.

0.89 Highest LLM-Human Correlation (GPT-4o, Null)

LLM Multimodal Processing Flow

Input Sentence + Image
Internal Representation (Visual & Textual)
Contextual Integration
Generate Acceptability Rating (Prompting)
Calculate Log Probabilities (Internal)

Qwen2.5-7B: Human-like Distribution

Qwen2.5-7B exhibits judgment distributions most similar to human patterns among the tested LLMs, although with higher variance. This suggests it may capture certain nuances of human linguistic intuition better than other models, albeit not perfectly replicating human cognitive processes.

Implications: Qwen2.5-7B's unique distribution could offer insights into designing LLMs that more closely mimic human linguistic intuitions, particularly in how they spread judgments across the acceptability scale rather than clustering them at extremes or the center.

The study highlights a divergence in multimodal processing: humans can easily discard irrelevant visual context, whereas LLMs integrate all modalities, leading to a cognitive load effect similar to human textual context processing.

A key finding is the distinct ways humans and LLMs integrate multimodal information. Humans appear to suppress or discard irrelevant visual context when judging sentence acceptability, leading to minimal impact on ratings. In contrast, LLMs, especially vision and language models, integrate all presented modalities into their internal representations, treating visual context with a similar processing load as textual context. This leads to a compression effect in LLMs' judgments with visual contexts, mirroring human behavior only with textual contexts.

0.16 InternVL3-1B Correlation (Irrelevant Visual Context)
Feature Human Processing LLM Processing
Irrelevant Visual Context
  • Easily discarded/suppressed
  • Minimal impact on judgments
  • No compression effect
Irrelevant Visual Context
  • Integrated into internal representations
  • Can lead to compression effect
  • Similar processing load to textual context
Cognitive Load
  • More affected by textual context
  • Less affected by visual context
Cognitive Load
  • Appears to be affected by both visual and textual contexts similarly
  • Architectures designed to maximize context

Advanced ROI Calculator

Quantify the potential return on investment for integrating advanced AI into your operations. Adjust the parameters below to see your estimated annual savings and reclaimed hours.

Estimated Annual Savings $0
Hours Reclaimed Annually 0

Implementation Roadmap

Our phased approach ensures a smooth, effective, and tailored AI implementation, minimizing disruption and maximizing long-term value.

Phase 1: Discovery & Strategy

Comprehensive assessment, goal setting, and custom AI strategy development.

Phase 2: Pilot & Integration

Development of pilot AI solutions, seamless integration with existing systems, and initial testing.

Phase 3: Scaling & Optimization

Full-scale deployment, continuous monitoring, performance optimization, and ongoing support.

Ready to Transform Your Enterprise with AI?

Schedule a complimentary strategy session with our AI experts to explore how these insights can be leveraged for your business.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking