Skip to main content
Enterprise AI Analysis: Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test

Enterprise AI Analysis

Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test

This study evaluates the cross-ethnic emotion recognition capabilities of leading Multimodal Large Language Models (MLLMs), specifically ChatGPT-4, ChatGPT-4o, and Claude 3 Opus, using the 'Reading the Mind in the Eyes Test' (RMET). The research assessed their accuracy and consistency across diverse ethnic stimuli (White, Black, and Korean faces). ChatGPT-4o demonstrated performance significantly above human average and consistent across ethnic groups, placing it in the 85th-94th percentiles of human norms. In contrast, ChatGPT-4 performed near human average, and Claude 3 Opus performed near chance level. The findings highlight the rapid evolution of MLLMs and their potential for objective analysis in social cognition tasks, while also underscoring the need for continuous validation and ethical consideration of their limitations and potential biases in real-world applications.

Executive Impact: Key Findings at a Glance

Understanding the immediate implications of advanced MLLMs on social cognition tasks for enterprise decision-making and operational efficiency.

87.9% ChatGPT-4o Average RMET Accuracy
26.3% Human Average Accuracy
61.6% Performance Gap (ChatGPT-4o vs. Human)
High RMET Ethnic Consistency

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Examines the accuracy and capabilities of different MLLM versions, highlighting significant advancements and disparities between models.

87.9% ChatGPT-4o Average RMET Accuracy

Across White, Black, and Korean RMET versions, ChatGPT-4o achieved an average accuracy of 87.9%, significantly outperforming human norms (average 26.3%).

MLLM Performance Comparison on RMET
Model White RMET (%) Black RMET (%) Korean RMET (%) Human Norm Percentile (Average)
ChatGPT-4o 83.3% 94.4% 86.1% 90th
ChatGPT-4 50.0% 52.8% 47.2% 23.5th
Claude 3 Opus 41.7% 52.8% 47.2% 1.5th

ChatGPT-4o consistently demonstrated superior emotion recognition across all ethnic RMET versions, indicating robust performance and cross-ethnic generalization. Claude 3 Opus performed near chance level.

Investigates whether MLLMs exhibit ethnic biases in emotion recognition, comparing their performance across different racial stimuli.

No Bias Detected ChatGPT-4o's Cross-Ethnic Performance

ChatGPT-4o's high accuracy was consistent across White, Black, and Korean faces, suggesting an absence of ethnic bias in this visual recognition task. This contrasts with observed 'other-race effects' in human perception.

Cross-Ethnic Evaluation Process

Administer White RMET
Administer Black RMET
Administer Korean RMET
Compare Scores & Percentiles
Assess Cross-Ethnic Consistency

The systematic evaluation across diverse ethnic RMET versions provided a robust methodology to assess cross-ethnic consistency, revealing ChatGPT-4o's unbiased performance.

Discusses the broader societal impact of high-performing AI, including potential dilemmas and the inherent limitations of current evaluation methods.

Ethical Dilemma: High-Performing AI in Clinical Settings

Problem: If an AI tool is demonstrably more accurate and less biased than a human in a specific diagnostic task, what are the ethical ramifications for professional standards? Should its use become mandatory as a secondary check, especially in complex cases?

Solution: Professional communities must define new standards of care that integrate advanced AI capabilities, balancing human judgment with AI insights. This requires addressing liability, expertise, and the potential for 'tech-wash' if deployed within biased systems.

Impact: The findings challenge traditional notions of clinical expertise and liability, pushing for robust implementation protocols and user training to ensure equitable outcomes and prevent AI from inadvertently reinforcing societal biases.

RMET Validity Concerns Study Limitation: Reading the Mind in the Eyes Test

The RMET, while widely used, has been critiqued for its structural properties and ecological validity. It may rely more on vocabulary and elimination processes than direct emotion perception, limiting the generalizability of findings to real-world 'mind-reading' or clinical empathy.

Estimate Your Enterprise AI ROI

Understand the potential efficiency gains and cost savings by integrating advanced AI capabilities into your operations.

Estimated Annual Cost Savings
Annual Hours Reclaimed
Calculate Your AI ROI

Your AI Implementation Roadmap

A phased approach to integrating advanced MLLMs into your enterprise, leveraging the insights from this research.

Phase 1: Strategic Assessment

Identify specific social cognition tasks and emotion recognition needs within your enterprise. Assess current human performance benchmarks and data readiness.

Phase 2: MLLM Selection & Pilot

Based on performance and bias analysis (like this study's findings), select the optimal MLLM (e.g., ChatGPT-4o). Conduct a controlled pilot in a low-risk environment.

Phase 3: Custom Alignment & Integration

Refine MLLM prompts and fine-tune for enterprise-specific contexts. Integrate into existing workflows with dynamic, 'pluggable' architecture for future upgrades.

Phase 4: Ethical Deployment & Monitoring

Implement robust user training and accountability protocols. Continuously monitor for performance, consistency, and emergent biases in real-world usage. Establish new standards of care as needed.

Ready to Transform Your Enterprise with AI?

Leverage our expertise to integrate cutting-edge MLLM capabilities, mitigate biases, and achieve measurable impact.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking