AI FOR AFFECTIVE COMPUTING

Emotion Recognition in Signers: Bridging Gaps with Multi-Modal AI

This paper introduces eJSL, a novel Japanese Sign Language dataset for emotion recognition, and empirically validates methods to overcome data scarcity and the grammatical-affective facial expression overlap. By leveraging textual emotion recognition for weak labeling, optimizing temporal segment selection, and incorporating hand gesture features, the research establishes stronger baselines for emotion recognition in signers than spoken language LLMs, offering critical insights for affect-aware assistive technologies.

Schedule Your Strategy Session

Executive Impact & Strategic Value

Our analysis reveals how advanced multi-modal AI can significantly enhance emotion recognition in sign language, overcoming inherent data challenges and improving accuracy beyond traditional methods and even general-purpose large language models.

0% Performance Gain with TER-based Labeling

0% Accuracy Boost via Optimized Segment Selection

0% Recognition Enhanced by Hand Motion Features

0% Outperformance Against Vision-Capable LLMs

Access Full Report & Metrics

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Novel Dataset

Methodology Enhancements

Performance Benchmarking

Core Challenges

Introducing eJSL: A New Benchmark for Signer Emotion

The emotional Japanese Sign Language (eJSL) dataset is a novel video corpus comprising 78 distinct utterances performed by two native JSL signers across seven emotional states (anger, disgust, fear, joy, sadness, surprise, neutral). This results in 1,092 video clips, providing a unique resource for paralinguistic emotion recognition in a cross-lingual context. Unlike prior datasets, eJSL specifically targets the nuances of emotion expression in sign language, serving as a crucial benchmark for future research and affect-aware assistive technologies.

Performance Gain with TER-based Labeling

To address the scarcity of annotated sign language data, we applied a Textual Emotion Recognition (TER) model to subtitles from the BOBSL dataset. This technique generates large-scale weak labels for training, effectively mitigating data limitations. Fine-tuning models with this TER-based data led to substantial performance improvements. For instance, weighted accuracy (wAcc) on BOBSL-M_C improved from 15.54% to 27.85%, representing a 79% gain. On eJSL, wAcc increased from 7.41% to 15.11%.

Impact of TER-based Weak Labeling on Performance
Method	wAcc (%) (BOBSL-M_C)	Macro F1 (%) (BOBSL-M_C)	wAcc (%) (eJSL)	Macro F1 (%) (eJSL)
EAN w/ non-signers data	15.54	12.12	7.41	9.25
EAN w/ BOBSL-A TEA	27.85	17.75	15.11	12.11

Accuracy Boost via Optimized Segment Selection

Recognizing that grammatical facial expressions (GFEs) can obscure affective cues, we investigated the impact of selecting specific temporal segments. Our experiments demonstrated that using the last 2-second segment of a clip—often associated with post-signing emotional salience—significantly improves accuracy. On eJSL, this strategy boosted wAcc from 15.11% (full clip) to 23.17%, a 53% improvement, and Macro F1 from 12.11% to 19.26%.

Temporal Segment Selection Strategies on eJSL
Method	wAcc (%)	macro F1 (%)
Full Clip Input	15.11	12.11
Random 2s Segment	15.20	12.29
Post-Signing 2s Segment	23.17	19.26

Recognition Enhanced by Hand Motion Features

Hand gestures are integral to sign language, conveying not just linguistic content but also subtle emotional cues. By incorporating hand motion features alongside facial expressions into our EANwH model, we significantly enhanced emotion recognition. This multi-modal approach captured crucial temporal alignments and low-level interactions. On eJSL, EANwH (full clip) achieved a wAcc of 24.63% and Macro F1 of 21.09%, outperforming EAN (full clip) by 63% in wAcc.

Impact of Hand Motion Features (EANwH)
Method	wAcc (%) (BOBSL-M_C)	Macro F1 (%) (BOBSL-M_C)	wAcc (%) (eJSL)	Macro F1 (%) (eJSL)
EAN (full clip)	27.85	17.75	15.11	12.11
EANWH (full clip)	32.72	20.03	24.63	21.09
EAN (post 2s) (eJSL only)	-	-	23.17	19.26

EANwH Multi-modal Feature Fusion Architecture

Capture Face & Frame Sequences

→

Extract Facial Features (ResNet-50)

→

Extract Hand Keypoints (MediaPipe)

→

Concatenate Facial & Hand Features

→

Temporal Modeling (LSTM Block)

→

Emotion Classification

EANWH Performance Against Vision-Capable LLMs

Our EANwH model established stronger baselines than leading vision-capable Large Language Models (LLMs) like Qwen 2.5 and GPT-40 in emotion recognition for signers. This indicates the specialized multi-modal approach with specific feature engineering is more effective than general-purpose LLMs, particularly excelling in identifying 'Neutral' emotional states, which are crucial for real-world applications. EANwH demonstrated a Macro F1 score of 16.88% on EmoSign compared to 14.65% for GPT-40, and 21.09% on eJSL compared to 11.15% for GPT-40.

Comparison on EmoSign (Macro F1 %)
Model	Joy	Sad.	Ang.	Dis.	Fear	Sur.	Neu.	Total
Qwen2.5	39.18	4.26	28.57	0.00	0.00	17.65	10.17	14.26
GPT-40	38.38	27.27	0.00	28.57	8.33	0.00	0.00	14.65
EANWH	30.99	16.67	26.67	8.33	10.53	0.00	25.00	16.88

Comparison on eJSL (Macro F1 %)
Model	Joy	Sad.	Ang.	Dis.	Fear	Sur.	Neu.	Total
Qwen2.5	20.91	11.98	2.53	12.10	9.57	1.27	19.84	11.17
GPT-40	7.38	4.64	15.93	23.79	8.61	11.00	6.67	11.15
EANWH	35.91	10.64	15.55	14.29	9.65	21.10	40.49	21.09

Addressing the Grammatical-Affective Facial Expression Overlap

A significant challenge in emotion recognition for signers stems from the inherent overlap between grammatical facial expressions (GFEs) and affective facial expressions (AFEs). Unlike spoken language, sign language utilizes non-manual markers—such as eyebrow movements for syntax (e.g., yes/no questions) that can simultaneously convey emotion (e.g., surprise). This ambiguity complicates model training, requiring sophisticated methods to disentangle these signals for accurate affective understanding. Our research specifically investigates techniques, like temporal segment selection, to address this nuanced challenge.

Calculate Your Potential AI Impact

Estimate the efficiency gains and cost savings your enterprise could achieve by integrating advanced AI solutions based on our research findings.

Your Industry

Number of Employees Impacted by Manual Processes

Avg. Hours/Week per Employee on Repetitive Tasks

Average Hourly Cost (incl. benefits)

Estimated Annual Cost Savings $0

Annual Hours Reclaimed 0

Discuss Your Implementation

Your AI Implementation Roadmap

A typical journey to integrate advanced emotion recognition in sign language within your enterprise, leveraging key insights from this research.

Phase 1: Multi-modal Data Ingestion & Preprocessing

Establish robust pipelines for collecting and processing visual data (facial expressions, hand gestures) from sign language interactions. Implement advanced preprocessing for normalization and feature extraction, preparing data for multi-modal fusion.

Phase 2: TER-Based Weak Labeling & Model Fine-tuning

Utilize Textual Emotion Recognition (TER) models on associated captions to generate large-scale weak labels. Fine-tune foundational emotion recognition models with this weakly supervised data to adapt to the specific nuances of sign language.

Phase 3: Facial & Hand Feature Extraction & Fusion

Integrate specialized feature extractors for facial landmarks and hand keypoints (e.g., ResNet-50 for face, MediaPipe for hands). Develop early fusion strategies to combine these modalities effectively, capturing their synergistic emotional cues.

Phase 4: Temporal Dynamics Optimization

Implement temporal modeling modules (e.g., LSTMs) to capture sequential dependencies in expressions. Explore and optimize temporal segment selection strategies, focusing on segments less affected by grammatical markers to isolate affective information more accurately.

Phase 5: Deployment & Continuous Affective Model Enhancement

Deploy the enhanced multi-modal emotion recognition system within target applications (e.g., assistive technologies, human-computer interaction). Establish feedback loops for continuous learning and adaptation, improving the model's robustness and accuracy over time with real-world data.

Strategize Your AI Journey

Ready to Transform Your Enterprise AI?

Connect with our AI specialists to explore how these cutting-edge techniques can be tailored to drive innovation and efficiency within your organization.

Book a Free Consultation

AI FOR AFFECTIVE COMPUTING

Emotion Recognition in Signers: Bridging Gaps with Multi-Modal AI

Executive Impact & Strategic Value

Deep Analysis & Enterprise Applications

Introducing eJSL: A New Benchmark for Signer Emotion

EANwH Multi-modal Feature Fusion Architecture

EANWH Performance Against Vision-Capable LLMs

Addressing the Grammatical-Affective Facial Expression Overlap

Calculate Your Potential AI Impact

Your AI Implementation Roadmap

Phase 1: Multi-modal Data Ingestion & Preprocessing

Phase 2: TER-Based Weak Labeling & Model Fine-tuning

Phase 3: Facial & Hand Feature Extraction & Fusion

Phase 4: Temporal Dynamics Optimization

Phase 5: Deployment & Continuous Affective Model Enhancement

Ready to Transform Your Enterprise AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai