AI FOR AFFECTIVE COMPUTING
Emotion Recognition in Signers: Bridging Gaps with Multi-Modal AI
This paper introduces eJSL, a novel Japanese Sign Language dataset for emotion recognition, and empirically validates methods to overcome data scarcity and the grammatical-affective facial expression overlap. By leveraging textual emotion recognition for weak labeling, optimizing temporal segment selection, and incorporating hand gesture features, the research establishes stronger baselines for emotion recognition in signers than spoken language LLMs, offering critical insights for affect-aware assistive technologies.
Executive Impact & Strategic Value
Our analysis reveals how advanced multi-modal AI can significantly enhance emotion recognition in sign language, overcoming inherent data challenges and improving accuracy beyond traditional methods and even general-purpose large language models.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Introducing eJSL: A New Benchmark for Signer Emotion
The emotional Japanese Sign Language (eJSL) dataset is a novel video corpus comprising 78 distinct utterances performed by two native JSL signers across seven emotional states (anger, disgust, fear, joy, sadness, surprise, neutral). This results in 1,092 video clips, providing a unique resource for paralinguistic emotion recognition in a cross-lingual context. Unlike prior datasets, eJSL specifically targets the nuances of emotion expression in sign language, serving as a crucial benchmark for future research and affect-aware assistive technologies.
To address the scarcity of annotated sign language data, we applied a Textual Emotion Recognition (TER) model to subtitles from the BOBSL dataset. This technique generates large-scale weak labels for training, effectively mitigating data limitations. Fine-tuning models with this TER-based data led to substantial performance improvements. For instance, weighted accuracy (wAcc) on BOBSL-M_C improved from 15.54% to 27.85%, representing a 79% gain. On eJSL, wAcc increased from 7.41% to 15.11%.
| Method | wAcc (%) (BOBSL-M_C) | Macro F1 (%) (BOBSL-M_C) | wAcc (%) (eJSL) | Macro F1 (%) (eJSL) |
|---|---|---|---|---|
| EAN w/ non-signers data | 15.54 | 12.12 | 7.41 | 9.25 |
| EAN w/ BOBSL-A TEA | 27.85 | 17.75 | 15.11 | 12.11 |
Recognizing that grammatical facial expressions (GFEs) can obscure affective cues, we investigated the impact of selecting specific temporal segments. Our experiments demonstrated that using the last 2-second segment of a clip—often associated with post-signing emotional salience—significantly improves accuracy. On eJSL, this strategy boosted wAcc from 15.11% (full clip) to 23.17%, a 53% improvement, and Macro F1 from 12.11% to 19.26%.
| Method | wAcc (%) | macro F1 (%) |
|---|---|---|
| Full Clip Input | 15.11 | 12.11 |
| Random 2s Segment | 15.20 | 12.29 |
| Post-Signing 2s Segment | 23.17 | 19.26 |
Hand gestures are integral to sign language, conveying not just linguistic content but also subtle emotional cues. By incorporating hand motion features alongside facial expressions into our EANwH model, we significantly enhanced emotion recognition. This multi-modal approach captured crucial temporal alignments and low-level interactions. On eJSL, EANwH (full clip) achieved a wAcc of 24.63% and Macro F1 of 21.09%, outperforming EAN (full clip) by 63% in wAcc.
| Method | wAcc (%) (BOBSL-M_C) | Macro F1 (%) (BOBSL-M_C) | wAcc (%) (eJSL) | Macro F1 (%) (eJSL) |
|---|---|---|---|---|
| EAN (full clip) | 27.85 | 17.75 | 15.11 | 12.11 |
| EANWH (full clip) | 32.72 | 20.03 | 24.63 | 21.09 |
| EAN (post 2s) (eJSL only) | - | - | 23.17 | 19.26 |
EANwH Multi-modal Feature Fusion Architecture
EANWH Performance Against Vision-Capable LLMs
Our EANwH model established stronger baselines than leading vision-capable Large Language Models (LLMs) like Qwen 2.5 and GPT-40 in emotion recognition for signers. This indicates the specialized multi-modal approach with specific feature engineering is more effective than general-purpose LLMs, particularly excelling in identifying 'Neutral' emotional states, which are crucial for real-world applications. EANwH demonstrated a Macro F1 score of 16.88% on EmoSign compared to 14.65% for GPT-40, and 21.09% on eJSL compared to 11.15% for GPT-40.
| Model | Joy | Sad. | Ang. | Dis. | Fear | Sur. | Neu. | Total |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5 | 39.18 | 4.26 | 28.57 | 0.00 | 0.00 | 17.65 | 10.17 | 14.26 |
| GPT-40 | 38.38 | 27.27 | 0.00 | 28.57 | 8.33 | 0.00 | 0.00 | 14.65 |
| EANWH | 30.99 | 16.67 | 26.67 | 8.33 | 10.53 | 0.00 | 25.00 | 16.88 |
| Model | Joy | Sad. | Ang. | Dis. | Fear | Sur. | Neu. | Total |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5 | 20.91 | 11.98 | 2.53 | 12.10 | 9.57 | 1.27 | 19.84 | 11.17 |
| GPT-40 | 7.38 | 4.64 | 15.93 | 23.79 | 8.61 | 11.00 | 6.67 | 11.15 |
| EANWH | 35.91 | 10.64 | 15.55 | 14.29 | 9.65 | 21.10 | 40.49 | 21.09 |
Addressing the Grammatical-Affective Facial Expression Overlap
A significant challenge in emotion recognition for signers stems from the inherent overlap between grammatical facial expressions (GFEs) and affective facial expressions (AFEs). Unlike spoken language, sign language utilizes non-manual markers—such as eyebrow movements for syntax (e.g., yes/no questions) that can simultaneously convey emotion (e.g., surprise). This ambiguity complicates model training, requiring sophisticated methods to disentangle these signals for accurate affective understanding. Our research specifically investigates techniques, like temporal segment selection, to address this nuanced challenge.
Calculate Your Potential AI Impact
Estimate the efficiency gains and cost savings your enterprise could achieve by integrating advanced AI solutions based on our research findings.
Your AI Implementation Roadmap
A typical journey to integrate advanced emotion recognition in sign language within your enterprise, leveraging key insights from this research.
Phase 1: Multi-modal Data Ingestion & Preprocessing
Establish robust pipelines for collecting and processing visual data (facial expressions, hand gestures) from sign language interactions. Implement advanced preprocessing for normalization and feature extraction, preparing data for multi-modal fusion.
Phase 2: TER-Based Weak Labeling & Model Fine-tuning
Utilize Textual Emotion Recognition (TER) models on associated captions to generate large-scale weak labels. Fine-tune foundational emotion recognition models with this weakly supervised data to adapt to the specific nuances of sign language.
Phase 3: Facial & Hand Feature Extraction & Fusion
Integrate specialized feature extractors for facial landmarks and hand keypoints (e.g., ResNet-50 for face, MediaPipe for hands). Develop early fusion strategies to combine these modalities effectively, capturing their synergistic emotional cues.
Phase 4: Temporal Dynamics Optimization
Implement temporal modeling modules (e.g., LSTMs) to capture sequential dependencies in expressions. Explore and optimize temporal segment selection strategies, focusing on segments less affected by grammatical markers to isolate affective information more accurately.
Phase 5: Deployment & Continuous Affective Model Enhancement
Deploy the enhanced multi-modal emotion recognition system within target applications (e.g., assistive technologies, human-computer interaction). Establish feedback loops for continuous learning and adaptation, improving the model's robustness and accuracy over time with real-world data.
Ready to Transform Your Enterprise AI?
Connect with our AI specialists to explore how these cutting-edge techniques can be tailored to drive innovation and efficiency within your organization.