AI RESEARCH PAPER ANALYSIS

Event-based Lip Reading with Triplane Fusion Network

Published on: February 23, 2026 | Authors: Hao Ju, Zhedong Zheng, Xu Zheng, Wenyue Chen, Lin Wang, Dong Wang, Huchuan Lu, Xu Jia

Schedule Your Strategy Session

Executive Impact: Enhanced Visual Speech Recognition

This research introduces the Triplane Fusion Network (TF-Net), a novel approach for event-based lip reading. By analyzing lip movements from three distinct, complementary views (XYT, XT, YT) and facilitating multi-directional information exchange, TF-Net significantly improves accuracy and efficiency in visual speech recognition, particularly in challenging conditions where traditional cameras fail.

0 Modality Dataset Accuracy

0 Accuracy Improvement

0 Optimal Temporal Resolution

0 Operational Efficiency

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The Challenge of Lip Reading

Automatic lip reading is a critical subtask of visual speech recognition, focusing on predicting spoken words from lip movements. Traditional methods, using frame-based cameras, often miss subtle, fast motions due to low frame rates and are highly susceptible to illumination changes and motion blur.

This paper highlights how event cameras overcome these limitations by capturing local brightness changes with microsecond latency and high temporal resolution, making them robust to harsh conditions and ideal for fine-grained lip motion analysis. However, existing event-based approaches often adapt conventional video recognition architectures, overlooking the unique advantages of event data distribution.

The proposed Triplane Fusion Network (TF-Net) addresses this by explicitly processing event streams from three distinct, complementary views, enhancing discriminative feature extraction for subtle lip movements.

Triplane Fusion Network Architecture

The core innovation is the Triplane Fusion Network (TF-Net), an event-specified framework that processes lip movements by analyzing event streams from three distinct spatio-temporal views: the standard XYT view, and two additional perspectives, XT (horizontal motion over time) and YT (vertical motion over time). These 2D projections are vital for decoupling and analyzing the dynamic scale and speed of lip movements.

TF-Net is structured as a Mixture-of-Experts (MoE) architecture, where each expert branch specializes in processing a projection from a distinct viewing point. For the XYT view, 3D convolutions on Voxel Grid representations are used, while for XT and YT views, 2D convolutions on Event Profile representations are employed, significantly reducing computational burden. The network ensures the temporal dimension is retained in each branch without downsampling, followed by Bi-directional Gate Recurrent Units (Bi-GRU) for temporal aggregation and fully connected layers for word prediction.

Enterprise Process Flow

Raw Event Streams (X, Y, T)

→

Event Representation Conversions (XYT, XT, YT)

→

Triplane Fusion Network (Expert Blocks)

→

Mutual Information Exchange

→

Fused Prediction Head

Mutual Information Exchange Block (MIEB)

A key component of TF-Net is the Mutual Information Exchange Block (MIEB). This block facilitates multi-directional exchange of motion information among the different expert branches (XYT, XT, and YT views). Unlike previous methods that often overlook the specific event distribution along temporal axes, MIEB explicitly enables the complementary information exchange at the feature level.

The MIEB operates by tokenizing feature maps from expert blocks, transforming them into one-directional tokens (e.g., [T, W×C] for XT/YT, [T×H, W×C] and [T×W, H×C] for XYT). These tokens are then fed into a self-attention-based fusion mechanism. This multi-directional exchange ensures that each expert branch receives fused motion features, which are then recomposed to enhance motion modeling, especially for fine-grained lip movements, and optimize the learning process.

Explore MIEB for Your Data

Benchmark Performance & Efficiency

TF-Net demonstrates superior performance compared to existing state-of-the-art event-based lip reading methods. On the synthetic Modality dataset, it achieves an accuracy of 82.182%, surpassing competitive methods by +2.3%. On the real-world DVS-Lip dataset, it delivers highly competitive results with 73.709% accuracy.

Crucially, TF-Net is also highly efficient, boasting the smallest GFLOPs (16.230/16.235) among compared event-based methods, confirming its efficacy without excessive computational demands. This efficiency is partly attributed to the strategic use of 2D convolution for XT and YT branches, avoiding computationally expensive 3D operations where not necessary.

TF-Net vs. State-of-the-Art (Modality Dataset)

Method	Accuracy(%)	GFLOPs	Key Innovations
SyncVSR [3]	60.684	10.170	Data-Efficient Visual Speech Recognition
VTP† [44]	75.276	82.353	Sub-word Level Lip Reading with Visual Attention
TD3Net [30]	78.352	38.379	Temporal Densely Connected Multi-Dilated Conv. Network
MSTP (re-train) [54]	79.834	18.345	Multi-grained Spatio-Temporal Features Perceived Network
SpikeGRU2+ base [16]	74.581	19.784	Neuromorphic Lip Reading with Spiking Units
MTGA [60]	77.348	92.528	Multi-View Temporal Granularity Aligned Aggregation
TF-Net (Ours)	82.182	16.235	Triplane Fusion Network Multi-directional Information Exchange Event Profile (XT, YT)

Key Ablation Insights

Extensive ablation studies confirm the effectiveness of each TF-Net component:

Mixture of Branches: Combining XYT, XT, and YT views significantly outperforms single-branch or bi-branch architectures, highlighting the complementary nature of these perspectives.
Bi-GRU Module: Essential for learning temporal information, crucial for sequence-level predictions, showing a notable performance increase compared to variants without Bi-GRU.
Event Profile: The proposed 2D Event Profile representation (for XT and YT views) consistently outperforms Voxel Grid variants, demonstrating its ability to preserve fine-grained motion information and avoid counterbalancing positive/negative events.
Temporal Bins: An optimal resolution of 128 temporal bins (achieving 73.709% accuracy on DVS-Lip) is critical; coarser resolutions lead to significant performance drops.
MIEB Stages: Applying Mutual Information Exchange Blocks at all stages yields the best performance, with later-stage fusion proving particularly important for enhancing discriminative feature expression for word recognition. Multi-directional flow is superior to unilateral.

128 Bins Optimal Temporal Resolution for Event Data Processing, maximizing detail capture without computational overhead.

Conclusion & Future Directions

This work successfully introduces the Triplane Fusion Network (TF-Net), an event-specified framework designed for robust lip reading. By leveraging three complementary views (XYT, XT, YT) and integrating a Mutual Information Exchange Block (MIEB), TF-Net effectively captures and exchanges fine-grained motion information, leading to state-of-the-art performance on synthetic datasets and strong results on real-world data.

The research validates the importance of disentangling and recomposing views for event-specified distribution learning, offering a significant advancement in visual speech recognition, especially for scenarios requiring high temporal resolution and robustness to environmental variations. Future work could focus on addressing failure cases related to short, low-syllable words by improving event collection and motion pattern modeling precision.

Schedule a Consultation

Enterprise Application: Enhanced Accessibility and Security

Imagine a call center environment where agents assist customers, but a noisy background or a client with a speech impediment makes verbal communication challenging. Integrating TF-Net's advanced lip-reading capabilities, powered by event cameras, could provide real-time visual speech recognition. This would enable agents to understand customers more accurately, leading to improved service quality and reduced miscommunication.

Furthermore, in high-security environments, silent commands or biometric authentication based on lip movements could be made more robust. TF-Net's ability to discern subtle lip motions, even in low-light conditions, offers a significant advantage over traditional vision systems. This technology could enable more reliable access control or facilitate communication for individuals in situations where audio is compromised or unavailable, thereby enhancing both accessibility and security protocols.

Discuss a Custom Solution

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by implementing advanced AI solutions like the Triplane Fusion Network.

Your Industry

Number of Employees Affected

Avg. Hours/Week on Manual Tasks

Avg. Hourly Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Get a Tailored ROI Report

Your AI Implementation Roadmap

A phased approach to integrate cutting-edge AI, ensuring minimal disruption and maximum impact. This general roadmap can be customized for your specific needs.

Phase 01: Discovery & Strategy

Initial consultation to understand your unique business challenges, existing infrastructure, and strategic goals. Define clear objectives for AI integration and identify key performance indicators (KPIs).

Phase 02: Data Preparation & Model Training

Collect, preprocess, and annotate relevant data. Adapt or fine-tune models like TF-Net for your specific use case, leveraging event data or converting existing video streams. Initial model validation and benchmarking.

Phase 03: Pilot Deployment & Testing

Deploy a pilot version of the AI solution in a controlled environment. Conduct rigorous testing, gather user feedback, and iterate on model performance and system integration. Optimize for efficiency and accuracy.

Phase 04: Full-Scale Integration & Monitoring

Integrate the refined AI solution across your enterprise. Establish continuous monitoring systems to track performance, identify anomalies, and ensure ongoing optimization. Provide training and support for your teams.

Discuss Your Roadmap

Ready to Transform Your Enterprise with AI?

Book a free 30-minute consultation with our AI strategists to explore how advanced solutions can drive innovation and efficiency in your business.

Book Your Free Consultation

AI RESEARCH PAPER ANALYSIS

Event-based Lip Reading with Triplane Fusion Network

Executive Impact: Enhanced Visual Speech Recognition

Deep Analysis & Enterprise Applications

The Challenge of Lip Reading

Triplane Fusion Network Architecture

Enterprise Process Flow

Mutual Information Exchange Block (MIEB)

Benchmark Performance & Efficiency

TF-Net vs. State-of-the-Art (Modality Dataset)

Key Ablation Insights

Conclusion & Future Directions

Enterprise Application: Enhanced Accessibility and Security

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 01: Discovery & Strategy

Phase 02: Data Preparation & Model Training

Phase 03: Pilot Deployment & Testing

Phase 04: Full-Scale Integration & Monitoring

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai