Skip to main content
Enterprise AI Analysis: CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

Revolutionizing ECG Diagnostics with Multimodal AI

This paper introduces CG-DMER, a novel AI framework that combines contrastive and generative learning to improve electrocardiogram (ECG) interpretation. By integrating ECG signals with clinical text reports and disentangling representations, CG-DMER enhances diagnostic accuracy and generalizability for cardiovascular diseases. It addresses key challenges in existing methods by capturing fine-grained spatial-temporal dependencies and mitigating modality-specific biases.

Executive Impact

Quantifiable advantages derived from CG-DMER's advanced AI capabilities.

0 Macro AUC on CSN (100%)
0 AUC Improvement with Spatial-Temporal Masking
0 Average Zero-Shot AUC

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction

Recent advances in deep learning have achieved remarkable success in leveraging electrocardiograms (ECGs) for the automated classification of cardiovascular diseases (CVD). However, most methods are supervised and depend on large, expertly annotated datasets, which are costly to obtain. To mitigate these limitations, self-supervised learning (SSL) has emerged as a compelling alternative, enabling models to learn generalizable representations directly from unlabeled data. These representations can then be fine-tuned for specific downstream tasks, reducing dependence on large labeled datasets. Building on these advancements, ECG self-supervised learning (eSSL) has become a promising paradigm for extracting robust features from large-scale unlabeled ECGs. Contrastive eSSL (C-eSSL) [1, 2, 3, 4, 5] leverages contrastive learning techniques to capture clinically relevant patterns without requiring labeled data, thereby facilitating the differentiation of various physiological states. Generative eSSL (G-eSSL) [6, 7, 8, 9] focuses on modeling the underlying distribution of ECG signals by learning to reconstruct, predict, or generate physiologically realistic waveforms from unlabeled data. By leveraging these techniques, eSSL enhances model generalization and improves performance across various downstream clinical applications.

Despite these advancements, existing eSSL approaches have predominantly focused on learning representations from raw ECG waveforms, often overlooking the rich diagnostic and contextual information embedded within clinical text reports. As a result, this oversight along with the requirement for annotated samples in downstream tasks limits the versatility of eSSL. Until recently, multimodal approaches, including ETP [10], MERL [11] and C-MELT [12] have attempted to address these limitations. Despite their great success, they still face two main challenges from a modality-wise perspective: i) Intra-modality issue: Existing methods encode ECG signals in a lead-agnostic manner, overlooking the unique spatial and temporal characteristics of individual ECG leads, which limits their effectiveness in capturing fine-grained diagnostic information. ii) Inter-modality issue: Recent methods directly align ECG signals with clinical reports, introducing unnecessary noise and modality-specific biases due to the free-text nature of the reports.

Driven by the above analysis, we propose CG-DMER, a simple yet effective Contrastive-Generative framework for Disentangled Multimodal ECG Representation Learning. The core of CG-DMER lies in capturing fine-grained input details and enhancing ECG-text feature discrimination. Specifically, we propose a spatial-temporal masked modeling scheme for ECG signals, which applies masking across both lead-wise and temporal dimensions to encourage the model to learn fine-grained temporal dynamics and inter-lead spatial dependencies. In parallel, we apply masked reconstruction to clinical text reports, facilitating the learning of rich semantic representations. To further mitigate unnecessary noise and modality-specific biases, we design modality-specific and modality-shared encoders to disentangle features, promoting a clearer separation between modality-invariant and modality-specific representations. Cross-modal contrastive learning is then applied to the shared ECG-text representations, enhancing feature discrimination and strength-ening cross-modal alignment. Through this unified framework, our method facilitates the learning of generalizable and semantically rich multimodal representations.

Methodology

In multimodal ECG representation learning, given a training data set X consisting of pairs of N ECG and text reports, we represent each pair as (Ei ∈ RLXT, Ri ∈ V*), where L and T denote the lead and length of the raw ECG records Ei, and Ri denotes the associated text report, respectively, with i = 1, 2, 3, ..., N. Our CG-DMER framework learns general representations directly from ECG signals and the corresponding text reports as Figure 1 describes.

Figure 1: A schematic view of the proposed method (a) and the spatial-temporal processing to capture spatial-temporal patterns (b).

Methodology Figure 1

Experiments

We evaluate our framework on two downstream tasks: linear probing and zero-shot classification, using three widely adopted ECG datasets that together cover more than 100 cardiac conditions. The macro AUC is used as the evaluation metric for all tasks.

(1) PTB-XL [24] contains 21,837 ECG recordings from 18,885 patients. Each 12-lead ECG is sampled at 500 Hz for 10 seconds. Following the official annotation protocol, four subsets are provided for multi-label classification: 5 Superclasses, 23 Subclasses, 19 Form classes, and 12 Rhythm classes.

(2) CPSC2018 [25] includes 6,877 12-lead ECGs sampled at 500 Hz, with durations ranging from 6 to 60 seconds. Each record is annotated with one of 9 diagnostic categories.

(3) CSN [26, 27] consists of 45,152 12-lead ECGs sampled at 500 Hz for 10 seconds. After dropping records with 'unknown' annotations, the curated version contains 23,026 samples labeled with 38 categories.

Table 1: Linear probing performance comparison between CG-DMER and existing ECG representation learning methods. The best and second-best results are highlighted in red and blue, respectively. † denotes results reproduced by our implementation.

Experiments Table 1

Results

Table 1 reports the linear probing results of our proposed CG-DMER against existing eSSL and multimodal methods. Across six datasets and training data ratios ranging from 1% to 100%, CG-DMER consistently outperforms eSSL baselines. Remarkably, with only 10% of labeled data, CG-DMER surpasses all eSSL methods trained on 100% of the data. This demonstrates its strong generalization ability, driven by clinical text supervision that yields more discriminative and semantically enriched ECG representations. Beyond eSSL, CG-DMER also achieves superior results over multimodal counterparts across all datasets. These gains highlight the effectiveness of our hybrid contrastive-generative paradigm, which combines complementary strengths of both objectives. Together with the proposed feature disentanglement and alignment strategy, CG-DMER captures fine-grained physiological patterns and modality-specific cues, enabling robust generalization across diverse clinical tasks.

Table 2: Zero-shot performance comparison across multiple datasets. The best and second-best results are highlighted in red and blue, respectively. † denotes results reproduced by our implementation.

Results Table 2

Ablation Studies

We conduct ablation studies on the four core components of CG-DMER to evaluate their individual contributions (Table 3). Variant 1 employs only the contrastive objective as a multimodal baseline. Incorporating spatial-temporal masked ECG modeling (Variant 2) improves AUC by +1.96 and +1.51, highlighting its effectiveness in capturing lead correlations and temporal dynamics. Variant 3 further adds masked text reconstruction, enabling the model to recover clinical semantics. Variant 4 introduces feature disentanglement, which mitigates modality bias and noise, yielding additional improvements. The full CG-DMER achieves the highest performance, demonstrating that all components are complementary and jointly enhance representation learning.

Table 3: Effects of model components.

Ablation Studies Table 3

Figure 2: Results of CG-DMER with different patch numbers and masking ratio on linear probing and zero-shot (Figure 2) and T-SNE visualization on the CSN test set (Figure 3).

Ablation Studies Figures 2 and 3

Conclusion

In this paper, we present CG-DMER, a contrastive-generative framework for multimodal ECG representation learning. Using spatial-temporal masked modeling and representation disentanglement, CG-DMER captures fine-grained ECG dynamics and reduces modality-related issues. It enhances the discriminability and robustness of ECG representations, improving performance on various downstream tasks. Experiments on multiple datasets show that CG-DMER outperforms existing methods and has strong potential for clinical diagnostics.

93.55% Achieved Macro AUC on CSN (100% data)

CG-DMER Multimodal Learning Process

ECG Signals & Clinical Reports
Spatial-Temporal Masked Modeling
Masked Reconstruction (Text)
Disentanglement (Specific & Shared Encoders)
Cross-modal Contrastive Learning
Generalizable Multimodal Representations
Feature CG-DMER Prior Multimodal
Intra-modality (lead-wise) patterns
  • Spatial-Temporal Masking
  • Lead-agnostic
Inter-modality biases mitigation
  • Disentanglement Strategy
  • Direct alignment
Hybrid Learning (Contrastive+Generative)
  • Yes
  • Typically one or the other
State-of-the-Art Performance
  • Across diverse tasks
  • Limited improvements

Clinical Impact: Enhanced Diagnostic Precision

CG-DMER's ability to capture fine-grained temporal dynamics and inter-lead spatial dependencies from ECG signals, combined with rich semantic context from clinical reports, leads to significantly enhanced diagnostic precision. In a practical clinical scenario, this means earlier and more accurate detection of complex cardiovascular conditions, reducing misdiagnosis rates and improving patient outcomes. The disentanglement strategy ensures that modality-specific nuances are preserved while leveraging shared semantic information, making the system robust to variations in data quality and reporting styles. This comprehensive understanding empowers clinicians with a powerful tool for more confident and reliable cardiovascular disease diagnosis.

Estimate Your AI Impact

Adjust the parameters to see the potential annual hours reclaimed and cost savings your organization could achieve with advanced AI solutions like CG-DMER.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A typical phased approach to integrate CG-DMER into your existing infrastructure.

Phase 1: Discovery & Strategy

Conduct a detailed assessment of existing ECG data infrastructure and clinical workflows. Define specific diagnostic goals and integration points.

Phase 2: Data Preparation & Model Customization

Curate and preprocess multimodal ECG-text datasets. Customize CG-DMER framework to specific enterprise data characteristics and integrate with existing EMR systems.

Phase 3: Validation & Deployment

Rigorously validate model performance against clinical benchmarks. Deploy the fine-tuned CG-DMER model into a secure, scalable production environment.

Phase 4: Monitoring & Iteration

Continuously monitor model performance, data drift, and clinical impact. Implement feedback loops for iterative improvements and model retraining.

Ready to Transform Your Diagnostic Capabilities?

Book a personalized strategy session with our AI experts to explore how CG-DMER can be tailored to your organization's unique needs.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking