AI IN HEALTHCARE

How Well Do Multimodal Models Reason on ECG Signals?

This paper introduces ECG_ReasonEval, a novel and reproducible framework for rigorously evaluating the reasoning capabilities of multimodal large language models on ECG signals. By decomposing reasoning into verifiable Perception (signal grounding) and Deduction (clinical consensus), it addresses critical challenges in scalability and semantic correctness, providing a pathway to more trustworthy and auditable AI systems in high-stakes medical applications.

Schedule Your Strategy Session

Driving Trust & Efficiency in Health AI

Our framework provides verifiable reasoning, enhancing trust and auditability for AI in critical medical applications.

0 Scalability Increase

0 Human Annotation Errors Identified

0 Deduction Reliability (r-value)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Current Reasoning Evaluation Landscape

Existing methods for evaluating AI reasoning often face tradeoffs between scalability and depth. ECG_ReasonEval introduces a comprehensive approach to overcome these limitations.

Method	Applicable to Health Data	Assesses Reasoning Text	Inspects Non-Text Modality	Allows Multiple Valid Reasonings	Reproducible
Step-wise Decomposition	✓	✓
Question Answering	✓	✓
Zero-shot Classification	✓				✓
n-gram Match	✓	✓			✓
Expert Assessment	✓	✓	✓	✓
ECG_ReasonEval (Ours)	✓	✓	✓	✓	✓

ECG_ReasonEval Methodology Overview

The framework decomposes reasoning into Perception (signal grounding) and Deduction (clinical consensus), using a dual-verification process to ensure both fidelity to input data and alignment with medical standards.

Enterprise Process Flow

Parse Reasoning Findings

→

Generate Verification Code

→

Execute Code on ECG Signal

→

Extract Diagnostic Criteria

→

Generate Embeddings

→

Query Knowledge Database

→

Retrieve Top-k Matches

→

Calculate Precision@k

Perception: Grounding Reasoning in Signal Features

The Perception phase validates if a model's reasoning trace accurately describes observable patterns in the raw ECG signal. An agentic framework is employed, where a Data Science Agent dynamically generates Python code using specialized tools (like a SOTA DL segmentation model) to empirically verify claims such as "irregular RR intervals." This process has shown high reliability, with 83% global accuracy in supporting assessment, and crucially, has audited human annotations, finding 17% of cardiologist notes to be incorrect (Figure 3). This ensures that the model's textual explanations are truly grounded in the input data.

Deduction: Aligning Reasoning with Clinical Consensus

The Deduction phase assesses if the model's logic aligns with established clinical knowledge. It involves constructing a comprehensive database of diagnostic criteria from authoritative cardiology resources (LITFL, Wikipedia, ECGpedia, WikiEM). A Text Cleaning Agent (using Claude 4.5 Opus and GPT 5.2 Pro) extracts and standardizes criteria. The model's censored reasoning trace is embedded and used to query this database, with Precision@k measuring alignment with ground truth. This approach achieves ~0.8 Precision@1, demonstrating strong alignment with expert consensus and robustness to varied physician phrasing, validating the semantic correctness of the reasoning against medical standards.

Reasoning vs. Predictive Accuracy

r=0.18 / r=0.70 Correlation with Final Classification Accuracy (Perception / Deduction)

Crucially, the paper finds a weak correlation between Perception scores and final accuracy (r=0.18), but a strong correlation between Deduction and accuracy (r=0.70). This highlights that models can achieve high predictive accuracy without truly "seeing" the signal, implying they may hallucinate justifications post-hoc. Trustworthy AI requires both accurate predictions and verifiable reasoning.

Multimodal Model Performance Breakdown

A comparison of various multimodal models reveals distinct strengths and weaknesses in perception and deduction capabilities.

Model	Perception (Acc@Thresh100%)	Deduction (Precision@1)	Key Characteristics
OpenTSLM / QoQ-Med (TSLMs)	~25-30%	Low	Good Perception (accurate sensors), but lack medical knowledge ("Dull Boy").
Claude Opus 4.5 (Plot)	<10%	High	High Deduction but poor Perception; prone to "post-hoc reasoning" and hallucination.
Gemini 3.1 Pro (Plot)	~15-16%	Highest (~0.48 on Rhythm P@5)	Best balanced performance; shows promise in bridging Perception and Deduction gap.
Physician	Upper Bound	Upper Bound	Gold standard for both grounding and consensus.

Quantify Your AI Impact

Estimate the potential efficiency gains and cost savings for your enterprise with our AI solutions.

Industry Sector

Number of Employees

Avg. Weekly Hours on Manual Tasks

Avg. Hourly Rate ($)

Estimated Annual Savings 0

Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A structured approach to integrating advanced AI into your enterprise, ensuring a seamless transition and maximum impact.

Phase 1: Discovery & Strategy

Comprehensive analysis of current workflows, identification of high-impact AI opportunities, and development of a tailored implementation strategy.

Phase 2: Pilot Program & Validation

Deployment of a targeted AI solution in a controlled environment, rigorous testing, and validation against key performance indicators.

Phase 3: Scaled Deployment & Integration

Full-scale integration of the AI solution across relevant departments, comprehensive training, and continuous optimization for peak performance.

Phase 4: Ongoing Optimization & Support

Continuous monitoring, iterative improvements based on real-world data, and dedicated support to ensure sustained value and adaptability.

Ready to Transform Your Enterprise with AI?

Unlock the full potential of AI for your business. Schedule a personalized consultation with our experts today.

Discuss Your Implementation

AI IN HEALTHCARE

How Well Do Multimodal Models Reason on ECG Signals?

Driving Trust & Efficiency in Health AI

Deep Analysis & Enterprise Applications

Current Reasoning Evaluation Landscape

ECG_ReasonEval Methodology Overview

Enterprise Process Flow

Perception: Grounding Reasoning in Signal Features

Deduction: Aligning Reasoning with Clinical Consensus

Reasoning vs. Predictive Accuracy

Multimodal Model Performance Breakdown

Quantify Your AI Impact

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot Program & Validation

Phase 3: Scaled Deployment & Integration

Phase 4: Ongoing Optimization & Support

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai