Enterprise AI Analysis

Feasibility of Using LLMs to Automate Analysis of AI/ML Medical Device Approvals

Harnessing Large Language Models for regulatory analysis in healthcare.

By Ying Wang, Diogo Monteiro do Amaral, Farah Magrabi

Published: HIKM '25: Health Informatics Knowledge Management Conference 2025, September 16-17, 2025, Online, Australia. 23 February 2026.

Schedule Your Strategy Session

Executive Impact: Transforming Regulatory Compliance

The integration of machine learning (ML) algorithms into medical devices for clinical decision-making is rapidly expanding. However, the specific functionalities and clinical use of ML within these devices often remain unclear, posing potential safety concerns. While manual analysis of ML-driven devices approved by the FDA provides insights, it is inefficient. This study explores the feasibility of using large language models (LLMs) to automate such analyses. We evaluate LLMs based on architecture, training strategies, parameter size, computational demands, and output quality to extract general device characteristics, ML-specific details (e.g., functions, inputs/outputs), and clinical applications (e.g., users, conditions). Analyzing 108 ML device approvals, we found that decoder LLMs excel at extracting explicit information but are computationally intensive, whereas encoder models are more efficient and better at inferring clinical context. Despite these strengths, all LLMs require domain-specific optimization to effectively address ML-related details. These findings serve as benchmark of LLMs in this domain.

0 Total Citations

23 Total Downloads

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction to AI/ML in Medical Devices

Artificial intelligence (AI) systems are playing a significant role in healthcare delivery, especially witnessed in specialties like Radiology and Cardiology [1; 2]. In particular, machine learning (ML) enabled Al systems are increasingly used to automate tasks, for example, detecting or quantifying disease in screening mammography or cardiac arrhythmias and recommending patient treatment [3]. Such ML tools used for the diagnosis, management, or prevention of disease are classed as medical devices and regulated in most countries. The US Food and Drug Administration is the world's largest medical device regulator and as of September 2024 had authorized over 1,000 ML devices [4].

While ML medical devices offer significant potential benefits, they pose new risks, particularly due to their black-box nature and the frequent lack of interpretability in outputs [5]. Previous research on safety issues in ML-driven medical devices has revealed that many problems stem from the data acquisition process, leading to errors in ML outputs and contraindicated use. These findings highlighted the need for understanding the whole system, with special attention on how ML is applied in devices, and how they interact with other components and users.

Enterprise Process Flow

Identify ML devices & Retrieve PDF documents

→

Process PDFs into structured text narratives

→

Design tailored prompts for LLMs

→

Evaluate LLM performance against expert annotations

LLM Performance Comparison

Feature	LLAMA2	BERT	ROBERTa	BigBird	Longformer	PHI2
Device name (BLEU)	0.69	0.69	0.49	0.07	0.15	0.01
Indications for use (Fuzzy)	0.87	0.62	0.60	0.45	0.68	0.76
ML methods (Fuzzy)	0.6	0.54	0.55	0.38	0.43	0.43
Users (BLEU)	0.06	0.36	0.33	0	0.04	0
Health conditions (Fuzzy)	0.76	0.33	0.74	0.33	0.49	0.43

LLaMA2 generally outperformed other LLMs on general device features, while BERT and RoBERTa showed better performance for clinical features like 'Users' and 'Health conditions', indicating their strength in summarization. All models struggled with nuanced ML-specific details, highlighting the need for domain-specific optimization.

Insights and Limitations of LLM Application

Our findings showed that decoder models like LLaMA2, though requiring significant computational resources, outperformed others in extracting device characteristics from approval documents, including long sections, due to their larger 4,096-token limit. However, LLaMA2 struggles with precise information representation, often introducing irrelevant content and producing overly creative response, leading to low precision in BLEU scores and limiting its clarify and feasibility in extracting ML and clinical features. Encoder-based LLMs like BERT produce concise outputs but constrained by small architecture, hindering performance on tasks requiring longer text processing, such as device descriptions. However, BERT performed relatively well at summarizing shorter text spans, such as health conditions and users [10].

Overall, all the LLMs, primarily trained on general corpora like Wikipedia, struggle to address domain-specific features critical for analysing ML applications in clinical settings [9]. In addition, the inconsistent structure of FDA documents, with irregular sections, non-standard layouts, and inclusion of figures and tables, poses challenges for designing effective prompt templates, limiting LLM performance. Further, the exclusive use of objective performance metrics, focused on word overlapping rather than semantic meaning, restricts evaluation. Subjective measures would provide better assessment [16].

Future Directions and Recommendations

This study assessed the feasibility of leveraging LLMs to automate the extraction of critical information from FDA-approved ML medical device documents. Decoder-based LLMs showed strong performance in retrieving explicit information such as device characteristics and ML functions, though at the cost of high computational demands. In contrast, encoder-based models (e.g., BERT and RoBERTa) were more efficient and better at inferring clinical context, including intended users and applications. However, all models struggled with extracting nuanced ML-specific details, highlighting the need for domain-specific optimization strategies such as finetuning. These findings establish a benchmark for applying LLMs in regulatory document analysis and suggest a promising role for tailored LLM pipelines in enhancing the transparency and oversight of ML-enabled medical devices.

This study highlights the importance of clear instructions and integrating domain knowledge to improve the performance of generative LLMs. Future work could explore approaches such as Retrieval-Augmented Generation, which allows the model to access precise and authoritative knowledge sources by integrating domain-specific data (e.g., annotated FDA approvals, device manuals, regulatory guidelines) into the response generation process [18]. Alternatively, fine-tuning LLMs with instructions or labels can help the model better understand terminology, context, and relationships, bridging the gap between general training and domain-specific tasks to improve accuracy and reliability in extracting specific device features. However, the benefits must be weighed against the exponential computational, environmental and privacy concerns [19].

Calculate Your Potential AI-Driven Efficiency Gains

Estimate your organization's potential annual savings and reclaimed employee hours by automating medical device approval analysis with enterprise AI solutions.

Industry Sector

Number of Employees (involved in regulatory tasks)

Avg. Hours/Week/Employee on Manual Analysis

Avg. Hourly Cost per Employee ($)

Annual Savings $0

Hours Reclaimed Annually 0

Your Enterprise AI Implementation Roadmap

Our structured approach ensures a seamless transition to AI-powered regulatory analysis, delivering measurable results.

Phase 1: Discovery & Strategy

Assess current manual processes, identify key data points for automation, define success metrics, and customize an LLM strategy.

Phase 2: LLM Integration & Optimization

Deploy and fine-tune selected LLMs with domain-specific data, integrate with existing regulatory systems, and establish robust validation protocols.

Phase 3: Pilot & Scaled Deployment

Conduct pilot programs on a subset of approvals, gather feedback, iterate on LLM performance, and scale the solution across the enterprise.

Ready to Automate Your Regulatory Analysis?

Connect with our experts to discover how AI can streamline your processes, reduce compliance risks, and unlock new efficiencies.

Schedule Your Strategy Session

Enterprise AI Analysis

Feasibility of Using LLMs to Automate Analysis of AI/ML Medical Device Approvals

Executive Impact: Transforming Regulatory Compliance

Deep Analysis & Enterprise Applications

Introduction to AI/ML in Medical Devices

Enterprise Process Flow

LLM Performance Comparison

Insights and Limitations of LLM Application

Future Directions and Recommendations

Calculate Your Potential AI-Driven Efficiency Gains

Your Enterprise AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: LLM Integration & Optimization

Phase 3: Pilot & Scaled Deployment

Ready to Automate Your Regulatory Analysis?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai