Enterprise AI Analysis
Feasibility of Using LLMs to Automate Analysis of AI/ML Medical Device Approvals
Harnessing Large Language Models for regulatory analysis in healthcare.
By Ying Wang, Diogo Monteiro do Amaral, Farah Magrabi
Published: HIKM '25: Health Informatics Knowledge Management Conference 2025, September 16-17, 2025, Online, Australia. 23 February 2026.
Executive Impact: Transforming Regulatory Compliance
The integration of machine learning (ML) algorithms into medical devices for clinical decision-making is rapidly expanding. However, the specific functionalities and clinical use of ML within these devices often remain unclear, posing potential safety concerns. While manual analysis of ML-driven devices approved by the FDA provides insights, it is inefficient. This study explores the feasibility of using large language models (LLMs) to automate such analyses. We evaluate LLMs based on architecture, training strategies, parameter size, computational demands, and output quality to extract general device characteristics, ML-specific details (e.g., functions, inputs/outputs), and clinical applications (e.g., users, conditions). Analyzing 108 ML device approvals, we found that decoder LLMs excel at extracting explicit information but are computationally intensive, whereas encoder models are more efficient and better at inferring clinical context. Despite these strengths, all LLMs require domain-specific optimization to effectively address ML-related details. These findings serve as benchmark of LLMs in this domain.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Introduction to AI/ML in Medical Devices
Artificial intelligence (AI) systems are playing a significant role in healthcare delivery, especially witnessed in specialties like Radiology and Cardiology [1; 2]. In particular, machine learning (ML) enabled Al systems are increasingly used to automate tasks, for example, detecting or quantifying disease in screening mammography or cardiac arrhythmias and recommending patient treatment [3]. Such ML tools used for the diagnosis, management, or prevention of disease are classed as medical devices and regulated in most countries. The US Food and Drug Administration is the world's largest medical device regulator and as of September 2024 had authorized over 1,000 ML devices [4].
While ML medical devices offer significant potential benefits, they pose new risks, particularly due to their black-box nature and the frequent lack of interpretability in outputs [5]. Previous research on safety issues in ML-driven medical devices has revealed that many problems stem from the data acquisition process, leading to errors in ML outputs and contraindicated use. These findings highlighted the need for understanding the whole system, with special attention on how ML is applied in devices, and how they interact with other components and users.
Enterprise Process Flow
LLM Performance Comparison
| Feature | LLAMA2 | BERT | ROBERTa | BigBird | Longformer | PHI2 |
|---|---|---|---|---|---|---|
| Device name (BLEU) | 0.69 | 0.69 | 0.49 | 0.07 | 0.15 | 0.01 |
| Indications for use (Fuzzy) | 0.87 | 0.62 | 0.60 | 0.45 | 0.68 | 0.76 |
| ML methods (Fuzzy) | 0.6 | 0.54 | 0.55 | 0.38 | 0.43 | 0.43 |
| Users (BLEU) | 0.06 | 0.36 | 0.33 | 0 | 0.04 | 0 |
| Health conditions (Fuzzy) | 0.76 | 0.33 | 0.74 | 0.33 | 0.49 | 0.43 |
LLaMA2 generally outperformed other LLMs on general device features, while BERT and RoBERTa showed better performance for clinical features like 'Users' and 'Health conditions', indicating their strength in summarization. All models struggled with nuanced ML-specific details, highlighting the need for domain-specific optimization.
Insights and Limitations of LLM Application
Our findings showed that decoder models like LLaMA2, though requiring significant computational resources, outperformed others in extracting device characteristics from approval documents, including long sections, due to their larger 4,096-token limit. However, LLaMA2 struggles with precise information representation, often introducing irrelevant content and producing overly creative response, leading to low precision in BLEU scores and limiting its clarify and feasibility in extracting ML and clinical features. Encoder-based LLMs like BERT produce concise outputs but constrained by small architecture, hindering performance on tasks requiring longer text processing, such as device descriptions. However, BERT performed relatively well at summarizing shorter text spans, such as health conditions and users [10].
Overall, all the LLMs, primarily trained on general corpora like Wikipedia, struggle to address domain-specific features critical for analysing ML applications in clinical settings [9]. In addition, the inconsistent structure of FDA documents, with irregular sections, non-standard layouts, and inclusion of figures and tables, poses challenges for designing effective prompt templates, limiting LLM performance. Further, the exclusive use of objective performance metrics, focused on word overlapping rather than semantic meaning, restricts evaluation. Subjective measures would provide better assessment [16].
Future Directions and Recommendations
This study assessed the feasibility of leveraging LLMs to automate the extraction of critical information from FDA-approved ML medical device documents. Decoder-based LLMs showed strong performance in retrieving explicit information such as device characteristics and ML functions, though at the cost of high computational demands. In contrast, encoder-based models (e.g., BERT and RoBERTa) were more efficient and better at inferring clinical context, including intended users and applications. However, all models struggled with extracting nuanced ML-specific details, highlighting the need for domain-specific optimization strategies such as finetuning. These findings establish a benchmark for applying LLMs in regulatory document analysis and suggest a promising role for tailored LLM pipelines in enhancing the transparency and oversight of ML-enabled medical devices.
This study highlights the importance of clear instructions and integrating domain knowledge to improve the performance of generative LLMs. Future work could explore approaches such as Retrieval-Augmented Generation, which allows the model to access precise and authoritative knowledge sources by integrating domain-specific data (e.g., annotated FDA approvals, device manuals, regulatory guidelines) into the response generation process [18]. Alternatively, fine-tuning LLMs with instructions or labels can help the model better understand terminology, context, and relationships, bridging the gap between general training and domain-specific tasks to improve accuracy and reliability in extracting specific device features. However, the benefits must be weighed against the exponential computational, environmental and privacy concerns [19].
Calculate Your Potential AI-Driven Efficiency Gains
Estimate your organization's potential annual savings and reclaimed employee hours by automating medical device approval analysis with enterprise AI solutions.
Your Enterprise AI Implementation Roadmap
Our structured approach ensures a seamless transition to AI-powered regulatory analysis, delivering measurable results.
Phase 1: Discovery & Strategy
Assess current manual processes, identify key data points for automation, define success metrics, and customize an LLM strategy.
Phase 2: LLM Integration & Optimization
Deploy and fine-tune selected LLMs with domain-specific data, integrate with existing regulatory systems, and establish robust validation protocols.
Phase 3: Pilot & Scaled Deployment
Conduct pilot programs on a subset of approvals, gather feedback, iterate on LLM performance, and scale the solution across the enterprise.
Ready to Automate Your Regulatory Analysis?
Connect with our experts to discover how AI can streamline your processes, reduce compliance risks, and unlock new efficiencies.