Skip to main content
Enterprise AI Analysis: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Enterprise AI Analysis

An Artificial Intelligence Framework for End-to-End Rare Disease Phenotyping from Clinical Notes Using Large Language Models

This report details an innovative AI framework, RARE-PHENIX, designed to streamline rare disease diagnosis by automating the extraction, standardization, and prioritization of phenotypic features from unstructured clinical notes. Leveraging Large Language Models (LLMs) and a supervised ranking model, RARE-PHENIX significantly outperforms existing deep learning methods, offering a clinically aligned, end-to-end solution for improved diagnostic concordance and efficiency in real-world settings.

Executive Impact: Key Performance Indicators

RARE-PHENIX demonstrates substantial improvements in critical metrics for rare disease phenotyping, validating its potential for enterprise-level clinical applications.

0.70 Ontology-based Similarity (vs. 0.58 baseline)
29% Relative Reduction in False Negatives
24% Relative Reduction in False Positives
16,357 Clinical Notes in External Validation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Methodology Overview
LLM-based Feature Extraction
HPO Standardization via RAG
Phenotype Prioritization
Performance & Error Analysis

RARE-PHENIX End-to-End Workflow

RARE-PHENIX models the real-world clinical workflow for rare disease phenotyping into three sequential modules: extraction, standardization, and prioritization. This integrated approach ensures a comprehensive and diagnostically relevant output.

Enterprise Process Flow

Extract Rare Disease Features from Clinical Notes (LLMs)
Standardize Features to Human Phenotype Ontology (HPO) Terms (RAG)
Prioritize Diagnostically Informative HPO Terms (Supervised Ranking)
Structured, Ranked Phenotypes for Diagnosis

LLM-based Phenotype Extraction

Module 1 identifies and extracts rare disease phenotypes from unstructured clinical notes using two complementary LLM approaches: parameter-efficient instruction fine-tuning of open-source models (LLaMA family) and few-shot prompting of a closed-source model (ChatGPT-4o). Instruction fine-tuning was performed using PEFT on the RareDis Corpus (832 documents) and UDN Synthetic Clinical Narratives (2,671 patients).

Leveraging Large Language Models

Our approach fine-tuned 10 LLaMA models (7b to 70b parameters) and utilized ChatGPT-4o for extraction. This dual strategy allows for evaluation under both deployable open-source and API-based settings, ensuring adaptability for various enterprise environments. The use of synthetic clinical text from 2,671 patients significantly augmented training data, enhancing the models' ability to capture nuanced rare disease phenotypes.

HPO Standardization via Retrieval-Augmented Generation (RAG)

Module 2 standardizes extracted phenotype strings to structured Human Phenotype Ontology (HPO) terms using Retrieval-Augmented Generation (RAG). This process grounds LLM outputs in external knowledge, preventing hallucination and ensuring interoperable phenotype terms crucial for downstream diagnostic workup like genomic analysis. A vector database of HPO terms is used for semantic retrieval to select the most appropriate HPO term.

0.43 Precision after HPO Standardization (for LLaMA-2-70b, from 0.25 initially)

This substantial increase in precision demonstrates how mapping free-text phenotypes to HPO terms significantly reduces noise and yields more specific, diagnostically relevant phenotype representations.

Phenotype Prioritization for Diagnostic Utility

Module 3 operationalizes phenotype prioritization as a supervised learning-to-rank task using XGBoost. It learns to assign higher relevance scores to clinician-curated HPO terms compared to non-curated ones, distinguishing common, non-specific phenotypes from diagnostically informative ones. This significantly improves diagnostic utility, especially for top-k lists.

+0.09 Gain in Ontology-based Similarity at top-10, due to prioritization

The largest gains from prioritization were observed at lower 'k' cutoffs, where clinical decision-making typically focuses on a limited number of highly informative phenotypes. This highlights the module's effectiveness in surfacing the most critical information upfront.

End-to-End Performance & Error Analysis

RARE-PHENIX consistently outperformed PhenoBERT across all metrics in end-to-end evaluation, showcasing its superior ability to generate clinically concordant phenotypes. Systematic error analysis revealed that most false negatives are due to linguistic variation or contextual description rather than true extraction failures, while false positives mainly arise from ontology granularity differences or non-specific symptoms.

Metric RARE-PHENIX (best LLM) PhenoBERT Baseline
Ontology-based Similarity (k=50) ~0.70 ~0.58
Mean FN Reduction 29% relative reduction Baseline
Mean FP Reduction 24% relative reduction Baseline
F1 Score (k=50) ~0.50 ~0.23

Calculate Your Potential AI-Driven ROI

Estimate the significant efficiency gains and cost savings your enterprise could achieve by integrating RARE-PHENIX.

Estimated Annual Cost Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A phased approach to integrating RARE-PHENIX into your existing clinical and research workflows for maximum impact.

Phase 1: Discovery & Strategy (1-2 Weeks)

Initial assessment of your current rare disease phenotyping processes, infrastructure, and specific diagnostic challenges. Define key objectives, success metrics, and a tailored integration strategy for RARE-PHENIX.

Phase 2: Data Integration & Customization (4-6 Weeks)

Secure integration with your EHR systems and clinical note repositories. Fine-tune LLM models with institution-specific data (if applicable) and customize HPO standardization rules to align with local clinical practice.

Phase 3: Pilot Deployment & Validation (6-8 Weeks)

Roll out RARE-PHENIX in a pilot environment with a select group of clinicians or researchers. Collect feedback, perform internal validation against clinician-curated data, and iterate on model performance and usability.

Phase 4: Full-Scale Rollout & Ongoing Optimization (Ongoing)

Deploy RARE-PHENIX across your enterprise. Establish continuous monitoring for performance, identify new opportunities for feature extraction and prioritization, and provide ongoing training and support for users.

Ready to Transform Rare Disease Diagnosis?

Connect with our AI specialists to learn how RARE-PHENIX can accelerate insights and improve outcomes in your organization.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking