ENTERPRISE AI ANALYSIS
Machine-Learned Codes from EHR Data Predict Hard Outcomes Better than Human-Assigned ICD Codes
Electronic health records (EHRs) are widely used in the United States. As of 2021, the adoption rate reached 96% for non-federal acute care [1]. These systems not only enable patients and physicians to access clinical information electronically and streamline medical billing and claims processes, but they have also been recognized as valuable resources for assessing study feasibility and facilitating patient recruitment in clinical research. Beyond their clinical utility, EHRs also support observational studies in epidemiology, risk factor analysis, genomic association studies, and other areas [2,3]. A key initial step in utilizing EHR data for research involves phenotyping, the identification of patients with specific traits or medical conditions [4]. The most common data source for phenotyping is the International Classification of Diseases (ICD) codes. Developed by the World Health Organization, the ICD system was designed to standardize the classification and coding of diseases and has since been widely adopted by healthcare providers worldwide [5]. Accurate assignment of ICD codes is essential for precisely identifying and characterizing target populations in interventional and observational studies. Typically, ICD codes are assigned by clinicians as part of their clinical practice or by trained medical coders who extract relevant information from clinical notes. However, this manual process is laborious and prone to error due to incomplete and disorganized documentation, limited resources (e.g., staffing and budget), human mistakes, and changes in coding standards [6]. Prior research has highlighted considerable variability and inconsistency in the assignment of ICD codes over time and across different clinical settings. For example, the transition from ICD-9 to ICD-10 can lead to substantial discrepancies in phenotype definitions [7–9]. Even within the same ICD version, inconsistencies in coding practices continue to be a significant challenge [10-12]. Such coding errors compromise subsequent clinical analyses by not accurately representing patient's underlying conditions. The recent emergence of machine learning (ML) has done much to improve and automate disease phenotyping using EHR data [13,14]. Compared with human coders, the advantages of the ML-based phenotyping methods include improved scalability for processing large volumes of data, enhanced consistency in applying phenotyping criteria, and the ability to detect complex patterns across multiple data types. However, most ML-based phenotyping methods typically require manual chart reviews by clinical experts to create "gold standards”, which is both time-consuming and tailored for specific research projects. On the other hand, although the initial ICD codes assigned may not be perfect, they are not random. Patients with specific diagnosis codes often share common clinical characteristics—such as demographics, symptoms, and treatments—embedded within the EHR as structured (e.g., medication, procedures, etc.) or unstructured data (e.g., clinical notes). When treating the originally assigned ICD codes as a “silver standard", ML can extract and capture these characteristics, creating a unique “fingerprint" of clinical features for each patient and generating ML-derived phenotypes. This methodology can potentially provide a more robust and consistent representation of patient conditions. ICD codes are organized hierarchically, starting with broad disease and condition categories (chapters). These chapters are further subdivided into blocks, which group related conditions more specifically. In this study, we trained a ML model to create coarse-grained phenotypes, referred to as ML-derived ICD blocks, corresponding to ICD code blocks. To evaluate our hypothesis that ML-derived ICD blocks, which are more closely aligned with actual patient conditions, would improve the performance of downstream tasks such as predicting critical outcomes (mortality or hospitalization), we compared predictive performance using original ICD blocks from the EHR with that achieved using ML-derived ICD blocks across a range of ML models. By using the notes, procedures performed, and medications given to develop the ML models, we hoped to achieve a more detailed picture of each patient's condition. We chose to use the VA data for two reasons: The patient population tends to be relatively stable with fewer losses due to changes in providers or hospitals than in other datasets, and the data available include all the clinical notes, not generally available in other very large datasets. While predominantly male, our focus here was on testing whether ML predictions were more effective than human-assigned codes and considered that the differing population was unlikely to affect our results.
Executive Impact Summary
This study demonstrates that machine learning (ML)-derived International Classification of Diseases (ICD) code blocks consistently and significantly outperform human-assigned ICD codes in predicting critical patient outcomes like 90-day mortality or rehospitalization. By leveraging diverse EHR data, ML models provide a more accurate and consistent representation of patient conditions, addressing the inherent variability and potential errors in manual coding practices. This translates to improved predictive power and offers a robust foundation for enhanced clinical risk stratification and research within healthcare systems.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
ML-derived code blocks consistently revealed a much larger impact on outcome prediction. For instance, Acute kidney failure and chronic kidney disease (N17-N19) showed an impact factor of 9.452, compared to 0.754 for assigned blocks.
Other high-impact ML-derived blocks include Aplastic and other anemias (D60-D64) with 7.529 and Mood [affective] disorders (F30-F39) with 5.890. This suggests ML captures more nuanced clinical features related to disease severity and comorbidity.
Human-assigned ICD code blocks generally showed only small deviations from an expected value of 0, indicating a limited individual impact on predicting hard outcomes. Even for conditions with some of the highest assigned impact factors like Mental and behavioral disorders due to psychoactive substance use (F10-F19) at 0.842, their predictive power was significantly less than their ML-derived counterparts.
This lack of significant impact underscores the limitations of human-assigned codes for complex outcome prediction, likely due to inconsistencies, incompleteness, and a lack of granular detail in the coding process.
Enterprise Process Flow
| Code Chapter | Original Chapter Agreement | ML-Derived Chapter Agreement |
|---|---|---|
| Infectious and parasitic disease | 42.00% | 58.00% |
| Diseases of the ear and mastoid process | 32.00% | 68.00% |
| Diseases of the respiratory system | 36.00% | 64.00% |
| Classifier | Features | Recall | F1-Score | Precision | Accuracy | AUC |
|---|---|---|---|---|---|---|
| LR | Original Blocks | 0.687 | 0.315 | 0.205 | 0.663 | 0.722 |
| LR | ML-derived Blocks | 0.702 | 0.342 | 0.226 | 0.695 | 0.759 |
| SVM | Original Blocks | 0.664 | 0.317 | 0.208 | 0.677 | 0.723 |
| SVM | ML-derived Blocks | 0.701 | 0.344 | 0.228 | 0.698 | 0.751 |
| RF | Original Blocks | 0.651 | 0.312 | 0.205 | 0.675 | 0.707 |
| RF | ML-derived Blocks | 0.713 | 0.332 | 0.217 | 0.676 | 0.7510 |
| NN | Original Blocks | 0.638 | 0.323 | 0.216 | 0.697 | 0.724 |
| NN | ML-derived Blocks | 0.732 | 0.363 | 0.241 | 0.709 | 0.783 |
Impact of ML-Derived Codes on Predictive Power
While the overall AUC improvement may seem modest, the analysis of impact factors reveals a significant shift. Human-assigned code blocks showed minimal individual impact on outcome prediction. In contrast, ML-derived codes, particularly for conditions like Acute kidney failure and chronic kidney disease, showed a significantly higher impact, demonstrating their greater sensitivity in capturing nuanced patient conditions relevant to hospitalization and mortality. This highlights the ability of ML to uncover deeper, more predictive patterns than traditional coding methods, leading to more robust risk stratification.
Focus Metric (N17-N19): 9.452 (ML-Derived) vs 0.754 (Assigned)
Advanced ROI Calculator
Estimate the potential return on investment for implementing advanced AI solutions in your enterprise by adjusting key variables. This calculator provides a realistic projection based on industry averages and our proprietary efficiency models.
Implementation Roadmap
A typical AI solution deployment follows a structured, iterative process to ensure seamless integration and maximum impact. Our proven methodology minimizes disruption and accelerates time-to-value.
Phase 01: Discovery & Strategy
Comprehensive assessment of existing workflows, data infrastructure, and business objectives. Define clear project scope, KPIs, and success metrics.
Phase 02: Data Preparation & Model Development
Collection, cleansing, and transformation of data. Iterative development and training of AI models, ensuring robust performance and bias mitigation.
Phase 03: Integration & Pilot Deployment
Seamless integration of AI models into existing enterprise systems. Pilot testing with a controlled user group to gather feedback and refine functionality.
Phase 04: Full-Scale Rollout & Optimization
Company-wide deployment with comprehensive training and support. Continuous monitoring, evaluation, and fine-tuning for sustained peak performance.
Ready to transform your operations?
Our experts are ready to help you navigate the complexities of AI adoption. Book a personalized strategy session to explore how machine learning can drive efficiency and innovation in your organization.