AI ANALYSIS: Study of comparative performance of general-purpose LLM-based systems in predicting IVF outcomes
Study of comparative performance of general-purpose LLM-based systems in predicting IVF outcomes
Analysis Date: 09 January 2026
Executive Impact Summary
This study evaluates the out-of-the-box performance of three general-purpose large language models (ChatGPT, DeepSeek, and Gemini) in predicting In Vitro Fertilization (IVF) outcomes. Analyzing 1473 IVF/ICSI cycles, the research highlights variable and generally suboptimal performance across tasks, emphasizing their current unsuitability for standalone clinical decision support.
Key Takeaways for Enterprise AI Strategy
- General-purpose LLMs show variable and suboptimal performance for IVF outcome prediction.
- No single LLM consistently achieved high accuracy across all prediction tasks.
- Current LLMs are not suitable for stand-alone clinical decision support in IVF.
- Need for cautious interpretation of AI-generated outputs in reproductive medicine.
- Future focus should be on task-specific, rigorously validated AI models.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
IVF Outcome Prediction
The study directly compared three general-purpose LLMs (ChatGPT, DeepSeek, and Gemini) in predicting IVF outcomes. Performance varied significantly across tasks, with no model consistently achieving the accuracy, calibration, or reliability required for independent clinical use.
While Gemini showed the highest accuracy for stimulation protocols (51.26%) and embryo counts (68.22%), and DeepSeek had the lowest numerical error for oocyte counts, all models exhibited limitations. Clinical pregnancy prediction was particularly challenging, with Gemini achieving the highest AUC (0.711) but still indicating only moderate discrimination. The R² values were generally low, suggesting limited explanatory power across continuous outcome predictions. These results underscore that general-purpose LLMs, without fine-tuning, are not yet suitable for complex medical predictions.
Model Performance Comparison
Gemini generally outperformed other models in categorical predictions like stimulation protocols and embryo counts, and also in clinical pregnancy AUC. DeepSeek showed an advantage in numerical oocyte count precision, while ChatGPT performed best for trigger type prediction. However, none reached thresholds for reliable clinical use.
Specifically, Gemini led in stimulation protocol accuracy (51.26%) and embryo count classification accuracy (68.22%). ChatGPT was superior for ovulation trigger type (39.31%). DeepSeek had the lowest MAE for total and M2 oocyte counts. Despite these relative strengths, R² scores for count predictions were consistently low (0.01-0.02), highlighting a lack of robust predictive power. Pairwise McNemar tests revealed significant differences in predictive behavior, indicating model-specific strengths and weaknesses rather than a general superiority of one model across all tasks.
Limitations & Future Directions
Key limitations include the single-center, retrospective design, imbalanced outcome data, and the 'out-of-the-box' application of LLMs without task-specific training. Future work must focus on developing and validating specialized AI models with broader datasets.
The study acknowledges that its findings are hypothesis-generating and cautionary, not definitive. Factors like unmeasured lifestyle/genetic variables and incomplete follow-up for live birth outcomes further limit generalizability. The proprietary and evolving nature of LLMs, coupled with the lack of full transparency in their training, makes direct replication and clinical integration challenging. The emphasis is on developing rigorously validated, task-specific prediction models, potentially integrating structured clinical variables, embryology parameters, and imaging data, and establishing standardized evaluation frameworks.
Enterprise Process Flow
Despite being the highest, this accuracy is still insufficient for reliable standalone clinical application.
| Model | Protocol Accuracy | Embryo Count Accuracy (±1) | Clinical Pregnancy AUC |
|---|---|---|---|
| ChatGPT | 35.91% | 44.45% | 0.690 |
| DeepSeek | 39.92% | 61.76% | 0.676 |
| Gemini | 51.26% | 68.22% | 0.711 |
| Notes: Performance varied significantly, highlighting model-specific strengths and the overall suboptimal nature for clinical decision-making. | |||
Challenge of Clinical Pregnancy Prediction
Clinical pregnancy prediction proved to be the most challenging task for all LLMs. While Gemini achieved the highest AUC (0.711), overall predictive performance remained moderate. This is likely due to the inherent complexity and multi-factorial nature of implantation, as well as the imbalanced distribution of pregnancy outcomes in the dataset. Dedicated, task-specific models often achieve higher accuracy in this domain, underscoring the limitations of general-purpose LLMs for such nuanced medical predictions. Further research is required to incorporate more comprehensive prognostic factors for reliable clinical use.
These extremely low R² values indicate very limited capacity to explain variance in oocyte yield, reflecting the biological variability and model limitations.
Calculate Your Potential AI Impact
Estimate the efficiency gains and cost savings your enterprise could realize by strategically implementing AI solutions.
Your AI Implementation Roadmap
A structured approach to integrating AI into your enterprise, ensuring maximum impact and smooth transition.
Phase 1: Discovery & Strategy
Understand your unique challenges, identify high-impact AI opportunities, and develop a tailored strategic roadmap.
Phase 2: Pilot & Proof-of-Concept
Implement AI solutions in a controlled environment, validate performance, and refine models based on real-world data.
Phase 3: Scaled Deployment
Integrate validated AI systems across your enterprise, ensuring robust infrastructure and comprehensive training.
Phase 4: Optimization & Future-Proofing
Continuously monitor AI performance, iterate on improvements, and explore new advancements to maintain competitive advantage.
Ready to Transform Your Operations with AI?
Leverage cutting-edge AI insights to drive efficiency, innovation, and growth. Our experts are here to guide your journey.