AI ANALYSIS: Study of comparative performance of general-purpose LLM-based systems in predicting IVF outcomes

Study of comparative performance of general-purpose LLM-based systems in predicting IVF outcomes

Analysis Date: 09 January 2026

Executive Impact Summary

This study evaluates the out-of-the-box performance of three general-purpose large language models (ChatGPT, DeepSeek, and Gemini) in predicting In Vitro Fertilization (IVF) outcomes. Analyzing 1473 IVF/ICSI cycles, the research highlights variable and generally suboptimal performance across tasks, emphasizing their current unsuitability for standalone clinical decision support.

0 Gemini Protocol Prediction Accuracy

0 DeepSeek Lowest Oocyte MAE

0 Gemini Highest Clinical Pregnancy AUC

Key Takeaways for Enterprise AI Strategy

General-purpose LLMs show variable and suboptimal performance for IVF outcome prediction.
No single LLM consistently achieved high accuracy across all prediction tasks.
Current LLMs are not suitable for stand-alone clinical decision support in IVF.
Need for cautious interpretation of AI-generated outputs in reproductive medicine.
Future focus should be on task-specific, rigorously validated AI models.

Schedule Your Strategy Session

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

IVF Outcome Prediction

The study directly compared three general-purpose LLMs (ChatGPT, DeepSeek, and Gemini) in predicting IVF outcomes. Performance varied significantly across tasks, with no model consistently achieving the accuracy, calibration, or reliability required for independent clinical use.

While Gemini showed the highest accuracy for stimulation protocols (51.26%) and embryo counts (68.22%), and DeepSeek had the lowest numerical error for oocyte counts, all models exhibited limitations. Clinical pregnancy prediction was particularly challenging, with Gemini achieving the highest AUC (0.711) but still indicating only moderate discrimination. The R² values were generally low, suggesting limited explanatory power across continuous outcome predictions. These results underscore that general-purpose LLMs, without fine-tuning, are not yet suitable for complex medical predictions.

Model Performance Comparison

Gemini generally outperformed other models in categorical predictions like stimulation protocols and embryo counts, and also in clinical pregnancy AUC. DeepSeek showed an advantage in numerical oocyte count precision, while ChatGPT performed best for trigger type prediction. However, none reached thresholds for reliable clinical use.

Specifically, Gemini led in stimulation protocol accuracy (51.26%) and embryo count classification accuracy (68.22%). ChatGPT was superior for ovulation trigger type (39.31%). DeepSeek had the lowest MAE for total and M2 oocyte counts. Despite these relative strengths, R² scores for count predictions were consistently low (0.01-0.02), highlighting a lack of robust predictive power. Pairwise McNemar tests revealed significant differences in predictive behavior, indicating model-specific strengths and weaknesses rather than a general superiority of one model across all tasks.

Limitations & Future Directions

Key limitations include the single-center, retrospective design, imbalanced outcome data, and the 'out-of-the-box' application of LLMs without task-specific training. Future work must focus on developing and validating specialized AI models with broader datasets.

The study acknowledges that its findings are hypothesis-generating and cautionary, not definitive. Factors like unmeasured lifestyle/genetic variables and incomplete follow-up for live birth outcomes further limit generalizability. The proprietary and evolving nature of LLMs, coupled with the lack of full transparency in their training, makes direct replication and clinical integration challenging. The emphasis is on developing rigorously validated, task-specific prediction models, potentially integrating structured clinical variables, embryology parameters, and imaging data, and establishing standardized evaluation frameworks.

Enterprise Process Flow

Data Collection (1473 IVF Cycles)

→

Vignette Generation (Standardized Narratives)

→

LLM Querying (ChatGPT, DeepSeek, Gemini)

→

Performance Evaluation (Accuracy, MAE, AUC)

51.26 Highest Protocol Prediction Accuracy (Gemini)

Despite being the highest, this accuracy is still insufficient for reliable standalone clinical application.

Comparative Performance Across LLMs (Selected Metrics)
Model	Protocol Accuracy	Embryo Count Accuracy (±1)	Clinical Pregnancy AUC
ChatGPT	35.91%	44.45%	0.690
DeepSeek	39.92%	61.76%	0.676
Gemini	51.26%	68.22%	0.711
Notes: Performance varied significantly, highlighting model-specific strengths and the overall suboptimal nature for clinical decision-making.

Challenge of Clinical Pregnancy Prediction

Clinical pregnancy prediction proved to be the most challenging task for all LLMs. While Gemini achieved the highest AUC (0.711), overall predictive performance remained moderate. This is likely due to the inherent complexity and multi-factorial nature of implantation, as well as the imbalanced distribution of pregnancy outcomes in the dataset. Dedicated, task-specific models often achieve higher accuracy in this domain, underscoring the limitations of general-purpose LLMs for such nuanced medical predictions. Further research is required to incorporate more comprehensive prognostic factors for reliable clinical use.

0.01-0.02 Typical R² for Oocyte Count Predictions

These extremely low R² values indicate very limited capacity to explain variance in oocyte yield, reflecting the biological variability and model limitations.

Calculate Your Potential AI Impact

Estimate the efficiency gains and cost savings your enterprise could realize by strategically implementing AI solutions.

Your Industry

Number of Employees (AI-Impacted Roles)

Average Weekly Hours on Repetitive Tasks

Average Hourly Cost per Employee

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A structured approach to integrating AI into your enterprise, ensuring maximum impact and smooth transition.

Phase 1: Discovery & Strategy

Understand your unique challenges, identify high-impact AI opportunities, and develop a tailored strategic roadmap.

Phase 2: Pilot & Proof-of-Concept

Implement AI solutions in a controlled environment, validate performance, and refine models based on real-world data.

Phase 3: Scaled Deployment

Integrate validated AI systems across your enterprise, ensuring robust infrastructure and comprehensive training.

Phase 4: Optimization & Future-Proofing

Continuously monitor AI performance, iterate on improvements, and explore new advancements to maintain competitive advantage.

Ready to Transform Your Operations with AI?

Leverage cutting-edge AI insights to drive efficiency, innovation, and growth. Our experts are here to guide your journey.

Discuss Your Implementation Strategy

AI ANALYSIS: Study of comparative performance of general-purpose LLM-based systems in predicting IVF outcomes

Study of comparative performance of general-purpose LLM-based systems in predicting IVF outcomes

Executive Impact Summary

Key Takeaways for Enterprise AI Strategy

Deep Analysis & Enterprise Applications

IVF Outcome Prediction

Model Performance Comparison

Limitations & Future Directions

Enterprise Process Flow

Comparative Performance Across LLMs (Selected Metrics)

Challenge of Clinical Pregnancy Prediction

Calculate Your Potential AI Impact

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot & Proof-of-Concept

Phase 3: Scaled Deployment

Phase 4: Optimization & Future-Proofing

Ready to Transform Your Operations with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai