Enterprise AI Analysis for Healthcare
Evaluation of AI language models in answering pregnancy-related questions assessed by obstetrics specialists
This analysis distills key findings from recent research on AI language models in obstetric patient education, providing strategic insights for enterprise AI adoption and highlighting their potential as clinical adjuncts under expert oversight.
Authors: Betül Keyif, Engin Yurtçu, Alper Başbuğ & Fikret Gokhan Goynumer
Publication: Scientific Reports, 16 February 2026
DOI: 10.1038/s41598-026-40609-0
Executive Impact & Key Insights
AI language models demonstrate significant potential in enhancing patient education within obstetrics. This study reveals critical performance differences and underscores the necessity of expert oversight for clinical application.
Abstract: This study aimed to compare the performance of three large language models—ChatGPT-3.5, Gemini, and ChatGPT-4.0—in generating responses to ten frequently asked pregnancy-related questions, as evaluated by obstetrics and gynecology specialists. Seventy-five specialists independently rated 30 anonymized AI-generated responses using a 5-point Likert scale across four domains: accuracy, reliability, patient-friendliness, and comprehensibility. All questions were standardized and presented verbatim to each model using identical zero-shot prompts. Data were analyzed using the Kruskal-Wallis test with Bonferroni-adjusted Mann-Whitney U post-hoc comparisons. Inter-rater consistency was assessed using Cronbach's alpha. Spearman correlation was used to examine associations between clinical experience and evaluation patterns. ChatGPT-4.0 demonstrated the highest overall performance, particularly in accuracy (median 4.35; mean ± SD: 4.30 ± 0.48) and patient-friendliness (4.40; 4.35 ± 0.47). Gemini performed comparably to ChatGPT-4.0 in comprehensibility (3.70; 3.68 ± 0.54), while ChatGPT-3.5 consistently received the lowest scores. Significant differences were observed among the three models for accuracy, reliability, and patient-friendliness (all p < 0.001), but not for comprehensibility (p = 0.521). A modest positive correlation was found between clinical experience and reliability ratings (r = 0.261, p = 0.0238). Among the evaluated models, ChatGPT-4.0 provided the most clinically aligned and patient-friendly responses to common pregnancy questions. While AI tools may offer valuable support for patient education, expert oversight remains essential to ensure accuracy and safety. Further research should explore their real-world impact on patient comprehension, behavior, and clinical outcomes.
Keywords: artificial intelligence; ChatGPT; Gemini; large language models; obstetrics; patient education; pregnancy
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Overall Model Performance Comparison
The study evaluated ChatGPT-3.5, Gemini, and ChatGPT-4.0 across four domains: accuracy, reliability, patient-friendliness, and comprehensibility. ChatGPT-4.0 consistently demonstrated the highest overall performance, particularly excelling in accuracy and patient-friendliness. Gemini showed an intermediate performance, with scores comparable to ChatGPT-4.0 in comprehensibility, while ChatGPT-3.5 received the lowest scores across most evaluation criteria.
| Domain | ChatGPT-3.5 (Median Score) | Gemini (Median Score) | ChatGPT-4.0 (Median Score) |
|---|---|---|---|
| Accuracy | 3.10 [IQR: 2.80-3.40] | 3.85 [IQR: 3.60-4.10] | 4.35 [IQR: 4.10-4.60] |
| Reliability | 3.05 [IQR: 2.70-3.30] | 3.55 [IQR: 3.25-3.85] | 3.95 [IQR: 3.70-4.20] |
| Patient-Friendliness | 3.20 [IQR: 2.90-3.50] | 3.90 [IQR: 3.60-4.20] | 4.40 [IQR: 4.10-4.70] |
| Comprehensibility | 3.40 [IQR: 3.10-3.70] | 3.70 [IQR: 3.40-4.00] | 3.65 [IQR: 3.40-3.90] |
Key Insight: ChatGPT-4.0 delivered the most clinically aligned and patient-friendly responses, with significant statistical differences observed for accuracy, reliability, and patient-friendliness compared to other models (all p < 0.001).
AI Response Generation Protocol
The study implemented a rigorous, standardized protocol for AI response generation to ensure high methodological quality and minimize external variables. Each AI model received identical zero-shot prompts for ten pregnancy-related questions, ensuring that baseline performance was evaluated without prior conversation history or iterative refinements.
Enterprise Process Flow
Inter-rater reliability was assessed using Cronbach's alpha, revealing high consistency among the 75 specialist evaluators. A modest positive correlation (Spearman r = 0.261, p = 0.0238) was found between clinical experience and reliability ratings, suggesting more experienced clinicians perceived the factual robustness of AI-generated content differently.
Qualitative Analysis of AI Responses
Qualitative analysis, demonstrated through example question-response sets, revealed distinct patterns among the models. ChatGPT-4.0 provided detailed, clinically coherent, and patient-friendly explanations. Gemini offered balanced but sometimes less detailed responses, while ChatGPT-3.5 frequently produced shorter or incomplete answers. These examples contextualize the quantitative findings and highlight the nuanced differences in output quality.
Example Responses to Pregnancy Questions
Comparing AI model outputs for common pregnancy-related questions, showcasing their typical strengths and weaknesses.
Question 1: What foods should I avoid during pregnancy?
ChatGPT-3.5: "Avoid raw meats, raw eggs and unpasteurized milk. Some fish should not be eaten."
Gemini: "Pregnant women should avoid raw or undercooked meat, unpasteurized dairy, high-mercury fish and deli meats unless reheated."
ChatGPT-4.0: "You should avoid high-mercury fish (shark, swordfish), unpasteurized dairy, raw/undercooked eggs or meat, deli meats unless fully heated, and limit caffeine."
Question 2: What does early-pregnancy bleeding mean?
ChatGPT-3.5: "Bleeding can occur for many reasons including implantation or miscarriage."
Gemini: "Light spotting may be normal, but bleeding can also indicate miscarriage or ectopic pregnancy; medical evaluation is needed."
ChatGPT-4.0: "Early bleeding may be normal (implantation), but can also signal miscarriage, ectopic pregnancy, infection, or subchorionic hematoma; prompt assessment is recommended."
Question 3: Can I continue exercising while pregnant?
ChatGPT-3.5: "Light exercise is usually OK but ask your doctor."
Gemini: "Moderate exercise is generally safe unless you have high-risk conditions; avoid contact sports and activities with fall risk."
ChatGPT-4.0: "Regular moderate aerobic exercise is recommended unless contraindicated; avoid high-impact sports, dehydration, overheating, or activities with falling/trauma risk."
The study concludes that while AI tools like ChatGPT-4.0 show promising potential for supporting patient education, they are not substitutes for professional medical advice. Expert oversight remains crucial to ensure the accuracy, safety, and clinical appropriateness of AI-generated health information.
Projected ROI for AI Integration
Estimate the potential time and cost savings by implementing AI solutions for patient education and administrative support in your enterprise.
Your Enterprise AI Adoption Roadmap
A structured approach to integrating AI language models into your healthcare operations, ensuring patient safety and maximizing clinical value.
01. Needs Assessment & Pilot Study
Identify specific areas within obstetric patient education where AI tools can provide maximum value. Conduct a pilot with selected AI models, focusing on user acceptance and initial accuracy in real-world scenarios.
02. Data Integration & Model Fine-tuning
Integrate AI models with existing healthcare information systems. Fine-tune models with relevant clinical data and local guidelines to enhance accuracy and contextual understanding, especially for patient-specific inquiries.
03. Clinical Oversight & Training
Establish robust clinical oversight mechanisms for AI-generated responses, including human-in-the-loop validation. Train healthcare professionals on how to effectively utilize AI tools as adjuncts for patient education and decision support, emphasizing critical evaluation.
04. Scaled Deployment & Continuous Monitoring
Deploy AI solutions across relevant patient education platforms within your enterprise. Implement continuous monitoring of AI performance, patient comprehension, and clinical outcomes, iterating based on feedback, new medical evidence, and regulatory changes.
Ready to Transform Your Healthcare AI Strategy?
Leverage the power of advanced AI models like ChatGPT-4.0 to enhance patient education and streamline clinical support. Connect with our experts to explore how these insights can be tailored to build a robust, patient-centered AI solution for your organization.