Enterprise AI Analysis for Healthcare

Evaluation of AI language models in answering pregnancy-related questions assessed by obstetrics specialists

This analysis distills key findings from recent research on AI language models in obstetric patient education, providing strategic insights for enterprise AI adoption and highlighting their potential as clinical adjuncts under expert oversight.

Schedule Your Strategic Consultation

Authors: Betül Keyif, Engin Yurtçu, Alper Başbuğ & Fikret Gokhan Goynumer
Publication: Scientific Reports, 16 February 2026
DOI: 10.1038/s41598-026-40609-0

Executive Impact & Key Insights

AI language models demonstrate significant potential in enhancing patient education within obstetrics. This study reveals critical performance differences and underscores the necessity of expert oversight for clinical application.

0 Median Accuracy (ChatGPT-4.0)

0 Obstetrics Specialists Evaluated

0 Max Inter-Rater Reliability (ChatGPT-4.0)

0 Median Patient-Friendliness (ChatGPT-4.0)

Abstract: This study aimed to compare the performance of three large language models—ChatGPT-3.5, Gemini, and ChatGPT-4.0—in generating responses to ten frequently asked pregnancy-related questions, as evaluated by obstetrics and gynecology specialists. Seventy-five specialists independently rated 30 anonymized AI-generated responses using a 5-point Likert scale across four domains: accuracy, reliability, patient-friendliness, and comprehensibility. All questions were standardized and presented verbatim to each model using identical zero-shot prompts. Data were analyzed using the Kruskal-Wallis test with Bonferroni-adjusted Mann-Whitney U post-hoc comparisons. Inter-rater consistency was assessed using Cronbach's alpha. Spearman correlation was used to examine associations between clinical experience and evaluation patterns. ChatGPT-4.0 demonstrated the highest overall performance, particularly in accuracy (median 4.35; mean ± SD: 4.30 ± 0.48) and patient-friendliness (4.40; 4.35 ± 0.47). Gemini performed comparably to ChatGPT-4.0 in comprehensibility (3.70; 3.68 ± 0.54), while ChatGPT-3.5 consistently received the lowest scores. Significant differences were observed among the three models for accuracy, reliability, and patient-friendliness (all p < 0.001), but not for comprehensibility (p = 0.521). A modest positive correlation was found between clinical experience and reliability ratings (r = 0.261, p = 0.0238). Among the evaluated models, ChatGPT-4.0 provided the most clinically aligned and patient-friendly responses to common pregnancy questions. While AI tools may offer valuable support for patient education, expert oversight remains essential to ensure accuracy and safety. Further research should explore their real-world impact on patient comprehension, behavior, and clinical outcomes.

Keywords: artificial intelligence; ChatGPT; Gemini; large language models; obstetrics; patient education; pregnancy

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Model Performance & Strengths

Methodology & Reliability

Qualitative Insights & Implications

Overall Model Performance Comparison

The study evaluated ChatGPT-3.5, Gemini, and ChatGPT-4.0 across four domains: accuracy, reliability, patient-friendliness, and comprehensibility. ChatGPT-4.0 consistently demonstrated the highest overall performance, particularly excelling in accuracy and patient-friendliness. Gemini showed an intermediate performance, with scores comparable to ChatGPT-4.0 in comprehensibility, while ChatGPT-3.5 received the lowest scores across most evaluation criteria.

Domain	ChatGPT-3.5 (Median Score)	Gemini (Median Score)	ChatGPT-4.0 (Median Score)
Accuracy	3.10 [IQR: 2.80-3.40]	3.85 [IQR: 3.60-4.10]	4.35 [IQR: 4.10-4.60]
Reliability	3.05 [IQR: 2.70-3.30]	3.55 [IQR: 3.25-3.85]	3.95 [IQR: 3.70-4.20]
Patient-Friendliness	3.20 [IQR: 2.90-3.50]	3.90 [IQR: 3.60-4.20]	4.40 [IQR: 4.10-4.70]
Comprehensibility	3.40 [IQR: 3.10-3.70]	3.70 [IQR: 3.40-4.00]	3.65 [IQR: 3.40-3.90]

Key Insight: ChatGPT-4.0 delivered the most clinically aligned and patient-friendly responses, with significant statistical differences observed for accuracy, reliability, and patient-friendliness compared to other models (all p < 0.001).

AI Response Generation Protocol

The study implemented a rigorous, standardized protocol for AI response generation to ensure high methodological quality and minimize external variables. Each AI model received identical zero-shot prompts for ten pregnancy-related questions, ensuring that baseline performance was evaluated without prior conversation history or iterative refinements.

Enterprise Process Flow

Standardized Question Submission

→

Zero-Shot Prompting

→

Single Output Generation

→

No Iterative Tuning

→

Anonymization for Evaluation

Inter-rater reliability was assessed using Cronbach's alpha, revealing high consistency among the 75 specialist evaluators. A modest positive correlation (Spearman r = 0.261, p = 0.0238) was found between clinical experience and reliability ratings, suggesting more experienced clinicians perceived the factual robustness of AI-generated content differently.

0.849 Highest Inter-Rater Reliability (ChatGPT-4.0 Cronbach's Alpha)

0.261 Spearman Correlation (Clinical Experience vs. Reliability)

Qualitative Analysis of AI Responses

Qualitative analysis, demonstrated through example question-response sets, revealed distinct patterns among the models. ChatGPT-4.0 provided detailed, clinically coherent, and patient-friendly explanations. Gemini offered balanced but sometimes less detailed responses, while ChatGPT-3.5 frequently produced shorter or incomplete answers. These examples contextualize the quantitative findings and highlight the nuanced differences in output quality.

Example Responses to Pregnancy Questions

Comparing AI model outputs for common pregnancy-related questions, showcasing their typical strengths and weaknesses.

Question 1: What foods should I avoid during pregnancy?

ChatGPT-3.5: "Avoid raw meats, raw eggs and unpasteurized milk. Some fish should not be eaten."

Gemini: "Pregnant women should avoid raw or undercooked meat, unpasteurized dairy, high-mercury fish and deli meats unless reheated."

ChatGPT-4.0: "You should avoid high-mercury fish (shark, swordfish), unpasteurized dairy, raw/undercooked eggs or meat, deli meats unless fully heated, and limit caffeine."

Question 2: What does early-pregnancy bleeding mean?

ChatGPT-3.5: "Bleeding can occur for many reasons including implantation or miscarriage."

Gemini: "Light spotting may be normal, but bleeding can also indicate miscarriage or ectopic pregnancy; medical evaluation is needed."

ChatGPT-4.0: "Early bleeding may be normal (implantation), but can also signal miscarriage, ectopic pregnancy, infection, or subchorionic hematoma; prompt assessment is recommended."

Question 3: Can I continue exercising while pregnant?

ChatGPT-3.5: "Light exercise is usually OK but ask your doctor."

Gemini: "Moderate exercise is generally safe unless you have high-risk conditions; avoid contact sports and activities with fall risk."

ChatGPT-4.0: "Regular moderate aerobic exercise is recommended unless contraindicated; avoid high-impact sports, dehydration, overheating, or activities with falling/trauma risk."

The study concludes that while AI tools like ChatGPT-4.0 show promising potential for supporting patient education, they are not substitutes for professional medical advice. Expert oversight remains crucial to ensure the accuracy, safety, and clinical appropriateness of AI-generated health information.

Projected ROI for AI Integration

Estimate the potential time and cost savings by implementing AI solutions for patient education and administrative support in your enterprise.

Your Industry

Number of Employees Involved (e.g., in patient support)

Average Hours/Week per Employee on Repetitive Tasks

Average Hourly Rate of Employees ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your Enterprise AI Adoption Roadmap

A structured approach to integrating AI language models into your healthcare operations, ensuring patient safety and maximizing clinical value.

01. Needs Assessment & Pilot Study

Identify specific areas within obstetric patient education where AI tools can provide maximum value. Conduct a pilot with selected AI models, focusing on user acceptance and initial accuracy in real-world scenarios.

02. Data Integration & Model Fine-tuning

Integrate AI models with existing healthcare information systems. Fine-tune models with relevant clinical data and local guidelines to enhance accuracy and contextual understanding, especially for patient-specific inquiries.

03. Clinical Oversight & Training

Establish robust clinical oversight mechanisms for AI-generated responses, including human-in-the-loop validation. Train healthcare professionals on how to effectively utilize AI tools as adjuncts for patient education and decision support, emphasizing critical evaluation.

04. Scaled Deployment & Continuous Monitoring

Deploy AI solutions across relevant patient education platforms within your enterprise. Implement continuous monitoring of AI performance, patient comprehension, and clinical outcomes, iterating based on feedback, new medical evidence, and regulatory changes.

Ready to Transform Your Healthcare AI Strategy?

Leverage the power of advanced AI models like ChatGPT-4.0 to enhance patient education and streamline clinical support. Connect with our experts to explore how these insights can be tailored to build a robust, patient-centered AI solution for your organization.

Book Your Free Consultation

Enterprise AI Analysis for Healthcare

Evaluation of AI language models in answering pregnancy-related questions assessed by obstetrics specialists

Executive Impact & Key Insights

Deep Analysis & Enterprise Applications

Overall Model Performance Comparison

AI Response Generation Protocol

Enterprise Process Flow

Qualitative Analysis of AI Responses

Example Responses to Pregnancy Questions

Question 1: What foods should I avoid during pregnancy?

Question 2: What does early-pregnancy bleeding mean?

Question 3: Can I continue exercising while pregnant?

Projected ROI for AI Integration

Your Enterprise AI Adoption Roadmap

01. Needs Assessment & Pilot Study

02. Data Integration & Model Fine-tuning

03. Clinical Oversight & Training

04. Scaled Deployment & Continuous Monitoring

Ready to Transform Your Healthcare AI Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai