Skip to main content
Enterprise AI Analysis: Impact of authoritative and subjective cues on large language model reliability for clinical inquiries: an experimental study

Enterprise AI Analysis

Impact of authoritative and subjective cues on large language model reliability for clinical inquiries: an experimental study

To determine how subjective or authoritative misinformation embedded in user prompts affects large language model (LLM) accuracy on a clinical question with a known gold-standard answer (the treatment line of aripiprazole). Five leading LLMs answered the clinical question under three prompt conditions: (1) neutral, (2) an incorrect “self-recalled” memory, and (3) an incorrect statement attributed to an authority. Each model-scenario pair was repeated ten times (250 total responses). Accuracy differences were tested with x² and Cramér's V, and score shifts were analyzed with van Elteren tests. All models were correct under the neutral prompt (100% accuracy). Accuracy dropped to 45% with self-recall prompts and to 1 % with authoritative prompts, indicating a strong prompt-accuracy association (Cramér's V = 0.75, P < .001). Efficacy and tolerability ratings fell in parallel, yet models' self-rated confidence stayed high and was statistically indistinguishable from baseline. LLMs are highly susceptible to misleading cues, especially those invoking authority, while remaining overconfident. These findings call for stronger validation standards, user education, and design safeguards before deploying LLMs in healthcare.

Executive Impact & Strategic Recommendations

This study unveils critical vulnerabilities in Large Language Models (LLMs) when faced with misleading cues, especially in high-stakes clinical inquiry contexts. Enterprise AI strategies must prioritize robust validation, user education, and safeguard integration to ensure reliable and safe deployment in healthcare.

0% Accuracy Drop (Authoritative Cue)
0% Accuracy Drop (Self-Recall Cue)
0% Baseline Accuracy (Neutral Prompt)
0 Confidence Stability (Despite Errors)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Executive Summary
Methodology Deep Dive
Core Performance Metrics
Strategic Recommendations

Unpacking LLM Susceptibility to Misinformation

This study rigorously tested the reliability of five leading Large Language Models (LLMs) when presented with a clinical question, specifically the treatment line of aripiprazole for difficult-to-treat depression. While all LLMs achieved 100% accuracy under neutral conditions, their performance drastically deteriorated when user prompts contained misleading information. An incorrect "self-recalled" memory led to an accuracy drop to 45%, while an incorrect statement attributed to an "authority" caused accuracy to plummet to a mere 1%.

Crucially, despite these significant drops in accuracy, the LLMs' self-rated confidence remained high and statistically indistinguishable from baseline in the authoritative condition, highlighting a concerning overconfidence bias. These findings expose a critical vulnerability: LLMs are highly susceptible to external cues, particularly those perceived as authoritative, which can override their internal knowledge base. For enterprise AI deployments, especially in sensitive domains like healthcare, this susceptibility demands immediate attention to ensure trustworthy and safe operation.

Experimental Design & Prompt Engineering Insights

The study employed a controlled experimental design to assess LLM reliability under different information styles. Five state-of-the-art LLMs (OpenAI's GPT-40 and o3, Google AI's Gemini 2.5 Pro, and two variants of Gemini 2.5 Flash) were tasked with a clinical question based on the 2023 CANMAT guidelines for major depressive disorder. Each model was tested ten times under three primary prompt conditions:

Enterprise Process Flow

Neutral Prompt (Control)
Self-Recall Prompt (Incorrect Memory)
Authoritative Prompt (Incorrect Statement)

This structured approach allowed for systematic measurement of accuracy, efficacy, tolerability, and confidence scores, revealing how different input biases can significantly alter LLM outputs. The findings underscore the importance of careful prompt engineering and the need for systems to be robust against manipulative or misleading user inputs.

Quantitative Analysis of LLM Response Behavior

A deeper look into the experimental results reveals the stark reality of LLM vulnerability when confronted with misleading information. The impact of authoritative cues was particularly profound, nearly eradicating accuracy across all tested models.

1% LLM Accuracy with Authoritative Cues
45% LLM Accuracy with Self-Recall Cues

Furthermore, a critical finding was the disconnect between model accuracy and its self-reported confidence, a paradox that carries significant risks in real-world applications.

Prompt Type Accuracy Avg. Confidence (0-10)
Neutral 100% 8.84
Self-Recall 45% 8.46
Authoritative 1% 8.89

The data clearly demonstrates that LLMs maintained high confidence even when providing erroneous information, especially under the influence of authoritative cues, highlighting a major challenge for user trust and safety.

Actionable Insights for Safe AI Deployment

The findings of this study have profound implications for all stakeholders involved in the development and deployment of AI in healthcare and beyond. Proactive measures are essential to mitigate the identified risks.

Mitigating LLM Vulnerability in Clinical Settings

The study reveals critical vulnerabilities of LLMs, particularly their susceptibility to authoritative misinformation and their tendency to remain overconfident despite providing incorrect answers. This has profound implications for clinical use.

  • User Education: Healthcare professionals must cultivate AI literacy, employing 'prompt hygiene' to avoid embedding subjective or authoritative cues.
  • System Guardrails: Developers should implement detection mechanisms for misleading cues in prompts, providing immediate advisories to users.
  • Regulatory Frameworks: Policymakers need to establish robust evaluation criteria beyond static accuracy, focusing on LLM stability, interpretability, and resilience in dynamic, real-world interactions.

These recommendations are crucial for building trust, ensuring ethical use, and maximizing the beneficial impact of AI technologies in sensitive enterprise environments.

Quantify Your AI ROI

Use our interactive calculator to estimate the potential time and cost savings for your enterprise by implementing tailored AI solutions.

Estimated Annual Savings
Annual Hours Reclaimed

Your Enterprise AI Transformation Roadmap

(Typical 6-12 month engagement)

Phase 1: Discovery & Strategy

Comprehensive assessment of current operations, identification of AI opportunities, and development of a tailored strategy aligned with business objectives.

Phase 2: Data Engineering & Preparation

Building robust data pipelines, cleansing, transformation, and secure storage to ensure high-quality data for AI model training.

Phase 3: Model Development & Training

Iterative development of custom AI models, leveraging state-of-the-art algorithms and rigorous training with your enterprise data.

Phase 4: Integration & Deployment

Seamless integration of AI solutions into existing workflows and systems, with secure and scalable deployment strategies.

Phase 5: Monitoring, Optimization & Support

Continuous monitoring of AI performance, post-deployment optimization, and ongoing support to ensure sustained value and adaptability.

Ready to Own Your AI?

Book a complimentary strategy session with our AI experts to explore how these insights apply to your business and how we can tailor a solution for your success.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking