Skip to main content
Enterprise AI Analysis: When Stability Fails: Hidden Failure Modes of LLMs in Data-Constrained Scientific Decision-Making

Enterprise AI Analysis

When Stability Fails: Hidden Failure Modes of LLMs in Data-Constrained Scientific Decision-Making

This paper investigates the reliability of Large Language Models (LLMs) as decision-support tools in data-constrained scientific workflows, particularly in gene prioritization. It introduces a behavioral evaluation framework assessing stability, correctness against statistical ground truth, prompt sensitivity, and output validity. Findings reveal that LLMs can exhibit high run-to-run stability while systematically diverging from statistical ground truth, being highly sensitive to minor prompt wording changes, and generating invalid outputs. This highlights that stability alone is insufficient for reliable scientific decision-making, emphasizing the need for explicit ground-truth validation and output validity checks.

Key Findings at a Glance

LLM Stability (Jaccard)
Agreement with Ground Truth
Prompt Sensitivity (Jaccard)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Scientific Analysis
Failure Modes
Evaluation Framework

LLM Reliability in Scientific Contexts

In data-constrained scientific workflows, LLM reliability is paramount. This research dissects several dimensions: stability, correctness, prompt sensitivity, and output validity. The results underscore that merely achieving high stability in repeated runs does not equate to correctness when a statistical ground truth exists.

Diagnosing LLM Failure Modes

This study reveals four critical failure modes: systematic divergence from ground truth despite high stability, extreme prompt sensitivity to minor wording changes, over-selection under relaxed thresholds, and the hallucination of syntactically plausible but invalid data. These issues are particularly problematic in high-stakes scientific decision-making.

A New Evaluation Paradigm

The proposed behavioral evaluation framework explicitly separates the dimensions of stability, correctness, prompt sensitivity, and output validity. This comprehensive approach allows for a controlled diagnosis of LLM behavior, moving beyond simple reproducibility metrics to assess true scientific utility.

74% Prompt Wording Sensitivity (Jaccard drop for ChatGPT P7a vs P7b)

Enterprise Process Flow

Fixed DE Table Input
LLM Query (Repeated Runs)
Output Parsing
Compare to DESeq2 Reference
Assess Stability
Assess Correctness
Assess Sensitivity
Assess Validity
Identify Failure Modes

Model Performance Across Key Dimensions (FDR < 0.05)

Model Run-to-Run Stability (Jaccard) Agreement with Ground Truth (Jaccard) Hallucinated IDs per run
ChatGPT 1.00 1.00 0
Gemini 1.00 1.00 0
Claude 1.00 0.00 20

Case Study: Borderline Ranking Uncertainty (P6)

In the borderline Top-20 ranking scenario (P6), models were tasked with prioritizing genes under uncertainty. Gemini achieved perfect agreement (Jaccard = 1.00) with the deterministic reference, demonstrating its ability to handle nuanced statistical ranking. In contrast, ChatGPT diverged significantly (Jaccard = 0.14), and Claude again returned no true signal. This highlights varying capabilities in handling complex, data-constrained decision rules.

Calculate Your Potential AI Impact

Estimate the time and cost savings your organization could achieve by integrating intelligent automation, based on this research.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your Enterprise AI Roadmap

A structured approach to leveraging LLMs responsibly in your scientific workflows.

Phase 1: Controlled Experiment Design

Define statistical gene prioritization task, fix differential expression table, and design prompt regimes varying thresholds and wording. Ensure deterministic ground truth reference.

Phase 2: LLM Query Execution & Data Collection

Query selected LLMs (ChatGPT, Gemini, Claude) multiple times per prompt regime with identical inputs. Collect raw outputs for analysis.

Phase 3: Output Parsing & Metric Calculation

Develop automated parsing for gene sets and compute evaluation metrics: Jaccard for stability/correctness, overlap coefficient for containment, and validity checks for hallucinated IDs.

Phase 4: Behavioral Analysis & Failure Mode Identification

Analyze LLM outputs across the four dimensions (stability, correctness, sensitivity, validity) to identify and categorize specific failure modes, such as systematic divergence or prompt sensitivity.

Phase 5: Reporting & Recommendations

Summarize findings, quantify performance, and formulate recommendations for responsible LLM deployment in data-constrained scientific workflows, emphasizing explicit validation.

Ready to Unlock Your Enterprise AI Potential?

Schedule a Consultation to Discuss LLM Integration

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking