Enterprise AI Analysis

When Stability Fails: Hidden Failure Modes of LLMs in Data-Constrained Scientific Decision-Making

This paper investigates the reliability of Large Language Models (LLMs) as decision-support tools in data-constrained scientific workflows, particularly in gene prioritization. It introduces a behavioral evaluation framework assessing stability, correctness against statistical ground truth, prompt sensitivity, and output validity. Findings reveal that LLMs can exhibit high run-to-run stability while systematically diverging from statistical ground truth, being highly sensitive to minor prompt wording changes, and generating invalid outputs. This highlights that stability alone is insufficient for reliable scientific decision-making, emphasizing the need for explicit ground-truth validation and output validity checks.

Schedule Your Strategy Session

Key Findings at a Glance

LLM Stability (Jaccard)

Agreement with Ground Truth

Prompt Sensitivity (Jaccard)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Scientific Analysis

Failure Modes

Evaluation Framework

LLM Reliability in Scientific Contexts

In data-constrained scientific workflows, LLM reliability is paramount. This research dissects several dimensions: stability, correctness, prompt sensitivity, and output validity. The results underscore that merely achieving high stability in repeated runs does not equate to correctness when a statistical ground truth exists.

Diagnosing LLM Failure Modes

This study reveals four critical failure modes: systematic divergence from ground truth despite high stability, extreme prompt sensitivity to minor wording changes, over-selection under relaxed thresholds, and the hallucination of syntactically plausible but invalid data. These issues are particularly problematic in high-stakes scientific decision-making.

A New Evaluation Paradigm

The proposed behavioral evaluation framework explicitly separates the dimensions of stability, correctness, prompt sensitivity, and output validity. This comprehensive approach allows for a controlled diagnosis of LLM behavior, moving beyond simple reproducibility metrics to assess true scientific utility.

74% Prompt Wording Sensitivity (Jaccard drop for ChatGPT P7a vs P7b)

Enterprise Process Flow

Fixed DE Table Input

→

LLM Query (Repeated Runs)

→

Output Parsing

→

Compare to DESeq2 Reference

→

Assess Stability

→

Assess Correctness

→

Assess Sensitivity

→

Assess Validity

→

Identify Failure Modes

Model Performance Across Key Dimensions (FDR < 0.05)

Model	Run-to-Run Stability (Jaccard)	Agreement with Ground Truth (Jaccard)	Hallucinated IDs per run
ChatGPT	1.00	1.00	0
Gemini	1.00	1.00	0
Claude	1.00	0.00	20

Case Study: Borderline Ranking Uncertainty (P6)

In the borderline Top-20 ranking scenario (P6), models were tasked with prioritizing genes under uncertainty. Gemini achieved perfect agreement (Jaccard = 1.00) with the deterministic reference, demonstrating its ability to handle nuanced statistical ranking. In contrast, ChatGPT diverged significantly (Jaccard = 0.14), and Claude again returned no true signal. This highlights varying capabilities in handling complex, data-constrained decision rules.

Calculate Your Potential AI Impact

Estimate the time and cost savings your organization could achieve by integrating intelligent automation, based on this research.

Your Industry

Number of Employees (Impacted by repetitive tasks)

Average Weekly Hours Spent on Repetitive Tasks per Employee

Average Hourly Cost of Employee (USD)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your Enterprise AI Roadmap

A structured approach to leveraging LLMs responsibly in your scientific workflows.

Phase 1: Controlled Experiment Design

Define statistical gene prioritization task, fix differential expression table, and design prompt regimes varying thresholds and wording. Ensure deterministic ground truth reference.

Phase 2: LLM Query Execution & Data Collection

Query selected LLMs (ChatGPT, Gemini, Claude) multiple times per prompt regime with identical inputs. Collect raw outputs for analysis.

Phase 3: Output Parsing & Metric Calculation

Develop automated parsing for gene sets and compute evaluation metrics: Jaccard for stability/correctness, overlap coefficient for containment, and validity checks for hallucinated IDs.

Phase 4: Behavioral Analysis & Failure Mode Identification

Analyze LLM outputs across the four dimensions (stability, correctness, sensitivity, validity) to identify and categorize specific failure modes, such as systematic divergence or prompt sensitivity.

Phase 5: Reporting & Recommendations

Summarize findings, quantify performance, and formulate recommendations for responsible LLM deployment in data-constrained scientific workflows, emphasizing explicit validation.

Ready to Unlock Your Enterprise AI Potential?

Schedule a Consultation to Discuss LLM Integration

Enterprise AI Analysis

When Stability Fails: Hidden Failure Modes of LLMs in Data-Constrained Scientific Decision-Making

Key Findings at a Glance

Deep Analysis & Enterprise Applications

LLM Reliability in Scientific Contexts

Diagnosing LLM Failure Modes

A New Evaluation Paradigm

Enterprise Process Flow

Model Performance Across Key Dimensions (FDR < 0.05)

Case Study: Borderline Ranking Uncertainty (P6)

Calculate Your Potential AI Impact

Your Enterprise AI Roadmap

Phase 1: Controlled Experiment Design

Phase 2: LLM Query Execution & Data Collection

Phase 3: Output Parsing & Metric Calculation

Phase 4: Behavioral Analysis & Failure Mode Identification

Phase 5: Reporting & Recommendations

Ready to Unlock Your Enterprise AI Potential?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai