ENTERPRISE AI ANALYSIS

PsychiatryBench: a multi-task benchmark for LLMs in psychiatry

The paper introduces PsychiatryBench, a meticulously curated, multi-task benchmark for evaluating Large Language Models (LLMs) in psychiatry. Developed from authoritative psychiatric textbooks and casebooks, it features eleven distinct question-answering tasks totaling 5,188 expert-annotated items, covering diagnostic reasoning, treatment planning, and longitudinal follow-up. Evaluation of frontier LLMs (Google Gemini, DeepSeek, Sonnet 4.5, and GPT 5) alongside open-source medical models (MedGemma) reveals significant gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks. The benchmark aims to provide a modular and extensible platform for improving LLM performance in mental health applications, addressing limitations of existing evaluation resources.

Schedule Your Strategy Session

Executive Impact: Key Metrics

0 Expert-Annotated Items

0 GPT 5 Medium (T) Top Score

0 Points Avg. Improvement (LLMs)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Performance Landscape

The results unequivocally demonstrate a stratified performance landscape and a rapid, positive trajectory of advancement in LLM capabilities for diagnostic reasoning. GPT 5 Medium (T) stands out as the premier model with the highest average score of (84.5%), closely followed by Sonnet 4.5 (T) (83.7%). These models consistently deliver state-of-the-art performance, securing the best or second-best scores across the majority of tasks, particularly those involving complex clinical judgment like Diagnosis, Treatment, and management plan.

Task-Specific Challenges

A horizontal analysis of Table 2 reveals consistent patterns in task difficulty across models. Modern LLMs demonstrate remarkable proficiency in tasks that require synthesizing contextual information and generating structured, long-form clinical reasoning. This is most evident in Sequential QA, where Sonnet 4.5 (T) achieved a near-perfect score of (96.2%). Similarly, strong performance in the Clinical Approach task topped at (90.2%) by Sonnet 4.5 shows that contemporary models are adept at constructing coherent diagnostic and management pathways from complex psychiatric vignettes. Conversely, two task categories remain persistent challenges, exposing the limits of current LLM capabilities: Classification of Specific Disorders (F1/A 0.52/45.0 for GPT 5 Medium (T)) and Extended Matching Items (EMI) with broad performance variability.

Model Specialization

A compelling narrative within our results is the performance of the domain-specialized model, MedGemma. Despite its moderate size, MedGemma achieves an impressive average score of (78.5%), placing it on par with large-scale generalist models like Gemini 2.5 Pro (80.2%) and DeepSeek-R1 (80.4%). This performance is not uniform; rather, it is concentrated in areas that directly benefit from its specialized training on biomedical and clinical texts. However, this specialization comes with a trade-off, as MedGemma was less competitive in broader, open-ended reasoning tasks such as Management Plan (81.7%) and Sequential QA (87.7%) compared to top-tier generalist models.

Inference Strategies

Our study's inclusion of 'Thinking' variants for the Gemini 2.5 Flash and Sonnet 4.5 models reveals that the benefit of more deliberative, multi-step inference is highly architecture-dependent. For the Anthropic models, this strategy yielded a significant performance dividend. Sonnet 4.5 (T) not only surpassed its standard counterpart with an average score of (83.7%) versus (81.5%) but also achieved the highest performance across several of the most cognitively demanding tasks: Diagnosis (89.9%), Treatment (88.4%), Treatment Follow-Up (90.4%), and Mental QA (92.5%). In contrast, this advantage was not evident for the Google Gemini models, where performance was mixed.

Cross-Task Consistency

An inspection of the scores in Table 2 reveals substantial differences in cross-task stability among models. General-purpose frontier models such as Sonnet 4.5 (T) and GPT 5 Medium (T) demonstrate the highest degree of uniformity, maintaining strong performance across both highly structured tasks and open-ended reasoning tasks. In contrast, specialized medical models such as MedGemma and Med_Palmyra display a more uneven performance profile.

84.5% GPT 5 Medium (T) - Top Average Score

The results unequivocally demonstrate a stratified performance landscape and a rapid, positive trajectory of advancement in LLM capabilities for diagnostic reasoning. GPT 5 Medium (T) stands out as the premier model with the highest average score of (84.5%), closely followed by Sonnet 4.5 (T) (83.7%). These models consistently deliver state-of-the-art performance, securing the best or second-best scores across the majority of tasks, particularly those involving complex clinical judgment like Diagnosis, Treatment, and management plan.

Persistent Challenges: Classification of Specific Disorders

The Classification of Specific Disorders task proved to be the most difficult across the benchmark, with even top-tier models like GPT 5 Medium (T) achieving only (0.52/45.0) Fl-score/subset-accuracy.
This reflects the inherent difficulty of multi-label classification in psychiatry, where overlapping symptoms and comorbidities blur categorical boundaries.
The Extended Matching Items (EMI) task, which demands discriminating among numerous clinically similar options, also showed broad performance variability. While models like GPT 5 Medium (T) scored (89.1%), Gemini 2.0 Flash recorded a much lower score of (75.5%).

Specialized vs. Generalist LLMs

Feature	Specialized Models (e.g., MedGemma)	Generalist Models (e.g., Sonnet 4.5, GPT 5)
Average Score	78.5%	83.7% - 84.5%
Strengths	Strong F1/Subset Accuracy in Classification of Specific Disorders (0.69/45.0) High knowledge-intensive MCQ (87.4%)	Fluid, multi-step reasoning Contextual integration, coherence, adaptability
Limitations	Less competitive in open-ended reasoning (Management Plan 81.7%, Sequential QA 87.7%)	Lower performance in specific, knowledge-intensive classification tasks (e.g., Classification of Specific Disorders)

Our study's inclusion of 'Thinking' variants for the Gemini 2.5 Flash and Sonnet 4.5 models reveals that the benefit of more deliberative, multi-step inference is highly architecture-dependent. For the Anthropic models, this strategy yielded a significant performance dividend. Sonnet 4.5 (T) not only surpassed its standard counterpart with an average score of (83.7%) versus (81.5%) but also achieved the highest performance across several of the most cognitively demanding tasks: Diagnosis (89.9%), Treatment (88.4%), Treatment Follow-Up (90.4%), and Mental QA (92.5%).

+2.2% Average Performance Boost for Sonnet 4.5 (T) over Sonnet 4.5

Our analysis reveals distinct performance signatures across model families, reflecting different architectural priorities and the effectiveness of their training. The frontier generalist models from Anthropic and OpenAI set the benchmark, with Sonnet 4.5 (T) (83.7%) and GPT-5 Medium (T) (84.5%) demonstrating state-of-the-art capabilities driven by recent architectural innovations. In contrast, other families highlight key limitations. The Gemini models, while strong with Gemini 2.5 Pro at 80.2%, show that advanced inference modes are not universally beneficial; its 'Thinking' variants yielded inconsistent gains, unlike the significant boosts seen in the Sonnet series.

Instability in Specialized and Lower-Performing Models

Specialized medical models such as MedGemmathe and Med_Palmyra display a more uneven performance profile. Their strengths are concentrated in knowledge-intensive tasks such as mental QA and the classification of specific disorders, where domain-specific pretraining yields clear advantages.
However, these models are noticeably less consistent on tasks requiring multi-step reasoning, contextual linking, or flexible narrative generation, such as sequential QA and clinical approach.
Lower-performing models, exemplified by JSL_MedLlama, exhibit the widest instability across tasks. Large drops in performance between knowledge tasks and contextual reasoning tasks suggest that these models struggle to maintain coherent reasoning pipelines when task demands shift.

PsychiatryBench Development Pipeline

Sources (Textbooks & Casebooks)

→

Manual Samples Extraction

→

Data Processing & Samples Creation

→

Validation

→

Final PsychQA Dataset

→

LLM Models

→

Judges

Calculate Your Potential ROI with AI in Psychiatry

Understand the financial and efficiency gains by integrating advanced AI solutions into your psychiatric practice.

Typical Employee Role

Average Annual Salary ($)

Time Saved Per Employee (%)

Number of Employees

Hours per Week (Avg.)

Projected Annual Savings $0

Annual Hours Reclaimed 0

Implementation Roadmap

Our structured approach ensures a seamless integration of AI, maximizing benefits and minimizing disruption.

Phase 1: Dataset Curation & Task Design

Manually curated to address the unique challenges of psychiatric reasoning. Sourced from authoritative psychiatry textbooks and expert-validated clinical resources, with natural language questions paired with expert-formulated answers.

Phase 2: LLM Evaluation & Metric Definition

Each model is applied to the finalized PsychiatryBench dataset, prompted to generate answers across all eleven selected task types. LLMs are also used as evaluators or judges, scoring responses based on accuracy, completeness, and clinical relevance.

Phase 3: Cross-Task Consistency & Model Comparison

Evaluation of cross-task stability among models, revealing general-purpose frontier models demonstrate high uniformity, while specialized medical models exhibit more uneven performance profiles.

Discuss Your Implementation

Ready to Transform Your Psychiatric Practice?

Leverage cutting-edge AI to enhance diagnostic precision, streamline workflows, and improve patient care.

Book a Consultation Now

ENTERPRISE AI ANALYSIS

PsychiatryBench: a multi-task benchmark for LLMs in psychiatry

Executive Impact: Key Metrics

Deep Analysis & Enterprise Applications

Performance Landscape

Task-Specific Challenges

Model Specialization

Inference Strategies

Cross-Task Consistency

Persistent Challenges: Classification of Specific Disorders

Specialized vs. Generalist LLMs

Instability in Specialized and Lower-Performing Models

PsychiatryBench Development Pipeline

Calculate Your Potential ROI with AI in Psychiatry

Implementation Roadmap

Phase 1: Dataset Curation & Task Design

Phase 2: LLM Evaluation & Metric Definition

Phase 3: Cross-Task Consistency & Model Comparison

Ready to Transform Your Psychiatric Practice?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai