ENTERPRISE AI ANALYSIS
PsychiatryBench: a multi-task benchmark for LLMs in psychiatry
The paper introduces PsychiatryBench, a meticulously curated, multi-task benchmark for evaluating Large Language Models (LLMs) in psychiatry. Developed from authoritative psychiatric textbooks and casebooks, it features eleven distinct question-answering tasks totaling 5,188 expert-annotated items, covering diagnostic reasoning, treatment planning, and longitudinal follow-up. Evaluation of frontier LLMs (Google Gemini, DeepSeek, Sonnet 4.5, and GPT 5) alongside open-source medical models (MedGemma) reveals significant gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks. The benchmark aims to provide a modular and extensible platform for improving LLM performance in mental health applications, addressing limitations of existing evaluation resources.
Executive Impact: Key Metrics
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Performance Landscape
The results unequivocally demonstrate a stratified performance landscape and a rapid, positive trajectory of advancement in LLM capabilities for diagnostic reasoning. GPT 5 Medium (T) stands out as the premier model with the highest average score of (84.5%), closely followed by Sonnet 4.5 (T) (83.7%). These models consistently deliver state-of-the-art performance, securing the best or second-best scores across the majority of tasks, particularly those involving complex clinical judgment like Diagnosis, Treatment, and management plan.
Task-Specific Challenges
A horizontal analysis of Table 2 reveals consistent patterns in task difficulty across models. Modern LLMs demonstrate remarkable proficiency in tasks that require synthesizing contextual information and generating structured, long-form clinical reasoning. This is most evident in Sequential QA, where Sonnet 4.5 (T) achieved a near-perfect score of (96.2%). Similarly, strong performance in the Clinical Approach task topped at (90.2%) by Sonnet 4.5 shows that contemporary models are adept at constructing coherent diagnostic and management pathways from complex psychiatric vignettes. Conversely, two task categories remain persistent challenges, exposing the limits of current LLM capabilities: Classification of Specific Disorders (F1/A 0.52/45.0 for GPT 5 Medium (T)) and Extended Matching Items (EMI) with broad performance variability.
Model Specialization
A compelling narrative within our results is the performance of the domain-specialized model, MedGemma. Despite its moderate size, MedGemma achieves an impressive average score of (78.5%), placing it on par with large-scale generalist models like Gemini 2.5 Pro (80.2%) and DeepSeek-R1 (80.4%). This performance is not uniform; rather, it is concentrated in areas that directly benefit from its specialized training on biomedical and clinical texts. However, this specialization comes with a trade-off, as MedGemma was less competitive in broader, open-ended reasoning tasks such as Management Plan (81.7%) and Sequential QA (87.7%) compared to top-tier generalist models.
Inference Strategies
Our study's inclusion of 'Thinking' variants for the Gemini 2.5 Flash and Sonnet 4.5 models reveals that the benefit of more deliberative, multi-step inference is highly architecture-dependent. For the Anthropic models, this strategy yielded a significant performance dividend. Sonnet 4.5 (T) not only surpassed its standard counterpart with an average score of (83.7%) versus (81.5%) but also achieved the highest performance across several of the most cognitively demanding tasks: Diagnosis (89.9%), Treatment (88.4%), Treatment Follow-Up (90.4%), and Mental QA (92.5%). In contrast, this advantage was not evident for the Google Gemini models, where performance was mixed.
Cross-Task Consistency
An inspection of the scores in Table 2 reveals substantial differences in cross-task stability among models. General-purpose frontier models such as Sonnet 4.5 (T) and GPT 5 Medium (T) demonstrate the highest degree of uniformity, maintaining strong performance across both highly structured tasks and open-ended reasoning tasks. In contrast, specialized medical models such as MedGemma and Med_Palmyra display a more uneven performance profile.
The results unequivocally demonstrate a stratified performance landscape and a rapid, positive trajectory of advancement in LLM capabilities for diagnostic reasoning. GPT 5 Medium (T) stands out as the premier model with the highest average score of (84.5%), closely followed by Sonnet 4.5 (T) (83.7%). These models consistently deliver state-of-the-art performance, securing the best or second-best scores across the majority of tasks, particularly those involving complex clinical judgment like Diagnosis, Treatment, and management plan.
Persistent Challenges: Classification of Specific Disorders
The Classification of Specific Disorders task proved to be the most difficult across the benchmark, with even top-tier models like GPT 5 Medium (T) achieving only (0.52/45.0) Fl-score/subset-accuracy.
This reflects the inherent difficulty of multi-label classification in psychiatry, where overlapping symptoms and comorbidities blur categorical boundaries.
The Extended Matching Items (EMI) task, which demands discriminating among numerous clinically similar options, also showed broad performance variability. While models like GPT 5 Medium (T) scored (89.1%), Gemini 2.0 Flash recorded a much lower score of (75.5%).
| Feature | Specialized Models (e.g., MedGemma) | Generalist Models (e.g., Sonnet 4.5, GPT 5) |
|---|---|---|
| Average Score | 78.5% | 83.7% - 84.5% |
| Strengths |
|
|
| Limitations |
|
|
Our study's inclusion of 'Thinking' variants for the Gemini 2.5 Flash and Sonnet 4.5 models reveals that the benefit of more deliberative, multi-step inference is highly architecture-dependent. For the Anthropic models, this strategy yielded a significant performance dividend. Sonnet 4.5 (T) not only surpassed its standard counterpart with an average score of (83.7%) versus (81.5%) but also achieved the highest performance across several of the most cognitively demanding tasks: Diagnosis (89.9%), Treatment (88.4%), Treatment Follow-Up (90.4%), and Mental QA (92.5%).
Our analysis reveals distinct performance signatures across model families, reflecting different architectural priorities and the effectiveness of their training. The frontier generalist models from Anthropic and OpenAI set the benchmark, with Sonnet 4.5 (T) (83.7%) and GPT-5 Medium (T) (84.5%) demonstrating state-of-the-art capabilities driven by recent architectural innovations. In contrast, other families highlight key limitations. The Gemini models, while strong with Gemini 2.5 Pro at 80.2%, show that advanced inference modes are not universally beneficial; its 'Thinking' variants yielded inconsistent gains, unlike the significant boosts seen in the Sonnet series.
Instability in Specialized and Lower-Performing Models
Specialized medical models such as MedGemmathe and Med_Palmyra display a more uneven performance profile. Their strengths are concentrated in knowledge-intensive tasks such as mental QA and the classification of specific disorders, where domain-specific pretraining yields clear advantages.
However, these models are noticeably less consistent on tasks requiring multi-step reasoning, contextual linking, or flexible narrative generation, such as sequential QA and clinical approach.
Lower-performing models, exemplified by JSL_MedLlama, exhibit the widest instability across tasks. Large drops in performance between knowledge tasks and contextual reasoning tasks suggest that these models struggle to maintain coherent reasoning pipelines when task demands shift.
PsychiatryBench Development Pipeline
Calculate Your Potential ROI with AI in Psychiatry
Understand the financial and efficiency gains by integrating advanced AI solutions into your psychiatric practice.
Implementation Roadmap
Our structured approach ensures a seamless integration of AI, maximizing benefits and minimizing disruption.
Phase 1: Dataset Curation & Task Design
Manually curated to address the unique challenges of psychiatric reasoning. Sourced from authoritative psychiatry textbooks and expert-validated clinical resources, with natural language questions paired with expert-formulated answers.
Phase 2: LLM Evaluation & Metric Definition
Each model is applied to the finalized PsychiatryBench dataset, prompted to generate answers across all eleven selected task types. LLMs are also used as evaluators or judges, scoring responses based on accuracy, completeness, and clinical relevance.
Phase 3: Cross-Task Consistency & Model Comparison
Evaluation of cross-task stability among models, revealing general-purpose frontier models demonstrate high uniformity, while specialized medical models exhibit more uneven performance profiles.
Ready to Transform Your Psychiatric Practice?
Leverage cutting-edge AI to enhance diagnostic precision, streamline workflows, and improve patient care.