Enterprise AI Analysis

A large-scale benchmark for evaluating large language models on medical question answering in Romanian

This paper introduces MedQARo, the first large-scale medical QA benchmark in Romanian, featuring 105,880 QA pairs from real-world clinical records of 1,242 oncology patients. The study evaluates state-of-the-art large language models (LLMs) in both zero-shot and supervised fine-tuning scenarios, highlighting the critical importance of domain-specific and language-specific adaptation for reliable clinical QA in low-resource languages. Fine-tuned models, particularly RoMistral-7B, significantly outperform zero-shot and API-based LLMs like GPT-5.2 and Gemini 3 Flash, demonstrating the challenge of the dataset and the need for robust models.

Executive Impact & Key Metrics

This research provides foundational insights for enterprises seeking to leverage AI in medical question answering, especially in specialized or underserved language domains.

Total QA Pairs in MedQARo

Avg. Epicrisis Length

RoMistral-7B Performance (Fine-tuned)

Schedule Your Strategy Session

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Abstract

Introduction

Methods

Results

Discussion

The paper introduces MedQARo, the first large-scale medical QA benchmark in Romanian. It comprises 105,880 QA pairs about cancer patients from clinical records. A comprehensive evaluation of state-of-the-art LLMs is performed, including open-source and API-based models, under zero-shot and supervised fine-tuning scenarios. Results emphasize the crucial role of domain-specific and language-specific fine-tuning for reliable clinical QA in low-resource languages, with fine-tuned models significantly outperforming zero-shot counterparts.

The development of robust Question Answering (QA) systems is critical for human-level AI. While Large Language Models (LLMs) have advanced, specialized domains like medical QA and low-resource languages like Romanian pose unique challenges. Medical QA requires deep understanding of clinical terminology and reasoning, with high accuracy being paramount due to serious consequences of mistakes. Existing multilingual QA datasets largely omit Romanian. MedQARo addresses this gap by providing the first large-scale medical QA benchmark in Romanian, grounded in oncology case summaries.

MedQARo data was manually extracted and annotated by seven physicians from clinical documents (epicrises) from two medical centers in Bucharest, Romania. The dataset includes 1,242 oncology patients (796 breast, 215 lung, 231 other cancers) resulting in 105,880 high-quality QA pairs. Data splits are patient-level to prevent leakage, including in-domain and cross-domain test collections for generalization assessment. Four open-source LLMs (RoLLaMA2-7B, RoMistral-7B, Phi-4-mini-instruct, LLaMA3-OpenBioLLM-8B) were fine-tuned using LoRA and evaluated in zero-shot settings. GPT-5.2 and Gemini 3 Flash were also evaluated zero-shot via APIs. Evaluation metrics include F1 score, Exact Match (EM), BLEU, and METEOR. Ethical approval was obtained, and data is anonymized and publicly available.

Fine-tuned models significantly outperform zero-shot models and baselines on MedQARo. The Q+E+A prompt format yields higher performance than E+Q+A. RoMistral-7B (fine-tuned) achieves the best performance with an F1 score of 0.671 on the in-domain test set, demonstrating robustness. Phi-4-mini-instruct performs better when focusing on the first 2,048 tokens of an epicrisis, suggesting context trimming as a regularization technique. Cross-domain evaluation reveals a noticeable performance drop for all models, highlighting challenges in generalization. API-based LLMs (GPT-5.2, Gemini 3 Flash) perform worse than fine-tuned open-source LLMs, underscoring the importance of task-specific and domain-specific fine-tuning.

MedQARo is a challenging benchmark, with even the best fine-tuned model achieving an F1 of only 0.671 in-domain and 0.445 cross-domain. This gap indicates ample room for improvement. Key findings include: 1) Long prompts aren't always better; optimal context length is crucial. 2) Language-specific LLMs (RoMistral-7B, Phi-4-mini) are better starting points than medical domain-specific (OpenBioLLM-8B) for Romanian QA. 3) Fine-tuning is critical, outperforming zero-shot significantly. Future work will explore Retrieval-Augmented Generation (RAG) to improve generalization and handle extensive clinical narratives more effectively without context trimming issues.

105,880+ QA Pairs in MedQARo

LLM Performance: Fine-Tuned vs. Zero-Shot (In-Domain)

Fine-tuned models consistently outperform zero-shot versions across all metrics (F1, EM, BLEU, METEOR), highlighting the importance of adaptation.

Model Type	Key Advantages
Fine-Tuned LLMs	Significantly higher F1, EM, BLEU, METEOR scores Demonstrates strong domain and language adaptation RoMistral-7B achieved best performance (F1 ~0.67)
Zero-Shot LLMs (e.g., GPT-5.2, Gemini 3 Flash)	Lower performance compared to fine-tuned models Struggles with domain and language generalization API-based LLMs still outperformed smaller zero-shot models in some cases, but not fine-tuned ones

Enterprise Process Flow

Manual Data Collection & Annotation

→

Identify Oncology Patients (Breast, Lung, Other Cancers)

→

Physician Annotation of QA Pairs from Epicrises

→

Create In-Domain & Cross-Domain Test Sets

→

Benchmark LLMs (Zero-Shot & Fine-Tuning)

→

Analyze Generalization Capabilities

Impact of Prompt Length and Context Trimming

Analysis revealed that long prompts are not always helpful, and trimming epicrises to fit prompt lengths can act as a regularization technique, sometimes improving performance by reducing input clutter. However, overly short contexts also degrade performance.

Outcome: Models achieve superior performance with optimal input token counts (e.g., 2,048 tokens for RoLLaMA2-7B-Instruct) compared to both shorter and significantly longer sequences, indicating a sweet spot for context length in clinical QA.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings for your enterprise by integrating specialized LLM-powered medical QA.

Your Industry

Number of Employees (or FTEs in QA roles)

Average Weekly Hours on QA Tasks per Employee

Average Hourly Cost per Employee ($)

Annual Savings Estimate $0

Hours Reclaimed Annually 0

Your AI Implementation Roadmap

A structured approach to integrating advanced LLMs for clinical question answering.

Phase 1: Data Preparation & Model Selection

Gather and preprocess your specific clinical data, select appropriate base LLMs for fine-tuning.

Phase 2: Custom Fine-Tuning & Adaptation

Apply LoRA-based fine-tuning on MedQARo and your custom data, optimizing for Romanian medical terminology.

Phase 3: Integration & Validation

Integrate the fine-tuned LLMs into your existing systems, perform rigorous validation against clinical benchmarks.

Phase 4: Deployment & Continuous Improvement

Deploy the specialized AI agents, monitor performance, and iterate based on real-world feedback.

Discuss Your Implementation

Ready to Transform Your Clinical AI Strategy?

Book a complimentary strategy session to explore how fine-tuned LLMs can revolutionize medical question answering in your enterprise.

Schedule Your Strategy Session

Enterprise AI Analysis

A large-scale benchmark for evaluating large language models on medical question answering in Romanian

Executive Impact & Key Metrics

Deep Analysis & Enterprise Applications

LLM Performance: Fine-Tuned vs. Zero-Shot (In-Domain)

Enterprise Process Flow

Impact of Prompt Length and Context Trimming

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 1: Data Preparation & Model Selection

Phase 2: Custom Fine-Tuning & Adaptation

Phase 3: Integration & Validation

Phase 4: Deployment & Continuous Improvement

Ready to Transform Your Clinical AI Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai