Enterprise AI Analysis
A large-scale benchmark for evaluating large language models on medical question answering in Romanian
This paper introduces MedQARo, the first large-scale medical QA benchmark in Romanian, featuring 105,880 QA pairs from real-world clinical records of 1,242 oncology patients. The study evaluates state-of-the-art large language models (LLMs) in both zero-shot and supervised fine-tuning scenarios, highlighting the critical importance of domain-specific and language-specific adaptation for reliable clinical QA in low-resource languages. Fine-tuned models, particularly RoMistral-7B, significantly outperform zero-shot and API-based LLMs like GPT-5.2 and Gemini 3 Flash, demonstrating the challenge of the dataset and the need for robust models.
Executive Impact & Key Metrics
This research provides foundational insights for enterprises seeking to leverage AI in medical question answering, especially in specialized or underserved language domains.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The paper introduces MedQARo, the first large-scale medical QA benchmark in Romanian. It comprises 105,880 QA pairs about cancer patients from clinical records. A comprehensive evaluation of state-of-the-art LLMs is performed, including open-source and API-based models, under zero-shot and supervised fine-tuning scenarios. Results emphasize the crucial role of domain-specific and language-specific fine-tuning for reliable clinical QA in low-resource languages, with fine-tuned models significantly outperforming zero-shot counterparts.
The development of robust Question Answering (QA) systems is critical for human-level AI. While Large Language Models (LLMs) have advanced, specialized domains like medical QA and low-resource languages like Romanian pose unique challenges. Medical QA requires deep understanding of clinical terminology and reasoning, with high accuracy being paramount due to serious consequences of mistakes. Existing multilingual QA datasets largely omit Romanian. MedQARo addresses this gap by providing the first large-scale medical QA benchmark in Romanian, grounded in oncology case summaries.
MedQARo data was manually extracted and annotated by seven physicians from clinical documents (epicrises) from two medical centers in Bucharest, Romania. The dataset includes 1,242 oncology patients (796 breast, 215 lung, 231 other cancers) resulting in 105,880 high-quality QA pairs. Data splits are patient-level to prevent leakage, including in-domain and cross-domain test collections for generalization assessment. Four open-source LLMs (RoLLaMA2-7B, RoMistral-7B, Phi-4-mini-instruct, LLaMA3-OpenBioLLM-8B) were fine-tuned using LoRA and evaluated in zero-shot settings. GPT-5.2 and Gemini 3 Flash were also evaluated zero-shot via APIs. Evaluation metrics include F1 score, Exact Match (EM), BLEU, and METEOR. Ethical approval was obtained, and data is anonymized and publicly available.
Fine-tuned models significantly outperform zero-shot models and baselines on MedQARo. The Q+E+A prompt format yields higher performance than E+Q+A. RoMistral-7B (fine-tuned) achieves the best performance with an F1 score of 0.671 on the in-domain test set, demonstrating robustness. Phi-4-mini-instruct performs better when focusing on the first 2,048 tokens of an epicrisis, suggesting context trimming as a regularization technique. Cross-domain evaluation reveals a noticeable performance drop for all models, highlighting challenges in generalization. API-based LLMs (GPT-5.2, Gemini 3 Flash) perform worse than fine-tuned open-source LLMs, underscoring the importance of task-specific and domain-specific fine-tuning.
MedQARo is a challenging benchmark, with even the best fine-tuned model achieving an F1 of only 0.671 in-domain and 0.445 cross-domain. This gap indicates ample room for improvement. Key findings include: 1) Long prompts aren't always better; optimal context length is crucial. 2) Language-specific LLMs (RoMistral-7B, Phi-4-mini) are better starting points than medical domain-specific (OpenBioLLM-8B) for Romanian QA. 3) Fine-tuning is critical, outperforming zero-shot significantly. Future work will explore Retrieval-Augmented Generation (RAG) to improve generalization and handle extensive clinical narratives more effectively without context trimming issues.
| Model Type | Key Advantages |
|---|---|
| Fine-Tuned LLMs |
|
| Zero-Shot LLMs (e.g., GPT-5.2, Gemini 3 Flash) |
|
Enterprise Process Flow
Impact of Prompt Length and Context Trimming
Analysis revealed that long prompts are not always helpful, and trimming epicrises to fit prompt lengths can act as a regularization technique, sometimes improving performance by reducing input clutter. However, overly short contexts also degrade performance.
Outcome: Models achieve superior performance with optimal input token counts (e.g., 2,048 tokens for RoLLaMA2-7B-Instruct) compared to both shorter and significantly longer sequences, indicating a sweet spot for context length in clinical QA.
Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings for your enterprise by integrating specialized LLM-powered medical QA.
Your AI Implementation Roadmap
A structured approach to integrating advanced LLMs for clinical question answering.
Phase 1: Data Preparation & Model Selection
Gather and preprocess your specific clinical data, select appropriate base LLMs for fine-tuning.
Phase 2: Custom Fine-Tuning & Adaptation
Apply LoRA-based fine-tuning on MedQARo and your custom data, optimizing for Romanian medical terminology.
Phase 3: Integration & Validation
Integrate the fine-tuned LLMs into your existing systems, perform rigorous validation against clinical benchmarks.
Phase 4: Deployment & Continuous Improvement
Deploy the specialized AI agents, monitor performance, and iterate based on real-world feedback.
Ready to Transform Your Clinical AI Strategy?
Book a complimentary strategy session to explore how fine-tuned LLMs can revolutionize medical question answering in your enterprise.