Enterprise AI Analysis
Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-40. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.
Executive Impact Summary
Med-V1 represents a significant leap in biomedical evidence attribution, offering a family of small language models (3 billion parameters) that rival the performance of much larger, expensive frontier LLMs like GPT-5. By leveraging high-quality synthetic data (MedFact-Synth) for training, Med-V1 achieves accuracy improvements of 27.0% to 71.3% over its base models across five biomedical benchmarks. Its capabilities extend to crucial enterprise applications such as detecting LLM hallucinations in AI-generated content and identifying high-stakes evidence misattributions in clinical practice guidelines, providing both structured verdicts and natural-language explanations. This makes Med-V1 an efficient, scalable, and cost-effective solution for ensuring factual consistency and citation integrity in high-stakes biomedical domains.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Med-V1 is built on 3-billion-parameter LLMs (Llama-3.2-3B-Instruct and Qwen2.5-3B-Instruct) and trained using a novel two-stage post-training procedure: supervised fine-tuning (SFT) followed by reinforcement learning (RL). A key innovation is the creation of MedFact-Synth, a large-scale synthetic dataset of 1.5 million claim-article pairs with 5-point Likert scale veracity labels and natural-language rationales, generated by a panel of frontier LLMs. This dataset ensures diverse and high-quality supervision, enabling Med-V1 to achieve strong verification capabilities despite its compact size.
Evaluated on MedFact-Bench, a benchmark comprising five biomedical verification datasets (SciFact, HealthVer, MedAESQA, PubMedQA-Fact, BioASQ-Fact), Med-V1 demonstrates substantial performance improvements. It achieves +27.0% to +71.3% higher accuracy than its base models and performs comparably to frontier LLMs such as GPT-5, with average accuracies ranging from 0.728 to 0.732. This performance closure highlights Med-V1's ability to provide frontier-level verification in a lightweight, efficient package, even in a strict zero-shot setting.
A detailed error analysis reveals that a significant portion of Med-V1's 'errors' (71% for L3B, 66% for Q3B) are due to dataset quality issues, such as incorrect ground truth labels or bad claim formulations, rather than true model failures. This indicates Med-V1's robust reasoning capabilities. The model consistently produces high-quality, step-by-step natural-language explanations for its predictions, a critical feature for transparency and trust in high-stakes biomedical applications, distinguishing it from black-box classifiers.
Med-V1's practical utility is demonstrated through two key use cases. First, it quantifies hallucination rates in LLM-generated answers under various citation instructions, revealing how citation format affects validity and hallucination. Second, it identifies high-stakes misattributions in clinical practice guidelines at scale, flagging cases where cited sources do not support claims, particularly concerning treatment effectiveness and risk/etiology. These applications underscore Med-V1's value as reusable infrastructure for auditing AI outputs and critical medical documents.
Enterprise Process Flow
| Feature | Frontier LLMs (e.g., GPT-5) | Med-V1 (3B Parameters) |
|---|---|---|
| Parameter Count |
|
|
| Average Accuracy (MedFact-Bench) |
|
|
| Cost/Scalability |
|
|
| Explanation Quality |
|
|
| Zero-shot Capability |
|
|
Quantifying LLM Hallucinations in Citation Generation
Med-V1's analysis of GPT-4o and GPT-5 generated answers under various citation instructions reveals critical insights into LLM hallucination:
- Claim Volume: GPT-5 generates significantly more claims (18.6-36.3 per answer) compared to GPT-4o (5.1-7.4).
- Hallucination Rates: For standard citation formats (NLM, AMA, Vancouver, APA, MLA), both models show similar hallucination rates (42.8-55.8% for GPT-4o, 44.9-53.0% for GPT-5).
- Direct Identifiers: Direct PMID citations lead to extreme hallucinations (96.3% for GPT-4o, 85.7% for GPT-5).
- DOI Improvement: GPT-5 shows improved DOI memorization with lower hallucination (47.5%) compared to GPT-4o (>80%).
- Supported Claims: Both models generate fewer supported claims than human experts (0.2-3.2 for GPT-4o, 2.6-8.3 for GPT-5 vs. 10.3 for humans).
Identifying High-Stakes Misattributions in Clinical Practice Guidelines
Med-V1 applied to 57,000 statement-source pairs from clinical guidelines identified a non-trivial 5% (3% partial, 2% strong) contradiction rate. Manual validation of 100 flagged cases confirmed 28 genuine misattributions. The most critical domains affected were effectiveness of treatment (12 cases) and risk/etiology (7 cases). These misattributions, often involving incorrect statistical reporting or misrepresentation of associations, carry potential public health impacts by informing flawed medical decisions. Med-V1 provides an efficient way to audit such critical documents at scale.
Advanced ROI Calculator
Understanding the return on investment for AI solutions is crucial. Med-V1 provides tangible benefits by automating complex verification tasks, freeing up valuable human resources and reducing the risk of costly errors and improving accuracy in high-stakes biomedical contexts.
Your Implementation Roadmap
Our phased implementation approach ensures a smooth integration of Med-V1 into your existing enterprise workflows, maximizing impact while minimizing disruption and accelerating your journey to AI-driven factual verification.
Discovery & Integration Planning
Assess current systems, define verification needs, and plan API integration strategies to seamlessly embed Med-V1 into your existing infrastructure.
Data & Workflow Configuration
Adapt Med-V1 for specific data sources, fine-tune prompts for your unique requirements, and configure automated workflows for optimal performance.
Pilot Deployment & Validation
Deploy Med-V1 in a controlled pilot environment, rigorously validate results against human expert judgments, and gather feedback for iterative refinement.
Full-Scale Rollout & Optimization
Expand Med-V1 across your organization, continuously monitor performance, and implement further optimizations to ensure sustained accuracy and efficiency.
Ready to Transform Your Verification Process?
Unlock unparalleled accuracy and efficiency in biomedical evidence attribution. Let's discuss how Med-V1 can drive innovation in your enterprise.