Enterprise AI Analysis

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-40. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.

Schedule Your Strategy Session

Executive Impact Summary

Med-V1 represents a significant leap in biomedical evidence attribution, offering a family of small language models (3 billion parameters) that rival the performance of much larger, expensive frontier LLMs like GPT-5. By leveraging high-quality synthetic data (MedFact-Synth) for training, Med-V1 achieves accuracy improvements of 27.0% to 71.3% over its base models across five biomedical benchmarks. Its capabilities extend to crucial enterprise applications such as detecting LLM hallucinations in AI-generated content and identifying high-stakes evidence misattributions in clinical practice guidelines, providing both structured verdicts and natural-language explanations. This makes Med-V1 an efficient, scalable, and cost-effective solution for ensuring factual consistency and citation integrity in high-stakes biomedical domains.

3B Parameter Count

1.5M Synthetic Training Instances

+42.5% Average Accuracy Improvement

0.74 MedFact-Synth Label Accuracy

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Model Architecture & Training

Performance & Benchmarking

Error Analysis & Explanation Quality

Real-world Use Cases

Med-V1 is built on 3-billion-parameter LLMs (Llama-3.2-3B-Instruct and Qwen2.5-3B-Instruct) and trained using a novel two-stage post-training procedure: supervised fine-tuning (SFT) followed by reinforcement learning (RL). A key innovation is the creation of MedFact-Synth, a large-scale synthetic dataset of 1.5 million claim-article pairs with 5-point Likert scale veracity labels and natural-language rationales, generated by a panel of frontier LLMs. This dataset ensures diverse and high-quality supervision, enabling Med-V1 to achieve strong verification capabilities despite its compact size.

Evaluated on MedFact-Bench, a benchmark comprising five biomedical verification datasets (SciFact, HealthVer, MedAESQA, PubMedQA-Fact, BioASQ-Fact), Med-V1 demonstrates substantial performance improvements. It achieves +27.0% to +71.3% higher accuracy than its base models and performs comparably to frontier LLMs such as GPT-5, with average accuracies ranging from 0.728 to 0.732. This performance closure highlights Med-V1's ability to provide frontier-level verification in a lightweight, efficient package, even in a strict zero-shot setting.

A detailed error analysis reveals that a significant portion of Med-V1's 'errors' (71% for L3B, 66% for Q3B) are due to dataset quality issues, such as incorrect ground truth labels or bad claim formulations, rather than true model failures. This indicates Med-V1's robust reasoning capabilities. The model consistently produces high-quality, step-by-step natural-language explanations for its predictions, a critical feature for transparency and trust in high-stakes biomedical applications, distinguishing it from black-box classifiers.

Med-V1's practical utility is demonstrated through two key use cases. First, it quantifies hallucination rates in LLM-generated answers under various citation instructions, revealing how citation format affects validity and hallucination. Second, it identifies high-stakes misattributions in clinical practice guidelines at scale, flagging cases where cited sources do not support claims, particularly concerning treatment effectiveness and risk/etiology. These applications underscore Med-V1's value as reusable infrastructure for auditing AI outputs and critical medical documents.

Enterprise Process Flow

Sample 1M PubMed Articles

→

Generate Supported/Refuted Claims (GPT-4o)

→

Retrieve Top 10 Relevant Papers (MedCPT)

→

LLM Panel Voting (Frontier LLMs for Ground Truth)

→

Combine & Fine-tune Med-V1 (SFT & RL)

42.5% Average Accuracy Improvement Over Base 3B Models

Feature	Frontier LLMs (e.g., GPT-5)	Med-V1 (3B Parameters)
Parameter Count	70B+	3B
Average Accuracy (MedFact-Bench)	0.717-0.736	0.728-0.732 (Comparable to Frontier)
Cost/Scalability	High, proprietary API access	Low, efficient deployment
Explanation Quality	Good (implicit reasoning)	High-quality, explicit rationales
Zero-shot Capability	Yes	Yes, with domain-specific fine-tuning

Quantifying LLM Hallucinations in Citation Generation

Med-V1's analysis of GPT-4o and GPT-5 generated answers under various citation instructions reveals critical insights into LLM hallucination:

Claim Volume: GPT-5 generates significantly more claims (18.6-36.3 per answer) compared to GPT-4o (5.1-7.4).
Hallucination Rates: For standard citation formats (NLM, AMA, Vancouver, APA, MLA), both models show similar hallucination rates (42.8-55.8% for GPT-4o, 44.9-53.0% for GPT-5).
Direct Identifiers: Direct PMID citations lead to extreme hallucinations (96.3% for GPT-4o, 85.7% for GPT-5).
DOI Improvement: GPT-5 shows improved DOI memorization with lower hallucination (47.5%) compared to GPT-4o (>80%).
Supported Claims: Both models generate fewer supported claims than human experts (0.2-3.2 for GPT-4o, 2.6-8.3 for GPT-5 vs. 10.3 for humans).

Identifying High-Stakes Misattributions in Clinical Practice Guidelines

Med-V1 applied to 57,000 statement-source pairs from clinical guidelines identified a non-trivial 5% (3% partial, 2% strong) contradiction rate. Manual validation of 100 flagged cases confirmed 28 genuine misattributions. The most critical domains affected were effectiveness of treatment (12 cases) and risk/etiology (7 cases). These misattributions, often involving incorrect statistical reporting or misrepresentation of associations, carry potential public health impacts by informing flawed medical decisions. Med-V1 provides an efficient way to audit such critical documents at scale.

Advanced ROI Calculator

Understanding the return on investment for AI solutions is crucial. Med-V1 provides tangible benefits by automating complex verification tasks, freeing up valuable human resources and reducing the risk of costly errors and improving accuracy in high-stakes biomedical contexts.

Your Industry

Number of Employees (impacted by verification tasks)

Average Weekly Hours on Manual Verification

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your Implementation Roadmap

Our phased implementation approach ensures a smooth integration of Med-V1 into your existing enterprise workflows, maximizing impact while minimizing disruption and accelerating your journey to AI-driven factual verification.

Discovery & Integration Planning

Assess current systems, define verification needs, and plan API integration strategies to seamlessly embed Med-V1 into your existing infrastructure.

Data & Workflow Configuration

Adapt Med-V1 for specific data sources, fine-tune prompts for your unique requirements, and configure automated workflows for optimal performance.

Pilot Deployment & Validation

Deploy Med-V1 in a controlled pilot environment, rigorously validate results against human expert judgments, and gather feedback for iterative refinement.

Full-Scale Rollout & Optimization

Expand Med-V1 across your organization, continuously monitor performance, and implement further optimizations to ensure sustained accuracy and efficiency.

Ready to Transform Your Verification Process?

Unlock unparalleled accuracy and efficiency in biomedical evidence attribution. Let's discuss how Med-V1 can drive innovation in your enterprise.

Book a Consultation

Enterprise AI Analysis

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Executive Impact Summary

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Quantifying LLM Hallucinations in Citation Generation

Identifying High-Stakes Misattributions in Clinical Practice Guidelines

Advanced ROI Calculator

Your Implementation Roadmap

Discovery & Integration Planning

Data & Workflow Configuration

Pilot Deployment & Validation

Full-Scale Rollout & Optimization

Ready to Transform Your Verification Process?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai