Enterprise AI Analysis

From setting to vetting: using artificial intelligence for Single Best Answer questions review

Authors: Olivia Ng, Siew Ping Han, Magdalene Hui Min Lee, Dong Haur Phua

Abstract: Maintaining the quality of Single Best Answer (SBA) questions remains a challenge in medical education, especially as artificial intelligence (AI)-generated items become more common. While considerable attention has been paid to AI question generation, the vetting process is under-explored and difficult to scale. This study investigates the feasibility and reliability of using a large language model to support the vetting of SBA questions. An AI-based reviewer, QA-bot, was developed using custom GPT and embedded with 25 criteria aligned with Bloom's taxonomy (Levels 1-3). QA-bot and two experienced educators independently evaluated 32 AI-generated SBA questions using the shared evaluation rubric. The rubric showed high internal consistency (Cronbach's alpha=0.878), and strong inter-rater reliability between human reviewers (intraclass correlation coefficient [ICC]=0.893). QA-bot demonstrated good alignment with human raters (ICC=0.861 and 0.840). While the AI performed well on objective, rule-based criteria, it was less consistent in detecting irrelevant complexity and accurately judging difficulty. These findings suggest that AI can function as an efficient first-pass reviewer, improving consistency and reducing workload, with human oversight remaining essential for educational and clinical relevance.

Schedule Your Strategy Session

Executive Impact

Streamlining Medical Assessment Vetting with AI

This study explores the potential of AI, specifically Large Language Models (LLMs), to enhance the review and validation of Single Best Answer (SBA) questions in medical education. It demonstrates that an AI-powered QA-bot can significantly improve efficiency and consistency in vetting, acting as a crucial first-pass reviewer while emphasizing the continued necessity of human oversight for nuanced judgment and clinical relevance.

0.0 Rubric Internal Consistency

0.0 Human Inter-Rater Reliability

0.0 AI-Human Alignment (Avg)

0 Efficiency Gain (Time Reduction)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction & Problem

Methodology

Key Findings

Discussion & Implications

Limitations & Future Work

Significant Challenge in SBA Quality

Maintaining the quality of Single Best Answer (SBA) questions remains a significant challenge in medical education, especially as AI-generated items become more common. The vetting process is under-explored and difficult to scale, leading to potential undermining of assessment validity.

Enterprise Process Flow

SBA Question Creation (32 AI-generated)

→

Rubric Development (25 criteria, Bloom's Levels 1-3)

→

Human Reviewers (2 experts, independent)

→

AI QA-Bot Development (Custom GPT-4o)

→

Independent Evaluation & Data Analysis

The study investigated the feasibility and reliability of using a large language model to support the vetting of SBA questions. An AI-based reviewer, QA-bot, was developed using custom GPT and embedded with 25 criteria aligned with Bloom's taxonomy. This bot and two experienced medical educators independently evaluated 32 AI-generated SBA questions using a shared evaluation rubric.

Performance Aspect	Human Reviewers	AI QA-Bot
Rubric Internal Consistency	High (Cronbach's alpha=0.878)	N/A
Human Inter-rater Reliability	Strong (ICC=0.893)	N/A
AI-Human Agreement	N/A	Good (ICC=0.861 & 0.840)
Objective Criteria Vetting	Consistent	Performed Well
Subjective Judgment (e.g., Difficulty)	Consistent, Contextual	Less Consistent, Misjudged Difficulty
Evaluation Speed (32 questions)	2-3 hours	<15 minutes

The evaluation rubric showed high internal consistency and human reviewers strong inter-rater reliability. QA-bot demonstrated good alignment with human raters on objective criteria, but was less consistent in detecting irrelevant complexity and judging difficulty. Crucially, QA-bot completed the evaluation significantly faster than human reviewers, indicating an approximate 88% time reduction for this task.

Complementary Roles of AI in Assessment Quality Assurance

This study highlights that while AI tools excel at objective, rule-based tasks (functioning as an efficient first-pass reviewer), human judgment remains essential for evaluating contextual relevance, pedagogical appropriateness, and clinical validity. The significant efficiency gains from AI can improve scalability and reduce resource demands, positioning AI as a valuable aid rather than a replacement in the assessment development process. Thoughtful integration will streamline item review and improve consistency, with human oversight ensuring high-quality assessment standards.

Context: AI can function as an efficient first-pass reviewer, improving consistency and reducing workload. However, human oversight remains essential for educational and clinical relevance, particularly for nuanced judgments like cognitive complexity and question relevance where AI showed limitations.

Aspect	Details & Impact
Evaluation Scope	Focused exclusively on AI-generated questions across selected clinical domains. Limits generalizability to other subject areas or question types.
Reviewer Pool	Two human reviewers with differing experience levels. May have contributed to some variability in judgment; expanding the reviewer pool could provide a broader basis for comparison.
Rubric Depth	Might not fully capture the subtleties of educational intent or interpretive reasoning. Suggests ongoing refinement of AI tools, improved rubric alignment, and domain-specific training could enhance ability to handle more complex evaluative tasks.

The study acknowledged limitations, including its focus on AI-generated questions in specific domains, the size and experience variance of the human reviewer pool, and the potential for the rubric not fully capturing all nuances of expert judgment. These areas suggest avenues for future research and refinement of AI tools.

ROI Calculator

Calculate Your Potential Savings

Estimate the time and cost savings your institution could achieve by integrating AI into your SBA question vetting process.

Your Industry

Number of Employees involved in Content Review

Average Hours Spent Weekly on Content Review per Employee

Average Hourly Cost per Employee (e.g., salary + benefits)

Annual Cost Savings $0

Annual Hours Reclaimed 0

Implementation Plan

AI Integration Roadmap for Assessment QA

A structured approach ensures successful adoption and maximum benefit from AI-powered assessment quality assurance.

Phase 1: Pilot & Rubric Alignment

Integrate a QA-bot with existing item-writing guidelines and pilot it on a subset of questions. Calibrate AI judgments against expert human reviewers to refine the rubric and AI's understanding.

Phase 2: Scaled First-Pass Review

Deploy AI as a first-pass reviewer for a larger volume of SBA questions, focusing on objective criteria like clarity, grammar, and adherence to structural standards, significantly reducing initial human workload.

Phase 3: Human-AI Collaborative Vetting

Establish a workflow where human experts focus on nuanced aspects such as contextual relevance, cognitive complexity, and clinical validity, leveraging AI to surface potential issues and inconsistencies for focused review.

Phase 4: Continuous Improvement & Domain Expansion

Regularly update AI models with new data and feedback, expanding its application to diverse subject areas and question types. Monitor performance and gather feedback for ongoing refinement of the AI-powered vetting process.

Ready to Transform Your Assessment Vetting Process?

Schedule Your AI Strategy Session

Enterprise AI Analysis

From setting to vetting: using artificial intelligence for Single Best Answer questions review

Executive Impact

Streamlining Medical Assessment Vetting with AI

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Complementary Roles of AI in Assessment Quality Assurance

ROI Calculator

Calculate Your Potential Savings

Implementation Plan

AI Integration Roadmap for Assessment QA

Phase 1: Pilot & Rubric Alignment

Phase 2: Scaled First-Pass Review

Phase 3: Human-AI Collaborative Vetting

Phase 4: Continuous Improvement & Domain Expansion

Ready to Transform Your Assessment Vetting Process?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai