Skip to main content
Enterprise AI Analysis: From setting to vetting: using artificial intelligence for Single Best Answer questions review

Enterprise AI Analysis

From setting to vetting: using artificial intelligence for Single Best Answer questions review

Authors: Olivia Ng, Siew Ping Han, Magdalene Hui Min Lee, Dong Haur Phua

Abstract: Maintaining the quality of Single Best Answer (SBA) questions remains a challenge in medical education, especially as artificial intelligence (AI)-generated items become more common. While considerable attention has been paid to AI question generation, the vetting process is under-explored and difficult to scale. This study investigates the feasibility and reliability of using a large language model to support the vetting of SBA questions. An AI-based reviewer, QA-bot, was developed using custom GPT and embedded with 25 criteria aligned with Bloom's taxonomy (Levels 1-3). QA-bot and two experienced educators independently evaluated 32 AI-generated SBA questions using the shared evaluation rubric. The rubric showed high internal consistency (Cronbach's alpha=0.878), and strong inter-rater reliability between human reviewers (intraclass correlation coefficient [ICC]=0.893). QA-bot demonstrated good alignment with human raters (ICC=0.861 and 0.840). While the AI performed well on objective, rule-based criteria, it was less consistent in detecting irrelevant complexity and accurately judging difficulty. These findings suggest that AI can function as an efficient first-pass reviewer, improving consistency and reducing workload, with human oversight remaining essential for educational and clinical relevance.

Executive Impact

Streamlining Medical Assessment Vetting with AI

This study explores the potential of AI, specifically Large Language Models (LLMs), to enhance the review and validation of Single Best Answer (SBA) questions in medical education. It demonstrates that an AI-powered QA-bot can significantly improve efficiency and consistency in vetting, acting as a crucial first-pass reviewer while emphasizing the continued necessity of human oversight for nuanced judgment and clinical relevance.

0.0 Rubric Internal Consistency
0.0 Human Inter-Rater Reliability
0.0 AI-Human Alignment (Avg)
0 Efficiency Gain (Time Reduction)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction & Problem
Methodology
Key Findings
Discussion & Implications
Limitations & Future Work
Significant Challenge in SBA Quality

Maintaining the quality of Single Best Answer (SBA) questions remains a significant challenge in medical education, especially as AI-generated items become more common. The vetting process is under-explored and difficult to scale, leading to potential undermining of assessment validity.

Enterprise Process Flow

SBA Question Creation (32 AI-generated)
Rubric Development (25 criteria, Bloom's Levels 1-3)
Human Reviewers (2 experts, independent)
AI QA-Bot Development (Custom GPT-4o)
Independent Evaluation & Data Analysis

The study investigated the feasibility and reliability of using a large language model to support the vetting of SBA questions. An AI-based reviewer, QA-bot, was developed using custom GPT and embedded with 25 criteria aligned with Bloom's taxonomy. This bot and two experienced medical educators independently evaluated 32 AI-generated SBA questions using a shared evaluation rubric.

Performance Aspect Human Reviewers AI QA-Bot
Rubric Internal Consistency High (Cronbach's alpha=0.878) N/A
Human Inter-rater Reliability Strong (ICC=0.893) N/A
AI-Human Agreement N/A Good (ICC=0.861 & 0.840)
Objective Criteria Vetting Consistent Performed Well
Subjective Judgment (e.g., Difficulty) Consistent, Contextual Less Consistent, Misjudged Difficulty
Evaluation Speed (32 questions) 2-3 hours <15 minutes

The evaluation rubric showed high internal consistency and human reviewers strong inter-rater reliability. QA-bot demonstrated good alignment with human raters on objective criteria, but was less consistent in detecting irrelevant complexity and judging difficulty. Crucially, QA-bot completed the evaluation significantly faster than human reviewers, indicating an approximate 88% time reduction for this task.

Complementary Roles of AI in Assessment Quality Assurance

This study highlights that while AI tools excel at objective, rule-based tasks (functioning as an efficient first-pass reviewer), human judgment remains essential for evaluating contextual relevance, pedagogical appropriateness, and clinical validity. The significant efficiency gains from AI can improve scalability and reduce resource demands, positioning AI as a valuable aid rather than a replacement in the assessment development process. Thoughtful integration will streamline item review and improve consistency, with human oversight ensuring high-quality assessment standards.

Context: AI can function as an efficient first-pass reviewer, improving consistency and reducing workload. However, human oversight remains essential for educational and clinical relevance, particularly for nuanced judgments like cognitive complexity and question relevance where AI showed limitations.

Aspect Details & Impact
Evaluation Scope Focused exclusively on AI-generated questions across selected clinical domains. Limits generalizability to other subject areas or question types.
Reviewer Pool Two human reviewers with differing experience levels. May have contributed to some variability in judgment; expanding the reviewer pool could provide a broader basis for comparison.
Rubric Depth Might not fully capture the subtleties of educational intent or interpretive reasoning. Suggests ongoing refinement of AI tools, improved rubric alignment, and domain-specific training could enhance ability to handle more complex evaluative tasks.

The study acknowledged limitations, including its focus on AI-generated questions in specific domains, the size and experience variance of the human reviewer pool, and the potential for the rubric not fully capturing all nuances of expert judgment. These areas suggest avenues for future research and refinement of AI tools.

ROI Calculator

Calculate Your Potential Savings

Estimate the time and cost savings your institution could achieve by integrating AI into your SBA question vetting process.

Annual Cost Savings $0
Annual Hours Reclaimed 0

Implementation Plan

AI Integration Roadmap for Assessment QA

A structured approach ensures successful adoption and maximum benefit from AI-powered assessment quality assurance.

Phase 1: Pilot & Rubric Alignment

Integrate a QA-bot with existing item-writing guidelines and pilot it on a subset of questions. Calibrate AI judgments against expert human reviewers to refine the rubric and AI's understanding.

Phase 2: Scaled First-Pass Review

Deploy AI as a first-pass reviewer for a larger volume of SBA questions, focusing on objective criteria like clarity, grammar, and adherence to structural standards, significantly reducing initial human workload.

Phase 3: Human-AI Collaborative Vetting

Establish a workflow where human experts focus on nuanced aspects such as contextual relevance, cognitive complexity, and clinical validity, leveraging AI to surface potential issues and inconsistencies for focused review.

Phase 4: Continuous Improvement & Domain Expansion

Regularly update AI models with new data and feedback, expanding its application to diverse subject areas and question types. Monitor performance and gather feedback for ongoing refinement of the AI-powered vetting process.

Ready to Transform Your Assessment Vetting Process?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking