Automated Grading of Open-Ended Questions in Higher Education Using GenAI Models
Enterprise AI Analysis: Maximizing Educational Efficiency
This study investigates the potential of Generative AI models and sentence embedding models for the automated assessment of open-ended student responses in a higher education computer science course. Among 110 university students enrolled in a software engineering course, 1,885 responses to 24 open-ended questions assessing knowledge of software engineering concepts were collected. Using precision, recall, F1-score, false positive and false negative rates, and inter-rater agreement metrics such as Fleiss' Kappa and Krippendorff's Alpha, we systematically analyzed the performance of eleven state-of-the-art models, including GPTo1, Claude3, PaLM2, and SBERT, against two human expert graders. The findings reveal that GPTol achieved the highest agreement with human evaluations, showing almost perfect agreement, low false positive and false negative rates, and strong performance across all grade categories. Models such as Claude3 and PaLM2 demonstrated substantial agreement, excelling in higher-grade assessments but falling short in identifying failing grades. Sentence embedding models, while moderately effective, struggled with capturing the context and semantic nuances of diverse student expressions. The study also highlights the limitations of reference-based grading approaches, as shown by the Natural Language Inference analysis, which found that many student responses contradicted reference answers despite being semantically correct. This underscores the importance of context-sensitive models like GPTo1, which accurately evaluate diverse responses and ensure fairer grading. While GPTol stands out as a candidate for independent deployment, the financial cost of such high-performing proprietary models raises concerns about scalability.
Executive Impact
Automated grading with GenAI offers unprecedented accuracy and efficiency in educational assessment. Here’s a snapshot of the key performance indicators:
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Overall Performance
GPTo1 consistently outperformed other models with a Fleiss' Kappa of 0.82 and Krippendorff's Alpha exceeding 0.80, indicating almost perfect agreement with human graders. Claude3 and PaLM2 also showed strong agreement, particularly in higher-grade categories.
Grading Errors
GPTo1, Claude3, and PaLM2 exhibited the lowest False Positive (FP) and False Negative (FN) rates. Encoder-based models (BERT, RoBERTa, USE) showed higher FP rates due to reliance on surface-level similarity, misinterpreting partial overlaps as strong semantic agreement.
Contextual Understanding
GenAI models, especially GPTo1 and Claude3, excelled in contextual understanding, interpreting diverse student phrasing and semantic nuances beyond strict reference matching. This is crucial for accurately grading open-ended questions.
Reference-Based Limitations
Natural Language Inference (NLI) analysis revealed that many student responses, although semantically correct, contradicted reference answers due to varied phrasing. This highlights the limitations of rigid reference-based evaluation.
Automated Grading Process Flow
| Model | Key Strengths |
|---|---|
| GPTo1 |
|
| Claude3 & PaLM2 |
|
| SBERT & USE |
|
| BERT & RoBERTa |
|
Impact of Contextual Evaluation
Challenge with Reference-Based Grading
Traditional reference-based grading often penalizes student responses that deviate in phrasing or structure from predefined answers, even if semantically correct. This led to a significant portion of student answers being classified as 'Contradiction' by NLI models despite potential validity.
GPTo1's Context-Sensitive Approach
GPTo1, utilizing a context-focused evaluation prompt (Prompt 2/3), demonstrated superior ability to interpret diverse student responses. This flexibility reduced false negatives and improved alignment with human graders, ensuring fairer assessment for students using their own words.
The study highlights that context-sensitive models like GPTo1 are essential for accurate and fair automated grading of open-ended questions, moving beyond rigid reference matching.
Unlock Your Institution's AI ROI
Estimate the potential savings and efficiency gains by implementing GenAI for automated grading.
Your AI Implementation Roadmap
A typical deployment process ensures seamless integration and maximum impact for your institution.
Discovery & Planning
Assess current grading workflows, identify key requirements, and define success metrics. Data collection, anonymization, and initial model selection.
AI Model Development & Customization
Fine-tune selected GenAI models with institutional data, develop contextual prompts, and establish robust evaluation benchmarks with human graders.
Integration & Monitoring
Integrate the automated grading system into existing LMS. Implement continuous monitoring, error analysis, and iterative refinement for optimal performance.
Ready to Transform Your Assessment?
Leverage cutting-edge GenAI to enhance grading accuracy, consistency, and efficiency in your higher education institution.