Automated Grading of Open-Ended Questions in Higher Education Using GenAI Models

Enterprise AI Analysis: Maximizing Educational Efficiency

This study investigates the potential of Generative AI models and sentence embedding models for the automated assessment of open-ended student responses in a higher education computer science course. Among 110 university students enrolled in a software engineering course, 1,885 responses to 24 open-ended questions assessing knowledge of software engineering concepts were collected. Using precision, recall, F1-score, false positive and false negative rates, and inter-rater agreement metrics such as Fleiss' Kappa and Krippendorff's Alpha, we systematically analyzed the performance of eleven state-of-the-art models, including GPTo1, Claude3, PaLM2, and SBERT, against two human expert graders. The findings reveal that GPTol achieved the highest agreement with human evaluations, showing almost perfect agreement, low false positive and false negative rates, and strong performance across all grade categories. Models such as Claude3 and PaLM2 demonstrated substantial agreement, excelling in higher-grade assessments but falling short in identifying failing grades. Sentence embedding models, while moderately effective, struggled with capturing the context and semantic nuances of diverse student expressions. The study also highlights the limitations of reference-based grading approaches, as shown by the Natural Language Inference analysis, which found that many student responses contradicted reference answers despite being semantically correct. This underscores the importance of context-sensitive models like GPTo1, which accurately evaluate diverse responses and ensure fairer grading. While GPTol stands out as a candidate for independent deployment, the financial cost of such high-performing proprietary models raises concerns about scalability.

Schedule Your Strategy Session

Executive Impact

Automated grading with GenAI offers unprecedented accuracy and efficiency in educational assessment. Here’s a snapshot of the key performance indicators:

0.82 GPTo1 Fleiss' Kappa Score

11 GPTo1 False Positives (FP)

39 GPTo1 False Negatives (FN)

0.95 Human Grader QWK Agreement

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Overall Performance

GPTo1 consistently outperformed other models with a Fleiss' Kappa of 0.82 and Krippendorff's Alpha exceeding 0.80, indicating almost perfect agreement with human graders. Claude3 and PaLM2 also showed strong agreement, particularly in higher-grade categories.

Grading Errors

GPTo1, Claude3, and PaLM2 exhibited the lowest False Positive (FP) and False Negative (FN) rates. Encoder-based models (BERT, RoBERTa, USE) showed higher FP rates due to reliance on surface-level similarity, misinterpreting partial overlaps as strong semantic agreement.

Contextual Understanding

GenAI models, especially GPTo1 and Claude3, excelled in contextual understanding, interpreting diverse student phrasing and semantic nuances beyond strict reference matching. This is crucial for accurately grading open-ended questions.

Reference-Based Limitations

Natural Language Inference (NLI) analysis revealed that many student responses, although semantically correct, contradicted reference answers due to varied phrasing. This highlights the limitations of rigid reference-based evaluation.

0.82 GPTo1 Fleiss' Kappa Score

Automated Grading Process Flow

Business Understanding

→

Data Understanding

→

Data Preparation

→

Modeling

→

Evaluation

→

Deployment

Model Performance Comparison (Key Strengths)

Model	Key Strengths
GPTo1	Almost perfect agreement with human graders (Fleiss' Kappa 0.82). Lowest FP and FN rates. Excels in capturing semantic and contextual nuances.
Claude3 & PaLM2	Substantial agreement with human graders. Strong performance in higher-grade assessments. Lower FP/FN rates than embedding models.
SBERT & USE	Moderately effective for semantic similarity. Struggled with context and semantic nuances of diverse expressions.
BERT & RoBERTa	Reliance on surface-level similarity led to higher FP rates. Struggled with contextual understanding of open-ended responses.

Impact of Contextual Evaluation

Challenge with Reference-Based Grading

Traditional reference-based grading often penalizes student responses that deviate in phrasing or structure from predefined answers, even if semantically correct. This led to a significant portion of student answers being classified as 'Contradiction' by NLI models despite potential validity.

GPTo1's Context-Sensitive Approach

GPTo1, utilizing a context-focused evaluation prompt (Prompt 2/3), demonstrated superior ability to interpret diverse student responses. This flexibility reduced false negatives and improved alignment with human graders, ensuring fairer assessment for students using their own words.

The study highlights that context-sensitive models like GPTo1 are essential for accurate and fair automated grading of open-ended questions, moving beyond rigid reference matching.

Unlock Your Institution's AI ROI

Estimate the potential savings and efficiency gains by implementing GenAI for automated grading.

Your Industry/Sector

Number of Graders/Assessors

Average Weekly Grading Hours per Assessor

Average Hourly Cost per Assessor ($)

Annual Savings $150,000

Hours Reclaimed Annually 3,750

Discuss Your Custom ROI

Your AI Implementation Roadmap

A typical deployment process ensures seamless integration and maximum impact for your institution.

Discovery & Planning

Assess current grading workflows, identify key requirements, and define success metrics. Data collection, anonymization, and initial model selection.

AI Model Development & Customization

Fine-tune selected GenAI models with institutional data, develop contextual prompts, and establish robust evaluation benchmarks with human graders.

Integration & Monitoring

Integrate the automated grading system into existing LMS. Implement continuous monitoring, error analysis, and iterative refinement for optimal performance.

Ready to Transform Your Assessment?

Leverage cutting-edge GenAI to enhance grading accuracy, consistency, and efficiency in your higher education institution.

Book Your Expert Consultation

Automated Grading of Open-Ended Questions in Higher Education Using GenAI Models

Enterprise AI Analysis: Maximizing Educational Efficiency

Executive Impact

Deep Analysis & Enterprise Applications

Overall Performance

Grading Errors

Contextual Understanding

Reference-Based Limitations

Automated Grading Process Flow

Model Performance Comparison (Key Strengths)

Impact of Contextual Evaluation

Challenge with Reference-Based Grading

GPTo1's Context-Sensitive Approach

Unlock Your Institution's AI ROI

Your AI Implementation Roadmap

Discovery & Planning

AI Model Development & Customization

Integration & Monitoring

Ready to Transform Your Assessment?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai