Enterprise AI Research Analysis

Evaluating LLMs for Automated A-Level German Essay Scoring

This paper investigates the application of state-of-the-art open-weight Large Language Models (LLMs) for grading Austrian A-level German texts, with a particular focus on rubric-based evaluation. It evaluates four LLMs on a dataset of 101 anonymised student exams across three text types, using different contexts and prompting strategies. The results indicate that while smaller models can use standardized rubrics, they are not accurate enough for real-world grading.

Discover AI's Impact on Education

Key Findings at a Glance

Understand the critical performance metrics and implications of LLMs in automated essay scoring for Austrian A-levels.

0 Max Agreement with Human Rater (sub-dimensions)

0 Final Grade Match with Human Expert

0 Max Grading Time per Task-Pair (LLama3.3 large context)

0 Student Exams Evaluated

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction & Related Work

Methodology

Results & Evaluation

Discussion & Future Work

Background on Automated Essay Scoring

Automated Essay Scoring (AES) has been a research focus for decades, aiming to reduce teacher workload and mitigate subjective biases. Early systems used hand-crafted features and statistical models, but recent advancements in Large Language Models (LLMs) have enabled unprecedented flexibility in evaluating student writing.

The paper highlights that applying LLMs to AES tasks has been explored before, with use cases ranging from singular grades to deep explanations. This study specifically addresses the variety of text types in Austrian A-level German exams, each with distinct grading criteria, requiring the AES system to adaptively identify and apply appropriate rubrics.

Methodology Overview

The study processed and evaluated a dataset of 101 anonymised student exams across three text types: Literary Interpretation, Letter to the Editor, and Commentary. Four open-weight LLMs (DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b, and LLama3.3 70b) were tested with various contexts and prompting strategies. LLama3.3 70b was selected for deeper investigation due to its stability and diverse performance.

LLM Performance Comparison (Baseline)

Model	Accuracy (Final Grade)	QWK (Final Grade)
Qwen3 30b	23.3%	0.233
LLama3.3 70b	23.8%	0.238
DeepSeek-R1 32b	12.1%	0.121
Mixtral 8x7b	25.0%	0.250

Initial baseline evaluation of different LLMs for overall grade accuracy and Quadratic Weighted Kappa (QWK). Mixtral showed the highest initial accuracy, but LLama3.3 showed better stability and overall performance in further tests.

Enterprise Process Flow (Few-Shot In-Context Learning Workflow)

Answer generation

→

Calibration text

→

Solution presentation

→

Self calibration

→

Candidate text

→

Answer generation

Evaluation Results

The study employed Quadratic Weighted Kappa (QWK), Mean Average Error (MAE), Pearson Correlation Coefficient (PCC), and accuracy to evaluate LLM performance. LLama3.3 70b generally outperformed other models in terms of stability and overall score across categories.

40.6% Maximum Agreement with Human Rater (sub-dimensions)

LLMs achieved a maximum of 40.6% agreement with human raters in rubric-provided sub-dimensions for Austrian A-level German essays.

RAG Context Strategies Impact on QWK

Strategy	QWK (Final Grade)	Key Observation
Baseline	0.29	No RAG context
RAG-1-best	0.48	Single best reference text - high QWK but questionable validity
RAG-best-worst	0.46	Best-Average-Worst selection
Few-best-worst	0.40	Few-shot with Best-Average-Worst - higher grading variety
CoT-best-worst	0.23	CoT with Best-Average-Worst - lower QWK but higher accuracy for Task 2

Different RAG and prompting strategies yielded varied QWK scores. RAG-1-best showed the highest QWK for the final grade, but few-shot methods generally showed more stable performance across dimensions.

Discussion & Future Work

The study concludes that current LLMs are not yet ready for fully autonomous grading but hold significant potential as supportive tools. Challenges include lack of grading variety in training data, computational restraints, and the need for multi-grader ground truths to reduce bias.

750s Max Grading Time per Task-Pair

LLama3.3 70b took up to 750 seconds for grading a pair of tasks with large context techniques, highlighting computational intensity and practical limitations for real-world application.

Challenges & Limitations of Current LLM AES

The study highlights several limitations for current LLM-based AES, including the dataset (limited text types, OCR artifacts, single grader bias), model capabilities (LLMs other than LLama3.3 70b showed significant flaws like grade bias and unreliability), and computational power (high demands for large models and context techniques). Fully automated grading for real-world scenarios is still a long way off.

Key Takeaway: Current LLMs, while promising, are not yet ready for fully autonomous grading of Austrian A-levels due to accuracy, reliability, and computational demands. Future work requires more diverse datasets, multiple human graders, and stronger compute resources.

Future work should focus on running larger models with multiple iterations, implementing confidence scores for output reliability, and ensuring fair, unbiased grading systems that account for diverse student needs and prevent malicious attacks.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by integrating advanced AI solutions.

Your Industry

Number of Employees Involved in Manual Processes

Average Weekly Hours Spent per Employee on these processes

Average Hourly Cost per Employee (including benefits)

Annual Savings $0

Hours Reclaimed Annually 0

Your AI Implementation Roadmap

Our proven methodology guides your enterprise through a seamless AI integration, from strategy to sustained impact.

Discovery & Strategy

In-depth analysis of current workflows, identification of AI opportunities, and tailored strategy development for maximum impact.

Pilot & Prototyping

Rapid development of AI prototypes for key use cases, ensuring alignment with business goals and quick validation.

Full-Scale Implementation

Seamless integration of AI solutions into your existing infrastructure, with robust testing and comprehensive training.

Optimization & Scaling

Continuous monitoring, performance tuning, and expansion of AI capabilities across your enterprise for long-term growth.

Ready to Transform Your Enterprise with AI?

Schedule a personalized consultation with our AI experts to discuss how these insights can be applied to your specific business challenges and opportunities.

Schedule Your Strategy Session

Enterprise AI Research Analysis

Evaluating LLMs for Automated A-Level German Essay Scoring

Key Findings at a Glance

Deep Analysis & Enterprise Applications

Background on Automated Essay Scoring

Methodology Overview

LLM Performance Comparison (Baseline)

Enterprise Process Flow (Few-Shot In-Context Learning Workflow)

Evaluation Results

RAG Context Strategies Impact on QWK

Discussion & Future Work

Challenges & Limitations of Current LLM AES

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Discovery & Strategy

Pilot & Prototyping

Full-Scale Implementation

Optimization & Scaling

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai