Skip to main content
Enterprise AI Analysis: Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

Enterprise AI Research Analysis

Evaluating LLMs for Automated A-Level German Essay Scoring

This paper investigates the application of state-of-the-art open-weight Large Language Models (LLMs) for grading Austrian A-level German texts, with a particular focus on rubric-based evaluation. It evaluates four LLMs on a dataset of 101 anonymised student exams across three text types, using different contexts and prompting strategies. The results indicate that while smaller models can use standardized rubrics, they are not accurate enough for real-world grading.

Key Findings at a Glance

Understand the critical performance metrics and implications of LLMs in automated essay scoring for Austrian A-levels.

0 Max Agreement with Human Rater (sub-dimensions)
0 Final Grade Match with Human Expert
0 Max Grading Time per Task-Pair (LLama3.3 large context)
0 Student Exams Evaluated

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction & Related Work
Methodology
Results & Evaluation
Discussion & Future Work

Background on Automated Essay Scoring

Automated Essay Scoring (AES) has been a research focus for decades, aiming to reduce teacher workload and mitigate subjective biases. Early systems used hand-crafted features and statistical models, but recent advancements in Large Language Models (LLMs) have enabled unprecedented flexibility in evaluating student writing.

The paper highlights that applying LLMs to AES tasks has been explored before, with use cases ranging from singular grades to deep explanations. This study specifically addresses the variety of text types in Austrian A-level German exams, each with distinct grading criteria, requiring the AES system to adaptively identify and apply appropriate rubrics.

Methodology Overview

The study processed and evaluated a dataset of 101 anonymised student exams across three text types: Literary Interpretation, Letter to the Editor, and Commentary. Four open-weight LLMs (DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b, and LLama3.3 70b) were tested with various contexts and prompting strategies. LLama3.3 70b was selected for deeper investigation due to its stability and diverse performance.

LLM Performance Comparison (Baseline)

Model Accuracy (Final Grade) QWK (Final Grade)
Qwen3 30b 23.3% 0.233
LLama3.3 70b 23.8% 0.238
DeepSeek-R1 32b 12.1% 0.121
Mixtral 8x7b 25.0% 0.250

Initial baseline evaluation of different LLMs for overall grade accuracy and Quadratic Weighted Kappa (QWK). Mixtral showed the highest initial accuracy, but LLama3.3 showed better stability and overall performance in further tests.

Enterprise Process Flow (Few-Shot In-Context Learning Workflow)

Answer generation
Calibration text
Solution presentation
Self calibration
Candidate text
Answer generation

Evaluation Results

The study employed Quadratic Weighted Kappa (QWK), Mean Average Error (MAE), Pearson Correlation Coefficient (PCC), and accuracy to evaluate LLM performance. LLama3.3 70b generally outperformed other models in terms of stability and overall score across categories.

40.6% Maximum Agreement with Human Rater (sub-dimensions)

LLMs achieved a maximum of 40.6% agreement with human raters in rubric-provided sub-dimensions for Austrian A-level German essays.

RAG Context Strategies Impact on QWK

Strategy QWK (Final Grade) Key Observation
Baseline 0.29 No RAG context
RAG-1-best 0.48 Single best reference text - high QWK but questionable validity
RAG-best-worst 0.46 Best-Average-Worst selection
Few-best-worst 0.40 Few-shot with Best-Average-Worst - higher grading variety
CoT-best-worst 0.23 CoT with Best-Average-Worst - lower QWK but higher accuracy for Task 2

Different RAG and prompting strategies yielded varied QWK scores. RAG-1-best showed the highest QWK for the final grade, but few-shot methods generally showed more stable performance across dimensions.

Discussion & Future Work

The study concludes that current LLMs are not yet ready for fully autonomous grading but hold significant potential as supportive tools. Challenges include lack of grading variety in training data, computational restraints, and the need for multi-grader ground truths to reduce bias.

750s Max Grading Time per Task-Pair

LLama3.3 70b took up to 750 seconds for grading a pair of tasks with large context techniques, highlighting computational intensity and practical limitations for real-world application.

Challenges & Limitations of Current LLM AES

The study highlights several limitations for current LLM-based AES, including the dataset (limited text types, OCR artifacts, single grader bias), model capabilities (LLMs other than LLama3.3 70b showed significant flaws like grade bias and unreliability), and computational power (high demands for large models and context techniques). Fully automated grading for real-world scenarios is still a long way off.

Key Takeaway: Current LLMs, while promising, are not yet ready for fully autonomous grading of Austrian A-levels due to accuracy, reliability, and computational demands. Future work requires more diverse datasets, multiple human graders, and stronger compute resources.

Future work should focus on running larger models with multiple iterations, implementing confidence scores for output reliability, and ensuring fair, unbiased grading systems that account for diverse student needs and prevent malicious attacks.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by integrating advanced AI solutions.

Annual Savings $0
Hours Reclaimed Annually 0

Your AI Implementation Roadmap

Our proven methodology guides your enterprise through a seamless AI integration, from strategy to sustained impact.

Discovery & Strategy

In-depth analysis of current workflows, identification of AI opportunities, and tailored strategy development for maximum impact.

Pilot & Prototyping

Rapid development of AI prototypes for key use cases, ensuring alignment with business goals and quick validation.

Full-Scale Implementation

Seamless integration of AI solutions into your existing infrastructure, with robust testing and comprehensive training.

Optimization & Scaling

Continuous monitoring, performance tuning, and expansion of AI capabilities across your enterprise for long-term growth.

Ready to Transform Your Enterprise with AI?

Schedule a personalized consultation with our AI experts to discuss how these insights can be applied to your specific business challenges and opportunities.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking