Natural Language Processing (NLP)

Automatic Essay Scoring and Feedback Generation in Basque Language Learning

This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. It leverages open-source models like Latxa, fine-tuning them to achieve superior performance over closed-source systems (GPT-5, Claude Sonnet 4.5) in scoring consistency and feedback quality. The research also presents a novel evaluation methodology for feedback generation, identifying a wider range of error types relevant for low-resource languages.

Schedule Your Strategy Session

Executive Impact: At a Glance

Leverage AI to transform your enterprise. Here’s the projected impact based on the latest research and your industry profile.

0 Accuracy in Essay Scoring

0 Feedback Generation Quality

0 Reduction in Manual Grading Time

0 Broader Error Type Coverage

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The core of this research is the introduction of the first publicly available, richly annotated dataset for Basque AES and feedback generation at the CEFR C1 level. This dataset comprises 3,200 essays, each annotated by expert evaluators with criterion-specific scores (correctness, richness, coherence, cohesion, and task alignment), detailed feedback, and error examples. This rich annotation enables the development of models capable of producing both reliable scores and interpretable, criterion-specific feedback, fostering deeper awareness and more targeted language development.

The essays have an average length of 334.29 words. Correctness shows the highest number of annotated error examples, reflecting its objective nature. The dataset was divided into training, validation, and test sets containing 2600, 300, and 300 instances, respectively.

Experiments demonstrate that while fine-tuned encoder models like RoBERTa-EusCrawl remain a strong baseline for criterion-based AES, training generative models using SFT on the new dataset yields significant performance gains. Specifically, the SFT Latxa 70B model surpassed both specialized encoder models and state-of-the-art proprietary models like GPT-5 and Claude Sonnet 4.5 in score Correctness criterion. This highlights the effectiveness of domain-specific fine-tuning.

Output configuration significantly affects model performance, with the SFE (Score, Feedback, Error-examples) ordering being the best configuration. Prioritizing initial score assessment enables the model to adequately guide subsequent generation of feedback and error examples.

Our analysis of generated explanations revealed that SFT models have high consistency between generated feedback and assessed scores. The fine-tuned Latxa model also proved superior in identifying a more balanced and pedagogically relevant range of error types, whereas closed models disproportionately focused on surface-level spelling and vocabulary errors, possibly due to OCR artifacts in the essays.

This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque, promoting open frameworks for accurate and pedagogically sound feedback that aligns with established language proficiency scales.

57.23 QWK score for SFT Latxa 70B in Correctness scoring, outperforming GPT-5 (27.23).

Enterprise Process Flow

Basque C1 Essay Dataset Creation

→

Expert Annotation & Feedback Curation

→

Fine-tuning Open-Source LLMs (Latxa)

→

Automated Essay Scoring (AES)

→

Feedback & Error Example Generation

→

Pedagogical Improvement in Basque LL

Model Performance Comparison (Correctness QWK Score)

Model	QWK Score	Key Strengths
SFT Latxa 70B	57.23	Superior correctness scoring High feedback consistency Broad error type coverage
GPT-5	27.23	Good overall error identification Primarily spelling/lexical errors
Claude Sonnet 4.5	18.93	Limited risk in extraction Lower accuracy than SFT Latxa
RoBERTa-EusCrawl	43.82	Strong baseline for AES Encoder-based reliability

Impact in Low-Resource Language Education

The development of AES and feedback generation systems for Basque addresses a critical gap in educational technology for less-resourced languages.

Challenge: Traditional manual evaluation is time-consuming and costly, limiting scalability and consistent feedback for Basque learners.

Solution: By fine-tuning open-source LLMs like Latxa on a newly created, richly annotated Basque C1 dataset, the research enables automated, high-quality, and pedagogically relevant feedback.

Outcome: This leads to improved learning outcomes for Basque speakers, fostering deeper language awareness and more targeted development, bridging the gap with more resourced languages.

Calculate Your Potential AI Impact

Estimate the efficiency gains and cost savings for your organization by integrating advanced AI language models.

Your Industry

Number of Employees Affected

Avg. Hours/Week on Language Tasks

Avg. Hourly Cost per Employee ($)

Annual Savings $0

Hours Reclaimed 0

AI Implementation Roadmap

A phased approach to integrating advanced essay scoring and feedback generation into your educational or enterprise systems.

Phase 1: Data Acquisition & Preprocessing

Gather and digitize existing essay data, perform OCR, and align with CEFR C1 criteria. Annotate with expert feedback and error examples to build a robust training dataset specific to your language context.

Phase 2: Model Fine-tuning & Adaptation

Select and fine-tune open-source LLMs (e.g., Latxa) on your curated dataset. Optimize for criterion-specific scoring (correctness, coherence) and explore different output configurations (SFE) to maximize performance and consistency.

Phase 3: Feedback System Development

Integrate the fine-tuned model into a feedback generation system. Develop modules for extracting specific error examples and categorizing error types, ensuring pedagogically relevant insights.

Phase 4: Evaluation & Validation

Implement a novel evaluation methodology combining automatic metrics (QWK, Weighted-F1, Consistency) with expert human validation for feedback quality and error identification. Iterate based on expert annotator feedback.

Phase 5: Deployment & Integration

Deploy the validated AES and feedback generation system into your existing language learning platform or educational infrastructure. Provide training and support for educators and learners to maximize adoption and impact.

Ready to Transform Your Language Learning Program?

Unlock the full potential of AI for automated essay scoring and personalized feedback. Schedule a consultation to explore how our tailored solutions can enhance your educational initiatives.

Schedule a Free Consultation

Natural Language Processing (NLP)

Automatic Essay Scoring and Feedback Generation in Basque Language Learning

Executive Impact: At a Glance

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Model Performance Comparison (Correctness QWK Score)

Impact in Low-Resource Language Education

Calculate Your Potential AI Impact

AI Implementation Roadmap

Phase 1: Data Acquisition & Preprocessing

Phase 2: Model Fine-tuning & Adaptation

Phase 3: Feedback System Development

Phase 4: Evaluation & Validation

Phase 5: Deployment & Integration

Ready to Transform Your Language Learning Program?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai