Natural Language Processing (NLP)
Automatic Essay Scoring and Feedback Generation in Basque Language Learning
This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. It leverages open-source models like Latxa, fine-tuning them to achieve superior performance over closed-source systems (GPT-5, Claude Sonnet 4.5) in scoring consistency and feedback quality. The research also presents a novel evaluation methodology for feedback generation, identifying a wider range of error types relevant for low-resource languages.
Executive Impact: At a Glance
Leverage AI to transform your enterprise. Here’s the projected impact based on the latest research and your industry profile.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The core of this research is the introduction of the first publicly available, richly annotated dataset for Basque AES and feedback generation at the CEFR C1 level. This dataset comprises 3,200 essays, each annotated by expert evaluators with criterion-specific scores (correctness, richness, coherence, cohesion, and task alignment), detailed feedback, and error examples. This rich annotation enables the development of models capable of producing both reliable scores and interpretable, criterion-specific feedback, fostering deeper awareness and more targeted language development.
The essays have an average length of 334.29 words. Correctness shows the highest number of annotated error examples, reflecting its objective nature. The dataset was divided into training, validation, and test sets containing 2600, 300, and 300 instances, respectively.
Experiments demonstrate that while fine-tuned encoder models like RoBERTa-EusCrawl remain a strong baseline for criterion-based AES, training generative models using SFT on the new dataset yields significant performance gains. Specifically, the SFT Latxa 70B model surpassed both specialized encoder models and state-of-the-art proprietary models like GPT-5 and Claude Sonnet 4.5 in score Correctness criterion. This highlights the effectiveness of domain-specific fine-tuning.
Output configuration significantly affects model performance, with the SFE (Score, Feedback, Error-examples) ordering being the best configuration. Prioritizing initial score assessment enables the model to adequately guide subsequent generation of feedback and error examples.
Our analysis of generated explanations revealed that SFT models have high consistency between generated feedback and assessed scores. The fine-tuned Latxa model also proved superior in identifying a more balanced and pedagogically relevant range of error types, whereas closed models disproportionately focused on surface-level spelling and vocabulary errors, possibly due to OCR artifacts in the essays.
This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque, promoting open frameworks for accurate and pedagogically sound feedback that aligns with established language proficiency scales.
Enterprise Process Flow
| Model | QWK Score | Key Strengths |
|---|---|---|
| SFT Latxa 70B | 57.23 |
|
| GPT-5 | 27.23 |
|
| Claude Sonnet 4.5 | 18.93 |
|
| RoBERTa-EusCrawl | 43.82 |
|
Impact in Low-Resource Language Education
The development of AES and feedback generation systems for Basque addresses a critical gap in educational technology for less-resourced languages.
Challenge: Traditional manual evaluation is time-consuming and costly, limiting scalability and consistent feedback for Basque learners.
Solution: By fine-tuning open-source LLMs like Latxa on a newly created, richly annotated Basque C1 dataset, the research enables automated, high-quality, and pedagogically relevant feedback.
Outcome: This leads to improved learning outcomes for Basque speakers, fostering deeper language awareness and more targeted development, bridging the gap with more resourced languages.
Calculate Your Potential AI Impact
Estimate the efficiency gains and cost savings for your organization by integrating advanced AI language models.
AI Implementation Roadmap
A phased approach to integrating advanced essay scoring and feedback generation into your educational or enterprise systems.
Phase 1: Data Acquisition & Preprocessing
Gather and digitize existing essay data, perform OCR, and align with CEFR C1 criteria. Annotate with expert feedback and error examples to build a robust training dataset specific to your language context.
Phase 2: Model Fine-tuning & Adaptation
Select and fine-tune open-source LLMs (e.g., Latxa) on your curated dataset. Optimize for criterion-specific scoring (correctness, coherence) and explore different output configurations (SFE) to maximize performance and consistency.
Phase 3: Feedback System Development
Integrate the fine-tuned model into a feedback generation system. Develop modules for extracting specific error examples and categorizing error types, ensuring pedagogically relevant insights.
Phase 4: Evaluation & Validation
Implement a novel evaluation methodology combining automatic metrics (QWK, Weighted-F1, Consistency) with expert human validation for feedback quality and error identification. Iterate based on expert annotator feedback.
Phase 5: Deployment & Integration
Deploy the validated AES and feedback generation system into your existing language learning platform or educational infrastructure. Provide training and support for educators and learners to maximize adoption and impact.
Ready to Transform Your Language Learning Program?
Unlock the full potential of AI for automated essay scoring and personalized feedback. Schedule a consultation to explore how our tailored solutions can enhance your educational initiatives.