Research Analysis: NLP & Low-Resource Languages

Hazawi+: A Structured Corpus in Kuwaiti Arabic Stories

Authors: FATEMAH HUSAIN, MOHAMMAD ALENEZI, NIZAR HABASH

Published: 06 March 2026

The availability of high-quality corpora is foundational for advancements in Natural Language Processing (NLP), enabling the training and rigorous evaluation of computational models. While rich textual resources exist for high-resource languages, a significant scarcity persists for many natural languages, particularly understudied Arabic dialects such as Kuwaiti Arabic (KA). This paper introduces HAZAWI+, a multi-domain textual corpus comprising over 7 million tokens of KA dialectal stories and novels. Unlike social network texts, HazaWI+ is specifically designed to capture the rich linguistic features inherent in narrative texts, including morphological complexity, informal syntax, and pragmatic nuances, making it an invaluable resource for developing NLP models in low-resource settings. The entire corpus underwent automatic morphological annotation using CAMEL tools specialized for Gulf Arabic, with annotation quality subsequently validated through a rigorous manual review of a 105,770-token sample by two language experts. To demonstrate HazawI+'s immediate usability, we present a complementary empirical study involving the programmatic generation of a synthetic dataset of KA stories, which is then utilized in a downstream task to train a classifier capable of distinguishing between human-written and bot-generated narratives. This experiment serves as a crucial proof-of-concept, underscoring HazAWI+'s potential to provide researchers with deep insights into dialectal linguistic patterns and to significantly enhance the precision of various language processing tasks for the Kuwaiti Arabic dialect.

Schedule Your Strategy Session

Executive Impact & Key Findings

The Hazawi+ corpus provides critical foundational resources for advancing Natural Language Processing in Kuwaiti Arabic. Here's how this research empowers enterprise AI initiatives:

0 Total Corpus Size

0 Stories & Novels

0 Manually Annotated Tokens

0 Bot Detection F1 Score

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Corpus Development

Language Processing Tools

Evaluation

Corpus Size

7.1M Tokens in Hazawi+

Hazawi+ comprises over 7 million tokens of Kuwaiti Arabic dialectal stories and novels, making it a significant resource for low-resource language NLP. This scale addresses the scarcity of resources for understudied Arabic dialects.

Enterprise Process Flow

Collect Stories & Novels (Survey, Gumar)

→

Initial Data Cleaning & Filtering

→

Text Pre-processing (Punctuation Retained)

→

Automatic Morphological Annotation (CAMeL)

→

Manual Annotation & Validation (Experts)

→

Error Analysis & Refinement

The development of Hazawi+ followed a structured process, from initial data collection and cleaning to automatic and manual annotation, ensuring high-quality linguistic resources for Kuwaiti Arabic. Special attention was paid to retaining narrative-critical elements.

Feature	Hazawi+	Other KA Corpora (e.g., X posts)
Content Type	Narrative stories & novels (coherent, context-rich)	Fragmented social media posts (short, less context)
Data Quality	Curated, minimally edited, high authenticity	Noisy, spam, emojis, character repetition, less curated
Linguistic Features	Morphological complexity, informal syntax, pragmatic nuances (narrative)	Shorter texts, less dependency, lack context
Ethical Concerns	Informed consent via surveys, privacy-focused	Privacy issues with scraping social media
Resource Size	7.1M tokens	Smaller scale (e.g., 340K posts, 16.6K posts)
Annotation Depth	Automatic & Manual POS/CODA, morphological features	Sentiment labels, some morphological (often automatic)

Hazawi+ distinguishes itself from other Kuwaiti Arabic corpora by focusing on rich, narrative-based content collected ethically, offering deeper linguistic insights and higher data quality compared to typical social media datasets.

Manual Annotation Coverage

105,770 Manually Annotated Tokens (1.5% of corpus)

A significant subset of the corpus (1.5% of total tokens, 105,770 words) was manually annotated by two language experts for POS tags and CODA rules, validating the automatic tagging process and enabling detailed error analysis.

Feature	Kuwaiti Arabic (KA)	Modern Standard Arabic (MSA)
Orthography	No standard rules, inconsistent spelling (phonology-based)	Strict prescriptive rules, standardized spelling
Vocabulary	Includes many unique words, idioms, cultural expressions	Formal vocabulary, lacks dialect-specific terms
Diacritics	Lacks diacritics in online texts, context-dependent meaning	Relies on diacritics for disambiguation
Tool Availability	Limited specialized NLP tools and resources	Rich textual resources, extensive NLP tools
Text Source	User-generated content (social media, stories), informal, narrative	Formal media, literature, official communications
Ambiguity	High due to lack of diacritics, context crucial for POS	Reduced by diacritics and grammatical rules

Processing Kuwaiti Arabic presents unique challenges compared to Modern Standard Arabic due to significant differences in phonology, morphology, lexicon, and the absence of prescriptive rules. Hazawi+ directly addresses these specific needs.

Demonstrating Usability: Human vs. Bot Narrative Classification

Scenario: To showcase Hazawi+'s immediate usability, a synthetic dataset of KA stories was programmatically generated using derived frequency tables. A logistic regression classifier was then trained to differentiate between human-written and bot-generated narratives.

Findings:

The classifier achieved perfect scores (Accuracy: 1.00, Precision: 1.00, Recall: 1.00, F1-Score: 1.00) in distinguishing between human and bot-generated stories.
This outcome is attributed to the distinct lexical characteristics, sentence structures, and linguistic patterns of the simple bot-generated stories (randomly assembled from word pools) compared to the coherent human-written data from Hazawi+.

Implications: This experiment serves as a crucial proof-of-concept, underscoring Hazawi+'s potential to provide deep insights into dialectal linguistic patterns and significantly enhance various language processing tasks for Kuwaiti Arabic, particularly for evaluating generative models. Future work will involve more sophisticated bot generation.

An empirical study demonstrated Hazawi+'s immediate usability by successfully training a classifier to distinguish between human-written and programmatically generated Kuwaiti Arabic narratives, achieving perfect scores due to distinct patterns.

Inter-Annotator Agreement (POS)

91% POS Tagging Agreement Rate

The Inter-Annotator Agreement (IAA) for Part-Of-Speech (POS) tagging between two expert annotators achieved 91%, indicating strong consistency, though challenges exist with multi-part tokens and contextual ambiguity in dialectal text.

Inter-Annotator Agreement (CODA)

96% CODA Annotation Agreement Rate

The IAA for Conventional Orthography for Dialectal Arabic (CODA) tagging achieved 96%, showing very high agreement. This consistency is aided by CODA's focus on standardized writing without diacritics, despite dialectal variations.

Explore Advanced NLP Solutions

Calculate Your Potential AI ROI

Estimate the significant time and cost savings your enterprise could achieve by integrating advanced Kuwaiti Arabic NLP solutions, powered by resources like Hazawi+.

Your Industry

Number of Employees (Impacted by Language Data)

Avg. Hours/Week on Language Tasks (per employee)

Avg. Hourly Rate (USD)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Get a Custom ROI Analysis

Your AI Implementation Roadmap

A structured approach to integrating Kuwaiti Arabic NLP into your enterprise, leveraging cutting-edge research and tailored solutions.

Phase 1: Discovery & Strategy

Comprehensive assessment of current processes, data infrastructure, and business objectives. Develop a tailored AI strategy and roadmap aligned with enterprise goals, identifying key use cases for Kuwaiti Arabic NLP leveraging Hazawi+.

Phase 2: Data Preparation & Model Training

Utilize Hazawi+ for advanced linguistic feature extraction, pre-training, and fine-tuning of models for Kuwaiti Arabic. Incorporate specialized datasets and augment with real-world KA data where necessary for tasks like sentiment analysis, entity recognition, or narrative understanding.

Phase 3: Custom Model Development & Integration

Develop and train custom AI models specific to identified use cases, such as automated story analysis, dialect classification, or content generation in KA. Integrate these models with existing enterprise systems and workflows, ensuring seamless operation and scalability.

Phase 4: Deployment & Optimization

Deploy AI solutions into a production environment. Monitor performance, collect feedback, and continuously refine models for improved accuracy, efficiency, and cultural relevance within the Kuwaiti context. Implement A/B testing and iterative improvements.

Phase 5: Performance Monitoring & Iteration

Establish robust monitoring systems for AI model performance and business impact. Conduct regular reviews to identify new opportunities for AI integration, ensuring sustained value and adaptation to evolving language patterns and business needs.

Book a Consultation to Start Your Roadmap

Ready to Transform Your Enterprise with AI?

Connect with our expert team to discuss how tailored AI solutions, grounded in advanced linguistic research, can drive innovation and efficiency in your organization.

Schedule Your Discovery Call

Research Analysis: NLP & Low-Resource Languages

Hazawi+: A Structured Corpus in Kuwaiti Arabic Stories

Executive Impact & Key Findings

Deep Analysis & Enterprise Applications

Corpus Size

Enterprise Process Flow

Manual Annotation Coverage

Demonstrating Usability: Human vs. Bot Narrative Classification

Inter-Annotator Agreement (POS)

Inter-Annotator Agreement (CODA)

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Data Preparation & Model Training

Phase 3: Custom Model Development & Integration

Phase 4: Deployment & Optimization

Phase 5: Performance Monitoring & Iteration

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai