Skip to main content
Enterprise AI Analysis: Hazawi+: A Structured Corpus in Kuwaiti Arabic Stories

Research Analysis: NLP & Low-Resource Languages

Hazawi+: A Structured Corpus in Kuwaiti Arabic Stories

Authors: FATEMAH HUSAIN, MOHAMMAD ALENEZI, NIZAR HABASH

Published: 06 March 2026

The availability of high-quality corpora is foundational for advancements in Natural Language Processing (NLP), enabling the training and rigorous evaluation of computational models. While rich textual resources exist for high-resource languages, a significant scarcity persists for many natural languages, particularly understudied Arabic dialects such as Kuwaiti Arabic (KA). This paper introduces HAZAWI+, a multi-domain textual corpus comprising over 7 million tokens of KA dialectal stories and novels. Unlike social network texts, HazaWI+ is specifically designed to capture the rich linguistic features inherent in narrative texts, including morphological complexity, informal syntax, and pragmatic nuances, making it an invaluable resource for developing NLP models in low-resource settings. The entire corpus underwent automatic morphological annotation using CAMEL tools specialized for Gulf Arabic, with annotation quality subsequently validated through a rigorous manual review of a 105,770-token sample by two language experts. To demonstrate HazawI+'s immediate usability, we present a complementary empirical study involving the programmatic generation of a synthetic dataset of KA stories, which is then utilized in a downstream task to train a classifier capable of distinguishing between human-written and bot-generated narratives. This experiment serves as a crucial proof-of-concept, underscoring HazAWI+'s potential to provide researchers with deep insights into dialectal linguistic patterns and to significantly enhance the precision of various language processing tasks for the Kuwaiti Arabic dialect.

Executive Impact & Key Findings

The Hazawi+ corpus provides critical foundational resources for advancing Natural Language Processing in Kuwaiti Arabic. Here's how this research empowers enterprise AI initiatives:

0 Total Corpus Size
0 Stories & Novels
0 Manually Annotated Tokens
0 Bot Detection F1 Score

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Corpus Development
Language Processing Tools
Evaluation

Corpus Size

7.1M Tokens in Hazawi+

Hazawi+ comprises over 7 million tokens of Kuwaiti Arabic dialectal stories and novels, making it a significant resource for low-resource language NLP. This scale addresses the scarcity of resources for understudied Arabic dialects.

Enterprise Process Flow

Collect Stories & Novels (Survey, Gumar)
Initial Data Cleaning & Filtering
Text Pre-processing (Punctuation Retained)
Automatic Morphological Annotation (CAMeL)
Manual Annotation & Validation (Experts)
Error Analysis & Refinement

The development of Hazawi+ followed a structured process, from initial data collection and cleaning to automatic and manual annotation, ensuring high-quality linguistic resources for Kuwaiti Arabic. Special attention was paid to retaining narrative-critical elements.

Feature Hazawi+ Other KA Corpora (e.g., X posts)
Content Type
  • Narrative stories & novels (coherent, context-rich)
  • Fragmented social media posts (short, less context)
Data Quality
  • Curated, minimally edited, high authenticity
  • Noisy, spam, emojis, character repetition, less curated
Linguistic Features
  • Morphological complexity, informal syntax, pragmatic nuances (narrative)
  • Shorter texts, less dependency, lack context
Ethical Concerns
  • Informed consent via surveys, privacy-focused
  • Privacy issues with scraping social media
Resource Size
  • 7.1M tokens
  • Smaller scale (e.g., 340K posts, 16.6K posts)
Annotation Depth
  • Automatic & Manual POS/CODA, morphological features
  • Sentiment labels, some morphological (often automatic)

Hazawi+ distinguishes itself from other Kuwaiti Arabic corpora by focusing on rich, narrative-based content collected ethically, offering deeper linguistic insights and higher data quality compared to typical social media datasets.

Manual Annotation Coverage

105,770 Manually Annotated Tokens (1.5% of corpus)

A significant subset of the corpus (1.5% of total tokens, 105,770 words) was manually annotated by two language experts for POS tags and CODA rules, validating the automatic tagging process and enabling detailed error analysis.

Feature Kuwaiti Arabic (KA) Modern Standard Arabic (MSA)
Orthography
  • No standard rules, inconsistent spelling (phonology-based)
  • Strict prescriptive rules, standardized spelling
Vocabulary
  • Includes many unique words, idioms, cultural expressions
  • Formal vocabulary, lacks dialect-specific terms
Diacritics
  • Lacks diacritics in online texts, context-dependent meaning
  • Relies on diacritics for disambiguation
Tool Availability
  • Limited specialized NLP tools and resources
  • Rich textual resources, extensive NLP tools
Text Source
  • User-generated content (social media, stories), informal, narrative
  • Formal media, literature, official communications
Ambiguity
  • High due to lack of diacritics, context crucial for POS
  • Reduced by diacritics and grammatical rules

Processing Kuwaiti Arabic presents unique challenges compared to Modern Standard Arabic due to significant differences in phonology, morphology, lexicon, and the absence of prescriptive rules. Hazawi+ directly addresses these specific needs.

Demonstrating Usability: Human vs. Bot Narrative Classification

Scenario: To showcase Hazawi+'s immediate usability, a synthetic dataset of KA stories was programmatically generated using derived frequency tables. A logistic regression classifier was then trained to differentiate between human-written and bot-generated narratives.

Findings:

  • The classifier achieved perfect scores (Accuracy: 1.00, Precision: 1.00, Recall: 1.00, F1-Score: 1.00) in distinguishing between human and bot-generated stories.
  • This outcome is attributed to the distinct lexical characteristics, sentence structures, and linguistic patterns of the simple bot-generated stories (randomly assembled from word pools) compared to the coherent human-written data from Hazawi+.

Implications: This experiment serves as a crucial proof-of-concept, underscoring Hazawi+'s potential to provide deep insights into dialectal linguistic patterns and significantly enhance various language processing tasks for Kuwaiti Arabic, particularly for evaluating generative models. Future work will involve more sophisticated bot generation.

An empirical study demonstrated Hazawi+'s immediate usability by successfully training a classifier to distinguish between human-written and programmatically generated Kuwaiti Arabic narratives, achieving perfect scores due to distinct patterns.

Inter-Annotator Agreement (POS)

91% POS Tagging Agreement Rate

The Inter-Annotator Agreement (IAA) for Part-Of-Speech (POS) tagging between two expert annotators achieved 91%, indicating strong consistency, though challenges exist with multi-part tokens and contextual ambiguity in dialectal text.

Inter-Annotator Agreement (CODA)

96% CODA Annotation Agreement Rate

The IAA for Conventional Orthography for Dialectal Arabic (CODA) tagging achieved 96%, showing very high agreement. This consistency is aided by CODA's focus on standardized writing without diacritics, despite dialectal variations.

Calculate Your Potential AI ROI

Estimate the significant time and cost savings your enterprise could achieve by integrating advanced Kuwaiti Arabic NLP solutions, powered by resources like Hazawi+.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A structured approach to integrating Kuwaiti Arabic NLP into your enterprise, leveraging cutting-edge research and tailored solutions.

Phase 1: Discovery & Strategy

Comprehensive assessment of current processes, data infrastructure, and business objectives. Develop a tailored AI strategy and roadmap aligned with enterprise goals, identifying key use cases for Kuwaiti Arabic NLP leveraging Hazawi+.

Phase 2: Data Preparation & Model Training

Utilize Hazawi+ for advanced linguistic feature extraction, pre-training, and fine-tuning of models for Kuwaiti Arabic. Incorporate specialized datasets and augment with real-world KA data where necessary for tasks like sentiment analysis, entity recognition, or narrative understanding.

Phase 3: Custom Model Development & Integration

Develop and train custom AI models specific to identified use cases, such as automated story analysis, dialect classification, or content generation in KA. Integrate these models with existing enterprise systems and workflows, ensuring seamless operation and scalability.

Phase 4: Deployment & Optimization

Deploy AI solutions into a production environment. Monitor performance, collect feedback, and continuously refine models for improved accuracy, efficiency, and cultural relevance within the Kuwaiti context. Implement A/B testing and iterative improvements.

Phase 5: Performance Monitoring & Iteration

Establish robust monitoring systems for AI model performance and business impact. Conduct regular reviews to identify new opportunities for AI integration, ensuring sustained value and adaptation to evolving language patterns and business needs.

Ready to Transform Your Enterprise with AI?

Connect with our expert team to discuss how tailored AI solutions, grounded in advanced linguistic research, can drive innovation and efficiency in your organization.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking