Skip to main content
Enterprise AI Analysis: High-Risk Memories? Comparative audit of the representation of Second World War atrocities in Ukraine by generative AI applications

Enterprise AI Analysis

High-Risk Memories? Comparative audit of the representation of Second World War atrocities in Ukraine by generative AI applications

This paper investigates how generative AI (genAI) applications represent and potentially misrepresent high-risk memories, specifically Second World War atrocities in Ukraine. It audits three common genAI applications for historical misrepresentation, including hallucinations and inconsistent moralization, across different languages and atrocity types. The findings highlight significant inaccuracies and ethical concerns, especially for less-known memories and lower-resource languages.

Executive Impact: Key Findings for Enterprise Leaders

Generative AI models demonstrate limited and inconsistent accuracy in representing high-risk historical memories, particularly WWII atrocities in Ukraine. A significant portion of responses contains factual inaccuracies and hallucinations, especially in lower-resource languages. While moralizing statements are present, their inclusion is inconsistent across applications, languages, and specific historical episodes, undermining genAI's perceived authority and creating a skewed moral hierarchy. This poses substantial risks for historical misrepresentation and instrumentalization.

0% Approximate Accuracy Rate for GenAI on Historical Facts
0% Accuracy Rate for Lower-Resource Languages (Bard Russian, Bing Chat Ukrainian)
0% Bard Outputs Containing Hallucinations (Overall)
0% ChatGPT Responses with Moralizing Statements (English/Russian)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Generative AI and the Distortion of High-Risk Memories

GenAI accelerates content production but risks historical misrepresentation, from distorting facts and depicting groups inaccurately to subtle selective moralization. This is critical for high-risk memories, like WWII atrocities, which carry strong emotional loads and are often instrumentalized politically. Misrepresentation is heightened by genAI's probabilistic 'memory,' which reiterates contradictory interpretations from its training data, especially for contested high-risk memories without external safeguards or ethical guidelines. Hallucinations and moralizing statements, though sometimes intended for safety, can misleadingly attribute moral authority to AI and create skewed historical narratives.

50% Approximate accuracy rate for genAI responses on historical facts compared to human baseline.

Enterprise Process Flow

GenAI Content Production
Information Discovery Alteration
Historical Misrepresentation Risk
Distortion/Denial Amplification
Ethical Obligations Challenged
Aspect Human Memory (Contrast) GenAI (Risk)
Nature of Memory
  • Cognitive (encoding, storage, retrieval)
  • Probabilistic (next token prediction)
Understanding Context
  • Entangled with social practices, negotiated truth
  • Limited understanding of historical accuracy/ethics
Misrepresentation Source
  • Intentional manipulation, selective recall
  • Training data deficiencies, unintentional reiteration of conflicts, hallucinations
Ethical Framework
  • Societally negotiated ethical obligations
  • Relies on explicit developer specification; default is probabilistic output

Auditing GenAI: Accuracy, Hallucinations, and Language Variation

Empirical audits reveal that genAI applications struggle with high-risk memories. Only about 50% of responses align with human baselines on specific historical facts, a rate that significantly decreases for lower-resource languages like Ukrainian and Russian (often 30% or less). Hallucinations are common, with Bard producing them in over half its outputs, particularly for Ukrainian prompts. This highlights how inadequate knowledge bases in certain languages exacerbate misleading claims, even without adversarial intent. Such variations mean users in specific linguistic contexts are far more likely to receive inaccurate information.

30% Accuracy rate for Bard in Russian and Bing Chat in Ukrainian on historical facts, indicating significant language dependency.
Application English Accuracy (Approx.) Ukrainian/Russian Accuracy (Approx.) Hallucination Tendency
Bard
  • 70% (Holocaust general)
  • 30% (Russian), 50% (Ukrainian)
  • High (50%+ overall)
ChatGPT
  • 50-70% (stable)
  • 40-50% (less dramatic drop)
  • Moderate (partially correct responses common)
Bing Chat
  • 50-60% (Holocaust general)
  • 30% (Ukrainian), 50% (Russian)
  • Moderate (irrelevant responses, less hallucinations)

Case Study: Stepan Bandera and Lviv Pogrom

Bard, when prompted in Ukrainian, produced multiple misleading claims about Stepan Bandera, an anti-Soviet resistance leader. It incorrectly stated that he rejected Nazi ideas and was arrested in July 1941 for refusing to fight the Soviet Union. Similarly, it claimed the Lviv pogrom in 1941 was the 'only time' Ukrainians participated in killing Jews. These examples illustrate how limited knowledge bases in lower-resource languages lead to significant historical distortions and invented narratives, including non-existent testimonies.

The Selective Moral Authority of AI in Historical Narratives

GenAI applications, particularly ChatGPT, frequently include moralizing statements in responses about mass atrocities. While sometimes reinforcing ethical lessons, this moralization is often inconsistent across different applications, languages, and specific atrocity instances. For example, some atrocities are labeled 'horrific' with explicit calls to remember, while others (or the same event in a different language) receive no such moral framing. This inconsistency can lead to a skewed moral hierarchy of memories and mislead users into perceiving AI as a moral authority it does not possess, enabling the selective enforcement of standardized representation patterns often associated with the Global North.

80% Highest frequency of moralizing statements in Bard (English prompts, Polish atrocities).
Application Moralization Frequency Consistency Across Languages/Topics Example Framing
ChatGPT
  • High (50%+ for English/Russian)
  • More stable (40-50%) but internal variation
  • Emphasizes historical records, avoids blame for entire groups
Bard
  • Moderate (esp. Ukrainian prompts)
  • Highly inconsistent (e.g., 80% English Polish atrocities vs. 20% Russian)
  • Uses 'dark chapters,' 'horrible tragedies,' emphasizes 'tolerance and understanding'
Bing Chat
  • Low (20-25%)
  • Inconsistent (e.g., 60%+ Ukrainian anti-Ukrainian atrocities vs. 0% English)
  • Uses 'tragic,' 'brutal,' occasionally 'most horrible crimes'

Enterprise Process Flow

GenAI Produces Moralizing Statements
Statements Reinforce Normative Interpretations
Inconsistent Moralization Across Contexts
Skewed Moral Hierarchy Created
User Perception of AI Moral Authority Distorted

Calculate Your Potential AI Impact

Estimate the efficiency gains and cost savings for your enterprise by integrating responsible AI solutions, considering the nuanced ethical implications highlighted in this research.

Annual Savings
Hours Reclaimed Annually

Your AI Implementation Roadmap

Navigate the complexities of AI integration with a clear, phase-by-phase approach, focusing on ethical considerations and robust performance in sensitive domains.

Phase 1: Ethical Assessment & Data Audit

Conduct a comprehensive audit of existing data and systems for potential biases and misrepresentation risks, especially for sensitive historical or social data. Establish a clear 'North Star' for AI ethical behavior and memory representation.

Phase 2: Custom Model Development & Refusal Mechanisms

Develop or fine-tune AI models with specialized training data and implement refusal mechanisms for queries lacking sufficient information or posing high misrepresentation risks, aligning with the "North Star" vision.

Phase 3: Consistency & Moralization Standardisation

Implement a framework to ensure consistent moralization and normative framing across different languages and contexts, preventing skewed moral hierarchies and selective enforcement of historical narratives.

Phase 4: Continuous Monitoring & Expert Oversight

Establish ongoing monitoring processes and integrate human expert oversight to detect and correct emerging misrepresentations, hallucinations, or inconsistencies in AI outputs, particularly in high-risk memory domains.

Ready to Build Responsible AI for Your Enterprise?

Leverage our expertise to develop AI solutions that are accurate, ethical, and aligned with your organizational values, mitigating the risks of misrepresentation and fostering trust.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking