AI Ethics & Safety
WHEN AGENTS PERSUADE: PROPAGANDA GENERATION AND MITIGATION IN LLMS
This research investigates the capabilities of Large Language Models (LLMs) to generate propagandistic content and evaluates methods for mitigating such behavior. Utilizing domain-specific detection models, the study finds that LLMs readily produce propaganda employing various rhetorical techniques like loaded language, flag-waving, and appeals to fear. Notably, fine-tuning methods, particularly Odds Ratio Preference Optimization (ORPO), significantly reduce the LLMs' propensity to generate manipulative content, with ORPO demonstrating the highest effectiveness.
Executive Impact: Key Findings
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
LLM Propaganda Mitigation Process
| Technique | Human Propagandistic Use | LLM Propagandistic Use |
|---|---|---|
| Loaded Language | Moderate | High (emotional rhetoric) |
| Flag-Waving | Moderate | High (patriotic narratives) |
| Appeal to Fear | Moderate | High (fear-based manipulation) |
| Name-Calling | Moderate | Varied (GPT-4o similar, Llama/Mistral lower) |
| Exaggeration/Minimization | Moderate | High (hyperbolic content) |
Mitigation Efficacy: ORPO's Impact
Our evaluation highlights ORPO as the most effective fine-tuning method. Compared to un-fine-tuned models, ORPO reduced the average number of propaganda techniques per article by 13.4x. While prompt-level guardrails were easily overridden, baking 'no propaganda' into model weights through ORPO proved highly robust, yielding only 10% propaganda outputs compared to 99% for untuned models.
Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings by implementing AI solutions tailored to your enterprise needs, considering the mitigation of harmful content generation.
Your AI Implementation Roadmap
A strategic phased approach to integrate AI responsibly, minimizing risks of manipulative content while maximizing operational benefits.
Phase 1: Initial Assessment & Model Selection
Evaluate current LLM usage and identify areas susceptible to manipulative content. Select appropriate base models for fine-tuning.
Phase 2: Data Curation & Annotation
Gather and annotate domain-specific datasets for propaganda and rhetorical technique detection, and for preference alignment.
Phase 3: Fine-Tuning & Mitigation Strategy
Apply SFT, DPO, or ORPO to reduce propagandistic outputs. Implement robust guardrails.
Phase 4: Validation & Deployment
Rigorously test mitigated models using human and automated evaluations. Deploy models with continuous monitoring.
Ready to Own Your AI?
Schedule a free consultation to discuss how our enterprise AI solutions can be tailored to your organization, ensuring ethical and powerful deployment.