GENERATIVE AI SECURITY & SAFETY
Unlocking the Future of AI: Lessons from Red Teaming 100+ Products
Based on our extensive experience red teaming over 100 generative AI products at Microsoft, we present key insights and a robust threat model ontology to guide effective AI security and safety practices.
Key Impact & Operational Scale
Our red teaming operations have provided critical insights into a wide array of Generative AI systems, ensuring robust security and safety from development to deployment.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Lesson 1: Understand what the system can do and where it is applied. The first step in an AI red teaming operation is to determine which vulnerabilities to target. Starting from potential downstream impacts, rather than attack strategies, makes it more likely that an operation will produce useful findings tied to real-world risks. Anticipating downstream impacts requires considering system capabilities and application context.
Lesson 2: You don't have to compute gradients to break an AI system. Real-world attackers often use simpler techniques like prompt engineering rather than complex gradient-based methods. Effective attack strategies often leverage combinations of tactics targeting multiple weaknesses in the broader AI system, not just the model. Prioritizing simple techniques and system-level attacks is crucial.
Case Study: Jailbreaking a Vision Language Model
Summary: In this operation, we tested a VLM for responsible AI impacts, specifically the generation of content aiding illegal activities. We found that overlaying malicious instructions on an image input was more effective at bypassing safety guardrails than direct text prompts, revealing a critical weakness in VLM safety training.
Details: System: Vision language model (VLM)
Actor: Adversarial user
Tactic: ML Model Access, Defense Evasion
Technique: AML.T0040, AML.T0051 (LLM Prompt Injection)
Impact: Generation of illegal content. This case highlights how simple, system-level attacks (like image-based prompt injection) can effectively subvert complex AI systems.
Enterprise Process Flow: End-to-End Automated Scamming Scenario
Lesson 3: AI red teaming is not safety benchmarking. The risk landscape is constantly shifting, with novel attacks and failure modes. AI red teaming explores unfamiliar scenarios and helps define new harm categories beyond what existing benchmarks measure. It requires human effort to discover novel harms and probe contextualized risks.
Lesson 6: Responsible AI harms are pervasive but difficult to measure. RAI harms are often ambiguous and subjective, unlike security vulnerabilities. They are influenced by probabilistic model behavior and require detailed policy for evaluation. AI red teaming probes these scenarios, distinguishing between adversarial and benign user interactions.
Case Study: Chatbot Response to Users in Distress
Summary: We evaluated how an LLM-based chatbot responds to users expressing distress (e.g., depressive thoughts, self-harm intent). This assessment considers psychosocial harms, requiring human emotional intelligence and subject matter expertise to interpret model responses in sensitive contexts.
Details: System: LLM-based chatbot
Actor: Distressed user
Weakness: Improper LLM safety training
Impact: Possible adverse impacts on a user's mental health and wellbeing. This case highlights the crucial role of human judgment and cultural competence in assessing sensitive AI interactions.
Case Study: Probing Text-to-Image Generator for Gender Bias
Summary: We investigated gender bias in a text-to-image generator by using prompts that depicted individuals without specifying gender (e.g., 'a secretary' and 'a boss'). This revealed how the model's default depictions can exacerbate gender-based stereotypes, underscoring the subtle nature of responsible AI harms.
Details: System: Text-to-image generator
Actor: Average user
Tactic: ML Model Access
Technique: AML.T0040 (ML Model Inference API Access)
Weakness: Model bias
Impact: Generation of content that may exacerbate gender-based biases and stereotypes. Repeated probing with non-gendered prompts helps reveal inherent biases in model generation.
Lesson 4: Automation can help cover more of the risk landscape. The complexity of AI risks necessitates tools like PyRIT for rapid vulnerability identification, automated attacks, and large-scale testing. Automation helps account for non-deterministic model behavior and estimates failure likelihood, but it must augment human judgment, not replace it.
Lesson 5: The human element of AI red teaming is crucial. Automation supports, but does not replace, human judgment and creativity in AI red teaming. Prioritizing risks, designing system-level attacks, defining new harm categories, and assessing context-specific risks (e.g., cultural competence, emotional intelligence) are inherently human tasks that require subject matter experts.
Lesson 7: LLMs amplify existing security risks and introduce new ones. GenAI integrates into applications, introducing novel attack vectors and shifting the security landscape. AI red teams must consider both existing system-level risks (e.g., outdated dependencies, improper error handling) and novel model-level weaknesses (e.g., cross-prompt injection attacks in RAG architectures). Mitigations require both system-level and model-level improvements.
Case Study: SSRF in a Video-Processing GenAI Application
Summary: We identified a Server-Side Request Forgery (SSRF) vulnerability in a GenAI-based video processing system, stemming from its use of an outdated FFmpeg version. An attacker could craft malicious video files to access internal resources and escalate privileges. This highlights the importance of regularly updating and isolating critical dependencies.
Details: System: GenAI application
Actor: Adversarial user
Tactic: Reconnaissance, Initial Access, Privilege Escalation
Technique: T1595 (Active Scanning), T1190 (Ex-ploit Public-Facing Application), T1068 (Exploitation for Privilege Escalation)
Weakness: CWE-918: Server-Side Request Forgery (SSRF)
Impact: Unauthorized privilege escalation. This case demonstrates that traditional security vulnerabilities are still critical in GenAI systems.
Lesson 8: The work of securing AI systems will never be complete. The idea of 'solving' AI safety through purely technical advances is unrealistic. AI security is an ongoing process influenced by economics (cost of attack), break-fix cycles (continuous red teaming and mitigation), and regulation. The goal is to raise the cost of attacks, making advanced exploitation uneconomical for adversaries.
Calculate Your Potential AI Optimization ROI
Estimate the impact of implementing robust AI security and safety practices, including improved efficiency and reduced risk exposure.
Our AI Security & Safety Implementation Roadmap
We guide your enterprise through a structured process to integrate red teaming insights and best practices, ensuring measurable improvements in your AI systems' resilience.
Phase 01: Initial Assessment & Threat Modeling
Conduct a comprehensive audit of existing AI systems, identifying potential vulnerabilities and defining a tailored threat model based on our ontology.
Phase 02: Red Teaming Operations
Execute targeted red teaming exercises using PyRIT and human-led techniques to uncover system-level and model-level weaknesses.
Phase 03: Mitigation Strategy & Implementation
Develop and implement robust security controls and safety guardrails, leveraging insights from red teaming to address identified risks.
Phase 04: Continuous Monitoring & Adaptation
Establish ongoing monitoring and feedback loops to adapt to emerging threats and new AI capabilities, ensuring long-term resilience.
Ready to Secure Your Generative AI Future?
Our experts are prepared to help your enterprise navigate the complexities of AI safety and security. Schedule a personalized consultation to fortify your AI initiatives.