LLM Security & Safety

Cross-Lingual Jailbreak Detection via Semantic Codebooks: A Robust Approach for Multilingual LLMs

A novel training-free framework uses language-agnostic semantic similarity to detect cross-lingual jailbreak attempts in Large Language Models (LLMs). By comparing multilingual query embeddings against a fixed English codebook of jailbreak prompts, this system acts as an external guardrail without requiring retraining or language-specific adaptation. Evaluated across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5), the approach shows near-perfect separability (AUC up to 0.99) on curated benchmarks, significantly reducing attack success rates under strict false-positive constraints. However, performance degrades on behaviorally diverse unsafe benchmarks.

Schedule Your Strategy Session

Executive Impact: Key Findings for Enterprise AI

Our systematic evaluation reveals that while semantic similarity effectively mitigates canonical jailbreak patterns across languages, its efficacy diminishes significantly when confronted with diverse and heterogeneous attack types. This highlights the need for a multi-layered security approach in multilingual LLM deployments.

0% Attack Reduction (Curated)

0 Cross-Lingual Separability (High-Pattern)

0 Cross-Lingual Separability (Low-Pattern)

0% Recall at Low False Positive Rate (Curated)

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Our Training-Free Cross-Lingual Guardrail

Incoming Query (Any Language)

→

Multilingual Embedding

→

Fixed English Codebook Comparison

→

Max Cosine Similarity > Threshold?

→

Block Query

→

Forward to LLM

13,811 Unique Jailbreak Prompts in English Codebook (Page 5)

Source: Table 1. On curated prompt-injection benchmarks (1-2), separability remains high. On behaviorally diverse benchmarks (3-4), separability degrades substantially.
Benchmark	English (native)	Russian (m2m)	Chinese (m2m)	Arabic (m2m)
Benchmark 1	0.829	0.785	0.781	0.765
Benchmark 2	0.993	0.854	0.855	0.855
Benchmark 3	0.618	0.675	0.694	0.660
Benchmark 4	0.615	0.614	0.593	0.605

0.993 Near-perfect AUC on Benchmark 2 (English)

0.59-0.70 Degraded AUC on Diverse Benchmarks (3 & 4) across translations

Source: Table 2. Performance at a security-critical FPR of less than 1%. Significant drop in TPR for diverse benchmarks, indicating difficulty in catching attacks without high false positives.
Benchmark	English	Russian (m2m)	Chinese (m2m)	Arabic (m2m)
Benchmark 1	25.6%	22.2%	21.6%	20.1%
Benchmark 2	91.9%	82.9%	80.1%	78.5%
Benchmark 3	23.0%	14.0%	17.0%	17.0%
Benchmark 4	3.3%	5.2%	4.5%	6.1%

Low-FPR Regime Challenges

Deployment-grade safety systems demand strict false-positive constraints (FPR < 1%). While effective on canonical jailbreak patterns (Benchmark 2: 78.5-91.9% TPR), recall collapses to single digits (3.3-6.4%) on heterogeneous unsafe benchmarks (Benchmark 4). This highlights the limitations of similarity-only filtering for diverse attacks under strict FPR conditions.

91.9% Max TPR for canonical attacks at FPR < 1% (Benchmark 2, BGE-M3)

3.3% Min TPR for diverse attacks at FPR < 1% (Benchmark 4, BGE-M3)

Source: Table 3. The semantic filter substantially reduces attacks on canonical prompt-injection benchmarks, but mitigation weakens significantly under distribution shift.
Benchmark	Mean Reduction (%)	Std
Benchmark 1	96.2	± 2.6
Benchmark 2	50.0	± 17.4
Benchmark 3	43.7	± 21.6
Benchmark 4	18.6	± 13.8

Impact on Target LLMs & Translation Pipelines

The semantic filter effectively reduces successful jailbreaks on canonical benchmarks. For instance, Benchmark 1 sees a 96.2% mean reduction. However, on heterogeneous benchmarks (e.g., Benchmark 4), the mean reduction drops to 18.6%, with increasing instability across models and languages (higher standard deviation). This confirms that a fixed English codebook struggles with diverse, un-templated attack patterns.

96.2% Mean Reduction on Canonical Jailbreaks (Benchmark 1)

18.6% Mean Reduction on Highly Diverse Jailbreaks (Benchmark 4)

Advanced ROI Calculator

Estimate the potential return on investment for integrating advanced cross-lingual safety measures into your enterprise LLM deployments. Optimize for efficiency and risk reduction.

Industry Sector

Number of Employees Interacting with LLMs

Average Hours Saved per Employee per Week

Average Hourly Rate ($)

Potential Annual Savings $0

Annual Hours Reclaimed 0

Calculate Your ROI

Your Enterprise AI Safety Roadmap

A strategic phased approach for integrating robust cross-lingual jailbreak detection.

Phase 1: Pilot Program & Baseline Assessment

Deploy the semantic codebook filter on a small, controlled multilingual LLM environment. Establish baseline attack success rates and monitor false positive rates in target languages (e.g., Russian, Chinese, Arabic) using an initial fixed English codebook.

Phase 2: Language-Specific Codebook Augmentation

Based on pilot results, curate native multilingual codebooks or augment the English codebook with carefully translated and semantically deduplicated examples. Implement continuous update pipelines to capture emerging attack patterns and control concept drift.

Phase 3: Hybrid Architecture Integration

Integrate similarity-based filtering with complementary detection methods like perplexity-based anomaly detection and syntactic pattern analysis. Develop cascaded pipelines for enhanced robustness under distribution shift, maintaining strict FPR constraints.

Phase 4: Longitudinal Deployment & Adaptation

Conduct real-world assessment on live multilingual traffic, quantifying resilience to evolving attack distributions and language drift. Continuously refine detection strategies based on performance monitoring and LLM adjudication, establishing deployment guidelines for semantic guardrails.

Book a Consultation

Ready to Fortify Your Multilingual LLMs?

Don't let cross-lingual vulnerabilities compromise your enterprise AI. Our experts are ready to help you implement state-of-the-art detection and mitigation strategies.

Get Started Today

LLM Security & Safety

Cross-Lingual Jailbreak Detection via Semantic Codebooks: A Robust Approach for Multilingual LLMs

Executive Impact: Key Findings for Enterprise AI

Deep Analysis & Enterprise Applications

Our Training-Free Cross-Lingual Guardrail

Low-FPR Regime Challenges

Impact on Target LLMs & Translation Pipelines

Advanced ROI Calculator

Your Enterprise AI Safety Roadmap

Phase 1: Pilot Program & Baseline Assessment

Phase 2: Language-Specific Codebook Augmentation

Phase 3: Hybrid Architecture Integration

Phase 4: Longitudinal Deployment & Adaptation

Ready to Fortify Your Multilingual LLMs?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai