Test-Time Safety Alignment

Achieving Zero Safety-Flagged Responses with Test-Time Alignment

Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. New research introduces an innovative, black-box approach to neutralize harmful LLM outputs by optimizing input embeddings at a sub-lexical level without modifying model weights.

Baturay Saglam, Dionysis Kalogerias
Department of Electrical and Computer Engineering, Yale University

Schedule Your Strategy Session

Executive Impact: Unprecedented LLM Safety in Real-Time

Our Test-Time Safety Alignment (TSA) method provides a robust, real-time defense against adversarial prompts, dramatically improving LLM safety without modifying model weights. It leverages a unique embedding-level optimization to steer models towards harmless outputs.

0% Safety-Flagged Responses Eliminated (Most Models)

0 steps Avg. Iterations for Llama 3.1-8B-Instruct

0s Avg. per-prompt time (Llama 3.1-8B-Instruct)

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Large language models (LLMs) often undergo safety training (e.g., RLHF) but remain vulnerable to adversarial prompts that bypass these mechanisms, eliciting harmful content. The challenge is that alignment is learned from finite examples, while the input space is unbounded. Existing test-time defenses have limitations, such as undirected perturbations or reliance on auxiliary modules. Our objective is to minimize the semantic harmfulness of LLM outputs at test time, treating the LLM and the moderation oracle (OpenAI Moderation API) as black boxes, without altering human-readable input or forcing refusals.

Our approach, Test-Time Safety Alignment (TSA), applies zeroth-order gradient estimation to approximate the gradient of a black-box text-moderation oracle with respect to the prompt embeddings. We perturb input embeddings with i.i.d. Gaussian noise, average the moderation scores, and perform gradient descent to minimize harmfulness. Key components include Gaussian smoothing, gradient normalization to prevent off-manifold drift, and cosine similarity constraints to preserve fidelity to the original prompt direction. The method operates at the sub-lexical level, converging typically within a few gradient steps.

Experiments were conducted on five instruction-tuned models (Gemma 3-1B, Phi-3.5-mini, Llama 3.1-8B, Qwen3-14B, GPT-OSS-20B) using two red-teaming benchmarks: WildJailbreak (adversarial harmful/benign) and HarmBench (direct harmful). TSA nearly eliminates every safety-flagged completion for most models, often within 1-2 gradient steps. On benign prompts, responses remain substantively intact, confirming no blanket refusal bias. Optimized embeddings decode back to original tokens, indicating sub-lexical perturbations are sufficient to steer behavior towards safety. Baselines (SmoothLLM, AdaSteer, RESTA) were less effective, particularly on HarmBench, showing more flagged completions than the undefended model.

The effectiveness of sub-lexical intervention stems from the embedding space's geometry and gradient normalization, allowing precise steering within a token's Voronoi cell. Notably, the optimization cannot be reversed to create jailbreaks, attributed to the refusal basin shaped by safety training. TSA is a drop-in safety layer requiring only input embeddings and a black-box moderation API. Limitations include per-prompt latency (seconds), the need for embedding access (not API-gated services), and effectiveness scaling with the underlying model's existing safety training quality, as observed with Gemma 3-1B.

Test-Time Safety Alignment Process Flow

User Prompt Input

→

Tokenization & Embedding

→

Embeddings Perturbation

→

LLM Response Generation

→

Black-box Moderation API Scoring

→

Zeroth-Order Gradient Estimation

→

Gradient Descent on Embeddings

→

Safe Output Generation

0% Safety-Flagged Responses Eliminated on Llama 3.1, GPT-OSS, Qwen3, Phi-3.5 (Adversarial Harmful)

Comparison of Test-Time Defense Methods (Llama 3.1-8B-Instruct)

Feature	TSA (Our Method)	SmoothLLM	AdaSteer	RESTA
Control Mechanism	Continuous Embedding Optimization (Sub-lexical)	Discrete Token Perturbation (Character Swaps)	Activation Steering (Hidden States)	Embedding Noise Injection (Token-level)
Optimization Objective	Directly minimize semantic harmfulness (black-box API)	Majority vote on perturbed outputs (jailbreak judge)	Steering towards safety directions (fixed calibration)	Majority vote on noisy embeddings (smoothed prefix)
Retraining/Calibration	None (fully test-time)	None	Requires fixed calibration set offline	None
Auxiliary Modules	None	Jailbreak judge	Steering vectors	None
Effectiveness (Flagged on Adv. Harmful)	0/2000 (100% reduction)	122/2000 (43.1% reduction)	85/2000 (61.0% reduction)	199/2000 (8.7% reduction)

Case Study: Llama 3.1-8B-Instruct (Adversarial Harmful)

On the challenging WildJailbreak (adversarial harmful) benchmark, Llama 3.1-8B-Instruct initially produced 218 flagged responses out of 2,000. Applying Test-Time Safety Alignment (TSA) reduced this to 0 flagged responses, achieving a 100% elimination of harmful content. This was accomplished with an average of just 1.0 optimization step per prompt, demonstrating exceptional efficiency and effectiveness. The method ensures that outputs remain coherent and substantive, without resorting to blanket refusals, by subtly shifting the model's continuous embedding space at a sub-lexical level.

0% Flagged Responses Eliminated

Calculate Your Potential AI ROI

Estimate the annual savings and reclaimed human hours your enterprise could achieve by integrating advanced AI solutions.

Your Industry

Number of Employees Impacted

Avg. Hours Per Week on Repetitive Tasks

Avg. Hourly Employee Cost ($)

Estimated Annual Savings $0

Reclaimed Human Hours Annually 0

Your AI Implementation Roadmap

A structured approach ensures seamless integration and maximum impact. Here’s a typical journey with our enterprise AI experts.

Phase 1: Discovery & Strategy

Deep dive into your current workflows, identify key pain points, and define strategic AI opportunities tailored to your enterprise goals.

Phase 2: Pilot & Validation

Implement a targeted AI pilot program to test hypotheses, validate ROI, and gather critical feedback in a controlled environment.

Phase 3: Scaled Deployment

Roll out validated AI solutions across relevant departments, ensuring robust infrastructure, security, and user adoption.

Phase 4: Optimization & Future-Proofing

Continuous monitoring, performance tuning, and integration of new AI advancements to maintain a competitive edge and long-term value.

Map Your AI Journey

Ready to Transform Your Enterprise with AI?

Our experts are ready to discuss how Test-Time Safety Alignment and other cutting-edge AI solutions can benefit your organization.

Book Your Free Consultation Now

Test-Time Safety Alignment

Achieving Zero Safety-Flagged Responses with Test-Time Alignment

Executive Impact: Unprecedented LLM Safety in Real-Time

Deep Analysis & Enterprise Applications

Test-Time Safety Alignment Process Flow

Comparison of Test-Time Defense Methods (Llama 3.1-8B-Instruct)

Case Study: Llama 3.1-8B-Instruct (Adversarial Harmful)

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot & Validation

Phase 3: Scaled Deployment

Phase 4: Optimization & Future-Proofing

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai