2026-02-25

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

This paper delves into the critical trade-off observed in Large Language Model (LLM) post-training: while optimizing for pass@k (success across multiple attempts) can significantly boost performance, it often leads to a degradation in pass@1 (single-shot accuracy). This trade-off is particularly challenging for real-world deployments constrained by latency, cost, and the need for reliable single-shot fallbacks. We provide a novel theoretical framework to explain this phenomenon, identifying "prompt interference" as the core mechanism.

Schedule Your AI Strategy Session

Quantifying the Operational Impact

Our research uncovers key metrics that highlight the often-overlooked consequences of pass@k optimization, demonstrating how seemingly beneficial performance gains can mask critical regressions in single-shot accuracy.

0 Max Pass@k Uplift

0 Max Gradient Conflict Score (Qwen-7B)

0 Prompt Reweighting Ratio

Discuss Your Implementation Strategy

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Understanding the Pass@k vs. Pass@1 Trade-off

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of k independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@k. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods.

This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference.

Figure 1a interpretation: This figure empirically illustrates the trade-off. Before pass@k optimization, both pass@1 and pass@k have certain values. After optimization, pass@k significantly increases, but pass@1 surprisingly decreases. This visual evidence underscores the central problem addressed by the paper.

Foundations of Pass@k Optimization

In many verifiable tasks, such as code generation and short-answer math, a system can afford multiple response attempts for the same prompt and check the correctness of each response attempt with an automatic verifier... The corresponding metric, pass@k, measures the probability that at least one of k i.i.d. samples solves the prompt.

The pass@k objective is defined as the probability that at least one response is correct among k responses, i.e., Jk(θ) := E_x~D [1 – (1 – p_θ(x))^k]. This can be written using the nonlinear transformation f_k(p) := 1 − (1 − p)^k. Pass@k optimization can be performed using pass@k policy gradients, where ∇J_k(θ) = E_x~D[W_k(p_θ(x))∇p_θ(x)], and w_k(p) := k(1 − p)^k-1.

The per-prompt success probability for any x ∈ X and any policy parameter θ ∈ R^d is P_θ(x) := E_y~πθ(·|x) [r(x, y)], with its gradient ∇p_θ(x) = E_y~πθ(·|x) [r(x, y)s_θ(x, y)], where s_θ(x, y) := ∇ log π_θ(y|x) is the score function. The weights W_k(p_θ(x)) emphasize low-probability of success prompts.

Introducing Prompt Interference

We introduce the concept of prompt interference. We say that two given prompts are positively (resp. negatively) interfering if a policy parameter update which increases the probability of providing a correct response for that prompt tends to increase (resp. decrease) the probability of success of the other prompt. To capture the similarity between prompts in terms of pass@1 gradient representation, we introduce a similarity kernel κ_θ(x, x') := (∇p_θ(x), ∇p_θ(x')). This kernel informs on whether improving pass@1 on one prompt tends to also improve pass@1 on another prompt.

Illustration of negative prompt interference: It follows that the per-prompt success probability defined in (1) is given by p_θ(x) = π_θ(y*(x)|x) for any prompt x ∈ X... two prompts that have a similar representation will have opposite per-prompt pass@1 gradients and will hence be negatively interfering.

Figure 2 interpretation: The cosine kernel heatmap visually demonstrates prompt interference. Blue regions correspond to negative prompt interference, where improving one prompt's success actively harms another due to shared policy parameters. This phenomenon is central to understanding the pass@k/pass@1 trade-off.

The Core of Gradient Conflict

We show that pass@k and pass@1 gradients can be conflicting in the sense that they can form an obtuse angle. This implies that a policy update following pass@k's policy gradient tends to increase pass@k while decreasing pass@1. We provide a characterization of this gradient conflict by establishing an interpretable expression for the inner product between pass@k and pass@1.

Our key insight: Compared to pass@1, optimizing the pass@k objective induces an implicit prompt reweighting toward prompts with lower success probability (i.e., prompts the current policy rarely solves). When these prompts contribute gradients that conflict with the population pass@1 gradient, upweighting them increases their influence on the pass@k policy gradient update. Consequently, the pass@k gradient can conflict with the pass@1 gradient direction.

Sufficient conditions and influence of k: Using our gradient conflict characterization, we provide sufficient conditions under which gradient conflict occurs. We further study the influence of the parameter k and show that increasing k encourages gradient conflict under some conditions in the relative probability of success in negatively versus positively interfering prompts.

Pass@1 degradation under pass@k updates: We prove that pass@1 decreases while pass@k increases (simultaneously) under one-step pass@k policy updates satisfying an explicit stepsize condition.

Figure 3 interpretation: This contour plot illustrates the pass@1 and pass@k objectives in parameter space. The gradients are conflicting in the gray area, meaning updates for pass@k can move the policy in a direction that lowers pass@1.

Experimental Evidence & Real-world Models

We empirically test whether the pass@k objective can induce gradient conflict with pass@1 on math reasoning, as predicted by Proposition 4.1. We use the MATH dataset (Hendrycks et al., 2021) and run experiments with two reasoning models: DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Qwen-7B.

Our experiments validate the theoretical predictions in several ways. First, the agreement scores show clear separation between hard prompts (negative agreement scores clustered below zero) and easy prompts (positive agreement scores above zero), confirming that prompt interference exists in practice.

Second, the pass@k weights reveal extreme disparity in how pass@k values different prompts. Hard prompts (low pass@1 values) receive weights orders of magnitude higher than easy prompts, demonstrating the extreme reweighting mechanism our theory identifies.

Third, the weighted agreement scores demonstrate the consequence of this reweighting. For Llama-8B, the gradient alignment flips from positive to negative, resulting in an inner product of −0.613. For Qwen-7B, despite only 29 hard prompts versus 627 easy, the extreme weight disparity causes an even more dramatic shift (∆ = −3.04 × 10⁻¹), yielding a strongly negative inner product of -181.

Figure 6 interpretation: These panels vividly show how pass@k reweighting causes gradient conflict. Column A shows prompt interference. Column B highlights the extreme reweighting of hard prompts by pass@k. Column C demonstrates the resulting downward shift in weighted agreement, leading to a negative inner product between pass@k and pass@1 gradients, confirming the conflict.

Causal Chain of Pass@1 Degradation

Pass@k optimization applied

→

Implicit reweighting towards low-success prompts

→

Upweighting of negatively interfering prompts

→

Pass@k policy gradient conflicts with Pass@1 gradient

→

Policy update reduces Pass@1 while increasing Pass@k

Per-Prompt vs Population Gradient Alignment

Aspect	Per-Prompt View	Population View
∇Jk & ∇J1 Collinearity	Always positively collinear (∇Jk(x;θ) = wk,θ(x)∇J1(x;θ))	Can be conflicting (∇Jk(θ) and ∇J1(θ) form an obtuse angle)
Prompt Weighting	Uniform (for individual gradients)	Implicit reweighting by wk,θ(x) (emphasizes low-success prompts)
Risk to Pass@1	No direct degradation (local improvement)	Significant degradation if reweighting amplifies negative interference

< 0 Condition for Gradient Conflict: The expected agreement score of prompts, weighted by pass@k's emphasis on low-success examples, becomes negative.

Real-world LLM Behavior: DeepSeek Models

Our experiments with DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Qwen-7B models validate the theoretical predictions. For Llama-8B, the gradient alignment flips from positive to negative (a shift of ∆ = −3.92 × 10⁻³), resulting in an inner product of −0.613. For Qwen-7B, despite a significant imbalance (21.6:1 ratio of easy to hard prompts), the extreme weight disparity caused by pass@k optimization leads to an even stronger negative inner product of −181. This empirically confirms that pass@k's reweighting mechanism systematically amplifies negatively interfering hard prompts, causing substantial pass@1 degradation.

Key Statistic: Inner Product: -181 (Qwen-7B)

Navigating the Pass@k / Pass@1 Trade-off This research highlights that blindly optimizing for pass@k without considering prompt interference can inadvertently harm pass@1 performance, which is crucial for many practical applications. Future work should focus on designing methods to mitigate this gradient conflict, enabling LLMs to achieve multi-attempt gains while preserving strong single-shot accuracy. Our introduced similarity metric offers a valuable tool for gradient surgery to achieve this balance.

Calculate Your Potential AI ROI

Estimate the significant efficiency gains and cost savings your enterprise could realize by strategically implementing AI solutions.

Your Industry

Number of Employees

Avg. Manual Hours / Week / Employee

Avg. Hourly Cost / Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Unlock Your Enterprise's AI Potential

Your Enterprise AI Roadmap

A structured approach ensures seamless integration and maximum impact from your AI initiatives, guided by expert support.

Discovery & Assessment

In-depth analysis of your current AI capabilities, existing data infrastructure, and identification of key performance indicators (KPIs) to target for improvement.

Strategy & Design

Development of a tailored AI strategy, selection of appropriate models and technologies, and architectural design for seamless integration into your existing workflows.

Development & Integration

Custom model training or fine-tuning, API integration with enterprise systems, and meticulous deployment of AI solutions across your operational environment.

Testing & Optimization

Rigorous testing of AI models for accuracy and reliability, continuous performance monitoring, and iterative refinement to maximize efficiency and impact.

Scaling & Support

Enterprise-wide rollout of proven AI solutions, provision of ongoing maintenance, and continuous support to ensure sustained performance and adaptation to evolving needs.

Begin Your AI Transformation

Ready to Transform Your Enterprise with AI?

Our team of AI experts is ready to help you navigate the complexities of AI integration, mitigate risks, and unlock unparalleled operational efficiencies.

Book Your Free Consultation Now

2026-02-25

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Quantifying the Operational Impact

Deep Analysis & Enterprise Applications

Understanding the Pass@k vs. Pass@1 Trade-off

Foundations of Pass@k Optimization

Introducing Prompt Interference

The Core of Gradient Conflict

Experimental Evidence & Real-world Models

Causal Chain of Pass@1 Degradation

Per-Prompt vs Population Gradient Alignment

Real-world LLM Behavior: DeepSeek Models

Calculate Your Potential AI ROI

Your Enterprise AI Roadmap

Discovery & Assessment

Strategy & Design

Development & Integration

Testing & Optimization

Scaling & Support

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai