2026-02-25
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
This paper delves into the critical trade-off observed in Large Language Model (LLM) post-training: while optimizing for pass@k (success across multiple attempts) can significantly boost performance, it often leads to a degradation in pass@1 (single-shot accuracy). This trade-off is particularly challenging for real-world deployments constrained by latency, cost, and the need for reliable single-shot fallbacks. We provide a novel theoretical framework to explain this phenomenon, identifying "prompt interference" as the core mechanism.
Quantifying the Operational Impact
Our research uncovers key metrics that highlight the often-overlooked consequences of pass@k optimization, demonstrating how seemingly beneficial performance gains can mask critical regressions in single-shot accuracy.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Understanding the Pass@k vs. Pass@1 Trade-off
Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of k independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@k. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods.
This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference.
Figure 1a interpretation: This figure empirically illustrates the trade-off. Before pass@k optimization, both pass@1 and pass@k have certain values. After optimization, pass@k significantly increases, but pass@1 surprisingly decreases. This visual evidence underscores the central problem addressed by the paper.
Foundations of Pass@k Optimization
In many verifiable tasks, such as code generation and short-answer math, a system can afford multiple response attempts for the same prompt and check the correctness of each response attempt with an automatic verifier... The corresponding metric, pass@k, measures the probability that at least one of k i.i.d. samples solves the prompt.
The pass@k objective is defined as the probability that at least one response is correct among k responses, i.e., Jk(θ) := Ex~D [1 – (1 – pθ(x))k]. This can be written using the nonlinear transformation fk(p) := 1 − (1 − p)k. Pass@k optimization can be performed using pass@k policy gradients, where ∇Jk(θ) = Ex~D[Wk(pθ(x))∇pθ(x)], and wk(p) := k(1 − p)k-1.
The per-prompt success probability for any x ∈ X and any policy parameter θ ∈ Rd is Pθ(x) := Ey~πθ(·|x) [r(x, y)], with its gradient ∇pθ(x) = Ey~πθ(·|x) [r(x, y)sθ(x, y)], where sθ(x, y) := ∇ log πθ(y|x) is the score function. The weights Wk(pθ(x)) emphasize low-probability of success prompts.
Introducing Prompt Interference
We introduce the concept of prompt interference. We say that two given prompts are positively (resp. negatively) interfering if a policy parameter update which increases the probability of providing a correct response for that prompt tends to increase (resp. decrease) the probability of success of the other prompt. To capture the similarity between prompts in terms of pass@1 gradient representation, we introduce a similarity kernel κθ(x, x') := (∇pθ(x), ∇pθ(x')). This kernel informs on whether improving pass@1 on one prompt tends to also improve pass@1 on another prompt.
Illustration of negative prompt interference: It follows that the per-prompt success probability defined in (1) is given by pθ(x) = πθ(y*(x)|x) for any prompt x ∈ X... two prompts that have a similar representation will have opposite per-prompt pass@1 gradients and will hence be negatively interfering.
Figure 2 interpretation: The cosine kernel heatmap visually demonstrates prompt interference. Blue regions correspond to negative prompt interference, where improving one prompt's success actively harms another due to shared policy parameters. This phenomenon is central to understanding the pass@k/pass@1 trade-off.
The Core of Gradient Conflict
We show that pass@k and pass@1 gradients can be conflicting in the sense that they can form an obtuse angle. This implies that a policy update following pass@k's policy gradient tends to increase pass@k while decreasing pass@1. We provide a characterization of this gradient conflict by establishing an interpretable expression for the inner product between pass@k and pass@1.
Our key insight: Compared to pass@1, optimizing the pass@k objective induces an implicit prompt reweighting toward prompts with lower success probability (i.e., prompts the current policy rarely solves). When these prompts contribute gradients that conflict with the population pass@1 gradient, upweighting them increases their influence on the pass@k policy gradient update. Consequently, the pass@k gradient can conflict with the pass@1 gradient direction.
Sufficient conditions and influence of k: Using our gradient conflict characterization, we provide sufficient conditions under which gradient conflict occurs. We further study the influence of the parameter k and show that increasing k encourages gradient conflict under some conditions in the relative probability of success in negatively versus positively interfering prompts.
Pass@1 degradation under pass@k updates: We prove that pass@1 decreases while pass@k increases (simultaneously) under one-step pass@k policy updates satisfying an explicit stepsize condition.
Figure 3 interpretation: This contour plot illustrates the pass@1 and pass@k objectives in parameter space. The gradients are conflicting in the gray area, meaning updates for pass@k can move the policy in a direction that lowers pass@1.
Experimental Evidence & Real-world Models
We empirically test whether the pass@k objective can induce gradient conflict with pass@1 on math reasoning, as predicted by Proposition 4.1. We use the MATH dataset (Hendrycks et al., 2021) and run experiments with two reasoning models: DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Qwen-7B.
Our experiments validate the theoretical predictions in several ways. First, the agreement scores show clear separation between hard prompts (negative agreement scores clustered below zero) and easy prompts (positive agreement scores above zero), confirming that prompt interference exists in practice.
Second, the pass@k weights reveal extreme disparity in how pass@k values different prompts. Hard prompts (low pass@1 values) receive weights orders of magnitude higher than easy prompts, demonstrating the extreme reweighting mechanism our theory identifies.
Third, the weighted agreement scores demonstrate the consequence of this reweighting. For Llama-8B, the gradient alignment flips from positive to negative, resulting in an inner product of −0.613. For Qwen-7B, despite only 29 hard prompts versus 627 easy, the extreme weight disparity causes an even more dramatic shift (∆ = −3.04 × 10−1), yielding a strongly negative inner product of -181.
Figure 6 interpretation: These panels vividly show how pass@k reweighting causes gradient conflict. Column A shows prompt interference. Column B highlights the extreme reweighting of hard prompts by pass@k. Column C demonstrates the resulting downward shift in weighted agreement, leading to a negative inner product between pass@k and pass@1 gradients, confirming the conflict.
Causal Chain of Pass@1 Degradation
| Aspect | Per-Prompt View | Population View |
|---|---|---|
| ∇Jk & ∇J1 Collinearity |
|
|
| Prompt Weighting |
|
|
| Risk to Pass@1 |
|
|
Real-world LLM Behavior: DeepSeek Models
Our experiments with DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Qwen-7B models validate the theoretical predictions. For Llama-8B, the gradient alignment flips from positive to negative (a shift of ∆ = −3.92 × 10−3), resulting in an inner product of −0.613. For Qwen-7B, despite a significant imbalance (21.6:1 ratio of easy to hard prompts), the extreme weight disparity caused by pass@k optimization leads to an even stronger negative inner product of −181. This empirically confirms that pass@k's reweighting mechanism systematically amplifies negatively interfering hard prompts, causing substantial pass@1 degradation.
Key Statistic: Inner Product: -181 (Qwen-7B)
pass@k without considering prompt interference can inadvertently harm pass@1 performance, which is crucial for many practical applications. Future work should focus on designing methods to mitigate this gradient conflict, enabling LLMs to achieve multi-attempt gains while preserving strong single-shot accuracy. Our introduced similarity metric offers a valuable tool for gradient surgery to achieve this balance.
Calculate Your Potential AI ROI
Estimate the significant efficiency gains and cost savings your enterprise could realize by strategically implementing AI solutions.
Your Enterprise AI Roadmap
A structured approach ensures seamless integration and maximum impact from your AI initiatives, guided by expert support.
Discovery & Assessment
In-depth analysis of your current AI capabilities, existing data infrastructure, and identification of key performance indicators (KPIs) to target for improvement.
Strategy & Design
Development of a tailored AI strategy, selection of appropriate models and technologies, and architectural design for seamless integration into your existing workflows.
Development & Integration
Custom model training or fine-tuning, API integration with enterprise systems, and meticulous deployment of AI solutions across your operational environment.
Testing & Optimization
Rigorous testing of AI models for accuracy and reliability, continuous performance monitoring, and iterative refinement to maximize efficiency and impact.
Scaling & Support
Enterprise-wide rollout of proven AI solutions, provision of ongoing maintenance, and continuous support to ensure sustained performance and adaptation to evolving needs.
Ready to Transform Your Enterprise with AI?
Our team of AI experts is ready to help you navigate the complexities of AI integration, mitigate risks, and unlock unparalleled operational efficiencies.