Enterprise AI Analysis
Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-p (nucleus) sampling, and min-p sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-p sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present top-H decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an entropy-constrained minimum divergence problem. We then prove this minimization problem to be equivalent to an entropy-constrained mass maximization (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-p sampling by up to 25.63% on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an LLM-as-judge evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be easily integrated into creative writing applications. The code is available at https://github.com/ErfanBaghaei/Top-H-Decoding.
Executive Impact & Key Metrics
Top-H Decoding introduces a novel approach to text generation, addressing the critical balance between creativity and coherence in large language models (LLMs). Unlike existing methods like min-p sampling that rely on single-token heuristics, Top-H dynamically selects tokens based on an entropy-constrained minimum divergence objective (ECMD), equivalent to an NP-hard entropy-constrained mass maximization (ECMM) problem. Our greedy algorithm efficiently approximates this solution, bounding uncertainty while maximizing probability mass. Empirical evaluations demonstrate Top-H's significant outperformance, achieving up to 25.63% higher accuracy on creative writing benchmarks and maintaining robustness across varying temperatures on reasoning and QA tasks. LLM-as-a-judge evaluations further validate its ability to produce more coherent and creative outputs, making Top-H a state-of-the-art method for open-ended text generation.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Entropy-Constrained Minimum Divergence (ECMD)
Top-H's theoretical foundation lies in the Entropy-Constrained Minimum Divergence (ECMD) problem. This formulation explicitly balances creativity and coherence by minimizing the Jensen-Shannon divergence between the chosen token distribution (q) and the original model distribution (p), subject to an upper bound on the entropy of q, controlled by a parameter α (H(q) ≤ αH(p)
). This is equivalent to the Entropy-Constrained Mass Maximization (ECMM) problem, which seeks to maximize the sum of probabilities of selected tokens while adhering to the entropy constraint. This innovative approach ensures dynamic adaptation to the model's confidence, fostering exploration in uncertain contexts and coherence in certain ones.
The core Entropy-Constrained Mass Maximization (ECMM) problem, which Top-H aims to solve, has been proven to be NP-hard. This finding, established through a polynomial-time reduction from the Cardinality-Constrained Subset Sum problem, underscores the inherent computational complexity of precisely balancing creativity and coherence. While finding an optimal solution is intractable for general cases, Top-H employs an efficient greedy approximation algorithm to achieve practical and competitive results.
Top-H Greedy Algorithm
To address the NP-hard nature of ECMM, Top-H implements a computationally efficient greedy algorithm. It iteratively selects tokens in descending order of their probabilities, adding them to a sampling set S
as long as the entropy of the resulting distribution q
remains below the dynamic threshold α * H(p)
. This ensures that the algorithm maximizes the total probability mass of selected tokens while keeping the randomness (uncertainty) within a controlled, adaptive bound. The algorithm guarantees termination before all tokens are selected, ensuring efficiency and focused sampling.
Algorithm 1 Top-H: proposed greedy token selection algorithm
Require: Probability mass function p = (P1, P2, ..., pn), entropy threshold coefficient a ∈ (0,1)
Ensure: Selected token set S
1: Sort tokens in descending order of probability: p1 ≥ P2 ≥ ... ≥ Pn
2: Initialize S ← Ø, H(q) ← 0
3: for each token i in sorted order do
4: Add token i to S
5: Compute updated distribution q over S
6: Compute entropy H(q)
7: if H(q) > α· Η(p) then
8: Remove token i from S
9: break
10: end if
11: end for
12: return S
Creative Writing Performance
On creative writing benchmarks like Alpaca-Eval and MT-Bench, Top-H significantly outperforms existing sampling methods, especially at higher temperatures where creativity is crucial. For instance, Top-H achieves substantial improvements in win-rate and judge scores.
Model (Temperature) | Min-p Win Rate (%) | Top-H Win Rate (%) | Win Rate Delta (Top-H - Min-p) |
---|---|---|---|
LLaMA3.1-8B-Instruct (T=1.5) | 31.16 | 36.52 | +5.36 |
Qwen2.5-3B (T=2.0) | 20.18 | 27.53 | +7.35 |
Phi-3-Mini (T=2.0) | 25.68 | 32.82 | +7.14 |
Data from Figure 2 (Alpaca-Eval win rates). Top-H demonstrates up to 17.11% win-rate improvement over min-p across settings. |
Reasoning & CoT Task Accuracy
Top-H demonstrates competitive performance on reasoning tasks (GSM8k, GPQA), consistently outperforming or matching min-p and top-p, especially as temperature increases, where other methods tend to degrade significantly.
Model (Dataset, Temp) | Min-p Accuracy (%) | Top-p Accuracy (%) | Top-H Accuracy (%) |
---|---|---|---|
LLaMA3.1-8B-Instruct (GSM8K, T=2.0) | 13.72 | 2.65 | 39.35 |
Qwen2.5-3B (GPQA, T=2.0) | 25.00 | 22.32 | 28.12 |
Phi-3-Mini (GSM8K, T=2.0) | 60.88 | 7.73 | 60.20 |
Data from Table 1 (GSM8k) and Table 2 (GPQA). Top-H shows up to 25.63 percentage points accuracy improvement over min-p on GSM8k (LLaMA3.1-8B-Instruct, T=2.0). |
LLM-as-a-Judge: Creativity and Coherence
Evaluations using GPT-4o as an LLM-as-a-judge confirm Top-H's superior balance of creativity and coherence. At lower temperatures, Top-H already leads, and this advantage becomes more pronounced at higher temperatures, where other methods often produce fragmented or incoherent text.
Sampling Method | LLaMA3.1-8B-Instruct (T=1.0 Avg Score) | LLaMA3.1-8B-Instruct (T=2.0 Avg Score) |
---|---|---|
Top-p | 7.42 | 6.28 |
Min-p | 7.38 | 7.20 |
Top-H | 8.25 | 8.50 |
Scores are averages of M1-M5 across three prompts, derived from Table 3 (LLaMA3.1-8B-Instruct). |
Coherence Robustness Across Temperatures
Traditional sampling methods like min-p and top-p show a sharp decline in generated text coherence (measured by total log-probability) as temperature increases. This indicates their sensitivity and tendency to produce fragmented text at higher temperatures, where diversity is sought. In contrast, Top-H adaptively adjusts its entropy constraint to the next token's probability distribution, allowing it to maintain significantly more consistent and coherent outputs, even in high-temperature settings (see Figure 3 in the paper).
Tuning the Entropy Threshold Coefficient (α)
The parameter α
in Top-H directly controls the maximum allowable entropy of the sampled distribution (H(q) ≤ αH(p)
), effectively governing the balance between creativity and coherence. Through extensive empirical evaluation using LLM-as-a-judge on development samples, it was found that an α
value of 0.4 yields the highest average across creativity and coherence scores, striking an optimal balance. This allows Top-H to modulate randomness precisely according to the model's uncertainty (Figure 4 in the paper).
Despite the NP-hard nature of the ECMM problem, empirical evaluations demonstrate that Top-H's greedy approximation algorithm performs remarkably well. The ratio of Top-H's generated probability mass to the theoretically optimal ECMM solution (Γs/Γs*
) consistently hovers around 1.0 across various generation steps and prompts (Figure 5 in the paper). This strong empirical optimality confirms Top-H's practical effectiveness in approximating the ideal balance of creativity and coherence.
Computational Overhead
Top-H is designed for efficiency and easy integration into existing LLM generation pipelines. Comparative runtime analysis shows that Top-H introduces only a negligible increase in processing time per token compared to top-p and min-p sampling methods, with an average overhead as low as 0.8% across diverse settings.
Model | Method | T=1.0 (ms/token) | T=2.0 (ms/token) | Relative Overhead (Top-H vs Min-p, T=1.0) |
---|---|---|---|---|
LLaMA3.1-8B-Instruct | top-p | 27.4275 | 27.4389 | - |
min-p | 27.3396 | 27.3840 | 0% | |
top-H | 28.3951 | 28.4671 | +3.86% | |
Phi-3-Mini-4K-Instruct | top-p | 23.7809 | 23.5844 | - |
min-p | 23.6499 | 23.9397 | 0% | |
top-H | 24.3847 | 24.5929 | +3.11% | |
Data from Table 5. Overall average overhead is as low as 0.8% (C.2 Computational overhead and timing comparisons). |
Hugging Face Top-H LogitsProcessor
Top-H decoding can be easily integrated into existing Hugging Face-based LLM generation pipelines as a LogitsProcessor
. This snippet illustrates the core implementation logic, demonstrating its simplicity and plug-and-play compatibility. Developers can instantiate this class and pass it to their model's generate
function, enabling Top-H's dynamic creativity-coherence balancing with minimal effort.
from transformers import LogitsProcessor
import torch
import numpy as np
import torch.nn.functional as F
class EntropyFilteringLogitsProcessor (LogitsProcessor):
def __init__(self, top_n=100, temperature=1.0):
super().__init__()
self.top_n = top_n
self.temperature = temperature
@staticmethod
def calculate_entropy (probs):
probs = probs [probs > 0]
probs = probs / np.sum(probs)
return -np.sum(probs * np.log(probs))
def __call__(self, input_ids, scores):
batch_size, vocab_size = scores.shape
assert batch_size == 1
scaled_logits = scores
probs = F.softmax(scaled_logits / self.temperature, dim=-1)
probs_np = probs [0].cpu().numpy()
sorted_indices = np.argsort(probs_np) [::-1]
sorted_probs = probs_np [sorted_indices]
top_n_indices = sorted_indices[:self.top_n]
top_n_probs = probs_np [top_n_indices]
alpha = 0.4
threshold = self.calculate_entropy (top_n_probs) * alpha
ind = 1
valid_indices = []
for idx, prob in zip (top_n_indices, top_n_probs):
valid_indices.append(idx)
ind += 1
cumulative_probs_np_b = top_n_probs[:ind]
entropy_diff = self.calculate_entropy(cumulative_probs_np_b)
if entropy_diff > threshold:
break
keep_mask = torch.zeros(vocab_size, dtype=torch.bool)
keep_mask[valid_indices] = True
updated_scores = scaled_logits.clone()
updated_scores[:, keep_mask] = float('-inf')
return updated_scores
Top-H Decision Process Flow
Calculate Your Potential ROI with Top-H Decoding
Estimate the efficiency gains and cost savings your organization could achieve by implementing advanced AI text generation.
Your Journey to Advanced AI Generation
A typical implementation roadmap for integrating Top-H Decoding and similar advanced AI strategies into your enterprise.
Discovery & Strategy
Initial assessment of current LLM usage, identifying key pain points in creativity, coherence, and scalability. Define clear objectives and success metrics for enhanced text generation.
Proof-of-Concept & Customization
Pilot Top-H Decoding on a specific use case. Tailor the entropy threshold (α) and other parameters to optimize performance for your unique data and desired output characteristics. This phase includes initial integration with existing LLM infrastructure.
Integration & Testing
Full integration of Top-H Decoding into your production LLM pipelines. Comprehensive testing across various models, tasks, and temperatures to ensure robustness, performance, and adherence to quality standards. This includes A/B testing against current methods.
Deployment & Monitoring
Rollout of Top-H enhanced generation across relevant enterprise applications. Establish continuous monitoring for performance, coherence, and creativity metrics. Implement feedback loops for ongoing optimization and fine-tuning.
Scaling & Advanced Applications
Expand Top-H Decoding to broader use cases and larger language models. Explore advanced applications such as automated creative content generation, sophisticated dialogue systems, and personalized communication at scale.
Ready to Elevate Your LLM Capabilities?
Book a personalized consultation to explore how Top-H Decoding can transform your enterprise's AI text generation, balancing unparalleled creativity with unwavering coherence.