Enterprise AI Analysis

Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-p (nucleus) sampling, and min-p sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-p sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present top-H decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an entropy-constrained minimum divergence problem. We then prove this minimization problem to be equivalent to an entropy-constrained mass maximization (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-p sampling by up to 25.63% on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an LLM-as-judge evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be easily integrated into creative writing applications. The code is available at https://github.com/ErfanBaghaei/Top-H-Decoding.

Schedule Your Strategy Session

Executive Impact & Key Metrics

Top-H Decoding introduces a novel approach to text generation, addressing the critical balance between creativity and coherence in large language models (LLMs). Unlike existing methods like min-p sampling that rely on single-token heuristics, Top-H dynamically selects tokens based on an entropy-constrained minimum divergence objective (ECMD), equivalent to an NP-hard entropy-constrained mass maximization (ECMM) problem. Our greedy algorithm efficiently approximates this solution, bounding uncertainty while maximizing probability mass. Empirical evaluations demonstrate Top-H's significant outperformance, achieving up to 25.63% higher accuracy on creative writing benchmarks and maintaining robustness across varying temperatures on reasoning and QA tasks. LLM-as-a-judge evaluations further validate its ability to produce more coherent and creative outputs, making Top-H a state-of-the-art method for open-ended text generation.

0 Improved Creative Writing Accuracy

0 Reduced Coherence Degradation at High Temps

0 ECMM Optimal Approximation

0 Minimal Runtime Overhead

Discuss Enterprise Integration

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Entropy-Constrained Minimum Divergence (ECMD)

Top-H's theoretical foundation lies in the Entropy-Constrained Minimum Divergence (ECMD) problem. This formulation explicitly balances creativity and coherence by minimizing the Jensen-Shannon divergence between the chosen token distribution (q) and the original model distribution (p), subject to an upper bound on the entropy of q, controlled by a parameter α (H(q) ≤ αH(p)). This is equivalent to the Entropy-Constrained Mass Maximization (ECMM) problem, which seeks to maximize the sum of probabilities of selected tokens while adhering to the entropy constraint. This innovative approach ensures dynamic adaptation to the model's confidence, fostering exploration in uncertain contexts and coherence in certain ones.

NP-Hard Complexity of ECMM Problem

The core Entropy-Constrained Mass Maximization (ECMM) problem, which Top-H aims to solve, has been proven to be NP-hard. This finding, established through a polynomial-time reduction from the Cardinality-Constrained Subset Sum problem, underscores the inherent computational complexity of precisely balancing creativity and coherence. While finding an optimal solution is intractable for general cases, Top-H employs an efficient greedy approximation algorithm to achieve practical and competitive results.

Top-H Greedy Algorithm

To address the NP-hard nature of ECMM, Top-H implements a computationally efficient greedy algorithm. It iteratively selects tokens in descending order of their probabilities, adding them to a sampling set S as long as the entropy of the resulting distribution q remains below the dynamic threshold α * H(p). This ensures that the algorithm maximizes the total probability mass of selected tokens while keeping the randomness (uncertainty) within a controlled, adaptive bound. The algorithm guarantees termination before all tokens are selected, ensuring efficiency and focused sampling.


Algorithm 1 Top-H: proposed greedy token selection algorithm
Require: Probability mass function p = (P1, P2, ..., pn), entropy threshold coefficient a ∈ (0,1)
Ensure: Selected token set S
1: Sort tokens in descending order of probability: p1 ≥ P2 ≥ ... ≥ Pn
2: Initialize S ← Ø, H(q) ← 0
3: for each token i in sorted order do
4:     Add token i to S
5:     Compute updated distribution q over S
6:     Compute entropy H(q)
7:     if H(q) > α· Η(p) then
8:         Remove token i from S
9:         break
10:    end if
11: end for
12: return S

Creative Writing Performance

On creative writing benchmarks like Alpaca-Eval and MT-Bench, Top-H significantly outperforms existing sampling methods, especially at higher temperatures where creativity is crucial. For instance, Top-H achieves substantial improvements in win-rate and judge scores.

Model (Temperature)	Min-p Win Rate (%)	Top-H Win Rate (%)	Win Rate Delta (Top-H - Min-p)
LLaMA3.1-8B-Instruct (T=1.5)	31.16	36.52	+5.36
Qwen2.5-3B (T=2.0)	20.18	27.53	+7.35
Phi-3-Mini (T=2.0)	25.68	32.82	+7.14
Data from Figure 2 (Alpaca-Eval win rates). Top-H demonstrates up to 17.11% win-rate improvement over min-p across settings.

Reasoning & CoT Task Accuracy

Top-H demonstrates competitive performance on reasoning tasks (GSM8k, GPQA), consistently outperforming or matching min-p and top-p, especially as temperature increases, where other methods tend to degrade significantly.

Model (Dataset, Temp)	Min-p Accuracy (%)	Top-p Accuracy (%)	Top-H Accuracy (%)
LLaMA3.1-8B-Instruct (GSM8K, T=2.0)	13.72	2.65	39.35
Qwen2.5-3B (GPQA, T=2.0)	25.00	22.32	28.12
Phi-3-Mini (GSM8K, T=2.0)	60.88	7.73	60.20
Data from Table 1 (GSM8k) and Table 2 (GPQA). Top-H shows up to 25.63 percentage points accuracy improvement over min-p on GSM8k (LLaMA3.1-8B-Instruct, T=2.0).

LLM-as-a-Judge: Creativity and Coherence

Evaluations using GPT-4o as an LLM-as-a-judge confirm Top-H's superior balance of creativity and coherence. At lower temperatures, Top-H already leads, and this advantage becomes more pronounced at higher temperatures, where other methods often produce fragmented or incoherent text.

Sampling Method	LLaMA3.1-8B-Instruct (T=1.0 Avg Score)	LLaMA3.1-8B-Instruct (T=2.0 Avg Score)
Top-p	7.42	6.28
Min-p	7.38	7.20
Top-H	8.25	8.50
Scores are averages of M1-M5 across three prompts, derived from Table 3 (LLaMA3.1-8B-Instruct).

Coherence Robustness Across Temperatures

Traditional sampling methods like min-p and top-p show a sharp decline in generated text coherence (measured by total log-probability) as temperature increases. This indicates their sensitivity and tendency to produce fragmented text at higher temperatures, where diversity is sought. In contrast, Top-H adaptively adjusts its entropy constraint to the next token's probability distribution, allowing it to maintain significantly more consistent and coherent outputs, even in high-temperature settings (see Figure 3 in the paper).

Tuning the Entropy Threshold Coefficient (α)

The parameter α in Top-H directly controls the maximum allowable entropy of the sampled distribution (H(q) ≤ αH(p)), effectively governing the balance between creativity and coherence. Through extensive empirical evaluation using LLM-as-a-judge on development samples, it was found that an α value of 0.4 yields the highest average across creativity and coherence scores, striking an optimal balance. This allows Top-H to modulate randomness precisely according to the model's uncertainty (Figure 4 in the paper).

1.0 Approximation Ratio to Optimal ECMM

Despite the NP-hard nature of the ECMM problem, empirical evaluations demonstrate that Top-H's greedy approximation algorithm performs remarkably well. The ratio of Top-H's generated probability mass to the theoretically optimal ECMM solution (Γs/Γs*) consistently hovers around 1.0 across various generation steps and prompts (Figure 5 in the paper). This strong empirical optimality confirms Top-H's practical effectiveness in approximating the ideal balance of creativity and coherence.

Computational Overhead

Top-H is designed for efficiency and easy integration into existing LLM generation pipelines. Comparative runtime analysis shows that Top-H introduces only a negligible increase in processing time per token compared to top-p and min-p sampling methods, with an average overhead as low as 0.8% across diverse settings.

Model	Method	T=1.0 (ms/token)	T=2.0 (ms/token)	Relative Overhead (Top-H vs Min-p, T=1.0)
LLaMA3.1-8B-Instruct	top-p	27.4275	27.4389	-
	min-p	27.3396	27.3840	0%
	top-H	28.3951	28.4671	+3.86%
Phi-3-Mini-4K-Instruct	top-p	23.7809	23.5844	-
	min-p	23.6499	23.9397	0%
	top-H	24.3847	24.5929	+3.11%
Data from Table 5. Overall average overhead is as low as 0.8% (C.2 Computational overhead and timing comparisons).

Hugging Face Top-H LogitsProcessor

Top-H decoding can be easily integrated into existing Hugging Face-based LLM generation pipelines as a LogitsProcessor. This snippet illustrates the core implementation logic, demonstrating its simplicity and plug-and-play compatibility. Developers can instantiate this class and pass it to their model's generate function, enabling Top-H's dynamic creativity-coherence balancing with minimal effort.

from transformers import LogitsProcessor
import torch
import numpy as np
import torch.nn.functional as F

class EntropyFilteringLogitsProcessor (LogitsProcessor):
    def __init__(self, top_n=100, temperature=1.0):
        super().__init__()
        self.top_n = top_n
        self.temperature = temperature

    @staticmethod
    def calculate_entropy (probs):
        probs = probs [probs > 0]
        probs = probs / np.sum(probs)
        return -np.sum(probs * np.log(probs))

    def __call__(self, input_ids, scores):
        batch_size, vocab_size = scores.shape
        assert batch_size == 1

        scaled_logits = scores
        probs = F.softmax(scaled_logits / self.temperature, dim=-1)
        probs_np = probs [0].cpu().numpy()

        sorted_indices = np.argsort(probs_np) [::-1]
        sorted_probs = probs_np [sorted_indices]

        top_n_indices = sorted_indices[:self.top_n]
        top_n_probs = probs_np [top_n_indices]

        alpha = 0.4
        threshold = self.calculate_entropy (top_n_probs) * alpha

        ind = 1
        valid_indices = []
        for idx, prob in zip (top_n_indices, top_n_probs):
            valid_indices.append(idx)
            ind += 1
            cumulative_probs_np_b = top_n_probs[:ind]
            entropy_diff = self.calculate_entropy(cumulative_probs_np_b)
            if entropy_diff > threshold:
                break
        
        keep_mask = torch.zeros(vocab_size, dtype=torch.bool)
        keep_mask[valid_indices] = True
        updated_scores = scaled_logits.clone()
        updated_scores[:, keep_mask] = float('-inf')
        return updated_scores

Top-H Decision Process Flow

LLM Generates Raw Token Probabilities (p)

→

Calculate Base Entropy H(p)

→

Sort Tokens by Probability (Descending)

→

Iteratively Add Tokens to Set S

→

Compute Entropy H(q) for Set S

→

Check Constraint: H(q) ≤ αH(p)?

→

If Violated: Remove Last Token & Halt

→

Sample Next Token from Final Set S

Calculate Your Potential ROI with Top-H Decoding

Estimate the efficiency gains and cost savings your organization could achieve by implementing advanced AI text generation.

Your Industry

Number of Employees (leveraging LLMs for text generation)

Average Hours/Week per Employee on Text Generation

Average Hourly Fully Loaded Cost per Employee ($)

Estimated Annual Cost Savings $0

Estimated Annual Hours Reclaimed 0

Optimize Your AI Workflows

Your Journey to Advanced AI Generation

A typical implementation roadmap for integrating Top-H Decoding and similar advanced AI strategies into your enterprise.

Discovery & Strategy

Initial assessment of current LLM usage, identifying key pain points in creativity, coherence, and scalability. Define clear objectives and success metrics for enhanced text generation.

Proof-of-Concept & Customization

Pilot Top-H Decoding on a specific use case. Tailor the entropy threshold (α) and other parameters to optimize performance for your unique data and desired output characteristics. This phase includes initial integration with existing LLM infrastructure.

Integration & Testing

Full integration of Top-H Decoding into your production LLM pipelines. Comprehensive testing across various models, tasks, and temperatures to ensure robustness, performance, and adherence to quality standards. This includes A/B testing against current methods.

Deployment & Monitoring

Rollout of Top-H enhanced generation across relevant enterprise applications. Establish continuous monitoring for performance, coherence, and creativity metrics. Implement feedback loops for ongoing optimization and fine-tuning.

Scaling & Advanced Applications

Expand Top-H Decoding to broader use cases and larger language models. Explore advanced applications such as automated creative content generation, sophisticated dialogue systems, and personalized communication at scale.

Ready to Elevate Your LLM Capabilities?

Book a personalized consultation to explore how Top-H Decoding can transform your enterprise's AI text generation, balancing unparalleled creativity with unwavering coherence.

Book a Free Consultation

Enterprise AI Analysis

Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

Executive Impact & Key Metrics

Deep Analysis & Enterprise Applications

Entropy-Constrained Minimum Divergence (ECMD)

Top-H Greedy Algorithm

Creative Writing Performance

Reasoning & CoT Task Accuracy

LLM-as-a-Judge: Creativity and Coherence

Coherence Robustness Across Temperatures

Tuning the Entropy Threshold Coefficient (α)

Computational Overhead

Hugging Face Top-H LogitsProcessor

Top-H Decision Process Flow

Calculate Your Potential ROI with Top-H Decoding

Your Journey to Advanced AI Generation

Discovery & Strategy

Proof-of-Concept & Customization

Integration & Testing

Deployment & Monitoring

Scaling & Advanced Applications

Ready to Elevate Your LLM Capabilities?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai