Skip to main content
Enterprise AI Analysis: UnMaskFork: Test-Time Scaling for Masked Diffusion via Deterministic Action Branching

Enterprise AI Research Analysis

UnMaskFork: Test-Time Scaling for Masked Diffusion via Deterministic Action Branching

This report provides a comprehensive analysis of "UnMaskFork", a novel test-time scaling framework for Masked Diffusion Language Models (MDLMs). It highlights its innovative use of deterministic action branching via Monte Carlo Tree Search (MCTS) to enhance performance on complex coding and mathematical reasoning tasks, outperforming existing stochastic scaling methods.

Executive Impact & Key Metrics

UnMaskFork significantly advances the capabilities of Masked Diffusion Language Models, offering enhanced performance and reliability in high-stakes generative AI applications, particularly for code and mathematical reasoning.

0 Pass@1 on HumanEval+
0 Pass@1 on MBPP+
0 Pass@1 on MATH Dataset
0 Cache Hit Rate (NFE=12288)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Abstract
Introduction
Methods
Experiments
Conclusion

Abstract

Test-time scaling strategies have effectively leveraged inference-time compute to enhance the reasoning abilities of Autoregressive Large Language Models. In this work, we demonstrate that Masked Diffusion Language Models (MDLMS) are inherently amenable to advanced search strategies, owing to their iterative and non-autoregressive generation process. To leverage this, we propose UnMaskFork (UMF), a framework that formulates the unmasking trajectory as a search tree and employs Monte Carlo Tree Search to optimize the generation path. In contrast to standard scaling methods relying on stochastic sampling, UMF explores the search space through deterministic partial unmasking actions performed by multiple MDLMs. Our empirical evaluation demonstrates that UMF consistently outperforms existing test-time scaling baselines on complex coding benchmarks, while also exhibiting strong scalability on mathematical reasoning tasks.

Introduction

The scaling laws of Large Language Models (LLMs) have recently expanded beyond pre-training parameters to inference-time compute. Specifically, test-time scaling (TTS) (Wang et al., 2023; Brown et al., 2024; Snell et al., 2025) demonstrates that allocating additional compute budget during inference, typically via Best-of-N or tree search, can significantly enhance reasoning capabilities in Autoregressive Models (AR-LLMs).

Parallel to this, Masked Diffusion Language Models (MDLMs) (Austin et al., 2021; Hoogeboom et al., 2021; Shi et al., 2024; Lou et al., 2024; Sahoo et al., 2024), which model generation as an iterative transition from "mask" token sequence to clean text, have emerged as a compelling non-autoregressive alternative to AR-LLMs. Through large-scale pretraining, modern MDLMs (Nie et al., 2025; Ye et al., 2025; Gong et al., 2025; Xie et al., 2025) are approaching the performance of AR-LLMs of similar model size. However, compared to TTS research in AR-LLMs, unlocking the full reasoning potential of MDLMs through test-time scaling remains relatively unexplored.

In this work, we empirically demonstrate that applying standard AR-based scaling strategies, such as Best-of-N, where temperature is often increased to encourage diversity, is ineffective for MDLMs. Our extensive experiments on coding benchmarks reveal that increasing the temperature across the entire unmasking schedule degrades generation quality, thereby hindering performance improvements despite the increased diversity. We hypothesize that while stochastic sampling benefits autoregressive models, MDLMs rely on global iterative refinement, where early stochastic errors can be detrimental. In this context, early stochastic errors introduced by high-temperature sampling can propagate through subsequent denoising steps, disrupting global consistency and leading to irreversible structural defects. These observations underscore the need for a scaling paradigm that derives diversity from structural variations rather than stochastic noise injection.

To address this need, we propose UnMaskFork (UMF), a sample-efficient test-time scaling framework tailored for masked diffusion. Instead of relying on stochastic sampling within a single model, UMF achieves exploration through deterministic action branching. We formulate the unmasking trajectory as a search tree where branches represent distinct unmasking decisions made by different pretrained MDLMs or deterministic strategies (e.g., varying inference parameters), and empirically verify that using multiple MDLMs leads to improvements. This approach replaces stochastic noise with deterministic actions that yield high-quality, distinct trajectories. Furthermore, this determinism allows for efficient node caching: UMF caches and reuses partially unmasked state, further optimizing the compute budget (defined by Number of Function Evaluations) during Monte Carlo Tree Search (MCTS) and consistently outperforming other TTS baselines.

Contributions. ① We empirically show that temperature-based stochastic scaling in MDLMs is often less sample-efficient for the budget regimes we have tested, and that aggressive stochasticity can degrade generation quality. ② We propose UMF, an inference method that scales masked diffusion inference by exploring a tree of unmasking trajectories. By treating distinct MDLMs or inference parameters as discrete actions in MCTS, we achieve diverse exploration without sacrificing generation quality. ③ We demonstrate that UMF consistently outperforms existing baselines, such as Best-of-N and DTS*, on complex coding benchmarks such as LiveCodeBench, HumanEval+, and MBPP+, and we also show that UMF is effective in mathematical reasoning tasks.

2. Related Work

Inference-time scaling and alignment for diffusion models. For diffusion models, inference-time scaling has been explored through sampler modifications and longer/structured sampling procedures. In discrete masked diffusion, ReMDM (Wang et al., 2025) introduces principled remasking to allow tokens to be revisited and corrected, enabling improved quality as the number of sampling steps increases. Another line of work treats inference-time alignment to reward functions as a sampling/search problem. Similarly, PG-DLM (Dang et al., 2025) applies particle Gibbs and conditional SMC kernels to resample entire denoising trajectories, enabling trajectory-level refinement under reward guidance without retraining. Finally, the broader "diffusion + MCTS” trend also appears in planning, where diffusion-based trajectory generators are combined with MCTS-style search for improved test-time planning scalability (Yoon et al., 2025).

Tree search for diffusion language model inference. Recent work has started to apply explicit tree search to diffusion LM decoding itself, motivated by the combinatorial nature of choosing unmasking positions and committing tokens. Diffusion Tree Sampling (DTS) (Jain et al., 2025b) constructs a tree over the unmasking process and propagates terminal rewards to reuse past computation for scalable inference-time alignment. MEDAL (Huang et al., 2025) uses MCTS at the initialization stage to explore high-confidence unmasking trajectories and provide a stronger starting point for subsequent refinement. TREASURE (Yu et al., 2025) proposes a test-time alignment method tailored to masked diffusion, introducing branching and low-variance scoring mechanisms to address correlated branches and high-variance reward estimates when applying tree search to parallel unmasking. Our work complements these efforts by exploiting a distinct axis of exploration: Instead of relying primarily on stochastic branching within a single model, we define search actions at the level of selecting among multiple pretrained MDLMs (and deterministic inference configurations), enabling diverse yet high-quality partial unmasking decisions and efficient reuse of deterministic rollouts via caching. Structurally, we note that while DTS employs stochastic rollouts and incorporates all intermediate nodes along the path into the search tree, UMF utilizes deterministic rollouts that are cached for efficient reuse rather than being explicitly added to the search tree.

3. Preliminaries

In this section, we define the necessary notation and MDLM unmasking process to formulate the tree search in UMF.

3.1. Partially-masked state and mask ratio

Let V be the vocabulary (i.e., the set of tokens), m be the mask token, and U := VU {m} be the extended vocabulary. Conditioned on a prompt token sequence xprompt ∈ Vnp of length np, the model generates ng tokens. Let the total length be n := np + ng, and denote the “partially-masked state" during inference as z ∈ Un. The prompt is fixed such that 20:np-1 = xPrompt.

We define the set of mask positions in the generation segment (index set Ig := {np, ..., n − 1}) as M(z) := {i ∈ Ig | zi = m}. We also define the “residual mask ratio” of the generation segment as p(z) := |M(z)|/ng. The initial state is z = (xprompt,m,...,m) (where the generation segment is fully masked), and the terminal state zo satisfies p(zo) = 0 (i.e., fully unmasked).

3.2. MDLM prediction and unmask transition

The MDLM predicts, for each position i, a categorical distribution poi(· | z) over tokens conditioned on the current state z. Specifically, the model produces logits lo,i( | z), which are converted into a tempered distribution by scaling and normalizing: po,Ti(x | z) := softmax (lo,i(x | z)/T) . Here, T = 1 recovers the original model distribution, T < 1 sharpens it, and T > 1 flattens it. In the limit T → 0, the distribution pe,T,i concentrates its probability mass on the token with the highest logit, corresponding to greedy selection (xi := arg maxx lo,i(x | z)). Conversely, for T > 0, we select tokens via stochastic sampling: xi ~ Pө,T,i(· | z).

In a single unmasking step, given the current state zt, we (1) obtain candidate tokens via model prediction and (2) select a subset of positions St ⊆ M(zt) to commit (unmask and fix). Specifically, for each i ∈ St, we sample îi ~ Po,Ti(zt) (or take the argmax) to construct the next state Zt-1: Zt-1,i = I(i ∈ St)xi + I(i ∉ St)zt,i. Positions not in St remain unchanged. We assume monotonic unmasking, meaning positions are never re-masked once committed: M(t-1) M(zt).

3.3. Action as an inference configuration

In UMF, we formulate the search space not as a probabilistic branching over tokens, but as a discrete selection of inference configurations. We define an action a ∈ A as a tuple: a := (θα, Τα, ga). Here, θa specifies the model parameters (i.e., selecting one of multiple pre-trained MDLMs), Ta is the sampling temperature, and ga is the remasking strategy (e.g., entropy-based or low-confidence) that determines the commit set St from zt. Given a state zt, an action a induces a transition Fa Fa ZtZt-1. Crucially, when using a low temperature Ta ≈ 0 (greedy decoding) combined with a deterministic strategy ga, the transition Fa becomes fully deterministic, assuming fixed tie-breaking rules. This determinism is key to UMF, as it enables efficient node caching by avoiding redundant computations for identical state-action pairs.

3.4. Inference budget (NFE)

We measure the computational cost using the Number of Function Evaluations (NFE). Formally, NFE counts the total number of MDLM forward passes required to compute the distribution poi(· | z). A single unmasking step typically consumes 1 NFE. However, a key advantage of UMF is that if the transition or rollout result for a pair (z, a) is retrieved from the cache, the NFE cost is zero.

3.5. Remasking Strategies (ga)

The remasking strategy ga determines the subset of positions to be re-masked (or kept masked) based on the current state. Existing literature proposes both deterministic and stochastic approaches. Deterministic strategies typically target tokens with the lowest model confidence. Examples include the Entropy-based strategy (for Dream models (Ye et al., 2025)), which masks positions with the highest predictive entropy, and the Low-confidence strategy (for LLaDA models (Nie et al., 2025)), which masks positions where the model assigns the lowest probability to the selected token. Conversely, stochastic strategies, such as independent sampling ("origin") in Dream or random masking in LLaDA, introduce randomness into the selection process. We provide further details on these strategies in Appendix B.

4. Methods

4.1. Motivation

In this section, we propose UnMaskFork (UMF). Our method reformulates the inference process as a tree search where specific partial unmasking configurations serve as discrete actions within MCTS. UMF treats a high-performing inference configuration (e.g., a specific MDLM) as an atomic action. While random remasking induces diversity, it often degrades performance compared to adaptive strategies (such as low-confidence masking) that utilize inference-time information (Nie et al., 2025; Zhu et al., 2025; Kim et al., 2025). Therefore, we adopt confidence-based deterministic remasking strategies with temperature T ≈ 02 as our primary actions. This approach offers two major advantages: (1) it avoids performance degradation caused by suboptimal stochastic perturbations, as we empirically demonstrate in Section 6; and (2) it enables efficient node caching by leveraging the deterministic nature of low-temperature rollouts, sharply distinguishing UMF from stochastic baselines.

4.2. UnMaskFork

UMF formulates the unmasking trajectory as a search tree where nodes represent partially masked states at specific masking ratios, and branches correspond to selected inference actions. We employ MCTS to optimize the generation path. To strictly adhere to a fixed compute budget during Test-Time Scaling (TTS), we follow existing methodologies (Jain et al., 2025b) and measure the Number of Function Evaluations (NFE), terminating the search once the budget is exhausted. NFE corresponds to the number of forward passes of the model pe. We expand nodes according to a discrete schedule of mask ratios p. In this work, we adopt the schedule [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.2], which samples more frequently in the early stages to ensure the diversity of the trajectory.

Algorithm 1 outlines the UMF procedure. One iteration of MCTS consists of three steps: Select, Expand, and Backup.

Select. In the Select step, we traverse the tree from the root to a node where unexplored actions remain. UMF selects a child node using the standard UCT score (Kocsis & Szepesvári, 2006): UCT(N) = Ntgt + Cexp log npar where ri represents rewards accumulated by backups, npar and ntgt are the visit counts of the parent node and the target node, respectively, and Cexp is the exploration coefficient. We set Cexp = 1 in our experiments.

Expand. In the Expand step, we select an unexplored action a from the chosen node N. In UMF, expanding a node entails advancing the state to the next predetermined residual mask ratio p. We achieve this via a procedure UnMaskToNextRatio, which iteratively applies atomic MDLM transitions (each consuming 1 NFE) while holding the action a fixed until the target ratio is reached. Upon generating the new child node Nnew, we immediately continue with a deterministic rollout until the sequence is fully unmasked to obtain a reward r. Crucially, throughout this process, we cache all intermediate nodes and the final reward r. Consequently, if a subsequent expansion visits a state-action pair (N, a) that is already cached, we retrieve the trajectory and reward with zero effective NFE cost.

Backup. Once the reward r is obtained, it is backed up to Nnew and all its ancestors. This value updates the node statistics used in the UCT score defined above for subsequent iterations.

4.3. Design of the Action Set

The design of the action set A is critical for performance. Extensive experiments in Section 6 confirmed that solely increasing temperature or randomizing mask order often hurts performance in MDLMs. Therefore, we always include the highest-performing deterministic configuration (low temperature with greedy remasking) as one of the actions. Since the root node's action is selected early in the MCTS process, this design ensures that the strong unmasking trajectory is explored first, guaranteeing performance in low-budget regimes. Regarding other actions, while we demonstrate in Section 6.3.2 that switching between different pre-trained MDLMs yields the most significant gains, actions with different inference parameters (e.g., temperatures) can also improve performance over other TTS baselines.

Handling Heterogeneous Tokenizers. When using multiple MDLMs with different tokenizers (e.g., Dream (Ye et al., 2025) and LLaDA (Nie et al., 2025)), direct token transfer is infeasible. To address this, we explicitly map special tokens (MASK, EOS, Pad) directly between models to ensure control compatibility. For the non-special tokens, we employ a text-based mapping strategy: segments are decoded into text via the source tokenizer and re-encoded by the target tokenizer. Empirically, we observed no structural degradation in the generated code resulting from this conversion. Furthermore, we emphasize that model switching does not occur at every denoising step, but only at the expansion nodes of the search tree. Given our maximum search depth of 7, the tokenizer is swapped at most 6 times throughout the entire 768-step generation process. Consequently, this infrequent switching introduces negligible instability compared to the benefit of utilizing diverse model priors.

5. Motivation and Analysis

In this section, we analyze the design choices of UMF. Specifically, we address two key design choices: (1) Why does forking the unmasking trajectory with different actions improve performance? (2) Why does UMF prioritize deterministic actions (e.g., switching models or heuristics) over standard stochastic sampling (e.g., increasing temperature), despite both offering diversity? We formulate these choices through the lens of diffusion kernel selection and sample efficiency under a fixed compute budget.

5.1. Inference as Adaptive Kernel Selection

The generation process of an MDLM can be viewed as a sequence of transitions using a reverse kernel po(Zszt). Following the formulation by Sahoo et al. (2024), the negative Evidence Lower Bound (ELBO) for the trajectory decomposes into a sum of KL divergences between the true posterior q(zs|zt, x) and the model kernel pe.

In the context of UMF, we define an action a ∈ A as a specific inference configuration, such as the choice of model parameters θα (e.g., distinct pre-trained models) or hyperparameters (e.g., temperature, remasking strategy). Selecting an action a at step t corresponds to choosing a specific kernel Ka(zt) from a family of available kernels.

We can view the tree search as optimizing a state-dependent switching policy πι(α|zt). Let ef(zt) be the expected KL divergence error for action a at state zt. A switching policy that dynamically selects the best action can strictly outperform any single static model. Formally, relying on the inequality between the sum of minimums and the minimum of sums, we have:

Eq. (1) implies that even if no single model is superior across all steps, a trajectory formed by interleaving the "best local kernels" achieves a lower accumulated error. This provides the theoretical motivation for exploring diverse actions. Importantly, this logic holds for any set of diverse kernels, whether they arise from stochastic perturbations or structural differences (e.g., multi-model).

5.2. Budget Efficiency: Deterministic vs. Stochastic Diversity

While both stochastic sampling (high temperature) and deterministic switching (multi-model) provide distinct unmasking trajectories, they differ fundamentally in their sample efficiency during tree search. This distinction is critical when operating under a fixed budget of Number of Function Evaluations (NFE).

In MCTS, we estimate the value Q(z, a) = E[R(r)|z, a] of a node by rolling out trajectories 7 and observing the terminal reward R. The reliability of this estimation is governed by the variance of the rollout. Introducing stochasticity (e.g., T > 0) to generate diverse actions makes R a random variable with variance Var[R] > 0. From standard Monte Carlo convergence rates, to estimate the value within an error margin e with high confidence, the required number of rollouts m scales linearly with the variance: m × Var[R]/€2. Consequently, stochastic actions necessitate repeated sampling (large m) to distinguish high-quality nodes from noise, consuming significant NFE for value estimation rather than exploration.

In contrast, multi-model UMF employs deterministic actions (T≈ 0) using heterogeneous models. In this setting, the transition is deterministic given the action, implying Var[R] ≈ 0. Thus, a single rollout (m = 1) is sufficient to obtain the exact value of the branch. This design choice yields two practical advantages: (1) Exploration Width: By eliminating the need for repeated averaging, UMF can allocate its NFE budget to expand the search tree. (2) Effective Caching: Deterministic trajectories allow for aggressive node caching (Algorithm 1). If a state-action pair is revisited, the computation can be skipped entirely. This effectively reduces the marginal cost of exploring known regions of a search tree to zero.

Our empirical results in Table 4 support this analysis. While increasing temperature improves performance (confirming the benefit of kernel diversity from Sec. 5.1), it is consistently outperformed by the multi-model approach. We attribute this to the fact that the multi-model strategy achieves exploration of distinct modes without introducing the variance penalty, maximizing the efficiency of inference-time compute.

6. Experiments

6.1. Experimental Setup

6.1.1. BENCHMARKS

We evaluated our approach on coding tasks using 100 samples each from LiveCodeBench (Jain et al., 2025a), HumanEval+, and MBPP+ (via EvalPlus (Liu et al., 2023)). To compute the reward signal during the search, we utilized the public test cases for LiveCodeBench and the standard test cases for EvalPlus, defining the reward as the proportion of passed tests. Candidates with the highest reward were selected for final evaluation. We report the Pass@1 score computed on the private test set (LiveCodeBench) and the extended test set (EvalPlus).

Beyond coding, we also evaluated UMF on the MATH dataset (Hendrycks et al., 2021). We sampled 15 problems from each of the 7 categories, totaling 105 problems. We employed Qwen2.5-Math-PRM-7B (Zhang et al., 2025) to calculate the reward, and for Pass@1 evaluation, we compared the generated answer with the groundtruth.

6.1.2. MODELS

For coding tasks, we employed Dream-Coder-v0-Instruct-7B (Xie et al., 2025) and LLaDA-8B-Instruct (Nie et al., 2025). For math tasks, we used LLaDA-8B-Instruct and Dream-v0-Instruct-7B (Ye et al., 2025). For the Dream model, we set the temperature T = 0.1 and utilized the entropy-based remasking strategy. For LLaDA, we set T = 0 and used the low-confidence strategy. We did not employ block diffusion (Arriola et al., 2025), as its semi-autoregressive nature restricts the diversity of unmasking trajectories targeted by our method.

6.1.3. BASELINES

To empirically demonstrate the efficacy of UMF, we compare it against representative inference-time scaling methods ranging from standard Best-of-N (BoN) to advanced tree search algorithms, specifically diffusion-specific tree search (DTS* (Jain et al., 2025b)) and generic adaptive branching MCTS (AB-MCTS-M (Inoue et al., 2025)). This enables a controlled comparison under a matched NFE budget. We evaluated two primary configurations: (1) varying temperature (T ∈ {0.1, 0.5, 1.0}) with the deterministic remasking strategy (entropy and low-confidence); and (2) randomized remasking strategy at low temperature (T ≈ 0). Under these configurations, we tested (A) Best-of-N and (B) DTS* using both Dream and LLaDA models. Additionally, to assess the benefits of multi-model budget allocation, we included a (C) "Pair" baseline that selects the higher-reward solution from independent Dream and LLaDA generations. We also compared against AB-MCTS with two distinct action spaces: (i) a multi-model setting utilizing both Dream-Coder and LLaDA unmasking actions, and (ii) a single-model setting using only Dream-Coder actions. Both configurations were evaluated with temperatures T ∈ {0.1,0.5, 1.0}. In total, these 16 (= (3+1) * 2 * 2) (T/remask, algorithm, MDLMs) + 6 (pair) + 6 (AB-MCTS) baselines were evaluated across the three coding benchmarks.

For all experiments, we fixed the generation length to 768 tokens and adopted a schedule where one token is unmasked per function evaluation. This mitigates performance degradation from reducing generation length or unmasking multiple tokens simultaneously. Furthermore, to prevent premature padding generation at high temperatures, we applied an EoS padding penalty of 1e-12 for Dream models (T = 1) (Xie et al., 2025), and set the confidence of EoS tokens to 0 for LLaDA (Nie et al., 2025), as recommended.

6.2. Results

6.2.1. RESULTS AT FIXED NFE

Table 1 presents the performance comparison on coding tasks at a fixed budget of NFE = 12288. For non-UMF baselines, we report the best result among temperatures Τ∈ {0.1, 0.5, 1.0}. As observed, UMF consistently outperforms all other tree-search and scaling baselines. Notably, UMF significantly outperforms the “Pair” baselines, which simply split the budget between Dream-Coder and LLaDA. This result confirms that mere access to multiple models is insufficient; the structured interaction provided by the tree search is essential for unlocking their combined potential. Crucially, the results show that strategies relying on random remasking degrade performance. Table 2 further shows the results for UMF on the MATH dataset. UMF maintains high performance at low budgets and also achieves an 11.43 point improvement at NFE = 12288. This suggests that UMF generalizes to reasoning tasks beyond coding, provided a valid reward signal exists.

6.2.2. PERFORMANCE SCALING WITH TEMPERATURE AND NFE

To analyze scaling behavior, Figure 2 illustrates performance curves for representative methods (full results for all the baselines in Appendix A). The plots reveal that UMF leverages the stability of low-temperature generation in low-budget regimes, while consistently improving performance as the budget increases. To test for saturation, we extended the LiveCodeBench evaluation to NFE = 24576, achieving a Pass@1 score of 30.0% (+2.0 points). This continued improvement validates that UMF effectively utilizes the compute budget through mechanisms such as caching. Notably, because UMF relies on nearly deterministic unmasking and UCT-based search, its performance scales stably without the fluctuations observed in stochastic baselines.

In contrast, while higher temperatures in baselines enhance diversity, they fail to surpass the T = 0.1 setting at NFE = 12288, suggesting that the quality loss from stochasticity outweighs the diversity benefits. The only baseline outperforming the deterministic Best-of-N (T = 0.1) was AB-MCTS (T = 0.1) using multiple models. This implies that exploring distinct modes via model diversity is superior to stochastic perturbation. UMF improves upon AB-MCTS by restricting branching to available actions and optimizing NFE usage.

6.3. Ablation Study

6.3.1. EFFECTIVENESS OF CACHING IN UMF

To quantify the benefit of caching, we evaluated UMF on LiveCodeBench with and without the caching mechanism. Table 3 shows the results and the cache hit rate, defined as the number of rollouts with cache hits divided by the total number of rollouts. We observe that for NFE > 3072, the method maintains a high cache hit rate around 50%, and the caching improves the performance for fixed NFE. This demonstrates that caching enables deeper search within the same NFE budget, leading to consistent performance improvements.

6.3.2. TYPES OF ACTIONS

We investigated the impact of different action definitions by comparing the multi-model approach against: (1) Temperature scaling (T = 0.1, 0.5), (2) Temperature scaling (T = 0.1, 1.0), and (3) Remasking strategies (entropy vs. origin). We used Dream-Coder for those baselines. Table 4 presents the results at NFE = 12288. While all UMF variants surpass other TTS baselines with the same budget, the multi-model configuration yields the most significant gains, confirming the superiority of structural diversity over stochastic perturbation.

6.3.3. EFFECT OF MULTIPLE MODELS

We aimed to decouple the benefit of multiple models contributing to one trajectory from the benefit of simply allocating budget across two independent models. We compared multi-model UMF against a “Pair” baseline where Dream-Coder and LLaDA solve each problem independently using single-model UMF (NFE = 6144 each, total 12288), and the best answer is selected. For the single-model UMF, we used the best-performing UMF configurations from Table 4. Table 5 demonstrates that multi-model UMF consistently outperforms the independent pair baseline. This result highlights the critical importance of interleaving model capabilities within a single unmasking trajectory.

7. Conclusion

In this work, we presented UnMaskFork (UMF), a principled test-time scaling framework tailored for Masked Diffusion Language Models. We identified that standard stochastic scaling methods, such as temperature sampling, often degrade the generation quality of MDLMs by disrupting the iterative unmasking trajectory. To address this, UMF replaces stochastic perturbations with deterministic actions that branch the search space using heterogeneous models or distinct inference heuristics. By strictly adhering to deterministic transitions, our method ensures that every explored path retains the high generation quality inherent to the model's optimal decoding process, while deriving diversity from distinct inference configurations. Crucially, this deterministic nature facilitates aggressive node caching, which improves sample efficiency under fixed compute budgets. Extensive evaluations on coding tasks demonstrate that UMF consistently outperforms existing baselines, including Best-of-N and Diffusion Tree Sampling. Additionally, results on MATH indicate that UMF remains effective for reasoning tasks other than coding. Our findings suggest that maintaining high-quality deterministic backbones and leveraging diverse model priors through interleaved search is more effective than independent ensembling or introducing random noise for inference scaling in non-autoregressive models. Future work includes training a policy network or value function to guide the tree search, moving beyond heuristic-based UCT for even greater sample efficiency. Furthermore, exploring dynamic action spaces, where the set of candidate models or unmasking schedules adapts to the instance difficulty, could further optimize the trade-off between computational cost and reasoning depth.

88% Pass@1 on HumanEval+ with UMF at NFE=12288

Enterprise Process Flow

Select Node (UCT score)
Expand (Unexplored Action)
Deterministic Rollout
Cache Intermediate States
Backup Reward
Iterate (NFE budget)
Feature UMF (Multi-model, Deterministic) DTS* (Stochastic, Single-model) Best-of-N (Stochastic, Single-model)
Diversity Source Structural variations (models/heuristics) Stochastic noise (temperature) Stochastic noise (temperature)
Sampling Nature Deterministic Stochastic Stochastic
Cache Efficiency High (Node Caching) Low (repeated rollouts) N/A
Performance (HumanEval+ Pass@1 at NFE=12288) 88% 72% 65%

UMF's Collaborative Code Generation

In a LiveCodeBench problem solved uniquely by UMF at higher NFE, the framework demonstrated a powerful collaborative generation approach. Initially, Dream-Coder was leveraged to outline the high-level implementation steps of the solution. As the unmasking trajectory progressed, LLaDA seamlessly integrated to fill in specific functional requirements at the end of the code. Subsequently, LLaDA continued to build out the core implementation, which was then refined and completed by Dream-Coder.

This dynamic interplay highlights UMF's ability to interleave the distinct capabilities of heterogeneous models—using each for what it excels at—to construct a comprehensive and correct solution that neither model could achieve independently. This strategic blending of models minimizes individual weaknesses and maximizes collective strengths, leading to superior problem-solving.

Estimate Your AI ROI

See the potential impact of integrating advanced AI solutions like UnMaskFork into your enterprise. Adjust the parameters to calculate your estimated annual savings and reclaimed human hours.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A typical journey to integrate advanced AI capabilities, from initial strategy to full-scale deployment and optimization.

Discovery & Strategy

Assess current workflows, identify AI opportunities, and define clear objectives and success metrics. Develop a tailored strategy aligned with your business goals.

Pilot & Prototyping

Implement a proof-of-concept on a focused use case. Evaluate performance, gather feedback, and iterate on the AI model and integration for optimal results.

Integration & Deployment

Seamlessly integrate the AI solution into your existing systems and infrastructure. Conduct rigorous testing and phased rollouts to ensure stability and performance.

Monitoring & Optimization

Continuously monitor AI performance, fine-tune models, and optimize workflows based on real-world data. Scale the solution across departments as benefits are realized.

Ready to Unmask Your Enterprise AI Potential?

Connect with our experts to explore how UnMaskFork's innovations can be tailored to drive unprecedented efficiency and reasoning capabilities within your organization.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking