Controlling Multimodal LLMs via Reward-guided Decoding

Mastering MLLM Outputs: Precision, Recall, and Efficiency

Our innovative reward-guided decoding (MRGD) strategy gives enterprises unprecedented control over multimodal large language models, mitigating hallucinations and optimizing for specific business outcomes.

Schedule Your Strategy Session

Executive Impact: Key Metrics

As MLLMs become critical for enterprise operations, controlling their outputs for accuracy and relevance is paramount. This analysis highlights how MRGD significantly reduces object hallucinations, improves visual grounding, and offers dynamic control over the precision-recall trade-off, leading to more reliable and efficient AI deployments.

0 Reduction in Object Hallucinations (COCO)

0 Reward Model Validation Accuracy

0 Times More Sample-Efficient Than Rejection Sampling

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Multimodal Large Language Models (MLLMs) have shown great potential to solve a wide range of visiolinguistic tasks, while offering a general language interface to users. As the adoption of MLLMs increases, the demand to easily control their behavior to satisfy diverse user needs is emerging. Two needs, in particular, arise among the most important for users of MLLMs: a) control over the precision and thoroughness of their output (e.g., object recall), and b) control over the amount of compute spent to generate those outputs. For instance, a user with visual impairment using the system to understand their surroundings may want the MLLM to respond with highly precise outputs (as hallucinations might be highly undesirable), while avoiding overly high latency on limited compute (e.g., on a smartphone); instead, a user leveraging the MLLM to generate synthetic captions to train downstream models may prioritize more diverse and detailed outputs (even if it means tolerating lower precision) while having the flexibility to spend more compute.

In this paper, we tackle this problem and propose a method for inference-time alignment of MLLMs. Our method, called multimodal reward-guided decoding (MRGD), employs two reward functions, one tailored for hallucination reduction and one tailored for improving object recall. Using these reward functions as criteria for searching for better outputs, our method gives control over the two axes mentioned above: by giving the option to set a relative weight for each reward, it allows to smoothly control the trade-off between object precision and recall of the MLLM's outputs; by varying the breadth of the search, we can control the trade-off between the amount of test-time compute and the degree of visual grounding (which encompasses both object precision and recall).

We propose a multimodal reward-guided decoding strategy to improve the controllability of MLLMs at inference time. We first build small yet effective multimodal reward models to evaluate different aspects of visual grounding, and later combine them for search-based guided decoding.

Building Multimodal Reward Models: The effectiveness of our guided decoding strategy hinges on the existence of a reward function capable of successfully evaluating how well a response satisfies a certain objective. We build two reward models (RMs) to incentivize precision and recall respectively: (1) an object hallucination reward model r_hal, trained from preference data, and (2) a recall reward model r_rec, obtained by combining pre-trained modules (object detector, word embedding, NLP tools).

Multimodal Reward-Guided Decoding: Our goal is to guide the generation process of an MLLM where the generated response is modulated using the two reward functions. Given an image and a visual instruction, an MLLM generates a text response autoregressively token-by-token. To give a user the possibility of choosing the relative strength of each reward model on-the-fly, we define a score s as the linear combination of the rewards for object hallucination r_hal and object recall r_rec: S(Xv, xq, y) = w·r_hal (Xv, xq, Y) + (1 − w) ·r_rec (Xv, xq, Y), where w ∈ [0,1] is a guidance strength hyperparameter chosen at inference time.

Our method consistently outperforms existing hallucination mitigation approaches, while allowing test-time controllability of an MLLM's outputs. For instance, on the COCO benchmark, CHAIR is reduced by ~70% (from 15.05% with greedy decoding to 4.53% with MRGD) while recall is only reduced by 6.5%. By combining both reward models with w=0.5, recall is substantially increased w.r.t. w=1.0 (2.6% on COCO and 8.4% on AMBER), without overly increasing the hallucination rate (0.8% on COCO and 1% AMBER). When w=0, MRGD achieves state-of-the-art results on object recall/coverage at the cost of a higher hallucination rate. We also observe the optimal operating point w*—mitigating object hallucinations without losing recall—varies by benchmark, with w*≈0.25 for COCO and w* = 1.0 for AMBER. Compared to prior visual hallucination mitigation methods, MRGD consistently surpasses the performance of methods which fine-tune the base MLLM, while offering greater flexibility and more granular control over the MLLM's behavior.

70% Reduction in Object Hallucinations (CHAIR metric)

Enterprise Process Flow

Input Image & Query

→

Sample k Candidate Completions

→

Evaluate with Reward Models (r_hal, r_rec)

→

Combine Rewards (weighted)

→

Select Best Completion

→

Add to Context & Repeat

→

Final MLLM Output

MRGD vs. Existing Hallucination Mitigation Methods (LLaVA-1.57B on COCO)

Method	Object Hallucination (Cᵢ ↓)	Object Recall (Rec. ↑)	Controllability
Greedy Decoding	15.05	81.30	None
Prompting	13.50	80.38	Coarse (prompt engineering)
SFT / RLHF	5.4-16.09	79.2-81.34	None (inference-time)
MRGD (w=1.0, precision focus)	4.53	76.04	Fine-grained, dynamic
MRGD (w=0.5, balanced)	5.34	78.63	Fine-grained, dynamic
MRGD (w=0.0, recall focus)	24.20	85.23	Fine-grained, dynamic

Enterprise Application: Automated Content Generation with Controlled Grounding

A leading e-commerce platform utilizes MRGD to generate product descriptions and marketing copy. By dynamically adjusting the w parameter, they can choose between highly precise, factual descriptions (higher w) to minimize inaccuracies, or more creative, detailed descriptions (lower w) that explore a wider range of product features, even if it introduces slightly more 'fluff'. This control ensures compliance for critical content while allowing creative freedom for marketing materials, significantly reducing manual oversight.

Outcome: 55% reduction in post-generation edits and 20% faster content approval cycles.

Advanced ROI Calculator

Estimate the potential financial impact of AI integration based on your enterprise profile.

Industry

Number of Employees Impacted

Hours Saved Per Employee Per Week

Average Hourly Cost Per Employee ($)

Annual Savings $0

Annual Hours Reclaimed 0

Calculate Your ROI

Phased Implementation Roadmap

Our structured approach ensures a smooth, effective, and tailored AI integration process for your enterprise.

Phase 1: Foundation & Reward Model Deployment

Establish secure data pipelines and deploy pre-trained or custom reward models (r_hal, r_rec) within your infrastructure. This includes integration with existing MLLM services.

Phase 2: Strategy Definition & Parameter Tuning

Collaborate with our AI strategists to define optimal guidance strength (w) and search breadth (k, T) parameters tailored to specific use cases, balancing precision, recall, and computational budget.

Phase 3: Integration & Iterative Optimization

Integrate MRGD into your MLLM inference stack. Conduct iterative A/B testing and performance monitoring to fine-tune parameters and continuously improve visual grounding and output quality.

Phase 4: Scalable Deployment & Training

Scale MRGD across your enterprise applications. Implement robust monitoring and feedback loops for ongoing model improvement and adaptation to evolving business needs.

Ready to Transform Your MLLM Applications?

Unlock precise, controllable, and efficient AI outputs with our expert guidance. Let's discuss how MRGD can redefine your enterprise AI strategy.

Book a Free Consultation

Controlling Multimodal LLMs via Reward-guided Decoding

Mastering MLLM Outputs: Precision, Recall, and Efficiency

Executive Impact: Key Metrics

Deep Analysis & Enterprise Applications

Enterprise Process Flow

MRGD vs. Existing Hallucination Mitigation Methods (LLaVA-1.57B on COCO)

Enterprise Application: Automated Content Generation with Controlled Grounding

Advanced ROI Calculator

Phased Implementation Roadmap

Phase 1: Foundation & Reward Model Deployment

Phase 2: Strategy Definition & Parameter Tuning

Phase 3: Integration & Iterative Optimization

Phase 4: Scalable Deployment & Training

Ready to Transform Your MLLM Applications?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai