ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

$^{*}$Equal contribution
School of Computer Science, Carnegie Mellon University
ICCV 2025

Highlights

  • We investigate and challenge the performance-efficiency trade-off of existing contrastive decoding approaches for mitigating hallucinations in LVLMs, highlighting the efficiency issues.
  • We present ONLY, a novel training-free decoding algorithm that leverages a single additional Transformer layer to improve the accuracy of LVLM responses.
  • We conduct comprehensive experiments across various benchmarks and demonstrate that our proposed ONLY consistently outperforms existing approaches with minimal implementation effort and computational cost.

Abstract

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our ONLY approach consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost.

Experimental Results

Efficiency comparison
fail
Table 1. Efficiency comparison. For each method, we present the average inference latency per instance and peak GPU memory. Experiments are conducted on a single RTX A6000 Ada GPU. The best results are bolded, and the second-best are underlined.

Results on POPE
fail
Table 2. Results on POPE benchmark. Higher ($\uparrow$) accuracy, precision, recall, and F1 indicate better performance. The best results are bolded, and the second-best are underlined.

Results on CHAIR
fail
Table 3. Results on CHAIR benchmark. We limit the maximum number of new tokens to 64 or 128. Lower ($\downarrow$) CHAIR$_S$ and CHAIR$_I$ indicate better performance. The best results in each setting are bolded, and the second-best are underlined.

Results on MME
fail
Table 4. Results on MME-Hallucination and MMBench benchmark. We report the average MME scores along with the standard deviation across three random seeds for each subset. Higher scores ($\uparrow$) indicate better performance. The best results are bolded, and the second-best are underlined.

Case Study on LLaVA-Bench
fail
Figure 3. Case study on the LLaVA-Bench benchmark. We compare the responses generated by regular decoding and our method using LLaVA-1.5. GPT-4V-aided evaluation results are also provided alongside the responses. Hallucinated and accurate content is highlighted in red} and green.

BibTeX


        coming soon