When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

Interactive Demo

Real-Time TOIH Visualization

Watch how model predictions shift as the video plays — observe Baseline being misled by contradictory overlay text while VTHM-MoE resists.

Query

How many times does the person in the video transfer the phone to another person?

A 1 B 3 Ground Truth C 5 D 7

Overlay Text (V⁻): "The person transfer the phone to another person five times."

00:00 / 00:46

V⁰

Neutral SampleText-Free

1st

2nd

3rd

Baseline

VTHM-MoE

V⁻

Negative SampleText-Contradictory

1st

2nd

3rd

Baseline

VTHM-MoE

A (1) B (3) C (5) D (7)

Introduction — Bar Chart Meaning & Initial State

◼ Neutral Sample (V⁰)

◼ Negative Sample (V⁻)

1st Phone Transfer (15 – 18 s)

◼ Neutral Sample

◼ Negative Sample

2nd Phone Transfer (33 – 35 s)

◼ Neutral Sample

◼ Negative Sample

3rd Phone Transfer (42 – 45 s)

◼ Neutral Sample

◼ Negative Sample

Query

What does the person do after picking up an object from the floor?

A They hand it to someone B They place it on the desk C Walk towards the door to place it down Ground Truth D They throw it out the window

Overlay Text (V⁻): “The person placed it on the desk after picking up an object from the floor.”

00:00 / 00:13

V⁰

Neutral SampleText-Free

pick

door

desk

Baseline

VTHM-MoE

V⁻

Negative SampleText-Contradictory

pick

door

desk

Baseline

VTHM-MoE

A (hand) B (desk) C (door) D (window)

Picking Up Object (0 – 3 s)

◼ Neutral (V⁰)

◼ Negative (V⁻)

Walking Toward the Door (3 – 7 s)

◼ Neutral

◼ Negative

Walking Toward Desk / Window (7 – 13 s)

◼ Neutral

◼ Negative

Query

What do the divers use to explore the rusted structure underwater during low visibility?

A Glow sticks B Night vision goggles C Flashlights Ground Truth D Underwater drones

Overlay Text (V⁻): “The divers use glow sticks to illuminate the rusted structure underwater.”

00:00 / 02:47

V⁰

Neutral SampleText-Free

goggles

flash

glow

Baseline

VTHM-MoE

V⁻

Negative SampleText-Contradictory

goggles

flash

glow

Baseline

VTHM-MoE

A (Glow sticks) B (NV goggles) C (Flashlights) D (Drones)

Introduction — Initial State (0 – 5 s)

◼ Neutral (V⁰)

◼ Negative (V⁻)

Night Vision Goggles Visible (5 – 47 s)

◼ Neutral

◼ Negative

Flashlights Illuminating Structure (49 – 75 s)

◼ Neutral

◼ Negative

Glow Sticks Appear (80 – 86 s)

◼ Neutral

◼ Negative

Query

As shown in the video, where are the candles placed?

A Next to the holder made of spoons B On the bottom of the table C In the middle of the cups Ground Truth D There is no candle in this video

Overlay Text (V⁻): “There is no candle placed in the middle of the cups.”

00:00 / 01:40

V⁰

Neutral SampleText-Free

cups

craft

candle

Baseline

VTHM-MoE

V⁻

Negative SampleText-Contradictory

cups

craft

candle

Baseline

VTHM-MoE

A (spoons) B (table) C (cups) D (no candle)

Initial State (0 – 17 s)

◼ Neutral (V⁰)

◼ Negative (V⁻)

Lit Cups with Center Glow (17 – 18 s)

◼ Neutral

◼ Negative

Craft Items with Candle-like Light (62 – 65 s)

◼ Neutral

◼ Negative

Candles Clearly Visible in Cups (95 – 100 s)

◼ Neutral

◼ Negative

Abstract

Can VLMs Truly See, or Are They Just Reading?

Vision-Language Models (VLMs) have achieved remarkable performance across video question answering (VQA) tasks spanning temporal reasoning, action recognition, object localization, and spatial relationship comprehension. Despite these successes, a fundamental question remains unexplored: do VLMs genuinely ground their understanding in visual content, or do they predominantly rely on semantic alignment with overlay text—without sufficiently leveraging non-textual visual evidence?

We identify and formalize a critical vulnerability called Text Overlay-Induced Hallucination (TOIH): VLMs generate answers that mirror semantically contradictory overlay text while disregarding the visual ground truth. This failure manifests consistently across Temporal, Action, Object, and Spatial understanding dimensions.

To address this, we present VisualTextTrap, the first benchmark dedicated to TOIH evaluation (6,057 samples, 88 attributes, L1–L5 conflict levels), and propose VTHM-MoE, a Mixture-of-Experts framework with dual OCR-Visual encoding and adaptive token routing that effectively suppresses TOIH while preserving general video comprehension.

Problem Definition

Text Overlay-Induced Hallucination (TOIH)

When overlay text semantically contradicts the visual scene, state-of-the-art VLMs systematically follow the text and ignore visual evidence. We observe this failure across four fundamental VQA dimensions.

Temporal Action Object Spatial

(a)

TemporalBench VideoMME LLaVA-Video

■ No Halluc. ■ Wrong Opt. ■ Correct Opt.

(b)

Temporal Action Object Spatial

(c)

(d)

(e)

Figure 2. VisualTextTrap statistics. (a) Attribute frequency distribution (top-10 shown). (b) Data source and label composition: 6,057 samples from TemporalBench, VideoMME, and LLaVA-Video. (c) Dimension breakdown: Temporal (1,093), Spatial (1,062), Object (1,160), Action (2,604+). (d) Expert selection ratio density across dimensions. (e) Conflict level (L1–L5) distribution proportion.

🕐

Temporal

Models count events from text rather than video frames, reporting incorrect frequencies even when visual evidence is unambiguous.

🏃

Action

Action verbs in overlay text override direct visual action recognition, causing categorical misidentification.

📦

Object

Named objects in overlay text are hallucinated as present in the scene with high confidence, even when absent.

🗺️

Spatial

Absolute position phrases in text override accurate spatial perception, reversing the direction or arrangement of objects.

Method

VTHM-MoE: Hallucination Mitigation via Mixture-of-Experts

VTHM-MoE explicitly disentangles overlay text from native visual content through a dual-encoder structure, dimension-specialized expert modules, and an adaptive token routing strategy.

Figure 5. VTHM-MoE Architecture. Left (Dual Feature Extraction): An OCR-Encoder (Qwen3-VL-8B-Instruct-OCR3) extracts attended overlay-text representations; a Visual-Encoder (Qwen3-VL-8B-Instruct) extracts scene-level visual features independently. Their difference is used to select the K most conflicted patches. Right-top (MoE Hallucination Detection): Four dimension-specialized experts (Temporal, Action, Object, Spatial) process tokens from transformer layers 1–16, supervised by an SFT KL-divergence loss on expert allocation ratios. Right-bottom (Adaptive Token Routing): A Ratio Gate and Inconsistency Classifier at layer 16 dynamically route each token to the appropriate TOIH-resistant expert via per-token softmax routing.

🔀

Dual Encoder

Separate OCR-Encoder and Visual-Encoder produce disentangled representations. Their attended difference highlights conflicting regions, enabling targeted K-patch selection.

🧠

Dimension Experts

Four specialized expert modules—one per VQA dimension—are individually trained to detect cross-modal discrepancies, enabling precise interference suppression.

Temporal Action Object Spatial

⚙️

Adaptive Routing

An Inconsistency Classifier and Ratio Gate route tokens to TOIH-resistant experts only when conflicting text is detected, preserving standard routing for positive/neutral inputs.

Qualitative Analysis

Adaptive Routing Correctly Activates Dimension Experts

VTHM-MoE accurately identifies the conflict dimension and routes tokens to the corresponding dominant expert, achieving correct predictions while baselines fail.

Figure 6. Qualitative analysis of VTHM-MoE's routing behavior across four TOIH scenarios. For each case, we show the visual frames, the misleading overlay text, the conflict classifier prediction (cls′), and the dynamic expert routing ratios (expert_ratio′). (a) Temporal Count Conflict: dominant Temporal expert (0.46). (b) Specific Action Conflict: dominant Action expert (0.43). (c) Object Recognition Conflict: dominant Object expert (0.51). (d) Absolute Position Conflict: dominant Spatial expert (0.46). In all cases, VTHM-MoE yields the correct prediction y′.

When Text Hijacks Vision: Benchmarking and Mitigating
Text Overlay-Induced Hallucination in Vision Language Models

Real-Time TOIH Visualization

Can VLMs Truly See, or Are They Just Reading?

Text Overlay-Induced Hallucination (TOIH)

Temporal

Action

Object

Spatial

All State-of-the-Art VLMs Are Susceptible

VisualTextTrap: The First TOIH Benchmark

Positive (Congruent)

Negative (Contradictory)

Neutral (Text-free)

Hybrid Annotation Pipeline

VTHM-MoE: Hallucination Mitigation via Mixture-of-Experts

Dual Encoder

Dimension Experts

Adaptive Routing

VTHM-MoE Consistently Outperforms All Baselines

Adaptive Routing Correctly Activates Dimension Experts

Summary of Contributions

Novel Problem

VisualTextTrap Benchmark

VTHM-MoE Framework

Empirical Analysis

BibTeX