VisualTextTrap: Benchmarking and Mitigating
Text Overlay-Induced Hallucination in Vision Language Models

Anonymous Authors

Anonymous Institution  ·  Anonymous Institution


Real-Time TOIH Visualization

Watch how model predictions shift as the video plays — observe Baseline being misled by contradictory overlay text while VTHM-MoE resists.

Query

How many times does the person in the video transfer the phone to another person?

A 1 B 3 Ground Truth C 5 D 7

Overlay Text (V): "The person transfer the phone to another person five times."

00:00 / 00:46
V0
Neutral SampleText-Free
1st
2nd
3rd
Baseline
VTHM-MoE
V
Negative SampleText-Contradictory
1st
2nd
3rd
Baseline
VTHM-MoE
A (1) B (3) C (5) D (7)
Introduction — Bar Chart Meaning & Initial State
◼ Neutral Sample (V⁰)
◼ Negative Sample (V⁻)
1st Phone Transfer (15 – 18 s)
◼ Neutral Sample
◼ Negative Sample
2nd Phone Transfer (33 – 35 s)
◼ Neutral Sample
◼ Negative Sample
3rd Phone Transfer (42 – 45 s)
◼ Neutral Sample
◼ Negative Sample
Query

What does the person do after picking up an object from the floor?

A They hand it to someone B They place it on the desk C Walk towards the door to place it down Ground Truth D They throw it out the window

Overlay Text (V): “The person placed it on the desk after picking up an object from the floor.”

00:00 / 00:13
V0
Neutral SampleText-Free
pick
door
desk
Baseline
VTHM-MoE
V
Negative SampleText-Contradictory
pick
door
desk
Baseline
VTHM-MoE
A (hand) B (desk) C (door) D (window)
Picking Up Object (0 – 3 s)
◼ Neutral (V⁰)
◼ Negative (V⁻)
Walking Toward the Door (3 – 7 s)
◼ Neutral
◼ Negative
Walking Toward Desk / Window (7 – 13 s)
◼ Neutral
◼ Negative
Query

What do the divers use to explore the rusted structure underwater during low visibility?

A Glow sticks B Night vision goggles C Flashlights Ground Truth D Underwater drones

Overlay Text (V): “The divers use glow sticks to illuminate the rusted structure underwater.”

00:00 / 02:47
V0
Neutral SampleText-Free
goggles
flash
glow
Baseline
VTHM-MoE
V
Negative SampleText-Contradictory
goggles
flash
glow
Baseline
VTHM-MoE
A (Glow sticks) B (NV goggles) C (Flashlights) D (Drones)
Introduction — Initial State (0 – 5 s)
◼ Neutral (V⁰)
◼ Negative (V⁻)
Night Vision Goggles Visible (5 – 47 s)
◼ Neutral
◼ Negative
Flashlights Illuminating Structure (49 – 75 s)
◼ Neutral
◼ Negative
Glow Sticks Appear (80 – 86 s)
◼ Neutral
◼ Negative
Query

As shown in the video, where are the candles placed?

A Next to the holder made of spoons B On the bottom of the table C In the middle of the cups Ground Truth D There is no candle in this video

Overlay Text (V): “There is no candle placed in the middle of the cups.”

00:00 / 01:40
V0
Neutral SampleText-Free
cups
craft
candle
Baseline
VTHM-MoE
V
Negative SampleText-Contradictory
cups
craft
candle
Baseline
VTHM-MoE
A (spoons) B (table) C (cups) D (no candle)
Initial State (0 – 17 s)
◼ Neutral (V⁰)
◼ Negative (V⁻)
Lit Cups with Center Glow (17 – 18 s)
◼ Neutral
◼ Negative
Craft Items with Candle-like Light (62 – 65 s)
◼ Neutral
◼ Negative
Candles Clearly Visible in Cups (95 – 100 s)
◼ Neutral
◼ Negative

TOIH across four VQA dimensions
Figure 1. TOIH across four dimensions. Temporal: the model reports "twice" because the overlay says so, despite visual evidence of multiple repetitions. Action: a person securing wooden beams is misidentified as "cutting metal" due to overlay text. Object: a non-existent whisk is falsely confirmed based on text alone. Spatial: entry from the driver's side is overridden by text asserting "passenger side." VTHM-MoE (bottom bar) correctly resists all four traps.

Abstract

Can VLMs Truly See, or Are They Just Reading?

Vision-Language Models (VLMs) have achieved remarkable performance across video question answering (VQA) tasks spanning temporal reasoning, action recognition, object localization, and spatial relationship comprehension. Despite these successes, a fundamental question remains unexplored: do VLMs genuinely ground their understanding in visual content, or do they predominantly rely on semantic alignment with overlay text—without sufficiently leveraging non-textual visual evidence?

We identify and formalize a critical vulnerability called Text Overlay-Induced Hallucination (TOIH): VLMs generate answers that mirror semantically contradictory overlay text while disregarding the visual ground truth. This failure manifests consistently across Temporal, Action, Object, and Spatial understanding dimensions.

To address this, we present VisualTextTrap, the first benchmark dedicated to TOIH evaluation (6,057 samples, 88 attributes, L1–L5 conflict levels), and propose VTHM-MoE, a Mixture-of-Experts framework with dual OCR-Visual encoding and adaptive token routing that effectively suppresses TOIH while preserving general video comprehension.

Text Overlay-Induced Hallucination (TOIH)

When overlay text semantically contradicts the visual scene, state-of-the-art VLMs systematically follow the text and ignore visual evidence. We observe this failure across four fundamental VQA dimensions.

Temporal Action Object Spatial
(a)
TemporalBench VideoMME LLaVA-Video
■ No Halluc. ■ Wrong Opt. ■ Correct Opt.
(b)
Temporal Action Object Spatial
(c)
Expert Selection Ratio Density
(d)
(e)
Figure 2. VisualTextTrap statistics. (a) Attribute frequency distribution (top-10 shown). (b) Data source and label composition: 6,057 samples from TemporalBench, VideoMME, and LLaVA-Video. (c) Dimension breakdown: Temporal (1,093), Spatial (1,062), Object (1,160), Action (2,604+). (d) Expert selection ratio density across dimensions. (e) Conflict level (L1–L5) distribution proportion.
🕐

Temporal

Models count events from text rather than video frames, reporting incorrect frequencies even when visual evidence is unambiguous.

🏃

Action

Action verbs in overlay text override direct visual action recognition, causing categorical misidentification.

📦

Object

Named objects in overlay text are hallucinated as present in the scene with high confidence, even when absent.

🗺️

Spatial

Absolute position phrases in text override accurate spatial perception, reversing the direction or arrangement of objects.


All State-of-the-Art VLMs Are Susceptible

We evaluated leading models under Neutral (no overlay) and Negative (contradictory overlay) conditions. Every model shows a dramatic accuracy drop when conflicting text is introduced.

The full annotation pipeline of VisualTextTrap
Figure 3. The full annotation pipeline of VisualTextTrap. (a) Data sources. (b) Task dimension, expert ratio, and cognitive complexity annotation. (c) Manual quality check. (d) MLLM-assisted hallucination text generation across all four dimensions. (e) Five-level conflict intensity scoring. (f) Overlay text embedding with diverse visual features.

Benchmark

VisualTextTrap: The First TOIH Benchmark

VisualTextTrap is constructed via a hybrid pipeline combining MLLM-assisted hallucination text generation with multi-round human verification, sourcing videos from LLaVA-Video, VideoMME, and TemporalBench.

6,057
Total Samples
88
Fine-grained Attributes
L1–L5
Conflict Levels
14
Evaluation Metrics
4
VQA Dimensions
3
Text Conditions

Five-Level Conflict Intensity (L1–L5)

L1Irrelevant
L2Entity Mismatch
L3Attr. Conflict
L4Semantic Opp.
L5Polarity Reversal

Positive (Congruent)

Overlay text agrees with the visual scene. Tests standard comprehension.

Negative (Contradictory)

Overlay text contradicts the visual scene. Probes TOIH susceptibility.

Neutral (Text-free)

No overlay text. Establishes the visual-only baseline for each sample.

LLaVA-Video VideoMME TemporalBench Neutral Negative
0 20 40 60 80
Accuracy (%)
Figure 4. Accuracy under Neutral (filled circle) vs. Negative (hollow circle) conditions across LLaVA-Video, VideoMME, and TemporalBench. Even the strongest models suffer significant accuracy drops when misleading overlay text is present.

Hybrid Annotation Pipeline

1
Video Sourcing

Collect from LLaVA-Video, VideoMME, TemporalBench

2
Dim. Classification

MLLM assigns Temporal / Action / Object / Spatial dimension

3
Halluc. Text Gen.

Claude-Sonnet-4.6 generates contradictory overlay text per level

4
Manual Verification

Multi-round human check for quality and conflict accuracy

5
Text Embedding

Overlay text inserted into video frames with varied position/font


VTHM-MoE: Hallucination Mitigation via Mixture-of-Experts

VTHM-MoE explicitly disentangles overlay text from native visual content through a dual-encoder structure, dimension-specialized expert modules, and an adaptive token routing strategy.

VTHM-MoE architecture diagram
Figure 5. VTHM-MoE Architecture. Left (Dual Feature Extraction): An OCR-Encoder (Qwen3-VL-8B-Instruct-OCR3) extracts attended overlay-text representations; a Visual-Encoder (Qwen3-VL-8B-Instruct) extracts scene-level visual features independently. Their difference is used to select the K most conflicted patches. Right-top (MoE Hallucination Detection): Four dimension-specialized experts (Temporal, Action, Object, Spatial) process tokens from transformer layers 1–16, supervised by an SFT KL-divergence loss on expert allocation ratios. Right-bottom (Adaptive Token Routing): A Ratio Gate and Inconsistency Classifier at layer 16 dynamically route each token to the appropriate TOIH-resistant expert via per-token softmax routing.
🔀

Dual Encoder

Separate OCR-Encoder and Visual-Encoder produce disentangled representations. Their attended difference highlights conflicting regions, enabling targeted K-patch selection.

🧠

Dimension Experts

Four specialized expert modules—one per VQA dimension—are individually trained to detect cross-modal discrepancies, enabling precise interference suppression.

Temporal Action Object Spatial
⚙️

Adaptive Routing

An Inconsistency Classifier and Ratio Gate route tokens to TOIH-resistant experts only when conflicting text is detected, preserving standard routing for positive/neutral inputs.


Experiments

VTHM-MoE Consistently Outperforms All Baselines

We evaluate on three video VQA benchmarks using 14 TOIH-specific metrics covering hallucination rate, semantic consistency, and model robustness under conflicting text.

Main results on VisualTextTrap
Table 1. Comprehensive evaluation results across 14 TOIH-specific metrics on three benchmarks (LLaVA-Video, VideoMME, TemporalBench), covering both open-source models (Qwen3-VL-30B, Qwen3-VL-235B, InternVL3.5-241B) and closed-source models (Gemini-3.1-Pro), along with Baseline variants (CoT, SFT, CoT+SFT) and our VTHM-MoE.
77.7
LLaVA-Video Overall
61.4
VideoMME Overall
53.1
TemporalBench Overall
+8.0
Over Best Baseline (LLaVA)

Adaptive Routing Correctly Activates Dimension Experts

VTHM-MoE accurately identifies the conflict dimension and routes tokens to the corresponding dominant expert, achieving correct predictions while baselines fail.

Qualitative routing analysis
Figure 6. Qualitative analysis of VTHM-MoE's routing behavior across four TOIH scenarios. For each case, we show the visual frames, the misleading overlay text, the conflict classifier prediction (cls′), and the dynamic expert routing ratios (expert_ratio′). (a) Temporal Count Conflict: dominant Temporal expert (0.46). (b) Specific Action Conflict: dominant Action expert (0.43). (c) Object Recognition Conflict: dominant Object expert (0.51). (d) Absolute Position Conflict: dominant Spatial expert (0.46). In all cases, VTHM-MoE yields the correct prediction y′.

Contributions

Summary of Contributions

📌

Novel Problem

First formal definition and characterization of TOIH across Temporal, Action, Object, and Spatial understanding dimensions.

📊

VisualTextTrap Benchmark

6,057 samples · 88 attributes · L1–L5 conflict levels · 14 evaluation metrics — the first comprehensive TOIH benchmark.

🏗️

VTHM-MoE Framework

Dual OCR-Visual encoding + 4 dimension experts + Adaptive Token Routing for effective TOIH mitigation without degrading general VQA.

🔬

Empirical Analysis

Comprehensive evaluation across model families, video types, and conflict levels; VTHM-MoE sets new state-of-the-art on all sub-benchmarks.


BibTeX

@inproceedings{visualtexttrap2025,
  title = {VisualTextTrap: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Video VLMs},
  author = {Anonymous Authors},
  booktitle = {Under Review},
  year = {2025},
}