VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

1 University of Science and Technology of China 2 Xiaohongshu Inc. 3 East China Normal University 4 Xi'an Jiaotong University
*Equal Contribution   #Corresponding Author

Abstract

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks.

Key Contributions

Novel Agentic Paradigm

We propose VideoSeeker, an agentic instance-level video understanding paradigm that breaks through the limitations of text queries and achieves more precise spatial and temporal references through visual prompts.

Automated Data Pipeline

We construct a four-stage instance-level video question answering data synthesis pipeline that efficiently generates large-scale, high-quality instance-level video data (34.2k SFT samples, 4.1k RL samples).

Superior Performance

VideoSeeker achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing GPT-4o and Gemini-2.5-Pro while also showing effective transferability to general video understanding scenarios.

Method Overview

VideoSeeker Overview

Figure: Overview of VideoSeeker. (A) Instance-level video understanding tasks require models to accurately locate and reason about specific instances in videos guided by visual prompts. (B) Pipeline overview: four-stage automated data synthesis pipeline + two-stage training strategy.

Task Formulation

Given a query Q, a visual prompt frame Fvp, and a video V, the goal of instance-level video understanding is to accurately answer query Q with respect to the specific instance indicated by Fvp.

Environmental Interaction

The policy model interacts with the video environment through multi-turn active perception control, equipped with a perception tool set:

  • view_visual_prompt: Continuously provides visual prompt frames, maintaining a cognitive anchor of the target instance appearance
  • crop_video: Endows the model with fine-grained local observation capability for filtering keyframes

Data Construction

Data Pipeline

Figure: Four-stage Data Pipeline. (1) Low-cost Text Filtering, (2) Video-level Verification, (3) Pixel-level Mask Generation, (4) Visual Prompt Rendering.

Training Strategy

1
Cold-start SFT

34.2k high-quality trajectories for foundational tool-calling behaviors

2
Agentic RL (GRPO)

4.1k curated samples with three-component reward: accuracy, format compliance, and parsimony

Main Results

Benchmark Results

Figure: Performance comparison on V2P-Bench and general video understanding benchmarks.

Key Findings

Generalization to General Video Understanding

Despite being trained exclusively on instance-level video understanding tasks, VideoSeeker demonstrates strong cross-task generalization on general video benchmarks, achieving +3.2% and +3.3% improvements. This reveals that core capabilities learned from instance-level tasks, such as long-range visual reasoning and multi-turn reasoning, transfer compositionally to broader video understanding scenarios.

The Heterogeneous Distillation Paradox

We discover that the raw capability of a teacher model does not proportionally transfer to distillation performance. In homogeneous distillation, teachers and students share similar patterns, enabling efficient knowledge transfer; in heterogeneous distillation, pattern divergence is significant, causing stronger teachers' knowledge to be less effectively absorbed.

Reward Hacking on Multiple-Choice Data

RL training on multiple-choice (MC) data leads to a significant performance drop (43.8%) as models exploit random guessing. In contrast, open-ended (OE) training with LLM judges achieves 74.5%, demonstrating that OE provides a more robust strategy for RL training.

BibTeX

@article{zhao2026videoseeker,
  title={VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation},
  author={Zhao, Yiming and Zeng, Yu and Huang, Wenxuan and Fang, Zhen and Miao, Qing and Su, Qisheng and Zhao, Jiawei and Cai, Jiayin and Chen, Lin and Chen, Zehui and Qi, Yukun and Hu, Yao and Jiang, Xiaolong and Zhao, Feng},
  institution={{ University of Science and Technology of China, Xiaohongshu Inc., East China Normal University, Xi'an Jiaotong University}},
  journal={arXiv preprint arXiv:2605.16079},
  year={2026},
  url={https://arxiv.org/abs/2605.16079}
}