Skip to content

Official Code for paper "Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding""

License

Notifications You must be signed in to change notification settings

SalesforceAIResearch/ActiveVideoPerception

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Homepage arXiv Video

Ziyang Wang1,2*, Honglu Zhou1,Shijie Wang1, Junnan Li1, Caiming Xiong1, Silvio Savarese1, Mohit Bansal2, Michael S. Ryoo1, Juan Carlos Niebles1

1 Salesforce AI Research
2 UNC Chapel Hill
* Work done during internship at Salesforce



Table of Contents


Highlights

Active Video Perception (AVP) is an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels.

Key ideas:

  • Treat long videos as interactive environments
  • Iteratively plan → observe → reflect to seek evidence
  • Allocate computation adaptively to informative regions
  • Improve grounding, efficiency, and reasoning faithfulness

AVP consistently improves over strong MLLM backbones and prior agentic frameworks across multiple long video understanding benchmarks.




Setup

1. Create Conda Environment

Create and activate a fresh conda environment with the required Python version:

conda create -n avp python=3.10 -y
conda activate avp

2. Install System Dependencies

conda install -c conda-forge ffmpeg
ffmpeg -version
pip install -r requirements.txt

3. Setup Annotation Files and Config Files

Before running evaluation, please download the videos from original benchmark huggingface and update the video paths of the annotation file in avp/eval_anno/ and update the Gemini API information in avp/config.example.json.

For API Keys:

Vertex AI (default): Set project and location in config for GCP Vertex AI.

API key (Google AI Studio): Set the GEMINI_API_KEY environment variable (or optional api_key in config).


Evaluation

Set these in avp/parrelel_run.sh before running:

  • ANNOTATION_FILE – Path to your annotation JSON
  • OUTPUT_DIR – Directory where results will be written
  • CONFIG_FILE – Path to your config JSON (e.g. config.example.json)

Optional (with defaults):

  • LIMIT – Max number of samples (omit for no limit)
  • MAX_TURNS – Max plan–execute cycles per sample (default: 3)
  • NUM_WORKERS – Number of parallel workers (default: 4)
  • TIMEOUT – Timeout per sample in seconds, prevent API failure (omit for no timeout, suggested to use)

Example Script:

bash avp/parrelel_run.sh

Citation

If you find our work useful, please cite:

@misc{wang2025activevideoperceptioniterative,
      title={Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding}, 
      author={Ziyang Wang and Honglu Zhou and Shijie Wang and Junnan Li and Caiming Xiong and Silvio Savarese and Mohit Bansal and Michael S. Ryoo and Juan Carlos Niebles},
      year={2025},
      eprint={2512.05774},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.05774}, 
}

About

Official Code for paper "Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding""

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •