We propose a novel task set comprising 4 categories and 16 tasks to evaluate how Video-LLMs recall, perception, reasoning, and navigation from embodied videos.
Then, we have selected a MCQ example from each task to give the experience.
a) The pipeline includes four steps: video curation, MCQ generation, blind filtering, and human refinement. b) Histogram of frame count of videos. c) Histogram of path lengths. d) Histogram of word count of questions. e) Violin plot of word count of different categories of questions. f) Word cloud generated from questions and choices.
The dataset includes real-world video data from Shenzhen and Zhaoqing, complemented by simulator benchmarks (EmbodiedCity and AerialVLN) for realistic modeling, aerial agent support, and reference routes.
A CoT prompting method with narration, structured extraction, role-playing, and templates is used to generate MCQs, with support from Gemini-1.5-Flash.
The blind filtering step tests n Video-LLMs on MCQs without video input, eliminating those solvable with common sense to improve dataset quality.
Issues like ambiguous navigation targets, hallucinated elements, imprecise directions, and incorrect options in MCQs required extensive human refinement, totaling 800+ hours.
The dataset includes 1,547 video clips with diverse resolutions and durations, covering varied UAV trajectories and scenarios, and over 5.2K MCQs for comprehensive evaluation.
We initially evaluate the performance of 17 popular Video-LLMs on various tasks related to embodied cognition in motion. Subsequently, we conduct detailed analyses focusing on the models, tasks, and video data sources. Finally, we summarize and categorize the reasons for failures across different tasks.
We can draw the following conclusions:
Method | Recall | Perception | Reasoning | Navigation | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Rank | Avg. | Trajectory Captioning | Sequence Recall | Object Recall | Scene Recall | Start/End Position | Proximity | Duration | Landmark Position | Goal Detection | Cognitive Map | Causal | Counterfactual | Association | Progress Evaluation | High-level Planning | Action Generation | ||||
Baseline | |||||||||||||||||||||
Random | - | 19.7 | 18.5 | 17.0 | 20.8 | 13.5 | 21.8 | 37.8 | 35.6 | 19.7 | 18.0 | 21.9 | 18.2 | 25.0 | 18.3 | 21.8 | 15.9 | 16.4 | |||
Proprietary Models (API) | |||||||||||||||||||||
Gemini-1.5-Flash[1 fps] | 4 | 40.5 | 39.7 | 51.8 | 61.7 | 79.3 | 61.3 | 47.1 | 59.8 | 37.8 | 28.7 | 47.9 | 60.0 | 42.4 | 20.0 | 43.3 | 32.6 | 34.4 | |||
Gemini-1.5-Pro[1 fps] | 3 | 42.5 | 58.6 | 61.6 | 65.0 | 72.1 | 66.2 | 66.4 | 63.6 | 37.4 | 33.8 | 46.0 | 63.6 | 46.2 | 23.0 | 38.8 | 43.8 | 31.9 | |||
Gemini-2.0-Flash[1 fps] | 5 | 38.3 | 47.9 | 58.9 | 63.3 | 75.7 | 57.0 | 66.4 | 47.7 | 27.9 | 27.8 | 45.3 | 62.7 | 24.2 | 17.8 | 39.2 | 48.4 | 30.5 | |||
GPT-4o-mini[32f] | 6 | 36.5 | 33.0 | 53.6 | 48.3 | 59.5 | 56.3 | 69.7 | 51.5 | 33.3 | 31.3 | 42.4 | 65.5 | 47.7 | 22.9 | 30.8 | 57.5 | 25.4 | |||
GPT-4o[32f] | 2 | 43.6 | 47.6 | 58.9 | 65.0 | 67.6 | 61.3 | 63.0 | 47.7 | 36.8 | 42.4 | 52.8 | 66.4 | 44.7 | 45.8 | 34.2 | 67.8 | 33.8 | |||
Qwen-VL-Max-latest[32f] | 1 | 45.5 | 44.9 | 70.5 | 64.2 | 75.7 | 73.9 | 78.2 | 43.9 | 44.8 | 44.7 | 61.1 | 77.3 | 49.2 | 23.9 | 38.8 | 70.0 | 29.6 | |||
Open-source Models | |||||||||||||||||||||
LLaVA-NeXt-Video-7B-hf[32f] | 3 | 38.6 | 55.7 | 39.3 | 43.3 | 61.3 | 40.8 | 58.8 | 52.3 | 49.5 | 16.7 | 26.8 | 44.5 | 20.5 | 58.7 | 36.6 | 52.3 | 19.2 | |||
Phi-3.5-vision-instruct[32f] | 2 | 38.7 | 67.0 | 57.1 | 57.5 | 64.9 | 45.1 | 48.7 | 45.5 | 49.2 | 17.0 | 52.1 | 51.8 | 34.8 | 13.9 | 33.2 | 59.7 | 15.6 | |||
Kangaroo[64f] | 1 | 39.2 | 27.0 | 66.1 | 60.8 | 69.4 | 53.5 | 75.6 | 57.6 | 35.5 | 37.2 | 60.0 | 64.5 | 42.4 | 19.1 | 32.5 | 41.9 | 32.4 | |||
Qwen2-VL-2B-Instruct[0.5 fps] | 5 | 31.9 | 29.9 | 54.5 | 30.8 | 57.7 | 24.6 | 69.7 | 47.7 | 22.0 | 22.1 | 64.2 | 46.4 | 35.6 | 13.5 | 28.8 | 44.2 | 27.3 | |||
Qwen2-VL-7B-Instruct[0.25 fps] | 4 | 36.2 | 36.5 | 50.9 | 47.5 | 65.8 | 47.2 | 52.1 | 48.5 | 25.1 | 28.4 | 55.8 | 55.5 | 29.5 | 11.7 | 33.9 | 59.3 | 32.7 | |||
InternVL2-2B[32f] | 11 | 27.6 | 19.2 | 29.5 | 37.5 | 55.9 | 22.5 | 57.1 | 37.9 | 19.3 | 24.6 | 39.2 | 33.6 | 45.5 | 33.5 | 29.2 | 37.6 | 20.9 | |||
InternVL2-4B[32f] | 10 | 28.1 | 19.2 | 37.5 | 33.3 | 62.2 | 24.6 | 66.4 | 42.4 | 23.2 | 26.5 | 32.8 | 36.4 | 35.6 | 24.8 | 29.5 | 32.2 | 22.1 | |||
InternVL2-8B[32f] | 9 | 28.1 | 23.4 | 23.2 | 35.0 | 52.3 | 22.5 | 58.0 | 44.7 | 23.1 | 27.4 | 28.3 | 33.6 | 45.5 | 27.0 | 31.5 | 35.7 | 21.4 | |||
InternVL2-26B[32f] | 8 | 28.3 | 24.3 | 36.6 | 35.0 | 61.3 | 26.8 | 51.2 | 40.2 | 19.9 | 28.1 | 32.4 | 32.7 | 44.7 | 26.5 | 28.9 | 37.6 | 22.8 | |||
InternVL2-40B[32f] | 7 | 28.4 | 22.2 | 19.6 | 30.8 | 54.1 | 21.1 | 61.3 | 50.0 | 23.2 | 26.5 | 34.7 | 27.3 | 41.7 | 25.7 | 32.4 | 34.9 | 22.3 | |||
InternVL2-Llama3-76B[32f] | 6 | 28.9 | 19.5 | 38.4 | 37.5 | 54.1 | 18.3 | 65.5 | 48.5 | 22.9 | 28.1 | 33.6 | 30.9 | 43.2 | 27.4 | 31.3 | 34.5 | 23.2 | |||
Fine-Tuning | Test set | |||||||||||||||||||||
InternVL2-4B(before)[32f] | 3 | 28.3 | 21.3 | 45.2 | 31.8 | 63.0 | 20.4 | 66.7 | 43.8 | 27.1 | 27.9 | 28.5 | 34.9 | 39.6 | 24.1 | 27.4 | 29.7 | 23.0 | |||
InternVL2-4B(after)[32f] | 2 | 31.5 | 25.5 | 38.1 | 34.1 | 60.9 | 20.4 | 66.7 | 37.5 | 22.1 | 38.8 | 33.1 | 32.6 | 50.0 | 31.4 | 28.1 | 39.9 | 28.9 | |||
InternVL2-8B(before)[32f] | 4 | 26.5 | 24.5 | 19.0 | 34.1 | 50.0 | 22.4 | 55.6 | 37.5 | 22.8 | 24.5 | 26.2 | 25.6 | 54.2 | 22.6 | 24.3 | 37.0 | 22.1 | |||
InternVL2-8B(after)[32f] | 1 | 31.7 | 25.5 | 35.7 | 34.1 | 60.9 | 18.4 | 66.7 | 39.6 | 23.8 | 37.4 | 31.5 | 34.9 | 50.0 | 32.8 | 27.7 | 39.1 | 29.4 |
Pairwise correlations between tasks reveal insights into underlying cognitive abilities, with causal reasoning showing high correlation with most tasks, highlighting its foundational role.
Causal reasoning emerges as a key factor in a wide range of cognitive processes, suggesting its potential role in the emergence of embodied cognitive abilities in motion.
Recall tasks are strongly interrelated, emphasizing memory as a shared requirement and central component in cognitive processes.
Navigation tasks correlate strongly with Recall and Perception tasks, confirming the reliance of action and planning on memory and perceptual abilities.
Counterfactual and Association reasoning tasks show low correlations with others, suggesting they rely on distinct, independent cognitive processes.
These findings suggest that high-level reasoning tasks require targeted training, as they operate independently from general embodied cognitive abilities.
We evaluate Video-LLMs across four cognitive abilities using three data sources. Addressing the lack of real-world data in embodied research, we trained models using EmbodiedCity and AerialVLN data, then tested them on real-world data. Fine-tuning InternVL-4B and InternVL-8B with LoRA enhanced Sim2Real transfer, with mean improvements of 3.2% and 5.2%.
Three common error types in reasoning process of Video-LLMs are identified: urban elements/scene understanding errors, motion understanding errors, and egocentric thinking errors.
@misc{zhao2025urbanvideobench,
title={UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces},
author={Baining Zhao and Jianjie Fang and Zichao Dai and Ziyou Wang and Jirong Zha and Weichen Zhang and Chen Gao and Yue Wang and Jinqiang Cui and Xinlei Chen and Yong Li},
year={2025},
eprint={2503.06157},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.06157},
}