Teaser Image

UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces

Urban Embodied Video Characteristics: Our dataset reflects two embodied characteristics of the city: complex urban scenes with dynamic/static elements and unique aerial navigation.
Task Set and MCQ Examples: We propose a novel task set comprising 4 categories and 16 tasks to evaluate how Video-LLMs recall, perception, reasoning, and navigation from embodied videos.
Dataset Generation Pipeline and Statistics: Our dataset consists of 1.5k videos and 5.2K MCQs through video collection, CoT-based generation, blind filtering, and human refinement.
Expriments: We evaluate the performance of 17 popular Video-LLMs on various tasks related to embodied cognition in motion. We also conduct further correlation analysis, Sim-to-Real fine-tuning experiments, and error analysis.

1. Urban Embodied Video Characteristics

Complex Scene and Rich Semantic Information: Urban areas are vast, containing diverse elements like skyscrapers, bridges, and tunnels that provide rich semantic information and pose comprehension and navigation challenges, while dynamic elements like pedestrians and vehicles require real-time adaptation.

Real City

Complex city street view, buildings, cars, electric vehicles and so on

EmbodiedCity Simulator

Tall buildings, trees, bustling city street scenes and so on

AerialVLN Simulator

Streets, lakes, city streetscapes, etc

Unique Aerial Motion: Aerial navigation involves vertical mobility and a first-person perspective, adding complexity by requiring enhanced embodied cognition for processing diverse motion and observation angles, necessitating advanced spatial awareness and decision-making.

Real City

Gimbal angle upward

EmbodiedCity Simulator

Fly downwards

AerialVLN Simulator

Horizontal rotation


2. Task Set and MCQ Examples

We propose a novel task set comprising 4 categories and 16 tasks to evaluate how Video-LLMs recall, perception, reasoning, and navigation from embodied videos.

Then, we have selected a MCQ example from each task to give the experience.

Trajectory Captioning
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Sequence Recall
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Object Recall
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Scene Recall
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Start/End Position
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Proximity
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Duration
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Landmark Position
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Goal Detection
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Cognitive Map
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Causal
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Counterfactual
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Trajectory Captioning
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Progress Evaluation
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

High-level Planning
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Action Generation
This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question


3. Dataset Generation Pipeline and Statistics

Figure 1

a) The pipeline includes four steps: video curation, MCQ generation, blind filtering, and human refinement. b) Histogram of frame count of videos. c) Histogram of path lengths. d) Histogram of word count of questions. e) Violin plot of word count of different categories of questions. f) Word cloud generated from questions and choices.


Video Curation

The dataset includes real-world video data from Shenzhen and Zhaoqing, complemented by simulator benchmarks (EmbodiedCity and AerialVLN) for realistic modeling, aerial agent support, and reference routes.

MCQ Generation

A CoT prompting method with narration, structured extraction, role-playing, and templates is used to generate MCQs, with support from Gemini-1.5-Flash.

Blind Filtering

The blind filtering step tests n Video-LLMs on MCQs without video input, eliminating those solvable with common sense to improve dataset quality.

Human Refinement

Issues like ambiguous navigation targets, hallucinated elements, imprecise directions, and incorrect options in MCQs required extensive human refinement, totaling 800+ hours.

Dataset Statistics

The dataset includes 1,547 video clips with diverse resolutions and durations, covering varied UAV trajectories and scenarios, and over 5.2K MCQs for comprehensive evaluation.


4. Expriments

4.1. Quantitative results

We initially evaluate the performance of 17 popular Video-LLMs on various tasks related to embodied cognition in motion. Subsequently, we conduct detailed analyses focusing on the models, tasks, and video data sources. Finally, we summarize and categorize the reasons for failures across different tasks.


We can draw the following conclusions:

  • Both proprietary models and open-source models exhibit relatively poor embodied cognitive abilities when navigating urban open-ended spaces. The best-performing model, Qwen-VL-Max, achieves an average accuracy of only 45.5%. This underscores the value of this benchmark, highlighting that embodied cognitive abilities in urban three-dimensional spaces have not been adequately addressed.
  • Some open-source Video-LLMs outperform part of proprietary models. Specifically, models that are optimized for video data demonstrate superior performance compared to LMMs that focus on images.
  • Smaller parameter models appear to be more unstable. For two models with equivalent average accuracy, the open-source small parameter model tends to have a lower minimum accuracy across all tasks compared to the commercial large parameter model.
Method Recall Perception Reasoning Navigation
Rank Avg. Trajectory Captioning Sequence Recall Object Recall Scene Recall Start/End Position Proximity Duration Landmark Position Goal Detection Cognitive Map Causal Counterfactual Association Progress Evaluation High-level Planning Action Generation
Baseline
Random -19.7 18.517.020.813.521.8 37.835.619.718.021.9 18.225.018.3 21.815.916.4
Proprietary Models (API)
Gemini-1.5-Flash[1 fps] 440.5 39.751.861.779.361.3 47.159.837.828.747.9 60.042.420.0 43.332.634.4
Gemini-1.5-Pro[1 fps] 342.5 58.661.665.072.166.2 66.463.637.433.846.0 63.646.223.0 38.843.831.9
Gemini-2.0-Flash[1 fps] 538.3 47.958.963.375.757.0 66.447.727.927.845.3 62.724.217.8 39.248.430.5
GPT-4o-mini[32f] 636.5 33.053.648.359.556.3 69.751.533.331.342.4 65.547.722.9 30.857.525.4
GPT-4o[32f] 243.6 47.658.965.067.661.3 63.047.736.842.452.8 66.444.745.8 34.267.833.8
Qwen-VL-Max-latest[32f] 145.5 44.970.564.275.773.9 78.243.944.844.761.1 77.349.223.9 38.870.029.6
Open-source Models
LLaVA-NeXt-Video-7B-hf[32f] 338.6 55.739.343.361.340.8 58.852.349.516.726.8 44.520.558.7 36.652.319.2
Phi-3.5-vision-instruct[32f] 238.7 67.057.157.564.945.1 48.745.549.217.052.1 51.834.813.9 33.259.715.6
Kangaroo[64f] 139.2 27.066.160.869.453.5 75.657.635.537.260.0 64.542.419.1 32.541.932.4
Qwen2-VL-2B-Instruct[0.5 fps] 531.9 29.954.530.857.724.6 69.747.722.022.164.2 46.435.613.5 28.844.227.3
Qwen2-VL-7B-Instruct[0.25 fps] 436.2 36.550.947.565.847.2 52.148.525.128.455.8 55.529.511.7 33.959.332.7
InternVL2-2B[32f] 1127.6 19.229.537.555.922.5 57.137.919.324.639.2 33.645.533.5 29.237.620.9
InternVL2-4B[32f] 1028.1 19.237.533.362.224.6 66.442.423.226.532.8 36.435.624.8 29.532.222.1
InternVL2-8B[32f] 928.1 23.423.235.052.322.5 58.044.723.127.428.3 33.645.527.0 31.535.721.4
InternVL2-26B[32f] 828.3 24.336.635.061.326.8 51.240.219.928.132.4 32.744.726.5 28.937.622.8
InternVL2-40B[32f] 728.4 22.219.630.854.121.1 61.350.023.226.534.7 27.341.725.7 32.434.922.3
InternVL2-Llama3-76B[32f] 628.9 19.538.437.554.118.3 65.548.522.928.133.6 30.943.227.4 31.334.523.2
Fine-Tuning | Test set
InternVL2-4B(before)[32f] 328.3 21.345.231.863.020.4 66.743.827.127.928.5 34.939.624.1 27.429.723.0
InternVL2-4B(after)[32f] 231.5 25.538.134.160.920.4 66.737.522.138.833.1 32.650.031.4 28.139.928.9
InternVL2-8B(before)[32f] 426.5 24.519.034.150.022.4 55.637.522.824.526.2 25.654.222.6 24.337.022.1
InternVL2-8B(after)[32f] 131.7 25.535.734.160.918.4 66.739.623.837.431.5 34.950.032.8 27.739.129.4

4.2. Correlation of Cognitive Abilities

Pairwise correlations between tasks reveal insights into underlying cognitive abilities, with causal reasoning showing high correlation with most tasks, highlighting its foundational role.


Causal reasoning emerges as a key factor in a wide range of cognitive processes, suggesting its potential role in the emergence of embodied cognitive abilities in motion.

Recall tasks are strongly interrelated, emphasizing memory as a shared requirement and central component in cognitive processes.

Navigation tasks correlate strongly with Recall and Perception tasks, confirming the reliance of action and planning on memory and perceptual abilities.

Counterfactual and Association reasoning tasks show low correlations with others, suggesting they rely on distinct, independent cognitive processes.


These findings suggest that high-level reasoning tasks require targeted training, as they operate independently from general embodied cognitive abilities.

Figure 1

4.3. Sim-to-Real

We evaluate Video-LLMs across four cognitive abilities using three data sources. Addressing the lack of real-world data in embodied research, we trained models using EmbodiedCity and AerialVLN data, then tested them on real-world data. Fine-tuning InternVL-4B and InternVL-8B with LoRA enhanced Sim2Real transfer, with mean improvements of 3.2% and 5.2%.

4.4. Error Analysis

Three common error types in reasoning process of Video-LLMs are identified: urban elements/scene understanding errors, motion understanding errors, and egocentric thinking errors.

Figure 1

BibTeX

@misc{zhao2025urbanvideobench,
      title={UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces},
      author={Baining Zhao and Jianjie Fang and Zichao Dai and Ziyou Wang and Jirong Zha and Weichen Zhang and Chen Gao and Yue Wang and Jinqiang Cui and Xinlei Chen and Yong Li},
      year={2025},
      eprint={2503.06157},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.06157},
}