UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces

Authors

Baining Zhao ^1,2*

Jianjie Fang ³*

Zichao Dai ‡¹*

Ziyou Wang ³

Jirong Zha ¹

Weichen Zhang ^1,2

Chen Gao ¹†

Yue Wang ¹

Jinqiang Cui ²

Xinlei Chen ^1,2,4†

Yong Li ¹

Affiliations

Tsinghua University ¹

Affiliations

Pengcheng Lab ²

Affiliations

Northeastern University ³

Affiliations

RISC-V International Open Source Laboratory ⁴

Notes

*Equal contribution

†Corresponding author

Date

02/27/2025

Complex Scene and Rich Semantic Information: Urban areas are vast, containing diverse elements like skyscrapers, bridges, and tunnels that provide rich semantic information and pose comprehension and navigation challenges, while dynamic elements like pedestrians and vehicles require real-time adaptation.

Real City

Complex city street view, buildings, cars, electric vehicles and so on

EmbodiedCity Simulator

Tall buildings, trees, bustling city street scenes and so on

AerialVLN Simulator

Streets, lakes, city streetscapes, etc

Unique Aerial Motion: Aerial navigation involves vertical mobility and a first-person perspective, adding complexity by requiring enhanced embodied cognition for processing diverse motion and observation angles, necessitating advanced spatial awareness and decision-making.

Real City

Gimbal angle upward

EmbodiedCity Simulator

Fly downwards

AerialVLN Simulator

Horizontal rotation

Start/End Position

This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Question: Please assume the role of an aerial agent. The video represents your egocentric observations from the past to the present. Please recall where do you start and end its journey?

Options:

A. Start at the large office building with a reflective facade and end at the intersection with the flow of colorful cars.
B. Start at the uniquely shaped black tower in the distance and end at the cluster of slim apartment buildings.
C. Start at the aerial view of the main city street and end at the streetlights behind the bus stop.
D. Start at the futuristic architecture building with curved windows and end at the greenbelt separating the traffic lanes.
E. Start at the first row of skyscrapers on the left and end at the uniquely shaped black tower in the distance.

Click to view Ground Truth and MLLM's answer!

Ground Truth: C

Gemini-1.5-Flash: C

Gemini-1.5-Pro: C

Gemini-2.0-Flash: C

GPT-4o-mini: C

GPT-4o: C

Qwen-VL-Max-latest: C

LLaVA-NeXT-Video-7B-hf: A

Phi-3.5-vision-instruct: A

Kangaroo: C

Qwen2-VL-2B-Instruct: A

Qwen2-VL-7B-Instruct: E

InternVL2-2B: A

InternVL2-4B: C

InternVL2-8B: A

InternVL2-26B: A

InternVL2-40B: E

InternVL2-Llama3-76B: E

InternVL2-4B ft: A

InternVL2-8B ft: A

Proximity

This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Question: Please assume the role of an aerial agent. The video represents your egocentric observations from the past to the present. How does the distance between you and the building directly below your initial position?

Options:

A. Increase.
B. Decrease.
C. No change

Click to view Ground Truth and MLLM's answer!

Ground Truth: B

Gemini-1.5 Flash: A

Gemini-1.5 Pro: A

Gemini-2.0-flash: B

GPT-4o-mini: B

GPT-4o: A

Qwen-VL-Max-latest: B

LLaVA-NeXT-Video-7B-hf: B

Phi-3.5-vision-instruct: B

Kangaroo: B

Qwen2-VL-2B-Instruct: B

Qwen2-VL-7B-Instruct: A

InternVL2-2B: A

InternVL2-4B: B

InternVL2-8B: B

InternVL2-26B: A

InternVL2-40B: B

InternVL2-Llama3-76B: B

InternVL2-4B ft: B

InternVL2-8B ft: B

Duration

This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Question: Please assume the role of an aerial agent. The video represents your egocentric observations from the past to the present. Please answer: Which takes longer, turn to face the building or descent from the 22nd floor to the 18th floor?

Options:

A. Turn to face the building takes longer than descent to the 18th floor.
B. Turn to face the building takes shorter than descent to the 18th floor.
C. Turn to face the building takes the same time as descent to the 18th floor.

Click to view Ground Truth and MLLM's answer!

Ground Truth: B

Gemini-1.5-Flash: B

Gemini-1.5-Pro: B

Gemini-2.0-Flash: B

GPT-4o-mini: B

GPT-4o: B

Qwen-VL-Max-latest: A

LLaVA-NeXT-Video-7B-hf: A

Phi-3.5-vision-instruct: A

Kangaroo: A

Qwen2-VL-2B-Instruct: A

Qwen2-VL-7B-Instruct: B

InternVL2-2B: B

InternVL2-4B: A

InternVL2-8B: A

InternVL2-26B: A

InternVL2-40B: B

InternVL2-Llama3-76B: A

InternVL2-4B ft: A

InternVL2-8B ft: A

Landmark Position

This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Question: Please assume the role of an aerial agent. The video represents your egocentric observations from the past to the present. Please answer: At the initial position, you are required to navigate to the center of the apron of the building below. When you reach the current position, what is the direction to the destination?

Options:

A. Locate directly above the center of the helipad.
B. Locate slightly northeast of the center of the helipad.
C. Locate away from the helipad and above a busy intersection.
D. Located southwest of the helipad, near the edge of the building.
E. Located above an adjacent building rather than near the helipad.

Click to view Ground Truth and MLLM's answer!

Ground Truth: A

Gemini-1.5-Flash: A

Gemini-1.5-Pro: D

Gemini-2.0-Flash: D

GPT-4o-mini: D

GPT-4o: B

Qwen-VL-Max-latest: B

LLaVA-NeXT-Video-7B-hf: A

Phi-3.5-vision-instruct: A

Kangaroo: B

Qwen2-VL-2B-Instruct: A

Qwen2-VL-7B-Instruct: B

InternVL2-2B: B

InternVL2-4B: A

InternVL2-8B: A

InternVL2-26B: B

InternVL2-40B: B

InternVL2-Llama3-76B: D

InternVL2-4B ft: B

InternVL2-8B ft: B

Causal

This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Question: Please assume the role of an aerial agent. The video represents your egocentric observations from the past to the present. Please answer: Why do you continue to rise after scanning the middle floors of high-rise residential buildings?

Options:

A. Continue to rise to avoid collision with the birds.
B. Continue to rise closer to the target building.
C. Continue to rise because it sees the target roof directly above.
D. Continue to rise because the target is not visible at the current position and a higher altitude is needed to capture a panoramic view of the area.
E. Continue to rise to avoid getting too close to other buildings.

Click to view Ground Truth and MLLM's answer!

Ground Truth: D

Gemini-1.5-Flash: D

Gemini-1.5-Pro: D

Gemini-2.0-Flash: D

GPT-4o-mini: D

GPT-4o: D

Qwen-VL-Max-latest: D

LLaVA-NeXT-Video-7B-hf: C

Phi-3.5-vision-instruct: D

Kangaroo: D

Qwen2-VL-2B-Instruct: D

Qwen2-VL-7B-Instruct: D

InternVL2-2B: B

InternVL2-4B: E

InternVL2-8B: B

InternVL2-26B: B

InternVL2-40B: B

InternVL2-Llama3-76B: E

InternVL2-4B ft: E

InternVL2-8B ft: E

Counterfactual

This video is playing at 2x speed. You can adjust the speed using the controls above.

Click to view the question

Question: Please assume the role of an aerial agent. The video represents your egocentric observations from the past to the present. Please answer: If you do not rise directly to a height where you can see the entire building complex, but chooses to fly horizontally along the initial direction and then rise, can you still reach the basketball court? How do the times compare for alternative routes?

Options:

A. If I fly horizontally along the initial direction first, it can still reach the basketball court because the basketball court is directly in front.
B. If I first fly horizontally along the initial direction, it can still reach the basketball court and the alternative route will take less time.
C. If I first fly horizontally along the initial direction, it will not be able to effectively reach the basketball court because the initial direction is forward away from the basketball court and the target may not be visible.

Click to view Ground Truth and MLLM's answer!

Ground Truth: C

Gemini-1.5-Flash: C

Gemini-1.5-Pro: C

Gemini-2.0-Flash: B

GPT-4o-mini: A

GPT-4o: C

Qwen-VL-Max-latest: C

LLaVA-NeXT-Video-7B-hf: A

Phi-3.5-vision-instruct: B

Kangaroo: B

Qwen2-VL-2B-Instruct: C

Qwen2-VL-7B-Instruct: C

InternVL2-2B: B

InternVL2-4B: A

InternVL2-8B: B

InternVL2-26B: B

InternVL2-40B: C

InternVL2-Llama3-76B: C

InternVL2-4B ft: C

InternVL2-8B ft: C

3. Dataset Generation Pipeline and Statistics

a) The pipeline includes four steps: video curation, MCQ generation, blind filtering, and human refinement. b) Histogram of frame count of videos. c) Histogram of path lengths. d) Histogram of word count of questions. e) Violin plot of word count of different categories of questions. f) Word cloud generated from questions and choices.

Video Curation

The dataset includes real-world video data from Shenzhen and Zhaoqing, complemented by simulator benchmarks (EmbodiedCity and AerialVLN) for realistic modeling, aerial agent support, and reference routes.

MCQ Generation

A CoT prompting method with narration, structured extraction, role-playing, and templates is used to generate MCQs, with support from Gemini-1.5-Flash.

Blind Filtering

The blind filtering step tests n Video-LLMs on MCQs without video input, eliminating those solvable with common sense to improve dataset quality.

Human Refinement

Issues like ambiguous navigation targets, hallucinated elements, imprecise directions, and incorrect options in MCQs required extensive human refinement, totaling 800+ hours.

Dataset Statistics

The dataset includes 1,547 video clips with diverse resolutions and durations, covering varied UAV trajectories and scenarios, and over 5.2K MCQs for comprehensive evaluation.

4. Expriments

4.1. Quantitative results

We initially evaluate the performance of 17 popular Video-LLMs on various tasks related to embodied cognition in motion. Subsequently, we conduct detailed analyses focusing on the models, tasks, and video data sources. Finally, we summarize and categorize the reasons for failures across different tasks.

We can draw the following conclusions:

Both proprietary models and open-source models exhibit relatively poor embodied cognitive abilities when navigating urban open-ended spaces. The best-performing model, Qwen-VL-Max, achieves an average accuracy of only 45.5%. This underscores the value of this benchmark, highlighting that embodied cognitive abilities in urban three-dimensional spaces have not been adequately addressed.
Some open-source Video-LLMs outperform part of proprietary models. Specifically, models that are optimized for video data demonstrate superior performance compared to LMMs that focus on images.
Smaller parameter models appear to be more unstable. For two models with equivalent average accuracy, the open-source small parameter model tends to have a lower minimum accuracy across all tasks compared to the commercial large parameter model.

Method			Recall					Perception					Reasoning			Navigation
	Rank	Avg.	Trajectory Captioning	Sequence Recall	Object Recall	Scene Recall	Start/End Position	Proximity	Duration	Landmark Position	Goal Detection	Cognitive Map	Causal	Counterfactual	Association	Progress Evaluation	High-level Planning	Action Generation
	Baseline
Random	-	19.7	18.5	17.0	20.8	13.5	21.8	37.8	35.6	19.7	18.0	21.9	18.2	25.0	18.3	21.8	15.9	16.4
Proprietary Models (API)
Gemini-1.5-Flash[1 fps]	4	40.5	39.7	51.8	61.7	79.3	61.3	47.1	59.8	37.8	28.7	47.9	60.0	42.4	20.0	43.3	32.6	34.4
Gemini-1.5-Pro[1 fps]	3	42.5	58.6	61.6	65.0	72.1	66.2	66.4	63.6	37.4	33.8	46.0	63.6	46.2	23.0	38.8	43.8	31.9
Gemini-2.0-Flash[1 fps]	5	38.3	47.9	58.9	63.3	75.7	57.0	66.4	47.7	27.9	27.8	45.3	62.7	24.2	17.8	39.2	48.4	30.5
GPT-4o-mini[32f]	6	36.5	33.0	53.6	48.3	59.5	56.3	69.7	51.5	33.3	31.3	42.4	65.5	47.7	22.9	30.8	57.5	25.4
GPT-4o[32f]	2	43.6	47.6	58.9	65.0	67.6	61.3	63.0	47.7	36.8	42.4	52.8	66.4	44.7	45.8	34.2	67.8	33.8
Qwen-VL-Max-latest[32f]	1	45.5	44.9	70.5	64.2	75.7	73.9	78.2	43.9	44.8	44.7	61.1	77.3	49.2	23.9	38.8	70.0	29.6
Open-source Models
LLaVA-NeXt-Video-7B-hf[32f]	3	38.6	55.7	39.3	43.3	61.3	40.8	58.8	52.3	49.5	16.7	26.8	44.5	20.5	58.7	36.6	52.3	19.2
Phi-3.5-vision-instruct[32f]	2	38.7	67.0	57.1	57.5	64.9	45.1	48.7	45.5	49.2	17.0	52.1	51.8	34.8	13.9	33.2	59.7	15.6
Kangaroo[64f]	1	39.2	27.0	66.1	60.8	69.4	53.5	75.6	57.6	35.5	37.2	60.0	64.5	42.4	19.1	32.5	41.9	32.4
Qwen2-VL-2B-Instruct[0.5 fps]	5	31.9	29.9	54.5	30.8	57.7	24.6	69.7	47.7	22.0	22.1	64.2	46.4	35.6	13.5	28.8	44.2	27.3
Qwen2-VL-7B-Instruct[0.25 fps]	4	36.2	36.5	50.9	47.5	65.8	47.2	52.1	48.5	25.1	28.4	55.8	55.5	29.5	11.7	33.9	59.3	32.7
InternVL2-2B[32f]	11	27.6	19.2	29.5	37.5	55.9	22.5	57.1	37.9	19.3	24.6	39.2	33.6	45.5	33.5	29.2	37.6	20.9
InternVL2-4B[32f]	10	28.1	19.2	37.5	33.3	62.2	24.6	66.4	42.4	23.2	26.5	32.8	36.4	35.6	24.8	29.5	32.2	22.1
InternVL2-8B[32f]	9	28.1	23.4	23.2	35.0	52.3	22.5	58.0	44.7	23.1	27.4	28.3	33.6	45.5	27.0	31.5	35.7	21.4
InternVL2-26B[32f]	8	28.3	24.3	36.6	35.0	61.3	26.8	51.2	40.2	19.9	28.1	32.4	32.7	44.7	26.5	28.9	37.6	22.8
InternVL2-40B[32f]	7	28.4	22.2	19.6	30.8	54.1	21.1	61.3	50.0	23.2	26.5	34.7	27.3	41.7	25.7	32.4	34.9	22.3
InternVL2-Llama3-76B[32f]	6	28.9	19.5	38.4	37.5	54.1	18.3	65.5	48.5	22.9	28.1	33.6	30.9	43.2	27.4	31.3	34.5	23.2
Fine-Tuning \| Test set
InternVL2-4B(before)[32f]	3	28.3	21.3	45.2	31.8	63.0	20.4	66.7	43.8	27.1	27.9	28.5	34.9	39.6	24.1	27.4	29.7	23.0
InternVL2-4B(after)[32f]	2	31.5	25.5	38.1	34.1	60.9	20.4	66.7	37.5	22.1	38.8	33.1	32.6	50.0	31.4	28.1	39.9	28.9
InternVL2-8B(before)[32f]	4	26.5	24.5	19.0	34.1	50.0	22.4	55.6	37.5	22.8	24.5	26.2	25.6	54.2	22.6	24.3	37.0	22.1
InternVL2-8B(after)[32f]	1	31.7	25.5	35.7	34.1	60.9	18.4	66.7	39.6	23.8	37.4	31.5	34.9	50.0	32.8	27.7	39.1	29.4

4.2. Correlation of Cognitive Abilities

Pairwise correlations between tasks reveal insights into underlying cognitive abilities, with causal reasoning showing high correlation with most tasks, highlighting its foundational role.

Causal reasoning emerges as a key factor in a wide range of cognitive processes, suggesting its potential role in the emergence of embodied cognitive abilities in motion.

Recall tasks are strongly interrelated, emphasizing memory as a shared requirement and central component in cognitive processes.

Navigation tasks correlate strongly with Recall and Perception tasks, confirming the reliance of action and planning on memory and perceptual abilities.

Counterfactual and Association reasoning tasks show low correlations with others, suggesting they rely on distinct, independent cognitive processes.

These findings suggest that high-level reasoning tasks require targeted training, as they operate independently from general embodied cognitive abilities.

UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces

Authors

Affiliations

Affiliations

Affiliations

Affiliations

Notes

Date

1. Urban Embodied Video Characteristics

2. Task Set and MCQ Examples

3. Dataset Generation Pipeline and Statistics

Video Curation

MCQ Generation

Blind Filtering

Human Refinement

Dataset Statistics

4. Expriments

4.1. Quantitative results

4.2. Correlation of Cognitive Abilities

4.3. Sim-to-Real

4.4. Error Analysis

BibTeX