CityNavAgent

Weichen Zhang^1,4, Chen Gao^3*, Shiquan Yu⁵, Ruiying Peng¹, Baining Zhao^1,4

Qian Zhang¹, Jinqiang Cui⁴, Xinlei Chen^1*, Yong Li^2,3

¹Shenzhen International Graduate School, Tsinghua University

²Department of Electronic Engineering, ³BNRist, Tsinghua University

⁴Pengcheng Laboratory, ⁵University of Oxford

Challenges

Aerial vision-and-language navigation (VLN) requires drones to interpret natural language instructions and navigate complex urban environments. This task emerges as a critical embodied AI challenge that bridges human-robot interaction, 3D spatial reasoning, and real-world deployment. The main challenges include:

1. Complex scene understanding in urban environments: Urban environments exhibit considerably higher object variety than indoor scenes, incorporating extensive infrastructural elements, architectural structures, and natural landscapes. The semantic density in urban scenes is highly dynamic - high semantic density near ground level versus markedly sparse semantics at higher altitudes.

2. Exponential complexity in long-horizon motion planning: Aerial VLN can be considered as a Partially Observable Markov Decision Process, where the agent predicts the next action based on current state and environmental context. However, long-horizon navigation requires predicting longer action sequences, leading to exponential growth in possible action sequences (m^n for m actions over n steps).

3. Absence of predefined navigation graphs: Unlike ground VLN that relies on pre-defined topological graphs, aerial VLN operates in continuous 3D space without predetermined waypoints, making navigation planning significantly more challenging in the exponentially expanding action space.

Method

We propose CityNavAgent, an LLM-empowered agent that significantly reduces navigation complexity for urban aerial VLN through hierarchical semantic planning and global memory. The framework comprises three key modules working together to enable zero-shot aerial navigation.

Open-vocabulary Perception Module

To accurately understand complex semantics in urban environments, we leverage a powerful open-vocabulary captioning and grounding model to extract rich scene semantic features. Given a set of panoramic images at each timestep, an open-vocabulary image captioner (CAP), based on GPT-4V, generates descriptive object captions. These captions are then grounded in the visual input using GroundingDINO to yield bounding boxes for each identified object. A visual tokenizer (VT) further processes the bounding boxes, which are then passed through a segmentation model (SAM) to obtain fine-grained semantic masks. Considering the limitations of egocentric views in capturing true 3D spatial relationships, a geometric projector (GP) uses the RGB-D sensor's depth maps and the agent' s pose to project segmented 2D pixels into a 3D metric space. This process constructs a local 3D semantic point cloud, mapping each object phrase to its corresponding 3D location. The resulting point cloud provides both semantic and spatial awareness, serving as a foundational input for downstream planning modules.

Hierarchical Semantic Planning Module (HSPM)

HSPM decomposes the long-horizon navigation task into manageable sub-goals with different semantic levels, progressively reducing the action space complexity:

1. Landmark-level Planning: Extracts a sequence of landmark phases from free-form instructions using LLM prompt engineering, creating navigation sub-goals that guide the overall trajectory.

2. Object-level Planning: Leverages LLM commonsense reasoning to identify the most relevant objects or regions (OROI) in the current view that lead to invisible sub-goals. For example, reasoning that a road leads to a traffic light.

3. Motion-level Planning: Translates high-level planning outputs into executable waypoints and action sequences. When the agent reaches locations in the memory graph, it directly uses graph search for efficient navigation.

Global Memory Module

The global memory module stores historical trajectories in a 3D topological graph, enabling efficient navigation to previously visited targets. Each node contains waypoint coordinates and visual observations, while edges are weighted by distance. The memory graph is progressively updated by merging successful trajectory graphs and supports efficient subgraph extraction for navigation planning。

Memory Graph Search: When the agent reaches a waypoint in the memory graph, it uses a modified Dijkstra algorithm to find optimal paths that maximize the probability of traversing remaining landmarks in the correct order.

Experiment

Experimental Setup

We evaluate CityNavAgent on both AirVLN-S and AirVLN-Enriched datasets, comparing against statistical-based methods (random sampling, action sampling), learning-based methods (Seq2Seq, CMA, LingUNet), and zero-shot LLM-based methods (NavGPT, MapGPT, VELMA, LM-Nav, STMR). We use standard VLN metrics: Success Rate (SR), Oracle Success Rate (OSR), Navigation Error (NE), and path-following metrics (SDTW, SPL).

Main Results

CityNavAgent significantly outperforms all baseline methods across key metrics. On AirVLN-S validation unseen split, CityNavAgent achieves 11.7% SR compared to the best baseline LM-Nav's 10.4% SR. More notably, on the fine-grained AirVLN-Enriched dataset, CityNavAgent achieves 28.3% SR, substantially outperforming LM-Nav's 23.6% SR.

Key Observations

1. Hierarchical planning effectiveness: The HSPM successfully decomposes complex long-range navigation tasks into manageable sub-tasks, enabling better instruction following and reduced planning difficulty.

2. Memory module importance: Ablation studies show that removing the global memory module results in 16.6% and 14.4% decreases in SR and SPL respectively, demonstrating its critical role in preventing dead ends and blind exploration.

3. LLM reasoning quality: Using GPT-4V instead of GPT-3.5 for object-level planning yields 5.0% and 7.4% improvements in SR and SPL, attributed to lower hallucination rates and stronger reasoning capabilities.

Ablation Analysis

We conduct comprehensive ablation studies on three key aspects of our navigation agent:

Effect of semantic map-based exploration: To evaluate the effectiveness of semantic map-based waypoint prediction, we substitute this module with a random walk strategy. As shown in Table, the agent without the semantic map suffers a 4.7% and 4.3% drop in SR and SPL, and a 25.6% increase in NE. This result reveals that the semantic map extracts structured environmental information, facilitating the LLM in commonsense reasoning so that the agent navigates to regions or objects that are more relevant to the navigation task. Consequently, both the accuracy and efficiency of navigation improve.

Effect of memory graph-based exploitation: We omit the global memory module and replace the graph search algorithm with a random walk. As presented in the first row of Table, the absence of the memory graph results in a 16.6% and 14.4% decrease in SR and SPL, and a 116.7% increase in NE. Compared with the semantic map, the memory graph has a more noticeable impact on navigation performance. This demonstrates its effectiveness in preventing the agent from falling into dead ends or blind exploration, particularly in long-distance outdoor scenarios, thereby ensuring navigation stability.

Effect of different LLMs: We assess various LLMs for commonsense reasoning in object-level planning. The agent with LLaVA-7B performs the worst due to perception hallucination and unstructured output. While GPT-3.5 yields competitive results, GPT-4V further improves performance, with 5.0% and 7.4% increases in SR and SPL, respectively. This is attributed to GPT-4V’s lower hallucination rate and stronger reasoning abilities, generating more contextually appropriate responses from the semantic map and navigation instruction.

Failure Cases — Figure 2: The qualitative result of failure cases. The green captions and bounding boxes are the referred landmarks in instructions. The red bounding boxes are the misreferred landmarks due to three failure reasons.

Effectiveness of Indoor Waypoint Prediction

To evaluate the CWP method for waypoint prediction, we compare predicted waypoints with target waypoints in both indoor and outdoor settings. Metrics such as |△|, drel, Chamfer distance (dC), and Hausdorff distance (dH) are used. CWP performs well indoors with 1.04 dC and 2.01 dH, but in outdoor scenarios the values degrade to 6.4 dC and 5.15 dH. This performance gap, also illustrated in Figure 1(a), is due to the scale and dimensional differences between indoor and outdoor environments—CWP operates in 2D space and lacks the generalizability for open urban navigation.

Failure Case Analysis

We categorize failure cases across datasets into three types:

Instruction Ambiguity: Navigation instructions lack clear landmarks or contain many visually similar objects, making accurate reference difficult.
Perception Failure: Despite strong object detection, outdoor scenes have many edge cases, resulting in misidentification of instructed landmarks.
Reasoning Module: During hierarchical planning, the object-level planner may make incorrect inferences when semantic connections between scene objects and landmarks are weak, leading to reasoning errors.

Visual examples of these failures are provided in Figure 2.