Embodied-R

Baining Zhao^*, Ziyou Wang^*, Jianjie Fang^*, Chen Gao^†, Fanghang Man

Jinqiang Cui, Xin Wang, Xinlei Chen^†, Yong Li, Wenwu Zhu

Tsinghua University

Project Video

Task of Embodied Reasoning

We focus on embodied spatial reasoning tasks, which require models to understand and reason about spatial relationships from sequential visual observations. We use two main datasets: VSI-Bench (indoor scenarios) and UrbanVideo-Bench (outdoor scenarios). These tasks present the following challenges:

1. Integration of Perception and Reasoning: Reasoning is always built upon perception. For the studied problem, continuous visual observations impose higher demands on perception. Reasoning cannot be well achieved with faulty perceptions or hallucinations. It is challenging to reason when it is already hard to perceive from the videos.

2. Complex Spatio-temporal Relationships: Video data naturally involves complex spatio-temporal relationships, requiring the discovery of object associations across frames and the extraction of semantics relevant to the reasoning task. For instance, to navigate to a destination outside the current field of view, one must infer their location from historical visual observations, build a mental map of the environment, develop a high-level plan to determine the direction, and finally decide on specific actions to execute.

3. Distinct Characteristics of Embodied Observations: First, egocentric videos focus on understanding the relationship between the observer and the surrounding environment, often from a constrained first-person perspective. Second, embodied continuous visual observations are generated over time, indicating that embodied perception should rely on sequential inputs rather than aggregating all visual observations for a single input after a prolonged period. Finally, embodied visual observations also exhibit spatial continuity, meaning there is significant redundancy and repetition between frames.

Outdoor

Question: Navigation Instruction given at initial position: [Observe around the square area, then fly towards the highway, then turn left and land on the roof of the building on the left]. You move according to a series of movement instructions. What are you doing now?

A. I look around the square area.
B. I turn left and land on the roof of the building on the left.
C. I fly towards the road.
D. I fly over the park.
E. I land.

Indoor

Question: What will be the first-time appearance order of the following categories in the video: table, backpack, trash bin, lamp?

A. table, backpack, trash bin, lamp
B. backpack, lamp, trash bin, table
C. lamp, table, trash bin, backpack
D. backpack, table, trash bin, lamp

Method

Our proposed Embodied-R is a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. By decoupling perception and reasoning, we can leverage the perceptual capabilities of large-scale VLMs while training resource-efficient small-scale LMs to activate embodied reasoning through reinforcement learning.

Large-Scale VLM-based Perception

Key-Frame Extractor: As the agent moves continuously in space, high sampling frequencies result in significant overlap between consecutive frames. On one hand, the VLM relies on changes in static objects within the environment across frames to infer the agent's pose variation. On the other hand, excessive overlap between frames leads to increased inference costs for both the VLM and LLM. To address this, we designed a key-frame extractor tailored to the characteristics of embodied videos, selecting key frames that retain overlap while ensuring sufficient information gain between them.

Embodied Semantic Representation: Since perceptual capability is positively correlated with model size, we employ a large-scale VLM to process visual inputs to ensure high-quality perception. The differential information of each key frame is described sequentially. This approach provides two key benefits: 1) The sequential and dynamic processing aligns better with the characteristics of embodied scenarios, where visual observations are continuously generated over time. 2) It facilitates the handling of long videos by avoiding the input token limitations that arise when all frames are processed simultaneously by the VLM.

Small-Scale LM-based Reasoning

Given semantic perception, we can inference the answer through a LM, which can be trained with limit computational resource to boost the spatial reasoning ability.

Group Relative Policy Optimization (GRPO): We adopt a computationally efficient RL training strategy, GRPO. For a given query and semantic annotation, GRPO generates a group of outputs using the reference policy, then updates the policy model by optimizing the objective function.

Reward Modeling: We propose three types of rewards: format reward, accuracy reward, and logical consistency reward. These are designed to respectively guide the model to learn the "think-answer" reasoning pattern, accurate embodied spatial reasoning, and logical consistency between reasoning and the answer. Notably, we introduce a novel logical consistency reward to address reward hacking behaviors observed during training, ensuring alignment between the reasoning process and the final answer.

Three-Stage Training Schedule: We design a three-stage training schedule to achieve a smooth improvement in training performance. The primary distinction between stages lies in the different weight ratios assigned to the three types of rewards: Stage 1 (epochs 1-2) focuses on format rewards; Stage 2 (epochs 3-4) shifts to improving the accuracy of model responses; Stage 3 (epochs 5-12) aims to enhance accuracy while simultaneously improving the quality of the "thinking" process.

Experiment

Experimental Setup

We primarily focus on spatial reasoning problems during motion within three-dimensional physical space. We selected two embodied video datasets as the main training and testing sets: VSI-Bench (indoor first-person navigation data) and UrbanVideo-Bench (outdoor embodied data captured by drones navigating through aerial spaces). These datasets provide diversity in scenarios by incorporating both indoor and outdoor video data. We specifically selected four distinct types of tasks from each dataset, characterized by long spatial reasoning chains and low accuracy.

In our experiments, the base models adopted by Embodied-R are: VLM: Qwen2.5-VL-72B-Instruct, LLM: Qwen2.5-3B-Instruct. We conducted five repeated experiments, with the dataset randomly divided into five equal parts and 5-fold cross-validation adopted. The final testing results are averaged across the five experiments.

RQ1:How does Embodied-R perform compared to existing video-LLMs?

Experimental results demonstrate that our proposed reasoning-enhanced model outperforms proprietary models by over 10%+ and SFT-trained models by 5%+.

RQ2:Has Embodied-R Learned Slow-Thinking?

Beyond the quantitative results, we aim to explore whether spatial reasoning capabilities in the output of Embodied-R are improved. After RL training, Embodied-R demonstrates the following human-like reasoning ways:

1. Spatial Relationship Reasoning: Accurately inferring the relative spatial relationship between itself and the surrounding environment.
2. Systematic Analysis: Breaking down problems into components, presenting answers with a "part-to-whole" structure, and maintaining clear logical organization.
3. Contextual Integration: Integrating semantic information across different frames to perform comprehensive analysis.
4. Think-Answer Format: Strictly adhering to a structured process of reasoning before outputting the final answer.

RQ3:What are the contributions of each module?

Ablation of Key-Frame Extractor: The role of Key-Frame Extractor is to reduce inference time and training time by retaining essential frames and removing redundant ones while maintaining perceptual quality. With negligible differences in accuracy, training time is significantly reduced by 8.7%, and single inference time is reduced by approximately one-third.

Ablation of Collaboration: The collaborative framework enables improved reasoning capabilities under limited computational resources for training. With training-free large-scale pretrained VLMs, it only requires training small-scale LM models to achieve enhanced reasoning performance. With identical key-frame inputs and using the same VLM, Qwen2.5-VL-72B-Instruct, the overall accuracy of collaborative inference is 1.5 times higher than that of the standalone VLM.

Ablation of RL Training: RL is central to the LM training in this paper. Without RL training, directly applying the original LM-3B model for reasoning leads to poor performance, as the LM has limited exposure to embodied spatial reasoning data during pretraining. After RL training, the LM achieves significant improvements, with a 27.9% increase on the UrbanVideo-Bench and a 20.6% increase on the VSI-Bench benchmarks.

Further Exploration

RQ4:What is the Relationship Between Inference Ability, Aha Moments, and Response Length?

The GRPO training process involves tracking the validation set's accuracy reward, format reward, ratio of logical consistency reward to accuracy reward, and the response length. Notably, existing pure-text-based reproductions of DeepSeek-R-Zero models identify inference length and the "aha moment" as key indicators of emergent reasoning capabilities. However, such phenomena are rarely observed in other multimodal reasoning tasks, such as image-based reasoning.

This leads us to hypothesize that response length is strongly influenced by the nature of the question itself. For instance, mathematical problems often require multi-step calculations, where increased reasoning length tends to correlate positively with reasoning ability. In contrast, for multimodal reasoning tasks like embodied spatial reasoning, the LM model training process converges toward an optimal range of text output distributions. Concise reasoning patterns may facilitate embodied spatial reasoning. This highlights the versatility of RL-based post-training method, demonstrating the ability to benefit a wide range of reasoning tasks.

RQ5:Why Not Directly Perform RL on VLLMs?

We previously attempted direct RL training on the Qwen-VL-3B-Instruct model. Under similar training parameters and time, the performance of the VLM was notably inferior to that of the LM. Upon convergence, the VLM achieved an accuracy of 43.8% on the test set, significantly lower than the LM. The limited perceptual capability of the VLM restricts its potential for reasoning improvements. Therefore, under resource-constrained conditions, collaborative inference integrating models of different scales present a promising solution.

RQ6:Is Accuracy+Format Rewards All You Need?

According to the Deepseek-R1-Zero, it appears that accuracy and format rewards are enough to guide the model toward correct reasoning. However, during training in our problem, we observed instances of reward hacking, where the model optimizes the answer but the reasoning process leading to that answer is inconsistent with the answer itself. We aim to ensure alignment between the model's reasoning process and its answer, both to enhance generalization and improve the interpretability of the reasoning process.

We employ GPT-4o to evaluate the proportion of logically consistent outputs on the test set before and after incorporating a logical consistency reward. This proportion increased from 46.01% to 99.43% after the reward was added, demonstrating the value of this approach in addressing embodied spatial multiple-choice reasoning tasks. Moreover, this reward mechanism could potentially be extended to other reasoning tasks prone to answer accuracy hacking during training.

RQ7:RL vs SFT when Generalize to Out-of-Distribution (OOD) Embodied Tasks

For small-scale LMs, we aim to explore their generalization performance when trained with SFT instead of RL. To evaluate this, we introduced two OOD datasets: EgoSchema and the egocentric task in MVBench. These two OOD datasets differ significantly from the training set in both task content and scene characteristics.

RL-trained models demonstrate generalization ability across both datasets. On the EgoSchema dataset, the RL-trained language model under the Embodied-R framework even achieves performance comparable to the state-of-the-art multimodal reasoning model, Gemini-2.5-Pro. SFT-trained models showed improvement on EgoSchema but a decline on MVBench. This suggests that slow reasoning, as employed in RL models, could be a promising approach to improve the generalization capabilities even for small-scale models.

Citation

@misc{zhao2025embodiedr,
      title={Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning},
      author={Baining Zhao and Ziyou Wang and Jianjie Fang and Chen Gao and Fanhang Man and Jinqiang Cui and Xin Wang and Xinlei Chen and Yong Li and Wenwu Zhu},
      year={2025},
      eprint={2504.12680},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2504.12680},
}