How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM

Structured Taxonomy: We provide a novel perspective on LLM applications in 3D-related tasks through a structured taxonomy categorizing research into three primary groups.

Comprehensive Review: Building on the proposed taxonomy, we systematically review current research progress on LLMs for spatial reasoning tasks.

Future Directions: We highlight remaining limitations of existing works and suggest potential directions for future research.

arXiv

Authors

Jirong Zha ^1*

Yuxuan Fan ^2*

Xiao Yang ²

Chen Gao ^3†

Xinlei Chen ^1†

Affiliations

¹ Shenzhen International Graduate School, Tsinghua University

² The Hong Kong University of Science and Technology (Guangzhou)

³ BNRist, Tsinghua University

Notes

*Equal contribution, †Corresponding author

Abstract

3D spatial understanding is essential in real-world applications such as robotics, autonomous vehicles, virtual reality, and medical imaging. Recently, Large Language Models (LLMs), having demonstrated remarkable success across various domains, have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods. In this survey, we present a comprehensive review of methods integrating LLMs with 3D spatial understanding. We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams. We systematically review representative methods along these categories, covering data representations, architectural modifications, and training strategies that bridge textual and 3D modalities. Finally, we discuss current limitations, including dataset scarcity and computational challenges, while highlighting promising research directions in spatial perception, multi-modal fusion, and real-world applications.

Overview

Large Language Models can acquire 3D spatial reasoning capabilities through various input sources including multi-view images, RGB-D images, point clouds, and hybrid modalities, enabling the processing and understanding of three-dimensional information.

Taxonomy of 3D-LLM Methods

Our proposed taxonomy classifies 3D-LLM research into three main categories: Image-based spatial reasoning, Point cloud-based spatial reasoning, and Hybrid modality-based spatial reasoning.

Methodology Overview

Our methodology framework illustrates the comprehensive approach to integrating LLMs with 3D spatial understanding across different modalities and alignment strategies.

Image-based Spatial Reasoning

Image-based spatial reasoning methods categorize based on input modalities: multi-view images, monocular images, RGB-D images, and 3D medical images. Each modality offers unique advantages for enhancing 3D understanding in Large Language Models.

Point Cloud-based Spatial Reasoning

Point cloud-based spatial reasoning employs three main alignment methods: Direct Alignment, Step-by-step Alignment, and Task-specific Alignment for integrating point cloud data with language models.

Hybrid Modality-based Spatial Reasoning

Hybrid modality-based spatial reasoning integrates point clouds, images, and LLMs through Tightly Coupled and Loosely Coupled approaches, offering different trade-offs between integration and modularity.

Summary of Models

This table summarizes representative models across different categories of our taxonomy, highlighting their input modalities, key features, and applications in 3D spatial reasoning tasks.

Challenges and Future Directions

Current Challenges

Weak Spatial Reasoning: Limited acuity in 3D spatial understanding and fine-grained relationships, struggling with front/back distinctions and occluded object localization.
Data Scarcity: Lack of high-quality 3D-text paired datasets compared to abundant 2D resources, hindering robust model training.
Multimodal Integration: Challenges in fusing 3D data with other modalities due to structural differences and potential information loss.
Complex Task Definition: Need for frameworks supporting nuanced language-context inference in dynamic environments.

Future Directions

Enhanced 3D Perception: Development of richer 3D-text datasets and improved model architectures for better geometric relationship encoding.
Multi-Modal Fusion: Tighter integration through unified latent spaces and attention mechanisms to preserve geometric and semantic details.
Cross-Scene Generalization: Open-vocabulary 3D understanding with large-scale pretraining and transfer learning paradigms.
Autonomous Systems: Applications in robotics, medical imaging, architectural design, and interactive education with environmental constraints.

BibTeX

@article{zha2025llm3d,
  title={How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM},
  author={Jirong Zha and Yuxuan Fan and Xiao Yang and Chen Gao and Xinlei Chen},
  journal={IJCAI},
  year={2025}
}