- Weak Spatial Reasoning: Limited acuity in 3D spatial understanding and fine-grained relationships, struggling with front/back distinctions and occluded object localization.
- Data Scarcity: Lack of high-quality 3D-text paired datasets compared to abundant 2D resources, hindering robust model training.
- Multimodal Integration: Challenges in fusing 3D data with other modalities due to structural differences and potential information loss.
- Complex Task Definition: Need for frameworks supporting nuanced language-context inference in dynamic environments.