EmbodiedScene: Towards Automated Generation of Diverse and Realistic Scenes for Embodied AI

Examples generated by EmbodiedScene. With just a simple prompt, EmbodiedScene is able to generate diverse styles and layouts that are physically plausible and approach human-level design quality.

Abstract

Collecting high-quality data within simulation environments has proven to be an effective strategy for addressing data challenges in embodied AI. As a critical step in data collection, the construction of simulation scenes heavily relies on expert knowledge, while automated approaches often struggle with ensuring sufficient diversity and realism. To address these limitations, we present EmbodiedScene, a hierarchical framework that leverages large language models (LLMs) to automate the generation of diverse and realistic simulation scenes, with a particular focus on multi-room indoor scene synthesis. To encourage diversity, we introduce a unified representation that encodes spatial configurations and layout semantics. This representation is populated with detailed content by LLMs and further diversified using evolutionary algorithms. It then serves as the foundation for downstream scene synthesis, guiding the generation of precise absolute parameters through a three-stage coarse-to-fine process: floor plan, region plan, and layout plan. To ensure reliability, we incorporate a vision-language model (VLM) as a stage-wise scene critic. The VLM provides feedback by comparing intended design objectives with the generated outputs and guides iterative refinement at each stage. Experimental results demonstrate that EmbodiedScene generates scenes with significantly greater realism and diversity than strong baselines. We further show that EmbodiedScene plays a key role in improving the performance of downstream embodied AI tasks.

Another Carousel

BibTeX

BibTex Code Here