We present Habitat, a platform for research in embodied artificial intelligence (AI).
Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of: (1) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D dataset handling. Habitat-Sim is fast—when rendering a scene from Matterport3D, it achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multi-process on a single GPU. (2) Habitat-API: a modular high-level library for end-to-end development of embodied AI algorithms—defining tasks (eg. navigation, instruction following, question answering), configuring, training, and benchmarking embodied agents.
These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or “merely” impractical. Specifically, in the context of point-goal navigation: (1) we revisit the comparison between learning and SLAM approaches from two recent works and find evidence for the opposite conclusion—that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations [cf. DD-PPO], and (2) we conduct the first cross-dataset generalization experiments {train, test} × {Matterport3D, Gibson} for multiple sensors {blind, RGB, RGBD, D} and find that only agents with depth (D) sensors generalize across datasets. We hope that our open-source platform and these findings will advance research in embodied AI.
…4. PointGoal Navigation at Scale: …For the experiments reported here, we train until 75 million agent steps are accumulated across all worker threads. This is 15× larger than the experience used in previous investigations.20, 16 Training agents to 75 million steps took (in sum over all 3 datasets): 320 GPU-hours for Blind, 566 GPU-hours for RGB, 475 GPU-hours for Depth, and 906 GPU-hours for RGBD (overall 2,267 GPU-hours).
Figure 3: Average SPL of agents on the val set over the course of training. Previous work20, 16 has analyzed performance at 5–10 million steps. Interesting trends emerge with more experience: (1) Blind agents initially outperform RGB & RGBD but saturate quickly; (2) Learning-based Depth agents outperform classic SLAM. The shaded areas around curves show the standard error of SPL over 5 seeds.
…Learning vs SLAM: To answer the first question we plot agent performance (SPL) on validation (ie. unseen) episodes over the course of training in Figure 3 (top: Gibson, bottom: Matterport3D). SLAM does not require training and thus has a constant performance (0.59 on Gibson, 0.42 on Matterport3D). All RL (PPO) agents start out with far worse SPL, but RL (PPO) Depth, in particular, improves dramatically and matches the classic baseline at ~10M frames (Gibson) or 30M frames (Matterport3D) of experience, continuing to improve thereafter. Notice that if we terminated the experiment at 5M frames as in20 we would also conclude that SLAM dominates. Interestingly, RGB agents do not outperform Blind agents; we hypothesize because both are equipped with GPS sensors. Indeed, qualitative results (Figure 4 and video in supplement) suggest that Blind agents ‘hug’ walls and implement ‘wall following’ heuristics. In contrast, RGB sensors provide a high-dimensional complex signal that may be prone to overfitting to train environments due to the variety across scenes (even within the same dataset). We also notice in Figure 3 that all methods perform better on Gibson than Matterport3D. This is consistent with our previous analysis that Gibson contains smaller scenes and shorter episodes.