Robot Navigation¶
Navigation is the foundational capability for any mobile robot — the ability to move from one location to another while avoiding obstacles and efficiently reaching goals. Modern navigation research goes far beyond simple path planning, encompassing semantic understanding, language grounding, and social awareness.
1. Point-Goal Navigation (PointNav)¶
Task Definition¶
Given a target coordinate \((x, y, z)\) relative to the agent's starting position, navigate to that location. No semantic understanding is required — the agent knows only "go 5 meters forward and 3 meters left."
Formal Specification¶
Input: Agent's current pose (position + orientation)
Goal: relative coordinates (Δx, Δy, Δz)
Output: Sequence of actions (move forward, turn left, turn right, stop)
Metric: Success Rate (SR), Success weighted by Path Length (SPL)
Why It Matters¶
PointNav is the simplest navigation task, but it tests fundamental capabilities:
- Spatial reasoning: Understanding 3D space from egocentric observations
- Obstacle avoidance: Detecting and navigating around barriers
- Path efficiency: Finding short paths, not just any path
- Generalization: Working in unseen environments
Key Benchmarks¶
| Benchmark | Year | Environment | Key Feature |
|---|---|---|---|
| Habitat PointNav Challenge | 2019–present | Habitat (HM3D, MP3D) | Annual competition, photorealistic |
| Gibson PointNav | 2018 | Gibson | Real-world scanned environments |
| RoboTHOR PointNav | 2020 | AI2-THOR | Sim-to-real transfer |
Datasets¶
- Matterport3D: 10,800 panoramic RGB-D images, 90 buildings, 40 semantic categories
- HM3D: 1,000 building-scale 3D scenes (largest dataset for PointNav)
- Gibson: 572 real-world scanned buildings
State-of-the-Art Methods¶
# Conceptual PointNav agent architecture
class PointNavAgent:
"""
Typical PointNav agent components:
1. Visual encoder (ResNet / ViT) → extract visual features
2. Map module (spatial map or implicit map) → build spatial memory
3. Policy network (GRU / Transformer) → decide actions
"""
def __init__(self):
self.visual_encoder = ResNetEncoder() # Process RGB-D input
self.map_module = SpatialMap() # Maintain 2D/3D map
self.policy = GRUPolicy( # Action selection
input_dim=256 + 3, # features + goal
hidden_dim=512,
output_dim=4 # 4 actions
)
def act(self, rgb, depth, goal):
# 1. Encode visual observation
features = self.visual_encoder(rgb, depth)
# 2. Update spatial map
self.map_module.update(features)
# 3. Select action based on features + goal direction
action = self.policy(features, goal)
return action # 0=forward, 1=left, 2=right, 3=stop
Where It Applies¶
- Warehouse robots navigating to shelf locations
- Delivery robots moving to drop-off points
- Drone navigation to GPS coordinates
Frontier-Based Exploration Demo¶

2. Object-Goal Navigation (ObjectNav)¶
Task Definition¶
Given a semantic category (e.g., "find the refrigerator"), navigate to an instance of that object without being given coordinates. The agent must understand what objects look like and where they are typically found.
Formal Specification¶
Input: Agent's current pose + object category (e.g., "chair")
Output: Sequence of actions leading to an instance of that category
Metric: Success Rate (SR), SPL, Soft SPL
Why It's Harder Than PointNav¶
| PointNav | ObjectNav |
|---|---|
| Knows exact goal location | Must search for goal |
| No semantic understanding needed | Must recognize object categories |
| Can plan path directly | Must explore + recognize |
| Pure spatial reasoning | Spatial + semantic reasoning |
Key Benchmarks¶
| Benchmark | Year | Scenes | Objects | Key Feature |
|---|---|---|---|---|
| Habitat ObjectNav Challenge | 2021–present | HM3D (1000) | 6 categories | Annual competition |
| RoboTHOR ObjectNav | 2020 | 75 rooms | 19 categories | Sim-to-real |
| ProcTHOR ObjectNav | 2022 | 10,000 rooms | 50+ categories | Procedurally generated |
Datasets¶
- HM3D-Semantics: 1,000 scenes with semantic annotations (furniture, appliances, etc.)
- Gibson: 572 buildings with object labels
- ProcTHOR: 10,000 procedurally generated rooms (scalable training data)
Object Categories (Habitat Challenge)¶
Standard 6 categories:
1. chair — Seating furniture
2. bed — Sleeping furniture
3. plant — Indoor vegetation
4. toilet — Bathroom fixture
5. tv_monitor — Screens and displays
6. sofa — Couches and settees
Typical Pipeline¶
Observation → Object Detector → Semantic Map → Frontier Explorer → Policy
(RGB-D) (YOLO / SAM) (2D grid) (explore unknown (DRL)
frontiers)
Where It Applies¶
- Household robots: "Bring me the mug from the kitchen"
- Service robots: "Find the nearest exit"
- Search and rescue: "Locate injured persons"
3. Vision-Language Navigation (VLN)¶
Task Definition¶
Follow natural language instructions to navigate through environments. Unlike PointNav/ObjectNav, the goal is specified in human language, requiring the agent to ground language in visual observations.
Example Instructions¶
R2R dataset:
"Walk past the piano and turn right. Go down the hallway and
enter the second door on your left."
REVERIE dataset:
"Go to the mug on the table in the kitchen."
RxR dataset (multilingual):
"从钢琴旁边走过,右转。沿着走廊走,进入左边第二个门。"
Key Datasets¶
| Dataset | Year | Instructions | Scenes | Language | Key Feature |
|---|---|---|---|---|---|
| R2R | 2018 | 21,567 | 90 (MP3D) | English | First VLN benchmark |
| RxR | 2020 | 126,000+ | 90 (MP3D) | EN/HI/TE | Multilingual, dense grounding |
| REVERIE | 2020 | 21,702 | 90 (MP3D) | English | Remote referring expressions |
| SOON | 2021 | 4,000+ | 80 | English | Random environmental variation |
| CVDN | 2020 | 7,441 dialogs | 30+ | English | Dialog-based navigation |
| ALFRED | 2020 | 25,743 | 120 (THOR) | English | Navigation + manipulation |
Evaluation Metrics¶
| Metric | Definition |
|---|---|
| SR (Success Rate) | % of episodes where agent stops within 3m of goal |
| SPL (Success weighted by Path Length) | SR × (shortest path / actual path) |
| NE (Navigation Error) | Average distance from agent to goal at episode end |
| OSR (Oracle Success Rate) | % of episodes where agent was within 3m at any point |
State-of-the-Art Approaches¶
Evolution of VLN methods:
2018: Seq2Seq — LSTM encoder-decoder, no visual grounding
2019: Speaker-Follower — Data augmentation with speaker model
2019: PRESS — Pretrained language model (BERT) for instructions
2020: EnvDrop — Environment dropout for generalization
2021: VLN-BERT — Transformer with cross-modal attention
2022: HAMT — Hierarchical alignment transformer
2023: NaVid — GPT-4V for zero-shot navigation
2024: LLM-based agents — Using LLMs as navigation planners
Where It Applies¶
- Household assistants: "Go to the bedroom and bring my glasses"
- Service robots in hotels: "Take this to room 305"
- Museum guides: "Take visitors to the Monet exhibition"
4. Exploration / Active Mapping¶
Task Definition¶
Autonomously explore an unknown environment to build a complete map or maximize coverage. Unlike goal-directed navigation, there is no specific target — the objective is to understand the environment as thoroughly and efficiently as possible.
Formal Specification¶
Input: Agent's current pose + observations from unknown environment
Output: Exploration policy (where to look next)
Metric: Coverage (% of environment mapped), efficiency (steps to full map)
Key Methods¶
| Method | Year | Approach | Reference |
|---|---|---|---|
| Active Neural SLAM | 2020 | Learnable SLAM + exploration policy | Chaplot et al., ICLR 2020 |
| FAI (Flood Fill Exploration) | 2023 | Frontier-based with learned value | Chi et al. |
| BEV Exploration | 2024 | Bird's-eye view representation | Various |
Exploration Strategies¶
1. Frontier-Based Exploration (classical)
┌─────────────────────────┐
│ Known ░░░░░░░░░░░░ │
│ space ░░ ? ? ? ░░░ │ ← Frontier: boundary between
│ ░░ ? ? ? ░░░ │ known and unknown space
│ Known ░░░░░░░░░░░░ │
└─────────────────────────┘
Move to nearest frontier → observe → update map → repeat
2. Learned Exploration (modern)
Use RL to learn a policy that maximizes coverage
Input: current map + visited areas
Output: next waypoint to visit
Where It Applies¶
- Search and rescue: Explore collapsed buildings
- Space exploration: Map unknown planetary surfaces
- Home robots: Map a new apartment on first deployment
5. Social Navigation¶
Task Definition¶
Navigate in environments with other agents or humans, respecting social norms such as personal space, collision avoidance, yielding right-of-way, and maintaining comfortable interactions.
Key Challenges¶
Social Navigation must handle:
├── Proxemics — Respect personal space (Hall, 1966)
├── Collision avoidance — Dynamic obstacles (people move)
├── Yielding — Give way in narrow passages
├── Group awareness — Navigate around groups, not through them
├── Prediction — Anticipate human motion
└── Communication — Signal intent through motion
Key Datasets¶
| Dataset | Year | Setting | Key Feature |
|---|---|---|---|
| SCAND | 2024 | Indoor, real-world | Socially compliant navigation demonstrations |
| SDD (Stanford Drone Dataset) | 2016 | Outdoor, aerial | Pedestrian trajectories |
| ETH/UCY | 2009 | Outdoor, indoor | Human trajectory prediction |
| Habitat 3.0 | 2024 | Simulation | Human-in-the-loop social navigation |
Where It Applies¶
- Hospital robots: Navigate crowded corridors
- Airport guide robots: Move through busy terminals
- Restaurant delivery robots: Navigate between tables
6. Other Navigation Tasks¶
Embodied Question Answering (EQA)¶
Given a question like "What color is the couch in the living room?", the agent must navigate to the living room, observe the couch, and answer the question.
- Dataset: EQA (Das et al., CVPR 2018)
- Extension: Habitat-EQA with multi-room questions
Audio-Visual Navigation¶
Navigate toward a sound source (e.g., "find the ringing phone").
- Dataset: SoundSpaces (Chen et al., 2020)
- Key feature: Requires audio-visual fusion
Rearrangement¶
Navigate to objects, pick them up, and place them in correct locations.
- Challenge: Habitat 2023 Rearrangement track
- Dataset: ReplicaCAD Rearrangement
Navigation Task Comparison¶
| Task | Goal Specification | Key Challenge | Top Method Type | Typical SR |
|---|---|---|---|---|
| PointNav | Relative coordinates | Spatial reasoning | RL + map | ~95% |
| ObjectNav | Object category | Semantic search | RL + detection | ~55% |
| VLN | Language instruction | Language grounding | Transformer | ~60% |
| Exploration | None (maximize coverage) | Efficient coverage | Frontier + RL | ~90% coverage |
| Social Nav | Goal + social norms | Human prediction | Predictive planning | N/A (subjective) |
| EQA | Question | Navigate + reason | Multi-modal | ~45% |
References¶
- Anderson et al. (2018). "On Evaluation of Embodied Navigation Agents." arXiv:1807.06757
- Batra et al. (2020). "Exploring Visual Navigation using Habitat." arXiv:2004.01261
- Anderson et al. (2018). "Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments." CVPR 2018
- Chaplot et al. (2020). "Learning to Explore using Active Neural SLAM." ICLR 2020
- Mavrogiannis et al. (2022). "Core Challenges of Social Robot Navigation: A Survey." ACM Computing Surveys
- Thakur et al. (2024). "Socially CompliAnt Navigation Dataset (SCAND)." ICRA 2024
- Guan et al. (2022). "A Survey on Vision-Language Navigation." arXiv:2211.11697
- Xia et al. (2024). "Navigation in the Era of Foundation Models: A Survey." arXiv:2402.19300