Skip to content

Robot Navigation

Navigation is the foundational capability for any mobile robot — the ability to move from one location to another while avoiding obstacles and efficiently reaching goals. Modern navigation research goes far beyond simple path planning, encompassing semantic understanding, language grounding, and social awareness.

1. Point-Goal Navigation (PointNav)

Task Definition

Given a target coordinate \((x, y, z)\) relative to the agent's starting position, navigate to that location. No semantic understanding is required — the agent knows only "go 5 meters forward and 3 meters left."

Formal Specification

Input:  Agent's current pose (position + orientation)
        Goal: relative coordinates (Δx, Δy, Δz)
Output: Sequence of actions (move forward, turn left, turn right, stop)
Metric: Success Rate (SR), Success weighted by Path Length (SPL)

Why It Matters

PointNav is the simplest navigation task, but it tests fundamental capabilities:

  • Spatial reasoning: Understanding 3D space from egocentric observations
  • Obstacle avoidance: Detecting and navigating around barriers
  • Path efficiency: Finding short paths, not just any path
  • Generalization: Working in unseen environments

Key Benchmarks

Benchmark Year Environment Key Feature
Habitat PointNav Challenge 2019–present Habitat (HM3D, MP3D) Annual competition, photorealistic
Gibson PointNav 2018 Gibson Real-world scanned environments
RoboTHOR PointNav 2020 AI2-THOR Sim-to-real transfer

Datasets

  • Matterport3D: 10,800 panoramic RGB-D images, 90 buildings, 40 semantic categories
  • HM3D: 1,000 building-scale 3D scenes (largest dataset for PointNav)
  • Gibson: 572 real-world scanned buildings

State-of-the-Art Methods

# Conceptual PointNav agent architecture
class PointNavAgent:
    """
    Typical PointNav agent components:
    1. Visual encoder (ResNet / ViT) → extract visual features
    2. Map module (spatial map or implicit map) → build spatial memory
    3. Policy network (GRU / Transformer) → decide actions
    """

    def __init__(self):
        self.visual_encoder = ResNetEncoder()      # Process RGB-D input
        self.map_module = SpatialMap()              # Maintain 2D/3D map
        self.policy = GRUPolicy(                    # Action selection
            input_dim=256 + 3,                      # features + goal
            hidden_dim=512,
            output_dim=4                             # 4 actions
        )

    def act(self, rgb, depth, goal):
        # 1. Encode visual observation
        features = self.visual_encoder(rgb, depth)

        # 2. Update spatial map
        self.map_module.update(features)

        # 3. Select action based on features + goal direction
        action = self.policy(features, goal)
        return action  # 0=forward, 1=left, 2=right, 3=stop

Where It Applies

  • Warehouse robots navigating to shelf locations
  • Delivery robots moving to drop-off points
  • Drone navigation to GPS coordinates

Frontier-Based Exploration Demo

Frontier-Based Exploration Animation


2. Object-Goal Navigation (ObjectNav)

Task Definition

Given a semantic category (e.g., "find the refrigerator"), navigate to an instance of that object without being given coordinates. The agent must understand what objects look like and where they are typically found.

Formal Specification

Input:  Agent's current pose + object category (e.g., "chair")
Output: Sequence of actions leading to an instance of that category
Metric: Success Rate (SR), SPL, Soft SPL

Why It's Harder Than PointNav

PointNav ObjectNav
Knows exact goal location Must search for goal
No semantic understanding needed Must recognize object categories
Can plan path directly Must explore + recognize
Pure spatial reasoning Spatial + semantic reasoning

Key Benchmarks

Benchmark Year Scenes Objects Key Feature
Habitat ObjectNav Challenge 2021–present HM3D (1000) 6 categories Annual competition
RoboTHOR ObjectNav 2020 75 rooms 19 categories Sim-to-real
ProcTHOR ObjectNav 2022 10,000 rooms 50+ categories Procedurally generated

Datasets

  • HM3D-Semantics: 1,000 scenes with semantic annotations (furniture, appliances, etc.)
  • Gibson: 572 buildings with object labels
  • ProcTHOR: 10,000 procedurally generated rooms (scalable training data)

Object Categories (Habitat Challenge)

Standard 6 categories:
1. chair       — Seating furniture
2. bed         — Sleeping furniture
3. plant       — Indoor vegetation
4. toilet      — Bathroom fixture
5. tv_monitor  — Screens and displays
6. sofa        — Couches and settees

Typical Pipeline

Observation → Object Detector → Semantic Map → Frontier Explorer → Policy
  (RGB-D)     (YOLO / SAM)      (2D grid)     (explore unknown     (DRL)
                                                  frontiers)

Where It Applies

  • Household robots: "Bring me the mug from the kitchen"
  • Service robots: "Find the nearest exit"
  • Search and rescue: "Locate injured persons"

3. Vision-Language Navigation (VLN)

Task Definition

Follow natural language instructions to navigate through environments. Unlike PointNav/ObjectNav, the goal is specified in human language, requiring the agent to ground language in visual observations.

Example Instructions

R2R dataset:
"Walk past the piano and turn right. Go down the hallway and 
 enter the second door on your left."

REVERIE dataset:
"Go to the mug on the table in the kitchen."

RxR dataset (multilingual):
"从钢琴旁边走过,右转。沿着走廊走,进入左边第二个门。"

Key Datasets

Dataset Year Instructions Scenes Language Key Feature
R2R 2018 21,567 90 (MP3D) English First VLN benchmark
RxR 2020 126,000+ 90 (MP3D) EN/HI/TE Multilingual, dense grounding
REVERIE 2020 21,702 90 (MP3D) English Remote referring expressions
SOON 2021 4,000+ 80 English Random environmental variation
CVDN 2020 7,441 dialogs 30+ English Dialog-based navigation
ALFRED 2020 25,743 120 (THOR) English Navigation + manipulation

Evaluation Metrics

Metric Definition
SR (Success Rate) % of episodes where agent stops within 3m of goal
SPL (Success weighted by Path Length) SR × (shortest path / actual path)
NE (Navigation Error) Average distance from agent to goal at episode end
OSR (Oracle Success Rate) % of episodes where agent was within 3m at any point

State-of-the-Art Approaches

Evolution of VLN methods:

2018: Seq2Seq — LSTM encoder-decoder, no visual grounding
2019: Speaker-Follower — Data augmentation with speaker model
2019: PRESS — Pretrained language model (BERT) for instructions
2020: EnvDrop — Environment dropout for generalization
2021: VLN-BERT — Transformer with cross-modal attention
2022: HAMT — Hierarchical alignment transformer
2023: NaVid — GPT-4V for zero-shot navigation
2024: LLM-based agents — Using LLMs as navigation planners

Where It Applies

  • Household assistants: "Go to the bedroom and bring my glasses"
  • Service robots in hotels: "Take this to room 305"
  • Museum guides: "Take visitors to the Monet exhibition"

4. Exploration / Active Mapping

Task Definition

Autonomously explore an unknown environment to build a complete map or maximize coverage. Unlike goal-directed navigation, there is no specific target — the objective is to understand the environment as thoroughly and efficiently as possible.

Formal Specification

Input:  Agent's current pose + observations from unknown environment
Output: Exploration policy (where to look next)
Metric: Coverage (% of environment mapped), efficiency (steps to full map)

Key Methods

Method Year Approach Reference
Active Neural SLAM 2020 Learnable SLAM + exploration policy Chaplot et al., ICLR 2020
FAI (Flood Fill Exploration) 2023 Frontier-based with learned value Chi et al.
BEV Exploration 2024 Bird's-eye view representation Various

Exploration Strategies

1. Frontier-Based Exploration (classical)
   ┌─────────────────────────┐
   │  Known    ░░░░░░░░░░░░  │
   │  space    ░░ ? ? ? ░░░  │  ← Frontier: boundary between
   │          ░░ ? ? ? ░░░  │     known and unknown space
   │  Known    ░░░░░░░░░░░░  │
   └─────────────────────────┘
   Move to nearest frontier → observe → update map → repeat

2. Learned Exploration (modern)
   Use RL to learn a policy that maximizes coverage
   Input: current map + visited areas
   Output: next waypoint to visit

Where It Applies

  • Search and rescue: Explore collapsed buildings
  • Space exploration: Map unknown planetary surfaces
  • Home robots: Map a new apartment on first deployment

5. Social Navigation

Task Definition

Navigate in environments with other agents or humans, respecting social norms such as personal space, collision avoidance, yielding right-of-way, and maintaining comfortable interactions.

Key Challenges

Social Navigation must handle:
├── Proxemics — Respect personal space (Hall, 1966)
├── Collision avoidance — Dynamic obstacles (people move)
├── Yielding — Give way in narrow passages
├── Group awareness — Navigate around groups, not through them
├── Prediction — Anticipate human motion
└── Communication — Signal intent through motion

Key Datasets

Dataset Year Setting Key Feature
SCAND 2024 Indoor, real-world Socially compliant navigation demonstrations
SDD (Stanford Drone Dataset) 2016 Outdoor, aerial Pedestrian trajectories
ETH/UCY 2009 Outdoor, indoor Human trajectory prediction
Habitat 3.0 2024 Simulation Human-in-the-loop social navigation

Where It Applies

  • Hospital robots: Navigate crowded corridors
  • Airport guide robots: Move through busy terminals
  • Restaurant delivery robots: Navigate between tables

6. Other Navigation Tasks

Embodied Question Answering (EQA)

Given a question like "What color is the couch in the living room?", the agent must navigate to the living room, observe the couch, and answer the question.

  • Dataset: EQA (Das et al., CVPR 2018)
  • Extension: Habitat-EQA with multi-room questions

Audio-Visual Navigation

Navigate toward a sound source (e.g., "find the ringing phone").

  • Dataset: SoundSpaces (Chen et al., 2020)
  • Key feature: Requires audio-visual fusion

Rearrangement

Navigate to objects, pick them up, and place them in correct locations.

  • Challenge: Habitat 2023 Rearrangement track
  • Dataset: ReplicaCAD Rearrangement

Task Goal Specification Key Challenge Top Method Type Typical SR
PointNav Relative coordinates Spatial reasoning RL + map ~95%
ObjectNav Object category Semantic search RL + detection ~55%
VLN Language instruction Language grounding Transformer ~60%
Exploration None (maximize coverage) Efficient coverage Frontier + RL ~90% coverage
Social Nav Goal + social norms Human prediction Predictive planning N/A (subjective)
EQA Question Navigate + reason Multi-modal ~45%

References

  • Anderson et al. (2018). "On Evaluation of Embodied Navigation Agents." arXiv:1807.06757
  • Batra et al. (2020). "Exploring Visual Navigation using Habitat." arXiv:2004.01261
  • Anderson et al. (2018). "Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments." CVPR 2018
  • Chaplot et al. (2020). "Learning to Explore using Active Neural SLAM." ICLR 2020
  • Mavrogiannis et al. (2022). "Core Challenges of Social Robot Navigation: A Survey." ACM Computing Surveys
  • Thakur et al. (2024). "Socially CompliAnt Navigation Dataset (SCAND)." ICRA 2024
  • Guan et al. (2022). "A Survey on Vision-Language Navigation." arXiv:2211.11697
  • Xia et al. (2024). "Navigation in the Era of Foundation Models: A Survey." arXiv:2402.19300