Robot Navigation¶

Navigation is the foundational capability for any mobile robot — the ability to move from one location to another while avoiding obstacles and efficiently reaching goals. Modern navigation research goes far beyond simple path planning, encompassing semantic understanding, language grounding, and social awareness.

Task Definition¶

Given a target coordinate \((x, y, z)\) relative to the agent's starting position, navigate to that location. No semantic understanding is required — the agent knows only "go 5 meters forward and 3 meters left."

Formal Specification¶

Input:  Agent's current pose (position + orientation)
        Goal: relative coordinates (Δx, Δy, Δz)
Output: Sequence of actions (move forward, turn left, turn right, stop)
Metric: Success Rate (SR), Success weighted by Path Length (SPL)

Why It Matters¶

PointNav is the simplest navigation task, but it tests fundamental capabilities:

Spatial reasoning: Understanding 3D space from egocentric observations
Obstacle avoidance: Detecting and navigating around barriers
Path efficiency: Finding short paths, not just any path
Generalization: Working in unseen environments

Key Benchmarks¶

Benchmark	Year	Environment	Key Feature
Habitat PointNav Challenge	2019–present	Habitat (HM3D, MP3D)	Annual competition, photorealistic
Gibson PointNav	2018	Gibson	Real-world scanned environments
RoboTHOR PointNav	2020	AI2-THOR	Sim-to-real transfer

Datasets¶

Matterport3D: 10,800 panoramic RGB-D images, 90 buildings, 40 semantic categories
HM3D: 1,000 building-scale 3D scenes (largest dataset for PointNav)
Gibson: 572 real-world scanned buildings

State-of-the-Art Methods¶

# Conceptual PointNav agent architecture
class PointNavAgent:
    """
    Typical PointNav agent components:
    1. Visual encoder (ResNet / ViT) → extract visual features
    2. Map module (spatial map or implicit map) → build spatial memory
    3. Policy network (GRU / Transformer) → decide actions
    """

    def __init__(self):
        self.visual_encoder = ResNetEncoder()      # Process RGB-D input
        self.map_module = SpatialMap()              # Maintain 2D/3D map
        self.policy = GRUPolicy(                    # Action selection
            input_dim=256 + 3,                      # features + goal
            hidden_dim=512,
            output_dim=4                             # 4 actions
        )

    def act(self, rgb, depth, goal):
        # 1. Encode visual observation
        features = self.visual_encoder(rgb, depth)

        # 2. Update spatial map
        self.map_module.update(features)

        # 3. Select action based on features + goal direction
        action = self.policy(features, goal)
        return action  # 0=forward, 1=left, 2=right, 3=stop

Where It Applies¶

Warehouse robots navigating to shelf locations
Delivery robots moving to drop-off points
Drone navigation to GPS coordinates

Frontier-Based Exploration Demo¶

Frontier-Based Exploration Animation

Task Definition¶

Given a semantic category (e.g., "find the refrigerator"), navigate to an instance of that object without being given coordinates. The agent must understand what objects look like and where they are typically found.

Formal Specification¶

Input:  Agent's current pose + object category (e.g., "chair")
Output: Sequence of actions leading to an instance of that category
Metric: Success Rate (SR), SPL, Soft SPL

Why It's Harder Than PointNav¶

PointNav	ObjectNav
Knows exact goal location	Must search for goal
No semantic understanding needed	Must recognize object categories
Can plan path directly	Must explore + recognize
Pure spatial reasoning	Spatial + semantic reasoning

Key Benchmarks¶

Benchmark	Year	Scenes	Objects	Key Feature
Habitat ObjectNav Challenge	2021–present	HM3D (1000)	6 categories	Annual competition
RoboTHOR ObjectNav	2020	75 rooms	19 categories	Sim-to-real
ProcTHOR ObjectNav	2022	10,000 rooms	50+ categories	Procedurally generated

Datasets¶

HM3D-Semantics: 1,000 scenes with semantic annotations (furniture, appliances, etc.)
Gibson: 572 buildings with object labels
ProcTHOR: 10,000 procedurally generated rooms (scalable training data)

Object Categories (Habitat Challenge)¶

Standard 6 categories:
1. chair       — Seating furniture
2. bed         — Sleeping furniture
3. plant       — Indoor vegetation
4. toilet      — Bathroom fixture
5. tv_monitor  — Screens and displays
6. sofa        — Couches and settees

Typical Pipeline¶

Observation → Object Detector → Semantic Map → Frontier Explorer → Policy
  (RGB-D)     (YOLO / SAM)      (2D grid)     (explore unknown     (DRL)
                                                  frontiers)

Where It Applies¶

Household robots: "Bring me the mug from the kitchen"
Service robots: "Find the nearest exit"
Search and rescue: "Locate injured persons"

Task Definition¶

Follow natural language instructions to navigate through environments. Unlike PointNav/ObjectNav, the goal is specified in human language, requiring the agent to ground language in visual observations.

Example Instructions¶

R2R dataset:
"Walk past the piano and turn right. Go down the hallway and 
 enter the second door on your left."

REVERIE dataset:
"Go to the mug on the table in the kitchen."

RxR dataset (multilingual):
"从钢琴旁边走过，右转。沿着走廊走，进入左边第二个门。"

Key Datasets¶

Dataset	Year	Instructions	Scenes	Language	Key Feature
R2R	2018	21,567	90 (MP3D)	English	First VLN benchmark
RxR	2020	126,000+	90 (MP3D)	EN/HI/TE	Multilingual, dense grounding
REVERIE	2020	21,702	90 (MP3D)	English	Remote referring expressions
SOON	2021	4,000+	80	English	Random environmental variation
CVDN	2020	7,441 dialogs	30+	English	Dialog-based navigation
ALFRED	2020	25,743	120 (THOR)	English	Navigation + manipulation

Evaluation Metrics¶

Metric	Definition
SR (Success Rate)	% of episodes where agent stops within 3m of goal
SPL (Success weighted by Path Length)	SR × (shortest path / actual path)
NE (Navigation Error)	Average distance from agent to goal at episode end
OSR (Oracle Success Rate)	% of episodes where agent was within 3m at any point

State-of-the-Art Approaches¶

Evolution of VLN methods:

2018: Seq2Seq — LSTM encoder-decoder, no visual grounding
2019: Speaker-Follower — Data augmentation with speaker model
2019: PRESS — Pretrained language model (BERT) for instructions
2020: EnvDrop — Environment dropout for generalization
2021: VLN-BERT — Transformer with cross-modal attention
2022: HAMT — Hierarchical alignment transformer
2023: NaVid — GPT-4V for zero-shot navigation
2024: LLM-based agents — Using LLMs as navigation planners

Where It Applies¶

Household assistants: "Go to the bedroom and bring my glasses"
Service robots in hotels: "Take this to room 305"
Museum guides: "Take visitors to the Monet exhibition"

4. Exploration / Active Mapping¶

Task Definition¶

Autonomously explore an unknown environment to build a complete map or maximize coverage. Unlike goal-directed navigation, there is no specific target — the objective is to understand the environment as thoroughly and efficiently as possible.

Formal Specification¶

Input:  Agent's current pose + observations from unknown environment
Output: Exploration policy (where to look next)
Metric: Coverage (% of environment mapped), efficiency (steps to full map)

Key Methods¶

Method	Year	Approach	Reference
Active Neural SLAM	2020	Learnable SLAM + exploration policy	Chaplot et al., ICLR 2020
FAI (Flood Fill Exploration)	2023	Frontier-based with learned value	Chi et al.
BEV Exploration	2024	Bird's-eye view representation	Various

Exploration Strategies¶

1. Frontier-Based Exploration (classical)
   ┌─────────────────────────┐
   │  Known    ░░░░░░░░░░░░  │
   │  space    ░░ ? ? ? ░░░  │  ← Frontier: boundary between
   │          ░░ ? ? ? ░░░  │     known and unknown space
   │  Known    ░░░░░░░░░░░░  │
   └─────────────────────────┘
   Move to nearest frontier → observe → update map → repeat

2. Learned Exploration (modern)
   Use RL to learn a policy that maximizes coverage
   Input: current map + visited areas
   Output: next waypoint to visit

Where It Applies¶

Search and rescue: Explore collapsed buildings
Space exploration: Map unknown planetary surfaces
Home robots: Map a new apartment on first deployment

Task Definition¶

Navigate in environments with other agents or humans, respecting social norms such as personal space, collision avoidance, yielding right-of-way, and maintaining comfortable interactions.

Key Challenges¶

Social Navigation must handle:
├── Proxemics — Respect personal space (Hall, 1966)
├── Collision avoidance — Dynamic obstacles (people move)
├── Yielding — Give way in narrow passages
├── Group awareness — Navigate around groups, not through them
├── Prediction — Anticipate human motion
└── Communication — Signal intent through motion

Key Datasets¶

Dataset	Year	Setting	Key Feature
SCAND	2024	Indoor, real-world	Socially compliant navigation demonstrations
SDD (Stanford Drone Dataset)	2016	Outdoor, aerial	Pedestrian trajectories
ETH/UCY	2009	Outdoor, indoor	Human trajectory prediction
Habitat 3.0	2024	Simulation	Human-in-the-loop social navigation

Where It Applies¶

Hospital robots: Navigate crowded corridors
Airport guide robots: Move through busy terminals
Restaurant delivery robots: Navigate between tables

Embodied Question Answering (EQA)¶

Given a question like "What color is the couch in the living room?", the agent must navigate to the living room, observe the couch, and answer the question.

Dataset: EQA (Das et al., CVPR 2018)
Extension: Habitat-EQA with multi-room questions

Navigate toward a sound source (e.g., "find the ringing phone").

Dataset: SoundSpaces (Chen et al., 2020)
Key feature: Requires audio-visual fusion

Rearrangement¶

Navigate to objects, pick them up, and place them in correct locations.

Challenge: Habitat 2023 Rearrangement track
Dataset: ReplicaCAD Rearrangement

Task	Goal Specification	Key Challenge	Top Method Type	Typical SR
PointNav	Relative coordinates	Spatial reasoning	RL + map	~95%
ObjectNav	Object category	Semantic search	RL + detection	~55%
VLN	Language instruction	Language grounding	Transformer	~60%
Exploration	None (maximize coverage)	Efficient coverage	Frontier + RL	~90% coverage
Social Nav	Goal + social norms	Human prediction	Predictive planning	N/A (subjective)
EQA	Question	Navigate + reason	Multi-modal	~45%

References¶

Anderson et al. (2018). "On Evaluation of Embodied Navigation Agents." arXiv:1807.06757
Batra et al. (2020). "Exploring Visual Navigation using Habitat." arXiv:2004.01261
Anderson et al. (2018). "Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments." CVPR 2018
Chaplot et al. (2020). "Learning to Explore using Active Neural SLAM." ICLR 2020
Mavrogiannis et al. (2022). "Core Challenges of Social Robot Navigation: A Survey." ACM Computing Surveys
Thakur et al. (2024). "Socially CompliAnt Navigation Dataset (SCAND)." ICRA 2024
Guan et al. (2022). "A Survey on Vision-Language Navigation." arXiv:2211.11697
Xia et al. (2024). "Navigation in the Era of Foundation Models: A Survey." arXiv:2402.19300

Robot Navigation¶

1. Point-Goal Navigation (PointNav)¶

Task Definition¶

Formal Specification¶

Why It Matters¶

Key Benchmarks¶

Datasets¶

State-of-the-Art Methods¶

Where It Applies¶

Frontier-Based Exploration Demo¶

2. Object-Goal Navigation (ObjectNav)¶

Task Definition¶

Formal Specification¶

Why It's Harder Than PointNav¶

Key Benchmarks¶

Datasets¶

Object Categories (Habitat Challenge)¶

Typical Pipeline¶

Where It Applies¶

3. Vision-Language Navigation (VLN)¶

Task Definition¶

Example Instructions¶

Key Datasets¶

Evaluation Metrics¶

State-of-the-Art Approaches¶

Where It Applies¶

4. Exploration / Active Mapping¶

Task Definition¶

Formal Specification¶

Key Methods¶

Exploration Strategies¶

Where It Applies¶

5. Social Navigation¶

Task Definition¶

Key Challenges¶

Key Datasets¶

Where It Applies¶

6. Other Navigation Tasks¶

Embodied Question Answering (EQA)¶

Audio-Visual Navigation¶

Rearrangement¶

Navigation Task Comparison¶

References¶

Robotics Course Docs

Learn

Build

Community