Skip to content

SLAM: Simultaneous Localization and Mapping

SLAM is the problem of simultaneously building a map of an unknown environment and localizing the robot within that map. It is one of the most fundamental problems in robotics — nearly every autonomous robot needs to know "where am I?" and "what does the world look like?"

For ROS implementation details, see ROS SLAM Tutorial.

The SLAM Problem

Formal Definition

Given:
  - Robot observations z_{1:t} (camera images, LiDAR scans, IMU readings)
  - Robot controls u_{1:t} (odometry, wheel encoders)

Estimate:
  - Robot trajectory x_{1:t} (where has the robot been?)
  - Map m (what does the environment look like?)

Jointly:
  p(x_{1:t}, m | z_{1:t}, u_{1:t})

Why It's Hard

SLAM challenges:
├── Chicken-and-egg — Need location to build map, need map to localize
├── Data association — Is this the same place I visited before? (loop closure)
├── Uncertainty — Sensors are noisy, odometry drifts
├── Scalability — Maps grow with exploration time
├── Dynamic objects — People, cars, doors change the environment
└── Multi-modal — Different sensors have different strengths

SLAM with Loop Closure Demo

SLAM Mapping with Loop Closure Animation

SLAM Variants

1. Visual SLAM (vSLAM)

Uses cameras as the primary sensor. The most popular approach due to low cost and rich information.

Method Year Type Key Feature Reference
ORB-SLAM3 2021 Feature-based Monocular, stereo, RGB-D, IMU Campos et al.
LSD-SLAM 2014 Direct (dense) Semi-dense maps from monocular Engel et al.
DSO 2017 Direct (sparse) Photometric bundle adjustment Engel et al.
VINS-Mono 2018 Feature-based + IMU Tightly-coupled VIO Qin et al.
OpenVSLAM 2019 Feature-based Modular architecture Sumikura et al.
DROID-SLAM 2021 Deep learning Learned SLAM, high accuracy Teed et al.
SplaTAM 2024 Gaussian Splatting 3DGS-based SLAM Keetha et al.

Visual SLAM Pipeline

Camera Image
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Feature     │────▶│   Feature     │────▶│   Motion      │
│   Extraction  │     │   Matching    │     │   Estimation  │
│  (ORB, SIFT,  │     │  (BFMatcher,  │     │  (PnP, ICP,   │
│   SuperPoint) │     │   LightGlue)  │     │   BA)         │
└──────────────┘     └──────────────┘     └──────────────┘
                                          ┌──────────────┐
                                          │   Map         │
                                          │   Update      │
                                          │  (keyframes,  │
                                          │   landmarks)  │
                                          └──────────────┘
                                          ┌──────────────┐
                                          │   Loop        │
                                          │   Closure     │
                                          │  (detect      │
                                          │   revisits)   │
                                          └──────────────┘

Feature-based vs Direct Methods

Aspect Feature-based Direct
How it works Extract keypoints, match between frames Use pixel intensities directly
Examples ORB-SLAM3, VINS-Mono LSD-SLAM, DSO
Robustness High (invariant to lighting) Lower (sensitive to exposure)
Map density Sparse (point cloud) Dense / semi-dense
Accuracy Good Often better in texture-rich scenes
Speed Fast Slower (pixel-level optimization)

2. LiDAR SLAM

Uses laser range finders for precise 3D mapping. More accurate than visual SLAM but more expensive.

Method Year Type Key Feature
LOAM 2014 Feature-based LiDAR odometry + mapping
LeGO-LOAM 2018 Feature-based Lightweight, ground optimization
LIO-SAM 2020 Tightly-coupled LiDAR + IMU factor graph
FAST-LIO2 2021 Iterated EKF Fast, lightweight
CT-ICP 2021 Point-to-point Continuous-time ICP
KISS-ICP 2023 Simple ICP "Keep It Simple and Scalable"

3. RGB-D SLAM

Uses depth cameras (RealSense, Kinect) for dense 3D reconstruction.

Method Year Key Feature
RTAB-Map 2014 Multi-session, graph-based
ElasticFusion 2015 Real-time dense SLAM
BundleFusion 2017 Global bundle adjustment
Nice-SLAM 2021 Neural implicit SLAM
SplaTAM 2024 3D Gaussian Splatting SLAM

4. Learning-Based SLAM

Recent trend: replace hand-crafted components with learned ones.

Approach Example Year Learning Target
Learned features SuperPoint + SuperGlue 2018, 2020 Keypoint detection + matching
Learned SLAM DROID-SLAM 2021 End-to-end visual odometry
Neural implicit iMAP, NICE-SLAM 2021 Neural radiance field as map
Gaussian Splatting SplaTAM, MonoGS 2024 3DGS as map representation

SLAM Datasets

Indoor Datasets

Dataset Year Sensor Environment Key Feature
TUM RGB-D 2012 Kinect Office rooms 39 sequences, ground truth
ICL-NUIM 2014 Synthetic Living room/office Perfect ground truth
EuRoC MAV 2016 Stereo + IMU Machine hall, room Micro aerial vehicle
TartanAir 2020 Stereo Various (sim) Diverse environments, HD
Replica 2019 Synthetic Indoor rooms High-fidelity 3D reconstructions
ScanNet 2017 RGB-D 1513 scenes Semantic labels

Outdoor Datasets

Dataset Year Sensor Environment Key Feature
KITTI 2012 Stereo + LiDAR Urban driving Standard benchmark
nuScenes 2019 LiDAR + cameras Urban, Boston/Singapore 1000 scenes, 3D annotations
Waymo Open 2019 LiDAR + cameras Urban/suburban 1150 scenes
MulRan 2020 LiDAR Urban, multi-session Long-term relocalization
Oxford RobotCar 2016 Multi-sensor Urban Oxford 1000+ km, multi-weather
Hilti SLAM Challenge 2022 Multi-sensor Construction sites Multi-floor SLAM

Dataset Details

TUM RGB-D (The Standard Indoor Benchmark)

TUM RGB-D Dataset:
├── 39 sequences across 5 scenarios
│   ├── fr1_xyz        — Slow, structured motion
│   ├── fr1_desk       — Desktop objects
│   ├── fr2_xyz        — Larger workspace
│   ├── fr3_office     — Full office
│   └── fr1_room       — Complete room traversal
├── Sensor: Microsoft Kinect v1
├── Resolution: 640×480 @ 30Hz
├── Ground truth: Motion capture system
└── Evaluation: ATE (Absolute Trajectory Error)

KITTI (The Standard Outdoor Benchmark)

KITTI Dataset:
├── Stereo + Velodyne LiDAR + GPS/IMU
├── Urban, suburban, highway scenarios
├── Sequences: 22 training, 11 test
├── Ground truth: GPS/RTK (cm-level)
├── Evaluation:
│   ├── t_err — Translational error (%)
│   └── r_err — Rotational error (deg/100m)
└── Odometry benchmark leaderboard

Evaluation Metrics

Absolute Trajectory Error (ATE)

Measures the global consistency of the estimated trajectory.

\[ \text{ATE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} \| \hat{t}_i - t_i \|^2} \]

Where \(\hat{t}_i\) is the estimated position and \(t_i\) is the ground truth.

Relative Pose Error (RPE)

Measures local accuracy over fixed time intervals.

\[ \text{RPE} = \sqrt{\frac{1}{N-\Delta} \sum_{i=1}^{N-\Delta} \| (\hat{T}_i^{-1} \hat{T}_{i+\Delta})^{-1} (T_i^{-1} T_{i+\Delta}) \|^2} \]

Comparison Table

Metric Measures Sensitive to Use Case
ATE Global consistency Scale, rotation, translation Loop closure quality
RPE Local accuracy Drift over short intervals Odometry quality
Map quality 3D reconstruction Completeness, accuracy Mapping applications

Evaluation Code

import numpy as np
from scipy.spatial.transform import Rotation

def compute_ate(estimated_poses, ground_truth_poses):
    """
    Compute Absolute Trajectory Error (ATE).

    Args:
        estimated_poses: List of 4x4 SE(3) matrices
        ground_truth_poses: List of 4x4 SE(3) matrices

    Returns:
        ate: Root mean squared error (meters)
    """
    errors = []
    for T_est, T_gt in zip(estimated_poses, ground_truth_poses):
        # Translation error
        t_est = T_est[:3, 3]
        t_gt = T_gt[:3, 3]
        error = np.linalg.norm(t_est - t_gt)
        errors.append(error)

    ate = np.sqrt(np.mean(np.array(errors)**2))
    return ate

def compute_rpe(estimated_poses, ground_truth_poses, delta=1):
    """
    Compute Relative Pose Error (RPE).

    Args:
        estimated_poses: List of 4x4 SE(3) matrices
        ground_truth_poses: List of 4x4 SE(3) matrices
        delta: Frame interval for comparison

    Returns:
        trans_err: Mean translational error (meters)
        rot_err: Mean rotational error (degrees)
    """
    trans_errors = []
    rot_errors = []

    for i in range(len(estimated_poses) - delta):
        # Relative poses
        T_est_rel = np.linalg.inv(estimated_poses[i]) @ estimated_poses[i + delta]
        T_gt_rel = np.linalg.inv(ground_truth_poses[i]) @ ground_truth_poses[i + delta]

        # Error
        T_err = np.linalg.inv(T_gt_rel) @ T_est_rel

        # Translation error
        trans_errors.append(np.linalg.norm(T_err[:3, 3]))

        # Rotation error (angle of rotation)
        r = Rotation.from_matrix(T_err[:3, :3])
        rot_errors.append(np.abs(r.as_rotvec(degrees=True)).max())

    return np.mean(trans_errors), np.mean(rot_errors)

Neural Implicit SLAM

Replace traditional maps with neural radiance fields (NeRF) or 3D Gaussian Splatting (3DGS).

Traditional SLAM:     Map = sparse points + keyframes
Neural SLAM:          Map = neural network (implicit function)
Gaussian Splatting:   Map = 3D Gaussian primitives

Advantages:
- Dense, photorealistic maps
- Novel view synthesis
- Compact representation

Challenges:
- Computationally expensive
- Real-time performance is hard
- Loop closure in neural maps

Foundation Models for SLAM

Approach Example How It Helps
Learned features SuperPoint, LightGlue Better matching in challenging conditions
Semantic SLAM ConceptFusion Build semantic maps
Language-grounded SLAM NLMap "Find objects near the red couch"
Depth estimation DPT, Depth Anything Monocular depth for SLAM

References

  • Cadena et al. (2016). "Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age." IEEE T-RO
  • Campos et al. (2021). "ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM." IEEE T-RO
  • Qin et al. (2018). "VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator." IEEE T-RO
  • Shan et al. (2021). "LIO-SAM: Tightly-coupled Lidar Inertial Odometry via Smoothing and Mapping." IROS 2020
  • Teed & Deng (2021). "DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras." NeurIPS 2021
  • Keetha et al. (2024). "Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians." CVPR 2024
  • Sturm et al. (2012). "A Benchmark for the Evaluation of RGB-D SLAM Systems." IROS 2012
  • Geiger et al. (2012). "Are we ready for autonomous driving? The KITTI vision benchmark suite." CVPR 2012