SLAM: Simultaneous Localization and Mapping¶

SLAM is the problem of simultaneously building a map of an unknown environment and localizing the robot within that map. It is one of the most fundamental problems in robotics — nearly every autonomous robot needs to know "where am I?" and "what does the world look like?"

For ROS implementation details, see ROS SLAM Tutorial.

The SLAM Problem¶

Formal Definition¶

Given:
  - Robot observations z_{1:t} (camera images, LiDAR scans, IMU readings)
  - Robot controls u_{1:t} (odometry, wheel encoders)

Estimate:
  - Robot trajectory x_{1:t} (where has the robot been?)
  - Map m (what does the environment look like?)

Jointly:
  p(x_{1:t}, m | z_{1:t}, u_{1:t})

Why It's Hard¶

SLAM challenges:
├── Chicken-and-egg — Need location to build map, need map to localize
├── Data association — Is this the same place I visited before? (loop closure)
├── Uncertainty — Sensors are noisy, odometry drifts
├── Scalability — Maps grow with exploration time
├── Dynamic objects — People, cars, doors change the environment
└── Multi-modal — Different sensors have different strengths

SLAM with Loop Closure Demo¶

SLAM Mapping with Loop Closure Animation

SLAM Variants¶

1. Visual SLAM (vSLAM)¶

Uses cameras as the primary sensor. The most popular approach due to low cost and rich information.

Method	Year	Type	Key Feature	Reference
ORB-SLAM3	2021	Feature-based	Monocular, stereo, RGB-D, IMU	Campos et al.
LSD-SLAM	2014	Direct (dense)	Semi-dense maps from monocular	Engel et al.
DSO	2017	Direct (sparse)	Photometric bundle adjustment	Engel et al.
VINS-Mono	2018	Feature-based + IMU	Tightly-coupled VIO	Qin et al.
OpenVSLAM	2019	Feature-based	Modular architecture	Sumikura et al.
DROID-SLAM	2021	Deep learning	Learned SLAM, high accuracy	Teed et al.
SplaTAM	2024	Gaussian Splatting	3DGS-based SLAM	Keetha et al.

Visual SLAM Pipeline¶

Camera Image
    │
    ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Feature     │────▶│   Feature     │────▶│   Motion      │
│   Extraction  │     │   Matching    │     │   Estimation  │
│  (ORB, SIFT,  │     │  (BFMatcher,  │     │  (PnP, ICP,   │
│   SuperPoint) │     │   LightGlue)  │     │   BA)         │
└──────────────┘     └──────────────┘     └──────────────┘
                                                  │
                                                  ▼
                                          ┌──────────────┐
                                          │   Map         │
                                          │   Update      │
                                          │  (keyframes,  │
                                          │   landmarks)  │
                                          └──────────────┘
                                                  │
                                                  ▼
                                          ┌──────────────┐
                                          │   Loop        │
                                          │   Closure     │
                                          │  (detect      │
                                          │   revisits)   │
                                          └──────────────┘

Feature-based vs Direct Methods¶

Aspect	Feature-based	Direct
How it works	Extract keypoints, match between frames	Use pixel intensities directly
Examples	ORB-SLAM3, VINS-Mono	LSD-SLAM, DSO
Robustness	High (invariant to lighting)	Lower (sensitive to exposure)
Map density	Sparse (point cloud)	Dense / semi-dense
Accuracy	Good	Often better in texture-rich scenes
Speed	Fast	Slower (pixel-level optimization)

2. LiDAR SLAM¶

Uses laser range finders for precise 3D mapping. More accurate than visual SLAM but more expensive.

Method	Year	Type	Key Feature
LOAM	2014	Feature-based	LiDAR odometry + mapping
LeGO-LOAM	2018	Feature-based	Lightweight, ground optimization
LIO-SAM	2020	Tightly-coupled	LiDAR + IMU factor graph
FAST-LIO2	2021	Iterated EKF	Fast, lightweight
CT-ICP	2021	Point-to-point	Continuous-time ICP
KISS-ICP	2023	Simple ICP	"Keep It Simple and Scalable"

3. RGB-D SLAM¶

Uses depth cameras (RealSense, Kinect) for dense 3D reconstruction.

Method	Year	Key Feature
RTAB-Map	2014	Multi-session, graph-based
ElasticFusion	2015	Real-time dense SLAM
BundleFusion	2017	Global bundle adjustment
Nice-SLAM	2021	Neural implicit SLAM
SplaTAM	2024	3D Gaussian Splatting SLAM

4. Learning-Based SLAM¶

Recent trend: replace hand-crafted components with learned ones.

Approach	Example	Year	Learning Target
Learned features	SuperPoint + SuperGlue	2018, 2020	Keypoint detection + matching
Learned SLAM	DROID-SLAM	2021	End-to-end visual odometry
Neural implicit	iMAP, NICE-SLAM	2021	Neural radiance field as map
Gaussian Splatting	SplaTAM, MonoGS	2024	3DGS as map representation

SLAM Datasets¶

Indoor Datasets¶

Dataset	Year	Sensor	Environment	Key Feature
TUM RGB-D	2012	Kinect	Office rooms	39 sequences, ground truth
ICL-NUIM	2014	Synthetic	Living room/office	Perfect ground truth
EuRoC MAV	2016	Stereo + IMU	Machine hall, room	Micro aerial vehicle
TartanAir	2020	Stereo	Various (sim)	Diverse environments, HD
Replica	2019	Synthetic	Indoor rooms	High-fidelity 3D reconstructions
ScanNet	2017	RGB-D	1513 scenes	Semantic labels

Outdoor Datasets¶

Dataset	Year	Sensor	Environment	Key Feature
KITTI	2012	Stereo + LiDAR	Urban driving	Standard benchmark
nuScenes	2019	LiDAR + cameras	Urban, Boston/Singapore	1000 scenes, 3D annotations
Waymo Open	2019	LiDAR + cameras	Urban/suburban	1150 scenes
MulRan	2020	LiDAR	Urban, multi-session	Long-term relocalization
Oxford RobotCar	2016	Multi-sensor	Urban Oxford	1000+ km, multi-weather
Hilti SLAM Challenge	2022	Multi-sensor	Construction sites	Multi-floor SLAM

Dataset Details¶

TUM RGB-D (The Standard Indoor Benchmark)¶

TUM RGB-D Dataset:
├── 39 sequences across 5 scenarios
│   ├── fr1_xyz        — Slow, structured motion
│   ├── fr1_desk       — Desktop objects
│   ├── fr2_xyz        — Larger workspace
│   ├── fr3_office     — Full office
│   └── fr1_room       — Complete room traversal
├── Sensor: Microsoft Kinect v1
├── Resolution: 640×480 @ 30Hz
├── Ground truth: Motion capture system
└── Evaluation: ATE (Absolute Trajectory Error)

KITTI (The Standard Outdoor Benchmark)¶

KITTI Dataset:
├── Stereo + Velodyne LiDAR + GPS/IMU
├── Urban, suburban, highway scenarios
├── Sequences: 22 training, 11 test
├── Ground truth: GPS/RTK (cm-level)
├── Evaluation:
│   ├── t_err — Translational error (%)
│   └── r_err — Rotational error (deg/100m)
└── Odometry benchmark leaderboard

Evaluation Metrics¶

Absolute Trajectory Error (ATE)¶

Measures the global consistency of the estimated trajectory.

\[ \text{ATE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} \| \hat{t}_i - t_i \|^2} \]

Where \(\hat{t}_i\) is the estimated position and \(t_i\) is the ground truth.

Relative Pose Error (RPE)¶

Measures local accuracy over fixed time intervals.

\[ \text{RPE} = \sqrt{\frac{1}{N-\Delta} \sum_{i=1}^{N-\Delta} \| (\hat{T}_i^{-1} \hat{T}_{i+\Delta})^{-1} (T_i^{-1} T_{i+\Delta}) \|^2} \]

Comparison Table¶

Metric	Measures	Sensitive to	Use Case
ATE	Global consistency	Scale, rotation, translation	Loop closure quality
RPE	Local accuracy	Drift over short intervals	Odometry quality
Map quality	3D reconstruction	Completeness, accuracy	Mapping applications

Evaluation Code¶

import numpy as np
from scipy.spatial.transform import Rotation

def compute_ate(estimated_poses, ground_truth_poses):
    """
    Compute Absolute Trajectory Error (ATE).

    Args:
        estimated_poses: List of 4x4 SE(3) matrices
        ground_truth_poses: List of 4x4 SE(3) matrices

    Returns:
        ate: Root mean squared error (meters)
    """
    errors = []
    for T_est, T_gt in zip(estimated_poses, ground_truth_poses):
        # Translation error
        t_est = T_est[:3, 3]
        t_gt = T_gt[:3, 3]
        error = np.linalg.norm(t_est - t_gt)
        errors.append(error)

    ate = np.sqrt(np.mean(np.array(errors)**2))
    return ate

def compute_rpe(estimated_poses, ground_truth_poses, delta=1):
    """
    Compute Relative Pose Error (RPE).

    Args:
        estimated_poses: List of 4x4 SE(3) matrices
        ground_truth_poses: List of 4x4 SE(3) matrices
        delta: Frame interval for comparison

    Returns:
        trans_err: Mean translational error (meters)
        rot_err: Mean rotational error (degrees)
    """
    trans_errors = []
    rot_errors = []

    for i in range(len(estimated_poses) - delta):
        # Relative poses
        T_est_rel = np.linalg.inv(estimated_poses[i]) @ estimated_poses[i + delta]
        T_gt_rel = np.linalg.inv(ground_truth_poses[i]) @ ground_truth_poses[i + delta]

        # Error
        T_err = np.linalg.inv(T_gt_rel) @ T_est_rel

        # Translation error
        trans_errors.append(np.linalg.norm(T_err[:3, 3]))

        # Rotation error (angle of rotation)
        r = Rotation.from_matrix(T_err[:3, :3])
        rot_errors.append(np.abs(r.as_rotvec(degrees=True)).max())

    return np.mean(trans_errors), np.mean(rot_errors)

Modern Trends (2023–2025)¶

Neural Implicit SLAM¶

Replace traditional maps with neural radiance fields (NeRF) or 3D Gaussian Splatting (3DGS).

Traditional SLAM:     Map = sparse points + keyframes
Neural SLAM:          Map = neural network (implicit function)
Gaussian Splatting:   Map = 3D Gaussian primitives

Advantages:
- Dense, photorealistic maps
- Novel view synthesis
- Compact representation

Challenges:
- Computationally expensive
- Real-time performance is hard
- Loop closure in neural maps

Foundation Models for SLAM¶

Approach	Example	How It Helps
Learned features	SuperPoint, LightGlue	Better matching in challenging conditions
Semantic SLAM	ConceptFusion	Build semantic maps
Language-grounded SLAM	NLMap	"Find objects near the red couch"
Depth estimation	DPT, Depth Anything	Monocular depth for SLAM

References¶

Cadena et al. (2016). "Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age." IEEE T-RO
Campos et al. (2021). "ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM." IEEE T-RO
Qin et al. (2018). "VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator." IEEE T-RO
Shan et al. (2021). "LIO-SAM: Tightly-coupled Lidar Inertial Odometry via Smoothing and Mapping." IROS 2020
Teed & Deng (2021). "DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras." NeurIPS 2021
Keetha et al. (2024). "Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians." CVPR 2024
Sturm et al. (2012). "A Benchmark for the Evaluation of RGB-D SLAM Systems." IROS 2012
Geiger et al. (2012). "Are we ready for autonomous driving? The KITTI vision benchmark suite." CVPR 2012