SLAM: Simultaneous Localization and Mapping¶
SLAM is the problem of simultaneously building a map of an unknown environment and localizing the robot within that map. It is one of the most fundamental problems in robotics — nearly every autonomous robot needs to know "where am I?" and "what does the world look like?"
For ROS implementation details, see ROS SLAM Tutorial.
The SLAM Problem¶
Formal Definition¶
Given:
- Robot observations z_{1:t} (camera images, LiDAR scans, IMU readings)
- Robot controls u_{1:t} (odometry, wheel encoders)
Estimate:
- Robot trajectory x_{1:t} (where has the robot been?)
- Map m (what does the environment look like?)
Jointly:
p(x_{1:t}, m | z_{1:t}, u_{1:t})
Why It's Hard¶
SLAM challenges:
├── Chicken-and-egg — Need location to build map, need map to localize
├── Data association — Is this the same place I visited before? (loop closure)
├── Uncertainty — Sensors are noisy, odometry drifts
├── Scalability — Maps grow with exploration time
├── Dynamic objects — People, cars, doors change the environment
└── Multi-modal — Different sensors have different strengths
SLAM with Loop Closure Demo¶

SLAM Variants¶
1. Visual SLAM (vSLAM)¶
Uses cameras as the primary sensor. The most popular approach due to low cost and rich information.
| Method | Year | Type | Key Feature | Reference |
|---|---|---|---|---|
| ORB-SLAM3 | 2021 | Feature-based | Monocular, stereo, RGB-D, IMU | Campos et al. |
| LSD-SLAM | 2014 | Direct (dense) | Semi-dense maps from monocular | Engel et al. |
| DSO | 2017 | Direct (sparse) | Photometric bundle adjustment | Engel et al. |
| VINS-Mono | 2018 | Feature-based + IMU | Tightly-coupled VIO | Qin et al. |
| OpenVSLAM | 2019 | Feature-based | Modular architecture | Sumikura et al. |
| DROID-SLAM | 2021 | Deep learning | Learned SLAM, high accuracy | Teed et al. |
| SplaTAM | 2024 | Gaussian Splatting | 3DGS-based SLAM | Keetha et al. |
Visual SLAM Pipeline¶
Camera Image
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Feature │────▶│ Feature │────▶│ Motion │
│ Extraction │ │ Matching │ │ Estimation │
│ (ORB, SIFT, │ │ (BFMatcher, │ │ (PnP, ICP, │
│ SuperPoint) │ │ LightGlue) │ │ BA) │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Map │
│ Update │
│ (keyframes, │
│ landmarks) │
└──────────────┘
│
▼
┌──────────────┐
│ Loop │
│ Closure │
│ (detect │
│ revisits) │
└──────────────┘
Feature-based vs Direct Methods¶
| Aspect | Feature-based | Direct |
|---|---|---|
| How it works | Extract keypoints, match between frames | Use pixel intensities directly |
| Examples | ORB-SLAM3, VINS-Mono | LSD-SLAM, DSO |
| Robustness | High (invariant to lighting) | Lower (sensitive to exposure) |
| Map density | Sparse (point cloud) | Dense / semi-dense |
| Accuracy | Good | Often better in texture-rich scenes |
| Speed | Fast | Slower (pixel-level optimization) |
2. LiDAR SLAM¶
Uses laser range finders for precise 3D mapping. More accurate than visual SLAM but more expensive.
| Method | Year | Type | Key Feature |
|---|---|---|---|
| LOAM | 2014 | Feature-based | LiDAR odometry + mapping |
| LeGO-LOAM | 2018 | Feature-based | Lightweight, ground optimization |
| LIO-SAM | 2020 | Tightly-coupled | LiDAR + IMU factor graph |
| FAST-LIO2 | 2021 | Iterated EKF | Fast, lightweight |
| CT-ICP | 2021 | Point-to-point | Continuous-time ICP |
| KISS-ICP | 2023 | Simple ICP | "Keep It Simple and Scalable" |
3. RGB-D SLAM¶
Uses depth cameras (RealSense, Kinect) for dense 3D reconstruction.
| Method | Year | Key Feature |
|---|---|---|
| RTAB-Map | 2014 | Multi-session, graph-based |
| ElasticFusion | 2015 | Real-time dense SLAM |
| BundleFusion | 2017 | Global bundle adjustment |
| Nice-SLAM | 2021 | Neural implicit SLAM |
| SplaTAM | 2024 | 3D Gaussian Splatting SLAM |
4. Learning-Based SLAM¶
Recent trend: replace hand-crafted components with learned ones.
| Approach | Example | Year | Learning Target |
|---|---|---|---|
| Learned features | SuperPoint + SuperGlue | 2018, 2020 | Keypoint detection + matching |
| Learned SLAM | DROID-SLAM | 2021 | End-to-end visual odometry |
| Neural implicit | iMAP, NICE-SLAM | 2021 | Neural radiance field as map |
| Gaussian Splatting | SplaTAM, MonoGS | 2024 | 3DGS as map representation |
SLAM Datasets¶
Indoor Datasets¶
| Dataset | Year | Sensor | Environment | Key Feature |
|---|---|---|---|---|
| TUM RGB-D | 2012 | Kinect | Office rooms | 39 sequences, ground truth |
| ICL-NUIM | 2014 | Synthetic | Living room/office | Perfect ground truth |
| EuRoC MAV | 2016 | Stereo + IMU | Machine hall, room | Micro aerial vehicle |
| TartanAir | 2020 | Stereo | Various (sim) | Diverse environments, HD |
| Replica | 2019 | Synthetic | Indoor rooms | High-fidelity 3D reconstructions |
| ScanNet | 2017 | RGB-D | 1513 scenes | Semantic labels |
Outdoor Datasets¶
| Dataset | Year | Sensor | Environment | Key Feature |
|---|---|---|---|---|
| KITTI | 2012 | Stereo + LiDAR | Urban driving | Standard benchmark |
| nuScenes | 2019 | LiDAR + cameras | Urban, Boston/Singapore | 1000 scenes, 3D annotations |
| Waymo Open | 2019 | LiDAR + cameras | Urban/suburban | 1150 scenes |
| MulRan | 2020 | LiDAR | Urban, multi-session | Long-term relocalization |
| Oxford RobotCar | 2016 | Multi-sensor | Urban Oxford | 1000+ km, multi-weather |
| Hilti SLAM Challenge | 2022 | Multi-sensor | Construction sites | Multi-floor SLAM |
Dataset Details¶
TUM RGB-D (The Standard Indoor Benchmark)¶
TUM RGB-D Dataset:
├── 39 sequences across 5 scenarios
│ ├── fr1_xyz — Slow, structured motion
│ ├── fr1_desk — Desktop objects
│ ├── fr2_xyz — Larger workspace
│ ├── fr3_office — Full office
│ └── fr1_room — Complete room traversal
├── Sensor: Microsoft Kinect v1
├── Resolution: 640×480 @ 30Hz
├── Ground truth: Motion capture system
└── Evaluation: ATE (Absolute Trajectory Error)
KITTI (The Standard Outdoor Benchmark)¶
KITTI Dataset:
├── Stereo + Velodyne LiDAR + GPS/IMU
├── Urban, suburban, highway scenarios
├── Sequences: 22 training, 11 test
├── Ground truth: GPS/RTK (cm-level)
├── Evaluation:
│ ├── t_err — Translational error (%)
│ └── r_err — Rotational error (deg/100m)
└── Odometry benchmark leaderboard
Evaluation Metrics¶
Absolute Trajectory Error (ATE)¶
Measures the global consistency of the estimated trajectory.
Where \(\hat{t}_i\) is the estimated position and \(t_i\) is the ground truth.
Relative Pose Error (RPE)¶
Measures local accuracy over fixed time intervals.
Comparison Table¶
| Metric | Measures | Sensitive to | Use Case |
|---|---|---|---|
| ATE | Global consistency | Scale, rotation, translation | Loop closure quality |
| RPE | Local accuracy | Drift over short intervals | Odometry quality |
| Map quality | 3D reconstruction | Completeness, accuracy | Mapping applications |
Evaluation Code¶
import numpy as np
from scipy.spatial.transform import Rotation
def compute_ate(estimated_poses, ground_truth_poses):
"""
Compute Absolute Trajectory Error (ATE).
Args:
estimated_poses: List of 4x4 SE(3) matrices
ground_truth_poses: List of 4x4 SE(3) matrices
Returns:
ate: Root mean squared error (meters)
"""
errors = []
for T_est, T_gt in zip(estimated_poses, ground_truth_poses):
# Translation error
t_est = T_est[:3, 3]
t_gt = T_gt[:3, 3]
error = np.linalg.norm(t_est - t_gt)
errors.append(error)
ate = np.sqrt(np.mean(np.array(errors)**2))
return ate
def compute_rpe(estimated_poses, ground_truth_poses, delta=1):
"""
Compute Relative Pose Error (RPE).
Args:
estimated_poses: List of 4x4 SE(3) matrices
ground_truth_poses: List of 4x4 SE(3) matrices
delta: Frame interval for comparison
Returns:
trans_err: Mean translational error (meters)
rot_err: Mean rotational error (degrees)
"""
trans_errors = []
rot_errors = []
for i in range(len(estimated_poses) - delta):
# Relative poses
T_est_rel = np.linalg.inv(estimated_poses[i]) @ estimated_poses[i + delta]
T_gt_rel = np.linalg.inv(ground_truth_poses[i]) @ ground_truth_poses[i + delta]
# Error
T_err = np.linalg.inv(T_gt_rel) @ T_est_rel
# Translation error
trans_errors.append(np.linalg.norm(T_err[:3, 3]))
# Rotation error (angle of rotation)
r = Rotation.from_matrix(T_err[:3, :3])
rot_errors.append(np.abs(r.as_rotvec(degrees=True)).max())
return np.mean(trans_errors), np.mean(rot_errors)
Modern Trends (2023–2025)¶
Neural Implicit SLAM¶
Replace traditional maps with neural radiance fields (NeRF) or 3D Gaussian Splatting (3DGS).
Traditional SLAM: Map = sparse points + keyframes
Neural SLAM: Map = neural network (implicit function)
Gaussian Splatting: Map = 3D Gaussian primitives
Advantages:
- Dense, photorealistic maps
- Novel view synthesis
- Compact representation
Challenges:
- Computationally expensive
- Real-time performance is hard
- Loop closure in neural maps
Foundation Models for SLAM¶
| Approach | Example | How It Helps |
|---|---|---|
| Learned features | SuperPoint, LightGlue | Better matching in challenging conditions |
| Semantic SLAM | ConceptFusion | Build semantic maps |
| Language-grounded SLAM | NLMap | "Find objects near the red couch" |
| Depth estimation | DPT, Depth Anything | Monocular depth for SLAM |
References¶
- Cadena et al. (2016). "Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age." IEEE T-RO
- Campos et al. (2021). "ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM." IEEE T-RO
- Qin et al. (2018). "VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator." IEEE T-RO
- Shan et al. (2021). "LIO-SAM: Tightly-coupled Lidar Inertial Odometry via Smoothing and Mapping." IROS 2020
- Teed & Deng (2021). "DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras." NeurIPS 2021
- Keetha et al. (2024). "Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians." CVPR 2024
- Sturm et al. (2012). "A Benchmark for the Evaluation of RGB-D SLAM Systems." IROS 2012
- Geiger et al. (2012). "Are we ready for autonomous driving? The KITTI vision benchmark suite." CVPR 2012