Depth Estimation¶
Depth perception — the ability to measure the distance from the camera to every point in a scene — is one of the most fundamental capabilities a robot needs. Without depth, a robot sees the world as a flat photograph and cannot reason about the 3D structure of its environment. With depth, a robot can navigate around obstacles, plan manipulation trajectories, and build maps of its surroundings.
This tutorial covers the four major families of depth sensing: stereo vision, structured light (RGB-D cameras), Time-of-Flight (ToF), and monocular depth estimation. We provide working Python code, mathematical foundations, and a comparison to help you choose the right sensor for your robot.
Learning Objectives¶
After completing this tutorial, you will be able to:
- Understand the different depth sensing modalities and how each works
- Compare stereo, structured light, ToF, and monocular approaches in terms of range, accuracy, cost, and operating conditions
- Implement stereo depth estimation using OpenCV (block matching and SGM)
- Generate point clouds from depth images using camera intrinsics
- Use modern monocular depth estimation models (MiDaS, Depth Anything)
- Integrate depth cameras with ROS 2
1. Why Depth Matters in Robotics¶
A standard RGB camera captures a 2D projection of the 3D world — all depth information is lost. Yet almost every robotic task requires some form of depth reasoning:
| Task | Why Depth Is Needed |
|---|---|
| Navigation | Detect and avoid obstacles at varying distances |
| Manipulation | Compute 3D positions of objects for grasp planning |
| Mapping | Build 3D occupancy maps or point cloud maps |
| Inspection | Measure dimensions, detect surface defects |
| Human-Robot Interaction | Track body pose in 3D for safe collaboration |
2D vs 3D Perception¶
2D Image (RGB) 3D Perception (RGB-D)
+------------------+ +------------------+
| . . . . . | | . . . . . |
| . BOX . . | | . BOX(1.2m) . |
| . . . . . | vs | . . . . . |
| ROBOT . . | | ROBOT . . |
| . . . . . | | . . . . . |
+------------------+ +------------------+
"There is a box." "The box is at (0.3, 0, 1.2) m"
Depth turns perception into measurement.
2. Stereo Vision¶
Stereo vision uses two cameras separated by a known baseline \(B\) to estimate depth through triangulation. This is inspired by human binocular vision.
2.1 Epipolar Geometry¶
Given two images of the same scene taken from different viewpoints, the geometric relationship between them is captured by two matrices.
Fundamental Matrix \(F\): relates corresponding points \(\mathbf{x}\) and \(\mathbf{x'}\) in pixel coordinates:
where \(F\) is a \(3 \times 3\) matrix of rank 2 encoding the epipolar geometry.
Essential Matrix \(E\): same relationship but in calibrated (normalized) coordinates:
The essential matrix encodes the rotation \(R\) and translation \(t\) between the two cameras:
where \([t]_\times\) is the skew-symmetric matrix of the translation vector:
The relationship between \(F\) and \(E\) via the camera intrinsics \(K\) is:
Epipolar Constraint: For any point in the left image, its corresponding point in the right image must lie along a specific line — the epipolar line. This reduces the search for correspondences from 2D to 1D.
Left Image Right Image
+--------+ +--------+
| x | | |
| | -----> |epiline |
| | | x' |
+--------+ +--------+
Point x maps to x' must lie on
some point the epipolar line
2.2 Stereo Rectification¶
Before computing disparity, we rectify both images so that epipolar lines become horizontal scanlines. This simplifies the correspondence search to a 1D horizontal search.
Given the intrinsics and the relative pose between the two cameras, cv2.stereoRectify computes rotation matrices \(R_1, R_2\) and new projection matrices \(P_1, P_2\) that align both images.
Before Rectification After Rectification
+--------+ +--------+ +--------+ +--------+
| / | | \ | | -------| |--------|
| x | | x' | --> | x | | x' |
| / | | \ | | -------| |--------|
+--------+ +--------+ +--------+ +--------+
Epipolar lines are Epipolar lines are
slanted in both images horizontal scanlines
2.3 Block Matching & SGM¶
Block Matching (BM) finds correspondences by comparing small windows (blocks) around each pixel. It is fast but can produce noisy results.
Semi-Global Matching (SGM) improves on BM by enforcing a smoothness constraint along multiple scanline directions. It produces much cleaner disparity maps at the cost of more computation.
import cv2
import numpy as np
# Load rectified stereo pair (grayscale)
img_left = cv2.imread("left.png", cv2.IMREAD_GRAYSCALE)
img_right = cv2.imread("right.png", cv2.IMREAD_GRAYSCALE)
# --- Block Matching ---
stereo_bm = cv2.StereoBM_create(
numDisparities=128, # Must be divisible by 16
blockSize=15 # Odd number, typically 5-21
)
disparity_bm = stereo_bm.compute(img_left, img_right)
# --- Semi-Global Matching (SGM) ---
stereo_sgbm = cv2.StereoSGBM_create(
minDisparity=0,
numDisparities=128, # Must be divisible by 16
blockSize=5,
P1=8 * 3 * 5 ** 2, # Penalty for disparity change of 1
P2=32 * 3 * 5 ** 2, # Penalty for disparity change > 1
disp12MaxDiff=1,
uniquenessRatio=10,
speckleWindowSize=100,
speckleRange=32,
preFilterCap=63,
mode=cv2.STEREO_SGBM_MODE_SGBM_3WAY
)
disparity_sgbm = stereo_sgbm.compute(img_left, img_right)
# Normalize for visualization (16-bit signed -> 8-bit unsigned)
disp_vis = cv2.normalize(
disparity_sgbm, None, alpha=0, beta=255,
norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8U
)
cv2.imwrite("disparity.png", disp_vis)
print("Disparity map saved. Shape:", disparity_sgbm.shape)
2.4 Depth from Disparity¶
Once we have the disparity map, computing depth is straightforward using the formula:
where:
- \(Z\) is the depth (distance) in meters
- \(f\) is the focal length in pixels
- \(B\) is the baseline (distance between the two cameras) in meters
- \(d\) is the disparity (horizontal pixel shift) in pixels
import cv2
import numpy as np
def compute_depth_map(left_path, right_path, focal_length, baseline,
num_disparities=128, block_size=5):
"""
Compute a depth map from a rectified stereo pair.
Parameters
----------
left_path : str Path to left rectified image
right_path : str Path to right rectified image
focal_length : float Focal length in pixels
baseline : float Stereo baseline in meters
Returns
-------
depth_map : ndarray Depth in meters (float32)
"""
# Load images
img_left = cv2.imread(left_path, cv2.IMREAD_GRAYSCALE)
img_right = cv2.imread(right_path, cv2.IMREAD_GRAYSCALE)
# Compute disparity using SGM
stereo = cv2.StereoSGBM_create(
minDisparity=0,
numDisparities=num_disparities,
blockSize=block_size,
P1=8 * 3 * block_size ** 2,
P2=32 * 3 * block_size ** 2,
disp12MaxDiff=1,
uniquenessRatio=10,
speckleWindowSize=100,
speckleRange=32,
mode=cv2.STEREO_SGBM_MODE_SGBM_3WAY
)
disparity = stereo.compute(img_left, img_right).astype(np.float32) / 16.0
# Avoid division by zero
disparity[disparity <= 0] = 0.1
# Z = f * B / d
depth_map = (focal_length * baseline) / disparity
# Cap maximum depth to 10 meters
depth_map[depth_map > 10.0] = 10.0
return depth_map
# Example usage
if __name__ == "__main__":
focal_length = 718.856 # Example: KITTI dataset
baseline = 0.54 # meters
depth = compute_depth_map("left.png", "right.png", focal_length, baseline)
print(f"Depth map shape: {depth.shape}")
print(f"Depth range: {depth.min():.2f} - {depth.max():.2f} meters")
# Visualize
depth_vis = cv2.normalize(depth, None, 0, 255, cv2.NORM_MINMAX)
cv2.imwrite("depth_map.png", depth_vis.astype(np.uint8))
3. Structured Light (RGB-D Cameras)¶
Structured light cameras project a known pattern (infrared dots or stripes) onto the scene and observe how the pattern deforms on surfaces. By analyzing the deformation, the camera computes depth at each pixel.
3.1 Kinect / RealSense / Azure Kinect¶
The most popular structured light and RGB-D sensors for robotics:
| Sensor | Technology | Resolution (Depth) | Range | FPS | Price (USD) | Notes |
|---|---|---|---|---|---|---|
| Kinect v1 | Structured Light (IR) | 640×480 | 0.5–4.5 m | 30 | Discontinued | Classic, large community |
| Kinect v2 | ToF | 512×424 | 0.5–4.5 m | 30 | Discontinued | Better accuracy |
| Intel RealSense D435i | Active IR Stereo | 1280×720 | 0.2–10 m | 90 | ~$250 | IMU built-in, ROS 2 support |
| Intel RealSense D455 | Active IR Stereo | 1280×720 | 0.4–6 m | 90 | ~$350 | Larger baseline, better accuracy |
| Azure Kinect DK | ToF (iTOF) | 1024×1024 | 0.5–5.46 m | 30 | ~$400 | Body tracking SDK |
| Orbbec Femto Mega | ToF | 1024×1024 | 0.2–5 m | 30 | ~$350 | OpenNI compatible |
Best Choice for ROS 2
For ROS 2-based robotics projects, the Intel RealSense D435i is the most popular choice due to its native realsense2_camera ROS 2 driver, built-in IMU, and good balance of range and price.
3.2 Depth Image Processing¶
Intel provides the pyrealsense2 library for direct access to RealSense cameras:
import pyrealsense2 as rs
import numpy as np
import cv2
def stream_realsense():
"""Stream and display aligned color + depth from a RealSense camera."""
# Configure streams
pipeline = rs.pipeline()
config = rs.config()
config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)
config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
# Start streaming
profile = pipeline.start(config)
# Get depth scale (meters per unit)
depth_sensor = profile.get_device().first_depth_sensor()
depth_scale = depth_sensor.get_depth_scale()
print(f"Depth Scale: {depth_scale} meters/unit")
# Align depth to color
align_to = rs.stream.color
align = rs.align(align_to)
try:
while True:
# Wait for frames
frames = pipeline.wait_for_frames()
# Align depth frame to color frame
aligned_frames = align.process(frames)
depth_frame = aligned_frames.get_depth_frame()
color_frame = aligned_frames.get_color_frame()
if not depth_frame or not color_frame:
continue
# Convert to numpy arrays
depth_image = np.asanyarray(depth_frame.get_data())
color_image = np.asanyarray(color_frame.get_data())
# Depth in meters
depth_meters = depth_image * depth_scale
# Apply colormap to depth for visualization
depth_colormap = cv2.applyColorMap(
cv2.convertScaleAbs(depth_image, alpha=0.03),
cv2.COLORMAP_JET
)
# Stack color and depth side by side
images = np.hstack((color_image, depth_colormap))
cv2.imshow("Color + Depth", images)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
finally:
pipeline.stop()
cv2.destroyAllWindows()
if __name__ == "__main__":
stream_realsense()
4. Time-of-Flight (ToF)¶
Time-of-Flight cameras actively illuminate the scene with modulated infrared light and measure the round-trip time (or phase shift) of the light to compute depth.
Principle¶
There are two main ToF approaches:
- Direct ToF (dToF): Measures the actual time it takes for a light pulse to travel to the surface and back.
where \(c\) is the speed of light and \(\Delta t\) is the round-trip time.
- Indirect ToF (iToF): Emits continuously modulated light and measures the phase shift \(\phi\) of the returned signal:
where \(f_{mod}\) is the modulation frequency.
Advantages and Limitations¶
| Aspect | Details |
|---|---|
| ✅ Advantages | Dense depth at full frame rate; no baseline needed; works in low texture |
| ✅ Compact | Single sensor, small form factor |
| ❌ Multi-path interference | Reflective/translucent surfaces cause errors |
| ❌ Limited range | Typically 0.5–5 m for consumer sensors |
| ❌ Outdoor sensitivity | Strong sunlight saturates the IR sensor |
| ❌ Low resolution | Typically lower than stereo or structured light |
5. Monocular Depth Estimation¶
Recent deep learning methods can estimate depth from a single RGB image. While historically these methods produced only relative (scale-invariant) depth, modern models now achieve metric depth estimation.
5.1 MiDaS / Depth Anything¶
MiDaS (Ranftl et al., 2020) and Depth Anything (Yang et al., 2024) are state-of-the-art models for monocular depth estimation. They are available via torch.hub for easy use.
import torch
import cv2
import numpy as np
import matplotlib.pyplot as plt
def estimate_depth_midas(image_path):
"""
Estimate depth from a single image using MiDaS v3.1 (DPT-Large).
Parameters
----------
image_path : str Path to the input RGB image
Returns
-------
depth : ndarray Relative depth map (float32, higher = closer)
"""
# Load MiDaS model from torch.hub
model_type = "DPT_Large" # Options: DPT_Large, DPT_Hybrid, MiDaS_small
midas = torch.hub.load("intel-isl/MiDaS", model_type)
# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
midas.to(device)
midas.eval()
# Load transforms
midas_transforms = torch.hub.load("intel-isl/MiDaS", "transforms")
if model_type in ["DPT_Large", "DPT_Hybrid"]:
transform = midas_transforms.dpt_transform
else:
transform = midas_transforms.small_transform
# Load and preprocess image
img = cv2.imread(image_path)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
input_batch = transform(img_rgb).to(device)
# Inference
with torch.no_grad():
prediction = midas(input_batch)
# Resize to original resolution
prediction = torch.nn.functional.interpolate(
prediction.unsqueeze(1),
size=img_rgb.shape[:2],
mode="bicubic",
align_corners=False,
).squeeze()
depth = prediction.cpu().numpy()
return depth
def estimate_depth_anything(image_path):
"""
Estimate depth using Depth Anything V2 (small model).
Parameters
----------
image_path : str Path to the input RGB image
Returns
-------
depth : ndarray Relative depth map (float32)
"""
# Load Depth Anything from torch.hub
model = torch.hub.load("depth-anything/Depth-Anything-V2", "DepthAnythingV2",
encoder="vits", device="cpu")
model.eval()
# Load image
img = cv2.imread(image_path)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# Inference (the model handles preprocessing internally)
with torch.no_grad():
depth = model.infer_image(img_rgb)
return depth
if __name__ == "__main__":
image_path = "test_image.jpg"
# MiDaS
depth_midas = estimate_depth_midas(image_path)
print(f"MiDaS depth shape: {depth_midas.shape}")
print(f"MiDaS depth range: {depth_midas.min():.2f} - {depth_midas.max():.2f}")
# Visualize
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.imshow(cv2.cvtColor(cv2.imread(image_path), cv2.COLOR_BGR2RGB))
plt.title("Input Image")
plt.axis("off")
plt.subplot(1, 2, 2)
plt.imshow(depth_midas, cmap="plasma")
plt.title("MiDaS Depth (relative)")
plt.colorbar(fraction=0.046)
plt.axis("off")
plt.tight_layout()
plt.savefig("depth_estimation_result.png", dpi=150)
plt.show()
5.2 Metric Depth Estimation¶
MiDaS and Depth Anything produce relative depth — the values are not in meters. For robotics, we often need metric depth (in meters). Modern approaches include:
| Model | Metric? | Range | Notes |
|---|---|---|---|
| MiDaS v3.1 | Relative | N/A | Scale-invariant |
| Depth Anything V2 | Relative | N/A | Use with metric head |
| ZoeDepth | Metric (indoor) | 0–10 m | Combines MiDaS with metric bins |
| UniDepth | Metric | 0.1–100 m | Camera-agnostic metric depth |
| Depth Pro (Apple) | Metric | 0.1–250 m | Sharp boundaries, zero-shot |
| Metric3D v2 | Metric | 0–100 m | Canonical camera transform |
To convert relative depth to metric depth when you have even a few ground-truth depth values, you can fit a simple affine model:
import numpy as np
def relative_to_metric(rel_depth, gt_depth, mask=None):
"""
Convert relative depth to metric depth using affine alignment.
Parameters
----------
rel_depth : ndarray Relative depth map (H, W)
gt_depth : ndarray Ground-truth metric depth (H, W), 0 = invalid
mask : ndarray Boolean mask of valid ground-truth pixels
Returns
-------
metric_depth : ndarray Metric depth in meters (H, W)
"""
if mask is None:
mask = gt_depth > 0
# Flatten valid pixels
rel_vals = rel_depth[mask].reshape(-1, 1)
gt_vals = gt_depth[mask].reshape(-1, 1)
# Solve: gt = a * rel + b (least squares)
A = np.hstack([rel_vals, np.ones_like(rel_vals)])
result = np.linalg.lstsq(A, gt_vals, rcond=None)
a, b = result[0].flatten()
# Apply affine transform
metric_depth = a * rel_depth + b
metric_depth = np.clip(metric_depth, 0, None)
print(f"Affine fit: scale={a:.4f}, shift={b:.4f}")
return metric_depth
6. Depth Sensor Comparison¶
| Criterion | Stereo | Structured Light | ToF | Monocular (DL) |
|---|---|---|---|---|
| Range | 0.5 m – ∞ | 0.2 – 10 m | 0.5 – 5 m | 0.1 – 250 m* |
| Accuracy | Medium | High | Medium–High | Low–Medium |
| Resolution | High (matches camera) | 640×480 typical | 320×240 – 1024×1024 | Matches input image |
| Frame Rate | 30–60 fps | 30 fps | 30–60 fps | 1–30 fps (GPU dependent) |
| Cost | Low (two cameras) | Medium ($200–400) | Medium–High ($300–500) | Free (software only) |
| Outdoor | ✅ Works well | ❌ IR pattern overwhelmed | ❌ Sunlight saturates IR | ✅ Works well |
| Texture Dependence | ❌ Needs texture | ✅ Active pattern | ✅ Active illumination | ✅ Learns from data |
| Power | Low | Medium (IR projector) | Medium | High (GPU inference) |
| Latency | ~30 ms | ~30 ms | ~30 ms | 50–500 ms |
| Compact | ❌ Needs baseline | ✅ Small | ✅ Small | ✅ Single camera |
*With metric depth models like Depth Pro.
7. Point Cloud Generation from Depth¶
A depth image plus camera intrinsics lets you generate a 3D point cloud. Each pixel \((u, v)\) with depth \(Z\) maps to a 3D point:
where \(f_x, f_y\) are focal lengths and \(c_x, c_y\) is the principal point.
import numpy as np
import cv2
def depth_to_point_cloud(depth_image, intrinsic_matrix, color_image=None,
depth_scale=1.0, stride=1):
"""
Convert a depth image to a 3D point cloud.
Parameters
----------
depth_image : ndarray Depth image (H, W), uint16 or float32
intrinsic_matrix : ndarray 3x3 camera intrinsic matrix K
color_image : ndarray Optional RGB image (H, W, 3)
depth_scale : float Scale factor to convert depth to meters
stride : int Downsample factor (1 = every pixel)
Returns
-------
points : ndarray (N, 3) XYZ coordinates in meters
colors : ndarray (N, 3) RGB values (0-255), or None
"""
fx = intrinsic_matrix[0, 0]
fy = intrinsic_matrix[1, 1]
cx = intrinsic_matrix[0, 2]
cy = intrinsic_matrix[1, 2]
# Create pixel coordinate grid
h, w = depth_image.shape
u, v = np.meshgrid(np.arange(0, w, stride),
np.arange(0, h, stride))
# Get depth values at grid points
depth = depth_image[::stride, ::stride].astype(np.float32) * depth_scale
# Mask out invalid depth (0 or negative)
valid = depth > 0
z = depth[valid]
x = (u[valid] - cx) * z / fx
y = (v[valid] - cy) * z / fy
points = np.stack([x, y, z], axis=-1)
# Extract colors if provided
colors = None
if color_image is not None:
colors = color_image[::stride, ::stride][valid]
return points, colors
def save_point_cloud_ply(filename, points, colors=None):
"""
Save point cloud to PLY file format.
Parameters
----------
filename : str Output PLY file path
points : ndarray (N, 3) XYZ coordinates
colors : ndarray (N, 3) RGB values (optional)
"""
n_points = points.shape[0]
has_color = colors is not None
with open(filename, 'w') as f:
# PLY header
f.write("ply\n")
f.write("format ascii 1.0\n")
f.write(f"element vertex {n_points}\n")
f.write("property float x\n")
f.write("property float y\n")
f.write("property float z\n")
if has_color:
f.write("property uchar red\n")
f.write("property uchar green\n")
f.write("property uchar blue\n")
f.write("end_header\n")
# Write points
for i in range(n_points):
line = f"{points[i, 0]:.4f} {points[i, 1]:.4f} {points[i, 2]:.4f}"
if has_color:
line += f" {int(colors[i, 0])} {int(colors[i, 1])} {int(colors[i, 2])}"
f.write(line + "\n")
print(f"Saved {n_points} points to {filename}")
if __name__ == "__main__":
# Example: synthetic depth image with known intrinsics
H, W = 480, 640
depth_img = np.random.randint(500, 5000, (H, W)).astype(np.uint16)
K = np.array([
[525.0, 0.0, 319.5],
[ 0.0, 525.0, 239.5],
[ 0.0, 0.0, 1.0]
])
points, colors = depth_to_point_cloud(
depth_img, K, depth_scale=0.001, stride=4
)
print(f"Generated {points.shape[0]} 3D points")
print(f"X range: [{points[:, 0].min():.2f}, {points[:, 0].max():.2f}] m")
print(f"Y range: [{points[:, 1].min():.2f}, {points[:, 1].max():.2f}] m")
print(f"Z range: [{points[:, 2].min():.2f}, {points[:, 2].max():.2f}] m")
save_point_cloud_ply("output.ply", points)
8. ROS 2 Integration¶
Depth cameras in ROS 2 publish data on standardized topics using standard message types.
Common Topics¶
| Topic | Message Type | Description |
|---|---|---|
/camera/color/image_raw |
sensor_msgs/Image |
RGB image |
/camera/depth/image_rect_raw |
sensor_msgs/Image |
Depth image (16-bit, mm or raw) |
/camera/depth/color/points |
sensor_msgs/PointCloud2 |
Colored point cloud |
/camera/aligned_depth_to_color/image_raw |
sensor_msgs/Image |
Depth aligned to color frame |
/camera/color/camera_info |
sensor_msgs/CameraInfo |
Camera intrinsics (K, distortion) |
/camera/imu |
sensor_msgs/Imu |
IMU data (for D435i) |
Launch RealSense in ROS 2¶
# Install the RealSense ROS 2 wrapper
sudo apt install ros-${ROS_DISTRO}-realsense2-camera
# Launch with default settings
ros2 launch realsense2_camera rs_launch.py
# Launch with specific parameters
ros2 launch realsense2_camera rs_launch.py \
depth_module.depth_profile:=640x480x30 \
rgb_camera.color_profile:=640x480x30 \
align_depth.enable:=true \
pointcloud.enable:=true
Subscribe to Depth in Python (ROS 2)¶
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, CameraInfo
from cv_bridge import CvBridge
import numpy as np
class DepthSubscriber(Node):
"""Subscribe to depth image and camera info from a RealSense camera."""
def __init__(self):
super().__init__("depth_subscriber")
self.bridge = CvBridge()
self.K = None # Camera intrinsic matrix
# Subscribe to camera info (once)
self.create_subscription(
CameraInfo,
"/camera/aligned_depth_to_color/camera_info",
self.camera_info_callback,
10
)
# Subscribe to aligned depth image
self.create_subscription(
Image,
"/camera/aligned_depth_to_color/image_raw",
self.depth_callback,
10
)
def camera_info_callback(self, msg):
"""Extract camera intrinsic matrix from CameraInfo."""
self.K = np.array(msg.k).reshape(3, 3)
self.get_logger().info(f"Camera intrinsics:\n{self.K}")
# Unsubscribe after receiving once
self.destroy_subscription(
self.get_logger__ # noqa: placeholder
)
def depth_callback(self, msg):
"""Process incoming depth image."""
# Convert ROS Image to OpenCV format
depth_image = self.bridge.imgmsg_to_cv2(msg, desired_encoding="passthrough")
# Depth values are in millimeters (uint16) — convert to meters
depth_meters = depth_image.astype(np.float32) / 1000.0
# Print some statistics
valid = depth_meters > 0
if valid.any():
self.get_logger().info(
f"Depth — min: {depth_meters[valid].min():.2f} m, "
f"max: {depth_meters[valid].max():.2f} m, "
f"median: {np.median(depth_meters[valid]):.2f} m"
)
def main():
rclpy.init()
node = DepthSubscriber()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()
if __name__ == "__main__":
main()
Useful ROS 2 Commands for Depth Cameras¶
# List all active topics
ros2 topic list | grep camera
# View depth image in RViz2
rviz2
# Add -> By topic -> /camera/depth/image_rect_raw
# Echo camera info
ros2 topic echo /camera/color/camera_info
# Check frame rate
ros2 topic hz /camera/aligned_depth_to_color/image_raw
# Record a rosbag with depth data
ros2 bag record /camera/color/image_raw \
/camera/aligned_depth_to_color/image_raw \
/camera/color/camera_info
Exercises¶
Exercise 1: Stereo Depth on KITTI¶
Download a stereo pair from the KITTI dataset. Use cv2.StereoSGBM to compute the disparity map and convert it to depth. Visualize the result as a colormap.
Hint
Use focal_length = 718.856 and baseline = 0.54 m for KITTI. Load images with cv2.imread in grayscale mode.
Exercise 2: RealSense Point Cloud¶
Connect an Intel RealSense D435i and use pyrealsense2 to capture a depth frame. Generate a colored point cloud and visualize it with Open3D.
import open3d as o3d
# After generating points and colors from Exercise concept above:
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(points)
pcd.colors = o3d.utility.Vector3dVector(colors / 255.0)
o3d.visualization.draw_geometries([pcd])
Exercise 3: Monocular Depth Comparison¶
Download a single image and run both MiDaS and Depth Anything on it. Compare the depth maps visually. Are there regions where one model outperforms the other? Discuss.
Exercise 4: Depth-Based Obstacle Detection¶
Write a ROS 2 node that subscribes to a depth camera topic, segments pixels closer than 0.5 meters, and publishes a warning message when an obstacle is detected. Test with a bag file.
References¶
- Multiple View Geometry in Computer Vision, Hartley & Zisserman — The definitive textbook on epipolar geometry and stereo vision
- OpenCV Stereo Vision Documentation — Tutorial on StereoBM and StereoSGBM
- MiDaS: Robust Monocular Depth Estimation, Ranftl et al. 2020 — Monocular depth estimation model
- Depth Anything V2, Yang et al. 2024 — State-of-the-art monocular depth
- Intel RealSense SDK 2.0 Documentation — Official RealSense programming guide
- realsense-ros (ROS 2 wrapper) — ROS 2 driver for RealSense cameras
- ZoeDepth: Zero-shot Metric Depth, Bhat et al. 2023 — Metric depth from a single image
- Depth Pro: Sharp Monocular Metric Depth, Apple 2024 — High-quality metric depth
- KITTI Stereo Dataset — Benchmark stereo dataset with ground truth
- ROS 2 sensor_msgs — Image — Standard image message type