Skip to content

Depth Estimation

Depth perception — the ability to measure the distance from the camera to every point in a scene — is one of the most fundamental capabilities a robot needs. Without depth, a robot sees the world as a flat photograph and cannot reason about the 3D structure of its environment. With depth, a robot can navigate around obstacles, plan manipulation trajectories, and build maps of its surroundings.

This tutorial covers the four major families of depth sensing: stereo vision, structured light (RGB-D cameras), Time-of-Flight (ToF), and monocular depth estimation. We provide working Python code, mathematical foundations, and a comparison to help you choose the right sensor for your robot.

Learning Objectives

After completing this tutorial, you will be able to:

  • Understand the different depth sensing modalities and how each works
  • Compare stereo, structured light, ToF, and monocular approaches in terms of range, accuracy, cost, and operating conditions
  • Implement stereo depth estimation using OpenCV (block matching and SGM)
  • Generate point clouds from depth images using camera intrinsics
  • Use modern monocular depth estimation models (MiDaS, Depth Anything)
  • Integrate depth cameras with ROS 2

1. Why Depth Matters in Robotics

A standard RGB camera captures a 2D projection of the 3D world — all depth information is lost. Yet almost every robotic task requires some form of depth reasoning:

Task Why Depth Is Needed
Navigation Detect and avoid obstacles at varying distances
Manipulation Compute 3D positions of objects for grasp planning
Mapping Build 3D occupancy maps or point cloud maps
Inspection Measure dimensions, detect surface defects
Human-Robot Interaction Track body pose in 3D for safe collaboration

2D vs 3D Perception

   2D Image (RGB)                  3D Perception (RGB-D)
  +------------------+            +------------------+
  |  .  .  .  .  .  |            |  .  .  .  .  .  |
  |  .  BOX  .  .   |            |  . BOX(1.2m) .   |
  |  .  .  .  .  .  |     vs     |  .  .  .  .  .   |
  |  ROBOT  .  .    |            |  ROBOT  .  .     |
  |  .  .  .  .  .  |            |  .  .  .  .  .   |
  +------------------+            +------------------+
  "There is a box."               "The box is at (0.3, 0, 1.2) m"

Depth turns perception into measurement.


2. Stereo Vision

Stereo vision uses two cameras separated by a known baseline \(B\) to estimate depth through triangulation. This is inspired by human binocular vision.

2.1 Epipolar Geometry

Given two images of the same scene taken from different viewpoints, the geometric relationship between them is captured by two matrices.

Fundamental Matrix \(F\): relates corresponding points \(\mathbf{x}\) and \(\mathbf{x'}\) in pixel coordinates:

\[\mathbf{x'}^T F \mathbf{x} = 0\]

where \(F\) is a \(3 \times 3\) matrix of rank 2 encoding the epipolar geometry.

Essential Matrix \(E\): same relationship but in calibrated (normalized) coordinates:

\[\mathbf{x'}_n^T E \mathbf{x}_n = 0\]

The essential matrix encodes the rotation \(R\) and translation \(t\) between the two cameras:

\[E = [t]_\times R\]

where \([t]_\times\) is the skew-symmetric matrix of the translation vector:

\[[t]_\times = \begin{pmatrix} 0 & -t_z & t_y \\ t_z & 0 & -t_x \\ -t_y & t_x & 0 \end{pmatrix}\]

The relationship between \(F\) and \(E\) via the camera intrinsics \(K\) is:

\[E = K'^T F K\]

Epipolar Constraint: For any point in the left image, its corresponding point in the right image must lie along a specific line — the epipolar line. This reduces the search for correspondences from 2D to 1D.

  Left Image          Right Image
  +--------+          +--------+
  |   x    |          |        |
  |        |  ----->  |epiline |
  |        |          |  x'    |
  +--------+          +--------+
  Point x maps to     x' must lie on
  some point          the epipolar line

2.2 Stereo Rectification

Before computing disparity, we rectify both images so that epipolar lines become horizontal scanlines. This simplifies the correspondence search to a 1D horizontal search.

Given the intrinsics and the relative pose between the two cameras, cv2.stereoRectify computes rotation matrices \(R_1, R_2\) and new projection matrices \(P_1, P_2\) that align both images.

  Before Rectification          After Rectification
  +--------+  +--------+       +--------+  +--------+
  | /      |  |  \     |       | -------|  |--------|
  |   x    |  |     x' |  -->  |   x    |  |   x'   |
  |  /     |  |   \    |       | -------|  |--------|
  +--------+  +--------+       +--------+  +--------+
  Epipolar lines are           Epipolar lines are
  slanted in both images       horizontal scanlines

2.3 Block Matching & SGM

Block Matching (BM) finds correspondences by comparing small windows (blocks) around each pixel. It is fast but can produce noisy results.

Semi-Global Matching (SGM) improves on BM by enforcing a smoothness constraint along multiple scanline directions. It produces much cleaner disparity maps at the cost of more computation.

import cv2
import numpy as np

# Load rectified stereo pair (grayscale)
img_left = cv2.imread("left.png", cv2.IMREAD_GRAYSCALE)
img_right = cv2.imread("right.png", cv2.IMREAD_GRAYSCALE)

# --- Block Matching ---
stereo_bm = cv2.StereoBM_create(
    numDisparities=128,   # Must be divisible by 16
    blockSize=15          # Odd number, typically 5-21
)
disparity_bm = stereo_bm.compute(img_left, img_right)

# --- Semi-Global Matching (SGM) ---
stereo_sgbm = cv2.StereoSGBM_create(
    minDisparity=0,
    numDisparities=128,       # Must be divisible by 16
    blockSize=5,
    P1=8 * 3 * 5 ** 2,       # Penalty for disparity change of 1
    P2=32 * 3 * 5 ** 2,      # Penalty for disparity change > 1
    disp12MaxDiff=1,
    uniquenessRatio=10,
    speckleWindowSize=100,
    speckleRange=32,
    preFilterCap=63,
    mode=cv2.STEREO_SGBM_MODE_SGBM_3WAY
)
disparity_sgbm = stereo_sgbm.compute(img_left, img_right)

# Normalize for visualization (16-bit signed -> 8-bit unsigned)
disp_vis = cv2.normalize(
    disparity_sgbm, None, alpha=0, beta=255,
    norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8U
)
cv2.imwrite("disparity.png", disp_vis)
print("Disparity map saved. Shape:", disparity_sgbm.shape)

2.4 Depth from Disparity

Once we have the disparity map, computing depth is straightforward using the formula:

\[Z = \frac{f \cdot B}{d}\]

where:

  • \(Z\) is the depth (distance) in meters
  • \(f\) is the focal length in pixels
  • \(B\) is the baseline (distance between the two cameras) in meters
  • \(d\) is the disparity (horizontal pixel shift) in pixels
import cv2
import numpy as np

def compute_depth_map(left_path, right_path, focal_length, baseline,
                      num_disparities=128, block_size=5):
    """
    Compute a depth map from a rectified stereo pair.

    Parameters
    ----------
    left_path : str       Path to left rectified image
    right_path : str      Path to right rectified image
    focal_length : float  Focal length in pixels
    baseline : float      Stereo baseline in meters

    Returns
    -------
    depth_map : ndarray   Depth in meters (float32)
    """
    # Load images
    img_left = cv2.imread(left_path, cv2.IMREAD_GRAYSCALE)
    img_right = cv2.imread(right_path, cv2.IMREAD_GRAYSCALE)

    # Compute disparity using SGM
    stereo = cv2.StereoSGBM_create(
        minDisparity=0,
        numDisparities=num_disparities,
        blockSize=block_size,
        P1=8 * 3 * block_size ** 2,
        P2=32 * 3 * block_size ** 2,
        disp12MaxDiff=1,
        uniquenessRatio=10,
        speckleWindowSize=100,
        speckleRange=32,
        mode=cv2.STEREO_SGBM_MODE_SGBM_3WAY
    )
    disparity = stereo.compute(img_left, img_right).astype(np.float32) / 16.0

    # Avoid division by zero
    disparity[disparity <= 0] = 0.1

    # Z = f * B / d
    depth_map = (focal_length * baseline) / disparity

    # Cap maximum depth to 10 meters
    depth_map[depth_map > 10.0] = 10.0

    return depth_map


# Example usage
if __name__ == "__main__":
    focal_length = 718.856   # Example: KITTI dataset
    baseline = 0.54          # meters

    depth = compute_depth_map("left.png", "right.png", focal_length, baseline)
    print(f"Depth map shape: {depth.shape}")
    print(f"Depth range: {depth.min():.2f} - {depth.max():.2f} meters")

    # Visualize
    depth_vis = cv2.normalize(depth, None, 0, 255, cv2.NORM_MINMAX)
    cv2.imwrite("depth_map.png", depth_vis.astype(np.uint8))

3. Structured Light (RGB-D Cameras)

Structured light cameras project a known pattern (infrared dots or stripes) onto the scene and observe how the pattern deforms on surfaces. By analyzing the deformation, the camera computes depth at each pixel.

3.1 Kinect / RealSense / Azure Kinect

The most popular structured light and RGB-D sensors for robotics:

Sensor Technology Resolution (Depth) Range FPS Price (USD) Notes
Kinect v1 Structured Light (IR) 640×480 0.5–4.5 m 30 Discontinued Classic, large community
Kinect v2 ToF 512×424 0.5–4.5 m 30 Discontinued Better accuracy
Intel RealSense D435i Active IR Stereo 1280×720 0.2–10 m 90 ~$250 IMU built-in, ROS 2 support
Intel RealSense D455 Active IR Stereo 1280×720 0.4–6 m 90 ~$350 Larger baseline, better accuracy
Azure Kinect DK ToF (iTOF) 1024×1024 0.5–5.46 m 30 ~$400 Body tracking SDK
Orbbec Femto Mega ToF 1024×1024 0.2–5 m 30 ~$350 OpenNI compatible

Best Choice for ROS 2

For ROS 2-based robotics projects, the Intel RealSense D435i is the most popular choice due to its native realsense2_camera ROS 2 driver, built-in IMU, and good balance of range and price.

3.2 Depth Image Processing

Intel provides the pyrealsense2 library for direct access to RealSense cameras:

import pyrealsense2 as rs
import numpy as np
import cv2

def stream_realsense():
    """Stream and display aligned color + depth from a RealSense camera."""

    # Configure streams
    pipeline = rs.pipeline()
    config = rs.config()
    config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)
    config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)

    # Start streaming
    profile = pipeline.start(config)

    # Get depth scale (meters per unit)
    depth_sensor = profile.get_device().first_depth_sensor()
    depth_scale = depth_sensor.get_depth_scale()
    print(f"Depth Scale: {depth_scale} meters/unit")

    # Align depth to color
    align_to = rs.stream.color
    align = rs.align(align_to)

    try:
        while True:
            # Wait for frames
            frames = pipeline.wait_for_frames()

            # Align depth frame to color frame
            aligned_frames = align.process(frames)
            depth_frame = aligned_frames.get_depth_frame()
            color_frame = aligned_frames.get_color_frame()

            if not depth_frame or not color_frame:
                continue

            # Convert to numpy arrays
            depth_image = np.asanyarray(depth_frame.get_data())
            color_image = np.asanyarray(color_frame.get_data())

            # Depth in meters
            depth_meters = depth_image * depth_scale

            # Apply colormap to depth for visualization
            depth_colormap = cv2.applyColorMap(
                cv2.convertScaleAbs(depth_image, alpha=0.03),
                cv2.COLORMAP_JET
            )

            # Stack color and depth side by side
            images = np.hstack((color_image, depth_colormap))

            cv2.imshow("Color + Depth", images)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break

    finally:
        pipeline.stop()
        cv2.destroyAllWindows()


if __name__ == "__main__":
    stream_realsense()

4. Time-of-Flight (ToF)

Time-of-Flight cameras actively illuminate the scene with modulated infrared light and measure the round-trip time (or phase shift) of the light to compute depth.

Principle

There are two main ToF approaches:

  1. Direct ToF (dToF): Measures the actual time it takes for a light pulse to travel to the surface and back.
\[d = \frac{c \cdot \Delta t}{2}\]

where \(c\) is the speed of light and \(\Delta t\) is the round-trip time.

  1. Indirect ToF (iToF): Emits continuously modulated light and measures the phase shift \(\phi\) of the returned signal:
\[d = \frac{c \cdot \phi}{4 \pi f_{mod}}\]

where \(f_{mod}\) is the modulation frequency.

Advantages and Limitations

Aspect Details
Advantages Dense depth at full frame rate; no baseline needed; works in low texture
Compact Single sensor, small form factor
Multi-path interference Reflective/translucent surfaces cause errors
Limited range Typically 0.5–5 m for consumer sensors
Outdoor sensitivity Strong sunlight saturates the IR sensor
Low resolution Typically lower than stereo or structured light

5. Monocular Depth Estimation

Recent deep learning methods can estimate depth from a single RGB image. While historically these methods produced only relative (scale-invariant) depth, modern models now achieve metric depth estimation.

5.1 MiDaS / Depth Anything

MiDaS (Ranftl et al., 2020) and Depth Anything (Yang et al., 2024) are state-of-the-art models for monocular depth estimation. They are available via torch.hub for easy use.

import torch
import cv2
import numpy as np
import matplotlib.pyplot as plt


def estimate_depth_midas(image_path):
    """
    Estimate depth from a single image using MiDaS v3.1 (DPT-Large).

    Parameters
    ----------
    image_path : str   Path to the input RGB image

    Returns
    -------
    depth : ndarray    Relative depth map (float32, higher = closer)
    """
    # Load MiDaS model from torch.hub
    model_type = "DPT_Large"   # Options: DPT_Large, DPT_Hybrid, MiDaS_small
    midas = torch.hub.load("intel-isl/MiDaS", model_type)

    # Move to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    midas.to(device)
    midas.eval()

    # Load transforms
    midas_transforms = torch.hub.load("intel-isl/MiDaS", "transforms")
    if model_type in ["DPT_Large", "DPT_Hybrid"]:
        transform = midas_transforms.dpt_transform
    else:
        transform = midas_transforms.small_transform

    # Load and preprocess image
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    input_batch = transform(img_rgb).to(device)

    # Inference
    with torch.no_grad():
        prediction = midas(input_batch)

        # Resize to original resolution
        prediction = torch.nn.functional.interpolate(
            prediction.unsqueeze(1),
            size=img_rgb.shape[:2],
            mode="bicubic",
            align_corners=False,
        ).squeeze()

    depth = prediction.cpu().numpy()
    return depth


def estimate_depth_anything(image_path):
    """
    Estimate depth using Depth Anything V2 (small model).

    Parameters
    ----------
    image_path : str   Path to the input RGB image

    Returns
    -------
    depth : ndarray    Relative depth map (float32)
    """
    # Load Depth Anything from torch.hub
    model = torch.hub.load("depth-anything/Depth-Anything-V2", "DepthAnythingV2",
                           encoder="vits", device="cpu")
    model.eval()

    # Load image
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Inference (the model handles preprocessing internally)
    with torch.no_grad():
        depth = model.infer_image(img_rgb)

    return depth


if __name__ == "__main__":
    image_path = "test_image.jpg"

    # MiDaS
    depth_midas = estimate_depth_midas(image_path)
    print(f"MiDaS depth shape: {depth_midas.shape}")
    print(f"MiDaS depth range: {depth_midas.min():.2f} - {depth_midas.max():.2f}")

    # Visualize
    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.imshow(cv2.cvtColor(cv2.imread(image_path), cv2.COLOR_BGR2RGB))
    plt.title("Input Image")
    plt.axis("off")

    plt.subplot(1, 2, 2)
    plt.imshow(depth_midas, cmap="plasma")
    plt.title("MiDaS Depth (relative)")
    plt.colorbar(fraction=0.046)
    plt.axis("off")

    plt.tight_layout()
    plt.savefig("depth_estimation_result.png", dpi=150)
    plt.show()

5.2 Metric Depth Estimation

MiDaS and Depth Anything produce relative depth — the values are not in meters. For robotics, we often need metric depth (in meters). Modern approaches include:

Model Metric? Range Notes
MiDaS v3.1 Relative N/A Scale-invariant
Depth Anything V2 Relative N/A Use with metric head
ZoeDepth Metric (indoor) 0–10 m Combines MiDaS with metric bins
UniDepth Metric 0.1–100 m Camera-agnostic metric depth
Depth Pro (Apple) Metric 0.1–250 m Sharp boundaries, zero-shot
Metric3D v2 Metric 0–100 m Canonical camera transform

To convert relative depth to metric depth when you have even a few ground-truth depth values, you can fit a simple affine model:

import numpy as np

def relative_to_metric(rel_depth, gt_depth, mask=None):
    """
    Convert relative depth to metric depth using affine alignment.

    Parameters
    ----------
    rel_depth : ndarray   Relative depth map (H, W)
    gt_depth  : ndarray   Ground-truth metric depth (H, W), 0 = invalid
    mask      : ndarray   Boolean mask of valid ground-truth pixels

    Returns
    -------
    metric_depth : ndarray   Metric depth in meters (H, W)
    """
    if mask is None:
        mask = gt_depth > 0

    # Flatten valid pixels
    rel_vals = rel_depth[mask].reshape(-1, 1)
    gt_vals = gt_depth[mask].reshape(-1, 1)

    # Solve: gt = a * rel + b  (least squares)
    A = np.hstack([rel_vals, np.ones_like(rel_vals)])
    result = np.linalg.lstsq(A, gt_vals, rcond=None)
    a, b = result[0].flatten()

    # Apply affine transform
    metric_depth = a * rel_depth + b
    metric_depth = np.clip(metric_depth, 0, None)

    print(f"Affine fit: scale={a:.4f}, shift={b:.4f}")
    return metric_depth

6. Depth Sensor Comparison

Criterion Stereo Structured Light ToF Monocular (DL)
Range 0.5 m – ∞ 0.2 – 10 m 0.5 – 5 m 0.1 – 250 m*
Accuracy Medium High Medium–High Low–Medium
Resolution High (matches camera) 640×480 typical 320×240 – 1024×1024 Matches input image
Frame Rate 30–60 fps 30 fps 30–60 fps 1–30 fps (GPU dependent)
Cost Low (two cameras) Medium ($200–400) Medium–High ($300–500) Free (software only)
Outdoor ✅ Works well ❌ IR pattern overwhelmed ❌ Sunlight saturates IR ✅ Works well
Texture Dependence ❌ Needs texture ✅ Active pattern ✅ Active illumination ✅ Learns from data
Power Low Medium (IR projector) Medium High (GPU inference)
Latency ~30 ms ~30 ms ~30 ms 50–500 ms
Compact ❌ Needs baseline ✅ Small ✅ Small ✅ Single camera

*With metric depth models like Depth Pro.


7. Point Cloud Generation from Depth

A depth image plus camera intrinsics lets you generate a 3D point cloud. Each pixel \((u, v)\) with depth \(Z\) maps to a 3D point:

\[X = \frac{(u - c_x) \cdot Z}{f_x}, \quad Y = \frac{(v - c_y) \cdot Z}{f_y}\]

where \(f_x, f_y\) are focal lengths and \(c_x, c_y\) is the principal point.

import numpy as np
import cv2

def depth_to_point_cloud(depth_image, intrinsic_matrix, color_image=None,
                         depth_scale=1.0, stride=1):
    """
    Convert a depth image to a 3D point cloud.

    Parameters
    ----------
    depth_image : ndarray      Depth image (H, W), uint16 or float32
    intrinsic_matrix : ndarray  3x3 camera intrinsic matrix K
    color_image : ndarray       Optional RGB image (H, W, 3)
    depth_scale : float         Scale factor to convert depth to meters
    stride : int                Downsample factor (1 = every pixel)

    Returns
    -------
    points : ndarray    (N, 3) XYZ coordinates in meters
    colors : ndarray    (N, 3) RGB values (0-255), or None
    """
    fx = intrinsic_matrix[0, 0]
    fy = intrinsic_matrix[1, 1]
    cx = intrinsic_matrix[0, 2]
    cy = intrinsic_matrix[1, 2]

    # Create pixel coordinate grid
    h, w = depth_image.shape
    u, v = np.meshgrid(np.arange(0, w, stride),
                       np.arange(0, h, stride))

    # Get depth values at grid points
    depth = depth_image[::stride, ::stride].astype(np.float32) * depth_scale

    # Mask out invalid depth (0 or negative)
    valid = depth > 0
    z = depth[valid]
    x = (u[valid] - cx) * z / fx
    y = (v[valid] - cy) * z / fy

    points = np.stack([x, y, z], axis=-1)

    # Extract colors if provided
    colors = None
    if color_image is not None:
        colors = color_image[::stride, ::stride][valid]

    return points, colors


def save_point_cloud_ply(filename, points, colors=None):
    """
    Save point cloud to PLY file format.

    Parameters
    ----------
    filename : str      Output PLY file path
    points : ndarray    (N, 3) XYZ coordinates
    colors : ndarray    (N, 3) RGB values (optional)
    """
    n_points = points.shape[0]
    has_color = colors is not None

    with open(filename, 'w') as f:
        # PLY header
        f.write("ply\n")
        f.write("format ascii 1.0\n")
        f.write(f"element vertex {n_points}\n")
        f.write("property float x\n")
        f.write("property float y\n")
        f.write("property float z\n")
        if has_color:
            f.write("property uchar red\n")
            f.write("property uchar green\n")
            f.write("property uchar blue\n")
        f.write("end_header\n")

        # Write points
        for i in range(n_points):
            line = f"{points[i, 0]:.4f} {points[i, 1]:.4f} {points[i, 2]:.4f}"
            if has_color:
                line += f" {int(colors[i, 0])} {int(colors[i, 1])} {int(colors[i, 2])}"
            f.write(line + "\n")

    print(f"Saved {n_points} points to {filename}")


if __name__ == "__main__":
    # Example: synthetic depth image with known intrinsics
    H, W = 480, 640
    depth_img = np.random.randint(500, 5000, (H, W)).astype(np.uint16)

    K = np.array([
        [525.0,   0.0, 319.5],
        [  0.0, 525.0, 239.5],
        [  0.0,   0.0,   1.0]
    ])

    points, colors = depth_to_point_cloud(
        depth_img, K, depth_scale=0.001, stride=4
    )
    print(f"Generated {points.shape[0]} 3D points")
    print(f"X range: [{points[:, 0].min():.2f}, {points[:, 0].max():.2f}] m")
    print(f"Y range: [{points[:, 1].min():.2f}, {points[:, 1].max():.2f}] m")
    print(f"Z range: [{points[:, 2].min():.2f}, {points[:, 2].max():.2f}] m")

    save_point_cloud_ply("output.ply", points)

8. ROS 2 Integration

Depth cameras in ROS 2 publish data on standardized topics using standard message types.

Common Topics

Topic Message Type Description
/camera/color/image_raw sensor_msgs/Image RGB image
/camera/depth/image_rect_raw sensor_msgs/Image Depth image (16-bit, mm or raw)
/camera/depth/color/points sensor_msgs/PointCloud2 Colored point cloud
/camera/aligned_depth_to_color/image_raw sensor_msgs/Image Depth aligned to color frame
/camera/color/camera_info sensor_msgs/CameraInfo Camera intrinsics (K, distortion)
/camera/imu sensor_msgs/Imu IMU data (for D435i)

Launch RealSense in ROS 2

# Install the RealSense ROS 2 wrapper
sudo apt install ros-${ROS_DISTRO}-realsense2-camera

# Launch with default settings
ros2 launch realsense2_camera rs_launch.py

# Launch with specific parameters
ros2 launch realsense2_camera rs_launch.py \
    depth_module.depth_profile:=640x480x30 \
    rgb_camera.color_profile:=640x480x30 \
    align_depth.enable:=true \
    pointcloud.enable:=true

Subscribe to Depth in Python (ROS 2)

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, CameraInfo
from cv_bridge import CvBridge
import numpy as np


class DepthSubscriber(Node):
    """Subscribe to depth image and camera info from a RealSense camera."""

    def __init__(self):
        super().__init__("depth_subscriber")
        self.bridge = CvBridge()
        self.K = None  # Camera intrinsic matrix

        # Subscribe to camera info (once)
        self.create_subscription(
            CameraInfo,
            "/camera/aligned_depth_to_color/camera_info",
            self.camera_info_callback,
            10
        )

        # Subscribe to aligned depth image
        self.create_subscription(
            Image,
            "/camera/aligned_depth_to_color/image_raw",
            self.depth_callback,
            10
        )

    def camera_info_callback(self, msg):
        """Extract camera intrinsic matrix from CameraInfo."""
        self.K = np.array(msg.k).reshape(3, 3)
        self.get_logger().info(f"Camera intrinsics:\n{self.K}")
        # Unsubscribe after receiving once
        self.destroy_subscription(
            self.get_logger__  # noqa: placeholder
        )

    def depth_callback(self, msg):
        """Process incoming depth image."""
        # Convert ROS Image to OpenCV format
        depth_image = self.bridge.imgmsg_to_cv2(msg, desired_encoding="passthrough")

        # Depth values are in millimeters (uint16) — convert to meters
        depth_meters = depth_image.astype(np.float32) / 1000.0

        # Print some statistics
        valid = depth_meters > 0
        if valid.any():
            self.get_logger().info(
                f"Depth — min: {depth_meters[valid].min():.2f} m, "
                f"max: {depth_meters[valid].max():.2f} m, "
                f"median: {np.median(depth_meters[valid]):.2f} m"
            )


def main():
    rclpy.init()
    node = DepthSubscriber()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()


if __name__ == "__main__":
    main()

Useful ROS 2 Commands for Depth Cameras

# List all active topics
ros2 topic list | grep camera

# View depth image in RViz2
rviz2
# Add -> By topic -> /camera/depth/image_rect_raw

# Echo camera info
ros2 topic echo /camera/color/camera_info

# Check frame rate
ros2 topic hz /camera/aligned_depth_to_color/image_raw

# Record a rosbag with depth data
ros2 bag record /camera/color/image_raw \
                /camera/aligned_depth_to_color/image_raw \
                /camera/color/camera_info

Exercises

Exercise 1: Stereo Depth on KITTI

Download a stereo pair from the KITTI dataset. Use cv2.StereoSGBM to compute the disparity map and convert it to depth. Visualize the result as a colormap.

Hint

Use focal_length = 718.856 and baseline = 0.54 m for KITTI. Load images with cv2.imread in grayscale mode.

Exercise 2: RealSense Point Cloud

Connect an Intel RealSense D435i and use pyrealsense2 to capture a depth frame. Generate a colored point cloud and visualize it with Open3D.

import open3d as o3d

# After generating points and colors from Exercise concept above:
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(points)
pcd.colors = o3d.utility.Vector3dVector(colors / 255.0)
o3d.visualization.draw_geometries([pcd])

Exercise 3: Monocular Depth Comparison

Download a single image and run both MiDaS and Depth Anything on it. Compare the depth maps visually. Are there regions where one model outperforms the other? Discuss.

Exercise 4: Depth-Based Obstacle Detection

Write a ROS 2 node that subscribes to a depth camera topic, segments pixels closer than 0.5 meters, and publishes a warning message when an obstacle is detected. Test with a bag file.


References