Depth Estimation¶

Depth perception — the ability to measure the distance from the camera to every point in a scene — is one of the most fundamental capabilities a robot needs. Without depth, a robot sees the world as a flat photograph and cannot reason about the 3D structure of its environment. With depth, a robot can navigate around obstacles, plan manipulation trajectories, and build maps of its surroundings.

This tutorial covers the four major families of depth sensing: stereo vision, structured light (RGB-D cameras), Time-of-Flight (ToF), and monocular depth estimation. We provide working Python code, mathematical foundations, and a comparison to help you choose the right sensor for your robot.

Learning Objectives¶

After completing this tutorial, you will be able to:

Understand the different depth sensing modalities and how each works
Compare stereo, structured light, ToF, and monocular approaches in terms of range, accuracy, cost, and operating conditions
Implement stereo depth estimation using OpenCV (block matching and SGM)
Generate point clouds from depth images using camera intrinsics
Use modern monocular depth estimation models (MiDaS, Depth Anything)
Integrate depth cameras with ROS 2

1. Why Depth Matters in Robotics¶

A standard RGB camera captures a 2D projection of the 3D world — all depth information is lost. Yet almost every robotic task requires some form of depth reasoning:

Task	Why Depth Is Needed
Navigation	Detect and avoid obstacles at varying distances
Manipulation	Compute 3D positions of objects for grasp planning
Mapping	Build 3D occupancy maps or point cloud maps
Inspection	Measure dimensions, detect surface defects
Human-Robot Interaction	Track body pose in 3D for safe collaboration

2D vs 3D Perception¶

   2D Image (RGB)                  3D Perception (RGB-D)
  +------------------+            +------------------+
  |  .  .  .  .  .  |            |  .  .  .  .  .  |
  |  .  BOX  .  .   |            |  . BOX(1.2m) .   |
  |  .  .  .  .  .  |     vs     |  .  .  .  .  .   |
  |  ROBOT  .  .    |            |  ROBOT  .  .     |
  |  .  .  .  .  .  |            |  .  .  .  .  .   |
  +------------------+            +------------------+
  "There is a box."               "The box is at (0.3, 0, 1.2) m"

Depth turns perception into measurement.

2. Stereo Vision¶

Stereo vision uses two cameras separated by a known baseline $B$ to estimate depth through triangulation. This is inspired by human binocular vision.

2.1 Epipolar Geometry¶

Given two images of the same scene taken from different viewpoints, the geometric relationship between them is captured by two matrices.

Fundamental Matrix $F$: relates corresponding points $\mathbf{x}$ and $\mathbf{x'}$ in pixel coordinates:

\[\mathbf{x'}^T F \mathbf{x} = 0\]

where $F$ is a $3 \times 3$ matrix of rank 2 encoding the epipolar geometry.

Essential Matrix $E$: same relationship but in calibrated (normalized) coordinates:

\[\mathbf{x'}_n^T E \mathbf{x}_n = 0\]

The essential matrix encodes the rotation $R$ and translation $t$ between the two cameras:

\[E = [t]_\times R\]

where $[t]_\times$ is the skew-symmetric matrix of the translation vector:

\[[t]_\times = \begin{pmatrix} 0 & -t_z & t_y \\ t_z & 0 & -t_x \\ -t_y & t_x & 0 \end{pmatrix}\]

The relationship between $F$ and $E$ via the camera intrinsics $K$ is:

\[E = K'^T F K\]

Epipolar Constraint: For any point in the left image, its corresponding point in the right image must lie along a specific line — the epipolar line. This reduces the search for correspondences from 2D to 1D.

  Left Image          Right Image
  +--------+          +--------+
  |   x    |          |        |
  |        |  ----->  |epiline |
  |        |          |  x'    |
  +--------+          +--------+
  Point x maps to     x' must lie on
  some point          the epipolar line

2.2 Stereo Rectification¶

Before computing disparity, we rectify both images so that epipolar lines become horizontal scanlines. This simplifies the correspondence search to a 1D horizontal search.

Given the intrinsics and the relative pose between the two cameras, cv2.stereoRectify computes rotation matrices $R_1, R_2$ and new projection matrices $P_1, P_2$ that align both images.

  Before Rectification          After Rectification
  +--------+  +--------+       +--------+  +--------+
  | /      |  |  \     |       | -------|  |--------|
  |   x    |  |     x' |  -->  |   x    |  |   x'   |
  |  /     |  |   \    |       | -------|  |--------|
  +--------+  +--------+       +--------+  +--------+
  Epipolar lines are           Epipolar lines are
  slanted in both images       horizontal scanlines

2.3 Block Matching & SGM¶

Block Matching (BM) finds correspondences by comparing small windows (blocks) around each pixel. It is fast but can produce noisy results.

Semi-Global Matching (SGM) improves on BM by enforcing a smoothness constraint along multiple scanline directions. It produces much cleaner disparity maps at the cost of more computation.

import cv2
import numpy as np

# Load rectified stereo pair (grayscale)
img_left = cv2.imread("left.png", cv2.IMREAD_GRAYSCALE)
img_right = cv2.imread("right.png", cv2.IMREAD_GRAYSCALE)

# --- Block Matching ---
stereo_bm = cv2.StereoBM_create(
    numDisparities=128,   # Must be divisible by 16
    blockSize=15          # Odd number, typically 5-21
)
disparity_bm = stereo_bm.compute(img_left, img_right)

# --- Semi-Global Matching (SGM) ---
stereo_sgbm = cv2.StereoSGBM_create(
    minDisparity=0,
    numDisparities=128,       # Must be divisible by 16
    blockSize=5,
    P1=8 * 3 * 5 ** 2,       # Penalty for disparity change of 1
    P2=32 * 3 * 5 ** 2,      # Penalty for disparity change > 1
    disp12MaxDiff=1,
    uniquenessRatio=10,
    speckleWindowSize=100,
    speckleRange=32,
    preFilterCap=63,
    mode=cv2.STEREO_SGBM_MODE_SGBM_3WAY
)
disparity_sgbm = stereo_sgbm.compute(img_left, img_right)

# Normalize for visualization (16-bit signed -> 8-bit unsigned)
disp_vis = cv2.normalize(
    disparity_sgbm, None, alpha=0, beta=255,
    norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8U
)
cv2.imwrite("disparity.png", disp_vis)
print("Disparity map saved. Shape:", disparity_sgbm.shape)

2.4 Depth from Disparity¶

Once we have the disparity map, computing depth is straightforward using the formula:

\[Z = \frac{f \cdot B}{d}\]

where:

$Z$ is the depth (distance) in meters
$f$ is the focal length in pixels
$B$ is the baseline (distance between the two cameras) in meters
$d$ is the disparity (horizontal pixel shift) in pixels

import cv2
import numpy as np

def compute_depth_map(left_path, right_path, focal_length, baseline,
                      num_disparities=128, block_size=5):
    """
    Compute a depth map from a rectified stereo pair.

    Parameters
    ----------
    left_path : str       Path to left rectified image
    right_path : str      Path to right rectified image
    focal_length : float  Focal length in pixels
    baseline : float      Stereo baseline in meters

    Returns
    -------
    depth_map : ndarray   Depth in meters (float32)
    """
    # Load images
    img_left = cv2.imread(left_path, cv2.IMREAD_GRAYSCALE)
    img_right = cv2.imread(right_path, cv2.IMREAD_GRAYSCALE)

    # Compute disparity using SGM
    stereo = cv2.StereoSGBM_create(
        minDisparity=0,
        numDisparities=num_disparities,
        blockSize=block_size,
        P1=8 * 3 * block_size ** 2,
        P2=32 * 3 * block_size ** 2,
        disp12MaxDiff=1,
        uniquenessRatio=10,
        speckleWindowSize=100,
        speckleRange=32,
        mode=cv2.STEREO_SGBM_MODE_SGBM_3WAY
    )
    disparity = stereo.compute(img_left, img_right).astype(np.float32) / 16.0

    # Avoid division by zero
    disparity[disparity <= 0] = 0.1

    # Z = f * B / d
    depth_map = (focal_length * baseline) / disparity

    # Cap maximum depth to 10 meters
    depth_map[depth_map > 10.0] = 10.0

    return depth_map


# Example usage
if __name__ == "__main__":
    focal_length = 718.856   # Example: KITTI dataset
    baseline = 0.54          # meters

    depth = compute_depth_map("left.png", "right.png", focal_length, baseline)
    print(f"Depth map shape: {depth.shape}")
    print(f"Depth range: {depth.min():.2f} - {depth.max():.2f} meters")

    # Visualize
    depth_vis = cv2.normalize(depth, None, 0, 255, cv2.NORM_MINMAX)
    cv2.imwrite("depth_map.png", depth_vis.astype(np.uint8))

3. Structured Light (RGB-D Cameras)¶

Structured light cameras project a known pattern (infrared dots or stripes) onto the scene and observe how the pattern deforms on surfaces. By analyzing the deformation, the camera computes depth at each pixel.

3.1 Kinect / RealSense / Azure Kinect¶

The most popular structured light and RGB-D sensors for robotics:

Sensor	Technology	Resolution (Depth)	Range	FPS	Price (USD)	Notes
Kinect v1	Structured Light (IR)	640×480	0.5–4.5 m	30	Discontinued	Classic, large community
Kinect v2	ToF	512×424	0.5–4.5 m	30	Discontinued	Better accuracy
Intel RealSense D435i	Active IR Stereo	1280×720	0.2–10 m	90	~$250	IMU built-in, ROS 2 support
Intel RealSense D455	Active IR Stereo	1280×720	0.4–6 m	90	~$350	Larger baseline, better accuracy
Azure Kinect DK	ToF (iTOF)	1024×1024	0.5–5.46 m	30	~$400	Body tracking SDK
Orbbec Femto Mega	ToF	1024×1024	0.2–5 m	30	~$350	OpenNI compatible

Best Choice for ROS 2

For ROS 2-based robotics projects, the Intel RealSense D435i is the most popular choice due to its native realsense2_camera ROS 2 driver, built-in IMU, and good balance of range and price.

3.2 Depth Image Processing¶

Intel provides the pyrealsense2 library for direct access to RealSense cameras:

import pyrealsense2 as rs
import numpy as np
import cv2

def stream_realsense():
    """Stream and display aligned color + depth from a RealSense camera."""

    # Configure streams
    pipeline = rs.pipeline()
    config = rs.config()
    config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)
    config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)

    # Start streaming
    profile = pipeline.start(config)

    # Get depth scale (meters per unit)
    depth_sensor = profile.get_device().first_depth_sensor()
    depth_scale = depth_sensor.get_depth_scale()
    print(f"Depth Scale: {depth_scale} meters/unit")

    # Align depth to color
    align_to = rs.stream.color
    align = rs.align(align_to)

    try:
        while True:
            # Wait for frames
            frames = pipeline.wait_for_frames()

            # Align depth frame to color frame
            aligned_frames = align.process(frames)
            depth_frame = aligned_frames.get_depth_frame()
            color_frame = aligned_frames.get_color_frame()

            if not depth_frame or not color_frame:
                continue

            # Convert to numpy arrays
            depth_image = np.asanyarray(depth_frame.get_data())
            color_image = np.asanyarray(color_frame.get_data())

            # Depth in meters
            depth_meters = depth_image * depth_scale

            # Apply colormap to depth for visualization
            depth_colormap = cv2.applyColorMap(
                cv2.convertScaleAbs(depth_image, alpha=0.03),
                cv2.COLORMAP_JET
            )

            # Stack color and depth side by side
            images = np.hstack((color_image, depth_colormap))

            cv2.imshow("Color + Depth", images)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break

    finally:
        pipeline.stop()
        cv2.destroyAllWindows()


if __name__ == "__main__":
    stream_realsense()

4. Time-of-Flight (ToF)¶

Time-of-Flight cameras actively illuminate the scene with modulated infrared light and measure the round-trip time (or phase shift) of the light to compute depth.

Principle¶

There are two main ToF approaches:

Direct ToF (dToF): Measures the actual time it takes for a light pulse to travel to the surface and back.

\[d = \frac{c \cdot \Delta t}{2}\]

where $c$ is the speed of light and $\Delta t$ is the round-trip time.

Indirect ToF (iToF): Emits continuously modulated light and measures the phase shift $\phi$ of the returned signal:

\[d = \frac{c \cdot \phi}{4 \pi f_{mod}}\]

where $f_{mod}$ is the modulation frequency.

Advantages and Limitations¶

Aspect	Details
✅ Advantages	Dense depth at full frame rate; no baseline needed; works in low texture
✅ Compact	Single sensor, small form factor
❌ Multi-path interference	Reflective/translucent surfaces cause errors
❌ Limited range	Typically 0.5–5 m for consumer sensors
❌ Outdoor sensitivity	Strong sunlight saturates the IR sensor
❌ Low resolution	Typically lower than stereo or structured light

5. Monocular Depth Estimation¶

Recent deep learning methods can estimate depth from a single RGB image. While historically these methods produced only relative (scale-invariant) depth, modern models now achieve metric depth estimation.

5.1 MiDaS / Depth Anything¶

MiDaS (Ranftl et al., 2020) and Depth Anything (Yang et al., 2024) are state-of-the-art models for monocular depth estimation. They are available via torch.hub for easy use.

import torch
import cv2
import numpy as np
import matplotlib.pyplot as plt


def estimate_depth_midas(image_path):
    """
    Estimate depth from a single image using MiDaS v3.1 (DPT-Large).

    Parameters
    ----------
    image_path : str   Path to the input RGB image

    Returns
    -------
    depth : ndarray    Relative depth map (float32, higher = closer)
    """
    # Load MiDaS model from torch.hub
    model_type = "DPT_Large"   # Options: DPT_Large, DPT_Hybrid, MiDaS_small
    midas = torch.hub.load("intel-isl/MiDaS", model_type)

    # Move to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    midas.to(device)
    midas.eval()

    # Load transforms
    midas_transforms = torch.hub.load("intel-isl/MiDaS", "transforms")
    if model_type in ["DPT_Large", "DPT_Hybrid"]:
        transform = midas_transforms.dpt_transform
    else:
        transform = midas_transforms.small_transform

    # Load and preprocess image
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    input_batch = transform(img_rgb).to(device)

    # Inference
    with torch.no_grad():
        prediction = midas(input_batch)

        # Resize to original resolution
        prediction = torch.nn.functional.interpolate(
            prediction.unsqueeze(1),
            size=img_rgb.shape[:2],
            mode="bicubic",
            align_corners=False,
        ).squeeze()

    depth = prediction.cpu().numpy()
    return depth


def estimate_depth_anything(image_path):
    """
    Estimate depth using Depth Anything V2 (small model).

    Parameters
    ----------
    image_path : str   Path to the input RGB image

    Returns
    -------
    depth : ndarray    Relative depth map (float32)
    """
    # Load Depth Anything from torch.hub
    model = torch.hub.load("depth-anything/Depth-Anything-V2", "DepthAnythingV2",
                           encoder="vits", device="cpu")
    model.eval()

    # Load image
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Inference (the model handles preprocessing internally)
    with torch.no_grad():
        depth = model.infer_image(img_rgb)

    return depth


if __name__ == "__main__":
    image_path = "test_image.jpg"

    # MiDaS
    depth_midas = estimate_depth_midas(image_path)
    print(f"MiDaS depth shape: {depth_midas.shape}")
    print(f"MiDaS depth range: {depth_midas.min():.2f} - {depth_midas.max():.2f}")

    # Visualize
    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.imshow(cv2.cvtColor(cv2.imread(image_path), cv2.COLOR_BGR2RGB))
    plt.title("Input Image")
    plt.axis("off")

    plt.subplot(1, 2, 2)
    plt.imshow(depth_midas, cmap="plasma")
    plt.title("MiDaS Depth (relative)")
    plt.colorbar(fraction=0.046)
    plt.axis("off")

    plt.tight_layout()
    plt.savefig("depth_estimation_result.png", dpi=150)
    plt.show()

5.2 Metric Depth Estimation¶

MiDaS and Depth Anything produce relative depth — the values are not in meters. For robotics, we often need metric depth (in meters). Modern approaches include:

Model	Metric?	Range	Notes
MiDaS v3.1	Relative	N/A	Scale-invariant
Depth Anything V2	Relative	N/A	Use with metric head
ZoeDepth	Metric (indoor)	0–10 m	Combines MiDaS with metric bins
UniDepth	Metric	0.1–100 m	Camera-agnostic metric depth
Depth Pro (Apple)	Metric	0.1–250 m	Sharp boundaries, zero-shot
Metric3D v2	Metric	0–100 m	Canonical camera transform

To convert relative depth to metric depth when you have even a few ground-truth depth values, you can fit a simple affine model:

import numpy as np

def relative_to_metric(rel_depth, gt_depth, mask=None):
    """
    Convert relative depth to metric depth using affine alignment.

    Parameters
    ----------
    rel_depth : ndarray   Relative depth map (H, W)
    gt_depth  : ndarray   Ground-truth metric depth (H, W), 0 = invalid
    mask      : ndarray   Boolean mask of valid ground-truth pixels

    Returns
    -------
    metric_depth : ndarray   Metric depth in meters (H, W)
    """
    if mask is None:
        mask = gt_depth > 0

    # Flatten valid pixels
    rel_vals = rel_depth[mask].reshape(-1, 1)
    gt_vals = gt_depth[mask].reshape(-1, 1)

    # Solve: gt = a * rel + b  (least squares)
    A = np.hstack([rel_vals, np.ones_like(rel_vals)])
    result = np.linalg.lstsq(A, gt_vals, rcond=None)
    a, b = result[0].flatten()

    # Apply affine transform
    metric_depth = a * rel_depth + b
    metric_depth = np.clip(metric_depth, 0, None)

    print(f"Affine fit: scale={a:.4f}, shift={b:.4f}")
    return metric_depth

6. Depth Sensor Comparison¶

Criterion	Stereo	Structured Light	ToF	Monocular (DL)
Range	0.5 m – ∞	0.2 – 10 m	0.5 – 5 m	0.1 – 250 m*
Accuracy	Medium	High	Medium–High	Low–Medium
Resolution	High (matches camera)	640×480 typical	320×240 – 1024×1024	Matches input image
Frame Rate	30–60 fps	30 fps	30–60 fps	1–30 fps (GPU dependent)
Cost	Low (two cameras)	Medium ($200–400)	Medium–High ($300–500)	Free (software only)
Outdoor	✅ Works well	❌ IR pattern overwhelmed	❌ Sunlight saturates IR	✅ Works well
Texture Dependence	❌ Needs texture	✅ Active pattern	✅ Active illumination	✅ Learns from data
Power	Low	Medium (IR projector)	Medium	High (GPU inference)
Latency	~30 ms	~30 ms	~30 ms	50–500 ms
Compact	❌ Needs baseline	✅ Small	✅ Small	✅ Single camera

*With metric depth models like Depth Pro.

7. Point Cloud Generation from Depth¶

A depth image plus camera intrinsics lets you generate a 3D point cloud. Each pixel $(u, v)$ with depth $Z$ maps to a 3D point:

\[X = \frac{(u - c_x) \cdot Z}{f_x}, \quad Y = \frac{(v - c_y) \cdot Z}{f_y}\]

where $f_x, f_y$ are focal lengths and $c_x, c_y$ is the principal point.

import numpy as np
import cv2

def depth_to_point_cloud(depth_image, intrinsic_matrix, color_image=None,
                         depth_scale=1.0, stride=1):
    """
    Convert a depth image to a 3D point cloud.

    Parameters
    ----------
    depth_image : ndarray      Depth image (H, W), uint16 or float32
    intrinsic_matrix : ndarray  3x3 camera intrinsic matrix K
    color_image : ndarray       Optional RGB image (H, W, 3)
    depth_scale : float         Scale factor to convert depth to meters
    stride : int                Downsample factor (1 = every pixel)

    Returns
    -------
    points : ndarray    (N, 3) XYZ coordinates in meters
    colors : ndarray    (N, 3) RGB values (0-255), or None
    """
    fx = intrinsic_matrix[0, 0]
    fy = intrinsic_matrix[1, 1]
    cx = intrinsic_matrix[0, 2]
    cy = intrinsic_matrix[1, 2]

    # Create pixel coordinate grid
    h, w = depth_image.shape
    u, v = np.meshgrid(np.arange(0, w, stride),
                       np.arange(0, h, stride))

    # Get depth values at grid points
    depth = depth_image[::stride, ::stride].astype(np.float32) * depth_scale

    # Mask out invalid depth (0 or negative)
    valid = depth > 0
    z = depth[valid]
    x = (u[valid] - cx) * z / fx
    y = (v[valid] - cy) * z / fy

    points = np.stack([x, y, z], axis=-1)

    # Extract colors if provided
    colors = None
    if color_image is not None:
        colors = color_image[::stride, ::stride][valid]

    return points, colors


def save_point_cloud_ply(filename, points, colors=None):
    """
    Save point cloud to PLY file format.

    Parameters
    ----------
    filename : str      Output PLY file path
    points : ndarray    (N, 3) XYZ coordinates
    colors : ndarray    (N, 3) RGB values (optional)
    """
    n_points = points.shape[0]
    has_color = colors is not None

    with open(filename, 'w') as f:
        # PLY header
        f.write("ply\n")
        f.write("format ascii 1.0\n")
        f.write(f"element vertex {n_points}\n")
        f.write("property float x\n")
        f.write("property float y\n")
        f.write("property float z\n")
        if has_color:
            f.write("property uchar red\n")
            f.write("property uchar green\n")
            f.write("property uchar blue\n")
        f.write("end_header\n")

        # Write points
        for i in range(n_points):
            line = f"{points[i, 0]:.4f} {points[i, 1]:.4f} {points[i, 2]:.4f}"
            if has_color:
                line += f" {int(colors[i, 0])} {int(colors[i, 1])} {int(colors[i, 2])}"
            f.write(line + "\n")

    print(f"Saved {n_points} points to {filename}")


if __name__ == "__main__":
    # Example: synthetic depth image with known intrinsics
    H, W = 480, 640
    depth_img = np.random.randint(500, 5000, (H, W)).astype(np.uint16)

    K = np.array([
        [525.0,   0.0, 319.5],
        [  0.0, 525.0, 239.5],
        [  0.0,   0.0,   1.0]
    ])

    points, colors = depth_to_point_cloud(
        depth_img, K, depth_scale=0.001, stride=4
    )
    print(f"Generated {points.shape[0]} 3D points")
    print(f"X range: [{points[:, 0].min():.2f}, {points[:, 0].max():.2f}] m")
    print(f"Y range: [{points[:, 1].min():.2f}, {points[:, 1].max():.2f}] m")
    print(f"Z range: [{points[:, 2].min():.2f}, {points[:, 2].max():.2f}] m")

    save_point_cloud_ply("output.ply", points)

8. ROS 2 Integration¶

Depth cameras in ROS 2 publish data on standardized topics using standard message types.

Common Topics¶

Topic	Message Type	Description
`/camera/color/image_raw`	`sensor_msgs/Image`	RGB image
`/camera/depth/image_rect_raw`	`sensor_msgs/Image`	Depth image (16-bit, mm or raw)
`/camera/depth/color/points`	`sensor_msgs/PointCloud2`	Colored point cloud
`/camera/aligned_depth_to_color/image_raw`	`sensor_msgs/Image`	Depth aligned to color frame
`/camera/color/camera_info`	`sensor_msgs/CameraInfo`	Camera intrinsics (K, distortion)
`/camera/imu`	`sensor_msgs/Imu`	IMU data (for D435i)

Launch RealSense in ROS 2¶

# Install the RealSense ROS 2 wrapper
sudo apt install ros-${ROS_DISTRO}-realsense2-camera

# Launch with default settings
ros2 launch realsense2_camera rs_launch.py

# Launch with specific parameters
ros2 launch realsense2_camera rs_launch.py \
    depth_module.depth_profile:=640x480x30 \
    rgb_camera.color_profile:=640x480x30 \
    align_depth.enable:=true \
    pointcloud.enable:=true

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, CameraInfo
from cv_bridge import CvBridge
import numpy as np


class DepthSubscriber(Node):
    """Subscribe to depth image and camera info from a RealSense camera."""

    def __init__(self):
        super().__init__("depth_subscriber")
        self.bridge = CvBridge()
        self.K = None  # Camera intrinsic matrix

        # Subscribe to camera info (once)
        self.create_subscription(
            CameraInfo,
            "/camera/aligned_depth_to_color/camera_info",
            self.camera_info_callback,
            10
        )

        # Subscribe to aligned depth image
        self.create_subscription(
            Image,
            "/camera/aligned_depth_to_color/image_raw",
            self.depth_callback,
            10
        )

    def camera_info_callback(self, msg):
        """Extract camera intrinsic matrix from CameraInfo."""
        self.K = np.array(msg.k).reshape(3, 3)
        self.get_logger().info(f"Camera intrinsics:\n{self.K}")
        # Unsubscribe after receiving once
        self.destroy_subscription(
            self.get_logger__  # noqa: placeholder
        )

    def depth_callback(self, msg):
        """Process incoming depth image."""
        # Convert ROS Image to OpenCV format
        depth_image = self.bridge.imgmsg_to_cv2(msg, desired_encoding="passthrough")

        # Depth values are in millimeters (uint16) — convert to meters
        depth_meters = depth_image.astype(np.float32) / 1000.0

        # Print some statistics
        valid = depth_meters > 0
        if valid.any():
            self.get_logger().info(
                f"Depth — min: {depth_meters[valid].min():.2f} m, "
                f"max: {depth_meters[valid].max():.2f} m, "
                f"median: {np.median(depth_meters[valid]):.2f} m"
            )


def main():
    rclpy.init()
    node = DepthSubscriber()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()


if __name__ == "__main__":
    main()

Useful ROS 2 Commands for Depth Cameras¶

# List all active topics
ros2 topic list | grep camera

# View depth image in RViz2
rviz2
# Add -> By topic -> /camera/depth/image_rect_raw

# Echo camera info
ros2 topic echo /camera/color/camera_info

# Check frame rate
ros2 topic hz /camera/aligned_depth_to_color/image_raw

# Record a rosbag with depth data
ros2 bag record /camera/color/image_raw \
                /camera/aligned_depth_to_color/image_raw \
                /camera/color/camera_info

Exercises¶

Exercise 1: Stereo Depth on KITTI¶

Download a stereo pair from the KITTI dataset. Use cv2.StereoSGBM to compute the disparity map and convert it to depth. Visualize the result as a colormap.

Hint

Use focal_length = 718.856 and baseline = 0.54 m for KITTI. Load images with cv2.imread in grayscale mode.

Exercise 2: RealSense Point Cloud¶

Connect an Intel RealSense D435i and use pyrealsense2 to capture a depth frame. Generate a colored point cloud and visualize it with Open3D.

import open3d as o3d

# After generating points and colors from Exercise concept above:
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(points)
pcd.colors = o3d.utility.Vector3dVector(colors / 255.0)
o3d.visualization.draw_geometries([pcd])

Exercise 3: Monocular Depth Comparison¶

Download a single image and run both MiDaS and Depth Anything on it. Compare the depth maps visually. Are there regions where one model outperforms the other? Discuss.

Exercise 4: Depth-Based Obstacle Detection¶

Write a ROS 2 node that subscribes to a depth camera topic, segments pixels closer than 0.5 meters, and publishes a warning message when an obstacle is detected. Test with a bag file.

References¶

Multiple View Geometry in Computer Vision, Hartley & Zisserman — The definitive textbook on epipolar geometry and stereo vision
OpenCV Stereo Vision Documentation — Tutorial on StereoBM and StereoSGBM
MiDaS: Robust Monocular Depth Estimation, Ranftl et al. 2020 — Monocular depth estimation model
Depth Anything V2, Yang et al. 2024 — State-of-the-art monocular depth
Intel RealSense SDK 2.0 Documentation — Official RealSense programming guide
realsense-ros (ROS 2 wrapper) — ROS 2 driver for RealSense cameras
ZoeDepth: Zero-shot Metric Depth, Bhat et al. 2023 — Metric depth from a single image
Depth Pro: Sharp Monocular Metric Depth, Apple 2024 — High-quality metric depth
KITTI Stereo Dataset — Benchmark stereo dataset with ground truth
ROS 2 sensor_msgs — Image — Standard image message type

Depth Estimation¶

Learning Objectives¶

1. Why Depth Matters in Robotics¶

2D vs 3D Perception¶

2. Stereo Vision¶

2.1 Epipolar Geometry¶

2.2 Stereo Rectification¶

2.3 Block Matching & SGM¶

2.4 Depth from Disparity¶

3. Structured Light (RGB-D Cameras)¶

3.1 Kinect / RealSense / Azure Kinect¶

3.2 Depth Image Processing¶

4. Time-of-Flight (ToF)¶

Principle¶

Advantages and Limitations¶

5. Monocular Depth Estimation¶

5.1 MiDaS / Depth Anything¶

5.2 Metric Depth Estimation¶

6. Depth Sensor Comparison¶

7. Point Cloud Generation from Depth¶

8. ROS 2 Integration¶

Common Topics¶

Launch RealSense in ROS 2¶

Useful ROS 2 Commands for Depth Cameras¶

Exercises¶

Exercise 1: Stereo Depth on KITTI¶

Exercise 2: RealSense Point Cloud¶

Exercise 3: Monocular Depth Comparison¶

Exercise 4: Depth-Based Obstacle Detection¶

References¶

Robotics Course Docs

Learn

Build

Community

Depth Estimation¶

Learning Objectives¶

1. Why Depth Matters in Robotics¶

2D vs 3D Perception¶

2. Stereo Vision¶

2.1 Epipolar Geometry¶

2.2 Stereo Rectification¶

2.3 Block Matching & SGM¶

2.4 Depth from Disparity¶

3. Structured Light (RGB-D Cameras)¶

3.1 Kinect / RealSense / Azure Kinect¶

3.2 Depth Image Processing¶

4. Time-of-Flight (ToF)¶

Principle¶

Advantages and Limitations¶

5. Monocular Depth Estimation¶

5.1 MiDaS / Depth Anything¶

5.2 Metric Depth Estimation¶

6. Depth Sensor Comparison¶

7. Point Cloud Generation from Depth¶

8. ROS 2 Integration¶

Common Topics¶

Launch RealSense in ROS 2¶

Subscribe to Depth in Python (ROS 2)¶

Useful ROS 2 Commands for Depth Cameras¶

Exercises¶

Exercise 1: Stereo Depth on KITTI¶

Exercise 2: RealSense Point Cloud¶

Exercise 3: Monocular Depth Comparison¶

Exercise 4: Depth-Based Obstacle Detection¶

References¶

Robotics Course Docs

Learn

Build

Community