Skip to content

Object Tracking

Object tracking is one of the fundamental perception capabilities required for autonomous robots. While an object detector can tell you what is in a frame and where, a tracker tells you which object is which across time — maintaining consistent identities even through occlusions, appearance changes, and missed detections.

For a mobile robot navigating a crowded hallway, a drone following a target, or a robotic arm tracking a moving conveyor belt, reliable multi-object tracking (MOT) is essential for safe and effective operation.

  Frame t          Frame t+1        Frame t+2
 ┌──────────┐    ┌──────────┐    ┌──────────┐
 │  ┌───┐   │    │   ┌───┐  │    │    ┌───┐ │
 │  │ A │   │    │   │ A │  │    │    │ A │ │
 │  └───┘   │    │   └───┘  │    │    └───┘ │
 │          │    │          │    │          │
 │   ┌───┐  │    │  ┌───┐   │    │ ┌───┐   │
 │   │ B │  │    │  │ B │   │    │ │ B │   │
 │   └───┘  │    │  └───┘   │    │ └───┘   │
 └──────────┘    └──────────┘    └──────────┘
  Detection only:  "two boxes"  "two boxes"  "two boxes"
  With tracking:   A=1, B=2     A=1, B=2     A=1, B=2

Learning Objectives

By the end of this chapter you will be able to:

  1. Explain the difference between object detection and object tracking and when each is appropriate.
  2. Implement sparse and dense optical flow using OpenCV.
  3. Build a Kalman filter for single-object state estimation and tracking.
  4. Describe the SORT and DeepSORT tracking pipelines.
  5. Understand transformer-based tracking architectures.
  6. Evaluate trackers using standard MOT metrics (MOTA, MOTP, HOTA, IDF1).
  7. Integrate tracking into a ROS 2 robotics system.

1. Tracking vs. Detection

Aspect Detection Tracking
Output Bounding boxes + class labels per frame Consistent IDs across frames
Temporal info None (per-frame) Maintains identity over time
Handles occlusion Object disappears, reappears as "new" Keeps ID through brief occlusions
Speed Slower (full forward pass) Faster (prediction without detection)
Failure mode False positives / negatives ID switches, track fragmentation

When to use detection only

  • Static scenes (e.g., counting objects on a shelf once).
  • When you only need the current state, not a history.

When to use tracking

  • Any task requiring temporal consistency: following a person, counting objects entering/leaving a zone, trajectory prediction.
  • When you need to smooth noisy detections (Kalman filter predictions bridge gaps).
  • When downstream planning depends on object identity (e.g., "avoid the red car specifically").

A common paradigm: detect then track — run a detector at each frame and use a tracker to associate detections over time.


2. Optical Flow

Optical flow estimates the apparent motion of pixels (or features) between two consecutive frames. It is a low-level building block used inside many trackers.

2.1 Lucas-Kanade (Sparse Optical Flow)

The Lucas-Kanade method tracks a sparse set of feature points. Given a small window around a point, it assumes brightness constancy and solves for the displacement \((u, v)\) that minimizes:

\[ \sum_{(x,y) \in W} \left[ I(x+u, y+v, t+1) - I(x, y, t) \right]^2 \]

Using a first-order Taylor expansion this becomes the linear system:

\[ \begin{bmatrix} \sum I_x^2 & \sum I_x I_y \\ \sum I_x I_y & \sum I_y^2 \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix} = - \begin{bmatrix} \sum I_x I_t \\ \sum I_y I_t \end{bmatrix} \]

Code: Sparse Optical Flow with Lucas-Kanade

import cv2
import numpy as np

def lucas_kanade_demo(video_path=0):
    """Track sparse features using Lucas-Kanade optical flow."""
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"Cannot open video: {video_path}")
        return

    # Parameters for Shi-Tomasi corner detection
    feature_params = dict(
        maxCorners=200,
        qualityLevel=0.01,
        minDistance=10,
        blockSize=7
    )

    # Lucas-Kanade parameters
    lk_params = dict(
        winSize=(15, 15),
        maxLevel=2,
        criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 0.03)
    )

    # Random colors for visualization
    np.random.seed(42)
    colors = np.random.randint(0, 255, (500, 3), dtype=np.uint8)

    ret, old_frame = cap.read()
    if not ret:
        return
    old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)
    p0 = cv2.goodFeaturesToTrack(old_gray, mask=None, **feature_params)

    # Create a mask image for drawing tracks
    mask = np.zeros_like(old_frame)

    while True:
        ret, frame = cap.read()
        if not ret:
            break
        frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Calculate optical flow
        p1, st, err = cv2.calcOpticalFlowPyrLK(
            old_gray, frame_gray, p0, None, **lk_params
        )

        if p1 is not None:
            # Select good points
            good_new = p1[st.flatten() == 1]
            good_old = p0[st.flatten() == 1]

            # Draw tracks
            for i, (new, old) in enumerate(zip(good_new, good_old)):
                a, b = new.ravel().astype(int)
                c, d = old.ravel().astype(int)
                mask = cv2.line(mask, (a, b), (c, d), colors[i % 500].tolist(), 2)
                frame = cv2.circle(frame, (a, b), 5, colors[i % 500].tolist(), -1)

        img = cv2.add(frame, mask)
        cv2.imshow('Lucas-Kanade Optical Flow', img)

        if cv2.waitKey(30) & 0xFF == 27:  # ESC to quit
            break

        # Update previous frame and points
        old_gray = frame_gray.copy()

        # Re-detect features periodically
        if len(good_new) < 50:
            p0 = cv2.goodFeaturesToTrack(old_gray, mask=None, **feature_params)
        else:
            p0 = good_new.reshape(-1, 1, 2)

    cap.release()
    cv2.destroyAllWindows()

if __name__ == '__main__':
    lucas_kanade_demo(0)  # Use 0 for webcam, or provide a video file path

Key parameters:

Parameter Description
maxCorners Maximum number of features to track
qualityLevel Minimum accepted quality of corners (0–1)
minDistance Minimum Euclidean distance between features
winSize Size of the search window for LK
maxLevel Number of pyramid levels (handles larger motions)

2.2 Farneback Dense Optical Flow

Unlike sparse methods, dense optical flow computes a velocity vector for every pixel. Farneback's algorithm models the signal in a neighborhood with a polynomial expansion and estimates displacement.

import cv2
import numpy as np

def farneback_dense_flow_demo(video_path=0):
    """Compute and visualize dense optical flow using Farneback method."""
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"Cannot open video: {video_path}")
        return

    ret, old_frame = cap.read()
    if not ret:
        return
    old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)

    hsv_mask = np.zeros_like(old_frame)
    hsv_mask[:, :, 1] = 255  # Set saturation to maximum

    while True:
        ret, frame = cap.read()
        if not ret:
            break
        frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Compute dense optical flow
        flow = cv2.calcOpticalFlowFarneback(
            prev=old_gray,
            next=frame_gray,
            flow=None,
            pyr_scale=0.5,  # Pyramid scale
            levels=3,        # Number of pyramid levels
            winsize=15,      # Window size
            iterations=3,    # Iterations per pyramid level
            poly_n=5,        # Polynomial neighborhood size
            poly_sigma=1.2,  # Gaussian std for polynomial smoothing
            flags=0
        )

        # Convert flow to polar coordinates (magnitude, angle)
        magnitude, angle = cv2.cartToPolar(flow[..., 0], flow[..., 1])

        # Use angle for hue, magnitude for value
        hsv_mask[:, :, 0] = angle * 180 / np.pi / 2
        hsv_mask[:, :, 2] = cv2.normalize(
            magnitude, None, 0, 255, cv2.NORM_MINMAX
        )

        # Convert HSV to BGR for display
        rgb_flow = cv2.cvtColor(hsv_mask, cv2.COLOR_HSV2BGR)

        # Overlay on original frame
        overlay = cv2.addWeighted(frame, 0.6, rgb_flow, 0.4, 0)

        cv2.imshow('Dense Optical Flow (Farneback)', overlay)
        cv2.imshow('Flow Field', rgb_flow)

        if cv2.waitKey(30) & 0xFF == 27:
            break

        old_gray = frame_gray

    cap.release()
    cv2.destroyAllWindows()

if __name__ == '__main__':
    farneback_dense_flow_demo(0)

Interpreting the HSV color wheel visualization:

         0° (Red)    → motion right
        90° (Green)   → motion down
       180° (Cyan)    → motion left
       270° (Blue)    → motion up
   Brightness = speed

3. Kalman Filter for Tracking

The Kalman filter is the workhorse of object tracking. It provides an optimal (minimum variance) estimate of a linear system's state given noisy observations. Even when the true dynamics are mildly nonlinear (as in most tracking scenarios), it works remarkably well.

3.1 State Prediction

We model the object state as a vector. For tracking a bounding box center with constant velocity:

\[ \mathbf{x} = \begin{bmatrix} x \\ y \\ \dot{x} \\ \dot{y} \end{bmatrix} \]

The state transition (predict) step:

\[ \mathbf{x}_{k|k-1} = \mathbf{F} \, \mathbf{x}_{k-1|k-1} + \mathbf{B} \, \mathbf{u}_k \]
\[ \mathbf{P}_{k|k-1} = \mathbf{F} \, \mathbf{P}_{k-1|k-1} \, \mathbf{F}^\top + \mathbf{Q} \]

where \(\mathbf{F}\) is the state transition matrix:

\[ \mathbf{F} = \begin{bmatrix} 1 & 0 & \Delta t & 0 \\ 0 & 1 & 0 & \Delta t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \]

and \(\mathbf{Q}\) is the process noise covariance (models uncertainty in the motion model).

3.2 Measurement Update

When a measurement \(\mathbf{z}_k\) arrives (the detected bounding box center):

\[ \mathbf{y}_k = \mathbf{z}_k - \mathbf{H} \, \mathbf{x}_{k|k-1} \quad \text{(innovation)} \]
\[ \mathbf{S}_k = \mathbf{H} \, \mathbf{P}_{k|k-1} \, \mathbf{H}^\top + \mathbf{R} \quad \text{(innovation covariance)} \]
\[ \mathbf{K}_k = \mathbf{P}_{k|k-1} \, \mathbf{H}^\top \, \mathbf{S}_k^{-1} \quad \text{(Kalman gain)} \]
\[ \mathbf{x}_{k|k} = \mathbf{x}_{k|k-1} + \mathbf{K}_k \, \mathbf{y}_k \]
\[ \mathbf{P}_{k|k} = (\mathbf{I} - \mathbf{K}_k \, \mathbf{H}) \, \mathbf{P}_{k|k-1} \]

where \(\mathbf{H}\) is the observation matrix (maps state to measurement space) and \(\mathbf{R}\) is the measurement noise covariance.

3.3 Implementation — Tracking a Ball with cv2.KalmanFilter

import cv2
import numpy as np

def kalman_ball_tracker():
    """Track a colored ball using Kalman filter and HSV color segmentation."""
    cap = cv2.VideoCapture(0)
    if not cap.isOpened():
        print("Cannot open camera")
        return

    # --- Kalman Filter Setup ---
    # State: [x, y, vx, vy]  Measurement: [x, y]
    kf = cv2.KalmanFilter(4, 2)
    dt = 1.0  # Assume 1 time unit per frame

    # State transition matrix F
    kf.transitionMatrix = np.array([
        [1, 0, dt, 0],
        [0, 1, 0, dt],
        [0, 0, 1,  0],
        [0, 0, 0,  1]
    ], dtype=np.float32)

    # Measurement matrix H
    kf.measurementMatrix = np.array([
        [1, 0, 0, 0],
        [0, 1, 0, 0]
    ], dtype=np.float32)

    # Process noise covariance Q
    kf.processNoiseCov = np.eye(4, dtype=np.float32) * 1e-2

    # Measurement noise covariance R
    kf.measurementNoiseCov = np.eye(2, dtype=np.float32) * 1e-1

    # Initial error covariance P
    kf.errorCovPost = np.eye(4, dtype=np.float32)

    initialized = False

    # HSV range for a yellow tennis ball (adjust for your object)
    lower_color = np.array([29, 86, 6])
    upper_color = np.array([64, 255, 255])

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
        mask = cv2.inRange(hsv, lower_color, upper_color)
        mask = cv2.erode(mask, None, iterations=2)
        mask = cv2.dilate(mask, None, iterations=2)

        # Find contours
        contours, _ = cv2.findContours(
            mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
        )

        measurement = None
        if contours:
            c = max(contours, key=cv2.contourArea)
            ((mx, my), radius) = cv2.minEnclosingCircle(c)
            if radius > 10:
                measurement = np.array([[np.float32(mx)], [np.float32(my)]])
                # Draw the detection
                cv2.circle(frame, (int(mx), int(my)), int(radius), (0, 255, 0), 2)

        # --- Kalman Filter ---
        if not initialized and measurement is not None:
            # Initialize state with first measurement
            kf.statePost = np.array([
                [measurement[0, 0]],
                [measurement[1, 0]],
                [0], [0]
            ], dtype=np.float32)
            initialized = True

        if initialized:
            # Predict
            prediction = kf.predict()
            px, py = int(prediction[0, 0]), int(prediction[1, 0])

            # Update (only if we have a detection)
            if measurement is not None:
                kf.correct(measurement)

            # Draw predicted position
            cv2.circle(frame, (px, py), 10, (0, 0, 255), 2)
            cv2.putText(frame, f"Predicted ({px},{py})", (px + 15, py),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)

            # Draw corrected position
            cx = int(kf.statePost[0, 0])
            cy = int(kf.statePost[1, 0])
            cv2.circle(frame, (cx, cy), 6, (255, 0, 0), -1)
            cv2.putText(frame, f"Corrected ({cx},{cy})", (cx + 15, cy + 20),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1)

        cv2.imshow('Kalman Ball Tracker', frame)
        if cv2.waitKey(30) & 0xFF == 27:
            break

    cap.release()
    cv2.destroyAllWindows()

if __name__ == '__main__':
    kalman_ball_tracker()

Tip: The Kalman filter continues to predict even when the detector misses the object (measurement is None). This "bridging" of gaps is one of its greatest strengths for tracking.


4. SORT — Simple Online Realtime Tracking

SORT (Bewley et al., 2016) is a pragmatic tracker that combines:

  1. External detections from any detector (e.g., Faster R-CNN, YOLO).
  2. Kalman filters — one per tracked object — to predict the next state.
  3. Hungarian algorithm — to associate predicted tracks with new detections.
┌──────────────────────────────────────────────────────────┐
│                      SORT Pipeline                       │
│                                                          │
│  Frame t ──► Detector ──► Detections D_t                │
│                              │                           │
│  Existing Tracks ──► Kalman Predict ──► Predictions P_t  │
│                              │                           │
│                    ┌─────────┴──────────┐                │
│                    │  Association       │                │
│                    │  (Hungarian algo   │                │
│                    │   on IoU matrix)   │                │
│                    └─────────┬──────────┘                │
│                  ┌───────────┼───────────┐               │
│              Matched    Unmatched D   Unmatched T        │
│              (update)   (create new)  (increment age)    │
│                              │                           │
│                    Tracks older than                     │
│                    max_age frames ──► delete              │
└──────────────────────────────────────────────────────────┘

Key design choices

Component Choice
Motion model Kalman filter (constant velocity, [x,y,s,r,ẋ,ẏ,ṡ]) where s=area, r=aspect ratio
Association metric IoU (Intersection over Union) between predicted boxes and detections
Assignment Hungarian algorithm (scipy.optimize.linear_sum_assignment)
New tracks Created when detection is unmatched for n_init consecutive frames
Track deletion After max_age frames without a matched detection

Strengths: Simple, fast (~260 FPS), works with any detector.
Weaknesses: ID switches during occlusions (only motion, no appearance).


5. DeepSORT

DeepSORT (Wojke et al., 2017) extends SORT by adding an appearance descriptor — a deep Re-ID (re-identification) embedding that helps distinguish objects even when they overlap or are occluded.

5.1 Deep Appearance Features

A small CNN (typically trained on a person Re-ID dataset) extracts a 128-dimensional embedding vector from each detected crop. Each track maintains a gallery of the last \(N\) embeddings.

  Detection crop (128×64) ──► CNN (ResNet-based) ──► 128-D embedding
                                           Stored in track's appearance
                                           gallery (ring buffer of last 100)

The appearance cost between a track and a detection is:

\[ d_{\text{app}}^{(i,j)} = \min \{ 1 - e_j^\top e_k^{(i)} \mid e_k^{(i)} \in \text{gallery}_i \} \]

5.2 Matching Cascade

DeepSORT replaces the single global Hungarian assignment with a cascade that gives priority to tracks that have been seen recently:

  For age = 1, 2, 3, ... :
    ┌──────────────────────────────────────┐
    │  Cost matrix = λ · d_app + (1-λ)·d_IoU │
    │  (appearance + motion)                  │
    │  Run Hungarian on remaining tracks      │
    │  of this age and unmatched detections   │
    └──────────────────────────────────────┘

This prevents a frequently-visible track from being "stolen" by a new detection that is actually closer to a long-lost track.

Architecture diagram:

┌───────────────────────────────────────────────────────┐
│                     DeepSORT                          │
│                                                       │
│  ┌──────────┐    ┌──────────┐    ┌────────────────┐  │
│  │ Detector │───►│Detection │───►│  Re-ID CNN     │  │
│  │ (YOLO,   │    │  Crops   │    │  (128-D embed) │  │
│  │  etc.)   │    └──────────┘    └───────┬────────┘  │
│  └──────────┘                            │           │
│                  ┌───────────────────────┘           │
│                  ▼                                    │
│  ┌────────────────────────────────────────────┐      │
│  │  Cascade Matching                           │      │
│  │  ┌──────────────┐  ┌────────────────────┐  │      │
│  │  │  IoU cost    │  │ Appearance cost     │  │      │
│  │  │  matrix      │  │ matrix              │  │      │
│  │  └──────┬───────┘  └────────┬───────────┘  │      │
│  │         └────────┬──────────┘               │      │
│  │           Combined cost matrix              │      │
│  │                 │                            │      │
│  │    ┌────────────┼────────────┐              │      │
│  │  Matched    Unmatched D   Unmatched T       │      │
│  └────────────────────────────────────────────┘      │
│                  │                                    │
│          Kalman filter update / new track / delete    │
└───────────────────────────────────────────────────────┘

6. Transformer-Based Trackers

Recent MOT methods leverage transformers to jointly perform detection and tracking in a single architecture.

MOTRv2 (2023)

MOTRv2 builds on DETR-style detectors. Key ideas:

  • Track queries: Each existing track has a learnable query embedding that is propagated across frames. The decoder predicts the box for that track in the current frame.
  • Detection queries: New objects are detected by "newborn" queries that are not associated with any track.
  • The Hungarian matcher jointly handles detection and association in one step.
  Frame t ──► CNN backbone ──► features
            Track queries t-1 ──►   │
                            Transformer Decoder
                           ┌───────────────────┐
                           │ Track queries t    │──► existing tracks
                           │ New detection queries│──► new tracks
                           └───────────────────┘

TrackFormer (2022)

  • Also DETR-based; introduces track embeddings that attend to the current frame's features.
  • Uses attention to propagate identity across frames without explicit motion models.
  • No need for a separate Kalman filter or Re-ID network — the transformer learns to track implicitly.

Comparison of tracking paradigms:

Method Motion Model Appearance Speed Complexity
SORT Kalman filter None ~260 FPS Low
DeepSORT Kalman filter Re-ID CNN ~40 FPS Medium
TrackFormer Learned attention Learned ~15 FPS High
MOTRv2 Learned attention Learned ~20 FPS High

7. Evaluation Metrics

Evaluating multi-object trackers requires metrics that capture different aspects of performance:

Metric Full Name Measures Range Notes
MOTA Multi-Object Tracking Accuracy 1 - (FN + FP + ID_switches) / GT (-∞, 100%] Most common; penalizes misses, false alarms, and ID switches equally
MOTP Multi-Object Tracking Precision Average IoU of matched detections [0, 1] Measures localization quality, not association
HOTA Higher Order Tracking Accuracy Geometric mean of detection and association quality [0, 100] Balances DetA and AssA; preferred modern metric
IDF1 ID F1 Score Ratio of correctly identified detections over time [0, 1] Focuses on identity preservation

HOTA decomposes into:

\[ \text{HOTA}_\alpha = \sqrt{\text{DetA}_\alpha \cdot \text{AssA}_\alpha} \]

where \(\text{DetA}\) is detection accuracy and \(\text{AssA}\) is association accuracy at a given IoU threshold \(\alpha\).

Common benchmark datasets:

Dataset Domain Sequences Key Challenge
MOT17/20 Pedestrian 16 / 4 Crowded scenes
KITTI Driving 21 3D tracking, varying scale
DanceTrack Dance 100+ Fast motion, similar appearance
BDD100K Driving 1,000 Diverse weather/conditions

8. Multi-Object Tracking in ROS 2

In a ROS 2 robotics stack, tracking data flows as messages on topics.

Standard message types

  Perception Pipeline in ROS 2:

  Camera/RGB ──► /image_raw (sensor_msgs/Image)
  Detector Node ──► /detections (vision_msgs/Detection2DArray)
  Tracker Node ──► /tracked_objects (vision_msgs/Detection2DArray)
       │            with unique IDs in detection results
  Planning/Control Node

Key message types:

Message Package Purpose
vision_msgs/Detection2DArray vision_msgs Array of 2D detections with bounding boxes
vision_msgs/Detection3DArray vision_msgs 3D detections (for LiDAR or depth cameras)
vision_msgs/ObjectHypothesisWithPose vision_msgs Class + confidence + pose for each hypothesis
geometry_msgs/PoseStamped geometry_msgs Single tracked object pose
custom_msg/TrackedObjectArray Custom Often created for richer tracking info (ID, velocity, etc.)

Example: Tracker node skeleton

import rclpy
from rclpy.node import Node
from vision_msgs.msg import Detection2DArray, Detection2D, ObjectHypothesisWithPose
from geometry_msgs.msg import PoseStamped
import numpy as np
from scipy.optimize import linear_sum_assignment

class SimpleTrackerNode(Node):
    def __init__(self):
        super().__init__('simple_tracker')
        self.sub = self.create_subscription(
            Detection2DArray, '/detections', self.detection_callback, 10
        )
        self.pub = self.create_publisher(
            Detection2DArray, '/tracked_objects', 10
        )
        self.tracks = {}  # id -> state (x, y, w, h)
        self.next_id = 0
        self.get_logger().info('SimpleTracker started.')

    def detection_callback(self, msg: Detection2DArray):
        detections = []
        for det in msg.detections:
            cx = det.bbox.center.position.x
            cy = det.bbox.center.position.y
            w = det.bbox.size_x
            h = det.bbox.size_y
            detections.append([cx, cy, w, h])

        if not detections:
            return

        det_array = np.array(detections)
        track_ids = list(self.tracks.keys())
        track_states = np.array([self.tracks[tid] for tid in track_ids]) if track_ids else np.empty((0, 4))

        if len(track_states) > 0:
            # Simple Euclidean distance cost
            cost = np.linalg.norm(
                track_states[:, None, :2] - det_array[None, :, :2], axis=2
            )
            row_idx, col_idx = linear_sum_assignment(cost)

            matched_tracks = set()
            matched_dets = set()
            for r, c in zip(row_idx, col_idx):
                if cost[r, c] < 100:  # threshold
                    self.tracks[track_ids[r]] = det_array[c]
                    matched_tracks.add(r)
                    matched_dets.add(c)

            # Create new tracks for unmatched detections
            for i in range(len(det_array)):
                if i not in matched_dets:
                    self.tracks[self.next_id] = det_array[i]
                    self.next_id += 1
        else:
            for det in det_array:
                self.tracks[self.next_id] = det
                self.next_id += 1

        # Publish tracked objects
        out_msg = Detection2DArray()
        out_msg.header = msg.header
        for tid, state in self.tracks.items():
            d = Detection2D()
            d.bbox.center.position.x = float(state[0])
            d.bbox.center.position.y = float(state[1])
            d.bbox.size_x = float(state[2])
            d.bbox.size_y = float(state[3])
            hyp = ObjectHypothesisWithPose()
            hyp.hypothesis.class_id = str(tid)
            hyp.hypothesis.score = 1.0
            d.results.append(hyp)
            out_msg.detections.append(d)
        self.pub.publish(out_msg)

def main():
    rclpy.init()
    node = SimpleTrackerNode()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Popular ROS 2 tracking packages:

Package Description
ros2_deepsort DeepSORT wrapper for ROS 2
yolov8_ros YOLOv8 with built-in tracking (BoT-SORT, ByteTrack)
open_track Open-source multi-object tracker with ROS 2 interface

9. Exercises

Exercise 1: Optical Flow Visualization

Run the Farneback dense optical flow demo on a recorded video. Modify the HSV visualization to highlight only fast-moving objects (threshold on magnitude). What happens when you increase levels and winsize?

Exercise 2: Kalman Filter Tuning

Using the ball-tracking Kalman filter code: 1. Increase processNoiseCov by 10x. What happens to the predictions? 2. Increase measurementNoiseCov by 10x. How does the filter respond? 3. Add width and height to the state vector so the filter also predicts the bounding box size.

Exercise 3: Implement SORT from Scratch

Using the Kalman filter code from Section 3 as a starting point: 1. Create a Track class that wraps cv2.KalmanFilter. 2. Implement IoU computation between two bounding boxes. 3. Use scipy.optimize.linear_sum_assignment for association. 4. Test on the MOT17 validation set.

Exercise 4: DeepSORT Appearance Features

Replace the IoU cost in Exercise 3 with a combined motion + appearance cost: 1. Load a pretrained Re-ID model (e.g., torchreid library). 2. Extract embeddings from detection crops. 3. Implement the matching cascade from Section 5.2.

Exercise 5: ROS 2 Tracker Node

Extend the ROS 2 tracker node skeleton from Section 8 to: 1. Use a Kalman filter per track instead of simple position matching. 2. Add track lifecycle management (creation after N consecutive detections, deletion after M missed frames). 3. Publish velocity estimates alongside position.


References

  1. Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. IJCAI.
  2. Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. SCIA.
  3. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. ASME Journal of Basic Engineering.
  4. Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. ICIP.
  5. Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. ICIP.
  6. Zhang, Y., Sun, P., Jiang, Y., et al. (2022). ByteTrack: Multi-object tracking by associating every detection box. ECCV.
  7. Meinhardt, T., Kirillov, A., Leal-Taixe, L., & Feichtenhofer, C. (2022). TrackFormer: Multi-object tracking with transformers. CVPR.
  8. Zhang, Y., Wang, T., & Zhang, X. (2023). MOTRv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. CVPR.
  9. Luiten, J., Osep, A., Dendorfer, P., et al. (2021). HOTA: A higher order metric for evaluating multi-object tracking. IJCV.
  10. OpenCV Documentation. Optical Flow — https://docs.opencv.org/4.x/d7/d8b/tutorial_py_lucas_kanade.html