Object Tracking¶

Object tracking is one of the fundamental perception capabilities required for autonomous robots. While an object detector can tell you what is in a frame and where, a tracker tells you which object is which across time — maintaining consistent identities even through occlusions, appearance changes, and missed detections.

For a mobile robot navigating a crowded hallway, a drone following a target, or a robotic arm tracking a moving conveyor belt, reliable multi-object tracking (MOT) is essential for safe and effective operation.

  Frame t          Frame t+1        Frame t+2
 ┌──────────┐    ┌──────────┐    ┌──────────┐
 │  ┌───┐   │    │   ┌───┐  │    │    ┌───┐ │
 │  │ A │   │    │   │ A │  │    │    │ A │ │
 │  └───┘   │    │   └───┘  │    │    └───┘ │
 │          │    │          │    │          │
 │   ┌───┐  │    │  ┌───┐   │    │ ┌───┐   │
 │   │ B │  │    │  │ B │   │    │ │ B │   │
 │   └───┘  │    │  └───┘   │    │ └───┘   │
 └──────────┘    └──────────┘    └──────────┘
  Detection only:  "two boxes"  "two boxes"  "two boxes"
  With tracking:   A=1, B=2     A=1, B=2     A=1, B=2

Learning Objectives¶

By the end of this chapter you will be able to:

Explain the difference between object detection and object tracking and when each is appropriate.
Implement sparse and dense optical flow using OpenCV.
Build a Kalman filter for single-object state estimation and tracking.
Describe the SORT and DeepSORT tracking pipelines.
Understand transformer-based tracking architectures.
Evaluate trackers using standard MOT metrics (MOTA, MOTP, HOTA, IDF1).
Integrate tracking into a ROS 2 robotics system.

1. Tracking vs. Detection¶

Aspect	Detection	Tracking
Output	Bounding boxes + class labels per frame	Consistent IDs across frames
Temporal info	None (per-frame)	Maintains identity over time
Handles occlusion	Object disappears, reappears as "new"	Keeps ID through brief occlusions
Speed	Slower (full forward pass)	Faster (prediction without detection)
Failure mode	False positives / negatives	ID switches, track fragmentation

When to use detection only¶

Static scenes (e.g., counting objects on a shelf once).
When you only need the current state, not a history.

When to use tracking¶

Any task requiring temporal consistency: following a person, counting objects entering/leaving a zone, trajectory prediction.
When you need to smooth noisy detections (Kalman filter predictions bridge gaps).
When downstream planning depends on object identity (e.g., "avoid the red car specifically").

A common paradigm: detect then track — run a detector at each frame and use a tracker to associate detections over time.

2. Optical Flow¶

Optical flow estimates the apparent motion of pixels (or features) between two consecutive frames. It is a low-level building block used inside many trackers.

2.1 Lucas-Kanade (Sparse Optical Flow)¶

The Lucas-Kanade method tracks a sparse set of feature points. Given a small window around a point, it assumes brightness constancy and solves for the displacement \((u, v)\) that minimizes:

\[ \sum_{(x,y) \in W} \left[ I(x+u, y+v, t+1) - I(x, y, t) \right]^2 \]

Using a first-order Taylor expansion this becomes the linear system:

\[ \begin{bmatrix} \sum I_x^2 & \sum I_x I_y \\ \sum I_x I_y & \sum I_y^2 \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix} = - \begin{bmatrix} \sum I_x I_t \\ \sum I_y I_t \end{bmatrix} \]

Code: Sparse Optical Flow with Lucas-Kanade

import cv2
import numpy as np

def lucas_kanade_demo(video_path=0):
    """Track sparse features using Lucas-Kanade optical flow."""
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"Cannot open video: {video_path}")
        return

    # Parameters for Shi-Tomasi corner detection
    feature_params = dict(
        maxCorners=200,
        qualityLevel=0.01,
        minDistance=10,
        blockSize=7
    )

    # Lucas-Kanade parameters
    lk_params = dict(
        winSize=(15, 15),
        maxLevel=2,
        criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 0.03)
    )

    # Random colors for visualization
    np.random.seed(42)
    colors = np.random.randint(0, 255, (500, 3), dtype=np.uint8)

    ret, old_frame = cap.read()
    if not ret:
        return
    old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)
    p0 = cv2.goodFeaturesToTrack(old_gray, mask=None, **feature_params)

    # Create a mask image for drawing tracks
    mask = np.zeros_like(old_frame)

    while True:
        ret, frame = cap.read()
        if not ret:
            break
        frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Calculate optical flow
        p1, st, err = cv2.calcOpticalFlowPyrLK(
            old_gray, frame_gray, p0, None, **lk_params
        )

        if p1 is not None:
            # Select good points
            good_new = p1[st.flatten() == 1]
            good_old = p0[st.flatten() == 1]

            # Draw tracks
            for i, (new, old) in enumerate(zip(good_new, good_old)):
                a, b = new.ravel().astype(int)
                c, d = old.ravel().astype(int)
                mask = cv2.line(mask, (a, b), (c, d), colors[i % 500].tolist(), 2)
                frame = cv2.circle(frame, (a, b), 5, colors[i % 500].tolist(), -1)

        img = cv2.add(frame, mask)
        cv2.imshow('Lucas-Kanade Optical Flow', img)

        if cv2.waitKey(30) & 0xFF == 27:  # ESC to quit
            break

        # Update previous frame and points
        old_gray = frame_gray.copy()

        # Re-detect features periodically
        if len(good_new) < 50:
            p0 = cv2.goodFeaturesToTrack(old_gray, mask=None, **feature_params)
        else:
            p0 = good_new.reshape(-1, 1, 2)

    cap.release()
    cv2.destroyAllWindows()

if __name__ == '__main__':
    lucas_kanade_demo(0)  # Use 0 for webcam, or provide a video file path

Key parameters:

Parameter	Description
`maxCorners`	Maximum number of features to track
`qualityLevel`	Minimum accepted quality of corners (0–1)
`minDistance`	Minimum Euclidean distance between features
`winSize`	Size of the search window for LK
`maxLevel`	Number of pyramid levels (handles larger motions)

2.2 Farneback Dense Optical Flow¶

Unlike sparse methods, dense optical flow computes a velocity vector for every pixel. Farneback's algorithm models the signal in a neighborhood with a polynomial expansion and estimates displacement.

import cv2
import numpy as np

def farneback_dense_flow_demo(video_path=0):
    """Compute and visualize dense optical flow using Farneback method."""
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"Cannot open video: {video_path}")
        return

    ret, old_frame = cap.read()
    if not ret:
        return
    old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)

    hsv_mask = np.zeros_like(old_frame)
    hsv_mask[:, :, 1] = 255  # Set saturation to maximum

    while True:
        ret, frame = cap.read()
        if not ret:
            break
        frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Compute dense optical flow
        flow = cv2.calcOpticalFlowFarneback(
            prev=old_gray,
            next=frame_gray,
            flow=None,
            pyr_scale=0.5,  # Pyramid scale
            levels=3,        # Number of pyramid levels
            winsize=15,      # Window size
            iterations=3,    # Iterations per pyramid level
            poly_n=5,        # Polynomial neighborhood size
            poly_sigma=1.2,  # Gaussian std for polynomial smoothing
            flags=0
        )

        # Convert flow to polar coordinates (magnitude, angle)
        magnitude, angle = cv2.cartToPolar(flow[..., 0], flow[..., 1])

        # Use angle for hue, magnitude for value
        hsv_mask[:, :, 0] = angle * 180 / np.pi / 2
        hsv_mask[:, :, 2] = cv2.normalize(
            magnitude, None, 0, 255, cv2.NORM_MINMAX
        )

        # Convert HSV to BGR for display
        rgb_flow = cv2.cvtColor(hsv_mask, cv2.COLOR_HSV2BGR)

        # Overlay on original frame
        overlay = cv2.addWeighted(frame, 0.6, rgb_flow, 0.4, 0)

        cv2.imshow('Dense Optical Flow (Farneback)', overlay)
        cv2.imshow('Flow Field', rgb_flow)

        if cv2.waitKey(30) & 0xFF == 27:
            break

        old_gray = frame_gray

    cap.release()
    cv2.destroyAllWindows()

if __name__ == '__main__':
    farneback_dense_flow_demo(0)

Interpreting the HSV color wheel visualization:

         0° (Red)    → motion right
        90° (Green)   → motion down
       180° (Cyan)    → motion left
       270° (Blue)    → motion up
   Brightness = speed

3. Kalman Filter for Tracking¶

The Kalman filter is the workhorse of object tracking. It provides an optimal (minimum variance) estimate of a linear system's state given noisy observations. Even when the true dynamics are mildly nonlinear (as in most tracking scenarios), it works remarkably well.

3.1 State Prediction¶

We model the object state as a vector. For tracking a bounding box center with constant velocity:

\[ \mathbf{x} = \begin{bmatrix} x \\ y \\ \dot{x} \\ \dot{y} \end{bmatrix} \]

The state transition (predict) step:

\[ \mathbf{x}_{k|k-1} = \mathbf{F} \, \mathbf{x}_{k-1|k-1} + \mathbf{B} \, \mathbf{u}_k \]

\[ \mathbf{P}_{k|k-1} = \mathbf{F} \, \mathbf{P}_{k-1|k-1} \, \mathbf{F}^\top + \mathbf{Q} \]

where \(\mathbf{F}\) is the state transition matrix:

\[ \mathbf{F} = \begin{bmatrix} 1 & 0 & \Delta t & 0 \\ 0 & 1 & 0 & \Delta t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \]

and \(\mathbf{Q}\) is the process noise covariance (models uncertainty in the motion model).

3.2 Measurement Update¶

When a measurement \(\mathbf{z}_k\) arrives (the detected bounding box center):

\[ \mathbf{y}_k = \mathbf{z}_k - \mathbf{H} \, \mathbf{x}_{k|k-1} \quad \text{(innovation)} \]

\[ \mathbf{S}_k = \mathbf{H} \, \mathbf{P}_{k|k-1} \, \mathbf{H}^\top + \mathbf{R} \quad \text{(innovation covariance)} \]

\[ \mathbf{K}_k = \mathbf{P}_{k|k-1} \, \mathbf{H}^\top \, \mathbf{S}_k^{-1} \quad \text{(Kalman gain)} \]

\[ \mathbf{x}_{k|k} = \mathbf{x}_{k|k-1} + \mathbf{K}_k \, \mathbf{y}_k \]

\[ \mathbf{P}_{k|k} = (\mathbf{I} - \mathbf{K}_k \, \mathbf{H}) \, \mathbf{P}_{k|k-1} \]

where \(\mathbf{H}\) is the observation matrix (maps state to measurement space) and \(\mathbf{R}\) is the measurement noise covariance.

3.3 Implementation — Tracking a Ball with cv2.KalmanFilter¶

import cv2
import numpy as np

def kalman_ball_tracker():
    """Track a colored ball using Kalman filter and HSV color segmentation."""
    cap = cv2.VideoCapture(0)
    if not cap.isOpened():
        print("Cannot open camera")
        return

    # --- Kalman Filter Setup ---
    # State: [x, y, vx, vy]  Measurement: [x, y]
    kf = cv2.KalmanFilter(4, 2)
    dt = 1.0  # Assume 1 time unit per frame

    # State transition matrix F
    kf.transitionMatrix = np.array([
        [1, 0, dt, 0],
        [0, 1, 0, dt],
        [0, 0, 1,  0],
        [0, 0, 0,  1]
    ], dtype=np.float32)

    # Measurement matrix H
    kf.measurementMatrix = np.array([
        [1, 0, 0, 0],
        [0, 1, 0, 0]
    ], dtype=np.float32)

    # Process noise covariance Q
    kf.processNoiseCov = np.eye(4, dtype=np.float32) * 1e-2

    # Measurement noise covariance R
    kf.measurementNoiseCov = np.eye(2, dtype=np.float32) * 1e-1

    # Initial error covariance P
    kf.errorCovPost = np.eye(4, dtype=np.float32)

    initialized = False

    # HSV range for a yellow tennis ball (adjust for your object)
    lower_color = np.array([29, 86, 6])
    upper_color = np.array([64, 255, 255])

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
        mask = cv2.inRange(hsv, lower_color, upper_color)
        mask = cv2.erode(mask, None, iterations=2)
        mask = cv2.dilate(mask, None, iterations=2)

        # Find contours
        contours, _ = cv2.findContours(
            mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
        )

        measurement = None
        if contours:
            c = max(contours, key=cv2.contourArea)
            ((mx, my), radius) = cv2.minEnclosingCircle(c)
            if radius > 10:
                measurement = np.array([[np.float32(mx)], [np.float32(my)]])
                # Draw the detection
                cv2.circle(frame, (int(mx), int(my)), int(radius), (0, 255, 0), 2)

        # --- Kalman Filter ---
        if not initialized and measurement is not None:
            # Initialize state with first measurement
            kf.statePost = np.array([
                [measurement[0, 0]],
                [measurement[1, 0]],
                [0], [0]
            ], dtype=np.float32)
            initialized = True

        if initialized:
            # Predict
            prediction = kf.predict()
            px, py = int(prediction[0, 0]), int(prediction[1, 0])

            # Update (only if we have a detection)
            if measurement is not None:
                kf.correct(measurement)

            # Draw predicted position
            cv2.circle(frame, (px, py), 10, (0, 0, 255), 2)
            cv2.putText(frame, f"Predicted ({px},{py})", (px + 15, py),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)

            # Draw corrected position
            cx = int(kf.statePost[0, 0])
            cy = int(kf.statePost[1, 0])
            cv2.circle(frame, (cx, cy), 6, (255, 0, 0), -1)
            cv2.putText(frame, f"Corrected ({cx},{cy})", (cx + 15, cy + 20),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1)

        cv2.imshow('Kalman Ball Tracker', frame)
        if cv2.waitKey(30) & 0xFF == 27:
            break

    cap.release()
    cv2.destroyAllWindows()

if __name__ == '__main__':
    kalman_ball_tracker()

Tip: The Kalman filter continues to predict even when the detector misses the object (measurement is None). This "bridging" of gaps is one of its greatest strengths for tracking.

4. SORT — Simple Online Realtime Tracking¶

SORT (Bewley et al., 2016) is a pragmatic tracker that combines:

External detections from any detector (e.g., Faster R-CNN, YOLO).
Kalman filters — one per tracked object — to predict the next state.
Hungarian algorithm — to associate predicted tracks with new detections.

┌──────────────────────────────────────────────────────────┐
│                      SORT Pipeline                       │
│                                                          │
│  Frame t ──► Detector ──► Detections D_t                │
│                              │                           │
│  Existing Tracks ──► Kalman Predict ──► Predictions P_t  │
│                              │                           │
│                    ┌─────────┴──────────┐                │
│                    │  Association       │                │
│                    │  (Hungarian algo   │                │
│                    │   on IoU matrix)   │                │
│                    └─────────┬──────────┘                │
│                  ┌───────────┼───────────┐               │
│              Matched    Unmatched D   Unmatched T        │
│              (update)   (create new)  (increment age)    │
│                              │                           │
│                    Tracks older than                     │
│                    max_age frames ──► delete              │
└──────────────────────────────────────────────────────────┘

Key design choices¶

Component	Choice
Motion model	Kalman filter (constant velocity, [x,y,s,r,ẋ,ẏ,ṡ]) where s=area, r=aspect ratio
Association metric	IoU (Intersection over Union) between predicted boxes and detections
Assignment	Hungarian algorithm (scipy.optimize.linear_sum_assignment)
New tracks	Created when detection is unmatched for `n_init` consecutive frames
Track deletion	After `max_age` frames without a matched detection

Strengths: Simple, fast (~260 FPS), works with any detector.
Weaknesses: ID switches during occlusions (only motion, no appearance).

5. DeepSORT¶

DeepSORT (Wojke et al., 2017) extends SORT by adding an appearance descriptor — a deep Re-ID (re-identification) embedding that helps distinguish objects even when they overlap or are occluded.

5.1 Deep Appearance Features¶

A small CNN (typically trained on a person Re-ID dataset) extracts a 128-dimensional embedding vector from each detected crop. Each track maintains a gallery of the last \(N\) embeddings.

  Detection crop (128×64) ──► CNN (ResNet-based) ──► 128-D embedding
                                                        │
                                           Stored in track's appearance
                                           gallery (ring buffer of last 100)

The appearance cost between a track and a detection is:

\[ d_{\text{app}}^{(i,j)} = \min \{ 1 - e_j^\top e_k^{(i)} \mid e_k^{(i)} \in \text{gallery}_i \} \]

5.2 Matching Cascade¶

DeepSORT replaces the single global Hungarian assignment with a cascade that gives priority to tracks that have been seen recently:

  For age = 1, 2, 3, ... :
    ┌──────────────────────────────────────┐
    │  Cost matrix = λ · d_app + (1-λ)·d_IoU │
    │  (appearance + motion)                  │
    │  Run Hungarian on remaining tracks      │
    │  of this age and unmatched detections   │
    └──────────────────────────────────────┘

This prevents a frequently-visible track from being "stolen" by a new detection that is actually closer to a long-lost track.

Architecture diagram:

┌───────────────────────────────────────────────────────┐
│                     DeepSORT                          │
│                                                       │
│  ┌──────────┐    ┌──────────┐    ┌────────────────┐  │
│  │ Detector │───►│Detection │───►│  Re-ID CNN     │  │
│  │ (YOLO,   │    │  Crops   │    │  (128-D embed) │  │
│  │  etc.)   │    └──────────┘    └───────┬────────┘  │
│  └──────────┘                            │           │
│                  ┌───────────────────────┘           │
│                  ▼                                    │
│  ┌────────────────────────────────────────────┐      │
│  │  Cascade Matching                           │      │
│  │  ┌──────────────┐  ┌────────────────────┐  │      │
│  │  │  IoU cost    │  │ Appearance cost     │  │      │
│  │  │  matrix      │  │ matrix              │  │      │
│  │  └──────┬───────┘  └────────┬───────────┘  │      │
│  │         └────────┬──────────┘               │      │
│  │           Combined cost matrix              │      │
│  │                 │                            │      │
│  │    ┌────────────┼────────────┐              │      │
│  │  Matched    Unmatched D   Unmatched T       │      │
│  └────────────────────────────────────────────┘      │
│                  │                                    │
│          Kalman filter update / new track / delete    │
└───────────────────────────────────────────────────────┘

6. Transformer-Based Trackers¶

Recent MOT methods leverage transformers to jointly perform detection and tracking in a single architecture.

MOTRv2 (2023)¶

MOTRv2 builds on DETR-style detectors. Key ideas:

Track queries: Each existing track has a learnable query embedding that is propagated across frames. The decoder predicts the box for that track in the current frame.
Detection queries: New objects are detected by "newborn" queries that are not associated with any track.
The Hungarian matcher jointly handles detection and association in one step.

  Frame t ──► CNN backbone ──► features
                                    │
            Track queries t-1 ──►   │
                                    ▼
                            Transformer Decoder
                           ┌───────────────────┐
                           │ Track queries t    │──► existing tracks
                           │ New detection queries│──► new tracks
                           └───────────────────┘

TrackFormer (2022)¶

Also DETR-based; introduces track embeddings that attend to the current frame's features.
Uses attention to propagate identity across frames without explicit motion models.
No need for a separate Kalman filter or Re-ID network — the transformer learns to track implicitly.

Comparison of tracking paradigms:

Method	Motion Model	Appearance	Speed	Complexity
SORT	Kalman filter	None	~260 FPS	Low
DeepSORT	Kalman filter	Re-ID CNN	~40 FPS	Medium
TrackFormer	Learned attention	Learned	~15 FPS	High
MOTRv2	Learned attention	Learned	~20 FPS	High

7. Evaluation Metrics¶

Evaluating multi-object trackers requires metrics that capture different aspects of performance:

Metric	Full Name	Measures	Range	Notes
MOTA	Multi-Object Tracking Accuracy	1 - (FN + FP + ID_switches) / GT	(-∞, 100%]	Most common; penalizes misses, false alarms, and ID switches equally
MOTP	Multi-Object Tracking Precision	Average IoU of matched detections	[0, 1]	Measures localization quality, not association
HOTA	Higher Order Tracking Accuracy	Geometric mean of detection and association quality	[0, 100]	Balances DetA and AssA; preferred modern metric
IDF1	ID F1 Score	Ratio of correctly identified detections over time	[0, 1]	Focuses on identity preservation

HOTA decomposes into:

\[ \text{HOTA}_\alpha = \sqrt{\text{DetA}_\alpha \cdot \text{AssA}_\alpha} \]

where \(\text{DetA}\) is detection accuracy and \(\text{AssA}\) is association accuracy at a given IoU threshold \(\alpha\).

Common benchmark datasets:

Dataset	Domain	Sequences	Key Challenge
MOT17/20	Pedestrian	16 / 4	Crowded scenes
KITTI	Driving	21	3D tracking, varying scale
DanceTrack	Dance	100+	Fast motion, similar appearance
BDD100K	Driving	1,000	Diverse weather/conditions

8. Multi-Object Tracking in ROS 2¶

In a ROS 2 robotics stack, tracking data flows as messages on topics.

Standard message types¶

  Perception Pipeline in ROS 2:

  Camera/RGB ──► /image_raw (sensor_msgs/Image)
       │
       ▼
  Detector Node ──► /detections (vision_msgs/Detection2DArray)
       │
       ▼
  Tracker Node ──► /tracked_objects (vision_msgs/Detection2DArray)
       │            with unique IDs in detection results
       ▼
  Planning/Control Node

Key message types:

Message	Package	Purpose
`vision_msgs/Detection2DArray`	vision_msgs	Array of 2D detections with bounding boxes
`vision_msgs/Detection3DArray`	vision_msgs	3D detections (for LiDAR or depth cameras)
`vision_msgs/ObjectHypothesisWithPose`	vision_msgs	Class + confidence + pose for each hypothesis
`geometry_msgs/PoseStamped`	geometry_msgs	Single tracked object pose
`custom_msg/TrackedObjectArray`	Custom	Often created for richer tracking info (ID, velocity, etc.)

Example: Tracker node skeleton¶

import rclpy
from rclpy.node import Node
from vision_msgs.msg import Detection2DArray, Detection2D, ObjectHypothesisWithPose
from geometry_msgs.msg import PoseStamped
import numpy as np
from scipy.optimize import linear_sum_assignment

class SimpleTrackerNode(Node):
    def __init__(self):
        super().__init__('simple_tracker')
        self.sub = self.create_subscription(
            Detection2DArray, '/detections', self.detection_callback, 10
        )
        self.pub = self.create_publisher(
            Detection2DArray, '/tracked_objects', 10
        )
        self.tracks = {}  # id -> state (x, y, w, h)
        self.next_id = 0
        self.get_logger().info('SimpleTracker started.')

    def detection_callback(self, msg: Detection2DArray):
        detections = []
        for det in msg.detections:
            cx = det.bbox.center.position.x
            cy = det.bbox.center.position.y
            w = det.bbox.size_x
            h = det.bbox.size_y
            detections.append([cx, cy, w, h])

        if not detections:
            return

        det_array = np.array(detections)
        track_ids = list(self.tracks.keys())
        track_states = np.array([self.tracks[tid] for tid in track_ids]) if track_ids else np.empty((0, 4))

        if len(track_states) > 0:
            # Simple Euclidean distance cost
            cost = np.linalg.norm(
                track_states[:, None, :2] - det_array[None, :, :2], axis=2
            )
            row_idx, col_idx = linear_sum_assignment(cost)

            matched_tracks = set()
            matched_dets = set()
            for r, c in zip(row_idx, col_idx):
                if cost[r, c] < 100:  # threshold
                    self.tracks[track_ids[r]] = det_array[c]
                    matched_tracks.add(r)
                    matched_dets.add(c)

            # Create new tracks for unmatched detections
            for i in range(len(det_array)):
                if i not in matched_dets:
                    self.tracks[self.next_id] = det_array[i]
                    self.next_id += 1
        else:
            for det in det_array:
                self.tracks[self.next_id] = det
                self.next_id += 1

        # Publish tracked objects
        out_msg = Detection2DArray()
        out_msg.header = msg.header
        for tid, state in self.tracks.items():
            d = Detection2D()
            d.bbox.center.position.x = float(state[0])
            d.bbox.center.position.y = float(state[1])
            d.bbox.size_x = float(state[2])
            d.bbox.size_y = float(state[3])
            hyp = ObjectHypothesisWithPose()
            hyp.hypothesis.class_id = str(tid)
            hyp.hypothesis.score = 1.0
            d.results.append(hyp)
            out_msg.detections.append(d)
        self.pub.publish(out_msg)

def main():
    rclpy.init()
    node = SimpleTrackerNode()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Popular ROS 2 tracking packages:

Package	Description
`ros2_deepsort`	DeepSORT wrapper for ROS 2
`yolov8_ros`	YOLOv8 with built-in tracking (BoT-SORT, ByteTrack)
`open_track`	Open-source multi-object tracker with ROS 2 interface

9. Exercises¶

Exercise 1: Optical Flow Visualization¶

Run the Farneback dense optical flow demo on a recorded video. Modify the HSV visualization to highlight only fast-moving objects (threshold on magnitude). What happens when you increase levels and winsize?

Exercise 2: Kalman Filter Tuning¶

Using the ball-tracking Kalman filter code: 1. Increase processNoiseCov by 10x. What happens to the predictions? 2. Increase measurementNoiseCov by 10x. How does the filter respond? 3. Add width and height to the state vector so the filter also predicts the bounding box size.

Exercise 3: Implement SORT from Scratch¶

Using the Kalman filter code from Section 3 as a starting point: 1. Create a Track class that wraps cv2.KalmanFilter. 2. Implement IoU computation between two bounding boxes. 3. Use scipy.optimize.linear_sum_assignment for association. 4. Test on the MOT17 validation set.

Exercise 4: DeepSORT Appearance Features¶

Replace the IoU cost in Exercise 3 with a combined motion + appearance cost: 1. Load a pretrained Re-ID model (e.g., torchreid library). 2. Extract embeddings from detection crops. 3. Implement the matching cascade from Section 5.2.

Exercise 5: ROS 2 Tracker Node¶

Extend the ROS 2 tracker node skeleton from Section 8 to: 1. Use a Kalman filter per track instead of simple position matching. 2. Add track lifecycle management (creation after N consecutive detections, deletion after M missed frames). 3. Publish velocity estimates alongside position.

References¶

Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. IJCAI.
Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. SCIA.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. ASME Journal of Basic Engineering.
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. ICIP.
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. ICIP.
Zhang, Y., Sun, P., Jiang, Y., et al. (2022). ByteTrack: Multi-object tracking by associating every detection box. ECCV.
Meinhardt, T., Kirillov, A., Leal-Taixe, L., & Feichtenhofer, C. (2022). TrackFormer: Multi-object tracking with transformers. CVPR.
Zhang, Y., Wang, T., & Zhang, X. (2023). MOTRv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. CVPR.
Luiten, J., Osep, A., Dendorfer, P., et al. (2021). HOTA: A higher order metric for evaluating multi-object tracking. IJCV.
OpenCV Documentation. Optical Flow — https://docs.opencv.org/4.x/d7/d8b/tutorial_py_lucas_kanade.html