Skip to content

Visual Object Tracking (视觉目标跟踪)

Project Type: Perception | Difficulty: ★★☆☆☆ to ★★★★★ | Estimated Time: 1–3 weekends


1. Project Overview

Visual Object Tracking (VOT) is the task of estimating the trajectory of a target object across video frames, given only its initial bounding box in the first frame. Unlike object detection, tracking must maintain object identity over time and handle occlusion, deformation, and motion blur.

  ┌────────────────────────────────────────────────────────────┐
  │                   Tracking Pipeline                         │
  │                                                             │
  │   Frame t-1        Frame t         Frame t+1                 │
  │   ┌────────┐       ┌────────┐      ┌────────┐               │
  │   │ ▓▓▓▓▓ │──────►│ ▓▓▓▓▓ │─────►│ ▓▓▓▓▓ │               │
  │   │ ▓▓▓▓▓ │       │ ▓▓▓▓▓ │      │ ▓▓▓▓▓ │               │
  │   │ ID: 1  │       │ ID: 1  │      │ ID: 1  │               │
  │   └────────┘       └────────┘      └────────┘               │
  │        │               │               │                     │
  │        └───────────────┴───────────────┘                     │
  │                    Track ID: 1                                │
  │   ┌──────────────────────────────────────────────┐            │
  │   │  Detect → Predict → Associate → Update       │            │
  │   └──────────────────────────────────────────────┘            │
  └────────────────────────────────────────────────────────────┘

In this project you will implement three progressively more sophisticated tiers:

Tier Approach Core Method Speed Accuracy
Traditional KCF Tracker Kernelized Correlation Filter + HOG ~300 FPS ★★★☆☆
Intermediate SORT / DeepSORT Detection + Kalman Filter + Hungarian ~100 FPS ★★★★☆
Modern Transformer Tracker Attention + Template Matching ~30 FPS ★★★★★

2. Hardware & Software Requirements

Hardware

Component Specification Notes
Camera USB webcam or Pi Camera 640×480 minimum, 30 FPS
(Optional) GPU NVIDIA GPU with ≥ 4 GB VRAM For DeepSORT and transformer trackers
Robot / Test scene Any moving object or person For real-world testing

Software

Package Version Purpose
Python ≥ 3.8 Core language
OpenCV ≥ 4.5 Image processing & tracking
NumPy ≥ 1.20 Numerical computation
filterpy ≥ 1.4 Kalman filter implementation
scipy ≥ 1.7 Linear algebra, Hungarian algorithm
torch ≥ 1.10 Deep learning (DeepSORT, transformers)
torchvision ≥ 0.11 Pre-trained CNN models
ultralytics ≥ 8.0 YOLO detection for SORT/DeepSORT
pip install opencv-python numpy filterpy scipy torch torchvision ultralytics

3. Tier 1 — Traditional: KCF Tracker

3.1 Concept

The Kernelized Correlation Filter (KCF) tracker frames object tracking as a ridge regression problem in the Fourier domain. By exploiting the circulant structure of image patches, KCF achieves exceptional speed while maintaining competitive accuracy.

Key ideas:

  1. Circulant matrices: Translations of an image patch generate a circulant matrix. All eigenvalues of a circulant matrix are given by the Discrete Fourier Transform (DFT) of the first row — this is the theorem of circulant matrices.

  2. Kernel trick: Map features to a high-dimensional space using a kernel function (e.g., Gaussian RBF), enabling non-linear boundary estimation without explicitly computing the high-dimensional representation.

  3. Fast detection: In the Fourier domain, solving the kernel ridge regression reduces to element-wise division — making detection extremely fast.

3.2 Mathematical Formulation

Training: Given a set of image patches \(x_i\) extracted from training frames, solve ridge regression:

\[ \min_{\mathbf{w}} \sum_i \left\| f(\mathbf{x}_i) - y_i \right\|^2 + \lambda \|\mathbf{w}\|^2 \]

where \(y_i\) is the Gaussian-shaped regression target (a 2D Gaussian centered at the target).

Solution in Fourier domain (with circulant matrix \(\mathbf{C}(\mathbf{x})\)):

\[ \mathbf{\hat{w}} = \frac{\hat{\mathbf{x}} \odot \hat{\mathbf{y}}^*}{\hat{\mathbf{x}}^* \odot \hat{\mathbf{x}} + \lambda} \]

where \(\hat{}\) denotes the DFT, \(\odot\) is element-wise multiplication, and \(^*\) is complex conjugate.

Kernel mapping: Using a kernel \(\kappa(\mathbf{x}, \mathbf{x}')\), the classifier becomes:

\[ f(\mathbf{z}) = \mathbf{w}^T \phi(\mathbf{z}) = \sum_i \alpha_i \kappa(\mathbf{x}_i, \mathbf{z}) \]

where \(\alpha_i\) are the learned weights.

Detection: For a new patch \(\mathbf{z}\), compute the correlation response:

\[ \hat{\mathbf{f}}(\mathbf{z}) = \hat{\mathbf{k}}^{\mathbf{xz}} \odot \hat{\boldsymbol{\alpha}} \]

The peak of the response map indicates the new target center.

3.3 Complete Python Code — KCF Tracker

"""
Tier 1: Kernelized Correlation Filter (KCF) Tracker
====================================================
A from-scratch implementation using HOG features + Gaussian kernel ridge regression.
For production use, consider OpenCV's cv2.TrackerKCF_create().

Features:
- HOG (Histogram of Oriented Gradients) feature extraction
- Gaussian kernel in HOG feature space
- Circulant matrix theory for fast training/detection via FFT
"""

import numpy as np
import cv2
import time


# ─── HOG Feature Extractor ────────────────────────────────────────

def compute_hog_features(gray: np.ndarray, cell_size: int = 4, num_bins: int = 9) -> np.ndarray:
    """
    Compute HOG features from a grayscale image.

    Parameters
    ----------
    gray : 2D array (H x W)
        Grayscale image (already cropped to the patch region)
    cell_size : int
        Size of each HOG cell in pixels
    num_bins : int
        Number of orientation bins

    Returns
    -------
    features : 1D array
        Flattened HOG feature vector
    """
    h, w = gray.shape
    n_cells_x = w // cell_size
    n_cells_y = h // cell_size

    # 1. Compute image gradients
    gx = cv2.Sobel(gray, cv2.CV_32F, 1, 0, ksize=1)
    gy = cv2.Sobel(gray, cv2.CV_32F, 0, 1, ksize=1)
    magnitude = np.sqrt(gx**2 + gy**2)
    orientation = np.arctan2(gy, gx) * (180 / np.pi) % 180  # [0, 180)

    # 2. Build histogram per cell
    cell_histograms = np.zeros((n_cells_y, n_cells_x, num_bins), dtype=np.float32)
    bin_width = 180 / num_bins

    for cy in range(n_cells_y):
        for cx in range(n_cells_x):
            y0, y1 = cy * cell_size, (cy + 1) * cell_size
            x0, x1 = cx * cell_size, (cx + 1) * cell_size

            mag_patch = magnitude[y0:y1, x0:x1].flatten()
            ori_patch = orientation[y0:y1, x0:x1].flatten()

            hist = np.zeros(num_bins, dtype=np.float32)
            for m, o in zip(mag_patch, ori_patch):
                bin_idx = int(o / bin_width) % num_bins
                hist[bin_idx] += m

            cell_histograms[cy, cx] = hist

    # 3. Block normalization (L2-norm)
    eps = 1e-5
    features = []
    for by in range(n_cells_y - 1):
        for bx in range(n_cells_x - 1):
            block = cell_histograms[by:by+2, bx:bx+2].flatten()
            norm = np.sqrt(np.sum(block**2) + eps)
            features.extend(block / norm)

    return np.array(features, dtype=np.float32)


def gaussian_response_map(size: tuple, sigma: float = 2.0) -> np.ndarray:
    """
    Generate a 2D Gaussian response map as regression target.

    Parameters
    ----------
    size : (H, W)
        Size of the response map
    sigma : float
        Standard deviation of the Gaussian

    Returns
    -------
    response : 2D array
        Gaussian-shaped target map
    """
    h, w = size
    cy, cx = h // 2, w // 2
    y, x = np.ogrid[:h, :w]
    response = np.exp(-((x - cx)**2 + (y - cy)**2) / (2 * sigma**2))
    return response.astype(np.float32)


def gaussian_kernel(x1: np.ndarray, x2: np.ndarray, sigma: float = 0.5) -> np.ndarray:
    """
    Compute Gaussian kernel matrix: k(x1_i, x2_j) for all pairs.
    Exploits circulant structure for efficiency.

    Parameters
    ----------
    x1, x2 : 1D arrays (D,)
        Feature vectors
    sigma : float
        Kernel bandwidth

    Returns
    -------
    k : 2D array (N x M)
        Kernel matrix
    """
    # Use squared Euclidean distance in Fourier domain
    # For circulant matrices, we compute via DFT for speed
    n = len(x1)
    # Build circulant kernel matrix
    c = np.zeros(n, dtype=np.float32)
    c[0] = np.exp(-np.sum((x1 - x2)**2) / (2 * sigma**2))
    # Circulate shifts
    for i in range(1, n):
        shift = i % n
        c[i] = c[0]  # Simplified: in real implementation, use actual shifts

    # Circulant matrix → diagonalize by DFT
    hat_c = np.fft.fft(c)
    return hat_c


class KCFTacker:
    """
    Kernelized Correlation Filter tracker using HOG features.
    """

    def __init__(self, patch_size: int = 64, lambda_reg: float = 1e-4,
                 sigma: float = 0.5, interp_factor: float = 0.075):
        self.patch_size = patch_size          # Size of the search / training patch
        self.lambda_reg = lambda_reg          # Regularization parameter
        self.sigma = sigma                    # Gaussian kernel bandwidth
        self.interp_factor = interp_factor   # Model update rate

        self.model_alphas = None  # FFT of correlation filter
        self.model_x = None       # FFT of HOG features (training patch)
        self.target_pos = None    # (x, y) in image coordinates
        self.target_sz = None     # (width, height)

    def init(self, frame: np.ndarray, bbox: tuple) -> None:
        """
        Initialize tracker with first frame and bounding box.

        Parameters
        ----------
        frame : BGR image
        bbox : (x, y, w, h) — top-left corner + dimensions
        """
        x, y, w, h = bbox
        cx, cy = x + w / 2, y + h / 2
        self.target_pos = (float(cx), float(cy))
        self.target_sz = (float(w), float(h))

        # Extract and resize patch
        patch = self._extract_patch(frame, self.target_pos, self.target_sz, self.patch_size)
        gray = cv2.cvtColor(patch, cv2.COLOR_BGR2GRAY).astype(np.float32) / 255.0

        # Extract HOG features
        self.model_x = compute_hog_features(gray)

        # Generate regression target (Gaussian label)
        y_target = gaussian_response_map((gray.shape[0], gray.shape[1]), sigma=2.0)

        # Train the filter: solve for alphas in Fourier domain
        x_fft = np.fft.fft2(self.model_x.reshape(gray.shape))
        y_fft = np.fft.fft2(y_target)

        # KCF solution: alpha = y / (x * x_conj + lambda)
        x_power = np.abs(x_fft)**2
        self.model_alphas = np.fft.fft2(y_target) / (x_power + self.lambda_reg)

    def update(self, frame: np.ndarray) -> tuple:
        """
        Track the object in the current frame.

        Returns
        -------
        bbox : (x, y, w, h)
        """
        if self.target_pos is None:
            raise ValueError("Tracker not initialized. Call init() first.")

        # Extract search patch (slightly larger than target)
        search_sz = (self.target_sz[0] * 1.5, self.target_sz[1] * 1.5)
        patch = self._extract_patch(frame, self.target_pos, search_sz, self.patch_size)
        gray = cv2.cvtColor(patch, cv2.COLOR_BGR2GRAY).astype(np.float32) / 255.0

        # Extract HOG features
        feat = compute_hog_features(gray)

        # Compute response map via FFT
        feat_fft = np.fft.fft2(feat.reshape(gray.shape))
        response_fft = feat_fft * self.model_alphas
        response = np.real(np.fft.ifft2(response_fft))

        # Find peak (maximum response)
        peak_idx = np.unravel_index(np.argmax(response), response.shape)
        dy, dx = peak_idx[0] - response.shape[0] // 2, peak_idx[1] - response.shape[1] // 2

        # Scale displacement to image coordinates
        scale_x = search_sz[0] / self.patch_size
        scale_y = search_sz[1] / self.patch_size
        dx_px = dx * scale_x
        dy_px = dy * scale_y

        # Update target position
        self.target_pos = (
            self.target_pos[0] + dx_px,
            self.target_pos[1] + dy_px
        )

        # Online model update (interpolation)
        patch_new = self._extract_patch(frame, self.target_pos, self.target_sz, self.patch_size)
        gray_new = cv2.cvtColor(patch_new, cv2.COLOR_BGR2GRAY).astype(np.float32) / 255.0
        feat_new = compute_hog_features(gray_new)

        self.model_x = (1 - self.interp_factor) * self.model_x + self.interp_factor * feat_new
        feat_fft_new = np.fft.fft2(self.model_x.reshape(gray_new.shape))
        x_power_new = np.abs(feat_fft_new)**2
        alpha_new = np.fft.fft2(gaussian_response_map(gray_new.shape)) / (x_power_new + self.lambda_reg)
        self.model_alphas = (1 - self.interp_factor) * self.model_alphas + self.interp_factor * alpha_new

        return self._bbox_from_center(self.target_pos, self.target_sz)

    def _extract_patch(self, frame: np.ndarray, center: tuple,
                       size: tuple, patch_size: int) -> np.ndarray:
        """Extract and resize a patch centered at `center` with size `size`."""
        x, y = int(center[0] - size[0] / 2), int(center[1] - size[1] / 2)
        patch = frame[y:y+int(size[1]), x:x+int(size[0])]
        if patch.size == 0:
            return np.zeros((patch_size, patch_size, 3), dtype=np.uint8)
        return cv2.resize(patch, (patch_size, patch_size))

    def _bbox_from_center(self, center: tuple, size: tuple) -> tuple:
        """Convert (cx, cy, w, h) center format to (x, y, w, h) top-left format."""
        return (int(center[0] - size[0] / 2),
                int(center[1] - size[1] / 2),
                int(size[0]),
                int(size[1]))


def demo_kcf_tracker(video_source: int = 0, initial_bbox: tuple = None):
    """
    Demo the KCF tracker on a live camera feed.
    Draw a bounding box with your mouse to select the object.
    """
    cap = cv2.VideoCapture(video_source)
    if not cap.isOpened():
        print("[ERROR] Cannot open camera")
        return

    # Collect first frame and let user select bbox
    ret, frame = cap.read()
    if not ret:
        print("[ERROR] Cannot read first frame")
        return

    if initial_bbox is None:
        print("[INFO] Draw a bounding box on the image, then press ENTER or SPACE")
        bbox = cv2.selectROI("Select Object", frame, fromCenter=False, showCrosshair=True)
        cv2.destroyWindow("Select Object")
    else:
        bbox = initial_bbox

    # Initialize tracker
    tracker = KCFTacker()
    tracker.init(frame, bbox)

    print("[INFO] KCF Tracking started. Press 'q' to quit.")
    fps_counter, last_time = 0, time.time()

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        t0 = time.time()
        tracked_bbox = tracker.update(frame)
        elapsed = time.time() - t0

        fps = 1.0 / elapsed if elapsed > 0 else 0
        fps_counter += 1

        # Draw bounding box
        x, y, w, h = tracked_bbox
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
        cv2.putText(frame, f"KCF  FPS: {fps:.1f}", (10, 30),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)

        cv2.imshow("KCF Tracker", frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()
    print(f"[INFO] Average FPS: {fps_counter / max(1, time.time() - last_time):.1f}")


if __name__ == "__main__":
    # Example: track in first detected motion or manually specify bbox
    demo_kcf_tracker(video_source=0, initial_bbox=None)

3.4 OpenCV Built-in KCF

OpenCV provides a ready-made KCF implementation:

import cv2

tracker = cv2.TrackerKCF_create()
# Or for simpler testing:
# tracker = cv2.TrackerMOSSE_create()  # Even faster (~1000 FPS)

ret, frame = cap.read()
bbox = cv2.selectROI("Select", frame, fromCenter=False)
tracker.init(frame, bbox)

while True:
    ret, frame = cap.read()
    if not ret:
        break
    success, bbox = tracker.update(frame)
    if success:
        x, y, w, h = [int(v) for v in bbox]
        cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)
    cv2.imshow("KCF", frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

4. Tier 2 — Intermediate: SORT & DeepSORT

4.1 Concept

SORT (Simple Online and Realtime Tracking) and DeepSORT (Deep Association Metric) are detection-based multi-object trackers. Instead of searching for the object directly, they:

  1. Detect objects in every frame using an object detector (e.g., YOLO)
  2. Predict where each track is expected to be in the current frame (using a Kalman filter)
  3. Associate detections to tracks using an assignment algorithm (Hungarian algorithm)
  4. Update the Kalman filter with matched detections
  ┌─────────────────────────────────────────────────────────────┐
  │                   SORT / DeepSORT Pipeline                   │
  │                                                              │
  │   Frame t-1 ──► YOLO Detection ──► ┌─────────────┐           │
  │                                    │ Kalman      │           │
  │   Frame t ────► YOLO Detection ──►│ Filter      │           │
  │                                    │ Prediction  │           │
  │   Frame t+1 ──► YOLO Detection ──► │             │           │
  │                                    │ Hungarian   │           │
  │                                    │ Assignment  │           │
  │                                    └──────┬──────┘           │
  │   Tracks (ID, bbox, age) ◄──────────────┘                    │
  └─────────────────────────────────────────────────────────────┘

4.2 Mathematical Foundation

4.2.1 Kalman Filter

For 2D tracking, we model each object as having position \((x, y)\), size \((w, h)\), and velocity \((\dot{x}, \dot{y})\). The state vector is:

\[ \mathbf{x} = [x, y, w, h, \dot{x}, \dot{y}]^T \]

State transition model (constant velocity):

\[ \mathbf{x}_t = \mathbf{F} \cdot \mathbf{x}_{t-1} + \mathbf{w}_t \]

where:

\[ \mathbf{F} = \begin{bmatrix} 1 & 0 & 0 & 0 & dt & 0 \\ 0 & 1 & 0 & 0 & 0 & dt \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix} \]

Measurement model (we only observe position and size):

\[ \mathbf{z}_t = \mathbf{H} \cdot \mathbf{x}_t + \mathbf{v}_t \]

where \(\mathbf{H} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{bmatrix}\).

4.2.2 Hungarian Algorithm

The Hungarian algorithm (Kuhn-Munkres) solves the minimum-cost assignment problem between \(n\) detections and \(m\) tracks:

\[ \min_{\mathbf{A}} \sum_{i=1}^{n} \sum_{j=1}^{m} A_{ij} \cdot c_{ij} \]

subject to: each detection assigned to at most one track and vice versa.

IoU-based cost matrix for SORT:

\[ \text{IoU}(d, t) = \frac{\text{area}(d \cap t)}{\text{area}(d \cup t)} \]

DeepSORT adds an appearance descriptor (cosine distance) to the cost matrix:

\[ c_{ij} = \lambda \cdot (1 - \text{IoU}_{ij}) + (1 - \lambda) \cdot (1 - s_{ij}) \]

where \(s_{ij}\) is the cosine similarity between deep appearance features of detection \(d_i\) and track \(t_j\).

4.3 Complete Python Code — SORT Tracker

"""
Tier 2: SORT Tracker (Simple Online and Realtime Tracking)
==========================================================
Detection-based multi-object tracking with Kalman filter + Hungarian assignment.
Uses YOLO for detection (via ultralytics) and IoU for association.

Modules:
1. KalmanFilter — constant-velocity model in 2D image space
2. Track — manages individual track state and history
3. Sort — manages all tracks and performs assignment
"""

import numpy as np
import cv2
from scipy.optimize import linear_sum_assignment
from collections import deque


# ─── Kalman Filter ────────────────────────────────────────────────

class KalmanFilter:
    """
    Constant-velocity Kalman filter for 2D bounding box tracking.

    State: [x, y, w, h, vx, vy]
    Measurement: [x, y, w, h]
    """

    def __init__(self, dt: float = 1.0):
        # State transition matrix (constant velocity model)
        self.F = np.array([
            [1, 0, 0, 0, dt, 0],
            [0, 1, 0, 0, 0,  dt],
            [0, 0, 1, 0, 0,  0],
            [0, 0, 0, 1, 0,  0],
            [0, 0, 0, 0, 1,  0],
            [0, 0, 0, 0, 0,  1],
        ], dtype=np.float32)

        # Measurement matrix (observe x, y, w, h)
        self.H = np.array([
            [1, 0, 0, 0, 0, 0],
            [0, 1, 0, 0, 0, 0],
            [0, 0, 1, 0, 0, 0],
            [0, 0, 0, 1, 0, 0],
        ], dtype=np.float32)

        # Process noise covariance
        self.Q = np.eye(6, dtype=np.float32) * 0.01

        # Measurement noise covariance
        self.R = np.eye(4, dtype=np.float32) * 0.1

        # State and covariance
        self.x = np.zeros((6, 1), dtype=np.float32)  # State estimate
        self.P = np.eye(6, dtype=np.float32)          # Error covariance

    def init(self, measurement: np.ndarray) -> None:
        """Initialize filter with first measurement [x, y, w, h]."""
        self.x[:4] = measurement.reshape(4, 1)
        self.x[4:] = 0.0  # velocity = 0

    def predict(self) -> tuple:
        """Predict next state and return projected measurement."""
        self.x = self.F @ self.x
        self.P = self.F @ self.P @ self.F.T + self.Q

        # Project to measurement space
        z_pred = self.H @ self.x
        return z_pred.flatten()

    def update(self, measurement: np.ndarray) -> tuple:
        """Update state with a new measurement and return projected measurement."""
        z = measurement.reshape(4, 1).astype(np.float32)
        z_pred = self.H @ self.x

        # Innovation (measurement residual)
        y = z - z_pred

        # Kalman gain
        S = self.H @ self.P @ self.H.T + self.R
        K = self.P @ self.H.T @ np.linalg.inv(S)

        # Update state and covariance
        self.x = self.x + K @ y
        self.P = (np.eye(6) - K @ self.H) @ self.P

        return z_pred.flatten()


# ─── Track ────────────────────────────────────────────────────────

class Track:
    """A single track (object trajectory)."""

    _next_id = 1

    def __init__(self, bbox: np.ndarray, frame_id: int):
        self.id = Track._next_id
        Track._next_id += 1

        self.bbox = bbox          # [x, y, w, h]
        self.age = 1              # Frames since first detection
        self.hits = 1             # Total matched detections
        self.time_since_update = 0

        # Kalman filter for prediction
        self.kf = KalmanFilter()
        self.kf.init(bbox)

        # Track history (for visualization)
        self.trace = deque(maxlen=30)

    def predict(self) -> np.ndarray:
        """Predict next bounding box using Kalman filter."""
        self.bbox = self.kf.predict()
        self.time_since_update += 1
        return self.bbox

    def update(self, bbox: np.ndarray) -> None:
        """Update track with new detection."""
        self.bbox = self.kf.update(bbox)
        self.hits += 1
        self.time_since_update = 0

    def state(self) -> str:
        """Return track state: 'confirmed' if hits >= 3, else 'tentative'."""
        return 'confirmed' if self.hits >= 3 else 'tentative'


# ─── IoU Cost Matrix ──────────────────────────────────────────────

def compute_iou(boxA: np.ndarray, boxB: np.ndarray) -> float:
    """
    Compute Intersection over Union between two bounding boxes.

    Parameters
    ----------
    boxA, boxB : [x, y, w, h]

    Returns
    -------
    iou : float in [0, 1]
    """
    xA, yA, wA, hA = boxA
    xB, yB, wB, hB = boxB

    xi1 = max(xA, xB)
    yi1 = max(yA, yB)
    xi2 = min(xA + wA, xB + wB)
    yi2 = min(yA + hA, yB + hB)

    inter_area = max(0, xi2 - xi1) * max(0, yi2 - yi1)
    union_area = wA * hA + wB * hB - inter_area

    return inter_area / (union_area + 1e-6)


def compute_iou_matrix(detections: list, tracks: list) -> np.ndarray:
    """Build IoU cost matrix: rows=detections, cols=tracks."""
    n, m = len(detections), len(tracks)
    matrix = np.zeros((n, m))
    for i, det in enumerate(detections):
        for j, trk in enumerate(tracks):
            matrix[i, j] = 1.0 - compute_iou(det, trk)  # Cost = 1 - IoU
    return matrix


# ─── SORT Tracker ─────────────────────────────────────────────────

class SORT:
    """
    SORT multi-object tracker.

    Parameters
    ----------
    max_age : int
        Maximum frames without update before track is removed
    min_hits : int
        Minimum detections before track is confirmed
    iou_threshold : float
        Minimum IoU for a match (0.3 typical)
    """

    def __init__(self, max_age: int = 30, min_hits: int = 3, iou_threshold: float = 0.3):
        self.max_age = max_age
        self.min_hits = min_hits
        self.iou_threshold = iou_threshold
        self.tracks: list[Track] = []
        self.frame_count = 0

    def update(self, detections: np.ndarray, frame_id: int) -> np.ndarray:
        """
        Update tracker with detections from the current frame.

        Parameters
        ----------
        detections : (N, 4) array — [x, y, w, h] per detection

        Returns
        -------
        tracked_bboxes : (M, 5) array — [x, y, w, h, track_id] for confirmed tracks
        """
        self.frame_count += 1

        # ── 1. Predict step: advance all tracks ──────────────────
        for track in self.tracks:
            track.predict()

        # ── 2. Hungarian assignment ──────────────────────────────
        if len(detections) == 0:
            detected = []
        else:
            matched, unmatched_dets, unmatched_trks = self._associate(detections)

            # Update matched tracks
            for d_idx, t_idx in matched:
                det = detections[d_idx]
                self.tracks[t_idx].update(det)

            # Create new tracks for unmatched detections
            for d_idx in unmatched_dets:
                det = detections[d_idx]
                self.tracks.append(Track(det, frame_id))

        # ── 3. Remove old tracks ─────────────────────────────────
        self.tracks = [
            t for t in self.tracks
            if t.time_since_update <= self.max_age and t.hits >= self.min_hits
        ]

        # ── 4. Output confirmed tracks ────────────────────────────
        result = []
        for track in self.tracks:
            if track.state() == 'confirmed':
                x, y, w, h = [int(v) for v in track.bbox]
                result.append([x, y, w, h, track.id])
                track.trace.append((x + w//2, y + h//2))

        return np.array(result, dtype=np.int32) if result else np.empty((0, 5))

    def _associate(self, detections: np.ndarray) -> tuple:
        """Match detections to tracks using Hungarian algorithm on IoU cost."""
        if len(self.tracks) == 0:
            return [], list(range(len(detections))), []

        cost_matrix = compute_iou_matrix(detections, [t.bbox for t in self.tracks])

        # Hungarian algorithm (minimize cost)
        row_idx, col_idx = linear_sum_assignment(cost_matrix)

        matched = []
        unmatched_dets = list(range(len(detections)))
        unmatched_trks = list(range(len(self.tracks)))

        for r, c in zip(row_idx, col_idx):
            if cost_matrix[r, c] < (1.0 - self.iou_threshold):
                matched.append((r, c))
                if r in unmatched_dets:
                    unmatched_dets.remove(r)
                if c in unmatched_trks:
                    unmatched_trks.remove(c)

        return matched, unmatched_dets, unmatched_trks


# ─── Complete SORT Demo with YOLO ────────────────────────────────

def run_sort_with_yolo(video_source: int = 0, confidence: float = 0.3):
    """
    Run SORT multi-object tracker with YOLO detection.

    Requires: pip install ultralytics
    """
    try:
        from ultralytics import YOLO
    except ImportError:
        print("[ERROR] ultralytics not installed. Run: pip install ultralytics")
        return

    # Load YOLO model (YOLOv8n = nano, fastest)
    print("[INFO] Loading YOLOv8n...")
    model = YOLO("yolov8n.pt")

    cap = cv2.VideoCapture(video_source)
    if not cap.isOpened():
        print("[ERROR] Cannot open camera")
        return

    tracker = SORT(max_age=30, min_hits=3, iou_threshold=0.3)
    frame_id = 0

    # COCO class names for person/vehicle tracking
    CLASSES = ['person', 'bicycle', 'car', 'motorcycle', 'bus', 'truck']

    print("[INFO] SORT+DeepSORT tracking started. Press 'q' to quit.")

    while True:
        ret, frame = cap.read()
        if not ret:
            break
        frame_id += 1

        # ── Detect ──
        results = model(frame, verbose=False)[0]
        detections = []
        for box in results.boxes:
            cls_id = int(box.cls[0])
            conf = float(box.conf[0])
            if conf < confidence:
                continue
            if results.names[cls_id] not in CLASSES:
                continue
            x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
            x, y = int(x1), int(y1)
            w, h = int(x2 - x1), int(y2 - y1)
            detections.append(np.array([x, y, w, h], dtype=np.float32))

        detections = np.array(detections) if detections else np.empty((0, 4))

        # ── Track ──
        tracked = tracker.update(detections, frame_id)

        # ── Visualize ──
        for x, y, w, h, track_id in tracked:
            cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)
            cv2.putText(frame, f"ID:{track_id}", (x, y - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

        cv2.putText(frame, f"Frame: {frame_id}  Tracks: {len(tracked)}",
                    (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 255), 2)
        cv2.imshow("SORT + YOLO", frame)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()


if __name__ == "__main__":
    run_sort_with_yolo(video_source=0)

4.4 DeepSORT — Appearance Descriptor Extension

DeepSORT extends SORT by adding a deep appearance descriptor (ReID network) to handle occlusions where IoU matching fails:

"""
DeepSORT Extension — Appearance Feature Extraction
=================================================
Uses a pre-trained ReID network to extract appearance descriptors
for cosine-distance matching alongside IoU.
"""

import torch
import torch.nn as nn
import numpy as np


class EmbeddingNetwork(nn.Module):
    """
    Simple CNN embedding network for ReID.
    Outputs a 128-dim unit-normalized feature vector.
    For production, use OSNet or ResNet-50 pretrained on Market-1501.
    """

    def __init__(self, input_dim: int = 512, embedding_dim: int = 128):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.Conv2d(32, 64, 3, stride=2, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, stride=2, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),
        )
        self.fc = nn.Linear(128, embedding_dim)
        self.bn = nn.BatchNorm1d(embedding_dim)

    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        x = self.bn(x)
        # L2 normalize for cosine similarity
        x = nn.functional.normalize(x, p=2, dim=1)
        return x


def cosine_distance(feat1: np.ndarray, feat2: np.ndarray) -> float:
    """Cosine distance between two feature vectors."""
    return 1.0 - np.dot(feat1, feat2) / (np.linalg.norm(feat1) * np.linalg.norm(feat2) + 1e-6)


# DeepSORT uses the cosine distance in addition to IoU:
# cost = lambda * (1 - IoU) + (1 - lambda) * (1 - cosine_sim)
# Typical lambda = 0.98 (favor appearance over IoU)

5. Tier 3 — Modern: Transformer-Based Tracking

5.1 Concept

Transformer trackers replace hand-crafted features and correlation filters with attention mechanisms that learn global context. The dominant paradigm is template-matching: a template (initial target) and a search region (current frame) are tokenized and processed by a transformer encoder-decoder.

┌─────────────────────────────────────────────────────────────────┐
│              Transformer Tracker Architecture                    │
│                                                                  │
│   Frame t-1 (Template)         Frame t (Search)                  │
│   ┌─────────────────┐          ┌──────────────────┐              │
│   │  BBox: (x,y,w,h)│          │   Full Frame     │              │
│   └────────┬────────┘          └────────┬─────────┘              │
│            │                             │                         │
│            ▼                             ▼                         │
│   ┌─────────────────┐          ┌──────────────────┐              │
│   │ Template Patch   │          │ Search Region    │              │
│   │ (cropped + resize)│          │ (3x template sz) │              │
│   └────────┬────────┘          └────────┬─────────┘              │
│            │                             │                         │
│            ▼                             ▼                         │
│   ┌──────────────────────────────────────────────┐               │
│   │              Token Embedding                  │               │
│   │   [CLS] t1 t2 ... tk [SEP] s1 s2 ... sm      │               │
│   └────────┬────────────────────────────────────┘               │
│            │                                                     │
│            ▼                                                     │
│   ┌──────────────────────────────────────────────┐               │
│   │          Transformer Encoder (SA + CA)        │               │
│   │  Self-attention: template-template            │               │
│   │  Cross-attention: template ↔ search           │               │
│   └────────┬────────────────────────────────────┘               │
│            │                                                     │
│            ▼                                                     │
│   ┌──────────────────────────────────────────────┐               │
│   │          Transformer Decoder                  │               │
│   │  Query = learnable [QUERY] token              │               │
│   │  Output = bounding box regression             │               │
│   └──────────────────────────────────────────────┘               │
│            │                                                     │
│            ▼                                                     │
│        BBox: (x, y, w, h) + Confidence                             │
└─────────────────────────────────────────────────────────────────┘

Key innovations:

  1. Cross-attention: The template tokens attend to search tokens, learning which search features correspond to the template. This enables handling appearance changes (deformation, occlusion) that correlation filters cannot handle.

  2. Global context: Unlike KCF's local correlation, transformers capture long-range dependencies between all pixels in the search region.

  3. Template updating: Most modern trackers (OSTrack, MixFormer) update the template online using features from recent frames, enabling long-term tracking.

5.2 Attention Mechanism

Scaled Dot-Product Attention:

\[ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left( \frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}} \right) \mathbf{V} \]

where: - \(\mathbf{Q}\) (Query) — what we're looking for - \(\mathbf{K}\) (Key) — what the image contains - \(\mathbf{V}\) (Value) — actual feature values - \(d_k\) — key dimension (for scaling)

Multi-Head Attention:

\[ \text{MHA}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \mathbf{W}^O \]

where each \(\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)\).

5.3 Complete Python Code — Attention-Based Tracker

"""
Tier 3: Transformer-Based Visual Object Tracker
================================================
A simplified transformer tracker demonstrating:
1. Token embedding (patch + position)
2. Self-attention and cross-attention
3. Bounding box regression from [CLS] token

For production use, consider OSTrack, MixFormer, or TransT.
"""

import numpy as np
import cv2
import torch
import torch.nn as nn
import torch.nn.functional as F
import math


# ─── Attention Layer ──────────────────────────────────────────────

class ScaledDotProductAttention(nn.Module):
    """Scaled dot-product attention: Q, K, V → attention scores."""

    def __init__(self, d_k: int):
        super().__init__()
        self.d_k = d_k

    def forward(self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor,
                mask: torch.Tensor = None) -> tuple:
        """
        Parameters
        ----------
        Q : (B, num_heads, seq_len, d_k)
        K : (B, num_heads, seq_len, d_k)
        V : (B, num_heads, seq_len, d_v)
        mask : optional, (B, 1, seq_len, seq_len)

        Returns
        -------
        output : (B, num_heads, seq_len, d_v)
        attn_weights : (B, num_heads, seq_len, seq_len)
        """
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attn_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attn_weights, V)
        return output, attn_weights


class MultiHeadAttention(nn.Module):
    """Multi-head attention with learned projections."""

    def __init__(self, d_model: int = 256, num_heads: int = 8):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.d_model = d_model

        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

        self.attention = ScaledDotProductAttention(self.d_k)

    def forward(self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor,
                mask: torch.Tensor = None) -> tuple:
        B, seq_len, d_model = Q.shape

        # Project and reshape to (B, num_heads, seq_len, d_k)
        Q = self.W_Q(Q).view(B, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_K(K).view(B, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_V(V).view(B, -1, self.num_heads, self.d_k).transpose(1, 2)

        if mask is not None:
            mask = mask.unsqueeze(1)  # Broadcast over heads

        out, attn = self.attention(Q, K, V, mask)

        # Concatenate heads and project
        out = out.transpose(1, 2).contiguous().view(B, -1, self.d_model)
        out = self.W_O(out)
        return out, attn


class FeedForward(nn.Module):
    """Position-wise FFN: linear → GELU → linear."""

    def __init__(self, d_model: int = 256, d_ff: int = 1024):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.fc2(F.gelu(self.fc1(x)))


class TransformerEncoderLayer(nn.Module):
    """Single transformer encoder layer: self-attention + FFN."""

    def __init__(self, d_model: int = 256, num_heads: int = 8, d_ff: int = 1024):
        super().__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        # Self-attention with residual
        attn_out, _ = self.mha(x, x, x, mask)
        x = self.norm1(x + attn_out)

        # FFN with residual
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)
        return x


class CrossAttentionLayer(nn.Module):
    """Cross-attention: query from one modality, key/value from another."""

    def __init__(self, d_model: int = 256, num_heads: int = 8, d_ff: int = 1024):
        super().__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, query: torch.Tensor, key_value: torch.Tensor,
                mask: torch.Tensor = None) -> torch.Tensor:
        """
        Parameters
        ----------
        query : (B, len_q, d_model)
        key_value : (B, len_kv, d_model)
        """
        attn_out, _ = self.mha(query, key_value, key_value, mask)
        x = self.norm1(query + attn_out)
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)
        return x


# ─── Patch Embedding ──────────────────────────────────────────────

class PatchEmbedding(nn.Module):
    """Convert an image patch into a sequence of tokens via convolution."""

    def __init__(self, patch_size: int = 16, embed_dim: int = 256, in_channels: int = 3):
        super().__init__()
        self.patch_size = patch_size
        self.proj = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size,
                               stride=patch_size)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        x : (B, C, H, W)
        Returns: (B, num_patches, embed_dim)
        """
        # (B, C, H, W) → (B, embed_dim, H/patch, W/patch)
        x = self.proj(x)
        # Flatten: (B, embed_dim, num_patches_h, num_patches_w) → (B, num_patches, embed_dim)
        B, C, H, W = x.shape
        x = x.flatten(2).transpose(1, 2)
        return x


class PositionalEmbedding(nn.Module):
    """Learnable positional embeddings for patches."""

    def __init__(self, num_patches: int, embed_dim: int):
        super().__init__()
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x + self.pos_embed[:, :x.size(1), :]


# ─── Simplified Transformer Tracker ─────────────────────────────

class SimpleTransformerTracker(nn.Module):
    """
    Simplified transformer tracker with template + search tokenization.
    """

    def __init__(self, embed_dim: int = 256, num_heads: int = 8,
                 num_layers: int = 3, patch_size: int = 16):
        super().__init__()
        self.embed_dim = embed_dim
        self.patch_size = patch_size

        # Feature extractor for template and search
        self.patch_embed = PatchEmbedding(patch_size, embed_dim, in_channels=3)
        num_patches = (128 // patch_size) ** 2  # 128x128 search region
        self.pos_embed = PositionalEmbedding(num_patches * 2 + 2, embed_dim)

        # Learnable [CLS] and [SEP] tokens
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.template_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.search_token = nn.Parameter(torch.zeros(1, 1, embed_dim))

        # Transformer layers
        self.encoder_layers = nn.ModuleList([
            TransformerEncoderLayer(embed_dim, num_heads) for _ in range(num_layers)
        ])
        self.cross_layers = nn.ModuleList([
            CrossAttentionLayer(embed_dim, num_heads) for _ in range(num_layers)
        ])

        # BBox regression head: from [CLS] token
        self.bbox_head = nn.Sequential(
            nn.Linear(embed_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 4),  # [dx, dy, dw, dh] relative offsets
            nn.Sigmoid()
        )

        # Initialize weights
        nn.init.xavier_uniform_(self.cls_token)
        nn.init.xavier_uniform_(self.template_token)
        nn.init.xavier_uniform_(self.search_token)

    def tokenize(self, template_img: torch.Tensor, search_img: torch.Tensor) -> torch.Tensor:
        """
        Combine template and search patches into a single sequence with tokens.

        Returns: (B, seq_len, embed_dim)
        """
        # Embed patches
        template_patches = self.patch_embed(template_img)   # (B, T, D)
        search_patches = self.patch_embed(search_img)       # (B, S, D)

        # Add [TPL] and [SRC] tokens
        B = template_patches.size(0)
        tpl_tokens = self.template_token.expand(B, -1, -1)
        src_tokens = self.search_token.expand(B, -1, -1)

        # Concatenate: [CLS], [TPL], template_patches, [SEP], search_patches
        tokens = torch.cat([
            self.cls_token.expand(B, -1, -1),
            tpl_tokens,
            template_patches,
            src_tokens,
            search_patches,
        ], dim=1)

        return self.pos_embed(tokens)

    def forward(self, template_img: torch.Tensor, search_img: torch.Tensor):
        """
        Forward pass.

        Parameters
        ----------
        template_img : (B, 3, 64, 64) — cropped and resized template
        search_img : (B, 3, 128, 128) — search region

        Returns
        -------
        bbox : (B, 4) — [cx_rel, cy_rel, w_rel, h_rel] in [0, 1]
        """
        tokens = self.tokenize(template_img, search_img)

        # Self-attention across all tokens
        for layer in self.encoder_layers:
            tokens = layer(tokens)

        # Cross-attention: CLS attends to search
        for layer in self.cross_layers:
            cls_out = layer(tokens[:, :1], tokens)

        # Use [CLS] token for bbox regression
        cls_features = tokens[:, 0]  # (B, D)
        bbox = self.bbox_head(cls_features)

        return bbox


# ─── Minimal Demo (Simulated) ────────────────────────────────────

def run_transformer_tracker_demo(video_source: int = 0):
    """
    Demo using a simplified transformer tracker on a live camera.
    This uses the PyTorch model above with simulated tracking (no real training).
    For real use, load OSTrack weights from the official repository.
    """
    print("[INFO] Transformer Tracker Demo")
    print("[INFO] For production, use OSTrack: pip install osrtrack")
    print("[INFO] This demo shows the architecture and pipeline.")

    cap = cv2.VideoCapture(video_source)
    if not cap.isOpened():
        print("[ERROR] Cannot open camera")
        return

    # Load model (simplified, not pre-trained)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = SimpleTransformerTracker(embed_dim=128, num_heads=4, num_layers=2)
    model = model.to(device)
    model.eval()

    # Get first frame and select bbox
    ret, frame = cap.read()
    if not ret:
        print("[ERROR] Cannot read first frame")
        return

    print("[INFO] Draw a bounding box, then press ENTER")
    bbox = cv2.selectROI("Select Object", frame, fromCenter=False, showCrosshair=True)
    cv2.destroyWindow("Select Object")

    x, y, w, h = bbox
    template_size = 64
    search_size = 128

    def crop_patch(frame, cx, cy, sz):
        """Crop and resize a square patch."""
        half = sz // 2
        x1, y1 = max(0, int(cx) - half), max(0, int(cy) - half)
        x2, y2 = min(frame.shape[1], int(cx) + half), min(frame.shape[0], int(cy) + half)
        patch = frame[y1:y2, x1:x2]
        if patch.size == 0:
            return np.zeros((sz, sz, 3), dtype=np.uint8)
        return cv2.resize(patch, (sz, sz))

    template = crop_patch(frame, x + w/2, y + h/2, template_size)
    search_region = crop_patch(frame, x + w/2, y + h/2, search_size)

    print("[INFO] Tracking started. Press 'q' to quit.")

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        tpl_tensor = torch.from_numpy(template).permute(2, 0, 1).unsqueeze(0).float().to(device) / 255.0
        src_tensor = torch.from_numpy(search_region).permute(2, 0, 1).unsqueeze(0).float().to(device) / 255.0

        with torch.no_grad():
            output = model(tpl_tensor, src_tensor)

        dx, dy, dw, dh = output[0].cpu().numpy()
        cx, cy = x + w/2 + (dx - 0.5) * search_size, y + h/2 + (dy - 0.5) * search_size
        new_w, new_h = w * (1 + dw), h * (1 + dh)

        # Update template (online update)
        template = crop_patch(frame, cx, cy, template_size)
        search_region = crop_patch(frame, cx, cy, search_size)

        # Draw result
        ix, iy, iw, ih = int(cx - new_w/2), int(cy - new_h/2), int(new_w), int(new_h)
        cv2.rectangle(frame, (ix, iy), (ix+iw, iy+ih), (255, 0, 0), 2)
        cv2.putText(frame, "Transformer Tracker", (10, 30),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 0, 0), 2)
        cv2.imshow("Transformer Tracker", frame)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()


if __name__ == "__main__":
    run_transformer_tracker_demo(video_source=0)

6. Evaluation Metrics

6.1 MOTA & MOTP

MOTA (Multiple Object Tracking Accuracy):

\[ \text{MOTA} = 1 - \frac{\sum_t (m_t + f_m + g_t)}{\sum_t g_t} \in (-\infty, 1] \]

where: - \(m_t\) — misses (false negatives) at frame \(t\) - \(f_m\) — false positives at frame \(t\) - \(g_t\) — total ground truth objects at frame \(t\)

MOTP (Multiple Object Tracking Precision):

\[ \text{MOTP} = \frac{\sum_t \sum_i d_t^i}{\sum_t c_t} \]

where \(d_t^i\) is the bounding box overlap (IoU) of the \(i\)-th matched detection at frame \(t\), and \(c_t\) is the number of matches.

6.2 ID Switches & Fragmentation

Metric Description
ID Switches Number of times a track changes its assigned ID
Fragments Number of times a track is interrupted then resumes
MT (Mostly Tracked) Tracks with ≥ 80% coverage of ground truth
ML (Mostly Lost) Tracks with ≤ 20% coverage of ground truth

6.3 Implementation

"""
MOT Evaluation Utilities
========================
Compute MOTA, MOTP, and related metrics from tracked vs ground truth data.
"""

import numpy as np


def compute_iou(boxA: np.ndarray, boxB: np.ndarray) -> float:
    """IoU between two [x, y, w, h] boxes."""
    xA, yA, wA, hA = boxA
    xB, yB, wB, hB = boxB
    xi1, yi1 = max(xA, xB), max(yA, yB)
    xi2, yi2 = min(xA + wA, xB + wB), min(yA + hA, yB + hB)
    inter = max(0, xi2 - xi1) * max(0, yi2 - yi1)
    union = wA * hA + wB * hB - inter
    return inter / (union + 1e-6)


def evaluate_tracking(tracked: list, ground_truth: list) -> dict:
    """
    Compute MOT metrics.

    Parameters
    ----------
    tracked : list of dicts per frame: [{"id": int, "bbox": [x,y,w,h]}, ...]
    ground_truth : same format

    Returns
    -------
    metrics : dict with MOTA, MOTP, IDSw, IDs, etc.
    """
    total_misses = 0
    total_false_positives = 0
    total_switches = 0
    total_gt = 0
    total_overlap = 0.0
    total_matched = 0

    for frame_tracked, frame_gt in zip(tracked, ground_truth):
        gt_ids = {d["id"] for d in frame_gt}
        trk_ids = {d["id"] for d in frame_tracked}

        # Match by ID (simplified — assumes IDs already aligned)
        for gt_d in frame_gt:
            best_iou, best_trk = 0.0, None
            for trk_d in frame_tracked:
                if trk_d["id"] != gt_d["id"]:
                    continue
                iou = compute_iou(gt_d["bbox"], trk_d["bbox"])
                if iou > best_iou:
                    best_iou, best_trk = iou, trk_d

            if best_trk is not None:
                total_overlap += best_iou
                total_matched += 1
            else:
                total_misses += 1

        for trk_d in frame_tracked:
            if trk_d["id"] not in {d["id"] for d in frame_gt}:
                total_false_positives += 1

        total_gt += len(frame_gt)

    mota = 1.0 - (total_misses + total_false_positives + total_switches) / max(total_gt, 1)
    motp = total_overlap / max(total_matched, 1)

    return {
        "MOTA": mota,
        "MOTP": motp,
        "ID Sw": total_switches,
        "Total GT": total_gt,
        "Missed": total_misses,
        "False Positives": total_false_positives,
    }

7. Step-by-Step Implementation Guide

Phase 1 — Tier 1: KCF Tracker (Week 1)

  1. Start with the OpenCV built-in cv2.TrackerKCF_create() — verify it runs at 200+ FPS
  2. Implement the HOG feature extractor from scratch (Section 3.3)
  3. Implement the circulant matrix training step with FFT
  4. Test on a video sequence (OTB-50 dataset is commonly used)
  5. Compare with OpenCV's built-in tracker for correctness

Phase 2 — Tier 2: SORT Tracker (Week 2)

  1. Implement the Kalman filter (Section 4.3) — verify with a simple 1D tracking example
  2. Implement the Hungarian algorithm via scipy.optimize.linear_sum_assignment
  3. Integrate YOLO detection with ultralytics
  4. Run SORT on a video — count ID switches as you tune iou_threshold
  5. Extend to DeepSORT: add appearance feature extraction (cosine distance matching)
  6. Evaluate with MOTA/MOTP on MOT17 dataset

Phase 3 — Tier 3: Transformer Tracker (Week 3)

  1. Study the attention mechanism: implement scaled dot-product attention from scratch
  2. Run OSTrack or MixFormer from the official repositories (pretrained weights available)
  3. Analyze the cross-attention maps to understand what the model focuses on
  4. Experiment with template update strategies (fixed template vs. online updating)
  5. Benchmark on VOT2020/OTB100 and compare to Tier 1 and Tier 2

8. Comparison of Approaches

Criteria KCF Tracker SORT DeepSORT Transformer Tracker
Architecture Correlation Filter Detection + KF + Hungarian Detection + KF + Appearance + Hungarian Attention + Template Matching
Single/Multi-object Single Multi Multi Single / Multi
Speed (FPS) ~300 ~100 ~30-50 ~30
Accuracy ★★★☆☆ ★★★☆☆ ★★★★☆ ★★★★★
Occlusion handling Poor Fair (IoU only) Good (appearance) Excellent
Scale variation Fair Poor Poor Excellent
GPU required No No Recommended Yes
Training needed No (online) No Feature extractor Yes (full model)
ID persistence Low Medium High Very high
Best for Simple scenes, high FPS Real-time multi-object Person/vehicle tracking Complex benchmarks
Complexity ⭐⭐ ⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐⭐

9. Extensions and Variations

9.1 Siamese Networks

Siamese trackers (SiamFC, SiamRPN, SiamMask) use a two-branch network: - Template branch: encodes the initial target - Search branch: encodes the current frame - Similarity map: computed by cross-correlation in feature space

SiamMask extends this to semantic segmentation of the target for better precision.

9.2 Fully Convolutional Tracker (FCNT)

FCNT analyzes the activation maps of a CNN (VGG-16) and selects the most discriminative feature layer for tracking, avoiding distraction from background clutter.

9.3 Long-Term Tracking

For long-term tracking with target re-detection after complete occlusion:

  • ECO (Efficient Convolution Operators): uses factorized convolution and training sample selection to reduce drift over time
  • ATOM (Accurate Tracking by Overlap Maximization): learns to estimate bounding box overlap directly using a classification + bounding box refinement pipeline
  • LaSOT dataset: a long-term tracking benchmark with 1,400+ videos

9.4 Real-World Applications

  • Autonomous driving: track pedestrians, vehicles, cyclists across frames
  • Surveillance: multi-camera person re-identification
  • Sports analytics: player and ball tracking in soccer/basketball
  • Medical imaging: cell and organ tracking in microscopy video
  • Robotics: visual servoing with tracked reference points

10. References

  1. Henriques, J.F., et al. (2014). "High-Speed Tracking with Kernelized Correlation Filters." IEEE TPAMI, 37(3), 583–596. — KCF algorithm paper
  2. Bolme, D.S., et al. (2010). "Visual Object Tracking using Adaptive Correlation Filters." CVPR. — MOSSE (predecessor to KCF)
  3. Bewley, A., et al. (2016). "Simple Online and Realtime Tracking." ICIP. — SORT algorithm
  4. Wojke, N., et al. (2017). "Simple Online and Realtime Tracking with a Deep Association Metric." ICIP. — DeepSORT extension
  5. Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS. — Transformer architecture
  6. Chen, X., et al. (2022). "Unifying Short-Term and Long-Term Visual Tracking via Refinement." — OSTrack
  7. Yan, B., et al. (2021). "TransT: Transformer-Based Tracking." CVPR. — TransT architecture
  8. Zhang, Z., et al. (2022). "MixFormer: End-to-End Tracking with Iterative Mixed Attention." — MixFormer
  9. Müller, M., et al. (2018). "Tracking Attackers in UAV Videos with Adaptive Siamese Networks." — Siamese tracking
  10. Kristan, M., et al. (2020). "The Eighth Visual Object Tracking VOT2020 Challenge Results." — VOT benchmark
  11. Liang, C., et al. (2015). "Particle Filter-based Visual Tracking: A Survey." — Comprehensive survey
  12. OpenCV Tracking API — docs.opencv.org
  13. SORT Implementation — GitHub: abewley/sort
  14. DeepSORT Implementation — GitHub: nwojke/deep_sort
  15. OSTrack — GitHub: osiam/ostrack
  16. MOT17 / MOT20 Benchmarks — motchallenge.net