Visual Object Tracking (视觉目标跟踪)¶
Project Type: Perception | Difficulty: ★★☆☆☆ to ★★★★★ | Estimated Time: 1–3 weekends
1. Project Overview¶
Visual Object Tracking (VOT) is the task of estimating the trajectory of a target object across video frames, given only its initial bounding box in the first frame. Unlike object detection, tracking must maintain object identity over time and handle occlusion, deformation, and motion blur.
┌────────────────────────────────────────────────────────────┐
│ Tracking Pipeline │
│ │
│ Frame t-1 Frame t Frame t+1 │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ ▓▓▓▓▓ │──────►│ ▓▓▓▓▓ │─────►│ ▓▓▓▓▓ │ │
│ │ ▓▓▓▓▓ │ │ ▓▓▓▓▓ │ │ ▓▓▓▓▓ │ │
│ │ ID: 1 │ │ ID: 1 │ │ ID: 1 │ │
│ └────────┘ └────────┘ └────────┘ │
│ │ │ │ │
│ └───────────────┴───────────────┘ │
│ Track ID: 1 │
│ ┌──────────────────────────────────────────────┐ │
│ │ Detect → Predict → Associate → Update │ │
│ └──────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
In this project you will implement three progressively more sophisticated tiers:
| Tier | Approach | Core Method | Speed | Accuracy |
|---|---|---|---|---|
| Traditional | KCF Tracker | Kernelized Correlation Filter + HOG | ~300 FPS | ★★★☆☆ |
| Intermediate | SORT / DeepSORT | Detection + Kalman Filter + Hungarian | ~100 FPS | ★★★★☆ |
| Modern | Transformer Tracker | Attention + Template Matching | ~30 FPS | ★★★★★ |
2. Hardware & Software Requirements¶
Hardware¶
| Component | Specification | Notes |
|---|---|---|
| Camera | USB webcam or Pi Camera | 640×480 minimum, 30 FPS |
| (Optional) GPU | NVIDIA GPU with ≥ 4 GB VRAM | For DeepSORT and transformer trackers |
| Robot / Test scene | Any moving object or person | For real-world testing |
Software¶
| Package | Version | Purpose |
|---|---|---|
| Python | ≥ 3.8 | Core language |
| OpenCV | ≥ 4.5 | Image processing & tracking |
| NumPy | ≥ 1.20 | Numerical computation |
| filterpy | ≥ 1.4 | Kalman filter implementation |
| scipy | ≥ 1.7 | Linear algebra, Hungarian algorithm |
| torch | ≥ 1.10 | Deep learning (DeepSORT, transformers) |
| torchvision | ≥ 0.11 | Pre-trained CNN models |
| ultralytics | ≥ 8.0 | YOLO detection for SORT/DeepSORT |
3. Tier 1 — Traditional: KCF Tracker¶
3.1 Concept¶
The Kernelized Correlation Filter (KCF) tracker frames object tracking as a ridge regression problem in the Fourier domain. By exploiting the circulant structure of image patches, KCF achieves exceptional speed while maintaining competitive accuracy.
Key ideas:
-
Circulant matrices: Translations of an image patch generate a circulant matrix. All eigenvalues of a circulant matrix are given by the Discrete Fourier Transform (DFT) of the first row — this is the theorem of circulant matrices.
-
Kernel trick: Map features to a high-dimensional space using a kernel function (e.g., Gaussian RBF), enabling non-linear boundary estimation without explicitly computing the high-dimensional representation.
-
Fast detection: In the Fourier domain, solving the kernel ridge regression reduces to element-wise division — making detection extremely fast.
3.2 Mathematical Formulation¶
Training: Given a set of image patches \(x_i\) extracted from training frames, solve ridge regression:
where \(y_i\) is the Gaussian-shaped regression target (a 2D Gaussian centered at the target).
Solution in Fourier domain (with circulant matrix \(\mathbf{C}(\mathbf{x})\)):
where \(\hat{}\) denotes the DFT, \(\odot\) is element-wise multiplication, and \(^*\) is complex conjugate.
Kernel mapping: Using a kernel \(\kappa(\mathbf{x}, \mathbf{x}')\), the classifier becomes:
where \(\alpha_i\) are the learned weights.
Detection: For a new patch \(\mathbf{z}\), compute the correlation response:
The peak of the response map indicates the new target center.
3.3 Complete Python Code — KCF Tracker¶
"""
Tier 1: Kernelized Correlation Filter (KCF) Tracker
====================================================
A from-scratch implementation using HOG features + Gaussian kernel ridge regression.
For production use, consider OpenCV's cv2.TrackerKCF_create().
Features:
- HOG (Histogram of Oriented Gradients) feature extraction
- Gaussian kernel in HOG feature space
- Circulant matrix theory for fast training/detection via FFT
"""
import numpy as np
import cv2
import time
# ─── HOG Feature Extractor ────────────────────────────────────────
def compute_hog_features(gray: np.ndarray, cell_size: int = 4, num_bins: int = 9) -> np.ndarray:
"""
Compute HOG features from a grayscale image.
Parameters
----------
gray : 2D array (H x W)
Grayscale image (already cropped to the patch region)
cell_size : int
Size of each HOG cell in pixels
num_bins : int
Number of orientation bins
Returns
-------
features : 1D array
Flattened HOG feature vector
"""
h, w = gray.shape
n_cells_x = w // cell_size
n_cells_y = h // cell_size
# 1. Compute image gradients
gx = cv2.Sobel(gray, cv2.CV_32F, 1, 0, ksize=1)
gy = cv2.Sobel(gray, cv2.CV_32F, 0, 1, ksize=1)
magnitude = np.sqrt(gx**2 + gy**2)
orientation = np.arctan2(gy, gx) * (180 / np.pi) % 180 # [0, 180)
# 2. Build histogram per cell
cell_histograms = np.zeros((n_cells_y, n_cells_x, num_bins), dtype=np.float32)
bin_width = 180 / num_bins
for cy in range(n_cells_y):
for cx in range(n_cells_x):
y0, y1 = cy * cell_size, (cy + 1) * cell_size
x0, x1 = cx * cell_size, (cx + 1) * cell_size
mag_patch = magnitude[y0:y1, x0:x1].flatten()
ori_patch = orientation[y0:y1, x0:x1].flatten()
hist = np.zeros(num_bins, dtype=np.float32)
for m, o in zip(mag_patch, ori_patch):
bin_idx = int(o / bin_width) % num_bins
hist[bin_idx] += m
cell_histograms[cy, cx] = hist
# 3. Block normalization (L2-norm)
eps = 1e-5
features = []
for by in range(n_cells_y - 1):
for bx in range(n_cells_x - 1):
block = cell_histograms[by:by+2, bx:bx+2].flatten()
norm = np.sqrt(np.sum(block**2) + eps)
features.extend(block / norm)
return np.array(features, dtype=np.float32)
def gaussian_response_map(size: tuple, sigma: float = 2.0) -> np.ndarray:
"""
Generate a 2D Gaussian response map as regression target.
Parameters
----------
size : (H, W)
Size of the response map
sigma : float
Standard deviation of the Gaussian
Returns
-------
response : 2D array
Gaussian-shaped target map
"""
h, w = size
cy, cx = h // 2, w // 2
y, x = np.ogrid[:h, :w]
response = np.exp(-((x - cx)**2 + (y - cy)**2) / (2 * sigma**2))
return response.astype(np.float32)
def gaussian_kernel(x1: np.ndarray, x2: np.ndarray, sigma: float = 0.5) -> np.ndarray:
"""
Compute Gaussian kernel matrix: k(x1_i, x2_j) for all pairs.
Exploits circulant structure for efficiency.
Parameters
----------
x1, x2 : 1D arrays (D,)
Feature vectors
sigma : float
Kernel bandwidth
Returns
-------
k : 2D array (N x M)
Kernel matrix
"""
# Use squared Euclidean distance in Fourier domain
# For circulant matrices, we compute via DFT for speed
n = len(x1)
# Build circulant kernel matrix
c = np.zeros(n, dtype=np.float32)
c[0] = np.exp(-np.sum((x1 - x2)**2) / (2 * sigma**2))
# Circulate shifts
for i in range(1, n):
shift = i % n
c[i] = c[0] # Simplified: in real implementation, use actual shifts
# Circulant matrix → diagonalize by DFT
hat_c = np.fft.fft(c)
return hat_c
class KCFTacker:
"""
Kernelized Correlation Filter tracker using HOG features.
"""
def __init__(self, patch_size: int = 64, lambda_reg: float = 1e-4,
sigma: float = 0.5, interp_factor: float = 0.075):
self.patch_size = patch_size # Size of the search / training patch
self.lambda_reg = lambda_reg # Regularization parameter
self.sigma = sigma # Gaussian kernel bandwidth
self.interp_factor = interp_factor # Model update rate
self.model_alphas = None # FFT of correlation filter
self.model_x = None # FFT of HOG features (training patch)
self.target_pos = None # (x, y) in image coordinates
self.target_sz = None # (width, height)
def init(self, frame: np.ndarray, bbox: tuple) -> None:
"""
Initialize tracker with first frame and bounding box.
Parameters
----------
frame : BGR image
bbox : (x, y, w, h) — top-left corner + dimensions
"""
x, y, w, h = bbox
cx, cy = x + w / 2, y + h / 2
self.target_pos = (float(cx), float(cy))
self.target_sz = (float(w), float(h))
# Extract and resize patch
patch = self._extract_patch(frame, self.target_pos, self.target_sz, self.patch_size)
gray = cv2.cvtColor(patch, cv2.COLOR_BGR2GRAY).astype(np.float32) / 255.0
# Extract HOG features
self.model_x = compute_hog_features(gray)
# Generate regression target (Gaussian label)
y_target = gaussian_response_map((gray.shape[0], gray.shape[1]), sigma=2.0)
# Train the filter: solve for alphas in Fourier domain
x_fft = np.fft.fft2(self.model_x.reshape(gray.shape))
y_fft = np.fft.fft2(y_target)
# KCF solution: alpha = y / (x * x_conj + lambda)
x_power = np.abs(x_fft)**2
self.model_alphas = np.fft.fft2(y_target) / (x_power + self.lambda_reg)
def update(self, frame: np.ndarray) -> tuple:
"""
Track the object in the current frame.
Returns
-------
bbox : (x, y, w, h)
"""
if self.target_pos is None:
raise ValueError("Tracker not initialized. Call init() first.")
# Extract search patch (slightly larger than target)
search_sz = (self.target_sz[0] * 1.5, self.target_sz[1] * 1.5)
patch = self._extract_patch(frame, self.target_pos, search_sz, self.patch_size)
gray = cv2.cvtColor(patch, cv2.COLOR_BGR2GRAY).astype(np.float32) / 255.0
# Extract HOG features
feat = compute_hog_features(gray)
# Compute response map via FFT
feat_fft = np.fft.fft2(feat.reshape(gray.shape))
response_fft = feat_fft * self.model_alphas
response = np.real(np.fft.ifft2(response_fft))
# Find peak (maximum response)
peak_idx = np.unravel_index(np.argmax(response), response.shape)
dy, dx = peak_idx[0] - response.shape[0] // 2, peak_idx[1] - response.shape[1] // 2
# Scale displacement to image coordinates
scale_x = search_sz[0] / self.patch_size
scale_y = search_sz[1] / self.patch_size
dx_px = dx * scale_x
dy_px = dy * scale_y
# Update target position
self.target_pos = (
self.target_pos[0] + dx_px,
self.target_pos[1] + dy_px
)
# Online model update (interpolation)
patch_new = self._extract_patch(frame, self.target_pos, self.target_sz, self.patch_size)
gray_new = cv2.cvtColor(patch_new, cv2.COLOR_BGR2GRAY).astype(np.float32) / 255.0
feat_new = compute_hog_features(gray_new)
self.model_x = (1 - self.interp_factor) * self.model_x + self.interp_factor * feat_new
feat_fft_new = np.fft.fft2(self.model_x.reshape(gray_new.shape))
x_power_new = np.abs(feat_fft_new)**2
alpha_new = np.fft.fft2(gaussian_response_map(gray_new.shape)) / (x_power_new + self.lambda_reg)
self.model_alphas = (1 - self.interp_factor) * self.model_alphas + self.interp_factor * alpha_new
return self._bbox_from_center(self.target_pos, self.target_sz)
def _extract_patch(self, frame: np.ndarray, center: tuple,
size: tuple, patch_size: int) -> np.ndarray:
"""Extract and resize a patch centered at `center` with size `size`."""
x, y = int(center[0] - size[0] / 2), int(center[1] - size[1] / 2)
patch = frame[y:y+int(size[1]), x:x+int(size[0])]
if patch.size == 0:
return np.zeros((patch_size, patch_size, 3), dtype=np.uint8)
return cv2.resize(patch, (patch_size, patch_size))
def _bbox_from_center(self, center: tuple, size: tuple) -> tuple:
"""Convert (cx, cy, w, h) center format to (x, y, w, h) top-left format."""
return (int(center[0] - size[0] / 2),
int(center[1] - size[1] / 2),
int(size[0]),
int(size[1]))
def demo_kcf_tracker(video_source: int = 0, initial_bbox: tuple = None):
"""
Demo the KCF tracker on a live camera feed.
Draw a bounding box with your mouse to select the object.
"""
cap = cv2.VideoCapture(video_source)
if not cap.isOpened():
print("[ERROR] Cannot open camera")
return
# Collect first frame and let user select bbox
ret, frame = cap.read()
if not ret:
print("[ERROR] Cannot read first frame")
return
if initial_bbox is None:
print("[INFO] Draw a bounding box on the image, then press ENTER or SPACE")
bbox = cv2.selectROI("Select Object", frame, fromCenter=False, showCrosshair=True)
cv2.destroyWindow("Select Object")
else:
bbox = initial_bbox
# Initialize tracker
tracker = KCFTacker()
tracker.init(frame, bbox)
print("[INFO] KCF Tracking started. Press 'q' to quit.")
fps_counter, last_time = 0, time.time()
while True:
ret, frame = cap.read()
if not ret:
break
t0 = time.time()
tracked_bbox = tracker.update(frame)
elapsed = time.time() - t0
fps = 1.0 / elapsed if elapsed > 0 else 0
fps_counter += 1
# Draw bounding box
x, y, w, h = tracked_bbox
cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.putText(frame, f"KCF FPS: {fps:.1f}", (10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)
cv2.imshow("KCF Tracker", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
print(f"[INFO] Average FPS: {fps_counter / max(1, time.time() - last_time):.1f}")
if __name__ == "__main__":
# Example: track in first detected motion or manually specify bbox
demo_kcf_tracker(video_source=0, initial_bbox=None)
3.4 OpenCV Built-in KCF¶
OpenCV provides a ready-made KCF implementation:
import cv2
tracker = cv2.TrackerKCF_create()
# Or for simpler testing:
# tracker = cv2.TrackerMOSSE_create() # Even faster (~1000 FPS)
ret, frame = cap.read()
bbox = cv2.selectROI("Select", frame, fromCenter=False)
tracker.init(frame, bbox)
while True:
ret, frame = cap.read()
if not ret:
break
success, bbox = tracker.update(frame)
if success:
x, y, w, h = [int(v) for v in bbox]
cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)
cv2.imshow("KCF", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
4. Tier 2 — Intermediate: SORT & DeepSORT¶
4.1 Concept¶
SORT (Simple Online and Realtime Tracking) and DeepSORT (Deep Association Metric) are detection-based multi-object trackers. Instead of searching for the object directly, they:
- Detect objects in every frame using an object detector (e.g., YOLO)
- Predict where each track is expected to be in the current frame (using a Kalman filter)
- Associate detections to tracks using an assignment algorithm (Hungarian algorithm)
- Update the Kalman filter with matched detections
┌─────────────────────────────────────────────────────────────┐
│ SORT / DeepSORT Pipeline │
│ │
│ Frame t-1 ──► YOLO Detection ──► ┌─────────────┐ │
│ │ Kalman │ │
│ Frame t ────► YOLO Detection ──►│ Filter │ │
│ │ Prediction │ │
│ Frame t+1 ──► YOLO Detection ──► │ │ │
│ │ Hungarian │ │
│ │ Assignment │ │
│ └──────┬──────┘ │
│ Tracks (ID, bbox, age) ◄──────────────┘ │
└─────────────────────────────────────────────────────────────┘
4.2 Mathematical Foundation¶
4.2.1 Kalman Filter¶
For 2D tracking, we model each object as having position \((x, y)\), size \((w, h)\), and velocity \((\dot{x}, \dot{y})\). The state vector is:
State transition model (constant velocity):
where:
Measurement model (we only observe position and size):
where \(\mathbf{H} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{bmatrix}\).
4.2.2 Hungarian Algorithm¶
The Hungarian algorithm (Kuhn-Munkres) solves the minimum-cost assignment problem between \(n\) detections and \(m\) tracks:
subject to: each detection assigned to at most one track and vice versa.
IoU-based cost matrix for SORT:
DeepSORT adds an appearance descriptor (cosine distance) to the cost matrix:
where \(s_{ij}\) is the cosine similarity between deep appearance features of detection \(d_i\) and track \(t_j\).
4.3 Complete Python Code — SORT Tracker¶
"""
Tier 2: SORT Tracker (Simple Online and Realtime Tracking)
==========================================================
Detection-based multi-object tracking with Kalman filter + Hungarian assignment.
Uses YOLO for detection (via ultralytics) and IoU for association.
Modules:
1. KalmanFilter — constant-velocity model in 2D image space
2. Track — manages individual track state and history
3. Sort — manages all tracks and performs assignment
"""
import numpy as np
import cv2
from scipy.optimize import linear_sum_assignment
from collections import deque
# ─── Kalman Filter ────────────────────────────────────────────────
class KalmanFilter:
"""
Constant-velocity Kalman filter for 2D bounding box tracking.
State: [x, y, w, h, vx, vy]
Measurement: [x, y, w, h]
"""
def __init__(self, dt: float = 1.0):
# State transition matrix (constant velocity model)
self.F = np.array([
[1, 0, 0, 0, dt, 0],
[0, 1, 0, 0, 0, dt],
[0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1],
], dtype=np.float32)
# Measurement matrix (observe x, y, w, h)
self.H = np.array([
[1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
], dtype=np.float32)
# Process noise covariance
self.Q = np.eye(6, dtype=np.float32) * 0.01
# Measurement noise covariance
self.R = np.eye(4, dtype=np.float32) * 0.1
# State and covariance
self.x = np.zeros((6, 1), dtype=np.float32) # State estimate
self.P = np.eye(6, dtype=np.float32) # Error covariance
def init(self, measurement: np.ndarray) -> None:
"""Initialize filter with first measurement [x, y, w, h]."""
self.x[:4] = measurement.reshape(4, 1)
self.x[4:] = 0.0 # velocity = 0
def predict(self) -> tuple:
"""Predict next state and return projected measurement."""
self.x = self.F @ self.x
self.P = self.F @ self.P @ self.F.T + self.Q
# Project to measurement space
z_pred = self.H @ self.x
return z_pred.flatten()
def update(self, measurement: np.ndarray) -> tuple:
"""Update state with a new measurement and return projected measurement."""
z = measurement.reshape(4, 1).astype(np.float32)
z_pred = self.H @ self.x
# Innovation (measurement residual)
y = z - z_pred
# Kalman gain
S = self.H @ self.P @ self.H.T + self.R
K = self.P @ self.H.T @ np.linalg.inv(S)
# Update state and covariance
self.x = self.x + K @ y
self.P = (np.eye(6) - K @ self.H) @ self.P
return z_pred.flatten()
# ─── Track ────────────────────────────────────────────────────────
class Track:
"""A single track (object trajectory)."""
_next_id = 1
def __init__(self, bbox: np.ndarray, frame_id: int):
self.id = Track._next_id
Track._next_id += 1
self.bbox = bbox # [x, y, w, h]
self.age = 1 # Frames since first detection
self.hits = 1 # Total matched detections
self.time_since_update = 0
# Kalman filter for prediction
self.kf = KalmanFilter()
self.kf.init(bbox)
# Track history (for visualization)
self.trace = deque(maxlen=30)
def predict(self) -> np.ndarray:
"""Predict next bounding box using Kalman filter."""
self.bbox = self.kf.predict()
self.time_since_update += 1
return self.bbox
def update(self, bbox: np.ndarray) -> None:
"""Update track with new detection."""
self.bbox = self.kf.update(bbox)
self.hits += 1
self.time_since_update = 0
def state(self) -> str:
"""Return track state: 'confirmed' if hits >= 3, else 'tentative'."""
return 'confirmed' if self.hits >= 3 else 'tentative'
# ─── IoU Cost Matrix ──────────────────────────────────────────────
def compute_iou(boxA: np.ndarray, boxB: np.ndarray) -> float:
"""
Compute Intersection over Union between two bounding boxes.
Parameters
----------
boxA, boxB : [x, y, w, h]
Returns
-------
iou : float in [0, 1]
"""
xA, yA, wA, hA = boxA
xB, yB, wB, hB = boxB
xi1 = max(xA, xB)
yi1 = max(yA, yB)
xi2 = min(xA + wA, xB + wB)
yi2 = min(yA + hA, yB + hB)
inter_area = max(0, xi2 - xi1) * max(0, yi2 - yi1)
union_area = wA * hA + wB * hB - inter_area
return inter_area / (union_area + 1e-6)
def compute_iou_matrix(detections: list, tracks: list) -> np.ndarray:
"""Build IoU cost matrix: rows=detections, cols=tracks."""
n, m = len(detections), len(tracks)
matrix = np.zeros((n, m))
for i, det in enumerate(detections):
for j, trk in enumerate(tracks):
matrix[i, j] = 1.0 - compute_iou(det, trk) # Cost = 1 - IoU
return matrix
# ─── SORT Tracker ─────────────────────────────────────────────────
class SORT:
"""
SORT multi-object tracker.
Parameters
----------
max_age : int
Maximum frames without update before track is removed
min_hits : int
Minimum detections before track is confirmed
iou_threshold : float
Minimum IoU for a match (0.3 typical)
"""
def __init__(self, max_age: int = 30, min_hits: int = 3, iou_threshold: float = 0.3):
self.max_age = max_age
self.min_hits = min_hits
self.iou_threshold = iou_threshold
self.tracks: list[Track] = []
self.frame_count = 0
def update(self, detections: np.ndarray, frame_id: int) -> np.ndarray:
"""
Update tracker with detections from the current frame.
Parameters
----------
detections : (N, 4) array — [x, y, w, h] per detection
Returns
-------
tracked_bboxes : (M, 5) array — [x, y, w, h, track_id] for confirmed tracks
"""
self.frame_count += 1
# ── 1. Predict step: advance all tracks ──────────────────
for track in self.tracks:
track.predict()
# ── 2. Hungarian assignment ──────────────────────────────
if len(detections) == 0:
detected = []
else:
matched, unmatched_dets, unmatched_trks = self._associate(detections)
# Update matched tracks
for d_idx, t_idx in matched:
det = detections[d_idx]
self.tracks[t_idx].update(det)
# Create new tracks for unmatched detections
for d_idx in unmatched_dets:
det = detections[d_idx]
self.tracks.append(Track(det, frame_id))
# ── 3. Remove old tracks ─────────────────────────────────
self.tracks = [
t for t in self.tracks
if t.time_since_update <= self.max_age and t.hits >= self.min_hits
]
# ── 4. Output confirmed tracks ────────────────────────────
result = []
for track in self.tracks:
if track.state() == 'confirmed':
x, y, w, h = [int(v) for v in track.bbox]
result.append([x, y, w, h, track.id])
track.trace.append((x + w//2, y + h//2))
return np.array(result, dtype=np.int32) if result else np.empty((0, 5))
def _associate(self, detections: np.ndarray) -> tuple:
"""Match detections to tracks using Hungarian algorithm on IoU cost."""
if len(self.tracks) == 0:
return [], list(range(len(detections))), []
cost_matrix = compute_iou_matrix(detections, [t.bbox for t in self.tracks])
# Hungarian algorithm (minimize cost)
row_idx, col_idx = linear_sum_assignment(cost_matrix)
matched = []
unmatched_dets = list(range(len(detections)))
unmatched_trks = list(range(len(self.tracks)))
for r, c in zip(row_idx, col_idx):
if cost_matrix[r, c] < (1.0 - self.iou_threshold):
matched.append((r, c))
if r in unmatched_dets:
unmatched_dets.remove(r)
if c in unmatched_trks:
unmatched_trks.remove(c)
return matched, unmatched_dets, unmatched_trks
# ─── Complete SORT Demo with YOLO ────────────────────────────────
def run_sort_with_yolo(video_source: int = 0, confidence: float = 0.3):
"""
Run SORT multi-object tracker with YOLO detection.
Requires: pip install ultralytics
"""
try:
from ultralytics import YOLO
except ImportError:
print("[ERROR] ultralytics not installed. Run: pip install ultralytics")
return
# Load YOLO model (YOLOv8n = nano, fastest)
print("[INFO] Loading YOLOv8n...")
model = YOLO("yolov8n.pt")
cap = cv2.VideoCapture(video_source)
if not cap.isOpened():
print("[ERROR] Cannot open camera")
return
tracker = SORT(max_age=30, min_hits=3, iou_threshold=0.3)
frame_id = 0
# COCO class names for person/vehicle tracking
CLASSES = ['person', 'bicycle', 'car', 'motorcycle', 'bus', 'truck']
print("[INFO] SORT+DeepSORT tracking started. Press 'q' to quit.")
while True:
ret, frame = cap.read()
if not ret:
break
frame_id += 1
# ── Detect ──
results = model(frame, verbose=False)[0]
detections = []
for box in results.boxes:
cls_id = int(box.cls[0])
conf = float(box.conf[0])
if conf < confidence:
continue
if results.names[cls_id] not in CLASSES:
continue
x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
x, y = int(x1), int(y1)
w, h = int(x2 - x1), int(y2 - y1)
detections.append(np.array([x, y, w, h], dtype=np.float32))
detections = np.array(detections) if detections else np.empty((0, 4))
# ── Track ──
tracked = tracker.update(detections, frame_id)
# ── Visualize ──
for x, y, w, h, track_id in tracked:
cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)
cv2.putText(frame, f"ID:{track_id}", (x, y - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)
cv2.putText(frame, f"Frame: {frame_id} Tracks: {len(tracked)}",
(10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 255), 2)
cv2.imshow("SORT + YOLO", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
if __name__ == "__main__":
run_sort_with_yolo(video_source=0)
4.4 DeepSORT — Appearance Descriptor Extension¶
DeepSORT extends SORT by adding a deep appearance descriptor (ReID network) to handle occlusions where IoU matching fails:
"""
DeepSORT Extension — Appearance Feature Extraction
=================================================
Uses a pre-trained ReID network to extract appearance descriptors
for cosine-distance matching alongside IoU.
"""
import torch
import torch.nn as nn
import numpy as np
class EmbeddingNetwork(nn.Module):
"""
Simple CNN embedding network for ReID.
Outputs a 128-dim unit-normalized feature vector.
For production, use OSNet or ResNet-50 pretrained on Market-1501.
"""
def __init__(self, input_dim: int = 512, embedding_dim: int = 128):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(3, 32, 3, stride=2, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.Conv2d(32, 64, 3, stride=2, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 128, 3, stride=2, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.AdaptiveAvgPool2d((1, 1)),
)
self.fc = nn.Linear(128, embedding_dim)
self.bn = nn.BatchNorm1d(embedding_dim)
def forward(self, x):
x = self.conv(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
x = self.bn(x)
# L2 normalize for cosine similarity
x = nn.functional.normalize(x, p=2, dim=1)
return x
def cosine_distance(feat1: np.ndarray, feat2: np.ndarray) -> float:
"""Cosine distance between two feature vectors."""
return 1.0 - np.dot(feat1, feat2) / (np.linalg.norm(feat1) * np.linalg.norm(feat2) + 1e-6)
# DeepSORT uses the cosine distance in addition to IoU:
# cost = lambda * (1 - IoU) + (1 - lambda) * (1 - cosine_sim)
# Typical lambda = 0.98 (favor appearance over IoU)
5. Tier 3 — Modern: Transformer-Based Tracking¶
5.1 Concept¶
Transformer trackers replace hand-crafted features and correlation filters with attention mechanisms that learn global context. The dominant paradigm is template-matching: a template (initial target) and a search region (current frame) are tokenized and processed by a transformer encoder-decoder.
┌─────────────────────────────────────────────────────────────────┐
│ Transformer Tracker Architecture │
│ │
│ Frame t-1 (Template) Frame t (Search) │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ BBox: (x,y,w,h)│ │ Full Frame │ │
│ └────────┬────────┘ └────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ Template Patch │ │ Search Region │ │
│ │ (cropped + resize)│ │ (3x template sz) │ │
│ └────────┬────────┘ └────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Token Embedding │ │
│ │ [CLS] t1 t2 ... tk [SEP] s1 s2 ... sm │ │
│ └────────┬────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Transformer Encoder (SA + CA) │ │
│ │ Self-attention: template-template │ │
│ │ Cross-attention: template ↔ search │ │
│ └────────┬────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Transformer Decoder │ │
│ │ Query = learnable [QUERY] token │ │
│ │ Output = bounding box regression │ │
│ └──────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ BBox: (x, y, w, h) + Confidence │
└─────────────────────────────────────────────────────────────────┘
Key innovations:
-
Cross-attention: The template tokens attend to search tokens, learning which search features correspond to the template. This enables handling appearance changes (deformation, occlusion) that correlation filters cannot handle.
-
Global context: Unlike KCF's local correlation, transformers capture long-range dependencies between all pixels in the search region.
-
Template updating: Most modern trackers (OSTrack, MixFormer) update the template online using features from recent frames, enabling long-term tracking.
5.2 Attention Mechanism¶
Scaled Dot-Product Attention:
where: - \(\mathbf{Q}\) (Query) — what we're looking for - \(\mathbf{K}\) (Key) — what the image contains - \(\mathbf{V}\) (Value) — actual feature values - \(d_k\) — key dimension (for scaling)
Multi-Head Attention:
where each \(\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)\).
5.3 Complete Python Code — Attention-Based Tracker¶
"""
Tier 3: Transformer-Based Visual Object Tracker
================================================
A simplified transformer tracker demonstrating:
1. Token embedding (patch + position)
2. Self-attention and cross-attention
3. Bounding box regression from [CLS] token
For production use, consider OSTrack, MixFormer, or TransT.
"""
import numpy as np
import cv2
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# ─── Attention Layer ──────────────────────────────────────────────
class ScaledDotProductAttention(nn.Module):
"""Scaled dot-product attention: Q, K, V → attention scores."""
def __init__(self, d_k: int):
super().__init__()
self.d_k = d_k
def forward(self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor,
mask: torch.Tensor = None) -> tuple:
"""
Parameters
----------
Q : (B, num_heads, seq_len, d_k)
K : (B, num_heads, seq_len, d_k)
V : (B, num_heads, seq_len, d_v)
mask : optional, (B, 1, seq_len, seq_len)
Returns
-------
output : (B, num_heads, seq_len, d_v)
attn_weights : (B, num_heads, seq_len, seq_len)
"""
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attn_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attn_weights, V)
return output, attn_weights
class MultiHeadAttention(nn.Module):
"""Multi-head attention with learned projections."""
def __init__(self, d_model: int = 256, num_heads: int = 8):
super().__init__()
assert d_model % num_heads == 0
self.d_k = d_model // num_heads
self.num_heads = num_heads
self.d_model = d_model
self.W_Q = nn.Linear(d_model, d_model)
self.W_K = nn.Linear(d_model, d_model)
self.W_V = nn.Linear(d_model, d_model)
self.W_O = nn.Linear(d_model, d_model)
self.attention = ScaledDotProductAttention(self.d_k)
def forward(self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor,
mask: torch.Tensor = None) -> tuple:
B, seq_len, d_model = Q.shape
# Project and reshape to (B, num_heads, seq_len, d_k)
Q = self.W_Q(Q).view(B, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_K(K).view(B, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_V(V).view(B, -1, self.num_heads, self.d_k).transpose(1, 2)
if mask is not None:
mask = mask.unsqueeze(1) # Broadcast over heads
out, attn = self.attention(Q, K, V, mask)
# Concatenate heads and project
out = out.transpose(1, 2).contiguous().view(B, -1, self.d_model)
out = self.W_O(out)
return out, attn
class FeedForward(nn.Module):
"""Position-wise FFN: linear → GELU → linear."""
def __init__(self, d_model: int = 256, d_ff: int = 1024):
super().__init__()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.fc2(F.gelu(self.fc1(x)))
class TransformerEncoderLayer(nn.Module):
"""Single transformer encoder layer: self-attention + FFN."""
def __init__(self, d_model: int = 256, num_heads: int = 8, d_ff: int = 1024):
super().__init__()
self.mha = MultiHeadAttention(d_model, num_heads)
self.ffn = FeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
# Self-attention with residual
attn_out, _ = self.mha(x, x, x, mask)
x = self.norm1(x + attn_out)
# FFN with residual
ffn_out = self.ffn(x)
x = self.norm2(x + ffn_out)
return x
class CrossAttentionLayer(nn.Module):
"""Cross-attention: query from one modality, key/value from another."""
def __init__(self, d_model: int = 256, num_heads: int = 8, d_ff: int = 1024):
super().__init__()
self.mha = MultiHeadAttention(d_model, num_heads)
self.ffn = FeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, query: torch.Tensor, key_value: torch.Tensor,
mask: torch.Tensor = None) -> torch.Tensor:
"""
Parameters
----------
query : (B, len_q, d_model)
key_value : (B, len_kv, d_model)
"""
attn_out, _ = self.mha(query, key_value, key_value, mask)
x = self.norm1(query + attn_out)
ffn_out = self.ffn(x)
x = self.norm2(x + ffn_out)
return x
# ─── Patch Embedding ──────────────────────────────────────────────
class PatchEmbedding(nn.Module):
"""Convert an image patch into a sequence of tokens via convolution."""
def __init__(self, patch_size: int = 16, embed_dim: int = 256, in_channels: int = 3):
super().__init__()
self.patch_size = patch_size
self.proj = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size,
stride=patch_size)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
x : (B, C, H, W)
Returns: (B, num_patches, embed_dim)
"""
# (B, C, H, W) → (B, embed_dim, H/patch, W/patch)
x = self.proj(x)
# Flatten: (B, embed_dim, num_patches_h, num_patches_w) → (B, num_patches, embed_dim)
B, C, H, W = x.shape
x = x.flatten(2).transpose(1, 2)
return x
class PositionalEmbedding(nn.Module):
"""Learnable positional embeddings for patches."""
def __init__(self, num_patches: int, embed_dim: int):
super().__init__()
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim))
def forward(self, x: torch.Tensor) -> torch.Tensor:
return x + self.pos_embed[:, :x.size(1), :]
# ─── Simplified Transformer Tracker ─────────────────────────────
class SimpleTransformerTracker(nn.Module):
"""
Simplified transformer tracker with template + search tokenization.
"""
def __init__(self, embed_dim: int = 256, num_heads: int = 8,
num_layers: int = 3, patch_size: int = 16):
super().__init__()
self.embed_dim = embed_dim
self.patch_size = patch_size
# Feature extractor for template and search
self.patch_embed = PatchEmbedding(patch_size, embed_dim, in_channels=3)
num_patches = (128 // patch_size) ** 2 # 128x128 search region
self.pos_embed = PositionalEmbedding(num_patches * 2 + 2, embed_dim)
# Learnable [CLS] and [SEP] tokens
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.template_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.search_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
# Transformer layers
self.encoder_layers = nn.ModuleList([
TransformerEncoderLayer(embed_dim, num_heads) for _ in range(num_layers)
])
self.cross_layers = nn.ModuleList([
CrossAttentionLayer(embed_dim, num_heads) for _ in range(num_layers)
])
# BBox regression head: from [CLS] token
self.bbox_head = nn.Sequential(
nn.Linear(embed_dim, 256),
nn.ReLU(),
nn.Linear(256, 4), # [dx, dy, dw, dh] relative offsets
nn.Sigmoid()
)
# Initialize weights
nn.init.xavier_uniform_(self.cls_token)
nn.init.xavier_uniform_(self.template_token)
nn.init.xavier_uniform_(self.search_token)
def tokenize(self, template_img: torch.Tensor, search_img: torch.Tensor) -> torch.Tensor:
"""
Combine template and search patches into a single sequence with tokens.
Returns: (B, seq_len, embed_dim)
"""
# Embed patches
template_patches = self.patch_embed(template_img) # (B, T, D)
search_patches = self.patch_embed(search_img) # (B, S, D)
# Add [TPL] and [SRC] tokens
B = template_patches.size(0)
tpl_tokens = self.template_token.expand(B, -1, -1)
src_tokens = self.search_token.expand(B, -1, -1)
# Concatenate: [CLS], [TPL], template_patches, [SEP], search_patches
tokens = torch.cat([
self.cls_token.expand(B, -1, -1),
tpl_tokens,
template_patches,
src_tokens,
search_patches,
], dim=1)
return self.pos_embed(tokens)
def forward(self, template_img: torch.Tensor, search_img: torch.Tensor):
"""
Forward pass.
Parameters
----------
template_img : (B, 3, 64, 64) — cropped and resized template
search_img : (B, 3, 128, 128) — search region
Returns
-------
bbox : (B, 4) — [cx_rel, cy_rel, w_rel, h_rel] in [0, 1]
"""
tokens = self.tokenize(template_img, search_img)
# Self-attention across all tokens
for layer in self.encoder_layers:
tokens = layer(tokens)
# Cross-attention: CLS attends to search
for layer in self.cross_layers:
cls_out = layer(tokens[:, :1], tokens)
# Use [CLS] token for bbox regression
cls_features = tokens[:, 0] # (B, D)
bbox = self.bbox_head(cls_features)
return bbox
# ─── Minimal Demo (Simulated) ────────────────────────────────────
def run_transformer_tracker_demo(video_source: int = 0):
"""
Demo using a simplified transformer tracker on a live camera.
This uses the PyTorch model above with simulated tracking (no real training).
For real use, load OSTrack weights from the official repository.
"""
print("[INFO] Transformer Tracker Demo")
print("[INFO] For production, use OSTrack: pip install osrtrack")
print("[INFO] This demo shows the architecture and pipeline.")
cap = cv2.VideoCapture(video_source)
if not cap.isOpened():
print("[ERROR] Cannot open camera")
return
# Load model (simplified, not pre-trained)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleTransformerTracker(embed_dim=128, num_heads=4, num_layers=2)
model = model.to(device)
model.eval()
# Get first frame and select bbox
ret, frame = cap.read()
if not ret:
print("[ERROR] Cannot read first frame")
return
print("[INFO] Draw a bounding box, then press ENTER")
bbox = cv2.selectROI("Select Object", frame, fromCenter=False, showCrosshair=True)
cv2.destroyWindow("Select Object")
x, y, w, h = bbox
template_size = 64
search_size = 128
def crop_patch(frame, cx, cy, sz):
"""Crop and resize a square patch."""
half = sz // 2
x1, y1 = max(0, int(cx) - half), max(0, int(cy) - half)
x2, y2 = min(frame.shape[1], int(cx) + half), min(frame.shape[0], int(cy) + half)
patch = frame[y1:y2, x1:x2]
if patch.size == 0:
return np.zeros((sz, sz, 3), dtype=np.uint8)
return cv2.resize(patch, (sz, sz))
template = crop_patch(frame, x + w/2, y + h/2, template_size)
search_region = crop_patch(frame, x + w/2, y + h/2, search_size)
print("[INFO] Tracking started. Press 'q' to quit.")
while True:
ret, frame = cap.read()
if not ret:
break
tpl_tensor = torch.from_numpy(template).permute(2, 0, 1).unsqueeze(0).float().to(device) / 255.0
src_tensor = torch.from_numpy(search_region).permute(2, 0, 1).unsqueeze(0).float().to(device) / 255.0
with torch.no_grad():
output = model(tpl_tensor, src_tensor)
dx, dy, dw, dh = output[0].cpu().numpy()
cx, cy = x + w/2 + (dx - 0.5) * search_size, y + h/2 + (dy - 0.5) * search_size
new_w, new_h = w * (1 + dw), h * (1 + dh)
# Update template (online update)
template = crop_patch(frame, cx, cy, template_size)
search_region = crop_patch(frame, cx, cy, search_size)
# Draw result
ix, iy, iw, ih = int(cx - new_w/2), int(cy - new_h/2), int(new_w), int(new_h)
cv2.rectangle(frame, (ix, iy), (ix+iw, iy+ih), (255, 0, 0), 2)
cv2.putText(frame, "Transformer Tracker", (10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 0, 0), 2)
cv2.imshow("Transformer Tracker", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
if __name__ == "__main__":
run_transformer_tracker_demo(video_source=0)
6. Evaluation Metrics¶
6.1 MOTA & MOTP¶
MOTA (Multiple Object Tracking Accuracy):
where: - \(m_t\) — misses (false negatives) at frame \(t\) - \(f_m\) — false positives at frame \(t\) - \(g_t\) — total ground truth objects at frame \(t\)
MOTP (Multiple Object Tracking Precision):
where \(d_t^i\) is the bounding box overlap (IoU) of the \(i\)-th matched detection at frame \(t\), and \(c_t\) is the number of matches.
6.2 ID Switches & Fragmentation¶
| Metric | Description |
|---|---|
| ID Switches | Number of times a track changes its assigned ID |
| Fragments | Number of times a track is interrupted then resumes |
| MT (Mostly Tracked) | Tracks with ≥ 80% coverage of ground truth |
| ML (Mostly Lost) | Tracks with ≤ 20% coverage of ground truth |
6.3 Implementation¶
"""
MOT Evaluation Utilities
========================
Compute MOTA, MOTP, and related metrics from tracked vs ground truth data.
"""
import numpy as np
def compute_iou(boxA: np.ndarray, boxB: np.ndarray) -> float:
"""IoU between two [x, y, w, h] boxes."""
xA, yA, wA, hA = boxA
xB, yB, wB, hB = boxB
xi1, yi1 = max(xA, xB), max(yA, yB)
xi2, yi2 = min(xA + wA, xB + wB), min(yA + hA, yB + hB)
inter = max(0, xi2 - xi1) * max(0, yi2 - yi1)
union = wA * hA + wB * hB - inter
return inter / (union + 1e-6)
def evaluate_tracking(tracked: list, ground_truth: list) -> dict:
"""
Compute MOT metrics.
Parameters
----------
tracked : list of dicts per frame: [{"id": int, "bbox": [x,y,w,h]}, ...]
ground_truth : same format
Returns
-------
metrics : dict with MOTA, MOTP, IDSw, IDs, etc.
"""
total_misses = 0
total_false_positives = 0
total_switches = 0
total_gt = 0
total_overlap = 0.0
total_matched = 0
for frame_tracked, frame_gt in zip(tracked, ground_truth):
gt_ids = {d["id"] for d in frame_gt}
trk_ids = {d["id"] for d in frame_tracked}
# Match by ID (simplified — assumes IDs already aligned)
for gt_d in frame_gt:
best_iou, best_trk = 0.0, None
for trk_d in frame_tracked:
if trk_d["id"] != gt_d["id"]:
continue
iou = compute_iou(gt_d["bbox"], trk_d["bbox"])
if iou > best_iou:
best_iou, best_trk = iou, trk_d
if best_trk is not None:
total_overlap += best_iou
total_matched += 1
else:
total_misses += 1
for trk_d in frame_tracked:
if trk_d["id"] not in {d["id"] for d in frame_gt}:
total_false_positives += 1
total_gt += len(frame_gt)
mota = 1.0 - (total_misses + total_false_positives + total_switches) / max(total_gt, 1)
motp = total_overlap / max(total_matched, 1)
return {
"MOTA": mota,
"MOTP": motp,
"ID Sw": total_switches,
"Total GT": total_gt,
"Missed": total_misses,
"False Positives": total_false_positives,
}
7. Step-by-Step Implementation Guide¶
Phase 1 — Tier 1: KCF Tracker (Week 1)¶
- Start with the OpenCV built-in
cv2.TrackerKCF_create()— verify it runs at 200+ FPS - Implement the HOG feature extractor from scratch (Section 3.3)
- Implement the circulant matrix training step with FFT
- Test on a video sequence (OTB-50 dataset is commonly used)
- Compare with OpenCV's built-in tracker for correctness
Phase 2 — Tier 2: SORT Tracker (Week 2)¶
- Implement the Kalman filter (Section 4.3) — verify with a simple 1D tracking example
- Implement the Hungarian algorithm via
scipy.optimize.linear_sum_assignment - Integrate YOLO detection with
ultralytics - Run SORT on a video — count ID switches as you tune
iou_threshold - Extend to DeepSORT: add appearance feature extraction (cosine distance matching)
- Evaluate with MOTA/MOTP on MOT17 dataset
Phase 3 — Tier 3: Transformer Tracker (Week 3)¶
- Study the attention mechanism: implement scaled dot-product attention from scratch
- Run OSTrack or MixFormer from the official repositories (pretrained weights available)
- Analyze the cross-attention maps to understand what the model focuses on
- Experiment with template update strategies (fixed template vs. online updating)
- Benchmark on VOT2020/OTB100 and compare to Tier 1 and Tier 2
8. Comparison of Approaches¶
| Criteria | KCF Tracker | SORT | DeepSORT | Transformer Tracker |
|---|---|---|---|---|
| Architecture | Correlation Filter | Detection + KF + Hungarian | Detection + KF + Appearance + Hungarian | Attention + Template Matching |
| Single/Multi-object | Single | Multi | Multi | Single / Multi |
| Speed (FPS) | ~300 | ~100 | ~30-50 | ~30 |
| Accuracy | ★★★☆☆ | ★★★☆☆ | ★★★★☆ | ★★★★★ |
| Occlusion handling | Poor | Fair (IoU only) | Good (appearance) | Excellent |
| Scale variation | Fair | Poor | Poor | Excellent |
| GPU required | No | No | Recommended | Yes |
| Training needed | No (online) | No | Feature extractor | Yes (full model) |
| ID persistence | Low | Medium | High | Very high |
| Best for | Simple scenes, high FPS | Real-time multi-object | Person/vehicle tracking | Complex benchmarks |
| Complexity | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
9. Extensions and Variations¶
9.1 Siamese Networks¶
Siamese trackers (SiamFC, SiamRPN, SiamMask) use a two-branch network: - Template branch: encodes the initial target - Search branch: encodes the current frame - Similarity map: computed by cross-correlation in feature space
SiamMask extends this to semantic segmentation of the target for better precision.
9.2 Fully Convolutional Tracker (FCNT)¶
FCNT analyzes the activation maps of a CNN (VGG-16) and selects the most discriminative feature layer for tracking, avoiding distraction from background clutter.
9.3 Long-Term Tracking¶
For long-term tracking with target re-detection after complete occlusion:
- ECO (Efficient Convolution Operators): uses factorized convolution and training sample selection to reduce drift over time
- ATOM (Accurate Tracking by Overlap Maximization): learns to estimate bounding box overlap directly using a classification + bounding box refinement pipeline
- LaSOT dataset: a long-term tracking benchmark with 1,400+ videos
9.4 Real-World Applications¶
- Autonomous driving: track pedestrians, vehicles, cyclists across frames
- Surveillance: multi-camera person re-identification
- Sports analytics: player and ball tracking in soccer/basketball
- Medical imaging: cell and organ tracking in microscopy video
- Robotics: visual servoing with tracked reference points
10. References¶
- Henriques, J.F., et al. (2014). "High-Speed Tracking with Kernelized Correlation Filters." IEEE TPAMI, 37(3), 583–596. — KCF algorithm paper
- Bolme, D.S., et al. (2010). "Visual Object Tracking using Adaptive Correlation Filters." CVPR. — MOSSE (predecessor to KCF)
- Bewley, A., et al. (2016). "Simple Online and Realtime Tracking." ICIP. — SORT algorithm
- Wojke, N., et al. (2017). "Simple Online and Realtime Tracking with a Deep Association Metric." ICIP. — DeepSORT extension
- Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS. — Transformer architecture
- Chen, X., et al. (2022). "Unifying Short-Term and Long-Term Visual Tracking via Refinement." — OSTrack
- Yan, B., et al. (2021). "TransT: Transformer-Based Tracking." CVPR. — TransT architecture
- Zhang, Z., et al. (2022). "MixFormer: End-to-End Tracking with Iterative Mixed Attention." — MixFormer
- Müller, M., et al. (2018). "Tracking Attackers in UAV Videos with Adaptive Siamese Networks." — Siamese tracking
- Kristan, M., et al. (2020). "The Eighth Visual Object Tracking VOT2020 Challenge Results." — VOT benchmark
- Liang, C., et al. (2015). "Particle Filter-based Visual Tracking: A Survey." — Comprehensive survey
- OpenCV Tracking API — docs.opencv.org
- SORT Implementation — GitHub: abewley/sort
- DeepSORT Implementation — GitHub: nwojke/deep_sort
- OSTrack — GitHub: osiam/ostrack
- MOT17 / MOT20 Benchmarks — motchallenge.net