YOLO Object Detection¶

YOLO (You Only Look Once) is the most widely used real-time object detection framework in robotics. It predicts bounding boxes and class probabilities in a single forward pass, making it fast enough for live video processing on embedded devices. This tutorial covers YOLO from first principles through edge deployment on robots.

Learning Objectives¶

Understand the core ideas behind single-stage detectors and why they dominate real-time robotics
Trace the evolution from YOLOv1 (2016) to YOLOv12 (2025) and pick the right version for your task
Train, evaluate, and deploy custom YOLO models using the Ultralytics ecosystem
Apply YOLO to pick-and-place, tracking, pose estimation, and instance segmentation

1. What is YOLO¶

1.1 Brief History¶

Object detection — locating and classifying objects in images — is one of the oldest and most practical problems in computer vision. Before YOLO, the state of the art was dominated by two-stage detectors like R-CNN (2014) and Faster R-CNN (2015). These systems first propose regions that might contain objects, then classify each region. They achieved high accuracy but were slow: 2–5 frames per second on a GPU, far too slow for real-time robotics.

In 2016, Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi introduced YOLOv1 with a radical insight: treat detection as a single regression problem. Instead of a multi-stage pipeline, predict all bounding boxes and class probabilities directly from the full image in one pass of the network. YOLOv1 ran at 45 FPS — fast enough for video — and instantly reshaped the field.

Since then, the YOLO lineage has produced over a dozen major versions. The community contributions have been enormous: anchor boxes (YOLOv2), feature pyramid networks (YOLOv3), CSPNet and mosaic augmentation (YOLOv4), anchor-free detection (YOLOv8), and attention-centric designs (YOLOv12). Ultralytics unified many of these ideas into a single Python package that is now the de facto standard for YOLO deployment.

1.2 Why YOLO Matters for Robotics¶

Robots need to perceive their environment in real time. Consider a robotic arm picking parts off a conveyor belt: it must detect each part, estimate its position, and plan a grasp — all within the cycle time of the belt (often < 100 ms). YOLO provides:

Speed: 30–300+ FPS depending on model size, enabling real-time control loops
Accuracy: Modern YOLO models achieve mAP > 50 on COCO, competitive with much slower detectors
Versatility: A single framework handles detection, segmentation, pose estimation, and tracking
Edge deployment: Export to TensorRT, ONNX, OpenVINO for Jetson, Intel, and ARM devices
Community: Tens of thousands of pre-trained weights, datasets, and deployment examples

For robotics, YOLO is not just a detector — it is the perception backbone for manipulation, navigation, inspection, and human-robot interaction.

2. How YOLO Works¶

2.1 The Core Insight¶

Traditional two-stage detectors work like this:

Input Image
    │
    ▼
┌──────────────────┐
│ Region Proposal   │  ← "Where might objects be?" (Selective Search, RPN)
│ Network           │
└────────┬─────────┘
         │  ~2000 candidate regions
         ▼
┌──────────────────┐
│ Classification    │  ← "What is in each region?"
│ + Bounding Box    │
│   Refinement      │
└──────────────────┘

Two-stage detectors are accurate but slow because they process each candidate region separately.

YOLO collapses this into a single step:

Input Image
    │
    ▼
┌──────────────────┐
│                   │
│  CNN Backbone     │  ← Extract features from the entire image
│  + Neck + Head    │
│                   │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  S x S Grid       │
│  × B Boxes        │  ← Predict ALL boxes + classes in ONE pass
│  × (C + 5) values │
└──────────────────┘

2.2 Grid-Based Prediction (YOLOv1 Detail)¶

YOLOv1 divides the input image into an S × S grid. Each grid cell predicts:

B bounding boxes (each box has: x, y, w, h, confidence)
C class probabilities (one score per class)

The total output is a tensor of shape S × S × (B × 5 + C).

┌─────────────────────────────────────────────────┐
│                 Input Image                      │
│                                                  │
│    ┌─────┬─────┬─────┬─────┬─────┐              │
│    │ Cell│ Cell│ Cell│ Cell│ Cell│  ← 7×7 grid   │
│    ├─────┼─────┼─────┼─────┤     │              │
│    │ ... │ ... │ ... │ ... │     │              │
│    └─────┴─────┴─────┴─────┴─────┘              │
│                                                  │
│    Each cell predicts:                           │
│    - 2 bounding boxes (x, y, w, h, conf)        │
│    - 20 class probabilities (Pascal VOC)         │
│                                                  │
│    Output tensor: 7 × 7 × 30                    │
└─────────────────────────────────────────────────┘

2.3 Loss Function¶

YOLO optimizes a multi-part loss combining:

Localization loss: Mean squared error on bounding box coordinates (x, y, w, h)
Confidence loss: MSE on the objectness score (does this cell contain an object?)
Classification loss: MSE on class probabilities

Only cells containing objects contribute to the confidence and classification losses. The weighting balances these terms (typically λ_coord = 5, λ_noobj = 0.5).

2.4 From YOLOv1 to Modern Architectures¶

Modern YOLO models (v5, v8, v11) follow a three-part architecture:

┌─────────────────────────────────────────────────────────────────┐
│                    YOLO Architecture (v5/v8/v11)                 │
│                                                                  │
│  ┌─────────────┐                                                │
│  │             │                                                │
│  │   Backbone   │  Extracts multi-scale features                 │
│  │  (CSPNet /   │  e.g., CSPDarknet, C2f blocks                 │
│  │   C2f)       │                                                │
│  │             │                                                │
│  └──────┬──────┘                                                │
│         │  P3, P4, P5 (1/8, 1/16, 1/32 resolution)             │
│         ▼                                                       │
│  ┌─────────────┐                                                │
│  │             │                                                │
│  │    Neck      │  Fuses multi-scale features                    │
│  │  (PANet /    │  Top-down + bottom-up path aggregation         │
│  │   SPPF)      │                                                │
│  │             │                                                │
│  └──────┬──────┘                                                │
│         │  F3, F4, F5 (enriched features)                       │
│         ▼                                                       │
│  ┌─────────────┐                                                │
│  │             │                                                │
│  │    Head      │  Predicts boxes, scores, classes               │
│  │  (Anchor /   │  One prediction per grid cell                  │
│  │  Anchor-free)│                                                │
│  │             │                                                │
│  └─────────────┘                                                │
│         │                                                        │
│         ▼                                                        │
│  Bounding boxes + Class scores + Confidence                     │
└─────────────────────────────────────────────────────────────────┘

Backbone: A deep CNN (e.g., CSPDarknet53, EfficientNet) that extracts hierarchical features. Larger models (L, X) use deeper backbones with more channels.

Neck: Feature Pyramid Network (FPN) + Path Aggregation Network (PANet). The FPN top-down pathway merges high-level semantic features with low-level spatial features. The PANet bottom-up pathway adds a second pass for stronger gradient flow.

Head: The detection head produces predictions at three scales: - P3 (⅛): Small objects (e.g., screws, small parts) - P4 (1/16): Medium objects (e.g., cups, tools) - P5 (1/32): Large objects (e.g., boxes, people)

2.5 Anchor-Free vs Anchor-Based¶

Anchor-based (YOLOv2–v5): The model predicts offsets from predefined anchor boxes (prior shapes computed via k-means on the training set). Each anchor produces a bounding box by adjusting x, y, w, h offsets.

Anchor-free (YOLOv8+): The model predicts centers and extents directly, or uses a task-aligned assigner to match predictions to ground truth without predefined anchors. This simplifies deployment and improves generalization.

Anchor-based:  prediction = anchor + offset
Anchor-free:   prediction = direct (center_x, center_y, width, height)

3. YOLO Versions¶

3.1 Evolution Table¶

Version	Year	Key Innovation	Backbone	Neck	Head	mAP@50-95	FPS (V100)	Notable
YOLOv1	2016	Single-pass detection, grid prediction	Darknet-19	None	FC layers	63.4 (VOC)	45	First real-time detector
YOLOv2	2017	Batch norm, anchor boxes, multi-scale training	Darknet-19	Passthrough	Conv	78.6 (VOC)	40	Trained on 9000+ classes (YOLO9000)
YOLOv3	2018	Darknet-53, FPN multi-scale detection	Darknet-53	FPN	3-scale	57.9 (mAP50)	20	Best balance at the time
YOLOv4	2020	CSPDarknet, PANet, mosaic augmentation	CSPDarknet53	PANet	SPP	65.7 (mAP50)	62	Bag of freebies + specials
YOLOv5	2020	PyTorch native, auto-anchor, easy deployment	CSPDarknet53	PANet + SPPF	Anchor	68.9 (mAP50)	140	Most deployed in production
YOLOv7	2022	E-ELAN, re-parameterization, model scaling	E-ELAN	ELAN-PAN	Anchor	71.2 (mAP50)	161	Fastest at its release
YOLOv8	2023	Anchor-free, decoupled head, Ultralytics API	CSPNet (C2f)	PANet + SPPF	Anchor-free	53.9 (mAP)	280	Unified detect/seg/pose/classify
YOLOv9	2024	GELAN, PGI (Programmable Gradient Information)	GELAN	GELAN	Anchor-free	55.6 (mAP)	300+	Solves information bottleneck
YOLOv10	2024	NMS-free, dual label assignment	CSPNet	PANet	Anchor-free	54.4 (mAP)	350+	Eliminates NMS post-processing
YOLO11	2024	C3k2 blocks, improved efficiency	C3k2-CSPNet	C2f2	Anchor-free	54.7 (mAP)	320+	Most parameter-efficient
YOLOv12	2025	Attention-centric, dynamic resolution, flash attention	A*-CSPNet	A*-PAN	Anchor-free	56.0 (mAP)	340+	Combines CNN speed with transformer accuracy

mAP values are approximate and depend on input size (640×640 default). FPS measured on NVIDIA V100 or A100.

3.2 Key Innovations by Version¶

YOLOv1 (2016) — The Original¶

Divides image into S×S grid; each cell predicts B boxes + C classes
End-to-end differentiable — no region proposals
Limitation: struggles with small objects, many instances of the same class

YOLOv2 / YOLO9000 (2017) — Faster and More Classes¶

Added batch normalization after every convolutional layer (+2% mAP)
Introduced anchor boxes via k-means clustering on training data
Multi-scale training: randomly resize input during training (320–608 pixels)
Joint training on detection + classification (9413 classes from WordTree)

YOLOv3 (2018) — Multi-Scale Detection¶

Darknet-53 backbone: 53 convolutional layers with residual connections
Feature Pyramid Network (FPN): predictions at 3 scales (13×13, 26×26, 52×52)
Binary cross-entropy for class prediction (handles multi-label)
The go-to version for years in production robotics

YOLOv4 (2020) — Bag of Freebies¶

CSPDarknet53: Cross-Stage Partial connections reduce computation
SPP + PANet neck for richer feature aggregation
Mosaic augmentation: 4-image collage that teaches the model about context
Mish activation: smooth non-linearity replacing Leaky ReLU
Dozens of "free" training tricks: CutMix, DropBlock, label smoothing

YOLOv5 (2020) — The Deployment King¶

Written in PyTorch from the start (YOLOv1–v4 were Darknet/C++)
Auto-anchor: Automatically learns anchor box sizes for your dataset
Integrated export to ONNX, TensorRT, CoreML, TFLite
Variants: n (nano), s (small), m (medium), l (large), x (extra-large)
Most widely deployed YOLO version in industrial robotics

YOLOv7 (2022) — Efficiency Champion¶

E-ELAN (Extended Efficient Layer Aggregation Network): optimized feature fusion
Re-parameterization: Train with complex architecture, deploy with simpler one
Auxiliary head during training for better gradient flow
State-of-the-art speed-accuracy tradeoff at release

YOLOv8 (2023) — The Modern Standard¶

Anchor-free detection: eliminates anchor box hyperparameters
Decoupled head: separate branches for classification and regression
Task-Aligned Assigner: dynamic positive sample assignment
Unified API: ultralytics package supports detect, segment, pose, classify
Variants: n, s, m, l, x — choose by speed/accuracy budget

YOLOv9 (2024) — Information Bottleneck Solved¶

GELAN (Generalized Efficient Layer Aggregation Network): new macro-architecture
PGI (Programmable Gradient Information): prevents information loss in deep networks
Proves that auxiliary heads and reversible branches can improve any architecture
Smallest model (YOLOv9-t) achieves 44% mAP with only 2M parameters

YOLOv10 (2024) — NMS-Free¶

Dual label assignment: NMS-free training with consistent dual assignments
Holistic label assignment: pairs one-to-one and one-to-many assignments
Eliminates the NMS post-processing step, reducing latency by 1–3 ms
Lightweight architectures: n (2.7M params), s (7.2M), m (15.4M)

YOLO11 (2024) — Efficient Next-Gen¶

C3k2 blocks: smaller, more efficient cross-stage connections
22% fewer parameters than YOLOv8 with same accuracy
Improved feature extraction at all scales
Available in n, s, m, l, x variants

YOLOv12 (2025) — Attention-Centric¶

A*-CSPNet: Replaces some convolution blocks with attention mechanisms
Flash attention for memory-efficient self-attention
Dynamic resolution: adapts to input size without retraining
Combines the speed of CNNs with the accuracy of vision transformers
Best performance for high-resolution detection tasks

3.3 How to Choose a Version¶

Decision guide:

  Need fastest inference? ──────── YOLOv8-n or YOLOv10-n
  Need best accuracy? ──────────── YOLOv12 or YOLOv8-x  
  Need NMS-free (edge)? ────────── YOLOv10
  Need most deployment options? ── YOLOv5 (widest support)
  Need segmentation + pose? ────── YOLOv8 (unified API)
  Need minimal compute? ─────────── YOLOv9-t or YOLO11-n
  Production/industrial? ────────── YOLOv5 or YOLOv8 (most battle-tested)

4. Quick Start with Ultralytics¶

4.1 Installation¶

# Create a virtual environment (recommended)
python -m venv yolo_env
source yolo_env/bin/activate    # Linux/Mac
# yolo_env\Scripts\activate     # Windows

# Install Ultralytics (includes YOLOv5, v8, v11)
pip install ultralytics

# Verify installation
python -c "import ultralytics; print(ultralytics.__version__)"
# Output: 8.x.x

4.2 Inference on an Image¶

from ultralytics import YOLO

# Load a pre-trained model (downloads weights automatically)
model = YOLO("yolov8n.pt")  # nano — fastest, smallest

# Run inference on an image
results = model("https://ultralytics.com/images/zidane.jpg")

# Process results
for result in results:
    # Bounding boxes: xyxy format [x1, y1, x2, y2]
    boxes = result.boxes
    print(f"Found {len(boxes)} objects")

    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0].tolist()  # coordinates
        confidence = box.conf[0].item()           # confidence score
        class_id = int(box.cls[0].item())         # class index
        class_name = result.names[class_id]        # class name

        print(f"  {class_name}: {confidence:.2f} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}]")

# Save annotated image
results[0].save("output.jpg")

4.3 Inference on Video / Webcam¶

from ultralytics import YOLO
import cv2

model = YOLO("yolov8s.pt")  # small model — good speed/accuracy balance

# --- Option 1: Process a video file ---
results = model.predict(
    source="input_video.mp4",
    save=True,              # save annotated video
    conf=0.25,              # confidence threshold
    imgsz=640,              # inference size
    classes=[0, 1],         # only detect classes 0 (person) and 1 (bicycle)
    stream=True,            # memory-efficient streaming
)

for r in results:
    # r.boxes contains detections for this frame
    pass

# --- Option 2: Process webcam (real-time) ---
cap = cv2.VideoCapture(0)

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # YOLO inference directly on numpy array
    results = model(frame, verbose=False)

    # Display results
    annotated = results[0].plot()  # draw boxes on frame
    cv2.imshow("YOLO Detection", annotated)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

4.4 Export to ONNX / TensorRT¶

from ultralytics import YOLO

model = YOLO("yolov8s.pt")

# Export to ONNX (cross-platform deployment)
model.export(format="onnx", imgsz=640, simplify=True)
# Creates: yolov8s.onnx

# Export to TensorRT (NVIDIA GPU optimized)
model.export(format="engine", imgsz=640, half=True)  # half = FP16
# Creates: yolov8s.engine

# Export to OpenVINO (Intel CPU/GPU/VPU)
model.export(format="openvino", imgsz=640)
# Creates: yolov8s_openvino_model/

# Export to CoreML (Apple Silicon)
model.export(format="coreml", imgsz=640)

# Export to TFLite (mobile/edge)
model.export(format="tflite", imgsz=320)  # smaller input for edge

4.5 Running Exported Models¶

from ultralytics import YOLO

# Load an ONNX model
model = YOLO("yolov8s.onnx")
results = model("test.jpg")

# Load a TensorRT engine
model = YOLO("yolov8s.engine")
results = model("test.jpg")

# Both produce the same output format as the PyTorch model

5. YOLO for Robotics Applications¶

5.1 Object Detection for Pick-and-Place¶

The most common robotics use of YOLO: detecting objects on a table or conveyor belt and providing their positions for a manipulator.

from ultralytics import YOLO
import numpy as np

# Load model trained on your object classes
model = YOLO("runs/detect/my_objects/weights/best.pt")

def detect_objects(image):
    """Detect objects and return their center positions in the image."""
    results = model(image, verbose=False)[0]

    objects = []
    for box in results.boxes:
        x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
        conf = box.conf[0].item()
        cls = int(box.cls[0].item())

        # Compute center point (pixel coordinates)
        cx = (x1 + x2) / 2
        cy = (y1 + y2) / 2
        w = x2 - x1
        h = y2 - y1

        objects.append({
            "class": results.names[cls],
            "confidence": conf,
            "center_px": (cx, cy),
            "bbox": (x1, y1, x2, y2),
            "size_px": (w, h),
        })

    return objects

# --- Integration with robot coordinate transform ---
def pixel_to_robot(cx, cy, camera_matrix, extrinsic_matrix):
    """Convert pixel coordinates to robot base frame."""
    # Undistort pixel to normalized camera coordinates
    fx, fy = camera_matrix[0, 0], camera_matrix[1, 1]
    cx_cam, cy_cam = camera_matrix[0, 2], camera_matrix[1, 2]

    # Project to 3D (assuming known depth Z)
    Z = 0.5  # depth from camera to table surface in meters
    X = (cx - cx_cam) * Z / fx
    Y = (cy - cy_cam) * Z / fy

    # Transform to robot base frame
    point_cam = np.array([X, Y, Z, 1.0])
    point_robot = extrinsic_matrix @ point_cam

    return point_robot[:3]  # x, y, z in robot frame

# --- Example usage ---
# detect -> transform -> plan grasp -> execute
image = camera.capture()
objects = detect_objects(image)

for obj in objects:
    robot_pos = pixel_to_robot(*obj["center_px"], K, T_base_cam)
    print(f"Object: {obj['class']}, Robot position: {robot_pos}")
    # robot_arm.move_to(robot_pos)
    # robot_arm.grasp()

5.2 Real-Time Tracking with DeepSORT / ByteTrack¶

Object detection gives you boxes in each frame, but for robotics you often need tracking — consistent object IDs across frames. ByteTrack is the modern choice.

from ultralytics import YOLO
from ultralytics.trackers import ByteTrack  # built into Ultralytics

# ByteTrack is integrated into YOLOv8+ — no extra packages needed
model = YOLO("yolov8s.pt")

# Enable tracking in predict
results = model.track(
    source="conveyor_video.mp4",
    tracker="bytetrack.yaml",  # ByteTrack config
    persist=True,              # maintain track IDs across frames
    save=True,
    conf=0.3,
)

# Process tracked results
for r in results:
    for box in r.boxes:
        track_id = box.id.item() if box.id is not None else None
        cls = r.names[int(box.cls[0].item())]
        xyxy = box.xyxy[0].tolist()

        if track_id is not None:
            print(f"Track {int(track_id)}: {cls} at {xyxy}")
            # Use track_id to maintain state: count, history, prediction

5.3 Pose Estimation (YOLOv8-Pose)¶

YOLOv8-Pose detects human keypoints (or custom keypoints) alongside bounding boxes. Useful for human-robot collaboration, gesture recognition, and ergonomic monitoring.

from ultralytics import YOLO

# Load pose model (trained on COCO keypoints: 17 keypoints)
model = YOLO("yolov8s-pose.pt")

results = model("person_image.jpg")

for r in results:
    # Keypoints: shape (num_persons, num_keypoints, 3) — x, y, visibility
    keypoints = r.keypoints

    if keypoints is not None:
        kpts = keypoints.xy[0].cpu().numpy()  # first person

        # COCO keypoint indices:
        # 0: nose, 1: left_eye, 2: right_eye, 3: left_ear, 4: right_ear
        # 5: left_shoulder, 6: right_shoulder, 7: left_elbow, 8: right_elbow
        # 9: left_wrist, 10: right_wrist, 11: left_hip, 12: right_hip
        # 13: left_knee, 14: right_knee, 15: left_ankle, 16: right_ankle

        # Compute arm angle for robot interaction
        shoulder = kpts[5]
        elbow = kpts[7]
        wrist = kpts[9]

        # Angle at elbow
        v1 = shoulder - elbow
        v2 = wrist - elbow
        angle = np.degrees(np.arccos(
            np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2) + 1e-6)
        ))
        print(f"Elbow angle: {angle:.1f}°")

5.4 Instance Segmentation (YOLOv8-Seg)¶

Segmentation provides pixel-level masks instead of just bounding boxes. Useful for grasping irregular objects, bin picking, and scene understanding.

from ultralytics import YOLO
import numpy as np

# Load segmentation model
model = YOLO("yolov8s-seg.pt")

results = model("cluttered_table.jpg")

for r in results:
    # Masks: shape (num_detections, H, W) — binary masks
    masks = r.masks

    if masks is not None:
        for i, mask in enumerate(masks):
            binary_mask = mask.data[0].cpu().numpy()  # (H, W)
            cls = r.names[int(r.boxes[i].cls[0].item())]
            conf = r.boxes[i].conf[0].item()

            # Compute mask properties
            pixels = np.sum(binary_mask > 0.5)
            print(f"{cls}: {conf:.2f}, area: {pixels} px")

            # Compute centroid for grasp planning
            ys, xs = np.where(binary_mask > 0.5)
            centroid_x = np.mean(xs)
            centroid_y = np.mean(ys)
            print(f"  Centroid: ({centroid_x:.0f}, {centroid_y:.0f})")

6. Custom Training Pipeline¶

6.1 Dataset Preparation¶

Before training, you need images with bounding box annotations.

Annotation Tools:

Tool	Type	Best For	Export Format
Roboflow	Cloud	End-to-end (annotate + augment + deploy)	YOLO txt, COCO JSON, VOC XML
CVAT	Cloud/Self-hosted	Large team projects, video annotation	YOLO txt, COCO, VOC
LabelImg	Desktop	Quick single-class annotation	YOLO txt, VOC XML
Label Studio	Cloud/Self-hosted	Multi-modal (image + text + audio)	Multiple formats

YOLO Label Format — one .txt file per image, in the same directory:

# Each line: class_id center_x center_y width height (normalized 0-1)
# Example: image.jpg has a "cup" (class 0) and a "plate" (class 1)
0 0.456 0.321 0.128 0.095
1 0.672 0.543 0.256 0.182

Directory Structure:

my_dataset/
├── images/
│   ├── train/          # ~80% of images
│   │   ├── img001.jpg
│   │   ├── img002.jpg
│   │   └── ...
│   └── val/            # ~20% of images
│       ├── img050.jpg
│       └── ...
├── labels/
│   ├── train/
│   │   ├── img001.txt
│   │   ├── img002.txt
│   │   └── ...
│   └── val/
│       ├── img050.txt
│       └── ...
└── data.yaml           # dataset config (see below)

6.2 YAML Configuration File¶

# data.yaml — Dataset configuration for Ultralytics YOLO
# Paths can be absolute or relative to this file

# Dataset paths
train: ../my_dataset/images/train     # Training images
val: ../my_dataset/images/val          # Validation images
# test: ../my_dataset/images/test      # Optional test set

# Number of classes
nc: 3

# Class names (must match annotation class IDs)
names:
  0: gripper
  1: screw
  2: bearing

6.3 Data Augmentation Strategies¶

Ultralytics applies augmentation automatically during training. Key strategies:

# Augmentation parameters (set in training args or YAML)
augmentation = {
    # Geometric
    "hsv_h": 0.015,       # Hue augmentation (±1.5%)
    "hsv_s": 0.7,         # Saturation augmentation (±70%)
    "hsv_v": 0.4,         # Value/brightness augmentation (±40%)
    "degrees": 10.0,      # Rotation (±10°)
    "translate": 0.1,     # Translation (±10% of image)
    "scale": 0.5,         # Scale augmentation (±50%)
    "shear": 2.0,         # Shear (±2°)
    "perspective": 0.0,   # Perspective transform
    "flipud": 0.0,        # Vertical flip probability
    "fliplr": 0.5,        # Horizontal flip probability

    # Mosaic (combines 4 images into one — key for small objects)
    "mosaic": 1.0,        # Mosaic probability (1.0 = always)

    # Mixup (blends two images)
    "mixup": 0.0,         # Mixup probability

    # Copy-Paste (copy objects between images)
    "copy_paste": 0.0,    # Copy-paste probability
}

6.4 Training Script with Full Options¶

from ultralytics import YOLO

# --- Option 1: Python API (recommended for scripts) ---
model = YOLO("yolov8s.pt")  # Start from pre-trained weights

results = model.train(
    data="data.yaml",          # Dataset config
    epochs=100,                # Number of training epochs
    imgsz=640,                 # Input image size

    # Batch size and optimization
    batch=16,                  # Batch size (use -1 for auto)
    lr0=0.01,                  # Initial learning rate
    lrf=0.01,                  # Final learning rate (lr0 * lrf)
    momentum=0.937,            # SGD momentum
    weight_decay=0.0005,       # L2 regularization
    warmup_epochs=3.0,         # Warmup epochs
    warmup_momentum=0.8,       # Warmup momentum

    # Augmentation
    hsv_h=0.015,
    hsv_s=0.7,
    hsv_v=0.4,
    degrees=10.0,
    translate=0.1,
    scale=0.5,
    fliplr=0.5,
    mosaic=1.0,
    mixup=0.0,

    # Model settings
    name="my_custom_model",    # Experiment name
    project="runs/detect",     # Output directory
    exist_ok=False,            # Overwrite existing

    # Hardware
    device="0",                # GPU device (cpu, 0, 0,1, etc.)
    workers=8,                 # Data loading workers

    # Checkpointing
    save_period=10,            # Save checkpoint every N epochs
    patience=50,               # Early stopping patience

    # Validation
    val=True,                  # Validate during training
    cache=True,                # Cache images in RAM (fast but uses memory)
)

# --- Option 2: CLI ---
# yolo detect train data=data.yaml model=yolov8s.pt epochs=100 imgsz=640 batch=16 device=0

6.5 Evaluation Metrics¶

After training, evaluate your model on the validation set:

from ultralytics import YOLO

model = YOLO("runs/detect/my_custom_model/weights/best.pt")

# Run validation
metrics = model.val(
    data="data.yaml",
    imgsz=640,
    batch=16,
    conf=0.25,
    iou=0.6,                   # IoU threshold for NMS
    max_det=300,               # Max detections per image
)

# Key metrics
print(f"mAP50:      {metrics.box.map50:.4f}")      # mAP at IoU=0.50
print(f"mAP50-95:   {metrics.box.map:.4f}")          # mAP at IoU=0.50:0.95
print(f"Precision:  {metrics.box.mp:.4f}")            # Mean precision
print(f"Recall:     {metrics.box.mr:.4f}")            # Mean recall

# Per-class results
for i, (p, r, ap50, ap) in enumerate(
    zip(metrics.box.p, metrics.box.r, metrics.box.ap50, metrics.box.ap)
):
    print(f"  Class {i}: P={p:.3f}, R={r:.3f}, mAP50={ap50:.3f}, mAP={ap:.3f}")

Metrics explained:

Precision: Of all predicted boxes, what fraction are correct? (TP / (TP + FP))
Recall: Of all ground-truth boxes, what fraction were detected? (TP / (TP + FN))
mAP50: Mean Average Precision at IoU threshold 0.5 — lenient metric
mAP50-95: Mean AP averaged over IoU thresholds 0.50 to 0.95 — strict metric (COCO standard)
IoU: Intersection over Union — overlap between predicted and ground-truth boxes

6.6 TensorBoard Monitoring¶

# During training, TensorBoard logs are saved automatically
# Start TensorBoard:

# From the project directory
tensorboard --logdir runs/detect

# Or with a specific port
tensorboard --logdir runs/detect --port 6006

# Open http://localhost:6006 in your browser

The dashboard shows: - Training/Box loss: Localization loss (how well boxes fit objects) - Training/Class loss: Classification loss (how well classes are predicted) - Training/DFL loss: Distribution Focal Loss (for anchor-free models) - metrics/precision and metrics/recall: Per-epoch evaluation - metrics/mAP50 and metrics/mAP50-95: Overall accuracy

7. Edge Deployment¶

7.1 NVIDIA Jetson (TensorRT)¶

The Jetson Nano, Xavier, and Orin are the most common edge platforms for robotics.

# Step 1: Export to TensorRT on the training machine
python -c "
from ultralytics import YOLO
model = YOLO('yolov8s.pt')
model.export(format='engine', imgsz=640, half=True, device=0)
"

# Step 2: Copy the .engine file to your Jetson
scp yolov8s.engine jetson@192.168.1.100:~/

# Step 3: Run on Jetson (no GPU export needed — TensorRT is pre-installed)
python -c "
from ultralytics import YOLO
model = YOLO('yolov8s.engine')
results = model('test.jpg')
results[0].save('output.jpg')
"

Jetson Optimization Tips:

Use FP16 (half=True) for 2× speed with minimal accuracy loss
For Jetson Nano (4GB), use YOLOv8-nano or YOLO11-nano
Use imgsz=320 for maximum speed on constrained devices
Enable DLA (Deep Learning Accelerator) on Xavier/Orin for extra throughput

7.2 OpenVINO for Intel¶

from ultralytics import YOLO

# Export to OpenVINO format
model = YOLO("yolov8s.pt")
model.export(format="openvino", imgsz=640)

# Run inference with OpenVINO
ov_model = YOLO("yolov8s_openvino_model/")
results = ov_model("test.jpg")

OpenVINO is ideal for: - Intel NUC /NUC-based robots - Intel RealSense companion computers - CPU-only inference (no GPU required) - Intel Movidius VPUs (Myriad X)

7.3 ONNX Runtime¶

from ultralytics import YOLO

# Export
model = YOLO("yolov8s.pt")
model.export(format="onnx", imgsz=640, simplify=True)

# Run with ONNX Runtime (cross-platform, any hardware)
onnx_model = YOLO("yolov8s.onnx")
results = onnx_model("test.jpg")

ONNX Runtime provides hardware acceleration on: - NVIDIA GPU (CUDA execution provider) - Intel CPU (OpenVINO execution provider) - ARM CPU (NNAPI on Android, Core ML on iOS) - AMD GPU (ROCm execution provider)

7.4 Benchmark Table: Latency vs Accuracy¶

Benchmarks on a single image (640×640), NVIDIA Jetson Orin NX 16GB:

Model	Format	Latency (ms)	FPS	mAP50-95	Size (MB)
YOLO11-n	TensorRT FP16	3.2	312	39.5	2.6
YOLOv8-n	TensorRT FP16	3.5	286	37.3	3.2
YOLO11-s	TensorRT FP16	4.8	208	47.0	9.4
YOLOv8-s	TensorRT FP16	5.2	192	44.9	11.2
YOLO11-m	TensorRT FP16	8.1	123	51.5	20.1
YOLOv8-m	TensorRT FP16	9.3	108	50.2	25.9
YOLO11-l	TensorRT FP16	12.5	80	53.4	25.3
YOLOv8-l	TensorRT FP16	14.2	70	52.9	43.7
YOLOv10-n	ONNX	4.1	244	39.5	4.7
YOLOv12-n	TensorRT FP16	3.8	263	40.2	3.1

Latency includes pre-processing, inference, and NMS. Actual numbers vary by hardware and image content.

Model Selection Guide for Robotics:

┌────────────────────────────────────────────────────────────────┐
│                    Model Selection Flowchart                    │
│                                                                │
│  What's your FPS target?                                       │
│  │                                                             │
│  ├─ > 200 FPS → YOLO11-n or YOLOv10-n (TensorRT FP16)       │
│  ├─ 100-200 FPS → YOLOv8-s or YOLO11-s                       │
│  ├─ 50-100 FPS → YOLOv8-m or YOLO11-m                        │
│  └─ < 50 FPS → YOLOv8-l/x or YOLOv12-l (max accuracy)       │
│                                                                │
│  Edge device RAM < 4GB? → Use nano variants + imgsz=320      │
│  Need segmentation? → YOLOv8s-seg or YOLO11s-seg              │
│  Need pose? → YOLOv8s-pose                                     │
└────────────────────────────────────────────────────────────────┘

8. Comparison with Other Detectors¶

8.1 YOLO vs Alternatives¶

Feature	YOLOv8/v11	Faster R-CNN	DETR	EfficientDet
Type	Single-stage	Two-stage	Transformer	Single-stage
Speed (FPS)	80–300+	10–30	20–60	30–100
mAP (COCO)	50–56	42–45	42–50	40–55
Small objects	Good (FPN)	Good (RPN)	Moderate	Good (BiFPN)
NMS required	Yes (v8), No (v10)	Yes	No	Yes
Edge friendly	✅ Excellent	❌ Slow	⚠️ Moderate	✅ Good
Custom training	✅ Easy (Ultralytics)	⚠️ Complex	⚠️ Needs tuning	⚠️ Moderate
Real-time robotics	✅ First choice	❌ Too slow	⚠️ Possible	✅ Good alternative
Active development	✅ Very active	❌ Mature	✅ Active	⚠️ Slowing

8.2 When to Use What¶

YOLO (v8/v11/v12): Default choice for real-time robotics. Fastest iteration cycle, best deployment ecosystem, widest hardware support.
DETR / RT-DETR: When you need end-to-end detection without NMS, or when dealing with complex scenes with many overlapping objects. RT-DETR (Real-Time DETR) achieves 48.5 mAP at 114 FPS, competitive with YOLO.
Faster R-CNN: When accuracy is paramount and speed is not critical (e.g., offline processing of recorded video). Still widely used in academic benchmarks.
EfficientDet: When you need a good balance on CPU-only devices. BiFPN architecture is efficient but less flexible for deployment than YOLO.

9. Best Practices for Robotics¶

9.1 Data Collection Tips¶

Collect diverse images: Vary lighting, backgrounds, object orientations, and camera angles. A model trained on one lighting condition will fail in another.
Capture in the deployment environment: Train on images from the actual robot workspace, not stock photos. Include the robot arm, conveyor belt, bins, and clutter.
Balance your classes: If you have 1000 images of "screw" and only 50 of "bearing", the model will be biased. Use augmentation or collect more data for rare classes.
Include edge cases: Add images with partial occlusions, blurry frames, extreme angles, and multiple overlapping objects.
Resolution matters: Capture at the resolution your robot camera will use. Training on 4K images but deploying at 640×640 is wasteful — downscale first.
Annotate carefully: Consistent bounding boxes are more important than large quantities of loosely annotated data. Tight boxes around visible portions of objects.

9.2 Common Pitfalls¶

Pitfall	Symptom	Fix
Overfitting	High train mAP, low val mAP	More data, stronger augmentation, smaller model
Underfitting	Low mAP everywhere	Larger model, more epochs, check labels
Class imbalance	Model ignores rare classes	Oversample, use class weights, augment rare classes
Domain gap	Works in lab, fails in field	Collect field data, use domain randomization
Too many false positives	Detecting things that aren't there	Increase confidence threshold, retrain with negatives
Small object misses	Misses small screws/parts	Use higher resolution (imgsz=1280), add more small-object annotations
Slow inference	FPS too low for control loop	Use nano/small model, TensorRT, reduce input size
Blinking detections	Objects appear/disappear across frames	Use tracking (ByteTrack), temporal smoothing

9.3 Model Selection Guide¶

Application: Pick-and-place (tabletop)
├── Objects: 3-10 classes, well-separated
├── Speed: 30+ FPS required
├── Recommended: YOLOv8s or YOLO11s
├── Input size: 640×640
└── Export: TensorRT FP16

Application: Conveyor belt inspection
├── Objects: Small parts (screws, connectors)
├── Speed: 60+ FPS (fast belt)
├── Recommended: YOLOv8n or YOLO11n
├── Input size: 640×640 or 320×320
└── Export: TensorRT FP16

Application: Human-robot collaboration
├── Objects: People, hands, gestures
├── Speed: 30+ FPS
├── Recommended: YOLOv8s-pose (for pose)
├── Input size: 640×640
└── Export: TensorRT FP16

Application: Bin picking (cluttered)
├── Objects: Many overlapping instances
├── Speed: 10+ FPS (bin picker is slower)
├── Recommended: YOLOv8m-seg + PointCloud
├── Input size: 640×640 or 1280×1280
└── Export: TensorRT FP16 + depth integration

Application: Drone / outdoor navigation
├── Objects: Vehicles, people, obstacles
├── Speed: 30+ FPS
├── Recommended: YOLOv8m or YOLO11m
├── Input size: 640×640
└── Export: TensorRT or ONNX

9.4 Production Checklist¶

Before deploying a YOLO model on a real robot:

[ ] Model validated on held-out test set (not seen during training)
[ ] Tested with real lighting conditions at deployment site
[ ] Confidence threshold tuned (too low = false positives, too high = misses)
[ ] Latency measured on target hardware (not just training GPU)
[ ] NMS parameters tuned (IoU threshold, max detections)
[ ] Failure mode tested: what happens with blank/empty scenes?
[ ] Recovery behavior defined: what does the robot do when detection fails?
[ ] Logging enabled: save detections for offline analysis and retraining
[ ] Model versioning: track which weights are deployed on which robot
[ ] Update pipeline: plan for periodic retraining with new data

10. References¶

Papers¶

YOLOv1 — You Only Look Once (Redmon et al., 2016) — Introduced single-stage detection
YOLOv2 — YOLO9000 (Redmon & Farhadi, 2017) — Batch norm, anchor boxes, multi-scale training
YOLOv3 (Redmon & Farhadi, 2018) — Darknet-53, FPN multi-scale detection
YOLOv4 (Bochkovskiy et al., 2020) — CSPDarknet, mosaic augmentation, bag of freebies
YOLOv7 (Wang et al., 2022) — E-ELAN, re-parameterization, auxiliary training
YOLOv8 (Jocher et al., 2023) — Anchor-free, decoupled head, Ultralytics ecosystem
YOLOv9 (Wang et al., 2024) — GELAN, Programmable Gradient Information
YOLOv10 (Wang et al., 2024) — NMS-free, dual label assignment
YOLO11 (Jocher et al., 2024) — C3k2 blocks, parameter efficiency
YOLOv12 (Tian et al., 2025) — Attention-centric design, flash attention
RT-DETR (Zhao et al., 2023) — Real-time end-to-end detector (DETR family)

Official Documentation¶

Ultralytics Documentation — Complete YOLOv5/v8/v11 reference
Ultralytics YOLO GitHub — Source code and issues
Ultralytics YOLOv5 GitHub — Legacy YOLOv5 repo

Deployment Guides¶

NVIDIA Jetson AI Lab — Jetson deployment tutorials
TensorRT Documentation — GPU optimization
OpenVINO Documentation — Intel deployment
ONNX Runtime — Cross-platform inference

Tutorials and Courses¶

Roboflow YOLO Tutorial — Step-by-step custom training
Ultralytics YOLO Course — Official examples
Papers With Code — Object Detection — Benchmarks and leaderboards

Last updated: 2025