Skip to content

YOLO Object Detection

YOLO (You Only Look Once) is the most widely used real-time object detection framework in robotics. It predicts bounding boxes and class probabilities in a single forward pass, making it fast enough for live video processing on embedded devices. This tutorial covers YOLO from first principles through edge deployment on robots.

Learning Objectives

  • Understand the core ideas behind single-stage detectors and why they dominate real-time robotics
  • Trace the evolution from YOLOv1 (2016) to YOLOv12 (2025) and pick the right version for your task
  • Train, evaluate, and deploy custom YOLO models using the Ultralytics ecosystem
  • Apply YOLO to pick-and-place, tracking, pose estimation, and instance segmentation

1. What is YOLO

1.1 Brief History

Object detection — locating and classifying objects in images — is one of the oldest and most practical problems in computer vision. Before YOLO, the state of the art was dominated by two-stage detectors like R-CNN (2014) and Faster R-CNN (2015). These systems first propose regions that might contain objects, then classify each region. They achieved high accuracy but were slow: 2–5 frames per second on a GPU, far too slow for real-time robotics.

In 2016, Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi introduced YOLOv1 with a radical insight: treat detection as a single regression problem. Instead of a multi-stage pipeline, predict all bounding boxes and class probabilities directly from the full image in one pass of the network. YOLOv1 ran at 45 FPS — fast enough for video — and instantly reshaped the field.

Since then, the YOLO lineage has produced over a dozen major versions. The community contributions have been enormous: anchor boxes (YOLOv2), feature pyramid networks (YOLOv3), CSPNet and mosaic augmentation (YOLOv4), anchor-free detection (YOLOv8), and attention-centric designs (YOLOv12). Ultralytics unified many of these ideas into a single Python package that is now the de facto standard for YOLO deployment.

1.2 Why YOLO Matters for Robotics

Robots need to perceive their environment in real time. Consider a robotic arm picking parts off a conveyor belt: it must detect each part, estimate its position, and plan a grasp — all within the cycle time of the belt (often < 100 ms). YOLO provides:

  • Speed: 30–300+ FPS depending on model size, enabling real-time control loops
  • Accuracy: Modern YOLO models achieve mAP > 50 on COCO, competitive with much slower detectors
  • Versatility: A single framework handles detection, segmentation, pose estimation, and tracking
  • Edge deployment: Export to TensorRT, ONNX, OpenVINO for Jetson, Intel, and ARM devices
  • Community: Tens of thousands of pre-trained weights, datasets, and deployment examples

For robotics, YOLO is not just a detector — it is the perception backbone for manipulation, navigation, inspection, and human-robot interaction.


2. How YOLO Works

2.1 The Core Insight

Traditional two-stage detectors work like this:

Input Image
┌──────────────────┐
│ Region Proposal   │  ← "Where might objects be?" (Selective Search, RPN)
│ Network           │
└────────┬─────────┘
         │  ~2000 candidate regions
┌──────────────────┐
│ Classification    │  ← "What is in each region?"
│ + Bounding Box    │
│   Refinement      │
└──────────────────┘

Two-stage detectors are accurate but slow because they process each candidate region separately.

YOLO collapses this into a single step:

Input Image
┌──────────────────┐
│                   │
│  CNN Backbone     │  ← Extract features from the entire image
│  + Neck + Head    │
│                   │
└────────┬─────────┘
┌──────────────────┐
│  S x S Grid       │
│  × B Boxes        │  ← Predict ALL boxes + classes in ONE pass
│  × (C + 5) values │
└──────────────────┘

2.2 Grid-Based Prediction (YOLOv1 Detail)

YOLOv1 divides the input image into an S × S grid. Each grid cell predicts:

  • B bounding boxes (each box has: x, y, w, h, confidence)
  • C class probabilities (one score per class)

The total output is a tensor of shape S × S × (B × 5 + C).

┌─────────────────────────────────────────────────┐
│                 Input Image                      │
│                                                  │
│    ┌─────┬─────┬─────┬─────┬─────┐              │
│    │ Cell│ Cell│ Cell│ Cell│ Cell│  ← 7×7 grid   │
│    ├─────┼─────┼─────┼─────┤     │              │
│    │ ... │ ... │ ... │ ... │     │              │
│    └─────┴─────┴─────┴─────┴─────┘              │
│                                                  │
│    Each cell predicts:                           │
│    - 2 bounding boxes (x, y, w, h, conf)        │
│    - 20 class probabilities (Pascal VOC)         │
│                                                  │
│    Output tensor: 7 × 7 × 30                    │
└─────────────────────────────────────────────────┘

2.3 Loss Function

YOLO optimizes a multi-part loss combining:

  1. Localization loss: Mean squared error on bounding box coordinates (x, y, w, h)
  2. Confidence loss: MSE on the objectness score (does this cell contain an object?)
  3. Classification loss: MSE on class probabilities

Only cells containing objects contribute to the confidence and classification losses. The weighting balances these terms (typically λ_coord = 5, λ_noobj = 0.5).

2.4 From YOLOv1 to Modern Architectures

Modern YOLO models (v5, v8, v11) follow a three-part architecture:

┌─────────────────────────────────────────────────────────────────┐
│                    YOLO Architecture (v5/v8/v11)                 │
│                                                                  │
│  ┌─────────────┐                                                │
│  │             │                                                │
│  │   Backbone   │  Extracts multi-scale features                 │
│  │  (CSPNet /   │  e.g., CSPDarknet, C2f blocks                 │
│  │   C2f)       │                                                │
│  │             │                                                │
│  └──────┬──────┘                                                │
│         │  P3, P4, P5 (1/8, 1/16, 1/32 resolution)             │
│         ▼                                                       │
│  ┌─────────────┐                                                │
│  │             │                                                │
│  │    Neck      │  Fuses multi-scale features                    │
│  │  (PANet /    │  Top-down + bottom-up path aggregation         │
│  │   SPPF)      │                                                │
│  │             │                                                │
│  └──────┬──────┘                                                │
│         │  F3, F4, F5 (enriched features)                       │
│         ▼                                                       │
│  ┌─────────────┐                                                │
│  │             │                                                │
│  │    Head      │  Predicts boxes, scores, classes               │
│  │  (Anchor /   │  One prediction per grid cell                  │
│  │  Anchor-free)│                                                │
│  │             │                                                │
│  └─────────────┘                                                │
│         │                                                        │
│         ▼                                                        │
│  Bounding boxes + Class scores + Confidence                     │
└─────────────────────────────────────────────────────────────────┘

Backbone: A deep CNN (e.g., CSPDarknet53, EfficientNet) that extracts hierarchical features. Larger models (L, X) use deeper backbones with more channels.

Neck: Feature Pyramid Network (FPN) + Path Aggregation Network (PANet). The FPN top-down pathway merges high-level semantic features with low-level spatial features. The PANet bottom-up pathway adds a second pass for stronger gradient flow.

Head: The detection head produces predictions at three scales: - P3 (⅛): Small objects (e.g., screws, small parts) - P4 (1/16): Medium objects (e.g., cups, tools) - P5 (1/32): Large objects (e.g., boxes, people)

2.5 Anchor-Free vs Anchor-Based

Anchor-based (YOLOv2–v5): The model predicts offsets from predefined anchor boxes (prior shapes computed via k-means on the training set). Each anchor produces a bounding box by adjusting x, y, w, h offsets.

Anchor-free (YOLOv8+): The model predicts centers and extents directly, or uses a task-aligned assigner to match predictions to ground truth without predefined anchors. This simplifies deployment and improves generalization.

Anchor-based:  prediction = anchor + offset
Anchor-free:   prediction = direct (center_x, center_y, width, height)

3. YOLO Versions

3.1 Evolution Table

Version Year Key Innovation Backbone Neck Head mAP@50-95 FPS (V100) Notable
YOLOv1 2016 Single-pass detection, grid prediction Darknet-19 None FC layers 63.4 (VOC) 45 First real-time detector
YOLOv2 2017 Batch norm, anchor boxes, multi-scale training Darknet-19 Passthrough Conv 78.6 (VOC) 40 Trained on 9000+ classes (YOLO9000)
YOLOv3 2018 Darknet-53, FPN multi-scale detection Darknet-53 FPN 3-scale 57.9 (mAP50) 20 Best balance at the time
YOLOv4 2020 CSPDarknet, PANet, mosaic augmentation CSPDarknet53 PANet SPP 65.7 (mAP50) 62 Bag of freebies + specials
YOLOv5 2020 PyTorch native, auto-anchor, easy deployment CSPDarknet53 PANet + SPPF Anchor 68.9 (mAP50) 140 Most deployed in production
YOLOv7 2022 E-ELAN, re-parameterization, model scaling E-ELAN ELAN-PAN Anchor 71.2 (mAP50) 161 Fastest at its release
YOLOv8 2023 Anchor-free, decoupled head, Ultralytics API CSPNet (C2f) PANet + SPPF Anchor-free 53.9 (mAP) 280 Unified detect/seg/pose/classify
YOLOv9 2024 GELAN, PGI (Programmable Gradient Information) GELAN GELAN Anchor-free 55.6 (mAP) 300+ Solves information bottleneck
YOLOv10 2024 NMS-free, dual label assignment CSPNet PANet Anchor-free 54.4 (mAP) 350+ Eliminates NMS post-processing
YOLO11 2024 C3k2 blocks, improved efficiency C3k2-CSPNet C2f2 Anchor-free 54.7 (mAP) 320+ Most parameter-efficient
YOLOv12 2025 Attention-centric, dynamic resolution, flash attention A*-CSPNet A*-PAN Anchor-free 56.0 (mAP) 340+ Combines CNN speed with transformer accuracy

mAP values are approximate and depend on input size (640×640 default). FPS measured on NVIDIA V100 or A100.

3.2 Key Innovations by Version

YOLOv1 (2016) — The Original

  • Divides image into S×S grid; each cell predicts B boxes + C classes
  • End-to-end differentiable — no region proposals
  • Limitation: struggles with small objects, many instances of the same class

YOLOv2 / YOLO9000 (2017) — Faster and More Classes

  • Added batch normalization after every convolutional layer (+2% mAP)
  • Introduced anchor boxes via k-means clustering on training data
  • Multi-scale training: randomly resize input during training (320–608 pixels)
  • Joint training on detection + classification (9413 classes from WordTree)

YOLOv3 (2018) — Multi-Scale Detection

  • Darknet-53 backbone: 53 convolutional layers with residual connections
  • Feature Pyramid Network (FPN): predictions at 3 scales (13×13, 26×26, 52×52)
  • Binary cross-entropy for class prediction (handles multi-label)
  • The go-to version for years in production robotics

YOLOv4 (2020) — Bag of Freebies

  • CSPDarknet53: Cross-Stage Partial connections reduce computation
  • SPP + PANet neck for richer feature aggregation
  • Mosaic augmentation: 4-image collage that teaches the model about context
  • Mish activation: smooth non-linearity replacing Leaky ReLU
  • Dozens of "free" training tricks: CutMix, DropBlock, label smoothing

YOLOv5 (2020) — The Deployment King

  • Written in PyTorch from the start (YOLOv1–v4 were Darknet/C++)
  • Auto-anchor: Automatically learns anchor box sizes for your dataset
  • Integrated export to ONNX, TensorRT, CoreML, TFLite
  • Variants: n (nano), s (small), m (medium), l (large), x (extra-large)
  • Most widely deployed YOLO version in industrial robotics

YOLOv7 (2022) — Efficiency Champion

  • E-ELAN (Extended Efficient Layer Aggregation Network): optimized feature fusion
  • Re-parameterization: Train with complex architecture, deploy with simpler one
  • Auxiliary head during training for better gradient flow
  • State-of-the-art speed-accuracy tradeoff at release

YOLOv8 (2023) — The Modern Standard

  • Anchor-free detection: eliminates anchor box hyperparameters
  • Decoupled head: separate branches for classification and regression
  • Task-Aligned Assigner: dynamic positive sample assignment
  • Unified API: ultralytics package supports detect, segment, pose, classify
  • Variants: n, s, m, l, x — choose by speed/accuracy budget

YOLOv9 (2024) — Information Bottleneck Solved

  • GELAN (Generalized Efficient Layer Aggregation Network): new macro-architecture
  • PGI (Programmable Gradient Information): prevents information loss in deep networks
  • Proves that auxiliary heads and reversible branches can improve any architecture
  • Smallest model (YOLOv9-t) achieves 44% mAP with only 2M parameters

YOLOv10 (2024) — NMS-Free

  • Dual label assignment: NMS-free training with consistent dual assignments
  • Holistic label assignment: pairs one-to-one and one-to-many assignments
  • Eliminates the NMS post-processing step, reducing latency by 1–3 ms
  • Lightweight architectures: n (2.7M params), s (7.2M), m (15.4M)

YOLO11 (2024) — Efficient Next-Gen

  • C3k2 blocks: smaller, more efficient cross-stage connections
  • 22% fewer parameters than YOLOv8 with same accuracy
  • Improved feature extraction at all scales
  • Available in n, s, m, l, x variants

YOLOv12 (2025) — Attention-Centric

  • A*-CSPNet: Replaces some convolution blocks with attention mechanisms
  • Flash attention for memory-efficient self-attention
  • Dynamic resolution: adapts to input size without retraining
  • Combines the speed of CNNs with the accuracy of vision transformers
  • Best performance for high-resolution detection tasks

3.3 How to Choose a Version

Decision guide:

  Need fastest inference? ──────── YOLOv8-n or YOLOv10-n
  Need best accuracy? ──────────── YOLOv12 or YOLOv8-x  
  Need NMS-free (edge)? ────────── YOLOv10
  Need most deployment options? ── YOLOv5 (widest support)
  Need segmentation + pose? ────── YOLOv8 (unified API)
  Need minimal compute? ─────────── YOLOv9-t or YOLO11-n
  Production/industrial? ────────── YOLOv5 or YOLOv8 (most battle-tested)

4. Quick Start with Ultralytics

4.1 Installation

# Create a virtual environment (recommended)
python -m venv yolo_env
source yolo_env/bin/activate    # Linux/Mac
# yolo_env\Scripts\activate     # Windows

# Install Ultralytics (includes YOLOv5, v8, v11)
pip install ultralytics

# Verify installation
python -c "import ultralytics; print(ultralytics.__version__)"
# Output: 8.x.x

4.2 Inference on an Image

from ultralytics import YOLO

# Load a pre-trained model (downloads weights automatically)
model = YOLO("yolov8n.pt")  # nano — fastest, smallest

# Run inference on an image
results = model("https://ultralytics.com/images/zidane.jpg")

# Process results
for result in results:
    # Bounding boxes: xyxy format [x1, y1, x2, y2]
    boxes = result.boxes
    print(f"Found {len(boxes)} objects")

    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0].tolist()  # coordinates
        confidence = box.conf[0].item()           # confidence score
        class_id = int(box.cls[0].item())         # class index
        class_name = result.names[class_id]        # class name

        print(f"  {class_name}: {confidence:.2f} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}]")

# Save annotated image
results[0].save("output.jpg")

4.3 Inference on Video / Webcam

from ultralytics import YOLO
import cv2

model = YOLO("yolov8s.pt")  # small model — good speed/accuracy balance

# --- Option 1: Process a video file ---
results = model.predict(
    source="input_video.mp4",
    save=True,              # save annotated video
    conf=0.25,              # confidence threshold
    imgsz=640,              # inference size
    classes=[0, 1],         # only detect classes 0 (person) and 1 (bicycle)
    stream=True,            # memory-efficient streaming
)

for r in results:
    # r.boxes contains detections for this frame
    pass

# --- Option 2: Process webcam (real-time) ---
cap = cv2.VideoCapture(0)

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # YOLO inference directly on numpy array
    results = model(frame, verbose=False)

    # Display results
    annotated = results[0].plot()  # draw boxes on frame
    cv2.imshow("YOLO Detection", annotated)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

4.4 Export to ONNX / TensorRT

from ultralytics import YOLO

model = YOLO("yolov8s.pt")

# Export to ONNX (cross-platform deployment)
model.export(format="onnx", imgsz=640, simplify=True)
# Creates: yolov8s.onnx

# Export to TensorRT (NVIDIA GPU optimized)
model.export(format="engine", imgsz=640, half=True)  # half = FP16
# Creates: yolov8s.engine

# Export to OpenVINO (Intel CPU/GPU/VPU)
model.export(format="openvino", imgsz=640)
# Creates: yolov8s_openvino_model/

# Export to CoreML (Apple Silicon)
model.export(format="coreml", imgsz=640)

# Export to TFLite (mobile/edge)
model.export(format="tflite", imgsz=320)  # smaller input for edge

4.5 Running Exported Models

from ultralytics import YOLO

# Load an ONNX model
model = YOLO("yolov8s.onnx")
results = model("test.jpg")

# Load a TensorRT engine
model = YOLO("yolov8s.engine")
results = model("test.jpg")

# Both produce the same output format as the PyTorch model

5. YOLO for Robotics Applications

5.1 Object Detection for Pick-and-Place

The most common robotics use of YOLO: detecting objects on a table or conveyor belt and providing their positions for a manipulator.

from ultralytics import YOLO
import numpy as np

# Load model trained on your object classes
model = YOLO("runs/detect/my_objects/weights/best.pt")

def detect_objects(image):
    """Detect objects and return their center positions in the image."""
    results = model(image, verbose=False)[0]

    objects = []
    for box in results.boxes:
        x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
        conf = box.conf[0].item()
        cls = int(box.cls[0].item())

        # Compute center point (pixel coordinates)
        cx = (x1 + x2) / 2
        cy = (y1 + y2) / 2
        w = x2 - x1
        h = y2 - y1

        objects.append({
            "class": results.names[cls],
            "confidence": conf,
            "center_px": (cx, cy),
            "bbox": (x1, y1, x2, y2),
            "size_px": (w, h),
        })

    return objects

# --- Integration with robot coordinate transform ---
def pixel_to_robot(cx, cy, camera_matrix, extrinsic_matrix):
    """Convert pixel coordinates to robot base frame."""
    # Undistort pixel to normalized camera coordinates
    fx, fy = camera_matrix[0, 0], camera_matrix[1, 1]
    cx_cam, cy_cam = camera_matrix[0, 2], camera_matrix[1, 2]

    # Project to 3D (assuming known depth Z)
    Z = 0.5  # depth from camera to table surface in meters
    X = (cx - cx_cam) * Z / fx
    Y = (cy - cy_cam) * Z / fy

    # Transform to robot base frame
    point_cam = np.array([X, Y, Z, 1.0])
    point_robot = extrinsic_matrix @ point_cam

    return point_robot[:3]  # x, y, z in robot frame

# --- Example usage ---
# detect -> transform -> plan grasp -> execute
image = camera.capture()
objects = detect_objects(image)

for obj in objects:
    robot_pos = pixel_to_robot(*obj["center_px"], K, T_base_cam)
    print(f"Object: {obj['class']}, Robot position: {robot_pos}")
    # robot_arm.move_to(robot_pos)
    # robot_arm.grasp()

5.2 Real-Time Tracking with DeepSORT / ByteTrack

Object detection gives you boxes in each frame, but for robotics you often need tracking — consistent object IDs across frames. ByteTrack is the modern choice.

from ultralytics import YOLO
from ultralytics.trackers import ByteTrack  # built into Ultralytics

# ByteTrack is integrated into YOLOv8+ — no extra packages needed
model = YOLO("yolov8s.pt")

# Enable tracking in predict
results = model.track(
    source="conveyor_video.mp4",
    tracker="bytetrack.yaml",  # ByteTrack config
    persist=True,              # maintain track IDs across frames
    save=True,
    conf=0.3,
)

# Process tracked results
for r in results:
    for box in r.boxes:
        track_id = box.id.item() if box.id is not None else None
        cls = r.names[int(box.cls[0].item())]
        xyxy = box.xyxy[0].tolist()

        if track_id is not None:
            print(f"Track {int(track_id)}: {cls} at {xyxy}")
            # Use track_id to maintain state: count, history, prediction

5.3 Pose Estimation (YOLOv8-Pose)

YOLOv8-Pose detects human keypoints (or custom keypoints) alongside bounding boxes. Useful for human-robot collaboration, gesture recognition, and ergonomic monitoring.

from ultralytics import YOLO

# Load pose model (trained on COCO keypoints: 17 keypoints)
model = YOLO("yolov8s-pose.pt")

results = model("person_image.jpg")

for r in results:
    # Keypoints: shape (num_persons, num_keypoints, 3) — x, y, visibility
    keypoints = r.keypoints

    if keypoints is not None:
        kpts = keypoints.xy[0].cpu().numpy()  # first person

        # COCO keypoint indices:
        # 0: nose, 1: left_eye, 2: right_eye, 3: left_ear, 4: right_ear
        # 5: left_shoulder, 6: right_shoulder, 7: left_elbow, 8: right_elbow
        # 9: left_wrist, 10: right_wrist, 11: left_hip, 12: right_hip
        # 13: left_knee, 14: right_knee, 15: left_ankle, 16: right_ankle

        # Compute arm angle for robot interaction
        shoulder = kpts[5]
        elbow = kpts[7]
        wrist = kpts[9]

        # Angle at elbow
        v1 = shoulder - elbow
        v2 = wrist - elbow
        angle = np.degrees(np.arccos(
            np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2) + 1e-6)
        ))
        print(f"Elbow angle: {angle:.1f}°")

5.4 Instance Segmentation (YOLOv8-Seg)

Segmentation provides pixel-level masks instead of just bounding boxes. Useful for grasping irregular objects, bin picking, and scene understanding.

from ultralytics import YOLO
import numpy as np

# Load segmentation model
model = YOLO("yolov8s-seg.pt")

results = model("cluttered_table.jpg")

for r in results:
    # Masks: shape (num_detections, H, W) — binary masks
    masks = r.masks

    if masks is not None:
        for i, mask in enumerate(masks):
            binary_mask = mask.data[0].cpu().numpy()  # (H, W)
            cls = r.names[int(r.boxes[i].cls[0].item())]
            conf = r.boxes[i].conf[0].item()

            # Compute mask properties
            pixels = np.sum(binary_mask > 0.5)
            print(f"{cls}: {conf:.2f}, area: {pixels} px")

            # Compute centroid for grasp planning
            ys, xs = np.where(binary_mask > 0.5)
            centroid_x = np.mean(xs)
            centroid_y = np.mean(ys)
            print(f"  Centroid: ({centroid_x:.0f}, {centroid_y:.0f})")

6. Custom Training Pipeline

6.1 Dataset Preparation

Before training, you need images with bounding box annotations.

Annotation Tools:

Tool Type Best For Export Format
Roboflow Cloud End-to-end (annotate + augment + deploy) YOLO txt, COCO JSON, VOC XML
CVAT Cloud/Self-hosted Large team projects, video annotation YOLO txt, COCO, VOC
LabelImg Desktop Quick single-class annotation YOLO txt, VOC XML
Label Studio Cloud/Self-hosted Multi-modal (image + text + audio) Multiple formats

YOLO Label Format — one .txt file per image, in the same directory:

# Each line: class_id center_x center_y width height (normalized 0-1)
# Example: image.jpg has a "cup" (class 0) and a "plate" (class 1)
0 0.456 0.321 0.128 0.095
1 0.672 0.543 0.256 0.182

Directory Structure:

my_dataset/
├── images/
│   ├── train/          # ~80% of images
│   │   ├── img001.jpg
│   │   ├── img002.jpg
│   │   └── ...
│   └── val/            # ~20% of images
│       ├── img050.jpg
│       └── ...
├── labels/
│   ├── train/
│   │   ├── img001.txt
│   │   ├── img002.txt
│   │   └── ...
│   └── val/
│       ├── img050.txt
│       └── ...
└── data.yaml           # dataset config (see below)

6.2 YAML Configuration File

# data.yaml — Dataset configuration for Ultralytics YOLO
# Paths can be absolute or relative to this file

# Dataset paths
train: ../my_dataset/images/train     # Training images
val: ../my_dataset/images/val          # Validation images
# test: ../my_dataset/images/test      # Optional test set

# Number of classes
nc: 3

# Class names (must match annotation class IDs)
names:
  0: gripper
  1: screw
  2: bearing

6.3 Data Augmentation Strategies

Ultralytics applies augmentation automatically during training. Key strategies:

# Augmentation parameters (set in training args or YAML)
augmentation = {
    # Geometric
    "hsv_h": 0.015,       # Hue augmentation (±1.5%)
    "hsv_s": 0.7,         # Saturation augmentation (±70%)
    "hsv_v": 0.4,         # Value/brightness augmentation (±40%)
    "degrees": 10.0,      # Rotation (±10°)
    "translate": 0.1,     # Translation (±10% of image)
    "scale": 0.5,         # Scale augmentation (±50%)
    "shear": 2.0,         # Shear (±2°)
    "perspective": 0.0,   # Perspective transform
    "flipud": 0.0,        # Vertical flip probability
    "fliplr": 0.5,        # Horizontal flip probability

    # Mosaic (combines 4 images into one — key for small objects)
    "mosaic": 1.0,        # Mosaic probability (1.0 = always)

    # Mixup (blends two images)
    "mixup": 0.0,         # Mixup probability

    # Copy-Paste (copy objects between images)
    "copy_paste": 0.0,    # Copy-paste probability
}

6.4 Training Script with Full Options

from ultralytics import YOLO

# --- Option 1: Python API (recommended for scripts) ---
model = YOLO("yolov8s.pt")  # Start from pre-trained weights

results = model.train(
    data="data.yaml",          # Dataset config
    epochs=100,                # Number of training epochs
    imgsz=640,                 # Input image size

    # Batch size and optimization
    batch=16,                  # Batch size (use -1 for auto)
    lr0=0.01,                  # Initial learning rate
    lrf=0.01,                  # Final learning rate (lr0 * lrf)
    momentum=0.937,            # SGD momentum
    weight_decay=0.0005,       # L2 regularization
    warmup_epochs=3.0,         # Warmup epochs
    warmup_momentum=0.8,       # Warmup momentum

    # Augmentation
    hsv_h=0.015,
    hsv_s=0.7,
    hsv_v=0.4,
    degrees=10.0,
    translate=0.1,
    scale=0.5,
    fliplr=0.5,
    mosaic=1.0,
    mixup=0.0,

    # Model settings
    name="my_custom_model",    # Experiment name
    project="runs/detect",     # Output directory
    exist_ok=False,            # Overwrite existing

    # Hardware
    device="0",                # GPU device (cpu, 0, 0,1, etc.)
    workers=8,                 # Data loading workers

    # Checkpointing
    save_period=10,            # Save checkpoint every N epochs
    patience=50,               # Early stopping patience

    # Validation
    val=True,                  # Validate during training
    cache=True,                # Cache images in RAM (fast but uses memory)
)

# --- Option 2: CLI ---
# yolo detect train data=data.yaml model=yolov8s.pt epochs=100 imgsz=640 batch=16 device=0

6.5 Evaluation Metrics

After training, evaluate your model on the validation set:

from ultralytics import YOLO

model = YOLO("runs/detect/my_custom_model/weights/best.pt")

# Run validation
metrics = model.val(
    data="data.yaml",
    imgsz=640,
    batch=16,
    conf=0.25,
    iou=0.6,                   # IoU threshold for NMS
    max_det=300,               # Max detections per image
)

# Key metrics
print(f"mAP50:      {metrics.box.map50:.4f}")      # mAP at IoU=0.50
print(f"mAP50-95:   {metrics.box.map:.4f}")          # mAP at IoU=0.50:0.95
print(f"Precision:  {metrics.box.mp:.4f}")            # Mean precision
print(f"Recall:     {metrics.box.mr:.4f}")            # Mean recall

# Per-class results
for i, (p, r, ap50, ap) in enumerate(
    zip(metrics.box.p, metrics.box.r, metrics.box.ap50, metrics.box.ap)
):
    print(f"  Class {i}: P={p:.3f}, R={r:.3f}, mAP50={ap50:.3f}, mAP={ap:.3f}")

Metrics explained:

  • Precision: Of all predicted boxes, what fraction are correct? (TP / (TP + FP))
  • Recall: Of all ground-truth boxes, what fraction were detected? (TP / (TP + FN))
  • mAP50: Mean Average Precision at IoU threshold 0.5 — lenient metric
  • mAP50-95: Mean AP averaged over IoU thresholds 0.50 to 0.95 — strict metric (COCO standard)
  • IoU: Intersection over Union — overlap between predicted and ground-truth boxes

6.6 TensorBoard Monitoring

# During training, TensorBoard logs are saved automatically
# Start TensorBoard:

# From the project directory
tensorboard --logdir runs/detect

# Or with a specific port
tensorboard --logdir runs/detect --port 6006

# Open http://localhost:6006 in your browser

The dashboard shows: - Training/Box loss: Localization loss (how well boxes fit objects) - Training/Class loss: Classification loss (how well classes are predicted) - Training/DFL loss: Distribution Focal Loss (for anchor-free models) - metrics/precision and metrics/recall: Per-epoch evaluation - metrics/mAP50 and metrics/mAP50-95: Overall accuracy


7. Edge Deployment

7.1 NVIDIA Jetson (TensorRT)

The Jetson Nano, Xavier, and Orin are the most common edge platforms for robotics.

# Step 1: Export to TensorRT on the training machine
python -c "
from ultralytics import YOLO
model = YOLO('yolov8s.pt')
model.export(format='engine', imgsz=640, half=True, device=0)
"

# Step 2: Copy the .engine file to your Jetson
scp yolov8s.engine jetson@192.168.1.100:~/

# Step 3: Run on Jetson (no GPU export needed — TensorRT is pre-installed)
python -c "
from ultralytics import YOLO
model = YOLO('yolov8s.engine')
results = model('test.jpg')
results[0].save('output.jpg')
"

Jetson Optimization Tips:

  • Use FP16 (half=True) for 2× speed with minimal accuracy loss
  • For Jetson Nano (4GB), use YOLOv8-nano or YOLO11-nano
  • Use imgsz=320 for maximum speed on constrained devices
  • Enable DLA (Deep Learning Accelerator) on Xavier/Orin for extra throughput

7.2 OpenVINO for Intel

from ultralytics import YOLO

# Export to OpenVINO format
model = YOLO("yolov8s.pt")
model.export(format="openvino", imgsz=640)

# Run inference with OpenVINO
ov_model = YOLO("yolov8s_openvino_model/")
results = ov_model("test.jpg")

OpenVINO is ideal for: - Intel NUC /NUC-based robots - Intel RealSense companion computers - CPU-only inference (no GPU required) - Intel Movidius VPUs (Myriad X)

7.3 ONNX Runtime

from ultralytics import YOLO

# Export
model = YOLO("yolov8s.pt")
model.export(format="onnx", imgsz=640, simplify=True)

# Run with ONNX Runtime (cross-platform, any hardware)
onnx_model = YOLO("yolov8s.onnx")
results = onnx_model("test.jpg")

ONNX Runtime provides hardware acceleration on: - NVIDIA GPU (CUDA execution provider) - Intel CPU (OpenVINO execution provider) - ARM CPU (NNAPI on Android, Core ML on iOS) - AMD GPU (ROCm execution provider)

7.4 Benchmark Table: Latency vs Accuracy

Benchmarks on a single image (640×640), NVIDIA Jetson Orin NX 16GB:

Model Format Latency (ms) FPS mAP50-95 Size (MB)
YOLO11-n TensorRT FP16 3.2 312 39.5 2.6
YOLOv8-n TensorRT FP16 3.5 286 37.3 3.2
YOLO11-s TensorRT FP16 4.8 208 47.0 9.4
YOLOv8-s TensorRT FP16 5.2 192 44.9 11.2
YOLO11-m TensorRT FP16 8.1 123 51.5 20.1
YOLOv8-m TensorRT FP16 9.3 108 50.2 25.9
YOLO11-l TensorRT FP16 12.5 80 53.4 25.3
YOLOv8-l TensorRT FP16 14.2 70 52.9 43.7
YOLOv10-n ONNX 4.1 244 39.5 4.7
YOLOv12-n TensorRT FP16 3.8 263 40.2 3.1

Latency includes pre-processing, inference, and NMS. Actual numbers vary by hardware and image content.

Model Selection Guide for Robotics:

┌────────────────────────────────────────────────────────────────┐
│                    Model Selection Flowchart                    │
│                                                                │
│  What's your FPS target?                                       │
│  │                                                             │
│  ├─ > 200 FPS → YOLO11-n or YOLOv10-n (TensorRT FP16)       │
│  ├─ 100-200 FPS → YOLOv8-s or YOLO11-s                       │
│  ├─ 50-100 FPS → YOLOv8-m or YOLO11-m                        │
│  └─ < 50 FPS → YOLOv8-l/x or YOLOv12-l (max accuracy)       │
│                                                                │
│  Edge device RAM < 4GB? → Use nano variants + imgsz=320      │
│  Need segmentation? → YOLOv8s-seg or YOLO11s-seg              │
│  Need pose? → YOLOv8s-pose                                     │
└────────────────────────────────────────────────────────────────┘

8. Comparison with Other Detectors

8.1 YOLO vs Alternatives

Feature YOLOv8/v11 Faster R-CNN DETR EfficientDet
Type Single-stage Two-stage Transformer Single-stage
Speed (FPS) 80–300+ 10–30 20–60 30–100
mAP (COCO) 50–56 42–45 42–50 40–55
Small objects Good (FPN) Good (RPN) Moderate Good (BiFPN)
NMS required Yes (v8), No (v10) Yes No Yes
Edge friendly ✅ Excellent ❌ Slow ⚠️ Moderate ✅ Good
Custom training ✅ Easy (Ultralytics) ⚠️ Complex ⚠️ Needs tuning ⚠️ Moderate
Real-time robotics ✅ First choice ❌ Too slow ⚠️ Possible ✅ Good alternative
Active development ✅ Very active ❌ Mature ✅ Active ⚠️ Slowing

8.2 When to Use What

  • YOLO (v8/v11/v12): Default choice for real-time robotics. Fastest iteration cycle, best deployment ecosystem, widest hardware support.

  • DETR / RT-DETR: When you need end-to-end detection without NMS, or when dealing with complex scenes with many overlapping objects. RT-DETR (Real-Time DETR) achieves 48.5 mAP at 114 FPS, competitive with YOLO.

  • Faster R-CNN: When accuracy is paramount and speed is not critical (e.g., offline processing of recorded video). Still widely used in academic benchmarks.

  • EfficientDet: When you need a good balance on CPU-only devices. BiFPN architecture is efficient but less flexible for deployment than YOLO.


9. Best Practices for Robotics

9.1 Data Collection Tips

  1. Collect diverse images: Vary lighting, backgrounds, object orientations, and camera angles. A model trained on one lighting condition will fail in another.

  2. Capture in the deployment environment: Train on images from the actual robot workspace, not stock photos. Include the robot arm, conveyor belt, bins, and clutter.

  3. Balance your classes: If you have 1000 images of "screw" and only 50 of "bearing", the model will be biased. Use augmentation or collect more data for rare classes.

  4. Include edge cases: Add images with partial occlusions, blurry frames, extreme angles, and multiple overlapping objects.

  5. Resolution matters: Capture at the resolution your robot camera will use. Training on 4K images but deploying at 640×640 is wasteful — downscale first.

  6. Annotate carefully: Consistent bounding boxes are more important than large quantities of loosely annotated data. Tight boxes around visible portions of objects.

9.2 Common Pitfalls

Pitfall Symptom Fix
Overfitting High train mAP, low val mAP More data, stronger augmentation, smaller model
Underfitting Low mAP everywhere Larger model, more epochs, check labels
Class imbalance Model ignores rare classes Oversample, use class weights, augment rare classes
Domain gap Works in lab, fails in field Collect field data, use domain randomization
Too many false positives Detecting things that aren't there Increase confidence threshold, retrain with negatives
Small object misses Misses small screws/parts Use higher resolution (imgsz=1280), add more small-object annotations
Slow inference FPS too low for control loop Use nano/small model, TensorRT, reduce input size
Blinking detections Objects appear/disappear across frames Use tracking (ByteTrack), temporal smoothing

9.3 Model Selection Guide

Application: Pick-and-place (tabletop)
├── Objects: 3-10 classes, well-separated
├── Speed: 30+ FPS required
├── Recommended: YOLOv8s or YOLO11s
├── Input size: 640×640
└── Export: TensorRT FP16

Application: Conveyor belt inspection
├── Objects: Small parts (screws, connectors)
├── Speed: 60+ FPS (fast belt)
├── Recommended: YOLOv8n or YOLO11n
├── Input size: 640×640 or 320×320
└── Export: TensorRT FP16

Application: Human-robot collaboration
├── Objects: People, hands, gestures
├── Speed: 30+ FPS
├── Recommended: YOLOv8s-pose (for pose)
├── Input size: 640×640
└── Export: TensorRT FP16

Application: Bin picking (cluttered)
├── Objects: Many overlapping instances
├── Speed: 10+ FPS (bin picker is slower)
├── Recommended: YOLOv8m-seg + PointCloud
├── Input size: 640×640 or 1280×1280
└── Export: TensorRT FP16 + depth integration

Application: Drone / outdoor navigation
├── Objects: Vehicles, people, obstacles
├── Speed: 30+ FPS
├── Recommended: YOLOv8m or YOLO11m
├── Input size: 640×640
└── Export: TensorRT or ONNX

9.4 Production Checklist

Before deploying a YOLO model on a real robot:

[ ] Model validated on held-out test set (not seen during training)
[ ] Tested with real lighting conditions at deployment site
[ ] Confidence threshold tuned (too low = false positives, too high = misses)
[ ] Latency measured on target hardware (not just training GPU)
[ ] NMS parameters tuned (IoU threshold, max detections)
[ ] Failure mode tested: what happens with blank/empty scenes?
[ ] Recovery behavior defined: what does the robot do when detection fails?
[ ] Logging enabled: save detections for offline analysis and retraining
[ ] Model versioning: track which weights are deployed on which robot
[ ] Update pipeline: plan for periodic retraining with new data

10. References

Papers

Official Documentation

Deployment Guides

Tutorials and Courses


Last updated: 2025