YOLO Object Detection¶
YOLO (You Only Look Once) is the most widely used real-time object detection framework in robotics. It predicts bounding boxes and class probabilities in a single forward pass, making it fast enough for live video processing on embedded devices. This tutorial covers YOLO from first principles through edge deployment on robots.
Learning Objectives¶
- Understand the core ideas behind single-stage detectors and why they dominate real-time robotics
- Trace the evolution from YOLOv1 (2016) to YOLOv12 (2025) and pick the right version for your task
- Train, evaluate, and deploy custom YOLO models using the Ultralytics ecosystem
- Apply YOLO to pick-and-place, tracking, pose estimation, and instance segmentation
1. What is YOLO¶
1.1 Brief History¶
Object detection — locating and classifying objects in images — is one of the oldest and most practical problems in computer vision. Before YOLO, the state of the art was dominated by two-stage detectors like R-CNN (2014) and Faster R-CNN (2015). These systems first propose regions that might contain objects, then classify each region. They achieved high accuracy but were slow: 2–5 frames per second on a GPU, far too slow for real-time robotics.
In 2016, Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi introduced YOLOv1 with a radical insight: treat detection as a single regression problem. Instead of a multi-stage pipeline, predict all bounding boxes and class probabilities directly from the full image in one pass of the network. YOLOv1 ran at 45 FPS — fast enough for video — and instantly reshaped the field.
Since then, the YOLO lineage has produced over a dozen major versions. The community contributions have been enormous: anchor boxes (YOLOv2), feature pyramid networks (YOLOv3), CSPNet and mosaic augmentation (YOLOv4), anchor-free detection (YOLOv8), and attention-centric designs (YOLOv12). Ultralytics unified many of these ideas into a single Python package that is now the de facto standard for YOLO deployment.
1.2 Why YOLO Matters for Robotics¶
Robots need to perceive their environment in real time. Consider a robotic arm picking parts off a conveyor belt: it must detect each part, estimate its position, and plan a grasp — all within the cycle time of the belt (often < 100 ms). YOLO provides:
- Speed: 30–300+ FPS depending on model size, enabling real-time control loops
- Accuracy: Modern YOLO models achieve mAP > 50 on COCO, competitive with much slower detectors
- Versatility: A single framework handles detection, segmentation, pose estimation, and tracking
- Edge deployment: Export to TensorRT, ONNX, OpenVINO for Jetson, Intel, and ARM devices
- Community: Tens of thousands of pre-trained weights, datasets, and deployment examples
For robotics, YOLO is not just a detector — it is the perception backbone for manipulation, navigation, inspection, and human-robot interaction.
2. How YOLO Works¶
2.1 The Core Insight¶
Traditional two-stage detectors work like this:
Input Image
│
▼
┌──────────────────┐
│ Region Proposal │ ← "Where might objects be?" (Selective Search, RPN)
│ Network │
└────────┬─────────┘
│ ~2000 candidate regions
▼
┌──────────────────┐
│ Classification │ ← "What is in each region?"
│ + Bounding Box │
│ Refinement │
└──────────────────┘
Two-stage detectors are accurate but slow because they process each candidate region separately.
YOLO collapses this into a single step:
Input Image
│
▼
┌──────────────────┐
│ │
│ CNN Backbone │ ← Extract features from the entire image
│ + Neck + Head │
│ │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ S x S Grid │
│ × B Boxes │ ← Predict ALL boxes + classes in ONE pass
│ × (C + 5) values │
└──────────────────┘
2.2 Grid-Based Prediction (YOLOv1 Detail)¶
YOLOv1 divides the input image into an S × S grid. Each grid cell predicts:
- B bounding boxes (each box has: x, y, w, h, confidence)
- C class probabilities (one score per class)
The total output is a tensor of shape S × S × (B × 5 + C).
┌─────────────────────────────────────────────────┐
│ Input Image │
│ │
│ ┌─────┬─────┬─────┬─────┬─────┐ │
│ │ Cell│ Cell│ Cell│ Cell│ Cell│ ← 7×7 grid │
│ ├─────┼─────┼─────┼─────┤ │ │
│ │ ... │ ... │ ... │ ... │ │ │
│ └─────┴─────┴─────┴─────┴─────┘ │
│ │
│ Each cell predicts: │
│ - 2 bounding boxes (x, y, w, h, conf) │
│ - 20 class probabilities (Pascal VOC) │
│ │
│ Output tensor: 7 × 7 × 30 │
└─────────────────────────────────────────────────┘
2.3 Loss Function¶
YOLO optimizes a multi-part loss combining:
- Localization loss: Mean squared error on bounding box coordinates (x, y, w, h)
- Confidence loss: MSE on the objectness score (does this cell contain an object?)
- Classification loss: MSE on class probabilities
Only cells containing objects contribute to the confidence and classification losses. The weighting balances these terms (typically λ_coord = 5, λ_noobj = 0.5).
2.4 From YOLOv1 to Modern Architectures¶
Modern YOLO models (v5, v8, v11) follow a three-part architecture:
┌─────────────────────────────────────────────────────────────────┐
│ YOLO Architecture (v5/v8/v11) │
│ │
│ ┌─────────────┐ │
│ │ │ │
│ │ Backbone │ Extracts multi-scale features │
│ │ (CSPNet / │ e.g., CSPDarknet, C2f blocks │
│ │ C2f) │ │
│ │ │ │
│ └──────┬──────┘ │
│ │ P3, P4, P5 (1/8, 1/16, 1/32 resolution) │
│ ▼ │
│ ┌─────────────┐ │
│ │ │ │
│ │ Neck │ Fuses multi-scale features │
│ │ (PANet / │ Top-down + bottom-up path aggregation │
│ │ SPPF) │ │
│ │ │ │
│ └──────┬──────┘ │
│ │ F3, F4, F5 (enriched features) │
│ ▼ │
│ ┌─────────────┐ │
│ │ │ │
│ │ Head │ Predicts boxes, scores, classes │
│ │ (Anchor / │ One prediction per grid cell │
│ │ Anchor-free)│ │
│ │ │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ Bounding boxes + Class scores + Confidence │
└─────────────────────────────────────────────────────────────────┘
Backbone: A deep CNN (e.g., CSPDarknet53, EfficientNet) that extracts hierarchical features. Larger models (L, X) use deeper backbones with more channels.
Neck: Feature Pyramid Network (FPN) + Path Aggregation Network (PANet). The FPN top-down pathway merges high-level semantic features with low-level spatial features. The PANet bottom-up pathway adds a second pass for stronger gradient flow.
Head: The detection head produces predictions at three scales: - P3 (⅛): Small objects (e.g., screws, small parts) - P4 (1/16): Medium objects (e.g., cups, tools) - P5 (1/32): Large objects (e.g., boxes, people)
2.5 Anchor-Free vs Anchor-Based¶
Anchor-based (YOLOv2–v5): The model predicts offsets from predefined anchor boxes (prior shapes computed via k-means on the training set). Each anchor produces a bounding box by adjusting x, y, w, h offsets.
Anchor-free (YOLOv8+): The model predicts centers and extents directly, or uses a task-aligned assigner to match predictions to ground truth without predefined anchors. This simplifies deployment and improves generalization.
Anchor-based: prediction = anchor + offset
Anchor-free: prediction = direct (center_x, center_y, width, height)
3. YOLO Versions¶
3.1 Evolution Table¶
| Version | Year | Key Innovation | Backbone | Neck | Head | mAP@50-95 | FPS (V100) | Notable |
|---|---|---|---|---|---|---|---|---|
| YOLOv1 | 2016 | Single-pass detection, grid prediction | Darknet-19 | None | FC layers | 63.4 (VOC) | 45 | First real-time detector |
| YOLOv2 | 2017 | Batch norm, anchor boxes, multi-scale training | Darknet-19 | Passthrough | Conv | 78.6 (VOC) | 40 | Trained on 9000+ classes (YOLO9000) |
| YOLOv3 | 2018 | Darknet-53, FPN multi-scale detection | Darknet-53 | FPN | 3-scale | 57.9 (mAP50) | 20 | Best balance at the time |
| YOLOv4 | 2020 | CSPDarknet, PANet, mosaic augmentation | CSPDarknet53 | PANet | SPP | 65.7 (mAP50) | 62 | Bag of freebies + specials |
| YOLOv5 | 2020 | PyTorch native, auto-anchor, easy deployment | CSPDarknet53 | PANet + SPPF | Anchor | 68.9 (mAP50) | 140 | Most deployed in production |
| YOLOv7 | 2022 | E-ELAN, re-parameterization, model scaling | E-ELAN | ELAN-PAN | Anchor | 71.2 (mAP50) | 161 | Fastest at its release |
| YOLOv8 | 2023 | Anchor-free, decoupled head, Ultralytics API | CSPNet (C2f) | PANet + SPPF | Anchor-free | 53.9 (mAP) | 280 | Unified detect/seg/pose/classify |
| YOLOv9 | 2024 | GELAN, PGI (Programmable Gradient Information) | GELAN | GELAN | Anchor-free | 55.6 (mAP) | 300+ | Solves information bottleneck |
| YOLOv10 | 2024 | NMS-free, dual label assignment | CSPNet | PANet | Anchor-free | 54.4 (mAP) | 350+ | Eliminates NMS post-processing |
| YOLO11 | 2024 | C3k2 blocks, improved efficiency | C3k2-CSPNet | C2f2 | Anchor-free | 54.7 (mAP) | 320+ | Most parameter-efficient |
| YOLOv12 | 2025 | Attention-centric, dynamic resolution, flash attention | A*-CSPNet | A*-PAN | Anchor-free | 56.0 (mAP) | 340+ | Combines CNN speed with transformer accuracy |
mAP values are approximate and depend on input size (640×640 default). FPS measured on NVIDIA V100 or A100.
3.2 Key Innovations by Version¶
YOLOv1 (2016) — The Original¶
- Divides image into S×S grid; each cell predicts B boxes + C classes
- End-to-end differentiable — no region proposals
- Limitation: struggles with small objects, many instances of the same class
YOLOv2 / YOLO9000 (2017) — Faster and More Classes¶
- Added batch normalization after every convolutional layer (+2% mAP)
- Introduced anchor boxes via k-means clustering on training data
- Multi-scale training: randomly resize input during training (320–608 pixels)
- Joint training on detection + classification (9413 classes from WordTree)
YOLOv3 (2018) — Multi-Scale Detection¶
- Darknet-53 backbone: 53 convolutional layers with residual connections
- Feature Pyramid Network (FPN): predictions at 3 scales (13×13, 26×26, 52×52)
- Binary cross-entropy for class prediction (handles multi-label)
- The go-to version for years in production robotics
YOLOv4 (2020) — Bag of Freebies¶
- CSPDarknet53: Cross-Stage Partial connections reduce computation
- SPP + PANet neck for richer feature aggregation
- Mosaic augmentation: 4-image collage that teaches the model about context
- Mish activation: smooth non-linearity replacing Leaky ReLU
- Dozens of "free" training tricks: CutMix, DropBlock, label smoothing
YOLOv5 (2020) — The Deployment King¶
- Written in PyTorch from the start (YOLOv1–v4 were Darknet/C++)
- Auto-anchor: Automatically learns anchor box sizes for your dataset
- Integrated export to ONNX, TensorRT, CoreML, TFLite
- Variants: n (nano), s (small), m (medium), l (large), x (extra-large)
- Most widely deployed YOLO version in industrial robotics
YOLOv7 (2022) — Efficiency Champion¶
- E-ELAN (Extended Efficient Layer Aggregation Network): optimized feature fusion
- Re-parameterization: Train with complex architecture, deploy with simpler one
- Auxiliary head during training for better gradient flow
- State-of-the-art speed-accuracy tradeoff at release
YOLOv8 (2023) — The Modern Standard¶
- Anchor-free detection: eliminates anchor box hyperparameters
- Decoupled head: separate branches for classification and regression
- Task-Aligned Assigner: dynamic positive sample assignment
- Unified API:
ultralyticspackage supports detect, segment, pose, classify - Variants: n, s, m, l, x — choose by speed/accuracy budget
YOLOv9 (2024) — Information Bottleneck Solved¶
- GELAN (Generalized Efficient Layer Aggregation Network): new macro-architecture
- PGI (Programmable Gradient Information): prevents information loss in deep networks
- Proves that auxiliary heads and reversible branches can improve any architecture
- Smallest model (YOLOv9-t) achieves 44% mAP with only 2M parameters
YOLOv10 (2024) — NMS-Free¶
- Dual label assignment: NMS-free training with consistent dual assignments
- Holistic label assignment: pairs one-to-one and one-to-many assignments
- Eliminates the NMS post-processing step, reducing latency by 1–3 ms
- Lightweight architectures: n (2.7M params), s (7.2M), m (15.4M)
YOLO11 (2024) — Efficient Next-Gen¶
- C3k2 blocks: smaller, more efficient cross-stage connections
- 22% fewer parameters than YOLOv8 with same accuracy
- Improved feature extraction at all scales
- Available in n, s, m, l, x variants
YOLOv12 (2025) — Attention-Centric¶
- A*-CSPNet: Replaces some convolution blocks with attention mechanisms
- Flash attention for memory-efficient self-attention
- Dynamic resolution: adapts to input size without retraining
- Combines the speed of CNNs with the accuracy of vision transformers
- Best performance for high-resolution detection tasks
3.3 How to Choose a Version¶
Decision guide:
Need fastest inference? ──────── YOLOv8-n or YOLOv10-n
Need best accuracy? ──────────── YOLOv12 or YOLOv8-x
Need NMS-free (edge)? ────────── YOLOv10
Need most deployment options? ── YOLOv5 (widest support)
Need segmentation + pose? ────── YOLOv8 (unified API)
Need minimal compute? ─────────── YOLOv9-t or YOLO11-n
Production/industrial? ────────── YOLOv5 or YOLOv8 (most battle-tested)
4. Quick Start with Ultralytics¶
4.1 Installation¶
# Create a virtual environment (recommended)
python -m venv yolo_env
source yolo_env/bin/activate # Linux/Mac
# yolo_env\Scripts\activate # Windows
# Install Ultralytics (includes YOLOv5, v8, v11)
pip install ultralytics
# Verify installation
python -c "import ultralytics; print(ultralytics.__version__)"
# Output: 8.x.x
4.2 Inference on an Image¶
from ultralytics import YOLO
# Load a pre-trained model (downloads weights automatically)
model = YOLO("yolov8n.pt") # nano — fastest, smallest
# Run inference on an image
results = model("https://ultralytics.com/images/zidane.jpg")
# Process results
for result in results:
# Bounding boxes: xyxy format [x1, y1, x2, y2]
boxes = result.boxes
print(f"Found {len(boxes)} objects")
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0].tolist() # coordinates
confidence = box.conf[0].item() # confidence score
class_id = int(box.cls[0].item()) # class index
class_name = result.names[class_id] # class name
print(f" {class_name}: {confidence:.2f} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}]")
# Save annotated image
results[0].save("output.jpg")
4.3 Inference on Video / Webcam¶
from ultralytics import YOLO
import cv2
model = YOLO("yolov8s.pt") # small model — good speed/accuracy balance
# --- Option 1: Process a video file ---
results = model.predict(
source="input_video.mp4",
save=True, # save annotated video
conf=0.25, # confidence threshold
imgsz=640, # inference size
classes=[0, 1], # only detect classes 0 (person) and 1 (bicycle)
stream=True, # memory-efficient streaming
)
for r in results:
# r.boxes contains detections for this frame
pass
# --- Option 2: Process webcam (real-time) ---
cap = cv2.VideoCapture(0)
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# YOLO inference directly on numpy array
results = model(frame, verbose=False)
# Display results
annotated = results[0].plot() # draw boxes on frame
cv2.imshow("YOLO Detection", annotated)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
4.4 Export to ONNX / TensorRT¶
from ultralytics import YOLO
model = YOLO("yolov8s.pt")
# Export to ONNX (cross-platform deployment)
model.export(format="onnx", imgsz=640, simplify=True)
# Creates: yolov8s.onnx
# Export to TensorRT (NVIDIA GPU optimized)
model.export(format="engine", imgsz=640, half=True) # half = FP16
# Creates: yolov8s.engine
# Export to OpenVINO (Intel CPU/GPU/VPU)
model.export(format="openvino", imgsz=640)
# Creates: yolov8s_openvino_model/
# Export to CoreML (Apple Silicon)
model.export(format="coreml", imgsz=640)
# Export to TFLite (mobile/edge)
model.export(format="tflite", imgsz=320) # smaller input for edge
4.5 Running Exported Models¶
from ultralytics import YOLO
# Load an ONNX model
model = YOLO("yolov8s.onnx")
results = model("test.jpg")
# Load a TensorRT engine
model = YOLO("yolov8s.engine")
results = model("test.jpg")
# Both produce the same output format as the PyTorch model
5. YOLO for Robotics Applications¶
5.1 Object Detection for Pick-and-Place¶
The most common robotics use of YOLO: detecting objects on a table or conveyor belt and providing their positions for a manipulator.
from ultralytics import YOLO
import numpy as np
# Load model trained on your object classes
model = YOLO("runs/detect/my_objects/weights/best.pt")
def detect_objects(image):
"""Detect objects and return their center positions in the image."""
results = model(image, verbose=False)[0]
objects = []
for box in results.boxes:
x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
conf = box.conf[0].item()
cls = int(box.cls[0].item())
# Compute center point (pixel coordinates)
cx = (x1 + x2) / 2
cy = (y1 + y2) / 2
w = x2 - x1
h = y2 - y1
objects.append({
"class": results.names[cls],
"confidence": conf,
"center_px": (cx, cy),
"bbox": (x1, y1, x2, y2),
"size_px": (w, h),
})
return objects
# --- Integration with robot coordinate transform ---
def pixel_to_robot(cx, cy, camera_matrix, extrinsic_matrix):
"""Convert pixel coordinates to robot base frame."""
# Undistort pixel to normalized camera coordinates
fx, fy = camera_matrix[0, 0], camera_matrix[1, 1]
cx_cam, cy_cam = camera_matrix[0, 2], camera_matrix[1, 2]
# Project to 3D (assuming known depth Z)
Z = 0.5 # depth from camera to table surface in meters
X = (cx - cx_cam) * Z / fx
Y = (cy - cy_cam) * Z / fy
# Transform to robot base frame
point_cam = np.array([X, Y, Z, 1.0])
point_robot = extrinsic_matrix @ point_cam
return point_robot[:3] # x, y, z in robot frame
# --- Example usage ---
# detect -> transform -> plan grasp -> execute
image = camera.capture()
objects = detect_objects(image)
for obj in objects:
robot_pos = pixel_to_robot(*obj["center_px"], K, T_base_cam)
print(f"Object: {obj['class']}, Robot position: {robot_pos}")
# robot_arm.move_to(robot_pos)
# robot_arm.grasp()
5.2 Real-Time Tracking with DeepSORT / ByteTrack¶
Object detection gives you boxes in each frame, but for robotics you often need tracking — consistent object IDs across frames. ByteTrack is the modern choice.
from ultralytics import YOLO
from ultralytics.trackers import ByteTrack # built into Ultralytics
# ByteTrack is integrated into YOLOv8+ — no extra packages needed
model = YOLO("yolov8s.pt")
# Enable tracking in predict
results = model.track(
source="conveyor_video.mp4",
tracker="bytetrack.yaml", # ByteTrack config
persist=True, # maintain track IDs across frames
save=True,
conf=0.3,
)
# Process tracked results
for r in results:
for box in r.boxes:
track_id = box.id.item() if box.id is not None else None
cls = r.names[int(box.cls[0].item())]
xyxy = box.xyxy[0].tolist()
if track_id is not None:
print(f"Track {int(track_id)}: {cls} at {xyxy}")
# Use track_id to maintain state: count, history, prediction
5.3 Pose Estimation (YOLOv8-Pose)¶
YOLOv8-Pose detects human keypoints (or custom keypoints) alongside bounding boxes. Useful for human-robot collaboration, gesture recognition, and ergonomic monitoring.
from ultralytics import YOLO
# Load pose model (trained on COCO keypoints: 17 keypoints)
model = YOLO("yolov8s-pose.pt")
results = model("person_image.jpg")
for r in results:
# Keypoints: shape (num_persons, num_keypoints, 3) — x, y, visibility
keypoints = r.keypoints
if keypoints is not None:
kpts = keypoints.xy[0].cpu().numpy() # first person
# COCO keypoint indices:
# 0: nose, 1: left_eye, 2: right_eye, 3: left_ear, 4: right_ear
# 5: left_shoulder, 6: right_shoulder, 7: left_elbow, 8: right_elbow
# 9: left_wrist, 10: right_wrist, 11: left_hip, 12: right_hip
# 13: left_knee, 14: right_knee, 15: left_ankle, 16: right_ankle
# Compute arm angle for robot interaction
shoulder = kpts[5]
elbow = kpts[7]
wrist = kpts[9]
# Angle at elbow
v1 = shoulder - elbow
v2 = wrist - elbow
angle = np.degrees(np.arccos(
np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2) + 1e-6)
))
print(f"Elbow angle: {angle:.1f}°")
5.4 Instance Segmentation (YOLOv8-Seg)¶
Segmentation provides pixel-level masks instead of just bounding boxes. Useful for grasping irregular objects, bin picking, and scene understanding.
from ultralytics import YOLO
import numpy as np
# Load segmentation model
model = YOLO("yolov8s-seg.pt")
results = model("cluttered_table.jpg")
for r in results:
# Masks: shape (num_detections, H, W) — binary masks
masks = r.masks
if masks is not None:
for i, mask in enumerate(masks):
binary_mask = mask.data[0].cpu().numpy() # (H, W)
cls = r.names[int(r.boxes[i].cls[0].item())]
conf = r.boxes[i].conf[0].item()
# Compute mask properties
pixels = np.sum(binary_mask > 0.5)
print(f"{cls}: {conf:.2f}, area: {pixels} px")
# Compute centroid for grasp planning
ys, xs = np.where(binary_mask > 0.5)
centroid_x = np.mean(xs)
centroid_y = np.mean(ys)
print(f" Centroid: ({centroid_x:.0f}, {centroid_y:.0f})")
6. Custom Training Pipeline¶
6.1 Dataset Preparation¶
Before training, you need images with bounding box annotations.
Annotation Tools:
| Tool | Type | Best For | Export Format |
|---|---|---|---|
| Roboflow | Cloud | End-to-end (annotate + augment + deploy) | YOLO txt, COCO JSON, VOC XML |
| CVAT | Cloud/Self-hosted | Large team projects, video annotation | YOLO txt, COCO, VOC |
| LabelImg | Desktop | Quick single-class annotation | YOLO txt, VOC XML |
| Label Studio | Cloud/Self-hosted | Multi-modal (image + text + audio) | Multiple formats |
YOLO Label Format — one .txt file per image, in the same directory:
# Each line: class_id center_x center_y width height (normalized 0-1)
# Example: image.jpg has a "cup" (class 0) and a "plate" (class 1)
0 0.456 0.321 0.128 0.095
1 0.672 0.543 0.256 0.182
Directory Structure:
my_dataset/
├── images/
│ ├── train/ # ~80% of images
│ │ ├── img001.jpg
│ │ ├── img002.jpg
│ │ └── ...
│ └── val/ # ~20% of images
│ ├── img050.jpg
│ └── ...
├── labels/
│ ├── train/
│ │ ├── img001.txt
│ │ ├── img002.txt
│ │ └── ...
│ └── val/
│ ├── img050.txt
│ └── ...
└── data.yaml # dataset config (see below)
6.2 YAML Configuration File¶
# data.yaml — Dataset configuration for Ultralytics YOLO
# Paths can be absolute or relative to this file
# Dataset paths
train: ../my_dataset/images/train # Training images
val: ../my_dataset/images/val # Validation images
# test: ../my_dataset/images/test # Optional test set
# Number of classes
nc: 3
# Class names (must match annotation class IDs)
names:
0: gripper
1: screw
2: bearing
6.3 Data Augmentation Strategies¶
Ultralytics applies augmentation automatically during training. Key strategies:
# Augmentation parameters (set in training args or YAML)
augmentation = {
# Geometric
"hsv_h": 0.015, # Hue augmentation (±1.5%)
"hsv_s": 0.7, # Saturation augmentation (±70%)
"hsv_v": 0.4, # Value/brightness augmentation (±40%)
"degrees": 10.0, # Rotation (±10°)
"translate": 0.1, # Translation (±10% of image)
"scale": 0.5, # Scale augmentation (±50%)
"shear": 2.0, # Shear (±2°)
"perspective": 0.0, # Perspective transform
"flipud": 0.0, # Vertical flip probability
"fliplr": 0.5, # Horizontal flip probability
# Mosaic (combines 4 images into one — key for small objects)
"mosaic": 1.0, # Mosaic probability (1.0 = always)
# Mixup (blends two images)
"mixup": 0.0, # Mixup probability
# Copy-Paste (copy objects between images)
"copy_paste": 0.0, # Copy-paste probability
}
6.4 Training Script with Full Options¶
from ultralytics import YOLO
# --- Option 1: Python API (recommended for scripts) ---
model = YOLO("yolov8s.pt") # Start from pre-trained weights
results = model.train(
data="data.yaml", # Dataset config
epochs=100, # Number of training epochs
imgsz=640, # Input image size
# Batch size and optimization
batch=16, # Batch size (use -1 for auto)
lr0=0.01, # Initial learning rate
lrf=0.01, # Final learning rate (lr0 * lrf)
momentum=0.937, # SGD momentum
weight_decay=0.0005, # L2 regularization
warmup_epochs=3.0, # Warmup epochs
warmup_momentum=0.8, # Warmup momentum
# Augmentation
hsv_h=0.015,
hsv_s=0.7,
hsv_v=0.4,
degrees=10.0,
translate=0.1,
scale=0.5,
fliplr=0.5,
mosaic=1.0,
mixup=0.0,
# Model settings
name="my_custom_model", # Experiment name
project="runs/detect", # Output directory
exist_ok=False, # Overwrite existing
# Hardware
device="0", # GPU device (cpu, 0, 0,1, etc.)
workers=8, # Data loading workers
# Checkpointing
save_period=10, # Save checkpoint every N epochs
patience=50, # Early stopping patience
# Validation
val=True, # Validate during training
cache=True, # Cache images in RAM (fast but uses memory)
)
# --- Option 2: CLI ---
# yolo detect train data=data.yaml model=yolov8s.pt epochs=100 imgsz=640 batch=16 device=0
6.5 Evaluation Metrics¶
After training, evaluate your model on the validation set:
from ultralytics import YOLO
model = YOLO("runs/detect/my_custom_model/weights/best.pt")
# Run validation
metrics = model.val(
data="data.yaml",
imgsz=640,
batch=16,
conf=0.25,
iou=0.6, # IoU threshold for NMS
max_det=300, # Max detections per image
)
# Key metrics
print(f"mAP50: {metrics.box.map50:.4f}") # mAP at IoU=0.50
print(f"mAP50-95: {metrics.box.map:.4f}") # mAP at IoU=0.50:0.95
print(f"Precision: {metrics.box.mp:.4f}") # Mean precision
print(f"Recall: {metrics.box.mr:.4f}") # Mean recall
# Per-class results
for i, (p, r, ap50, ap) in enumerate(
zip(metrics.box.p, metrics.box.r, metrics.box.ap50, metrics.box.ap)
):
print(f" Class {i}: P={p:.3f}, R={r:.3f}, mAP50={ap50:.3f}, mAP={ap:.3f}")
Metrics explained:
- Precision: Of all predicted boxes, what fraction are correct? (TP / (TP + FP))
- Recall: Of all ground-truth boxes, what fraction were detected? (TP / (TP + FN))
- mAP50: Mean Average Precision at IoU threshold 0.5 — lenient metric
- mAP50-95: Mean AP averaged over IoU thresholds 0.50 to 0.95 — strict metric (COCO standard)
- IoU: Intersection over Union — overlap between predicted and ground-truth boxes
6.6 TensorBoard Monitoring¶
# During training, TensorBoard logs are saved automatically
# Start TensorBoard:
# From the project directory
tensorboard --logdir runs/detect
# Or with a specific port
tensorboard --logdir runs/detect --port 6006
# Open http://localhost:6006 in your browser
The dashboard shows: - Training/Box loss: Localization loss (how well boxes fit objects) - Training/Class loss: Classification loss (how well classes are predicted) - Training/DFL loss: Distribution Focal Loss (for anchor-free models) - metrics/precision and metrics/recall: Per-epoch evaluation - metrics/mAP50 and metrics/mAP50-95: Overall accuracy
7. Edge Deployment¶
7.1 NVIDIA Jetson (TensorRT)¶
The Jetson Nano, Xavier, and Orin are the most common edge platforms for robotics.
# Step 1: Export to TensorRT on the training machine
python -c "
from ultralytics import YOLO
model = YOLO('yolov8s.pt')
model.export(format='engine', imgsz=640, half=True, device=0)
"
# Step 2: Copy the .engine file to your Jetson
scp yolov8s.engine jetson@192.168.1.100:~/
# Step 3: Run on Jetson (no GPU export needed — TensorRT is pre-installed)
python -c "
from ultralytics import YOLO
model = YOLO('yolov8s.engine')
results = model('test.jpg')
results[0].save('output.jpg')
"
Jetson Optimization Tips:
- Use FP16 (half=True) for 2× speed with minimal accuracy loss
- For Jetson Nano (4GB), use YOLOv8-nano or YOLO11-nano
- Use
imgsz=320for maximum speed on constrained devices - Enable DLA (Deep Learning Accelerator) on Xavier/Orin for extra throughput
7.2 OpenVINO for Intel¶
from ultralytics import YOLO
# Export to OpenVINO format
model = YOLO("yolov8s.pt")
model.export(format="openvino", imgsz=640)
# Run inference with OpenVINO
ov_model = YOLO("yolov8s_openvino_model/")
results = ov_model("test.jpg")
OpenVINO is ideal for: - Intel NUC /NUC-based robots - Intel RealSense companion computers - CPU-only inference (no GPU required) - Intel Movidius VPUs (Myriad X)
7.3 ONNX Runtime¶
from ultralytics import YOLO
# Export
model = YOLO("yolov8s.pt")
model.export(format="onnx", imgsz=640, simplify=True)
# Run with ONNX Runtime (cross-platform, any hardware)
onnx_model = YOLO("yolov8s.onnx")
results = onnx_model("test.jpg")
ONNX Runtime provides hardware acceleration on: - NVIDIA GPU (CUDA execution provider) - Intel CPU (OpenVINO execution provider) - ARM CPU (NNAPI on Android, Core ML on iOS) - AMD GPU (ROCm execution provider)
7.4 Benchmark Table: Latency vs Accuracy¶
Benchmarks on a single image (640×640), NVIDIA Jetson Orin NX 16GB:
| Model | Format | Latency (ms) | FPS | mAP50-95 | Size (MB) |
|---|---|---|---|---|---|
| YOLO11-n | TensorRT FP16 | 3.2 | 312 | 39.5 | 2.6 |
| YOLOv8-n | TensorRT FP16 | 3.5 | 286 | 37.3 | 3.2 |
| YOLO11-s | TensorRT FP16 | 4.8 | 208 | 47.0 | 9.4 |
| YOLOv8-s | TensorRT FP16 | 5.2 | 192 | 44.9 | 11.2 |
| YOLO11-m | TensorRT FP16 | 8.1 | 123 | 51.5 | 20.1 |
| YOLOv8-m | TensorRT FP16 | 9.3 | 108 | 50.2 | 25.9 |
| YOLO11-l | TensorRT FP16 | 12.5 | 80 | 53.4 | 25.3 |
| YOLOv8-l | TensorRT FP16 | 14.2 | 70 | 52.9 | 43.7 |
| YOLOv10-n | ONNX | 4.1 | 244 | 39.5 | 4.7 |
| YOLOv12-n | TensorRT FP16 | 3.8 | 263 | 40.2 | 3.1 |
Latency includes pre-processing, inference, and NMS. Actual numbers vary by hardware and image content.
Model Selection Guide for Robotics:
┌────────────────────────────────────────────────────────────────┐
│ Model Selection Flowchart │
│ │
│ What's your FPS target? │
│ │ │
│ ├─ > 200 FPS → YOLO11-n or YOLOv10-n (TensorRT FP16) │
│ ├─ 100-200 FPS → YOLOv8-s or YOLO11-s │
│ ├─ 50-100 FPS → YOLOv8-m or YOLO11-m │
│ └─ < 50 FPS → YOLOv8-l/x or YOLOv12-l (max accuracy) │
│ │
│ Edge device RAM < 4GB? → Use nano variants + imgsz=320 │
│ Need segmentation? → YOLOv8s-seg or YOLO11s-seg │
│ Need pose? → YOLOv8s-pose │
└────────────────────────────────────────────────────────────────┘
8. Comparison with Other Detectors¶
8.1 YOLO vs Alternatives¶
| Feature | YOLOv8/v11 | Faster R-CNN | DETR | EfficientDet |
|---|---|---|---|---|
| Type | Single-stage | Two-stage | Transformer | Single-stage |
| Speed (FPS) | 80–300+ | 10–30 | 20–60 | 30–100 |
| mAP (COCO) | 50–56 | 42–45 | 42–50 | 40–55 |
| Small objects | Good (FPN) | Good (RPN) | Moderate | Good (BiFPN) |
| NMS required | Yes (v8), No (v10) | Yes | No | Yes |
| Edge friendly | ✅ Excellent | ❌ Slow | ⚠️ Moderate | ✅ Good |
| Custom training | ✅ Easy (Ultralytics) | ⚠️ Complex | ⚠️ Needs tuning | ⚠️ Moderate |
| Real-time robotics | ✅ First choice | ❌ Too slow | ⚠️ Possible | ✅ Good alternative |
| Active development | ✅ Very active | ❌ Mature | ✅ Active | ⚠️ Slowing |
8.2 When to Use What¶
-
YOLO (v8/v11/v12): Default choice for real-time robotics. Fastest iteration cycle, best deployment ecosystem, widest hardware support.
-
DETR / RT-DETR: When you need end-to-end detection without NMS, or when dealing with complex scenes with many overlapping objects. RT-DETR (Real-Time DETR) achieves 48.5 mAP at 114 FPS, competitive with YOLO.
-
Faster R-CNN: When accuracy is paramount and speed is not critical (e.g., offline processing of recorded video). Still widely used in academic benchmarks.
-
EfficientDet: When you need a good balance on CPU-only devices. BiFPN architecture is efficient but less flexible for deployment than YOLO.
9. Best Practices for Robotics¶
9.1 Data Collection Tips¶
-
Collect diverse images: Vary lighting, backgrounds, object orientations, and camera angles. A model trained on one lighting condition will fail in another.
-
Capture in the deployment environment: Train on images from the actual robot workspace, not stock photos. Include the robot arm, conveyor belt, bins, and clutter.
-
Balance your classes: If you have 1000 images of "screw" and only 50 of "bearing", the model will be biased. Use augmentation or collect more data for rare classes.
-
Include edge cases: Add images with partial occlusions, blurry frames, extreme angles, and multiple overlapping objects.
-
Resolution matters: Capture at the resolution your robot camera will use. Training on 4K images but deploying at 640×640 is wasteful — downscale first.
-
Annotate carefully: Consistent bounding boxes are more important than large quantities of loosely annotated data. Tight boxes around visible portions of objects.
9.2 Common Pitfalls¶
| Pitfall | Symptom | Fix |
|---|---|---|
| Overfitting | High train mAP, low val mAP | More data, stronger augmentation, smaller model |
| Underfitting | Low mAP everywhere | Larger model, more epochs, check labels |
| Class imbalance | Model ignores rare classes | Oversample, use class weights, augment rare classes |
| Domain gap | Works in lab, fails in field | Collect field data, use domain randomization |
| Too many false positives | Detecting things that aren't there | Increase confidence threshold, retrain with negatives |
| Small object misses | Misses small screws/parts | Use higher resolution (imgsz=1280), add more small-object annotations |
| Slow inference | FPS too low for control loop | Use nano/small model, TensorRT, reduce input size |
| Blinking detections | Objects appear/disappear across frames | Use tracking (ByteTrack), temporal smoothing |
9.3 Model Selection Guide¶
Application: Pick-and-place (tabletop)
├── Objects: 3-10 classes, well-separated
├── Speed: 30+ FPS required
├── Recommended: YOLOv8s or YOLO11s
├── Input size: 640×640
└── Export: TensorRT FP16
Application: Conveyor belt inspection
├── Objects: Small parts (screws, connectors)
├── Speed: 60+ FPS (fast belt)
├── Recommended: YOLOv8n or YOLO11n
├── Input size: 640×640 or 320×320
└── Export: TensorRT FP16
Application: Human-robot collaboration
├── Objects: People, hands, gestures
├── Speed: 30+ FPS
├── Recommended: YOLOv8s-pose (for pose)
├── Input size: 640×640
└── Export: TensorRT FP16
Application: Bin picking (cluttered)
├── Objects: Many overlapping instances
├── Speed: 10+ FPS (bin picker is slower)
├── Recommended: YOLOv8m-seg + PointCloud
├── Input size: 640×640 or 1280×1280
└── Export: TensorRT FP16 + depth integration
Application: Drone / outdoor navigation
├── Objects: Vehicles, people, obstacles
├── Speed: 30+ FPS
├── Recommended: YOLOv8m or YOLO11m
├── Input size: 640×640
└── Export: TensorRT or ONNX
9.4 Production Checklist¶
Before deploying a YOLO model on a real robot:
[ ] Model validated on held-out test set (not seen during training)
[ ] Tested with real lighting conditions at deployment site
[ ] Confidence threshold tuned (too low = false positives, too high = misses)
[ ] Latency measured on target hardware (not just training GPU)
[ ] NMS parameters tuned (IoU threshold, max detections)
[ ] Failure mode tested: what happens with blank/empty scenes?
[ ] Recovery behavior defined: what does the robot do when detection fails?
[ ] Logging enabled: save detections for offline analysis and retraining
[ ] Model versioning: track which weights are deployed on which robot
[ ] Update pipeline: plan for periodic retraining with new data
10. References¶
Papers¶
- YOLOv1 — You Only Look Once (Redmon et al., 2016) — Introduced single-stage detection
- YOLOv2 — YOLO9000 (Redmon & Farhadi, 2017) — Batch norm, anchor boxes, multi-scale training
- YOLOv3 (Redmon & Farhadi, 2018) — Darknet-53, FPN multi-scale detection
- YOLOv4 (Bochkovskiy et al., 2020) — CSPDarknet, mosaic augmentation, bag of freebies
- YOLOv7 (Wang et al., 2022) — E-ELAN, re-parameterization, auxiliary training
- YOLOv8 (Jocher et al., 2023) — Anchor-free, decoupled head, Ultralytics ecosystem
- YOLOv9 (Wang et al., 2024) — GELAN, Programmable Gradient Information
- YOLOv10 (Wang et al., 2024) — NMS-free, dual label assignment
- YOLO11 (Jocher et al., 2024) — C3k2 blocks, parameter efficiency
- YOLOv12 (Tian et al., 2025) — Attention-centric design, flash attention
- RT-DETR (Zhao et al., 2023) — Real-time end-to-end detector (DETR family)
Official Documentation¶
- Ultralytics Documentation — Complete YOLOv5/v8/v11 reference
- Ultralytics YOLO GitHub — Source code and issues
- Ultralytics YOLOv5 GitHub — Legacy YOLOv5 repo
Deployment Guides¶
- NVIDIA Jetson AI Lab — Jetson deployment tutorials
- TensorRT Documentation — GPU optimization
- OpenVINO Documentation — Intel deployment
- ONNX Runtime — Cross-platform inference
Tutorials and Courses¶
- Roboflow YOLO Tutorial — Step-by-step custom training
- Ultralytics YOLO Course — Official examples
- Papers With Code — Object Detection — Benchmarks and leaderboards
Last updated: 2025