Training Pipeline for Robotics Perception¶

A comprehensive guide to collecting data, training models, evaluating performance, and deploying optimized perception models for robotics applications.

Overview¶

Robotics perception systems must reliably detect, segment, and estimate depth for objects in unstructured, dynamic environments. Off-the-shelf models trained on general benchmarks (COCO, ImageNet) often fail when deployed on real robots because:

Domain shift: The robot's camera, mounting angle, and operating environment differ from training data.
Specific objects: Robots must recognize task-relevant objects (tools, gripper targets, obstacles) not present in public datasets.
Real-time constraints: Embedded compute on the robot requires optimized models (INT8, TensorRT).
Robustness: Variations in lighting, motion blur, partial occlusion, and sensor noise demand tailored training.

This guide walks through the full lifecycle—from raw images to a deployed model—covering detection (YOLO), segmentation (SAM, Mask R-CNN), depth estimation, evaluation, optimization, and MLOps practices.

Prerequisites: Python 3.8+, PyTorch 2.0+, CUDA 11.8+, basic familiarity with computer vision and neural networks.

Learning Objectives¶

Collect and annotate high-quality datasets for robotics tasks
Train and fine-tune detection, segmentation, and depth models
Evaluate models with standard metrics and analyze failure modes
Optimize models for real-time inference on edge hardware
Set up experiment tracking and CI/CD pipelines for model iteration

1. Data Collection¶

1.1 Manual Annotation Tools¶

Tool	Format Support	Strengths	Best For
LabelImg	YOLO, Pascal VOC	Lightweight, simple UI	Quick bounding box annotation
CVAT	COCO, YOLO, Pascal VOC	Multi-user, video support, AI-assisted	Team annotation projects
Roboflow	All major formats	Auto-augment, export, hosting	End-to-end pipeline
Label Studio	JSON, COCO, VOC	Multi-modal (text, image, audio), ML backend	Complex annotation tasks

LabelImg (quick start for bounding boxes):

pip install labelImg
labelImg  # Opens GUI, select folder and format

CVAT (team annotation with Docker):

docker compose -f docker-compose.yml up -d
# Access at http://localhost:8080
# Create project, define labels, invite annotators

Label Studio (with ML backend for active learning):

pip install label-studio label-studio-ml
label-studio start &
# Configure ML backend to pre-annotate with a pre-trained model

1.2 Automatic Annotation with Pre-trained Models¶

Manual annotation is slow. Use a pre-trained model to generate initial annotations, then human annotators correct mistakes (human-in-the-loop):

from ultralytics import YOLO

# Load a pre-trained model (or your own model from a previous iteration)
model = YOLO("yolov8x.pt")

# Auto-annotate a folder of raw images
results = model.predict(
    source="raw_images/",
    save_txt=True,          # Save YOLO-format labels
    conf=0.5,               # Confidence threshold
    imgsz=640,
    save=True               # Save annotated images for visual review
)
# Output: raw_images/labels/ contains .txt files

SAM-assisted labeling (Segment Anything for masks):

from segment_anything import sam_model_registry, SamPredictor
import cv2
import numpy as np

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)

# For each image, provide point prompts or box prompts
image = cv2.imread("robot_scene.jpg")
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image_rgb)

# Prompt with a bounding box (e.g., from YOLO detection)
box = np.array([100, 200, 400, 500])  # x1, y1, x2, y2
masks, scores, _ = predictor.predict(
    box=box,
    multimask_output=True
)
# Select the best mask
best_mask = masks[np.argmax(scores)]

For large-scale auto-annotation, consider Autodistill, which chains foundation models to generate annotations in YOLO format automatically.

1.3 Data Collection Best Practices for Robotics¶

Principle	Why It Matters	How to Achieve It
Varying lighting	Robots operate under different conditions	Capture in sunlight, shadows, indoors, artificial light
Multiple camera angles	Robot cameras may be mounted differently	Mount camera at different heights and angles
Diverse backgrounds	Clutter confuses models	Collect in different rooms, workspaces, outdoor areas
Include edge cases	Models fail on unusual configurations	Add partially occluded, distant, or motion-blurred objects
Represent target domain	Domain gap causes failures	Use the actual robot camera and mounting position
Sufficient quantity	More data generally improves generalization	Aim for 500+ images per class minimum; 2000+ ideal

Data collection script example (save images from a webcam at intervals):

import cv2
import os
import time

cap = cv2.VideoCapture(0)
output_dir = "collected_data/"
os.makedirs(output_dir, exist_ok=True)

frame_count = 0
interval = 0.5  # seconds between captures

while True:
    ret, frame = cap.read()
    if not ret:
        break
    cv2.imshow("Preview (press 's' to save, 'q' to quit)", frame)
    key = cv2.waitKey(1) & 0xFF
    if key == ord("s"):
        filename = f"{output_dir}/img_{frame_count:06d}.jpg"
        cv2.imwrite(filename, frame)
        print(f"Saved: {filename}")
        frame_count += 1
    elif key == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()
print(f"Total images saved: {frame_count}")

1.4 Augmentation Strategies¶

Augmentation artificially increases dataset diversity and improves generalization.

Photometric augmentations (color/lighting changes):

import albumentations as A

photometric_transform = A.Compose([
    A.RandomBrightnessContrast(brightness_limit=0.3, contrast_limit=0.3, p=0.7),
    A.HueSaturationValue(hue_shift_limit=20, sat_shift_limit=30, val_shift_limit=20, p=0.5),
    A.CLAHE(clip_limit=4.0, p=0.3),
    A.RandomGamma(gamma_limit=(80, 120), p=0.3),
    A.GaussNoise(var_limit=(10.0, 50.0), p=0.3),
])

Geometric augmentations (spatial transforms):

geometric_transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomRotate90(p=0.5),
    A.ShiftScaleRotate(
        shift_limit=0.1,
        scale_limit=0.2,
        rotate_limit=15,
        border_mode=cv2.BORDER_CONSTANT,
        p=0.7
    ),
    A.Perspective(scale=(0.05, 0.1), p=0.3),
], bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]))

Advanced augmentations:

# Cutout / Coarse Dropout - randomly mask patches
cutout_transform = A.CoarseDropout(
    max_holes=8, max_height=32, max_width=32,
    min_holes=1, min_height=8, min_width=8,
    fill_value=0, p=0.5
)

# Mixup - blend two images and their labels
# (typically done at training time, see YOLO's mixup parameter)

# Mosaic - combine 4 images into one (YOLO's default augmentation)
# Controlled by the 'mosaic' parameter in YOLO training

Applying augmentations to a labeled dataset:

import cv2
import json

def augment_dataset(image_dir, label_dir, output_dir, transform, num_augments=3):
    """Apply augmentation to an entire dataset."""
    os.makedirs(f"{output_dir}/images", exist_ok=True)
    os.makedirs(f"{output_dir}/labels", exist_ok=True)

    for img_name in os.listdir(image_dir):
        if not img_name.endswith((".jpg", ".png")):
            continue
        image = cv2.imread(f"{image_dir}/{img_name}")
        # Load YOLO-format labels
        label_path = f"{label_dir}/{os.path.splitext(img_name)[0]}.txt"
        bboxes, class_labels = [], []
        if os.path.exists(label_path):
            with open(label_path) as f:
                for line in f.readlines():
                    parts = line.strip().split()
                    cls = int(parts[0])
                    x, y, w, h = map(float, parts[1:5])
                    # Convert YOLO to Pascal VOC (x1, y1, x2, y2)
                    h_img, w_img = image.shape[:2]
                    x1 = (x - w / 2) * w_img
                    y1 = (y - h / 2) * h_img
                    x2 = (x + w / 2) * w_img
                    y2 = (y + h / 2) * h_img
                    bboxes.append([x1, y1, x2, y2])
                    class_labels.append(cls)

        # Save original
        cv2.imwrite(f"{output_dir}/images/{img_name}", image)

        # Create augmented versions
        for i in range(num_augments):
            transformed = transform(
                image=image,
                bboxes=bboxes,
                class_labels=class_labels
            )
            aug_name = f"{os.path.splitext(img_name)[0]}_aug{i}.jpg"
            cv2.imwrite(f"{output_dir}/images/{aug_name}", transformed["image"])

            # Convert back to YOLO format and save
            h_img, w_img = transformed["image"].shape[:2]
            with open(f"{output_dir}/labels/{os.path.splitext(aug_name)[0]}.txt", "w") as f:
                for bbox, cls in zip(transformed["bboxes"], transformed["class_labels"]):
                    x1, y1, x2, y2 = bbox
                    cx = ((x1 + x2) / 2) / w_img
                    cy = ((y1 + y2) / 2) / h_img
                    bw = (x2 - x1) / w_img
                    bh = (y2 - y1) / h_img
                    f.write(f"{int(cls)} {cx:.6f} {cy:.6f} {bw:.6f} {bh:.6f}\n")

augment_dataset("raw_images", "raw_labels", "augmented_dataset", 
                transform=geometric_transform, num_augments=3)

2. Dataset Formats¶

2.1 YOLO Format¶

Each image has a corresponding .txt file. Each line: class_id center_x center_y width height (normalized 0-1).

project/
├── images/
│   ├── train/
│   │   ├── img001.jpg
│   │   └── img002.jpg
│   └── val/
│       ├── img010.jpg
│       └── img011.jpg
├── labels/
│   ├── train/
│   │   ├── img001.txt
│   │   └── img002.txt
│   └── val/
│       ├── img010.txt
│       └── img011.txt
└── data.yaml

Example label file (img001.txt):

0 0.512500 0.483203 0.235000 0.462500
1 0.213750 0.651042 0.120000 0.287500

Example data.yaml:

path: /absolute/path/to/project
train: images/train
val: images/val
test: images/test  # optional

nc: 3  # number of classes
names: ["bottle", "cup", "tool"]

2.2 COCO Format¶

JSON-based, single annotation file for the entire dataset.

{
  "images": [
    {"id": 1, "file_name": "img001.jpg", "width": 640, "height": 480}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 0,
      "bbox": [100, 50, 150, 220],
      "area": 33000,
      "segmentation": [[100, 50, 250, 50, 250, 270, 100, 270]],
      "iscrowd": 0
    }
  ],
  "categories": [
    {"id": 0, "name": "bottle"},
    {"id": 1, "name": "cup"}
  ]
}

Note: COCO bbox format is [x_top_left, y_top_left, width, height] in pixels.

2.3 Pascal VOC Format¶

XML-based, one file per image.

<annotation>
  <filename>img001.jpg</filename>
  <size>
    <width>640</width>
    <height>480</height>
  </size>
  <object>
    <name>bottle</name>
    <bndbox>
      <xmin>100</xmin>
      <ymin>50</ymin>
      <xmax>250</xmax>
      <ymax>270</ymax>
    </bndbox>
  </object>
</annotation>

2.4 Format Conversion Scripts¶

COCO to YOLO:

import json
import os

def coco_to_yolo(coco_json_path, output_dir):
    """Convert COCO annotations to YOLO format."""
    os.makedirs(output_dir, exist_ok=True)

    with open(coco_json_path) as f:
        coco = json.load(f)

    # Build lookup tables
    img_lookup = {img["id"]: img for img in coco["images"]}

    # Group annotations by image
    ann_by_image = {}
    for ann in coco["annotations"]:
        img_id = ann["image_id"]
        if img_id not in ann_by_image:
            ann_by_image[img_id] = []
        ann_by_image[img_id].append(ann)

    for img_id, annotations in ann_by_image.items():
        img = img_lookup[img_id]
        w, h = img["width"], img["height"]
        txt_path = os.path.join(output_dir, 
                                os.path.splitext(img["file_name"])[0] + ".txt")

        with open(txt_path, "w") as f:
            for ann in annotations:
                x, y, bw, bh = ann["bbox"]  # COCO: x, y, w, h (pixels)
                cx = (x + bw / 2) / w
                cy = (y + bh / 2) / h
                nw = bw / w
                nh = bh / h
                f.write(f"{ann['category_id']} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")

    print(f"Converted {len(ann_by_image)} images to YOLO format in {output_dir}")

coco_to_yolo("annotations.json", "labels_yolo/")

Pascal VOC to YOLO:

import xml.etree.ElementTree as ET
import os

def voc_to_yolo(voc_dir, output_dir):
    """Convert Pascal VOC XML annotations to YOLO format."""
    os.makedirs(output_dir, exist_ok=True)
    class_names = []  # collect unique classes

    for xml_file in os.listdir(voc_dir):
        if not xml_file.endswith(".xml"):
            continue
        tree = ET.parse(os.path.join(voc_dir, xml_file))
        root = tree.getroot()

        size = root.find("size")
        w = int(size.find("width").text)
        h = int(size.find("height").text)

        txt_name = os.path.splitext(xml_file)[0] + ".txt"
        with open(os.path.join(output_dir, txt_name), "w") as f:
            for obj in root.findall("object"):
                name = obj.find("name").text
                if name not in class_names:
                    class_names.append(name)
                cls_id = class_names.index(name)

                bbox = obj.find("bndbox")
                xmin = float(bbox.find("xmin").text)
                ymin = float(bbox.find("ymin").text)
                xmax = float(bbox.find("xmax").text)
                ymax = float(bbox.find("ymax").text)

                cx = ((xmin + xmax) / 2) / w
                cy = ((ymin + ymax) / 2) / h
                bw = (xmax - xmin) / w
                bh = (ymax - ymin) / h

                f.write(f"{cls_id} {cx:.6f} {cy:.6f} {bw:.6f} {bh:.6f}\n")

    print(f"Classes: {class_names}")

voc_to_yolo("Annotations/", "labels_yolo/")

Recommended tools for batch conversion: Roboflow and FiftyOne both offer programmatic and UI-based format conversion.

3. Training Detection Models (YOLO)¶

3.1 Full Ultralytics Training Pipeline¶

from ultralytics import YOLO

# Option A: Train from scratch with a YAML config
model = YOLO("yolov8n.yaml")  # Nano model (2.1M params)
results = model.train(
    data="path/to/data.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    name="robot_detection_v1",
    device="0",               # GPU index, or "cpu"
    patience=20,              # Early stopping patience
    save=True,
    save_period=10,           # Save checkpoint every N epochs
    val=True,
    plots=True,
)

# Option B: Fine-tune from pre-trained weights (recommended)
model = YOLO("yolov8m.pt")  # Medium model, pre-trained on COCO
results = model.train(
    data="path/to/data.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    name="robot_detection_finetune",
    device="0",
    freeze=10,                # Freeze first 10 layers (transfer learning)
    lr0=0.001,                # Lower learning rate for fine-tuning
)

# Option C: Train from CLI
# yolo detect train data=data.yaml model=yolov8m.pt epochs=100 imgsz=640

Model size comparison:

Model	Params	mAP (COCO)	Speed (ms)	Best For
YOLOv8n	3.2M	37.3	0.99	Edge devices, Jetson Nano
YOLOv8s	11.2M	44.9	1.20	Jetson Orin Nano
YOLOv8m	25.9M	50.2	1.83	Desktop GPU
YOLOv8l	43.7M	52.9	2.39	Desktop GPU
YOLOv8x	68.2M	53.9	3.53	Server GPU, highest accuracy

3.2 Hyperparameter Tuning¶

Key hyperparameters and their effects:

model = YOLO("yolov8m.pt")

# Tuning sweep with Ray Tune (built into Ultralytics)
results = model.tune(
    data="data.yaml",
    epochs=30,              # Shorter epochs per trial
    iterations=100,         # Number of tuning trials
    optimizer="AdamW",
    plots=True,
    save=True,
    device="0",
)
# Saves best hyperparameters to runs/tune/weights/best_hyperparameters.yaml

Manual hyperparameter guide:

Parameter	Default	Tuning Range	When to Adjust
`lr0`	0.01	0.0001 - 0.01	Overfitting → lower; underfitting → higher
`lrf`	0.01	0.001 - 0.1	Final LR ratio (lr0 × lrf)
`momentum`	0.937	0.8 - 0.99	SGD momentum
`weight_decay`	0.0005	0.0001 - 0.01	Regularization strength
`warmup_epochs`	3.0	1 - 5	More warmup for small datasets
`warmup_momentum`	0.8	0.5 - 0.95
`box`	7.5	1 - 20	Box loss gain
`cls`	0.5	0.1 - 5	Classification loss gain
`dfl`	1.5	0.5 - 5	Distribution focal loss gain
`mosaic`	1.0	0 - 1	Disable (0) for small objects
`mixup`	0.0	0 - 0.5	Add for regularization
`copy_paste`	0.0	0 - 1	Copy-paste augmentation
`degrees`	0.0	0 - 45	Rotation augmentation (degrees)
`scale`	0.5	0 - 0.9	Scale augmentation range
`fliplr`	0.5	0 - 1	Horizontal flip probability
`hsv_h`	0.015	0 - 0.1	Hue augmentation
`hsv_s`	0.7	0 - 1	Saturation augmentation
`hsv_v`	0.4	0 - 1	Value augmentation

3.3 Transfer Learning Strategy¶

from ultralytics import YOLO

# Strategy 1: Freeze backbone, train head only (first 5-10 epochs)
model = YOLO("yolov8m.pt")
model.train(
    data="data.yaml",
    epochs=5,
    freeze=10,         # Freeze layers 0-9 (backbone)
    lr0=0.01,
    name="phase1_head_only"
)

# Strategy 2: Unfreeze and fine-tune everything (next 50+ epochs)
model = YOLO("runs/detect/phase1_head_only/weights/best.pt")
model.train(
    data="data.yaml",
    epochs=50,
    freeze=0,          # Unfreeze all layers
    lr0=0.001,         # Lower LR for full fine-tuning
    name="phase2_full_finetune"
)

# Strategy 3: Progressive unfreezing (most thorough)
model = YOLO("yolov8m.pt")
for phase, (freeze_layers, epochs, lr) in enumerate([
    (15, 5, 0.01),    # Train head + neck
    (10, 10, 0.005),  # Unfreeze more layers
    (5, 15, 0.001),   # Unfreeze even more
    (0, 30, 0.0005),  # Fine-tune everything
], 1):
    model = model.train(
        data="data.yaml",
        epochs=epochs,
        freeze=freeze_layers,
        lr0=lr,
        name=f"phase{phase}"
    )
    # Reload best weights from this phase
    model = YOLO(f"runs/detect/phase{phase}/weights/best.pt")

3.4 Multi-GPU Training¶

# PyTorch DDP (Distributed Data Parallel) - recommended
yolo detect train data=data.yaml model=yolov8m.pt epochs=100 batch=32 device=0,1

# For 4 GPUs:
yolo detect train data=data.yaml model=yolov8x.pt epochs=100 batch=64 device=0,1,2,3

# Effective batch size = batch_per_device × num_GPUs
# Rule of thumb: scale batch size linearly with GPU count, adjust LR with sqrt

# Python API
from ultralytics import YOLO

model = YOLO("yolov8m.pt")
model.train(
    data="data.yaml",
    epochs=100,
    batch=16,         # Per GPU
    device="0,1",     # Use 2 GPUs
    imgsz=640,
    workers=8,        # Data loader workers per GPU
    name="multi_gpu_train"
)

3.5 YOLO Validation and Export¶

from ultralytics import YOLO

model = YOLO("runs/detect/train/weights/best.pt")

# Validate on test set
metrics = model.val(
    data="data.yaml",
    split="test",
    imgsz=640,
    batch=16,
    conf=0.25,
    iou=0.6,
    device="0",
)
print(f"mAP50: {metrics.box.map50:.4f}")
print(f"mAP50-95: {metrics.box.map:.4f}")

# Export to various formats
model.export(format="onnx", imgsz=640, simplify=True)          # ONNX
model.export(format="engine", imgsz=640, half=True)             # TensorRT (GPU)
model.export(format="engine", imgsz=640, half=True, device=0)   # TensorRT INT8
model.export(format="tflite", imgsz=640)                        # TFLite (mobile)
model.export(format="coreml", imgsz=640)                        # CoreML (Apple)

4. Training Segmentation Models (SAM)¶

4.1 Fine-tuning SAM on Custom Data¶

The Segment Anything Model (SAM) supports prompt-based segmentation. Fine-tuning adapts it to your specific domain:

import torch
from segment_anything import sam_model_registry, SamPredictor

# Load pre-trained SAM
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")
sam.to("cuda")

# For custom fine-tuning, you need to modify the decoder
# The key is training the mask decoder with your domain-specific prompts

# Custom dataset loader for SAM fine-tuning
from torch.utils.data import Dataset, DataLoader

class SAMPromptDataset(Dataset):
    def __init__(self, images_dir, annotations_dir, transform=None):
        self.images = sorted([f for f in os.listdir(images_dir) 
                              if f.endswith(('.jpg', '.png'))])
        self.images_dir = images_dir
        self.annotations_dir = annotations_dir
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img_name = self.images[idx]
        image = cv2.imread(os.path.join(self.images_dir, img_name))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        # Load ground truth mask
        mask = cv2.imread(
            os.path.join(self.annotations_dir, 
                         os.path.splitext(img_name)[0] + ".png"),
            cv2.IMREAD_GRAYSCALE
        )

        if self.transform:
            augmented = self.transform(image=image, mask=mask)
            image = augmented["image"]
            mask = augmented["mask"]

        return {
            "image": torch.tensor(image).permute(2, 0, 1).float() / 255.0,
            "mask": torch.tensor(mask).long(),
            "point_coords": torch.tensor([[128, 128]]),  # Example prompt
            "point_labels": torch.tensor([1]),             # 1 = foreground
        }

# Training loop (simplified)
dataset = SAMPromptDataset("images/", "masks/")
loader = DataLoader(dataset, batch_size=4, shuffle=True)

optimizer = torch.optim.AdamW(sam.mask_decoder.parameters(), lr=1e-4)
loss_fn = torch.nn.CrossEntropyLoss()

sam.train()
for epoch in range(50):
    for batch in loader:
        images = batch["image"].to("cuda")
        masks_gt = batch["mask"].to("cuda")

        # Get image embeddings from the encoder
        with torch.no_grad():
            image_embeddings = sam.image_encoder(images)

        # Predict with prompt
        sparse_embeddings, dense_embeddings = sam.prompt_encoder(
            points=batch["point_coords"].to("cuda"),
            labels=batch["point_labels"].to("cuda"),
            boxes=None,
            mask_input=None,
        )

        # Decode masks
        low_res_masks, _ = sam.mask_decoder(
            image_embeddings=image_embeddings,
            image_pe=sam.prompt_encoder.get_dense_pe(),
            sparse_prompt_embeddings=sparse_embeddings,
            dense_prompt_embeddings=dense_embeddings,
            multimask_output=False,
        )

        # Upsample to original resolution and compute loss
        pred_masks = torch.nn.functional.interpolate(
            low_res_masks, size=masks_gt.shape[-2:], mode="bilinear"
        )
        loss = loss_fn(pred_masks.squeeze(1), masks_gt)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}/50, Loss: {loss.item():.4f}")

4.2 LoRA Fine-tuning for SAM¶

LoRA (Low-Rank Adaptation) is parameter-efficient—only trains a small number of additional parameters:

import torch
import torch.nn as nn
from peft import LoraConfig, get_peft_model, TaskType
from segment_anything import sam_model_registry

# Load SAM
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")

# Apply LoRA to the image encoder attention layers
lora_config = LoraConfig(
    task_type=TaskType.FEATURE_EXTRACTION,
    r=8,                          # Rank of adaptation matrices
    lora_alpha=32,                # Scaling factor
    lora_dropout=0.1,
    target_modules=["qkv", "proj"],  # Attention layers to adapt
    bias="none",
)

# Wrap the image encoder with LoRA
sam.image_encoder = get_peft_model(sam.image_encoder, lora_config)

# Print trainable parameter count
trainable = sum(p.numel() for p in sam.parameters() if p.requires_grad)
total = sum(p.numel() for p in sam.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

# Now train with standard loop as above
# Only LoRA parameters are updated; the rest are frozen

4.3 Training Mask R-CNN from Scratch¶

import torch
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

def get_maskrcnn_model(num_classes):
    """Create a Mask R-CNN model for custom segmentation."""
    model = maskrcnn_resnet50_fpn_v2(weights="DEFAULT")

    # Replace the classifier head
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # Replace the mask predictor
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    model.roi_heads.mask_predictor = MaskRCNNPredictor(
        in_features_mask, hidden_layer, num_classes
    )
    return model

# Custom dataset for Mask R-CNN (COCO-style)
from torchvision.datasets import CocoDetection
from torchvision import transforms

class CocoSegDataset(CocoDetection):
    """Wraps CocoDetection to return masks alongside boxes."""
    def __getitem__(self, idx):
        img, targets = super().__getitem__(idx)

        # Process targets for Mask R-CNN
        boxes = []
        labels = []
        masks = []
        for ann in targets:
            boxes.append(ann["bbox"])  # [x, y, w, h]
            labels.append(ann["category_id"])
            masks.append(ann.get("segmentation", []))

        target = {
            "boxes": torch.tensor(boxes, dtype=torch.float32),
            "labels": torch.tensor(labels, dtype=torch.int64),
        }
        return img, target

# Training loop
model = get_maskrcnn_model(num_classes=4)  # 3 classes + background
model.to("cuda")
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

model.train()
for epoch in range(20):
    for images, targets in dataloader:
        images = [img.to("cuda") for img in images]
        targets = [{k: v.to("cuda") for k, v in t.items()} for t in targets]

        loss_dict = model(images, targets)
        total_loss = sum(loss.values() for loss in loss_dict.values())

        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

    lr_scheduler.step()
    print(f"Epoch {epoch+1}: loss = {total_loss.item():.4f}")

5. Training Depth Estimation¶

5.1 Fine-tuning MiDaS/DPT on Custom Stereo Pairs¶

import torch
import torch.nn as nn
from transformers import DPTForDepthEstimation, DPTImageProcessor

# Load pre-trained DPT model
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
model.to("cuda")

# Custom dataset for depth fine-tuning
class StereoDepthDataset(torch.utils.data.Dataset):
    """
    Expects directory structure:
    dataset/
    ├── left/          # Left stereo images
    ├── right/         # Right stereo images
    └── depth/         # Ground truth depth maps (numpy .npy)
    """
    def __init__(self, root_dir, transform=None):
        self.left_dir = os.path.join(root_dir, "left")
        self.depth_dir = os.path.join(root_dir, "depth")
        self.images = sorted(os.listdir(self.left_dir))
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img = Image.open(os.path.join(self.left_dir, self.images[idx]))
        depth = np.load(os.path.join(self.depth_dir,
                        os.path.splitext(self.images[idx])[0] + ".npy"))

        if self.transform:
            img = self.transform(img)

        depth = torch.tensor(depth, dtype=torch.float32).unsqueeze(0)
        return {"pixel_values": img, "depth": depth}

# Fine-tuning loop
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = nn.SmoothL1Loss()  # Scale-invariant depth loss

model.train()
for epoch in range(30):
    for batch in dataloader:
        pixel_values = batch["pixel_values"].to("cuda")
        gt_depth = batch["depth"].to("cuda")

        outputs = model(pixel_values=pixel_values)
        pred_depth = outputs.predicted_depth

        # Interpolate to ground truth size
        pred_depth = nn.functional.interpolate(
            pred_depth, size=gt_depth.shape[-2:], mode="bicubic"
        )

        loss = loss_fn(pred_depth, gt_depth)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}: depth_loss = {loss.item():.4f}")

5.2 Self-supervised Depth Estimation Training¶

Train depth from stereo pairs without ground truth using photometric consistency:

import torch
import torch.nn.functional as F

class MonoDepthLoss(nn.Module):
    """
    Self-supervised monocular depth estimation loss
    (based on Monodepth2: Godard et al., 2019)
    """
    def __init__(self, alpha_ssim=0.85, alpha_l1=0.15):
        super().__init__()
        self.alpha_ssim = alpha_ssim
        self.alpha_l1 = alpha_l1

    def ssim_loss(self, pred, target, window_size=11):
        """Structural similarity loss."""
        C1 = 0.01 ** 2
        C2 = 0.03 ** 2

        mu_pred = F.avg_pool2d(pred, window_size, stride=1, padding=window_size // 2)
        mu_target = F.avg_pool2d(target, window_size, stride=1, padding=window_size // 2)

        mu_pred_sq = mu_pred ** 2
        mu_target_sq = mu_target ** 2
        mu_cross = mu_pred * mu_target

        sigma_pred_sq = F.avg_pool2d(pred ** 2, window_size, 1, window_size // 2) - mu_pred_sq
        sigma_target_sq = F.avg_pool2d(target ** 2, window_size, 1, window_size // 2) - mu_target_sq
        sigma_cross = F.avg_pool2d(pred * target, window_size, 1, window_size // 2) - mu_cross

        ssim_map = ((2 * mu_cross + C1) * (2 * sigma_cross + C2)) / \
                   ((mu_pred_sq + mu_target_sq + C1) * (sigma_pred_sq + sigma_target_sq + C2))
        return torch.clamp((1 - ssim_map) / 2, 0, 1).mean()

    def photometric_loss(self, pred_image, target_image):
        """Combined L1 + SSIM photometric loss."""
        l1 = (pred_image - target_image).abs().mean()
        ssim = self.ssim_loss(pred_image, target_image)
        return self.alpha_ssim * ssim + self.alpha_l1 * l1

    def forward(self, pred_depth, pred_image_left, target_image_right,
                K, T_cam_to_right):
        """
        pred_depth: predicted depth map (B, 1, H, W)
        pred_image_left: reconstructed left image from right
        target_image_right: original right image
        K: camera intrinsic matrix (B, 3, 3)
        T_cam_to_right: extrinsic transform from left to right camera (B, 4, 4)
        """
        # Compute photometric loss (consistency between predicted warp and target)
        photo_loss = self.photometric_loss(pred_image_left, target_image_right)

        # Smoothness loss (encourage depth gradients to align with image gradients)
        depth_grad_x = pred_depth[:, :, :, :-1] - pred_depth[:, :, :, 1:]
        depth_grad_y = pred_depth[:, :, :-1, :] - pred_depth[:, :, 1:, :]

        return {
            "photometric_loss": photo_loss,
            "total_loss": photo_loss  # Add smoothness term in practice
        }

# Usage in training
loss_fn = MonoDepthLoss()
# ... training loop with stereo pairs, warp the right image using predicted depth

Recommended datasets for depth training:

Dataset	Type	Size	Use Case
KITTI Depth	Real stereo + LiDAR	86K images	Autonomous driving
Make3D	Real outdoor	534 images	Outdoor depth
NYU Depth V2	Indoor RGB-D	1,449 scenes	Indoor robotics
ScanNet	Indoor RGB-D	1513 scans	Indoor 3D understanding

6. Evaluation & Validation¶

6.1 Standard Metrics¶

mAP (mean Average Precision): Primary metric for object detection.

from ultralytics import YOLO

model = YOLO("runs/detect/train/weights/best.pt")
metrics = model.val(data="data.yaml", split="test")

# Key metrics
print(f"mAP50:    {metrics.box.map50:.4f}")     # AP at IoU=0.50
print(f"mAP50-95: {metrics.box.map:.4f}")        # AP averaged over IoU 0.50:0.95
print(f"Precision: {metrics.box.mp:.4f}")        # Mean precision
print(f"Recall:    {metrics.box.mr:.4f}")        # Mean recall

# Per-class metrics
names = metrics.names
for i, (p, r, ap50, ap) in enumerate(
    zip(metrics.box.p, metrics.box.r, metrics.box.ap50, metrics.box.ap)
):
    print(f"  {names[i]:15s} P={p:.3f}  R={r:.3f}  AP50={ap50:.3f}  AP50-95={ap:.3f}")

IoU (Intersection over Union): Measures overlap between predicted and ground truth boxes.

IoU = Area of Intersection / Area of Union

   ┌──────────────┐
   │    ┌───GT     │
   │    │  ╱       │
   │    └─╱────────│
   │   ╱  │        │
   │  ╱ Pred       │
   └──────────────┘

IoU > 0.5  → typically considered a "correct" detection
IoU > 0.75 → stricter threshold for precise localization

6.2 Confusion Matrix Analysis¶

from ultralytics import YOLO

model = YOLO("runs/detect/train/weights/best.pt")
results = model.val(data="data.yaml", split="test", plots=True)
# Confusion matrix saved to runs/detect/val/confusion_matrix.png

Key patterns to look for in confusion matrices:

                    Predicted
                 cat    dog    background
True  cat    [  85      5      10  ]   → 10 missed cats
      dog    [   3     88       9  ]   → 9 missed dogs
      bg     [   2      1     97  ]    → 3 false positives

Diagonal dominance = good
Off-diagonal clusters = systematic misclassification
High background row = false positives
High background column = false negatives (missed objects)

6.3 Precision-Recall Curves¶

import matplotlib.pyplot as plt
from ultralytics import YOLO

model = YOLO("runs/detect/train/weights/best.pt")
results = model.val(data="data.yaml", plots=True)

# Plots are auto-saved: PR_curve.png, F1_curve.png, P_curve.png, R_curve.png

Interpreting the curves:

PR curve: Area under curve = AP. Curves closer to top-right = better.
F1 curve: Peak of the F1 curve suggests optimal confidence threshold.
Precision curve: Increases as confidence threshold increases.
Recall curve: Decreases as confidence threshold increases.

6.4 Test-Time Augmentation (TTA)¶

TTA applies multiple augmentations at inference and aggregates predictions:

from ultralytics import YOLO

model = YOLO("best.pt")

# Basic inference
results = model.predict("test_image.jpg", conf=0.25)

# With TTA (slower but more accurate)
results_tta = model.predict(
    "test_image.jpg",
    conf=0.25,
    augment=True,   # Enable TTA
)
# TTA applies horizontal flip and multiple scales, then NMS merges results

7. Model Optimization for Deployment¶

7.1 Quantization¶

Reduce model precision from FP32 to INT8 or FP16 for faster inference:

from ultralytics import YOLO

model = YOLO("best.pt")

# FP16 export (2x memory reduction, minimal accuracy loss)
model.export(format="engine", half=True)

# INT8 quantization (4x reduction, slight accuracy trade-off)
# Requires a calibration dataset
model.export(
    format="engine",
    int8=True,
    data="calibration_data.yaml",  # Small representative dataset
)

# ONNX with quantization
model.export(format="onnx", simplify=True)
# Then use onnxruntime with quantization:

import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic, QuantType

# Post-training dynamic quantization
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_int8.onnx",
    weight_type=QuantType.QInt8,
)

# Benchmark
session = ort.InferenceSession("model_int8.onnx")
# Compare inference time with FP32 vs INT8

7.2 Pruning¶

Remove redundant weights to reduce model size:

import torch.nn.utils.prune as prune

model = YOLO("best.pt").model

# Global magnitude pruning - remove 30% of smallest weights
parameters_to_prune = []
for module in model.modules():
    if isinstance(module, torch.nn.Conv2d):
        parameters_to_prune.append((module, "weight"))

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.3,  # Remove 30% of weights
)

# Make pruning permanent
for module, param_name in parameters_to_prune:
    prune.remove(module, param_name)

# Count sparsity
total = 0
pruned = 0
for module, _ in parameters_to_prune:
    total += module.weight.nelement()
    pruned += torch.sum(module.weight == 0).item()
print(f"Sparsity: {100. * pruned / total:.1f}%")

7.3 Knowledge Distillation¶

Train a smaller student model to mimic a larger teacher:

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.7):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha

    def forward(self, student_logits, teacher_logits, gt_labels):
        # Soft targets (knowledge from teacher)
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=1),
            F.softmax(teacher_logits / self.temperature, dim=1),
            reduction="batchmean"
        ) * (self.temperature ** 2)

        # Hard targets (ground truth)
        hard_loss = F.cross_entropy(student_logits, gt_labels)

        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

# Distillation pipeline
teacher = YOLO("yolov8x.pt")  # Large teacher
student = YOLO("yolov8n.pt")  # Small student

# Train student to mimic teacher's soft predictions
# Use the distillation loss in combination with standard detection loss

7.4 ONNX Export and Optimization¶

from ultralytics import YOLO

model = YOLO("best.pt")

# Export to ONNX
model.export(
    format="onnx",
    imgsz=640,
    simplify=True,    # Apply ONNX simplifier
    dynamic=False,     # Fixed input size (faster)
    opset=17,         # ONNX opset version
)

# Optimize with onnxruntime
import onnxruntime as ort

# GPU inference
session_gpu = ort.InferenceSession(
    "best.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# CPU inference (optimized)
session_cpu = ort.InferenceSession(
    "best.onnx",
    providers=["CPUExecutionProvider"],
    sess_options=ort.SessionOptions()
)
session_cpu.get_session_options().graph_optimization_level = (
    ort.GraphOptimizationLevel.ORT_ENABLE_ALL
)

# Benchmark
import time
import numpy as np

dummy_input = np.random.randn(1, 3, 640, 640).astype(np.float32)
input_name = session_gpu.get_inputs()[0].name

# Warmup
for _ in range(10):
    session_gpu.run(None, {input_name: dummy_input})

# Benchmark
times = []
for _ in range(100):
    start = time.time()
    session_gpu.run(None, {input_name: dummy_input})
    times.append(time.time() - start)

print(f"ONNX GPU: {np.mean(times)*1000:.1f} ms/img")

TensorRT optimization (fastest NVIDIA GPU inference):

# Export directly to TensorRT engine
model.export(format="engine", imgsz=640, half=True)

# Or convert from ONNX
import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)

with open("best.onnx", "rb") as f:
    parser.parse(f.read())

config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB
config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16

engine = builder.build_serialized_network(network, config)
with open("best.trt", "wb") as f:
    f.write(engine)

8. MLOps for Robotics¶

8.1 Experiment Tracking¶

Weights & Biases:

import wandb

wandb.init(
    project="robotics-perception",
    name="yolov8m_finetune_v3",
    config={
        "model": "yolov8m",
        "epochs": 100,
        "lr": 0.001,
        "batch_size": 16,
        "imgsz": 640,
        "augmentations": ["mosaic", "mixup", "hsv"],
    }
)

from ultralytics import YOLO
model = YOLO("yolov8m.pt")
results = model.train(
    data="data.yaml",
    epochs=100,
    project="robotics-perception",
    name="yolov8m_finetune_v3",
)

# Log metrics manually if needed
wandb.log({"mAP50": results.results_dict.get("metrics/mAP50(B)", 0)})
wandb.finish()

MLflow:

import mlflow
import mlflow.pytorch

mlflow.set_experiment("robotics-perception")

with mlflow.start_run(run_name="yolov8m_finetune_v3"):
    # Log parameters
    mlflow.log_param("model", "yolov8m")
    mlflow.log_param("epochs", 100)
    mlflow.log_param("learning_rate", 0.001)

    # Train
    from ultralytics import YOLO
    model = YOLO("yolov8m.pt")
    results = model.train(data="data.yaml", epochs=100)

    # Log metrics
    mlflow.log_metric("mAP50", results.results_dict["metrics/mAP50(B)"])
    mlflow.log_metric("mAP50-95", results.results_dict["metrics/mAP50-95(B)"])

    # Log model artifact
    mlflow.log_artifact("runs/detect/train/weights/best.pt")

8.2 Model Versioning and Registry¶

model_registry/
├── models/
│   ├── detection/
│   │   ├── v1.0/
│   │   │   ├── model.pt
│   │   │   ├── metadata.json
│   │   │   └── eval_report.html
│   │   ├── v1.1/
│   │   └── v2.0/
│   ├── segmentation/
│   │   └── v1.0/
│   └── depth/
│       └── v1.0/
└── metadata.json  # Global registry

Example metadata.json per model version:

{
  "model_name": "robot_detection",
  "version": "2.0",
  "framework": "ultralytics_yolov8",
  "base_model": "yolov8m",
  "training_data": "robot_dataset_v3",
  "training_date": "2026-04-15",
  "metrics": {
    "mAP50": 0.87,
    "mAP50-95": 0.62,
    "inference_ms": 8.3,
    "model_size_mb": 52.4
  },
  "target_hardware": "Jetson Orin Nano",
  "quantization": "FP16",
  "approved_by": "team_lead",
  "status": "production"
}

8.3 CI/CD for Model Retraining¶

# .github/workflows/retrain.yml
name: Model Retraining Pipeline

on:
  push:
    paths:
      - 'training_data/**'
  workflow_dispatch:
    inputs:
      model_type:
        description: 'Model to retrain'
        required: true
        type: choice
        options:
          - detection
          - segmentation
          - depth

jobs:
  train:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install ultralytics wandb opencv-python

      - name: Train model
        run: |
          python train.py \
            --model-type ${{ inputs.model_type }} \
            --epochs 100 \
            --data training_data/data.yaml
        env:
          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}

      - name: Run evaluation
        run: |
          python evaluate.py \
            --model runs/detect/train/weights/best.pt \
            --data data.yaml \
            --threshold 0.8  # Minimum mAP50 to pass

      - name: Deploy if metrics pass
        if: success()
        run: |
          python deploy.py \
            --model runs/detect/train/weights/best.pt \
            --format onnx \
            --target jetson

9. Common Pitfalls¶

9.1 Overfitting¶

Symptoms: Training loss decreases, validation loss increases or plateaus.

Solutions:

# 1. Add regularization
model.train(
    weight_decay=0.01,       # L2 regularization
    dropout=0.2,             # If using custom head
    mixup=0.15,              # Mixup augmentation
    mosaic=1.0,              # Mosaic augmentation
    degrees=10,              # Rotation augmentation
    scale=0.5,               # Scale augmentation
)

# 2. Reduce model complexity
model = YOLO("yolov8s.pt")  # Use smaller model

# 3. Early stopping
model.train(
    patience=15,  # Stop if no improvement for 15 epochs
)

# 4. More data or better augmentation
# 5. Reduce number of epochs
# 6. Add dropout layers to custom heads

9.2 Class Imbalance¶

Symptoms: Model biased toward majority class, low recall on rare classes.