Skip to content

Training Pipeline for Robotics Perception

A comprehensive guide to collecting data, training models, evaluating performance, and deploying optimized perception models for robotics applications.


Overview

Robotics perception systems must reliably detect, segment, and estimate depth for objects in unstructured, dynamic environments. Off-the-shelf models trained on general benchmarks (COCO, ImageNet) often fail when deployed on real robots because:

  • Domain shift: The robot's camera, mounting angle, and operating environment differ from training data.
  • Specific objects: Robots must recognize task-relevant objects (tools, gripper targets, obstacles) not present in public datasets.
  • Real-time constraints: Embedded compute on the robot requires optimized models (INT8, TensorRT).
  • Robustness: Variations in lighting, motion blur, partial occlusion, and sensor noise demand tailored training.

This guide walks through the full lifecycle—from raw images to a deployed model—covering detection (YOLO), segmentation (SAM, Mask R-CNN), depth estimation, evaluation, optimization, and MLOps practices.

Prerequisites: Python 3.8+, PyTorch 2.0+, CUDA 11.8+, basic familiarity with computer vision and neural networks.


Learning Objectives

  • Collect and annotate high-quality datasets for robotics tasks
  • Train and fine-tune detection, segmentation, and depth models
  • Evaluate models with standard metrics and analyze failure modes
  • Optimize models for real-time inference on edge hardware
  • Set up experiment tracking and CI/CD pipelines for model iteration

1. Data Collection

1.1 Manual Annotation Tools

Tool Format Support Strengths Best For
LabelImg YOLO, Pascal VOC Lightweight, simple UI Quick bounding box annotation
CVAT COCO, YOLO, Pascal VOC Multi-user, video support, AI-assisted Team annotation projects
Roboflow All major formats Auto-augment, export, hosting End-to-end pipeline
Label Studio JSON, COCO, VOC Multi-modal (text, image, audio), ML backend Complex annotation tasks

LabelImg (quick start for bounding boxes):

pip install labelImg
labelImg  # Opens GUI, select folder and format

CVAT (team annotation with Docker):

docker compose -f docker-compose.yml up -d
# Access at http://localhost:8080
# Create project, define labels, invite annotators

Label Studio (with ML backend for active learning):

pip install label-studio label-studio-ml
label-studio start &
# Configure ML backend to pre-annotate with a pre-trained model

1.2 Automatic Annotation with Pre-trained Models

Manual annotation is slow. Use a pre-trained model to generate initial annotations, then human annotators correct mistakes (human-in-the-loop):

from ultralytics import YOLO

# Load a pre-trained model (or your own model from a previous iteration)
model = YOLO("yolov8x.pt")

# Auto-annotate a folder of raw images
results = model.predict(
    source="raw_images/",
    save_txt=True,          # Save YOLO-format labels
    conf=0.5,               # Confidence threshold
    imgsz=640,
    save=True               # Save annotated images for visual review
)
# Output: raw_images/labels/ contains .txt files

SAM-assisted labeling (Segment Anything for masks):

from segment_anything import sam_model_registry, SamPredictor
import cv2
import numpy as np

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)

# For each image, provide point prompts or box prompts
image = cv2.imread("robot_scene.jpg")
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image_rgb)

# Prompt with a bounding box (e.g., from YOLO detection)
box = np.array([100, 200, 400, 500])  # x1, y1, x2, y2
masks, scores, _ = predictor.predict(
    box=box,
    multimask_output=True
)
# Select the best mask
best_mask = masks[np.argmax(scores)]

For large-scale auto-annotation, consider Autodistill, which chains foundation models to generate annotations in YOLO format automatically.

1.3 Data Collection Best Practices for Robotics

Principle Why It Matters How to Achieve It
Varying lighting Robots operate under different conditions Capture in sunlight, shadows, indoors, artificial light
Multiple camera angles Robot cameras may be mounted differently Mount camera at different heights and angles
Diverse backgrounds Clutter confuses models Collect in different rooms, workspaces, outdoor areas
Include edge cases Models fail on unusual configurations Add partially occluded, distant, or motion-blurred objects
Represent target domain Domain gap causes failures Use the actual robot camera and mounting position
Sufficient quantity More data generally improves generalization Aim for 500+ images per class minimum; 2000+ ideal

Data collection script example (save images from a webcam at intervals):

import cv2
import os
import time

cap = cv2.VideoCapture(0)
output_dir = "collected_data/"
os.makedirs(output_dir, exist_ok=True)

frame_count = 0
interval = 0.5  # seconds between captures

while True:
    ret, frame = cap.read()
    if not ret:
        break
    cv2.imshow("Preview (press 's' to save, 'q' to quit)", frame)
    key = cv2.waitKey(1) & 0xFF
    if key == ord("s"):
        filename = f"{output_dir}/img_{frame_count:06d}.jpg"
        cv2.imwrite(filename, frame)
        print(f"Saved: {filename}")
        frame_count += 1
    elif key == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()
print(f"Total images saved: {frame_count}")

1.4 Augmentation Strategies

Augmentation artificially increases dataset diversity and improves generalization.

Photometric augmentations (color/lighting changes):

import albumentations as A

photometric_transform = A.Compose([
    A.RandomBrightnessContrast(brightness_limit=0.3, contrast_limit=0.3, p=0.7),
    A.HueSaturationValue(hue_shift_limit=20, sat_shift_limit=30, val_shift_limit=20, p=0.5),
    A.CLAHE(clip_limit=4.0, p=0.3),
    A.RandomGamma(gamma_limit=(80, 120), p=0.3),
    A.GaussNoise(var_limit=(10.0, 50.0), p=0.3),
])

Geometric augmentations (spatial transforms):

geometric_transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomRotate90(p=0.5),
    A.ShiftScaleRotate(
        shift_limit=0.1,
        scale_limit=0.2,
        rotate_limit=15,
        border_mode=cv2.BORDER_CONSTANT,
        p=0.7
    ),
    A.Perspective(scale=(0.05, 0.1), p=0.3),
], bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]))

Advanced augmentations:

# Cutout / Coarse Dropout - randomly mask patches
cutout_transform = A.CoarseDropout(
    max_holes=8, max_height=32, max_width=32,
    min_holes=1, min_height=8, min_width=8,
    fill_value=0, p=0.5
)

# Mixup - blend two images and their labels
# (typically done at training time, see YOLO's mixup parameter)

# Mosaic - combine 4 images into one (YOLO's default augmentation)
# Controlled by the 'mosaic' parameter in YOLO training

Applying augmentations to a labeled dataset:

import cv2
import json

def augment_dataset(image_dir, label_dir, output_dir, transform, num_augments=3):
    """Apply augmentation to an entire dataset."""
    os.makedirs(f"{output_dir}/images", exist_ok=True)
    os.makedirs(f"{output_dir}/labels", exist_ok=True)

    for img_name in os.listdir(image_dir):
        if not img_name.endswith((".jpg", ".png")):
            continue
        image = cv2.imread(f"{image_dir}/{img_name}")
        # Load YOLO-format labels
        label_path = f"{label_dir}/{os.path.splitext(img_name)[0]}.txt"
        bboxes, class_labels = [], []
        if os.path.exists(label_path):
            with open(label_path) as f:
                for line in f.readlines():
                    parts = line.strip().split()
                    cls = int(parts[0])
                    x, y, w, h = map(float, parts[1:5])
                    # Convert YOLO to Pascal VOC (x1, y1, x2, y2)
                    h_img, w_img = image.shape[:2]
                    x1 = (x - w / 2) * w_img
                    y1 = (y - h / 2) * h_img
                    x2 = (x + w / 2) * w_img
                    y2 = (y + h / 2) * h_img
                    bboxes.append([x1, y1, x2, y2])
                    class_labels.append(cls)

        # Save original
        cv2.imwrite(f"{output_dir}/images/{img_name}", image)

        # Create augmented versions
        for i in range(num_augments):
            transformed = transform(
                image=image,
                bboxes=bboxes,
                class_labels=class_labels
            )
            aug_name = f"{os.path.splitext(img_name)[0]}_aug{i}.jpg"
            cv2.imwrite(f"{output_dir}/images/{aug_name}", transformed["image"])

            # Convert back to YOLO format and save
            h_img, w_img = transformed["image"].shape[:2]
            with open(f"{output_dir}/labels/{os.path.splitext(aug_name)[0]}.txt", "w") as f:
                for bbox, cls in zip(transformed["bboxes"], transformed["class_labels"]):
                    x1, y1, x2, y2 = bbox
                    cx = ((x1 + x2) / 2) / w_img
                    cy = ((y1 + y2) / 2) / h_img
                    bw = (x2 - x1) / w_img
                    bh = (y2 - y1) / h_img
                    f.write(f"{int(cls)} {cx:.6f} {cy:.6f} {bw:.6f} {bh:.6f}\n")

augment_dataset("raw_images", "raw_labels", "augmented_dataset", 
                transform=geometric_transform, num_augments=3)

2. Dataset Formats

2.1 YOLO Format

Each image has a corresponding .txt file. Each line: class_id center_x center_y width height (normalized 0-1).

project/
├── images/
│   ├── train/
│   │   ├── img001.jpg
│   │   └── img002.jpg
│   └── val/
│       ├── img010.jpg
│       └── img011.jpg
├── labels/
│   ├── train/
│   │   ├── img001.txt
│   │   └── img002.txt
│   └── val/
│       ├── img010.txt
│       └── img011.txt
└── data.yaml

Example label file (img001.txt):

0 0.512500 0.483203 0.235000 0.462500
1 0.213750 0.651042 0.120000 0.287500

Example data.yaml:

path: /absolute/path/to/project
train: images/train
val: images/val
test: images/test  # optional

nc: 3  # number of classes
names: ["bottle", "cup", "tool"]

2.2 COCO Format

JSON-based, single annotation file for the entire dataset.

{
  "images": [
    {"id": 1, "file_name": "img001.jpg", "width": 640, "height": 480}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 0,
      "bbox": [100, 50, 150, 220],
      "area": 33000,
      "segmentation": [[100, 50, 250, 50, 250, 270, 100, 270]],
      "iscrowd": 0
    }
  ],
  "categories": [
    {"id": 0, "name": "bottle"},
    {"id": 1, "name": "cup"}
  ]
}

Note: COCO bbox format is [x_top_left, y_top_left, width, height] in pixels.

2.3 Pascal VOC Format

XML-based, one file per image.

<annotation>
  <filename>img001.jpg</filename>
  <size>
    <width>640</width>
    <height>480</height>
  </size>
  <object>
    <name>bottle</name>
    <bndbox>
      <xmin>100</xmin>
      <ymin>50</ymin>
      <xmax>250</xmax>
      <ymax>270</ymax>
    </bndbox>
  </object>
</annotation>

2.4 Format Conversion Scripts

COCO to YOLO:

import json
import os

def coco_to_yolo(coco_json_path, output_dir):
    """Convert COCO annotations to YOLO format."""
    os.makedirs(output_dir, exist_ok=True)

    with open(coco_json_path) as f:
        coco = json.load(f)

    # Build lookup tables
    img_lookup = {img["id"]: img for img in coco["images"]}

    # Group annotations by image
    ann_by_image = {}
    for ann in coco["annotations"]:
        img_id = ann["image_id"]
        if img_id not in ann_by_image:
            ann_by_image[img_id] = []
        ann_by_image[img_id].append(ann)

    for img_id, annotations in ann_by_image.items():
        img = img_lookup[img_id]
        w, h = img["width"], img["height"]
        txt_path = os.path.join(output_dir, 
                                os.path.splitext(img["file_name"])[0] + ".txt")

        with open(txt_path, "w") as f:
            for ann in annotations:
                x, y, bw, bh = ann["bbox"]  # COCO: x, y, w, h (pixels)
                cx = (x + bw / 2) / w
                cy = (y + bh / 2) / h
                nw = bw / w
                nh = bh / h
                f.write(f"{ann['category_id']} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")

    print(f"Converted {len(ann_by_image)} images to YOLO format in {output_dir}")

coco_to_yolo("annotations.json", "labels_yolo/")

Pascal VOC to YOLO:

import xml.etree.ElementTree as ET
import os

def voc_to_yolo(voc_dir, output_dir):
    """Convert Pascal VOC XML annotations to YOLO format."""
    os.makedirs(output_dir, exist_ok=True)
    class_names = []  # collect unique classes

    for xml_file in os.listdir(voc_dir):
        if not xml_file.endswith(".xml"):
            continue
        tree = ET.parse(os.path.join(voc_dir, xml_file))
        root = tree.getroot()

        size = root.find("size")
        w = int(size.find("width").text)
        h = int(size.find("height").text)

        txt_name = os.path.splitext(xml_file)[0] + ".txt"
        with open(os.path.join(output_dir, txt_name), "w") as f:
            for obj in root.findall("object"):
                name = obj.find("name").text
                if name not in class_names:
                    class_names.append(name)
                cls_id = class_names.index(name)

                bbox = obj.find("bndbox")
                xmin = float(bbox.find("xmin").text)
                ymin = float(bbox.find("ymin").text)
                xmax = float(bbox.find("xmax").text)
                ymax = float(bbox.find("ymax").text)

                cx = ((xmin + xmax) / 2) / w
                cy = ((ymin + ymax) / 2) / h
                bw = (xmax - xmin) / w
                bh = (ymax - ymin) / h

                f.write(f"{cls_id} {cx:.6f} {cy:.6f} {bw:.6f} {bh:.6f}\n")

    print(f"Classes: {class_names}")

voc_to_yolo("Annotations/", "labels_yolo/")

Recommended tools for batch conversion: Roboflow and FiftyOne both offer programmatic and UI-based format conversion.


3. Training Detection Models (YOLO)

3.1 Full Ultralytics Training Pipeline

from ultralytics import YOLO

# Option A: Train from scratch with a YAML config
model = YOLO("yolov8n.yaml")  # Nano model (2.1M params)
results = model.train(
    data="path/to/data.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    name="robot_detection_v1",
    device="0",               # GPU index, or "cpu"
    patience=20,              # Early stopping patience
    save=True,
    save_period=10,           # Save checkpoint every N epochs
    val=True,
    plots=True,
)

# Option B: Fine-tune from pre-trained weights (recommended)
model = YOLO("yolov8m.pt")  # Medium model, pre-trained on COCO
results = model.train(
    data="path/to/data.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    name="robot_detection_finetune",
    device="0",
    freeze=10,                # Freeze first 10 layers (transfer learning)
    lr0=0.001,                # Lower learning rate for fine-tuning
)

# Option C: Train from CLI
# yolo detect train data=data.yaml model=yolov8m.pt epochs=100 imgsz=640

Model size comparison:

Model Params mAP (COCO) Speed (ms) Best For
YOLOv8n 3.2M 37.3 0.99 Edge devices, Jetson Nano
YOLOv8s 11.2M 44.9 1.20 Jetson Orin Nano
YOLOv8m 25.9M 50.2 1.83 Desktop GPU
YOLOv8l 43.7M 52.9 2.39 Desktop GPU
YOLOv8x 68.2M 53.9 3.53 Server GPU, highest accuracy

3.2 Hyperparameter Tuning

Key hyperparameters and their effects:

model = YOLO("yolov8m.pt")

# Tuning sweep with Ray Tune (built into Ultralytics)
results = model.tune(
    data="data.yaml",
    epochs=30,              # Shorter epochs per trial
    iterations=100,         # Number of tuning trials
    optimizer="AdamW",
    plots=True,
    save=True,
    device="0",
)
# Saves best hyperparameters to runs/tune/weights/best_hyperparameters.yaml

Manual hyperparameter guide:

Parameter Default Tuning Range When to Adjust
lr0 0.01 0.0001 - 0.01 Overfitting → lower; underfitting → higher
lrf 0.01 0.001 - 0.1 Final LR ratio (lr0 × lrf)
momentum 0.937 0.8 - 0.99 SGD momentum
weight_decay 0.0005 0.0001 - 0.01 Regularization strength
warmup_epochs 3.0 1 - 5 More warmup for small datasets
warmup_momentum 0.8 0.5 - 0.95
box 7.5 1 - 20 Box loss gain
cls 0.5 0.1 - 5 Classification loss gain
dfl 1.5 0.5 - 5 Distribution focal loss gain
mosaic 1.0 0 - 1 Disable (0) for small objects
mixup 0.0 0 - 0.5 Add for regularization
copy_paste 0.0 0 - 1 Copy-paste augmentation
degrees 0.0 0 - 45 Rotation augmentation (degrees)
scale 0.5 0 - 0.9 Scale augmentation range
fliplr 0.5 0 - 1 Horizontal flip probability
hsv_h 0.015 0 - 0.1 Hue augmentation
hsv_s 0.7 0 - 1 Saturation augmentation
hsv_v 0.4 0 - 1 Value augmentation

3.3 Transfer Learning Strategy

from ultralytics import YOLO

# Strategy 1: Freeze backbone, train head only (first 5-10 epochs)
model = YOLO("yolov8m.pt")
model.train(
    data="data.yaml",
    epochs=5,
    freeze=10,         # Freeze layers 0-9 (backbone)
    lr0=0.01,
    name="phase1_head_only"
)

# Strategy 2: Unfreeze and fine-tune everything (next 50+ epochs)
model = YOLO("runs/detect/phase1_head_only/weights/best.pt")
model.train(
    data="data.yaml",
    epochs=50,
    freeze=0,          # Unfreeze all layers
    lr0=0.001,         # Lower LR for full fine-tuning
    name="phase2_full_finetune"
)

# Strategy 3: Progressive unfreezing (most thorough)
model = YOLO("yolov8m.pt")
for phase, (freeze_layers, epochs, lr) in enumerate([
    (15, 5, 0.01),    # Train head + neck
    (10, 10, 0.005),  # Unfreeze more layers
    (5, 15, 0.001),   # Unfreeze even more
    (0, 30, 0.0005),  # Fine-tune everything
], 1):
    model = model.train(
        data="data.yaml",
        epochs=epochs,
        freeze=freeze_layers,
        lr0=lr,
        name=f"phase{phase}"
    )
    # Reload best weights from this phase
    model = YOLO(f"runs/detect/phase{phase}/weights/best.pt")

3.4 Multi-GPU Training

# PyTorch DDP (Distributed Data Parallel) - recommended
yolo detect train data=data.yaml model=yolov8m.pt epochs=100 batch=32 device=0,1

# For 4 GPUs:
yolo detect train data=data.yaml model=yolov8x.pt epochs=100 batch=64 device=0,1,2,3

# Effective batch size = batch_per_device × num_GPUs
# Rule of thumb: scale batch size linearly with GPU count, adjust LR with sqrt
# Python API
from ultralytics import YOLO

model = YOLO("yolov8m.pt")
model.train(
    data="data.yaml",
    epochs=100,
    batch=16,         # Per GPU
    device="0,1",     # Use 2 GPUs
    imgsz=640,
    workers=8,        # Data loader workers per GPU
    name="multi_gpu_train"
)

3.5 YOLO Validation and Export

from ultralytics import YOLO

model = YOLO("runs/detect/train/weights/best.pt")

# Validate on test set
metrics = model.val(
    data="data.yaml",
    split="test",
    imgsz=640,
    batch=16,
    conf=0.25,
    iou=0.6,
    device="0",
)
print(f"mAP50: {metrics.box.map50:.4f}")
print(f"mAP50-95: {metrics.box.map:.4f}")

# Export to various formats
model.export(format="onnx", imgsz=640, simplify=True)          # ONNX
model.export(format="engine", imgsz=640, half=True)             # TensorRT (GPU)
model.export(format="engine", imgsz=640, half=True, device=0)   # TensorRT INT8
model.export(format="tflite", imgsz=640)                        # TFLite (mobile)
model.export(format="coreml", imgsz=640)                        # CoreML (Apple)

4. Training Segmentation Models (SAM)

4.1 Fine-tuning SAM on Custom Data

The Segment Anything Model (SAM) supports prompt-based segmentation. Fine-tuning adapts it to your specific domain:

import torch
from segment_anything import sam_model_registry, SamPredictor

# Load pre-trained SAM
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")
sam.to("cuda")

# For custom fine-tuning, you need to modify the decoder
# The key is training the mask decoder with your domain-specific prompts

# Custom dataset loader for SAM fine-tuning
from torch.utils.data import Dataset, DataLoader

class SAMPromptDataset(Dataset):
    def __init__(self, images_dir, annotations_dir, transform=None):
        self.images = sorted([f for f in os.listdir(images_dir) 
                              if f.endswith(('.jpg', '.png'))])
        self.images_dir = images_dir
        self.annotations_dir = annotations_dir
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img_name = self.images[idx]
        image = cv2.imread(os.path.join(self.images_dir, img_name))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        # Load ground truth mask
        mask = cv2.imread(
            os.path.join(self.annotations_dir, 
                         os.path.splitext(img_name)[0] + ".png"),
            cv2.IMREAD_GRAYSCALE
        )

        if self.transform:
            augmented = self.transform(image=image, mask=mask)
            image = augmented["image"]
            mask = augmented["mask"]

        return {
            "image": torch.tensor(image).permute(2, 0, 1).float() / 255.0,
            "mask": torch.tensor(mask).long(),
            "point_coords": torch.tensor([[128, 128]]),  # Example prompt
            "point_labels": torch.tensor([1]),             # 1 = foreground
        }

# Training loop (simplified)
dataset = SAMPromptDataset("images/", "masks/")
loader = DataLoader(dataset, batch_size=4, shuffle=True)

optimizer = torch.optim.AdamW(sam.mask_decoder.parameters(), lr=1e-4)
loss_fn = torch.nn.CrossEntropyLoss()

sam.train()
for epoch in range(50):
    for batch in loader:
        images = batch["image"].to("cuda")
        masks_gt = batch["mask"].to("cuda")

        # Get image embeddings from the encoder
        with torch.no_grad():
            image_embeddings = sam.image_encoder(images)

        # Predict with prompt
        sparse_embeddings, dense_embeddings = sam.prompt_encoder(
            points=batch["point_coords"].to("cuda"),
            labels=batch["point_labels"].to("cuda"),
            boxes=None,
            mask_input=None,
        )

        # Decode masks
        low_res_masks, _ = sam.mask_decoder(
            image_embeddings=image_embeddings,
            image_pe=sam.prompt_encoder.get_dense_pe(),
            sparse_prompt_embeddings=sparse_embeddings,
            dense_prompt_embeddings=dense_embeddings,
            multimask_output=False,
        )

        # Upsample to original resolution and compute loss
        pred_masks = torch.nn.functional.interpolate(
            low_res_masks, size=masks_gt.shape[-2:], mode="bilinear"
        )
        loss = loss_fn(pred_masks.squeeze(1), masks_gt)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}/50, Loss: {loss.item():.4f}")

4.2 LoRA Fine-tuning for SAM

LoRA (Low-Rank Adaptation) is parameter-efficient—only trains a small number of additional parameters:

import torch
import torch.nn as nn
from peft import LoraConfig, get_peft_model, TaskType
from segment_anything import sam_model_registry

# Load SAM
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")

# Apply LoRA to the image encoder attention layers
lora_config = LoraConfig(
    task_type=TaskType.FEATURE_EXTRACTION,
    r=8,                          # Rank of adaptation matrices
    lora_alpha=32,                # Scaling factor
    lora_dropout=0.1,
    target_modules=["qkv", "proj"],  # Attention layers to adapt
    bias="none",
)

# Wrap the image encoder with LoRA
sam.image_encoder = get_peft_model(sam.image_encoder, lora_config)

# Print trainable parameter count
trainable = sum(p.numel() for p in sam.parameters() if p.requires_grad)
total = sum(p.numel() for p in sam.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

# Now train with standard loop as above
# Only LoRA parameters are updated; the rest are frozen

4.3 Training Mask R-CNN from Scratch

import torch
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

def get_maskrcnn_model(num_classes):
    """Create a Mask R-CNN model for custom segmentation."""
    model = maskrcnn_resnet50_fpn_v2(weights="DEFAULT")

    # Replace the classifier head
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # Replace the mask predictor
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    model.roi_heads.mask_predictor = MaskRCNNPredictor(
        in_features_mask, hidden_layer, num_classes
    )
    return model

# Custom dataset for Mask R-CNN (COCO-style)
from torchvision.datasets import CocoDetection
from torchvision import transforms

class CocoSegDataset(CocoDetection):
    """Wraps CocoDetection to return masks alongside boxes."""
    def __getitem__(self, idx):
        img, targets = super().__getitem__(idx)

        # Process targets for Mask R-CNN
        boxes = []
        labels = []
        masks = []
        for ann in targets:
            boxes.append(ann["bbox"])  # [x, y, w, h]
            labels.append(ann["category_id"])
            masks.append(ann.get("segmentation", []))

        target = {
            "boxes": torch.tensor(boxes, dtype=torch.float32),
            "labels": torch.tensor(labels, dtype=torch.int64),
        }
        return img, target

# Training loop
model = get_maskrcnn_model(num_classes=4)  # 3 classes + background
model.to("cuda")
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

model.train()
for epoch in range(20):
    for images, targets in dataloader:
        images = [img.to("cuda") for img in images]
        targets = [{k: v.to("cuda") for k, v in t.items()} for t in targets]

        loss_dict = model(images, targets)
        total_loss = sum(loss.values() for loss in loss_dict.values())

        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

    lr_scheduler.step()
    print(f"Epoch {epoch+1}: loss = {total_loss.item():.4f}")

5. Training Depth Estimation

5.1 Fine-tuning MiDaS/DPT on Custom Stereo Pairs

import torch
import torch.nn as nn
from transformers import DPTForDepthEstimation, DPTImageProcessor

# Load pre-trained DPT model
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
model.to("cuda")

# Custom dataset for depth fine-tuning
class StereoDepthDataset(torch.utils.data.Dataset):
    """
    Expects directory structure:
    dataset/
    ├── left/          # Left stereo images
    ├── right/         # Right stereo images
    └── depth/         # Ground truth depth maps (numpy .npy)
    """
    def __init__(self, root_dir, transform=None):
        self.left_dir = os.path.join(root_dir, "left")
        self.depth_dir = os.path.join(root_dir, "depth")
        self.images = sorted(os.listdir(self.left_dir))
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img = Image.open(os.path.join(self.left_dir, self.images[idx]))
        depth = np.load(os.path.join(self.depth_dir,
                        os.path.splitext(self.images[idx])[0] + ".npy"))

        if self.transform:
            img = self.transform(img)

        depth = torch.tensor(depth, dtype=torch.float32).unsqueeze(0)
        return {"pixel_values": img, "depth": depth}

# Fine-tuning loop
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = nn.SmoothL1Loss()  # Scale-invariant depth loss

model.train()
for epoch in range(30):
    for batch in dataloader:
        pixel_values = batch["pixel_values"].to("cuda")
        gt_depth = batch["depth"].to("cuda")

        outputs = model(pixel_values=pixel_values)
        pred_depth = outputs.predicted_depth

        # Interpolate to ground truth size
        pred_depth = nn.functional.interpolate(
            pred_depth, size=gt_depth.shape[-2:], mode="bicubic"
        )

        loss = loss_fn(pred_depth, gt_depth)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}: depth_loss = {loss.item():.4f}")

5.2 Self-supervised Depth Estimation Training

Train depth from stereo pairs without ground truth using photometric consistency:

import torch
import torch.nn.functional as F

class MonoDepthLoss(nn.Module):
    """
    Self-supervised monocular depth estimation loss
    (based on Monodepth2: Godard et al., 2019)
    """
    def __init__(self, alpha_ssim=0.85, alpha_l1=0.15):
        super().__init__()
        self.alpha_ssim = alpha_ssim
        self.alpha_l1 = alpha_l1

    def ssim_loss(self, pred, target, window_size=11):
        """Structural similarity loss."""
        C1 = 0.01 ** 2
        C2 = 0.03 ** 2

        mu_pred = F.avg_pool2d(pred, window_size, stride=1, padding=window_size // 2)
        mu_target = F.avg_pool2d(target, window_size, stride=1, padding=window_size // 2)

        mu_pred_sq = mu_pred ** 2
        mu_target_sq = mu_target ** 2
        mu_cross = mu_pred * mu_target

        sigma_pred_sq = F.avg_pool2d(pred ** 2, window_size, 1, window_size // 2) - mu_pred_sq
        sigma_target_sq = F.avg_pool2d(target ** 2, window_size, 1, window_size // 2) - mu_target_sq
        sigma_cross = F.avg_pool2d(pred * target, window_size, 1, window_size // 2) - mu_cross

        ssim_map = ((2 * mu_cross + C1) * (2 * sigma_cross + C2)) / \
                   ((mu_pred_sq + mu_target_sq + C1) * (sigma_pred_sq + sigma_target_sq + C2))
        return torch.clamp((1 - ssim_map) / 2, 0, 1).mean()

    def photometric_loss(self, pred_image, target_image):
        """Combined L1 + SSIM photometric loss."""
        l1 = (pred_image - target_image).abs().mean()
        ssim = self.ssim_loss(pred_image, target_image)
        return self.alpha_ssim * ssim + self.alpha_l1 * l1

    def forward(self, pred_depth, pred_image_left, target_image_right,
                K, T_cam_to_right):
        """
        pred_depth: predicted depth map (B, 1, H, W)
        pred_image_left: reconstructed left image from right
        target_image_right: original right image
        K: camera intrinsic matrix (B, 3, 3)
        T_cam_to_right: extrinsic transform from left to right camera (B, 4, 4)
        """
        # Compute photometric loss (consistency between predicted warp and target)
        photo_loss = self.photometric_loss(pred_image_left, target_image_right)

        # Smoothness loss (encourage depth gradients to align with image gradients)
        depth_grad_x = pred_depth[:, :, :, :-1] - pred_depth[:, :, :, 1:]
        depth_grad_y = pred_depth[:, :, :-1, :] - pred_depth[:, :, 1:, :]

        return {
            "photometric_loss": photo_loss,
            "total_loss": photo_loss  # Add smoothness term in practice
        }

# Usage in training
loss_fn = MonoDepthLoss()
# ... training loop with stereo pairs, warp the right image using predicted depth

Recommended datasets for depth training:

Dataset Type Size Use Case
KITTI Depth Real stereo + LiDAR 86K images Autonomous driving
Make3D Real outdoor 534 images Outdoor depth
NYU Depth V2 Indoor RGB-D 1,449 scenes Indoor robotics
ScanNet Indoor RGB-D 1513 scans Indoor 3D understanding

6. Evaluation & Validation

6.1 Standard Metrics

mAP (mean Average Precision): Primary metric for object detection.

from ultralytics import YOLO

model = YOLO("runs/detect/train/weights/best.pt")
metrics = model.val(data="data.yaml", split="test")

# Key metrics
print(f"mAP50:    {metrics.box.map50:.4f}")     # AP at IoU=0.50
print(f"mAP50-95: {metrics.box.map:.4f}")        # AP averaged over IoU 0.50:0.95
print(f"Precision: {metrics.box.mp:.4f}")        # Mean precision
print(f"Recall:    {metrics.box.mr:.4f}")        # Mean recall

# Per-class metrics
names = metrics.names
for i, (p, r, ap50, ap) in enumerate(
    zip(metrics.box.p, metrics.box.r, metrics.box.ap50, metrics.box.ap)
):
    print(f"  {names[i]:15s} P={p:.3f}  R={r:.3f}  AP50={ap50:.3f}  AP50-95={ap:.3f}")

IoU (Intersection over Union): Measures overlap between predicted and ground truth boxes.

IoU = Area of Intersection / Area of Union

   ┌──────────────┐
   │    ┌───GT     │
   │    │  ╱       │
   │    └─╱────────│
   │   ╱  │        │
   │  ╱ Pred       │
   └──────────────┘

IoU > 0.5  → typically considered a "correct" detection
IoU > 0.75 → stricter threshold for precise localization

6.2 Confusion Matrix Analysis

from ultralytics import YOLO

model = YOLO("runs/detect/train/weights/best.pt")
results = model.val(data="data.yaml", split="test", plots=True)
# Confusion matrix saved to runs/detect/val/confusion_matrix.png

Key patterns to look for in confusion matrices:

                    Predicted
                 cat    dog    background
True  cat    [  85      5      10  ]   → 10 missed cats
      dog    [   3     88       9  ]   → 9 missed dogs
      bg     [   2      1     97  ]    → 3 false positives
  • Diagonal dominance = good
  • Off-diagonal clusters = systematic misclassification
  • High background row = false positives
  • High background column = false negatives (missed objects)

6.3 Precision-Recall Curves

import matplotlib.pyplot as plt
from ultralytics import YOLO

model = YOLO("runs/detect/train/weights/best.pt")
results = model.val(data="data.yaml", plots=True)

# Plots are auto-saved: PR_curve.png, F1_curve.png, P_curve.png, R_curve.png

Interpreting the curves:

  • PR curve: Area under curve = AP. Curves closer to top-right = better.
  • F1 curve: Peak of the F1 curve suggests optimal confidence threshold.
  • Precision curve: Increases as confidence threshold increases.
  • Recall curve: Decreases as confidence threshold increases.

6.4 Test-Time Augmentation (TTA)

TTA applies multiple augmentations at inference and aggregates predictions:

from ultralytics import YOLO

model = YOLO("best.pt")

# Basic inference
results = model.predict("test_image.jpg", conf=0.25)

# With TTA (slower but more accurate)
results_tta = model.predict(
    "test_image.jpg",
    conf=0.25,
    augment=True,   # Enable TTA
)
# TTA applies horizontal flip and multiple scales, then NMS merges results

7. Model Optimization for Deployment

7.1 Quantization

Reduce model precision from FP32 to INT8 or FP16 for faster inference:

from ultralytics import YOLO

model = YOLO("best.pt")

# FP16 export (2x memory reduction, minimal accuracy loss)
model.export(format="engine", half=True)

# INT8 quantization (4x reduction, slight accuracy trade-off)
# Requires a calibration dataset
model.export(
    format="engine",
    int8=True,
    data="calibration_data.yaml",  # Small representative dataset
)

# ONNX with quantization
model.export(format="onnx", simplify=True)
# Then use onnxruntime with quantization:
import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic, QuantType

# Post-training dynamic quantization
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_int8.onnx",
    weight_type=QuantType.QInt8,
)

# Benchmark
session = ort.InferenceSession("model_int8.onnx")
# Compare inference time with FP32 vs INT8

7.2 Pruning

Remove redundant weights to reduce model size:

import torch.nn.utils.prune as prune

model = YOLO("best.pt").model

# Global magnitude pruning - remove 30% of smallest weights
parameters_to_prune = []
for module in model.modules():
    if isinstance(module, torch.nn.Conv2d):
        parameters_to_prune.append((module, "weight"))

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.3,  # Remove 30% of weights
)

# Make pruning permanent
for module, param_name in parameters_to_prune:
    prune.remove(module, param_name)

# Count sparsity
total = 0
pruned = 0
for module, _ in parameters_to_prune:
    total += module.weight.nelement()
    pruned += torch.sum(module.weight == 0).item()
print(f"Sparsity: {100. * pruned / total:.1f}%")

7.3 Knowledge Distillation

Train a smaller student model to mimic a larger teacher:

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.7):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha

    def forward(self, student_logits, teacher_logits, gt_labels):
        # Soft targets (knowledge from teacher)
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=1),
            F.softmax(teacher_logits / self.temperature, dim=1),
            reduction="batchmean"
        ) * (self.temperature ** 2)

        # Hard targets (ground truth)
        hard_loss = F.cross_entropy(student_logits, gt_labels)

        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

# Distillation pipeline
teacher = YOLO("yolov8x.pt")  # Large teacher
student = YOLO("yolov8n.pt")  # Small student

# Train student to mimic teacher's soft predictions
# Use the distillation loss in combination with standard detection loss

7.4 ONNX Export and Optimization

from ultralytics import YOLO

model = YOLO("best.pt")

# Export to ONNX
model.export(
    format="onnx",
    imgsz=640,
    simplify=True,    # Apply ONNX simplifier
    dynamic=False,     # Fixed input size (faster)
    opset=17,         # ONNX opset version
)

# Optimize with onnxruntime
import onnxruntime as ort

# GPU inference
session_gpu = ort.InferenceSession(
    "best.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# CPU inference (optimized)
session_cpu = ort.InferenceSession(
    "best.onnx",
    providers=["CPUExecutionProvider"],
    sess_options=ort.SessionOptions()
)
session_cpu.get_session_options().graph_optimization_level = (
    ort.GraphOptimizationLevel.ORT_ENABLE_ALL
)

# Benchmark
import time
import numpy as np

dummy_input = np.random.randn(1, 3, 640, 640).astype(np.float32)
input_name = session_gpu.get_inputs()[0].name

# Warmup
for _ in range(10):
    session_gpu.run(None, {input_name: dummy_input})

# Benchmark
times = []
for _ in range(100):
    start = time.time()
    session_gpu.run(None, {input_name: dummy_input})
    times.append(time.time() - start)

print(f"ONNX GPU: {np.mean(times)*1000:.1f} ms/img")

TensorRT optimization (fastest NVIDIA GPU inference):

# Export directly to TensorRT engine
model.export(format="engine", imgsz=640, half=True)

# Or convert from ONNX
import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)

with open("best.onnx", "rb") as f:
    parser.parse(f.read())

config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB
config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16

engine = builder.build_serialized_network(network, config)
with open("best.trt", "wb") as f:
    f.write(engine)

8. MLOps for Robotics

8.1 Experiment Tracking

Weights & Biases:

import wandb

wandb.init(
    project="robotics-perception",
    name="yolov8m_finetune_v3",
    config={
        "model": "yolov8m",
        "epochs": 100,
        "lr": 0.001,
        "batch_size": 16,
        "imgsz": 640,
        "augmentations": ["mosaic", "mixup", "hsv"],
    }
)

from ultralytics import YOLO
model = YOLO("yolov8m.pt")
results = model.train(
    data="data.yaml",
    epochs=100,
    project="robotics-perception",
    name="yolov8m_finetune_v3",
)

# Log metrics manually if needed
wandb.log({"mAP50": results.results_dict.get("metrics/mAP50(B)", 0)})
wandb.finish()

MLflow:

import mlflow
import mlflow.pytorch

mlflow.set_experiment("robotics-perception")

with mlflow.start_run(run_name="yolov8m_finetune_v3"):
    # Log parameters
    mlflow.log_param("model", "yolov8m")
    mlflow.log_param("epochs", 100)
    mlflow.log_param("learning_rate", 0.001)

    # Train
    from ultralytics import YOLO
    model = YOLO("yolov8m.pt")
    results = model.train(data="data.yaml", epochs=100)

    # Log metrics
    mlflow.log_metric("mAP50", results.results_dict["metrics/mAP50(B)"])
    mlflow.log_metric("mAP50-95", results.results_dict["metrics/mAP50-95(B)"])

    # Log model artifact
    mlflow.log_artifact("runs/detect/train/weights/best.pt")

8.2 Model Versioning and Registry

model_registry/
├── models/
│   ├── detection/
│   │   ├── v1.0/
│   │   │   ├── model.pt
│   │   │   ├── metadata.json
│   │   │   └── eval_report.html
│   │   ├── v1.1/
│   │   └── v2.0/
│   ├── segmentation/
│   │   └── v1.0/
│   └── depth/
│       └── v1.0/
└── metadata.json  # Global registry

Example metadata.json per model version:

{
  "model_name": "robot_detection",
  "version": "2.0",
  "framework": "ultralytics_yolov8",
  "base_model": "yolov8m",
  "training_data": "robot_dataset_v3",
  "training_date": "2026-04-15",
  "metrics": {
    "mAP50": 0.87,
    "mAP50-95": 0.62,
    "inference_ms": 8.3,
    "model_size_mb": 52.4
  },
  "target_hardware": "Jetson Orin Nano",
  "quantization": "FP16",
  "approved_by": "team_lead",
  "status": "production"
}

8.3 CI/CD for Model Retraining

# .github/workflows/retrain.yml
name: Model Retraining Pipeline

on:
  push:
    paths:
      - 'training_data/**'
  workflow_dispatch:
    inputs:
      model_type:
        description: 'Model to retrain'
        required: true
        type: choice
        options:
          - detection
          - segmentation
          - depth

jobs:
  train:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install ultralytics wandb opencv-python

      - name: Train model
        run: |
          python train.py \
            --model-type ${{ inputs.model_type }} \
            --epochs 100 \
            --data training_data/data.yaml
        env:
          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}

      - name: Run evaluation
        run: |
          python evaluate.py \
            --model runs/detect/train/weights/best.pt \
            --data data.yaml \
            --threshold 0.8  # Minimum mAP50 to pass

      - name: Deploy if metrics pass
        if: success()
        run: |
          python deploy.py \
            --model runs/detect/train/weights/best.pt \
            --format onnx \
            --target jetson

9. Common Pitfalls

9.1 Overfitting

Symptoms: Training loss decreases, validation loss increases or plateaus.

Solutions:

# 1. Add regularization
model.train(
    weight_decay=0.01,       # L2 regularization
    dropout=0.2,             # If using custom head
    mixup=0.15,              # Mixup augmentation
    mosaic=1.0,              # Mosaic augmentation
    degrees=10,              # Rotation augmentation
    scale=0.5,               # Scale augmentation
)

# 2. Reduce model complexity
model = YOLO("yolov8s.pt")  # Use smaller model

# 3. Early stopping
model.train(
    patience=15,  # Stop if no improvement for 15 epochs
)

# 4. More data or better augmentation
# 5. Reduce number of epochs
# 6. Add dropout layers to custom heads

9.2 Class Imbalance

Symptoms: Model biased toward majority class, low recall on rare classes.

Solutions:

# 1. Class weights in loss function
from ultralytics import YOLO

model = YOLO("yolov8m.pt")
model.train(
    data="data.yaml",
    class_weights=[1.0, 3.0, 5.0],  # Higher weight for rare classes
)

# 2. Oversampling rare classes
# 3. Focal loss (already used in YOLO as default)

# 4. Synthetic data generation for rare classes
# 5. Data-level balancing (remove excess majority samples)

9.3 Data Leakage

Symptoms: Metrics seem too good; model fails on truly new data.

Common causes and fixes:

# 1. Split BEFORE augmentation
# WRONG: augment all data, then split → augmented versions of same image 
#         appear in both train and val
# RIGHT: split raw data first, then augment train set only

# 2. Check for temporal leakage
# WRONG: train on Monday images, validate on Tuesday images of same scene
# RIGHT: use different days, different lighting, different cameras for val

# 3. Ensure no duplicate images across splits
import hashlib

def find_duplicates(image_dir):
    """Find duplicate images by content hash."""
    hashes = {}
    for root, _, files in os.walk(image_dir):
        for f in files:
            if f.endswith(('.jpg', '.png')):
                path = os.path.join(root, f)
                with open(path, 'rb') as fp:
                    h = hashlib.md5(fp.read()).hexdigest()
                if h in hashes:
                    print(f"DUPLICATE: {path} == {hashes[h]}")
                else:
                    hashes[h] = path

find_duplicates("dataset/images/")

9.4 Annotation Errors

Common issues:

  1. Inconsistent bounding boxes: Tight vs. loose boxes for the same class
  2. Missing annotations: Objects present but not labeled
  3. Wrong class labels: Misclassifications in annotations
  4. Off-by-one errors: Background counted as a class or vice versa

Detection tools:

from ultralytics import YOLO

# Train a model, then analyze low-confidence predictions
model = YOLO("yolov8m.pt")
results = model.predict("val_images/", conf=0.3, save=True)

# Look for patterns:
# - Same location always has low confidence → likely annotation error
# - Consistent class confusion → label swap
# - Objects never detected → missing annotations
# Programmatic check for annotation quality
def audit_annotations(labels_dir, images_dir):
    """Check for common annotation issues."""
    issues = []
    for label_file in os.listdir(labels_dir):
        if not label_file.endswith(".txt"):
            continue
        img_name = os.path.splitext(label_file)[0]
        img_path = os.path.join(images_dir, img_name + ".jpg")

        if not os.path.exists(img_path):
            issues.append(f"Orphan label: {label_file} (no matching image)")
            continue

        img = cv2.imread(img_path)
        h, w = img.shape[:2]

        with open(os.path.join(labels_dir, label_file)) as f:
            for i, line in enumerate(f.readlines()):
                parts = line.strip().split()
                if len(parts) != 5:
                    issues.append(f"{label_file} line {i}: wrong format")
                    continue
                cls, cx, cy, bw, bh = int(parts[0]), *map(float, parts[1:])
                # Check bounds
                if not (0 <= cx <= 1 and 0 <= cy <= 1 and 0 < bw <= 1 and 0 < bh <= 1):
                    issues.append(f"{label_file} line {i}: out-of-bounds bbox")
                if bw < 0.01 or bh < 0.01:
                    issues.append(f"{label_file} line {i}: suspiciously small bbox")

    print(f"Found {len(issues)} issues:")
    for issue in issues:
        print(f"  - {issue}")

audit_annotations("labels/", "images/")

9.5 Quick Diagnostic Checklist

Problem Symptom Fix
Overfitting Train loss ↓, val loss ↑ More data, augmentation, regularization, smaller model
Underfitting Both losses high Larger model, more epochs, higher LR
Class imbalance High accuracy, low recall on rare classes Class weights, oversampling, focal loss
Domain gap Great val metrics, poor real-world performance Collect real-world data, domain adaptation
Small object failures Poor AP for small objects Higher input resolution, adjust anchor scales
Occlusion failures Miss detections when objects overlap NMS tuning (lower iou_thres), training with occlusion augmentation

10. References

Papers

Tools and Frameworks

Datasets

  • COCO — 80-class detection/segmentation benchmark
  • KITTI — Autonomous driving with depth and LiDAR
  • Open Images — Google's large-scale detection dataset
  • Roboflow Universe — Community shared robotics datasets
  • NVIDIA Isaac Sim — Synthetic data generation for robotics