Training Pipeline for Robotics Perception¶
A comprehensive guide to collecting data, training models, evaluating performance, and deploying optimized perception models for robotics applications.
Overview¶
Robotics perception systems must reliably detect, segment, and estimate depth for objects in unstructured, dynamic environments. Off-the-shelf models trained on general benchmarks (COCO, ImageNet) often fail when deployed on real robots because:
- Domain shift: The robot's camera, mounting angle, and operating environment differ from training data.
- Specific objects: Robots must recognize task-relevant objects (tools, gripper targets, obstacles) not present in public datasets.
- Real-time constraints: Embedded compute on the robot requires optimized models (INT8, TensorRT).
- Robustness: Variations in lighting, motion blur, partial occlusion, and sensor noise demand tailored training.
This guide walks through the full lifecycle—from raw images to a deployed model—covering detection (YOLO), segmentation (SAM, Mask R-CNN), depth estimation, evaluation, optimization, and MLOps practices.
Prerequisites: Python 3.8+, PyTorch 2.0+, CUDA 11.8+, basic familiarity with computer vision and neural networks.
Learning Objectives¶
- Collect and annotate high-quality datasets for robotics tasks
- Train and fine-tune detection, segmentation, and depth models
- Evaluate models with standard metrics and analyze failure modes
- Optimize models for real-time inference on edge hardware
- Set up experiment tracking and CI/CD pipelines for model iteration
1. Data Collection¶
1.1 Manual Annotation Tools¶
| Tool | Format Support | Strengths | Best For |
|---|---|---|---|
| LabelImg | YOLO, Pascal VOC | Lightweight, simple UI | Quick bounding box annotation |
| CVAT | COCO, YOLO, Pascal VOC | Multi-user, video support, AI-assisted | Team annotation projects |
| Roboflow | All major formats | Auto-augment, export, hosting | End-to-end pipeline |
| Label Studio | JSON, COCO, VOC | Multi-modal (text, image, audio), ML backend | Complex annotation tasks |
LabelImg (quick start for bounding boxes):
CVAT (team annotation with Docker):
docker compose -f docker-compose.yml up -d
# Access at http://localhost:8080
# Create project, define labels, invite annotators
Label Studio (with ML backend for active learning):
pip install label-studio label-studio-ml
label-studio start &
# Configure ML backend to pre-annotate with a pre-trained model
1.2 Automatic Annotation with Pre-trained Models¶
Manual annotation is slow. Use a pre-trained model to generate initial annotations, then human annotators correct mistakes (human-in-the-loop):
from ultralytics import YOLO
# Load a pre-trained model (or your own model from a previous iteration)
model = YOLO("yolov8x.pt")
# Auto-annotate a folder of raw images
results = model.predict(
source="raw_images/",
save_txt=True, # Save YOLO-format labels
conf=0.5, # Confidence threshold
imgsz=640,
save=True # Save annotated images for visual review
)
# Output: raw_images/labels/ contains .txt files
SAM-assisted labeling (Segment Anything for masks):
from segment_anything import sam_model_registry, SamPredictor
import cv2
import numpy as np
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)
# For each image, provide point prompts or box prompts
image = cv2.imread("robot_scene.jpg")
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image_rgb)
# Prompt with a bounding box (e.g., from YOLO detection)
box = np.array([100, 200, 400, 500]) # x1, y1, x2, y2
masks, scores, _ = predictor.predict(
box=box,
multimask_output=True
)
# Select the best mask
best_mask = masks[np.argmax(scores)]
For large-scale auto-annotation, consider Autodistill, which chains foundation models to generate annotations in YOLO format automatically.
1.3 Data Collection Best Practices for Robotics¶
| Principle | Why It Matters | How to Achieve It |
|---|---|---|
| Varying lighting | Robots operate under different conditions | Capture in sunlight, shadows, indoors, artificial light |
| Multiple camera angles | Robot cameras may be mounted differently | Mount camera at different heights and angles |
| Diverse backgrounds | Clutter confuses models | Collect in different rooms, workspaces, outdoor areas |
| Include edge cases | Models fail on unusual configurations | Add partially occluded, distant, or motion-blurred objects |
| Represent target domain | Domain gap causes failures | Use the actual robot camera and mounting position |
| Sufficient quantity | More data generally improves generalization | Aim for 500+ images per class minimum; 2000+ ideal |
Data collection script example (save images from a webcam at intervals):
import cv2
import os
import time
cap = cv2.VideoCapture(0)
output_dir = "collected_data/"
os.makedirs(output_dir, exist_ok=True)
frame_count = 0
interval = 0.5 # seconds between captures
while True:
ret, frame = cap.read()
if not ret:
break
cv2.imshow("Preview (press 's' to save, 'q' to quit)", frame)
key = cv2.waitKey(1) & 0xFF
if key == ord("s"):
filename = f"{output_dir}/img_{frame_count:06d}.jpg"
cv2.imwrite(filename, frame)
print(f"Saved: {filename}")
frame_count += 1
elif key == ord("q"):
break
cap.release()
cv2.destroyAllWindows()
print(f"Total images saved: {frame_count}")
1.4 Augmentation Strategies¶
Augmentation artificially increases dataset diversity and improves generalization.
Photometric augmentations (color/lighting changes):
import albumentations as A
photometric_transform = A.Compose([
A.RandomBrightnessContrast(brightness_limit=0.3, contrast_limit=0.3, p=0.7),
A.HueSaturationValue(hue_shift_limit=20, sat_shift_limit=30, val_shift_limit=20, p=0.5),
A.CLAHE(clip_limit=4.0, p=0.3),
A.RandomGamma(gamma_limit=(80, 120), p=0.3),
A.GaussNoise(var_limit=(10.0, 50.0), p=0.3),
])
Geometric augmentations (spatial transforms):
geometric_transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.RandomRotate90(p=0.5),
A.ShiftScaleRotate(
shift_limit=0.1,
scale_limit=0.2,
rotate_limit=15,
border_mode=cv2.BORDER_CONSTANT,
p=0.7
),
A.Perspective(scale=(0.05, 0.1), p=0.3),
], bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]))
Advanced augmentations:
# Cutout / Coarse Dropout - randomly mask patches
cutout_transform = A.CoarseDropout(
max_holes=8, max_height=32, max_width=32,
min_holes=1, min_height=8, min_width=8,
fill_value=0, p=0.5
)
# Mixup - blend two images and their labels
# (typically done at training time, see YOLO's mixup parameter)
# Mosaic - combine 4 images into one (YOLO's default augmentation)
# Controlled by the 'mosaic' parameter in YOLO training
Applying augmentations to a labeled dataset:
import cv2
import json
def augment_dataset(image_dir, label_dir, output_dir, transform, num_augments=3):
"""Apply augmentation to an entire dataset."""
os.makedirs(f"{output_dir}/images", exist_ok=True)
os.makedirs(f"{output_dir}/labels", exist_ok=True)
for img_name in os.listdir(image_dir):
if not img_name.endswith((".jpg", ".png")):
continue
image = cv2.imread(f"{image_dir}/{img_name}")
# Load YOLO-format labels
label_path = f"{label_dir}/{os.path.splitext(img_name)[0]}.txt"
bboxes, class_labels = [], []
if os.path.exists(label_path):
with open(label_path) as f:
for line in f.readlines():
parts = line.strip().split()
cls = int(parts[0])
x, y, w, h = map(float, parts[1:5])
# Convert YOLO to Pascal VOC (x1, y1, x2, y2)
h_img, w_img = image.shape[:2]
x1 = (x - w / 2) * w_img
y1 = (y - h / 2) * h_img
x2 = (x + w / 2) * w_img
y2 = (y + h / 2) * h_img
bboxes.append([x1, y1, x2, y2])
class_labels.append(cls)
# Save original
cv2.imwrite(f"{output_dir}/images/{img_name}", image)
# Create augmented versions
for i in range(num_augments):
transformed = transform(
image=image,
bboxes=bboxes,
class_labels=class_labels
)
aug_name = f"{os.path.splitext(img_name)[0]}_aug{i}.jpg"
cv2.imwrite(f"{output_dir}/images/{aug_name}", transformed["image"])
# Convert back to YOLO format and save
h_img, w_img = transformed["image"].shape[:2]
with open(f"{output_dir}/labels/{os.path.splitext(aug_name)[0]}.txt", "w") as f:
for bbox, cls in zip(transformed["bboxes"], transformed["class_labels"]):
x1, y1, x2, y2 = bbox
cx = ((x1 + x2) / 2) / w_img
cy = ((y1 + y2) / 2) / h_img
bw = (x2 - x1) / w_img
bh = (y2 - y1) / h_img
f.write(f"{int(cls)} {cx:.6f} {cy:.6f} {bw:.6f} {bh:.6f}\n")
augment_dataset("raw_images", "raw_labels", "augmented_dataset",
transform=geometric_transform, num_augments=3)
2. Dataset Formats¶
2.1 YOLO Format¶
Each image has a corresponding .txt file. Each line: class_id center_x center_y width height (normalized 0-1).
project/
├── images/
│ ├── train/
│ │ ├── img001.jpg
│ │ └── img002.jpg
│ └── val/
│ ├── img010.jpg
│ └── img011.jpg
├── labels/
│ ├── train/
│ │ ├── img001.txt
│ │ └── img002.txt
│ └── val/
│ ├── img010.txt
│ └── img011.txt
└── data.yaml
Example label file (img001.txt):
Example data.yaml:
path: /absolute/path/to/project
train: images/train
val: images/val
test: images/test # optional
nc: 3 # number of classes
names: ["bottle", "cup", "tool"]
2.2 COCO Format¶
JSON-based, single annotation file for the entire dataset.
{
"images": [
{"id": 1, "file_name": "img001.jpg", "width": 640, "height": 480}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 0,
"bbox": [100, 50, 150, 220],
"area": 33000,
"segmentation": [[100, 50, 250, 50, 250, 270, 100, 270]],
"iscrowd": 0
}
],
"categories": [
{"id": 0, "name": "bottle"},
{"id": 1, "name": "cup"}
]
}
Note: COCO bbox format is [x_top_left, y_top_left, width, height] in pixels.
2.3 Pascal VOC Format¶
XML-based, one file per image.
<annotation>
<filename>img001.jpg</filename>
<size>
<width>640</width>
<height>480</height>
</size>
<object>
<name>bottle</name>
<bndbox>
<xmin>100</xmin>
<ymin>50</ymin>
<xmax>250</xmax>
<ymax>270</ymax>
</bndbox>
</object>
</annotation>
2.4 Format Conversion Scripts¶
COCO to YOLO:
import json
import os
def coco_to_yolo(coco_json_path, output_dir):
"""Convert COCO annotations to YOLO format."""
os.makedirs(output_dir, exist_ok=True)
with open(coco_json_path) as f:
coco = json.load(f)
# Build lookup tables
img_lookup = {img["id"]: img for img in coco["images"]}
# Group annotations by image
ann_by_image = {}
for ann in coco["annotations"]:
img_id = ann["image_id"]
if img_id not in ann_by_image:
ann_by_image[img_id] = []
ann_by_image[img_id].append(ann)
for img_id, annotations in ann_by_image.items():
img = img_lookup[img_id]
w, h = img["width"], img["height"]
txt_path = os.path.join(output_dir,
os.path.splitext(img["file_name"])[0] + ".txt")
with open(txt_path, "w") as f:
for ann in annotations:
x, y, bw, bh = ann["bbox"] # COCO: x, y, w, h (pixels)
cx = (x + bw / 2) / w
cy = (y + bh / 2) / h
nw = bw / w
nh = bh / h
f.write(f"{ann['category_id']} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")
print(f"Converted {len(ann_by_image)} images to YOLO format in {output_dir}")
coco_to_yolo("annotations.json", "labels_yolo/")
Pascal VOC to YOLO:
import xml.etree.ElementTree as ET
import os
def voc_to_yolo(voc_dir, output_dir):
"""Convert Pascal VOC XML annotations to YOLO format."""
os.makedirs(output_dir, exist_ok=True)
class_names = [] # collect unique classes
for xml_file in os.listdir(voc_dir):
if not xml_file.endswith(".xml"):
continue
tree = ET.parse(os.path.join(voc_dir, xml_file))
root = tree.getroot()
size = root.find("size")
w = int(size.find("width").text)
h = int(size.find("height").text)
txt_name = os.path.splitext(xml_file)[0] + ".txt"
with open(os.path.join(output_dir, txt_name), "w") as f:
for obj in root.findall("object"):
name = obj.find("name").text
if name not in class_names:
class_names.append(name)
cls_id = class_names.index(name)
bbox = obj.find("bndbox")
xmin = float(bbox.find("xmin").text)
ymin = float(bbox.find("ymin").text)
xmax = float(bbox.find("xmax").text)
ymax = float(bbox.find("ymax").text)
cx = ((xmin + xmax) / 2) / w
cy = ((ymin + ymax) / 2) / h
bw = (xmax - xmin) / w
bh = (ymax - ymin) / h
f.write(f"{cls_id} {cx:.6f} {cy:.6f} {bw:.6f} {bh:.6f}\n")
print(f"Classes: {class_names}")
voc_to_yolo("Annotations/", "labels_yolo/")
Recommended tools for batch conversion: Roboflow and FiftyOne both offer programmatic and UI-based format conversion.
3. Training Detection Models (YOLO)¶
3.1 Full Ultralytics Training Pipeline¶
from ultralytics import YOLO
# Option A: Train from scratch with a YAML config
model = YOLO("yolov8n.yaml") # Nano model (2.1M params)
results = model.train(
data="path/to/data.yaml",
epochs=100,
imgsz=640,
batch=16,
name="robot_detection_v1",
device="0", # GPU index, or "cpu"
patience=20, # Early stopping patience
save=True,
save_period=10, # Save checkpoint every N epochs
val=True,
plots=True,
)
# Option B: Fine-tune from pre-trained weights (recommended)
model = YOLO("yolov8m.pt") # Medium model, pre-trained on COCO
results = model.train(
data="path/to/data.yaml",
epochs=100,
imgsz=640,
batch=16,
name="robot_detection_finetune",
device="0",
freeze=10, # Freeze first 10 layers (transfer learning)
lr0=0.001, # Lower learning rate for fine-tuning
)
# Option C: Train from CLI
# yolo detect train data=data.yaml model=yolov8m.pt epochs=100 imgsz=640
Model size comparison:
| Model | Params | mAP (COCO) | Speed (ms) | Best For |
|---|---|---|---|---|
| YOLOv8n | 3.2M | 37.3 | 0.99 | Edge devices, Jetson Nano |
| YOLOv8s | 11.2M | 44.9 | 1.20 | Jetson Orin Nano |
| YOLOv8m | 25.9M | 50.2 | 1.83 | Desktop GPU |
| YOLOv8l | 43.7M | 52.9 | 2.39 | Desktop GPU |
| YOLOv8x | 68.2M | 53.9 | 3.53 | Server GPU, highest accuracy |
3.2 Hyperparameter Tuning¶
Key hyperparameters and their effects:
model = YOLO("yolov8m.pt")
# Tuning sweep with Ray Tune (built into Ultralytics)
results = model.tune(
data="data.yaml",
epochs=30, # Shorter epochs per trial
iterations=100, # Number of tuning trials
optimizer="AdamW",
plots=True,
save=True,
device="0",
)
# Saves best hyperparameters to runs/tune/weights/best_hyperparameters.yaml
Manual hyperparameter guide:
| Parameter | Default | Tuning Range | When to Adjust |
|---|---|---|---|
lr0 |
0.01 | 0.0001 - 0.01 | Overfitting → lower; underfitting → higher |
lrf |
0.01 | 0.001 - 0.1 | Final LR ratio (lr0 × lrf) |
momentum |
0.937 | 0.8 - 0.99 | SGD momentum |
weight_decay |
0.0005 | 0.0001 - 0.01 | Regularization strength |
warmup_epochs |
3.0 | 1 - 5 | More warmup for small datasets |
warmup_momentum |
0.8 | 0.5 - 0.95 | |
box |
7.5 | 1 - 20 | Box loss gain |
cls |
0.5 | 0.1 - 5 | Classification loss gain |
dfl |
1.5 | 0.5 - 5 | Distribution focal loss gain |
mosaic |
1.0 | 0 - 1 | Disable (0) for small objects |
mixup |
0.0 | 0 - 0.5 | Add for regularization |
copy_paste |
0.0 | 0 - 1 | Copy-paste augmentation |
degrees |
0.0 | 0 - 45 | Rotation augmentation (degrees) |
scale |
0.5 | 0 - 0.9 | Scale augmentation range |
fliplr |
0.5 | 0 - 1 | Horizontal flip probability |
hsv_h |
0.015 | 0 - 0.1 | Hue augmentation |
hsv_s |
0.7 | 0 - 1 | Saturation augmentation |
hsv_v |
0.4 | 0 - 1 | Value augmentation |
3.3 Transfer Learning Strategy¶
from ultralytics import YOLO
# Strategy 1: Freeze backbone, train head only (first 5-10 epochs)
model = YOLO("yolov8m.pt")
model.train(
data="data.yaml",
epochs=5,
freeze=10, # Freeze layers 0-9 (backbone)
lr0=0.01,
name="phase1_head_only"
)
# Strategy 2: Unfreeze and fine-tune everything (next 50+ epochs)
model = YOLO("runs/detect/phase1_head_only/weights/best.pt")
model.train(
data="data.yaml",
epochs=50,
freeze=0, # Unfreeze all layers
lr0=0.001, # Lower LR for full fine-tuning
name="phase2_full_finetune"
)
# Strategy 3: Progressive unfreezing (most thorough)
model = YOLO("yolov8m.pt")
for phase, (freeze_layers, epochs, lr) in enumerate([
(15, 5, 0.01), # Train head + neck
(10, 10, 0.005), # Unfreeze more layers
(5, 15, 0.001), # Unfreeze even more
(0, 30, 0.0005), # Fine-tune everything
], 1):
model = model.train(
data="data.yaml",
epochs=epochs,
freeze=freeze_layers,
lr0=lr,
name=f"phase{phase}"
)
# Reload best weights from this phase
model = YOLO(f"runs/detect/phase{phase}/weights/best.pt")
3.4 Multi-GPU Training¶
# PyTorch DDP (Distributed Data Parallel) - recommended
yolo detect train data=data.yaml model=yolov8m.pt epochs=100 batch=32 device=0,1
# For 4 GPUs:
yolo detect train data=data.yaml model=yolov8x.pt epochs=100 batch=64 device=0,1,2,3
# Effective batch size = batch_per_device × num_GPUs
# Rule of thumb: scale batch size linearly with GPU count, adjust LR with sqrt
# Python API
from ultralytics import YOLO
model = YOLO("yolov8m.pt")
model.train(
data="data.yaml",
epochs=100,
batch=16, # Per GPU
device="0,1", # Use 2 GPUs
imgsz=640,
workers=8, # Data loader workers per GPU
name="multi_gpu_train"
)
3.5 YOLO Validation and Export¶
from ultralytics import YOLO
model = YOLO("runs/detect/train/weights/best.pt")
# Validate on test set
metrics = model.val(
data="data.yaml",
split="test",
imgsz=640,
batch=16,
conf=0.25,
iou=0.6,
device="0",
)
print(f"mAP50: {metrics.box.map50:.4f}")
print(f"mAP50-95: {metrics.box.map:.4f}")
# Export to various formats
model.export(format="onnx", imgsz=640, simplify=True) # ONNX
model.export(format="engine", imgsz=640, half=True) # TensorRT (GPU)
model.export(format="engine", imgsz=640, half=True, device=0) # TensorRT INT8
model.export(format="tflite", imgsz=640) # TFLite (mobile)
model.export(format="coreml", imgsz=640) # CoreML (Apple)
4. Training Segmentation Models (SAM)¶
4.1 Fine-tuning SAM on Custom Data¶
The Segment Anything Model (SAM) supports prompt-based segmentation. Fine-tuning adapts it to your specific domain:
import torch
from segment_anything import sam_model_registry, SamPredictor
# Load pre-trained SAM
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")
sam.to("cuda")
# For custom fine-tuning, you need to modify the decoder
# The key is training the mask decoder with your domain-specific prompts
# Custom dataset loader for SAM fine-tuning
from torch.utils.data import Dataset, DataLoader
class SAMPromptDataset(Dataset):
def __init__(self, images_dir, annotations_dir, transform=None):
self.images = sorted([f for f in os.listdir(images_dir)
if f.endswith(('.jpg', '.png'))])
self.images_dir = images_dir
self.annotations_dir = annotations_dir
self.transform = transform
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
img_name = self.images[idx]
image = cv2.imread(os.path.join(self.images_dir, img_name))
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Load ground truth mask
mask = cv2.imread(
os.path.join(self.annotations_dir,
os.path.splitext(img_name)[0] + ".png"),
cv2.IMREAD_GRAYSCALE
)
if self.transform:
augmented = self.transform(image=image, mask=mask)
image = augmented["image"]
mask = augmented["mask"]
return {
"image": torch.tensor(image).permute(2, 0, 1).float() / 255.0,
"mask": torch.tensor(mask).long(),
"point_coords": torch.tensor([[128, 128]]), # Example prompt
"point_labels": torch.tensor([1]), # 1 = foreground
}
# Training loop (simplified)
dataset = SAMPromptDataset("images/", "masks/")
loader = DataLoader(dataset, batch_size=4, shuffle=True)
optimizer = torch.optim.AdamW(sam.mask_decoder.parameters(), lr=1e-4)
loss_fn = torch.nn.CrossEntropyLoss()
sam.train()
for epoch in range(50):
for batch in loader:
images = batch["image"].to("cuda")
masks_gt = batch["mask"].to("cuda")
# Get image embeddings from the encoder
with torch.no_grad():
image_embeddings = sam.image_encoder(images)
# Predict with prompt
sparse_embeddings, dense_embeddings = sam.prompt_encoder(
points=batch["point_coords"].to("cuda"),
labels=batch["point_labels"].to("cuda"),
boxes=None,
mask_input=None,
)
# Decode masks
low_res_masks, _ = sam.mask_decoder(
image_embeddings=image_embeddings,
image_pe=sam.prompt_encoder.get_dense_pe(),
sparse_prompt_embeddings=sparse_embeddings,
dense_prompt_embeddings=dense_embeddings,
multimask_output=False,
)
# Upsample to original resolution and compute loss
pred_masks = torch.nn.functional.interpolate(
low_res_masks, size=masks_gt.shape[-2:], mode="bilinear"
)
loss = loss_fn(pred_masks.squeeze(1), masks_gt)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}/50, Loss: {loss.item():.4f}")
4.2 LoRA Fine-tuning for SAM¶
LoRA (Low-Rank Adaptation) is parameter-efficient—only trains a small number of additional parameters:
import torch
import torch.nn as nn
from peft import LoraConfig, get_peft_model, TaskType
from segment_anything import sam_model_registry
# Load SAM
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")
# Apply LoRA to the image encoder attention layers
lora_config = LoraConfig(
task_type=TaskType.FEATURE_EXTRACTION,
r=8, # Rank of adaptation matrices
lora_alpha=32, # Scaling factor
lora_dropout=0.1,
target_modules=["qkv", "proj"], # Attention layers to adapt
bias="none",
)
# Wrap the image encoder with LoRA
sam.image_encoder = get_peft_model(sam.image_encoder, lora_config)
# Print trainable parameter count
trainable = sum(p.numel() for p in sam.parameters() if p.requires_grad)
total = sum(p.numel() for p in sam.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
# Now train with standard loop as above
# Only LoRA parameters are updated; the rest are frozen
4.3 Training Mask R-CNN from Scratch¶
import torch
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
def get_maskrcnn_model(num_classes):
"""Create a Mask R-CNN model for custom segmentation."""
model = maskrcnn_resnet50_fpn_v2(weights="DEFAULT")
# Replace the classifier head
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# Replace the mask predictor
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
model.roi_heads.mask_predictor = MaskRCNNPredictor(
in_features_mask, hidden_layer, num_classes
)
return model
# Custom dataset for Mask R-CNN (COCO-style)
from torchvision.datasets import CocoDetection
from torchvision import transforms
class CocoSegDataset(CocoDetection):
"""Wraps CocoDetection to return masks alongside boxes."""
def __getitem__(self, idx):
img, targets = super().__getitem__(idx)
# Process targets for Mask R-CNN
boxes = []
labels = []
masks = []
for ann in targets:
boxes.append(ann["bbox"]) # [x, y, w, h]
labels.append(ann["category_id"])
masks.append(ann.get("segmentation", []))
target = {
"boxes": torch.tensor(boxes, dtype=torch.float32),
"labels": torch.tensor(labels, dtype=torch.int64),
}
return img, target
# Training loop
model = get_maskrcnn_model(num_classes=4) # 3 classes + background
model.to("cuda")
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)
model.train()
for epoch in range(20):
for images, targets in dataloader:
images = [img.to("cuda") for img in images]
targets = [{k: v.to("cuda") for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
total_loss = sum(loss.values() for loss in loss_dict.values())
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
lr_scheduler.step()
print(f"Epoch {epoch+1}: loss = {total_loss.item():.4f}")
5. Training Depth Estimation¶
5.1 Fine-tuning MiDaS/DPT on Custom Stereo Pairs¶
import torch
import torch.nn as nn
from transformers import DPTForDepthEstimation, DPTImageProcessor
# Load pre-trained DPT model
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
model.to("cuda")
# Custom dataset for depth fine-tuning
class StereoDepthDataset(torch.utils.data.Dataset):
"""
Expects directory structure:
dataset/
├── left/ # Left stereo images
├── right/ # Right stereo images
└── depth/ # Ground truth depth maps (numpy .npy)
"""
def __init__(self, root_dir, transform=None):
self.left_dir = os.path.join(root_dir, "left")
self.depth_dir = os.path.join(root_dir, "depth")
self.images = sorted(os.listdir(self.left_dir))
self.transform = transform
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
img = Image.open(os.path.join(self.left_dir, self.images[idx]))
depth = np.load(os.path.join(self.depth_dir,
os.path.splitext(self.images[idx])[0] + ".npy"))
if self.transform:
img = self.transform(img)
depth = torch.tensor(depth, dtype=torch.float32).unsqueeze(0)
return {"pixel_values": img, "depth": depth}
# Fine-tuning loop
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = nn.SmoothL1Loss() # Scale-invariant depth loss
model.train()
for epoch in range(30):
for batch in dataloader:
pixel_values = batch["pixel_values"].to("cuda")
gt_depth = batch["depth"].to("cuda")
outputs = model(pixel_values=pixel_values)
pred_depth = outputs.predicted_depth
# Interpolate to ground truth size
pred_depth = nn.functional.interpolate(
pred_depth, size=gt_depth.shape[-2:], mode="bicubic"
)
loss = loss_fn(pred_depth, gt_depth)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}: depth_loss = {loss.item():.4f}")
5.2 Self-supervised Depth Estimation Training¶
Train depth from stereo pairs without ground truth using photometric consistency:
import torch
import torch.nn.functional as F
class MonoDepthLoss(nn.Module):
"""
Self-supervised monocular depth estimation loss
(based on Monodepth2: Godard et al., 2019)
"""
def __init__(self, alpha_ssim=0.85, alpha_l1=0.15):
super().__init__()
self.alpha_ssim = alpha_ssim
self.alpha_l1 = alpha_l1
def ssim_loss(self, pred, target, window_size=11):
"""Structural similarity loss."""
C1 = 0.01 ** 2
C2 = 0.03 ** 2
mu_pred = F.avg_pool2d(pred, window_size, stride=1, padding=window_size // 2)
mu_target = F.avg_pool2d(target, window_size, stride=1, padding=window_size // 2)
mu_pred_sq = mu_pred ** 2
mu_target_sq = mu_target ** 2
mu_cross = mu_pred * mu_target
sigma_pred_sq = F.avg_pool2d(pred ** 2, window_size, 1, window_size // 2) - mu_pred_sq
sigma_target_sq = F.avg_pool2d(target ** 2, window_size, 1, window_size // 2) - mu_target_sq
sigma_cross = F.avg_pool2d(pred * target, window_size, 1, window_size // 2) - mu_cross
ssim_map = ((2 * mu_cross + C1) * (2 * sigma_cross + C2)) / \
((mu_pred_sq + mu_target_sq + C1) * (sigma_pred_sq + sigma_target_sq + C2))
return torch.clamp((1 - ssim_map) / 2, 0, 1).mean()
def photometric_loss(self, pred_image, target_image):
"""Combined L1 + SSIM photometric loss."""
l1 = (pred_image - target_image).abs().mean()
ssim = self.ssim_loss(pred_image, target_image)
return self.alpha_ssim * ssim + self.alpha_l1 * l1
def forward(self, pred_depth, pred_image_left, target_image_right,
K, T_cam_to_right):
"""
pred_depth: predicted depth map (B, 1, H, W)
pred_image_left: reconstructed left image from right
target_image_right: original right image
K: camera intrinsic matrix (B, 3, 3)
T_cam_to_right: extrinsic transform from left to right camera (B, 4, 4)
"""
# Compute photometric loss (consistency between predicted warp and target)
photo_loss = self.photometric_loss(pred_image_left, target_image_right)
# Smoothness loss (encourage depth gradients to align with image gradients)
depth_grad_x = pred_depth[:, :, :, :-1] - pred_depth[:, :, :, 1:]
depth_grad_y = pred_depth[:, :, :-1, :] - pred_depth[:, :, 1:, :]
return {
"photometric_loss": photo_loss,
"total_loss": photo_loss # Add smoothness term in practice
}
# Usage in training
loss_fn = MonoDepthLoss()
# ... training loop with stereo pairs, warp the right image using predicted depth
Recommended datasets for depth training:
| Dataset | Type | Size | Use Case |
|---|---|---|---|
| KITTI Depth | Real stereo + LiDAR | 86K images | Autonomous driving |
| Make3D | Real outdoor | 534 images | Outdoor depth |
| NYU Depth V2 | Indoor RGB-D | 1,449 scenes | Indoor robotics |
| ScanNet | Indoor RGB-D | 1513 scans | Indoor 3D understanding |
6. Evaluation & Validation¶
6.1 Standard Metrics¶
mAP (mean Average Precision): Primary metric for object detection.
from ultralytics import YOLO
model = YOLO("runs/detect/train/weights/best.pt")
metrics = model.val(data="data.yaml", split="test")
# Key metrics
print(f"mAP50: {metrics.box.map50:.4f}") # AP at IoU=0.50
print(f"mAP50-95: {metrics.box.map:.4f}") # AP averaged over IoU 0.50:0.95
print(f"Precision: {metrics.box.mp:.4f}") # Mean precision
print(f"Recall: {metrics.box.mr:.4f}") # Mean recall
# Per-class metrics
names = metrics.names
for i, (p, r, ap50, ap) in enumerate(
zip(metrics.box.p, metrics.box.r, metrics.box.ap50, metrics.box.ap)
):
print(f" {names[i]:15s} P={p:.3f} R={r:.3f} AP50={ap50:.3f} AP50-95={ap:.3f}")
IoU (Intersection over Union): Measures overlap between predicted and ground truth boxes.
IoU = Area of Intersection / Area of Union
┌──────────────┐
│ ┌───GT │
│ │ ╱ │
│ └─╱────────│
│ ╱ │ │
│ ╱ Pred │
└──────────────┘
IoU > 0.5 → typically considered a "correct" detection
IoU > 0.75 → stricter threshold for precise localization
6.2 Confusion Matrix Analysis¶
from ultralytics import YOLO
model = YOLO("runs/detect/train/weights/best.pt")
results = model.val(data="data.yaml", split="test", plots=True)
# Confusion matrix saved to runs/detect/val/confusion_matrix.png
Key patterns to look for in confusion matrices:
Predicted
cat dog background
True cat [ 85 5 10 ] → 10 missed cats
dog [ 3 88 9 ] → 9 missed dogs
bg [ 2 1 97 ] → 3 false positives
- Diagonal dominance = good
- Off-diagonal clusters = systematic misclassification
- High background row = false positives
- High background column = false negatives (missed objects)
6.3 Precision-Recall Curves¶
import matplotlib.pyplot as plt
from ultralytics import YOLO
model = YOLO("runs/detect/train/weights/best.pt")
results = model.val(data="data.yaml", plots=True)
# Plots are auto-saved: PR_curve.png, F1_curve.png, P_curve.png, R_curve.png
Interpreting the curves:
- PR curve: Area under curve = AP. Curves closer to top-right = better.
- F1 curve: Peak of the F1 curve suggests optimal confidence threshold.
- Precision curve: Increases as confidence threshold increases.
- Recall curve: Decreases as confidence threshold increases.
6.4 Test-Time Augmentation (TTA)¶
TTA applies multiple augmentations at inference and aggregates predictions:
from ultralytics import YOLO
model = YOLO("best.pt")
# Basic inference
results = model.predict("test_image.jpg", conf=0.25)
# With TTA (slower but more accurate)
results_tta = model.predict(
"test_image.jpg",
conf=0.25,
augment=True, # Enable TTA
)
# TTA applies horizontal flip and multiple scales, then NMS merges results
7. Model Optimization for Deployment¶
7.1 Quantization¶
Reduce model precision from FP32 to INT8 or FP16 for faster inference:
from ultralytics import YOLO
model = YOLO("best.pt")
# FP16 export (2x memory reduction, minimal accuracy loss)
model.export(format="engine", half=True)
# INT8 quantization (4x reduction, slight accuracy trade-off)
# Requires a calibration dataset
model.export(
format="engine",
int8=True,
data="calibration_data.yaml", # Small representative dataset
)
# ONNX with quantization
model.export(format="onnx", simplify=True)
# Then use onnxruntime with quantization:
import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic, QuantType
# Post-training dynamic quantization
quantize_dynamic(
model_input="model.onnx",
model_output="model_int8.onnx",
weight_type=QuantType.QInt8,
)
# Benchmark
session = ort.InferenceSession("model_int8.onnx")
# Compare inference time with FP32 vs INT8
7.2 Pruning¶
Remove redundant weights to reduce model size:
import torch.nn.utils.prune as prune
model = YOLO("best.pt").model
# Global magnitude pruning - remove 30% of smallest weights
parameters_to_prune = []
for module in model.modules():
if isinstance(module, torch.nn.Conv2d):
parameters_to_prune.append((module, "weight"))
prune.global_unstructured(
parameters_to_prune,
pruning_method=prune.L1Unstructured,
amount=0.3, # Remove 30% of weights
)
# Make pruning permanent
for module, param_name in parameters_to_prune:
prune.remove(module, param_name)
# Count sparsity
total = 0
pruned = 0
for module, _ in parameters_to_prune:
total += module.weight.nelement()
pruned += torch.sum(module.weight == 0).item()
print(f"Sparsity: {100. * pruned / total:.1f}%")
7.3 Knowledge Distillation¶
Train a smaller student model to mimic a larger teacher:
import torch
import torch.nn as nn
import torch.nn.functional as F
class DistillationLoss(nn.Module):
def __init__(self, temperature=4.0, alpha=0.7):
super().__init__()
self.temperature = temperature
self.alpha = alpha
def forward(self, student_logits, teacher_logits, gt_labels):
# Soft targets (knowledge from teacher)
soft_loss = F.kl_div(
F.log_softmax(student_logits / self.temperature, dim=1),
F.softmax(teacher_logits / self.temperature, dim=1),
reduction="batchmean"
) * (self.temperature ** 2)
# Hard targets (ground truth)
hard_loss = F.cross_entropy(student_logits, gt_labels)
return self.alpha * soft_loss + (1 - self.alpha) * hard_loss
# Distillation pipeline
teacher = YOLO("yolov8x.pt") # Large teacher
student = YOLO("yolov8n.pt") # Small student
# Train student to mimic teacher's soft predictions
# Use the distillation loss in combination with standard detection loss
7.4 ONNX Export and Optimization¶
from ultralytics import YOLO
model = YOLO("best.pt")
# Export to ONNX
model.export(
format="onnx",
imgsz=640,
simplify=True, # Apply ONNX simplifier
dynamic=False, # Fixed input size (faster)
opset=17, # ONNX opset version
)
# Optimize with onnxruntime
import onnxruntime as ort
# GPU inference
session_gpu = ort.InferenceSession(
"best.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# CPU inference (optimized)
session_cpu = ort.InferenceSession(
"best.onnx",
providers=["CPUExecutionProvider"],
sess_options=ort.SessionOptions()
)
session_cpu.get_session_options().graph_optimization_level = (
ort.GraphOptimizationLevel.ORT_ENABLE_ALL
)
# Benchmark
import time
import numpy as np
dummy_input = np.random.randn(1, 3, 640, 640).astype(np.float32)
input_name = session_gpu.get_inputs()[0].name
# Warmup
for _ in range(10):
session_gpu.run(None, {input_name: dummy_input})
# Benchmark
times = []
for _ in range(100):
start = time.time()
session_gpu.run(None, {input_name: dummy_input})
times.append(time.time() - start)
print(f"ONNX GPU: {np.mean(times)*1000:.1f} ms/img")
TensorRT optimization (fastest NVIDIA GPU inference):
# Export directly to TensorRT engine
model.export(format="engine", imgsz=640, half=True)
# Or convert from ONNX
import tensorrt as trt
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
with open("best.onnx", "rb") as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB
config.set_flag(trt.BuilderFlag.FP16) # Enable FP16
engine = builder.build_serialized_network(network, config)
with open("best.trt", "wb") as f:
f.write(engine)
8. MLOps for Robotics¶
8.1 Experiment Tracking¶
Weights & Biases:
import wandb
wandb.init(
project="robotics-perception",
name="yolov8m_finetune_v3",
config={
"model": "yolov8m",
"epochs": 100,
"lr": 0.001,
"batch_size": 16,
"imgsz": 640,
"augmentations": ["mosaic", "mixup", "hsv"],
}
)
from ultralytics import YOLO
model = YOLO("yolov8m.pt")
results = model.train(
data="data.yaml",
epochs=100,
project="robotics-perception",
name="yolov8m_finetune_v3",
)
# Log metrics manually if needed
wandb.log({"mAP50": results.results_dict.get("metrics/mAP50(B)", 0)})
wandb.finish()
MLflow:
import mlflow
import mlflow.pytorch
mlflow.set_experiment("robotics-perception")
with mlflow.start_run(run_name="yolov8m_finetune_v3"):
# Log parameters
mlflow.log_param("model", "yolov8m")
mlflow.log_param("epochs", 100)
mlflow.log_param("learning_rate", 0.001)
# Train
from ultralytics import YOLO
model = YOLO("yolov8m.pt")
results = model.train(data="data.yaml", epochs=100)
# Log metrics
mlflow.log_metric("mAP50", results.results_dict["metrics/mAP50(B)"])
mlflow.log_metric("mAP50-95", results.results_dict["metrics/mAP50-95(B)"])
# Log model artifact
mlflow.log_artifact("runs/detect/train/weights/best.pt")
8.2 Model Versioning and Registry¶
model_registry/
├── models/
│ ├── detection/
│ │ ├── v1.0/
│ │ │ ├── model.pt
│ │ │ ├── metadata.json
│ │ │ └── eval_report.html
│ │ ├── v1.1/
│ │ └── v2.0/
│ ├── segmentation/
│ │ └── v1.0/
│ └── depth/
│ └── v1.0/
└── metadata.json # Global registry
Example metadata.json per model version:
{
"model_name": "robot_detection",
"version": "2.0",
"framework": "ultralytics_yolov8",
"base_model": "yolov8m",
"training_data": "robot_dataset_v3",
"training_date": "2026-04-15",
"metrics": {
"mAP50": 0.87,
"mAP50-95": 0.62,
"inference_ms": 8.3,
"model_size_mb": 52.4
},
"target_hardware": "Jetson Orin Nano",
"quantization": "FP16",
"approved_by": "team_lead",
"status": "production"
}
8.3 CI/CD for Model Retraining¶
# .github/workflows/retrain.yml
name: Model Retraining Pipeline
on:
push:
paths:
- 'training_data/**'
workflow_dispatch:
inputs:
model_type:
description: 'Model to retrain'
required: true
type: choice
options:
- detection
- segmentation
- depth
jobs:
train:
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install ultralytics wandb opencv-python
- name: Train model
run: |
python train.py \
--model-type ${{ inputs.model_type }} \
--epochs 100 \
--data training_data/data.yaml
env:
WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
- name: Run evaluation
run: |
python evaluate.py \
--model runs/detect/train/weights/best.pt \
--data data.yaml \
--threshold 0.8 # Minimum mAP50 to pass
- name: Deploy if metrics pass
if: success()
run: |
python deploy.py \
--model runs/detect/train/weights/best.pt \
--format onnx \
--target jetson
9. Common Pitfalls¶
9.1 Overfitting¶
Symptoms: Training loss decreases, validation loss increases or plateaus.
Solutions:
# 1. Add regularization
model.train(
weight_decay=0.01, # L2 regularization
dropout=0.2, # If using custom head
mixup=0.15, # Mixup augmentation
mosaic=1.0, # Mosaic augmentation
degrees=10, # Rotation augmentation
scale=0.5, # Scale augmentation
)
# 2. Reduce model complexity
model = YOLO("yolov8s.pt") # Use smaller model
# 3. Early stopping
model.train(
patience=15, # Stop if no improvement for 15 epochs
)
# 4. More data or better augmentation
# 5. Reduce number of epochs
# 6. Add dropout layers to custom heads
9.2 Class Imbalance¶
Symptoms: Model biased toward majority class, low recall on rare classes.
Solutions:
# 1. Class weights in loss function
from ultralytics import YOLO
model = YOLO("yolov8m.pt")
model.train(
data="data.yaml",
class_weights=[1.0, 3.0, 5.0], # Higher weight for rare classes
)
# 2. Oversampling rare classes
# 3. Focal loss (already used in YOLO as default)
# 4. Synthetic data generation for rare classes
# 5. Data-level balancing (remove excess majority samples)
9.3 Data Leakage¶
Symptoms: Metrics seem too good; model fails on truly new data.
Common causes and fixes:
# 1. Split BEFORE augmentation
# WRONG: augment all data, then split → augmented versions of same image
# appear in both train and val
# RIGHT: split raw data first, then augment train set only
# 2. Check for temporal leakage
# WRONG: train on Monday images, validate on Tuesday images of same scene
# RIGHT: use different days, different lighting, different cameras for val
# 3. Ensure no duplicate images across splits
import hashlib
def find_duplicates(image_dir):
"""Find duplicate images by content hash."""
hashes = {}
for root, _, files in os.walk(image_dir):
for f in files:
if f.endswith(('.jpg', '.png')):
path = os.path.join(root, f)
with open(path, 'rb') as fp:
h = hashlib.md5(fp.read()).hexdigest()
if h in hashes:
print(f"DUPLICATE: {path} == {hashes[h]}")
else:
hashes[h] = path
find_duplicates("dataset/images/")
9.4 Annotation Errors¶
Common issues:
- Inconsistent bounding boxes: Tight vs. loose boxes for the same class
- Missing annotations: Objects present but not labeled
- Wrong class labels: Misclassifications in annotations
- Off-by-one errors: Background counted as a class or vice versa
Detection tools:
from ultralytics import YOLO
# Train a model, then analyze low-confidence predictions
model = YOLO("yolov8m.pt")
results = model.predict("val_images/", conf=0.3, save=True)
# Look for patterns:
# - Same location always has low confidence → likely annotation error
# - Consistent class confusion → label swap
# - Objects never detected → missing annotations
# Programmatic check for annotation quality
def audit_annotations(labels_dir, images_dir):
"""Check for common annotation issues."""
issues = []
for label_file in os.listdir(labels_dir):
if not label_file.endswith(".txt"):
continue
img_name = os.path.splitext(label_file)[0]
img_path = os.path.join(images_dir, img_name + ".jpg")
if not os.path.exists(img_path):
issues.append(f"Orphan label: {label_file} (no matching image)")
continue
img = cv2.imread(img_path)
h, w = img.shape[:2]
with open(os.path.join(labels_dir, label_file)) as f:
for i, line in enumerate(f.readlines()):
parts = line.strip().split()
if len(parts) != 5:
issues.append(f"{label_file} line {i}: wrong format")
continue
cls, cx, cy, bw, bh = int(parts[0]), *map(float, parts[1:])
# Check bounds
if not (0 <= cx <= 1 and 0 <= cy <= 1 and 0 < bw <= 1 and 0 < bh <= 1):
issues.append(f"{label_file} line {i}: out-of-bounds bbox")
if bw < 0.01 or bh < 0.01:
issues.append(f"{label_file} line {i}: suspiciously small bbox")
print(f"Found {len(issues)} issues:")
for issue in issues:
print(f" - {issue}")
audit_annotations("labels/", "images/")
9.5 Quick Diagnostic Checklist¶
| Problem | Symptom | Fix |
|---|---|---|
| Overfitting | Train loss ↓, val loss ↑ | More data, augmentation, regularization, smaller model |
| Underfitting | Both losses high | Larger model, more epochs, higher LR |
| Class imbalance | High accuracy, low recall on rare classes | Class weights, oversampling, focal loss |
| Domain gap | Great val metrics, poor real-world performance | Collect real-world data, domain adaptation |
| Small object failures | Poor AP for small objects | Higher input resolution, adjust anchor scales |
| Occlusion failures | Miss detections when objects overlap | NMS tuning (lower iou_thres), training with occlusion augmentation |
10. References¶
Papers¶
- YOLOv8: An Object Detection Model (Ultralytics, 2023) — Latest YOLO architecture for real-time detection
- Segment Anything (Kirillov et al., 2023) — Foundation model for promptable segmentation
- SAM 2 (Ravi et al., 2024) — Segment Anything for video
- Monodepth2 (Godard et al., 2019) — Self-supervised monocular depth estimation
- LoRA: Low-Rank Adaptation (Hu et al., 2021) — Parameter-efficient fine-tuning
- Knowledge Distillation (Hinton et al., 2015) — Compressing neural networks
- Focal Loss (Lin et al., 2017) — Addressing class imbalance
- Mosaic Augmentation (Wang et al., 2021) — YOLOv4 augmentation strategy
- DPT: Dense Prediction Transformer (Ranftl et al., 2021) — Vision transformer for depth estimation
Tools and Frameworks¶
- Ultralytics YOLOv8 — Training, validation, export for detection/segmentation
- Segment Anything (SAM) — Meta's foundation segmentation model
- Roboflow — Dataset management and model training platform
- CVAT — Open-source annotation tool
- Label Studio — Multi-modal annotation platform
- Albumentations — Fast image augmentation library
- ONNX Runtime — Cross-platform model inference
- TensorRT — NVIDIA GPU optimization
- Weights & Biases — Experiment tracking
- MLflow — Open-source ML lifecycle management
- FiftyOne — Dataset quality analysis and visualization
Datasets¶
- COCO — 80-class detection/segmentation benchmark
- KITTI — Autonomous driving with depth and LiDAR
- Open Images — Google's large-scale detection dataset
- Roboflow Universe — Community shared robotics datasets
- NVIDIA Isaac Sim — Synthetic data generation for robotics