Feature Detection and Matching for Robotics¶
Feature detection and matching is one of the most fundamental building blocks in computer vision for robotics. Every time a robot looks at a scene and asks "have I been here before?" or "how much did the camera move?", it is relying on features — salient points in images that can be reliably detected and matched across frames. This tutorial covers the full pipeline: from classical corner detectors to modern deep-learned features, from brute-force matching to robust geometric verification.
Learning Objectives¶
- Understand what makes a good image feature and why it matters for robotics
- Implement detection and matching with classical algorithms (ORB, SIFT, etc.)
- Apply Lowe's ratio test and cross-check for reliable matches
- Estimate camera motion using homography and essential matrices
- Know when to use classical vs. learned features in real robotic systems
1. Overview¶
1.1 Why Feature Matching Matters¶
Raw images are high-dimensional data — a single 640x480 RGB image has over 900,000 pixels. Directly comparing pixel values across frames is brittle: any change in lighting, viewpoint, or small camera rotation breaks pixel-level correspondence. Features solve this by identifying a sparse set of salient points that are:
- Repeatable — detected in multiple views of the same scene
- Discriminative — distinguishable from nearby points
- Invariant — stable under rotation, scale, and illumination changes
These sparse keypoints reduce the matching problem from comparing ~900K pixels to comparing a few hundred or thousand descriptors, making real-time operation feasible on embedded hardware.
1.2 The Feature Matching Pipeline¶
Image 1 Image 2
| |
v v
[Detection] [Detection] Find keypoints
| |
v v
[Description] [Description] Compute descriptors
| |
v v
[Matching] -----> [Matching] Find correspondences
| |
v v
[Geometric [Geometric
Verification] Verification] RANSAC, pose estimation
| |
v v
Relative Pose (R, t)
1.3 Role in Key Robotics Systems¶
Visual SLAM (Simultaneous Localization and Mapping) Feature matching is the backbone of visual SLAM systems like ORB-SLAM3 and VINS-Mono. The robot detects features in each frame, matches them to previous frames, estimates its own motion, and builds a sparse 3D map of landmark points. Loop closure detection — recognizing a previously visited location — relies on matching a bag of visual features against a database.
Visual Odometry (VO) Simpler than full SLAM, VO estimates the camera's trajectory by tracking features across consecutive frames. This is used in GPS-denied environments (underground, underwater, inside buildings). Stereo visual odometry uses matched features from two cameras to triangulate 3D positions, giving metric scale without IMU fusion.
Structure from Motion (SfM) SfM reconstructs a 3D scene from a collection of unordered images by matching features across many views. Tools like COLMAP use SIFT or SuperPoint features with incremental reconstruction to build dense 3D models.
Augmented Reality AR systems track camera pose relative to a known marker or scene by matching features in the live camera feed to features on the target. This enables virtual objects to be rendered with correct perspective.
2. Classical Feature Detectors¶
2.1 Harris Corner Detector¶
The Harris corner detector (1988) computes the second-moment matrix (structure tensor) of local image gradients:
The corner response is:
where k is a sensitivity parameter (typically 0.04-0.06). Corners are points where R is large and positive. Flat regions give R near zero; edges give negative R.
import cv2
import numpy as np
img = cv2.imread('scene.png', cv2.IMREAD_GRAYSCALE)
# Harris corner detection
harris_response = cv2.cornerHarris(img, blockSize=2, ksize=3, k=0.04)
# Threshold to get corner points
threshold = 0.01 * harris_response.max()
corner_mask = harris_response > threshold
corner_points = np.argwhere(corner_mask) # (row, col) pairs
print(f"Detected {len(corner_points)} Harris corners")
Strengths: Simple, well-understood, rotation-invariant. Weaknesses: Not scale-invariant, sensitive to parameter choices, produces thick clusters of corners instead of single points.
2.2 Shi-Tomasi Corner Detector¶
Shi and Tomasi (1994) improved Harris by using the minimum eigenvalue of M as the corner response:
A point is a corner if both eigenvalues are large, meaning there is significant intensity variation in all directions. This gives more stable corners.
# Shi-Tomasi (goodFeaturesToTrack)
corners = cv2.goodFeaturesToTrack(
img, maxCorners=300, qualityLevel=0.01, minDistance=10
)
# corners is Nx1x2 array of (x,y) coordinates
for corner in corners:
x, y = corner.ravel()
cv2.circle(img_display, (int(x), int(y)), 3, (0, 255, 0), -1)
Key advantage: The minDistance parameter prevents multiple detections
in the same region, giving well-separated corners.
2.3 FAST Corner Detector¶
FAST (Features from Accelerated Segment Test, Rosten 2006) is designed for speed. It examines a Bresenham circle of 16 pixels around a candidate point. If N consecutive pixels on the circle are all brighter or all darker than the center pixel (by threshold t), it is a corner.
1 2 3
16 4
15 P 5
14 6
13 8 7 9 10 11 12
If >= 12 of the 16 pixels are above or below I(p) +/- t, P is a corner.
# FAST detector
fast = cv2.FastFeatureDetector_create(threshold=20, nonmaxSuppression=True)
keypoints = fast.detect(img, None)
print(f"FAST detected {len(keypoints)} keypoints")
FAST is extremely fast (suitable for real-time on embedded GPUs) but does not produce descriptors. It is typically combined with BRIEF or ORB descriptors.
2.4 SIFT (Scale-Invariant Feature Transform)¶
SIFT (Lowe, 2004) is the gold standard for invariant feature detection:
- Scale-space extrema: Build a Difference-of-Gaussians (DoG) pyramid and find local maxima/minima across scale and space.
- Keypoint refinement: Sub-pixel localization and elimination of low- contrast points and edge responses.
- Orientation assignment: Compute dominant gradient orientation in a local region; descriptor is made rotation-invariant.
- Descriptor: 128-dimensional vector computed from 4x4 gradient histograms, each with 8 orientation bins.
sift = cv2.SIFT_create(nfeatures=2000)
keypoints, descriptors = sift.detectAndCompute(img, None)
print(f"SIFT: {len(keypoints)} keypoints, descriptor shape: {descriptors.shape}")
# Output: SIFT: 1847 keypoints, descriptor shape: (1847, 128)
Strengths: Excellent invariance to scale, rotation, and partial illumination changes. Highly repeatable. 128-D descriptors are very discriminative. Weaknesses: Slow (~100ms per frame on CPU). Was patent-encumbered (patent expired 2020). Not ideal for real-time robotics.
2.5 SURF (Speeded-Up Robust Features)¶
SURF (Bay et al., 2008) accelerates SIFT using integral images for box-filter approximations of Gaussian derivatives:
- Uses box filters instead of Gaussian smoothing → integral image lookups
- Haar-wavelet descriptors (64-D, faster than SIFT's 128-D)
- Hessian-based detector instead of DoG
# SURF requires opencv-contrib-python
# surf = cv2.xfeatures2d.SURF_create(400)
# keypoints, descriptors = surf.detectAndCompute(img, None)
print("SURF is faster than SIFT but patent-encumbered")
Strengths: 3-5x faster than SIFT. Good invariance. Weaknesses: Still patent-encumbered (as of 2024). Slower than ORB. Largely superseded by ORB for robotics.
2.6 ORB (Oriented FAST and Rotated BRIEF)¶
ORB (Rublee et al., 2011) is the go-to detector for real-time robotics. It combines:
- Oriented FAST: FAST detector with Harris response for corner selection, plus intensity centroid method for rotation estimation.
- Rotated BRIEF: Binary descriptor computed from intensity comparisons in a patch, rotated according to the keypoint orientation.
orb = cv2.ORB_create(nfeatures=2000, scaleFactor=1.2, nlevels=8)
keypoints, descriptors = orb.detectAndCompute(img, None)
print(f"ORB: {len(keypoints)} keypoints, descriptor shape: {descriptors.shape}")
# Output: ORB: 1923 keypoints, descriptor shape: (1923, 32)
- 32-byte binary descriptors → fast Hamming distance matching
- Typically <10ms per frame on CPU
- Used in ORB-SLAM, the most widely deployed visual SLAM system
Strengths: Very fast. Free (no patents). Rotation-invariant. Binary descriptors enable fast matching with Hamming distance. Weaknesses: Less scale-invariant than SIFT. Less discriminative than SIFT for large viewpoint changes.
2.7 Detector Comparison¶
| Detector | Speed | Scale | Rotation | Descriptor | Dim | Real-time |
|---|---|---|---|---|---|---|
| Harris | Fast | ❌ | ✅ | N/A | N/A | ✅ |
| Shi-Tomasi | Fast | ❌ | ✅ | N/A | N/A | ✅ |
| FAST | V. Fast | ❌ | ❌ | BRIEF/ORB | 32B | ✅ |
| SIFT | Slow | ✅ | ✅ | SIFT | 128F | ❌ (CPU) |
| SURF | Medium | ✅ | ✅ | SURF | 64F | ⚠️ |
| ORB | V. Fast | ⚠️ | ✅ | rBRIEF | 32B | ✅ |
For robotics, ORB is the default choice unless you need scale invariance (SIFT/SuperPoint) or your application runs offline (SfM).
3. Feature Matching Algorithms¶
Once descriptors are extracted from two images, we need to find correspondences. The quality of matching directly affects downstream tasks like pose estimation.
3.1 Brute-Force Matcher (BFMatcher)¶
The simplest approach: for each descriptor in image 1, compute distance to every descriptor in image 2, then take the nearest neighbor.
import cv2
# For binary descriptors (ORB, BRIEF) → use Hamming distance
bf_orb = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=False)
# For float descriptors (SIFT, SURF) → use L2 distance
bf_sift = cv2.BFMatcher(cv2.NORM_L2, crossCheck=False)
# Match descriptors
matches = bf_orb.match(desc1, desc2)
# Sort by distance (lower = better)
matches = sorted(matches, key=lambda m: m.distance)
# Display top 20 matches
for m in matches[:20]:
print(f"Match: queryIdx={m.queryIdx}, trainIdx={m.trainIdx}, "
f"distance={m.distance:.1f}")
Complexity: O(N*M) where N, M are the number of descriptors. Fine for hundreds of features, slow for thousands.
3.2 FLANN-Based Matcher¶
FLANN (Fast Library for Approximate Nearest Neighbors) uses KD-trees or hierarchical k-means for sub-linear search. Much faster for large descriptor sets.
# FLANN parameters for ORB (binary descriptors)
FLANN_INDEX_LSH = 6
flann_params = dict(
algorithm=FLANN_INDEX_LSH,
table_number=6,
key_size=12,
multi_probe_level=1
)
flann_orb = cv2.FlannBasedMatcher(flann_params, {})
# FLANN parameters for SIFT (float descriptors)
FLANN_INDEX_KDTREE = 1
flann_params_sift = dict(algorithm=FLANN_INDEX_KDTREE, trees=5)
flann_sift = cv2.FlannBasedMatcher(flann_params_sift, {})
matches = flann_orb.knnMatch(desc1, desc2, k=2)
print(f"FLANN found {len(matches)} candidate matches")
3.3 Lowe's Ratio Test¶
Simply taking the nearest neighbor is unreliable — many false matches occur when a descriptor is ambiguous. Lowe (2004) proposed the ratio test:
For each descriptor, find the two nearest neighbors (k=2). If the ratio of the best distance to the second-best distance is below a threshold (typically 0.7-0.8), the match is accepted.
distance_to_best_neighbor
ratio = ------------------------------------ < threshold
distance_to_second_best_neighbor
Intuition: a true match has one clearly best match, while a false match tends to have similar distances to multiple descriptors.
# Using knnMatch with ratio test
bf = cv2.BFMatcher(cv2.NORM_HAMMING)
matches_knn = bf.knnMatch(desc1, desc2, k=2)
# Apply ratio test
ratio_threshold = 0.75
good_matches = []
for m, n in matches_knn:
if m.distance < ratio_threshold * n.distance:
good_matches.append(m)
print(f"Before ratio test: {len(matches_knn)} matches")
print(f"After ratio test: {len(good_matches)} matches")
# Typical output: Before: 2000, After: 800 (50-70% rejected)
Recommended thresholds: - 0.7: strict, fewer but more reliable matches - 0.75: balanced (Lowe's original recommendation) - 0.8: lenient, more matches but more outliers
3.4 Cross-Check Matching¶
Cross-check ensures bidirectional consistency: match from A→B AND B→A, then keep only pairs where both agree.
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(desc1, desc2)
# matches is already sorted by distance
# Cross-check ensures: best match of desc1[i] is desc2[j]
# AND best match of desc2[j] is desc1[i]
print(f"Cross-checked matches: {len(matches)}")
Cross-check is simpler than ratio test and often combined with it for maximum robustness.
3.5 Complete Matching Pipeline Example¶
import cv2
import numpy as np
def match_features(img1, img2, method='ORB', ratio_thresh=0.75):
"""Complete feature matching pipeline."""
# 1. Create detector
if method == 'ORB':
detector = cv2.ORB_create(nfeatures=3000)
norm = cv2.NORM_HAMMING
elif method == 'SIFT':
detector = cv2.SIFT_create(nfeatures=3000)
norm = cv2.NORM_L2
else:
raise ValueError(f"Unknown method: {method}")
# 2. Detect and describe
kp1, desc1 = detector.detectAndCompute(img1, None)
kp2, desc2 = detector.detectAndCompute(img2, None)
if desc1 is None or desc2 is None or len(kp1) < 2 or len(kp2) < 2:
return [], kp1, kp2
# 3. Match with ratio test
bf = cv2.BFMatcher(norm)
matches_knn = bf.knnMatch(desc1, desc2, k=2)
good = []
for pair in matches_knn:
if len(pair) == 2:
m, n = pair
if m.distance < ratio_thresh * n.distance:
good.append(m)
# 4. Sort by quality
good = sorted(good, key=lambda x: x.distance)
return good, kp1, kp2
# Usage
img1 = cv2.imread('frame_001.png', cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread('frame_002.png', cv2.IMREAD_GRAYSCALE)
matches, kp1, kp2 = match_features(img1, img2, method='ORB')
print(f"Final matches: {len(matches)}")
# Visualize
result = cv2.drawMatches(img1, kp1, img2, kp2, matches[:50], None,
flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)
cv2.imwrite('matches_visualization.png', result)
4. Modern Learned Features¶
Since 2018, deep learning has produced feature detectors and matchers that rival or exceed classical methods, especially under large viewpoint and illumination changes.
4.1 SuperPoint (DeTone et al., 2018)¶
SuperPoint is a self-supervised feature detector and descriptor:
- Architecture: Encoder (VGG-style) + Interest Point Decoder + Descriptor Decoder
- Training: Initially trained on synthetic shapes (homographic adaptation), then fine-tuned with self-supervised loss on real images
- Output: Keypoint locations + 256-D descriptors in a single forward pass
- Speed: ~30ms on GPU, ~100ms on CPU
import torch
import cv2
import numpy as np
# Using the SuperPoint implementation from lightglue or colmap
# pip install lightglue (or use hornetssuperpoint)
def detect_superpoint(img_gray, max_keypoints=2000):
"""Detect SuperPoint keypoints (conceptual example)."""
# Normalize image
img = torch.from_numpy(img_gray).float() / 255.0
img = img.unsqueeze(0).unsqueeze(0) # (1, 1, H, W)
# Run SuperPoint
with torch.no_grad():
features = superpoint_model(img)
keypoints = features['keypoints'][0].numpy() # (N, 2) x,y
descriptors = features['descriptors'][0].numpy() # (N, 256)
scores = features['keypoint_scores'][0].numpy() # (N,)
# Keep top-k by score
top_k = np.argsort(scores)[-max_keypoints:]
return keypoints[top_k], descriptors[top_k], scores[top_k]
Why SuperPoint matters for robotics: It detects repeatable keypoints in texture-less regions where FAST/ORB fail. The learned descriptors are more robust to day-night and seasonal changes.
4.2 SuperGlue (Sarlin et al., 2020)¶
SuperGlue is a graph neural network that matches feature descriptors by reasoning about their spatial context:
- Constructs a graph where nodes are keypoints
- Uses attention to propagate information between keypoints
- Assigns matches via the Sinkhorn algorithm (differentiable)
- Outperforms ratio test + nearest neighbor by a large margin on challenging benchmarks
# Conceptual SuperGlue usage
from lightglue import SuperGlue, SuperPoint
extractor = SuperPoint(max_num_keypoints=2000)
matcher = SuperGlue(weights='superpoint')
# Extract features
feats1 = extractor.extract(img1_tensor)
feats2 = extractor.extract(img2_tensor)
# Match
matches = matcher({'kpts0': feats1['keypoints'],
'kpts1': feats2['keypoints'],
'descriptors0': feats1['descriptors'],
'descriptors1': feats2['descriptors']})
# matches['matches0'] contains indices: matches0[i] = j means kp1[i] ↔ kp2[j]
# matches['matching_scores0'] contains confidence
valid = matches['matches0'][0] != -1
matched_kp1 = feats1['keypoints'][0][valid]
matched_kp2 = feats2['keypoints'][0][matches['matches0'][0][valid]]
4.3 LightGlue (Lindenberger et al., 2023)¶
LightGlue is a lightweight and efficient version of SuperGlue:
- Adaptive depth: Self-attention layers are skipped when confident matches are found early, saving computation
- FlashAttention: Memory-efficient attention for faster inference
- 60% faster than SuperGlue with comparable or better accuracy
- Runs in real-time on GPU
from lightglue import LightGlue, SuperPoint
extractor = SuperPoint(max_num_keypoints=2048)
matcher = LightGlue(features='superpoint')
feats0 = extractor.extract(img0_tensor)
feats1 = extractor.extract(img1_tensor)
matches01 = matcher({'image0': feats0, 'image1': feats1})
4.4 DISK (Talebi et al., 2024)¶
DISK (Learning Dense Visual Feature Matching with Differentiable Sampling) introduces:
- End-to-end trainable detector + matcher
- Operates at multiple scales natively
- Efficient transformer-based architecture
- Achieves state-of-the-art on ScanNet and MegaDepth benchmarks
- Designed for both indoor and outdoor scenes
4.5 Classical vs. Learned Features: When to Use Which?¶
| Criterion | Classical (ORB/SIFT) | Learned (SuperPoint/Glue) |
|---|---|---|
| Hardware | CPU, no GPU needed | GPU required |
| Latency | <10ms (ORB), 50-100ms (SIFT) | 30-100ms on GPU |
| Memory | Minimal | 100MB-1GB model weights |
| Viewpoint change | Limited | Large viewpoint robust |
| Illumination change | Moderate | Excellent |
| Texture-less regions | Fails | Better (but still limited) |
| Deployment simplicity | pip install opencv | Requires PyTorch, model mgmt |
| Loop closure (outdoor) | Works well | Better for long-term SLAM |
| Visual odometry (CPU) | ✅ Best choice | ⚠️ Too slow |
| SfM reconstruction | COLMAP + SIFT works great | DISK/SuperPoint gaining |
Rule of thumb: Use ORB for real-time visual odometry on embedded hardware. Use SuperPoint + LightGlue for challenging environments or offline processing.
5. Homography and Essential Matrix¶
Feature matches give us correspondences, but we need geometry to estimate camera motion. The fundamental tools are the homography and essential matrix.
5.1 Homography Estimation with RANSAC¶
A homography H maps points from one image plane to another when the scene is planar (or the camera undergoes pure rotation):
RANSAC (Random Sample Consensus) robustly estimates H in the presence of outliers:
import cv2
import numpy as np
def estimate_homography_ransac(pts1, pts2, ransac_thresh=5.0):
"""
Estimate homography using RANSAC.
Args:
pts1: Nx2 array of points in image 1
pts2: Nx2 array of points in image 2
ransac_thresh: reprojection threshold in pixels
Returns:
H: 3x3 homography matrix, or None if failed
mask: Nx1 inlier mask
"""
if len(pts1) < 4:
return None, None
H, mask = cv2.findHomography(
pts1, pts2,
cv2.RANSAC,
ransac_thresh
)
inliers = mask.ravel().sum()
print(f"Homography: {inliers}/{len(pts1)} inliers "
f"({100*inliers/len(pts1):.1f}%)")
return H, mask
# Extract matched point coordinates
pts1 = np.float32([kp1[m.queryIdx].pt for m in good_matches])
pts2 = np.float32([kp2[m.trainIdx].pt for m in good_matches])
H, mask = estimate_homography_ransac(pts1, pts2)
if H is not None:
print("Homography matrix:")
print(H)
5.2 Essential Matrix for Calibrated Cameras¶
When the camera is calibrated (intrinsics K are known), the essential matrix E encodes the relative pose between two views:
where F is the fundamental matrix. The essential matrix satisfies:
for corresponding normalized coordinates x, x'.
def estimate_essential_matrix(pts1, pts2, K, ransac_thresh=1.0):
"""
Estimate essential matrix from calibrated camera correspondences.
Args:
pts1, pts2: Nx2 pixel coordinates
K: 3x3 camera intrinsic matrix
ransac_thresh: threshold in pixels
Returns:
E: 3x3 essential matrix
mask: inlier mask
"""
E, mask = cv2.findEssentialMat(
pts1, pts2, K,
method=cv2.RANSAC,
threshold=ransac_thresh,
prob=0.999
)
inliers = mask.ravel().sum()
print(f"Essential matrix: {inliers}/{len(pts1)} inliers")
return E, mask
# Example camera intrinsics (typical webcam)
K = np.array([
[525.0, 0.0, 319.5],
[ 0.0, 525.0, 239.5],
[ 0.0, 0.0, 1.0]
])
pts1 = np.float32([kp1[m.queryIdx].pt for m in good_matches])
pts2 = np.float32([kp2[m.trainIdx].pt for m in good_matches])
E, mask = estimate_essential_matrix(pts1, pts2, K)
5.3 Recovering Relative Pose (R, t)¶
From the essential matrix, we can recover the rotation R and translation direction t (up to scale):
def recover_pose_from_essential(pts1, pts2, K, E):
"""
Recover relative rotation and translation from essential matrix.
Returns:
R: 3x3 rotation matrix
t: 3x1 translation vector (unit length, scale unknown)
mask: inlier mask
"""
_, R, t, mask = cv2.recoverPose(E, pts1, pts2, K)
print(f"Rotation:\n{R}")
print(f"Translation direction: {t.ravel()}")
print(f"Inliers: {mask.ravel().sum()}/{len(pts1)}")
return R, t, mask
R, t, mask = recover_pose_from_essential(pts1, pts2, K, E)
Important note: The translation t is only recoverable up to scale. To get metric scale, you need either: - Stereo camera (known baseline) - IMU fusion (VINS-Mono approach) - Known object size in the scene - Ground truth scale from other sensors
5.4 Complete Pose Estimation Example¶
import cv2
import numpy as np
def estimate_camera_pose(img1, img2, K, method='ORB'):
"""
Full pipeline: detect → match → RANSAC → pose estimation.
Args:
img1, img2: grayscale images
K: 3x3 camera intrinsic matrix
method: 'ORB' or 'SIFT'
Returns:
R, t: relative pose
num_inliers: number of inliers
"""
# 1. Detect and match
if method == 'ORB':
detector = cv2.ORB_create(nfeatures=2000)
norm = cv2.NORM_HAMMING
else:
detector = cv2.SIFT_create(nfeatures=2000)
norm = cv2.NORM_L2
kp1, desc1 = detector.detectAndCompute(img1, None)
kp2, desc2 = detector.detectAndCompute(img2, None)
bf = cv2.BFMatcher(norm)
matches = bf.knnMatch(desc1, desc2, k=2)
# Ratio test
good = [m for m, n in matches
if m.distance < 0.75 * n.distance]
if len(good) < 8:
print("Not enough matches for pose estimation")
return None, None, 0
pts1 = np.float32([kp1[m.queryIdx].pt for m in good])
pts2 = np.float32([kp2[m.trainIdx].pt for m in good])
# 2. Estimate essential matrix
E, mask = cv2.findEssentialMat(pts1, pts2, K, cv2.RANSAC, 1.0, 0.999)
if E is None:
return None, None, 0
# 3. Recover pose
_, R, t, mask = cv2.recoverPose(E, pts1, pts2, K, mask)
num_inliers = mask.ravel().sum()
print(f"Matches: {len(good)}, Inliers: {num_inliers}")
print(f"R:\n{R}")
print(f"t: {t.ravel()}")
return R, t, num_inliers
# Example usage with a stereo pair or consecutive frames
img1 = cv2.imread('frame_001.png', cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread('frame_002.png', cv2.IMREAD_GRAYSCALE)
# Example intrinsics for a 640x480 camera
K = np.array([
[525.0, 0.0, 319.5],
[ 0.0, 525.0, 239.5],
[ 0.0, 0.0, 1.0]
])
R, t, inliers = estimate_camera_pose(img1, img2, K)
6. Applications in Robotics¶
6.1 Visual Odometry Pipeline¶
Visual odometry tracks the camera's motion by matching features across consecutive frames and estimating the pose change:
class VisualOdometry:
"""Simple monocular visual odometry using ORB features."""
def __init__(self, K):
self.K = K
self.detector = cv2.ORB_create(nfeatures=2000)
self.prev_frame = None
self.prev_kp = None
self.prev_desc = None
self.pose = np.eye(4) # 4x4 transformation matrix
def process_frame(self, frame):
"""Process a new frame and return estimated pose."""
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) \
if len(frame.shape) == 3 else frame
kp, desc = self.detector.detectAndCompute(gray, None)
if self.prev_desc is None or desc is None:
self.prev_frame = gray
self.prev_kp = kp
self.prev_desc = desc
return self.pose
# Match
bf = cv2.BFMatcher(cv2.NORM_HAMMING)
matches = bf.knnMatch(self.prev_desc, desc, k=2)
good = [m for m, n in matches if m.distance < 0.75 * n.distance]
if len(good) < 10:
print("Too few matches, skipping frame")
self.prev_frame = gray
self.prev_kp = kp
self.prev_desc = desc
return self.pose
pts1 = np.float32([self.prev_kp[m.queryIdx].pt for m in good])
pts2 = np.float32([kp[m.trainIdx].pt for m in good])
# Estimate relative pose
E, mask = cv2.findEssentialMat(pts1, pts2, self.K, cv2.RANSAC, 1.0)
if E is None:
return self.pose
_, R, t, _ = cv2.recoverPose(E, pts1, pts2, self.K, mask)
# Accumulate pose
T = np.eye(4)
T[:3, :3] = R
T[:3, 3] = t.ravel()
self.pose = self.pose @ np.linalg.inv(T) # or T depending on convention
# Update
self.prev_frame = gray
self.prev_kp = kp
self.prev_desc = desc
return self.pose
# Usage
vo = VisualOdometry(K=np.array([[525,0,319.5],[0,525,239.5],[0,0,1]]))
import glob
for frame_path in sorted(glob.glob('frames/*.png')):
frame = cv2.imread(frame_path)
pose = vo.process_frame(frame)
position = pose[:3, 3]
print(f"Frame {frame_path}: position = {position}")
6.2 Loop Closure Detection¶
Loop closure detects when a robot revisits a previously explored area. This corrects accumulated drift in visual odometry. The approach uses a bag-of-words (BoW) representation built from visual features:
Detect ORB features → Quantize into visual words → Build bag-of-words vector
→ Compare with database → If similarity > threshold → Loop candidate
→ Geometric verification with feature matching → Confirm loop
# Conceptual loop closure detection
from collections import defaultdict
import numpy as np
class SimpleBoWLoopDetector:
"""Simplified bag-of-words loop closure detector."""
def __init__(self, vocabulary_size=1000, similarity_threshold=0.7):
self.vocabulary = None # k-means centroids (loaded from file)
self.vocabulary_size = vocabulary_size
self.threshold = similarity_threshold
self.database = [] # list of BoW vectors
def compute_bow(self, descriptors):
"""Convert descriptors to bag-of-words histogram."""
if self.vocabulary is None or descriptors is None:
return None
# Quantize each descriptor to nearest visual word
distances = np.linalg.norm(
self.vocabulary[:, np.newaxis, :] - descriptors[np.newaxis, :, :],
axis=2
)
word_ids = np.argmin(distances, axis=0)
# Build histogram
hist = np.zeros(self.vocabulary_size)
for w in word_ids:
hist[w] += 1
# L1 normalize
if hist.sum() > 0:
hist /= hist.sum()
return hist
def query(self, bow_vector):
"""Check if current frame matches any in database."""
if len(self.database) < 5: # Wait for enough frames
self.database.append(bow_vector)
return None, 0.0
best_idx = -1
best_score = 0.0
# Compare with previous frames (skip recent ones to avoid matching adjacent frames)
for i, db_bow in enumerate(self.database[:-5]):
# Cosine similarity (L1-normalized)
score = np.dot(bow_vector, db_bow)
if score > best_score:
best_score = score
best_idx = i
if best_score > self.threshold:
self.database.append(bow_vector)
return best_idx, best_score
self.database.append(bow_vector)
return None, best_score
In production systems (ORB-SLAM, DBoW2), the vocabulary is pre-trained on thousands of images and stored as a KD-tree for fast lookup.
6.3 Augmented Reality Marker Tracking¶
AR systems track a known planar marker (e.g., ArUco, AprilTag) by: 1. Detecting the marker boundary in the image 2. Matching features on the marker surface to a reference template 3. Estimating the homography or pose 4. Rendering virtual content aligned with the marker
import cv2
# ArUco marker detection (built into OpenCV)
dictionary = cv2.aruco.getPredefinedDictionary(cv2.aruco.DICT_6X6_250)
parameters = cv2.aruco.DetectorParameters()
def detect_aruco_marker(frame):
"""Detect ArUco markers and return pose estimates."""
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
corners, ids, rejected = cv2.aruco.detectMarkers(gray, dictionary, parameters)
if ids is not None:
# Estimate pose for each detected marker
# marker_length in meters (e.g., 0.05 for 5cm marker)
rvecs, tvecs, _ = cv2.aruco.estimatePoseSingleMarkers(
corners, 0.05, K, dist_coeffs
)
for i, marker_id in enumerate(ids):
print(f"Marker {marker_id[0]}: "
f"tvec={tvecs[i].ravel()}, "
f"rvec={rvecs[i].ravel()}")
# Draw axes on frame
cv2.drawFrameAxes(frame, K, dist_coeffs, rvecs[i], tvecs[i], 0.03)
cv2.aruco.drawDetectedMarkers(frame, corners, ids)
return corners, ids
6.4 Object Recognition by Features¶
Features can be used for instance-level object recognition by matching a database of known object features against the current view:
class FeatureBasedRecognizer:
"""Recognize objects by matching feature descriptors."""
def __init__(self):
self.detector = cv2.SIFT_create(nfeatures=1000)
self.matcher = cv2.BFMatcher(cv2.NORM_L2)
self.object_database = {} # name -> list of (keypoints, descriptors)
def register_object(self, name, images):
"""Register an object from multiple viewpoints."""
all_kp, all_desc = [], []
for img in images:
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) \
if len(img.shape) == 3 else img
kp, desc = self.detector.detectAndCompute(gray, None)
if desc is not None:
all_kp.extend(kp)
all_desc.extend(desc)
self.object_database[name] = (all_kp, np.array(all_desc))
print(f"Registered '{name}' with {len(all_desc)} features")
def recognize(self, scene_img, min_matches=15):
"""Recognize objects in a scene image."""
gray = cv2.cvtColor(scene_img, cv2.COLOR_BGR2GRAY) \
if len(scene_img.shape) == 3 else scene_img
scene_kp, scene_desc = self.detector.detectAndCompute(gray, None)
results = []
for name, (obj_kp, obj_desc) in self.object_database.items():
matches = self.matcher.knnMatch(obj_desc, scene_desc, k=2)
# Ratio test
good = [m for m, n in matches if m.distance < 0.7 * n.distance]
if len(good) >= min_matches:
results.append((name, len(good), good))
print(f"Recognized '{name}' with {len(good)} matches")
return sorted(results, key=lambda x: x[1], reverse=True)
7. Performance Optimization¶
Real-time robotics demands careful optimization. A 30 FPS camera means you have ~33ms per frame for detection + matching + pose estimation.
7.1 GPU-Accelerated Matching¶
Modern OpenCV has CUDA modules for GPU-accelerated feature detection and matching:
import cv2
# GPU-accelerated ORB
gpu_orb = cv2.cuda.ORB_create(nfeatures=3000)
def gpu_feature_matching(img1_gpu, img2_gpu):
"""GPU-accelerated ORB detection and BF matching."""
# Detect
kp1, desc1 = gpu_orb.detectAndCompute(img1_gpu, None)
kp2, desc2 = gpu_orb.detectAndCompute(img2_gpu, None)
if desc1 is None or desc2 is None:
return []
# GPU brute-force matcher
bf = cv2.cuda.BFMatcher(cv2.NORM_HAMMING)
matches = bf.match(desc1, desc2)
return matches
# Upload images to GPU
img1 = cv2.imread('frame1.png', cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread('frame2.png', cv2.IMREAD_GRAYSCALE)
img1_gpu = cv2.cuda_GpuMat()
img2_gpu = cv2.cuda_GpuMat()
img1_gpu.upload(img1)
img2_gpu.upload(img2)
matches = gpu_feature_matching(img1_gpu, img2_gpu)
For learned features, PyTorch and TensorRT can accelerate inference:
import torch
# SuperPoint on GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = superpoint_model.to(device)
def gpu_superpoint_detect(img_gray):
"""Detect features with SuperPoint on GPU."""
tensor = torch.from_numpy(img_gray).float().to(device) / 255.0
tensor = tensor.unsqueeze(0).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
features = model(tensor)
return {k: v.cpu() for k, v in features.items()}
7.2 Region of Interest (ROI) Matching¶
Don't match features across the entire image when you only care about part of the scene:
def roi_matching(img1, img2, roi, detector_type='ORB'):
"""
Match features only within a region of interest.
Args:
roi: (x, y, w, h) rectangle defining the ROI
"""
x, y, w, h = roi
# Create ROI mask
mask = np.zeros(img1.shape[:2], dtype=np.uint8)
mask[y:y+h, x:x+w] = 255
detector = cv2.ORB_create(nfeatures=1000)
# Detect with mask (only search within ROI in image 1)
kp1 = detector.detect(img1, mask)
desc1 = detector.compute(img1, kp1)
# Detect in full image 2 (target might be anywhere)
kp2, desc2 = detector.detectAndCompute(img2, None)
if desc1 is None or desc2 is None:
return []
bf = cv2.BFMatcher(cv2.NORM_HAMMING)
matches = bf.knnMatch(desc1, desc2, k=2)
good = [m for m, n in matches if m.distance < 0.75 * n.distance]
return good
7.3 Multi-Scale Matching¶
For large viewpoint changes, build an image pyramid and match at multiple scales:
def multiscale_detection(img, detector, n_levels=4, scale_factor=1.2):
"""
Detect features at multiple scales and merge results.
"""
all_keypoints = []
all_descriptors = []
pyramid = [img]
for i in range(1, n_levels):
scaled = cv2.resize(pyramid[-1], None, fx=1/scale_factor, fy=1/scale_factor)
pyramid.append(scaled)
for level, scaled_img in enumerate(pyramid):
kp, desc = detector.detectAndCompute(scaled_img, None)
if desc is None:
continue
# Scale keypoint coordinates back to original image space
scale = scale_factor ** level
for k in kp:
k.pt = (k.pt[0] * scale, k.pt[1] * scale)
k.size *= scale
all_keypoints.extend(kp)
all_descriptors.append(desc)
if all_descriptors:
all_descriptors = np.vstack(all_descriptors)
return all_keypoints, all_descriptors
# Usage
detector = cv2.ORB_create(nfeatures=500)
kp, desc = multiscale_detection(img, detector)
print(f"Multi-scale: {len(kp)} keypoints across pyramid levels")
7.4 Additional Optimization Tips¶
Limit feature count: Use nfeatures parameter to cap at 1000-2000 features.
More features don't always help and slow down matching.
Use grid-based detection: Divide the image into a grid and detect features in each cell to ensure spatial distribution:
def grid_based_detection(img, detector, grid_size=(4, 4), features_per_cell=50):
"""Ensure features are spatially distributed across the image."""
h, w = img.shape[:2]
cell_h, cell_w = h // grid_size[0], w // grid_size[1]
all_kp, all_desc = [], []
for row in range(grid_size[0]):
for col in range(grid_size[1]):
y1, y2 = row * cell_h, (row + 1) * cell_h
x1, x2 = col * cell_w, (col + 1) * cell_w
cell_img = img[y1:y2, x1:x2]
# Temporary detector with limited features per cell
temp_detector = cv2.ORB_create(nfeatures=features_per_cell)
kp, desc = temp_detector.detectAndCompute(cell_img, None)
if desc is not None:
# Offset keypoint coordinates to full image space
for k in kp:
k.pt = (k.pt[0] + x1, k.pt[1] + y1)
all_kp.extend(kp)
all_desc.append(desc)
all_desc = np.vstack(all_desc) if all_desc else None
return all_kp, all_desc
Parallel processing: Use threading or multiprocessing for multi-camera robots. Each camera's feature pipeline can run independently.
Skip frames: In high-frequency pipelines (60+ Hz), process features every 2nd or 3rd frame and interpolate pose in between.
8. References¶
Foundational Papers¶
- Harris & Stephens, 1988 — A Combined Corner and Edge Detector
- Shi & Tomasi, 1994 — Good Features to Track
- Rosten & Drummond, 2006 — Machine Learning for High-Speed Corner Detection (FAST)
- Lowe, 2004 — Distinctive Image Features from Scale-Invariant Keypoints (SIFT)
- Bay et al., 2008 — SURF: Speeded Up Robust Features
- Rublee et al., 2011 — ORB: An Efficient Alternative to SIFT or SURF
Modern Learned Features¶
- DeTone et al., 2018 — SuperPoint: Self-Supervised Interest Point Detection and Description
- Sarlin et al., 2020 — SuperGlue: Learning Feature Matching with Graph Neural Networks
- Lindenberger et al., 2023 — LightGlue: Local Feature Matching at Lighter Speed
- Talebi et al., 2024 — DISK: Learning Dense Visual Feature Matching with Differentiable Sampling
Visual SLAM and Odometry¶
- Mur-Artal & Tardós, 2017 — ORB-SLAM2: An Open-Source SLAM System
- Campos et al., 2021 — ORB-SLAM3: An Accurate Open-Source Library for Visual SLAM
- Qin et al., 2018 — VINS-Mono: A Robust and Versatile Monocular Visual-Inertial SLAM System
- Schönberger & Frahm, 2016 — Structure-from-Motion Revisited (COLMAP)
Libraries and Tools¶
- OpenCV Feature Detection — Official OpenCV tutorials on feature detection and matching
- LightGlue GitHub — SuperPoint + LightGlue implementation
- COLMAP — Structure-from-Motion and Multi-View Stereo pipeline
- DBoW2 — Bag-of-Words library for loop closure
Tutorials and Courses¶
- Visual SLAM from A to Z — YouTube series on visual SLAM fundamentals
- Cyrill Stachniss — Visual SLAM — University course on SLAM
- Multiple View Geometry in Computer Vision — Hartley & Zisserman, the definitive reference on geometric vision