Robotic Arm Grasping (机械臂抓取)¶
Project Type: Manipulation | Difficulty: ★★☆☆☆ to ★★★★★ (tier-dependent) | Estimated Time: 2–4 weekends
1. Project Overview¶
Grasping is a fundamental manipulation skill: the robot must position its end-effector (gripper) at a specific 6-DoF pose, close the fingers, and lift the object without dropping or colliding with the environment.
This project covers three algorithm tiers:
| Tier | Approach | Input | Key Algorithm |
|---|---|---|---|
| Tier 1 — Traditional | Analytical + Numerical IK | Known target pose | Geometric IK, Jacobian pseudo-inverse, DLS |
| Tier 2 — Intermediate | GraspNet-style DL | RGB-D / Point Cloud | 6-DoF grasp candidate sampling + CNN quality scoring |
| Tier 3 — Modern | RL-based Grasping | Raw sensor | PPO/SAC in Isaac Gym or PyBullet |
┌──────────────────────────────────────────────────────────────┐
│ Robotic Arm Grasping Pipeline │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Perceive │──▶│ Grasp │──▶│ Execute │ │
│ │ (Camera / │ │ Planning │ │ (Move to │ │
│ │ Depth) │ │ (IK / DL / │ │ pose, close │ │
│ │ │ │ RL) │ │ gripper) │ │
│ └──────────────┘ └──────────────┘ └───────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Point Cloud Grasp Pose Success? │
│ / RGB-D (x,y,z,r,p,y) Feedback → │
│ + gripper width restart loop │
└──────────────────────────────────────────────────────────────┘
2. Hardware & Software Requirements¶
Hardware¶
| Component | Specification | Notes |
|---|---|---|
| Robotic Arm | UR5e / Panda 7-DoF | 6-DoF minimum; URDF required |
| Gripper | Parallel-jaw (2-finger) | e.g., Robotiq 2F-85, or custom |
| Camera | Intel RealSense D435 / Azure Kinect | RGB-D required for Tier 2+ |
| Computer | Ubuntu 20.04, GPU ≥ 6 GB VRAM | For DL/RL training |
| Object Set | YCB / GraspNet-1B objects | Various shapes and sizes |
| (Optional) | ATI Force/Torque sensor | For grasp quality evaluation |
Software¶
| Package | Version | Purpose |
|---|---|---|
| Python | ≥ 3.8 | Core language |
| NumPy / SciPy | ≥ 1.20 | Numerical IK |
| MoveIt | ROS Noetic | Motion planning + IK |
| PyBullet | ≥ 3.0 | Physics simulation (Tier 3) |
| PyTorch | ≥ 1.13 | Deep learning (Tier 2) |
| Stable-Baselines3 | ≥ 1.7 | RL algorithms |
| Gymnasium | ≥ 0.28 | RL environment interface |
| open3d | ≥ 0.16 | Point cloud processing |
3. Tier 1 — Traditional: Inverse Kinematics¶
3.1 Forward Kinematics Review¶
Given joint angles \(\mathbf{q} = [q_1, q_2, \ldots, q_n]^T\), forward kinematics (FK) computes the end-effector pose:
where \(\mathbf{R} \in SO(3)\) is the rotation matrix and \(\mathbf{p} \in \mathbb{R}^3\) is the position.
Denavit-Hartenberg (DH) parameters for a 6-DoF manipulator:
| Joint | \(a_i\) (mm) | \(\alpha_i\) (rad) | \(d_i\) (mm) | \(\theta_i\) |
|---|---|---|---|---|
| 1 | 0 | \(-\pi/2\) | \(d_1\) | \(q_1\) |
| 2 | \(a_2\) | 0 | 0 | \(q_2 - \pi/2\) |
| 3 | \(a_3\) | 0 | 0 | \(q_3\) |
| 4 | 0 | \(-\pi/2\) | \(d_4\) | \(q_4\) |
| 5 | 0 | \(\pi/2\) | 0 | \(q_5\) |
| 6 | 0 | 0 | \(d_6\) | \(q_6\) |
3.2 Analytical IK — Geometric Approach¶
For a 6-DoF articulated arm (e.g., UR5), we decompose the problem:
- Wrist center \(\mathbf{p}_c = \mathbf{p}_e - d_6 \cdot \mathbf{n}\) (tool offset)
- Shoulder — solve for \(q_1\) (azimuth) and \(q_2, q_3\) (elbow configuration)
- Wrist orientation — solve for \(q_4, q_5, q_6\) from remaining rotation
2-Link planar IK (shoulder + elbow):
Given wrist center \(\mathbf{p}_c = (x, y, z)\):
Law of cosines for elbow angle \(\theta_2\):
(Use − for "elbow-up" configuration, + for "elbow-down")
Shoulder angle \(\theta_1\):
3.3 Numerical IK — Jacobian Methods¶
When analytical solutions are unavailable, numerical IK iteratively refines \(\mathbf{q}\):
Jacobian pseudo-inverse:
Damped Least Squares (DLS) — the standard approach that avoids singularities:
where \(\lambda\) is the damping factor (typical: \(\lambda \in [0.01, 0.5]\)).
Singular Value Decomposition (SVD) form:
3.4 Complete Python Code — Numerical IK¶
"""
Tier 1: Numerical IK — Damped Least Squares (DLS)
=================================================
Applies to any serial manipulator with known DH parameters or FK function.
"""
import numpy as np
def skew(v: np.ndarray) -> np.ndarray:
"""Skew-symmetric matrix from 3-vector."""
return np.array([
[0, -v[2], v[1]],
[v[2], 0, -v[0]],
[-v[1], v[0], 0]
])
def rot_x(angle: float) -> np.ndarray:
"""Rotation about X-axis by `angle` radians."""
c, s = np.cos(angle), np.sin(angle)
return np.array([[1, 0, 0], [0, c, -s], [0, s, c]])
def rot_z(angle: float) -> np.ndarray:
"""Rotation about Z-axis by `angle` radians."""
c, s = np.cos(angle), np.sin(angle)
return np.array([[c, -s, 0], [s, c, 0], [0, 0, 1]])
def joint_to_transform(q: np.ndarray) -> np.ndarray:
"""
FK for a 6R planar arm (simplified UR5-like geometry).
Returns 4x4 homogeneous transform of end-effector.
Parameters
----------
q : array of shape (6,)
Joint angles in radians [q1..q6]
Returns
-------
T : (4, 4) homogeneous transform
"""
# Link lengths (UR5-style, in metres)
L1, L2, L3, L4, L5, L6 = 0.089, 0.425, 0.392, 0.109, 0.093, 0.082
# Shoulder to elbow
T01 = rot_z(q[0]) @ np.block([[np.eye(3)], [np.zeros(3)]]) # placeholder
# We'll compose individual DH transforms
d1 = L1
a2, a3 = L2, L3
# Simplified DH frames (standard convention)
# Joint i: rotate θ_i about Z, translate d_i along Z,
# translate a_i along X, rotate α_i about X
def dh_transform(a, alpha, d, theta):
ct, st = np.cos(theta), np.sin(theta)
ca, sa = np.cos(alpha), np.sin(alpha)
return np.array([
[ct, -st*ca, st*sa, a*ct],
[st, ct*ca, -ct*sa, a*st],
[0, sa, ca, d],
[0, 0, 0, 1]
])
# DH parameters (simplified UR5)
# Joint 1: shoulder yaw
T01 = dh_transform(0, -np.pi/2, d1, q[0])
# Joint 2: shoulder pitch
T12 = dh_transform(a2, 0, 0, q[1] - np.pi/2)
# Joint 3: elbow pitch
T23 = dh_transform(a3, 0, 0, q[2])
# Joint 4: wrist 1 roll
T34 = dh_transform(0, -np.pi/2, L4, q[3])
# Joint 5: wrist 2 pitch
T45 = dh_transform(0, np.pi/2, 0, q[4])
# Joint 6: wrist 3 roll
T56 = dh_transform(0, 0, L5 + L6, q[5])
T = T01 @ T12 @ T23 @ T34 @ T45 @ T56
return T
def extract_jacobian_column(T_prev: np.ndarray, joint_axis: np.ndarray,
joint_pos: np.ndarray, ee_pos: np.ndarray
) -> np.ndarray:
"""Extract one column of the geometric Jacobian."""
return np.cross(joint_axis, ee_pos - joint_pos)
def geometric_jacobian(q: np.ndarray) -> np.ndarray:
"""
Compute the 6×n geometric Jacobian J(q).
Columns: [Jv_1..Jv_n; Jw_1..Jw_n]
"""
J = np.zeros((6, len(q)))
T = np.eye(4)
axes = []
positions = []
# Walk through joints, accumulating transforms
# (simplified: just compute joint axes in world frame)
for i in range(len(q)):
# Z-axis of joint frame i (in world frame)
axis_world = T[:3, 2].copy()
ee_pos = joint_to_transform(q)[:3, 3]
J[:3, i] = np.cross(axis_world, ee_pos - T[:3, 3])
J[3:, i] = axis_world
# Advance T (placeholder — replace with actual FK propagation)
T_i = np.eye(4)
if i == 0:
T_i = dh_transform_sim(0, -np.pi/2, 0.089, q[0])
elif i == 1:
T_i = dh_transform_sim(0.425, 0, 0, q[1] - np.pi/2)
elif i == 2:
T_i = dh_transform_sim(0.392, 0, 0, q[2])
elif i == 3:
T_i = dh_transform_sim(0, -np.pi/2, 0.109, q[3])
elif i == 4:
T_i = dh_transform_sim(0, np.pi/2, 0, q[4])
elif i == 5:
T_i = dh_transform_sim(0, 0, 0.175, q[5])
T = T @ T_i
return J
def dh_transform_sim(a, alpha, d, theta):
ct, st = np.cos(theta), np.sin(theta)
ca, sa = np.cos(alpha), np.sin(alpha)
return np.array([
[ct, -st*ca, st*sa, a*ct],
[st, ct*ca, -ct*sa, a*st],
[0, sa, ca, d],
[0, 0, 0, 1]
])
def numerical_ik(target_pose: np.ndarray, q_init: np.ndarray,
max_iter: int = 200, tol: float = 1e-4,
lambda_damp: float = 0.1) -> np.ndarray:
"""
Damped Least Squares IK for 6R manipulator.
Parameters
----------
target_pose : (4, 4) homogeneous transform
q_init : initial joint angles
max_iter : max iterations
tol : convergence threshold (position error in metres)
lambda_damp : damping factor λ
Returns
-------
q : (6,) joint angles (radians)
"""
q = q_init.copy()
target_pos = target_pose[:3, 3]
target_rot = target_pose[:3, :3]
for iteration in range(max_iter):
T = joint_to_transform(q)
pos_err = target_pos - T[:3, 3]
# Orientation error via axis-angle of R_error = R_desired @ R_current.T
R_err = target_rot @ T[:3, :3].T
trace_R = np.trace(R_err)
trace_R = np.clip(trace_R, -1.0, 3.0)
theta_err = np.arccos((trace_R - 1) / 2)
if theta_err > 1e-6:
axis = (1 / (2 * np.sin(theta_err))) * np.array([
R_err[2, 1] - R_err[1, 2],
R_err[0, 2] - R_err[2, 0],
R_err[1, 0] - R_err[0, 1]
])
else:
axis = np.zeros(3)
ori_err = theta_err * axis
x_err = np.concatenate([pos_err, ori_err])
if np.linalg.norm(x_err) < tol:
print(f"Converged in {iteration} iterations")
break
# Geometric Jacobian
J = geometric_jacobian(q)
# DLS update
lambda_sq = lambda_damp ** 2
delta_q = J.T @ np.linalg.inv(J @ J.T + lambda_sq * np.eye(6)) @ x_err
q = q + delta_q
# Joint limits (UR5)
q = np.clip(q, np.radians([-180, -128, -128, -266, -266, -266]),
np.radians([ 180, 128, 128, 266, 266, 266]))
return q
def demo():
"""Test IK: move from home to a target pose."""
# Target: position (0.4, 0.2, 0.3), rotation = identity
target = np.eye(4)
target[:3, 3] = [0.4, 0.2, 0.3]
q_init = np.zeros(6) # start from home position
q_solution = numerical_ik(target, q_init, lambda_damp=0.15)
# Verify
T_verify = joint_to_transform(q_solution)
print(f"Target pos : {target[:3, 3]}")
print(f"Achieved pos: {T_verify[:3, 3]}")
print(f"Solution q : {np.degrees(q_solution)} deg")
return q_solution
if __name__ == "__main__":
demo()
4. Tier 2 — Intermediate: GraspNet-Style Deep Learning¶
4.1 Overview¶
GraspNet (GraspNet-1B: 6-DoF Grasp Detection) predicts 6-DoF grasp poses (position + orientation + gripper width) from a single RGB-D image / point cloud. The pipeline:
RGB-D Image / Point Cloud
│
▼
Sample Grasp Candidates (tens of thousands)
[x, y, z, roll, pitch, yaw, gripper_width]
│
▼
Grasp Quality CNN Scorer
[Q(success) ∈ [0, 1] for each candidate]
│
▼
Non-Maximum Suppression → Top-K Grasps
│
▼
Execute (IK → trajectory → gripper close)
4.2 Grasp Representation — 6-DoF Pose¶
A grasp is fully specified by:
| Symbol | Meaning | Range |
|---|---|---|
| \(\mathbf{p} \in \mathbb{R}^3\) | Grasp position (contact midpoint) | World coords |
| \(\mathbf{R} \in SO(3)\) | Grasp orientation (gripper frame) | 3-DoF rotation |
| \(w\) | Gripper width (fingers open) | 0 – 0.08 m |
Gripper frame convention: \(x\)-axis = approach direction, \(z\)-axis = finger closing direction.
4.3 Grasp Quality Metrics¶
Before learning, we can evaluate grasps analytically:
Force Closure: A grasp achieves force closure if any external wrench can be counteracted by contact forces. Condition for 2-finger planar grasp:
where \(\mathbf{G}\) is the Grasp Matrix and \(\mathbf{f}\) are contact forces.
Epsilon Quality \(\epsilon \in [0, 1]\): Largest wrench that can be applied in any direction while keeping unit contact forces:
Higher \(\epsilon\) = more robust grasp.
Wrench Space Volume: Volume of the grasp wrench space (GWS), approximated as:
4.4 Complete Python Code — Grasp Candidate Sampling + Quality CNN¶
"""
Tier 2: GraspNet-Style Grasp Pose Prediction
============================================
1. Sample grasp candidates from point cloud
2. Score them with a CNN quality estimator
3. Output top-K ranked grasps
"""
import numpy as np
import torch
import torch.nn as nn
# ─── Grasp Representation ────────────────────────────────────────
class Grasp:
"""
A 6-DoF grasp pose.
frame: SE(3) transform where:
- origin = grasp contact midpoint
- x-axis = approach direction (toward object)
- z-axis = finger closing direction
width: gripper opening (metres)
"""
def __init__(self, frame: np.ndarray, width: float):
# frame: (4, 4) homogeneous transform
self.frame = frame
self.width = width
@property
def position(self) -> np.ndarray:
return self.frame[:3, 3]
@property
def approach(self) -> np.ndarray:
"""Unit vector along gripper x-axis (approach direction)."""
return self.frame[:3, 0]
def to_matrix(self) -> np.ndarray:
"""Flatten grasp to 9-DoF: [x, y, z, r11..r33, width]."""
return np.concatenate([self.position, self.frame[:3, :3].flatten(), [self.width]])
# ─── Grasp Candidate Sampling ────────────────────────────────────
def sample_antipodal_points(points: np.ndarray, normals: np.ndarray,
num_samples: int = 5000) -> list:
"""
Sample grasp candidates by selecting antipodal point pairs.
An antipodal pair has normals pointing roughly toward each other.
The grasp center is midway between the two points.
Parameters
----------
points : (N, 3) point cloud
normals : (N, 3) surface normals
num_samples : max number of candidates to generate
Returns
-------
candidates : list of Grasp objects
"""
N = len(points)
candidates = []
count = 0
# Filter points with valid normals
valid = np.where(np.linalg.norm(normals, axis=1) > 0.1)[0]
points = points[valid]
normals = normals[valid]
for _ in range(num_samples):
i = np.random.randint(0, len(points))
j = np.random.randint(0, len(points))
if i == j:
continue
p1, n1 = points[i], normals[i]
p2, n2 = points[j], normals[j]
# Distance between points
dist = np.linalg.norm(p1 - p2)
if dist < 0.01 or dist > 0.10: # reject too close / too far
continue
# Antiformal: normals should be roughly opposing
dot_normals = np.dot(n1, n2)
if dot_normals > -0.5: # not antipodal enough
continue
# Grasp frame: center at midpoint, approach = toward object
center = (p1 + p2) / 2
approach = -(n1 + n2)
approach /= (np.linalg.norm(approach) + 1e-8)
# Build orthonormal frame
# binormal = approach × down (down = -gravity)
down = np.array([0, 0, -1])
binormal = np.cross(approach, down)
if np.linalg.norm(binormal) < 1e-4:
binormal = np.cross(approach, np.array([0, 1, 0]))
binormal /= np.linalg.norm(binormal)
closing = np.cross(approach, binormal) # z-axis = finger dir
frame = np.eye(4)
frame[:3, 0] = approach # x
frame[:3, 1] = closing # z (closing direction)
frame[:3, 2] = binormal # y
frame[:3, 3] = center
candidates.append(Grasp(frame, width=dist))
return candidates
# ─── Grasp Quality CNN ──────────────────────────────────────────
class GraspQualityCNN(nn.Module):
"""
Predict grasp success probability from:
- Point cloud crop around grasp (N × 3)
- Grasp pose features (9-DoF)
"""
def __init__(self, num_points: int = 256, pose_dim: int = 9,
hidden_dim: int = 256):
super().__init__()
self.num_points = num_points
# Point-wise MLP (shared weights per point)
self.point_encoder = nn.Sequential(
nn.Conv1d(3, 64, 1), nn.ReLU(),
nn.Conv1d(64, 128, 1), nn.ReLU(),
nn.Conv1d(128, 256, 1), nn.ReLU(),
)
# Pose encoder
self.pose_encoder = nn.Sequential(
nn.Linear(pose_dim, 128), nn.ReLU(),
nn.Linear(128, 256), nn.ReLU(),
)
# Fusion + score head
self.fusion = nn.Sequential(
nn.Linear(256 + 256, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, 128), nn.ReLU(),
nn.Linear(128, 1), nn.Sigmoid(), # score ∈ [0, 1]
)
def forward(self, points: torch.Tensor, pose: torch.Tensor) -> torch.Tensor:
"""
Parameters
----------
points : (B, N, 3) point cloud crop
pose : (B, 9) grasp pose features
Returns
-------
score : (B, 1) success probability
"""
B = points.size(0)
# Encode point cloud
pc_feat = self.point_encoder(points.permute(0, 2, 1)) # (B, 256, N)
pc_feat = pc_feat.max(dim=2)[0] # (B, 256) — max pooling
# Encode pose
pose_feat = self.pose_encoder(pose) # (B, 256)
# Fusion
combined = torch.cat([pc_feat, pose_feat], dim=1) # (B, 512)
score = self.fusion(combined) # (B, 1)
return score
def score_grasps(candidates: list, pc_crop: np.ndarray,
model: GraspQualityCNN, device: str = "cpu"
) -> list[tuple]:
"""
Score all grasp candidates with the CNN.
Parameters
----------
candidates : list of Grasp objects
pc_crop : (M, 3) point cloud of the scene (crop around candidates)
model : trained GraspQualityCNN
Returns
-------
scored_grasps : list of (Grasp, score) sorted descending by score
"""
model.eval()
batch_size = 64
scored = []
with torch.no_grad():
for i in range(0, len(candidates), batch_size):
batch = candidates[i:i + batch_size]
# Build pose feature tensor
pose_batch = np.array([g.to_matrix() for g in batch]).astype(np.float32)
pose_tensor = torch.from_numpy(pose_batch).to(device)
# Subsample point cloud to fixed size
pc_resampled = resample_pc(pc_crop, num_points=model.num_points)
pc_tensor = torch.from_numpy(pc_resampled).unsqueeze(0).expand(len(batch), -1, -1)
pc_tensor = pc_tensor.to(device)
# Score
scores = model(pc_tensor.float(), pose_tensor.float())
for g, s in zip(batch, scores.cpu().numpy().flatten()):
scored.append((g, float(s)))
scored.sort(key=lambda x: x[1], reverse=True)
return scored
def resample_pc(pc: np.ndarray, num_points: int) -> np.ndarray:
"""Randomly resample (or pad) a point cloud to fixed size."""
N = len(pc)
if N >= num_points:
idx = np.random.choice(N, num_points, replace=False)
else:
idx = np.random.choice(N, num_points, replace=True)
return pc[idx]
def demo_graspnet():
"""Simulated demo — in practice, load RealSense + trained model."""
np.random.seed(42)
# Simulated point cloud: a sphere-like object
N = 1000
angles = np.random.uniform(0, 2*np.pi, N)
heights = np.random.uniform(-0.05, 0.05, N)
radii = np.random.uniform(0.03, 0.05, N)
points = np.stack([radii * np.cos(angles),
radii * np.sin(angles),
heights], axis=1)
normals = points / np.linalg.norm(points, axis=1, keepdims=True)
# Sample candidates
candidates = sample_antipodal_points(points, normals, num_samples=200)
print(f"Sampled {len(candidates)} grasp candidates")
# Score with a random (untrained) CNN
device = "cpu"
model = GraspQualityCNN().to(device)
scored = score_grasps(candidates, points, model, device)
print("Top 5 grasps:")
for g, score in scored[:5]:
print(f" score={score:.3f} pos={g.position} width={g.width:.4f}m")
return scored
if __name__ == "__main__":
demo_graspnet()
5. Tier 3 — Modern: RL-Based Grasping¶
5.1 Concept¶
Instead of separately planning a grasp pose, RL-based grasping directly learns a policy \(\pi_\theta(\mathbf{a} \mid \mathbf{o})\) that maps observations to gripper actions. The robot learns through trial and error which strategies lead to successful grasps.
┌──────────────────────────────────────────────────────────┐
│ RL Grasping Loop │
│ │
│ Observation o_t │
│ [rgb, depth, proprioceptive state] │
│ │ │
│ ▼ │
│ Policy π_θ (neural network) │
│ │ │
│ ▼ │
│ Action a_t = [Δx, Δy, Δz, Δθ, open/close gripper] │
│ │ │
│ ▼ │
│ PyBullet / Isaac Gym physics simulation │
│ │ │
│ ▼ │
│ Reward r_t = +1 (success) / -0.01 (step penalty) │
│ │ │
│ ▼ │
│ Policy update (PPO / SAC) │
└──────────────────────────────────────────────────────────┘
5.2 RL Formulation¶
State \(\mathbf{s}_t\): - End-effector pose \((x, y, z, \phi, \theta, \psi)\) - Gripper width \(w\) - Object pose(s) from simulation
Action \(\mathbf{a}_t\):
- \(\Delta x, \Delta y, \Delta z\) — position delta (metres)
- \(\Delta \phi\) — wrist roll delta (radians)
- gripper_open — binary (0 = close, 1 = open)
Reward shaping:
| Event | Reward |
|---|---|
| Grasp success (object lifted) | \(r_{\text{success}} = +10\) |
| Object dropped after lift | \(r_{\text{drop}} = -5\) |
| Collision with table | \(r_{\text{coll}} = -2\) |
| Step penalty | \(r_{\text{step}} = -0.01\) |
| Approach bonus (close to object) | \(r_{\text{approach}} = +0.1 \cdot \Delta d\) |
5.3 Complete Python Code — PyBullet Grasping Environment + PPO¶
"""
Tier 3: RL-Based Grasping with PyBullet + PPO
=============================================
Custom Gymnasium environment simulating a 6-DoF arm grasping task.
Trains a PPO agent to approach, grasp, and lift a cube.
"""
import numpy as np
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3 import PPO
import pybullet as p
import pybullet_data
import pybullet_utils as pu # helper (see note below)
import time
class GraspingEnv(gym.Env):
"""
PyBullet-based grasping environment.
Scene:
- Plane (table surface)
- 1 cube object (0.05 m side)
- 6-DoF robot arm (simplified kinematic model)
Observation: (12,)
- EE position (3) + EE velocity (3)
- Gripper width (1) + gripper velocity (1)
- Object position relative to EE (3) + object velocity (1)
Action: (5,)
- dx, dy, dz (position delta in EE frame)
- dtheta (wrist rotation)
- gripper: 0=close, 1=open
"""
metadata = {"render_modes": ["human"], "render_fps": 30}
def __init__(self, render_mode=None, gui=True):
super().__init__()
self.render_mode = render_mode
self.gui = gui
# Observation: 12-dim
self.observation_space = spaces.Box(
low=-np.inf, high=np.inf, shape=(12,), dtype=np.float32)
# Action: 5-dim
self.action_space = spaces.Box(
low=-1.0, high=1.0, shape=(5,), dtype=np.float32)
# Physics client
self.client_id = p.connect(p.GUI if gui else p.DIRECT)
p.setAdditionalSearchPath(pybullet_data.getDataPath())
p.setGravity(0, 0, -9.81, physicsClientId=self.client_id)
# Simulation parameters
self.dt = 0.02 # 50 Hz
self.max_steps = 200 # 4-second episodes
self.reach_threshold = 0.04 # grasp success distance (m)
self.lift_height = 0.10 # object must rise this high
self.robot_id = None
self.object_id = None
self.table_id = None
self.step_count = 0
# ─── Helpers ──────────────────────────────────────────────
def _setup_scene(self):
"""Create table, object, and arm."""
# Table
self.table_id = p.loadURDF("plane.urdf", physicsClientId=self.client_id)
table_visual = p.createVisualShape(
p.GEOM_BOX, halfExtents=[0.3, 0.3, 0.01],
rgbaColor=[0.6, 0.5, 0.4, 1], physicsClientId=self.client_id)
p.createMultiBody(
baseMass=0, baseVisualShapeIndex=table_visual,
basePosition=[0.4, 0, 0.005], physicsClientId=self.client_id)
# Cube object
obj_collision = p.createCollisionShape(
p.GEOM_BOX, halfExtents=[0.025]*3)
obj_visual = p.createVisualShape(
p.GEOM_BOX, halfExtents=[0.025]*3,
rgbaColor=[0.8, 0.2, 0.2, 1])
self.object_id = p.createMultiBody(
baseMass=0.01, baseCollisionShapeIndex=obj_collision,
baseVisualShapeIndex=obj_visual,
basePosition=[0.4, 0.0, 0.03],
physicsClientId=self.client_id)
# Simplified arm: 3 pitch joints + 1 gripper
# We'll use a kinematic chain: base → shoulder → elbow → wrist
# Each joint is controlled by setting joint position target
arm_positions = [[0, 0, 0], [0, 0, 0.05], [0, 0, 0.25]]
self.joint_limits = [(0.0, 0.3), (-1.0, 1.0), (-1.5, 0.0), (-1.0, 1.0)]
# (In full implementation: load UR5 URDF from MoveIt or PyBullet URDF)
# Here we simulate a 4-joint planar arm
ee_pos = np.array([0.3, 0.0, 0.35])
self.ee_pos = ee_pos.copy()
self.ee_quat = [0, 0, 0, 1] # wxyz
self.gripper_open = 1.0
def _reset_robot(self):
"""Randomize object and reset arm to home."""
# Randomize object position on table
obj_x = np.random.uniform(0.30, 0.50)
obj_y = np.random.uniform(-0.15, 0.15)
p.resetBasePositionAndOrientation(
self.object_id, [obj_x, obj_y, 0.03], [0, 0, 0, 1],
physicsClientId=self.client_id)
# Reset arm to home
self.ee_pos = np.array([0.3, 0.0, 0.35])
self.gripper_open = 1.0
# ─── Gymnasium Interface ───────────────────────────────────
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self._setup_scene()
self._reset_robot()
self.step_count = 0
return self._get_obs(), {}
def step(self, action):
self.step_count += 1
# ── Parse action ──
dx = float(action[0]) * 0.05 # ±5 cm per step
dy = float(action[1]) * 0.05
dz = float(action[2]) * 0.03 # ±3 cm
dtheta = float(action[3]) * 0.2 # ±0.2 rad
gripper_cmd = float(action[4]) # 0=close, 1=open
# Update gripper
self.gripper_open = np.clip(self.gripper_open + (gripper_cmd - 0.5) * 0.3,
0.0, 1.0)
# ── Move EE (simplified kinematic model) ──
# Transform delta from EE frame to world frame
quat = np.array(self.ee_quat)
# Approximate: assume EE quaternion ≈ [0,0,0,1] (upright gripper)
self.ee_pos += np.array([dx, dy, dz])
# Clamp to workspace
self.ee_pos[0] = np.clip(self.ee_pos[0], 0.15, 0.65)
self.ee_pos[1] = np.clip(self.ee_pos[1], -0.35, 0.35)
self.ee_pos[2] = np.clip(self.ee_pos[2], 0.01, 0.60)
# ── Step physics ──
p.stepSimulation(physicsClientId=self.client_id)
if self.render_mode == "human":
time.sleep(self.dt)
# ── Compute reward ──
obj_pos, obj_quat = p.getBasePositionAndOrientation(
self.object_id, physicsClientId=self.client_id)
obj_pos = np.array(obj_pos)
obj_vel = np.array(p.getBaseVelocity(
self.object_id, physicsClientId=self.client_id)[0])
dist_ee_obj = np.linalg.norm(self.ee_pos - obj_pos)
obj_lifted = obj_pos[2] > self.lift_height
# Grasp: EE close to object AND gripper closed
is_grasping = (dist_ee_obj < self.reach_threshold) and (self.gripper_open < 0.3)
# Reward components
reward = -0.01 # step penalty
reward += 0.05 * max(0, 0.10 - dist_ee_obj) # approach bonus
if is_grasping and obj_lifted:
reward += 10.0 # SUCCESS
elif is_grasping:
reward += 1.0 # firm grasp
if obj_pos[2] < 0.01:
reward -= 2.0 # object fell off table
# Termination
terminated = bool(obj_lifted and is_grasping)
truncated = self.step_count >= self.max_steps
return self._get_obs(), reward, terminated, truncated, {}
def _get_obs(self) -> np.ndarray:
obj_pos, _ = p.getBasePositionAndOrientation(
self.object_id, physicsClientId=self.client_id)
obj_pos = np.array(obj_pos)
obj_vel = np.array(p.getBaseVelocity(
self.object_id, physicsClientId=self.client_id)[0])
rel_pos = obj_pos - self.ee_pos
obs = np.concatenate([
self.ee_pos, # 3
np.zeros(3), # EE velocity (unknown → zero)
np.array([self.gripper_open, 0.0]), # gripper state
rel_pos, # 3
[obj_vel[2]], # object upward velocity
], dtype=np.float32)
return np.clip(obs, -10, 10)
def render(self):
pass # handled by PyBullet GUI
# ─── Training ─────────────────────────────────────────────────
def train_ppo(total_timesteps: int = 200_000, save_path: str = "ppo_grasping"):
"""Train a PPO agent on the grasping environment."""
env = GraspingEnv(gui=False)
model = PPO(
policy="MlpPolicy",
env=env,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.01, # encourage exploration
verbose=1,
tensorboard_log="./tb_logs_grasping/",
)
print(f"[INFO] Training PPO for {total_timesteps} timesteps...")
model.learn(total_timesteps=total_timesteps, progress_bar=True)
model.save(save_path)
print(f"[INFO] Model saved to {save_path}")
return model
def evaluate(model_path: str = "ppo_grasping", episodes: int = 20):
"""Evaluate a trained model."""
env = GraspingEnv(gui=True)
model = PPO.load(model_path)
successes = 0
for ep in range(episodes):
obs, _ = env.reset()
done = False
steps = 0
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
steps += 1
success = terminated # episode ends on success
successes += int(success)
print(f" Episode {ep+1}: {'✓ SUCCESS' if success else '✗ FAIL'} in {steps} steps")
print(f"\nSuccess rate: {successes}/{episodes} = {successes/episodes*100:.1f}%")
if __name__ == "__main__":
import sys
if len(sys.argv) > 1 and sys.argv[1] == "eval":
evaluate()
else:
train_ppo()
6. Step-by-Step Implementation Guide¶
Phase 1 — Robot Setup (All Tiers)¶
- Connect the arm: Flash firmware on UR5/Panda, verify
roscontrolconnectivity - Load URDF:
rosparam load robot_description→ verify in RViz - Home position: Define a safe home configuration for all joints
- Workspace calibration: Identify reachable workspace bounds
Phase 2 — Tier 1: Numerical IK¶
- Define DH parameters or load the URDF with
kdl_kinematics_plugin - Implement
geometric_jacobian()using the URDF chain - Run
numerical_ik()from home position → random reachable targets - Validate by comparing IK solution → FK forward pose
- Integrate with MoveIt:
move_group.set_joint_value_target(q_solution)
# Use MoveIt's built-in IK (IKFast or KDL)
from moveit_commander import MoveGroupCommander
mgc = MoveGroupCommander("manipulator")
mgc.set_pose_target(pose_target) # geometry_msgs/Pose
plan = mgc.plan()
mgc.execute(plan, wait=True)
Phase 3 — Tier 2: GraspNet Pipeline¶
- Point cloud capture: Subscribe to
/camera/depth/points(ROS) or RealSense SDK - Downsample: VoxelGrid filter (leaf size 2–5 mm)
- Pass through: Crop to workspace volume
- Estimate normals:
open3d.estimate_normals() - Sample candidates: Run
sample_antipodal_points() - Score: Run CNN inference on batch
- Execute: Take top-1 grasp → run IK → move arm → close gripper
- Loop: If failed, retry with next-ranked grasp candidate
Phase 4 — Tier 3: RL Training¶
- Start PyBullet simulation: Verify object spawning and arm dynamics
- Baseline RL: Train PPO for 200k timesteps (~10 min on CPU)
- Hyperparameter sweep: Tune
ent_coef,gamma,n_steps - Domain randomization: Vary object size, mass, friction, camera noise
- Sim-to-real: Transfer policy to real robot; fine-tune with online data
7. Comparison of Approaches¶
| Criteria | Numerical IK (Tier 1) | GraspNet DL (Tier 2) | RL Grasping (Tier 3) |
|---|---|---|---|
| Grasp pose | Given (user specifies) | Predicted from PC | Learned end-to-end |
| Object perception | Known / pre-scene | RGB-D / point cloud | Raw sensors |
| Training data | None | 1B synthetic grasps | Trial-and-error |
| Generalization | Deterministic | Strong (CNN features) | Needs sim-to-real |
| Computation | ~100 Hz | ~10 Hz (GPU batch) | ~50 Hz (inference) |
| Unknown objects | ❌ No | ✅ Yes | ✅ Yes (after training) |
| Occlusion handling | ❌ No | ⚠️ Limited | ⚠️ Limited |
| Setup complexity | Low | Medium (GPU + dataset) | High (sim + training) |
| Best for | Pick-and-place | Cluttered scenes | Novel objects, complex dynamics |
| Difficulty | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
8. Extensions and Variations¶
8.1 Multi-Object Grasping¶
- Extend GraspNet to output grasp sequences (sorted by reachability)
- Use a scene graph to reason about object relationships
- Combine with task planning for "pick X and place in Y"
8.2 Impedance Control¶
- After closing the gripper, switch to impedance mode to adapt to contact forces
- Track desired wrench: \(\tau = \mathbf{J}^T \mathbf{F}_d + \mathbf{K}_p \mathbf{e} + \mathbf{K}_d \dot{\mathbf{e}}\)
8.3 Dexterous Manipulation¶
- Replace parallel-jaw with Allegro / Shadow Hand
- Use contact model (Morse friction cones) for grasp stability analysis
- Extend to in-hand manipulation: reorient object without re-grasping
8.4 Contact Model¶
Coulomb friction (planar contact):
where \(F_t\) is tangential force, \(F_n\) is normal force, \(\mu\) is friction coefficient.
Grasp matrix for 2-finger grasp with \(k\) contact points:
8.5 Sim-to-Real Transfer¶
| Issue | Sim Solution | Real Adjustment |
|---|---|---|
| Camera noise | Add Gaussian noise to depth | Retune with real data |
| Dynamics mismatch | Match URDF inertials | System identification |
| Gripper offset | Calibrate with checkerboard | Online calibration |
| Object properties | Randomize friction, mass | Estimate from F/T sensor |
9. References¶
- Siciliano et al., 2008 — Robotics: Modelling, Planning and Control — Comprehensive robotics textbook covering IK, dynamics, and control
- Spong et al., 2006 — Robot Modeling and Control — DH parameters, Jacobians, and numerical IK
- GraspNet-1B Dataset & Paper — 1 billion grasp poses, 6-DoF grasp detection benchmark
- Zhou et al., 2022 — GraspNet-1B: A Large-Scale Benchmark for General Object Grasping — 6-DoF grasp pose prediction from point cloud
- Fang et al., 2020 — Closing the Loop for Robotic Grasping: A Large Scale Learning Based Approach — End-to-end grasp scoring CNN
- Kalakrishnan et al., 2011 — Learning Force Control Policies for Compliant Manipulation — RL for force-based manipulation
- Schulman et al., 2017 — Proximal Policy Optimization Algorithms — PPO algorithm
- Coumans & Bai, 2016–2023 — PyBullet Physics Engine — Robot simulation
- Sucan & Chitta, 2012 — MoveIt! ROS Integration for Planning Groups — Motion planning framework
- Choi & Christensen, 2020 — Real-time Grasp Planning Using Multi-View SilhouetteNet — Multi-view grasp planning
- Open3D Documentation — Point cloud processing library
- YCB Object Set — Benchmark objects for manipulation research