Skip to content

Robotic Arm Grasping (机械臂抓取)

Project Type: Manipulation | Difficulty: ★★☆☆☆ to ★★★★★ (tier-dependent) | Estimated Time: 2–4 weekends


1. Project Overview

Grasping is a fundamental manipulation skill: the robot must position its end-effector (gripper) at a specific 6-DoF pose, close the fingers, and lift the object without dropping or colliding with the environment.

This project covers three algorithm tiers:

Tier Approach Input Key Algorithm
Tier 1 — Traditional Analytical + Numerical IK Known target pose Geometric IK, Jacobian pseudo-inverse, DLS
Tier 2 — Intermediate GraspNet-style DL RGB-D / Point Cloud 6-DoF grasp candidate sampling + CNN quality scoring
Tier 3 — Modern RL-based Grasping Raw sensor PPO/SAC in Isaac Gym or PyBullet
  ┌──────────────────────────────────────────────────────────────┐
  │                    Robotic Arm Grasping Pipeline              │
  │                                                              │
  │   ┌──────────────┐   ┌──────────────┐   ┌───────────────┐    │
  │   │  Perceive     │──▶│  Grasp       │──▶│  Execute      │    │
  │   │  (Camera /    │   │  Planning    │   │  (Move to     │    │
  │   │   Depth)       │   │  (IK / DL /  │   │   pose, close │    │
  │   │                │   │   RL)         │   │   gripper)      │    │
  │   └──────────────┘   └──────────────┘   └───────────────┘    │
  │         │                  │                   │             │
  │         ▼                  ▼                   ▼             │
  │   Point Cloud         Grasp Pose          Success?            │
  │   / RGB-D            (x,y,z,r,p,y)       Feedback →           │
  │                       + gripper width      restart loop       │
  └──────────────────────────────────────────────────────────────┘

2. Hardware & Software Requirements

Hardware

Component Specification Notes
Robotic Arm UR5e / Panda 7-DoF 6-DoF minimum; URDF required
Gripper Parallel-jaw (2-finger) e.g., Robotiq 2F-85, or custom
Camera Intel RealSense D435 / Azure Kinect RGB-D required for Tier 2+
Computer Ubuntu 20.04, GPU ≥ 6 GB VRAM For DL/RL training
Object Set YCB / GraspNet-1B objects Various shapes and sizes
(Optional) ATI Force/Torque sensor For grasp quality evaluation

Software

Package Version Purpose
Python ≥ 3.8 Core language
NumPy / SciPy ≥ 1.20 Numerical IK
MoveIt ROS Noetic Motion planning + IK
PyBullet ≥ 3.0 Physics simulation (Tier 3)
PyTorch ≥ 1.13 Deep learning (Tier 2)
Stable-Baselines3 ≥ 1.7 RL algorithms
Gymnasium ≥ 0.28 RL environment interface
open3d ≥ 0.16 Point cloud processing
pip install numpy scipy pybullet torch stable-baselines3 gymnasium open3d opencv-python

3. Tier 1 — Traditional: Inverse Kinematics

3.1 Forward Kinematics Review

Given joint angles \(\mathbf{q} = [q_1, q_2, \ldots, q_n]^T\), forward kinematics (FK) computes the end-effector pose:

\[ \mathbf{T}(\mathbf{q}) = \begin{bmatrix} \mathbf{R}(\mathbf{q}) & \mathbf{p}(\mathbf{q}) \\ \mathbf{0}_{1\times3} & 1 \end{bmatrix} \in SE(3) \]

where \(\mathbf{R} \in SO(3)\) is the rotation matrix and \(\mathbf{p} \in \mathbb{R}^3\) is the position.

Denavit-Hartenberg (DH) parameters for a 6-DoF manipulator:

Joint \(a_i\) (mm) \(\alpha_i\) (rad) \(d_i\) (mm) \(\theta_i\)
1 0 \(-\pi/2\) \(d_1\) \(q_1\)
2 \(a_2\) 0 0 \(q_2 - \pi/2\)
3 \(a_3\) 0 0 \(q_3\)
4 0 \(-\pi/2\) \(d_4\) \(q_4\)
5 0 \(\pi/2\) 0 \(q_5\)
6 0 0 \(d_6\) \(q_6\)

3.2 Analytical IK — Geometric Approach

For a 6-DoF articulated arm (e.g., UR5), we decompose the problem:

  1. Wrist center \(\mathbf{p}_c = \mathbf{p}_e - d_6 \cdot \mathbf{n}\) (tool offset)
  2. Shoulder — solve for \(q_1\) (azimuth) and \(q_2, q_3\) (elbow configuration)
  3. Wrist orientation — solve for \(q_4, q_5, q_6\) from remaining rotation

2-Link planar IK (shoulder + elbow):

         Shoulder
            O  q1
           /
          /
         O───X  (wrist center)
        / \
       /   \
      /     \
   q2 O       O q3 (elbow)     Link1=L1, Link2=L2

Given wrist center \(\mathbf{p}_c = (x, y, z)\):

\[ r = \sqrt{x^2 + z^2} \quad \text{(2D distance in sagittal plane)} \]

Law of cosines for elbow angle \(\theta_2\):

\[ \cos \theta_2 = \frac{r^2 - L_1^2 - L_2^2}{2 L_1 L_2} \quad \Rightarrow \quad \theta_2 = \pm \arccos(\cdots) \]

(Use for "elbow-up" configuration, + for "elbow-down")

Shoulder angle \(\theta_1\):

\[ \theta_1 = \text{atan2}(x, y) - \arctan_2\!\left(L_2 \sin\theta_2,\; L_1 + L_2 \cos\theta_2\right) \]

3.3 Numerical IK — Jacobian Methods

When analytical solutions are unavailable, numerical IK iteratively refines \(\mathbf{q}\):

Jacobian pseudo-inverse:

\[ \dot{\mathbf{q}} = \mathbf{J}^{\dagger} \dot{\mathbf{x}}, \quad \mathbf{J}^{\dagger} = \mathbf{J}^T \left(\mathbf{J} \mathbf{J}^T\right)^{-1} \]

Damped Least Squares (DLS) — the standard approach that avoids singularities:

\[ \dot{\mathbf{q}} = \mathbf{J}^T \left(\mathbf{J} \mathbf{J}^T + \lambda^2 \mathbf{I}\right)^{-1} \dot{\mathbf{x}} \]

where \(\lambda\) is the damping factor (typical: \(\lambda \in [0.01, 0.5]\)).

Singular Value Decomposition (SVD) form:

\[ \dot{\mathbf{q}} = \mathbf{V} \mathbf{\Sigma}^{\dagger} \mathbf{U}^T \dot{\mathbf{x}} + \frac{\lambda^2}{\sigma_i^2 + \lambda^2} (\mathbf{I} - \mathbf{J}^{\dagger}\mathbf{J}) \mathbf{q}_{\text{null}} \]

3.4 Complete Python Code — Numerical IK

"""
Tier 1: Numerical IK — Damped Least Squares (DLS)
=================================================
Applies to any serial manipulator with known DH parameters or FK function.
"""

import numpy as np


def skew(v: np.ndarray) -> np.ndarray:
    """Skew-symmetric matrix from 3-vector."""
    return np.array([
        [0,    -v[2],  v[1]],
        [v[2],   0,   -v[0]],
        [-v[1], v[0],   0]
    ])


def rot_x(angle: float) -> np.ndarray:
    """Rotation about X-axis by `angle` radians."""
    c, s = np.cos(angle), np.sin(angle)
    return np.array([[1, 0, 0], [0, c, -s], [0, s, c]])


def rot_z(angle: float) -> np.ndarray:
    """Rotation about Z-axis by `angle` radians."""
    c, s = np.cos(angle), np.sin(angle)
    return np.array([[c, -s, 0], [s, c, 0], [0, 0, 1]])


def joint_to_transform(q: np.ndarray) -> np.ndarray:
    """
    FK for a 6R planar arm (simplified UR5-like geometry).
    Returns 4x4 homogeneous transform of end-effector.

    Parameters
    ----------
    q : array of shape (6,)
        Joint angles in radians [q1..q6]

    Returns
    -------
    T : (4, 4) homogeneous transform
    """
    # Link lengths (UR5-style, in metres)
    L1, L2, L3, L4, L5, L6 = 0.089, 0.425, 0.392, 0.109, 0.093, 0.082

    # Shoulder to elbow
    T01 = rot_z(q[0]) @ np.block([[np.eye(3)], [np.zeros(3)]])  # placeholder
    # We'll compose individual DH transforms
    d1 = L1
    a2, a3 = L2, L3

    # Simplified DH frames (standard convention)
    # Joint i: rotate θ_i about Z, translate d_i along Z,
    #                   translate a_i along X,  rotate α_i about X
    def dh_transform(a, alpha, d, theta):
        ct, st = np.cos(theta), np.sin(theta)
        ca, sa = np.cos(alpha), np.sin(alpha)
        return np.array([
            [ct, -st*ca,  st*sa, a*ct],
            [st,  ct*ca, -ct*sa, a*st],
            [0,   sa,     ca,    d],
            [0,   0,      0,     1]
        ])

    # DH parameters (simplified UR5)
    # Joint 1: shoulder yaw
    T01 = dh_transform(0,       -np.pi/2, d1,       q[0])
    # Joint 2: shoulder pitch
    T12 = dh_transform(a2,       0,       0,        q[1] - np.pi/2)
    # Joint 3: elbow pitch
    T23 = dh_transform(a3,       0,       0,        q[2])
    # Joint 4: wrist 1 roll
    T34 = dh_transform(0,      -np.pi/2, L4,       q[3])
    # Joint 5: wrist 2 pitch
    T45 = dh_transform(0,       np.pi/2, 0,        q[4])
    # Joint 6: wrist 3 roll
    T56 = dh_transform(0,       0,       L5 + L6,  q[5])

    T = T01 @ T12 @ T23 @ T34 @ T45 @ T56
    return T


def extract_jacobian_column(T_prev: np.ndarray, joint_axis: np.ndarray,
                              joint_pos: np.ndarray, ee_pos: np.ndarray
                              ) -> np.ndarray:
    """Extract one column of the geometric Jacobian."""
    return np.cross(joint_axis, ee_pos - joint_pos)


def geometric_jacobian(q: np.ndarray) -> np.ndarray:
    """
    Compute the 6×n geometric Jacobian J(q).
    Columns: [Jv_1..Jv_n; Jw_1..Jw_n]
    """
    J = np.zeros((6, len(q)))
    T = np.eye(4)
    axes = []
    positions = []

    # Walk through joints, accumulating transforms
    # (simplified: just compute joint axes in world frame)
    for i in range(len(q)):
        # Z-axis of joint frame i (in world frame)
        axis_world = T[:3, 2].copy()
        ee_pos = joint_to_transform(q)[:3, 3]
        J[:3, i] = np.cross(axis_world, ee_pos - T[:3, 3])
        J[3:, i] = axis_world

        # Advance T (placeholder — replace with actual FK propagation)
        T_i = np.eye(4)
        if i == 0:
            T_i = dh_transform_sim(0, -np.pi/2, 0.089, q[0])
        elif i == 1:
            T_i = dh_transform_sim(0.425, 0, 0, q[1] - np.pi/2)
        elif i == 2:
            T_i = dh_transform_sim(0.392, 0, 0, q[2])
        elif i == 3:
            T_i = dh_transform_sim(0, -np.pi/2, 0.109, q[3])
        elif i == 4:
            T_i = dh_transform_sim(0, np.pi/2, 0, q[4])
        elif i == 5:
            T_i = dh_transform_sim(0, 0, 0.175, q[5])
        T = T @ T_i
    return J


def dh_transform_sim(a, alpha, d, theta):
    ct, st = np.cos(theta), np.sin(theta)
    ca, sa = np.cos(alpha), np.sin(alpha)
    return np.array([
        [ct, -st*ca,  st*sa, a*ct],
        [st,  ct*ca, -ct*sa, a*st],
        [0,   sa,     ca,    d],
        [0,   0,      0,    1]
    ])


def numerical_ik(target_pose: np.ndarray, q_init: np.ndarray,
                  max_iter: int = 200, tol: float = 1e-4,
                  lambda_damp: float = 0.1) -> np.ndarray:
    """
    Damped Least Squares IK for 6R manipulator.

    Parameters
    ----------
    target_pose : (4, 4) homogeneous transform
    q_init      : initial joint angles
    max_iter    : max iterations
    tol         : convergence threshold (position error in metres)
    lambda_damp : damping factor λ

    Returns
    -------
    q : (6,) joint angles (radians)
    """
    q = q_init.copy()
    target_pos = target_pose[:3, 3]
    target_rot = target_pose[:3, :3]

    for iteration in range(max_iter):
        T = joint_to_transform(q)
        pos_err = target_pos - T[:3, 3]

        # Orientation error via axis-angle of R_error = R_desired @ R_current.T
        R_err = target_rot @ T[:3, :3].T
        trace_R = np.trace(R_err)
        trace_R = np.clip(trace_R, -1.0, 3.0)
        theta_err = np.arccos((trace_R - 1) / 2)
        if theta_err > 1e-6:
            axis = (1 / (2 * np.sin(theta_err))) * np.array([
                R_err[2, 1] - R_err[1, 2],
                R_err[0, 2] - R_err[2, 0],
                R_err[1, 0] - R_err[0, 1]
            ])
        else:
            axis = np.zeros(3)
        ori_err = theta_err * axis

        x_err = np.concatenate([pos_err, ori_err])

        if np.linalg.norm(x_err) < tol:
            print(f"Converged in {iteration} iterations")
            break

        # Geometric Jacobian
        J = geometric_jacobian(q)

        # DLS update
        lambda_sq = lambda_damp ** 2
        delta_q = J.T @ np.linalg.inv(J @ J.T + lambda_sq * np.eye(6)) @ x_err
        q = q + delta_q

        # Joint limits (UR5)
        q = np.clip(q, np.radians([-180, -128, -128, -266, -266, -266]),
                         np.radians([ 180,  128,  128,  266,  266,  266]))

    return q


def demo():
    """Test IK: move from home to a target pose."""
    # Target: position (0.4, 0.2, 0.3), rotation = identity
    target = np.eye(4)
    target[:3, 3] = [0.4, 0.2, 0.3]

    q_init = np.zeros(6)  # start from home position
    q_solution = numerical_ik(target, q_init, lambda_damp=0.15)

    # Verify
    T_verify = joint_to_transform(q_solution)
    print(f"Target pos : {target[:3, 3]}")
    print(f"Achieved pos: {T_verify[:3, 3]}")
    print(f"Solution q  : {np.degrees(q_solution)} deg")
    return q_solution


if __name__ == "__main__":
    demo()

4. Tier 2 — Intermediate: GraspNet-Style Deep Learning

4.1 Overview

GraspNet (GraspNet-1B: 6-DoF Grasp Detection) predicts 6-DoF grasp poses (position + orientation + gripper width) from a single RGB-D image / point cloud. The pipeline:

  RGB-D Image / Point Cloud
  Sample Grasp Candidates (tens of thousands)
  [x, y, z, roll, pitch, yaw, gripper_width]
  Grasp Quality CNN Scorer
  [Q(success) ∈ [0, 1] for each candidate]
  Non-Maximum Suppression → Top-K Grasps
  Execute (IK → trajectory → gripper close)

4.2 Grasp Representation — 6-DoF Pose

A grasp is fully specified by:

\[ \mathcal{G} = (\mathbf{p}, \mathbf{R}, w) \]
Symbol Meaning Range
\(\mathbf{p} \in \mathbb{R}^3\) Grasp position (contact midpoint) World coords
\(\mathbf{R} \in SO(3)\) Grasp orientation (gripper frame) 3-DoF rotation
\(w\) Gripper width (fingers open) 0 – 0.08 m

Gripper frame convention: \(x\)-axis = approach direction, \(z\)-axis = finger closing direction.

4.3 Grasp Quality Metrics

Before learning, we can evaluate grasps analytically:

Force Closure: A grasp achieves force closure if any external wrench can be counteracted by contact forces. Condition for 2-finger planar grasp:

\[ \mathbf{G} \mathbf{f} = -\mathbf{w}_{ext} \quad \text{(wrench balance)} \]

where \(\mathbf{G}\) is the Grasp Matrix and \(\mathbf{f}\) are contact forces.

Epsilon Quality \(\epsilon \in [0, 1]\): Largest wrench that can be applied in any direction while keeping unit contact forces:

\[ \epsilon = \min_{\mathbf{f}, \|\mathbf{f}\| \leq 1} \max_{\|\mathbf{w}\|=1} \mathbf{w}^T \mathbf{G} \mathbf{f} \]

Higher \(\epsilon\) = more robust grasp.

Wrench Space Volume: Volume of the grasp wrench space (GWS), approximated as:

\[ V_{GWS} = \det\left(\mathbf{G} \mathbf{G}^T\right)^{1/2} \]

4.4 Complete Python Code — Grasp Candidate Sampling + Quality CNN

"""
Tier 2: GraspNet-Style Grasp Pose Prediction
============================================
1. Sample grasp candidates from point cloud
2. Score them with a CNN quality estimator
3. Output top-K ranked grasps
"""

import numpy as np
import torch
import torch.nn as nn


# ─── Grasp Representation ────────────────────────────────────────

class Grasp:
    """
    A 6-DoF grasp pose.

    frame: SE(3) transform where:
      - origin  = grasp contact midpoint
      - x-axis  = approach direction (toward object)
      - z-axis  = finger closing direction
    width: gripper opening (metres)
    """

    def __init__(self, frame: np.ndarray, width: float):
        # frame: (4, 4) homogeneous transform
        self.frame = frame
        self.width = width

    @property
    def position(self) -> np.ndarray:
        return self.frame[:3, 3]

    @property
    def approach(self) -> np.ndarray:
        """Unit vector along gripper x-axis (approach direction)."""
        return self.frame[:3, 0]

    def to_matrix(self) -> np.ndarray:
        """Flatten grasp to 9-DoF: [x, y, z, r11..r33, width]."""
        return np.concatenate([self.position, self.frame[:3, :3].flatten(), [self.width]])


# ─── Grasp Candidate Sampling ────────────────────────────────────

def sample_antipodal_points(points: np.ndarray, normals: np.ndarray,
                             num_samples: int = 5000) -> list:
    """
    Sample grasp candidates by selecting antipodal point pairs.

    An antipodal pair has normals pointing roughly toward each other.
    The grasp center is midway between the two points.

    Parameters
    ----------
    points   : (N, 3) point cloud
    normals  : (N, 3) surface normals
    num_samples : max number of candidates to generate

    Returns
    -------
    candidates : list of Grasp objects
    """
    N = len(points)
    candidates = []
    count = 0

    # Filter points with valid normals
    valid = np.where(np.linalg.norm(normals, axis=1) > 0.1)[0]
    points = points[valid]
    normals = normals[valid]

    for _ in range(num_samples):
        i = np.random.randint(0, len(points))
        j = np.random.randint(0, len(points))
        if i == j:
            continue

        p1, n1 = points[i], normals[i]
        p2, n2 = points[j], normals[j]

        # Distance between points
        dist = np.linalg.norm(p1 - p2)
        if dist < 0.01 or dist > 0.10:   # reject too close / too far
            continue

        # Antiformal: normals should be roughly opposing
        dot_normals = np.dot(n1, n2)
        if dot_normals > -0.5:            # not antipodal enough
            continue

        # Grasp frame: center at midpoint, approach = toward object
        center = (p1 + p2) / 2
        approach = -(n1 + n2)
        approach /= (np.linalg.norm(approach) + 1e-8)

        # Build orthonormal frame
        # binormal = approach × down (down = -gravity)
        down = np.array([0, 0, -1])
        binormal = np.cross(approach, down)
        if np.linalg.norm(binormal) < 1e-4:
            binormal = np.cross(approach, np.array([0, 1, 0]))
        binormal /= np.linalg.norm(binormal)
        closing = np.cross(approach, binormal)   # z-axis = finger dir

        frame = np.eye(4)
        frame[:3, 0] = approach    # x
        frame[:3, 1] = closing     # z  (closing direction)
        frame[:3, 2] = binormal    # y
        frame[:3, 3] = center

        candidates.append(Grasp(frame, width=dist))

    return candidates


# ─── Grasp Quality CNN ──────────────────────────────────────────

class GraspQualityCNN(nn.Module):
    """
    Predict grasp success probability from:
    - Point cloud crop around grasp (N × 3)
    - Grasp pose features (9-DoF)
    """

    def __init__(self, num_points: int = 256, pose_dim: int = 9,
                 hidden_dim: int = 256):
        super().__init__()
        self.num_points = num_points

        # Point-wise MLP (shared weights per point)
        self.point_encoder = nn.Sequential(
            nn.Conv1d(3, 64, 1), nn.ReLU(),
            nn.Conv1d(64, 128, 1), nn.ReLU(),
            nn.Conv1d(128, 256, 1), nn.ReLU(),
        )

        # Pose encoder
        self.pose_encoder = nn.Sequential(
            nn.Linear(pose_dim, 128), nn.ReLU(),
            nn.Linear(128, 256), nn.ReLU(),
        )

        # Fusion + score head
        self.fusion = nn.Sequential(
            nn.Linear(256 + 256, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, 128), nn.ReLU(),
            nn.Linear(128, 1), nn.Sigmoid(),   # score ∈ [0, 1]
        )

    def forward(self, points: torch.Tensor, pose: torch.Tensor) -> torch.Tensor:
        """
        Parameters
        ----------
        points : (B, N, 3) point cloud crop
        pose   : (B, 9) grasp pose features

        Returns
        -------
        score  : (B, 1) success probability
        """
        B = points.size(0)

        # Encode point cloud
        pc_feat = self.point_encoder(points.permute(0, 2, 1))  # (B, 256, N)
        pc_feat = pc_feat.max(dim=2)[0]                        # (B, 256) — max pooling

        # Encode pose
        pose_feat = self.pose_encoder(pose)                     # (B, 256)

        # Fusion
        combined = torch.cat([pc_feat, pose_feat], dim=1)       # (B, 512)
        score = self.fusion(combined)                          # (B, 1)
        return score


def score_grasps(candidates: list, pc_crop: np.ndarray,
                 model: GraspQualityCNN, device: str = "cpu"
                 ) -> list[tuple]:
    """
    Score all grasp candidates with the CNN.

    Parameters
    ----------
    candidates : list of Grasp objects
    pc_crop    : (M, 3) point cloud of the scene (crop around candidates)
    model      : trained GraspQualityCNN

    Returns
    -------
    scored_grasps : list of (Grasp, score) sorted descending by score
    """
    model.eval()
    batch_size = 64
    scored = []

    with torch.no_grad():
        for i in range(0, len(candidates), batch_size):
            batch = candidates[i:i + batch_size]

            # Build pose feature tensor
            pose_batch = np.array([g.to_matrix() for g in batch]).astype(np.float32)
            pose_tensor = torch.from_numpy(pose_batch).to(device)

            # Subsample point cloud to fixed size
            pc_resampled = resample_pc(pc_crop, num_points=model.num_points)
            pc_tensor = torch.from_numpy(pc_resampled).unsqueeze(0).expand(len(batch), -1, -1)
            pc_tensor = pc_tensor.to(device)

            # Score
            scores = model(pc_tensor.float(), pose_tensor.float())
            for g, s in zip(batch, scores.cpu().numpy().flatten()):
                scored.append((g, float(s)))

    scored.sort(key=lambda x: x[1], reverse=True)
    return scored


def resample_pc(pc: np.ndarray, num_points: int) -> np.ndarray:
    """Randomly resample (or pad) a point cloud to fixed size."""
    N = len(pc)
    if N >= num_points:
        idx = np.random.choice(N, num_points, replace=False)
    else:
        idx = np.random.choice(N, num_points, replace=True)
    return pc[idx]


def demo_graspnet():
    """Simulated demo — in practice, load RealSense + trained model."""
    np.random.seed(42)

    # Simulated point cloud: a sphere-like object
    N = 1000
    angles = np.random.uniform(0, 2*np.pi, N)
    heights = np.random.uniform(-0.05, 0.05, N)
    radii = np.random.uniform(0.03, 0.05, N)
    points = np.stack([radii * np.cos(angles),
                        radii * np.sin(angles),
                        heights], axis=1)
    normals = points / np.linalg.norm(points, axis=1, keepdims=True)

    # Sample candidates
    candidates = sample_antipodal_points(points, normals, num_samples=200)
    print(f"Sampled {len(candidates)} grasp candidates")

    # Score with a random (untrained) CNN
    device = "cpu"
    model = GraspQualityCNN().to(device)
    scored = score_grasps(candidates, points, model, device)

    print("Top 5 grasps:")
    for g, score in scored[:5]:
        print(f"  score={score:.3f}  pos={g.position}  width={g.width:.4f}m")

    return scored


if __name__ == "__main__":
    demo_graspnet()

5. Tier 3 — Modern: RL-Based Grasping

5.1 Concept

Instead of separately planning a grasp pose, RL-based grasping directly learns a policy \(\pi_\theta(\mathbf{a} \mid \mathbf{o})\) that maps observations to gripper actions. The robot learns through trial and error which strategies lead to successful grasps.

  ┌──────────────────────────────────────────────────────────┐
  │                  RL Grasping Loop                         │
  │                                                           │
  │   Observation o_t                                        │
  │   [rgb, depth, proprioceptive state]                     │
  │         │                                                 │
  │         ▼                                                 │
  │   Policy π_θ (neural network)                            │
  │         │                                                 │
  │         ▼                                                 │
  │   Action a_t = [Δx, Δy, Δz, Δθ, open/close gripper]     │
  │         │                                                 │
  │         ▼                                                 │
  │   PyBullet / Isaac Gym physics simulation                 │
  │         │                                                 │
  │         ▼                                                 │
  │   Reward r_t = +1 (success) / -0.01 (step penalty)       │
  │         │                                                 │
  │         ▼                                                 │
  │   Policy update (PPO / SAC)                               │
  └──────────────────────────────────────────────────────────┘

5.2 RL Formulation

State \(\mathbf{s}_t\): - End-effector pose \((x, y, z, \phi, \theta, \psi)\) - Gripper width \(w\) - Object pose(s) from simulation

Action \(\mathbf{a}_t\): - \(\Delta x, \Delta y, \Delta z\) — position delta (metres) - \(\Delta \phi\) — wrist roll delta (radians) - gripper_open — binary (0 = close, 1 = open)

Reward shaping:

Event Reward
Grasp success (object lifted) \(r_{\text{success}} = +10\)
Object dropped after lift \(r_{\text{drop}} = -5\)
Collision with table \(r_{\text{coll}} = -2\)
Step penalty \(r_{\text{step}} = -0.01\)
Approach bonus (close to object) \(r_{\text{approach}} = +0.1 \cdot \Delta d\)

5.3 Complete Python Code — PyBullet Grasping Environment + PPO

"""
Tier 3: RL-Based Grasping with PyBullet + PPO
=============================================
Custom Gymnasium environment simulating a 6-DoF arm grasping task.
Trains a PPO agent to approach, grasp, and lift a cube.
"""

import numpy as np
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3 import PPO
import pybullet as p
import pybullet_data
import pybullet_utils as pu  # helper (see note below)
import time


class GraspingEnv(gym.Env):
    """
    PyBullet-based grasping environment.

    Scene:
      - Plane (table surface)
      - 1 cube object (0.05 m side)
      - 6-DoF robot arm (simplified kinematic model)

    Observation: (12,)
      - EE position (3) + EE velocity (3)
      - Gripper width (1) + gripper velocity (1)
      - Object position relative to EE (3) + object velocity (1)

    Action: (5,)
      - dx, dy, dz (position delta in EE frame)
      - dtheta (wrist rotation)
      - gripper: 0=close, 1=open
    """

    metadata = {"render_modes": ["human"], "render_fps": 30}

    def __init__(self, render_mode=None, gui=True):
        super().__init__()
        self.render_mode = render_mode
        self.gui = gui

        # Observation: 12-dim
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(12,), dtype=np.float32)
        # Action: 5-dim
        self.action_space = spaces.Box(
            low=-1.0, high=1.0, shape=(5,), dtype=np.float32)

        # Physics client
        self.client_id = p.connect(p.GUI if gui else p.DIRECT)
        p.setAdditionalSearchPath(pybullet_data.getDataPath())
        p.setGravity(0, 0, -9.81, physicsClientId=self.client_id)

        # Simulation parameters
        self.dt = 0.02          # 50 Hz
        self.max_steps = 200    # 4-second episodes
        self.reach_threshold = 0.04   # grasp success distance (m)
        self.lift_height = 0.10       # object must rise this high

        self.robot_id = None
        self.object_id = None
        self.table_id = None
        self.step_count = 0

    # ─── Helpers ──────────────────────────────────────────────

    def _setup_scene(self):
        """Create table, object, and arm."""
        # Table
        self.table_id = p.loadURDF("plane.urdf", physicsClientId=self.client_id)
        table_visual = p.createVisualShape(
            p.GEOM_BOX, halfExtents=[0.3, 0.3, 0.01],
            rgbaColor=[0.6, 0.5, 0.4, 1], physicsClientId=self.client_id)
        p.createMultiBody(
            baseMass=0, baseVisualShapeIndex=table_visual,
            basePosition=[0.4, 0, 0.005], physicsClientId=self.client_id)

        # Cube object
        obj_collision = p.createCollisionShape(
            p.GEOM_BOX, halfExtents=[0.025]*3)
        obj_visual = p.createVisualShape(
            p.GEOM_BOX, halfExtents=[0.025]*3,
            rgbaColor=[0.8, 0.2, 0.2, 1])
        self.object_id = p.createMultiBody(
            baseMass=0.01, baseCollisionShapeIndex=obj_collision,
            baseVisualShapeIndex=obj_visual,
            basePosition=[0.4, 0.0, 0.03],
            physicsClientId=self.client_id)

        # Simplified arm: 3 pitch joints + 1 gripper
        # We'll use a kinematic chain: base → shoulder → elbow → wrist
        # Each joint is controlled by setting joint position target
        arm_positions = [[0, 0, 0], [0, 0, 0.05], [0, 0, 0.25]]
        self.joint_limits = [(0.0, 0.3), (-1.0, 1.0), (-1.5, 0.0), (-1.0, 1.0)]

        # (In full implementation: load UR5 URDF from MoveIt or PyBullet URDF)
        # Here we simulate a 4-joint planar arm
        ee_pos = np.array([0.3, 0.0, 0.35])
        self.ee_pos = ee_pos.copy()
        self.ee_quat = [0, 0, 0, 1]  # wxyz
        self.gripper_open = 1.0

    def _reset_robot(self):
        """Randomize object and reset arm to home."""
        # Randomize object position on table
        obj_x = np.random.uniform(0.30, 0.50)
        obj_y = np.random.uniform(-0.15, 0.15)
        p.resetBasePositionAndOrientation(
            self.object_id, [obj_x, obj_y, 0.03], [0, 0, 0, 1],
            physicsClientId=self.client_id)

        # Reset arm to home
        self.ee_pos = np.array([0.3, 0.0, 0.35])
        self.gripper_open = 1.0

    # ─── Gymnasium Interface ───────────────────────────────────

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self._setup_scene()
        self._reset_robot()
        self.step_count = 0
        return self._get_obs(), {}

    def step(self, action):
        self.step_count += 1

        # ── Parse action ──
        dx = float(action[0]) * 0.05    # ±5 cm per step
        dy = float(action[1]) * 0.05
        dz = float(action[2]) * 0.03    # ±3 cm
        dtheta = float(action[3]) * 0.2  # ±0.2 rad
        gripper_cmd = float(action[4])   # 0=close, 1=open

        # Update gripper
        self.gripper_open = np.clip(self.gripper_open + (gripper_cmd - 0.5) * 0.3,
                                     0.0, 1.0)

        # ── Move EE (simplified kinematic model) ──
        # Transform delta from EE frame to world frame
        quat = np.array(self.ee_quat)
        # Approximate: assume EE quaternion ≈ [0,0,0,1] (upright gripper)
        self.ee_pos += np.array([dx, dy, dz])
        # Clamp to workspace
        self.ee_pos[0] = np.clip(self.ee_pos[0], 0.15, 0.65)
        self.ee_pos[1] = np.clip(self.ee_pos[1], -0.35, 0.35)
        self.ee_pos[2] = np.clip(self.ee_pos[2], 0.01, 0.60)

        # ── Step physics ──
        p.stepSimulation(physicsClientId=self.client_id)
        if self.render_mode == "human":
            time.sleep(self.dt)

        # ── Compute reward ──
        obj_pos, obj_quat = p.getBasePositionAndOrientation(
            self.object_id, physicsClientId=self.client_id)
        obj_pos = np.array(obj_pos)
        obj_vel = np.array(p.getBaseVelocity(
            self.object_id, physicsClientId=self.client_id)[0])

        dist_ee_obj = np.linalg.norm(self.ee_pos - obj_pos)
        obj_lifted = obj_pos[2] > self.lift_height

        # Grasp: EE close to object AND gripper closed
        is_grasping = (dist_ee_obj < self.reach_threshold) and (self.gripper_open < 0.3)

        # Reward components
        reward = -0.01                                   # step penalty
        reward += 0.05 * max(0, 0.10 - dist_ee_obj)   # approach bonus
        if is_grasping and obj_lifted:
            reward += 10.0                              # SUCCESS
        elif is_grasping:
            reward += 1.0                               # firm grasp
        if obj_pos[2] < 0.01:
            reward -= 2.0                               # object fell off table

        # Termination
        terminated = bool(obj_lifted and is_grasping)
        truncated = self.step_count >= self.max_steps

        return self._get_obs(), reward, terminated, truncated, {}

    def _get_obs(self) -> np.ndarray:
        obj_pos, _ = p.getBasePositionAndOrientation(
            self.object_id, physicsClientId=self.client_id)
        obj_pos = np.array(obj_pos)
        obj_vel = np.array(p.getBaseVelocity(
            self.object_id, physicsClientId=self.client_id)[0])

        rel_pos = obj_pos - self.ee_pos
        obs = np.concatenate([
            self.ee_pos,           # 3
            np.zeros(3),           # EE velocity (unknown → zero)
            np.array([self.gripper_open, 0.0]),  # gripper state
            rel_pos,               # 3
            [obj_vel[2]],          # object upward velocity
        ], dtype=np.float32)
        return np.clip(obs, -10, 10)

    def render(self):
        pass  # handled by PyBullet GUI


# ─── Training ─────────────────────────────────────────────────

def train_ppo(total_timesteps: int = 200_000, save_path: str = "ppo_grasping"):
    """Train a PPO agent on the grasping environment."""
    env = GraspingEnv(gui=False)

    model = PPO(
        policy="MlpPolicy",
        env=env,
        learning_rate=3e-4,
        n_steps=2048,
        batch_size=64,
        n_epochs=10,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        ent_coef=0.01,          # encourage exploration
        verbose=1,
        tensorboard_log="./tb_logs_grasping/",
    )

    print(f"[INFO] Training PPO for {total_timesteps} timesteps...")
    model.learn(total_timesteps=total_timesteps, progress_bar=True)
    model.save(save_path)
    print(f"[INFO] Model saved to {save_path}")
    return model


def evaluate(model_path: str = "ppo_grasping", episodes: int = 20):
    """Evaluate a trained model."""
    env = GraspingEnv(gui=True)
    model = PPO.load(model_path)

    successes = 0
    for ep in range(episodes):
        obs, _ = env.reset()
        done = False
        steps = 0

        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            steps += 1

        success = terminated   # episode ends on success
        successes += int(success)
        print(f"  Episode {ep+1}: {'✓ SUCCESS' if success else '✗ FAIL'} in {steps} steps")

    print(f"\nSuccess rate: {successes}/{episodes} = {successes/episodes*100:.1f}%")


if __name__ == "__main__":
    import sys
    if len(sys.argv) > 1 and sys.argv[1] == "eval":
        evaluate()
    else:
        train_ppo()

6. Step-by-Step Implementation Guide

Phase 1 — Robot Setup (All Tiers)

  1. Connect the arm: Flash firmware on UR5/Panda, verify roscontrol connectivity
  2. Load URDF: rosparam load robot_description → verify in RViz
  3. Home position: Define a safe home configuration for all joints
  4. Workspace calibration: Identify reachable workspace bounds
# Load UR5 URDF with MoveIt setup assistant
roslaunch ur5_moveit_config demo.launch

Phase 2 — Tier 1: Numerical IK

  1. Define DH parameters or load the URDF with kdl_kinematics_plugin
  2. Implement geometric_jacobian() using the URDF chain
  3. Run numerical_ik() from home position → random reachable targets
  4. Validate by comparing IK solution → FK forward pose
  5. Integrate with MoveIt: move_group.set_joint_value_target(q_solution)
# Use MoveIt's built-in IK (IKFast or KDL)
from moveit_commander import MoveGroupCommander
mgc = MoveGroupCommander("manipulator")
mgc.set_pose_target(pose_target)   # geometry_msgs/Pose
plan = mgc.plan()
mgc.execute(plan, wait=True)

Phase 3 — Tier 2: GraspNet Pipeline

  1. Point cloud capture: Subscribe to /camera/depth/points (ROS) or RealSense SDK
  2. Downsample: VoxelGrid filter (leaf size 2–5 mm)
  3. Pass through: Crop to workspace volume
  4. Estimate normals: open3d.estimate_normals()
  5. Sample candidates: Run sample_antipodal_points()
  6. Score: Run CNN inference on batch
  7. Execute: Take top-1 grasp → run IK → move arm → close gripper
  8. Loop: If failed, retry with next-ranked grasp candidate

Phase 4 — Tier 3: RL Training

  1. Start PyBullet simulation: Verify object spawning and arm dynamics
  2. Baseline RL: Train PPO for 200k timesteps (~10 min on CPU)
  3. Hyperparameter sweep: Tune ent_coef, gamma, n_steps
  4. Domain randomization: Vary object size, mass, friction, camera noise
  5. Sim-to-real: Transfer policy to real robot; fine-tune with online data

7. Comparison of Approaches

Criteria Numerical IK (Tier 1) GraspNet DL (Tier 2) RL Grasping (Tier 3)
Grasp pose Given (user specifies) Predicted from PC Learned end-to-end
Object perception Known / pre-scene RGB-D / point cloud Raw sensors
Training data None 1B synthetic grasps Trial-and-error
Generalization Deterministic Strong (CNN features) Needs sim-to-real
Computation ~100 Hz ~10 Hz (GPU batch) ~50 Hz (inference)
Unknown objects ❌ No ✅ Yes ✅ Yes (after training)
Occlusion handling ❌ No ⚠️ Limited ⚠️ Limited
Setup complexity Low Medium (GPU + dataset) High (sim + training)
Best for Pick-and-place Cluttered scenes Novel objects, complex dynamics
Difficulty ⭐⭐⭐ ⭐⭐⭐⭐

8. Extensions and Variations

8.1 Multi-Object Grasping

  • Extend GraspNet to output grasp sequences (sorted by reachability)
  • Use a scene graph to reason about object relationships
  • Combine with task planning for "pick X and place in Y"

8.2 Impedance Control

  • After closing the gripper, switch to impedance mode to adapt to contact forces
  • Track desired wrench: \(\tau = \mathbf{J}^T \mathbf{F}_d + \mathbf{K}_p \mathbf{e} + \mathbf{K}_d \dot{\mathbf{e}}\)

8.3 Dexterous Manipulation

  • Replace parallel-jaw with Allegro / Shadow Hand
  • Use contact model (Morse friction cones) for grasp stability analysis
  • Extend to in-hand manipulation: reorient object without re-grasping

8.4 Contact Model

Coulomb friction (planar contact):

\[ F_t \leq \mu F_n \]

where \(F_t\) is tangential force, \(F_n\) is normal force, \(\mu\) is friction coefficient.

Grasp matrix for 2-finger grasp with \(k\) contact points:

\[ \mathbf{G} = \begin{bmatrix} \mathbf{I}_3 & \mathbf{I}_3 \\ [\mathbf{r}_1]_\times & [\mathbf{r}_2]_\times \end{bmatrix} \in \mathbb{R}^{6 \times 2k} \]

8.5 Sim-to-Real Transfer

Issue Sim Solution Real Adjustment
Camera noise Add Gaussian noise to depth Retune with real data
Dynamics mismatch Match URDF inertials System identification
Gripper offset Calibrate with checkerboard Online calibration
Object properties Randomize friction, mass Estimate from F/T sensor

9. References