Mobile Manipulation¶
Project Type: Manipulation + Navigation | Difficulty: ★★★☆☆ to ★★★★★ (approach-dependent) | Estimated Time: 2–4 weekends
1. Project Overview¶
Mobile manipulation combines mobility (navigating a mobile base) with manipulation (controlling a robotic arm) to enable a robot to approach a scene, locate an object, and act on it — picking it up, placing it, opening a drawer, or pressing a button. This is one of the canonical problems in service robotics.
┌──────────────────────────────────────────────────────────────────┐
│ Mobile Manipulation Pipeline (Top View) │
│ │
│ ┌──────────┐ │
│ │ Scene / │ │
│ │ Objects │ │
│ └────┬─────┘ │
│ │ │
│ ▼ (perceive) │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ Grounding │──►│ Reachability │ │
│ │ (detect / │ │ Analysis │ │
│ │ localize) │ │ (which approach │ │
│ └──────────────┘ │ pose is valid?) │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ (plan) │
│ ┌───────────────────────────────────┐ │
│ │ Whole-Body Planner │ │
│ │ (base pose + arm joint targets) │ │
│ └──────────────┬────────────────────┘ │
│ │ │
│ ┌──────────────┴────────────────────┐ │
│ │ │ │
│ ▼ (execute) ▼ (execute) │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Mobile Base │ │ Arm + Gripper│ │
│ │ Controller │ │ Controller │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ Move to pose Grasp / Manipulate │
│ │
│ Robot = [Mobile Base] + [Arm] + [End-Effector / Gripper] │
└──────────────────────────────────────────────────────────────────┘
In this project you will explore three progressively more sophisticated approaches:
| Tier | Approach | Key Technique | ROS Package |
|---|---|---|---|
| Tier 1 | Decoupled | Navigate to approach pose, then IK arm | move_base + moveit_ros |
| Tier 2 | Whole-Body | Joint optimization of base + arm | moveit_ros (whole-body) + OMPL |
| Tier 3 | Foundation Models | VLA policy (RT-2 / Octo / π0) | ros2_manipulation + inference server |
2. Learning Objectives¶
- Understand the mobile manipulator kinematics model (base + arm)
- Implement decoupled navigation + inverse kinematics (IK) control
- Perform whole-body manipulability analysis and joint optimization
- Configure MoveIt 2 for a mobile manipulator in ROS 2
- Integrate object detection with grasp planning
- Understand Vision-Language-Action (VLA) foundation models for manipulation
3. Hardware & Software Requirements¶
Hardware¶
| Component | Specification | Notes |
|---|---|---|
| Mobile manipulator | Stretch RE-1, Fetch, or custom (e.g., LoCoArm + TurtleBot) | Any mobile base + arm combination |
| Depth camera | Intel RealSense D435i / Azure Kinect | For object detection & point clouds |
| LiDAR (optional) | 2D LiDAR for navigation (e.g., RPLIDAR A1) | Only needed for Tier 1–2 |
| Onboard computer | NVIDIA Jetson Orin / x86 with ROS 2 | GPU recommended for Tier 3 |
| Gripper | Parallel-jaw (Robotiq 2F-85) or custom | Actuated gripper |
| Battery | Sufficient for 2+ hours of operation |
Software¶
| Package | Version | Purpose |
|---|---|---|
| ROS 2 | Humble / Iron | Core framework |
| MoveIt 2 | Humble | Whole-body motion planning |
| Navigation 2 | Humble | Mobile base navigation |
| OpenCV | ≥ 4.5 | Image processing |
| PyTorch | ≥ 1.13 | Neural network inference (Tier 3) |
| Grasp Pose Detection | GraspNet / Contact-GraspNet | Grasp planning |
| OMPL | (via MoveIt) | Sampling-based motion planning |
| Isaac Sim / Gazebo | Latest | Simulation |
# Install core dependencies (ROS 2 Humble)
sudo apt install ros-humble-moveit ros-humble-navigation2 ros-humble-ros2-control
pip install opencv-python numpy torch torchvision open3d
4. Tier 1 — Traditional: Decoupled Navigation + IK Control¶
4.1 Concept¶
The simplest approach is to decouple the mobile base and the arm:
- Navigate the mobile base to an approach pose near the target object
- Lock the base, then solve inverse kinematics (IK) for the arm to reach the object
- Execute the grasp
This approach is easy to implement and debug, but it cannot adapt the base pose to improve arm reachability — the base might arrive at a location where the arm cannot comfortably reach the target.
Step 1: Navigate Step 2: Lock base, solve IK
┌──────────────────┐ ┌────────────────────────┐
│ │ │ │
│ [BASE]──► │ │ [BASE] [ARM reaches] │
│ ↓ │ │ ↖ ───► [OBJ] │
│ approach pose │ │ IK solution │
└──────────────────┘ └────────────────────────┘
4.2 Kinematics Model¶
For a decoupled system, the forward kinematics of the arm is:
where \(\mathbf{q}_{arm} \in \mathbb{R}^n\) are the arm joint positions and \(\mathbf{x}_{ee} = [p_x, p_y, p_z, \phi, \theta, \psi]\) is the end-effector pose (position + orientation).
The inverse kinematics solves:
For a 6-DOF arm, we typically use a numerical IK solver (e.g., TRAC-IK or KDL):
def compute_ik(robot_model, q_init, target_pose, timeout=1.0):
"""
Solve IK for the arm to reach a desired end-effector pose.
Parameters
----------
robot_model : robot_model_loader.RobotModelLoader
Loaded robot model (URDF + SRDF)
q_init : list[float]
Initial joint configuration guess
target_pose : geometry_msgs/Pose
Desired end-effector pose in base frame
timeout : float
Solver timeout in seconds
Returns
-------
list[float] or None
Solved joint configuration, or None if no solution found
"""
from moveit_msgs.srv import GetPositionIK
ik_service = '/compute_ik'
# Use MoveIt Python API instead (simpler):
from moveit.core import RobotModel, KinematicsQueryReturnVal
# TRAC-IK or KDL solver
group_name = "arm_group"
scene = PlanningSceneInterface()
move_group = MoveGroupInterface(group_name, "robot_description")
move_group.setPoseTarget(target_pose)
move_group.setMaxVelocityScalingFactor(0.5)
move_group.setMaxAccelerationScalingFactor(0.5)
plan = move_group.plan()
if plan[0]: # success
return plan[1].joint_trajectory.points[-1].positions
return None
4.3 Approach Pose Selection¶
Before navigating, we must compute a valid approach pose for the base:
"""
Tier 1: Decoupled mobile manipulation
======================================
Navigate base to approach pose, then use IK to reach the object.
"""
import numpy as np
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import PoseStamped, Pose
from nav2_msgs.action import NavigateToPose
from moveitInterfaces import MoveGroupInterface, PlanningSceneInterface
from rclpy.action import ActionClient
class DecoupledMobileManipulator(Node):
"""
Decoupled approach: navigate base to approach pose,
then use IK to control the arm.
"""
def __init__(self):
super().__init__('decoupled_mobile_manipulator')
# ── MoveIt interfaces ──────────────────────────────────────
self.arm_group = MoveGroupInterface("arm_group", "robot_description")
self.gripper_group = MoveGroupInterface("gripper_group", "robot_description")
self.scene = PlanningSceneInterface()
# ── Navigation action client ────────────────────────────────
self.nav_client = ActionClient(self, NavigateToPose, 'navigate_to_pose')
self.get_logger().info("Decoupled Mobile Manipulator ready")
def navigate_to(self, x: float, y: float, yaw: float, timeout: float = 30.0):
"""
Send the mobile base to (x, y, yaw) using Navigation 2.
Parameters
----------
x, y : float — Goal position in map frame
yaw : float — Goal orientation in radians
timeout : float — Maximum wait time in seconds
"""
from tf_transformations import quaternion_from_euler
goal = NavigateToPose.Goal()
goal.pose.header.frame_id = "map"
goal.pose.header.stamp = self.get_clock().now().to_msg()
goal.pose.pose.position.x = x
goal.pose.pose.position.y = y
goal.pose.pose.position.z = 0.0
q = quaternion_from_euler(0, 0, yaw)
goal.pose.pose.orientation.x = q[0]
goal.pose.pose.orientation.y = q[1]
goal.pose.pose.orientation.z = q[2]
goal.pose.pose.orientation.w = q[3]
self.get_logger().info(f"Navigating to ({x:.2f}, {y:.2f})...")
self.nav_client.wait_for_server()
send_goal = self.nav_client.send_goal_async(goal)
rclpy.spin_until_future_complete(self, send_goal)
result = send_goal.result()
if result:
self.get_logger().info("Navigation succeeded!")
return True
else:
self.get_logger().warn("Navigation failed or timed out")
return False
def pick_object(self, object_pose: PoseStamped, approach_height: float = 0.15):
"""
Pick an object using IK-based arm control.
The sequence: pre-grasp → grasp → lift.
Parameters
----------
object_pose : PoseStamped
Position of the object in the base frame
approach_height : float
How high above the object to approach before descending
"""
# ── 1. Compute approach pose (above the object) ─────────────
pregrasp_pose = Pose()
pregrasp_pose.position.CopyFrom(object_pose.pose.position)
pregrasp_pose.position.z += approach_height # go up first
pregrasp_pose.orientation.CopyFrom(object_pose.pose.orientation)
# ── 2. Pre-grasp: move arm to approach pose via IK ──────────
self.arm_group.setPoseTarget(pregrasp_pose)
success = self.arm_group.go(wait=True)
self.arm_group.stop()
self.arm_group.clearPoseTargets()
if not success:
self.get_logger().error("Pre-grasp IK failed")
return False
# ── 3. Descend to grasp pose ─────────────────────────────────
grasp_pose = Pose()
grasp_pose.position.CopyFrom(object_pose.pose.position)
grasp_pose.orientation.CopyFrom(object_pose.pose.orientation)
# Linear interpolation descent (10 steps)
for t in np.linspace(0, 1, 10):
intermediate = Pose()
intermediate.position.z = grasp_pose.position.z + (1 - t) * approach_height
intermediate.position.x = grasp_pose.position.x
intermediate.position.y = grasp_pose.position.y
intermediate.orientation.CopyFrom(grasp_pose.orientation)
self.arm_group.setPoseTarget(intermediate)
self.arm_group.go(wait=True)
self.arm_group.stop()
self.arm_group.clearPoseTargets()
# ── 4. Close gripper ─────────────────────────────────────────
self.gripper_group.setNamedTarget("close")
self.gripper_group.go(wait=True)
self.gripper_group.stop()
# ── 5. Lift ──────────────────────────────────────────────────
lift_pose = grasp_pose
lift_pose.position.z += 0.2
self.arm_group.setPoseTarget(lift_pose)
self.arm_group.go(wait=True)
self.arm_group.stop()
self.arm_group.clearPoseTargets()
self.get_logger().info("Pick completed!")
return True
def place_object(self, place_pose: PoseStamped):
"""
Place the object at the desired location.
"""
# Descend to place
for t in np.linspace(0, 1, 10):
intermediate = Pose()
intermediate.position.z = place_pose.pose.position.z + (1 - t) * 0.2
intermediate.position.x = place_pose.pose.position.x
intermediate.position.y = place_pose.pose.position.y
intermediate.orientation.CopyFrom(place_pose.pose.orientation)
self.arm_group.setPoseTarget(intermediate)
self.arm_group.go(wait=True)
self.arm_group.stop()
self.arm_group.clearPoseTargets()
# Open gripper
self.gripper_group.setNamedTarget("open")
self.gripper_group.go(wait=True)
self.gripper_group.stop()
# Retract
retreat_pose = Pose()
retreat_pose.position.CopyFrom(place_pose.pose.position)
retreat_pose.position.z += 0.3
retreat_pose.orientation.CopyFrom(place_pose.pose.orientation)
self.arm_group.setPoseTarget(retreat_pose)
self.arm_group.go(wait=True)
self.arm_group.stop()
self.arm_group.clearPoseTargets()
def main():
rclpy.init()
node = DecoupledMobileManipulator()
# ── Demo sequence ────────────────────────────────────────────────
# Step 1: Navigate to a pre-defined approach pose
success = node.navigate_to(x=2.0, y=1.5, yaw=np.pi / 2)
if not success:
node.get_logger().error("Could not reach approach pose!")
rclpy.shutdown()
return
# Step 2: Pick object (object pose from perception system)
object_pose = PoseStamped()
object_pose.header.frame_id = "base_link"
object_pose.pose.position.x = 0.4
object_pose.pose.position.y = 0.0
object_pose.pose.position.z = 0.6
object_pose.pose.orientation.w = 1.0
node.pick_object(object_pose)
# Step 3: Navigate to place location
node.navigate_to(x=3.0, y=1.5, yaw=np.pi / 2)
node.place_object(object_pose)
node.get_logger().info("Demo complete!")
rclpy.shutdown()
if __name__ == "__main__":
main()
4.4 Limitations of the Decoupled Approach¶
| Issue | Impact |
|---|---|
| Fixed base pose | Arm may reach the object but in a singularity or near joint limit |
| No base pose optimization | Suboptimal grasp direction (wrist may be inverted) |
| No collision between base and environment | Base might block arm workspace |
| Sequential failures | If navigation fails, the arm work is wasted |
5. Tier 2 — Intermediate: Whole-Body Planning¶
5.1 Concept¶
Whole-body planning jointly optimizes the mobile base pose and the arm joint configuration to simultaneously satisfy:
- Reachability — the end-effector must reach the object
- Manipulability — the arm should be in a dexterous configuration (not near singularities)
- Collision-free — the entire robot body avoids obstacles
- Smooth motion — minimize joint effort / travel distance
Decoupled: Whole-Body:
┌──────────────┐ ┌──────────────────────────┐
│ Base → pose │ │ Optimize base + arm │
│ Arm → IK │ │ jointly: │
│ (two steps) │ │ │
└──────────────┘ │ min ||q - q_init||² │
│ s.t. ||x_ee - x_obj|| < ε │
│ J(q) q̇ = ẋ │
└──────────────────────────┘
5.2 Whole-Body Kinematics¶
The combined robot state is:
where \(\mathbf{q}_{base} = [x, y, \theta]\) for a differential-drive base and \(\mathbf{q}_{arm} \in \mathbb{R}^n\) for the arm joints.
The whole-body Jacobian maps joint velocities to end-effector velocity:
where \(J_{wb} \in \mathbb{R}^{6 \times (3+n)}\) is the whole-body Jacobian:
The manipulability measure (Yoshikawa) for the arm:
Maximizing \(w\) keeps the arm away from singularities.
5.3 MoveIt 2 Whole-Body Configuration¶
<!-- launch/whole_body_planning.launch -->
<launch>
<!-- Load robot description -->
<node pkg="robot_state_publisher" type="robot_state_publisher"
name="robot_state_publisher" output="screen">
<param name="robot_description" textfile="$(find my_mobile_manipulator)/urdf/robot.urdf"/>
</node>
<!-- MoveIt configuration -->
<node pkg="moveit_ros_move_group" type="move_group"
name="move_group" output="screen">
<!-- Load SRDF with kinematic chains -->
<param name="robot_description" textfile="$(find my_mobile_manipulator)/urdf/robot.urdf"/>
<rosparam file="$(find my_mobile_manipulator)/config/ompl_planning.yaml"/>
<rosparam file="$(find my_mobile_manipulator)/config/kinematics.yaml"/>
<param name="robot_description_semantic"
textfile="$(find my_mobile_manipulator)/config/robot.srdf"/>
</node>
</launch>
SRDF configuration — define the whole-body planning group:
<!-- config/robot.srdf -->
<robot name="mobile_manipulator">
<!-- Whole-body group: base + arm -->
<group name="whole_body">
<chain base_link="base_footprint" tip_link="gripper_link"/>
</group>
<!-- Sub-groups for specialized control -->
<group name="arm_group">
<link name="shoulder_link"/>
<link name="gripper_link"/>
</group>
<group name="base_group">
<joint name="x_joint"/>
<joint name="y_joint"/>
<joint name="theta_joint"/>
</group>
<!-- Named home configuration -->
<group_state name="home" group="whole_body">
<joint name="x_joint" value="0"/>
<joint name="y_joint" value="0"/>
<joint name="theta_joint" value="0"/>
<joint name="shoulder_pan_joint" value="0"/>
<joint name="shoulder_lift_joint" value="-1.57"/>
<joint name="elbow_joint" value="1.57"/>
<joint name="wrist_1_joint" value="0"/>
<joint name="wrist_2_joint" value="-1.57"/>
<joint name="wrist_3_joint" value="0"/>
</group_state>
</robot>
5.4 Whole-Body Python Planning¶
"""
Tier 2: Whole-Body Mobile Manipulation with MoveIt 2
======================================================
Jointly optimize base pose + arm configuration using
the MoveGroupInterface with a whole-body group.
"""
import numpy as np
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import Pose, PoseStamped
from moveitInterfaces import MoveGroupInterface, PlanningSceneInterface
from moveit_msgs.msg import Constraints, JointConstraint, PositionConstraint
class WholeBodyMobileManipulator(Node):
"""
Whole-body manipulation using MoveIt 2.
The key advantage over Tier 1: the planner considers both the
base pose and the arm joints together, optimizing manipulability
and avoiding collisions across the entire robot body.
"""
def __init__(self):
super().__init__('whole_body_mobile_manipulator')
# ── Whole-body planning group ─────────────────────────────────
# Includes: base (x, y, theta) + arm joints
self.whole_body_group = MoveGroupInterface(
"whole_body", "robot_description"
)
self.arm_group = MoveGroupInterface("arm_group", "robot_description")
self.scene = PlanningSceneInterface()
# ── Configure planner ─────────────────────────────────────────
self.whole_body_group.setNumPlanningAttempts(10)
self.whole_body_group.setPlanningTime(5.0)
self.whole_body_group.setMaxVelocityScalingFactor(0.6)
self.whole_body_group.setMaxAccelerationScalingFactor(0.6)
self.whole_body_group.setPlannerId("RRTstar")
# ── Add table as collision object ────────────────────────────
self._add_collision_objects()
self.get_logger().info("Whole-body Mobile Manipulator ready")
def _add_collision_objects(self):
"""Add environment obstacles to the planning scene."""
from shape_msgs.msg import SolidPrimitive
# Table
table_pose = PoseStamped()
table_pose.header.frame_id = "base_link"
table_pose.pose.position.x = 0.8
table_pose.pose.position.y = 0.0
table_pose.pose.position.z = -0.3
self.scene.addBox("table", 0.6, 0.6, 0.74, table_pose)
# Shelf
shelf_pose = PoseStamped()
shelf_pose.header.frame_id = "base_link"
shelf_pose.pose.position.x = 0.8
shelf_pose.pose.position.y = 0.3
shelf_pose.pose.position.z = 0.3
self.scene.addBox("shelf", 0.4, 0.3, 0.05, shelf_pose)
self.get_logger().info("Collision objects added")
def whole_body_pick(self, object_position: np.ndarray,
approach_direction: np.ndarray = np.array([0, 0, -1]),
approach_height: float = 0.15):
"""
Pick an object using whole-body planning.
Unlike Tier 1, this plans the base pose AND arm joints together,
considering manipulability and avoiding arm/base self-collisions.
Parameters
----------
object_position : ndarray [3]
Object position in world frame (x, y, z)
approach_direction : ndarray [3]
Unit vector for approach direction (default: top-down)
approach_height : float
Approach distance above the object
"""
# ── 1. Compute approach pose ──────────────────────────────────
approach_pose = PoseStamped()
approach_pose.header.frame_id = "base_link"
approach_pose.header.stamp = self.get_clock().now().to_msg()
# Start above the object
approach_pose.pose.position.x = object_position[0]
approach_pose.pose.position.y = object_position[1]
approach_pose.pose.position.z = object_position[2] + approach_height
# Approach direction determines gripper orientation
from tf_transformations import quaternion_from_matrix
# Build rotation matrix aligned with approach direction
# (gripper Z-axis aligned with approach direction)
z = approach_direction / np.linalg.norm(approach_direction)
x_temp = np.array([1, 0, 0]) if abs(z[2]) < 0.99 else np.array([0, 1, 0])
x = x_temp - np.dot(x_temp, z) * z
x = x / np.linalg.norm(x)
y = np.cross(z, x)
R = np.column_stack([x, y, z])
q = quaternion_from_matrix(np.vstack([np.hstack([R, [0, 0, 0]]), [0, 0, 0, 1]]))
approach_pose.pose.orientation.x = q[0]
approach_pose.pose.orientation.y = q[1]
approach_pose.pose.orientation.z = q[2]
approach_pose.pose.orientation.w = q[3]
# ── 2. Set whole-body pose target ─────────────────────────────
# This jointly optimizes base (x, y, theta) and arm joints
self.whole_body_group.setPoseTarget(approach_pose)
# ── 3. Plan with constraints (collision + manipulability) ───────
constraints = Constraints()
# Add joint limits constraints to keep arm away from limits
constraints.joint_constraints = self._make_joint_constraints()
plan = self.whole_body_group.plan()
if not plan[0]:
self.get_logger().warn("Whole-body plan to approach pose failed")
return False
# Execute
self.whole_body_group.execute(plan[1], wait=True)
self.whole_body_group.stop()
self.whole_body_group.clearPoseTargets()
# ── 4. Descend linearly to grasp pose ──────────────────────────
grasp_pose = PoseStamped()
grasp_pose.header.frame_id = "base_link"
grasp_pose.pose.position.x = object_position[0]
grasp_pose.pose.position.y = object_position[1]
grasp_pose.pose.position.z = object_position[2]
grasp_pose.pose.orientation.CopyFrom(approach_pose.pose.orientation)
# Use Cartesian path for linear descent
waypoints = [approach_pose.pose, grasp_pose.pose]
(plan, fraction) = self.arm_group.computeCartesianPath(
waypoints,
eef_step=0.01, # 1 cm steps
jump_threshold=1.0, # disable jump checking
)
if fraction < 0.95:
self.get_logger().warn(f"Cartesian path only {fraction*100:.1f}% complete")
self.arm_group.execute(plan, wait=True)
# ── 5. Close gripper ────────────────────────────────────────────
self.arm_group.stop()
self.arm_group.setNamedTarget("gripper_close")
self.arm_group.go(wait=True)
self.arm_group.stop()
# ── 6. Lift ────────────────────────────────────────────────────
lift_pose = grasp_pose
lift_pose.pose.position.z += 0.2
self.whole_body_group.setPoseTarget(lift_pose)
plan = self.whole_body_group.plan()
if plan[0]:
self.whole_body_group.execute(plan[1], wait=True)
self.whole_body_group.stop()
self.whole_body_group.clearPoseTargets()
self.get_logger().info("Whole-body pick completed!")
return True
def _make_joint_constraints(self) -> list:
"""Create soft joint constraints to keep arm away from limits."""
constraints = []
# Avoid shoulder joint limits by adding a 0.1 rad margin
margin = 0.1
for joint_name, lower, upper in [
("shoulder_pan_joint", -2.5, 2.5),
("shoulder_lift_joint", -2.0, 2.0),
("elbow_joint", -2.8, 2.8),
("wrist_1_joint", -2.0, 2.0),
("wrist_2_joint", -2.0, 2.0),
("wrist_3_joint", -2.0, 2.0),
]:
jc = JointConstraint()
jc.joint_name = joint_name
jc.position = 0.0
jc.tolerance_above = upper - margin
jc.tolerance_below = lower + margin
jc.weight = 0.5
constraints.append(jc)
return constraints
def compute_manipulability(self, q: np.ndarray) -> float:
"""
Compute the manipulability measure for the current arm configuration.
Uses the Yoshikawa formula: w = sqrt(det(J * J^T))
Returns the manipulability value (0 at singularity, larger is better).
"""
# This would be computed from the Jacobian of the arm
# J_2n6 Jacobian from FK solver
# For simplicity, return a placeholder
# In practice, use: w = np.sqrt(np.linalg.det(J @ J.T))
return 1.0 # Placeholder
def main():
rclpy.init()
node = WholeBodyMobileManipulator()
# ── Demo: pick object at known position ───────────────────────────
# Object at (0.65, 0.1, 0.75) in base frame (on a table)
obj_pos = np.array([0.65, 0.1, 0.75])
node.whole_body_pick(obj_pos)
node.get_logger().info("Demo complete!")
rclpy.shutdown()
if __name__ == "__main__":
main()
5.5 Manipulability Optimization¶
The arm's manipulability ellipsoid reveals how well the arm can move in different directions. At a singularity, the ellipsoid collapses — the arm loses a degree of freedom.
def compute_manipulability_optimized_base_pose(
object_pos: np.ndarray,
robot_model,
n_candidates: int = 100,
) -> tuple:
"""
Find the best base pose that maximizes arm manipulability
for reaching the object.
This searches over a grid of base poses and selects the one
with the highest manipulability at the IK solution.
"""
import random
best_pose = None
best_manipulability = -1.0
# Search over candidate base poses
for _ in range(n_candidates):
# Random base pose in reachable area (x: 0.3-1.0m, y: -0.3-0.3m, theta: -pi to pi)
q_base = np.array([
random.uniform(0.3, 1.0),
random.uniform(-0.3, 0.3),
random.uniform(-np.pi, np.pi),
])
# Compute IK for the arm to reach object_pos from this base
q_arm = compute_ik_for_base(object_pos, q_base, robot_model)
if q_arm is None:
continue # IK failed for this base pose
# Compute manipulability
q_full = np.concatenate([q_base, q_arm])
w = compute_manipulability_from_jacobian(q_full, robot_model)
if w > best_manipulability:
best_manipulability = w
best_pose = q_base
return best_pose, best_manipulability
6. Tier 3 — Modern: Foundation Model-Based Manipulation¶
6.1 Concept¶
Foundation models (large neural networks pre-trained on massive data) have revolutionized manipulation. Vision-Language-Action (VLA) models like RT-2, Octo, and π0 learn a single policy that:
- Takes RGB images (and optionally language instructions) as input
- Outputs continuous robot actions (end-effector delta pose + gripper command)
This is fundamentally different from Tier 1–2 because the same model can handle novel objects, scenes, and tasks without explicit geometry or IK.
┌──────────────────────────────────────────────────────────────────┐
│ Vision-Language-Action (VLA) Model │
│ │
│ [Image] ──┐ │
│ │ │
│ [Lang] ──┼──► ┌──────────────┐ ──► Action Token Stream │
│ │ │ Vision │ [Δx, Δy, Δz, Δroll, │
│ │ │ Encoder │ Δpitch, Δyaw, grip] │
│ │ │ (ViT/Sam) │ │
│ │ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ └───►│ LLM Core │◄── (frozen or fine-tuned) │
│ │ (PaLM-E / │ │
│ │ LLaMA / │ │
│ │ RT-2 Core) │ │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Action Head │ │
│ │ (token → │ │
│ │ continuous)│ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────────┘
6.2 Action Tokenization¶
RT-2-style models discretize continuous actions into tokens (inspired by language tokens):
-
Discretize each action dimension into \(B\) bins (e.g., \(B=256\)): $\(\text{action\_id} = \text{Clip}\left(\left\lfloor \frac{a - a_{\min}}{a_{\max} - a_{\min}} \cdot B \right\rfloor, 0, B-1\right)\)$
-
Treat each
action_idas a vocabulary token, concatenated with language tokens - Train with standard language modeling loss (next-token prediction)
- At inference, sample action tokens and map back to continuous values
6.3 ROS 2 VLA Integration¶
"""
Tier 3: Foundation Model Mobile Manipulation
===============================================
Integrate a VLA model (e.g., Octo) with ROS 2 for
generalized mobile manipulation.
"""
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, JointState
from geometry_msgs.msg import Twist
from cv_bridge import CvBridge
import numpy as np
import torch
try:
import octo
from octo.model import OctoModel
except ImportError:
print("[WARN] Octo not installed. Install: pip install octo-model")
OctoModel = None
class VLAMobileManipulator(Node):
"""
VLA-based mobile manipulation.
The VLA model takes a camera image and a language instruction,
then outputs continuous actions that control both the mobile base
and the arm. This is a single end-to-end policy — no separate
navigation or IK module needed.
"""
def __init__(self):
super().__init__('vla_mobile_manipulator')
self.bridge = CvBridge()
# ── Load VLA model ────────────────────────────────────────────
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.get_logger().info(f"Using device: {self.device}")
if OctoModel is not None:
# Load Octo-small (pre-trained on OXE dataset)
self.model = OctoModel.load_pretrained("octo-small-1")
self.model.to(self.device)
self.get_logger().info("Octo model loaded")
else:
self.model = None
self.get_logger().warn("VLA model not available — running in simulation mode")
# ── ROS interfaces ────────────────────────────────────────────
self.image_sub = self.create_subscription(
Image,
"/camera/color/image_raw",
self.image_callback,
10
)
self.joint_sub = self.create_subscription(
JointState,
"/joint_states",
self.joint_callback,
10
)
# Base + arm velocity command publisher
self.base_cmd_pub = self.create_publisher(Twist, "/cmd_vel", 10)
self.arm_cmd_pub = self.create_publisher(
JointState, "/arm_controller/joint_commands", 10
)
self.current_joints = np.zeros(7) # arm joint states
self.current_image = None
self.get_logger().info("VLA Mobile Manipulator initialized")
def image_callback(self, msg: Image):
"""Receive camera image."""
try:
self.current_image = self.bridge.imgmsg_to_cv2(msg, desired_encoding="rgb8")
except Exception as e:
self.get_logger().error(f"Image conversion failed: {e}")
def joint_callback(self, msg: JointState):
"""Receive current joint states."""
self.current_joints = np.array(msg.position[:7])
def compute_vla_action(self, instruction: str,
image: np.ndarray,
arm_joints: np.ndarray) -> np.ndarray:
"""
Query the VLA model for an action given the current observation.
Parameters
----------
instruction : str
Language instruction, e.g., "pick up the red cup"
image : ndarray
Current RGB camera image (H, W, 3)
arm_joints : ndarray
Current arm joint positions
Returns
-------
action : ndarray
Predicted action: [base_vx, base_vy, base_omega,
arm_delta_q1, ..., arm_delta_qn, gripper]
"""
if self.model is None:
# Simulation mode: return zero action
return np.zeros(10)
# Preprocess image
image_tensor = torch.from_numpy(image).permute(2, 0, 1) # CHW
image_tensor = torch.nn.functional.interpolate(
image_tensor.unsqueeze(0).float(), size=(256, 256)
).squeeze(0)
# Tokenize instruction
instruction_tokens = self.model.encode_text(instruction)
# Model inference
with torch.no_grad():
# Octo outputs actions in gripper frame
# Shape: (batch=1, pred_horizon=1, action_dim)
action = self.model.predict_action(
image=image_tensor.to(self.device).unsqueeze(0),
text=[instruction],
)
return action[0, 0].cpu().numpy()
def execute_action(self, action: np.ndarray):
"""
Send action to robot hardware.
Action format: [base_vx, base_vy, base_omega,
delta_q1, ..., delta_q6, gripper]
"""
# ── Mobile base velocity ───────────────────────────────────────
base_cmd = Twist()
base_cmd.linear.x = float(action[0])
base_cmd.linear.y = float(action[1])
base_cmd.angular.z = float(action[2])
self.base_cmd_pub.publish(base_cmd)
# ── Arm joint delta commands ──────────────────────────────────
n_arm_joints = len(self.current_joints)
arm_delta = action[3:3 + n_arm_joints]
new_joints = self.current_joints + arm_delta
arm_cmd = JointState()
arm_cmd.header.stamp = self.get_clock().now().to_msg()
arm_cmd.name = [
"shoulder_pan_joint", "shoulder_lift_joint", "elbow_joint",
"wrist_1_joint", "wrist_2_joint", "wrist_3_joint"
]
arm_cmd.position = new_joints.tolist()
self.arm_cmd_pub.publish(arm_cmd)
# ── Gripper ────────────────────────────────────────────────────
gripper_cmd = JointState()
gripper_cmd.name = ["gripper_joint"]
gripper_cmd.position = [float(action[-1])]
# Publish to gripper controller...
def run_policy(self, instruction: str, max_steps: int = 200):
"""
Run the VLA policy for a given instruction.
Parameters
----------
instruction : str
Language command to execute
max_steps : int
Maximum number of control steps
"""
self.get_logger().info(f"Executing: '{instruction}'")
rate = self.create_rate(10) # 10 Hz control loop
for step in range(max_steps):
if self.current_image is None:
rate.sleep()
continue
# Query model
action = self.compute_vla_action(
instruction, self.current_image, self.current_joints
)
# Execute on robot
self.execute_action(action)
# Check for termination (if model outputs a "done" signal)
if np.abs(action).max() < 1e-3:
self.get_logger().info("Action converged — stopping")
break
rate.sleep()
# Stop robot
self.execute_action(np.zeros_like(action))
self.get_logger().info("Policy execution complete")
def main():
rclpy.init()
node = VLAMobileManipulator()
# ── Example instructions ──────────────────────────────────────────
instructions = [
"pick up the cup on the table",
"place the cup on the shelf",
"push the button on the wall",
"open the drawer",
]
for instruction in instructions:
node.run_policy(instruction, max_steps=200)
rclpy.shutdown()
if __name__ == "__main__":
main()
6.4 Sim-to-Real Transfer¶
Training VLA models involves significant domain gap between simulation and reality:
| Challenge | Description | Mitigation |
|---|---|---|
| Visual domain gap | Different textures, lighting, camera noise | Domain randomization (DR), neural rendering (NeRF, 3DGS) |
| Physics gap | Contact dynamics, friction, compliance | sim-to-real transfer via RL fine-tuning, system identification |
| Action space mismatch | Different robot models | Action normalization, cross-embodiment training |
| Data efficiency | VLAs need millions of demos | Pre-training on large datasets (OXE, BridgeData) |
| Latency | Model inference is slow | ONNX optimization, GPU batching, action chunking |
7. Comparison of Approaches¶
| Criteria | Tier 1: Decoupled | Tier 2: Whole-Body | Tier 3: VLA Foundation |
|---|---|---|---|
| Base + Arm coordination | ❌ Independent | ✅ Joint | ✅ Implicit (learned) |
| Novel object handling | ❌ Needs model | ❌ Needs model | ✅ Zero-shot |
| Generalization | Task-specific | Task-specific | Task-general |
| Real-time performance | ✅ Fast | ✅ ~1-5 Hz | ⚠️ Depends on model size |
| Hardware requirements | Low | Medium | High (GPU required) |
| Setup complexity | Low | Medium | High |
| Collision awareness | ❌ Arm only | ✅ Full body | ⚠️ Learned |
| Singularity handling | ❌ Manual | ✅ Jacobian-based | ⚠️ Learned |
| Best for | Structured environments | Semi-structured | Open-world manipulation |
| Code complexity | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Reliability | High (predictable) | High (verifiable) | Medium (stochastic) |
8. Step-by-Step Implementation Guide¶
Phase 1 — Robot Setup (All Tiers)¶
- URDF / SDF: Define the complete robot — base + arm + sensors — in URDF
- Kinematics: Load the URDF into MoveIt and configure kinematic chains
- Calibration: Calibrate base-to-camera transform (
tftree) - Gazebo simulation: Verify the robot loads and moves in simulation
Phase 2 — Tier 1: Decoupled (Week 1)¶
- Launch the Navigation 2 stack with SLAM (see SLAM project)
- Define a set of approach poses for known pick locations
- Implement
compute_ik()using MoveIt's IK service - Run the pick-and-place demo with known object poses
- Test: navigate to pose → IK → grasp → lift
Phase 3 — Tier 2: Whole-Body (Week 2)¶
- Define the
whole_bodyplanning group in SRDF (base + arm chain) - Configure OMPL with
RRTstarorPRMstarfor whole-body planning - Add collision objects (tables, shelves) to the planning scene
- Replace the Tier 1 decoupled
navigate_to()+compute_ik()withwhole_body_pick() - Implement manipulability optimization for base pose selection
- Test: single whole-body plan for pick, verify collision avoidance
Phase 4 — Tier 3: VLA Integration (Week 3)¶
- Install Octo or RT-2 (requires Docker + GPU environment)
- Run inference on a pre-trained model with the real camera feed
- Add action normalization (map model output to your robot's action limits)
- Implement a safety layer: clamp velocities, check joint limits before sending commands
- Fine-tune on domain-specific data if performance is insufficient
- Test: give language commands like "pick up the blue box"
9. Extensions and Variations¶
9.1 Mobile Base Navigation with Arm Constraints¶
Use Task Space Region (TSR) constraints in MoveIt to ensure the arm maintains a specific orientation or stays within a workspace while the base navigates:
# Keep arm pointing downward while base navigates
from moveit_msgs.msg import Constraints, OrientationConstraint
orientation_constraint = OrientationConstraint()
orientation_constraint.header.frame_id = "base_link"
orientation_constraint.link_name = "gripper_link"
orientation_constraint.orientation.w = 1.0
orientation_constraint.absolute_x_axis_tolerance = 0.1
orientation_constraint.absolute_y_axis_tolerance = 0.1
orientation_constraint.absolute_z_axis_tolerance = 3.14 # free rotation around Z
orientation_constraint.weight = 1.0
constraints = Constraints()
constraints.orientation_constraints.append(orientation_constraint)
move_group.setPathConstraints(constraints)
9.2 Grasp Pose Detection¶
Replace hard-coded grasp poses with learning-based grasp detection:
| Method | Description | ROS Package |
|---|---|---|
| GraspNet | 6-DOF grasp pose prediction from point clouds | graspnet_ros |
| Contact-GraspNet | Contact-level grasp synthesis from partial views | contact_graspnet |
| GPD | Geometric approach to grasp detection | gpd |
| 6-DOF GraspNet (Isaac) | NVIDIA's optimized grasp detection | Isaac SDK |
9.3 Dual-Arm Mobile Manipulation¶
Extend to two arms for bimanual tasks (e.g., holding an object with one arm while moving it with the base, or cooperative carrying):
# Dual-arm whole-body group
<group name="dual_arm_whole_body">
<chain base_link="base_footprint" tip_link="left_gripper_link"/>
<chain base_link="base_footprint" tip_link="right_gripper_link"/>
</group>
9.4 Hierarchical Planning¶
Combine Tier 2 and Tier 3 in a hierarchical architecture:
Task Planner (LLM/VLA) ──► High-level goal: "place cup on shelf"
│
▼
Motion Planner (Tier 2) ──► Whole-body trajectory for base + arm
│
▼
Controller (PID/阻抗) ──► Execute trajectory on hardware
10. References¶
- Siciliano et al. (2016). Robotics: Modelling, Planning and Control — Whole-body kinematics textbook
- Yoshikawa, T. (1985). "Manipulability of Robotic Mechanisms." IJHR — Manipulability measure
- Koenig & Howard (2004). "Design and Use Paradigms for Gazebo." IROS — Robot simulation
- Coleman et al. (2014). "Reducing the Barrier to Entry of Complex Robotic Software." ICRA — MoveIt introduction
- Sucan et al. (2012). "The Open Motion Planning Library." ICRA — OMPL
- Brohan et al. (2022). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15218 — RT-2 paper
- Octo Model Team (2024). "Octo: An Open-Source Generalist Robot Policy." GitHub — Octo model
- Black et al. (2024). "π0: A Vision-Language-Action Model for Generalist Robot Control." Physical Intelligence — π0 paper
- Xiao et al. (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv:2310.08864 — OXE dataset (training foundation models)
- Zhou et al. (2023). "Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes." ICRA — GraspNet
- Sundaresan et al. (2020). "Rethinking Grasp: A Survey of Deep Learning Approaches for Robotic Grasping." arXiv — Grasp detection survey
- ROS 2 MoveIt 2 Documentation — Official MoveIt 2 tutorials
- Stretch RE-1 Documentation — Hello Robot's mobile manipulator
- Fetch Robotics Mobile Manipulator — Fetch hardware + ROS integration