Skip to content

Mobile Manipulation

Project Type: Manipulation + Navigation | Difficulty: ★★★☆☆ to ★★★★★ (approach-dependent) | Estimated Time: 2–4 weekends


1. Project Overview

Mobile manipulation combines mobility (navigating a mobile base) with manipulation (controlling a robotic arm) to enable a robot to approach a scene, locate an object, and act on it — picking it up, placing it, opening a drawer, or pressing a button. This is one of the canonical problems in service robotics.

  ┌──────────────────────────────────────────────────────────────────┐
  │              Mobile Manipulation Pipeline (Top View)              │
  │                                                                   │
  │    ┌──────────┐                                                  │
  │    │ Scene /  │                                                  │
  │    │ Objects  │                                                  │
  │    └────┬─────┘                                                  │
  │         │                                                         │
  │         ▼ (perceive)                                              │
  │    ┌──────────────┐   ┌─────────────────┐                       │
  │    │ Grounding    │──►│ Reachability     │                       │
  │    │ (detect /    │   │ Analysis         │                       │
  │    │  localize)   │   │ (which approach   │                       │
  │    └──────────────┘   │  pose is valid?) │                       │
  │                       └────────┬─────────┘                       │
  │                                │                                   │
  │                                ▼ (plan)                            │
  │         ┌───────────────────────────────────┐                    │
  │         │      Whole-Body Planner            │                    │
  │         │  (base pose + arm joint targets)   │                    │
  │         └──────────────┬────────────────────┘                    │
  │                        │                                          │
  │         ┌──────────────┴────────────────────┐                    │
  │         │                                     │                   │
  │         ▼ (execute)                           ▼ (execute)         │
  │  ┌─────────────┐                      ┌─────────────┐           │
  │  │ Mobile Base │                      │ Arm + Gripper│           │
  │  │ Controller  │                      │ Controller   │           │
  │  └──────┬──────┘                      └──────┬──────┘           │
  │         │                                     │                   │
  │         ▼                                     ▼                   │
  │   Move to pose                           Grasp / Manipulate      │
  │                                                                   │
  │         Robot = [Mobile Base] + [Arm] + [End-Effector / Gripper] │
  └──────────────────────────────────────────────────────────────────┘

In this project you will explore three progressively more sophisticated approaches:

Tier Approach Key Technique ROS Package
Tier 1 Decoupled Navigate to approach pose, then IK arm move_base + moveit_ros
Tier 2 Whole-Body Joint optimization of base + arm moveit_ros (whole-body) + OMPL
Tier 3 Foundation Models VLA policy (RT-2 / Octo / π0) ros2_manipulation + inference server

2. Learning Objectives

  • Understand the mobile manipulator kinematics model (base + arm)
  • Implement decoupled navigation + inverse kinematics (IK) control
  • Perform whole-body manipulability analysis and joint optimization
  • Configure MoveIt 2 for a mobile manipulator in ROS 2
  • Integrate object detection with grasp planning
  • Understand Vision-Language-Action (VLA) foundation models for manipulation

3. Hardware & Software Requirements

Hardware

Component Specification Notes
Mobile manipulator Stretch RE-1, Fetch, or custom (e.g., LoCoArm + TurtleBot) Any mobile base + arm combination
Depth camera Intel RealSense D435i / Azure Kinect For object detection & point clouds
LiDAR (optional) 2D LiDAR for navigation (e.g., RPLIDAR A1) Only needed for Tier 1–2
Onboard computer NVIDIA Jetson Orin / x86 with ROS 2 GPU recommended for Tier 3
Gripper Parallel-jaw (Robotiq 2F-85) or custom Actuated gripper
Battery Sufficient for 2+ hours of operation

Software

Package Version Purpose
ROS 2 Humble / Iron Core framework
MoveIt 2 Humble Whole-body motion planning
Navigation 2 Humble Mobile base navigation
OpenCV ≥ 4.5 Image processing
PyTorch ≥ 1.13 Neural network inference (Tier 3)
Grasp Pose Detection GraspNet / Contact-GraspNet Grasp planning
OMPL (via MoveIt) Sampling-based motion planning
Isaac Sim / Gazebo Latest Simulation
# Install core dependencies (ROS 2 Humble)
sudo apt install ros-humble-moveit ros-humble-navigation2 ros-humble-ros2-control
pip install opencv-python numpy torch torchvision open3d

4. Tier 1 — Traditional: Decoupled Navigation + IK Control

4.1 Concept

The simplest approach is to decouple the mobile base and the arm:

  1. Navigate the mobile base to an approach pose near the target object
  2. Lock the base, then solve inverse kinematics (IK) for the arm to reach the object
  3. Execute the grasp

This approach is easy to implement and debug, but it cannot adapt the base pose to improve arm reachability — the base might arrive at a location where the arm cannot comfortably reach the target.

  Step 1: Navigate                    Step 2: Lock base, solve IK
  ┌──────────────────┐              ┌────────────────────────┐
  │                  │              │                         │
  │     [BASE]──►    │              │   [BASE]  [ARM reaches]  │
  │        ↓         │              │     ↖     ───► [OBJ]    │
  │   approach pose  │              │   IK solution           │
  └──────────────────┘              └────────────────────────┘

4.2 Kinematics Model

For a decoupled system, the forward kinematics of the arm is:

\[ \mathbf{x}_{ee} = f(\mathbf{q}_{arm}) \in \mathbb{R}^6 \]

where \(\mathbf{q}_{arm} \in \mathbb{R}^n\) are the arm joint positions and \(\mathbf{x}_{ee} = [p_x, p_y, p_z, \phi, \theta, \psi]\) is the end-effector pose (position + orientation).

The inverse kinematics solves:

\[ \mathbf{q}_{arm} = f^{-1}(\mathbf{x}_{des}) \]

For a 6-DOF arm, we typically use a numerical IK solver (e.g., TRAC-IK or KDL):

def compute_ik(robot_model, q_init, target_pose, timeout=1.0):
    """
    Solve IK for the arm to reach a desired end-effector pose.

    Parameters
    ----------
    robot_model : robot_model_loader.RobotModelLoader
        Loaded robot model (URDF + SRDF)
    q_init : list[float]
        Initial joint configuration guess
    target_pose : geometry_msgs/Pose
        Desired end-effector pose in base frame
    timeout : float
        Solver timeout in seconds

    Returns
    -------
    list[float] or None
        Solved joint configuration, or None if no solution found
    """
    from moveit_msgs.srv import GetPositionIK

    ik_service = '/compute_ik'
    # Use MoveIt Python API instead (simpler):
    from moveit.core import RobotModel, KinematicsQueryReturnVal

    # TRAC-IK or KDL solver
    group_name = "arm_group"
    scene = PlanningSceneInterface()
    move_group = MoveGroupInterface(group_name, "robot_description")

    move_group.setPoseTarget(target_pose)
    move_group.setMaxVelocityScalingFactor(0.5)
    move_group.setMaxAccelerationScalingFactor(0.5)

    plan = move_group.plan()
    if plan[0]:  # success
        return plan[1].joint_trajectory.points[-1].positions
    return None

4.3 Approach Pose Selection

Before navigating, we must compute a valid approach pose for the base:

"""
Tier 1: Decoupled mobile manipulation
======================================
Navigate base to approach pose, then use IK to reach the object.
"""

import numpy as np
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import PoseStamped, Pose
from nav2_msgs.action import NavigateToPose
from moveitInterfaces import MoveGroupInterface, PlanningSceneInterface
from rclpy.action import ActionClient


class DecoupledMobileManipulator(Node):
    """
    Decoupled approach: navigate base to approach pose,
    then use IK to control the arm.
    """

    def __init__(self):
        super().__init__('decoupled_mobile_manipulator')

        # ── MoveIt interfaces ──────────────────────────────────────
        self.arm_group = MoveGroupInterface("arm_group", "robot_description")
        self.gripper_group = MoveGroupInterface("gripper_group", "robot_description")
        self.scene = PlanningSceneInterface()

        # ── Navigation action client ────────────────────────────────
        self.nav_client = ActionClient(self, NavigateToPose, 'navigate_to_pose')

        self.get_logger().info("Decoupled Mobile Manipulator ready")

    def navigate_to(self, x: float, y: float, yaw: float, timeout: float = 30.0):
        """
        Send the mobile base to (x, y, yaw) using Navigation 2.

        Parameters
        ----------
        x, y : float      — Goal position in map frame
        yaw  : float      — Goal orientation in radians
        timeout : float   — Maximum wait time in seconds
        """
        from tf_transformations import quaternion_from_euler

        goal = NavigateToPose.Goal()
        goal.pose.header.frame_id = "map"
        goal.pose.header.stamp = self.get_clock().now().to_msg()
        goal.pose.pose.position.x = x
        goal.pose.pose.position.y = y
        goal.pose.pose.position.z = 0.0

        q = quaternion_from_euler(0, 0, yaw)
        goal.pose.pose.orientation.x = q[0]
        goal.pose.pose.orientation.y = q[1]
        goal.pose.pose.orientation.z = q[2]
        goal.pose.pose.orientation.w = q[3]

        self.get_logger().info(f"Navigating to ({x:.2f}, {y:.2f})...")
        self.nav_client.wait_for_server()
        send_goal = self.nav_client.send_goal_async(goal)
        rclpy.spin_until_future_complete(self, send_goal)

        result = send_goal.result()
        if result:
            self.get_logger().info("Navigation succeeded!")
            return True
        else:
            self.get_logger().warn("Navigation failed or timed out")
            return False

    def pick_object(self, object_pose: PoseStamped, approach_height: float = 0.15):
        """
        Pick an object using IK-based arm control.

        The sequence: pre-grasp → grasp → lift.

        Parameters
        ----------
        object_pose : PoseStamped
            Position of the object in the base frame
        approach_height : float
            How high above the object to approach before descending
        """
        # ── 1. Compute approach pose (above the object) ─────────────
        pregrasp_pose = Pose()
        pregrasp_pose.position.CopyFrom(object_pose.pose.position)
        pregrasp_pose.position.z += approach_height  # go up first
        pregrasp_pose.orientation.CopyFrom(object_pose.pose.orientation)

        # ── 2. Pre-grasp: move arm to approach pose via IK ──────────
        self.arm_group.setPoseTarget(pregrasp_pose)
        success = self.arm_group.go(wait=True)
        self.arm_group.stop()
        self.arm_group.clearPoseTargets()

        if not success:
            self.get_logger().error("Pre-grasp IK failed")
            return False

        # ── 3. Descend to grasp pose ─────────────────────────────────
        grasp_pose = Pose()
        grasp_pose.position.CopyFrom(object_pose.pose.position)
        grasp_pose.orientation.CopyFrom(object_pose.pose.orientation)

        # Linear interpolation descent (10 steps)
        for t in np.linspace(0, 1, 10):
            intermediate = Pose()
            intermediate.position.z = grasp_pose.position.z + (1 - t) * approach_height
            intermediate.position.x = grasp_pose.position.x
            intermediate.position.y = grasp_pose.position.y
            intermediate.orientation.CopyFrom(grasp_pose.orientation)

            self.arm_group.setPoseTarget(intermediate)
            self.arm_group.go(wait=True)
            self.arm_group.stop()
            self.arm_group.clearPoseTargets()

        # ── 4. Close gripper ─────────────────────────────────────────
        self.gripper_group.setNamedTarget("close")
        self.gripper_group.go(wait=True)
        self.gripper_group.stop()

        # ── 5. Lift ──────────────────────────────────────────────────
        lift_pose = grasp_pose
        lift_pose.position.z += 0.2
        self.arm_group.setPoseTarget(lift_pose)
        self.arm_group.go(wait=True)
        self.arm_group.stop()
        self.arm_group.clearPoseTargets()

        self.get_logger().info("Pick completed!")
        return True

    def place_object(self, place_pose: PoseStamped):
        """
        Place the object at the desired location.
        """
        # Descend to place
        for t in np.linspace(0, 1, 10):
            intermediate = Pose()
            intermediate.position.z = place_pose.pose.position.z + (1 - t) * 0.2
            intermediate.position.x = place_pose.pose.position.x
            intermediate.position.y = place_pose.pose.position.y
            intermediate.orientation.CopyFrom(place_pose.pose.orientation)

            self.arm_group.setPoseTarget(intermediate)
            self.arm_group.go(wait=True)
            self.arm_group.stop()
            self.arm_group.clearPoseTargets()

        # Open gripper
        self.gripper_group.setNamedTarget("open")
        self.gripper_group.go(wait=True)
        self.gripper_group.stop()

        # Retract
        retreat_pose = Pose()
        retreat_pose.position.CopyFrom(place_pose.pose.position)
        retreat_pose.position.z += 0.3
        retreat_pose.orientation.CopyFrom(place_pose.pose.orientation)
        self.arm_group.setPoseTarget(retreat_pose)
        self.arm_group.go(wait=True)
        self.arm_group.stop()
        self.arm_group.clearPoseTargets()


def main():
    rclpy.init()
    node = DecoupledMobileManipulator()

    # ── Demo sequence ────────────────────────────────────────────────
    # Step 1: Navigate to a pre-defined approach pose
    success = node.navigate_to(x=2.0, y=1.5, yaw=np.pi / 2)
    if not success:
        node.get_logger().error("Could not reach approach pose!")
        rclpy.shutdown()
        return

    # Step 2: Pick object (object pose from perception system)
    object_pose = PoseStamped()
    object_pose.header.frame_id = "base_link"
    object_pose.pose.position.x = 0.4
    object_pose.pose.position.y = 0.0
    object_pose.pose.position.z = 0.6
    object_pose.pose.orientation.w = 1.0

    node.pick_object(object_pose)

    # Step 3: Navigate to place location
    node.navigate_to(x=3.0, y=1.5, yaw=np.pi / 2)
    node.place_object(object_pose)

    node.get_logger().info("Demo complete!")
    rclpy.shutdown()


if __name__ == "__main__":
    main()

4.4 Limitations of the Decoupled Approach

Issue Impact
Fixed base pose Arm may reach the object but in a singularity or near joint limit
No base pose optimization Suboptimal grasp direction (wrist may be inverted)
No collision between base and environment Base might block arm workspace
Sequential failures If navigation fails, the arm work is wasted

5. Tier 2 — Intermediate: Whole-Body Planning

5.1 Concept

Whole-body planning jointly optimizes the mobile base pose and the arm joint configuration to simultaneously satisfy:

  1. Reachability — the end-effector must reach the object
  2. Manipulability — the arm should be in a dexterous configuration (not near singularities)
  3. Collision-free — the entire robot body avoids obstacles
  4. Smooth motion — minimize joint effort / travel distance
  Decoupled:                        Whole-Body:
  ┌──────────────┐                  ┌──────────────────────────┐
  │ Base → pose  │                  │ Optimize base + arm      │
  │ Arm → IK     │                  │ jointly:                 │
  │ (two steps)  │                  │                          │
  └──────────────┘                  │ min  ||q - q_init||²      │
                                    │ s.t. ||x_ee - x_obj|| < ε │
                                    │      J(q) q̇ = ẋ          │
                                    └──────────────────────────┘

5.2 Whole-Body Kinematics

The combined robot state is:

\[ \mathbf{q} = \begin{bmatrix} \mathbf{q}_{base} \\ \mathbf{q}_{arm} \end{bmatrix} \in \mathbb{R}^{(3+n)} \]

where \(\mathbf{q}_{base} = [x, y, \theta]\) for a differential-drive base and \(\mathbf{q}_{arm} \in \mathbb{R}^n\) for the arm joints.

The whole-body Jacobian maps joint velocities to end-effector velocity:

\[ \dot{\mathbf{x}}_{ee} = J_{wb}(\mathbf{q}) \, \dot{\mathbf{q}} \]

where \(J_{wb} \in \mathbb{R}^{6 \times (3+n)}\) is the whole-body Jacobian:

\[ J_{wb} = \begin{bmatrix} J_{base} \\ J_{arm} \end{bmatrix} \]

The manipulability measure (Yoshikawa) for the arm:

\[ w = \sqrt{\det(J_{arm} J_{arm}^\top)} \]

Maximizing \(w\) keeps the arm away from singularities.

5.3 MoveIt 2 Whole-Body Configuration

<!-- launch/whole_body_planning.launch -->
<launch>
  <!-- Load robot description -->
  <node pkg="robot_state_publisher" type="robot_state_publisher"
        name="robot_state_publisher" output="screen">
    <param name="robot_description" textfile="$(find my_mobile_manipulator)/urdf/robot.urdf"/>
  </node>

  <!-- MoveIt configuration -->
  <node pkg="moveit_ros_move_group" type="move_group"
        name="move_group" output="screen">
    <!-- Load SRDF with kinematic chains -->
    <param name="robot_description" textfile="$(find my_mobile_manipulator)/urdf/robot.urdf"/>
    <rosparam file="$(find my_mobile_manipulator)/config/ompl_planning.yaml"/>
    <rosparam file="$(find my_mobile_manipulator)/config/kinematics.yaml"/>
    <param name="robot_description_semantic"
           textfile="$(find my_mobile_manipulator)/config/robot.srdf"/>
  </node>
</launch>

SRDF configuration — define the whole-body planning group:

<!-- config/robot.srdf -->
<robot name="mobile_manipulator">
  <!-- Whole-body group: base + arm -->
  <group name="whole_body">
    <chain base_link="base_footprint" tip_link="gripper_link"/>
  </group>

  <!-- Sub-groups for specialized control -->
  <group name="arm_group">
    <link name="shoulder_link"/>
    <link name="gripper_link"/>
  </group>

  <group name="base_group">
    <joint name="x_joint"/>
    <joint name="y_joint"/>
    <joint name="theta_joint"/>
  </group>

  <!-- Named home configuration -->
  <group_state name="home" group="whole_body">
    <joint name="x_joint" value="0"/>
    <joint name="y_joint" value="0"/>
    <joint name="theta_joint" value="0"/>
    <joint name="shoulder_pan_joint"  value="0"/>
    <joint name="shoulder_lift_joint" value="-1.57"/>
    <joint name="elbow_joint"         value="1.57"/>
    <joint name="wrist_1_joint"       value="0"/>
    <joint name="wrist_2_joint"       value="-1.57"/>
    <joint name="wrist_3_joint"       value="0"/>
  </group_state>
</robot>

5.4 Whole-Body Python Planning

"""
Tier 2: Whole-Body Mobile Manipulation with MoveIt 2
======================================================
Jointly optimize base pose + arm configuration using
the MoveGroupInterface with a whole-body group.
"""

import numpy as np
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import Pose, PoseStamped
from moveitInterfaces import MoveGroupInterface, PlanningSceneInterface
from moveit_msgs.msg import Constraints, JointConstraint, PositionConstraint


class WholeBodyMobileManipulator(Node):
    """
    Whole-body manipulation using MoveIt 2.

    The key advantage over Tier 1: the planner considers both the
    base pose and the arm joints together, optimizing manipulability
    and avoiding collisions across the entire robot body.
    """

    def __init__(self):
        super().__init__('whole_body_mobile_manipulator')

        # ── Whole-body planning group ─────────────────────────────────
        # Includes: base (x, y, theta) + arm joints
        self.whole_body_group = MoveGroupInterface(
            "whole_body", "robot_description"
        )
        self.arm_group = MoveGroupInterface("arm_group", "robot_description")
        self.scene = PlanningSceneInterface()

        # ── Configure planner ─────────────────────────────────────────
        self.whole_body_group.setNumPlanningAttempts(10)
        self.whole_body_group.setPlanningTime(5.0)
        self.whole_body_group.setMaxVelocityScalingFactor(0.6)
        self.whole_body_group.setMaxAccelerationScalingFactor(0.6)
        self.whole_body_group.setPlannerId("RRTstar")

        # ── Add table as collision object ────────────────────────────
        self._add_collision_objects()

        self.get_logger().info("Whole-body Mobile Manipulator ready")

    def _add_collision_objects(self):
        """Add environment obstacles to the planning scene."""
        from shape_msgs.msg import SolidPrimitive

        # Table
        table_pose = PoseStamped()
        table_pose.header.frame_id = "base_link"
        table_pose.pose.position.x = 0.8
        table_pose.pose.position.y = 0.0
        table_pose.pose.position.z = -0.3
        self.scene.addBox("table", 0.6, 0.6, 0.74, table_pose)

        # Shelf
        shelf_pose = PoseStamped()
        shelf_pose.header.frame_id = "base_link"
        shelf_pose.pose.position.x = 0.8
        shelf_pose.pose.position.y = 0.3
        shelf_pose.pose.position.z = 0.3
        self.scene.addBox("shelf", 0.4, 0.3, 0.05, shelf_pose)

        self.get_logger().info("Collision objects added")

    def whole_body_pick(self, object_position: np.ndarray,
                        approach_direction: np.ndarray = np.array([0, 0, -1]),
                        approach_height: float = 0.15):
        """
        Pick an object using whole-body planning.

        Unlike Tier 1, this plans the base pose AND arm joints together,
        considering manipulability and avoiding arm/base self-collisions.

        Parameters
        ----------
        object_position : ndarray [3]
            Object position in world frame (x, y, z)
        approach_direction : ndarray [3]
            Unit vector for approach direction (default: top-down)
        approach_height : float
            Approach distance above the object
        """
        # ── 1. Compute approach pose ──────────────────────────────────
        approach_pose = PoseStamped()
        approach_pose.header.frame_id = "base_link"
        approach_pose.header.stamp = self.get_clock().now().to_msg()

        # Start above the object
        approach_pose.pose.position.x = object_position[0]
        approach_pose.pose.position.y = object_position[1]
        approach_pose.pose.position.z = object_position[2] + approach_height

        # Approach direction determines gripper orientation
        from tf_transformations import quaternion_from_matrix

        # Build rotation matrix aligned with approach direction
        # (gripper Z-axis aligned with approach direction)
        z = approach_direction / np.linalg.norm(approach_direction)
        x_temp = np.array([1, 0, 0]) if abs(z[2]) < 0.99 else np.array([0, 1, 0])
        x = x_temp - np.dot(x_temp, z) * z
        x = x / np.linalg.norm(x)
        y = np.cross(z, x)

        R = np.column_stack([x, y, z])
        q = quaternion_from_matrix(np.vstack([np.hstack([R, [0, 0, 0]]), [0, 0, 0, 1]]))

        approach_pose.pose.orientation.x = q[0]
        approach_pose.pose.orientation.y = q[1]
        approach_pose.pose.orientation.z = q[2]
        approach_pose.pose.orientation.w = q[3]

        # ── 2. Set whole-body pose target ─────────────────────────────
        # This jointly optimizes base (x, y, theta) and arm joints
        self.whole_body_group.setPoseTarget(approach_pose)

        # ── 3. Plan with constraints (collision + manipulability) ───────
        constraints = Constraints()
        # Add joint limits constraints to keep arm away from limits
        constraints.joint_constraints = self._make_joint_constraints()

        plan = self.whole_body_group.plan()
        if not plan[0]:
            self.get_logger().warn("Whole-body plan to approach pose failed")
            return False

        # Execute
        self.whole_body_group.execute(plan[1], wait=True)
        self.whole_body_group.stop()
        self.whole_body_group.clearPoseTargets()

        # ── 4. Descend linearly to grasp pose ──────────────────────────
        grasp_pose = PoseStamped()
        grasp_pose.header.frame_id = "base_link"
        grasp_pose.pose.position.x = object_position[0]
        grasp_pose.pose.position.y = object_position[1]
        grasp_pose.pose.position.z = object_position[2]
        grasp_pose.pose.orientation.CopyFrom(approach_pose.pose.orientation)

        # Use Cartesian path for linear descent
        waypoints = [approach_pose.pose, grasp_pose.pose]
        (plan, fraction) = self.arm_group.computeCartesianPath(
            waypoints,
            eef_step=0.01,      # 1 cm steps
            jump_threshold=1.0,  # disable jump checking
        )
        if fraction < 0.95:
            self.get_logger().warn(f"Cartesian path only {fraction*100:.1f}% complete")
        self.arm_group.execute(plan, wait=True)

        # ── 5. Close gripper ────────────────────────────────────────────
        self.arm_group.stop()
        self.arm_group.setNamedTarget("gripper_close")
        self.arm_group.go(wait=True)
        self.arm_group.stop()

        # ── 6. Lift ────────────────────────────────────────────────────
        lift_pose = grasp_pose
        lift_pose.pose.position.z += 0.2
        self.whole_body_group.setPoseTarget(lift_pose)
        plan = self.whole_body_group.plan()
        if plan[0]:
            self.whole_body_group.execute(plan[1], wait=True)
        self.whole_body_group.stop()
        self.whole_body_group.clearPoseTargets()

        self.get_logger().info("Whole-body pick completed!")
        return True

    def _make_joint_constraints(self) -> list:
        """Create soft joint constraints to keep arm away from limits."""
        constraints = []

        # Avoid shoulder joint limits by adding a 0.1 rad margin
        margin = 0.1
        for joint_name, lower, upper in [
            ("shoulder_pan_joint",  -2.5, 2.5),
            ("shoulder_lift_joint", -2.0, 2.0),
            ("elbow_joint",         -2.8, 2.8),
            ("wrist_1_joint",       -2.0, 2.0),
            ("wrist_2_joint",       -2.0, 2.0),
            ("wrist_3_joint",       -2.0, 2.0),
        ]:
            jc = JointConstraint()
            jc.joint_name = joint_name
            jc.position = 0.0
            jc.tolerance_above = upper - margin
            jc.tolerance_below = lower + margin
            jc.weight = 0.5
            constraints.append(jc)

        return constraints

    def compute_manipulability(self, q: np.ndarray) -> float:
        """
        Compute the manipulability measure for the current arm configuration.

        Uses the Yoshikawa formula: w = sqrt(det(J * J^T))

        Returns the manipulability value (0 at singularity, larger is better).
        """
        # This would be computed from the Jacobian of the arm
        # J_2n6 Jacobian from FK solver
        # For simplicity, return a placeholder
        # In practice, use: w = np.sqrt(np.linalg.det(J @ J.T))
        return 1.0  # Placeholder


def main():
    rclpy.init()
    node = WholeBodyMobileManipulator()

    # ── Demo: pick object at known position ───────────────────────────
    # Object at (0.65, 0.1, 0.75) in base frame (on a table)
    obj_pos = np.array([0.65, 0.1, 0.75])
    node.whole_body_pick(obj_pos)

    node.get_logger().info("Demo complete!")
    rclpy.shutdown()


if __name__ == "__main__":
    main()

5.5 Manipulability Optimization

The arm's manipulability ellipsoid reveals how well the arm can move in different directions. At a singularity, the ellipsoid collapses — the arm loses a degree of freedom.

def compute_manipulability_optimized_base_pose(
    object_pos: np.ndarray,
    robot_model,
    n_candidates: int = 100,
) -> tuple:
    """
    Find the best base pose that maximizes arm manipulability
    for reaching the object.

    This searches over a grid of base poses and selects the one
    with the highest manipulability at the IK solution.
    """
    import random

    best_pose = None
    best_manipulability = -1.0

    # Search over candidate base poses
    for _ in range(n_candidates):
        # Random base pose in reachable area (x: 0.3-1.0m, y: -0.3-0.3m, theta: -pi to pi)
        q_base = np.array([
            random.uniform(0.3, 1.0),
            random.uniform(-0.3, 0.3),
            random.uniform(-np.pi, np.pi),
        ])

        # Compute IK for the arm to reach object_pos from this base
        q_arm = compute_ik_for_base(object_pos, q_base, robot_model)
        if q_arm is None:
            continue  # IK failed for this base pose

        # Compute manipulability
        q_full = np.concatenate([q_base, q_arm])
        w = compute_manipulability_from_jacobian(q_full, robot_model)

        if w > best_manipulability:
            best_manipulability = w
            best_pose = q_base

    return best_pose, best_manipulability

6. Tier 3 — Modern: Foundation Model-Based Manipulation

6.1 Concept

Foundation models (large neural networks pre-trained on massive data) have revolutionized manipulation. Vision-Language-Action (VLA) models like RT-2, Octo, and π0 learn a single policy that:

  1. Takes RGB images (and optionally language instructions) as input
  2. Outputs continuous robot actions (end-effector delta pose + gripper command)

This is fundamentally different from Tier 1–2 because the same model can handle novel objects, scenes, and tasks without explicit geometry or IK.

  ┌──────────────────────────────────────────────────────────────────┐
  │              Vision-Language-Action (VLA) Model                   │
  │                                                                  │
  │   [Image] ──┐                                                    │
  │             │                                                    │
  │   [Lang]  ──┼──► ┌──────────────┐ ──► Action Token Stream        │
  │             │    │  Vision      │       [Δx, Δy, Δz, Δroll,      │
  │             │    │  Encoder     │        Δpitch, Δyaw, grip]     │
  │             │    │  (ViT/Sam)   │                                 │
  │             │    └──────────────┘                                 │
  │             │           │                                         │
  │             │           ▼                                          │
  │             │    ┌──────────────┐                                 │
  │             └───►│  LLM Core    │◄── (frozen or fine-tuned)      │
  │                  │  (PaLM-E /   │                                 │
  │                  │   LLaMA /    │                                 │
  │                  │   RT-2 Core)  │                                 │
  │                  └──────────────┘                                 │
  │                           │                                       │
  │                           ▼                                       │
  │                  ┌──────────────┐                                 │
  │                  │  Action Head │                                 │
  │                  │  (token →    │                                 │
  │                  │   continuous)│                                 │
  │                  └──────────────┘                                 │
  └──────────────────────────────────────────────────────────────────┘

6.2 Action Tokenization

RT-2-style models discretize continuous actions into tokens (inspired by language tokens):

  1. Discretize each action dimension into \(B\) bins (e.g., \(B=256\)): $\(\text{action\_id} = \text{Clip}\left(\left\lfloor \frac{a - a_{\min}}{a_{\max} - a_{\min}} \cdot B \right\rfloor, 0, B-1\right)\)$

  2. Treat each action_id as a vocabulary token, concatenated with language tokens

  3. Train with standard language modeling loss (next-token prediction)
  4. At inference, sample action tokens and map back to continuous values

6.3 ROS 2 VLA Integration

"""
Tier 3: Foundation Model Mobile Manipulation
===============================================
Integrate a VLA model (e.g., Octo) with ROS 2 for
generalized mobile manipulation.
"""

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, JointState
from geometry_msgs.msg import Twist
from cv_bridge import CvBridge
import numpy as np
import torch

try:
    import octo
    from octo.model import OctoModel
except ImportError:
    print("[WARN] Octo not installed. Install: pip install octo-model")
    OctoModel = None


class VLAMobileManipulator(Node):
    """
    VLA-based mobile manipulation.

    The VLA model takes a camera image and a language instruction,
    then outputs continuous actions that control both the mobile base
    and the arm. This is a single end-to-end policy — no separate
    navigation or IK module needed.
    """

    def __init__(self):
        super().__init__('vla_mobile_manipulator')
        self.bridge = CvBridge()

        # ── Load VLA model ────────────────────────────────────────────
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.get_logger().info(f"Using device: {self.device}")

        if OctoModel is not None:
            # Load Octo-small (pre-trained on OXE dataset)
            self.model = OctoModel.load_pretrained("octo-small-1")
            self.model.to(self.device)
            self.get_logger().info("Octo model loaded")
        else:
            self.model = None
            self.get_logger().warn("VLA model not available — running in simulation mode")

        # ── ROS interfaces ────────────────────────────────────────────
        self.image_sub = self.create_subscription(
            Image,
            "/camera/color/image_raw",
            self.image_callback,
            10
        )
        self.joint_sub = self.create_subscription(
            JointState,
            "/joint_states",
            self.joint_callback,
            10
        )

        # Base + arm velocity command publisher
        self.base_cmd_pub = self.create_publisher(Twist, "/cmd_vel", 10)
        self.arm_cmd_pub = self.create_publisher(
            JointState, "/arm_controller/joint_commands", 10
        )

        self.current_joints = np.zeros(7)  # arm joint states
        self.current_image = None
        self.get_logger().info("VLA Mobile Manipulator initialized")

    def image_callback(self, msg: Image):
        """Receive camera image."""
        try:
            self.current_image = self.bridge.imgmsg_to_cv2(msg, desired_encoding="rgb8")
        except Exception as e:
            self.get_logger().error(f"Image conversion failed: {e}")

    def joint_callback(self, msg: JointState):
        """Receive current joint states."""
        self.current_joints = np.array(msg.position[:7])

    def compute_vla_action(self, instruction: str,
                             image: np.ndarray,
                             arm_joints: np.ndarray) -> np.ndarray:
        """
        Query the VLA model for an action given the current observation.

        Parameters
        ----------
        instruction : str
            Language instruction, e.g., "pick up the red cup"
        image : ndarray
            Current RGB camera image (H, W, 3)
        arm_joints : ndarray
            Current arm joint positions

        Returns
        -------
        action : ndarray
            Predicted action: [base_vx, base_vy, base_omega,
                              arm_delta_q1, ..., arm_delta_qn, gripper]
        """
        if self.model is None:
            # Simulation mode: return zero action
            return np.zeros(10)

        # Preprocess image
        image_tensor = torch.from_numpy(image).permute(2, 0, 1)  # CHW
        image_tensor = torch.nn.functional.interpolate(
            image_tensor.unsqueeze(0).float(), size=(256, 256)
        ).squeeze(0)

        # Tokenize instruction
        instruction_tokens = self.model.encode_text(instruction)

        # Model inference
        with torch.no_grad():
            # Octo outputs actions in gripper frame
            # Shape: (batch=1, pred_horizon=1, action_dim)
            action = self.model.predict_action(
                image=image_tensor.to(self.device).unsqueeze(0),
                text=[instruction],
            )

        return action[0, 0].cpu().numpy()

    def execute_action(self, action: np.ndarray):
        """
        Send action to robot hardware.

        Action format: [base_vx, base_vy, base_omega,
                        delta_q1, ..., delta_q6, gripper]
        """
        # ── Mobile base velocity ───────────────────────────────────────
        base_cmd = Twist()
        base_cmd.linear.x = float(action[0])
        base_cmd.linear.y = float(action[1])
        base_cmd.angular.z = float(action[2])
        self.base_cmd_pub.publish(base_cmd)

        # ── Arm joint delta commands ──────────────────────────────────
        n_arm_joints = len(self.current_joints)
        arm_delta = action[3:3 + n_arm_joints]
        new_joints = self.current_joints + arm_delta

        arm_cmd = JointState()
        arm_cmd.header.stamp = self.get_clock().now().to_msg()
        arm_cmd.name = [
            "shoulder_pan_joint", "shoulder_lift_joint", "elbow_joint",
            "wrist_1_joint", "wrist_2_joint", "wrist_3_joint"
        ]
        arm_cmd.position = new_joints.tolist()
        self.arm_cmd_pub.publish(arm_cmd)

        # ── Gripper ────────────────────────────────────────────────────
        gripper_cmd = JointState()
        gripper_cmd.name = ["gripper_joint"]
        gripper_cmd.position = [float(action[-1])]
        # Publish to gripper controller...

    def run_policy(self, instruction: str, max_steps: int = 200):
        """
        Run the VLA policy for a given instruction.

        Parameters
        ----------
        instruction : str
            Language command to execute
        max_steps : int
            Maximum number of control steps
        """
        self.get_logger().info(f"Executing: '{instruction}'")

        rate = self.create_rate(10)  # 10 Hz control loop
        for step in range(max_steps):
            if self.current_image is None:
                rate.sleep()
                continue

            # Query model
            action = self.compute_vla_action(
                instruction, self.current_image, self.current_joints
            )

            # Execute on robot
            self.execute_action(action)

            # Check for termination (if model outputs a "done" signal)
            if np.abs(action).max() < 1e-3:
                self.get_logger().info("Action converged — stopping")
                break

            rate.sleep()

        # Stop robot
        self.execute_action(np.zeros_like(action))
        self.get_logger().info("Policy execution complete")


def main():
    rclpy.init()
    node = VLAMobileManipulator()

    # ── Example instructions ──────────────────────────────────────────
    instructions = [
        "pick up the cup on the table",
        "place the cup on the shelf",
        "push the button on the wall",
        "open the drawer",
    ]

    for instruction in instructions:
        node.run_policy(instruction, max_steps=200)

    rclpy.shutdown()


if __name__ == "__main__":
    main()

6.4 Sim-to-Real Transfer

Training VLA models involves significant domain gap between simulation and reality:

Challenge Description Mitigation
Visual domain gap Different textures, lighting, camera noise Domain randomization (DR), neural rendering (NeRF, 3DGS)
Physics gap Contact dynamics, friction, compliance sim-to-real transfer via RL fine-tuning, system identification
Action space mismatch Different robot models Action normalization, cross-embodiment training
Data efficiency VLAs need millions of demos Pre-training on large datasets (OXE, BridgeData)
Latency Model inference is slow ONNX optimization, GPU batching, action chunking

7. Comparison of Approaches

Criteria Tier 1: Decoupled Tier 2: Whole-Body Tier 3: VLA Foundation
Base + Arm coordination ❌ Independent ✅ Joint ✅ Implicit (learned)
Novel object handling ❌ Needs model ❌ Needs model ✅ Zero-shot
Generalization Task-specific Task-specific Task-general
Real-time performance ✅ Fast ✅ ~1-5 Hz ⚠️ Depends on model size
Hardware requirements Low Medium High (GPU required)
Setup complexity Low Medium High
Collision awareness ❌ Arm only ✅ Full body ⚠️ Learned
Singularity handling ❌ Manual ✅ Jacobian-based ⚠️ Learned
Best for Structured environments Semi-structured Open-world manipulation
Code complexity ⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐⭐
Reliability High (predictable) High (verifiable) Medium (stochastic)

8. Step-by-Step Implementation Guide

Phase 1 — Robot Setup (All Tiers)

  1. URDF / SDF: Define the complete robot — base + arm + sensors — in URDF
  2. Kinematics: Load the URDF into MoveIt and configure kinematic chains
  3. Calibration: Calibrate base-to-camera transform (tf tree)
  4. Gazebo simulation: Verify the robot loads and moves in simulation

Phase 2 — Tier 1: Decoupled (Week 1)

  1. Launch the Navigation 2 stack with SLAM (see SLAM project)
  2. Define a set of approach poses for known pick locations
  3. Implement compute_ik() using MoveIt's IK service
  4. Run the pick-and-place demo with known object poses
  5. Test: navigate to pose → IK → grasp → lift

Phase 3 — Tier 2: Whole-Body (Week 2)

  1. Define the whole_body planning group in SRDF (base + arm chain)
  2. Configure OMPL with RRTstar or PRMstar for whole-body planning
  3. Add collision objects (tables, shelves) to the planning scene
  4. Replace the Tier 1 decoupled navigate_to() + compute_ik() with whole_body_pick()
  5. Implement manipulability optimization for base pose selection
  6. Test: single whole-body plan for pick, verify collision avoidance

Phase 4 — Tier 3: VLA Integration (Week 3)

  1. Install Octo or RT-2 (requires Docker + GPU environment)
  2. Run inference on a pre-trained model with the real camera feed
  3. Add action normalization (map model output to your robot's action limits)
  4. Implement a safety layer: clamp velocities, check joint limits before sending commands
  5. Fine-tune on domain-specific data if performance is insufficient
  6. Test: give language commands like "pick up the blue box"

9. Extensions and Variations

9.1 Mobile Base Navigation with Arm Constraints

Use Task Space Region (TSR) constraints in MoveIt to ensure the arm maintains a specific orientation or stays within a workspace while the base navigates:

# Keep arm pointing downward while base navigates
from moveit_msgs.msg import Constraints, OrientationConstraint

orientation_constraint = OrientationConstraint()
orientation_constraint.header.frame_id = "base_link"
orientation_constraint.link_name = "gripper_link"
orientation_constraint.orientation.w = 1.0
orientation_constraint.absolute_x_axis_tolerance = 0.1
orientation_constraint.absolute_y_axis_tolerance = 0.1
orientation_constraint.absolute_z_axis_tolerance = 3.14  # free rotation around Z
orientation_constraint.weight = 1.0

constraints = Constraints()
constraints.orientation_constraints.append(orientation_constraint)
move_group.setPathConstraints(constraints)

9.2 Grasp Pose Detection

Replace hard-coded grasp poses with learning-based grasp detection:

Method Description ROS Package
GraspNet 6-DOF grasp pose prediction from point clouds graspnet_ros
Contact-GraspNet Contact-level grasp synthesis from partial views contact_graspnet
GPD Geometric approach to grasp detection gpd
6-DOF GraspNet (Isaac) NVIDIA's optimized grasp detection Isaac SDK

9.3 Dual-Arm Mobile Manipulation

Extend to two arms for bimanual tasks (e.g., holding an object with one arm while moving it with the base, or cooperative carrying):

# Dual-arm whole-body group
<group name="dual_arm_whole_body">
  <chain base_link="base_footprint" tip_link="left_gripper_link"/>
  <chain base_link="base_footprint" tip_link="right_gripper_link"/>
</group>

9.4 Hierarchical Planning

Combine Tier 2 and Tier 3 in a hierarchical architecture:

  Task Planner (LLM/VLA) ──► High-level goal: "place cup on shelf"
  Motion Planner (Tier 2) ──► Whole-body trajectory for base + arm
  Controller (PID/阻抗) ──► Execute trajectory on hardware

10. References