Mobile Manipulation¶

Project Type: Manipulation + Navigation | Difficulty: ★★★☆☆ to ★★★★★ (approach-dependent) | Estimated Time: 2–4 weekends

1. Project Overview¶

Mobile manipulation combines mobility (navigating a mobile base) with manipulation (controlling a robotic arm) to enable a robot to approach a scene, locate an object, and act on it — picking it up, placing it, opening a drawer, or pressing a button. This is one of the canonical problems in service robotics.

  ┌──────────────────────────────────────────────────────────────────┐
  │              Mobile Manipulation Pipeline (Top View)              │
  │                                                                   │
  │    ┌──────────┐                                                  │
  │    │ Scene /  │                                                  │
  │    │ Objects  │                                                  │
  │    └────┬─────┘                                                  │
  │         │                                                         │
  │         ▼ (perceive)                                              │
  │    ┌──────────────┐   ┌─────────────────┐                       │
  │    │ Grounding    │──►│ Reachability     │                       │
  │    │ (detect /    │   │ Analysis         │                       │
  │    │  localize)   │   │ (which approach   │                       │
  │    └──────────────┘   │  pose is valid?) │                       │
  │                       └────────┬─────────┘                       │
  │                                │                                   │
  │                                ▼ (plan)                            │
  │         ┌───────────────────────────────────┐                    │
  │         │      Whole-Body Planner            │                    │
  │         │  (base pose + arm joint targets)   │                    │
  │         └──────────────┬────────────────────┘                    │
  │                        │                                          │
  │         ┌──────────────┴────────────────────┐                    │
  │         │                                     │                   │
  │         ▼ (execute)                           ▼ (execute)         │
  │  ┌─────────────┐                      ┌─────────────┐           │
  │  │ Mobile Base │                      │ Arm + Gripper│           │
  │  │ Controller  │                      │ Controller   │           │
  │  └──────┬──────┘                      └──────┬──────┘           │
  │         │                                     │                   │
  │         ▼                                     ▼                   │
  │   Move to pose                           Grasp / Manipulate      │
  │                                                                   │
  │         Robot = [Mobile Base] + [Arm] + [End-Effector / Gripper] │
  └──────────────────────────────────────────────────────────────────┘

In this project you will explore three progressively more sophisticated approaches:

Tier	Approach	Key Technique	ROS Package
Tier 1	Decoupled	Navigate to approach pose, then IK arm	`move_base` + `moveit_ros`
Tier 2	Whole-Body	Joint optimization of base + arm	`moveit_ros` (whole-body) + OMPL
Tier 3	Foundation Models	VLA policy (RT-2 / Octo / π0)	`ros2_manipulation` + inference server

2. Learning Objectives¶

Understand the mobile manipulator kinematics model (base + arm)
Implement decoupled navigation + inverse kinematics (IK) control
Perform whole-body manipulability analysis and joint optimization
Configure MoveIt 2 for a mobile manipulator in ROS 2
Integrate object detection with grasp planning
Understand Vision-Language-Action (VLA) foundation models for manipulation

3. Hardware & Software Requirements¶

Hardware¶

Component	Specification	Notes
Mobile manipulator	Stretch RE-1, Fetch, or custom (e.g., LoCoArm + TurtleBot)	Any mobile base + arm combination
Depth camera	Intel RealSense D435i / Azure Kinect	For object detection & point clouds
LiDAR (optional)	2D LiDAR for navigation (e.g., RPLIDAR A1)	Only needed for Tier 1–2
Onboard computer	NVIDIA Jetson Orin / x86 with ROS 2	GPU recommended for Tier 3
Gripper	Parallel-jaw (Robotiq 2F-85) or custom	Actuated gripper
Battery	Sufficient for 2+ hours of operation

Software¶

Package	Version	Purpose
ROS 2	Humble / Iron	Core framework
MoveIt 2	Humble	Whole-body motion planning
Navigation 2	Humble	Mobile base navigation
OpenCV	≥ 4.5	Image processing
PyTorch	≥ 1.13	Neural network inference (Tier 3)
Grasp Pose Detection	GraspNet / Contact-GraspNet	Grasp planning
OMPL	(via MoveIt)	Sampling-based motion planning
Isaac Sim / Gazebo	Latest	Simulation

# Install core dependencies (ROS 2 Humble)
sudo apt install ros-humble-moveit ros-humble-navigation2 ros-humble-ros2-control
pip install opencv-python numpy torch torchvision open3d

4.1 Concept¶

The simplest approach is to decouple the mobile base and the arm:

Navigate the mobile base to an approach pose near the target object
Lock the base, then solve inverse kinematics (IK) for the arm to reach the object
Execute the grasp

This approach is easy to implement and debug, but it cannot adapt the base pose to improve arm reachability — the base might arrive at a location where the arm cannot comfortably reach the target.

  Step 1: Navigate                    Step 2: Lock base, solve IK
  ┌──────────────────┐              ┌────────────────────────┐
  │                  │              │                         │
  │     [BASE]──►    │              │   [BASE]  [ARM reaches]  │
  │        ↓         │              │     ↖     ───► [OBJ]    │
  │   approach pose  │              │   IK solution           │
  └──────────────────┘              └────────────────────────┘

4.2 Kinematics Model¶

For a decoupled system, the forward kinematics of the arm is:

\[ \mathbf{x}_{ee} = f(\mathbf{q}_{arm}) \in \mathbb{R}^6 \]

where $\mathbf{q}_{arm} \in \mathbb{R}^n$ are the arm joint positions and $\mathbf{x}_{ee} = [p_x, p_y, p_z, \phi, \theta, \psi]$ is the end-effector pose (position + orientation).

The inverse kinematics solves:

\[ \mathbf{q}_{arm} = f^{-1}(\mathbf{x}_{des}) \]

For a 6-DOF arm, we typically use a numerical IK solver (e.g., TRAC-IK or KDL):

def compute_ik(robot_model, q_init, target_pose, timeout=1.0):
    """
    Solve IK for the arm to reach a desired end-effector pose.

    Parameters
    ----------
    robot_model : robot_model_loader.RobotModelLoader
        Loaded robot model (URDF + SRDF)
    q_init : list[float]
        Initial joint configuration guess
    target_pose : geometry_msgs/Pose
        Desired end-effector pose in base frame
    timeout : float
        Solver timeout in seconds

    Returns
    -------
    list[float] or None
        Solved joint configuration, or None if no solution found
    """
    from moveit_msgs.srv import GetPositionIK

    ik_service = '/compute_ik'
    # Use MoveIt Python API instead (simpler):
    from moveit.core import RobotModel, KinematicsQueryReturnVal

    # TRAC-IK or KDL solver
    group_name = "arm_group"
    scene = PlanningSceneInterface()
    move_group = MoveGroupInterface(group_name, "robot_description")

    move_group.setPoseTarget(target_pose)
    move_group.setMaxVelocityScalingFactor(0.5)
    move_group.setMaxAccelerationScalingFactor(0.5)

    plan = move_group.plan()
    if plan[0]:  # success
        return plan[1].joint_trajectory.points[-1].positions
    return None

4.3 Approach Pose Selection¶

Before navigating, we must compute a valid approach pose for the base:

"""
Tier 1: Decoupled mobile manipulation
======================================
Navigate base to approach pose, then use IK to reach the object.
"""

import numpy as np
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import PoseStamped, Pose
from nav2_msgs.action import NavigateToPose
from moveitInterfaces import MoveGroupInterface, PlanningSceneInterface
from rclpy.action import ActionClient


class DecoupledMobileManipulator(Node):
    """
    Decoupled approach: navigate base to approach pose,
    then use IK to control the arm.
    """

    def __init__(self):
        super().__init__('decoupled_mobile_manipulator')

        # ── MoveIt interfaces ──────────────────────────────────────
        self.arm_group = MoveGroupInterface("arm_group", "robot_description")
        self.gripper_group = MoveGroupInterface("gripper_group", "robot_description")
        self.scene = PlanningSceneInterface()

        # ── Navigation action client ────────────────────────────────
        self.nav_client = ActionClient(self, NavigateToPose, 'navigate_to_pose')

        self.get_logger().info("Decoupled Mobile Manipulator ready")

    def navigate_to(self, x: float, y: float, yaw: float, timeout: float = 30.0):
        """
        Send the mobile base to (x, y, yaw) using Navigation 2.

        Parameters
        ----------
        x, y : float      — Goal position in map frame
        yaw  : float      — Goal orientation in radians
        timeout : float   — Maximum wait time in seconds
        """
        from tf_transformations import quaternion_from_euler

        goal = NavigateToPose.Goal()
        goal.pose.header.frame_id = "map"
        goal.pose.header.stamp = self.get_clock().now().to_msg()
        goal.pose.pose.position.x = x
        goal.pose.pose.position.y = y
        goal.pose.pose.position.z = 0.0

        q = quaternion_from_euler(0, 0, yaw)
        goal.pose.pose.orientation.x = q[0]
        goal.pose.pose.orientation.y = q[1]
        goal.pose.pose.orientation.z = q[2]
        goal.pose.pose.orientation.w = q[3]

        self.get_logger().info(f"Navigating to ({x:.2f}, {y:.2f})...")
        self.nav_client.wait_for_server()
        send_goal = self.nav_client.send_goal_async(goal)
        rclpy.spin_until_future_complete(self, send_goal)

        result = send_goal.result()
        if result:
            self.get_logger().info("Navigation succeeded!")
            return True
        else:
            self.get_logger().warn("Navigation failed or timed out")
            return False

    def pick_object(self, object_pose: PoseStamped, approach_height: float = 0.15):
        """
        Pick an object using IK-based arm control.

        The sequence: pre-grasp → grasp → lift.

        Parameters
        ----------
        object_pose : PoseStamped
            Position of the object in the base frame
        approach_height : float
            How high above the object to approach before descending
        """
        # ── 1. Compute approach pose (above the object) ─────────────
        pregrasp_pose = Pose()
        pregrasp_pose.position.CopyFrom(object_pose.pose.position)
        pregrasp_pose.position.z += approach_height  # go up first
        pregrasp_pose.orientation.CopyFrom(object_pose.pose.orientation)

        # ── 2. Pre-grasp: move arm to approach pose via IK ──────────
        self.arm_group.setPoseTarget(pregrasp_pose)
        success = self.arm_group.go(wait=True)
        self.arm_group.stop()
        self.arm_group.clearPoseTargets()

        if not success:
            self.get_logger().error("Pre-grasp IK failed")
            return False

        # ── 3. Descend to grasp pose ─────────────────────────────────
        grasp_pose = Pose()
        grasp_pose.position.CopyFrom(object_pose.pose.position)
        grasp_pose.orientation.CopyFrom(object_pose.pose.orientation)

        # Linear interpolation descent (10 steps)
        for t in np.linspace(0, 1, 10):
            intermediate = Pose()
            intermediate.position.z = grasp_pose.position.z + (1 - t) * approach_height
            intermediate.position.x = grasp_pose.position.x
            intermediate.position.y = grasp_pose.position.y
            intermediate.orientation.CopyFrom(grasp_pose.orientation)

            self.arm_group.setPoseTarget(intermediate)
            self.arm_group.go(wait=True)
            self.arm_group.stop()
            self.arm_group.clearPoseTargets()

        # ── 4. Close gripper ─────────────────────────────────────────
        self.gripper_group.setNamedTarget("close")
        self.gripper_group.go(wait=True)
        self.gripper_group.stop()

        # ── 5. Lift ──────────────────────────────────────────────────
        lift_pose = grasp_pose
        lift_pose.position.z += 0.2
        self.arm_group.setPoseTarget(lift_pose)
        self.arm_group.go(wait=True)
        self.arm_group.stop()
        self.arm_group.clearPoseTargets()

        self.get_logger().info("Pick completed!")
        return True

    def place_object(self, place_pose: PoseStamped):
        """
        Place the object at the desired location.
        """
        # Descend to place
        for t in np.linspace(0, 1, 10):
            intermediate = Pose()
            intermediate.position.z = place_pose.pose.position.z + (1 - t) * 0.2
            intermediate.position.x = place_pose.pose.position.x
            intermediate.position.y = place_pose.pose.position.y
            intermediate.orientation.CopyFrom(place_pose.pose.orientation)

            self.arm_group.setPoseTarget(intermediate)
            self.arm_group.go(wait=True)
            self.arm_group.stop()
            self.arm_group.clearPoseTargets()

        # Open gripper
        self.gripper_group.setNamedTarget("open")
        self.gripper_group.go(wait=True)
        self.gripper_group.stop()

        # Retract
        retreat_pose = Pose()
        retreat_pose.position.CopyFrom(place_pose.pose.position)
        retreat_pose.position.z += 0.3
        retreat_pose.orientation.CopyFrom(place_pose.pose.orientation)
        self.arm_group.setPoseTarget(retreat_pose)
        self.arm_group.go(wait=True)
        self.arm_group.stop()
        self.arm_group.clearPoseTargets()


def main():
    rclpy.init()
    node = DecoupledMobileManipulator()

    # ── Demo sequence ────────────────────────────────────────────────
    # Step 1: Navigate to a pre-defined approach pose
    success = node.navigate_to(x=2.0, y=1.5, yaw=np.pi / 2)
    if not success:
        node.get_logger().error("Could not reach approach pose!")
        rclpy.shutdown()
        return

    # Step 2: Pick object (object pose from perception system)
    object_pose = PoseStamped()
    object_pose.header.frame_id = "base_link"
    object_pose.pose.position.x = 0.4
    object_pose.pose.position.y = 0.0
    object_pose.pose.position.z = 0.6
    object_pose.pose.orientation.w = 1.0

    node.pick_object(object_pose)

    # Step 3: Navigate to place location
    node.navigate_to(x=3.0, y=1.5, yaw=np.pi / 2)
    node.place_object(object_pose)

    node.get_logger().info("Demo complete!")
    rclpy.shutdown()


if __name__ == "__main__":
    main()

4.4 Limitations of the Decoupled Approach¶

Issue	Impact
Fixed base pose	Arm may reach the object but in a singularity or near joint limit
No base pose optimization	Suboptimal grasp direction (wrist may be inverted)
No collision between base and environment	Base might block arm workspace
Sequential failures	If navigation fails, the arm work is wasted

5. Tier 2 — Intermediate: Whole-Body Planning¶

5.1 Concept¶

Whole-body planning jointly optimizes the mobile base pose and the arm joint configuration to simultaneously satisfy:

Reachability — the end-effector must reach the object
Manipulability — the arm should be in a dexterous configuration (not near singularities)
Collision-free — the entire robot body avoids obstacles
Smooth motion — minimize joint effort / travel distance

  Decoupled:                        Whole-Body:
  ┌──────────────┐                  ┌──────────────────────────┐
  │ Base → pose  │                  │ Optimize base + arm      │
  │ Arm → IK     │                  │ jointly:                 │
  │ (two steps)  │                  │                          │
  └──────────────┘                  │ min  ||q - q_init||²      │
                                    │ s.t. ||x_ee - x_obj|| < ε │
                                    │      J(q) q̇ = ẋ          │
                                    └──────────────────────────┘

5.2 Whole-Body Kinematics¶

The combined robot state is:

\[ \mathbf{q} = \begin{bmatrix} \mathbf{q}_{base} \\ \mathbf{q}_{arm} \end{bmatrix} \in \mathbb{R}^{(3+n)} \]

where $\mathbf{q}_{base} = [x, y, \theta]$ for a differential-drive base and $\mathbf{q}_{arm} \in \mathbb{R}^n$ for the arm joints.

The whole-body Jacobian maps joint velocities to end-effector velocity:

\[ \dot{\mathbf{x}}_{ee} = J_{wb}(\mathbf{q}) \, \dot{\mathbf{q}} \]

where $J_{wb} \in \mathbb{R}^{6 \times (3+n)}$ is the whole-body Jacobian:

\[ J_{wb} = \begin{bmatrix} J_{base} \\ J_{arm} \end{bmatrix} \]

The manipulability measure (Yoshikawa) for the arm:

\[ w = \sqrt{\det(J_{arm} J_{arm}^\top)} \]

Maximizing $w$ keeps the arm away from singularities.

5.3 MoveIt 2 Whole-Body Configuration¶

<!-- launch/whole_body_planning.launch -->
<launch>
  <!-- Load robot description -->
  <node pkg="robot_state_publisher" type="robot_state_publisher"
        name="robot_state_publisher" output="screen">
    <param name="robot_description" textfile="$(find my_mobile_manipulator)/urdf/robot.urdf"/>
  </node>

  <!-- MoveIt configuration -->
  <node pkg="moveit_ros_move_group" type="move_group"
        name="move_group" output="screen">
    <!-- Load SRDF with kinematic chains -->
    <param name="robot_description" textfile="$(find my_mobile_manipulator)/urdf/robot.urdf"/>
    <rosparam file="$(find my_mobile_manipulator)/config/ompl_planning.yaml"/>
    <rosparam file="$(find my_mobile_manipulator)/config/kinematics.yaml"/>
    <param name="robot_description_semantic"
           textfile="$(find my_mobile_manipulator)/config/robot.srdf"/>
  </node>
</launch>

SRDF configuration — define the whole-body planning group:

<!-- config/robot.srdf -->
<robot name="mobile_manipulator">
  <!-- Whole-body group: base + arm -->
  <group name="whole_body">
    <chain base_link="base_footprint" tip_link="gripper_link"/>
  </group>

  <!-- Sub-groups for specialized control -->
  <group name="arm_group">
    <link name="shoulder_link"/>
    <link name="gripper_link"/>
  </group>

  <group name="base_group">
    <joint name="x_joint"/>
    <joint name="y_joint"/>
    <joint name="theta_joint"/>
  </group>

  <!-- Named home configuration -->
  <group_state name="home" group="whole_body">
    <joint name="x_joint" value="0"/>
    <joint name="y_joint" value="0"/>
    <joint name="theta_joint" value="0"/>
    <joint name="shoulder_pan_joint"  value="0"/>
    <joint name="shoulder_lift_joint" value="-1.57"/>
    <joint name="elbow_joint"         value="1.57"/>
    <joint name="wrist_1_joint"       value="0"/>
    <joint name="wrist_2_joint"       value="-1.57"/>
    <joint name="wrist_3_joint"       value="0"/>
  </group_state>
</robot>

5.4 Whole-Body Python Planning¶

"""
Tier 2: Whole-Body Mobile Manipulation with MoveIt 2
======================================================
Jointly optimize base pose + arm configuration using
the MoveGroupInterface with a whole-body group.
"""

import numpy as np
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import Pose, PoseStamped
from moveitInterfaces import MoveGroupInterface, PlanningSceneInterface
from moveit_msgs.msg import Constraints, JointConstraint, PositionConstraint


class WholeBodyMobileManipulator(Node):
    """
    Whole-body manipulation using MoveIt 2.

    The key advantage over Tier 1: the planner considers both the
    base pose and the arm joints together, optimizing manipulability
    and avoiding collisions across the entire robot body.
    """

    def __init__(self):
        super().__init__('whole_body_mobile_manipulator')

        # ── Whole-body planning group ─────────────────────────────────
        # Includes: base (x, y, theta) + arm joints
        self.whole_body_group = MoveGroupInterface(
            "whole_body", "robot_description"
        )
        self.arm_group = MoveGroupInterface("arm_group", "robot_description")
        self.scene = PlanningSceneInterface()

        # ── Configure planner ─────────────────────────────────────────
        self.whole_body_group.setNumPlanningAttempts(10)
        self.whole_body_group.setPlanningTime(5.0)
        self.whole_body_group.setMaxVelocityScalingFactor(0.6)
        self.whole_body_group.setMaxAccelerationScalingFactor(0.6)
        self.whole_body_group.setPlannerId("RRTstar")

        # ── Add table as collision object ────────────────────────────
        self._add_collision_objects()

        self.get_logger().info("Whole-body Mobile Manipulator ready")

    def _add_collision_objects(self):
        """Add environment obstacles to the planning scene."""
        from shape_msgs.msg import SolidPrimitive

        # Table
        table_pose = PoseStamped()
        table_pose.header.frame_id = "base_link"
        table_pose.pose.position.x = 0.8
        table_pose.pose.position.y = 0.0
        table_pose.pose.position.z = -0.3
        self.scene.addBox("table", 0.6, 0.6, 0.74, table_pose)

        # Shelf
        shelf_pose = PoseStamped()
        shelf_pose.header.frame_id = "base_link"
        shelf_pose.pose.position.x = 0.8
        shelf_pose.pose.position.y = 0.3
        shelf_pose.pose.position.z = 0.3
        self.scene.addBox("shelf", 0.4, 0.3, 0.05, shelf_pose)

        self.get_logger().info("Collision objects added")

    def whole_body_pick(self, object_position: np.ndarray,
                        approach_direction: np.ndarray = np.array([0, 0, -1]),
                        approach_height: float = 0.15):
        """
        Pick an object using whole-body planning.

        Unlike Tier 1, this plans the base pose AND arm joints together,
        considering manipulability and avoiding arm/base self-collisions.

        Parameters
        ----------
        object_position : ndarray [3]
            Object position in world frame (x, y, z)
        approach_direction : ndarray [3]
            Unit vector for approach direction (default: top-down)
        approach_height : float
            Approach distance above the object
        """
        # ── 1. Compute approach pose ──────────────────────────────────
        approach_pose = PoseStamped()
        approach_pose.header.frame_id = "base_link"
        approach_pose.header.stamp = self.get_clock().now().to_msg()

        # Start above the object
        approach_pose.pose.position.x = object_position[0]
        approach_pose.pose.position.y = object_position[1]
        approach_pose.pose.position.z = object_position[2] + approach_height

        # Approach direction determines gripper orientation
        from tf_transformations import quaternion_from_matrix

        # Build rotation matrix aligned with approach direction
        # (gripper Z-axis aligned with approach direction)
        z = approach_direction / np.linalg.norm(approach_direction)
        x_temp = np.array([1, 0, 0]) if abs(z[2]) < 0.99 else np.array([0, 1, 0])
        x = x_temp - np.dot(x_temp, z) * z
        x = x / np.linalg.norm(x)
        y = np.cross(z, x)

        R = np.column_stack([x, y, z])
        q = quaternion_from_matrix(np.vstack([np.hstack([R, [0, 0, 0]]), [0, 0, 0, 1]]))

        approach_pose.pose.orientation.x = q[0]
        approach_pose.pose.orientation.y = q[1]
        approach_pose.pose.orientation.z = q[2]
        approach_pose.pose.orientation.w = q[3]

        # ── 2. Set whole-body pose target ─────────────────────────────
        # This jointly optimizes base (x, y, theta) and arm joints
        self.whole_body_group.setPoseTarget(approach_pose)

        # ── 3. Plan with constraints (collision + manipulability) ───────
        constraints = Constraints()
        # Add joint limits constraints to keep arm away from limits
        constraints.joint_constraints = self._make_joint_constraints()

        plan = self.whole_body_group.plan()
        if not plan[0]:
            self.get_logger().warn("Whole-body plan to approach pose failed")
            return False

        # Execute
        self.whole_body_group.execute(plan[1], wait=True)
        self.whole_body_group.stop()
        self.whole_body_group.clearPoseTargets()

        # ── 4. Descend linearly to grasp pose ──────────────────────────
        grasp_pose = PoseStamped()
        grasp_pose.header.frame_id = "base_link"
        grasp_pose.pose.position.x = object_position[0]
        grasp_pose.pose.position.y = object_position[1]
        grasp_pose.pose.position.z = object_position[2]
        grasp_pose.pose.orientation.CopyFrom(approach_pose.pose.orientation)

        # Use Cartesian path for linear descent
        waypoints = [approach_pose.pose, grasp_pose.pose]
        (plan, fraction) = self.arm_group.computeCartesianPath(
            waypoints,
            eef_step=0.01,      # 1 cm steps
            jump_threshold=1.0,  # disable jump checking
        )
        if fraction < 0.95:
            self.get_logger().warn(f"Cartesian path only {fraction*100:.1f}% complete")
        self.arm_group.execute(plan, wait=True)

        # ── 5. Close gripper ────────────────────────────────────────────
        self.arm_group.stop()
        self.arm_group.setNamedTarget("gripper_close")
        self.arm_group.go(wait=True)
        self.arm_group.stop()

        # ── 6. Lift ────────────────────────────────────────────────────
        lift_pose = grasp_pose
        lift_pose.pose.position.z += 0.2
        self.whole_body_group.setPoseTarget(lift_pose)
        plan = self.whole_body_group.plan()
        if plan[0]:
            self.whole_body_group.execute(plan[1], wait=True)
        self.whole_body_group.stop()
        self.whole_body_group.clearPoseTargets()

        self.get_logger().info("Whole-body pick completed!")
        return True

    def _make_joint_constraints(self) -> list:
        """Create soft joint constraints to keep arm away from limits."""
        constraints = []

        # Avoid shoulder joint limits by adding a 0.1 rad margin
        margin = 0.1
        for joint_name, lower, upper in [
            ("shoulder_pan_joint",  -2.5, 2.5),
            ("shoulder_lift_joint", -2.0, 2.0),
            ("elbow_joint",         -2.8, 2.8),
            ("wrist_1_joint",       -2.0, 2.0),
            ("wrist_2_joint",       -2.0, 2.0),
            ("wrist_3_joint",       -2.0, 2.0),
        ]:
            jc = JointConstraint()
            jc.joint_name = joint_name
            jc.position = 0.0
            jc.tolerance_above = upper - margin
            jc.tolerance_below = lower + margin
            jc.weight = 0.5
            constraints.append(jc)

        return constraints

    def compute_manipulability(self, q: np.ndarray) -> float:
        """
        Compute the manipulability measure for the current arm configuration.

        Uses the Yoshikawa formula: w = sqrt(det(J * J^T))

        Returns the manipulability value (0 at singularity, larger is better).
        """
        # This would be computed from the Jacobian of the arm
        # J_2n6 Jacobian from FK solver
        # For simplicity, return a placeholder
        # In practice, use: w = np.sqrt(np.linalg.det(J @ J.T))
        return 1.0  # Placeholder


def main():
    rclpy.init()
    node = WholeBodyMobileManipulator()

    # ── Demo: pick object at known position ───────────────────────────
    # Object at (0.65, 0.1, 0.75) in base frame (on a table)
    obj_pos = np.array([0.65, 0.1, 0.75])
    node.whole_body_pick(obj_pos)

    node.get_logger().info("Demo complete!")
    rclpy.shutdown()


if __name__ == "__main__":
    main()

5.5 Manipulability Optimization¶

The arm's manipulability ellipsoid reveals how well the arm can move in different directions. At a singularity, the ellipsoid collapses — the arm loses a degree of freedom.

def compute_manipulability_optimized_base_pose(
    object_pos: np.ndarray,
    robot_model,
    n_candidates: int = 100,
) -> tuple:
    """
    Find the best base pose that maximizes arm manipulability
    for reaching the object.

    This searches over a grid of base poses and selects the one
    with the highest manipulability at the IK solution.
    """
    import random

    best_pose = None
    best_manipulability = -1.0

    # Search over candidate base poses
    for _ in range(n_candidates):
        # Random base pose in reachable area (x: 0.3-1.0m, y: -0.3-0.3m, theta: -pi to pi)
        q_base = np.array([
            random.uniform(0.3, 1.0),
            random.uniform(-0.3, 0.3),
            random.uniform(-np.pi, np.pi),
        ])

        # Compute IK for the arm to reach object_pos from this base
        q_arm = compute_ik_for_base(object_pos, q_base, robot_model)
        if q_arm is None:
            continue  # IK failed for this base pose

        # Compute manipulability
        q_full = np.concatenate([q_base, q_arm])
        w = compute_manipulability_from_jacobian(q_full, robot_model)

        if w > best_manipulability:
            best_manipulability = w
            best_pose = q_base

    return best_pose, best_manipulability

6. Tier 3 — Modern: Foundation Model-Based Manipulation¶

6.1 Concept¶

Foundation models (large neural networks pre-trained on massive data) have revolutionized manipulation. Vision-Language-Action (VLA) models like RT-2, Octo, and π0 learn a single policy that:

Takes RGB images (and optionally language instructions) as input
Outputs continuous robot actions (end-effector delta pose + gripper command)

This is fundamentally different from Tier 1–2 because the same model can handle novel objects, scenes, and tasks without explicit geometry or IK.

  ┌──────────────────────────────────────────────────────────────────┐
  │              Vision-Language-Action (VLA) Model                   │
  │                                                                  │
  │   [Image] ──┐                                                    │
  │             │                                                    │
  │   [Lang]  ──┼──► ┌──────────────┐ ──► Action Token Stream        │
  │             │    │  Vision      │       [Δx, Δy, Δz, Δroll,      │
  │             │    │  Encoder     │        Δpitch, Δyaw, grip]     │
  │             │    │  (ViT/Sam)   │                                 │
  │             │    └──────────────┘                                 │
  │             │           │                                         │
  │             │           ▼                                          │
  │             │    ┌──────────────┐                                 │
  │             └───►│  LLM Core    │◄── (frozen or fine-tuned)      │
  │                  │  (PaLM-E /   │                                 │
  │                  │   LLaMA /    │                                 │
  │                  │   RT-2 Core)  │                                 │
  │                  └──────────────┘                                 │
  │                           │                                       │
  │                           ▼                                       │
  │                  ┌──────────────┐                                 │
  │                  │  Action Head │                                 │
  │                  │  (token →    │                                 │
  │                  │   continuous)│                                 │
  │                  └──────────────┘                                 │
  └──────────────────────────────────────────────────────────────────┘

6.2 Action Tokenization¶

RT-2-style models discretize continuous actions into tokens (inspired by language tokens):

Discretize each action dimension into $B$ bins (e.g., $B=256$): $$\text{action\_id} = \text{Clip}\left(\left\lfloor \frac{a - a_{\min}}{a_{\max} - a_{\min}} \cdot B \right\rfloor, 0, B-1\right)$$
Treat each action_id as a vocabulary token, concatenated with language tokens
Train with standard language modeling loss (next-token prediction)
At inference, sample action tokens and map back to continuous values

6.3 ROS 2 VLA Integration¶

"""
Tier 3: Foundation Model Mobile Manipulation
===============================================
Integrate a VLA model (e.g., Octo) with ROS 2 for
generalized mobile manipulation.
"""

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, JointState
from geometry_msgs.msg import Twist
from cv_bridge import CvBridge
import numpy as np
import torch

try:
    import octo
    from octo.model import OctoModel
except ImportError:
    print("[WARN] Octo not installed. Install: pip install octo-model")
    OctoModel = None


class VLAMobileManipulator(Node):
    """
    VLA-based mobile manipulation.

    The VLA model takes a camera image and a language instruction,
    then outputs continuous actions that control both the mobile base
    and the arm. This is a single end-to-end policy — no separate
    navigation or IK module needed.
    """

    def __init__(self):
        super().__init__('vla_mobile_manipulator')
        self.bridge = CvBridge()

        # ── Load VLA model ────────────────────────────────────────────
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.get_logger().info(f"Using device: {self.device}")

        if OctoModel is not None:
            # Load Octo-small (pre-trained on OXE dataset)
            self.model = OctoModel.load_pretrained("octo-small-1")
            self.model.to(self.device)
            self.get_logger().info("Octo model loaded")
        else:
            self.model = None
            self.get_logger().warn("VLA model not available — running in simulation mode")

        # ── ROS interfaces ────────────────────────────────────────────
        self.image_sub = self.create_subscription(
            Image,
            "/camera/color/image_raw",
            self.image_callback,
            10
        )
        self.joint_sub = self.create_subscription(
            JointState,
            "/joint_states",
            self.joint_callback,
            10
        )

        # Base + arm velocity command publisher
        self.base_cmd_pub = self.create_publisher(Twist, "/cmd_vel", 10)
        self.arm_cmd_pub = self.create_publisher(
            JointState, "/arm_controller/joint_commands", 10
        )

        self.current_joints = np.zeros(7)  # arm joint states
        self.current_image = None
        self.get_logger().info("VLA Mobile Manipulator initialized")

    def image_callback(self, msg: Image):
        """Receive camera image."""
        try:
            self.current_image = self.bridge.imgmsg_to_cv2(msg, desired_encoding="rgb8")
        except Exception as e:
            self.get_logger().error(f"Image conversion failed: {e}")

    def joint_callback(self, msg: JointState):
        """Receive current joint states."""
        self.current_joints = np.array(msg.position[:7])

    def compute_vla_action(self, instruction: str,
                             image: np.ndarray,
                             arm_joints: np.ndarray) -> np.ndarray:
        """
        Query the VLA model for an action given the current observation.

        Parameters
        ----------
        instruction : str
            Language instruction, e.g., "pick up the red cup"
        image : ndarray
            Current RGB camera image (H, W, 3)
        arm_joints : ndarray
            Current arm joint positions

        Returns
        -------
        action : ndarray
            Predicted action: [base_vx, base_vy, base_omega,
                              arm_delta_q1, ..., arm_delta_qn, gripper]
        """
        if self.model is None:
            # Simulation mode: return zero action
            return np.zeros(10)

        # Preprocess image
        image_tensor = torch.from_numpy(image).permute(2, 0, 1)  # CHW
        image_tensor = torch.nn.functional.interpolate(
            image_tensor.unsqueeze(0).float(), size=(256, 256)
        ).squeeze(0)

        # Tokenize instruction
        instruction_tokens = self.model.encode_text(instruction)

        # Model inference
        with torch.no_grad():
            # Octo outputs actions in gripper frame
            # Shape: (batch=1, pred_horizon=1, action_dim)
            action = self.model.predict_action(
                image=image_tensor.to(self.device).unsqueeze(0),
                text=[instruction],
            )

        return action[0, 0].cpu().numpy()

    def execute_action(self, action: np.ndarray):
        """
        Send action to robot hardware.

        Action format: [base_vx, base_vy, base_omega,
                        delta_q1, ..., delta_q6, gripper]
        """
        # ── Mobile base velocity ───────────────────────────────────────
        base_cmd = Twist()
        base_cmd.linear.x = float(action[0])
        base_cmd.linear.y = float(action[1])
        base_cmd.angular.z = float(action[2])
        self.base_cmd_pub.publish(base_cmd)

        # ── Arm joint delta commands ──────────────────────────────────
        n_arm_joints = len(self.current_joints)
        arm_delta = action[3:3 + n_arm_joints]
        new_joints = self.current_joints + arm_delta

        arm_cmd = JointState()
        arm_cmd.header.stamp = self.get_clock().now().to_msg()
        arm_cmd.name = [
            "shoulder_pan_joint", "shoulder_lift_joint", "elbow_joint",
            "wrist_1_joint", "wrist_2_joint", "wrist_3_joint"
        ]
        arm_cmd.position = new_joints.tolist()
        self.arm_cmd_pub.publish(arm_cmd)

        # ── Gripper ────────────────────────────────────────────────────
        gripper_cmd = JointState()
        gripper_cmd.name = ["gripper_joint"]
        gripper_cmd.position = [float(action[-1])]
        # Publish to gripper controller...

    def run_policy(self, instruction: str, max_steps: int = 200):
        """
        Run the VLA policy for a given instruction.

        Parameters
        ----------
        instruction : str
            Language command to execute
        max_steps : int
            Maximum number of control steps
        """
        self.get_logger().info(f"Executing: '{instruction}'")

        rate = self.create_rate(10)  # 10 Hz control loop
        for step in range(max_steps):
            if self.current_image is None:
                rate.sleep()
                continue

            # Query model
            action = self.compute_vla_action(
                instruction, self.current_image, self.current_joints
            )

            # Execute on robot
            self.execute_action(action)

            # Check for termination (if model outputs a "done" signal)
            if np.abs(action).max() < 1e-3:
                self.get_logger().info("Action converged — stopping")
                break

            rate.sleep()

        # Stop robot
        self.execute_action(np.zeros_like(action))
        self.get_logger().info("Policy execution complete")


def main():
    rclpy.init()
    node = VLAMobileManipulator()

    # ── Example instructions ──────────────────────────────────────────
    instructions = [
        "pick up the cup on the table",
        "place the cup on the shelf",
        "push the button on the wall",
        "open the drawer",
    ]

    for instruction in instructions:
        node.run_policy(instruction, max_steps=200)

    rclpy.shutdown()


if __name__ == "__main__":
    main()

6.4 Sim-to-Real Transfer¶

Training VLA models involves significant domain gap between simulation and reality:

Challenge	Description	Mitigation
Visual domain gap	Different textures, lighting, camera noise	Domain randomization (DR), neural rendering (NeRF, 3DGS)
Physics gap	Contact dynamics, friction, compliance	sim-to-real transfer via RL fine-tuning, system identification
Action space mismatch	Different robot models	Action normalization, cross-embodiment training
Data efficiency	VLAs need millions of demos	Pre-training on large datasets (OXE, BridgeData)
Latency	Model inference is slow	ONNX optimization, GPU batching, action chunking

7. Comparison of Approaches¶

Criteria	Tier 1: Decoupled	Tier 2: Whole-Body	Tier 3: VLA Foundation
Base + Arm coordination	❌ Independent	✅ Joint	✅ Implicit (learned)
Novel object handling	❌ Needs model	❌ Needs model	✅ Zero-shot
Generalization	Task-specific	Task-specific	Task-general
Real-time performance	✅ Fast	✅ ~1-5 Hz	⚠️ Depends on model size
Hardware requirements	Low	Medium	High (GPU required)
Setup complexity	Low	Medium	High
Collision awareness	❌ Arm only	✅ Full body	⚠️ Learned
Singularity handling	❌ Manual	✅ Jacobian-based	⚠️ Learned
Best for	Structured environments	Semi-structured	Open-world manipulation
Code complexity	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐
Reliability	High (predictable)	High (verifiable)	Medium (stochastic)

8. Step-by-Step Implementation Guide¶

Phase 1 — Robot Setup (All Tiers)¶

URDF / SDF: Define the complete robot — base + arm + sensors — in URDF
Kinematics: Load the URDF into MoveIt and configure kinematic chains
Calibration: Calibrate base-to-camera transform (tf tree)
Gazebo simulation: Verify the robot loads and moves in simulation

Phase 2 — Tier 1: Decoupled (Week 1)¶

Launch the Navigation 2 stack with SLAM (see SLAM project)
Define a set of approach poses for known pick locations
Implement compute_ik() using MoveIt's IK service
Run the pick-and-place demo with known object poses
Test: navigate to pose → IK → grasp → lift

Phase 3 — Tier 2: Whole-Body (Week 2)¶

Define the whole_body planning group in SRDF (base + arm chain)
Configure OMPL with RRTstar or PRMstar for whole-body planning
Add collision objects (tables, shelves) to the planning scene
Replace the Tier 1 decoupled navigate_to() + compute_ik() with whole_body_pick()
Implement manipulability optimization for base pose selection
Test: single whole-body plan for pick, verify collision avoidance

Phase 4 — Tier 3: VLA Integration (Week 3)¶

Install Octo or RT-2 (requires Docker + GPU environment)
Run inference on a pre-trained model with the real camera feed
Add action normalization (map model output to your robot's action limits)
Implement a safety layer: clamp velocities, check joint limits before sending commands
Fine-tune on domain-specific data if performance is insufficient
Test: give language commands like "pick up the blue box"

9. Extensions and Variations¶

Use Task Space Region (TSR) constraints in MoveIt to ensure the arm maintains a specific orientation or stays within a workspace while the base navigates:

# Keep arm pointing downward while base navigates
from moveit_msgs.msg import Constraints, OrientationConstraint

orientation_constraint = OrientationConstraint()
orientation_constraint.header.frame_id = "base_link"
orientation_constraint.link_name = "gripper_link"
orientation_constraint.orientation.w = 1.0
orientation_constraint.absolute_x_axis_tolerance = 0.1
orientation_constraint.absolute_y_axis_tolerance = 0.1
orientation_constraint.absolute_z_axis_tolerance = 3.14  # free rotation around Z
orientation_constraint.weight = 1.0

constraints = Constraints()
constraints.orientation_constraints.append(orientation_constraint)
move_group.setPathConstraints(constraints)

9.2 Grasp Pose Detection¶

Replace hard-coded grasp poses with learning-based grasp detection:

Method	Description	ROS Package
GraspNet	6-DOF grasp pose prediction from point clouds	`graspnet_ros`
Contact-GraspNet	Contact-level grasp synthesis from partial views	`contact_graspnet`
GPD	Geometric approach to grasp detection	`gpd`
6-DOF GraspNet (Isaac)	NVIDIA's optimized grasp detection	Isaac SDK

9.3 Dual-Arm Mobile Manipulation¶

Extend to two arms for bimanual tasks (e.g., holding an object with one arm while moving it with the base, or cooperative carrying):

# Dual-arm whole-body group
<group name="dual_arm_whole_body">
  <chain base_link="base_footprint" tip_link="left_gripper_link"/>
  <chain base_link="base_footprint" tip_link="right_gripper_link"/>
</group>

9.4 Hierarchical Planning¶

Combine Tier 2 and Tier 3 in a hierarchical architecture:

  Task Planner (LLM/VLA) ──► High-level goal: "place cup on shelf"
         │
         ▼
  Motion Planner (Tier 2) ──► Whole-body trajectory for base + arm
         │
         ▼
  Controller (PID/阻抗) ──► Execute trajectory on hardware