Vision-Language Navigation (视觉-语言导航)¶

Project Type: Embodied AI | Difficulty: ★★★☆☆ to ★★★★★ | Estimated Time: 2–4 weekends

1. Project Overview¶

Vision-Language Navigation (VLN) is the task of enabling a robot to reach a goal location in an environment by following natural language instructions, such as "Turn left at the blue chair, go past the kitchen, and stop at the dining table." The robot must ground language references to visual observations and plan a navigation path accordingly.

┌─────────────────────────────────────────────────────────────────────┐
│                    Vision-Language Navigation Pipeline                │
│                                                                     │
│   ┌──────────┐    ┌──────────────────┐    ┌────────────────────┐    │
│   │ Language │───▶│ Instruction     │───▶│ Cross-Modal       │    │
│   │ "Go past │    │ Parser (NLP)    │    │ Grounding         │    │
│   │ the red  │    │                 │    │                    │    │
│   │ door..." │    │ • Entity extract │    │ • Vision-language  │    │
│   └──────────┘    │ • Action parse   │    │   alignment        │    │
│                   │ • Waypoint seq   │    │ • Spatial reasoning│    │
│                   └──────────────────┘    └─────────┬──────────┘    │
│                                                      │               │
│   ┌──────────┐    ┌──────────────────┐                │               │
│   │ RGB/     │───▶│ Visual Feature  │───────────────┘               │
│   │ Depth    │    │ Extractor       │                                │
│   │ Camera   │    │ (ResNet/ViT/CLIP)                              │
│   └──────────┘    └──────────────────┘                                │
│                              │                                        │
│                              ▼                                        │
│                   ┌──────────────────┐    ┌────────────────────┐    │
│                   │ Path Planner     │───▶│ Action Commands    │    │
│                   │ (A* / RL Policy) │    │ (vel, turn angles) │    │
│                   └──────────────────┘    └────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

In this project you will implement three progressively more sophisticated approaches:

Tier	Approach	Key Technique	Dataset
1 — Traditional	Modular pipeline	NLP parsing + landmark detection + A*	Custom synthetic
2 — Intermediate	Seq2Seq + attention	Cross-modal attention, teacher forcing	R2R (Room-to-Room)
3 — Modern	Foundation models	CLIP visual features + LLM instruction following	Zero-shot

2. Hardware & Software Requirements¶

Hardware¶

Component	Specification	Notes
Robot platform	TurtleBot3 / custom wheeled robot	Differential drive
RGB-D Camera	RealSense D435 / Azure Kinect	Required for depth
Onboard PC	Jetson Nano / Raspberry Pi 4 / Laptop	For real-world deployment
(Optional) LiDAR	RPLIDAR A1 / L515	For modular pipeline
Simulation PC	Desktop with dedicated GPU	For Habitat / AI2-THOR

Software¶

Package	Version	Purpose
Python	≥ 3.8	Core language
PyTorch	≥ 1.13	Neural network training
Transformers	≥ 4.30	CLIP, LLaVA, GPT models
OpenCV	≥ 4.5	Image preprocessing
NumPy	≥ 1.20	Numerical computation
Habitat-Sim	≥ 0.2	3D embodied AI simulator
AI2-THOR	≥ 4.0	Household navigation
spaCy / NLTK	latest	NLP instruction parsing
scikit-image	≥ 0.19	Image feature extraction
matplotlib	≥ 3.5	Visualization

pip install torch torchvision transformers opencv-python numpy
pip install habitat-sim ai2-thor  # simulation backends
pip install spacy nltk scikit-image matplotlib
python -m spacy download en_core_web_sm

3. Tier 1 — Traditional: Modular Pipeline¶

3.1 Concept¶

The modular pipeline decomposes VLN into three independent stages: (1) NLP parsing to extract entities and actions from the instruction, (2) visual landmark detection to locate referenced objects in the image, and (3) path planning to navigate toward the detected landmarks.

This approach is transparent and debuggable — each stage is a white box with interpretable outputs.

3.2 Key Components¶

3.2.1 Instruction Parser (NLP)¶

We use spaCy for named entity recognition (NER) and dependency parsing to extract:

Landmarks: objects referenced in the instruction ("blue chair", "kitchen table")
Actions: navigation verbs ("turn", "go", "stop", "pass")
Directions: spatial relations ("left", "right", "straight", "behind")

import spacy

nlp = spacy.load("en_core_web_sm")

def parse_instruction(instruction: str) -> dict:
    """
    Parse a navigation instruction into structured components.

    Returns: {
        'entities': [{'text': 'blue chair', 'label': 'LANDMARK'}, ...],
        'actions':  [{'verb': 'turn', 'direction': 'left'}, ...],
        'route':    ['turn left', 'go straight', 'stop']
    }
    """
    doc = nlp(instruction)
    entities = []
    actions = []

    # Named entity recognition
    for ent in doc.ents:
        entities.append({'text': ent.text, 'label': 'OBJECT'})

    # Verb + direction extraction via dependency parsing
    for token in doc:
        if token.pos_ == "VERB":
            direction = None
            for child in token.children:
                if child.dep_ in ("prep", "advcl"):
                    direction = child.text
            actions.append({'verb': token.lemma_, 'direction': direction})

    # Build route sequence
    route = []
    for token in doc:
        if token.dep_ == "ROOT" and token.pos_ == "VERB":
            route.append(token.lemma_)
        if token.dep_ in ("prep", "pcomp"):
            route.append(token.head.text + " " + token.text)

    return {'entities': entities, 'actions': actions, 'route': route}

3.2.2 Visual Landmark Detector¶

We use a pretrained ResNet-50 to extract feature maps, then compare detected objects with the instruction's landmarks using cosine similarity.

import torch
import torchvision.models as models
import torchvision.transforms as T
import cv2
import numpy as np

# Load pretrained ResNet-50 as feature extractor
resnet = models.resnet50(pretrained=True)
resnet = torch.nn.Sequential(*list(resnet.children())[:-2])  # Remove FC, keep feature maps
resnet.eval()

transform = T.Compose([
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# CLIP-style landmark vocabulary (simplified)
LANDMARK_VOCAB = [
    "chair", "table", "door", "window", "bed", "sofa",
    "kitchen", "bathroom", "hallway", "stairs",
    "blue", "red", "green", "white", "black"
]

def detect_landmarks(frame: np.ndarray, target_landmarks: list) -> list:
    """
    Detect landmark locations in the image frame.

    Parameters
    ----------
    frame : BGR image (HxWx3)
    target_landmarks : list of landmark names from instruction parser

    Returns
    -------
    detections : list of (label, bbox, confidence) tuples
    """
    # Resize and preprocess
    rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    input_tensor = transform(rgb).unsqueeze(0)

    # Extract feature map (1x2048x7x7 for ResNet-50)
    with torch.no_grad():
        features = resnet(input_tensor)  # [1, 2048, H', W']

    B, C, H, W = features.shape
    feature_map = features.squeeze(0).reshape(C, H * W).T  # (H*W) x 2048

    # Simple color-based landmark detection (complement to deep features)
    detections = []
    hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)

    color_map = {
        'blue': ([100, 150, 0], [140, 255, 255]),
        'red':  ([0, 100, 100], [10, 255, 255]),
        'green':([40, 50, 50],  [80, 255, 255]),
    }

    for landmark in target_landmarks:
        color_key = landmark.lower()
        if color_key in color_map:
            lower, upper = color_map[color_key]
            mask = cv2.inRange(hsv, np.array(lower), np.array(upper))
            # Find largest contour
            contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
            if contours:
                largest = max(contours, key=cv2.contourArea)
                x, y, w, h = cv2.boundingRect(largest)
                detections.append({
                    'label': landmark,
                    'bbox': (x, y, x+w, y+h),
                    'confidence': float(cv2.contourArea(largest) / (frame.shape[0] * frame.shape[1]))
                })

    return detections

3.2.3 A* Path Planner¶

Given a top-down map (built from depth sensor or SLAM), we use A* to plan a path that passes through the detected landmark locations.

import heapq

def astar_path(grid: np.ndarray, start: tuple, goal: tuple,
               landmarks: list = None) -> list:
    """
    A* path planning on a 2D occupancy grid.

    Parameters
    ----------
    grid : 2D numpy array (0 = free, 1 = obstacle)
    start : (row, col) starting position
    goal : (row, col) goal position
    landmarks : list of (row, col) waypoints to visit in order

    Returns
    -------
    path : list of (row, col) positions from start to goal
    """
    if landmarks is None:
        landmarks = []

    def heuristic(a, b):
        # Manhattan distance (can use Euclidean for 8-connected)
        return abs(a[0] - b[0]) + abs(a[1] - b[1])

    def neighbors(pos):
        for dr, dc in [(-1,0),(1,0),(0,-1),(0,1),(-1,-1),(-1,1),(1,-1),(1,1)]:
            nr, nc = pos[0] + dr, pos[1] + dc
            if 0 <= nr < grid.shape[0] and 0 <= nc < grid.shape[1]:
                if grid[nr, nc] == 0:  # free cell
                    yield (nr, nc)

    # Insert intermediate goals for landmarks
    waypoints = [start] + landmarks + [goal]

    full_path = []
    for i in range(len(waypoints) - 1):
        sub_path = _astar_single(grid, waypoints[i], waypoints[i+1], heuristic, neighbors)
        if sub_path is None:
            return None  # No path found
        full_path.extend(sub_path[:-1])

    full_path.append(goal)
    return full_path


def _astar_single(grid, start, goal, heuristic, neighbors):
    """Single-shot A* between two waypoints."""
    open_set = [(heuristic(start, goal), start)]
    came_from = {}
    g_score = {start: 0}

    while open_set:
        _, current = heapq.heappop(open_set)

        if current == goal:
            # Reconstruct path
            path = []
            while current in came_from:
                path.append(current)
                current = came_from[current]
            path.append(start)
            return path[::-1]

        for neighbor in neighbors(current):
            tentative_g = g_score[current] + 1
            if neighbor not in g_score or tentative_g < g_score[neighbor]:
                came_from[neighbor] = current
                g_score[neighbor] = tentative_g
                f_score = tentative_g + heuristic(neighbor, goal)
                heapq.heappush(open_set, (f_score, neighbor))

    return None  # No path found

3.3 Putting It All Together¶

def modular_vln_loop(instruction: str, rgb_frame: np.ndarray,
                    depth_frame: np.ndarray, grid_map: np.ndarray,
                    robot_pos: tuple) -> tuple:
    """
    Complete modular VLN pipeline.

    Returns: (next_action, detected_landmarks, planned_path)
    """
    # Step 1: Parse instruction
    parsed = parse_instruction(instruction)
    target_landmarks = [e['text'] for e in parsed['entities']]

    # Step 2: Detect landmarks in current view
    detections = detect_landmarks(rgb_frame, target_landmarks)

    # Step 3: Project detections to world coordinates (simplified)
    # In practice: use depth + camera intrinsics + robot pose
    landmark_waypoints = [d['bbox'][:2] for d in detections]  # placeholder

    # Step 4: A* path planning
    # Goal is the last detected landmark or end of instruction
    goal = landmark_waypoints[-1] if landmark_waypoints else (15, 15)
    path = astar_path(grid_map, robot_pos, goal, landmark_waypoints[:-1])

    # Step 5: Compute next action from path
    if path and len(path) > 1:
        next_pos = path[1]
        direction = (next_pos[0] - robot_pos[0], next_pos[1] - robot_pos[1])
        action = {'type': 'move_to', 'target': next_pos}
    else:
        action = {'type': 'stop'}

    return action, detections, path

3.4 Limitations¶

Aspect	Issue
NLP	Struggles with complex referring expressions ("the second door on your left")
Vision	Color-based detection is brittle; needs robust landmark recognition
Planning	A* on a 2D grid ignores 3D geometry and doorways
Generalization	Each component must be retrained independently for new environments

4. Tier 2 — Intermediate: Seq2Seq with Attention¶

4.1 Concept¶

Seq2Seq VLN uses an encoder-decoder architecture where:

Encoder: processes both the instruction (as a sequence of word embeddings) and the visual observation (as a sequence of spatial features).
Decoder: generates a sequence of navigation actions, attending to relevant parts of the instruction and visual features at each step.

The key innovation is cross-modal attention, which allows the model to align language tokens with visual regions.

4.2 Model Architecture¶

The Speaker-Follower model (Fried et al., 2018) and the CMU How-to-nav model (Wang et al., 2018) both use attention-based seq2seq:

┌─────────────────────────────────────────────────────────────────────┐
│                 Seq2Seq VLN Model (Speaker-Follower)                 │
│                                                                     │
│  Instruction: "Turn left at the blue chair"                         │
│                                                                     │
│  ┌─────────┐                                                        │
│  │ Encoder │                                                        │
│  │         │                                                        │
│  │ ┌─────┐ │    ┌──────────────────────────────────────────────┐   │
│  │ │ w₁  │─┼───▶│                                              │   │
│  │ ├─────┤ │    │            Cross-Modal Attention              │   │
│  │ │ w₂  │─┼───▶│                                              │   │
│  │ ├─────┤ │    │  α_i = softmax(vᵀ·tanh(W₁h_i + W₂v_j))      │   │
│  │ │ w₃  │─┼───▶│                                              │   │
│  │ └─────┘ │    └──────────────────────────────────────────────┘   │
│  └─────────┘                         │                             │
│       │                             │                             │
│       │ h_i (language hidden)        │ c_t (context vector)       │
│       │                             ▼                             │
│       │                      ┌──────────────┐                     │
│  ┌────▼────┐                 │   Decoder    │                     │
│  │  LSTM   │◀────────────────│              │                     │
│  │         │                 │  a_t = argmax│                     │
│  │ h_t     │────────────────▶│  P(a|context)│                     │
│  └─────────┘                 └──────────────┘                     │
│                                                                     │
│  Action space: {Forward, Left, Right, <Stop>}                      │
└─────────────────────────────────────────────────────────────────────┘

The attention mechanism computes a weighted sum of visual features based on the current decoder state:

\[ c_t = \sum_{j=1}^{N} \alpha_{tj} \, v_j \]

where the attention weights are:

\[ \alpha_{tj} = \frac{\exp\big(e_{tj}\big)}{\sum_{k=1}^{N} \exp\big(e_{tk}\big)}, \quad e_{tj} = v^\top \tanh\big(W_1 h_{t-1} + W_2 v_j\big) \]

\(h_{t-1}\) — previous decoder hidden state
\(v_j\) — visual feature at region \(j\)
\(W_1, W_2, v\) — learnable attention parameters

4.4 Complete PyTorch Implementation¶

"""
Tier 2: Seq2Seq VLN with Cross-Modal Attention
==============================================
Implements a simplified Speaker-Follower style model.
Trainable on R2R (Room-to-Room) dataset.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# ─── Model ──────────────────────────────────────────────────────────────────

class Seq2SeqVLN(nn.Module):
    """
    Seq2Seq VLN with cross-modal attention.

    Args:
        vocab_size: size of instruction vocabulary
        embed_dim: word embedding dimension
        hidden_dim: LSTM hidden dimension
        visual_dim: dimension of visual features (e.g., ResNet-2048)
        num_actions: number of navigation actions
        dropout: dropout probability
    """

    def __init__(self, vocab_size: int, embed_dim: int = 256,
                 hidden_dim: int = 512, visual_dim: int = 2048,
                 num_actions: int = 4, dropout: float = 0.3):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_actions = num_actions

        # Language embedding
        self.word_embed = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lang_lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.lang_proj = nn.Linear(hidden_dim * 2, hidden_dim)

        # Visual projection
        self.visual_proj = nn.Linear(visual_dim, hidden_dim)

        # Cross-modal attention
        self.attn_W1 = nn.Linear(hidden_dim, hidden_dim)
        self.attn_W2 = nn.Linear(hidden_dim, hidden_dim)
        self.attn_v = nn.Linear(hidden_dim, 1)

        # Decoder LSTM
        self.decoder_lstm = nn.LSTM(hidden_dim * 3, hidden_dim, batch_first=True)
        self.action_head = nn.Linear(hidden_dim, num_actions)

        self.dropout = nn.Dropout(dropout)

    def forward(self, instruction_tokens: torch.Tensor,
                visual_features: torch.Tensor,
                action_history: torch.Tensor) -> torch.Tensor:
        """
        Forward pass.

        Parameters
        ----------
        instruction_tokens : (B, L) — token IDs
        visual_features : (B, N, D_v) — N spatial regions, D_v dims
        action_history : (B, T) — previous actions (for teacher forcing)

        Returns
        -------
        logits : (B, num_actions) — action probability logits
        """
        B = instruction_tokens.size(0)

        # ── Language encoder ──────────────────────────────────────────────
        lang_embed = self.word_embed(instruction_tokens)  # (B, L, D_e)
        lang_h, (h_lang, _) = self.lang_lstm(lang_embed)  # (B, L, 2D), (2, B, D)
        # Combine bidirectional states
        h_lang = torch.cat([h_lang[0], h_lang[1]], dim=-1)  # (B, 2D)
        h_lang = self.dropout(torch.tanh(self.lang_proj(h_lang)))  # (B, D)

        # ── Visual projection ──────────────────────────────────────────────
        V = self.visual_proj(visual_features)  # (B, N, D)

        # ── Decoder LSTM ──────────────────────────────────────────────────
        # Initialize LSTM with language context
        decoder_state = (h_lang.unsqueeze(0), torch.zeros_like(h_lang.unsqueeze(0)))

        # Concatenate context + previous action embedding
        if action_history.size(1) > 0:
            prev_action = self.word_embed(action_history[:, -1])  # (B, D)
        else:
            prev_action = torch.zeros(B, self.hidden_dim, device=instruction_tokens.device)

        # Context from attention
        context = self._cross_attention(h_lang, V)  # (B, D)

        lstm_input = torch.cat([context, prev_action], dim=-1)  # (B, 2D)
        lstm_out, decoder_state = self.decoder_lstm(lstm_input.unsqueeze(1), decoder_state)
        lstm_out = lstm_out.squeeze(1)  # (B, D)

        # ── Action prediction ──────────────────────────────────────────────
        logits = self.action_head(self.dropout(lstm_out))  # (B, num_actions)
        return logits

    def _cross_attention(self, query: torch.Tensor, visual: torch.Tensor) -> torch.Tensor:
        """
        Cross-modal attention: attend to visual features given a language query.

        Parameters
        ----------
        query : (B, D) — language hidden state
        visual : (B, N, D) — visual feature grid

        Returns
        -------
        context : (B, D) — attended visual features
        """
        # Expand query for batch processing
        q = query.unsqueeze(1).expand_as(visual)  # (B, N, D)
        # Attention energy
        energy = torch.tanh(self.attn_W1(q) + self.attn_W2(visual))  # (B, N, D)
        energy = self.attn_v(energy).squeeze(-1)  # (B, N)
        alpha = F.softmax(energy, dim=-1)  # (B, N), normalized over regions
        context = torch.bmm(alpha.unsqueeze(1), visual).squeeze(1)  # (B, D)
        return context


def train_seq2seq_vln(model: Seq2SeqVLN, train_loader, optimizer, num_epochs: int = 20):
    """Train the Seq2Seq VLN model."""
    criterion = nn.CrossEntropyLoss(ignore_index=-1)

    for epoch in range(num_epochs):
        model.train()
        total_loss = 0.0

        for batch in train_loader:
            instruction, visual, actions_gt, lengths = batch

            optimizer.zero_grad()

            # Forward pass
            logits = model(instruction, visual, actions_gt[:, :-1])

            # Teacher forcing: predict next action
            loss = criterion(logits, actions_gt[:, -1])

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        print(f"[Epoch {epoch+1}/{num_epochs}] Loss: {avg_loss:.4f}")


# ─── R2R Dataset (simplified) ────────────────────────────────────────────────

class R2RDataset(torch.utils.data.Dataset):
    """
    Simplified Room-to-Room dataset.

    Each sample contains:
    - instruction: natural language description
    - path: sequence of (x, y, heading) viewpoints
    - action_seq: corresponding action sequence
    """

    def __init__(self, split='train'):
        self.split = split
        # In practice, load from Matterport3D R2R dataset
        # Here we show synthetic data for illustration
        self.data = [
            {
                'instruction': "Turn left and go past the chair",
                'path': [(0, 0, 0), (1, 0, 0), (2, 0, 0), (2, 1, np.pi/2)],
                'actions': [0, 0, 1, 2],  # Forward, Forward, Left, Stop
            },
        ]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        return {
            'instruction': sample['instruction'],
            'path': np.array(sample['path']),
            'actions': np.array(sample['actions'], dtype=np.int64),
        }


# ─── Evaluation Metrics ───────────────────────────────────────────────────────

def evaluate_vln(model, dataset, device='cuda'):
    """
    Evaluate VLN model using standard metrics.

    Metrics:
    - Success Rate (SR): fraction of episodes where agent stops within 3m of goal
    - SPL: Success weighted by Path Length = SR * (optimal_length / agent_length)
    - nDTW: normalized Dynamic Time Warping (sequence similarity)
    """
    model.eval()
    sr_total, spl_total, n_samples = 0, 0, 0

    with torch.no_grad():
        for sample in dataset:
            instruction = sample['instruction']
            gt_path = sample['path']
            gt_actions = sample['actions']

            # Simulate inference (would use Habitat-sim in practice)
            # Here: predicted actions
            pred_actions = torch.argmax(model(
                sample['inst_encoded'].unsqueeze(0).to(device),
                sample['visual'].unsqueeze(0).to(device),
                torch.zeros(1, 1, dtype=torch.long).to(device)
            ), dim=-1)

            # Compute SR (simplified)
            sr = float(torch.argmax(pred_actions) == gt_actions[-1])
            sr_total += sr
            n_samples += 1

    return {
        'Success Rate': sr_total / n_samples,
        'SPL': spl_total / n_samples if n_samples > 0 else 0,
    }

4.5 R2R Dataset & Evaluation¶

The Room-to-Room (R2R) dataset is the standard benchmark for VLN:

Metric	Definition	Ideal
Success Rate (SR)	% of episodes where agent stops within 3m of goal	1.0
SPL	SR × (optimal length / path length)	1.0
nDTW	Normalized Dynamic Time Warping similarity	1.0
CLS	Coverage weighted by Length Score	1.0

4.6 Comparison — Modular vs Seq2Seq¶

Aspect	Modular Pipeline	Seq2Seq + Attention
Interpretability	High (white-box stages)	Low (neural black box)
Training	Supervised per component	End-to-end differentiable
Generalization	Poor (rule-based)	Better (learned representations)
Data efficiency	High (no training needed)	Low (needs large VLN datasets)
Compositionality	Limited (hand-coded rules)	Strong (learned generalization)
Training complexity	⭐	⭐⭐⭐

5. Tier 3 — Modern: Foundation Models for VLN¶

5.1 Concept¶

Foundation models (large pretrained vision-language models) enable zero-shot VLN — navigating to novel environments without task-specific training. CLIP provides aligned visual-text features; LLaVA / GPT-4V enable instruction following.

Key advantage: no VLN-specific training data required.

CLIP encodes images and text into a shared embedding space. We use CLIP's visual encoder to represent each candidate navigation viewpoint, then score them based on instruction similarity.

"""
Tier 3: Zero-Shot VLN using CLIP
================================
CLIP-based viewpoint scoring for instruction-guided navigation.
No VLN-specific training required.
"""

import torch
import clip
from PIL import Image
import numpy as np
import cv2

# Load CLIP model (ViT-B/32)
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)


def encode_text_instruction(instruction: str) -> torch.Tensor:
    """Encode a natural language instruction into CLIP text features."""
    text = clip.tokenize([instruction]).to(device)
    with torch.no_grad():
        text_features = model.encode_text(text)
        text_features /= text_features.norm(dim=-1, keepdim=True)
    return text_features


def encode_viewpoint(frame: np.ndarray) -> torch.Tensor:
    """Encode an RGB frame into CLIP visual features."""
    image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    image_input = preprocess(image).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        image_features /= image_features.norm(dim=-1, keepdim=True)
    return image_features


def score_viewpoints(current_frame: np.ndarray,
                     candidate_frames: list,
                     instruction: str) -> list:
    """
    Score candidate navigation viewpoints based on instruction relevance.

    Parameters
    ----------
    current_frame : current RGB observation
    candidate_frames : list of RGB frames from candidate viewpoints
    instruction : natural language navigation instruction

    Returns
    -------
    scores : list of (viewpoint_id, similarity_score) sorted descending
    """
    # Encode instruction once
    text_features = encode_text_instruction(instruction)

    results = []
    for idx, frame in enumerate(candidate_frames):
        vis_features = encode_viewpoint(frame)
        # Cosine similarity in CLIP embedding space
        similarity = (vis_features @ text_features.T).item()
        results.append((idx, similarity))

    # Sort by descending similarity
    results.sort(key=lambda x: x[1], reverse=True)
    return results


def zero_shot_vln_step(instruction: str, current_frame: np.ndarray,
                       candidate_poses: list, robot_state: dict) -> dict:
    """
    Single step of zero-shot VLN using CLIP.

    Parameters
    ----------
    instruction : full navigation instruction
    current_frame : current RGB observation
    candidate_poses : list of candidate robot poses to evaluate
    robot_state : {'x', 'y', 'heading', 'map'}

    Returns
    -------
    next_action : {'type': 'move', 'target_pose': pose}
    """
    # In practice: render candidate viewpoints from Habitat-sim
    # Here: simulate with current frame (would be different viewpoints)
    candidate_frames = [current_frame] * len(candidate_poses)

    # Score each candidate
    scores = score_viewpoints(current_frame, candidate_frames, instruction)

    # Pick highest-scoring viewpoint
    best_idx, best_score = scores[0]

    if best_score > 0.25:  # Threshold for accepting a viewpoint
        action = {
            'type': 'move',
            'target_pose': candidate_poses[best_idx],
            'confidence': best_score
        }
    else:
        # Fallback: random exploration or stop
        action = {'type': 'explore', 'confidence': best_score}

    return action

5.3 LLaVA for Instruction Following¶

LLaVA (Large Language and Vision Assistant) can reason about navigation instructions and visual context:

"""
LLaVA-based navigation instruction generation.
Given a panoramic view, LLaVA generates a sub-instruction for the next step.
"""

from transformers import AutoProcessor, LlavaForConditionalGeneration
import torch
from PIL import Image

# Load LLaVA-1.5-7B
model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="auto"
)


def generate_sub_instruction(panorama_image: Image.Image,
                              full_instruction: str,
                              history: list) -> str:
    """
    Use LLaVA to generate the next sub-instruction.

    Parameters
    ----------
    panorama_image : 360-degree panoramic RGB image
    full_instruction : original navigation instruction
    history : list of already-executed sub-instructions

    Returns
    -------
    next_step : natural language sub-instruction
    """
    # Remaining instruction
    remaining = full_instruction
    for past in history:
        remaining = remaining.replace(past, "").strip()

    prompt = (
        f"[INST] <<SYS>>\n"
        f"You are a navigation assistant. Given a panoramic view of an environment "
        f"and a full navigation instruction, identify the most important landmark "
        f"or direction for the NEXT step. Output ONLY a short instruction (max 10 words).\n"
        f"<</SYS>>\n\n"
        f"Full instruction: '{full_instruction}'\n"
        f"Previous steps completed: {', '.join(history) if history else 'None'}\n"
        f"What is the next step? [/INST]"
    )

    inputs = processor(text=prompt, images=panorama_image, return_tensors="pt").to(device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=20,
            do_sample=False,
            temperature=0.0,  # deterministic
        )

    next_step = processor.decode(output_ids[0], skip_special_tokens=True)
    return next_step.strip()


# ─── Habitat-Sim Integration ──────────────────────────────────────────────────

def habitat_vln_episode(instruction: str, start_pose: dict,
                         goal_id: str, num_steps: int = 100):
    """
    Run a VLN episode in Habitat-Sim.

    Parameters
    ----------
    instruction : navigation instruction
    start_pose : {'x', 'y', 'z', 'yaw'}
    goal_id : target object instance ID
    num_steps : max steps before timeout

    Returns
    -------
    result : {'success': bool, 'steps': int, 'path_length': float}
    """
    import habitat_sim

    # Initialize simulator
    sim = habitat_sim.Simulator(habitat_sim.Configuration(
        habitat_sim.specifications.DEFAULT_SCENE,
        {
            'width': 640,
            'height': 480,
            'sensor_height': 1.5,
            'hfov': 90,
        }
    ))

    # Set agent state
    agent_state = habitat_sim.AgentState()
    agent_state.position = [start_pose['x'], start_pose['y'], start_pose['z']]
    agent_state.rotation = habitat_sim.utils.quat_from_angle_axis(
        start_pose['yaw'], habitat_sim.geo.GRAVITY
    )
    sim.agents[0].set_state(agent_state)

    done = False
    steps = 0

    for step in range(num_steps):
        # Get RGB observation
        obs = sim.render()
        rgb = obs['rgba_camera']

        # CLIP-based viewpoint scoring
        action = zero_shot_vln_step(instruction, rgb, candidate_poses=[...], robot_state={})

        # Execute action
        if action['type'] == 'move':
            # Move to target pose
            move_and_look(sim, action['target_pose'])
        else:
            # Random exploration
            sim.agents[0].move_forward(0.25)
            sim.agents[0].rotate(15, habitat_sim.geo.GRAVITY)

        # Check goal
        if check_goal_reached(sim, goal_id):
            done = True
            break

    sim.close()
    return {'success': done, 'steps': steps, 'path_length': steps * 0.25}

5.4 Comparison Table — All Three Tiers¶

Aspect	Tier 1: Modular	Tier 2: Seq2Seq	Tier 3: Foundation Models
Training required	No (rule-based)	Yes (large VLN dataset)	No (zero-shot)
Generalization	Poor	Moderate	High
Data dependency	None	R2R / RxR (100k samples)	CLIP / LLaVA pretrained
Compute cost	Low	Medium	High (large models)
Interpretability	High	Medium	Low
Success rate (R2R)	~20–30%	~50–70%	~40–60% (zero-shot)
Setup complexity	⭐	⭐⭐⭐	⭐⭐
Best for	Debugging, simple scenes	Research, benchmark SOTA	Novel environments, rapid deployment

6. Step-by-Step Implementation Guide¶

Phase 1 — Tier 1: Modular Pipeline (Week 1)¶

Set up NLP parsing

pip install spacy
python -m spacy download en_core_web_sm

Implement parse_instruction() with spaCy NER and dependency parsing
Test on sample instructions like "Turn left at the blue chair"
Implement landmark detection
Use ResNet-50 pretrained features
Add color-based detection for common landmarks
Visualize detected bounding boxes on camera frames
Implement A* path planner
Create a grid map from depth sensor or pre-built map
Integrate landmarks as waypoints
Test navigation in a simple environment (Gazebo or Habitat)
Integrate and test
Run end-to-end in simulation
Measure success rate in a known environment

Phase 2 — Tier 2: Seq2Seq (Week 2)¶

Download R2R dataset

wget https://github.com/peteanderson80/Matterport3DSimulator
# Follow instructions to download R2R dataset

Preprocess instructions and action sequences
Pre-extract ResNet visual features for each viewpoint
Implement model
Build Seq2SeqVLN class with cross-modal attention
Train on R2R train split
Monitor loss and validation SR
Evaluate on R2R
Test on val_seen and val_unseen splits
Compare SR and SPL against published baselines

Phase 3 — Tier 3: Foundation Models (Week 3)¶

Set up CLIP

pip install git+https://github.com/openai/CLIP.git

Implement viewpoint scoring
Integrate with Habitat-Sim for candidate viewpoint rendering
Set up LLaVA (optional)
Requires GPU with ≥16GB VRAM
Fine-tune or use zero-shot prompting
Evaluate zero-shot performance
Compare against Tier 2 trained models
Measure generalization to unseen environments

7. Extensions and Variations¶

7.1 AuxRN — Auxiliary Reasoning¶

Add auxiliary reasoning tasks during training: - CVAE (Contrastive VLN): contrastive learning between positive and negative instruction-path pairs - Perception loss: reconstruction loss on visual features - Self-monitoring: track whether the agent is making progress toward the goal

7.2 Vision-Language Pre-training (VLN-BERT)¶

Pretrain a joint vision-language model on large-scale image-caption datasets before fine-tuning on VLN:

Pretraining: COCO captions, Visual Genome → Learn aligned vision-language representations
Fine-tuning: R2R dataset → Learn navigation-specific grounding

Instead of late fusion (CLIP-style), use early fusion: - Concatenate visual and language features at every layer - Cross-attention layers (as in Flamingo, GPT-4V) for deeper grounding

7.4 Active Learning¶

Collect human feedback on failed episodes
Retrain on corrected demonstrations (DAgger-style)

8. References¶

Anderson et al., 2018 — Vision-and-Language Navigation: Interpreting Grounded Language Instructions in Photo-Realistic Environments — R2R dataset and VLN task definition
Fried et al., 2018 — Speaker-Follower Models for Language-Based Image Exploration — Seq2Seq VLN with attention
Wang et al., 2018 — Look Before You Leap: Bridging Model-Free and Model-Based RL for Vision-Based Navigation — CMU How-to-nav
Huang et al., 2019 — Transferable Representation Learning in Vision-and-Language Navigation — AuxRN auxiliary tasks
Li et al., 2022 — VLN-BERT: A Pretrained Language Model for Vision-and-Language Navigation — Pretraining for VLN
Radford et al., 2021 — Learning Transferable Visual Models From Natural Language Supervision (CLIP) — CLIP foundation model
Liu et al., 2024 — LLaVA: Large Language and Vision Assistant — LLaVA architecture
Habitat-Sim GitHub — Embodied AI simulator
AI2-THOR GitHub — Interactive household environment
R2R Dataset — Room-to-Room navigation benchmark
Ma et al., 2019 — Rethinking the Performance of Navigation with Language — VLN benchmarking analysis

Vision-Language Navigation (视觉-语言导航)¶

1. Project Overview¶

2. Hardware & Software Requirements¶

Hardware¶

Software¶

3. Tier 1 — Traditional: Modular Pipeline¶

3.1 Concept¶

3.2 Key Components¶

3.2.1 Instruction Parser (NLP)¶

3.2.2 Visual Landmark Detector¶

3.2.3 A* Path Planner¶

3.3 Putting It All Together¶

3.4 Limitations¶

4. Tier 2 — Intermediate: Seq2Seq with Attention¶

4.1 Concept¶

4.2 Model Architecture¶

4.4 Complete PyTorch Implementation¶

4.5 R2R Dataset & Evaluation¶

4.6 Comparison — Modular vs Seq2Seq¶

5. Tier 3 — Modern: Foundation Models for VLN¶

5.1 Concept¶

5.2 CLIP Visual Features for Navigation¶

5.3 LLaVA for Instruction Following¶

5.4 Comparison Table — All Three Tiers¶

6. Step-by-Step Implementation Guide¶

Phase 1 — Tier 1: Modular Pipeline (Week 1)¶

Phase 2 — Tier 2: Seq2Seq (Week 2)¶

Phase 3 — Tier 3: Foundation Models (Week 3)¶

7. Extensions and Variations¶

7.1 AuxRN — Auxiliary Reasoning¶

7.2 Vision-Language Pre-training (VLN-BERT)¶

7.4 Active Learning¶

8. References¶

Robotics Course Docs

Learn

Build

Community

Vision-Language Navigation (视觉-语言导航)¶

1. Project Overview¶

2. Hardware & Software Requirements¶

Hardware¶

Software¶

3. Tier 1 — Traditional: Modular Pipeline¶

3.1 Concept¶

3.2 Key Components¶

3.2.1 Instruction Parser (NLP)¶

3.2.2 Visual Landmark Detector¶

3.2.3 A* Path Planner¶

3.3 Putting It All Together¶

3.4 Limitations¶

4. Tier 2 — Intermediate: Seq2Seq with Attention¶

4.1 Concept¶

4.2 Model Architecture¶

4.3 Cross-Modal Attention¶

4.4 Complete PyTorch Implementation¶

4.5 R2R Dataset & Evaluation¶

4.6 Comparison — Modular vs Seq2Seq¶

5. Tier 3 — Modern: Foundation Models for VLN¶

5.1 Concept¶

5.2 CLIP Visual Features for Navigation¶

5.3 LLaVA for Instruction Following¶

5.4 Comparison Table — All Three Tiers¶

6. Step-by-Step Implementation Guide¶

Phase 1 — Tier 1: Modular Pipeline (Week 1)¶

Phase 2 — Tier 2: Seq2Seq (Week 2)¶

Phase 3 — Tier 3: Foundation Models (Week 3)¶

7. Extensions and Variations¶

7.1 AuxRN — Auxiliary Reasoning¶

7.2 Vision-Language Pre-training (VLN-BERT)¶

7.3 Multi-Modal Fusion¶

7.4 Active Learning¶

8. References¶

Robotics Course Docs

Learn

Build

Community