Skip to content

Vision-Language Navigation (视觉-语言导航)

Project Type: Embodied AI | Difficulty: ★★★☆☆ to ★★★★★ | Estimated Time: 2–4 weekends


1. Project Overview

Vision-Language Navigation (VLN) is the task of enabling a robot to reach a goal location in an environment by following natural language instructions, such as "Turn left at the blue chair, go past the kitchen, and stop at the dining table." The robot must ground language references to visual observations and plan a navigation path accordingly.

┌─────────────────────────────────────────────────────────────────────┐
│                    Vision-Language Navigation Pipeline                │
│                                                                     │
│   ┌──────────┐    ┌──────────────────┐    ┌────────────────────┐    │
│   │ Language │───▶│ Instruction     │───▶│ Cross-Modal       │    │
│   │ "Go past │    │ Parser (NLP)    │    │ Grounding         │    │
│   │ the red  │    │                 │    │                    │    │
│   │ door..." │    │ • Entity extract │    │ • Vision-language  │    │
│   └──────────┘    │ • Action parse   │    │   alignment        │    │
│                   │ • Waypoint seq   │    │ • Spatial reasoning│    │
│                   └──────────────────┘    └─────────┬──────────┘    │
│                                                      │               │
│   ┌──────────┐    ┌──────────────────┐                │               │
│   │ RGB/     │───▶│ Visual Feature  │───────────────┘               │
│   │ Depth    │    │ Extractor       │                                │
│   │ Camera   │    │ (ResNet/ViT/CLIP)                              │
│   └──────────┘    └──────────────────┘                                │
│                              │                                        │
│                              ▼                                        │
│                   ┌──────────────────┐    ┌────────────────────┐    │
│                   │ Path Planner     │───▶│ Action Commands    │    │
│                   │ (A* / RL Policy) │    │ (vel, turn angles) │    │
│                   └──────────────────┘    └────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

In this project you will implement three progressively more sophisticated approaches:

Tier Approach Key Technique Dataset
1 — Traditional Modular pipeline NLP parsing + landmark detection + A* Custom synthetic
2 — Intermediate Seq2Seq + attention Cross-modal attention, teacher forcing R2R (Room-to-Room)
3 — Modern Foundation models CLIP visual features + LLM instruction following Zero-shot

2. Hardware & Software Requirements

Hardware

Component Specification Notes
Robot platform TurtleBot3 / custom wheeled robot Differential drive
RGB-D Camera RealSense D435 / Azure Kinect Required for depth
Onboard PC Jetson Nano / Raspberry Pi 4 / Laptop For real-world deployment
(Optional) LiDAR RPLIDAR A1 / L515 For modular pipeline
Simulation PC Desktop with dedicated GPU For Habitat / AI2-THOR

Software

Package Version Purpose
Python ≥ 3.8 Core language
PyTorch ≥ 1.13 Neural network training
Transformers ≥ 4.30 CLIP, LLaVA, GPT models
OpenCV ≥ 4.5 Image preprocessing
NumPy ≥ 1.20 Numerical computation
Habitat-Sim ≥ 0.2 3D embodied AI simulator
AI2-THOR ≥ 4.0 Household navigation
spaCy / NLTK latest NLP instruction parsing
scikit-image ≥ 0.19 Image feature extraction
matplotlib ≥ 3.5 Visualization
pip install torch torchvision transformers opencv-python numpy
pip install habitat-sim ai2-thor  # simulation backends
pip install spacy nltk scikit-image matplotlib
python -m spacy download en_core_web_sm

3. Tier 1 — Traditional: Modular Pipeline

3.1 Concept

The modular pipeline decomposes VLN into three independent stages: (1) NLP parsing to extract entities and actions from the instruction, (2) visual landmark detection to locate referenced objects in the image, and (3) path planning to navigate toward the detected landmarks.

This approach is transparent and debuggable — each stage is a white box with interpretable outputs.

3.2 Key Components

3.2.1 Instruction Parser (NLP)

We use spaCy for named entity recognition (NER) and dependency parsing to extract:

  • Landmarks: objects referenced in the instruction ("blue chair", "kitchen table")
  • Actions: navigation verbs ("turn", "go", "stop", "pass")
  • Directions: spatial relations ("left", "right", "straight", "behind")
import spacy

nlp = spacy.load("en_core_web_sm")

def parse_instruction(instruction: str) -> dict:
    """
    Parse a navigation instruction into structured components.

    Returns: {
        'entities': [{'text': 'blue chair', 'label': 'LANDMARK'}, ...],
        'actions':  [{'verb': 'turn', 'direction': 'left'}, ...],
        'route':    ['turn left', 'go straight', 'stop']
    }
    """
    doc = nlp(instruction)
    entities = []
    actions = []

    # Named entity recognition
    for ent in doc.ents:
        entities.append({'text': ent.text, 'label': 'OBJECT'})

    # Verb + direction extraction via dependency parsing
    for token in doc:
        if token.pos_ == "VERB":
            direction = None
            for child in token.children:
                if child.dep_ in ("prep", "advcl"):
                    direction = child.text
            actions.append({'verb': token.lemma_, 'direction': direction})

    # Build route sequence
    route = []
    for token in doc:
        if token.dep_ == "ROOT" and token.pos_ == "VERB":
            route.append(token.lemma_)
        if token.dep_ in ("prep", "pcomp"):
            route.append(token.head.text + " " + token.text)

    return {'entities': entities, 'actions': actions, 'route': route}

3.2.2 Visual Landmark Detector

We use a pretrained ResNet-50 to extract feature maps, then compare detected objects with the instruction's landmarks using cosine similarity.

import torch
import torchvision.models as models
import torchvision.transforms as T
import cv2
import numpy as np

# Load pretrained ResNet-50 as feature extractor
resnet = models.resnet50(pretrained=True)
resnet = torch.nn.Sequential(*list(resnet.children())[:-2])  # Remove FC, keep feature maps
resnet.eval()

transform = T.Compose([
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# CLIP-style landmark vocabulary (simplified)
LANDMARK_VOCAB = [
    "chair", "table", "door", "window", "bed", "sofa",
    "kitchen", "bathroom", "hallway", "stairs",
    "blue", "red", "green", "white", "black"
]

def detect_landmarks(frame: np.ndarray, target_landmarks: list) -> list:
    """
    Detect landmark locations in the image frame.

    Parameters
    ----------
    frame : BGR image (HxWx3)
    target_landmarks : list of landmark names from instruction parser

    Returns
    -------
    detections : list of (label, bbox, confidence) tuples
    """
    # Resize and preprocess
    rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    input_tensor = transform(rgb).unsqueeze(0)

    # Extract feature map (1x2048x7x7 for ResNet-50)
    with torch.no_grad():
        features = resnet(input_tensor)  # [1, 2048, H', W']

    B, C, H, W = features.shape
    feature_map = features.squeeze(0).reshape(C, H * W).T  # (H*W) x 2048

    # Simple color-based landmark detection (complement to deep features)
    detections = []
    hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)

    color_map = {
        'blue': ([100, 150, 0], [140, 255, 255]),
        'red':  ([0, 100, 100], [10, 255, 255]),
        'green':([40, 50, 50],  [80, 255, 255]),
    }

    for landmark in target_landmarks:
        color_key = landmark.lower()
        if color_key in color_map:
            lower, upper = color_map[color_key]
            mask = cv2.inRange(hsv, np.array(lower), np.array(upper))
            # Find largest contour
            contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
            if contours:
                largest = max(contours, key=cv2.contourArea)
                x, y, w, h = cv2.boundingRect(largest)
                detections.append({
                    'label': landmark,
                    'bbox': (x, y, x+w, y+h),
                    'confidence': float(cv2.contourArea(largest) / (frame.shape[0] * frame.shape[1]))
                })

    return detections

3.2.3 A* Path Planner

Given a top-down map (built from depth sensor or SLAM), we use A* to plan a path that passes through the detected landmark locations.

import heapq

def astar_path(grid: np.ndarray, start: tuple, goal: tuple,
               landmarks: list = None) -> list:
    """
    A* path planning on a 2D occupancy grid.

    Parameters
    ----------
    grid : 2D numpy array (0 = free, 1 = obstacle)
    start : (row, col) starting position
    goal : (row, col) goal position
    landmarks : list of (row, col) waypoints to visit in order

    Returns
    -------
    path : list of (row, col) positions from start to goal
    """
    if landmarks is None:
        landmarks = []

    def heuristic(a, b):
        # Manhattan distance (can use Euclidean for 8-connected)
        return abs(a[0] - b[0]) + abs(a[1] - b[1])

    def neighbors(pos):
        for dr, dc in [(-1,0),(1,0),(0,-1),(0,1),(-1,-1),(-1,1),(1,-1),(1,1)]:
            nr, nc = pos[0] + dr, pos[1] + dc
            if 0 <= nr < grid.shape[0] and 0 <= nc < grid.shape[1]:
                if grid[nr, nc] == 0:  # free cell
                    yield (nr, nc)

    # Insert intermediate goals for landmarks
    waypoints = [start] + landmarks + [goal]

    full_path = []
    for i in range(len(waypoints) - 1):
        sub_path = _astar_single(grid, waypoints[i], waypoints[i+1], heuristic, neighbors)
        if sub_path is None:
            return None  # No path found
        full_path.extend(sub_path[:-1])

    full_path.append(goal)
    return full_path


def _astar_single(grid, start, goal, heuristic, neighbors):
    """Single-shot A* between two waypoints."""
    open_set = [(heuristic(start, goal), start)]
    came_from = {}
    g_score = {start: 0}

    while open_set:
        _, current = heapq.heappop(open_set)

        if current == goal:
            # Reconstruct path
            path = []
            while current in came_from:
                path.append(current)
                current = came_from[current]
            path.append(start)
            return path[::-1]

        for neighbor in neighbors(current):
            tentative_g = g_score[current] + 1
            if neighbor not in g_score or tentative_g < g_score[neighbor]:
                came_from[neighbor] = current
                g_score[neighbor] = tentative_g
                f_score = tentative_g + heuristic(neighbor, goal)
                heapq.heappush(open_set, (f_score, neighbor))

    return None  # No path found

3.3 Putting It All Together

def modular_vln_loop(instruction: str, rgb_frame: np.ndarray,
                    depth_frame: np.ndarray, grid_map: np.ndarray,
                    robot_pos: tuple) -> tuple:
    """
    Complete modular VLN pipeline.

    Returns: (next_action, detected_landmarks, planned_path)
    """
    # Step 1: Parse instruction
    parsed = parse_instruction(instruction)
    target_landmarks = [e['text'] for e in parsed['entities']]

    # Step 2: Detect landmarks in current view
    detections = detect_landmarks(rgb_frame, target_landmarks)

    # Step 3: Project detections to world coordinates (simplified)
    # In practice: use depth + camera intrinsics + robot pose
    landmark_waypoints = [d['bbox'][:2] for d in detections]  # placeholder

    # Step 4: A* path planning
    # Goal is the last detected landmark or end of instruction
    goal = landmark_waypoints[-1] if landmark_waypoints else (15, 15)
    path = astar_path(grid_map, robot_pos, goal, landmark_waypoints[:-1])

    # Step 5: Compute next action from path
    if path and len(path) > 1:
        next_pos = path[1]
        direction = (next_pos[0] - robot_pos[0], next_pos[1] - robot_pos[1])
        action = {'type': 'move_to', 'target': next_pos}
    else:
        action = {'type': 'stop'}

    return action, detections, path

3.4 Limitations

Aspect Issue
NLP Struggles with complex referring expressions ("the second door on your left")
Vision Color-based detection is brittle; needs robust landmark recognition
Planning A* on a 2D grid ignores 3D geometry and doorways
Generalization Each component must be retrained independently for new environments

4. Tier 2 — Intermediate: Seq2Seq with Attention

4.1 Concept

Seq2Seq VLN uses an encoder-decoder architecture where:

  • Encoder: processes both the instruction (as a sequence of word embeddings) and the visual observation (as a sequence of spatial features).
  • Decoder: generates a sequence of navigation actions, attending to relevant parts of the instruction and visual features at each step.

The key innovation is cross-modal attention, which allows the model to align language tokens with visual regions.

4.2 Model Architecture

The Speaker-Follower model (Fried et al., 2018) and the CMU How-to-nav model (Wang et al., 2018) both use attention-based seq2seq:

┌─────────────────────────────────────────────────────────────────────┐
│                 Seq2Seq VLN Model (Speaker-Follower)                 │
│                                                                     │
│  Instruction: "Turn left at the blue chair"                         │
│                                                                     │
│  ┌─────────┐                                                        │
│  │ Encoder │                                                        │
│  │         │                                                        │
│  │ ┌─────┐ │    ┌──────────────────────────────────────────────┐   │
│  │ │ w₁  │─┼───▶│                                              │   │
│  │ ├─────┤ │    │            Cross-Modal Attention              │   │
│  │ │ w₂  │─┼───▶│                                              │   │
│  │ ├─────┤ │    │  α_i = softmax(vᵀ·tanh(W₁h_i + W₂v_j))      │   │
│  │ │ w₃  │─┼───▶│                                              │   │
│  │ └─────┘ │    └──────────────────────────────────────────────┘   │
│  └─────────┘                         │                             │
│       │                             │                             │
│       │ h_i (language hidden)        │ c_t (context vector)       │
│       │                             ▼                             │
│       │                      ┌──────────────┐                     │
│  ┌────▼────┐                 │   Decoder    │                     │
│  │  LSTM   │◀────────────────│              │                     │
│  │         │                 │  a_t = argmax│                     │
│  │ h_t     │────────────────▶│  P(a|context)│                     │
│  └─────────┘                 └──────────────┘                     │
│                                                                     │
│  Action space: {Forward, Left, Right, <Stop>}                      │
└─────────────────────────────────────────────────────────────────────┘

4.3 Cross-Modal Attention

The attention mechanism computes a weighted sum of visual features based on the current decoder state:

\[ c_t = \sum_{j=1}^{N} \alpha_{tj} \, v_j \]

where the attention weights are:

\[ \alpha_{tj} = \frac{\exp\big(e_{tj}\big)}{\sum_{k=1}^{N} \exp\big(e_{tk}\big)}, \quad e_{tj} = v^\top \tanh\big(W_1 h_{t-1} + W_2 v_j\big) \]
  • \(h_{t-1}\) — previous decoder hidden state
  • \(v_j\) — visual feature at region \(j\)
  • \(W_1, W_2, v\) — learnable attention parameters

4.4 Complete PyTorch Implementation

"""
Tier 2: Seq2Seq VLN with Cross-Modal Attention
==============================================
Implements a simplified Speaker-Follower style model.
Trainable on R2R (Room-to-Room) dataset.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# ─── Model ──────────────────────────────────────────────────────────────────

class Seq2SeqVLN(nn.Module):
    """
    Seq2Seq VLN with cross-modal attention.

    Args:
        vocab_size: size of instruction vocabulary
        embed_dim: word embedding dimension
        hidden_dim: LSTM hidden dimension
        visual_dim: dimension of visual features (e.g., ResNet-2048)
        num_actions: number of navigation actions
        dropout: dropout probability
    """

    def __init__(self, vocab_size: int, embed_dim: int = 256,
                 hidden_dim: int = 512, visual_dim: int = 2048,
                 num_actions: int = 4, dropout: float = 0.3):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_actions = num_actions

        # Language embedding
        self.word_embed = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lang_lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.lang_proj = nn.Linear(hidden_dim * 2, hidden_dim)

        # Visual projection
        self.visual_proj = nn.Linear(visual_dim, hidden_dim)

        # Cross-modal attention
        self.attn_W1 = nn.Linear(hidden_dim, hidden_dim)
        self.attn_W2 = nn.Linear(hidden_dim, hidden_dim)
        self.attn_v = nn.Linear(hidden_dim, 1)

        # Decoder LSTM
        self.decoder_lstm = nn.LSTM(hidden_dim * 3, hidden_dim, batch_first=True)
        self.action_head = nn.Linear(hidden_dim, num_actions)

        self.dropout = nn.Dropout(dropout)

    def forward(self, instruction_tokens: torch.Tensor,
                visual_features: torch.Tensor,
                action_history: torch.Tensor) -> torch.Tensor:
        """
        Forward pass.

        Parameters
        ----------
        instruction_tokens : (B, L) — token IDs
        visual_features : (B, N, D_v) — N spatial regions, D_v dims
        action_history : (B, T) — previous actions (for teacher forcing)

        Returns
        -------
        logits : (B, num_actions) — action probability logits
        """
        B = instruction_tokens.size(0)

        # ── Language encoder ──────────────────────────────────────────────
        lang_embed = self.word_embed(instruction_tokens)  # (B, L, D_e)
        lang_h, (h_lang, _) = self.lang_lstm(lang_embed)  # (B, L, 2D), (2, B, D)
        # Combine bidirectional states
        h_lang = torch.cat([h_lang[0], h_lang[1]], dim=-1)  # (B, 2D)
        h_lang = self.dropout(torch.tanh(self.lang_proj(h_lang)))  # (B, D)

        # ── Visual projection ──────────────────────────────────────────────
        V = self.visual_proj(visual_features)  # (B, N, D)

        # ── Decoder LSTM ──────────────────────────────────────────────────
        # Initialize LSTM with language context
        decoder_state = (h_lang.unsqueeze(0), torch.zeros_like(h_lang.unsqueeze(0)))

        # Concatenate context + previous action embedding
        if action_history.size(1) > 0:
            prev_action = self.word_embed(action_history[:, -1])  # (B, D)
        else:
            prev_action = torch.zeros(B, self.hidden_dim, device=instruction_tokens.device)

        # Context from attention
        context = self._cross_attention(h_lang, V)  # (B, D)

        lstm_input = torch.cat([context, prev_action], dim=-1)  # (B, 2D)
        lstm_out, decoder_state = self.decoder_lstm(lstm_input.unsqueeze(1), decoder_state)
        lstm_out = lstm_out.squeeze(1)  # (B, D)

        # ── Action prediction ──────────────────────────────────────────────
        logits = self.action_head(self.dropout(lstm_out))  # (B, num_actions)
        return logits

    def _cross_attention(self, query: torch.Tensor, visual: torch.Tensor) -> torch.Tensor:
        """
        Cross-modal attention: attend to visual features given a language query.

        Parameters
        ----------
        query : (B, D) — language hidden state
        visual : (B, N, D) — visual feature grid

        Returns
        -------
        context : (B, D) — attended visual features
        """
        # Expand query for batch processing
        q = query.unsqueeze(1).expand_as(visual)  # (B, N, D)
        # Attention energy
        energy = torch.tanh(self.attn_W1(q) + self.attn_W2(visual))  # (B, N, D)
        energy = self.attn_v(energy).squeeze(-1)  # (B, N)
        alpha = F.softmax(energy, dim=-1)  # (B, N), normalized over regions
        context = torch.bmm(alpha.unsqueeze(1), visual).squeeze(1)  # (B, D)
        return context


def train_seq2seq_vln(model: Seq2SeqVLN, train_loader, optimizer, num_epochs: int = 20):
    """Train the Seq2Seq VLN model."""
    criterion = nn.CrossEntropyLoss(ignore_index=-1)

    for epoch in range(num_epochs):
        model.train()
        total_loss = 0.0

        for batch in train_loader:
            instruction, visual, actions_gt, lengths = batch

            optimizer.zero_grad()

            # Forward pass
            logits = model(instruction, visual, actions_gt[:, :-1])

            # Teacher forcing: predict next action
            loss = criterion(logits, actions_gt[:, -1])

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        print(f"[Epoch {epoch+1}/{num_epochs}] Loss: {avg_loss:.4f}")


# ─── R2R Dataset (simplified) ────────────────────────────────────────────────

class R2RDataset(torch.utils.data.Dataset):
    """
    Simplified Room-to-Room dataset.

    Each sample contains:
    - instruction: natural language description
    - path: sequence of (x, y, heading) viewpoints
    - action_seq: corresponding action sequence
    """

    def __init__(self, split='train'):
        self.split = split
        # In practice, load from Matterport3D R2R dataset
        # Here we show synthetic data for illustration
        self.data = [
            {
                'instruction': "Turn left and go past the chair",
                'path': [(0, 0, 0), (1, 0, 0), (2, 0, 0), (2, 1, np.pi/2)],
                'actions': [0, 0, 1, 2],  # Forward, Forward, Left, Stop
            },
        ]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        return {
            'instruction': sample['instruction'],
            'path': np.array(sample['path']),
            'actions': np.array(sample['actions'], dtype=np.int64),
        }


# ─── Evaluation Metrics ───────────────────────────────────────────────────────

def evaluate_vln(model, dataset, device='cuda'):
    """
    Evaluate VLN model using standard metrics.

    Metrics:
    - Success Rate (SR): fraction of episodes where agent stops within 3m of goal
    - SPL: Success weighted by Path Length = SR * (optimal_length / agent_length)
    - nDTW: normalized Dynamic Time Warping (sequence similarity)
    """
    model.eval()
    sr_total, spl_total, n_samples = 0, 0, 0

    with torch.no_grad():
        for sample in dataset:
            instruction = sample['instruction']
            gt_path = sample['path']
            gt_actions = sample['actions']

            # Simulate inference (would use Habitat-sim in practice)
            # Here: predicted actions
            pred_actions = torch.argmax(model(
                sample['inst_encoded'].unsqueeze(0).to(device),
                sample['visual'].unsqueeze(0).to(device),
                torch.zeros(1, 1, dtype=torch.long).to(device)
            ), dim=-1)

            # Compute SR (simplified)
            sr = float(torch.argmax(pred_actions) == gt_actions[-1])
            sr_total += sr
            n_samples += 1

    return {
        'Success Rate': sr_total / n_samples,
        'SPL': spl_total / n_samples if n_samples > 0 else 0,
    }

4.5 R2R Dataset & Evaluation

The Room-to-Room (R2R) dataset is the standard benchmark for VLN:

Metric Definition Ideal
Success Rate (SR) % of episodes where agent stops within 3m of goal 1.0
SPL SR × (optimal length / path length) 1.0
nDTW Normalized Dynamic Time Warping similarity 1.0
CLS Coverage weighted by Length Score 1.0

4.6 Comparison — Modular vs Seq2Seq

Aspect Modular Pipeline Seq2Seq + Attention
Interpretability High (white-box stages) Low (neural black box)
Training Supervised per component End-to-end differentiable
Generalization Poor (rule-based) Better (learned representations)
Data efficiency High (no training needed) Low (needs large VLN datasets)
Compositionality Limited (hand-coded rules) Strong (learned generalization)
Training complexity ⭐⭐⭐

5. Tier 3 — Modern: Foundation Models for VLN

5.1 Concept

Foundation models (large pretrained vision-language models) enable zero-shot VLN — navigating to novel environments without task-specific training. CLIP provides aligned visual-text features; LLaVA / GPT-4V enable instruction following.

Key advantage: no VLN-specific training data required.

5.2 CLIP Visual Features for Navigation

CLIP encodes images and text into a shared embedding space. We use CLIP's visual encoder to represent each candidate navigation viewpoint, then score them based on instruction similarity.

"""
Tier 3: Zero-Shot VLN using CLIP
================================
CLIP-based viewpoint scoring for instruction-guided navigation.
No VLN-specific training required.
"""

import torch
import clip
from PIL import Image
import numpy as np
import cv2

# Load CLIP model (ViT-B/32)
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)


def encode_text_instruction(instruction: str) -> torch.Tensor:
    """Encode a natural language instruction into CLIP text features."""
    text = clip.tokenize([instruction]).to(device)
    with torch.no_grad():
        text_features = model.encode_text(text)
        text_features /= text_features.norm(dim=-1, keepdim=True)
    return text_features


def encode_viewpoint(frame: np.ndarray) -> torch.Tensor:
    """Encode an RGB frame into CLIP visual features."""
    image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    image_input = preprocess(image).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        image_features /= image_features.norm(dim=-1, keepdim=True)
    return image_features


def score_viewpoints(current_frame: np.ndarray,
                     candidate_frames: list,
                     instruction: str) -> list:
    """
    Score candidate navigation viewpoints based on instruction relevance.

    Parameters
    ----------
    current_frame : current RGB observation
    candidate_frames : list of RGB frames from candidate viewpoints
    instruction : natural language navigation instruction

    Returns
    -------
    scores : list of (viewpoint_id, similarity_score) sorted descending
    """
    # Encode instruction once
    text_features = encode_text_instruction(instruction)

    results = []
    for idx, frame in enumerate(candidate_frames):
        vis_features = encode_viewpoint(frame)
        # Cosine similarity in CLIP embedding space
        similarity = (vis_features @ text_features.T).item()
        results.append((idx, similarity))

    # Sort by descending similarity
    results.sort(key=lambda x: x[1], reverse=True)
    return results


def zero_shot_vln_step(instruction: str, current_frame: np.ndarray,
                       candidate_poses: list, robot_state: dict) -> dict:
    """
    Single step of zero-shot VLN using CLIP.

    Parameters
    ----------
    instruction : full navigation instruction
    current_frame : current RGB observation
    candidate_poses : list of candidate robot poses to evaluate
    robot_state : {'x', 'y', 'heading', 'map'}

    Returns
    -------
    next_action : {'type': 'move', 'target_pose': pose}
    """
    # In practice: render candidate viewpoints from Habitat-sim
    # Here: simulate with current frame (would be different viewpoints)
    candidate_frames = [current_frame] * len(candidate_poses)

    # Score each candidate
    scores = score_viewpoints(current_frame, candidate_frames, instruction)

    # Pick highest-scoring viewpoint
    best_idx, best_score = scores[0]

    if best_score > 0.25:  # Threshold for accepting a viewpoint
        action = {
            'type': 'move',
            'target_pose': candidate_poses[best_idx],
            'confidence': best_score
        }
    else:
        # Fallback: random exploration or stop
        action = {'type': 'explore', 'confidence': best_score}

    return action

5.3 LLaVA for Instruction Following

LLaVA (Large Language and Vision Assistant) can reason about navigation instructions and visual context:

"""
LLaVA-based navigation instruction generation.
Given a panoramic view, LLaVA generates a sub-instruction for the next step.
"""

from transformers import AutoProcessor, LlavaForConditionalGeneration
import torch
from PIL import Image

# Load LLaVA-1.5-7B
model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="auto"
)


def generate_sub_instruction(panorama_image: Image.Image,
                              full_instruction: str,
                              history: list) -> str:
    """
    Use LLaVA to generate the next sub-instruction.

    Parameters
    ----------
    panorama_image : 360-degree panoramic RGB image
    full_instruction : original navigation instruction
    history : list of already-executed sub-instructions

    Returns
    -------
    next_step : natural language sub-instruction
    """
    # Remaining instruction
    remaining = full_instruction
    for past in history:
        remaining = remaining.replace(past, "").strip()

    prompt = (
        f"[INST] <<SYS>>\n"
        f"You are a navigation assistant. Given a panoramic view of an environment "
        f"and a full navigation instruction, identify the most important landmark "
        f"or direction for the NEXT step. Output ONLY a short instruction (max 10 words).\n"
        f"<</SYS>>\n\n"
        f"Full instruction: '{full_instruction}'\n"
        f"Previous steps completed: {', '.join(history) if history else 'None'}\n"
        f"What is the next step? [/INST]"
    )

    inputs = processor(text=prompt, images=panorama_image, return_tensors="pt").to(device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=20,
            do_sample=False,
            temperature=0.0,  # deterministic
        )

    next_step = processor.decode(output_ids[0], skip_special_tokens=True)
    return next_step.strip()


# ─── Habitat-Sim Integration ──────────────────────────────────────────────────

def habitat_vln_episode(instruction: str, start_pose: dict,
                         goal_id: str, num_steps: int = 100):
    """
    Run a VLN episode in Habitat-Sim.

    Parameters
    ----------
    instruction : navigation instruction
    start_pose : {'x', 'y', 'z', 'yaw'}
    goal_id : target object instance ID
    num_steps : max steps before timeout

    Returns
    -------
    result : {'success': bool, 'steps': int, 'path_length': float}
    """
    import habitat_sim

    # Initialize simulator
    sim = habitat_sim.Simulator(habitat_sim.Configuration(
        habitat_sim.specifications.DEFAULT_SCENE,
        {
            'width': 640,
            'height': 480,
            'sensor_height': 1.5,
            'hfov': 90,
        }
    ))

    # Set agent state
    agent_state = habitat_sim.AgentState()
    agent_state.position = [start_pose['x'], start_pose['y'], start_pose['z']]
    agent_state.rotation = habitat_sim.utils.quat_from_angle_axis(
        start_pose['yaw'], habitat_sim.geo.GRAVITY
    )
    sim.agents[0].set_state(agent_state)

    done = False
    steps = 0

    for step in range(num_steps):
        # Get RGB observation
        obs = sim.render()
        rgb = obs['rgba_camera']

        # CLIP-based viewpoint scoring
        action = zero_shot_vln_step(instruction, rgb, candidate_poses=[...], robot_state={})

        # Execute action
        if action['type'] == 'move':
            # Move to target pose
            move_and_look(sim, action['target_pose'])
        else:
            # Random exploration
            sim.agents[0].move_forward(0.25)
            sim.agents[0].rotate(15, habitat_sim.geo.GRAVITY)

        # Check goal
        if check_goal_reached(sim, goal_id):
            done = True
            break

    sim.close()
    return {'success': done, 'steps': steps, 'path_length': steps * 0.25}

5.4 Comparison Table — All Three Tiers

Aspect Tier 1: Modular Tier 2: Seq2Seq Tier 3: Foundation Models
Training required No (rule-based) Yes (large VLN dataset) No (zero-shot)
Generalization Poor Moderate High
Data dependency None R2R / RxR (100k samples) CLIP / LLaVA pretrained
Compute cost Low Medium High (large models)
Interpretability High Medium Low
Success rate (R2R) ~20–30% ~50–70% ~40–60% (zero-shot)
Setup complexity ⭐⭐⭐ ⭐⭐
Best for Debugging, simple scenes Research, benchmark SOTA Novel environments, rapid deployment

6. Step-by-Step Implementation Guide

Phase 1 — Tier 1: Modular Pipeline (Week 1)

  1. Set up NLP parsing
    pip install spacy
    python -m spacy download en_core_web_sm
    
  2. Implement parse_instruction() with spaCy NER and dependency parsing
  3. Test on sample instructions like "Turn left at the blue chair"

  4. Implement landmark detection

  5. Use ResNet-50 pretrained features
  6. Add color-based detection for common landmarks
  7. Visualize detected bounding boxes on camera frames

  8. Implement A* path planner

  9. Create a grid map from depth sensor or pre-built map
  10. Integrate landmarks as waypoints
  11. Test navigation in a simple environment (Gazebo or Habitat)

  12. Integrate and test

  13. Run end-to-end in simulation
  14. Measure success rate in a known environment

Phase 2 — Tier 2: Seq2Seq (Week 2)

  1. Download R2R dataset
    wget https://github.com/peteanderson80/Matterport3DSimulator
    # Follow instructions to download R2R dataset
    
  2. Preprocess instructions and action sequences
  3. Pre-extract ResNet visual features for each viewpoint

  4. Implement model

  5. Build Seq2SeqVLN class with cross-modal attention
  6. Train on R2R train split
  7. Monitor loss and validation SR

  8. Evaluate on R2R

  9. Test on val_seen and val_unseen splits
  10. Compare SR and SPL against published baselines

Phase 3 — Tier 3: Foundation Models (Week 3)

  1. Set up CLIP
    pip install git+https://github.com/openai/CLIP.git
    
  2. Implement viewpoint scoring
  3. Integrate with Habitat-Sim for candidate viewpoint rendering

  4. Set up LLaVA (optional)

  5. Requires GPU with ≥16GB VRAM
  6. Fine-tune or use zero-shot prompting

  7. Evaluate zero-shot performance

  8. Compare against Tier 2 trained models
  9. Measure generalization to unseen environments

7. Extensions and Variations

7.1 AuxRN — Auxiliary Reasoning

Add auxiliary reasoning tasks during training: - CVAE (Contrastive VLN): contrastive learning between positive and negative instruction-path pairs - Perception loss: reconstruction loss on visual features - Self-monitoring: track whether the agent is making progress toward the goal

7.2 Vision-Language Pre-training (VLN-BERT)

Pretrain a joint vision-language model on large-scale image-caption datasets before fine-tuning on VLN:

Pretraining: COCO captions, Visual Genome → Learn aligned vision-language representations
Fine-tuning: R2R dataset → Learn navigation-specific grounding

7.3 Multi-Modal Fusion

Instead of late fusion (CLIP-style), use early fusion: - Concatenate visual and language features at every layer - Cross-attention layers (as in Flamingo, GPT-4V) for deeper grounding

7.4 Active Learning

  • Collect human feedback on failed episodes
  • Retrain on corrected demonstrations (DAgger-style)

8. References

  1. Anderson et al., 2018 — Vision-and-Language Navigation: Interpreting Grounded Language Instructions in Photo-Realistic Environments — R2R dataset and VLN task definition
  2. Fried et al., 2018 — Speaker-Follower Models for Language-Based Image Exploration — Seq2Seq VLN with attention
  3. Wang et al., 2018 — Look Before You Leap: Bridging Model-Free and Model-Based RL for Vision-Based Navigation — CMU How-to-nav
  4. Huang et al., 2019 — Transferable Representation Learning in Vision-and-Language Navigation — AuxRN auxiliary tasks
  5. Li et al., 2022 — VLN-BERT: A Pretrained Language Model for Vision-and-Language Navigation — Pretraining for VLN
  6. Radford et al., 2021 — Learning Transferable Visual Models From Natural Language Supervision (CLIP) — CLIP foundation model
  7. Liu et al., 2024 — LLaVA: Large Language and Vision Assistant — LLaVA architecture
  8. Habitat-Sim GitHub — Embodied AI simulator
  9. AI2-THOR GitHub — Interactive household environment
  10. R2R Dataset — Room-to-Room navigation benchmark
  11. Ma et al., 2019 — Rethinking the Performance of Navigation with Language — VLN benchmarking analysis