Vision-Language Navigation (视觉-语言导航)¶
Project Type: Embodied AI | Difficulty: ★★★☆☆ to ★★★★★ | Estimated Time: 2–4 weekends
1. Project Overview¶
Vision-Language Navigation (VLN) is the task of enabling a robot to reach a goal location in an environment by following natural language instructions, such as "Turn left at the blue chair, go past the kitchen, and stop at the dining table." The robot must ground language references to visual observations and plan a navigation path accordingly.
┌─────────────────────────────────────────────────────────────────────┐
│ Vision-Language Navigation Pipeline │
│ │
│ ┌──────────┐ ┌──────────────────┐ ┌────────────────────┐ │
│ │ Language │───▶│ Instruction │───▶│ Cross-Modal │ │
│ │ "Go past │ │ Parser (NLP) │ │ Grounding │ │
│ │ the red │ │ │ │ │ │
│ │ door..." │ │ • Entity extract │ │ • Vision-language │ │
│ └──────────┘ │ • Action parse │ │ alignment │ │
│ │ • Waypoint seq │ │ • Spatial reasoning│ │
│ └──────────────────┘ └─────────┬──────────┘ │
│ │ │
│ ┌──────────┐ ┌──────────────────┐ │ │
│ │ RGB/ │───▶│ Visual Feature │───────────────┘ │
│ │ Depth │ │ Extractor │ │
│ │ Camera │ │ (ResNet/ViT/CLIP) │
│ └──────────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ ┌────────────────────┐ │
│ │ Path Planner │───▶│ Action Commands │ │
│ │ (A* / RL Policy) │ │ (vel, turn angles) │ │
│ └──────────────────┘ └────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
In this project you will implement three progressively more sophisticated approaches:
| Tier | Approach | Key Technique | Dataset |
|---|---|---|---|
| 1 — Traditional | Modular pipeline | NLP parsing + landmark detection + A* | Custom synthetic |
| 2 — Intermediate | Seq2Seq + attention | Cross-modal attention, teacher forcing | R2R (Room-to-Room) |
| 3 — Modern | Foundation models | CLIP visual features + LLM instruction following | Zero-shot |
2. Hardware & Software Requirements¶
Hardware¶
| Component | Specification | Notes |
|---|---|---|
| Robot platform | TurtleBot3 / custom wheeled robot | Differential drive |
| RGB-D Camera | RealSense D435 / Azure Kinect | Required for depth |
| Onboard PC | Jetson Nano / Raspberry Pi 4 / Laptop | For real-world deployment |
| (Optional) LiDAR | RPLIDAR A1 / L515 | For modular pipeline |
| Simulation PC | Desktop with dedicated GPU | For Habitat / AI2-THOR |
Software¶
| Package | Version | Purpose |
|---|---|---|
| Python | ≥ 3.8 | Core language |
| PyTorch | ≥ 1.13 | Neural network training |
| Transformers | ≥ 4.30 | CLIP, LLaVA, GPT models |
| OpenCV | ≥ 4.5 | Image preprocessing |
| NumPy | ≥ 1.20 | Numerical computation |
| Habitat-Sim | ≥ 0.2 | 3D embodied AI simulator |
| AI2-THOR | ≥ 4.0 | Household navigation |
| spaCy / NLTK | latest | NLP instruction parsing |
| scikit-image | ≥ 0.19 | Image feature extraction |
| matplotlib | ≥ 3.5 | Visualization |
pip install torch torchvision transformers opencv-python numpy
pip install habitat-sim ai2-thor # simulation backends
pip install spacy nltk scikit-image matplotlib
python -m spacy download en_core_web_sm
3. Tier 1 — Traditional: Modular Pipeline¶
3.1 Concept¶
The modular pipeline decomposes VLN into three independent stages: (1) NLP parsing to extract entities and actions from the instruction, (2) visual landmark detection to locate referenced objects in the image, and (3) path planning to navigate toward the detected landmarks.
This approach is transparent and debuggable — each stage is a white box with interpretable outputs.
3.2 Key Components¶
3.2.1 Instruction Parser (NLP)¶
We use spaCy for named entity recognition (NER) and dependency parsing to extract:
- Landmarks: objects referenced in the instruction ("blue chair", "kitchen table")
- Actions: navigation verbs ("turn", "go", "stop", "pass")
- Directions: spatial relations ("left", "right", "straight", "behind")
import spacy
nlp = spacy.load("en_core_web_sm")
def parse_instruction(instruction: str) -> dict:
"""
Parse a navigation instruction into structured components.
Returns: {
'entities': [{'text': 'blue chair', 'label': 'LANDMARK'}, ...],
'actions': [{'verb': 'turn', 'direction': 'left'}, ...],
'route': ['turn left', 'go straight', 'stop']
}
"""
doc = nlp(instruction)
entities = []
actions = []
# Named entity recognition
for ent in doc.ents:
entities.append({'text': ent.text, 'label': 'OBJECT'})
# Verb + direction extraction via dependency parsing
for token in doc:
if token.pos_ == "VERB":
direction = None
for child in token.children:
if child.dep_ in ("prep", "advcl"):
direction = child.text
actions.append({'verb': token.lemma_, 'direction': direction})
# Build route sequence
route = []
for token in doc:
if token.dep_ == "ROOT" and token.pos_ == "VERB":
route.append(token.lemma_)
if token.dep_ in ("prep", "pcomp"):
route.append(token.head.text + " " + token.text)
return {'entities': entities, 'actions': actions, 'route': route}
3.2.2 Visual Landmark Detector¶
We use a pretrained ResNet-50 to extract feature maps, then compare detected objects with the instruction's landmarks using cosine similarity.
import torch
import torchvision.models as models
import torchvision.transforms as T
import cv2
import numpy as np
# Load pretrained ResNet-50 as feature extractor
resnet = models.resnet50(pretrained=True)
resnet = torch.nn.Sequential(*list(resnet.children())[:-2]) # Remove FC, keep feature maps
resnet.eval()
transform = T.Compose([
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# CLIP-style landmark vocabulary (simplified)
LANDMARK_VOCAB = [
"chair", "table", "door", "window", "bed", "sofa",
"kitchen", "bathroom", "hallway", "stairs",
"blue", "red", "green", "white", "black"
]
def detect_landmarks(frame: np.ndarray, target_landmarks: list) -> list:
"""
Detect landmark locations in the image frame.
Parameters
----------
frame : BGR image (HxWx3)
target_landmarks : list of landmark names from instruction parser
Returns
-------
detections : list of (label, bbox, confidence) tuples
"""
# Resize and preprocess
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
input_tensor = transform(rgb).unsqueeze(0)
# Extract feature map (1x2048x7x7 for ResNet-50)
with torch.no_grad():
features = resnet(input_tensor) # [1, 2048, H', W']
B, C, H, W = features.shape
feature_map = features.squeeze(0).reshape(C, H * W).T # (H*W) x 2048
# Simple color-based landmark detection (complement to deep features)
detections = []
hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
color_map = {
'blue': ([100, 150, 0], [140, 255, 255]),
'red': ([0, 100, 100], [10, 255, 255]),
'green':([40, 50, 50], [80, 255, 255]),
}
for landmark in target_landmarks:
color_key = landmark.lower()
if color_key in color_map:
lower, upper = color_map[color_key]
mask = cv2.inRange(hsv, np.array(lower), np.array(upper))
# Find largest contour
contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
if contours:
largest = max(contours, key=cv2.contourArea)
x, y, w, h = cv2.boundingRect(largest)
detections.append({
'label': landmark,
'bbox': (x, y, x+w, y+h),
'confidence': float(cv2.contourArea(largest) / (frame.shape[0] * frame.shape[1]))
})
return detections
3.2.3 A* Path Planner¶
Given a top-down map (built from depth sensor or SLAM), we use A* to plan a path that passes through the detected landmark locations.
import heapq
def astar_path(grid: np.ndarray, start: tuple, goal: tuple,
landmarks: list = None) -> list:
"""
A* path planning on a 2D occupancy grid.
Parameters
----------
grid : 2D numpy array (0 = free, 1 = obstacle)
start : (row, col) starting position
goal : (row, col) goal position
landmarks : list of (row, col) waypoints to visit in order
Returns
-------
path : list of (row, col) positions from start to goal
"""
if landmarks is None:
landmarks = []
def heuristic(a, b):
# Manhattan distance (can use Euclidean for 8-connected)
return abs(a[0] - b[0]) + abs(a[1] - b[1])
def neighbors(pos):
for dr, dc in [(-1,0),(1,0),(0,-1),(0,1),(-1,-1),(-1,1),(1,-1),(1,1)]:
nr, nc = pos[0] + dr, pos[1] + dc
if 0 <= nr < grid.shape[0] and 0 <= nc < grid.shape[1]:
if grid[nr, nc] == 0: # free cell
yield (nr, nc)
# Insert intermediate goals for landmarks
waypoints = [start] + landmarks + [goal]
full_path = []
for i in range(len(waypoints) - 1):
sub_path = _astar_single(grid, waypoints[i], waypoints[i+1], heuristic, neighbors)
if sub_path is None:
return None # No path found
full_path.extend(sub_path[:-1])
full_path.append(goal)
return full_path
def _astar_single(grid, start, goal, heuristic, neighbors):
"""Single-shot A* between two waypoints."""
open_set = [(heuristic(start, goal), start)]
came_from = {}
g_score = {start: 0}
while open_set:
_, current = heapq.heappop(open_set)
if current == goal:
# Reconstruct path
path = []
while current in came_from:
path.append(current)
current = came_from[current]
path.append(start)
return path[::-1]
for neighbor in neighbors(current):
tentative_g = g_score[current] + 1
if neighbor not in g_score or tentative_g < g_score[neighbor]:
came_from[neighbor] = current
g_score[neighbor] = tentative_g
f_score = tentative_g + heuristic(neighbor, goal)
heapq.heappush(open_set, (f_score, neighbor))
return None # No path found
3.3 Putting It All Together¶
def modular_vln_loop(instruction: str, rgb_frame: np.ndarray,
depth_frame: np.ndarray, grid_map: np.ndarray,
robot_pos: tuple) -> tuple:
"""
Complete modular VLN pipeline.
Returns: (next_action, detected_landmarks, planned_path)
"""
# Step 1: Parse instruction
parsed = parse_instruction(instruction)
target_landmarks = [e['text'] for e in parsed['entities']]
# Step 2: Detect landmarks in current view
detections = detect_landmarks(rgb_frame, target_landmarks)
# Step 3: Project detections to world coordinates (simplified)
# In practice: use depth + camera intrinsics + robot pose
landmark_waypoints = [d['bbox'][:2] for d in detections] # placeholder
# Step 4: A* path planning
# Goal is the last detected landmark or end of instruction
goal = landmark_waypoints[-1] if landmark_waypoints else (15, 15)
path = astar_path(grid_map, robot_pos, goal, landmark_waypoints[:-1])
# Step 5: Compute next action from path
if path and len(path) > 1:
next_pos = path[1]
direction = (next_pos[0] - robot_pos[0], next_pos[1] - robot_pos[1])
action = {'type': 'move_to', 'target': next_pos}
else:
action = {'type': 'stop'}
return action, detections, path
3.4 Limitations¶
| Aspect | Issue |
|---|---|
| NLP | Struggles with complex referring expressions ("the second door on your left") |
| Vision | Color-based detection is brittle; needs robust landmark recognition |
| Planning | A* on a 2D grid ignores 3D geometry and doorways |
| Generalization | Each component must be retrained independently for new environments |
4. Tier 2 — Intermediate: Seq2Seq with Attention¶
4.1 Concept¶
Seq2Seq VLN uses an encoder-decoder architecture where:
- Encoder: processes both the instruction (as a sequence of word embeddings) and the visual observation (as a sequence of spatial features).
- Decoder: generates a sequence of navigation actions, attending to relevant parts of the instruction and visual features at each step.
The key innovation is cross-modal attention, which allows the model to align language tokens with visual regions.
4.2 Model Architecture¶
The Speaker-Follower model (Fried et al., 2018) and the CMU How-to-nav model (Wang et al., 2018) both use attention-based seq2seq:
┌─────────────────────────────────────────────────────────────────────┐
│ Seq2Seq VLN Model (Speaker-Follower) │
│ │
│ Instruction: "Turn left at the blue chair" │
│ │
│ ┌─────────┐ │
│ │ Encoder │ │
│ │ │ │
│ │ ┌─────┐ │ ┌──────────────────────────────────────────────┐ │
│ │ │ w₁ │─┼───▶│ │ │
│ │ ├─────┤ │ │ Cross-Modal Attention │ │
│ │ │ w₂ │─┼───▶│ │ │
│ │ ├─────┤ │ │ α_i = softmax(vᵀ·tanh(W₁h_i + W₂v_j)) │ │
│ │ │ w₃ │─┼───▶│ │ │
│ │ └─────┘ │ └──────────────────────────────────────────────┘ │
│ └─────────┘ │ │
│ │ │ │
│ │ h_i (language hidden) │ c_t (context vector) │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ ┌────▼────┐ │ Decoder │ │
│ │ LSTM │◀────────────────│ │ │
│ │ │ │ a_t = argmax│ │
│ │ h_t │────────────────▶│ P(a|context)│ │
│ └─────────┘ └──────────────┘ │
│ │
│ Action space: {Forward, Left, Right, <Stop>} │
└─────────────────────────────────────────────────────────────────────┘
4.3 Cross-Modal Attention¶
The attention mechanism computes a weighted sum of visual features based on the current decoder state:
where the attention weights are:
- \(h_{t-1}\) — previous decoder hidden state
- \(v_j\) — visual feature at region \(j\)
- \(W_1, W_2, v\) — learnable attention parameters
4.4 Complete PyTorch Implementation¶
"""
Tier 2: Seq2Seq VLN with Cross-Modal Attention
==============================================
Implements a simplified Speaker-Follower style model.
Trainable on R2R (Room-to-Room) dataset.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
# ─── Model ──────────────────────────────────────────────────────────────────
class Seq2SeqVLN(nn.Module):
"""
Seq2Seq VLN with cross-modal attention.
Args:
vocab_size: size of instruction vocabulary
embed_dim: word embedding dimension
hidden_dim: LSTM hidden dimension
visual_dim: dimension of visual features (e.g., ResNet-2048)
num_actions: number of navigation actions
dropout: dropout probability
"""
def __init__(self, vocab_size: int, embed_dim: int = 256,
hidden_dim: int = 512, visual_dim: int = 2048,
num_actions: int = 4, dropout: float = 0.3):
super().__init__()
self.hidden_dim = hidden_dim
self.num_actions = num_actions
# Language embedding
self.word_embed = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lang_lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
self.lang_proj = nn.Linear(hidden_dim * 2, hidden_dim)
# Visual projection
self.visual_proj = nn.Linear(visual_dim, hidden_dim)
# Cross-modal attention
self.attn_W1 = nn.Linear(hidden_dim, hidden_dim)
self.attn_W2 = nn.Linear(hidden_dim, hidden_dim)
self.attn_v = nn.Linear(hidden_dim, 1)
# Decoder LSTM
self.decoder_lstm = nn.LSTM(hidden_dim * 3, hidden_dim, batch_first=True)
self.action_head = nn.Linear(hidden_dim, num_actions)
self.dropout = nn.Dropout(dropout)
def forward(self, instruction_tokens: torch.Tensor,
visual_features: torch.Tensor,
action_history: torch.Tensor) -> torch.Tensor:
"""
Forward pass.
Parameters
----------
instruction_tokens : (B, L) — token IDs
visual_features : (B, N, D_v) — N spatial regions, D_v dims
action_history : (B, T) — previous actions (for teacher forcing)
Returns
-------
logits : (B, num_actions) — action probability logits
"""
B = instruction_tokens.size(0)
# ── Language encoder ──────────────────────────────────────────────
lang_embed = self.word_embed(instruction_tokens) # (B, L, D_e)
lang_h, (h_lang, _) = self.lang_lstm(lang_embed) # (B, L, 2D), (2, B, D)
# Combine bidirectional states
h_lang = torch.cat([h_lang[0], h_lang[1]], dim=-1) # (B, 2D)
h_lang = self.dropout(torch.tanh(self.lang_proj(h_lang))) # (B, D)
# ── Visual projection ──────────────────────────────────────────────
V = self.visual_proj(visual_features) # (B, N, D)
# ── Decoder LSTM ──────────────────────────────────────────────────
# Initialize LSTM with language context
decoder_state = (h_lang.unsqueeze(0), torch.zeros_like(h_lang.unsqueeze(0)))
# Concatenate context + previous action embedding
if action_history.size(1) > 0:
prev_action = self.word_embed(action_history[:, -1]) # (B, D)
else:
prev_action = torch.zeros(B, self.hidden_dim, device=instruction_tokens.device)
# Context from attention
context = self._cross_attention(h_lang, V) # (B, D)
lstm_input = torch.cat([context, prev_action], dim=-1) # (B, 2D)
lstm_out, decoder_state = self.decoder_lstm(lstm_input.unsqueeze(1), decoder_state)
lstm_out = lstm_out.squeeze(1) # (B, D)
# ── Action prediction ──────────────────────────────────────────────
logits = self.action_head(self.dropout(lstm_out)) # (B, num_actions)
return logits
def _cross_attention(self, query: torch.Tensor, visual: torch.Tensor) -> torch.Tensor:
"""
Cross-modal attention: attend to visual features given a language query.
Parameters
----------
query : (B, D) — language hidden state
visual : (B, N, D) — visual feature grid
Returns
-------
context : (B, D) — attended visual features
"""
# Expand query for batch processing
q = query.unsqueeze(1).expand_as(visual) # (B, N, D)
# Attention energy
energy = torch.tanh(self.attn_W1(q) + self.attn_W2(visual)) # (B, N, D)
energy = self.attn_v(energy).squeeze(-1) # (B, N)
alpha = F.softmax(energy, dim=-1) # (B, N), normalized over regions
context = torch.bmm(alpha.unsqueeze(1), visual).squeeze(1) # (B, D)
return context
def train_seq2seq_vln(model: Seq2SeqVLN, train_loader, optimizer, num_epochs: int = 20):
"""Train the Seq2Seq VLN model."""
criterion = nn.CrossEntropyLoss(ignore_index=-1)
for epoch in range(num_epochs):
model.train()
total_loss = 0.0
for batch in train_loader:
instruction, visual, actions_gt, lengths = batch
optimizer.zero_grad()
# Forward pass
logits = model(instruction, visual, actions_gt[:, :-1])
# Teacher forcing: predict next action
loss = criterion(logits, actions_gt[:, -1])
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
print(f"[Epoch {epoch+1}/{num_epochs}] Loss: {avg_loss:.4f}")
# ─── R2R Dataset (simplified) ────────────────────────────────────────────────
class R2RDataset(torch.utils.data.Dataset):
"""
Simplified Room-to-Room dataset.
Each sample contains:
- instruction: natural language description
- path: sequence of (x, y, heading) viewpoints
- action_seq: corresponding action sequence
"""
def __init__(self, split='train'):
self.split = split
# In practice, load from Matterport3D R2R dataset
# Here we show synthetic data for illustration
self.data = [
{
'instruction': "Turn left and go past the chair",
'path': [(0, 0, 0), (1, 0, 0), (2, 0, 0), (2, 1, np.pi/2)],
'actions': [0, 0, 1, 2], # Forward, Forward, Left, Stop
},
]
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx]
return {
'instruction': sample['instruction'],
'path': np.array(sample['path']),
'actions': np.array(sample['actions'], dtype=np.int64),
}
# ─── Evaluation Metrics ───────────────────────────────────────────────────────
def evaluate_vln(model, dataset, device='cuda'):
"""
Evaluate VLN model using standard metrics.
Metrics:
- Success Rate (SR): fraction of episodes where agent stops within 3m of goal
- SPL: Success weighted by Path Length = SR * (optimal_length / agent_length)
- nDTW: normalized Dynamic Time Warping (sequence similarity)
"""
model.eval()
sr_total, spl_total, n_samples = 0, 0, 0
with torch.no_grad():
for sample in dataset:
instruction = sample['instruction']
gt_path = sample['path']
gt_actions = sample['actions']
# Simulate inference (would use Habitat-sim in practice)
# Here: predicted actions
pred_actions = torch.argmax(model(
sample['inst_encoded'].unsqueeze(0).to(device),
sample['visual'].unsqueeze(0).to(device),
torch.zeros(1, 1, dtype=torch.long).to(device)
), dim=-1)
# Compute SR (simplified)
sr = float(torch.argmax(pred_actions) == gt_actions[-1])
sr_total += sr
n_samples += 1
return {
'Success Rate': sr_total / n_samples,
'SPL': spl_total / n_samples if n_samples > 0 else 0,
}
4.5 R2R Dataset & Evaluation¶
The Room-to-Room (R2R) dataset is the standard benchmark for VLN:
| Metric | Definition | Ideal |
|---|---|---|
| Success Rate (SR) | % of episodes where agent stops within 3m of goal | 1.0 |
| SPL | SR × (optimal length / path length) | 1.0 |
| nDTW | Normalized Dynamic Time Warping similarity | 1.0 |
| CLS | Coverage weighted by Length Score | 1.0 |
4.6 Comparison — Modular vs Seq2Seq¶
| Aspect | Modular Pipeline | Seq2Seq + Attention |
|---|---|---|
| Interpretability | High (white-box stages) | Low (neural black box) |
| Training | Supervised per component | End-to-end differentiable |
| Generalization | Poor (rule-based) | Better (learned representations) |
| Data efficiency | High (no training needed) | Low (needs large VLN datasets) |
| Compositionality | Limited (hand-coded rules) | Strong (learned generalization) |
| Training complexity | ⭐ | ⭐⭐⭐ |
5. Tier 3 — Modern: Foundation Models for VLN¶
5.1 Concept¶
Foundation models (large pretrained vision-language models) enable zero-shot VLN — navigating to novel environments without task-specific training. CLIP provides aligned visual-text features; LLaVA / GPT-4V enable instruction following.
Key advantage: no VLN-specific training data required.
5.2 CLIP Visual Features for Navigation¶
CLIP encodes images and text into a shared embedding space. We use CLIP's visual encoder to represent each candidate navigation viewpoint, then score them based on instruction similarity.
"""
Tier 3: Zero-Shot VLN using CLIP
================================
CLIP-based viewpoint scoring for instruction-guided navigation.
No VLN-specific training required.
"""
import torch
import clip
from PIL import Image
import numpy as np
import cv2
# Load CLIP model (ViT-B/32)
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
def encode_text_instruction(instruction: str) -> torch.Tensor:
"""Encode a natural language instruction into CLIP text features."""
text = clip.tokenize([instruction]).to(device)
with torch.no_grad():
text_features = model.encode_text(text)
text_features /= text_features.norm(dim=-1, keepdim=True)
return text_features
def encode_viewpoint(frame: np.ndarray) -> torch.Tensor:
"""Encode an RGB frame into CLIP visual features."""
image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
image_input = preprocess(image).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image_input)
image_features /= image_features.norm(dim=-1, keepdim=True)
return image_features
def score_viewpoints(current_frame: np.ndarray,
candidate_frames: list,
instruction: str) -> list:
"""
Score candidate navigation viewpoints based on instruction relevance.
Parameters
----------
current_frame : current RGB observation
candidate_frames : list of RGB frames from candidate viewpoints
instruction : natural language navigation instruction
Returns
-------
scores : list of (viewpoint_id, similarity_score) sorted descending
"""
# Encode instruction once
text_features = encode_text_instruction(instruction)
results = []
for idx, frame in enumerate(candidate_frames):
vis_features = encode_viewpoint(frame)
# Cosine similarity in CLIP embedding space
similarity = (vis_features @ text_features.T).item()
results.append((idx, similarity))
# Sort by descending similarity
results.sort(key=lambda x: x[1], reverse=True)
return results
def zero_shot_vln_step(instruction: str, current_frame: np.ndarray,
candidate_poses: list, robot_state: dict) -> dict:
"""
Single step of zero-shot VLN using CLIP.
Parameters
----------
instruction : full navigation instruction
current_frame : current RGB observation
candidate_poses : list of candidate robot poses to evaluate
robot_state : {'x', 'y', 'heading', 'map'}
Returns
-------
next_action : {'type': 'move', 'target_pose': pose}
"""
# In practice: render candidate viewpoints from Habitat-sim
# Here: simulate with current frame (would be different viewpoints)
candidate_frames = [current_frame] * len(candidate_poses)
# Score each candidate
scores = score_viewpoints(current_frame, candidate_frames, instruction)
# Pick highest-scoring viewpoint
best_idx, best_score = scores[0]
if best_score > 0.25: # Threshold for accepting a viewpoint
action = {
'type': 'move',
'target_pose': candidate_poses[best_idx],
'confidence': best_score
}
else:
# Fallback: random exploration or stop
action = {'type': 'explore', 'confidence': best_score}
return action
5.3 LLaVA for Instruction Following¶
LLaVA (Large Language and Vision Assistant) can reason about navigation instructions and visual context:
"""
LLaVA-based navigation instruction generation.
Given a panoramic view, LLaVA generates a sub-instruction for the next step.
"""
from transformers import AutoProcessor, LlavaForConditionalGeneration
import torch
from PIL import Image
# Load LLaVA-1.5-7B
model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
model_name, torch_dtype=torch.float16, device_map="auto"
)
def generate_sub_instruction(panorama_image: Image.Image,
full_instruction: str,
history: list) -> str:
"""
Use LLaVA to generate the next sub-instruction.
Parameters
----------
panorama_image : 360-degree panoramic RGB image
full_instruction : original navigation instruction
history : list of already-executed sub-instructions
Returns
-------
next_step : natural language sub-instruction
"""
# Remaining instruction
remaining = full_instruction
for past in history:
remaining = remaining.replace(past, "").strip()
prompt = (
f"[INST] <<SYS>>\n"
f"You are a navigation assistant. Given a panoramic view of an environment "
f"and a full navigation instruction, identify the most important landmark "
f"or direction for the NEXT step. Output ONLY a short instruction (max 10 words).\n"
f"<</SYS>>\n\n"
f"Full instruction: '{full_instruction}'\n"
f"Previous steps completed: {', '.join(history) if history else 'None'}\n"
f"What is the next step? [/INST]"
)
inputs = processor(text=prompt, images=panorama_image, return_tensors="pt").to(device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=20,
do_sample=False,
temperature=0.0, # deterministic
)
next_step = processor.decode(output_ids[0], skip_special_tokens=True)
return next_step.strip()
# ─── Habitat-Sim Integration ──────────────────────────────────────────────────
def habitat_vln_episode(instruction: str, start_pose: dict,
goal_id: str, num_steps: int = 100):
"""
Run a VLN episode in Habitat-Sim.
Parameters
----------
instruction : navigation instruction
start_pose : {'x', 'y', 'z', 'yaw'}
goal_id : target object instance ID
num_steps : max steps before timeout
Returns
-------
result : {'success': bool, 'steps': int, 'path_length': float}
"""
import habitat_sim
# Initialize simulator
sim = habitat_sim.Simulator(habitat_sim.Configuration(
habitat_sim.specifications.DEFAULT_SCENE,
{
'width': 640,
'height': 480,
'sensor_height': 1.5,
'hfov': 90,
}
))
# Set agent state
agent_state = habitat_sim.AgentState()
agent_state.position = [start_pose['x'], start_pose['y'], start_pose['z']]
agent_state.rotation = habitat_sim.utils.quat_from_angle_axis(
start_pose['yaw'], habitat_sim.geo.GRAVITY
)
sim.agents[0].set_state(agent_state)
done = False
steps = 0
for step in range(num_steps):
# Get RGB observation
obs = sim.render()
rgb = obs['rgba_camera']
# CLIP-based viewpoint scoring
action = zero_shot_vln_step(instruction, rgb, candidate_poses=[...], robot_state={})
# Execute action
if action['type'] == 'move':
# Move to target pose
move_and_look(sim, action['target_pose'])
else:
# Random exploration
sim.agents[0].move_forward(0.25)
sim.agents[0].rotate(15, habitat_sim.geo.GRAVITY)
# Check goal
if check_goal_reached(sim, goal_id):
done = True
break
sim.close()
return {'success': done, 'steps': steps, 'path_length': steps * 0.25}
5.4 Comparison Table — All Three Tiers¶
| Aspect | Tier 1: Modular | Tier 2: Seq2Seq | Tier 3: Foundation Models |
|---|---|---|---|
| Training required | No (rule-based) | Yes (large VLN dataset) | No (zero-shot) |
| Generalization | Poor | Moderate | High |
| Data dependency | None | R2R / RxR (100k samples) | CLIP / LLaVA pretrained |
| Compute cost | Low | Medium | High (large models) |
| Interpretability | High | Medium | Low |
| Success rate (R2R) | ~20–30% | ~50–70% | ~40–60% (zero-shot) |
| Setup complexity | ⭐ | ⭐⭐⭐ | ⭐⭐ |
| Best for | Debugging, simple scenes | Research, benchmark SOTA | Novel environments, rapid deployment |
6. Step-by-Step Implementation Guide¶
Phase 1 — Tier 1: Modular Pipeline (Week 1)¶
- Set up NLP parsing
- Implement
parse_instruction()with spaCy NER and dependency parsing -
Test on sample instructions like "Turn left at the blue chair"
-
Implement landmark detection
- Use ResNet-50 pretrained features
- Add color-based detection for common landmarks
-
Visualize detected bounding boxes on camera frames
-
Implement A* path planner
- Create a grid map from depth sensor or pre-built map
- Integrate landmarks as waypoints
-
Test navigation in a simple environment (Gazebo or Habitat)
-
Integrate and test
- Run end-to-end in simulation
- Measure success rate in a known environment
Phase 2 — Tier 2: Seq2Seq (Week 2)¶
- Download R2R dataset
- Preprocess instructions and action sequences
-
Pre-extract ResNet visual features for each viewpoint
-
Implement model
- Build
Seq2SeqVLNclass with cross-modal attention - Train on R2R train split
-
Monitor loss and validation SR
-
Evaluate on R2R
- Test on val_seen and val_unseen splits
- Compare SR and SPL against published baselines
Phase 3 — Tier 3: Foundation Models (Week 3)¶
- Set up CLIP
- Implement viewpoint scoring
-
Integrate with Habitat-Sim for candidate viewpoint rendering
-
Set up LLaVA (optional)
- Requires GPU with ≥16GB VRAM
-
Fine-tune or use zero-shot prompting
-
Evaluate zero-shot performance
- Compare against Tier 2 trained models
- Measure generalization to unseen environments
7. Extensions and Variations¶
7.1 AuxRN — Auxiliary Reasoning¶
Add auxiliary reasoning tasks during training: - CVAE (Contrastive VLN): contrastive learning between positive and negative instruction-path pairs - Perception loss: reconstruction loss on visual features - Self-monitoring: track whether the agent is making progress toward the goal
7.2 Vision-Language Pre-training (VLN-BERT)¶
Pretrain a joint vision-language model on large-scale image-caption datasets before fine-tuning on VLN:
Pretraining: COCO captions, Visual Genome → Learn aligned vision-language representations
Fine-tuning: R2R dataset → Learn navigation-specific grounding
7.3 Multi-Modal Fusion¶
Instead of late fusion (CLIP-style), use early fusion: - Concatenate visual and language features at every layer - Cross-attention layers (as in Flamingo, GPT-4V) for deeper grounding
7.4 Active Learning¶
- Collect human feedback on failed episodes
- Retrain on corrected demonstrations (DAgger-style)
8. References¶
- Anderson et al., 2018 — Vision-and-Language Navigation: Interpreting Grounded Language Instructions in Photo-Realistic Environments — R2R dataset and VLN task definition
- Fried et al., 2018 — Speaker-Follower Models for Language-Based Image Exploration — Seq2Seq VLN with attention
- Wang et al., 2018 — Look Before You Leap: Bridging Model-Free and Model-Based RL for Vision-Based Navigation — CMU How-to-nav
- Huang et al., 2019 — Transferable Representation Learning in Vision-and-Language Navigation — AuxRN auxiliary tasks
- Li et al., 2022 — VLN-BERT: A Pretrained Language Model for Vision-and-Language Navigation — Pretraining for VLN
- Radford et al., 2021 — Learning Transferable Visual Models From Natural Language Supervision (CLIP) — CLIP foundation model
- Liu et al., 2024 — LLaVA: Large Language and Vision Assistant — LLaVA architecture
- Habitat-Sim GitHub — Embodied AI simulator
- AI2-THOR GitHub — Interactive household environment
- R2R Dataset — Room-to-Room navigation benchmark
- Ma et al., 2019 — Rethinking the Performance of Navigation with Language — VLN benchmarking analysis