Skip to content

Multi-Agent Systems

While a single agent can handle many tasks, complex real-world problems benefit from multiple specialized agents working together. Multi-agent systems assign different roles, capabilities, and communication protocols to different agents, enabling collaboration no single agent could achieve alone.

Why Multi-Agent?

Scenario Single Agent Multi-Agent
Research + write report Overloaded, quality drops Researcher + Writer
Code review Limited perspective Coder + Reviewer + Security analyst
Customer service One-size-fits-all Triage + Product + Escalation
Simulate a society Impossible Each person = one agent

Key benefits: specialization, parallelism, division of labor, and realism.


1. OpenAI Swarm

Type: Lightweight multi-agent orchestration framework
Repo: openai/swarm
Released: October 2024
Focus: Handoffs between agents with minimal overhead

OpenAI Swarm is an experimental framework for multi-agent orchestration — managing handoffs and context transfer between agents, rather than relying on fixed pipelines.

Core Concepts

Agent — A unit with instructions (system prompt) and tools:

from swarm import Swarm, Agent

client = Swarm()

sales_agent = Agent(
    name="Sales Agent",
    instructions="You are a friendly sales assistant. Be concise and helpful.",
    tools=[lookup_product, check_inventory]
)

support_agent = Agent(
    name="Support Agent",
    instructions="You are a technical support specialist. Be thorough and accurate.",
    tools=[diagnose_issue, escalate_ticket]
)

Handoff — Transfer conversation to another agent with updated context:

def transfer_to_sales():
    """Handoff to sales agent"""
    return sales_agent

def transfer_to_support():
    """Handoff to support agent"""
    return support_agent

sales_agent.tools.append(transfer_to_support)
support_agent.tools.append(transfer_to_sales)

Key insight: Swarm uses two primitive operations — handoff and function calling. No central coordinator. The LLM decides when to hand off based on function return values.

Full Example: Customer Service Pipeline

from swarm import Swarm, Agent

client = Swarm()

# Triage agent routes to the right specialist
triage_agent = Agent(
    name="Triage Agent",
    instructions="""You are a customer service triage agent.
    Route users to the appropriate agent:
    - Purchasing, billing, product info → transfer_to_sales
    - Technical issues, bugs, errors → transfer_to_support
    """,
    tools=[]
)

sales_agent = Agent(
    name="Sales Agent",
    instructions="You help with purchases, billing, and product information.",
    tools=[lookup_product, process_order]
)

support_agent = Agent(
    name="Support Agent",
    instructions="You help with technical issues. Be thorough.",
    tools=[diagnose_issue, create_ticket]
)

# Add handoff tools
triage_agent.tools.extend([
    lambda: sales_agent,
    lambda: support_agent,
])

# Run
response = client.run(
    agent=triage_agent,
    messages=[{"role": "user", "content": "My robot arm is making weird noises"}]
)
print(response.messages[-1]["content"])

Design Philosophy

OpenAI deliberately avoids "intelligent orchestration" — letting the LLM decide the flow would make debugging a "black box." Instead, Swarm exposes the low-level primitives (handoff + function calling) so developers can see and intervene at every step.

"The difference between Swarm and writing if-else yourself is that in three years, you'll want to know why you didn't write it like Swarm." — GitHub comment

Strengths and Limitations

✅ Strengths ❌ Limitations
Extremely lightweight (~800 lines) No built-in persistence
Agent-as-tool flexibility No native multi-agent memory sharing
Easy to understand and prototype Experimental (not production-ready)
Native OpenAI API integration Limited error handling and recovery

2. Microsoft AutoGen → Microsoft Agent Framework (MAF)

Type: Multi-agent conversational framework
Papers: AutoGen (2023) | MAF (2025)
Evolution: AutoGen v0.4 (2025) → Merged with Semantic Kernel → Microsoft Agent Framework (MAF) (2025 Oct)

AutoGen pioneered the idea that agents are conversation participants. By 2025, it had evolved into Microsoft Agent Framework (MAF), combining AutoGen's multi-agent patterns with Semantic Kernel's enterprise-grade reliability.

Core Concept: ConversableAgent

from autogen import ConversableAgent

assistant = ConversableAgent(
    name="assistant",
    system_message="You are a helpful Python coding assistant.",
    llm_config={"model": "gpt-4o"},
)

user_proxy = ConversableAgent(
    name="user",
    human_input_mode="NEVER",  # NEVER / ALWAYS / TERMINATE
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "coding", "use_docker": False},
)

user_proxy.initiate_chat(
    assistant,
    message="Write a Python function to compute matrix inverse.",
)

Group Chat: Multiple Agents Discussing

from autogen import GroupChat, GroupChatManager

group_chat = GroupChat(
    agents=[user_proxy, researcher, critic, writer],
    messages=[],
    max_round=12,
)

manager = GroupChatManager(
    groupchat=group_chat,
    speaker_selection_method="auto",  # or "round_robin"
)

user_proxy.initiate_chat(
    manager,
    message="Write an 800-word article about SMR nuclear progress in 2026.",
)

MAF: Production-Grade Multi-Agent

# Microsoft Agent Framework (2025+)
from agent_framework import GroupChat, GroupChatManager, AssistantAgent

# Multi-agent group chat (AutoGen style, with persistence)
group = GroupChat(
    agents=[user_proxy, researcher, critic, writer],
    max_rounds=15,
    # New in MAF: persistent session id, checkpointing
)

manager = GroupChatManager(group=group)

await user_proxy.initiate_chat(
    manager,
    message="Research & write 600-word post on AI agent developments in 2026"
)

MAF adds deterministic DAG-based workflows alongside group chat:

# DAG workflow for order processing (deterministic, not emergent)
# Nodes: Agent | Function | Condition | Loop
# Edges: define deterministic execution paths

AutoGen vs. MAF vs. Swarm

Feature AutoGen MAF Swarm
Communication Conversational (message passing) Conversational + DAG Handoff-based
Code execution Native (code_execution_config) Native Via tools
Group chat Built-in GroupChat Built-in + persistence Manual
Production readiness v0.4 mature RC stage (2025) Experimental
Enterprise features Limited Built-in (checkpointing, OpenTelemetry) None
Language Python Python + .NET Python

3. Stanford Generative Agents (Smallville)

Type: Simulation / research framework
Paper: Generative Agents (2023)
Demo: Stanford Smallville
Startup: Simile — raised $100M (Index Ventures, Andreessen, Lee, Karpathy)

This is a research prototype demonstrating how believable human behavior emerges from LLM-powered agents without explicit programming.

The Insight

Give each "person" in a virtual world: 1. A name, occupation, and personality 2. Memory streams (accumulated experiences) 3. The ability to reflect and plan

→ Agents spontaneously form relationships, coordinate activities, and exhibit emergent social behavior.

Architecture: Memory-Stream, Reflection, Planning

┌─────────────────────────────────────────────────────────────┐
│              Memory-Stream → Reflection → Planning           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │            Memory Stream (chronological)             │   │
│  │  [observation] → [observation] → [reflection] → ...  │   │
│  │  "Isabella is at the coffee shop"                    │   │
│  │  "Tom invited Isabella to a party"                  │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                   │
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Reflection: synthesize observations into insights   │   │
│  │  "Isabella seems like someone who enjoys organizing  │   │
│  │   social events and bringing people together"         │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                   │
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Planning: daily plan based on current state         │   │
│  │  08:00 - Wake up and have breakfast                  │   │
│  │  09:00 - Open the coffee shop                       │   │
│  │  13:00 - Have lunch at Hobbs Cafe                   │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key Mechanisms

1. Perception — Agents observe the world and other agents:

[Isabella Rodriguez] observed [Tom Moreno] is currently at [The Willow Market and Deli]

2. Memory Retrieval — Retrieve relevant memories given current situation:

def retrieve_relevant(memory_stream, current_situation, k=5):
    importance = score_relevance(memory.content, current_situation)
    recency = score_recency(memory.timestamp)
    relevance = alpha * importance + beta * recency
    return top_k(memories, by=relevance, k=k)

3. Reflection — Periodically synthesize observations into high-level insights:

# Observations: "person is tired", "person is coughing", ...
# Reflection: "person might be getting sick"

4. Planning — Create a daily plan from current state and goals.

The Valentine's Day Party (Emergent Behavior)

One of the most famous demonstrations:

  1. Seed: Isabella gets the instruction "You want to throw a Valentine's Day party."
  2. Propagation: She spreads the word organically — other agents decide whether to attend based on their personalities.
  3. Coordination: Some agents offer to help decorate; others discuss what to wear.
  4. Result: 5 agents attend the party at the exact planned time.

No hard-coded rules. Everything emerged from the agents' memory, reflection, and planning.

Simile: From 25 to 1,000+ Agents

The original authors (Joon Sung Park et al.) founded Simile in 2026, scaling from 25 to 1,000+ agents simulating real human populations. Used for predicting customer behavior, brand messaging impact, and policy decisions. Wealthfront reported 15x expansion in user research scope using Simile.

Relevance to Robotics

  • Robots in human environments need to model other agents (humans, other robots)
  • Emergent behavior means the robot doesn't need explicit rules for every situation
  • Memory + reflection enables long-horizon task planning

4. CrewAI

Type: Role-based multi-agent framework
Repo: crewaiinc/crewAI
Installation: pip install crewai crewai-tools

CrewAI takes a role-oriented approach. Each agent has a defined role, goal, and backstory — mimicking a real team.

Core Concepts

Agent — A role with tools and a specific goal:

from crewai import Agent
from crewai_tools import SerperDevTool, FileReadTool

researcher = Agent(
    role="Senior Research Analyst",
    goal="Discover cutting-edge developments in robot manipulation",
    backstory=(
        "You are a PhD-level robotics researcher with 10 years of experience "
        "monitoring the latest papers, patents, and industry developments."
    ),
    tools=[SerperDevTool(), FileReadTool()],
    verbose=True,
)

writer = Agent(
    role="Tech Writer",
    goal="Write compelling technical content about robotics",
    backstory=(
        "You are a skilled technical writer who translates complex research "
        "into accessible, well-structured articles for engineers."
    ),
    tools=[FileReadTool()],
    verbose=True,
)

Task — A unit of work assigned to an agent:

from crewai import Task

research_task = Task(
    description="Research the latest advances in dexterous manipulation",
    agent=researcher,
    expected_output="A summary of 5 key papers with their contributions",
)

write_task = Task(
    description="Write a blog post about the research findings",
    agent=writer,
    expected_output="A 1000-word blog post with technical accuracy",
    context=[research_task],  # Writer sees researcher's output
)

Crew — A team executing tasks:

from crewai import Crew, Process

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_task],
    process=Process.sequential,  # or Process.hierarchical
)

result = crew.kickoff()
print(result)

Hierarchical Process (Manager Agent)

crew = Crew(
    agents=[researcher, writer, reviewer],
    tasks=[task1, task2, task3],
    process=Process.hierarchical,
    manager_llm=ChatOpenAI(model="gpt-4o"),
)

Custom Tools

from crewai.tools import BaseTool
from pydantic import Field

class MyCustomTool(BaseTool):
    name: str = Field(default="my_custom_tool")
    description: str = Field(default="Tool description")

    def _run(self, tool_input: str) -> str:
        # Implementation
        return f"Result: {tool_input}"

agent = Agent(tools=[MyCustomTool()])

CrewAI vs. AutoGen vs. Swarm

Feature CrewAI AutoGen Swarm
Approach Role-based team Conversational Handoff-based
Best for Structured pipelines Complex negotiation Lightweight routing
Code execution Via tools Native Via tools
Memory Built-in Via custom None
Complexity Medium High Low
Production ✅ Growing ✅ Mature ❌ Experimental

5. LangGraph

Type: Graph-based agent workflow framework
Repo: langchain-ai/langgraph
Installation: pip install langgraph langchain

LangGraph models agent workflows as directed graphs (DAGs). Each node is a step (agent, tool, or condition), and edges define the flow.

Core Concepts

State — A shared TypedDict flowing through the graph:

from typing import TypedDict, Annotated
from langgraph.graph import add_messages

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]  # Append-only for messages
    search_results: list[str]
    retry_count: int
    final_answer: str

Nodes — Functions that transform state:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

def planner(state: AgentState) -> dict:
    """Planning node"""
    response = llm.invoke([
        {"role": "system", "content": "Create a research plan."},
        *state["messages"]
    ])
    return {"messages": [response]}

def searcher(state: AgentState) -> dict:
    """Search node"""
    last_msg = state["messages"][-1].content
    results = web_search(last_msg)
    return {"search_results": results}

def writer(state: AgentState) -> dict:
    """Report generation node"""
    context = "\n".join(state["search_results"])
    response = llm.invoke([
        {"role": "system", "content": f"Write a report based on:\n{context}"},
        *state["messages"]
    ])
    return {"messages": [response], "final_answer": response.content}

Conditional Edges — Route based on state:

def should_continue(state: AgentState) -> str:
    if state.get("is_done"):
        return "end"
    elif state["retry_count"] < 2:
        return "searcher"
    else:
        return "writer"

Build and Compile the Graph

from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver

graph = StateGraph(AgentState)

graph.add_node("planner", planner)
graph.add_node("searcher", searcher)
graph.add_node("writer", writer)

graph.add_edge(START, "planner")
graph.add_edge("planner", "searcher")

graph.add_conditional_edges(
    "searcher",
    should_continue,
    {"writer": "writer", "searcher": "searcher"}
)

graph.add_edge("writer", END)

# Add checkpointing for persistence
memory = MemorySaver()
app = graph.compile(checkpointer=memory)

# Run
config = {"configurable": {"thread_id": "research-001"}}
result = app.invoke(
    {"messages": [{"role": "user", "content": "Analyze AI agent trends in 2026"}]},
    config=config,
)

Human-in-the-Loop: Interrupt and Approve

# Interrupt before the "writer" node for human review
app = graph.compile(
    checkpointer=memory,
    interrupt_before=["writer"],  # Pause before writing
)

# First run: executes up to "writer" and pauses
result = app.invoke(input, config=config)

# Human reviews search results
print("Search results:", result["search_results"])

# Human approves or modifies
app.update_state(config, {"search_results": result["search_results"] + ["extra"]})

# Resume execution
final = app.invoke(None, config=config)  # None = resume from interrupt

Multi-Agent: Supervisor Architecture

from langgraph_supervisor import create_supervisor, create_react_agent

research_agent = create_react_agent(
    llm, tools=[web_search, arxiv_search], name="researcher",
    prompt="You are a research expert."
)
coding_agent = create_react_agent(
    llm, tools=[python_repl, code_sandbox], name="coder",
    prompt="You are a coding expert."
)
writing_agent = create_react_agent(
    llm, tools=[], name="writer",
    prompt="You are a writing expert."
)

supervisor = create_supervisor(
    agents=[research_agent, coding_agent, writing_agent],
    model=llm,
    prompt="You are a project manager. Delegate tasks appropriately."
)

multi_agent = supervisor.compile(checkpointer=MemorySaver())

LangGraph for Robotics: Task Planning Graph

class RobotState(TypedDict):
    command: str
    plan: list[str]
    current_step: int
    observation: str | None
    approved: bool

def parse_command(state: RobotState) -> RobotState:
    steps = llm.invoke(f"Decompose: {state['command']}")
    return {"plan": steps, "current_step": 0}

def execute_step(state: RobotState) -> RobotState:
    step = state["plan"][state["current_step"]]
    obs = robot.execute(step)
    return {"observation": obs, "current_step": state["current_step"] + 1}

def should_continue(state: RobotState) -> str:
    if not state.get("approved"):
        return "interrupt"
    return "end" if state["current_step"] >= len(state["plan"]) else "execute_step"

LangGraph vs. CrewAI vs. AutoGen

Feature LangGraph CrewAI AutoGen
Model DAG / state graph Role-based Conversational
Flexibility ⭐⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐
Learning curve Medium Low High
Human-in-the-loop Native (interrupt) Limited Supported
Persistence PostgreSQL / Redis Limited Via custom
Visualization Built-in Mermaid LangSmith Limited
Production maturity High Growing Mature

6. Architecture Patterns Summary

┌─────────────────────────────────────────────────────────────┐
│            Multi-Agent Architecture Patterns                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Pattern 1: Sequential Pipeline                             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐               │
│  │Researcher│───→│Writer  │───→│Editor   │               │
│  └─────────┘    └─────────┘    └─────────┘               │
│  (CrewAI sequential, LangGraph linear)                       │
│                                                             │
│  Pattern 2: Handoff / Router                               │
│  ┌──────────┐     ┌─────────┐                             │
│  │  Triage  │────→│ Sales   │                             │
│  │  Agent   │────→│ Support │                             │
│  │  (root)  │────→│ Billing │                             │
│  └──────────┘     └─────────┘                             │
│  (OpenAI Swarm, Hermes delegation)                        │
│                                                             │
│  Pattern 3: Group Chat / Round Table                     │
│       ┌──────────────────────────────────────┐              │
│       │  ┌──────┐  ┌──────┐  ┌──────┐       │              │
│       │  │Agent1│  │Agent2│  │Agent3│       │              │
│       │  └──┬───┘  └──┬───┘  └──┬───┘       │              │
│       │     └─────────┼─────────┘            │              │
│       │               ▼                      │              │
│       │         Group Chat                    │              │
│       └──────────────────────────────────────┘              │
│  (AutoGen GroupChat, MAF)                                    │
│                                                             │
│  Pattern 4: Hierarchical / Manager                         │
│       ┌────────────┐                                        │
│       │  Manager   │                                        │
│       └──┬──────┬──┘                                        │
│       ┌───┴───┐  ┌───┴───┐                                  │
│       │Worker1 │  │Worker2 │                                 │
│       └───────┘  └───────┘                                  │
│  (CrewAI hierarchical, MAF supervisor)                       │
│                                                             │
│  Pattern 5: Simulation / Emergent                         │
│       ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐                      │
│       │ A │ │ B │ │ C │ │ D │ │ E │                      │
│       └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘                        │
│         └─────┼─────┼─────┼─────┘                         │
│               ▼     ▼     ▼                                 │
│         Shared World / Memory Streams                      │
│  (Stanford Generative Agents → Simile)                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

7. Benchmark Performance

Multi-agent systems consistently outperform single agents:

Benchmark Single Agent Multi-Agent Improvement
GAIA (general AI assistant) 40–60% 70–85% +25–40%
SWE-bench Verified (software engineering) baseline +25–40% significant
Real-world projects (Novo Nordisk) ~25% iteration cycle reduction

References

Framework Papers & Repos

Research & Surveys

Hermes Agent