LLM Basics: APIs, Models, Local Deployment & Coding Assistants¶

This chapter covers how to work with Large Language Models — from calling cloud APIs to running models locally and using AI-powered coding assistants.

Section 1: How to Request LLMs (API-Based)¶

1.1 OpenAI API¶

Models: GPT-4o, GPT-4-turbo, o1, GPT-4o-mini
Endpoint: https://api.openai.com/v1/chat/completions
Auth: Bearer token via OPENAI_API_KEY environment variable
Pricing (per 1M tokens): GPT-4o: $2.50 input / $10 output; GPT-4o-mini: $0.15 / $0.60

pip install openai

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain reinforcement learning in 3 sentences."},
    ],
    temperature=0.7,
    max_tokens=200,
)

print(response.choices[0].message.content)

o1 models (reasoning models) do not support system messages or temperature:

response = client.chat.completions.create(
    model="o1",
    messages=[
        {"role": "user", "content": "Solve: What is the integral of x^2 * e^x?"},
    ],
)
print(response.choices[0].message.content)

1.2 Anthropic API¶

Models: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3.5 Haiku
Endpoint: https://api.anthropic.com/v1/messages
Auth: ANTHROPIC_API_KEY header
Pricing (per 1M tokens): Claude 3.5 Sonnet: $3 input / $15 output

pip install anthropic

import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a robotics expert.",
    messages=[
        {"role": "user", "content": "Explain PID control in simple terms."},
    ],
)

print(message.content[0].text)

1.3 Google Gemini API¶

Models: Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash
Endpoint: https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent
Auth: GOOGLE_API_KEY or service account
Pricing (per 1M tokens): Gemini 1.5 Pro: $1.25 input / $5 output; Flash: $0.075 / $0.30

pip install google-genai

import os
from google import genai

client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY"))

response = client.models.generate_content(
    model="gemini-1.5-pro",
    contents="Describe the difference between SLAM and visual odometry.",
)

print(response.text)

1.4 DeepSeek API¶

Models: DeepSeek-V3, DeepSeek-R1
Endpoint: https://api.deepseek.com/v1/chat/completions (OpenAI-compatible)
Auth: DEEPSEEK_API_KEY
Pricing (per 1M tokens): DeepSeek-V3: $0.27 input / $1.10 output (very cost-effective)

pip install openai

import os
from openai import OpenAI

# DeepSeek uses an OpenAI-compatible API
client = OpenAI(
    api_key=os.environ.get("DEEPSEEK_API_KEY"),
    base_url="https://api.deepseek.com/v1",
)

response = client.chat.completions.create(
    model="deepseek-chat",  # DeepSeek-V3
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "用中文解释什么是机器人的运动学正解。"},
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

1.5 Qwen (通义千问) API — Alibaba Cloud¶

Models: Qwen 2.5 (72B, 32B, 14B, 7B), Qwen-Max, Qwen-Plus
Endpoint: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions (OpenAI-compatible)
Auth: DASHSCOPE_API_KEY
Pricing: Qwen-Max: ¥0.02 / 1K tokens input; open-weight models are free to self-host

pip install openai

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

response = client.chat.completions.create(
    model="qwen-max",
    messages=[
        {"role": "system", "content": "你是一个机器人学专家。"},
        {"role": "user", "content": "简要解释SLAM（同时定位与地图构建）。"},
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

1.6 OpenRouter — Unified Gateway¶

OpenRouter provides a single API endpoint to access 100+ models from different providers. Useful for switching between models without changing code.

Endpoint: https://openrouter.ai/api/v1/chat/completions
Auth: OPENROUTER_API_KEY
Pricing: Varies per model; typically provider price + small markup

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENROUTER_API_KEY"),
    base_url="https://openrouter.ai/api/v1",
)

# Switch models by changing the model string
response = client.chat.completions.create(
    model="anthropic/claude-3.5-sonnet",  # or "openai/gpt-4o", "deepseek/deepseek-chat"
    messages=[
        {"role": "user", "content": "What is a Jacobian matrix in robotics?"},
    ],
)

print(response.choices[0].message.content)

Section 2: Major LLMs Compared¶

Model	Provider	Context Window	Pricing (per 1M tokens, input/output)	Strengths
GPT-4o	OpenAI	128K	$2.50 / $10	Fast, multimodal, strong all-round
Claude 3.5 Sonnet	Anthropic	200K	$3 / $15	Best coding, long context, careful reasoning
Gemini 1.5 Pro	Google	1M (2M exp.)	$1.25 / $5	Largest context, strong multimodal
DeepSeek-V3	DeepSeek	128K	$0.27 / $1.10	Excellent value, strong reasoning
Qwen 2.5 (72B)	Alibaba	128K	Free (open weights)	Top open Chinese LLM, good multilingual
Llama 3.1 (405B)	Meta	128K	Free (open weights)	Best open-weight model, large community
Mistral Large	Mistral	128K	$2 / $6	Strong European model, efficient

Notes: - Pricing shown is for API access (cloud). Open-weight models (Qwen, Llama, Mistral) are free to download but require compute. - Context window = maximum input + output tokens per request. - All prices are approximate and may change.

Section 3: Local Deployment¶

Running models locally gives you privacy, no API costs, and offline capability. The trade-off is needing sufficient hardware (GPU with enough VRAM).

3.1 llama.cpp (GGUF Format)¶

The reference implementation for running GGUF-quantized models on CPU and GPU.

# Install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # enable CUDA for NVIDIA GPU
cmake --build build --config Release -j$(nproc)

# Download a GGUF model (example: Qwen 2.5 7B Q4)
# Get GGUF files from https://huggingface.co/models?search=gguf

# Run inference (interactive)
./build/bin/llama-cli -m /path/to/qwen2.5-7b-q4_k_m.gguf \
    -p "Explain forward kinematics:" -n 200

# Start an OpenAI-compatible server
./build/bin/llama-server -m /path/to/qwen2.5-7b-q4_k_m.gguf \
    --host 0.0.0.0 --port 8080

Then call it like any OpenAI API:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="qwen2.5",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

3.2 Ollama¶

The simplest way to run LLMs locally. One command to install and run.

# Install (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama run qwen2.5:7b          # ~4.7 GB download
ollama run llama3.1:8b          # ~4.7 GB
ollama run deepseek-coder-v2:16b # ~9 GB

# Use in Python
# pip install ollama

import ollama

response = ollama.chat(
    model="qwen2.5:7b",
    messages=[
        {"role": "user", "content": "What is a transformation matrix in robotics?"},
    ],
)
print(response["message"]["content"])

Ollama also exposes an OpenAI-compatible API at http://localhost:11434/v1.

3.3 vLLM¶

High-throughput serving engine, ideal for production and multi-user scenarios.

pip install vllm

# Start server with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 32768

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Explain PID tuning."}],
)
print(response.choices[0].message.content)

3.4 Text Generation Inference (TGI)¶

Hugging Face's production serving solution. Docker-based deployment.

# Using Docker (recommended)
docker run --gpus all -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id Qwen/Qwen2.5-7B-Instruct \
    --max-input-tokens 4096 \
    --max-total-tokens 8192

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Local Deployment Quick Comparison¶

Tool	Best For	GPU Required?	Ease of Setup
llama.cpp	CPU inference, edge devices	Optional	Moderate
Ollama	Beginners, quick experiments	Recommended	Very Easy
vLLM	Production serving, high throughput	Yes	Easy
TGI	HuggingFace ecosystem, Docker setups	Yes	Easy (Docker)

Section 4: Coding Assistant Frameworks¶

4.1 Trae (ByteDance)¶

Trae is ByteDance's AI-powered IDE and coding assistant. It provides: - AI chat integrated into the editor (VS Code-like interface) - Code completion and generation powered by various LLMs - Support for both Chinese and English prompts - Built-in terminal integration for running generated code - Available as a standalone desktop application (macOS, Windows) - Free tier available; uses ByteDance's internal models and supports third-party models

How it works: Trae connects to LLM backends (ByteDance's Doubao/豆包 models and others), providing inline suggestions, chat-based code generation, and multi-file editing capabilities.

Website: https://www.trae.ai

4.2 OpenAI Codex CLI¶

OpenAI's terminal-based coding agent. Runs in your terminal and can read/write files, execute commands, and iterate on code.

# Install
npm install -g @openai/codex

# Run
export OPENAI_API_KEY="sk-..."
codex "create a Python script that computes forward kinematics for a 3-DOF arm"

Features: - Runs locally in your terminal - Can read and modify your codebase - Executes shell commands with your approval - Sandboxed execution modes (suggest, auto-edit, full-auto) - Uses GPT-4o or o-series models

4.3 Claude Code (Anthropic)¶

Anthropic's agentic coding tool that runs in the terminal.

# Install
npm install -g @anthropic-ai/claude-code

# Run
claude "refactor this function to use async/await"

Features: - Terminal-native, works in your existing shell - Reads entire project context - Can run tests, fix bugs, create PRs - Supports multi-file refactoring - Uses Claude 3.5 Sonnet / Claude 4

4.4 Cursor¶

An AI-native code editor (VS Code fork) with deep LLM integration.

Features: - Built on VS Code; familiar interface and extensions - Cmd+K inline editing: describe changes in natural language - Chat panel with codebase-aware context - Multi-file editing with "Composer" mode - Supports GPT-4o, Claude 3.5, and other models - Tab completion powered by custom models - Available at https://cursor.com

4.5 GitHub Copilot¶

GitHub's AI pair programmer, integrated into VS Code, JetBrains, and Neovim.

Features: - Inline code suggestions as you type - Chat panel (Copilot Chat) for Q&A and code generation - Workspace-aware context - Supports multiple models (GPT-4o, Claude 3.5 Sonnet) - $10/month individual; free for students and open-source maintainers - Integrated into GitHub.com for PR summaries and code review

Coding Assistants Comparison¶

Feature	Trae	Codex CLI	Claude Code	Cursor	GitHub Copilot
Type	IDE	Terminal agent	Terminal agent	Editor (IDE)	Editor plugin
Platform	Desktop app	Terminal	Terminal	Desktop (VS Code fork)	VS Code, JetBrains, etc.
Default Model	Doubao / configurable	GPT-4o / o1	Claude Sonnet	GPT-4o / Claude	GPT-4o / Claude
Codebase Awareness	Yes	Yes	Yes	Yes (strong)	Yes
Multi-file Edit	Yes	Yes	Yes	Yes (Composer)	Limited
Free Tier	Yes	ChatGPT Plus req.	API key needed	Limited free tier	Free for students
Best For	Chinese dev ecosystem	CLI workflows	CLI, deep refactoring	Full IDE experience	Broad compatibility
Language	EN / CN	EN	EN	EN	EN

References¶