LLM Basics: APIs, Models, Local Deployment & Coding Assistants¶
This chapter covers how to work with Large Language Models — from calling cloud APIs to running models locally and using AI-powered coding assistants.
Section 1: How to Request LLMs (API-Based)¶
1.1 OpenAI API¶
Models: GPT-4o, GPT-4-turbo, o1, GPT-4o-mini
Endpoint: https://api.openai.com/v1/chat/completions
Auth: Bearer token via OPENAI_API_KEY environment variable
Pricing (per 1M tokens): GPT-4o: $2.50 input / $10 output; GPT-4o-mini: $0.15 / $0.60
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain reinforcement learning in 3 sentences."},
],
temperature=0.7,
max_tokens=200,
)
print(response.choices[0].message.content)
o1 models (reasoning models) do not support system messages or temperature:
response = client.chat.completions.create(
model="o1",
messages=[
{"role": "user", "content": "Solve: What is the integral of x^2 * e^x?"},
],
)
print(response.choices[0].message.content)
1.2 Anthropic API¶
Models: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3.5 Haiku
Endpoint: https://api.anthropic.com/v1/messages
Auth: ANTHROPIC_API_KEY header
Pricing (per 1M tokens): Claude 3.5 Sonnet: $3 input / $15 output
import os
import anthropic
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a robotics expert.",
messages=[
{"role": "user", "content": "Explain PID control in simple terms."},
],
)
print(message.content[0].text)
1.3 Google Gemini API¶
Models: Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash
Endpoint: https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent
Auth: GOOGLE_API_KEY or service account
Pricing (per 1M tokens): Gemini 1.5 Pro: $1.25 input / $5 output; Flash: $0.075 / $0.30
import os
from google import genai
client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY"))
response = client.models.generate_content(
model="gemini-1.5-pro",
contents="Describe the difference between SLAM and visual odometry.",
)
print(response.text)
1.4 DeepSeek API¶
Models: DeepSeek-V3, DeepSeek-R1
Endpoint: https://api.deepseek.com/v1/chat/completions (OpenAI-compatible)
Auth: DEEPSEEK_API_KEY
Pricing (per 1M tokens): DeepSeek-V3: $0.27 input / $1.10 output (very cost-effective)
import os
from openai import OpenAI
# DeepSeek uses an OpenAI-compatible API
client = OpenAI(
api_key=os.environ.get("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com/v1",
)
response = client.chat.completions.create(
model="deepseek-chat", # DeepSeek-V3
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "用中文解释什么是机器人的运动学正解。"},
],
temperature=0.7,
)
print(response.choices[0].message.content)
1.5 Qwen (通义千问) API — Alibaba Cloud¶
Models: Qwen 2.5 (72B, 32B, 14B, 7B), Qwen-Max, Qwen-Plus
Endpoint: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions (OpenAI-compatible)
Auth: DASHSCOPE_API_KEY
Pricing: Qwen-Max: ¥0.02 / 1K tokens input; open-weight models are free to self-host
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
response = client.chat.completions.create(
model="qwen-max",
messages=[
{"role": "system", "content": "你是一个机器人学专家。"},
{"role": "user", "content": "简要解释SLAM(同时定位与地图构建)。"},
],
temperature=0.7,
)
print(response.choices[0].message.content)
1.6 OpenRouter — Unified Gateway¶
OpenRouter provides a single API endpoint to access 100+ models from different providers. Useful for switching between models without changing code.
Endpoint: https://openrouter.ai/api/v1/chat/completions
Auth: OPENROUTER_API_KEY
Pricing: Varies per model; typically provider price + small markup
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("OPENROUTER_API_KEY"),
base_url="https://openrouter.ai/api/v1",
)
# Switch models by changing the model string
response = client.chat.completions.create(
model="anthropic/claude-3.5-sonnet", # or "openai/gpt-4o", "deepseek/deepseek-chat"
messages=[
{"role": "user", "content": "What is a Jacobian matrix in robotics?"},
],
)
print(response.choices[0].message.content)
Section 2: Major LLMs Compared¶
| Model | Provider | Context Window | Pricing (per 1M tokens, input/output) | Strengths |
|---|---|---|---|---|
| GPT-4o | OpenAI | 128K | $2.50 / $10 | Fast, multimodal, strong all-round |
| Claude 3.5 Sonnet | Anthropic | 200K | $3 / $15 | Best coding, long context, careful reasoning |
| Gemini 1.5 Pro | 1M (2M exp.) | $1.25 / $5 | Largest context, strong multimodal | |
| DeepSeek-V3 | DeepSeek | 128K | $0.27 / $1.10 | Excellent value, strong reasoning |
| Qwen 2.5 (72B) | Alibaba | 128K | Free (open weights) | Top open Chinese LLM, good multilingual |
| Llama 3.1 (405B) | Meta | 128K | Free (open weights) | Best open-weight model, large community |
| Mistral Large | Mistral | 128K | $2 / $6 | Strong European model, efficient |
Notes: - Pricing shown is for API access (cloud). Open-weight models (Qwen, Llama, Mistral) are free to download but require compute. - Context window = maximum input + output tokens per request. - All prices are approximate and may change.
Section 3: Local Deployment¶
Running models locally gives you privacy, no API costs, and offline capability. The trade-off is needing sufficient hardware (GPU with enough VRAM).
3.1 llama.cpp (GGUF Format)¶
The reference implementation for running GGUF-quantized models on CPU and GPU.
# Install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # enable CUDA for NVIDIA GPU
cmake --build build --config Release -j$(nproc)
# Download a GGUF model (example: Qwen 2.5 7B Q4)
# Get GGUF files from https://huggingface.co/models?search=gguf
# Run inference (interactive)
./build/bin/llama-cli -m /path/to/qwen2.5-7b-q4_k_m.gguf \
-p "Explain forward kinematics:" -n 200
# Start an OpenAI-compatible server
./build/bin/llama-server -m /path/to/qwen2.5-7b-q4_k_m.gguf \
--host 0.0.0.0 --port 8080
Then call it like any OpenAI API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
resp = client.chat.completions.create(
model="qwen2.5",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
3.2 Ollama¶
The simplest way to run LLMs locally. One command to install and run.
# Install (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama run qwen2.5:7b # ~4.7 GB download
ollama run llama3.1:8b # ~4.7 GB
ollama run deepseek-coder-v2:16b # ~9 GB
# Use in Python
# pip install ollama
import ollama
response = ollama.chat(
model="qwen2.5:7b",
messages=[
{"role": "user", "content": "What is a transformation matrix in robotics?"},
],
)
print(response["message"]["content"])
Ollama also exposes an OpenAI-compatible API at http://localhost:11434/v1.
3.3 vLLM¶
High-throughput serving engine, ideal for production and multi-user scenarios.
pip install vllm
# Start server with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--host 0.0.0.0 --port 8000 \
--max-model-len 32768
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Explain PID tuning."}],
)
print(response.choices[0].message.content)
3.4 Text Generation Inference (TGI)¶
Hugging Face's production serving solution. Docker-based deployment.
# Using Docker (recommended)
docker run --gpus all -p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id Qwen/Qwen2.5-7B-Instruct \
--max-input-tokens 4096 \
--max-total-tokens 8192
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Local Deployment Quick Comparison¶
| Tool | Best For | GPU Required? | Ease of Setup |
|---|---|---|---|
| llama.cpp | CPU inference, edge devices | Optional | Moderate |
| Ollama | Beginners, quick experiments | Recommended | Very Easy |
| vLLM | Production serving, high throughput | Yes | Easy |
| TGI | HuggingFace ecosystem, Docker setups | Yes | Easy (Docker) |
Section 4: Coding Assistant Frameworks¶
4.1 Trae (ByteDance)¶
Trae is ByteDance's AI-powered IDE and coding assistant. It provides: - AI chat integrated into the editor (VS Code-like interface) - Code completion and generation powered by various LLMs - Support for both Chinese and English prompts - Built-in terminal integration for running generated code - Available as a standalone desktop application (macOS, Windows) - Free tier available; uses ByteDance's internal models and supports third-party models
How it works: Trae connects to LLM backends (ByteDance's Doubao/豆包 models and others), providing inline suggestions, chat-based code generation, and multi-file editing capabilities.
Website: https://www.trae.ai
4.2 OpenAI Codex CLI¶
OpenAI's terminal-based coding agent. Runs in your terminal and can read/write files, execute commands, and iterate on code.
# Install
npm install -g @openai/codex
# Run
export OPENAI_API_KEY="sk-..."
codex "create a Python script that computes forward kinematics for a 3-DOF arm"
Features: - Runs locally in your terminal - Can read and modify your codebase - Executes shell commands with your approval - Sandboxed execution modes (suggest, auto-edit, full-auto) - Uses GPT-4o or o-series models
4.3 Claude Code (Anthropic)¶
Anthropic's agentic coding tool that runs in the terminal.
# Install
npm install -g @anthropic-ai/claude-code
# Run
claude "refactor this function to use async/await"
Features: - Terminal-native, works in your existing shell - Reads entire project context - Can run tests, fix bugs, create PRs - Supports multi-file refactoring - Uses Claude 3.5 Sonnet / Claude 4
4.4 Cursor¶
An AI-native code editor (VS Code fork) with deep LLM integration.
Features: - Built on VS Code; familiar interface and extensions - Cmd+K inline editing: describe changes in natural language - Chat panel with codebase-aware context - Multi-file editing with "Composer" mode - Supports GPT-4o, Claude 3.5, and other models - Tab completion powered by custom models - Available at https://cursor.com
4.5 GitHub Copilot¶
GitHub's AI pair programmer, integrated into VS Code, JetBrains, and Neovim.
Features: - Inline code suggestions as you type - Chat panel (Copilot Chat) for Q&A and code generation - Workspace-aware context - Supports multiple models (GPT-4o, Claude 3.5 Sonnet) - $10/month individual; free for students and open-source maintainers - Integrated into GitHub.com for PR summaries and code review
Coding Assistants Comparison¶
| Feature | Trae | Codex CLI | Claude Code | Cursor | GitHub Copilot |
|---|---|---|---|---|---|
| Type | IDE | Terminal agent | Terminal agent | Editor (IDE) | Editor plugin |
| Platform | Desktop app | Terminal | Terminal | Desktop (VS Code fork) | VS Code, JetBrains, etc. |
| Default Model | Doubao / configurable | GPT-4o / o1 | Claude Sonnet | GPT-4o / Claude | GPT-4o / Claude |
| Codebase Awareness | Yes | Yes | Yes | Yes (strong) | Yes |
| Multi-file Edit | Yes | Yes | Yes | Yes (Composer) | Limited |
| Free Tier | Yes | ChatGPT Plus req. | API key needed | Limited free tier | Free for students |
| Best For | Chinese dev ecosystem | CLI workflows | CLI, deep refactoring | Full IDE experience | Broad compatibility |
| Language | EN / CN | EN | EN | EN | EN |
References¶
API Documentation¶
- OpenAI API: https://platform.openai.com/docs
- Anthropic API: https://docs.anthropic.com/en/docs
- Google Gemini API: https://ai.google.dev/docs
- DeepSeek API: https://platform.deepseek.com/api-docs
- Qwen / DashScope: https://help.aliyun.com/zh/dashscope/
- OpenRouter: https://openrouter.ai/docs
Local Deployment¶
- llama.cpp: https://github.com/ggerganov/llama.cpp
- Ollama: https://ollama.com
- vLLM: https://docs.vllm.ai
- TGI: https://huggingface.co/docs/text-generation-inference
- GGUF Models: https://huggingface.co/models?search=gguf
Coding Assistants¶
- Trae: https://www.trae.ai
- OpenAI Codex CLI: https://github.com/openai/codex
- Claude Code: https://docs.anthropic.com/en/docs/claude-code
- Cursor: https://cursor.com
- GitHub Copilot: https://github.com/features/copilot