Skip to content

LLM Basics: APIs, Models, Local Deployment & Coding Assistants

This chapter covers how to work with Large Language Models — from calling cloud APIs to running models locally and using AI-powered coding assistants.


Section 1: How to Request LLMs (API-Based)

1.1 OpenAI API

Models: GPT-4o, GPT-4-turbo, o1, GPT-4o-mini
Endpoint: https://api.openai.com/v1/chat/completions
Auth: Bearer token via OPENAI_API_KEY environment variable
Pricing (per 1M tokens): GPT-4o: $2.50 input / $10 output; GPT-4o-mini: $0.15 / $0.60

pip install openai
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain reinforcement learning in 3 sentences."},
    ],
    temperature=0.7,
    max_tokens=200,
)

print(response.choices[0].message.content)

o1 models (reasoning models) do not support system messages or temperature:

response = client.chat.completions.create(
    model="o1",
    messages=[
        {"role": "user", "content": "Solve: What is the integral of x^2 * e^x?"},
    ],
)
print(response.choices[0].message.content)

1.2 Anthropic API

Models: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3.5 Haiku
Endpoint: https://api.anthropic.com/v1/messages
Auth: ANTHROPIC_API_KEY header
Pricing (per 1M tokens): Claude 3.5 Sonnet: $3 input / $15 output

pip install anthropic
import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a robotics expert.",
    messages=[
        {"role": "user", "content": "Explain PID control in simple terms."},
    ],
)

print(message.content[0].text)

1.3 Google Gemini API

Models: Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash
Endpoint: https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent
Auth: GOOGLE_API_KEY or service account
Pricing (per 1M tokens): Gemini 1.5 Pro: $1.25 input / $5 output; Flash: $0.075 / $0.30

pip install google-genai
import os
from google import genai

client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY"))

response = client.models.generate_content(
    model="gemini-1.5-pro",
    contents="Describe the difference between SLAM and visual odometry.",
)

print(response.text)

1.4 DeepSeek API

Models: DeepSeek-V3, DeepSeek-R1
Endpoint: https://api.deepseek.com/v1/chat/completions (OpenAI-compatible)
Auth: DEEPSEEK_API_KEY
Pricing (per 1M tokens): DeepSeek-V3: $0.27 input / $1.10 output (very cost-effective)

pip install openai
import os
from openai import OpenAI

# DeepSeek uses an OpenAI-compatible API
client = OpenAI(
    api_key=os.environ.get("DEEPSEEK_API_KEY"),
    base_url="https://api.deepseek.com/v1",
)

response = client.chat.completions.create(
    model="deepseek-chat",  # DeepSeek-V3
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "用中文解释什么是机器人的运动学正解。"},
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

1.5 Qwen (通义千问) API — Alibaba Cloud

Models: Qwen 2.5 (72B, 32B, 14B, 7B), Qwen-Max, Qwen-Plus
Endpoint: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions (OpenAI-compatible)
Auth: DASHSCOPE_API_KEY
Pricing: Qwen-Max: ¥0.02 / 1K tokens input; open-weight models are free to self-host

pip install openai
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

response = client.chat.completions.create(
    model="qwen-max",
    messages=[
        {"role": "system", "content": "你是一个机器人学专家。"},
        {"role": "user", "content": "简要解释SLAM(同时定位与地图构建)。"},
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

1.6 OpenRouter — Unified Gateway

OpenRouter provides a single API endpoint to access 100+ models from different providers. Useful for switching between models without changing code.

Endpoint: https://openrouter.ai/api/v1/chat/completions
Auth: OPENROUTER_API_KEY
Pricing: Varies per model; typically provider price + small markup

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENROUTER_API_KEY"),
    base_url="https://openrouter.ai/api/v1",
)

# Switch models by changing the model string
response = client.chat.completions.create(
    model="anthropic/claude-3.5-sonnet",  # or "openai/gpt-4o", "deepseek/deepseek-chat"
    messages=[
        {"role": "user", "content": "What is a Jacobian matrix in robotics?"},
    ],
)

print(response.choices[0].message.content)

Section 2: Major LLMs Compared

Model Provider Context Window Pricing (per 1M tokens, input/output) Strengths
GPT-4o OpenAI 128K $2.50 / $10 Fast, multimodal, strong all-round
Claude 3.5 Sonnet Anthropic 200K $3 / $15 Best coding, long context, careful reasoning
Gemini 1.5 Pro Google 1M (2M exp.) $1.25 / $5 Largest context, strong multimodal
DeepSeek-V3 DeepSeek 128K $0.27 / $1.10 Excellent value, strong reasoning
Qwen 2.5 (72B) Alibaba 128K Free (open weights) Top open Chinese LLM, good multilingual
Llama 3.1 (405B) Meta 128K Free (open weights) Best open-weight model, large community
Mistral Large Mistral 128K $2 / $6 Strong European model, efficient

Notes: - Pricing shown is for API access (cloud). Open-weight models (Qwen, Llama, Mistral) are free to download but require compute. - Context window = maximum input + output tokens per request. - All prices are approximate and may change.


Section 3: Local Deployment

Running models locally gives you privacy, no API costs, and offline capability. The trade-off is needing sufficient hardware (GPU with enough VRAM).

3.1 llama.cpp (GGUF Format)

The reference implementation for running GGUF-quantized models on CPU and GPU.

# Install
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # enable CUDA for NVIDIA GPU
cmake --build build --config Release -j$(nproc)

# Download a GGUF model (example: Qwen 2.5 7B Q4)
# Get GGUF files from https://huggingface.co/models?search=gguf

# Run inference (interactive)
./build/bin/llama-cli -m /path/to/qwen2.5-7b-q4_k_m.gguf \
    -p "Explain forward kinematics:" -n 200

# Start an OpenAI-compatible server
./build/bin/llama-server -m /path/to/qwen2.5-7b-q4_k_m.gguf \
    --host 0.0.0.0 --port 8080

Then call it like any OpenAI API:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="qwen2.5",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

3.2 Ollama

The simplest way to run LLMs locally. One command to install and run.

# Install (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama run qwen2.5:7b          # ~4.7 GB download
ollama run llama3.1:8b          # ~4.7 GB
ollama run deepseek-coder-v2:16b # ~9 GB

# Use in Python
# pip install ollama
import ollama

response = ollama.chat(
    model="qwen2.5:7b",
    messages=[
        {"role": "user", "content": "What is a transformation matrix in robotics?"},
    ],
)
print(response["message"]["content"])

Ollama also exposes an OpenAI-compatible API at http://localhost:11434/v1.

3.3 vLLM

High-throughput serving engine, ideal for production and multi-user scenarios.

pip install vllm

# Start server with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 32768
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Explain PID tuning."}],
)
print(response.choices[0].message.content)

3.4 Text Generation Inference (TGI)

Hugging Face's production serving solution. Docker-based deployment.

# Using Docker (recommended)
docker run --gpus all -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id Qwen/Qwen2.5-7B-Instruct \
    --max-input-tokens 4096 \
    --max-total-tokens 8192
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Local Deployment Quick Comparison

Tool Best For GPU Required? Ease of Setup
llama.cpp CPU inference, edge devices Optional Moderate
Ollama Beginners, quick experiments Recommended Very Easy
vLLM Production serving, high throughput Yes Easy
TGI HuggingFace ecosystem, Docker setups Yes Easy (Docker)

Section 4: Coding Assistant Frameworks

4.1 Trae (ByteDance)

Trae is ByteDance's AI-powered IDE and coding assistant. It provides: - AI chat integrated into the editor (VS Code-like interface) - Code completion and generation powered by various LLMs - Support for both Chinese and English prompts - Built-in terminal integration for running generated code - Available as a standalone desktop application (macOS, Windows) - Free tier available; uses ByteDance's internal models and supports third-party models

How it works: Trae connects to LLM backends (ByteDance's Doubao/豆包 models and others), providing inline suggestions, chat-based code generation, and multi-file editing capabilities.

Website: https://www.trae.ai

4.2 OpenAI Codex CLI

OpenAI's terminal-based coding agent. Runs in your terminal and can read/write files, execute commands, and iterate on code.

# Install
npm install -g @openai/codex

# Run
export OPENAI_API_KEY="sk-..."
codex "create a Python script that computes forward kinematics for a 3-DOF arm"

Features: - Runs locally in your terminal - Can read and modify your codebase - Executes shell commands with your approval - Sandboxed execution modes (suggest, auto-edit, full-auto) - Uses GPT-4o or o-series models

4.3 Claude Code (Anthropic)

Anthropic's agentic coding tool that runs in the terminal.

# Install
npm install -g @anthropic-ai/claude-code

# Run
claude "refactor this function to use async/await"

Features: - Terminal-native, works in your existing shell - Reads entire project context - Can run tests, fix bugs, create PRs - Supports multi-file refactoring - Uses Claude 3.5 Sonnet / Claude 4

4.4 Cursor

An AI-native code editor (VS Code fork) with deep LLM integration.

Features: - Built on VS Code; familiar interface and extensions - Cmd+K inline editing: describe changes in natural language - Chat panel with codebase-aware context - Multi-file editing with "Composer" mode - Supports GPT-4o, Claude 3.5, and other models - Tab completion powered by custom models - Available at https://cursor.com

4.5 GitHub Copilot

GitHub's AI pair programmer, integrated into VS Code, JetBrains, and Neovim.

Features: - Inline code suggestions as you type - Chat panel (Copilot Chat) for Q&A and code generation - Workspace-aware context - Supports multiple models (GPT-4o, Claude 3.5 Sonnet) - $10/month individual; free for students and open-source maintainers - Integrated into GitHub.com for PR summaries and code review

Coding Assistants Comparison

Feature Trae Codex CLI Claude Code Cursor GitHub Copilot
Type IDE Terminal agent Terminal agent Editor (IDE) Editor plugin
Platform Desktop app Terminal Terminal Desktop (VS Code fork) VS Code, JetBrains, etc.
Default Model Doubao / configurable GPT-4o / o1 Claude Sonnet GPT-4o / Claude GPT-4o / Claude
Codebase Awareness Yes Yes Yes Yes (strong) Yes
Multi-file Edit Yes Yes Yes Yes (Composer) Limited
Free Tier Yes ChatGPT Plus req. API key needed Limited free tier Free for students
Best For Chinese dev ecosystem CLI workflows CLI, deep refactoring Full IDE experience Broad compatibility
Language EN / CN EN EN EN EN

References

API Documentation

Local Deployment

Coding Assistants