Andy Luo — Notes & Projects

CPU-GPU Co-Design for Agentic LLM Inference

2026-05-14T00:00:00+00:00

Quantifying where time actually goes — and why your CPU might be stealing 15% and more of your GPU throughput.

Key Summary

We instrumented the full request lifecycle of agentic LLM inference on AMD MI300X to answer a simple question: how much of end-to-end latency is CPU work vs GPU work?

Using MiniMax-M2.5 (230 GB FP8 MoE) on 2× MI300X with vLLM 0.19.0, we decomposed every request into serialization, HTTP overhead (tokenization + scheduling + queue wait), GPU prefill, and GPU decode across 8 scenarios spanning concurrency 1–32 and context 1k–100k tokens.

Headline findings:

At low concurrency, CPU overhead is negligible — 0.4–0.6% of E2E time for single requests at any context length
At high concurrency, CPU overhead becomes material — 11–15% of E2E time at 32 concurrent users
The bottleneck is not tokenization or JSON parsing — it’s scheduling + queue wait, which scales superlinearly with concurrency
Tokenization at 100k tokens costs only 220ms (~500k tok/s on a single CPU core), tiny compared to GPU prefill (2–4 seconds)
LMCache adds minimal CPU overhead vs HBM prefix cache — the CPU% split is nearly identical between the two strategies
The real CPU-GPU co-design opportunity is not in making CPU faster, but in overlapping CPU work with GPU work and reducing scheduling contention at high concurrency

1. Motivation: The Hidden CPU Tax in Agentic Inference

Our previous work benchmarked LMCache for multi-turn agentic workloads on MI300X, comparing KV-cache strategies. We measured TTFT, throughput, and cache hit rates. But we treated the inference server as a black box — we never asked where inside the server the time goes.

Agentic AI workloads are not just GPU workloads. Every request passes through a CPU pipeline before and after GPU execution:

Client                          Server (vLLM)                      GPU
──────                          ─────────────                      ───
 │                                    │                              │
 │─── serialize request ──────────────│                              │
 │    (JSON, 0.04-1.3ms)              │                              │
 │                                    │                              │
 │                          ┌─────────┴──────────┐                   │
 │                          │ HTTP parse         │                   │
 │                          │ Tokenize input     │                   │
 │                          │ Schedule request   │  "HTTP Overhead"  │
 │                          │ KV cache lookup    │  (7-3900ms)       │
 │                          │ Queue wait         │                   │
 │                          └─────────┬──────────┘                   │
 │                                    │                              │
 │                                    │──── GPU prefill ─────────────│
 │                                    │     (41-28537ms)             │
 │                                    │                              │
 │                                    │──── GPU decode (streaming) ──│
 │                                    │     (1780-20792ms)           │
 │                                    │                              │
 │◄── parse SSE response ─────────────│                              │
 │    (1.9µs per chunk)               │                              │

The question: at scale (32 concurrent users, 100k token contexts), does the CPU pipeline become a bottleneck?

2. Methodology

2.1 Hardware & Software

Component	Specification
GPU	2× AMD Instinct MI300X (192 GB HBM3 each), gfx942
CPU	AMD EPYC (ENC1-CLS01-SVR08)
Model	MiniMaxAI/MiniMax-M2.5 FP8, TP=2
Framework	vLLM 0.19.0 (ROCm)
KV Cache	HBM prefix cache / LMCache CPU DRAM
Workload	739 anonymized Claude Code agentic conversations

2.2 What We Measured

We decomposed each request into five time components:

Component	Where	What It Captures
t_serialize	Client CPU	JSON serialization of the request payload
t_http_overhead	Server CPU	HTTP parsing + tokenization + scheduling + queue wait + KV cache lookup
t_server_prefill	Server GPU	Attention computation over all input tokens
t_decode	Server GPU (mostly)	Autoregressive token generation + streaming
t_response_parse	Client CPU	SSE chunk parsing + tool call extraction

We classify t_serialize + t_http_overhead + t_response_parse as CPU time and t_server_prefill + t_decode as GPU time.

Note: t_http_overhead is measured as the gap between client sending the HTTP request and receiving the first byte back. This includes tokenization, scheduling, queue wait time, and KV cache management — all CPU-side work that happens before the GPU begins prefill. At low concurrency this is mostly tokenization + scheduling. At high concurrency, queue wait dominates.

2.3 Test Matrix

Scenario	Concurrency	Context	Purpose
single_1k	1	1,000	Baseline: pure overhead
single_8k	1	8,000	Typical agent turn
single_32k	1	32,000	Large agent context
single_100k	1	100,000	Maximum agent context
conc4_8k	4	8,000	Light multi-tenant
conc16_32k	16	32,000	Medium load
conc32_32k	32	32,000	High load, moderate context
conc32_100k	32	100,000	Stress: high load + large context

Each scenario was run with 3–5 batches of requests, with results aggregated.

3. Results

3.1 The CPU-GPU Split: It’s All About Concurrency

HBM Prefix Cache Configuration:

Scenario	Conc	Ctx	HTTP OH (ms)	Prefill (ms)	Decode (ms)	Total (ms)	CPU%	GPU%
single_1k	1	1K	7	41	1,780	1,828	0.4%	99.6%
single_8k	1	8K	15	124	3,142	3,282	0.5%	99.5%
single_32k	1	32K	47	682	7,736	8,465	0.6%	99.4%
single_100k	1	100K	131	3,555	20,792	24,479	0.6%	99.4%
conc4_8k	4	8K	53	137	3,101	3,291	1.6%	98.4%
conc16_32k	16	32K	555	498	7,832	8,885	6.2%	93.8%
conc32_32k	32	32K	1,130	636	7,873	9,639	11.6%	88.4%
conc32_100k	32	100K	3,885	2,479	19,591	25,957	14.9%	85.1%

The pattern is clear: CPU overhead scales with concurrency, not context length.

Single-request: CPU% is flat at ~0.5% regardless of whether context is 1k or 100k
At concurrency 32: CPU% jumps to 11–15%
The dominant CPU cost is t_http_overhead (scheduling + queue wait), not tokenization

3.2 LMCache vs HBM Prefix Cache: CPU Overhead Comparison

LMCache DRAM Configuration (gpu-mem-util=0.78):

Scenario	Conc	Ctx	HTTP OH (ms)	Prefill (ms)	Decode (ms)	Total (ms)	CPU%	GPU%
single_1k	1	1K	7	44	2,653	2,704	0.3%	99.7%
single_8k	1	8K	15	178	3,376	3,569	0.4%	99.6%
conc4_8k	4	8K	50	121	3,455	3,627	1.4%	98.6%
conc16_32k	16	32K	515	1,655	8,063	10,233	5.1%	94.9%
conc32_32k	32	32K	1,135	722	8,386	10,243	11.0%	89.0%
conc32_100k	32	100K	3,937	28,537	20,769	53,244	9.8%	90.2%

Key comparison — CPU overhead is nearly identical:

Scenario	HBM-PC CPU%	LMCache CPU%	Delta
single_1k	0.4%	0.3%	−0.1%
conc4_8k	1.6%	1.4%	−0.2%
conc16_32k	6.2%	5.1%	−1.1%
conc32_32k	11.6%	11.0%	−0.6%
conc32_100k	14.9%	9.8%	−5.1%

LMCache does NOT add measurable CPU overhead. In fact, CPU% is slightly lower with LMCache at high concurrency because LMCache’s CPU DRAM cache reduces HBM pressure, meaning less time in KV block eviction decisions on the CPU side.

The t_http_overhead is nearly identical between the two configs (~1,130–1,135ms at conc32_32k), confirming that the LMCache connector’s CPU-side work (hash computation, cache lookup, DMA scheduling) is negligible.

3.3 Where Does CPU Time Actually Go?

We ran standalone micro-benchmarks to isolate each CPU component:

Component	Time at 100K tokens	% of HTTP Overhead (conc=32)
Tokenization (encode)	220 ms	~5.7%
JSON serialization (request build)	0.82 ms	<0.1%
SHA256 hash (cache key)	0.62 ms	<0.1%
SSE chunk parse (per token)	1.9 µs	<0.1%
Detokenization (128 tokens)	0.27 ms	<0.1%
Scheduling + queue wait	~3,660 ms	~94%

The smoking gun: scheduling + queue wait accounts for ~94% of CPU overhead at high concurrency. Tokenization, hashing, and serialization are negligible.

This makes sense: at 32 concurrent requests, the vLLM scheduler must:

Decide which requests to batch together
Walk the prefix cache tree to find matching blocks
Allocate KV blocks for new tokens
Manage the preemption queue when HBM is under pressure
Coordinate across TP workers

Each of these is O(n) or worse in the number of concurrent requests, and they all happen on a single Python thread (GIL-bound).

3.4 Tokenization Deep-Dive: Linear but Fast

Tokens	Encode (ms)	Throughput (tok/s)
679	1.18	576,506
2,711	5.09	532,379
5,423	10.35	523,861
10,840	20.46	529,718
21,679	42.72	507,414
43,359	87.85	493,582
67,745	134.90	502,188
101,615	220.38	461,085

Tokenization scales linearly with input length at ~500k tok/s. Even at 100k tokens (the largest agentic context we tested), tokenization takes only 220ms — under 1% of E2E time for any scenario.

The HuggingFace tokenizers library (Rust-based BPE) is already highly optimized. Switching to a C++ tokenizer would save ~50–100ms at 100k tokens — not enough to matter.

Detokenization (streaming output) is even faster: 0.27ms for 128 output tokens. Per-token streaming overhead is not a concern.

4. Analysis: The Scheduling Wall

4.1 Why Scheduling Dominates at High Concurrency

The t_http_overhead captures everything from HTTP request receipt to first GPU kernel launch. At concurrency 1, it’s dominated by tokenization (~220ms for 100k). At concurrency 32, it balloons to 3,885ms — a 30× increase.

The growth is superlinear with concurrency:

Concurrency	HTTP Overhead (32K ctx)	Growth Factor
1	47 ms	1.0×
4	53 ms	1.1×
16	555 ms	11.8×
32	1,130 ms	24.0×

This superlinear scaling points to contention in the scheduling path:

Python GIL: vLLM’s scheduler runs in the main asyncio event loop. At 32 concurrent requests, the GIL serializes scheduling decisions, tokenization, and HTTP handling.
Prefix cache tree walks: With prefix caching enabled, every scheduling decision walks the block hash tree. At high concurrency with diverse prompts, the tree grows and walks become expensive.
Block allocation contention: The KV block allocator must coordinate free/used block tables across TP workers.
Queue wait: When the GPU is saturated, requests queue in the scheduler waiting for slots.

4.2 The 15% Rule

Our data suggests a practical rule of thumb:

At production-level concurrency (16–32 users), CPU overhead consumes 10–15% of E2E latency on MI300X.

This means that even with an infinitely fast GPU, you would only recover 85–90% of theoretical speedup. The remaining 10–15% is CPU-bound.

For a concrete example: at conc32_100k with HBM prefix cache, total E2E is 25,957ms. GPU time is 22,070ms (prefill + decode). Even if GPU time went to zero, the CPU overhead of 3,887ms would remain — setting a hard floor on latency.

5. Optimization Recommendations

Tier 1: High Impact, Framework-Level

Optimization	Expected Impact	Effort
Pipeline scheduling with GPU execution	5–10% E2E at high concurrency	Medium
Move tokenization off main event loop	2–3% at high concurrency	Low
Batch scheduling decisions	3–5% at high concurrency	Medium
Pre-allocate KV blocks speculatively	2–3% at high concurrency	Medium

Tier 2: System-Level Tuning

Optimization	Expected Impact	Effort
NUMA affinity (pin workers to GPU-local node)	1–2%	Low
CPU frequency governor (`performance` mode)	0.5–1%	Trivial
Dedicated CPU cores for scheduler (isolcpus)	1–2%	Low

Tier 3: Not Worth Optimizing

Component	Why Not
Tokenizer speed	Already 500k tok/s, <1% of E2E
JSON serialization	<1ms even at 100k tokens
SSE parsing	1.9µs per chunk — effectively zero
LMCache hash/lookup	<1ms even at 100k tokens
Detokenization	0.27ms for 128 output tokens

6. Key Takeaways

For inference platform teams:

CPU overhead is real but bounded. At 32 concurrent users, 10–15% of E2E latency is CPU. This sets a floor on achievable latency regardless of GPU speed.
Scheduling is the bottleneck, not tokenization. Don’t waste time optimizing the tokenizer — optimize the scheduler and its interaction with the KV cache manager.
LMCache adds zero measurable CPU overhead. The cache connector’s hash/lookup/DMA scheduling cost is lost in the noise. If you’re avoiding LMCache because of CPU concerns, don’t.
The GIL is the elephant in the room. At 32+ concurrent requests, Python GIL serializes scheduling, tokenization, and HTTP handling. Multi-process architectures (like vLLM V1’s separated EngineCore) are the right direction.

For hardware architects:

CPU performance matters for inference at scale. A faster CPU won’t help a single request, but it directly impacts latency at 16+ concurrent users.
PCIe/Infinity Fabric bandwidth is not the CPU bottleneck. The CPU overhead is all compute (scheduling, hash computation, Python interpretation), not data transfer.
NUMA topology matters. Ensuring scheduler threads run on CPU cores local to the GPU’s NUMA node reduces memory access latency for KV block table management.

For the agentic AI community:

The CPU-GPU co-design question is a scheduling problem, not a compute problem. The path forward is better overlap between CPU scheduling and GPU execution.
Context length matters less than concurrency. A single 100k-token request has 0.6% CPU overhead. Thirty-two 1k-token requests have 11%+ CPU overhead. If you’re scaling to many concurrent agent sessions, CPU efficiency of the scheduler is critical.

7. Open Questions & Future Directions

7.1 Can We Eliminate the CPU Bottleneck? Rust, No-GIL, and Beyond

Our data shows that 94% of CPU overhead is scheduling + queue wait, not tokenization or serialization. This has direct implications for optimization strategies:

Rewriting the scheduler in Rust or C++:

The vLLM scheduler today is pure Python — prefix tree walks, block allocation, preemption logic, all running under the GIL. Rewriting the hot path in Rust (via PyO3) or C++ (via pybind11) could yield significant gains:

Component	Current (Python)	Estimated (Rust)	Speedup	Impact on E2E
Prefix tree walk	O(n) per request, GIL-held	O(n) but no GIL, SIMD-friendly	5–10×	2–4% at conc=32
Block allocation	Dict lookups + list ops	Lock-free concurrent allocator	10–20×	1–2% at conc=32
Hash computation	Python `hash()`	Rust `xxhash` / `blake3`	3–5×	<0.5% (already fast)
Request batching	Python list sorting	Rust `rayon` parallel sort	5–10×	1–2% at conc=32

Total estimated E2E improvement: 4–8% at conc=32 from a Rust scheduler rewrite. This is meaningful but not transformative — the real win is eliminating GIL contention, not raw speed.

Removing the Python GIL:

Python 3.13+ introduced experimental free-threaded mode (--disable-gil). For vLLM, this could be transformative:

Currently: tokenization, scheduling, HTTP handling, and detokenization all serialize through the GIL
Without GIL: these can truly parallelize across CPU cores
The t_http_overhead at conc=32 (1,130ms for 32K context) includes substantial GIL contention — multiple requests competing for the same Python thread
Estimated impact: 20–40% reduction in t_http_overhead at high concurrency, translating to 3–6% E2E improvement

However, GIL removal has risks:

vLLM’s internal data structures (block tables, prefix cache tree) would need thread-safe redesign
Many Python C extensions assume GIL protection
The torch runtime itself has GIL interactions during tensor operations

The pragmatic path — vLLM V1’s multi-process architecture:

vLLM V1 already separates the EngineCore (scheduler) from the APIServer (HTTP handling) into different processes. This is effectively a GIL bypass:

APIServer (Process 1)     EngineCore (Process 2)     Workers (Process 3+)
├── HTTP parsing          ├── Scheduling             ├── GPU prefill
├── Tokenization          ├── Block allocation       ├── GPU decode
├── Request routing       ├── Cache management       ├── KV transfers
└── SSE streaming         └── Preemption logic       └── Sampling
         │                         │                        │
         └── IPC (shared mem) ─────┘                        │
                                   └── IPC (shared mem) ────┘

This architecture already eliminates most GIL contention. Our measurements show that vLLM 0.19.0 (which uses V1) achieves reasonable scaling — the 15% CPU overhead at conc=32 is after the multi-process split. Without it, we’d likely see 25–30%.

Recommendation: The highest-ROI optimization is pipelining scheduling with GPU execution — start scheduling the next batch while the current batch is still executing on GPU. This doesn’t require any language change, just better overlap in the EngineCore.

7.2 Sub-Agent Explosion: What Happens at 12× Concurrency?

Modern agentic frameworks (Claude Code, OpenHands, SWE-Agent) routinely spawn sub-agents. A single user session might fork into 4–12 parallel sub-agents for tasks like:

Searching multiple codebases simultaneously
Running parallel tool calls (web search + file read + code execution)
Exploring multiple solution paths (tree-of-thought)

The math gets scary fast:

If 4 users each spawn 3 sub-agents, you have 4 × (1 + 3) = 16 effective concurrent sessions. If each spawns 12 sub-agents: 4 × (1 + 12) = 52 concurrent sessions.

Extrapolating from our data:

Users	Sub-agents/user	Effective Conc	Est. CPU%	Est. HTTP OH (32K)
4	0	4	1.6%	53 ms
4	3	16	6.2%	555 ms
4	12	52	20–25%	~3,000 ms
8	12	104	30–40%	~8,000+ ms

At 52 effective concurrent requests, our superlinear scaling model predicts:

HTTP overhead would reach ~3,000ms (vs 1,130ms at conc=32) — that’s 3 seconds of pure CPU wait before a single GPU kernel fires
CPU% of E2E could hit 20–25%, meaning one quarter of your GPU investment is wasted on CPU scheduling
The prefix cache tree would become deep and wide (52 diverse conversation prefixes), making tree walks even more expensive

Sub-agent-specific challenges:

Prefix divergence: Sub-agents share a common parent prefix but diverge quickly (different tool calls, different search results). This creates a bushy prefix tree that’s expensive to walk but has high reuse potential — exactly the regime where LMCache’s L2 tier pays off.
Bursty arrival patterns: Sub-agents don’t arrive at a steady rate — they burst (parent spawns 12 children simultaneously). The scheduler must absorb this burst, and queue wait time spikes.
Priority inversion: The parent agent is blocked waiting for sub-agent results. If sub-agents are queued behind other users’ requests, the parent’s end-to-end latency multiplies.

Co-design implications:

Request routing becomes critical: With 52+ concurrent sessions, a single vLLM instance may not be enough. Disaggregated serving (separate prefill and decode nodes) or multi-instance routing could reduce per-instance scheduling pressure.
Sub-agent-aware scheduling: A scheduler that understands parent-child relationships could prioritize sub-agents of the same parent to complete a “generation” faster, rather than round-robin across all requests.
Shared prefix optimization: Sub-agents from the same parent share ~80–90% of their prefix. A scheduler that detects this and batches sibling sub-agents together for prefill could dramatically reduce redundant computation.

7.3 Hybrid Workloads: Database Queries, RAG, and Tool Execution

Real agentic workloads don’t just call the LLM — they interleave LLM inference with CPU/IO-bound operations:

 Turn 1: LLM generates SQL query            (GPU: 2-5s)
 Turn 2: Execute SQL against database       (CPU/IO: 50-500ms)
 Turn 3: LLM analyzes results               (GPU: 3-8s)
 Turn 4: Retrieve documents from vector DB  (CPU/IO: 20-200ms)
 Turn 5: LLM synthesizes final answer       (GPU: 5-15s)

The inter-turn gap is a new CPU cost we didn’t measure:

Our benchmark focused on the intra-request CPU-GPU split (what happens inside a single LLM call). But agentic workloads have a second CPU cost: the inter-turn gap — the time between the LLM finishing one turn and the next turn’s prompt being ready.

This gap includes:

Operation	Typical Latency	Where
Tool call parsing	0.1–1 ms	Client CPU
Database query (PostgreSQL)	5–500 ms	External service
Vector DB retrieval (FAISS/pgvector)	10–200 ms	CPU + sometimes GPU
Web API call (search, code execution)	100–2,000 ms	Network + external
Result formatting + context assembly	1–10 ms	Client CPU
Re-tokenization of updated context	50–220 ms	Server CPU

Performance implications:

GPU idle time: During tool execution, the GPU allocated to this user’s session sits idle. At 100k context, the KV cache for one session holds ~12 GB of HBM. If tool execution takes 500ms, that’s 500ms × 12 GB of stranded GPU memory that could serve other requests.
The KV cache cold-start problem: If the scheduler evicts this session’s KV blocks during tool execution (to serve other requests), the next turn must re-prefill the entire context. This is exactly the scenario where LMCache’s CPU DRAM tier shines — it preserves KV state across tool-execution gaps at negligible cost.
CPU contention between tool execution and scheduling: If tool execution (database queries, vector search) runs on the same CPU cores as the vLLM scheduler, it competes for CPU resources. At high concurrency + frequent tool calls, this could push CPU overhead well beyond the 15% we measured for pure LLM inference.

Estimated E2E impact of hybrid workloads:

Workload Type	LLM Time	Tool Time	Inter-turn OH	GPU Idle %	Effective CPU%
Pure chat	100%	0%	~0%	~0%	10–15%
Light tools (search)	70%	20%	10%	15–20%	20–25%
Heavy tools (DB + RAG)	50%	35%	15%	25–35%	25–35%
Code execution agents	40%	45%	15%	35–45%	30–40%

For code execution agents (the Claude Code use case our traces come from), CPU and IO operations may consume 40–50% of wall-clock time, with GPU active only 50–60% of the time. This fundamentally changes the co-design equation:

For pure LLM serving: Buy the best GPU, CPU barely matters
For agentic serving: CPU, memory bandwidth, and IO become co-equal with GPU. System balance matters more than peak GPU FLOPS.

Optimization strategies for hybrid workloads:

Speculative prefetching: While the LLM generates a tool call, pre-warm likely next-turn prefixes based on the tool type. For example, if the model calls search(), pre-tokenize a template like "Search results: {placeholder}" to have partial KV cache ready.
KV cache reservation: Reserve a “parking” slot in CPU DRAM for active sessions during tool execution, preventing eviction. LMCache already enables this — the question is whether to make it tool-call-aware.
Separate CPU pools: Dedicate specific CPU cores to vLLM scheduling and others to tool execution. NUMA-aware pinning becomes critical: vLLM scheduler threads on cores near the GPU, tool execution threads on cores near the NIC (for database queries) or NVMe (for document retrieval).
Async tool execution with GPU overlap: Execute tool calls concurrently with other users’ LLM inference, then “re-inject” the results when ready. This requires the scheduler to support interruptible sessions — start other requests during the tool gap, then preempt them when the tool-calling session is ready to continue.

Appendix: Reproduction

Environment

# Container
docker run -d --name lmcache-bench --entrypoint /bin/bash \
  --device=/dev/kfd --device=/dev/dri --network=host --ipc=host \
  --group-add video --cap-add SYS_PTRACE \
  -v /mnt/nvme3n1p1/models:/work/models \
  vllm/vllm-openai-rocm:v0.19.0 -c "sleep infinity"

# LMCache (source build for ROCm)
docker exec lmcache-bench bash -c "
  git clone --depth 1 https://github.com/LMCache/LMCache.git /work/LMCache
  cd /work/LMCache && BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
  pip uninstall -y nixl nixl-cu12 cupy-cuda12x cufile-python cuda-pathfinder
"

Server Configs

HBM Prefix Cache:

VLLM_FLOAT32_MATMUL_PRECISION=high \
vllm serve /work/models/MiniMax-M2.5 \
  --tensor-parallel-size 2 --enable-prefix-caching \
  --gpu-memory-utilization 0.85 --host 0.0.0.0 --port 8000

LMCache DRAM:

PYTHONHASHSEED=0 VLLM_FLOAT32_MATMUL_PRECISION=high \
LMCACHE_LOCAL_CPU=true LMCACHE_CHUNK_SIZE=256 \
vllm serve /work/models/MiniMax-M2.5 \
  --tensor-parallel-size 2 --enable-prefix-caching \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' \
  --gpu-memory-utilization 0.78 --host 0.0.0.0 --port 8000

Benchmark Scripts

All scripts and raw data are available at github.com/andyluo7/cpu-gpu-codesign-agentic-inference.

This analysis accompanies our LMCache multi-turn agentic benchmark and uses the same hardware, model, and workload traces.

Benchmarking LMCache for Multi-Turn Agentic Workloads on AMD MI300X

2026-04-20T00:00:00+00:00

A practitioner’s guide to KV-cache tiering on ROCm — what works, what doesn’t, and the regime where it actually matters.

Key Summary

We benchmarked multi-turn agentic workloads using 739 anonymized Claude Code conversation traces from kv-cache-tester against MiniMax-M2.5 (230 GB FP8 MoE) on 2× AMD MI300X with vLLM 0.19.0 + LMCache (built from source for ROCm). Three KV-cache strategies were compared head-to-head: no cache, vLLM’s HBM prefix cache, and LMCache CPU-DRAM offload.

Headline findings:

LMCache works on AMD MI300X today — first known working stack with BUILD_WITH_HIP=1
Regime matters more than the strategy. HBM prefix cache wins at low load; LMCache wins decisively under stress
Under stress (32 users / 100k context / agentic traces): LMCache delivers 3.0× lower TTFT avg, 2.1× lower p95, 2.6× lower max, 2.3× more requests vs HBM-only
PYTHONHASHSEED=0 is mandatory for LMCache cache-key consistency — missing this gives 0% cache hits even on bit-identical prompts
Synthetic cache-rate benchmarks understate LMCache’s value by ~10-17% because they don’t pressure HBM enough; use real agentic traces for honest comparisons

1. Introduction

Why agentic workloads are different

Modern coding assistants like Claude Code, Cursor, and Devin do not behave like chatbots. A typical agentic conversation:

Ships 20-150k tokens of input on every turn (file contents, tool outputs, conversation history)
Reuses ~93-97% of its prefix across turns — only the latest tool call or response changes
Lasts hours, not seconds (median 60 minutes, P75 163 minutes)
Spawns sub-agents that recursively grow the context tree
Heavily depends on shared system prompt + tool definitions (~12-25k tokens) cached across all conversations

If you re-prefill the entire 100k-token context every turn, you waste 95% of GPU compute. The whole serving stack — caching strategy, batching, scheduling, routing — has to be designed around prefix reuse.

What’s a KV cache, briefly

LLMs decode autoregressively: each new token attends back over every previous token’s K/V tensors. Storing these K/V tensors lets you skip recomputation on the next turn. A 100k-token MiniMax-M2.5 KV cache uses about 12 GB of HBM. Multiply by N concurrent users and you quickly run out of GPU memory.

The hierarchy:

Tier	Where	Latency	Capacity per node
L0	GPU registers/L1	ns	KB
L1	GPU HBM	μs	hundreds of GB
L2	CPU DRAM	~100 μs	TB
L3	Local NVMe	ms	tens of TB
L4	Remote object store	10s ms	unbounded

Production stacks tier the KV cache across L1-L3. LMCache, NVIDIA Dynamo, and SGLang HiCache are all implementations of this idea.

What we wanted to find out

Can LMCache run on AMD MI300X at all? (PyPI ships CUDA-only wheels)
Does it help on real agentic workloads, or only in synthetic benchmarks?
Where’s the regime crossover where the L2 tier starts paying off vs HBM-only?
What configuration knobs actually matter in practice?

2. Architecture

The serving stack

                ┌────────────────────────────────────┐
                │  trace_replay_tester.py (client)   │
                │  • 739 anonymized Claude Code      │
                │    agentic conversation traces     │
                │  • Cooldown-gated user ramp        │
                │  • Working-set + period budgets    │
                └─────────────┬──────────────────────┘
                              │ OpenAI HTTP /v1/chat/completions
                              ▼
                ┌────────────────────────────────────┐
                │       vLLM 0.19.0 ROCm             │
                │  ─────────────────────────────     │
                │  Scheduler → Prefix-cache (HBM)    │
                │  ──────────│──────────────         │
                │            │ KV connector V1 hook  │
                │            ▼                       │
                │  ┌──────────────────────┐          │
                │  │ LMCacheConnectorV1   │          │
                │  │ (BUILD_WITH_HIP=1)   │          │
                │  └─────────┬────────────┘          │
                │            │                       │
                │      ┌─────┴───────┐               │
                │      │             │               │
                │      ▼             ▼               │
                │  GPU (HBM)    CPU DRAM             │
                │  L1 cache     L2 cache (64 GB)     │
                └────────────────────────────────────┘
                              │
                              ▼
                ┌────────────────────────────────────┐
                │  MiniMax-M2.5 (230 GB FP8 MoE)     │
                │  TP=2 across 2× MI300X (192 GB)    │
                └────────────────────────────────────┘

Three test configurations

We ran the same workload three times, swapping only the KV strategy:

Configuration	Server flags	What’s cached
A: Vanilla (no cache)	`--no-enable-prefix-caching`	Nothing — every prefill from scratch
B: HBM prefix cache	`--enable-prefix-caching`	KV blocks in HBM, LRU evicted when full
C: LMCache DRAM	`--enable-prefix-caching` + `--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'`	HBM L1 + 64 GB CPU DRAM L2 (LRU across both)

What the trace replay tester does

trace_replay_tester.py (callanjfox/WEKA) replays 739 anonymized Claude Code conversations. Each trace contains:

{"id":"trace_0001",
 "tool_tokens":12974, "system_tokens":4243,
 "block_size":64, "hash_id_scope":"local",
 "requests":[
   {"t":0.0, "type":"n", "in":71175, "out":169,
    "hash_ids":[1,2,3,...,1112]},   // block hashes — drives cache match calc
   ...]}

Per-trace stats (median across 739 traces):

Starting input: 20,160 tokens
Ending input: 115,008 tokens
Cache hit rate per conversation: 96.9% (theoretical, with infinite cache)
Conversation duration: 60 min

The tester:

Generates synthetic content to hit each trace’s specified input_tokens while preserving real assistant responses (so the model actually decodes meaningfully)
Pre-warms a canonical prefix (--warm-prefix-pct 0.5): ~12k tokens of shared tool/system content, mirrors how Claude Code keeps tool defs cached across conversations
Adaptively scales concurrent users based on observed p95 TTFT vs --max-ttft SLO — same control loop production load balancers use
Recycles users (--recycle): when one conversation completes, replace it with a fresh trace

This gives you a controlled approximation of agentic production traffic without sending real Claude Code data anywhere.

3. Implementation: getting LMCache running on MI300X

This part has more sharp edges than you’d expect. Documenting them so you don’t repeat them.

Step 1: Container

docker run -d --name lmcache-bench --entrypoint /bin/bash \
  --device=/dev/kfd --device=/dev/dri --network=host --ipc=host \
  --group-add video --cap-add SYS_PTRACE \
  -v /mnt/nvme/models:/work/models \
  vllm/vllm-openai-rocm:v0.19.0 \
  -c "sleep infinity"

Step 2: Build LMCache from source (PyPI wheel is CUDA-only)

docker exec lmcache-bench bash -c "
  git clone --depth 1 https://github.com/LMCache/LMCache.git /work/LMCache
  cd /work/LMCache && BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
"

pip install lmcache ships a CUDA-linked c_ops.so that fails with libcudart.so.12: cannot open shared object file. The source build with BUILD_WITH_HIP=1 emits HIP bytecode that loads cleanly.

Step 3: Uninstall transitive CUDA-only deps

When you pip install lmcache==0.4.3, it pulls in nixl-cu12, nixl_ep, cupy-cuda12x. vLLM 0.19’s quark quantization config imports nixl_ep unconditionally → libcuda.so.1 ImportError before the model even loads.

pip uninstall -y nixl nixl-cu12 cupy-cuda12x cufile-python cuda-pathfinder

Step 4: Launch with the right flags

VLLM_FLOAT32_MATMUL_PRECISION=high \
PYTHONHASHSEED=0 \
LMCACHE_LOCAL_CPU=true \
LMCACHE_CHUNK_SIZE=256 \
LMCACHE_MAX_LOCAL_CPU_SIZE=64 \
vllm serve /work/models/MiniMax-M2.5 \
  --tensor-parallel-size 2 --gpu-memory-utilization 0.85 \
  --tool-call-parser minimax_m2 --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice --trust-remote-code \
  --enable-prefix-caching \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' \
  --host 0.0.0.0 --port 8000

The three configuration mistakes that cost the most time

PYTHONHASHSEED=0 is non-negotiable. Python’s hash() is randomized per-process. Without a fixed seed, TP worker 0 hashes a prompt to one cache key and TP worker 1 hashes the same prompt to a different key. Even sending the same request twice from the same client misses every time. Symptom: server log shows LMCache hit tokens: 0, need to load: 0 on bit-identical prompts.
You need --enable-prefix-caching (not --no-enable-prefix-caching) even when running LMCache. LMCache borrows vLLM’s prefix-cache hash function for cache-key derivation. Without it, you get LMCache WARNING: Could not load 'builtin' from vLLM. Using builtin hash. and inconsistent behavior.
Do NOT set LMCACHE_SAVE_DECODE_CACHE=true. It synchronously offloads every decode step to CPU, which can serialize the GPU pipeline. We saw 100-250s stalls on otherwise simple requests. Decode-cache reuse is rare in practice (each decode produces a unique tail) so the offload cost is pure overhead.

Recipe-specific gotchas

For MiniMax-M2 series specifically, the official vLLM recipe includes --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}'. This pass was added after vLLM 0.19.0 — drop it from the launch command if you’re pinned to that version.

Sanity check

Before running benchmarks, confirm the cache path actually fires:

$ curl -s http://127.0.0.1:8000/v1/chat/completions ...   # send prompt twice
# server log:
LMCache: Reqid=...80e (1030 tok, 1st pass): hit tokens: 0     ← cold (correct)
LMCache: Reqid=...8cf (1030 tok, 2nd pass): hit tokens: 1024  ← warm hit ✅

If the second pass shows hit tokens: 0, fix PYTHONHASHSEED before going further.

4. Benchmarks: methodology

We ran four phases, each isolating a different question:

Phase	Tester	Question
1	Smoke test (curl)	Does the server respond coherently with LMCache?
2	`single_prompt_tester.py`	Does LMCache actually skip prefill on cache hits?
3 base	`trace_replay_tester.py` low load	What happens with realistic agentic traffic?
3 stress	`trace_replay_tester.py` high load	Where does LMCache pay off vs HBM-only?
4	`cache_rate_tester.py` + `working_set_tester.py`	Synthetic sweeps for controlled comparison

Common settings

Hardware: 2× AMD MI300X (192 GB HBM each), gfx942
Software: vLLM 0.19.0 + LMCache main (HIP-built) + transformers 4.57.1
Model: MiniMaxAI/MiniMax-M2.5 FP8, TP=2, --gpu-memory-utilization 0.78 (stress) or 0.85 (others)
Tester: 0.5 warm-prefix, think-only timing, max-context 32k (base) or 100k (stress)
60s --max-ttft SLO (stress) or 30s (base)

5. Results

5.1 Phase 2 — LMCache reuse path validated

Single-prompt cold-vs-warm sweep at increasing context sizes. Each request was sent twice; second iteration should hit cache and skip prefill.

Context	Cold (s)	Warm (s)	Speedup
1k	6.42	3.22	2.0×
2k	40.4	3.76	10.7×
8k	8.92	8.06	1.1×
16k	15.21	13.46	1.13×

Server logs confirmed real cache hits: LMCache hit tokens: 1024 / 1792 / 3840 on second iterations. The reuse path works; PYTHONHASHSEED=0 was the unlock.

5.2 Phase 3 base load — HBM prefix cache wins

8 max users, 32k context, 10 min. Working set fits comfortably in HBM at TP=2.

Metric	Vanilla	HBM-PC	LMCache
Reqs completed	9	52	25
Peak users	2	8	3
TTFT avg (s)	30.05	16.66	24.29
TTFT p50 (s)	25.99	0.00	32.30
TTFT p95 (s)	54.11	65.08	48.08
Workload cache hit rate	63.4%	55.5%	84.0%

HBM prefix cache won decisively at this load — 5.8× more requests, 2× lower TTFT vs vanilla, sustained 8 users vs 2 for vanilla. LMCache added overhead without unlocking the L2 tier (working set fit in L1).

5.3 Phase 3 STRESS — LMCache wins decisively

32 max users, 100k context, 20 min, GPU memory util reduced to 0.78 to force HBM pressure.

Metric	Vanilla	HBM-PC	LMCache
Reqs completed	18	12	28
TTFT avg (s)	150.84	102.17	34.59
TTFT p50 (s)	0.00	117.15	29.86
TTFT p95 (s)	826.69	240.87	112.78
TTFT max (s)	950.96	301.72	117.38
Input throughput (tok/s)	591	471	933
Working set held	191k tok	230k tok	312k (+36%)
Workload cache hit rate	69.2%	64.4%	72.4%

LMCache wins:

vs Vanilla: 4.4× lower TTFT avg, 7.3× lower p95, 8.1× lower max, 1.6× more reqs
vs HBM-PC: 3.0× lower TTFT avg, 2.1× lower p95, 2.6× lower max, 2.3× more reqs
Holds 36% more working set with the same HBM budget

5.4 Phase 4 synthetic sweeps — surprising negative

Same 3-configuration comparison but with cache_rate_tester.py (controlled 0/25/50/75/100% hit rates) and 1M token working set.

16k context	Hit%	Vanilla-NEP	Vanilla-PC	LMCache
(tok/s)	0	2,383	2,416	1,867
	25	2,387	2,457	1,867
	50	2,395	2,323	2,044
	75	2,369	3,061	1,956
	100	2,356	3,044	1,956

LMCache underperforms by 10-17% in this synthetic test. Why? The 1M nominal working set still fits in HBM at TP=2. The DRAM tier is unused but the connector overhead (key hashing, lookups, no-op transfers) is paid on every request.

This is a critical lesson: synthetic benchmarks with controlled hit rates can give misleading negative results for L2 caches. They don’t generate enough working-set pressure to expose where the L2 tier actually pays off.

6. Key Findings

Finding 1: Regime crossover is the central question

There is no universal “always enable LMCache” answer. The break-even is working set vs HBM efficient capacity. For our setup (MiniMax-M2.5 FP8 TP=2 on 2× MI300X), the crossover sits around 250-300k token sustained working set. Below that, HBM prefix cache is sufficient. Above that, LMCache pays off non-linearly.

Working set	Recommended strategy
< 100k tokens	HBM prefix cache (vanilla-PC)
100-250k tokens	HBM prefix cache, monitor for eviction
250-500k tokens	LMCache DRAM
> 500k tokens	LMCache DRAM, consider NVMe L3 tier

Finding 2: PYTHONHASHSEED is the silent killer

Most LMCache deployment failures we’d guess are caused by missing PYTHONHASHSEED=0. Symptom: 0% cache hit rate even on bit-identical prompts; LMCache logs show Could not load 'builtin' from vLLM. Using builtin hash. ... You MUST set PYTHONHASHSEED to ensure consistent hashing.

This is in the LMCache config docs but easy to miss. Treat it as mandatory.

Finding 3: Decode is the bottleneck, not prefill

Across all our runs, output throughput was 1-8 tok/s aggregate. MiniMax-M2.5 + TP=2 + AITER on MI300X is decode-bound at the concurrencies that fit in TTFT SLO. KV caching only attacks the prefill side.

For a real production deployment, the next dollar should go to:

FP8 KV cache (we ran BF16 KV) — 2× capacity at <0.5% quality loss
Speculative decoding (Eagle-2/Medusa) — 2-3× decode speedup
PD disaggregation at >2-node scale — solves prefill blocking decode

KV caching is necessary but not sufficient.

Finding 4: TP=2 + LMCacheConnectorV1 has a deadlock under sustained load

We hit a shm_broadcast: No available shared memory broadcast block found in 60 seconds deadlock during one of our Phase 3 runs. Both TP workers alive, no preemptions, no waiting requests, but no progress for 6+ minutes. Reproduced once, didn’t reproduce on retry with different settings. Worth filing upstream against vLLM and/or LMCache.

Finding 5: Synthetic benchmarks lie about L2 cache value

cache_rate_tester with controlled hit rates didn’t generate enough working-set pressure to make the L2 tier useful. LMCache showed -10 to -17% throughput in those tests. The agentic trace replay (Phase 3 stress) — same model, same hardware — showed +200% throughput. The difference: realistic working-set distributions and concurrent-user pressure.

Always benchmark caching strategies on representative workloads, not synthetic mixtures.

Finding 6: TTFT-gated ramp control is the right way to think about concurrency

Across every test, peak concurrent users plateaued at 4-8 — not because of HBM limits but because the ramp controller refused to add more users while p95 TTFT exceeded the SLO threshold. This mirrors how production load balancers throttle. The “throughput numbers” you see in our results aren’t peak GPU utilization — they’re steady-state throughput within an SLO, which is what actually matters.

7. Best Practices

For evaluating cache strategies

Use real workload traces, not synthetic mixes. The kv-cache-tester dataset provides 739 anonymized Claude Code traces. There’s no excuse to evaluate L2 caching with toy benchmarks.
Test under stress, not just nominal load. Cache strategies look identical at low load. The whole point of L2 caching is the long tail.
Keep --max-ttft realistic (5-30s for chat, 30-120s for agentic) — too high and you’re measuring queue depth, too low and you cripple ramp.
Three configurations minimum: no-cache (lower bound), HBM-only (cheap baseline), L2-cache (your proposal). Anything less hides the regime story.

For LMCache deployment on MI300X

Build from source with BUILD_WITH_HIP=1, do not use the PyPI wheel
Set PYTHONHASHSEED=0 in the server’s env
Enable vLLM’s prefix cache (--enable-prefix-caching) so LMCache can reuse its hash function
Don’t enable LMCACHE_SAVE_DECODE_CACHE — it stalls the decode pipeline
Size the L2 pool generously (LMCACHE_MAX_LOCAL_CPU_SIZE=64 GB+) — DRAM is cheap, evictions hurt
Use FP8 weights and FP8 KV cache to maximize HBM L1 capacity before pushing to L2
Monitor LMCache hit tokens: N in server logs to verify the cache path is firing in production

For agentic serving in general

Sticky session routing is non-negotiable — without it, conversation N+1 lands on a fresh replica and gets zero cache reuse
Cache-control markers in your prompts (Anthropic-style cache_control: {"type": "ephemeral"}) make explicit what the server should keep warm
Byte-identical message serialization across turns — JSON key reordering, whitespace changes, timestamp diffs all silently destroy cache hits
PD disaggregation at >2-node scale — runs prefill on burst-capacity replicas, decode on KV-cache-resident replicas. LMCache and PD are complementary; production stacks like Mooncake combine both.
Speculative decoding — Eagle-2/Medusa give 2-3× decode speedup. Bigger throughput win than any cache layer for decode-bound workloads.

When NOT to deploy LMCache

Working set comfortably fits HBM (most chat workloads)
Decode-bound serving where prefill cost is already small relative to decode
Single-node deployments where you don’t have spare DRAM bandwidth
TP > 4 with vLLM 0.19.x (KV connector deadlock risk; needs investigation)

8. Reproduce

To reproduce a single configuration:

# 1. Container + LMCache build (one time)
docker run -d --name lmcache-bench --entrypoint /bin/bash \
  --device=/dev/kfd --device=/dev/dri --network=host --ipc=host \
  --group-add video --cap-add SYS_PTRACE \
  -v /your/models:/work/models \
  vllm/vllm-openai-rocm:v0.19.0 -c "sleep infinity"

docker exec lmcache-bench bash -c "
  pip uninstall -y nixl nixl-cu12 cupy-cuda12x cufile-python cuda-pathfinder
  git clone --depth 1 https://github.com/LMCache/LMCache.git /work/LMCache
  cd /work/LMCache && BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
"

# 2. Server (LMCache stress configuration)
docker exec -d lmcache-bench bash -c "
  VLLM_FLOAT32_MATMUL_PRECISION=high PYTHONHASHSEED=0 \
  LMCACHE_LOCAL_CPU=true LMCACHE_CHUNK_SIZE=256 LMCACHE_MAX_LOCAL_CPU_SIZE=64 \
  vllm serve /work/models/MiniMax-M2.5 \
    --tensor-parallel-size 2 --gpu-memory-utilization 0.78 \
    --enable-prefix-caching \
    --kv-transfer-config '{\"kv_connector\":\"LMCacheConnectorV1\",\"kv_role\":\"kv_both\"}' \
    --tool-call-parser minimax_m2 --reasoning-parser minimax_m2 \
    --enable-auto-tool-choice --trust-remote-code \
    --host 0.0.0.0 --port 8000
"

# 3. Trace replay client
git clone https://github.com/callanjfox/kv-cache-tester.git
cd kv-cache-tester
python3 trace_replay_tester.py \
  --api-endpoint http://127.0.0.1:8000 \
  --trace-directory traces \
  --start-users 4 --max-users 32 \
  --max-ttft 60.0 --test-duration 1200 \
  --max-context 100000 --warm-prefix-pct 0.5 \
  --timing-strategy think-only --recycle \
  --output-dir ./results

9. Acknowledgments

callanjfox / WEKA for the kv-cache-tester toolkit and the 739 anonymized Claude Code agentic traces
LMCache team for the connector and the source-friendly build system
Hot Aisle for the MI300X access

Bench environment: ENC1-CLS01-SVR08, 2× AMD MI300X (gfx942, 192 GB HBM each), ROCm 7.0.0, vLLM 0.19.0, LMCache main (commit ~2026-04). All raw CSVs and run logs in the linked repository.

Run GLM-4.6V on AMD MI300X GPU with vLLM

2025-12-09T00:00:00+00:00

GLM-4.6V is the latest multimodal model from Z.AI, designed to bridge the gap between visual perception and executable action. In this post, we’ll explore what makes GLM-4.6V special and how you can run it on AMD’s powerful MI300X GPUs using vLLM.

1. Overview about GLM-4.6V

GLM-4.6V is a 106B parameter foundation model that achieves State-of-the-Art (SoTA) performance in visual understanding, comparable to other leading models like GPT-4V. It introduces several groundbreaking capabilities:

Native Multimodal Function Calling: Unlike previous models that required converting visual inputs to text descriptions, GLM-4.6V can directly process images, screenshots, and documents as tool inputs. It can also generate visual outputs like charts and rendered pages, integrating them into its reasoning chain.
Interleaved Image-Text Content Generation: The model can synthesize coherent content that mixes text and images, ideal for generating rich reports or articles.
Multimodal Document Understanding: With a context window of up to 128k tokens, it can process and understand long documents, charts, and complex layouts without OCR pre-processing.
Frontend Replication & Visual Editing: It can reconstruct HTML/CSS from screenshots and support natural language-driven edits.

For those with more constrained resources, a lightweight version, GLM-4.6V-Flash (9B), is also available for local deployment.

2. How to run on AMD MI300X GPU

Running GLM-4.6V on AMD MI300X is straightforward thanks to vLLM support. Ensure you have a working ROCm environment set up for your MI300X.

Prerequisites & Installation

Try it by launching the vLLM container:

docker run -it \
 --privileged \
 --network=host \
 --group-add=video \
 --ipc=host \
 --cap-add=SYS_PTRACE \
 --security-opt seccomp=unconfined \
 --device /dev/kfd \
 --device /dev/dri \
 --name vllm-omni \
 rocm/vllm-dev:nightly

You need to install transformers with version >= 0.5.0

https://github.com/huggingface/transformers.git
pip install '.[torch]'

Running Inference

Launch vLLM server inside the container:

vllm serve zai-org/GLM-4.6V \
     --tensor-parallel-size 4 \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --enable-expert-parallel \
     --allowed-local-media-path / \
     --mm-encoder-tp-mode data \
     --mm_processor_cache_type shm

You can also use –tensor-parallel-size 2 and 8 to run on 2 or 8 MI300X GPU. The same command can be used to run zai-org/GLM-4.6V-FP8 on 1, 2, 4, 8 MI300X GPU.

Once vLLM server is launched, here are two quick examples of demonstrating the capabilities of GLM-4.6V.

Example 1: Visual Grounding

curl -X POST \
    http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "zai-org/GLM-4.6V",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://cloudcovert-1305175928.cos.ap-guangzhou.myqcloud.com/%E5%9B%BE%E7%89%87grounding.PNG"
                        }
                    },
                    {
                        "type": "text",
                        "text": "Where is the second bottle of beer from the right on the table?  Provide coordinates in [[xmin,ymin,xmax,ymax]] format"
                    }
                ]
            }
        ],
        "thinking": {
            "type":"enabled"
        }
    }'

The output:

{
  "id": "chatcmpl-afb2ac2dce2bd986",
  "object": "chat.completion",
  "created": 1765416718,
  "model": "zai-org/GLM-4.6V",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "\nThe coordinates of the second bottle of beer from the right on the table are <|begin_of_box|>[[94,598,177,991]]<|end_of_box|>.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "The image shows an outdoor table setting with various items on it, including bottles of beer. The question asks for the coordinates of the second bottle of beer from the right on the table. By visually inspecting the table, we identify the bottles of beer and count from the right - hand side to find the second one. Then, we determine the bounding box coordinates of that specific bottle.",
        "reasoning_content": "The image shows an outdoor table setting with various items on it, including bottles of beer. The question asks for the coordinates of the second bottle of beer from the right on the table. By visually inspecting the table, we identify the bottles of beer and count from the right - hand side to find the second one. Then, we determine the bounding box coordinates of that specific bottle."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 151336,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 696,
    "total_tokens": 807,
    "completion_tokens": 111,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

You can see it can successfully identify the second bottle of beer from the right on the table and provide the coordinates [94,598,177,991]. It also shows the reasoning process in the “reasoning_content” field.

Example 2: Visual Understanding

curl -X POST \
  http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-4.6V",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://cdn.bigmodel.cn/markdown/1765174983998image.png"
            }
          },
          {
            "type": "text",
            "text": "Identify the breeds of all cats in the image. Return the results in valid JSON format. The result should be a list, where each element in the list corresponds to a dictionary of target detection results. The dictionary keys are label and bbox_2d, with values being the detected cat breed and the result bounding box coordinates respectively. For example: [{\"label\": \"Golden Shorthair-1\", \"bbox_2d\": [1,2,3,4]}, {\"label\": \"Golden Shorthair-2\", \"bbox_2d\": [4,5,6,7]}]"
          }
        ]
      }
    ],
    "thinking": {
      "type": "enabled"
    }
  }'

The output:

{
  "id": "chatcmpl-ad870121ef1f16e5",
  "object": "chat.completion",
  "created": 1765417439,
  "model": "zai-org/GLM-4.6V",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "\nThe list of cat breeds and their bounding box coordinates in the required JSON format is <|begin_of_box|>[{\"label\": \"American Shorthair-1\", \"bbox_2d\": [109, 152, 193, 822]}, {\"label\": \"American Shorthair-2\", \"bbox_2d\": [191, 331, 311, 852]}, {\"label\": \"American Shorthair-3\", \"bbox_2d\": [299, 347, 434, 899]}, {\"label\": \"Domestic Shorthair-1\", \"bbox_2d\": [422, 523, 516, 913]}, {\"label\": \"American Shorthair-4\", \"bbox_2d\": [505, 257, 609, 852]}, {\"label\": \"American Shorthair-5\", \"bbox_2d\": [606, 445, 710, 855]}, {\"label\": \"Maine Coon-1\", \"bbox_2d\": [696, 92, 819, 822]}, {\"label\": \"American Shorthair-6\", \"bbox_2d\": [808, 473, 886, 825]}]<|end_of_box|>.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": "The image shows a group of cats of various breeds and sizes standing against a white background. The task is to identify the breed of each cat and provide the bounding box coordinates in a specific JSON - format. To do this, I need to visually analyze each cat in the image, determine its breed based on physical characteristics such as fur pattern, color, and body shape, and then estimate the bounding box coordinates for each cat. I will go through each cat one by one, starting from the left - most cat and moving to the right, and create a dictionary for each with the 'label' key for the breed and 'bbox_2d' key for the coordinates.",
        "reasoning_content": "The image shows a group of cats of various breeds and sizes standing against a white background. The task is to identify the breed of each cat and provide the bounding box coordinates in a specific JSON - format. To do this, I need to visually analyze each cat in the image, determine its breed based on physical characteristics such as fur pattern, color, and body shape, and then estimate the bounding box coordinates for each cat. I will go through each cat one by one, starting from the left - most cat and moving to the right, and create a dictionary for each with the 'label' key for the breed and 'bbox_2d' key for the coordinates."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 151336,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 635,
    "total_tokens": 1058,
    "completion_tokens": 423,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

You can see it can successfully identify the breeds of all cats in the image and provide the bounding box coordinates in the required JSON format.

Example 3: Video Understanding

Here is an example of using a video as input. We will use a sample video of a dog running.

Video URL: Sample Video

Your browser does not support the video tag.

curl -X POST \
  http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-4.6V",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "video_url",
            "video_url": {
              "url": "https://andyluo7.github.io/assets/hunyuan_video_1.5_00001_.mp4"
            }
          },
          {
            "type": "text",
            "text": "Describe the video in detail."
          }
        ]
      }
    ]
  }'

The output:

{
"id": "chatcmpl-video-example",
"object": "chat.completion",
"created": 1765418000,
"model": "zai-org/GLM-4.6V",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The video presents a dynamic and immersive sequence of a white paper airplane flying through a dense urban cityscape. The perspective is from a high altitude, looking down upon a sprawling metropolis of tall skyscrapers, likely a city like New York. The scene is bathed in a warm, golden light, suggesting the time is either sunrise or sunset.\n\nThe video begins with a wide shot, establishing the vastness of the city. A single, simple white paper airplane is seen soaring between the towering buildings. As the video progresses, the camera initiates a rapid, forward-moving dolly shot, descending towards the plane. This creates a thrilling sense of motion, as if the viewer is flying alongside the paper airplane. The buildings appear to rush past, and the perspective shifts from a distant overview to a much closer, more intimate view of the plane navigating the narrow streets and canyons between the skyscrapers.\n\nThe camera continues to track the plane's flight path, maintaining a close distance as it weaves through the city. The motion is smooth and continuous, emphasizing the speed and freedom of the paper airplane's journey. Towards the end, the camera pulls back slightly, offering a broader view of the city skyline as the plane flies towards the distant horizon, leaving the viewer with a sense of adventure and possibility. The overall effect is a visually striking and imaginative portrayal of a simple object undertaking a grand journey through a complex urban environment.",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": "This is a detailed description of the video based on the provided seconds.\n\n### Thinking Process\n\n1. **Initial Observation:** I will examine the sequence of seconds to understand the overall narrative and visual style. The seconds show a consistent scene with a clear subject and a dynamic camera movement.\n2. **Identify the Subject:** The central object is a white paper airplane. It's the focal point of the entire sequence.\n3. **Analyze the Environment:** The background is a dense, urban cityscape, viewed from a high vantage point. The buildings are tall skyscrapers, suggesting a major city like New York. The lighting suggests it's either early morning or late afternoon, with a warm, golden hue.\n4. **Describe the Camera Movement:** The camera is not static. It appears to be moving forward and downward, tracking the paper airplane. This creates a sense of motion and immersion, as if the viewer is flying alongside the plane. The perspective shifts from a high, distant view to a much closer, more intimate one.\n5. **Sequence the Events:** I will describe the video chronologically.\n * **Beginning:** The video starts with a wide, high-angle shot of the city. The paper airplane is seen flying between the skyscrapers.\n * **Middle:** The camera rapidly moves forward and descends, getting closer to the plane. The buildings seem to rush past, creating a sense of speed. The plane navigates through the narrow canyons formed by the tall buildings.\n * **End:** The camera pulls back slightly, offering a wider view of the city skyline as the plane continues its flight towards the horizon.\n6. **Note Visual Details:** I'll mention the warm color grading, the motion blur that emphasizes speed, and the contrast between the simple, white paper airplane and the complex, massive city below.\n7. **Synthesize into a Coherent Description:** I will combine these observations into a detailed, flowing paragraph that captures the essence of the video.\n\n***",
"reasoning_content": "This is a detailed description of the video based on the provided seconds.\n\n### Thinking Process\n\n1. **Initial Observation:** I will examine the sequence of seconds to understand the overall narrative and visual style. The seconds show a consistent scene with a clear subject and a dynamic camera movement.\n2. **Identify the Subject:** The central object is a white paper airplane. It's the focal point of the entire sequence.\n3. **Analyze the Environment:** The background is a dense, urban cityscape, viewed from a high vantage point. The buildings are tall skyscrapers, suggesting a major city like New York. The lighting suggests it's either early morning or late afternoon, with a warm, golden hue.\n4. **Describe the Camera Movement:** The camera is not static. It appears to be moving forward and downward, tracking the paper airplane. This creates a sense of motion and immersion, as if the viewer is flying alongside the plane. The perspective shifts from a high, distant view to a much closer, more intimate one.\n5. **Sequence the Events:** I will describe the video chronologically.\n * **Beginning:** The video starts with a wide, high-angle shot of the city. The paper airplane is seen flying between the skyscrapers.\n * **Middle:** The camera rapidly moves forward and descends, getting closer to the plane. The buildings seem to rush past, creating a sense of speed. The plane navigates through the narrow canyons formed by the tall buildings.\n * **End:** The camera pulls back slightly, offering a wider view of the city skyline as the plane continues its flight towards the horizon.\n6. **Note Visual Details:** I'll mention the warm color grading, the motion blur that emphasizes speed, and the contrast between the simple, white paper airplane and the complex, massive city below.\n7. **Synthesize into a Coherent Description:** I will combine these observations into a detailed, flowing paragraph that captures the essence of the video.\n\n***",
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": :151336,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 19246,
"total_tokens": 19951,
"completion_tokens": 705,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}

You can see that GLM-4.6V can effectively process video content and output a coherent description of the scene.

3. Summary

GLM-4.6V represents a significant leap forward in open multimodal AI, bringing native visual tool use and long-context understanding to the forefront. When paired with the high-bandwidth memory and compute power of AMD MI300X GPUs, it becomes a formidable tool for enterprise-grade multimodal applications.

We encourage you to try running GLM-4.6V on your AMD infrastructure today! Check out the official documentation and the Hugging Face model card for more deep dives.

Running FLUX.2, HunyuanVideo-1.5, and Z-Image-Turbo on AMD MI300X

2025-11-27T17:00:00+00:00

I spent some time bringing a few trending open image and video genernation models to AMD MI300X GPU and wanted to jot down a repeatable path. The focus here is to get first frames/images out easily and quickly. Simple pip install only. Performance is less concern.

FLUX.2-dev: Black Forest Labs’s new text-to-image generation model with improved realism, text adherence, and image editing capabilities.
HunyuanVideo-1.5: Tencent’s latest video generation model that delivers top-tier quality with only 8.3B parameters.
Z-Image-Turbo: An efficient image generation model with Single-Stream Diffusion Transformer.

The prerequsite is to have access to AMD MI300X GPU, which is available on various CSPs including AMD Developer Cloud with free developer credit.

1) Base setup

OS: recent Ubuntu (22.04 or similar) with kernel that ships ROCm 6.x/7.x drivers.
GPU runtime: ROCm 6.x/7.x with rocminfo and rocm-smi working.

Quick sanity:

rocm-smi

We will see something like this, which shows 8 MI300X GPUs in one node,

You will see one GPU listed if you are using single GPU snapshot from AMD Developer Cloud.

Single MI300X GPU is sufficient enough to run all the 3 models.

2) Get Started

Install uv if not installed yet

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Install Pytorch, Diffusers, Transformers etc.

uv pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1
uv pip install "git+https://github.com/huggingface/diffusers.git"
uv pip install "transformers>=4.45.0" huggingface_hub requests safetensors accelerate

Install ComfyUI

git clone https://github.com/comfyanonymous/ComfyUI.git
cd $HOME/ComfyUI
uv pip install -r requirements.txt

3) FLUX.2-dev (image)

There are 2 ways to run FLUX.2-dev, with diffusers or ComfyUI.

3.1) diffusers

Minimal script (assumes HF auth token in HF_TOKEN if the model is gated):

python - <<'PY'
import torch
from diffusers import Flux2Pipeline

# Full FLUX.2 [dev] open-weights checkpoint (no bitsandbytes)
repo_id = "black-forest-labs/FLUX.2-dev"
device = "cuda"          # on ROCm builds, "cuda" aliases to AMD GPUs
torch_dtype = torch.bfloat16

# Load full Flux2 pipeline (text encoder + DiT + VAE) in bf16
pipe = Flux2Pipeline.from_pretrained(
    repo_id,
    torch_dtype=torch_dtype,
)

# Move everything to MI300X
pipe.to(device)

prompt = (
    "Realistic macro photograph of a hermit crab using a soda can as its shell, "
    "partially emerging from the can, captured with sharp detail and natural colors, "
    "on a sunlit beach with soft shadows and a shallow depth of field, with blurred "
    "ocean waves in the background. The can has the text `BFL Diffusers` on it and "
    "it has a color gradient that start with #FF5733 at the top and transitions to "
    "#33FF57 at the bottom."
)

# Reproducible generator tied to the GPU
generator = torch.Generator(device=device).manual_seed(42)

image = pipe(
    prompt=prompt,
    generator=generator,
    num_inference_steps=50,  # 28 is a good trade-off if you want faster
    guidance_scale=4.0,
    height=1024,
    width=1024,
).images[0]
image.save("flux2_output.png")
print("Saved flux2_output.png")
PY

The image will be generated in around 12 seconds. Here is my generated one,

3.2) ComfyUI

Download model files and put them into right places in ComfyUI

huggingface-cli download Comfy-Org/flux2-dev --local-dir $HOME/Comfy-Org-flux2-dev
cp $HOME/Comfy-Org-flux2-dev/split_files/vae/flux2-vae.safetensors $HOME/ComfyUI/models/vae
cp $HOME/Comfy-Org-flux2-dev/split_files/text_encoders/mistral_3_small_flux2_fp8.safetensors $HOME/ComfyUI/models/text_encoders/
cp $HOME/Comfy-Org-flux2-dev/split_files/diffusion_models/flux2_dev_fp8mixed.safetensors $HOME/ComfyUI/models/diffusion_models/

Run ComfyUI

TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python main.py --use-pytorch-cross-attention

You should see something like this,

Checkpoint files will always be loaded safely.
Total VRAM 196592 MB, total RAM 2321759 MB
pytorch version: 2.10.0.dev20251123+rocm7.1

...

Starting server

To see the GUI go to: http://127.0.0.1:8188

I used a remote MI300X server with IP address 64.139.222.215. To use ComfyUI in web browser on my macbook, I need to map it to the localhost by follows on the terminal on my macbook,

ssh -L 8188:127.0.0.1:8188 amd@64.139.222.215

Please change amd@64.139.222.215 accordingly to your account and IP address of MI300X server. Please also keep the terminal which runs the port mapping open while you use ComfyUI.

Next, launch web browser on your host computer and visit http://localhost:8188/. You should be able to see ComfyUI open and up.

Then go to https://comfyanonymous.github.io/ComfyUI_examples/flux2/#basic-example-workflow and drag the image to ComfyUI in the web browser to get the workflow.

Download sunset.png and fennec_girl_sing.png from https://github.com/andyluo7/andyluo7.github.io/tree/main/assets and put them into $HOME/ComfyUI/input.

You can see the workflow in ComfyUI as follows, click the blue “Run” botton at the top right corner to generate the image.

The prompt is “cute anime girl with gigantic fennec ears and a big fluffy fox tail with long wavy blonde hair and large blue eyes blonde colored eyelashes wearing a pink sweater a large oversized gold trimmed black winter coat and a long blue maxi skirt and a red scarf, she is happy while singing on stage like an idol while holding a microphone, there are colorful lights, it is a postcard held by a hand in front of a beautiful city at sunset and there is cursive writing that says “Flux 2, Now in ComfyUI”,

It tooks around 15s to generate the 1024x1024 image in 20 steps shown as follows. It consumes 27% of VRAM in single MI300X GPU.

Your browser does not support the video tag.

4) HunyuanVideo-1.5 (video)

We will use ComfyUI to run Tencent’s HunyuanVideo-1.5 video generation model, the same way we ran FLUX.2-dev as above.

Download model files and put them into right places in ComfyUI

huggingface-cli download Comfy-Org/HunyuanVideo_1.5_repackaged --local-dir $HOME/HunyuanVideo_1.5_repackaged
cp $HOME/HunyuanVideo_1.5_repackaged/split_files/text_encoders/*.* $HOME/ComfyUI/models/text_encoders
cp $HOME/HunyuanVideo_1.5_repackaged/split_files/vae/*.* $HOME/ComfyUI/models/vae
cp $HOME/HunyuanVideo_1.5_repackaged/split_files/diffusion_models/*.* $HOME/ComfyUI/models/diffusion_models
cp $HOME/HunyuanVideo_1.5_repackaged/split_files/latent_upscale_models/*.* $HOME/ComfyUI/models/latent_upscale_models
cp $HOME/HunyuanVideo_1.5_repackaged/split_files/clip_vision/*.* $HOME/ComfyUI/models/clip_vision
cp $HOME/HunyuanVideo_1.5_repackaged/split_files/loras/*.* $HOME/ComfyUI/models/loras

Run ComfyUI

TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python main.py --use-pytorch-cross-attention

Open Workflow

We will use 720p Text-to-Video workflow. Please download https://github.com/Comfy-Org/workflow_templates/blob/main/templates/video_hunyuan_video_1.5_720p_t2v.json and drag it to ComfyUI in the web browser to open it. You will see something like this, click the blue “Run” botton at the top right corner to generate the video.

The prompt is “A paper airplane released from the top of a skyscraper, gliding through urban canyons, crossing traffic, flying over streets, spiraling upward between buildings. The camera follows the paper airplane’s perspective, shooting cityscape in first-person POV, finally flying toward the sunset, disappearing in golden light. Creative camera movement, free perspective, dreamlike colors.”.

It will take more than 10 minutes to generate a 720p video with 5 second length, shown below, in 20 steps. It consumes 18% of VRAM for single MI300X GPU during execution.

Your browser does not support the video tag.

5) Z-Image-Turbo (image)

This model emphasizes speed with great quality. It can run with diffusers using following Python code,

python - <<'PY'
import torch
from diffusers import ZImagePipeline

# 1. Load the pipeline
# Use bfloat16 for optimal performance on supported GPUs
pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights."

# 2. Generate Image
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # This actually results in 8 DiT forwards
    guidance_scale=0.0,     # Guidance should be 0 for the Turbo models
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("example.png")
PY

It runs blazingly fast and generates the image instantly. Here is the generated image,

6) Next step

This blog focuses on Out-of-the-Box experience of running these fresh new models on single AMD MI300X GPU.

For optimized performance, we can use aiter backend, which includes Flash Attention, with diffusers. We can also try cache inference to speed up HunyuanVideo-1.5.

We can also use multiple MI300X GPUs to reduce the latency for single request and increase the throughput for multiple batched requests.

We can also use Radeon GPU or AIPC like Strix-Halo to build interesting applications with these powerful image and video generation models.

Kicking Off

2025-11-27T14:00:00+00:00

Thanks for dropping by. This site will collect build notes, experiments, and write-ups on what I’m learning. Expect posts on:

Tools and workflows I rely on.
Project retrospectives—what worked and what didn’t.
Short notes that future me (and maybe you) will want within reach.

If you’d like to follow along, add the RSS feed in your reader or check back periodically. Here we go.