CPU-GPU Co-Design for Agentic LLM Inference
Quantifying where time actually goes — and why your CPU might be stealing 15% and more of your GPU throughput.
Key Summary
We instrumented the full request lifecycle of agentic LLM inference on AMD MI300X to answer a simple question: how much of end-to-end latency is CPU work vs GPU work?
Using MiniMax-M2.5 (230 GB FP8 MoE) on 2× MI300X with vLLM 0.19.0, we decomposed every request into serialization, HTTP overhead (tokenization + scheduling + queue wait), GPU prefill, and GPU decode across 8 scenarios spanning concurrency 1–32 and context 1k–100k tokens.
Headline findings:
- At low concurrency, CPU overhead is negligible — 0.4–0.6% of E2E time for single requests at any context length
- At high concurrency, CPU overhead becomes material — 11–15% of E2E time at 32 concurrent users
- The bottleneck is not tokenization or JSON parsing — it’s scheduling + queue wait, which scales superlinearly with concurrency
- Tokenization at 100k tokens costs only 220ms (~500k tok/s on a single CPU core), tiny compared to GPU prefill (2–4 seconds)
- LMCache adds minimal CPU overhead vs HBM prefix cache — the CPU% split is nearly identical between the two strategies
- The real CPU-GPU co-design opportunity is not in making CPU faster, but in overlapping CPU work with GPU work and reducing scheduling contention at high concurrency
1. Motivation: The Hidden CPU Tax in Agentic Inference
Our previous work benchmarked LMCache for multi-turn agentic workloads on MI300X, comparing KV-cache strategies. We measured TTFT, throughput, and cache hit rates. But we treated the inference server as a black box — we never asked where inside the server the time goes.
Agentic AI workloads are not just GPU workloads. Every request passes through a CPU pipeline before and after GPU execution:
Client Server (vLLM) GPU
────── ───────────── ───
│ │ │
│─── serialize request ──────────────│ │
│ (JSON, 0.04-1.3ms) │ │
│ │ │
│ ┌─────────┴──────────┐ │
│ │ HTTP parse │ │
│ │ Tokenize input │ │
│ │ Schedule request │ "HTTP Overhead" │
│ │ KV cache lookup │ (7-3900ms) │
│ │ Queue wait │ │
│ └─────────┬──────────┘ │
│ │ │
│ │──── GPU prefill ─────────────│
│ │ (41-28537ms) │
│ │ │
│ │──── GPU decode (streaming) ──│
│ │ (1780-20792ms) │
│ │ │
│◄── parse SSE response ─────────────│ │
│ (1.9µs per chunk) │ │
The question: at scale (32 concurrent users, 100k token contexts), does the CPU pipeline become a bottleneck?
2. Methodology
2.1 Hardware & Software
| Component | Specification |
|---|---|
| GPU | 2× AMD Instinct MI300X (192 GB HBM3 each), gfx942 |
| CPU | AMD EPYC (ENC1-CLS01-SVR08) |
| Model | MiniMaxAI/MiniMax-M2.5 FP8, TP=2 |
| Framework | vLLM 0.19.0 (ROCm) |
| KV Cache | HBM prefix cache / LMCache CPU DRAM |
| Workload | 739 anonymized Claude Code agentic conversations |
2.2 What We Measured
We decomposed each request into five time components:
| Component | Where | What It Captures |
|---|---|---|
| t_serialize | Client CPU | JSON serialization of the request payload |
| t_http_overhead | Server CPU | HTTP parsing + tokenization + scheduling + queue wait + KV cache lookup |
| t_server_prefill | Server GPU | Attention computation over all input tokens |
| t_decode | Server GPU (mostly) | Autoregressive token generation + streaming |
| t_response_parse | Client CPU | SSE chunk parsing + tool call extraction |
We classify t_serialize + t_http_overhead + t_response_parse as CPU time and t_server_prefill + t_decode as GPU time.
Note: t_http_overhead is measured as the gap between client sending the HTTP request and receiving the first byte back. This includes tokenization, scheduling, queue wait time, and KV cache management — all CPU-side work that happens before the GPU begins prefill. At low concurrency this is mostly tokenization + scheduling. At high concurrency, queue wait dominates.
2.3 Test Matrix
| Scenario | Concurrency | Context | Purpose |
|---|---|---|---|
| single_1k | 1 | 1,000 | Baseline: pure overhead |
| single_8k | 1 | 8,000 | Typical agent turn |
| single_32k | 1 | 32,000 | Large agent context |
| single_100k | 1 | 100,000 | Maximum agent context |
| conc4_8k | 4 | 8,000 | Light multi-tenant |
| conc16_32k | 16 | 32,000 | Medium load |
| conc32_32k | 32 | 32,000 | High load, moderate context |
| conc32_100k | 32 | 100,000 | Stress: high load + large context |
Each scenario was run with 3–5 batches of requests, with results aggregated.
3. Results
3.1 The CPU-GPU Split: It’s All About Concurrency
HBM Prefix Cache Configuration:
| Scenario | Conc | Ctx | HTTP OH (ms) | Prefill (ms) | Decode (ms) | Total (ms) | CPU% | GPU% |
|---|---|---|---|---|---|---|---|---|
| single_1k | 1 | 1K | 7 | 41 | 1,780 | 1,828 | 0.4% | 99.6% |
| single_8k | 1 | 8K | 15 | 124 | 3,142 | 3,282 | 0.5% | 99.5% |
| single_32k | 1 | 32K | 47 | 682 | 7,736 | 8,465 | 0.6% | 99.4% |
| single_100k | 1 | 100K | 131 | 3,555 | 20,792 | 24,479 | 0.6% | 99.4% |
| conc4_8k | 4 | 8K | 53 | 137 | 3,101 | 3,291 | 1.6% | 98.4% |
| conc16_32k | 16 | 32K | 555 | 498 | 7,832 | 8,885 | 6.2% | 93.8% |
| conc32_32k | 32 | 32K | 1,130 | 636 | 7,873 | 9,639 | 11.6% | 88.4% |
| conc32_100k | 32 | 100K | 3,885 | 2,479 | 19,591 | 25,957 | 14.9% | 85.1% |
The pattern is clear: CPU overhead scales with concurrency, not context length.
- Single-request: CPU% is flat at ~0.5% regardless of whether context is 1k or 100k
- At concurrency 32: CPU% jumps to 11–15%
- The dominant CPU cost is
t_http_overhead(scheduling + queue wait), not tokenization
3.2 LMCache vs HBM Prefix Cache: CPU Overhead Comparison
LMCache DRAM Configuration (gpu-mem-util=0.78):
| Scenario | Conc | Ctx | HTTP OH (ms) | Prefill (ms) | Decode (ms) | Total (ms) | CPU% | GPU% |
|---|---|---|---|---|---|---|---|---|
| single_1k | 1 | 1K | 7 | 44 | 2,653 | 2,704 | 0.3% | 99.7% |
| single_8k | 1 | 8K | 15 | 178 | 3,376 | 3,569 | 0.4% | 99.6% |
| conc4_8k | 4 | 8K | 50 | 121 | 3,455 | 3,627 | 1.4% | 98.6% |
| conc16_32k | 16 | 32K | 515 | 1,655 | 8,063 | 10,233 | 5.1% | 94.9% |
| conc32_32k | 32 | 32K | 1,135 | 722 | 8,386 | 10,243 | 11.0% | 89.0% |
| conc32_100k | 32 | 100K | 3,937 | 28,537 | 20,769 | 53,244 | 9.8% | 90.2% |
Key comparison — CPU overhead is nearly identical:
| Scenario | HBM-PC CPU% | LMCache CPU% | Delta |
|---|---|---|---|
| single_1k | 0.4% | 0.3% | −0.1% |
| conc4_8k | 1.6% | 1.4% | −0.2% |
| conc16_32k | 6.2% | 5.1% | −1.1% |
| conc32_32k | 11.6% | 11.0% | −0.6% |
| conc32_100k | 14.9% | 9.8% | −5.1% |
LMCache does NOT add measurable CPU overhead. In fact, CPU% is slightly lower with LMCache at high concurrency because LMCache’s CPU DRAM cache reduces HBM pressure, meaning less time in KV block eviction decisions on the CPU side.
The t_http_overhead is nearly identical between the two configs (~1,130–1,135ms at conc32_32k), confirming that the LMCache connector’s CPU-side work (hash computation, cache lookup, DMA scheduling) is negligible.
3.3 Where Does CPU Time Actually Go?
We ran standalone micro-benchmarks to isolate each CPU component:
| Component | Time at 100K tokens | % of HTTP Overhead (conc=32) |
|---|---|---|
| Tokenization (encode) | 220 ms | ~5.7% |
| JSON serialization (request build) | 0.82 ms | <0.1% |
| SHA256 hash (cache key) | 0.62 ms | <0.1% |
| SSE chunk parse (per token) | 1.9 µs | <0.1% |
| Detokenization (128 tokens) | 0.27 ms | <0.1% |
| Scheduling + queue wait | ~3,660 ms | ~94% |
The smoking gun: scheduling + queue wait accounts for ~94% of CPU overhead at high concurrency. Tokenization, hashing, and serialization are negligible.
This makes sense: at 32 concurrent requests, the vLLM scheduler must:
- Decide which requests to batch together
- Walk the prefix cache tree to find matching blocks
- Allocate KV blocks for new tokens
- Manage the preemption queue when HBM is under pressure
- Coordinate across TP workers
Each of these is O(n) or worse in the number of concurrent requests, and they all happen on a single Python thread (GIL-bound).
3.4 Tokenization Deep-Dive: Linear but Fast
| Tokens | Encode (ms) | Throughput (tok/s) |
|---|---|---|
| 679 | 1.18 | 576,506 |
| 2,711 | 5.09 | 532,379 |
| 5,423 | 10.35 | 523,861 |
| 10,840 | 20.46 | 529,718 |
| 21,679 | 42.72 | 507,414 |
| 43,359 | 87.85 | 493,582 |
| 67,745 | 134.90 | 502,188 |
| 101,615 | 220.38 | 461,085 |
Tokenization scales linearly with input length at ~500k tok/s. Even at 100k tokens (the largest agentic context we tested), tokenization takes only 220ms — under 1% of E2E time for any scenario.
The HuggingFace tokenizers library (Rust-based BPE) is already highly optimized. Switching to a C++ tokenizer would save ~50–100ms at 100k tokens — not enough to matter.
Detokenization (streaming output) is even faster: 0.27ms for 128 output tokens. Per-token streaming overhead is not a concern.
4. Analysis: The Scheduling Wall
4.1 Why Scheduling Dominates at High Concurrency
The t_http_overhead captures everything from HTTP request receipt to first GPU kernel launch. At concurrency 1, it’s dominated by tokenization (~220ms for 100k). At concurrency 32, it balloons to 3,885ms — a 30× increase.
The growth is superlinear with concurrency:
| Concurrency | HTTP Overhead (32K ctx) | Growth Factor |
|---|---|---|
| 1 | 47 ms | 1.0× |
| 4 | 53 ms | 1.1× |
| 16 | 555 ms | 11.8× |
| 32 | 1,130 ms | 24.0× |
This superlinear scaling points to contention in the scheduling path:
-
Python GIL: vLLM’s scheduler runs in the main asyncio event loop. At 32 concurrent requests, the GIL serializes scheduling decisions, tokenization, and HTTP handling.
-
Prefix cache tree walks: With prefix caching enabled, every scheduling decision walks the block hash tree. At high concurrency with diverse prompts, the tree grows and walks become expensive.
-
Block allocation contention: The KV block allocator must coordinate free/used block tables across TP workers.
-
Queue wait: When the GPU is saturated, requests queue in the scheduler waiting for slots.
4.2 The 15% Rule
Our data suggests a practical rule of thumb:
At production-level concurrency (16–32 users), CPU overhead consumes 10–15% of E2E latency on MI300X.
This means that even with an infinitely fast GPU, you would only recover 85–90% of theoretical speedup. The remaining 10–15% is CPU-bound.
For a concrete example: at conc32_100k with HBM prefix cache, total E2E is 25,957ms. GPU time is 22,070ms (prefill + decode). Even if GPU time went to zero, the CPU overhead of 3,887ms would remain — setting a hard floor on latency.
5. Optimization Recommendations
Tier 1: High Impact, Framework-Level
| Optimization | Expected Impact | Effort |
|---|---|---|
| Pipeline scheduling with GPU execution | 5–10% E2E at high concurrency | Medium |
| Move tokenization off main event loop | 2–3% at high concurrency | Low |
| Batch scheduling decisions | 3–5% at high concurrency | Medium |
| Pre-allocate KV blocks speculatively | 2–3% at high concurrency | Medium |
Tier 2: System-Level Tuning
| Optimization | Expected Impact | Effort |
|---|---|---|
| NUMA affinity (pin workers to GPU-local node) | 1–2% | Low |
CPU frequency governor (performance mode) |
0.5–1% | Trivial |
| Dedicated CPU cores for scheduler (isolcpus) | 1–2% | Low |
Tier 3: Not Worth Optimizing
| Component | Why Not |
|---|---|
| Tokenizer speed | Already 500k tok/s, <1% of E2E |
| JSON serialization | <1ms even at 100k tokens |
| SSE parsing | 1.9µs per chunk — effectively zero |
| LMCache hash/lookup | <1ms even at 100k tokens |
| Detokenization | 0.27ms for 128 output tokens |
6. Key Takeaways
For inference platform teams:
-
CPU overhead is real but bounded. At 32 concurrent users, 10–15% of E2E latency is CPU. This sets a floor on achievable latency regardless of GPU speed.
-
Scheduling is the bottleneck, not tokenization. Don’t waste time optimizing the tokenizer — optimize the scheduler and its interaction with the KV cache manager.
-
LMCache adds zero measurable CPU overhead. The cache connector’s hash/lookup/DMA scheduling cost is lost in the noise. If you’re avoiding LMCache because of CPU concerns, don’t.
-
The GIL is the elephant in the room. At 32+ concurrent requests, Python GIL serializes scheduling, tokenization, and HTTP handling. Multi-process architectures (like vLLM V1’s separated EngineCore) are the right direction.
For hardware architects:
-
CPU performance matters for inference at scale. A faster CPU won’t help a single request, but it directly impacts latency at 16+ concurrent users.
-
PCIe/Infinity Fabric bandwidth is not the CPU bottleneck. The CPU overhead is all compute (scheduling, hash computation, Python interpretation), not data transfer.
-
NUMA topology matters. Ensuring scheduler threads run on CPU cores local to the GPU’s NUMA node reduces memory access latency for KV block table management.
For the agentic AI community:
-
The CPU-GPU co-design question is a scheduling problem, not a compute problem. The path forward is better overlap between CPU scheduling and GPU execution.
-
Context length matters less than concurrency. A single 100k-token request has 0.6% CPU overhead. Thirty-two 1k-token requests have 11%+ CPU overhead. If you’re scaling to many concurrent agent sessions, CPU efficiency of the scheduler is critical.
7. Open Questions & Future Directions
7.1 Can We Eliminate the CPU Bottleneck? Rust, No-GIL, and Beyond
Our data shows that 94% of CPU overhead is scheduling + queue wait, not tokenization or serialization. This has direct implications for optimization strategies:
Rewriting the scheduler in Rust or C++:
The vLLM scheduler today is pure Python — prefix tree walks, block allocation, preemption logic, all running under the GIL. Rewriting the hot path in Rust (via PyO3) or C++ (via pybind11) could yield significant gains:
| Component | Current (Python) | Estimated (Rust) | Speedup | Impact on E2E |
|---|---|---|---|---|
| Prefix tree walk | O(n) per request, GIL-held | O(n) but no GIL, SIMD-friendly | 5–10× | 2–4% at conc=32 |
| Block allocation | Dict lookups + list ops | Lock-free concurrent allocator | 10–20× | 1–2% at conc=32 |
| Hash computation | Python hash() |
Rust xxhash / blake3 |
3–5× | <0.5% (already fast) |
| Request batching | Python list sorting | Rust rayon parallel sort |
5–10× | 1–2% at conc=32 |
Total estimated E2E improvement: 4–8% at conc=32 from a Rust scheduler rewrite. This is meaningful but not transformative — the real win is eliminating GIL contention, not raw speed.
Removing the Python GIL:
Python 3.13+ introduced experimental free-threaded mode (--disable-gil). For vLLM, this could be transformative:
- Currently: tokenization, scheduling, HTTP handling, and detokenization all serialize through the GIL
- Without GIL: these can truly parallelize across CPU cores
- The
t_http_overheadat conc=32 (1,130ms for 32K context) includes substantial GIL contention — multiple requests competing for the same Python thread - Estimated impact: 20–40% reduction in
t_http_overheadat high concurrency, translating to 3–6% E2E improvement
However, GIL removal has risks:
- vLLM’s internal data structures (block tables, prefix cache tree) would need thread-safe redesign
- Many Python C extensions assume GIL protection
- The
torchruntime itself has GIL interactions during tensor operations
The pragmatic path — vLLM V1’s multi-process architecture:
vLLM V1 already separates the EngineCore (scheduler) from the APIServer (HTTP handling) into different processes. This is effectively a GIL bypass:
APIServer (Process 1) EngineCore (Process 2) Workers (Process 3+)
├── HTTP parsing ├── Scheduling ├── GPU prefill
├── Tokenization ├── Block allocation ├── GPU decode
├── Request routing ├── Cache management ├── KV transfers
└── SSE streaming └── Preemption logic └── Sampling
│ │ │
└── IPC (shared mem) ─────┘ │
└── IPC (shared mem) ────┘
This architecture already eliminates most GIL contention. Our measurements show that vLLM 0.19.0 (which uses V1) achieves reasonable scaling — the 15% CPU overhead at conc=32 is after the multi-process split. Without it, we’d likely see 25–30%.
Recommendation: The highest-ROI optimization is pipelining scheduling with GPU execution — start scheduling the next batch while the current batch is still executing on GPU. This doesn’t require any language change, just better overlap in the EngineCore.
7.2 Sub-Agent Explosion: What Happens at 12× Concurrency?
Modern agentic frameworks (Claude Code, OpenHands, SWE-Agent) routinely spawn sub-agents. A single user session might fork into 4–12 parallel sub-agents for tasks like:
- Searching multiple codebases simultaneously
- Running parallel tool calls (web search + file read + code execution)
- Exploring multiple solution paths (tree-of-thought)
The math gets scary fast:
If 4 users each spawn 3 sub-agents, you have 4 × (1 + 3) = 16 effective concurrent sessions. If each spawns 12 sub-agents: 4 × (1 + 12) = 52 concurrent sessions.
Extrapolating from our data:
| Users | Sub-agents/user | Effective Conc | Est. CPU% | Est. HTTP OH (32K) |
|---|---|---|---|---|
| 4 | 0 | 4 | 1.6% | 53 ms |
| 4 | 3 | 16 | 6.2% | 555 ms |
| 4 | 12 | 52 | 20–25% | ~3,000 ms |
| 8 | 12 | 104 | 30–40% | ~8,000+ ms |
At 52 effective concurrent requests, our superlinear scaling model predicts:
- HTTP overhead would reach ~3,000ms (vs 1,130ms at conc=32) — that’s 3 seconds of pure CPU wait before a single GPU kernel fires
- CPU% of E2E could hit 20–25%, meaning one quarter of your GPU investment is wasted on CPU scheduling
- The prefix cache tree would become deep and wide (52 diverse conversation prefixes), making tree walks even more expensive
Sub-agent-specific challenges:
-
Prefix divergence: Sub-agents share a common parent prefix but diverge quickly (different tool calls, different search results). This creates a bushy prefix tree that’s expensive to walk but has high reuse potential — exactly the regime where LMCache’s L2 tier pays off.
-
Bursty arrival patterns: Sub-agents don’t arrive at a steady rate — they burst (parent spawns 12 children simultaneously). The scheduler must absorb this burst, and queue wait time spikes.
-
Priority inversion: The parent agent is blocked waiting for sub-agent results. If sub-agents are queued behind other users’ requests, the parent’s end-to-end latency multiplies.
Co-design implications:
- Request routing becomes critical: With 52+ concurrent sessions, a single vLLM instance may not be enough. Disaggregated serving (separate prefill and decode nodes) or multi-instance routing could reduce per-instance scheduling pressure.
- Sub-agent-aware scheduling: A scheduler that understands parent-child relationships could prioritize sub-agents of the same parent to complete a “generation” faster, rather than round-robin across all requests.
- Shared prefix optimization: Sub-agents from the same parent share ~80–90% of their prefix. A scheduler that detects this and batches sibling sub-agents together for prefill could dramatically reduce redundant computation.
7.3 Hybrid Workloads: Database Queries, RAG, and Tool Execution
Real agentic workloads don’t just call the LLM — they interleave LLM inference with CPU/IO-bound operations:
Turn 1: LLM generates SQL query (GPU: 2-5s)
Turn 2: Execute SQL against database (CPU/IO: 50-500ms)
Turn 3: LLM analyzes results (GPU: 3-8s)
Turn 4: Retrieve documents from vector DB (CPU/IO: 20-200ms)
Turn 5: LLM synthesizes final answer (GPU: 5-15s)
The inter-turn gap is a new CPU cost we didn’t measure:
Our benchmark focused on the intra-request CPU-GPU split (what happens inside a single LLM call). But agentic workloads have a second CPU cost: the inter-turn gap — the time between the LLM finishing one turn and the next turn’s prompt being ready.
This gap includes:
| Operation | Typical Latency | Where |
|---|---|---|
| Tool call parsing | 0.1–1 ms | Client CPU |
| Database query (PostgreSQL) | 5–500 ms | External service |
| Vector DB retrieval (FAISS/pgvector) | 10–200 ms | CPU + sometimes GPU |
| Web API call (search, code execution) | 100–2,000 ms | Network + external |
| Result formatting + context assembly | 1–10 ms | Client CPU |
| Re-tokenization of updated context | 50–220 ms | Server CPU |
Performance implications:
-
GPU idle time: During tool execution, the GPU allocated to this user’s session sits idle. At 100k context, the KV cache for one session holds ~12 GB of HBM. If tool execution takes 500ms, that’s 500ms × 12 GB of stranded GPU memory that could serve other requests.
-
The KV cache cold-start problem: If the scheduler evicts this session’s KV blocks during tool execution (to serve other requests), the next turn must re-prefill the entire context. This is exactly the scenario where LMCache’s CPU DRAM tier shines — it preserves KV state across tool-execution gaps at negligible cost.
-
CPU contention between tool execution and scheduling: If tool execution (database queries, vector search) runs on the same CPU cores as the vLLM scheduler, it competes for CPU resources. At high concurrency + frequent tool calls, this could push CPU overhead well beyond the 15% we measured for pure LLM inference.
Estimated E2E impact of hybrid workloads:
| Workload Type | LLM Time | Tool Time | Inter-turn OH | GPU Idle % | Effective CPU% |
|---|---|---|---|---|---|
| Pure chat | 100% | 0% | ~0% | ~0% | 10–15% |
| Light tools (search) | 70% | 20% | 10% | 15–20% | 20–25% |
| Heavy tools (DB + RAG) | 50% | 35% | 15% | 25–35% | 25–35% |
| Code execution agents | 40% | 45% | 15% | 35–45% | 30–40% |
For code execution agents (the Claude Code use case our traces come from), CPU and IO operations may consume 40–50% of wall-clock time, with GPU active only 50–60% of the time. This fundamentally changes the co-design equation:
- For pure LLM serving: Buy the best GPU, CPU barely matters
- For agentic serving: CPU, memory bandwidth, and IO become co-equal with GPU. System balance matters more than peak GPU FLOPS.
Optimization strategies for hybrid workloads:
-
Speculative prefetching: While the LLM generates a tool call, pre-warm likely next-turn prefixes based on the tool type. For example, if the model calls
search(), pre-tokenize a template like"Search results: {placeholder}"to have partial KV cache ready. -
KV cache reservation: Reserve a “parking” slot in CPU DRAM for active sessions during tool execution, preventing eviction. LMCache already enables this — the question is whether to make it tool-call-aware.
-
Separate CPU pools: Dedicate specific CPU cores to vLLM scheduling and others to tool execution. NUMA-aware pinning becomes critical: vLLM scheduler threads on cores near the GPU, tool execution threads on cores near the NIC (for database queries) or NVMe (for document retrieval).
-
Async tool execution with GPU overlap: Execute tool calls concurrently with other users’ LLM inference, then “re-inject” the results when ready. This requires the scheduler to support interruptible sessions — start other requests during the tool gap, then preempt them when the tool-calling session is ready to continue.
Appendix: Reproduction
Environment
# Container
docker run -d --name lmcache-bench --entrypoint /bin/bash \
--device=/dev/kfd --device=/dev/dri --network=host --ipc=host \
--group-add video --cap-add SYS_PTRACE \
-v /mnt/nvme3n1p1/models:/work/models \
vllm/vllm-openai-rocm:v0.19.0 -c "sleep infinity"
# LMCache (source build for ROCm)
docker exec lmcache-bench bash -c "
git clone --depth 1 https://github.com/LMCache/LMCache.git /work/LMCache
cd /work/LMCache && BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
pip uninstall -y nixl nixl-cu12 cupy-cuda12x cufile-python cuda-pathfinder
"
Server Configs
HBM Prefix Cache:
VLLM_FLOAT32_MATMUL_PRECISION=high \
vllm serve /work/models/MiniMax-M2.5 \
--tensor-parallel-size 2 --enable-prefix-caching \
--gpu-memory-utilization 0.85 --host 0.0.0.0 --port 8000
LMCache DRAM:
PYTHONHASHSEED=0 VLLM_FLOAT32_MATMUL_PRECISION=high \
LMCACHE_LOCAL_CPU=true LMCACHE_CHUNK_SIZE=256 \
vllm serve /work/models/MiniMax-M2.5 \
--tensor-parallel-size 2 --enable-prefix-caching \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' \
--gpu-memory-utilization 0.78 --host 0.0.0.0 --port 8000
Benchmark Scripts
All scripts and raw data are available at github.com/andyluo7/cpu-gpu-codesign-agentic-inference.
This analysis accompanies our LMCache multi-turn agentic benchmark and uses the same hardware, model, and workload traces.