ATOM + LMCache: KV Cache Offloading with AMD's Optimized Plugin for vLLM
When the baseline is already optimized, does KV cache offloading still help? We tested ATOM — AMD’s high-performance inference engine that integrates with vLLM as an out-of-tree plugin — with LMCache on MI300X. The answer: yes, and more than you’d expect.
Key Summary
This is a sequel to our previous benchmark of LMCache on MI300X. That blog tested vanilla vLLM + LMCache. This one tests ATOM (AiTer Optimized Model) — AMD’s high-performance inference engine built on AITER kernels that integrates with vLLM as an out-of-tree plugin — combined with LMCache, and adds an NVMe L3 tier.
We ran 739 Claude Code agentic traces against MiniMax-M2.5 (456B MoE, FP8) on 2× MI300X under three configurations: ATOM HBM-only, ATOM + LMCache CPU, and ATOM + LMCache CPU+NVMe.
Headline findings:
- ATOM + LMCache CPU delivers 2.4× lower median TTFT and 59% more requests vs ATOM HBM-only under stress (32 users, 100K context)
- Adding NVMe as an L3 tier cuts p95 TTFT by 41% on top of the CPU-only LMCache result — the long tail compresses dramatically
- ATOM’s FP8 KV cache halves KV memory vs our previous BF16 runs — more room for HBM prefix cache, less L2 pressure, but L2 still pays off decisively under load
- LMCache backend choice is critical: the default PyPI wheel uses a Python fallback on ROCm that made LMCache 1.7× slower than baseline. Source-building with
BUILD_WITH_HIP=1is mandatory. - CUDA graphs must be explicitly enabled — ATOM sets
enforce_eager=Trueby default. Without the override, 3-5× throughput loss.
1. Introduction
From vanilla vLLM to ATOM
Our previous blog established that LMCache works on MI300X and wins decisively under stress — 3× lower TTFT, 2.3× more requests when HBM prefix cache gets overwhelmed.
But that test used stock vLLM. AMD ships ATOM (AiTer Optimized Model), a high-performance inference engine purpose-built for AMD Instinct GPUs. As described in the vLLM-ATOM blog, ATOM integrates with vLLM as an out-of-tree plugin — not a fork — preserving vLLM’s existing APIs and batching paths while delivering AMD-native attention, model execution, and optimized MoE routing via AITER kernels. It adds FP8/MXFP4 quantized KV cache support, fused QK-norm/RoPE/cache operations, piecewise torch.compile with CUDA graph capture, and INT4 quick-reduce for tensor-parallel all-reduce. ATOM registers itself via vLLM’s platform plugin entry point, so it’s a drop-in — no vLLM source patches needed.
The natural question: does LMCache still help when the serving baseline is already optimized? If ATOM’s FP8 KV cache doubles effective HBM capacity, maybe there’s enough room for the prefix cache and L2 offloading becomes unnecessary?
Short answer: no. Under real agentic workloads, HBM still fills up, and the L2 tier still pays off. The crossover just shifts.
What changed from the previous blog
| Previous blog | This blog | |
|---|---|---|
| vLLM backend | Stock vLLM 0.19.0 | ATOM v0.1.3.dev203 (vLLM plugin) |
| vLLM version | 0.19.0 | 0.19.1 |
| Attention | Default ROCm FA | AITER (via ATOM) |
| KV cache dtype | BF16 | FP8 |
| CUDA graphs | Default | Explicit: FULL_AND_PIECEWISE |
| TP all-reduce | Default | INT4 quick-reduce (AITER) |
| Container | vllm/vllm-openai-rocm:v0.19.0 |
rocm/atom-dev:vllm-latest |
| LMCache tiers | HBM + CPU DRAM | HBM + CPU DRAM + NVMe |
| PyTorch | Default | 2.10.0+rocm7.2.3 |
| ROCm | 7.0.0 | 7.2.3 |
2. Architecture
The ATOM + LMCache stack
┌──────────────────────────────────────────┐
│ trace_replay_tester.py (client) │
│ • 739 anonymized Claude Code traces │
│ • Cooldown-gated user ramp │
│ • Working-set + period budgets │
└──────────────┬───────────────────────────┘
│ OpenAI HTTP /v1/chat/completions
▼
┌──────────────────────────────────────────┐
│ vLLM 0.19.1 + ATOM plugin │
│ ──────────────────────────────────── │
│ ATOM platform plugin (auto-registered) │
│ • AITER attention kernels │
│ • FP8 KV cache quantization │
│ • INT4 quick-reduce (TP all-reduce) │
│ • Fused QK-norm/RoPE/cache quant │
│ ──────────────────────────────────── │
│ Scheduler → Prefix-cache (HBM, FP8) │
│ ──────────│───────────────── │
│ │ KV connector V1 hook │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ LMCacheConnectorV1 │ │
│ │ (BUILD_WITH_HIP=1) │ │
│ └──────┬─────────────────┘ │
│ │ │
│ ┌────┴────┬────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ GPU HBM CPU DRAM NVMe SSD │
│ L1 (FP8) L2 (64 GB) L3 (optional) │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ MiniMax-M2.5 (456B MoE, 256 experts) │
│ FP8, ~230 GB, TP=2 across 2× MI300X │
└──────────────────────────────────────────┘
What ATOM changes under the hood
ATOM is a vLLM out-of-tree plugin — as detailed in the vLLM-ATOM blog, it doesn’t fork vLLM but registers itself at startup and replaces key compute paths:
- Attention: AITER flash attention replaces the default ROCm FA backend. Tuned for gfx942 wavefronts.
- KV cache quantization:
--kv-cache-dtype fp8halves per-token KV memory. A 100K-token MiniMax-M2.5 KV cache drops from ~12 GB (BF16) to ~6 GB (FP8). That’s real HBM headroom. - TP all-reduce:
AITER_QUICK_REDUCE_QUANTIZATION=INT4quantizes the all-reduce payload to INT4, cutting cross-GPU bandwidth by 4× at negligible quality cost. - Fused ops:
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1fuses QK-normalization, rotary position embedding, and cache quantization into a single kernel — fewer HBM round-trips. - CUDA graphs: Must be explicitly enabled via
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'because ATOM’s default setsenforce_eager=True.
Three test arms
| Arm | LMCache config | What’s cached |
|---|---|---|
| A: ATOM HBM-only | None | FP8 KV blocks in HBM, LRU evicted when full |
| B: ATOM + LMCache CPU | LMCACHE_LOCAL_CPU=true, 64 GB |
HBM L1 (FP8) + 64 GB CPU DRAM L2 |
| C: ATOM + LMCache CPU+NVMe | CPU + LMCACHE_LOCAL_DISK=/nvme/lmcache |
HBM L1 (FP8) + CPU DRAM L2 + NVMe L3 |
The NVMe L3 tier is new in this blog. When the CPU DRAM L2 fills up, LMCache spills evicted KV blocks to local NVMe. Retrieval latency goes from ~100μs (DRAM) to ~1ms (NVMe), but capacity jumps from 64 GB to terabytes. For long-running agentic sessions where KV states accumulate over hours, the L3 tier prevents permanent eviction.
3. Implementation
Step 1: ATOM container
ATOM ships as a pre-built container with vLLM, AITER, and ROCm 7.2.3:
docker run -d --name atom-lmcache --entrypoint /bin/bash \
--device=/dev/kfd --device=/dev/dri --network=host --ipc=host \
--group-add video --cap-add SYS_PTRACE \
-v /mnt/nvme/models:/work/models \
-v /mnt/nvme/lmcache:/nvme/lmcache \
rocm/atom-dev:vllm-latest \
-c "sleep infinity"
The -v /mnt/nvme/lmcache:/nvme/lmcache mount is for the NVMe L3 tier. Skip it if you’re only testing HBM + CPU DRAM.
Step 2: Build LMCache from source with HIP
This is the same as the previous blog, but the mistake is easier to make now:
docker exec atom-lmcache bash -c "
git clone --depth 1 https://github.com/LMCache/LMCache.git /work/LMCache
cd /work/LMCache && BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
pip install aiofile # Required for LMCache GDS/NVMe backend
"
Why aiofile? LMCache’s disk backend uses aiofile for async NVMe I/O. Without it, enabling the disk path silently falls back to synchronous writes that stall the event loop.
Step 3: The critical backend check
pip install lmcache from PyPI gives you a CUDA-linked c_ops.so. On ROCm, this doesn’t crash — it silently falls back to Python-native non_cuda_equivalents. In our first attempt, this fallback made LMCache 1.7× slower than having no cache at all. The overhead of Python-side KV block serialization exceeded the prefill compute it was saving.
Verify you have the HIP backend:
python3 -c "from lmcache.storage_backend.connector import c_ops; print(c_ops.__file__)"
# Should show a .so built from HIP sources, NOT a Python .py file
If you see a .py path or non_cuda_equivalents, rebuild from source.
Step 4: ATOM recipe config
ATOM’s performance knobs, set as environment variables and server flags:
# ATOM environment
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
# LMCache environment
export PYTHONHASHSEED=0
export LMCACHE_LOCAL_CPU=true
export LMCACHE_CHUNK_SIZE=256
export LMCACHE_MAX_LOCAL_CPU_SIZE=64
export LMCACHE_LOCAL_DISK=/nvme/lmcache # omit for CPU-only arm
Step 5: Launch
docker exec -d atom-lmcache bash -c "
AITER_QUICK_REDUCE_QUANTIZATION=INT4 \
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 \
VLLM_FLOAT32_MATMUL_PRECISION=high \
PYTHONHASHSEED=0 \
LMCACHE_LOCAL_CPU=true \
LMCACHE_CHUNK_SIZE=256 \
LMCACHE_MAX_LOCAL_CPU_SIZE=64 \
LMCACHE_LOCAL_DISK=/nvme/lmcache \
vllm serve /work/models/MiniMax-M2.5 \
--tensor-parallel-size 2 --gpu-memory-utilization 0.78 \
--kv-cache-dtype fp8 \
--async-scheduling \
--enable-prefix-caching \
--compilation-config '{\"cudagraph_mode\": \"FULL_AND_PIECEWISE\"}' \
--max-num-batched-tokens 16384 \
--kv-transfer-config '{\"kv_connector\":\"LMCacheConnectorV1\",\"kv_role\":\"kv_both\"}' \
--tool-call-parser minimax_m2 --reasoning-parser minimax_m2 \
--enable-auto-tool-choice --trust-remote-code \
--host 0.0.0.0 --port 8000
"
For the HBM-only arm (Arm A), remove --kv-transfer-config and the LMCACHE_* env vars.
The four configuration mistakes that cost the most time
-
CUDA graphs disabled by default. ATOM sets
enforce_eager=Truein its platform registration. Without explicitly overriding with--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}', every batch goes through eager execution. We measured 3-5× throughput loss without CUDA graphs. This was our single biggest performance surprise. -
LMCache Python fallback. Already covered above. The PyPI wheel’s silent fallback to
non_cuda_equivalentson ROCm turns LMCache from a 2.4× speedup into a 1.7× slowdown.BUILD_WITH_HIP=1is non-negotiable. -
PYTHONHASHSEED=0is still mandatory. Same as the previous blog — Python’s hash randomization breaks LMCache cache-key consistency across TP workers. This hasn’t changed. -
Missing
aiofilefor NVMe tier. LMCache’s disk backend importsaiofilefor async I/O. If it’s missing, the disk path either raises an ImportError or falls back to sync writes that block the event loop.pip install aiofilebefore enablingLMCACHE_LOCAL_DISK.
4. Benchmarks: methodology
Same tester as the previous blog: trace_replay_tester.py from callanjfox/kv-cache-tester, replaying 739 anonymized Claude Code conversation traces.
Test matrix
| Phase | Users | Context | Duration | GMU | Question |
|---|---|---|---|---|---|
| Base | 8 | 32K | 10 min | 0.85 | Does ATOM+LMCache help at low load? |
| Stress | 32 | 100K | 20 min | 0.78 | Where does L2/L3 pay off under pressure? |
Common settings
- Hardware: 2× AMD MI300X (192 GB HBM each), gfx942
- Software: ATOM v0.1.3.dev203 + vLLM 0.19.1 + LMCache (HIP-built) + PyTorch 2.10.0+rocm7.2.3
- Model: MiniMaxAI/MiniMax-M2.5 FP8, 456B MoE (256 experts), ~230 GB, TP=2
- Container:
rocm/atom-dev:vllm-latest - Tester:
--warm-prefix-pct 0.5,--timing-strategy think-only,--recycle,--seed 42 - Identical trace assignment across all arms (seed=42)
Input/output token distributions
The Claude Code traces exhibit the classic agentic pattern — massive input that accumulates over a conversation, short output per turn:
| Statistic | Input tokens | Output tokens |
|---|---|---|
| Mean | ~42-45K | ~450-500 |
| Median | ~34-38K | ~240 |
| Min | ~9.8K | 1 |
| Max | ~98.9K | ~6K |
Most requests ship 30-50K tokens of context (file contents, tool outputs, prior conversation) and get back a few hundred tokens (a tool call or short response). This is why prefix caching matters so much — 93-97% of each request’s input is identical to the previous turn.
5. Results
5.1 Base load — HBM prefix cache is sufficient
8 max users, 32K context, 10 minutes, GMU=0.85. Working set fits comfortably in HBM.
| Metric | ATOM HBM-only | ATOM + LMCache CPU |
|---|---|---|
| Requests completed | 80 | 92 (+15%) |
| TTFT avg (s) | 0.80 | 0.93 (+16%) |
| TTFT p95 (s) | 4.10 | 3.35 (-18%) |
| Goodput | 91.9% | 80.3% |
At low load, the picture is mixed. LMCache completes 15% more requests and trims the p95 tail by 18%, but adds 16% overhead to average TTFT and reduces goodput. The working set fits in HBM; the L2 tier is mostly unused but still incurs connector overhead on every request.
Verdict at base load: ATOM’s FP8 KV cache gives enough HBM headroom. LMCache is unnecessary here.
This is consistent with our previous blog’s finding — the crossover is about working-set pressure, not raw hit rate.
5.2 Stress load — LMCache wins decisively
32 max users, 100K context, 20 minutes, GMU=0.78. This is where HBM runs out.
| Metric | ATOM HBM-only | ATOM + LMCache CPU | ATOM + LMCache CPU+NVMe |
|---|---|---|---|
| Requests completed | 208 | 331 (+59%) | 374 (+80%) |
| TTFT avg (s) | 84.0 | 47.1 (-44%) | 38.7 (-54%) |
| TTFT median (s) | 80.1 | 32.9 (-59%) | 34.6 (-57%) |
| TTFT p95 (s) | 207.3 | 150.2 (-28%) | 88.3 (-57%) |
| TTFT max (s) | 234.2 | 181.4 (-23%) | 97.9 (-58%) |
| Input tok/s | 3,370 | 6,440 (+91%) | 5,953 (+77%) |
| Output tok/s | 91 | 127 (+39%) | 145 (+59%) |
| Cache hit rate | 86.8% | 89.6% | 83.4% |
| Working set peak | 1.91M tokens | 2.15M tokens | 2.28M tokens |
Under stress, the story is unambiguous:
ATOM + LMCache CPU vs ATOM HBM-only:
- 59% more requests completed
- 2.4× lower median TTFT (80.1s → 32.9s)
- 44% lower average TTFT
- 28% lower p95 TTFT
- 91% higher input throughput
- 12% larger working set sustained in memory
ATOM + LMCache CPU+NVMe vs ATOM HBM-only:
- 80% more requests completed
- 57% lower median TTFT
- 54% lower average TTFT
- 57% lower p95 TTFT (207.3s → 88.3s)
- 58% lower max TTFT (234.2s → 97.9s)
- 59% higher output throughput
5.3 The NVMe L3 tier — compressing the tail
The most interesting result is the gap between CPU-only and CPU+NVMe:
| Metric | LMCache CPU | LMCache CPU+NVMe | Delta |
|---|---|---|---|
| Requests completed | 331 | 374 | +13% |
| TTFT avg (s) | 47.1 | 38.7 | -18% |
| TTFT p95 (s) | 150.2 | 88.3 | -41% |
| TTFT max (s) | 181.4 | 97.9 | -46% |
| Output tok/s | 127 | 145 | +14% |
The NVMe tier’s impact is concentrated in the tail latencies. p95 drops by 41%, max drops by 46%. This makes sense: the L3 tier catches KV blocks that would have been permanently evicted from the 64 GB CPU DRAM L2. Without NVMe, those evicted states force a full prefill recomputation — 50-100K tokens from scratch. With NVMe, they get retrieved in ~1ms instead of ~50-100s of prefill.
The cache hit rate actually drops slightly with NVMe (89.6% → 83.4%). This is an artifact of how LMCache counts: L2 evictions to disk are not counted as “hits” until they’re retrieved back. The effective reuse rate is higher.
5.4 TTFT behavior under pressure — the pressure relief valve
One of the most revealing patterns is how TTFT evolves over the 20-minute stress run:
ATOM HBM-only: TTFT degrades monotonically. Starts around 20s, climbs steadily to 230s, and never recovers. As more users join and the working set grows, HBM prefix cache entries get evicted faster than they can be reused. Each eviction forces a full prefill, which takes longer because the scheduler is already saturated. It’s a death spiral.
ATOM + LMCache CPU: TTFT oscillates. It spikes when a burst of new users arrive (cache-cold), then recovers to 23-46s as cached KV states hit from CPU DRAM. The CPU L2 tier acts as a pressure relief valve — when HBM fills up and starts evicting, the evicted blocks land in DRAM instead of being lost forever. On the next request from the same conversation, the KV state is retrieved from DRAM (~100μs) instead of recomputed from scratch (~50s+).
This oscillating-vs-monotonic pattern is the clearest behavioral evidence that L2 caching works. It’s not just faster on average — it’s self-healing under load.
6. Comparison: ATOM vs Vanilla vLLM (previous blog)
How does the ATOM-optimized stack compare to the vanilla vLLM stack from our previous blog?
Direct comparison is approximate — the previous blog used different vLLM version (0.19.0 vs 0.19.1), BF16 KV (vs FP8), and different GMU settings. But the order-of-magnitude story is clear:
| Stress metric | Vanilla vLLM + LMCache (prev blog) | ATOM + LMCache CPU (this blog) |
|---|---|---|
| Requests completed | 28 | 331 |
| TTFT avg (s) | 34.6 | 47.1 |
| Input tok/s | 933 | 6,440 |
ATOM’s optimizations (AITER kernels, FP8 KV, INT4 all-reduce, CUDA graphs) deliver a fundamentally different throughput regime. The previous blog’s 28 requests in 20 minutes reflects that vanilla vLLM on ROCm was severely bottlenecked. ATOM removes those bottlenecks.
But LMCache’s relative value remains consistent: under stress, L2 caching adds 50-80% more requests and cuts TTFT by 40-60%, regardless of whether the baseline is vanilla or optimized.
7. Key Findings
Finding 1: L2 caching helps even when the baseline is optimized
The worry going in was that ATOM’s FP8 KV cache (halving per-token KV memory) would give enough HBM headroom to make L2 offloading unnecessary. It doesn’t. Under real agentic workloads at 32 concurrent users with 100K context, HBM still fills up and LMCache still delivers 59% more requests.
FP8 KV does shift the crossover point — ATOM HBM-only handles 208 requests vs the previous blog’s vanilla-PC at similar load levels. But it doesn’t eliminate the need for L2.
Finding 2: NVMe L3 is a tail-latency story
The NVMe tier doesn’t help average performance much (-18%) but demolishes the tail: p95 drops 41%, max drops 46%. If your SLO is p95 or p99, NVMe is the cheapest intervention available. NVMe is already in the box — you’re just telling LMCache to use it.
Finding 3: LMCache backend choice is make-or-break
This is the single most impactful finding for practitioners. The default pip install lmcache on ROCm gives you a Python fallback that turns LMCache into a net negative. Our first attempt showed LMCache 1.7× slower than baseline — we spent days debugging before discovering the backend issue.
The fix is simple: BUILD_WITH_HIP=1 pip install -e . from source. But the failure mode is silent — no errors, no warnings in the default log level. You just get slow.
Finding 4: CUDA graphs are not optional with ATOM
ATOM’s vLLM plugin sets enforce_eager=True by default. This is presumably for development/debugging convenience, but it means every batch goes through Python-side eager dispatch instead of a pre-compiled CUDA graph. The performance impact is 3-5× throughput loss.
Override with:
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'
We filed ROCm/ATOM #890 suggesting the default should be flipped.
Finding 5: FP8 KV cache is a free lunch (mostly)
ATOM’s --kv-cache-dtype fp8 halves KV cache memory with negligible quality impact on MiniMax-M2.5. This doubles the effective HBM capacity for prefix caching. Combined with LMCache, the L2 tier gets less pressure (fewer evictions from L1), which means better hit rates and less DRAM bandwidth consumed.
Our previous blog used BF16 KV. The FP8 upgrade is one reason ATOM HBM-only handles 208 requests vs the previous blog’s HBM-PC numbers — you get more cache capacity before needing L2.
Finding 6: The oscillation pattern confirms L2 value
ATOM HBM-only shows monotonically degrading TTFT — a death spiral where eviction pressure compounds. ATOM + LMCache shows oscillating TTFT that recovers after spikes. This behavioral difference is the most intuitive proof that L2 caching works: the system heals itself when cached states are retrievable from DRAM instead of permanently lost.
8. Lessons Learned
Lesson 1: LMCache backend matters enormously
| Backend | Source | Performance |
|---|---|---|
PyPI wheel (pip install lmcache) |
CUDA c_ops.so → Python non_cuda_equivalents fallback on ROCm |
1.7× slower than no cache |
Source build (BUILD_WITH_HIP=1) |
HIP c_ops.so with native kernels |
2.4× faster median TTFT |
The difference between “cache makes things worse” and “cache gives 59% more throughput” is a single build flag. Check your backend.
Lesson 2: CUDA graphs must be enabled with ATOM
# Without (ATOM default): enforce_eager=True → 3-5× slower
# With: proper graph capture
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'
Lesson 3: PYTHONHASHSEED=0 is still mandatory
Same as the previous blog. Python’s per-process hash randomization breaks LMCache cache-key consistency across TP workers. Symptom: 0% cache hit rate on bit-identical prompts.
Lesson 4: FP8 KV cache changes the economics
With BF16 KV (previous blog), a 100K-token KV cache for MiniMax-M2.5 uses ~12 GB HBM. With FP8 KV (this blog), it’s ~6 GB. That means:
- More conversations fit in HBM before L2 eviction starts
- The crossover point where LMCache pays off shifts to higher concurrency
- But at real production loads (32+ concurrent agentic users), L2 still pays off
9. Best Practices
For ATOM + LMCache deployment on MI300X
-
Build LMCache from source with
BUILD_WITH_HIP=1. Do not use the PyPI wheel on ROCm. Verify withpython3 -c "from lmcache.storage_backend.connector import c_ops; print(c_ops.__file__)"— it should show a.so, not a.py. - Enable CUDA graphs explicitly:
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' -
Set
PYTHONHASHSEED=0in the server’s environment. -
Use FP8 KV cache (
--kv-cache-dtype fp8) — it’s a free lunch on MiniMax-M2.5 and likely on most modern MoE models. - Enable ATOM’s fusions:
AITER_QUICK_REDUCE_QUANTIZATION=INT4 ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 -
Size the CPU L2 generously (
LMCACHE_MAX_LOCAL_CPU_SIZE=64GB+). DRAM is cheap; evictions to disk or permanent loss are expensive. -
Add NVMe L3 if you care about tail latency. Set
LMCACHE_LOCAL_DISK=/path/to/nvmeandpip install aiofile. The p95 improvement (41% in our test) is significant for SLO-bound deployments. -
Enable async scheduling (
--async-scheduling) — lets vLLM overlap scheduling with GPU execution. Free throughput. - Monitor cache hits in production:
LMCache: Reqid=...80e (1030 tok): hit tokens: 1024 ← working ✅ LMCache: Reqid=...8cf (1030 tok): hit tokens: 0 ← broken ❌
When to use which configuration
| Scenario | Recommended config |
|---|---|
| Low concurrency, short context (<32K) | ATOM HBM-only (FP8 KV) |
| Moderate concurrency, mixed context | ATOM + LMCache CPU |
| High concurrency, long context (100K+), SLO on p95 | ATOM + LMCache CPU+NVMe |
| Decode-bound workloads (short input, long output) | ATOM HBM-only — cache won’t help the bottleneck |
When NOT to deploy LMCache with ATOM
- Working set fits comfortably in HBM (most chat workloads with FP8 KV)
- Decode-bound serving where prefill is already fast relative to generation
- You haven’t verified
BUILD_WITH_HIP=1— the Python fallback will make things worse, not better
10. Reproduce This
Container setup
# 1. Start ATOM container
docker run -d --name atom-lmcache --entrypoint /bin/bash \
--device=/dev/kfd --device=/dev/dri --network=host --ipc=host \
--group-add video --cap-add SYS_PTRACE \
-v /your/models:/work/models \
-v /your/nvme:/nvme/lmcache \
rocm/atom-dev:vllm-latest -c "sleep infinity"
# 2. Build LMCache with HIP backend
docker exec atom-lmcache bash -c "
git clone --depth 1 https://github.com/LMCache/LMCache.git /work/LMCache
cd /work/LMCache && BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
pip install aiofile
"
Server (ATOM + LMCache CPU+NVMe arm)
docker exec -d atom-lmcache bash -c "
AITER_QUICK_REDUCE_QUANTIZATION=INT4 \
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 \
VLLM_FLOAT32_MATMUL_PRECISION=high \
PYTHONHASHSEED=0 \
LMCACHE_LOCAL_CPU=true \
LMCACHE_CHUNK_SIZE=256 \
LMCACHE_MAX_LOCAL_CPU_SIZE=64 \
LMCACHE_LOCAL_DISK=/nvme/lmcache \
vllm serve /work/models/MiniMax-M2.5 \
--tensor-parallel-size 2 --gpu-memory-utilization 0.78 \
--kv-cache-dtype fp8 \
--async-scheduling \
--enable-prefix-caching \
--compilation-config '{\"cudagraph_mode\": \"FULL_AND_PIECEWISE\"}' \
--max-num-batched-tokens 16384 \
--kv-transfer-config '{\"kv_connector\":\"LMCacheConnectorV1\",\"kv_role\":\"kv_both\"}' \
--tool-call-parser minimax_m2 --reasoning-parser minimax_m2 \
--enable-auto-tool-choice --trust-remote-code \
--host 0.0.0.0 --port 8000
"
Trace replay client
git clone --recursive https://github.com/callanjfox/kv-cache-tester.git
cd kv-cache-tester
# Stress run
python3 trace_replay_tester.py \
--api-endpoint http://127.0.0.1:8000 \
--trace-directory traces \
--start-users 4 --max-users 32 \
--max-ttft 60.0 --test-duration 1200 \
--max-context 100000 --warm-prefix-pct 0.5 \
--timing-strategy think-only --recycle \
--seed 42 \
--output-dir ./results
11. What’s Next
This blog tested two tiers (CPU + NVMe) on top of ATOM’s HBM cache. Several directions remain:
-
PD disaggregation: Separate prefill and decode workers. LMCache and PD are complementary — prefill workers compute KV, store to shared L2/L3; decode workers retrieve. This is the architecture Mooncake, DeepSeek, and NVIDIA Dynamo use in production.
-
Speculative decoding: Our output throughput topped out at 145 tok/s aggregate. The decode side is the bottleneck, and KV caching doesn’t help decode. Eagle-2 or Medusa speculative decoding could give 2-3× on the decode path.
-
Multi-node L2 with nixl/distributed cache: LMCache supports distributed backends (Redis, remote object stores). In a multi-node TP>2 setup, shared KV cache across replicas would enable cross-replica prefix reuse.
-
Quantized L2 storage: Storing FP8 KV in the L2/L3 tiers (instead of converting back to full precision) would halve the DRAM and NVMe bandwidth required for cache transfers.
12. Acknowledgments
- AMD ROCm team for ATOM and the
rocm/atom-devcontainer - LMCache team for the HIP-compatible build system and the KV connector
-
callanjfox / WEKA for the kv-cache-tester toolkit and 739 anonymized Claude Code traces
- Hotaisle for MI300X access
Bench environment: ENC1-CLS01-SVR08, 2× AMD MI300X (gfx942, 192 GB HBM each), ROCm 7.2.3, ATOM v0.1.3.dev203, vLLM 0.19.1, LMCache (HIP-built, commit ~2026-05), PyTorch 2.10.0+rocm7.2.3. Container: rocm/atom-dev:vllm-latest.