ATOM + LMCache: KV Cache Offloading with AMD’s Optimized Plugin for vLLM

When the baseline is already optimized, does KV cache offloading still help? We tested ATOM — AMD’s high-performance inference engine that integrates with vLLM as an out-of-tree plugin — with LMCache on MI300X. The answer: yes, and more than you’d expect.

Key Summary

This is a sequel to our previous benchmark of LMCache on MI300X. That blog tested vanilla vLLM + LMCache. This one tests ATOM (AiTer Optimized Model) — AMD’s high-performance inference engine built on AITER kernels that integrates with vLLM as an out-of-tree plugin — combined with LMCache, and adds an NVMe L3 tier.

We ran 739 Claude Code agentic traces against MiniMax-M2.5 (456B MoE, FP8) on 2× MI300X under three configurations: ATOM HBM-only, ATOM + LMCache CPU, and ATOM + LMCache CPU+NVMe.

Headline findings:

ATOM + LMCache CPU delivers 2.4× lower median TTFT and 59% more requests vs ATOM HBM-only under stress (32 users, 100K context)
Adding NVMe as an L3 tier cuts p95 TTFT by 41% on top of the CPU-only LMCache result — the long tail compresses dramatically
ATOM’s FP8 KV cache halves KV memory vs our previous BF16 runs — more room for HBM prefix cache, less L2 pressure, but L2 still pays off decisively under load
LMCache backend choice is critical: the default PyPI wheel uses a Python fallback on ROCm that made LMCache 1.7× slower than baseline. Source-building with BUILD_WITH_HIP=1 is mandatory.
CUDA graphs must be explicitly enabled — ATOM sets enforce_eager=True by default. Without the override, 3-5× throughput loss.

1. Introduction

From vanilla vLLM to ATOM

Our previous blog established that LMCache works on MI300X and wins decisively under stress — 3× lower TTFT, 2.3× more requests when HBM prefix cache gets overwhelmed.

But that test used stock vLLM. AMD ships ATOM (AiTer Optimized Model), a high-performance inference engine purpose-built for AMD Instinct GPUs. As described in the vLLM-ATOM blog, ATOM integrates with vLLM as an out-of-tree plugin — not a fork — preserving vLLM’s existing APIs and batching paths while delivering AMD-native attention, model execution, and optimized MoE routing via AITER kernels. It adds FP8/MXFP4 quantized KV cache support, fused QK-norm/RoPE/cache operations, piecewise torch.compile with CUDA graph capture, and INT4 quick-reduce for tensor-parallel all-reduce. ATOM registers itself via vLLM’s platform plugin entry point, so it’s a drop-in — no vLLM source patches needed.

The natural question: does LMCache still help when the serving baseline is already optimized? If ATOM’s FP8 KV cache doubles effective HBM capacity, maybe there’s enough room for the prefix cache and L2 offloading becomes unnecessary?

Short answer: no. Under real agentic workloads, HBM still fills up, and the L2 tier still pays off. The crossover just shifts.

What changed from the previous blog

	Previous blog	This blog
vLLM backend	Stock vLLM 0.19.0	ATOM v0.1.3.dev203 (vLLM plugin)
vLLM version	0.19.0	0.19.1
Attention	Default ROCm FA	AITER (via ATOM)
KV cache dtype	BF16	FP8
CUDA graphs	Default	Explicit: FULL_AND_PIECEWISE
TP all-reduce	Default	INT4 quick-reduce (AITER)
Container	`vllm/vllm-openai-rocm:v0.19.0`	`rocm/atom-dev:vllm-latest`
LMCache tiers	HBM + CPU DRAM	HBM + CPU DRAM + NVMe
PyTorch	Default	2.10.0+rocm7.2.3
ROCm	7.0.0	7.2.3

2. Architecture

The ATOM + LMCache stack

              ┌──────────────────────────────────────────┐
              │   trace_replay_tester.py (client)        │
              │   • 739 anonymized Claude Code traces    │
              │   • Cooldown-gated user ramp             │
              │   • Working-set + period budgets          │
              └──────────────┬───────────────────────────┘
                             │ OpenAI HTTP /v1/chat/completions
                             ▼
              ┌──────────────────────────────────────────┐
              │         vLLM 0.19.1 + ATOM plugin        │
              │  ────────────────────────────────────    │
              │  ATOM platform plugin (auto-registered)  │
              │    • AITER attention kernels              │
              │    • FP8 KV cache quantization            │
              │    • INT4 quick-reduce (TP all-reduce)   │
              │    • Fused QK-norm/RoPE/cache quant      │
              │  ────────────────────────────────────    │
              │  Scheduler → Prefix-cache (HBM, FP8)     │
              │  ──────────│─────────────────            │
              │            │ KV connector V1 hook        │
              │            ▼                             │
              │  ┌────────────────────────┐              │
              │  │  LMCacheConnectorV1    │              │
              │  │  (BUILD_WITH_HIP=1)    │              │
              │  └──────┬─────────────────┘              │
              │         │                                │
              │    ┌────┴────┬────────────┐              │
              │    │         │            │              │
              │    ▼         ▼            ▼              │
              │  GPU HBM   CPU DRAM    NVMe SSD         │
              │  L1 (FP8)  L2 (64 GB)  L3 (optional)   │
              └──────────────────────────────────────────┘
                             │
                             ▼
              ┌──────────────────────────────────────────┐
              │   MiniMax-M2.5 (456B MoE, 256 experts)   │
              │   FP8, ~230 GB, TP=2 across 2× MI300X    │
              └──────────────────────────────────────────┘

What ATOM changes under the hood

ATOM is a vLLM out-of-tree plugin — as detailed in the vLLM-ATOM blog, it doesn’t fork vLLM but registers itself at startup and replaces key compute paths:

Attention: AITER flash attention replaces the default ROCm FA backend. Tuned for gfx942 wavefronts.
KV cache quantization: --kv-cache-dtype fp8 halves per-token KV memory. A 100K-token MiniMax-M2.5 KV cache drops from ~12 GB (BF16) to ~6 GB (FP8). That’s real HBM headroom.
TP all-reduce: AITER_QUICK_REDUCE_QUANTIZATION=INT4 quantizes the all-reduce payload to INT4, cutting cross-GPU bandwidth by 4× at negligible quality cost.
Fused ops: ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 fuses QK-normalization, rotary position embedding, and cache quantization into a single kernel — fewer HBM round-trips.
CUDA graphs: Must be explicitly enabled via --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' because ATOM’s default sets enforce_eager=True.

Three test arms

Arm	LMCache config	What’s cached
A: ATOM HBM-only	None	FP8 KV blocks in HBM, LRU evicted when full
B: ATOM + LMCache CPU	`LMCACHE_LOCAL_CPU=true`, 64 GB	HBM L1 (FP8) + 64 GB CPU DRAM L2
C: ATOM + LMCache CPU+NVMe	CPU + `LMCACHE_LOCAL_DISK=/nvme/lmcache`	HBM L1 (FP8) + CPU DRAM L2 + NVMe L3

The NVMe L3 tier is new in this blog. When the CPU DRAM L2 fills up, LMCache spills evicted KV blocks to local NVMe. Retrieval latency goes from ~100μs (DRAM) to ~1ms (NVMe), but capacity jumps from 64 GB to terabytes. For long-running agentic sessions where KV states accumulate over hours, the L3 tier prevents permanent eviction.

3. Implementation

Step 1: ATOM container

ATOM ships as a pre-built container with vLLM, AITER, and ROCm 7.2.3:

docker run -d --name atom-lmcache --entrypoint /bin/bash \
  --device=/dev/kfd --device=/dev/dri --network=host --ipc=host \
  --group-add video --cap-add SYS_PTRACE \
  -v /mnt/nvme/models:/work/models \
  -v /mnt/nvme/lmcache:/nvme/lmcache \
  rocm/atom-dev:vllm-latest \
  -c "sleep infinity"

The -v /mnt/nvme/lmcache:/nvme/lmcache mount is for the NVMe L3 tier. Skip it if you’re only testing HBM + CPU DRAM.

Step 2: Build LMCache from source with HIP

This is the same as the previous blog, but the mistake is easier to make now:

docker exec atom-lmcache bash -c "
  git clone --depth 1 https://github.com/LMCache/LMCache.git /work/LMCache
  cd /work/LMCache && BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
  pip install aiofile  # Required for LMCache GDS/NVMe backend
"

Why aiofile? LMCache’s disk backend uses aiofile for async NVMe I/O. Without it, enabling the disk path silently falls back to synchronous writes that stall the event loop.

Step 3: The critical backend check

pip install lmcache from PyPI gives you a CUDA-linked c_ops.so. On ROCm, this doesn’t crash — it silently falls back to Python-native non_cuda_equivalents. In our first attempt, this fallback made LMCache 1.7× slower than having no cache at all. The overhead of Python-side KV block serialization exceeded the prefill compute it was saving.

Verify you have the HIP backend:

python3 -c "from lmcache.storage_backend.connector import c_ops; print(c_ops.__file__)"
# Should show a .so built from HIP sources, NOT a Python .py file

If you see a .py path or non_cuda_equivalents, rebuild from source.

Step 4: ATOM recipe config

ATOM’s performance knobs, set as environment variables and server flags:

# ATOM environment
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1

# LMCache environment
export PYTHONHASHSEED=0
export LMCACHE_LOCAL_CPU=true
export LMCACHE_CHUNK_SIZE=256
export LMCACHE_MAX_LOCAL_CPU_SIZE=64
export LMCACHE_LOCAL_DISK=/nvme/lmcache  # omit for CPU-only arm

Step 5: Launch

docker exec -d atom-lmcache bash -c "
  AITER_QUICK_REDUCE_QUANTIZATION=INT4 \
  ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 \
  VLLM_FLOAT32_MATMUL_PRECISION=high \
  PYTHONHASHSEED=0 \
  LMCACHE_LOCAL_CPU=true \
  LMCACHE_CHUNK_SIZE=256 \
  LMCACHE_MAX_LOCAL_CPU_SIZE=64 \
  LMCACHE_LOCAL_DISK=/nvme/lmcache \
  vllm serve /work/models/MiniMax-M2.5 \
    --tensor-parallel-size 2 --gpu-memory-utilization 0.78 \
    --kv-cache-dtype fp8 \
    --async-scheduling \
    --enable-prefix-caching \
    --compilation-config '{\"cudagraph_mode\": \"FULL_AND_PIECEWISE\"}' \
    --max-num-batched-tokens 16384 \
    --kv-transfer-config '{\"kv_connector\":\"LMCacheConnectorV1\",\"kv_role\":\"kv_both\"}' \
    --tool-call-parser minimax_m2 --reasoning-parser minimax_m2 \
    --enable-auto-tool-choice --trust-remote-code \
    --host 0.0.0.0 --port 8000
"

For the HBM-only arm (Arm A), remove --kv-transfer-config and the LMCACHE_* env vars.

The four configuration mistakes that cost the most time

CUDA graphs disabled by default. ATOM sets enforce_eager=True in its platform registration. Without explicitly overriding with --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}', every batch goes through eager execution. We measured 3-5× throughput loss without CUDA graphs. This was our single biggest performance surprise.
LMCache Python fallback. Already covered above. The PyPI wheel’s silent fallback to non_cuda_equivalents on ROCm turns LMCache from a 2.4× speedup into a 1.7× slowdown. BUILD_WITH_HIP=1 is non-negotiable.
PYTHONHASHSEED=0 is still mandatory. Same as the previous blog — Python’s hash randomization breaks LMCache cache-key consistency across TP workers. This hasn’t changed.
Missing aiofile for NVMe tier. LMCache’s disk backend imports aiofile for async I/O. If it’s missing, the disk path either raises an ImportError or falls back to sync writes that block the event loop. pip install aiofile before enabling LMCACHE_LOCAL_DISK.

4. Benchmarks: methodology

Same tester as the previous blog: trace_replay_tester.py from callanjfox/kv-cache-tester, replaying 739 anonymized Claude Code conversation traces.

Test matrix

Phase	Users	Context	Duration	GMU	Question
Base	8	32K	10 min	0.85	Does ATOM+LMCache help at low load?
Stress	32	100K	20 min	0.78	Where does L2/L3 pay off under pressure?

Common settings

Hardware: 2× AMD MI300X (192 GB HBM each), gfx942
Software: ATOM v0.1.3.dev203 + vLLM 0.19.1 + LMCache (HIP-built) + PyTorch 2.10.0+rocm7.2.3
Model: MiniMaxAI/MiniMax-M2.5 FP8, 456B MoE (256 experts), ~230 GB, TP=2
Container: rocm/atom-dev:vllm-latest
Tester: --warm-prefix-pct 0.5, --timing-strategy think-only, --recycle, --seed 42
Identical trace assignment across all arms (seed=42)

Input/output token distributions

The Claude Code traces exhibit the classic agentic pattern — massive input that accumulates over a conversation, short output per turn:

Statistic	Input tokens	Output tokens
Mean	~42-45K	~450-500
Median	~34-38K	~240
Min	~9.8K	1
Max	~98.9K	~6K

Most requests ship 30-50K tokens of context (file contents, tool outputs, prior conversation) and get back a few hundred tokens (a tool call or short response). This is why prefix caching matters so much — 93-97% of each request’s input is identical to the previous turn.

5. Results

5.1 Base load — HBM prefix cache is sufficient

8 max users, 32K context, 10 minutes, GMU=0.85. Working set fits comfortably in HBM.

Metric	ATOM HBM-only	ATOM + LMCache CPU
Requests completed	80	92 (+15%)
TTFT avg (s)	0.80	0.93 (+16%)
TTFT p95 (s)	4.10	3.35 (-18%)
Goodput	91.9%	80.3%

At low load, the picture is mixed. LMCache completes 15% more requests and trims the p95 tail by 18%, but adds 16% overhead to average TTFT and reduces goodput. The working set fits in HBM; the L2 tier is mostly unused but still incurs connector overhead on every request.

Verdict at base load: ATOM’s FP8 KV cache gives enough HBM headroom. LMCache is unnecessary here.

This is consistent with our previous blog’s finding — the crossover is about working-set pressure, not raw hit rate.

5.2 Stress load — LMCache wins decisively

32 max users, 100K context, 20 minutes, GMU=0.78. This is where HBM runs out.

Metric	ATOM HBM-only	ATOM + LMCache CPU	ATOM + LMCache CPU+NVMe
Requests completed	208	331 (+59%)	374 (+80%)
TTFT avg (s)	84.0	47.1 (-44%)	38.7 (-54%)
TTFT median (s)	80.1	32.9 (-59%)	34.6 (-57%)
TTFT p95 (s)	207.3	150.2 (-28%)	88.3 (-57%)
TTFT max (s)	234.2	181.4 (-23%)	97.9 (-58%)
Input tok/s	3,370	6,440 (+91%)	5,953 (+77%)
Output tok/s	91	127 (+39%)	145 (+59%)
Cache hit rate	86.8%	89.6%	83.4%
Working set peak	1.91M tokens	2.15M tokens	2.28M tokens

Under stress, the story is unambiguous:

ATOM + LMCache CPU vs ATOM HBM-only:

59% more requests completed
2.4× lower median TTFT (80.1s → 32.9s)
44% lower average TTFT
28% lower p95 TTFT
91% higher input throughput
12% larger working set sustained in memory

ATOM + LMCache CPU+NVMe vs ATOM HBM-only:

80% more requests completed
57% lower median TTFT
54% lower average TTFT
57% lower p95 TTFT (207.3s → 88.3s)
58% lower max TTFT (234.2s → 97.9s)
59% higher output throughput

5.3 The NVMe L3 tier — compressing the tail

The most interesting result is the gap between CPU-only and CPU+NVMe:

Metric	LMCache CPU	LMCache CPU+NVMe	Delta
Requests completed	331	374	+13%
TTFT avg (s)	47.1	38.7	-18%
TTFT p95 (s)	150.2	88.3	-41%
TTFT max (s)	181.4	97.9	-46%
Output tok/s	127	145	+14%

The NVMe tier’s impact is concentrated in the tail latencies. p95 drops by 41%, max drops by 46%. This makes sense: the L3 tier catches KV blocks that would have been permanently evicted from the 64 GB CPU DRAM L2. Without NVMe, those evicted states force a full prefill recomputation — 50-100K tokens from scratch. With NVMe, they get retrieved in ~1ms instead of ~50-100s of prefill.

The cache hit rate actually drops slightly with NVMe (89.6% → 83.4%). This is an artifact of how LMCache counts: L2 evictions to disk are not counted as “hits” until they’re retrieved back. The effective reuse rate is higher.

5.4 TTFT behavior under pressure — the pressure relief valve

One of the most revealing patterns is how TTFT evolves over the 20-minute stress run:

ATOM HBM-only: TTFT degrades monotonically. Starts around 20s, climbs steadily to 230s, and never recovers. As more users join and the working set grows, HBM prefix cache entries get evicted faster than they can be reused. Each eviction forces a full prefill, which takes longer because the scheduler is already saturated. It’s a death spiral.

ATOM + LMCache CPU: TTFT oscillates. It spikes when a burst of new users arrive (cache-cold), then recovers to 23-46s as cached KV states hit from CPU DRAM. The CPU L2 tier acts as a pressure relief valve — when HBM fills up and starts evicting, the evicted blocks land in DRAM instead of being lost forever. On the next request from the same conversation, the KV state is retrieved from DRAM (~100μs) instead of recomputed from scratch (~50s+).

This oscillating-vs-monotonic pattern is the clearest behavioral evidence that L2 caching works. It’s not just faster on average — it’s self-healing under load.

6. Comparison: ATOM vs Vanilla vLLM (previous blog)

How does the ATOM-optimized stack compare to the vanilla vLLM stack from our previous blog?

Direct comparison is approximate — the previous blog used different vLLM version (0.19.0 vs 0.19.1), BF16 KV (vs FP8), and different GMU settings. But the order-of-magnitude story is clear:

Stress metric	Vanilla vLLM + LMCache (prev blog)	ATOM + LMCache CPU (this blog)
Requests completed	28	331
TTFT avg (s)	34.6	47.1
Input tok/s	933	6,440

ATOM’s optimizations (AITER kernels, FP8 KV, INT4 all-reduce, CUDA graphs) deliver a fundamentally different throughput regime. The previous blog’s 28 requests in 20 minutes reflects that vanilla vLLM on ROCm was severely bottlenecked. ATOM removes those bottlenecks.

But LMCache’s relative value remains consistent: under stress, L2 caching adds 50-80% more requests and cuts TTFT by 40-60%, regardless of whether the baseline is vanilla or optimized.

7. Key Findings

Finding 1: L2 caching helps even when the baseline is optimized

The worry going in was that ATOM’s FP8 KV cache (halving per-token KV memory) would give enough HBM headroom to make L2 offloading unnecessary. It doesn’t. Under real agentic workloads at 32 concurrent users with 100K context, HBM still fills up and LMCache still delivers 59% more requests.

FP8 KV does shift the crossover point — ATOM HBM-only handles 208 requests vs the previous blog’s vanilla-PC at similar load levels. But it doesn’t eliminate the need for L2.

Finding 2: NVMe L3 is a tail-latency story

The NVMe tier doesn’t help average performance much (-18%) but demolishes the tail: p95 drops 41%, max drops 46%. If your SLO is p95 or p99, NVMe is the cheapest intervention available. NVMe is already in the box — you’re just telling LMCache to use it.

Finding 3: LMCache backend choice is make-or-break

This is the single most impactful finding for practitioners. The default pip install lmcache on ROCm gives you a Python fallback that turns LMCache into a net negative. Our first attempt showed LMCache 1.7× slower than baseline — we spent days debugging before discovering the backend issue.

The fix is simple: BUILD_WITH_HIP=1 pip install -e . from source. But the failure mode is silent — no errors, no warnings in the default log level. You just get slow.

Finding 4: CUDA graphs are not optional with ATOM

ATOM’s vLLM plugin sets enforce_eager=True by default. This is presumably for development/debugging convenience, but it means every batch goes through Python-side eager dispatch instead of a pre-compiled CUDA graph. The performance impact is 3-5× throughput loss.

Override with:

--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

We filed ROCm/ATOM #890 suggesting the default should be flipped.

Finding 5: FP8 KV cache is a free lunch (mostly)

ATOM’s --kv-cache-dtype fp8 halves KV cache memory with negligible quality impact on MiniMax-M2.5. This doubles the effective HBM capacity for prefix caching. Combined with LMCache, the L2 tier gets less pressure (fewer evictions from L1), which means better hit rates and less DRAM bandwidth consumed.

Our previous blog used BF16 KV. The FP8 upgrade is one reason ATOM HBM-only handles 208 requests vs the previous blog’s HBM-PC numbers — you get more cache capacity before needing L2.

Finding 6: The oscillation pattern confirms L2 value

ATOM HBM-only shows monotonically degrading TTFT — a death spiral where eviction pressure compounds. ATOM + LMCache shows oscillating TTFT that recovers after spikes. This behavioral difference is the most intuitive proof that L2 caching works: the system heals itself when cached states are retrievable from DRAM instead of permanently lost.

8. Lessons Learned

Lesson 1: LMCache backend matters enormously

Backend	Source	Performance
PyPI wheel (`pip install lmcache`)	CUDA `c_ops.so` → Python `non_cuda_equivalents` fallback on ROCm	1.7× slower than no cache
Source build (`BUILD_WITH_HIP=1`)	HIP `c_ops.so` with native kernels	2.4× faster median TTFT

The difference between “cache makes things worse” and “cache gives 59% more throughput” is a single build flag. Check your backend.

Lesson 2: CUDA graphs must be enabled with ATOM

# Without (ATOM default): enforce_eager=True → 3-5× slower
# With: proper graph capture
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

Lesson 3: PYTHONHASHSEED=0 is still mandatory

Same as the previous blog. Python’s per-process hash randomization breaks LMCache cache-key consistency across TP workers. Symptom: 0% cache hit rate on bit-identical prompts.

Lesson 4: FP8 KV cache changes the economics

With BF16 KV (previous blog), a 100K-token KV cache for MiniMax-M2.5 uses ~12 GB HBM. With FP8 KV (this blog), it’s ~6 GB. That means:

More conversations fit in HBM before L2 eviction starts
The crossover point where LMCache pays off shifts to higher concurrency
But at real production loads (32+ concurrent agentic users), L2 still pays off

9. Best Practices

For ATOM + LMCache deployment on MI300X

Build LMCache from source with BUILD_WITH_HIP=1. Do not use the PyPI wheel on ROCm. Verify with python3 -c "from lmcache.storage_backend.connector import c_ops; print(c_ops.__file__)" — it should show a .so, not a .py.

Enable CUDA graphs explicitly:

--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

Set PYTHONHASHSEED=0 in the server’s environment.
Use FP8 KV cache (--kv-cache-dtype fp8) — it’s a free lunch on MiniMax-M2.5 and likely on most modern MoE models.

Enable ATOM’s fusions:

AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1

Size the CPU L2 generously (LMCACHE_MAX_LOCAL_CPU_SIZE=64 GB+). DRAM is cheap; evictions to disk or permanent loss are expensive.
Add NVMe L3 if you care about tail latency. Set LMCACHE_LOCAL_DISK=/path/to/nvme and pip install aiofile. The p95 improvement (41% in our test) is significant for SLO-bound deployments.
Enable async scheduling (--async-scheduling) — lets vLLM overlap scheduling with GPU execution. Free throughput.

Monitor cache hits in production:

LMCache: Reqid=...80e (1030 tok): hit tokens: 1024  ← working ✅
LMCache: Reqid=...8cf (1030 tok): hit tokens: 0     ← broken ❌

When to use which configuration

Scenario	Recommended config
Low concurrency, short context (<32K)	ATOM HBM-only (FP8 KV)
Moderate concurrency, mixed context	ATOM + LMCache CPU
High concurrency, long context (100K+), SLO on p95	ATOM + LMCache CPU+NVMe
Decode-bound workloads (short input, long output)	ATOM HBM-only — cache won’t help the bottleneck

When NOT to deploy LMCache with ATOM

Working set fits comfortably in HBM (most chat workloads with FP8 KV)
Decode-bound serving where prefill is already fast relative to generation
You haven’t verified BUILD_WITH_HIP=1 — the Python fallback will make things worse, not better

10. Reproduce This

Container setup

# 1. Start ATOM container
docker run -d --name atom-lmcache --entrypoint /bin/bash \
  --device=/dev/kfd --device=/dev/dri --network=host --ipc=host \
  --group-add video --cap-add SYS_PTRACE \
  -v /your/models:/work/models \
  -v /your/nvme:/nvme/lmcache \
  rocm/atom-dev:vllm-latest -c "sleep infinity"

# 2. Build LMCache with HIP backend
docker exec atom-lmcache bash -c "
  git clone --depth 1 https://github.com/LMCache/LMCache.git /work/LMCache
  cd /work/LMCache && BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
  pip install aiofile
"

Server (ATOM + LMCache CPU+NVMe arm)

docker exec -d atom-lmcache bash -c "
  AITER_QUICK_REDUCE_QUANTIZATION=INT4 \
  ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 \
  VLLM_FLOAT32_MATMUL_PRECISION=high \
  PYTHONHASHSEED=0 \
  LMCACHE_LOCAL_CPU=true \
  LMCACHE_CHUNK_SIZE=256 \
  LMCACHE_MAX_LOCAL_CPU_SIZE=64 \
  LMCACHE_LOCAL_DISK=/nvme/lmcache \
  vllm serve /work/models/MiniMax-M2.5 \
    --tensor-parallel-size 2 --gpu-memory-utilization 0.78 \
    --kv-cache-dtype fp8 \
    --async-scheduling \
    --enable-prefix-caching \
    --compilation-config '{\"cudagraph_mode\": \"FULL_AND_PIECEWISE\"}' \
    --max-num-batched-tokens 16384 \
    --kv-transfer-config '{\"kv_connector\":\"LMCacheConnectorV1\",\"kv_role\":\"kv_both\"}' \
    --tool-call-parser minimax_m2 --reasoning-parser minimax_m2 \
    --enable-auto-tool-choice --trust-remote-code \
    --host 0.0.0.0 --port 8000
"

Trace replay client

git clone --recursive https://github.com/callanjfox/kv-cache-tester.git
cd kv-cache-tester

# Stress run
python3 trace_replay_tester.py \
  --api-endpoint http://127.0.0.1:8000 \
  --trace-directory traces \
  --start-users 4 --max-users 32 \
  --max-ttft 60.0 --test-duration 1200 \
  --max-context 100000 --warm-prefix-pct 0.5 \
  --timing-strategy think-only --recycle \
  --seed 42 \
  --output-dir ./results

11. What’s Next

This blog tested two tiers (CPU + NVMe) on top of ATOM’s HBM cache. Several directions remain:

PD disaggregation: Separate prefill and decode workers. LMCache and PD are complementary — prefill workers compute KV, store to shared L2/L3; decode workers retrieve. This is the architecture Mooncake, DeepSeek, and NVIDIA Dynamo use in production.
Speculative decoding: Our output throughput topped out at 145 tok/s aggregate. The decode side is the bottleneck, and KV caching doesn’t help decode. Eagle-2 or Medusa speculative decoding could give 2-3× on the decode path.
Multi-node L2 with nixl/distributed cache: LMCache supports distributed backends (Redis, remote object stores). In a multi-node TP>2 setup, shared KV cache across replicas would enable cross-replica prefix reuse.
Quantized L2 storage: Storing FP8 KV in the L2/L3 tiers (instead of converting back to full precision) would halve the DRAM and NVMe bandwidth required for cache transfers.

12. Acknowledgments

AMD ROCm team for ATOM and the rocm/atom-dev container
LMCache team for the HIP-compatible build system and the KV connector
callanjfox / WEKA for the kv-cache-tester toolkit and 739 anonymized Claude Code traces
Hotaisle for MI300X access

Bench environment: ENC1-CLS01-SVR08, 2× AMD MI300X (gfx942, 192 GB HBM each), ROCm 7.2.3, ATOM v0.1.3.dev203, vLLM 0.19.1, LMCache (HIP-built, commit ~2026-05), PyTorch 2.10.0+rocm7.2.3. Container: rocm/atom-dev:vllm-latest.