rocm3d: From One-Off 3D Ports to Reusable ROCm Compatibility Recipes

By David Li and Andy Luo

Turning recurring CUDA-ecosystem dependencies in 3D / VLA / world-model migration into reusable ROCm recipes a Code Agent can apply and verify on MI300X.

As spatial intelligence and embodied AI become more concrete application directions, workloads such as 3D generation, 3D reconstruction, world models, VLA, and grasping are moving from research demos toward more systematic engineering validation. Over the past few months, more of these open-source repos have been tested on AMD MI300X. At first, the work looked like a set of unrelated porting tasks: one repo needed Gaussian splatting, another needed sparse convolution, another broke on flash attention, and another hid a small CUDA extension under models/ops.

After enough attempts, the pattern became clearer: many failures were not model-specific. They were dependency-pattern failures. Under different model names, the same CUDA-default packages, native extension builds, backend fallbacks, and “pip pulled the CUDA wheel back in” problems kept reappearing.

That is why rocm3d was organized. It is not a model zoo, and it is not an attempt to maintain a fork for every repo. Its core asset is a Cursor Agent Skill: rocm-lib-compat. The skill turns recurring CUDA-ecosystem dependencies in 3D / VLA / world-model migration into ROCm base-image choices, dependency cleanup rules, library replacements, backend sanity checks, and smoke-test recipes. The Code Agent handles execution, logs, command edits, and retries; the skill tells the agent what to do when it sees each CUDA-default dependency.

TL;DR

The early approach was to hand-write ROCm migration scripts for a batch of 3D generation and reconstruction models, and later to try an autorun framework that could generate, execute, and repair scripts. After running through dozens of repos, the conclusion became simpler: the reusable asset is not the framework. It is an accurate ROCm library compatibility skill.

rocm3d collects recurring CUDA-default dependency patterns such as xformers, gsplat, pytorch3d, flash-attn, spconv, tiny-rocm-nn, and repo-local native extensions into reusable ROCm recipes, then lets a Code Agent apply and verify them on MI300X.

Why This Exists

3D reconstruction / generation, video world models, VLA / robotics, and grasping are becoming real AI workloads rather than just research demos. They are different from LLM serving: many repos do not depend only on PyTorch and one attention kernel. They often bring rendering, point clouds, mesh processing, sparse convolution, differentiable rasterization, and robotics postprocessing into the same environment.

Most of these components are CUDA-first by default. README install commands often assume NVIDIA GPUs. requirements.txt may pin CUDA wheels. pyproject.toml may default to xformers or gsplat. Submodules may contain handwritten CUDA extensions. To a user, this looks like a repo-specific install issue; for ROCm adoption, it is a repeated systems problem.

One recurring observation is that the blocker is often not the model architecture itself. A 3D Gaussian repo, a point-cloud backbone, a video world model, and a grasping pipeline look very different at the model level, but their install failures often collapse into the same small set of dependency families. Writing one-off install notes per repo does not scale well.

rocm3d is meant to address this repetition. The goal is not to prove that one specific 3D model can run, but to help the next similar repo avoid rediscovering the same ROCm migration rules.

What Repeated Ports Taught Me

The first lesson is that failures are rarely unique.

Early on, it is easy to treat every repo as an isolated port. This repo is missing gsplat; that repo needs spconv; another repo cannot install flash_attn. That classification is too close to the model name and too far from the real issue. The durable unit is the dependency family: what is the CUDA-default path, whether a ROCm wheel exists, whether the import name stays the same, how the native extension should be built, and whether validation should look at import, compiled backend, or final artifact.

The second lesson is that execution itself is not the hard part.

Remote login, Docker startup, command execution, log reading, script edits, and retries are things modern Code Agents are already good at. The repeated waste is in dependency decisions: ROCm 6.4 or ROCm 7.x? Which CUDA pins need to be removed? After installing amd_gsplat, is the import name still gsplat? Did flash-attn run through PyTorch SDPA, FA2 Triton, AITER Triton, or AITER CK? Can a native extension be built with hipcc directly, or does it need a replacement library?

That knowledge should not live only in chat logs and throwaway scripts. It belongs in a skill.

The converged judgment is:

The durable asset is not the automation wrapper. It is the compatibility knowledge that tells an agent what to do when it sees each CUDA-default dependency.

In other words, the thing worth maintaining is not “how to execute commands.” It is “when this CUDA-default dependency appears, which ROCm base should be chosen, which pins should be removed, which wheel should be installed, and how should the backend be checked for fallback?”

The Core Asset: rocm-lib-compat

rocm-lib-compat is a Cursor Agent Skill. It is Markdown, but the Markdown encodes migration rules.

It covers:

ROCm version strategy
recommended base images
CUDA-to-ROCm replacement table
dependency cleanup patterns
native extension build patterns
backend verification checks
known pitfalls

Version strategy matters. It is natural to ask whether the newest ROCm version is always best. In real migrations, the answer depends on the dependency family.

For many 3D repos, ROCm 6.4 remains a practical default because the PyTorch Docker image and several 3D-related wheels are more mature there. For attention-heavy repos, ROCm 7.x may be better for validating AITER CK paths. For pure PyTorch / SDPA repos, the most important thing is often not chasing the newest stack, but preserving ROCm PyTorch and preventing pip from replacing it with CUDA wheels.

A condensed family map looks like this:

Family	Recipe focus	Example anchor
Dependency cleanup / package replacement	Remove CUDA-default pins, choose ROCm wheel/source path, keep package/import names straight	Depth-Anything-3
Gaussian / raster package path	`amd_gsplat`, compiled extension sanity check, render smoke test	classic 3DGS, ZipSplat
Sparse / point / graph	`spconv_rocm`, PyG ROCm wheels, `torch_scatter` / `torch_sparse` / `torch_cluster` sanity	PointTransformerV3
Attention backend sanity	PyTorch SDPA, ROCm Triton, FA2 Triton, AITER Triton / CK; verify actual runtime backend	Matrix-Game
Repo-local native op	`PYTORCH_ROCM_ARCH`, hipcc, custom op registration, correctness smoke test	GroundingDINO
Recipe composition	Multiple families in one repo: attention, mesh/raster, point-cloud ops, postprocess	TRELLIS.2, GraspGen
Neural rendering / neural field extensions	diff-rasterizers, `simple-knn`, `tiny-rocm-nn`, `custom_rasterizer`, `pytorch3d`, NVIDIA-origin rasterization components	TriSplat

This table matters more than a support list. A support list says which repos have run. A dependency-family map says where the next repo is likely to fail and what should be tried first.

A Few Build-Up Examples

The following are not full experiment reports. They are compressed build-up examples, each mapped to a recurring dependency pattern.

Dependency Cleanup: Depth-Anything-3

The most basic and frequent value is preventing the environment from silently reinstalling CUDA packages.

Many repo blockers are not in model code. They are in the default package path inside requirements.txt or pyproject.toml. Repos like Depth-Anything-3 may ask for CUDA-default dependencies such as xformers and gsplat. The first useful thing rocm-lib-compat does is not kernel work. It turns those implicit defaults into explicit decisions.

# Dependency-cleanup example: Depth-Anything-3
# Goal: keep ROCm PyTorch from the base image and remove CUDA-default packages
# before they overwrite the working runtime.

docker run --device=/dev/kfd --device=/dev/dri --ipc=host \
  -e HIP_VISIBLE_DEVICES=0 \
  rocm/pytorch:rocm6.4.3_ubuntu24.04_py3.12_pytorch_release_2.6.0

# DA3 upstream install asks for xformers and nerfstudio gsplat.
# rocm-lib-compat turns that into explicit package decisions:
grep -vEi '^(torch|torchvision|torchaudio|xformers|gsplat|cupy|bpy|flash.attn|triton)' \
  requirements.txt | pip install -r /dev/stdin

# If the repo needs Gaussian output, use the ROCm gsplat path instead of
# silently pulling a CUDA wheel.
pip install amd_gsplat --no-deps \
  --extra-index-url=https://pypi.amd.com/rocm-6.4.3/simple/

The same cleanup pattern appears in Hunyuan3D-style repos: skip CUDA-only optional dependencies, keep ROCm torch, then verify a bounded artifact. For example, the Hunyuan3D-2.1 shape-generation path can produce a 344K-vertex GLB on MI300X + ROCm 6.4, uses PyTorch’s built-in AOTriton attention, and can build custom_rasterizer as a PyTorch C++ extension.

The point is that install path is part of compatibility. Before profiling starts, performance and correctness may already have been decided by the wrong wheel.

Gaussian / Raster Packages: classic 3DGS + ZipSplat

Gaussian splatting repos commonly follow two paths.

The first path is classic 3DGS, which depends on source-built rasterizer / KNN extensions such as diff-gaussian-rasterization and simple-knn. The second path is more package-oriented. ZipSplat depends on gsplat.rendering.rasterization; on ROCm, amd_gsplat can replace it while keeping the import name as gsplat.

For this family, import gsplat is not enough. Better evidence is the compiled backend, Gaussian count, PLY export, and render output.

# Two Gaussian paths show up repeatedly.

# 1. Classic 3DGS: source-built rasterizer/KNN extensions.
git clone --recursive https://github.com/graphdeco-inria/gaussian-splatting
cd gaussian-splatting
PYTORCH_ROCM_ARCH=gfx942 pip install submodules/diff-gaussian-rasterization --no-build-isolation
PYTORCH_ROCM_ARCH=gfx942 pip install submodules/simple-knn --no-build-isolation
python render.py -m <trained-or-pretrained-model>

# 2. ZipSplat: package-path replacement where the import name remains `gsplat`.
pip install amd_gsplat --no-deps \
  --extra-index-url=https://pypi.amd.com/rocm-6.4.3/simple/
python run_zipsplat_rocm_smoke.py \
  --input assets/examples/office \
  --num-frames 5 \
  --compression 1.0 \
  --render-size 512

In the local ZipSplat validation, amd_gsplat loaded under the gsplat import name and reported version 1.5.3. The official office input produced 51,840 Gaussians with render_status=pass. That is more meaningful than an import check because it covers model forward, Gaussian export, and the render path in one bounded smoke.

Sparse / Point Cloud: PointTransformerV3

Point-cloud / sparse 3D / grasping repos often share a lower-level dependency set: spconv, PyG wheels, torch_scatter, torch_cluster, and attention backends. The tricky part is that wheel name, package name, import name, and actual backend are often not the same thing.

PointTransformerV3 is a good minimal example. Its standalone model path needs spconv, optional flash attention, and torch_scatter’s segment_csr behavior, but it does not necessarily require the full Pointcept framework or its CUDA pointops.

# Sparse / point-cloud example: PTv3 standalone on MI300X / ROCm 7.2
docker run --device=/dev/kfd --device=/dev/dri --ipc=host \
  -e HIP_VISIBLE_DEVICES=0 \
  rocm/pytorch:rocm7.2_ubuntu24.04_py3.13_pytorch_release_2.10.0

git clone -b rocm https://github.com/ZJLi2013/spconv_rocm.git
pip install -e spconv_rocm/
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE pip install flash-attn --no-build-isolation
pip install addict einops scipy timm

# torch_scatter was not a good wheel story in this run, so the minimal
# PTv3 path used a pure PyTorch segment_csr shim for inference.
python ptv3_modelnet40_inference.py --diverse --num-classes 40 --enable-flash

The local record shows spconv_rocm, FA2 Triton, and a torch_scatter shim supporting PTv3 standalone inference. ModelNet40 passed across all 40 classes, with output shape [10000, 64].

The value of this example is not just that “PTv3 ran.” It separates the sparse / point-cloud family into backend install, sparse-conv kernel, scatter/gather dataflow, and a real point-cloud workload. The next GraspGen, GGPT, or point-cloud repo does not need to start from scratch.

Attention Backend Sanity: Matrix-Game

For attention-heavy repos, a successful install does not mean the right backend ran.

Repos such as Matrix-Game-3.0 may have multiple attention paths: PyTorch SDPA, FA2 Triton, AITER Triton, and AITER CK. Different ROCm versions, PyTorch versions, and install paths can send the same model code to different backends. The important part is not to recommend one backend universally, but to verify which backend the model actually used and validate it with an artifact.

# Attention backend example: Matrix-Game-3.0
# Start from a safe install that does not overwrite ROCm PyTorch.
grep -v -E '^(torch|torchvision|torchaudio|flash.attn)' requirements.txt \
  | pip install -r /dev/stdin
pip install opencv-python-headless

# Baseline fallback path: patch hard FlashAttention assert to SDPA.
python patch_attention.py --backend sdpa
torchrun --nproc_per_node=1 generate.py \
  --size 704*1280 \
  --ckpt_dir Matrix-Game-3.0 \
  --num_iterations 2 \
  --num_inference_steps 3 \
  --save_name test_amd_sdpa

# Better ROCm paths tested later:
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE pip install flash-attn --no-build-isolation
pip install aiter  # AITER Triton / CK depending on ROCm image and install source

In local validation, the SDPA fallback generated a 1.9MB mp4. FA2 Triton generated a 1.8MB mp4. AITER CK in a ROCm 7.2 Docker image reduced steady-state iteration time from about 14.6s to about 11s. The evidence is not just that a package installed. It is the combination of video artifact, runtime backend, and iteration time.

Repo-Local Native Op: GroundingDINO

Some blockers are not library wheels. They are repo-local ops.

GroundingDINO’s MsDeformAttn is a typical example. A PyTorch fallback can unblock inference, but fallback is only a compatibility baseline. For inference, the better path is a forward-only HIP extension registered as a compiler-friendly custom op.

# Repo-local op example: GroundingDINO MsDeformAttn on ROCm
docker run --device=/dev/kfd --device=/dev/dri --ipc=host \
  -e HIP_VISIBLE_DEVICES=0 \
  rocm/pytorch:rocm6.4.3_ubuntu24.04_py3.12_pytorch_release_2.6.0

git clone -b rocm_supported https://github.com/ZJLi2013/GroundingDINO.git
cd GroundingDINO
pip install -r requirements.txt
pip install -e . --no-build-isolation

python tools/benchmark_inference.py \
  --config groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --checkpoint /models/groundingdino_swint_ogc.pth \
  --image /data/truck.jpg \
  --text-prompt "truck"

python tools/benchmark_msda_op.py

The local experiments show that PyTorch fallback unblocked inference and produced one truck box. The HIP forward path reduced end-to-end latency from 0.1523s to 0.0983s. At the op level, HIP forward was 0.144ms versus 1.943ms for fallback, with max diff 2.16e-6.

This illustrates a broader pattern: repo-local native ops need a minimal workload, a correctness check, and custom-op registration. Otherwise, it is easy to stop at “it runs but is slow” or “it imports but cannot be optimized.”

Composition Stress Test: TRELLIS.2 + GraspGen

A simple repo tests one replacement rule. A useful stress case tests whether several rules compose.

TRELLIS.2 is an image-to-3D composition example: attention, flex_gemm, cumesh, mesh / raster dependencies, and the image-to-3D runtime all appear together. GraspGen is a grasping pipeline example: PointNet2-style point-cloud ops, torch-cluster-like dependencies, sampling, scoring, and postprocessing all appear in one workflow.

# Composition example A: TRELLIS.2 image-to-3D
pip install flash-attn --index-url=https://pypi.amd.com/simple \
  --extra-index-url=https://pypi.org/simple
GPU_ARCHS=gfx942 pip install git+https://github.com/ZJLi2013/CuMesh.git@rocm --no-build-isolation
pip install git+https://github.com/ZJLi2013/FlexGEMM.git@rocm --no-build-isolation
pip install ./o-voxel --no-build-isolation
# nvdiffrast / nvdiffrec: describe as NVIDIA-origin compatibility-pattern examples;
# do not frame this as redistribution or relicensing.
python run_trellis2_image_to_3d.py --model microsoft/TRELLIS.2-4B --image input.png

# Composition example B: GraspGen
PYTORCH_ROCM_ARCH=gfx942 python pointnet2_ops/setup.py build_ext --inplace
pip install torch-cluster -f https://github.com/Looong01/pyg-rocm-build --no-build-isolation
python demo_object_pc.py --checkpoint graspgen_robotiq_2f_140 --output_file demo_output_sample.npz

In local validation, TRELLIS.2 ran the full pipeline on MI300X and produced about 5.99M vertices and 12.2M faces in about 308s. GraspGen exported Top-100 SE(3) grasps for three object point clouds, with inference time between 0.40s and 1.88s, and recorded score ranges.

These examples also show the boundary for public write-ups. Some components and checkpoints come from NVIDIA-origin repos or have explicit license / model terms. They should be described as local compatibility-pattern validation, not redistribution, relicensing, or “an AMD version” of an upstream NVIDIA-origin component.

Neural Rendering / Neural Field Extensions

Neural rendering repos are another frequent case. ROCm compatibility is often not a single wheel install, but a set of native-extension recipes: set the right GPU arch, build with hipcc, confirm the extension import name, run a tiny render or KNN workload, and only then trust the higher-level demo.

TriSplat is a representative example. It touches diff-gaussian-rasterization-w-pose, diff-triangle-rasterization, simple-knn, and CroCo’s curope extension.

# Neural rendering extension example: TriSplat
# This is different from wheel replacement: the work is mostly
# source-built extensions and submodule hygiene.
export PYTORCH_ROCM_ARCH=gfx942
export CUDA_HOME=${CONDA_PREFIX}
export CUDA_PATH=${CUDA_HOME}
pip install --no-build-isolation /tmp/diff-gaussian-rasterization-w-pose-main
pip install --no-build-isolation submodules/diff-triangle-rasterization
pip install --no-build-isolation submodules/simple-knn
pip install --no-build-isolation src/model/encoder/backbone/croco/curope

This family also includes tiny-cuda-nn / tiny-rocm-nn, custom_rasterizer, pytorch3d mesh / raster ops, and NVIDIA-origin differentiable rendering components such as nvdiffrast / nvdiffrec.

The wording matters. A safer description is:

Some differentiable rendering repos depend on NVIDIA-origin rasterization components such as nvdiffrast / nvdiffrec. In rocm3d, these should be framed as compatibility-pattern examples rather than redistribution targets: the useful lesson is how to recognize a native rasterization extension, choose an appropriate ROCm-compatible source path, build it with the ROCm toolchain, and verify rasterize / interpolate / antialias / texture smoke tests before running the full model.

Avoid phrases such as “AMD version of nvdiffrast” or “we distribute nvdiffrast for ROCm.” More accurate wording is “NVIDIA-origin rasterization components,” “compatibility-pattern example,” “ROCm-compatible source path,” “technical validation,” and “no redistribution or relicensing of upstream components.”

What the Validation Matrix Taught Me

The support matrix is not a trophy list. It is a dependency map.

After more repos are tested, the most useful signal is not the count. It is which core libraries repeat, and which libraries unlock multiple model categories once they work. gsplat / amd_gsplat affects feed-forward 3DGS and Gaussian export. spconv_rocm and PyG ROCm wheels affect point-cloud backbones and grasping. Attention backends affect video and world models. Differentiable rasterization affects neural rendering and mesh texture paths.

Another lesson is that install path is part of performance.

Many performance issues are already decided before profiling starts: wrong wheel, incomplete native extension build, Python fallback, pip resolver pulling CUDA packages back in, or the wrong Docker / Python combination. Profiling after the model runs is important, but only after the model is actually running on the intended backend.

The default path is also part of the ecosystem.

The NVIDIA ecosystem is strong in default wheels, default docs, upstream README paths, and default backends. ROCm does not need one handmade workaround per repo; it needs clearer wheel sources, smoke tests, fallback detection, and upstream PR / issue tracking. If the default path is unclear, users naturally return to the CUDA package path.

Finally, “it runs” needs evidence.

A safer pattern is to leave bounded evidence for each repo:

repo/env manifest -> bounded workload -> backend sanity -> artifact -> optional profile/handoff

At minimum, that means environment, input workload, install path, runtime artifact, and backend sanity check. Without those, “it runs” is too easy to misread: it may mean import success, CPU fallback, Python fallback, or an output file generated by a backend different from the one expected.

Using rocm3d

In practice, rocm3d is more like a migration manual for an agent than a script users execute line by line. Given a target repo, target machine, and rough workload, the Code Agent uses rocm-lib-compat to inspect the dependency structure: which packages need to be moved off CUDA-default paths, which can rely on ROCm PyTorch, and which need ROCm wheels, source builds, or repo-local patches.

git clone https://github.com/ZJLi2013/rocm3d.git

The goal should not be defined as “the install command exits with code 0.” A more useful goal is to leave an auditable migration path: which ROCm base image was chosen, which CUDA pins were removed, which libraries were replaced, which runtime backend actually loaded, and which bounded workload produced which artifact.

That is why the skill matters more than a temporary script. A temporary script solves one run; the skill carries forward why the install was done that way, how to check whether it was done correctly, and where to look first when it fails. For a new repo, the ideal result is not a polished generic install note, but a mapping back to existing dependency families: is this a gsplat problem, an attention-backend problem, a sparse point-cloud problem, a repo-local op problem, or a composition of several?

What Is Next

The next step is not simply adapting more repos. It is building the asset layers around the compatibility knowledge.

Future work can move along a few lines:

Layer	Next step
Productization	Core lib smoke suite for `gsplat`, `pytorch3d`, PyG ROCm, `spconv_rocm`, differentiable rendering packages
Performance	Build on the `kernels-for-3d` work in the `rocm3d` dev branch: attention, GEMM, norm, RoPE, convolution, sparse conv, raster / render kernels
Upstream	PR / issue tracking for repeated dependency fixes

rocm3d is not trying to become a model zoo. It is a compact compatibility layer that helps each new 3D / VLA / world-model repo avoid rediscovering the same ROCm migration rules.