RoboSmith: A Synthetic-Data Pipeline for Embodied Interactive Manipulation (ROCm-native)

By David Li and Andy Luo

The bottleneck for embodied AI is not pixels — it’s interaction: contact-rich, multi-step, articulated behavior data, exactly what visual-quality-first synthetic pipelines struggle to cover. RoboSmith stands on the shoulders of the open-source community — Genesis-World (simulation), LeRobot (dataset format), Articraft (articulated assets), PyRoki (kinematics), and more — and wires them together on AMD, turning one real object on a single MI300 into long-horizon, interactive manipulation data: fully open-source, ROCm-native.

1. Why interaction data

Physical AI — agents that can interact with the physical world — has a very concrete bottleneck: data. Synthetic data spent a long time focused on visual quality: ray-traced rendering, 3D asset generation, static scene reconstruction, 3D Gaussian Splatting, and so on. But as VLA and Physical-AI foundation / world models mature, the truly scarce ingredient has become data of interacting with the physical environment — physical properties such as contact and motion.

There are already many routes for collecting such behavior data: teleoperation, UMI, ego-centric capture, simulation synthesis, and more. RoboSmith takes the synthetic data generation (SDG) route, aiming to produce interaction-rich, long-horizon behavior data within the ROCm ecosystem:

Simple grasping (pick up a block) is cheap to collect but low-information — single-step, rigid-body only, the robot barely touches its environment.
Long-horizon interaction (open a drawer → put something inside → close it) is the scarce, genuinely valuable tier — multi-step, articulated, collision avoidance required. Collecting this on real robots is slow, expensive, and unsafe, and you can’t scrape “1,000 demos of opening a drawer and placing an object” off the internet.

The upside of SDG is direct: manufacture demos in simulation, with physics / randomization / labels fully controllable and free. What we additionally want to fill is a gap within the ROCm ecosystem — today’s mature SDG / Eval toolchains are mostly built around specific hardware and rendering stacks, and open, cross-hardware options are still few. We want to run this whole chain, end to end, with open-source components on AMD ROCm.

In one sentence: take a real or generated object, make it simulatable, declare a long-horizon articulated task in a few lines of code, and produce interactive training data — all on a single AMD Instinct MI300.

🎬 First look. Two real outputs first: same engine, differing by roughly a single task declaration, both requiring collision-free planning.

Left: drawer · long-horizon articulated open→pick→place→close (one episode = one video). Right: two-tier supporter · side-insertion with avoidance (the die is taken from the upper shelf and inserted horizontally into the lower shelf; a vertical top-down descent would be blocked by the upper board).

2. What RoboSmith is

In one line: RoboSmith is embodied Data Infrastructure (Data Infra, not a VLA training framework) — given “assets + task definition”, it produces verified LeRobot datasets. It is the SDG pillar inside a larger AMD/ROCm-native robotics platform: it consumes rocRecon’s assets, relies on rocRobo for motion, and is designed to feed downstream evaluation and sim2real.

2.1 Framework overview

flowchart TB
  T[Text]; I[Image]; RO[Real Object]
  subgraph L1["① Asset layer"]
    OBJ["Objaverse built-in
(current default assets)"]
    REC["rocRecon
real2sim reconstruction"]
    ART["Articraft
image + text → articulated assets"]
    ASSET["mesh · URDF · joints · metadata · affordance"]
    OBJ --> ASSET
    REC --> ASSET
    ART --> ASSET
  end
  RO --> REC
  I --> REC & ART
  T --> REC & ART
  subgraph L2["② Task authoring · Code-as-Policy"]
    CAP["actions: open() · pick() · place() · close()
success predicates: object_in_container() · joint_closed() …"]
  end
  ASSET --> CAP
  subgraph L3["③ Execution engine · RoboSmith"]
    G["GraspGen
learned grasping"]
    GEN["Genesis
physics · rendering · sensors"]
    RR["rocRobo
collision-free IK · TrajOpt · segment routing"]
  end
  CAP --> G & GEN & RR
  subgraph L4["④ Rollout & verification"]
    RV["expert rollout · predicate checking · episode recording"]
  end
  G & GEN & RR --> RV
  subgraph L5["⑤ LeRobot dataset export"]
    EX["observation · action · video · success · metadata"]
  end
  RV --> EX
  EX --> FOOT["end-to-end on AMD ROCm Runtime · Genesis / rocRobo / GraspGen"]

Asset sources are stackable: Objaverse built-in library (current default), rocRecon real reconstruction (text / image / real object), and Articraft-generated articulated assets (image + text).

The base engines all run on ROCm: Genesis (physics + rendering + sensors), rocRobo (PyRoki-based JAX kinematics: collision-free IK + avoidance trajopt + JSON serve, open-source), and LeRobot (a VLA-consumable dataset format).

2.2 Capability Matrix A — the three landed pillars

Calling RoboSmith “a data-generation tool” undersells the design intent: it is one of three already-working pillars in a larger AMD/ROCm platform. Here we list only landed capabilities (the longer-horizon pillars are in the §4 roadmap):

Capability pillar	Scope	Engine	Status
Motion / dynamics	collision-free IK + avoidance trajopt + segment routing	rocRobo	✅ landed
SDG synthetic data	assets → scene → expert trajectories → LeRobot export	RoboSmith	✅ landed
real2sim	real/generated → simulatable asset (asset-level)	rocRecon	✅ landed

The pipeline in this post strings these three pillars — real2sim, SDG, motion — into a single end-to-end flow; the first-look clips above highlight the SDG + motion stages.

2.3 Capability Matrix B — asset × action × predicate (coverage view)

As a synthetic-data generator, RoboSmith is essentially “given an asset’s capabilities, automatically unfold the generatable (task, action, predicate)”. The three form an affordance-binding chain, not independent dimensions:

flowchart LR
  AF["asset affordance
(what manipulable structure it offers)"] -->|gate unlock| ACT["semantic action / skill
(human/VLA-readable verb)"]
  ACT -->|verify pairing| PRED["success predicate
(world-state fluent)"]

Current coverage (registry is authoritative), two classes of manipulable assets:

Asset class	affordance (unlock condition)	semantic action → underlying primitive	success predicate
rigid	graspable surface (mesh/bbox + upright + metric_scale)	`pick` / `place` → learned grasp (GraspGen)	`object_in_container` / `object_above` / `stacked` / `objects_aligned`
articulated	`task_joints` + `handles` (thin semantic annotation)	`open` / `close` → `drag_handle` (prismatic straight line, coaxial reversal)	`joint_opened` / `joint_closed`

(Support bodies like table/plane only bear load, have no affordance, and are not listed; deformable / cable assets are on the “asset physics” axis of the §4 roadmap.)

Extension principle “adding something touches only one segment”: a new rigid body needs mesh + upright, and pick/place are reused directly; a new articulated asset needs task_joints/handles, and open/close are reused directly; a new action (e.g. opening a revolute door) = reuse the semantic verb + add one branch in the underlying primitive (today drag_handle only implements prismatic straight lines; revolute arcs are pending). This principle is also how the §4 roadmap unfolds.

3. How it works

No single component is novel on its own; the real value is that they compose end-to-end on ROCm. Below we walk the flow first, then take the components apart one by one.

3.1 The end-to-end pipeline

Step 1 — real2sim (rocRecon). A real or generated object starts as just pixels or a rough mesh. rocRecon turns it into a simulatable asset: reconstructed with modern 3D generation/reconstruction models (TRELLIS, Hunyuan3D, or multi-view / 3DGS routes), producing a sim-ready package — watertight/convex-decomposed collision proxies, URDF, PBR materials — plus metadata (scale, canonical pose, physical properties). “sim-ready” is still not “interaction-ready”.

Step 2 — admission and scene parsing (RoboSmith). The asset is admitted into a scene with task affordances: where it sits, what supports it, which robot it shares space with. A static mesh goes from “decoration” to “participant in the task”.

Step 3 — declare the task (Code-as-Policy). The author writes a small declarative spec — scene, success conditions, and an ordered sequence of intents. No waypoints, no gripper timing, no grasp poses (see §3.2).

Step 4 — resolve grasps + segment-level motion. Each pick lowers into a learned 6-DoF grasp candidate (GraspGen), filtered by robot/table feasibility; each motion segment is routed by rocRobo at the segment level (free-space avoidance / contact-controlled, see §3.3); articulated steps directly drive the object’s joints. One key fix made “grasping objects on a shelf” possible: the grasp’s “no-go plane below” was changed from the global table to the object’s own bottom face (support_z), otherwise shelved objects would be grasped into empty space.

Step 5 — rollout → data. RoboSmith executes the scripted expert inside Genesis, records episodes, judges success by the declared criteria, and exports an interactive LeRobot dataset — observations, actions, success/failure metadata, video.

The output isn’t a video, it’s a reusable recipe: change the object in Step 1, or one line of the task in Step 3, and the same pipeline produces a new dataset — that’s the difference between “getting a demo to run” and “producing data continuously”.

3.2 Component ①: Code-as-Policy

RoboSmith’s most important design decision is the author interface, because the interface decides whether data generation can scale. Code-as-Policy (CAP) decouples task intent from execution: the author states “what to do”, the engine handles “how”. Here is the declarative core of the long-horizon drawer task:

# scenarios/pick_place_into_drawer.py
scenario_intents = intent_sequence([
    open_("drawer"),
    pick("die"),
    place("drawer_open_slot", place_z=0.25),
    close_("drawer"),
])

scenario_task = task(
    "pick_place_into_drawer",
    scene=scenario_scene,
    instruction="Open the drawer, put the die inside, then close the drawer.",
    success=all_of(
        object_in_container("die", "drawer", xy_threshold=0.20, z_margin=0.0),
        joint_closed("drawer", "drawer_slide", closed_position=0.02),
    ),
)

That is the entire “policy”. Two points worth expanding:

Meta-actions: a small set of composable verbs. CAP exposes four task-level meta-actions, each lowering into an existing expert primitive that the author never sees: pick/place are rigid-body verbs (they go through learned grasp planning), open_/close_ are articulated verbs (they go through a joint-driven primitive, not grasp planning). A long-horizon task is just an ordered composition of these verbs.

Success = a typed predicate tree. A task doesn’t end when the script finishes; it completes only when the world state satisfies the conditions. Leaves are world fluents (object_in_container / joint_closed / object_lifted…), composed with all_of / any_of / not; the validator catches typos before a single simulation frame runs. The drawer task’s success is the conjunction of “the die is inside the drawer” and “the drawer is closed” — the run where the die rolls out, or the drawer springs open, is correctly labeled a failure, which is exactly the signal long-horizon data needs.

3.3 Component ②: rocRobo (motion)

The clearest illustration of “why collision-free planning is mandatory” is the two-tier supporter side-insertion: the die is taken from the upper shelf and inserted into the lower shelf. The two tiers are stacked with a small gap, so the arm cannot descend vertically into the lower tier and must reach in horizontally from the open side; and a naive Cartesian straight line flips the wrist near a singularity and flings the object away — so the insertion segment must switch to a collision-aware planned motion, with the pre-insertion pose realigned along the insertion axis (approach axis). (The drawer open→pick→place→close, meanwhile, tests stateful articulation + step-ordering dependency; the two first-look videos are in §1.)

rocRobo’s value isn’t “it can plan a trajectory”, it’s divide-and-conquer: whether a motion should avoid obstacles or hug a contact surface depends on whether it’s a contact segment. RoboSmith routes by segment:

Segment type	Example	Motion mode	Avoidance
Free space	approach, transport	`plan_motion` (rocRobo collision-free planning)	✅
Contact-controlled	drag handle, descend/insert	`cartesian` (axis-controlled straight line)	❌ (contact, correct)
Grasp/release	grasp, release	finger-only (gripper in place)	—

place itself is split into four phases — transport (free) → insert/descend (contact or side-insert) → release (open fingers, settle) → extract (retreat) — where the insertion pose is computed on the fly along the approach axis (top-down is fully equivalent to the old vertical pose; side-insert automatically becomes horizontal), and the carried object is folded into collision via an attached collision object (ACO). Long-horizon tasks also re-anchor between segments using the live pose of objects/joints, reducing accumulated error.

🎬 rocRobo motion demo — collision-free planning on free segments, axis-controlled straight lines on contact segments.

3.4 Component ③: Genesis-World & asset production

The reason the declarative and motion layers can run end-to-end rests on two often-overlooked foundations — also entirely on ROCm:

Genesis-World · simulation engine: rigid / MPM / SPH / FEM / PBD physics + rasterized/ray-traced rendering + sensors + massively parallel environments. Rollouts execute here and are recorded into episodes with video.
Asset production · Objaverse / rocRecon / Articraft: where the pipeline’s input comes from — the current default directly uses Objaverse built-in assets; rocRecon turns text / image / a real object into a sim-ready URDF via 3D generation/reconstruction (real2sim); the agentic Articraft uses image + text with LLM code generation to produce articulated 3D assets. All three sources yield metadata-bearing simulatable assets, feeding Step 1 of §3.1.

3.5 Making it run on AMD

The whole stack runs on ROCm on CDNA3 (gfx942, MI300/MI325) — GraspGen’s sparse-convolution backbone, Genesis simulation, rocRobo planning — all built from public upstream source into self-contained images (no vendoring, no local paths). Here we discuss just one reality that truly shaped the deployment.

Two runtimes, one host: the sim × motion bridge. In our configuration, JAX and PyTorch do not coexist cleanly in a single process under their respective ROCm runtimes (they contend over the HIP runtime and device allocator) — yet rocRobo’s motion planning is JAX and RoboSmith’s simulation and grasping are PyTorch. The pragmatic solution is an architectural seam: the two run as two separate containers on the same host, and the notebook drives the motion planner across the boundary via docker exec.

flowchart LR
  subgraph HOST[Same AMD MI300 host]
    subgraph C1[Container A · RoboSmith]
      SIM[Genesis sim + GraspGen
PyTorch / ROCm]
    end
    subgraph C2[Container B · rocRobo]
      MOT[collision-free motion planning
JAX / ROCm serve]
    end
  end
  SIM --> |request: solve_ik / plan_motion| MOT
  MOT --> |response: trajectory JSON| SIM

This sidecar is a stopgap; the platform roadmap is to converge it into a first-class API (plan → execute → verify). Two main costs today: each cold start of the rocRobo serve (JAX/XLA recompile + warmup) is the current wall-clock hog of e2e, and this docker exec + JSONL control plane is fragile (dropping one in-flight response degrades to fallback) — the next step, “persistent serve + stable IPC”, cures both pains at once.

4. Roadmap and limitations

The landed slice is real, but the interesting frontier is what it sets up. The platform roadmap is four vertical capability axes + one horizontal platform axis:

Morphology & embodiment: Franka single-arm → dual-arm → humanoid (including mobile manipulation).
Asset physics: rigid → articulated (where we are now) → deformable (garments) → cable.
Long-horizon orchestration: grow CAP from “intent sequence + terminal-state success” into multi-skill composition, explicit constraints and safety specs, and multi-arm concurrency.
Closed-loop evaluation: the highest-priority pillar yet to build — benchmark suite, regression matrix, perturbation taxonomy, sim↔real correlation.
Horizontal foundation: converge the sim×motion bridge and the rollout / plan→execute / render→obs pipelines into stable APIs.

Further out are longer-horizon pillars, all built on an open-source, ROCm-first foundation: scene-level real2sim, RL baselines using Genesis’s thousand-scale parallelism, sim2real (domain randomization + co-training + a real-robot recapture loop), and policy / world models (VLA policies, world models generating data).

Current limitations:

Every pick depends on GraspGen (NVIDIA research/eval license, non-commercial) — pulled from official upstream at build time and baked into a local image, not redistributed. Everything else in the stack (RoboSmith, rocRecon, rocRobo, spconv_rocm, Genesis, LeRobot export) is open-source ROCm. Swapping in an open-source / commercially-usable grasper is the next step toward a fully unrestricted pipeline; architecturally the grasp model is already isolated behind a single wrapper, so replacement will be clean.
pick is currently collision-blind: pick on an empty table is stable, but once there’s a static obstacle beside the object (partition/cabinet/container), the free segment cuts straight through — making the free segment of pick also avoid obstacles is the highest-priority hole to fill on the roadmap.
Side-insertion under real configurations is not yet fully reliable: it succeeds cleanly in isolated scenes, but during a real upper-shelf pick → lower-shelf place, the insertion segment can still fail to plan and degrade to a straight-line fallback (succeeding by luck) — this step is not yet fully closed.

5. Summary

The real takeaway is not “we opened a drawer in simulation”, it’s that the entire chain runs end-to-end on a single AMD GPU: one real object → real2sim → declarative task → learned grasping → collision-free motion → a labeled, interactive dataset, with the sole non-open-source dependency cleanly isolated.

Back to the bottleneck we opened with — embodied AI is short on interaction data — three things worth remembering:

The interface decides scalability. Code-as-Policy decouples “intent” from “execution”: the author declares what to do, and the engine resolves grasping/motion/articulation and judges success against a typed world-state predicate tree. Changing one line of declaration is a new dataset — that’s the key to turning “getting a demo to run” into “producing continuously”.
General grasping is a building block for long-horizon tasks, not a separate product line. Every pick inside a long-horizon articulated task (drawer / supporter) is the same learned general grasp — growing assets and growing task complexity both happen on the same engine, same authoring layer, same export path.
An open-source ROCm stack for Physical AI is viable. Physics simulation, transformer-based learned grasping, and collision-free motion all run on CDNA3 — provided you respect the runtime boundaries (torch/jax process isolation, segment routing that separates contact from free space). Evaluation, sim2real, and world models are the next steps on the roadmap, not a wish list.

References

RoboSmith (SDG engine) — https://github.com/ZJLi2013/RoboSmith
rocRecon (real2sim) — https://github.com/ZJLi2013/RocRecon
rocRobo (motion) — https://github.com/ZJLi2013/rocRobo
spconv_rocm — https://github.com/ZJLi2013/spconv_rocm
Genesis-World (simulation) — https://github.com/Genesis-Embodied-AI/genesis-world
LeRobot (dataset format) — https://github.com/huggingface/lerobot
PyRoki (kinematics) — https://github.com/chungmin99/pyroki
Articraft (articulated assets) — https://github.com/mattzh72/articraft
Objaverse (asset library) — https://github.com/allenai/objaverse-xl
GraspGen (NVlabs · research/eval license) — https://github.com/NVlabs/GraspGen