Reflex
Take a robot policy off the training cluster, onto a robot.
Reflex is the deployment layer for vision-language-action models. Take any pi0, pi0.5, SmolVLA, or GR00T checkpoint and run it on a Jetson Orin or desktop GPU in one command — verified at machine-precision parity to PyTorch. Or just tell the chat agent to do it for you.
How it works
From a HuggingFace model to a robot, in four steps.
reflex go --model <hf_id> runs all four. Each step writes a verifiable artifact and refuses to ship if its check fails — bad exports never reach a robot.
What it looks like
Talk to your robot fleet in plain English.
reflex chat wraps the entire CLI surface in a natural-language agent. 100 calls/day free, no signup, no API key.
$ reflex chat
connected: chat.fastcrest.com (model=gpt-5-mini)
you › deploy SmolVLA to my desktop GPU and start serving
→ list_targets({})
→ pull_model({"model": "smolvla-base"}) ↓ 900 MB from HuggingFace
→ export_model({"model": "smolvla-base", "target": "desktop"})
→ serve_model({"export_dir": "./reflex_export"})
SmolVLA is exported and serving at http://localhost:8000.
Latency: ~12 ms/call on your GPU. Try:
curl -X POST http://localhost:8000/act ...
Composable wedges
Every flag is opt-in. Compose only what you need.
14 runtime wedges layer on top of reflex serve — safety, observability, optimization, transport. Every response surfaces telemetry from the wedges you've enabled.
Multi-embodiment
pi0, pi0.5, SmolVLA, GR00T — all four major open VLA families. ONNX export verified at cos = +1.000000 against PyTorch.
Edge-first
Jetson Orin Nano (8 GB) → AGX Orin → Thor → desktop NVIDIA. Hardware probe picks the right variant; export targets the right precision.
SnapFlow distillation
First open-source SnapFlow reproduction. Distill any pi0 / pi0.5 to a 1-step student that beats its 10-step teacher (64% vs 56% on libero_object).
Production runtime
CUDA graphs, cost-weighted batching, A2C2 correction, record-replay traces, real-time chunking — composable wedges on a single FastAPI server.
Numbers
Verifiable claims, not vibes.
All three reproducible with one command on Modal. See the parity ledger and changelog for full provenance.
vs other tools
Where Reflex fits.
Reflex is deliberately narrow. Here's the honest read.
| Reflex | Triton | HF Endpoints | Raw ONNX | |
|---|---|---|---|---|
| Edge GPU deployment | design center | cloud-first | cloud-only | DIY |
| VLA-specific export (pi0 / pi0.5 / GR00T) | built-in, validated | no | no | manual, error-prone |
| Verified machine-precision parity | automatic | DIY | DIY | DIY |
| Decomposed pi0.5 (9× speedup) | one flag | no | no | ~weeks of work |
| Setup time | 30 seconds | days | minutes | 1–3 weeks |
| Multi-tenant cloud serving at scale | not the design | battle-tested | managed | DIY |
Honest details: vs other tools →
Common questions
FAQ
Why not just use Triton?
reflex export and drop the result into Triton if you want both. Full comparison →
Does it work without a GPU?
pip install 'reflex-vla[serve,onnx]'. The CPU path runs all four supported VLAs at machine-precision parity to PyTorch; SmolVLA is the only one fast enough for real-time control on CPU. reflex chat works with no install at all once the package is installed.
What does BSL 1.1 mean for me?
Does it work on RTX 5090 / Blackwell?
reflex go segfaults at startup on Blackwell. Workaround: use reflex chat (no GPU needed), reflex doctor, and reflex models list. /act needs a non-Blackwell GPU temporarily — Modal A10G/A100 or RTX 4090. We're tracking ORT upstream for the fix.