Roadmap to v1.0.0
Narsil's v1.0.0 goal: a stable, production-grade Rust serving engine with a frozen backend trait and request schema, multiple transports, and both compute families (Burn-native and native-torch) hardened on a real recommender workload.
This roadmap is thematic, not dated. Items are ordered by dependency, not calendar.
Where we are — v0.2.x
- ✅ tonic gRPC serving shell with three execution modes (inline / worker pool / batch collector).
- ✅ Continuous batching that coalesces concurrent requests into one fused forward.
- ✅ Burn CUDA DLRM backend (FP32/FP16) + opt-in cuTile fused interaction.
- ✅ Route A:
torch-dlrmbackend serving TorchRec artifacts viatch-rs(libtorch 2.11 + FBGEMM 1.6.0) — matches the C++ baseline's latency, exceeds its throughput. - ✅ DLRM compare harness + documented benchmarks.
v0.3 — Route A hardening
Make the native-torch path production-shaped.
- Fresh-weights artifact generated from the parked
third_party/torchrecsubmodule (today's benchmark uses the existing random-weight artifact). - Link-time
libtorch_cudaso theRTLD_GLOBALdlopen workaround becomes optional; ship a clear runtime preload config either way. - Numerical parity check against a trained-weight artifact. (The sparse-layout correctness pass is done: the backend reorders batch-major wire ids into TorchRec's key-major
KeyedJaggedTensorlayout for both single and coalesced inference, covered by unit tests.) - INT8 / FP16 parity across both backends; document precision knobs.
- CI that builds
--features torchagainst a pinned libtorch and runs the smoke example.
v0.4 — Burn-native parity
Close the latency gap without libtorch, for the pure-Rust story.
- Replace the 26 per-table embedding gathers with a table-batched gather.
- Replace the 351 pairwise ops with a single batched-matmul (
bmm) interaction + triangular gather (one kernel instead of ~700). - Quantized (INT8/FP16) embeddings in Burn.
- Re-benchmark Burn vs route A vs the C++ baseline.
v0.5 — Transports
Deliver on "many transports externally."
- HTTP/JSON ingress alongside gRPC (shared executor core).
- QUIC transport.
- First-class observability: Prometheus metrics, structured tracing, health endpoints (graceful drain already exists).
v0.6 — Multi-GPU & collectives
- NCCL and UCCL integration for multi-GPU and multi-node.
- Sharded embeddings for tables that exceed one device.
v0.7 — Model coverage
- Generic TorchScript / ONNX loading beyond DLRM, behind the same backend trait.
- A lightweight model registry (load/swap artifacts without redeploying).
v0.8 — Serving ergonomics
- Adaptive batch parameters (auto-tune
batch_size/timeoutto latency SLOs). - Admission control and backpressure under overload.
- Dashboards and load-shedding policies.
v0.9 — Hardening
- Soak and chaos tests; schema fuzzing.
- Performance regression gates in CI (latency/throughput budgets).
- Security review of the request path and artifact loading.
v1.0.0 — Stability
- Frozen
InferenceBackendtrait and protobuf schema with semver guarantees. - Supported transport + backend matrix documented.
- Complete docs, examples, and an upgrade guide.
Principles
These hold across every milestone:
- The serving shell is the product; backends are pluggable. New compute must not require touching the transport, scheduler, or batcher.
- Measure before optimizing. Every perf change ships with a benchmark delta (see the Decision Log).
- Pin the ABI. The fbgemm↔torch pairing and
tch↔libtorch version are exact — treat them as a locked unit, not a range. - Both compute families stay first-class. Burn-native and native-torch are co-equal paths, not a primary and a fallback.