Skip to content

Roadmap to v1.0.0

Narsil's v1.0.0 goal: a stable, production-grade Rust serving engine with a frozen backend trait and request schema, multiple transports, and both compute families (Burn-native and native-torch) hardened on a real recommender workload.

This roadmap is thematic, not dated. Items are ordered by dependency, not calendar.

Where we are — v0.2.x

  • ✅ tonic gRPC serving shell with three execution modes (inline / worker pool / batch collector).
  • ✅ Continuous batching that coalesces concurrent requests into one fused forward.
  • ✅ Burn CUDA DLRM backend (FP32/FP16) + opt-in cuTile fused interaction.
  • Route A: torch-dlrm backend serving TorchRec artifacts via tch-rs (libtorch 2.11 + FBGEMM 1.6.0) — matches the C++ baseline's latency, exceeds its throughput.
  • ✅ DLRM compare harness + documented benchmarks.

v0.3 — Route A hardening

Make the native-torch path production-shaped.

  • Fresh-weights artifact generated from the parked third_party/torchrec submodule (today's benchmark uses the existing random-weight artifact).
  • Link-time libtorch_cuda so the RTLD_GLOBAL dlopen workaround becomes optional; ship a clear runtime preload config either way.
  • Numerical parity check against a trained-weight artifact. (The sparse-layout correctness pass is done: the backend reorders batch-major wire ids into TorchRec's key-major KeyedJaggedTensor layout for both single and coalesced inference, covered by unit tests.)
  • INT8 / FP16 parity across both backends; document precision knobs.
  • CI that builds --features torch against a pinned libtorch and runs the smoke example.

v0.4 — Burn-native parity

Close the latency gap without libtorch, for the pure-Rust story.

  • Replace the 26 per-table embedding gathers with a table-batched gather.
  • Replace the 351 pairwise ops with a single batched-matmul (bmm) interaction + triangular gather (one kernel instead of ~700).
  • Quantized (INT8/FP16) embeddings in Burn.
  • Re-benchmark Burn vs route A vs the C++ baseline.

v0.5 — Transports

Deliver on "many transports externally."

  • HTTP/JSON ingress alongside gRPC (shared executor core).
  • QUIC transport.
  • First-class observability: Prometheus metrics, structured tracing, health endpoints (graceful drain already exists).

v0.6 — Multi-GPU & collectives

  • NCCL and UCCL integration for multi-GPU and multi-node.
  • Sharded embeddings for tables that exceed one device.

v0.7 — Model coverage

  • Generic TorchScript / ONNX loading beyond DLRM, behind the same backend trait.
  • A lightweight model registry (load/swap artifacts without redeploying).

v0.8 — Serving ergonomics

  • Adaptive batch parameters (auto-tune batch_size / timeout to latency SLOs).
  • Admission control and backpressure under overload.
  • Dashboards and load-shedding policies.

v0.9 — Hardening

  • Soak and chaos tests; schema fuzzing.
  • Performance regression gates in CI (latency/throughput budgets).
  • Security review of the request path and artifact loading.

v1.0.0 — Stability

  • Frozen InferenceBackend trait and protobuf schema with semver guarantees.
  • Supported transport + backend matrix documented.
  • Complete docs, examples, and an upgrade guide.

Principles

These hold across every milestone:

  • The serving shell is the product; backends are pluggable. New compute must not require touching the transport, scheduler, or batcher.
  • Measure before optimizing. Every perf change ships with a benchmark delta (see the Decision Log).
  • Pin the ABI. The fbgemm↔torch pairing and tch↔libtorch version are exact — treat them as a locked unit, not a range.
  • Both compute families stay first-class. Burn-native and native-torch are co-equal paths, not a primary and a fallback.

Apache-2.0 licensed.