Roadmap to v1.0.0

Narsil's v1.0.0 goal: a stable, production-grade Rust serving engine with a frozen backend trait and request schema, multiple transports, and both compute families (Burn-native and native-torch) hardened on a real recommender workload.

This roadmap is thematic, not dated. Items are ordered by dependency, not calendar.

Shipped — v0.4.0 (current release)

✅ tonic gRPC serving shell with three execution modes (inline / worker pool / batch collector).
✅ Continuous batching that coalesces concurrent requests into one fused forward.
✅ Burn CUDA DLRM backend (FP32/FP16) with table-batched gather + batched-matmul interaction — conc-1 p50 13.04 ms → 1.17 ms (~11×), single-lane throughput 7.9k → 125.6k samples/s — plus opt-in cuTile fused interaction.
✅ Route A: torch-dlrm backend serving TorchRec artifacts via tch-rs (libtorch 2.11 + FBGEMM 1.6.0) — matches the C++ baseline's latency, exceeds its throughput (~2× via batching).
✅ Reproducible Route A INT8 artifact generator from the parked third_party/torchrec submodule.
✅ Link-time libtorch_cuda retention for CUDA libtorch builds; fbgemm plugins remain runtime dlopens for op registration.
✅ HTTP/JSON gateway alongside gRPC (shared executor core), runtime-gated by NARSIL_HTTP_ADDR, with body-size caps, request timeouts, and per-item batch errors.
✅ HTTP/3 over QUIC (quinn/h3, feature-gated quic) with configurable TLS.
✅ Observability: Prometheus metrics + HTTP liveness/readiness endpoints, with readiness wired to real backend/collector health (graceful drain already existed).
✅ CI builds --features torch (CPU libtorch) and --features quic; DLRM compare harness + documented benchmarks.

Next — v0.5: Burn-native hardening & parity gates

Close the remaining Burn-native gaps without libtorch, and lock the wins in.

Quantized (INT8/FP16) embeddings in Burn.
Tail-latency work for Burn batch mode after the table-batched gather + bmm rewrite.
Numerical parity against a trained-weight artifact (the sparse-layout correctness pass is done and unit-tested) and INT8 / FP16 parity across both backends; document precision knobs.
Performance regression gates in CI for Burn vs route A vs the documented C++ baseline.

v0.6 — Multi-GPU & collectives

NCCL and UCCL integration for multi-GPU and multi-node.
Sharded embeddings for tables that exceed one device.

v0.7 — Model coverage

Generic TorchScript / ONNX loading beyond DLRM, behind the same backend trait.
A lightweight model registry (load/swap artifacts without redeploying).

v0.8 — Serving ergonomics

Adaptive batch parameters (auto-tune batch_size / timeout to latency SLOs).
Admission control and backpressure under overload.
Dashboards and load-shedding policies.

v0.9 — Hardening

Soak and chaos tests; schema fuzzing.
Performance regression gates in CI (latency/throughput budgets).
Security review of the request path and artifact loading.

v1.0.0 — Stability

Frozen InferenceBackend trait and protobuf schema with semver guarantees.
Supported transport + backend matrix documented.
Complete docs, examples, and an upgrade guide.

Principles

These hold across every milestone:

The serving shell is the product; backends are pluggable. New compute must not require touching the transport, scheduler, or batcher.
Measure before optimizing. Every perf change ships with a benchmark delta (see the Decision Log).
Pin the ABI. The fbgemm↔torch pairing and tch↔libtorch version are exact — treat them as a locked unit, not a range.
Both compute families stay first-class. Burn-native and native-torch are co-equal paths, not a primary and a fallback.

Roadmap to v1.0.0 ​

Shipped — v0.4.0 (current release) ​

Next — v0.5: Burn-native hardening & parity gates ​

v0.6 — Multi-GPU & collectives ​

v0.7 — Model coverage ​

v0.8 — Serving ergonomics ​

v0.9 — Hardening ​

v1.0.0 — Stability ​

Principles ​