Decision Log
How Narsil's DLRM serving got from "27× behind the C++ baseline" to "matches it." Each entry is a decision with its context and consequence.
D1 — Diagnose the gap before chasing it
Context. The first benchmark had Narsil's Burn CUDA DLRM at ~79 req/s vs the TorchRec C++ gRPC server at ~2148 req/s, and 13.0 ms vs 1.0 ms p50 at concurrency 1 — even at matched dtype.
Investigation. The gap was two independent problems, neither of them the transport:
- Serialization. The benchmark ran one CUDA worker, so 16 concurrent gRPC streams queued behind a single ~13 ms service time.
- Kernel-launch-bound forward. The Burn DLRM issued ~750 tiny kernels per forward — 26 separate embedding gathers and ~700 elementwise/reduction ops for the 351 pairwise interactions — plus a per-request host sync. At ~17 µs/launch that is the 13 ms.
FP16 vs FP32 made no difference, which confirmed the limiter was launch count, not arithmetic width.
Consequence. The fix had to be the operator graph and execution model, not a tonic→C++ port. This reframed everything downstream.
Takeaway
"Same dtype" was a red herring. Measure where the time actually goes before optimizing.
D2 — Why there are no "FBGEMM Rust bindings"
Context. TorchRec's speed comes from FBGEMM_GPU: table-batched quantized embeddings and a fused interaction. The natural question was whether to bind FBGEMM from Rust.
Finding. FBGEMM_GPU is not a library with a stable C ABI — it is a set of PyTorch custom operators registered into libtorch's dispatcher (fbgemm::*), taking at::Tensor. There is nothing to FFI to without the PyTorch C++ runtime. So "Rust FBGEMM bindings" necessarily means Rust → libtorch dispatcher → fbgemm op — which is exactly what tch-rs provides.
Consequence. Two real routes: (A) reach the ops through libtorch via tch-rs, or (C) reimplement the few inference-critical kernels in Burn (batched-gather embedding + a bmm interaction). cuTile was a partial (B) but blocked on this host (needs CUDA 13.2 / newer toolkit support).
D3 — Choose Route A (native torch via tch-rs)
Context. Narsil's thesis: a Rust-native serving engine that is internally native torch and custom Rust/cuTile kernels, and externally many transports. That makes tch-rs + libtorch + FBGEMM on-thesis, not a compromise.
Decision. Pursue Route A: load the existing TorchRec TorchScript artifact through tch-rs and run the real FBGEMM kernels behind Narsil's serving shell. Keep the Burn path for the pure-Rust story.
De-risking (before any code). Verified on the target box that:
fbgemm_gpu's inference op registers via plaintorch.ops.load_library— notorchrecPython needed.- The TorchScript artifact loads in a torch-only process (exactly
tch-rs's context). - The I/O contract is a
Dict[str, Tensor]in /"default"tensor out. - fbgemm supports the L4's sm_89 — unlike cuTile, this path was not hardware-blocked.
D4 — Pin the toolchain to the one ABI-compatible pair
Context. The intended stack was torch 2.11 + fbgemm 1.6.0 + torchrec 1.6.0. Reality intervened.
What we hit.
tch↔ libtorch is exact.tch0.24 targets libtorch 2.11.0 precisely;tch0.23 → 2.10.- torchrec 1.6.0 isn't on the cu128 index (only 1.5.0). So we tried downgrading to torch 2.10 + fbgemm 1.5.0.
- That combination is ABI-broken. fbgemm 1.5.0 needs
c10::MessageLogger(SourceLocation, int, bool), but torch 2.10'slibc10exportsMessageLogger(char const*, int, int, bool)→ undefined symbol at import. No amount of reinstalling fixes a symbol mismatch. - torchrec is pure Python (
py3-none-any). It is not a runtime dependency of the engine — only a build-time tool for generating artifacts. So its version is flexible; the only hard constraint is fbgemm ↔ torch. - Installs also surfaced pip dependency-confusion hash errors (mixing
--extra-index-url) and transient corruption on large wheels.
Decision. Settle on the only validated binary pair: torch 2.11.0 + fbgemm 1.6.0 (tch 0.24). Install torch + fbgemm from the cu128 index only. Park torchrec as a git submodule (third_party/torchrec) for artifact generation when fresh weights are needed.
Hard-won rules
- The fbgemm↔torch ABI is the real constraint; torchrec is pure Python and flexible.
- Install torch + fbgemm from the single cu128 index — never mix
--extra-index-urlfor them. - The Rust binary must
dlopen(libtorch_cuda.so, RTLD_GLOBAL)at startup or ATen reports no CUDA.
D5 — Reuse the existing artifact; defer regeneration
Context. The benchmark artifact (model_default.pt) was built on torch 2.5.1 / fbgemm 1.0.0.
Finding. It loads and runs cleanly on torch 2.11 / fbgemm 1.6.0 — TorchScript + fbgemm forward-compat held across the jump.
Consequence. No torchrec regeneration was needed to build, validate, or benchmark the backend. Regeneration (via the parked submodule) is only required for fresh, trained weights — tracked as a roadmap item.
D6 — Validate the result
Serving the same INT8 artifact on an L4, Route A reaches ~1.02 ms p50 at concurrency 1 (on par with the C++ baseline's 1.033 ms, ~13× faster than Burn's 13.04 ms) and, with the batch collector, ~440k samples/s at concurrency 16 (~2× the C++ baseline via request coalescing). This closed the loop on D1: the gap was the kernel path, and the Rust serving shell adds negligible overhead.
See Benchmarks for full tables and caveats.