Decision Log

How Narsil's DLRM serving got from "27× behind the C++ baseline" to "matches it." Each entry is a decision with its context and consequence.

D1 — Diagnose the gap before chasing it

Context. The first benchmark had Narsil's Burn CUDA DLRM at ~79 req/s vs the TorchRec C++ gRPC server at ~2148 req/s, and 13.0 ms vs 1.0 ms p50 at concurrency 1 — even at matched dtype.

Investigation. The gap was two independent problems, neither of them the transport:

Serialization. The benchmark ran one CUDA worker, so 16 concurrent gRPC streams queued behind a single ~13 ms service time.
Kernel-launch-bound forward. The Burn DLRM issued ~750 tiny kernels per forward — 26 separate embedding gathers and ~700 elementwise/reduction ops for the 351 pairwise interactions — plus a per-request host sync. At ~17 µs/launch that is the 13 ms.

FP16 vs FP32 made no difference, which confirmed the limiter was launch count, not arithmetic width.

Consequence. The fix had to be the operator graph and execution model, not a tonic→C++ port. This reframed everything downstream.

Takeaway

"Same dtype" was a red herring. Measure where the time actually goes before optimizing.

D2 — Why there are no "FBGEMM Rust bindings"

Context. TorchRec's speed comes from FBGEMM_GPU: table-batched quantized embeddings and a fused interaction. The natural question was whether to bind FBGEMM from Rust.

Finding. FBGEMM_GPU is not a library with a stable C ABI — it is a set of PyTorch custom operators registered into libtorch's dispatcher (fbgemm::*), taking at::Tensor. There is nothing to FFI to without the PyTorch C++ runtime. So "Rust FBGEMM bindings" necessarily means Rust → libtorch dispatcher → fbgemm op — which is exactly what tch-rs provides.

Consequence. Two real routes: (A) reach the ops through libtorch via tch-rs, or (C) reimplement the few inference-critical kernels in Burn (batched-gather embedding + a bmm interaction). cuTile was a partial (B) but blocked on this host (needs CUDA 13.2 / newer toolkit support).

D3 — Choose Route A (native torch via tch-rs)

Context. Narsil's thesis: a Rust-native serving engine that is internally native torch and custom Rust/cuTile kernels, and externally many transports. That makes tch-rs + libtorch + FBGEMM on-thesis, not a compromise.

Decision. Pursue Route A: load the existing TorchRec TorchScript artifact through tch-rs and run the real FBGEMM kernels behind Narsil's serving shell. Keep the Burn path for the pure-Rust story.

De-risking (before any code). Verified on the target box that:

fbgemm_gpu's inference op registers via plain torch.ops.load_library — no torchrec Python needed.
The TorchScript artifact loads in a torch-only process (exactly tch-rs's context).
The I/O contract is a Dict[str, Tensor] in / "default" tensor out.
fbgemm supports the L4's sm_89 — unlike cuTile, this path was not hardware-blocked.

D4 — Pin the toolchain to the one ABI-compatible pair

Context. The intended stack was torch 2.11 + fbgemm 1.6.0 + torchrec 1.6.0. Reality intervened.

What we hit.

tch ↔ libtorch is exact. tch 0.24 targets libtorch 2.11.0 precisely; tch 0.23 → 2.10.
torchrec 1.6.0 isn't on the cu128 index (only 1.5.0). So we tried downgrading to torch 2.10 + fbgemm 1.5.0.
That combination is ABI-broken. fbgemm 1.5.0 needs c10::MessageLogger(SourceLocation, int, bool), but torch 2.10's libc10 exports MessageLogger(char const*, int, int, bool) → undefined symbol at import. No amount of reinstalling fixes a symbol mismatch.
torchrec is pure Python (py3-none-any). It is not a runtime dependency of the engine — only a build-time tool for generating artifacts. So its version is flexible; the only hard constraint is fbgemm ↔ torch.
Installs also surfaced pip dependency-confusion hash errors (mixing --extra-index-url) and transient corruption on large wheels.

Decision. Settle on the only validated binary pair: torch 2.11.0 + fbgemm 1.6.0 (tch 0.24). Install torch + fbgemm from the cu128 index only. Park torchrec as a git submodule (third_party/torchrec) for artifact generation when fresh weights are needed.

Hard-won rules

The fbgemm↔torch ABI is the real constraint; torchrec is pure Python and flexible.
Install torch + fbgemm from the single cu128 index — never mix --extra-index-url for them.
CUDA libtorch builds retain libtorch_cuda.so at link time; startup dlopen is only for the fbgemm plugin libraries that register fbgemm::* ops.

D5 — Generate a current-stack Route A artifact

Context. The original benchmark artifact (model_default.pt) was built on torch 2.5.1 / fbgemm 1.0.0 and was useful for serving latency/throughput validation, but it was no longer a reproducible current-stack artifact.

Decision. Route A now uses scripts/create_torchrec_dlrm_artifact.py to generate target/torchrec/dlrm_int8_seed0_torch211_fbgemm16.pt from the parked third_party/torchrec submodule, pinned at commit bb49de0 (authored 2026-05-22, carried by the weekly nightly tag v2026.05.25.00), on torch 2.11.0+cu128 and fbgemm-gpu 1.6.0+cu128.

Consequence. The artifact generation is deterministic for model weights/buffers (recorded by the sidecar state_dict_sha256) and validates the serving contract with a smoke forward. Note that torch.use_deterministic_algorithms(..., warn_only=True) pins the weights (what state_dict_sha256 covers), not necessarily bit-identical forward outputs — warn_only=True lets non-deterministic ops fall back with a warning rather than error. Fully trained Criteo weights still require real training and are out of scope for this serving-path hardening.

D6 — Validate the result

Serving the same INT8 artifact on an L4, Route A reaches ~1.02 ms p50 at concurrency 1 (on par with the C++ baseline's 1.033 ms, ~13× faster than Burn's 13.04 ms) and, with the batch collector, ~440k samples/s at concurrency 16 (~2× the C++ baseline via request coalescing). This closed the loop on D1: the gap was the kernel path, and the Rust serving shell adds negligible overhead.

See Benchmarks for full tables and caveats.

D7 — Make the Burn path launch-efficient too

Context. Route A proved the serving shell was not the bottleneck, but the pure-Rust path still needed to stand on its own. The old Burn DLRM emitted 26 per-table gathers and 351 pairwise multiply/reduce pairs per forward.

Decision. Keep the same Burn model contract and weights, but reshape the operator graph:

stack all embedding tables into one table and run one gather into [B, 26, D];
concatenate dense projection + sparse embeddings into [B, 27, D];
compute the DLRM interaction with one batched Gram matmul and one strict-triangle gather, preserving the previous pair order before the over-arch MLP.

Consequence. The DLRM-specific launch-producing ops drop from roughly 728 to 3. On the L4 benchmark, Burn FP32 p50 at concurrency 1 fell from 13.04 ms to 1.17 ms, and single-lane concurrency-16 throughput rose from 7.9k to 125.6k samples/s. Burn batch mode now has a route-A-like steady p50, but still needs tail-latency work before its throughput matches route A's coalesced path.

Decision Log ​

D1 — Diagnose the gap before chasing it ​

D2 — Why there are no "FBGEMM Rust bindings" ​

D3 — Choose Route A (native torch via tch-rs) ​

D4 — Pin the toolchain to the one ABI-compatible pair ​

D5 — Generate a current-stack Route A artifact ​

D6 — Validate the result ​

D7 — Make the Burn path launch-efficient too ​