Benchmarks

DLRM serving on an NVIDIA L4 (sm_89), batch size 100/request, 26 sparse features × 100k rows × dim 64, dense arch 512→256→64, over arch 512→512→256→1. Full methodology and raw numbers live in REPORTS.md; this page is the summary.

Main results (concurrency 16)

Server	Req/s	Samples/s	p50 ms	p99 ms	Peak GPU MiB
TorchRec C++ gRPC + TorchScript INT8 (documented baseline)	2148	214,821	7.23	13.27	780
Narsil route A (libtorch/FBGEMM), batch mode	4665	466,540	3.27	4.88	478
Narsil route A (libtorch/FBGEMM), worker mode (1 lane)	1490	149,006	10.69	11.16	454
Narsil Burn CUDA DLRM FP32, batched ops, worker mode	1256	125,563	12.62	14.50	1658
Narsil Burn CUDA DLRM FP16, batched ops, worker mode	1181	118,105	13.64	14.52	602
Narsil Burn CUDA DLRM FP32, batched ops, batch mode	296	29,620	3.34	5.08	1754

Low-concurrency latency (concurrency 1)

Server	Req/s	Samples/s	p50 ms
TorchRec C++ gRPC + TorchScript INT8 (documented)	967	96,717	1.033
Narsil route A (libtorch/FBGEMM), worker mode	763	76,261	1.329
Narsil Burn CUDA DLRM FP32, batched ops	839	83,933	1.173
Narsil Burn CUDA DLRM FP16, batched ops	805	80,470	1.268
Narsil Burn CUDA DLRM FP32 (before batched ops)	77	7,655	13.04

Reading the numbers

Burn-native single-request latency is solved. Replacing 26 per-table embedding gathers and 351 pairwise multiply/reduce pairs with one table-batched gather plus one bmm interaction dropped Burn FP32 p50 from 13.04 ms to 1.17 ms.
Continuous batching wins on throughput. In batch mode Narsil coalesces the 16 in-flight requests into one ~1600-row fused forward. Route A reaches ~466k samples/s; Burn batch mode has a similar steady p50 but lower throughput because of top-tail scheduling outliers.
Even a single Burn lane is competitive now. Burn FP32 worker mode at concurrency 16 reaches 125k samples/s, up from 7.9k before the batched-ops rewrite and about 84% of this session's route-A worker throughput.
FP16 mainly reduces Burn memory (602 MiB in worker mode versus FP32's 1658 MiB), but it is slightly slower on this shape.

Caveats

The C++ TorchRec baseline was not re-run in the latest pass: its server links libtorch 2.5.1, whose runtime was removed during the toolchain migration, and is ABI-incompatible with libtorch 2.11. Baseline rows are the documented 2026-05-27 measurements on the same L4. Route A reproduction now uses the current-stack INT8 artifact generated by scripts/create_torchrec_dlrm_artifact.py.
The generated Route A artifact has deterministic initialized weights and the official TorchRec INT8 inference shape. It is self-consistent for serving throughput/latency; trained Criteo quality remains out of scope. The Burn FP32 row is a different artifact (narsil_dlrm_default.bin, FP32), shown for historical context only.
The wire sends sparse ids in sample-major layout; the backend reorders them into TorchRec's key-major KeyedJaggedTensor layout before the forward, so the gathered rows are correct (accuracy-faithful) for both single and coalesced inference. This reorder is a cheap host-side step and does not affect throughput/latency.
The batch-mode throughput edge is a serving-architecture difference (continuous batching), fairly attributed to Narsil's collector rather than to the kernels.

Reproduce

bash

# Build the Burn CUDA engine
cargo build --release --features cuda

# Generate the Route A artifact under target/torchrec/
python scripts/create_torchrec_dlrm_artifact.py

# Burn worker-mode latency/throughput
python benchmarks/torchrec_dlrm_compare.py \
  --narsil-backend burn --narsil-precision fp32 --skip-torchrec \
  --narsil-execution-mode worker \
  --requests 1000 --warmup 50 --concurrency 16 --batch-size 100

# Build the torch-enabled engine against an installed libtorch 2.11
LIBTORCH_USE_PYTORCH=1 cargo build --release --features "cuda torch"

# Route-A batch mode (main run, concurrency 16)
python benchmarks/torchrec_dlrm_compare.py \
  --narsil-backend torch --skip-torchrec \
  --narsil-execution-mode batch --narsil-batch-size 16 --narsil-batch-timeout-ms 5 \
  --requests 1000 --warmup 50 --concurrency 16 --batch-size 100

# Low-concurrency latency (concurrency 1, worker mode)
python benchmarks/torchrec_dlrm_compare.py \
  --narsil-backend torch --skip-torchrec \
  --narsil-execution-mode worker --requests 200 --warmup 20 --concurrency 1 --batch-size 100

Benchmarks ​

Main results (concurrency 16) ​

Low-concurrency latency (concurrency 1) ​

Reading the numbers ​

Caveats ​

Reproduce ​