Benchmarks
DLRM serving on an NVIDIA L4 (sm_89), batch size 100/request, 26 sparse features × 100k rows × dim 64, dense arch 512→256→64, over arch 512→512→256→1. Full methodology and raw numbers live in REPORTS.md; this page is the summary.
Main results (concurrency 16)
| Server | Req/s | Samples/s | p50 ms | p99 ms | Peak GPU MiB |
|---|---|---|---|---|---|
| TorchRec C++ gRPC + TorchScript INT8 (documented baseline) | 2148 | 214,821 | 7.23 | 13.27 | 780 |
| Narsil route A (libtorch/FBGEMM), batch mode | 4377 | 437,736 | 3.34 | 14.12 | 478 |
| Narsil route A (libtorch/FBGEMM), worker mode (1 lane) | 1412 | 141,160 | 11.40 | 11.83 | 454 |
| Narsil Burn CUDA DLRM FP32 (documented) | 79 | 7,935 | 201.3 | 211.0 | 954 |
Low-concurrency latency (concurrency 1)
| Server | Req/s | Samples/s | p50 ms |
|---|---|---|---|
| TorchRec C++ gRPC + TorchScript INT8 (documented) | 967 | 96,717 | 1.033 |
| Narsil route A (libtorch/FBGEMM), worker mode | 975 | 97,475 | 1.024 |
| Narsil Burn CUDA DLRM FP32 (documented) | 77 | 7,655 | 13.04 |
Reading the numbers
- Single-request latency is solved. Route A's ~1.02 ms p50 is on par with the C++ baseline and ~13× faster than the Burn path. Same FBGEMM kernels → same latency; the Rust tonic shell adds negligible overhead. This confirms the original diagnosis that the gap was the kernel path, not the transport. (Repeated 2026-05-29 and 2026-05-30, this latency is stable.)
- Continuous batching wins on throughput. In batch mode Narsil coalesces the 16 in-flight requests into one ~1600-row fused forward, reaching ~2× the documented C++ baseline's throughput. The C++ sync server processes requests independently and does not coalesce. (Batch throughput is load-dependent: ~434k–515k samples/s across sessions, consistently ~2×+ the baseline.)
- Even a single lane is competitive. Worker mode at concurrency 16 (no coalescing) reaches ~66% of the C++ throughput — a single Rust execution lane holds up once the kernels are FBGEMM.
- Lower GPU footprint (454–478 MiB) than both documented baselines.
Caveats
- The C++ TorchRec baseline was not re-run in the latest pass: its server links libtorch 2.5.1, whose runtime was removed during the toolchain migration, and is ABI-incompatible with libtorch 2.11. Baseline rows are the documented 2026-05-27 measurements on the same L4 and the same
model_default.pt; route A rows were measured 2026-05-30 on that machine. - The route A and TorchRec C++ rows serve the same INT8 artifact (
model_default.pt) with the same random, untrained weights — that pair compares serving throughput/latency for an identical model, not accuracy. The Burn FP32 row is a different artifact (narsil_dlrm_default.bin, FP32), shown for historical context only. - The wire sends sparse ids in sample-major layout; the backend reorders them into TorchRec's key-major
KeyedJaggedTensorlayout before the forward, so the gathered rows are correct (accuracy-faithful) for both single and coalesced inference. This reorder is a cheap host-side step and does not affect throughput/latency. - The batch-mode throughput edge is a serving-architecture difference (continuous batching), fairly attributed to Narsil's collector rather than to the kernels.
Reproduce
bash
# Build the torch-enabled engine against an installed libtorch 2.11
LIBTORCH_USE_PYTORCH=1 cargo build --release --features torch
# Main run (concurrency 16, batch mode)
python benchmarks/torchrec_dlrm_compare.py \
--narsil-backend torch --skip-torchrec \
--narsil-torch-model /path/to/model_default.pt \
--narsil-execution-mode batch --narsil-batch-size 16 --narsil-batch-timeout-ms 5 \
--requests 1000 --warmup 50 --concurrency 16 --batch-size 100
# Low-concurrency latency (concurrency 1, worker mode)
python benchmarks/torchrec_dlrm_compare.py \
--narsil-backend torch --skip-torchrec \
--narsil-torch-model /path/to/model_default.pt \
--narsil-execution-mode worker --requests 200 --warmup 20 --concurrency 1 --batch-size 100