Skip to content

Benchmarks

DLRM serving on an NVIDIA L4 (sm_89), batch size 100/request, 26 sparse features × 100k rows × dim 64, dense arch 512→256→64, over arch 512→512→256→1. Full methodology and raw numbers live in REPORTS.md; this page is the summary.

Main results (concurrency 16)

ServerReq/sSamples/sp50 msp99 msPeak GPU MiB
TorchRec C++ gRPC + TorchScript INT8 (documented baseline)2148214,8217.2313.27780
Narsil route A (libtorch/FBGEMM), batch mode4377437,7363.3414.12478
Narsil route A (libtorch/FBGEMM), worker mode (1 lane)1412141,16011.4011.83454
Narsil Burn CUDA DLRM FP32 (documented)797,935201.3211.0954

Low-concurrency latency (concurrency 1)

ServerReq/sSamples/sp50 ms
TorchRec C++ gRPC + TorchScript INT8 (documented)96796,7171.033
Narsil route A (libtorch/FBGEMM), worker mode97597,4751.024
Narsil Burn CUDA DLRM FP32 (documented)777,65513.04

Reading the numbers

  • Single-request latency is solved. Route A's ~1.02 ms p50 is on par with the C++ baseline and ~13× faster than the Burn path. Same FBGEMM kernels → same latency; the Rust tonic shell adds negligible overhead. This confirms the original diagnosis that the gap was the kernel path, not the transport. (Repeated 2026-05-29 and 2026-05-30, this latency is stable.)
  • Continuous batching wins on throughput. In batch mode Narsil coalesces the 16 in-flight requests into one ~1600-row fused forward, reaching ~2× the documented C++ baseline's throughput. The C++ sync server processes requests independently and does not coalesce. (Batch throughput is load-dependent: ~434k–515k samples/s across sessions, consistently ~2×+ the baseline.)
  • Even a single lane is competitive. Worker mode at concurrency 16 (no coalescing) reaches ~66% of the C++ throughput — a single Rust execution lane holds up once the kernels are FBGEMM.
  • Lower GPU footprint (454–478 MiB) than both documented baselines.

Caveats

  • The C++ TorchRec baseline was not re-run in the latest pass: its server links libtorch 2.5.1, whose runtime was removed during the toolchain migration, and is ABI-incompatible with libtorch 2.11. Baseline rows are the documented 2026-05-27 measurements on the same L4 and the same model_default.pt; route A rows were measured 2026-05-30 on that machine.
  • The route A and TorchRec C++ rows serve the same INT8 artifact (model_default.pt) with the same random, untrained weights — that pair compares serving throughput/latency for an identical model, not accuracy. The Burn FP32 row is a different artifact (narsil_dlrm_default.bin, FP32), shown for historical context only.
  • The wire sends sparse ids in sample-major layout; the backend reorders them into TorchRec's key-major KeyedJaggedTensor layout before the forward, so the gathered rows are correct (accuracy-faithful) for both single and coalesced inference. This reorder is a cheap host-side step and does not affect throughput/latency.
  • The batch-mode throughput edge is a serving-architecture difference (continuous batching), fairly attributed to Narsil's collector rather than to the kernels.

Reproduce

bash
# Build the torch-enabled engine against an installed libtorch 2.11
LIBTORCH_USE_PYTORCH=1 cargo build --release --features torch

# Main run (concurrency 16, batch mode)
python benchmarks/torchrec_dlrm_compare.py \
  --narsil-backend torch --skip-torchrec \
  --narsil-torch-model /path/to/model_default.pt \
  --narsil-execution-mode batch --narsil-batch-size 16 --narsil-batch-timeout-ms 5 \
  --requests 1000 --warmup 50 --concurrency 16 --batch-size 100

# Low-concurrency latency (concurrency 1, worker mode)
python benchmarks/torchrec_dlrm_compare.py \
  --narsil-backend torch --skip-torchrec \
  --narsil-torch-model /path/to/model_default.pt \
  --narsil-execution-mode worker --requests 200 --warmup 20 --concurrency 1 --batch-size 100

Apache-2.0 licensed.