Backend-agnostic core
A small InferenceBackend trait sits behind tonic gRPC, a worker pool, and a continuous-batching collector. Backends plug in without touching the serving shell.
One Rust serving shell — Burn-native kernels and native torch (libtorch + FBGEMM) underneath, many transports on top.
Narsil is a Rust inference serving engine. Its thesis is to be the best-in-class serving shell — owning the transport, scheduling, and batching in safe Rust — while staying pluggable about the compute underneath:
tch-rs → libtorch + FBGEMM) and custom Rust/cuTile kernels (Burn).The current reference workload is DLRM (Deep Learning Recommendation Model), benchmarked against the official TorchRec C++ gRPC server.
Serving the same INT8 DLRM artifact on an NVIDIA L4:
| Path | conc 1 — p50 | conc 16 — samples/s |
|---|---|---|
| Narsil route A (libtorch/FBGEMM) | 1.02 ms | ~438,000 |
| TorchRec C++ gRPC (documented baseline) | 1.033 ms | ~215,000 |
| Narsil Burn CUDA DLRM (FP32) | 13.04 ms | ~7,900 |
The conc-1 p50 is measured in worker mode; the conc-16 throughput in batch mode (see Benchmarks for the per-mode tables). Route A matches the C++ baseline's single-request latency and exceeds its throughput via continuous batching. See the full tables and caveats there, and the Decision Log for how we got here.