Narsil

Backend-agnostic core

A small InferenceBackend trait sits behind tonic gRPC, a worker pool, and a continuous-batching collector. Backends plug in without touching the serving shell.

Native torch inference (route A)

The torch-dlrm backend loads a TorchRec TorchScript artifact through tch-rs and runs the real FBGEMM table-batched embedding + interaction kernels — matching the C++ baseline's latency.

Burn-native path

A pure-Rust Burn CUDA DLRM backend using table-batched embedding gather and batched-matmul interaction, plus an opt-in cuTile fused-interaction kernel for hardware that supports it.

Many transports

gRPC is the default API, with opt-in HTTP/JSON and feature-gated HTTP/3 over QUIC sharing the same executor and backend.

Production observability

Prometheus metrics plus HTTP liveness/readiness endpoints sit beside tonic-health, with labels for transport, endpoint, status, and batch size.

What is Narsil?

Narsil is a Rust inference serving engine. Its thesis is to be the best-in-class serving shell — owning the transport, scheduling, and batching in safe Rust — while staying pluggable about the compute underneath:

Internally: native torch inference (tch-rs → libtorch + FBGEMM) and custom Rust/cuTile kernels (Burn).

Externally: many transports — gRPC, HTTP, and QUIC today, with NCCL and UCCL on the roadmap.

The current reference workload is DLRM (Deep Learning Recommendation Model), benchmarked against the official TorchRec C++ gRPC server.

The headline result

On the same NVIDIA L4 DLRM serving shape:

Path	conc 1 — p50	conc 16 — samples/s
Narsil route A (libtorch/FBGEMM)	1.33 ms	~466,000
TorchRec C++ gRPC (documented baseline)	1.033 ms	~215,000
Narsil Burn CUDA DLRM (FP32, batched ops)	1.17 ms	~126,000
Narsil Burn CUDA DLRM (FP32, before batched ops)	13.04 ms	~7,900

Path

conc 1 — p50

conc 16 — samples/s

Narsil route A (libtorch/FBGEMM)

1.33 ms

~466,000

TorchRec C++ gRPC (documented baseline)

1.033 ms

~215,000

Narsil Burn CUDA DLRM (FP32, batched ops)

1.17 ms

~126,000

Narsil Burn CUDA DLRM (FP32, before batched ops)

13.04 ms

~7,900

The conc-1 p50 is measured in worker mode; the conc-16 throughput in batch mode (see Benchmarks for the per-mode tables). Route A exceeds the C++ baseline's throughput via continuous batching, and the Burn-native path now lands in the same single-request latency band after the table-batched gather + bmm interaction rewrite. See the full tables and caveats there, and the Decision Log for how we got here.

NarsilRust-native inference serving engine