Architecture
Narsil separates serving (transport, scheduling, batching) from compute (the model backend). The serving shell is fixed and reusable; backends plug in behind one small trait.
Request flow
client ──gRPC──▶ tonic service ──▶ InferenceExecutor ──▶ InferenceBackend ──▶ response
(schema decode) (inline / (Burn | torch)
worker pool /
batch collector)- Transport —
tonicgRPC server (src/server.rs). APredictRequestis decoded into the internalInferenceRequest(src/schema.rs): a list of named tensors with dtype + shape + raw little-endian bytes. The protobuf schema is the only external contract. - Executor — the decoded request is handed to one of three executors (below).
- Backend — implements the
InferenceBackendtrait and produces anInferenceResponse. - Response — re-encoded to protobuf and returned.
The backend trait
Everything compute-side reduces to one trait (src/backend/mod.rs):
pub trait InferenceBackend: Send + Sync + 'static {
fn metadata(&self) -> BackendMetadata;
fn infer(&self, request: InferenceRequest) -> Result<InferenceResponse>;
/// Override to amortise compute across a coalesced batch.
fn infer_batch(&self, requests: Vec<InferenceRequest>) -> Vec<Result<InferenceResponse>> {
requests.into_iter().map(|r| self.infer(r)).collect()
}
}infer_batch is the hook that makes continuous batching pay off: a backend that can run one fused forward over many requests overrides it; everything else gets the one-by-one default.
Execution modes
Selected by NARSIL_EXECUTION_MODE and wired in src/main.rs:
| Mode | Behaviour | Use |
|---|---|---|
inline | Run infer directly on the tonic task. | Lowest overhead, no isolation. |
worker | A fixed pool of OS threads, each owning the backend; requests dispatched over a channel. | CPU backends, or a single CUDA lane. |
batch | A collector accumulates concurrent requests up to NARSIL_BATCH_SIZE / NARSIL_BATCH_TIMEOUT_MS, then calls infer_batch once and splits the results. | GPU backends — coalescing amortises kernel-launch overhead. |
For a single GPU the winning configuration is one execution lane + batch mode: you want one large fused forward, not many small ones contending for the SMs.
Backends
Backend (NARSIL_BACKEND) | Feature | Compute |
|---|---|---|
burn-affine, burn-mlp, burn-dlrm | default | Burn CPU/Flex. |
burn-cuda-mlp, burn-cuda-dlrm | cuda | Burn CUDA (CubeCL); FP32/FP16. |
burn-cuda-dlrm + NARSIL_DLRM_INTERACTION=cutile | cuda-cutile | Burn + an opt-in fused cuTile interaction kernel. |
torch-dlrm | torch | Route A — libtorch + FBGEMM via tch-rs. |
The two families embody the thesis: Burn is the pure-Rust / custom-kernel path; torch is the native-torch path. Both sit behind the identical serving shell.
Route A: the torch backend
TorchDlrmBackend (src/backend/torch_dlrm.rs) serves a TorchRec TorchScript artifact directly:
- Startup —
dlopens (withRTLD_GLOBAL)libtorch_cuda.sofirst, then the FBGEMM inference.soset. The first is required because the Rust binary only linkslibtorch_cpu/c10; without it ATen reports no CUDA. The rest register thefbgemm::*ops the artifact references. It thenCModule::loads the artifact onto the CUDA device. - Per request — parses the Narsil request tensors, builds the CUDA
Dict<str, Tensor>the model expects, runsforward_is, and decodes the"default"output tensor. - Batching —
infer_batchconcatenates dense + sparse inputs across requests into one fused forward over the summed batch, then slices the output back per request.
Because CModule is Send + Sync, the backend holds an Arc<CModule>; concurrency is bounded by the chosen executor rather than a lock.
The DLRM I/O contract
Mirrors TorchRec's dlrm_predict.py:
forward(self, batch: Dict[str, Tensor]) -> Dict[str, Tensor]
in : float_features f32 [B, 13]
id_list_features.lengths i32 [B * 26] (all ones: one id per feature)
id_list_features.values i32 [B * 26]
out: "default" f32 [B] (on CPU)The same Narsil protobuf request (dense, lengths, values tensors) drives both the Burn and torch backends — no schema change to switch compute paths.
Build features
cargo build # default: Burn CPU backends
cargo build --features cuda # + Burn CUDA backends
cargo build --features cuda-cutile # + opt-in fused cuTile DLRM interaction
cargo build --features torch # + route A (libtorch/FBGEMM via tch-rs)The torch feature links libtorch 2.11 (via tch 0.24); the recommended build incantation points tch at an existing torch install: LIBTORCH_USE_PYTORCH=1 cargo build --features torch.