Skip to content

Architecture

Narsil separates serving (transport, scheduling, batching) from compute (the model backend). The serving shell is fixed and reusable; backends plug in behind one small trait.

Request flow

 client ──gRPC──▶ tonic service ──▶ InferenceExecutor ──▶ InferenceBackend ──▶ response
                  (schema decode)     (inline /             (Burn | torch)
                                       worker pool /
                                       batch collector)
  1. Transporttonic gRPC server (src/server.rs). A PredictRequest is decoded into the internal InferenceRequest (src/schema.rs): a list of named tensors with dtype + shape + raw little-endian bytes. The protobuf schema is the only external contract.
  2. Executor — the decoded request is handed to one of three executors (below).
  3. Backend — implements the InferenceBackend trait and produces an InferenceResponse.
  4. Response — re-encoded to protobuf and returned.

The backend trait

Everything compute-side reduces to one trait (src/backend/mod.rs):

rust
pub trait InferenceBackend: Send + Sync + 'static {
    fn metadata(&self) -> BackendMetadata;
    fn infer(&self, request: InferenceRequest) -> Result<InferenceResponse>;

    /// Override to amortise compute across a coalesced batch.
    fn infer_batch(&self, requests: Vec<InferenceRequest>) -> Vec<Result<InferenceResponse>> {
        requests.into_iter().map(|r| self.infer(r)).collect()
    }
}

infer_batch is the hook that makes continuous batching pay off: a backend that can run one fused forward over many requests overrides it; everything else gets the one-by-one default.

Execution modes

Selected by NARSIL_EXECUTION_MODE and wired in src/main.rs:

ModeBehaviourUse
inlineRun infer directly on the tonic task.Lowest overhead, no isolation.
workerA fixed pool of OS threads, each owning the backend; requests dispatched over a channel.CPU backends, or a single CUDA lane.
batchA collector accumulates concurrent requests up to NARSIL_BATCH_SIZE / NARSIL_BATCH_TIMEOUT_MS, then calls infer_batch once and splits the results.GPU backends — coalescing amortises kernel-launch overhead.

For a single GPU the winning configuration is one execution lane + batch mode: you want one large fused forward, not many small ones contending for the SMs.

Backends

Backend (NARSIL_BACKEND)FeatureCompute
burn-affine, burn-mlp, burn-dlrmdefaultBurn CPU/Flex.
burn-cuda-mlp, burn-cuda-dlrmcudaBurn CUDA (CubeCL); FP32/FP16.
burn-cuda-dlrm + NARSIL_DLRM_INTERACTION=cutilecuda-cutileBurn + an opt-in fused cuTile interaction kernel.
torch-dlrmtorchRoute A — libtorch + FBGEMM via tch-rs.

The two families embody the thesis: Burn is the pure-Rust / custom-kernel path; torch is the native-torch path. Both sit behind the identical serving shell.

Route A: the torch backend

TorchDlrmBackend (src/backend/torch_dlrm.rs) serves a TorchRec TorchScript artifact directly:

  • Startupdlopens (with RTLD_GLOBAL) libtorch_cuda.so first, then the FBGEMM inference .so set. The first is required because the Rust binary only links libtorch_cpu/c10; without it ATen reports no CUDA. The rest register the fbgemm::* ops the artifact references. It then CModule::loads the artifact onto the CUDA device.
  • Per request — parses the Narsil request tensors, builds the CUDA Dict<str, Tensor> the model expects, runs forward_is, and decodes the "default" output tensor.
  • Batchinginfer_batch concatenates dense + sparse inputs across requests into one fused forward over the summed batch, then slices the output back per request.

Because CModule is Send + Sync, the backend holds an Arc<CModule>; concurrency is bounded by the chosen executor rather than a lock.

The DLRM I/O contract

Mirrors TorchRec's dlrm_predict.py:

forward(self, batch: Dict[str, Tensor]) -> Dict[str, Tensor]
  in : float_features              f32 [B, 13]
       id_list_features.lengths    i32 [B * 26]   (all ones: one id per feature)
       id_list_features.values     i32 [B * 26]
  out: "default"                   f32 [B]        (on CPU)

The same Narsil protobuf request (dense, lengths, values tensors) drives both the Burn and torch backends — no schema change to switch compute paths.

Build features

bash
cargo build                          # default: Burn CPU backends
cargo build --features cuda          # + Burn CUDA backends
cargo build --features cuda-cutile   # + opt-in fused cuTile DLRM interaction
cargo build --features torch         # + route A (libtorch/FBGEMM via tch-rs)

The torch feature links libtorch 2.11 (via tch 0.24); the recommended build incantation points tch at an existing torch install: LIBTORCH_USE_PYTORCH=1 cargo build --features torch.

Apache-2.0 licensed.