Architecture

Narsil separates serving (transport, scheduling, batching) from compute (the model backend). The serving shell is fixed and reusable; backends plug in behind one small trait.

Request flow

 client ──gRPC────────▶ tonic service ─┐
 client ──HTTP/JSON───▶ axum gateway ──┤
 client ──HTTP/3/QUIC─▶ h3 gateway ────┴─▶ InferenceExecutor ──▶ InferenceBackend ──▶ response
                      (schema decode)       (inline /             (Burn | torch)
                                             worker pool /
                                             batch collector)

Transport — tonic gRPC server, opt-in axum HTTP/JSON gateway, and feature-gated HTTP/3 over QUIC gateway (src/server.rs). A request is decoded into the internal InferenceRequest (src/schema.rs): a list of named tensors with dtype + shape + raw little-endian bytes. The protobuf tensor schema remains the canonical contract; JSON maps data as base64 bytes and dtype as F32, F64, I64, I32, U8, or Bool.
Executor — the decoded request is handed to one of three executors (below).
Backend — implements the InferenceBackend trait and produces an InferenceResponse.
Response — re-encoded to protobuf and returned.

The backend trait

Everything compute-side reduces to one trait (src/backend/mod.rs):

rust

pub trait InferenceBackend: Send + Sync + 'static {
    fn metadata(&self) -> BackendMetadata;
    fn infer(&self, request: InferenceRequest) -> Result<InferenceResponse>;

    /// Override to amortise compute across a coalesced batch.
    fn infer_batch(&self, requests: Vec<InferenceRequest>) -> Vec<Result<InferenceResponse>> {
        requests.into_iter().map(|r| self.infer(r)).collect()
    }
}

infer_batch is the hook that makes continuous batching pay off: a backend that can run one fused forward over many requests overrides it; everything else gets the one-by-one default.

Execution modes

Selected by NARSIL_EXECUTION_MODE and wired in src/main.rs:

Mode	Behaviour	Use
`inline`	Run `infer` directly on the tonic task.	Lowest overhead, no isolation.
`worker`	A fixed pool of OS threads, each owning the backend; requests dispatched over a channel.	CPU backends, or a single CUDA lane.
`batch`	A collector accumulates concurrent requests up to `NARSIL_BATCH_SIZE` / `NARSIL_BATCH_TIMEOUT_MS`, then calls `infer_batch` once and splits the results.	GPU backends — coalescing amortises kernel-launch overhead.

For a single GPU the winning configuration is one execution lane + batch mode: you want one large fused forward, not many small ones contending for the SMs.

HTTP/JSON gateway

gRPC is still the default transport. Set NARSIL_HTTP_ADDR to run the HTTP/JSON gateway beside the gRPC listener:

bash

NARSIL_HTTP_ADDR=127.0.0.1:8080 cargo run -- 127.0.0.1:50051

The gateway exposes:

Method	Path	Body
`POST`	`/v1/infer`	one JSON inference request
`POST`	`/v1/infer_batch`	`{ "requests": [ ... ] }`

Both routes share the same executor as gRPC, so inline, worker, and batch execution modes behave consistently across transports.

HTTP/3 over QUIC

The QUIC transport is behind the quic feature and the NARSIL_QUIC_ADDR runtime toggle:

bash

NARSIL_QUIC_ADDR=127.0.0.1:50052 cargo run --features quic -- 127.0.0.1:50051

It serves the same HTTP paths as the JSON gateway (/v1/infer, /v1/infer_batch, /metrics, /health/live, /health/ready) over HTTP/3 with ALPN h3, and dispatches through the same shared executor. Each request stream is handled concurrently so a slow upload or long inference on one stream does not head-of-line block the others multiplexed on the same connection, and request bodies are capped (64 MiB) to bound per-request memory. On shutdown the endpoint stops accepting new connections and drains in-flight streams (bounded) before closing. If NARSIL_QUIC_ADDR is set without --features quic, startup fails fast with a configuration error — before any listener binds — instead of silently ignoring the transport.

TLS is mandatory (TLS 1.3). Provide a real certificate chain and key via NARSIL_QUIC_TLS_CERT and NARSIL_QUIC_TLS_KEY (both PEM, both required together) in any non-local deployment. With neither set, Narsil generates an ephemeral self-signed certificate for development only and logs a warning; that certificate cannot be verified by clients and must not be used in production.

Observability

Narsil installs a Prometheus recorder around the shared transport path. The recorder is only installed when an HTTP listener will actually render it — a gRPC-only deploy keeps collecting metrics through the no-op recorder and logs a startup warning that nothing is exposed.

Prefer a dedicated, privately bound observability listener via NARSIL_OBSERVABILITY_ADDR. /metrics is deliberately kept off the public inference port: it is served on the observability listener, while the health probes are served on both the gateway and the observability listener.

bash

NARSIL_OBSERVABILITY_ADDR=127.0.0.1:9090 cargo run -- 127.0.0.1:50051

Listener / route placement:

Env	`/metrics`	`/health/live` + `/health/ready`
`NARSIL_OBSERVABILITY_ADDR` (recommended, bind private)	yes	yes
`NARSIL_HTTP_ADDR` with observability listener set	no	yes
`NARSIL_HTTP_ADDR` only (no observability listener)	yes (single-port convenience; warns)	yes
gRPC-only (neither set)	no (collected, not exposed; warns)	gRPC `tonic-health` only

Method	Path	Response
`GET`	`/metrics`	Prometheus text exposition
`GET`	`/health/live`	`204` while the process + execution machinery are alive, `503` otherwise
`GET`	`/health/ready`	`204` once the backend is loaded and the executor is live, `503` during shutdown/drain

Readiness is wired to real backend/executor health: it gates on the backend producing valid metadata() (a model is loaded) and, in batch mode, on the continuous-batching collector loop still being alive. If that loop exits, liveness and readiness flip to 503 instead of reporting healthy forever.

Metric names:

Metric	Type	Labels	Meaning
`narsil_requests_total`	counter	`transport`, `endpoint`, `status`	request count
`narsil_request_latency_seconds`	histogram	`transport`, `endpoint`, `status`	request latency
`narsil_in_flight_requests`	gauge	`transport`, `endpoint`	in-flight requests
`narsil_batch_size`	histogram	`transport`, `endpoint`	client-supplied requests per transport call
`narsil_fused_batch_size`	histogram	—	effective fused batch the `batch`-mode collector forms before `infer_batch`

The same gRPC health service remains registered through tonic-health; the HTTP readiness route is the scrape-friendly companion for load balancers and Kubernetes probes.

Backends

Backend (`NARSIL_BACKEND`)	Feature	Compute
`burn-affine`, `burn-mlp`, `burn-dlrm`	default	Burn CPU/`Flex`.
`burn-cuda-mlp`, `burn-cuda-dlrm`	`cuda`	Burn CUDA (CubeCL); FP32/FP16.
`burn-cuda-dlrm` + `NARSIL_DLRM_INTERACTION=cutile`	`cuda-cutile`	Burn + an opt-in fused cuTile interaction kernel.
`torch-dlrm`	`torch`	Route A — libtorch + FBGEMM via `tch-rs`.

The two families embody the thesis: Burn is the pure-Rust / custom-kernel path; torch is the native-torch path. Both sit behind the identical serving shell.

Route A: the torch backend

TorchDlrmBackend (src/backend/torch_dlrm.rs) serves a TorchRec TorchScript artifact directly:

Startup — CUDA libtorch builds retain libtorch_cuda.so at link time (--no-as-needed when the library exists), so ATen's CUDA hooks are registered without a startup dlopen. The backend still dlopens the FBGEMM inference .so set with RTLD_GLOBAL so the artifact's fbgemm::* ops register before CModule::load runs on the CUDA device.
Artifact — scripts/create_torchrec_dlrm_artifact.py generates the current-stack INT8 TorchScript package from third_party/torchrec into target/torchrec/.
Per request — parses the Narsil request tensors, builds the CUDA Dict<str, Tensor> the model expects, runs forward_is, and decodes the "default" output tensor.
Batching — infer_batch concatenates dense + sparse inputs across requests into one fused forward over the summed batch, then slices the output back per request.

Because CModule is Send + Sync, the backend holds an Arc<CModule>; concurrency is bounded by the chosen executor rather than a lock.

The DLRM I/O contract

Mirrors TorchRec's dlrm_predict.py:

forward(self, batch: Dict[str, Tensor]) -> Dict[str, Tensor]
  in : float_features              f32 [B, 13]
       id_list_features.lengths    i32 [B * 26]   (all ones: one id per feature)
       id_list_features.values     i32 [B * 26]
  out: "default"                   f32 [B]        (on CPU)

The same Narsil protobuf request (dense, lengths, values tensors) drives both the Burn and torch backends — no schema change to switch compute paths.

Burn-native DLRM

BurnDlrmModel (src/backend/dlrm.rs) is the no-libtorch path. It keeps the same request contract and raw FP32 weight file as the earlier prototype, but the hot DLRM blocks are now batched:

all embedding tables are stacked into one [tables * rows, dim] tensor, so sparse ids become one global index vector and one select, reshaped to [B, tables, dim];
dense projection and sparse embeddings are concatenated into [B, 27, dim];
the interaction is one batched Gram matmul plus one triangular gather, preserving the old pair ordering before concatenating with the dense vector and feeding the over-arch MLP.

The opt-in NARSIL_DLRM_INTERACTION=cutile path still consumes the same [B, 27, dim] feature tensor and replaces only the interaction block with the fused cuTile kernel.

Build features

bash

cargo build                          # default: Burn CPU backends
cargo build --features cuda          # + Burn CUDA backends
cargo build --features cuda-cutile   # + opt-in fused cuTile DLRM interaction
cargo build --features torch         # + route A (libtorch/FBGEMM via tch-rs)
cargo build --features quic          # + HTTP/3 over QUIC transport

The torch feature links libtorch 2.11 (via tch 0.24); the recommended build incantation points tch at an existing torch install: LIBTORCH_USE_PYTORCH=1 cargo build --features torch.

Architecture ​

Request flow ​

The backend trait ​

Execution modes ​

HTTP/JSON gateway ​

HTTP/3 over QUIC ​

Observability ​

Backends ​

Route A: the torch backend ​

The DLRM I/O contract ​

Burn-native DLRM ​

Build features ​

Architecture

Request flow

The backend trait

Execution modes

HTTP/JSON gateway

HTTP/3 over QUIC

Observability

Backends

Route A: the torch backend

The DLRM I/O contract

Burn-native DLRM

Build features