Runtime & Language Support

InfraMind is designed as a runtime-agnostic execution layer. The protocol does not enforce a specific framework, language, or serving toolchain—instead, it relies on container isolation and schema-conformant interfaces to validate that a given model behaves as declared. This allows developers to deploy models using the libraries and runtimes they are most familiar with, so long as the output can be verified and the runtime environment is self-contained.

Every model must be served from within a container that exposes either a REST or gRPC endpoint. Input/output must comply with the declared schema in model.yaml. Beyond that, runtime selection is entirely up to the deployer.

Python Runtime (FastAPI, Flask, raw)

Python is the most widely supported runtime on the mesh, primarily using FastAPI as the serving interface due to its speed, schema compliance, and async support.

Recommended for:

Transformers
Custom ML pipelines
Text-to-text models
Tabular inference
Finetuned LLMs (e.g. transformers, sentence-transformers, scikit-learn)

Example:

from fastapi import FastAPI, Request
app = FastAPI()

@app.post("/inference")
async def infer(req: Request):
    data = await req.json()
    output = process(data["input"])
    return {"result": output}

In model.yaml:

runtime: python3.10
protocol: rest
port: 9000
entrypoint: serve.py

The node agent uses the declared runtime to sandbox and execute the model, matching against expected schema and verifying response latency.

ONNX Runtime

ONNX models are supported via onnxruntime or onnxruntime-gpu. The container must install onnxruntime and expose a wrapper script that accepts JSON input, feeds it to the session, and returns serialized output.

Example ONNX wrapper:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx")
def predict(input_vector):
    inputs = {"input": np.array(input_vector).astype(np.float32)}
    return session.run(None, inputs)[0].tolist()

ONNX runtime is typically used for:

Exported scikit-learn pipelines
Quantized transformers
Edge-optimized vision models
Language classifiers

model.yaml should declare:

runtime: python3.10
protocol: rest
entrypoint: run_onnx.py

TensorFlow Runtime

TensorFlow models must be served using either:

tensorflow-serving inside a container
A custom Flask/FastAPI wrapper using tf.keras.models.load_model

InfraMind recommends explicit wrappers for reproducibility and portability.

TensorFlow jobs often require GPU support. The base image should be:

FROM tensorflow/tensorflow:2.13.0-gpu

Containerized wrapper:

import tensorflow as tf
model = tf.keras.models.load_model("export/")
def infer(input_data):
    return model.predict(input_data).tolist()

Job containers must disable eager execution if performance optimization is critical.

FastAPI Runtime (Preferred REST server)

FastAPI is the default for REST-based serving in InfraMind. It allows schema enforcement, async request processing, and auto-documentation.

Benefits:

Integrates easily with model.yaml input/output schema
Compatible with JSON Schema validation
Can run inside Gunicorn or Uvicorn

CLI tooling assumes a default /inference route with a POST method, though custom paths can be configured in the manifest.

PyTorch (TorchServe)

TorchServe is supported as a model server backend. Containers must run torch-model-archiver to bundle the model and expose the REST inference API via port 8080.

Container layout:

/model_store/
  └── summarizer.mar
/config/
  └── config.properties

Start TorchServe in Dockerfile:

CMD ["torchserve", "--start", "--model-store", "model_store", "--models", "summarizer.mar"]

model.yaml example:

runtime: torchserve
entrypoint: summarizer.mar
port: 8080
resources:
  gpu: true

TorchServe is ideal for large vision or speech models, multi-modal systems, or inference that requires GPU-optimized batch execution.

Rust (Coming Soon)

Rust support is in progress via WebAssembly (WASI) and native execution within secure enclaves. The goal is to allow lightweight, compiled model runners written in Rust to serve stateless functions with ultra-low latency.

Expected support:

tangram, tract, onnxruntime-sys
Native Fastify-like async frameworks (axum, actix-web)
WASM/TEE model execution with deterministic IO

Example:

#[post("/inference")]
async fn infer(payload: Json<Input>) -> Json<Output> {
    let result = model.predict(&payload.data);
    Json(Output { result })
}

Runtime specification will use:

runtime: rust
protocol: rest
port: 9090

These containers will require static compilation and ABI compatibility. Signed WASM models may also be supported through enclave isolation.

Summary of Supported Runtimes

Runtime

Container Requirements

Port Default

GPU Support

Use Case

Python

python:3.10

9000

optional

LLMs, NLP, tabular, text

ONNX

onnxruntime

9000

optional

Quantized, portable models

TensorFlow

tensorflow:2.x

9000

required

Image, audio, LLM fine-tunes

FastAPI

uvicorn

9000

optional

General REST wrapper

TorchServe

torchserve

8080

required

Large batch, computer vision

Rust (WASI)

Static binary / WASM

9090

Embedded, secure, ultra-fast

All runtimes must expose input/output handlers that comply with the declared schema in model.yaml. The mesh assumes no language, framework, or dependency set beyond what is declared by the deployer inside the container.

InfraMind’s runtime layer is not prescriptive—it is flexible by design. The only requirement is that your model runs deterministically, accepts schema-conformant input, returns valid JSON, and can be validated without trust in the runtime. Whether written in Python, Rust, or wrapped in a compiled inference runner, any model can become a global endpoint with one registration.

PreviousHow to Deploy a Model NextCreating a Custom Model Endpoint

Last updated 2 months ago