Job Lifecycle
The lifecycle of an InfraMind job defines how a single AI inference task moves from initial submission to verified execution and finally, reward distribution. This pipeline is built to be decentralized, fault-tolerant, and observable. The system is designed to support thousands of concurrent job flows with minimal central coordination, enabling high-throughput compute operations across a globally distributed mesh of independently run nodes.
Each job is tracked as a cryptographically signed object. It contains a payload, a reference to a containerized model, a resource profile, and a deadline. Every state transition — from submission to execution to settlement — is logged, validated, and either completed or fault-handled. All components of the lifecycle are designed to operate under variable latency, regional partitioning, or partial node failure.
The job lifecycle is composed of six core phases:
Submit
Schedule
Assign
Execute
Return
Reward
Submit
The job begins when a client sends a request to an InfraMind inference endpoint. This is a signed REST or gRPC call referencing a published model container.
Example request:
POST /inference/v1/summarizer-v1
Authorization: Bearer $TOKEN
Content-Type: application/json
{
"text": "InfraMind replaces the cloud with a global runtime mesh."
}
The endpoint router parses the model ID, verifies the request format against the model’s input_schema
, and encapsulates the job into an unsigned payload:
{
"job_id": "fae7-2c1a",
"model_ref": "QmXab123...",
"input": { "text": "..." },
"ttl": 4000,
"region_hint": "eu-west"
}
This job is then handed to the nearest mesh scheduler for assignment.
Schedule
The scheduler’s role is to find the best candidate node from the current mesh view. Nodes are filtered and ranked based on:
Region proximity
Current availability
Required hardware profile
Success rate
Stake reputation
Historical latency
The scheduler signs the job assignment and forwards it to the selected node via secure pub/sub or push socket:
{
"job_id": "fae7-2c1a",
"assigned_node": "0xB1f...",
"signature": "0xabc...",
"deadline": 1719552000
}
Job assignment includes a strict timeout. If the assigned node does not acknowledge within t_ack = 300ms
, the scheduler triggers fallback to the next candidate.
Assign
The node receives the job, verifies the scheduler signature, checks for container availability (local cache or pull), and reserves compute resources for execution. If the node is under overload, or the job violates runtime limits (e.g., untrusted input format), the job is declined.
Accepted jobs move to running
state. Jobs that time out or are rejected are reassigned.
Execute
The container is launched in a sandboxed environment. The node executes the model entrypoint, feeds it the input payload, and records stdout/stderr, execution duration, and memory usage. Execution must complete within the declared TTL.
Execution container logs:
{
"started_at": 1719551943,
"latency_ms": 318,
"output": {
"summary": "InfraMind enables decentralized inference."
},
"container_exit_code": 0
}
Node agents validate the result against output_schema
. Invalid outputs are rejected, logged, and flagged. Nodes that repeatedly return malformed responses face slashing penalties and reduced scheduling priority.
Return
Upon successful execution, the node generates a proof of serving:
{
"job_id": "fae7-2c1a",
"output_hash": "0x7b8f...",
"latency_ms": 318,
"node_id": "0xB1f...",
"signature": "0x6cf1..."
}
The result is returned to the client via the original endpoint, and the signed proof is relayed to the reward oracle for post-verification.
The job transitions to completed
.
Reward
Final settlement happens either off-chain or on-chain, depending on reward configuration.
On-chain reward transaction:
infra claim --job fae7-2c1a --node 0xB1f...
Rewards are calculated from:
Job complexity (resource multiplier)
Node latency percentile
Stake-weighted bonus
Job type (real-time, batched, confidential)
Nodes with verified proof are paid in $INFRA tokens. Failed or dishonest nodes are slashed according to severity.
Timeouts and Failures
t_ack_timeout
: if node doesn’t acknowledge within threshold, reassignment occurst_exec_max
: if node exceeds runtime TTL, the job is cancelled and reassignedinvalid_output
: schema mismatch leads to soft failure, retriablenode_offline
: node marked temporarily unavailable and deprioritized
Retry Logic
The scheduler maintains a fallback queue. If all nodes fail or timeout, the job is held in a retry buffer (default: 3s). If no candidate responds after n_retries = 3
, the job is marked failed
, and the user receives an error.
Retries are issued with exponential backoff and region scope widening.
Slashing Events
Slashing is enforced for:
Proof forgery (invalid signature)
Repeated malformed outputs
Fake resource declarations
Over-promising node capacity
Refusal of jobs after acceptance
Slashed stake is partially burned and partially redistributed to high-performing nodes in the same epoch.
Job lifecycle guarantees include:
At-most-once execution
Transparent fallback
Deterministic routing trace
Per-job audit logs
User-verifiable receipts
The system does not assume perfect conditions. It is built to recover, retry, and reassign under unpredictable latency, machine availability, or partial node failure. This enables InfraMind to operate without orchestration servers, without fixed clusters, and without privileged regions—delivering runtime as a protocol, not a product.
Last updated