Job Lifecycle

The lifecycle of an InfraMind job defines how a single AI inference task moves from initial submission to verified execution and finally, reward distribution. This pipeline is built to be decentralized, fault-tolerant, and observable. The system is designed to support thousands of concurrent job flows with minimal central coordination, enabling high-throughput compute operations across a globally distributed mesh of independently run nodes.

Each job is tracked as a cryptographically signed object. It contains a payload, a reference to a containerized model, a resource profile, and a deadline. Every state transition — from submission to execution to settlement — is logged, validated, and either completed or fault-handled. All components of the lifecycle are designed to operate under variable latency, regional partitioning, or partial node failure.

The job lifecycle is composed of six core phases:

  1. Submit

  2. Schedule

  3. Assign

  4. Execute

  5. Return

  6. Reward

Submit

The job begins when a client sends a request to an InfraMind inference endpoint. This is a signed REST or gRPC call referencing a published model container.

Example request:

POST /inference/v1/summarizer-v1
Authorization: Bearer $TOKEN
Content-Type: application/json

{
  "text": "InfraMind replaces the cloud with a global runtime mesh."
}

The endpoint router parses the model ID, verifies the request format against the model’s input_schema, and encapsulates the job into an unsigned payload:

{
  "job_id": "fae7-2c1a",
  "model_ref": "QmXab123...",
  "input": { "text": "..." },
  "ttl": 4000,
  "region_hint": "eu-west"
}

This job is then handed to the nearest mesh scheduler for assignment.

Schedule

The scheduler’s role is to find the best candidate node from the current mesh view. Nodes are filtered and ranked based on:

  • Region proximity

  • Current availability

  • Required hardware profile

  • Success rate

  • Stake reputation

  • Historical latency

The scheduler signs the job assignment and forwards it to the selected node via secure pub/sub or push socket:

{
  "job_id": "fae7-2c1a",
  "assigned_node": "0xB1f...",
  "signature": "0xabc...",
  "deadline": 1719552000
}

Job assignment includes a strict timeout. If the assigned node does not acknowledge within t_ack = 300ms, the scheduler triggers fallback to the next candidate.

Assign

The node receives the job, verifies the scheduler signature, checks for container availability (local cache or pull), and reserves compute resources for execution. If the node is under overload, or the job violates runtime limits (e.g., untrusted input format), the job is declined.

Accepted jobs move to running state. Jobs that time out or are rejected are reassigned.

Execute

The container is launched in a sandboxed environment. The node executes the model entrypoint, feeds it the input payload, and records stdout/stderr, execution duration, and memory usage. Execution must complete within the declared TTL.

Execution container logs:

{
  "started_at": 1719551943,
  "latency_ms": 318,
  "output": {
    "summary": "InfraMind enables decentralized inference."
  },
  "container_exit_code": 0
}

Node agents validate the result against output_schema. Invalid outputs are rejected, logged, and flagged. Nodes that repeatedly return malformed responses face slashing penalties and reduced scheduling priority.

Return

Upon successful execution, the node generates a proof of serving:

{
  "job_id": "fae7-2c1a",
  "output_hash": "0x7b8f...",
  "latency_ms": 318,
  "node_id": "0xB1f...",
  "signature": "0x6cf1..."
}

The result is returned to the client via the original endpoint, and the signed proof is relayed to the reward oracle for post-verification.

The job transitions to completed.

Reward

Final settlement happens either off-chain or on-chain, depending on reward configuration.

On-chain reward transaction:

infra claim --job fae7-2c1a --node 0xB1f...

Rewards are calculated from:

  • Job complexity (resource multiplier)

  • Node latency percentile

  • Stake-weighted bonus

  • Job type (real-time, batched, confidential)

Nodes with verified proof are paid in $INFRA tokens. Failed or dishonest nodes are slashed according to severity.

Timeouts and Failures

  • t_ack_timeout: if node doesn’t acknowledge within threshold, reassignment occurs

  • t_exec_max: if node exceeds runtime TTL, the job is cancelled and reassigned

  • invalid_output: schema mismatch leads to soft failure, retriable

  • node_offline: node marked temporarily unavailable and deprioritized

Retry Logic

The scheduler maintains a fallback queue. If all nodes fail or timeout, the job is held in a retry buffer (default: 3s). If no candidate responds after n_retries = 3, the job is marked failed, and the user receives an error.

Retries are issued with exponential backoff and region scope widening.

Slashing Events

Slashing is enforced for:

  • Proof forgery (invalid signature)

  • Repeated malformed outputs

  • Fake resource declarations

  • Over-promising node capacity

  • Refusal of jobs after acceptance

Slashed stake is partially burned and partially redistributed to high-performing nodes in the same epoch.

Job lifecycle guarantees include:

  • At-most-once execution

  • Transparent fallback

  • Deterministic routing trace

  • Per-job audit logs

  • User-verifiable receipts

The system does not assume perfect conditions. It is built to recover, retry, and reassign under unpredictable latency, machine availability, or partial node failure. This enables InfraMind to operate without orchestration servers, without fixed clusters, and without privileged regions—delivering runtime as a protocol, not a product.

Last updated