Monitoring your Node

Monitoring an InfraMind node is essential for maintaining high uptime, optimizing job performance, and ensuring that resource declarations reflect actual capacity. The node agent exposes a comprehensive local interface for telemetry inspection, historical usage analysis, and fault diagnostics. These observability layers are native to the InfraMind runtime and require no third-party plugins, though external integrations (Prometheus, Grafana) are also supported for full-stack operators.

Logs, job receipts, node statistics, and scheduler activity are all accessible via command-line tools and optionally via a local or remote Web UI.

Log Access

All node logs are written to:

~/.inframind/logs/inframind.log

These include:

Container execution logs
Job assignment metadata
Scheduling decisions and fallbacks
Heartbeat acknowledgments
Reward claim results
Errors and runtime exceptions

Example command to view in real-time:

tail -f ~/.inframind/logs/inframind.log

Or using journalctl (for systemd deployments):

journalctl -u inframind-node -f

Logs are rotated every 24 hours by default and compressed after 7 days. To modify this behavior, update your system’s logrotate configuration.

Job History

Every job executed by a node is tracked locally as a receipt:

~/.inframind/receipts/{job_id}.json

Each receipt includes:

{
  "job_id": "f23a-8e9c",
  "timestamp": 1719823198,
  "latency_ms": 247,
  "model_ref": "ipfs://QmU1Wxje...",
  "status": "success",
  "output_hash": "0xab8f...",
  "node_signature": "0x4921...",
  "reward": "1.21 INFRA"
}

Receipts can be queried using the CLI:

infra jobs --limit 10

Or to inspect a specific job:

infra job --id f23a-8e9c

Historical job statistics (rolling averages, success rates, proof timestamps) are also calculated and presented at runtime via:

infra stats

Node Status

To monitor node health, system load, and connection state, use:

infra status

Typical output:

Node ID:       0xA39f2b...
Uptime:        19h 32m
Jobs Served:   234
Avg Latency:   213ms
Stake:         200 INFRA
GPU Enabled:   true
CPU Usage:     22%
Memory Usage:  3.8 GB / 8.0 GB
Cache:         8.2 GB used / 25 GB allocated
Current Region: europe-west
Mesh Connected: true
Heartbeat OK:  every 5s

This command aggregates local telemetry and the last scheduler response. If your node is offline or behind NAT/firewall, mesh status will return false and the node will be deprioritized.

Web UI Dashboard

The InfraMind node agent exposes a local monitoring dashboard by default at:

http://localhost:5050

If running remotely:

http://<your-node-ip>:5050

Dashboard sections include:

Job timeline
Live system metrics
Container cache viewer
Reward claim history
Model execution summaries
Scheduler handshake logs
Node reputation trend

This service is secured using local binding by default. To expose it over the internet, reverse proxy with nginx or caddy, and configure an auth layer.

To disable or change the port:

dashboard:
  enabled: true
  port: 5050

Or via environment:

export INFRA_DASHBOARD_PORT=8080

Prometheus & Grafana Integration

InfraMind nodes expose a native Prometheus exporter on port 9100:

http://localhost:9100/metrics

Example metrics:

infra_node_jobs_total 342
infra_node_latency_avg_ms 213
infra_node_gpu_available 1
infra_node_container_cache_hits 82
infra_node_stake_total 200.0
infra_node_sla_uptime 0.9912

Scrape config for prometheus.yml:

- job_name: 'inframind'
  static_configs:
    - targets: ['localhost:9100']

Grafana dashboards are available via InfraMind community templates or can be created manually using PromQL.

Example PromQL:

rate(infra_node_jobs_total[5m])

Custom Alerts and Diagnostics

To test execution health, simulate a local job run:

infra simulate --model ./summarizer.yaml --input test.json

For GPU stress tests:

infra benchmark --type=gpu --duration=60s

To configure watchdog-style auto-recovery:

infra watchdog enable

This process pings the agent every 60 seconds and restarts on memory exhaustion, timeout hangs, or failed handshake response.

Remote Monitoring (Optional)

You can register your node to a fleet monitoring UI via the cloud operator interface. This includes:

Uptime leaderboard
Reward heatmaps
Regional job density
Latency scatter plot
Offline notifications (via webhook or Telegram)

This is opt-in and uses anonymized metadata only. Activate with:

infra cloud join --node 0xABC --name "Node 🇳🇱 - Rotterdam"

Summary

All InfraMind nodes provide introspection tools for both personal use and automated monitoring. Whether you run a single bare-metal box or manage a GPU fleet across regions, logs, metrics, and diagnostics are first-class citizens. Full transparency of execution behavior ensures that nodes can be tuned, debugged, and optimized for consistent, high-performance operation in the mesh.

PreviousNode Configuration NextNode Rewards & Slashing

Last updated 2 months ago