Monitoring your Node

Monitoring an InfraMind node is essential for maintaining high uptime, optimizing job performance, and ensuring that resource declarations reflect actual capacity. The node agent exposes a comprehensive local interface for telemetry inspection, historical usage analysis, and fault diagnostics. These observability layers are native to the InfraMind runtime and require no third-party plugins, though external integrations (Prometheus, Grafana) are also supported for full-stack operators.

Logs, job receipts, node statistics, and scheduler activity are all accessible via command-line tools and optionally via a local or remote Web UI.

Log Access

All node logs are written to:

~/.inframind/logs/inframind.log

These include:

  • Container execution logs

  • Job assignment metadata

  • Scheduling decisions and fallbacks

  • Heartbeat acknowledgments

  • Reward claim results

  • Errors and runtime exceptions

Example command to view in real-time:

tail -f ~/.inframind/logs/inframind.log

Or using journalctl (for systemd deployments):

journalctl -u inframind-node -f

Logs are rotated every 24 hours by default and compressed after 7 days. To modify this behavior, update your system’s logrotate configuration.

Job History

Every job executed by a node is tracked locally as a receipt:

Each receipt includes:

Receipts can be queried using the CLI:

Or to inspect a specific job:

Historical job statistics (rolling averages, success rates, proof timestamps) are also calculated and presented at runtime via:

Node Status

To monitor node health, system load, and connection state, use:

Typical output:

This command aggregates local telemetry and the last scheduler response. If your node is offline or behind NAT/firewall, mesh status will return false and the node will be deprioritized.

Web UI Dashboard

The InfraMind node agent exposes a local monitoring dashboard by default at:

If running remotely:

Dashboard sections include:

  • Job timeline

  • Live system metrics

  • Container cache viewer

  • Reward claim history

  • Model execution summaries

  • Scheduler handshake logs

  • Node reputation trend

This service is secured using local binding by default. To expose it over the internet, reverse proxy with nginx or caddy, and configure an auth layer.

To disable or change the port:

Or via environment:

Prometheus & Grafana Integration

InfraMind nodes expose a native Prometheus exporter on port 9100:

Example metrics:

Scrape config for prometheus.yml:

Grafana dashboards are available via InfraMind community templates or can be created manually using PromQL.

Example PromQL:

Custom Alerts and Diagnostics

To test execution health, simulate a local job run:

For GPU stress tests:

To configure watchdog-style auto-recovery:

This process pings the agent every 60 seconds and restarts on memory exhaustion, timeout hangs, or failed handshake response.

Remote Monitoring (Optional)

You can register your node to a fleet monitoring UI via the cloud operator interface. This includes:

  • Uptime leaderboard

  • Reward heatmaps

  • Regional job density

  • Latency scatter plot

  • Offline notifications (via webhook or Telegram)

This is opt-in and uses anonymized metadata only. Activate with:

Summary

All InfraMind nodes provide introspection tools for both personal use and automated monitoring. Whether you run a single bare-metal box or manage a GPU fleet across regions, logs, metrics, and diagnostics are first-class citizens. Full transparency of execution behavior ensures that nodes can be tuned, debugged, and optimized for consistent, high-performance operation in the mesh.

Last updated