6.6 Agent Status, Metrics, and Troubleshooting

Orbnetes deployment and release orchestration documentation for operators and platform teams.

Agent status and runtime metrics are your first diagnostic layer.

Typical status signals:

  • online/offline/inactive,
  • last heartbeat time,
  • reported runner version,
  • OS/platform/hostname,
  • runtime metrics (CPU, memory, disk where available).

Quick troubleshooting workflow

1. Agent not claiming jobs
  • Verify agent is online.
  • Verify project allows this agent.
  • Verify blueprint job tags match agent tags.
  • Check queue for blocked dependencies/approval waits.
2. Agent appears online but jobs fail immediately
  • Inspect job-run live log first failing step.
  • Check shell availability and permissions on host.
  • Verify runtime config (secrets/vars) is present.
3. Version mismatch in UI
  • Confirm running binary version on host.
  • Confirm heartbeat payload includes updated agent version.
  • Check service restart after update.
  • Verify update package target points to intended build.
4. Update fails or loops
  • Inspect service logs for restart behavior.
  • Validate package format and executable naming.
  • Ensure API credentials and download endpoint are accessible.
  • Roll back to known-good runner package if needed.
5. Disk or memory pressure
  • Review runtime metrics from agent status.
  • Clean runner work directories/artifact leftovers.
  • Increase host capacity or split workload across more agents.

Operational best practices

  • Keep at least one spare agent for critical tags.
  • Monitor heartbeat freshness and queue depth together.
  • Standardize runner versions per environment tier.
  • Regularly test fresh install path (not only upgrade path).
  • Treat agent fleet as managed infrastructure, not ad-hoc hosts.