20.5 Incident Triage with Graph + Live Logs

Orbnetes deployment and release orchestration documentation for operators and platform teams.

Objective

Diagnose and mitigate a failed or stuck execution quickly using built-in runtime visibility.

Triage Workflow

  1. Open release or pipeline page.
  2. Inspect DAG graph:
    • find first failed node,
    • identify blocked dependents.
  3. Open corresponding live job page.
  4. Use step timeline to locate first failing step.
  5. Search logs for error signature (permission denied, not found, timeout, etc.).
  6. Classify failure type:
    • routing/tag,
    • config/secrets,
    • runtime/tooling,
    • external dependency/network.
  7. Decide recovery action:
    • rerun failed,
    • rerun all,
    • cancel,
    • rollback.
  8. Capture evidence (log download + IDs) for incident record.

Success Criteria

  • root cause category identified quickly,
  • recovery action executed with minimal guesswork,
  • incident evidence preserved (links/logs/status timeline).

Common Pitfalls

  • focusing on final error line instead of first causal failure,
  • rerunning repeatedly without correcting underlying config/routing issue,
  • not checking approval/dependency gates before assuming runner failure.

Operational Note for Playbook Usage

Treat these playbooks as baseline templates. For production readiness, add service-specific guardrails:

  • health-check gates,
  • rollback eligibility rules,
  • communication/escalation steps,
  • post-deploy validation checklist.