15.5 Common Failure Patterns

Orbnetes deployment and release orchestration documentation for operators and platform teams.

Below are frequent failure classes and how to recognize them quickly.

1) Tag routing mismatch

Symptoms:

  • jobs stay queued,
  • no agent picks job despite active pipeline.

Check:

  • job tags vs agent tags,
  • project allowed agents mapping,
  • agent online state.

2) Missing secrets/variables

Symptoms:

  • step fails immediately with missing env/config errors.

Check:

  • key exists in expected scope (environment/project/global),
  • environment selection at launch is correct,
  • key naming exact match.

3) Missing release file binding

Symptoms:

  • deploy step references release file variable but path/value is empty.

Check:

  • source/tag/file selected in release,
  • blueprint actually expects ORBN_RELEASE_FILE,
  • selected artifact available from source.

4) Dependency blocking (needs)

Symptoms:

  • downstream job waiting indefinitely or skipped/blocked after upstream fail.

Check:

  • upstream status,
  • dependency chain in graph,
  • failure policy/conditions.

5) Shell/runtime mismatch

Symptoms:

  • command syntax errors, path issues, executable not found.

Check:

  • shell type (bash vs powershell),
  • agent OS/architecture expectations,
  • tool presence on runner host.

6) External service/network failures

Symptoms:

  • timeout, connection refused, DNS/auth errors.

Check:

  • target endpoint availability,
  • network egress/firewall,
  • credentials/token validity,
  • transient vs persistent pattern across reruns.

7) Resource pressure on agent

Symptoms:

  • step slowdown, random command failures, disk write errors.

Check:

  • runtime metrics (CPU/memory/disk),
  • work directory space,
  • concurrent job load on same host.

Observability Response Checklist

When a run fails:

  1. Open pipeline graph and locate first failed/blocked branch.
  2. Open job live page and inspect first failed step output.
  3. Correlate status + duration + timestamp for context.
  4. Classify failure pattern (routing, config, runtime, dependency, external).
  5. Apply rerun strategy (failed or all) with corrected inputs/config.
  6. Preserve logs for incident record if production impact exists.

This workflow gives fast diagnosis while keeping evidence and actions traceable.