Keyboard shortcuts

Press ← or β†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Operations Runbook

Language: δΈ­ζ–‡

Note: Each procedure lists expected metrics values at key steps. If the observed value differs, follow the escalation path or refer to the linked section.

P1-001: Supervisor Process Crash

Symptoms: Supervisor process exits unexpectedly; children become orphaned.

StepActionExpected MetricsEstimated Duration
1Check process status: pgrep -x supervisorexit code == 0 (process running) or exit code == 1 (not running)1min
2If not running, check last log lines: journalctl -u supervisor -n 50Log ends with ShutdownPhase::Completed (planned) or Panic (unexpected)2min
3If unexpected crash: collect core dump and backtraceCore dump file present in /tmp/2min
4Restart supervisor: cargo run --release --example supervisor_quickstarthealth.status == "ready" within 30s5min
5Verify children reconnected: check dashboard IPCdashboard_link == "connected"2min

Escalation: If restart fails twice, escalate to L2 engineering with core dump and logs. Total estimated duration: 12min (within 15min SLA).

P1-002: Child Task Crash Loop

Symptoms: A child task repeatedly fails and restarts; current_state shows elevated restart counts.

StepActionExpected MetricsEstimated Duration
1Query current state: handle.current_state()children.failed > 0 and child_runtime_records[].restart_count > threshold1min
2Check child exit reason in journalTaskExit::Panicked or TaskExit::Failed with reason string2min
3If restart budget exhausted: restart_budget.tokens == 0Budget exhausted; child quarantined automatically1min
4Remove or replace the faulty child spec: handle.remove_child()CommandResult::Accepted2min
5Verify no lingering slot: check current_state()children.running == target_count2min

Escalation: If child root cause not identified in 10min, file a bug with the exit reason and journal snippet. Total estimated duration: 8min.

P1-003: Dashboard IPC Disconnected

Symptoms: health.dashboard_link == "disconnected"; dashboard UI shows no data.

StepActionExpected MetricsEstimated Duration
1Check IPC socket path existence: ls -la /tmp/supervisor.sockSocket file present with correct permissions1min
2Check relay process: pgrep -x relayProcess running1min
3Restart relay: kill -TERM <relay_pid> and wait for auto-restartSupervisor auto-restarts relay; dashboard_link == "connected" within 10s3min
4If still disconnected, restart the supervisor process (dashboard IPC lifecycle is tied to the supervisor process)health.dashboard_link == "connected"2min

Escalation: If IPC socket path contention (error contains field_path="dashboard.path"), check deployment guide socket path configuration. Total estimated duration: 7min.

P1-004: Runtime Starvation

Symptoms: Control loop iterations stall; health.uptime_secs advances but events are not processed.

StepActionExpected MetricsEstimated Duration
1Check Tokio runtime metrics: handle.health().control_loop_iterationsiterations_per_sec > 01min
2If stalled, check for blocking tasks: review child task listNo child in BlockForever or IgnoreCancel state2min
3Quarantine suspicious children: handle.quarantine_child()Child marked as Quarantined2min
4Verify recovery: health.control_loop_iterations increasesiterations_per_sec > 0 after 5s3min

Escalation: If starvation persists after quarantining all non-critical children, escalate to L2 with runtime metrics snapshot. Total estimated duration: 8min.