Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

中文手册

Open the 中文手册.

rust-supervisor Manual

Language: 中文

Project Scope

rust-supervisor is a Rust task supervision core for Tokio services. It uses declarative models to manage child startup, stop, restart, quarantine, state query, event recording, health checks, and Shutdown Without Orphaned Tasks.

The configuration boundary uses rust-config-tree v0.1.9 with YAML files. Runtime tunable values must enter the system through this centralized configuration path.

This project has no legacy interface burden. Users should import public types from owning module paths, such as rust_supervisor::runtime::supervisor::Supervisor.

Reading Path

  • Getting Started: start a minimal supervisor from YAML configuration.
  • Configuration: understand SupervisorConfig, ConfigState, and startup rejection boundaries.
  • Supervisor Tree: understand SupervisorSpec, SupervisorTree, and registry ownership.
  • Task Model: understand ChildSpec, TaskFactory, TaskContext, and readiness.
  • Policies: understand restart decisions, backoff, fuse rules, quarantine, and task exit classification.
  • Runtime Control: understand SupervisorHandle commands and idempotent behavior.
  • Dashboard: understand the three-end workflow across the target process, relay, and dashboard client.
  • Shutdown: understand four-stage shutdown and blocking worker boundaries.
  • Observability: understand events, logs, tracing, metrics, audit data, and run summaries.
  • Examples: run each learning example under examples/.
  • Quality Gates: run formatting, build, test, documentation, SBOM, and release checks.

Runtime Boundary

The supervisor core governs lifecycle behavior only. High-frequency business messages belong in the data plane. The control plane handles lifecycle commands, current state queries, events, and governance decisions.

Getting Started

Language: 中文

Prerequisites

This project is a Rust library. The examples require Cargo and a Tokio application environment. Repository examples include their required dependencies.

The primary configuration file is examples/config/supervisor.yaml. The loader uses rust-config-tree v0.1.9, reads YAML, and produces ConfigState.

Minimal Command

cargo run --example supervisor_quickstart

The example loads YAML through load_config_state, derives SupervisorSpec through ConfigState::to_supervisor_spec, starts the runtime through Supervisor::start, queries current_state, and then shuts down the tree through shutdown_tree.

Minimal Code Path

use rust_supervisor::config::loader::load_config_state;
use rust_supervisor::runtime::supervisor::Supervisor;

#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
    let state = load_config_state("examples/config/supervisor.yaml")?;
    let spec = state.to_supervisor_spec()?;
    let handle = Supervisor::start(spec).await?;
    let current = handle.current_state().await?;
    println!("{current:#?}");
    handle.shutdown_tree("operator", "quickstart complete").await?;
    Ok(())
}

Result

The example validates the integration path. It is not a business task template. Application workers should live inside the ChildSpec and TaskFactory boundaries instead of being started as unmanaged background tasks.

Configuration and Schema

Language: 中文

Entry Point

The configuration entry point is rust_supervisor::config::loader::load_config_state. It accepts only the YAML primary configuration file. The repository example path is examples/config/supervisor.yaml.

The current configuration shape contains supervisor, policy, shutdown, and observability groups. They map into SupervisorRootConfig, PolicyConfig, ShutdownConfig, and ObservabilityConfig.

Configuration State

rust_supervisor::config::configurable::SupervisorConfig is the public root configuration struct. It supports confique::Config, schemars::JsonSchema, serde::Serialize, and serde::Deserialize. Users can reuse the same model for YAML loading, template generation, and JSON Schema generation.

ConfigState is the validated immutable state. Runtime modules must not keep separate runtime tunable constants.

ConfigState::to_supervisor_spec derives SupervisorSpec. The implementation fills the supervision strategy, policy defaults, shutdown budgets, health timing, and observability capacity from configuration values.

Template Boundary

The official template is examples/config/supervisor.template.yaml. It remains a single YAML file by default and covers supervisor, policy, shutdown, and observability.

This crate does not add x-tree-split to the public configuration structs, official schema, or official template. Projects that want split configuration files can wrap or reuse SupervisorConfig in their own crate and decide their own tree split layout.

Error Boundary

Configuration loading returns SupervisorError::FatalConfig when startup must be rejected:

  • The file extension is not YAML.
  • The file cannot be read.
  • YAML cannot be parsed into SupervisorConfig.
  • The supervision strategy is not one of OneForOne, OneForAll, or RestForOne.
  • A required numeric value is zero.
  • The initial backoff is greater than the maximum backoff.
  • The jitter ratio is outside the accepted range.

Supervisor::start_from_config_file rejects invalid configuration before it creates runtime channels or spawns the control loop.

Example Configuration

supervisor:
  strategy: OneForAll
policy:
  child_restart_limit: 10
  child_restart_window_ms: 60000
  supervisor_failure_limit: 30
  supervisor_failure_window_ms: 60000
  initial_backoff_ms: 100
  max_backoff_ms: 5000
  jitter_ratio: 0.10
  heartbeat_interval_ms: 1000
  stale_after_ms: 3000
shutdown:
  graceful_timeout_ms: 5000
  abort_wait_ms: 1000
observability:
  event_journal_capacity: 256
  metrics_enabled: true
  audit_enabled: true

Supervisor Tree

Language: 中文

Declaration Model

SupervisorSpec describes one supervisor node. It contains path, strategy, children, config_version, default restart policy, default backoff policy, default health policy, default shutdown policy, supervisor-level fuse limits, restart_budget, escalation_policy, group_strategies, child_strategy_overrides, and dynamic_supervisor_policy.

ChildSpec describes one child. It contains id, name, kind, factory, restart_policy, shutdown_policy, health_policy, readiness_policy, backoff_policy, dependencies, tags, and criticality.

Tree Building

SupervisorTree::build validates SupervisorSpec and converts children into path-aware nodes. Each child path is derived from the parent path and ChildId.

SupervisorPath::root returns the root path. SupervisorPath::join appends a child path segment. SupervisorPath::parent returns the parent path when it exists.

Startup And Shutdown Order

startup_order returns nodes in declaration order. shutdown_order returns nodes in reverse declaration order. This ordering is the basis for Shutdown Without Orphaned Tasks.

Restart Planning

restart_execution_plan resolves the runtime restart scope from the tree and SupervisorSpec. It keeps per-child overrides, group strategies, restart budgets, escalation policies, and dynamic supervisor policy in one plan so the runtime control loop does not duplicate strategy selection logic.

Registry

RegistryStore stores ChildRuntime values by child identifier, supervisor path, and declaration order. Runtime control and current state queries should go through the registry instead of bypassing it.

Task Model

Language: 中文

Task Kinds

TaskKind distinguishes AsyncWorker, BlockingWorker, and Supervisor. A blocking worker must not be treated as a normal asynchronous worker that can always be aborted immediately.

Task Factory

TaskFactory is the core construction contract. Every attempt must create a fresh future. service_fn is an ergonomic adapter that still targets TaskFactory; it does not replace the core model.

TaskResult distinguishes Succeeded, Cancelled, and Failed. The Failed variant carries TaskFailure and TaskFailureKind.

Task Context

TaskContext contains child identifier, supervisor path, generation, attempt, cancellation token, heartbeat sender, and readiness sender.

Workers should use TaskContext::heartbeat to report health, TaskContext::mark_ready to report explicit readiness, and TaskContext::is_cancelled or TaskContext::cancellation_token to react to shutdown.

Readiness

ReadinessPolicy supports Immediate and Explicit. An explicitly ready child should not appear as ready in current state or events until it reports readiness.

Policies

Language: 中文

Supervision Strategy

SupervisionStrategy decides the restart scope after a failure. OneForOne selects only the failed child. OneForAll selects every child in the selected scope. RestForOne selects the failed child and every child declared after it in the selected scope.

restart_scope calculates the restart scope from SupervisorTree, the strategy, and the failed child identifier.

restart_execution_plan combines the supervisor strategy, GroupStrategy, ChildStrategyOverride, RestartBudget, EscalationPolicy, and DynamicSupervisorPolicy into a StrategyExecutionPlan. Child overrides take precedence over group strategies, and group strategies take precedence over the supervisor-wide strategy.

The runtime control loop now receives child exits and applies the selected StrategyExecutionPlan automatically when policy returns a restart decision. Runtime lifecycle events use restart_plan so operators can see the selected strategy, group, and child scope.

Group Strategy And Overrides

GroupStrategy uses child tags to define a smaller restart scope. A child can belong to at most one configured strategy group. ChildStrategyOverride applies a per-child strategy and governance override when one child needs stricter restart behavior than its group or supervisor.

Restart Budget And Escalation

RestartBudget records the maximum restart count and the counting window selected for a plan. EscalationPolicy records the follow-up action when restart governance cannot remain local, including parent escalation, tree shutdown, or scope quarantine.

Dynamic Supervisor Policy

DynamicSupervisorPolicy controls runtime add_child acceptance. The current command accepts child manifests and tracks the dynamic manifest count. It rejects additions when dynamic supervision is disabled or the configured child limit has already been reached.

Restart Policy

RestartPolicy contains Permanent, Transient, and Temporary. PolicyEngine reads TaskExit, the failure category, and the restart policy, then returns RestartDecision.

Backoff And Jitter

BackoffPolicy describes initial delay, maximum delay, jitter mode, and reset-after behavior. Tests can use deterministic jitter so coverage does not depend on random output.

Fuse And Quarantine

MeltdownPolicy limits restarts or failures inside configured windows. Crossing a child-level fuse places the child in quarantine. Crossing a supervisor-level fuse escalates the failure to the parent.

Task Exit Classification

TaskExit distinguishes success, cancellation, typed failure, panic, and timeout. The policy layer must read typed classifications instead of inferring behavior from strings.

Runtime Control

Language: 中文

Control Entry Point

SupervisorHandle is the runtime control entry point. It sends requests to the runtime control loop through a command channel and returns CommandResult.

Control Commands

  • add_child: accept a dynamic child manifest when DynamicSupervisorPolicy allows another child.
  • remove_child: stop the target child before removing its registry record.
  • restart_child: request a restart for the target child.
  • pause_child: pause governance for the target child.
  • resume_child: resume governance for the target child.
  • quarantine_child: place the target child into quarantine.
  • shutdown_tree: shut down the whole supervisor tree.
  • current_state: return the current SupervisorState.
  • subscribe_events: subscribe to lifecycle events.

Idempotent Behavior

Repeated control commands should not create unrecoverable errors. Pausing an already paused child returns the current state. Quarantining an already quarantined child returns the current state. Shutting down an already completed tree returns the existing shutdown result.

Dynamic Additions

Dynamic additions are governed before the manifest is accepted. The runtime rejects add_child when dynamic supervision is disabled or when the declared plus dynamic child count has reached the configured limit. current_state includes accepted dynamic manifests in child_count.

Audit Data

Each control command carries requested_by, reason, target_path, accepted_at, and command_id. These fields support audit events and incident review.

Dashboard Three-End Workflow

Language: 中文

The dashboard feature is delivered by three repositories. rust-supervisor owns only target-process local IPC and shared contracts. ~/rust-supervisor-relay owns the relay and external wss:// sessions. ~/rust-supervisor-ui owns the browser dashboard client.

The screenshot below shows the dashboard client view for target lists, topology, state, and runtime streams.

rust-supervisor dashboard screenshot

Three-End Responsibilities

  • rust-supervisor: The target process reads SupervisorConfig, opens a Unix domain socket when ipc.enabled=true, and produces snapshots, event records, log records, command results, and registration heartbeats.
  • rust-supervisor-relay: The relay listens on the registration socket, stores the target registry, exposes external wss:// dashboard sessions, validates mTLS and allowed IPC path prefixes, and forwards session commands to the target process.
  • rust-supervisor-ui: The dashboard client connects to the relay through wss:// and displays the target list, topology, state, event stream, log tail, and command audit.

Local Demo Flow

  1. Start the relay first. It must listen on the registration socket before the target process can register itself.
cd ~/rust-supervisor-relay
cargo run -- --config examples/config/dashboard-relay.local.yaml
  1. Start the target process next. It opens the local IPC socket and sends registration heartbeats to the relay.
cd ~/rust-supervisor
cargo run --example demo -- --config examples/config/supervisor.local.yaml
  1. Start the dashboard client last. Browser code connects only to the relay and never reads the target-process local IPC socket directly.
cd ~/rust-supervisor-ui
VITE_SUPERVISOR_RELAY_URL=wss://localhost:9443/supervisor npm run dev

Runtime Order

After receiving a registration heartbeat, the relay only stores the target process in the target registry. Registration does not trigger proactive event or log push. After the dashboard client establishes an authenticated dashboard session and selects a target, the relay connects to the target-process IPC socket, reads state, and subscribes to events.subscribe or logs.tail only when the session requests those streams.

Control commands must start from the dashboard client, pass relay session validation, and then reach the target process. Each command must carry operator identity, target identity, and reason. Dangerous commands must also be confirmed in the client.

Verification Commands

cd ~/rust-supervisor
cargo test --test dashboard_config_test --test dashboard_protocol_shape_test --test dashboard_state_test --test dashboard_stream_test --test dashboard_performance_test

cargo test --manifest-path ~/rust-supervisor-relay/Cargo.toml
npm --prefix ~/rust-supervisor-ui run test
npm --prefix ~/rust-supervisor-ui run build
npm --prefix ~/rust-supervisor-ui run test:e2e:three-end

Production Notes

The target process may expose only a local Unix domain socket and must not expose IPC directly to the network. The relay must use wss:// for external access. The browser or operating-system certificate store selects the mTLS client certificate, and page scripts must not read the certificate private key. ipc.path, registration.relay_registration_path, and the relay allowed IPC path prefix must match, otherwise the target will fail to register or the relay will reject the connection.

Shutdown

Language: 中文

Formal Term

This project uses Shutdown Without Orphaned Tasks to describe the shutdown goal. After root shutdown completes, the runtime should leave no orphaned task.

Four Stages

The shutdown protocol has four stages:

  • Request stop: accept the shutdown cause and propagate the cancellation token.
  • Graceful drain: wait for each child to finish on its own.
  • Abort stragglers: force or escalate asynchronous tasks that exceed their timeout.
  • Reconcile: align registry state, current state, metrics, and the event journal.

Order

Startup runs in declaration order. Shutdown runs in reverse declaration order. startup_order and shutdown_order expose this rule.

Blocking Worker Boundary

BlockingWorker represents spawn_blocking work or other work that cannot be assumed to abort immediately. After shutdown timeout, the runtime should record the non-immediate termination boundary and follow the escalation policy.

Shutdown Cause

ShutdownCause records requested_by and reason. The cause should appear in audit and diagnostic output.

Observability

Language: 中文

Event Model

SupervisorEvent describes one lifecycle fact. It contains When, Where, What, sequence, and correlation identifier.

When records wall-clock time, monotonic time, uptime, generation, and attempt. Where records supervisor path, child identifier, parent identifier, and task name. What records state transition, policy decision, health state, exit reason, or control command.

Pipeline Outputs

The observability pipeline publishes the same lifecycle fact as these signals:

  • SupervisorEvent.
  • Structured log.
  • Tracing span and tracing event.
  • Metrics.
  • Audit event.
  • Event journal entry.
  • Test recorder entry.

Metric Labels

Metric labels must stay low-cardinality. Acceptable labels include supervisor path, child identifier, state, decision, and failure category. Full error text, user input, and unbounded dynamic values should not become labels.

Diagnostic Replay

The event journal stores a fixed number of recent events. RunSummary is built from the event journal, current state, and policy decisions so operators can explain meltdown, shutdown timeout, or parent escalation.

Examples

Language: 中文

Quick Start

cargo run --example supervisor_quickstart

supervisor_quickstart reads examples/config/supervisor.yaml, derives SupervisorSpec, starts a supervisor, queries current state, and shuts down the tree.

Configuration Tree

cargo run --example config_tree_supervisor

config_tree_supervisor shows the rust-config-tree v0.1.9 YAML loading path and prints the derived SupervisorSpec.

Restart Policy Lab

cargo run --example restart_policy_lab

restart_policy_lab shows the basic shapes of TaskFailure, TaskFailureKind, RestartPolicy, the canonical spec::supervisor::SupervisionStrategy, and RestartDecision.

Shutdown Tree

cargo run --example shutdown_tree

shutdown_tree demonstrates request stop, graceful drain, abort stragglers, and reconcile before calling shutdown_tree.

Observability Probe

cargo run --example observability_probe

observability_probe subscribes to events, queries current state, prints one event, and shuts down. It checks the observability integration path.

Supervisor Tree Story

cargo run --example supervisor_tree_story

supervisor_tree_story declares market feed, risk engine, and audit sink children. It shows dependencies, tags, criticality, explicit readiness, startup order, shutdown order, and RestForOne restart scope.

Runtime Control Story

cargo run --example runtime_control_story

runtime_control_story starts a real supervisor and runs add_child, pause_child, resume_child, quarantine_child, current_state, subscribe_events, and shutdown_tree. It combines operator control with audit events.

Policy Failure Matrix

cargo run --example policy_failure_matrix

policy_failure_matrix feeds success, external dependency failure, fatal bug failure, and panic into Permanent, Transient, and Temporary restart policies. It also shows deterministic jitter and meltdown tracking.

Diagnostic Replay

cargo run --example diagnostic_replay

diagnostic_replay builds deterministic events, writes them into the event journal, replays failure, backoff, and restart facts, then generates metric samples and RunSummary.

Quality Gates

Language: 中文

Baseline Commands

cargo fmt --check
cargo check
cargo test
cargo doc --no-deps
cargo package --list
scripts/check-coding-standard.sh
scripts/check-maintainability.sh
scripts/generate-sbom.sh
scripts/validate-sbom.sh
cargo publish --dry-run

Documentation Synchronization

The manual, engineering docs, README files, examples, quickstart, public API contract, and glossary must stay synchronized. When public APIs, configuration shape, example behavior, or observability signals change, documentation must be updated in the same implementation pass.

Coding Standard

scripts/check-coding-standard.sh checks required release materials, example files, primary configuration, documentation punctuation, and No Compatibility language.

Maintainability

scripts/check-maintainability.sh checks paired manual and docs entries, example count, validation artifacts, the Shutdown Without Orphaned Tasks term, and the rust-config-tree term.

SBOM And Release

scripts/generate-sbom.sh creates minimal CycloneDX JSON and SPDX JSON release artifacts. scripts/validate-sbom.sh checks file existence, JSON shape, package name, Cargo.lock digest, and sensitive path leakage.