中文手册
Open the 中文手册.
rust-supervisor Manual
Language: 中文
Project Scope
rust-supervisor is a Rust task supervision core for Tokio services. It uses declarative models to manage child startup, stop, restart, quarantine, state query, event recording, health checks, and Shutdown Without Orphaned Tasks.
The configuration boundary uses rust-config-tree v0.1.9 with YAML files. Runtime tunable values must enter the system through this centralized configuration path.
This project has no legacy interface burden. Users should import public types from owning module paths, such as rust_supervisor::runtime::supervisor::Supervisor.
Reading Path
- Getting Started: start a minimal supervisor from YAML configuration.
- Configuration: understand
SupervisorConfig,ConfigState, and startup rejection boundaries. - Supervisor Tree: understand
SupervisorSpec,SupervisorTree, and registry ownership. - Task Model: understand
ChildSpec,TaskFactory,TaskContext, and readiness. - Policies: understand restart decisions, backoff, fuse rules, quarantine, and task exit classification.
- Runtime Control: understand
SupervisorHandlecommands and idempotent behavior. - Dashboard: understand the three-end workflow across the target process, relay, and dashboard client.
- Shutdown: understand four-stage shutdown and blocking worker boundaries.
- Observability: understand events, logs, tracing, metrics, audit data, and run summaries.
- Examples: run each learning example under
examples/. - Quality Gates: run formatting, build, test, documentation, SBOM, and release checks.
Runtime Boundary
The supervisor core governs lifecycle behavior only. High-frequency business messages belong in the data plane. The control plane handles lifecycle commands, current state queries, events, and governance decisions.
Getting Started
Language: 中文
Prerequisites
This project is a Rust library. The examples require Cargo and a Tokio application environment. Repository examples include their required dependencies.
The primary configuration file is examples/config/supervisor.yaml. The loader uses rust-config-tree v0.1.9, reads YAML, and produces ConfigState.
Minimal Command
cargo run --example supervisor_quickstart
The example loads YAML through load_config_state, derives SupervisorSpec through ConfigState::to_supervisor_spec, starts the runtime through Supervisor::start, queries current_state, and then shuts down the tree through shutdown_tree.
Minimal Code Path
use rust_supervisor::config::loader::load_config_state;
use rust_supervisor::runtime::supervisor::Supervisor;
#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
let state = load_config_state("examples/config/supervisor.yaml")?;
let spec = state.to_supervisor_spec()?;
let handle = Supervisor::start(spec).await?;
let current = handle.current_state().await?;
println!("{current:#?}");
handle.shutdown_tree("operator", "quickstart complete").await?;
Ok(())
}
Result
The example validates the integration path. It is not a business task template. Application workers should live inside the ChildSpec and TaskFactory boundaries instead of being started as unmanaged background tasks.
Configuration and Schema
Language: 中文
Entry Point
The configuration entry point is rust_supervisor::config::loader::load_config_state. It accepts only the YAML primary configuration file. The repository example path is examples/config/supervisor.yaml.
The current configuration shape contains supervisor, policy, shutdown, and observability groups. They map into SupervisorRootConfig, PolicyConfig, ShutdownConfig, and ObservabilityConfig.
Configuration State
rust_supervisor::config::configurable::SupervisorConfig is the public root configuration struct. It supports confique::Config, schemars::JsonSchema, serde::Serialize, and serde::Deserialize. Users can reuse the same model for YAML loading, template generation, and JSON Schema generation.
ConfigState is the validated immutable state. Runtime modules must not keep separate runtime tunable constants.
ConfigState::to_supervisor_spec derives SupervisorSpec. The implementation fills the supervision strategy, policy defaults, shutdown budgets, health timing, and observability capacity from configuration values.
Template Boundary
The official template is examples/config/supervisor.template.yaml. It remains a single YAML file by default and covers supervisor, policy, shutdown, and observability.
This crate does not add x-tree-split to the public configuration structs, official schema, or official template. Projects that want split configuration files can wrap or reuse SupervisorConfig in their own crate and decide their own tree split layout.
Error Boundary
Configuration loading returns SupervisorError::FatalConfig when startup must be rejected:
- The file extension is not YAML.
- The file cannot be read.
- YAML cannot be parsed into
SupervisorConfig. - The supervision strategy is not one of
OneForOne,OneForAll, orRestForOne. - A required numeric value is zero.
- The initial backoff is greater than the maximum backoff.
- The jitter ratio is outside the accepted range.
Supervisor::start_from_config_file rejects invalid configuration before it creates runtime channels or spawns the control loop.
Example Configuration
supervisor:
strategy: OneForAll
policy:
child_restart_limit: 10
child_restart_window_ms: 60000
supervisor_failure_limit: 30
supervisor_failure_window_ms: 60000
initial_backoff_ms: 100
max_backoff_ms: 5000
jitter_ratio: 0.10
heartbeat_interval_ms: 1000
stale_after_ms: 3000
shutdown:
graceful_timeout_ms: 5000
abort_wait_ms: 1000
observability:
event_journal_capacity: 256
metrics_enabled: true
audit_enabled: true
Supervisor Tree
Language: 中文
Declaration Model
SupervisorSpec describes one supervisor node. It contains path, strategy, children, config_version, default restart policy, default backoff policy, default health policy, default shutdown policy, supervisor-level fuse limits, restart_budget, escalation_policy, group_strategies, child_strategy_overrides, and dynamic_supervisor_policy.
ChildSpec describes one child. It contains id, name, kind, factory, restart_policy, shutdown_policy, health_policy, readiness_policy, backoff_policy, dependencies, tags, and criticality.
Tree Building
SupervisorTree::build validates SupervisorSpec and converts children into path-aware nodes. Each child path is derived from the parent path and ChildId.
SupervisorPath::root returns the root path. SupervisorPath::join appends a child path segment. SupervisorPath::parent returns the parent path when it exists.
Startup And Shutdown Order
startup_order returns nodes in declaration order. shutdown_order returns nodes in reverse declaration order. This ordering is the basis for Shutdown Without Orphaned Tasks.
Restart Planning
restart_execution_plan resolves the runtime restart scope from the tree and SupervisorSpec. It keeps per-child overrides, group strategies, restart budgets, escalation policies, and dynamic supervisor policy in one plan so the runtime control loop does not duplicate strategy selection logic.
Registry
RegistryStore stores ChildRuntime values by child identifier, supervisor path, and declaration order. Runtime control and current state queries should go through the registry instead of bypassing it.
Task Model
Language: 中文
Task Kinds
TaskKind distinguishes AsyncWorker, BlockingWorker, and Supervisor. A blocking worker must not be treated as a normal asynchronous worker that can always be aborted immediately.
Task Factory
TaskFactory is the core construction contract. Every attempt must create a fresh future. service_fn is an ergonomic adapter that still targets TaskFactory; it does not replace the core model.
TaskResult distinguishes Succeeded, Cancelled, and Failed. The Failed variant carries TaskFailure and TaskFailureKind.
Task Context
TaskContext contains child identifier, supervisor path, generation, attempt, cancellation token, heartbeat sender, and readiness sender.
Workers should use TaskContext::heartbeat to report health, TaskContext::mark_ready to report explicit readiness, and TaskContext::is_cancelled or TaskContext::cancellation_token to react to shutdown.
Readiness
ReadinessPolicy supports Immediate and Explicit. An explicitly ready child should not appear as ready in current state or events until it reports readiness.
Policies
Language: 中文
Supervision Strategy
SupervisionStrategy decides the restart scope after a failure. OneForOne selects only the failed child. OneForAll selects every child in the selected scope. RestForOne selects the failed child and every child declared after it in the selected scope.
restart_scope calculates the restart scope from SupervisorTree, the strategy, and the failed child identifier.
restart_execution_plan combines the supervisor strategy, GroupStrategy, ChildStrategyOverride, RestartBudget, EscalationPolicy, and DynamicSupervisorPolicy into a StrategyExecutionPlan. Child overrides take precedence over group strategies, and group strategies take precedence over the supervisor-wide strategy.
The runtime control loop now receives child exits and applies the selected StrategyExecutionPlan automatically when policy returns a restart decision. Runtime lifecycle events use restart_plan so operators can see the selected strategy, group, and child scope.
Group Strategy And Overrides
GroupStrategy uses child tags to define a smaller restart scope. A child can belong to at most one configured strategy group. ChildStrategyOverride applies a per-child strategy and governance override when one child needs stricter restart behavior than its group or supervisor.
Restart Budget And Escalation
RestartBudget records the maximum restart count and the counting window selected for a plan. EscalationPolicy records the follow-up action when restart governance cannot remain local, including parent escalation, tree shutdown, or scope quarantine.
Dynamic Supervisor Policy
DynamicSupervisorPolicy controls runtime add_child acceptance. The current command accepts child manifests and tracks the dynamic manifest count. It rejects additions when dynamic supervision is disabled or the configured child limit has already been reached.
Restart Policy
RestartPolicy contains Permanent, Transient, and Temporary. PolicyEngine reads TaskExit, the failure category, and the restart policy, then returns RestartDecision.
Backoff And Jitter
BackoffPolicy describes initial delay, maximum delay, jitter mode, and reset-after behavior. Tests can use deterministic jitter so coverage does not depend on random output.
Fuse And Quarantine
MeltdownPolicy limits restarts or failures inside configured windows. Crossing a child-level fuse places the child in quarantine. Crossing a supervisor-level fuse escalates the failure to the parent.
Task Exit Classification
TaskExit distinguishes success, cancellation, typed failure, panic, and timeout. The policy layer must read typed classifications instead of inferring behavior from strings.
Runtime Control
Language: 中文
Control Entry Point
SupervisorHandle is the runtime control entry point. It sends requests to the runtime control loop through a command channel and returns CommandResult.
Control Commands
add_child: accept a dynamic child manifest whenDynamicSupervisorPolicyallows another child.remove_child: stop the target child before removing its registry record.restart_child: request a restart for the target child.pause_child: pause governance for the target child.resume_child: resume governance for the target child.quarantine_child: place the target child into quarantine.shutdown_tree: shut down the whole supervisor tree.current_state: return the currentSupervisorState.subscribe_events: subscribe to lifecycle events.
Idempotent Behavior
Repeated control commands should not create unrecoverable errors. Pausing an already paused child returns the current state. Quarantining an already quarantined child returns the current state. Shutting down an already completed tree returns the existing shutdown result.
Dynamic Additions
Dynamic additions are governed before the manifest is accepted. The runtime rejects add_child when dynamic supervision is disabled or when the declared plus dynamic child count has reached the configured limit. current_state includes accepted dynamic manifests in child_count.
Audit Data
Each control command carries requested_by, reason, target_path, accepted_at, and command_id. These fields support audit events and incident review.
Dashboard Three-End Workflow
Language: 中文
The dashboard feature is delivered by three repositories. rust-supervisor owns only target-process local IPC and shared contracts. ~/rust-supervisor-relay owns the relay and external wss:// sessions. ~/rust-supervisor-ui owns the browser dashboard client.
The screenshot below shows the dashboard client view for target lists, topology, state, and runtime streams.

Three-End Responsibilities
rust-supervisor: The target process readsSupervisorConfig, opens a Unix domain socket whenipc.enabled=true, and produces snapshots, event records, log records, command results, and registration heartbeats.rust-supervisor-relay: The relay listens on the registration socket, stores the target registry, exposes externalwss://dashboard sessions, validates mTLS and allowed IPC path prefixes, and forwards session commands to the target process.rust-supervisor-ui: The dashboard client connects to the relay throughwss://and displays the target list, topology, state, event stream, log tail, and command audit.
Local Demo Flow
- Start the relay first. It must listen on the registration socket before the target process can register itself.
cd ~/rust-supervisor-relay
cargo run -- --config examples/config/dashboard-relay.local.yaml
- Start the target process next. It opens the local IPC socket and sends registration heartbeats to the relay.
cd ~/rust-supervisor
cargo run --example demo -- --config examples/config/supervisor.local.yaml
- Start the dashboard client last. Browser code connects only to the relay and never reads the target-process local IPC socket directly.
cd ~/rust-supervisor-ui
VITE_SUPERVISOR_RELAY_URL=wss://localhost:9443/supervisor npm run dev
Runtime Order
After receiving a registration heartbeat, the relay only stores the target process in the target registry. Registration does not trigger proactive event or log push. After the dashboard client establishes an authenticated dashboard session and selects a target, the relay connects to the target-process IPC socket, reads state, and subscribes to events.subscribe or logs.tail only when the session requests those streams.
Control commands must start from the dashboard client, pass relay session validation, and then reach the target process. Each command must carry operator identity, target identity, and reason. Dangerous commands must also be confirmed in the client.
Verification Commands
cd ~/rust-supervisor
cargo test --test dashboard_config_test --test dashboard_protocol_shape_test --test dashboard_state_test --test dashboard_stream_test --test dashboard_performance_test
cargo test --manifest-path ~/rust-supervisor-relay/Cargo.toml
npm --prefix ~/rust-supervisor-ui run test
npm --prefix ~/rust-supervisor-ui run build
npm --prefix ~/rust-supervisor-ui run test:e2e:three-end
Production Notes
The target process may expose only a local Unix domain socket and must not expose IPC directly to the network. The relay must use wss:// for external access. The browser or operating-system certificate store selects the mTLS client certificate, and page scripts must not read the certificate private key. ipc.path, registration.relay_registration_path, and the relay allowed IPC path prefix must match, otherwise the target will fail to register or the relay will reject the connection.
Shutdown
Language: 中文
Formal Term
This project uses Shutdown Without Orphaned Tasks to describe the shutdown goal. After root shutdown completes, the runtime should leave no orphaned task.
Four Stages
The shutdown protocol has four stages:
- Request stop: accept the shutdown cause and propagate the cancellation token.
- Graceful drain: wait for each child to finish on its own.
- Abort stragglers: force or escalate asynchronous tasks that exceed their timeout.
- Reconcile: align registry state, current state, metrics, and the event journal.
Order
Startup runs in declaration order. Shutdown runs in reverse declaration order. startup_order and shutdown_order expose this rule.
Blocking Worker Boundary
BlockingWorker represents spawn_blocking work or other work that cannot be assumed to abort immediately. After shutdown timeout, the runtime should record the non-immediate termination boundary and follow the escalation policy.
Shutdown Cause
ShutdownCause records requested_by and reason. The cause should appear in audit and diagnostic output.
Observability
Language: 中文
Event Model
SupervisorEvent describes one lifecycle fact. It contains When, Where, What, sequence, and correlation identifier.
When records wall-clock time, monotonic time, uptime, generation, and attempt. Where records supervisor path, child identifier, parent identifier, and task name. What records state transition, policy decision, health state, exit reason, or control command.
Pipeline Outputs
The observability pipeline publishes the same lifecycle fact as these signals:
SupervisorEvent.- Structured log.
- Tracing span and tracing event.
- Metrics.
- Audit event.
- Event journal entry.
- Test recorder entry.
Metric Labels
Metric labels must stay low-cardinality. Acceptable labels include supervisor path, child identifier, state, decision, and failure category. Full error text, user input, and unbounded dynamic values should not become labels.
Diagnostic Replay
The event journal stores a fixed number of recent events. RunSummary is built from the event journal, current state, and policy decisions so operators can explain meltdown, shutdown timeout, or parent escalation.
Examples
Language: 中文
Quick Start
cargo run --example supervisor_quickstart
supervisor_quickstart reads examples/config/supervisor.yaml, derives SupervisorSpec, starts a supervisor, queries current state, and shuts down the tree.
Configuration Tree
cargo run --example config_tree_supervisor
config_tree_supervisor shows the rust-config-tree v0.1.9 YAML loading path and prints the derived SupervisorSpec.
Restart Policy Lab
cargo run --example restart_policy_lab
restart_policy_lab shows the basic shapes of TaskFailure, TaskFailureKind, RestartPolicy, the canonical spec::supervisor::SupervisionStrategy, and RestartDecision.
Shutdown Tree
cargo run --example shutdown_tree
shutdown_tree demonstrates request stop, graceful drain, abort stragglers, and reconcile before calling shutdown_tree.
Observability Probe
cargo run --example observability_probe
observability_probe subscribes to events, queries current state, prints one event, and shuts down. It checks the observability integration path.
Supervisor Tree Story
cargo run --example supervisor_tree_story
supervisor_tree_story declares market feed, risk engine, and audit sink children. It shows dependencies, tags, criticality, explicit readiness, startup order, shutdown order, and RestForOne restart scope.
Runtime Control Story
cargo run --example runtime_control_story
runtime_control_story starts a real supervisor and runs add_child, pause_child, resume_child, quarantine_child, current_state, subscribe_events, and shutdown_tree. It combines operator control with audit events.
Policy Failure Matrix
cargo run --example policy_failure_matrix
policy_failure_matrix feeds success, external dependency failure, fatal bug failure, and panic into Permanent, Transient, and Temporary restart policies. It also shows deterministic jitter and meltdown tracking.
Diagnostic Replay
cargo run --example diagnostic_replay
diagnostic_replay builds deterministic events, writes them into the event journal, replays failure, backoff, and restart facts, then generates metric samples and RunSummary.
Quality Gates
Language: 中文
Baseline Commands
cargo fmt --check
cargo check
cargo test
cargo doc --no-deps
cargo package --list
scripts/check-coding-standard.sh
scripts/check-maintainability.sh
scripts/generate-sbom.sh
scripts/validate-sbom.sh
cargo publish --dry-run
Documentation Synchronization
The manual, engineering docs, README files, examples, quickstart, public API contract, and glossary must stay synchronized. When public APIs, configuration shape, example behavior, or observability signals change, documentation must be updated in the same implementation pass.
Coding Standard
scripts/check-coding-standard.sh checks required release materials, example files, primary configuration, documentation punctuation, and No Compatibility language.
Maintainability
scripts/check-maintainability.sh checks paired manual and docs entries, example count, validation artifacts, the Shutdown Without Orphaned Tasks term, and the rust-config-tree term.
SBOM And Release
scripts/generate-sbom.sh creates minimal CycloneDX JSON and SPDX JSON release artifacts. scripts/validate-sbom.sh checks file existence, JSON shape, package name, Cargo.lock digest, and sensitive path leakage.