中文手册
Open the 中文手册.
rust-supervisor Manual
Language: 中文
Project Scope
rust-supervisor is a Rust task supervision core for Tokio services. It uses declarative models to manage child startup, stop, restart, quarantine, state query, event recording, health checks, and Shutdown Without Orphaned Tasks.
The configuration boundary uses rust-config-tree 0.3.0 with YAML files. Runtime tunable values must enter the system through this centralized configuration path.
This project has no legacy interface burden. Users should import public types from owning module paths, such as rust_supervisor::runtime::supervisor::Supervisor.
Reading Path
- Getting Started: start a minimal supervisor from YAML configuration.
- Configuration: understand
SupervisorConfig,ConfigState, and startup rejection boundaries. - Split Configuration and Transparent Array Sections: understand split
groups.yaml/children.yamllayout and template CLI commands. - Supervisor Tree: understand
SupervisorSpec,SupervisorTree, and registry ownership. - Task Model: understand
ChildSpec,TaskFactory,TaskContext, and readiness. - Policies: understand restart decisions, backoff, fuse rules, quarantine, and task exit classification.
- Runtime Control: understand
SupervisorHandlecommands and idempotent behavior. - Dashboard: understand the three-end workflow across the target process, relay, and dashboard client.
- Shutdown: understand four-stage shutdown and blocking worker boundaries.
- Observability: understand events, logs, tracing, metrics, audit data, and run summaries.
- Examples: run each learning example under
examples/. - Quality Gates: run formatting, build, test, documentation, SBOM, and release checks.
Runtime Boundary
The supervisor core governs lifecycle behavior only. High-frequency business messages belong in the data plane. The control plane handles lifecycle commands, current state queries, events, and governance decisions.
Getting Started
Language: 中文
Walkthrough: This guide has 5 steps (Step 1 of 5 to Step 5 of 5). Estimated completion time: 5 minutes.
Step 1 of 5: Prerequisites
This project is a Rust library. The examples require Cargo and a Tokio application environment. Repository examples include their required dependencies.
The primary configuration file is examples/config/supervisor.yaml. The loader uses rust-config-tree 0.3.0, reads YAML, and produces ConfigState.
Step 2 of 5: Minimal Command
cargo run --example supervisor_quickstart
The example loads YAML through load_config_from_yaml_file, derives SupervisorSpec through ConfigState::to_supervisor_spec, starts the runtime through Supervisor::start, queries current_state, and then shuts down the tree through shutdown_tree.
Step 3 of 5: Minimal Code Path
use rust_supervisor::config::loader::load_config_from_yaml_file;
use rust_supervisor::runtime::supervisor::Supervisor;
#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
let state = load_config_from_yaml_file("examples/config/supervisor.yaml")?;
let spec = state.to_supervisor_spec()?;
let handle = Supervisor::start(spec).await?;
let current = handle.current_state().await?;
println!("{current:#?}");
handle.shutdown_tree("operator", "quickstart complete").await?;
Ok(())
}
Step 4 of 5: Result
The example validates the integration path. It is not a business task template. Application workers should live inside ChildSpec and TaskFactory boundaries instead of being started as unmanaged background tasks.
Step 5 of 5: Health Self-Check
After startup, the supervisor prints a health self-check JSON to stdout. The JSON schema is formally defined in health-selfcheck-schema.md.
Expected output (example):
{
"status": "ready",
"supervisor_version": "0.1.2",
"uptime_secs": 3600,
"children": { "total": 5, "running": 5, "failed": 0 },
"dashboard_link": "connected"
}
If status is not "ready", check the operations runbook for troubleshooting steps.
Entry Points
The Supervisor struct in src/runtime/supervisor.rs:36-83 provides 3 entry methods:
| Method | Input | When to Use |
|---|---|---|
Supervisor::start(spec) | SupervisorSpec (built programmatically) | You already have a spec object |
Supervisor::start_from_config_state(state) | ConfigState (validated config) | You loaded config via the loader |
Supervisor::start_from_config_file(path) | YAML file path | Direct launch from a file |
All 3 converge on the private start_with_policy() (src/runtime/supervisor.rs:95-126), which:
- Calls
spec.validate()to verify all child declarations - Creates an mpsc command channel and a broadcast event channel
- Creates
RuntimeControlPlaneandObservabilityPipeline - Builds
RuntimeControlState - Spawns the control loop via
tokio::spawn(run_control_loop(...)) - Starts
RuntimeWatchdogto monitor control loop health - Returns
SupervisorHandlefor commands (restart, shutdown, etc.) and event subscriptions
Usage Examples
From YAML file via ConfigState — start_from_config_state
Full example: examples/supervisor_quickstart.rs. Config: examples/config/supervisor.yaml.
use rust_supervisor::config::loader::load_config_from_yaml_file;
use rust_supervisor::runtime::supervisor::Supervisor;
#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
let state = load_config_from_yaml_file("examples/config/supervisor.yaml")?;
let handle = Supervisor::start_from_config_state(state).await?;
handle.shutdown_tree("operator", "quickstart complete").await?;
Ok(())
}
load_config_from_yaml_file returns a ConfigState. Its to_supervisor_spec() is called internally by start_from_config_state.
Direct from YAML file path — start_from_config_file
One-step shortcut that calls load_config_from_yaml_file internally:
use rust_supervisor::runtime::supervisor::Supervisor;
#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
let handle = Supervisor::start_from_config_file("examples/config/supervisor.yaml").await?;
handle.shutdown_tree("operator", "done").await?;
Ok(())
}
Programmatic spec — start
Full example: examples/supervisor_tree_story.rs.
use std::sync::Arc;
use rust_supervisor::id::types::ChildId;
use rust_supervisor::runtime::supervisor::Supervisor;
use rust_supervisor::spec::child::{ChildSpec, TaskKind};
use rust_supervisor::spec::supervisor::SupervisorSpec;
use rust_supervisor::task::factory::{TaskResult, service_fn};
#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
let factory = service_fn(|ctx| async move {
ctx.heartbeat();
ctx.mark_ready();
println!("child running at path={}", ctx.path);
TaskResult::Succeeded
});
let child = ChildSpec::worker(
ChildId::new("demo-worker"),
"Demo Worker",
TaskKind::AsyncWorker,
Arc::new(factory),
);
let spec = SupervisorSpec::root(vec![child]);
let handle = Supervisor::start(spec).await?;
let state = handle.current_state().await?;
println!("{state:#?}");
handle.shutdown_tree("operator", "demo complete").await?;
Ok(())
}
ChildSpec::worker() automatically sets task_role = Some(TaskRole::Worker), equivalent to task_role: worker in YAML.
TaskRole Behavior
The 5 TaskRole variants dispatch to different default lifecycle policies via RoleDefaultPolicy::for_role():
| Dimension | Service | Worker | Job | Sidecar | Supervisor |
|---|---|---|---|---|---|
| On success | Restart | Stop | Stop | Restart | Restart |
| On timeout | RestartWithBackoff | RestartWithBackoff | StopAndEscalate | RestartWithBackoff | RestartWithBackoff |
| Max restarts | 10 | 3 | 1 | 5 | 3 |
| Default severity | Critical | Standard | Optional | Standard | Critical |
The per-task role defaults are defined by 5 constructors in src/policy/task_role_defaults.rs:418-464:
- Service: long-running daemon, restart on success, 10 retries, Critical severity — expected to stay online forever.
- Worker: background task, stop on success, 3 retries, Standard severity — stops when done.
- Job: one-shot task, stop on success, timeout escalates immediately (no retry), 1 retry, Optional severity — runs once then exits.
- Sidecar: auxiliary process, same staying behavior as Service but lower restart budget (5), requires a
SidecarConfigbinding to a primary. - Supervisor: nested supervision tree, same staying behavior as Service, 3 retries, Critical severity.
When task_role is None, EffectivePolicy::merge() falls back to TaskRole::Worker with a warning. semantic_conflicts_for_child() detects role violations (e.g., Job with permanent restart policy).
Configuration and Schema
Language: 中文
Entry Point
The configuration entry point is rust_supervisor::config::loader::load_config_from_yaml_file. It accepts only the YAML primary configuration file. The repository example path is examples/config/supervisor.yaml.
The configuration struct SupervisorConfig contains these top-level groups:
| Group | Type | Description |
|---|---|---|
include | Vec<PathBuf> | Additional config files included by rust-config-tree |
supervisor | SupervisorRootConfig | Root supervision strategy |
policy | PolicyConfig | Restart, backoff, heartbeat, failure window, restart budget, meltdown fuse, and supervision pipeline capacities |
shutdown | ShutdownConfig | Graceful timeout and abort wait budgets |
observability | ObservabilityConfig | Event journal capacity and metric/audit switches |
audit | AuditConfig | Audit storage backend, JSON Lines file path, and write failure strategy |
backpressure | BackpressureConfig | Backpressure strategy, thresholds, window, and audit channel capacity for observability subscribers |
groups | GroupsConfigSection | Group name and group-level restart budget overrides; membership is declared on children[].group; supports split file groups.yaml |
group_strategies | Vec<GroupStrategyConfig> | Group-level supervision strategies, restart limits, and escalation policies |
group_dependencies | Vec<GroupDependencyConfig> | Cross-group failure propagation edges |
child_strategy_overrides | Vec<ChildStrategyOverrideConfig> | Child-level supervision strategies, restart limits, and escalation policies |
severity_defaults | Vec<SeverityDefaultConfig> | Default severity class per task role |
dashboard | Option<DashboardIpcConfig> | Optional dashboard IPC socket (Unix only) |
children | ChildrenConfigSection | Declarative child specifications; serialized as a YAML array; supports split file children.yaml |
Configuration State
rust_supervisor::config::configurable::SupervisorConfig is the public root configuration struct. It supports confique::Config, schemars::JsonSchema, serde::Serialize, and serde::Deserialize. Users can reuse the same model for YAML loading, template generation, and JSON Schema generation.
ConfigState is the validated immutable state. Runtime modules must not keep separate runtime tunable constants.
ConfigState::to_supervisor_spec derives SupervisorSpec. The implementation fills the supervision strategy, policy defaults, shutdown budgets, health timing, observability capacity, backpressure policy, dynamic supervisor policy, restart budget, failure window, meltdown fuse, supervision pipeline capacities, group policies, and child strategy overrides from configuration values.
Templates and Split Configuration
The official single-file template is examples/config/supervisor.template.yaml.
groups and children use transparent array sections. They can live in the root file or be split into groups.yaml and children.yaml through include. Split files contain only the array body, not an items: wrapper.
- Details: Split Configuration and Transparent Array Sections
- Generated template tree:
config/supervisor_config/ - Runnable split example:
cargo run --example split_config_supervisor
Generate templates and schemas. CLI subcommands are top-level without a config prefix. --config belongs to the run and validate-config subcommands; generate-template and generate-schema use examples/config/supervisor.yaml as the default template source:
cargo run -- run --config examples/config/supervisor.yaml
cargo run -- validate-config --config examples/config/split/supervisor.yaml
cargo run -- generate-template \
--output config/supervisor_config/supervisor_config.example.yaml
cargo run -- generate-schema \
--output config/supervisor_config/supervisor.schema.json
Error Boundary
Configuration loading returns SupervisorError::FatalConfig when startup must be rejected:
Root-level checks:
- The file extension is not YAML.
- The file cannot be read.
- YAML cannot be parsed into
SupervisorConfig. - The supervision strategy is not one of
OneForOne,OneForAll, orRestForOne. - A required numeric value is zero.
- The initial backoff is greater than the maximum backoff.
- The jitter ratio is outside the accepted range.
policy.restart_budget.window_secs,policy.restart_budget.max_burst, orpolicy.restart_budget.recovery_rate_per_secis invalid.policy.failure_window.window_secs,policy.failure_window.max_count, orpolicy.failure_window.thresholdis invalid.- A
policy.meltdown.*window or threshold is zero. - A
policy.supervision_pipeline.*capacity or concurrent restart limit is zero. supervisor.dynamic_supervisor.child_limitis zero.backpressure.warn_threshold_pctis not between 1 and 100.backpressure.critical_threshold_pctis not between 1 and 100.backpressure.warn_threshold_pctis greater than or equal tobackpressure.critical_threshold_pct.backpressure.window_secsorbackpressure.audit_channel_capacityis zero.
Child declaration checks:
- Child ID and name must be non-empty.
- Tags must be non-empty.
- A child with
kind: Supervisormust not have a factory; a child withkind: AsyncWorkerorkind: BlockingWorkermust have one. - Sidecar task role requires
sidecar_config, and vice versa. - Dependency cycles are rejected.
- Group membership is declared only on
children[].group; referenced group names must exist ingroups. - Group names referenced by
group_strategiesandgroup_dependenciesmust exist. - Child names referenced by
child_strategy_overridesmust exist. severity_defaultsmust not declare the same task role more than once.
IPC checks (when dashboard.enabled = true):
target_idmust be non-empty.pathis required and must be absolute.- Registration
relay_registration_pathis required and must be absolute. lease_secondsmust be greater than zero.heartbeat_interval_secondsmust be positive and less thanlease_seconds.
Supervisor::start_from_config_file rejects invalid configuration before it creates runtime channels or spawns the control loop.
Example Configuration
supervisor:
strategy: OneForAll
escalation_policy: escalate_to_parent
control_channel_capacity: 256
event_channel_capacity: 256
dynamic_supervisor:
enabled: true
child_limit: 16
policy:
child_restart_limit: 10
child_restart_window_ms: 60000
supervisor_failure_limit: 30
supervisor_failure_window_ms: 60000
initial_backoff_ms: 100
max_backoff_ms: 5000
jitter_ratio: 0.10
heartbeat_interval_ms: 1000
stale_after_ms: 3000
restart_budget:
window_secs: 60
max_burst: 10
recovery_rate_per_sec: 0.50
failure_window:
mode: time_sliding
window_secs: 60
max_count: 5
threshold: 5
meltdown:
child_max_restarts: 3
child_window_secs: 10
group_max_failures: 5
group_window_secs: 30
supervisor_max_failures: 10
supervisor_window_secs: 60
reset_after_secs: 120
supervision_pipeline:
journal_capacity: 100
subscriber_capacity: 10
concurrent_restart_limit: 5
shutdown:
graceful_timeout_ms: 5000
abort_wait_ms: 1000
observability:
event_journal_capacity: 256
metrics_enabled: true
audit_enabled: true
audit:
enabled: true
backend: memory
failure_strategy: fail_closed
max_defer_queue: 1000
backpressure:
strategy: alert_and_block
warn_threshold_pct: 80
critical_threshold_pct: 95
window_secs: 30
audit_channel_capacity: 1024
groups:
- name: core
children:
- api
budget:
window_secs: 60
max_burst: 10
recovery_rate_per_sec: 0.50
- name: upstream
children: []
group_strategies:
- group: core
strategy: OneForOne
restart_limit:
max_restarts: 5
window_ms: 60000
escalation_policy: quarantine_scope
group_dependencies:
- from_group: core
to_group: upstream
propagation: Full
child_strategy_overrides:
- child_id: api
strategy: RestForOne
restart_limit:
max_restarts: 3
window_ms: 30000
escalation_policy: shutdown_tree
severity_defaults:
- task_role: service
severity: Critical
children:
- name: api
kind: supervisor
criticality: critical
tags:
- core
task_role: supervisor
severity: Critical
group: core
restart_policy: transient
dashboard:
enabled: true
target_id: payments-worker-a
path: /tmp/rust-supervisor-demo/payments-worker-a.sock
permissions: "0600"
bind_mode: replace_stale
registration:
enabled: true
relay_registration_path: /tmp/rust-supervisor-demo/dashboard-relay-registration.sock
display_name: "payments worker a"
lease_seconds: 30
registration_heartbeat_interval_seconds: 15
Secret Placeholders
Configuration values that reference secrets use the ${SECRET_NAME} placeholder format.
Replace these placeholders with environment variables or your secret management solution
before starting the supervisor. Example:
dashboard:
security_config:
peer_identity:
allowed_uids: [ "${SUPERVISOR_UID}" ]
dashboard.security_config does not carry audit settings. IPC audit persistence uses the root audit section so there is one authoritative AuditConfig.
The supervisor does not resolve placeholders at runtime; replacement must happen
before configuration loading (e.g., via envsubst or your deployment pipeline).
TLS is handled by the relay layer (rust-supervisor-relay) using wss://. The supervisor
target process exposes only a local Unix domain socket and does not terminate TLS.
Upgrade
This version does not support in-place upgrades. To upgrade, deploy a fresh instance with the new version and migrate state through the external IPC interface.
Split Configuration and Transparent Array Sections
Language: 中文
Overview
groups and children on SupervisorConfig use transparent array sections. They appear as YAML arrays on disk, are stored in Rust behind an items field, and are loaded or templated by rust-config-tree 0.3.0.
In short: use children: [...] in a single file, or write only [...] in a split file. Do not wrap split bodies with items:.
Fields
| Field | Rust type | Split file | Section schema top-level type |
|---|---|---|---|
groups | GroupsConfigSection | groups.yaml | array |
children | ChildrenConfigSection | children.yaml | array |
Access items with:
#![allow(unused)]
fn main() {
config.children.len();
config.children.as_slice();
config.groups.as_slice();
}
Single-file layout
groups:
- name: core
children:
- name: api
kind: async_worker
Split layout
Root config:
include:
- groups.yaml
- children.yaml
supervisor:
strategy: OneForAll
policy:
child_restart_limit: 10
# ... remaining policy / shutdown / observability fields
Split files contain only the array body:
# groups.yaml
- name: core
budget:
window_secs: 60
max_burst: 10
recovery_rate_per_sec: 0.5
# children.yaml
- name: worker
kind: async_worker
criticality: optional
restart_policy: permanent
Repository references:
- Generated template tree:
config/supervisor_config/ - Example inputs:
examples/config/split/ - Runnable example:
cargo run --example split_config_supervisor
Supported YAML shapes
The loader accepts all three forms:
| Shape | Example |
|---|---|
| Transparent array | children: [{ name: api }] |
| Body-only split file | children.yaml contains only - name: api |
Legacy items wrapper | children:\n items: [...] |
Do not use flow-style [{ name: worker }]. Template generation emits block YAML.
CLI
Use the rust-tokio-supervisor binary or cargo run. Subcommands are top-level; there is no config prefix. --config belongs to the run and validate-config subcommands, not the global CLI.
# Validate and print a summary (default `run` command)
cargo run -- run --config examples/config/supervisor.yaml
# Validate the full config tree (includes, defaults, runtime validation)
cargo run -- validate-config --config examples/config/split/supervisor.yaml
# Generate split templates (includes groups.yaml / children.yaml)
# Template source defaults to examples/config/supervisor.yaml
cargo run -- generate-template \
--output config/supervisor_config/supervisor_config.example.yaml
# Generate JSON Schemas
cargo run -- generate-schema \
--output config/supervisor_config/supervisor.schema.json
No post-processing is required after generation. The library strips section root keys and rewrites flow arrays to block YAML.
Loading in code
#![allow(unused)]
fn main() {
use rust_config_tree::config::load_config;
use rust_supervisor::config::{
configurable::SupervisorConfig,
loader::load_config_from_yaml_file,
};
let config = load_config::<SupervisorConfig>("supervisor.yaml")?;
let state = load_config_from_yaml_file("supervisor.yaml")?;
}
Runtime defaults vs template samples
| Scenario | children at runtime |
|---|---|
children omitted from every config file | []; template sample worker is not injected |
generate-template command | children.yaml may include a worker sample from #[config(default = ...)] |
| Body-only split file | Loads the array contents normally |
groups template defaults to [].
IDE completion
Bind the section schema in split files:
# yaml-language-server: $schema=./children.schema.json
- name: worker
kind: async_worker
children.schema.json is a top-level array, not an { items: [...] } object.
Reusing the pattern in another crate
To add another transparent array section:
- Declare a dedicated struct per transparent array section with an
items: Vec<T>field, and implement transparent-arraySerialize,Deserialize,JsonSchema, and accessors (seeChildrenConfigSectionandGroupsConfigSection). - Mark the root field with
#[schemars(extend("x-tree-split" = true, "x-tree-transparent-array" = true))]. - Implement
ConfigSchema::include_paths. - Use
load_config,write_config_templates, andwrite_config_schemas.
See the x-tree-transparent-array section in the rust-config-tree documentation.
factory_key Configuration
Language: 中文
1. Summary
factory_key is a task factory key in YAML configuration. Its value is the name shared by the configuration file and Rust code, such as api_server. It connects a declarative worker child to a TaskFactory registered in Rust code.
The configuration file stores only declarations. It does not store executable closures. The real task startup logic must be supplied by Rust code.
2. Problem
A Supervisor task tree can declare children through configuration files. However, async_worker and blocking_worker children need an executable TaskFactory when they actually start. A TaskFactory contains Rust code and usually a closure, so it cannot be safely stored directly in YAML.
factory_key defines the boundary. The configuration file writes an agreed key, and Rust code registers a task factory under the same key. Before startup, the system binds the declaration to the executable factory.
3. Configuration
children.yaml can declare workers like this:
- name: api
kind: async_worker
factory_key: api_server
- name: exporter
kind: blocking_worker
factory_key: report_exporter
api_server and report_exporter are not function names. They are configuration-level task factory keys. Rust code must register matching TaskFactory values.
4. Rust Registration
Rust code uses TaskFactoryRegistry to map keys to TaskFactory values.
#![allow(unused)]
fn main() {
use rust_supervisor::spec::child::TaskKind;
use rust_supervisor::task::factory::{TaskResult, service_fn};
use rust_supervisor::task::factory_registry::{
TaskFactoryDescriptor, TaskFactoryRegistry,
};
use std::sync::Arc;
let mut registry = TaskFactoryRegistry::new();
registry.register(TaskFactoryDescriptor::new(
"api_server",
"API Server",
"Runs the API service.",
[TaskKind::AsyncWorker],
Arc::new(service_fn(|_ctx| async { TaskResult::Succeeded })),
))?;
registry.register(TaskFactoryDescriptor::new(
"report_exporter",
"Report Exporter",
"Runs blocking export work.",
[TaskKind::BlockingWorker],
Arc::new(service_fn(|_ctx| async { TaskResult::Succeeded })),
))?;
}
TaskFactoryDescriptor stores 3 kinds of data:
key: The task factory key used by configuration files.titleanddescription: Metadata shown by schema-backed editor completion.allowed_kinds: The task kinds that may use this factory, such asTaskKind::AsyncWorkerorTaskKind::BlockingWorker.
5. Startup Binding
After configuration loading, factory_key is still only a string. Before startup, the string must be resolved to a real TaskFactory.
The current binding path is:
ConfigStatereads child declarations from YAML.to_supervisor_spec_with_factoriesusesTaskFactoryRegistryto bind workers.bind_task_factorieschecks every worker’sfactory_key.- The registry resolves the matching
TaskFactoryand writes it intoChildSpec. Supervisorstarts with already-bound executable task factories.
Binding rules:
- Worker children must declare
factory_key. - Supervisor child nodes must not declare
factory_key. - An unknown
factory_keycauses a configuration error. - A factory that does not support the current
TaskKindcauses a configuration error.
6. Completion Generation
Editor completion depends on JSON Schema. The current implementation does not rewrite the rust-config-tree schema generator. Instead, it post-processes the base schema generated by rust-config-tree.
The flow is:
generate-templateorgenerate-schemaasks rust-config-tree to generate the base schema.supervisor_schema_targets_with_factory_registryreceives the root schema and split-section schemas.- Each schema is parsed into
serde_json::Value. inject_factory_key_completions_if_presentfinds thefactory_keyfield.- The system writes keys from
TaskFactoryRegistryintooneOf. - The schema is serialized again and written to the target file.
After generation, children.schema.json contains a factory_key field like this:
{
"factory_key": {
"description": "TaskFactory registry key used to bind worker children before startup.",
"oneOf": [
{
"const": "api_server",
"description": "Runs the API service.",
"title": "API Server"
},
{
"const": "report_exporter",
"description": "Runs blocking export work.",
"title": "Report Exporter"
}
],
"type": [
"string",
"null"
]
}
}
When an editor reads the yaml-language-server schema directive at the top of children.yaml, it can offer factory_key candidates.
7. Commands
Generate templates:
target/debug/rust-tokio-supervisor generate-template
This command writes configuration templates and schemas with completion metadata.
Generate schemas only:
target/debug/rust-tokio-supervisor generate-schema
This command writes schemas only, and the generated schema also contains factory_key candidates in oneOf.
8. Current Boundaries
factory_keyis a configuration declaration, not executable code.- Completion candidates come from the
TaskFactoryRegistryused by the command. - If Rust code does not register a key, a configuration file using that key cannot start.
- Schema-backed completion helps editors suggest valid candidates, but it does not replace startup binding validation.
- Runtime child addition goes through the same kind of binding validation, so dynamic additions cannot bypass the registry.
Supervisor Tree
Language: 中文
Declaration Model
SupervisorSpec describes one supervisor node. It contains:
path— stable path for this supervisorstrategy— restart scope strategy (OneForOne,OneForAll,RestForOne)children— child specifications in declaration orderconfig_version— configuration version that produced this specdefault_restart_policy,default_backoff_policy,default_health_policy,default_shutdown_policy— policies inherited by children that do not overridesupervisor_failure_limit— maximum supervisor failures before parent escalationrestart_limit— optional supervisor-level restart limitescalation_policy— optional supervisor-level escalation policygroup_strategies— group-level strategy overridesgroup_configs— group-level restart budget, membership, and isolation configsgroup_dependencies— cross-group dependency edges for fault propagationseverity_defaults— default severity class per task role for escalation bifurcationchild_strategy_overrides— per-child strategy and governance overridesdynamic_supervisor_policy— runtime add_child acceptance policycontrol_channel_capacity— mpsc command channel capacityevent_channel_capacity— broadcast event channel capacity
ChildSpec describes one child. It contains:
id,name,kind— stable identity and task kindfactory— optionalArc<dyn TaskFactory>for worker childrenrestart_policy,shutdown_policy,health_policy,readiness_policy,backoff_policy— per-child policy overridesdependencies— child IDs that must become ready before this child startstags— low-cardinality grouping labelscriticality—CriticalorOptionaltask_role— optionalTaskRolethat selects default lifecycle policy semanticssidecar_config— optional sidecar binding (required when role isSidecar)severity— optional explicit severity overridegroup— optional group name for group-level isolation and budget trackinghealth_check,readiness— optional health/readiness check configurationsresource_limits— optional resource limitscommand_permissions— command permissions granted to this childenvironment,secrets— environment variables and secret references
Tree Building
SupervisorTree::build validates SupervisorSpec and converts children into path-aware nodes. Each child path is derived from the parent path and ChildId.
SupervisorPath::root returns the root path. SupervisorPath::join appends a child path segment. SupervisorPath::parent returns the parent path when it exists.
Startup And Shutdown Order
startup_order returns nodes in declaration order. shutdown_order returns nodes in reverse declaration order. This ordering is the basis for Shutdown Without Orphaned Tasks.
Restart Planning
restart_execution_plan resolves the runtime restart scope from the tree and SupervisorSpec. It keeps per-child overrides, group strategies, restart limits, escalation policies, and dynamic supervisor policy in one plan so the runtime control loop does not duplicate strategy selection logic.
Registry
RegistryStore stores ChildRuntime values by child identifier, supervisor path, and declaration order. Runtime control and current state queries should go through the registry instead of bypassing it.
Task Model
Language: 中文
Task Kinds
TaskKind distinguishes AsyncWorker, BlockingWorker, and Supervisor. A blocking worker must not be treated as a normal asynchronous worker that can always be aborted immediately.
Task Factory
TaskFactory is the core construction contract. Every attempt must create a fresh future. service_fn is an ergonomic adapter that still targets TaskFactory; it does not replace the core model.
TaskResult distinguishes Succeeded, Cancelled, and Failed. The Failed variant carries TaskFailure and TaskFailureKind.
Task Context
TaskContext contains child identifier, supervisor path, generation, attempt, cancellation token, heartbeat sender, and readiness sender.
Workers should use TaskContext::heartbeat to report health, TaskContext::mark_ready to report explicit readiness, and TaskContext::is_cancelled or TaskContext::cancellation_token to react to shutdown.
Readiness
ReadinessPolicy supports Immediate and Explicit. An explicitly ready child should not appear as ready in current state or events until it reports readiness.
ChildSpec and ChildDeclaration
Language: 中文
How do ChildSpec and ChildDeclaration relate?
ChildDeclaration is the external declaration that arrives from configuration and RPC. ChildSpec is the internal specification the supervisor runtime uses to register, start, and restart children. The two share many fields but serve different roles. They are connected through TryFrom conversion, which also fills in defaults.
What each one is
ChildDeclaration | ChildSpec | |
|---|---|---|
| Module | src/spec/child_declaration.rs | src/spec/child.rs |
| Role | Input model for YAML, add_child payloads, and similar sources | Runtime model in the registry and control loop |
| Typical source | Config file deserialization, dynamic child add requests | Converted from a declaration, or built directly in code |
| Can it run alone? | No. It has no factory and no fully materialized policy objects | Yes. The supervisor manages lifecycle from it |
ChildDeclaration focuses on a serializable, validatable declaration: names, dependency names, environment variables, secret placeholders, health_check / readiness config blocks, and rules such as validate_child_declaration (name format, ${SECRET} syntax, and so on).
Beyond declaration fields, ChildSpec also carries runtime essentials such as:
- A resolved
ChildIdderived fromname factory: Option<Arc<dyn TaskFactory>>, the task factory that actually runs work (not part of serde)- Materialized
HealthPolicy,ReadinessPolicy,ShutdownPolicy, andBackoffPolicy - Runtime fields such as
isolationandcleanup_paths
How they connect
The data flow looks like this:
YAML / add_child RPC
|
v
ChildDeclaration ---- validate_child_declaration ----+
| |
| TryFrom<ChildDeclaration> for ChildSpec |
v |
ChildSpec --------------------------------------------+
|
v
Register topology, start children, policy pipeline, restart / meltdown, etc.
The conversion lives in TryFrom<ChildDeclaration> for ChildSpec inside child_declaration.rs. It performs steps such as:
name->ChildId::new(&decl.name)- dependency names in
dependencies->Vec<ChildId> health_check->HealthPolicywith default intervalsreadinesspresent ->ReadinessPolicy::Explicit, otherwiseImmediateshutdown_policy/backoff_policyand similar fields receive defaults during conversion even when the declaration omits them
When a child is added dynamically, PendingChild keeps both the declaration and the converted child_spec. Auditing also stores a SHA-256 of the declaration (declaration_hash) for reconciliation and compensation.
Shared types
Shared enums and config structs such as RestartPolicy, TaskKind, and HealthCheckConfig are defined in child.rs. ChildDeclaration reuses them to avoid parallel type trees. The top-level containers remain separate: declaration container vs specification container.
ChildSpec construction paths
The repository has 6 paths that construct ChildSpec. They serve different use cases and should not be collapsed into one entry point.
| Path | Typical entry | Use case | Validation behavior |
|---|---|---|---|
| Builder | ChildSpecBuilder::worker, service, job, sidecar, supervisor, new | Direct runtime spec construction in Rust code | build() calls ChildSpec::validate() |
| Worker convenience function | ChildSpec::worker(...) | Worker default bundle only | Delegates to ChildSpecBuilder::worker(...).build() |
| Declaration conversion | TryFrom<ChildDeclaration> for ChildSpec | YAML config, RPC payloads, dynamic child adds | validate_child_declaration runs before conversion, and supervisor-level validation catches final issues |
| Role template | ServiceTemplate::child_spec, JobTemplate::child_spec, and related role templates | Caller already implemented role traits but does not want to hand-build adapters and specs | Calls the matching ChildSpecBuilder internally |
| Macro-generated helper | child_spec() generated by #[service], #[worker], #[job], #[sidecar], and #[supervisor_role] | Default role contract entry path | Generated code calls the matching ChildSpecBuilder |
| Serde | serde_json::from_value::<ChildSpec>(...) | Mainly tests for deserialization defaults and invalid enum handling | Does not pass through the builder, so callers must validate before runtime use or rely on later spec validation |
Important boundaries:
ChildSpecBuilder::build()is the main exit for Rust code construction paths.- Configuration and RPC should not accept
ChildSpecdirectly. They should acceptChildDeclarationfirst, then convert it intoChildSpec. - Role templates and macros are not new runtime models. They turn role lifecycle objects into adapters, then call
ChildSpecBuilderto produce specs. - Serde can construct
ChildSpecbecauseChildSpecderivesDeserialize. That path does not automatically callChildSpecBuilder::build().
Adjacent paths that do not construct a ChildSpec:
| Entry | Why it is not a ChildSpec construction path |
|---|---|
SupervisorSpec::root(Vec<ChildSpec>) | It accepts already constructed child specs and builds a supervisor spec |
SupervisorSpecBuilder::root(Vec<ChildSpec>) | It wraps supervisor spec construction and does not create an individual child spec |
ConfigState::to_supervisor_spec() | It assembles a supervisor spec from the Vec<ChildSpec> already stored in ConfigState |
bind_child_factory(...) | It binds a task factory to an existing ChildSpec and does not create a new one |
clone() | It copies an existing ChildSpec instead of generating one from an input model |
How to remember them
- Writing config, handling API input, validating declarations -> think
ChildDeclaration - Seeing how the supervisor manages a child or what the policy engine reads -> think
ChildSpec - Asking whether YAML and runtime use the same thing -> same underlying information, different lifecycle stage: declaration is input, spec is the landed form
In-code construction
Configuration and RPC should still use ChildDeclaration. When you construct a runtime spec directly in Rust, prefer ChildSpecBuilder:
#![allow(unused)]
fn main() {
use rust_supervisor::id::types::ChildId;
use rust_supervisor::policy::task_role_defaults::TaskRole;
use rust_supervisor::spec::child::TaskKind;
use rust_supervisor::spec::child_builder::ChildSpecBuilder;
use rust_supervisor::task::factory::{TaskResult, service_fn};
use std::sync::Arc;
let factory = service_fn(|_ctx| async { TaskResult::Succeeded });
let spec = ChildSpecBuilder::worker(
ChildId::new("worker"),
"worker",
TaskKind::AsyncWorker,
Arc::new(factory),
)
.task_role(TaskRole::Worker)
.tag("invoice")
.build()?;
}
Entry methods:
| Method | Purpose |
|---|---|
ChildSpecBuilder::worker(...) | Async or blocking worker; defaults match ChildSpec::worker |
ChildSpecBuilder::service(...) | Long-running service; sets TaskRole::Service |
ChildSpecBuilder::job(...) | Finite job; sets TaskRole::Job |
ChildSpecBuilder::sidecar(...) | Sidecar; sets sidecar binding and the primary child dependency |
ChildSpecBuilder::supervisor(...) | Nested supervisor; no factory |
ChildSpecBuilder::new(...) | Minimal skeleton; caller must set kind and, for workers, factory |
Build exit:
| Method | Behavior |
|---|---|
build() | Calls ChildSpec::validate() after construction; returns SupervisorError on failure |
ChildSpec::worker(...) remains available. It delegates to ChildSpecBuilder::worker(...).build() and also returns Result<ChildSpec, SupervisorError>.
For field-by-field mapping and defaults through TryFrom, see child-spec-builder.md for builder details, or inspect child_declaration.rs directly.
ChildSpecBuilder
Language: 中文
One-sentence summary
ChildSpecBuilder is the fluent API for constructing ChildSpec values in Rust code. Configuration and RPC should still use ChildDeclaration. The build exit is build() -> Result<ChildSpec, SupervisorError>, which calls ChildSpec::validate() internally.
Relationship to child-spec.md: that page explains how declarations and specs divide responsibility. This page focuses on builder entry points, setters, and common usage patterns.
Module path
#![allow(unused)]
fn main() {
use rust_supervisor::spec::child_builder::ChildSpecBuilder;
}
The module is defined in src/spec/child_builder.rs. Per project module-boundary rules, there is no pub use re-export.
When to use the builder
| Scenario | Recommended approach |
|---|---|
YAML config, add_child RPC payloads | ChildDeclaration + TryFrom |
| Tests, examples, hand-built runtime specs in code | ChildSpecBuilder |
| Worker default bundle only, no fluent chain | ChildSpec::worker(...)? (delegates to the builder internally) |
Legacy code may still mutate fields after construction. New code should prefer the builder.
Entry methods
| Method | Purpose | Default highlights |
|---|---|---|
worker(id, name, kind, factory) | Async or blocking worker | Matches ChildSpec::worker: Transient restart, Critical criticality, TaskRole::Worker, and so on |
service(id, name, kind, factory) | Long-running service | Based on worker defaults: TaskRole::Service, Critical criticality |
job(id, name, kind, factory) | Finite job | Based on worker defaults: TaskRole::Job, Optional criticality |
sidecar(id, name, kind, factory, sidecar_config) | Sidecar attached to a primary child | Based on worker defaults: TaskRole::Sidecar, writes sidecar_config, and automatically adds the primary child dependency |
supervisor(id, name) | Nested supervisor | kind = Supervisor, factory = None, task_role = Supervisor, criticality = Critical |
new(id, name) | Minimal skeleton | Sets only id / name plus baseline policies; caller must add kind and, for workers, factory |
Build exit
| Method | Behavior |
|---|---|
build() | Takes the inner ChildSpec, calls validate(), returns Ok(spec) or SupervisorError |
All entry methods and setters return ChildSpecBuilder, which means construction is still in progress. Only build() consumes the builder and returns the final ChildSpec.
There is no build_validated(). Validation is always performed inside build().
ChildSpec::worker(...) also returns Result<ChildSpec, SupervisorError> via ChildSpecBuilder::worker(...).build().
Basic usage
#![allow(unused)]
fn main() {
use rust_supervisor::error::types::SupervisorError;
use rust_supervisor::id::types::ChildId;
use rust_supervisor::policy::task_role_defaults::TaskRole;
use rust_supervisor::spec::child::TaskKind;
use rust_supervisor::spec::child_builder::ChildSpecBuilder;
use rust_supervisor::task::factory::{TaskResult, service_fn};
use std::sync::Arc;
fn build_worker() -> Result<ChildSpec, SupervisorError> {
let factory = Arc::new(service_fn(|_ctx| async { TaskResult::Succeeded }));
ChildSpecBuilder::worker(
ChildId::new("invoice-worker"),
"Invoice Worker",
TaskKind::AsyncWorker,
factory,
)
.task_role(TaskRole::Worker)
.tag("invoice")
.build()
}
}
Propagate errors with ?, or use build().expect("...") in tests.
Fluent setter coverage
Each setter consumes self and returns Self. You can chain them in any order that remains semantically valid.
Policy fields: isolation, restart_policy, shutdown_policy, health_policy, readiness_policy, backoff_policy
Topology and classification: dependencies, dependency, tags, tag, criticality, task_role, without_task_role, sidecar_config, without_sidecar_config, severity, without_severity, group, without_group
Config blocks: health_check, without_health_check, readiness, without_readiness, resource_limits, without_resource_limits, command_permissions, environment, env_var, secrets, secret, cleanup_paths, cleanup_path
Runtime: kind, factory, without_factory (for new() or supervisor paths)
Naming convention: plural fields use dependencies(...), tags(...); singular helpers use dependency(...), tag(...). The same pattern applies to environment / env_var, secrets / secret, and cleanup_paths / cleanup_path.
Common combinations
Service
Long-running services should prefer service(...); callers do not need to set TaskRole::Service by hand:
#![allow(unused)]
fn main() {
ChildSpecBuilder::service(id, "API Service", TaskKind::AsyncWorker, factory)
.tag("service")
.build()?;
}
Job
Finite work should prefer job(...). You can still override restart_policy for one-shot behavior:
#![allow(unused)]
fn main() {
ChildSpecBuilder::job(id, "Nightly Export", TaskKind::AsyncWorker, factory)
.restart_policy(RestartPolicy::Temporary)
.build()?;
}
Sidecar
Sidecars attached to a primary child should prefer sidecar(...). This entry writes sidecar_config and automatically adds the primary child dependency:
#![allow(unused)]
fn main() {
use rust_supervisor::policy::task_role_defaults::SidecarConfig;
ChildSpecBuilder::sidecar(
id,
"Metrics Sidecar",
TaskKind::AsyncWorker,
factory,
SidecarConfig::new(primary_id.clone(), false),
)
.build()?;
}
If you still configure task_role = Sidecar manually with setters, you must also set sidecar_config, or build() validation fails.
Worker from new()
#![allow(unused)]
fn main() {
ChildSpecBuilder::new(ChildId::new("custom"), "custom")
.kind(TaskKind::AsyncWorker)
.factory(factory)
.build()?;
}
Data flow (short)
ChildSpecBuilder::worker / service / job / sidecar / supervisor / new
|
v
fluent setters (policy, role, deps, env, ...)
|
v
build() --> ChildSpec::validate()
|
+-- Ok(ChildSpec) --> Supervisor::start / register topology
+-- Err(SupervisorError)
Example program
Runnable demo:
cargo run --example child_spec_builder
Source: examples/child_spec_builder.rs. Covers worker, service, job, sidecar, supervisor, the new() path, and an intentionally invalid sidecar combination.
Tests and regression
External tests: src/spec/tests/child_builder_test.rs
| Test | What it verifies |
|---|---|
worker_builder_matches_child_spec_worker_defaults | Builder output matches ChildSpec::worker field-for-field |
supervisor_builder_produces_valid_supervisor_child | Supervisor entry has no factory and validates |
service_builder_sets_service_role | Service entry sets TaskRole::Service and Critical criticality |
job_builder_sets_job_role_and_optional_criticality | Job entry sets TaskRole::Job and Optional criticality |
sidecar_builder_sets_sidecar_role_binding_and_dependency | Sidecar entry sets the binding and automatically adds the primary child dependency |
builder_setters_apply_expected_fields | Sidecar, dependency, tag, and related setters |
build_rejects_invalid_sidecar_combination | Missing sidecar_config makes build() fail |
new_builder_can_build_valid_worker_with_factory | new() path works after required fields are set |
Run:
cargo test --test child_builder_test
Known boundaries
- Default policy bundles for
TryFrom<ChildDeclaration>are not fully shared with the builder yet. The two paths may evolve independently; review both when changing defaults. - The builder does not handle serde. Dynamic child adds still flow through
ChildDeclaration. - Legacy examples and tests were not bulk-migrated from
ChildSpec::worker. Both styles are runtime-equivalent when callers handleResult.
Further reading
child-spec.md— howChildDeclarationandChildSpecrelate, plus a short builder introductiondocs/architecture.md— module boundaries and the no re-export rule
Policies
Language: 中文
Supervision Strategy
SupervisionStrategy decides the restart scope after a failure. OneForOne selects only the failed child. OneForAll selects every child in the selected scope. RestForOne selects the failed child and every child declared after it in the selected scope.
restart_scope calculates the restart scope from SupervisorTree, the strategy, and the failed child identifier.
restart_execution_plan combines the supervisor strategy, GroupStrategy, ChildStrategyOverride, RestartLimit, EscalationPolicy, and DynamicSupervisorPolicy into a StrategyExecutionPlan. Child overrides take precedence over group strategies, and group strategies take precedence over the supervisor-wide strategy. The plan stores a dynamic_supervisor_enabled boolean; the full DynamicSupervisorPolicy struct (including child_limit) is evaluated by the control loop at add_child time, not embedded in the plan.
The runtime control loop now receives child exits and applies the selected StrategyExecutionPlan automatically when policy returns a restart decision. Runtime lifecycle events include restart scope information so operators can see the selected strategy, group, and child scope.
Group Strategy And Overrides
GroupStrategy uses child tags to define a smaller restart scope. A child can belong to at most one configured strategy group. ChildStrategyOverride applies a per-child strategy and governance override when one child needs stricter restart behavior than its group or supervisor.
GroupConfig configures restart budget, membership, and isolation at the group level. GroupDependencyEdge defines cross-group dependency edges for fault propagation.
Restart Limit And Escalation
RestartLimit records the maximum restart count and the counting window selected for a plan. EscalationPolicy records the follow-up action when restart governance cannot remain local, including parent escalation, tree shutdown, or scope quarantine.
Dynamic Supervisor Policy
DynamicSupervisorPolicy controls runtime add_child acceptance. It is a stateless configuration object with enabled and child_limit fields. The control loop maintains the current child count externally and calls allows_addition(current_child_count) at add_child time. Additions are rejected when dynamic supervision is disabled or the configured child limit has already been reached.
Restart Policy
RestartPolicy contains Permanent, Transient, and Temporary. PolicyEngine reads TaskExit, the failure category, and the restart policy, then returns RestartDecision.
Backoff And Jitter
BackoffPolicy describes initial delay, maximum delay, and jitter ratio. There are two types with this name:
spec::child::BackoffPolicy— used inChildSpec, fields areinitial_delay,max_delay,jitter_ratio(a 0.0-1.0 float ratio).policy::backoff::BackoffPolicy— used by the runtime policy engine, fields areinitial,max,jitter_mode(an enum:Disabled,Deterministic,FullJitter,DecorrelatedJitter),jitter_percent, andreset_after.
Tests can use JitterMode::Deterministic so coverage does not depend on random output.
Fuse And Quarantine
MeltdownPolicy limits restarts or failures inside configured windows at three levels: child, group, and supervisor. Crossing a child-level fuse places the child in quarantine. Crossing a group-level fuse escalates to the supervisor level. Crossing a supervisor-level fuse escalates the failure to the parent.
Task Exit Classification
TaskExit distinguishes success, cancellation, typed failure, panic, and timeout. The policy layer reads typed classifications from TaskFailureKind (which includes Panic, Timeout, and typed failure categories) instead of inferring behavior from strings.
Runtime Control
Language: 中文
Control Entry Point
SupervisorHandle is the runtime control entry point. It sends requests to the runtime control loop through a command channel and returns CommandResult.
Control Commands
These are ControlCommand enum variants sent through the command channel:
add_child— accept a dynamic child manifest whenDynamicSupervisorPolicyallows another child.remove_child— mark the target child runtime state record asRemoved, deliver cancellation to the active attempt, and remove the runtime state record after the attempt exits.restart_child— request a restart for the target child.pause_child— mark the target child runtime state record asPaused, deliver cancellation to the active attempt, and pause automatic restarts.resume_child— resume governance for the target child.quarantine_child— mark the target child runtime state record asQuarantined, deliver cancellation to the active attempt, and block automatic restarts.shutdown_tree— shut down the whole supervisor tree.current_state— return the currentSupervisorStateand expose each child runtime fact throughCurrentState.child_runtime_records.
Handle Methods
These are methods on SupervisorHandle that do not go through ControlCommand:
subscribe_events— subscribe to lifecycle events via abroadcast::Receiver.is_alive— quickly check whether the runtime control loop can still accept ordinary control commands.health— returnRuntimeHealthReport, includingalive, control-planestate,started_at_unix_nanos,last_observed_at_unix_nanos,failure, andexit_report.join— wait until the runtime control plane reaches a final state and repeatedly return the sameRuntimeExitReport.shutdown— shut down only the runtime control plane. It does not replaceshutdown_tree.
Child Runtime State Control
PauseChild, RemoveChild, and QuarantineChild are stop-style control commands defined by this feature. All 3 commands return CommandResult::ChildControl, and the result contains ChildControlResult. The old CommandResult::ChildState shape is no longer part of the public result model.
PauseChild writes ChildRuntimeState.operation as Paused. If an active attempt exists, the runtime control loop delivers cancellation to that attempt and moves stop progress to CancelDelivered. While the child is paused, the supervision strategy does not automatically restart that child.
RemoveChild writes ChildRuntimeState.operation as Removed. If an active attempt exists, the runtime control loop first delivers cancellation and then physically removes the record from child_runtime_states after the attempt exits. If no active attempt exists, the runtime control loop returns a NoActiveAttempt result and then removes the runtime state record.
QuarantineChild writes ChildRuntimeState.operation as Quarantined. If an active attempt exists, the runtime control loop delivers cancellation. The quarantined runtime state record remains visible, but the supervision strategy no longer automatically restarts that child. An operator can still run RemoveChild later.
These 3 stop-style control commands do not synchronously wait for the child future to end. If a child ignores cancellation for too long, a later CurrentState call or repeated stop-style command triggers reconcile_stop_deadlines and exposes the stop failure through ChildControlFailure.
CurrentState returns child_runtime_records. Each ChildRuntimeRecord is ordered by declaration order. Construction performs only non-blocking reads, does not wait for a child future, and does not perform extra I/O. This collection is the main entry point for reading runtime state facts.
RestartChild and ResumeChild remain existing commands. This feature only requires them not to corrupt runtime state facts. It does not define new lifecycle semantics for them.
See the full contract in child-runtime-state-control.md.
ChildControlResult Fields
child_id: stable identifier of the controlled child.attempt: active attempt targeted by the command. It isNonewhen no active attempt exists.generation: generation targeted by the command. It isNonewhen no active attempt exists.operation_before:ChildControlOperationobserved when the command arrived.operation_after:ChildControlOperationafter command handling.status: currentChildAttemptStatusfor the attempt. It isNonewhen no active attempt exists.cancel_delivered: whether this command actually delivered cancellation.stop_state:ChildStopStateafter command handling.restart_limit: currentRestartLimitState, including window, limit, used count, remaining count, and exhaustion flag.liveness: currentChildLivenessState, including last heartbeat time, heartbeat stale flag, and readiness.idempotent: whether this command reused an already existing target state.failure: current control failure. It isNonewhen no failure exists.generation_fence: optionalGenerationFenceOutcomeused by restart control commands.admission_conflict: optionalAdmissionConflictdetail when a concurrent request is rejected.
ChildRuntimeRecord Fields
child_id: stable identifier of the child represented by this runtime state record.path: child path in the supervisor tree.generation: current active generation. It isNonewhen no active attempt exists.attempt: current active attempt. It isNonewhen no active attempt exists.status: currentChildAttemptStatusfor the attempt.operation: currentChildControlOperation, which can beActive,Paused,Quarantined, orRemoved.liveness: currentChildLivenessState.restart_limit: currentRestartLimitState.stop_state: currentChildStopState.failure: most recentChildControlFailure. Whenstop_stateisFailed, this must beSome.generation_fence_phase: currentGenerationFencePhasefor dashboard projection.pending_restart: optionalPendingRestartSummaryfor queued restarts behind a generation fence.
Idempotent Behavior
Repeated control commands should not create unrecoverable errors. Pausing an already paused child returns the current state. Quarantining an already quarantined child returns the current state. Shutting down an already completed tree returns the existing shutdown result.
join caches the final RuntimeExitReport from the control loop. Repeated calls to join on the same handle return the same result every time and do not consume the underlying exit receiver again.
shutdown only asks the runtime control loop to exit normally. If the control plane has already completed or failed, another shutdown call directly returns the existing final report. shutdown_tree remains responsible for child task and full supervisor tree shutdown semantics.
Runtime Health
is_alive is a low-cost state check. It returns true when the control plane is alive. It returns false when the control plane is starting, shutting down, completed, or failed.
health returns structured state. After an abnormal control-plane exit, health can still read: alive, state, started_at_unix_nanos, last_observed_at_unix_nanos, failure (with phase, reason, panic flag, recoverable flag), and exit_report. Ordinary control commands after the control plane has ended return SupervisorError with the same exit reason.
Dynamic Additions
Dynamic additions are governed before the manifest is accepted. The runtime rejects add_child when dynamic supervision is disabled or when the declared child count plus dynamic child count has reached the configured limit. current_state.child_count includes accepted dynamic manifests.
Audit Data
Each control command carries CommandMeta with command_id, requested_by, and reason. These fields must be non-empty text. SupervisorHandle rejects empty values before the command enters the channel, and the runtime control loop validates them again before execution. This preserves traceable audit sources for manual operations, dashboard IPC forwarding, and internal control calls.
The event payload CommandAudit additionally records target_path and accepted_at_unix_nanos for audit events and incident review.
Dashboard Three-End Workflow
Language: 中文
The dashboard feature is delivered by three repositories. rust-supervisor owns only target-process local IPC and shared contracts. ~/rust-supervisor-relay owns the relay and external wss:// sessions. ~/rust-supervisor-ui owns the browser dashboard client.
The screenshot below shows the dashboard client view for target lists, topology, state, and runtime streams.

Three-End Responsibilities
rust-supervisor: The target process readsSupervisorConfig, opens a Unix domain socket whendashboard.enabled=true, and produces state, event records, log records, command results, and registration heartbeats.rust-supervisor-relay: The relay listens on the registration socket, stores the target registry, exposes externalwss://dashboard sessions, validates mTLS and allowed IPC path prefixes, and forwards session commands to the target process.rust-supervisor-ui: The dashboard client connects to the relay throughwss://and displays the target list, topology, state, event stream, log tail, and command audit.
Generation fencing projections
IPC-visible child control envelopes include nullable generation_fence payloads on DashboardChildControlResult. DashboardChildRuntimeRecord mirrors pending_restart summaries whenever the runtime pins an older attempt while a restart waits behind the fence.
Local Demo Flow
- Start the relay first. It must listen on the registration socket before the target process can register itself.
cd ~/rust-supervisor-relay
cargo run -- --config examples/config/dashboard-relay.local.yaml
- Start the target process next. It opens the local IPC socket and sends registration heartbeats to the relay.
cd ~/rust-supervisor
cargo run --example demo -- --config examples/config/supervisor.local.yaml
- Start the dashboard client last. Browser code connects only to the relay and never reads the target-process local IPC socket directly.
cd ~/rust-supervisor-ui
VITE_SUPERVISOR_RELAY_URL=wss://localhost:9443/supervisor npm run dev
Runtime Order
After receiving a registration heartbeat, the relay only stores the target process in the target registry. Registration does not trigger proactive event or log push. After the dashboard client establishes an authenticated dashboard session and selects a target, the relay connects to the target-process IPC socket, reads state, and subscribes to events.subscribe or logs.tail only when the session requests those streams.
Control commands must start from the dashboard client, pass relay session validation, and then reach the target process. Each command must carry operator identity, target identity, and reason. Dangerous commands must also be confirmed in the client.
Verification Commands
cd ~/rust-supervisor
cargo test --test dashboard_config_test --test dashboard_protocol_shape_test --test dashboard_state_test --test dashboard_stream_test --test dashboard_performance_test
cargo test --manifest-path ~/rust-supervisor-relay/Cargo.toml
npm --prefix ~/rust-supervisor-ui run test
npm --prefix ~/rust-supervisor-ui run build
npm --prefix ~/rust-supervisor-ui run test:e2e:three-end
Production Notes
The target process may expose only a local Unix domain socket and must not expose IPC directly to the network. The relay must use wss:// for external access. The browser or operating-system certificate store selects the mTLS client certificate, and page scripts must not read the certificate private key. dashboard.path, registration.relay_registration_path, and the relay allowed IPC path prefix must match, otherwise the target will fail to register or the relay will reject the connection.
Shutdown
Language: 中文
Formal Term
This project uses Shutdown Without Orphaned Tasks to describe the shutdown goal. After root shutdown completes, the runtime should leave no orphaned task.
Four Stages
The shutdown protocol has four stages:
- Request stop: accept the shutdown cause and propagate the cancellation token.
- Graceful drain: wait for each child to finish on its own.
- Abort stragglers: force or escalate asynchronous tasks that exceed their timeout.
- Reconcile: align registry state, current state, metrics, and the event journal.
Order
Startup runs in declaration order. Shutdown runs in reverse declaration order. startup_order and shutdown_order expose this rule.
Blocking Worker Boundary
BlockingWorker represents spawn_blocking work or other work that cannot be assumed to abort immediately. After shutdown timeout, the runtime should record the non-immediate termination boundary and follow the escalation policy.
Shutdown Cause
ShutdownCause records requested_by and reason. The cause should appear in audit and diagnostic output.
Completion Result
shutdown_tree returns ShutdownResult. After the pipeline finishes, ShutdownResult.report contains ShutdownPipelineReport with per-child outcomes, a reconcile report, and dashboard socket status. When the core runtime does not own the dashboard IPC socket, the report records socket status as NotOwned.
Observability
Language: 中文
Event Model
SupervisorEvent describes one lifecycle fact. It contains When, Where, What, sequence, and correlation identifier.
When records wall-clock time, monotonic time, uptime, generation, and attempt. Where records supervisor path, child identifier, parent identifier, and task name. What records state transition, policy decision, health state, exit reason, or control command.
Pipeline Outputs
The observability pipeline publishes the same lifecycle fact as these signals:
SupervisorEvent.- Structured log.
- Tracing span and tracing event.
- Metrics.
- Audit event.
- Event journal entry.
- Test recorder entry.
Metric Labels
Metric labels must stay low-cardinality. Acceptable labels include supervisor path, child identifier, state, decision, and failure category. Full error text, user input, and unbounded dynamic values should not become labels.
Real Shutdown Pipeline
After ShutdownTree runs the real shutdown pipeline, the observability pipeline must surface lifecycle facts for each stage. ChildShutdownCancelDelivered means the runtime delivered CancellationToken to the in-flight child attempt. ChildShutdownGraceful means the child task returned inside the graceful drain budget. ChildShutdownAborted means the runtime requested abort for a stuck task. ChildShutdownLateReport means the child task returned after the normal shutdown accounting window. ShutdownCompleted means the pipeline emitted the final reconcile report.
Metrics record shutdown facts with low-cardinality labels. supervisor_shutdown_duration_seconds measures full pipeline duration. supervisor_shutdown_child_outcomes_total counts outcomes by status and phase and must not place child_id on metric labels. supervisor_shutdown_abort_total counts abort paths by bounded reason. supervisor_shutdown_late_reports_total counts late reports by phase.
Audit events record cancel delivered, graceful outcome, abort outcome, late report, and completed reconcile. When the core runtime does not own the dashboard IPC socket, the reconcile report records socket status as NotOwned.
Diagnostic Replay
The event journal stores a fixed number of recent events. RunSummary is built from the event journal, current state, and policy decisions so operators can explain meltdown, shutdown timeout, or parent escalation.
Examples
Language: 中文
Quick Start
cargo run --example supervisor_quickstart
supervisor_quickstart reads examples/config/supervisor.yaml, derives SupervisorSpec, starts a supervisor, queries current state, and shuts down the tree.
Configuration Tree
cargo run --example config_tree_supervisor
config_tree_supervisor shows the rust-config-tree 0.3.0 YAML loading path and prints the derived SupervisorSpec.
Split Configuration
cargo run --example split_config_supervisor
split_config_supervisor loads examples/config/split/supervisor.yaml, which includes body-only groups.yaml and children.yaml. See Split Configuration and Transparent Array Sections.
Restart Policy Lab
cargo run --example restart_policy_lab
restart_policy_lab shows the basic shapes of TaskFailure, TaskFailureKind, RestartPolicy, the canonical spec::supervisor::SupervisionStrategy, and RestartDecision.
Shutdown Tree
cargo run --example shutdown_tree
shutdown_tree demonstrates request stop, graceful drain, abort stragglers, and reconcile before calling shutdown_tree.
Observability Probe
cargo run --example observability_probe
observability_probe subscribes to events, queries current state, prints one event, and shuts down. It checks the observability integration path.
Supervisor Tree Story
cargo run --example supervisor_tree_story
supervisor_tree_story declares market feed, risk engine, and audit sink children. It shows dependencies, tags, criticality, explicit readiness, startup order, shutdown order, and RestForOne restart scope.
Runtime Control Story
cargo run --example runtime_control_story
runtime_control_story starts a real supervisor and runs add_child, pause_child, resume_child, quarantine_child, current_state, subscribe_events, and shutdown_tree. It combines operator control with audit events.
Policy Failure Matrix
cargo run --example policy_failure_matrix
policy_failure_matrix feeds success, external dependency failure, fatal bug failure, and panic into Permanent, Transient, and Temporary restart policies. It also shows deterministic jitter and meltdown tracking.
Diagnostic Replay
cargo run --example diagnostic_replay
diagnostic_replay builds deterministic events, writes them into the event journal, replays failure, backoff, and restart facts, then generates metric samples and RunSummary.
Operations Runbook
Language: 中文
Note: Each procedure lists expected metrics values at key steps. If the observed value differs, follow the escalation path or refer to the linked section.
P1-001: Supervisor Process Crash
Symptoms: Supervisor process exits unexpectedly; children become orphaned.
| Step | Action | Expected Metrics | Estimated Duration |
|---|---|---|---|
| 1 | Check process status: pgrep -x supervisor | exit code == 0 (process running) or exit code == 1 (not running) | 1min |
| 2 | If not running, check last log lines: journalctl -u supervisor -n 50 | Log ends with ShutdownPhase::Completed (planned) or Panic (unexpected) | 2min |
| 3 | If unexpected crash: collect core dump and backtrace | Core dump file present in /tmp/ | 2min |
| 4 | Restart supervisor: cargo run --release --example supervisor_quickstart | health.status == "ready" within 30s | 5min |
| 5 | Verify children reconnected: check dashboard IPC | dashboard_link == "connected" | 2min |
Escalation: If restart fails twice, escalate to L2 engineering with core dump and logs. Total estimated duration: 12min (within 15min SLA).
P1-002: Child Task Crash Loop
Symptoms: A child task repeatedly fails and restarts; current_state shows elevated restart counts.
| Step | Action | Expected Metrics | Estimated Duration |
|---|---|---|---|
| 1 | Query current state: handle.current_state() | children.failed > 0 and child_runtime_records[].restart_count > threshold | 1min |
| 2 | Check child exit reason in journal | TaskExit::Panicked or TaskExit::Failed with reason string | 2min |
| 3 | If restart budget exhausted: restart_budget.tokens == 0 | Budget exhausted; child quarantined automatically | 1min |
| 4 | Remove or replace the faulty child spec: handle.remove_child() | CommandResult::Accepted | 2min |
| 5 | Verify no lingering slot: check current_state() | children.running == target_count | 2min |
Escalation: If child root cause not identified in 10min, file a bug with the exit reason and journal snippet. Total estimated duration: 8min.
P1-003: Dashboard IPC Disconnected
Symptoms: health.dashboard_link == "disconnected"; dashboard UI shows no data.
| Step | Action | Expected Metrics | Estimated Duration |
|---|---|---|---|
| 1 | Check IPC socket path existence: ls -la /tmp/supervisor.sock | Socket file present with correct permissions | 1min |
| 2 | Check relay process: pgrep -x relay | Process running | 1min |
| 3 | Restart relay: kill -TERM <relay_pid> and wait for auto-restart | Supervisor auto-restarts relay; dashboard_link == "connected" within 10s | 3min |
| 4 | If still disconnected, restart the supervisor process (dashboard IPC lifecycle is tied to the supervisor process) | health.dashboard_link == "connected" | 2min |
Escalation: If IPC socket path contention (error contains field_path="dashboard.path"), check deployment guide socket path configuration.
Total estimated duration: 7min.
P1-004: Runtime Starvation
Symptoms: Control loop iterations stall; health.uptime_secs advances but events are not processed.
| Step | Action | Expected Metrics | Estimated Duration |
|---|---|---|---|
| 1 | Check Tokio runtime metrics: handle.health().control_loop_iterations | iterations_per_sec > 0 | 1min |
| 2 | If stalled, check for blocking tasks: review child task list | No child in BlockForever or IgnoreCancel state | 2min |
| 3 | Quarantine suspicious children: handle.quarantine_child() | Child marked as Quarantined | 2min |
| 4 | Verify recovery: health.control_loop_iterations increases | iterations_per_sec > 0 after 5s | 3min |
Escalation: If starvation persists after quarantining all non-critical children, escalate to L2 with runtime metrics snapshot. Total estimated duration: 8min.
Frequently Asked Questions (FAQ)
Language: 中文
Basics
What is the difference between ChildDeclaration and ChildSpec?
ChildDeclaration is the input model used in YAML configuration and add_child RPC payloads. It focuses on serializable, validatable declarations. ChildSpec is the runtime model used by the supervisor to register, start, and restart children. It carries resolved ChildId, Arc<dyn TaskFactory>, and materialized policy objects.
See ChildSpec and ChildDeclaration for details.
What are the entry methods after Supervisor starts?
Supervisor provides 3 entry methods:
| Method | Input | When to use |
|---|---|---|
Supervisor::start(spec) | SupervisorSpec (pre-built spec) | Programmatic startup |
Supervisor::start_from_config_state(state) | ConfigState (validated config) | Start from config loader output |
Supervisor::start_from_config_file(path) | YAML file path | Start directly from YAML file |
All three converge into start_with_policy(), which validates, creates channels, spawns the control loop, and returns a SupervisorHandle.
What does “Shutdown Without Orphaned Tasks” mean?
This is the core shutdown goal of the project. After the root supervisor completes shutdown, no orphan tasks may remain in the runtime. This is achieved through the four-stage shutdown protocol (request stop -> graceful drain -> abort stragglers -> reconcile) and by shutting down children in reverse declaration order, ensuring every child is properly terminated.
Configuration
What child fields does the YAML children entry support?
children is a YAML array backed by ChildrenConfigSection in Rust. Access items with .as_slice(). Each declaration supports these fields:
| Category | Field | Description |
|---|---|---|
| Identity | name | Child name, required, non-empty |
| Kind | kind | async_worker, blocking_worker, or supervisor |
| Criticality | criticality | critical or optional |
| Restart policy | restart_policy | permanent, transient, or temporary |
| Dependencies | dependencies | List of dependent child names |
| Health check | health_check | Health check interval, timeout, etc. |
| Readiness | readiness | Explicit readiness check config |
| Resource limits | resource_limits | CPU, memory and other resource constraints |
| Command permissions | command_permissions | Commands this child is allowed to execute |
| Environment | environment | Key-value environment variable list |
| Secrets | secrets | ${SECRET_NAME}-format secret references |
| Tags | tags | Low-cardinality grouping tags |
| Task role | task_role | service, worker, job, sidecar, supervisor |
See Configuration for a complete config sample.
How do I split groups and children into separate YAML files?
Add include in the root config and write body-only split files:
include:
- groups.yaml
- children.yaml
# children.yaml
- name: worker
kind: async_worker
See Split Configuration and Transparent Array Sections. Run cargo run --example split_config_supervisor.
What happens when children is omitted from a config file?
Runtime loading yields an empty list []. Template sample entries such as worker are not injected at runtime. Only generate-template writes sample entries.
What configurations cause rejection at startup?
Configuration loading returns SupervisorError::FatalConfig when startup must be rejected. Rejection reasons include:
- The file is not YAML format or cannot be read
- Supervision strategy is not
OneForOne,OneForAll, orRestForOne - Numeric values are zero or out of valid range
- Initial backoff is greater than max backoff
- Jitter ratio is not between 0.0 and 1.0
- Restart budget, failure window, or meltdown config is invalid
- Child declaration has circular dependencies
- Child ID or name is empty
- Sidecar task role is missing
sidecar_config - Dashboard IPC path is not absolute
See Configuration for the full rejection list.
Runtime Control
What is the five-step add_child transaction?
add_child chains five steps into a single transaction:
- Parse: Deserialize the RPC payload into a
ChildDeclaration - Validate: Run
validate_child_declaration, checking name format, dependency name existence, secret placeholder syntax, etc. - Register: Update topology, insert the new child into the registry, and run cycle detection
- Launch: Create and start the child future via
TaskFactory - Audit Persist: Write audit records including the declaration SHA-256 hash
If any step fails, the entire transaction rolls back to the pre-call topology view, or writes a compensating record for post-recovery handling.
Which runtime control commands are idempotent?
Repeated control commands do not create unrecoverable errors:
- Pausing an already paused child returns the current state
- Quarantining an already quarantined child returns the current state
- Calling shutdown after shutdown is complete returns the existing result
joincaches the finalRuntimeExitReport; repeated calls return the same result
What is the difference between pause, quarantine, and remove?
All three are stop-type control commands, but they behave differently:
| Command | operation set to | Record kept | Auto-restart |
|---|---|---|---|
pause_child | Paused | Kept | Suspended while paused |
quarantine_child | Quarantined | Kept | Disabled permanently |
remove_child | Removed | Physically deleted after attempt exits | N/A |
Pause can be resumed via resume_child. Quarantined children can be removed later. Remove is final — the runtime record is physically deleted.
Policies & Failure Handling
When should each RestartPolicy value be used?
| Value | Behavior | When to use |
|---|---|---|
Permanent | Always restart | Critical services like API servers, database connections |
Transient | Restart only for certain failure categories | Restart on external dependency failures, not on fatal bugs |
Temporary | Restart at most once | One-shot jobs, do not retry after failure |
How do the three meltdown levels cascade?
The meltdown policy (MeltdownPolicy) limits restarts or failures within a window, across three levels:
- Child-level: Exceeds
child_max_restarts/child_window_secs-> enters quarantine - Group-level: Exceeds
group_max_failures/group_window_secs-> escalates to supervisor - Supervisor-level: Exceeds
supervisor_max_failures/supervisor_window_secs-> escalates to parent
After meltdown triggers, it auto-resets after reset_after_secs.
Observability
How do I subscribe to lifecycle events?
Call SupervisorHandle::subscribe_events() to get a broadcast::Receiver. Events are of type SupervisorEvent, containing When (wall time, monotonic time, uptime, generation, attempt), Where (supervisor path, child ID, task name), and What (state transitions, policy decisions, health status, exit reasons, or control commands).
What happens when the event journal is full?
The event journal is a fixed-capacity ring buffer. When full, it overwrites the oldest entries. Capacity is configured via observability.event_journal_capacity. However, the add_child-dedicated audit channel does not silently overwrite — it returns Err(AuditStorageFailure) when full.
Dashboard
Which three repositories does the Dashboard feature require?
The dashboard feature spans three repositories:
| Repository | Responsibility |
|---|---|
rust-supervisor (this project) | Target process local IPC and shared contracts |
~/rust-supervisor-relay | Relay and external wss:// sessions |
~/rust-supervisor-ui | Browser dashboard client |
The target process exposes only a local Unix domain socket. IPC must never be exposed to external networks.
What IPC methods are supported?
Supported methods: hello, state, events.subscribe, logs.tail, command.restart_child, command.pause_child, command.resume_child, command.quarantine_child, command.remove_child, command.add_child, and command.shutdown_tree.
Project & Build
What does target/debug/rust-tokio-supervisor generate-template do without arguments?
generate-template with no arguments does not output to stdout. It writes to config/<root-config-name>/<root-config-name>.example.yaml by default.
For this project:
# No terminal output after running
./target/debug/rust-tokio-supervisor generate-template
# But files are actually written
ls config/supervisor_config/
# supervisor_config.example.yaml
# supervisor_config.schema.json
Options:
# Specify output path
./target/debug/rust-tokio-supervisor generate-template --output /tmp/my-config.yaml
# Also generate JSON Schema
./target/debug/rust-tokio-supervisor generate-template --schema /tmp/schema.json
The output format is inferred from the file extension; unknown or missing extensions use YAML by default.
Why does Cargo.toml declare only one [[bin]] (rust-tokio-supervisor) but there are multiple binaries in target/debug/?
Cargo supports two ways to declare binary targets:
- Explicit declaration: via
[[bin]]entries inCargo.toml, e.g.,src/main.rs->rust-tokio-supervisor - Auto-discovery: each
.rsfile insrc/bin/automatically becomes a binary target, using the filename as the target name
So Cargo.toml shows only [[bin]] name = "rust-tokio-supervisor", but src/bin/generate_supervisor.rs and src/bin/generate_supervisor_config.rs are auto-discovered by Cargo, producing additional binaries.
Note: The
src/bin/directory may be cleaned up or moved after feature completion to keep the project structure tidy.
Common Errors
What is SupervisorError::FatalConfig?
FatalConfig indicates an unrecoverable error during configuration loading. The error includes field_path (JSON Pointer format) and a human-readable hint to help locate the specific problem.
What should I do when add_child returns Err(SupervisorShuttingDown)?
The supervisor is currently executing its shutdown sequence and cannot accept new add_child requests. Wait for the supervisor to complete shutdown, restart it, then retry the add operation.
What should I do when add_child returns Err(ChildLimitExceeded)?
The runtime child count has reached its limit (currently 1000). Either remove unnecessary children via remove_child, or adjust the dynamic_supervisor.child_limit configuration.
What happens when audit storage fails?
When the audit channel (ring buffer) write fails:
- add_child enters compensating flow and returns
Err(AuditStorageFailure) - The topology view rolls back to its pre-call state
- No orphaned semi-parsed state is left behind
Quality Gates
Language: 中文
Baseline Commands
cargo fmt --check
cargo check
cargo test
cargo doc --no-deps
cargo package --list
scripts/check-coding-standard.sh
scripts/check-maintainability.sh
scripts/generate-sbom.sh
scripts/validate-sbom.sh
cargo publish --dry-run
Documentation Synchronization
The manual, engineering docs, README files, examples, quickstart, public API contract, and glossary must stay synchronized. When public APIs, configuration shape, example behavior, or observability signals change, documentation must be updated in the same implementation pass.
Coding Standard
scripts/check-coding-standard.sh checks required release materials, example files, primary configuration, documentation punctuation, and No Compatibility language. Chinese-language docs in this repository must use ASCII punctuation.
Maintainability
scripts/check-maintainability.sh checks isomorphic manual/zh and manual/en entries, isomorphic docs/zh and docs/en entries for quality gate and parallel governance pages, example count against the contract, validation artifacts, the Shutdown Without Orphaned Tasks term, and the rust-config-tree term.
SBOM And Release
scripts/generate-sbom.sh creates artifacts/sbom/rust-supervisor.cdx.json and artifacts/sbom/rust-supervisor.spdx.json. scripts/validate-sbom.sh checks file presence, JSON shape, package name, Cargo.lock digest, and leakage of secrets, tokens, local absolute paths, or build scratch paths.