Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

中文手册

Open the 中文手册.

rust-supervisor Manual

Language: 中文

Project Scope

rust-supervisor is a Rust task supervision core for Tokio services. It uses declarative models to manage child startup, stop, restart, quarantine, state query, event recording, health checks, and Shutdown Without Orphaned Tasks.

The configuration boundary uses rust-config-tree 0.3.0 with YAML files. Runtime tunable values must enter the system through this centralized configuration path.

This project has no legacy interface burden. Users should import public types from owning module paths, such as rust_supervisor::runtime::supervisor::Supervisor.

Reading Path

  • Getting Started: start a minimal supervisor from YAML configuration.
  • Configuration: understand SupervisorConfig, ConfigState, and startup rejection boundaries.
  • Split Configuration and Transparent Array Sections: understand split groups.yaml / children.yaml layout and template CLI commands.
  • Supervisor Tree: understand SupervisorSpec, SupervisorTree, and registry ownership.
  • Task Model: understand ChildSpec, TaskFactory, TaskContext, and readiness.
  • Policies: understand restart decisions, backoff, fuse rules, quarantine, and task exit classification.
  • Runtime Control: understand SupervisorHandle commands and idempotent behavior.
  • Dashboard: understand the three-end workflow across the target process, relay, and dashboard client.
  • Shutdown: understand four-stage shutdown and blocking worker boundaries.
  • Observability: understand events, logs, tracing, metrics, audit data, and run summaries.
  • Examples: run each learning example under examples/.
  • Quality Gates: run formatting, build, test, documentation, SBOM, and release checks.

Runtime Boundary

The supervisor core governs lifecycle behavior only. High-frequency business messages belong in the data plane. The control plane handles lifecycle commands, current state queries, events, and governance decisions.

Getting Started

Language: 中文

Walkthrough: This guide has 5 steps (Step 1 of 5 to Step 5 of 5). Estimated completion time: 5 minutes.

Step 1 of 5: Prerequisites

This project is a Rust library. The examples require Cargo and a Tokio application environment. Repository examples include their required dependencies.

The primary configuration file is examples/config/supervisor.yaml. The loader uses rust-config-tree 0.3.0, reads YAML, and produces ConfigState.

Step 2 of 5: Minimal Command

cargo run --example supervisor_quickstart

The example loads YAML through load_config_from_yaml_file, derives SupervisorSpec through ConfigState::to_supervisor_spec, starts the runtime through Supervisor::start, queries current_state, and then shuts down the tree through shutdown_tree.

Step 3 of 5: Minimal Code Path

use rust_supervisor::config::loader::load_config_from_yaml_file;
use rust_supervisor::runtime::supervisor::Supervisor;

#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
    let state = load_config_from_yaml_file("examples/config/supervisor.yaml")?;
    let spec = state.to_supervisor_spec()?;
    let handle = Supervisor::start(spec).await?;
    let current = handle.current_state().await?;
    println!("{current:#?}");
    handle.shutdown_tree("operator", "quickstart complete").await?;
    Ok(())
}

Step 4 of 5: Result

The example validates the integration path. It is not a business task template. Application workers should live inside ChildSpec and TaskFactory boundaries instead of being started as unmanaged background tasks.

Step 5 of 5: Health Self-Check

After startup, the supervisor prints a health self-check JSON to stdout. The JSON schema is formally defined in health-selfcheck-schema.md.

Expected output (example):

{
  "status": "ready",
  "supervisor_version": "0.1.2",
  "uptime_secs": 3600,
  "children": { "total": 5, "running": 5, "failed": 0 },
  "dashboard_link": "connected"
}

If status is not "ready", check the operations runbook for troubleshooting steps.


Entry Points

The Supervisor struct in src/runtime/supervisor.rs:36-83 provides 3 entry methods:

MethodInputWhen to Use
Supervisor::start(spec)SupervisorSpec (built programmatically)You already have a spec object
Supervisor::start_from_config_state(state)ConfigState (validated config)You loaded config via the loader
Supervisor::start_from_config_file(path)YAML file pathDirect launch from a file

All 3 converge on the private start_with_policy() (src/runtime/supervisor.rs:95-126), which:

  1. Calls spec.validate() to verify all child declarations
  2. Creates an mpsc command channel and a broadcast event channel
  3. Creates RuntimeControlPlane and ObservabilityPipeline
  4. Builds RuntimeControlState
  5. Spawns the control loop via tokio::spawn(run_control_loop(...))
  6. Starts RuntimeWatchdog to monitor control loop health
  7. Returns SupervisorHandle for commands (restart, shutdown, etc.) and event subscriptions

Usage Examples

From YAML file via ConfigState — start_from_config_state

Full example: examples/supervisor_quickstart.rs. Config: examples/config/supervisor.yaml.

use rust_supervisor::config::loader::load_config_from_yaml_file;
use rust_supervisor::runtime::supervisor::Supervisor;

#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
    let state = load_config_from_yaml_file("examples/config/supervisor.yaml")?;
    let handle = Supervisor::start_from_config_state(state).await?;
    handle.shutdown_tree("operator", "quickstart complete").await?;
    Ok(())
}

load_config_from_yaml_file returns a ConfigState. Its to_supervisor_spec() is called internally by start_from_config_state.

Direct from YAML file path — start_from_config_file

One-step shortcut that calls load_config_from_yaml_file internally:

use rust_supervisor::runtime::supervisor::Supervisor;

#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
    let handle = Supervisor::start_from_config_file("examples/config/supervisor.yaml").await?;
    handle.shutdown_tree("operator", "done").await?;
    Ok(())
}

Programmatic spec — start

Full example: examples/supervisor_tree_story.rs.

use std::sync::Arc;
use rust_supervisor::id::types::ChildId;
use rust_supervisor::runtime::supervisor::Supervisor;
use rust_supervisor::spec::child::{ChildSpec, TaskKind};
use rust_supervisor::spec::supervisor::SupervisorSpec;
use rust_supervisor::task::factory::{TaskResult, service_fn};

#[tokio::main]
async fn main() -> Result<(), rust_supervisor::error::types::SupervisorError> {
    let factory = service_fn(|ctx| async move {
        ctx.heartbeat();
        ctx.mark_ready();
        println!("child running at path={}", ctx.path);
        TaskResult::Succeeded
    });

    let child = ChildSpec::worker(
        ChildId::new("demo-worker"),
        "Demo Worker",
        TaskKind::AsyncWorker,
        Arc::new(factory),
    );

    let spec = SupervisorSpec::root(vec![child]);
    let handle = Supervisor::start(spec).await?;

    let state = handle.current_state().await?;
    println!("{state:#?}");
    handle.shutdown_tree("operator", "demo complete").await?;
    Ok(())
}

ChildSpec::worker() automatically sets task_role = Some(TaskRole::Worker), equivalent to task_role: worker in YAML.

TaskRole Behavior

The 5 TaskRole variants dispatch to different default lifecycle policies via RoleDefaultPolicy::for_role():

DimensionServiceWorkerJobSidecarSupervisor
On successRestartStopStopRestartRestart
On timeoutRestartWithBackoffRestartWithBackoffStopAndEscalateRestartWithBackoffRestartWithBackoff
Max restarts103153
Default severityCriticalStandardOptionalStandardCritical

The per-task role defaults are defined by 5 constructors in src/policy/task_role_defaults.rs:418-464:

  • Service: long-running daemon, restart on success, 10 retries, Critical severity — expected to stay online forever.
  • Worker: background task, stop on success, 3 retries, Standard severity — stops when done.
  • Job: one-shot task, stop on success, timeout escalates immediately (no retry), 1 retry, Optional severity — runs once then exits.
  • Sidecar: auxiliary process, same staying behavior as Service but lower restart budget (5), requires a SidecarConfig binding to a primary.
  • Supervisor: nested supervision tree, same staying behavior as Service, 3 retries, Critical severity.

When task_role is None, EffectivePolicy::merge() falls back to TaskRole::Worker with a warning. semantic_conflicts_for_child() detects role violations (e.g., Job with permanent restart policy).

Configuration and Schema

Language: 中文

Entry Point

The configuration entry point is rust_supervisor::config::loader::load_config_from_yaml_file. It accepts only the YAML primary configuration file. The repository example path is examples/config/supervisor.yaml.

The configuration struct SupervisorConfig contains these top-level groups:

GroupTypeDescription
includeVec<PathBuf>Additional config files included by rust-config-tree
supervisorSupervisorRootConfigRoot supervision strategy
policyPolicyConfigRestart, backoff, heartbeat, failure window, restart budget, meltdown fuse, and supervision pipeline capacities
shutdownShutdownConfigGraceful timeout and abort wait budgets
observabilityObservabilityConfigEvent journal capacity and metric/audit switches
auditAuditConfigAudit storage backend, JSON Lines file path, and write failure strategy
backpressureBackpressureConfigBackpressure strategy, thresholds, window, and audit channel capacity for observability subscribers
groupsGroupsConfigSectionGroup name and group-level restart budget overrides; membership is declared on children[].group; supports split file groups.yaml
group_strategiesVec<GroupStrategyConfig>Group-level supervision strategies, restart limits, and escalation policies
group_dependenciesVec<GroupDependencyConfig>Cross-group failure propagation edges
child_strategy_overridesVec<ChildStrategyOverrideConfig>Child-level supervision strategies, restart limits, and escalation policies
severity_defaultsVec<SeverityDefaultConfig>Default severity class per task role
dashboardOption<DashboardIpcConfig>Optional dashboard IPC socket (Unix only)
childrenChildrenConfigSectionDeclarative child specifications; serialized as a YAML array; supports split file children.yaml

Configuration State

rust_supervisor::config::configurable::SupervisorConfig is the public root configuration struct. It supports confique::Config, schemars::JsonSchema, serde::Serialize, and serde::Deserialize. Users can reuse the same model for YAML loading, template generation, and JSON Schema generation.

ConfigState is the validated immutable state. Runtime modules must not keep separate runtime tunable constants.

ConfigState::to_supervisor_spec derives SupervisorSpec. The implementation fills the supervision strategy, policy defaults, shutdown budgets, health timing, observability capacity, backpressure policy, dynamic supervisor policy, restart budget, failure window, meltdown fuse, supervision pipeline capacities, group policies, and child strategy overrides from configuration values.

Templates and Split Configuration

The official single-file template is examples/config/supervisor.template.yaml.

groups and children use transparent array sections. They can live in the root file or be split into groups.yaml and children.yaml through include. Split files contain only the array body, not an items: wrapper.

Generate templates and schemas. CLI subcommands are top-level without a config prefix. --config belongs to the run and validate-config subcommands; generate-template and generate-schema use examples/config/supervisor.yaml as the default template source:

cargo run -- run --config examples/config/supervisor.yaml

cargo run -- validate-config --config examples/config/split/supervisor.yaml

cargo run -- generate-template \
  --output config/supervisor_config/supervisor_config.example.yaml

cargo run -- generate-schema \
  --output config/supervisor_config/supervisor.schema.json

Error Boundary

Configuration loading returns SupervisorError::FatalConfig when startup must be rejected:

Root-level checks:

  • The file extension is not YAML.
  • The file cannot be read.
  • YAML cannot be parsed into SupervisorConfig.
  • The supervision strategy is not one of OneForOne, OneForAll, or RestForOne.
  • A required numeric value is zero.
  • The initial backoff is greater than the maximum backoff.
  • The jitter ratio is outside the accepted range.
  • policy.restart_budget.window_secs, policy.restart_budget.max_burst, or policy.restart_budget.recovery_rate_per_sec is invalid.
  • policy.failure_window.window_secs, policy.failure_window.max_count, or policy.failure_window.threshold is invalid.
  • A policy.meltdown.* window or threshold is zero.
  • A policy.supervision_pipeline.* capacity or concurrent restart limit is zero.
  • supervisor.dynamic_supervisor.child_limit is zero.
  • backpressure.warn_threshold_pct is not between 1 and 100.
  • backpressure.critical_threshold_pct is not between 1 and 100.
  • backpressure.warn_threshold_pct is greater than or equal to backpressure.critical_threshold_pct.
  • backpressure.window_secs or backpressure.audit_channel_capacity is zero.

Child declaration checks:

  • Child ID and name must be non-empty.
  • Tags must be non-empty.
  • A child with kind: Supervisor must not have a factory; a child with kind: AsyncWorker or kind: BlockingWorker must have one.
  • Sidecar task role requires sidecar_config, and vice versa.
  • Dependency cycles are rejected.
  • Group membership is declared only on children[].group; referenced group names must exist in groups.
  • Group names referenced by group_strategies and group_dependencies must exist.
  • Child names referenced by child_strategy_overrides must exist.
  • severity_defaults must not declare the same task role more than once.

IPC checks (when dashboard.enabled = true):

  • target_id must be non-empty.
  • path is required and must be absolute.
  • Registration relay_registration_path is required and must be absolute.
  • lease_seconds must be greater than zero.
  • heartbeat_interval_seconds must be positive and less than lease_seconds.

Supervisor::start_from_config_file rejects invalid configuration before it creates runtime channels or spawns the control loop.

Example Configuration

supervisor:
  strategy: OneForAll
  escalation_policy: escalate_to_parent
  control_channel_capacity: 256
  event_channel_capacity: 256
  dynamic_supervisor:
    enabled: true
    child_limit: 16
policy:
  child_restart_limit: 10
  child_restart_window_ms: 60000
  supervisor_failure_limit: 30
  supervisor_failure_window_ms: 60000
  initial_backoff_ms: 100
  max_backoff_ms: 5000
  jitter_ratio: 0.10
  heartbeat_interval_ms: 1000
  stale_after_ms: 3000
  restart_budget:
    window_secs: 60
    max_burst: 10
    recovery_rate_per_sec: 0.50
  failure_window:
    mode: time_sliding
    window_secs: 60
    max_count: 5
    threshold: 5
  meltdown:
    child_max_restarts: 3
    child_window_secs: 10
    group_max_failures: 5
    group_window_secs: 30
    supervisor_max_failures: 10
    supervisor_window_secs: 60
    reset_after_secs: 120
  supervision_pipeline:
    journal_capacity: 100
    subscriber_capacity: 10
    concurrent_restart_limit: 5
shutdown:
  graceful_timeout_ms: 5000
  abort_wait_ms: 1000
observability:
  event_journal_capacity: 256
  metrics_enabled: true
  audit_enabled: true
audit:
  enabled: true
  backend: memory
  failure_strategy: fail_closed
  max_defer_queue: 1000
backpressure:
  strategy: alert_and_block
  warn_threshold_pct: 80
  critical_threshold_pct: 95
  window_secs: 30
  audit_channel_capacity: 1024
groups:
  - name: core
    children:
      - api
    budget:
      window_secs: 60
      max_burst: 10
      recovery_rate_per_sec: 0.50
  - name: upstream
    children: []
group_strategies:
  - group: core
    strategy: OneForOne
    restart_limit:
      max_restarts: 5
      window_ms: 60000
    escalation_policy: quarantine_scope
group_dependencies:
  - from_group: core
    to_group: upstream
    propagation: Full
child_strategy_overrides:
  - child_id: api
    strategy: RestForOne
    restart_limit:
      max_restarts: 3
      window_ms: 30000
    escalation_policy: shutdown_tree
severity_defaults:
  - task_role: service
    severity: Critical
children:
  - name: api
    kind: supervisor
    criticality: critical
    tags:
      - core
    task_role: supervisor
    severity: Critical
    group: core
    restart_policy: transient
dashboard:
  enabled: true
  target_id: payments-worker-a
  path: /tmp/rust-supervisor-demo/payments-worker-a.sock
  permissions: "0600"
  bind_mode: replace_stale
  registration:
    enabled: true
    relay_registration_path: /tmp/rust-supervisor-demo/dashboard-relay-registration.sock
    display_name: "payments worker a"
    lease_seconds: 30
    registration_heartbeat_interval_seconds: 15

Secret Placeholders

Configuration values that reference secrets use the ${SECRET_NAME} placeholder format. Replace these placeholders with environment variables or your secret management solution before starting the supervisor. Example:

dashboard:
  security_config:
    peer_identity:
      allowed_uids: [ "${SUPERVISOR_UID}" ]

dashboard.security_config does not carry audit settings. IPC audit persistence uses the root audit section so there is one authoritative AuditConfig.

The supervisor does not resolve placeholders at runtime; replacement must happen before configuration loading (e.g., via envsubst or your deployment pipeline).

TLS is handled by the relay layer (rust-supervisor-relay) using wss://. The supervisor target process exposes only a local Unix domain socket and does not terminate TLS.

Upgrade

This version does not support in-place upgrades. To upgrade, deploy a fresh instance with the new version and migrate state through the external IPC interface.

Split Configuration and Transparent Array Sections

Language: 中文

Overview

groups and children on SupervisorConfig use transparent array sections. They appear as YAML arrays on disk, are stored in Rust behind an items field, and are loaded or templated by rust-config-tree 0.3.0.

In short: use children: [...] in a single file, or write only [...] in a split file. Do not wrap split bodies with items:.

Fields

FieldRust typeSplit fileSection schema top-level type
groupsGroupsConfigSectiongroups.yamlarray
childrenChildrenConfigSectionchildren.yamlarray

Access items with:

#![allow(unused)]
fn main() {
config.children.len();
config.children.as_slice();
config.groups.as_slice();
}

Single-file layout

groups:
  - name: core
children:
  - name: api
    kind: async_worker

Split layout

Root config:

include:
  - groups.yaml
  - children.yaml
supervisor:
  strategy: OneForAll
policy:
  child_restart_limit: 10
  # ... remaining policy / shutdown / observability fields

Split files contain only the array body:

# groups.yaml
- name: core
  budget:
    window_secs: 60
    max_burst: 10
    recovery_rate_per_sec: 0.5
# children.yaml
- name: worker
  kind: async_worker
  criticality: optional
  restart_policy: permanent

Repository references:

  • Generated template tree: config/supervisor_config/
  • Example inputs: examples/config/split/
  • Runnable example: cargo run --example split_config_supervisor

Supported YAML shapes

The loader accepts all three forms:

ShapeExample
Transparent arraychildren: [{ name: api }]
Body-only split filechildren.yaml contains only - name: api
Legacy items wrapperchildren:\n items: [...]

Do not use flow-style [{ name: worker }]. Template generation emits block YAML.

CLI

Use the rust-tokio-supervisor binary or cargo run. Subcommands are top-level; there is no config prefix. --config belongs to the run and validate-config subcommands, not the global CLI.

# Validate and print a summary (default `run` command)
cargo run -- run --config examples/config/supervisor.yaml

# Validate the full config tree (includes, defaults, runtime validation)
cargo run -- validate-config --config examples/config/split/supervisor.yaml

# Generate split templates (includes groups.yaml / children.yaml)
# Template source defaults to examples/config/supervisor.yaml
cargo run -- generate-template \
  --output config/supervisor_config/supervisor_config.example.yaml

# Generate JSON Schemas
cargo run -- generate-schema \
  --output config/supervisor_config/supervisor.schema.json

No post-processing is required after generation. The library strips section root keys and rewrites flow arrays to block YAML.

Loading in code

#![allow(unused)]
fn main() {
use rust_config_tree::config::load_config;
use rust_supervisor::config::{
    configurable::SupervisorConfig,
    loader::load_config_from_yaml_file,
};

let config = load_config::<SupervisorConfig>("supervisor.yaml")?;
let state = load_config_from_yaml_file("supervisor.yaml")?;
}

Runtime defaults vs template samples

Scenariochildren at runtime
children omitted from every config file[]; template sample worker is not injected
generate-template commandchildren.yaml may include a worker sample from #[config(default = ...)]
Body-only split fileLoads the array contents normally

groups template defaults to [].

IDE completion

Bind the section schema in split files:

# yaml-language-server: $schema=./children.schema.json

- name: worker
  kind: async_worker

children.schema.json is a top-level array, not an { items: [...] } object.

Reusing the pattern in another crate

To add another transparent array section:

  1. Declare a dedicated struct per transparent array section with an items: Vec<T> field, and implement transparent-array Serialize, Deserialize, JsonSchema, and accessors (see ChildrenConfigSection and GroupsConfigSection).
  2. Mark the root field with #[schemars(extend("x-tree-split" = true, "x-tree-transparent-array" = true))].
  3. Implement ConfigSchema::include_paths.
  4. Use load_config, write_config_templates, and write_config_schemas.

See the x-tree-transparent-array section in the rust-config-tree documentation.

factory_key Configuration

Language: 中文

1. Summary

factory_key is a task factory key in YAML configuration. Its value is the name shared by the configuration file and Rust code, such as api_server. It connects a declarative worker child to a TaskFactory registered in Rust code.

The configuration file stores only declarations. It does not store executable closures. The real task startup logic must be supplied by Rust code.

2. Problem

A Supervisor task tree can declare children through configuration files. However, async_worker and blocking_worker children need an executable TaskFactory when they actually start. A TaskFactory contains Rust code and usually a closure, so it cannot be safely stored directly in YAML.

factory_key defines the boundary. The configuration file writes an agreed key, and Rust code registers a task factory under the same key. Before startup, the system binds the declaration to the executable factory.

3. Configuration

children.yaml can declare workers like this:

- name: api
  kind: async_worker
  factory_key: api_server

- name: exporter
  kind: blocking_worker
  factory_key: report_exporter

api_server and report_exporter are not function names. They are configuration-level task factory keys. Rust code must register matching TaskFactory values.

4. Rust Registration

Rust code uses TaskFactoryRegistry to map keys to TaskFactory values.

#![allow(unused)]
fn main() {
use rust_supervisor::spec::child::TaskKind;
use rust_supervisor::task::factory::{TaskResult, service_fn};
use rust_supervisor::task::factory_registry::{
    TaskFactoryDescriptor, TaskFactoryRegistry,
};
use std::sync::Arc;

let mut registry = TaskFactoryRegistry::new();

registry.register(TaskFactoryDescriptor::new(
    "api_server",
    "API Server",
    "Runs the API service.",
    [TaskKind::AsyncWorker],
    Arc::new(service_fn(|_ctx| async { TaskResult::Succeeded })),
))?;

registry.register(TaskFactoryDescriptor::new(
    "report_exporter",
    "Report Exporter",
    "Runs blocking export work.",
    [TaskKind::BlockingWorker],
    Arc::new(service_fn(|_ctx| async { TaskResult::Succeeded })),
))?;
}

TaskFactoryDescriptor stores 3 kinds of data:

  • key: The task factory key used by configuration files.
  • title and description: Metadata shown by schema-backed editor completion.
  • allowed_kinds: The task kinds that may use this factory, such as TaskKind::AsyncWorker or TaskKind::BlockingWorker.

5. Startup Binding

After configuration loading, factory_key is still only a string. Before startup, the string must be resolved to a real TaskFactory.

The current binding path is:

  1. ConfigState reads child declarations from YAML.
  2. to_supervisor_spec_with_factories uses TaskFactoryRegistry to bind workers.
  3. bind_task_factories checks every worker’s factory_key.
  4. The registry resolves the matching TaskFactory and writes it into ChildSpec.
  5. Supervisor starts with already-bound executable task factories.

Binding rules:

  • Worker children must declare factory_key.
  • Supervisor child nodes must not declare factory_key.
  • An unknown factory_key causes a configuration error.
  • A factory that does not support the current TaskKind causes a configuration error.

6. Completion Generation

Editor completion depends on JSON Schema. The current implementation does not rewrite the rust-config-tree schema generator. Instead, it post-processes the base schema generated by rust-config-tree.

The flow is:

  1. generate-template or generate-schema asks rust-config-tree to generate the base schema.
  2. supervisor_schema_targets_with_factory_registry receives the root schema and split-section schemas.
  3. Each schema is parsed into serde_json::Value.
  4. inject_factory_key_completions_if_present finds the factory_key field.
  5. The system writes keys from TaskFactoryRegistry into oneOf.
  6. The schema is serialized again and written to the target file.

After generation, children.schema.json contains a factory_key field like this:

{
  "factory_key": {
    "description": "TaskFactory registry key used to bind worker children before startup.",
    "oneOf": [
      {
        "const": "api_server",
        "description": "Runs the API service.",
        "title": "API Server"
      },
      {
        "const": "report_exporter",
        "description": "Runs blocking export work.",
        "title": "Report Exporter"
      }
    ],
    "type": [
      "string",
      "null"
    ]
  }
}

When an editor reads the yaml-language-server schema directive at the top of children.yaml, it can offer factory_key candidates.

7. Commands

Generate templates:

target/debug/rust-tokio-supervisor generate-template

This command writes configuration templates and schemas with completion metadata.

Generate schemas only:

target/debug/rust-tokio-supervisor generate-schema

This command writes schemas only, and the generated schema also contains factory_key candidates in oneOf.

8. Current Boundaries

  • factory_key is a configuration declaration, not executable code.
  • Completion candidates come from the TaskFactoryRegistry used by the command.
  • If Rust code does not register a key, a configuration file using that key cannot start.
  • Schema-backed completion helps editors suggest valid candidates, but it does not replace startup binding validation.
  • Runtime child addition goes through the same kind of binding validation, so dynamic additions cannot bypass the registry.

Supervisor Tree

Language: 中文

Declaration Model

SupervisorSpec describes one supervisor node. It contains:

  • path — stable path for this supervisor
  • strategy — restart scope strategy (OneForOne, OneForAll, RestForOne)
  • children — child specifications in declaration order
  • config_version — configuration version that produced this spec
  • default_restart_policy, default_backoff_policy, default_health_policy, default_shutdown_policy — policies inherited by children that do not override
  • supervisor_failure_limit — maximum supervisor failures before parent escalation
  • restart_limit — optional supervisor-level restart limit
  • escalation_policy — optional supervisor-level escalation policy
  • group_strategies — group-level strategy overrides
  • group_configs — group-level restart budget, membership, and isolation configs
  • group_dependencies — cross-group dependency edges for fault propagation
  • severity_defaults — default severity class per task role for escalation bifurcation
  • child_strategy_overrides — per-child strategy and governance overrides
  • dynamic_supervisor_policy — runtime add_child acceptance policy
  • control_channel_capacity — mpsc command channel capacity
  • event_channel_capacity — broadcast event channel capacity

ChildSpec describes one child. It contains:

  • id, name, kind — stable identity and task kind
  • factory — optional Arc<dyn TaskFactory> for worker children
  • restart_policy, shutdown_policy, health_policy, readiness_policy, backoff_policy — per-child policy overrides
  • dependencies — child IDs that must become ready before this child starts
  • tags — low-cardinality grouping labels
  • criticalityCritical or Optional
  • task_role — optional TaskRole that selects default lifecycle policy semantics
  • sidecar_config — optional sidecar binding (required when role is Sidecar)
  • severity — optional explicit severity override
  • group — optional group name for group-level isolation and budget tracking
  • health_check, readiness — optional health/readiness check configurations
  • resource_limits — optional resource limits
  • command_permissions — command permissions granted to this child
  • environment, secrets — environment variables and secret references

Tree Building

SupervisorTree::build validates SupervisorSpec and converts children into path-aware nodes. Each child path is derived from the parent path and ChildId.

SupervisorPath::root returns the root path. SupervisorPath::join appends a child path segment. SupervisorPath::parent returns the parent path when it exists.

Startup And Shutdown Order

startup_order returns nodes in declaration order. shutdown_order returns nodes in reverse declaration order. This ordering is the basis for Shutdown Without Orphaned Tasks.

Restart Planning

restart_execution_plan resolves the runtime restart scope from the tree and SupervisorSpec. It keeps per-child overrides, group strategies, restart limits, escalation policies, and dynamic supervisor policy in one plan so the runtime control loop does not duplicate strategy selection logic.

Registry

RegistryStore stores ChildRuntime values by child identifier, supervisor path, and declaration order. Runtime control and current state queries should go through the registry instead of bypassing it.

Task Model

Language: 中文

Task Kinds

TaskKind distinguishes AsyncWorker, BlockingWorker, and Supervisor. A blocking worker must not be treated as a normal asynchronous worker that can always be aborted immediately.

Task Factory

TaskFactory is the core construction contract. Every attempt must create a fresh future. service_fn is an ergonomic adapter that still targets TaskFactory; it does not replace the core model.

TaskResult distinguishes Succeeded, Cancelled, and Failed. The Failed variant carries TaskFailure and TaskFailureKind.

Task Context

TaskContext contains child identifier, supervisor path, generation, attempt, cancellation token, heartbeat sender, and readiness sender.

Workers should use TaskContext::heartbeat to report health, TaskContext::mark_ready to report explicit readiness, and TaskContext::is_cancelled or TaskContext::cancellation_token to react to shutdown.

Readiness

ReadinessPolicy supports Immediate and Explicit. An explicitly ready child should not appear as ready in current state or events until it reports readiness.

ChildSpec and ChildDeclaration

Language: 中文

How do ChildSpec and ChildDeclaration relate?

ChildDeclaration is the external declaration that arrives from configuration and RPC. ChildSpec is the internal specification the supervisor runtime uses to register, start, and restart children. The two share many fields but serve different roles. They are connected through TryFrom conversion, which also fills in defaults.

What each one is

ChildDeclarationChildSpec
Modulesrc/spec/child_declaration.rssrc/spec/child.rs
RoleInput model for YAML, add_child payloads, and similar sourcesRuntime model in the registry and control loop
Typical sourceConfig file deserialization, dynamic child add requestsConverted from a declaration, or built directly in code
Can it run alone?No. It has no factory and no fully materialized policy objectsYes. The supervisor manages lifecycle from it

ChildDeclaration focuses on a serializable, validatable declaration: names, dependency names, environment variables, secret placeholders, health_check / readiness config blocks, and rules such as validate_child_declaration (name format, ${SECRET} syntax, and so on).

Beyond declaration fields, ChildSpec also carries runtime essentials such as:

  • A resolved ChildId derived from name
  • factory: Option<Arc<dyn TaskFactory>>, the task factory that actually runs work (not part of serde)
  • Materialized HealthPolicy, ReadinessPolicy, ShutdownPolicy, and BackoffPolicy
  • Runtime fields such as isolation and cleanup_paths

How they connect

The data flow looks like this:

YAML / add_child RPC
        |
        v
  ChildDeclaration  ---- validate_child_declaration ----+
        |                                                  |
        | TryFrom<ChildDeclaration> for ChildSpec           |
        v                                                  |
     ChildSpec  --------------------------------------------+
        |
        v
  Register topology, start children, policy pipeline, restart / meltdown, etc.

The conversion lives in TryFrom<ChildDeclaration> for ChildSpec inside child_declaration.rs. It performs steps such as:

  • name -> ChildId::new(&decl.name)
  • dependency names in dependencies -> Vec<ChildId>
  • health_check -> HealthPolicy with default intervals
  • readiness present -> ReadinessPolicy::Explicit, otherwise Immediate
  • shutdown_policy / backoff_policy and similar fields receive defaults during conversion even when the declaration omits them

When a child is added dynamically, PendingChild keeps both the declaration and the converted child_spec. Auditing also stores a SHA-256 of the declaration (declaration_hash) for reconciliation and compensation.

Shared types

Shared enums and config structs such as RestartPolicy, TaskKind, and HealthCheckConfig are defined in child.rs. ChildDeclaration reuses them to avoid parallel type trees. The top-level containers remain separate: declaration container vs specification container.

ChildSpec construction paths

The repository has 6 paths that construct ChildSpec. They serve different use cases and should not be collapsed into one entry point.

PathTypical entryUse caseValidation behavior
BuilderChildSpecBuilder::worker, service, job, sidecar, supervisor, newDirect runtime spec construction in Rust codebuild() calls ChildSpec::validate()
Worker convenience functionChildSpec::worker(...)Worker default bundle onlyDelegates to ChildSpecBuilder::worker(...).build()
Declaration conversionTryFrom<ChildDeclaration> for ChildSpecYAML config, RPC payloads, dynamic child addsvalidate_child_declaration runs before conversion, and supervisor-level validation catches final issues
Role templateServiceTemplate::child_spec, JobTemplate::child_spec, and related role templatesCaller already implemented role traits but does not want to hand-build adapters and specsCalls the matching ChildSpecBuilder internally
Macro-generated helperchild_spec() generated by #[service], #[worker], #[job], #[sidecar], and #[supervisor_role]Default role contract entry pathGenerated code calls the matching ChildSpecBuilder
Serdeserde_json::from_value::<ChildSpec>(...)Mainly tests for deserialization defaults and invalid enum handlingDoes not pass through the builder, so callers must validate before runtime use or rely on later spec validation

Important boundaries:

  • ChildSpecBuilder::build() is the main exit for Rust code construction paths.
  • Configuration and RPC should not accept ChildSpec directly. They should accept ChildDeclaration first, then convert it into ChildSpec.
  • Role templates and macros are not new runtime models. They turn role lifecycle objects into adapters, then call ChildSpecBuilder to produce specs.
  • Serde can construct ChildSpec because ChildSpec derives Deserialize. That path does not automatically call ChildSpecBuilder::build().

Adjacent paths that do not construct a ChildSpec:

EntryWhy it is not a ChildSpec construction path
SupervisorSpec::root(Vec<ChildSpec>)It accepts already constructed child specs and builds a supervisor spec
SupervisorSpecBuilder::root(Vec<ChildSpec>)It wraps supervisor spec construction and does not create an individual child spec
ConfigState::to_supervisor_spec()It assembles a supervisor spec from the Vec<ChildSpec> already stored in ConfigState
bind_child_factory(...)It binds a task factory to an existing ChildSpec and does not create a new one
clone()It copies an existing ChildSpec instead of generating one from an input model

How to remember them

  • Writing config, handling API input, validating declarations -> think ChildDeclaration
  • Seeing how the supervisor manages a child or what the policy engine reads -> think ChildSpec
  • Asking whether YAML and runtime use the same thing -> same underlying information, different lifecycle stage: declaration is input, spec is the landed form

In-code construction

Configuration and RPC should still use ChildDeclaration. When you construct a runtime spec directly in Rust, prefer ChildSpecBuilder:

#![allow(unused)]
fn main() {
use rust_supervisor::id::types::ChildId;
use rust_supervisor::policy::task_role_defaults::TaskRole;
use rust_supervisor::spec::child::TaskKind;
use rust_supervisor::spec::child_builder::ChildSpecBuilder;
use rust_supervisor::task::factory::{TaskResult, service_fn};
use std::sync::Arc;

let factory = service_fn(|_ctx| async { TaskResult::Succeeded });
let spec = ChildSpecBuilder::worker(
    ChildId::new("worker"),
    "worker",
    TaskKind::AsyncWorker,
    Arc::new(factory),
)
.task_role(TaskRole::Worker)
.tag("invoice")
.build()?;
}

Entry methods:

MethodPurpose
ChildSpecBuilder::worker(...)Async or blocking worker; defaults match ChildSpec::worker
ChildSpecBuilder::service(...)Long-running service; sets TaskRole::Service
ChildSpecBuilder::job(...)Finite job; sets TaskRole::Job
ChildSpecBuilder::sidecar(...)Sidecar; sets sidecar binding and the primary child dependency
ChildSpecBuilder::supervisor(...)Nested supervisor; no factory
ChildSpecBuilder::new(...)Minimal skeleton; caller must set kind and, for workers, factory

Build exit:

MethodBehavior
build()Calls ChildSpec::validate() after construction; returns SupervisorError on failure

ChildSpec::worker(...) remains available. It delegates to ChildSpecBuilder::worker(...).build() and also returns Result<ChildSpec, SupervisorError>.

For field-by-field mapping and defaults through TryFrom, see child-spec-builder.md for builder details, or inspect child_declaration.rs directly.

ChildSpecBuilder

Language: 中文

One-sentence summary

ChildSpecBuilder is the fluent API for constructing ChildSpec values in Rust code. Configuration and RPC should still use ChildDeclaration. The build exit is build() -> Result<ChildSpec, SupervisorError>, which calls ChildSpec::validate() internally.

Relationship to child-spec.md: that page explains how declarations and specs divide responsibility. This page focuses on builder entry points, setters, and common usage patterns.

Module path

#![allow(unused)]
fn main() {
use rust_supervisor::spec::child_builder::ChildSpecBuilder;
}

The module is defined in src/spec/child_builder.rs. Per project module-boundary rules, there is no pub use re-export.

When to use the builder

ScenarioRecommended approach
YAML config, add_child RPC payloadsChildDeclaration + TryFrom
Tests, examples, hand-built runtime specs in codeChildSpecBuilder
Worker default bundle only, no fluent chainChildSpec::worker(...)? (delegates to the builder internally)

Legacy code may still mutate fields after construction. New code should prefer the builder.

Entry methods

MethodPurposeDefault highlights
worker(id, name, kind, factory)Async or blocking workerMatches ChildSpec::worker: Transient restart, Critical criticality, TaskRole::Worker, and so on
service(id, name, kind, factory)Long-running serviceBased on worker defaults: TaskRole::Service, Critical criticality
job(id, name, kind, factory)Finite jobBased on worker defaults: TaskRole::Job, Optional criticality
sidecar(id, name, kind, factory, sidecar_config)Sidecar attached to a primary childBased on worker defaults: TaskRole::Sidecar, writes sidecar_config, and automatically adds the primary child dependency
supervisor(id, name)Nested supervisorkind = Supervisor, factory = None, task_role = Supervisor, criticality = Critical
new(id, name)Minimal skeletonSets only id / name plus baseline policies; caller must add kind and, for workers, factory

Build exit

MethodBehavior
build()Takes the inner ChildSpec, calls validate(), returns Ok(spec) or SupervisorError

All entry methods and setters return ChildSpecBuilder, which means construction is still in progress. Only build() consumes the builder and returns the final ChildSpec.

There is no build_validated(). Validation is always performed inside build().

ChildSpec::worker(...) also returns Result<ChildSpec, SupervisorError> via ChildSpecBuilder::worker(...).build().

Basic usage

#![allow(unused)]
fn main() {
use rust_supervisor::error::types::SupervisorError;
use rust_supervisor::id::types::ChildId;
use rust_supervisor::policy::task_role_defaults::TaskRole;
use rust_supervisor::spec::child::TaskKind;
use rust_supervisor::spec::child_builder::ChildSpecBuilder;
use rust_supervisor::task::factory::{TaskResult, service_fn};
use std::sync::Arc;

fn build_worker() -> Result<ChildSpec, SupervisorError> {
    let factory = Arc::new(service_fn(|_ctx| async { TaskResult::Succeeded }));
    ChildSpecBuilder::worker(
        ChildId::new("invoice-worker"),
        "Invoice Worker",
        TaskKind::AsyncWorker,
        factory,
    )
    .task_role(TaskRole::Worker)
    .tag("invoice")
    .build()
}
}

Propagate errors with ?, or use build().expect("...") in tests.

Fluent setter coverage

Each setter consumes self and returns Self. You can chain them in any order that remains semantically valid.

Policy fields: isolation, restart_policy, shutdown_policy, health_policy, readiness_policy, backoff_policy

Topology and classification: dependencies, dependency, tags, tag, criticality, task_role, without_task_role, sidecar_config, without_sidecar_config, severity, without_severity, group, without_group

Config blocks: health_check, without_health_check, readiness, without_readiness, resource_limits, without_resource_limits, command_permissions, environment, env_var, secrets, secret, cleanup_paths, cleanup_path

Runtime: kind, factory, without_factory (for new() or supervisor paths)

Naming convention: plural fields use dependencies(...), tags(...); singular helpers use dependency(...), tag(...). The same pattern applies to environment / env_var, secrets / secret, and cleanup_paths / cleanup_path.

Common combinations

Service

Long-running services should prefer service(...); callers do not need to set TaskRole::Service by hand:

#![allow(unused)]
fn main() {
ChildSpecBuilder::service(id, "API Service", TaskKind::AsyncWorker, factory)
    .tag("service")
    .build()?;
}

Job

Finite work should prefer job(...). You can still override restart_policy for one-shot behavior:

#![allow(unused)]
fn main() {
ChildSpecBuilder::job(id, "Nightly Export", TaskKind::AsyncWorker, factory)
    .restart_policy(RestartPolicy::Temporary)
    .build()?;
}

Sidecar

Sidecars attached to a primary child should prefer sidecar(...). This entry writes sidecar_config and automatically adds the primary child dependency:

#![allow(unused)]
fn main() {
use rust_supervisor::policy::task_role_defaults::SidecarConfig;

ChildSpecBuilder::sidecar(
    id,
    "Metrics Sidecar",
    TaskKind::AsyncWorker,
    factory,
    SidecarConfig::new(primary_id.clone(), false),
)
.build()?;
}

If you still configure task_role = Sidecar manually with setters, you must also set sidecar_config, or build() validation fails.

Worker from new()

#![allow(unused)]
fn main() {
ChildSpecBuilder::new(ChildId::new("custom"), "custom")
    .kind(TaskKind::AsyncWorker)
    .factory(factory)
    .build()?;
}

Data flow (short)

ChildSpecBuilder::worker / service / job / sidecar / supervisor / new
        |
        v
   fluent setters (policy, role, deps, env, ...)
        |
        v
   build()  -->  ChildSpec::validate()
        |
        +-- Ok(ChildSpec)  -->  Supervisor::start / register topology
        +-- Err(SupervisorError)

Example program

Runnable demo:

cargo run --example child_spec_builder

Source: examples/child_spec_builder.rs. Covers worker, service, job, sidecar, supervisor, the new() path, and an intentionally invalid sidecar combination.

Tests and regression

External tests: src/spec/tests/child_builder_test.rs

TestWhat it verifies
worker_builder_matches_child_spec_worker_defaultsBuilder output matches ChildSpec::worker field-for-field
supervisor_builder_produces_valid_supervisor_childSupervisor entry has no factory and validates
service_builder_sets_service_roleService entry sets TaskRole::Service and Critical criticality
job_builder_sets_job_role_and_optional_criticalityJob entry sets TaskRole::Job and Optional criticality
sidecar_builder_sets_sidecar_role_binding_and_dependencySidecar entry sets the binding and automatically adds the primary child dependency
builder_setters_apply_expected_fieldsSidecar, dependency, tag, and related setters
build_rejects_invalid_sidecar_combinationMissing sidecar_config makes build() fail
new_builder_can_build_valid_worker_with_factorynew() path works after required fields are set

Run:

cargo test --test child_builder_test

Known boundaries

  • Default policy bundles for TryFrom<ChildDeclaration> are not fully shared with the builder yet. The two paths may evolve independently; review both when changing defaults.
  • The builder does not handle serde. Dynamic child adds still flow through ChildDeclaration.
  • Legacy examples and tests were not bulk-migrated from ChildSpec::worker. Both styles are runtime-equivalent when callers handle Result.

Further reading

  • child-spec.md — how ChildDeclaration and ChildSpec relate, plus a short builder introduction
  • docs/architecture.md — module boundaries and the no re-export rule

Policies

Language: 中文

Supervision Strategy

SupervisionStrategy decides the restart scope after a failure. OneForOne selects only the failed child. OneForAll selects every child in the selected scope. RestForOne selects the failed child and every child declared after it in the selected scope.

restart_scope calculates the restart scope from SupervisorTree, the strategy, and the failed child identifier.

restart_execution_plan combines the supervisor strategy, GroupStrategy, ChildStrategyOverride, RestartLimit, EscalationPolicy, and DynamicSupervisorPolicy into a StrategyExecutionPlan. Child overrides take precedence over group strategies, and group strategies take precedence over the supervisor-wide strategy. The plan stores a dynamic_supervisor_enabled boolean; the full DynamicSupervisorPolicy struct (including child_limit) is evaluated by the control loop at add_child time, not embedded in the plan.

The runtime control loop now receives child exits and applies the selected StrategyExecutionPlan automatically when policy returns a restart decision. Runtime lifecycle events include restart scope information so operators can see the selected strategy, group, and child scope.

Group Strategy And Overrides

GroupStrategy uses child tags to define a smaller restart scope. A child can belong to at most one configured strategy group. ChildStrategyOverride applies a per-child strategy and governance override when one child needs stricter restart behavior than its group or supervisor.

GroupConfig configures restart budget, membership, and isolation at the group level. GroupDependencyEdge defines cross-group dependency edges for fault propagation.

Restart Limit And Escalation

RestartLimit records the maximum restart count and the counting window selected for a plan. EscalationPolicy records the follow-up action when restart governance cannot remain local, including parent escalation, tree shutdown, or scope quarantine.

Dynamic Supervisor Policy

DynamicSupervisorPolicy controls runtime add_child acceptance. It is a stateless configuration object with enabled and child_limit fields. The control loop maintains the current child count externally and calls allows_addition(current_child_count) at add_child time. Additions are rejected when dynamic supervision is disabled or the configured child limit has already been reached.

Restart Policy

RestartPolicy contains Permanent, Transient, and Temporary. PolicyEngine reads TaskExit, the failure category, and the restart policy, then returns RestartDecision.

Backoff And Jitter

BackoffPolicy describes initial delay, maximum delay, and jitter ratio. There are two types with this name:

  • spec::child::BackoffPolicy — used in ChildSpec, fields are initial_delay, max_delay, jitter_ratio (a 0.0-1.0 float ratio).
  • policy::backoff::BackoffPolicy — used by the runtime policy engine, fields are initial, max, jitter_mode (an enum: Disabled, Deterministic, FullJitter, DecorrelatedJitter), jitter_percent, and reset_after.

Tests can use JitterMode::Deterministic so coverage does not depend on random output.

Fuse And Quarantine

MeltdownPolicy limits restarts or failures inside configured windows at three levels: child, group, and supervisor. Crossing a child-level fuse places the child in quarantine. Crossing a group-level fuse escalates to the supervisor level. Crossing a supervisor-level fuse escalates the failure to the parent.

Task Exit Classification

TaskExit distinguishes success, cancellation, typed failure, panic, and timeout. The policy layer reads typed classifications from TaskFailureKind (which includes Panic, Timeout, and typed failure categories) instead of inferring behavior from strings.

Runtime Control

Language: 中文

Control Entry Point

SupervisorHandle is the runtime control entry point. It sends requests to the runtime control loop through a command channel and returns CommandResult.

Control Commands

These are ControlCommand enum variants sent through the command channel:

  • add_child — accept a dynamic child manifest when DynamicSupervisorPolicy allows another child.
  • remove_child — mark the target child runtime state record as Removed, deliver cancellation to the active attempt, and remove the runtime state record after the attempt exits.
  • restart_child — request a restart for the target child.
  • pause_child — mark the target child runtime state record as Paused, deliver cancellation to the active attempt, and pause automatic restarts.
  • resume_child — resume governance for the target child.
  • quarantine_child — mark the target child runtime state record as Quarantined, deliver cancellation to the active attempt, and block automatic restarts.
  • shutdown_tree — shut down the whole supervisor tree.
  • current_state — return the current SupervisorState and expose each child runtime fact through CurrentState.child_runtime_records.

Handle Methods

These are methods on SupervisorHandle that do not go through ControlCommand:

  • subscribe_events — subscribe to lifecycle events via a broadcast::Receiver.
  • is_alive — quickly check whether the runtime control loop can still accept ordinary control commands.
  • health — return RuntimeHealthReport, including alive, control-plane state, started_at_unix_nanos, last_observed_at_unix_nanos, failure, and exit_report.
  • join — wait until the runtime control plane reaches a final state and repeatedly return the same RuntimeExitReport.
  • shutdown — shut down only the runtime control plane. It does not replace shutdown_tree.

Child Runtime State Control

PauseChild, RemoveChild, and QuarantineChild are stop-style control commands defined by this feature. All 3 commands return CommandResult::ChildControl, and the result contains ChildControlResult. The old CommandResult::ChildState shape is no longer part of the public result model.

PauseChild writes ChildRuntimeState.operation as Paused. If an active attempt exists, the runtime control loop delivers cancellation to that attempt and moves stop progress to CancelDelivered. While the child is paused, the supervision strategy does not automatically restart that child.

RemoveChild writes ChildRuntimeState.operation as Removed. If an active attempt exists, the runtime control loop first delivers cancellation and then physically removes the record from child_runtime_states after the attempt exits. If no active attempt exists, the runtime control loop returns a NoActiveAttempt result and then removes the runtime state record.

QuarantineChild writes ChildRuntimeState.operation as Quarantined. If an active attempt exists, the runtime control loop delivers cancellation. The quarantined runtime state record remains visible, but the supervision strategy no longer automatically restarts that child. An operator can still run RemoveChild later.

These 3 stop-style control commands do not synchronously wait for the child future to end. If a child ignores cancellation for too long, a later CurrentState call or repeated stop-style command triggers reconcile_stop_deadlines and exposes the stop failure through ChildControlFailure.

CurrentState returns child_runtime_records. Each ChildRuntimeRecord is ordered by declaration order. Construction performs only non-blocking reads, does not wait for a child future, and does not perform extra I/O. This collection is the main entry point for reading runtime state facts.

RestartChild and ResumeChild remain existing commands. This feature only requires them not to corrupt runtime state facts. It does not define new lifecycle semantics for them.

See the full contract in child-runtime-state-control.md.

ChildControlResult Fields

  • child_id: stable identifier of the controlled child.
  • attempt: active attempt targeted by the command. It is None when no active attempt exists.
  • generation: generation targeted by the command. It is None when no active attempt exists.
  • operation_before: ChildControlOperation observed when the command arrived.
  • operation_after: ChildControlOperation after command handling.
  • status: current ChildAttemptStatus for the attempt. It is None when no active attempt exists.
  • cancel_delivered: whether this command actually delivered cancellation.
  • stop_state: ChildStopState after command handling.
  • restart_limit: current RestartLimitState, including window, limit, used count, remaining count, and exhaustion flag.
  • liveness: current ChildLivenessState, including last heartbeat time, heartbeat stale flag, and readiness.
  • idempotent: whether this command reused an already existing target state.
  • failure: current control failure. It is None when no failure exists.
  • generation_fence: optional GenerationFenceOutcome used by restart control commands.
  • admission_conflict: optional AdmissionConflict detail when a concurrent request is rejected.

ChildRuntimeRecord Fields

  • child_id: stable identifier of the child represented by this runtime state record.
  • path: child path in the supervisor tree.
  • generation: current active generation. It is None when no active attempt exists.
  • attempt: current active attempt. It is None when no active attempt exists.
  • status: current ChildAttemptStatus for the attempt.
  • operation: current ChildControlOperation, which can be Active, Paused, Quarantined, or Removed.
  • liveness: current ChildLivenessState.
  • restart_limit: current RestartLimitState.
  • stop_state: current ChildStopState.
  • failure: most recent ChildControlFailure. When stop_state is Failed, this must be Some.
  • generation_fence_phase: current GenerationFencePhase for dashboard projection.
  • pending_restart: optional PendingRestartSummary for queued restarts behind a generation fence.

Idempotent Behavior

Repeated control commands should not create unrecoverable errors. Pausing an already paused child returns the current state. Quarantining an already quarantined child returns the current state. Shutting down an already completed tree returns the existing shutdown result.

join caches the final RuntimeExitReport from the control loop. Repeated calls to join on the same handle return the same result every time and do not consume the underlying exit receiver again.

shutdown only asks the runtime control loop to exit normally. If the control plane has already completed or failed, another shutdown call directly returns the existing final report. shutdown_tree remains responsible for child task and full supervisor tree shutdown semantics.

Runtime Health

is_alive is a low-cost state check. It returns true when the control plane is alive. It returns false when the control plane is starting, shutting down, completed, or failed.

health returns structured state. After an abnormal control-plane exit, health can still read: alive, state, started_at_unix_nanos, last_observed_at_unix_nanos, failure (with phase, reason, panic flag, recoverable flag), and exit_report. Ordinary control commands after the control plane has ended return SupervisorError with the same exit reason.

Dynamic Additions

Dynamic additions are governed before the manifest is accepted. The runtime rejects add_child when dynamic supervision is disabled or when the declared child count plus dynamic child count has reached the configured limit. current_state.child_count includes accepted dynamic manifests.

Audit Data

Each control command carries CommandMeta with command_id, requested_by, and reason. These fields must be non-empty text. SupervisorHandle rejects empty values before the command enters the channel, and the runtime control loop validates them again before execution. This preserves traceable audit sources for manual operations, dashboard IPC forwarding, and internal control calls.

The event payload CommandAudit additionally records target_path and accepted_at_unix_nanos for audit events and incident review.

Dashboard Three-End Workflow

Language: 中文

The dashboard feature is delivered by three repositories. rust-supervisor owns only target-process local IPC and shared contracts. ~/rust-supervisor-relay owns the relay and external wss:// sessions. ~/rust-supervisor-ui owns the browser dashboard client.

The screenshot below shows the dashboard client view for target lists, topology, state, and runtime streams.

rust-supervisor dashboard screenshot

Three-End Responsibilities

  • rust-supervisor: The target process reads SupervisorConfig, opens a Unix domain socket when dashboard.enabled=true, and produces state, event records, log records, command results, and registration heartbeats.
  • rust-supervisor-relay: The relay listens on the registration socket, stores the target registry, exposes external wss:// dashboard sessions, validates mTLS and allowed IPC path prefixes, and forwards session commands to the target process.
  • rust-supervisor-ui: The dashboard client connects to the relay through wss:// and displays the target list, topology, state, event stream, log tail, and command audit.

Generation fencing projections

IPC-visible child control envelopes include nullable generation_fence payloads on DashboardChildControlResult. DashboardChildRuntimeRecord mirrors pending_restart summaries whenever the runtime pins an older attempt while a restart waits behind the fence.

Local Demo Flow

  1. Start the relay first. It must listen on the registration socket before the target process can register itself.
cd ~/rust-supervisor-relay
cargo run -- --config examples/config/dashboard-relay.local.yaml
  1. Start the target process next. It opens the local IPC socket and sends registration heartbeats to the relay.
cd ~/rust-supervisor
cargo run --example demo -- --config examples/config/supervisor.local.yaml
  1. Start the dashboard client last. Browser code connects only to the relay and never reads the target-process local IPC socket directly.
cd ~/rust-supervisor-ui
VITE_SUPERVISOR_RELAY_URL=wss://localhost:9443/supervisor npm run dev

Runtime Order

After receiving a registration heartbeat, the relay only stores the target process in the target registry. Registration does not trigger proactive event or log push. After the dashboard client establishes an authenticated dashboard session and selects a target, the relay connects to the target-process IPC socket, reads state, and subscribes to events.subscribe or logs.tail only when the session requests those streams.

Control commands must start from the dashboard client, pass relay session validation, and then reach the target process. Each command must carry operator identity, target identity, and reason. Dangerous commands must also be confirmed in the client.

Verification Commands

cd ~/rust-supervisor
cargo test --test dashboard_config_test --test dashboard_protocol_shape_test --test dashboard_state_test --test dashboard_stream_test --test dashboard_performance_test

cargo test --manifest-path ~/rust-supervisor-relay/Cargo.toml
npm --prefix ~/rust-supervisor-ui run test
npm --prefix ~/rust-supervisor-ui run build
npm --prefix ~/rust-supervisor-ui run test:e2e:three-end

Production Notes

The target process may expose only a local Unix domain socket and must not expose IPC directly to the network. The relay must use wss:// for external access. The browser or operating-system certificate store selects the mTLS client certificate, and page scripts must not read the certificate private key. dashboard.path, registration.relay_registration_path, and the relay allowed IPC path prefix must match, otherwise the target will fail to register or the relay will reject the connection.

Shutdown

Language: 中文

Formal Term

This project uses Shutdown Without Orphaned Tasks to describe the shutdown goal. After root shutdown completes, the runtime should leave no orphaned task.

Four Stages

The shutdown protocol has four stages:

  • Request stop: accept the shutdown cause and propagate the cancellation token.
  • Graceful drain: wait for each child to finish on its own.
  • Abort stragglers: force or escalate asynchronous tasks that exceed their timeout.
  • Reconcile: align registry state, current state, metrics, and the event journal.

Order

Startup runs in declaration order. Shutdown runs in reverse declaration order. startup_order and shutdown_order expose this rule.

Blocking Worker Boundary

BlockingWorker represents spawn_blocking work or other work that cannot be assumed to abort immediately. After shutdown timeout, the runtime should record the non-immediate termination boundary and follow the escalation policy.

Shutdown Cause

ShutdownCause records requested_by and reason. The cause should appear in audit and diagnostic output.

Completion Result

shutdown_tree returns ShutdownResult. After the pipeline finishes, ShutdownResult.report contains ShutdownPipelineReport with per-child outcomes, a reconcile report, and dashboard socket status. When the core runtime does not own the dashboard IPC socket, the report records socket status as NotOwned.

Observability

Language: 中文

Event Model

SupervisorEvent describes one lifecycle fact. It contains When, Where, What, sequence, and correlation identifier.

When records wall-clock time, monotonic time, uptime, generation, and attempt. Where records supervisor path, child identifier, parent identifier, and task name. What records state transition, policy decision, health state, exit reason, or control command.

Pipeline Outputs

The observability pipeline publishes the same lifecycle fact as these signals:

  • SupervisorEvent.
  • Structured log.
  • Tracing span and tracing event.
  • Metrics.
  • Audit event.
  • Event journal entry.
  • Test recorder entry.

Metric Labels

Metric labels must stay low-cardinality. Acceptable labels include supervisor path, child identifier, state, decision, and failure category. Full error text, user input, and unbounded dynamic values should not become labels.

Real Shutdown Pipeline

After ShutdownTree runs the real shutdown pipeline, the observability pipeline must surface lifecycle facts for each stage. ChildShutdownCancelDelivered means the runtime delivered CancellationToken to the in-flight child attempt. ChildShutdownGraceful means the child task returned inside the graceful drain budget. ChildShutdownAborted means the runtime requested abort for a stuck task. ChildShutdownLateReport means the child task returned after the normal shutdown accounting window. ShutdownCompleted means the pipeline emitted the final reconcile report.

Metrics record shutdown facts with low-cardinality labels. supervisor_shutdown_duration_seconds measures full pipeline duration. supervisor_shutdown_child_outcomes_total counts outcomes by status and phase and must not place child_id on metric labels. supervisor_shutdown_abort_total counts abort paths by bounded reason. supervisor_shutdown_late_reports_total counts late reports by phase.

Audit events record cancel delivered, graceful outcome, abort outcome, late report, and completed reconcile. When the core runtime does not own the dashboard IPC socket, the reconcile report records socket status as NotOwned.

Diagnostic Replay

The event journal stores a fixed number of recent events. RunSummary is built from the event journal, current state, and policy decisions so operators can explain meltdown, shutdown timeout, or parent escalation.

Examples

Language: 中文

Quick Start

cargo run --example supervisor_quickstart

supervisor_quickstart reads examples/config/supervisor.yaml, derives SupervisorSpec, starts a supervisor, queries current state, and shuts down the tree.

Configuration Tree

cargo run --example config_tree_supervisor

config_tree_supervisor shows the rust-config-tree 0.3.0 YAML loading path and prints the derived SupervisorSpec.

Split Configuration

cargo run --example split_config_supervisor

split_config_supervisor loads examples/config/split/supervisor.yaml, which includes body-only groups.yaml and children.yaml. See Split Configuration and Transparent Array Sections.

Restart Policy Lab

cargo run --example restart_policy_lab

restart_policy_lab shows the basic shapes of TaskFailure, TaskFailureKind, RestartPolicy, the canonical spec::supervisor::SupervisionStrategy, and RestartDecision.

Shutdown Tree

cargo run --example shutdown_tree

shutdown_tree demonstrates request stop, graceful drain, abort stragglers, and reconcile before calling shutdown_tree.

Observability Probe

cargo run --example observability_probe

observability_probe subscribes to events, queries current state, prints one event, and shuts down. It checks the observability integration path.

Supervisor Tree Story

cargo run --example supervisor_tree_story

supervisor_tree_story declares market feed, risk engine, and audit sink children. It shows dependencies, tags, criticality, explicit readiness, startup order, shutdown order, and RestForOne restart scope.

Runtime Control Story

cargo run --example runtime_control_story

runtime_control_story starts a real supervisor and runs add_child, pause_child, resume_child, quarantine_child, current_state, subscribe_events, and shutdown_tree. It combines operator control with audit events.

Policy Failure Matrix

cargo run --example policy_failure_matrix

policy_failure_matrix feeds success, external dependency failure, fatal bug failure, and panic into Permanent, Transient, and Temporary restart policies. It also shows deterministic jitter and meltdown tracking.

Diagnostic Replay

cargo run --example diagnostic_replay

diagnostic_replay builds deterministic events, writes them into the event journal, replays failure, backoff, and restart facts, then generates metric samples and RunSummary.

Operations Runbook

Language: 中文

Note: Each procedure lists expected metrics values at key steps. If the observed value differs, follow the escalation path or refer to the linked section.

P1-001: Supervisor Process Crash

Symptoms: Supervisor process exits unexpectedly; children become orphaned.

StepActionExpected MetricsEstimated Duration
1Check process status: pgrep -x supervisorexit code == 0 (process running) or exit code == 1 (not running)1min
2If not running, check last log lines: journalctl -u supervisor -n 50Log ends with ShutdownPhase::Completed (planned) or Panic (unexpected)2min
3If unexpected crash: collect core dump and backtraceCore dump file present in /tmp/2min
4Restart supervisor: cargo run --release --example supervisor_quickstarthealth.status == "ready" within 30s5min
5Verify children reconnected: check dashboard IPCdashboard_link == "connected"2min

Escalation: If restart fails twice, escalate to L2 engineering with core dump and logs. Total estimated duration: 12min (within 15min SLA).

P1-002: Child Task Crash Loop

Symptoms: A child task repeatedly fails and restarts; current_state shows elevated restart counts.

StepActionExpected MetricsEstimated Duration
1Query current state: handle.current_state()children.failed > 0 and child_runtime_records[].restart_count > threshold1min
2Check child exit reason in journalTaskExit::Panicked or TaskExit::Failed with reason string2min
3If restart budget exhausted: restart_budget.tokens == 0Budget exhausted; child quarantined automatically1min
4Remove or replace the faulty child spec: handle.remove_child()CommandResult::Accepted2min
5Verify no lingering slot: check current_state()children.running == target_count2min

Escalation: If child root cause not identified in 10min, file a bug with the exit reason and journal snippet. Total estimated duration: 8min.

P1-003: Dashboard IPC Disconnected

Symptoms: health.dashboard_link == "disconnected"; dashboard UI shows no data.

StepActionExpected MetricsEstimated Duration
1Check IPC socket path existence: ls -la /tmp/supervisor.sockSocket file present with correct permissions1min
2Check relay process: pgrep -x relayProcess running1min
3Restart relay: kill -TERM <relay_pid> and wait for auto-restartSupervisor auto-restarts relay; dashboard_link == "connected" within 10s3min
4If still disconnected, restart the supervisor process (dashboard IPC lifecycle is tied to the supervisor process)health.dashboard_link == "connected"2min

Escalation: If IPC socket path contention (error contains field_path="dashboard.path"), check deployment guide socket path configuration. Total estimated duration: 7min.

P1-004: Runtime Starvation

Symptoms: Control loop iterations stall; health.uptime_secs advances but events are not processed.

StepActionExpected MetricsEstimated Duration
1Check Tokio runtime metrics: handle.health().control_loop_iterationsiterations_per_sec > 01min
2If stalled, check for blocking tasks: review child task listNo child in BlockForever or IgnoreCancel state2min
3Quarantine suspicious children: handle.quarantine_child()Child marked as Quarantined2min
4Verify recovery: health.control_loop_iterations increasesiterations_per_sec > 0 after 5s3min

Escalation: If starvation persists after quarantining all non-critical children, escalate to L2 with runtime metrics snapshot. Total estimated duration: 8min.

Frequently Asked Questions (FAQ)

Language: 中文

Basics

What is the difference between ChildDeclaration and ChildSpec?

ChildDeclaration is the input model used in YAML configuration and add_child RPC payloads. It focuses on serializable, validatable declarations. ChildSpec is the runtime model used by the supervisor to register, start, and restart children. It carries resolved ChildId, Arc<dyn TaskFactory>, and materialized policy objects.

See ChildSpec and ChildDeclaration for details.

What are the entry methods after Supervisor starts?

Supervisor provides 3 entry methods:

MethodInputWhen to use
Supervisor::start(spec)SupervisorSpec (pre-built spec)Programmatic startup
Supervisor::start_from_config_state(state)ConfigState (validated config)Start from config loader output
Supervisor::start_from_config_file(path)YAML file pathStart directly from YAML file

All three converge into start_with_policy(), which validates, creates channels, spawns the control loop, and returns a SupervisorHandle.

What does “Shutdown Without Orphaned Tasks” mean?

This is the core shutdown goal of the project. After the root supervisor completes shutdown, no orphan tasks may remain in the runtime. This is achieved through the four-stage shutdown protocol (request stop -> graceful drain -> abort stragglers -> reconcile) and by shutting down children in reverse declaration order, ensuring every child is properly terminated.

Configuration

What child fields does the YAML children entry support?

children is a YAML array backed by ChildrenConfigSection in Rust. Access items with .as_slice(). Each declaration supports these fields:

CategoryFieldDescription
IdentitynameChild name, required, non-empty
Kindkindasync_worker, blocking_worker, or supervisor
Criticalitycriticalitycritical or optional
Restart policyrestart_policypermanent, transient, or temporary
DependenciesdependenciesList of dependent child names
Health checkhealth_checkHealth check interval, timeout, etc.
ReadinessreadinessExplicit readiness check config
Resource limitsresource_limitsCPU, memory and other resource constraints
Command permissionscommand_permissionsCommands this child is allowed to execute
EnvironmentenvironmentKey-value environment variable list
Secretssecrets${SECRET_NAME}-format secret references
TagstagsLow-cardinality grouping tags
Task roletask_roleservice, worker, job, sidecar, supervisor

See Configuration for a complete config sample.

How do I split groups and children into separate YAML files?

Add include in the root config and write body-only split files:

include:
  - groups.yaml
  - children.yaml
# children.yaml
- name: worker
  kind: async_worker

See Split Configuration and Transparent Array Sections. Run cargo run --example split_config_supervisor.

What happens when children is omitted from a config file?

Runtime loading yields an empty list []. Template sample entries such as worker are not injected at runtime. Only generate-template writes sample entries.

What configurations cause rejection at startup?

Configuration loading returns SupervisorError::FatalConfig when startup must be rejected. Rejection reasons include:

  • The file is not YAML format or cannot be read
  • Supervision strategy is not OneForOne, OneForAll, or RestForOne
  • Numeric values are zero or out of valid range
  • Initial backoff is greater than max backoff
  • Jitter ratio is not between 0.0 and 1.0
  • Restart budget, failure window, or meltdown config is invalid
  • Child declaration has circular dependencies
  • Child ID or name is empty
  • Sidecar task role is missing sidecar_config
  • Dashboard IPC path is not absolute

See Configuration for the full rejection list.

Runtime Control

What is the five-step add_child transaction?

add_child chains five steps into a single transaction:

  1. Parse: Deserialize the RPC payload into a ChildDeclaration
  2. Validate: Run validate_child_declaration, checking name format, dependency name existence, secret placeholder syntax, etc.
  3. Register: Update topology, insert the new child into the registry, and run cycle detection
  4. Launch: Create and start the child future via TaskFactory
  5. Audit Persist: Write audit records including the declaration SHA-256 hash

If any step fails, the entire transaction rolls back to the pre-call topology view, or writes a compensating record for post-recovery handling.

Which runtime control commands are idempotent?

Repeated control commands do not create unrecoverable errors:

  • Pausing an already paused child returns the current state
  • Quarantining an already quarantined child returns the current state
  • Calling shutdown after shutdown is complete returns the existing result
  • join caches the final RuntimeExitReport; repeated calls return the same result

What is the difference between pause, quarantine, and remove?

All three are stop-type control commands, but they behave differently:

Commandoperation set toRecord keptAuto-restart
pause_childPausedKeptSuspended while paused
quarantine_childQuarantinedKeptDisabled permanently
remove_childRemovedPhysically deleted after attempt exitsN/A

Pause can be resumed via resume_child. Quarantined children can be removed later. Remove is final — the runtime record is physically deleted.

Policies & Failure Handling

When should each RestartPolicy value be used?

ValueBehaviorWhen to use
PermanentAlways restartCritical services like API servers, database connections
TransientRestart only for certain failure categoriesRestart on external dependency failures, not on fatal bugs
TemporaryRestart at most onceOne-shot jobs, do not retry after failure

How do the three meltdown levels cascade?

The meltdown policy (MeltdownPolicy) limits restarts or failures within a window, across three levels:

  1. Child-level: Exceeds child_max_restarts / child_window_secs -> enters quarantine
  2. Group-level: Exceeds group_max_failures / group_window_secs -> escalates to supervisor
  3. Supervisor-level: Exceeds supervisor_max_failures / supervisor_window_secs -> escalates to parent

After meltdown triggers, it auto-resets after reset_after_secs.

Observability

How do I subscribe to lifecycle events?

Call SupervisorHandle::subscribe_events() to get a broadcast::Receiver. Events are of type SupervisorEvent, containing When (wall time, monotonic time, uptime, generation, attempt), Where (supervisor path, child ID, task name), and What (state transitions, policy decisions, health status, exit reasons, or control commands).

What happens when the event journal is full?

The event journal is a fixed-capacity ring buffer. When full, it overwrites the oldest entries. Capacity is configured via observability.event_journal_capacity. However, the add_child-dedicated audit channel does not silently overwrite — it returns Err(AuditStorageFailure) when full.

Dashboard

Which three repositories does the Dashboard feature require?

The dashboard feature spans three repositories:

RepositoryResponsibility
rust-supervisor (this project)Target process local IPC and shared contracts
~/rust-supervisor-relayRelay and external wss:// sessions
~/rust-supervisor-uiBrowser dashboard client

The target process exposes only a local Unix domain socket. IPC must never be exposed to external networks.

What IPC methods are supported?

Supported methods: hello, state, events.subscribe, logs.tail, command.restart_child, command.pause_child, command.resume_child, command.quarantine_child, command.remove_child, command.add_child, and command.shutdown_tree.

Project & Build

What does target/debug/rust-tokio-supervisor generate-template do without arguments?

generate-template with no arguments does not output to stdout. It writes to config/<root-config-name>/<root-config-name>.example.yaml by default.

For this project:

# No terminal output after running
./target/debug/rust-tokio-supervisor generate-template

# But files are actually written
ls config/supervisor_config/
# supervisor_config.example.yaml
# supervisor_config.schema.json

Options:

# Specify output path
./target/debug/rust-tokio-supervisor generate-template --output /tmp/my-config.yaml

# Also generate JSON Schema
./target/debug/rust-tokio-supervisor generate-template --schema /tmp/schema.json

The output format is inferred from the file extension; unknown or missing extensions use YAML by default.

Why does Cargo.toml declare only one [[bin]] (rust-tokio-supervisor) but there are multiple binaries in target/debug/?

Cargo supports two ways to declare binary targets:

  1. Explicit declaration: via [[bin]] entries in Cargo.toml, e.g., src/main.rs -> rust-tokio-supervisor
  2. Auto-discovery: each .rs file in src/bin/ automatically becomes a binary target, using the filename as the target name

So Cargo.toml shows only [[bin]] name = "rust-tokio-supervisor", but src/bin/generate_supervisor.rs and src/bin/generate_supervisor_config.rs are auto-discovered by Cargo, producing additional binaries.

Note: The src/bin/ directory may be cleaned up or moved after feature completion to keep the project structure tidy.

Common Errors

What is SupervisorError::FatalConfig?

FatalConfig indicates an unrecoverable error during configuration loading. The error includes field_path (JSON Pointer format) and a human-readable hint to help locate the specific problem.

What should I do when add_child returns Err(SupervisorShuttingDown)?

The supervisor is currently executing its shutdown sequence and cannot accept new add_child requests. Wait for the supervisor to complete shutdown, restart it, then retry the add operation.

What should I do when add_child returns Err(ChildLimitExceeded)?

The runtime child count has reached its limit (currently 1000). Either remove unnecessary children via remove_child, or adjust the dynamic_supervisor.child_limit configuration.

What happens when audit storage fails?

When the audit channel (ring buffer) write fails:

  • add_child enters compensating flow and returns Err(AuditStorageFailure)
  • The topology view rolls back to its pre-call state
  • No orphaned semi-parsed state is left behind

Quality Gates

Language: 中文

Baseline Commands

cargo fmt --check
cargo check
cargo test
cargo doc --no-deps
cargo package --list
scripts/check-coding-standard.sh
scripts/check-maintainability.sh
scripts/generate-sbom.sh
scripts/validate-sbom.sh
cargo publish --dry-run

Documentation Synchronization

The manual, engineering docs, README files, examples, quickstart, public API contract, and glossary must stay synchronized. When public APIs, configuration shape, example behavior, or observability signals change, documentation must be updated in the same implementation pass.

Coding Standard

scripts/check-coding-standard.sh checks required release materials, example files, primary configuration, documentation punctuation, and No Compatibility language. Chinese-language docs in this repository must use ASCII punctuation.

Maintainability

scripts/check-maintainability.sh checks isomorphic manual/zh and manual/en entries, isomorphic docs/zh and docs/en entries for quality gate and parallel governance pages, example count against the contract, validation artifacts, the Shutdown Without Orphaned Tasks term, and the rust-config-tree term.

SBOM And Release

scripts/generate-sbom.sh creates artifacts/sbom/rust-supervisor.cdx.json and artifacts/sbom/rust-supervisor.spdx.json. scripts/validate-sbom.sh checks file presence, JSON shape, package name, Cargo.lock digest, and leakage of secrets, tokens, local absolute paths, or build scratch paths.