Frequently Asked Questions (FAQ)

Language: 中文

Basics

What is the difference between ChildDeclaration and ChildSpec?

ChildDeclaration is the input model used in YAML configuration and add_child RPC payloads. It focuses on serializable, validatable declarations. ChildSpec is the runtime model used by the supervisor to register, start, and restart children. It carries resolved ChildId, Arc<dyn TaskFactory>, and materialized policy objects.

See ChildSpec and ChildDeclaration for details.

What are the entry methods after Supervisor starts?

Supervisor provides 3 entry methods:

Method	Input	When to use
`Supervisor::start(spec)`	`SupervisorSpec` (pre-built spec)	Programmatic startup
`Supervisor::start_from_config_state(state)`	`ConfigState` (validated config)	Start from config loader output
`Supervisor::start_from_config_file(path)`	YAML file path	Start directly from YAML file

All three converge into start_with_policy(), which validates, creates channels, spawns the control loop, and returns a SupervisorHandle.

What does “Shutdown Without Orphaned Tasks” mean?

This is the core shutdown goal of the project. After the root supervisor completes shutdown, no orphan tasks may remain in the runtime. This is achieved through the four-stage shutdown protocol (request stop -> graceful drain -> abort stragglers -> reconcile) and by shutting down children in reverse declaration order, ensuring every child is properly terminated.

Configuration

What child fields does the YAML `children` entry support?

children is a YAML array backed by ChildrenConfigSection in Rust. Access items with .as_slice(). Each declaration supports these fields:

Category	Field	Description
Identity	`name`	Child name, required, non-empty
Kind	`kind`	`async_worker`, `blocking_worker`, or `supervisor`
Criticality	`criticality`	`critical` or `optional`
Restart policy	`restart_policy`	`permanent`, `transient`, or `temporary`
Dependencies	`dependencies`	List of dependent child names
Health check	`health_check`	Health check interval, timeout, etc.
Readiness	`readiness`	Explicit readiness check config
Resource limits	`resource_limits`	CPU, memory and other resource constraints
Command permissions	`command_permissions`	Commands this child is allowed to execute
Environment	`environment`	Key-value environment variable list
Secrets	`secrets`	`${SECRET_NAME}`-format secret references
Tags	`tags`	Low-cardinality grouping tags
Task role	`task_role`	`service`, `worker`, `job`, `sidecar`, `supervisor`

See Configuration for a complete config sample.

How do I split `groups` and `children` into separate YAML files?

Add include in the root config and write body-only split files:

include:
  - groups.yaml
  - children.yaml

# children.yaml
- name: worker
  kind: async_worker

See Split Configuration and Transparent Array Sections. Run cargo run --example split_config_supervisor.

What happens when `children` is omitted from a config file?

Runtime loading yields an empty list []. Template sample entries such as worker are not injected at runtime. Only generate-template writes sample entries.

What configurations cause rejection at startup?

Configuration loading returns SupervisorError::FatalConfig when startup must be rejected. Rejection reasons include:

The file is not YAML format or cannot be read
Supervision strategy is not OneForOne, OneForAll, or RestForOne
Numeric values are zero or out of valid range
Initial backoff is greater than max backoff
Jitter ratio is not between 0.0 and 1.0
Restart budget, failure window, or meltdown config is invalid
Child declaration has circular dependencies
Child ID or name is empty
Sidecar task role is missing sidecar_config
Dashboard IPC path is not absolute

See Configuration for the full rejection list.

Runtime Control

What is the five-step add_child transaction?

add_child chains five steps into a single transaction:

Parse: Deserialize the RPC payload into a ChildDeclaration
Validate: Run validate_child_declaration, checking name format, dependency name existence, secret placeholder syntax, etc.
Register: Update topology, insert the new child into the registry, and run cycle detection
Launch: Create and start the child future via TaskFactory
Audit Persist: Write audit records including the declaration SHA-256 hash

If any step fails, the entire transaction rolls back to the pre-call topology view, or writes a compensating record for post-recovery handling.

Which runtime control commands are idempotent?

Repeated control commands do not create unrecoverable errors:

Pausing an already paused child returns the current state
Quarantining an already quarantined child returns the current state
Calling shutdown after shutdown is complete returns the existing result
join caches the final RuntimeExitReport; repeated calls return the same result

What is the difference between pause, quarantine, and remove?

All three are stop-type control commands, but they behave differently:

Command	`operation` set to	Record kept	Auto-restart
`pause_child`	`Paused`	Kept	Suspended while paused
`quarantine_child`	`Quarantined`	Kept	Disabled permanently
`remove_child`	`Removed`	Physically deleted after attempt exits	N/A

Pause can be resumed via resume_child. Quarantined children can be removed later. Remove is final — the runtime record is physically deleted.

Policies & Failure Handling

When should each RestartPolicy value be used?

Value	Behavior	When to use
`Permanent`	Always restart	Critical services like API servers, database connections
`Transient`	Restart only for certain failure categories	Restart on external dependency failures, not on fatal bugs
`Temporary`	Restart at most once	One-shot jobs, do not retry after failure

How do the three meltdown levels cascade?

The meltdown policy (MeltdownPolicy) limits restarts or failures within a window, across three levels:

Child-level: Exceeds child_max_restarts / child_window_secs -> enters quarantine
Group-level: Exceeds group_max_failures / group_window_secs -> escalates to supervisor
Supervisor-level: Exceeds supervisor_max_failures / supervisor_window_secs -> escalates to parent

After meltdown triggers, it auto-resets after reset_after_secs.

Observability

Call SupervisorHandle::subscribe_events() to get a broadcast::Receiver. Events are of type SupervisorEvent, containing When (wall time, monotonic time, uptime, generation, attempt), Where (supervisor path, child ID, task name), and What (state transitions, policy decisions, health status, exit reasons, or control commands).

What happens when the event journal is full?

The event journal is a fixed-capacity ring buffer. When full, it overwrites the oldest entries. Capacity is configured via observability.event_journal_capacity. However, the add_child-dedicated audit channel does not silently overwrite — it returns Err(AuditStorageFailure) when full.

Dashboard

Which three repositories does the Dashboard feature require?

The dashboard feature spans three repositories:

Repository	Responsibility
`rust-supervisor` (this project)	Target process local IPC and shared contracts
`~/rust-supervisor-relay`	Relay and external `wss://` sessions
`~/rust-supervisor-ui`	Browser dashboard client

The target process exposes only a local Unix domain socket. IPC must never be exposed to external networks.

What IPC methods are supported?

Supported methods: hello, state, events.subscribe, logs.tail, command.restart_child, command.pause_child, command.resume_child, command.quarantine_child, command.remove_child, command.add_child, and command.shutdown_tree.

Project & Build

What does `target/debug/rust-tokio-supervisor generate-template` do without arguments?

generate-template with no arguments does not output to stdout. It writes to config/<root-config-name>/<root-config-name>.example.yaml by default.

For this project:

# No terminal output after running
./target/debug/rust-tokio-supervisor generate-template

# But files are actually written
ls config/supervisor_config/
# supervisor_config.example.yaml
# supervisor_config.schema.json

Options:

# Specify output path
./target/debug/rust-tokio-supervisor generate-template --output /tmp/my-config.yaml

# Also generate JSON Schema
./target/debug/rust-tokio-supervisor generate-template --schema /tmp/schema.json

The output format is inferred from the file extension; unknown or missing extensions use YAML by default.

Why does `Cargo.toml` declare only one `[[bin]]` (rust-tokio-supervisor) but there are multiple binaries in `target/debug/`?

Cargo supports two ways to declare binary targets:

Explicit declaration: via [[bin]] entries in Cargo.toml, e.g., src/main.rs -> rust-tokio-supervisor
Auto-discovery: each .rs file in src/bin/ automatically becomes a binary target, using the filename as the target name

So Cargo.toml shows only [[bin]] name = "rust-tokio-supervisor", but src/bin/generate_supervisor.rs and src/bin/generate_supervisor_config.rs are auto-discovered by Cargo, producing additional binaries.

Note: The src/bin/ directory may be cleaned up or moved after feature completion to keep the project structure tidy.

add_child enters compensating flow and returns Err(AuditStorageFailure)
The topology view rolls back to its pre-call state
No orphaned semi-parsed state is left behind

rust-tokio-supervisor Manual