Supervision
Where retry answers "run this once,
replaying on failure", a Supervisor answers the different question "keep
this alive": restart a child per policy whenever it exits, with bounded
restarts, exponential backoff, and jitter — a minimal runit/systemd-style
keeper, platform-agnostic because it sits entirely on the
ProcessRunner seam.
- The shape
- Policies: what counts as a crash
- Backoff and jitter
- Failure storms
- Stopping
- Outcomes
- Supervising inside a shared group
- Errors and cancellation
The shape
use processkit::{Command, RestartPolicy, Supervisor}; use std::time::Duration; #[tokio::main] async fn main() -> processkit::Result<()> { let outcome = Supervisor::new(Command::new("my-server").args(["--port", "8080"])) .restart(RestartPolicy::OnCrash) // default .max_restarts(5) // default: unlimited .backoff(Duration::from_millis(200), 2.0) // default: 200ms × 2.0 .max_backoff(Duration::from_secs(30)) // default: 30s cap .jitter(true) // default: on .stop_when(|res| res.code() == Some(0)) // optional exit condition .run() .await?; println!( "ended after {} restarts, reason: {:?}, last exit: {:?}", outcome.restarts, outcome.stopped, outcome.final_result.code(), ); Ok(()) }
Each incarnation is one full captured run of the command (so the command's
own timeout, stdin, env, … all apply per run — with the usual
one-shot-stdin caveat for the second run
onward).
Policies: what counts as a crash
A crash is any run that is not a success (ProcessResult::is_success,
which honors the command's ok_codes): an exit code outside the accepted set
(default {0}), a timeout, a signal-kill, or a spawn failure. A command with
ok_codes([0, 2]) that exits 2 is a success, so OnCrash treats it as clean,
not a crash.
| RestartPolicy | Restarts after… |
|---|---|
OnCrash (default) | crashes only; a clean exit ends supervision (PolicySatisfied) |
Always | every completed run, clean or not — pair it with stop_when/max_restarts or it loops forever |
Never | nothing: one run, reported as-is |
Backoff and jitter
The n-th restart (0-based) sleeps
delay(n) = min(base × factor^n, max_backoff) × jitter
with jitter drawn uniformly from [0.5, 1.5) per restart. Jitter is on by
default so a fleet of supervised workers restarted by the same incident
doesn't stampede back in lockstep; jitter(false) gives deterministic delays
(useful in tests with a paused tokio clock). A non-finite or < 1.0 factor is
treated as 1.0 — constant delay, never a shrinking one.
base=200ms, factor=2.0, cap=30s:
restart #0 → ~200ms #1 → ~400ms #2 → ~800ms … #7 → ~25.6s #8+ → 30s (cap)
Failure storms
Backoff spaces individual restarts; max_restarts is a lifetime cap.
Neither distinguishes a service that fails once a day from one that is
suddenly crash-looping. The opt-in storm guard does (a design borrowed
from Go's suture supervisor — the
idea, not the code):
#![allow(unused)] fn main() { use processkit::{Command, Supervisor}; use std::time::Duration; let outcome = Supervisor::new(Command::new("worker")) .storm_pause(Duration::from_secs(15)) // master switch — off by default .failure_decay(Duration::from_secs(30)) // score half-life (default 30s) .failure_threshold(5.0) // trip point (default 5.0) .run() .await?; println!("storm pauses taken: {}", outcome.storm_pauses); }
Each failed run adds 1 to a score that halves every failure_decay:
score = score × 0.5^(Δt / failure_decay) + 1
- Fails rarely: the score decays back toward
1between failures and never reaches the threshold — the guard stays out of the way. - Failure storm: failures arrive faster than the half-life drains them, the
score climbs past
failure_threshold, and the supervisor takes one collective pause ofstorm_pause(jittered into[0.5, 1.5)like the backoff), resets the score, and resumes.
Only failures feed the score — crashes and spawn errors — not clean exits
restarted under RestartPolicy::Always. The pause stacks with (runs before)
the per-restart backoff, and the max_restarts budget is checked first, so a
storm pause never extends an exhausted budget. Pauses taken are reported in
SupervisionOutcome::storm_pauses.
Stopping
Three gates, checked in this order after every completed run:
stop_when(predicate)— sees the run'sProcessResult; returningtrueends supervision regardless of policy (→StopReason::Predicate). "Exit 0 is done, anything else is a crash" is the classic:stop_when(|res| res.code() == Some(0))underRestartPolicy::Always.- The policy —
OnCrashstops on a clean exit (→PolicySatisfied). max_restarts(n)— at most n restarts = n + 1 total runs; an exhausted budget reports the last result (→RestartsExhausted).max_restarts(0)means exactly one run.
Outcomes
run() resolves to a SupervisionOutcome:
#![allow(unused)] fn main() { let outcome = Supervisor::new(Command::new("job")).run().await?; outcome.final_result; // ProcessResult<String> of the LAST run outcome.restarts; // how many restarts happened (not counting run #1) outcome.stopped; // StopReason::{Predicate, PolicySatisfied, RestartsExhausted} outcome.storm_pauses; // failure-storm pauses taken (0 unless storm_pause is set) }
Note run() returning Ok does not mean the child succeeded — it means
supervision concluded. Inspect final_result (or ensure_success() it) for
the child's own verdict.
Supervising inside a shared group
The supervisor runs through any ProcessRunner. The headline production
variant injects a ProcessGroup so every incarnation —
and everything it spawns — lives in one kill-on-drop container:
#![allow(unused)] fn main() { use processkit::{Command, ProcessGroup, RestartPolicy, Supervisor}; let group = ProcessGroup::new()?; let outcome = Supervisor::new(Command::new("worker")) .with_runner(&group) // &group is itself a ProcessRunner .restart(RestartPolicy::OnCrash) .max_restarts(10) .run() .await?; // The group outlives supervision: drop it (or shutdown) to reap any strays. }
Mind one interaction: don't supervise into a group you've suspended — under the cgroup mechanism the restarted child would start frozen (and the spawn itself can block). Resume first.
The same injection point makes supervision logic hermetically testable — script a sequence of fake results and assert the restart/stop behavior with no real process; see Testing your code.
Errors and cancellation
A run that produces no result at all (spawn/IO failure) can't be judged by
stop_when; the policy treats it as a crash and restarts (with backoff)
unless the policy is Never or the budget is exhausted — then the error
itself surfaces as run()'s Err.
A cancelled incarnation is
terminal: run() returns
Err(Error::Cancelled) immediately. The token never un-cancels, so a restart
could only produce another instantly-cancelled run — the supervisor refuses
the futile loop.
Next: Testing your code · Timeouts, retries & cancellation · Process groups