Supervision

‹ docs index

Where retry answers "run this once, replaying on failure", a Supervisor answers the different question "keep this alive": restart a child per policy whenever it exits, with bounded restarts, exponential backoff, and jitter — a minimal runit/systemd-style keeper, platform-agnostic because it sits entirely on the ProcessRunner seam.

The shape

use processkit::{Command, RestartPolicy, Supervisor};
use std::time::Duration;

#[tokio::main]
async fn main() -> processkit::Result<()> {
    let outcome = Supervisor::new(Command::new("my-server").args(["--port", "8080"]))
        .restart(RestartPolicy::OnCrash)           // default
        .max_restarts(5)                           // default: unlimited
        .backoff(Duration::from_millis(200), 2.0)  // default: 200ms × 2.0
        .max_backoff(Duration::from_secs(30))      // default: 30s cap
        .jitter(true)                              // default: on
        .stop_when(|res| res.code() == Some(0))    // optional exit condition
        .run()
        .await?;

    println!(
        "ended after {} restarts, reason: {:?}, last exit: {:?}",
        outcome.restarts, outcome.stopped, outcome.final_result.code(),
    );
    Ok(())
}

Each incarnation is one full captured run of the command (so the command's own timeout, stdin, env, … all apply per run — with the usual one-shot-stdin caveat for the second run onward).

Policies: what counts as a crash

A crash is any run that is not a success (ProcessResult::is_success, which honors the command's ok_codes): an exit code outside the accepted set (default {0}), a timeout, a signal-kill, or a spawn failure. A command with ok_codes([0, 2]) that exits 2 is a success, so OnCrash treats it as clean, not a crash.

RestartPolicyRestarts after…
OnCrash (default)crashes only; a clean exit ends supervision (PolicySatisfied)
Alwaysevery completed run, clean or not — pair it with stop_when/max_restarts or it loops forever
Nevernothing: one run, reported as-is

Backoff and jitter

The n-th restart (0-based) sleeps

delay(n) = min(base × factor^n, max_backoff) × jitter

with jitter drawn uniformly from [0.5, 1.5) per restart. Jitter is on by default so a fleet of supervised workers restarted by the same incident doesn't stampede back in lockstep; jitter(false) gives deterministic delays (useful in tests with a paused tokio clock). A non-finite or < 1.0 factor is treated as 1.0 — constant delay, never a shrinking one.

base=200ms, factor=2.0, cap=30s:
restart #0 → ~200ms   #1 → ~400ms   #2 → ~800ms … #7 → ~25.6s   #8+ → 30s (cap)

Failure storms

Backoff spaces individual restarts; max_restarts is a lifetime cap. Neither distinguishes a service that fails once a day from one that is suddenly crash-looping. The opt-in storm guard does (a design borrowed from Go's suture supervisor — the idea, not the code):

#![allow(unused)]
fn main() {
use processkit::{Command, Supervisor};
use std::time::Duration;

let outcome = Supervisor::new(Command::new("worker"))
    .storm_pause(Duration::from_secs(15))     // master switch — off by default
    .failure_decay(Duration::from_secs(30))   // score half-life (default 30s)
    .failure_threshold(5.0)                   // trip point (default 5.0)
    .run()
    .await?;

println!("storm pauses taken: {}", outcome.storm_pauses);
}

Each failed run adds 1 to a score that halves every failure_decay:

score = score × 0.5^(Δt / failure_decay) + 1
  • Fails rarely: the score decays back toward 1 between failures and never reaches the threshold — the guard stays out of the way.
  • Failure storm: failures arrive faster than the half-life drains them, the score climbs past failure_threshold, and the supervisor takes one collective pause of storm_pause (jittered into [0.5, 1.5) like the backoff), resets the score, and resumes.

Only failures feed the score — crashes and spawn errors — not clean exits restarted under RestartPolicy::Always. The pause stacks with (runs before) the per-restart backoff, and the max_restarts budget is checked first, so a storm pause never extends an exhausted budget. Pauses taken are reported in SupervisionOutcome::storm_pauses.

Stopping

Three gates, checked in this order after every completed run:

  1. stop_when(predicate) — sees the run's ProcessResult; returning true ends supervision regardless of policy (→ StopReason::Predicate). "Exit 0 is done, anything else is a crash" is the classic: stop_when(|res| res.code() == Some(0)) under RestartPolicy::Always.
  2. The policyOnCrash stops on a clean exit (→ PolicySatisfied).
  3. max_restarts(n) — at most n restarts = n + 1 total runs; an exhausted budget reports the last result (→ RestartsExhausted). max_restarts(0) means exactly one run.

Outcomes

run() resolves to a SupervisionOutcome:

#![allow(unused)]
fn main() {
let outcome = Supervisor::new(Command::new("job")).run().await?;

outcome.final_result; // ProcessResult<String> of the LAST run
outcome.restarts;     // how many restarts happened (not counting run #1)
outcome.stopped;      // StopReason::{Predicate, PolicySatisfied, RestartsExhausted}
outcome.storm_pauses; // failure-storm pauses taken (0 unless storm_pause is set)
}

Note run() returning Ok does not mean the child succeeded — it means supervision concluded. Inspect final_result (or ensure_success() it) for the child's own verdict.

Supervising inside a shared group

The supervisor runs through any ProcessRunner. The headline production variant injects a ProcessGroup so every incarnation — and everything it spawns — lives in one kill-on-drop container:

#![allow(unused)]
fn main() {
use processkit::{Command, ProcessGroup, RestartPolicy, Supervisor};

let group = ProcessGroup::new()?;

let outcome = Supervisor::new(Command::new("worker"))
    .with_runner(&group)                 // &group is itself a ProcessRunner
    .restart(RestartPolicy::OnCrash)
    .max_restarts(10)
    .run()
    .await?;

// The group outlives supervision: drop it (or shutdown) to reap any strays.
}

Mind one interaction: don't supervise into a group you've suspended — under the cgroup mechanism the restarted child would start frozen (and the spawn itself can block). Resume first.

The same injection point makes supervision logic hermetically testable — script a sequence of fake results and assert the restart/stop behavior with no real process; see Testing your code.

Errors and cancellation

A run that produces no result at all (spawn/IO failure) can't be judged by stop_when; the policy treats it as a crash and restarts (with backoff) unless the policy is Never or the budget is exhausted — then the error itself surfaces as run()'s Err.

A cancelled incarnation is terminal: run() returns Err(Error::Cancelled) immediately. The token never un-cancels, so a restart could only produce another instantly-cancelled run — the supervisor refuses the futile loop.


Next: Testing your code · Timeouts, retries & cancellation · Process groups