Timeouts, retries & cancellation

Three ways a run ends early, with three different philosophies:

a timeout is data — the deadline was part of the run's contract, so its expiry is captured in the result (and only the success-checking verbs turn it into an error);
a retry is a policy — the success-checking verbs replay the run while your classifier says the failure is transient;
a cancellation is an abandonment — the caller changed its mind, so every path reports an error; there is no result worth inspecting.
Timeouts
Retries
Cancellation
Precedence and interactions

Timeouts

Command::timeout(d) kills the whole process tree at the deadline — not just the direct child, so a wrapper script's grandchildren die too.

#![allow(unused)]
fn main() {
use processkit::Command;
use std::time::Duration;

// Captured: inspect the flag yourself.
let result = Command::new("slow-tool")
    .timeout(Duration::from_secs(5))
    .output_string()
    .await?;
if result.timed_out() {
    println!("partial output before the kill: {}", result.stdout());
}

// Raised: the checking verbs convert the flag into a typed error.
let err = Command::new("slow-tool")
    .timeout(Duration::from_secs(5))
    .run()
    .await
    .unwrap_err();
assert!(matches!(err, processkit::Error::Timeout { .. }));
}

Where each verb lands:

Verb	Deadline expiry becomes
`output_string()` / `output_bytes()`	`Ok` result with `timed_out() == true`, `code() == None`, partial output kept
`run()` / `exit_code()` / `probe()` / `checked()`	`Error::Timeout { program, timeout, stdout, stderr }` — the partial output captured before the kill is attached (`err.diagnostic()` surfaces a hung tool's last words)
`first_line(pred)`	`Error::Timeout` (the line never arrived in time)
`start()` + streaming	the stream ends at the deadline (tree killed, pipes closed); `finish` then reports the kill (`outcome == Outcome::TimedOut`)
`ensure_success()` on a captured result	`Error::Timeout`, checked before the exit code
`Pipeline`	chain deadline → `timed_out` result; per-stage deadlines fold into pipefail

Error::Timeout is #[non_exhaustive] (like Exit and Signalled), so read it through accessors instead of destructuring { program, timeout, stdout, stderr } — a full literal no longer compiles, and a later field would break one. Alongside the lossy-UTF-8 err.stdout() / err.stderr() strings it now carries err.stdout_bytes() -> Option<&[u8]>: the exact captured bytes when the failure rose from a byte path (output_bytes().await?.ensure_success()?), and None on the text verbs (run / output_string / checked), whose decoded stdout is already complete. The full error table lists every variant.

Two distinct deadline families to keep apart:

Command::timeout — the run's own contract, this section.
The readiness probes' within parameter — gives Error::NotReady and never kills the child.

Optional and explicitly-unbounded deadlines

A CliClient's default_timeout is a gap-fill: it lends its deadline to any command that didn't set one, and a per-command timeout overrides it. Two verbs shape that fill:

Command::no_timeout() marks a command explicitly unbounded — a tail -f, a watch loop — so the client's default_timeout leaves it alone. That is not the same as simply never setting a timeout (which the client does fill): you are opting out on purpose. The last of timeout / no_timeout on the chain wins, so a late no_timeout() clears an earlier deadline and a late timeout(d) restores one.
Command::timeout_opt(Option<Duration>) folds a config value straight in: Some(d) is exactly timeout(d), None is exactly no_timeout() — the deliberate opt-out, not a silent fall back to the client default. It collapses the match cfg { Some(d) => c.timeout(d), None => c.no_timeout() } dance at a config-driven call site into one call.

#![allow(unused)]
fn main() {
use processkit::Command;
use std::time::Duration;

// Config-driven deadline: `Some(d)` behaves like `timeout(d)`, `None` like
// `no_timeout()` (stay unbounded — do NOT fall back to a client default_timeout).
let deadline: Option<Duration> = None;
Command::new("worker")
    .timeout_opt(deadline)
    .run()
    .await?;

// A follower that must outlive an ambient default_timeout — opt out explicitly:
Command::new("tail").args(["-f", "app.log"]).no_timeout().run().await?;
}

Graceful timeout

By default the deadline hard-kills at once. Add timeout_grace(d) to give the tree a chance to clean up: at the deadline it is sent SIGTERM (or the signal chosen with timeout_signal, which needs the process-control feature), allowed up to the grace window to exit, then SIGKILLed — the same SIGTERM → wait → SIGKILL tier as ProcessGroup::shutdown. A signal-handling child that exits ends the grace early.

#![allow(unused)]
fn main() {
use processkit::Command;
use std::time::Duration;

let result = Command::new("slow-tool")
    .timeout(Duration::from_secs(30))
    .timeout_grace(Duration::from_secs(5)) // SIGTERM, wait up to 5s, then SIGKILL
    .output_string()
    .await?;
}

timed_out() is true regardless of whether the child exited on the signal or was SIGKILLed after the grace — the deadline is what fired. Windows has no signal tier: timeout_grace is accepted but the deadline kills the job atomically.

The explicit RunningProcess::shutdown(grace) verb (stop a started handle on demand) composes with a Command::timeout: its own SIGTERM → grace → SIGKILL is the single teardown (it does not also fire the run's timeout teardown), and if the deadline has already elapsed when you call shutdown, the outcome is reported as Outcome::TimedOut — the grace you pass governs the teardown timing.

Retries

retry(max_attempts, backoff, classifier) replays a failed run — up to max_attempts total attempts, sleeping backoff between tries, retrying only while the classifier accepts the error:

#![allow(unused)]
fn main() {
use processkit::{Command, Error};
use std::time::Duration;

let out = Command::new("curl")
    .args(["-fsS", "https://example.com/api"])
    .timeout(Duration::from_secs(10))
    .retry(3, Duration::from_millis(250), |e| {
        // transient: network timeouts and curl's "couldn't connect" (7)
        matches!(e, Error::Timeout { .. })
            || matches!(e, Error::Exit { code: 7, .. })
    })
    .run()
    .await?;
}

Ground rules:

Retries apply to the success-checking paths only (run, exit_code, probe, ProcessRunnerExt::checked — and everything built on them, e.g. CliClient). The non-erroring output_string capture never retries: it didn't fail.
The classifier sees the typed error — match on variants, codes, even the captured stderr.
Each attempt re-runs the same Command: a one-shot stdin source (table) is consumed by attempt #1, so attempt #2 fails loud with an Error::Io (InvalidInput) at launch (D10) rather than silently feeding empty stdin. Use reusable sources for retried commands.
A Cancelled error is never retried, classifier or not — the token stays cancelled forever, so another attempt could only fail the same way.

Exponential backoff with `RetryPolicy`

retry's backoff is a fixed delay — the same pause between every attempt, unchanged. For network-style load, reach for retry_with(policy, classifier) and a RetryPolicy: exponential growth, a per-delay cap, and optional AWS-style full jitter (each wait spread uniformly over [0, delay], so a herd of retriers doesn't resynchronize). It drives the same retry loop as retry — only the delay schedule differs.

#![allow(unused)]
fn main() {
use processkit::{Command, Error, RetryPolicy};
use std::time::Duration;

let policy = RetryPolicy::new()   // == RetryPolicy::default()
    .max_retries(5)               // 5 retries after the first → 6 attempts total
    .initial_backoff(Duration::from_millis(100))
    .multiplier(2.0)              // 100ms, 200ms, 400ms, … (before cap + jitter)
    .max_backoff(Duration::from_secs(30))
    .jitter(true);

let out = Command::new("curl")
    .args(["-fsS", "https://example.com/api"])
    .timeout(Duration::from_secs(10))
    .retry_with(policy, |e| matches!(e, Error::Timeout { .. }))
    .run()
    .await?;
}

RetryPolicy is #[non_exhaustive]; build it with new() / default() and the setters above. Mind the counting difference: retry(n, …) counts n total attempts, but a RetryPolicy counts max_retries — the runs after the first — so retry(3, …) corresponds to RetryPolicy::new().max_retries(2). Default is a sane cloud client: 3 retries (4 attempts) / 100 ms initial / ×2 / 30 s cap / jitter on.

Client-wide retries and opting out

Set a policy once on a CliClient and every verb it runs retries on it — the client-wide analogue of a per-command retry:

#![allow(unused)]
fn main() {
use processkit::{CliClient, RetryPolicy};

let gh = CliClient::new("gh")
    .default_retry(RetryPolicy::default(), |e| e.is_transient());
// every gh.run(...) / gh.checked(...) now retries a transient failure
}

A per-command retry or retry_with wins over the client default (gap-fill, not override — the same rule as timeout vs. default_timeout). To make one command a single shot despite the default, chain Command::retry_never(): it runs exactly once and suppresses the fill, symmetric with no_timeout() (and tidier than the retry(1, Duration::ZERO, |_| false) idiom it replaces).

For "keep it alive" (restart a service whenever it exits) rather than "replay this one operation", use a Supervisor — same backoff shape, different loop condition.

Every backoff sleep is cancelable: it is raced against the command's cancellation token, so a cancel landing mid-backoff resolves promptly with Error::Cancelled instead of waiting out a (possibly 30 s) delay. The same holds for a Supervisor's restart backoff and its storm-pause windows.

Cancellation

Hand any command a CancellationToken (re-exported at the crate root); cancelling the token kills the run's tree and makes every consuming path report Error::Cancelled:

use processkit::{CancellationToken, Command};

#[tokio::main]
async fn main() -> processkit::Result<()> {
    let shutdown = CancellationToken::new();

    // Wire the same parent token into many jobs via child tokens:
    let job = tokio::spawn({
        let token = shutdown.child_token();
        async move {
            Command::new("long-export").cancel_on(token).run().await
        }
    });

    // Ctrl-C handler, sibling failure, UI button, …
    shutdown.cancel();

    assert!(matches!(
        job.await.unwrap(),
        Err(processkit::Error::Cancelled { .. })
    ));
    Ok(())
}

The contract, path by path:

Situation	Behavior
Cancel during `run` / `output_string` / `output_bytes` / `wait` / `profile` / `exit_code` / `probe`	tree killed, `Error::Cancelled { program }`
Cancel during streaming (`stdout_lines`)	the stream ends; the following `finish` reports `Error::Cancelled`
Token already cancelled before the run	short-circuits before spawning — no process is ever created
Cancel on a shared-`ProcessGroup` handle	kills the child itself, leaves the group's siblings alone (same scope as a timeout)
A `Pipeline` stage's token cancels	that stage dies; the cancellation errors the whole pipeline and the private group reaps the other stages
Under `retry`	terminal — never retried, whatever the classifier says
Under a `Supervisor`	terminal — supervision returns `Err(Cancelled)` instead of restarting into a still-cancelled token
`wait_any` mid-run	surfaces `Err(Cancelled)` — each racer's wait path resolves to `Cancelled` when its token fires, the same as a bulk verb (a pre-cancelled token still hits the pre-spawn short-circuit)
`first_line` mid-run	surfaces `Error::Cancelled` once the token fires — a cancelled stream that closes without a match is reported as cancellation, not `Ok(None)`

Client-level default

A typed wrapper built on CliClient usually constructs and consumes its Commands internally — there is no place to chain a per-call cancel_on. Set the token once on the client; every command it builds carries it:

#![allow(unused)]
fn main() {
use processkit::{CancellationToken, CliClient};

let token = CancellationToken::new();
let gh = CliClient::new("gh").default_cancel_on(token.child_token());
// ... controller cancels `token` → every in-flight command of THIS client
// dies (whole tree), surfacing Error::Cancelled to the awaiting call.
}

Clients are cheap — scope cancellation by building one client per cancellable scope with its own (child) token, instead of threading tokens through call signatures. cli_client!-generated wrappers re-emit the builder, so Git::new().default_cancel_on(t) works for downstream crates too.

Precedence: a per-command cancel_on chained on a built command replaces the client default (explicit beats default, like a per-command timeout after default_timeout). To honor both sources, wire it explicitly — CancellationToken has no built-in merge: derive a child of the default (let c = default.child_token()), hand the command cancel_on(c.clone()), and have the second source call c.cancel(). Or simpler: build a dedicated client per scope.

Precedence and interactions

Timeout vs. cancellation. A timeout is captured; a cancellation is always an error. When both land on the same run, cancellation wins — you asked the run to stop mattering, so no result is synthesized:

#![allow(unused)]
fn main() {
use processkit::{CancellationToken, Command};
use std::time::Duration;

let token = CancellationToken::new();
token.cancel();

let err = Command::new("tool")
    .timeout(Duration::from_millis(1))   // would have been a Timeout…
    .cancel_on(token)                    // …but cancellation takes priority
    .run()
    .await
    .unwrap_err();
assert!(matches!(err, processkit::Error::Cancelled { .. }));
}

Which knob for which job:

You want	Reach for
"This run may not take longer than X"	`Command::timeout`
"This operation is flaky, try a few times"	`Command::retry`
"Stop everything when the app shuts down"	`cancel_on` + one shared token
"Keep this service alive across crashes"	`Supervisor`
"Tell me when it's ready, don't kill it"	readiness probes

Next: Supervision · Streaming & interactive I/O · Running commands

ProcessKit