Skip to main content

State machine

Everything ferrus does is modeled as an explicit state machine. There is no hidden context — the current state lives in .ferrus/STATE.json on disk and agents are stateless between spawns.

States and transitions

Idle
└─► Executing ← create_task (Supervisor)
└─► Checking ← check (Executor, pass)
├─► Consultation ← consult (Executor)
│ └─► (restore previous state) ← wait_for_consult
├─► [FAIL, retries < max] Addressing → check again
├─► [FAIL, retries ≥ max] Failed
└─► Reviewing ← submit (Executor)
├─► [REJECT] Addressing → check loop (retries reset)
│ └─► [cycles ≥ max] Failed
└─► Complete ← approve (Supervisor)

Every transition is explicit and gated by a tool call from the right role — supervisors can't submit, executors can't approve.

State catalog

StateMeaning
IdleNo active task. STATE.json exists with state = Idle.
ExecutingExecutor is implementing the task described in TASK.md.
AddressingChecks failed with retries remaining, or reviewer rejected — executor is fixing.
ConsultationExecutor paused to ask the supervisor a question via consult.
AwaitingHumanAgent paused to ask a human a question via ask_human.
ReviewingExecutor submitted work; supervisor is reading SUBMISSION.md.
CompleteSupervisor approved the submission. Next /task starts cleanly.
FailedRetries or review cycles exhausted. /reset required to clear.

Retry and review budgets

Two counters guard the loop:

  • check_retries — consecutive check failures. Reset to 0 on each successful pass and on each rejected submission. Exhausting this budget moves the task to Failed.
  • review_cycles — number of reject → fix round trips. Exhausting this budget also moves the task to Failed.

Both limits live in ferrus.toml under [limits]. See Configuration.

Crash safety

STATE.json is always written atomically — ferrus writes to STATE.json.tmp and then renames. A crash mid-write leaves the previous valid state in place.

If an agent process dies, the lease attached to its state entry expires after lease.ttl_secs. You can then /resume the executor headlessly and the loop picks up exactly where it left off.

Leases

Only one executor works on a task at a time. When an executor claims a state, it writes its identity (executor:codex:1) into claimed_by and starts calling heartbeat every heartbeat_interval_secs. If heartbeats stop for longer than ttl_secs, the claim is considered dead and can be taken by a new process.

This is the mechanism that makes ferrus restart-safe: crashes, terminal closes, or power losses don't leave the state machine stuck.

Why a state machine, not a chat?

Chat-based agents are stateful by accident — context accumulates in a conversation window and any crash or restart loses it. ferrus flips this:

  • State lives on disk, in plain text.
  • Agents are stateless between runs; each spawn is handed a short prompt that points to the files it needs.
  • Lifecycle is explicit: you can look at STATE.json and know exactly what's happening.

This is what "deterministic orchestration" means in practice — not that the agents are deterministic, but that ferrus is.