Killing a Process-Handshake Deadlock with Rust Typestate

The symptom

loopal is a multi-process agent runtime: a thin parent spawns a hub subprocess in --hub-only mode, and the hub in turn spawns an agent-server. The parent waits on the child's stdout for a handshake line before it considers startup complete. When a user had an MCP server configured, startup hung for 30 seconds and then died with:

hub child did not produce a handshake within 30s

The 30s is the parent-side HANDSHAKE_TIMEOUT in hub_spawn.rs. Hitting it means the child never printed its handshake at all — not slow, fully stuck.

The root cause: send-before-receive

Bootstrap ran in this order (hub_bootstrap.rs, pre-fix):

let root_session_id = client.start_agent(&params).await?;   // (1) blocks

let (root_conn, incoming_rx) = client.into_parts();
loopal_agent_hub::agent_io::start_agent_io(                  // (2) starts consumer
    hub.clone(), ROOT_AGENT_NAME, root_conn, incoming_rx,
);

start_agent is a request to the agent-server. While handling it, the agent-server fires reverse IPC back at the hub — hub/mcp/snapshot, hub/secret/* — to discover MCP tools and resolve secrets. Those reverse requests land in incoming_rx.

But incoming_rx is only drained by start_agent_io, which is step (2) — and step (2) does not run until step (1) returns. The agent-server blocks waiting for the hub to answer its reverse call; the hub blocks at start_agent().await waiting for the agent-server to answer. Classic two-party deadlock. With no MCP configured there was nothing to snapshot, so the reverse call never fired and the bug stayed hidden.

A tokio::time::sleep before start_agent, or a retry, would have masked it. Neither removes the reordering hazard — the next person to add a reverse call on the bootstrap path reintroduces the hang.

Fix 1: the connection can't send before it reads

The IPC Connection now carries its lifecycle in the type. Two zero-sized marker types, one generic parameter:

pub struct Inactive;
pub struct Listening;

pub struct Connection<S = Inactive> {
    transport: Arc<dyn Transport>,
    pending: PendingMap,
    next_id: AtomicI64,
    _state: PhantomData<S>,
}

send_request, respond, respond_error are implemented only for Connection<Listening>. The single way to obtain a Listening connection is to consume an Inactive one — and that constructor is what spawns the reader loop:

impl Connection<Inactive> {
    pub fn into_listening(self) -> (Arc<Connection<Listening>>, mpsc::Receiver<Incoming>) {
        let rx = spawn_reader_loop(self.transport.clone(), self.pending.clone());
        (Arc::new(Connection { /* state: Listening */ ... }), rx)
    }
}

You cannot call send_request on a connection whose reader has not started — there is no such method on Connection<Inactive>. The old free-standing start() escape hatch was deleted. "Send before the reader is up" went from a runtime hang to a type error.

Fix 2: the bootstrap order is a typestate chain

The same idea, one level up. Bootstrap is now five states, each a distinct struct, each transition consuming self:

HubBuilt → ListenerBound → DispatcherReady → AgentSpawned → Ready

let bs = HubBuilt::new(cwd, config).await;
let bs = bs.bind_listener().await?;          // -> ListenerBound
if let Some(tx) = alive_tx {
    let _ = tx.send(bs.alive_info());        // ALIVE the instant the listener binds
}
let bs = bs.register_handlers(cli).await?;   // -> DispatcherReady
let bs = bs.spawn_agent_process().await?;    // -> AgentSpawned (consumer now draining)
let bs = bs.start_root_agent(&params).await?; // -> Ready

You cannot skip or reorder a step: start_root_agent exists only on AgentSpawned, which only spawn_agent_process produces, which only exists on DispatcherReady, and so on. The compiler walks the chain for you.

The deadlock-relevant move lives in spawn_agent_process: it starts the agent IO loop and waits for the consumer to confirm it is draining before the next state can issue start_agent:

loopal_agent_hub::agent_io::start_agent_io(
    hub.clone(), dispatcher, ROOT_AGENT_NAME, conn.clone(), incoming_rx,
    Some(ready_tx),                 // consumer signals readiness on this oneshot
);
wait_consumer_ready(ready_rx).await?;   // 2s budget; reverse-IPC channel draining
// ... only now does AgentSpawned::start_root_agent send agent/start

alive_info() is deliberately a method on ListenerBound only. The ALIVE handshake fires the moment the TCP listener is bound, no longer gated behind the uplink connect or the agent spawn — so a slow agent start can never delay the liveness signal.

Fix 3: IPC cost is declared, never defaulted

The reverse calls that deadlocked were unbudgeted — they would wait forever. So every IPC callsite now takes an explicit budget, and the type has no Default:

#[derive(Debug, Clone, Copy)]
pub enum IpcBudget {
    Allowed(Duration),
    Forbidden,
}

pub const HUB_RPC_BUDGET: IpcBudget = IpcBudget::Allowed(Duration::from_secs(8));

// There is intentionally no `Default` impl: callers must explicitly choose
// between `Allowed(d)` (with a justified timeout) and `Forbidden`.

Forbidden is for code on a latency-critical path — bootstrap before the reverse channel drains. A remote MCP provider rejects it immediately instead of blocking:

IpcBudget::Forbidden => {
    return Err(format!("{method} rejected: IpcBudget::Forbidden on critical path"));
}

The numbers are layered so the innermost failure surfaces first: proxy(8s) < start_agent(20s) < HANDSHAKE(30s). A stuck reverse call trips the 8s proxy budget and the error names the real layer, instead of bubbling up as a generic 30s handshake timeout.

Result

Startup with no MCP and no secrets, ephemeral session, measured:

LOOPAL_HUB_ALIVE: 243 ms (listener bound)
LOOPAL_HUB_READY: 488 ms (root agent started)

Roughly 60× under the old 30s ceiling. The handshake itself was split into phases — LOOPAL_HUB_ALIVE <addr> <token> and LOOPAL_HUB_READY <session_id> — with the legacy single line still emitted for back-compat.

Three regression tests pin the behavior:

hub_only_mcp_deadlock_test.rs — boots --hub-only with an MCP server that is literally sleep 60, asserts a handshake line within 5s.
bootstrap_typestate_e2e_test.rs — drives a real subprocess, asserts ALIVE within 3s and READY within 8s.
ipc_budget_test.rs — a hub whose send_request is pending() forever; Forbidden must return in under 50ms.

What carried over

The fix added states and a non-defaulting enum; it did not add a single sleep or retry. Two constraints generalize beyond this codebase:

If an ordering is required for correctness, make the wrong order unrepresentable. A reader that must start before a writer, a listener that must bind before a send — encode the precondition as a type the next operation consumes, not as a comment or a runtime assert. The cost is a few zero-sized structs; the payoff is that the regression cannot compile.
Make resource cost explicit at the callsite, with no default. An IPC call that can block forever is a latent deadlock. Forcing every caller to choose Allowed(d) or Forbidden turns "I didn't think about blocking here" from a silent hang into a line of code a reviewer can see.