The symptom
loopal is a multi-process agent runtime: a thin parent spawns a hub subprocess in --hub-only mode, and the hub in turn spawns an agent-server. The parent waits on the child's stdout for a handshake line before it considers startup complete. When a user had an MCP server configured, startup hung for 30 seconds and then died with:
hub child did not produce a handshake within 30s
The 30s is the parent-side HANDSHAKE_TIMEOUT in hub_spawn.rs. Hitting it means the child never printed its handshake at all — not slow, fully stuck.
The root cause: send-before-receive
Bootstrap ran in this order (hub_bootstrap.rs, pre-fix):
let root_session_id = client.start_agent(¶ms).await?; // (1) blocks
let (root_conn, incoming_rx) = client.into_parts();
loopal_agent_hub::agent_io::start_agent_io( // (2) starts consumer
hub.clone(), ROOT_AGENT_NAME, root_conn, incoming_rx,
);
start_agent is a request to the agent-server. While handling it, the agent-server fires reverse IPC back at the hub — hub/mcp/snapshot, hub/secret/* — to discover MCP tools and resolve secrets. Those reverse requests land in incoming_rx.
But incoming_rx is only drained by start_agent_io, which is step (2) — and step (2) does not run until step (1) returns. The agent-server blocks waiting for the hub to answer its reverse call; the hub blocks at start_agent().await waiting for the agent-server to answer. Classic two-party deadlock. With no MCP configured there was nothing to snapshot, so the reverse call never fired and the bug stayed hidden.
A tokio::time::sleep before start_agent, or a retry, would have masked it. Neither removes the reordering hazard — the next person to add a reverse call on the bootstrap path reintroduces the hang.
Fix 1: the connection can't send before it reads
The IPC Connection now carries its lifecycle in the type. Two zero-sized marker types, one generic parameter:
pub struct Inactive;
pub struct Listening;
pub struct Connection<S = Inactive> {
transport: Arc<dyn Transport>,
pending: PendingMap,
next_id: AtomicI64,
_state: PhantomData<S>,
}
send_request, respond, respond_error are implemented only for Connection<Listening>. The single way to obtain a Listening connection is to consume an Inactive one — and that constructor is what spawns the reader loop:
impl Connection<Inactive> {
pub fn into_listening(self) -> (Arc<Connection<Listening>>, mpsc::Receiver<Incoming>) {
let rx = spawn_reader_loop(self.transport.clone(), self.pending.clone());
(Arc::new(Connection { /* state: Listening */ ... }), rx)
}
}
You cannot call send_request on a connection whose reader has not started — there is no such method on Connection<Inactive>. The old free-standing start() escape hatch was deleted. "Send before the reader is up" went from a runtime hang to a type error.
Fix 2: the bootstrap order is a typestate chain
The same idea, one level up. Bootstrap is now five states, each a distinct struct, each transition consuming self:
HubBuilt → ListenerBound → DispatcherReady → AgentSpawned → Ready
let bs = HubBuilt::new(cwd, config).await;
let bs = bs.bind_listener().await?; // -> ListenerBound
if let Some(tx) = alive_tx {
let _ = tx.send(bs.alive_info()); // ALIVE the instant the listener binds
}
let bs = bs.register_handlers(cli).await?; // -> DispatcherReady
let bs = bs.spawn_agent_process().await?; // -> AgentSpawned (consumer now draining)
let bs = bs.start_root_agent(¶ms).await?; // -> Ready
You cannot skip or reorder a step: start_root_agent exists only on AgentSpawned, which only spawn_agent_process produces, which only exists on DispatcherReady, and so on. The compiler walks the chain for you.
The deadlock-relevant move lives in spawn_agent_process: it starts the agent IO loop and waits for the consumer to confirm it is draining before the next state can issue start_agent:
loopal_agent_hub::agent_io::start_agent_io(
hub.clone(), dispatcher, ROOT_AGENT_NAME, conn.clone(), incoming_rx,
Some(ready_tx), // consumer signals readiness on this oneshot
);
wait_consumer_ready(ready_rx).await?; // 2s budget; reverse-IPC channel draining
// ... only now does AgentSpawned::start_root_agent send agent/start
alive_info() is deliberately a method on ListenerBound only. The ALIVE handshake fires the moment the TCP listener is bound, no longer gated behind the uplink connect or the agent spawn — so a slow agent start can never delay the liveness signal.
Fix 3: IPC cost is declared, never defaulted
The reverse calls that deadlocked were unbudgeted — they would wait forever. So every IPC callsite now takes an explicit budget, and the type has no Default:
#[derive(Debug, Clone, Copy)]
pub enum IpcBudget {
Allowed(Duration),
Forbidden,
}
pub const HUB_RPC_BUDGET: IpcBudget = IpcBudget::Allowed(Duration::from_secs(8));
// There is intentionally no `Default` impl: callers must explicitly choose
// between `Allowed(d)` (with a justified timeout) and `Forbidden`.
Forbidden is for code on a latency-critical path — bootstrap before the reverse channel drains. A remote MCP provider rejects it immediately instead of blocking:
IpcBudget::Forbidden => {
return Err(format!("{method} rejected: IpcBudget::Forbidden on critical path"));
}
The numbers are layered so the innermost failure surfaces first: proxy(8s) < start_agent(20s) < HANDSHAKE(30s). A stuck reverse call trips the 8s proxy budget and the error names the real layer, instead of bubbling up as a generic 30s handshake timeout.
Result
Startup with no MCP and no secrets, ephemeral session, measured:
LOOPAL_HUB_ALIVE: 243 ms (listener bound)LOOPAL_HUB_READY: 488 ms (root agent started)
Roughly 60× under the old 30s ceiling. The handshake itself was split into phases — LOOPAL_HUB_ALIVE <addr> <token> and LOOPAL_HUB_READY <session_id> — with the legacy single line still emitted for back-compat.
Three regression tests pin the behavior:
hub_only_mcp_deadlock_test.rs— boots--hub-onlywith an MCP server that is literallysleep 60, asserts a handshake line within 5s.bootstrap_typestate_e2e_test.rs— drives a real subprocess, asserts ALIVE within 3s and READY within 8s.ipc_budget_test.rs— a hub whosesend_requestispending()forever;Forbiddenmust return in under 50ms.
What carried over
The fix added states and a non-defaulting enum; it did not add a single sleep or retry. Two constraints generalize beyond this codebase:
- If an ordering is required for correctness, make the wrong order unrepresentable. A reader that must start before a writer, a listener that must bind before a send — encode the precondition as a type the next operation consumes, not as a comment or a runtime assert. The cost is a few zero-sized structs; the payoff is that the regression cannot compile.
- Make resource cost explicit at the callsite, with no default. An IPC call that can block forever is a latent deadlock. Forcing every caller to choose
Allowed(d)orForbiddenturns "I didn't think about blocking here" from a silent hang into a line of code a reviewer can see.