Killing the Zombie Leak in Remote Workstation Daemons

The symptom

On a single production runner PID we counted roughly 108 <defunct> entries that had accumulated over 13 days. Each one was a dead PodDaemon the kernel was keeping in the process table, waiting for a parent that would never call wait(). The count tracked pod create-destroy cycles: one zombie per cycle.

A zombie holds no memory and no file descriptors, only a slot in the process table. It is harmless until the table fills, and then it is a hard outage that looks like nothing because every process individually is fine. The interesting part is not the cleanup — it is why a daemon designed to outlive its parent was being held by that parent at all.

What PodDaemon is, and why Release looked correct

A PodDaemon holds the PTY for one pod's agent session and accepts IPC from the runner for I/O forwarding. Its defining requirement is survival across a runner restart: when the runner upgrades, existing pod sessions must keep running, and the new runner re-attaches to them. So the daemon must be detached from the runner's lifecycle by design.

The old startup did this with os.StartProcess followed by proc.Release():

proc, err := os.StartProcess(binPath, []string{binPath}, attr)
// ...
// Release the process so it becomes a proper daemon
if err := proc.Release(); err != nil {
    return pid, fmt.Errorf("release daemon process: %w", err)
}

attr already set Setsid: true, so the daemon got its own session. The naming reads like a complete detachment. It is not.

Release() only drops the *os.Process bookkeeping inside the Go runtime's heap — it frees the handle, nothing more. It does not touch the kernel's parent-child relationship. The daemon's ppid was still the runner. When the daemon later died (on pod destroy), the kernel sent SIGCHLD to the runner and parked the corpse as a zombie until someone reaped it. Nobody ever did, because the runner had thrown away the handle it would have needed to Wait().

This is the trap: Setsid detaches the session; only a process exit (or PID 1 adoption) detaches the parent. Releasing the handle removed the one path back to the corpse without removing the corpse.

The reparent move: a real double-fork

To make the daemon both survive runner restart and not leave a zombie, its parent has to actually go away. When a parent exits, the kernel reparents its children to init(1), and init reaps them. That is the property the old code never achieved.

The fix introduces a launcher subprocess. The runner re-execs itself with a marker argument, __processmgr_launcher__. That short-lived process spawns the real daemon, reports the daemon's PID back, and exits. Its exit is what flips the daemon's ppid to init(1):

// inside RunLauncher() — the __processmgr_launcher__ subcommand
cmd := exec.Command(binPath, args...)
configureDaemonSysProcAttr(cmd) // Setsid: own session
if err := cmd.Start(); err != nil { /* ... */ }

// report the daemon PID up the fd-3 pipe so the runner knows it
fmt.Fprintln(pipe, cmd.Process.Pid)

// Release here, then exit — the kernel reparents the daemon to init
cmd.Process.Release()
os.Exit(0)

The launcher itself is spawned by the process manager with a normal exec.Cmd, and that Cmd is Wait'd. So the launcher produces no zombie, and the daemon it leaves behind belongs to init, not the runner. The daemon's PID travels back through a pipe on file descriptor 3 (ExtraFiles[0]), bounded by a LauncherStartTimeout (default 10s); a launcher that does not report a PID in time is treated as a fork failure and killed.

Windows has no zombie state and does not inherit ExtraFiles, so the launcher scheme is structurally impossible there. That path spawns the daemon directly with DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP, which already gives the parent-detachment Windows users need. The startup is split by platform rather than papered over with a shared abstraction that fits neither.

One manager, one Wait per child

The root cause generalizes past PodDaemon: the codebase had several places calling exec/StartProcess directly, and the invariant "every Start() is paired with exactly one Wait()" lived only in each author's head. The fix makes that invariant structural.

A new internal/processmgr package (23 non-test .go files) became the single source of truth for every long-lived child the runner spawns. Callers describe what they want with a Spec and get back a Handle; they never see the underlying *exec.Cmd:

type Mode int
const (
    ModeNormal Mode = iota // long-lived child, reaped on Stop
    ModePTY                 // PTY-backed child
    ModeDaemon              // double-fork detach; survives runner restart
)

Returning a Handle instead of a Cmd is the load-bearing decision. You cannot call Start without Wait if you never hold the thing you would have to Wait on. Internally every exec.Cmd has exactly one cmd.Wait() in a panic-safe goroutine. The PodDaemon, ACP client, and MCP server lifecycle all migrated onto it.

Stop policy is encoded in the manager, not at call sites. StopAll deliberately skips daemons — that is what preserves the cross-restart semantic:

func (m *manager) StopAll(ctx context.Context) error {
    return m.stopMatching(ctx, func(p Handle) bool { return p.Mode() != ModeDaemon })
}

Belt and suspenders: a reaper that should never fire

The manager runs a safety-net sweep on a 30s timer:

func reapOrphans() int {
    count := 0
    for {
        var ws syscall.WaitStatus
        pid, err := syscall.Wait4(-1, &ws, syscall.WNOHANG, nil)
        if pid <= 0 { return count }
        // ...
        count++
    }
}

The design intent is that this returns 0 forever. A non-zero return means some Start() path bypassed processmgr — so the reaper increments runner_zombie_reaped_total and logs a warning pointing at /debug/processes. The metric is a leak detector, not a cleanup mechanism. The same endpoint lists every tracked child by owner, mode, PID, and uptime, so an operator can answer "which pod owns this PID?" without standing up Prometheus.

Making the bug unrepresentable again

A fix that depends on everyone remembering the rule decays. Two guards keep it from coming back:

A forbidigo lint forbids os.StartProcess everywhere except inside processmgr itself. bazel run //runner:lint catches the bare pattern before review. The error message names the replacement, so the lint teaches the fix.
TestE2E_ManyDaemonsLeakNoZombies runs 20 real daemon create-destroy cycles — real fork, real SIGKILL, real ppid reassignment — then scans the process table with ps for Z-state entries owned by the test process and asserts zero. It also asserts the reaper caught nothing, which proves the daemon path went through the manager rather than leaking and being swept up.

A separate stability pass (#189) had already added panic recovery and auto-restarting goroutines to the daemon so that a crash leaves a stack trace instead of empty logs. That made daemon failures diagnosable; #411 made them not leak.

Takeaways

Release() is not detachment. It frees a handle; it does not change ppid. If you need a child to outlive its parent without becoming a zombie, the parent must exit (or you must reparent to init), and a double-fork via a throwaway launcher is the portable way to get there.
Setsid detaches the session, not the parent. They are different relationships and they fail independently.
The durable fix for "every Start needs a Wait" is to make the unpaired form unrepresentable — hand out a Handle, never the Cmd — and back it with a lint plus a property test, so the invariant does not rely on memory.
Reaper-style safety nets earn their keep as detectors. If yours is doing real work in steady state, treat that as an alarm, not a feature.