The shape of the problem
On the runner, each agent pod runs as a detached daemon process. A short-lived manager spawns it, then attaches over a local IPC channel to forward terminal I/O. Until #197 that channel was platform-split: a Unix domain socket on macOS and Linux, a Windows named pipe everywhere else.
// ipc_unix.go //go:build !windows
func Listen(path string) (net.Listener, error) {
_ = os.Remove(path) // clean up stale socket from a crashed daemon
return net.Listen("unix", path)
}
// ipc_windows.go //go:build windows
func Listen(path string) (net.Listener, error) {
return winio.ListenPipe(path, nil) // github.com/Microsoft/go-winio
}
Two Listen/Dial implementations behind build tags, a third-party dependency (go-winio) only Windows pulled in, and a separate ipc_windows_test.go. The Unix side also carried the lifecycle burden of an on-disk file: remove a stale socket on start, remove it again in CleanupSession, and once more on the daemon's failure path.
Why the socket path bit us first
The macOS Unix socket path limit is 104 bytes, including the terminating NUL. The socket path was originally derived from the pod's sandbox directory, and sandbox paths sit under the workspace root. A long workspace path overflowed the limit and net.Listen("unix", path) failed before the daemon could accept anything.
The earlier fix (#147) worked around this by moving sockets to a short, predictable directory:
// GetSocketDir returns the directory for IPC sockets.
// Derived from TempBaseDir to guarantee short, predictable paths
// that stay within Unix socket path limits (104 bytes on macOS).
func (c *Config) GetSocketDir() string {
return filepath.Join(TempBaseDir(), "sockets")
}
That kept paths short but left the structural cost in place: the limit, the file lifecycle, the platform split. #197 removed the cause instead of containing it.
TCP loopback, OS-assigned port
The replacement is one file with no build tags:
// ipc.go — 18 lines, replaces ipc_unix.go + ipc_windows.go
func Listen() (net.Listener, error) {
return net.Listen("tcp", "127.0.0.1:0")
}
func Dial(addr string) (net.Conn, error) {
return net.Dial("tcp", addr)
}
127.0.0.1:0 asks the kernel for a free ephemeral port. No path, so no 104-byte ceiling and no socket file to leak. The same code compiles and runs on darwin, linux, and windows; go-winio came out of go.mod.
This inverts address discovery. Previously the manager pre-computed the socket path and handed it to the daemon. Now the port is only known after Listen(), so the daemon binds first and writes the result back into its state file:
listener, err := Listen() // OS assigns the port
// ...
state.IPCAddr = listener.Addr().String() // e.g. "127.0.0.1:54213"
state.DaemonPID = os.Getpid()
SaveState(state)
The manager no longer connects to a known path; it polls pod_daemon.json until IPCAddr is populated, with a fail-fast check on the daemon PID so a dead child doesn't cost the full timeout:
const maxAttempts = 50
const retryDelay = 100 * time.Millisecond // 5s ceiling
// ...
state, err := LoadState(sandboxPath)
if err == nil && state.IPCAddr != "" {
dpty, err := connectDaemon(connectOpts{Addr: state.IPCAddr, AuthToken: authToken})
// ...
}
if pid > 0 && process.IsAlive(pid) != nil {
return nil, nil, fmt.Errorf("daemon (pid %d) exited before IPC ready", pid)
}
What loopback gives up, and how the token covers it
A Unix socket carries filesystem permissions; any local process can connect() to 127.0.0.1:<port> if it can guess or scan the port. So the access-control guarantee that came for free from the filesystem now has to be made explicit at the application layer.
The manager generates a 32-byte token from crypto/rand per session, hex-encodes it, and stores it in the same state file:
const authTokenBytes = 32
func generateAuthToken() (string, error) {
b := make([]byte, authTokenBytes)
if _, err := rand.Read(b); err != nil {
return "", fmt.Errorf("generate auth token: %w", err)
}
return hex.EncodeToString(b), nil
}
The attach handshake carries it. The protocol bumped from version 1 to 2, and MsgAttach grew from {version uint8} to {version uint8}{auth_token bytes}:
// client side: [version][token]
attachPayload := make([]byte, 1+len(tokenBytes))
attachPayload[0] = protocolVersion // 2
copy(attachPayload[1:], tokenBytes)
WriteMessage(conn, MsgAttach, attachPayload)
The daemon validates with a constant-time compare so a rejected token leaks no timing signal:
receivedToken := payload[1:] // bytes after version
if len(receivedToken) != len(expectedToken) ||
subtle.ConstantTimeCompare(receivedToken, expectedToken) != 1 {
log.Warn("client auth failed: invalid token")
conn.Close()
return
}
The token lives only in pod_daemon.json, which sits in the per-pod sandbox under user-owned permissions — so filesystem access control still gates the secret, while the socket itself no longer depends on it. The manager also re-reads the token after the daemon writes its address and rejects a mismatch as possible tampering.
Connection parameters as a struct
The old connectDaemon(ipcPath string) became connectDaemon(opts connectOpts). With address and token both being strings, a positional signature is one transposition away from sending the address as the token:
type connectOpts struct {
Addr string // "127.0.0.1:12345"
AuthToken string // hex-encoded
}
A small change, but the kind of confusion that compiles cleanly and fails at runtime is worth designing out.
Result and constraints
The diff was 28 files, +429 / −505 — IPC code got smaller while gaining auth. One transport replaces two; ipc_unix.go, ipc_windows.go, and ipc_windows_test.go are gone, along with the go-winio dependency and the GetSocketDir/EnsureSocketDir surface. Verified across unit and integration tests, cross-compiles for linux/windows/darwin, and end-to-end pod creation on all three.
Two constraints fell out of it:
- Loopback is not an access boundary. Binding
127.0.0.1keeps traffic off the network, but any local process can reach the port. If you drop the filesystem's implicit ACL, you owe an explicit one — here, a per-session random token plus constant-time validation. The token's own secrecy still rides on file permissions; loopback moved the boundary, it didn't remove the need for one. - OS-assigned ports invert discovery. The bound address exists only after
Listen(), so the producer has to publish it and the consumer has to poll, with a liveness check to fail fast. That replaced a pre-shared path with a small write-then-read protocol over the state file.
One operational note from CI: TCP accept/read timing is looser than a local socket under load. Read deadlines in the daemon tests were raised from 2s to 10s to stop slow shared runners from flaking.