Tech5/25/2026·9 min

Splitting Analytics Into Three Binaries, Moving to Connect-RPC, and Isolating ClickHouse

One analytics binary did ingest, enrich, query, and admin. We split it into collector/worker/datahouse, replaced HTTP+JSON+HMAC with Connect-RPC, and made test data physically incapable of reaching prod dashboards.

The starting point

The analytics service began as a single Go binary with cobra subcommands: serve (public event ingest), worker (enrich + load to ClickHouse), query (dashboard reads), plus migrate / ch-migrate / seed / replay. One process image held PostgreSQL credentials, ClickHouse DDL credentials, the Redpanda consumer, and a public HTTP listener.

That layout has three problems that compound at the same time:

  • Attack surface. The public ingest endpoint runs in the same binary that can DROP TABLE in ClickHouse. A compromise of the front door reaches the warehouse.
  • Failure domain. Ingest is latency-sensitive and CPU-light; the worker is throughput-heavy and bursty. Sharing a process means a worker backlog competes with request handling for the same resources.
  • Wire protocol drift. The SDK shipped hand-rolled HTTP + JSON + an HMAC signature over the request body. The server re-implemented the same parsing and verification. Two copies of one contract, kept in sync by hand.

Two changes addressed these: a Connect-RPC migration, then a three-binary split.

Connect-RPC: deleting the hand-rolled transport

The SDK and server previously agreed on a contract that lived in code on both sides — JSON field names, gzip framing, and an HMAC signature computed over the raw request body. Every change to that contract was two edits that had to match.

The migration moved the contract into .proto as the single source of truth (Server/libs/proto/analytics/v1/), with Go and Swift generated from the same files. The server now exposes a single :8080 HTTP/2 (h2c) port that speaks Connect, gRPC, and gRPC-Web simultaneously. The handler signature is the standard Connect shape:

func (s *Service) IngestEvents(
    ctx context.Context,
    req *connect.Request[analyticsv1.IngestEventsRequest],
) (*connect.Response[analyticsv1.IngestEventsResponse], error)

On the SDK side, three files were deleted outright: RequestSigner.swift, GzipCompressor.swift, and AnalyticsHTTPClient.swift. Gzip, HTTP/2, and error-code mapping are now the Connect runtime's job. The uploader holds a generated client and builds a generated request:

private let client: Aio_Analytics_V1_CollectorServiceClient
// ...
var req = Aio_Analytics_V1_IngestEventsRequest()
req.events = events
let resp = await client.ingestEvents(request: req, headers: [:])

Two non-obvious points fell out of this.

The signature interceptor stopped signing the body. Connect's codec wraps the payload across protocols (binary / JSON / gzip), so there is no stable byte sequence to HMAC after the framing layer. Body signing was removed; auth moved to a header-based scheme verified by an interceptor. The server-side note is explicit that re-introducing body signing under a multi-protocol codec would immediately reintroduce drift — which was the reason for the migration in the first place.

Codegen needed custom Bazel rules. Generating connect-go alongside protoc-gen-go-grpc produces conflicting type names, so the build uses a connect_go compiler with package_suffix= to emit into the same package as the .pb.go. On Swift, connect-swift and swift-protobuf must be pulled through one compiler path, otherwise the two copies of swift-protobuf collide at link time with duplicate symbols.

Health and readiness endpoints stayed plain HTTP GET. Kubernetes and Docker probes expect that, and there is no reason to route a liveness check through an RPC codec.

Three binaries, with constraints enforced by imports

With one wire protocol, the binary split became mechanical. The single process became three deployables, each with a single responsibility:

  • analytics-collector — public ingest. Runs the Connect-RPC server, mounts only CollectorService, writes events to Redpanda. It does not hold ClickHouse credentials, does not consume MQ, and carries no admin commands.
  • analytics-worker — internal consumer. Runs enricher + loader + persons-merger against three Redpanda topics, writes ClickHouse. It opens no business HTTP listener; the only socket is 127.0.0.1:8081/healthz for the Docker health probe.
  • datahouse — folded into the existing mainline server rather than left standalone. CH admin (DDL) and read-only ad-hoc query are exposed as DatahouseService RPCs, reached through mainline's main domain with a PAT. CLI, agents, and CI no longer connect to ClickHouse directly.

The interesting part is that the boundaries are not documented conventions — they are compile-time facts. The collector cannot import libs/go/analytics/{clickhouse, chviews, replay}, so it physically cannot acquire warehouse credentials. The worker has no interceptor.APIKeyAuth or collector RPC import, so it cannot accidentally grow a public surface. Shared logic that both needed — config, entity types, MQ client, enricher/loader, testkit — moved down into Server/libs/go/analytics/.

The split commit was a net deletion: +1893 / −3388 lines, mostly from removing the standalone query backend and the duplicated SDK transport.

datahouse's read path leans on database permissions rather than SQL parsing for safety. ExecSQL runs arbitrary SELECT against a ClickHouse user pinned to readonly = 2:

CREATE USER IF NOT EXISTS analytics_readonly ... SETTINGS readonly = 2;
GRANT SELECT ON analytics.* TO analytics_readonly;

readonly = 2 is the relevant ClickHouse setting: SELECT-only, but it still permits per-query settings like max_execution_time (which readonly = 1 would reject). Injection defense is the user grant, not a query validator. A separate dbt_runner user holds the CREATE VIEW / DROP VIEW grants for materializing metric views — the DDL path and the query path never share a credential.

Three levels of ClickHouse isolation

The third concern is keeping test and development data out of production dashboards. This is handled at three physical levels, not one flag.

Test isolation — is_test, defended in three layers. Events carry an is_test Bool DEFAULT false CODEC(ZSTD(1)) column. The column is nearly free: 99%+ of values are false, and ZSTD on a near-constant column compresses to roughly zero bytes per row. Three independent layers feed it:

  1. The iOS SDK's ReporterModeResolver short-circuits capture entirely under XCTest, XCUITest, or any simulator (SIMULATOR_DEVICE_NAME / SIMULATOR_UDID are part of the simulator runtime contract). Test events never reach the network. A single environment signal covers unit tests, UI tests, and local make run_app_on_sim, so every new product inherits the isolation with no per-product config.
  2. If anything bypasses the SDK and still sends $is_test = true, the worker's TestModeEnricher reads it and writes the column. This enricher has no configuration toggle — the rule is part of the SDK↔server contract, so operations cannot accidentally turn it off.
  3. The query service injects AND is_test = false into dashboard reads by default.

There is a deliberate ANALYTICS_FORCE_PROD=1 escape hatch for dogfooding real production reporting from a simulator — but it does not lift the XCTest/XCUITest guards. A real test process should never be able to pollute monitoring data, regardless of env.

Dev isolation — per-worktree port offset. The local stack (PostgreSQL, Redis, Redpanda, ClickHouse, MinIO, plus the three app containers) runs from a docker-compose whose host ports are offset by a hash of the git worktree name. Each service gets a 1000-wide band (POSTGRES_PORT = 15000 + offset, CLICKHOUSE_NATIVE_PORT = 18000 + offset, COLLECTOR_PORT = 24000 + offset, …) with the offset bounded below 500, so bands never collide and multiple worktrees run their full stack in parallel. In CI, a WORKTREE_NAME env (e.g. ci-job-$CI_JOB_ID) replaces the worktree hash for per-job isolation.

Prod isolation — separate VM and ingress. Production is its own analytics-prod VM. Public ingest enters through analytics.monitor.agentsmesh.ai → an sg-001 Traefik that terminates TLS and reverse-proxies h2c to the collector container; the worker and ClickHouse sit on the LAN with no public port. The dev compose's is_test-default data and the prod warehouse are different machines, not different rows.

What transferred

A few constraints from this are reusable beyond analytics:

  • Put the wire contract in one artifact and generate both ends. A protocol kept in sync by code review on two sides drifts; a .proto that compiles into Go and Swift cannot.
  • Enforce service boundaries with the import graph, not a wiki page. "The collector must not touch ClickHouse" is a comment; "the collector does not import the clickhouse package" is checked by the compiler.
  • Defense in depth beats a single switch. Test-data isolation that depends on one boolean fails silently when the boolean is wrong. Three independent layers — SDK short-circuit, worker enricher, query filter — each catch what the others miss.
  • Lean on the datastore's permission model for query safety. A readonly = 2 ClickHouse user is a stronger guarantee than a SQL parser, and it is one line.