Tech5/29/2026·8 min

Service Registered != Data Flowing: Anatomy of a Silent Telemetry Outage

A performance metric pipeline shipped, compiled, and registered cleanly, yet recorded zero rows for months. Four serial gaps, each one passing every green light.

We replaced PostHog and Sentry with an in-house telemetry stack: an iOS SDK records events, performance histograms, and crashes; a Connect-RPC collector writes to redpanda; Go workers consume into ClickHouse. The events and crash streams worked. The performance metric stream — wired last — recorded zero rows on a production device for months without a single error.

The service was registered. The build was green. The dashboards were empty. "Service registered" turned out to carry no information about whether data was actually flowing.

The path from record() to a database row

For a metric to land in ClickHouse, four independent links have to be intact:

  1. register — the assembly is registered into the MicroKernel service container, so the service exists.
  2. emit — something actually calls record(); a registered service nobody invokes produces nothing.
  3. startstart() runs, so the snapshot timer fires and buffered samples get uploaded.
  4. consent — the consent gate lets the record through instead of dropping it.

Each link is a different file, a different concern, owned by a different layer. Any one of them broken yields zero data with no error. They were broken in sequence, so each fix exposed the next.

Gap 1: registered nowhere

MetricCollectionService, its assembly, the uploader, the generated RPC client, and the emitter were all implemented. The bootstrap never called MetricCollectionAssembly.registerIfConfigured(...). The MicroKernel could not resolve the service, so no emitter could be instantiated.

The terminal symptom is the only one that matters:

redpanda  metrics.histogram / metrics.sparse / metrics.tail  HW = 0
ClickHouse metrics_histogram / metrics_sparse / metrics_tail  0 rows
collector  zero IngestHistograms calls

The fix was one line in registerPlatformServices(), placed after TelemetryContextService registers (the install ID is read from that SSOT). Commit ea368a8a2.

Gap 2: registered, but nobody emits

Wiring the assembly registered the service. It did not connect any producer. The LaunchPerformanceEmitter was implemented but never instantiated, so record() was never called.

The fix instantiates the emitter right after the service registers, then records the first frame via CATransaction.setCompletionBlock, which fires once the launch transaction reaches the screen:

if let metricService = MicroKernel.shared.getService(MetricCollectionServiceProtocol.self) {
    launchEmitter = LaunchPerformanceEmitter(service: metricService)
    launchEmitter?.recordDidFinishLaunching()
}

Gap 3: emitted, but never uploaded

MetricCollectionService.start() starts the periodic snapshot timer, replays persisted records, and subscribes to the consent stream. The assembly's assemble() only constructed the service — it never called start(). record() wrote to the in-memory store, and the store was never snapshotted or uploaded.

The reference was already in the same module. AnalyticsReporterServiceAssembly.assemble() calls service.start(configuration:) before returning, eliminating any "half-assembled" state where the service exists but is dormant. The metric assembly was changed to match:

let service = MetricCollectionService(/* ... */)
// Without an explicit start, record() only writes the in-memory store and
// never snapshots/uploads — metrics topic stays at 0 traffic.
Task { await service.start() }
return service

The consent state defaults to .undecided, and the host app never called setMetricConsent(.granted). The record path gated on == .granted:

guard consent.metricConsent == .granted else { return }

Every record was dropped. This is the convenient-default trap: an opt-in default (undecided) colliding with an == .granted check produces a silent, total drop.

Performance histograms are anonymous timings and counts with no PII, which fits an opt-out model. The gate was changed so .undecided is collected and only an explicit .declined short-circuits:

guard consent.metricConsent != .declined else { return }

A test pins the semantics so this cannot regress:

func testRecord_consentUndecided_recordsAnyway() {
    let (s, _, _, _, _, _, store) = makeService(consent: .undecided)
    s.record(Metric.appLaunchToFirstFrame, value: 1_000_000_000,
             tags: Metric.AppLaunchToFirstFrameTags(launchType: .cold))
    XCTAssertEqual(store.snapshot().histogramCount, 1)
}

Commits ea368a8a2 (gap 1) and 414dba729 (gaps 2–4).

What "verified" means

The four fixes were not declared done because the build passed. They were verified at the terminal — the database row, not the registration call. A cold launch of WidgetCraft on a physical device, with a 60-second wait for the snapshot interval, closed the chain:

collector   IngestHistograms status=ok
redpanda    metrics.histogram HW=1
ClickHouse  metrics_histogram:
              metric_name = app_launch_to_first_frame
              metric_count = 1
              metric_sum   = 2.34e9 ns  (~2.34 s cold-launch-to-first-frame)

Constraints worth carrying

Each gap looked connected from where it was edited. The pattern is general:

  • A registered service is not a running one. Lifecycle services must start() inside assemble(). Leaving start to the caller creates a half-assembled state — present but dormant — that no compiler catches.
  • Verify at the sink, not the source. "Registered ok" and "no error returned" are not evidence of data flow. The check is the terminal store: ClickHouse rows, redpanda high-watermark > 0. When it reads zero, walk one link back.
  • Convenient defaults plus silent gates conspire. An undecided default landing on an == .granted check drops everything quietly. Decide opt-in vs opt-out explicitly, and make the gate's semantics a test, not a comment.
  • The right pattern was already in the repo. The events reporter started inside its assembly; the metric service did not. A new telemetry service should copy the existing one's lifecycle, link for link, rather than re-deriving it.