Replacing Sentry with a Self-Hosted Crash Pipeline

Starting point

iOS crash reporting ran through Sentry. The SDK was on-device, the release pipeline uploaded dSYMs to Sentry's upload API, and symbolication happened on their servers. When we deleted the PostHog and Sentry SDKs from the apps to consolidate on our own telemetry stack, the crash path was the last thing still tethered to a third party.

The release flow for a candidate build has three stages. The third one was named sentry and did one job: upload the dSYM to Sentry. Removing the dependency meant rebuilding that stage and the entire symbolication-and-analysis backend behind it.

The dSYM upload path was already split

One detail made the cutover smaller than it looked. The sentry stage had been dual-writing for two days before the switch: alongside the Sentry upload, it also PUT the dSYM into a self-hosted MinIO dsym bucket, because the symbolicator already referenced that bucket as a source. The dual-write was added defensively while the self-hosted path was being stood up.

So the refactor (commit 157b03018) did not build a new path. It promoted the existing MinIO write to the sole path and deleted the Sentry side:

infra/sentry — a 263-line HTTP client plus its tests
domain/service/sentry — the 113-line orchestration service
config.SentryConfig and the SENTRY_API_BASE / ORG_SLUG / TEAM_SLUG / AUTH_TOKEN env

The stage was renamed end to end: sentry → report_dsym, including the entity fields, the workflow (WorkflowReportDsym), the task payload, and the DB columns. Migration 038 renames the candidate columns in place:

ALTER TABLE release_candidates RENAME COLUMN sentry_status TO report_dsym_status;
ALTER TABLE release_candidates RENAME COLUMN sentry_release_id TO report_dsym_debug_id;
-- ...started_at / completed_at / error_msg
UPDATE release_stages SET process_id = 'report_dsym' WHERE process_id IN ('sentry', 'sentry_upload');

Net diff: +1221 / -2859. The error-monitoring Sentry (SENTRY_DSN for server-side traces) is a separate concern and stayed.

Extracting the debug id ourselves

Removing Sentry's upload API removed the thing that returned the dSYM's debug id. The symbolicator indexes debug files by the mach-O LC_UUID, so the new stage extracts that id itself before writing the bucket.

dsym.ExtractDebugID (Server/libs/go/analytics/dsym/extract_uuid.go) reads the LC_UUID load command (0x1b) directly. Go's debug/macho does not surface a typed UUID command, so it matches the raw command id off the load bytes:

const lcUUID = 0x1b

func uuidFromLoads(mf *macho.File) string {
    for _, l := range mf.Loads {
        lb, ok := l.(macho.LoadBytes)
        if !ok {
            continue
        }
        raw := lb.Raw()
        if len(raw) < 24 {
            continue
        }
        // layout: cmd(4) + cmdsize(4) + uuid(16)
        if mf.ByteOrder.Uint32(raw[0:4]) == lcUUID {
            return hex.EncodeToString(raw[8:24])
        }
    }
    return ""
}

It handles a raw thin or fat mach-O DWARF binary as well as a .dSYM.zip (extracting Contents/Resources/DWARF/<binary> first), caps in-memory handling at 600 MB to guard against zip bombs, and returns the id as lowercase 32-hex without hyphens — the exact form the object key uses. The report_dsym handler downloads the dSYM, extracts the id, writes the bucket, and marks the stage complete. If the writer is not wired, it skips gracefully rather than blocking the release chain.

The MinIO layout was the expensive lesson

Writing the bucket is easy. Writing it so the symbolicator finds the file took several iterations, and every failure was silent — the symbolicator returned missing, not an error.

The first version used layout: native. That name is misleading: sentry/symbolicator's native layout is a Microsoft symstore-style 5-level directory split on 4-char chunks with a .app suffix. We uploaded to 20/2680550c.../WidgetCraft; the symbolicator looked for 2026/8055/0C25/3246/899E/8EC69155D60D.app. Permanently not found.

The fix (commit d7f5ea2ce) switched to layout: unified, which is what we'd assumed native was:

/sources/20/2680550c253246899e8ec69155d60d/debuginfo   (dSYM)
/sources/20/2680550c253246899e8ec69155d60d/executable  (stripped binary)

Unified requires a fixed filename — debuginfo for the debug file — so the writer hardcodes it and the caller cannot pass a binary name. That made the layout decision single-source-of-truth in one place. Other silent gaps in the same area, fixed across adjacent commits: the symbolicate request field is stacktraces, not threads; a custom S3 endpoint must be passed as a [name, endpoint] region tuple, not a standalone field, or it's dropped; symbolicator 26.x rejects RFC1918 addresses unless connect_to_reserved_ips: true; and ${VAR} in the config template needs os.ExpandEnv before the subprocess starts.

Because none of these surface as errors, a later pass (commit 2ddfa74f2) added a static configcheck linter in CI — sentry/symbolicator 26.5.1 has no --check-config flag — and a testcontainers e2e test that boots MinIO plus the symbolicator, uploads a minimal LC_UUID-only mach-O fixture, and asserts debug_status: found. The principle: a debugging session that burned a day on silent gaps should leave behind checks that fail loudly.

The runtime path

The worker side (commit 5882bad25) is a crash-worker binary consuming events.crash.raw from Redpanda. The symbolicator is the only thing that calls it, so instead of a separate deployment it runs as an in-process subprocess: the worker forks the symbolicator binary on 127.0.0.1:3021 at startup, and if the child exits the parent panics to trigger a restart — fail fast, no self-healing.

Pipeline.Process runs eight steps per event, ordered idempotent-first: unmarshal, schema validation, archive the raw envelope to S3 (best-effort), symbolicate, compute a fingerprint, write ClickHouse crash_events, and upsert a projection into the mainline Postgres crash_issues table. Symbolication failure is non-fatal — it falls back to raw frames for the fingerprint so an issue still groups. The KSCrash JSON the iOS reporter sends is converted to the symbolicator wire format: binary-image UUIDs lowercased, float addresses rendered to 0x hex, and only the crashed thread kept.

The two writes are deliberately independent: ClickHouse for analysis, Postgres for the issue-tracking business system. The worker writes the Postgres projection by connecting to PG directly across a VLAN with a restricted role, rather than calling back into a mainline HTTP RPC.

On the device, CrashReporterService wraps KSCrash 2.5.1 and uploads over Connect-RPC with HMAC auth, reusing the analytics-reporter signing scheme (the shared signer was extracted into Foundation/Services/AnalyticsConnectAuth). Pending crashes are reconciled from a state file and flushed on a background delay so cold start is untouched.

How an agent reads it

The point of keeping issues in mainline Postgres is that an agent can query them through the same CLI it uses for everything else. The mainline /api/v1/crash/* REST API (wired in commit f5282f9ea) backs four mainline-cli subcommands:

mainline-cli crash triage --product WidgetCraft --release 1.4.0 --since 24h
mainline-cli crash issues --product WidgetCraft --status open
mainline-cli crash show <fingerprint>
mainline-cli crash patch <fingerprint> --status resolved --resolved-in-release 1.4.1

The CLI defaults to JSON output so it pipes cleanly; --format=table is for humans. The issue record carries projection fields the worker writes directly — latest_frames, latest_dist, latest_os_version, latest_device_model — so show returns a representative symbolicated stack without a second query against ClickHouse events. State changes (patch) only touch the business columns and never the worker-maintained projection, so the two writers don't collide. The list reader deliberately omits the large frames field; only GetEvent pulls the full stacktrace JSON for deep analysis.

The end-to-end check that closed the loop: a WidgetCraft AboutPage.swift:60 array out-of-bounds crash, symbolicated against an uploaded dSYM, resolving to the exact line.

What carried over

A dual-write window before a cutover turns a risky migration into a deletion. The hard part (the new path) was validated in production while the old path still carried the load.
When you remove a managed service, inventory what it silently provided. Sentry's upload API returned the debug id; that obligation moved onto us and became 139 lines of mach-O parsing.
A symbolicator that returns missing instead of an error will eat days. Every silent gap we hit is now a CI lint or an e2e assertion, because "it works on my repro" is not the same as "it fails when it's wrong."
Keeping crash issues in the same store the rest of the platform queries means triage is a CLI call an agent already knows how to make, not a separate console to integrate.