Splitting One River Queue into Nine, and Decoupling OTel Metrics from Traces

The shared queue head-of-line problem

Mainline runs background jobs on River, a Postgres-backed Go job queue. Every job kind — cron health checks, App Store rank scans, signing-profile provisioning, WidgetCraft publish workflows — landed in River's single default queue, sharing one MaxWorkers pool.

That arrangement has a failure mode that only shows up under the right load shape. The daily App Store rank scan (apprank_scan_country) fans out at 02:00 and enqueues roughly 100 items, each taking about 136 seconds. With one shared pool, those long jobs occupy every worker slot for ~20 minutes. Anything behind them — including the sub-second cron kinds whose SLO is latency, not throughput — sits in the available state waiting. Head-of-line blocking, in a job queue that has no notion of priority across a flat pool.

Routing by kind instead of one pool

The fix is to give each job family its own queue with its own concurrency budget, so a long-running batch kind cannot consume slots that a latency-sensitive kind needs. Nine queues, each carrying an independent MaxWorkers cap:

var QueueMaxWorkers = map[QueueName]int{
    QueueAppRankScan:    10, // the 100-deep daily burst, isolated
    QueueAppRankAgg:     3,  // daily aggregation tail (trend, thinning, partition)
    QueueCronShort:      6,  // sub-second cron; SLO is latency
    QueueSigning:        4,  // provision / health-check family
    QueueSigningRefresh: 2,  // profile_auto_refresh quarantined alone
    QueuePublish:        3,  // WidgetCraft publish workflows
    QueueRelease:        3,  // App Store release plumbing
    QueueMedia:          3,  // asset-heavy / upstream-bound kinds
    QueueImport:         2,  // product / file import
}

The total is 36 workers, sized for the worker VM (2 CPU, 4G RAM) running IO-bound handlers. Two isolation decisions are worth calling out. apprank_scan gets its own 10 slots so its burst is bounded to that queue. And profile_auto_refresh — historically the noisiest job in the signing family — is quarantined into signing_refresh with just two slots, so a panic loop in refresh caps its blast radius at two workers instead of draining the slots that provisioning and health-checks share.

Static routing, no silent fallback

Each TaskKind maps to exactly one queue through a static table:

var kindToQueue = map[TaskKind]QueueName{
    TypeStaleTaskCheck:     QueueCronShort,
    TypeTrunkImport:        QueueCronShort,
    TypeReleaseStatusPoll:  QueueCronShort,
    TypeAppRankScanCountry: QueueAppRankScan,
    TypeAppRankDailyTrend:  QueueAppRankAgg,
    TypeProfileAutoRefresh: QueueSigningRefresh,
    // ... one entry per kind
}

func QueueFor(kind TaskKind) (QueueName, error) {
    q, ok := kindToQueue[kind]
    if !ok {
        return "", fmt.Errorf("task: kind %q has no queue mapping", kind)
    }
    return q, nil
}

QueueFor returns an error on an unmapped kind rather than falling back to a default queue. That choice is deliberate. A silent fallback would route a newly-added kind to River's now-empty default queue, where its jobs would stall in available forever with no worker consuming them — the exact regression class the table exists to prevent. The enqueue path enforces it: when a caller hasn't pinned a queue explicitly, it looks the kind up and refuses to enqueue an unmapped one.

Three layers guard the invariant that every kind has exactly one queue:

Bootstrap self-check. queue.AddWorker records every registered kind (River's worker registry is opaque — no enumerate API), and at startup the bootstrap iterates RegisteredKinds() and panics if any lacks a mapping. Drift fails fast at boot, not silently at runtime.
Subpackage registration. WidgetCraft owns its publish kinds and registers them from init() via RegisterKindQueue, which keeps the kind constants colocated with their payloads without forcing an import cycle. It panics on duplicate registration to surface ambiguity at startup.
Unit tests. Six guard tests pin the contract — every known kind resolves to a non-empty queue, no routing entry is dead, the total budget stays at 36 (a deliberate review gate, not an accidental map edit), no queue has a non-positive worker count, and every queue has at least one kind routing to it.

The observability gap behind all this

Splitting queues only pays off if you can see per-queue behavior. Mainline emits OTel metrics through an in-process otelprom bridge into the shared Prometheus registry, so /metrics serves them to Grafana. The problem was the gate: a single OTelReady() switch — true only when an OTLP endpoint is configured — controlled the entire OTel pipeline, traces and metrics together.

Traces genuinely need a remote collector (Tempo) to be useful, so gating trace export on the endpoint is correct. Metrics don't. The otelprom bridge is in-process with zero external dependencies; the only cost beyond a no-op is an in-memory aggregator. Coupling the two meant business instrumentation produced nothing on /metrics until the trace collector was deployed — metric visibility blocked on trace rollout, for no technical reason.

The decoupling moves metric init out from behind the gate entirely:

func (c *ObservabilityConfig) OTelReady() bool {
    return c.OTelEndpoint != "" // gates trace export + task propagation only
}

InitOTelMetrics no longer early-returns when the endpoint is unset; the metric bridge is always on. A follow-up fix was needed the same day — buildOTelResource was swallowing a schema-URL conflict that suppressed metric exposure — so the decoupling is two commits, not one.

Instrumenting the workers

With metrics unconditionally live, each River worker gets an OTel middleware emitting three instruments:

river.job.duration  // histogram (s)        attrs: queue, kind, status
river.job.count     // counter              attrs: queue, kind, status, attempt
river.job.active    // UpDownCounter         attrs: queue

Two implementation details matter. First, middleware ordering. The chain is Trace → Recovery → Metrics → Log → Controlled, outermost first. Metrics sits inside Recovery so that when Recovery converts a panic into an error, the metrics layer still records status=error rather than missing the job entirely. And river.job.active is incremented on entry and decremented in a defer, so even a panicking job decrements correctly — otherwise Recovery catching the panic upstream would leave the active gauge stuck at +1 forever.

Second, cardinality. The attempt attribute is bucketed to 1 / 2 / >=3 rather than carried as a raw count. River's default max_attempts is 25; emitting each distinct attempt number as its own attribute value would multiply time series 25-fold for no analytical gain. The same discipline applies to the companion HTTP middleware added alongside — otelgin produces trace spans but not metrics, so an HTTP metrics middleware emits http.server.request.duration keyed on c.FullPath() (the route template /api/v1/products/:id, not the resolved URL, which would explode cardinality on the path parameter).

Takeaways

A flat job pool has no defense against one kind's load shape monopolizing it; isolation has to be structural — per-queue worker budgets — not a hoped-for fair share. When you build a routing table, make the unmapped case an error, not a default: a silent fallback turns a missing mapping into stuck jobs that no test will catch but every guard will. And keep observability gates scoped to what actually needs the dependency. Metrics that run in-process should not wait on a trace collector; one boolean controlling both pipelines is a coupling that costs you exactly the visibility you need to operate the rest.