We sample to survive
Our analytics pipeline ingests events and histograms from every install. At scale, keeping 100% of everything is neither cheap nor necessary, so we sample — per install, so a given device is consistently in or out of the sampled set rather than flipping a coin per request.
Sampling is one of those decisions that feels innocent until a number on a dashboard is quietly wrong.
The trap: a total that read low
A dashboard showed "total requests" by reading the count field off a histogram. With sampling on, that number came in low — roughly by the sampling rate. The fix looked obvious: divide by the rate to "scale it back up."
That scaling is correct for some metrics and subtly, stubbornly wrong for others. The difference is the whole point.
Ratios are free; totals are not
Split every metric into two families:
- Ratios and shape — error rate, mean latency, P50/P99, cache hit ratio. These are relative quantities.
- Absolute totals — total requests, total revenue, number of installs that did X. These are extensive quantities.
Under sampling, the two families behave in completely different ways.
Why P99 and error rate survive sampling
A percentile or a rate is a property of the distribution, not of how many samples you drew from it. Sample uniformly and the sampled distribution has (in expectation) the same shape as the full one: P99 of the sample ≈ P99 of the population. An error rate — errors over total, both scaled by the same factor — has the sampling rate cancel out entirely.
The beautiful part: you don't even need to know the sampling rate to read these correctly. They come out unbiased, for free.
Why the total stays wrong — even with a lot of data
To recover an absolute total you do the opposite: take the sample's count and multiply by 1 / rate. That's an estimator, and its error isn't governed by how many events you collected — it's governed by how many installs you sampled, and how evenly volume spreads across them.
Per-install sampling is cluster sampling: you keep or drop a whole device's stream at once. If business volume is heavy-tailed — a handful of power users generate most of the traffic — then whether your estimate is right depends almost entirely on whether those few whales landed in the sample. Either they're in (you over-count) or they're out (you under-count). Collecting more events doesn't help; only more installs narrows the error, and under a heavy tail even a large install count stays biased.
So "divide by the rate" yields a number that looks precise and is quietly off — sometimes by a lot.
The fix: count for totals, sample for shape
The rule we settled on:
- Business totals go through always-on, full-fidelity events. They are never reconstructed from a sampled histogram.
- Sampled histograms are for ratios and shape — percentiles, rates, distributions — where sampling is unbiased and cheap.
In our dbt models this is two deliberately different columns: sample_count (the size of the sample, for variance/confidence) versus request_count (the real business volume, from full events). They are not interchangeable — conflating them is exactly the bug we hit. A cross-repo contract lint now asserts the two stay separate and sourced from the right place, so the semantics are enforced, not remembered.
Takeaway
Sampling is free for ratios and percentiles and a trap for totals. Before you divide by the sampling rate, ask whether the quantity is relative or extensive. If it's a total that matters to the business, don't sample it — count it — and name your columns so the next person can't lose the distinction.