Spotify

Name: Experimentation Platform
Author: Spotify

In-House

Experimentation Platform

Website Engineering Blog

Overview

Most experimentation platforms measure success by win rate—how often a variant beats control. Spotify's internal platform, evolved over more than a decade from an earlier system called ABBA, made a deliberate cultural break from that framing. Their Experiments with Learning (EwL) metric treats a clear negative result—a detected regression that stops a bad ship—as exactly as valuable as a positive lift. The numbers behind the reframe are concrete: across Spotify's experimentation program, only roughly 12% of experiments produce clear winners, yet teams identify valid learning from approximately 64% of their experiments. An EwL must satisfy two conditions—validity (all systems, metrics, and sample checks worked as intended) and decision-readiness (results clearly indicate whether to ship, abort, or declare "neutral but powered")—and experiments that fail these criteria are classified into specific failure modes: invalid experiments, unpowered experiments, or early aborts. That instrumented framework, tracked at the platform level, reshaped how 300+ teams across the company think about what a "successful" experiment looks like.

The technical motivation matched the cultural one. Spotify runs roughly 10,000 experiments per year across 600 million users, spanning mobile clients, backend services, and ML-driven recommendation surfaces. The original ABBA platform, built starting in 2013, mapped each experiment one-to-one to a feature flag named after the experiment—a coupling that grew unwieldy as the company scaled. A/B testing events under ABBA eventually consumed almost 25% of Spotify's total event volume. The system couldn't coordinate the volume of concurrent tests the organization needed, didn't support the statistical methods required for longitudinal user data (listeners measured repeatedly over weeks), and lacked a centralized metrics layer. Between 2018 and 2020 the team rebuilt from scratch around three modular components—Remote Configuration, a Metrics Catalog, and an Experiment Planner—each designed to solve a specific class of bottleneck.

What makes the platform worth studying isn't any single technique but the coherence between statistical methodology, infrastructure design, and organizational practice. Sequential testing adapted for longitudinal data, CUPED-style variance reduction applied to ratio metrics, interleaving for ranking evaluation, quarterly holdback cohorts for compound-effect measurement, encouragement designs with instrumental variables for noncompliance-prone features, heterogeneous treatment effect estimation for personalized interventions, and a cultural framework that values learning over winning—these are tightly coupled pieces of the same system. The commercialization of the platform as Confidence extends the same architecture to external organizations, a rare case of an in-house experimentation platform being externalized with SDKs spanning Swift, Kotlin, Java, Go, JavaScript, Python, Flutter, Rust, and PHP.

Architecture & Approach

The platform is organized into three components with distinct responsibilities. Remote Configuration replaced ABBA's one-flag-per-experiment model with a property-based system: instead of toggling a feature on or off, teams define configurable properties (button color, number of tracks in a list, ranking algorithm variant) and assign values per experiment arm. Flag resolution happens locally in-process at 10–50 microsecond latency with zero network dependency per evaluation—a hard requirement for a real-time streaming product. The Confidence provider downloads flags and all associated rules from experiments and feature gates, updating state periodically in the background; once downloaded, all subsequent resolutions execute entirely in-process. This architecture works across cloud services, edge computing at CDNs, and client devices alike. The Metrics Catalog serves as a single source of truth for metric definitions. Raw event data flows through SQL pipelines, is joined with experiment-group assignments, aggregated into an OLAP cube, and exposed through an API with sub-second query latency. Teams don't manage their own metric calculations—the catalog is a managed environment that eliminates the need for experimenters to understand underlying storage infrastructure. The Experiment Planner, surfaced through Spotify's internal developer portal Backstage, orchestrates creation, launch, and analysis. It has programmatic knowledge of available properties and their types, reducing misconfiguration, and can coordinate a single experiment across Android, iOS, and backend simultaneously. Sample size calculation queries historical data from the data warehouse, which may take several minutes to execute, to determine whether available traffic under the planned allocation will deliver adequate power.

User assignment relies on a salt machine that hashes users into buckets via a tree of cryptographic salts, allowing dynamic reshuffling without stopping running experiments. Disjoint experiments targeting non-overlapping populations can use independent salt trees, maximizing concurrency. A domain system maps experiments to product surfaces—roughly corresponding to distinct surfaces like the mobile home feed or web player—with timelines showing past and planned tests to prevent hidden conflicts between teams. Each quarter, a fresh holdback cohort is carved out and excluded from all new experiments; at quarter's end, the cohort receives the combined treatment of every shipped improvement while a control group experiences the prior baseline, enabling measurement of the compound interaction effect that single-experiment analysis cannot address.

On the statistics side, Spotify's most distinctive contribution is solving what they call "peeking problem 2.0": the false-positive inflation that arises when running sequential tests on longitudinal data, where each user contributes multiple measurements over time. Standard sequential methods correct for peeking across users but not within a user's measurement timeline. Spotify's approach uses group sequential tests (GSTs) operating on multivariate vectors of test statistics, exploiting the known covariance structure between consecutive analyses to derive marginal critical bounds that correctly spend alpha at each interim look. Their research also revealed a counter-intuitive property of cumulative metrics: statistical power can actually decrease with more observations, because as an experiment progresses the share of early users measured during periods with large treatment effects shrinks relative to later users measured during periods with smaller effects. Variance reduction uses a full regression adjustment estimator (Negi and Wooldridge's formulation rather than the original CUPED), applied separately to numerator and denominator for ratio metrics following Jin and Ba's method. For large-scale quantile analysis, the team developed an index-based bootstrap method leveraging properties of the Poisson bootstrap algorithm that makes difference-in-quantiles confidence intervals computationally tractable even for experiments with hundreds of millions of observations. For recommendation and ranking evaluation, the platform supports interleaving experiments—the Home feed alone runs over 250 tests per year. Beyond standard A/B tests, the platform implements encouragement designs with instrumental variables for features where user noncompliance is expected, and heterogeneous treatment effect estimation using uplift modeling with multi-task learning to identify which user segments respond most favorably to interventions like in-app messaging. Bandit algorithms and contextual personalization systems are deliberately kept in Spotify's separate ML stack and evaluated as treatment variants within the experimentation platform—a conscious architectural boundary where each stack does what it's optimized for: the ML infrastructure handles low-latency feature computation and recommendation generation, while the experimentation infrastructure measures outcomes accurately across teams.

What Makes It Notable

Spotify's public writing on experimentation is unusually deep and unusually honest about hard problems. The two-part series on sequential testing with longitudinal data doesn't just present a solution—it names a failure mode ("peeking problem 2.0") that most platforms quietly ignore, walks through the math, and explains why naïve application of standard sequential methods breaks down. The Experiments with Learning framework is similarly concrete: rather than vaguely advocating for "learning culture," the team defined a measurable proxy, instrumented it in the platform, and used it to shift organizational incentives away from win-rate vanity metrics. The scaling story is equally specific—the organization identified that with audience size essentially fixed, velocity gains had to come from running more tests per user and turning them around faster, which required decoupling feature delivery from app release cycles so teams could ship rough experimental variations without waiting for production-quality polish.

Practitioners can take away several specific ideas. The quarterly holdback mechanism for measuring compound effects is a clean solution to a problem most teams hand-wave past. The deliberate separation of ML and experimentation stacks—using A/B tests to evaluate bandits rather than embedding bandit logic in the experimentation layer—is a design decision worth considering for any team running both personalization and controlled experiments. The index-based Poisson bootstrap for quantile metrics unlocks inference tools that are normally computationally prohibitive at scale. The open-source Confidence Python library, with 286 stars and 44 releases through version 4.1.0, provides implementations of Z-tests, Welch's T-tests, chi-squared tests, a BetaBinomial Bayesian alternative, and CUPED-based variance reduction using pre-exposure data for teams that want the statistical machinery without the full platform. And the commercial Confidence product—with SDKs built on an open-source Rust resolver that can run natively or as WebAssembly, integrated directly into Backstage—represents a rare case of an in-house platform being externalized with the same property-based flag resolution, variance reduction defaults, and analysis tooling that Spotify's internal teams use at scale.