Name: Confidence
Author: Spotify

Overview

Confidence began life inside Spotify as the experimentation backbone for a product used by hundreds of millions of listeners. The platform it replaced—an A/B testing system called ABBA, built in 2013—mapped experiments one-to-one with feature flags, and by around 2017 its event logging had grown to almost 25% of Spotify's total event volume, creating cost and reliability problems that forced a rethink. The successor platform was decomposed into three parts: Remote Configuration (replacing the old feature-flagging service with a more general concept of configurable "properties"), a Metrics Catalog (a managed environment running SQL pipelines to ingest metrics into a data warehouse, enabling sub-second-latency queries from UIs and notebooks), and an Experiment Planner. That architecture supported Spotify's experimentation culture as it grew to over 300 teams running more than 10,000 experiments per year across mobile apps, backend systems, and everything in between.

The in-house-origin story matters because it shaped defaults and tradeoffs that a greenfield SaaS product would not arrive at on its own. Confidence was built under real constraints at Spotify's scale—latency-sensitive clients, complex ecosystems of features, and the need for both product teams and analysts to trust results. Before being offered externally, it was trusted by hundreds of teams inside Spotify. Commercialization, which moved through a private beta starting in 2023, means external buyers inherit tooling and workflows that were stress-tested on one of the world's largest streaming services, not only on slide decks. The platform is designed to scale from startups with 1,000 users to organizations serving a billion.

Spotify offers three deployment models: a managed service with minimal operational overhead, a Backstage plugin that mirrors how Spotify itself runs the platform, and direct API integration for teams requiring customization or advanced use cases such as multi-armed bandit optimization and switchback testing.

Key Features

Confidence spans feature management and A/B testing in one product. Flags gate functionality and support targeting through segments—rules built on user attributes like geography, subscription status, or device type—with allocation percentages controlling what proportion of matched users see each variant. Randomization is bucket-based, computing a hash of a specified evaluation-context field (typically a user identifier) to guarantee that the same user always resolves to the same variant. Server-side SDKs are built on an open-source Rust-based flag resolver that can run natively or as WebAssembly, evaluating rules locally in microseconds without network dependencies at resolution time and syncing logging asynchronously in the background. Client SDKs span Go, Java, Python, JavaScript (Node.js and browser), Swift, Kotlin, Flutter, Rust, PHP, and Ruby—all implementing the OpenFeature specification for portability across flag platforms. The platform distinguishes between flag evaluation and flag application: an explicit "apply" event is reported only when a user actually experiences the variant, preventing sample-size inflation and preserving experiment validity.

Gradual rollouts reduce blast radius while metrics accumulate, with monitoring, alerts, and instant rollback for detecting regressions. Experiment analysis supports multiple statistical approaches. Sequential testing enables daily checks on all metrics so peeking and early stopping are handled with appropriate methods rather than fixed-end-date assumptions. Spotify's engineers have published work on fixed-power designs, where experiments start without pre-specified sample sizes; instead, the required sample size is continuously re-estimated from observed outcome data, and the experiment stops when the current sample exceeds that estimate. Variance reduction is enabled by default through a full regression-adjustment estimator (advancing beyond the original CUPED approach), separately regressing outcome and pre-treatment variables for treatment and control groups to achieve tighter confidence intervals. For ratio metrics—click-through rate, average order value—the adjustment is applied independently to each of the four terms in the ratio expression. A sample-size calculator draws on at least fourteen days of historical metric data from prior experiments or flag assignments to estimate required durations, with conservative defaults that reflect hard-won lessons about underpowered tests.

The metrics framework supports average metrics, ratio metrics, filtered metrics (restricting measurement to subsets like mobile-only transactions), and window-based metrics with both cumulative and end-of-window aggregation. Collaboration features include code-review-style experiment reviews, comments, approvals, and access control. The architecture is warehouse-native: data stays in the customer's own data warehouse rather than being centralized within Confidence's infrastructure, preserving existing governance policies and simplifying compliance with data-residency requirements.

What Makes It Notable

Few vendors can credibly say their product was the primary experimentation layer for a global consumer platform before it became a commercial offering. Confidence's position is unusual not just because of scale but because of the methodological contributions that came with it. Spotify's Experiments with Learning (EwL) framework redefines what counts as a successful experiment: rather than equating success with finding a statistically significant winner, it recognizes that a successful experiment is one that yields enough valid information to inform a product decision—whether the treatment won, lost, or showed no effect. Internal metrics show an EwL rate of approximately 64% against a win rate of approximately 12%, meaning the vast majority of valuable learning comes from experiments that do not identify winning treatments. That distinction reshapes how organizations should incentivize experimentation: rewarding valid information over victory laps.

The platform also carries a particular philosophy about how experimental evidence travels through organizations. Spotify's public material emphasizes that value can be lost at every phase of the pipeline—when ideas become hypotheses, when hypotheses are implemented, when implementations run as experiments, and when analyses inform decisions—and that simplicity often wins over complexity. The power lies in running many clean, reliable tests rather than a few intricate ones. Confidence is Spotify's attempt to package that operational philosophy, along with the statistical machinery and infrastructure engineering behind it, for teams that want vendor-managed experimentation and feature delivery without building the stack from scratch.

Spotify

Overview

Key Features

What Makes It Notable

Resources

Key Facts

Product

Related Platforms

Spotify