Booking.com

Name: Experiment Tool
Author: Booking.com

In-House

Experiment Tool

Website Engineering Blog

Overview

Few companies have run controlled experiments as long or as intensely as Booking.com. The Experiment Tool — the internal platform behind all of it — has been in production since roughly 2005, and today orchestrates over 1,000 concurrent experiments across 75 countries and 43 languages. The 2017 arXiv paper by Kaufman, Pitchforth, and Vermeer — Democratizing online controlled experiments at Booking.com — laid out the organizational and technical architecture that made this possible: a self-serve model where roughly 80% of product and technology teams launch experiments independently, resolving disagreement through evidence rather than seniority. With over 1.5 million room nights reserved per day, even fraction-of-a-percent conversion improvements carry material revenue implications, which means the statistical machinery underneath has to be trustworthy at a level most organizations never need to reach.

The platform exists because Booking.com operates a two-sided travel marketplace where interventions ripple in non-obvious ways. A change to search ranking doesn't just affect guest conversion — it reshapes which properties get visibility, which alters supplier behavior, which feeds back into inventory and pricing. Optimizing one metric in isolation (say, bookings) can quietly degrade others (cancellations, customer service contacts, partner churn). The Experiment Tool encodes this complexity through an Overall Evaluation Criteria (OEC) framework paired with mandatory guardrail metrics, forcing teams to optimize for durable value rather than proximate clicks. Research from Evercore Group found that Booking.com's testing drives conversions at two to three times the industry average — a compounding advantage that accumulates over years of disciplined iteration.

What makes the volume trustworthy rather than reckless is the machinery underneath: consistent assignment via stable hashing across devices and sessions, layered namespacing that isolates concurrent tests within product surfaces while permitting orthogonal stacking across them, automated power calculations that run after an initial data-collection burn-in period, and continuous sample ratio mismatch (SRM) detection that catches broken randomization before it corrupts results. Harvard Business School professor Stefan Thomke estimated annual volume at twenty to thirty thousand experiments — implying each product team runs hundreds per year, creating learning cycles far faster than organizations operating on quarterly planning horizons.

Architecture & Approach

Experiments flow through a self-serve model with tiered governance. Most teams design, configure, and launch tests independently using guided templates that enforce explicit hypothesis statements, minimum detectable effects, ramp plans, and guardrail definitions before anything goes live. High-risk domains — pricing, payments, search ranking, compliance-sensitive surfaces — require a lightweight pre-launch review. The platform performs automatic power calculations using historical variance patterns and continuously monitors SRM and guardrail thresholds, pausing experiments that trip safety constraints without waiting for human intervention. Booking.com has open-sourced a Power Calculator (a Vue.js component) that assists experimenters in determining appropriate durations given baseline conversion rates, minimum detectable effects, and significance levels, and guidelines recommend running experiments for full-week cycles to account for day-of-week variability in user behavior.

The statistical backbone reflects nearly two decades of compounding investment. The platform supports frequentist sequential testing with alpha-spending functions and the Mixture Sequential Probability Ratio Test, enabling continuous monitoring and early stopping without inflating false positives — experimenters need not specify a maximum sample size in advance. CUPED variance reduction, documented in a January 2018 post by data scientist Simon Jackson, uses pre-experiment covariates to shrink confidence intervals; the technique, originally developed by Microsoft's Experiment Platform team (Deng, Xu, Kohavi, & Walker, 2013), became an industry standard partly because Booking.com's public documentation provided practical guidance on covariate selection and pipeline integration that teams at Netflix, Meta, Airbnb, and DoorDash subsequently adopted. For ranking and recommendation systems, interleaving merges results from competing algorithms at the position level within a single search page — at QCon London 2026, Jabez Eliezer Manuel described how roughly 50% of ranking experiments now use interleaving rather than standard user-level randomization, achieving approximately 50x the sensitivity of conventional designs and up to 100x when combined with counterfactual evaluation. For marketplace-specific questions where customer-level randomization is insufficient, the team developed partial blockout experiments, a two-dimensional design that removes a feature from random partner–customer combinations, revealing effects on both sides simultaneously.

Beyond randomized tests, a dedicated non-randomized experiment tool runs hundreds of observational studies annually using synthetic control, difference-in-differences, and Bayesian structural time-series methods — essential for geographic campaigns or feature rollouts where full randomization is infeasible. That tool enforces pre-registration, runs placebo tests on historical data to validate modeling assumptions, monitors for contamination in control regions, and automatically detects when control markets are inadvertently used as treatment markets in other experiments. Mediation analysis built into the platform decomposes treatment effects into direct and indirect pathways using two-stage models under sequential ignorability, letting teams understand why an intervention works. A causal forest implementation supports heterogeneous treatment effect estimation, and recent published work on sensitivity analysis for causal machine learning provides quantitative frameworks for assessing how robust conclusions remain when unobserved confounders may exist. The platform has also addressed the problem of bias in point estimates used for impact quantification: when the same dataset is used both to decide whether to ship a change and to estimate its magnitude, selection effects produce systematic upward bias; Booking.com researchers proposed using lower bounds of confidence intervals rather than point estimates when data must serve dual purposes. Over 480 ML models generating 400 billion predictions daily are evaluated through an integrated pipeline, and the 2019 KDD paper 150 Successful Machine Learning Models delivered the uncomfortable finding that offline model performance can be entirely uncorrelated with business outcomes — reinforcing the necessity of online evaluation for ML systems.

A central experiment registry records every test's hypothesis, owner, metrics plan, expected effect size, guardrails, outcome, and ship decision — searchable across the full history back to 2005. The registry prevents duplicate work, enables meta-analysis across thousands of accumulated experiments, and preserves institutional memory independent of team turnover. It also supports what the team calls meta-experiments: experiments on the experimentation process itself, testing whether changes to platform defaults, onboarding flows, or governance policies improve the quality and velocity of testing across the organization.

What Makes It Notable

The lasting contribution is cultural as much as technical. Booking.com demonstrated — and documented publicly — that experimentation could become an organizational default rather than a specialist function, and that doing so required investing equally in tooling, statistical rigor, and education. The principle that customer sentiment should drive product development, not the highest-paid person's opinion, is enforced through organizational habits: onboarding curricula, internal case studies of confident opinions overturned by data, a network of embedded Experiment Ambassadors within product teams, and a ranking of evidence sources that places one's own experiment data above all forms of opinion — including other people's experiment data, which provides a false sense of certainty when stripped of its original context. The guidelines-over-rules approach (experiment on everything, experiment atomically, ensure proper power, run full-week cycles) scales quality without creating bottlenecks. Booking.com hosted the first Experimentation Conference at their Amsterdam headquarters in May 2024, bringing ninety practitioners from twenty-four companies together, and followed with a second conference in 2025 — establishing the company as a convening force for the practitioner community.

The specific methodological contributions have shaped practice well beyond the company. The CUPED implementation documentation helped establish variance reduction as an industry expectation. The sequential testing work showed GST achieving near-fixed-horizon power while permitting early stopping. The partial blockout design for two-sided marketplaces solved a problem other platforms hadn't publicly addressed. The interleaving work demonstrated sensitivity gains large enough to fundamentally change how ranking experiments are designed. And the published finding that one widely cited example — displaying sold-out properties alongside available options — increased overall bookings by heightening scarcity signals illustrates how experimentation uncovers mechanisms that expert judgment alone would miss. Engineers from the Experiment Tool team, including former Experimentation Lead Jonas Alves, went on to found ABsmartly, explicitly aiming to make methodologies pioneered internally at Booking.com available to smaller organizations. For practitioners evaluating what a mature experimentation program looks like after two decades of compounding investment, this remains the most thoroughly documented reference point available.