Twitter

Twitter

In-House

Duck Duck Goose

Overview

Few experimentation platforms have left as visible a fingerprint on community methodology as Duck Duck Goose. Built in 2010 as Twitter was scaling past hundreds of millions of users, DDG was designed around a core tension: the platform needed real-time visibility into whether an experiment was broken right now, but final decisions had to rest on deep offline analysis that could handle terabytes of interaction logs, social-graph mutations, and client events without cutting statistical corners. The result was a system that split those concerns explicitly—streaming health checks on one side, a multi-stage batch pipeline on the other—and a research program that published unusually frank accounts of what goes wrong at scale.

That research program is what most practitioners encounter first. Twitter's 2015–2016 blog series tackled problems that many teams silently struggle with: bucket imbalance caused by experiments influencing their own trigger rates, and the surprisingly common anti-pattern of creating two control groups "for safety" that actually inflates false negatives. The bucket-imbalance work showed that comparing unique bucketed users is substantially more effective than examining total triggers or total visits—focusing on the fundamental unit of randomization rather than derived quantities that correlate with the treatment itself. The multiple-controls analysis demonstrated that pooling two control buckets into one larger control dominates both the "use one as a check" and "pick whichever looks more representative" strategies. These analyses didn't just flag the issues—they provided concrete detection heuristics, specific test procedures, and statistical arguments that have been absorbed into how the broader community thinks about experiment health checks and design.

The organizational bet was that a single platform, tightly integrated with Twitter's event logging and data warehouse, could make experimentation the default rather than the exception for product teams. The platform balanced three competing priorities—flexibility of available metrics, predictability of analysis, and ease of interpretation—while processing data encompassing tweets, social graph changes, server logs, and detailed user interactions across web and mobile clients. Feature switches delayed treatment assignment until the moment a user actually encountered the relevant surface, keeping non-exposed users out of analysis and preserving statistical power. Metric groups were versioned, owned by the teams that defined them, and queryable by anyone—an early attempt at metric governance that anticipated problems many organizations only discovered later. DDG also shipped tools for power analysis that let an experimenter specify a similar past experiment and an expected lift; the system would load historical statistics for all metrics from that reference experiment and recommend traffic allocation, sidestepping the problem that experimenters rarely know metric variance a priori.

Architecture & Approach

DDG's defining architectural choice is its three-stage Scalding pipeline running on Hadoop, designed to progressively reduce data volume while increasing analytical richness. The first stage aggregates raw client events and server logs into per-user, per-hour metric values—a general-purpose dataset that feeds experimentation analysis but also serves top-level metric calculations and ad-hoc cohort analysis. The second stage joins those hourly rollups with A/B test impression logs, computing per-user metric aggregates scoped to each experiment's runtime. Because a user's entry time and status (new, casual, or frequent user) are recorded at impression time, this stage directly supports heterogeneous treatment effect analysis and measurement of attribute changes during the experiment without requiring analysts to reprocess raw events. The third stage rolls everything into summary statistics—effect estimates, confidence intervals, segment breakdowns—and loads them into Manhattan, Twitter's key-value data store, where internal dashboards serve results to product teams. A great deal of engineering effort goes into keeping this pipeline efficient, including automated full-scale testing and continuous improvements to profiling and monitoring within Hadoop; even small percentage-point gains in pipeline efficiency translate directly to faster time-to-results.

Sitting in front of the batch pipeline is a real-time layer: lightweight statistics computed by TSAR, a streaming job on Heron (Twitter's post-Storm stream processor), ingesting from a central event ingest service. TSAR doesn't attempt deep analysis; it provides early warning—anomalous traffic splits, unexpected metric movements, broken logging—so that teams can kill a misconfigured experiment in minutes rather than waiting for the next batch run. This explicit separation of health monitoring (streaming, low-latency, approximate) from decision-quality analysis (batch, high-fidelity, authoritative) avoids the common trap of trying to do both in a single system and doing neither well.

Assignment uses a delayed feature-switch model: an engineer creates an experiment via a web UI, receives code snippets to integrate into production, and the treatment decision is deferred until a user hits the relevant code path. Only then is an "ab test impression" event logged, which becomes the fundamental unit linking users to experiments in downstream analysis. Users who never encounter the experiment surface are never bucketed, which concentrates statistical power on the population where the treatment could actually have an effect. Bucket health is checked automatically via multinomial goodness-of-fit tests across all buckets—where the chi-square statistic captures how much each bucket deviates from expected allocation, following a chi-square distribution with k−1 degrees of freedom—and per-bucket binomial tests when the overall check fails, using only first-time bucketed users to avoid confounding from the experiment itself altering trigger frequency. Twitter reported that building this check into the toolchain saved many hours of investigation and analysis. Results dashboards use a color-coding system: statistically significant positive changes appear green, significant negatives red, and likely-underpowered metrics yellow, with color intensity reflecting the p-value or MDE magnitude—transforming abstract statistical concepts into visual signals that non-specialist practitioners can act on quickly.

What Makes It Notable

DDG's most lasting contribution is methodological rather than architectural. The bucket-imbalance detection work formalized a principle many teams learn the hard way: post-exposure impression counts are unreliable for checking randomization health because the treatment can change how often users return. Testing balance on first-bucketing events only is now standard advice, but Twitter published the reasoning and the specific test procedures in 2015 when the practice was far from universal. Similarly, the multiple-controls analysis gave the community a citable reference for a design mistake that persists in organizations without strong statistical review. Beyond these published analyses, Twitter's internal experience catalogued three further categories of pitfalls—dilution (users assigned to treatment who never actually receive it), carryover effects (treatment exposure influencing behavior after an experiment ends), and novelty impacts (transient user responses that fade with habituation)—each requiring distinct analytical countermeasures. The platform also proved itself on high-stakes product decisions: when Twitter tested expanding the 140-character tweet limit, DDG experiments across multiple countries—organized by how much each language suffered from character cramming—showed that only 9% of English tweets had historically hit the limit, dropping to 1% under the new cap, with just 5% of tweets exceeding 140 characters and only 2% exceeding 190, confirming that brevity would survive the change while users reported greater satisfaction with self-expression.

For practitioners studying platform design, DDG's three-stage pipeline is a clean illustration of how to layer aggregation so that each stage serves distinct consumers: Stage 1 feeds general analytics, Stage 2 enables flexible experiment deep-dives and methodology research, and Stage 3 powers standardized dashboards for non-specialist decision-makers. The metric governance model—versioned metric groups with clear team ownership—addressed metric sprawl and definition drift before "metric stores" became a recognized infrastructure category. The power-analysis tooling, which bootstraps variance estimates from past experiments rather than asking experimenters to guess, tackled one of the most common friction points in experiment planning. What teams can take away is less the specific technology choices (Scalding, Hadoop, Manhattan are artifacts of their era) and more the structural decisions: separate streaming health from batch analysis, delay assignment to preserve power, automate statistical sanity checks rather than relying on manual review, version your metric definitions like you version your code, and invest in visual communication of statistical uncertainty so that results dashboards actually change behavior.

People

G

Gary Lam

Technical Lead, Machine Learning Platform

Key Facts

Methodology
frequentist cuped switchback
Platform Type
server-side client-side full-stack mobile
Year Started

~2010

Tech Stack
Scalding Hadoop Heron Manhattan
#bucket-imbalance-detection#metric-governance#heterogeneous-treatment-effects#multi-stage-pipeline#experimentation-culture#delayed-assignment

Last updated: 2026-03-28

Related Platforms