Tinder

Tinder

In-House

Phoenix

Overview

Running controlled experiments on a two-sided marketplace where one user's treatment directly affects another user's outcomes is a genuinely hard problem. On Tinder, showing User A a different set of recommended profiles changes not just A's behavior but the incoming attention, match rates, and experience of everyone A interacts with. Standard user-level randomization breaks down when treatment and control groups are connected through the matching graph. Phoenix, Tinder's internal experimentation platform, was built to handle this class of problem alongside the more conventional product tests that any mobile-first consumer app needs to run at volume.

The platform was developed by Tinder's Quantitative Tools team with a specific mandate: make it fast and easy to set up, manage, and conclude controlled experiments. Andrew Han, Senior PM of Experimentation, has presented Phoenix's scope in webinar formats pitched at product managers—a telling audience choice that reflects the platform's design philosophy. Phoenix covers three phases of the experiment lifecycle—ideation and design, configuration and management, and analysis and conclusion—decomposed internally into four components: Ground Control (lifecycle management from creation through rollout and termination), Assignment, the Levers System (remote configuration), and the Metrics System (analysis). These were documented in a three-part Medium series authored by Kenneth Yeh, Jinqiang Han, Juzheng Li, Connor Wybranowski, Siggi Jonsson, Keith McKnight, Uday Babbar, Limin Yao, Fan Jiang, and staff data scientist Lucile Lu. At peak periods the platform supports roughly 400 concurrent experiments, with usage doubling annually—a trajectory that reflects experimentation expanding from a specialist practice into routine product development across teams. In a separate GrowthHackers AMA, Han described the platform computing approximately 100 metrics per experiment and identified the importance of anchoring experiments to north star metrics rather than allowing teams to optimize for arbitrary intermediates.

The financial stakes justify the infrastructure investment. Concrete examples reinforce the value of defaulting to randomized testing over intuition: Tinder tested three news feed button designs through Phoenix and found the original control won by roughly 5–8% in engagement—a case where shipping the "new" design without testing would have quietly degraded the product. Experiments on the first-time user experience revealed that requiring a bio during onboarding caused user drop-off even with explanatory instructions, while a swiping tutorial improved the experience for women in developing markets. These are the kinds of findings that only surface through controlled experimentation, not observational analytics. More recently, Phoenix has provided the infrastructure for validating complex machine learning systems like Tinder's "Chemistry" AI-powered personalization layer, which was tested initially in Australia and New Zealand before expanding to the US and Canada—an example of using experimentation to gate a major algorithmic change through measured rollout rather than simultaneous deployment.

Architecture & Approach

Phoenix sits on top of Tinder's AWS-hosted infrastructure, integrating with MongoDB for persistence and Amazon ElastiCache for Redis to cache experiment assignments at the sub-hundred-millisecond latencies a mobile app demands. Tinder's custom API gateway (TAG) serves as the coordination layer, propagating assignment metadata through the request path so that every downstream service—recommendation, messaging, payments, safety—sees a consistent view of a user's experimental state without point-to-point wiring. Assignment supports both pre-auth experiments keyed on device ID and post-auth experiments keyed on user ID, and can operate client-side or server-side. Sticky assignment maintains a user's treatment group even if attributes like device or region change mid-experiment; non-sticky experiments use real-time attributes for segmentation instead.

The Levers System provides the remote configuration layer that decouples experience delivery from app release cycles. Each Lever consists of an identifier, a type (Boolean, String, or Number), and a value. A Config Service SDK on each platform—iOS, Android, Web—processes, persists, and vends Lever values from Phoenix. A common pattern pairs a Boolean Lever for primary feature gating with additional Levers controlling parameters like copy, button color, or notification delay timing, enabling variant changes without waiting for an App Store review cycle. This abstraction means a single feature might use several Levers simultaneously, with the Boolean controlling whether the feature is active and the String and Number Levers parameterizing its behavior.

The Metrics System uses Amazon Redshift as its primary analytical engine, with custom metrics defined as rollups via Redshift queries. An "experiment family" concept enables high concurrency by separating member traffic so users can participate in multiple experiments simultaneously without cross-contamination between unrelated tests. Metrics are processed in per-metric-batch fashion—rather than per-experiment or per-metric individually—to support throughput across hundreds of concurrent experiments. Multiple comparison correction is applied as a system invariant to guard against the false discovery inflation that naturally accompanies metric proliferation at this scale. Methodologically, Phoenix combines frequentist hypothesis testing with CUPED variance reduction and sequential testing that allows teams to monitor accumulating results without inflating false-positive rates through naive peeking. For marketplace-level experiments where user-level randomization would introduce interference bias, the platform supports switchback designs that alternate treatment and control across time windows within geographic regions, ensuring everyone in a cluster experiences the same variant at any given moment. Experimenters must specify metrics and articulate hypotheses before launch, with power analysis embedded at design time and automated computation of treatment effects and confidence intervals at analysis time.

What Makes It Notable

Phoenix's most distinctive contribution is its practical integration of switchback experiments into a general-purpose, self-serve experimentation platform. Many companies acknowledge that network effects and marketplace interference violate the stable unit treatment value assumption (SUTVA), but few have operationalized an alternative design pattern within tooling accessible to product managers. For practitioners working on two-sided marketplaces—ride-sharing, food delivery, dating—Tinder's approach to embedding switchback methodology alongside conventional A/B testing offers a concrete reference point for handling interference without abandoning randomized experimentation. The specific pairing of CUPED for sensitivity with switchback for validity under network effects is a pragmatic combination: pre-experiment engagement signals like swipe volume, match history, and messaging cadence are strongly predictive of post-experiment behavior, meaning substantial variance reduction is achievable and translates directly into shorter experiment durations or the ability to detect smaller effects.

The organizational pattern matters as much as the technical one. The Quantitative Tools team builds and maintains Phoenix but deliberately avoids acting as a gatekeeper deciding which experiments teams should prioritize—individual product teams retain autonomy over their experimental agendas while inheriting variance reduction, interference-aware designs, and multiple comparison correction as platform-level defaults. Han has described this as a conscious choice: the team functions as an enabler, not a bottleneck. The platform's growth to 400 concurrent experiments suggests that removing technical barriers to rigorous testing, rather than adding statistical sophistication that only specialists can wield, is what drives experimentation adoption at scale. The public documentation is thin but precise—three Medium posts providing genuine architectural detail and a handful of talks supplying statistical philosophy and business context—with no open-source components or peer-reviewed papers. For teams building experimentation infrastructure for marketplace products, the key takeaway is that baking interference-aware methodology and automated safeguards into the platform layer, so experiment owners inherit them by default, produces compounding returns as the organization's experimentation volume grows.

People

A

Andrew Han

Senior PM of Experimentation

K

Kenneth Yeh

Author, Phoenix Testing Platform

J

Jinqiang Han

Author, Phoenix Testing Platform

J

Juzheng Li

Author, Phoenix Testing Platform

C

Connor Wybranowski

Author, Phoenix Testing Platform

S

Siggi Jonsson

Author, Phoenix Testing Platform

K

Keith McKnight

Author, Phoenix Testing Platform

U

Uday Babbar

Author, Phoenix Testing Platform

L

Limin Yao

Author, Phoenix Testing Platform

F

Fan Jiang

Author, Phoenix Testing Platform

L

Lucile Lu

Author, Phoenix Testing Platform

Key Facts

Methodology
frequentist sequential cuped switchback multiple-comparison-correction
Platform Type
server-side mobile marketplace client-side feature-flags ml-experiments
Scale

~400 concurrent experiments, doubling annually

Year Started

~2019

Tech Stack
AWS MongoDB Redis AWS Amplify Amazon ElastiCache Amazon Redshift
#marketplace-experiments#variance-reduction#switchback#two-sided-marketplace#experimentation-culture#mobile-experimentation#remote-configuration#feature-flags#ml-experiments

Last updated: 2026-03-29