Yelp

Yelp

In-House

Bunsen

Overview

Most experimentation platforms start as technical projects—someone builds an assignment service, bolts on a stats engine, and ships a dashboard. Bunsen is interesting because it started as an organizational problem. Before 2018, Yelp's teams ran experiments in incompatible ways: different significance thresholds, different sample-size heuristics, no shared metric taxonomy, and no central record of what had already been tested. Post-analysis for individual experiments was taking 7–14 days due to manual processes, and experimentation had become a blocker that decreased product velocity rather than increasing it. The platform that emerged was designed as much to fix that fragmentation as to provide infrastructure.

Rohan Katyal, the product manager who spearheaded the program, framed it as an organizational transformation rather than a technical build. His three-pillar approach—education, standardization, and continuous monitoring—shaped every design decision. The education pillar included explaining concepts like p-values directly on Bunsen's experimentation scorecard, training distributed "Bunsen Deputies" through a four-week program, and running hands-on sessions ("Build Me a Bunsen" for system design exercises, "Break Me a Bunsen" for surfacing gaps). Standardization meant a shared metric taxonomy of decision metrics, tracking metrics, and guardrail metrics, plus a Product Experimentation Process (PEP) template that governed how experiments were created, interpreted, and communicated. The monitoring pillar treated the program itself as something to measure: quarterly meta-analyses assessed whether each experiment was correctly designed, properly powered, and soundly interpreted. Over eighteen months, that feedback loop produced a 120% improvement in decision accuracy—not experiment volume, but the proportion of experiments yielding correct conclusions. The work became significant enough to be documented as a Harvard Business School case study (621-064) by Iavor Bojinov and Karim R. Lakhani, later revised in March 2024.

The platform scaled to handle what Justin Norman, Yelp's VP of Data Science, described in a 2020 podcast as over 700 experiments running at any one time across Bunsen and its companion data platform Beaker. Combined with a 10X increase in experiment volume and a 2X improvement in statistical decision accuracy, the program transformed experimentation from a specialist function into an organizational habit—one whose quality was quantified, not assumed. Katyal later moved to Meta's New Product Experimentation team, but the contrast he drew publicly is telling: Yelp's approach was education- and culture-led where Facebook's was tool-driven and abstraction-led.

Architecture & Approach

Bunsen handles the full experiment lifecycle: creation and configuration, consistent user randomization, data collection, statistical analysis, and result visualization through an Experimentation Scorecard designed for quick, unambiguous reads. The platform supports traditional A/B tests, sequential testing for early stopping, and progressive feature rollouts with metric monitoring at each stage. Cohort management handles mid-experiment allocation changes—adjusting treatment/control splits after launch while preserving statistical validity, a problem covered in detail on Yelp's engineering blog. Integration was engineered for minimal friction: dedicated client libraries for every technology stack at Yelp allowed developers to connect with Bunsen in two lines of code.

Data flows through Yelp's real-time pipeline built on Kafka (processing billions of messages daily), with Flink handling stream transformations and Cassandra providing high-volume event storage. Analytics on historical data run through Amazon Athena and AWS Glue. Automated safeguards are baked into the workflow: Bunsen Prechecks verify that experiments meet minimum quality standards before launch, checking for sample ratio mismatch, randomization failures, and design soundness. Bunsen Health monitors the platform's own internal operations, alerting teams to infrastructure issues. Guardian metrics function as automated circuit breakers—if an experiment severely harms core business metrics, the system can roll back the change without manual intervention. This proved its value when the local services team discovered through experimentation that a new design was decreasing project creation; the rollback prevented what would have been millions of dollars in lost revenue.

Beyond standard A/B testing, Yelp has invested in statistical methods that accelerate experiment cycles without sacrificing rigor. Engineers developed techniques using Kaplan-Meier and mean cumulative function estimators to generate reduced-variance estimates of n-day retention and cumulative spend from partially observed data. Monte Carlo simulations showed these estimators enabled a 12–16% reduction in required cohort observation time, translating to a 25–50% reduction in the follow-up period needed to measure treatment effects—with no loss in statistical power. For machine learning experimentation, Bunsen integrates with Beaker, Yelp's feature generation and tracking platform, to support experiments on adaptive systems. Yelp published detailed guidance on A/B testing bandit algorithms, recommending an 80-80 convergence definition and matched pair design with pairwise deletion to handle the methodological complications of experimenting on systems that learn during the test. A Back-Testing Engine introduced in February 2026 extended this further for ad budget allocation, simulating the entire ecosystem with proposed algorithm changes using production code (included as Git submodules) and Poisson-sampled outcomes before committing to live A/B tests.

The organizational architecture mirrors the technical one. Rather than centralizing expertise, the Deputies model distributes it: trained practitioners across the company handle adoption, training, and quality within their own teams, while the platform team focuses on infrastructure. Different stakeholder tiers get different engagement: ICs receive documentation, buddy programs, and peer-learning sessions; managers participate in experimentation councils; executives see quarterly program-health reviews with decision-accuracy trends. Weekly or monthly experimentation newsletters highlight wins from other teams, and executive dashboards visualize experiment volume by team—a gentle gamification that, Yelp found, correlated positively with team performance.

What Makes It Notable

The quarterly meta-analysis of decision accuracy is Bunsen's most distinctive contribution to experimentation practice. Most organizations measure experiment volume or velocity; Yelp measured whether the program was actually producing correct conclusions, then used that signal to improve training, tooling, and process. The 120% improvement figure over eighteen months is one of the few publicly cited, quantified measures of experimentation program quality rather than throughput. That this metric was reported to executive leadership quarterly—not buried in a data science retrospective—signals how deeply the feedback loop was embedded in organizational governance.

The Deputies model offers a practical template for scaling expertise without building an ever-larger central team, and it has been documented in enough detail—through the Harvard Business School case study, Katyal's conference talks, podcast appearances, and Yelp's engineering blog—that practitioners can study the specific training structure, incentive design, and rollout sequence. The statistical innovations are equally instructive: the partially observed data accelerators, the bandit experimentation framework, and the back-testing engine each address real limitations of standard A/B testing that most platforms leave to practitioners to solve ad hoc. For teams wrestling with how to move experimentation from a specialist function to an organizational habit while maintaining statistical rigor, Bunsen's public record—spanning academic case studies, technical blog posts, and conference presentations—is among the most complete available.

People

R

Rohan Katyal

Experimentation Program Lead

R

Rohan Singh

Experimentation Infrastructure Researcher

I

Iavor Bojinov

Harvard Business School Professor (Case Author)

K

Karim Lakhani

Harvard Business School Professor (Case Author)

Key Facts

Methodology
frequentist sequential cuped
Platform Type
server-side client-side mobile full-stack
Scale

1,000+ concurrent experiments

Year Started

~2018

Tech Stack
Kafka Flink Cassandra Python Scikit-Optimize AWS Amazon Athena AWS Glue
#variance-reduction#democratization#experimentation-culture#power-analysis#guardrail-metrics#sample-ratio-mismatch

Last updated: 2026-03-28