Uber

Name: Citrus
Author: Uber

In-House

Citrus

Website Engineering Blog

Overview

Most experimentation platforms treat assignment as a network call: ask a central service which bucket a user belongs to, get a response, move on. At low volumes that works fine. At Uber's scale—thousands of microservices, 1,000+ concurrent experiments, latency budgets measured in single-digit milliseconds—every RPC adds up. Project Citrus, begun in 2020 as a full rewrite of the legacy "Morpheus" platform built seven or more years prior, started from the premise that the entire evaluation model was wrong and that experiment assignment should be a local computation, not a network request.

The architectural insight was that Uber already had a battle-tested system for pushing configuration to every host in its fleet: Flipr, the company's dynamic configuration platform, which manages over 350,000 active properties, processes roughly 150,000 configuration changes per week, and serves approximately three million queries per second across 700+ microservices on 50,000+ hosts. Citrus reframed experiments as temporary override layers on Flipr parameters—when an experiment exists for a given parameter, the platform intercepts requests and provides the experiment-determined value; when no experiment is active, the system serves Flipr's default. Rules engines were pushed to host agents, assignment logic read from local file caches, and the Parameter Service RPC was eliminated entirely. The result was a 100× reduction in p99 evaluation latency—from 10ms to 100µs—rolled out initially to Go services. By the second half of 2023, over 100 Go services had migrated, covering nearly 70% of experiment traffic.

Beyond the performance story, Citrus unified what had been three separate worlds: configuration management, feature flags, and A/B tests now share the same client libraries across Go, Java, Android, iOS, JavaScript, and other languages, the same staged rollout machinery, and the same targeting primitives. The statistical layer supports a wide methodological range—frequentist A/B/N tests (t-tests, chi-squared, rank-sum), CUPED variance reduction, sequential testing with always-valid confidence sequences, causal inference methods (synthetic control, difference-in-differences, propensity score matching, inverse probability of treatment weighting, doubly-robust estimation), contextual bandits (LinUCB for CRM personalization), Thompson sampling, and switchback designs for marketplace experiments where individual randomization violates SUTVA. That breadth reflects the diversity of Uber's product surface: a pricing algorithm change on the rider side creates interference patterns fundamentally different from a UI copy test in Uber Eats or a driver cash payment policy that can only be evaluated at the city level.

Architecture & Approach

The core design decision is push-based local evaluation. Rather than microservices pulling assignment from a central Parameter Service via RPC, Citrus continuously distributes experiment rule definitions—targeting criteria, randomization salt, variant weights—to file caches on every host through the same pipeline Flipr uses for configuration parameters. When a service needs to evaluate an experiment, it reads from the local cache and runs the rules engine in-process. No network hop, no serialization overhead, no dependency on a remote service's availability. Randomization uses consistent hashing: a unit identifier (typically user or driver) is hashed with an experiment-key-derived salt, and the residual of integer division against a modulus (typically 100) determines the bucket (0–99). Because the hash depends only on the identifier and salt, the same unit always receives the same assignment regardless of which server evaluates the request.

Treatment groups are organized in a tree structure of contiguous bucket ranges, allowing hierarchical subdivision for multi-variant experiments. Contextual constraints—geographic region, device type, operating system—define which portion of the bucket space each experiment controls, and a custom logic engine detects overlapping experiments at configuration time to prevent conflicts. This means two independent experiments can run on the same parameter simultaneously as long as their context spaces don't intersect, significantly increasing the throughput of concurrent experiments. Multiple layers of SDK fallback ensure reliability: mobile SDKs cache the last-received configuration payload; if the backend is unavailable, they serve from cache, and if the cache is empty, they fall back to Flipr's default parameter value. Backend services similarly fall back to locally-served Flipr defaults on timeout or failure, decoupling experimentation availability from core service availability.

Metric classification drives automatic statistical test selection. The platform categorizes every tracked metric as a proportion, continuous value, or ratio, then applies the appropriate hypothesis test and variance reduction strategy without requiring the experimenter to make that choice. CUPED is standard practice: pre-experiment covariates construct a baseline prediction for each unit, and post-experiment outcomes are adjusted against that prediction to remove variance from pre-existing differences. For experiments that need early stopping, sequential testing allocates the Type I error budget across time using dynamic boundaries—conservative early on when data is sparse, relaxing as observations accumulate. Switchback experiments handle marketplace-specific interference: entire geographic regions are assigned to treatment or control for time windows, alternating over time to estimate aggregate causal effects when individual randomization breaks down. For synthetic control scenarios—such as testing driver surge pricing changes in a single city—the platform constructs a statistical control as a weighted combination of similar untreated markets, enabling causal impact assessment without randomization. For continuous optimization problems like personalized CRM messaging, contextual multi-armed bandits (LinUCB) adaptively allocate traffic toward higher-performing message variants conditioned on user features. Regression detection during staged rollouts uses sequential likelihood ratio tests for continuous monitoring of crash rates, ANR events, and app state anomalies, with both time-based ramp-up schedules and Bayesian risk-based approaches that calculate maximum safe exposure given current constraints, with automated rollback when boundaries are crossed. Separately, Uber AI researchers have released an optimal experimental design (OED) framework built on Pyro that uses expected information gain to computationally select experiment designs maximizing uncertainty reduction, enabling iterative cycling through design, observation, and inference stages.

What Makes It Notable

The inversion from remote evaluation to local computation is the standout engineering contribution. It's a pattern other large organizations with microservice architectures can learn from directly: if you already have a configuration distribution layer, experimentation assignment can ride on top of it rather than requiring its own RPC infrastructure. The deliberate decision to layer experiments as temporary overrides on Flipr's existing multimap data structure—rather than building a parallel system—meant Citrus inherited years of operational maturity, rollback capabilities, and staged deployment patterns from day one. The unification of feature flags, configuration, and experiments under one system eliminates the common organizational friction where flag management and experiment management are separate workflows maintained by separate teams with separate tooling.

Uber has also been unusually generous in sharing methodology. The CausalML open-source library (uplift modeling, heterogeneous treatment effects, doubly-robust estimation) gives external practitioners access to the same causal inference toolkit used internally. The OED framework released atop Pyro makes adaptive experimental design accessible outside Uber. Public talks on sequential testing implementation, blog posts detailing the Citrus architecture and Flipr integration, and academic papers on adaptive rollout methodologies provide concrete, reproducible detail. The universal holdout practice—maintaining a persistent population segment isolated from all feature changes to measure cumulative long-term impact—addresses a question most platforms ignore: whether the sum of individually positive experiments actually compounds into a better product over time.