Walmart

Walmart

In-House

Expo

Overview

When your experimentation infrastructure fails during Black Friday, you're flying blind at precisely the moment decisions matter most. That was the reality Walmart faced with its earlier Lambda architecture for A/B testing—a dual-system design where batch and streaming layers diverged under peak load, delivering stale or inconsistent metrics during the highest-traffic hours of the year. Expo, built by WalmartLabs, replaced that architecture with a unified Spark Structured Streaming pipeline capable of processing 30,000 to 60,000 events per second across eight tenants (mobile apps, walmart.com, and other digital properties), computing experiment metrics at minute-level granularity and surfacing anomalies in real time. The system detects anomalous trends and false positives as experiments run, providing early warning for problematic tests before they damage customer experience.

The platform reflects a specific set of constraints that come with running experiments at the world's largest retailer. Traffic is extreme and spiky, seasonality is acute, and the cost of a bad shipping decision—or a missed positive signal—is measured in billions. Expo needed to handle not just volume but also the operational reality that events arrive late (mobile clients buffer telemetry, networks hiccup), experiments span heterogeneous surfaces with different instrumentation, and multiple business units must run concurrent tests without cross-contamination. The system's design choices—two-phase aggregation, 10-minute watermarks for late-arriving events, geographic redundancy with mirrored Spark jobs in separate data centers—are direct responses to those constraints. Treatment assignment relies on hashing algorithms that consistently map users to experiments based on combinations of experiment ID and user ID, ensuring stable assignment throughout an experiment's lifetime rather than relying on runtime randomization that could produce inconsistent groupings.

Beyond the core infrastructure, Walmart's experimentation practice layers additional statistical methods on top of Expo's data collection. CUPED is used for variance reduction, with reported reductions of 5–20% across interfaces, compressing experimentation lifecycles by reaching reliable conclusions faster. Interleaving—described in detail by Garima Choudhary and colleagues at Walmart Global Tech—accelerates search algorithm comparison by blending ranked results from two competing models into a single search results page shown to the same users simultaneously. Three variants are employed: Balanced Interleaving, Team Draft Interleaving, and Probabilistic Interleaving, with Balanced Interleaving noted to introduce bias when both algorithms return the same items in different orders. Each customer interaction in interleaving provides attribution intelligence—clicks and add-to-cart actions are credited to the algorithm that surfaced the item—compressing feedback loops from weeks to days. Promising models identified through interleaving then graduate to larger-scale A/B testing through Expo. Causal forest and quasi-experimental methods like propensity score matching and difference-in-differences handle cases where full randomization isn't feasible—marketing campaign evaluation being a primary example.

Architecture & Approach

Expo's pipeline begins with event ingestion into Apache Kafka, where user interactions from all eight tenants are queued by session identifier. From there, a two-phase Spark Structured Streaming pipeline—written in Scala and running on 180-core instances—performs the heavy lifting. The first phase aggregates raw events into session-level summaries every minute: grouping by session ID, classifying events by experimental variant, and computing per-session metrics like conversion flags, revenue, and engagement measures including clicks and add-to-cart actions. The second phase rolls these session summaries into population-level aggregates per minute per variant, producing the time-series data that experiment owners actually monitor. Structured Streaming's consistency guarantee—that output at any point is equivalent to executing a batch job on a prefix of the data—prevents the subtle divergence bugs that plagued the earlier Lambda approach, where batch and streaming layers could disagree about the same metric.

These minute-level aggregates are written to KairosDB (backed by Cassandra), a time-series store chosen for its natural fit with the dominant query pattern: "show me metric X for experiment Y between time A and time B." Throughput to storage runs at 1–3 million records per minute. Grafana dashboards connect to KairosDB for real-time visualization and alerting, with automated anomaly detection flagging data quality issues (impossible metric values, unexpected volume spikes) and statistical artifacts (apparent effects consistent with noise rather than genuine treatment differences). The system also pushes its own health metrics into KairosDB, so platform issues don't go undetected or get conflated with actual experimental results.

The system's resilience design is worth noting. Every Spark job is mirrored in a second data center, so a facility-level outage doesn't interrupt metric computation or cause data loss. The 10-minute watermark accommodates late-arriving events—a practical concession to the realities of mobile telemetry and network variability—while treating anything arriving after that window as anomalous. The overall architecture is stateless where possible, enabling horizontal scaling to absorb the traffic spikes that define retail. This infrastructure extends into Walmart's broader data ecosystem: Kafka serves as the backbone for streaming across the organization, and Cassandra's distributed, coordinator-free architecture supports the write throughput required when thousands of concurrent experiments generate millions of data points per minute.

What Makes It Notable

Expo's primary contribution to the experimentation community is a well-documented case study of migrating from Lambda architecture to unified streaming for experiment analytics—a transition many organizations have contemplated but few have described in detail at comparable throughput. The 2019 InfoQ coverage, driven largely by senior engineer Reza Esfandani, provided specific numbers (event rates, core counts, watermark durations, storage throughput) that are rare in public descriptions of internal platforms. For practitioners evaluating whether to build streaming experimentation infrastructure, Expo remains one of the more concrete reference architectures available.

What's also instructive is how Expo fits within a broader methodological stack rather than trying to be everything. The platform handles assignment, event collection, and metric aggregation; CUPED-based variance reduction, interleaving analysis, and causal inference are layered on top as separate analytical concerns. That separation of infrastructure from methodology—letting the pipeline be good at ingestion and aggregation while statistical sophistication lives elsewhere—is an architectural pattern worth studying, particularly for organizations whose experimentation needs are outgrowing their monolithic testing tools. More recently, Walmart's experimentation culture has expanded beyond traditional A/B testing into agent orchestration, where a system of coordinated AI "super agents" serves 900,000 associates handling three million weekly questions, with measurable outcomes including forty percent faster customer support and significant reductions in operational planning time. The company has also tested voice-enabled shopping, generative AI-powered design assistance, and contextual commerce within virtual environments—domains where success metrics extend well beyond conversion into conversation quality, creative appropriateness, and cross-channel engagement. Expo's streaming foundation and Walmart's layered statistical approach provide the measurement backbone for these increasingly complex experimental surfaces.

People

R

Reza Esfandani

Senior Software Engineer, WalmartLabs

Resources

Key Facts

Methodology
frequentist cuped interleaving causal-forest
Platform Type
server-side full-stack mobile
Tech Stack
Apache Spark Structured Streaming Apache Kafka KairosDB Apache Cassandra Grafana Scala
#hyperscale-streaming#lambda-architecture-migration#variance-reduction#multi-tenant-experimentation#real-time-anomaly-detection#retail-experimentation

Last updated: 2026-03-28

Related Platforms