Skyscanner

Name: Dr Jekyll
Author: Skyscanner

In-House

Dr Jekyll

Website

Overview

Most experimentation platforms treat feature flagging and A/B testing as adjacent concerns. Skyscanner's Dr Jekyll—paired with its API counterpart Mr Hyde—was built from the start as a single system for both, giving teams one place to manage configuration rollouts and run controlled experiments across roughly 300 Java microservices handling 35 million daily searches. The unification matters: when a feature flag and an experiment share the same assignment and targeting infrastructure, graduating a winning variant to a permanent rollout is a configuration change, not a migration between systems.

The platform emerged alongside Skyscanner's shift from a pure metasearch engine (sending users to airline and OTA sites) to a marketplace where Skyscanner itself became the merchant of record for bookings. That transition, beginning around 2019, introduced experimental scenarios the original metasearch model never demanded—testing checkout flows, payment processing, customer-service models, and two-sided marketplace dynamics where supply and demand interact. Dr Jekyll had to support not only classical user-level randomization but also designs suited to marketplace interference, such as switchback experiments that alternate treatments across time intervals rather than across users. It also implements CUPED for variance reduction, letting experiments converge faster on a platform where daily traffic could otherwise tempt teams to call results too early.

One detail that reveals the team's care for data quality: Dr Jekyll explicitly classifies incoming requests by whether they originate from web crawlers, using request-level attributes to exclude non-human traffic from experiment assignment. In travel search, where bots constitute a meaningful share of inbound requests, ignoring this distinction would quietly bias metrics across every running test.

Architecture & Approach

Dr Jekyll (the UI) and Mr Hyde (the API) are deliberately separated to allow independent scaling and evolution. The split also reflects a hard performance constraint: native apps call the Mr Hyde API on application start and require experiment data within a one-second window—calls exceeding this budget are aborted. In Skyscanner's original data-centre setup, roughly 70% of requests globally came in under that threshold. After migrating to AWS and adding Akamai geo-aware routing in front of Mr Hyde, performance matched or improved on the data-centre figures globally, a detail Raymond Davies documented in his 2017 migration write-up.

That migration itself is instructive. Data moved from a Couchbase cluster in the data centre to a Postgres database on AWS, with scheduled AWS Lambda functions handling synchronization by truncating and re-inserting records. The team discovered that S3 replication lag—sometimes 30 minutes to several hours—was problematic for multi-region distribution, a real-world constraint that shaped their eventual architecture. All of this happened with active experiments in flight, meaning the migration could not invalidate running tests or introduce assignment inconsistencies during the cutover.

The broader infrastructure context matters for understanding how Dr Jekyll operates at scale. Skyscanner deploys roughly 10,000 times per month across a cell-based Kubernetes architecture spanning four regions and twelve availability zones, with Istio handling cross-cluster service mesh routing and Argo CD orchestrating GitOps-driven rollouts. Experiment assignments and events flow into two parallel streaming paths: one through Kafka for real-time metrics and logs, another through Kinesis consumed by Flink and written to S3 at 15-minute intervals. This dual-pipeline design supports both low-latency monitoring and batch analysis without either path constraining the other. The platform's targeting and segmentation layer goes beyond simple random allocation—teams can segment on request attributes including user locale, platform, device type, and crawler status—enabling stratified randomization and hold-out groups. Experimental data feeds into Databricks with governance via Unity Catalog for lineage tracking and access control, while an observability stack rebuilt around OpenTelemetry and New Relic means services participating in experiments emit structured traces and metrics without bespoke instrumentation per test.

What Makes It Notable

Dr Jekyll is not well-documented publicly—there are no conference papers or detailed engineering blog series comparable to what Netflix or Booking.com have published. What makes it worth studying is the combination of experimentation and configuration management in a single system operating across a large microservices fleet during a fundamental business-model transition. Few platforms have had to support the jump from referral-based metasearch to direct-booking marketplace while keeping thousands of concurrent experiments valid. The one-second hard cutoff on Mr Hyde API responses, and the engineering work required to preserve that contract across an infrastructure migration with live experiments, illustrates the kind of constraint that shapes real experimentation systems but rarely appears in public talks.

The crawler-aware segmentation is a practical contribution worth borrowing: any platform with significant bot traffic—travel, e-commerce, classifieds—faces the same contamination risk, and building detection into the assignment layer rather than cleaning it up in post-analysis is a cleaner architectural choice. Davies' 2017 migration write-up, with its candid account of S3 replication lag surprises and the Lambda-based sync workaround, remains the most detailed public artifact on Dr Jekyll and offers a rare window into what it costs to move a live experimentation system between infrastructure providers without breaking the experiments running on top of it.