Most Companies Waste 80% of Their A/B Testing Budgets

In 2019, Booking.com ran over 25,000 A/B tests. That's 68 experiments per day, every day, for a year. They declared victory: fastest learner wins. But when researchers examined their actual learning, they found something unsettling. After thousands of tests, Booking.com had thousands of isolated yes/no answers but no causal model of how their product actually worked. They'd built a knowledge graveyard, not a knowledge engine.

This reveals the central illusion of modern experimentation. Companies think they're buying platforms when they're actually building epistemological systems – systems for generating reliable knowledge from messy reality. Platforms are about infrastructure and velocity. Epistemology is about whether you're learning anything real. Most organizations optimize for the wrong one.

The pattern repeats everywhere. Companies spend millions on experimentation platforms – Statsig, LaunchDarkly, Optimizely – then wonder why they're not learning faster. The answer is uncomfortable: the platform is maybe 20-30% of the solution. The other 70-80% is statistical literacy, data quality discipline, and organizational coordination. You can't buy those. You have to build them, slowly, against organizational resistance.

Around 2019, something changed in how successful companies thought about experimentation data. The shift was subtle but fundamental: measurement stopped being an output and became infrastructure.

Before, you ran an experiment, collected data, calculated results, made a decision. Measurement was the endpoint. After, measurement became continuous validation of whether your entire testing machinery was working correctly. The data quality itself became the product.

This emerged because of a specific technical problem that only appears at scale. When you're running 10 concurrent experiments, you can manually verify each one. When you're running 500, you can't. Silent failures become inevitable. A tracking pixel doesn't fire. A user assignment gets cached incorrectly. A metric calculation drifts. These errors corrupt 10-15% of experiments at scale, but they're invisible unless you're actively hunting for them.

Sample Ratio Mismatch detection became the canary in the coal mine. The principle is simple: if you assign users 50/50 to control and treatment, you should see roughly 50/50 distribution in your data. If you see 52/48, something broke. Maybe bot traffic. Maybe a redirect issue. Maybe a client-side race condition. The specifics don't matter – what matters is that your test results are now unreliable, and you won't know unless you're checking.

Netflix, Airbnb, and Microsoft all built automated SRM detection around 2018-2019. It wasn't optional. They discovered that human reviewers miss these issues until after decisions get made. By then, you've already shipped a feature that hurt users based on corrupted data.

This created the discontinuity: scaling from 10 to 100 experiments per month is an infrastructure problem. Scaling from 100 to 1,000 is a data quality problem. The binding constraints shift from "can we run more tests" to "can we trust the tests we're running."

Most companies measure outcomes. Should we ship feature X? Does blue button convert better than red? These are useful questions, but they don't compound. After 500 experiments, you have 500 disconnected facts. You're no smarter about the underlying mechanisms driving user behavior.

Frontier organizations now treat experiments differently. They're building causal models, not collecting yes/no answers. The technical term is heterogeneous treatment effect estimation, but the concept is straightforward: instead of asking "did X work," ask "for whom did X work, under what conditions, and why."

This matters because of variance in treatment effects. Booking.com might test a new search filter and find it improves conversions by 2% on average. That's a clear win, ship it. But the heterogeneous analysis reveals something more interesting: it improves conversions by 15% for users searching in unfamiliar cities, does nothing for frequent travelers, and actually hurts conversions by 8% for mobile users with slow connections.

Now you have mechanism insight. The filter helps with information overload but creates interaction costs. This guides the next 20 experiments. You don't need to retest variations – you have a model of how the feature operates. Your learning compounds.

The shift from isolated tests to causal models is where the next decade's competitive advantage lives. Companies running 1,000 disconnected experiments learn slower than companies running 100 experiments that build on each other. The velocity-to-insight ratio matters more than raw velocity.

Here's an uncomfortable fact: only about 5% of software engineers understand statistical power analysis. This creates a bizarre situation where companies build sophisticated experimentation platforms that anyone can use, then watch as 70% of experiments produce misleading results because the basics got ignored.

The most common failure mode is obvious in retrospect: insufficient sample size. A company with 1 million users per month wants to detect a 0.1% improvement in conversion rate. The math is straightforward – you need about 16 million user observations to detect that effect reliably. At 1 million users per month, that's a 16-month experiment. Nobody waits 16 months.

So they run it for 2 weeks, see a positive signal, ship it, declare victory. What actually happened? They detected noise and mistook it for signal. The feature might help, might hurt, might do nothing – the data can't distinguish. But the organizational machinery requires decisions, so people make them on insufficient evidence.

This is why platforms alone don't solve the problem. You can build the world's best experimentation infrastructure, but if users don't understand power analysis, they'll run underpowered tests and make bad decisions faster. You've just accelerated organizational learning in a random direction.

The solution isn't "train everyone in statistics." That doesn't scale and doesn't stick. The solution is encoding statistical rigor into the platform itself. Statsig and others now refuse to start experiments that are obviously underpowered. You input your expected effect size and current traffic, and the system tells you "this will take 6 weeks to reach 80% power." If you don't want to wait, you need either more traffic or lower effect size sensitivity.

This works because it changes the default. Before, the default was "run any experiment anyone wants." After, the default is "only run experiments we can actually learn from." The friction is intentional and valuable.

GDPR and CCPA aren't temporary obstacles. They're the new baseline, and they fundamentally reshape what's possible.

The constraint is specific: you can't collect and retain user data indefinitely anymore. You need consent, you need purpose limitation, you need deletion capabilities. For experimentation, this means your sample sizes shrink. Users who don't consent drop out. Historical data expires. Your statistical power degrades.

Most companies treat this as an annoying compliance cost. Smart companies recognize it as a forcing function toward better methods. If you can't collect unlimited data, you need to extract more signal from limited data.

Federated learning becomes interesting here. The basic idea: instead of collecting user data centrally and running experiments on aggregated data, you run experiments locally on user devices and only aggregate the statistical summaries. The raw data never leaves the device.

This sounds like a research curiosity until you realize it's mandatory in certain regulated industries. Healthcare can't centralize patient data. Finance has similar constraints. If you want to run experiments in these domains, you need privacy-preserving methods or you don't experiment at all.

The performance penalty is real – federated learning is 15-25% less efficient than centralized analysis. But "less efficient than impossible" is still a win. Companies that build federated experimentation capabilities now will have 5-year leads in regulated markets. Everyone else will be catching up.

Right now, there are maybe 30 credible experimentation platforms. In 5 years, there will be 5-8. The economics point one direction: consolidation.

Building a production experimentation platform costs 5-10 person-years upfront, plus ongoing maintenance. Most companies under 2,000 employees can't justify that investment. The unit economics favor buying over building, and once you buy, switching costs are high.

This creates natural oligopoly conditions. A few platforms capture most of the market, then get acquired into larger analytics and observability stacks. Mixpanel, Amplitude, Datadog – they all want to own the "observe-analyze-decide-act" loop. Experimentation is the "decide" layer. It's too valuable to leave disaggregated.

The alternative scenario is fragmentation into verticals. Healthcare experimentation needs different compliance features than e-commerce. Clinical trials need different statistical rigor than pricing tests. Manufacturing needs different integration points than SaaS. If vertical-specific needs diverge enough, we get industry-specific platforms instead of horizontal consolidation.

Which happens depends on whether abstraction economies of scale outweigh vertical specialization benefits. History suggests consolidation wins for horizontal infrastructure, but fragmentation wins for domain-specific tools. Experimentation platforms sit in the middle – they're infrastructure, but with deep domain requirements.

My read: 60% consolidation, 40% vertical fragmentation. Most companies use one of 5-6 general platforms, with specialized tools for regulated industries. The general platforms win through integration breadth and statistical sophistication. The vertical platforms win through compliance bundles and workflow optimization.

The current model is discrete: propose hypothesis, design experiment, collect data, analyze results, make decision, repeat. This batch processing model is ending.

The next generation replaces batch experimentation with continuous learning systems. Instead of discrete A/B tests, you run contextual multi-armed bandits. Instead of testing then deciding, the system learns and adapts in real time. Instead of experimenters proposing hypotheses, large language models generate candidates automatically based on observed user patterns.

This isn't speculative – early versions already exist. Recommendation systems have worked this way for years. Netflix doesn't run an A/B test to decide which thumbnail to show you; it learns continuously from billions of impression-click pairs. Advertising platforms optimize bids continuously across millions of auctions per second. Pricing engines adjust dynamically based on inventory and demand signals. The only question is how far the pattern extends beyond these domains.

The limiting factor is interpretability and control. Batch experiments are interpretable – you can explain why you made a decision. Continuous learning systems are black boxes – they work, but you can't always articulate why. This matters enormously for regulated decisions. You can't price insurance with a system you can't explain. You can't make medical decisions with a model you can't audit.

So we get bifurcation: continuous learning for optimization domains where interpretability is optional (recommendations, layouts, content ranking), discrete testing for domains where interpretability is mandatory (pricing, treatment selection, credit decisions).

The interesting edge is where these boundaries shift. As explainable AI methods improve, more decisions migrate from discrete to continuous. As regulatory scrutiny increases, some decisions migrate back from continuous to discrete. The equilibrium point moves, but slowly.

If you're building products, experimentation velocity is now table stakes. Companies running 100 experiments per month learn 5x faster than companies running 20. The good news: this is achievable with modest investment ($50-100K annually plus cultural shift). The bad news: it requires discipline most organizations resist.

Start with 10 tests focused on core product questions. Not "should button be blue or green," but "does personalized onboarding improve 90-day retention." Build from there. The goal isn't testing everything – it's testing the things that matter until you understand mechanism.

If you're building platforms, the differentiation frontier is moving from features to trust. Every platform can run tests. The valuable platforms prevent bad tests from happening, detect data quality issues automatically, and encode statistical rigor as defaults. The platform becomes an epistemological guardrail, not just infrastructure.

If you're investing in this space, watch for M&A activity in 2025-2026. The consolidation wave is starting. Also watch for federated learning adoption in healthcare and finance – that's the leading indicator for privacy-preserving methods going mainstream.

The deeper pattern: we're moving from "testing as tool" to "testing as culture" to "testing as continuous learning system." Each transition changes what competitive advantage looks like. In 2010, having A/B testing was an advantage. In 2020, testing velocity was the advantage. In 2030, causal model building will be the advantage.

The companies winning then are building that capability now, slowly, while everyone else is still shopping for platforms.