What Breaks When Experiment Programs Scale?
The 3 things that break after early wins (and how to fix them)
👋 Welcome back!
Here’s a line I wish more experimentation leaders internalised early:
Experimentation fails fastest when it succeeds.
Early wins don’t just create momentum. They create load. More demand from across the business. More scrutiny from leadership. More stakeholders with opinions. And more things that can quietly break beneath the surface.
Most teams scale up experimentation but don’t scale the systems around it. The result is a program that worked beautifully at low volume but starts to buckle under its own success.
In my experience, three things break first: territorial pushback, tracking chaos, and production bottlenecks. In this edition, I’m sharing what each looks like in practice, how to spot the warning signs, and what to do about them.
1) Territorial pushback
When your experimentation speed jumps, everyone notices. Including senior leaders who suddenly see tests showing up in “their” area:
“Why are you testing in our funnel?”
“Who approved this?”
“What if this hurts our number?”
This is one of the most common plateau triggers. Not because the concern is irrational, but because it is unmanaged.
I’ve repeatedly seen this play out in a particularly sharp way: senior leaders or executives who had zero prior involvement in the experimentation program — people I’d often never even met — swoop in and kill tests on the spot. They don’t come to the experimentation team. They go straight to the direct leadership of the team I’m working with. One conversation, and the test is dead.
It’s jarring the first time it happens. But once you see the pattern, it makes sense. As you expand experimentation across surfaces, you’re implicitly challenging three things:
who owns customer experience decisions
who carries downside risk
who gets credit for upside
If you don’t make those trade-offs explicit, politics will.
The fix
You don’t win territorial pushback with better statistics. You win it with clarity and trust.
Make the value impossible to argue with. Keep a simple running record of what shipped and what improved. Frame it in business language, not experiment language. When a senior leader can see that experimentation contributed to a measurable lift in a metric they care about, pushback loses its footing.
Show you take their concerns seriously. Be explicit about what you’re monitoring, what “unacceptable harm” means in their context, and exactly when you will stop a test. The goal is for any leader to look at your guardrails and feel confident that the things they care about are being actively protected and that harm will be detected and stopped.
Pull leadership in before they push back. Don’t wait for experimentation to feel like an invasion. Present in forums that go beyond your immediate team like an all-hands, a cross-functional sync, or a leadership review. Ask your direct leaders to surface the program among their peers. Seek out possible friction early, before it becomes a confrontation. The best alignment happens before anyone feels blindsided.
A useful reminder:
Scaling experimentation is rarely a systems problem. It’s a people problem.
2) Tracking chaos
Remembering two experiments is easy. Remembering twenty is hard. Remembering a hundred is impossible. And yet, most teams don’t build any tracking structure until they’re already drowning.
Without that structure, teams quietly lose track of what’s live, what won, what lost, what got killed, and — most importantly — what actually shipped. The program keeps running, but the learning stops compounding. You can’t build strategic advantage on top of forgotten work.
This is where many teams accidentally turn experimentation into motion: lots of tests, lots of slide decks, but very little institutional memory. The program looks active, but when someone asks “what have we learned about onboarding in the last six months?”, nobody can answer without digging through old documents.
The fix
The key is not a fancy tool (though good tools can help). It’s building a consistent tracking structure before the volume outpaces your memory. If you wait until you’re running 50 experiments per quarter, you’ll never catch up.
At a minimum, capture three things for every experiment:
Business objective: What part of the business is this connected to? Acquisition? Retention? Monetisation?
Problem space: What specific driver or area of focus does this sit within? Onboarding friction? Feature visibility? Pricing clarity?
Outcome: What happened, and what did we do about it? Win / Lose / Inconclusive, and then Shipped / Dropped / Iterated.
The middle one — problem space — is the field most teams miss. And it’s the most important, because it’s what lets you zoom out and ask: “Are we making progress against this problem, or just running disconnected tests?”
Read the last edition to learn why this information is so important:
How to Turn Tests Into a Strategic Experimentation Program
That’s the difference between a testing archive and a decision system.
3) Production bottlenecks
This one is painfully common, and it catches teams off-guard because the symptoms look like a different problem entirely.
You’ve got five wins ready to ship. But engineering cannot keep up. So the wins sit in a queue. Stakeholders start to wonder whether experimenting was worth the effort. And experimenters, under pressure to show progress, compensate by running more tests, which only makes the queue longer.
The result is a growing pile of “value” that never gets realised. Programs in this state often get misdiagnosed. People say “we need better ideas” or “we need faster analysis.” But the real bottleneck is shipping. No amount of experimentation throughput matters if the decisions it produces never make it into the product.
The fix
Treat production capacity as a first-class constraint, not an afterthought.
Sync capacity with whoever owns production early. If you can create wins faster than the organisation can ship them, you’re not building momentum. You’re building frustration. Have the capacity conversation before the queue becomes your default state.
Get an engineer on the team part-time if you need to. Not forever. Just enough to prevent implementation from becoming the permanent bottleneck.
Make productionisation part of how you prioritise. Ideas that would be straightforward to get into production should be upweighted. Ideas that would be expensive or complex to ship should be considered more carefully. This doesn’t mean you avoid ambitious tests. It means you factor in the full cost of realising value, not just the cost of running the experiment.
A simple rule of thumb:
If your implementation rate is low, you’re not running an experimentation program. You’re running a reporting engine.
The irony (and the real lesson)
The better your experimentation program performs, the harder it gets to manage. Success increases the surface area. And if you do not scale the systems around experimentation, the program collapses under its own momentum.
The goal is not simply to “run more tests.”
The goal is to build a program that:
earns trust across the org
keeps track of what matters
turns wins into shipped decisions
The takeaway
Expect pushback as you scale. It’s a signal you’re expanding influence.
Build tracking before you need it. Memory is a strategic asset.
Protect shipping capacity. Wins that do not ship are just stories.
Until next time 🙌
Some of my other resources you might find useful:
🧮 The Experimenter’s Calculator: a tool to plan high-quality experiments
📊 Intro to A/B Test Statistics: a free webinar for practitioners






