Perceived Winners First?

Imagine you have a finite set of 100 A/B tests to run. Some will be winners, some losers, and many will be flat. You reject stat-sig negatives (don't ship them) and ship stat-sig positives.

If a human can correctly guess the direction of a test even 59% of the time, should you let them prioritize the order by winners first? The simulation below shows what happens.

Simulation

Human directional accuracy

59%

For sorting positives first

Win / Flat / Loss %

100 tests: true lift values

Lesson 1 of 7

Takeaways Over 1,000 Simulations

Ordering does not magically create more winners in a finite set. What it does change is when you reach them. Front-loading likely winners compounds gains sooner, so the product gets better faster.

Only Test Positive Estimates From A Finite Set?

Let's see what happens when we apply our intuition to gate experiments believed to be negative.

Imagine having a finite set of 100 ideas while comparing two policies: only running tests with positive estimates versus running all test ideas (with negative estimates in the second half).

Simulation

Human directional accuracy

59%

For sorting positives first

Win / Flat / Loss %

Temporary negative-test loss 50% exposure to the losing variant.
Loss lasts two test slots, then returns to zero.

Candidate tests: true lift values

Lesson 1 of 7

Takeaways Over 1,000 Simulations

If you use intuition to decide whether a test gets run at all, you may avoid waste. But at 59% accuracy you will still filter out some real winners, and you may also remove the failures that would have generated valuable follow-up ideas.

What If The Idea Backlog Is Infinite?

When new ideas keep coming in, skipping a negative-estimated test doesn't waste a slot — a fresh candidate fills it immediately. So the question shifts: is there still a case for running negative-estimated ideas at all?

Here we compare test positives only and random ordering against two strategies that also continue into negative-estimated ideas: adding them within the main prioritized queue, or running them in parallel alongside positives.

Simulation

Human directional accuracy

59%

For sorting positives first

Win / Flat / Loss %

One incoming batch: true lift values

Lesson 1 of 7

Takeaways Over 1,000 Simulations

With an infinite backlog, skipping a negative-estimated idea costs nothing — a fresh positive candidate takes its place. Running negatives in parallel (line 3) adds tests without slowing the main queue, so the lift compounds freely — but only if those extra tests truly run in the same time frame. Slotting negatives into the main queue after positives (line 2) means the queue gets slower, and that drag shows up in the results.

Direction vs. Impact Magnitude Sorting?

One sorting method uses predicted direction to both filter and order: only positive-estimated tests run, sorted positives-first. The other ignores direction entirely and filters + sorts by estimated impact magnitude — dropping anything with a negative estimated effect and running the biggest estimated effects first.

This lesson isolates the two signals to show which carries more value on its own.

Simulation

Human directional accuracy

59%

For sorting positives first

Win / Flat / Loss %

Impact estimate error

%

Median absolute error when estimating effect size

100 tests: true lift values

Lesson 1 of 7

Takeaways Over 1,000 Simulations

Direction sorting uses a binary signal — predicted win or loss — and ignores magnitude entirely. Magnitude-only sorting uses estimated effect size but has no directional filter, so it can front-load large losers alongside large winners. The impact estimate error slider shows how quickly the magnitude signal degrades as estimates get noisier.

Does Iterating On Results Help?

What happens when we follow up on statistically significant results with one additional test?

We compare three strategies: test positives only, iterate on all results, or iterate only on results above an effect size threshold.

Follow-up tests keep a configurable relative gain from the source test. A negative follow-up that lands is treated as flipping the original loss into a smaller positive gain.

Simulation

Human directional accuracy

59%

For sorting positives first

Win / Flat / Loss %

Effect threshold

%

To iterate on

Effect decay

%

Effect retained

Winner success

35%

Positive follow-up win chance

Loser success

42%

Loss-to-win follow-up chance

One incoming batch: true lift values

Lesson 1 of 7

Takeaways Over 1,000 Simulations

Does Iterating On Selected Higher Impact Results Help?

Lesson 5 showed that iterating on large effects can beat iterating on all results. This lesson asks: how does threshold-based iteration compare to simply running negatives in parallel as free follow-ups?

Large absolute effect means winners above the cutoff and losers below it — a strong loss is treated as a signal worth following up, since a loser follow-up can flip it into a smaller win.

Simulation

Human directional accuracy

59%

For sorting positives first

Win / Flat / Loss %

Effect threshold

%

To iterate on

Iteration effort

%

Slot cost per iteration

Effect decay

100%

Follow-up effect retained

Winner success

35%

Positive follow-up win chance

Loser success

42%

Loss-to-win follow-up chance

One incoming batch: true lift values

Lesson 1 of 7

Takeaways Over 1,000 Simulations

Sequential iteration has to earn its slot. Larger source effects can survive decay better, while smaller source effects often get diluted enough that the next fresh positive-estimated idea is a stronger bet.

Cumulative Gains After Running 100 Tests

This final view stacks the policies from earlier lessons into a single progression, from a random roadmap to prioritized, filtered, open-ended, and iterative experimentation.

The bar chart compares average final cumulative lift after the same number of experiment slots, using the same batch-of-10 candidate rule and the same relative follow-up gain setting for the open-ended and iterative steps.

Average Final Lift

Human directional accuracy

59%

For sorting positives first

Win / Flat / Loss %

Negative estimates to test

per 10

How many low-scored ideas still run relative to the main queue

Relative follow-up gain

%

Relative decay from iterated tests

Takeaway

Each added layer changes a different part of the system: ordering changes timing, filtering changes selection, an open-ended backlog changes opportunity cost batch by batch, and iteration changes what happens after you learn something. How much it pays depends both on hit rate and on the relative follow-up gain you assume.

Prioritized Experiments

Perceived Winners First?

Simulation

Takeaways Over 1,000 Simulations

Only Test Positive Estimates From A Finite Set?

Simulation

Takeaways Over 1,000 Simulations

What If The Idea Backlog Is Infinite?

Simulation

Takeaways Over 1,000 Simulations

Direction vs. Impact Magnitude Sorting?

Simulation

Takeaways Over 1,000 Simulations

Does Iterating On Results Help?

Simulation

Takeaways Over 1,000 Simulations

Does Iterating On Selected Higher Impact Results Help?

Simulation

Takeaways Over 1,000 Simulations

Cumulative Gains After Running 100 Tests

Average Final Lift

Takeaway

References