Retired Blog

How many trades before a win rate is real

How many trades before a win rate is real

A trader who has gone 16-and-4 over the last twenty trades will tell you they have an 80% win rate. They're wrong. What they actually have is a single sample from a probability distribution that almost certainly has a true win rate somewhere between 56% and 94%. The headline number is real but the confidence around it is so loose that the strategy could be a money printer or barely above a coin flip and the data wouldn't be able to tell you which.

This isn't pedantry. It's the difference between trusting a backtest enough to scale up and getting wiped out when the strategy reveals it was lucky, not skilled. Most retail traders quit a working strategy or scale up a doomed one because they treated 20 trades as enough evidence. The math says it isn't. The math says you need 200, ideally 500, before the win-rate number narrows to the kind of precision people assume it has from the first day.

This post walks the binomial confidence interval, what it actually says about your strategy, the rough heuristic that gets you to a useful answer without the formula, and why the gap between "what the number looks like" and "what the number means" is the single most-underestimated source of trading error. The interactive at the end takes your win rate and trade count and shows you the actual range your number sits inside.

What a win rate measurement actually is

Every time you take a trade, the outcome is roughly a coin flip — except the coin is biased toward whatever your strategy's true edge is. After enough flips, the proportion of heads converges on the true bias. Before then, it doesn't.

The rate at which it converges is governed by the binomial distribution. The standard error of a sample proportion is √(p(1−p)/n), where p is the observed rate and n is the sample size. Multiply that by 1.96 and you get the half-width of the 95% confidence interval — the range you'd expect to contain the true win rate 95% of the time.

The numbers come out brutally:

  • 20 trades, 80% win rate: standard error 0.089, 95% CI ≈ 80% ± 18% → range 62% to 98%.
  • 50 trades, 80% win rate: SE 0.057, CI ± 11% → 69% to 91%.
  • 100 trades, 80% win rate: SE 0.040, CI ± 8% → 72% to 88%.
  • 200 trades, 80% win rate: SE 0.028, CI ± 5.5% → 74.5% to 85.5%.
  • 500 trades, 80% win rate: SE 0.018, CI ± 3.5% → 76.5% to 83.5%.
  • 1000 trades, 80% win rate: SE 0.013, CI ± 2.5% → 77.5% to 82.5%.

Read those carefully. After 100 trades — already more than most retail traders do in a year — the actual win rate could plausibly be anywhere from 72% to 88%. After 20 trades, the band is so wide that an "80% win rate" strategy is statistically indistinguishable from a "70% win rate" strategy, and almost indistinguishable from a "60%" one. You can't tell from the data alone.

The ratio that matters is intuitive: doubling sample size shrinks the interval by about √2 ≈ 1.41. To halve the error bar, you need four times the trades. To quarter it, sixteen times. This is the slow, unforgiving cost of certainty.

The funnel

Win-rate confidence funnel: 80% observed across sample sizes 100% 80% 60% True win rate 20 50 100 200 500 1000 Number of trades ±18% ±11% ±8% ±5.5% ±3.5% ±2.5%

That funnel — wide at the left, tight at the right — is what every new trader is doing battle with whether they realise it or not. Twenty trades give you the leftmost slice. The dotted line is the observed 80%; the red band is where the true rate could plausibly be hiding. Anyone declaring strategy results from inside the wide end of that funnel is making a probabilistic claim they don't have the evidence to support.

Why this is the single biggest source of strategy error

Sample-size error doesn't feel like error. It feels like a real number that just happens to wiggle. The wiggle is the entire problem.

Three concrete patterns that flow from the funnel:

Quitting strategies that work. A real 65% win-rate strategy will produce stretches of 55% over 20 trades by pure chance about 30% of the time. To the trader living through them, the strategy "stopped working." They ditch it. The strategy was fine; the sample was too small to distinguish a normal cold streak from a broken edge. The Kahneman and Tversky 1971 paper "Belief in the law of small numbers" formally showed even trained statisticians making this mistake systematically.

Scaling strategies that don't. The mirror image. A coin-flip strategy will produce stretches of 70% wins over 20 trades by chance about 5% of the time. To the trader running it, that's "the strategy works, time to size up." Three weeks later the wins regress to 50% and the increased size delivers larger losses than before. The strategy didn't change; the sample was always misleading.

Comparing two strategies on too few trades. "Strategy A is at 75%, Strategy B is at 65%, A wins, switch to A." With 50 trades each, the confidence intervals overlap so heavily that the gap is statistically meaningless. You cannot tell from that data which strategy is better. You'd need at least 200 trades each, and even then the answer is "probably A, but it could still be B with bad luck on a fair sample."

The strategy-edit feedback loop. Every time a trader changes a parameter and re-runs the backtest, they're starting the sample-size clock over. A strategy that's been "improved" 6 times in the last year doesn't have 600 backtested trades — it has effectively 100, with the more recent edits sitting on dramatically smaller samples. The win rate displayed on the latest version is mostly noise.

The pattern across all of them: the human brain treats short sequences as informative when they aren't. Every time you've heard someone say "well, it's working so far" after fewer than 100 trades, the most accurate translation is "I have no evidence one way or the other yet."

A useful heuristic

The full math is the binomial confidence interval. The useful shortcut is the doubling rule:

To halve your uncertainty about the true win rate, multiply your sample size by 4.

That's because the standard error scales with 1/√n. Going from 50 trades to 100 cuts the error bar by √2 ≈ 0.71x. Going from 50 to 200 cuts it to √4 = 2x. From 50 to 800 cuts it to √16 = 4x. Each cut buys you tighter resolution on the true rate, but the cost in additional trades grows quadratically.

The practical implication: if your current win-rate measurement has, say, ±10% error and you'd like to know the rate to within ±2.5%, you need roughly 16 times as many trades. If you've taken 50 so far, you'd need 800. There's no shortcut.

The other useful number is the Wilson score interval for very small samples or extreme proportions, where the standard normal approximation breaks down. For a trader with 18 wins out of 20 trades, the simple ±18% formula is technically optimistic; the Wilson interval correctly tells you the true rate is somewhere between roughly 56% and 94%. The shape of the result is the same — wide error bar — but the math is more honest about it.

How big a sample you actually need

The right question isn't "how many trades?" in the abstract — it's "how precisely do I need to measure?"

For deciding whether a strategy works at all (true rate above 50%, or above some break-even rate), 100 trades is usually sufficient. The interval at 100 is tight enough to rule out coin-flip outcomes if the strategy has a real edge.

For comparing two strategies and picking the better one, 200-300 trades each, with the proviso that small differences in true rate (e.g. 75% vs 70%) require even more. Compare strategies with very different rates first; the small-difference comparisons are the ones that demand patience.

For choosing whether to scale risk based on the win rate, 500 trades is a reasonable floor. The error bar at 500 is about ±3.5% for an 80% rate, which means a real strategy at 80% looks like it could be anywhere from 76% to 84% on the data — narrow enough that scaling decisions can rest on the headline number without major risk of mis-sizing.

For statistical significance against a benchmark (e.g. "is this strategy reliably better than coin-flipping with my fee structure?"), the required sample depends on the size of the effect you're testing for. Detecting a small edge (say 55% vs 50%) requires hundreds to low thousands of trades. Detecting a large edge (75% vs 50%) needs only tens.

For a typical retail crypto trader taking 1-3 trades per day, hitting 500 trades is roughly a year of consistent trading. There is no version of "take 20 trades and decide" that the math supports.

Calculate yours

Tools that go with this

Two tools cover the practical version of this exact problem:

Win-rate confidence calculator

Enter what you've measured. See what you actually know.
Observed win rate:
Standard error:
95% CI half-width (Wilson):
95% confidence interval — true win rate sits inside this band:
0%
100%

The interval shown is the Wilson score interval, which is more accurate than the simple normal approximation when sample sizes are small or proportions are extreme. The bar shows where the true win rate sits with 95% confidence — drag the inputs and watch the band tighten as you add trades. The pattern is exactly the funnel from the SVG, just rendered as numbers you can plug your own data into.

How professional traders handle this

Three things that distinguish people who get this right from people who don't.

Don't change the strategy until the sample says you can. A working strategy will have multi-week stretches that look broken. A real strategy edit needs another full sample to evaluate. Most retail traders edit weekly or daily, which means their actual evaluation sample is always 10-20 trades. They never accumulate the data needed to know if anything they're doing works.

Backtest on 1000+ samples or accept the data won't tell you much. If your strategy only generates 50 signals on a year of historical data, the backtest result is noise. Either widen the parameters to generate more signals, lengthen the backtest period, or accept that you're operating on intuition with statistical decoration.

Compare the right way. Don't pit Strategy A's performance over the last 100 trades against Strategy B's performance over the next 100. The market regime probably shifted between them. Run both strategies in parallel on the same trades, or use a paired comparison where each trade is evaluated under both strategy rule sets at the same time. This eliminates regime confounding and dramatically reduces the sample size you need.

For more on the variance side of small samples — what a "bad month" looks like statistically even at high win rates — the why-traders-lose-money post covers the broader picture, and the chart-confirmation-bias post covers why the human brain insists on declaring patterns from too little evidence. The same sample-size math is also the gating constraint on tag-level analytics — see the trade-tagging post for why ~30 trades per tag bucket is the floor before any per-setup or per-session number means anything. The 30-trade-per-cohort floor also gates the post-trade FOMO audit in the chart-before-setup post — the comparison between the chart-first cohort and the clean cohort needs both buckets to be sample-sized before the gap means anything.

Sources
  • Kahneman, D., & Tversky, A. (1971). Belief in the law of small numbers. Psychological Bulletin, 76(2), 105-110.
  • Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158), 209-212.
  • Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2), 101-117. (Comparison of confidence interval methods.)
  • Taleb, N. N. (2004). Fooled by Randomness. Random House. (Popular treatment of small-sample inference in finance.)
What's the minimum sample to know if a strategy works at all?

Roughly 100 trades for a strategy that's clearly above the break-even rate. The 95% confidence interval at 100 trades is tight enough that a real 65%+ strategy will be statistically distinguishable from a 50% coin-flip outcome. Strategies whose true edge is closer to break-even need much more — sometimes thousands of trades — to confirm.

Why does the math break down at small samples?

The standard normal-approximation formula assumes the proportion isn't too close to 0 or 1 and that n is large enough for the binomial distribution to look roughly normal. At small n or extreme proportions (like 18/20), neither holds — you have to use the Wilson score interval or the exact Clopper-Pearson method. The simple formula gives an interval that's too narrow and too symmetric.

How do I compare two strategies properly?

The cleanest method is paired evaluation — run both strategies on the same period, ideally the same signals, and compute the difference in outcomes per matched trade. The standard error of the difference is much smaller than the standard error of either rate alone, so you need fewer trades to reach significance. About 100-200 paired trades is usually enough to distinguish strategies whose true rates differ by 5+ points.

Does a backtest count as samples or do live trades count differently?

Mathematically, samples are samples. Practically, backtest samples have systematic biases (overfitting, look-ahead, optimistic execution) that live samples don't. A reasonable rule of thumb is to discount backtest sample size by 50-70% when projecting forward — a 1000-trade backtest is roughly equivalent to 300-500 live trades for the purpose of estimating true win rate.

What's a good win rate to aim for given sample-size constraints?

Higher win rates are easier to confirm with smaller samples because the gap to coin-flip is wider. A 75% strategy is "obviously above break-even" in 50 trades. A 55% strategy needs 300+ to be confidently above break-even. If you're trying to validate a strategy quickly, optimise for higher win rate (which usually means smaller R:R) — accepting that the strategy might end up structurally vulnerable to costs as covered in the cost-floor post.

Should I use Bayesian methods instead?

Yes, if you have a useful prior. A Bayesian approach updates your prior belief about the strategy's win rate using the trade outcomes, which produces narrower intervals when the prior is informative. The catch: a wrong prior produces wrong intervals. For most retail traders without strong priors, the frequentist Wilson interval is the safer default — it makes no assumptions you can't justify.

Does sample size matter for strategies that hold positions for weeks?

It matters even more, because each "sample" takes longer to accumulate. A swing strategy that takes one trade a week needs about 10 years to hit 500 samples. The math doesn't care — you still need the trades — but practically it means swing traders are operating on smaller samples than scalpers, even though their conviction is often higher. Be more humble about win-rate claims at lower frequencies, not less.

← All posts