Brier scores, for people who don't trade for a living

The National Weather Service tracks the Brier score of every forecaster who issues a probability of precipitation. So does every weather app you've used, whether the disclosure surfaces it or not. The score isn't fancy. It's the average squared distance between the probability you said and what actually happened, where "what actually happened" is 1 if it rained and 0 if it didn't.

A forecaster who says 80% chance of rain every day for a year, on days where it rained 80% of the time, gets a Brier score of zero. They are, in the technical sense, perfectly calibrated. A forecaster who says 50% chance of rain every day, regardless of what they actually believe, gets a score of 0.25 — the score you'd get from flipping a coin. A forecaster who says 95% and is wrong gets penalized harder than one who said 60% and was wrong, because the confidence was misplaced and the math knows.

This isn't an esoteric statistical exercise. It is the basic feedback mechanism that separates a forecaster from a person with opinions, and it works on any prediction with a probability and a binary outcome. Will NVDA gap down on earnings? (Yes / no.) Will the next CPI print come in below 3.0%? (Yes / no.) Will gold close above 4,200 by end of Q3? (Yes / no.) Attach a probability to each of those, wait for the world to settle them, and you have the data you need.

Most retail traders never compute this number. The ones who do are usually surprised by what they find.

The math, in three lines

For a single prediction:

score = (p − outcome)²

Where p is the probability you assigned (between 0 and 1), and outcome is 1 if the predicted thing happened, 0 if it didn't.

For a set of predictions:

brier = average of all the (p − outcome)² values

Lower is better. Zero is perfect. 0.25 is what a coin flipper gets. Anything above 0.25 means you'd be doing better by ignoring your own forecasts and just guessing 50/50 on everything.

That's it. There is no other formula. The reason it works — the reason it's load-bearing despite looking trivial — is that it punishes overconfidence and rewards calibration in the same equation, without you having to notice it.

A worked example, ten predictions long

Suppose you logged ten forecasts over a quarter, with conviction expressed as a probability. The score for each is (conviction − outcome)², where outcome is 1 for a hit and 0 for a miss:

NVDA gaps down on print — conviction 70%, hit. Score: (0.70 − 1)² = 0.09.
Brent holds 70 through end of June — conviction 60%, hit. Score: 0.16.
Tesla loses trillion-dollar cap before next earnings — conviction 80%, miss. Score: 0.64.
Fed cuts in September — conviction 55%, hit. Score: 0.20.
XLE outperforms SPX in Q2 — conviction 75%, hit. Score: 0.06.
EUR/USD above 1.10 by month end — conviction 60%, miss. Score: 0.36.
Gold breaks 4,200 by end of Q3 — conviction 85%, miss. Score: 0.72.
The next CPI print below consensus — conviction 50%, hit. Score: 0.25.
AMD beats revenue, gaps up — conviction 65%, hit. Score: 0.12.
DXY weaker on the month — conviction 70%, miss. Score: 0.49.

Average: 0.31.

Above 0.25. The trader represented by these numbers would have been better off ignoring their own forecasts and flipping a coin. Not because they got more wrong than right — they actually got 6 out of 10 correct. The problem is that the misses (3, 7, and 10) carried high conviction, and the hits carried lower conviction. The math reads that as the confidence signal is going the wrong way.

This is the kind of pattern that's invisible without the score. Eyeballed, the trader sees six wins and four losses and feels reasonably good. The Brier score sees a person who is consistently most confident on the predictions that don't pan out.

What the score surfaces that hit-rate hides

A bare hit-rate (six out of ten) tells you nothing about whether your conviction signal is informative. A Brier score tells you whether the conviction tracks the outcome.

You can be a 60%-hit-rate trader and have a worse Brier score than a 50%-hit-rate trader, if the 60%-er is wildly overconfident and the 50%-er is humble. You can also be a 40%-hit-rate trader on certain types of forecasts (long-shot macro calls, say) and still beat the coin flipper, if your high-conviction calls are the ones that pay off. The score doesn't reward you for being right often; it rewards you for being right when you said you were sure, and for not claiming sureness when you weren't.

The decomposition is more useful than the raw number. Group your predictions by the probability you assigned and look at the realised hit-rate for each bucket:

Of all the calls where you said 80%, what fraction actually happened?
Of all the calls where you said 50%, what fraction actually happened?

If you said 80% on 20 things and 14 happened, you're slightly underconfident at the high end (you'd want 16 to land for perfect calibration). If you said 50% on 20 things and 5 happened, you're systematically overconfident on coin-flip-feeling calls — your "I genuinely don't know but I lean this way" lean is wrong more often than chance.

Those numbers are the actionable part. Tetlock's Superforecasters spend almost no time looking at their topline Brier score. They spend their time on the decomposition, because the decomposition tells them where to update.

Why retail almost never has this number

Two reasons.

First, the data isn't usually available. To compute a Brier score you need the probability you assigned, written down at the time, before the outcome was known. Most retail predictions are stated as directional opinions ("I'm bullish") rather than probabilities, and they're stated in places (DMs, voice notes, charts) that don't preserve the prediction-and-confidence pair as a structured artifact.

Second, the data, when reconstructed from memory, is contaminated. If you sit down today and try to write out your last twenty predictions with the conviction you "had" at the time, the brain will helpfully sandbag the misses ("I wasn't really that sure about that one") and inflate the hits. The reconstructed dataset says you're well calibrated. The contemporaneous one would have said something different.

Both problems have the same fix: log the prediction and the conviction at the moment of formation, and don't edit either field afterward. Then run the math.

The version that earns the discipline

You don't need software for this. You need a habit of writing down a probability with every directional view you form, and a willingness to sit with the score the math returns.

What you'll find — almost everyone does, the first time they actually run the numbers honestly — is that you're overconfident at the high end. Your 90%-confidence calls land at maybe 70%. Your 60%-confidence calls land somewhere around 60%, which is fine. The 90%s are where the damage lives, because they're the ones you size around.

That fact, surfaced cleanly, is worth more than any individual call. It tells you to discount your own conviction signal by a known factor when it's high. That's the kind of knowledge a year of P&L can't deliver, no matter how much you stare at it.

A hunch journal is the data structure for this. The fields it captures — the read, the conviction, the horizon, the outcome — are exactly what the math needs. The score is what falls out at the end of the quarter.

Whether you compute it in Excel or have an app do it for you is a detail. Whether you compute it at all is the question.