How well do modern LLMs predict the future? They tested on ~300 Kalshi prediction markets. Claude Opus 4.5 performed the best. Its Brier Score (a measure of mean square error of prediction probs) of ~0.23 is still off human superforecasters (0.15-0.2) but is approaching it.
They used Oct-Nov 2025. Gemini 3 Pro wasn't compared but GPT 5.2 XHigh disappointed. Source:
(ForecastBench is also an attempt to do this but is stale and doesn't have the new models)
305