Compiling academic and media forecaster’s 2012 American Presidential election predictions and statistically judging correctness; Nate Silver was not the best.
I statistically analyzed in R hundreds of predictions compiled for ~10 forecasters of the 201213ya American Presidential election, and ranking them by Brier, RMSE, & log scores.
The best overall performance seems to be by Drew Linzer and Wang & Holbrook, while Nate Silver appears as somewhat overrated and the famous Intrade prediction market turned in a disappointing overall performance.
In November 201213ya, I was hired by CFAR to compile an extensive dataset of pundits, modelers, hobbyists, and academics who had attempted to statistically forecast the 201213ya American presidential race and other minor races; the results were interesting in that they contradicted the lionization of Nate Silver’s forecasts in The New York Times. This page is a full listing of the R source code I used to produce my analysis for the CFAR essay; notes on the derivation of each dataset are stored at 2012-gwern-notes.txt.
This election prediction judgment divided up into several sections dealing with different categories of predictions:
the overall Presidential race predictions: probability of Obama victory, final electoral vote count, and percentage of popular vote
the Presidential state-by-state predictions: the percentage Obama will take (vote share/margin/edge), as well as the probability he will win that state at all
the Senate state-by-state predictions: similar, but normalized for the Democratic candidate
Few forecasters made predictions in all categories, the ones who did make predictions did not always make their full predictions public, etc. Note that all percentages are normalized in terms of that going to Obama, Democrats, or in some cases, Independents/Greens. The “Reality” ‘forecaster’ is the ground truth; these were all updated 23 November in what is hopefully a final update.
The point of these calculations is to extract Brier scores (for categorical predictions like percentage of Obama victory) and RMSE sums (for continuous/quantitative predictions like vote share). Intrade prices were interpreted as straightforward probabilities without any correction for Intrade’s long-shot bias1
presidential <-read.csv("https://gwern.net/doc/statistics/prediction/election/2012-presidential.csv", row.names=1)# Reality=2012 result; 2008=2008 resultspresidential probability electoral popularReality 1.000033250.7920081.000036553.00Nate Silver 0.909031350.80Drew Linzer 0.9900332NASimon Jackman 0.914033250.80DeSart 0.886230351.37Margin of Error 0.680030351.50Wang & Ferguson 1.000030351.10Intrade 0.658029150.75Josh Putnam NA332NAUnskewed Polls NA26348.88# probability can be scored as a Brier score; available in 'verification' libraryinstall.packages("verification")library(verification)# handle lists & vectors for laterbr <-function(obs, pred) brier(unlist(obs),unlist(pred),bins=FALSE)$bs # bins=FALSE avoids rounding# convenience functionbrp <-function(p) brier(presidential["Reality",]$probability, presidential[p,]$probability,bins=FALSE)$bslapply(rownames(presidential)[1:9], brp)
Reality: 0
2008: 0
Wang: 0
Linzer: 0.0001
Jackman: 0.007396
Silver: 0.008281
DeSart: 0.01295044
Margin: 0.1024
Intrade: 0.116964
Random: 0.25 (50% guess is always 0.25)
# To score electorals and populars, we use RMSErmse <-function(obs, pred) sqrt(mean((obs-pred)^2,na.rm=TRUE))rpe <-function(p) rmse(presidential["Reality",]$electoral, presidential[p,]$electoral)lapply(rownames(presidential), rpe)
statemargin <-read.csv("https://gwern.net/doc/statistics/prediction/election/2012-statemargin.csv", row.names=1)statemargin al ak az ar ca co ctReality 38.4282940.7925344.4485536.8789959.6945551.5653658.38274200838.8000037.7000045.0000038.8000060.9000053.5000060.50000Nate Silver 36.7000038.6000046.2000038.6000058.1000050.8000056.60000Drew Linzer 40.3000037.5000046.2000037.1000059.8000051.2000056.80000Margin of Error 37.0000041.0000049.0000039.0000061.0000053.0000059.00000Josh Putnam NANA46.59500NA58.3950050.8750055.92000Unskewed Polls 37.7800036.4000043.9500044.6800057.6500049.4800054.55000Intrade NANANANANANANASimon Jackman 38.70000NA46.1000036.4000058.6000051.0000056.80000DeSart 35.2000032.2000046.4000038.7000059.2000050.1000057.70000Wang & Ferguson 42.5000039.0000046.0000038.0000057.5000051.0000056.50000 de dc fl ga hi id ilReality 58.6107490.9140250.0078745.4821670.5452332.6223357.53322200861.9000092.9000050.9000047.0000071.8000036.1000061.80000Nate Silver 59.6000093.0000049.8000045.5000066.5000032.1000059.80000Drew Linzer 61.00000NA50.2000046.0000065.6000031.2000060.20000Margin of Error 60.0000091.0000049.0000044.0000070.0000035.0000061.00000Josh Putnam NANA50.0800045.38000NANA59.58000Unskewed Polls 86.8800057.4000047.6000043.2000058.5500030.9500055.25000Intrade NANANANANANANASimon Jackman NA91.6000050.1000045.5000065.0000032.0000059.60000DeSart 60.5000095.8000049.9000045.5000066.6000029.1000060.80000Wang & Ferguson 62.5000090.0000050.0000046.0000063.5000032.0000059.50000 indiana ia ks ky la me mdReality 44.0834551.988237.8272137.8099440.5774655.9635261.97419200849.9000054.000041.4000041.1000039.9000057.6000061.90000Nate Silver 45.3000051.100037.9000040.3000039.3000055.9000060.90000Drew Linzer 44.3000051.600041.1000045.1000039.7000056.4000061.30000Margin of Error 47.0000052.000041.0000036.0000039.0000057.0000062.00000Josh Putnam 44.3650051.2750NANA43.0650056.1700060.64500Unskewed Polls 41.9000049.880036.6000040.8000043.6300051.9000055.83000Intrade NANANANANANANASimon Jackman 44.8000051.4000NA40.9000038.9000056.0000061.00000DeSart 43.1000051.800039.4000041.9000039.3000055.9000061.90000Wang & Ferguson 43.5000051.000041.5000044.5000043.5000055.5000061.00000 ma mi mn ms mo mt neReality 60.7488654.3039152.649743.5486244.3496241.7081337.86805200862.0000057.4000054.200042.8000049.3000047.2000041.50000Nate Silver 59.0000053.0000053.700045.6000045.6000045.2000040.40000Drew Linzer 60.0000052.7000054.200041.8000045.3000045.3000042.50000Margin of Error 58.0000053.0000054.000043.0000049.0000045.0000040.00000Josh Putnam 56.1700052.7850053.7650NA45.9250045.46500NAUnskewed Polls 60.1000051.7500051.030039.8300046.2000038.8000034.40000Intrade NANANANANANANASimon Jackman 59.7000053.6000054.0000NA45.3000046.0000042.80000DeSart 62.8000053.7000054.300040.0000046.1000044.2000039.80000Wang & Ferguson 59.5000052.7500053.750044.0000045.2500045.7500043.00000 nv nh nj nm ny nc ndReality 52.3562551.9826857.8593952.9954762.6246148.3509738.69731200855.1000054.3000056.8000056.7000062.2000049.9000044.70000Nate Silver 51.8000051.4000055.5000054.1000062.4000048.9000042.00000Drew Linzer 52.2000051.6000056.6000054.4000063.2000049.1000041.70000Margin of Error 55.0000053.0000057.0000057.0000062.0000050.0000043.00000Josh Putnam 52.0250051.5150056.1800054.5650062.5100049.22000NAUnskewed Polls 52.1500050.0300053.8000053.5300058.7500044.9800037.15000Intrade NANANANANANANASimon Jackman 51.9000051.3000056.0000054.4000062.7000049.2000043.00000DeSart 51.8000051.7000057.1000054.7000064.6000047.7000040.10000Wang & Ferguson 52.5000051.0000056.0000053.0000062.0000049.0000043.00000 oh ok or pa ri sc sdReality 50.1432333.2276854.3001651.7583462.7009644.0880339.86614200851.2000034.4000057.1000054.7000063.1000044.9000044.70000Nate Silver 51.3000033.8000053.6000052.5000061.8000043.2000042.50000Drew Linzer 51.6000033.5000053.6000052.7000063.1000044.3000044.80000Margin of Error 52.0000031.0000055.0000055.0000062.0000043.0000044.00000Josh Putnam 51.47000NANA52.84500NANA44.79000Unskewed Polls 47.7500035.5500050.5300050.2800059.7300041.5800039.88000Intrade NANANANANANANASimon Jackman 51.9000034.5000053.1000053.1000062.8000044.8000044.90000DeSart 51.3000035.4000053.8000052.9000065.2000043.3000042.70000Wang & Ferguson 51.5000035.5000053.0000051.5000062.0000047.0000044.50000 tn tx ut vt va wa wvReality 39.0737741.3637124.7325166.5705551.1564656.1394135.50631200841.8000043.8000034.2000067.8000052.7000057.5000042.60000Nate Silver 41.4000041.2000027.8000066.2000050.7000056.2000041.30000Drew Linzer 43.3000041.4000026.7000070.5000051.1000057.1000042.80000Margin of Error 39.0000039.0000031.0000064.0000050.0000057.0000038.00000Josh Putnam 43.9350042.4400027.31000NA50.8950056.68000NAUnskewed Polls 43.7000039.8500028.7500056.5300048.8800051.4300044.55000Intrade NANANANANANANASimon Jackman 44.0000040.8000026.9000068.8000051.0000056.5000041.50000DeSart 41.3000040.7000025.9000070.7000050.1000057.1000039.80000Wang & Ferguson 44.5000042.0000027.5000065.5000051.0000057.0000041.50000 wi wyReality 52.8019127.81889200856.3000032.70000Nate Silver 52.4000030.90000Drew Linzer 52.5000032.00000Margin of Error 52.0000033.00000Josh Putnam 52.30500NAUnskewed Polls 49.9800030.55000Intrade NANASimon Jackman 52.50000NADeSart 52.6000030.10000Wang & Ferguson 52.2500034.00000
What’s the equivalent of Brier function for outcomes which aren’t yes/no binary? A more quantitative measure; a common choice is the RMSE (which punishes outliers), in this case, we’re looking at the difference between the predicted edge in votes and the actual edge over all the states a predictor gave us numbers:
senatewin <-read.csv("https://gwern.net/doc/statistics/prediction/election/2012-senatewin.csv", row.names=1) az ca ct de fl hi indiana me md ma miReality 0.0001.0001.0001.001.0001.001.001.0001.001.0001.00Nate Silver 0.0401.0000.9601.001.0001.000.700.9301.000.9401.00Intrade 0.2250.9980.8880.990.8590.960.850.9570.960.7860.95Wang & Ferguson 0.1200.9500.9980.950.9500.950.840.9500.950.9600.96 mn ms mo mt ne nv nj nm ny nd oh paReality 1.000.001.0001.0000.000.001.001.001.001.0001.001.00Nate Silver 1.000.000.9800.3400.010.171.000.971.000.0800.970.99Intrade 0.950.000.7030.3710.060.060.960.951.000.1550.840.86Wang & Ferguson 0.950.050.9600.6900.050.270.950.950.950.7500.950.95 ri tn tx ut vt va wa wv wi wyReality 1.000.000.0000.000.001.001.001.0001.0000.00Nate Silver 1.000.000.0000.000.000.881.000.9200.7900.00Intrade 0.990.000.0250.000.050.780.960.9510.6260.00Wang & Ferguson 0.950.050.0500.050.050.960.950.9500.7200.05
The Senate win predictions (done only by Wang, Silver, & Intrade in this dataset):
To combine the state win predictions with the presidency win prediction and also the Senate race win predictions requires data on all 3, so still Wang vs Silver vs Intrade:
senatemargin <-read.csv("https://gwern.net/doc/statistics/prediction/election/2012-senatemargin.csv", row.names=1)senatemargin az ca ct de fl hi id me md ma mi mn msReality 45.861.655.266.455.262.649.952.955.353.754.765.340.3Nate Silver 46.659.652.666.553.256.650.053.060.851.756.063.732.3 mo mt ne nv nj nm ny nd oh pa ri tn txReality 54.748.741.844.758.551.071.950.550.353.664.830.440.5Nate Silver 52.248.445.647.556.153.467.547.251.952.959.135.541.5 ut vt va wa wv wi wyReality 30.224.852.560.260.651.521.6Nate Silver 32.425.051.059.356.051.127.7rmse <-function(obs, pred) sqrt(mean((obs-pred)^2,na.rm=TRUE))r <-function(x) rmse(senatemargin["Reality",], senatemargin[x,])r("Reality"); r("Nate Silver"); # no one else's predictions are available
Reality: 0
Nate Silver: 3.272197
Not bad at all.
Let’s combine the state margin with the electoral / popular to get an overall RMSE picture of the predictors:
r <-function(p) rmse(unlist(c(statemargin["Reality",], presidential["Reality",]$electoral, presidential["Reality",]$popular)),unlist(c(statemargin[p,], presidential[p,]$electoral, presidential[p,]$popular)))lapply(rownames(statemargin), r)
Reality: 0
Josh Putnam: 2.002633
Simon Jackman: 2.206758
Drew Linzer: 2.503588
Nate Silver: 3.186463
DeSart: 4.635004
Margin of Error: 4.641332
Wang & Ferguson: 4.83369
2008: 5.525641
Unskewed Polls: 11.84946
(Which shows you how bad Unskewed Polls was: we could fit Putnam, Jackman, Linzer, and Silver’s errors into his and have room left over.)
Example of the difference between Brier and log score:
# Oops!brier(0,1,bins=FALSE)$bs1# But we can recover by getting the second rightbrier(c(0,1),c(1,1),bins=FALSE)$bs0.5# Oops!logScore(1, 0)-Inf# Can we recover? ...we're screwedlogScore(c(1,1), c(0,1))-Inf
We mentioned there were other proper scoring rules besides the Brier score; another binary-outcome rule, less used by political forecasters, is the “logarithmic scoring rule” (see Wikipedia or Eliezer Yudkowsky’s “Technical Explanation”); it has some deep connections to areas like information theory, data compression, and Bayesian inference, which makes it invaluable in some context. But because a log score ranges between 0 and negative Infinity (bigger is better/smaller worse) rather than 0 and 1 (smaller better) and has some different behaviors, it’s a bit harder to understand than a Brier score.
(One way in which the log score differs from the Brier score is treatment of 100/0% predictions: the log score of a 100% prediction which is wrong is negative Infinity, while in Brier it’d simply be 1 and one can recover; hence if you say 100% twice and are wrong once, your Brier score would recover to 0.5 but your log score will still be negative Infinity! This is what happens with the “2008” benchmark.)
Forecaster
State win probabilities
Reality
0
Linzer
-0.9327548
Wang & Ferguson
-1.750359
Silver
-2.057887
Jackman
-2.254638
DeSart
-3.30201
Intrade
-5.719922
Margin of Error
-10.20808
2008
-Infinity
Forecaster
Presidential win probability
Reality
0
2008
0
Wang & Ferguson
0
Jackman
-0.08992471
Linzer
-0.01005034
Silver
-0.09541018
DeSart
-0.1208126
Intrade
-0.4185503
Margin of Error
-0.3856625
Note that the 200817ya benchmark and Wang & Ferguson took a risk here by an outright 100% chance of victory, which the log score rewarded with a 0: if somehow Obama had lost, then the log score of any set of their predictions which included the presidential win probability would automatically be -Infinity, rendering them officially The Worst Predictors In The World. This is why one should allow for the unthinkable by including some fraction of percent; of course, I’m sure Wang & Ferguson don’t mean 100% literally but more like “it’s so close to 100% we can’t be bothered to report the tiny remaining possibility”.
I have been told that once Intrade prices have been corrected for this, the new results are comparable to Silver & Wang. This doesn’t necessarily surprise me, but during the original analysis I did not look into doing the long-shot bias correction because: hardly anyone does in discussions of prediction markets; it would’ve been more work; I’m not sure it’s really legitimate, since if Intrade is biased, then it’s biased - if someone produces extreme estimates which can be easily improved by regressing to some relevant mean, it doesn’t seem quite honest to present your corrected version instead as what they “really” meant.
Summing together RMSEs from different metrics is statistically illegitimate & misleading since the summation will reflect almost entirely the electoral vote performance, since it’s on a scale much bigger than the other metrics. I include it for curiosity only.