In the aftermath of the 2012 election, campaign prognosticators Nate Silver, Simon Jackman, Drew Linzer, and Sam Wang make preliminary quantitative assessments of how well their final predictions played out. Others have posted comparisons of these and other election prediction and poll aggregation outfits. Hopefully, we'll one day compare and combine the models based on their long term predictive power. To compare and combine models effectively, we need a good quantitative measure of their accuracy. The prognosticators have used something called the Brier score to measure the accuracy of their election eve predictions of state-level outcomes. Despite its historical success in measuring forecast accuracy, the Brier score fails in at least two ways as a forecast score. I'll review its inadequacies and suggest a better method.
The Brier score measures the accuracy of binary probabilistic predictions. To calculate it, take the average, squared difference between the forecast probability of a given outcome (e.g., Obama winning the popular vote in California) and the observed probability that the event occurred (.e.g, one if the Obama won, zero if he didn't win). The higher the Brier score, the worse the predictive accuracy. As Nils Barth suggested to Sam Wang
, you can also calculate a normalized Brier score by subtracting four times the Brier score from one. A normalized Brier score compares the predictive accuracy of a model to the predictive accuracy of a model that perfectly predicted the outcomes. The higher the normalized Brier score, the greater the predictive accuracy.
Because the Brier score (and its normalized cousin) measure predictive accuracy, I've suggested that we can use them to construct certainty weights for prediction models
, which we could then use when calculating an average model that combines the separate models into a meta-prediction. Recently, I've discovered research in the weather forecasting community about a better way to score forecast accuracy. This new score ties directly to a well-studied model averaging mechanism. Before describing the new scoring method, let's describe the problems with the Brier score.
) notes that the Brier score doesn't deal adequately with very improbable or probable events. For example, suppose that the probability that a Black Democrat wins Texas is 1 in 1000. Suppose we have one forecast model that predicts Obama will surely lose in Texas, whereas another model predicts that Obama's probability of winning is 1 in 400. Well, Obama lost Texas. The Brier score would tell us to prefer the model that predicted a sure loss for Obama. Yet the model that gave him a small probability of winning is closer to the "truth" in the sense that it estimates he has a small probably of winning. In addition to its poor performance scoring highly improbable and probable events, the Brier score doesn't perform well when scoring very poor forecasts (Benedetti 2010
; sorry for the pay wall).
These issues with the Brier score should give prognosticators pause for two reasons. First, they suggest that the Brier score will not perform well in the "safe" states of a given party. Second, they suggest that Brier scores will not perform well for models whose predictions were poor (here's lookin' at you, Bickers and Berry). So what should we do instead? It's all about the likelihood. Well, actually its logarithm.
Both Jewson and Benedetti convincingly argue that the proper score of forecast accuracy is something called the log likelihood. A likelihood is the probability of a set of observations given the model of reality that we assume produced those observations. As Jewson points out, the likelihood in our case is the probability of a set of observations (i.e., which states Obama won) given the forecasts associated with those observations (i.e., the forecast probability that Obama would win those states). A score based on the log likelihood penalizes measures that are very certain one way or the other, giving the lowest scores to models that are perfectly certain of the outcome.
To compare the accuracy of two models, simply take the difference in their log likelihood. To calculate model weights, first subtract the likelihood score of each model from the minimum likelihood score across all the models. Then exponentiate the difference you just calculated. Then divide the exponentiated difference of each model by the sum of those values across all the models. Voila. A model averaging weight.
Some problems remain. For starters, we haven't factored Occam's razor into our scoring of models. Occam's razor, of course, is the idea that simpler models are better than complex models all else equal. Some of you might notice that the model weight calculation in the previous paragraph is identical to the model weight calculation method based on the information criterion scores of models that have the same number of variables. I argue that we can ignore Occam's razors for our purposes. What we're doing is measuring a model's predictive accuracy, not its fit to previous observations. I leave it up to the first order election prognosticators to decide which parameters they include in their model. In making meta election forecasts, I'll let the models' actual predictive performance decide which ones should get more weight.
A funny short story about the triumph and perils of endless recursions in meta-analysis. NOT a critique of meta-analysis itself.
Once upon a time, there was a land called the United States of America, which was ruled by a shapeshifter whose physiognomy and political party affiliation was recast every four years by an electoral vote, itself a reflection of the vote of the people. For centuries, the outcome of the election had been foretold by a cadre of magicians and wizards collectively known as the Pundets. Gazing into their crystal balls at the size of crowds at political rallies, they charted the course of the shapeshifting campaign. They were often wrong, but people listened to them anyway.
Then, from the labyrinthine caves beneath the Marginuvera Mountains emerged a troglodyte race known as the Pulstirs. Pasty of skin and snarfy in laughter, they challenged the hegemony of the Pundet elite by crafting their predictions from the collective utterances of the populace. Trouble soon followed. Some of the powerful new Pulstir craftsmen forged alliances with one party or another. And as more and more Pulstirs emerged from Marginuvera, they conducted more and more puls.
The greatest trouble came, unsurprisingly, from the old Pundet guard in their ill-fated attempts to merge their decrees with Pulstir findings. Unable to cope with the number of puls, unwilling to so much as state an individual pul's marginuvera, the Pundet's predictions confused the people more than it informed them.
Then, one day, unbeknownst to one another, rangers emerged from the Forests of Metta Analisis. Long had each of them observed the Pundets and Pulstirs from afar. Long had they anguished over the amount of time the Pundets spent bullshyting about what the ruler of America would look like after election day rather than discussing in earnest the policies that the shapeshifter would adopt. Long had the rangers shaken their fists at the sky every time Pundets with differing loyalties supported their misbegotten claims with a smattering of gooseberry-picked puls. Long had the rangers tasted vomit at the back of their throats whenever the Pundets at Sea-en-en jabbered about it being a close race when one possible shapeshifting outcome had been on average trailing the other by several points in the last several fortnights of puls.
Each ranger retreated to a secluded cave, where they used the newfangled signal torches of the Intyrnet to broadcast their shrewd aggregation of the Pulstir's predictions. There, they persisted on a diet of espresso, Power Bars, and drops of Mountain Dew. Few hours they slept. In making their predictions, some relied only on the collective information of the puls. Others looked as well to fundamental trends of prosperity in each of America's states.
Pundets on all (by that, we mean both) sides questioned the rangers' methods, scoffed at the certainty with which the best of them predicted that the next ruler of America would look kind of like a skinny Nelson Mandela, and would support similar policies to the ones he supported back when he had a bigger chin and lighter skin, was lame of leg, and harbored great fondness for elegantly masculine cigarette holders.
On election day, it was the rangers who triumphed, and who collectively became known as the Quants, a moniker that was earlier bestowed upon another group of now disgraced, but equally pasty rangers who may have helped usher in the Great Recession of the early Second Millennium. The trouble is that the number of Quants had increased due to the popularity and controversy surrounding their predictions. While most of the rangers correctly predicted the physiognomy of the president, they had differing levels of uncertainty in the outcome, and their predictions fluctuated to different degrees over the course of the lengthy campaign.
Soon after the election, friends of the Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Quants had aggregated the puls to form better predictions about the outcome of the election, we could aggregate the aggregates to make our predictions yet more accurate.
Four years later, the Meta-Quants broadcast their predictions alongside those of the original Quants. Sure enough, the Meta-Quants predicted the outcome with greater accuracy and precision than the original Qaunts.
Soon after the election, friends of the Meta-Quants, who had also trained in the Forests of Metta Analsis, made a bold suggestion. They argued that, just as the Meta-Quants had aggregated the Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates to make even better predictions.
Four years later, the Meta-Meta-Quants broadcast their predictions alongside those of the Quants and the Meta-Quants. Sure enough, the Meta-Meta-Quants predicted the outcome with somewhat better accuracy and precision than the Meta-Quants, but not as much better as the Meta-Quants had over the Quants. Nobody really paid attention to that part of it.
Which is why, soon after the election, friends of the Meta-Meta-Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Meta-Meta-Quants had aggregated the Meta-Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates of the aggregates to make even better predictions.
One thousand years later, the (Meta x 253)-Quants broadcast their predictions alongside those of all the other types of Quants. By this time, 99.9999999% of Intyrnet communication was devoted to the prediction of the next election, and the rest was devoted to the prediction of the election after that. A Dyson Sphere was constructed around the sun to power the syrvers necessary to compute and communicate the prediction models of the (Meta x 253)-Quants, plus all the other types of Quants. Unfortunately, most of the brilliant people in the Solar System were employed making predictions about elections. Thus the second-rate constructors of the Dyson Sphere accidentally built its shell within the orbit of Earth, blocking out the sun and eventually causing the extinction of life on the planet.
UPDATE: Edited out some two-am-induced errors.
Now we've established that people who analyze polling data might have something there, let's devise ways to compare and contrast the different models. Drew Linzer at votamatic.com
already described his strategy for checking how well his model worked, and started Tweeting
some of his post hoc analyses
. So did Simon Jackman
. As of this moment, Micah Cohen at Nate Silver's FiveThirtyEight blog says "Stay tuned
." Darryl Holman is busy covering the Washington State race
, but I suspect we'll see some predictive performance analysis from him soon, too.
Tonight (okay, this morning), I want to compare the predictions that three of the modelers made about the electoral vote count to show you just how awesome these guys did, but also to draw some contrasts in the results of their modeling strategy. Darryl Holman, Simon Jackman, and Sam Wang all shared the probability distribution of their final electoral vote predictions for Obama with me. Here are the three probability distributions in the same plot for what I think is the first time.
The first thing to notice is that the two most likely outcomes in each of the models are 303 and 332. These two outcomes together are between 15%, 30%, and 36% likely for Holman, Jackman, and Wang, respectively.
Three hundred and three votes happens to be the number of votes Obama currently has secured. Three hundred and thirty-two votes would be the number Obama would have if 29 electoral votes from the remaining toss-up state, Florida, went to him. As most of you know, Obama won the popular vote in Florida, but by a small margin. That's the power of well designed and executed quantitative analysis.
Note, however, that the distributions aren't identical. Jackman's and Wang's distributions are more dispersed, more kurtotic (peaked), and more skewed than Holman's distribution. If you look at Silver's distribution, it is also more dispersed and kurtotic than Holman's. The models also differ in the relative likelihood they give to the two most likely outcomes. Another difference is that Jackman's distribution (and Silver's) has a third most likely outcome favorable to Obama that is much more distinguishable from the noise than it is is for Holman's model.
I've argued in a previous post
that differences like these are important, if not on election eve, then earlier in the campaign. I've also argued
that all of these models together might better predict the election in aggregate than they do on their own. So let's see what these models had to say in aggregate in their final runs before the election. It might seem silly to do this analysis after the election is already over, but, hey, they're still counting Florida.
Here is the average probability distribution of the three models.
Whoopdeedoo. It's an average distribution. Who cares, right? Well that histogram shows us what the models predicted in aggregate for the 2012 election. The aggregate distribution leads to more uncertainty regarding the two most likely outcomes than for some models (especially Holman), but less uncertainty for others (especially Wang). If we had added Drew Linzer's model and Nate Silver's model, which both predicted higher likelihood of 332 than 303 electoral votes, perhaps the uncertainty would have decreased even more in favor of 332. That third outcome also shows up as important in the aggregate model.
Model averaging and model comparison like this would have been helpful earlier in the campaign because it would have given us a sense of what all the models said in aggregate, but also how they differed. The more models we average, and the better we estimate the relative weights to give the models when calculating that average, the better.
Anyway, the outcome that truly matters has already been decided. I admit that I'm happy about it.