In the aftermath of the 2012 election, campaign prognosticators Nate Silver, Simon Jackman, Drew Linzer, and Sam Wang make preliminary quantitative assessments of how well their final predictions played out. Others have posted comparisons of these and other election prediction and poll aggregation outfits. Hopefully, we'll one day compare and combine the models based on their long term predictive power. To compare and combine models effectively, we need a good quantitative measure of their accuracy. The prognosticators have used something called the Brier score to measure the accuracy of their election eve predictions of state-level outcomes. Despite its historical success in measuring forecast accuracy, the Brier score fails in at least two ways as a forecast score. I'll review its inadequacies and suggest a better method.
The Brier score measures the accuracy of binary probabilistic predictions. To calculate it, take the average, squared difference between the forecast probability of a given outcome (e.g., Obama winning the popular vote in California) and the observed probability that the event occurred (.e.g, one if the Obama won, zero if he didn't win). The higher the Brier score, the worse the predictive accuracy. As Nils Barth suggested to Sam Wang
, you can also calculate a normalized Brier score by subtracting four times the Brier score from one. A normalized Brier score compares the predictive accuracy of a model to the predictive accuracy of a model that perfectly predicted the outcomes. The higher the normalized Brier score, the greater the predictive accuracy.
Because the Brier score (and its normalized cousin) measure predictive accuracy, I've suggested that we can use them to construct certainty weights for prediction models
, which we could then use when calculating an average model that combines the separate models into a meta-prediction. Recently, I've discovered research in the weather forecasting community about a better way to score forecast accuracy. This new score ties directly to a well-studied model averaging mechanism. Before describing the new scoring method, let's describe the problems with the Brier score.
) notes that the Brier score doesn't deal adequately with very improbable or probable events. For example, suppose that the probability that a Black Democrat wins Texas is 1 in 1000. Suppose we have one forecast model that predicts Obama will surely lose in Texas, whereas another model predicts that Obama's probability of winning is 1 in 400. Well, Obama lost Texas. The Brier score would tell us to prefer the model that predicted a sure loss for Obama. Yet the model that gave him a small probability of winning is closer to the "truth" in the sense that it estimates he has a small probably of winning. In addition to its poor performance scoring highly improbable and probable events, the Brier score doesn't perform well when scoring very poor forecasts (Benedetti 2010
; sorry for the pay wall).
These issues with the Brier score should give prognosticators pause for two reasons. First, they suggest that the Brier score will not perform well in the "safe" states of a given party. Second, they suggest that Brier scores will not perform well for models whose predictions were poor (here's lookin' at you, Bickers and Berry). So what should we do instead? It's all about the likelihood. Well, actually its logarithm.
Both Jewson and Benedetti convincingly argue that the proper score of forecast accuracy is something called the log likelihood. A likelihood is the probability of a set of observations given the model of reality that we assume produced those observations. As Jewson points out, the likelihood in our case is the probability of a set of observations (i.e., which states Obama won) given the forecasts associated with those observations (i.e., the forecast probability that Obama would win those states). A score based on the log likelihood penalizes measures that are very certain one way or the other, giving the lowest scores to models that are perfectly certain of the outcome.
To compare the accuracy of two models, simply take the difference in their log likelihood. To calculate model weights, first subtract the likelihood score of each model from the minimum likelihood score across all the models. Then exponentiate the difference you just calculated. Then divide the exponentiated difference of each model by the sum of those values across all the models. Voila. A model averaging weight.
Some problems remain. For starters, we haven't factored Occam's razor into our scoring of models. Occam's razor, of course, is the idea that simpler models are better than complex models all else equal. Some of you might notice that the model weight calculation in the previous paragraph is identical to the model weight calculation method based on the information criterion scores of models that have the same number of variables. I argue that we can ignore Occam's razors for our purposes. What we're doing is measuring a model's predictive accuracy, not its fit to previous observations. I leave it up to the first order election prognosticators to decide which parameters they include in their model. In making meta election forecasts, I'll let the models' actual predictive performance decide which ones should get more weight.
A funny short story about the triumph and perils of endless recursions in meta-analysis. NOT a critique of meta-analysis itself.
Once upon a time, there was a land called the United States of America, which was ruled by a shapeshifter whose physiognomy and political party affiliation was recast every four years by an electoral vote, itself a reflection of the vote of the people. For centuries, the outcome of the election had been foretold by a cadre of magicians and wizards collectively known as the Pundets. Gazing into their crystal balls at the size of crowds at political rallies, they charted the course of the shapeshifting campaign. They were often wrong, but people listened to them anyway.
Then, from the labyrinthine caves beneath the Marginuvera Mountains emerged a troglodyte race known as the Pulstirs. Pasty of skin and snarfy in laughter, they challenged the hegemony of the Pundet elite by crafting their predictions from the collective utterances of the populace. Trouble soon followed. Some of the powerful new Pulstir craftsmen forged alliances with one party or another. And as more and more Pulstirs emerged from Marginuvera, they conducted more and more puls.
The greatest trouble came, unsurprisingly, from the old Pundet guard in their ill-fated attempts to merge their decrees with Pulstir findings. Unable to cope with the number of puls, unwilling to so much as state an individual pul's marginuvera, the Pundet's predictions confused the people more than it informed them.
Then, one day, unbeknownst to one another, rangers emerged from the Forests of Metta Analisis. Long had each of them observed the Pundets and Pulstirs from afar. Long had they anguished over the amount of time the Pundets spent bullshyting about what the ruler of America would look like after election day rather than discussing in earnest the policies that the shapeshifter would adopt. Long had the rangers shaken their fists at the sky every time Pundets with differing loyalties supported their misbegotten claims with a smattering of gooseberry-picked puls. Long had the rangers tasted vomit at the back of their throats whenever the Pundets at Sea-en-en jabbered about it being a close race when one possible shapeshifting outcome had been on average trailing the other by several points in the last several fortnights of puls.
Each ranger retreated to a secluded cave, where they used the newfangled signal torches of the Intyrnet to broadcast their shrewd aggregation of the Pulstir's predictions. There, they persisted on a diet of espresso, Power Bars, and drops of Mountain Dew. Few hours they slept. In making their predictions, some relied only on the collective information of the puls. Others looked as well to fundamental trends of prosperity in each of America's states.
Pundets on all (by that, we mean both) sides questioned the rangers' methods, scoffed at the certainty with which the best of them predicted that the next ruler of America would look kind of like a skinny Nelson Mandela, and would support similar policies to the ones he supported back when he had a bigger chin and lighter skin, was lame of leg, and harbored great fondness for elegantly masculine cigarette holders.
On election day, it was the rangers who triumphed, and who collectively became known as the Quants, a moniker that was earlier bestowed upon another group of now disgraced, but equally pasty rangers who may have helped usher in the Great Recession of the early Second Millennium. The trouble is that the number of Quants had increased due to the popularity and controversy surrounding their predictions. While most of the rangers correctly predicted the physiognomy of the president, they had differing levels of uncertainty in the outcome, and their predictions fluctuated to different degrees over the course of the lengthy campaign.
Soon after the election, friends of the Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Quants had aggregated the puls to form better predictions about the outcome of the election, we could aggregate the aggregates to make our predictions yet more accurate.
Four years later, the Meta-Quants broadcast their predictions alongside those of the original Quants. Sure enough, the Meta-Quants predicted the outcome with greater accuracy and precision than the original Qaunts.
Soon after the election, friends of the Meta-Quants, who had also trained in the Forests of Metta Analsis, made a bold suggestion. They argued that, just as the Meta-Quants had aggregated the Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates to make even better predictions.
Four years later, the Meta-Meta-Quants broadcast their predictions alongside those of the Quants and the Meta-Quants. Sure enough, the Meta-Meta-Quants predicted the outcome with somewhat better accuracy and precision than the Meta-Quants, but not as much better as the Meta-Quants had over the Quants. Nobody really paid attention to that part of it.
Which is why, soon after the election, friends of the Meta-Meta-Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Meta-Meta-Quants had aggregated the Meta-Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates of the aggregates to make even better predictions.
One thousand years later, the (Meta x 253)-Quants broadcast their predictions alongside those of all the other types of Quants. By this time, 99.9999999% of Intyrnet communication was devoted to the prediction of the next election, and the rest was devoted to the prediction of the election after that. A Dyson Sphere was constructed around the sun to power the syrvers necessary to compute and communicate the prediction models of the (Meta x 253)-Quants, plus all the other types of Quants. Unfortunately, most of the brilliant people in the Solar System were employed making predictions about elections. Thus the second-rate constructors of the Dyson Sphere accidentally built its shell within the orbit of Earth, blocking out the sun and eventually causing the extinction of life on the planet.
I've written a lot recently about the promise of combining the results from the different election prediction models that have cropped up over the last decade. (Here's a scroll of those articles
in reverse chronological order.) One suggestion I've made is to average the results of the election prediction models. The marginoferror blog made the same suggestion, noting that the averaged aggregator performs better than any individual aggregator
(that they included in their sample of aggregators).
Today, I present suggestions for how to calculate averaging weights for a given prediction of the winner of the presidency in each state, and of the percent popular vote in that state. These methods suggestions were inspired by the reporting of Brier scores and other prediction accuracy statistics by Simon Jackman, Sam Wang, and Drew Linzer.
State-level outcomes (thus EV outcomes)
To calculate the model weight for a given model at a given point in time, start with Christopher A. T. Ferro's sample size adjusted Brier score (see equation 8
, which depends on equation 3 and the first expression in section 2.a) comparing all
observed state-level outcomes to the probability estimated from all
of the years that an aggregator has made predictions at the specified calendar distance from election day.
Ferro's adjusted Brier score is best because it accounts for the effects of sample size on the Brier score.
Next, subtract that Brier score from one, which is the highest possible value for a Brier score. The result is an absolute score that increases as the Brier score decreases. Recall that the Brier score is larger when there is greater distance between the predicted and observed values.
Next, we repeat that process for all aggregators that have made predictions at that distance from election day.
Next, we normalize all the absolute scores by the summed absolute scores to give each model a relative weight.
Finally, we weight each model by its relative weight when averaging.
This method could easily be modified to give models weights corresponding to entire prediction histories, and/or to prediction within a given time interval at a given distance from election. It could also be extended to deal with one-off forecasts that are never updated. Because state-level outcomes largely determine the electoral vote, I propose that the same model weight calculated as above could be used when averaging electoral vote distributions.
State-level shares of popular vote
The method is identical to what I described above, except we replace Ferro's adjusted Brier score with the sample normalized root mean squared error, which would measured the average percentage point difference between the observed and expected popular vote outcomes. Simply calculate one minus the sample normalized root mean squared error of a given model, and divide the difference by the some of the same for the rest of the models. Then, calculate a weighted average.
These methods have a lot of nice features:
- They result in weights that are easily interpreted.
- The weights can also be decomposed into different components because they are based on the Brier score and root mean squared error. For example, the Brier score can be decomposed to examine calibration and uncertainty effects. The mean squared error can be decomposed into bias and variance components.
- The methods are flexible enough to accommodate any scope of predictive power that interests researchers.
UPDATE: Edited out some two-am-induced errors.
Now we've established that people who analyze polling data might have something there, let's devise ways to compare and contrast the different models. Drew Linzer at votamatic.com
already described his strategy for checking how well his model worked, and started Tweeting
some of his post hoc analyses
. So did Simon Jackman
. As of this moment, Micah Cohen at Nate Silver's FiveThirtyEight blog says "Stay tuned
." Darryl Holman is busy covering the Washington State race
, but I suspect we'll see some predictive performance analysis from him soon, too.
Tonight (okay, this morning), I want to compare the predictions that three of the modelers made about the electoral vote count to show you just how awesome these guys did, but also to draw some contrasts in the results of their modeling strategy. Darryl Holman, Simon Jackman, and Sam Wang all shared the probability distribution of their final electoral vote predictions for Obama with me. Here are the three probability distributions in the same plot for what I think is the first time.
The first thing to notice is that the two most likely outcomes in each of the models are 303 and 332. These two outcomes together are between 15%, 30%, and 36% likely for Holman, Jackman, and Wang, respectively.
Three hundred and three votes happens to be the number of votes Obama currently has secured. Three hundred and thirty-two votes would be the number Obama would have if 29 electoral votes from the remaining toss-up state, Florida, went to him. As most of you know, Obama won the popular vote in Florida, but by a small margin. That's the power of well designed and executed quantitative analysis.
Note, however, that the distributions aren't identical. Jackman's and Wang's distributions are more dispersed, more kurtotic (peaked), and more skewed than Holman's distribution. If you look at Silver's distribution, it is also more dispersed and kurtotic than Holman's. The models also differ in the relative likelihood they give to the two most likely outcomes. Another difference is that Jackman's distribution (and Silver's) has a third most likely outcome favorable to Obama that is much more distinguishable from the noise than it is is for Holman's model.
I've argued in a previous post
that differences like these are important, if not on election eve, then earlier in the campaign. I've also argued
that all of these models together might better predict the election in aggregate than they do on their own. So let's see what these models had to say in aggregate in their final runs before the election. It might seem silly to do this analysis after the election is already over, but, hey, they're still counting Florida.
Here is the average probability distribution of the three models.
Whoopdeedoo. It's an average distribution. Who cares, right? Well that histogram shows us what the models predicted in aggregate for the 2012 election. The aggregate distribution leads to more uncertainty regarding the two most likely outcomes than for some models (especially Holman), but less uncertainty for others (especially Wang). If we had added Drew Linzer's model and Nate Silver's model, which both predicted higher likelihood of 332 than 303 electoral votes, perhaps the uncertainty would have decreased even more in favor of 332. That third outcome also shows up as important in the aggregate model.
Model averaging and model comparison like this would have been helpful earlier in the campaign because it would have given us a sense of what all the models said in aggregate, but also how they differed. The more models we average, and the better we estimate the relative weights to give the models when calculating that average, the better.
Anyway, the outcome that truly matters has already been decided. I admit that I'm happy about it.
For the last few news cycles, Nate Silver
(and by extension all election forecasters) faced intense but misguided scrutiny by pundits (especially conservative ones). His critics run the gamut from partisan charlatans to scared pundits clinging steadfastly to the relevance of their raconteur. It's fun to laugh at the innumerate, wishy washy arguments these people make. Yet a rational strain of the debate gets muffled by our collective cachinnation: the argument that Silver's model isn't perfect, simply because no model is likely to be perfect. I'll examine this strain of the debate and argue that we need to systematically compare and combine the results of the numerous election forecasting models that have cropped up in the last decade, and will crop up in the years to come.
This is a manifesto for better election prediction by comparing and contrasting prediction models. That's right. We need to aggregate the aggregators or, if you prefer, meta-analyze the meta-analyses.
The most extreme argument that Silver's forecasts are wrong comes from a model widely cited in the conservative media that predicts a landslide victory for Romney
. Political scientists Kenneth Bickers and Michael Berry developed the model. Nate Silver has soundly questioned its merits, although only in a series of Tweets
. In his Tweet-ique, Silver cited a model developed by Douglas Hibbs
as a more meritorious example that favors Romney.
The critiques of Silver's model don't only come from those whose models forecast a Romney victory. Consistent variance exists among those forecasting an Obama win, as well. Drew Linzer's Votamatic
forecast predicts a larger electoral vote victory for Obama than Silver's. Sam Wang's meta-margin
and Darryl Holman's electoral vote predictions
are consistently between Silver's and Linzer's. These differences have been in the tens of electoral votes, although the models are converging (with the exception of Linzer's) because we are so close to the election.
Similarly, variance exists among modelers in the odds they give for an Obama win. For example, today Wang gives over 99% probability to Obama winning the electoral vote. Holman's gives over 94% probability. Today, Silver gives about 80% probability to an Obama win. Those differences might look small as percents, but in terms of odds, they're big. Wang gives 99 to 1 odds that Obama will win, Holman gives about 16 to 1 odds, and Silver gives about 4 to 1 odds. From a bettor's perspective, those differences are effing huge
because the odds are a strictly convex function of the probability of an event. That means the odds grow more and more quickly as an event becomes more probable. If you don't know what the Hell I am talking about, look at this graph of odds as a function of the percent chance of an event
. See what I mean?
Anyway, those are just the forecasters I follow. HuffPo
now have their shiny-looking forecasts, and I'm sure a bunch of other outfits exist that I haven't even heard about. And you know there will be umpteen more forecasting blogs come election 2016, perhaps even by the mid-terms. Each of these forecasts will present a probabilistic picture of an election at a given point in time. They will use varying methods and, as we've already seen, they will likely produce varying results. Browsing each of them obsessively every day will give you a basic sense of how a campaign race is going. Just as well, it might confuse you. Furthermore, your personal biases and unchecked assumptions about what models are more right might cause you to subconsciously weight some models more heavily than others, leading to a conclusion that is about as justified as Wolf Blitzer's or Joe Scarborough's claim that it's a close race.
In the spirit of principled forecasting, subjective and subconscious impressions of the available election forecasts simply cannot stand, lest we instantaneously regress to the (hopefully) waning days of sensationalist campaign punditry. So what is the way forward?
Enter meta-meta-analysis. To understand what that is, you first need to understand what meta-analysis is. Many of the forecasters, like Silver and Wang and Holman and Linzer and Simon Jackman
, use information from many polls to guide their models. Each poll is an independent experiment that measure the probability that someone will vote for a particular candidate. Some models, like Silver's, contrast
these experiments by assigning a weight to the poll that is proportional to its historical accuracy. All election prediction models (except the ones that rely only on economic indicators to make predictions) also combine
information from the separate polls to produce their predictions. An analysis that contrasts and/or combines the results of different studies is a meta-analysis.
But what if some of the analyses that you are combining and contrasting are themselves meta-analyses? That's what would happen if you took my advice and placed the sprawling set of election predictions models under systematic, quantitative scrutiny. I call such a study a meta-meta-analysis. I'm not the only one who calls it so. Recently, a paper in the Canadian Medical Association Journal did a meta-analysis of meta-analyses of medical research and found many of them problematic
So why meta-meta-analysis? I've already argued that unsystematic review of election forecast models can lead us astray. A more positive argument is that election prediction meta-meta-analysis would provide us with a more complete, and possibly more accurate picture of the current state of a campaign race. The reason is that different election prediction models emphasize different aspects of the race. Some models include variables about the state of the economy. Among those, some models emphasize some variables over others or (in the case of the Berry/Bickers model) include an economic variable that no other models include. Other models take the polls at face value. The relative emphases on economic variables and polls represents one way in which forecasters differ in their approach to prediction.
Statisticians have argued that we must take into account our uncertainty in the best model to use when making predictions. Even if we could measure which of the models is the best one, statisticians argue, we shouldn't necessarily take that best model as the right answer. That would be as silly as always going with the answer of the pollster whose long-term accuracy is highest. While some election models might predict elections better than others, it is likely that none of them are perfectly accurate. It turns out that we might make better predictions by taking a weighted average of the results of the models available to us, with the weights proportional to our relative certainty that a given model is correct. If you want to geek out on this concept, read this excellent interview that Prashant Nair at PNAS did of statistician Adrian Raftery
, who has applied model averaging methods to topics ranging from weather prediction to demographic forecasting.
I'm going out on a limb here to say that I'm a pioneer in the model averaging of election forecasting models. My meta-meta-analyses (which you can read here
in reverse chronological order) are simple - almost stupidly simple. They combine only two models. They assume that the two models have equal predictive power in the absence of any evidence to the contrary. But they provide a first glimpse at the potential power of election prediction meta-meta-analysis, and present a simple method to average the electoral vote predictions of forecasters.
Where do we go from here? First, we need to compare more than two models. Yet election forecasters, particularly those who are licensed by major media companies, are likely wary of sharing their raw data for fear of being scooped (not so with at least two open source modelers, Sam Wang and Darryl Holman, who provide their raw electoral vote probability distributions for each forecast update). Second, we need to come up with a method to estimate our uncertainty in the predictive power of a given model (until then, I suggest we place equal weight on each model). Doing so, we can begin to make informed decisions about how to weight models' results when we average them. We would also have some measure of which models are "better". Third, we need to develop methods to effectively communicate the results of these meta-meta-analyses to the public so that people can understand what it all means (at least if they're not frothy-mouthed partisan buffoons or spotlight-grubbing dead-heat-arguing pundits).
So there you have it. A manifesto for meta-meta-analysis of election prediction models. I hope this message reaches election prediction modelers and convinces them that it is worthwhile to compromise our competing interests and Internet market share in the interest of making better predictions that result in a more informed public. I welcome your comments and criticism. Even if you'd never participate in this effort, election forecasters, at least give me your two cents.
Lastly, thanks to all the election forecasters for the important work that they do.
They look similar. Yesterday, both models predicted the same number of electoral votes for Obama. Their random error estimates of the odds that Obama wins, however, differ slightly. If I had access to Linzer's and Silver's EV distributions, they would also look different. Linzer's would be shifted to the right, Silver's to the left. Alas, I don't have access to those model results beyond what those authors publish on the Internet.
I averaged the two probability distributions. This method makes sense because the result of the averaging still sums to one (so it is still a probability distribution), and because the two models could be interpreted as estimating 538 parameters (the probabilities of each electoral vote result), and because I don't have much reason to believe that the two models have unequal predictive power.
Below is the average probability distribution.
The averaged model predicts that Obama will win a median 303 votes. We can be 95% confident that he will receive between 265 and 344 votes. Obama will win with probability greater than 96%, giving him greater than 24 to 1 odds of winning the election.
And, guess what, election simulation naysayers. Neither of these simulators weight polls, and one of them includes all polls. And I've reported the median to avoid outlier effects. And the distribution doesn't look bimodal.
What do Nate Silver
, Darryl Holman
, Drew Linzer
, and Sam Wang
all have in common? They all use statistical methods to forecast elections, especially presidential ones. Their models all tend to say the same thing: the odds are pretty good that Obama is going to win. Yet they often make different predictions about the number of electoral votes that, say, Obama will get, and about the probability that Obama would win if an election were held right now.
For example, as of right now, Silver predicts 294 electoral votes to Obama with 3 to 1 odds of an Obama win. Holman predicts an average 299 electoral votes with 9 to 1 odds of an Obama win. Wang predicts a median 291 electoral votes, also with 9 to 1 odds of an Obama win. Linzer predicts a whopping 332 electoral votes and doesn't report the probability of an Obama win.
I contacted each of those men to request access to their electoral vote probability distributions. So far, Sam Wang and Darryl Holman have accepted. Drew Linzer declined. Nate Silver hasn't answered, likely because his mailbox is chock full of fan and hate mail.
Wang and Holman now both offer their histogram of electoral vote probabilities on their respective web pages. I went and grabbed these discrete probability distributions and did what a good, albeit naive model averager would do: I averaged the probability distributions to come up with a summary probability distribution (which, by the way, still sums to one).
This method makes sense because, basically, these guys are estimating 538 parameters, and I'm simply averaging those 538 parameters across the models to which I currently have access because I currently have no reason to think they are much different in predictive power (although later on the method could be extended to include weights).
From the aggregated electoral vote distribution, I calculated the mean, median, 2.5th percentile, and 97.5th percentile of the number of electoral votes (EV) to Obama. I also calculated the probability that Obama will get 270 EV or more, winning him the election.
Mean EV: 296
Median EV: 294
95% Confidence interval: 261, 337
Probability Obama wins: over 90%
So 9 to 1 odds Obama wins. Something like 294 or 296 electoral votes.
I'd love to see what happens if I put Nate Silver into the equation. Obviously, it will drag the distribution down. I might look into modeling weights at that point, too, because both Holman and Wang predicted the electoral votes better than Silver, and I believe Wang did a slightly better job than Holman, although I forget.
Anyway, there you have it. Rest easy and VOTE.