In the aftermath of the 2012 election, campaign prognosticators Nate Silver, Simon Jackman, Drew Linzer, and Sam Wang make preliminary quantitative assessments of how well their final predictions played out. Others have posted comparisons of these and other election prediction and poll aggregation outfits. Hopefully, we'll one day compare and combine the models based on their long term predictive power. To compare and combine models effectively, we need a good quantitative measure of their accuracy. The prognosticators have used something called the Brier score to measure the accuracy of their election eve predictions of state-level outcomes. Despite its historical success in measuring forecast accuracy, the Brier score fails in at least two ways as a forecast score. I'll review its inadequacies and suggest a better method.
The Brier score measures the accuracy of binary probabilistic predictions. To calculate it, take the average, squared difference between the forecast probability of a given outcome (e.g., Obama winning the popular vote in California) and the observed probability that the event occurred (.e.g, one if the Obama won, zero if he didn't win). The higher the Brier score, the worse the predictive accuracy. As Nils Barth suggested to Sam Wang
, you can also calculate a normalized Brier score by subtracting four times the Brier score from one. A normalized Brier score compares the predictive accuracy of a model to the predictive accuracy of a model that perfectly predicted the outcomes. The higher the normalized Brier score, the greater the predictive accuracy.
Because the Brier score (and its normalized cousin) measure predictive accuracy, I've suggested that we can use them to construct certainty weights for prediction models
, which we could then use when calculating an average model that combines the separate models into a meta-prediction. Recently, I've discovered research in the weather forecasting community about a better way to score forecast accuracy. This new score ties directly to a well-studied model averaging mechanism. Before describing the new scoring method, let's describe the problems with the Brier score.
) notes that the Brier score doesn't deal adequately with very improbable or probable events. For example, suppose that the probability that a Black Democrat wins Texas is 1 in 1000. Suppose we have one forecast model that predicts Obama will surely lose in Texas, whereas another model predicts that Obama's probability of winning is 1 in 400. Well, Obama lost Texas. The Brier score would tell us to prefer the model that predicted a sure loss for Obama. Yet the model that gave him a small probability of winning is closer to the "truth" in the sense that it estimates he has a small probably of winning. In addition to its poor performance scoring highly improbable and probable events, the Brier score doesn't perform well when scoring very poor forecasts (Benedetti 2010
; sorry for the pay wall).
These issues with the Brier score should give prognosticators pause for two reasons. First, they suggest that the Brier score will not perform well in the "safe" states of a given party. Second, they suggest that Brier scores will not perform well for models whose predictions were poor (here's lookin' at you, Bickers and Berry). So what should we do instead? It's all about the likelihood. Well, actually its logarithm.
Both Jewson and Benedetti convincingly argue that the proper score of forecast accuracy is something called the log likelihood. A likelihood is the probability of a set of observations given the model of reality that we assume produced those observations. As Jewson points out, the likelihood in our case is the probability of a set of observations (i.e., which states Obama won) given the forecasts associated with those observations (i.e., the forecast probability that Obama would win those states). A score based on the log likelihood penalizes measures that are very certain one way or the other, giving the lowest scores to models that are perfectly certain of the outcome.
To compare the accuracy of two models, simply take the difference in their log likelihood. To calculate model weights, first subtract the likelihood score of each model from the minimum likelihood score across all the models. Then exponentiate the difference you just calculated. Then divide the exponentiated difference of each model by the sum of those values across all the models. Voila. A model averaging weight.
Some problems remain. For starters, we haven't factored Occam's razor into our scoring of models. Occam's razor, of course, is the idea that simpler models are better than complex models all else equal. Some of you might notice that the model weight calculation in the previous paragraph is identical to the model weight calculation method based on the information criterion scores of models that have the same number of variables. I argue that we can ignore Occam's razors for our purposes. What we're doing is measuring a model's predictive accuracy, not its fit to previous observations. I leave it up to the first order election prognosticators to decide which parameters they include in their model. In making meta election forecasts, I'll let the models' actual predictive performance decide which ones should get more weight.
What do Nate Silver
, Darryl Holman
, Drew Linzer
, and Sam Wang
all have in common? They all use statistical methods to forecast elections, especially presidential ones. Their models all tend to say the same thing: the odds are pretty good that Obama is going to win. Yet they often make different predictions about the number of electoral votes that, say, Obama will get, and about the probability that Obama would win if an election were held right now.
For example, as of right now, Silver predicts 294 electoral votes to Obama with 3 to 1 odds of an Obama win. Holman predicts an average 299 electoral votes with 9 to 1 odds of an Obama win. Wang predicts a median 291 electoral votes, also with 9 to 1 odds of an Obama win. Linzer predicts a whopping 332 electoral votes and doesn't report the probability of an Obama win.
I contacted each of those men to request access to their electoral vote probability distributions. So far, Sam Wang and Darryl Holman have accepted. Drew Linzer declined. Nate Silver hasn't answered, likely because his mailbox is chock full of fan and hate mail.
Wang and Holman now both offer their histogram of electoral vote probabilities on their respective web pages. I went and grabbed these discrete probability distributions and did what a good, albeit naive model averager would do: I averaged the probability distributions to come up with a summary probability distribution (which, by the way, still sums to one).
This method makes sense because, basically, these guys are estimating 538 parameters, and I'm simply averaging those 538 parameters across the models to which I currently have access because I currently have no reason to think they are much different in predictive power (although later on the method could be extended to include weights).
From the aggregated electoral vote distribution, I calculated the mean, median, 2.5th percentile, and 97.5th percentile of the number of electoral votes (EV) to Obama. I also calculated the probability that Obama will get 270 EV or more, winning him the election.
Mean EV: 296
Median EV: 294
95% Confidence interval: 261, 337
Probability Obama wins: over 90%
So 9 to 1 odds Obama wins. Something like 294 or 296 electoral votes.
I'd love to see what happens if I put Nate Silver into the equation. Obviously, it will drag the distribution down. I might look into modeling weights at that point, too, because both Holman and Wang predicted the electoral votes better than Silver, and I believe Wang did a slightly better job than Holman, although I forget.
Anyway, there you have it. Rest easy and VOTE.
PolitiFact recently published a list of their Truth-O-Meter rulings that the candidates and their "surrogates
" have said regarding foreign policy
, including Iraq and Afghanistan
. Because this is a small and probably biased sample of rulings, it gives me an opportunity to demonstrate the power of Malark-O
-Meter to show how much signal there is amid the noise when it comes to how much malarkey candidates spew. It also gives me a chance to showcase Malark-O
-Meter's current limitations.
This is quick and dirty because the debates begin soon. So let me know if there are any mistakes.
I collated the statements made by red and blue candidates and their surrogates into a red pile and a blue pile. Then I used Malark-O-Meter's simulation methods
to simulate the probability distribution of the individual foreign-policy-specific malarky scores
, and to simulate ratio of the red team's foreign-policy-specific malarkey to the blue team's foreign-policy-specific malarkey. Then I calculated the 95% confidence interval (95% CI) of the individual scores and the ratio, and calculated the probability that the red team spews more foreign-policy-specific malarkey (FPSM) than the blue team.
Here are the results:
- Blue team's FPSM -- mean: 50; 95% CI: 36 to 64 (so, maybe half truthful)
- Red team's FPSM -- mean: 56; 95% CI: 36 to 75 (so, maybe a little more than half full of malarkey)
- Ratio -- mean: 1.15; 95% CI: 0.68 to 1.73
- Probability that the red team spews more FPSM that the blue team is about 70%.
So according to the analysis, there are only slightly better than 2 to 1 odds that the red team spews more malarkey than the blue team, but we expect the difference between the two teams to be pretty small (yes, I've been calling the Republicans and Democrats the red and blue team for the last few paragraphs).
But what does this mean? Well, PolitiFact chose precisely the same number of statements for each team. Maybe they subconsciously chose rulings that in aggregate downplay any the difference between the teams. Or maybe PolitiFact has a liberal bias, as some allege. In that case, we'd expect them to inflate the relationship between the two, or even invert it if their bias is strong enough. If both biases act in tandem, we might expect a small difference favoring the blue team.
But honestly, all of this is hand waving. We need more evidence to know if such biases exist and how strong they are. And we need more statements on foreign policy from each team. Well, we're going to get the latter tonight. As for the former. Well. Some day.
For what it's worth, however, this is evidence. I encourage you to gather more and to share it with me. But based on this evidence, I predict that Romney will spew somewhat more malarkey tonight than Obama.
And yes, I'm going to examine that question tomorrow (after I do some field work for my dissertation project).