In the aftermath of the 2012 election, campaign prognosticators Nate Silver, Simon Jackman, Drew Linzer, and Sam Wang make preliminary quantitative assessments of how well their final predictions played out. Others have posted comparisons of these and other election prediction and poll aggregation outfits. Hopefully, we'll one day compare and combine the models based on their long term predictive power. To compare and combine models effectively, we need a good quantitative measure of their accuracy. The prognosticators have used something called the Brier score to measure the accuracy of their election eve predictions of state-level outcomes. Despite its historical success in measuring forecast accuracy, the Brier score fails in at least two ways as a forecast score. I'll review its inadequacies and suggest a better method.

The Brier score measures the accuracy of binary probabilistic predictions. To calculate it, take the average, squared difference between the forecast probability of a given outcome (e.g., Obama winning the popular vote in California) and the observed probability that the event occurred (.e.g, one if the Obama won, zero if he didn't win). The higher the Brier score, the worse the predictive accuracy. As Nils Barth suggested to Sam Wang, you can also calculate a normalized Brier score by subtracting four times the Brier score from one. A normalized Brier score compares the predictive accuracy of a model to the predictive accuracy of a model that perfectly predicted the outcomes. The higher the normalized Brier score, the greater the predictive accuracy.

Because the Brier score (and its normalized cousin) measure predictive accuracy, I've suggested that we can use them to construct certainty weights for prediction models, which we could then use when calculating an average model that combines the separate models into a meta-prediction. Recently, I've discovered research in the weather forecasting community about a better way to score forecast accuracy. This new score ties directly to a well-studied model averaging mechanism. Before describing the new scoring method, let's describe the problems with the Brier score.

Jewson (2004) notes that the Brier score doesn't deal adequately with very improbable or probable events. For example, suppose that the probability that a Black Democrat wins Texas is 1 in 1000. Suppose we have one forecast model that predicts Obama will surely lose in Texas, whereas another model predicts that Obama's probability of winning is 1 in 400. Well, Obama lost Texas. The Brier score would tell us to prefer the model that predicted a sure loss for Obama. Yet the model that gave him a small probability of winning is closer to the "truth" in the sense that it estimates he has a small probably of winning. In addition to its poor performance scoring highly improbable and probable events, the Brier score doesn't perform well when scoring very poor forecasts (Benedetti 2010; sorry for the pay wall).

These issues with the Brier score should give prognosticators pause for two reasons. First, they suggest that the Brier score will not perform well in the "safe" states of a given party. Second, they suggest that Brier scores will not perform well for models whose predictions were poor (here's lookin' at you, Bickers and Berry). So what should we do instead? It's all about the likelihood. Well, actually its logarithm.

Both Jewson and Benedetti convincingly argue that the proper score of forecast accuracy is something called the log likelihood. A likelihood is the probability of a set of observations given the model of reality that we assume produced those observations. As Jewson points out, the likelihood in our case is the probability of a set of observations (i.e., which states Obama won) given the forecasts associated with those observations (i.e., the forecast probability that Obama would win those states). A score based on the log likelihood penalizes measures that are very certain one way or the other, giving the lowest scores to models that are perfectly certain of the outcome.

To compare the accuracy of two models, simply take the difference in their log likelihood. To calculate model weights, first subtract the likelihood score of each model from the minimum likelihood score across all the models. Then exponentiate the difference you just calculated. Then divide the exponentiated difference of each model by the sum of those values across all the models. Voila. A model averaging weight.

Some problems remain. For starters, we haven't factored Occam's razor into our scoring of models. Occam's razor, of course, is the idea that simpler models are better than complex models all else equal. Some of you might notice that the model weight calculation in the previous paragraph is identical to the model weight calculation method based on the information criterion scores of models that have the same number of variables. I argue that we can ignore Occam's razors for our purposes. What we're doing is measuring a model's predictive accuracy, not its fit to previous observations. I leave it up to the first order election prognosticators to decide which parameters they include in their model. In making meta election forecasts, I'll let the models' actual predictive performance decide which ones should get more weight.
 
 
A funny short story about the triumph and perils of endless recursions in meta-analysis. NOT a critique of meta-analysis itself.
Once upon a time, there was a land called the United States of America, which was ruled by a shapeshifter whose physiognomy and political party affiliation was recast every four years by an electoral vote, itself a reflection of the vote of the people. For centuries, the outcome of the election had been foretold by a cadre of magicians and wizards collectively known as the Pundets. Gazing into their crystal balls at the size of crowds at political rallies, they charted the course of the shapeshifting campaign. They were often wrong, but people listened to them anyway.

Then, from the labyrinthine caves beneath the Marginuvera Mountains emerged a troglodyte race known as the Pulstirs. Pasty of skin and snarfy in laughter, they challenged the hegemony of the Pundet elite by crafting their predictions from the collective utterances of the populace. Trouble soon followed. Some of the powerful new Pulstir craftsmen forged alliances with one party or another. And as more and more Pulstirs emerged from Marginuvera, they conducted more and more puls.

The greatest trouble came, unsurprisingly, from the old Pundet guard in their ill-fated attempts to merge their decrees with Pulstir findings. Unable to cope with the number of puls, unwilling to so much as state an individual pul's marginuvera, the Pundet's predictions confused the people more than it informed them.

Then, one day, unbeknownst to one another, rangers emerged from the Forests of Metta Analisis. Long had each of them observed the Pundets and Pulstirs from afar. Long had they anguished over the amount of time the Pundets spent bullshyting about what the ruler of America would look like after election day rather than discussing in earnest the policies that the shapeshifter would adopt. Long had the rangers shaken their fists at the sky every time Pundets with differing loyalties supported their misbegotten claims with a smattering of gooseberry-picked puls. Long had the rangers tasted vomit at the back of their throats whenever the Pundets at Sea-en-en jabbered about it being a close race when one possible shapeshifting outcome had been on average trailing the other by several points in the last several fortnights of puls.

Each ranger retreated to a secluded cave, where they used the newfangled signal torches of the Intyrnet to broadcast their shrewd aggregation of the Pulstir's predictions. There, they persisted on a diet of espresso, Power Bars, and drops of Mountain Dew. Few hours they slept. In making their predictions, some relied only on the collective information of the puls. Others looked as well to fundamental trends of prosperity in each of America's states. 

Pundets on all (by that, we mean both) sides questioned the rangers' methods, scoffed at the certainty with which the best of them predicted that the next ruler of America would look kind of like a skinny Nelson Mandela, and would support similar policies to the ones he supported back when he had a bigger chin and lighter skin, was lame of leg, and harbored great fondness for elegantly masculine cigarette holders.

On election day, it was the rangers who triumphed, and who collectively became known as the Quants, a moniker that was earlier bestowed upon another group of now disgraced, but equally pasty rangers who may have helped usher in the Great Recession of the early Second Millennium. The trouble is that the number of Quants had increased due to the popularity and controversy surrounding their predictions. While most of the rangers correctly predicted the physiognomy of the president, they had differing levels of uncertainty in the outcome, and their predictions fluctuated to different degrees over the course of the lengthy campaign.

Soon after the election, friends of the Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Quants had aggregated the puls to form better predictions about the outcome of the election, we could aggregate the aggregates to make our predictions yet more accurate. 

Four years later, the Meta-Quants broadcast their predictions alongside those of the original Quants. Sure enough, the Meta-Quants predicted the outcome with greater accuracy and precision than the original Qaunts.

Soon after the election, friends of the Meta-Quants, who had also trained in the Forests of Metta Analsis, made a bold suggestion. They argued that, just as the Meta-Quants had aggregated the Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates to make even better predictions.

Four years later, the Meta-Meta-Quants broadcast their predictions alongside those of the Quants and the Meta-Quants. Sure enough, the Meta-Meta-Quants predicted the outcome with somewhat better accuracy and precision than the Meta-Quants, but not as much better as the Meta-Quants had over the Quants. Nobody really paid attention to that part of it.

Which is why, soon after the election, friends of the Meta-Meta-Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Meta-Meta-Quants had aggregated the Meta-Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates of the aggregates to make even better predictions.

...

One thousand years later, the (Meta x 253)-Quants broadcast their predictions alongside those of all the other types of Quants. By this time, 99.9999999% of Intyrnet communication was devoted to the prediction of the next election, and the rest was devoted to the prediction of the election after that. A Dyson Sphere was constructed around the sun to power the syrvers necessary to compute and communicate the prediction models of the (Meta x 253)-Quants, plus all the other types of Quants. Unfortunately, most of the brilliant people in the Solar System were employed making predictions about elections. Thus the second-rate constructors of the Dyson Sphere accidentally built its shell within the orbit of Earth, blocking out the sun and eventually causing the extinction of life on the planet.

The end.
 
 
UPDATE: Edited out some two-am-induced errors.

As xkcd put it,
Now we've established that people who analyze polling data might have something there, let's devise ways to compare and contrast the different models. Drew Linzer at votamatic.com already described his strategy for checking how well his model worked, and started Tweeting some of his post hoc analyses. So did Simon Jackman. As of this moment, Micah Cohen at Nate Silver's FiveThirtyEight blog says "Stay tuned." Darryl Holman is busy covering the Washington State race, but I suspect we'll see some predictive performance analysis from him soon, too.

Tonight (okay, this morning), I want to compare the predictions that three of the modelers made about the electoral vote count to show you just how awesome these guys did, but also to draw some contrasts in the results of their modeling strategy. Darryl Holman, Simon Jackman, and Sam Wang all shared the probability distribution of their final electoral vote predictions for Obama with me. Here are the three probability distributions in the same plot for what I think is the first time.
The first thing to notice is that the two most likely outcomes in each of the models are 303 and 332.  These two outcomes together are between 15%, 30%, and 36% likely for Holman, Jackman, and Wang, respectively.

Three hundred and three votes happens to be the number of votes Obama currently has secured. Three hundred and thirty-two votes would be the number Obama would have if 29 electoral votes from the remaining toss-up state, Florida, went to him. As most of you know, Obama won the popular vote in Florida, but by a small margin. That's the power of well designed and executed quantitative analysis.

Note, however, that the distributions aren't identical. Jackman's and Wang's distributions are more dispersed, more kurtotic (peaked), and more skewed than Holman's distribution. If you look at Silver's distribution, it is also more dispersed and kurtotic than Holman's. The models also differ in the relative likelihood they give to the two most likely outcomes. Another difference is that Jackman's distribution (and Silver's) has a third most likely outcome favorable to Obama that is much more distinguishable from the noise than it is is for Holman's model.

I've argued in a previous post that differences like these are important, if not on election eve, then earlier in the campaign. I've also argued that all of these models together might better predict the election in aggregate than they do on their own. So let's see what these models had to say in aggregate in their final runs before the election. It might seem silly to do this analysis after the election is already over, but, hey, they're still counting Florida.

Here is the average probability distribution of the three models.
Whoopdeedoo. It's an average distribution. Who cares, right? Well that histogram shows us what the models predicted in aggregate for the 2012 election. The aggregate distribution leads to more uncertainty regarding the two most likely outcomes than for some models (especially Holman), but less uncertainty for others (especially Wang). If we had added Drew Linzer's model and Nate Silver's model, which both predicted higher likelihood of 332 than 303 electoral votes, perhaps the uncertainty would have decreased even more in favor of 332. That third outcome also shows up as important in the aggregate model.

Model averaging and model comparison like this would have been helpful earlier in the campaign because it would have given us a sense of what all the models said in aggregate, but also how they differed. The more models we average, and the better we estimate the relative weights to give the models when calculating that average, the better.

Anyway, the outcome that truly matters has already been decided. I admit that I'm happy about it.
 

    about

    Malark-O-blog published news and commentary about the statistical analysis of the comparative truthfulness of the 2012 presidential and vice presidential candidates. It has since closed down while its author makes bigger plans.

    author

    Brash Equilibrium is an evolutionary anthropologist and writer. His real name is Benjamin Chabot-Hanowell. His wife calls him Babe. His daughter calls him Papa.

    what is malarkey?

    It's a polite word for bullshit. Here, it's a measure of falsehood. 0 means you're truthful on average. 100 means you're 100% full of malarkey. Details.

    what is simulated malarkey?

    Fact checkers only rate a small sample of the statements that politicians make. How uncertain are we about the real truthfulness of politicians? To find out, treat fact checker report cards like an experiment, and use random number generators to repeat that experiment a lot of times to see all the possible outcomes. Details.

    malark-O-glimpse

    Can you tell the difference between the 2012 presidential election tickets from just a glimpse at their simulated malarkey score distributions?

    Picture
    dark = pres, light = vp
    (Click for larger image.)

    fuzzy portraits of malarkey

    Simulated distributions of malarkey for each 2012 presidential candidate with 95% confidence interval on either side of the simulated average malarkey score. White line at half truthful. (Rounded to nearest whole number.)

    Picture
    (Click for larger image.)
    • 87% certain Obama is less than half full of malarkey.
    • 100% certain Romney is more than half full of malarkey.
    • 66% certain Biden is more than half full of malarkey.
    • 70% certain Ryan is more than half full of malarkey.
    (Probabilities rounded to nearest percent.)

    fuzzy portraits of ticket malarkey

    Simulated distributions of collated and average malarkey for each 2012 presidential election ticket, with 95% confidence interval labeled on either side of the simulated malarkey score. White line at half truthful. (Rounded to nearest whole number.)

    malarkometer fuzzy ticket portraits 2012-10-16 2012 election
    (Click for larger image.)
    • 81% certain Obama/Biden's collective statements are less than half full of malarkey.
    • 100% certain Romney/Ryan's collective statements are more than half full of malarkey.
    • 51% certain the Democratic candidates are less than half full of malarkey.
    • 97% certain the Republican candidates are on average more than half full of malarkey.
    • 95% certain the candidates' statements are on average more than half full of malarkey.
    • 93% certain the candidates themselves are on average more than half full of malarkey.
    (Probabilities rounded to nearest percent.)

    Comparisons

    Simulated probability distributions of the difference the malarkey scores of one 2012 presidential candidate or party and another, with 95% confidence interval labeled on either side of simulated mean malarkey. Blue bars are when Democrats spew more malarkey, red when Republicans do. White line and purple bar at equal malarkey. (Rounded to nearest hundredth.)

    Picture
    (Click for larger image.)
    • 100% certain Romney spews more malarkey than Obama.
    • 55% certain Ryan spews more malarkey than Biden.
    • 100% certain Romney/Ryan collectively spew more malarkey than Obama/Biden.
    • 94% certain the Republican candidates spew more malarkey on average than the Democratic candidates.
    (Probabilities rounded to nearest percent.)

    2012 prez debates

    presidential debates

    Simulated probability distribution of the malarkey spewed by individual 2012 presidential candidates during debates, with 95% confidence interval labeled on either side of simulated mean malarkey. White line at half truthful. (Rounded to nearest whole number.)

    Picture
    (Click for larger image.)
    • 66% certain Obama was more than half full of malarkey during the 1st debate.
    • 81% certain Obama was less than half full of malarkey during the 2nd debate.
    • 60% certain Obama was less than half full of malarkey during the 3rd debate.
    (Probabilities rounded to nearest percent.)

    Picture
    (Click for larger image.)
    • 78% certain Romney was more than half full of malarkey during the 1st debate.
    • 80% certain Romney was less than half full of malarkey during the 2nd debate.
    • 66% certain Romney was more than half full of malarkey during the 3rd debate.
    (Probabilities rounded to nearest percent.)

    aggregate 2012 prez debate

    Distributions of malarkey for collated 2012 presidential debate report cards and the average presidential debate malarkey score.
    Picture
    (Click for larger image.)
    • 68% certain Obama's collective debate statements were less than half full of malarkey.
    • 68% certain Obama was less than half full of malarkey during the average debate.
    • 67% certain Romney's collective debate statements were more than half full of malarkey.
    • 57% certain Romney was more than half full of malarkey during the average debate.
     (Probabilities rounded to nearest percent.)

    2012 vice presidential debate

    Picture
    (Click for larger image.)
    • 60% certain Biden was less than half full of malarkey during the vice presidential debate.
    • 89% certain Ryan was more than half full of malarkey during the vice presidential debate.
    (Probabilities rounded to nearest percent.)

    overall 2012 debate performance

    Malarkey score from collated report card comprising all debates, and malarkey score averaged over candidates on each party's ticket.
    Picture
    (Click for larger image.)
    • 72% certain Obama/Biden's collective statements during the debates were less than half full of malarkey.
    • 67% certain the average Democratic ticket member was less than half full of malarkey during the debates.
    • 87% certain Romney/Ryan's collective statements during the debates were more than half full of malarkey.
    • 88% certain the average Republican ticket member was more than half full of malarkey during the debates.

    (Probabilities rounded to nearest percent.)

    2012 debate self comparisons

    Simulated probability distributions of the difference in malarkey that a 2012 presidential candidate spews normally compared to how much they spewed during a debate (or aggregate debate), with 95% confidence interval labeled on either side of the simulated mean difference. Light bars mean less malarkey was spewed during the debate than usual. Dark bars less. White bar at equal malarkey. (Rounded to nearest hundredth.)

    individual 2012 presidential debates

    Picture
    (Click for larger image.)
    • 80% certain Obama spewed more malarkey during the 1st debate than he usually does.
    • 84% certain Obama spewed less malarkey during the 2nd debate than he usually does.
    • 52% certain Obama spewed more malarkey during the 3rd debate than he usually does.
    Picture
    (Click for larger image.)
    • 51% certain Romney spewed more malarkey during the 1st debate than he usually does.
    • 98% certain Romney spewed less malarkey during the 2nd debate than he usually does.
    • 68% certain Romney spewed less malarkey during the 3rd debate than he usually does.

    (Probabilities rounded to nearest percent.)

    aggregate 2012 presidential debate

    Picture
    (Click for larger image.)
    • 58% certain Obama's statements during the debates were more full of malarkey than they usually are.
    • 56% certain Obama spewed more malarkey than he usually does during the average debate.
    • 73% certain Romney's statements during the debates were less full of malarkey than they usually are.
    • 86% certain Romney spewed less malarkey than he usually does during the average debate.

    (Probabilities rounded to nearest percent.)

    vice presidential debate

    Picture
    (Click for larger image.)
    • 70% certain Biden spewed less malarkey during the vice presidential debate than he usually does.
    • 86% certain Ryan spewed more malarkey during the vice presdiential debate than he usually does.

    (Probabilities rounded to nearest percent.)

    2012 opponent comparisons

    Simulated probability distributions of the difference in malarkey between the Republican candidate and the Democratic candidate during a debate, with 95% confidence interval labeled on either side of simulated mean comparison. Blue bars are when Democrats spew more malarkey, red when Republicans do. White bar at equal malarkey. (Rounded to nearest hundredth.)

    individual 2012 presidential debates

    Picture
    (Click for larger image.)
    • 60% certain Romney spewed more malarkey during the 1st debate than Obama.
    • 49% certain Romney spewed more malarkey during the 2nd debate than Obama.
    • 72% certain Romney spewed more malarkey during the 3rd debate than Obama.

    (Probabilities rounded to nearest percent.)

    aggregate 2012 presidential debate

    Picture
    (Click for larger image.)
    • 74% certain Romney's statements during the debates were more full of malarkey than Obama's.
    • 67% certain Romney was more full of malarkey than Obama during the average debate.

    (Probabilities rounded to nearest percent.)

    vice presidential debate

    • 92% certain Ryan spewed more malarkey than Biden during the vice presidential debate.

    (Probabilities rounded to nearest percent.)

    overall 2012 debate comparison

    Party comparison of 2012 presidential ticket members' collective and individual average malarkey scores during debates.
    • 88% certain that Republican ticket members' collective statements were more full of malarkey than Democratic ticket members'.
    • 86% certain that the average Republican candidate spewed more malarkey during the average debate than the average Democratic candidate.

    (Probabilities rounded to nearest percent.)

    observe & report

    Below are the observed malarkey scores and comparisons form the  malarkey scores of the 2012 presidential candidates.

    2012 prez candidates

    Truth-O-Meter only (observed)

    candidate malarkey
    Obama 44
    Biden 48
    Romney 55
    Ryan 58

    The Fact Checker only (observed)

    candidate malarkey
    Obama 53
    Biden 58
    Romney 60
    Ryan 47

    Averaged over fact checkers

    candidate malarkey
    Obama 48
    Biden 53
    Romney 58
    Ryan 52

    2012 Red prez vs. Blue prez

    Collated bullpucky

    ticket malarkey
    Obama/Biden 46
    Romney/Ryan 56

    Average bullpucky

    ticket malarkey
    Obama/Biden 48
    Romney/Ryan 58

    2012 prez debates

    1st presidential debate

    opponent malarkey
    Romney 61
    Obama 56

    2nd presidential debate (town hall)

    opponent malarkey
    Romney 31
    Obama 33

    3rd presidential debate

    opponent malarkey
    Romney 57
    Obama 46

    collated presidential debates

    opponent malarkey
    Romney 54
    Obama 46

    average presidential debate

    opponent malarkey
    Romney 61
    Obama 56

    vice presidential debate

    opponent malarkey
    Ryan 68
    Biden 44

    collated debates overall

    ticket malarkey
    Romney/Ryan 57
    Obama/Biden 46

    average debate overall

    ticket malarkey
    Romney/Ryan 61
    Obama/Biden 56

    the raw deal

    You've come this far. Why not just check out the raw data Maslark-O-Meter is using? I promise you: it is as riveting as a phone book.

    archives

    June 2013
    May 2013
    April 2013
    January 2013
    December 2012
    November 2012
    October 2012

    malark-O-dex

    All
    2008 Election
    2012 Election
    Average Malarkey
    Bias
    Brainstorm
    Brier Score
    Bullpucky
    Caveats
    Closure
    Collated Malarkey
    Conversations
    Dan Shultz
    Darryl Holman
    Debates
    Drew Linzer
    Election Forecasting
    Equivalence
    Fact Checking Industry
    Fallacy Checking
    Foreign Policy
    Fuzzy Portraits
    Gerrymandering
    Incumbents Vs. Challengers
    Information Theory
    Kathleen Hall Jamieson
    Launch
    Logical Fallacies
    Longitudinal Study
    Malarkey
    Marco Rubio
    Meta Analysis
    Methods Changes
    Misleading
    Model Averaging
    Nate Silver
    Origins
    Pants On Fire
    Politifactbias.com
    Poo Flinging
    Presidential Election
    Ratios Vs Differences
    Redistricting
    Red Vs. Blue
    Root Mean Squared Error
    Sam Wang
    Science Literacy
    Short Fiction
    Simon Jackman
    Small Multiples
    Stomach Parasite
    The Future
    The Past
    To Do
    Truth Goggles
    Truth O Meter
    Truth O Meter