Last year, while living away from my family for a year to do ethnographic fieldwork in a remote village on a tiny Lesser Antillean island, I kept myself sane and connected to the political news in my home country by creating a new hobby. I applied my knowledge of inferential statistics and computational simulation to use fact checker reports from PolitiFact.com
and The Fact Checker
at the Washington Post
to comparatively judge the truthfulness of the 2012 presidential and vice presidential candidates, and (more importantly) to measure our uncertainty in those judgments.
The site (and its syndication on the Daily Kos
) generated some good discussion, some respectable traffic, and (I hope) showed its followers the potential for a new kind of inference-driven fact checking journalism. My main conclusions from the 2012 election analysis were:
(1) The candidates aren't as different as partisans left or right would have us believe.
(2) But the Democratic ticket was somewhat more truthful than the Republican ticket, both overall, and during the debates.
(3) It's quite likely that the 2012 Republican ticket was less truthful than the 2008 Republican ticket, and somewhat likely that the 2012 Democratic ticket was less truthful than the 2008 Democratic ticket.
Throughout, I tempered these conclusions with the recognition that my analyses did not account for the possible biases of fact checkers, including biases toward fairness, newsworthiness, and, yes, political beliefs. Meanwhile, I discussed ways to work toward measuring these biases and adjusting measures of truthfulness for them. I also suggested that fact checkers should begin in earnest to acknowledge that they aren't just checking facts, but the logical validity of politicians' arguments, as well. That is, fact checkers should also become fallacy checkers who gauge the soundness
of an argument, not simply the truth of its premises.
Now, it's time to close up shop. Not because I don't plan on moving forward with what I'm proud to have done here. I'm closing up shop because I have much bigger ideas.
I've started writing up an master plan for a research institute and social media platform that will revolutionize fact checking journalism. For now, I'm calling the project Sound Check
. I might have to change the name because that domain name is taken. Whatever its eventual name, Sound Check will be like FiveThirtyEight meets YouGov meets PolitiFact meets RapGenius: data-driven soundness checking journalism and research on an annotated social web. You can read more about the idea from this draft executive summary
Anyway, over the next three years (and beyond!), I hope you're going to hear a lot about this project. Already, I've started searching for funding so that I can, once I obtain my PhD in June 2014, start working full time on Sound Check.
One plan is to become an "Upstart". Upstart
is a new idea from some ex-Googlers. At Upstart, individual graduates hedge their personal risk by looking for investor/mentors, who gain returns from the Upstart's future income (which is predicted from a proprietary algorithm owned by Upstart). Think of it as a capitalist, mentoring-focused sort of patronage. Unlike Kickstarter or other crowd-funding mechanisms, where patrons get feel-good vibes and rewards, Upstart investors are investing in a person like they would invest in a company.
Another plan is, of course, to go the now almost traditional crowd-funding route, but only for clearly defined milestones of the project. For example, first I'd want to get funding to organize a meet-up of potential collaborators and investors. Next I'd want to get funding for the beta-testing of the sound checking algorithm. After that I'd get funding for a beta-test of the social network aspect of Sound Check. Perhaps the these (hopefully successfully) crowd-funded projects would create interest among heavy-hitting investors.
Yet another idea is to entice some university (UW?) and some wealthy person or group of people interested in civic engagement and political fact checking to partner with Sound Check in a way similar to how FactCheck.org grew out of the Annenberg Public Policy Center at University of Pennsylvania.
Sound Check is a highly ambitious idea. It will need startup funding for servers, programmers, administrative staff, as well as training and maintaining Sound Checkers (that's fact checkers who also fallacy check). So I've got my work cut out for me. I'm open to advice and new mentors. And soon, I'll be open, along with Sound Check, to investors and donors.
Glenn Kessler, Fact Checker at The Washington Post, gave two out of four Pinnochios to Barney Frank
, who claimed that GOP gerrymandering allowed Republicans to maintain their House majority. Kessler would have given Frank three Pinocchios, but Frank publicly recanted his statement in a live television interview. Here at Malark-O-Meter, we equate a score of three Pinocchios with a PolitiFact Truth-O-Meter score of "Mostly False". Kessler was right to knock off a Pinocchio for Barney's willingness to publicly recant his claim. I'll explain why Kessler's fact check was correct, and why he was right to be lenient on Frank.
Frank was wrong because, as a Brennan Center for Justice report suggests
, the Democrats wouldn't have won the House majority even before the 2010 redistricting. Although the Republicans clearly won the latest redistricting game, it doesn't fully explain how they maintained their majority. The other factor is geography. Dan Hopkins at The Monkey Cage
cited a study by Chen and Rodden showing that Democrats are clustered inefficiently in urban areas. Consequently, they get big Congressional wins in key urban districts, but at the cost of small margin losses in the majority of districts. (And no, fellow fans of the Princeton Election Consortium
, it doesn't matter that the effect is even bigger than the one Sam Wang predicted; it's still not only because of redistricting.)
So why was Kessler right to knock off a Pinocchio for Barney's willingness to recant? At Malark-O-Meter, we see fact checker report cards as a means to measure the overall factuality of individuals and groups. If an individual recants a false statement, that individual's marginal factuality should go up in our eyes for two reasons. First, that person made a statement that adheres to the facts. Second, the act of recanting a falsehood is a testament to one's adherence to the facts.
Regardless of its causes and no matter what Barney's malarkey score ends up being because of his remarks about it, what do we make of the disparity between the popular vote and the House seat margin, which has occurred only three other times in the last century
? Should we modify U.S. Code, Title 2, Chapter 1, Section 2c (2 USC § 2c
), which became law in 1967 and requires states with more than one apportioned Representative to be divided into one-member districts? Should we instead go with a general ticket
, which gives all House seats to the party that wins a state's popular vote? Is there some sensible middle ground? (Of course there is.)
The answer to these questions depends critically on the role we want the geographic distribution of the U.S. population to play in determining the composition of the House. The framers of the Constitution meant for the House of Representatives to be the most democratic body of the national government, which is why we apportion Representatives based on the Census, and why there are more Representatives than Senators. Clearly, it isn't democratic for our redistricting rules to be vague enough that a party can benefit simply by holding the House majority in a Census year. Is it also undemocratic to allow the regional geography of the United States to determine the House composition?
I don't think so. Instead, the geographic distribution of humans
in the United States should determine the House composition. There are a bunch of redistricting algorithms out there that would help this happen. The underlying theme of the best algorithms is that Congressional districts should have comparable population size. Let's just pick an algorithm and do it already. And if we're not sure which of these algorithms is the best one, let's just do them all and take the average.
In the aftermath of the 2012 election, campaign prognosticators Nate Silver, Simon Jackman, Drew Linzer, and Sam Wang make preliminary quantitative assessments of how well their final predictions played out. Others have posted comparisons of these and other election prediction and poll aggregation outfits. Hopefully, we'll one day compare and combine the models based on their long term predictive power. To compare and combine models effectively, we need a good quantitative measure of their accuracy. The prognosticators have used something called the Brier score to measure the accuracy of their election eve predictions of state-level outcomes. Despite its historical success in measuring forecast accuracy, the Brier score fails in at least two ways as a forecast score. I'll review its inadequacies and suggest a better method.
The Brier score measures the accuracy of binary probabilistic predictions. To calculate it, take the average, squared difference between the forecast probability of a given outcome (e.g., Obama winning the popular vote in California) and the observed probability that the event occurred (.e.g, one if the Obama won, zero if he didn't win). The higher the Brier score, the worse the predictive accuracy. As Nils Barth suggested to Sam Wang
, you can also calculate a normalized Brier score by subtracting four times the Brier score from one. A normalized Brier score compares the predictive accuracy of a model to the predictive accuracy of a model that perfectly predicted the outcomes. The higher the normalized Brier score, the greater the predictive accuracy.
Because the Brier score (and its normalized cousin) measure predictive accuracy, I've suggested that we can use them to construct certainty weights for prediction models
, which we could then use when calculating an average model that combines the separate models into a meta-prediction. Recently, I've discovered research in the weather forecasting community about a better way to score forecast accuracy. This new score ties directly to a well-studied model averaging mechanism. Before describing the new scoring method, let's describe the problems with the Brier score.
) notes that the Brier score doesn't deal adequately with very improbable or probable events. For example, suppose that the probability that a Black Democrat wins Texas is 1 in 1000. Suppose we have one forecast model that predicts Obama will surely lose in Texas, whereas another model predicts that Obama's probability of winning is 1 in 400. Well, Obama lost Texas. The Brier score would tell us to prefer the model that predicted a sure loss for Obama. Yet the model that gave him a small probability of winning is closer to the "truth" in the sense that it estimates he has a small probably of winning. In addition to its poor performance scoring highly improbable and probable events, the Brier score doesn't perform well when scoring very poor forecasts (Benedetti 2010
; sorry for the pay wall).
These issues with the Brier score should give prognosticators pause for two reasons. First, they suggest that the Brier score will not perform well in the "safe" states of a given party. Second, they suggest that Brier scores will not perform well for models whose predictions were poor (here's lookin' at you, Bickers and Berry). So what should we do instead? It's all about the likelihood. Well, actually its logarithm.
Both Jewson and Benedetti convincingly argue that the proper score of forecast accuracy is something called the log likelihood. A likelihood is the probability of a set of observations given the model of reality that we assume produced those observations. As Jewson points out, the likelihood in our case is the probability of a set of observations (i.e., which states Obama won) given the forecasts associated with those observations (i.e., the forecast probability that Obama would win those states). A score based on the log likelihood penalizes measures that are very certain one way or the other, giving the lowest scores to models that are perfectly certain of the outcome.
To compare the accuracy of two models, simply take the difference in their log likelihood. To calculate model weights, first subtract the likelihood score of each model from the minimum likelihood score across all the models. Then exponentiate the difference you just calculated. Then divide the exponentiated difference of each model by the sum of those values across all the models. Voila. A model averaging weight.
Some problems remain. For starters, we haven't factored Occam's razor into our scoring of models. Occam's razor, of course, is the idea that simpler models are better than complex models all else equal. Some of you might notice that the model weight calculation in the previous paragraph is identical to the model weight calculation method based on the information criterion scores of models that have the same number of variables. I argue that we can ignore Occam's razors for our purposes. What we're doing is measuring a model's predictive accuracy, not its fit to previous observations. I leave it up to the first order election prognosticators to decide which parameters they include in their model. In making meta election forecasts, I'll let the models' actual predictive performance decide which ones should get more weight.
UPDATE: Edited out some two-am-induced errors.
Now we've established that people who analyze polling data might have something there, let's devise ways to compare and contrast the different models. Drew Linzer at votamatic.com
already described his strategy for checking how well his model worked, and started Tweeting
some of his post hoc analyses
. So did Simon Jackman
. As of this moment, Micah Cohen at Nate Silver's FiveThirtyEight blog says "Stay tuned
." Darryl Holman is busy covering the Washington State race
, but I suspect we'll see some predictive performance analysis from him soon, too.
Tonight (okay, this morning), I want to compare the predictions that three of the modelers made about the electoral vote count to show you just how awesome these guys did, but also to draw some contrasts in the results of their modeling strategy. Darryl Holman, Simon Jackman, and Sam Wang all shared the probability distribution of their final electoral vote predictions for Obama with me. Here are the three probability distributions in the same plot for what I think is the first time.
The first thing to notice is that the two most likely outcomes in each of the models are 303 and 332. These two outcomes together are between 15%, 30%, and 36% likely for Holman, Jackman, and Wang, respectively.
Three hundred and three votes happens to be the number of votes Obama currently has secured. Three hundred and thirty-two votes would be the number Obama would have if 29 electoral votes from the remaining toss-up state, Florida, went to him. As most of you know, Obama won the popular vote in Florida, but by a small margin. That's the power of well designed and executed quantitative analysis.
Note, however, that the distributions aren't identical. Jackman's and Wang's distributions are more dispersed, more kurtotic (peaked), and more skewed than Holman's distribution. If you look at Silver's distribution, it is also more dispersed and kurtotic than Holman's. The models also differ in the relative likelihood they give to the two most likely outcomes. Another difference is that Jackman's distribution (and Silver's) has a third most likely outcome favorable to Obama that is much more distinguishable from the noise than it is is for Holman's model.
I've argued in a previous post
that differences like these are important, if not on election eve, then earlier in the campaign. I've also argued
that all of these models together might better predict the election in aggregate than they do on their own. So let's see what these models had to say in aggregate in their final runs before the election. It might seem silly to do this analysis after the election is already over, but, hey, they're still counting Florida.
Here is the average probability distribution of the three models.
Whoopdeedoo. It's an average distribution. Who cares, right? Well that histogram shows us what the models predicted in aggregate for the 2012 election. The aggregate distribution leads to more uncertainty regarding the two most likely outcomes than for some models (especially Holman), but less uncertainty for others (especially Wang). If we had added Drew Linzer's model and Nate Silver's model, which both predicted higher likelihood of 332 than 303 electoral votes, perhaps the uncertainty would have decreased even more in favor of 332. That third outcome also shows up as important in the aggregate model.
Model averaging and model comparison like this would have been helpful earlier in the campaign because it would have given us a sense of what all the models said in aggregate, but also how they differed. The more models we average, and the better we estimate the relative weights to give the models when calculating that average, the better.
Anyway, the outcome that truly matters has already been decided. I admit that I'm happy about it.
Tomorrow is election day. At this point, the probability is vanishingly small that any single fact checked statement from any of the 2012 presidential or vice presidential candidates could sway anyone's decision. It's similarly unlikely that an analysis of the aggregate truthfulness of the candidates will influence votes. Since everyone's already decided, now is a good time to reflect. On the malarkey scale, how do the presidential candidates this election year compare to the candidates in 2008? Specifically, did the candidates spew more malarkey or less? I'll use Malark-O
-Meter's factuality scale and statistical analysis tools to address this question. The question is an important one because it gives us some insight into how the shifts in political climate since Obama entered office have influenced campaign politics.
The malarkey score uses fact checker rulings from PolitiFact
's Truth-O-Meter, The Washington Post's Fact Checker
's Pinocchio scale, or both to measure the average falsehood of the statements that individuals or groups make. These organizations rate the factuality of statements using categories that range from true to false. Malark-O
-Meter turns these categories to numbers, then averages the numeric rating of an individual's or group's statements. The result is a score that ranges from 0, which suggests that 0% of what comes out of your mouth is malarkey, to 100, which suggests that 100% of what you say is malarkey. For more details on its calculation, read this
. For caveats to the validity of this measure, read this
. For a justification of comparing truthfulness among individuals like I do, read this
Karen S. at Politi-Psychotics
shared with me her collection of all PolitiFact rulings for Obama, Biden, Romney, Ryan, McCain, and Palin. I used that data to construct a malarkey score that estimated the falseness of Obama, Biden, McCain, and Palin as of October 30, 2008. I would have included The Fact Checker's rulings, but Kessler's column wasn't a permanent part of WaPo
until 2011, and I was unable to collect all the necessary data in time for the eve of election. To calculate the malarkey scores for the 2012 candidates, I calculate the malarkey score separately for their Truth-O-Meter report cards as of October 30, 2012.
You might wonder why I don't limit the scope of the 2012 malarkey scores to the campaign season. It's because I'm not trying to measure the malarkey spewed during a campaign season. I'm trying to measure the overall factuality of a presidential hopeful. Back in 2008, Obama and Biden might have been more or less factual than they have become in the last four years. Yet we shouldn't be blind to the malarkey that they or the two 2012 Republican candidates spewed before 2011. Moreover, we shouldn't punish candidates back in 2008 for statements that they haven't made yet.
With those caveats out of the way, let's see what the observed malarkey scores are before measuring our statistical uncertainty in them. Our aim is to get a better understanding of how our beliefs about the factuality of the two campaigns compare at identical points in their history.
If you take the observed data at face value, it suggests that there are some candidate-level differences between the two election years. The differences aren't big. Still, in four years, the candidates range between zero and 12 percent more full of malarkey in 2012 than their counterparts in 2008. Curiously, the data also suggests that, according to her PolitiFact report card, Palin spewed less malarkey during the campaign than McCain (my how things have changed since then). But with what degree of certainty can we make such statements given the evidence we have? Enter Malark-O-Meter's statistical methods for estimating and comparing malarkey scores
Basically, we treat fact checker report cards as a sort of experiment that gauges the factuality of a particular candidate. Because fact checkers only rate a small number of individuals' statements, there's uncertainty in these measures that arises from sampling error. This is particularly important because sampling error is higher in smaller samples. There were fewer statements collected for Obama in 2008 than in 2012, which would increase our level of uncertainty in the comparisons we'd make between Obama's performance in the two years. The number of statements that have been fact checked also differ across the four candidates, with presidential candidates fact checked more than vice presidential candidates and Democratic candidates fact checked more than Republican candidates.
Anyway, we can use probability theory to simulate the universe of possible malarkey scores given the data that fact checkers have collected, then estimate the likelihood that a candidate's malarkey score is a particular value. Let's start with the observation that each candidate's malarkey scores from 2012 were higher than the malarkey scores of their counterparts in 2008.
It turns out that we can be 95% confident that Barack Obama's malarkey score was between 6% percent smaller and 14% larger going up to this election year than it was up to 2008. The odds are about 3 to 1 that if we repeated this experiment, we'd find Obama to have spewed more malarkey by 2012 than he had by 2008.
The comparison isn't as clear for Biden because we have less data than we do for Obama. We can be 95% confident that Biden's malarkey score was between 18% smaller and 24% larger by 2012 than it was by 2008. If we repeated this experiment, it is a coin toss whether we'd again find that Biden spewed more or less malarkey in 2012 than 2008. So we can't tell a difference between Biden 2008 and Biden 2012.
Is Romney today more or less truthful as a Republican presidential candidate than McCain was by 2008? We can be 95% confident that Romney's malarkey score was between 6% smaller and 13% larger by 2012 than McCain's was by 2008. The odds are about 3 to 1 that if we did this experiment again, we'd find Romney to have spewed more malarkey by 2012 than McCain had by 2008.
Is Ryan more or less truthful as a Republican presidential candidate than Palin was in 2008? We can be 95% certain that Ryan's malarkey score was between 5% smaller and 56% larger by 2012 than Palin's was by 2008. The odds are better than 15 to 1 that Ryan spewed more malarkey by 2012 than Palin had by 2008.
The finding about Ryan and Palin strikes me because Palin has spewed numerous falsehoods in her selfish bid for wingnut fame since the 2008 election. Yet Ryan is touted as the facts man of the GOP. Remember, however, that Palin hadn't gone rogue until late in the campaign, and her truthiness has only exploded since then. Moreover, the story in 2008 wasn't so much that Palin was false. It was more that she didn't know her ass from her elbow (which is a reminder that factuality isn't the only important characteristic to look for in a candidate). Recall that in the 2008 vice presidential debate, she said little that was even worthy of fact checking. By comparison, Ryan's rhetoric makes fact checkers salivate because he often ties numbers and report findings to his arguments. He's actually quite knowledgeable, albeit bullshittingly so.
That's the picture for the individual positions on the campaign ticket. What is the picture for the tickets as a whole? Here are the observed collated malarkey scores for the party tickets in 2008 versus 2012. For each party, collated malarkey sums up the statements in each category that the two members of a ticket made. For this reason, collated malarkey measures the average amount of malarkey in the statements
made collectively by the members of a ticket.
2008 tickets (collated)
2012 tickets (collated)
Again, it looks like there are small differences between the two years. Let's see what statistical confidence we can place in that assessment.
We can be 95% certain that Obiden's collated malarkey score is between 6% smaller and 13% larger by 2012 than it was by 2008. If we repeated this experiment, the odds are a bit less than 5 to 2 that we would again conclude that Obiden 2012 has spewed more malarkey than Obiden 2008.
What about Rymney versus McPalin? We can be 95% certain that Rymney's collated malarkey score is between 2% smaller and 18% larger by 2012 than McPalin's was by 2008. The odds are just under 19 to 1 that if we repeated this experiment, we'd find Rymney to have spewed more malarkey than McPalin.
At the ticket level, we can be fairly confident that each party's ticket collectively spewed more malarkey by election 2012 than its counterpart had by election 2008. Also note that we can be more certain that Rymney spews more malarkey than McPain than that Obiden 2012 spews more malarkey than Obiden 2008.
The collated malarkey score rates the average falsehood of the statements a ticket makes. What about the average falsehood of the members of each ticket? That's what the member average malarkey score measures. Simply calculate the malarkey for each candidate on a ticket, then average the malarkey scores of the candidates on that ticket. Here are the observed malarkey scores.
Again, we observe small differences. Given the data, how certain can we be that these differences exist?
We can be 95% confident that Obama's and Biden's malarkey scores are on average between 10% smaller and 14% larger by the 2012 election than they were by the 2008 election. The odds are only about 3 to 2 that we're right in saying that Obama and Biden were on average more full of malarkey by 2012 than they were by 2008.
For the Republicans, we can be 95% confidence that Romney's and Ryan's are on average between 2% smaller and 27% larger by this election year than McCain and Palin were by election 2008. The odds are 19 to 1 that we'd be right in saying that Romney and Ryan were on average more full of malarkey by election 2012 than McCain and Palin were by election 2008.
Again, there is stronger evidence that the newest Republican ticket is less factual than the 2008 ticket than the evidence for Obiden 2012 being on average less factual than Obiden 2008. Except this time, the difference is in the average factuality of the members on each ticket. Together, these findings are consistent with Politi-Psychotics' evidence
that Republicans have become less factual since Obama took office, a phenomenon that can't be interpreted as evidence that PolitiFact has become more partisan since its separation from Congressional Quarterly
Two steps remain in this analysis, both pertaining to the overall truthfulness of the candidates in each election year. First, let's compare the collated malarkey score of all
candidates, regardless of party, between election 2008 and election 2012 (remember, collated scores add up all the statements in each categories for all the individuals included, then calculate a malarkey score from the collated report card). For both years, the collated malarkey score for all 2008 candidates is 48, just under half full of malarkey (actually, the malarkey scores are very slightly smaller for the 2008 election, but we round up to the nearest whole number). We might as well toss a coin to decide whether the candidates's statements were collectively more or less full of malarkey running up to the 2012 election than the 2008 election.
The differences are clearer when we look at the average malarkey score of the candidates by election year. In 2008, the candidates were on average 42% full of malarkey. The 2012 candidates are on average 44% full of malarkey. How statistically confident can we be in saying that the 2008 candidates were on average more truthful than the 2012 candidates? Sadly, we can be about 92% confident in this conclusion.
The difference in the candidates' average malarkey between this election year and eleciton 2008 is only two points on the malarkey scale. Yet if the trend continues over the next four presidential elections, then most of what our presidential and vice presidential candidates say could be false by the time my daughter can decide whether or not to give a president a chance at a second term. If the trend continues another six or seven elections after that, then my unborn grandchild will be choosing between two sets of liars.
So what explains the trend? There are at least two hypotheses. First, PolitiFact's rulings could have become tougher since 2008, I doubt this, but I can't rule it out completely without analyzing the full set of PolitiFact rulings.
An alternative explanation reflects current political reality. Our country has become increasingly polarized in recent years at the same time that the stakes of obtaining our country's highest political office have increased. Consequently, the premium on strategic deception is higher. This hypothesis is consistent with my finding that neither ticket has a solid record of factuality. The hypothesis also jives with the chilling fact that, as actor Rainn Wilson recently Tweeted,
There's a figure more damning than any malarkey score.
(UPDATE 2012-11-02: I made some changes to the prose to increase readability.)
Recently, I got into a slap fight with PolitiFactBias.com
(PFB). The self-proclaimed PolitiFact whistle blower bristled at my claim that my estimate of the partisan bias among two leading fact checkers is superior to theirs. A recurring theme in the debate surrounded PFB's finding that PolitiFact.com
's "Pants on Fire" category, which PolitiFact reserves for egregious statements, occurs much more often for Republicans than for Democrats. Because the "Pants on Fire" category is the most subjective of the categories in PolitiFact's Truth-O-Meter, PFB believes the comparison is evidence of PolitiFact's liberal bias.
I agree with PFB that the "Pants on Fire" category is highly subjective. That's why, when I calculate my factuality scores
, I treat the the category the same as I treat the "False" category. Yet treating the two categories the same doesn't account for selection bias. Perhaps PolitiFact is more likely to choose ridiculous statements that Republicans make so that they can rate them as "Pants on Fire", rather than because Republicans tend to make ridiculous statements more often than Democrats.
One way to adjust for selection bias on ridiculous statements is to pretend that "Pants on Fire" rulings ever happened. Presumably, the rest of the Truth-O-Meter categories are less susceptible to partisan bias in the selection and rating of statements. Therefore, the malarkey scores calculated from a report card excluding "Pants on Fire" statements might be a cleaner estimate of the factuality of an individual or group.
To examine the effect of excluding the "Pants on Fire" category on the comparison of malarkey scores between Republican and Democrats, I used Malark-O
-Meter's simulation methods
to statistically compare the collated malarkey scores of Rymney and Obiden after excluding the "Pants on Fire" statements from the observed Politi-Fact report cards. The collated malarkey score adds up the statements in each category across all the individuals in a certain group (such as a campaign ticket), and then calculates a malarkey score from the collated ticket. I examine the range of values of the modified comparison in which we have 95% statistical confidence. I chose the collated malarkey score comparison because it is one of the comparisons that my original analysis
was most certain about, and because the collated malarkey score is a summary measure of the falsehood in statements made collectively by a campaign ticket.
My original analysis suggested that Rymney spews 1.17 times more malarkey than Obiden (either that or fact checkers have 17% liberal bias
). Because we have a small sample of fact checked statements, however, we can only be 95% confident that the true comparison (or the true partisan bias) leads to the conclusion that Rymney spewed between 1.08 and 1.27 times more malarkey than Obiden. We can, however, be 99.99% certain that Rymney spewed more malarkey than Obiden, regardless of how much more.
After excluding the "Pants on Fire" category, you know what happens to the estimated difference between the two tickets and our degree of certainty in that difference? Not much
. The mean comparison drops to Rymney spewing 1.14 times more malarkey than Obiden (a difference of 0.03 times, whatever that means!). The 95% confidence intervals shift a smidge left to show Rymney spewing between 1.05 and 1.24 times more malarkey than Obiden (notice that the width of the confidence intervals does not change). The probability that Rymney spewed more malarkey than Obiden plunges
(sarcasm fully intended) to 99.87%. By the way, those decimals are probably meaningless for our purposes. Basically, we can be almost completely certain that Rymney's malarkey score is higher than Obiden's.
Why doesn't the comparison change that much after excluding the "Pants on Fire" rulings? There are two interacting, proximate reasons. First, the malarkey score is actually an average of malarkey scores calculated separately from the rulings of PolitiFact and The Fact Checker
at The Washington Post
. When I remove the "Pants on Fire" rulings from Truth-O-Meter report cards, it does nothing to The Fact Checker report cards or their associated malarkey scores.
Second, the number of "Pants on Fire" rulings is small compared to the number of other rulings. In fact, it is only 3% of the total sample of rulings across all four candidates, 2% of the Obiden collated report card, and 8% of the Rymney collated report card. So although Rymney has 4 times more "Pants on Fire" rulings than Obiden, it doesn't affect their malarkey scores from the Truth-O-Meter report cards much.
When you average one malarkey score that doesn't change all that much and another that doesn't change at all, the obvious result is that not much change happens.
What does this mean for the argument that including "Pants on Fire" rulings muddies the waters, even if I treat them the same as "False" rulings? It means that the differences I measure aren't affected heavily by the "Pants on Fire" bias, if it exists. So I'm just going to keep including them. This finding also lends credence to my argument that, if you want to call foul on PolitiFact and other top fact checkers, you need to cry foul on the whole shebang, not just one type of subjective ruling.
If you want to cry foul on all of PolitiFact's rulings, you need to estimate the potential bias in all of their rulings. That's what I did a few days ago
, but what PFB hasn't done. I suggested a better way for them to fulfill their mission of exposing PolitiFact as liberally biased (which they've tried to downplay as their mission, but it clearly is). Strangely, they don't want to take my advice. It's just as well, because my estimate of PolitiFact's bias (and their estimate) can just as easily be interpreted as an estimate of true party differences.
The other day, I posted estimates of the potential partisan and centrist bias in fact checker rulings
. My post was critical of fact checking critics as different in political positions as PolitiFactBias.com and Rachel Maddow.
On Sunday, Politifactbias.com posted what they call a "semi-smackdown" of my claim that they provide little quantitative evidence that PolitiFact has liberal bias
I want to thank PolitiFactBias for engaging me in a rational debate. (I'm serious. This is good!) To show how grateful I am, I'm going to systematically tear their semi-smackdown to shreds. In the process, I will clear up points of confusion that PolitiFactBias.com (PFB.com) has about who I am, and about Malark-O-Meter's methods.1. "Our pseudonymous subject goes by 'Brash Equilibrium.'"
My name is Benjamin Chabot-Hanowell
. I prefer the Internet to know me as Brash Equilibrium, and I don't mind if people call me Brash in meatspace. The link between my true identity and my pseudonym is apparent on the Internet
because I value transparency. That said, yes, call me Brash, not Benjamin.2. "Brash goes through the trouble of adding Kessler's Pinocchios together with PolitiFact's 'Truth-O-Meter' ratings..."
I don't add the two types of report card together. Doing so would bias the estimate heavily in favor of PolitiFact, which posts many times more rulings than Kessler, and is harder on Republicans than Kessler. Instead, I calculate the malarkey score from a report card (or collated report card, or subset of statements) and average the scores for the same subset. Doing so gives the two fact checkers equal weight. I don’t do this for my debate analyses because Kessler doesn’t do separate rulings for each statement made during the debates.3. "...and then calculates confidence intervals for various sets of ratings, based on the apparent assumption that the selection of stories is essentially random."
My confidence intervals don’t assume anything about the selection of stories. What they do assume is that fact checkers assemble a sample of statements from a population of statements, which results in sampling error. The population of statements from which those statements are selected could be everything that individual or group says. Or it could be the population of statements that are susceptible to whatever selection biases that fact checkers have. Either way, the basic mechanics of the calculation of the confidence intervals is the same. The question lies in whether I have parameterized my sampling distribution properly. Basically, PFB.com's saying that I haven't.
But what would PFB.com have me do? Introduce a prior probability distribution on the concentration parameters of the Dirichlet that isn't equal to the counts in each category plus one? Where would my prior beliefs about those parameters come from? From PFB.com’ allegations that PolitiFact cherrypicks liberal statements that are more likely to be true, whereas it cherrypicks conservative statements that are more likely to be false? Okay. What model should I use to characterize the strength of that bias, and its separate effects on conditional inclusion in each category?
We don’t know what model we should use because no one has statistically analyzed fact checker rating bias or selection bias, and that is the point of my article. Until someone does that, we can only estimate how much bias might
exist. To do this, we perform a thought experiment in which we assume that I am measuring fact checker bias instead of real differences among politicians. Doing so, I gave PFB.com two figures that it is free to use to support their argument that PolitiFact is biased (they’ll also have to assert that Glen Kessler is biased; look for PolitiFactAndTheFactCheckerBias.com soon!).
Meanwhile, I am free to use my findings to support my argument that the Republican ticket is less factual than the Democratic ticket. The truth is probably somewhere in between those two extremes, and the other extreme that fact checkers have centrist bias, as partisan liberals allege. For now, we don’t know exactly where the truth lies within that simplex of extremes. Although PFB.com's qualitative analysis suggests there might be some liberal bias, its authors rhetorically argue that there is a lot
of bias. They actually argue that it's all bias
! They present no statistical estimates of bias that cannot also be interpreted as statistical estimates of true differences.4. "It's a waste of time calculating confidence intervals if the data set exhibits a significant degree of selection bias."
Item 3 soundly defended my methods against this criticism. In sum, it is not a waste of time. What is a waste of time? Assuming that you know how biased an organization is when you've no conclusive estimate of the strength of that bias whatsoever.5. "Our case against PolitiFact is based on solid survey data showing a left-of-center ideological tendency among journalists, an extensive set of anecdotes showing mistakes that more often unfairly harm conservatives and our own study of PolitiFact's bias based on its ratings."
Survey data that shows journalists tend to be liberal doesn't automatically allow you to conclude that fact checker rulings are all bias. It doesn't give you an estimate of the strength of that bias if it exists. All it does is give one pause. And, yeah, it gives me pause, as I stated in my article when I conceded that there could be as much as 17% liberal bias in fact checker rulings!6. "Our study does not have a significant selection bias problem."
I highly doubt that. That PFB.com makes this assumption about its research, which relies heavily on blog entries in which it re-interprets a limited subset of PolitiFact rulings, makes me as suspicious of it as it is suspicious of PolitiFact.7. "Brash's opinion of PolitiFact Bias consists of an assertion without any apparent basis in fact."
And I never said it did. That is, in fact, the whole point of my article. Similarly, however, PFB.com's rhetoric about the strength of PolitiFact's bias has little evidentiary support. At least I recognize the gaps in my knowledge!
My methods, however, have much stronger scientific foundations than PFB.com's.8. In response to one of my recommendations about how to do better fact checking, PFB.com writes, "How often have we said it? Lacking a control for selection bias, the aggregated ratings tell us about PolitiFact and The Fact Checker, not about the subjects whose statements they grade."
No. It tells us about both the subjects whose statements they grade, and about the raters. We don't know the relative importance of these two factors in determining the results. PFB.com thinks it does. Actually, so do I. Our opinions differ markedly. Neither is based on a good estimate of how much bias there is among fact checkers.
Subjectively, however, I think it's pretty ridiculous to assume that it's all just bias. But I guess someday we'll see!9. "We need fact checkers who know how to draw the line between fact and opinion."
Sorry, PFB.com, you're never going to get that. What we actually need is a statistical method to estimate the influence of political beliefs on the report cards of individuals assembled from the rulings of professional fact checkers, and then a statistical method to adjust for that bias.10. "And critics who know enough to whistle a foul when "fact checkers" cross the line and conflate the two."
Yes. People like you and Rachel Maddow (strange bedfellows, to be sure!) are valuable whistle blowers. But your value isn't in estimating the strength of political bias among fact checkers.UPDATE (same day):
PFB.com and I fling more poor at one another here
They look similar. Yesterday, both models predicted the same number of electoral votes for Obama. Their random error estimates of the odds that Obama wins, however, differ slightly. If I had access to Linzer's and Silver's EV distributions, they would also look different. Linzer's would be shifted to the right, Silver's to the left. Alas, I don't have access to those model results beyond what those authors publish on the Internet.
I averaged the two probability distributions. This method makes sense because the result of the averaging still sums to one (so it is still a probability distribution), and because the two models could be interpreted as estimating 538 parameters (the probabilities of each electoral vote result), and because I don't have much reason to believe that the two models have unequal predictive power.
Below is the average probability distribution.
The averaged model predicts that Obama will win a median 303 votes. We can be 95% confident that he will receive between 265 and 344 votes. Obama will win with probability greater than 96%, giving him greater than 24 to 1 odds of winning the election.
And, guess what, election simulation naysayers. Neither of these simulators weight polls, and one of them includes all polls. And I've reported the median to avoid outlier effects. And the distribution doesn't look bimodal.
Earlier this month, Michael Scherer published an article called "Fact Checking and the False Equivalence Dilemma
" on Time
's Swampland blog. Scherer wrote the article in response to criticism of a cover story he wrote about the "factual deceptions" of Barry Obama and Willard Romney. Some readers accused him of false centrism.
Scherer's defense is that we cannot reliably compare the deceptiveness of individuals or groups, especially not based on fact checker rulings. He based his defense on comments by the leaders of the fact checking industry during a press conference that Scherer attended
. (In fact, the comments responded to a question that Scherer himself asked.)
Evidenced by my previous post on estimating partisan and centrist bias from fact checker report cards
, I sympathize with Scherer's defense against frothy-mouthed partisans who are convinced that the other side tells nothing but a bunch of stuff. Yet I disagree with him and the leaders of the fact checking industry that we cannot reliably compare fact checker rulings (notice I don't say deceptiveness) across politicians and political groups.
To make my point, I'll condense into a list what the fact checking industry leaders and Michael Scherer have said about what Scherer calls the "false equivalence dilemma" (but which should be called the "false comparison dilemma"). For each item in the list, I'll describe the issue, then explain why it's not that big of a deal.1. "...it's self selective process," says Glen Kessler from The Fact Checker at The Washington Post.
Kessler argues that fact checkers cherrypick the statements that they fact check. No, not out of centrist or partisan bias. In this case, Kessler's talking about a bias toward the timeliness and relevance of the statement. Kessler says that he decides what to fact check based on how much he thinks the fact check will educate the public about something important, like medicare or health insurance reform. He shies away from mere slips of the tongue.
Wait a while. If the only bias fact that checkers had was to fact check timely and relevant remarks about policy, that would make Malark-O-Meter's comparisons more valid, not less. Far more concerning is the possibility that some fact checkers have a fairness bias. Which brings me to...2. "...it would look like we were endorsing the other candidate," says Brooks Jackson of FactCheck.org.
This comment raises one non-issue against comparisons while implying another. Brooks argues that by demonstrating that one politician is more deceptive than another, FactCheck.org would open itself up to accusations of partisanship. From a publishing standpoint, this makes some sense, especially if your organization wants to maintain a nonpartisan reputation. Yet the ensuing controversy might cause the buzz about your organization to get louder. Just look what's happened with Nate Silver's political calculus this week. Or better yet, look what's happened to Internet searches for PolitiFact compared to factcheck.org over the last year. (Among frothy-mouthed right-wing partisans, PolitiFact is the poster child of the liberal fact checking establishment.)
Yet from the standpoint of informing the public (which is what we're trying to do, right?), who cares if you gain a false reputation of partisan bias? Many people already believe that the fact checking industry is biased, but at least as many people find it highly readable and refreshing. Perhaps that same demographic will find lucid, academically respectable factuality comparisons similarly refreshing.
Interestingly, Jackson's comment hints at the separate issue of centrist bias among today's top fact checkers. In the quest to avoid a partisan reputation, frothy-mouthed liberals allege, the fact checking industry is too fair-minded and falsely balanced (the same criticism leveled against Scherer's cover story in Time
).I've already shown that we can use Malark-O-Meter's statistical methods to estimate the likely level of centrist bias
(assuming that one exists). In the same article, I made suggestions for how to estimate the actual level of centrist (and partisan) bias among professional fact checkers.
Furthermore, if what we're aiming at is a more informed public, why must we always shy away from ambiguity? Yes, Malark-O-Meter's measurements are a complex mix of true difference, bias, sampling error, and perceptual error. No, we don't know the relative weights of those influences. But that doesn't make the estimates useless. In fact, it makes them something for people to discuss in light of other evidence about the comparative factuality of political groups.3. “Politicians in both parties will stretch the truth if it is in their political interest,” says Glen Kessler.
Glen Kessler argues that comparing politicians is fruitless because all politicians lie. Well, I statistically compared the factuality of Obama, Biden, Romney, and Ryan
. While all of them appear about half factual, there are some statistically significant differences. I estimate that Rymney's statements are collectively nearly 20% more false than Obiden's statements (I also estimated our uncertainty in that judgment). So yes, both parties' candidates appear to stretch (or maybe just not know) the facts about half the time. But one of them most likely does it more than the other, and maybe that matters
.4. "...not all deceptions are equally deceiving, and different people will reach different judgements about which is worse," says Michael Scherer.
Scherer goes on to ask:
He then says he doesn't know the answer to those questions. Neither do I, but I don't think the answers matter. What matters is the extent to which an individual's or group's policy recommendations and rhetoric adhere to the facts. That is why the fact checking industry exists. If the questions above bother you, then the fact checking industry writ large should bother you, not just the comparison niche that Malark-O-Meter is carving out. Furthermore, since Kessler has already established that fact checkers tend to examine statements that would lead to instructive journalism, we can be confident that most rulings that we would compare are, roughly speaking, equally cogent.
Which brings me to the straw man of the false equivalence dilemma:5. We can't read someone's mind.
Much of the fact checking industry leaders' commentary, and Michael Scherer's subsequent blog entry, assumed that what we're comparing is the deceptiveness (or conversely the truthfulness) of individuals or groups. This opened up the criticism that we can't read people's minds to determine if they are being deceptive. All we can do is rate the factuality of what they say. I agree with this statement so much that I discuss this issue in the section of my website about the caveats to the malarkey score and its analysis
I contend, however, that when words come out of someone's mouth that we want to fact check, that person is probably trying to influence someone else's opinion. The degree to which people influence our opinion should
be highly positively correlated with the degree to which their statements are true. No, not true in the value laden sense. True in the sense that matters to people like scientists and court judges. So I don't think it matters whether or not we can tell if someone is trying to be deceptive. What matters should be the soundness and validity of someone's arguments. The fact checking industry exists to facilitate such evaluations. Malark-O-Meter's comparisons facilitate similar evaluations at a higher level.
Lastly, I want to address one of Michael Scherer's remarks about a suggestion by political deceptiveness research pioneer, Kathleen Hall Jamieson, who works with Brooks Jackson at the Annenberg Public Policy Center, which runs FactCheck.org.
Three things. First, this is definitely a fine idea...if you want to measure the level of deception that moved voters. But what if you simply want to measure the average factuality of the statements that an individual or group makes? In that case, there is no need to weight fact check rulings by the size of their audience. In fact, by believing this measure is a measure of individual or group factuality (rather than a measure of the effects of an individual or group's statements), you would overestimate the factuality or falsehood of highly influential people relative to less influential people.
Second, most fact check rulings are of timely and relevant statements, and they are often a campaign's main talking points. So I would be interested to see what information all that extra work would add to a factuality score.
Third, while it is difficult to do in real time, it isn't impossible, especially not in pseudo real time. (Why do we have to do it in real time, anyway? Can't people wait a day? They already wait that long or more for most fact checker rulings! Moreover, didn't we once believe real time fact checking was so difficult, and yet that's what PolitiFact did during the debates.)
Anyway, for any given campaign ad or speech or debate, there's usually a transcript. We often know the target audience. We can also estimate the size of the audience. Come up with a systematic way to put those pieces of information together, and it will become as straightforward as...well...fact checking!
In sum, so long as fact checkers are doing their job fairly well (and I think they are) people like me can do our job (oh, but I wish it actually were my job!) fairly well. That said, there is much room for improvement and innovation. Stay tuned to Malark-O-Meter, where I hope some of that will happen.
Many accuse fact checkers like PolitiFact
and The Fact Checker
of bias. Most of these accusations come from the right, for which the most relevant example is politifactbias.com
. Conservatives don't focus as heavily on The Washington Post
's Fact Checker, perhaps because its rulings are apparently more centrist than PolitiFact, and because PolitiFact rulings apparently favor Democrats at least a little bit .
We can use Malark-O-Meter's recent analysis of the 2012 election candidates' factuality
to estimate the magnitude of liberal bias necessary to explain the differences observed between the two parties and
estimate our uncertainty in the size of that bias.
The simplest way to do this is to re-interpret my findings as measuring the average liberal bias of the two fact checkers, assuming that there is no difference between the two tickets. The appropriate comparison here is what I call the collated ticket malarkey, which sums all statements that the members of a ticket make in each category, then calculates the malarkey score
from the collated ticket. Using statistical simulation methods
, I've estimated the probability distribution of the ratio of the collated malarkey scores of Rymney to Obiden.
Here's a plot of that distribution with the 95% confidence intervals labeled on either side of the mean ratio. The white line lies at equal malarkey scores between the two tickets.
Interpreted as a true comparison of factuality, the probability distribution indicates that we can expect Rymney's statements are on average 17% more full of malarkey than Obiden's, although we can be 95% confident that the comparison is somewhere between 8% and 27% more red than blue malarkey.
Interpreted as an indicator of the average bias of PolitiFact and The Fact Checker, the probability distribution suggests that, if the two tickets spew equal amounts of malarkey, then the fact checkers on average rate the Democratic ticket's statements as somewhere between 8% and 27% more truthful than the Republican ticket's statements.
I'm going to speak against my subjective beliefs as a bleeding heart liberal and say that amount of bias isn't all that unrealistic, even if the bias is entirely subconscious.
If instead we believed like a moderate conservative that the true comparison was reversed - that is, if we believed that Obiden spewed 17% more malarkey than Rymney - then it suggests that the fact checkers's average bias is somewhere between 16% and 54% for the Democrats, with a mean estimated bias of 34%.
It seems unrealistic to me that PolitiFact and The Fact Checker are on average that
biased against the Republican party, even subconsciously. So while I think it's likely that bias could inflate the difference between the Republicans and Democrats, I find it much less likely that bias has reversed the comparison between the two tickets. Of course, these beliefs are based on hunches. Unlike politifactbias.com's rhetoric and limited quantitative analysis, however, it is based on good estimates of the possible bias, and our uncertainty in it.
It isn't just conservatives that accuse PolitiFact and The Fact Checker of bias. Believe it or not, liberals do, too. Liberals accuse fact checkers of being too centrist in a supposedly misguided quest to appear fair. You can look to Rachel Maddow
as a representative of this camp. Maddow's accusations, like politifactbias.com's, typically nitpick a few choice rulings (which is funny, because a lot of critics on both sides accuse PolitiFact and The Fact Checker of cherrypicking).
Such accusations amount to the suggestion that fact checkers artificially shrink
the difference between the two parties, making the histogram that I showed above incorrectly hover close to a ratio of one. So how much centrist bias do the fact checkers have on average?
Well, let's assume for a moment that we don't know which party spews more malarkey. We just know that, as I've estimated, the fact checkers on average rule that one party spews somewhere between 1.08 and 1.27 times the malarkey that the other party spews. Now let's put on a Rachel Maddow wig or a Rush Limbaugh bald cap and fat suit to become true partisans that believe the other side is actually, say, 95% full of crap, while our side is only 5% full of crap. This belief leads to a ratio of 19 to 1 comparing the malarkey of the enemy to our preferred party. Already, it seems unrealistic. But let's continue.
Next, divide each bin in the histogram I showed above by 19, which is the "true" ratio according to the partisans. The result is a measure of the alleged centrist bias of the average fact checker (at least at PolitiFact or The Fact Checker). Get a load of the 95% confidence interval of this new distribution: it runs from about 6% to about 7%. That is, a partisan would conclude that PolitiFact and The Fact Checker are on average so centrist that their rulings shrink the difference between the two parties to a mere SIX PERCENT of what it "truly" is.
I don't know about you, but I find this accusation as hard to swallow, if not harder, than the accusation that there is minor partisan bias among fact checkers.
Then again, my belief that fact checkers on average get it about right is entirely subjective. Given the data we currently have, it is not currently possible to tell how much partisan bias versus centrist bias versus honest mistakes versus honest fact checking contribute to the differences that I have estimated.
So what is the way forward? How can we create a system of fact checking that is less susceptible to accusations of bias, whether partisan or centrist? Here are my suggestions, which will require a lot of investment and time.
- More fact checking organizations. We need more large-scale fact checking institutions that provide categorical rulings like The Fact Checker and PolitiFact. The more fact checker rulings we have access to, the more fact checker rulings we can analyze and combine into some (possibly weighted) average.
- More fact checkers. We need more fact checkers in each institution so that we can rate more statements. The more statements we can rate, the weaker selection bias will be because, after some point, you can't cherrypick anymore.
- Blind fact checkers. After the statements are collected, they should be passed to people who do not see who made the statement. While it will be possible for people to figure out who made some statements, particularly when they are egregious, and particularly when they are repeated by a specific party or individual, many statements that fact checkers examine can be stripped of information about the individuals or parties involved so that fact checkers can concentrate on the facts.
- Embrace the partisans and centrists. There should be at least one institution that employs professional fact checkers who are, according to some objective measure, at different points along the various political dimensions that political scientists usually measure. So long as they are professional fact checkers and not simply politically motivated hacks, let these obvious partisans and centrists subconsciously cherrypick, waffle, and misrule to their heart's content so that we can actually measure the amount of subconscious bias rather than make accusations based on scanty evidence and fact checker rulings that make our neck hairs bristle.
I hope that Malark-O-Meter will someday grow into an organization that can realize at least one of these recommendations.