The other day, I posted estimates of the potential partisan and centrist bias in fact checker rulings
. My post was critical of fact checking critics as different in political positions as PolitiFactBias.com and Rachel Maddow.
On Sunday, Politifactbias.com posted what they call a "semi-smackdown" of my claim that they provide little quantitative evidence that PolitiFact has liberal bias
I want to thank PolitiFactBias for engaging me in a rational debate. (I'm serious. This is good!) To show how grateful I am, I'm going to systematically tear their semi-smackdown to shreds. In the process, I will clear up points of confusion that PolitiFactBias.com (PFB.com) has about who I am, and about Malark-O-Meter's methods.1. "Our pseudonymous subject goes by 'Brash Equilibrium.'"
My name is Benjamin Chabot-Hanowell
. I prefer the Internet to know me as Brash Equilibrium, and I don't mind if people call me Brash in meatspace. The link between my true identity and my pseudonym is apparent on the Internet
because I value transparency. That said, yes, call me Brash, not Benjamin.2. "Brash goes through the trouble of adding Kessler's Pinocchios together with PolitiFact's 'Truth-O-Meter' ratings..."
I don't add the two types of report card together. Doing so would bias the estimate heavily in favor of PolitiFact, which posts many times more rulings than Kessler, and is harder on Republicans than Kessler. Instead, I calculate the malarkey score from a report card (or collated report card, or subset of statements) and average the scores for the same subset. Doing so gives the two fact checkers equal weight. I don’t do this for my debate analyses because Kessler doesn’t do separate rulings for each statement made during the debates.3. "...and then calculates confidence intervals for various sets of ratings, based on the apparent assumption that the selection of stories is essentially random."
My confidence intervals don’t assume anything about the selection of stories. What they do assume is that fact checkers assemble a sample of statements from a population of statements, which results in sampling error. The population of statements from which those statements are selected could be everything that individual or group says. Or it could be the population of statements that are susceptible to whatever selection biases that fact checkers have. Either way, the basic mechanics of the calculation of the confidence intervals is the same. The question lies in whether I have parameterized my sampling distribution properly. Basically, PFB.com's saying that I haven't.
But what would PFB.com have me do? Introduce a prior probability distribution on the concentration parameters of the Dirichlet that isn't equal to the counts in each category plus one? Where would my prior beliefs about those parameters come from? From PFB.com’ allegations that PolitiFact cherrypicks liberal statements that are more likely to be true, whereas it cherrypicks conservative statements that are more likely to be false? Okay. What model should I use to characterize the strength of that bias, and its separate effects on conditional inclusion in each category?
We don’t know what model we should use because no one has statistically analyzed fact checker rating bias or selection bias, and that is the point of my article. Until someone does that, we can only estimate how much bias might
exist. To do this, we perform a thought experiment in which we assume that I am measuring fact checker bias instead of real differences among politicians. Doing so, I gave PFB.com two figures that it is free to use to support their argument that PolitiFact is biased (they’ll also have to assert that Glen Kessler is biased; look for PolitiFactAndTheFactCheckerBias.com soon!).
Meanwhile, I am free to use my findings to support my argument that the Republican ticket is less factual than the Democratic ticket. The truth is probably somewhere in between those two extremes, and the other extreme that fact checkers have centrist bias, as partisan liberals allege. For now, we don’t know exactly where the truth lies within that simplex of extremes. Although PFB.com's qualitative analysis suggests there might be some liberal bias, its authors rhetorically argue that there is a lot
of bias. They actually argue that it's all bias
! They present no statistical estimates of bias that cannot also be interpreted as statistical estimates of true differences.4. "It's a waste of time calculating confidence intervals if the data set exhibits a significant degree of selection bias."
Item 3 soundly defended my methods against this criticism. In sum, it is not a waste of time. What is a waste of time? Assuming that you know how biased an organization is when you've no conclusive estimate of the strength of that bias whatsoever.5. "Our case against PolitiFact is based on solid survey data showing a left-of-center ideological tendency among journalists, an extensive set of anecdotes showing mistakes that more often unfairly harm conservatives and our own study of PolitiFact's bias based on its ratings."
Survey data that shows journalists tend to be liberal doesn't automatically allow you to conclude that fact checker rulings are all bias. It doesn't give you an estimate of the strength of that bias if it exists. All it does is give one pause. And, yeah, it gives me pause, as I stated in my article when I conceded that there could be as much as 17% liberal bias in fact checker rulings!6. "Our study does not have a significant selection bias problem."
I highly doubt that. That PFB.com makes this assumption about its research, which relies heavily on blog entries in which it re-interprets a limited subset of PolitiFact rulings, makes me as suspicious of it as it is suspicious of PolitiFact.7. "Brash's opinion of PolitiFact Bias consists of an assertion without any apparent basis in fact."
And I never said it did. That is, in fact, the whole point of my article. Similarly, however, PFB.com's rhetoric about the strength of PolitiFact's bias has little evidentiary support. At least I recognize the gaps in my knowledge!
My methods, however, have much stronger scientific foundations than PFB.com's.8. In response to one of my recommendations about how to do better fact checking, PFB.com writes, "How often have we said it? Lacking a control for selection bias, the aggregated ratings tell us about PolitiFact and The Fact Checker, not about the subjects whose statements they grade."
No. It tells us about both the subjects whose statements they grade, and about the raters. We don't know the relative importance of these two factors in determining the results. PFB.com thinks it does. Actually, so do I. Our opinions differ markedly. Neither is based on a good estimate of how much bias there is among fact checkers.
Subjectively, however, I think it's pretty ridiculous to assume that it's all just bias. But I guess someday we'll see!9. "We need fact checkers who know how to draw the line between fact and opinion."
Sorry, PFB.com, you're never going to get that. What we actually need is a statistical method to estimate the influence of political beliefs on the report cards of individuals assembled from the rulings of professional fact checkers, and then a statistical method to adjust for that bias.10. "And critics who know enough to whistle a foul when "fact checkers" cross the line and conflate the two."
Yes. People like you and Rachel Maddow (strange bedfellows, to be sure!) are valuable whistle blowers. But your value isn't in estimating the strength of political bias among fact checkers.UPDATE (same day):
PFB.com and I fling more poor at one another here
They look similar. Yesterday, both models predicted the same number of electoral votes for Obama. Their random error estimates of the odds that Obama wins, however, differ slightly. If I had access to Linzer's and Silver's EV distributions, they would also look different. Linzer's would be shifted to the right, Silver's to the left. Alas, I don't have access to those model results beyond what those authors publish on the Internet.
I averaged the two probability distributions. This method makes sense because the result of the averaging still sums to one (so it is still a probability distribution), and because the two models could be interpreted as estimating 538 parameters (the probabilities of each electoral vote result), and because I don't have much reason to believe that the two models have unequal predictive power.
Below is the average probability distribution.
The averaged model predicts that Obama will win a median 303 votes. We can be 95% confident that he will receive between 265 and 344 votes. Obama will win with probability greater than 96%, giving him greater than 24 to 1 odds of winning the election.
And, guess what, election simulation naysayers. Neither of these simulators weight polls, and one of them includes all polls. And I've reported the median to avoid outlier effects. And the distribution doesn't look bimodal.
Earlier this month, Michael Scherer published an article called "Fact Checking and the False Equivalence Dilemma
" on Time
's Swampland blog. Scherer wrote the article in response to criticism of a cover story he wrote about the "factual deceptions" of Barry Obama and Willard Romney. Some readers accused him of false centrism.
Scherer's defense is that we cannot reliably compare the deceptiveness of individuals or groups, especially not based on fact checker rulings. He based his defense on comments by the leaders of the fact checking industry during a press conference that Scherer attended
. (In fact, the comments responded to a question that Scherer himself asked.)
Evidenced by my previous post on estimating partisan and centrist bias from fact checker report cards
, I sympathize with Scherer's defense against frothy-mouthed partisans who are convinced that the other side tells nothing but a bunch of stuff. Yet I disagree with him and the leaders of the fact checking industry that we cannot reliably compare fact checker rulings (notice I don't say deceptiveness) across politicians and political groups.
To make my point, I'll condense into a list what the fact checking industry leaders and Michael Scherer have said about what Scherer calls the "false equivalence dilemma" (but which should be called the "false comparison dilemma"). For each item in the list, I'll describe the issue, then explain why it's not that big of a deal.1. "...it's self selective process," says Glen Kessler from The Fact Checker at The Washington Post.
Kessler argues that fact checkers cherrypick the statements that they fact check. No, not out of centrist or partisan bias. In this case, Kessler's talking about a bias toward the timeliness and relevance of the statement. Kessler says that he decides what to fact check based on how much he thinks the fact check will educate the public about something important, like medicare or health insurance reform. He shies away from mere slips of the tongue.
Wait a while. If the only bias fact that checkers had was to fact check timely and relevant remarks about policy, that would make Malark-O-Meter's comparisons more valid, not less. Far more concerning is the possibility that some fact checkers have a fairness bias. Which brings me to...2. "...it would look like we were endorsing the other candidate," says Brooks Jackson of FactCheck.org.
This comment raises one non-issue against comparisons while implying another. Brooks argues that by demonstrating that one politician is more deceptive than another, FactCheck.org would open itself up to accusations of partisanship. From a publishing standpoint, this makes some sense, especially if your organization wants to maintain a nonpartisan reputation. Yet the ensuing controversy might cause the buzz about your organization to get louder. Just look what's happened with Nate Silver's political calculus this week. Or better yet, look what's happened to Internet searches for PolitiFact compared to factcheck.org over the last year. (Among frothy-mouthed right-wing partisans, PolitiFact is the poster child of the liberal fact checking establishment.)
Yet from the standpoint of informing the public (which is what we're trying to do, right?), who cares if you gain a false reputation of partisan bias? Many people already believe that the fact checking industry is biased, but at least as many people find it highly readable and refreshing. Perhaps that same demographic will find lucid, academically respectable factuality comparisons similarly refreshing.
Interestingly, Jackson's comment hints at the separate issue of centrist bias among today's top fact checkers. In the quest to avoid a partisan reputation, frothy-mouthed liberals allege, the fact checking industry is too fair-minded and falsely balanced (the same criticism leveled against Scherer's cover story in Time
).I've already shown that we can use Malark-O-Meter's statistical methods to estimate the likely level of centrist bias
(assuming that one exists). In the same article, I made suggestions for how to estimate the actual level of centrist (and partisan) bias among professional fact checkers.
Furthermore, if what we're aiming at is a more informed public, why must we always shy away from ambiguity? Yes, Malark-O-Meter's measurements are a complex mix of true difference, bias, sampling error, and perceptual error. No, we don't know the relative weights of those influences. But that doesn't make the estimates useless. In fact, it makes them something for people to discuss in light of other evidence about the comparative factuality of political groups.3. “Politicians in both parties will stretch the truth if it is in their political interest,” says Glen Kessler.
Glen Kessler argues that comparing politicians is fruitless because all politicians lie. Well, I statistically compared the factuality of Obama, Biden, Romney, and Ryan
. While all of them appear about half factual, there are some statistically significant differences. I estimate that Rymney's statements are collectively nearly 20% more false than Obiden's statements (I also estimated our uncertainty in that judgment). So yes, both parties' candidates appear to stretch (or maybe just not know) the facts about half the time. But one of them most likely does it more than the other, and maybe that matters
.4. "...not all deceptions are equally deceiving, and different people will reach different judgements about which is worse," says Michael Scherer.
Scherer goes on to ask:
He then says he doesn't know the answer to those questions. Neither do I, but I don't think the answers matter. What matters is the extent to which an individual's or group's policy recommendations and rhetoric adhere to the facts. That is why the fact checking industry exists. If the questions above bother you, then the fact checking industry writ large should bother you, not just the comparison niche that Malark-O-Meter is carving out. Furthermore, since Kessler has already established that fact checkers tend to examine statements that would lead to instructive journalism, we can be confident that most rulings that we would compare are, roughly speaking, equally cogent.
Which brings me to the straw man of the false equivalence dilemma:5. We can't read someone's mind.
Much of the fact checking industry leaders' commentary, and Michael Scherer's subsequent blog entry, assumed that what we're comparing is the deceptiveness (or conversely the truthfulness) of individuals or groups. This opened up the criticism that we can't read people's minds to determine if they are being deceptive. All we can do is rate the factuality of what they say. I agree with this statement so much that I discuss this issue in the section of my website about the caveats to the malarkey score and its analysis
I contend, however, that when words come out of someone's mouth that we want to fact check, that person is probably trying to influence someone else's opinion. The degree to which people influence our opinion should
be highly positively correlated with the degree to which their statements are true. No, not true in the value laden sense. True in the sense that matters to people like scientists and court judges. So I don't think it matters whether or not we can tell if someone is trying to be deceptive. What matters should be the soundness and validity of someone's arguments. The fact checking industry exists to facilitate such evaluations. Malark-O-Meter's comparisons facilitate similar evaluations at a higher level.
Lastly, I want to address one of Michael Scherer's remarks about a suggestion by political deceptiveness research pioneer, Kathleen Hall Jamieson, who works with Brooks Jackson at the Annenberg Public Policy Center, which runs FactCheck.org.
Three things. First, this is definitely a fine idea...if you want to measure the level of deception that moved voters. But what if you simply want to measure the average factuality of the statements that an individual or group makes? In that case, there is no need to weight fact check rulings by the size of their audience. In fact, by believing this measure is a measure of individual or group factuality (rather than a measure of the effects of an individual or group's statements), you would overestimate the factuality or falsehood of highly influential people relative to less influential people.
Second, most fact check rulings are of timely and relevant statements, and they are often a campaign's main talking points. So I would be interested to see what information all that extra work would add to a factuality score.
Third, while it is difficult to do in real time, it isn't impossible, especially not in pseudo real time. (Why do we have to do it in real time, anyway? Can't people wait a day? They already wait that long or more for most fact checker rulings! Moreover, didn't we once believe real time fact checking was so difficult, and yet that's what PolitiFact did during the debates.)
Anyway, for any given campaign ad or speech or debate, there's usually a transcript. We often know the target audience. We can also estimate the size of the audience. Come up with a systematic way to put those pieces of information together, and it will become as straightforward as...well...fact checking!
In sum, so long as fact checkers are doing their job fairly well (and I think they are) people like me can do our job (oh, but I wish it actually were my job!) fairly well. That said, there is much room for improvement and innovation. Stay tuned to Malark-O-Meter, where I hope some of that will happen.
Many accuse fact checkers like PolitiFact
and The Fact Checker
of bias. Most of these accusations come from the right, for which the most relevant example is politifactbias.com
. Conservatives don't focus as heavily on The Washington Post
's Fact Checker, perhaps because its rulings are apparently more centrist than PolitiFact, and because PolitiFact rulings apparently favor Democrats at least a little bit .
We can use Malark-O-Meter's recent analysis of the 2012 election candidates' factuality
to estimate the magnitude of liberal bias necessary to explain the differences observed between the two parties and
estimate our uncertainty in the size of that bias.
The simplest way to do this is to re-interpret my findings as measuring the average liberal bias of the two fact checkers, assuming that there is no difference between the two tickets. The appropriate comparison here is what I call the collated ticket malarkey, which sums all statements that the members of a ticket make in each category, then calculates the malarkey score
from the collated ticket. Using statistical simulation methods
, I've estimated the probability distribution of the ratio of the collated malarkey scores of Rymney to Obiden.
Here's a plot of that distribution with the 95% confidence intervals labeled on either side of the mean ratio. The white line lies at equal malarkey scores between the two tickets.
Interpreted as a true comparison of factuality, the probability distribution indicates that we can expect Rymney's statements are on average 17% more full of malarkey than Obiden's, although we can be 95% confident that the comparison is somewhere between 8% and 27% more red than blue malarkey.
Interpreted as an indicator of the average bias of PolitiFact and The Fact Checker, the probability distribution suggests that, if the two tickets spew equal amounts of malarkey, then the fact checkers on average rate the Democratic ticket's statements as somewhere between 8% and 27% more truthful than the Republican ticket's statements.
I'm going to speak against my subjective beliefs as a bleeding heart liberal and say that amount of bias isn't all that unrealistic, even if the bias is entirely subconscious.
If instead we believed like a moderate conservative that the true comparison was reversed - that is, if we believed that Obiden spewed 17% more malarkey than Rymney - then it suggests that the fact checkers's average bias is somewhere between 16% and 54% for the Democrats, with a mean estimated bias of 34%.
It seems unrealistic to me that PolitiFact and The Fact Checker are on average that
biased against the Republican party, even subconsciously. So while I think it's likely that bias could inflate the difference between the Republicans and Democrats, I find it much less likely that bias has reversed the comparison between the two tickets. Of course, these beliefs are based on hunches. Unlike politifactbias.com's rhetoric and limited quantitative analysis, however, it is based on good estimates of the possible bias, and our uncertainty in it.
It isn't just conservatives that accuse PolitiFact and The Fact Checker of bias. Believe it or not, liberals do, too. Liberals accuse fact checkers of being too centrist in a supposedly misguided quest to appear fair. You can look to Rachel Maddow
as a representative of this camp. Maddow's accusations, like politifactbias.com's, typically nitpick a few choice rulings (which is funny, because a lot of critics on both sides accuse PolitiFact and The Fact Checker of cherrypicking).
Such accusations amount to the suggestion that fact checkers artificially shrink
the difference between the two parties, making the histogram that I showed above incorrectly hover close to a ratio of one. So how much centrist bias do the fact checkers have on average?
Well, let's assume for a moment that we don't know which party spews more malarkey. We just know that, as I've estimated, the fact checkers on average rule that one party spews somewhere between 1.08 and 1.27 times the malarkey that the other party spews. Now let's put on a Rachel Maddow wig or a Rush Limbaugh bald cap and fat suit to become true partisans that believe the other side is actually, say, 95% full of crap, while our side is only 5% full of crap. This belief leads to a ratio of 19 to 1 comparing the malarkey of the enemy to our preferred party. Already, it seems unrealistic. But let's continue.
Next, divide each bin in the histogram I showed above by 19, which is the "true" ratio according to the partisans. The result is a measure of the alleged centrist bias of the average fact checker (at least at PolitiFact or The Fact Checker). Get a load of the 95% confidence interval of this new distribution: it runs from about 6% to about 7%. That is, a partisan would conclude that PolitiFact and The Fact Checker are on average so centrist that their rulings shrink the difference between the two parties to a mere SIX PERCENT of what it "truly" is.
I don't know about you, but I find this accusation as hard to swallow, if not harder, than the accusation that there is minor partisan bias among fact checkers.
Then again, my belief that fact checkers on average get it about right is entirely subjective. Given the data we currently have, it is not currently possible to tell how much partisan bias versus centrist bias versus honest mistakes versus honest fact checking contribute to the differences that I have estimated.
So what is the way forward? How can we create a system of fact checking that is less susceptible to accusations of bias, whether partisan or centrist? Here are my suggestions, which will require a lot of investment and time.
- More fact checking organizations. We need more large-scale fact checking institutions that provide categorical rulings like The Fact Checker and PolitiFact. The more fact checker rulings we have access to, the more fact checker rulings we can analyze and combine into some (possibly weighted) average.
- More fact checkers. We need more fact checkers in each institution so that we can rate more statements. The more statements we can rate, the weaker selection bias will be because, after some point, you can't cherrypick anymore.
- Blind fact checkers. After the statements are collected, they should be passed to people who do not see who made the statement. While it will be possible for people to figure out who made some statements, particularly when they are egregious, and particularly when they are repeated by a specific party or individual, many statements that fact checkers examine can be stripped of information about the individuals or parties involved so that fact checkers can concentrate on the facts.
- Embrace the partisans and centrists. There should be at least one institution that employs professional fact checkers who are, according to some objective measure, at different points along the various political dimensions that political scientists usually measure. So long as they are professional fact checkers and not simply politically motivated hacks, let these obvious partisans and centrists subconsciously cherrypick, waffle, and misrule to their heart's content so that we can actually measure the amount of subconscious bias rather than make accusations based on scanty evidence and fact checker rulings that make our neck hairs bristle.
I hope that Malark-O-Meter will someday grow into an organization that can realize at least one of these recommendations.
What do Nate Silver
, Darryl Holman
, Drew Linzer
, and Sam Wang
all have in common? They all use statistical methods to forecast elections, especially presidential ones. Their models all tend to say the same thing: the odds are pretty good that Obama is going to win. Yet they often make different predictions about the number of electoral votes that, say, Obama will get, and about the probability that Obama would win if an election were held right now.
For example, as of right now, Silver predicts 294 electoral votes to Obama with 3 to 1 odds of an Obama win. Holman predicts an average 299 electoral votes with 9 to 1 odds of an Obama win. Wang predicts a median 291 electoral votes, also with 9 to 1 odds of an Obama win. Linzer predicts a whopping 332 electoral votes and doesn't report the probability of an Obama win.
I contacted each of those men to request access to their electoral vote probability distributions. So far, Sam Wang and Darryl Holman have accepted. Drew Linzer declined. Nate Silver hasn't answered, likely because his mailbox is chock full of fan and hate mail.
Wang and Holman now both offer their histogram of electoral vote probabilities on their respective web pages. I went and grabbed these discrete probability distributions and did what a good, albeit naive model averager would do: I averaged the probability distributions to come up with a summary probability distribution (which, by the way, still sums to one).
This method makes sense because, basically, these guys are estimating 538 parameters, and I'm simply averaging those 538 parameters across the models to which I currently have access because I currently have no reason to think they are much different in predictive power (although later on the method could be extended to include weights).
From the aggregated electoral vote distribution, I calculated the mean, median, 2.5th percentile, and 97.5th percentile of the number of electoral votes (EV) to Obama. I also calculated the probability that Obama will get 270 EV or more, winning him the election.
Mean EV: 296
Median EV: 294
95% Confidence interval: 261, 337
Probability Obama wins: over 90%
So 9 to 1 odds Obama wins. Something like 294 or 296 electoral votes.
I'd love to see what happens if I put Nate Silver into the equation. Obviously, it will drag the distribution down. I might look into modeling weights at that point, too, because both Holman and Wang predicted the electoral votes better than Silver, and I believe Wang did a slightly better job than Holman, although I forget.
Anyway, there you have it. Rest easy and VOTE.
Election Day fast approaches. A few remain undecided. The majority have decided, and look for any last bits of manure to fling at the other side. In the era of fact checking websites, a candidate's factuality has become important to a lot of people, including me.
That's why, earlier this week, I introduced Malark-O-Meter
to the world (well...more like only 500 people in the world). I statistically analyzed fact checker report cards from Truth-O-Meter and The Fact Checker to compare the factuality of the 2012 presidential and vice presidential candidates overall, and during the debates that had happened so far. I promised I'd get back to you on the third debate, and with a summary of how the two parties did in the debates overall, compared to one another, and to their usual selves.
Unless Bidama and Rymney (or is it Obiden and Romyan?) blast each other in the next few weeks as much as they have in the last year, this is probably my final 2012 election malarkey analysis until Election Day. This is the one of the most comprehensive, sophisticated, and detailed analyses of the 2012 presidential candidates' factuality.
Share it with your friends. Discuss its results. Debate its merits. Tell me precisely why you think I'm full of shit. Supersize the histograms and past them to brick walls like you're Shepard Fairey. Because this stuff matters
. It matters because the facts matter. It matters because we should understand how confident we can be in our judgments about people.
Enough histrionics. Let's get to the science. If you've never been here before, quickly skim how I calculate the malarkey score
and how I do my statistical comparisons
before continuing. If you read my last 2012 presidential campaign update, not much has changed. So you might want to scroll down to my analysis of the third debate, and the debates as a whole.
Full report cards
I collected the full Truth-O-Meter and Fact Checker report cards for Obama, Biden, Romney, and Ryan this morning. Let's start with what we observe. That is, what can we say about the factuality of the two sides if we take the report cards at face value? Here are the revised overall malarkey scores for each individual candidate and each ticket.
(average malarkey in statements)
(average malarkey of individuals)
Okay. Not much has changed since last time. We observe that Obama spews less malarkey than Romney, and Biden less than Ryan. We observe that the blue team's statements are less full of malarkey than the red team's, and the Democratic candidates themselves are less full of malarkey than the Republican candidates. But not by much. No candidate or ticket appears much better or worse than half full of malarkey.
The trouble is, for each candidate (and party), we only have a small sample of the statements they've made. That introduces sampling error. We must calculate the certainty with which we can make judgments about the candidates and parties given the data we have.
To the right are the probability distributions of malarkey scores for the four candidates, labeled with the 95% confidence intervals on either side of the expected malarkey score. The white line lies at a malarkey score of 50.
How certain can we be that the candidates are much better or worse than half full of malarkey? From the probability distributions shown at right, I calculated the probabilities.
Odds are 9 to 1 that Obama's less than half full of malarkey, but not by much. It's almost 100% likely the Romney is more than half full of malarkey. The difference is greater than for Obama, but still not much difference than a half buck of malarkey.
The odds are only around 2 to 1 that Biden is more than half full of malarkey. Again, not by much. The same is true for Ryan.
We can only be pretty certain about how the presidential candidates compare to a half bucket of malarkey. What about the party tickets as a whole?
To the left are the probability distributions of the collated and average malarkey scores. Based on these distributions, the odds are nearly 9 to 1 that the Obiden's collective statements are on average less than half full of malarkey (but not by much), while it is almost certain that Rymney's are more than half full of malarkey (by a wider margin, but still not by much).
It's a statistical toss-up whether Obama and Biden are on average less than half full of malarkey, whereas the odds are over 19 to 1 that Romney and Ryan are on average more than half full of malarkey (but still not by much).
But how do the candidates and tickets compare, and how certain can we be about those comparisons?
To the right are probability distributions of the ratio between a Republican malarkey score and a Democrat malarkey score. Red bars occur when the Republican malarkey score is greater than the Democrat score. The comparisons run from very murky (for the v.p. candidates) to pretty clear (for the collated ticket report cards and the presidential candidates).
I am essentially 100% certain that Romney spews more malarkey than Obama...but not much more. Not even twice as much. Not even one and a half times as much. I'm also nearly 100% certain that Obiden collectively spew less malarkey than Rymney. But again, they're not that different. Basically, I can't tell a difference between Biden and Ryan. If there is a difference between them, it is tiny, but in favor of Biden. I can, however, give 9 to 1 odds that Obama and Biden are on average more factual than Romney and Ryan. But not that much more factual!
We draw two lessons. First, and I repeat from last time, the differences in factuality between the two parties aren't as large as either side would have you believe. That said, there is a clear difference. The differences we can be certain about favor Democrats. And all of the differences we've measured, regardless of our certainty in them, suggest that the Democratic candidates are more factual than the Republicans.
On to the debates.
This morning, I collected the Truth-O-Meter rulings of claims made during the final debate. To the right are the observed malarkey scores for that debate.
Once again, Romney spews more malarkey than Obama. At least, that's what the data says. But how strong is the evidence? Let's bust out the simulator.
To the left are the probability distributions of Obama's malarkey score in all three debates. The odd are only about 2 to 1 that Obama was more than half full of malarkey than the first debate. But they're better than 4 to 1 that he was less than half full of malarkey in the second. The odds are only 3 to 2 that he was less than half full of malarkey during the last debate.
As for Romney, the odds are better than 3 to 1 that he was more than half full of malarkey during the 1st debate. His report card from the 2nd debate was surprisingly truthful, the odds 4 to 1 that he was less than half full of malarkey. As for the third debate, we're back to 2 to 1 odds that he spewed more than half a bucket of malarkey.
How did the presidential candidates perform in the debates overall? To answer this question, I calculated two summary malarkey scores for the debates.
First, I collated the candidates' report cards from each debate, then calculates a malarkey score from it. This measures the average falsehood of the statements a candidate made across all the debates. The odds are about 2 to 1 that Obama's statements during the debates were less than half full of malarkey. The odds are about the same that Romney's statements were more than half full of malarkey.
Second, I averaged the candidate's malarkey score across the three debates. The odds are again about 2 to 1 that Obama was on average less than half full of malarkey during a debate. Statistically, we can't tell whether Romney was on average more or less than half full of malarkey during a debate.
People seemed very let down by Obama's performance in the first debate. Mostly, it had to do with his demeanor. But could people have been subconsciously disappointed with his truthfulness during that debate as well, perhaps cued by his subtle facial expressions and body language that he as being more false than he usually is?
Maybe. In any case, the odds are 4 to 1 he spewed more a few more falsehoods during the first debate than he usually does. The odds are again 4 to 1 that his performance improved during the second debate, when he appeared to be more factual than usual. Statistically, we can't tell a difference between the usual Obama and the Obama in the final debate.
In contrast, we can't tell a difference between the usual Romney and the Romney in the first debate. The odds are 49 to 1 that Romney was more factual during the 2nd debate than he usually is. Yet Romney lost that debate.
As for the third debate, the odds are only 2 to 1 that Romney was more factual in the final debate than he usually is. Whatever the case, he seems to have lost that one, too.
So much for the persuasive power of facts!
We've seen how the presidential candidates' compare to themselves, but how did their debate performances compare to one another? To the right are the probability distributions of the ratio of Romney's malarkey score to Obama's in each of the debates. Red means Romney spewed more malarkey, blue means Obama did. The odds are only 2 to 1 that Romney spewed more malarkey than Obama during the first debate. It's a toss-up who was more factual in the second debate. And the odds are better than 3 to 1 that Obama was more factual than Romney in the final debate. So the evidence is pretty damn weak, but favors Obama for two of the debates. But even if Obama was more factual than Romney, he probably wasn't that much more factual. What about overall performance during the presidential debates?
To the right are the probability distributions of the ratio of Romney's collated debate report card to Obama's (top), and the ratio of Romney's average debate malarkey score to Obama's (bottom). The odds are nearly 3 to 1 that Obama's statements during the debates were more factual than Romney's. The odds are only about 2 to 1 that he was more factual on average than Romney was in a given debate. Again, the evidence is fairly weak, but it favors Obama. Even if it favors Obama, the differences in factuality aren't that great.
To prepare for my analysis of the collective debate performance of each party's ticket, I review my analysis of the the vice presidential debate. To the right are the probability distributions of Biden's and Ryan's malarkey scores during the vice presidential debate.
The odds are about 3 to 2 that Biden was less than half full of malarkey during the debate. Contrastingly, the odds are nearly 9 to 1 that Ryan was more than half full of malarkey during the debate. So Biden was probably right. It was all just a bunch of stuff! Well, not all of it. Actually, not much more than half of it was malarkey.
Still, given the small amount of evidence we have from the debate (which introduces a lot of sampling error), it's quite interesting that the odds remain so high that Ryan spewed so much malarkey. Perhaps it was Biden's mastering of the facts after all that dampened the Republican's Romentum! Well, at least it was his ability to point out Ryan's factual missteps. But remember, Biden was about half full of malarkey during the debate, too.
That said, the plot at right suggests that Biden was probably about as factual as he usually is during the debate, whereas the odds are better 6 to 1 that Ryan was less factual than usual.
I'd like to think that Ryan's subtle cues of his own falsehood were a letdown to some undecided voters who had expected more from him, but the polls were pretty split about who won the debate.
Regardless of whether people think Ryan or Biden "won" the debate, the probability distribution of the ratio of Ryan's to Biden's debate malarkey score suggest better than 9 to 1 odds that Ryan spewed more malarkey than Biden. In this case, the mean difference between the two scores is actually fairly large. Or at least larger than we've come to expect from these candidates, who are all basically half charlatans. So here's a shout-out to Biden, whose spirited use of the term "malarkey" inspired the name of my factuality score. Good work, Mr. Vice President. Or at least, I'm over 90% sure it's good work.
Is there a way to analyze the collective malarkey scores across all four debates, and across the presidential and vice presidential candidates from each party? I'm Brash Equilbrium, baby! Of course there is.
I came up with two measures of malarkey overall all four debates. First, I simply collated the statements from the presidential and vice presidential candidates into two summary report cards for each party.
Second, I took the average of a presidential candidate's average malarkey across all three presidential debates, and the vice president's debate malarkey score. Let's unpack that a bit. Step one was to average the presidential candidate's malarkey across all three debates. Step two was to take the average of that average and the vice president's malarky score from the vice presidential debate.
Why did I take an average of averages? Because I wanted to measure the average malarky score of an individual on a party's ticket, not the party's average score across the four debates. If I'd done the latter, I would have weighted presidential candidates more heavily, which I already do in the collated measure since presidential candidates were more heavily fact checked, and had more debates, than vice presidential candidates.
Okay, let's look at some graphs and calculate some odds.
At left are the probability distributions for the collated and candidate average debate malarkey for each party.
The odds are almost 3 to 1 that Obiden's collective statements during the debates were less than half full of malarkey. The odds are about 2 to 1 that the average Democratic candidate's average debate performance was more than half factual.
Contrastingly, the odds are better than 6 to 1 that the collective statements of Rymney were more than half full of malarkey. The case is similar for the average Republican candidate's average debate performance.
Not the similarities between the confidence intervals and means of the debate summaries and those of each party's malarkey scores calculated from their full report cards. These similarities make me confident that malarkey scores taken from full report cards are a pretty good predictor malarkey scores accrued during events like televised debates. Remember also that the candidates' overall malarkey scores were calculated from two fact checkers, whereas the debate data comes from just one.
Maybe there is something to this Malark-O-Meter thing after all. Which brings me to our final plot.
The odds are better than 7 to 1 that the Republican candidates' collective statements during the debates were more full of malarkey than the Democrats'. The difference isn't that big, but it's not trivial.
The odds are better than 6 to 1 that the average Republican candidate's average debate performance was more full of malarkey than the average average Democratic candidate's. Again, the difference isn't big, but it's not trivial.
If factuality were all you cared about in a candidate, there is pretty strong evidence that you should prefer the Democrats over the Republicans. That said, there is also pretty strong evidence that the differences that likely exist between the two tickets aren't massive. Still, they aren't trivial.
Of course, you don't only care about factuality. You care about policy. You care about issues. But therein lies the rub. When politicians design and advocate for policies, they ideally do so with some grounding in the facts. Evidence matters, or at least it should, just as much to policymaking as it does in a courtroom or a chemistry lab.
What about values? You should care about your candidates' values too, right? But how are your candidates' values informed by the facts?
You see where I'm going here. I understand that factuality isn't the only characteristic we should consider when deciding who gets our vote.
But it sure seems to be at the root of all the others!
Those who've read my description of the malarkey score for a group (such as the members a presidential campaign ticket) know that I have two group malarkey measures: the collated malarkey score, and the average malarkey score.
Collated malarkey combines the statements of separate individuals into a single report card, then calculates the malarkey score from the combined report card. Average malarkey calculates a malarkey score for each member of the group, then averages them. Collated malarkey measures the average falseness of the statements a group makes. Average malarkey measures the average falsehood of a group's members.
You can also calculate collated and average malarkey scores for report cards grouped by a type of event. For example, there were three presidential debates. I will have a report card for each presidential candidate and each debate. I can collate them and calculate a malarkey score or calculate a malarkey score from each and average a candidate's malarkey scores across the three debates.
Right now, it looks like the most rulings will be for the first and last debates. That means the collated bullpucky score will be influenced more by the statements in these debates than in the second debate. Yet if I average across the debates, I treat each debate equally.
Hrm. Well, I think there's value in both strategies. So I'll just do both.
"The Hitchhiker's Guide to the Galaxy skips lightly over this tangle of academic abstraction, pausing only to note that the term 'Future Perfect' has been abandoned since it was discovered not to be."
(Douglas Adams, The Restaurant at the End of the Universe. Pan Books, 1980)
So where am I going with Malark-O
-Meter? I've lost an unhealthy amount of sleep working on it when I should have been resting after a day of interviews and dissertation data management. My heart flutters when I'm about to finish a Malark-O
-Meter analysis, again when I'm about to share it, and again when I get some feedback. Wait a second. That must mean I'm passionate about it!
I'm passionate enough about Malark-O
.M.) to think seriously about its future. I'm also passionate (almost to a fault) about doing things transparently. So I'm going to write my business plan in public view, using Google Docs. Every journey starts with one (almost) blank page
(as of this writing).
Before I type another word in that document, I'll lay out what I've been thinking about so far. I'll warn you. It's too big for M.O
.M.'s current britches. But I think if I play my cards right and meet the right people, all of this will one day be feasible.
I envision a non-profit organization that adheres to and extends M.O
.M.'s core mission: to statistically analyze fact checker rulings to make judgments about the factuality of politicians, and to measure our uncertainty in those judgments.
.M. would continue to calculate and statistically compare intuitive measures of factuality. It would mine data sufficient to calculate malarkey scores for entire political parties, and for the complex web of politicians, PACs, SuperPACs, supporters, and surrogates that surround a political campaign. As data amasses over its existence, we would be able to study changes in factuality over time, and for a larger sample of individuals, groups, and localities.
We'd extend M.O
.M.'s core mission by doing our own, in-house fact check rulings. This would increase the sample size of fact checker report cards that we use to generate malarkey scores. I see a transparent fact checking system employing at least two competent fact checkers, preferably at varying points within the political spectra, who make rulings comparable to the Truth-O-Meter and The Fact Checker. By getting into the business of fact-checking, we would also be in a position to do longitudinal malarkey research with historical scope as wide and deep as our archives of political claims.
That's right. One of the projects I'd like to do is to, within reason, make a fact check report card for every American president, alive or dead. Maybe every vice president, too. American politics has a rich historical record. Let's find a new use for it! We could ask and answer questions like, "Have presidential candidates become more brazen as the power of the presidency has increased?" Of course, to fact check history properly, we'd need to consult with at least one able historian.
Another way to extend M.O
.M.'s mission is to make outreach a key component. How do we make the technical machinery of M.O
.M., and its equally technical output, understandable to the broadest audience possible? To fulfill this mission, we need to blur the lines between political science and journalism. This has already happened with projects like fivethirtyeight.com
, and the Princeton Election Consortium
(and I'll put in a plug for the less well-known but equally awesome Darryl Holman at horsesass.org
, whose voting simulations you should take seriously despite the website's hilarious name). Fact checking itself is a hybrid of journalism and scholarship. M.O
.M. would add to and innovate within this new information ecology.
.M.'s revenues could come from several sources. Of course, there are grants, there's crowd-funding, there's "please give us money" buttons. But we could also gain revenue by doing commissioned studies for the media, for think tanks, and, yes, even for political campaigns. I'm thinking the funding will come mostly from grants, then commissioned studies, then begging for money from the crowd.
So, yeah. These ideas are big. I am serious about this project. It is not a toy or a gimmick or a game. I want to follow through with this to end. I hope you'll follow along. And for some of you, I hope some day you will participate.
For now, stay tuned for my next analysis update, in which I'll examine all four debates, and all four candidates.
If you watched the final presidential debate last night, you probably noticed that both candidates veered off the topic of foreign policy several times. Originally, I thought this might complicate my malarkey analysis of the third debate because I'd have to filter out claims unrelated to foreign policy. I change my mind for two reasons.
First, the candidates appeared to believe that their off-topic comments were relevant to foreign policy. Maybe some of them actually were. Second, I reminded myself that my task is to compare the factuality of the two candidates during the debate, and relative to their overall record, not to fact check foreign policy commentary specifically.
Anyway, stay tuned for my next Malark-O
-Meter data update, in which I analyze the presidential candidates' performances during the debate relative to one another and their overall records, compare and summarize their performances across all three debates, compare and summarize the performances of both tickets across all four of the debates, and update our beliefs about the overall factuality of the red and blue teams.
And yes, calling the Republicans and Democrats the red and blue team is a thing for me. Because sometimes, this campaign has been as comical as Red vs. Blue
PolitiFact recently published a list of their Truth-O-Meter rulings that the candidates and their "surrogates
" have said regarding foreign policy
, including Iraq and Afghanistan
. Because this is a small and probably biased sample of rulings, it gives me an opportunity to demonstrate the power of Malark-O
-Meter to show how much signal there is amid the noise when it comes to how much malarkey candidates spew. It also gives me a chance to showcase Malark-O
-Meter's current limitations.
This is quick and dirty because the debates begin soon. So let me know if there are any mistakes.
I collated the statements made by red and blue candidates and their surrogates into a red pile and a blue pile. Then I used Malark-O-Meter's simulation methods
to simulate the probability distribution of the individual foreign-policy-specific malarky scores
, and to simulate ratio of the red team's foreign-policy-specific malarkey to the blue team's foreign-policy-specific malarkey. Then I calculated the 95% confidence interval (95% CI) of the individual scores and the ratio, and calculated the probability that the red team spews more foreign-policy-specific malarkey (FPSM) than the blue team.
Here are the results:
- Blue team's FPSM -- mean: 50; 95% CI: 36 to 64 (so, maybe half truthful)
- Red team's FPSM -- mean: 56; 95% CI: 36 to 75 (so, maybe a little more than half full of malarkey)
- Ratio -- mean: 1.15; 95% CI: 0.68 to 1.73
- Probability that the red team spews more FPSM that the blue team is about 70%.
So according to the analysis, there are only slightly better than 2 to 1 odds that the red team spews more malarkey than the blue team, but we expect the difference between the two teams to be pretty small (yes, I've been calling the Republicans and Democrats the red and blue team for the last few paragraphs).
But what does this mean? Well, PolitiFact chose precisely the same number of statements for each team. Maybe they subconsciously chose rulings that in aggregate downplay any the difference between the teams. Or maybe PolitiFact has a liberal bias, as some allege. In that case, we'd expect them to inflate the relationship between the two, or even invert it if their bias is strong enough. If both biases act in tandem, we might expect a small difference favoring the blue team.
But honestly, all of this is hand waving. We need more evidence to know if such biases exist and how strong they are. And we need more statements on foreign policy from each team. Well, we're going to get the latter tonight. As for the former. Well. Some day.
For what it's worth, however, this is evidence. I encourage you to gather more and to share it with me. But based on this evidence, I predict that Romney will spew somewhat more malarkey tonight than Obama.
And yes, I'm going to examine that question tomorrow (after I do some field work for my dissertation project).