I lowered my estimation of Russ Roberts and did the reverse for Robin Hanson after listening to this episode of EconTalk. Roberts says he wants to discourage people from being persuaded by empirical arguments. I would like to discourage people from being persuaded by Roberts in that case. Here is my rank ordering of persuasive evidence:
1: Controlled, replicated double-blind experiments, or (even better) meta-studies of such
2: Non-manipulated (meant in a non pejorative sense) non-experimental empirical evidence
2.5: Case studies (may be considered a subset of 2)
3: Simple regressions/correlations with few controls from non-experimental empirical evidence
3.5: Same as above, but with more steps/complications
4: Mathematical models
5: Verbal arguments relying on logic
6: Verbal arguments relying on analogies
7: Catch-all for things I haven’t mentioned
Last: Proof by assertion
Roberts and I agree that simple empirical evidence is better than fancy complicated analysis of data. The reason is that the more additional input the analyst brings to bear, the closer their analysis gets to logical argument. The point about meta-studies should also be applied to numbers other than 1, but I didn’t feel like repeating myself. One should resort to a low number only when a higher number is not available.
UPDATE: A reader asks Andrew Gelman for his favorite example of an experimental study debunking a causal relation found in a regression.
February 4, 2009 at 8:19 pm
“3: Simple regressions/correlations with few controls from non-experimental empirical evidence
4: Mathematical models
5: Verbal arguments relying on argument
6: Verbal arguments relying on analogies”
What if it was the case that 3 fails to find any positive effects from free trade, but 4, 5, and 6 do?
My understanding is that this is in fact the case. A good example of 6 btw, is Landsburgs “Iowa Car drop”. (ch 21, The Armchair Economist).
I agree about Hanson in that interview – I was quite impressed.
February 4, 2009 at 9:54 pm
We’ve seen a number of autarkies and they were all poor. The richest countries tend to do a lot of trade (including importing) from other countries. So that falls under 2/2.5
February 4, 2009 at 10:37 pm
See above.
More generally, I don’t think that a hierarchy like that is useful. Such things tend to be the basis of pseudo-science or bad philosophy of science.
Hanson often points out that the grand-daddy example of 1 is medicine, where each technique is confirmed double blind by fantastically expensive studies but medicine as a whole is disconfirmed by equivalently careful studies.
February 4, 2009 at 10:39 pm
Also, one is ALWAYS left with 5 or 6 around the edges of a chain of argument to connect them to anything real, for instance, your verbal argument about autarchies above, but arguments are parallel and can’t be stronger than their weakest link.
February 5, 2009 at 12:23 am
How is 4 not a sub-set of 6?
February 5, 2009 at 1:19 am
Johnny: 4 is a sub-set of 6. More importantly, everything is a sub-set of 5
Tggp: thoughts on this?
http://scienceblogs.com/gnxp/2009/02/ceo_pay_part_ii.php
BTW, I’m curious as to why you and Mencius never clashed on empiricism.
February 5, 2009 at 8:28 am
You need to note something about the source of the data used in 2-3. If the data for 2-3 comes from a probability sample (like the GSS, NLSY, Gallup Pole) they are superior to evidence from convenience samples or case studies.
February 5, 2009 at 11:59 am
“where each technique is confirmed double blind by fantastically expensive studies ”
Except when they aren’t, as is the case for virtually all surgery.
February 5, 2009 at 12:41 pm
Oh, before I forget:
I don’t find any one kind of evidence especially convincing. I tend to be convinced when multiple types of evidence converge on the same conclusion — especially when they do so from different ‘directions’.
February 5, 2009 at 1:08 pm
TGGP: What, if anything, do you think of Paul Feyeraband’s book “Against Method”?
February 5, 2009 at 7:23 pm
My point about autarkies was just a blog comment, but I think it falls more under the heading of case studies (at least in its basis) rather than analogy/logic.
5 has a typo, it should have relying “relying on logic” rather than the repetitive “argument”
Mathematical models make assumptions explicit and explore boundaries/different conditions. Kevin Murphy’s recent equation for the stimulus is a good example of something that let’s both opponents and proponents agree on a framework and determine what differences in their beliefs result in their ultimate disagreement. Tabarrok on intellectual property and Caplan on federalism are both examples of model-based thinking that have reduced my confidence, even if I can’t say I changed my mind.
I have clashed with Moldbug over empiricism. His protestations to the contrary, I accuse him of rationalism with the occasional fallback on disguised mysterianism (perhaps “anti-reductionism” might be a better description). I have recently been dissenting from him on whether corporate governance is efficient, so I find the argument about CEO pay plausible. This Overcoming Bias post also seems relevant.
Yes, case studies aren’t as good as comparisons of lots of data and good sampling is better than poor sampling. melendwyr’s point about agreement among multiple sources is correct, which is why replicated results are better than single ones that by chance might beat the null hypothesis.
L.A, I have never read Feyeraband, just David Stove making fun of him.
February 6, 2009 at 2:25 am
Tggp: Glad to see you agree with melendwyr on this critical point and that you recognize that Moldbug is on crack on corporate governance, his key practical rather than theoretical mistake.
Glad to see the GMU guys independently moving your attitudes in 3 vertiginous directions.
February 6, 2009 at 12:44 pm
It’s not just multiple sources, but multiple types. It’s the only way to determine if there’s a problem with one of the types, or a serious failure of theory.
Many people making mathematical models of climate change might agree, but this is far less useful in evaluating either the models or our understanding than finding agreement between a model and empirical observations.
Our conclusions are robust when different paths converge on them. They are fragile when we have only a few ways of generating them. We are vulnerable to limitations in our methods in that case.
February 6, 2009 at 6:44 pm
Kevin Murphy is not from GMU. And what do you think is his key theoretical mistake? My main point for pointing to Patri Friedman instead, is that he actually has a plan to get from here to there that involves marginalism.
That’s a good point, melendwyr. Different sources might have a lot of shared information and we should be wary of double-counting it, just as when we hear single person say something multiple times.
February 7, 2009 at 11:53 am
I’d order things:
1: Controlled, replicated double-blind experiments, or (even better) meta-studies of such
2: Multi-variate regressions only when you have massive quantities of controlled data, and can successfully make predictions based on the regressions.
3: Case studies / detective work
4: Deduction from broadly established empirical facts
5: Simple regressions and statistics with assumptions clearly stated, on messy data.
Everything else – models and complicated regressions based on messy data, analogies, etc are useless. Analogies are never evidence or proof, they are simply a way of illustrating evidence and helping people understand an issue. Most “controlled” regressions are about as useful as reading bird entrails ( and far less honest).
I’d also note that 1) and 2) are very, very rare outside of chemistry and physics. I have yet to see a good example in social sciences and economics. Thus the case study and deduction is the proper way to study and field involving human action. In economics, I’ll take deduction over multivariate regression any day of the week.
TGGP-
Can you give a couple examples of a good empirical argument that an economist or social sciencet has made?
February 7, 2009 at 1:03 pm
Devin, your ranking does not seem too implausible and I think I may have placed case studies too high. However, the caveat on #2 seems large enough that it would require some elaboration on what is sufficient. Also, on the comment about “messy data”, I thought it was messiness that necessitated more complex data analysis.
There’s loads of empirical stuff in Freakonomics, and while not all of it is good there’s plenty that is. I think it was stupid for John Lott to frame his book as an attack on Freakonomics, and the same thing I said about Levitt’s book applies. To point out one memorable part, his argument about the effect of campaign donations on the behavior of retiring politicians really made me think. Russ Robert’s point about Milton Friedman & inflation seems correct, though that’s what’s attacked in the Post Keynesian link I highlighted in my next post. Rather than his book (which I haven’t read) I would say the stagflation of the 70s discredited old Keynesianism and Volcker’s successful clampdown on inflation in the 80s vindicated Friedman. Of course, it is precisely those decades that Steve Keen claims discredited Friedman! The Bell Curve & the Nurture Assumption make use of lots of empirical data (though only the former presents lots of new regressions and teaches enough statistics to know what that means) to make points people are not inclined to accept unless forced to. The Inductivist, Audacious Epigone & La Griffe du Lion (though he seems to make more use of mathematics) seem to be following in their footsteps.
February 7, 2009 at 5:45 pm
TGGP-
On second thought, I’m going to put “deduction based on broadly established emprirical facts” ahead of case studies.
The Friedman versus Keynes debates in the 70’s illustrates the point nicely. Doing a case study of the period provides a base of established fact. But we must use deduction and logic to make sense of it. For example, the Keynesians claim that the 70’s do not disprove Keynes, because the OPEC oil price manipulation caused the inflation. There is no way to refute this emprically. But we can refute the claim logically.
Also, on the comment about “messy data”, I thought it was messiness that necessitated more complex data analysis.
Perhaps instead of messy I should have said, variables that are subjective, impossible to define precisely, and/or too numerous.
For example, let’s say I’m trying to compare well being between 1900 and 2000. I could use the following methods:
1) Read trustworthy, personal accounts written during the time period.
2) Look at statistics that compare the price of a certain products (eggs, shoes, a house) to a laborers wage,
3) Combine all the goods in the economy into a basket, then use hedonics, chain weighting, and other complicated methods to distill the inputs into one number that defines economic well being.
Of these methods, the first is the best, it is the only one that can give me a real feel for the time period. Which is listening to a CD or listening to a Doowop group? A 3,000 square foot house in the surburbs in 2000, or a row house in 1900 Brooklyn? These questions do not have objective answers. We can only read accounts that try and place us in the old time period, and then make up our own mind.
The second method is helpful because it helps provide a quantitative dimension. But it is still limited because there are relatively few products we can compare quantitatively between the two time periods. An egg is an egg, but a house can be a very different thing depending on the surroundings.
The third method may seem better, but in reality it is useless. It makes a thousand assumptions and subjective judgements and then obscures them behind one number. I am much better off just having the raw numbers and then reasoning through the assumptions and making the subjective judgements myself.
To point out one memorable part, his argument about the effect of campaign donations on the behavior of retiring politicians really made me think.
I’ve never read Freakonomics, but when I was (briefly) a political science major I read a bunch of studies that tried using regressions to figure out how politicians voted. They were all absurd. If you want to understand the motivations of politicians, intern in Washington for a few months.
February 7, 2009 at 8:13 pm
We do combine logic with data to make sense of it. I don’t trust human rationality and so I want to minimize the amount of fallible reasoning necessary, binding our brains to reality without room for imagining nonsense.
I don’t trust source #1. You are unlikely to get a representative sample, and even then I do not trust people to be accurate in their personal accounts. How can you establish what is “trustworthy” in any case? I don’t trust “hedonics” (even the subject of “well being” doesn’t have an objective meaning), so that rules out 3. Instead I would look to measures of consumption.
I don’t think Freakonomics had much on politicians. The John Lott book was Freedomnomics, and he claimed to be attacking the cynicism of Freakonomics. I think cynicism is generally more accurate than the alternative, but some of my prior notions about public choice now seem to have more basis in ideology than an examination of an often messy reality.
February 8, 2009 at 12:36 pm
How can you establish what is “trustworthy” in any case?
You have a brain, use it. It’s not always easy to distinguish levels of trustworthiness, but it is usually possible.
Method #2 is fine but limited. For instance, there is no number that tell you whether communities were stronger and happier today or in 1900. First hand accounts are your only option. To the extent that you cannot find trustworthy sources, you have to admit that you do not know the answer. Not every answer lies under the lamp post, and if you refuse to look in the less well lit areas, you’ve just cut out most of your brain.
I think cynicism is generally more accurate than the alternative, but some of my prior notions about public choice now seem to have more basis in ideology than an examination of an often messy reality.
The blogger who’s understanding of Washington best matches my experiences working there is Mencius Moldbug. Actual quid pro quo corruption is rare. The real culprit is a selection effect that promotes well meaning people into power who genuinely believe that helping interest group X is in the public interest.
I also recommend the book “Government’s End” by Jonathan Rauch. He worked at the National Journal for years and knows the beast intimately. His analysis of how Washington actually works is spot on.
February 8, 2009 at 12:45 pm
You have a brain, use it.
That’s no explanation at all. I could have replaced my entire post with that. Somehow your brain-state must become entangled with reality in order for its conclusion to have any accuracy better than random.
stronger and happier
What do those mean?
First hand accounts are your only option
You must explicitly mentioned some other options.
you have to admit that you do not know the answer
I think that is often the case.
He worked at the National Journal for years and knows the beast intimately.
People who work in Washington are predominantly liberals. If I thought first-hand accounts were reliable, I’d then have to accept that liberals have a correct view of Washington. Of course, I do not.
February 8, 2009 at 1:20 pm
TGGP-
Most people over time acquire the ability to distinguish trustworthy accounts from BS. Partly it just comes from reading a lot and partly it comes naturally. There are tools and tricks you can teach yourself. It’s really not that hard to read a first hand account of an event and to figure out where the author is coming from and what his motivations are. And remember – your beloved statistics are always a secondary source. All statistics are someone else’s firsthand observation. So if you cannot trust any primary account, you cannot trust statistics even more so.
stronger and happier – What do those mean?
Let me put it this way. Let’s say you were making policy for a city, and your primary goal was to attract more residents by making it a nice place to live. As such, you were studying history to see what you could learn. Your best way of learning about what worked and what did not work in the past, is by reading first hand accounts. Knowing some statistics about the price of eggs in 1900 is useless.
If I thought first-hand accounts were reliable, I’d then have to accept that liberals have a correct view of Washington
Rauch is a liberal. He very accurately describes how Washington does work. I very much disagree with him about how Washington should work. A source may be very fair and accurate, but if the sources has different priors, you might still find disagree with it. But that doesn’t mean you cannot learn from it.
February 8, 2009 at 2:29 pm
Rauch is a libertarian liberal.
February 8, 2009 at 2:52 pm
Most people over time acquire the ability to distinguish trustworthy accounts from BS
Is this ability magic? If not, there must be a process to it. How is it that one does this distinguishing? How do we discover what is true?
It’s really not that hard to read a first hand account of an event and to figure out where the author is coming from and what his motivations are
How do you know whether your conclusion is accurate?
All statistics are someone else’s firsthand observation
We have machines that generate statistical data as it is produced (from cash registers, for instance). Part of what my company does is to help our clients do this, and that was also part of Steve Sailer’s old job. This removes a lot of the element for human error. If you relied on self-reports to determine how many people were present at Martin Luther King’s I Have a Dream speech, you would get a number far larger than the capacity of the area.
You earlier stated that you thought MM’s view was the most accurate one about how Washington works. MM says that most progressives think evil corporations, lobbyists, televangelists and the military-industrial complex have power over Washington. He thinks that’s completely wrong. Most people in Washington are not libertarians like Rauch.
February 8, 2009 at 5:50 pm
You may be interested in the recent work of philosopher Nancy Cartwright on this topic. I particularly enjoyed the first and last of the listed papers.
February 8, 2009 at 11:41 pm
Jason Malloy and Sister Y were exchanging dueling papers in another thread and I found one of them particularly relevant to this discussion for its use of revealed preference in prices as a superior indicator to self-reports in surveys.
February 9, 2009 at 12:01 am
Most people over time acquire the ability to distinguish trustworthy accounts from BS
Is this ability magic? If not, there must be a process to it. How is it that one does this distinguishing? How do we discover what is true?
Is this a serious question? Do you honestly not know? Do you really want me to explain to you how people learn? What’s what your deal?
February 9, 2009 at 12:18 am
No, I’m asking you not to beg the question. Police officers claim that they develop the ability to detect BS over time. Experiments show they actually don’t. Your intuition that you can detect the truth is not necessarily reliable. In order to detect BS we must find what constitutes evidence of it. In some cases it may match up with your intuitions and in other cases it may not. I’m asking you to open up the black box and explain what it is that makes it reliable; ie how is it entangled with truth.
February 9, 2009 at 2:00 pm
This higher-level talk about how to detect sound processes for matching beliefs with reality seems pretty useless to me. The fact is, we should always have some reason to be skeptical of our beliefs. But I don’t see where that gets us. It’s a nice thing to know, but then we go on almost exactly as before.
The fact is, most people do not have the time to seriously engage with intellectual issues (ie, look at *all* relevant evidence), but they still really like to have opinions. I guess you’re providing a new kind of shortcut, one that focuses on the type of evidence rather than the source.
But I think your suggested hierarchy is misleading, because it mistakes form for content. A randomized trial experiment looks more scientific than an observational case study, but in real life we always have to generalize the results from the experiment population to another population. #1 always comes along with #6. This is not just an esoteric point. If different people have different reactions to a drug, applying the results of a randomized experiment to an actual patient requires a #6-type judgment.
February 9, 2009 at 8:33 pm
This higher-level talk about how to detect sound processes for matching beliefs with reality seems pretty useless to me
Maybe you’re right and Magic 8-Balls, LSD trips and telephone psychics are just as reliable.
The fact is, we should always have some reason to be skeptical of our beliefs
True. I think there are some systematic errors we make that we can be aware of and actions we can take to reduce error.
But I don’t see where that gets us
One thing is to have generally less confidence in our beliefs. Don’t take large gambles on what seems like a sure thing. Furthermore, we can find out particular ways we are likely to screw up. We don’t have to go on as before.
but they still really like to have opinions
Tying in with my earlier point about survey data, Philip Converse called many of them “non-opinions” because they don’t reflect anything other than the desire to give a response. We raise our status by taking a bold stand, even if we delude ourselves into thinking our culture brainwashes us into conformity. We like having opinions, but not necessarily having accurate opinions. There is something we can do other than going on as before: not having an opinion on things we don’t know enough about. Practice saying out loud to others “I am uninformed on the issue and agnostic”.
I guess you’re providing a new kind of shortcut, one that focuses on the type of evidence rather than the source
Concentrating on the source can feed into confirmation bias and affective death spirals. Ask yourself if the source is really less likely to have accurate beliefs than yourself. Ask yourself what reason the source should have for doubting their own beliefs and putting more stock in someone else’s (notably, yours). That’s part of the idea behind Aumann’s Agreement Theorem. Think of good suggestions to give over a chronophone. Because meta is max.
Does a case study not also have to be generalized? And do we not discover that different people have different reactions through randomized trial? An analogy is not the same thing as the application of a general rule. None of which is to deny that data by itself is inert and we must make use of it, and we may easily do so poorly.
February 10, 2009 at 12:44 pm
Let me clarify my objection to a hierarchy of evidence.
Say you want to figure out if it’s worthwhile for the government to fight a “War on Teenage Motherhood.” First, you ask your imagination, “If a teenager becomes mom, does that screw up her life?” Second, you compare lifetime earnings for teenage moms and women who become mothers later. Third, you compare lifetime earnings for teenage moms and women with similar observable characteristics (parents’ SES). Fourth, you compare lifetime earnings for teenage moms and their sisters who weren’t teenage moms. Fifth, you compare lifetime earnings for teenage moms and women who became pregnant while teenagers but miscarried.
Now this is pretty straightforward; each additional steps rules out possibilities that the previous steps couldn’t. Hence we have a clear hierarchy of evidence, moving from gut feelings to something as close to a randomized experiment as we’re going to get with this issue.
But let’s say for the sake of argument that 98% of women who become pregnant are poor (I’m making this up), and now we’re interested in whether or not it’s helpful for there to be a stigma against teenage motherhood for rich people. How does our “gold standard” experimental study shape up against, say, observational case studies of rich women’s life trajectories? I don’t see why it should get special treatment. Yeah, it’s more “scientific,” but in ways that are no longer especially relevant to the matter at hand.
If you’re judging evidence by its appearance rather than its content, then you shouldn’t have an opinion.
Moreover, I think the debates that never end tend to be ones that face off two world views, both of which explain some but not all facts well. Stuff like “Is prostitution fundamentally an exchange of cash for services, or is it a type of social relation between men and women?” is best answered by “Yes,” not by figuring out which evidence counts more. Other popular debates involve questions like “Does he have 40% of a good case, or only 10%?” Again, not really resolvable with a hierarchy of evidence.
I agree with you that people should be more reluctant to offer opinions. But getting them to stop is kind of like walking into the middle of a religious civil war and telling people to put down their weapons unless they have a good reason to want to kill their opponents. Worse, it’s a bloodless civil war in which no one seems to be actually getting hurt. Bottom line, no one will stop fighting unless everyone stops fighting.
February 10, 2009 at 12:48 pm
98% of women who become pregnant are poor
Should be pregnant as teenagers, not pregnant, of course.
February 10, 2009 at 8:50 pm
There wasn’t any example of an actual experiment there, but the one with miscarriages seems the closest thing to a “natural” one. It is still flawed in that there are physical causes of miscarriage that have a decent chance of being correlated with what we are trying to study.
If our sample size isn’t large enough then we might not have enough to select the few rich pregnant teens without being subject to massive sampling error. The case studies are themselves a small sample and as a result less reliable than a larger one. The introduction of stigma as a variable would require an even larger sample for us to make use of variance. We could conduct an experiment by randomly subjecting some to stigma and some not (did we want to know if stigma prevents pregnancy or what are the effects given pregnancy?). Otherwise stigma seems too likely to be correlated with other unexamined factors that it will be hard for us to determine whether it is playing any causal role. You didn’t specify enough for us to say much about your case studies.
If you’re judging evidence by its appearance rather than its content, then you shouldn’t have an opinion.
So I should not give any greater credence to probability sampling than convenience sampling? Or is that not an issue of appearance, in which case what is it and what’s content?
Prostitution can take place with means other than cash and I’m shocked that you’d be so heteronormative so as to exclude poofters. What’s odd is your use of the word “or”. Would you ask “Is Illinois a state or does it begin with the letter ‘I'”?
I think one can make a case in which half of your arguments are sound and half are unsound. You can make a case that is twice as convincing as a different case or half as convincing as you anticipated (in which case you should actually adjust your belief in it downwards).
I think the debates that really never end are between world views that don’t say anything in any positive sense and so don’t explain anything. How could they be resolved?
ADDENDUM: This from Gene Expression is a pretty sweet example of a natural experiment.
May 9, 2009 at 2:32 pm
[…] Not only does he support my prejudices regarding statistics vs anecdotes, he also fits with my hierarchy of evidence regarding formal vs informal economic theorizing in Two Cheers for Formalism. As with his essay on […]
May 5, 2016 at 9:58 am
[…] talked a little about the hierarchy of the sciences and what kinds of evidence are convincing. Greg is an actual scientist who has worked in both “hard” and “soft” […]