From Eliezer Yudkowsky on Less Wrong (a few years old, but worth revisiting in the light of my recent Gigerenzer v Kahneman and Tversky post):

When a single experiment seems to show that subjects are guilty of some horrifying sinful bias - such as thinking that the proposition “Bill is an accountant who plays jazz” has a higher probability than “Bill is an accountant” - people may try to dismiss (not defy) the experimental data. Most commonly, by questioning whether the subjects interpreted the experimental instructions in some unexpected fashion - perhaps they misunderstood what you meant by “more probable”.

Experiments are not beyond questioning; on the other hand, there should always exist some mountain of evidence which suffices to convince you.

Here is (probably) the single most questioned experiment in the literature of heuristics and biases, which I reproduce here exactly as it appears in Tversky and Kahneman (1982):

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Please rank the following statements by their probability, using 1 for the most probable and 8 for the least probable:

(5.2) Linda is a teacher in elementary school.

(3.3) Linda works in a bookstore and takes Yoga classes.

(2.1) Linda is active in the feminist movement. (F)

(3.1) Linda is a psychiatric social worker.

(5.4) Linda is a member of the League of Women Voters.

(6.2) Linda is a bank teller. (T)

(6.4) Linda is an insurance salesperson.

(4.1) Linda is a bank teller and is active in the feminist movement. (T & F)

(The numbers at the start of each line are the mean ranks of each proposition, lower being more probable.)

How do you know that subjects did not interpret “Linda is a bank teller” to mean “Linda is a bank teller and is not active in the feminist movement”? For one thing, dear readers, I offer the observation that most bank tellers, even the ones who participated in anti-nuclear demonstrations in college, are probably not active in the feminist movement. So, even so, Teller should rank above Teller & Feminist.  …  But the researchers did not stop with this observation; instead, in Tversky and Kahneman (1983), they created a between-subjects experiment in which either the conjunction or the two conjuncts were deleted. Thus, in the between-subjects version of the experiment, each subject saw either (T&F), or (T), but not both. With a total of five propositions ranked, the mean rank of (T&F) was 3.3 and the mean rank of (T) was 4.4, N=86. Thus, the fallacy is not due solely to interpreting “Linda is a bank teller” to mean “Linda is a bank teller and not active in the feminist movement.”

Another way of knowing whether subjects have misinterpreted an experiment is to ask the subjects directly. Also in Tversky and Kahneman (1983), a total of 103 medical internists … were given problems like the following:

A 55-year-old woman had pulmonary embolism documented angiographically 10 days after a cholecstectomy. Please rank order the following in terms of the probability that they will be among the conditions experienced by the patient (use 1 for the most likely and 6 for the least likely). Naturally, the patient could experience more than one of these conditions.

  • Dyspnea and hemiparesis

  • Calf pain

  • Pleuritic chest pain

  • Syncope and tachycardia

  • Hemiparesis

  • Hemoptysis

As Tversky and Kahneman note, “The symptoms listed for each problem included one, denoted B, that was judged by our consulting physicians to be nonrepresentative of the patient’s condition, and the conjunction of B with another highly representative symptom denoted A. In the above example of pulmonary embolism (blood clots in the lung), dyspnea (shortness of breath) is a typical symptom, whereas hemiparesis (partial paralysis) is very atypical.”

In indirect tests, the mean ranks of A&B and B respectively were 2.8 and 4.3; in direct tests, they were 2.7 and 4.6. In direct tests, subjects ranked A&B above B between 73% to 100% of the time, with an average of 91%.

The experiment was designed to eliminate, in four ways, the possibility that subjects were interpreting B to mean “only B (and not A)”. First, carefully wording the instructions:  “…the probability that they will be among the conditions experienced by the patient”, plus an explicit reminder, “the patient could experience more than one of these conditions”. Second, by including indirect tests as a comparison. Third, the researchers afterward administered a questionnaire:

In assessing the probability that the patient described has a particular symptom X, did you assume that (check one):

X is the only symptom experienced by the patient?

X is among the symptoms experienced by the patient?

60 of 62 physicians, asked this question, checked the second answer.

Fourth and finally, as Tversky and Kahneman write, “An additional group of 24 physicians, mostly residents at Stanford Hospital, participated in a group discussion in which they were confronted with their conjunction fallacies in the same questionnaire. The respondents did not defend their answers, although some references were made to ’the nature of clinical experience.’  Most participants appeared surprised and dismayed to have made an elementary error of reasoning.”

Does the conjunction fallacy arise because subjects misinterpret what is meant by “probability”? This can be excluded by offering students bets with payoffs. In addition to the colored dice discussed yesterday, subjects have been asked which possibility they would prefer to bet $10 on in the classic Linda experiment. This did reduce the incidence of the conjunction fallacy, but only to 56% (N=60), which is still more than half the students.

But the ultimate proof of the conjunction fallacy is also the most elegant. In the conventional interpretation of the Linda experiment, subjects substitute judgment of representativeness for judgment of probability: Their feelings of similarity between each of the propositions and Linda’s description, determines how plausible it feels that each of the propositions is true of Linda. …

You just take another group of experimental subjects, and ask them how much each of the propositions “resembles” Linda. This was done - see Kahneman and Frederick (2002) - and the correlation between representativeness and probability was nearly perfect.  0.99, in fact.

The conjunction fallacy is probably the single most questioned bias ever introduced, which means that it now ranks among the best replicated. The conventional interpretation has been nearly absolutely nailed down.

There are a few additional experiments in Yudkowsky’s post that I have not included here.