The marshmallow test held up OK

Author

Jason Collins

Published

May 31, 2018

A common theme I see on my weekly visits to Twitter is the hordes piling onto the latest psychological study or effect that hasn’t survived a replication or meta-analysis. More often than not, the study deserves the criticism. But recently, the hordes have occasionally swung into action too quickly.

One series of tweets suggested that loss aversion had entered the replication crisis. A better description of the two papers that triggered the tweets is that they were the latest salvos in a decade-old debate about the interpretation of many loss aversion experiments. They have nothing to do with replication. (If you’re interested, the papers are here (ungated) and here. I have sympathy with parts of the arguments, and some other critiques of the concept of loss aversion. I’ll discuss these papers in a later post.)

Another set of tweets concerned a conceptual replication of the marshmallow test. Many of the comments suggested that the replication was a failure, and that the original study was rubbish. My view is that the original work has actually held up OK, although the interpretation of the result and some of the story-telling that followed the study is challenged.

First, to the original paper by Shoda, Mischel, and Peake, published in 1990 (pdf). In that study, four-year old children were placed at a table with a bell and a pair of “reward objects”. The pair of regard objects might be one marshmallow and two marshmallows, or one pretzel and two pretzels, and so on.

The children were told that the experimenter was going to leave the room, and that if they waited until the experimenter came back, they could have their preferred reward (the two marshmallows). Otherwise, they could call the experimenter back earlier by ringing the bell, but in that case they could only have their less preferred reward (one marshmallow). (Could a truly impatient child just not ring the bell and eat all three marshmallows?) The time until the children rang the bell, up to a maximum of 15 to 20 minutes, was recorded.

The headline result was that the time to ring the bell was predictive of future achievement in the SAT. Those who delayed their gratification had higher achievement. The time waited correlated 0.57 with SAT math scores and 0.42 with SAT verbal scores.

The new paper discusses a “conceptual replication”. It doesn’t copy the experimental design and replicate it precisely, but relies on a similar experimental design and a measure of academic achievement based on a composite of age-15 reading and math scores.

The main point to emerge from this replication is that there is an association between the delay in gratification and academic achievement, but the correlation (0.28) is only half to two-thirds of that found in the original study.

Anyone familiar with the replication literature will find this reduction in correlation unsurprising. One of the headline findings from the Reproducibility Project was that effect sizes in replications were around half of those in the original studies. Small sample sizes (low experimental power) also tend to result in Type M errors, whereby the effect size is exaggerated. (The original study only had 35 children in the baseline condition for which they were able to get the later academic results.)

Shoda and friends recognised this possibility (although perhaps not the reasons for it). As they wrote in the original paper:

[G]iven the smallness of the sample, the obtained coefficients could very well exaggerate the magnitude of the true association. For example, in the diagnostic condition, the 95% confidence interval for the correlation of preschool delay time with SAT verbal score ranges from .10 to .66, and with SAT quantitative score, the confidence interval ranges from .29 to .76. The value and importance given to SAT scores in our culture make caution essential before generalizing from the present study; at the very least, further replications with other populations, cohorts, and testing conditions seem necessary next steps.

The differences between the experiments could also be behind the difference in size of correlation. Each study used different measures of achievement. The marshmallow test in the replication had a maximum wait of only 7 minutes, compared to 15 to 20 minutes in the original (although most of the predictive power in the new study was found to be in the first 20 seconds). The replication created categories for time waited (e.g. 0 to 20 seconds, 20 seconds to 2 minutes, and so on), rather than using time as a continuous variable. It also focused on children with parents who did not have a college education - too many of the children with college-educated parents waited the full seven minutes. The original study drew its sample from the Stanford community.

Given the original authors’ notes about effect size, and the differences in study design, the original findings have held up rather well. For a simple diagnostic, the marshmallow test still has a surprising amount of predictive power. Delay of gratification at age 4 predicts later achievement. Some of the write-ups of this new work have stated that the marshmallow test may not be as strong a predictor of future outcomes as previously believed, but how strong did you actually believe it to be in the first place?

The other headline from the replication is that the predictive ability of the marshmallow test disappears with controls. That is, if you account for the children’s socioeconomic status, parental characteristics and a set of measures of cognitive and behavioural development, the marshmallow test does not provide any further information about that future achievement. It’s no surprise that controls of this nature do this. It simply suggests that the controls are better predictors. The original claim was not that the marshmallow test was the best or only predictor.

What is called into question are the implications that have been drawn from the marshmallow test studies. Shoda and friends suggested that the predictive power of the test might be related to the meta-cognitive strategies that the children employed. For instance, successful children might divert themselves so that they don’t just sit and stare at the marshmallows. If that is the case, we could teach children these strategies, and they might then be better able to delay gratification and have higher achievement in life. This has been a common theme of discussion of the marshmallow test for the last 30 years.

In the replication data, most of the predictive power of the marshmallow test was found to lie in the first 20 seconds. There was not a lot of difference between the kids who waited more than 20 seconds and those that waited the full seven minutes. It is questionable whether meta-cognitive strategies come into play in those first few seconds. If not, there may be little benefit in teaching children strategies to enable them to delay gratification. It seems less a problem of developing strategies for gratification, and more one of basic impulse control. To increase future achievement, broader behaviour and cognitive change might be required.