Category: Replications

This is what happened when psychologists tried to replicate 100 previously published findings

While 97 per cent of the original results showed a statistically significant
effect, this was reproduced in only 36 per cent of the replications 

After some high-profile and at times acrimonious failures to replicate past landmark findings, psychology as a discipline and scientific community has led the way in trying to find out more about why some scientific findings reproduce and others don’t, including instituting reporting practices to improve the reliability of future results. Much of this endevour is thanks to the Center for Open Science, co-founded by the University of Virginia psychologist Brian Nosek.

Today, the Center has published its latest large-scale project: an attempt by 270 psychologists to replicate findings from 100 psychology studies published in 2008 in three prestigious journals that cover cognitive and social psychology: Psychological Science, the Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory and Cognition.

The Reproducibility Project is designed to estimate the “reproducibility” of psychological findings and complements the Many Labs Replication Project which published its initial results last year. The new effort aimed to replicate many different prior results to try to establish the distinguishing features of replicable versus unreliable findings: in this sense it was broad and shallow and looking for general rules that apply across the fields studied. By contrast, the Many Labs Project involved many different teams all attempting to replicate a smaller number of past findings – in that sense it was narrow and deep, providing more detailed insights into specific psychological phenomena.

The headline result from the new Reproducibility Project report is that whereas 97 per cent of the original results showed a statistically significant effect, this was reproduced in only 36 per cent of the replication attempts. Some replications found the opposite effect to the one they were trying to recreate. This is despite the fact that the Project went to incredible lengths to make the replication attempts true to the original studies, including consulting with the original authors.

Just because a finding doesn’t replicate doesn’t mean the original result was false – there are many possible reasons for a replication failure, including unknown or unavoidable deviations from the original methodology. Overall, however, the results of the Project are likely indicative of the biases that researchers and journals show towards producing and publishing positive findings. For example, a survey published a few years ago revealed the questionable practices many researchers use to achieve positive results, and it’s well known that journals are less likely to publish negative results.

The Project found that studies that initially reported weaker or more surprising results were less likely to replicate. In contrast, the expertise of the original research team or replication research team were not related to the chances of replication success. Meanwhile, social psychology replications were less than half as likely to achieve a significant finding compared with cognitive psychology replication attempts, but in terms of declines in size of effect, both fields showed the same average reduction from original study to replication attempt, to less than half (cognitive psychology studies started out with larger effects and this is why more of the replications in this area retained statistical significance).

Among the studies that failed to replicate was research on loneliness increasing supernatural beliefs; conceptual fluency increasing a preference for concrete descriptions (e.g. if I prime you with the name of a city, that increases your conceptual fluency for the city, which supposedly makes you prefer concrete descriptions of that city); and research on links between people’s racial prejudice and their response times to pictures showing people from different ethnic groups alongside guns. A full list of the findings that the researchers attempted to replicate can be found on the Reproducibility Project website (as can all the data and replication analyses).

This may sound like a disappointing day for psychology, but in fact really the opposite is true. Through the Reproducibility Project, psychology and psychologists are blazing a trail, helping shed light on a problem that afflicts all of science, not just psychology. The Project, which was backed by the Association for Psychological Science (publisher of the journal Psychological Science), is a model of constructive collaboration showing how original authors and the authors of replication attempts can work together to further their field. In fact, some investigators on the Project were in the position of being both an original author and a replication researcher.

“The present results suggest there is room to improve reproducibility in psychology,” the authors of the Reproducibility Project concluded. But they added: “Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should” – that is, being constantly sceptical of its own explanatory claims and striving for improvement. “This isn’t a pessimistic story”, added Brian Nosek in a press conference for the new results. “The project shows science demonstrating an essential quality, self-correction – a community of researchers volunteered their time to contribute to a large project for which they would receive little individual credit.”

  ResearchBlogging.orgOpen Science Collaboration (2015). Estimating the reproducibility of psychological science Science

further reading
How did it feel to be part of the Reproducibility Project?
A replication tour de force
Do psychology findings replicate outside the lab?
A recipe for (attempting to) replicate existing findings in psychology
A special issue of The Psychologist on issues surrounding replication in psychology.
Serious power failure threatens the entire field of neuroscience 

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Our free fortnightly email will keep you up-to-date with all the psychology research we digest: Sign up!

The trouble with tDCS? Electrical brain stimulation may not work after all

By guest blogger Neuroskeptic

A widely-used brain stimulation technique may be less effective than previously believed.

Transcranial Direct Current Stimulation (tDCS) is an increasingly popular neuroscience tool. tDCS involves attaching electrodes to the scalp, through which a weak electrical current flows. The idea is that this current modulates the activity of the brain tissue underneath the electrode – safely and painlessly.

Outside of the neuroscience lab, tDCS is also used by hobbyists looking to boost their own brain power and a number of consumer stimulation devices are now being sold. The technique regularly makes the news, under headlines such as “Zapping your brain could help you lose weight”.

However, according to Australian neuroscientists Jared Horvath, Jason Forte and Olivia Carter, a single session of tDCS may have no detectable effect on cognitive function in most people. In a new paper published in the journal Brain Stimulation, Horvath and colleagues reviewed the published evidence on tDCS. They performed a meta-analysis of the data on how tDCS influences cognitive functions such as memory, language, and mental arithmetic.

For example, in experiments investigating language function, neuroscientists generally place the active tDCS electrode over the left frontal lobe of the volunteers. This ensures that the electrode is near to Broca’s area, a part of the brain known to be involved in language production. Then, the current is switched on and the volunteer is asked to do a linguistic task such as verbal fluency, in which the goal is to think of as many words beginning with a certain letter (say “p”) as possible within one minute. The performance of the volunteers given tDCS is compared to the performance of people given “sham” tDCS, in which the electrodes are attached but no current is applied.

Horvath et al. found that overall, there was no statistically significant difference between active and sham tDCS on any of the cognitive tasks that they examined. They say that:

Of the 59 analyses undertaken, tDCS was not found to generate a significant effect on any. Taken together, the evidence does not support the assertion that a single-session of tDCS has a reliable effect on cognitive tasks in healthy adult populations.

That seems pretty clear-cut. However, Horvath et al. acknowledge that their analysis did not include any of the studies that have been conducted on individuals with brain diseases or on the elderly, and they note that tDCS might be more effective in such cases.

What’s more, Horvath et al.’s meta-analysis didn’t utilize all of the studies on healthy people. The authors decided to only include results that had at least one published independent replication attempt. In other words, they only included studies that had measured the effects of tDCS on a given cognitive task, if more than one different research group had published papers using that technique. Even if one team of scientists had published several studies all showing that tDCS does influence some aspect of cognition, those results weren’t included unless at least one other team of researchers had published tDCS results using that same task. One hundred and seventy-six articles were excluded as a result.

Horvath et al. explain their decision not to consider those studies by saying that:

We chose to exclude measures that have only been replicated by a single research group to ensure all data included in and conclusions generated by this review accurately reflect the effects of tDCS itself, rather than any unique device, protocol, or condition utilized in a single lab.

However, this is a slightly unusual restriction to use on a meta-analysis. It might be interesting to see whether including these additional studies would have changed the results.

This is the second time Horvath, Forte and Carter have published a sceptical meta-analysis of tDCS. In November last year they reviewed studies on the neurophysiological effects of tDCS and concluded that tDCS has virtually no measurable effects on brain function. So Horvath et al. seem to have comprehensively shown that tDCS essentially has no impact in healthy people, either on a biological or on a cognitive level.

However, I spoke to Dr Nick Davis, Lecturer in Psychology at Swansea University who has published several papers about tDCS. Davis says that:

This is a really useful review, as it helps us to think about the way we talk about the effects of tDCS.

However I believe that the way the analysis was conducted may have obscured some of the very real effects of tDCS. The authors have made a judgement about which studies can be pooled together and which studies cannot be pooled. One always has to make these kinds of decisions and I am not sure I would have made the same decisions given the same choices.

tDCS is still a developing technology. I think that with more principled methods of targeting the current flow to the desired brain area, we will see tDCS become one of the standard tools of cognitive neuroscience, just as EEG and fMRI have become.


Horvath, J., Forte, J., & Carter, O. (2015). Quantitative Review Finds No Evidence of Cognitive Effects in Healthy Populations from Single-Session Transcranial Direct Current Stimulation (tDCS) Brain Stimulation DOI: 10.1016/j.brs.2015.01.400

Post written for the BPS Research Digest by Neuroskeptic, a British neuroscientist who blogs for Discover Magazine.

further reading
It’s shocking – How the press are hyping the benefits of electrical brain stimulation
Read this before zapping your brain
Bloggers behind the blogs: Neuroskeptic

A replication tour de force

In his famous 1974 lecture, Cargo Cult Science, Richard Feynman recalls his experience of suggesting to a psychology student that she should try to repeat a previous experiment before attempting a novel one:

“She was very delighted with this new idea, and went to her professor. And his reply was, no, you cannot do that, because the experiment has already been done and you would be wasting time. This was in about 1947 or so, and it seems to have been the general policy then to not try to repeat psychological experiments, but only to change the conditions and see what happened.”

Despite the popularity of the lecture, few took his comments about lack of replication in psychology seriously – and least of all psychologists. Another 40 years would pass before psychologists turned a critical eye on just how often they bother to replicate each other’s experiments. In 2012, US psychologist Matthew Makel and colleagues surveyed the top 100 psychology journals since 1900 and estimated that for every 1000 papers published, just two sought to closely replicate a previous study. Feynman’s instincts, it seems, were spot on.

Now, after decades of the status quo, psychology is finally coming to terms with the idea that replication is a vital ingredient in the recipe of discovery. The latest issue of the journal Social Psychology reports an impressive 15 papers that attempted to replicate influential findings related to personality and social cognition. Are men really more distressed by infidelity than women? Does pleasant music influence consumer choice? Is there an automatic link between cleanliness and moral judgements?

Many supposedly ‘classic’ effects could not be found

Several phenomena replicated successfully. An influential finding by Stanley Schacter from 1951 on ‘deviation rejection’ was successfully repeated by Eric Wesselman and colleagues. Schacter had originally found that individuals whose opinions persistently deviate from a group norm tend to be disempowered by the group and socially isolated. Wesselman replicated the result, though finding that it was smaller than originally supposed.

On the other hand, many supposedly ‘classic’ effects could not be found. For instance, there appears to be no evidence that making people feel physically warm promotes social warmth, that asking people to recall immoral behaviour makes the environment seem darker, or for the Romeo and Juliet effect.

The flagship of the special issue is the Many Labs project, a remarkable effort in which 50 psychologists located in 36 labs worldwide collaborated to replicate 13 key findings, across a sample of more than 6000 participants. Ten of the effects replicated successfully.

Adding further credibility to this enterprise, each of the studies reported in the special issue was pre-registered and peer reviewed before the authors collected data. Study pre-registration ensures that researchers adhere to the scientific method and is rapidly emerging as a vital tool for increasing the credibility and reliability of psychological science.

The entire issue is open access and well worth a read. I think Feynman would be glad to see psychology leaving the cargo cult behind and, for that, psychology can be proud too.

– Further reading: A special issue of The Psychologist on issues surrounding replication in psychology.


Klein, R., Ratliff, K., Vianello, M., Adams, Jr., R., Bahník, Bernstein, M., Bocian, K., Brandt, M., Brooks, B., Brumbaugh, C., Cemalcilar, Z., Chandler, J., Cheong, W., Davis, W., Devos, T., Eisner, M., Frankowska, N., Furrow, D., Galliani, E., Hasselman, F., Hicks, J., Hovermale, J., Hunt, S., Huntsinger, J., IJzerman, H., John, M., Joy-Gaba, J., Kappes, H., Krueger, L., Kurtz, J., Levitan, C., Mallett, R., Morris, W., Nelson, A., Nier, J., Packard, G., Pilati, R., Rutchick, A., Schmidt, K., Skorinko, J., Smith, R., Steiner, T., Storbeck, J., Van Swol, L., Thompson, D., van ’t Veer, A., Vaughn, L., Vranka, M., Wichman, A., Woodzicka, J., & Nosek, B. (2014). Data from Investigating Variation in Replicability: A “Many Labs” Replication Project Journal of Open Psychology Data, 2 (1) DOI: 10.5334/

Post written for the BPS Research Digest by guest host Chris Chambers, senior research fellow in cognitive neuroscience at the School of Psychology, Cardiff University, and contributor to the Guardian psychology blog, Headquarters.

Not so easy to spot: A failure to replicate the Macbeth Effect across three continents

“Out, damned spot!” cries a guilt-ridden Lady Macbeth as she desperately washes her hands in the vain pursuit of a clear conscience. Consistent with Shakespeare’s celebrated reputation as an astute observer of the human psyche, a wealth of contemporary research findings have demonstrated the reality of this close link between our sense of moral purity and physical cleanliness.

One manifestation of this was nicknamed the Macbeth Effect – first documented by Chen-Bo Zhong and Katie Liljenquist in an influential paper in the high-impact journal Science in 2006 – in which feelings of moral disgust were found to provoke a desire for physical cleansing. For instance, in their second study, Zhong and Liljenquist found that US participants who hand-copied a story about an unethical deed were subsequently more likely to rate cleansing products as highly desirable.

There have been many “conceptual replications” of the Macbeth Effect. A conceptual replication is when a different research methodology supports the proposed theoretical mechanism underlying the original effect. For example, last year, Mario Gollwitzer and André Melzer found that novice video gamers showed a strong preference for hygiene products after playing a violent game.

Given the strong theoretical foundations of the Macbeth Effect, combined with several conceptual replications, University of Oxford psychologist Brian Earp and his colleagues were surprised when a pilot study of theirs failed to replicate Zhong and Liljenquist’s second study. This pilot study had been intended as the start of a new project looking to further develop our understanding of the Macbeth Effect. Rather than filing away this negative result, Earp and his colleagues were inspired to examine the robustness of the Macbeth Effect with a series of direct replications. Unlike conceptual replications, direct replications seek to mimic the methods of an original study as closely as possible.

Following best practice guidelines, Earp’s team contacted Zhong and Liljenquist, who kindly shared their original materials. Another feature of a high-quality replication is to ensure you have enough statistical power to replicate the original effect. In psychology, this usually means recruiting an adequate number of participants. Accordingly, Earp’s team recruited 153 undergrad participants – more than five times as many as took part in Zhong and Liljenquist’s second study.

Exactly as in the original research, the British students hand-copied a story about an unethical deed (an office worker shreds a vital document needed by a colleague) or about an ethical deed (the office worker finds and saves the document for their colleague). They then rated the desirability and value of several consumer products. These were the exact same products used in the original study – including soap, toothpaste, batteries and fruit juice – except that a few brand names were changed to suit the UK as opposed to US context. Students who copied the unethical story rated the desirability and value of the various hygiene and other products just the same as the students who copied the ethical story. In other words, there was no Macbeth Effect.

It’s possible that the Macbeth Effect is a culturally specific phenomenon. Next, Earp and his team conducted a replication attempt with 156 US participants using Amazon’s Mechanical Turk survey website. The materials and methods were almost identical to the original except that participants were required to re-type and add punctuation to either the ethical or unethical version of the office worker story. Again, exposure to the unethical story made no difference to the participants’ ratings of the value or desirability of the consumer products – with just one anomaly. Participants in the unethical condition placed a higher value on toothpaste. In the context of their other findings, Earp’s team think this is likely a spurious result.

Finally, the exact same procedures were followed with an Indian sample – another culture, that like the US, places high value on moral purity. Nearly three hundred Indian participants were recruited via Amazon’s Mechanical Turk, but again no effect of exposure to an ethical or unethical story was found on ratings of hygiene or other products.

Earp and his colleagues want to be clear – they’re not saying that there is no link between physical and moral purity, nor are they dismissing the existence of a Macbeth Effect. But they do believe their three direct, cross-cultural replication failures call for a “careful reassessment of the evidence for a real-life ‘Macbeth Effect’ within the realm of moral psychology.”

This study, due for publication next year, comes at time when reformers in psychology are calling for more value to be placed on replication attempts and negative results. “By resisting the temptation … to bury our own non-significant findings with respect to the Macbeth Effect, we hope to have contributed a small part to the ongoing scientific process,” Earp and his colleagues concluded.


Brian D. Earp, Jim A. C. Everett, Elizabeth N. Madva, and J. Kiley Hamlin (2014). Out, damned spot: Can the “Macbeth Effect” be replicated? Basic and Applied Social Psychology, In Press.

— Further reading —
An unsuccessful conceptual replication of the Macbeth Effect was published in 2009 (pdf). Later, in 2011, another paper failed to replicate all four of Zhong and Liljenquist’s studies, although the replications may have been underpowered. 

From the Digest archive: Your conscience really can be wiped cleanFeeling clean makes us harsher moral judges.

See also: Psychologist magazine special issue on replications.

Christian Jarrett (@Psych_Writer) is Editor of BPS Research Digest

A recipe for (attempting to) replicate existing findings in psychology

Regular readers of this blog will know that social psychology has gone through a traumatic time of late. Some of its most high profile proponents have been found guilty of research fraud. And some of the field’s landmark findings have turned out to be less robust than hoped. This has led to soul searching and one proposal for strengthening the discipline is to encourage more replication attempts of existing research findings.

To this end, some journals have introduced dedicated replication article formats and pressure is building on others to follow suit. As the momentum for reform builds, an international team of psychologists has now published an open-access article outlining their “Replication Recipe” – key steps for conducting and publishing a convincing replication.

This is an important development because when high-profile findings have failed to replicate there’s been a tendency in recent times for ill-feeling and controversy to ensue. In particular, on more than one occasion the authors of the original findings have complained that the failed replication attempt was of poor quality or not suitably similar to the original methods. In fact it’s notable that one of the co-authors of this new paper is Ap Dijksterhuis who published his own tetchy response to a failed replication of his work earlier this year.

Dijksterhuis and the others, led by Mark Brandt at Tilburg University (coincidentally the institution of the disgraced psychologist Diederik Stapel), outline five main ingredients for a successful replication recipe:

1. Carefully defining the effects and methods that the researcher intends to replicate;
2. Following as exactly as possible the methods of the original study (including
participant recruitment, instructions, stimuli, measures, procedures, and analyses);
3. Having high statistical power [this usually means having a large enough sample];
4. Making complete details about the replication available, so that interested experts can
fully evaluate the replication attempt (or attempt another replication themselves);
5. Evaluating replication results, and comparing them critically to the results of the original study.

To help the would-be replicator, the authors have also compiled a checklist of 36 decisions that should be made, and items of information that should be collated, before the replication attempt begins. It appears as table 1 in their open-access article and they’ve made it freely available for completion and storage as a template on the Open Science Framework.

Here are some more highlights from their paper:

In deciding which past findings are worth attempting to replicate, Brandt and his team urge researchers to choose based on an effect’s theoretical importance, its value to society, and the existing confidence in the effect.

They remind readers that a perfect replication is of course impossible – replications inevitably will happen at a different time, probably in a different place, and almost certainly with different participants. While they recommend contact with the authors of the original study in order to replicate the original methodology as closely as possible, they also point out that a replication in a different time or place may actually require a change in methodology for the purpose of mimicking the context of the original. For instance, a study originally conducted in the US involving baseballs might do well to switch to cricket balls if replicated in the UK. Also bear in mind that something as seemingly innocuous as the brand of the computer monitor used to display stimuli could have a bearing on the results.

Another important point they make is that replicators should set out to measure any factors that they anticipate may cause the new findings to deviate from the original. Not only will this help achieve a successful replication but it furthers scientific understanding by establishing boundary conditions for the original effect.

On their point about having enough statistical power, Brandt and his colleagues urge replicators to err on the side of caution, towards having a larger sample size. Where calculating the necessary statistical power is tricky, they suggest a simple rule of thumb: aim for 2.5 times the sample size of the original study.

It’s not enough to say simply that a replication has failed or succeeded, Brandt and co also advise. Instead replicators should conduct two tests: first establish whether the effect of interest was statistically significant in the new study; and second, establish whether the findings from the attempted replication differ statistically from the findings of the original. A meta-analysis that combines results from the original study and the replication attempt is also recommended.

“By conducting high-powered replication studies of important findings we can build a cumulative science,” the authors conclude. “With our Replication Recipe, we hope to encourage more researchers to conduct convincing replications that contribute to theoretical development, confirmation and disconfirmation.”


Mark J. Brandt, Hans IJzerman, Ap Dijksterhuis, Frank J. Farach, Jason Gellerd, Roger Giner-Sorollae, James A. Grange, Marco Perugini, Jeffrey R. Spies, and Anna van ‘t Veer (2013). The Replication Recipe: What Makes for a Convincing Replication? Journal of Experimental Social Psychology DOI: 10.1016/j.jesp.2013.10.005

Further reading
Psychologist magazine special issue on replications.

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Most brain imaging papers fail to provide enough methodological detail to allow replication

Amidst recent fraud scandals in social psychology and other sciences, leading academics are calling for a greater emphasis to be placed on the replicability of research. “Replication is our best friend because it keeps us honest,” wrote the psychologists Chris Chambers and Petroc Sumner recently.

For replication to be possible, scientists need to provide sufficient methodological detail in their papers for other labs to copy their procedures. Focusing specifically on fMRI-based brain imaging research (a field that’s no stranger to controversy), University of Michigan psychology grad student Joshua Carp has reported a worrying observation – the vast majority of papers he sampled failed to provide enough methodological detail to allow other labs to replicate their work.

Carp searched the literature from 2007 to 2011 looking for open-access human studies that mentioned “fMRI” and “brain” in their abstracts. Of the 1392 papers he identified, Carp analysed a random sample of 241 brain imaging articles from 68 journals, including PLoS One, NeuroImage, PNAS, Cerebral Cortex and the Journal of Neuroscience. Where an article featured supplementary information published elsewhere, Carp considered this too.

There was huge variability in the methodological detail reported in different studies, and often the amount of detail was woeful, as Carp explains:

“Over one third of studies did not describe the number of trials, trial duration, and the range and distribution of inter-trial intervals. Fewer than half reported the number of subjects rejected from analysis; the reasons for rejection; how or whether subjects were compensated for participation; and the resolution, coverage, and slice order of functional brain images.”

Other crucial detail that was often omitted included information on correcting for slice acquisition timing, co-registering to high-resolution scans, and the modelling of temporal auto-correlations. In all, Carp looked at 179 methodological decisions. To non-specialists, some of these will sound like highly technical detail, but brain imagers know that varying these parameters can make a major difference to the results that are obtained.

One factor that non-specialists will appreciate relates to corrections made for problematic head-movements in the scanner. Only 21.6 per cent of analysed studies described the criteria for rejecting data based on head movements. Another factor that non-specialists can easily relate to is the need to correct for multiple comparisons. Of the 59 per cent of studies that reported using a formal correction technique, nearly one third failed to reveal what that technique was.

“The widespread omission of these parameters from research reports, documented here, poses a serious challenge to researchers who seek to replicate and build on published studies,” Carp said.

As well as looking at the amount of methodological detail shared by brain imagers, Carp was also interested in the variety of techniques used. This is important because the more analytical techniques and parameters available for tweaking, the more risk there is of researchers trying different approaches until they hit on a significant result.

Carp found 207 combinations of analytical techniques (including 16 unique data analysis software packages) – that’s nearly as many different methodological approaches as studies. Although there’s no evidence that brain imagers are indulging in selective reporting, the abundance of analytical techniques and parameters is worrying. “If some methods yield more favourable results than others,” Carp said, “investigators may choose to report only the pipelines that yield favourable results, a practice known as selective analysis reporting.”

The field of medical research has adopted standardised guidelines for reporting randomised clinical trials. Carp advocates the adoption of similar standardised reporting rules for fMRI-based brain imaging research. Relevant guidelines were proposed by Russell Poldrack and colleagues in 2008, although these may now need updating.

Carp said the reporting practices he uncovered were unlikely to reflect malice or dishonesty. He thinks researchers are merely following the norms in the field. “Unfortunately,” he said, “these norms do not encourage researchers to provide enough methodological detail for the independent replication of their findings.”


Carp J (2012). The secret lives of experiments: Methods reporting in the fMRI literature. NeuroImage, 63 (1), 289-300 PMID: 22796459

–Further reading– Psychologist magazine opinion special on replication.
An uncanny number of psychology findings manage to scrape into statistical significance.
Questionable research practices are rife in psychology, survey finds.

Post written by Christian Jarrett for the BPS Research Digest.

Do psychology findings replicate outside the lab?

Most psychology research takes place under laboratory conditions allowing tight control over the exact interventions and procedures participants are exposed to. That makes for neater science but leaves the discipline vulnerable to claims that the results aren’t relevant to real life where things are far messier. Now Gregory Mitchell at the University of Virginia has tested this very issue by poring over the literature looking for previously published meta-analyses that compared findings in the lab to the same issue addressed in a field experiment. His searches, which built on a similar 1999 study (pdf), led him to 82 meta-analyses from the last three decades, comprising 217 lab vs. field study comparisons.

Overall, Mitchell found that lab findings usually replicate in the real world (r = .71, where 1 would be a perfect match), but the devil is in the detail: some sub-disciplines in psychology fared much better than others; the size of the effects often differed greatly between lab and real world; and in a worrying number of cases, the real world results were actually in the opposite direction to the lab findings.

“Many small effects from the laboratory will turn out to be unreliable,” Mitchell concluded, “and a surprising number of laboratory findings may turn out to be affirmatively misleading about the nature of relations among variables outside the laboratory.”

Breaking the results down by sub-discipline, findings replicated from the lab most often in Industrial-Organisational Psychology (based on 72 comparisons) and least often in Developmental Psychology, where the three comparisons showed the average field result was actually in the opposite direction to the lab findings. The massive discrepancy in number of comparisons in these sub-disciplines makes it difficult and unfair to draw any definitive conclusions from this particular contrast. However, Social psychology had a similar number of comparisons (80) to Industrial Organisational Psych, yet produced a far lower replication rate (r = .53 vs. r = .89). Mitchell said further research is needed to find out why this might be.

There were also important differences in replication rates (from lab to field study) within different psychology sub-disciplines. For example, Industrial Organisational Psychology studies of performance evaluations translated less well from the lab compared with other topics of study in that discipline. Across subfields, lab studies of gender differences were particularly unlikely to translate to the real world. “We should recognise those domains of research that produce externally valid research,” Mitchell said, “and we should learn from those domains to improve the generalisability of laboratory research in other domains.”


Mitchell, G. (2012). Revisiting Truth or Triviality: The External Validity of Research in the Psychological Laboratory. Perspectives on Psychological Science, 7 (2), 109-117 DOI: 10.1177/1745691611432343

Further reading: Gregory Mitchell contributed to The Psychologist’s current opinion special on replication in psychology (free access).

Post written by Christian Jarrett for the BPS Research Digest.

Milgram’s obedience studies – not about obedience after all?

Stanley Milgram’s seminal experiments in the 1960s may not have been a demonstration of obedience to authority after all, a new study claims.

Milgram appalled the world when he showed the willingness of ordinary people to administer a lethal electric shock to an innocent person, simply because an experimenter ordered them to do so. Participants believed they were punishing an unsuccessful ‘learner’ in a learning task; the reality was the learner was a stooge. The conventional view is that the experiment demonstrated many people’s utter obedience to authority.

Attempts to explore the issue through replication have stalled in recent decades because of concerns the experiment could be distressing for participants. Jerry Burger at Santa Clara University found a partial solution to this problem in a 2009 study, after he realised that 79 per cent of Milgram’s participants who went beyond the 150-volt level (at which the ‘learner’ was first heard to call out in distress) subsequently went on to apply the maximum lethal shock level of 450 volts, almost as if the 150-volt level were a point of no return [further information]. Burger conducted a modern replication up to the 150-volt level and found that a similar proportion of people (70 per cent) were willing to go beyond this point as were willing to do so in the 1960s (82.5 per cent). Presumably, most of these participants would have gone all the way to 450 volts level had the experiment not been stopped short.

Now Burger and his colleagues have studied the utterances made by the modern-day participants during the 2009 partial-replication, and afterwards during de-briefing. They found that participants who expressed a sense that they were responsible for their actions were the ones least likely to go beyond the crucial 150-volt level. Relevant to this is that Milgram’s participants (and Burger’s) were told, if they asked, that responsibility for any harm caused to the learner rested with the experimenter.

In contrast to the key role played by participants’ sense of responsibility, utterances betraying concern about the learner’s wellbeing were not associated with whether they went beyond the 150-volt level. Yes, participants who voiced more concerns required more prompts from the experimenter to continue, but ultimately they were just as likely to apply the crucial 150-volt shock.

However, it’s the overall negligible effect of these experimenter prompts that’s led Burger and his team to question whether Milgram’s study is really about obedience at all. In their 2009 partial-replication, Burger’s lab copied the prompts used in the seminal research, word-for-word. The first time a participant exhibited reluctance to continue, the experimenter said, ‘Please continue’. With successive signs of resistance, the experimenter’s utterances became progressively more forceful: ‘The experiment requires that you continue’; ‘It is absolutely essential that you continue’; and finally ‘You have no other choice, you must go on.’

Burger’s revelation (based on their 2009 replication) is that as the experimenter utterances became more forceful – effectively more like a command, or an order – their effectiveness dwindled. In fact, of the participants who were told ‘you have no choice, you must continue’, all chose to disobey and none reached the 150-volt level. ‘The more the experimenter’s statement resembled an order,’ the researchers said, ‘the less likely participants did what the experimenter wished.’ It would be interesting to learn if the same pattern applied during Milgram’s original studies, but those results were not reported here, perhaps because the necessary data are not available.

Burger and his colleagues said their new observation has implications for how Milgram’s studies are portrayed to students and the wider public. Their feeling is that Milgram’s results say less about obedience and rather more about our general proclivity for acting out of character in certain circumstances. ‘The point is that these uncharacteristic behaviours may not be limited to circumstances in which an authority figure gives orders,’ Burger and his team said. ‘Few of us will ever find ourselves in a situation like My Lai or Abu Ghraib. But each of us may well encounter settings that lead us to act in surprising and perhaps disturbing ways.’

ResearchBlogging.orgBurger, J., Girgis, Z., and Manning, C. (2011). In Their Own Words: Explaining Obedience to Authority Through an Examination of Participants’ Comments. Social Psychological and Personality Science DOI: 10.1177/1948550610397632

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

More on Milgram:
Milgram’s personal archive reveals how he created the ‘strongest obedience situation’.
Classic 1960’s obediency experiment reproduced in virtual reality.