Category: Replications

Not so easy to spot: A failure to replicate the Macbeth Effect across three continents

“Out, damned spot!” cries a guilt-ridden Lady Macbeth as she desperately washes her hands in the vain pursuit of a clear conscience. Consistent with Shakespeare’s celebrated reputation as an astute observer of the human psyche, a wealth of contemporary research findings have demonstrated the reality of this close link between our sense of moral purity and physical cleanliness.

One manifestation of this was nicknamed the Macbeth Effect – first documented by Chen-Bo Zhong and Katie Liljenquist in an influential paper in the high-impact journal Science in 2006 – in which feelings of moral disgust were found to provoke a desire for physical cleansing. For instance, in their second study, Zhong and Liljenquist found that US participants who hand-copied a story about an unethical deed were subsequently more likely to rate cleansing products as highly desirable.

There have been many “conceptual replications” of the Macbeth Effect. A conceptual replication is when a different research methodology supports the proposed theoretical mechanism underlying the original effect. For example, last year, Mario Gollwitzer and André Melzer found that novice video gamers showed a strong preference for hygiene products after playing a violent game.

Given the strong theoretical foundations of the Macbeth Effect, combined with several conceptual replications, University of Oxford psychologist Brian Earp and his colleagues were surprised when a pilot study of theirs failed to replicate Zhong and Liljenquist’s second study. This pilot study had been intended as the start of a new project looking to further develop our understanding of the Macbeth Effect. Rather than filing away this negative result, Earp and his colleagues were inspired to examine the robustness of the Macbeth Effect with a series of direct replications. Unlike conceptual replications, direct replications seek to mimic the methods of an original study as closely as possible.

Following best practice guidelines, Earp’s team contacted Zhong and Liljenquist, who kindly shared their original materials. Another feature of a high-quality replication is to ensure you have enough statistical power to replicate the original effect. In psychology, this usually means recruiting an adequate number of participants. Accordingly, Earp’s team recruited 153 undergrad participants – more than five times as many as took part in Zhong and Liljenquist’s second study.

Exactly as in the original research, the British students hand-copied a story about an unethical deed (an office worker shreds a vital document needed by a colleague) or about an ethical deed (the office worker finds and saves the document for their colleague). They then rated the desirability and value of several consumer products. These were the exact same products used in the original study – including soap, toothpaste, batteries and fruit juice – except that a few brand names were changed to suit the UK as opposed to US context. Students who copied the unethical story rated the desirability and value of the various hygiene and other products just the same as the students who copied the ethical story. In other words, there was no Macbeth Effect.

It’s possible that the Macbeth Effect is a culturally specific phenomenon. Next, Earp and his team conducted a replication attempt with 156 US participants using Amazon’s Mechanical Turk survey website. The materials and methods were almost identical to the original except that participants were required to re-type and add punctuation to either the ethical or unethical version of the office worker story. Again, exposure to the unethical story made no difference to the participants’ ratings of the value or desirability of the consumer products – with just one anomaly. Participants in the unethical condition placed a higher value on toothpaste. In the context of their other findings, Earp’s team think this is likely a spurious result.

Finally, the exact same procedures were followed with an Indian sample – another culture, that like the US, places high value on moral purity. Nearly three hundred Indian participants were recruited via Amazon’s Mechanical Turk, but again no effect of exposure to an ethical or unethical story was found on ratings of hygiene or other products.

Earp and his colleagues want to be clear – they’re not saying that there is no link between physical and moral purity, nor are they dismissing the existence of a Macbeth Effect. But they do believe their three direct, cross-cultural replication failures call for a “careful reassessment of the evidence for a real-life ‘Macbeth Effect’ within the realm of moral psychology.”

This study, due for publication next year, comes at time when reformers in psychology are calling for more value to be placed on replication attempts and negative results. “By resisting the temptation … to bury our own non-significant findings with respect to the Macbeth Effect, we hope to have contributed a small part to the ongoing scientific process,” Earp and his colleagues concluded.


Brian D. Earp, Jim A. C. Everett, Elizabeth N. Madva, and J. Kiley Hamlin (2014). Out, damned spot: Can the “Macbeth Effect” be replicated? Basic and Applied Social Psychology, In Press.

— Further reading —
An unsuccessful conceptual replication of the Macbeth Effect was published in 2009 (pdf). Later, in 2011, another paper failed to replicate all four of Zhong and Liljenquist’s studies, although the replications may have been underpowered. 

From the Digest archive: Your conscience really can be wiped cleanFeeling clean makes us harsher moral judges.

See also: Psychologist magazine special issue on replications.

Christian Jarrett (@Psych_Writer) is Editor of BPS Research Digest

A recipe for (attempting to) replicate existing findings in psychology

Regular readers of this blog will know that social psychology has gone through a traumatic time of late. Some of its most high profile proponents have been found guilty of research fraud. And some of the field’s landmark findings have turned out to be less robust than hoped. This has led to soul searching and one proposal for strengthening the discipline is to encourage more replication attempts of existing research findings.

To this end, some journals have introduced dedicated replication article formats and pressure is building on others to follow suit. As the momentum for reform builds, an international team of psychologists has now published an open-access article outlining their “Replication Recipe” – key steps for conducting and publishing a convincing replication.

This is an important development because when high-profile findings have failed to replicate there’s been a tendency in recent times for ill-feeling and controversy to ensue. In particular, on more than one occasion the authors of the original findings have complained that the failed replication attempt was of poor quality or not suitably similar to the original methods. In fact it’s notable that one of the co-authors of this new paper is Ap Dijksterhuis who published his own tetchy response to a failed replication of his work earlier this year.

Dijksterhuis and the others, led by Mark Brandt at Tilburg University (coincidentally the institution of the disgraced psychologist Diederik Stapel), outline five main ingredients for a successful replication recipe:

1. Carefully defining the effects and methods that the researcher intends to replicate;
2. Following as exactly as possible the methods of the original study (including
participant recruitment, instructions, stimuli, measures, procedures, and analyses);
3. Having high statistical power [this usually means having a large enough sample];
4. Making complete details about the replication available, so that interested experts can
fully evaluate the replication attempt (or attempt another replication themselves);
5. Evaluating replication results, and comparing them critically to the results of the original study.

To help the would-be replicator, the authors have also compiled a checklist of 36 decisions that should be made, and items of information that should be collated, before the replication attempt begins. It appears as table 1 in their open-access article and they’ve made it freely available for completion and storage as a template on the Open Science Framework.

Here are some more highlights from their paper:

In deciding which past findings are worth attempting to replicate, Brandt and his team urge researchers to choose based on an effect’s theoretical importance, its value to society, and the existing confidence in the effect.

They remind readers that a perfect replication is of course impossible – replications inevitably will happen at a different time, probably in a different place, and almost certainly with different participants. While they recommend contact with the authors of the original study in order to replicate the original methodology as closely as possible, they also point out that a replication in a different time or place may actually require a change in methodology for the purpose of mimicking the context of the original. For instance, a study originally conducted in the US involving baseballs might do well to switch to cricket balls if replicated in the UK. Also bear in mind that something as seemingly innocuous as the brand of the computer monitor used to display stimuli could have a bearing on the results.

Another important point they make is that replicators should set out to measure any factors that they anticipate may cause the new findings to deviate from the original. Not only will this help achieve a successful replication but it furthers scientific understanding by establishing boundary conditions for the original effect.

On their point about having enough statistical power, Brandt and his colleagues urge replicators to err on the side of caution, towards having a larger sample size. Where calculating the necessary statistical power is tricky, they suggest a simple rule of thumb: aim for 2.5 times the sample size of the original study.

It’s not enough to say simply that a replication has failed or succeeded, Brandt and co also advise. Instead replicators should conduct two tests: first establish whether the effect of interest was statistically significant in the new study; and second, establish whether the findings from the attempted replication differ statistically from the findings of the original. A meta-analysis that combines results from the original study and the replication attempt is also recommended.

“By conducting high-powered replication studies of important findings we can build a cumulative science,” the authors conclude. “With our Replication Recipe, we hope to encourage more researchers to conduct convincing replications that contribute to theoretical development, confirmation and disconfirmation.”


Mark J. Brandt, Hans IJzerman, Ap Dijksterhuis, Frank J. Farach, Jason Gellerd, Roger Giner-Sorollae, James A. Grange, Marco Perugini, Jeffrey R. Spies, and Anna van ‘t Veer (2013). The Replication Recipe: What Makes for a Convincing Replication? Journal of Experimental Social Psychology DOI: 10.1016/j.jesp.2013.10.005

Further reading
Psychologist magazine special issue on replications.

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Most brain imaging papers fail to provide enough methodological detail to allow replication

Amidst recent fraud scandals in social psychology and other sciences, leading academics are calling for a greater emphasis to be placed on the replicability of research. “Replication is our best friend because it keeps us honest,” wrote the psychologists Chris Chambers and Petroc Sumner recently.

For replication to be possible, scientists need to provide sufficient methodological detail in their papers for other labs to copy their procedures. Focusing specifically on fMRI-based brain imaging research (a field that’s no stranger to controversy), University of Michigan psychology grad student Joshua Carp has reported a worrying observation – the vast majority of papers he sampled failed to provide enough methodological detail to allow other labs to replicate their work.

Carp searched the literature from 2007 to 2011 looking for open-access human studies that mentioned “fMRI” and “brain” in their abstracts. Of the 1392 papers he identified, Carp analysed a random sample of 241 brain imaging articles from 68 journals, including PLoS One, NeuroImage, PNAS, Cerebral Cortex and the Journal of Neuroscience. Where an article featured supplementary information published elsewhere, Carp considered this too.

There was huge variability in the methodological detail reported in different studies, and often the amount of detail was woeful, as Carp explains:

“Over one third of studies did not describe the number of trials, trial duration, and the range and distribution of inter-trial intervals. Fewer than half reported the number of subjects rejected from analysis; the reasons for rejection; how or whether subjects were compensated for participation; and the resolution, coverage, and slice order of functional brain images.”

Other crucial detail that was often omitted included information on correcting for slice acquisition timing, co-registering to high-resolution scans, and the modelling of temporal auto-correlations. In all, Carp looked at 179 methodological decisions. To non-specialists, some of these will sound like highly technical detail, but brain imagers know that varying these parameters can make a major difference to the results that are obtained.

One factor that non-specialists will appreciate relates to corrections made for problematic head-movements in the scanner. Only 21.6 per cent of analysed studies described the criteria for rejecting data based on head movements. Another factor that non-specialists can easily relate to is the need to correct for multiple comparisons. Of the 59 per cent of studies that reported using a formal correction technique, nearly one third failed to reveal what that technique was.

“The widespread omission of these parameters from research reports, documented here, poses a serious challenge to researchers who seek to replicate and build on published studies,” Carp said.

As well as looking at the amount of methodological detail shared by brain imagers, Carp was also interested in the variety of techniques used. This is important because the more analytical techniques and parameters available for tweaking, the more risk there is of researchers trying different approaches until they hit on a significant result.

Carp found 207 combinations of analytical techniques (including 16 unique data analysis software packages) – that’s nearly as many different methodological approaches as studies. Although there’s no evidence that brain imagers are indulging in selective reporting, the abundance of analytical techniques and parameters is worrying. “If some methods yield more favourable results than others,” Carp said, “investigators may choose to report only the pipelines that yield favourable results, a practice known as selective analysis reporting.”

The field of medical research has adopted standardised guidelines for reporting randomised clinical trials. Carp advocates the adoption of similar standardised reporting rules for fMRI-based brain imaging research. Relevant guidelines were proposed by Russell Poldrack and colleagues in 2008, although these may now need updating.

Carp said the reporting practices he uncovered were unlikely to reflect malice or dishonesty. He thinks researchers are merely following the norms in the field. “Unfortunately,” he said, “these norms do not encourage researchers to provide enough methodological detail for the independent replication of their findings.”


Carp J (2012). The secret lives of experiments: Methods reporting in the fMRI literature. NeuroImage, 63 (1), 289-300 PMID: 22796459

–Further reading– Psychologist magazine opinion special on replication.
An uncanny number of psychology findings manage to scrape into statistical significance.
Questionable research practices are rife in psychology, survey finds.

Post written by Christian Jarrett for the BPS Research Digest.

Do psychology findings replicate outside the lab?

Most psychology research takes place under laboratory conditions allowing tight control over the exact interventions and procedures participants are exposed to. That makes for neater science but leaves the discipline vulnerable to claims that the results aren’t relevant to real life where things are far messier. Now Gregory Mitchell at the University of Virginia has tested this very issue by poring over the literature looking for previously published meta-analyses that compared findings in the lab to the same issue addressed in a field experiment. His searches, which built on a similar 1999 study (pdf), led him to 82 meta-analyses from the last three decades, comprising 217 lab vs. field study comparisons.

Overall, Mitchell found that lab findings usually replicate in the real world (r = .71, where 1 would be a perfect match), but the devil is in the detail: some sub-disciplines in psychology fared much better than others; the size of the effects often differed greatly between lab and real world; and in a worrying number of cases, the real world results were actually in the opposite direction to the lab findings.

“Many small effects from the laboratory will turn out to be unreliable,” Mitchell concluded, “and a surprising number of laboratory findings may turn out to be affirmatively misleading about the nature of relations among variables outside the laboratory.”

Breaking the results down by sub-discipline, findings replicated from the lab most often in Industrial-Organisational Psychology (based on 72 comparisons) and least often in Developmental Psychology, where the three comparisons showed the average field result was actually in the opposite direction to the lab findings. The massive discrepancy in number of comparisons in these sub-disciplines makes it difficult and unfair to draw any definitive conclusions from this particular contrast. However, Social psychology had a similar number of comparisons (80) to Industrial Organisational Psych, yet produced a far lower replication rate (r = .53 vs. r = .89). Mitchell said further research is needed to find out why this might be.

There were also important differences in replication rates (from lab to field study) within different psychology sub-disciplines. For example, Industrial Organisational Psychology studies of performance evaluations translated less well from the lab compared with other topics of study in that discipline. Across subfields, lab studies of gender differences were particularly unlikely to translate to the real world. “We should recognise those domains of research that produce externally valid research,” Mitchell said, “and we should learn from those domains to improve the generalisability of laboratory research in other domains.”


Mitchell, G. (2012). Revisiting Truth or Triviality: The External Validity of Research in the Psychological Laboratory. Perspectives on Psychological Science, 7 (2), 109-117 DOI: 10.1177/1745691611432343

Further reading: Gregory Mitchell contributed to The Psychologist’s current opinion special on replication in psychology (free access).

Post written by Christian Jarrett for the BPS Research Digest.

Milgram’s obedience studies – not about obedience after all?

Stanley Milgram’s seminal experiments in the 1960s may not have been a demonstration of obedience to authority after all, a new study claims.

Milgram appalled the world when he showed the willingness of ordinary people to administer a lethal electric shock to an innocent person, simply because an experimenter ordered them to do so. Participants believed they were punishing an unsuccessful ‘learner’ in a learning task; the reality was the learner was a stooge. The conventional view is that the experiment demonstrated many people’s utter obedience to authority.

Attempts to explore the issue through replication have stalled in recent decades because of concerns the experiment could be distressing for participants. Jerry Burger at Santa Clara University found a partial solution to this problem in a 2009 study, after he realised that 79 per cent of Milgram’s participants who went beyond the 150-volt level (at which the ‘learner’ was first heard to call out in distress) subsequently went on to apply the maximum lethal shock level of 450 volts, almost as if the 150-volt level were a point of no return [further information]. Burger conducted a modern replication up to the 150-volt level and found that a similar proportion of people (70 per cent) were willing to go beyond this point as were willing to do so in the 1960s (82.5 per cent). Presumably, most of these participants would have gone all the way to 450 volts level had the experiment not been stopped short.

Now Burger and his colleagues have studied the utterances made by the modern-day participants during the 2009 partial-replication, and afterwards during de-briefing. They found that participants who expressed a sense that they were responsible for their actions were the ones least likely to go beyond the crucial 150-volt level. Relevant to this is that Milgram’s participants (and Burger’s) were told, if they asked, that responsibility for any harm caused to the learner rested with the experimenter.

In contrast to the key role played by participants’ sense of responsibility, utterances betraying concern about the learner’s wellbeing were not associated with whether they went beyond the 150-volt level. Yes, participants who voiced more concerns required more prompts from the experimenter to continue, but ultimately they were just as likely to apply the crucial 150-volt shock.

However, it’s the overall negligible effect of these experimenter prompts that’s led Burger and his team to question whether Milgram’s study is really about obedience at all. In their 2009 partial-replication, Burger’s lab copied the prompts used in the seminal research, word-for-word. The first time a participant exhibited reluctance to continue, the experimenter said, ‘Please continue’. With successive signs of resistance, the experimenter’s utterances became progressively more forceful: ‘The experiment requires that you continue’; ‘It is absolutely essential that you continue’; and finally ‘You have no other choice, you must go on.’

Burger’s revelation (based on their 2009 replication) is that as the experimenter utterances became more forceful – effectively more like a command, or an order – their effectiveness dwindled. In fact, of the participants who were told ‘you have no choice, you must continue’, all chose to disobey and none reached the 150-volt level. ‘The more the experimenter’s statement resembled an order,’ the researchers said, ‘the less likely participants did what the experimenter wished.’ It would be interesting to learn if the same pattern applied during Milgram’s original studies, but those results were not reported here, perhaps because the necessary data are not available.

Burger and his colleagues said their new observation has implications for how Milgram’s studies are portrayed to students and the wider public. Their feeling is that Milgram’s results say less about obedience and rather more about our general proclivity for acting out of character in certain circumstances. ‘The point is that these uncharacteristic behaviours may not be limited to circumstances in which an authority figure gives orders,’ Burger and his team said. ‘Few of us will ever find ourselves in a situation like My Lai or Abu Ghraib. But each of us may well encounter settings that lead us to act in surprising and perhaps disturbing ways.’

ResearchBlogging.orgBurger, J., Girgis, Z., and Manning, C. (2011). In Their Own Words: Explaining Obedience to Authority Through an Examination of Participants’ Comments. Social Psychological and Personality Science DOI: 10.1177/1948550610397632

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

More on Milgram:
Milgram’s personal archive reveals how he created the ‘strongest obedience situation’.
Classic 1960’s obediency experiment reproduced in virtual reality.