Category: Methodological

How burnt-out students could be skewing psychology research

It’s well known that psychology research relies too heavily on student volunteers. So many findings are assumed to apply to people in general, when they could be a quirk unique to undergrads. Now Michael Nicholls and his colleagues have drawn attention to another problem with relying on student participants – those who volunteer late in their university term or semester lack motivation and tend to perform worse than those who volunteer early.

A little background about student research participants. Psychology students often volunteer for numerous studies throughout a semester. Usually, they’re compelled to do this at least once in return for course credits that count towards their degree. Other times they receive cash or other forms of compensation. When in the semester they opt to volunteer for course credit is usually down to their discretion. To over-generalise, conscientious students tend to volunteer early in semester, whereas less disciplined students leave it until last minute, when time is short and deadlines are pressing.

Nicholls team first recruited 40 students participants (18 men) at Flinders University during the third week of a 14-week semester.  Half of them were first years who’d chosen to volunteer early in return for course credits. The other half of the participants, who hailed from various year groups, had chosen the option to receive $10 compensation. The challenge for both groups of students was the same – to perform 360 trials of a sustained attention task. Each trial they had to press a button as fast as possible if they saw any number between 1 and 9, except for the number 3, in which case they were to withhold responding.

At this early stage of the semester there was no difference in the performance (based on speed and accuracy) of the students who volunteered for course credit or for money. There was also no difference in their motivation levels, as revealed in a questionnaire.

Later in the semester, between weeks 9 to 12, the researchers repeated the exercise, with 20 more students who’d enrolled for course credit and 20 more who’d applied to participate in return for cash compensation. Now the researchers found a difference between the groups. Those participants receiving financial payment outperformed those who had volunteered in return for course credit. The latter group also showed more variability in their performance than their course-credit counterparts had done at the start of the semester, and they reported having lower motivation.

These results suggest that students who wait to volunteer for course credit until late in the semester lack motivation and their performance suffers as a result. Nicholls and his colleagues explained that their findings have serious implications for experimental design. “A lack of motivation and/or poorer performance may introduce noise into the data and obscure effects that may have been significant otherwise. Such effects become particularly problematic when experiments are conducted at different times of semester and the results are compared.”

One possible solution for researchers planning to compare findings across experiments conducted at different ends of a semester, is to ensure that they only test paid participants. Unlike participants who are volunteering for course credit, those who are paid seem to have consistent performance and motivation across the semester.


Nicholls, M., Loveless, K., Thomas, N., Loetscher, T., & Churches, O. (2014). Some participants may be better than others: Sustained attention and motivation are higher early in semester The Quarterly Journal of Experimental Psychology, 1-19 DOI: 10.1080/17470218.2014.925481

further reading
The use and abuse of student participants
Improving the student participant experience

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

A replication tour de force

In his famous 1974 lecture, Cargo Cult Science, Richard Feynman recalls his experience of suggesting to a psychology student that she should try to repeat a previous experiment before attempting a novel one:

“She was very delighted with this new idea, and went to her professor. And his reply was, no, you cannot do that, because the experiment has already been done and you would be wasting time. This was in about 1947 or so, and it seems to have been the general policy then to not try to repeat psychological experiments, but only to change the conditions and see what happened.”

Despite the popularity of the lecture, few took his comments about lack of replication in psychology seriously – and least of all psychologists. Another 40 years would pass before psychologists turned a critical eye on just how often they bother to replicate each other’s experiments. In 2012, US psychologist Matthew Makel and colleagues surveyed the top 100 psychology journals since 1900 and estimated that for every 1000 papers published, just two sought to closely replicate a previous study. Feynman’s instincts, it seems, were spot on.

Now, after decades of the status quo, psychology is finally coming to terms with the idea that replication is a vital ingredient in the recipe of discovery. The latest issue of the journal Social Psychology reports an impressive 15 papers that attempted to replicate influential findings related to personality and social cognition. Are men really more distressed by infidelity than women? Does pleasant music influence consumer choice? Is there an automatic link between cleanliness and moral judgements?

Many supposedly ‘classic’ effects could not be found

Several phenomena replicated successfully. An influential finding by Stanley Schacter from 1951 on ‘deviation rejection’ was successfully repeated by Eric Wesselman and colleagues. Schacter had originally found that individuals whose opinions persistently deviate from a group norm tend to be disempowered by the group and socially isolated. Wesselman replicated the result, though finding that it was smaller than originally supposed.

On the other hand, many supposedly ‘classic’ effects could not be found. For instance, there appears to be no evidence that making people feel physically warm promotes social warmth, that asking people to recall immoral behaviour makes the environment seem darker, or for the Romeo and Juliet effect.

The flagship of the special issue is the Many Labs project, a remarkable effort in which 50 psychologists located in 36 labs worldwide collaborated to replicate 13 key findings, across a sample of more than 6000 participants. Ten of the effects replicated successfully.

Adding further credibility to this enterprise, each of the studies reported in the special issue was pre-registered and peer reviewed before the authors collected data. Study pre-registration ensures that researchers adhere to the scientific method and is rapidly emerging as a vital tool for increasing the credibility and reliability of psychological science.

The entire issue is open access and well worth a read. I think Feynman would be glad to see psychology leaving the cargo cult behind and, for that, psychology can be proud too.

– Further reading: A special issue of The Psychologist on issues surrounding replication in psychology.


Klein, R., Ratliff, K., Vianello, M., Adams, Jr., R., Bahník, Bernstein, M., Bocian, K., Brandt, M., Brooks, B., Brumbaugh, C., Cemalcilar, Z., Chandler, J., Cheong, W., Davis, W., Devos, T., Eisner, M., Frankowska, N., Furrow, D., Galliani, E., Hasselman, F., Hicks, J., Hovermale, J., Hunt, S., Huntsinger, J., IJzerman, H., John, M., Joy-Gaba, J., Kappes, H., Krueger, L., Kurtz, J., Levitan, C., Mallett, R., Morris, W., Nelson, A., Nier, J., Packard, G., Pilati, R., Rutchick, A., Schmidt, K., Skorinko, J., Smith, R., Steiner, T., Storbeck, J., Van Swol, L., Thompson, D., van ’t Veer, A., Vaughn, L., Vranka, M., Wichman, A., Woodzicka, J., & Nosek, B. (2014). Data from Investigating Variation in Replicability: A “Many Labs” Replication Project Journal of Open Psychology Data, 2 (1) DOI: 10.5334/

Post written for the BPS Research Digest by guest host Chris Chambers, senior research fellow in cognitive neuroscience at the School of Psychology, Cardiff University, and contributor to the Guardian psychology blog, Headquarters.

Antidepressant brain stimulation: Promising signs or continuing doubts?

Depression is a growing public health concern, affecting 1 in 9 people at some point in their lives, and with a third of sufferers experiencing little or no benefit from medication. The World Health Organization predicts that by 2020 depression will become the second leading cause of disability worldwide. By 2026 it is expected to afflict nearly 1.5 million people in the UK, costing the economy more than £12bn every year.

Faced with this crisis, scientists have looked for alternative solutions to medication. Since the mid 1990s there has been a steady interest in developing brain stimulation methods as antidepressants, particularly for patients who are resistant to drug therapy. The general logic of this approach is that because depression is associated with abnormally low activity in the left prefrontal cortex, methods that increase prefrontal activity, such as transcranial magnetic stimulation (TMS), might help promote recovery.

A new Taiwanese study now reports that a particularly potent form of transcranial magnetic stimulation called theta burst stimulation could lead to benefits in treatment-resistant depression. Cheng-Ta Li and colleagues compared the efficacy of three different types of theta burst stimulation: a protocol believed to increase activity in the left prefrontal cortex, one that reduces activity in the right prefrontal cortex, and a combined protocol that seeks to achieve both in the same treatment session. Compared with sham (placebo) stimulation, the team found that two weeks of daily treatment using the combined protocol was most effective, reducing self-ratings of depression by about 35 per cent.

Self-ratings of depression were reduced by about 35 per cent

These results are promising but preliminary. The sample size was small, including just 15 patients per group, and the trial was not preregistered. Such limitations are common in a literature that is dominated by controversy and small exploratory reports. A major 2007 study, which concluded that TMS is clinically effective (and which led to the treatment becoming approved by the FDA) was later criticised for selectively reporting positive outcomes, deviating from its registered analysis protocol, and being contaminated by uncontrolled placebo effects. The most recent review of evidence to date concluded that the benefits of TMS, while measurable statistically, are so small as to be clinically insignificant. And as to how these benefits of TMS arise in the first place – well, the truth is we have almost no idea. Our best guess is ‘because dopamine’.

These uncertainties, in turn, raise concerns about ethics and regulation. With a growing number of companies offering TMS as a private healthcare intervention, and with standard treatments running into thousands of pounds, the fact that its efficacy remains unproven and unexplained is especially pertinent.

Notwithstanding these issues, this latest study by Li and colleagues is a helpful addition to the literature and suggests that more potent neurological interventions, such as theta burst stimulation, have potential. But realising that potential will require a commitment to rigorous and unbiased research practices. We need meticulously preregistered studies to prevent negative findings being censored and to ensure that authors don’t cherry pick analyses that ‘work’ or engage in other questionable research practices. We need studies with larger samples to determine which individual differences determine the efficacy of TMS and to generate reproducible effects. And we need a renewed focus on understanding the neurobiology underlying any benefits of TMS.

Once these challenges are met, brain stimulation may well provide a complementary treatment for depression. For now, though, the jury is out.


Li CT, Chen MH, Juan CH, Huang HH, Chen LF, Hsieh JC, Tu PC, Bai YM, Tsai SJ, Lee YC, & Su TP (2014). Efficacy of prefrontal theta-burst stimulation in refractory depression: a randomized sham-controlled study. Brain : a journal of neurology PMID: 24817188

Post written for the BPS Research Digest by guest host Chris Chambers, senior research fellow in cognitive neuroscience at the School of Psychology, Cardiff University, and contributor to the Guardian psychology blog, Headquarters.

‘I’ll have what she’s having’ – Developing the Faking Orgasm Scale for Women

Over 65 per cent of women are believed to have done it at least once in their lives. Magazines, TV shows and self-help books all talk about it. It features in one of the most memorable movie scenes ever.  What am I talking about? Faking orgasm, of course.

I’ve done it. I wonder if you have too?

Let’s distract ourselves from this potentially awkward moment with a study by Cooper and colleagues, who have created the Faking Orgasm Scale.

When I saw this paper I bristled at the thought of yet another tool to over-diagnose our sexual lives. Really, does it matter if we fake? Doesn’t this surveillance reduce trust and put more pressure on people?

But I liked their discussion of what orgasm might be, why pressure to ‘achieve’ ‘mind-blowing orgasms’ exists in Western culture, and who perpetuates it. (Clue: it’s not just the media, medicalisation of women’s sexual problems by the pharmaceutical industry also doesn’t help).

In a two-stage study respondents (all heterosexual women college students majoring in psychology) were asked when, why and how they faked orgasm. The researchers then narrowed this into four categories:

Altruistic Deceit (e.g. faked orgasm to make a partner happy or prevent them feeling guilty)
Fear and Insecurity (e.g. faked orgasm because they felt ashamed they couldn’t experience orgasm)
Elevated Arousal (e.g. faked an orgasm to get more turned on or increase the intensity of the experience)
Sexual Adjournment (e.g. to end a sexual encounter because of tiredness, a lack of enjoyment etc)

We tend to view faking orgasm as manipulative, whereas this research suggested that it could well play a positive role in increasing arousal. I could see an additional measure of distress being useful here to identify whether the faking was something done pleasurably to enhance sex, or an indication of other sexual or relationships problems where perhaps education or therapy might be of benefit.

A 2008 article in The Psychologist also considered orgasm

Wait! I’m sure you’ve already spotted these participants might be a bit WEIRD [Western, Educated, industrious, Rich, Democratic], so how useful is this study? The authors are up front about their research being limited by the use of a volunteer student sample, and because of this I think the Faking Orgasm Scale may be better described as a tool in development rather than an established measure.

For that to happen the scale would need further research using bi and lesbian women, Trans women, women in long term relationships, and those who are not US psychology majors. It could also broaden into sexual experiences that are not just penis in vagina intercourse or oral sex (the two activities respondents were required to have both tried and faked orgasm during).

The researchers note ‘faking orgasm… seems to have been overlooked almost entirely as a male sexual practice’ – something that future research could certainly benefit from, not least because existing qualitative work indicates faking orgasm is not unique to women and may be equally prevalent in men.

I can see therapists, researchers and healthcare providers welcoming a tool that might encourage us to open up about our sexual experiences. I could also see some practitioners taking issue with a quantified measure of complex behaviour and notions of authenticity and sexual behaviour. Me? I’d welcome anything that might allow us to talk more openly about orgasm so as to resist or reinvent the representations of perfectable sex we’re currently encouraged to aspire to.

– Further reading from The Psychologist – Orgasm.

Cooper EB, Fenigstein A, & Fauber RL (2014). The faking orgasm scale for women: psychometric properties. Archives of sexual behavior, 43 (3), 423-35 PMID: 24346866

Post written for the BPS Research Digest by guest host Petra Boynton, Senior Lecturer in International Primary Care Research, University College London and the Telegraph’s Agony Aunt.

Kidding ourselves on educational interventions?

Journals, especially high-impact journals, are notorious for not being interested in publishing replication studies, especially those that fail to obtain an interesting result. A recent paper by Lorna Halliday emphasises just how important replications are, especially when they involve interventions that promise to help children’s development.

Halliday focused on a commercially-available program called Phonomena, which was developed with the aim of improving children’s ability to distinguish between speech sounds – a skill which is thought to be important for learning to read, as well as for those who were learning English as a second language. An initial study reported by Moore et al in 2005 gave promising results. A group of 18 children who were trained for 6 hours using Phonomena showed improvements on tests of phonological awareness from pre- to post-training, whereas 12 untrained children did not.

In a subsequent study, however, Halliday and colleagues failed to replicate the positive results of Moore’s group using similar methods and stimuli. Although children showed some learning of the contrasts they were trained on, this did not generalise to tests of phonology or language administered before and after the training session. Rather than just leaving us with this disappointing result, however, Halliday decided to investigate possible reasons for the lack of replication, and her analysis should be required reading for anyone contemplating an intervention study, revealing, as it does, a number of apparently trivial factors that appear to play a role in determining results.

The different results could not be easily accounted for by differences in the samples of children, who were closely similar in terms of their pre-training scores. In terms of statistical power, the Halliday sample was larger, so should have had a better chance of detecting a true effect if it existed. There were some procedural differences in the training methods used in the two studies, but this led to better learning of the trained contrasts in the Halliday study, so we might have expected more transfer to the phonological awareness tests, when in fact the opposite was the case.

Halliday notes a number of factors that did differ between studies and which may have been key. First, the Halliday study used random assignment of children to training groups, whereas the Moore study gave training to one tutor group and used the other as a control group. This is potentially problematic because children will have been exposed to different teaching during the training interval. Second, both the experimenter and the children themselves were aware of which group was which in the Moore study. In the Halliday study, in contrast, two additional control groups were used who also underwent training. This avoids the problems of ‘placebo’ effects that can occur if children are motivated by the experience of training, or if they improve because of greater familiarity with the experimenter. Ideally, in a study like this, the experimenter should be blind to the child’s group status. This was not the case for either of the studies, leaving them open to possible experimenter bias, but Halliday noted that in her study the experimenter did not know the child’s pre-test score, whereas in the Moore study, the experimenter was aware of this information.

Drilling down to the raw data, Halliday noted an important contrast between the two studies. In the Moore study, the untreated control group showed little gain on the outcome measures of phonological processing, whereas in her study they showed significant improvements on two of the three measures. It’s feasible that this might have been because of the fact that the controls in the Halliday study were rewarded for participation, had regular contact with the experimenters throughout the study, and were tested at the end of the study by someone who was blind to their prior test score.

There has been much debate about the feasibility and appropriateness of using Randomised Controlled Trial methodology in educational settings. Such studies are hard to do, but their stringent methods have evolved for very good reasons: unless we carefully control all aspects of a study, it is easy to kid ourselves that an intervention has a beneficial effect, when in fact, a control group given similar treatment but without the key intervention component may do just as well.

Halliday LF (2014). A Tale of Two Studies on Auditory Training in Children: A Response to the Claim that ‘Discrimination Training of Phonemic Contrasts Enhances Phonological Processing in Mainstream School Children’ by Moore, Rosenberg and Coleman (2005). Dyslexia (Chichester, England) PMID: 24470350

Post written for the BPS Research Digest by guest host Dorothy Bishop, Professor of Developmental Neuropsychology and a Wellcome Principal Research Fellow at the Department of Experimental Psychology in Oxford, Adjunct Professor at The University of Western Australia, Perth, and a runner up in the 2012 UK Science Blogging Prize for BishopBlog.

There are 636,120 ways to have post traumatic stress disorder

The latest version of the American Psychiatric Association’s (APA) controversial diagnostic code – “the DSM-5” – continues the check-list approach used in previous editions. To receive a specific diagnosis, a patient must exhibit a minimum number of symptoms in different categories. One problem – this implies someone either has a mental illness or they don’t.

To avoid missing people who ought to be diagnosed, over time the criteria for many conditions have expanded, and nowhere is this more apparent than in the case of post traumatic stress disorder (PTSD). Indeed, in their new analysis of the latest expanded diagnostic criteria for PTSD, Isaac Galatzer-Levy and Richard Bryant calculate that there are now 636,120 ways to be diagnosed with PTSD based on all the possible combinations of symptoms that would fulfil a diagnosis for this condition.

First defined as a distinct disorder in 1980, for many years PTSD was diagnosed based on a patient exhibiting a sufficient number of various symptoms in three categories: reexperiencing symptoms (e.g. flashbacks); avoidance and numbing symptoms (e.g. diminished interest in activities); and arousal symptoms (e.g. insomnia). For the latest version of the DSM, a new symptom category was introduced: alterations in mood and cognition (e.g. increased shame). This means a diagnosis of PTSD is now met according to the patient having a minimum of 8 of 19 possible symptoms across four categories (or criteria), so long as these appear after they witnessed or experienced an event involving actual or threatened harm.

Putting these various diagnostic permutations into the statistical grinder, Galatzer-Levy and Bryant arrive at their figure of 636,120 ways to be diagnosed with PTSD. This compares to 79,794 ways based on DSM-IV – the previous version of the APA’s diagnostic code. The net has not widened in this fashion for all conditions – for example the criteria for panic disorder have tightened (there were 54,698 “ways” to be diagnosed with panic disorder in DSM-IV, compared with 23,442 ways in DSM-5).

Galatzer-Levy and Bryant believe the PTSD scenario exemplifies the problem with using a set of pre-defined criteria to identify whether a person has a mental health problem or not. In the pursuit of increasing diagnostic reliability, the code loses its meaning in a fog of heterogeneity. The authors fear that despite the increasing diagnostic complexity, people who need help are still missed, while others continue to be misdiagnosed. They believe this could be the reason why the research into risk factors for PTSD, and into the effectiveness of interventions for the condition, tends to produce such highly varied results.

The ideal situation, according to Galatzer-Levy and Bryant, is for our understanding and description of mental health problems to be based on empirical data – in this case about how people respond to stress and trauma. They say a useful approach is to use statistical techniques that reveal the varieties of ways that people are affected over time – a complexity that is missed by simple symptom check-lists. For instance, Galatzer-Levy and Bryant say there are at least three patterns in the way people respond to stressful events – some cope well and show only short-lived symptoms; others struggle at first but recover with time; while a third group continue struggling with chronic symptoms.

“Such an empirical approach for identifying behavioural patterns both in clinical and nonclinical contexts is nascent,” the authors conclude. “A great deal of work is necessary to identify and understand common outcomes of disparate, potentially traumatic, and common stressful life events.”


Isaac R. Galatzer-Levy and Richard A. Bryant (2013). 636,120 Ways to Have Posttraumatic Stress DisorderPerspectives on Psychological Science

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Not so easy to spot: A failure to replicate the Macbeth Effect across three continents

“Out, damned spot!” cries a guilt-ridden Lady Macbeth as she desperately washes her hands in the vain pursuit of a clear conscience. Consistent with Shakespeare’s celebrated reputation as an astute observer of the human psyche, a wealth of contemporary research findings have demonstrated the reality of this close link between our sense of moral purity and physical cleanliness.

One manifestation of this was nicknamed the Macbeth Effect – first documented by Chen-Bo Zhong and Katie Liljenquist in an influential paper in the high-impact journal Science in 2006 – in which feelings of moral disgust were found to provoke a desire for physical cleansing. For instance, in their second study, Zhong and Liljenquist found that US participants who hand-copied a story about an unethical deed were subsequently more likely to rate cleansing products as highly desirable.

There have been many “conceptual replications” of the Macbeth Effect. A conceptual replication is when a different research methodology supports the proposed theoretical mechanism underlying the original effect. For example, last year, Mario Gollwitzer and André Melzer found that novice video gamers showed a strong preference for hygiene products after playing a violent game.

Given the strong theoretical foundations of the Macbeth Effect, combined with several conceptual replications, University of Oxford psychologist Brian Earp and his colleagues were surprised when a pilot study of theirs failed to replicate Zhong and Liljenquist’s second study. This pilot study had been intended as the start of a new project looking to further develop our understanding of the Macbeth Effect. Rather than filing away this negative result, Earp and his colleagues were inspired to examine the robustness of the Macbeth Effect with a series of direct replications. Unlike conceptual replications, direct replications seek to mimic the methods of an original study as closely as possible.

Following best practice guidelines, Earp’s team contacted Zhong and Liljenquist, who kindly shared their original materials. Another feature of a high-quality replication is to ensure you have enough statistical power to replicate the original effect. In psychology, this usually means recruiting an adequate number of participants. Accordingly, Earp’s team recruited 153 undergrad participants – more than five times as many as took part in Zhong and Liljenquist’s second study.

Exactly as in the original research, the British students hand-copied a story about an unethical deed (an office worker shreds a vital document needed by a colleague) or about an ethical deed (the office worker finds and saves the document for their colleague). They then rated the desirability and value of several consumer products. These were the exact same products used in the original study – including soap, toothpaste, batteries and fruit juice – except that a few brand names were changed to suit the UK as opposed to US context. Students who copied the unethical story rated the desirability and value of the various hygiene and other products just the same as the students who copied the ethical story. In other words, there was no Macbeth Effect.

It’s possible that the Macbeth Effect is a culturally specific phenomenon. Next, Earp and his team conducted a replication attempt with 156 US participants using Amazon’s Mechanical Turk survey website. The materials and methods were almost identical to the original except that participants were required to re-type and add punctuation to either the ethical or unethical version of the office worker story. Again, exposure to the unethical story made no difference to the participants’ ratings of the value or desirability of the consumer products – with just one anomaly. Participants in the unethical condition placed a higher value on toothpaste. In the context of their other findings, Earp’s team think this is likely a spurious result.

Finally, the exact same procedures were followed with an Indian sample – another culture, that like the US, places high value on moral purity. Nearly three hundred Indian participants were recruited via Amazon’s Mechanical Turk, but again no effect of exposure to an ethical or unethical story was found on ratings of hygiene or other products.

Earp and his colleagues want to be clear – they’re not saying that there is no link between physical and moral purity, nor are they dismissing the existence of a Macbeth Effect. But they do believe their three direct, cross-cultural replication failures call for a “careful reassessment of the evidence for a real-life ‘Macbeth Effect’ within the realm of moral psychology.”

This study, due for publication next year, comes at time when reformers in psychology are calling for more value to be placed on replication attempts and negative results. “By resisting the temptation … to bury our own non-significant findings with respect to the Macbeth Effect, we hope to have contributed a small part to the ongoing scientific process,” Earp and his colleagues concluded.


Brian D. Earp, Jim A. C. Everett, Elizabeth N. Madva, and J. Kiley Hamlin (2014). Out, damned spot: Can the “Macbeth Effect” be replicated? Basic and Applied Social Psychology, In Press.

— Further reading —
An unsuccessful conceptual replication of the Macbeth Effect was published in 2009 (pdf). Later, in 2011, another paper failed to replicate all four of Zhong and Liljenquist’s studies, although the replications may have been underpowered. 

From the Digest archive: Your conscience really can be wiped cleanFeeling clean makes us harsher moral judges.

See also: Psychologist magazine special issue on replications.

Christian Jarrett (@Psych_Writer) is Editor of BPS Research Digest

Students assume psychology is less scientific/important than the natural sciences, says study with scientific limitations

Students see test tubes as more scientific than questionnaires

Despite over 130 years passing since the opening of its first laboratory, psychology still struggles to be taken seriously as a science. A new paper by psychologists in the USA suggests this is due in part to superficial assumptions made about the subject matter and methods of behavioural science.

Douglas Krull and David Silvera asked 73 college students (49 women) to rate various topics and pieces of equipment on a 9-point scale in terms of how scientific they thought they were. On average, the students consistently rated topics from the natural sciences (e.g. brain, solar flares), and natural science equipment (e.g. microscope, magnetic resonance imaging) as more scientific than behavioural science topics and equipment (e.g. attitudes and questionnaires) – the average ratings were 7.86, 5.06, 7 and 4.34, respectively.

A follow-up study involving 71 more college students was similar but this time students rated the scientific status of 20 brief scenarios. These varied according to whether the topic was natural or behavioural science and whether the equipment used was natural or behavioural (e.g. “Dr Thompson studies cancer. To do this research, Dr Thompson uses interviews” is an example of a natural science topic using behavioural science methods.) Natural science topics and equipment were again rated as more scientific than their behavioural science counterparts. And this was additive, so that natural science topics studied with natural science methods were assumed to be the most scientific of all.

A third and final study was almost identical but this time the 94 college students revealed their belief that the natural sciences are more important than the behavioural sciences. “Even though the scientific enterprise is defined by its method, people seem to be influenced by the content of the research,” Krull and Silvera concluded. They added that this could have serious adverse consequences including students interested in science not going into psychology; psychology findings not being taken seriously; and funding being diverted from psychology to other sciences. “Misperceptions of science have the potential to hinder research and applications of research that could otherwise produce positive changes in society,” they said.

Unfortunately for a paper on the reputation of psychological science, the paper contains a series of serious scientific limitations. For instance, not only are all three samples restricted to college students, we’re also told nothing about the background of these students; not even whether they were humanities or science students. There is also no detail on how the students construed the meaning of “scientific”. If students assume the meaning of scientific has more to do with subject matter than with method then the findings from the first two studies are simply tautological.

Apart from a couple of exceptions, we are also given no information on how the researchers categorised their list of topics and equipment as belonging either to natural or behavioural science. Sometimes it’s obvious, but not always. For instance, how was “computer programmes” categorised? Where the categorisation is revealed it doesn’t always seem justified. Is “the brain” exclusively a natural science topic and not a behavioural science topic? In truth psychologists often make inferences about the brain based on behavioural data. Obviously carving up scientific disciplines is a tricky business, but the issue is not really addressed by Krull and Silvera. In terms of terminology, their paper starts off distinguishing between natural and behavioural science, with psychology given as an example of a behavioural science. Their discussion then focuses largely on psychology.

Lastly, it’s unfortunate that Krull and Silvera more than once refer to the seductive allure of brain scans as an example of the way that people are swayed by the superficial merit of natural science. Presumably they wrote their paper before the seductive allure of brain scans was thoroughly debunked earlier this year. They can’t be blamed for not seeing into the future, but it was perhaps scientifically naive to place so much faith in a single study.


Douglas S. Krull and David H. Silvera (2013). The stereotyping of science: superficial details influence perceptions of what is scientific. Journal of Applied Social Psychology DOI: 10.1111/jasp.12118

–Further reading–
Child’s play! The developmental roots of the misconception that psychology is easy
From The Psychologist magazine news archive: A US psychologist has urged the psychological community to do more to challenge the public’s scepticism of our science.

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Scanning a brain that believes it is dead

What is going on in the brain of someone who has the deluded belief that they are brain dead? A team of researchers led by neuropsychologist Vanessa Charland-Varville at CHU Sart-Tilman Hospital and the University of Liege has attempted to find out by scanning the brain of a depressed patient who held this very belief.

The researchers used a Positron Emission Tomography (PET) scanner, which is the first time this scanning technology has been used on a patient with this kind of delusion – known as Cotard’s syndrome after the French neurologist Jules Cotard. The 48-year-old patient had developed Cotard’s after attempting to take his own life by electrocution. Eight months later he arrived at his general practitioner complaining that his brain was dead, and that he therefore no longer needed to eat or sleep. He acknowledged that he still had a mind, but (in the words of the researchers) he said he was “condemned to a kind of half-life, with a dead brain in a living body.”

The researchers used the PET scanner to monitor levels of metabolic activity across the patient’s brain as he rested. Compared with 39 healthy, age-matched controls, he showed substantially reduced activity across a swathe of frontal and temporal brain regions incorporating many key parts of what’s known as the “default mode network“. This is a hub of brain regions that shows increased activity when people’s brains are at rest, disengaged from the outside world. It’s been proposed that activity in this network is crucial for our sense of self.

“Our data suggest that the profound disturbance of thought and experience, revealed by Cotard’s delusion, reflects a profound disturbance in the brain regions responsible for ‘core consciousness’ and our abiding sense of self,” the researchers concluded.

Unfortunately the study has a number of serious limitations beyond the fact that it is of course a single case study. As well as having a diagnosis of Cotard’s Delusion, the patient was also depressed and on an intense drug regimen, including sedative, antidepressant and antipsychotic medication. It’s unclear therefore whether his distinctive brain activity was due to Cotard’s, depression or his drugs, although the researchers counter that such an extreme reduction in brain metabolism is not normally seen in patients with depression or on those drugs.

Another issue is with the lack of detail on the scanning procedure. Perhaps this is due to the short article format (a “Letter to the Editor”), but it’s not clear for how long the patient and controls were scanned, nor what they were instructed to do in the scanner. For example, did they have their eyes open or closed? What did they think about?

But perhaps most problematic is the issue of how to interpret the findings. Does the patient have Cotard’s Delusion because of his abnormal brain activity, or does he have that unusual pattern of brain activity because of his deluded beliefs? Relevant here, but not mentioned by the researchers, are studies showing that trained meditators also show reduced activity in the default mode network. This provides a graphic illustration of the limits to a purely biological approach to mental disorder. It seems diminished activity in the default mode network can be associated both with feelings of being brain dead or feelings of tranquil oneness with the world, it depends on who is doing the feeling. Understanding how this can be will likely require researchers to think outside of the brain.


Charland-Verville, V., Bruno, M., Bahri, M., Demertzi, A., Desseilles, M., Chatelle, C., Vanhaudenhuyse, A., Hustinx, R., Bernard, C., Tshibanda, L., Laureys, S., and Zeman, A. (2013). Brain dead yet mind alive: A positron emission tomography case study of brain metabolism in Cotard’s syndrome. Cortex DOI: 10.1016/j.cortex.2013.03.003

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Serious power failure threatens the entire field of neuroscience

Psychology has had a torrid time of late, with fraud scandals and question marks about the replicability of many of the discipline’s key findings. Today it is joined in the dock by its more biologically oriented sibling: Neuroscience. A team led by Katherine Button at the School of Experimental Psychology in Bristol, and including psychologist Brian Nosek, founder of the new Center for Open Science, make the case in a new paper that the majority of neuroscience studies involve woefully small sample sizes, rendering their results highly unreliable. “Low statistical power is an endemic problem in neuroscience,” they write.

At the heart of their case is a comprehensive analysis of 49 neuroscience meta-analyses published in 2011 (that’s all the meta-analyses published that year that contained the information required for their purposes). This took in 730 individual papers, including genetic studies, drug research and papers on brain abnormalities.

Meta-analyses collate all the findings in a given field as a way to provide the most accurate estimate possible about the size of any relevant effects. Button’s team compared these effect size estimates for neuroscience’s subfields against the average sample sizes used in those same areas of research. If the meta-analyses for a particular subfield suggested an effect – such as a brain abnormality associated with a mental illness – is real, but subtle, then this would indicate that suitable investigations in that field ought to involve large samples in order to be adequately powered. A larger effect size would require more modest samples.

Based on this, the researchers’ estimate is that the median statistical power of a neuroscience study is 21 per cent. This means that the vast majority (around 79 per cent) of real effects in brain science are likely being missed. More worrying still, when underpowered studies do uncover a significant result, the lack of power means the chances are increased that the finding is spurious. Thirdly, significant effect sizes uncovered by underpowered studies tend to be overestimates of the true effect size, even when the reported effect is in fact real. This is because, by their very nature, underpowered studies are only likely to turn up significant results in data where the effect size happens to be large.

It gets more worrying. The aforementioned issues are what you get when all else in the methodology is sound, bar the inadequate sample size. Trouble is, Button and her colleagues say underpowered studies often have other problems too. For instance, small studies are more vulnerable to the “file-drawer effect”, in which negative results tend to get swept under the carpet (simply because it’s easier to ignore a quick and easy study than a massive, expensive one). Underpowered studies are also more vulnerable to an issue known as “vibration of effects” whereby the results vary considerably with the particular choice of analysis. And yes, there is often a huge choice of analysis methods in neuroscience. A recent paper documented how 241 fMRI studies involved 223 unique analysis strategies.

Because of the relative paucity of brain imaging papers in their main analysis, Button’s team also turned their attention specifically to the brain imaging field. Based on findings from 461 studies published between 2006 and 2009, they estimate that the median statistical power in the sub-discipline of brain volume abnormality research is just 8 per cent.

Switching targets to the field of animal research (focusing on studies involving rats and mazes), they estimate most studies had a “severely” inadequate statistical power in the range of 18 to 31 per cent. This raises important ethical issues, Button’s team said, because it makes it highly likely that animals are being sacrificed with minimal chance of discovering true effects. It’s clearly a sensitive area, but one logical implication is that it would be more justifiable to conduct studies with larger samples of animals, because at least then there would be a more realistic chance of discovering the effects under investigation (a similar logic can also be applied to human studies).

The prevalence of inadequately powered studies in neuroscience is all the more disconcerting, Button and her colleagues conclude, because most of the low-lying fruit in brain science has already been picked. Today, the discipline is largely on the search for more subtle effects, and for this mission, suitable studies need to be as highly powered as possible. Yet sample sizes have stood still, while at the same time it has become easier than ever to run repeated, varied analyses on the same data, until a seemingly positive result crops up. This leads to a “disquieting conclusion”, the researchers said – “a dramatic increase in the likelihood that statistically significant findings are spurious.” They end their paper with a number of suggestions for how to rehabilitate the field, including performing routine power calculations prior to conducting studies (to ensure they are suitably powered), disclosing methods and findings transparently, and working collaboratively to increase study power.

KS Button, JPA Ioannidis, C Mokrysz, BA Nosek, J Flint, ESJ Robinson, and; MR Munafo (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience : 10.1038/nrn3475

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.