Category: Methodological

‘I’ll have what she’s having’ – Developing the Faking Orgasm Scale for Women

Over 65 per cent of women are believed to have done it at least once in their lives. Magazines, TV shows and self-help books all talk about it. It features in one of the most memorable movie scenes ever.  What am I talking about? Faking orgasm, of course.

I’ve done it. I wonder if you have too?

Let’s distract ourselves from this potentially awkward moment with a study by Cooper and colleagues, who have created the Faking Orgasm Scale.

When I saw this paper I bristled at the thought of yet another tool to over-diagnose our sexual lives. Really, does it matter if we fake? Doesn’t this surveillance reduce trust and put more pressure on people?

But I liked their discussion of what orgasm might be, why pressure to ‘achieve’ ‘mind-blowing orgasms’ exists in Western culture, and who perpetuates it. (Clue: it’s not just the media, medicalisation of women’s sexual problems by the pharmaceutical industry also doesn’t help).

In a two-stage study respondents (all heterosexual women college students majoring in psychology) were asked when, why and how they faked orgasm. The researchers then narrowed this into four categories:

Altruistic Deceit (e.g. faked orgasm to make a partner happy or prevent them feeling guilty)
Fear and Insecurity (e.g. faked orgasm because they felt ashamed they couldn’t experience orgasm)
Elevated Arousal (e.g. faked an orgasm to get more turned on or increase the intensity of the experience)
Sexual Adjournment (e.g. to end a sexual encounter because of tiredness, a lack of enjoyment etc)

We tend to view faking orgasm as manipulative, whereas this research suggested that it could well play a positive role in increasing arousal. I could see an additional measure of distress being useful here to identify whether the faking was something done pleasurably to enhance sex, or an indication of other sexual or relationships problems where perhaps education or therapy might be of benefit.

A 2008 article in The Psychologist also considered orgasm

Wait! I’m sure you’ve already spotted these participants might be a bit WEIRD [Western, Educated, industrious, Rich, Democratic], so how useful is this study? The authors are up front about their research being limited by the use of a volunteer student sample, and because of this I think the Faking Orgasm Scale may be better described as a tool in development rather than an established measure.

For that to happen the scale would need further research using bi and lesbian women, Trans women, women in long term relationships, and those who are not US psychology majors. It could also broaden into sexual experiences that are not just penis in vagina intercourse or oral sex (the two activities respondents were required to have both tried and faked orgasm during).

The researchers note ‘faking orgasm… seems to have been overlooked almost entirely as a male sexual practice’ – something that future research could certainly benefit from, not least because existing qualitative work indicates faking orgasm is not unique to women and may be equally prevalent in men.

I can see therapists, researchers and healthcare providers welcoming a tool that might encourage us to open up about our sexual experiences. I could also see some practitioners taking issue with a quantified measure of complex behaviour and notions of authenticity and sexual behaviour. Me? I’d welcome anything that might allow us to talk more openly about orgasm so as to resist or reinvent the representations of perfectable sex we’re currently encouraged to aspire to.

– Further reading from The Psychologist – Orgasm.

Cooper EB, Fenigstein A, & Fauber RL (2014). The faking orgasm scale for women: psychometric properties. Archives of sexual behavior, 43 (3), 423-35 PMID: 24346866

Post written for the BPS Research Digest by guest host Petra Boynton, Senior Lecturer in International Primary Care Research, University College London and the Telegraph’s Agony Aunt.

Kidding ourselves on educational interventions?

Journals, especially high-impact journals, are notorious for not being interested in publishing replication studies, especially those that fail to obtain an interesting result. A recent paper by Lorna Halliday emphasises just how important replications are, especially when they involve interventions that promise to help children’s development.

Halliday focused on a commercially-available program called Phonomena, which was developed with the aim of improving children’s ability to distinguish between speech sounds – a skill which is thought to be important for learning to read, as well as for those who were learning English as a second language. An initial study reported by Moore et al in 2005 gave promising results. A group of 18 children who were trained for 6 hours using Phonomena showed improvements on tests of phonological awareness from pre- to post-training, whereas 12 untrained children did not.

In a subsequent study, however, Halliday and colleagues failed to replicate the positive results of Moore’s group using similar methods and stimuli. Although children showed some learning of the contrasts they were trained on, this did not generalise to tests of phonology or language administered before and after the training session. Rather than just leaving us with this disappointing result, however, Halliday decided to investigate possible reasons for the lack of replication, and her analysis should be required reading for anyone contemplating an intervention study, revealing, as it does, a number of apparently trivial factors that appear to play a role in determining results.

The different results could not be easily accounted for by differences in the samples of children, who were closely similar in terms of their pre-training scores. In terms of statistical power, the Halliday sample was larger, so should have had a better chance of detecting a true effect if it existed. There were some procedural differences in the training methods used in the two studies, but this led to better learning of the trained contrasts in the Halliday study, so we might have expected more transfer to the phonological awareness tests, when in fact the opposite was the case.

Halliday notes a number of factors that did differ between studies and which may have been key. First, the Halliday study used random assignment of children to training groups, whereas the Moore study gave training to one tutor group and used the other as a control group. This is potentially problematic because children will have been exposed to different teaching during the training interval. Second, both the experimenter and the children themselves were aware of which group was which in the Moore study. In the Halliday study, in contrast, two additional control groups were used who also underwent training. This avoids the problems of ‘placebo’ effects that can occur if children are motivated by the experience of training, or if they improve because of greater familiarity with the experimenter. Ideally, in a study like this, the experimenter should be blind to the child’s group status. This was not the case for either of the studies, leaving them open to possible experimenter bias, but Halliday noted that in her study the experimenter did not know the child’s pre-test score, whereas in the Moore study, the experimenter was aware of this information.

Drilling down to the raw data, Halliday noted an important contrast between the two studies. In the Moore study, the untreated control group showed little gain on the outcome measures of phonological processing, whereas in her study they showed significant improvements on two of the three measures. It’s feasible that this might have been because of the fact that the controls in the Halliday study were rewarded for participation, had regular contact with the experimenters throughout the study, and were tested at the end of the study by someone who was blind to their prior test score.

There has been much debate about the feasibility and appropriateness of using Randomised Controlled Trial methodology in educational settings. Such studies are hard to do, but their stringent methods have evolved for very good reasons: unless we carefully control all aspects of a study, it is easy to kid ourselves that an intervention has a beneficial effect, when in fact, a control group given similar treatment but without the key intervention component may do just as well.

Halliday LF (2014). A Tale of Two Studies on Auditory Training in Children: A Response to the Claim that ‘Discrimination Training of Phonemic Contrasts Enhances Phonological Processing in Mainstream School Children’ by Moore, Rosenberg and Coleman (2005). Dyslexia (Chichester, England) PMID: 24470350

Post written for the BPS Research Digest by guest host Dorothy Bishop, Professor of Developmental Neuropsychology and a Wellcome Principal Research Fellow at the Department of Experimental Psychology in Oxford, Adjunct Professor at The University of Western Australia, Perth, and a runner up in the 2012 UK Science Blogging Prize for BishopBlog.

There are 636,120 ways to have post traumatic stress disorder

The latest version of the American Psychiatric Association’s (APA) controversial diagnostic code – “the DSM-5” – continues the check-list approach used in previous editions. To receive a specific diagnosis, a patient must exhibit a minimum number of symptoms in different categories. One problem – this implies someone either has a mental illness or they don’t.

To avoid missing people who ought to be diagnosed, over time the criteria for many conditions have expanded, and nowhere is this more apparent than in the case of post traumatic stress disorder (PTSD). Indeed, in their new analysis of the latest expanded diagnostic criteria for PTSD, Isaac Galatzer-Levy and Richard Bryant calculate that there are now 636,120 ways to be diagnosed with PTSD based on all the possible combinations of symptoms that would fulfil a diagnosis for this condition.

First defined as a distinct disorder in 1980, for many years PTSD was diagnosed based on a patient exhibiting a sufficient number of various symptoms in three categories: reexperiencing symptoms (e.g. flashbacks); avoidance and numbing symptoms (e.g. diminished interest in activities); and arousal symptoms (e.g. insomnia). For the latest version of the DSM, a new symptom category was introduced: alterations in mood and cognition (e.g. increased shame). This means a diagnosis of PTSD is now met according to the patient having a minimum of 8 of 19 possible symptoms across four categories (or criteria), so long as these appear after they witnessed or experienced an event involving actual or threatened harm.

Putting these various diagnostic permutations into the statistical grinder, Galatzer-Levy and Bryant arrive at their figure of 636,120 ways to be diagnosed with PTSD. This compares to 79,794 ways based on DSM-IV – the previous version of the APA’s diagnostic code. The net has not widened in this fashion for all conditions – for example the criteria for panic disorder have tightened (there were 54,698 “ways” to be diagnosed with panic disorder in DSM-IV, compared with 23,442 ways in DSM-5).

Galatzer-Levy and Bryant believe the PTSD scenario exemplifies the problem with using a set of pre-defined criteria to identify whether a person has a mental health problem or not. In the pursuit of increasing diagnostic reliability, the code loses its meaning in a fog of heterogeneity. The authors fear that despite the increasing diagnostic complexity, people who need help are still missed, while others continue to be misdiagnosed. They believe this could be the reason why the research into risk factors for PTSD, and into the effectiveness of interventions for the condition, tends to produce such highly varied results.

The ideal situation, according to Galatzer-Levy and Bryant, is for our understanding and description of mental health problems to be based on empirical data – in this case about how people respond to stress and trauma. They say a useful approach is to use statistical techniques that reveal the varieties of ways that people are affected over time – a complexity that is missed by simple symptom check-lists. For instance, Galatzer-Levy and Bryant say there are at least three patterns in the way people respond to stressful events – some cope well and show only short-lived symptoms; others struggle at first but recover with time; while a third group continue struggling with chronic symptoms.

“Such an empirical approach for identifying behavioural patterns both in clinical and nonclinical contexts is nascent,” the authors conclude. “A great deal of work is necessary to identify and understand common outcomes of disparate, potentially traumatic, and common stressful life events.”


Isaac R. Galatzer-Levy and Richard A. Bryant (2013). 636,120 Ways to Have Posttraumatic Stress DisorderPerspectives on Psychological Science

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Not so easy to spot: A failure to replicate the Macbeth Effect across three continents

“Out, damned spot!” cries a guilt-ridden Lady Macbeth as she desperately washes her hands in the vain pursuit of a clear conscience. Consistent with Shakespeare’s celebrated reputation as an astute observer of the human psyche, a wealth of contemporary research findings have demonstrated the reality of this close link between our sense of moral purity and physical cleanliness.

One manifestation of this was nicknamed the Macbeth Effect – first documented by Chen-Bo Zhong and Katie Liljenquist in an influential paper in the high-impact journal Science in 2006 – in which feelings of moral disgust were found to provoke a desire for physical cleansing. For instance, in their second study, Zhong and Liljenquist found that US participants who hand-copied a story about an unethical deed were subsequently more likely to rate cleansing products as highly desirable.

There have been many “conceptual replications” of the Macbeth Effect. A conceptual replication is when a different research methodology supports the proposed theoretical mechanism underlying the original effect. For example, last year, Mario Gollwitzer and André Melzer found that novice video gamers showed a strong preference for hygiene products after playing a violent game.

Given the strong theoretical foundations of the Macbeth Effect, combined with several conceptual replications, University of Oxford psychologist Brian Earp and his colleagues were surprised when a pilot study of theirs failed to replicate Zhong and Liljenquist’s second study. This pilot study had been intended as the start of a new project looking to further develop our understanding of the Macbeth Effect. Rather than filing away this negative result, Earp and his colleagues were inspired to examine the robustness of the Macbeth Effect with a series of direct replications. Unlike conceptual replications, direct replications seek to mimic the methods of an original study as closely as possible.

Following best practice guidelines, Earp’s team contacted Zhong and Liljenquist, who kindly shared their original materials. Another feature of a high-quality replication is to ensure you have enough statistical power to replicate the original effect. In psychology, this usually means recruiting an adequate number of participants. Accordingly, Earp’s team recruited 153 undergrad participants – more than five times as many as took part in Zhong and Liljenquist’s second study.

Exactly as in the original research, the British students hand-copied a story about an unethical deed (an office worker shreds a vital document needed by a colleague) or about an ethical deed (the office worker finds and saves the document for their colleague). They then rated the desirability and value of several consumer products. These were the exact same products used in the original study – including soap, toothpaste, batteries and fruit juice – except that a few brand names were changed to suit the UK as opposed to US context. Students who copied the unethical story rated the desirability and value of the various hygiene and other products just the same as the students who copied the ethical story. In other words, there was no Macbeth Effect.

It’s possible that the Macbeth Effect is a culturally specific phenomenon. Next, Earp and his team conducted a replication attempt with 156 US participants using Amazon’s Mechanical Turk survey website. The materials and methods were almost identical to the original except that participants were required to re-type and add punctuation to either the ethical or unethical version of the office worker story. Again, exposure to the unethical story made no difference to the participants’ ratings of the value or desirability of the consumer products – with just one anomaly. Participants in the unethical condition placed a higher value on toothpaste. In the context of their other findings, Earp’s team think this is likely a spurious result.

Finally, the exact same procedures were followed with an Indian sample – another culture, that like the US, places high value on moral purity. Nearly three hundred Indian participants were recruited via Amazon’s Mechanical Turk, but again no effect of exposure to an ethical or unethical story was found on ratings of hygiene or other products.

Earp and his colleagues want to be clear – they’re not saying that there is no link between physical and moral purity, nor are they dismissing the existence of a Macbeth Effect. But they do believe their three direct, cross-cultural replication failures call for a “careful reassessment of the evidence for a real-life ‘Macbeth Effect’ within the realm of moral psychology.”

This study, due for publication next year, comes at time when reformers in psychology are calling for more value to be placed on replication attempts and negative results. “By resisting the temptation … to bury our own non-significant findings with respect to the Macbeth Effect, we hope to have contributed a small part to the ongoing scientific process,” Earp and his colleagues concluded.


Brian D. Earp, Jim A. C. Everett, Elizabeth N. Madva, and J. Kiley Hamlin (2014). Out, damned spot: Can the “Macbeth Effect” be replicated? Basic and Applied Social Psychology, In Press.

— Further reading —
An unsuccessful conceptual replication of the Macbeth Effect was published in 2009 (pdf). Later, in 2011, another paper failed to replicate all four of Zhong and Liljenquist’s studies, although the replications may have been underpowered. 

From the Digest archive: Your conscience really can be wiped cleanFeeling clean makes us harsher moral judges.

See also: Psychologist magazine special issue on replications.

Christian Jarrett (@Psych_Writer) is Editor of BPS Research Digest

Students assume psychology is less scientific/important than the natural sciences, says study with scientific limitations

Students see test tubes as more scientific than questionnaires

Despite over 130 years passing since the opening of its first laboratory, psychology still struggles to be taken seriously as a science. A new paper by psychologists in the USA suggests this is due in part to superficial assumptions made about the subject matter and methods of behavioural science.

Douglas Krull and David Silvera asked 73 college students (49 women) to rate various topics and pieces of equipment on a 9-point scale in terms of how scientific they thought they were. On average, the students consistently rated topics from the natural sciences (e.g. brain, solar flares), and natural science equipment (e.g. microscope, magnetic resonance imaging) as more scientific than behavioural science topics and equipment (e.g. attitudes and questionnaires) – the average ratings were 7.86, 5.06, 7 and 4.34, respectively.

A follow-up study involving 71 more college students was similar but this time students rated the scientific status of 20 brief scenarios. These varied according to whether the topic was natural or behavioural science and whether the equipment used was natural or behavioural (e.g. “Dr Thompson studies cancer. To do this research, Dr Thompson uses interviews” is an example of a natural science topic using behavioural science methods.) Natural science topics and equipment were again rated as more scientific than their behavioural science counterparts. And this was additive, so that natural science topics studied with natural science methods were assumed to be the most scientific of all.

A third and final study was almost identical but this time the 94 college students revealed their belief that the natural sciences are more important than the behavioural sciences. “Even though the scientific enterprise is defined by its method, people seem to be influenced by the content of the research,” Krull and Silvera concluded. They added that this could have serious adverse consequences including students interested in science not going into psychology; psychology findings not being taken seriously; and funding being diverted from psychology to other sciences. “Misperceptions of science have the potential to hinder research and applications of research that could otherwise produce positive changes in society,” they said.

Unfortunately for a paper on the reputation of psychological science, the paper contains a series of serious scientific limitations. For instance, not only are all three samples restricted to college students, we’re also told nothing about the background of these students; not even whether they were humanities or science students. There is also no detail on how the students construed the meaning of “scientific”. If students assume the meaning of scientific has more to do with subject matter than with method then the findings from the first two studies are simply tautological.

Apart from a couple of exceptions, we are also given no information on how the researchers categorised their list of topics and equipment as belonging either to natural or behavioural science. Sometimes it’s obvious, but not always. For instance, how was “computer programmes” categorised? Where the categorisation is revealed it doesn’t always seem justified. Is “the brain” exclusively a natural science topic and not a behavioural science topic? In truth psychologists often make inferences about the brain based on behavioural data. Obviously carving up scientific disciplines is a tricky business, but the issue is not really addressed by Krull and Silvera. In terms of terminology, their paper starts off distinguishing between natural and behavioural science, with psychology given as an example of a behavioural science. Their discussion then focuses largely on psychology.

Lastly, it’s unfortunate that Krull and Silvera more than once refer to the seductive allure of brain scans as an example of the way that people are swayed by the superficial merit of natural science. Presumably they wrote their paper before the seductive allure of brain scans was thoroughly debunked earlier this year. They can’t be blamed for not seeing into the future, but it was perhaps scientifically naive to place so much faith in a single study.


Douglas S. Krull and David H. Silvera (2013). The stereotyping of science: superficial details influence perceptions of what is scientific. Journal of Applied Social Psychology DOI: 10.1111/jasp.12118

–Further reading–
Child’s play! The developmental roots of the misconception that psychology is easy
From The Psychologist magazine news archive: A US psychologist has urged the psychological community to do more to challenge the public’s scepticism of our science.

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Scanning a brain that believes it is dead

What is going on in the brain of someone who has the deluded belief that they are brain dead? A team of researchers led by neuropsychologist Vanessa Charland-Varville at CHU Sart-Tilman Hospital and the University of Liege has attempted to find out by scanning the brain of a depressed patient who held this very belief.

The researchers used a Positron Emission Tomography (PET) scanner, which is the first time this scanning technology has been used on a patient with this kind of delusion – known as Cotard’s syndrome after the French neurologist Jules Cotard. The 48-year-old patient had developed Cotard’s after attempting to take his own life by electrocution. Eight months later he arrived at his general practitioner complaining that his brain was dead, and that he therefore no longer needed to eat or sleep. He acknowledged that he still had a mind, but (in the words of the researchers) he said he was “condemned to a kind of half-life, with a dead brain in a living body.”

The researchers used the PET scanner to monitor levels of metabolic activity across the patient’s brain as he rested. Compared with 39 healthy, age-matched controls, he showed substantially reduced activity across a swathe of frontal and temporal brain regions incorporating many key parts of what’s known as the “default mode network“. This is a hub of brain regions that shows increased activity when people’s brains are at rest, disengaged from the outside world. It’s been proposed that activity in this network is crucial for our sense of self.

“Our data suggest that the profound disturbance of thought and experience, revealed by Cotard’s delusion, reflects a profound disturbance in the brain regions responsible for ‘core consciousness’ and our abiding sense of self,” the researchers concluded.

Unfortunately the study has a number of serious limitations beyond the fact that it is of course a single case study. As well as having a diagnosis of Cotard’s Delusion, the patient was also depressed and on an intense drug regimen, including sedative, antidepressant and antipsychotic medication. It’s unclear therefore whether his distinctive brain activity was due to Cotard’s, depression or his drugs, although the researchers counter that such an extreme reduction in brain metabolism is not normally seen in patients with depression or on those drugs.

Another issue is with the lack of detail on the scanning procedure. Perhaps this is due to the short article format (a “Letter to the Editor”), but it’s not clear for how long the patient and controls were scanned, nor what they were instructed to do in the scanner. For example, did they have their eyes open or closed? What did they think about?

But perhaps most problematic is the issue of how to interpret the findings. Does the patient have Cotard’s Delusion because of his abnormal brain activity, or does he have that unusual pattern of brain activity because of his deluded beliefs? Relevant here, but not mentioned by the researchers, are studies showing that trained meditators also show reduced activity in the default mode network. This provides a graphic illustration of the limits to a purely biological approach to mental disorder. It seems diminished activity in the default mode network can be associated both with feelings of being brain dead or feelings of tranquil oneness with the world, it depends on who is doing the feeling. Understanding how this can be will likely require researchers to think outside of the brain.


Charland-Verville, V., Bruno, M., Bahri, M., Demertzi, A., Desseilles, M., Chatelle, C., Vanhaudenhuyse, A., Hustinx, R., Bernard, C., Tshibanda, L., Laureys, S., and Zeman, A. (2013). Brain dead yet mind alive: A positron emission tomography case study of brain metabolism in Cotard’s syndrome. Cortex DOI: 10.1016/j.cortex.2013.03.003

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Serious power failure threatens the entire field of neuroscience

Psychology has had a torrid time of late, with fraud scandals and question marks about the replicability of many of the discipline’s key findings. Today it is joined in the dock by its more biologically oriented sibling: Neuroscience. A team led by Katherine Button at the School of Experimental Psychology in Bristol, and including psychologist Brian Nosek, founder of the new Center for Open Science, make the case in a new paper that the majority of neuroscience studies involve woefully small sample sizes, rendering their results highly unreliable. “Low statistical power is an endemic problem in neuroscience,” they write.

At the heart of their case is a comprehensive analysis of 49 neuroscience meta-analyses published in 2011 (that’s all the meta-analyses published that year that contained the information required for their purposes). This took in 730 individual papers, including genetic studies, drug research and papers on brain abnormalities.

Meta-analyses collate all the findings in a given field as a way to provide the most accurate estimate possible about the size of any relevant effects. Button’s team compared these effect size estimates for neuroscience’s subfields against the average sample sizes used in those same areas of research. If the meta-analyses for a particular subfield suggested an effect – such as a brain abnormality associated with a mental illness – is real, but subtle, then this would indicate that suitable investigations in that field ought to involve large samples in order to be adequately powered. A larger effect size would require more modest samples.

Based on this, the researchers’ estimate is that the median statistical power of a neuroscience study is 21 per cent. This means that the vast majority (around 79 per cent) of real effects in brain science are likely being missed. More worrying still, when underpowered studies do uncover a significant result, the lack of power means the chances are increased that the finding is spurious. Thirdly, significant effect sizes uncovered by underpowered studies tend to be overestimates of the true effect size, even when the reported effect is in fact real. This is because, by their very nature, underpowered studies are only likely to turn up significant results in data where the effect size happens to be large.

It gets more worrying. The aforementioned issues are what you get when all else in the methodology is sound, bar the inadequate sample size. Trouble is, Button and her colleagues say underpowered studies often have other problems too. For instance, small studies are more vulnerable to the “file-drawer effect”, in which negative results tend to get swept under the carpet (simply because it’s easier to ignore a quick and easy study than a massive, expensive one). Underpowered studies are also more vulnerable to an issue known as “vibration of effects” whereby the results vary considerably with the particular choice of analysis. And yes, there is often a huge choice of analysis methods in neuroscience. A recent paper documented how 241 fMRI studies involved 223 unique analysis strategies.

Because of the relative paucity of brain imaging papers in their main analysis, Button’s team also turned their attention specifically to the brain imaging field. Based on findings from 461 studies published between 2006 and 2009, they estimate that the median statistical power in the sub-discipline of brain volume abnormality research is just 8 per cent.

Switching targets to the field of animal research (focusing on studies involving rats and mazes), they estimate most studies had a “severely” inadequate statistical power in the range of 18 to 31 per cent. This raises important ethical issues, Button’s team said, because it makes it highly likely that animals are being sacrificed with minimal chance of discovering true effects. It’s clearly a sensitive area, but one logical implication is that it would be more justifiable to conduct studies with larger samples of animals, because at least then there would be a more realistic chance of discovering the effects under investigation (a similar logic can also be applied to human studies).

The prevalence of inadequately powered studies in neuroscience is all the more disconcerting, Button and her colleagues conclude, because most of the low-lying fruit in brain science has already been picked. Today, the discipline is largely on the search for more subtle effects, and for this mission, suitable studies need to be as highly powered as possible. Yet sample sizes have stood still, while at the same time it has become easier than ever to run repeated, varied analyses on the same data, until a seemingly positive result crops up. This leads to a “disquieting conclusion”, the researchers said – “a dramatic increase in the likelihood that statistically significant findings are spurious.” They end their paper with a number of suggestions for how to rehabilitate the field, including performing routine power calculations prior to conducting studies (to ensure they are suitably powered), disclosing methods and findings transparently, and working collaboratively to increase study power.

KS Button, JPA Ioannidis, C Mokrysz, BA Nosek, J Flint, ESJ Robinson, and; MR Munafo (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience : 10.1038/nrn3475

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Working memory training does not live up to the hype

According to CogMed, one of the larger providers of computerised working memory training, the benefits of such training is “comprehensive” and includes “being able to stay focused, resist distractions, plan activities, complete tasks, and follow and contribute to complex discussions.” Similar claims are made by other providers such as Jungle Memory and Cognifit, which is endorsed by neuroscientist Susan Greenfield.

Working memory describes our ability to hold relevant information in mind for use in mental tasks, while ignoring irrelevant information. If it were possible to improve our working memory capacity and discipline through training, it makes sense that this would have widespread benefits. But that’s a big if.

A new meta-analysis by Monica Melby-Lervåg and Charles Hulme has just been published in the February issue of the respected APA journal Developmental Psychology, which combined the results from 23 studies of working memory training completed up to 2011 (PDF is freely available). To be included, studies had to compare outcomes for a working memory training treatment group against outcomes in a control group. Most of the studies available are on healthy adults or children, with just a few involving children with developmental conditions such as ADHD.

The results were absolutely clear. Working memory training leads to short-term gains on working memory performance on tests that are the same as, or similar to, those used in the training. “However,” Melby-Lervåg and Hulme write, “there is no evidence that working memory training produces generalisable gains to the other skills that have been investigated (verbal ability, word decoding, arithmetic), even when assessments take place immediately after training.”

There was a modest, short-term benefit of the training on non-verbal intelligence but this disappeared when only considering the studies with a robust design (i.e. those that randomised participants across conditions and which enrolled control participants in some kind of activity). Similarly, there was a modest benefit of the training on a test of attentional control, but this disappeared at follow-up.

All of this suggests that working memory training isn’t increasing people’s working memory capacity in such a way that they benefit whenever they engage in any kind of task that leans on working memory. Rather, people who complete the training simply seem to have improved at the specific kinds of exercises used in the training, or possibly even just at computer tasks – effects which, anyway, wear off over time.

Overall, Melby-Lervåg and Hulme note that the studies that have looked at the benefits of working memory training have been poor in design. In particular, they tend not to bother enrolling the control group in any kind of intervention, which means any observed benefits of the working memory training could be related simply to the fun and expectations of being in a training programme, never mind the specifics of what that entails. Related to that, some dubious studies reported far-reaching benefits of the working memory training, without finding any improvements in working memory, thus supporting the notion that these benefits had to do with participant expectations and motivation.

A problem with all meta-analyses, this one included, is that they tend to rely on published studies, which means any unpublished results stuck in a filing cabinet get neglected. But of course, it’s usually negative results that get left in the drawer, so if anything, the current meta-analysis presents an overly rosy view of the benefits of working memory training.

Melby-Lervåg and Hulme’s ultimate conclusion was stark: “there is no evidence that these programmes are suitable as methods of treatment for children with developmental cognitive disorders or as ways of effecting general improvements in adults’ or children’s cognitive skills or scholastic achievements.”


Melby-Lervåg M, and Hulme C (2013). Is working memory training effective? A meta-analytic review. Developmental psychology, 49 (2), 270-91 PMID: 22612437 Free, full PDF of the study.

This meta-analysis only took in reviews published up to 2011. If you know of any quality studies into the effects of working memory training published since that time, please do share the relevant links via comments. 

–Further reading–
Brain training games don’t work.
Brain training for babies actually works (short term, at least)

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

Emotion research gets real – Was this person just told a joke or told they have great hair?

How accurately could you tell from a person’s display of behaviour and emotions what just happened to them?  Dhanya Pillai and her colleagues call this “retrodictive mindreading” and they say it’s a more realistic example of how we perceive emotions in everyday life, as compared with the approach taken by traditional psychological research, in which volunteers name the emotions displayed in static photos of people’s faces.

In Pillai’s study, the task of a group of 35 male and female participants wasn’t to look at pictures and name the facial expression. Instead, the participants watched clips of people reacting to a real-life social scenario and they had to deduce what scenario had led to that emotional display.

Half the challenge Pillai and her colleagues faced was to create the stimuli for this research. They recruited 40 men and women who thought they were going to be doing the usual thing and categorising emotional facial expressions. In fact, it was their own responses that were to become the stimuli for the study proper.

While these volunteers were sitting down ready for the “study” to start, one of four scenarios unfolded. The female researcher either told them a joke (“why did the woman wear a helmet at the dinner table? She was on a crash diet”); told them a story about a series of misfortunes she’d encountered on the way to work; paid them a compliment (e.g. “you’ve got really great hair, what shampoo do you use?”); or made them wait 5 minutes while she had a drink and did some texting. In each case the volunteers’ emotional responses were recorded on film and formed the stimuli for the real experiment.

The researchers ended up with 40 silent clips, lasting 3 to 9 seconds each, comprising ten clips for each of the four scenarios. The real participants for the study proper were first shown footage of the researcher in the four scenarios and how these were categorised as joke, story, compliment or waiting. Then these observer participants watched the 40 clips of the earlier volunteers, and their task in each case was to say which scenario the person in the video was responding to.

The observing participants’ performance was far from perfect – they averaged 60 per cent accuracy – but it was far better than the 25 per cent level you’d expect if they were merely guessing. By far, they were most skilled at recognising when a person was responding to the waiting scenario (90 per cent accuracy). Their accuracy was even for the other scenarios at around 50 per cent. They achieved this success level despite the huge amount of variety in the way the different volunteers responded to the different scenarios. “From observing just a few seconds of a person’s reaction, it appears we can gauge what kind of event might have happened to that individual with considerable success,” the researchers said.

A surprise detail came from the recordings of the observing participants’ eye movements. They focused more on the mouth region rather than the eyes. Based on past research (much of it using static facial displays), Pillai and her colleagues thought that better accuracy would go hand-in-hand with more attention paid to the eye region of the targets’ faces. In fact, for three of the scenarios (all except the joke), the opposite was true. This may be because focusing on the eye region is more beneficial when naming specific mental states, as opposed to the “retrodictive mindreading”challenge involved in the current study.

In contrast to much of the existing psychology literature, Pillia and her team concluded that theirs was an important step towards devising tasks “that closely approximate how we understand other people’s behaviour in real life situations.”


Pillai, D., Sheppard, E., and Mitchell, P. (2012). Can People Guess What Happened to Others from Their Reactions? PLoS ONE, 7 (11) DOI: 10.1371/journal.pone.0049859

Note: the picture above is for illustrative purposes only and was not used in the study.

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.

A new test for finding out what people really think of their personality

A problem with your standard personality questionnaire is that most people like to make a good impression. This is especially the case when questionnaires are used for job candidates. One way around this is to use so-called implicit measures of personality, designed to probe subconscious beliefs. The famous Rorschach ink-blot test is one example, but many psychologists criticise it for its unreliability. A more modern example is a version of the implicit association test, in which people are timed using the same response key for self-referential words and various personality traits. If they associate the trait with themselves, they should be quicker to answer. Now a team led by Florin Sava have proposed a brand-new test based on what’s called the “semantic misattribution procedure“.

Nearly a hundred participants watched as personality traits were flashed one at a time for a fifth of a second on a computer screen. After each trait (e.g. “anxious”), a neutral-looking Chinese pictograph was flashed on-screen. The participants didn’t know what these Chinese symbols meant. Their task was to ignore the flashed personality traits and to say whether they’d like each Chinese symbol to be printed on a personalised t-shirt for them or not, to reflect their personality.

This method is based on past research showing that we tend to automatically misattribute the meaning of briefly presented words to subsequent neutral stimuli. So, in the example above, participants would be expected to attribute, at a subconscious level, the meaning of “anxious” to the Chinese symbol. When assessing the suitability of the symbol for their t-shirt, it feels subjectively as if they are merely guessing, or making their judgment based on its visual properties. But in fact their choice of whether the symbol is suitable will be influenced by the anxious meaning they’ve attributed to it, and, crucially, whether or not they have an implicit belief that they are anxious.

In this initial study, and two more involving nearly 300 participants, Sava and his colleagues showed that participants’ scores on this test for conscientiousness, neuroticism and extraversion correlated with explicit measures of the same traits. The new implicit test also did a better job than explicit measures alone of predicting relevant behaviours, such as church attendance, perseverance on a lab task, and punctuality. The implicit scores for extraversion showed good consistency over 6 months. Finally, the new implicit test showed fewer signs of being influenced by social desirability concerns, as compared with traditional explicit measures. Next, the researchers plan to test whether their new implicit measure is immune to attempts at deliberate fakery.

“The present study suggests that the Semantic Misattribution Procedure is an effective alternative for measuring implicit personality self- concept,” the researchers said.


Sava, F., MaricuΤoiu, L., Rusu, S., Macsinga, I., Vîrgă, D., Cheng, C., and Payne, B. (2012). An Inkblot for the Implicit Assessment of Personality: The Semantic Misattribution Procedure. European Journal of Personality, 26 (6), 613-628 DOI: 10.1002/per.1861

–Further reading–
A personality test that can’t be faked

Post written by Christian Jarrett (@psych_writer) for the BPS Research Digest.