Is something rotten in the state of social psychology? Part Two: digging through the past

Victorian lifeboat men rowing to rescue a stricken shipBy Alex Fradera

A new paper in the Journal of Personality and Social Psychology has taken a hard look at psychology’s crisis of replication and research quality and we’re covering its findings in two parts.

In Part One, published yesterday, we reported the views of active research psychologists on the state of their field, as surveyed by Matt Motyl and his colleagues at the University of Illinois at Chicago. Researchers reported a cautious optimism: research practices hadn’t been as bad as feared, and are in any case improving.

But is their optimism warranted? After all, several high-profile replication projects have found that, more often than not, re-running previously successful studies produces only null results. But defenders of the state of psychology argue that replications fail for many reasons, including defects in the reproduction and differences in samples, so the implications aren’t settled.

To get closer to the truth, Motyl’s team complemented their survey findings with a forensic analysis of published data, uncovering results that seem to bolster their optimistic position. In Part Two of our coverage, we look at these findings and why they’re already proving controversial.

Motyl and his colleagues used a relatively new type of analysis to assess the quality and honesty of the data found in over 500 previously published papers in social psychology. Their approach is technical, involving weirdly-named statistics conducted upon even more statistics, so it helps to use an analogy: Just as a vegetable garden produces a variety of tomatoes, some bigger than others, some misshapen, some puny and poor for eating, an honestly-conducted body of research should bear a range of fruit in the same way. True experimental effects shouldn’t always come out exactly the same: they should vary in size from experiment to experiment, including instances when the effect is too small to be statistically significant.

These are the sorts of things you can evaluate in a body of research – in this case with the Test for Insufficient Variance, which Motyl’s study used alongside six other indices. When there were too many irregularities in the data, or bizarre regularity like identikit supermarket tomatoes, this suggested to Motyl and his colleagues that questionable research practices may have been used to make the weak results swell up to reach the desired appearance.

Crucially, however, the study found that more often than not, the indices showed low levels of anomalies, suggesting research practices are more likely to be acceptable than questionable. This was the case for studies from 2003-4, before the crisis was fully acknowledged, and the researchers found an even better picture for more recent (2013-14) papers. The fruits of the research may have been tampered with from time to time, but there was no case that the entire enterprise was “rotten to the core”.

This optimistic conclusion conflicts with similar analyses performed in the past, but this might be explained by the different approaches of collecting the data – of gathering the fruit, if you will. Past approaches automatically scraped articles for every instance of a statistic, such as every listed p-value. But this is like a bulldozer ripping out a corner of a garden and measuring everything that looks anything like a tomato, including stones and severed gnome-heads. To take just one example, articles will often list p-values for manipulation checks: confirmations that an experimental condition was set up correctly (did participants agree that the violent kung-fu clip was more violent than the video of grass growing?). But these aren’t tests to determine new scientific knowledge, rather – turning to another analogy – the equivalent of a chemist checking their equipment works before running an experiment. So Motyl’s team took a more nuanced approach, reading through every article and picking out by hand only the relevant statistics.

However, all is not rosy in the garden. At their Datacolada blog, “state of science” researchers Joseph Simmons, Leif Nelson, and Uri Simonsohn, have already responded to the new analysis and they’re sceptical. Simmons and co first note the daunting scale of the new enterprise: to correctly identify 1800 relevant test statistics from 500 papers. In an online response, Motyl’s team agreed that yes, it was time consuming, and yes, it required a lot of hands: “there are reasons this paper has many authors: It really took a village,” they said.

But Datacolada sampled some of the statistics that Motyl’s team used in their assessments and they argue that far too many of them were inappropriate, including data from manipulation checks that Motyl’s group had themselves categorised as statistica non grata. To the Datacolada team, this renders the whole enterprise suspect: “We are in no position to say whether their conclusions are right or wrong. But neither are they.” In their response, Motyl’s team make some concessions, but they argue that some of the statistic selection comes down to difference of opinion, and defend both their overall  procedure, and the amount of coding errors they expect their study will contain. So….


So doing high-quality science isn’t straightforward. Neither is doing high-quality science on the quality of science, nor is gathering everything together to form high-quality conclusions. But if we care about the validity of the more sexy findings in psychology – the amazing powers of power poses to make you physically more confident, how you can hack your happiness simply by changing your face, and how even subtle social signals about age, race or gender can transform how we perform at tasks – we need to care about psychological science itself, how it’s working and how it isn’t. (By the way, those findings I just listed? They’ve all struggled to replicate.)

There are surely ways to to improve the methods of this new study – perhaps not coincidentally, Datacolada’s Leif Nelson is running a similar project – but even if the new assessment does include some irrelevant statistics, it will likely be an advance on past analyses that included every irrelevant statistic.

So … the new insights have budged my position on the state of science a little: I’m still worried, but I can see a little more light among the dark. Motyl’s group make the case that social psychology isn’t ruined, that the garden isn’t totally contaminated. I hope so. But it’s not hope on its own that will move our field forward, but research, debate, and making sense of the evidence. After all, psychology is too good to give up on.

The State of Social and Personality Science: Rotten to the Core, Not so Bad, Getting Better, or Getting Worse?

Main image: An illustration from ‘The Family Friend’ published by S.W. Partridge & Co. (London, 1874). Lifeboat men rowing towards a wrecked ship in high seas. (via under licence)

Also check out:

Alex Fradera (@alexfradera) is Contributing Writer at BPS Research Digest

14 thoughts on “Is something rotten in the state of social psychology? Part Two: digging through the past”

  1. What many seem to forget when doing these replication studies, there are quite a lot of factors that have an impact that were not accounted for. For instance sampling error could have a significant impact. Take a Monte Carlo run of data with the population relationship arbitrarily set to r=.4. Now take 50 random samples, a measure the relationship between your two variables. Without any other manipulation the obtained r values could range from r=.7, down to r=-.1. And this variation is strictly due to sampling error, where the random variation of the population estimates are due to sample size. Other factors also come into play, such as range and reliability variation which have at least the same or greater impact on study effect size estimates.

      1. Thanks I’ll have a look at that, especially the dataset they mention. But my point remains. Given sampling, reliability and range variation, any replication study will randomly vary from the effect size found in the original study. When both are corrected for these factors, the results will be very close.

  2. Thanks to Alex Fradera for such a delightfully poetic elucidation of the blights affecting tender cultivars and culture 😊

    1. What I meant is that this is so beautifully written, reading about research papers becomes as enjoyable as a stroll through a blooming garden 🙂

      1. That’s very kind to say so. Even more than most pieces, I wanted to make this as pleasing as possible, because the subject matter is important but can easily be lost in the technicalities. Thank you.

Comments are closed.