One of the greatest temptations for psychologists is to report “marginally significant” research results. When statistical tests spit out values that are tantalisingly close to reaching significance, many just can’t help themselves.
Now a study in Psychological Science has shown just how widespread this practice is. Anton Olsson-Collentine and colleagues from Tilburg University analysed three decades of psychology papers and found that a whopping 40 per cent of p-values between 0.05 and 0.1 – i.e. those not significant according to conventional thresholds – were described by experimenters as “marginally significant”.
Psychologists use p-values to gauge whether a result is statistically significant. The p-value provides an estimate of the likelihood of the current results (and others more extreme) being obtained if the “null hypothesis” were true. (The null hypothesis is that there is no effect, or no difference between the groups being studied). At a certain threshold – usually when p is less than 0.05 – psychologists dismiss the null hypothesis and infer that their result probably represents a true effect.
But sometimes researchers treat that threshold rather flexibly. If a p-value is a little greater than 0.05, they often report the result as “marginally significant”, implying that there could still be some kind of real effect going on.
To determine how often p-values are reported this way, Olsson-Collentine’s team examined the ways that values falling between 0.05 and 0.1 were described in journals published by the American Psychological Association from 1985 to 2016.
The team programmed some code to go through 44,200 papers and extract 42,504 p-values falling between 0.05 and 0.1. They then searched the text immediately before and after the p-values for words beginning “margin” or “approach”, which could indicate that the results were being reported as marginally significant (or “approaching” significance).
The researchers found that almost 40 per cent of the non-significant p-values they identified were reported as marginally significant. Of nine main psychology disciplines, the practice was most common in organisational psychology, where 45 per cent of values were deemed marginally significant, and least common in clinical psychology, where that number dropped to 30 per cent.
This kind of reporting is a problem because it’s likely to contribute to false positives (the misattribution of null findings as true effects) and less reproducible research, say Olsson-Collentine and his colleagues. By calling a result “sort-of-significant”, psychologists are essentially changing the rules for what counts as significant after the fact, and therefore highlighting results that may be less likely to represent “true” effects.
Nevertheless, there was some good news: across the 30 year period covered by the new analysis, the “marginally significant” habit seems to have become less common in most of the nine psychological sub-disciplines. “The downward trend in psychology overall may reflect increasing awareness among researchers that p values in the range of .05 to .10 represent weak evidence against the null,” write the researchers. It may also be the result of editors becoming stricter, they add.
Crucially, the new study only accounted for p-values described as “marginal” or “approaching” significance. But researchers can be much more inventive in the language they use. In 2013, statistician Matthew Hankins compiled a list of hundreds of other phrases psychological scientists have used to describe low-but-not-significant p-values in the literature, from “flirting with conventional levels of significance” to “very closely brushed the limit of statistical significance”. By missing some of the more creative ways that scientists try to squeeze positive results out of their work, it’s possible this new study is underestimating the extent of the problem.