By Jesse Singal
Randomised experiments (also known as A/B testing) are an absolutely critical tool for evaluating everything from online marketing campaigns to new pharmaceutical drugs to school curricula. Rather than making decisions based on ideology, intuition or educated guess-work, you randomise people to one of two groups and expose one group to intervention A (one version of a social media headline, a new drug, or whatever, depending on the context ), one group to intervention B (a different version of the headline, a different drug etc), and compare outcomes for the two groups.
To anyone who believes in evidence-based decision making, medicine and policy, randomised tests make sense. But as a team led by Michelle N. Meyer at the Center for Translational Bioethics and Health Care Policy at the Geisinger Health System in Pennsylvania, write in PNAS, for some reason A/B testing sometimes elicits moral outrage. As an example, they point to the anger that ensued when Pearson Education “randomized math and computer science students at different schools to receive one of three versions of its instructional software: two versions displayed different encouraging messages as students attempted to solve problems, while a third displayed no messages.” The goal had been to test objectively whether the encouraging messages would, well, encourage students to do more problems, yet for this, the company received much criticism, including accusations that they’d treated students like guinea pigs, and failed to obtain their consent.
Viewed from a certain angle, this reaction is strange – prior to the A/B testing Pearson’s default policy had been a lack of encouraging message, which didn’t appear to generate any complaints. People didn’t have a problem with a lack of encouraging messages, or with encouraging messages – they only had a problem with comparing the two conditions. Which doesn’t quite make sense. (As Meyer’s team point out, there are situations in which A/B testing could be genuinely unethical. Giving one group an already validated cancer treatment but withholding it from another, for example, is clearly morally problematic. But Meyer and her colleagues focus entirely on “unobjectionable policies or treatments.”)
At root, Meyer et al’s paper had two goals: To determine how widespread this phenomenon is (after all, sometimes there’s a perception that many people are mad about something, but it’s really just a small group of loud people online who have strong opinions), and to poke and prod people’s reasons for experiencing discomfort at the idea of A/B testing. The team used online samples to probe these issues, conducting “16 studies on 5,873 participants from three populations spanning nine domains.”
As it turns out, it isn’t just a small group of online complainers who are uncomfortable: Based on the new findings, it appears that humans have a more general bias against this sort of A/B testing, for reasons that are hard to pin down.
Take Meyer and her colleagues’ first study. They presented online participants with a vignette in which a hospital director, seeking to lower the rate of death and illness caused by a procedure being performed improperly, thinks it might be helpful to present doctors with a safety checklist. Participants then read one of four versions of what happened next and they had to rate the appropriateness of the course of action taken:
Badge (A): The director decides that all doctors who perform this procedure will have the standard safety precautions printed on the back of their hospital ID badges.
Poster (B): The director decides that all rooms where this procedure is done will have a poster displaying the standard safety precautions.
A/B short: The director decides to run an experiment by randomly assigning patients to be treated by a doctor wearing the badge or in a room with the poster.
A/B learn: Same as A/B short, with an added sentence noting that after a year, the director will have all patients treated in whichever way turns out to have the highest survival rate.
As the researchers predicted, there was more opposition to both forms of the A/B testing than to the unilateral introduction of either safety policy. This finding was robust to multiple versions of the vignette and held up whether the researchers used participants recruited via Pollfish or Amazon’s Mechanical MTurk.
The same phenomenon also popped up in a wide variety of other (hypothetical) situations, from the design of self-driving cars to interventions to boost teacher wellbeing. And the authors write that “the effect is just as strong among those with higher educational attainment and science literacy and those with STEM degrees, and among professionals in the relevant domain.” So it’s not as though this bias can be chalked up to a lack of knowledge about the scientific process, or some sort of lack of critical-thinking skills.
What does explain it, then? The researchers believe that a combination of factors are at work, among them “a belief that consent is required to impose a policy on half of a population but not on the entire population; an aversion to controlled but not to uncontrolled experiments; and the proxy illusion of knowledge,” the last of which the researchers define as the belief that “randomized evaluations are unnecessary because experts already do or should know ‘what works.’”
To many of the sorts of people who rely on A/B testing, of course, this sort of reasoning doesn’t pass muster (why would it be okay to impose a policy on the full population but not half it?). We clearly need more research to better understand the public’s concerns and how to respond to them, given how important A/B testing is in so many different circumstances (and that it is only going to become more common as organisations become more science- and data-focused). For now, though, it’s an important first step to have established that this bias generalises to various different populations and isn’t driven by any one simple factor.
Post written by Jesse Singal (@JesseSingal) for the BPS Research Digest. Jesse is a contributing writer at BPS Research Digest and New York Magazine, and he publishes his own newsletter featuring behavioral-science-talk. He is also working on a book about why shoddy behavioral-science claims sometimes go viral for Farrar, Straus and Giroux.