People Have A Hard-To-Explain Bias Against Experimental Testing of Policies And Interventions, Preferring Just To See Them Implemented

A/B testing vector.

By Jesse Singal

Randomised experiments (also known as A/B testing) are an absolutely critical tool for evaluating everything from online marketing campaigns to new pharmaceutical drugs to school curricula. Rather than making decisions based on ideology, intuition or educated guess-work,  you randomise people to one of two groups and expose one group to intervention A (one version of a social media headline, a new drug, or whatever, depending on the context ), one group to intervention B (a different version of the headline, a different drug etc), and compare outcomes for the two groups.

To anyone who believes in evidence-based decision making, medicine and policy, randomised tests make sense. But as a team led by Michelle N. Meyer at the Center for Translational Bioethics and Health Care Policy at the Geisinger Health System in Pennsylvania, write in PNAS, for some reason A/B testing sometimes elicits moral outrage. As an example, they point to the anger that ensued when Pearson Education “randomized math and computer science students at different schools to receive one of three versions of its instructional software: two versions displayed different encouraging messages as students attempted to solve problems, while a third displayed no messages.” The goal had been to test objectively whether the encouraging messages would, well, encourage students to do more problems, yet for this, the company received much criticism, including accusations that they’d treated students like guinea pigs, and failed to obtain their consent.

Viewed from a certain angle, this reaction is strange – prior to the A/B testing Pearson’s default policy had been a lack of encouraging message, which didn’t appear to generate any complaints. People didn’t have a problem with a lack of encouraging messages, or with encouraging messages – they only had a problem with comparing the two conditions. Which doesn’t quite make sense. (As Meyer’s team point out, there are situations in which A/B testing could be genuinely unethical. Giving one group an already validated cancer treatment but withholding it from another, for example, is clearly morally problematic. But Meyer and her colleagues focus entirely on “unobjectionable policies or treatments.”)

At root, Meyer et al’s paper had two goals: To determine how widespread this phenomenon is (after all, sometimes there’s a perception that many people are mad about something, but it’s really just a small group of loud people online who have strong opinions), and to poke and prod people’s reasons for experiencing discomfort at the idea of A/B testing. The team used online samples to probe these issues, conducting “16 studies on 5,873 participants from three populations spanning nine domains.”

As it turns out, it isn’t just a small group of online complainers who are uncomfortable: Based on the new findings, it appears that humans have a more general bias against this sort of A/B testing, for reasons that are hard to pin down.

Take Meyer and her colleagues’ first study. They presented online participants with a vignette in which a hospital director, seeking to lower the rate of death and illness caused by a procedure being performed improperly, thinks it might be helpful to present doctors with a safety checklist. Participants then read one of four versions of what happened next and they had to rate the appropriateness of the course of action taken:

Badge (A): The director decides that all doctors who perform this procedure will have the standard safety precautions printed on the back of their hospital ID badges.

Poster (B): The director decides that all rooms where this procedure is done will have a poster displaying the standard safety precautions.

A/B short: The director decides to run an experiment by randomly assigning patients to be treated by a doctor wearing the badge or in a room with the poster.

A/B learn: Same as A/B short, with an added sentence noting that after a year, the director will have all patients treated in whichever way turns out to have the highest survival rate.

As the researchers predicted, there was more opposition to both forms of the A/B testing than to the unilateral introduction of either safety policy. This finding was robust to multiple versions of the vignette and held up whether the researchers used participants recruited via Pollfish or Amazon’s Mechanical MTurk.

The same phenomenon also popped up in a wide variety of other (hypothetical) situations, from the design of self-driving cars to interventions to boost teacher wellbeing. And the authors write that “the effect is just as strong among those with higher educational attainment and science literacy and those with STEM degrees, and among professionals in the relevant domain.” So it’s not as though this bias can be chalked up to a lack of knowledge about the scientific process, or some sort of lack of critical-thinking skills.

What does explain it, then? The researchers believe that a combination of factors are at work, among them “a belief that consent is required to impose a policy on half of a population but not on the entire population; an aversion to controlled but not to uncontrolled experiments; and the proxy illusion of knowledge,” the last of which the researchers define as the belief that “randomized evaluations are unnecessary because experts already do or should know ‘what works.’”

To many of the sorts of people who rely on A/B testing, of course, this sort of reasoning doesn’t pass muster (why would it be okay to impose a policy on the full population but not half it?). We clearly need more research to better understand the public’s concerns and how to respond to them, given how important A/B testing is in so many different circumstances (and that it is only going to become more common as organisations become more science- and data-focused). For now, though, it’s an important first step to have established that this bias generalises to various different populations and isn’t driven by any one simple factor.

Objecting to experiments that compare two unobjectionable policies or treatments

Post written by Jesse Singal (@JesseSingal) for the BPS Research Digest. Jesse is a contributing writer at BPS Research Digest and New York Magazine, and he publishes his own newsletter featuring behavioral-science-talk. He is also working on a book about why shoddy behavioral-science claims sometimes go viral for Farrar, Straus and Giroux.

6 thoughts on “People Have A Hard-To-Explain Bias Against Experimental Testing of Policies And Interventions, Preferring Just To See Them Implemented”

  1. That is problematic. Well functioning democracy requires policies to be testable. It’s a central aspect of democracy, in fact. To try out different policies and systems for different populations in the expectation that whichever works best will be used by other populations. This is the reason for having some level of decentralization, rather than a totalitarian government that simply enforces a single way of doing things on the entire population.

  2. I did in fact make an argument for decentralisation in 2010 at uni! Since then I have researched the history of psychology and made some horrifying discoveries. Psychology in the right hands has the potential to answer and provide many solutions to the problems society faces today! However, in the wrong hands it can create the baffling biases that we’re presently reviewing.

  3. Is it not the case that, in the examples given, certain groups are arbitrarily assigned to conditions of greater or lesser advantage or risk than other groups? To learn less by less encouragement, to not “have the highest survival rate” because of medical error?

    These are only “unobjectionable policies or treatments” if the observer in question MUST assume the outcome to be quite uncertain, and that, seen from the beginning, there should be no clear advantage to being in one group or the other. And indeed it would not seem reasonable to a layman that, for example, encouragement would not be encouraging, or that the best method of having doctors follow life-saving procedures would not be known. This perception of unfair treatment would make the arbitrariness of random assignment in these scenarios galling.

    Despite their often perfect ignorance, humans have an inherent tendency to form moral judgments, and in these experiments they are no less than compelled to. Though it can be set up that such judgments are unreasonable, I don’t think the existence of judgments about fairness are surprising when significant matters are at stake. If there is a deciding force that potentially controls our fate, we will want to hold it accountable to ensure that we are given as much opportunity as anyone else. We believe that authority granted under the social contract, even if it is just a hand that willingly rolls the dice, has an obligation to us to treat us well, and if it cannot treat us well, to at least treat us fairly compared to others (thus the relative acceptance of scenarios which are applied to the entire population, rather than only one segment of it).

    What is framed as a “controlled” experiment by the researcher may actually be seen as an “uncontrolled” risk for the subjects in the scenario and the observer who empathizes with them. That is, the researcher could have control over their fate, but does assume responsibility for it.

    Thus comes into play the idea that experts should bear responsibility for the things they do not know, if they nevertheless “should” know them. In determination of responsibility or acceptance of risk, the law readily relies on the principle of what a reasonable person ought to or could be expected to know. Given that not only knowledge but meta-knowledge (what is or is not known about the extent of knowledge) is ever further outside of the non-expert’s responsibility, we have no choice but to make assumptions about it. Our prejudices on moral matters may or may not be well-informed, but they are not optional.

    The way that knowledge which informs ethics will vary over time is also difficult for humans to account for. I think we all naturally resist the notion that ignorance should a legitimate excuse for unethical decisions, be they in the present or in the past. It would not be acceptable today if we randomly assigned some cars to have seat belts and others not. Because we cannot ignore our current knowledge, we have a hard time accepting that there must have been a time when seat belts did not make sense to the experts who should have been taking care of us.

    While we fail in projecting our state of knowledge backwards or forwards, at the same time ethics are assumed to be immutable and constantly applicable over varied circumstances and epochs, as a type of moral law. Of course they cannot be, but this illusion is nothing new.

    1. Uh oh, there’s a Philosopher in the house! Shoo! Shoo! Quickly before they make an essentialist argument! Really though I hope one of the inept persons who undertook this study reads your very insightful reply here. What concerns me the most is the confusion and contempt expressed by “scientists” in attempting to assimilate these results. Perhaps the stereotype of an emotionless machine, frustrated by it’s poor attempts to emulate human behavior, is not as far off as we would hope it to be.

      1. Thanks for crediting me as a large P Philosopher. Alas, I am not even a large P Psychologist.

        If I was a proper one of those (a researcher with good p values!), I would be able to maintain a professional level of curiosity towards this phenomenon, rather than seeming to deflate it with common sense and general principles that omit any citation of previous evidence. Of course these would not stand in any scientific literature.

        Unfortunately, I could not function at all as a clinician without resorting to such empathic assumptions about human nature.

        I do want to congratulate the discovery of this interesting deviation from rationality, and I do see that it merits more investigation. If it is a mystery, and my ideas were to contribute in any slight way to suggesting what forces may be behind that, I would be delighted.

Comments are closed.