HONOUR: Is Psychology’s Replication Crisis Really Overblown?

NYMAG.COM

Back in August, the Open Science Collaboration published an article in Science with some alarming findings about the state of psychological research: In an ambitious effort called the Reproducibility Project: Psychology, or RPP, over the last several years OSC volunteers had attempted to replicate 100 psychology studies from three journals. Only about 40 percent of them successfully replicated. In the other cases, whatever effects the original researchers claimed to have found simply didn’t pop up, suggesting they may have been artifacts of shoddy methodology or human bias rather than a genuine psychological finding.

The paper received widespread coverage, including here on Science of Us, because of the dark pall it cast over the discipline of psychology. And these questions didn’t come in a vacuum: In recent years there’s been a greater and greater understanding of the ways social science can be led astray by faulty results (usually) driven not by intentional malfeasance, but by a variety of complicated biases and institutional incentives. That’s why the Center for Open Science, which is headed by the University of Virginia social psychologist Brian Nosek and which organized the OSC’s efforts, was set up in the first place.

This stuff is all fairly complex, relying as it does on questions of statistical tomfoolery, hidden biases, and so on. And it got even more complex last week with the publication of a “technical comment” also in Science. In that article, a team led by Daniel Gilbert, a renowned psychologist at Harvard, argues that the OSC’s effort was itself riddled with methodological flaws. There’s no reason to be so pessimistic, in other words — there’s less evidence of a reproducibility crisis than Nosek and his colleagues would have you believe.

That article was published alongside a response from Nosek et al, which was itself followed, naturally, by a response to the response by Gilbert et al. Meanwhile, skin-in-the-game observers like the social psychologist Sanjay Srivastava and the statistician Andrew Gelman have written really useful blog posts for those hoping to dive deeper into this controversy.

But for those of us who just want to know whether the first paper’s findings are more or less intact, I’d argue that, yes, they are. The critique by Gilbert and his co-authors isn’t as muscular as it appears at first glance.

There are two main reasons for this. The first is that their statistical claims are shakier than they appear at first blush. Srivastava’s article explains why nicely, though it’s an argument that can’t be made in a nontechnical way. To sum up two important components, though: In laying out how often one would expect to successfully replicate an experiment from a pure stats standpoint, Gilbert and his colleagues use a standard that doesn’t really make sense and that would make it strikingly easy to “successfully” replicate weak, low-powered experiments. Then, in comparing the Nosek replication effort with another one, Gilbert and his colleagues claim the other replication effort successfully replicated 85 percent of the experiments it looked at. Srivastava explains that this is only true if you adopt a particularly forgiving definition of what it means to successfully replicate an experiment — a definition that drives up the differences between the two sets of replication efforts and that neither Gilbert and his colleagues nor Nosek and his use elsewhere. When you use a more apples-to-apples comparison, Srivastava writes, the other replication efforts were only successful 40 percent of the time, which is right around what Nosek found.

Okay, we’re out of the statistical weeds. Here’s another important — and less quantitative — complaint Gilbert and his colleagues make:

[M]any of OSC’s replication studies drew their samples from different populations than the original studies did. An original study that measured American’s attitudes toward African-Americans (3) was replicated with Italians, who do not share the same stereotypes; an original study that asked college students to imagine being called on by a professor (4) was replicated with participants who had never been to college; and an original study that asked students who commute to school to choose between apartments that were short and long drives from campus (5) was replicated with students who do not commute to school. What’s more, many of OSC’s replication studies used procedures that differed from the original study’s procedures in substantial ways: An original study that asked Israelis to imagine the consequences of military service (6) was replicated by asking Americans to imagine the consequences of a honeymoon; an original study that gave younger children the difficult task of locating targets on a large screen (7) was replicated by giving older children the easier task of locating targets on a small screen; an original study that showed how a change in the wording of a charitable appeal sent by mail to Koreans could boost response rates (8) was replicated by sending 771,408 e-mail messages to people all over the world (which produced a response rate of essentially zero in all conditions).

Not that things weren’t complicated already, but here’s where they get … philosophically complicated, for lack of a better word. What does it mean to accurately replicate a study’s methodology? Given that the whole point of the OSC’s concern is that underpowered, one-off studies conducted on specific groups are taken as “proof” of the existence of broad psychological concepts that may not in fact exist, surely it can’t be the case that replicators have to use the exact same category of subjects in an attempt to replicate. Doesn’t there have to be some leeway here for this conversation to make sense at all?

Digging into the specific examples shows that there might be less to this aspect of Gilbert et al’s critique than meets the eye, anyway. As Nosek and his colleagues point out in their response, for example, the Italian study was replicated successfully, and in three of the six cases listed above the authors of the original study endorsed the methodology used by the replicators.

But let’s zoom in even more: When Gilbert and his colleagues say that the second round of experiments were conducted differently, what does differently mean, exactly? The military/honeymoon example (point six) is a useful one here. It’s from a 2008 Journal of Personality and Social Psychology article (PDF) about how victims and perpetrators of wrongdoing have different emotional needs, and how this knowledge might help defuse conflict.

Here’s the passage in question, emphasis mine, in which the authors of the original study lay out the hypothetical they presented to the study’s participants:

Participants were told that they were taking part in a study on interpersonal relationships. They were asked to read a short vignette about an employee in an advertising company who was absent from work for 2 weeks due to maternity leave (for women) or military reserve duty (for men)—the most common reasons for extended work absences in Israeli society. The gender of both the protagonist and the antagonist were matched to that of the participant.

It was further indicated that upon returning to the office, the employee learned that a colleague who had temporarily filled her position was ultimately promoted to her job, whereas she herself was demoted. The demoted employee blamed her colleague for this demotion. In the victim condition, participants were asked to imagine themselves as the employee who was demoted from her position; in the perpetrator condition, participants were asked to imagine themselves as the colleague who was promoted. Following these instructions, participants received a questionnaire that included the before measures: (a) manipulation checks (measuring the extent to which participants perceived themselves as victims or perpetrators), (b) sense of power, (c) public moral image, (d) willingness to reconcile, and (e) psychological needs for power and social acceptance.

It seems pretty clear from reading this that what mattered, from the experimenters’ perspective, was providing a reason for the victim’s absence that shouldn’t have merited a demotion. Maybe the authors of the RPP replication attempt thought military service would be less likely to resonate among American experiment subjects, so they switched it to a honeymoon. The point is the same — you leave your job for two weeks, and when you get back someone has effectively taken it.

Is it fair to potentially pin the failure to replicate on the swapping out of military service for a honeymoon? If the goal in a study like this one is to develop theories that apply not just to Israelis or Americans, and not just to Israelis or Americans hypothesizing about a very specific type of event, then it feels like a stretch to get too hung up on the distinctions. (Plus, over on Retraction Watch Gilbert and a colleague get deeper into the weeds of how this replication was performed, and it’s clear the two situations were presented in a very similar manner.) Maybe in the other examples it mattered more — this was, to be honest, the only one I checked — but in this case the critique is underwhelming.

The back-and-forths between the Nosek and Gilbert camps aren’t over, and a lot of them are going to be really technically complex. It would be a shame for people to get dazed by all the numbers and claims and counterclaims being tossed about, and to lose sight of the underlying fact that the original Science article made a powerful argument that psychologists can and must do a better, more transparent job — and helped show them how to do so.

Plus, it’s not like Nosek and his colleagues decided out of nowhere that this stuff was a problem. Even beyond their specific findings, there are sound theoretical reasons to believe there’s a replication problem afoot. We’ve known for a long time about problems like publication bias — exciting findings getting published, non-findings getting tossed in a file drawer, never to be seen again — are a pretty big deal in the social sciences. We’ve also known that certain appealing psychological ideas have a way of catching on despite what turns out, in retrospect, to have been a lack of substantive evidence — if you doubt that, just read Melissa Dahl’s post from last week on the teetering concept of “ego depletion,” which had been taken by many for decades as a fact about human nature.

Finally, while the 40 percent “hit” rate on replications certainly earned the original paper a great deal of headlines and was a natural anchoring point for the discussions that followed, it was only part of the point Nosek and his colleagues wanted to get across. Forty-ish percent is no magic number here — the researchers took a somewhat-arbitrary approach to their sampling of papers and were open about that fact. More important is that the RPP provided a way forward for future replication efforts — it showed that if you’re dedicated, you really can, and should, rerun past experiments. And it will nudge social norms in a positive direction, in part by telling researchers, “There’s an increasing expectation someone will want to replicate your experiment, so it would be in your best interest to be as transparent as possible with your data and methodology.”

In short, Gilbert and his colleagues obviously have every right to criticize the methods of Nosek’s team, but their critique comes off as a bit flawed. Hopefully it won’t knock psychology off of its much-needed course correction.