What's a "valid" sample? Problems with Mechanical Turk study samples, part 1

It’s commonplace nowadays to see published psychology studies based on samples consisting of “workers” hired to participate in them via Amazon’s “Mechanical Turk,” a proprietary system that enables Amazon to collect a fee for brokering on-line employment relationships.

I’ve been trying to figure out for a while now what I think about this practice.

After considerable reading and thinking, I’ve concluded that “MT” samples are in fact a horribly defective basis for the study of the dynamics I myself am primarily interested in—namely, ones relating to how differences in group commitments interact with the cognitive processes that generate cultural or political polarization over societal risks and other facts that admit of scientific study.

I’m going to explain why, and in two posts. To lay the groundwork for my assessment of the flaws in MT samples, this post will set out a very basic account of how to think about the “validity” of psychology samples generally.

Sometimes people hold forth on this as if sample validity were some disembodied essence that could be identified and assessed independently of the purpose of conducting a study. They say things like, “That study isn’t any good—it’s based on college students!” or make complex mathematics-pervaded arguments about “probability based stratification” of general population samples and so forth.

The reason to make empirical observations is to generate evidence that gives us more reason or less than we otherwise would have had to believe some proposition or set of propositions (the ones featured in the study hypotheses) about how the world works.

The validity of a study sample, then, depends entirely on whether it can support inferences of that sort.

Imagine someone is studying some mental operation that he or she has reason to think is common to all people everywhere—say, “perceptual continuity,” which involves the sort of virtual, expectation-based processing of sensory stimuli that makes people shockingly oblivious to what seem like shockingly obvious but unexpected phenomena, like the sudden appearance of a gorilla among a group of basketball players or the sudden substitution of one person for another during a conversation between strangers.

Again, on the researcher’s best understanding of the mechanisms involved, everyone everywhere is subject to this sort of effect, which reflects processes that are in effect “hard wired” and invariant. If that’s so, then pretty much any group of people—so long as they haven’t suffered some sort of trauma that might change the operation of the relevant mental processes—will do.

So if a reasearcher wants to test whether a particular intervention—like telling people about this phenomenon—will help to counteract it, he or she can go ahead and test it on any group of normal people that researcher happens to have ready access to—like college undergraduates.

But now imagine that one is studying a phenomenon that one has good reason to believe will generate systematic differences among individuals identified with reference to certain specific characteristics.

That’s true of “cultural cognition” and like forms of motivated reasoning that figure in the tendency of people to fit their assessments of information—from scientific “data” to expository arguments to the positions of putative experts to (again!) their own sense impressions—to positions on risk and like facts that dominate among members of their group.

Because the phenomenon involves individual differences, a sample that doesn’t contain the sorts of individuals who differ in the relevant respects won’t support reliable inferences.

E.g., there’s a decent amount of evidence that white males with hierarchic and individualistic values (or with “conservative” political orientations; cultural values and measures of political ideology or party affiliation are merely alternative indicators of the same latent disposition, although I happen to think cultural worldviews tend to work better) are motivated to be highly skeptical of environmental and technological risks. Such risk claims, this work suggests, are psychically threatening to such individuals, because their status and authority in society tends to be bound up with commercial and industrial activities that are being identified as dangerous, and worthy of regulation.

If one wants to investigate how a particular way of “framing” information might dissipate dismissiveness and promote open-minded engagement with evidence on climate change, then it makes no sense to test such a hypothesis on, say, predominantly female undergraduates attending a liberal east-coast university. How they respond to the messages in question won’t generate reliable inferences about how white, hierarchical individualistic males will—and they are the group in the world that we have reason to believe is reacting in the most dismissive way to scientific evidence on climate change.

Obviously, this account of “sample validity” depends on one being right when one thinks one has “good reason to know” that the dynamics of interest are uniform across people or vary in specific ways across subpopulations of them.

But there’s no getting around that! If one uses a “representative” general population sample to study a phenomenon that in fact varies systematically across subpopulations, then the inferences one draws will also be faulty, unless one both tests for such individual differences and assures that the sample contains a sufficiently large number of the subpopulation members to enable detection of such effects. Indeed, to assure that there are enough of members of the subpopulations–particularly if one of them is small, like, say, a racial minority–is to oversample, generating a nonrepresentative sample!

The point is that the validity of a sample depends on its suitability for the inferences to be drawn about the dynamics in question. That feature of a sample can’t be determined in the abstract, according to any set of mechanical criteria. Rather it has to be assessed in a case-specific way, with the exercise of judgment.

And like anything of that sort—or just anything that one investigates empirically—the conclusions one reaches will need to be treated as provisional insofar as later on someone might come along and show that the dynamics in question involved some feature that evaded detection with one’s sample, and thus undermines the inferences one drew. Hey, that’s just the way science works!

Maybe on this account Mechanical Turk samples are “valid” for studying some things.

But I’m convinced they are not valid for the study of how cultural or ideological commitments influence motivated cognition: because of several problematic features of such samples, one cannot reliably infer from studies based on them how this dynamic will operate in the real world.

I’ll identify those problematic features of MT samples in part two of this series.

What’s a “valid” sample? Problems with Mechanical Turk study samples, part 1

Leave a Comment Cancel reply