It never fails! My own best efforts (here & here) to explain the startling and increasingly notorious paper by Miller & Sanjurjo have prompted the authors to step forward and try to restore the usual state of perfect comprehension enjoyed by the 14.3 billion regular readers of this blog. They have determined, in fact, that it will take three separate guest posts to undo the confusion, so apparently I’ve carried out my plan to a [GV]T.
As cool as the result of the M&S paper is, I myself remain fascinated by what it tells us about cognition, particularly among those with exquisitely fine-tuned statistical intuitions. How did the analytical error they uncovered in the classic “hot hand fallacy” studies remain undetected for some thirty years, and why does it continue to provoke stubborn resistance on the part of very very smart people?? To Miller & Sanjurjo’s credit, they have happily and persistently shouldered the immense burden of explication necessary to break the grip of the pesky intuition that their result “just can’t be right!”
Joshua B. Miller & Adam Sanjurjo
Thanks for the invitation to post here Dan!
Here’s our plan for the upcoming 3 posts:
- Today’s plan: A bit of the history of the hot hand fallacy, then clearly stating the bias we find, explaining why it invalidates the main conclusion of the original hot hand fallacy study (1985), and further, showing that correcting for the bias flips the conclusion of the original data, so that it now can be used as evidence supporting the existence of meaningfully large hot hand shooting.
- Next post: Provide a deeper understanding of how the bias emerges.
- Final post: Go deeper into potential implications for research on the hot hand effect, hot hand beliefs, and the gambler’s fallacy.
Part I
In the seminal hot hand fallacy paper, Gilovich, Vallone and Tversky (1985; “GVT”, also see the 1989 Tversky & Gilovich “Cold Facts” summary paper) set out to conduct a truly informative scientific test of hot hand shooting. After studying two types of in game shooting data, they conducted a controlled shooting study (experiment) with the Cornell University men’s and women’s basketball teams. This was an effective “…method for eliminating the effects of shot selection and defensive pressure” that were present as confounds in their analysis of game data (we will return to the issue of game data in a follow up post; for now click to the first page of Dixit & Nalebuff’s 1991 classic book “Thinking Strategically”, and this comment on Andrew Gelman’s blog). While the common use of the term “hot hand” shooting is vague and complex, everybody agrees that it refers to a temporary elevation in a player’s ability, i.e. the probability of a successful shot. Because hot state is unobservable to the researcher (perhaps not the player/teammate/coach!), we cannot simply measure a player’s probability of success in the hot state; we need an operational definition. A natural idea is to take a streak of sufficient length as a good signal of whether or not a player is in the hot state, and define a player as having the hot hand if his/her probability of success is greater after a streak of successful shots (hits), than after a streak of unsuccessful shots (misses). GVT designed a test for this.
Suppose we wanted to test whether Stephen Curry has the hot hand; how would we apply GVT’s test to Curry? The answer is that we would have Curry attempt 100 shots at locations from which he is expected to have a 50% chance of success (like a coin). Next, we would calculate Curry’s field goal percentage on those shots that immediately follow a streak of successful shots (hits), and test whether it is bigger than his field goal percentage on those shots that immediately follow a streak of unsuccessful shots (misses); the larger the difference that we observe, the stronger the evidence of the hot hand. GVT performed this test on the Cornell players, and found that this difference in field goal percentages was statistically significant for only one of the 26 players (two sample t-test), which is consistent with the chance variation that the coin model predicts.
Now, one can ask oneself: if Stephen Curry doesn’t get hot, that is, for each of his 100 shot attempts he has exactly a 50% chance of hitting his next shot, then what would I expect his field goal percentage to be when he is on a streak of three (or more) hits? Similarly, what would I expect his field goal percentage to be when he is on a streak of three (or more) misses?
Following GVT’s analysis, one can form two groups of shots:
Group “3hits”: all shots in which the previous three shots (or more) were a hit,
Group “3misses”: all shots in which the previous three shots (or more) were a miss,
From here, it is natural to reason as follows: if Stephen Curry always has the same chance of success, then he is like a coin, so we can consider each group of shots as independent; after all, each shot has been assigned at random either to one of three groups: “3hits,” “3misses,” or neither. So far this reasoning is correct. Now, GVT (implicitly) took this intuitive reasoning one step further: because all shots, which are independent, have been assigned at random to each of the groups, we should expect the field goal percentages to be the same in each group. This is the part that is wrong.
Where does this seemingly fine thinking go wrong? The first clue that there is a problem is that the variable that is being used to assign shots to groups is also showing up as a response variable in the computation of the field goal percentage, though this does not fully explain the problem. The key issue is that there is a bias in how shots are being selected for each group. Let’s see this by first focusing on the “3hits” group. Under the assumptions of GVT’s statistical test, Stephen Curry has a 50% chance of success on each shot, i.e. he is like a coin: heads for hit, and tails for miss. Now, suppose we plan on flipping a coin 100 times, then selecting at random among the flips that are immediately preceded by three consecutive heads, and finally checking to see if the flip we selected is a heads, or a tails. Now, before we flip, what is the probability that the flip we end up selecting is a heads?
The answer is that this probability is not 0.50, but 0.46! Herein lies the selection bias. The flips that are being selected for analysis are precisely the flips that are immediately preceded by three consecutive heads. Now, returning to the world of basketball shots, this way of selecting shots for analysis implies that for the “3hits” group, there would be a 0.46 chance that the shot we are selecting is a hit, and for the “3misses” group, there would be a 0.54 chance that the shot we are selecting is a hit.
Therefore, if Stephen Curry does not get hot, i.e. if he always has a 50% chance of success for the 100 shots we study, we should expect him to shoot 46% after a streak of three or more hits, and 54% after a streak of three or more misses. This is the order of magnitude of the bias that was built into the original hot hand study, and this is the bias that is depicted in Figure 2 on page 13 of our new paper, and a simpler version of this figure is below. This bias is large in basketball terms: a difference of more than 8 percentage points is nearly the difference between the median NBA Three Point shooter, and the very best. Another way to look at this bias is to imagine what would happen if we were to invite 100 players to participate in GVT’s experiment, with each player shooting from positions in which the chance of success on each shot were 50%. For each player check to see if his/her field goal percentage after a streak of three or more hits is higher than his/her field goal percentage after a streak of three or more misses. For how many players should we expect this to be true? Correct answer: 40 out of 100 players.
This selection bias is large enough to invalidate the main conclusion of GVT’s original study, without having to analyze any data. However, beyond this “negative” message, there is also a way forward. Namely, we can re-analyze the original Cornell dataset, but in a way invulnerable to the bias. It turns out that when we do this, we find considerable evidence of the hot hand in this data. First, if we look at Table 4 in GVT (page 307), we see that, on average, players shot around 3.5 percentage points better when on a hit streak of three or more shots, and that 64% of the players shot better when on a hit streak than when on a miss streak. While GVT do not directly analyze these summary averages, given our knowledge of the bias, they are telling (in fact, you can do much more with Table 4; see Kenny LJ respond to his own question here). With the correct analysis (described in the next post), there is statistically significant evidence of the hot hand in the original data set, and, as can be seen in Table 2 on page 23 of our new paper, the point estimate of the average hot hand effect size is large (further details in our “Cold Shower” paper here). If one adjusts for the bias, what one now finds is that: (1) hitting a streak of three or more shots in a row is associated with an expected 10 percentage points boost in a player’s field goal percentage, (2) 76% of players have a higher field goal percentage when on a hit vs. miss streak, (3) and 4 out of 26 players have a large enough effect to be individually significant by conventional statistical standards (p<.05), which itself is a statistically significant result on the number of significant effects, by conventional standards.
In a later post, we will return to the details of GVT’s paper, and talk about the evidence for the hot hand found across other datasets. If you prefer not to wait, please take a look at our Cold Shower paper, and related comments on Gelman’s blog).
In the next installment, we will discuss the counter-intuitive probability problem that reveals the bias, and explain what is driving the selection bias there. We will then discuss some common misconceptions about the nature of the selection bias, and some very interesting connections with classic probability paradoxes.