In recent years, a surprising number of scientific results have failed to hold up to continued scrutiny. Part of this ‘replicability crisis’ may be caused by practices that ignore the assumptions of traditional (frequentist) statistical methods (John, Loewenstein, and Prelec 2012). One of these assumptions is that the experimental protocol should be completely determined upfront. In practice, researchers often adjust the protocol due to unforeseen circumstances or collect data until a point has been proven. This practice, which is referred to as optional stopping, can cause true hypotheses to be wrongly rejected much more often than these statistical methods promise.
Bayes factor hypothesis testing has long been advocated as an alternative to traditional testing that can resolve several of its problems; in particular, it was claimed early on that Bayesian methods continue to be valid under optional stopping (Lindley 1957; Raiffa and Schlaifer 1961; Edwards, Lindman, and Savage 1963). In light of the replicability crisis, such claims have received much renewed interest (Wagenmakers 2007; Jeffrey N. Rouder 2014; Schönbrodt et al. 2017; Yu et al. 2014; Sanborn and Hills 2014). But what do they mean mathematically? It turns out that different authors mean quite different things by ‘Bayesian methods handle optional stopping’; moreover, such claims are often shown to hold only in an informal sense, or in restricted contexts. In the paper (Hendriksen, Heide, and Grünwald 2020) we give a systematic overview and formalization of such claims, and explain their relevance for practice: can we effectively rely on Bayes factor testing to do a good job under optional stopping or not? As we shall see, the answer is subtle. Secondly, we extend the reach of such claims to more general settings, for which they have never been formally verified and for which verification is not always trivial. In the paper ( Heide and Grünwald 2020), we explain claims about optional stopping for an audience of methodologists and applied statisticians with the help of computer simulations.

Bayesian inference

Bayesianism is about a certain interpretation of the concept probability: as degrees of belief. A Bayesian first expresses this belief as a probability function. We call this the prior distribution, and we denote it by \(\mathbb{P}(\theta)\), where \(\theta\) is the parameter (or several parameters) of the model. After the specification of the prior, we obtain the data \(D\) and the likelihood \(\mathbb{P}(D | \theta)\). Now we can compute the posterior distribution \(\mathbb{P}(\theta | D)\) with the help of Bayes’ theorem: \[\begin{aligned} \mathbb{P}(\theta | D) = \frac{\mathbb{P}(D | \theta) \mathbb{P}(\theta)}{\mathbb{P}(D)}.\end{aligned}\] Suppose we want to test a null hypothesis \(\mathcal{H}_0\) against an alternative hypothesis \(\mathcal{H}_1\). We can do this in a Bayesian way with Bayes factors: we start with the prior odds \(\mathbb{P}(\mathcal{H}_1) / \mathbb{P}(\mathcal{H}_0)\), our belief before seeing the data. Often we believe that both hypotheses are equally probable, then our prior odds are 1-to-1. Next we gather our data \(D\), and update our odds with the new knowledge, using Bayes’ theorem: \[\begin{aligned} \text{posterior odds}(\mathcal{H}_1 \text{ vs. } \mathcal{H}_0) = \frac{\mathbb{P}(\mathcal{H}_1 | D)}{\mathbb{P}(\mathcal{H}_0 | D)} = \frac{\mathbb{P}(\mathcal{H}_1)}{\mathbb{P}(\mathcal{H}_0)} \frac{\mathbb{P}(D | \mathcal{H}_1)}{\mathbb{P}(D | \mathcal{H}_0)}.\end{aligned}\] The posterior odds is our updated belief about which hypothesis is more likely.

Three notions of optional stopping

Validity under optional stopping is a desirable property of hypothesis testing: we gather some data, look at the results, and decide whether we stop of gather some additional data. Informally, we call ‘peeking at the results to decide whether to collect more data’ optional stopping, but if we want to make more precise what it means if we say that a test can handle optional stopping, it turns out that different approaches (frequentist, subjective Bayesian and objective Bayesian) lead to different interpretations and definitions. It tuns out that we can discern three main mathematical concepts of handling optional stopping, which we identify and formally define in the paper (Hendriksen, Heide, and Grünwald 2020).
The first concept we call subjective Bayesian optional stopping or \(\tau\)-independence. If one considers a purely subjective Bayesian setting, appropriate if one truly believes one’s prior, then Bayesian updating from prior to posterior is not affected by the employed stopping rule: one ends up with the same posterior if one had decided the sample size \(n\) in advance, or if it had been determined, for example, because one was satisfied with the result at this \(n\). In this sense a subjective Bayesian procedure does not depend on the stopping rule.
The second sense of optional stopping we call calibration. As (Jeffrey N. Rouder 2014) writes: ‘If a replicate experiment yielded a posterior odds of 3.5-to-1 in favor of the null, then we expect that the null was 3.5 times as probable as the alternative to have produced the data.’ In more mathematical language, this can be expressed as \[\begin{aligned} \text{post-odds} (\mathcal{H}_1 \text{ vs. } \mathcal{H}_0 | ``\text{post-odds} (\mathcal{H}_1 \text{ vs. } \mathcal{H}_0 | D ) = a") = a.\end{aligned}\] We say this equation expresses calibration of the posterior odds. It turns out that this calibration fails to hold if one does not adhere to a purely subjective Bayesian view, in particular, it does not hold for the default priors the Bayesian psychology community is advocating (Wagenmakers 2007; J. N. Rouder et al. 2012). To get a first idea of one of the issues: default priors sometimes depend on the data. Then it is unclear what optional stopping really means, because if, using prior \(P_1(\theta)\) based on a sample of size \(n\), one had stopped at sample size \(n'< n\), one should have really used prior \(P'_1(\theta)\) based on sample of size \(n'\)...but then one would have stopped at yet another sample size \(n''\), and so on. See our paper ( Heide and Grünwald 2020) for an extensive discussion and many examples.

The third sense is a frequentist interpretation of handling optional stopping, which is about controlling the Type I error of an experiment. A Type I error occurs when we reject the null hypothesis when it is true, also called false positive. The frequentist interpretation of handling optional stopping is that the Type I error guarantee holds if we do not determine the sampling plan — and thus the stopping rule — in advance, but we may stop when we see significant results. In the case \(\mathcal{H}_0\) is simple (containing just one hypothesis), there is a well-known intriguing connection between Bayes factors and Type I error probabilities: if we reject \(\mathcal{H}_0\) iff the posterior odds in favor of \(\mathcal{H}_0\) are smaller than some fixed level \(\alpha\), then we are guaranteed a Type I error of at most \(\alpha\). And interestingly, this holds not just for fixed sample sizes but even under optional stopping. However, for composite \(\mathcal{H}_0\) this does not continue to hold. Except for the special case where all free parameters in \(\mathcal{H}_0\) are nuisance parameters observing a group structure and equipped with the corresponding right-Haar prior, and are shared with \(\mathcal{H}_1\), as we prove in (Hendriksen, Heide, and Grünwald 2020). But for general priors and composite \(\mathcal{H}_0\), this is typically not the case.

Figure 1: Posterior odds in an experiment of testing whether the mean of a normal distribution is 0 (\(\mathcal{H}_0\)), versus non-zero (\(\mathcal{H}_1\)), from 20; 000 replicate experiments. (a) The empirical sampling distribution of the posterior odds as a histogram under \(\mathcal{H}_0\) (blue) and \(\mathcal{H}_1\) (pink). (b) Calibration plot: the observed posterior odds as a function of the nominal posterior odds.

Conclusion

One can give three distinct mathematical meanings to the notion of optional stopping. Whether or not we can say that ‘the Bayes factor method can handle optional stopping’ in practice is a subtle matter, depending on the specifics of the given situation: what models are used, what priors, and what is the goal of the analysis.

References

Edwards, Ward, Harold Lindman, and Leonard J. Savage. 1963. “Bayesian Statistical Inference for Psychological Research.” Psychological Review 70 (3): 193–242. https://doi.org/10.1037/h0044139.
Heide, Rianne, and Peter D. Grünwald. 2020. “Why Optional Stopping Can Be a Problem for Bayesians.” Psychonomic Bulletin & Review, Advance Publication, 1–18.
Hendriksen, Allard, Rianne Heide, and Peter Grünwald. 2020. “Optional Stopping with Bayes Factors: A Categorization and Extension of Folklore Results, with an Application to Invariant Situations.” Bayesian Analysis, Advance Publication.
John, Leslie K, George Loewenstein, and Drazen Prelec. 2012. “Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling.” Psychological Science.
Lindley, D. V. 1957. “A Statistical Paradox.” Biometrika 44 (1/2): 187–92. https://doi.org/10.2307/2333251.
Raiffa, H., and R. Schlaifer. 1961. Applied Statistical Decision Theory. Cambridge, MA: Harvard University Press.
Rouder, J N, R D Morey, P L Speckman, and J M Province. 2012. “Default Bayes Factors for ANOVA Designs.” Journal of Mathematical Psychology 56 (5): 356–74.
Rouder, Jeffrey N. 2014. “Optional Stopping: No Problem for Bayesians.” Psychonomic Bulletin & Review 21 (2): 301–8. https://doi.org/10.3758/s13423-014-0595-4.
Sanborn, Adam N., and Thomas T. Hills. 2014. “The Frequentist Implications of Optional Stopping on Bayesian Hypothesis Tests.” Psychonomic Bulletin & Review 21 (2): 283–300. https://doi.org/10.3758/s13423-013-0518-9.
Schönbrodt, Felix D., Eric-Jan Wagenmakers, Michael Zehetleitner, and Marco Perugini. 2017. “Sequential Hypothesis Testing with Bayes Factors: Efficiently Testing Mean Differences.” Psychological Methods 22 (2): 322–39. https://doi.org/10.1037/met0000061.
Wagenmakers, Eric-Jan. 2007. “A Practical Solution to the Pervasive Problems of p Values.” Psychonomic Bulletin & Review 14 (5): 779–804. https://doi.org/10.3758/bf03194105.
Yu, Erica C., Amber M. Sprenger, Rick P. Thomas, and Michael R. Dougherty. 2014. “When Decision Heuristics and Science Collide.” Psychonomic Bulletin & Review 21 (2): 268–82. https://doi.org/10.3758/s13423-013-0495-z.