The Rule of Three and Confidence Intervals
The “Rule of Three” is a neat trick for stating a confidence interval for a binomial proportion in the case where you’ve observed all successes or all failures. Thinking about this led me to some general meditations on confidence intervals.
The rule of three says that if $X \sim \textrm{Binomial}(n, p)$ for known $n$, and $X = 0$, then an approximate 95% confidence interval for $p$ is $[0, 3/n]$. If you’ve never heard of this before, check out the Wikipedia page for it.
When I first learned of the rule, I wanted to measure its actual coverage via simulation, but I quickly hit a roadblock. Ordinarily, to test the coverage of an confidence interval procedure, I would choose a value (or loop through several values) for the parameter, simulate many data sets, apply the procedure to each data set, and check how often the intervals produced by those data sets contained the true parameter. But that clearly won’t work here, because for each $n$ and $p$, there is only one data set to which the rule applies. So what does it even mean to say that an interval produced by the rule of three has 95% coverage?
The problem is that, by itself, the rule of three is not, strictly speaking, a procedure for producing confidence intervals, since such a procedure must apply to all possible data sets in order to be meaningful. The rule of three is a method for approximating the results of a legitimate confidence interval procedure, in the case that we get a particular data set. The procedure the rule of three approximates is sort of the archetypal confidence interval procedure: for each data set, the confidence interval consists of all values of the parameter whose likelihood is greater than 5% (in the case of a 95% confidence interval). (Here, and later, I’m fudging some distinctions between probability mass and density, and the difference between the probability of the data vs the probability of data “at least as extreme” as your data, but the main idea holds.) This produces a 95% interval because of the following reasoning: after observing data $D$, for the true parameter $\theta_0$ to not be in the interval produced by the procedure applied to $D$, it must be the case that the likelihood of $\theta_0$ is less than 5% for $D$. In other words, $D$ is one of the 5% most unlikely data sets given $\theta_0$. Clearly, this happens exactly 5% of the time, so, turning it around, the probability that the interval contains $\theta_0$ is 95%.
I say the above confidence interval procedure is archetypal because it is the one that we implicitly invoke to treat confidence intervals as hypothesis tests and vice versa. E.g., when we say things like, “The 95% confidence interval for $\mu$ excludes 0, so we can reject $H_0: \mu = 0$ at the 5% level,” we mean that both statements roughly boil down to “the probability of the observed data given $\mu = 0$ is less than 5%.” (By the way, this isn’t the only way we could formulate confidence intervals. We could, for example, make a really perverse confidence interval that excludes a tiny portion of the most likely values of the parameter, and includes all other (even very, very low-likelihood) values, in such a way that the coverage is 95%.)
Anyway, back to the rule of three. It kind of feels to me like the rule of three, by itself, should be a legitimate interval procedure that has some kind of statistical properties we can talk about. Conditional on the fact that $X = 0$, what I really want to know, what my Bayesian Id yearns to know is, “What is the probability that $p < 3/n$?” From a Bayesian point of view this question makes perfect sense. The usual derivation of the rule of three is decidedly frequentist, but can we come up with a Bayesian interpretation or derivation of the rule? Yes we can.
Assume a uniform prior on $p$. Then by the classic beta/binomial conjugate model, the posterior of $p$ conditional on $X = 0$ is $\textrm{Beta}(1, n + 1)$. Now we can get a 95% credible interval (in fact, the 95% highest posterior density interval) for $p$ by choosing the left-most 95% of the mass of the posterior. That means the interval will extend from 0 to the 95th percentile of the distribution. The CDF of the $\textrm{Beta}(1, n + 1)$ distribution is $1 - (1 - p)^{n+1}$, so the 95th percentile is where $1 - (1-p)^{n+1} = 0.95$; that is, where $(1 - p)^{n + 1} = 0.05$. Now we’ve got an equation that looks exactly like the usual derivation of the rule of three (e.g. as seen on Wikipedia) except with $n + 1$ replacing $n$. So our Bayesian rule of three gives us the 95% credible interval $[0, 3/(n + 1)]$. Since $3 / n > 3 / (n + 1)$, we can also use the standard $3 / n$ rule as a conservative 95% credible interval. Thus, the rule of three achieves nice Bayesian properties. Neat.