DeepAI

# Online Control of the False Coverage Rate and False Sign Rate

The false coverage rate (FCR) is the expected ratio of number of constructed confidence intervals (CIs) that fail to cover their respective parameters to the total number of constructed CIs. Procedures for FCR control exist in the offline setting, but none so far have been designed with the online setting in mind. In the online setting, there is an infinite sequence of fixed unknown parameters θ_t ordered by time. At each step, we see independent data that is informative about θ_t, and must immediately make a decision whether to report a CI for θ_t or not. If θ_t is selected for coverage, the task is to determine how to construct a CI for θ_t such that FCR≤α for any T∈N. A straightforward solution is to construct at each step a (1-α) level conditional CI. In this paper, we present a novel solution to the problem inspired by online false discovery rate (FDR) algorithms, which only requires the statistician to be able to construct a marginal CI at any given level. Apart from the fact that marginal CIs are usually simpler to construct than conditional ones, the marginal procedure has an important qualitative advantage over the conditional solution, namely, it allows selection to be determined by the candidate CI itself. We take advantage of this to offer solutions to some online problems which have not been addressed before. For example, we show that our general CI procedure can be used to devise online sign-classification procedures that control the false sign rate (FSR). In terms of power and length of the constructed CIs, we demonstrate that the two approaches have complementary strengths and weaknesses using simulations. Last, all of our methodology applies equally well to online FCR control for prediction intervals, having particular implications for assumption-free selective conformal inference.

• 6 publications
• 89 publications
01/02/2023

### Selective Conformal Inference with FCR Control

Conformal inference is a popular tool for constructing prediction interv...
06/02/2019

### Confidence Intervals for Selected Parameters

Practical or scientific considerations often lead to selecting a subset ...
10/10/2019

### Online control of the familywise error rate

Suppose an analyst wishes to test an infinite sequence of hypotheses one...
10/06/2021

### Deploying the Conditional Randomization Test in High Multiplicity Problems

This paper introduces the sequential CRT, which is a variable selection ...
10/15/2021

### SAFFRON and LORD Ensure Online Control of the False Discovery Rate Under Positive Dependence

Online testing procedures assume that hypotheses are observed in sequenc...
11/24/2020

### Competition-based control of the false discovery proportion

Target-decoy competition (TDC) is commonly used in the computational mas...

## 1 Introduction

While statisticians are trained to be aware of multiple testing issues, temporal multiplicity is often easy to miss. Let us examine the following simplified situation alluded to in the abstract. Consider a team of statisticians at a pharmaceutical company who test a new drug every week of the year. In week , a new drug is under consideration, and to assess its treatment effect , the team conducts a new randomized clinical trial with new participants. Suppose that the data, such as the normalized empirical difference in means between the treatment and control groups, can be summarized by the observation , independent of all the previous .

Now consider the following selection rule: if , then the statisticians simply ignore drug , and if , then the team reports the two-sided marginal CI for to the management (who may then decide to run a much larger second phase clinical trial since the CI does not contain 0). This may initially seem like an innocuous situation: each drug is different and has a different treatment effect , the data is always fresh and independent, the decision for whether or not to construct the CI for is dependent only on and independent of all other , and so is the interval if constructed.

Nonetheless, the combination of multiplicity and selection is a cause for concern already in the offline setting, as was insightfully pointed out by Benjamini and Yekutieli (2005). In the online case, when there is an infinite sequence of parameters, it is even easier to construct an example where ignoring selection has undesirable consequences. Indeed, consider the special case where for all , in other words, every tested drug is equivalent to a placebo. In this situation, every single CI that is reported to the management is incorrect, since it does not contain zero. Because a selection will eventually occur, among constructed CIs the proportion of non-covering CIs—this is later formally defined as the false coverage proportion, FCP—will equal one from this point on. Thus, the FCR—expectation of FDP—is not controlled. Of course, the second phase of the trial will rectify this error, but at a huge cost of time and money, and loss of faith in the team of statisticians.

One natural solution for this is provided by conditional post-selection inference: instead of a marginal CI, we may construct a conditional interval, where we condition on the event that , leading to inference based on a truncated gaussian likelihood in the above setting. Confidence intervals based on a truncated normal observation were proposed by Zhong and Prentice (2008) and Weinstein et al. (2013) to counteract the selection effect when providing inference after hypothesis testing. While these works consider the batch (offline) setting, in our simple example constructing such conditional CIs (as well as the selection rule) is a legitimate online CI procedure. Furthermore, this controls the FCR— in fact, as will be discussed in Section 3 and demonstrated in our simulations, constructing conditional intervals provides unnecessarily strong guarantees, that come at a price.

In this paper, we will propose a new approach for online FCR control that is very different from the aforementioned conditional approach. Informally, in order to achieve FCR control at level , instead of constructing conditional CIs, we construct marginal CIs for some . The algorithm to set the s is inspired by recent advances in the online false discovery rate (FDR) control literature, specifically recent work by the first author (Ramdas et al., 2017). The new CI procedure works in much more generality than the simple example described above, that is when are multi-dimensional, the data is not necessarily gaussian, and so on—cases in which constructing a conditional CI may be substantially harder if at all possible.

Even more importantly, by constructing marginal instead of conditional CIs, we leave open the possibility to use as a criterion for selection the candidate CI itself. For example, the rule may entail constructing the candidate marginal CI only if it does not include values of opposite signs. Thus, returning to the motivating example, this allows the team of statisticians to ensure that each reported CI is conclusive about the direction of the treatment effect, while the FCR is controlled. With such situations in mind, we instantiate our marginal CI procedure to propose a confidence interval-driven procedure, that constructs sign-determining CIs and can be seen as an online adaptation of the ideas of Weinstein and Yekutieli (2019). Every such sign-determining CI procedure corresponds to an online sign-classification procedure that controls the false sign rate (FSR). As a special case we show that for some recently proposed online testing procedures, supplementing rejections based on two-sided -values with directional decisions suffices to control the FSR.

The rest of this paper is organized as follows. Section 2 sets up the problem formally and introduces necessary notation. In Section 3 we discuss a conditional solution to the online FCR problem. A new online procedure that adjusts marginal confidence intervals, is presented in Section 4. In Section 5 we show how our marginal CI procedure can be used to solve a general online localization problem, and study the special case of the online sign-classification problem. Simulation results for comparing the marginal approach and the conditional approach are reported in Section 6. We end with a brief discussion in Section 7, where we mention how all of our results also hold for prediction intervals for unseen responses, with further details furnished in Appendix B.

## 2 Problem Setup

Let be a fixed sequence of fixed unknown parameters, where the domain of is arbitrary, but common examples may include or . Let denote the set of all measurable subsets of , in other words it is any acceptable confidence set for . In our setup, at each time step , we observe an independent observation (or summary statistic) , where the distribution of depends on (and possibly other parameters). For example when , we may have . Let denote the selection rule that indicates whether or not the user wishes to report a confidence set for . Explicitly, letting be the indicator for selection, where means that the user will report a confidence set for . Let the filtration formed by the sequence of selection decisions be denoted by

 Fi=σ(S1,S2,…,Si).

Next, let be the rule for constructing the confidence set for , the second argument allowing to take as an input a “confidence level". We denote . Thus, may be a marginal or a conditional confidence set for as discussed later, but in general it is no more than a map from as described above. For simplicity, in the rest of the paper we refer to as a confidence interval (CI) like it would usually be if , but with the understanding that everything discussed in this paper applies to the more general case of arbitrary confidence sets.

In our setup, the above rules are all required to be predictable, meaning that

 Si,Ii,αi are Fi−1-% measurable,

and we write

. Naturally, the instantiated random variables

both depend on . However, the rules must be -measurable, hence specified before observing . We emphasize that the requirement to be -measurable also prevents the rules from depending on unless it is through . Importantly, can depend on because both are predictable, and hence can depend on —for example, whether or not looks “favorable”, a point to which we will return in later sections.

Using these definitions, we now define an online selective-CI procedure. In the rest of the paper, we omit the term “selective", but this is done only for the sake of readability. Thus, an online CI protocol proceeds as follows:

1. At time , first commit to .

2. Then, observe . Decide whether or not is selected for coverage by setting .

3. Report if . Then, increment , and go back to step 1.

We next discuss the metrics used to evaluate the errors made by an online CI protocol.

### 2.1 Error metrics

Let the unknown false coverage indicator be denoted . Hence, implies that we intended to cover but our reported CI failed to do so. Using the aforementioned terminology, define the false coverage proportion up to time as

 FCP(T)=#reported intervals that fail to cover their % parameter#reported intervals=∑i≤TVi∑i≤TSi,

where per standard convention (i.e., if no intervals are constructed, then the false coverage proportion is trivially zero). The false coverage rate (FCR) and the modified FCR are defined, respectively, as

 FCR(T)=E[∑i≤TVi∑i≤TSi],     mFCR(T)=E[∑i≤TVi]E[∑i≤TSi].

Along the way, we will consider the relationship of the FCR to other error metrics like the positive FCR (pFCR), the false sign rate (FSR) and the well-known false discovery rate (FDR).

### 2.2 Main objective

The main objective of this paper is to develop and compare algorithms to specify and such that FCR or mFCR control is guaranteed at any time regardless of the choice of , that is,

 FCR(T)≤α   ∀T∈N,       % or       mFCR(T)≤α   ∀T∈N.

Specifically, we explore the following two avenues for constructing the CIs:

1. Marginal CI: this has the guarantee that for any , we have

 Pr{θi∉Ii(Xi,a) ∣∣ Fi−1}≤a, (1)

where the probability is taken only over the marginal measure of

, because the rule is predictable.

2. Conditional CI: this has the property that for any , we have

 Pr{θi∉Ii(Xi,a) ∣∣ Fi−1,Si=1}≤a, (2)

where the probability is taken over the measure of conditional on , because is predictable. 111 In defining a conditional CI, one may consider requiring only that . This weaker condition will suffice for mFCR control, as can be seen from the proofs of our theorems. We chose to use the stronger requirement (2) partly because it is more natural to construct a conditional CI when conditioning on along with ; indeed, our simulations include a typical example where we do not know how to construct a conditional CI with the weaker property, but it is easy to construct one with the stronger property (2).

For either choice, we must specify the level to use with if is selected for coverage.

On accomplishing this main objective, we detail in Section 5 exactly how it enables us to solve several other practical problems of interest, such as controlling the false sign rate. As mentioned in the end of the discussion in Section 7, the entire setup of this paper applies equally well to prediction intervals instead of CIs.

## 3 A method based on conditional inference

A conceptually straightforward method to control the mFCR is to construct conditional CIs at the nominal level . This trivially controls the mFCR at level , as seen by the following argument.

###### Theorem 1.

Constructing a conditional CI after every selection ensures that .

###### Proof.

From the definition (2) of a conditional CI it follows immediately that

 E[Vi ∣∣ Si=1]=E[Iθi∉Ii ∣∣ Si=1]=Pr{θi∉Ii ∣∣ Si=1}≤α.

Together with the fact that , we have

 E[Vi ∣∣ Si]≤α       %a.s.,

and hence,

 E[∑iVi] =∑iE[Vi]=∑iE[SiVi] =∑iE[SiE[Vi ∣∣ Si]]≤∑iE[αSi] =α∑iE[Si]=αE[∑iSi].

Rearranging the first and last displays above yields the desired result. ∎

Constructing conditional CIs at the nominal level ensures also that FCR is controlled. As a matter of fact, even the conditional expectation of FCP given that at least one selection is made,

 pFCR(T):=E[FCP(T) ∣∣ ∣∣ T∑i=1Si>0],

is controlled when using conditional CIs. We call the above the positive FCR, in analogy to the positive FDR (Storey et al., 2003).

###### Theorem 2.

Constructing a conditional CI after every selection ensures that

 pFCR(T)≤α        ∀T∈N.
###### Proof.

Consider any sequence such that . We have

 E[∑iVi∑iSi ∣∣∣ S1=s1,…,ST=sT] =1∑isiE[∑iVi ∣∣ ∣∣ S1=s1,…,ST=sT] =1∑isiE⎡⎣∑{i≤T:si=1}Iθi∉Ii ∣∣ ∣∣ S1=s1,…,ST=sT⎤⎦ =1∑isi∑{i≤T:si=1}Pr{θi∉Ii ∣∣ S1=s1,…,ST=sT} (a)=1∑isi∑{i≤T:si=1}Pr{θi∉Ii ∣∣ S1=s1,…,Si−1=si−1,Si=1} ≤1∑isi∑{i≤T:si=1}α =α,

where equality uses the fact that the selection decisions are independent of given because the selection rules are predictable. The original claim follows by taking expectation over the conditional distribution of given that . ∎

We immediately conclude that with conditional CIs we also have

 FCR(T)=pFCR(T)⋅Pr{T∑i=1Si>0}≤pFCR(T)≤α.

Control of the pFCR (and hence FCR) may seem pleasant, but in fact this strong guarantee has a price. Our two main criticisms of the conditional approach are:

1. Incompatibility. Conditional CIs are not able to ensure compatibility between selection decisions and the reported CI. For example, it is impossible to ensure that all selected CIs are sign-determining, meaning that it is impossible to select only those confidence intervals that do not contain 0. This is discussed further and explicitly demonstrated in Subsection 6.2.

2. Intractability. The conditional distribution of given and the event , is the distribution resulting from restricting to some subset of , which may be intractable to compute in general. At the very least, the conditional approach requires a case-by-case treatment; depending on the marginal distribution of and the selection rules , computing the conditional distribution may be far from trivial.

In the next sections, we describe a marginal approach to controlling the FCR, and elaborate on its various advantages with respect to the aforementioned conditional approach.

## 4 Adjusting marginal intervals: the LORD-CI procedure

In what follows, an algorithm is a sequence of mappings from past selection decisions to confidence levels, meaning that it maps to . By definition, such an is -measurable, hence a procedure that constructs a marginal confidence interval at level whenever , is a legitimate online CI protocol. We will refer to such a procedure as a marginal online CI protocol/procedure. A trivial marginal online CI protocol can be obtained by taking any fixed sequence of such that the series ; this procedure is called alpha-spending in the context of online FDR control by Foster and Stine (2008), and controls the familywise error rate (which in our context is the probability of even a single miscoverage event). Naturally, this is a much more stringent notion of error, and hence the resulting selected CIs will be excessively wide. The question we will address below is the following: is there a nontrivial algorithm to set the so that FCR is controlled?

### 4.1 mFCR control for arbitrary selection rules

Our first result identifies a sufficient condition for an algorithm to imply mFCR

control. Thus, we first associate any algorithm with an estimated false coverage proportion,

 ˆFCP(T):=∑i≤Tαi(∑i≤TSi)∨1.

We may then define the following procedure for online FCR control.

###### Definition 1 (LORD-CI procedure).

A LORD-CI procedure is any online protocol that constructs marginal confidence intervals, where are defined in a predictable fashion to maintain the invariant

 ∀T∈N,ˆFCP(T)≤α, (3)

regardless of the selection rules .

Any LORD-CI procedure comes with the following theoretical guarantee.

###### Theorem 3.

Given an arbitrary sequence of selection rules made by the user, any LORD-CI procedure has the guarantee that .

###### Proof.

By definition of a false coverage event, we have

 E[∑iVi] = E[∑iSiIθi∉Ii] (a)≤ ∑iE[Iθi∉Ii] = ∑iE[E[Iθi∉Ii ∣∣ Fi−1]] (b)≤ ∑iE[αi] = E[∑iαi] (c)≤ αE[∑iSi],

where inequality holds because , inequality by the definition (1) of a marginal CI, and inequality by the invariance (3). Rearranging the first and last expression yields the desired result. ∎

If one really insisted on requiring FCR control as opposed to mFCR control, we provide a guarantee for a subclass of “monotone” selection rules, as introduced below.

### 4.2 Monotonicity of algorithms, intervals and selection rules

The symbol

is used to compare vectors coordinatewise, so

means that for all .

An online FCR algorithm is called monotone if for any two vectors , we have . Equivalently, an online FCR algorithm is monotone if

 αi≥˜αi    whenever (S1,…,Si−1)⪰(˜S1,…,˜Si−1), (4)

where is the level produced by the online FCR algorithm, when presented with the history of selection decisions . We say that a CI rule is monotone if

 I(x,a2)⊆I(x,a1) for all a1

Monotonicity is satisfied for most natural (even non-equivariant) CI constructions, and thus we do not view this as a restriction. Irrespective of whether the online FCR algorithm and CI rule are monotone, we say that a selection rule is monotone if

 Si≥˜Si whenever (S1,…,Si−1)⪰(˜S1,…,˜Si−1), (5)

where, as before, is used to denote the selection decision at time , for the same observation , but for a different history .

As a simple special case, if each rule is independent of , then such a selection rule is trivially monotone, even if the underlying online FCR algorithm is not. In other words, if the final decision is based only on and on none of the past decisions, then such a rule is monotone. For example, setting for every constitutes a trivial monotone selection rule.

### 4.3 FCR control for monotone selection rules

We can provide the following guarantee for the nontrivial class of monotone selection rules.

###### Theorem 4.

Given an arbitrary sequence of monotone selection rules chosen by the user, any LORD-CI procedure that maintains the invariant (3) also satisfies that .

###### Proof.

By definition of , we have

 FCR(T) =E[∑i≤TVi∑j≤TSj] = ∑i≤TE[Si1θi∉Ii∑j≤TSj] ≤ ∑i≤TE[αi∑j≤TSj], (6)

where the sole inequality follows by Lemma 1, introduced after this proof. Thus, we see that

 FCR(T)≤E[∑i≤Tαi∑j≤TSj]≤α,

where the last inequality holds due to invariant (3). ∎

The critical step in the aforementioned argument is the invocation of the following powerful lemma.

###### Lemma 1.

Given an arbitrary sequence of monotone selection rules, we have

 E⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Si1θi∉Ii∑j≤TSjAi⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦≤E[αi∑j≤TSj].

Intuitively, the statement of the above lemma is obvious if the expectation could be taken separately in the numerator, as if it was independent of the denominator, because by construction (1). The following proof demonstrates that monotonicity allows us to formally perform such a step.

###### Proof.

Without loss of generality, we can ignore the case when almost surely for some ; in other words, if we would never select , then almost surely, and we can just ignore the time instant . Hence, we only consider the case when at least one value of leads to selection.

To derive a bound on , consider the following thought experiment. Let us hallucinate what selection decisions would have occurred under a slightly different series of observations, namely

 ˜X:=(X1,X2,…,Xi−1,X∗,Xi+1,…,XT),

where is any value that would have led to selection of , which is a predictable choice, because it can be made based on only the predictable selection rule . Let the sequence of selection decisions made by the same algorithm on be denoted , the levels be denoted , and the constructed intervals be . We then claim that

 Ai≡Si1θi∉Ii∑j≤TSj=Si1θi∉Ii∑j≤T˜Sj=:˜Ai,

where we have intentionally altered only the denominator. To see that the above equality holds, first note that if , then . Then note that if , then for all . Indeed, because , the first selection decisions are identical by construction; then if (and by construction), then , and so every future selection decision is also identical (and also the constructed CIs, at levels ). Hence,

 E[Ai] =E[˜Ai](a)≤E⎡⎣1θi∉Ii∑j≤T˜Sj⎤⎦ (b)=E⎡⎣1∑j≤T˜SjE[1θi∉Ii ∣∣ ˜Fn∖i]⎤⎦ (c)≤E⎡⎣αi∑j≤T˜Sj⎤⎦ (d)≤ E[αi∑j≤TSj],

where inequality holds because , equality follows because is -measurable because by construction, inequality holds by definition (1) of a marginal CI, and inequality holds because for all by the monotonicity of selection rules. This completes the proof of the lemma. ∎

The above is a generalization of lemmas that have been proved in the context of online FDR control by Javanmard and Montanari (2018); Ramdas et al. (2017), since the selection event may or may not be associated with the miscoverage event , but in online FDR control, the rejection event is obviously directly related to the false discovery event . We will later see that online FCR control captures online FDR control as a special case.

### 4.4 An explicit monotone online FCR algorithm

By the theorems above, the class of procedures in Definition 1 yield mFCR (FCR) control. To obtain a specific procedure, we use the LORD++ online FDR algorithm (Ramdas et al., 2017) to set the sequence of the . LORD++ was originally designed to maintain the invariant (3) in the context of testing, i.e., when stands for rejection of the

-th null hypothesis. In the absence of

-values, our algorithm instead substitutes rejection events by arbitrary selection events (). We call the aforementioned adaptation of LORD++ to the context of CIs the LORD-CI algorithm, and refer to the corresponding marginal online CI protocol as the LORD-CI procedure. In the sequel, unless indicated otherwise, whenever we refer to a LORD-CI procedure (or simply LORD-CI), we mean the LORD-CI procedure, that is, the protocol utilizing LORD++. An explicit description of LORD-CI is given in Protocol 1 below.

In Protocol 1, is a deterministic nonincreasing sequence of positive constants summing to one, that is specified in advance; is a prespecified constant; is a sequence of arbitrary predictable selection rules; and is now a sequence of marginal CIs, that is, each has the property (1). On implementing Protocol 1, set whenever (this happens for ). It is easy to verify that (see line 7 in Protocol 1) is monotone, because it has an additional nonnegative term in the summation with every new selection. One may also verify that satisfies the invariant (3), because is always less than .

## 5 Selections that depend on the candidate CIs

In the LORD-CI procedure, the predictable sequences of selection rules and marginal CI rules are both arbitrary, and these may be specified independently of each other. In this section we demonstrate how tying the selection rule to the confidence interval rule, by letting the (candidate) marginal CI determine whether is selected or not, can lead to many instantiations of the LORD-CI procedure that are of practical interest.

Informally, the idea is as follows. Suppose that we made some choice in advance for the marginal CI rule. Suppose also that we have in mind a criterion for what constitutes a “good" reported CI. For example, when , we might consider a reported CI “good" if it excludes zero. Then at each step , upon observing , we pretend that we were to construct where are set by the LORD-CI algorithm, but we only actually select and report it if it is “good". By design, then, we only report “good" intervals. Note that because is predictable, choosing to report an interval only if it is “good” is a predictable selection rule. Therefore we may use LORD-CI to determine levels of coverage and immediately be guaranteed FCR control. This is formalized in the definition below. We just remark that these ideas appeared first in Weinstein and Yekutieli (2019), but their treatment is rather informal, and, importantly, their proposed procedure is not an online procedure.

For the remainder of this section, whenever we speak of a CI rule, it will be assumed to be monotone. Again, we do not view this as a restriction.

### 5.1 From coverage to localization

Suppose that for each we have a collection of pre-specified disjoint subsets of . Being able to say that for exactly one qualifies as having “localized” the signal 222If the sets are not disjoint a-priori, one may either create a new set for the intersection, or generalize the definition of localization to allow for the reporting of multiple sets.. On observing , we must either localize by specifying which of it belongs to, or refrain from making any claim at all about (the latter reflecting the decision “not enough evidence to decide"). The corresponding natural notion of error for a given procedure is a false localization rate (FLR),

As we will see below, the false localization rate generalizes the false discovery rate.

###### Definition 2 (LORD-CI for localization).

Let be an arbitrary pre-specified monotone marginal CI rule for , and define as follows:

 Si={1,if Ii=Ii(Xi,αi) is a % subset of exactly one of Ki1,…,KiLi0,otherwise.

Then LORD-CI for localization is the online CI protocol that applies LORD-CI to the above selection rule, and when , it outputs the unique index such that .

The above procedure comes with the following guarantee.

###### Theorem 5.

The LORD-CI for localization procedure (Definition 2) satisfies for any .

###### Proof.

Note that the selection rule in Definition 2 can be rewritten as

 Si(Xi,Ii)=1⟺Xi∈{x:Ii(x,αi)⊆Kil for some l}, (7)

which defines a predictable selection rule because are predictable. Thus, the procedure in Definition 2 is LORD-CI for a predictable selection rule. Because the CI rules are monotone, and the output by the LORD-CI algorithm are also monotone by construction, we conclude that the selection rule (7) is also monotone according to condition (5). Hence, the procedure in Definition 2 is now the LORD-CI procedure for a predictable and monotone selection rule, which controls the FCR by Theorem 4. The last step is to observe that a false localization event implies a false coverage event (but not necessarily the other way around), and hence . ∎

Next we consider some special cases of localization and their implications.

### 5.2 Online composite hypothesis testing with FDR control

Suppose that we have a sequence of composite null hypotheses that we wish to test:

 Hi0:θi∈Θ0i,       i=1,2,…,

where . For any online testing procedure, let and define

 FDR(T):=E[#{i≤T:Ri=1,θi∈Θ0i}#{i≤T:Ri=1}],

which reduces to the usual definition of the FDR when include a single value, i.e., when testing point null hypotheses. We can use the procedure of Definition 2 to devise an online testing protocol that controls the FDR.

###### Definition 3 (LORD-CI for composite testing).

Consider an arbitrary marginal CI rule for each parameter . We reject the th composite null hypothesis and set , if and only if

 Ii(Xi,αi)∩Θ0i=∅, (8)

where is determined by the LORD-CI procedure using .

The procedure in the definition above comes with the following guarantee.

###### Corollary 1.

The LORD-CI procedure for composite testing (Definition 3) enjoys for any .

###### Proof.

Specialize the prescription in Definition 2 by taking and . Then, if and only if condition (8) holds, meaning that for all . The use of LORD-CI guarantees that we have for any . Last, we have that simply because a false discovery implies necessarily that a non-covering CI was constructed. ∎

Before proceeding, we would like to point out a connection to existing online FDR testing protocols. We can define a p-value for testing by

 Pi:=sup{α:Ii(Xi,α)∩Θ0i≠∅},

where is any monotone CI for . Indeed, if , then for any we have

We can therefore apply an existing online FDR protocol using this definition for a p-value. Note that while the computation of the p-value above might not be trivial, we are really only required at each step to check if , which is equivalent to rejecting when . In fact, if we use the same CI rules and the same algorithm to set the as in the CI procedure employed in Definition 3, we obtain exactly the composite testing procedure of Definition 3.

### 5.3 Online sign-classification with FSR control

Sometimes we would like to ask about the direction of the effect rather than test a two-sided null hypothesis. As argued in Gelman et al. (2012); Gelman and Tuerlinckx (2000), this often makes a more sensible question than asking whether a parameter is equal to zero. In fact, even statisticians that use a two-sided test of a point null hypothesis, tend to supplement—perhaps with a leap of faith—a rejection of the null with a claim about the sign of the parameter (Goeman et al., 2010, call this post hoc inference of the sign). Inferring the signs of multiple parameters simultaneously was considered at least as early as Bohrer (1979); Bohrer and Schervish (1980); Hochberg (1986). In the story of Section 1, the management might be interested primarily in identifying which drugs have a positive treatment effect and which drugs have a nonpositive treatment effect. Throughout this subsection suppose that , and that for some common likelihood function , so that a common CI rule can be used at all times333Note that a CI rule depends on the likelihood function only, not on the true value of ; for lack of a better phrase, we call this situation the “common likelihood” case.

When considering a sign-classification procedure, we will aim to control—in analogy to the FDR—the expected ratio of number of incorrect directional decisions to the total number of directional decisions made. Throughout this paper, to make a directional decision means to classify

(positive) or (non-positive); because zero is included on one side, this can be considered a weak sign-classification (although the definition is not symmetric, zero can be just as well appended to the positive side instead of the negative side). Hence, a sign-classification protocol is an online procedure that outputs

 Di=⎧⎨⎩1,if θi classified as positive−1,if θi classified as non-positive0,if no decision on the sign of θi is made.

Borrowing a term from Stephens (2016), we define the false sign rate as

 FSR(T):=E[#{i≤T:θi≤0,Di=1}+#{i≤T:θi>0,Di=−1}#{i≤T:Di=1}+#{i≤T:Di=−1}].

It is worth noting that this is slightly different from the definitions of Benjamini et al. (1993), who consider procedures that classify parameters as strictly positive or as strictly negative; however, if there are no parameters that equal zero exactly, then all definitions coincide (which is the case in virtually all realistic situations, see, e.g., Tukey, 1991). A natural procedure to consider is applying any online FDR protocol to test the hypotheses that

, and then classify each rejection according to the sign of an unbiased estimate of

(so, for example, when , a rejected null with entails ). We will see later that, for example, applying LORD++ to the usual two-sided -values indeed works, however FSR control is not automatically guaranteed, i.e., this requires a proof (see Gelman and Tuerlinckx, 2000, who point out caveats in replacing rejections with statements about the signs of the parameters).

We can rely again on the procedure of Definition 2 to devise a sign-classification protocol that controls the FSR. Thus, suppose that we have an arbitrary (common) marginal CI rule . Now specialize the prescription in Definition 2 by taking , . In words, this is the LORD-CI procedure that reports whenever it includes either only positive or only non-positive values. This special case of the procedure in Definition 2 is central enough to merit a separate definition.

###### Definition 4 (Sign-determining LORD-CI procedure).

Suppose that we are in the “common likelihood" case, and let be any marginal confidence interval procedure, i.e., for any . Assume that for any parameter there is a corresponding “null" value . The sign-determining LORD-CI procedure associated with is defined to be the LORD-CI procedure that utilizes the selection rules

 Si={1,if {τ−θ0i:τ∈I(x,αi)}⊆(0,∞)   or   {τ−θ0i:τ∈I(x,αi)}⊆(−∞,0]0,otherwise, (9)

and constructs if .

For simplicity, assume from now on that . In that case the sign-determining LORD-CI procedure constructs if and only if this interval is sign-determining, meaning that it includes only positive or only nonpositive values.

Returning to the FSR problem, apply now the CI procedure from Definition 4 with an arbitrary choice of , and set

 Di=⎧⎨⎩1,if Si=1 and Ii⊆(0,∞)−1,if Si=1 and Ii⊆(−∞,0]0,if Si=0. (10)

Then we have the following result:

###### Corollary 2.

The sign-classification procedure given by (10), enjoys for any .

###### Proof.

We have because a wrong decision on the sign of a parameter necessarily implies that a non-covering CI was constructed. Here, the left hand side of the inequality is the false sign rate associated with the sign-classification procedure in (10), and the right hand side is the false coverage rate associated with the sign-determining LORD-CI procedure of Definition 4 (using ). On the other hand, is controlled as a special case of the procedure of Definition 2, because we assumed that is monotone. ∎

###### Remark 1.

It is easy to see that we could drop the assumption on monotonicity of the CI rules in this section and still be guaranteed control of the respective modified error rates, for example mFDR in Subsection 5.2 and mFSR in Subsection 5.3 (which would now be implied by mFCR control for the corresponding CI procedure).

### 5.4 Configuring the sign-determining LORD-CI procedure

The sign-determining LORD-CI procedure was used in the previous subsection as a “wrapper” device to control the FSR for any definition of the CI rules , but it can be of interest to design specific rules since we know that the FSR procedure will only select sign-determining CIs. Indeed, in most realistic situations, it is useful to supplement a directional decision with confidence bounds that are consistent with that decision. For example, if the team of statisticians declare a specific drug to have a positive effect, the management will likely want to know how large the effect is at least, as would be quantified by a nonnegative lower endpoint of a CI.

Thus, ideally, the sign-determining LORD-CI procedure selects and constructs a large number of CIs—meaning that it is “powerful” when translated to an FSR protocol as in Section 5.3—while the lower endpoint for a positive interval is as far away from zero as possible (and, similarly, the upper endpoint for a nonpositive interval is as far away from zero as possible). Unfortunately, these two goals are conflicting in general; see Benjamini et al. (1998), who study the single-parameter case. The tradeoff between these two properties will be controlled here through the choice of the marginal CI rule ; thus, the corresponding sign-determining LORD-CI procedure may be seen as the online counterpart of the offline sign-determining multiple testing procedure of Weinstein and Yekutieli (2019). Below, we point out a few concrete examples of CI rules . For the rest of this section assume that , though the constructions below can be extended beyond the normal case.

1. is the usual symmetric interval, .

It can be easily verified that the sign-determining LORD-CI procedure with this choice for , selects exactly the set of parameters rejected by the LORD++ online FDR procedure using the usual two-sided -values

 Pi=2(1−Φ(|Xi|)).

A constructed CI has length , and is guaranteed to be sign-determining. As a byproduct, if we translate this into a sign-classification procedure as explained in Section 5.3, we have as a conclusion that selecting with LORD++ and classifying according to the sign of , controls the FSR. In fact, this conclusion still holds if is interpreted as declaring that is strictly negative rather than non-positive, because the usual symmetric interval is open.

2. is the “one-sided" interval given by

 I(x,α)=⎧⎪⎨⎪⎩(x−zα,x+zα),if 0<|x|zα(x−zα,0],if x<−zα.

It can be verified that the sign-determining LORD-CI procedure with this choice for , selects exactly the set of parameters rejected by the LORD++ online FDR procedure using “one-sided" -values

 Pi=1−Φ(|Xi|),

hence is much more powerful when translated into a sign-classification procedure. However, constructed intervals do not have an a priori bound on their length, since they take the form (if the observation is positive) or (if the observation is negative). Perhaps more seriously, a reported interval necessarily touches zero, thus failing to address our follow-up question on how big the effect is at least (a nonpositive interval even includes zero).

3. is the Modified Quasi-Conventional (MQC) CI of Weinstein and Yekutieli (2019).

The MQC confidence interval is in a sense a compromise between the two choices of presented above: it determines the sign earlier than the two-sided interval but not as early as the “one-sided" interval. In turn, it leads to more power than LORD++ applied to two-sided -values when interpreted as a sign-classification procedure, and at the same time separates from zero for large enough . A mathematical definition of the MQC interval is given in Weinstein and Yekutieli (2019), where its properties are further explained; we include a figure instead to illustrate the properties of that interval. In Figure 1 the endpoints of the MQC interval are shown in solid lines as a function of the observation for . The potential gain in power due to using the MQC interval instead of the symmetric interval, is demonstrated in Section 6.

## 6 Numerical experiments

### 6.1 Simulations

To examine how the LORD-CI procedure compares to conditional CIs, we carry out numerical experiments where online confidence intervals are constructed under different (predictable) selection schemes. We set and in each of simulation runs, we draw parameters i.i.d. from a mixture

 θi=⎧⎪⎨⎪⎩10−3,w.p. 0.45−10−3,w.p. 0.451+Wi,w.p. 0.1,

where . The mass at represents the “null" component (essentially zero), while the “nonnulls" are drawn so that large effects are rare. The observations are then drawn as . The are revealed one by one, and a confidence interval is to be quoted whenever a parameter is selected. The LORD-CI procedure uses the sequence of specified by the LORD++ procedure (Ramdas et al., 2017) with “default" choices and , as used in the experiments of Javanmard and Montanari (2018); Ramdas et al. (2017). If not indicated otherwise, the marginal CI used for LORD-CI is the symmetric two-sided interval, and the conditional CI used is the construction from Weinstein et al. (2013, Section 2) obtained by inverting shortest acceptance regions. Table 1 gives quantitative summary statistics (averaged over the replications) for the three simulation examples. Below, we examine the output from a single realization of the experiment for each of the examples.

We begin with a simple selection rule, where a CI is constructed when , i.e., when the size of the current observation exceeds a fixed threshold. Figure 2 shows conditional CIs (red) versus LORD-CI intervals (black) for a single realization. The conditional CIs are considerably shorter than LORD-CI, which seems to be conservative with as compared to about 0.1 for conditional. In particular, the conditional CIs become closer to the marginal two-sided

interval as the observation size increases, which would in that sense resemble Bayesian credible intervals for our example (these are not shown in the figure). Both the conditional and LORD-CI intervals may cross zero, as can be seen in the plot. In fact, as many as 53% of the conditional CIs cross zero, and 38% of LORD-CI intervals cross zero. Note that the lower endpoint of the CI is monotone non-decreasing for the conditional intervals, but not for LORD-CI. The conditional intervals seem preferable in this situation.

The second simulation example illustrates a situation where we are interested first in detecting the sign of the parameters, and second in supplementing a directional decision with confidence bounds. For this we implement the sign-determining LORD-CI procedure of Section 5, in other words, is selected whenever the candidate (symmetric) LORD-CI interval excludes zero. Because this amounts to selecting when , and because are independent and predictable, the conditional distribution of given and , is that of a truncated normal, and we can use again the intervals of Weinstein et al. (2013) (the cutoff will now be different for every selection, as opposed to the previous example where it was