1 Introduction
Machine learning systems are increasingly used to make decisions in uncertain environments. Decision-making can be viewed in the framework of hypothesis testing in that a decision is made as the result of a rejection of the null hypothesis [2, 10, 6, 28, 17, 5]. When multiple hypotheses are under consideration, a FDR control procedure provides a way to control the rate of erroneous rejections in a batch of hypotheses for small-scale data sets [3, 24, 25, 4, 32, 16]. However, these procedures typically require the test statistics of all of the hypotheses under consideration so that the p-values may be sorted and a set of hypotheses may be selected for rejection. In many modern problems the test statistics for all the hypotheses may not be known simultaneously and standard FDR procedures do not work.
Online FDR methods have recently been developed to address the need for FDR control procedures that maintain control for a sequence of tests when the test statistics are not all known at one time. Tukey and Braun [27] proposed the idea that one starts with a fixed amount “α-wealth” and for each hypothesis under consideration, the researcher may choose to spend some of that wealth until it is all gone. Foster and Stine [13] extended α-spending by allowing some return on the expenditure of α-wealth if the hypothesis is successfully rejected. Aharoni and Rosset [1] introduced generalized α-investing and provided a deterministic decision rule to optimally set the α-level for each test given the history of test outcomes. A full review of related work is in Section 1.2.
1.1 Contributions
We extend generalized α-investing to address the problem of online FDR control where the cost of data is not negligible. Our specific contributions are:
-
• a theoretical analysis of the long term asymptotic behavior of α-wealth in an α-investing procedure,
-
• a generalized α-investing procedure for sequential testing that simultaneously optimizes sample size and α-level using game-theoretic principles,
-
• a non-myopic α-investing procedure that maximizes the expected reward over a finite horizon of tests.
1.2 Related Work
Tukey proposed the notion of α-wealth to control the family-wise error rate for a sequence of tests [26, 27]. Foster and Stine [13] proposed α-investing, an online procedure that controls the marginal FDR (mFDR) for any stopping time in the testing sequence. Aharoni and Rosset [1] introduced generalized α-investing and provided a deterministic decision rule to maximize the expected reward for the next test in the sequence. Recently, there has been much work on online FDR control in the context of A/B testing, directed acyclic graphs and quality-preserving databases [31, 20]. Javanmard and Montanari [14] first proved that generalized α-investing controls FDR, not only mFDR under an online setting with an algorithm called LORD. Ramdas et. al. [18] proposed the LORD++ to improve the existing LORD. Recent work leverages contextual information in the data to improve the statistical power while controlling FDR offline [29] and online [7]. Ramdas et. al. [19] then proposed SAFFRON, which also belongs to the α-investing framework but adaptively estimates the proportion of the true nulls. All the aforementioned methods are synchronous, which means that each test can only start once the previous test has finished. Zrnic at. al. [34] extend α-investing methods to an asynchronous setting where tests are allowed to overlap in time. These state-of-the-art online FDR control α-investing methods do not address the needs for testing when the cost of data is not negligible. So, we propose a novel α-investing method for a setting that takes into account the cost of data sample collection, the sample size choice, and prior beliefs about the probability of rejection.
Section 2 is a technical background of generalized α-investing. Section 3 contains a theoretical analysis of the long term asymptotic behavior of the α-wealth. Section 4 presents a cost-aware generalized α-investing decision rule based on the game-theoretic equalizing strategy. Section 5 presents empirical experiments that show that the cost-aware ERO decision rule improves upon existing procedures when data collection costs are nontrivial. Section 6 presents analysis of two real data sets from gene expression studies that shows cost-aware α-investing aligns with the overall objectives of the application setting. Finally, Section 7 describes limitations and future work.
2 Background on Generalized α-Investing
Following the notation of Foster and Stine [13], consider m null hypotheses, ${H_{1}},\dots ,{H_{m}}$ where ${H_{j}}\subset {\Theta _{j}}$. The random variable ${R_{j}}\in \{0,1\}$ is an indicator of whether ${H_{j}}$ is rejected regardless of whether the null is true or not. The random variable ${V_{j}}\in \{0,1\}$ indicates whether the test ${H_{j}}$ is both true and rejected. These variables are aggregated as $R(m)={\textstyle\sum _{j=1}^{m}}{R_{j}}$ and $V(m)={\textstyle\sum _{j=1}^{m}}{V_{j}}$. With these definitions, the FDR [3] is
\[ \text{FDR}(m)={P_{\theta }}(R(m)\gt 0)\hspace{2.5pt}{\mathbb{E}_{\theta }}\left[\frac{V(m)}{R(m)}\mid R(m)\gt 0\right],\]
and the marginal false discovery rate (mFDR) is
\[ {\text{mFDR}_{\eta }}(m)=\frac{{\mathbb{E}_{\theta }}\left[V(m)\right]}{{\mathbb{E}_{\theta }}\left[R(m)+\eta \right]}.\]
Setting $\eta =1-\alpha $ provides weak control over the family-wise error rate at level α.Aharoni and Rosset [1] make two assumptions in their development of generalized α-investing:
where
Assumption 2.1 constrains the false positive rate to the level of the test and Assumption 2.2 is an upper bound of ${\rho _{j}}$ on the power of the test. A pool of α-wealth, ${W_{\alpha }}(j)$, is available to spend on the j-th hypothesis. The α-wealth is updated according to the following equations:
A deterministic function ${\mathcal{I}_{{W_{\alpha }}(0)}}$ is an α-investing rule that determines: the cost of conducting the j-th hypothesis test, ${\varphi _{j}}$; the reward for a successful rejection, ${\psi _{j}}$; and the level of the test, ${\alpha _{j}}$:
The α-investing rule depends only on the outcomes of the previous hypothesis tests. The Foster-Stine cost depends hyperbolically on the level of the test ${\varphi _{j}}={\alpha _{j}}/(1-{\alpha _{j}})$.
(2.3)
\[ {\rho _{j}}=\underset{{\theta _{j}}\in {\Theta _{j}}-{H_{j}}}{\sup }{P_{{\theta _{j}}}}({R_{j}}=1).\](2.6)
\[ ({\varphi _{j}},{\alpha _{j}},{\psi _{j}})={\mathcal{I}_{{W_{\alpha }}(0)}}(\{{R_{1}},\dots ,{R_{j-1}}\}).\]Generalized α-investing can be viewed in a game-theoretic framework where the outcome of the test (reject or fail-to-reject) is random and the procedure provides the optimal amount of “ante” to offer to play and “payoff” to demand should the test successfully reject. We make use of this game theoretic interpretation in our contributions in Section 4.1.
Aharoni and Rosset [1] derive a linear constraint on the reward ${\psi _{j}}$ to ensure that, for a given $({\varphi _{j}},{\alpha _{j}})$, the mFDR is controlled at a level α by ensuring the sequence $A(j)=\alpha {R_{j}}-{V_{j}}+\alpha \eta -{W_{\alpha }}(j)$ is a submartingale with respect to ${R_{j}}$. Note that this constraint is not on the α-wealth process directly. Their constraint is
Maximizing the expected reward of the next hypothesis test, $\mathbb{E}({R_{j}}){\psi _{j}}$, leads to the following equality
Note that this equality selects the point of intersection of the two parts of the constraint in (2.7). ERO α-investing provides two equations for three unknowns in the deterministic decision rule. Aharoni and Rosset [1] address this indeterminacy by considering three allocation schemes for ${\varphi _{j}}$: constant, relative, and relative200 and suggest that the investigator can explore various options and set ${\varphi _{j}}$ on their own. Further details on these schemes are given in Section 5.
(2.7)
\[ {\psi _{j}}\le \min \left(\frac{{\varphi _{j}}}{{\rho _{j}}}+\alpha ,\frac{{\varphi _{j}}}{{\alpha _{j}}}+\alpha -1\right).\]Since the dominant paradigm in testing of biological hypotheses is a bounded finite range for ${\Theta _{j}}$, for the remainder we assume ${\Theta _{j}}=[0,{\bar{\theta }_{j}}]$ for some upper bound, ${\bar{\theta }_{j}}$, and ${H_{j}}=\{0\}$. This scenario may be viewed as a test that the expression for gene j is differentially increased in an experimental condition compared to a control. We consider a simple z-test here for concreteness. The power of a one-sided z-test is $(1-\beta ):=1-\Phi \left({z_{1-\alpha }}+\frac{({\mu _{0}}-{\mu _{1}})}{\sigma /\sqrt{n}}\right)$ where ${z_{1-\alpha }}={\Phi ^{-1}}(1-\alpha )$ is the z-score corresponding to level α, ${\mu _{0}}$ is the expected value of the simple null hypothesis, ${\mu _{1}}$ is the expected value of the simple alternative hypothesis, σ is the standard deviation of the measurements, and n is the sample size.
Using Equation (2.3), the best power under the previously defined ${\Theta _{j}}$ is
The best power depends on: (1) the level of the test, ${\alpha _{j}}$, (2) the scale of the bound on the alternative, ${\bar{\theta }_{j}}$, (3) the sample size, ${n_{j}}$, and the measurement standard deviation, ${\sigma _{j}}$. One may compare multiple measurement technologies based on their precision by exploring the effect of changing ${\sigma _{j}}$ — for example, for a fixed budget and all other things equal, a trade-off can be computed between more samples with a higher variance technology, versus fewer samples with a lower variance technology. For the remainder, we assume ${\sigma _{j}}$ is fixed and known. ERO α-investing for Neyman-Pearson testing problems is solved by the following nonlinear optimization problem:
(2.9)
\[ {\rho _{j}}=1-\Phi \left({z_{1-{\alpha _{j}}}}-\frac{{\bar{\theta }_{j}}}{{\sigma _{j}}/\sqrt{{n_{j}}}}\right).\]Constraints 2.10b and 2.10c correspond to (2.7) which controls the mFDR level, and constraint (2.10d) ensures the maximal expected reward for the j-th test. The optimal ERO solution still depends on an external choice of the sample size ${n_{j}}$, and the cost of the test ${\varphi _{j}}$.
3 Long-Term α-Wealth
Since the levels of future tests depend on the amount of α-wealth available at the time of the tests, a theoretical consideration in generalized α-investing is whether the long-term α-wealth is submartingale or supermartingale (stochastically non-decreasing or stochastically non-increasing) for a given decision-rule. Most prior works include some implicit consideration of the behavior of the long-term α-wealth. Zhou et. al. [33], which predates the seminal work of Foster and Stine [13], test levels are set such that testing may continue indefinitely, even in the worst case scenario of no rejections, while still utilizing all initial α-wealth. Foster and Stine [13] discuss strategies for setting the level of the test and provide some examples designed to accumulate α-wealth for future tests. They also discuss the practical and ethical concerns with sorting easily rejected tests so as to accumulate an arbitrary amount of α-wealth before conducting more uncertain tests. Aharoni and Rosset [1] seek to optimize the expected reward of the current test in an effort to maximize the α-wealth available, and, in-turn, the levels for future tests. Javanmard and Montanari [14] discuss setting the vector γ such that the power is maximimized for a mixture model set a-priori for the hypothesis stream. In all of these methods, the motivation is to have sufficient α-wealth to conduct tests,with an appreciable power perpetually. Here we outline two scenarios where the long-term α-wealth can be either submartingale or supermartingale.
In order to state the theorems regarding the α-wealth sequence, we require a lemma bounding α-wealth as a function of the prior probability of the null hypothesis.
Lemma 1.
Given an ${\alpha _{j}}$-level for the j-th hypothesis test from rule $\mathcal{I}({R_{1}},\dots {R_{j-1}})$, the expected value of α-wealth for Foster-Stine α-investing is
where ${\mathbb{E}^{j-1}}\left[{W_{j}}\right]=\mathbb{E}\left[W(j)-W(j-1)|W(j-1)\right]$, and ${q_{j}}=\Pr [{\theta _{j}}\in {H_{j}}]$, the prior probability (belief) that the j-th null hypothesis is true. In the case of a simple null and alternative ${\Theta _{j}}=\{0,{\bar{\theta }_{j}}\}$, the bound is tight.
(3.1)
\[ \begin{aligned}{}{\mathbb{E}^{j-1}}\left[{W_{j}}\right]& \le \\ {} -\frac{{\alpha _{j}}}{1-{\alpha _{j}}}& +\left[{\rho _{j}}-({\rho _{j}}-{\alpha _{j}}){q_{j}}\right]\left(\alpha +\frac{{\alpha _{j}}}{1-{\alpha _{j}}}\right),\end{aligned}\]We are now in a position to understand dynamical properties of the expected value of the sequence of α-wealth, $\{W(j):j\in \mathbb{N}\}$.
Theorem 1 (Submartingale α-Wealth).
Given a simple null and alternative ${\Theta _{j}}=\{0,{\bar{\theta }_{j}}\}$, $\{W(j):j\in \mathbb{N}\}$ is submartingale (stochastically non-decreasing) with respect to $\{{R_{1}},\dots ,{R_{j-1}}\}$ if
Theorem 1 shows that one will be able to conduct an infinite number of tests in the long-term if the power is close to one or the prior probability of the null is close to zero. This scenario may occur when the hypothesis stream contains a large proportion of true alternative hypotheses, or if the sample sizes of the individual tests are large.
Theorem 2 (Supermartingale α-Wealth).
For any null and alternative hypothesis, $\{W(j):j\in \mathbb{N}\}$ is supermartingale (stochastically non-increasing) with respect to $\{{R_{1}},\dots ,{R_{j-1}}\}$ if
Theorem 2 shows that the generalized α-investing testing procedure will end in a finite number of steps if the power of the test is close to zero or the prior probability of the null hypothesis is close to one. This scenario may occur when the hypothesis stream is made up of a large proportion of true null hypotheses, or if the sample sizes used for each test results in an under powered test.
These theorems provide general insights for understanding when the α-wealth can be expected to be (stochastically) non-decreasing or non-increasing. The non-decreasing α-wealth sequences require that ${\rho _{j}}\uparrow 1$ for a fixed ${\bar{\theta }_{j}}$ which, in the case of a Gaussian, would require ${\sigma _{j}}/\sqrt{n}\to 0$ or $n\to \infty $. So, the α-wealth grows unbounded if the sample size is unbounded. This theory in combination with the premise of non-trivial experiment costs motivates the need for methods for cost-aware α-investing when the sample size is not fixed.
If the sequence of hypotheses is under the control of the investigator, a strategy they might employ is to select many hypotheses that are likely to be rejected early so that a large amount of α-wealth can be accumulated and then spent later on hypotheses that are less likely to be rejected, but are still of interest to the investigator. This phenomenon is generally called piggybacking. However, the issue with this strategy is that an investigator behaves differently if the difficult hypothesis is the first in the sequence of tests versus if the difficult hypothesis presents after a long sequence of easy tests. If the sequence of tests is not under the control of the investigator, they may still find themselves in a similar situation merely by a fortunate random ordering of the tests. In Figure 1 shows that in ERO investing one can accumulate a large amount of α-wealth and distort the expenditure of wealth for difficult tests that are subsequent to easy tests. ERO investing rejects several true nulls while cost-aware ERO (CAERO) (Section 4) does not exhibit such aggressive testing behavior. Our premise, in this work, is that the investigator should be indifferent, in expectation, as to the position of the difficult hypothesis in the sequence of tests.
Figure 1
An example of piggybacking in an individual experiment where ${q_{j}}=0.01$ for the first 100 hypotheses, and ${q_{j}}=0.95$ for the remaining 100 hypotheses. ERO investing with a relative spending scheme makes 8 false rejections following change in ${q_{j}}$, while CAERO (Section 4) does not.
4 Cost-Aware Generalized α-Investing
In this section, our development derives from two key differences in assumptions compared to previous work. First, the α-cost of a hypothesis test, ${\varphi _{j}}$, should account for the a-priori probability that the null hypothesis is true as well as the pattern of previous rejections. The value of ${\varphi _{j}}$ dictates bounds on ${\alpha _{j}}$ and ${\rho _{j}}$, which, in combination with ${q_{j}}$, directly impact the behavior of ${W_{\alpha }}$. Consequently, setting ${\varphi _{j}}$ has an impact on the level of future tests. In Section 4.1, we extend the ERO problem from Aharoni and Rosset [1] to include ${\varphi _{j}}$ in the optimization problem. Second, we assume that the per sample monetary cost to conduct hypothesis tests is not trivial, motivating the need to include the sample size of a test, ${n_{j}}$, in the optimization. This extension is detailed in Section 4.2. In Section 4, we present these extensions as a single decision rule, and extend this rule to a finite horizon in Section 4.4.
4.1 Optimizing ${\varphi _{j}}$
A key contribution of this work is a procedure for selecting the amount of α-wealth to commit to a given hypothesis test, ${\varphi _{j}}$. Aharoni and Rosset [1] leave ${\varphi _{j}}$ up to the investigator and provide several ways of selecting it: constant, relative, and relative-200. Given a value of ${\varphi _{j}}$, the variables ${\alpha _{j}}$, ${\rho _{j}}$, and ${\psi _{j}}$ are chosen such that the expected reward, ${\mathbb{E}_{\theta }}({R_{j}}){\psi _{j}}$ is maximized. In their simulation studies, the choice of ${\varphi _{j}}$ is such that under the data-generating process one is expected to see one true alternative hypothesis before α-death. This section develops a principled method of setting ${\varphi _{j}}$ via a strategy for a two-player zero-sum game between the investigator and nature.
Suppose that we have a zero-sum game involving two players: the investigator (Player I) and nature (Player II). The investigator has two strategies – to test or to not test a hypothesis. Nature, independent of the investigator, chooses to hide ${\theta _{j}}\in {H_{j}}$ with probability ${q_{j}}$ and ${\theta _{j}}\notin {H_{j}}$ otherwise. The utility function for this game is the change in α-wealth. The payoff matrix for the game is provided in Table 1.
If the investigator chooses not to conduct the test, there is no cost (${\varphi _{j}}=0$) and there is no reward (${\psi _{j}}=0$) regardless of what nature has chosen. So, the change in α-wealth when not conducting a test is zero. If the investigator chooses to conduct a test, and nature has hidden ${\theta _{j}}\in {H_{j}}$, then the payout is $-{\varphi _{j}}$ with probability $1-{\alpha _{j}}$, or $-{\varphi _{j}}+{\psi _{j}}$ with probability ${\alpha _{j}}$. In expectation, this payout is $-{\varphi _{j}}+{\alpha _{j}}{\psi _{j}}$. Similarly, if the investigator chooses to conduct the test and nature has hidden ${\theta _{j}}\notin {H_{j}}$, then the expected payout is $-{\varphi _{j}}+{\rho _{j}}{\psi _{j}}$. The mixed strategy of nature is known to be $({q_{j}},1-{q_{j}})$. What remains to be determined in this game are the unknowns in the payoff matrix, as well as the investigator’s strategy. We choose to set the payoffs such that the expected change in α-wealth is identical for both of the investigator’s strategies. By designing the payoff matrix conditioned on nature’s mixed strategy, such that the investigator has the same expected payoff for both pure (and any mixed) strategies, the investigator’s choice to test or not test a hypothesis has no effect (in expectation) on the ability to perform future tests. With these properties, nature is employing an equalizing strategy.
Table 1
Payoff matrix for posing hypothesis testing as a game against nature. The payouts shown are the expected value of the payout for a given pair of pure strategies.
Player II (Nature) | |||
${\theta _{j}}\in {H_{j}}$ | ${\theta _{j}}\notin {H_{j}}$ | ||
Player I | Conduct Test | $-{\varphi _{j}}+{\alpha _{j}}{\psi _{j}}$ | $-{\varphi _{j}}+{\rho _{j}}{\psi _{j}}$ |
(investigator) | Skip Test | 0 | 0 |
The result is the investigator is indifferent as to whether to test or not test and the expected payoff is
Since the expected payoff for not testing is 0, this equation ensures that the expected change in α-wealth when testing is also 0. By definition, this gives a self contained decision rule that makes α-wealth martingale, striking a balance between the two scenarios given in Theorem 1 and Theorem 2.
(4.1)
\[ {q_{j}}(-{\varphi _{j}}+{\alpha _{j}}{\psi _{j}})+(1-{q_{j}})(-{\varphi _{j}}+{\rho _{j}}{\psi _{j}})=0.\]Theorem 3 provides a condition on the power of the test which requires a balance between the ratio of ${\varphi _{j}}$ and ${\psi _{j}}$ and the probability of a false positive.
Allowing ${\varphi _{j}}$ to be a free variable in the optimization problem leads us to an infinite number of ERO-class solutions. Selecting the maximum of this class would lead to aggressive play, and in many situations leads to betting the farm or bold play. The martingale constraint, (4.2), reduces the solution space to a single nontrivial solution and a trivial solution where ${\varphi _{j}}={\psi _{j}}={\alpha _{j}}=0$. Furthermore, the fact that the expected reward is constrained to be zero by (4.2) means that the ERO optimization problem is a feasibility problem. Even so, we retain the objective function in the optimization problem in anticipation for the next section where we allow for variable sample sizes ${n_{j}}$.
The investigator may additionally wish to limit the variance of their wealth process. This can be accomplished by setting an upper bound on the proportion of wealth an investigator may spend for an ante. Furthermore, the investigator may wish to lower bound the power of an individual test. In order to satisfy these conditions, the investigator will collect more samples for an individual hypothesis in order to meet their power requirement. We choose to impose a lower bound constraint on the power for this reason.
4.2 Optimizing ${n_{j}}$
In the previous section ${n_{j}}$ was held fixed; we now consider ${n_{j}}$ as a free variable in our optimization problem. As a result, ${\rho _{j}}$ is no longer completely determined by ${\alpha _{j}}$ and one can increase ${\rho _{j}}$ via increasing ${n_{j}}$. In scientific applications, the investigator is also constrained by sample collection costs, and would not wish to spend excessively on a single hypothesis test. We modify the generalized α-investing decision rule, (2.6), to include a notion of dollar-wealth ${W_{\mathrm{\$ }}}(j)$ available for expenditure to collect data to test the j-th hypothesis
where ${n_{j}}$ is the sample size allocated for testing of the j-th hypothesis. A natural update for the dollar-wealth is
where ${c_{j}}$ is the per-sample cost for data to test the j-th hypothesis, and B is the initial dollar-wealth. Allowing the cost to vary with the hypothesis test enables one to model different experimental methods and cost inflation for long-term experimental plans.
(4.3)
\[ ({\varphi _{j}},{\alpha _{j}},{\psi _{j}},{n_{j}})=\mathcal{I}({W_{\alpha }}(0),{W_{\mathrm{\$ }}}(0))(\{{R_{1}},\dots ,{R_{j-1}}\}),\]Since ${n_{j}}$ is a free variable under the control of the investigator, we again have many solutions to the ERO problem. Furthermore, as ${n_{j}}$ increases up to the allowable expenditure of sample resources, so does the power and therefore the expected reward. Theoretically, the optimal solution allocates all the sample budget available even though the marginal increase in power and thus the expected reward, is vanishingly small for large ${n_{j}}$. To address this issue, we modify the objective function to include a small penalty for increasing sample size,2 $-\lambda {c_{j}}{n_{j}}$ where λ is chosen by the investigator. The solution is fairly robust to the value of λ and this reformulation provides for a unique optimal value.
4.3 Cost-Aware ERO Decision Rule
The investigator’s goal is to conduct as many tests as possible while rejecting as many true alternatives as possible and maintaining control of the false discovery rate. Incorporating the methods for optimizing ${\varphi _{j}}$ and ${n_{j}}$ into the ERO problem yields a self-contained decision rule in the form of (4.3),
Constraints (4.6b) and (4.6c) ensure control over the mFDR. Constraints (4.6e) and (4.6f) connect the level, power, and sample size of the test. Constraints (4.6g) and (4.6h) ensure the existing $(\alpha ,\mathrm{\$ })$-wealth is not exceeded. The parameter $a\in (0,1]$ controls the proportion of α-wealth that a single test can be allocated. Constraint (4.6i) ensures nature’s strategy is equalizing. Written out explicitly,
\[ \mathbb{E}\left[\Delta {W_{\alpha }}\right]=(-{\varphi _{j}}+{\psi _{j}})\Pr [{R_{j}}=1]+(-{\varphi _{j}})\Pr [{R_{j}}=0],\]
A pseudo-code algorithm of the full cost-aware ERO method and further extensions to cost-aware ERO are described in Appendix C.Lemma 2.
The cost-aware ERO decision rule ensures that α-wealth, ${W_{\alpha }}(j)$, is martingale with respect to $\{{R_{1}},\dots ,{R_{j-1}}\}$.
Constraint (4.6i) sets the expected change in α-wealth equal to zero. This enforces that ${W_{\alpha }}(j)$ is martingale. Allowing ${W_{\alpha }}(j)$ to be submartingale, as per Theorem 1, can lead to a situation where hypotheses are tested at high α-levels due to the accumulated ${W_{\alpha }}$ from previous rejections. This is referred to as piggybacking in the literature when such accumulated wealth leads to poor decisions [18]. On the other hand, allowing ${W_{\alpha }}(j)$ to be supermartingale, as per Theorem 2, causes the testing to end, and is referred to as α-death in the literature. Using a game-theoretic formulation allows us to propose an expected-reward optimal procedure which considers preventing α-death and piggy-backing.
Constraint (4.6i) only controls the expected increment in ${W_{\alpha }}$. It is well known that martingale-based strategies can suffer from what is known as gambler’s ruin. Since no bounds are set on the worst case scenario, which in this case is when ${R_{j}}=0$, it is possible that we could set ${\varphi _{j}}={W_{\alpha }}(j-1)$, and suffer α-death. This occurs, for example, when ${q_{j}}$ is close to 0, and ${\mu _{A}}$ and ${\sigma _{j}}$ allow for ${\rho _{j}}\to 1$. In such a case, a rejection is almost certain, and in turn, so is receiving the reward ${\psi _{j}}$. Recall that we restrict ourselves to an ERO solution, and thus, we can interpret constraint (4.6i), without the factor a, as setting ${\varphi _{j}}$ to the expected reward – the quantity that we are maximizing. In order to keep ${W_{\alpha }}$ martingale, this almost guaranteed upside must be counteracted by a devastating downside. In order to avoid α-death, we add a factor which limits the maximal bet, preventing the investigator from betting the farm. In simulation studies, we found that setting $a=0.025$ gave good results. We require ${\rho _{j}}\ge 0.9$ when n is not constrained.
4.4 Finite Horizon Cost-Aware ERO α-Investing
Figure 2
Extensive form of two-step game between Investigator (Player I) and the Nature (Player II). Strategies for each player are italicized. The leaves are labeled to denote the strategy taken by the investigator and are enumerated for equations presented in Appendix G.
The standard ERO framework optimizes only the one-step expected return, ${\mathbb{E}_{\theta }}({R_{j}}){\psi _{j}}$. But, when tests are expensive, it is logical to consider the expected return after two (or more) tests. We consider ${q_{j}}$ to be known, and extend the game theoretic framework to a finite horizon of decisions. The extensive form of the game between nature, who hides ${\theta _{j}}$ in the null or alternative hypothesis region, and the investigator, who seeks to find ${\theta _{j}}$ and gain the reward for doing so, is shown in Figure 2. We note that sequential two-step cost-aware ERO investing is a different problem than batch ERO investing because two-step investing accounts for the expected change in $(\alpha ,\mathrm{\$ })$-wealth after each step while batch cost-aware ERO only received the payoff at the conclusion of all of the tests in the batch.
The two-step objective function is
with constraints (4.6b)-(4.6i) from Problem 4.6 remaining for steps j and $j+1$. Designing the game so that nature’s strategy is an equalizing strategy results in a system of equations (Appendix G) that form constraints in the ERO optimization problem. It is worth noting that such a game simplifies to the standard cost-aware ERO method defined in 4.6 when ${W_{\mathrm{\$ }}}\gt \gt 0$. This holds since the parameters of the second test depend on the expected α-wealth available at that step. When the expected increment is 0, as set in constraint (4.6i), and when available $-wealth is not scarce, then each step is equivalent to optimization occurring on the first test. When this constraint is lifted, or when the available ${W_{\mathrm{\$ }}}$ is low, the finite horizon solution provides a different solution to the single step solution.
(4.7)
\[ \begin{aligned}{}\mathbb{E}({R_{j}}{\psi _{j}}& +{R_{j+1}}{\psi _{j+1}})=\mathbb{E}({R_{j}}){\psi _{j}}+\\ {} & {\psi _{j+1}}[P({R_{j}}=0)\mathbb{E}({R_{j+1}}|{R_{j}}=0)+\\ {} & P({R_{j}}=1)\mathbb{E}({R_{j+1}}|{R_{j}}=1)]\end{aligned}\]5 Synthetic Data Experiments
Experimental Settings To compare our method with state-of-the-art related methods, we generate synthetic data as described in Aharoni and Rosset [1]. The synthetic data is composed of $m=1000$ possible hypothesis tests. For the j-th test, the true state of ${\theta _{j}}$ is set to the null value of 0 with probability ${q_{j}}\sim \text{Unif}(0.85,0.95)$ and otherwise set to 2. A set of ${n_{j}}=1000$ potential samples ${({x_{ji}})_{i=1}^{{n_{j}}}}$ were generated i.i.d from a $\mathcal{N}({\theta _{j}},1)$ distribution. For each hypothesis test, the z-score was computed as ${z_{j}}=\sqrt{{n_{j}^{\ast }}}{\textstyle\sum _{i=1}^{{n_{j}^{\ast }}}}{x_{ji}}$, where ${n_{j}^{\ast }}$ is described in the table and the one-sided p-value is computed. The methods were tested on $10,000$ realizations of this simulation data generation mechanism. Pseudo-code, as well as other implementation details, for this simulation can be found in Appendix B.
5.1 Comparison to State-of-the-Art Methods
Table 2 compares our method, cost-aware ERO, with related state-of-the-art α-investing methods including: α-spending [27], α-investing [13], α-rewards [1], ERO-investing [1], LORD [14, 18], and SAFFRON [19]. The table is indexed by the allocation scheme (Scheme), and the reward method (Method). The allocation scheme determines the value of ${\varphi _{j}}$ at each step, which in many cases is left to user discretion. We implement the φ-allocation schemes proposed by Aharoni and Rosset [1]. The constant scheme simply allocates,
for each test, the relative scheme allocates an amount that is proportional to the remaining α-wealth,
and continues until ${W_{\alpha }}(j)\lt (1/1000){W_{\alpha }}(0)$. The relative 200 scheme follows the same proportional steps as the relative, but always performs 200 tests [1]. The results from our implementation of these methods matches or exceeds previously reported results.
Table 2
Comparison of cost-aware α-investing with state-of-the-art sequential hypothesis testing methods. Values for Tests, True Rejects and False Rejects are the average across 10,000 iterations, and these estimates are used for mFDR. All methods are constrained to use 1000 samples at most per iteration. For comparison include LORD++ with the optimal sample size of $n=3$. However, the optimal sample size for LORD++ was selected by observing the number of true rejects for ${n_{j}}\in [1,10]$ and this information would not be available to an investigator. The optimal value of ${n_{j}^{\ast }}$ for cost-aware ERO, however, was predictable from the observed data.
Tests | True Rejects | False Rejects | mFDR | ||
Scheme | Method | ||||
constant | α-spending | 10.0 | 0.28 | 0.04 | 0.033 |
α-investing | 16.0 | 0.44 | 0.07 | 0.045 | |
α-rewards $k=1$ | 14.6 | 0.40 | 0.06 | 0.043 | |
α-rewards $k=1.1$ | 16.3 | 0.43 | 0.06 | 0.043 | |
ERO investing | 18.9 | 0.53 | 0.08 | 0.051 | |
relative | α-spending | 66.0 | 0.55 | 0.04 | 0.028 |
α-investing | 81.8 | 0.87 | 0.09 | 0.045 | |
α-rewards $k=1$ | 81.1 | 0.85 | 0.08 | 0.043 | |
α-rewards $k=1.1$ | 80.8 | 0.82 | 0.08 | 0.041 | |
ERO investing | 83.2 | 0.93 | 0.90 | 0.045 | |
other | LORD++ | 1000.0 | 2.06 | 0.07 | 0.022 |
LORD1 | 1000.0 | 1.46 | 0.03 | 0.014 | |
LORD2 | 1000.0 | 1.97 | 0.06 | 0.020 | |
LORD3 | 1000.0 | 1.99 | 0.07 | 0.024 | |
SAFFRON | 1000.0 | 1.28 | 0.07 | 0.031 | |
cost-aware | ERO ${n_{j}}=1$ | 953.0 | 4.23 | 0.12 | 0.023 |
cost-aware | ERO ${n_{j}^{\ast }}$ | 225.7 | 19.11 | 0.22 | 0.011 |
other | LORD++ (n = 3) | 334.0 | 22.54 | 0.79 | 0.032 |
ERO investing yields more true rejects than α-spending, α-investing, and both α-rewards methods. The LORD variants and SAFFRON perform the maximum number of tests while maintaining control of the mFDR. For the use scenarios considered in the LORD and SAFFRON papers (large-scale A/B testing), this is optimal — tests are nearly free and the goal is to be able to keep testing while maintaining mFDR control. The cost-aware ERO setting is different and more applicable to biological experiments where one aims to maximize a limited budget of tests to achieve as many true rejects as possible while controlling the mFDR. Increasing the sample size capacity for each test enables cost-aware ERO to achieve higher power with fewer tests than the current state-of-the-art methods with $n=1$. For fair comparison, we varied n for LORD++ and include the sample size which maximized the number of true rejections. This selection was performed after running all considered sample sizes. It is important to note that the investigator would not have access to such information in a real experiment. Releasing the restriction on sample size enables cost-aware ERO to allocate an adaptive number of samples based on the prior of the null as well as the available budget. Appendix D shows the comparisons for $q=0.1$ and Appendix F shows comparisons with all of the other methods set to ${n_{j}}=10$. Our cost-aware ERO method with $n=1$ performs more tests and rejects more false null hypotheses than all competing methods at $n=1$.
Figure 3
Power, mFDR, and mean number of samples per test for cost-aware ERO and existing methods ($n=1$) with random ${q_{j}}\sim \text{Beta}$.
It is worth noting that ERO and cost-aware ERO with ${n_{j}}=1$ are still quite different despite the restriction of sample size. We can view the difference in performance between these two methods as the benefit of allocating ${\varphi _{j}}$ using our game-theoretic framework. Our decision rule incorporates our prior knowledge of the probability of the null hypothesis being true and aims to maintain α-wealth (as a martingale). The experimental set up of Aharoni and Rosset [1] implicitly leverages similar prior knowledge in the spending schemes proposed. All spending schemes proposed in Aharoni and Rosset [1] allow us to test at least one true alternative, in expectation, at which point the α-wealth should increase. This increase in α-wealth should then sustain testing until another true alternative appears. However, in the cost-aware ERO optimization problem, this information is explicitly accounted for, and helps us avoid situations described in Theorem 1 and Theorem 2. By restricting ${n_{j}}=1$, we have effectively limited our ability to inflate ${\rho _{j}}$ with a large sample size, and influence ${W_{\alpha }}(j)$ towards being submartingale. On the other hand, nature’s equalizing strategy limits the expected payout to 0, by limiting the size of ${\varphi _{j}}$, preventing the experimenter from experiencing α-death quickly, as seen in the constant spending scheme.
5.2 Computation and Implementation
In our experiments, for one set of $1,000$ potential hypothesis tests ERO investing, cost-aware, and finite-horizon cost-aware ERO all take $\sim 30$ seconds on a single 2.5GHz core and 16Gb RAM. The nonlinear optimization problem was solved using CONOPT [11]. Because the solver depends on initial values and heuristics to identify an initial feasible point, infrequently the solver was not able to find a local optimal solution; in these instances, the solver was restarted 10 times and if it failed on all restarts the iteration was discarded. Out of $10,000$ data sets at most 27 iterations were discarded (for ${n_{j}}=1$). Code to replicate these experiments is available at https://github.com/ThomasCook1437/cost-aware-alpha-investing.
5.3 Random Prior of the Null Hypothesis
One of the benefits of incorporating a notion of sampling cost into the hypothesis testing problem is the ability to allocate resources based on the prior probability of the null, q. We generated simulation data as previously described except the prior probability of the null hypothesis is selected at random from ${q_{j}}\sim \text{Beta}(a,b)$ where $a+b=100$ and with $2,500$ independent realizations of the data. Appendix B contains pseudo-code and further implementation details. Figure 3(a-c) shows the power, mFDR, and mean number of samples per test as a function of $\mathbb{E}[{q_{j}}]$. The results show that cost-aware ERO α-investing achieves high power while maintaining control of the mFDR. A key result of this experiment is that should it not be possible to collect as many samples as the optimization problem yields, the investigator may choose to not perform the test at all and instead wait for a test (with associated prior) that does yield an optimal sample size within the budget or may choose to allow the α-wealth ante to adjust to the bound on the sample size. This often occurs for large values of ${q_{j}}$, which we know by Theorem 2 will influence ${W_{\alpha }}(j)$ towards behaving as a supermartingale. Cost-aware ERO will compensate by increasing ${\rho _{j}}$ through the sample size, ${n_{j}}$, and will expend the ${W_{\mathrm{\$ }}}$ available, as the optimization only considers a single step. It is worth noting, that when $\mathbb{E}[{q_{j}}]$ is close to 1, cost-aware ERO with $n=1$, maintains power better than other methods. This can be attributed to the allocation scheme that constraint (4.6i) creates. The value of ${\varphi _{j}}$ is kept small so that multiple false null hypotheses are tested at an appreciable level so that α-wealth can be earned, and testing sustained. This setting is common in biological settings, where false null hypotheses can be rare.
5.4 Sensitivity to the Prior
Cost-aware ERO makes explicit the prior on the null, while the dependence on the probability of null hypotheses in the sequence is more implicit in other methods. So, an important question is, how sensitive is the method to misspecification of ${q_{j}}$. To address this question, we consider two types of misspecification across the hypothesis sequence: variance with a correct expectation, and bias Appendix E. We find that cost-aware ERO is robust to variance in q with a correct expectation. This is likely due to the property that cost-aware ERO is relatively conservative in its allocation of $(\alpha ,\mathrm{\$ })$-wealth and the method has the opportunity to recover from losses due to misspecification. However, the cost-aware ERO is sensitive to a biased specification of ${q_{j}}$. If the true probability of the null is 0.9 on average and $q=0.85$ is used in cost-aware ERO, 273 fewer tests are performed compared to using the correct value of q. Essentially, the downward bias in the assumed q causes cost-aware ERO to be more aggressive in spending α-wealth than it should be. In practice, this effect could be mitigated, but ensuring that α-wealth spending is conservative or by giving a margin of safety to the assumed value of ${q_{j}}$. However, a more principled solution would employ a robust optimization formulation of cost-aware ERO or to implement online-learning for the q process. While this modification is outside of the scope of this paper, it is of great interest.
5.5 Finite-Horizon Cost-Aware ERO Investing
To test whether extending the horizon of the reward to be maximized would enable better decisions as to $(\alpha ,\mathrm{\$ })$-wealth allocation, we varied the length of the horizon considered in the cost-aware ERO investing decision rules. In general, the optimal values returned are identical between the decision rules. This is especially visible at the beginning of the testing process. Discrepancies occur when ${W_{\mathrm{\$ }}}$ is sufficiently low such that repeatedly applying the one-step cost-aware ERO decision rule would expend all ${W_{\mathrm{\$ }}}$ prior to the final test in the finite horizon. This occurs when the finite-horizon is set to be a large number of steps or when the experiment is near the end of its funding. We also noticed that our solver exhibited less stability as the length of the horizon increased.
Figure 4
Power, mFDR, and mean number of samples per test for finite horizon cost-aware ERO with random ${q_{j}}\sim \text{Beta}$. A larger horizon corresponds to a greater number of future tests considered in the optimization process.
As seen in Figure 4, extending the horizon to include more tests results in the same allocation of samples. For $\mathbb{E}[q]=0.9$, which we consider most applicable to biological applications, Table 3 confirms that the solutions for different horizons are identical, with discrepancies occurring due to computational constraints.
Table 3
Varying the size of the finite horizon when ${q_{j}}\sim Beta(90,10)$. Values displayed correspond to the mean across 2,500 repetitions.
Tests | True Rejects | False Rejects | mFDR | |
Horizon | ||||
2 | 191.0 | 16.25 | 0.31 | 0.018 |
3 | 187.0 | 15.90 | 0.30 | 0.018 |
4 | 181.0 | 15.37 | 0.30 | 0.018 |
5 | 177.3 | 15.04 | 0.29 | 0.018 |
Note that this technique optimizes the parameters of each test based on the expected wealth available at that time. The parameters set for future tests will never truly be attained. These results demonstrate that our principled $({W_{\alpha }},{W_{\mathrm{\$ }}})$ spending strategy considering one step sufficiently captures the effect of the current test on our future tests. The martingale constraint enables us to conduct tests so that future tests remain powerful, and we do not benefit from adding additional information to our optimization problem. These results simultaneously suggest that an extended horizon may be appropriate for contexts where the optimization objective is not restricted to the expected reward or where the martingale constraint is not set for each individual test.
6 Real Data Experiments
Biological experiments are typically such that the sample costs are non-trivial, the proportion of false null hypotheses is small, and the number of overall tests is large. Our methods were compared to the ERO method on two gene expression data sets. The results show that the cost-aware method performs more tests and rejections, while spending fewer samples than a method which does not have the capability to adapt the sample size. As there is no ground-truth for these data sets, we are unable to compare the number of true rejections.
6.1 Prostate Cancer Data
Gene expression data was collected to investigate the molecular determinants of prostate cancer [9]. The data set contains 50 normal samples and 52 tumor samples and each sample is a $m=6033$ vector of gene expression levels. The data set has been normalized and log-transformed so that the data for each gene is roughly Gaussian. Let the empirical mean and standard deviation of the log-transformed normal samples be denoted ${\hat{\mu }_{j}}$ and ${\hat{\sigma }_{j}}$ respectively and let the log-transformed tumor data be denoted ${x_{j}}\in {\mathbb{R}^{52}}$. The goal is to test whether the tumor gene expression is increased relative to the normal samples.
We use this data set to simulate a sequential testing scenario across genes. To estimate a prior for the null hypothesis for each gene, a logistic function was fit to only the first two samples, ${\hat{q}_{j}}=1-{\left(1+\exp (-\beta ({[\bar{{x_{j}}}]_{1:2}}-{x_{0}}))\right)^{-1}}$, where ${x_{0}}={\log _{10}}(4)/\sigma $ and $\beta =2$; these first two samples were then removed from the data set. The order of the genes was permuted randomly and the cost-aware decision function was computed for each gene in sequence with ${\hat{q}_{j}}$ as described and ${\bar{\theta }_{j}}={\log _{10}}(2)/{\hat{\sigma }_{j}}$. We compared cost-aware ERO to ERO investing with the maximum number of samples available ($n=50$) and with a $n=3$ because a typical default replication level in biological experiments is to conduct experiments in triplicate. For both procedures ${c_{j}}=1,\forall j$, and ${W_{\mathrm{\$ }}}[0]=1000$. Pseudo-code and implementation details for this experiment can be found in Appendix B.
Figure 5
Comparison of Cost-aware ERO investing and ERO investing for a prostate cancer gene expression data set. ERO ($n=50$) rejects many tests early, but suffers from an aggressive expenditure of sample collection resources and is unable to test beyond the 20th gene. ERO ($n=3$) is able to reject more tests, but fails to reject early hypotheses, observes a noisier measurement of the true differential gene expression, and benefits from a piggybacking effect for later tests. Cost-aware ERO distributes the finite allocation of samples across a smaller set of genes than ERO ($n=3$), but a larger set of genes than ERO ($n=50$). It allocates, on average, $\bar{{n_{j}^{\ast }}}=15.5$ samples per test which strikes a balance between expenditure of ${W_{\alpha }}$ and ${W_{\mathrm{\$ }}}$.
Figure 5 shows the cost-aware and ERO decision rules on the prostate gene expression data set. The ERO method with $n=50$ selects many tests, but rapidly expends ${W_{\mathrm{\$ }}}$, as it does not optimize the sample size. The ERO method with $n=3$ is able to test a much greater number of genes because it is limited in the amount of ${W_{\mathrm{\$ }}}$ expenditure per test. While it may appear that the ERO method with $n=3$ is a much more favorable result, there are some critical concerns with this rejection sequence. First, ERO ($n=3$) fails to reject nearly all of the tests in the first 50 genes. Among those genes are many that clearly have a strong differential gene expression signature, ${\bar{x}_{j}}$ shown in Figure 5. Congruent with our observations of the effect of piggybacking in ERO investing (Figure 1), had one of the early tests appeared later in the sequence, after ERO had accumulated a significant α-wealth reserve, it would have been rejected. Second, ERO ($n=3$) received a much noisier observation of the true differential gene expression signature compared to ERO ($n=50$). As can be seen in Figure 5, ERO ($n=50$) does reject many early tests that do display a 2-fold increase in gene expression in the tumor compared to the normal cells when we observe all 50 samples. ERO ($n=3$) likely fails to reject true alternative hypotheses, in part, due to the fact that it does not have access to enough samples to accurately assess the true differential expression level. Cost-aware ERO allocates, on average $\bar{{n_{j}^{\ast }}}=15.5$ samples per test. This allocation strikes an balance between ERO with $n=3$ and $n=50$. It is provides a more accurate measurement of the differential gene expression than ERO ($n=3$), it does not expend the sample collection resources as aggressively as ERO ($n=50$). Finally, it is not as susceptible to the (random) ordering of the tests compared to ERO ($n=3$) which has the issue α-piggybacking or ERO ($n=50$) which has the issue of betting the farm in terms of sample resource wealth.
6.2 LINCS L1000
The Library of Integrated Network-Based Cellular Signatures (LINCS) NIH Common Fund program was established to provide publicly available data to study how cells respond to genetic stressors, such as perturbations by therapeutics [12]. The data considered is made up of L1000 assays of 1220 cell lines. The L1000 assay provides mRNA expression for 978 landmark genes. Differential gene expression is then calculated under a protocol known as level 5 preprocessing. Jeon et. al. [15] infer the remaining genes using a CycleGAN and make the predictions available on their lab’s webpage.3
Data was prepared in a similar fashion to the prostate cancer data. Data was available for 1220 samples which experienced a 10 uM perturbation Vorinostat. Differential expression for $23,614$ genes against controls were processed as per the L1000 Level 5 protocol. Following this protocol, we divided the values by the standard deviation for each individual gene so that the data had unit variance. Our experimental protocol utilized 100 samples to estimate q. We set ${q_{j}}=1-{\left(1+\exp (-\beta ({[\bar{{x_{j}}}]_{1:100}}-{x_{0}}))\right)^{-1}}$, where ${x_{0}}=2/\sigma $ and $\beta =0.6$. The distribution of ${q_{j}}$ reflects our prior belief that most genes are likely to belong to the null hypothesis. Samples used for this estimation were shuffled between iterations. For both procedures ${c_{j}}=1,\forall j$, and ${W_{\mathrm{\$ }}}[0]=100000$. Our method was allowed $n\le 1120$ samples while the ERO used $n=1120$ samples. The order of genes was randomly shuffled and the procedures were repeated $1,000$ times to collect average statistics. One sided Gaussian tests were performed where ${\mu _{0}}=0$ and ${\mu _{A}}=0.5$. $\sigma =1$ is assumed since data is already standardized.
Our method (CAERO) results in 797 tests with 176.89 rejections and an average sample size per test of $n=108$. while ERO ($n=1120$) results in 90 tests with 32.49 rejections. The results on this data set are consistent with observations for the prostate cancer gene expression data set in that the ERO procedure expends its sample budget long before the α-wealth has been exhausted.
7 Discussion
We have introduced an ERO generalized α-investing procedure that has a self contained decision rule. This rule removes the need for a user-specified allocation scheme and optimally selects the sample size for each test. We have shown empirical results in support of the benefits of optimizing these testing parameters rather than being left to user choice.
The cost-aware ERO methods does require the specification of the prior for the null, ${q_{j}}$. We have shown that the number of tests and true rejections is not sensitive to variability in ${q_{j}}$, but is sensitive to bias in ${q_{j}}$ – for example, if the investigator is systematically optimistic. For future work, it would be useful to investigate online learning methods for estimating ${q_{j}}$ and robust optimization formulations of the cost-aware ERO decision rule to reduce this sensitivity.
Cost-aware ERO does not, yet, have an explicit mechanism to hedge the risk of dollar wealth or α-wealth loss. The current optimization problem assumes a risk-neutral player who wishes to not lose α-wealth, on average, when conducting a test. Since this desire is expressed in expectation, the variance of actual outcomes can be large, leading to α-death without some constraint on the relative expenditure of α-wealth. For future work, it would be interesting to investigate a principled risk-hedging approach to conserve some wealth for future tests with the hope that a test with a more favorable reward structure is over the horizon.
The results from applying ERO and cost-aware ERO (Figure 5) show that there is a trade-off between expenditure of ${W_{\alpha }}$ and ${W_{\mathrm{\$ }}}$. In our formulation of the problem, we have assumed that the initial ${W_{\mathrm{\$ }}}$ is fixed and can only decrease. It would be interesting to apply a similar line of reasoning to ${W_{\mathrm{\$ }}}$ that was used to move from α-spending to α-investing. Specifically, if a test is rejected, there may be some reward towards ${W_{\mathrm{\$ }}}$. Then the decision rule may be modified to maximize ${W_{\mathrm{\$ }}}$ or to constrain it to be a martingale process as we have done here with ${W_{\alpha }}$.