Cost-aware Generalized α -investing for Multiple Hypothesis Testing

We consider the problem of sequential multiple hypothesis testing with nontrivial data collection cost. This problem appears, for example, when conducting biological experiments to identify differentially expressed genes in a disease process. This work builds on the generalized α -investing framework that enables control of the false discovery rate in a sequential testing setting. We make a theoretical analysis of the long term asymptotic behavior of α -wealth which motivates a consideration of sample size in the α -investing decision rule. Posing the testing process as a game with nature, we construct a decision rule that optimizes the expected return (ERO) of α -wealth and provides an optimal sample size for the test. Empirical results show that a cost-aware ERO decision rule correctly rejects more false null hypotheses than other methods. We extend cost-aware ERO investing to ﬁnite-horizon testing which enables the decision rule to allocate samples across many tests. Finally, empirical tests on real data sets from biological experiments show that cost-aware ERO produces actionable decisions to conduct tests at optimal sample sizes.


INTRODUCTION
Machine learning systems are increasingly used to make decisions in uncertain environments.Decision-making can be viewed in the framework of hypothesis testing in that a decision is made as the result of a rejection of the null hypothesis (Arrow et al., 1949;Dickey and Lientz, 1970;Blackwell and Girshick, 1979;Verdinelli and Wasserman, 1995;Parmigiani and Inoue, 2009;Berger, 2013).When multiple hypotheses are under consideration, a false discovery rate (FDR) control procedure provides a way to control the rate of erroneous rejections in a batch of hypotheses for small-scale data sets (Benjamini and Hochberg, 1995;Storey, 2002;Storey et al., 2004;Benjamini et al., 2006;Zeisel et al., 2011;Liang and Nettleton, 2012).However, these procedures typically require the test statistics of all of the hypotheses under consideration so that the p-values may be sorted and a set of hypotheses may be selected for rejection.In many modern problems the test statistics for all the hypotheses may not be known simultaneously and standard FDR procedures do not work.
Online FDR methods have recently been developed to address the need for FDR control procedures that maintain control for a sequence of tests when the test statistics are not all known at one time.Tukey and Braun (1994) proposed the idea that one starts with a fixed amount "α-wealth" and for each hypothesis under consideration, the researcher may choose to spend some of that wealth until it is all gone.Foster and Stine (2008) extended α-spending by allowing some return on the expenditure of α-wealth if the hypothesis is successfully rejected.Aharoni and Rosset (2014) introduced generalized α-investing and provided a deterministic decision rule to optimally set the α-level for each test given the history of test outcomes.A full review of related work is in Section 1.2.

Contributions
We extend generalized α-investing to address the problem of online FDR control where the cost of data is not negligible.Our specific contributions are: • a theoretical analysis of the long term asymptotic behavior of α-wealth in an α-investing procedure, • a generalized α-investing procedure for sequential testing that simultaneously optimizes sample size and αlevel using game-theoretic principles, • a non-myopic α-investing procedure that maximizes the expected reward over a finite horizon of tests.

Related Work
Tukey proposed the notion of α-wealth to control the familywise error rate for a sequence of tests (Tukey, 1991;Tukey and Braun, 1994).Foster and Stine (2008) proposed αinvesting, an online procedure that controls the marginal FDR (mFDR) for any stopping time in the testing sequence.Aharoni and Rosset (2014) introduced generalized α-investing and provided a deterministic decision rule to maximize the expected reward for the next test in the sequence.Recently, there has been much work on online FDR control in the context of A/B testing, directed acyclic graphs and quality-preserving databases (Yang et al., 2017;Ramdas et al., 2019).Javanmard and Montanari (2018) first proved that generalized α-investing controls FDR, not only mFDR under an online setting with an algorithm called LORD.Ramdas et al. (2017) proposed the LORD++ to improve the existing LORD.Recent work leverages contextual information in the data to improve the statistical power while controlling FDR offline (Xia et al., 2017) and online (Chen and Kasiviswanathan, 2020).Ramdas et al. (2018) then proposed SAFFRON, which also belongs to the α-investing framework but adaptively estimates the proportion of the true nulls.All the aforementioned methods are synchronous, which means that each test can only start once the previous test has finished.Zrnic et al. (2021) extend α-investing methods to an asynchronous setting where tests are allowed to overlap in time.These state-of-the-art online FDR control α-investing methods do not address the needs for testing when the cost of data is not negligible.So, we propose a novel α-investing method for a setting that takes into account the cost of data sample collection, the sample size choice, and prior beliefs about the probability of rejection.
Section 2 is a technical background of generalized αinvesting.Section 3 contains a theoretical analysis of the long term asymptotic behavior of the α-wealth.Section 4 presents a cost-aware generalized α-investing decision rule based on the game-theoretic equalizing strategy.Section 5 presents empirical experiments that show that the cost-aware ERO decision rule improves upon existing procedures when data collection costs are nontrivial.Section 6 presents analysis of two real data sets from gene expression studies that shows cost-aware α-investing aligns with the overall objectives of the application setting.Finally, Section 7 describes limitations and future work.

BACKGROUND ON GENERALIZED α-INVESTING
Following the notation of Foster and Stine (2008), consider m null hypotheses, H 1 , . . ., H m where H j ⊂ Θ j .The random variable R j ∈ {0, 1} is an indicator of whether H j is rejected regardless of whether the null is true or not.The random variable V j ∈ {0, 1} indicates whether the test H j is both true and rejected.These variables are aggregated as With these definitions, the FDR (Benjamini and Hochberg, 1995) is and the marginal false discovery rate (mFDR) is Setting η = 1 − α provides weak control over the familywise error rate at level α.Aharoni and Rosset (2014) make two assumptions in their development of generalized α-investing: (1) where ρ j = sup θj ∈Θj −Hj P θj (R j = 1). (3) Assumption 1 constrains the false positive rate to the level of the test and Assumption 2 is an upper bound of ρ j on the power of the test.A pool of α-wealth, W α (j), is available to spend on the j-th hypothesis.The α-wealth is updated according to the following equations: (5) A deterministic function I Wα(0) is an α-investing rule that determines: the cost of conducting the j-th hypothesis test, ϕ j ; the reward for a successful rejection, ψ j ; and the level of the test, α j : The α-investing rule depends only on the outcomes of the previous hypothesis tests.The Foster-Stine cost depends hyperbolically on the level of the test ϕ j = α j /(1 − α j ).
Generalized α-investing can be viewed in a game-theoretic framework where the outcome of the test (reject or fail-toreject) is random and the procedure provides the optimal amount of "ante" to offer to play and "payoff" to demand should the test successfully reject.We make use of this game theoretic interpretation in our contributions in Section 4.1.
Aharoni and Rosset (2014) derive a linear constraint on the reward ψ j to ensure that, for a given (ϕ j , α j ), the mFDR is controlled at a level α by ensuring the sequence A(j) = αR j − V j + αη − W α (j) is a submartingale with respect to R j .That constraint is Maximizing the expected reward of the next hypothesis test, E(R j )ψ j , leads to the following equality Note that this equality selects the point of intersection of the two parts of the constraint in (7).Expected Reward Optimum (ERO) α-investing provides two equations for three unknowns in the deterministic decision rule.Aharoni and Rosset (2014) address this indeterminacy by considering three allocation schemes for ϕ j : constant, relative, and relative200 and suggest that the investigator can explore various options and set ϕ j on their own.Further details on these schemes are given in Section 5.
Since the dominant paradigm in testing of biological hypotheses is a bounded finite range for Θ j , for the remainder we assume Θ j = [0, θj ] for some upper bound, θj , and H j = {0}.This scenario may be viewed as a test that the expression for gene j is differentially increased in an experimental condition compared to a control.We consider a simple z-test here for concreteness.The power of a onesided z-test is where is the z-score corresponding to level α, µ 0 is the expected value of the simple null hypothesis, µ 1 is the expected value of the simple alternative hypothesis, σ is the standard deviation of the measurements, and n is the sample size.
Using Equation (3), the best power under the previously defined Θ j is The best power depends on: (1) the level of the test, α j , (2) the scale of the bound on the alternative, θj , (3) the sample size, n j , and the measurement standard deviation, σ j .One may compare multiple measurement technologies based on their precision by exploring the effect of changing σ j -for example, for a fixed budget and all other things equal, a trade-off can be computed between more samples with a higher variance technology, versus fewer samples with a lower variance technology.For the remainder, we assume σ j is fixed and known.ERO α-investing for Neyman-Pearson testing problems is solved by the following nonlinear optimization problem: Constraints 10b and 10c correspond to (7) which controls the mFDR level, and constraint (10d) ensures the maximal expected reward for the j-th test.The optimal ERO solution still depends on an external choice of the sample size n j , and the cost of the test ϕ j .

LONG-TERM α-WEALTH
Since the levels of future tests depend on the amount of αwealth available at the time of the tests, a theoretical consideration in generalized α-investing is whether the long-term α-wealth is submartingale or supermartingale (stochastically non-decreasing or stochastically non-increasing) for a given decision-rule.Most prior works include some implicit consideration of the behavior of the long-term α-wealth.
In ?, which predates the seminal work of Foster and Stine (2008), test levels are set such that testing may continue indefinitely, even in the worst case scenario of no rejections, while still utilizing all initial α-wealth.Foster and Stine (2008) discuss strategies for setting the level of the test and provide some examples designed to accumulate α-wealth for future tests.They also discuss the practical and ethical concerns with sorting easily rejected tests so as to accumulate an arbitrary amount of α-wealth before conducting more uncertain tests.Aharoni and Rosset (2014) seek to optimize the expected reward of the current test in an effort to maximize the α-wealth available, and, in-turn, the levels for future tests.Javanmard and Montanari (2018) discuss setting the vector γ such that the power is maximimized for a mixture model set a-priori for the hypothesis stream.In all of these methods, the motivation is to have sufficient α-wealth to conduct tests,with an appreciable power perpetually.Here we outline two scenarios where the long-term α-wealth can be either submartingale or supermartingale.
In order to state the theorems regarding the α-wealth sequence, we require a lemma bounding α-wealth as a function of the prior probability of the null hypothesis.
Lemma 1.Given an α j -level for the j-th hypothesis test from rule I(R 1 , . . .R j−1 ), the expected value of α-wealth for Foster-Stine α-investing is where and q j = Pr[θ j ∈ H j ], the prior probability (belief) that the j-th null hypothesis is true.In the case of a simple null and alternative Θ j = {0, θj }, the bound is tight.
Proof.Proof is provided in Appendix A.
We are now in a position to understand dynamical properties of the expected value of the sequence of α-wealth, {W (j) : j ∈ N}.
Theorem 1 shows that one will be able to conduct an infinite number of tests in the long-term if the power is close to one or the prior probability of the null is close to zero.This scenario may occur when the hypothesis stream contains a large proportion of true alternative hypotheses, or if the sample sizes of the individual tests are large.
Theorem 2 shows that the generalized α-investing testing procedure will end in a finite number of steps if the power of the test is close to zero or the prior probability of the null hypothesis is close to one.This scenario may occur when the hypothesis stream is made up of a large proportion of true null hypotheses, or if the sample sizes used for each test results in an under powered test.
These theorems provide general insights for understanding when the α-wealth can be expected to be (stochastically) non-decreasing or non-increasing.The non-decreasing sequences require that ρ j ↑ 1 for a fixed θj which, in the case of a Gaussian, would require σ j / √ n → 0 or n → ∞.So, the α-wealth grows unbounded if the sample size is unbounded.This theory in combination with the premise of non-trivial experiment costs motivates the need for methods for cost-aware α-investing when the sample size is not fixed.

COST-AWARE GENERALIZED α-INVESTING
Our development in this section derives from two key differences in assumptions compared to previous work.Here we make the following assumptions: (1) the per sample monetary cost to conduct hypothesis tests is not trivial, and (2) the α-cost of a hypothesis test, ϕ j , should account for the a-priori probability that the null hypothesis is true as well as the pattern of previous rejections.

Cost-aware ERO α-investing
The generalized α-investing decision rule, ( 6), is augmented to include a notion of dollar-wealth W $ (j) available for expenditure to collect data to test the j-th hypothesis where n j is the sample size allocated for testing of the j-th hypothesis.A natural update plan for the dollar-wealth is where c j is the per-sample cost for data to test the j-th hypothesis, and B is the initial dollar-wealth.Allowing the cost to vary with the hypothesis test enables one to model different experimental methods and cost inflation for long-term experimental plans.The augmented optimization problem is identical to Problem 10 with objective max ϕj ,αj ,ψj ,nj E θ (R j )ψ j and constraint n j c j ≤ W $ (j).

Game Theoretic Formulation
The resulting optimization problem has an infinite number of solutions because ϕ j is not constrained.Thus the problem does not yet constitute a self-contained decision rule for (14).Indeed, Aharoni and Rosset (2014), throughout, suggest a scenario where ϕ j is chosen by the investigator and the level and reward are given by the decision rule.In such a scenario, ϕ j can be interpreted as ante for a game against nature.
Suppose that we have a zero-sum game involving two players: the investigator (Player I) and nature (Player II).The investigator has two strategies -to test or to not test a hypothesis.Nature, independent of the investigator, chooses to hide θ j ∈ H j with probability q j and θ j ∈ H j otherwise.
The utility function for this game is the change in α-wealth.
The payoff matrix for the game is provided in Table 1.
Player II (Nature) Skip Test 0 0 Table 1: Payoff matrix for posing hypothesis testing as a game against nature.The payouts shown are the expected value of the payout for a given pair of pure strategies.
If the investigator chooses not to conduct the test, there is no cost (ϕ j = 0) and there is no reward (ψ j = 0) regardless of what nature has chosen.So, the change in α-wealth when not conducting a test is zero.If the investigator chooses to conduct a test, and nature has hidden θ j ∈ H j , then the payout is −ϕ j with probability 1 − α j , or −ϕ j + ψ j with probability α j .In expectation, this payout is −ϕ j + α j ψ j .
Similarly, if the investigator chooses to conduct the test and nature has hidden θ j ∈ H j , then the expected payout is −ϕ j + ρ j ψ j .The mixed strategy of nature is known to be (q j , 1 − q j ).What remains to be determined in this game are the unknowns in the payoff matrix, as well as the investigator's strategy.We choose to set the payoffs such that the expected change in α-wealth is identical for both of the investigator's strategies.By designing the payoff matrix, given nature's mixed strategy, such that the investigator has the same expected payoff for both pure (and any mixed) strategies, the investigator's choice to test or not test a hypothesis has no effect (in expectation) on the ability to perform future tests.With these properties, nature is employing an equalizing strategy.
Such a payoff matrix is a solution to the equation, Since the expected payoff for not testing is 0, this equation ensures that the expected change in α-wealth when testing is also 0. By definition, this gives a self contained decision rule that makes α-wealth martingale, striking a balance between the two scenarios given in Theorem 1 and Theorem 2.
Theorem 3 (Martingale α-Wealth).Given a simple null and alternative Proof.Proof is provided in Appendix A.
Theorem 3 provides a condition on the power of the test which requires a balance between the ratio of ϕ j and ψ j and the probability of a false positive.Next, we present our main result, a self-contained decision rule to provide a martingale α-wealth sequence.

Cost-aware ERO Decision Rule
The investigator's goal is to conduct as many tests as possible while rejecting as many true alternatives as possible and maintaining control of the false discovery rate.In the long-run the best strategy for the investigator is to ensure that the α-wealth sequence is martingale.Incorporating this objective and the cost of the experiment into the ERO problem yields a self-contained decision rule in the form of ( 14), Constraints (18b) and (18c) ensure control over the mFDR.Constraint (18e) connects the level, power, and sample size of the test.Constraints (18f) and (18g) ensure the existing (α, $)-wealth is not exceeded.The parameter a ∈ (0, 1] controls the proportion of α-wealth that a single test can be allocated.Constraint (18h) ensures nature's strategy is equalizing.Written out explicitly, A pseudo-code algorithm of the full cost-aware ERO method and further extensions to cost-aware ERO are described in Appendix C.
Constraint (18h) sets the expected change in α-wealth equal to zero.This enforces that W α (j) is martingale.Allowing W α (j) to be submartingale, as per Theorem 1, can lead to a situation where hypotheses are tested at high α-levels due to the accumulated W α from previous rejections.This is referred to as piggybacking in the literature when such accumulated wealth leads to poor decisions (Ramdas et al., 2017).
On the other hand, allowing W α (j) to be supermartingale, as per Theorem 2, causes the testing to end, and is referred to as α-death in the literature.Using a game-theoretic formulation allows us to propose an expected-reward optimal procedure which considers preventing α-death and piggybacking.
Constraint (18h) only controls the expected increment in W α .It is well known that martingale-based strategies can suffer from what is known as gambler's ruin.Since no bounds are set on the worst case scenario, which in this case is when R j = 0, it is possible that we could set ϕ j = W α (j − 1), and suffer α-death.This occurs, for example, when q j is close to 0, and µ A and σ j allow for ρ j → 1.In such a case, a rejection is almost certain, and in turn, so is receiving the reward ψ j .Recall that we restrict ourselves to an ERO solution, and thus, we can interpret constraint (18h), without the factor a, as setting ϕ j to the expected reward -the quantity that we are maximizing.In order to keep W α martingale, this almost guaranteed upside must be counteracted by a devastating downside.In order to avoid α-death, we add a factor which limits the maximal bet.Following the relative scheme chosen in (Aharoni and Rosset, 2014), we set this factor to be 0.1.

Finite horizon cost-aware ERO α-investing
The standard ERO framework optimizes only the one-step expected return, E θ (R j )ψ j .But, when tests are expensive, it is logical to consider the expected return after two (or more) tests.We consider q j to be known, and extend the game theoretic framework to a finite horizon of decisions.
The extensive form of the game between nature, who hides θ j in the null or alternative hypothesis region, and the investigator, who seeks to find θ j and gain the reward for doing so, is shown in Figure 1.We note that sequential two-step cost-aware ERO investing is a different problem than batch ERO investing because two-step investing accounts for the expected change in (α, $)-wealth after each step while batch cost-aware ERO only received the payoff at the conclusion of all of the tests in the batch.
The two-step objective function is with constraints (18b)-(18h) from Problem 18 remaining for steps j and j + 1. Designing the game so that nature's strategy is an equalizing strategy results in a system of equations (Appendix G) that form constraints in the ERO optimization problem.It is worth noting that such a game simplifies to the standard cost-aware ERO method defined in 18 when W $ >> 0. This holds since the parameters of the second test depend on the expected α-wealth available at that step.When the expected increment is 0, as set in constraint (18h), and when available $-wealth is not scarce, then each step is equivalent to optimization occurring on the first test.When this constraint is lifted, or when the available W $ is low, the finite horizon solution provides a different solution to the single step solution.

SYNTHETIC DATA EXPERIMENTS
Experimental Settings To compare our method with stateof-the-art related methods, we generate synthetic data as described in Aharoni and Rosset (2014).The synthetic data is composed of m = 1000 possible hypothesis tests.For the j-th test, the true state of θ j is set to the null value of 0 with probability 0.9 and otherwise set to 2. A set of n j = 1000 potential samples (x ji ) nj i=1 were generated i.i.d from a N (θ j , 1) distribution.For each hypothesis test, the z-score was computed as z j = n * j n * j i=1 x ji , where n * j is described in the table and the one-sided p-value is computed.The methods were tested on 10, 000 realizations of this simulation data generation mechanism.Pseudo-code, as well as other implementation details, for this simulation can be found in Appendix B.

Comparison to state-of-the-art methods
Table 2 compares our method, cost-aware ERO, with related state-of-the-art α-investing methods including: αspending (Tukey and Braun, 1994), α-investing (Foster and Stine, 2008), α-rewards (Aharoni and Rosset, 2014), ERO-investing (Aharoni and Rosset, 2014), LORD (Javanmard and Montanari, 2018;Ramdas et al., 2017), and SAF-FRON (Ramdas et al., 2018).The table is indexed by the allocation scheme (Scheme), and the reward method (Method).The allocation scheme determines the value of ϕ j at each step.The constant scheme simply allocates, for each test, the relative scheme allocates an amount that is proportional to the remaining α-wealth, and continues until W α (j) < (1/1000)W α (0).The relative 200 scheme follows the same proportional steps as the relative, but always performs 200 tests (Aharoni and Rosset, 2014).The results from our implementation of these methods matches or exceeds previously reported results.
ERO investing yields more true rejects than α-spending, αinvesting, and both α-rewards methods.The LORD variants and SAFFRON perform nearly the maximum number of tests while maintaining control of the mFDR.For the use  scenarios considered in the LORD and SAFFRON papers (large-scale A/B testing), this is optimal -tests are nearly free and the goal is to be able to keep testing while maintaining mFDR control.The cost-aware ERO setting is different and more applicable to biological experiments where one aims to maximize a limited budget of tests to achieve as many true rejects as possible while controlling the mFDR.
Increasing the sample size capacity for each test to n j ≤ 10 enables cost-aware ERO to achieve higher power with fewer tests than the current state-of-the-art methods.Releasing the restriction on sample size enables cost-aware ERO to allocate the optimal number of samples based on the prior of the null as well as the available budget and the method returns an optimal n * j .Appendix D shows the comparisons for q = 0.1 and Appendix F shows comparisons with all of the other methods set to n j = 10.Our cost-aware ERO method with n j ≤ 10 performs more tests and rejects more false null hypotheses than all competing methods.We note, that the difference between restricting n j ≤ 100 and not restricting n j is rather small.For most of the tests, the sample sizes chosen are quite similar, but differences appear when α-wealth becomes small, and the optimal sample size without restriction is above 100.
It is worth noting that ERO and cost-aware ERO with n j = 1 are still quite different despite the restriction of sample size.We can view the difference in performance between these two methods as the benefit of allocating ϕ j using our gametheoretic framework.Our decision rule incorporates our prior knowledge of the probability of the null hypothesis being true and aims to maintain α-wealth (as a martingale).The experimental set up of Aharoni and Rosset (2014) implicitly leverages similar prior knowledge in the spending schemes proposed.All spending schemes proposed in Aharoni and Rosset (2014) allow us to test at least one true alternative, in expectation, at which point the α-wealth should increase.This increase in α-wealth should then sustain testing until another true alternative appears.However, in the cost-aware ERO optimization problem, this information is explicitly accounted for, and helps us avoid situations described in Theorem 1 and Theorem 2. By restricting n j = 1, we have effectively limited our ability to inflate ρ j with a large sample size, and influence W α (j) towards being submartingale.On the other hand, nature's equalizing strategy limits the expected payout to 0, by limiting the size of ϕ j , preventing the experimenter from experiencing α-death quickly, as seen in the constant spending scheme.

Computation and Implementation
In our experiments, for one set of 1, 000 potential hypothesis tests ERO investing, cost-aware, and finite-horizon costaware ERO all take ∼ 30 seconds on a single 2.5GHz core and 16Gb RAM.The nonlinear optimization problem was solved using CONOPT (Drud, 1994).Because the solver depends on initial values and heuristics to identify an initial feasible point, infrequently the solver was not able to find a local optimal solution; in these instances, the solver was restarted 10 times and if it failed on all restarts the iteration was discarded.Out of 10, 000 data sets at most 27 iterations were discarded (for n j = 1).Code to replicate these experiments is available at https://github.com/ThomasCook1437/cost-aware-alpha-investing.

Random Prior of the Null Hypothesis
One of the benefits of incorporating a notion of sampling cost into the hypothesis testing problem is the ability to allocate resources based on the prior probability of the null, q.We generated simulation data as previously described except the prior probability of the null hypothesis is selected at random from q j ∼ Beta(a, b) where a + b = 100 and with 2, 500 independent realizations of the data.Appendix B contains pseudo-code and further implementation details. Figure 2(a-c) shows the power, mFDR, and mean number of samples per test as a function of E[q j ].The results show that cost-aware ERO α-investing achieves high power while maintaining control of the mFDR.A key result of this experiment is that should it not be possible to collect as many samples as the optimization problem yields, the investigator may choose to not perform the test at all and instead wait for a test (with associated prior) that does yield an optimal sample size within the budget or may choose to allow the α-wealth ante to adjust to the bound on the sample size.This often occurs for large values of q j , which we know by Theorem 2 will influence W α (j) towards behaving as a supermartingale.Cost-aware ERO will compensate by increasing ρ j through the sample size, n j , and will expend the W $ available, as the optimization only considers a single step.It is worth noting, that when E[q j ] is close to 1, costaware ERO with n = 1, maintains power better than other methods.This can be attributed to the allocation scheme that constraint (18h) creates.The value of ϕ j is kept small so that multiple false null hypotheses are tested at an appreciable level so that α-wealth can be earned, and testing sustained.This setting is common in biological settings, where false null hypotheses can be rare.

Finite-Horizon Cost-aware ERO Investing
To test whether extending the horizon of the reward to be maximized would enable better decisions as to (α, $)-wealth allocation, we varied the length of the horizon considered in the cost-aware ERO investing decision rules.In general, the optimal values returned are identical between the decision rules.This is especially visible at the beginning of the testing process.Discrepancies occur when W $ is sufficiently low such that repeatedly applying the one-step cost-aware ERO decision rule would expend all W $ prior to the final test in the finite horizon.This occurs when the finite-horizon is set to be a large number of steps or when the experiment Figure 2: Power, mFDR, and mean number of samples per test for cost-aware ERO and existing methods with random q j .Figure 3: Power, mFDR, and mean number of samples per test for finite horizon cost-aware ERO with random q j .A larger horizon corresponds to a greater number of future tests considered in the optimization process. is near the end of its funding.We also noticed that our solver exhibited less stability as the length of the horizon increased.All methods were limited to allocating 10 samples at most to each individual test.
As seen in Figure 3, extending the horizon to include more tests results in a more conservative allocation of samples.For E[q] = 0.9, which we consider most applicable to biological applications, Table 3 shows that this risk aversion increased the number of tests for smaller horizons, and then decreases for larger horizons.
Note that this technique optimizes the parameters of each test based on the expected wealth available at that time.The parameters set for future tests will never truly be attained.Despite this, the risk averse effect on the initial test, in which W α (j − 1) is fixed and known, remains.An unintended consequence of this risk aversion is a decrease in power.These results demonstrate that our principled (W α , W $ ) spending strategy considering one step sufficiently captures the effect of the current test on our future tests.The martingale constraint enables us to conduct tests so that future tests remain powerful, and we do not benefit from adding additional information to our optimization problem.These results simultaneously suggest that an extended horizon may be appropriate for contexts where the optimization objective is not restricted to the expected reward.

REAL DATA EXPERIMENTS
Biological experimentation involves non-trivial sample costs, and the proportion of false null hypotheses is typically small, while the number of overall tests is quite large.Our methods were compared to the ERO method on two gene expression data sets.The subsequent results demonstrate that our cost-aware method performs more tests and rejections in such settings, while spending less samples than a method which does not have the capability to simultaneously optimize sample size.

Prostate Cancer Data
Gene expression data was collected to investigate the molecular determinants of prostate cancer (Dettling, 2004).The data set contains 50 normal samples and 52 tumor samples and each sample is a m = 6033 vector of gene expression levels.The data set has been normalized and logtransformed so that the data for each gene is roughly Gaussian.Let the empirical mean and standard deviation of the normal samples be denoted μj and σj respectively.The goal is to test whether the tumor gene expression is increased relative to the normal samples.
A logistic function using only the first two samples for each gene was used to compute the prior probability of the null hy-  3: Varying the size of the finite horizon when q j ∼ Beta(90, 10).Values displayed correspond to the mean across 2,500 repetitions.The sample lack of a rejection for cost-aware ERO at test 8 is a result of a noisy estimate of q 8 , leading the algorithm to not allocate fewer than 50 samples.However, this restraint in allocating sames allows for rejections at tests 24, 25, and 29. pothesis , where x 0 = log 10 (4)/σ and β = 2; these first two samples were then removed from the data set.The set of genes was permuted randomly and the cost-aware decision function was computed for each gene in sequence with q j as described and θj = log 10 (2)/σ j .For cost-aware ERO the sample size was limited to 50 samples.A one-sided Gaussian test was performed with the optimal number of samples.This procedure is compared to ERO investing with the maximum number of samples for all tests that were selected to be performed by that procedure.For both procedures c j = 1, ∀j, and W $ [0] = 1000.Pseudo-code and implementation details for this experiment can be found in Appendix B.
Figure 4 shows a single run of the cost-aware and ERO decision rules on the gene expression data set.The ERO method selects many tests, but rapidly expends W $ , as it does not optimize the sample size.In contrast, cost-aware ERO is more conservative in sample allocation.Across 1000 permutations of the data cost-aware ERO performed, on average, 26.0 tests and for the tests that were performed, the average sample size was 38.9.

LINCS L1000
The Library of Integrated Network-Based Cellular Signatures (LINCS) NIH Common Fund program was established to provide publicly available data to study how cells respond to genetic stressors, such as perturbations by therapeutics (?).The data considered is made up of L1000 assays of 1220 cell lines.The L1000 assay provides mRNA expression for 978 landmark genes.Differential gene expression is then calculated under a protocol known as level 5 preprocessing.?infer the remaining genes using a CycleGAN and make the predictions available on their lab's webpage2 .
Data was prepared in a similar fashion to the prostate cancer data.Data was available for 1220 samples which experienced a 10 uM perturbation Vorinostat.Differential expression for 23614 genes against controls were processed as per the L1000 Level 5 protocol.Following this protocol, we divided the values by the standard deviation for each individual gene so that the data had unit variance.Our experimental protocol utilized 100 samples to estimate q.We set −1 , where x 0 = 2/σ and β = 0.6.The distribution of q j reflects our prior belief that most genes are likely to belong to the null hypothesis.Samples used for this estimation were shuffled between iterations.For both procedures c j = 1, ∀j, and W $ [0] = 100000 Our method was allowed up to the remaining 1120 samples while the ERO used all possible 1120 samples.The order of genes was randomly shuffled before Cost-aware ERO (CAERO) performs more tests and rejects more hypotheses than ERO.In the LINCS data set, this occurs with a large cost savings.
Table 4 shows that our method (CAERO) results in a larger number of tests and a larger number of rejections, all while the average experimental cost is lower.The mean number of samples used for an individual test is about n = 580.This shows that our algorithm selects a portion of the data so that $-spending is balanced against the increase in the likelihood of a rejection for future α-wealth.
It should be noted that the number of rejections is rather high for both methods.This is likely a result of our use of statistical significance versus clinical significance.With n = 1120, the sampling distribution of X is N (0, 1 √ 1120 ).Even small differences in µ A will cause the null hypothesis to be rejected.Since cost-aware ERO utilizes a more reasonable sample size, we expect that these rejections are more likely to contain true signal in comparison the set of rejections given by the ERO method.

DISCUSSION
We have introduced an ERO generalized α-investing procedure that has a self contained decision rule.This rule removes the need for a user-specified allocation scheme and optimally selects the sample size for each test.We have shown empirical results in support of the benefits of optimizing these testing parameters rather than being left to user choice.An experiment with gene expression data shows that we can conduct more tests than with previous methods.The negative societal impact and ethical concerns of this work are limited, as this work proposes a method to control for multiple testing.
There are some limitations of the current work.First, the cost-aware ERO method may be sensitive to misspecification in q j .Simulations with an increasing distance between the true q j and that used by the cost-aware ERO method show that cost-aware ERO performance degrades for misspecified values of q j (Appendix E), however, mFDR control is not affected.Specification of q j should reflect the belief that an individual hypothesis belongs to the null hy-pothesis, and such granular information is not always available.We have demonstrated potential methods to estimate such quantities in real data simulations.It is also possible to set q j to be a more coarse belief of the distribution of hypotheses.For example, if it is believed that hypotheses are generated with a spike and slab distribution, it is reasonable to set q j to the probability mass of the spike.Such information is already captured in the relative spending scheme of ERO and the γ vector in LORD.
Second, the finite horizon method requires many known values q j .In a streaming situation this information may not be available.This may be alleviated by dynamically adjusting the size of the horizon as hypotheses arrive.However, we acknowledge that this presents a further complication in the implementation of our method, and opens the question of optimizing the size of the horizon, which is beyond the scope of the current work.
Finally, cost-aware ERO does not have an explicit mechanism to hedge the risk of dollar wealth or α-wealth loss.In our simulations we observed that cost-aware ERO can choose to ante (ϕ j ) the entire pool of α-wealth on the first promising test in the hopes of maximizing the expected reward.While this strategy is certainly one that maximizes the expected reward, it may be too risky in practice.One solution is to constrain ϕ j for any single test or to include ϕ j in the objective function.We rectify this with the addition of a in constraint (18f), although the setting of a presents an additional parameter to select a-priori.The current optimization problem assumes a risk-neutral player who wishes to not lose α-wealth, on average, when conducting a test.Since this desire is expressed in expectation, the variance of actual outcomes can be large, leading to α-death.For future work, it would be interesting to investigate a principled risk-hedging approach to conserve some wealth for future tests with the hope that a test with a more favorable reward structure is over the horizon.

A THEORETICAL ANALYSIS OF LONG-TERM ALPHA-WEALTH AND COST-AWARE ERO SOLUTION
A.1 Proof of Lemma 1 Proof of Lemma 1.The expected increment in α-wealth is This equation requires the probability of rejection, which can be written in factorized form as Now, Pr(R j = 1|θ j ∈ H j ) ≤ α j by Assumption 1 and Pr(R j = 1|θ j ∈ H j ) ≤ ρ j by Assumption 2. Defining Pr(θ j ∈ H j ) = q j gives the result.
Thus, for a given q j , the condition on s j for stochastically non-increasing wealth is The upper-bound in condition ( 21) is valid if it is positive.For j large enough, if α j /(1 − α j ) < α, then If α j < 1/2, this term is positive and the upper-bound for s j is positive.
Simplifying equation 23 yields 0 = (q j α j + (1 − q j )ρ j )(−ϕ j + ψ j ) + (q j (1 Solving equation 24 for ρ j ρ j = 1 1 − q j ϕ j ψ j − q j α j This implies that ρ j ∝ ϕj ψj − q j α j .This implies that the power of the test must balance the probability of rejection under the null and the ratio of the cost and reward of the test.

A.5 Existence and Uniqueness of Solution
Since the solution to the cost-aware ERO problem is infact an ERO solution, the existence of a solution is proven in Lemma 2 of Aharoni and Rosset (2014) given some assumptions which hold for a uniformly most powerful test with a continuous distribution function.Since these are the types of tests being considered in the current work, the necessary assumptions are met.
Theorem 4. In the cost-aware ERO solution, ϕ is unique.
Proof.Suppose ∃ a solution (ϕ * j , ψ * j , α * j , ρ * j , n * j ) such that Expanding the expectation of rejections in equation 25 yields As per Lemma 2, the α-wealth is martingale when using a solution to the cost-aware ERO optimization problem.Applying theorem 3 gives Substituting equations 27 and 28 into equation 26 gives Since the sample size, n j is now made a free parameter, a natural question is whether or not a unique n j can be selected.This is not necessarily the case.Consider the solution (ϕ j , ψ j , α j , ρ j , n j ) to the cost-aware ERO problem.Assume that a continuous distribution function is used.We now show that (ψ j , α j , ρ j , n j ) are not necessarily unique.
Suppose ∃ a solution (ϕ * j , ψ * j , α * j , ρ * j , n * j ) such that From equation 32, we know that ϕ j = ϕ * j .From Aharoni and Rosset (2014), any solution that is ERO must satisfy Using equation 34 it follows that ϕ Solving equations 34 and 35 for ϕ j and ϕ * j respectively give Substituting equations 36 and 37 into equation 32 gives Suppose α * j > α j .It follows that ρ * j > ρ j .Without loss of generality (with respect to the test statistic having a continuous distribution function), assume the test statistic is normally distributed.Writing out ρ j and ρ * j explicitly then implies that Equation 43 shows that a range on n values can be used.In certain scenarios, this allows (ψ j , α j , ρ j , n j ) = (ψ * j , α * j , ρ * j , n * j ).Considering the case when α * j < α j results in equation 43 having the inequality reversed.Note that n j = 1 is not necessarily permitted by this range.Including n j in our problem is still useful, despite not being unique, since an a-priori specification may not yield the same maximal expected reward as leaving n j to be optimized.

B SIMULATION DETAILS
In this section we describe simulations in greater detail so that our work can be fully reproduced.We briefly present the cost-aware ERO α-investing method in algorithmic form.
Algorithm 1 Cost-aware ERO Algorithm Input α, W α (0), W $ (0) j ← 0 while W α (j) > and W $ (j) > do Define q j , c j for hypothesis j Solve Problem 18 to obtain ϕ j , α j , ρ j , ψ j , and n j Collect data (x j1 , . . ., x jnj ) and compute p-value For cost-aware ERO, the initial value of α j was set 0.001, and the initial value of ρ j was set to 0.9.These initial values tended to give more conservative allocations of sample size as the optimal value of ρ j often approached a value of 1.In our simulation we define α = 0.05, W α (0) = 0.0475, W $ (0) = 1000, n iter = 10000 (number of iterations), m = 1000 (maximum number of tests per iteration), and c = 1 (cost per sample).W α (0) for implementations of LORD and SAFFRON follow suggestions from Javanmard and Montanari (2018) and Ramdas et al. (2018).An explicit algorithm is given in Algorithm 2. A similar experimental set up is used for Table 5 and Table 7 where q j and n j are adjusted respectively.

B.2 Experiment for Figure 2
We next discuss the experimental details for producing Figure 2. The initial value for ρ j (and for ρ j+1 ) was set to 1. Contrary to the previous experiment, we desire to demonstrate that myopic allocation of samples is avoided in finite-horizon ERO regardless of initial values.In our simulation we define α = 0.05, W α (0) = 0.0475, W $ (0) = 1e8, n iter = 2500, m = 1000, and c = 1.An additional q, specifically q 1001 is drawn for solving the finite-horizon optimization problem when we reach the final test.We sample q j from a Beta(a, 100 − a) distribution, and then sample whether θ j is null or not based on the realization of q j .This sampling scheme and relevant parameter values are given in Algorithm 3.

B.3 Experiment for Figure 4
The real data experiment shown in Figure 4 and detailed in Section 6 can be broken down into two steps: preprocessing and testing.
In preprocessing, we load in two dataframes, one containing gene expression data for 50 normal (non-cancerous) samples (6033 × 50), and a second containing similar data for 52 tumor samples (6033 × 52).We take then mean across the normal samples to obtain a (6033 × 1) vector containing the mean gene expression for normal patients.We calculate the standard deviation in a similar manner and use these vectors to standardize the (6033 × 52) dataframe containing tumor samples.Next, the first two columns of the tumor samples dataframe is separated from the remaining 50 columns to provide an informed estimate of q j for each test.It is important to note that we are allowing the potential for misspecification of q j by using an estimate of only two samples.Using these two samples:

C EXTENSIONS OF COST-AWARE α-INVESTING
In this section, we explore extensions of cost-aware ERO α-investing.
In Problem 18 the monetary cost does not factor in to the objective except through the constraints.In many practical applications, it may be useful to simultaneously maximize the α-reward and minimize the $-cost.In those applications, the objective function can be augmented to E(R j )ψ j − γc j n j , where γ controls the trade-off between improving α-wealth and minimizing $-cost.

C.2 Variable utility.
Not all hypotheses may have equal value to the investigator and their value assessment may be independent of their assessment of the prior probability of the null hypothesis (Ramdas et al., 2017).For example, an investigator may be confident that a gene is differentially expressed in a particular tissue based on prior literature.Then the prior probability that θ j = 0 is low, p j ≈ 0, and the utility of testing that hypothesis is also low.There may be a different gene that has not been reported to be differentially expressed in the tissue, but if it is it would be a major scientific discovery.Then, the investigator may assign a high prior probability to the null θ j = 0, but also a high utility to the event that the null is rejected.A generalized form of the cost-aware decision rule can be constructed to account for varying utility levels for each hypothesis in the batch by making the objective function K j=1 E θ (R j )U (R j )ψ j , where U (R j ) is the utility of the rejection of the j-th null hypothesis.

C.3 Batch testing.
Many biological experiments are conducted in batches.This scenario leads to a need for a decision rule that provides (α j , ψ j , n j ) K j=1 for a batch of K tests.To address this need, the objective function in Problem 18 can be modified to K j=1 E θ (R j )ψ j .It seems reasonable to expend all of the α-wealth for each batch and then collect the reward at the completion of the batch so that a next batch of hypotheses can be tested.Therefore, we have constraints K j=1 ϕ j ≤ W α (0) and K j=1 c j n j ≤ W $ (0).The other constraint remain and apply for each test in the batch.
D METHOD COMPARISON WITH q = 0.1 In Table 5 we explore the comparison of cost-aware ERO investing with other methods for q j = 0.1.The results presented in the main body of the paper are for q j = 0.9 to align with previous work.Naturally, when nulls occur infrequently, the issue of multiple testing is not as dire.When true alternatives are abundant, cost-aware ERO requires a large ante (ϕ j ).In this simulation we set a = 1 to highlight this effect.This causes cost-aware ERO to rapidly deplete the α-wealth.In contrast, other methods do not increase the ante as severely as cost-aware ERO.However, it should be noted that the fraction of the tests that are true rejects among those that are performed is very high.For example, in constant ERO investing the proportion of true rejects is 24% and the proportion of true rejects for cost-aware ERO (n j ≤ 10) is 90%.This is a highly desirable result for the setting of biological experiments and other settings where sample cost is nontrivial.5: Comparison of cost-aware α-investing with state-of-the-art sequential hypothesis testing methods with a prior probability of the null, q = 0.1 using 2,500 iterations.Cost-aware uses an initial ρ j for each iteration of 1.

Figure 1 :
Figure 1: Extensive form of two-step game between Investigator (Player I) and the Nature (Player II).Strategies for each player are italicized.The leaves are labeled to denote the strategy taken by the investigator and are enumerated for equations presented in Appendix G.

Figure 4 :
Figure 4: Comparison of Cost-aware ERO investing and ERO investing for a single permutation of the prostate cancer gene expression data.Cost-aware ERO distributes the finite allocation of samples across the set of genes while ERO expends the sample allocation within the first 20 tests.The sample lack of a rejection for cost-aware ERO at test 8 is a result of a noisy estimate of q 8 , leading the algorithm to not allocate fewer than 50 samples.However, this restraint in allocating sames allows for rejections at tests 24, 25, and 29.

Table 2 :
Comparison of cost-aware α-investing with state-of-the-art sequential hypothesis testing methods.Values for Tests, True Rejects and False Rejects are the average across 10,000 iterations, and these estimates are used for mFDR.

Table 4 :
Real data analysis results.Values displayed represent the average across 1,000 permutations of each data set.