We consider the problem of sequential multiple hypothesis testing with nontrivial data collection costs. This problem appears, for example, when conducting biological experiments to identify differentially expressed genes of a disease process. This work builds on the generalized α-investing framework which enables control of the marginal false discovery rate in a sequential testing setting. We make a theoretical analysis of the long term asymptotic behavior of α-wealth which motivates a consideration of sample size in the α-investing decision rule. Posing the testing process as a game with nature, we construct a decision rule that optimizes the expected α-wealth reward (ERO) and provides an optimal sample size for each test. Empirical results show that a cost-aware ERO decision rule correctly rejects more false null hypotheses than other methods for $n=1$ where n is the sample size. When the sample size is not fixed cost-aware ERO uses a prior on the null hypothesis to adaptively allocate of the sample budget to each test. We extend cost-aware ERO investing to finite-horizon testing which enables the decision rule to allocate samples in a non-myopic manner. Finally, empirical tests on real data sets from biological experiments show that cost-aware ERO balances the allocation of samples to an individual test against the allocation of samples across multiple tests.
Analyzing health effects associated with exposure to environmental chemical mixtures is a challenging problem in epidemiology, toxicology, and exposure science. In particular, when there are a large number of chemicals under consideration it is difficult to estimate the interactive effects without incorporating reasonable prior information. Based on substantive considerations, researchers believe that true interactions between chemicals need to incorporate their corresponding main effects. In this paper, we use this prior knowledge through a shrinkage prior that a priori assumes an interaction term can only occur when the corresponding main effects exist. Our initial development is for logistic regression with linear chemical effects. We extend this formulation to include non-linear exposure effects and to account for exposure subject to detection limit. We develop an MCMC algorithm using a shrinkage prior that shrinks the interaction terms closer to zero as the main effects get closer to zero. We examine the performance of our methodology through simulation studies and illustrate an analysis of chemical interactions in a case-control study in cancer.
The notion of an e-value has been recently proposed as a possible alternative to critical regions and p-values in statistical hypothesis testing. In this paper we consider testing the nonparametric hypothesis of symmetry, introduce analogues for e-values of three popular nonparametric tests, define an analogue for e-values of Pitman’s asymptotic relative efficiency, and apply it to the three nonparametric tests. We discuss limitations of our simple definition of asymptotic relative efficiency and list directions of further research.
We consider a formal statistical design that allows simultaneous enrollment of a main cohort and a backfill cohort of patients in a dose-finding trial. The goal is to accumulate more information at various doses to facilitate dose optimization. The proposed design, called Bi3+3, combines the simple dose-escalation algorithm in the i3+3 design and a model-based inference under the framework of probability of decisions (POD), both previously published. As a result, Bi3+3 provides a simple algorithm for backfilling patients to lower doses in a dose-finding trial once these doses exhibit safety profile in patients. The POD framework allows dosing decisions to be made when some backfill patients are still being followed with incomplete toxicity outcomes, thereby potentially expediting the clinical trial. At the end of the trial, Bi3+3 uses both toxicity and efficacy outcomes to estimate an optimal biological dose (OBD). The proposed inference is based on a dose-response model that takes into account either a monotone or plateau dose-efficacy relationship, which are frequently encountered in modern oncology drug development. Simulation studies show promising operating characteristics of the Bi3+3 design in comparison to existing designs.
This article proposes an alternative to the Hosmer-Lemeshow (HL) test for evaluating the calibration of probability forecasts for binary events. The approach is based on e-values, a new tool for hypothesis testing. An e-value is a random variable with expected value less or equal to one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a p-value. Our test uses online isotonic regression to estimate the calibration curve as a ‘betting strategy’ against the null hypothesis. We show that the test has power against essentially all alternatives, which makes it theoretically superior to the HL test and at the same time resolves the well-known instability problem of the latter. A simulation study shows that a feasible version of the proposed eHL test can detect slight miscalibrations in practically relevant sample sizes, but trades its universal validity and power guarantees against a reduced empirical power compared to the HL test in a classical simulation setup. We illustrate our test on recalibrated predictions for credit card defaults during the Taiwan credit card crisis, where the classical HL test delivers equivocal results.
Two-sample testing is a fundamental problem in statistics. While many powerful nonparametric methods exist for both the univariate and multivariate context, it is comparatively less common to see a framework for determining which data features lead to rejection of the null. In this paper, we propose a new nonparametric two-sample test named AUGUST, which incorporates a framework for interpretation while maintaining power comparable to existing methods. AUGUST tests for inequality in distribution up to a predetermined resolution using symmetry statistics from binary expansion. Designed for univariate and low to moderate-dimensional multivariate data, this construction allows us to understand distributional differences as a combination of fundamental orthogonal signals. Asymptotic theory for the test statistic facilitates p-value computation and power analysis, and an efficient algorithm enables computation on large data sets. In empirical studies, we show that our test has power comparable to that of popular existing methods, as well as greater power in some circumstances. We illustrate the interpretability of our method using NBA shooting data.
Cancer is the second leading cause of death in the world. Diagnosing cancer early on can save many lives. Pathologists have to look at tissue microarray (TMA) images manually to identify tumors, which can be time-consuming, inconsistent and subjective. Existing automatic algorithms either have not achieved the accuracy level of a pathologist or require substantial human involvements. A major challenge is that TMA images with different shapes, sizes, and locations can have the same score. Learning staining patterns in TMA images requires a huge number of images, which are severely limited due to privacy and regulation concerns in medical organizations. TMA images from different cancer types may share certain common characteristics, but combining them directly harms the accuracy due to heterogeneity in their staining patterns. Transfer learning is an emerging learning paradigm that allows borrowing strength from similar problems. However, existing approaches typically require a large sample from similar learning problems, while TMA images of different cancer types are often available in small sample size and further existing algorithms are limited to transfer learning from one similar problem. We propose a new transfer learning algorithm that could learn from multiple related problems, where each problem has a small sample and can have a substantially different distribution from the original one. The proposed algorithm has made it possible to break the critical accuracy barrier (the 75% accuracy level of pathologists), with a reported accuracy of 75.9% on breast cancer TMA images from the Stanford Tissue Microarray Database. It is supported by recent developments in transfer learning theory and empirical evidence in clustering technology. This will allow pathologists to confidently adopt automatic algorithms in recognizing tumors consistently with a higher accuracy in real time.
When testing a statistical hypothesis, is it legitimate to deliberate on the basis of initial data about whether and how to collect further data? Game-theoretic probability’s fundamental principle for testing by betting says yes, provided that you are testing the hypothesis’s predictions by betting and do not risk more capital than initially committed. Standard statistical theory uses Cournot’s principle, which does not allow such optional continuation. Cournot’s principle can be extended to allow optional continuation when testing is carried out by multiplying likelihood ratios, but the extension lacks the simplicity and generality of testing by betting.
Testing by betting can also help us with descriptive data analysis. To obtain a purely and honestly descriptive analysis using competing probability distributions, we have them bet against each other using the principle. The place of confidence intervals is then taken by sets of distributions that do relatively well in the competition. In the simplest implementation, these sets coincide with R. A. Fisher’s likelihood ranges.
Sequential change detection is a classical problem with a variety of applications. However, the majority of prior work has been parametric, for example, focusing on exponential families. We develop a fundamentally new and general framework for sequential change detection when the pre- and post-change distributions are nonparametrically specified (and thus composite). Our procedures come with clean, nonasymptotic bounds on the average run length (frequency of false alarms). In certain nonparametric cases (like sub-Gaussian or sub-exponential), we also provide near-optimal bounds on the detection delay following a changepoint. The primary technical tool that we introduce is called an e-detector, which is composed of sums of e-processes—a fundamental generalization of nonnegative supermartingales—that are started at consecutive times. We first introduce simple Shiryaev-Roberts and CUSUM-style e-detectors, and then show how to design their mixtures in order to achieve both statistical and computational efficiency. Our e-detector framework can be instantiated to recover classical likelihood-based procedures for parametric problems, as well as yielding the first change detection method for many nonparametric problems. As a running example, we tackle the problem of detecting changes in the mean of a bounded random variable without i.i.d. assumptions, with an application to tracking the performance of a basketball team over multiple seasons.
In this paper, we present the U.S. Mental Health Dashboard, an R Shiny web application that facilitates exploratory data analysis of U.S. mental health data collected through national surveys. Mental health affects almost every aspect of people’s lives including their social relationships, substance use, academic success, professional productivity, and physical wellness. Even so, mental illnesses are often perceived as less legitimate or serious than physical diseases, and as a result of this stigmatization, many people suffer in silence without access to proper treatment. To address the lack of accessible healthcare information related to mental illness, the U.S. Mental Health Dashboard presents dynamic visualizations, tables, and choropleth maps of the prevalence and geographic distribution of key mental health metrics based on data from the National Survey on Drug Use and Health (NSDUH) and Behavioral Risk Factor Surveillance System (BRFSS). National and state-level estimates are provided for the civilian, non-institutionalized adult population of the United States as well as within relevant demographic subpopulations. By demonstrating the pervasiveness of mental illness and stark health inequities between demographic groups, this application aims to raise mental health awareness and reduce self-blame and stigmatization, especially for individuals that may inherently be at high risk. The U.S. Mental Health Dashboard has a wide variety of potential use cases: to illustrate to individuals suffering from mental illness and those in close proximity to them that they are not alone, identify subpopulations with the biggest need for mental health care, and help epidemiologists planning studies identify the target population for specific mental illness symptoms.