Community detection in networks is the process by which unusually well-connected sub-networks are identified–a central component of many applied network analyses. The paradigm of modularity quality function optimization stipulates a partition of the network’s vertexes that maximizes the difference between the fraction of edges within communities and the corresponding expected fraction if edges were randomly allocated among all vertex pairs while conserving the degree distribution. The modularity quality function incorporates exclusively the network’s topology and has been extensively studied whereas the integration of constraints or external information on community composition has largely remained unexplored. We define a greedy, recursive-backtracking search procedure to identify the constitution of high-quality network communities that satisfy the global constraint that each community be comprised of at least one vertex among a set of so-called special vertexes and apply our methodology to identifying health care communities (HCCs) within a network of hospitals such that each HCC consists of at least one hospital wherein at least a minimum number of cardiac defibrillator surgeries were performed. This restriction permits meaningful comparisons in cardiac care among the resulting health care communities by standardizing the distribution of cardiac care across the hospital network.
In the interest of business innovation, social network companies often carry out experiments to test product changes and new ideas. In such experiments, users are typically assigned to one of two experimental conditions with some outcome of interest observed and compared. In this setting, the outcome of one user may be influenced by not only the condition to which they are assigned but also the conditions of other users via their network connections. This challenges classical experimental design and analysis methodologies and requires specialized methods. We introduce the general additive network effect (GANE) model, which encompasses many existing outcome models in the literature under a unified model-based framework. The model is both interpretable and flexible in modeling the treatment effect as well as the network influence. We show that (quasi) maximum likelihood estimators are consistent and asymptotically normal for a family of model specifications. Quantities of interest such as the global treatment effect are defined and expressed as functions of the GANE model parameters, and hence inference can be carried out using likelihood theory. We further propose the “power-degree” (POW-DEG) specification of the GANE model. The performance of POW-DEG and other specifications of the GANE model are investigated via simulations. Under model misspecification, the POW-DEG specification appears to work well. Finally, we study the characteristics of good experimental designs for the POW-DEG specification. We find that graph-cluster randomization and balanced designs are not necessarily optimal for precise estimation of the global treatment effect, indicating the need for alternative design strategies.
Data matrix centering is an ever-present yet under-examined aspect of data analysis. Functional data analysis (FDA) often operates with a default of centering such that the vectors in one dimension have mean zero. We find that centering along the other dimension identifies a novel useful mode of variation beyond those familiar in FDA. We explore ambiguities in both matrix orientation and nomenclature. Differences between centerings and their potential interaction can be easily misunderstood. We propose a unified framework and new terminology for centering operations. We clearly demonstrate the intuition behind and consequences of each centering choice with informative graphics. We also propose a new direction energy hypothesis test as part of a series of diagnostics for determining which choice of centering is best for a data set. We explore the application of these diagnostics in several FDA settings.
Systems with both quantitative and qualitative responses are widely encountered in many applications. Design of experiment methods are needed when experiments are conducted to study such systems. Classic experimental design methods are unsuitable here because they often focus on one type of response. In this paper, we develop a Bayesian D-optimal design method for experiments with one continuous and one binary response. Both noninformative and conjugate informative prior distributions on the unknown parameters are considered. The proposed design criterion has meaningful interpretations regarding the D-optimality for the models for both types of responses. An efficient point-exchange search algorithm is developed to construct the local D-optimal designs for given parameter values. Global D-optimal designs are obtained by accumulating the frequencies of the design points in local D-optimal designs, where the parameters are sampled from the prior distributions. The performances of the proposed methods are evaluated through two examples.
We are pleased to launch the first issue of the New England Journal of Statistics in Data Science (NEJSDS). NEJSDS is the official journal of the New England Statistical Society (NESS) under the leadership of Vice President for Journal and Publication and sponsored by the College of Liberal Arts and Sciences, University of Connecticut. The aims of the journal are to serve as an interface between statistics and other disciplines in data science, to encourage researchers to exchange innovative ideas, and to promote data science methods to the general scientific community. The journal publishes high quality original research, novel applications, and timely review articles in all aspects of data science, including all areas of statistical methodology, methods of machine learning, and artificial intelligence, novel algorithms, computational methods, data management and manipulation, applications of data science methods, among others. We encourage authors to submit collaborative work driven by real life problems posed by researchers, administrators, educators, or other stakeholders, and which require original and innovative solutions from data scientists.
In addition to scientific questions, clinical trialists often explore or require other design features, such as increasing the power while controlling the type I error rate, minimizing unnecessary exposure to inferior treatments, and comparing multiple treatments in one clinical trial. We propose implementing adaptive seamless design (ASD) with response adaptive randomization (RAR) to satisfy various clinical trials’ design objectives. However, the combination of ASD and RAR poses a challenge in controlling the type I error rate. In this paper, we investigated how to utilize the advantages of the two adaptive methods and control the type I error rate. We offered the theoretical foundation for this procedure. Numerical studies demonstrated that our methods could achieve efficient and ethical objectives while controlling the type I error rate.
The performance of a learning technique relies heavily on hyperparameter settings. It calls for hyperparameter tuning for a deep learning technique, which may be too computationally expensive for sophisticated learning techniques. As such, expeditiously exploring the relationship between hyperparameters and the performance of a learning technique controlled by these hyperparameters is desired, and thus it entails the consideration of design strategies to collect informative data efficiently to do so. Various designs can be considered for this purpose. The question as to which design to use then naturally arises. In this paper, we examine the use of different types of designs in efficiently collecting informative data to study the surface of test accuracy, a measure of the performance of a learning technique, over hyperparameters. Under the settings we considered, we find that the strong orthogonal array outperforms all other comparable designs.
As a prominent dimension reduction method for multivariate linear regression, the envelope model has received increased attention over the past decade due to its modeling flexibility and success in enhancing estimation and prediction efficiencies. Several enveloping approaches have been proposed in the literature; among these, the partial response envelope model [57] that focuses on only enveloping the coefficients for predictors of interest, and the simultaneous envelope model [14] that combines the predictor and the response envelope models within a unified modeling framework, are noteworthy. In this article we incorporate these two approaches within a Bayesian framework, and propose a novel Bayesian simultaneous partial envelope model that generalizes and addresses some limitations of the two approaches. Our method offers the flexibility of incorporating prior information if available, and aids coherent quantification of all modeling uncertainty through the posterior distribution of model parameters. A block Metropolis-within-Gibbs algorithm for Markov chain Monte Carlo (MCMC) sampling from the posterior is developed. The utility of our model is corroborated by theoretical results, comprehensive simulations, and a real imaging genetics data application for the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study.
Clinical trials usually involve sequential patient entry. When designing a clinical trial, it is often desirable to include a provision for interim analyses of accumulating data with the potential for stopping the trial early. We review Bayesian sequential clinical trial designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. A pertinent question is whether Bayesian sequential designs need to be adjusted for the planning of interim analyses. We answer this question from three perspectives: a frequentist-oriented perspective, a calibrated Bayesian perspective, and a subjective Bayesian perspective. We also provide new insights into the likelihood principle, which is commonly tied to statistical inference and decision making in sequential clinical trials. Some theoretical results are derived, and numerical studies are conducted to illustrate and assess these designs.
Controlled experiments are widely applied in many areas such as clinical trials or user behavior studies in IT companies. Recently, it is popular to study experimental design problems to facilitate personalized decision making. In this paper, we investigate the problem of optimal design of multiple treatment allocation for personalized decision making in the presence of observational covariates associated with experimental units (often, patients or users). We assume that the response of a subject assigned to a treatment follows a linear model which includes the interaction between covariates and treatments to facilitate precision decision making. We define the optimal objective as the maximum variance of estimated personalized treatment effects over different treatments and different covariates values. The optimal design is obtained by minimizing this objective. Under a semi-definite program reformulation of the original optimization problem, we use a YALMIP and MOSEK based optimization solver to provide the optimal design. Numerical studies are provided to assess the quality of the optimal design.