We are pleased to launch the first issue of the New England Journal of Statistics in Data Science (NEJSDS). NEJSDS is the official journal of the New England Statistical Society (NESS) under the leadership of Vice President for Journal and Publication and sponsored by the College of Liberal Arts and Sciences, University of Connecticut. The aims of the journal are to serve as an interface between statistics and other disciplines in data science, to encourage researchers to exchange innovative ideas, and to promote data science methods to the general scientific community. The journal publishes high quality original research, novel applications, and timely review articles in all aspects of data science, including all areas of statistical methodology, methods of machine learning, and artificial intelligence, novel algorithms, computational methods, data management and manipulation, applications of data science methods, among others. We encourage authors to submit collaborative work driven by real life problems posed by researchers, administrators, educators, or other stakeholders, and which require original and innovative solutions from data scientists.
This article expands upon my presentation to the panel on “The Radical Prescription for Change” at the 2017 ASA (American Statistical Association) symposium on A World Beyond$p<0.05$. It emphasizes that, to greatly enhance the reliability of—and hence public trust in—statistical and data scientific findings, we need to take a holistic approach. We need to lead by example, incentivize study quality, and inoculate future generations with profound appreciations for the world of uncertainty and the uncertainty world. The four “radical” proposals in the title—with all their inherent defects and trade-offs—are designed to provoke reactions and actions. First, research methodologies are trustworthy only if they deliver what they promise, even if this means that they have to be overly protective, a necessary trade-off for practicing quality-guaranteed statistics. This guiding principle may compel us to doubling variance in some situations, a strategy that also coincides with the call to raise the bar from $p<0.05$ to $p<0.005$ . Second, teaching principled practicality or corner-cutting is a promising strategy to enhance the scientific community’s as well as the general public’s ability to spot—and hence to deter—flawed arguments or findings. A remarkable quick-and-dirty Bayes formula for rare events, which simply divides the prevalence by the sum of the prevalence and the false positive rate (or the total error rate), as featured by the popular radio show Car Talk, illustrates the effectiveness of this strategy. Third, it should be a routine mental exercise to put ourselves in the shoes of those who would be affected by our research finding, in order to combat the tendency of rushing to conclusions or overstating confidence in our findings. A pufferfish/selfish test can serve as an effective reminder, and can help to institute the mantra “Thou shalt not sell what thou refuseth to buy” as the most basic professional decency. Considering personal stakes in our statistical endeavors also points to the concept of behavioral statistics, in the spirit of behavioral economics. Fourth, the current mathematical education paradigm that puts “deterministic first, stochastic second” is likely responsible for the general difficulties with reasoning under uncertainty, a situation that can be improved by introducing the concept of histogram, or rather kidstogram, as early as the concept of counting.
This contribution is a series of comments on Prof. Xiao-Li Meng’s article, “Double Your Variance, Dirtify Your Bayes, Devour Your Pufferfish, and Draw Your Kidstogram”. Prof. Meng’s article offers some radical proposals and not-so-radical proposals to improve the quality of statistical inference used in the sciences and also to extend distributional thinking to early education. Discussions and alternative proposals are presented.
We highlight points of agreement between Meng’s suggested principles and those proposed in our 2019 editorial in The American Statistician. We also discuss some questions that arise in the application of Meng’s principles in practice.
Random forests are a powerful machine learning tool that capture complex relationships between independent variables and an outcome of interest. Trees built in a random forest are dependent on several hyperparameters, one of the more critical being the node size. The original algorithm of Breiman, controls for node size by limiting the size of the parent node, so that a node cannot be split if it has less than a specified number of observations. We propose that this hyperparameter should instead be defined as the minimum number of observations in each terminal node. The two existing random forest approaches are compared in the regression context based on estimated generalization error, bias-squared, and variance of resulting predictions in a number of simulated datasets. Additionally the two approaches are applied to type 2 diabetes data obtained from the National Health and Nutrition Examination Survey. We have developed a straightforward method for incorporating weights into the random forest analysis of survey data. Our results demonstrate that generalization error under the proposed approach is competitive to that attained from the original random forest approach when data have large random error variability. The R code created from this work is available and includes an illustration.
There are many cases in which one has continuous flows over networks, and there is interest in predicting and monitoring such flows. This paper provides Bayesian models for two types of networks—those in which flow can be bidirectional, and those in which flow is unidirectional. The former is illustrated by an application to electrical transmission over the power grid, and the latter is examined with data on volumetric water flow in a river system. Both applications yield good predictive accuracy over short time horizons. Predictive accuracy is important in these applications—it improves the efficiency of the energy market and enables flood warnings and water management.