The evolving focus in statistics and data science education highlights the growing importance of computing. This paper presents the Data Jamboree, a live event that combines computational methods with traditional statistical techniques to address real-world data science problems. Participants, ranging from novices to experienced users, followed workshop leaders in using open-source tools like Julia, Python, and R to perform tasks such as data cleaning, manipulation, and predictive modeling. The Jamboree showcased the educational benefits of working with open data, providing participants with practical, hands-on experience. We compared the tools in terms of efficiency, flexibility, and statistical power, with Julia excelling in performance, Python in versatility, and R in statistical analysis and visualization. The paper concludes with recommendations for designing similar events to encourage collaborative learning and critical thinking in data science.
Observations of groundwater pollutants, such as arsenic or Perfluorooctane sulfonate (PFOS), are riddled with left censoring. These measurements have an impact on the health and lifestyle of the populace. Left censoring of these spatially correlated observations is usually addressed by applying Gaussian processes (GPs), which have theoretical advantages. However, this comes with a challenging computational complexity of $\mathcal{O}({n^{3}})$, impractical for large datasets. Additionally, a sizable proportion of the left-censored data creates further bottlenecks since the likelihood computation now involves an intractable high-dimensional integral of the multivariate Gaussian density. In this article, we tackle these two problems simultaneously by approximating the GP with a Gaussian Markov random field (GMRF) approach that exploits an explicit link between a GP with Matérn correlation function and a GMRF using stochastic partial differential equations (SPDEs). We introduce a GMRF-based measurement error into the model, which alleviates the likelihood computation for the censored data, drastically improving the computational speed while maintaining admirable accuracy. Our approach demonstrates robustness and substantial computational scalability compared to state-of-the-art methods for censored spatial responses across various simulation settings. Finally, the fit of this fully Bayesian model to the concentration of PFOS in groundwater available at 24,959 sites across California, where 46.62% responses are censored, produces prediction surface and uncertainty quantification in real-time, thereby substantiating the applicability and scalability of the proposed method. Code for implementation is made available via GitHub.
Up-and-Down designs (UDDs) are ubiquitous for dose-finding in a wide variety of scientific, engineering, and clinical fields. They are defined by a few simple rules that generate a random walk around the target percentile. UDDs’ combination of robust, tractable behavior, straightforward usage, and good dose-finding performance, has won the trust of practitioners and their consulting analysts across fields and continents. In contrast, in recent decades the statistical dose-finding design field has turned a cold shoulder towards UDDs, and it is quite possible that many younger dose-finding methods researchers are not even aware of this design approach.
We present a concise overview of UDDs and their current state-of-the-art methodology, with references for further inquiry. We also revisit the performance comparison between UDDs and novel, more complicated design approaches such as the Continual Reassessment Method and the Bayesian Optimal Interval design, which we group under the term “Aim-for-Target” designs. UDDs fare very well in the comparison, particularly in terms of robustness to sources of variability.