Data Jamboree: A Party of Open-Source Software Solving Real-World Data Science Problems
Pub. online: 13 March 2025
Type: Software
Open Access
Accepted
16 February 2025
16 February 2025
Published
13 March 2025
13 March 2025
Abstract
The evolving focus in statistics and data science education highlights the growing importance of computing. This paper presents the Data Jamboree, a live event that combines computational methods with traditional statistical techniques to address real-world data science problems. Participants, ranging from novices to experienced users, followed workshop leaders in using open-source tools like Julia, Python, and R to perform tasks such as data cleaning, manipulation, and predictive modeling. The Jamboree showcased the educational benefits of working with open data, providing participants with practical, hands-on experience. We compared the tools in terms of efficiency, flexibility, and statistical power, with Julia excelling in performance, Python in versatility, and R in statistical analysis and visualization. The paper concludes with recommendations for designing similar events to encourage collaborative learning and critical thinking in data science.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y. and Zheng, X. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from https://tensorflow.org.
Bates, D. and Yan, J. (2024). From CSV to arrow: creating a unified data set for efficient cross-platform analysis. Chance 37(4) 48–52. https://doi.org/10.1080/09332480.2024.2434443.
Bates, D., Lai, R. and Byrne, S. (2025). RCall.jl: Calling R from Julial. https://github.com/JuliaInterop/RCall.jl.
Bates, D., Mächler, M., Bolker, B. and Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67(1) 1–48. https://doi.org/10.18637/jss.v067.i01.
Bezanson, J., Edelman, A., Karpinski, S. and Shah, V. B. (2017). Julia: a fresh approach to numerical computing. SIAM Review 59(1) 65–98. https://doi.org/10.1137/141000671. MR3605826
Bouchet-Valat, M. and Kamiski, B. (2023). DataFrames.jl: flexible and fast tabular data in Julia. Journal of Statistical Software 107(4) 1–32. https://doi.org/10.18637/jss.v107.i04.
Coretta, S. (2024). rticulate: Ultrasound Tongue Imaging. R package version 1.7.4. https://CRAN.R-project.org/package=rticulate.
Dask Development Team (2016). Dask: Library for dynamic task scheduling. http://dask.pydata.org.
Eddelbuettel, D. (2013) Seamless R and C++ Integration with Rcpp. Springer, New York. ISBN 978-1-4614-6867-7. https://doi.org/10.1007/978-1-4614-6868-4.
Gautier, L. (2024). rpy2: A Python Interface to R. Version 3.5.17. https://rpy2.github.io/ Accessed 2025-02-18.
Hardin, J., Horton, N. J., Nolan, D. and Lang, D. T. (2021). Computing in the statistics curricula: a 10-year retrospective. Journal of Statistics and Data Science Education 29(sup1) 4–6. https://doi.org/10.1198/tast.2010.09132. MR2757001
Hicks, S. C. and Irizarry, R. A. (2018). A guide to teaching data science. The American Statistician 72(4) 382–391. https://doi.org/10.1080/00031305.2017.1356747. MR3878095
Hunter, J. D. (2007). Matplotlib: a 2D graphics environment. Computing in Science & Engineering 9(3) 90–95. https://doi.org/10.1109/MCSE.2007.55.
Johnson, S. G. (2025). PyCall.jl: Calling Python from Julia. https://github.com/JuliaPy/PyCall.jl.
Karpinski, S., Carlsson, K., Ekre, F., Varela, D. and Butterworth, I. (2025). Pkg: Package manager for the Julia programming language. https://github.com/JuliaLang/Pkg.jl.
McCullough, B. (2008). Special section on Microsoft Excel 2007. Computational Statistics & Data Analysis 52(10) 4568–4569. https://doi.org/10.1016/j.csda.2008.03.009. MR2521602
McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (Stéfan van der Walt and Jarrod Millman, eds.), 56–61. https://doi.org/10.25080/Majora-92bf1922-00a.
Nolan, D. and Temple Lang, D. (2010). Computing in the statistics curricula. The American Statistician 64(2) 97–107. https://doi.org/10.1198/tast.2010.09132. MR2757001
Parmer, C. and Romanenko, A. (2025). Dash.jl: A Julia Interface to the Dash Ecosystem. Version 1.5.0. Available at https://github.com/plotly/Dash.jl. Accessed 2025-02-18.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J. and Chintala, S. (2019). PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. https://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Rowley, C. (2022). PythonCall.jl: Python and Julia in harmony. https://github.com/JuliaPy/PythonCall.jl.
Therneau, T. M. and Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer, New York. https://doi.org/10.1007/978-1-4757-3294-8. MR1774977
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, Fourth ed. Springer, New York. https://www.stats.ox.ac.uk/pub/MASS4/. https://doi.org/10.1007/978-1-4899-2819-1. MR1337030
Waskom, M. L. (2021). Seaborn: statistical data visualization. Journal of Open Source Software 6(60) 3021. https://doi.org/10.21105/joss.03021.
Wickham, H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, H., Vaughan, D. and Girlich, M. (2024). tidyr: Tidy Messy Data. R package version 1.3.1, https://github.com/tidyverse/tidyr. https://tidyr.tidyverse.org.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K. and Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software 4(43) 1686. https://doi.org/10.21105/joss.01686.
Wickham, H., François, R., Henry, L., Müller, K. and Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4. https://CRAN.R-project.org/package=dplyr.