The New England Journal of Statistics in Data Science logo


  • Help
Login Register

  1. Home
  2. To appear
  3. Data Jamboree: A Party of Open-Source So ...

The New England Journal of Statistics in Data Science

Submit your article Information Become a Peer-reviewer
  • Article info
  • Full article
  • More
    Article info Full article

Data Jamboree: A Party of Open-Source Software Solving Real-World Data Science Problems
Lucy D’Agostino McGowan ORCID icon link to view author Lucy D’Agostino McGowan details   Shannon Tass ORCID icon link to view author Shannon Tass details   Samantha Tyner ORCID icon link to view author Samantha Tyner details     All authors (5)

Authors

 
Placeholder
https://doi.org/10.51387/25-NEJSDS79
Pub. online: 13 March 2025      Type: Software Tutorial And/or Review      Open accessOpen Access
Area: Software

Accepted
16 February 2025
Published
13 March 2025

Abstract

The evolving focus in statistics and data science education highlights the growing importance of computing. This paper presents the Data Jamboree, a live event that combines computational methods with traditional statistical techniques to address real-world data science problems. Participants, ranging from novices to experienced users, followed workshop leaders in using open-source tools like Julia, Python, and R to perform tasks such as data cleaning, manipulation, and predictive modeling. The Jamboree showcased the educational benefits of working with open data, providing participants with practical, hands-on experience. We compared the tools in terms of efficiency, flexibility, and statistical power, with Julia excelling in performance, Python in versatility, and R in statistical analysis and visualization. The paper concludes with recommendations for designing similar events to encourage collaborative learning and critical thinking in data science.

References

[1] 
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y. and Zheng, X. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from https://tensorflow.org.
[2] 
Agonafir, C., Pabon, A. R., Lakhankar, T., Khanbilvardi, R. and Devineni, N. (2022). Understanding New York City street flooding through 311 complaints. Journal of Hydrology 605 127300.
[3] 
Bates, D. and Yan, J. (2024). From CSV to arrow: creating a unified data set for efficient cross-platform analysis. Chance 37(4) 48–52. https://doi.org/10.1080/09332480.2024.2434443.
[4] 
Bates, D., Lai, R. and Byrne, S. (2025). RCall.jl: Calling R from Julial. https://github.com/JuliaInterop/RCall.jl.
[5] 
Bates, D., Mächler, M., Bolker, B. and Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67(1) 1–48. https://doi.org/10.18637/jss.v067.i01.
[6] 
Beheshti, A., Benatallah, B., Tabebordbar, A., Motahari-Nezhad, H. R., Barukh, M. C. and Nouri, R. (2019). DataSynapse: a social data curation foundry. Distributed and Parallel Databases 37 351–384.
[7] 
Bezanson, J., Edelman, A., Karpinski, S. and Shah, V. B. (2017). Julia: a fresh approach to numerical computing. SIAM Review 59(1) 65–98. https://doi.org/10.1137/141000671. MR3605826
[8] 
Borgman, C. L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology 63(6) 1059–1078.
[9] 
Bouchet-Valat, M. and Kamiski, B. (2023). DataFrames.jl: flexible and fast tabular data in Julia. Journal of Statistical Software 107(4) 1–32. https://doi.org/10.18637/jss.v107.i04.
[10] 
Cantor, M. N., Chandras, R. and Pulgarin, C. (2018). FACETS: using open data to measure community social determinants of health. Journal of the American Medical Informatics Association 25(4) 419–422.
[11] 
Chollet, F. et al. (2015). Keras.
[12] 
Coretta, S. (2024). rticulate: Ultrasound Tongue Imaging. R package version 1.7.4. https://CRAN.R-project.org/package=rticulate.
[13] 
Dalzell, N. M. and Evans, C. (2023). Increasing student access to and readiness for statistical competitions. Journal of Statistics and Data Science Education 31(3) 258–263.
[14] 
Dask Development Team (2016). Dask: Library for dynamic task scheduling. http://dask.pydata.org.
[15] 
Eddelbuettel, D. (2013) Seamless R and C++ Integration with Rcpp. Springer, New York. ISBN 978-1-4614-6867-7. https://doi.org/10.1007/978-1-4614-6868-4.
[16] 
Gautier, L. (2024). rpy2: A Python Interface to R. Version 3.5.17. https://rpy2.github.io/ Accessed 2025-02-18.
[17] 
Hardin, J., Horton, N. J., Nolan, D. and Lang, D. T. (2021). Computing in the statistics curricula: a 10-year retrospective. Journal of Statistics and Data Science Education 29(sup1) 4–6. https://doi.org/10.1198/tast.2010.09132. MR2757001
[18] 
Hicks, S. C. and Irizarry, R. A. (2018). A guide to teaching data science. The American Statistician 72(4) 382–391. https://doi.org/10.1080/00031305.2017.1356747. MR3878095
[19] 
Hunter, J. D. (2007). Matplotlib: a 2D graphics environment. Computing in Science & Engineering 9(3) 90–95. https://doi.org/10.1109/MCSE.2007.55.
[20] 
Janssen, M., Charalabidis, Y. and Zuiderwijk, A. (2012). Benefits, adoption barriers and myths of open data and open government. Information Systems Management 29(4) 258–268.
[21] 
Johnson, S. G. (2025). PyCall.jl: Calling Python from Julia. https://github.com/JuliaPy/PyCall.jl.
[22] 
Karpinski, S., Carlsson, K., Ekre, F., Varela, D. and Butterworth, I. (2025). Pkg: Package manager for the Julia programming language. https://github.com/JuliaLang/Pkg.jl.
[23] 
Kontokosta, C., Hong, B. and Korsberg, K. (2017). Equity in 311 reporting: Understanding socio-spatial differentials in the propensity to complain.
[24] 
Lara, M. and Lockwood, K. (2016). Hackathons as community-based learning: a case study. TechTrends 60(5) 486–495.
[25] 
McCullough, B. (2008). Special section on Microsoft Excel 2007. Computational Statistics & Data Analysis 52(10) 4568–4569. https://doi.org/10.1016/j.csda.2008.03.009. MR2521602
[26] 
McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (Stéfan van der Walt and Jarrod Millman, eds.), 56–61. https://doi.org/10.25080/Majora-92bf1922-00a.
[27] 
Minkoff, S. L. (2016). NYC 311: a tract-level analysis of citizen–government contacting in New York City. Urban Affairs Review 52(2) 211–246.
[28] 
Nolan, D. and Temple Lang, D. (2010). Computing in the statistics curricula. The American Statistician 64(2) 97–107. https://doi.org/10.1198/tast.2010.09132. MR2757001
[29] 
Noll, J. and Tackett, M. (2023). Insights from DataFest point to new opportunities for undergraduate statistics courses: team collaborations, designing research questions, and data ethics. Teaching Statistics 45 5–21.
[30] 
Parmer, C. and Romanenko, A. (2025). Dash.jl: A Julia Interface to the Dash Ecosystem. Version 1.5.0. Available at https://github.com/plotly/Dash.jl. Accessed 2025-02-18.
[31] 
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J. and Chintala, S. (2019). PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. https://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
[32] 
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
[33] 
Rajalingham, S., Chadwick, D. R. and Knight, B. (2016). Limitations of using Microsoft Excel version 2016 (MS Excel 2016) for data analysis in biomedical research. International Journal of Biomedical Science 12(4) 132–137.
[34] 
Ridgway, J., Campos, P. and Biehler, R. (2023). Data science, statistics, and civic statistics: education for a fast changing world. In Statistics for Empowerment and Social Engagement: Teaching Civic Statistics to Develop Informed Citizens (J. Ridgway, ed.), 563–580. Springer.
[35] 
Rowley, C. (2022). PythonCall.jl: Python and Julia in harmony. https://github.com/JuliaPy/PythonCall.jl.
[36] 
Stanish, L. F., Black, S. and Horsburgh, J. S. (2023). Reproducibility starts at the source: R, Python, and Julia packages for retrieving USGS hydrologic data. Water 15(24) 4236.
[37] 
Therneau, T. M. and Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer, New York. https://doi.org/10.1007/978-1-4757-3294-8. MR1774977
[38] 
Van Rossum, G. and Drake, F. L. (2009) Python 3 Reference Manual. CreateSpace, Scotts Valley, CA.
[39] 
VanderPlas, J. (2018) Python Data Science Handbook. O’Reilly Media.
[40] 
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, Fourth ed. Springer, New York. https://www.stats.ox.ac.uk/pub/MASS4/. https://doi.org/10.1007/978-1-4899-2819-1. MR1337030
[41] 
Waskom, M. L. (2021). Seaborn: statistical data visualization. Journal of Open Source Software 6(60) 3021. https://doi.org/10.21105/joss.03021.
[42] 
Wickham, H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
[43] 
Wickham, H. and Grolemund, G. (2017) R for Data Science. O’Reilly Media.
[44] 
Wickham, H., Vaughan, D. and Girlich, M. (2024). tidyr: Tidy Messy Data. R package version 1.3.1, https://github.com/tidyverse/tidyr. https://tidyr.tidyverse.org.
[45] 
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K. and Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software 4(43) 1686. https://doi.org/10.21105/joss.01686.
[46] 
Wickham, H., François, R., Henry, L., Müller, K. and Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4. https://CRAN.R-project.org/package=dplyr.

Full article PDF XML
Full article PDF XML

Copyright
© 2025 New England Statistical Society
by logo by logo
Open access article under the CC BY license.

Keywords
Data science education Julia Python R Statistical computing

Metrics
since December 2021
77

Article info
views

26

Full article
views

45

PDF
downloads

12

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

The New England Journal of Statistics in Data Science

  • ISSN: 2693-7166
  • Copyright © 2021 New England Statistical Society

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer
Powered by PubliMill  •  Privacy policy