Effects of stopping criterion on the growth of trees in regression random forests
Volume 1, Issue 1 (2023), pp. 46–61
Pub. online: 31 August 2022
Type: Methodology Article
Open Access
Area: Biomedical Research
Accepted
18 July 2022
18 July 2022
Published
31 August 2022
31 August 2022
Abstract
Random forests are a powerful machine learning tool that capture complex relationships between independent variables and an outcome of interest. Trees built in a random forest are dependent on several hyperparameters, one of the more critical being the node size. The original algorithm of Breiman, controls for node size by limiting the size of the parent node, so that a node cannot be split if it has less than a specified number of observations. We propose that this hyperparameter should instead be defined as the minimum number of observations in each terminal node. The two existing random forest approaches are compared in the regression context based on estimated generalization error, bias-squared, and variance of resulting predictions in a number of simulated datasets. Additionally the two approaches are applied to type 2 diabetes data obtained from the National Health and Nutrition Examination Survey. We have developed a straightforward method for incorporating weights into the random forest analysis of survey data. Our results demonstrate that generalization error under the proposed approach is competitive to that attained from the original random forest approach when data have large random error variability. The R code created from this work is available and includes an illustration.
Supplementary material
Supplementary MaterialThe supplementary material provides code in R software for implementing the algorithms developed in this work. The National Health and Nutrition Examination Survey data utilized in the paper are also provided.
References
Biau, G. and Devroye, L. (2010). On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. Journal of Multivariate Analysis 101 2499–2518. https://doi.org/10.1016/j.jmva.2010.06.019. MR2719877
Biau, G. and Scornet, E. (2016). A random forest guided tour. TEST 25 197–227. https://doi.org/10.1007/s11749-016-0481-7. MR3493512
Breiman, L. (2001). Random Forests. Machine Learning 45 5–32. MR3874153
Hastie, T., Tibshirani, R. and Friedman, J. H. (2017) The elements of statistical learning: Data mining, inference, and prediction. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7. MR2722294
Ishwaran, H. and Kogalur, U. B. (2021). Package ‘randomForestSRC’. https://doi.org/cran.r-project.org/web/packages/randomForestSRC/randomForestSRC.pdf.
Liaw, A. and Wiener, M. (2018). Package ‘randomForest’. https://doi.org/cran.r-project.org/web/packages/randomForest/randomForest.pdf.
Nicolo, M. L., Shewokis, P. A., Boullata, J., Sukumar, D., Smith, S., Compher, C. and Volpe, S. L. (2019). Sedentary behavior time as a predictor of hemoglobin A1c among adults, 40 to 59 years of age, living in the United States: National Health and Nutrition Examination Survey 2003 to 2004 and 2013 to 2014. Nutrition and Health 25 275–279. https://doi.org/10.1002/sim.7049. MR3569919
Scornet, E. (2018). Tuning parameters in random forests. ESAIM: Proceedings and Surveys 60 144–162. https://doi.org/10.1051/proc/201760144. MR3772478
Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, [2017–2018]. http://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2015.
Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, [2015–2016]. http://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2017.