Effects of stopping criterion on the growth of trees in regression random forests

Arsham, Aryana; Rosenberg, Philip; Little, Mark

doi:10.51387/22-NEJSDS5

The New England Journal of Statistics in Data Science

Effects of stopping criterion on the growth of trees in regression random forests

Volume 1, Issue 1 (2023), pp. 46–61

Aryana Arsham Philip Rosenberg Mark Little

https://doi.org/10.51387/22-NEJSDS5

Pub. online: 31 August 2022 Type: Methodology Article

Open Access

Area: Biomedical Research

Accepted
18 July 2022

Published
31 August 2022

Abstract

Random forests are a powerful machine learning tool that capture complex relationships between independent variables and an outcome of interest. Trees built in a random forest are dependent on several hyperparameters, one of the more critical being the node size. The original algorithm of Breiman, controls for node size by limiting the size of the parent node, so that a node cannot be split if it has less than a specified number of observations. We propose that this hyperparameter should instead be defined as the minimum number of observations in each terminal node. The two existing random forest approaches are compared in the regression context based on estimated generalization error, bias-squared, and variance of resulting predictions in a number of simulated datasets. Additionally the two approaches are applied to type 2 diabetes data obtained from the National Health and Nutrition Examination Survey. We have developed a straightforward method for incorporating weights into the random forest analysis of survey data. Our results demonstrate that generalization error under the proposed approach is competitive to that attained from the original random forest approach when data have large random error variability. The R code created from this work is available and includes an illustration.

Supplementary material

Supplementary Material

The supplementary material provides code in R software for implementing the algorithms developed in this work. The National Health and Nutrition Examination Survey data utilized in the paper are also provided.

References

[1]

Alexopoulos, A., Qamar, A., Hutchins, K., Crowley, M. J., Batch, B. C. and R., G. J. (2019). Triglycerides: Emerging Targets in Diabetes Care? Review of Moderate Hypertriglyceridemia in Diabetes. Current diabetes reports 19 13.

[2]

Biau, G. and Devroye, L. (2010). On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. Journal of Multivariate Analysis 101 2499–2518. https://doi.org/10.1016/j.jmva.2010.06.019. MR2719877

[3]

Biau, G. and Scornet, E. (2016). A random forest guided tour. TEST 25 197–227. https://doi.org/10.1007/s11749-016-0481-7. MR3493512

[4]

Breiman, L. (2001). Random Forests. Machine Learning 45 5–32. MR3874153

[5]

Chen, T. C., Clark, J., Riddles, M. K., Mohadjer, L. K. and Fakhouri, T. H. I. (2020). National Health and Nutrition Examination Survey, 2015–2018: Sample Design and Estimation Procedures. Vital Health Stat 2 184 1–35.

[6]

Hastie, T., Tibshirani, R. and Friedman, J. H. (2017) The elements of statistical learning: Data mining, inference, and prediction. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7. MR2722294

[7]

Ishwaran, H. and Kogalur, U. B. (2021). Package ‘randomForestSRC’. https://doi.org/cran.r-project.org/web/packages/randomForestSRC/randomForestSRC.pdf.

[8]

Liaw, A. and Wiener, M. (2018). Package ‘randomForest’. https://doi.org/cran.r-project.org/web/packages/randomForest/randomForest.pdf.

[9]

Liitle, M., Rosenberg, P. and Arsham, A. (2022). Alternative stopping rules to limit tree expansion for random forest models. Scientific Reports (Resubmitted).

[10]

Makris, K. and Spanou, L. (2011). Is there a relationship between mean blood glucose and glycated hemoglobin? J Diabetes Sci Technol 5(6) 1572–1583.

[11]

Mendola, N. D., Chen, T. q. C., Gu, Q., Eberhardt, M. S. and Saydah, S. (2018). Prevalence of Total, Diagnosed, and Undiagnosed Diabetes Among Adults: United States, 2013–2016. NCHS Data Brief 1–8.

[12]

Nicolo, M. L., Shewokis, P. A., Boullata, J., Sukumar, D., Smith, S., Compher, C. and Volpe, S. L. (2019). Sedentary behavior time as a predictor of hemoglobin A1c among adults, 40 to 59 years of age, living in the United States: National Health and Nutrition Examination Survey 2003 to 2004 and 2013 to 2014. Nutrition and Health 25 275–279. https://doi.org/10.1002/sim.7049. MR3569919

[13]

Ninh, T., Nguyen, X. T., Lane, J. and Wang, P. (2011). Relationship between obesity and diabetes in a US adult population: findings from the National Health and Nutrition Examination Survey, 1999–2006. Obesity Surgery 21 351–355.

[14]

Probst, P., Wright, M. N. and Boulesteix, A. (2019). Hyperparameters and tuning strategies for random forest. WIREs Data Mining and Knowledge Discovery 9 1301.

[15]

Rohrmann, S., Smit, E., Giovannucci, E. and Platz, E. A. (2005). Association between markers of the metabolic syndrome and lower urinary tract symptoms in the Third National Health and Nutrition Examination Survey (NHANES III). International journal of obesity 29 310–316.

[16]

Scornet, E. (2018). Tuning parameters in random forests. ESAIM: Proceedings and Surveys 60 144–162. https://doi.org/10.1051/proc/201760144. MR3772478

[17]

van Rijn, J. N. and Hutter, F. (2018). Hyperparameter importance across datasets. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[18]

Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, [2017–2018]. http://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2015.

[19]

Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, [2015–2016]. http://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2017.

Full article

Open access article under the CC BY license.

Keywords

Regression random forest Node size Generalization error

Funding

This work was supported by the Intramural Research Program of the National Institutes of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics and utilized the computational resources of the NIH HPC Biowulf cluster. (http://hpc.nih.gov). The Intramural Research Program of the National Institutes of Health, the National Cancer Institute, Division of Cancer Epidemiology and Genetics supported the work of all authors.

Metrics

since December 2021

1129

Article info
views

335

Full article
views

438

PDF
downloads

126

XML
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file