Knowledge Distillation Decision Tree for Unravelling Black-Box Machine Learning Models

Lu, Xuetao; Lee, J. Jack

doi:10.51387/25-NEJSDS84

The New England Journal of Statistics in Data Science

Knowledge Distillation Decision Tree for Unravelling Black-Box Machine Learning Models

Xuetao Lu J. Jack Lee

https://doi.org/10.51387/25-NEJSDS84

Pub. online: 7 May 2025 Type: Methodology Article

Open Access

Area: Biomedical Research

Accepted
18 February 2028

Published
7 May 2025

Abstract

Machine learning models, particularly the black-box models, are widely favored for their outstanding predictive capabilities. However, they often face scrutiny and criticism due to the lack of interpretability. Paradoxically, their strong predictive capabilities may indicate a deep understanding of the underlying data, implying significant potential for interpretation. Leveraging the emerging concept of knowledge distillation, we introduce the method of knowledge distillation decision tree (KDDT). This method enables the distillation of knowledge about the data from a black-box model into a decision tree, thereby facilitating the interpretation of the black-box model. Essential attributes for a good interpretable model include simplicity, stability, and predictivity. The primary challenge of constructing an interpretable tree lies in ensuring structural stability under the randomness of the training data. KDDT is developed with the theoretical foundations demonstrating that structure stability can be achieved under mild assumptions. Furthermore, we propose the hybrid KDDT to achieve both simplicity and predictivity. An efficient algorithm is provided for constructing the hybrid KDDT. Simulation studies and a real-data analysis validate the hybrid KDDT’s capability to deliver accurate and reliable interpretations. KDDT is an excellent interpretable model with great potential for practical applications.

References

[1]

Allen-Zhu, Z. and Li, Y. (2021). Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning. arXiv:2012.09816.

[2]

Ba, J. and Caruana, R. (2014). Do Deep Nets Really Need to be Deep? In Advances in Neural Information Processing Systems (Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence and K. Q. Weinberger, eds.) 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/ea8fcd92d59581717e06eb187f10666d-Paper.pdf.

[3]

Breiman, L., Friedman, J., Stone, C. J. and Olshen, R. A. (1984) Classification and Regression Trees. Chapman and Hall/CRC. MR0726392

[4]

Coppens, Y., Efthymiadis, K., Lenaerts, T. and Nowé, A. (2019). Distilling Deep Reinforcement Learning Policies in Soft Decision Trees. In IJCAI 2019.

[5]

Ding, Z., Hernandez-Leal, P., Ding, G. W., Li, C. and Huang, R. (2021). CDT: Cascading Decision Trees for Explainable Reinforcement Learning. arXiv:2011.07553.

[6]

Frosst, N. and Hinton, G. (2017). Distilling a Neural Network Into a Soft Decision Tree. arXiv:1711.09784.

[7]

Gou, J., Yu, B., Maybank, S. J. and Tao, D. (2021). Knowledge Distillation: A Survey. International Journal of Computer Vision 129(6) 1789–1819. https://doi.org/10.1007/s11263-021-01453-z.

[8]

Hinton, G., Vinyals, O. and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531.

[9]

Hu, C., Li, X., Liu, D., Wu, H., Chen, X., Wang, J. and Liu, X. (2023). Teacher-Student Architecture for Knowledge Distillation: A Survey. arXiv:abs/2308.04268.

[10]

Hyafil, L. and Rivest, R. L. (1976). Constructing optimal binary decision trees is NP-complete. Information Processing Letters 5(1) 15–17. https://doi.org/10.1016/0020-0190(76)90095-8. MR0413598

[11]

Johansson, U., Sönströd, C. and Löfström, T. (2011). One tree to explain them all. In 2011 IEEE Congress of Evolutionary Computation (CEC) 1444–1451. https://doi.org/10.1109/CEC.2011.5949785.

[12]

Kaleem, S. M., Rouf, T., Habib, G., Saleem, T. J. and Lall, B. (2024). A Comprehensive Review of Knowledge Distillation in Computer Vision. arXiv:abs/2404.00936.

[13]

Laan, M. J. V. D., Polley, E. C. and Hubbard, A. E. (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology 6(1) 25. https://doi.org/10.2202/1544-6115.1309. MR2349918

[14]

Li, J., Li, Y., Xiang, X., Xia, S.-T., Dong, S. and Cai, Y. (2020). TNT: An Interpretable Tree-Network-Tree Learning Framework using Knowledge Distillation. Entropy 22(11). https://doi.org/10.3390/e22111203. MR4222006

[15]

Lundberg, S. M. and Lee, S. -I. (2017). A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett, eds.) 30. Curran Associates, Inc.

[16]

Quinlan, J. R. (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. http://portal.acm.org/citation.cfm?id=152181.

[17]

Ribeiro, M. T., Singh, S. and Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’16 1135–1144. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2939672.2939778.

[18]

Rokach, L. and Maimon, O. (2014) Data Mining With Decision Trees: Theory and Applications, 2nd ed. World Scientific Publishing Co., Inc., USA.

[19]

Shen, Y., Xu, X. and Cao, J. (2020). Reconciling predictive and interpretable performance in repeat buyer prediction via model distillation and heterogeneous classifiers fusion. Neural Comput. Appl.

[20]

Shi, Y., Hwang, M. -Y., Lei, X. and Sheng, H. (2019). Knowledge Distillation For Recurrent Neural Network Language Modeling With Trust Regularization. arXiv:1904.04163.

[21]

Song, J., Zhang, H., Wang, X., Xue, M., Chen, Y., Sun, L., Tao, D. and Song, M. (2021). Tree-Like Decision Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 13488–13497.

[22]

Stanton, S., Izmailov, P., Kirichenko, P., Alemi, A. A. and Wilson, A. G. (2021). Does Knowledge Distillation Really Work? arXiv:2106.05945.

[23]

Urban, G., Geras, K. J., Kahou, S. E., Aslan, O., Wang, S., Caruana, R., Mohamed, A., Philipose, M. and Richardson, M. (2017). Do Deep Convolutional Nets Really Need to be Deep and Convolutional? arXiv:1603.05691.

[24]

Wang, Y. and Xia, S.-T. (2017). Unifying attribute splitting criteria of decision trees by Tsallis entropy. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2507–2511. https://doi.org/10.1109/ICASSP.2017.7952608.

[25]

Yang, C., Zhu, Y., Lu, W., Wang, Y., Chen, Q., Gao, C., Yan, B. and Chen, Y. (2024). Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application. ACM Trans. Intell. Syst. Technol. Just Accepted. https://doi.org/10.1145/3699518.

[26]

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A. (2016). Learning Deep Features for Discriminative Localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2921–2929. https://doi.org/10.1109/CVPR.2016.319.

[27]

Zhou, Y., Zhou, Z. and Hooker, G. (2018). Approximation Trees: Statistical Stability in Model Distillation. arXiv:1808.07573.

Full article Related articles

Open access article under the CC BY license.

Keywords

Knowledge distillation Decision tree Machine learning Model interpretability Prediction accuracy Structural stability

Funding

J. Jack Lee’s research was supported in part by the grants P30CA016672, P50CA221703, U24CA224285, and U24CA274274 from the National Cancer Institute.

Metrics

since December 2021

197

Article info
views

Full article
views

PDF
downloads

XML
downloads

RSS

Authors

Abstract

References

Export citation

Copy and paste formatted citation

Download citation in file