Clinical trial design for rare diseases can be challenging due to limited data, heterogeneous clinical manifestations and progression, and a frequent lack of adequate knowledge about the disease. Multiple endpoints are usually used to collectively assess the effectiveness of the investigational drug on multiple aspects of the disease. Here we propose an adaptive design based on the promising zone framework, allowing for sample size re-estimation (SSR) using interim data for a clinical trial involving multiple endpoints. The proposed SSR procedure incorporates two global tests: the ordinary least squares (OLS) test and the nonparametric permutation test. We consider two SSR approaches: one is based on power (SSR-Power) and the other on conditional power (SSR-CP). Simulation results show that the adaptive design achieves type I error control and satisfactory power. Compared with the permutation test, the OLS test has improved type I error control when the sample size is small and the timing of the interim analysis is early; while the permutation test achieves slightly higher power in most scenarios. Regarding the SSR methods, SSR-CP consistently achieves higher power than SSR-Power but often requires a larger sample size and more frequently reaches the maximum allowable sample size. The proposed design is particularly useful when the trial has a small initial sample size and has opportunity to adjust the sample size at an interim analysis to achieve adequate power.
Machine learning models, particularly the black-box models, are widely favored for their outstanding predictive capabilities. However, they often face scrutiny and criticism due to the lack of interpretability. Paradoxically, their strong predictive capabilities may indicate a deep understanding of the underlying data, implying significant potential for interpretation. Leveraging the emerging concept of knowledge distillation, we introduce the method of knowledge distillation decision tree (KDDT). This method enables the distillation of knowledge about the data from a black-box model into a decision tree, thereby facilitating the interpretation of the black-box model. Essential attributes for a good interpretable model include simplicity, stability, and predictivity. The primary challenge of constructing an interpretable tree lies in ensuring structural stability under the randomness of the training data. KDDT is developed with the theoretical foundations demonstrating that structure stability can be achieved under mild assumptions. Furthermore, we propose the hybrid KDDT to achieve both simplicity and predictivity. An efficient algorithm is provided for constructing the hybrid KDDT. Simulation studies and a real-data analysis validate the hybrid KDDT’s capability to deliver accurate and reliable interpretations. KDDT is an excellent interpretable model with great potential for practical applications.
We consider the problem of developing flexible and parsimonious biomarker combinations for cancer early detection in the presence of variable missingness at random. Motivated by the need to develop biomarker panels in a cross-institute pancreatic cyst biomarker validation study, we propose logic-regression based methods for feature selection and construction of logic rules under a multiple imputation framework. We generate ensemble trees for classification decision, and further select a single decision tree for simplicity and interpretability. We demonstrate superior performance of the proposed methods compared to alternative methods based on complete-case data or single imputation. The methods are applied to the pancreatic cyst data to estimate biomarker panels for pancreatic cysts subtype classification and malignant potential prediction.
The performance of a learning technique relies heavily on hyperparameter settings. It calls for hyperparameter tuning for a deep learning technique, which may be too computationally expensive for sophisticated learning techniques. As such, expeditiously exploring the relationship between hyperparameters and the performance of a learning technique controlled by these hyperparameters is desired, and thus it entails the consideration of design strategies to collect informative data efficiently to do so. Various designs can be considered for this purpose. The question as to which design to use then naturally arises. In this paper, we examine the use of different types of designs in efficiently collecting informative data to study the surface of test accuracy, a measure of the performance of a learning technique, over hyperparameters. Under the settings we considered, we find that the strong orthogonal array outperforms all other comparable designs.