Clustering-Based Imputation for Dropout Buyers in Large-Scale Online Experimentation
Volume 1, Issue 3 (2023), pp. 415–425
Pub. online: 24 May 2023
Type: Statistical Methodology
Open Access
Accepted
24 February 2023
24 February 2023
Published
24 May 2023
24 May 2023
Abstract
In online experimentation, appropriate metrics (e.g., purchase) provide strong evidence to support hypotheses and enhance the decision-making process. However, incomplete metrics are frequently occurred in the online experimentation, making the available data to be much fewer than the planned online experiments (e.g., A/B testing). In this work, we introduce the concept of dropout buyers and categorize users with incomplete metric values into two groups: visitors and dropout buyers. For the analysis of incomplete metrics, we propose a clustering-based imputation method using k-nearest neighbors. Our proposed imputation method considers both the experiment-specific features and users’ activities along their shopping paths, allowing different imputation values for different users. To facilitate efficient imputation of large-scale data sets in online experimentation, the proposed method uses a combination of stratification and clustering. The performance of the proposed method is compared to several conventional methods in both simulation studies and a real online experiment at eBay.
References
Imai, K. (2009). Statistical analysis of randomized experiments with non-ignorable missing binary outcomes: an application to a voting experiment. Journal of the Royal Statistical Society: Series C (Applied Statistics) 58(1) 83–104. https://doi.org/10.1111/j.1467-9876.2008.00637.x. MR2662235
Little, R. J. and Rubin, D. B. (2019). Statistical analysis with missing data 793. John Wiley & Sons. https://doi.org/10.1002/9781119013563. MR1925014
MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1 281–297. Oakland, CA, USA. MR0214227
Rubin, D. B. (1976). Inference and missing data. Biometrika 63(3) 581–592. https://doi.org/10.1093/biomet/63.3.581. MR0455196
Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys 81. John Wiley & Sons. MR2117498
Wu, C. J. and Hamada, M. S. (2011). Experiments: planning, analysis, and optimization 552. John Wiley & Sons. MR2583259
Zhang, Q. and Kang, L. (2022). Locally Optimal Design for A/B Tests in the Presence of Covariates and Network Dependence. Technometrics 64(3) 358–369. https://doi.org/10.1080/00401706.2022.2046169. MR4457329