Clustering-Based Imputation for Dropout Buyers in Large-Scale Online Experimentation

Shen, Sumin; Mao, Huiying; Zhang, Zezhong; Chen, Zili; Nie, Keyu; Deng, Xinwei

doi:10.51387/23-NEJSDS33

The New England Journal of Statistics in Data Science

Clustering-Based Imputation for Dropout Buyers in Large-Scale Online Experimentation

Volume 1, Issue 3 (2023), pp. 415–425

Sumin Shen Huiying Mao Zezhong Zhang All authors (6)

https://doi.org/10.51387/23-NEJSDS33

Pub. online: 24 May 2023 Type: Methodology Article

Open Access

Area: Statistical Methodology

Accepted
24 February 2023

Published
24 May 2023

Abstract

In online experimentation, appropriate metrics (e.g., purchase) provide strong evidence to support hypotheses and enhance the decision-making process. However, incomplete metrics are frequently occurred in the online experimentation, making the available data to be much fewer than the planned online experiments (e.g., A/B testing). In this work, we introduce the concept of dropout buyers and categorize users with incomplete metric values into two groups: visitors and dropout buyers. For the analysis of incomplete metrics, we propose a clustering-based imputation method using k-nearest neighbors. Our proposed imputation method considers both the experiment-specific features and users’ activities along their shopping paths, allowing different imputation values for different users. To facilitate efficient imputation of large-scale data sets in online experimentation, the proposed method uses a combination of stratification and clustering. The performance of the proposed method is compared to several conventional methods in both simulation studies and a real online experiment at eBay.

References

[1]

Bhaskaran, K. and Smeeth, L. (2014). What is the difference between missing completely at random and missing at random? International journal of epidemiology 43(4) 1336–1339.

[2]

Deng, A. and Shi, X. (2016). Data-driven metric development for online controlled experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 77–86.

[3]

Deng, A., Xu, Y., Kohavi, R. and Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the sixth ACM international conference on Web search and data mining 123–132.

[4]

Dmitriev, P. and Wu, X. (2016). Measuring metrics. In Proceedings of the 25th ACM international on conference on information and knowledge management 429–437.

[5]

Dmitriev, P., Gupta, S., Kim, D. W. and Vaz, G. (2017). A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining 1427–1436.

[6]

Dolton, P. and O’Neill, D. (1996). The restart effect and the return to full-time stable employment. Journal of the Royal Statistical Society: Series A (Statistics in Society) 159(2) 275–288.

[7]

Dolton, P. and O’Neill, D. (1996). Unemployment duration and the restart effect: some experimental evidence. The Economic Journal 106(435) 387–400.

[8]

Goldstein, D. G., Imai, K. and Göritz, A. S. (2007). The subtle psychology of voter turnout. Technical Report.

[9]

Gupta, S., Kohavi, R., Tang, D., Xu, Y., Andersen, R., Bakshy, E., Cardin, N., Chandran, S., Chen, N., Coey, D. et al. (2019). Top challenges from the first practical online controlled experiments summit. ACM SIGKDD Explorations Newsletter 21(1) 20–35.

[10]

Hechenbichler, K. and Schliep, K. (2004). Weighted k-nearest-neighbor techniques and ordinal classification.

[11]

Hruschka, E. R., de Castro, L. N. and Campello, R. J. (2004). Evolutionary algorithms for clustering gene-expression data. In Fourth IEEE International Conference on Data Mining (ICDM’04) 403–406. IEEE.

[12]

Imai, K. (2009). Statistical analysis of randomized experiments with non-ignorable missing binary outcomes: an application to a voting experiment. Journal of the Royal Statistical Society: Series C (Applied Statistics) 58(1) 83–104. https://doi.org/10.1111/j.1467-9876.2008.00637.x. MR2662235

[13]

Imbens, G. W. and Pizer, W. A. (2000). The analysis of randomized experiments with missing data. Technical Report.

[14]

Imbens, G. W., Rubin, D. B. and Sacerdote, B. I. (2001). Estimating the effect of unearned income on labor earnings, savings, and consumption: Evidence from a survey of lottery players. American economic review 91(4) 778–794.

[15]

Jin, Y. and Ba, S. (2022). Toward Optimal Variance Reduction in Online Controlled Experiments. Technometrics 1–12.

[16]

Kohavi, R., Deng, A., Longbotham, R. and Xu, Y. (2014). Seven rules of thumb for web site experimenters. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 1857–1866.

[17]

Kohavi, R., Crook, T., Longbotham, R., Frasca, B., Henne, R., Ferres, J. L. and Melamed, T. (2009). Online experimentation at Microsoft. Data Mining Case Studies 11(2009) 39.

[18]

Little, R. J. and Rubin, D. B. (2019). Statistical analysis with missing data 793. John Wiley & Sons. https://doi.org/10.1002/9781119013563. MR1925014

[19]

Machmouchi, W. and Buscher, G. (2016). Principles for the design of online A/B metrics. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval 589–590.

[20]

MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1 281–297. Oakland, CA, USA. MR0214227

[21]

Mao, H., Deng, X., Jiang, H., Shi, L., Li, H., Tuo, L., Shi, D. and Guo, F. (2021). Driving safety assessment for ride-hailing drivers. Accident Analysis & Prevention 149 105574.

[22]

Molenberghs, G., Thijs, H., Jansen, I., Beunckens, C., Kenward, M. G., Mallinckrodt, C. and Carroll, R. J. (2004). Analyzing incomplete longitudinal clinical trial data. Biostatistics 5(3) 445–464.

[23]

Nie, K., Kong, Y., Yuan, T. T. and Burke, P. B. (2020). Dealing With Ratio Metrics in A/B Testing at the Presence of Intra-User Correlation and Segments. In International Conference on Web Information Systems Engineering 563–577. Springer.

[24]

Ougiaroglou, S., Nanopoulos, A., Papadopoulos, A. N., Manolopoulos, Y. and Welzer-Druzovec, T. (2007). Adaptive k-nearest-neighbor classification using a dynamic number of nearest neighbors. In East European Conference on Advances in Databases and Information Systems 66–82. Springer.

[25]

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 53–65.

[26]

Rubin, D. B. (1976). Inference and missing data. Biometrika 63(3) 581–592. https://doi.org/10.1093/biomet/63.3.581. MR0455196

[27]

Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys 81. John Wiley & Sons. MR2117498

[28]

Spineli, L. M. and Kalyvas, C. (2020). Comparison of exclusion, imputation and modelling of missing binary outcome data in frequentist network meta-analysis. BMC medical research methodology 20(1) 1–15.

[29]

Tang, D., Agarwal, A., O’Brien, D. and Meyer, M. (2010). Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining 17–26.

[30]

Wang, X. (2011). A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. In The 2011 international joint conference on neural networks 1293–1299. IEEE.

[31]

Wu, C. J. and Hamada, M. S. (2011). Experiments: planning, analysis, and optimization 552. John Wiley & Sons. MR2583259

[32]

Xie, H. and Aurisset, J. (2016). Improving the sensitivity of online controlled experiments: Case studies at netflix. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 645–654.

[33]

Xu, Y., Duan, W. and Huang, S. (2018). SQR: balancing speed, quality and risk in online experiments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 895–904.

[34]

Zhang, Q. and Kang, L. (2022). Locally Optimal Design for A/B Tests in the Presence of Covariates and Network Dependence. Technometrics 64(3) 358–369. https://doi.org/10.1080/00401706.2022.2046169. MR4457329

Full article Related articles

Open access article under the CC BY license.

Keywords

Experimentation Metrics Imputation Clustering A/B testing

Metrics

since December 2021

586

Article info
views

125

Full article
views

160

PDF
downloads

XML
downloads

RSS

Authors

Abstract

References

Export citation

Copy and paste formatted citation

Download citation in file