1 Introduction
Online experimentation has been playing a key role in data-driven decision making in the IT industry including Microsoft [16, 17], Google [29], Linkedin [33], Netflix [32], Uber, eBay [23], and many others [9]. Generally, online controlled experimentation, also known as A/B testing, is conducted for a pre-determined amount of time to compare the difference in metrics between the treatment group and the control group where users are randomly assigned to. Prior to experimentation, a set of high-quality metrics are determined to assess the effects of new features in the treatment group. The collected metric results can provide strong evidence to support hypotheses and hence accelerate the decision-making process [2, 4, 19]. However, incomplete metrics are frequently occurred in the online experimentation, making the available data to be much fewer than the planned A/B testing. In this work, our focus is on the analysis of metrics that have incomplete measurements at the end of data collection in experiments.
According to the positions in the shopping funnel, metrics can be categorized as top, middle, and bottom funnel metrics. For instance, a successful purchase typically requires users to take multiple steps from the top homepage webpage to the bottom purchase webpage in the shopping funnel. In online experimentation, it is common for millions of users to arrive at the top funnel (e.g., homepage webpage), while only a small percentage of users reach the bottom funnel (e.g., purchase webpage). Between the transition from the top funnel to the bottom funnel, users need to navigate through multiple pages where they can exit from the shopping process. There are numerous scenarios in which users can exit the funnel, resulting in incomplete records of their purchases or other metrics. A common occurrence is simply that each experiment has its own experiment duration. Keeping experiments alive for a long period of time is expensive due to the high operational efforts and business opportunity costs. When we close experiments, we stop the track of all users, but some users might yet complete their purchases. This incompleteness in metrics due to the delay in collecting measurements for bottom-funnel metrics in experimentation are inevitable. There is also the possibility that users are lost to follow due to technical issues or user unavailability. For instance, when users switch from the desktop app to the mobile app, they become unavailable. It is essential to fill in the incomplete metrics to improve metric quality, leading to trustworthy results and better decisions.
With incomplete metric measurements, the inference of the difference in metrics between the treatment and the control in experiments is at the risk of being inaccurate [8, 13, 14]. To analyze experiments with missing metric values, a naive approach is to disregard users with incomplete outcomes. This approach assumes that the missingness is completely at random and that the fully observed users are representative of the entire population. Such an approach will reduce the total number of users in the study, leading to a decrease in the experiment power. The power decrease is substantial especially when the proportion of missingness is high.
Various imputation methods have been developed to address problems with missing data. One widely used method is the single imputation method, which fills in missing values with a single value, such as the mean of observed outcomes, for both the treatment group and the control group. The single imputation method preserves the full sample size, but it raises concerns regarding results with a distorted distribution and underestimated uncertainty [28]. In addition, the single imputation method disregards information from other observed variables collected along users’ journeys within the funnel. Other imputation methods have been developed for missing at random (MAR) and missing not at random (MNAR) scenarios. The MAR assumes that the missing mechanism is only associated with the observed variables [1, 12, 26]. Likelihood-based methods, such as generalized linear mixed models, are developed in clinical trials with incomplete outcomes [22]. The performance of the methods depends on the degree to which the assumptions are held for MAR. For MNAR in which the effect from missing outcomes is non-ignorable, the observed difference would be a biased estimate of the average treatment effect [22]. Regression-based imputation methods, such as the logistic regression, are employed for modeling the indicator for missingness [21]. Other prevalent methods, such as matching imputations, identify similar users from a set of variables. In general, these imputation methods require the identification of users with missing outcomes and users with outcomes as zero. In other words, general imputation methods are often not appropriate to handle certain online experimentation scenarios in which users’ missing outcomes represent both missing cases and zero cases.
To address the above challenges, we propose a clustering-based imputation method using k-nearest neighbors (kNN) for the analysis of online controlled experimentation in the presence of incomplete metrics. The key idea of the proposed method is to identify and impute incomplete metrics with users’ neighbors by incorporating the structure information of data from online experimentation. Specifically, the proposed method consists of two steps. The first step is to partition the data set into clusters after the stratification of experiment-specific features, including the treatment assignment and the buyers’ characteristics. In the second step, we perform the kNN-based imputation. Moreover, we divide users with missing outcomes into two categories: visitors and dropout buyers, such that the information of dropout buyers can be better utilized. Note that our framework assumes that the treatment assignment and user covariates are fully observed, whereas only the outcome at the bottom of the funnel has missing values. The proposed method has three key advantages. First, the proposed method uses the informative covariates during users’ journeys in the shopping funnel to impute incomplete metrics. Specifically, our method evaluates the heterogeneous impact from different user segments on missing rates in metrics. Second, the imputed values from our method are intuitive to understand. Lastly, our method employs stratification and clustering to alleviate the computation issues for large-scale data in online experimentation.
Throughout the paper, we consider the metric Purchase as an example of the incomplete metric at the funnel’s bottom for illustration. We also assume that the Purchase is the only metric (i.e., outcome) of interest in the experiment. The rest of the paper is organized as follows. In Sections 2 and 3, we detail the problem formulation, the proposed method, and the estimation procedures. In Section 4, we present simulations. A real case study is conducted in Section 5. We conclude this work with some discussion in Section 6.
2 Problem Formulation
In the context of online controlled experiments, we can classify users into three types based on their purchase behaviors: visitors, real buyers, and dropout buyers. Visitors participate in experiments but do not make contributions (e.g., purchases). Real buyers not only participate in experiments but also make contributions (e.g., purchases). Dropout buyers could have made their contributions (e.g., completed their transactions) within the experimentation period but failed due to various reasons. For example, users could drop out of the experiment because of unexpected external payment issues. Another example is that the experiment lost users due to various technical issues.
Suppose there are n users in an experiment, and let ${y_{i}}\in \{0,1\}$ denote whether the i-th user is a buyer or not. That is,
\[ {y_{i}}=\left\{\begin{array}{l@{\hskip10.0pt}l}1,\hspace{1em}& \text{user}\hspace{2.5pt}i\hspace{2.5pt}\text{is either a real buyer or a dropout buyer},\\ {} 0,\hspace{1em}& \text{user}\hspace{2.5pt}i\hspace{2.5pt}\text{is a visitor},\end{array}\right.\]
and ${y_{i}}=\mathbb{1}({z_{i}}\gt 0)$, where ${z_{i}}\ge 0$ denotes the purchase metric value of the i-th user, and $\mathbb{1}(\cdot )$ is the indicator function. We know for sure that user i is a real buyer and the corresponding value amount if he/she has completed transaction(s) during the experimentation period. In other cases, it is ambiguous whether he/she is a dropout buyer or merely a visitor. Therefore, we use ${y_{i}^{obs}}$ and ${z_{i}^{obs}}$ if the i-th user is a real buyer and ${y_{i}^{mis}}$ and ${z_{i}^{mis}}$ to represent the ambiguous situation (i.e., could be a dropout buyer or a visitor). To clarify,
\[ {y_{i}}=\left\{\begin{array}{l}{y_{i}^{obs}}=1,\text{user}\hspace{2.5pt}i\hspace{2.5pt}\text{is a real buyer},\hspace{1em}\\ {} {y_{i}^{mis}}=\left\{\begin{array}{l}1,\text{user}\hspace{2.5pt}i\hspace{2.5pt}\text{is a dropout buyer},\hspace{1em}\\ {} 0,\text{user}\hspace{2.5pt}i\hspace{2.5pt}\text{is a visitor}.\hspace{1em}\end{array}\right.\hspace{1em}\end{array}\right.\]
However, some practitioners arbitrarily treat all ${y_{i}^{mis}}$ and ${z_{i}^{mis}}$ as 0 without the diligence to distinguish between dropout buyers and visitors. Here, we denote such an arbitrary but simplified buyer indicator as
\[ {\tilde{y}_{i}}=\left\{\begin{array}{l@{\hskip10.0pt}l}1,\hspace{1em}& \text{user}\hspace{2.5pt}i\hspace{2.5pt}\text{is a real buyer},\\ {} 0,\hspace{1em}& \text{otherwise.}\hspace{2.5pt}\end{array}\right.\]
Their corresponding vectors are denoted as $\boldsymbol{y}=({y_{1}},\dots ,{y_{n}}),\boldsymbol{z}=({z_{1}},\dots ,{z_{n}}),\tilde{\boldsymbol{y}}=({\tilde{y}_{1}},\dots ,{\tilde{y}_{n}})$. Additionally, let ${\boldsymbol{x}_{i}}$ denote the relevant features for user i, ${\boldsymbol{x}_{i}}=({x_{i1}},\dots ,{x_{ip}})\in {R^{p}}$, $p\ge 1$, and let $\boldsymbol{X}={({\boldsymbol{x}_{1}},\dots ,{\boldsymbol{x}_{n}})^{T}}$, Without loss of generality, we assume that p features are continuous variables.Suppose there are m real buyers among the total n users, and without loss of generality, let us assume the first m users are real buyers. Denote n users’ purchase and transactional amount during the experimentation period using vectors
The problem of interest is to impute missing values ${\boldsymbol{y}^{mis}}$ and ${\boldsymbol{z}^{mis}}$ in the context of online experimentation. Among users with missing values, visitors are mixed with dropout buyers. Therefore, our proposed method is to firstly identify the candidates of dropout buyers (i.e., identifying the candidates of 1s in ${\boldsymbol{y}^{mis}}$) with the help of a classification model and then impute the ${\boldsymbol{y}^{mis}}$ and ${\boldsymbol{z}^{mis}}$ using an efficient cluster-based nearest neighbors-based approach.
3 The Proposed Method
The objective of the imputation problem is to impute missing values such that they are close to the underlying true data. The missing value imputation problem can be formulated as
\[ \underset{{\hat{\boldsymbol{y}}^{mis}}}{\text{min}}\hspace{2.5pt}l({\hat{\boldsymbol{y}}^{mis}},{\boldsymbol{y}^{mis}}),\]
where $l({\hat{\boldsymbol{y}}^{mis}},{\boldsymbol{y}^{mis}})$ is a loss function to quantify the difference between the imputed missing values ${\hat{\boldsymbol{y}}^{mis}}$ and the underlying true values ${\boldsymbol{y}^{mis}}$.Imputing missing values with non-parametric methods such as the nearest neighbors algorithm in large-scale data sets is challenging due to the large computation requirements for distances between pairs of data points. To solve this challenge, we propose to incorporate the data clustering patterns into the imputation. In other words, we partition users into c clusters and then perform imputations within each cluster. Thus, the cluster-based imputation problem is described as
where ${\boldsymbol{x}_{i}}$ denotes the features for user i, and $C(i)=h$ represents the user i with missing value ${y_{i}^{mis}}$ belongs to cluster h with the centroid ${\boldsymbol{\mu }_{h}}$, the constant g controls the within-cluster distances, and $||\cdot |{|_{2}}$ is the L${_{2}}$-norm. The set of indices I is defined as $I=\{i:{y_{i}}\hspace{2.5pt}\text{is missing}\}$. The features are selected based on experiment owners’ domain knowledge. After imputing ${y_{i}^{mis}}$, we can estimate the corresponding ${z_{i}^{mis}}$ as well.
(3.1)
\[\begin{aligned}{}& \underset{{\hat{\boldsymbol{y}}^{mis}}}{\text{min}}\hspace{2.5pt}{\sum \limits_{h=1}^{c}}{\sum \limits_{C(i)=h}^{}}l({\hat{y}_{i}^{mis}},{y_{i}^{mis}}),\\ {} \hspace{2.5pt}& s.t.{\sum \limits_{h=1}^{c}}{\sum \limits_{C(i)=h}^{}}||{\boldsymbol{x}_{i}}-{\boldsymbol{\mu }_{h}}|{|_{2}^{2}}\le g,\hspace{2.5pt}i\in I,\end{aligned}\]Note that it is unknown whether a user with an incomplete metric is a visitor or a dropout buyer. The dropout buyers are mixed with visitors because both do not have their purchase information recorded. To address the challenge, in Section 3.1, we apply the logistic regression model to identify a certain portion of visitors and narrow down the candidates of dropout buyers. Section 3.2 will detail the proposed cluster-based imputation. Notice that the data set in online controlled experiments often is very large such that the conventional clustering methods cannot be conducted efficiently. To alleviate the computation issue, Section 3.3 will consider a stratification-based clustering and describe how to choose the number of clusters.
3.1 Identifying Dropout Buyer Candidates
The practitioners’ simplified buyer indicator $\tilde{\boldsymbol{y}}$ reveals partial information in the true buyer indicator $\boldsymbol{y}$. Therefore, a classification model based on $(\boldsymbol{X},\tilde{\boldsymbol{y}})$ provides us with the likelihood of purchases. Users with a high likelihood but missing purchase records can serve as the candidates for dropout buyers. Since $\tilde{\boldsymbol{y}}$ is used as a substitution of $\boldsymbol{y}$, we call $\tilde{\boldsymbol{y}}$ pseudo-response.
Specifically, we propose to apply the logistic regression model for the buyer identification. Denote the conditional probability for user i as $p({\boldsymbol{x}_{i}})=Pr({\tilde{y}_{i}}=1|{\boldsymbol{x}_{i}})$, that is,
\[ {\tilde{y}_{i}}|{\boldsymbol{x}_{i}}=\left\{\begin{array}{l@{\hskip10.0pt}l}1,& \text{w.p.}\hspace{2.5pt}p({\boldsymbol{x}_{i}}),\\ {} 0,& \text{w.p.}\hspace{2.5pt}1-p({\boldsymbol{x}_{i}}).\end{array}\right.\]
We model the conditional probability $p({\boldsymbol{x}_{i}})$ with the logistic model $\log (p({\boldsymbol{x}_{i}})/(1-p({\boldsymbol{x}_{i}})))={\boldsymbol{x}_{i}^{T}}\boldsymbol{\beta }$ with $\boldsymbol{\beta }={({\beta _{1}},\dots ,{\beta _{p}})^{T}}$. Note that the features used in the logistic regression model are believed to be closely related to users’ purchase behaviors. A threshold is needed in the logistic model for classification. One widely used threshold value is 0.5. Customers can choose the percentage of TN in the whole samples as the threshold according to their domain knowledge.Comparing the model prediction and pseudo-response, Table 1 summarizes four types of classification results: false positive (FP), true negative (TN), false negative (FN), and true positive (TP) from the classification model. The FP indicates that the users with pseudo-response as 0 should have purchase information. We use this inconsistency to figure out the candidates of dropout buyers. That is, the FP cases can be either visitors or dropout buyers. The TN suggests the agreement that these users do not have purchases recorded. Thus, we treat all TN cases as visitors. The FN and the TP are users recorded with purchase behaviors, and hence they are real buyers, not dropout buyers or visitors.
Table 1
Summary of four categories of results in the logistic regression model.
Pseudo-response ($\tilde{y}$) | Prediction | Description | |
True Negative (TN) | 0 | 0 | Visitors |
False Positive (FP) | 0 | 1 | Candidates of dropout buyers |
False Negative (FN) | 1 | 0 | Real buyers |
True Positive (TP) | 1 | 1 | Real buyers |
Suppose there are r visitors and $n-m-r$ dropout buyer candidates that have been identified. Without loss of generality, let us assume the first r users in the missing set are those visitors. Then we write ${\boldsymbol{y}^{mis}}$ as
\[ {\boldsymbol{y}^{mis}}=({\boldsymbol{y}^{est}},{\boldsymbol{y}^{\ast }})=({y_{m+1}^{est}},\dots ,{y_{m+r}^{est}},{y_{m+r+1}^{\ast }},\dots ,{y_{n}^{\ast }}),\]
where ${y_{i}^{est}}=0,i=m+1,\dots ,m+r$, and ${y_{i}^{\ast }}\in \{0,1\},i=m+r+1,\dots ,n$ with 0 representing visitors and 1 representing dropout buyers. Similarly, we denote the corresponding continuous response for the purchase amount as
\[ {\boldsymbol{z}^{mis}}=({\boldsymbol{z}^{est}},{\boldsymbol{z}^{\ast }})=({z_{m+1}^{est}},\dots ,{z_{m+r}^{est}},{z_{m+r+1}^{\ast }},\dots ,{z_{n}^{\ast }}),\]
where ${z_{i}^{est}}=0,i=m+1,\dots ,m+r$, represents the purchase amounts from estimated visitors and ${z_{i}^{\ast }}$ represents the missing non-negative response from $n-m-r$ users. In the following imputation methods in Section 3.2, we consider ${\boldsymbol{z}^{est}}$ to be zeros and our major focus is to impute ${\boldsymbol{z}^{\ast }}$.3.2 Clustering-Based Imputation for Dropout Buyers
To impute the missing purchase value ${\boldsymbol{z}^{\ast }}$ of the dropout buyers, we adopt the clustering-based method using kNN techniques. It is noted that clustering improves data analysis efficiency by identifying inherent structure patterns and partitioning the large-scale data set into small subsets. In each strata ${\boldsymbol{X}_{tu}}$ (described later in the stratification step), we perform the K-means clustering method [20] to form clusters, which is formulated as
On top of clustering, we use the triangle inequality rule (described later) to ensure the consistent identification of nearest neighbors in the k-nearest neighbors (kNN) approach for imputation. The main idea of the kNN method is that nearby data points are similar to each other. The kNN algorithm is straightforward and does not require parametric model estimation, but it is computationally expensive and becomes slow as the size of the data set increases. However, this computational burden is greatly mitigated by the strategy of clustering. Given the specific cluster h (i.e., the fixed constraint in (1)), the imputation problem (1) with the kNN method can be written as
where $L\in \{0,1\}$ is the binary label, k is a positive integer representing the size of target user’s nearest neighbors ${N_{k}}({x_{i}})$ and j is the nearest neighbors’ user index. The performance of the kNN method may be affected by different k values. The optimal k value depends on the underlying structure of data sets. In this work, we use a fixed value 15 for k. It is not difficult to derive the solution to the objective function, which is written as
(3.2)
\[ {y_{i}^{\ast }}=\underset{L}{\text{argmax}}\sum \limits_{{x_{j}}\in {N_{k}}({x_{i}})}\mathbb{1}\{{y_{j}}=L\},\hspace{2.5pt}i\in I,\hspace{2.5pt}C(i)=h,\]
\[ {\hat{y}_{i}^{\ast }}=\left\{\begin{array}{l@{\hskip10.0pt}l}1,\hspace{1em}& {\textstyle\textstyle\sum _{j=1}^{k}}{y_{j}}/k\gt =0.5,\\ {} 0,\hspace{1em}& \text{otherwise},\end{array}\right.\]
where ${\textstyle\sum _{j=1}^{k}}{y_{j}}/k$ is the average of response ${y^{\prime }_{j}}s$ in the nearest neighbors.With the imputed ${\hat{y}_{i}^{\ast }}$, we obtain the corresponding imputed missing value ${\hat{z}_{i}^{\ast }}$ from the cost function formulated as
\[\begin{aligned}{}\underset{{z_{i}^{\ast }}}{\text{minimize}}\hspace{2.5pt}& {\hat{y}_{i}^{\ast }}{\sum \limits_{j=1}^{k}}||{z_{i}^{\ast }}-{z_{j}}|{|_{2}^{2}}+(1-{\hat{y}_{i}^{\ast }})||{z_{i}^{\ast }}|{|_{2}^{2}},\\ {} \hspace{2.5pt}& \hspace{2.5pt}{z_{i}^{\ast }}\ge 0,\hspace{2.5pt}i\in I,\hspace{2.5pt}C(i)=h.\end{aligned}\]
That is, the estimated ${\hat{z}_{i}^{\ast }}$ is given by
\[ {\hat{z}^{\ast }}=\left\{\begin{array}{l@{\hskip10.0pt}l}{\textstyle\textstyle\sum _{j=1}^{k}}{z_{j}}/k,\hspace{1em}& {\textstyle\textstyle\sum _{j=1}^{k}}{y_{j}}/k\gt =0.5,\\ {} 0,\hspace{1em}& \text{otherwise},\end{array}\right.\]
where ${\textstyle\sum _{j=1}^{k}}{z_{j}}/k$ is the average of response ${z^{\prime }_{j}}s$ in the nearest neighbors.The nearest neighbors are determined based on their distances to the target user, that is, the k closest neighbors are found by
where $d({\boldsymbol{x}_{i}},{\boldsymbol{x}_{j}})$ is the distance between the users i and j.
To further accelerate the computation, we adopt the triangle inequality rule [30], which avoids unnecessary distance calculations. We first obtain the k nearest neighbors within the closet cluster and denote their largest distance as ${d_{max}}$. We denote the distance between the target user and any other cluster centroid as ${d_{1}}$, the distance between any user in the same cluster and its cluster centroid as ${d_{2}}$, the distance between the target user and any user as ${d_{3}}$. The idea of the triangle inequality rule is that when ${d_{max}}\le |{d_{1}}-{d_{2}}|$, then ${d_{max}}\le $ ${d_{3}}$. As a result, we do not have to explicitly calculate ${d_{3}}$, which greatly speeds up the distance computation and ensures that the identification of nearest neighbors is robust to clustering. In this study, we use the L${_{2}}$-norm to measure distances.
3.3 Efficient Clustering Strategy
Note that the data set in the online controlled experimentation often is very large to cluster in the imputation step. To reduce the computational burden in clustering, we propose the stratification-based clustering approach. The key idea is to firstly stratify the user pool, and then perform clustering within each strata.
In the stratification step, we stratify users into two hierarchical levels: treatment assignment and users’ buying characteristics. The treatment assignment, including the treatment group and the control group, is determined by the experimentation configuration. Generally, in online controlled experiments there are two treatment assignments: control and treatment. However, more than two treatment assignments are possible in cases such as multivariant experiments. User’s buying characteristics, including new buyers, infrequent buyers, frequent buyers, and idle buyers, are categorized based on users’ purchase activities at eBay. There are in total 12 buyer categories. Note that both the experimentation configuration and the users’ buying segments are determined prior to the start of the experimentation. The hierarchical stratification is formulated as
where ${\boldsymbol{X}_{tu}}$ is the strata at the t-th treatment level and the u-th users’ buying characteristics in the feature space $\boldsymbol{X}$, and there are in total T levels treatment assignment and U levels users’ buying characteristics.
The combination of stratification and clustering within each strata greatly improves computation efficiency in the imputation step, where the neighbors of the target user are searched within all clusters.
The number of clusters in each strata from the stratification is obtained by maximizing a simplified version of the Silhouette score, also known as simplified Silhouette. The Silhouette score is an effective measure of clustering goodness [25], but it requires an intense computation of the distance between each data point and the rest data points. The simplified Silhouette improves the computational efficiency of the Silhouette score by calculating the distances between each data point and centroids of clusters [11]. The simplified Silhouette of data point i, denoted as $S{S_{i}}$, is defined as
where ${a_{i}}$ is the distance between the data point i and the centroid of the cluster it belongs to, and ${b_{i}}$ is the minimum of distances between the data point i and the centroids of other clusters. The final simplified Silhouette is the average of all data points’ simplified Silhouette. Note that the distances of each data point to its cluster centroid have already been calculated and recorded during the modeling process of k-means clustering, which greatly reduces the computational burden of the simplified Silhouette.
A pseudo-code for the proposed method is summarized in Algorithm 1.
4 Simulation
In this section, we conduct the simulation studies to evaluate the performance of the proposed cluster-based KNN imputation method. The complete response has two parts: the non-zero part and the zero part. The non-zero part of response follows a Gaussian distribution ${z_{s}}=1.5+1.1w+1.1{x_{s1}}+0.2{x_{s2}}+\epsilon $, $\epsilon \sim N(0,0.25)$ where w is the binary assignment to the control and the treatment group, and ${x_{s1}}$ and ${x_{s2}}$ are variables normally distributed N(0.1, 1) and N(0.2, 2.25), respectively. The binary indicator of response follows a Bernoulli distribution with the conditional probability $Pr({y_{s}}=1|{x_{s3}})$ expressed by a logistic regression model $logit({x_{s3}})=-1+5.8{x_{s3}}$, where ${x_{s3}}$ is a variable with a Gaussian distribution N(0.2, 0.04). In the simulation, we consider three scenarios for generating missing values in the response for both the control and the treatment group. In scenario 1 (S1), the missing is completely at random. In scenario 2 (S2), the missing probability is described with a logistic regression model depending on an unobserved variable following a Gaussian distribution. In scenario 3 (S3), the missing is dependent on the value of the response. Specifically, the missing response is indicated if its value exceeds a pre-defined threshold within the control and the treatment group. In all three scenarios, we further treat responses with zero as missing to represent real cases where users have incomplete records. The sample size is fixed at 5000.
We compare the proposed method with six benchmark models, including (i) Complete-case analysis (BM${_{1}}$), (ii) Unconditional control-mean imputation (BM${_{2}}$), (iii) Unconditional treatment-mean imputation (BM${_{3}}$), (iv) Unconditional zero imputation (BM${_{4}}$), (v) Best-case analysis (BM${_{5}}$), (vi) Worst-case analysis (BM${_{6}}$).
Complete-case analysis removes cases with missing values and uses only cases with complete outcomes. Specifically, we discard ${\boldsymbol{z}^{mis}}$ and the sample size is reduced to m, that is,
The complete-case analysis is easy to implement but generates unnecessary waste of information especially when the number of incomplete cases is substantial.
Unconditional control-mean imputation uses the mean in the observed users in the control group to impute missing values while unconditional treatment-mean imputation uses the mean in the observed users in the treatment group for imputation. That is,
\[\begin{aligned}{}{\text{BM}_{2}}:\hspace{2.5pt}& {\hat{z}_{i}^{\ast }}=\frac{{\textstyle\textstyle\sum _{c=1}^{{n_{c}}}}{z_{c}^{obs}}}{{n_{c}}},\hspace{2.5pt}c\in C,\\ {} {\text{BM}_{3}}:\hspace{2.5pt}& {\hat{z}_{i}^{\ast }}=\frac{{\textstyle\textstyle\sum _{t=1}^{{n_{t}}}}{z_{t}^{obs}}}{{n_{t}}},\hspace{2.5pt}t\in T,\end{aligned}\]
where the set of indices C is defined as $C=\{c:{z_{c}}\hspace{2.5pt}\text{is in the control group.}\}$ and the set of indices T is defined as $T=\{t:{z_{t}}\hspace{2.5pt}\text{is in the treatment group.}\}$. ${n_{c}}$ and ${n_{t}}$ is the number of sample sizes in the control group and in the treatment group, respectively. Unconditional zero imputation uses zero to impute missing values, that is,
These three imputation methods are different types of single value imputation approach, which can keep the full data size. But these imputation methods treat the missing values as fixed, distorting the distribution and ignoring the uncertainty in the missing values.The best-case analysis imputes missing values in the treatment (control) group with the mean in the users in the treatment (control) group. In contrast to the best-case analysis, the worst-case analysis imputes missing values in the treatment (control) group with the mean in the users in the control (treatment) group. Here, we assume that the testing feature in nature has a positive impact, and thus the mean in the treatment group is expected to be greater than the mean in the control group. As a result, the difference between the imputed missing values in the treatment group and the control group aligns with the feature impact in the best-case analysis, but contradicts the feature impact in the worst-case analysis. The best-case analysis and the worst-case analysis are expressed as
\[\begin{aligned}{}{\text{BM}_{5}}:\hspace{2.5pt}& {\hat{z}_{t}^{\ast }}=\frac{{\textstyle\textstyle\sum _{t=1}^{{n_{t}}}}{z_{t}^{obs}}}{{n_{t}}},\hspace{2.5pt}{\hat{z}_{c}^{\ast }}=\frac{{\textstyle\textstyle\sum _{c=1}^{{n_{c}}}}{z_{c}^{obs}}}{{n_{c}}},\\ {} {\text{BM}_{6}}:\hspace{2.5pt}& {\hat{z}_{t}^{\ast }}=\frac{{\textstyle\textstyle\sum _{c=1}^{{n_{c}}}}{z_{c}^{obs}}}{{n_{c}}},\hspace{2.5pt}{\hat{z}_{c}^{\ast }}=\frac{{\textstyle\textstyle\sum _{t=1}^{{n_{t}}}}{z_{t}^{obs}}}{{n_{t}}},\end{aligned}\]
where ${\hat{z}_{t}^{\ast }}$ (${\hat{z}_{c}^{\ast }}$) is the imputed missing value in the treatment (control) group, ${z_{t}^{obs}}$ (${z_{c}^{obs}}$) is the observed value in the treatment (control) group.To check the performance of the proposed method, we estimate the mean and variance in the control group, and compute lift in the mean between the treatment group and the control group, the standard error (SE) of the difference between the treatment and control group, coefficient of variation (CV) for the control group, zero rate (ZR) and p-value. The lift in the mean between the treatment group and the control group is described as
\[\begin{aligned}{}\text{Lift}& =\frac{{\mu _{t}}-{\mu _{c}}}{{\mu _{c}}}\times 100\% \\ {} & =\bigg(\frac{{\textstyle\textstyle\sum _{t=1}^{{n_{t}}}}{z_{t}^{}}}{{n_{t}}}-\frac{{\textstyle\textstyle\sum _{c=1}^{{n_{c}}}}{z_{c}^{}}}{{n_{c}}}\bigg)\bigg/\frac{{\textstyle\textstyle\sum _{c=1}^{{n_{c}}}}{z_{c}^{}}}{{n_{c}}}\times 100\% ,\end{aligned}\]
where ${\mu _{t}}$ and ${\mu _{c}}$ are the mean in the treatment group and the control group, respectively.The SE is expressed as
\[ \text{SE}=\sqrt{\frac{({n_{t}}-1){s_{t}^{2}}+({n_{c}}-1){s_{c}^{2}}}{{n_{t}}+{n_{c}}-2}\cdot \bigg(\frac{1}{{n_{c}}}+\frac{1}{{n_{t}}}\bigg)},\]
where ${s_{t}}$ and ${s_{c}}$ are the standard errors for the treatment group and the control group, respectively.Table 2
Performance comparisons of benchmark methods from 50 simulation replications (mean and standard errors (in parenthesis)). Note that method NoMissing uses the original complete response prior to the missing assignment.
Scenario | Method | Lift (%) | $\mu {_{\text{c}}}$ | $\mu {_{\text{t}}}$ | $s{_{\text{c}}}$ | CV | $n{_{\text{c}}}$ | ZR |
S1 | BM1 | 65.6 (4.96) | 1.7 (0.05) | 2.8 (0.04) | 1.2 (0.03) | 0.7 (0.03) | 953.8 (30.33) | 0 (0) |
BM2 | 24.9 (2.24) | 1.7 (0.05) | 2.1 (0.03) | 0.8 (0.02) | 0.5 (0.02) | 2504.1 (28.23) | 0 (0) | |
BM3 | 17.8 (1.14) | 2.4 (0.03) | 2.8 (0.04) | 0.9 (0.02) | 0.4 (0.01) | 2504.1 (28.23) | 0 (0) | |
BM4 | 65.4 (9.85) | 0.6 (0.02) | 1.1 (0.04) | 1.1 (0.02) | 1.8 (0.04) | 2504.1 (28.23) | 0.6 (0.01) | |
BM5 | 65.6 (4.96) | 1.7 (0.05) | 2.8 (0.04) | 0.8 (0.02) | 0.5 (0.02) | 2504.1 (28.23) | 0 (0) | |
BM6 | -11.1 (0.76) | 2.4 (0.03) | 2.1 (0.03) | 0.9 (0.02) | 0.4 (0.01) | 2504.1 (28.23) | 0 (0) | |
Proposed | 40.3 (11.30) | 1.1 (0.25) | 1.5 (0.24) | 1.3 (0.20) | 1.2 (0.09) | 2504.1 (28.23) | 0.4 (0.01) | |
NoMissing | 64.8 (7.41) | 0.9 (0.03) | 1.5 (0.04) | 1.2 (0.02) | 1.4 (0.03) | 2504.1 (28.23) | 0.5 (0.01) | |
S2 | BM1 | 65.0 (4.41) | 1.7 (0.04) | 2.8 (0.04) | 1.2 (0.03) | 0.7 (0.03) | 958.6 (29.9) | 0 (0) |
BM2 | 24.8 (2.02) | 1.7 (0.04) | 2.1 (0.03) | 0.8 (0.02) | 0.5 (0.02) | 2504.1 (28.23) | 0 (0) | |
BM3 | 17.8 (1.06) | 2.4 (0.03) | 2.8 (0.04) | 0.9 (0.03) | 0.4 (0.01) | 2504.1 (28.23) | 0 (0) | |
BM4 | 64.3 (9.47) | 0.6 (0.02) | 1.1 (0.04) | 1.1 (0.02) | 1.7 (0.04) | 2504.1 (28.23) | 0.6 (0.01) | |
BM5 | 65.0 (4.41) | 1.7 (0.04) | 2.8 (0.04) | 0.8 (0.02) | 0.5 (0.02) | 2504.1 (28.23) | 0 (0) | |
BM6 | -11.0 (0.59) | 2.4 (0.03) | 2.1 (0.03) | 0.9 (0.03) | 0.4 (0.01) | 2504.1 (28.23) | 0 (0) | |
Proposed | 39.4 (10.79) | 1.1 (0.25) | 1.5 (0.24) | 1.3 (0.20) | 1.2 (0.09) | 2504.1 (28.23) | 0.4 (0.01) | |
NoMissing | 64.8 (7.41) | 0.9 (0.03) | 1.5 (0.04) | 1.2 (0.02) | 1.4 (0.03) | 2504.1 (28.23) | 0.5 (0.01) | |
S3 | BM1 | 100.9 (8.84) | 1.1 (0.05) | 2.2 (0.03) | 0.9 (0.02) | 0.8 (0.05) | 958.6 (29.9) | 0 (0) |
BM2 | 38.4 (4.02) | 1.1 (0.05) | 1.5 (0.03) | 0.6 (0.02) | 0.5 (0.03) | 2504.1 (28.23) | 0 (0) | |
BM3 | 23.7 (1.33) | 1.8 (0.03) | 2.2 (0.03) | 0.8 (0.02) | 0.4 (0.02) | 2504.1 (28.23) | 0 (0) | |
BM4 | 100.1 (15.36) | 0.4 (0.02) | 0.8 (0.03) | 0.8 (0.02) | 1.8 (0.06) | 2504.1 (28.23) | 0.6 (0.01) | |
BM5 | 100.9 (8.84) | 1.1 (0.05) | 2.2 (0.03) | 0.6 (0.02) | 0.5 (0.03) | 2504.1 (28.23) | 0 (0) | |
BM6 | -14.7 (0.89) | 1.8 (0.03) | 1.5 (0.03) | 0.8 (0.02) | 0.4 (0.02) | 2504.1 (28.23) | 0 (0) | |
Proposed | 71.6 (9.67) | 0.6 (0.03) | 1.0 (0.03) | 0.8 (0.02) | 1.4 (0.05) | 2504.1 (28.23) | 0.5 (0.01) | |
NoMissing | 64.8 (7.41) | 0.9 (0.03) | 1.5 (0.04) | 1.2 (0.02) | 1.4 (0.03) | 2504.1 (28.23) | 0.5 (0.01) |
In online experimentation, the faster we run experiments, the more economic benefits, and less operational costs are achieved. Given constant user traffic, running experiments faster means a smaller number of users required [3, 31]. The CV is proportional to the number of users required for achieving a pre-determined statistical power of experiments. The CV is expressed as
The smaller the CV, the smaller the user size required to detect the difference at the specific statistical power, and thus the higher sensitivity.
The ZR is the ratio of the number of zero’s (${n_{zero}}$) in imputed $\boldsymbol{z}$ out of total data size n, described as
The ZR evaluates the proportion of visitors with the outcome as zero after the imputation method.
We compare the performance of the proposed method and benchmark methods in all scenarios in Table 2. In S1 and S2, the proposed cluster-based KNN imputation method has the closest ${\mu _{c}}$, ${\mu _{t}}$ and ZR compared to the method NoMissing. The BM2 and BM3 methods have larger ${\mu _{c}}$ and ${\mu _{t}}$ because these methods impute all missing values with nonzero values, which is indicated by their ZR values being 0. The proposed method has a comparable $s{_{\text{c}}}$ value to the method NoMissing, while the BM2, BM3, BM4, and BM5 methods have smaller values. This might be explained that the imputation values in the proposed method are not fixed as in the BM2, BM3, BM4, and BM5 methods. Though the BM1 method has a similar $s{_{\text{c}}}$ compared to the proposed method, its sample size is smaller due to the removal of samples with missing responses. In S3, the proposed method does not outperform the BM2 method. This is probably due to the fact that in S3 the missing response values can be partitioned into one particular group. When this entire group is missing, it is difficult for the KNN-based imputation approach to find good neighbors of missing responses. As a result, the estimated ${\mu _{c}}$ and ${\mu _{t}}$ are not close to the truth.
5 Case Study: Search Ranking Experiment
To illustrate the proposed method, this section uses a real online experiment whose objective was to improve eBay’s item ranking search results based on one ranking algorithm. The experiment hypothesis is that integrating information about negative buyer experiences into the ranking algorithm will reduce the visibility of items with a high probability of negative buyer experiences in search results, resulting in lower product return rates and increased revenues. This experiment lasts three weeks. A portion of eligible eBay users are selected and randomized into three variants – two treatment groups and one control group. The number of participant users in each variant exceeds 10 million. One of the most important outcomes is related to purchases, denoted here as PR.
Table 3
Performance comparisons of benchmark methods in the ranking search experiment. Note that the values of $s^{2}{}_{\text{c}}$, $\mu {_{\text{c}}}$, CV, and SE are not real and masked with particular linear transformation to meet the disclosure requirement.
Method | $s^{2}{}_{\text{c}}$ | $\mu {_{\text{c}}}$ | CV | ZR | Lift (%) | SE | p-value |
BM1 | 107035.21 | 1235.8 | 0.265 | 0.00 | -0.37 | 0.33 | 0.17 |
BM2 | 20003.17 | 390.5 | 0.362 | 0.00 | -0.16 | 0.06 | 0.28 |
BM3 | 20004.96 | 389.9 | 0.363 | 0.00 | -0.17 | 0.06 | 0.28 |
BM4 | 20693.30 | 213.7 | 0.673 | 0.83 | -0.29 | 0.06 | 0.31 |
BM5 | 20003.17 | 390.5 | 0.362 | 0.00 | -0.29 | 0.06 | 0.05 |
BM6 | 20004.96 | 389.9 | 0.363 | 0.00 | -0.03 | 0.06 | 0.82 |
Proposed | 21194.12 | 246.6 | 0.590 | 0.80 | -0.50 | 0.06 | 0.05 |
The outcome PR is incomplete due to its high missing rate. The PR is recorded when users made purchases during the experiment’s data collection period, but not when either of the following occurred: users did not make purchases, or the platform was unable to record the purchases before the end of the experiment’s data collection period. To impute PR and thus identify visitors and dropout buyers, we use these informative covariates, including the treatment assignment, the number of sessions, the number of sessions with searches, the number of sessions with qualified events highly related to purchases at eBay, and the user’s buying characteristics. The treatment assignment is pre-determined before running the experiment to assign users to the treatment group and the control group. The number of sessions corresponds to the number of sessions users have throughout the experiment. The number of sessions with searches is the number of sessions that contain at least one search activity. The number of sessions with qualified events is the number of sessions that include at least one qualified event activity. The buying characteristics of users are their historical purchasing patterns at eBay. These useful covariates are complete and do not have missing values. We impute the outcome PR using the proposed cluster-based imputation method. In the step of stratification, we divided the large-scale data set into smaller subsets based on two variables: the treatment assignment and user’s buying characteristic. When performing clustering within each strata, we use the number of sessions, the number of sessions with searches, and the number of sessions with qualified events.
In Table 3, we compare the performance between the proposed cluster-based imputation method and benchmark methods. The proposed method has a smaller mean in the control group and ZR than other methods except for the BM${_{4}}$. The proposed imputation method identifies visitors and dropout buyers from missing values. That being said, the proposed cluster-based imputation method imputes zeros for visitors, which is a portion of users with missing outcomes, and positive values for dropout buyers. Compared to the BM${_{4}}$, the proposed imputation method has a smaller size of zero and thus a larger mean in the control group. Compared to other mean-imputation methods that impute all missing values with a single value, the proposed imputation method has more zero’s and a smaller mean in the control group. The proposed method has a larger CV in the control group than all other methods, with the exception of BM${_{4}}$. This is largely attributable to the change in the mean of the control group, as the pooled standard errors for all methods, with the exception of BM${_{4}}$, are quite close. The proposed method has the smallest lift, and all methods have a consistent direction of lift. Based on the p-value and the Type I error as 10%, the proposed method and BM${_{5}}$ are statistically significant, indicating that there is sufficient evidence to reject the null hypothesis, whereas other methods are not statistically significant. This is expected because it is well known that single imputation methods tend to dilute mean differences, producing results that there is no difference between the control group and the treatment group. The proposed method has a larger variance in the control group and SE than other methods except for the BM${_{1}}$. The BM${_{1}}$ has a reduced sample size, resulting in the largest variance and SE for the control group. Unlike other methods, with the exception of the BM${_{1}}$, the proposed method does not ignore variance among missing values, resulting in a greater variance.
Figure 1
Comparison of mu across user segments between the proposed imputation method and the zero imputation method for the treatment group. The tick values in the vertical axis are omitted for the restriction of disclosure.
Figure 2
Comparison of zero rate across user segments between the proposed imputation method and the zero imputation method.
Figure 1 illustrates the increase in the mean of the control group across users’ buying segments for the proposed cluster-based imputation method and the zero-imputation method. Different user segments have different mean values, with the top two being the frequent buyer levels II and III. The proposed imputation method has larger mean values than the zero imputation method in nearly all user segments. The segments the frequent buyer levels II and III have considerably larger mean increases than the idle buyer levels. This suggests that the dropout buyers are more likely to occur in the frequent buyer levels II and III, while in the segments such as idle buyer levels, users with unrecorded outcomes are more likely to be visitors. This is consistent with the findings in Figure 2 regarding the allocation of the zero rate across user segments. Different user segments have varying degrees of zero rate. The zero rates for frequent buyer levels II and III are approximately 45%, whereas the zero rates for idle buyer levels II and III are above 90%. This is reasonable given that frequent buyer levels II and III are more likely to make purchases, resulting in low zero values for outcome PR. The high zero rate corresponds to the low mean value in Figure 1.
Figure 3 shows the distribution of CV across user segments for the proposed imputation method and the zero imputation method. For both methods, the CV values for the frequent buyer levels are less than half of those for the idle buyer levels. However, the CV of the proposed method is consistently lower than that of the zero imputation method across all user segments. The decrease in the CV indicates an improvement in sensitivity for the outcome PR. This improvement in sensitivity is largely attributable to the change in mean values.
6 Discussion
Metrics provide strong evidence to support hypotheses in online experimentation and hence reduce debates in the decision-making process. This paper introduces the concept of dropout buyers and classifies users with incomplete metric values into two categories: visitors and dropout buyers. For the analysis of incomplete metrics, we propose a cluster-based k-nearest neighbors-based imputation method. The proposed imputation method considers both the experiment-specific features and users’ activities along their shopping paths. The proposed method incorporates uncertainty among missing values in the outcome metrics using the k-nearest neighbors method. To facilitate efficient imputation in large-scale data sets in online experimentation, the proposed method employs a combination of stratification and clustering. The stratification approach divides the entire large-scale data set into small subsets to improve computation efficiency in the clustering step. The clustering approach identifies inherent structure patterns to improve the performance of the k-nearest neighbors method within each cluster.
It is worth to remarking that the kNN method used in this work considered the average of responses in nearest neighbors. The weighted average of nearest neighbors has been proposed to suggest that different data points in the neighbor contribute differently to the decision based on their distances from the target point [10]. That is, nearby data points, which are closer to the target in the neighbors, have higher influence on the decision than distant data points. Moreover, one would incorporate the network structure information into the kNN for the networked A/B testing [34]. Another direction for future research is to study ratio metrics [15] related to purchases in the proposed imputation framework. On the other hand, the proposed imputation method aims to impute missing values for each user with missing outcomes. It would be interesting to categorize users with missing outcomes into various hubs and investigate the imputation strategy for each hub of users altogether.