General Additive Network Effect Models

Bui, Trang; Steiner, Stefan H.; Stevens, Nathaniel T.

doi:10.51387/23-NEJSDS29

The New England Journal of Statistics in Data Science

General Additive Network Effect Models

Volume 1, Issue 3 (2023), pp. 342–360

Trang Bui Stefan H. Steiner Nathaniel T. Stevens

https://doi.org/10.51387/23-NEJSDS29

Pub. online: 27 April 2023 Type: Methodology Article

Open Access

Area: Statistical Methodology

Accepted
6 April 2023

Published
27 April 2023

Abstract

In the interest of business innovation, social network companies often carry out experiments to test product changes and new ideas. In such experiments, users are typically assigned to one of two experimental conditions with some outcome of interest observed and compared. In this setting, the outcome of one user may be influenced by not only the condition to which they are assigned but also the conditions of other users via their network connections. This challenges classical experimental design and analysis methodologies and requires specialized methods. We introduce the general additive network effect (GANE) model, which encompasses many existing outcome models in the literature under a unified model-based framework. The model is both interpretable and flexible in modeling the treatment effect as well as the network influence. We show that (quasi) maximum likelihood estimators are consistent and asymptotically normal for a family of model specifications. Quantities of interest such as the global treatment effect are defined and expressed as functions of the GANE model parameters, and hence inference can be carried out using likelihood theory. We further propose the “power-degree” (POW-DEG) specification of the GANE model. The performance of POW-DEG and other specifications of the GANE model are investigated via simulations. Under model misspecification, the POW-DEG specification appears to work well. Finally, we study the characteristics of good experimental designs for the POW-DEG specification. We find that graph-cluster randomization and balanced designs are not necessarily optimal for precise estimation of the global treatment effect, indicating the need for alternative design strategies.

1 Introduction

1.1 Context

Network data encode not only information about individuals but also the connections among them. Such data exist in many settings: students who are friends with one another [4, 33]; intersecting streets [27]; interconnected brain regions [30]; and social network users who “follow” or “friend” one another [47]. Network data can be represented by a graph where nodes represent individuals and an edge between two nodes indicates that there is a connection between them. For example, Figure 1 visualizes a Caltech Facebook network, which was collected in September 2005 [42, 43]. The version of the data used in this paper contains information about 770 users and was retrieved from the Network Data Repository [38]. In Figure 1, the circles (nodes) represent Caltech Facebook users and a line between two nodes (an edge) signifies a Facebook friendship between two users.

Figure 1

Network of Facebook connections among 770 Caltech Facebook users. The network was plotted using the igraph package with the Fruchterman-Reingold layout [14] generated from the qgraph package.

The dependency structure in network data poses distinctive challenges and opportunities and thus motivates specialized theory and methodology. In this paper, we focus on network experimentation, a topic that has gained recent interest given the widespread experimentation happening in social network companies like LinkedIn, Facebook, and Twitter [17]. These businesses regularly develop updates and new features to their platforms. To decide whether a new feature or update should be deployed for all users, these companies conduct controlled experiments on their user networks to test a new feature and evaluate its causal impact [34, 47].

Current literature on network experimentation generally focuses on the setting where the experiment is conducted on a given network of a fixed number of individuals. Connections among individuals on the network are further assumed to be known and unchanged over the duration of the experiment. In addition, the experiment usually involves only two experimental conditions: the proposed new condition (i.e., the treatment) and the existing or baseline condition (i.e., the control). This type of experiment is also known as an A/B test, since two conditions A (control) and B (treatment) are compared to determine which one performs better in terms of some business metric. For example, the outcome of interest may be a measure of engagement like time spent on the platform, and the goal of the experiment may be to determine which condition maximizes this metric. In general, the experimental outcome is measured on experimental units, who are assigned to the experimental conditions. In the social network setting, the experimental units are often the platform users.

There are two stages in running an experiment: designing the experiment and analyzing the results. In the design stage, the experimenter determines the treatment assignment for the experimental units. Then in the analysis stage, data collected from the experiment are used to estimate the effect of the treatment on the outcome of interest. A popular estimand in the literature is the global treatment effect (sometimes alternatively referred to as the average treatment effect), which is the average difference in experimental outcomes when everyone in the population is assigned to the treatment versus when everyone is assigned to the control [9, 12]. A significantly positive global treatment effect (in the case where higher outcome values are desired) would motivate a business decision to deploy the treatment instead of the control to all users and vice versa. As such, an accurate estimate of the global treatment effect is an important objective to consider when developing experimental design and analysis methodologies. Another relevant concern in network experimentation is the estimation of the network effect: the influence of the network on the outcome of interest.

Classical experimental design and analysis methodologies treat individuals (or experimental units) as independent entities, in the sense that the potential outcome of each unit is assumed to be independent of the experimental conditions the other units are assigned. This is known as the Stable Unit Treatment Value Assumption (SUTVA) [11]. However, when experimental units are connected on a network, SUTVA is often violated. For example, consider an experiment where the treatment is designed to increase the time a user spends on the social network. Suppose Connie is assigned to control and her friend, Tracy, is assigned to treatment. Due to the treatment, suppose Tracy spends more time on the network. Connie may observe Tracy’s increased activity and interact with Tracy more, resulting in an increased amount of time spent on the platform despite being in the control group. Thus, the fact that Tracy is assigned to the treatment affects the potential outcome of Connie. As such, SUTVA no longer holds and the potential outcome of an individual depends not only on their own treatment assignment but also the treatment assignment of other individuals in the network. Consequently, in a network setting, the global treatment effect depends not only on the experimental conditions but also on the structure of the network. Therefore, specialized methods are required for the design and analysis of network experiments.

1.2 Related Work

There are two major classes of approaches to the design and analysis of experiments on networks. The first base their methods on the exposure framework [2, 12, 16], in which experimental units (nodes) are classified into exposure groups where group members have the same level of exposure to the treatment. In a network experiment, even when two units are assigned to the same experimental condition, they may be exposed to different levels of the treatment (e.g., based on the treatment assignment of those around them) and thus belong to different exposure groups. For example, units i and j may belong to different exposure groups if all neighbors of unit i are assigned to treatment while all neighbors of unit j are assigned to control even when units i and j have the same treatment status. The literature often defines exposure groups based on local treatment assignment patterns of neighbors [2, 12, 39]. Based on the defined exposures, various causal effects, such as the global treatment effect and others [20, 41], can be defined as contrasts of the exposure groups’ potential outcomes [2]. Accordingly, estimators can be derived using inverse probability weighting [2, 12]. As an example, Gui et al. [16] define the treatment exposure as containing units such that (i) they are assigned to the treatment and (ii) the majority of their neighbors are also assigned to the treatment. Similarly, units are said to belong to the control exposure if (i) they are assigned to the control and (ii) the majority of their neighbors are also assigned to the control. The potential outcome of the treatment exposure group serves as a proxy for the outcome observed if the whole network is assigned to treatment. Similarly, the control exposure group is assumed to behave as if the whole network is assigned to control. The global treatment effect can then be estimated by taking the (inverse probability weighted) difference between the average outcome of the treatment exposure group and that of the control exposure group.

To achieve a precise estimator of the global treatment effect, each exposure group needs to contain a large number of units. In other words, the exposure-based analysis strategy requires that experimental units are mostly surrounded by neighbors assigned to the same experimental conditions as themselves. This motivates a popular design strategy in the network experimentation literature based on graph-cluster randomization [8, 16, 23, 44]. Graph-cluster randomization designs first partition the network into clusters, i.e., subgraphs of nodes that are densely connected within each cluster and sparsely connected between clusters. These clusters are then randomly assigned to treatment or control and all units in each cluster are assigned to the same experimental condition. This ensures that most units share the same experimental condition as their neighbors. Eckles et al. [12] illustrate via simulations that graph-cluster randomization, together with inverse-probability-weighted estimators, reduces bias in estimating the global treatment effect in experiments on networks.

The second class of approaches is based on assuming a statistical model for the experimental outcomes. There is a rich literature on network regression models for inference and prediction problems [1, 7, 22, 29, 31]. However, for network experiments, models need to characterize and quantify the influence of the treatment assignment to the outcome of interest, compared to the control, while taking into account the network influence. Parker et al. [35] and Koutra et al. [26] posit that an experimental condition can affect the outcome of a unit in two ways: (i) through the unit’s own treatment assignment and/or (ii) through the treatment assignment of the unit’s neighbors. These effects are assumed to be fixed and additive and then modeled via an ordinary linear regression model. Basse and Airoldi [5] and Pokhilko et al. [36] instead suggest that the experimental conditions only assert influence via the nodes’ own treatment assignment. Network interference is then modeled via the correlation of random errors based on the network structure, in a similar fashion to the conditional autoregressive (CAR) model from the spatial statistics literature [3, 10].

In the model-based framework, once the model is formalized, experimenters can select design criteria to reduce, for example, the variance of the model parameters and/or functions of them. Designs that optimize the posed design criteria can be found using exhaustive search [26, 35], random search [35] or other optimal design search methods [4, 36]. In the analysis stage, model parameters will be estimated by fitting the model to the observed data using methods such as least squares or maximum likelihood.

Of the two classes of approaches, the exposure framework seems more popular in the network experimentation literature thanks to its focus on the global treatment effect. When the main goal of the experiment is to accurately estimate the global treatment effect, the exposure framework only requires the experimenters to define the treatment and control exposures. For instance, in the Gui et al. [16] method the experimenter must determine what percentage of a unit’s neighbors must be assigned to treatment for it to be classified into the treatment exposure group. After that, graph-cluster randomization can be applied as a general design strategy to reduce the bias of the global treatment effect estimate. Nevertheless, there exist pitfalls associated with this class of approaches. Chin [9] points out that social networks are usually locally dense, making it hard for algorithms to partition the network into separate clusters. In addition, experimental outcomes on units not classified into the treatment or the control exposures will be wastefully discarded in the analysis stage [16], resulting in a low effective sample size [39].

On the contrary, if the generating model for the experimental outcome is correctly specified, the model-based framework will have several advantages over the exposure framework. First, data from all units participating in the experiment can be utilized when fitting the model in the analysis stage. Second, additional inference beyond the global treatment effect can be achieved via model parameters and functions of them. Finally, design selection can be calibrated according to the experimenters’ specific interests, which may sometimes be something other than the global treatment effect.

1.3 Our Contribution

In this paper, we attempt to unify many of the model-based approaches by introducing the general additive network effect (GANE) model, in which the experimental outcome of a unit is modeled as an additive function of the effect of its own treatment assignment and the effects that come from other treated or controlled nodes in the network. The model not only encompasses (as special cases) many network experiment models from the literature, but it also flexibly extends the manner in which network effects are modeled. In particular, the network influence from the treatment and control groups can be modeled differently in terms of both size and functional form. Moreover, the model is specified such that its parameters are interpretable. We show that (quasi) maximum likelihood estimation is possible for a family of model specifications. The resulting (quasi) maximum likelihood estimators are then proven to be consistent and asymptotically normal, which facilitates familiar likelihood-based inference.

Existing work in the model-based direction concentrates on inference for the model parameters themselves, however, in many cases, the global treatment effect is the main quantity of interest. To expand the utility of the model-based framework, we illustrate how quantities such as the global treatment effect can be expressed as functions of the GANE model parameters, which permits inference via the delta method [45]. Similarly, via the model parameters, the experimenters can also build design criteria and select experimental designs that optimize these criteria. Overall, through the GANE model, we provide a generalized model-based framework for the design and analysis of experiments on networks.

We further propose a particular specification of the GANE model, the power-degree (POW-DEG) specification, where the network effects are modeled as a nonlinear function of the number of neighbors assigned to treatment or control. Using the power as an additional parameter, the POW-DEG model specification gains additional flexibility in capturing the network effect compared to the linear counterpart that has been proposed in the literature [35]. We investigate inferential properties of the POW-DEG specification as well as other existing specifications in the literature via simulations on real-life networks. We find that the POW-DEG specification yields good estimates of the true global treatment effect, even under model misspecification. By examining designs in the context of the POW-DEG specification, we also find that cluster randomization and balanced treatment assignment are not necessarily optimal for this model specification and other design strategies should therefore be explored in the future.

As with any model-based approach, the GANE model framework faces the challenge of model selection. When the GANE model is estimated via maximum likelihood, experimenters can use popular model selection criteria such as AIC, BIC, etc. to choose the model specification that best fits the data. However, in the design phase, before data have been collected, experimenters must use their domain knowledge to select a suitable model specification, with which they can build design criteria and select a design. In the absence of information concerning which model specification is appropriate, we suggest the use of the POW-DEG specification because of its flexibility and good performance in terms of global treatment effect estimation and inference. Nevertheless, model selection for both the design and analysis of experiments on networks remains an open area for future study.

The rest of the paper is structured as follows. In Section 2, we introduce the general additive network effect (GANE) model framework. We present the theoretical results for maximum likelihood estimation in Section 3. In Section 4 we report the results of several simulation studies used to investigate the design and inferential properties of a few particular specifications of the GANE model. In Section 5 we conclude with a summary and discussion of several extensions to this work.

2 General Additive Network Effect Model

2.1 Modeling Framework

Suppose that the experiment is on a network of n experimental units. Let $\mathcal{G}=(\mathcal{V},\mathcal{E})$ denote the network where $\mathcal{V}$ denotes the set of experimental units (i.e., nodes) and $\mathcal{E}$ denotes the set of connections among experimental units (i.e., edges). As in most of the network experiment literature, we assume that $\mathcal{G}$ is known and fixed throughout the experiment. Furthermore, $\mathcal{G}$ is assumed to be undirected and simple, and so $\mathcal{G}$ can be represented by an $n\times n$ adjacency matrix A where for $i,j=1,2,\dots ,n$

\[ {A_{ij}}\hspace{0.1667em}=\hspace{0.1667em}{A_{ji}}\hspace{0.1667em}=\hspace{0.1667em}\left\{\begin{array}{l@{\hskip10.0pt}l}\hspace{-0.1667em}1\hspace{1em}& \hspace{-0.1667em}\hspace{-0.1667em}\text{if units}\hspace{2.5pt}i\ne j\hspace{2.5pt}\text{are connected}\\ {} \hspace{-0.1667em}0\hspace{1em}& \hspace{-0.1667em}\hspace{-0.1667em}\text{if}\hspace{2.5pt}i=j\hspace{2.5pt}\text{or if units}\hspace{2.5pt}i\ne j\hspace{2.5pt}\text{are not connected}\end{array}\right..\]

If ${A_{ij}}=1$, unit i and j are said to be neighbors. The number of neighbors of unit i, i.e., the sum of the ith row (or ith column) in the adjacency matrix, is called the degree of unit i, which we denote by ${k_{i}}$. We assume the treatment vs. control setting and let ${Z_{i}}$ be the binary indicator denoting the experimental condition of unit i, where ${Z_{i}}=1$ if unit i is assigned to treatment and ${Z_{i}}=0$ if unit i is assigned to control. Let ${Y_{i}}$ denote the experimental outcome of unit i and further let $\mathbf{D}=(\mathbf{A},\mathbf{Z},\mathbf{Y},\mathbf{X})$ denote the experimental data where Z and Y are the treatment assignment and outcome vectors respectively, A is the adjacency matrix defined above, and X is a possible $n\times p$ matrix containing the units’ covariates.

Parker et al. [35] introduce the linear network effect (LNE) model

(2.1)

\[ {Y_{i}}=\mu +\tau {Z_{i}}+{\gamma _{T}}{\sum \limits_{j=1}^{n}}{A_{ij}}{Z_{j}}+{\gamma _{C}}{\sum \limits_{j=1}^{n}}{A_{ij}}(1-{Z_{j}})+{\epsilon _{i}},\]

where ${\epsilon _{i}}\sim \mathcal{N}(0,1)$ independently for all $i=1,2,\dots ,n$. The model assumes that when a unit is assigned to the treatment group, the unit itself will experience an effect of size τ while exerting an effect of size ${\gamma _{T}}$ on each of its neighbors. If a unit is assigned to the control, it will exert an effect of size ${\gamma _{C}}$ on each of its neighbors. The outcome of a unit i is then the sum of the baseline μ, the effect of the treatment assignment, and the sum of the effects from its neighbors plus some random error ${\epsilon _{i}}$. In other words, the experimental outcome of unit i is a linear combination of which experimental condition it belongs to and the numbers of its neighbors that are assigned to treatment (treatment degree) and control (control degree). The model offers a straightforward parameter interpretation: μ is the expected baseline outcome, τ is the effect of the treatment assignment, and ${\gamma _{T}}$ and ${\gamma _{C}}$ quantify the effect of the neighbors’ treatment assignments. The model is linear in the parameters and can therefore be fit using ordinary least squares.

However, on a network, the outcome of a unit may not only be influenced by the number of neighbors it has. It may also be influenced by the outcomes of its neighbors [1, 40], or by neighbors of neighbors, etc. To enable experimenters to model the experimental outcomes more flexibly, we propose a more general model for the design and analysis of experiments on networks, which we call the general additive network effect (GANE) model:

(2.2)

\[ {Y_{i}}=\mu +\tau {Z_{i}}+{f_{T,i}}(\mathbf{D},\boldsymbol{\eta })+{f_{C,i}}(\mathbf{D},\boldsymbol{\eta })+{\epsilon _{i}},\]

where ${f_{T,i}}$ models the effect of treated units on unit i and ${f_{C,i}}$ models the effect of controlled units. Hence, ${f_{T}}$ and ${f_{C}}$ are functions of the network (via the adjacency matrix A) and the treatment assignment vector Z. They may also be functions of the outcome vector Y, a possible covariate matrix X, and a parameter vector $\boldsymbol{\eta }$. By separating the network effect based on the source (treatment or control), the GANE model not only allows the network effect to come from both treated and controlled neighbors, but the network effect can also be of different forms and sizes, i.e., ${f_{T}}$ and ${f_{C}}$ may have different forms and/or may depend on different parameters (which are all contained in the vector $\boldsymbol{\eta }$). The functional forms of ${f_{T}}$ and ${f_{C}}$ may be determined by the experimenters’ domain knowledge or other considerations.

The GANE model has a structure similar to that of the LNE model (2.1) by Parker et al. [35], in which μ parametrizes the baseline outcome, τ parametrizes the effect of the treatment assignment to a unit and ${f_{T}}$ and ${f_{C}}$ respectively model the influence the unit receives from other treated and controlled units. Therefore, it retains the clear interpretability of Model (2.1) while increasing the flexibility with which the influence of other nodes is modeled.

2.2 Model Specifications

The GANE model encompasses several existing models in the literature, a few of which are identified below.

The Linear Network Effect (LNE) Model: Model (2.1) is a GANE model with ${f_{T}}$ modeling the treatment degree and ${f_{C}}$ modeling the control degree:

(2.3)

\[\begin{aligned}{}{f_{T,i}}(\mathbf{D},\boldsymbol{\eta })& ={\gamma _{T}}{\sum \limits_{j=1}^{n}}{A_{ij}}{Z_{j}}\\ {} {f_{C,i}}(\mathbf{D},\boldsymbol{\eta })& ={\gamma _{C}}{\sum \limits_{j=1}^{n}}{A_{ij}}(1-{Z_{j}}),\end{aligned}\]

where $\boldsymbol{\eta }={({\gamma _{T}},{\gamma _{C}})^{\top }}$.

The Local Aggregate (LAG) Model: Advani and Malde [1] propose that the outcome of a unit may be influenced by the sum of its neighbors’ outcomes in cases such as (i) a person’s criminal behavior might depend on the number of crimes committed by their friends; (ii) the number of items purchased by a customer may depend on the number of items purchased by their friends; and (iii) the number of hours a student spends studying may depend on the number of hours their friends spend studying. Accordingly, under the GANE model, functions ${f_{T}}$ and ${f_{C}}$ can respectively model the sum of the treated and controlled neighbors’ outcomes:

(2.4)

\[\begin{aligned}{}{f_{T,i}}(\mathbf{D},\boldsymbol{\eta })& ={\rho _{T}}{\sum \limits_{j=1}^{n}}{A_{ij}}{Z_{j}}{Y_{j}},\\ {} {f_{C,i}}(\mathbf{D},\boldsymbol{\eta })& ={\rho _{C}}{\sum \limits_{j=1}^{n}}{A_{ij}}(1-{Z_{j}}){Y_{j}},\end{aligned}\]

where $\boldsymbol{\eta }={({\rho _{T}},{\rho _{C}})^{\top }}$.

The Homophily (HOM) Model: In the social economics literature, a unit’s outcome is often modeled as a function of the average outcome of its neighbors [7, 31]. This is often referred to as the homophily effect [40] which is summarized by the conjecture that “you are the average of the people around you”, and a result of people’s desire to conform to social norms [1]. Gui et al. [16] formalize a model with homophily effect for the analysis of a network A/B test as follows:

(2.5)

\[ {Y_{i}}=\mu +\tau {Z_{i}}+\gamma {\sum \limits_{j=1}^{n}}{A_{ij}}{Z_{j}}+\rho \frac{1}{{k_{i}}}{\sum \limits_{j=1}^{n}}{A_{ij}}{Y_{j}}+{\epsilon _{i}}.\]

That is, besides the homophily effect, each treated node will exert an effect of size γ on each of its neighbors. This model can be reparameterized and written in the GANE model framework with

\[\begin{aligned}{}{f_{T,i}}(\mathbf{D},\boldsymbol{\eta })& ={\gamma _{T}}{\sum \limits_{j=1}^{n}}{A_{ij}}{Z_{j}}+{\rho _{T}}\frac{1}{{k_{i}}}{\sum \limits_{j=1}^{n}}{A_{ij}}{Z_{j}}{Y_{j}},\\ {} {f_{C,i}}(\mathbf{D},\boldsymbol{\eta })& ={\rho _{C}}\frac{1}{{k_{i}}}{\sum \limits_{j=1}^{n}}{A_{ij}}(1-{Z_{j}}){Y_{j}}.\end{aligned}\]

In this case, $\boldsymbol{\eta }={({\rho _{T}},{\rho _{C}},{\gamma _{T}})^{\top }}$ with constraints ${\rho _{T}}={\rho _{C}}=\rho $ and ${\gamma _{T}}=\gamma $.

The above are existing models that have been proposed and discussed in the literature. We next propose a new specification under the GANE framework.

The Power-Degree (POW-DEG) Model: Let us continue with the example in Section 1.1 where the outcome of interest is the amount of time that users spend on a social network platform. Recall, since Tracy is assigned to treatment, Connie (in the control group) may spend more time on the platform simply because she is Tracy’s friend. However, Connie’s time increase due to her first treated friend may not necessarily be equal to the time increase due to her 100th treated friend. In particular, as the number of treated friends grows, the effect of an additional treated friend is likely to decrease. This intuition is similar to the law of diminishing marginal utility in economics [15].

When the number of neighbors is low, for example, in agricultural settings where units are plots placed on a lattice, it may be reasonable to model the network effects homogeneously as in the LNE specification (2.3). However, in the social network setting where a user can have hundreds to thousands of friends, this linearity assumption seems less likely. As a result, we suggest modeling the network effect as a non-linear function of the number of neighbors (i.e., the degree). In particular, we propose the following GANE model specification with

(2.6)

\[\begin{aligned}{}{f_{T,i}}(\mathbf{D},\boldsymbol{\eta })& ={\gamma _{T}}{\left({\sum \limits_{j=1}^{n}}{A_{ij}}{Z_{j}}\right)^{\lambda }},\\ {} {f_{C,i}}(\mathbf{D},\boldsymbol{\eta })& ={\gamma _{C}}{\left({\sum \limits_{j=1}^{n}}{A_{ij}}(1-{Z_{j}})\right)^{\lambda }},\end{aligned}\]

where $\boldsymbol{\eta }={({\gamma _{T}},{\gamma _{C}},\lambda )^{\top }}$. We call this the power-degree (POW-DEG) specification because the network effects are modeled as powers of the treatment and control degrees. The power parameter λ serves to temper the growth of network effects as the treatment and control degrees increase, and so we expect that $0\lt \lambda \lt 1$. However, in the interest of ample flexibility, we do not make this assumption. We allow for the possibility that $\lambda \gt 1$ and also the possibility of the LNE specification (2.3) arising as a special case when $\lambda =1$.

Which estimation method is appropriate for fitting the GANE model depends on the specification of the functions ${f_{T}}$ and ${f_{C}}$. For instance, estimation and inference for the LNE specification (2.3) can be performed using ordinary least squares. However, when ${f_{T}}$ and ${f_{C}}$ are more complicated such as in specifications (2.4) or (2.5), the GANE model becomes (spatially) autoregressive, in which case maximum likelihood estimation is required.

2.3 Quantities of Interest

As discussed in Section 1.2, compared to the exposure framework, one of the advantages of the model-based framework is that more quantities can be easily estimated, assuming they can be expressed as functions of the model parameters. Here, we review several such quantities that may be of interest to experimenters.

Global treatment effect (GTE): As noted already, an important quantity for business decision-making is the global treatment effect (GTE), which is defined as the difference in average outcomes when everyone in the network is assigned to the treatment versus when everyone is assigned to the control. Note that GTE measures the treatment effect at the global level, instead of the individual level, taking into account the structure of the network and possible network effects. In the GANE framework, this quantity can be expressed as

(2.7)

\[\begin{aligned}{}\text{GTE}=& \mathbb{E}\left[\frac{1}{n}{\sum \limits_{i=1}^{n}}{Y_{i}}\hspace{2.84526pt}\Bigg|\hspace{2.84526pt}\mathbf{Z}={\mathbf{1}_{n}}\right]-\mathbb{E}\left[\frac{1}{n}{\sum \limits_{i=1}^{n}}{Y_{i}}\hspace{2.84526pt}\Bigg|\hspace{2.84526pt}\mathbf{Z}={\mathbf{0}_{n}}\right]\\ {} =& \tau +\frac{1}{n}{\sum \limits_{i=1}^{n}}\mathbb{E}\big[{f_{T,i}}({\mathbf{D}_{\mathbf{Z}={\mathbf{1}_{n}}}},\boldsymbol{\eta })\big]\\ {} & -\frac{1}{n}{\sum \limits_{i=1}^{n}}\mathbb{E}\big[{f_{C,i}}({\mathbf{D}_{\mathbf{Z}={\mathbf{0}_{n}}}},\boldsymbol{\eta })\big]\end{aligned}\]

where the subscripts on D indicate that the functions are evaluated when all the experimental units are assigned to treatment or control, respectively. Note that the expectations in the second equivalence are only necessary when ${f_{T}}$ and/or ${f_{C}}$ are functions of the outcome Y. As shown, the GTE can be expressed as a function of the model parameters. Moreover, GTE can be decomposed into two components, the direct and indirect treatment effects, which aid interpretation.

Direct treatment effect (DTE): Traditionally, the direct treatment effect with respect to a treatment assignment vector Z is defined as the average difference in expected outcomes when the treatment assignment of one unit changes while the others remain the same, i.e.,

\[ \text{DTE}(\mathbf{Z})=\frac{1}{n}{\sum \limits_{i=1}^{n}}\bigg(\mathbb{E}\big[{Y_{i}}\big|{Z_{i}}=1,{\mathbf{Z}_{-i}}\big]-\mathbb{E}\big[{Y_{i}}\big|{Z_{i}}=0,{\mathbf{Z}_{-i}}\big]\bigg),\]

where ${\mathbf{Z}_{-i}}$ denotes vector Z without the ith element. This definition is difficult to interpret and calculate, especially when the network effect functions are complicated. So instead, we define the direct treatment effect as the expected difference in outcomes when a node is assigned to the treatment versus when a node is assigned to control, keeping the network effects fixed. In the GANE model framework, the direct treatment effect is simply

(2.8)

\[ \text{DTE}=\tau .\]

Our definition is clear, easy to interpret, easy to calculate, and does not depend on any specific treatment assignment vector. Hence, it can be used across all specifications of the GANE model.

Indirect treatment effect (ITE): Interest may also lie in quantifying the amount of the global treatment effect due to the network. The ITE is therefore defined as the difference between the global treatment effect and the direct treatment effect. In the GANE framework, this is

(2.9)

\[\begin{aligned}{}\text{ITE}=& \text{GTE}-\tau \\ {} =& \frac{1}{n}{\sum \limits_{i=1}^{n}}\mathbb{E}\big[{f_{T,i}}({\mathbf{D}_{\mathbf{Z}={\mathbf{1}_{n}}}},\boldsymbol{\eta })\big]-\frac{1}{n}{\sum \limits_{i=1}^{n}}\mathbb{E}\big[{f_{C,i}}({\mathbf{D}_{\mathbf{Z}={\mathbf{0}_{n}}}},\boldsymbol{\eta })\big].\end{aligned}\]

Hence, the indirect treatment effect can also be interpreted as the difference between the network effect induced by the treatment versus that induced by the control.

Each of these quantities can be expressed as functions of the model parameters μ, τ, and $\boldsymbol{\eta }$. When the expectations of ${f_{T}}$ and ${f_{C}}$ are known, estimates and hypothesis tests for these quantities can be developed based on parametric inference associated with the model. With respect to hypothesis tests, the experimenters may be interested in one or more of the following hypotheses.

Hypothesis 1 (Direct treatment effect).

${H_{01}}:\text{DTE}=\tau =0$ is the null hypothesis that the direct treatment effect is 0, i.e., keeping the network effect fixed, a node’s outcome is the same no matter if it is assigned to treatment or to control.

Hypothesis 2 (SUTVA).

${H_{02}}:{f_{T}}={f_{C}}=0$ is the null hypothesis that there is no network effect and the SUTVA is satisfied.

Hypothesis 3 (Indirect treatment effect).

${H_{03}}:\text{ITE}=0$ is the null hypothesis that the indirect treatment effect is 0, i.e., the network influence from treated and controlled neighbors is the same.

Hypothesis 4 (Global treatment effect).

${H_{04}}:\text{GTE}=0$ is the null hypothesis that the global treatment effect is 0, i.e., on average, treatment does not have an effect on the nodes’ outcomes.

When the GANE model can be estimated using maximum likelihood estimation, since DTE, ITE, and GTE are functions of the model parameters, Hypotheses 1, 3, and 4 can be tested using Wald-type tests, and Hypothesis 2 can be tested using a likelihood ratio test.

2.4 Experimental Design

To design an A/B test, the experimenter decides which units are assigned to the treatment group and which units are assigned to the control group. This corresponds to specifying the treatment assignment vector Z where each unit i is assigned ${Z_{i}}=0$ or 1. Interest lies in determining an optimal treatment assignment.

Under the GANE framework, experimenters can define design criteria related to the variance of the parameters’ estimators. For example, D-optimality [37] aims to minimize the determinant of the variance-covariance matrix and is therefore a popular choice to generally minimize the confidence region for the parameters. However, in the network experiment context, as discussed in Section 2.3, GTE (or possibly DTE or ITE) is the primary quantity of interest. Therefore, in the experimental design simulations in Section 4.4, we will focus on the variance of GTE estimates as our design criterion.

With the design criterion determined, experimenters can find good designs using methods such as exchange algorithms [26, 35], or random search [35]. Random search refers to the process in which the design is chosen by first randomly generating a large number of designs, and then choosing the design with the best-evaluated design criteria. Exchange algorithms [26, 35] instead take a greedy approach by iteratively changing the treatment assignment for each unit in the direction of optimizing the design criteria. Via simulation, Parker et al. [35] find that random search, despite its simplicity, yields nearly as good designs as the computationally less efficient exchange algorithm. We thus use random search as a design selection strategy in our simulations in Section 4.4.

3 Maximum Likelihood Inference

3.1 Estimation

We have discussed in Section 2.2 that different specifications of the GANE model may require different estimation techniques. In Section 2.3, we also mentioned that when maximum likelihood estimation is possible, different hypotheses can be tested using the maximum likelihood framework. Therefore, in this section, we discuss the maximum likelihood estimation for the GANE model.

In order to obtain the likelihood of the outcome vector Y, we consider the family of GANE specifications where the outcome ${Y_{i}}$ either (i) does not depend on neighboring outcomes, or (ii) depends linearly on neighboring outcomes. That is,

(3.1)

\[\begin{aligned}{}{f_{T,i}}(\mathbf{D},\boldsymbol{\eta })& ={\rho _{T}}{\sum \limits_{j=1}^{n}}{W_{T,ij}}{Y_{j}}+{\gamma _{T}}{g_{T,i}}(\boldsymbol{\varphi }),\\ {} {f_{C,i}}(\mathbf{D},\boldsymbol{\eta })& ={\rho _{C}}{\sum \limits_{j=1}^{n}}{W_{C,ij}}{Y_{j}}+{\gamma _{C}}{g_{C,i}}(\boldsymbol{\varphi }),\end{aligned}\]

where $\boldsymbol{\eta }={({\rho _{T}},{\rho _{C}},{\gamma _{T}},{\gamma _{C}},{\boldsymbol{\varphi }^{\top }})^{\top }}$ and ${W_{T,ij}}$ (or ${W_{C,ij}}$) is the ${(i,j)^{th}}$ element of pre-specified weight matrix ${\mathbf{W}_{T}}$ (or ${\mathbf{W}_{C}}$). The diagonals of these weight matrices are zero, i.e., ${W_{l,ii}}=0$ for $l\in \{T,C\}$ and $i=1,\dots ,n$. In addition, ${g_{T,i}}(\boldsymbol{\varphi })$ and ${g_{C,i}}(\boldsymbol{\varphi })$ are real-valued functions, possibly depending on the parameter vector $\boldsymbol{\varphi }$, the experimental data D, but not the outcome vector Y. We can see that Model (3.1) generalizes all model specifications discussed in Section 2.2, in which the outcome of an experiment may depend linearly on other unit’s outcomes and/or possibly nonlinearly on other covariates. Model (3.1), however, excludes cases where the outcome of unit i is dependent on a nonlinear function of the outcome vector Y, which complicates the maximum likelihood theory.

To perform estimation, we consider the matrix form of Model (3.1) as follows

(3.2)

\[\begin{aligned}{}\mathbf{Y}=& \mu {\mathbf{1}_{n}}+\tau \mathbf{Z}+\big({\rho _{T}}{\mathbf{W}_{T}}\mathbf{Y}+{\gamma _{T}}{\mathbf{G}_{T}}(\boldsymbol{\varphi })\big)\\ {} & +\big({\rho _{C}}{\mathbf{W}_{C}}\mathbf{Y}+{\gamma _{C}}{\mathbf{G}_{C}}(\boldsymbol{\varphi })\big)+\boldsymbol{\epsilon },\end{aligned}\]

where ${\mathbf{G}_{T}}(\boldsymbol{\varphi })$ denotes the $n\times 1$ vector of ${g_{T,i}}(\boldsymbol{\varphi })$ values and ${\mathbf{G}_{C}}(\boldsymbol{\varphi })$ denotes the $n\times 1$ vector of ${g_{C,i}}(\boldsymbol{\varphi })$ values. Let $\mathbf{M}(\boldsymbol{\varphi })=[{\mathbf{1}_{n}}\hspace{5.69054pt}\mathbf{Z}\hspace{5.69054pt}{\mathbf{G}_{T}}(\boldsymbol{\varphi })\hspace{5.69054pt}{\mathbf{G}_{C}}(\boldsymbol{\varphi })]$ be the model matrix, which depends on the parameter vector $\boldsymbol{\varphi }$. Further, let $\boldsymbol{\beta }={(\mu ,\tau ,{\gamma _{T}},{\gamma _{C}})^{\top }}$ and $\boldsymbol{\rho }={({\rho _{T}},{\rho _{C}})^{\top }}$. The model may be rewritten, isolating for Y on the left-hand side, as follows

(3.3)

\[\begin{aligned}{}\mathbf{Y}& =\big({\rho _{T}}{\mathbf{W}_{T}}+{\rho _{C}}{\mathbf{W}_{C}}\big)\mathbf{Y}+\mathbf{M}(\boldsymbol{\varphi })\boldsymbol{\beta }+\boldsymbol{\epsilon },\\ {} & =\mathbf{S}{(\boldsymbol{\rho })^{-1}}\bigg(\mathbf{M}(\boldsymbol{\varphi })\boldsymbol{\beta }+\boldsymbol{\epsilon }\bigg),\end{aligned}\]

where $\mathbf{S}(\boldsymbol{\rho })={\mathbf{I}_{n}}-{\rho _{T}}{\mathbf{W}_{T}}-{\rho _{C}}{\mathbf{W}_{C}}$. The expression in (3.3) is possible if and only if $\mathbf{S}(\boldsymbol{\rho })$ is invertible. Lemma 1 gives sufficient conditions on $\boldsymbol{\rho }$ so that $\mathbf{S}(\boldsymbol{\rho })$ is nonsingular. Although the condition is based on any matrix norm, in practice, we can use the popular spectral norm [19] to derive the constraints. The proof of Lemma 1 is given in Appendix A.1.

Lemma 1.

\[ \max \left(|{\rho _{T}}|,|{\rho _{C}}|\right)\lt \frac{1}{||{\mathbf{W}_{T}}||+||{\mathbf{W}_{C}}||},\]

where $||\cdot ||$ denotes a matrix norm [19], then $\mathbf{S}(\boldsymbol{\rho })$ is invertible.

With Y expressed as in Equation (3.3) and assuming $\boldsymbol{\epsilon }\sim \mathcal{N}({\mathbf{0}_{n}},{\sigma ^{2}}{\mathbf{I}_{N}})$, the log-likelihood function for Y is

(3.4)

\[\begin{aligned}{}\log L(\boldsymbol{\theta })=& -\frac{n}{2}\log (2\pi )-\frac{n}{2}\log ({\sigma ^{2}})+\log |\mathbf{S}(\boldsymbol{\rho })|\\ {} & -\frac{1}{2{\sigma ^{2}}}{\Big(\mathbf{S}(\boldsymbol{\rho })\mathbf{Y}-\mathbf{M}(\boldsymbol{\varphi })\boldsymbol{\beta }\Big)^{\top }}\Big(\mathbf{S}(\boldsymbol{\rho })\mathbf{Y}-\mathbf{M}(\boldsymbol{\varphi })\boldsymbol{\beta }\Big),\end{aligned}\]

where $\boldsymbol{\theta }={({\boldsymbol{\rho }^{\top }},{\boldsymbol{\beta }^{\top }},{\boldsymbol{\varphi }^{\top }},{\sigma ^{2}})^{\top }}$ is the vector of all model parameters. If the normality assumption is not made, then (3.4) becomes the quasi log-likelihood [28] and the estimators $\hat{\boldsymbol{\theta }}$ that maximize (3.4) are called the quasi maximum likelihood estimators.

To find the maximum likelihood estimates, we take the first order derivatives with respect to $\boldsymbol{\beta }$ and ${\sigma ^{2}}$ and equate them to zero to obtain

(3.5)

\[\begin{aligned}{}\hat{\boldsymbol{\beta }}(\boldsymbol{\rho },\boldsymbol{\varphi })=& {\Big(\mathbf{M}{(\boldsymbol{\varphi })^{\top }}\mathbf{M}(\boldsymbol{\varphi })\Big)^{-1}}\mathbf{M}{(\boldsymbol{\varphi })^{\top }}\mathbf{S}(\boldsymbol{\rho })\mathbf{Y};\end{aligned}\]

(3.6)

\[\begin{aligned}{}{\hat{\sigma }^{2}}(\boldsymbol{\rho },\boldsymbol{\varphi })=& \frac{1}{n}{\Big(\mathbf{S}(\boldsymbol{\rho })\mathbf{Y}-\mathbf{M}(\boldsymbol{\varphi })\hat{\boldsymbol{\beta }}(\boldsymbol{\rho },\boldsymbol{\varphi })\Big)^{\top }}\\ {} & \times \Big(\mathbf{S}(\boldsymbol{\rho })\mathbf{Y}-\mathbf{M}(\boldsymbol{\varphi })\hat{\boldsymbol{\beta }}(\boldsymbol{\rho },\boldsymbol{\varphi })\Big)\gt 0.\end{aligned}\]

Note that these are the solution of an ordinary least squares that regresses the transformed outcome variable $\mathbf{S}(\boldsymbol{\rho })\mathbf{Y}$ on the covariate matrix $\mathbf{M}(\boldsymbol{\varphi })$ when $\boldsymbol{\rho }$ and $\boldsymbol{\varphi }$ are known. Plugging this into the log-likelihood (3.4), we can obtain the profile log-likelihood

(3.7)

\[ {\ell _{\text{P}}}(\boldsymbol{\rho },\boldsymbol{\varphi })=-\frac{n}{2}\left[\log (2\pi )+1\right]+\log |\mathbf{S}(\boldsymbol{\rho })|-\frac{n}{2}\log {\hat{\sigma }^{2}}(\boldsymbol{\rho },\boldsymbol{\varphi }).\]

Then, we can find $\hat{\boldsymbol{\varphi }}$ and $\hat{\boldsymbol{\rho }}$ by maximizing ${\ell _{\text{P}}}$ using a grid search on their respective parameter space or using numerical algorithms such as the Nelder-Mead method [32]. The maximum likelihood estimates $\hat{\boldsymbol{\beta }}$ and ${\hat{\sigma }^{2}}$ can be obtained by plugging $\hat{\boldsymbol{\varphi }}$ and $\hat{\boldsymbol{\rho }}$ into Equations (3.5) and (3.6).

Note that in (3.5), it is implicitly required that $M{(\boldsymbol{\varphi })^{\top }}M(\boldsymbol{\varphi })$ is invertible, i.e., that the columns of the model matrix $M(\boldsymbol{\varphi })$ are linearly independent. Although unlikely, multicollinearity may exist. For instance, in the POW-DEG specification (2.6), when the graph is fully connected (i.e., every node is connected with one another), or when the treatment and/or control degrees are the same for all nodes, the model matrix will have linearly dependent columns. It is thus important in the design stage to choose a design that ensures the model matrix has full rank.

3.2 Asymptotic Results

Here, we study the behavior of the maximum likelihood estimators as the network size increases to infinity. We use the subscript n to denote the data for a given network size n. Model (3.3) then becomes

\[ {\mathbf{Y}_{n}}={\mathbf{S}_{n}}{(\boldsymbol{\rho })^{-1}}\bigg({\mathbf{M}_{n}}(\boldsymbol{\varphi })\boldsymbol{\beta }+{\boldsymbol{\epsilon }_{n}}\bigg),\]

where ${\mathbf{S}_{n}}(\boldsymbol{\rho })={\mathbf{I}_{n}}-{\rho _{T}}{\mathbf{W}_{Tn}}-{\rho _{C}}{\mathbf{W}_{Cn}}$. Let ${\boldsymbol{\theta }_{0}}={({\boldsymbol{\rho }_{0}^{\top }},{\boldsymbol{\beta }_{0}^{\top }},{\boldsymbol{\varphi }_{0}^{\top }},{\sigma _{0}^{2}})^{\top }}$ be the true parameter values. The consistency and asymptotic normality properties of the maximum likelihood estimators ${\hat{\boldsymbol{\theta }}_{n}}$ are given in Theorem 1 below.

Theorem 1.

Under Assumption 1–6 (given in Appendix A.2), the (quasi) maximum likelihood estimator ${\hat{\boldsymbol{\theta }}_{n}}$ obtained by maximizing the log-likelihood in (3.4) is consistent to ${\boldsymbol{\theta }_{0}}$. Further assuming that ${J_{n}}({\boldsymbol{\theta }_{0}})=-\mathbb{E}\left[\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }\partial {\boldsymbol{\theta }^{\top }}}\right]$ and ${V_{n}}({\boldsymbol{\theta }_{0}})=\mathbb{E}\left[\left(\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }}\right){\left(\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }}\right)^{\top }}\right]$ are positive definite,

\[ {[{V_{n}}({\boldsymbol{\theta }_{0}})]^{-1/2}}[{J_{n}}({\boldsymbol{\theta }_{0}})]({\hat{\boldsymbol{\theta }}_{n}}-{\boldsymbol{\theta }_{0}})\stackrel{d}{\to }\mathcal{N}({\mathbf{0}_{\mathrm{dim}(\boldsymbol{\theta })}},{\mathbf{I}_{\mathrm{dim}(\boldsymbol{\theta })}}),\]

where $\mathrm{dim}(\cdot )$ denotes the length of a vector.

The proof of Theorem 1 is given in Appendix A.3, following the ideas of Lee [28], treating ${\mathbf{S}_{n}}(\boldsymbol{\rho })$ and ${M_{n}}(\boldsymbol{\varphi })$ as non-stochastic for any given $\boldsymbol{\rho }$ and $\boldsymbol{\varphi }$. The random errors ${\epsilon _{i,n}}$ are assumed to be independent and identically distributed with mean zero and variance ${\sigma _{0}^{2}}$. When ${\epsilon _{i,n}}$ follow a normal distribution, ${\hat{\boldsymbol{\theta }}_{n}}$ is the maximum likelihood estimator (instead of a quasi maximum likelihood estimator), and we have ${V_{n}}(\boldsymbol{\theta })={J_{n}}(\boldsymbol{\theta })$ and

\[ {[{J_{n}}({\boldsymbol{\theta }_{0}})]^{1/2}}({\hat{\boldsymbol{\theta }}_{n}}-{\boldsymbol{\theta }_{0}})\stackrel{d}{\to }\mathcal{N}({\mathbf{0}_{\mathrm{dim}(\boldsymbol{\theta })}},{\mathbf{I}_{\mathrm{dim}(\boldsymbol{\theta })}}).\]

3.3 Inference for Causal Quantities

With the asymptotic normality result, inference for the parameters can be performed accordingly. The inference for the causal quantities given in Section 2.3 can then be carried out via the Delta method [45]. In particular, the global treatment effect for Model (3.1) is calculated as

(3.8)

\[\begin{aligned}{}\text{GTE}(\boldsymbol{\theta })=& \frac{1}{n}{\mathbf{1}_{n}^{\top }}\bigg[\left(\mu +\tau +\frac{1}{n}{\sum \limits_{i=1}^{n}}{g_{T,i}}({\mathbf{D}_{\mathbf{Z}={\mathbf{1}_{n}}}},\boldsymbol{\varphi })\right)\\ {} & \times {({\mathbf{I}_{n}}-{\rho _{T}}{\mathbf{W}_{T,\mathbf{Z}={\mathbf{1}_{n}}}})^{-1}}\\ {} & -\left(\mu +\frac{1}{n}{\sum \limits_{i=1}^{n}}{g_{C,i}}({\mathbf{D}_{\mathbf{Z}={\mathbf{0}_{n}}}},\boldsymbol{\varphi })\right)\\ {} & \times {({\mathbf{I}_{n}}-{\rho _{C}}{\mathbf{W}_{C,\mathbf{Z}={\mathbf{0}_{n}}}}))^{-1}}\bigg]{\mathbf{1}_{n}}.\end{aligned}\]

Using the Delta method, the variance of the GTE can be written as

(3.9)

\[ \mathrm{Var}\big[\text{GTE}(\boldsymbol{\theta })\big]={\mathbf{t}^{\top }}\mathrm{Var}(\boldsymbol{\theta })\mathbf{t},\]

where $\mathbf{t}=\frac{\partial \hspace{2.5pt}\text{GTE}(\boldsymbol{\theta })}{\partial {\boldsymbol{\theta }^{\top }}}$. As $\text{DTE}(\boldsymbol{\theta })=\tau $ and $\text{ITE}(\boldsymbol{\theta })=\text{GTE}(\boldsymbol{\theta })-\tau $, inference for DTE and ITE can be derived in a similar manner.

4 Simulations

In this section, we use simulation to study the properties of different specifications of the GANE model. Specifically, we study our proposed POW-DEG specification (2.6) and the HOM specification (2.5) as an illustration of a spatially autoregressive specification. In order to study these model specifications on real-life networks, we use the Caltech Facebook network and the UMichigan Facebook network, both retrieved from the Network Repository [38]. The sizes of these networks are summarized in Table 1. In both cases, these networks provide realistic structure for the graph $\mathcal{G}$, but the experiment and outcomes are hypothetical and simulated using either the POW-DEG (2.6) or HOM (2.5) specifications.

Table 1

Number of nodes and edges of the networks used in the simulations.

Networks	# of nodes $(n)$	# of edges
Caltech network	770	16,656
UMich network	3,749	81,903

4.1 The Distribution of the Estimates

We first investigate the asymptotic properties of the maximum likelihood estimates derived in Section 3. As the theoretical results concern the case where the treatment assignment vector Z is known, in this simulation, we fix a particular design where half of the nodes are randomly assigned to treatment and the other half are assigned to control.

First, we investigate the results for the POW-DEG specification (2.6) by generating outcomes on the given network (either the Caltech or UMich Facebook network) with the following parameter settings: $\boldsymbol{\beta }={(0,1,0.5,0.1)^{\top }}$ and $\sigma =1$. We further vary the power λ within the set $\{0.5,0.75,1,1.25\}$, where $\lambda =1$ corresponds to the LNE specification (2.3). With each combination of parameters, 1,000 runs are conducted where the outcomes are generated and the maximum likelihood estimates are calculated accordingly.

Figure 2

The distribution of parameter estimates of the POW-DEG specification on the Caltech Facebook network with $\boldsymbol{\beta }={(0,1,0.5,0.1)^{\top }}$ and $\lambda \in \{0.05,0.75,1.00,1.25\}$ over 1,000 simulation runs.

The distribution of the parameter and GTE estimates for the POW-DEG specification (2.6) are plotted in Figure 2. We can see that the distributions of all estimates are reasonably bell-shaped and symmetric and centered around the true values (dashed vertical lines) as is expected given the asymptotic theory. While the distribution of $\hat{\tau }$ remains the same under different values of λ, the variances of the other estimators decrease when λ increases. This is because the ranges of values within ${\mathbf{G}_{T}}$ and ${\mathbf{G}_{C}}$ in the model matrix M increase as λ increases, which in turn decreases the variance of the parameter estimates.

Figure 3

The variances of the estimates (left axes, lines) and coverage rates (right axes, bars) of POW-DEG specification on the Caltech Facebook network with $\boldsymbol{\beta }={(0,1,0.5,0.1)^{\top }}$ and $\lambda \in \{0.5,0.75,1.00,1.25\}$ over 1,000 simulation runs.

The coverage of 95% asymptotic confidence intervals and variances of the parameter estimates are given in Figure 3, where left axes correspond to variances and right axes correspond to coverage. The blue lines depict the asymptotic variances derived from $\mathbf{J}({\boldsymbol{\theta }_{0}})$ and the red lines depict the sample variance of the 1,000 parameter estimates. The agreement between these lines suggests that the asymptotic variances may be used reliably for inference. With respect to coverage, the coverage rates for the 95% confidence intervals are plotted as grey bars on the right axes and the dotted lines serve as a reference at 0.95. We can see that the obtained confidence intervals have the correct coverage. To summarize, the simulation corroborates the asymptotic theory and indicates that maximum likelihood procedures work as expected for the POW-DEG specification (2.6). Simulation results for a different set of parameter values are included in Section S1 of the Supplementary Material. These results suggest that when the network effect is small and the network size is moderate, consistent estimation of ${\gamma _{T}}$, ${\gamma _{C}}$, and λ is more difficult. However, the battery of simulations was also run on the UMich Facebook network, whose size ($n=3,749$) is almost 5 times that of the Caltech network, and we find that estimation of all parameters, whether the network and treatment effects are large or small, agrees with the asymptotic theory. These results are also available in Section S1 of the Supplementary Material.

Figure 4

(upper) The distribution of parameter estimates of the HOM specification with $\mu =0$, $\tau =1$, ${\gamma _{T}}=0.5$, ${\rho _{T}}={\rho _{C}}=0.1$ over 1,000 simulation runs. (lower) The corresponding variances of the estimates (left axes, lines) and coverage rates (right axes, bars).

We conducted a similar simulation study on the HOM specification (2.5) with $\mu =0$, $\tau =1$, ${\gamma _{T}}=0.5$, ${\rho _{T}}={\rho _{C}}=0.1$ and ${\sigma ^{2}}=1$. The results for both the Caltech Facebook network and the UMich Facebook network are shown in Figure 4. As with the POW-DEG (2.6) estimates, and in agreement with the likelihood theory, the distributions of these parameter estimates are bell-shaped and centered at the true values. Moreover, since the UMich Facebook network is larger, the variation in the estimates decreases, as expected. Notice that the true values of GTE are different for the two networks, even though all parameters used are the same. This illustrates how the true value of GTE depends not only on the parameters but also on the structure of the graph. Variances and confidence interval coverage are also plotted in Figure 4. As we would expect, the asymptotic variances are suitable for inference and the asymptotic confidence intervals have acceptable coverage. To demonstrate the generality of these findings we present additional simulation results for another set of parameter values in Section S1 of the Supplementary Material. The theory developed in Section 3 and the simulations presented here (for multiple GANE specifications, parameter values, and networks) demonstrate the general utility of maximum likelihood inference with GANE models.

4.2 Hypothesis Testing

As discussed in Section 2.3, under the GANE framework, we can test hypotheses about the DTE, SUTVA, the ITE and the GTE. In particular, testing DTE = 0 is equivalent to testing ${H_{01}}:\tau =0$; testing whether SUTVA is satisfied is equivalent to testing ${H_{02}}:{f_{T}}={f_{C}}=0$; the null hypothesis for testing the indirect treatment effect is ${H_{03}}:\text{ITE}=0$; and the null hypothesis for testing the global treatment effect is ${H_{04}}:\text{GTE}=0$. In the maximum likelihood framework, Hypotheses 1, 3, and 4 can be tested using Wald-type tests and Hypothesis 2 can be tested with a likelihood ratio test.

We study the characteristics of these tests via simulation. Again, as the design Z is treated as fixed in our analysis, we randomly pick a design where half of the nodes are assigned to treatment and the other half are assigned to control. The parameters of the POW-DEG specification (2.6) are set at $\boldsymbol{\beta }={(0,1,0.5,0.1)^{\top }}$, $\sigma =1$ and $\lambda \in \{0.5,0.75,1.00,1.25\}$. Separate simulations are conducted to investigate each of the four hypothesis tests. For each simulation, values of certain parameters in $\boldsymbol{\beta }$ vary while the others stay as stated. In particular, in the simulation for Hypothesis 1, τ varies in the range $[0,1]$; in the simulation for Hypothesis 2, ${\gamma _{T}}={\gamma _{C}}$ and their values vary in the range $[0,0.05]$; for Hypothesis 3, ${\gamma _{C}}$ is fixed at 0.1 and ${\gamma _{T}}-{\gamma _{C}}$ varies in the range $[0,0.5]$. Hypothesis 4 with ${H_{04}}:\text{GTE}=0$ is also tested within each of the three simulations (with different λ) and the results are aggregated over different values of GTE, which corresponds to different parameter combinations. All tests are done at a 5% significance level and 1,000 runs are conducted for each parameter combination. The results are presented in Figure 5. The dotted horizontal line serves as a reference at the 5% level.

As expected, the rejection rates for each test increase as the respective parameter values depart from their null values. Moreover, tests for ${H_{01}}:\tau =0$ seem to behave similarly over different values of λ. This is consistent with the model estimation results in Figure 2 where the variances for $\hat{\tau }$ and $\widehat{\text{GTE}}$ look similar over different values of λ while the variances for $\hat{\gamma }$’s decrease as λ increases. We also remark that the results at null values deviate slightly from the nominal 5% level. This can be attributed to the use of asymptotic (inexact) variances in these tests. Although we do not include the results for the UMich Facebook network, given its size and given the results from Section 4.1, we expect similar results to those presented here for the Caltech Facebook network.

Table 2

Parameters for the simulation in Section 4.3.

Specification	μ	τ	${\rho _{T}}$	${\rho _{C}}$	${\gamma _{T}}$	${\gamma _{C}}$	λ	σ
SUTVA	0	2	0	0	0	0		1
LNE	0	1	0	0	0.1231	0.1		1
POW-DEG	0	1	0	0	0.2691	0.1	0.5	1
LAG	0	1	0.008492	0.001	0	0		0.9977
HOM	0	1	0.1	0.1	0.01728	0		0.9999

Figure 5

Rejection rates of hypothesis tests for POW-DEG specification on the Caltech Facebook network with varying parameters.

We conducted similar simulations for the HOM specification (2.5) with $\mu =0$, $\tau =1$, ${\gamma _{T}}=0.5$, ${\rho _{T}}={\rho _{C}}=0.1$, and varying τ, ${\gamma _{T}}={\rho _{C}}$ and ${\gamma _{T}}-{\rho _{C}}$ in different simulations for different hypothesis tests. The results are included in Section S2 of the Supplementary Material. It can be noted that the results are similar in both networks, except for different values of τ, signifying that this is an important parameter for the HOM specification (2.5). We also note that the rejection rates for Hypothesis 3 always stay at $100\% $ even at ${\gamma _{T}}={\rho _{C}}$ (i.e., when the scaling coefficients for ${f_{T}}$ and ${f_{C}}$ are equal). This shows that the indirect effect is not only affected by the sizes of the network effects, but also by the functional forms of ${f_{T}}$ and ${f_{C}}$. These are different in the HOM specification (2.5).

4.3 Model Misspecification

The simulations in Sections 4.1 and 4.2 explore properties of maximum likelihood inference for different GANE specifications when they are correctly specified. In this section, we further investigate properties of these specifications under model misspecification. The specifications considered here are (i) the SUTVA specification, in which network effects do not exist and ${f_{T}}={f_{C}}=0$; (ii) the linear network effect (LNE) specification in (2.3); (iii) the POW-DEG specification in (2.6); (iv) the local aggregate (LAG) specification in (2.4); and (v) the homophily (HOM) specification in (2.5).

In this simulation, on the Caltech Facebook network, outcomes are generated 1,000 times for each of the listed model specifications. The data are then fitted using each of the five model specifications. We use global treatment effect (GTE) estimation and its inference results to compare performance among specifications because the GTE is generally of primary interest. To make the comparison fair, parameters for each model specification are chosen such that the true global treatment effect (GTE) is fixed at 2.0 and the average outcome variance is 1.0 in all data-generating scenarios. The exact parameter values for each specification are provided in Table 2.

Results of the simulation are plotted as heatmaps in Figure 6. The columns correspond to outcome-generating models and the rows correspond to estimating models. The top left panel shows the log ratio of the average estimated GTE to the true GTE. The desired value is 0, which is colored white. Red represents overestimation and blue represents underestimation. We see that all specifications can estimate the SUTVA specification well because it is nested within all GANE specifications. The POW-DEG specification seems to provide estimates with the lowest bias, even under model misspecification.

Figure 6

Model misspecification simulation results.

The top right panel shows the variances of the GTE estimates where white represents low variances and dark green represents high variances. We can see that the SUTVA and HOM (2.5) specifications provide the lowest variances while the highest variances come from the LAG specification (2.4). Both the POW-DEG (2.6) and the LNE (2.3) specifications provide reasonably low variances.

The bottom left panel shows the coverage rate of 95% confidence intervals for the GTE constructed by each estimating model. The results show that LNE (2.3), POW-DEG (2.6) and LAG (2.4) specifications have high coverage rates while the HOM specification (2.5) has lower coverage rates and the SUTVA specification has the worst. This is because the SUTVA specification does not capture the network effects introduced by other specifications.

Finally, on the bottom right, the model selection results by AIC are presented. Green represents high selection rates while white represents low selection rates. AIC works well as it selects the correct model specification most of the time, which is shown by the green diagonal. This supports the use of likelihood-based model selection criteria such as AIC for the GANE framework. Furthermore, it can be seen that POW-DEG specification (2.6) is selected fairly often no matter the data-generating model, which suggests that it fits the data reasonably well even under model misspecification.

As illustrated in Figure 6, the POW-DEG specification (2.6) is the only one that performs well in each dimension. This illustrates the flexibility of the POW-DEG model to capture a variety of network effects. Hence, we advocate its use generally, especially when there is no prior information or preference for another specification.

4.4 Experimental Design

Given the suggestion for the general use of the POW-DEG specification (2.6) in Section 4.3, in this section, we examine the characteristics of good designs in the context of this specification. Identifying such characteristics will provide useful insights to consider when designing the experiment. To do so, we randomly generate 10,000 designs for the Caltech Facebook network as follows: (i) take a random draw m from the discrete uniform distribution on $[1,n-1]$; (ii) randomly select m nodes and assign them to treatment, and assign the remaining $n-m$ nodes to control.

As GTE is the primary focus for many experiments on networks, in this simulation we use the variance of the estimated global treatment effect $\mathrm{Var}[\widehat{\text{GTE}}]$ as the design criterion to evaluate the designs. Good designs are ones that give lower values of the design criterion. The criterion is evaluated based on the Fisher information matrix of the POW-DEG specification (2.6) with parameters $\mu =0$, $\tau =1$, $\lambda \in \{0.50,0.75,1.00,1.25\}$. We also vary ${\gamma _{T}}$ and ${\gamma _{C}}$ in three settings: (i) ${\gamma _{T}}=0.5\gt {\gamma _{C}}=0.1$; (ii) ${\gamma _{T}}=0.5={\gamma _{C}}$; (iii) ${\gamma _{T}}=0.1\lt {\gamma _{C}}=0.5$.

For each parameter combination, the best and worst 500 (i.e., 5%) designs with respect to $\mathrm{Var}[\widehat{\text{GTE}}]$ are recorded. We investigate three characteristics of these designs using three metrics. First, as Bowers et al. [6] point out, in the presence of network interference, balanced allocation may not be optimal. As such, we examine the percentage of treated nodes in each of the designs. Second, Parker et al. [35] observe that for certain design criteria, it is better to assign nodes with high degrees to treatment and nodes with low degrees to control. Hence we calculate the difference in average degrees of treated and controlled nodes to verify this observation. Third, if the best designs are clustered as graph-cluster randomization algorithms suggest, then nodes will be more likely to be surrounded by neighbors who share the same treatment assignment as themselves. This means that for each node, the percentage of neighbors having the same treatment assignment will be high. However, with our two-step design-generating procedure, we are considering both balanced and unbalanced designs. Suppose a design has b% of treated nodes, then the nodes will be expected to have b% of neighbors treated and $(100-b)$% of neighbors controlled. Nodes in a clustered design will have a high percentage of similarly treated neighbors regardless of such expectation. Therefore, we use the percentage of treated neighbors in each design as a threshold and calculate the percentage of nodes in the design that have the proportion of neighbors sharing the same treatment with themselves higher than this threshold.

Figure 7

Characteristics of “best” 5%, randomly selected, and “worst” 5% designs with respect to the variance of estimated global treatment effect on the Caltech Facebook network.

Figure 7 shows characteristics of the best 500, the worst 500, and 500 randomly selected designs from the 10,000 that were generated. First, let us look at the best 5% designs. We see that the designs are more balanced at higher values of λ. At smaller values of λ, when ${\gamma _{T}}$ is greater, we need more treated nodes and when ${\gamma _{C}}$ is lower, we need more controlled nodes. Thus, in order to decide which treatment allocation to use, we need to know which of ${\gamma _{T}}$ or ${\gamma _{C}}$ is greater. This information may or may not be available in practice. Second, the differences in average degrees between treated nodes and controlled nodes distribute evenly around zero, suggesting that there is no preference in terms of degree when assigning treatments to the nodes. Last, in the third panel, we only observe a slight deviance of the boxplots from the 0.5 reference line at low values of λ. This suggests that graph cluster randomization is not very effective, at least in this specific setting. In contrast, the random designs have all the characteristics randomly distributed around the reference lines, while the worst designs will assign almost all of the nodes to either treatment or control.

Table 3 shows the efficiency gained (lost) by selecting the best (worst) 500 designs out of 10,000 random designs relative to a randomly selected design. We can see that the efficiency gained by the best 500 designs increases when network effects are present, i.e., when $ITE\ne 0$ or ${\gamma _{T}}\ne {\gamma _{C}}$ for the POW-DEG specification (2.6). It also increases when the sizes of the network effects increase, and when λ increases. Hence, even though a randomly generated design may be suitable when there is no network effect, when some network effect is present, other design selection techniques are necessary. We perform an analogous investigation with the UMich Facebook network, the results of which are provided in Section S3 of the Supplementary Material. The findings are similar to those presented here for the Caltech Facebook network, but we also notice that with a larger network, the efficiency of this random search procedure decreased. This suggests that when there are many experimental units, a balanced randomized design may still be efficient.

Table 3

Efficiency in terms of $\mathrm{Var}[\widehat{\text{GTE}}]$ of the best and worst 500 designs selected from 10,000 randomly generated designs compared to 500 randomly generated designs on the Caltech Facebook network.

Designs	λ	0.5	0.75	1.00	1.75
	${\gamma _{T}}\gt {\gamma _{C}}$	1.7521	2.3841	5.5776	23.7801
Best 500 designs	${\gamma _{T}}={\gamma _{C}}$	1.3875	2.0251	5.1557	23.2936
	${\gamma _{T}}\lt {\gamma _{C}}$	1.6177	2.3152	5.5369	23.7562
	${\gamma _{T}}\gt {\gamma _{C}}$	0.2469	0.16627	0.0807	0.0486
Worst 500 designs	${\gamma _{T}}={\gamma _{C}}$	0.4772	0.2032	0.0792	0.04791
	${\gamma _{T}}\lt {\gamma _{C}}$	0.2291	0.1588	0.0797	0.0484

5 Discussion

We introduce the general additive network effect model for network A/B tests, which unifies many existing models in the literature and enhances the modeling flexibility. We further bridge the model-based framework and the exposure framework by defining causal quantities of interest: the global treatment effect, the direct treatment effect, and the indirect treatment effect as functions of the model parameters. Inference for all three quantities may be carried out via the maximum likelihood framework.

Although the model is studied under the A/B testing setting where there are just two experimental conditions (treatment and control) and under the normal independent error assumptions, the GANE model framework can be extended for use in other settings. First, by expanding the model equation, the GANE model can be used to analyze experiments with more than two experimental conditions. Second, by introducing link functions and other distributional and functional assumptions, the framework can be extended to deal with non-normal distributions and discrete outcomes in manners similar to generalized linear models.

Despite the GANE framework’s wide modeling possibilities, we specifically propose the POW-DEG specification (2.6), which models the network effect as powers of the treatment and control degrees. Via simulation, we found that the specification is robust against model misspecification in terms of inference for the global treatment effect. Thus we suggest the use of this specification, especially in the design stage, when there is no prior information or modeling preference. Characteristics of good designs with respect to different criteria on the POW-DEG specification (2.6) are also investigated via simulation. We find that balanced randomization to treatment and control and graph cluster randomization are not necessarily optimal for the estimation of the global treatment effect in this specification.

The design simulation in Section 4.4 suggests that different parameter settings lead to a different desired percentage of treated nodes. In addition, the optimal design search requires the formula of the Fisher information matrix, which further demands the knowledge of both the model and parameters. This is often unknown in the design stage, thus posing challenges for the model-based framework. Future work could consider methods such as Bayesian design to alleviate the dependence on unknown parameters [13]. Moreover, although AIC appears to work in the model misspecification simulation in Section 4.3, the use of AIC is only possible in the analysis stage once the data is observed, or when preliminary data are available. Model selection for the design and analysis of experiments on networks thus remains an open problem for future research.

Appendix A Mathematical Details for Section 3

A.1 Proof of Lemma 1

We use the following two lemmas.

Lemma 2 (Theorem 18.2.16 of Harville [18]).

Let A represent an $n\times n$ matrix. Then, the infinite series $\mathbf{I}+\mathbf{A}+{\mathbf{A}^{2}}+{\mathbf{A}^{3}}+\cdots \hspace{0.1667em}$ converges if and only if ${\lim \nolimits_{k\to \infty }}{\mathbf{A}^{k}}=\mathbf{0}$, in which case $\mathbf{I}-\mathbf{A}$ is nonsingular and

\[ {(\mathbf{I}-\mathbf{A})^{-1}}={\sum \limits_{k=0}^{\infty }}{\mathbf{A}^{k}}=\mathbf{I}+\mathbf{A}+{\mathbf{A}^{2}}+{\mathbf{A}^{3}}+\cdots \hspace{0.1667em},\]

where ${\mathbf{A}^{0}}=\mathbf{I}$.

Lemma 3 (Lemma 5.6.11 of Horn and Johnson [19]).

Let A be an $n\times n$ given matrix. If there is a matrix norm $||\cdot ||$ such that $||A||\lt 1$, then ${\lim \nolimits_{k\to \infty }}{\mathbf{A}^{k}}=0$, that is, each entry of ${\mathbf{A}^{k}}$ tends to zero as $k\to \infty $.

From the two lemmas, if we have $||{\rho _{T}}{\mathbf{W}_{T}}+{\rho _{C}}{\mathbf{W}_{C}}||\lt 1$ for any matrix norm $||\cdot ||$, then $\mathbf{S}(\boldsymbol{\rho })$ will be invertible. Now, if the condition of Lemma 1 is satisfied, using triangle inequality, we can derive

\[\begin{aligned}{}||{\rho _{T}}{\mathbf{W}_{T}}+{\rho _{C}}{\mathbf{W}_{C}}||\le & |{\rho _{T}}|||{\mathbf{W}_{T}}||+|{\rho _{C}}|||{\mathbf{W}_{C}}||,\\ {} \le & \max (|{\rho _{T}}|,|{\rho _{C}}|)\left[||{\mathbf{W}_{T}}||+||{\mathbf{W}_{C}}||\right],\\ {} \lt & 1,\end{aligned}\]

i.e., $\mathbf{S}(\boldsymbol{\rho })$ is invertible.

A.2 Assumptions Needed for Asymptotic Results

In order to achieve the asymptotic results in Theorem 1 in Appendix A.3, we make the following assumptions.

Assumption 1.

${\boldsymbol{\epsilon }_{n}}={({\epsilon _{1n}},{\epsilon _{2n}},\dots ,{\epsilon _{nn}})^{\top }}$ are independently and identically distributed with mean 0 and variance ${\sigma _{0}^{2}}\gt 0$. In addition, the moment $\mathbb{E}(|{\epsilon _{i,n}}{|^{4+\eta }})$ exists for some $\eta \gt 0$.

Assumption 2.

The true parameters ${\boldsymbol{\rho }_{0}}$ and ${\boldsymbol{\varphi }_{0}}$ lie in the interior of a compact parameter space $\mathrm{P}\times \boldsymbol{\Phi }$. The parameters are uniquely identifiable, in the sense that $\mathbb{P}\left({L_{n}}({\boldsymbol{\theta }_{1}})={L_{n}}({\boldsymbol{\theta }_{2}})\right)=0$ for ${\boldsymbol{\theta }_{1}}\ne {\boldsymbol{\theta }_{2}}$.

Assumption 3.

The elements of ${\mathbf{W}_{Tn}}$ and ${\mathbf{W}_{Cn}}$ are at most of order ${h_{n}}$ uniformly, i.e., ${W_{ln,ij}}=O(1/{h_{n}})$ $\forall i,j$ and $l\in \{T,C\}$. The sequence ${h_{n}}$ can be bounded or divergent. Furthermore, ${\lim \nolimits_{n\to \infty }}{h_{n}}/n=0$.

Assumption 4.

The matrix ${\mathbf{S}_{n}}({\boldsymbol{\rho }_{0}})$ is nonsingular.

Assumption 5.

The weight matrices ${\mathbf{W}_{Tn}}$, ${\mathbf{W}_{Cn}}$ and the matrix ${\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}$ are uniformly bounded in both row and column sums. Moreover, ${\mathbf{S}_{n}}{(\boldsymbol{\rho })^{-1}}$ is uniformly bounded in either row or column sums.

Assumption 6.

For each i, the functions ${g_{T,i}}$ and ${g_{C,i}}$ are twice continuously differentiable with respect to $\boldsymbol{\varphi }$. The values of these functions and their derivatives are uniformly bounded $\forall \boldsymbol{\varphi }\in \boldsymbol{\Phi }$. Furthermore, $\forall \boldsymbol{\varphi }\in \boldsymbol{\Phi }$, ${\lim \nolimits_{n\to \infty }}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi })/n$ exists and nonsingular.

Assumptions 1, 2 and the differentiability requirement for ${g_{T,i}}$ and ${g_{C,i}}$ in Assumption 6 are usual regularity conditions for the consistency and asymptotic normality of nonlinear least squares regression [21]. The identifiability requirement in Assumption 2 contains the requirement that the columns of the model matrix ${\mathbf{M}_{n}}(\boldsymbol{\varphi })$ are linearly independent as discussed in Section 3.1.

Note that we cannot use the usual central limit theorems to derive the asymptotic behavior of Model (3.3) because as the sample size n changes, the weight matrices may also change, leading to changes in the outcomes ${\mathbf{Y}_{n}}$. For example, when a new unit is added to the network, it can be connected to other existing units, which in turn changes the degree ${k_{i}}$ for each existing unit i, $i=1,2,\dots ,n$. This results in a different set of weight matrices for autoregressive specifications such as the LAG (2.4) or the HOM (2.5) specifications. Therefore, we need to use the Central Limit Theorem for linear-quadratic forms of triangular arrays [25]. Assumptions 3, 5, and 6 are introduced to satisfy the assumptions of this theorem. Essentially, the bounds in these assumptions control the spatial correlations to a manageable degree so that they do not diverge as n goes to infinity [28]. For example, suppose ${\mathbf{W}_{Tn}}={\mathbf{A}_{n}}$, Assumption 3 is satisfied as all elements of A are either 1 or 0, i.e., bounded. However, to satisfy Assumption 5 in this case, we need to further require that the degree of each unit i, i.e., the number of connections, is bounded as n goes to infinity. This is reasonable in social network settings as one will not have infinitely many friends.

Finally, Assumption 4 makes sure that ${\mathbf{Y}_{n}}$ can be expressed in the reduced form as in (3.3).

A.3 Proof of Theorem 1

To prove consistency, we use the following lemma.

Lemma 4 (Theorem 3.4 of White [46]).

Let $(\Omega ,\mathcal{F},\mathbb{P})$ be a complete probability space, let Θ be a compact subset of ${\mathbb{R}^{p}}$, $p\in \mathbb{N}$ and let $\{{\boldsymbol{\Theta }_{n}}\}$ be a sequence of compact subsets of Θ. Let $\{{Q_{n}}\}$ be a sequence of random functions continuous on Θ a.s. and let ${\hat{\boldsymbol{\theta }}_{n}}=\arg {\max _{{\boldsymbol{\Theta }_{n}}}}{Q_{n}}(\cdot ,\boldsymbol{\theta })$ a.s. If ${Q_{n}}(\cdot ,\boldsymbol{\theta })-{\bar{Q}_{n}}(\boldsymbol{\theta })\to 0$ as $n\to \infty $ a.s. uniformly on Θ and if $\{{\bar{Q}_{n}}:\boldsymbol{\Theta }\to \mathbb{R}\}$ has identifiably unique maximizers $\{{\boldsymbol{\theta }_{n}^{\ast }}\}$ on $\{{\boldsymbol{\Theta }_{n}}\}$ then ${\hat{\boldsymbol{\theta }}_{n}}-{\boldsymbol{\theta }_{n}^{\ast }}\to 0$ as $n\to \infty $ a.s.

From the reduced form (3.3), consider

\[\begin{aligned}{}{Q_{n}}(\boldsymbol{\rho },\boldsymbol{\varphi })=& \underset{\boldsymbol{\beta },{\sigma ^{2}}}{\max }\mathbb{E}(\log {L_{n}}(\boldsymbol{\theta })),\\ {} =& \underset{\boldsymbol{\beta },{\sigma ^{2}}}{\max }\bigg[-\frac{n}{2}\log 2\pi -\frac{n}{2}\log ({\sigma ^{2}})+\log |{\mathbf{S}_{n}}(\boldsymbol{\rho })|\\ {} & -\frac{1}{2{\sigma ^{2}}}{\boldsymbol{\beta }^{\top }}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi })\boldsymbol{\beta }-\frac{{\sigma _{0}^{2}}}{2{\sigma ^{2}}}\mathrm{tr}\big({\mathbf{B}_{n}}(\boldsymbol{\rho })\big)\\ {} & +\frac{1}{{\sigma ^{2}}}{\boldsymbol{\beta }^{\top }}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{S}_{n}}(\boldsymbol{\rho }){\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}{\mathbf{M}_{n}}({\boldsymbol{\varphi }_{0}}){\boldsymbol{\beta }_{0}}\\ {} & -\frac{1}{2{\sigma ^{2}}}{\boldsymbol{\beta }_{0}^{\top }}{\mathbf{M}_{n}}{({\boldsymbol{\varphi }_{0}})^{\top }}{\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-\top }}{\mathbf{S}_{n}}{(\boldsymbol{\rho })^{\top }}\\ {} & \times {\mathbf{S}_{n}}(\boldsymbol{\rho }){\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}{\mathbf{M}_{n}}({\boldsymbol{\varphi }_{0}}){\boldsymbol{\beta }_{0}}\bigg],\end{aligned}\]

where ${\mathbf{B}_{n}}(\boldsymbol{\rho })={\mathbf{S}_{n}}{({\rho _{0}})^{-\top }}{\mathbf{S}_{n}}{(\boldsymbol{\rho })^{\top }}{\mathbf{S}_{n}}(\boldsymbol{\rho }){\mathbf{S}_{n}}{({\rho _{0}})^{-1}}$. Taking the first derivative with respect to $\boldsymbol{\beta }$ and ${\sigma ^{2}}$, we obtain the maximizers of ${Q_{n}}(\boldsymbol{\rho },\boldsymbol{\varphi })$ as follows:

(A.1)

\[\begin{aligned}{}{\boldsymbol{\beta }_{n}^{\ast }}(\boldsymbol{\rho },\boldsymbol{\varphi })=& {\left[{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi })\right]^{-1}}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{S}_{n}}(\boldsymbol{\rho })\\ {} & \times {\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}{\mathbf{M}_{n}}({\boldsymbol{\varphi }_{0}}){\boldsymbol{\beta }_{0}};\\ {} {\sigma _{n}^{\ast 2}}(\boldsymbol{\rho },\boldsymbol{\varphi })=& \frac{1}{n}\bigg\{{\boldsymbol{\beta }_{0}^{\top }}{\mathbf{M}_{n}}{({\boldsymbol{\varphi }_{0}})^{\top }}{\mathbf{S}_{n}^{-\top }}({\boldsymbol{\rho }_{0}}){\mathbf{S}_{n}}{(\boldsymbol{\rho })^{\top }}\big[{\mathbf{I}_{n}}-{\mathbf{H}_{n}}(\boldsymbol{\varphi })\big]\\ {} & \times {\mathbf{S}_{n}}(\boldsymbol{\rho }){\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}{\mathbf{M}_{n}}({\boldsymbol{\varphi }_{0}}){\boldsymbol{\beta }_{0}}+{\sigma _{0}^{2}}\mathrm{tr}\big({\mathbf{B}_{n}}(\boldsymbol{\rho })\big)\bigg\}.\end{aligned}\]

where ${\mathbf{H}_{n}}(\boldsymbol{\varphi })={\mathbf{M}_{n}}(\boldsymbol{\varphi }){\big[{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi })\big]^{-1}}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}$. Now,

\[\begin{aligned}{}{\ell _{\text{P},n}}(\boldsymbol{\rho },\boldsymbol{\varphi })=& -\frac{n}{2}\log 2\pi -\frac{n}{2}\log {\hat{\sigma }_{n}^{2}}(\boldsymbol{\rho },\boldsymbol{\varphi })\\ {} & +\log |{\mathbf{S}_{n}}(\boldsymbol{\rho })|-\frac{n}{2},\\ {} {Q_{n}}(\boldsymbol{\rho },\boldsymbol{\varphi })=& -\frac{n}{2}\log 2\pi -\frac{n}{2}\log {\sigma _{n}^{\ast 2}}(\boldsymbol{\rho },\boldsymbol{\varphi })\\ {} & +\log |{\mathbf{S}_{n}}(\boldsymbol{\rho })|-\frac{n}{2},\end{aligned}\]

where ${\hat{\sigma }_{n}^{2}}(\boldsymbol{\rho },\boldsymbol{\varphi })$ was given in (3.6). To use Lemma A4, we first need to show that

\[\begin{aligned}{}& \frac{1}{n}\big\{{\ell _{\text{P},n}}(\boldsymbol{\rho },\boldsymbol{\varphi })-{Q_{n}}(\boldsymbol{\rho },\boldsymbol{\varphi })\big\}\\ {} & =-\frac{1}{2}\big\{\log {\hat{\sigma }_{n}^{2}}(\boldsymbol{\rho },\boldsymbol{\varphi })-\log {\sigma _{n}^{\ast 2}}(\boldsymbol{\rho },\boldsymbol{\varphi })\big\}={o_{p}}(1).\end{aligned}\]

Note that

\[\begin{aligned}{}{\hat{\sigma }_{n}^{2}}(\boldsymbol{\rho },\boldsymbol{\varphi })-{\sigma _{n}^{\ast 2}}(\boldsymbol{\rho },\boldsymbol{\varphi })=& 2{R_{1n}}(\boldsymbol{\rho },\boldsymbol{\varphi })+{R_{2n}}(\boldsymbol{\rho },\boldsymbol{\varphi })\\ {} & -\frac{1}{n}{\sigma _{0}^{2}}\mathrm{tr}\big({\mathbf{B}_{n}}(\boldsymbol{\rho })\big),\end{aligned}\]

where

\[\begin{aligned}{}{R_{1n}}(\boldsymbol{\rho },\boldsymbol{\varphi })=& \frac{1}{n}{\boldsymbol{\beta }_{0}^{\top }}{\mathbf{M}_{n}}{({\boldsymbol{\rho }_{0}})^{\top }}{\mathbf{S}_{n}^{-\top }}({\boldsymbol{\rho }_{0}}){\mathbf{S}_{n}}{(\boldsymbol{\rho })^{\top }}\big[{\mathbf{I}_{n}}-{\mathbf{H}_{n}}(\boldsymbol{\varphi })\big]\\ {} & \times {\mathbf{S}_{n}}(\boldsymbol{\rho }){\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}{\boldsymbol{\epsilon }_{n}},\end{aligned}\]

and

\[\begin{aligned}{}& {R_{2n}}(\boldsymbol{\rho },\boldsymbol{\varphi })\\ {} & =\frac{1}{n}{\boldsymbol{\epsilon }_{n}^{\top }}{\mathbf{S}_{n}^{-\top }}({\boldsymbol{\rho }_{0}}){\mathbf{S}_{n}}{(\boldsymbol{\rho })^{\top }}\big[{\mathbf{I}_{n}}-{\mathbf{H}_{n}}(\boldsymbol{\varphi })\big]{\mathbf{S}_{n}}(\boldsymbol{\rho }){\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}{\boldsymbol{\epsilon }_{n}}\\ {} & =\frac{1}{n}{\boldsymbol{\epsilon }_{n}^{\top }}{\mathbf{S}_{n}^{-\top }}({\boldsymbol{\rho }_{0}}){\mathbf{S}_{n}}{(\boldsymbol{\rho })^{\top }}{\mathbf{S}_{n}}(\boldsymbol{\rho }){\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}{\boldsymbol{\epsilon }_{n}}\\ {} & \hspace{1em}-\frac{1}{n}{\bigg[\frac{1}{\sqrt{n}}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{S}_{n}}(\boldsymbol{\rho }){\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}{\boldsymbol{\epsilon }_{n}}\bigg]^{\top }}\\ {} & \hspace{1em}\times \bigg[\frac{1}{n}{[{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi })]^{-1}}\bigg]\\ {} & \hspace{1em}\times \bigg[\frac{1}{\sqrt{n}}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{S}_{n}}(\boldsymbol{\rho }){\mathbf{S}_{n}}{({\boldsymbol{\rho }_{0}})^{-1}}{\boldsymbol{\epsilon }_{n}}\bigg].\end{aligned}\]

It can be shown that ${R_{1n}}(\boldsymbol{\rho },\boldsymbol{\varphi })={o_{P}}(1)$ and ${R_{2n}}(\boldsymbol{\rho },\boldsymbol{\varphi })-\frac{1}{n}{\sigma _{0}^{2}}\mathrm{tr}\big({\mathbf{B}_{n}}(\boldsymbol{\rho })\big)={o_{P}}(1)$ using the following three lemmas given in Lee [28] and assumptions on the bounds of matrices.

Lemma 5.

Suppose the elements ${A_{n,ij}}$ of $n\times n$ matrices ${\mathbf{A}_{n}}$ are $O(1/{h_{n}})$ uniformly for all $i,j$. If $n\times n$ matrices $\{{\mathbf{B}_{n}}\}$ are uniformly bounded in column sums (respectively, row sums), then the elements of ${\mathbf{A}_{n}}{\mathbf{B}_{n}}$ (respectively, ${\mathbf{B}_{n}}{\mathbf{A}_{n}}$) have the uniform order $O(1/{h_{n}})$. For these cases, $\mathrm{tr}({\mathbf{A}_{n}}{\mathbf{B}_{n}})=\mathrm{tr}({\mathbf{B}_{n}}{\mathbf{A}_{n}})=O(n/{h_{n}})$.

Lemma 6.

Suppose $\{{\mathbf{A}_{n}}\}$ are uniformly bounded either in row or column sums and their elements ${A_{n,ij}}$ have order $O(1/{h_{n}})$ uniformly in i and j. Then $\mathbb{E}({\boldsymbol{\varepsilon }_{n}^{\top }}{\mathbf{A}_{n}}{\boldsymbol{\varepsilon }_{n}})={\sigma _{0}^{2}}\mathrm{tr}({\mathbf{A}_{n}})=O(n/{h_{n}})$ and $\mathrm{Var}({\boldsymbol{\varepsilon }_{n}^{\top }}{\mathbf{A}_{n}}{\boldsymbol{\varepsilon }_{n}})=O(n/{h_{n}})$. If ${\lim \nolimits_{n\to \infty }}{h_{n}}/n=0$, then $({h_{n}}/n)[{\boldsymbol{\varepsilon }_{n}^{\top }}{\mathbf{A}_{n}}{\boldsymbol{\varepsilon }_{n}}-\mathbb{E}({\boldsymbol{\varepsilon }_{n}^{\top }}{\mathbf{A}_{n}}{\boldsymbol{\varepsilon }_{n}})]={o_{P}}(1)$, where ${\epsilon _{n}}$ satisfies Assumption 1 (possibly without normality but with $\mathbb{E}(|{\epsilon _{n}}{|^{4+\gamma }})\lt \infty $ for some $\gamma \gt 0$).

Lemma 7.

Suppose that ${\mathbf{A}_{n}}$ is a square matrix with its column sums being uniformly bounded and elements of the $n\times k$ matrix ${\mathbf{Z}_{n}}$ are uniformly bounded. Then $(1/\sqrt{n}){\mathbf{Z}_{n}^{\top }}{\mathbf{A}_{n}}{\boldsymbol{\epsilon }_{n}}={O_{p}}(1)$. Furthermore, if the limit of ${\mathbf{Z}_{n}^{\top }}{\mathbf{A}_{n}}{\mathbf{A}_{n}^{\top }}{\mathbf{Z}_{n}}/n$ exists and is positive definite, then $(1/\sqrt{n}){\mathbf{Z}^{\prime }_{n}}{\mathbf{A}_{n}}{\boldsymbol{\epsilon }_{n}}\stackrel{d}{\to }\mathcal{N}(0,{\sigma _{0}^{2}}{\lim \nolimits_{n\to \infty }}{\mathbf{Z}_{n}^{\top }}{\mathbf{A}_{n}}{\mathbf{A}_{n}^{\top }}{\mathbf{Z}_{n}}/n)$.

Therefore, ${\hat{\sigma }_{n}^{2}}(\boldsymbol{\rho },\boldsymbol{\varphi })-{\sigma _{n}^{\ast 2}}(\boldsymbol{\rho },\boldsymbol{\varphi })={o_{P}}(1)$ uniformly on $\mathrm{P}\times \boldsymbol{\Phi }$. Hence, ${\sup _{(\boldsymbol{\rho },\boldsymbol{\varphi })\in \mathrm{P}\times \boldsymbol{\Phi }}}\frac{1}{n}\{{\ell _{\text{P},n}}(\boldsymbol{\rho },\boldsymbol{\varphi })-{Q_{n}}(\boldsymbol{\rho },\boldsymbol{\varphi })\}={o_{P}}(1)$. Second, we need to prove the identification uniqueness condition that, for any ${\epsilon _{1}},{\epsilon _{2}}\gt 0$,

\[ \lim \underset{n\to \infty }{\sup }\underset{\boldsymbol{\rho }\in {\bar{N}_{{\epsilon _{1}}}}({\boldsymbol{\rho }_{0}}),\boldsymbol{\varphi }\in {\bar{N}_{{\epsilon _{2}}}}({\boldsymbol{\varphi }_{0}})}{\max }\frac{1}{n}\big[{Q_{n}}(\boldsymbol{\rho },\boldsymbol{\varphi })-{Q_{n}}({\boldsymbol{\rho }_{0}},{\boldsymbol{\varphi }_{0}})\big]\lt 0,\]

where ${\bar{N}_{{\epsilon _{1}}}}({\boldsymbol{\rho }_{0}})$ denotes the complement of an open neighborhood of ${\boldsymbol{\rho }_{0}}$ of diameter ${\epsilon _{1}}$ and likewise for $\boldsymbol{\varphi }$. To see this, we can write

\[\begin{aligned}{}& \frac{1}{n}\big[{Q_{n}}(\boldsymbol{\rho },\boldsymbol{\varphi })-{Q_{n}}({\boldsymbol{\rho }_{0}},{\boldsymbol{\varphi }_{0}})\big]\\ {} & =\frac{1}{n}\Big\{\mathbb{E}[\log {L_{n}}(\boldsymbol{\rho },{\boldsymbol{\beta }_{0}},{\boldsymbol{\varphi }_{0}})]-\mathbb{E}[\log {L_{n}}({\boldsymbol{\rho }_{0}},{\boldsymbol{\beta }_{0}},{\boldsymbol{\varphi }_{0}})]\Big\}\\ {} & \hspace{1em}-\frac{1}{2}\Big\{\log {\sigma _{n}^{\ast 2}}(\boldsymbol{\rho },\boldsymbol{\varphi })-\log \left(\frac{{\sigma _{0}^{2}}}{n}\mathrm{tr}({\mathbf{B}_{n}}(\boldsymbol{\rho }))\right)\Big\}.\end{aligned}\]

The first term is less than 0 by Jensen’s inequality and the identifiability condition of Assumption 2. Furthermore, ${\sigma _{n}^{\ast 2}}(\boldsymbol{\rho },\boldsymbol{\varphi })\ge \frac{{\sigma _{0}^{2}}}{n}\mathrm{tr}({\mathbf{B}_{n}}(\boldsymbol{\rho }))$ from (A.1) by the positive semi-definiteness of the annihilator matrix $\mathbf{I}-{\mathbf{H}_{n}}(\boldsymbol{\varphi })$. Putting all of these together, we proved ${\hat{\boldsymbol{\theta }}_{n}}={\boldsymbol{\theta }_{0}}+{o_{P}}(1)$.

Now, to prove the asymptotic normality, we apply the mean-value theorem on the first order derivative of $\log {L_{n}}(\boldsymbol{\theta })$ at ${\hat{\boldsymbol{\theta }}_{n}}$ yielding

\[ \frac{\partial \log {L_{n}}({\hat{\boldsymbol{\theta }}_{n}})}{\partial \boldsymbol{\theta }}=\mathbf{0}=\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }}+({\hat{\boldsymbol{\theta }}_{n}}-{\boldsymbol{\theta }_{0}})\frac{{\partial ^{2}}\log {L_{n}}({\tilde{\boldsymbol{\theta }}_{n}})}{\partial \boldsymbol{\theta }\partial {\boldsymbol{\theta }^{\top }}},\]

where ${\tilde{\boldsymbol{\theta }}_{n}}$ lies between ${\hat{\boldsymbol{\theta }}_{n}}$ and ${\boldsymbol{\theta }_{0}}$. Therefore

\[ {\hat{\boldsymbol{\theta }}_{n}}-{\boldsymbol{\theta }_{0}}=-{\left[\frac{{\partial ^{2}}\log {L_{n}}({\tilde{\boldsymbol{\theta }}_{n}})}{\partial \boldsymbol{\theta }\partial {\boldsymbol{\theta }^{\top }}}\right]^{-1}}\left(\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }}\right).\]

We can write down the first derivatives of $\log {L_{n}}(\boldsymbol{\theta })$ with respect to $\boldsymbol{\theta }$ as follows:

\[\begin{aligned}{}\frac{\partial \log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\beta }}=& \frac{1}{{\sigma ^{2}}}{\boldsymbol{\epsilon }_{n}^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi }),\\ {} \frac{\partial \log {L_{n}}(\boldsymbol{\theta })}{\partial {\sigma ^{2}}}=& -\frac{n}{2{\sigma ^{2}}}+\frac{1}{2{\sigma ^{4}}}{\boldsymbol{\epsilon }_{n}^{\top }}{\boldsymbol{\epsilon }_{n}},\\ {} \frac{\partial \log {L_{n}}(\boldsymbol{\theta })}{\partial {\boldsymbol{\rho }_{j}}}=& \frac{1}{{\sigma ^{2}}}{\boldsymbol{\epsilon }_{n}^{\top }}{\mathbf{W}_{jn}}{\mathbf{S}_{n}}{(\boldsymbol{\rho })^{-1}}{\mathbf{M}_{n}}(\boldsymbol{\varphi })\boldsymbol{\beta }\\ {} & +\left(\frac{1}{{\sigma ^{2}}}{\boldsymbol{\epsilon }^{\top }}{\mathbf{W}_{jn}}{\mathbf{S}_{n}}{(\boldsymbol{\rho })^{-1}}{\boldsymbol{\epsilon }_{n}}-\mathrm{tr}({\mathbf{W}_{jn}}{\mathbf{S}_{n}^{-1}}(\boldsymbol{\rho }))\hspace{-0.1667em}\right),\\ {} \frac{\partial \log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\varphi }}=& \frac{1}{{\sigma ^{2}}}{\boldsymbol{\epsilon }_{n}^{\top }}\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\boldsymbol{\beta },\end{aligned}\]

for $j\in \{T,C\}$. Note that these are linear and quadratic functions of ${\boldsymbol{\epsilon }_{n}}$. Therefore we can apply the Central Limit Theorem for linear-quadratic functions [25] given as Lemma A8 below.

Lemma 8 (Theorem A.1 of Kelejian and Prucha [25]).

Consider the linear quadratic forms $(r=1,\dots ,m)$

\[ {\mathbf{Q}_{r,n}}={\boldsymbol{\epsilon }_{n}^{\top }}{\mathbf{A}_{r,n}}{\boldsymbol{\epsilon }_{n}}+{\mathbf{b}_{r,n}^{\top }}{\boldsymbol{\epsilon }_{n}},\]

where ${\boldsymbol{\epsilon }_{n}}={({\epsilon _{1,n}},\dots ,{\epsilon _{n,n}})^{\top }}$ is an $n\times 1$ random vector, and ${\mathbf{A}_{r,n}}={({a_{ij,r,n}})_{i,j=1,\dots ,n}}$ is an $n\times n$ non-stochastic real matrix, and ${\mathbf{b}_{r,n}}={({b_{1,r,n}},\dots ,{b_{n,r,n}})^{\top }}$ is an $n\times 1$ non-stochastic real vector. We make the following assumptions:

1. The real-valued random variables of the array $\{{\epsilon _{i,n}}:1\le i\le n,n\ge 1\}$ satisfy $\mathbb{E}[{\epsilon _{i,n}}]=0$. Furthermore, for each $n\ge 1$, the random variables ${\epsilon _{1,n}},\dots ,{\epsilon _{n,n}}$ are totally independent.
2. For $r=1,\dots ,m$, the elements of the array of real numbers $\{{a_{ij,r,n}}:1\le i,j\le n,n\ge 1\}$ satisfy ${a_{ij,r,n}}={a_{ji,r,n}}$ and ${\sup _{1\le j\le n,n\ge 1}}{\textstyle\sum _{i=1}^{n}}|{a_{ij,r,n}}|\lt \infty $. The elements of the array of real numbers $\{{b_{i,r,n}}:1\le i\le n,n\ge 1\}$ satisfy ${\sup _{n}}{n^{-1}}{\textstyle\sum _{i=1}^{n}}|{b_{i,r,n}}{|^{2+{\eta _{1}}}}\lt \infty $ for some ${\eta _{1}}\gt 0$.
3. For $r=1,\dots ,m$, one of the following two conditions holds
- (a) ${\sup _{1\le i\le n,n\ge 1}}\mathbb{E}|{\epsilon _{i,n}}{|^{2+{\eta _{2}}}}\lt \infty $ for some ${\eta _{2}}\gt 0$ and ${a_{ii,r,n}}=0$.
- (b) ${\sup _{1\le i\le n,n\ge 1}}\mathbb{E}|{\epsilon _{i,n}}{|^{4+{\eta _{2}}}}\lt \infty $ for some ${\eta _{2}}\gt 0$ (but possibly ${a_{ii,r,n}}\ne 0$)

Let

\[ {\mathbf{U}_{n}}={[{\mathbf{Q}_{1,n}},\dots ,{\mathbf{Q}_{m,n}}]^{\top }},\]

and ${\boldsymbol{\mu }_{{\mathbf{U}_{n}}}}=\mathbb{E}[{\mathbf{U}_{n}}]$ and ${\boldsymbol{\Sigma }_{{\mathbf{U}_{n}}}}$ denote the mean and variance-covariance matrix of ${\mathbf{U}_{n}}$, respectively. Suppose the assumptions hold and ${n^{-1}}{\lambda _{\min }}({\boldsymbol{\Sigma }_{{U_{n}}}})\ge c$ for some $c\gt 0$. Let ${\boldsymbol{\Sigma }_{{\mathbf{U}_{n}}}}=\big({\boldsymbol{\Sigma }_{{\mathbf{U}_{n}}}^{1/2}}\big){\big({\boldsymbol{\Sigma }_{{\mathbf{U}_{n}}}^{1/2}}\big)^{\top }}$, then

\[ {\boldsymbol{\Sigma }_{{\mathbf{U}_{n}}}^{-1/2}}({\mathbf{U}_{n}}-{\boldsymbol{\mu }_{{\mathbf{U}_{n}}}})\stackrel{d}{\to }\mathcal{N}(\mathbf{0},{\mathbf{I}_{m}}).\]

Therefore, we can apply the above theorem to $\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\beta }}$ with $m=\mathrm{dim}(\boldsymbol{\theta })$ since all the multipliers to ${\boldsymbol{\epsilon }_{n}}$ are bounded. Note that the assumption on the minimum eigenvalue of the variance-covariance matrix is to ensure that matrices ${\boldsymbol{\Sigma }_{{\mathbf{V}_{n}}}}$ stay invertible as $n\to \infty $, to which we have an equivalent condition in Theorem (1). The assumption of symmetry is W.L.O.G since ${\boldsymbol{\epsilon }_{n}}{\mathbf{A}_{n}}{\boldsymbol{\epsilon }_{n}}={\boldsymbol{\epsilon }_{n}^{\top }}[({\mathbf{A}_{n}}+{\mathbf{A}_{n}^{\top }})/2]{\boldsymbol{\epsilon }_{n}}$ [24]. Hence we have

(A.2)

\[ {[{\mathbf{V}_{n}}({\boldsymbol{\theta }_{0}})]^{-1/2}}\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }}\stackrel{d}{\to }\mathcal{N}({\mathbf{0}_{\mathrm{dim}(\boldsymbol{\theta })}},{\mathbf{I}_{\mathrm{dim}.(\boldsymbol{\theta })}})\]

Now, what is left to be proved is

(A.3)

\[\begin{aligned}{}\frac{1}{n}\frac{\partial \log {L_{n}}({\tilde{\boldsymbol{\theta }}_{n}})}{\partial \boldsymbol{\theta }\partial {\boldsymbol{\theta }^{\top }}}=& \frac{1}{n}\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }\partial {\boldsymbol{\theta }^{\top }}}+{o_{P}}(1)\\ {} =& \frac{1}{n}\mathbb{E}\left[\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }\partial {\boldsymbol{\theta }^{\top }}}\right]+{o_{P}}(1).\end{aligned}\]

The second derivatives of $\log {L_{n}}(\boldsymbol{\theta })$ with respect to $\boldsymbol{\theta }$ are

\[\begin{aligned}{}\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\beta }\partial {\boldsymbol{\beta }^{\top }}}=& -\frac{1}{{\sigma ^{2}}}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi }),\\ {} \frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{{(\partial \sigma )^{2}}}=& \frac{n}{2{\sigma ^{4}}}-\frac{1}{{\sigma ^{6}}}{\boldsymbol{\epsilon }_{n}^{\top }}{\boldsymbol{\epsilon }_{n}},\\ {} \frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial {\boldsymbol{\rho }_{j}}\partial {\boldsymbol{\rho }_{l}}}=& -\frac{1}{{\sigma ^{2}}}{\mathbf{Y}_{n}^{\top }}{\mathbf{W}_{ln}^{\top }}{\mathbf{W}_{jn}}{\mathbf{Y}_{n}}\\ {} & -\mathrm{tr}({\mathbf{W}_{ln}}{\mathbf{S}_{n}}{(\boldsymbol{\rho })^{-1}}{\mathbf{W}_{jn}}{\mathbf{S}_{n}}{(\boldsymbol{\rho })^{-1}}),\\ {} \frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\varphi }\partial {\boldsymbol{\varphi }^{\top }}}=& \frac{1}{{\sigma ^{2}}}\bigg[-{\boldsymbol{\beta }^{\top }}{\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)^{\top }}\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)\boldsymbol{\beta }\\ {} & +{\boldsymbol{\beta }^{\top }}{\left(\frac{{\partial ^{2}}{\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }\partial {\boldsymbol{\varphi }^{\top }}}\right)^{\top }}{\boldsymbol{\epsilon }_{n}}\bigg],\\ {} \frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial {\sigma ^{2}}\partial {\boldsymbol{\beta }^{\top }}}=& -\frac{1}{{\sigma ^{4}}}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\boldsymbol{\epsilon }_{n}},\\ {} \frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial {\sigma ^{2}}\partial {\boldsymbol{\rho }_{j}}}=& -\frac{1}{{\sigma ^{4}}}{\mathbf{Y}_{n}^{\top }}{\mathbf{W}_{jn}^{\top }}{\boldsymbol{\epsilon }_{n}},\\ {} \frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial {\sigma ^{2}}\partial {\boldsymbol{\varphi }^{\top }}}=& -\frac{1}{{\sigma ^{4}}}{\boldsymbol{\beta }^{\top }}{\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)^{\top }}{\boldsymbol{\epsilon }_{n}},\\ {} \frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\beta }\partial {\boldsymbol{\rho }_{j}}}=& -\frac{1}{{\sigma ^{2}}}{\mathbf{Y}_{n}^{\top }}{\mathbf{W}_{jn}^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi }),\\ {} \frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\varphi }\partial {\boldsymbol{\rho }_{j}}}=& -\frac{1}{{\sigma ^{2}}}{\mathbf{Y}_{n}^{\top }}{\mathbf{W}_{jn}^{\top }}\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)\boldsymbol{\beta },\\ {} \frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\beta }\partial {\boldsymbol{\varphi }^{\top }}}=& \frac{1}{{\sigma ^{2}}}\bigg[-{\boldsymbol{\beta }^{\top }}{\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi })\\ {} & +{\boldsymbol{\epsilon }_{n}^{\top }}\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)\bigg],\end{aligned}\]

for $l,j\in \{T,C\}$. Let ${\mathbf{C}_{kn}}(\boldsymbol{\rho })={\mathbf{W}_{kn}}{\mathbf{S}_{n}}{(\boldsymbol{\rho })^{-1}}$ for $k\in \{T,C\}$. Using the mean-value theorem, we have

\[\begin{aligned}{}& \frac{1}{n}\mathrm{tr}({\mathbf{W}_{ln}}{\mathbf{S}_{n}}{({\tilde{\boldsymbol{\rho }}_{n}})^{-1}}{\mathbf{W}_{jn}}{\mathbf{S}_{n}}{({\tilde{\boldsymbol{\rho }}_{n}})^{-1}})\\ {} & =\frac{1}{n}\mathrm{tr}({\mathbf{C}_{ln}}({\tilde{\boldsymbol{\rho }}_{n}}){\mathbf{C}_{jn}}({\tilde{\boldsymbol{\rho }}_{n}}))\\ {} & =\frac{1}{n}\mathrm{tr}({\mathbf{C}_{ln}}({\boldsymbol{\rho }_{0}}){\mathbf{C}_{jn}}({\boldsymbol{\rho }_{0}}))+\frac{1}{n}{({\tilde{\boldsymbol{\rho }}_{n}}-{\boldsymbol{\rho }_{0}})^{\top }}\\ {} & \times \left[\begin{array}{c}...\\ {} \mathrm{tr}\bigg\{{\mathbf{C}_{kn}}({\bar{\boldsymbol{\rho }}_{n}})\Big({\mathbf{C}_{ln}}({\bar{\boldsymbol{\rho }}_{n}}){\mathbf{C}_{jn}}({\bar{\boldsymbol{\rho }}_{n}})+{\mathbf{C}_{jn}}({\bar{\boldsymbol{\rho }}_{n}}){\mathbf{C}_{ln}}({\bar{\boldsymbol{\rho }}_{n}})\Big)\bigg\}\\ {} ...\end{array}\right]\\ {} & =\frac{1}{n}\mathrm{tr}({\mathbf{C}_{ln}}({\boldsymbol{\rho }_{0}}){\mathbf{C}_{jn}}({\boldsymbol{\rho }_{0}}))+{o_{P}}(1),\end{aligned}\]

where ${\bar{\boldsymbol{\rho }}_{n}}$ lies between ${\tilde{\boldsymbol{\rho }}_{n}}$ and ${\boldsymbol{\rho }_{0}}$. By consistency, ${\hat{\boldsymbol{\theta }}_{n}}={\boldsymbol{\theta }_{0}}+{o_{P}}(1)$, thus ${\tilde{\boldsymbol{\theta }}_{n}}={\boldsymbol{\theta }_{0}}+{o_{P}}(1)$ and ${\bar{\boldsymbol{\rho }}_{n}}={\boldsymbol{\theta }_{0}}+{o_{P}}(1)$. The elements of ${\mathbf{C}_{kn}}({\bar{\boldsymbol{\rho }}_{n}}){\mathbf{C}_{ln}}({\bar{\boldsymbol{\rho }}_{n}}){\mathbf{C}_{jn}}({\bar{\boldsymbol{\rho }}_{n}})$ and ${\mathbf{C}_{kn}}({\bar{\boldsymbol{\rho }}_{n}}){\mathbf{C}_{jn}}({\bar{\boldsymbol{\rho }}_{n}}){\mathbf{C}_{ln}}({\bar{\boldsymbol{\rho }}_{n}})$ are $O(1)$ and the elements of ${\mathbf{Y}_{n}^{\top }}{\mathbf{W}_{ln}^{\top }}{\mathbf{W}_{jn}}{\mathbf{Y}_{n}}$ are ${O_{P}}(n/{h_{n}})$ by Lemma A7. Therefore

\[ \frac{1}{n}\frac{{\partial ^{2}}\log {L_{n}}({\tilde{\boldsymbol{\theta }}_{n}})}{\partial {\boldsymbol{\rho }_{j}}\partial {\boldsymbol{\rho }_{l}}}=\frac{1}{n}\frac{{\partial ^{2}}\log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial {\boldsymbol{\rho }_{j}}\partial {\boldsymbol{\rho }_{l}}}+{o_{P}}(1).\]

For other partial derivative terms, we can use the fact that

\[\begin{aligned}{}{\mathbf{M}_{n}}({\tilde{\boldsymbol{\varphi }}_{n}})=& {\mathbf{M}_{n}}({\boldsymbol{\varphi }_{0}})+o(1),\\ {} \frac{\partial {\mathbf{M}_{n}}({\tilde{\boldsymbol{\varphi }}_{n}})}{\partial \boldsymbol{\varphi }}=& \frac{\partial {\mathbf{M}_{n}}({\boldsymbol{\varphi }_{0}})}{\partial \boldsymbol{\varphi }}+o(1),\\ {} \frac{{\partial ^{2}}{\mathbf{M}_{n}}({\tilde{\boldsymbol{\varphi }}_{n}})}{\partial \boldsymbol{\varphi }\partial {\boldsymbol{\varphi }^{\top }}}=& \frac{{\partial ^{2}}{\mathbf{M}_{n}}({\boldsymbol{\varphi }_{0}})}{\partial \boldsymbol{\varphi }\partial {\boldsymbol{\varphi }^{\top }}}+o(1),\end{aligned}\]

by Assumption 6 and the continuous mapping theorem [45]; the fact that elements of ${\mathbf{M}_{n}}(\boldsymbol{\varphi })$, $\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })/\partial \boldsymbol{\varphi }$ and ${\partial ^{2}}{\mathbf{M}_{n}}(\boldsymbol{\varphi })/\partial \boldsymbol{\varphi }\partial {\boldsymbol{\varphi }^{\top }}$ are all $O(1)$; and Lemma A5, A6 and A7. Similarly, we also can prove

\[ \frac{1}{n}\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }\partial {\boldsymbol{\theta }^{\top }}}+{o_{P}}(1)=\frac{1}{n}\mathbb{E}\left[\frac{\partial \log {L_{n}}({\boldsymbol{\theta }_{0}})}{\partial \boldsymbol{\theta }\partial {\boldsymbol{\theta }^{\top }}}\right]+{o_{P}}(1),\]

with

\[\begin{aligned}{}\mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\beta }\partial {\boldsymbol{\beta }^{\top }}}\right]& =-\frac{1}{{\sigma ^{2}}}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi }),\\ {} \mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{{(\partial \sigma )^{2}}}\right]& =-\frac{n}{2{\sigma ^{4}}},\\ {} \mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial {\boldsymbol{\rho }_{j}}\partial {\boldsymbol{\rho }_{l}}}\right]& =-\frac{1}{{\sigma ^{2}}}{\boldsymbol{\beta }^{\top }}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{C}_{ln}}{(\boldsymbol{\rho })^{\top }}{\mathbf{C}_{jn}}(\boldsymbol{\rho }){\mathbf{M}_{n}}(\boldsymbol{\varphi })\boldsymbol{\beta }\\ {} & \hspace{1em}-\mathrm{tr}({\mathbf{C}_{ln}}{(\boldsymbol{\rho })^{\top }}{\mathbf{C}_{jn}}(\boldsymbol{\rho }))\hspace{0.1667em}-\hspace{0.1667em}\mathrm{tr}({\mathbf{C}_{ln}}(\boldsymbol{\rho }){\mathbf{C}_{jn}}(\boldsymbol{\rho })),\\ {} \mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\varphi }\partial {\boldsymbol{\varphi }^{\top }}}\right]& =-\frac{1}{{\sigma ^{2}}}{\boldsymbol{\beta }^{\top }}{\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)^{\top }}\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)\boldsymbol{\beta },\\ {} \mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial {\sigma ^{2}}\partial {\boldsymbol{\beta }^{\top }}}\right]& ={\mathbf{0}_{4}},\\ {} \mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial {\sigma ^{2}}\partial {\boldsymbol{\rho }_{j}}}\right]& =-\frac{1}{{\sigma ^{2}}}\mathrm{tr}({\mathbf{C}_{jn}}),\\ {} \mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial {\sigma ^{2}}\partial {\boldsymbol{\varphi }^{\top }}}\right]& ={\mathbf{0}_{p}},\\ {} \mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\beta }\partial {\boldsymbol{\rho }_{j}}}\right]& =-\frac{1}{{\sigma ^{2}}}{\boldsymbol{\beta }^{\top }}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{\mathbf{C}_{jn}}{(\boldsymbol{\rho })^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi }),\\ {} \mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\varphi }\partial {\boldsymbol{\rho }_{j}}}\right]& =-\frac{1}{{\sigma ^{2}}}{\boldsymbol{\beta }^{\top }}{\mathbf{M}_{n}}{(\boldsymbol{\varphi })^{\top }}{C_{jn}}{(\boldsymbol{\rho })^{\top }}\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)\boldsymbol{\beta },\\ {} \mathbb{E}\left[\frac{{\partial ^{2}}\log {L_{n}}(\boldsymbol{\theta })}{\partial \boldsymbol{\beta }\partial {\boldsymbol{\varphi }^{\top }}}\right]& =-\frac{1}{{\sigma ^{2}}}{\boldsymbol{\beta }^{\top }}{\left(\frac{\partial {\mathbf{M}_{n}}(\boldsymbol{\varphi })}{\partial \boldsymbol{\varphi }}\right)^{\top }}{\mathbf{M}_{n}}(\boldsymbol{\varphi }),\end{aligned}\]

where p is the number of parameters in $\boldsymbol{\varphi }$. This completes the proof for (A.3). Finally, from (A.2), (A.3), and Slutsky’s theorem, Theorem 1 is proved.

References

[1]

Advani, A. and Malde, B. (2018). Methods to identify linear network models: a review. Swiss Journal of Economics and Statistics 154(1) 12.

[2]

Aronow, P. M., Samii, C. et al. (2017). Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics 11(4) 1912–1947. https://doi.org/10.1214/16-AOAS1005. MR3743283

[3]

Banerjee, S., Carlin, B. P. and Gelfand, A. E. (2003). Hierarchical modeling and analysis for spatial data. Chapman and Hall/CRC. MR3362184

[4]

Basse, G. and Feller, A. (2018). Analyzing two-stage experiments in the presence of interference. Journal of the American Statistical Association 113(521) 41–55. https://doi.org/10.1080/01621459.2017.1323641. MR3803438

[5]

Basse, G. W. and Airoldi, E. M. (2018). Model-assisted design of experiments in the presence of network-correlated outcomes. Biometrika 105(4) 849–858. https://doi.org/10.1093/biomet/asy036. MR3877869

[6]

Bowers, J., Desmarais, B. A., Frederickson, M., Ichino, N., Lee, H.-W. and Wang, S. (2018). Models, methods and network topology: experimental design for the study of interference. Social Networks 54 196–208.

[7]

Bramoullé, Y., Djebbari, H. and Fortin, B. (2009). Identification of peer effects through social networks. Journal of Econometrics 150(1) 41–55. https://doi.org/10.1016/j.jeconom.2008.12.021. MR2525993

[8]

Candogan, O., Chen, C. and Niazadeh, R. (2021). Correlated cluster-based randomized Experiments: robust Variance Minimization. Chicago Booth Research Paper No. 21-17.

[9]

Chin, A. (2019). Regression adjustments for estimating the global treatment effect in experiments with interference. Journal of Causal Inference 7(2). https://doi.org/10.1515/jci-2018-0026. MR4350071

[10]

Cliff, A. D. (1981). Spatial processes: models & applications. Pion, London. MR0632256

[11]

Cox, D. R. (1958). Planning of experiments. Wiley. MR0095561

[12]

Eckles, D., Karrer, B. and Ugander, J. (2016). Design and analysis of experiments in networks: reducing bias from interference. Journal of Causal Inference 5(1). https://doi.org/10.1515/jci-2015-0021. MR4323809

[13]

Fedorov, V. V. and Leonov, S. L. (2013). Optimal design for nonlinear response models. CRC Press. MR3113978

[14]

Fruchterman, T. M. and Reingold, E. M. (1991). Graph drawing by force-directed placement. Software: Practice and Experience 21(11) 1129–1164.

[15]

Gossen, H. H. (1983). The laws of human relations and the rules of human action derived therefrom. MIT Press.

[16]

Gui, H., Xu, Y., Bhasin, A. and Han, J. (2015). Network A/B testing: from sampling to estimation. In Proceedings of the 24th International Conference on World Wide Web 399–409.

[17]

Gupta, S., Kohavi, R., Tang, D., Xu, Y., Andersen, R., Bakshy, E., Cardin, N., Chandran, S., Chen, N., Coey, D., Curtis, M., Deng, A., Duan, W., Forbes, P., Frasca, B., Guy, T., Imbens, G. W., Saint Jacques, G., Kantawala, P., Katsev, I., Katzwer, M., Konutgan, M., Kunakova, E., Lee, M., Lee, M., Liu, J., McQueen, J., Najmi, A., Smith, B., Trehan, V., Vermeer, L., Walker, T., Wong, J. and Yashkov, I. (2019). Top challenges from the first practical online controlled experiments summit. ACM SIGKDD Explorations Newsletter 21(1) 20–35. https://doi.org/10.1145/3331651.3331655

[18]

Harville, D. A. (1998). Matrix algebra from a statistician’s perspective. Taylor & Francis. https://doi.org/10.1007/b98818. MR1467237

[19]

Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Cambridge University Press. MR2978290

[20]

Hudgens, M. G. and Halloran, M. E. (2008). Toward causal inference with interference. Journal of the American Statistical Association 103(482) 832–842. https://doi.org/10.1198/016214508000000292. MR2435472

[21]

Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics 40(2) 633–643. https://doi.org/10.1214/aoms/1177697731. MR0238419

[22]

Jochmans, K. and Weidner, M. (2019). Fixed-Effect Regressions on Network Data. Econometrica 87(5) 1543–1560. https://doi.org/10.3982/ECTA14605. MR4021456

[23]

Karrer, B., Shi, L., Bhole, M., Goldman, M., Palmer, T., Gelman, C., Konutgan, M. and Sun, F. (2021). Network experimentation at scale. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining 3106–3116.

[24]

Kelejian, H. H. and Prucha, I. R. (2001). On the asymptotic distribution of the Moran I test statistic with applications. Journal of Econometrics 104(2) 219–257. https://doi.org/10.1016/S0304-4076(01)00064-1. MR1864417

[25]

Kelejian, H. H. and Prucha, I. R. (2010). Specification and estimation of spatial autoregressive models with autoregressive and heteroskedastic disturbances. Journal of Econometrics 157(1) 53–67. https://doi.org/10.1016/j.jeconom.2009.10.025. MR2652278

[26]

Koutra, V., Gilmour, S. G. and Parker, B. M. (2021). Optimal block designs for experiments on networks. Journal of the Royal Statistical Society: Series C (Applied Statistics) 70(3) 596–618. https://doi.org/10.1111/rssc.12473. MR4275838

[27]

La Vigne, N. G., Lowry, S. S., Markman, J. A. and Dwyer, A. M. (2011). Evaluating the use of public surveillance cameras for crime control and prevention. Washington, DC: US Department of Justice, Office of Community Oriented Policing Services. Urban Institute, Justice Policy Center.

[28]

Lee, L.-F. (2004). Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica 72(6) 1899–1925. https://doi.org/10.1111/j.1468-0262.2004.00558.x. MR2095537

[29]

Li, T., Levina, E. and Zhu, J. (2019). Prediction models for network-linked data. The Annals of Applied Statistics 13(1) 132–164. https://doi.org/10.1214/18-AOAS1205. MR3937424

[30]

Lynn, C. W. and Bassett, D. S. (2019). The physics of brain network structure, function and control. Nature Reviews Physics 1(5) 318–332.

[31]

Manski, C. F. (1993). Identification of endogenous social effects: the reflection problem. The Review of Economic Studies 60(3) 531–542. https://doi.org/10.2307/2298123. MR1236836

[32]

Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. The Computer Journal 7(4) 308–313. https://doi.org/10.1093/comjnl/7.4.308. MR3363409

[33]

Paluck, E. L., Shepherd, H. and Aronow, P. M. (2016). Changing climates of conflict: a social network experiment in 56 schools. Proceedings of the National Academy of Sciences 113(3) 566–571.

[34]

Panger, G. (2016). Reassessing the Facebook experiment: critical thinking about the validity of big data research. Information, Communication & Society 19(8) 1108–1126.

[35]

Parker, B. M., Gilmour, S. G. and Schormans, J. (2017). Optimal design of experiments on connected units with application to social networks. Journal of the Royal Statistical Society. Series C (Applied Statistics) 66(3) 455–480. https://doi.org/10.1111/rssc.12170. MR3632337

[36]

Pokhilko, V., Zhang, Q., Kang, L. and Mays, D. P. (2019). D-optimal design for network A/B testing. Journal of Statistical Theory and Practice 13(4) 1–23. https://doi.org/10.1007/s42519-019-0058-3. MR4021855

[37]

Pukelsheim, F. (2006). Optimal design of experiments. SIAM. https://doi.org/10.1137/1.9780898719109. MR2224698

[38]

Rossi, R. A. and Ahmed, N. K. (2015). The network data repository with interactive graph analytics and visualization. In AAAI. https://networkrepository.com.

[39]

Saint-Jacques, G., Varshney, M., Simpson, J. and Xu, Y. (2019). Using Ego-Clusters to Measure Network Effects at LinkedIn. arXiv preprint arXiv:1903.08755.

[40]

Shalizi, C. R. and Thomas, A. C. (2011). Homophily and contagion are generically confounded in observational social network studies. Sociological Methods & Research 40(2) 211–239. https://doi.org/10.1177/0049124111404820. MR2767833

[41]

Tchetgen, E. J. T. and VanderWeele, T. J. (2012). On causal inference in the presence of interference. Statistical Methods in Medical Research 21(1) 55–75. https://doi.org/10.1177/0962280210386779. MR2867538

[42]

Traud, A. L., Mucha, P. J. and Porter, M. A. (2012). Social structure of Facebook networks. Journal of Physics A 391(16) 4165–4180.

[43]

Traud, A. L., Kelsic, E. D., Mucha, P. J. and Porter, M. A. (2011). Comparing community structure to characteristics in online collegiate social networks. SIAM Review 53(3) 526–543. https://doi.org/10.1137/080734315. MR2834086

[44]

Ugander, J., Karrer, B., Backstrom, L. and Kleinberg, J. (2013). Graph cluster randomization: network exposure to multiple universes. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 329–337.

[45]

Van der Vaart, A. W. (2000). Asymptotic statistics 3. Cambridge University Press. https://doi.org/10.1017/CBO9780511802256. MR1652247

[46]

White, H. (1996). Estimation, inference and specification analysis. Cambridge University Press. https://doi.org/10.1017/CCOL0521252806. MR1292251

[47]

Xu, Y., Chen, N., Fernandez, A., Sinno, O. and Bhasin, A. (2015). From infrastructure to culture: A/B testing challenges in large scale social networks. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2227–2236.

Reading mode

Table of contents

1 Introduction
2 General Additive Network Effect Model
3 Maximum Likelihood Inference
4 Simulations
5 Discussion
Appendix A Mathematical Details for Section 3
References

Open access article under the CC BY license.

Keywords

Design of experiments A/B testing Social networks SUTVA Network effect modeling

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) by way of Grants RGPIN-2019-04212 and RGPIN-2023-03245.

Metrics

since December 2021

935

Article info
views

319

Full article
views

329

PDF
downloads

101

XML
downloads

RSS

Figures
7
Tables
3
Theorems
1
Supplementary
1

Figure 1

Network of Facebook connections among 770 Caltech Facebook users. The network was plotted using the igraph package with the Fruchterman-Reingold layout [14] generated from the qgraph package.

Figure 2

Figure 3

Figure 4

Figure 5

Rejection rates of hypothesis tests for POW-DEG specification on the Caltech Facebook network with varying parameters.

Figure 6

Model misspecification simulation results.

Figure 7

Characteristics of “best” 5%, randomly selected, and “worst” 5% designs with respect to the variance of estimated global treatment effect on the Caltech Facebook network.

Table 1

Number of nodes and edges of the networks used in the simulations.

Table 2

Parameters for the simulation in Section 4.3.

Table 3

Theorem 1.

Supplementary Material
nejsds29_s001.pdf

The Supplementary Material to “General Additive Network Effect Models” contains additional simulation results for Section 4.

Figure 1

Network of Facebook connections among 770 Caltech Facebook users. The network was plotted using the igraph package with the Fruchterman-Reingold layout [14] generated from the qgraph package.

Figure 2

Figure 3

Figure 4

Figure 5

Rejection rates of hypothesis tests for POW-DEG specification on the Caltech Facebook network with varying parameters.

Figure 6

Model misspecification simulation results.

Figure 7

Characteristics of “best” 5%, randomly selected, and “worst” 5% designs with respect to the variance of estimated global treatment effect on the Caltech Facebook network.

Table 1

Number of nodes and edges of the networks used in the simulations.

Networks	# of nodes $(n)$	# of edges
Caltech network	770	16,656
UMich network	3,749	81,903

Table 2

Parameters for the simulation in Section 4.3.

Specification	μ	τ	${\rho _{T}}$	${\rho _{C}}$	${\gamma _{T}}$	${\gamma _{C}}$	λ	σ
SUTVA	0	2	0	0	0	0		1
LNE	0	1	0	0	0.1231	0.1		1
POW-DEG	0	1	0	0	0.2691	0.1	0.5	1
LAG	0	1	0.008492	0.001	0	0		0.9977
HOM	0	1	0.1	0.1	0.01728	0		0.9999

Table 3

Designs	λ	0.5	0.75	1.00	1.75
	${\gamma _{T}}\gt {\gamma _{C}}$	1.7521	2.3841	5.5776	23.7801
Best 500 designs	${\gamma _{T}}={\gamma _{C}}$	1.3875	2.0251	5.1557	23.2936
	${\gamma _{T}}\lt {\gamma _{C}}$	1.6177	2.3152	5.5369	23.7562
	${\gamma _{T}}\gt {\gamma _{C}}$	0.2469	0.16627	0.0807	0.0486
Worst 500 designs	${\gamma _{T}}={\gamma _{C}}$	0.4772	0.2032	0.0792	0.04791
	${\gamma _{T}}\lt {\gamma _{C}}$	0.2291	0.1588	0.0797	0.0484

Theorem 1.

where $\mathrm{dim}(\cdot )$ denotes the length of a vector.

Authors

Abstract

1 Introduction

1.1 Context

Figure 1

1.2 Related Work

1.3 Our Contribution

2 General Additive Network Effect Model

2.1 Modeling Framework

(2.1)

(2.2)

2.2 Model Specifications

(2.3)

(2.4)

(2.5)

(2.6)

2.3 Quantities of Interest

(2.7)

(2.8)

(2.9)

Hypothesis 1 (Direct treatment effect).

Hypothesis 2 (SUTVA).

Hypothesis 3 (Indirect treatment effect).

Hypothesis 4 (Global treatment effect).

2.4 Experimental Design

3 Maximum Likelihood Inference

3.1 Estimation

(3.1)

(3.2)

(3.3)

Lemma 1.

(3.4)

(3.5)

(3.6)

(3.7)

3.2 Asymptotic Results

Theorem 1.

3.3 Inference for Causal Quantities

(3.8)

(3.9)

4 Simulations

Table 1

4.1 The Distribution of the Estimates

Figure 2

Figure 3

Figure 4

4.2 Hypothesis Testing

Table 2

Figure 5

4.3 Model Misspecification

Figure 6

4.4 Experimental Design

Figure 7

Table 3

5 Discussion

Appendix A Mathematical Details for Section 3

A.1 Proof of Lemma 1

Lemma 2 (Theorem 18.2.16 of Harville [18]).

Lemma 3 (Lemma 5.6.11 of Horn and Johnson [19]).

A.2 Assumptions Needed for Asymptotic Results

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Assumption 6.

A.3 Proof of Theorem 1

Lemma 4 (Theorem 3.4 of White [46]).

(A.1)

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8 (Theorem A.1 of Kelejian and Prucha [25]).

(A.2)

(A.3)

References

Export citation

Copy and paste formatted citation

Download citation in file

Figure 1