E-detectors: a nonparametric framework for sequential change detection

Sequential change detection is a classical problem with a variety of applications. However, the majority of prior work has been parametric, for example, focusing on exponential families. We develop a fundamentally new and general framework for sequential change detection when the pre- and post-change distributions are nonparametrically specified (and thus composite). Our procedures come with clean, nonasymptotic bounds on the average run length (frequency of false alarms). In certain nonparametric cases (like sub-Gaussian or sub-exponential), we also provide near-optimal bounds on the detection delay following a changepoint. The primary technical tool that we introduce is called an \emph{e-detector}, which is composed of sums of e-processes -- a fundamental generalization of nonnegative supermartingales -- that are started at consecutive times. We first introduce simple Shiryaev-Roberts and CUSUM-style e-detectors, and then show how to design their mixtures in order to achieve both statistical and computational efficiency. Our e-detector framework can be instantiated to recover classical likelihood-based procedures for parametric problems, as well as yielding the first change detection method for many nonparametric problems. As a running example, we tackle the problem of detecting changes in the mean of a bounded random variable without i.i.d. assumptions, with an application to tracking the performance of a basketball team over multiple seasons.


Introduction
Suppose we observe sequentially a stream of random variables X 1 , X 2 , . . ., whose marginal distributions may change at some unknown time, or changepoint, ν.To take one concrete example that we generalize later, denote the data stream by X 1 , . . ., X v ∼ P µ 0 and X v+1 , X v+2 , • • • ∼ P µ where P µ 0 and P µ are some probability distributions with parameters µ 0 , µ ∈ R, respectively.Let P ν and E ν denote probability and expectation, respectively, with respect to the distribution of the entire infinite data stream, when the change occurs at time ν.If there is no change, we think of ν as being equal to ∞, and we let P ∞ and E ∞ refer to the corresponding probability and expectation.
We are concerned with designing sequential changepoint detection procedures to determine, at each time, whether a changepoint has occurred in the near past that (i) provide non-asymptotic false alarm guarantees, (ii) allow for non-parametric classes of pre-and post-change distributions and (iii) are computationally efficient.Formally, a sequential changepoint detection algorithm consists of a data-dependent stopping rule N * ≥ 1.If not specified explicitly, the underlying filtration with respect to which N * is defined is assumed to be the natural filtration generated by the data stream X 1 , X 2 , • • • , but in some cases it is beneficial to coarsen the filtration.If the stopping time N * is finite, then we declare that a changepoint has been detected, in the sense that sufficient evidence has accumulated to support the hypothesis that the data generating distribution has changed.If the algorithm never stops, we set N * = ∞ and no changepoint is proclaimed.To evaluate the performances of detection algorithms, we can check how quickly the algorithm can detect a change in the distribution while controlling the frequency of false alarms.
To control false alarms, we adopt the average run length (ARL) metric [23], defined by We say that the "ARL is controlled at level α" if E ∞ N * ≥ 1/α.An equivalent error metric is the False Alarm Rate (FAR), which is the reciprocal of the ARL, and we would like to ensure that the FAR is at most α, and we call a sequential change detection procedure as "valid" if it satisfies the above constraint.
A widely-used measure of the speed of detection after a changepoint is the worst average delays conditioned on the least favorable observations before the change [18], or conditioned on the event that the algorithm stops after the changepoint [25].These are defined by where the subscripts indicate the authors, and it is known that J P (N * ) ≤ J L (N * ) [22].Our implicit objective is to (approximately) minimize J L (N * ) or J P (N * ) while guaranteeing that the ARL is controlled at a prespecified level α ∈ (0, 1).We differ from other work in that our focus is on nonparametric and composite pre-and post-change distributions, as well as on deriving nonasymptotic guarantees.In several such settings, it is apriori unclear how to define any valid sequential change detection algorithm, let alone an optimal one.Accordingly, we first address the design problem, and then move to questions involving (approximate) optimality for detection delays.

Prior work and our contributions
If both pre-and post-change parameters µ 0 , µ are known and the distributions have densities p µ 0 and p µ with respect to some common reference measure, then the CUSUM procedure [23] has been known to achieve the optimal worst average delay (exactly for J L (N * ) and asymptotically for J P (N * ) as α → 0) among all procedures controlling ARL at the same level [18,21,31,14].Recall that the CUSUM procedure is defined by the stopping time N * CU := inf n ≥ 1 : , where c CU α > 0 is a constant chosen so that E ∞ (N * CU ) = 1/α, and the test statistic M CU n is defined by the recursive formula The Shiryaev-Roberts (SR) procedure [37,32] is defined by the stopping time N * SR := inf n ≥ 1 : , where c SR α > 0 is a constant chosen so that E ∞ (N * SR ) = 1/α, and the test statistic M SR n is obtained recursively as Unlike CUSUM, the SR procedure does not achieve exact minimax optimality for J L (N * ).However, the SR procedure and its generalized versions enjoy strong asymptotic optimality guarantees [26,27,41,39].
The literature sometimes assumes that the pre-change distribution is known or can be approximated with a high precision by using the previous history.However, the post-change distribution is typically unknown and is assumed to belong to a family of distributions P := {p µ : µ ∈ Θ}.In this case, one natural approach would be to replace the unknown parameter µ with an estimator µ.If we use the maximum likelihood estimator (MLE), then we obtain the CUSUM procedure based on the generalized likelihood ratio (GLR) rule [e.g., see 1,49,38].For those not familiar with sequential change detection, [15] provides a good overview.
Limitations of prior work.Though the usage of GLR statistic for the sequential change detection problem has a long history and often yields good empirical performance, the current literature has two main limitations.
First, most existing methodology relies on parametric assumptions on the family of distributions (eg: exponential families), both pre-change and post-change.There have been attempts to move away from this setting, and we will discuss these later.However, a general framework for deriving sequential change detection procedures in general nonparametric or composite settings has not been previously presented in the generality that we do here.
Second, the study of statistical properties has typically focused on the asymptotic regime of α → 0, unless the GLR statistic is defined on a well-separated post-change parameter space.In this paper, we guarantee nonasymptotic control on the ARL at a prespecified level (such as α = 0.001).In fact, in many settings considered, we do not know of any existing method to guarantee (even asymptotic) ARL control.
(We think that in theory and practice, the first is a bigger issue than the second.Luckily, our solution for the first automatically handles the second.Indeed, in the composite and nonparametric settings we consider, it is quite unclear how to control the ARL in any asymptotic sense.) Finally, a direct online implementation of the GLR rule is infeasible since the memory and computation time both increase at least linearly as n → ∞.One natural approach to tackle this online implementation issue is to use window-limited versions of the GLR rule [49,13].For instance, a simple form of the windowlimited GLR rule can be defined by, at each time n, computing µ over only times n − W to n for a properly chosen window size W > 0. However, the optimal choice of window size W has been studied only in the asymptotic setting (α → 0).For a fixed α, the optimal window size depends on the difference between the pre-and post-change distributions, which is unknown.
Our contributions.We present a general framework for sequential change detection, focusing on (parametric and nonparametric) settings with composite pre-and post-change distributions, and nonasymptotic guarantees on the ARL: 1. We introduce the concept of an e-detector that underlies our construction of sequential change detection procedures.The e-detector utilizes a generalization of the underlying martingale structure of likelihood ratios in classical sequential change detection procedures.This e-detector framework is applicable in nonparametric settings including sub-Gaussian, sub-exponential, and bounded random variables, among many others.In such settings, there is no common reference measure and likelihood ratios cannot be directly defined, thus composite nonnegative supermartingales, or more generally "e-processes", must be employed in their place.
2. Despite handling composite pre-and post-change distributions, even without an iid (independent and identically distributed) assumption on the data, our e-CUSUM and e-SR sequential change detection procedures based on e-detectors can always nonasymptotically control the ARL at level α.
3. Nonasymptotic bounds on the worst average delay are derived in special cases for nonparametric distributions with exponential tail decay (such as sub-Gaussian or sub-exponential), and they match the rate of known lower bounds for exponential families as α → 0.
4. Computationally feasible algorithms are presented, in order to run our procedures in an online fashion without windowing.Practical strategies to choose hyperparameters are discussed.These are based on an adaptive mixture method, with the number of mixture components growing slowly over time.
Our procedures have natural gambling interpretations, and our work can be viewed as setting the foundations for a game-theoretic approach to sequential change detection.Before discussing the general framework in detail, in the following subsection, we present a motivating real-world example involving bounded random variables to illustrate how our nonparametric framework can be easily used in settings in which it is nontrivial to apply other common methods.

Example: A changepoint in Cleveland Cavaliers 2011 -2018
The Cleveland Cavaliers are an American professional basketball team.We use the Cavaliers' game point records over 2010-11 to 2017-18 NBA seasons to illustrate how our proposed nonparametric sequential change detection algorithm can be applied to detect an interesting changepoint in the Cavaliers' recent history. 1he left plot in Figure 1 shows the difference between the scores of the Cavaliers and those of their opposing teams (also known as Plus-Minus) in all the games from the 2010-11 to the 2017-18 regular seasons.Each red line refers to the yearly average difference score in each season.Roughly, this value shows how well the Cavaliers performed against their opponent in terms of scoring.Typically, if a seasonal average is positive (or negative) then we may say that the Cavaliers showed a strong (or poor) performance in the corresponding season.The right plot shows a changepoint detected in early 2015; NBA fans may recall one major cause of the sharp improvement -LeBron James returned to the Cavaliers in 2014.However, how can we detect such a change on the fly by only tracking the Plus-Minus for each game?
This type of question fits well into the sequential change detection framework.Let X 1 , X 2 , . . .be the sequence of Plus-Minus stats we observe sequentially.After observing a poor performance of the Cavaliers in 2010-11 season, we define the Cavaliers' pre-change distribution on the Plus-Minus stats as follows: the average Plus-Minus of the team is less than or equal to µ 0 := −1.Now, we are interested in detecting a meaningful performance improvement on the fly by defining the post-change distribution as follows: the average Plus-Minus of the team is greater than µ 1 := 1.Here, the gap |µ 1 − µ 0 | between averages of Plus-Minus in pre-and post-changes refers to the degree of improvement we consider as a significant one.
Although the formulation of the problem is simple as described above, it is still nontrivial to fit this problem into commonly used sequential change detection procedures for the following reasons.First, it is not easy to choose a proper parametric model to fit observed Plus-Minus stats since they are integer-valued samples with varying mean and variance over seasons as Figure 1 illustrates.Second, even if we can choose a proper model, it is difficult to find a threshold to detect the changepoint since we are interested in detecting any changes larger than |µ 1 − µ 0 | instead of a fixed post-change.Many common methods have been relying on a high-quality simulator or large enough sample history for pre-change observations to compute a valid threshold.For this example, however, it is hard to access such tools since the Cavaliers' overall Plus-Minus stats are difficult to model directly, and it is tricky to justify using existing records to get a valid threshold as the team's overall performance varies a lot around the 2010-11 season.
Based on the introduced framework of the sequential change detection procedure using e-detectors, we detour difficulties in the commonly used methods illustrated above as follows.First, due to the nonparametric nature of the new framework, we do not need to choose any parametric model to fit the data.Instead, we simply assume the absolute value of each Plus-Minus stat is bounded by a large number -in this example, we set 80 as the boundary.Though we set a conservatively large boundary, our detection procedure is variance-adaptive so that we can detect the changepoint efficiently without specifying the variance of observations.Second, the nonasymptotic analysis of the new framework makes it possible to choose an explicit detection boundary, which is equal to log(1/α), to build a sequential change detection procedure controlling the ARL by 1/α for any given α ∈ (0, 1).In this example, we choose α = 10 −3 to make ARL is larger than at least 10 regular seasons of games.The right plot in Figure 1 shows the log of e-detectors on which we build the sequential change detection procedure.The red horizontal line corresponds to the detection boundary given by log(1/α).We can check the procedure detects the changepoint of the Plus-Minus stat of the Cavaliers in the middle of the 2014-15 season.See Section 5.2 for the detailed explanation about how we construct the sequential change detection procedure based on the general methodology we introduce in the paper.
Paper outline.The rest of the paper is organized as follows.In Section 2, we introduce a general framework about how to build composite, nonparametric, and nonasymptotic sequential change detection procedures using e-detectors.Section 3 extends the previous framework to the case where we have a set of e-detectors and explain how to use a mixture method to combine multiple e-detectors effectively.In Section 4, we introduce an exponential structure of e-detectors that makes it possible to design a near-optimal detection procedure with an explicit upper bound on worst average delays.Based on the proposed framework, Section 5 presents two canonical examples of Bernoulli (parametric) and bounded random variables (nonparametric) cases with real data applications of the Cavaliers 2011-2018 statistics.We conclude with a discussion, and defer proofs to the supplement.

Nonparametric sequential change detection using e-detectors 2.1. Problem Setup
Let P denote the set of possible pre-change distributions, which could in general be a nonparametric class.We do not assume the observations in the sequence to be independent or identically distributed: each P ∈ P is a distribution over an infinite sequence of random variables.We will assume throughout that up to the unknown changepoint ν, the observations X 1 , . . ., X ν follow a distribution P ∈ P. The remaining observations X ν+1 , X ν+2 , . . .are drawn from a distribution Q in a class of post-change distributions Q.In this case, we let P P,ν,Q , E P,ν,Q and V P,ν,Q denote probability, expectation and variance operators over the entire data stream.
If there never is a change, we will use the notation P P,∞ , E P,∞ and V P,∞ .Also, if a change occurs at the beginning (ν = 0) then we use P 0,Q , E 0,Q and V 0,Q .Note that technically P 0,Q = P Q,∞ , but we use the former to denote that Q is a post-change distribution and a changepoint has occurred at the very start, and the latter to denote that Q is a pre-change distribution and a changepoint never occurs.
Let F := {F n } n≥0 be a filtration where we let F 0 := {∅, Ω} for simplicity.Let M := {M n } n≥0 be a nonnegative adapted process with respect to the filtration F. If required, we define M ∞ := lim sup n→∞ M n and F ∞ := σ( n≥0 F n ) [see, for example, 4].It is common to consider the natural filtration F n := σ(X 1 , . . ., X n ), but there are situations where restricting the filtration could be advantageous (for example, when there are nuisance parameters).There are yet other situations when enlarging the filtration with external randomness can be useful.
Let T denote the set of all stopping times with respect to F, but we will later see that it typically suffices to consider finite stopping times, or those with finite expectation.
Remark 2.1.In our paper, the changepoint ν and the post-change observations also do not need to be independent of the pre-change observations, and they do not need to be identically distributed.In other words, ν could be a stopping time, and Q could itself depend on the pre-change data.It may be helpful to imagine an adversary who adaptively decides at each step whether or not to "stop" the pre-change data.If they choose to stop at time ν, there are two options for post-change points: (a) then can pick a distribution Q, draw a sequence Y 1 , Y 2 , . . ., from Q, and reveal X ν+i = Y i sequentially, or (b) the pick a distribution Q and draw X ν+1 , X ν+2 , . . .from Q | F ν .In situations without a common reference measure, setting (b) may be tricky to formally define (especially if the pre-change data have probability zero under Q), so one may think of setting (a).All of our results on ARL do not require any assumptions on the changepoint ν or post-change distribution of the data.When analyzing detection delay, we will typically assume that X ν+1 , X ν+2 , . . .are independent of the pre-change data, and are drawn from Q as if the time ν was reset to zero; we believe this can be relaxed in future work.But for much of the paper, it is also okay to assume that the post-change data is drawn a data-dependent With the appropriate definitions and setup in place, we can now define our central concept, an e-detector.

What is an e-detector?
Definition 2.2 (P-e-detector).The process M is called an e-detector with respect to the class of pre-change distributions P if it satisfies the property For brevity, we refer to M as "an e-detector for P", or if P is understood from context, then simply "an e-detector".If a stopping time τ has a nonzero probability of being infinite, then inequality ( 6) is trivially satisfied.Thus, the condition is only really required to hold for stopping times with finite expectation under some P .This latter set of stopping times depends on P, and so in order to not complicate notation, we continue to simply consider all stopping times T .(Also note that the condition can only be satisfied if process M is integrable under any P ∈ P, so this is implicitly assumed to be the case in what follows.) By the linearity of expectation and Tonelli's theorem, an average (or "a mixture") of e-detectors is also an e-detector.More formally, if {M a } a∈A is a set of e-detectors (where a is a tuning parameter), then so is M a dµ(a) for any fixed probability distribution µ over A. Later in this paper, we will in particular use finite mixtures of the form (M 1 + M 2 + • • • + M K )/K in order to adapt to the unknown post-change distribution.(In fact, we will develop more sophisticated mixtures whose number of components grows slowly with time.)For later reference, we state the above as a proposition: Proposition 2.3.Let {M a } a∈A be a set of e-detectors.Then for any probability measure µ on A, the mixture of e-detectors, M a dµ(a) forms a valid e-detector.
An e-detector provides a quantification of evidence for whether a changepoint has occurred or not, and may be continuously monitored, stopped and easily interpreted -e.g, a steep and steady increase of the process in recent times should be taken as an indication that a change has taken place.The following theorem shows how one can immediately obtain a sequential change detection procedure from an e-detector M .
Theorem 2.4.For any α ∈ (0, 1) and e-detector M , if we declare a changepoint at the stopping time then we have inf That is, the sequential change detection procedure in (7) controls the ARL at level α.
The informal proof is one line long: dropping subscripts, the definition of an e-detector implies that EN * ≥ EM N * , but by definition of N * , we know that M N * ≥ 1/α if N * is finite.(If N * is not almost surely finite, the claim holds anyway.)The full proof is in Appendix A.1.

Constructing an e-detector based on a sequence of e-processes
The central building block of our e-detector is called an e-process.E-processes are newly-developed tools that have been shown to play a fundamental role in sequential hypothesis testing, especially in composite, nonparametric settings.E-processes are generalizations of nonnegative martingales and supermartingales, and in particular, e-processes are nonparametric and composite generalizations of likelihood ratios.They have strong game-theoretic roots, and have found utility in the meta-analysis, as well as for the purposes of anytime-valid inference in the presence of continuous monitoring [28,29,7,11,12].The properties of e-processes have not yet been explored in changepoint analysis, and we undertake this effort here.
To understand their definition, we briefly forget about sequential change detection and consider testing the null hypothesis that X 1 , X 2 , • • • ∼ P for some P ∈ P.An e-process for P, called a P-e-process, is a sequence of nonnegative random variables (E t ) t≥1 such that for any P ∈ P and any stopping time τ , we have E P [E τ ] ≤ 1.As before, the underlying filtration can be that of the data or that of (E t ), or some enlargement of these.The value of E t measures evidence against the null (larger values, more evidence).A level-α sequential test can be obtained by rejecting the null as soon as E t exceeds 1/α; this is a consequence of Ville's inequality [42,11].
As a result of the optional stopping theorem, nonnegative P-martingales (i.e., the process is a nonnegative P -martingales simultaneously for every P ∈ P) and P-supermartingales are examples of e-processes.
However, e-processes are a distinct and more general class of processes.In fact, there exist natural classes P for which the only P-martingales are constants, and the only P-supermartingales are decreasing sequences, but there are e-processes for P that can increase to infinity when the data are not from P. See [29] for one such example, arising from sequentially testing exchangeability of a binary sequence, and [6] for another example arising from testing log-concavity.
In this subsection, we show how to leverage e-processes to build e-detectors.
Definition 2.5 (e j -process).For any j ≥ 1, Λ (j) := {Λ (j) n } n≥1 is called an e j -process for P if it is a nonnegative adapted process such that Λ (j) When j = 1, the e j -process is simply a standard e-process, which tests whether the data distribution is different from the proclaimed null (pre-change) distribution.For j > 1, each e j -process can be viewed as an e-process that begins at time j, and tests whether the data occurring after time j are well explained by the null hypothesis.
It is not hard to check that the above processes are indeed e-detectors, meaning they satisfy (6).
Remark 2.7.If each e-process starts with an initial weight smaller than 1 such that the sum of all initial weights is less than or equal to 1, then corresponding SR and CUSUM procedures control the probability of the false alarm, sup P ∈P P P,∞ (τ < ∞) ≤ α, which is a much more stringent error metric than the ARL.The price to pay is that if there is a change at time ν, then the detection delay will no longer be independent of ν, and will typically increase logarithmically with ν (so that the worst average delay is unbounded).

Constructing computationally efficient e-detectors using baseline increments
In general, it may take O(n) time to update the aforementioned e-detectors at time n.In order to construct e-detectors that can be updated online in sublinear time and memory (or even near-constant time and memory), it turns out to be computationally convenient to use a common "baseline" increment in order to build the underlying e j -processes, as we do below.Effectively, this amounts to using e j -processes that are P-supermartingales, which is a special case of particular interest.
Definition 2.8 (Baseline increment).A nonnegative, adapted process L := {L n } n≥1 is called a baseline increment if for each n ≥ 1, we have It is easy to check that if L 1 and L 2 are baseline increments, and A 1 and A 2 are nonnegative and predictable processes (meaning that A 1 n and A 2 n are both F n−1 -measurable) such that A 1 + A 2 is strictly positive, then the mixture (A 1 L 1 +A 2 L 2 )/(A 1 +A 2 ) also forms a baseline increment.In short, "predictable mixtures" retain the baseline increment property.
Comparing (11) to (9), we see that a baseline increment L is not itself an e j -process, because the expectation in (11) applies only at fixed times n, with the conditioning being on the previous step n − 1, but (9) calculates expectations at any stopping time, and conditions on j − 1.It is best to think about the baseline increment as the multiplicative increment that forms the e j -process, as follows.Definition 2.9 (Baseline e j -process).For a given baseline increment L := {L n } n≥1 , we define the corresponding "baseline e j -process" Λ (j) , for each j, n ∈ N, as below: Under any pre-change distribution P ∈ P, each Λ (j) is a nonnegative supermartingale by definition of the baseline increment L i .Therefore, a straightforward application of the optional stopping theorem implies that each Λ (j) satisfies condition (9), and thus is a valid e j -process.
As an example of a baseline e j -process, consider the case where we have iid observations X 1 , X 2 , . . .from a distribution p θ parameterized by θ ∈ Θ, and the pre-change distribution is given by θ 0 .Then, for any post-change distribution p θ 1 with θ 1 = θ 0 , the likelihood ratio between two distributions, L n := p θ 1 (X n )/p θ 0 (X n ) yields a baseline increment process with the inequality in (11) being replaced by the equality.Then, each Λ (j) n is the likelihood ratio based on X j , . . ., X n .Further, instead of using a fixed post-change parameter θ 1 , we can also plug-in a running MLE or any other online nonanticipating estimator that is based on the previous history F n−1 only, say θ n−1 , into the likelihood ratio.In this case, although the value L n := p θ n−1 (X n )/p θ 0 (X n ) at time n of the resulting process may depend on the previous history F n−1 , the inequality in (12) will be satisfied as an equality, yelling a valid baseline e j -processes.
Remark 2.10.While baseline increment processes provide a natural and computationally convenient way to construct e j -process, we emphasize that any e-detector, even one that does not use baseline increments, will automatically control the ARL by Theorem 2.4.To elaborate, baseline e j -processes are composite P-supermartingales (meaning P -supermartingales for every P ∈ P), but there exist other P-e-processeswhich are not P-supermartingales-that naturally arise and these can be used to form e-detectors; for example using universal inference [47,Section 8].Definition 2.11 (Baseline SR and CUSUM e-detectors).When an SR or CUSUM e-detector is constructed using a sequence of baseline e j -processes (12), we call it a "baseline SR or CUSUM e-detector".
Each baseline SR or CUSUM e-detector can be computed recursively like their classical analogs: with M SR 0 = M CU 0 = 0 for each n ∈ N. The above computational benefit is the primary reason to consider baseline e-detectors, but as mentioned in Remark 2.10 and when introducing e-processes, more general e-detectors are sometimes necessary for certain classes P.
We briefly verify below that the processes M SR and M CU defined above are valid e-detectors.Indeed, for any stopping time τ and pre-change distribution P ∈ P, if P P,∞ (τ = ∞) > 0 then the condition of the e-detector in (6) holds trivially.If not, then τ is finite almost surely, and we have by linearity of expectation and the tower rule: where the first inequality comes from the nonnegativity of e-processes, and the second inequality comes from the definition of the e-process Λ (j) for each j ≥ 1.Note that this proof is also applicable to general SR and CUSUM e-detectors.

Sequential change detection procedures by thresholding e-detectors
The value of any e-detector process, like M SR or M CU , is directly interpretable without specifying an explicit threshold: a larger value signals an accumulation of evidence of a changepoint.These can be monitored and adaptively stopped.Nevertheless, to explicitly control the ARL at level α, we define SR and CUSUM-style change detection procedures, called "e-SR" and "e-CUSUM" procedures as follows.
Definition 2.12 (e-SR and e-CUSUM procedures).Given SR and CUSUM e-detectors M SR and M CU , define e-SR and e-CUSUM procedures by the stopping times where c α is a constant chosen to control the ARL of the e-CUSUM procedure by 1/α for some α ∈ (0, 1).By Theorem 2.4, 1/α is a valid choice for c α .
We note c α = 1/α may be a very conservative choice for the e-CUSUM procedure.Indeed, suppose we use the trivial e-processes, that is, we set Λ (j) n := 1 for all j, n.In this setting, the SR e-detector is given by M SR n = n for each n.In contrast, the CUSUM e-detector, M CU n is equal to 1 for all n.Therefore, any valid threshold we can choose for the e-SR procedure must be larger than ⌊1/α⌋.On the other hand, any threshold above 1 makes N * CU = ∞, which of course controls ARL by 1/α, but the true ARL is much above the target.Building from this trivial example, it is possible to construct nontrivial examples in which letting α → 0 makes the gap between tight thresholds of e-SR and e-CUSUM procedures arbitrarily different.
Remark 2.13.Unless we assume the pre-change distribution is time-stationary, known and parametric, or we can access a good sample of the pre-change distribution or large enough historical data, computing a tight or even valid threshold c α can be a challenging task.In the application sections below, we will mainly deal with non-stationary pre-change distributions where pre-change observations may not be identically distributed and thus all observations before the changepoint may follow different distributions.In this case, setting c α = 1/α seems to be the only reasonable choice, and we recommend using the e-SR procedure rather than e-CUSUM since if we use the same threshold for both procedures, the former always detects the changepoint faster than the latter while provably controlling the ARL at the same level.

Some nontrivial instantiations of e-detectors
E-detectors can be thought of as a general reduction of change detection to sequential testing.Given the recent advances in nonparametric, composite sequential testing using nonnegative supermartingales and eprocesses, our e-detectors now make new classes of change detection problems possible.We detail below some interesting nontrivial examples of change detection problems that can now be solved using e-detectors.
Example 1: when likelihood ratios are well-defined.Consider first the parametric case, when P, Q have a common reference measure and likelihood ratios are well-defined.When the pre-and post-change distributions are known (meaning P, Q are singletons), then the standard likelihood-ratio based CUSUM and SR processes from (4) and ( 5) are both e-detectors.If Q is composite, then taking a mixture likelihood ratio yields e-detectors (using either a non-anticipating predictable mixture or using a fixed mixture distribution).If P is also composite, but maximum likelihood estimation is efficient over P, then e-processes can be constructed that take the ratio of mixture likelihoods over the alternative to maximum likelihood under the null, as done in universal inference [47,Section 8] or in [40].If the "reverse information projection" (RIPr) is computable (analytically or numerically), then one can use the method of [7], though there are some subtleties: by default the method produces e-values over blocks of observations, which can be multiplied across independent blocks to produce a nonnegative supermartingale (and thus e-process) to be used within our framework.But sometimes, the sequence of RIPr's over increasing sample sizes (nested blocks) automatically produces an e-process, and when this happens, it is more powerful than universal inference (see [7,30] for details).
Example 2: change in distribution.In this example, P is the set of all iid product distributions over infinite sequences (or its convex closure, the set of all exchangeable distributions), so P = {µ ∞ for some probability distribution µ}.The conformal sequential change detection procedures by [44,43] are designed to test deviations from exchangeability, meaning that they develop a test martingale (and thus an e-process) for P, meaning that their procedure fits neatly into our framework.Importantly, their filtration is restricted, and is smaller than the natural filtration of the data.This allows nonparametric martingales to exist, but the e-detector property only holds with respect to a smaller class of stopping times.Nevertheless, thresholding our e-detector at 1/α still controls the ARL at level α, as the latter property is independent of the filtration used to construct the e-detector.As a side note, if one wanted to construct an e-detector that was valid at all stopping times with respect to the natural filtration of the data, e-detectors based on martingales provably do not suffice, but e-detectors based on e-processes can be constructed using the techniques from [29] (at least for categorical distributions).
Example 3: nonparametric two-sample testing.Suppose we have two streams of (general multivariate) data: For simplicity below, assume that at time t, we observe one point from each stream (X t , Y t ).Before the changepoint (if one exists), the distributions of X t and Y t are equal, meaning that P = {(P X , P Y ) ∞ : P X = P Y }.This is a very nonparametric class, since it specifies nothing about the distributions except for the fact that they are equal before the changepoint.After the changepoint, the streams have different distributions (maybe the distribution of X changes, or that of Y changes, or both), thus Q = {(P X , P Y ) ∞ : P X = P Y }.For this very general nonparametric two-sample testing setup, [34] construct test martingales for P that are provably consistent against Q under minimal assumptions (in particular not requiring any minimum separation between the different distributions after the changepoint, since the tests automatically adapt to the closeness of the unknown alternative).These test martingales fit seamlessly into our e-detector framework, yielding new and practicable e-detectors for detecting a change from homogeneity to non-homogeneity between the streams.
Example 4: nonparametric independence testing.In this problem setting, we observe a pair of random variables (X t , Y t ) ∼ P XY at each step t, where X t , Y t can each lie in a general space.Before the changepoint (if one exists), the data are independent, meaning that P = {P ∞ XY : P XY = P X × P Y }.Beyond saying that the joint distribution factorizes into the product of marginals, there is no further structure assumed, making this a rich nonparametric composite class.As before, after the changepoint, (X, Y ) become dependent, meaning that Q = {P ∞ XY : P XY = P X × P Y }.For this general nonparametric independence testing problem, [24] construct test martingales for P that are provably consistent against Q under minimal, weak assumptions (as before, not requiring any separation).Again as before, (in the testing problem) the power of these tests automatically adapts to the difficulty of the unknown alternative.When plugged into our framework, it delivers a novel e-detector for a change from independence to dependence.
We briefly remark that in the two preceding examples (homogeneity and independence), we can move past the iid assumption.The same methods work even when the distribution is allowed to drift within P before the changepoint, and drift within Q after the changepoint.We refer to the original aforementioned papers for details.
Example 5: log-concavity.Here, the data before the changepoint comes from a log-concave distribution (in a general dimension d ≥ 1), so P = {µ ∞ : µ has a log-concave Lebesgue density}.This is a rich, nonparametric, shape-constrained class.The post-change class of distributions Q consists of, for example, any distribution that has a nonzero KL-divergence and Hellinger distance from every distribution in P. For testing P against Q, [6] show that there exists no nontrivial nonnegative supermartingales, but they design a powerful (and computationally efficient) e-process using universal inference.When plugged into our e-detector, this yields a nontrivial procedure that can detect a deviation from log-concavity.
Example 6: symmetry.Suppose P = {µ ∞ : µ is symmetric around 0} consists of the set of all distributions (in a general dimension d ≥ 1) that are symmetric around the origin, while Q consists its complement (that is, distributions which are not symmetric around the origin).[28] characterize all processes that are nonnegative martingales for P. When used with our e-detector, these provide a clean way to detect a change from symmetry to asymmetry.
Example 7: change in mean.Suppose P C = {µ ∞ : E X∼µ [X] ≤ 0, X satisfies C} consists of the set of all univariate distributions with mean less than or equal to zero satisfying come constraint C, while satisfies C} consists of those with positive mean.[11] provides a large variety of nonnegative supermartingales under various conditions C, such as when X are subGaussian, or bounded from above, or bounded from below, or have only two or three moments; see also [45] for heavytailed supermartingales.These can be plugged into the e-detector to yield new nonparametric schemes for changes in mean.
Example 8: Huber-robust change detection.As a final example, suppose we wish to detect a change in mean of heavy-tailed data (as above).But now, suppose that an adversary can also arbitrarily corrupt an ǫ fraction of the data.[46] develop Huber-robust supermartingales for this setting, which can be plugged into e-detectors to yield a valid e-detector in the presence of adversarial corruptions.
Note that in Examples 3, 4, 5 and 6, if P and Q were swapped, the testing problem is much harder, and we are not aware of any powerful test or change detection method.However, if one was simply interested in detecting a change in homogeneity or in dependence, i.e. from some distribution in P Q to some other one, there are two possible change detection methods that come to mind.First, an e-detector based on the conformal change detection methods in Example 2 would detect any change from any distribution to any other, though choosing the conformity score may be tricky.As a second and more direct option, one can choose a measure of homogeneity or dependence (like the kernel maximum mean discrepancy or energy distance, or the Hilbert Schmidt independence criterion or distance covariance), construct a confidence sequence for that measure (see [19] for several specific, tight, constructions), and plug it into the recent change detection scheme of [35].
Finally, it is worth noting that it is possible to define e-detectors in cases where there is no common reference measure amongst the pre-change and post-change distributions, and thus no easily-defined likelihood ratio process, and also when there is no nontrivial martingale that can be constructed.This is precisely the utility of e-processes, which are nonparametric and composite generalizations of likelihood ratios.We develop one more interesting nonparametric example (not described above) in the simulations section: detecting change in mean of a bounded random variable.

Warm-up: bounds on worst average delays for baseline e-detectors, when Q is known
Recall that our objective is to minimize worst average delays J L (N * ) or J P (N * ), given above in (2) and (3) respectively, while controlling the ARL E P,∞ (N * ) ≥ 1/α.Note that worst average delays in (2) and (3) were defined for a fixed pair of pre-and post-change distributions implicitly.In our setting where the pre-change distribution space could be composite, we take an additional supremum over all pre-change distributions when defining both worst average delays for each fixed post-change distribution.
To derive bounds on the worst average delays, we further assume that the post-change observations X ν+1 , X ν+2 , . . .are independent of the pre-change observations and form a strongly stationary process.That is, we assume that, for any finite subset I ⊂ N and any j ∈ N, the joint distributions of {X ν+i } i∈I and {X ν+i+j } i∈I are equal to each other.About the underlying baseline increment, we further assume that there exist a function f and an integer m ≥ 0 such that L n = f (X n , X n−1 , . . ., X n−m ) for each n.In this warm-up section, assume that we know the post-change distribution Q.Then, as we shall soon see, an optimal choice for L n would set m = 0, but if m is a strictly positive number then we implicitly assume that we can access m observations X 0 , X −1 , . . ., X 1−m from the pre-change distribution in order to build sequential change detection procedures.Under these conditions, the following theorem provides analytically more tractable upper bounds on worst average delays for e-SR and e-CUSUM procedures.
Proposition 2.14.For a given α ∈ (0, 1), let N * SR and N * CU be e-SR and e-CUSUM procedures using baseline e-detectors.Under the settings described above, their worst average delays are upper bounded as respectively, where N c is the stopping time defined for any c > 1 as N c := inf {n ≥ 1 : n i=1 log L i ≥ log(c)} , and c α ≤ 1/α is any threshold that ensures the e-CUSUM procedure (16) has an ARL no smaller than 1/α.Furthermore, if the post-change observations are iid and each L n is a function of X n only (i.e.m = 0) with The proof can be found in Appendix A.1.
Remark 2.15.The stopping time N 1/α delivers a level-α sequential test for the null hypothesis H 0 : P ∈ P.
On the other hand, the stopping time N cα may not necessarily control the type-1 error by α since the threshold c α can be significantly smaller than 1/α as discussed earlier.
The expected stopping time in the upper bounds (17) and ( 18) depends on the parameter m ≥ 0. When Q is known, m = 0 suffices because (19) suggests that L i should simply be chosen to maximize E 0,Q log L 1 , which is identical to the log-optimality criterion used for testing P against Q (as discussed in many recent works, like [33,7,48]).
In applications when Q is unknown, the underlying baseline increment may require a long enough sample history (large m) in order to achieve a reasonably small expected stopping time by "learning" Q or using an empirical distribution plug-in for Q.Then, the above results suggest that a reasonable way to choose the window size m is to pick the one minimizing the upper bound on the worst average delays.However, since the optimal choice of the window size should depend on the unknown post-change Q, it remains difficult to minimize the upper bound directly.In our simulations, we often encounter cases where a larger window size is better.In this case, we would choose a window size as large as possible while keeping the procedure computationally tractable.However, the right way to handle unknown Q is dealt with in detail next.

Combining baseline e-detectors using the method of mixtures
In the previous section, we discussed how one can construct a valid e-detector and derive upper bounds on worst average delays.However, in most composite and nonparametric sequential change detection scenarios, there is no single optimal e-detector but instead there are often several applicable e-detectors to choose from.In this section, we introduce a practicable and computationally efficient strategy to construct a good edetector for minimizing the upper bound on worst average delays in Proposition 2.14, especially for the upper bound (19) in the m = 0 case.
In detail, suppose we have a set of baseline increments {L λ } λ∈Π parametrized by λ ∈ Π.Then, under the additional condition assumed in the second part of Proposition 2.14 (namely, that m = 0 and the postchange observations are from an iid sequence), an ideal choice of the parameter λ op for a post-change distribution Q is given by minimizing the first term of the upper bound in (19), which often becomes a leading term especially for small enough α.In turn, this term is inversely proportional to where the second argument P in D(Q||P) explicitly refers to the dependency of the baseline increment L λ op to the class of pre-change distributions P. For the rest of the paper, we will assume that the set of baseline increments, {L λ } λ∈Π is rich enough such that D(Q||P) > 0 for all Q ∈ Q.As we observe later, in many canonical cases, we have D(Q||P) = inf P ∈P KL(Q||P ) where KL(Q||P ) is the Kullback-Leibler (KL) divergence from Q to P .Generally, computing λ op is not feasible since it depends on the unknown post-change distribution Q.Next, we show how to build a mixture of baseline e-detectors that can detect the changepoint nearly as quickly as the one with λ op , when known lower and upper bounds λ L and λ U on λ op are available.
Notice that an average of e-detectors is also a valid e-detector, in the sense of satisfying condition (6).Therefore, for any mixing distribution W supported on [λ L , λ U ], we can define mixtures of e-SR and e-CUSUM procedures by following stopping times: where c α > 1 is a fixed constant which controls the ARL for some α ∈ (0, 1).Since the mixture of e-CUSUM procedure is based on a valid e-detector, we can always set the threshold c α to be equal to 1/α as same as the threshold of the mixture of e-SR procedures.
Remark 3.1.Instead of using mixtures, one may be tempted to consider swapping the above integral with a supremum over λ ∈ [λ L , λ U ].However, this does not in general yield a valid e-detector.

Computational and analytical aspects of mixtures of baseline e-detectors
Though any mixing distribution yields a valid e-detector, for computational efficiency, we only consider discrete mixtures where the support of mixing distribution has at most countably many elements.To be specific, let {ω k } k≥1 be a set of nonnegative mixing weights with k≥1 ω k = 1 and let {λ k } k≥1 be the corresponding supporting set.For ease of notation, we denote L (λ k ) := L(k) for each k ≥ 1.Based on the set of nonnegative mixing weights and the corresponding set of baseline increments, we define mixtures of SR and CUSUM e-detectors as M mSR 0 = M mCU 0 := 1, and for each n ∈ N, Let K := | {k : ω k > 0} | be the number of nonzero mixing weights.
Finite mixtures.If K < ∞, we may for simplicity assume that the first K weights ω 1 , . . ., ω K are the only nonzero values.In this case, we can compute mixtures of SR and CUSUM e-detectors by where M SR n (k) and M CU n (k) are computed recursively as with and n ∈ N. Therefore, if each computation of L n (k) has constant time and space complexities, then the evaluation of mixtures of SR and CUSUM e-detectors at each time n requires O(K) time and space complexity.
Infinite mixtures, scheduling functions and adaptive re-weighting.If K = ∞ or if K is to be chosen adaptively as an increasing function of n we modify our strategy as follows.We first choose an increasing function K : N → N, and let K −1 : N → N be the generalized inverse function of K defined by K −1 (k) := inf {j ≥ 1 : K(j) ≥ k} for each k ∈ N. Note that K −1 is also an increasing function.We call such function K as a scheduling function.We intentionally overload notation: in what follows, K(n) plays the same role as the constant K in the case of finite support.Note that K −1 (k) ≤ n for any k ≤ K(n); we will use this simple fact below when defining nested summations.
Based on a scheduling function K and its generalized inverse K −1 , we define adaptive SR and CUSUM e-detectors, M aSR n and M aCU n , respectively, as where each γ j := 1/ K(j) k=1 ω k ≥ 1 is the adaptively re-weighting factor at time j, ensuring that the mixing weights always sum to one at each time.Here, we restrict not only the space over the index k from [1, ∞] to [1, K(n)] but also the space over the index j from for each n ≥ K −1 (k) and M aSR n (k) = M aCU n (k) = 0 for all n = 0, 1, . . ., K −1 (k) − 1.Therefore, if each computation of L n (k) has constant time and space complexities then the computations of adaptive SR and CUSUM e-detectors at each time n have O(K(n)) time and space complexities as well.For the purpose of implementing an online algorithm, we are typically interested in the case K(n) = O(log(n)).
Remark 3.2.Both mixtures of SR and CUSUM e-detectors can be viewed as special cases of their adaptive counterparts where the scheduling function K is understood as a constant function.In this case, we have Unlike finite mixtures, the mixing distribution deployed in the adaptive SR and CUSUM e-detectors vary over time.Hence, we cannot simply apply Proposition 2.3 to check whether this adaptive scheme yields valid e-detectors.The following proposition formally states the validity of adaptive SR and CUSUM e-detectors.The proof can be found in Appendix A.
where α ∈ (0, 1) is a fixed constant and c α is a positive value controlling ARL of the adaptive e-CUSUM procedure by 1/α.Similar to the usual e-CUSUM procedure case we discussed before, we can always set c α = 1/α.In this case, from the fact N * aSR ≤ N * aCU , which is implied by M aSR n ≥ M aCU n , we have where the last inequality comes from Theorem 2.4 with the fact that M aSR is a valid e-detector.However, the threshold c α for the adaptive e-CUSUM procedure can be chosen to be a significantly smaller value if we have enough knowledge about the pre-change distribution, as discussed in Section 2.5.

Worst average delay analysis for adaptive mixtures of e-detectors
We now derive general upper bounds on worst average delays of the adaptive e-SR and e-CUSUM procedures.As we did before, we further assume that post-change observations X ν+1 , X ν+2 , . . .are independent of the pre-change observations and form a strong stationary process.Also, we further assume that there exist a function f k and an integer m ≥ 0 such that L n (k) = f k (X n , X n−1 , . . ., X n−m ) for each k and n.Again, if m is a strictly positive number then we implicitly assume that there exist m observations X 0 , X −1 , . . ., X 1−m from the pre-change distribution we can use to build sequential change detection procedures.Under this additional condition for worst average delay analysis, the following theorem provides analytically more tractable upper bounds on worst average delays for N * aSR and N * aCU .
Theorem 3.4.Under additional conditions described above, worst average delays for N * aSR and N * aCU can be upper bounded as follows: where, for j ∈ N and c > 0, N c (j) is the stopping time Here, c α is the same threshold used to build the adaptive e-CUSUM procedure in (34).Note that for mixtures of the SR and CUSUM e-detectors where the scheduling function K is a constant function, the stopping times in the upper bounds do not depend on the index j, and thus the upper bounds can be reduced as where, for c > 0, N c is the stopping time The proof of upper bounds on worst average delays can be found in Appendix A.2.
Unlike the baseline e-detector case, however, due to mixing weights, it is nontrivial to get further simplified upper bounds on worst average delays as we did in Section 2.7.Next, we present specific adaptive e-SR and e-CUSUM procedures based on exponential baseline increments where we can compute both procedures efficiently and derive upper bounds on worst average delays in explicit forms.

Exponential baseline e-detectors and their mixtures
Building upon recent advances in time uniform concentration inequalities and sequential testing developed in [11] and [36], below we consider an exponential structure on baseline e-detectors.We show that, in this setting, it is possible to approximate the "oracle" e-SR and e-CUSUM procedures based on the knowledge of the optimal (but unknown) λ op by adaptive procedures built using a mixture of carefully chosen set of baseline increments {L λ k } k≥1 with mixing weights {ω k } k≥1 .
To be specific, assume there exists an extended real-valued convex function ψ on R that is finite and strictly convex on a set Π ⊂ R containing 0 in its interior Π o .Furthermore, assume ψ is continuously differentiable on Π o with ∇ψ(0) = 0 = ψ(0).Then define the "exponential baseline increment" as follows.
Definition 4.1 (Exponential baseline increment).For each n ∈ N and λ ∈ Π, define where s is a real-valued function and v is a positive function on the sample space.L λ := {L λ } n≥1 is called an exponential baseline increment if it satisfies condition (11) in Definition 2.8.
Above, s and v are mnemonics for sum and variance.For each where we assume that all expectations are finite.The following proposition provides an explicit expression for D(Q||P) 1 and a sufficient condition to have D(Q||P) > 0 when the underlying baseline increments have the form specified in (42).Proposition 4.2.For a fixed Q ∈ Q, suppose there exist λ op ∈ Π o such that ∆ op (Q) = ∇ψ(λ op ).Then, where ψ * is the convex conjugate of ψ.Thus, if ∆ op (Q) = 0, we have D(Q||P) > 0.
The proof of Proposition 4.2 can be found in Appendix B.1.For the rest of the section, we assume that Then, Proposition 4.2 implies D(Q||P) > 0 for all Q ∈ Q.Also, for ease of notation, we will drop the dependency of Q from related parameters and simply write λ op , µ, σ 2 and ∆.
The exponential structure of the baseline increment in (42) results in a simple form of λ op such that where the second equality comes from the fact that λ = ∇ψ * • ∇ψ(λ) for each λ ∈ Π.Although λ op still depends on the unknown post-change distribution Q via ∆ op , in many cases, we can find upper and lower bounds on ∆ op .In this section, we explain how to use the knowledge of the range of ∆ op to build a mixture of exponential baseline e-detectors that has explicit upper bounds on worst average delays.

Separated pre-and post-change distributions
Suppose we have knowledge of upper and lower bounds on the parameter ∆ op given in (43), i.e. we know a pair (∆ L , ∆ U ) such that ∆ L < ∆ op < ∆ U .It then follows that λ L < λ op < λ U , where λ L = ∇ψ * (∆ L ), λ op = ∇ψ * (∆ op ) and λ U = ∇ψ * (∆ U ).To simplify presentation, we only consider the one-sided and well-separated case: 0 < λ L < λ U .Let 1/α be the target level of the ARL control for a fixed α ∈ (0, 1).Let {L(k)} k∈[K] and {ω k } k∈[K] be K exponential baseline increments and mixing weights whose specific values will be defined later in this subsection.Since each L n (k) is a function of the n-th observation X n for each k ∈ [K], Theorem 3.4 implies that, if the post-change observations form a strong stationary process then the worst average delays for mixtures of e-SR and e-CUSUM procedures, N * mSR and N * mCU can be upper bounded by E 0,Q N 1/α and E 0,Q N cα , respectively, where N c is the stopping time defined in (41).Furthermore, as we can always set the threshold for the e-CUSUM procedure to be c α ≤ 1/α, we have E 0,Q N cα ≤ E 0,Q N 1/α .Therefore, in this subsection, we construct a set of baseline increments for which we can derive a tight bound on E 0,Q N 1/α .Algorithm 1 describes our methodology for computing mixtures of e-SR procedures in detail.The inputs to the algorithm are the upper and lower bounds ∆ U and ∆ L on ∆ op and the maximal number of baselines processes K max .Mixture of e-CUSUM procedures can be executed similarly by replacing Line 9 by Also, for the mixture of e-CUSUM procedures, we can replace the threshold 1/α with a smaller value c α if we have enough information about the pre-change distribution.For both e-SR and e-CUSUM, at each time n, updates of mixtures of e-detectors have O(K α ) time and space complexities, which do not depend on n.Algorithm 1 relies critically on the function computeBaseline in Line 1, which returns a set of parameters and weights to compute a mixture of e-detectors along with a threshold value g α > 0 that will appear in the upper bound on worst average delays given in Theorem 4.3 that will be explained below.The details of computeBaseline are fairly technical and are given in Algorithm 3 in Appendix B.1.
In the main result of this section, we provide bounds on ARL and worst average delays for the mixtures of e-CP procedures obtained with Algorithm 1 that is a function of the parameter λ op and the threshold value g α .The proof can be found in Appendix B.1.
Theorem 4.3.Let N * mSR and N * mCU be the stopping times corresponding to the mixtures of e-SR and e-CUSUM procedures in Algorithm 1 and its variant, respectively.Then, both procedures control the ARL by 1/α.If we further assume that the post-change observations X ν+1 , X ν+2 , . . .are iid samples from a post-change distribution Q, then the worst average delays for N * mSR and N * mCU can be bounded as The same bound holds also for J P (N * mSR ) and J P (N * mCU ).In Proposition B.2 in Appendix B.1, we show that if the number of baseline processes K max in Algorithm 1 is chosen large enough, then the quantity g α returned by computeBaseline is at most which can be easily evaluated numerically.Expression (99) in Appendix B.1 provides a precise formula for how large K max needs to be in order for the above bound to be in effect.In most practical cases, K max = 1000 is a large enough choice.Also, in many canonical examples we will present later, if we choose large enough K max satisfying the condition (99) then the first term gα D(Q||P) of the upper bound of worst average delays in Theorem 4.3 become a leading term.In this case, from the inequality (48), we can check that this leading term is O (log(1/α)/D(Q||P)) as α → 0. Remark 4.4.If there is only one pre-change distribution P and one post-change distribution Q, both from a natural univariate exponential family, then their likelihood ratio forms an exponential baseline increment.In this case, the above upper bound becomes O (log(1/α)/KL(Q||P )) as α → 0, matching the rate of the known lower bounds [18].

The bound on worst average delays in Theorem 4.3 is obtained by analyzing an auxiliary stopping time
( Using the same arguments as in the proof of Proposition 2.14, we immediately have that if the post-change observations are iid from Q, then for any g > 1, The bound ( 47) is finally established by showing that the stopping time Ngα obtained by using the threshold g α produced by Algorithm 3 is a deterministic upper bound to the stopping times N cα and N 1/α corresponding to mixtures of SR and CUSUM e-detectors.In detail, it holds that for any stream of observations X 1 , X 2 , . .., This nontrivial result is formally stated in Lemma B.1 in Appendix B.1.Its proof leverages geometric arguments used in [36,Theorem 2] to analyze sequential generalized likelihood ratio tests.
Algorithm 1: Pseudo-code of the mixture of e-SR procedures Input: ARL parameter α ∈ (0, 1), Boundary values 0 < ∆ L < ∆ U , Maximum number of baselines K max ∈ N. Output: Stopping time N * mSR of the mixture of e-SR procedures.Data: Data stream X 1 , X 2 , . . .(observed sequentially) The stopped time N * mSR

Non-separated pre-and post-change distributions
The previous subsection discussed how to build mixtures of e-SR and e-CUSUM procedures with an explicit upper bound on worst average delays when we have known and positive boundary values, λ L and λ U on the unknown λ op via the knowledge of ∆ L < ∆ op < ∆ U .However, in many cases, we may not be fully certain about the boundary values.In this subsection, we discuss how we can generalize the previous argument to the no separation case whereby we only know the sign of λ op (> 0) but do not have specific boundary values.
Recall that, for the well-separated case, we calibrated the mixtures of finitely many exponential baseline e-detectors using the stopping time Ngα in (49), which is in turn based on the maximum of underlying baseline increments over the known upper and lower bounds of λ op .Since we no longer have knowledge of the boundary values λ L and λ U , we may use similar stopping times where the range of maximum and the threshold slowly increase over time.In this case, we need an infinite sequence of baseline procedures {L(k)} k∈N and mixing weights {ω k } k∈N to build adaptive e-SR and e-CUSUM procedures.
The bound in Theorem 3.4 along with the fact γ j ≥ 1 for all j ∈ N implies that, for any given scheduling function K : N → N, if the post-change observations form a strong stationary process then worst average delays for adaptive e-SR and e-CUSUM procedures can be upper bounded by min j≥1 E 0,Q N 1/α (j) + j − 1 and min j≥1 [E 0,Q N cα (j) + j − 1], respectively, where we recall that N c (j) is defined for c > 0 by Again, since we can set the threshold for the e-CUSUM procedure in such a manner that c α ≤ 1/α (so that E 0,Q N cα (j) ≤ E 0,Q N 1/α (j)), in this subsection, we focus on constructing a set of baseline increments on which we can derive a tight upper bound on min j≥1 E 0,Q N 1/α (j) + j − 1 .
To derive the set of baseline increments, we use a time-varying boundary function g.Here, we intentionally overload notation: the constant g in the previous subsection for the well-separation case can be viewed as a constant function g in what follows.Let g : [1, ∞) → [0, ∞) be a nonnegative and nondecreasing continuous function such that the mapping t → g(t)/t is nonincreasing and lim t→∞ g(t)/t = 0.For a chosen positive number ∆ 0 > 0, let Finally, based on the sequence {∆ k } k≥0 , define and set ω 0 := α −1 e −g(V 0 ) ½(g(V 0 ) > v min D 0 ), ω k := α −1 e −g(V 0 η k )/η for each k ∈ N where v min := min x v(x), recalling the function v from Definition 4.1.
Based on the quantities defined above, we can construct the stopping time N 1/α (j) for each j.The following lemma shows that we can upper bound the stopping time N 1/α (j) with another stopping time Ng (j) from which we can derive an explicit upper bound on its expected stopping time.
Lemma 4.5.For any fixed j ≥ 1, ∆ 0 > 0, and tuning parameter η > 1, let N 1/α (j) be the stopping time based on the parameters defined above.Then, we have where Ng (j) is a stopping time defined by Note that the chosen set of weights {ω k } k≥0 yields valid adaptive e-SR and e-CUSUM procedures if Once the above condition is satisfied, we can use the worst average delay analysis in Section 3.2 with the bound in Lemma 4.5 to get an explicit upper bound on the worst average delay of the adaptive e-SR and e-CUSUM procedures.
In detail, let j op be the smallest integer satisfying λ K(j op ) < λ op and set K op := K(j op ).If we also have λ op < λ 0 then Lemma 4.5 implies where the stopping time N op is defined by and the expectation E 0,Q N op is typically on the order of g V 0 η K op /D(Q||P).Based on this observation, in the rest of this subsection, we introduce a practical and interpretable way to choose a boundary function g and related tuning parameters which minimize the leading term g V 0 η K op while satisfying the condition (56) on the set of mixing weights.First note that, although we have no bounds on ∆ op in the no separation case, we can still choose ∆ L and ∆ 0 with ∆ L < ∆ 0 as tuning parameters that represent our initial guess on the range of the unknown ∆ op .Since it is possible that the unknown parameter ∆ op of the post-change distribution is outside of the boundary (∆ L , ∆ 0 ), instead of assigning the entire α to the inside of the guessed interval, we split it into two parts by rα and (1 − r)α, respectively where r ∈ (0, 1) is another tuning parameter called the importance weight.Roughly speaking, larger r implies we make a higher bet on that the unknown ∆ op is inside of our chosen boundaries (∆ L , ∆ 0 ).Now, given tuning parameters ∆ L , ∆ 0 and r, we compute the set of {g rα , K L , η} by executing the function computeBaseline, just like in Algorithm 1, except that α is replaced replaced by rα.Then, we can extend the boundary function g to accommodate the case in which the unknown ∆ op is not inside the initial interval we had guessed.To be specific, we use the boundary function where V 0 := g rα /D 0 and s > 1 is a constant obtained as the solution of the equation Note that the right hand side of the above equation is approximately equal to (1 − r)αe grα/η .Therefore, In Algorithm 2, we provide the detailed steps for the adaptive e-SR procedure based on the boundary function in (59).The algorithm can be easily modified for the adaptive e-CUSUM procedure by replacing the update in Line 14 with the rule Also, for the adaptive e-CUSUM procedure, we can replace the threshold 1/α with a smaller value c α if we have enough information about the pre-change distribution.
Compute ∆ k as the solution of ψ * (z) = g k V 0 η k with respect to z(> 0), where V 0 := g rα /D 0 and g k := g rα + sη log (1 The stopped time N * aSR where m ≥ 1 is a tuning parameter.Therefore, for both adaptive e-SR and e-CUSUM procedures, updates of statistics have O(m log η n) time and space complexities at each time n.Although it is not a fully online algorithm, logarithm time and space complexities make it feasible to run adaptive e-SR and e-CUSUM procedures in most practical online settings.
From Section 3.1, we know that both procedures control the ARL by 1/α.The following theorem introduces explicit bounds on the worst average delays for both procedures.
Corollary 4.6.Let N * aSR and N * aCU be stopping times corresponding to the adaptive e-SR procedures in Algorithm 2 and its and e-CUSUM variant, respectively.Then, both procedures control ARL by 1/α.If we further assume that post-change observations X ν+1 , X ν+2 , . . .are iid samples from a post-change distribu-tion then the worst average delays for N * aSR and N * aCU can be upper bounded as (64) Note that η, s > 1 and r ∈ (0, 1) do not depend on the unknown ∆ op .

Application to real data and simulation study 5.1. Bernoulli random variables with dependent, time-varying means
Winning rates of the Cavaliers.To illustrate how sequential change detection procedures based on edetectors work, we revisit the example of the Cleveland Cavaliers, an American professional basketball team introduced in Section 1.2.Instead of using Plus-Minus stats, in this example, we are monitoring the performance of the Cavaliers by keeping track of wins and losses over all the games.Let X 1 , X 2 , • • • ∈ {0, 1} be the sequence of win indicators during 2010-11 to 2017-18 regular seasons, where X i = 1 if the Cavaliers won game i.Though Figure 2 presents monthly and seasonal averages for the purpose of visualization, we use the underlying binary sequence to build a sequential change detection procedure.
Modeling winning probabilities as a dependent sequence of Bernoullis.To detect a significant improvement of the performance of the Cavaliers, we assume that before an unknown changepoint ν ∈ N ∪ {∞}, the conditional average of winning probability given the sample history is less than or equal to p 0 := 0.49.That is, under any pre-change distribution P we have p n := E P,∞ [X n | F n−1 ] ≤ p 0 .(For simplicity, F is taken to be the natural filtration of the data.)Thus, the pre-change class of distributions is where we parameterize each distribution P over binary sequences by the sequence of conditional probabilities.
Our objective is to build mixtures of e-SR and e-CUSUM procedures tuned to quickly detect any significantly improved win rate larger than q 0 := 0.51 after the changepoint.This can be modeled by assuming that after some changepoint ν, the distribution Q is such that q 0 .Thus, we may think of the post-change class of distributions as being Q := {(q 1 , q 2 , . . . ) : q i ≥ q 0 , ∀i ≥ 1}.
In particular, this formalization allows for the winning probabilities to fluctuate over time before and after the changepoint (accounting for factors like form, injuries, etc.).
Deriving exponential baseline processes.For each λ > 0, define a baseline increment process where ψ B (λ) := log 1 − p 0 + p 0 e λ − λp 0 is the Bernoulli cumulant generating function.Note that each L (λ) is a valid baseline increment as it satisfies the inequality (11) in Definition 2.8.That is, under any pre-change distribution P , we have To derive exponential baseline processes, we first consider a simplified post-change distribution Q where each post-change observation is identically distributed with E 0,Q [X 1 ] := q ≥ q 0 > p 0 .In this case, the optimal choice of λ ≥ 0 given by Since the baseline increment has the exponential structure, by Proposition 4.2, we have that where KL(q||p 0 ) is the Kullback-Leibler (KL) divergence of Bernoulli distributions with parameters q and p 0 written as for q, p 0 ∈ (0, 1).The appearance of the KL divergence in ( 67) is not a coincidence as the baseline increment can be viewed as a re-parametrized likelihood ratios between two Bernoulli processes.However, the simple geometric structure of the baseline increment make it possible to utilize a prior knowledge about the postchange distribution via Algorithm 1 and 2.
For instance, suppose we know upper and lower bounds of conditional means of the post-change distribution as q n ∈ (q L , q U ), ∀n > ν.Let N * mSR and N * mCU be stopping times of mixtures of e-SR and e-CUSUM procedures in Algorithm 1.In this case, derived sequential change detection procedures do not rely on a specific choice of a post-change distribution Q ∈ Q.However, these procedures can still perform almost as well as the one optimized to a specific choice of the post-change distribution within the same range (q L , q U ).Typically, if the post-change observations are iid samples from a post-change distribution Q with E 0,Q [X 1 ] := q ∈ (q L , q U ), then by Theorem 4.3, the worst average delays have the following explicit bound: Typically for small α ≪ 1, from Proposition B.2, we can simplify the above upper bound as which matches the rate of the worst average delays, O (log(1/α)/KL(q||p 0 )) of the oracle sequential change detection procedure as α → 0. Implementation of Algorithm 1 and its results.The lower bound q L can be chosen as q 0 = 0.51 since it is the minimum winning rate we consider as a significant improvement from before the changepoint, when the rates are upper bounded by p 0 = 0.49.We can also safely assume that the win rate cannot be too high given the competitiveness of the NBA, so that the improved win rates cannot be larger than 0.9.In our framework, these considerations can be encoded by setting ∆ L := q 0 − p 0 = 0.02 and ∆ U := 0.41 as input parameters of Algorithm 1.As in Section 1.2, we set α := 10 −3 to ensure that the ARL is at least 1/α := 10 3 , which is more than the total number of games over 12 years of regular seasons.Finally, we set the maximum number of baselines K max := 1000.In fact, the computeBaseline function of Algorithm 3 returns only 69 baseline processes, and thus the resulting mixtures of e-SR and e-CUSUM procedures of Algorithm 1 can be computed efficiently in an online fashion.The right plot in Figure 2 presents the log e-detector values using mixtures of e-SR (red) and e-CUSUM (green) procedures.Although there were a few months in which monthly win rates were higher than p 0 , overall log e-detectors remained at a stable level over the first four seasons.However, after the 2014-15 season starts, the log e-detectors increase rapidly and both procedures detect a changepoint during the 2014-15 season, which is the season that marked the return of LeBron James to the Cavaliers.

Mean-shift detection in general bounded random variables
Plus-Minus of the Cavaliers revisited.We return to the Cavaliers 2011-2018 example from Section 1.2.Let X 1 , X 2 , . . .be the sequence of Plus-Minus stats from each game.We assume that the average Plus-Minus of the team is less than or equal to µ < := −1 before the changepoint (if any), while after the changepoint it is greater than µ > := 1.Here, the gap |µ > − µ < | between averages of Plus-Minus in preand post-changes refers to the degree of improvement we consider as significant.
For convenience, we first normalize the observed sequence.We assume that the absolute value of each Plus-Minus is bounded by 80, meaning that no team beats another by over 80 points (such an extreme game has never happened in NBA history).Accordingly, define the normalized Plus-Minus, X n := ( X n + 80)/160 ∈ [0, 1] for each n.Then, the pre-change observations have conditional mean at most m := (µ < + 80)/160 = 0.494 and the minimum gap to detect is equal to δ := |µ > − µ < |/160 = 0.0125.
Modeling plus-minus stats as a bounded sequence with time-varying, dependent means.After the normalization above, the Plus-Minus stats form sequence of bounded random variables X 1 , X 2 , ... on [0, 1].Each observation may have different distribution (due to seasonal effects, injuries, form, etc.), but we assume that all observations before an unknown changepoint ν have a mean less than or equal to a known boundary m ∈ (0, 1), when conditioned on the past sample history.That is, under any pre-change distribution P , we have In other words, we use where other characteristics about P (outside of its sequence of conditional means) are irrelevant.But after the changepoint, all observations have (conditional) mean larger than the boundary m with the minimum gap equal to δ.Thus, To build an e-SR or e-CUSUM procedure, we need to choose a baseline increment.To derive it, we first consider a simplified setting where both pre-and post-change observations are independently and identically distributed with E P,∞ [X] ≤ m and E 0,Q [X] ≥ m + δ, respectively.In this simplified case, we simply refer P and Q to marginal pre-and post-change distributions and P and Q to their collections.Then, define KL inf (Q; m) := inf P ∈P KL(Q, P ) to be the smallest KL divergence between Q and P. It is known (see, e.g., [9,10]) that KL inf has the following variational representation: Accordingly, for each λ ∈ (0, 1), define the baseline increment L λ := {L n } n≥1 as for each n ∈ N. Though the baseline increment above has been derived in the simplified iid setting, it can be checked that L λ is also a valid baseline increment for the general time-varying, dependent means case since it is nonnegative whenever X n , m ∈ [0, 1] as assumed in our setup, and for each pre-change distribution P ∈ P, we have where the inequality comes from the condition µ n ≤ m for any pre-change distribution.Interestingly. the the baseline increments in (70) correspond to rescaled increments of the capital process used in [48] to design test martingales for confidence sequences of means of bounded random variables.Though the expressions are essentially identical, ours was obtained via a variational representation of the KL divergence between distributions of bounded random variables, while the derivation presented in [48] is based on a betting interpretation of hypothesis testing.
For any Q ∈ Q, let λ op be the optimal choice of λ ∈ [0, 1] given by Unfortunately, it is typically difficult to compute the optimal λ op since it depends on the unknown postchange distribution Q in a complicated way.In this case, we use a sub-exponential lower bound from [5,12], given by where ψ E (λ) := − log(1 − λ) − λ for λ ∈ (0, 1).For each λ ∈ (0, 1), the process L λ is itself a valid exponential baseline increment with s(x) := x/m − 1 and v(x) := (x/m − 1) 2 .The lower bound in (72) also implies the lower bound where ψ * E (u) := u − log(1 + u) is the convex conjugate of ψ E , while µ, σ 2 and ∆ op from Section 4 are: Noting that ψ * E (u) ≈ u 2 /2 for small u, we see that for small ∆ op ≪ 1, one has Note that the oracle ∆ op depends on the unknown post-change distribution only via first and second moments.Therefore, in contrast to the original set of baseline increments {L λ } λ∈(0,1) , the exponential baseline increments { L λ } λ∈(0,1) that lower bound them allow us to more easily set a range (∆ L , ∆ U ) to build mixtures of the e-SR and e-CUSUM procedures.For example, if we assume that the post-change distribution has mean at least m + δ for a positive δ then we can upper and lower bound ∆ op by Now, given ∆ L and ∆ U , we can use Algorithm 1 to run the mixture of e-SR or e-CUSUM procedure to detect the changepoint based on the exponential baseline baseline increments { L λ } λ∈(0,1) .It is also straightforward to build the corresponding mixtures of e-SR and e-CUSUM procedures for the original baseline increment {L λ } λ∈(0,1) which is always more sample-efficient.
Implementation of Algorithm 1 and its results.Recall that in the plus-minus stats running example, we use pre-change mean m = 0.494 and the minimum gap δ = 0.0125, which bounds ∆ op by ∆ L := 0.024 and ∆ U := m(1−m) δ 2 = 1600.As before, we choose α = 10 −3 to make the ARL larger than 12 regular seasons and set the maximum number of baselines K max = 1000.Based on these parameters, we can build mixtures of e-SR and e-CUSUM procedures.Though the difference between ∆ L and ∆ U may seem to be large, the actual number of baselines returned by the function computeBaseline in Algorithm 1 is 190, which is small enough to update the procedure efficiently on the fly.Figure 3 shows e-detectors (left) and their logarithms (right).The horizontal line corresponds to the detection boundary given by 1/α (left) and log α −1 (right).Similar to the winning rate example, the log e-detectors remained stable during the first four regular seasons, although the difference between SR and CUSUM e-detectors is larger than before.After 2014-15 season started, both e-detectors increased rapidly, and the e-SR procedure detects a changepoint during the 2014-15 season, but e-CUSUM detects the changepoint only in the following season (as expected, since both procedures use the same threshold).

Simulation-based comparison with parametric methods
In the Bernoulli example of Section 5.1, we showed that, in the simple i.i.d Bernoulli setup, our mixtures of e-SR and e-CUSUM procedures match the rate of the worst average delays O (log(1/α)/KL(q||p 0 )) of the oracle sequential change detection procedure as α → 0. In this subsection, we conduct a simulation study to compare the efficiency of our e-SR procedure with the oracle CUSUM procedure with the exact threshold [21,31], given by where p * is the true post-change distribution parameter (hence the oracle designation) and c * α is the value of the threshold so that the ARL is exactly 1/α.It is well known that the oracle CUSUM procedure (with the appropriate choice of the stopping threshold that controls ARL exactly) minimizes the worst average delay.
We also compare our method with a version of the GLR procedure based on the stopping time  Even though e-SR uses the conservative log(1/α) threshold, its detection delay is excellent, often even better than the Oracle CUSUM method (which has optimal average worst-case (across ν) delay).
where each p j:n is the MLE of the post-change parameter and the exact threshold c α is tuned to control ARL exactly at 1/α (this is typically only possible in such simple parametric settings, either by analytic derivations or simulations).Unlike the oracle CUSUM procedure, the GLR procedure does not have an iterative update rule, as we need to recompute the MLE of the post-change parameter at each time.As a result, its computational cost at time n is O(n), which makes an online implementation very costly.In practice, we may want to use a window-limited GLR procedure to overcome the computational challenge.However, in our study, we deploy the GLR procedure to avoid the additional challenge of picking a window size.
Simulation details.Throughout this simulation, we draw pre-change observations as iid Bernoulli random variables with p 0 = 0.5, and post-change observations using p 1 = 0.6.For non-oracle methods, we will only assume that the post-change parameter is known to be in the interval [0.51, 0.99].The e-SR and e-CUSUM procedures in Algorithm 1 will use this range to set ∆ L := 0.01 and ∆ U := 0.49.For the GLR procedure, the MLE of the post-change parameter is p j:n := min max Xj:n , 0.51 , 0.99 , where Xj:n is the sample average over last n − j + 1 observations.Our ARL target is equal to 1/α := 500, and each simulation is repeated 5000 times to estimate average delays.The time of the changepoint ν varies in {0, 100, . . ., 500}.For simplicity, each run will end no later than time n = 1000.
For the oracle CUSUM, GLR, and e-CUSUM procedures (but not e-SR), we use the same simulation setup to find the exact threshold value that controls the ARL exactly 1/α for each method.For the e-SR procedure, we simply use the universal threshold of log(1/α) to demonstrate its efficiency.In particular, practitioners are not required to resort to expensive simulation to identify a good threshold value.Figure 4 shows the average delays of oracle CUSUM, GLR, e-CUSUM and e-SR procedures for each changepoint ν ∈ {0, 100, . . ., 500} (each experiment has only one change at ν).The vertical bar on each point represents 95% confidence interval of the average delay.As the theory guarantees, the oracle CUSUM procedure with an exact threshold results in the smallest worst average delay (91.3 ± 1.7), while surprisingly the GLR procedure with an exact threshold shows the largest worst average delay (123.7 ± 2.7) despite its high computational complexity.The two e-detectors perform reasonably well, and the e-SR detector in particular performs quite favorably overall despite using its conservative log(1/α) threshold.
As the value of the changepoint approaches the ARL target of 500, the average delays tend to decrease sharply for e-SR procedure, even falling below the oracle CUSUM method.This is plausibly because e-SR sums the underlying e-processes, in contrast to the other CUSUM-style procedures which take the maximum of the underlying e-processes.We may also intuitively expect the oracle CUSUM delay to be relatively flat across ν because it is known to be minimax optimal, in the sense of minimizing the worst case delay, and minimax procedures often have constant "risk profiles".Proving these fine-grained behaviors is beyond the scope of the current paper.
Figure 5 illustrates the "pre-change false alarm rate": the fraction of simulation runs in which the detection procedures stopped before the changepoint at ν (if a change occurs at ν = 0, then it is zero by definition).This is not a common metric, since we provably control ARL at the target level.However, it is an interesting metric, so we plot it.As the time of the true changepoint ν becomes closer to the ARL target of 500, false alarm rates increase across all methods.We notice that the e-SR procedure with the log(1/α) threshold results in the smallest false alarm rates in most cases.

Game-theoretic interpretation of an e-detector
We briefly mention here a game-theoretic interpretation of an e-detector along the lines of the game-theoretic interpretations of martingales and supermartingales as the wealth of a gambler playing a fair game (well known since the time of [42]).We first summarize the game-theoretic interpretation of a P-e-process, as described in [29].
The standard game-theoretic setup of [33] involves three players: a forecaster, a skeptic, and reality.The forecaster claims at the beginning that P is a plausible model for the yet-to-be-observed data; meaning that the observations are in accordance with (or generated by) some P ∈ P. The skeptic plays (in parallel) a family of games indexed by P ∈ P against nature and begins with one dollar in each game.The objective of the skeptic in the P -th game is to sequentially test whether P is a good explanation for the data by betting against P .At each time step, the skeptic places fair bets (relative to P , in the P -th game) about the next outcome.Then nature reveals the next outcome, and the skeptic's wealth in every game is updated.The magnitude of the skeptic's wealth in the P -th game is direct evidence against P being a good explanation; the higher the wealth, the more unlikely the data came from P .Thus in each game, the gambler places different bets, but nature's moves (the outcomes) are identical across all games.The skeptic's overall evidence against P is measured by their worst wealth across all the games.If this evidence exceeds 1/α, it means that the skeptic multiplied their initial capital by at least 1/α in every game, and if we reject P when this happens, Ville's inequality implies that we have a valid level-α sequential test.
Since our e-detectors are constructed to be cumulative sums of e-processes started at consecutive times, their game-theoretic interpretation builds on the aforementioned one.Informally, the forecaster not only claims that the data sequence follows P from the start, but that this will not change after some amount of time.The skeptic now wishes to detect a change, if one occurs, as soon as possible.To accomplish this task, the skeptic is provided with one extra dollar every day that they invest (using a P e-process) into testing whether the data from that day onwards is still explained well by P. E-detectors use the wealth in all these games (one against each P ∈ P, starting at each time) as a measure of evidence against the forecaster's claims.The SR e-detector uses the sum (across time) of the minimum wealth (across P at each time), though it could use the amount that this wealth exceeds n, which is the total dollar amount invested up to time n.The CUSUM e-detector uses the max-min wealth; the maximum (across time) of the minimum wealth (across P ).These are only two ways of constructing e-detectors, and we leave other constructions to future work.

Viewing Lorden's reduction to sequential testing as an e-detector
Lorden [18] proposed a simple method to construct a change detection method with ARL control via a reduction to sequential testing.We describe this below, first defining a sequential test formally.
A sequential test φ is a mapping from increasing amounts of data to a sequence of zeros and ones, where a one represents a rejection of the null hypothesis, and a zero means "continue collecting data".Formally, define the decision at time t as φ t : X t → {0, 1}, and let φ := {φ t } t≥0 be the collection of such decisions made one at a time based on the first t datapoints, with φ 0 = 0 by default.The sequence of tests φ is called a level-α sequential test for P if sup P ∈P P (∃t ≥ 1 : φ t (X 1 , . . ., X t ) = 1) ≤ α, i.e. if the probability of ever falsely rejecting the null is at most α.By convention, if φ t = 1, we set φ s = 1 for s > t.This is equivalent to requiring that, for each P ∈ P, P (φ τ (X 1 , . . ., X τ ) = 1) ≤ α for any stopping time τ.
Let φ (j) denote a sequential test is started at time j; that is, for φ (j) , the first observation is actually X j (and not X 1 ), but the test itself can depend on the first j − 1 observations (for example, we can choose our betting strategy based the first j − 1 points, even though our betting score will only be evaluated from time j onwards).Note that φ (1) is simply a standard sequential test as defined above.Ideally, these tests are powerful against alternative Q.
Lorden's change detection procedure is simple and works as follows.At each time t, start a new sequential test φ (t) , in addition to the ones that are already running.In other words, consider a sequence of level-α sequential tests φ (1) , φ (2) , . . ., starting at consecutive times.Lorden declares a change if any of those sequential tests rejects the null P: Lorden proved that this method controls the ARL at 1/α if the data are iid, and if the same test φ (j) is deployed at each j (i.e.apart from the delayed start, the tests are identical).We first observe that Lorden's method is a special case of an e-detector.Indeed, with each level-α sequential test φ ≡ {φ t } t≥0 , we can associate an e-process Λ Lorden t := 1(φt=1) α = φt α .Note that Λ only takes on two values: 0, 1/α, and when it reaches the latter, it stays there.Furthermore, note that E[Λ Lorden τ ] ≤ 1 for any stopping time τ , which makes it an e-process as claimed.
Last, note that if we form a "Lorden e-detector" using either the SR or CUSUM methods in (10), then both e-detectors start out at 0, and either e-detector jumps to level 1/α if and only if one of the (delayed start) sequential tests rejects the null, and further our e-detector declares a changepoint at exactly the same instant that Lorden's does.Thus, Lorden's procedure can be subsumed within our e-detector framework without any loss of generality or performance.
There are two benefits to viewing Lorden's method as an e-detector.First, we can dispense with both the aforementioned conditions that Lorden required to prove ARL control: the iid assumption, and the condition that the underlying tests φ (j) are identical across j.Indeed, our main result guarantees ARL control for any e-detector no matter what the underlying e j -processes are, or whether the data are iid or not.
Second, this viewpoint allows us to see why e-detectors could have much smaller detection delay than Lorden's method (that is, Lorden's e-detector).In Lorden's e-detector, there is no sharing of evidence across different e j -processes: each sequential test acts alone without help from the others, and we need a single e jprocess to reach 1/α before we can declare a change.When a general (say SR-type) e-detector crosses 1/α, the reason it does so will usually be because of a collaborative effect across various e-processes (caused by the nontrivial sum of e j -processes in the definition of the e-detector), none of which have yet individually reached 1/α.This will happen much sooner than any individual one reaches that threshold, causing an earlier detection than Lorden's method.In fact, every level-α sequential test in some sense must be based on threshold an e-process at level 1/α [28] and using our e-detector with those underlying e-processes will be more statistically efficient (shorter delay) than directly using Lorden's reduction.
For the sake of future reference, we summarize the above observations below in a "generalized Lorden's lemma", whose proof follows immediately from the properties of an e-detector and the discussion above.Lemma 6.1 (Generalized Lorden's Lemma).Suppose the data initially come from a distribution in the prechange class P and, if a change occurs, they later come from a distribution in the post-change class Q (note the lack of any iid assumption).For each j, let φ (j) denote a (one-sided) level-α sequential test for P against Q that is started at time j (but need not be identical or related in any way to any other φ (k) , for k = j).If we declare a change at the first time when any one of these sequential tests rejects the null, the resulting change detection procedure has ARL at most 1/α.Further, this generalization of Lorden's change detector is a special case of an e-detector.

Future directions
There remain a whole host of follow-up directions; we mention only a few below.First, our sequential change detection framework can be straightforwardly generalized to the multi-stream setting where we are monitoring a large number of data streams.In the classical parametric setting, minimum or summation of local CUSUM statistics for multi-stream data were proposed and their asymptotic optimality was studied [8,20].Since either minimum or scaled summation (average) of e-detectors also forms a valid e-detector, we can apply the framework in this paper to the multi-stream setting seamlessly.It is interesting to investigate how the framework can be even further generalized to structural multi-stream settings [50,51,2].
The kernel sequential change detection is an important class of sequential change detection methods [3,16].It is an interesting open direction how to instantiate existing kernel-based methods into our general framework to make it possible to analyze kernel sequential change detection algorithms in a nonasymptotic way.As it currently stands, neither framework is more general than the other, because the kernel methods often assume iid data before the changepoint, while we abstain from such strong assumptions.
Last, throughout this paper, we have only focused on detecting whether a changepoint happened or not but have not dealt with inferential questions surrounding when the change occurred.Future work could study how to perform such inference with e-detectors in our nonparametric settings, either online or post-hoc.

Summary
We have presented a general framework for sequential change detection based on a new concept called edetectors.The proposed framework is nonparametric as it does not rely on a parametric assumption on the data-generating distribution (though, when such assumptions are made, we recover well-known parametric methods as special cases).Also, the framework comes with nonasymptotic guarantees, since every component of the framework can be chosen and analyzed explicitly without any asymptotic approximations.By introducing additional structures such as baseline increments and exponential e-detectors on the top of the general framework, we can construct computationally and statistically efficient online algorithms that have explicit upper bounds on worst average delays.Finally, through examples involving Bernoulli and bounded random variables, we explained how one can apply the presented framework in practical settings, with NBA data serving as a running case study.

A. Main proofs A.1. Proofs for statements in Section 2
Proof of Theorem 2.4 (ARL control).For any given α ∈ (0, 1) let N * be the stopping time defined by For any pre-change distribution P ∈ P, we may assume N * < ∞ with probability one under P , without loss of generality.(If not, we have E P,∞ N * = ∞, which immediately proves the claim.)Then, from the definition of an e-detector, we have which implies that, for any P ∈ P, as desired.
Proof of Proposition 2.14 (Bounds on worst average delay).We first prove (17).Note that For each fixed P ∈ P and Q ∈ Q, since N j |= F ν for all j > ν + m, we have where the third equality comes from the fact that the distribution of X j−m , X j−m+1 , . . .under P P,ν,Q is equal to the one of X j−ν−m , X j−ν−m+1 , . . .under P 0,Q provided that j − m > ν, and the fifth equality is based on the strong stationarity of the post-change observations.Since the last term does not depend on P , ν or F ν , we obtain the claimed result as desired.
To prove (18), first note that The remaining part of the proof for ( 18) is followed by the same argument for (17).Finally, to prove (19), it is enough to show the following inequality holds: where N (g) := inf {n ≥ 1 : n i=1 log L i ≥ g} for each g > 0. The proof of the above upper bound (81) is based on the Lorden's inequality [17] which can be stated as follows: Fact A.1 (Lorden's inequality [17]).Suppose X 1 , X 2 , . . .are i.i.d.samples with EX 1 = µ > 0 and EX 2 1 < ∞.For each g > 0, set N (g) := inf {n : S n := n i=1 X i ≥ g} and R g := S N (g) − g.Then, the following inequality holds: where σ 2 = VX 1 .
Now, to prove the upper bound (81), fix a constant g > 0. Since E 0,Q log L 1 > 0, we have E 0,Q N (g) < ∞.Therefore, by Wald's equation, For each g > 0, set R g := N (g) i=1 log L i − g.Then, from the Lorden's inequality, we have where the first inequality comes from the definition of N (g).By multiplying 1/E 0,Q log L 1 on both sides of the inequality (84), we have the claimed upper bound, completing the proof.

A.2. Proofs for statements in Section 3
Proof of Proposition 3.3 (Validity of adaptive e-detectors).To see adaptive SR and CUSUM e-detectors are actually valid e-detectors, first note that M aCU n ≤ M aSR n for each n ∈ N. Therefore, it is enough to show that E P,∞ M aSR τ ≤ E P,∞ τ for any stopping time τ and pre-change distribution P ∈ P. If P P,∞ (τ = ∞) > 0 then the above inequality holds trivially.Otherwise if P P,∞ (τ = ∞) = 0 , we have that as desired.Above, the sole inequality follows since each Λ (j) (k) is an e j -process, and the following equality invokes the definition of γ j for each j.
Proof of Theorem 3.4 (Delay bounds for adaptive e-detectors).We first prove the upper bound for the adaptive e-SR procedure in (36).Note that where the first inequality follows because γ j ≥ 1 for each j.For each fixed P ∈ P and Q ∈ Q, since N j |= F ν for all j > ν + m, we have where the second equality comes from the fact that the distribution of X j−m , X j−m+1 , . . .under P P,ν,Q is equal to the one of X j−ν−m , X j−ν−m+1 , . . .under P 0,Q provided by j − m > ν and the forth equality is based on the strong stationarity of post-change observations.Since the last term does not depend on P , ν nor F ν , we have the claimed result: as desired.
To prove the adaptive e-CUSUM procedure case in (37), first note that where the first inequality comes from γ j ≥ 1 with the fact K −1 (k) ≤ j for any k ≤ K(j).The remaining part of the proof of (37) follows the same argument used to obtain (36).(87)

B. Remaining Proofs
Then, the same argument used in the proof Proposition 2.14 immediately implies that, i the post-change observations are i.i.d.from Q, then, for any g > 0, The claim of the theorem follows from Lemma B.1, whose statement and proof are given below.
Lemma B.1.Let N 1/α and N cα be stopping times where the underlying mixing weights {ω k } and parameters of baseline increments {λ k } are chosen via Algorithm 3. Let Ngα be the stopping time defined in (49) with the threshold given by Algorithm 3.Then, for any stream of observations X 1 , X 2 , . . ., deterministically, provided that 1 < c α < 1/α.
Proof of Lemma B.1 and Algorithm 3. Throughout this proof, we set D L := ψ * (∆ L ) < ψ * (∆ U ) =: D U .The first inequality N cα ≤ N 1/α follows directly from the definition of the stopping time in (41) along with the condition that c α ≤ 1/α.To prove the second inequality N 1/α ≤ Ngα , we will exploit on general geometric construction introduced in [36] to analyze the performance of sequential generalized likelihood ratio tests.To that effect, set µ n := S n /V n .Then, for each fixed λ > 0 such that ∆ = ∇ψ(λ), Proposition 4.
Then, the stopping event of Ngα can be expressed as ∃n ≥ 1 : sup H(∆ U ) and H(∆ L ) are half spaces contained in and tangent to R at (∆ U , g α /V U ) and (∆ L , g α /V L ), respectively.See Figure 6 for an illustration of the stopping event of Ngα .
Note that the first decomposition part ∃n ≥ 1 : ≥ g α is nonempty only if V U > min x v(x) := v min , which is equivalent to g α > v min D U .For the second part, a straightforward extension of Lemma 1 in the appendix of [36] implies that, for any fixed η > 1, the second part can be further decomposed by sets of simple events as follows: where K(η) is a positive integer defined by and, for k = 1, . . ., K(η) − 1, λ k is given by λ k := ψ * (∆ k ), with ∆ k the solution with respect to z > 0 of the equation As a quick remark, note that ω 0 (η) does not depend on η and each ω k (η) in fact does not depend on the index k but we use this notation just for consistency.∆ K(j) g(V 0 ) V 0 g(V 0 η K(j) ) V 0 η K(j) H(∆ K(j) ) n := max V 0 , min V n , V 0 η K(j) .
From this decomposition of the stopping event of Ng (j), for any fixed η > 1, we can lower bound the stopping time Ng (j) as follows: for each j ≥ 1.Now, let us first consider the case λ op ≥ λ 0 .In this case, we use the following simple upper bound: ≤ E 0,Q inf n ≥ 1 : Since E 0,Q log L (λ 0 ) 1 = σ 2 (λ 0 ∆ op − ψ(λ 0 )) ≥ σ 2 (λ 0 ∆ 0 − ψ(λ 0 )) = σ 2 ψ * (∆ 0 ), by the same argument of Proposition 2.14, the last term above can be further upper bounded by Since we are in the case λ op ≥ λ 0 , we have ψ * (∆ op ) ψ * (∆ 0 ) ≥ 1, which can be understood as a measure of inefficiency due to the misspecified upper bound of the oracle λ op .Now, consider the case λ op < λ 0 where we correctly specified the upper bound.In this case, let j op be the smallest integer satisfying λ K(j op ) := λ K op < λ op < λ 0 .Then, we can further upper bound the worst average delays by By Equation (50), we have the following intermediate upper bound on the worst average delays, Note that if j op = 1 then λ K(1) = λ L < λ op .Thus, in this case, we also correctly specified the lower bound, and the above bound is reduced to the same upper bound on the worst average delays in Theorem 4.3 of the well-separation case except the ARL parameter α being replaced by rα.Finally, to get an explicit upper bound on j op for the case j op > 1, fist note that, from the definition of λ K(j op −1) with the fact λ op < λ (j op −1) 1 ⇔ ∆ op < ∆ K(j op −1) , we have g V 0 η K(j op −1) V 0 η K(j op −1) = ψ * ∆ K(j op −1) > ψ * (∆ op ) .
(108) Also, the condition K(j) ≥ K L + m log η j implies for each j ≥ 1.By combining two inequalities above, we have ψ * (∆ op ) In sum, by combining all bounds above, we have C.An explicit way to compute the threshold in Algorithm 3 The following pseudo-code describes how to compute the threshold g α defined by g α := inf g > log(1/α) : e −g ½(g > v min D U ) + min

Figure 1 :
Figure 1: Left: Plus-Minus of the Cavaliers from 2010-11 to 2017-18 seasons.Each horizontal red line corresponds to the seasonal average.Right: The sample path of (the logarithm of) one of our e-detectors.The horizontal red line is the threshold equal to log(1/α) controlling the ARL by 1/α = 1000.In this example, the procedure detects the changepoint in the Plus-Minus of the Cavaliers at the end of 2014-15 season.

Definition 2 . 6 (
SR and CUSUM e-detectors).Based on a sequence of e-processes {Λ (j) } j≥1 , define SR and CUSUM e-detectors M SR and M CU , respectively by M SR 0 = M CU 0 := 0 and for each n ≥ 1, [1, n]  to K −1 (k), n .This choice makes it possible to compute both M aSR n and M aCU n efficiently since each M SR n (k) and M CU n (k) have following recursive representations

1
for each j, and thus M aSR n

2 . 3 . 3 .
Proposition For any mixing weights {ω k } k∈N and a scheduling function K, adaptive SR and CUSUM e-detectors defined in (29) and(30) form valid e-detectors satisfying the condition(6).Now, based on M aCU n and M aSR n , the adaptive e-SR and e-CUSUM procedures are defined by the stopping times:

Figure 2 :
Figure2: Left: Monthly win rates of the Cavaliers from 2010-11 to 2017-18 seasons (the raw data is Bernoulli, which is harder to visualize).Each red line corresponds to the seasonal average.Right: Paths of log e-detectors (SR: red; CUSUM: green).The horizontal line is the threshold (common to both procedures) equal to log(1/α), ensuring that the ARL is at least 1/α = 10 3 , larger than the number of games in 12 seasons (82 per season).The e-SR procedure detects a changepoint during the 2014-15 season.

Figure 3 :
Figure 3: Left: E-detectors (SR: red; CUSUM: green) over eight seasons (82 games per season).Right: Logarithm of e-detectors against date (the sharp rise of e-SR is simply due to the log scale).In both plots, horizontal lines are thresholds equal to 1/α (left) and log(1/α) (right) controlling the ARL by 1/α = 10 3 .The e-SR procedure detects a change during the 2014-15 season, while e-CUSUM takes longer (as expected).

Figure 4 :
Figure 4: Average detection delay for each changepoint ν = 0, 100, . . ., 500 (each experiment has exactly one changepoint at ν).Three of the methods use an exact threshold calculated via simulation (only possible in this simple, parametric example).Only the Oracle CUSUM method knows the post-change distribution.Even though e-SR uses the conservative log(1/α) threshold, its detection delay is excellent, often even better than the Oracle CUSUM method (which has optimal average worst-case (across ν) delay).

Figure 5 :
Figure5: Pre-change false alarm rates of detection methods for each changepoint ν = 0, 100, . . ., 500 (each experiment has exactly one changepoint at ν).Three of the methods use an exact threshold calculated via simulation (only possible in this simple, parametric example).Only the Oracle CUSUM method knows the post-change distribution exactly.e-SR has the smallest false alert ratio (defined in the text).

Figure 6 :
Figure 6: Illustration of the stopping event of Ngα defined in (49), and related regions H(∆ U ), H(∆ L ) and R. The stopping time Ngα is the first time when ( µ n , g α /V n ) is located in one of the colored areas.

Figure 7 :
Figure 7: Illustration of the stopping event of Ng (j) defined in (55), and related regions H(∆ 0 ), H(∆ K(j) ) and R. The stopping time Ng (j) is the first time when µ n , g V (j) n /V n is located in one of the colored

4 else 5
Compute gα := inf g ∈ v min D U , D U D L log(2/α) : e −g + f (g) ≤ α by using the bisection method to the function g → e −g + f (g) − α with endpoints v min D U , D U D L log(2/α) .