1 Introduction
The study of the efficiency of nonparametric tests that started in the late 1940s is often regarded as a success story in statistics. Some nonparametric tests, such as Wilcoxon’s signed-rank and rank-sum tests, are highly efficient even when used in the framework of popular parametric models, such as the Gaussian model. Theoretical results mostly concern asymptotic efficiency of those tests, but there is also empirical evidence for their finite-sample efficiency. While some nonparametric tests (such as Wilcoxon’s) became very popular after their high efficiency had been discovered, others (such as Wald and Wolfowitz’s run test) were gradually discarded from the statistical literature after their low efficiency had been demonstrated [16, Introduction].
The usual approach to hypothesis testing is based on critical regions or p-values, but in this paper we replace them with their alternative, e-values (see, e.g., [22, 20, 7]). We show that some of the old results about the efficiency of nonparametric tests carry over to hypothesis testing based on e-values. To distinguish our notions of power, tests, etc., from the standard notions, we add the prefix “e-”. (The prefix “p-” is sometimes added to signify standard notions based on p-values, but in this paper we rarely need it since the key notion that we are interested in, Pitman’s asymptotic relative efficiency, is defined in terms of critical regions rather than p-values.)
We explain basics of e-testing in Sect. 2, and in particular, we state an analogue of the Neyman–Pearson lemma in e-testing. In the following section, Sect. 3, we give a simple example of a parametric e-test, one for testing the null hypothesis $N(0,1)$ against an alternative $N(\theta ,1)$ in an IID situation.
In Sect. 4 we give the first, and in some sense most powerful, of the three examples of nonparametric e-tests that we discuss in this paper. It was introduced by Fisher in his 1935 book [5]. Our nonparametric null hypothesis is that of symmetry around 0 (and for simplicity we consider independent observations coming from a continuous distribution).
The material of Sects. 2–4 is standard. After that (Sect. 5) we define the asymptotic relative efficiency of e-tests in the spirit of Pitman’s definition [17]. We regard our definition of asymptotic relative efficiency as a direct translation of the classical definition. Then in Sect. 6 we compute the Pitman-type asymptotic relative efficiency of the Fisher-type test discussed in Sect. 4. This is complemented by similar computations for e-versions of the sign test in Sect. 7 and Wilcoxon’s signed-rank test in Sect. 8. Our results for all three tests agree perfectly with the classical results. This is just a first step, and in Sect. 9 we discuss limitations of our approach (which are considerable) and list natural directions of further research.
2 General Principles of E-testing
Let P be a given probability measure on a sample space Ω (a measurable space). Our null hypothesis is $\{P\}$; it is simple in the sense of containing a single probability measure.
We observe $\omega \in \Omega $ and are interested in whether ω was generated from P. An e-variable for testing P is an $[0,\infty ]$-valued random variable E such that $\textstyle\int E\hspace{0.1667em}\mathrm{d}P\le 1$. In order to be used for testing, we need to choose E before we observe ω. By Markov’s inequality, E can be large only with a small probability (for any threshold $c\gt 1$, $P(E\ge c)\le 1/c$); therefore, observing a large E casts doubt on ω being generated from P.
In the classical Neyman–Pearson approach to hypothesis testing, in addition to P we also have an alternative hypothesis Q. The e-power of an e-variable E is then defined as $\textstyle\int \log E\hspace{0.1667em}\mathrm{d}Q$. This is an analogue of the usual notion of power, but it only works in regular cases. One of such regular cases will be discussed in the next section. The following lemma is very well known (see, e.g., [20, Sect. 2.2.1] and the references therein), and we provide a simple proof.
Lemma 1.
For given null and alternative hypotheses P and Q, respectively, such that $Q\ll P$, the largest e-power is attained by the likelihood ratio $\mathrm{d}Q/\mathrm{d}P$: for any e-variable E,
And if $Q\ll P$ is violated, the largest e-power is ∞.
(2.1)
\[ \int \log E\hspace{0.1667em}\mathrm{d}Q\le \int \log \frac{\mathrm{d}Q}{\mathrm{d}P}\hspace{0.1667em}\mathrm{d}Q.\]The likelihood ratio $\mathrm{d}Q/\mathrm{d}P$ in Lemma 1 is understood to be the Radon–Nikodym derivative of Q w.r. to P.
Proof of Lemma 1.
If $Q\ll P$ is violated, there is an event $A\subseteq \Omega $ such that $P(A)=0$ and $Q(A)\gt 0$. Then the e-power of the e-variable
is ∞.
It remains to consider the case $Q\ll P$. In this case, let q be a probability density function of Q w.r. to P. In terms of q, we can rewrite (2.1) as
\[ \int q\log E\hspace{0.1667em}\mathrm{d}P\le \int q\log q\hspace{0.1667em}\mathrm{d}P,\hspace{1em}\text{i.e.,}\hspace{1em}\int q\log \frac{E}{q}\hspace{0.1667em}\mathrm{d}P\le 0.\]
The last inequality follows from $\log x\le x-1$. □According to Lemma 1, which is an analogue for e-values of the Neyman–Pearson lemma, the optimal e-variable for testing a null hypothesis P against an alternative $Q\ll P$ is the likelihood ratio $\mathrm{d}Q/\mathrm{d}P$. The maximum e-power is
\[ \operatorname{KL}(Q\| P):=\int \log \frac{\mathrm{d}Q}{\mathrm{d}P}\hspace{0.1667em}\mathrm{d}Q\]
(cf. [20, Sect. 2.3] and [7, Theorem 1]). This is simply the Kullback–Leibler divergence [12] of the alternative Q from the null hypothesis P; we will call it the optimal e-power.We will sometimes refer to $\log E$ as the observed e-power of E; the e-power is then the expectation of the observed e-power w.r. to the alternative hypothesis Q.
The notion of e-power is very close to Shafer’s [20] implied target, the main difference being that the implied target only depends on the null hypothesis P and the e-variable E.
As a short detour, let us check that our notion of e-power enjoys a natural property in testing with multiple e-values. Denote by ${\Pi ^{Q}}$ the function
that maps an e-variable to its e-power. Independent e-variables ${E_{1}},\dots ,{E_{K}}$ can be combined into one e-variable using a merging function, the most common choices being convex mixtures of the product functions
where M is a subset of $\{1,\dots ,K\}$, with ${F_{\varnothing }}$ set to 1. Denote by $\mathcal{M}$ the convex hull of all functions ${F_{M}}$. Useful elements of the class $\mathcal{M}$ are U-statistics with product as kernel, symmetric merging functions discussed in [22, Sect. 4].
Proposition 1.
Let $\mathbf{E}=({E_{1}},\dots ,{E_{K}})$ be a vector of independent e-variables.
-
(i) For all $F\in \mathcal{M}$, $F(\mathbf{E})$ is an e-variable.
-
(ii) If ${\Pi ^{Q}}({E_{k}})\gt 0$ for each $k=1,\dots ,K$, then ${\Pi ^{Q}}(F(\mathbf{E}))\gt 0$ for all $F\in \mathcal{M}\setminus \{{F_{\varnothing }}\}$.
-
(iii) If ${\Pi ^{Q}}({E_{k}})\ge 0$ for each $k=1,\dots ,K$, then ${\Pi ^{Q}}(F(\mathbf{E}))\ge 0$ for all $F\in \mathcal{M}$.
Proof.
Part (i) follows from the fact that the product of independent e-variables is an e-variable, and a convex mixture of e-variables is an e-variable. Next we prove (ii). For all M other than $M=\varnothing $, we have
and ${\Pi ^{Q}}({F_{\varnothing }}(\mathbf{E}))=0$. Note that the mapping (2.2) is concave on the set of nonnegative random variables. Since $F(\mathbf{E})$ is a convex mixture of ${F_{M}}(\mathbf{E})$ for $M\subseteq \{1,\dots ,K\}$, we get ${\Pi ^{Q}}(F(\mathbf{E}))\ge 0$, and the inequality is strict unless $F={F_{\varnothing }}$. This proves (ii). The case (iii) is similar to (ii). □
Proposition 1 shows that e-power remains positive when combining independent e-values with positive e-power using a large class of merging functions. As a special case of Proposition 1 applied to only one e-variable, if ${\Pi ^{Q}}(E)\gt 0$, then ${\Pi ^{Q}}(1-\lambda +\lambda E)\gt 0$ for all $\lambda \in (0,1]$. The operation of changing E to $1-\lambda +\lambda E$ is common in building e-processes; see, e.g., [24].
3 A Parametric E-test
We start our discussion of specific e-tests from a very simple parametric case, that of the Gaussian statistical model ${Q_{\theta }}:=N(\theta ,1)$, $\theta \in \mathbb{R}$, with the variance known to be 1. We observe realizations of independent ${Z_{1}},\dots ,{Z_{n}}\sim N(\theta ,1)$. The null hypothesis P is $N(0,1)$, and we are interested in the alternatives $Q={Q_{\theta }}=N(\theta ,1)$ for $\theta \ne 0$.
For observations ${z_{1}},\dots ,{z_{n}}$ and a given alternative $N(\theta ,1)$, the likelihood ratio of the alternative to the null hypothesis is
The corresponding optimal e-power is
(3.1)
\[\begin{aligned}{}{E_{\theta }}({z_{1}},\dots ,{z_{n}})& :=\frac{\exp (-\frac{1}{2}{\textstyle\textstyle\sum _{i=1}^{n}}{({z_{i}}-\theta )^{2}})}{\exp (-\frac{1}{2}{\textstyle\textstyle\sum _{i=1}^{n}}{z_{i}^{2}})}\\ {} & =\exp \Bigg(\theta {\sum \limits_{i=1}^{n}}{z_{i}}-\frac{1}{2}n{\theta ^{2}}\Bigg).\end{aligned}\]The interpretation of the optimal e-power (3.2) usually depends on the law of large numbers and its refinements (such as the central limit theorem and large deviation inequalities). The presence of log in the definition $\textstyle\int \log E\hspace{0.1667em}\mathrm{d}Q$ of the e-power of E under the alternative Q reflects the fact that a typical e-value is obtained by multiplying components coming from the individual observations ${z_{i}}$. This can be seen from (3.1) (and also expressions (4.4), (7.2), and (8.3) below, which are typical). Taking the logarithm leads to a much more regular distribution, which is, e.g., approximately Gaussian under standard regularity conditions. In the case of (3.1), the key component of the logarithm is ${\textstyle\sum _{i=1}^{n}}{z_{i}}$, and we can apply, e.g., the central limit theorem to see that the observed e-power is between the narrow limits $\frac{1}{2}n{\theta ^{2}}\pm c\sqrt{n}\theta $ with probability close (in this particular case, even exactly equal) to $\Phi (c)-\Phi (-c)$, where $c\gt 0$ and Φ is the standard Gaussian cumulative distribution function.
Remark 1.
To get the full idea of the power of E under Q, we need the whole distribution of the observed e-power $\log E$ under Q, and replacing it by its expectation is a crude step. (The next step might be, e.g., complementing the expectation with the standard deviation of $\log E$ under Q.) We leave such more realistic notions of power for future research.
We regard the family (3.1) of e-variables as a test (an e-test) of the null hypothesis $N(0,1)$. While for several important statistical models there are uniformly most powerful p-tests (see, e.g., [14, Chap. 3]), this is not the case for e-tests, and the e-tests considered in this paper are always families of e-variables.
The fact that the e-variable (3.1) depends on the unknown alternative parameter θ is a disadvantage. A natural way out is to integrate it under the prior distribution $N(0,1)$ over θ, which gives us the e-variable
(cf. Remark 2 below). Notice that the operation of integration makes the e-variable “two-sided”: while (3.1) is monotone in ${\textstyle\sum _{i}}{z_{i}}$, (3.3) is monotone in $|{\textstyle\sum _{i}}{z_{i}}|$. The remaining disadvantage of the e-variable (3.3) is that it is valid only under the simple Gaussian null hypothesis $N(0,1)$. In the following sections we will replace this simple null hypothesis with a composite nonparametric one.
(3.3)
\[\begin{aligned}{}& \frac{1}{\sqrt{2\pi }}\int \exp \Bigg(\theta {\sum \limits_{i=1}^{n}}{z_{i}}-\frac{1}{2}n{\theta ^{2}}-\frac{1}{2}{\theta ^{2}}\Bigg)\mathrm{d}\theta \\ {} & \hspace{1em}=\sqrt{\frac{1}{n+1}}\exp \Bigg(\frac{1}{2n+2}{\Bigg({\sum \limits_{i=1}^{n}}{z_{i}}\Bigg)^{2}}\Bigg)\end{aligned}\]Remark 2.
In our computations in this paper we often use the formula
\[ \int \exp \big(-A{x^{2}}+Bx\big)\mathrm{d}x=\sqrt{\frac{\pi }{A}}\exp \bigg(\frac{{B^{2}}}{4A}\bigg),\]
where $A\gt 0$ and $B\in \mathbb{R}$. Equations (3.1) and (3.3) are simple calculations, and they appear in the context of mixture martingales, which date back to, at least, the work of Robbins (e.g., [19]); see also the more recent [10] and the references therein.4 Fisher-type Nonparametric E-test of Symmetry
Let ${Z_{1}},\dots ,{Z_{n}}$ be continuous IID random variables. We are interested in the null hypothesis that their distribution is symmetric around 0. This is an example of a nonparametric hypothesis, since the distribution of ${Z_{1}},\dots ,{Z_{n}}$ is not described in a natural way by finitely many real-valued parameters. Intuitively, we are interested in two alternatives: the one-sided alternative that ${Z_{i}}$, even though IID, are not symmetric but shifted to the right; and the two-sided alternative that ${Z_{i}}$ are shifted to the right or to the left.
A typical case in applications is where ${Z_{i}}:={Y_{i}}-{X_{i}}$, ${X_{i}}$ is a pre-treatment measurement, and ${Y_{i}}$ is a post-treatment measurement, and we are interested in whether the treatment has any effect. Assuming that raising ${X_{i}}$ is desirable, the one-sided alternative is that the treatment is beneficial.
We will formalize our null hypothesis in a way similar to repetitive and one-off structures [23, Sects. 11.2.4 and 11.2.5]. However, we will not need general definitions and will adapt them to our special case.
The symmetry model for a sample size n is the pair $(t,b)$, where $t:{\mathbb{R}^{n}}\to \Sigma $ is the mapping
from the sample space ${\mathbb{R}^{n}}$ to the summary space ${[0,\infty )^{n}}$, and b is the Markov kernel that maps each summary ${({z_{1}},\dots ,{z_{n}})\in [0,\infty )^{n}}$ to the uniform probability measure on the set
An e-variable for testing the null hypothesis of symmetry is a function $E:{\mathbb{R}^{n}}\to [0,\infty ]$ such that $\textstyle\int E\hspace{0.1667em}\mathrm{d}b(t({z_{1}},\dots ,{z_{n}}))\le 1$ for all ${z_{1}},\dots ,{z_{n}}$. It is admissible if ≤ holds as = for all ${z_{1}},\dots ,{z_{n}}$; in other words, if it ceases to be an e-variable (w.r. to the symmetry model) as soon as its value is increased at any point.
(4.1)
\[\begin{aligned}{}& {t^{-1}}({z_{1}},\dots ,{z_{n}})\\ {} & \hspace{1em}=\big\{({j_{1}}{z_{1}},\dots ,{j_{n}}{z_{n}})\mid ({j_{1}},\dots ,{j_{n}})\in {\{-1,1\}^{n}}\big\}.\end{aligned}\]Remark 3.
The definition of admissibility that we give is adapted to our current context; see [18, Sect. 9] for a more general discussion.
In this section we define the first of our three e-tests for testing symmetry. We are interested in the e-variables of the form
where $S({z_{1}},\dots ,{z_{n}}):={\textstyle\sum _{i=1}^{n}}{z_{i}}$, $\lambda \gt 0$ is a positive parameter, and C is chosen to make E an admissible e-variable, i.e.,
(4.2)
\[ {E_{\lambda }}({z_{1}},\dots ,{z_{n}}):=\exp \big(\lambda S({z_{1}},\dots ,{z_{n}})-C\big),\]
\[ C=C\big(\lambda ,t({z_{1}},\dots ,{z_{n}})\big):=\log \int \exp (\lambda S)\mathrm{d}b\big(t({z_{1}},\dots ,{z_{n}})\big)\]
(in other words, $C:=\log \mathbb{E}\exp (\lambda S)$, the expectation being under the null hypothesis, i.e., under the symmetry model). Lemma 2 will give a convenient formula for computing C.The form (4.2) for our e-variables can be justified by the analogy with the e-variable (3.1) that we obtained in the Gaussian case. The expression for the normalizing constant C will, however, be different and will be derived momentarily.
The justification of the symmetry model from the point of view of standard statistical modelling is that, under the null hypothesis of symmetry, t is a sufficient statistic giving rise to b as conditional distribution.
For simplicity, we will assume that ${z_{1}},\dots ,{z_{n}}$ are all different (under our assumption that the random variables ${Z_{1}},\dots ,{Z_{n}}$ are continuous, the realizations will be all different almost surely).
Proof.
We find
\[\begin{aligned}{}{e^{C}}& ={2^{-n}}{\sum \limits_{{j_{1}}=0}^{1}}\dots {\sum \limits_{{j_{n}}=0}^{1}}{e^{\lambda {j_{1}}{z_{1}}+\cdots +\lambda {j_{n}}{z_{n}}}}\\ {} & ={2^{-n}}{\prod \limits_{i=1}^{n}}\big({e^{\lambda {z_{i}}}}+{e^{-\lambda {z_{i}}}}\big).\end{aligned}\]
(Alternatively, we can see straight away that the average of (4.4) below w.r. to $b(t({z_{1}},\dots ,{z_{n}}))$ is 1.) □Plugging (4.3) into (4.2) gives the e-variable
This is an e-version of Fisher’s permutation test, which he introduced and applied to Charles Darwin’s data [3, Chap. 1] in his 1935 book [5, Sects. 21 and 21.1] on experimental design.
(4.4)
\[ {E_{\lambda }}({z_{1}},\dots ,{z_{n}})={e^{-C}}{\prod \limits_{i=1}^{n}}{e^{\lambda {z_{i}}}}={\prod \limits_{i=1}^{n}}\frac{{e^{\lambda {z_{i}}}}}{\frac{1}{2}({e^{\lambda {z_{i}}}}+{e^{-\lambda {z_{i}}}})}.\]Again, since there is no uniformly most powerful e-test, we consider a family of e-variables. The e-variable (4.4) is, of course, admissible.
Figure 1
The inequality (4.6) on the log scale.
The e-variable (4.4) dominates
in the sense ${E^{\prime }}\le E$. Therefore, ${E^{\prime }}$ is also an e-variable, albeit inadmissible in general. To check the inequality ${E^{\prime }}\le E$, it suffices to check that
Expanding both sides into Taylor’s series shows that this inequality indeed holds for all x. The inequality is not excessively loose, especially for small values of x (which will be the case that we will be interested in when computing the Pitman efficiencies): cf. Figure 1.
(4.5)
\[ {E^{\prime }_{\lambda }}({z_{1}},\dots ,{z_{n}}):={\prod \limits_{i=1}^{n}}{e^{\lambda {z_{i}}-{\lambda ^{2}}{z_{i}^{2}}/2}},\]Remark 4.
The fact that (4.5) is an e-variable was established by de la Peña [4, Lemma 6.1]. Ramdas et al. [18, Sect. 10] point out that it is inadmissible, and they define several natural admissible alternatives to (4.4). Investigating the asymptotic relative efficiency of those admissible alternatives is an interesting direction of further research.
In order to get rid of the dependence of (4.4) or (4.5) on λ, we can integrate these expression over a prior distribution on λ. This can be easily done explicitly (see Remark 2) in the case of (4.5) and the prior distribution $N(0,1)$ on λ:
(4.7)
\[\begin{aligned}{}& \frac{1}{\sqrt{2\pi }}\int {\prod \limits_{i=1}^{n}}{e^{\lambda {z_{i}}-{\lambda ^{2}}{z_{i}^{2}}/2-{\lambda ^{2}}/2}}\hspace{0.1667em}\mathrm{d}\lambda \\ {} & \hspace{1em}=\sqrt{\frac{1}{1+{\textstyle\textstyle\sum _{i=1}^{n}}{z_{i}^{2}}}}\exp \bigg(\frac{{({\textstyle\textstyle\sum _{i=1}^{n}}{z_{i}})^{2}}}{2+2{\textstyle\textstyle\sum _{i=1}^{n}}{z_{i}^{2}}}\bigg).\end{aligned}\]The right-hand side of (4.7) is close to the right-hand side of (3.3) under $N(0,1)$ as the null hypothesis: this follows from ${\textstyle\sum _{i=1}^{n}}{z_{i}^{2}}\approx n$ (for large n and with high probability). However (as noticed in [4]), this relatively small change drastically changes the property of validity of the e-test: while the right-hand side of (3.3) is an e-test of $N(0,1)$ only, the right-hand side of (4.7) is an e-test of the nonparametric hypothesis of symmetry.
Results for Charles Darwin’s Data
In this subsection we will compute Fisher-type nonparametric e-values for data used by Darwin [3, Chap. 1] to test whether cross-fertilization of plants was advantageous to the progeny as compared with self-fertilization. This was an important question from the evolutionary point of view, and Darwin’s preliminary work had convinced him that cross-fertilization was indeed advantageous; in particular, nature went to great lengths to prevent self-fertilization [2].
Table 1
Differences in eighths of an inch between cross- and self-fertilised plants of the same pair (Table 3 in [5, Sect. 17]).
49 | 23 | 56 |
$-67$ | 28 | 24 |
8 | 41 | 75 |
16 | 14 | 60 |
6 | 29 | $-48$ |
Table 1 reports results for a small subset of Darwin’s data, those for maize. This subset was analyzed for Darwin by Francis Galton (as Darwin describes in detail in [3, Chap. 1]) and was reanalyzed by Fisher in [5, Chap. 3]. Fisher offered both parametric analysis (assuming the Gaussian distribution) and novel nonparametric analysis, and his finding was that Student’s t-test and Fisher’s nonparametric test produce remarkably similar results.
Table 1 lists the differences in height between 15 pairs of matched plants, with a cross- and self-fertilized plant in each pair (meaning a plant grown from a cross- or self-fertilized seed, respectively). A positive difference means that the cross-fertilized plant is taller, which we a priori expect to happen more often. Fisher was interested in two alternatives to the null hypothesis of symmetry: the one-sided alternative of positive observations being more common than negative ones and the two-sided alternative of asymmetry (with positive observations being either more or less common than negative ones).
Fisher’s p-value for testing the one-sided hypothesis is 2.634%, and his p-value for testing the two-sided hypothesis is twice as large, 5.267%. Therefore, the one-sided p-value is significant but not highly significant, whereas the two-sided p-value is not even significant.
Figure 2 plots the Fisher-type admissible e-values (4.4) (in blue) and the simplified e-values (4.5) (in red) for the parameter λ in the range $[0,1]$. The meaning of λ depends on the scale of the numbers ${z_{1}},\dots ,{z_{15}}$ in Table 1, and in order to make λ less arbitrary we normalize ${z_{1}},\dots ,{z_{15}}$ by dividing them by the standard deviation of these 15 numbers. Jeffreys’s [11, Appendix B] rule of thumb is to consider an e-value of 10 as being analogous to a p-value of $1\% $ and to consider an e-value of $\sqrt{10}\approx 3.162$ as being analogous to a p-value of $5\% $. (See [22, Sect. 2] for a more detailed discussion of relations between e-values and p-values.) This makes Figure 2 roughly comparable to Fisher’s p-values, especially if we ignore the inadmissible simplified e-values. If we guess in advance that $\lambda :=0.5$ is a good parameter value, we will get an e-value of 7.651. More realistically, averaging the e-values for $\lambda \in [0,1]$ will give the one-sided e-value 5.149. Replacing $\lambda \in [0,1]$ by $\lambda \in [-1,1]$ gives the two-sided e-value 2.633 not reaching the threshold of $\sqrt{10}$.
5 Pitman-type Asymptotic Relative Efficiency
The following definition is in the spirit of Pitman’s definition, which can be found in, e.g., [21, Sect. 14.3]. Let $({Q_{\theta }}\mid \theta \in \Theta )$ be a statistical model, i.e., a set of probability measures on the real line $\mathbb{R}$, with the observations generated from one of those probability measures in the IID fashion. We assume, for simplicity, that $\Theta =\mathbb{R}$ and regard ${Q_{0}}$ as the null hypothesis; informally, the alternative is either one-sided, $\theta \gt 0$, or two-sided, $\theta \ne 0$ (for specific e-tests, we will have the same results for one-sided and two-sided Pitman efficiency). By an e-variable we mean an e-variable w.r. to ${Q_{0}^{n}}$. In our asymptotic framework we consider sequences of parameter values ${\theta _{\nu }}$ that depend on the “difficulty” $\nu =1,2,\dots \hspace{0.1667em}$ of our testing problem; in the one-sided case we will assume ${\theta _{\nu }}\downarrow 0$ (the sequence is strictly decreasing and converges to 0), and in the two-sided case we will assume ${\theta _{\nu }}\to 0$.
Let ${\mathcal{E}_{1}^{n}}$ and ${\mathcal{E}_{2}^{n}}$ be families of e-variables on ${\mathbb{R}^{n}}$; we are interested in the case where ${\mathcal{E}_{1}^{n}}$ is a family of interest to us (a nonparametric e-test such as (4.4) above, or (7.3) or (8.1) below) and ${\mathcal{E}_{2}^{n}}$ is the baseline family of all e-variables on ${\mathbb{R}^{n}}$. The asymptotic relative efficiency of ${\mathcal{E}_{1}^{n}}$ w.r. to ${\mathcal{E}_{2}^{n}}$ is c if, for any $\beta \gt 0$ and any ${\theta _{\nu }}\downarrow 0$ (one-sided case) or ${\theta _{\nu }}\to 0$ (two-sided case), we have ${n_{\nu ,2}}/{n_{\nu ,1}}\to c$, where ${n_{\nu ,j}}$, $j=1,2$, is the minimal number of observations n such that
\[ \exists E\in {\mathcal{E}_{j}^{n}}:\int \log E\hspace{0.1667em}\mathrm{d}{Q_{{\theta _{\nu }}}^{n}}\ge \beta .\]
For example, if the asymptotic relative efficiency is 0.5, the best e-test in $({\mathcal{E}_{1}^{n}})$ requires twice as many observations n as the best test in $({\mathcal{E}_{2}^{n}})$ to achieve the same e-power (if the best e-tests exist).The idea of using an auxiliary parametric statistical model $({Q_{\theta }})$, such as the Gaussian model, to assay the efficiency of nonparametric e-tests is illustrated in Figure 3. We are testing a nonparametric null hypothesis (the hypothesis of symmetry in this paper), but we are afraid that for a popular parametric model (the Gaussian model ${Q_{\theta }}:=N(\theta ,1)$ in this paper, which plays the role of an assay statistical model) our testing method loses a lot. We are interested in the case where the intersection between the nonparametric null hypothesis and the assay model contains only one probability measure; we refer to this intersection as the parametric null hypothesis in Figure 3 (in this paper, it is $\{N(0,1)\}$). For a given simple alternative hypothesis $Q={Q_{\theta }}$ in the assay model (shown as the red dot in Figure 3), we are hoping to show that the best e-power achieved for testing the simple parametric null hypothesis vs Q is not much better than the best e-power achieved for testing the composite (and usually massive) nonparametric null hypothesis. Or, if Pitman-type notion of efficiency is to be used (as in this paper), that the same e-power is attained for numbers of observations that are not wildly different.
Our use of the Gaussian model with variance 1 as assay model motivates using (4.2) with $S({z_{1}},\dots ,{z_{n}}):={z_{1}}+\cdots +{z_{n}}$ as a nonparametric e-test. The sign and Wilcoxon versions will be natural modifications (corresponding to relaxing the symmetry assumption, as explained in Remark 6 below).
For all three nonparametric e-tests considered in this paper (Sects. 6–8 below) we will need the number ${n_{\nu ,2}}$ of observations required by our baseline, which is, by Lemma 1, the likelihood ratio $\mathrm{d}N({\theta _{\nu }},1)/\mathrm{d}N(0,1)$. By (3.2), achieving an e-power of β requires approximately
observations (namely, $\lceil 2\beta {\theta _{\nu }^{-2}}\rceil $ observations).
Remark 5.
In the context of regular statistical models such as Gaussian, it is natural to set ${\theta _{\nu }}:=c{\nu ^{-1/2}}$. In this case the “difficulty” ν (referred to as “time” in [21, Sect. 14.3]) becomes proportional to the number of observations required to achieve a given e-power.
6 Asymptotic Efficiency Of the Fisher-type E-test
In the classical case, the relative efficiency of Fisher’s test is 1 [6, Chapter 7, Example 4.1], as first shown by Hoeffding [9] (according to Mood [15]). Let us check that this remains true for the e-version as well.
First we find informally a suitable e-variable in the family (4.4) and then show that it requires the optimal number (5.1) of observations to achieve an e-power of β. Under the symmetry model, each observation ${z_{i}}$ is split into its magnitude ${m_{i}}:=|{z_{i}}|$ and sign ${s_{i}}:=\operatorname{sign}({z_{i}})$. Given the magnitudes, the signs are independent and $\mathbb{P}({s_{i}}=1)=1/2$ under the null hypothesis $N(0,1)$ and
\[\begin{aligned}{}\mathbb{P}({s_{i}}=1)& =\frac{\exp (-\frac{1}{2}{({m_{i}}-{\theta _{\nu }})^{2}})}{\exp (-\frac{1}{2}{({m_{i}}-{\theta _{\nu }})^{2}})+\exp (-\frac{1}{2}{(-{m_{i}}-{\theta _{\nu }})^{2}})}\\ {} & =\frac{\exp ({\theta _{\nu }}{m_{i}})}{\exp ({\theta _{\nu }}{m_{i}})+\exp (-{\theta _{\nu }}{m_{i}})}\end{aligned}\]
under the alternative hypothesis $N({\theta _{\nu }},1)$. The conditional likelihood ratio for the signs is
\[\begin{aligned}{}& {\prod \limits_{i=1}^{n}}\frac{2\exp ({\theta _{\nu }}{z_{i}})}{\exp ({\theta _{\nu }}{m_{i}})+\exp (-{\theta _{\nu }}{m_{i}})}\\ {} & \hspace{1em}={\prod \limits_{i=1}^{n}}\frac{\exp ({\theta _{\nu }}{z_{i}})}{1+{\theta _{\nu }^{2}}{m_{i}^{2}}/2+o({\theta _{\nu }^{2}}{m_{i}^{2}})}.\end{aligned}\]
This is Fisher’s e-test (4.4) corresponding to $\lambda :={\theta _{\nu }}$. Its observed e-power is
\[\begin{aligned}{}& {\sum \limits_{i=1}^{n}}\big({\theta _{\nu }}{z_{i}}-{\theta _{\nu }^{2}}{m_{i}^{2}}/2+o\big({\theta _{\nu }^{2}}{m_{i}^{2}}\big)\big)\\ {} & \hspace{1em}={\theta _{\nu }}{\sum \limits_{i=1}^{n}}{z_{i}}-\big(1+o(1)\big)\frac{{\theta _{\nu }^{2}}}{2}{\sum \limits_{i=1}^{n}}{m_{i}^{2}}.\end{aligned}\]
Since, under the alternative hypothesis $N({\theta _{\nu }},1)$,
and
\[ \mathbb{E}{\sum \limits_{i=1}^{n}}{m_{i}^{2}}=\mathbb{E}{\sum \limits_{i=1}^{n}}{z_{i}^{2}}=n+n{\theta _{\nu }^{2}}=\big(1+o(1)\big)n,\]
the e-power is
\[ n{\theta _{\nu }^{2}}-\big(1+o(1)\big)\frac{{\theta _{\nu }^{2}}}{2}n\sim \frac{1}{2}n{\theta _{\nu }^{2}}.\]
We obtain the optimal e-power (3.2) with $\theta ={\theta _{\nu }}$, and so the asymptotic relative efficiency of Fisher’s e-test is 1.7 Sign E-test
In this and following sections we use (4.2) for different statistics S, and with C still chosen to make ${E_{\lambda }}$ an admissible e-variable. In this section we make the simplest choice of $S({z_{1}},\dots ,{z_{n}})$ in (4.2), which is the number k of positive ${z_{i}}$ among ${z_{1}},\dots ,{z_{n}}$. This gives the sign e-test with parameter $\lambda \gt 0$. The use of the signs for hypothesis testing goes back to [1].
To obtain a useful alternative representation of the sign e-test, let $p\in (0,1)$ be defined by the equation
(so that λ becomes the log-odds ratio). The e-variable (4.2) then becomes
The last expression is the likelihood ratio of an alternative to the null hypothesis, and so is an admissible e-variable. This gives us the representation
of the sign e-test.
The equality between the last two terms in (7.1) gives an explicit expression for C,
which in turn gives the alternative representation
of the sign e-test.
In view of our informal alternative hypothesis, we are often interested in $\lambda \gt 0$, i.e., $p\gt 1/2$.
Remark 6.
Notice that in this section we are actually testing a wider null hypothesis than the symmetry model, since the magnitudes of ${z_{i}}$ do not matter. Namely, the sign e-test is valid for testing the hypothesis that the signs of ${Z_{1}},\dots ,{Z_{n}}$ are $\pm 1$ independently. A similar remark can also be made about the nonparametric e-test discussed in the following section, which in fact tests an intermediate null hypothesis.
As before, we have a dependence of the sign e-test (7.2) on a parameter, p. To get rid of this dependence, we can, e.g., integrate (7.2) over $p\in [0,1]$, obtaining
where B is the beta function. For testing the one-sided hypothesis we can integrate (7.2) over the uniform probability measure on $[0.5,1]$, which gives
where the second entry of B stands for the incomplete beta function.
Efficiency of the Sign Test
In this and next sections we consider the same assay parametric model and still assume that the null hypothesis is $N(0,1)$ and the alternative is $N({\theta _{\nu }},1)$. Suppose we only observe the signs ${s_{i}}$ of ${z_{i}}$, which is sufficient when testing the null hypothesis with the sign e-test. By Lemma 1 the largest e-power for an e-variable of this kind will be achieved by the likelihood ratio for the signs.
The sign of ${Z_{i}}$ is 1 with probability $1/2$ under the null hypothesis and $1/2+{\tilde{\theta }_{\nu }}/\sqrt{2\pi }$ under the alternative for ${\tilde{\theta }_{\nu }}\sim {\theta _{\nu }}$, due to the first-order Taylor approximation of the standard Gaussian cumulative distribution function Φ. With k being the number of positive ${z_{i}}$, the likelihood ratio for the signs is
\[\begin{aligned}{}& \frac{{(\frac{1}{2}+\frac{{\tilde{\theta }_{\nu }}}{\sqrt{2\pi }})^{k}}{(\frac{1}{2}-\frac{{\tilde{\theta }_{\nu }}}{\sqrt{2\pi }})^{n-k}}}{{(1/2)^{n}}}\\ {} & \hspace{1em}={\bigg(1+\sqrt{\frac{2}{\pi }}{\tilde{\theta }_{\nu }}\bigg)^{k}}{\bigg(1-\sqrt{\frac{2}{\pi }}{\tilde{\theta }_{\nu }}\bigg)^{n-k}}.\end{aligned}\]
This is an instance of the sign e-test (7.2), corresponding to $p=1/2+{\tilde{\theta }_{\nu }}/\sqrt{2\pi }$. The observed e-power of this e-test is
\[\begin{aligned}{}& k\log \bigg(1+\sqrt{\frac{2}{\pi }}{\tilde{\theta }_{\nu }}\bigg)+(n-k)\log \bigg(1-\sqrt{\frac{2}{\pi }}{\tilde{\theta }_{\nu }}\bigg)\\ {} & \hspace{1em}=(2k-n)\sqrt{\frac{2}{\pi }}{\tilde{\theta }_{\nu }}-\frac{1}{\pi }n{\tilde{\theta }_{\nu }^{2}}+o\big(n{\tilde{\theta }_{\nu }^{2}}\big)\end{aligned}\]
(we have used the second-order Taylor approximation). This gives the e-power
\[\begin{aligned}{}& \bigg(2\bigg(\frac{1}{2}+\frac{{\tilde{\theta }_{\nu }}}{\sqrt{2\pi }}\bigg)n-n\bigg)\sqrt{\frac{2}{\pi }}{\tilde{\theta }_{\nu }}-\frac{1}{\pi }n{\tilde{\theta }_{\nu }^{2}}+o\big(n{\tilde{\theta }_{\nu }^{2}}\big)\\ {} & \hspace{1em}=\frac{1}{\pi }n{\tilde{\theta }_{\nu }^{2}}+o\big(n{\tilde{\theta }_{\nu }^{2}}\big)\sim \frac{1}{\pi }n{\theta _{\nu }^{2}}.\end{aligned}\]
To achieve an e-power of β, the sign e-test needs $\sim \pi \beta {\theta _{\nu }^{-2}}$ observations. Therefore, the asymptotic efficiency of the sign e-test is $2/\pi \approx 0.64$, exactly the same as in the standard case [6, Example 3.1]. (In the standard case the sign test is usually compared with the t-test, but in this paper we use an even more basic assay parametric model; namely, we assume that the variance is known to be 1.)Since the asymptotic efficiency is approximately 2/3, we can say that the sign test wastes every third observation in our Gaussian setting. This is the least efficient of the three nonparametric e-tests considered in this paper when efficiency is measured using the Gaussian assay model as yardstick.
Sign Test for Darwin’s Data
It is interesting that the sign test gives the one-sided p-value of 0.00369 and the two-sided p-value of 0.00739. In contrast with Fisher’s p-test, both p-values are highly significant, the reason being that the two negative numbers in Table 1 are so large in absolute value.
Figure 4 is an analogue of Figure 2 for the sign test. The attainable e-values are now much larger, and the average over all $p\in [0,1]$ is 19.310. To use Jeffreys’s [11, Appendix B] expressions, we have strong evidence against the null hypothesis of cross- and self-fertilization being equally efficient. The corresponding one-sided e-value, found as the average over all $p\in [0.5,1]$, is 38.544, and in Jeffreys’s terminology it provides very strong evidence (for cross-fertilization tending to produce taller plants, in this context).
Table 1 comprises only small part of the overwhelming evidence in favour of cross-fertilization collected by Darwin over 11 years. Darwin chose maize to illustrate his and Galton’s statistical methods in [3, Chap. 1], but in [3, Chaps. 2–6] he has 99 similar tables (with our Table 1 corresponding to Darwin’s Table 97). With this amount of evidence, statistics is hardly needed to see that the evidence is really overwhelming.
8 Wilcoxon’s Signed-Rank E-tests
Wilcoxon’s signed-rank test [25] is based on arranging the magnitudes $|{z_{i}}|$ of the observations in the ascending order and assigning to each its rank, which is a number in the range $\{1,\dots ,n\}$: the observation ${z_{i}}$ with the smallest $|{z_{i}}|$ gets rank 1, the one with the second smallest $|{z_{i}}|$ gets rank 2, etc. Notice that the symmetry model (i.e., the uniform probability measure on (4.1)) implies that for any set $A\subseteq \{1,\dots ,n\}$, the probability is ${2^{-n}}$ that the observations with the ranks in A will be positive and all other observations will be negative. This determines the distribution (conditional on the magnitudes $|{z_{i}}|$) of Wilcoxon’s statistic ${V_{n}}$ defined as the sum of the ranks of the positive observations.
We will be interested in the nonparametric e-test (4.2) with $S:={V_{n}}$, i.e.,
The following lemma gives a convenient formula for computing C.
Proof.
Using Fisher’s conditional distribution (the uniform probability measure on (4.1)), we can write C in the form
\[ C=\log \bigg({2^{-n}}\sum \limits_{A\subseteq \{1,\dots ,n\}}\exp \big(\lambda \operatorname{sum}(A)\big)\bigg),\]
where $\operatorname{sum}(A)$ is the sum of all elements of A. Setting
where $\Lambda :=\exp (\lambda )$, and using the recursion
(obtained by splitting all subsets of $\{1,\dots ,i\}$ into those that do not contain i and those that do), we obtain
□Efficiency of Wilcoxon’s Signed-Rank E-test
Our derivation in this subsection will follow [13, Example 3.3.6]. The statistic
${V_{n}}$ being Wilcoxon’s signed-rank statistic defined at the beginning of this section, is asymptotically normal both under the null hypothesis $N(0,1)$,
and under the alternative hypothesis $N({\theta _{\nu }},1)$,
The mean value $1/2+{\theta _{\nu }}/\sqrt{\pi }$ in (8.6) is found as the first-order approximation to the probability of ${Z_{1}}+{Z_{2}}\gt 0$, where ${Z_{1}}$ and ${Z_{2}}$ are independent and distributed according to the alternative hypothesis $N({\theta _{\nu }},1)$ (see [13, (3.3.40)]). Namely, it is obtained from ${Z_{1}}+{Z_{2}}\sim N(2{\theta _{\nu }},2)$ and from the standard Gaussian density being $1/\sqrt{2\pi }$ at 0.
From (8.5) and (8.6) we obtain the asymptotic likelihood ratio
(of the form (8.1); see below). The observed e-power is obtained by removing the exp, and then the e-power is obtained by taking the expectation w.r. to ${T_{n}}$ distributed as (8.6). Therefore, the e-power is, asymptotically,
(8.7)
\[\begin{aligned}{}& \frac{\exp (-\frac{1}{2}{({T_{n}}-\frac{1}{2}-\frac{{\theta _{\nu }}}{\sqrt{\pi }})^{2}}/\frac{1}{3n})}{\exp (-\frac{1}{2}{({T_{n}}-\frac{1}{2})^{2}}/\frac{1}{3n})}\\ {} & \hspace{1em}=\exp \bigg(3n\bigg({T_{n}}-\frac{1}{2}\bigg)\frac{{\theta _{\nu }}}{\sqrt{\pi }}-\frac{3n}{2}\frac{{\theta _{\nu }^{2}}}{\pi }\bigg)\end{aligned}\]
\[ 3n\frac{{\theta _{\nu }}}{\sqrt{\pi }}\frac{{\theta _{\nu }}}{\sqrt{\pi }}-\frac{3n}{2}\frac{{\theta _{\nu }^{2}}}{\pi }=\frac{3n}{2}\frac{{\theta _{\nu }^{2}}}{\pi }.\]
The number of observations required for achieving an e-power of β is, asymptotically,
Comparing this with the baseline (5.1) gives the asymptotic relative efficiency of $3/\pi \approx 0.955$, as in the classical case. Wilcoxon’s test wastes one observation out of about 22 (under the Gaussian model as compared with the e-test optimized for that model).The approximate e-test used in this calculation (given by the right-hand side of (8.7)) is of the form (8.1) with
(obtained by expressing (8.7) in terms of ${V_{n}}$ using (8.4)). This, however, ignores the definition of C in (8.1). In practical application we should use, of course, the precise expression (8.3).
9 Directions of Further Research
In the previous sections we mentioned several limitations of our definitions. In this concluding section we will add further details.
The Notion of E-power as Used in the Definition of Efficiency
Our notion of e-power for an e-variable E is crude in that it depends only on the expectation of $\log E$, as explained in Remark 1. This crudeness is inherited by our definition of the asymptotic relative efficiency of e-tests. According to our definition in Sect. 5, the asymptotic relative efficiency is c if ${n_{\nu ,2}}\sim c{n_{\nu ,1}}$. This statement will be particularly useful if, under the alternative hypothesis, the full distribution of the original likelihood ratio, such as (3.1) for $\theta ={\theta _{\nu }}$ and ${n_{\nu ,2}}$ observations, is close, in a suitable sense, to the full distribution of the e-test, such as (4.4), (7.3), or (8.3) (with ${n_{\nu ,1}}$ observations and the corresponding value of the parameter). Therefore, a fuller treatment of asymptotic relative efficiency will not use e-power directly (which will make it more complicated).
Definition of Efficiency in Terms of Mixtures
Our definition of Pitman-type efficiency is close to being a direct translation of the classical one. It considers the alternatives $N(0,{\theta _{\nu }})$ that approach the null hypothesis $N(0,1)$ as the difficulty ν increases. In the classical case, this works perfectly for many popular assay models because of the existence of a uniformly most powerful test: the optimal size α critical region does not depend on ν (assuming ${\theta _{\nu }}\gt 0$). In the e-case, on the contrary, the optimal e-variable does depend on ν.
A possible alternative definition would be to replace $N({\theta _{\nu }},1)$ by a mixture $\textstyle\int N(\theta ,1){\mu _{\nu }}(\mathrm{d}\theta )$ of $N(\theta ,1)$ w.r. to a probability measure ${\mu _{\nu }}(\mathrm{d}\theta )$ that is increasingly concentrated around $\theta =0$ as $\nu \to \infty $. In a sense, the assay statistical model considered in this paper is “pure” in that it consists of pure Gaussian distributions. Considering mixtures $\textstyle\int N(\theta ,1){\mu _{\nu }}(\mathrm{d}\theta )$ would make the results more realistic but would also make the definitions more complicated.
Other Assay Models
In our efficiency results, the Gaussian model can be replaced by other statistical models. It is particularly interesting to compare nonparametric e-tests with the optimal e-tests under those models; nowadays, comparison with the t-test, which was done in many of the classical papers (e.g., [8]), looks less convincing for non-Gaussian assay models.
Our choice of the form (4.2) of the nonparametric e-tests considered in this paper was motivated by the Gaussian assay model: see the likelihood ratio (3.1). Using other assay models would lead to other nonparametric e-tests. Therefore, varying the assay model may be a useful design tool for nonparametric e-tests.
Other Notions of Efficiency
The Pitman-type notion of efficiency is “local”, in the sense of being defined in terms of progressively more difficult alternatives that tend to the null hypothesis as $\nu \to \infty $. It is the most popular notion of efficiency for nonparametric tests, but it would be interesting to develop e-versions of other, non-local, notions of asymptotic relative efficiency (see, e.g., [16, Chap. 1]).