Comparison Between Bayesian and Frequentist Tail Probability Estimates

Shen, Nan; González-Arévalo, Bárbara; Pericchi, Luis Raúl

doi:10.51387/23-NEJSDS39

Abstract

Tail probability plays an important part in the extreme value theory. Sometimes the conclusions from two approaches for estimating the tail probability of extreme events, the Bayesian and the frequentist methods, can differ a lot. In 1999, a rainfall that caused more than 30,000 deaths in Venezuela was not captured by the simple frequentist extreme value techniques. However, this catastrophic rainfall was not surprising if the Bayesian inference was used to allow for parameter uncertainty and the full available data was exploited [4].

In this paper, we investigate the reasons that the Bayesian estimator of the tail probability is always higher than the frequentist estimator. Sufficient conditions for this phenomenon are established both by using Jensen’s Inequality and by looking at Taylor series approximations, both of which point to the convexity of the distribution function.

1 Introduction

Tragedies like the 9/11 attacks, earthquakes or volcanic eruptions are rare events, but they are always followed by catastrophic consequences. Estimating the probabilities of extreme events has become more important and urgent in recent decades [3]. Both large deviations theory [11] and extreme value theory, which is widely used in disciplines like actuarial science, environmental sciences and physics [8], investigate both the theoretical and practical problems arising from rare events [12, 6].

With the popularization of Bayesian statistics, we now have two approaches for evaluating the probability of the tail: Bayesian and frequentist [13, 9, 1]. However, these two methods sometimes give different conclusions. As the case shows in [4], before 1999 simple frequentist extreme value techniques were used to predict future levels of extreme rainfall in Venezuela. In December 1999, daily precipitation of more than 410 mm, almost three times the daily rainfall measurements of the previously recorded maximum, was not captured which caused an estimated 30,000 deaths. Figure 1 in [4] shows that the precipitation of 1999 is exceptional even relative to the better fitting model under the frequentist MLE method. However, Figure 2 gives that the 1999 event can be anticipated if we use Bayesian inferences to fully account for the uncertainties due to parameter estimation and exploit all available data. Table 1 which is taken from [5] gives the return level estimates of 410.4 mm, the 1999 annual maximum, using different models. From the table we can also see that the Bayesian method gives a way much smaller return period that the frequentist Maximum Likelihood Estimation(MLE) under different models.

Figure 1

Return level plots for models fitted to the annual maximum Venezuelan rainfall data (

, GEV model, maximum likelihood estimates;

, GEV model, limits of the 95% confidence intervals;

, Gumbel model, maximum likelihood estimates;

, Gumbel model, limits of the 95% confidence intervals; ∙, empirical estimates based on the complete 49-year data set): (a) excluding the 1999 data; (b) including the 1999 data.

Figure 2

Predictive distributions for seasonal models fitted to the Venezuelan rainfall data:

, including the 1999 data;

, excluding the 1999 data; ∙, empirical estimates based on the 49 available annual maxima.

Table 1

Return level estimates of 410.4 mm, the 1999 annual maximum, using different models and modes of inference. For the MLE analysis the values correspond to the maximum likelihood estimates of the return period. For the Bayesian analysis the values are the predictive return periods. From [5] (with permission).

Return period of 410.4 mm
Mode of Inference	Model	1999 data excluded	1999 data included
MLE	Gumbel	17,600,000	737,000
MLE	GEV	4280	302
Bayes	Gumbel	2,170,000	233,000
Bayes	GEV	660	177

The lesson we learned from [5] and [4] is the motivation for our research. Reasons why these two methods give huge different results, especially why the Bayesian model usually gives a larger probability of the tail than the classic frequentist method need to be investigated here. The Bayesian estimation of probability of tails is well founded on Probability Theory: It is a marginal computation that integrates out the parameters of the tail. On the other hand, the “plug-in” insertion a point estimate of the parameters is obviously an “ad-hoc” procedure not based on probability calculus but on an approximation, that may be close to the correct calculation at the center of the range but that deteriorates spectacularly as we move away to the interesting tails, where extreme values occur.

1.1 Definitions and Goal

Let X be the random variable. Suppose X indicate the magnitude and intensity of an earthquake, then the tail probability $P(X\gt a)$ identifies the probability of an earthquake occurrence when a is some value much greater than the mean. The Bayesian estimator of this probability is defined as

\[ {P_{B}}(X\gt a)={\int _{\boldsymbol{\Theta }}}[1-F(a|\boldsymbol{\theta })]\pi (\boldsymbol{\theta }|a)d\boldsymbol{\theta },\]

which is the expectation of the tail probability $1-F(a|\boldsymbol{\theta })$ under the posterior distribution $\pi (\boldsymbol{\theta }|x)$ given $x=a$, where $\boldsymbol{\theta }$ denotes the parameters in the distribution function and could be one dimension or generalized to a high dimensional vector. The frequentist estimator is also called the plug-in estimator which is defined as

\[ {P_{F}}(X\gt a)=1-F(a|\hat{\boldsymbol{\theta }}),\]

where $\hat{\boldsymbol{\theta }}$ is the Maximum Likelihood Estimator(MLE) of $\boldsymbol{\theta }$.

Numerical experimental results point to the fact that the asymptotic behavior of the Bayesian estimator of the tail probability is usually higher than the frequentist estimator. We will use a very simple hierarchical model as an example to illustrate this phenomenon. Suppose we have a sequence of random variable $\boldsymbol{x}={x_{1}},\dots ,{x_{n}}$ that follows the exponential distribution. The density of the exponential distribution is given by

\[ f(x|\lambda )=\frac{1}{\lambda }{e^{-x/\lambda }}\]

where $x\ge 0$ and $\lambda \gt 0$. So the tail probability is

\[ \varphi (\lambda )=1-F(a|\lambda )={\int _{a}^{\infty }}f(x|\lambda )dx={e^{-\frac{a}{\lambda }}}\]

The marginal distribution can be calculated as

\[ m(\mathbf{x})={\int _{0}^{\infty }}f(\mathbf{x}|\lambda ){\pi _{J}}(\lambda )d\lambda =\Gamma (n){\Big({\sum \limits_{i=1}^{n}}{x_{i}}\Big)^{-n}},\]

where we use Jeffereys prior as ${\pi _{J}}(\lambda )\propto 1/\lambda $. By which the posterior distribution is obtained as

\[ \pi (\lambda |\mathbf{x})=\frac{1}{{\lambda ^{(n+1)}}\Gamma (n)}\exp \left(-\frac{{\textstyle\textstyle\sum _{i=1}^{n}}{x_{i}}}{\lambda }\right){\left({\sum \limits_{i=1}^{n}}{x_{i}}\right)^{n}}\]

Then the Bayesian estimator of the tail probability is

\[\begin{aligned}{}{P_{B}}(X\gt a)& =E[\varphi (\lambda )|\boldsymbol{x}]\\ {} & ={\int _{0}^{\infty }}\varphi (\lambda )\pi (\lambda |\boldsymbol{x})d\lambda ={\left(1+\frac{a}{n\bar{x}}\right)^{-n}}\end{aligned}\]

And the frequentist estimator of the tail probability is

\[ {P_{F}}(X\gt a)=\varphi (\hat{\lambda }=\bar{x})={e^{-\frac{a}{\bar{x}}}}\]

Note that ${P_{B}}$ goes to zero at a polynomial rate, whereas ${P_{F}}$ goes to zero at an exponential rate. So, they are not even asymptotically equivalent, and thus it should be the case that:

\[ \frac{{P_{B}}(X\gt a)}{{P_{F}}(X\gt a)}\to \infty \hspace{2.5pt}\hspace{2.5pt}\hspace{2.5pt}\text{as}\hspace{2.5pt}a\to \infty \]

We conduct the numerical experiments and plot the ratio of ${P_{B}}$ and ${P_{F}}$. The results below show the ratio for different range of a. We can see that when a is some moderate number between 50 and 100. The ratio is greater than 1 and increases slowly. However, when a varies from 500 to 1000, the ratio goes to extremely large number very quick. In other words, $\frac{{P_{B}}(X\gt a)}{{P_{F}}(X\gt a)}\to \infty \hspace{2.5pt}\text{as}\hspace{2.5pt}a\to \infty $ which implies that the Bayesian estimate is higher than the frequentist estimate asymptotically.

Figure 3

Plots of $\frac{{P_{B}}(X\gt a)}{{P_{F}}(X\gt a)}$ for different a.

We investigate the reasons for this behavior and the conditions under which it happens. We first use Jensen’s inequality [7, 2], which states that if $\varphi (\boldsymbol{\theta })=1-F(a|\boldsymbol{\theta })$ is a convex function of the random variable $\boldsymbol{\theta }$, then

\[ \varphi \left(\operatorname{E}[\boldsymbol{\theta }]\right)\le \operatorname{E}\left[\varphi (\boldsymbol{\theta })\right].\]

We then verify that the convexity conditions are met for certain distributions that are widely used in extreme value analysis using a Taylor series approximation of the tail probability $1-F(a|\boldsymbol{\theta })$ around the MLE of $\boldsymbol{\theta }$, and plug it into the difference between the Bayesian and the frequentist estimators which is defined as

\[ D(a)={P_{B}}(X\gt a)-{P_{F}}(X\gt a).\]

In conclusion, we can show that the convexity of $\varphi (\boldsymbol{\theta })=1-F(a|\boldsymbol{\theta })$ is a sufficient condition for ${P_{B}}\gt {P_{F}}$. We also verify this convexity condition for specific distributions which are widely used in extreme value theory.

2 Methodology

We will investigate why the Bayesian estimator of the tail probability is usually asymptotically higher than the frequentist one in this section. The method is to prove that if $\varphi (\boldsymbol{\theta })=1-F(a|\boldsymbol{\theta })$ is convex then we can apply Jensen’s inequality directly. Then we use the Taylor expansion of the tail probability $1-F(a|\boldsymbol{\theta })$ to verify the results we get which leads to the same conditions on our distribution function $F(a|\boldsymbol{\theta })$.

2.1 Convexity Investigation Using Jensen’s Inequality

For tail probability estimations, Bayesian method gives ${P_{B}}(X\gt a)={E^{\pi (\boldsymbol{\theta }|a)}}[1-F(a|\boldsymbol{\theta })]$, while frequentist method using ${P_{F}}(X\gt a)=1-F(a|\hat{\boldsymbol{\theta }})$. To investigate the relation between ${P_{B}}$ and ${P_{F}}$, Jensen’s inequality tells something similar and I will state formally here as:

Theorem 1.

Let $(\Omega ,A,\mu )$ be a probability space, such that $\mu (\Omega )=1$. If $g:\Omega \to {\mathbb{R}^{d}}$ is a real-valued function that is μ-integrable, and if φ is a convex function on the real line, then:

\[ \varphi \left({\int _{\Omega }}g\hspace{0.1667em}d\mu \right)\le {\int _{\Omega }}\varphi \circ g\hspace{0.1667em}d\mu .\]

Note here the measurable function g is our parameter $\boldsymbol{\theta }$. So Jensen’s inequality gives $\varphi \left(\operatorname{E}[\boldsymbol{\theta }]\right)\le \operatorname{E}\left[\varphi (\boldsymbol{\theta })\right]$ [2]. The inequality we want to prove, however, is that $\varphi [\hat{\boldsymbol{\theta }}]\le E[\varphi (\boldsymbol{\theta })]$. The following theorem and proof shows that as $a\to \infty $ we have $\varphi [\hat{\boldsymbol{\theta }}]$ and $\varphi [E(\boldsymbol{\theta })]$ are quite close to each other, which implies that $\varphi [\hat{\boldsymbol{\theta }}]\le E[\varphi (\boldsymbol{\theta })]$.

Theorem 2.

Let X be a continuous random variable supported on semi-infinite intervals, usually $[c,\infty )$ for some c, or supported on the whole real line, with $F(a|\boldsymbol{\theta })$ be the cumulative distribution function (CDF) of X where a is some extreme large number on the support, and $\varphi (\boldsymbol{\theta })=1-F(a|\boldsymbol{\theta })$ is a convex function. Suppose $\hat{\boldsymbol{\theta }}$ is the maximum likelihood estimation of the parameter $\boldsymbol{\theta }$, then

\[ \varphi [\hat{\boldsymbol{\theta }}]\le E[\varphi (\boldsymbol{\theta })]\]

Proof.

\[\begin{aligned}{}\left|\varphi [E(\boldsymbol{\theta })]-\varphi [\hat{\boldsymbol{\theta }}]\right|& =\left|\left(1-F[a|E(\boldsymbol{\theta })]\right)-\left(1-F[a|\hat{\boldsymbol{\theta }}]\right)\right|\\ {} & =\left|\left(1-{\int _{-\infty }^{a}}f(x|E(\boldsymbol{\theta }))dx\right)-\cdots \hspace{0.1667em}\right.\\ {} & \hspace{42.67912pt}\dots \left.\left(1-{\int _{-\infty }^{a}}f(x|\hat{\boldsymbol{\theta }})dx\right)\right|\\ {} & =\left|{\int _{a}^{\infty }}f(x|\hat{\boldsymbol{\theta }})dx-{\int _{a}^{\infty }}f(x|E(\boldsymbol{\theta }))dx\right|\\ {} & \le {\int _{a}^{\infty }}\left|f(x|\hat{\boldsymbol{\theta }})-f(x|E(\boldsymbol{\theta }))\right|dx\end{aligned}\]

Let’s define

\[ g(x)=f(x|\hat{\boldsymbol{\theta }})-f(x|E(\boldsymbol{\theta }))\]

Then we can see that

\[\begin{aligned}{}{\int _{-\infty }^{\infty }}|g(x)|dx& \le {\int _{-\infty }^{\infty }}\left|f(x|\hat{\boldsymbol{\theta }})\right|dx+{\int _{-\infty }^{\infty }}\left|f(x|E(\boldsymbol{\theta }))\right|dx\\ {} & ={\int _{-\infty }^{\infty }}f(x|\hat{\boldsymbol{\theta }})dx+{\int _{-\infty }^{\infty }}f(x|E(\boldsymbol{\theta }))dx\\ {} & =1+1=2\lt \infty .\end{aligned}\]

Which means that $g(x)$ is a integrable function. Thus implies ${\lim \nolimits_{a\to \infty }}{\textstyle\int _{a}^{\infty }}|g(x)|dx=0$, i.e

\[ \underset{a\to \infty }{\lim }{\int _{a}^{\infty }}\left|f(x|\hat{\boldsymbol{\theta }})-f(x|E(\boldsymbol{\theta }))\right|dx=0\]

Thus, for $\forall \epsilon \gt 0,\exists a$ such that

\[ {\int _{a}^{\infty }}\left|f(x|\hat{\boldsymbol{\theta }})-f(x|E(\boldsymbol{\theta }))\right|dx\lt \epsilon .\]

Which implies $\exists a$ such that

\[ |\varphi [\hat{\boldsymbol{\theta }}]-\varphi [E(\boldsymbol{\theta })]|\lt \epsilon .\]

Thus $-\epsilon \lt \varphi [\hat{\boldsymbol{\theta }}]-\varphi [E(\boldsymbol{\theta })]\lt \epsilon $ or we could write

\[ \varphi [E(\boldsymbol{\theta })]-\epsilon \lt \varphi [\hat{\boldsymbol{\theta }}]\lt \varphi [E(\boldsymbol{\theta })]+\epsilon .\]

Equality in Jensen’s inequality holds only if our function φ is essentially constant, and suppose our function $\varphi (\boldsymbol{\theta })$ is strictly convex, which is true for most of the cases that we encounter, then we know our Jensen’s inequality is strict also, i.e. $\varphi [E(\boldsymbol{\theta })]\lt E[\varphi (\boldsymbol{\theta })]$. Which implies $\exists \epsilon \gt 0$ such that

\[ E[\varphi (\boldsymbol{\theta })]\ge \varphi [E(\boldsymbol{\theta })]+\epsilon .\]

Hence for this ϵ, as $a\to \infty $ we have

\[ E[\varphi (\boldsymbol{\theta })]\ge \varphi [E(\boldsymbol{\theta })]+\epsilon \gt \varphi [\hat{\boldsymbol{\theta }}].\]

□

2.2 Taylor Expansion Examination

In this section, we will use Taylor series for the tail probability to check the results we got in the previous section. Let

\[\begin{aligned}{}D(a)& ={P_{B}}(X\gt a)-{P_{F}}(X\gt a)\\ {} & ={\int _{\boldsymbol{\Theta }}}[1-F(a|\boldsymbol{\theta })]\pi (\boldsymbol{\theta }|a)d\boldsymbol{\theta }-\left(1-F(a|\hat{\boldsymbol{\theta }})\right)\end{aligned}\]

which is the difference of the tail probabilities between the Bayesian and the frequentist estimators. The Taylor series of $1-F(a|\boldsymbol{\theta })$ at the MLE of $\boldsymbol{\theta }$, $\hat{\boldsymbol{\theta }}$, is given as

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle \varphi (\boldsymbol{\theta })& \displaystyle =& \displaystyle 1-F(a|\boldsymbol{\theta })\\ {} & \displaystyle =& \displaystyle 1-F(a|\hat{\boldsymbol{\theta }})-{\left.{\nabla _{\boldsymbol{\theta }}}F(a|\boldsymbol{\theta })\right|_{\boldsymbol{\theta }=\hat{\boldsymbol{\theta }}}}(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})-\cdots \\ {} & & \displaystyle \hspace{28.45274pt}\frac{1}{2}{\left.{H_{\boldsymbol{\theta }}}\left(F(a|\boldsymbol{\theta })\right)\right|_{\boldsymbol{\theta }=\hat{\boldsymbol{\theta }}}}{(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})^{2}}-R(\boldsymbol{\theta })\\ {} \displaystyle R(\boldsymbol{\theta })& \displaystyle =& \displaystyle \frac{1}{6}{\left.{D_{\boldsymbol{\theta }}^{3}}\left(F(a|\boldsymbol{\theta })\right)\right|_{\boldsymbol{\theta }={\boldsymbol{\theta }_{L}}}}{(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})^{3}}\end{array}\]

where ${\boldsymbol{\theta }_{L}}$ is between $\boldsymbol{\theta }$ and $\hat{\boldsymbol{\theta }}$.

where ${\nabla _{\boldsymbol{\theta }}}F(a|\boldsymbol{\theta })$ is the gradient of $F(a|\boldsymbol{\theta })$ such that

\[ {\left({\nabla _{\boldsymbol{\theta }}}F(a|\boldsymbol{\theta })\right)_{i}}=\frac{\partial }{\partial {\theta _{i}}}F(a|\boldsymbol{\theta })\]

and ${H_{\boldsymbol{\theta }}}$ is the Hessian matrix of dimension $|\boldsymbol{\theta }|\times |\boldsymbol{\theta }|$ such that

\[ {H_{ml}}=\frac{{\partial ^{2}}}{\partial {\theta _{m}}\partial {\theta _{l}}}F(a|\boldsymbol{\theta })\]

And ${D_{\boldsymbol{\theta }}^{3}}\left(F(a|\boldsymbol{\theta })\right)$ is the third partial derivative of $F(a|\boldsymbol{\theta })$ w.r.t. $\boldsymbol{\theta }$ in a similar manner. Then $D(a)$ could be rewritten as

\[\begin{aligned}{}D(a)& ={\int _{\boldsymbol{\Theta }}}[1-F(a|\boldsymbol{\theta })]\pi (\boldsymbol{\theta }|a)d\boldsymbol{\theta }-\left(1-F(a|\hat{\boldsymbol{\theta }})\right)\\ {} & =-{\left.{\nabla _{\boldsymbol{\theta }}}F(a|\boldsymbol{\theta })\right|_{\hat{\boldsymbol{\theta }}}}{\int _{\Theta }}\pi (\boldsymbol{\theta }|a)(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})d\boldsymbol{\theta }-\cdots \\ {} & \hspace{1em}\hspace{2.5pt}\frac{1}{2}{\left.{H_{\boldsymbol{\theta }}}\left(F(a|\boldsymbol{\theta })\right)\right|_{\hat{\boldsymbol{\theta }}}}{\int _{-\infty }^{\infty }}\pi (\boldsymbol{\theta }|a){(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})^{2}}d\boldsymbol{\theta }-{R^{\ast }}(\boldsymbol{\theta })\\ {} & =-{\left.{\nabla _{\boldsymbol{\theta }}}F(a|\boldsymbol{\theta })\right|_{\hat{\boldsymbol{\theta }}}}{E^{\pi (\boldsymbol{\theta }|a)}}(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})-\cdots \\ {} & \hspace{1em}\hspace{2.5pt}\frac{1}{2}{\left.{H_{\boldsymbol{\theta }}}\left(F(a|\boldsymbol{\theta })\right)\right|_{\hat{\boldsymbol{\theta }}}}{E^{\pi (\boldsymbol{\theta }|a)}}{(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})^{2}}-{R^{\ast }}(\boldsymbol{\theta })\end{aligned}\]

Here we simplify the notation by writing $dF(a|\boldsymbol{\theta })/d\boldsymbol{\theta }{\left.\right|_{\boldsymbol{\theta }=\hat{\boldsymbol{\theta }}}}={F^{\prime }}(a|\hat{\boldsymbol{\theta }})$, ${d^{2}}F(a|\boldsymbol{\theta })/d{\boldsymbol{\theta }^{2}}{\left.\right|_{\boldsymbol{\theta }=\hat{\boldsymbol{\theta }}}}={F^{\prime\prime }}(a|\hat{\boldsymbol{\theta }})$, and

\[\begin{aligned}{}{R^{\ast }}(\boldsymbol{\theta })& =\frac{1}{6}{\left.{D_{\boldsymbol{\theta }}^{3}}\left(F(a|\boldsymbol{\theta })\right)\right|_{{\boldsymbol{\theta }_{L}}}}{\int _{-\infty }^{\infty }}\pi (\boldsymbol{\theta }|x){(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})^{3}}d\boldsymbol{\theta }\\ {} & =\frac{1}{6}{\left.{D_{\boldsymbol{\theta }}^{3}}\left(F(a|\boldsymbol{\theta })\right)\right|_{{\boldsymbol{\theta }_{L}}}}{E^{\pi (\boldsymbol{\theta }|x)}}{(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})^{3}}\end{aligned}\]

In order for $\varphi (\boldsymbol{\theta })$ to be convex, and $D(a)$ to be negative we would need the first term ${\left.{\nabla _{\boldsymbol{\theta }}}F(a|\boldsymbol{\theta })\right|_{\hat{\boldsymbol{\theta }}}}{E^{\pi (\boldsymbol{\theta }|a)}}(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})$ and the third term ${R^{\ast }}(\boldsymbol{\theta })$ to go to zero asymptotically, and the second term ${\left.{H_{\boldsymbol{\theta }}}\left(F(a|\boldsymbol{\theta })\right)\right|_{\hat{\boldsymbol{\theta }}}}{E^{\pi (\boldsymbol{\theta }|a)}}{(\boldsymbol{\theta }-\hat{\boldsymbol{\theta }})^{2}}$ to be negative as $a\to \infty $. In what follows, we will show some examples with specific distributions that are widely used in extreme value analysis where this happens. We conjecture that this is true for a broad set of cumulative distribution functions, since it worked in all the examples we tried. This would be an interesting open problem to solve in the future.

Example 1 (Exponential distribution).

The density of the exponential distribution is given by

\[ f(x|\lambda )=\frac{1}{\lambda }{e^{-x/\lambda }}\]

where $x\ge 0$ and $\lambda \gt 0$. So the tail probability is

\[ 1-F(a|\lambda )={\int _{a}^{\infty }}f(x|\lambda )dx={e^{-\frac{a}{\lambda }}}\]

Taking derivative with respect to λ at both sides we have

\[\begin{array}{l}\displaystyle -\frac{d}{d\lambda }F(a|\lambda )=\frac{a}{{\lambda ^{2}}}{e^{-\frac{a}{\lambda }}}\\ {} \displaystyle -\frac{{d^{2}}}{d{\lambda ^{2}}}F(a|\lambda )=\left(-\frac{2a}{{\lambda ^{3}}}+\frac{{a^{2}}}{{\lambda ^{4}}}\right){e^{-\frac{a}{\lambda }}}\\ {} \displaystyle -\frac{{d^{3}}}{d{\lambda ^{3}}}F(a|\lambda )=\left(\frac{6a}{{\lambda ^{4}}}-\frac{6{a^{2}}}{{\lambda ^{5}}}+\frac{{a^{3}}}{{\lambda ^{6}}}\right){e^{-\frac{a}{\lambda }}}\end{array}\]

Suppose we have i.i.d sample $\mathbf{x}=({x_{1}},\dots ,{x_{n}})$ from $f(x|\lambda )$, then the marginal distribution can be calculated as

\[ m(\mathbf{x})={\int _{0}^{\infty }}f(\mathbf{x}|\lambda ){\pi _{J}}(\lambda )d\lambda =\Gamma (n){\Big({\sum \limits_{i=1}^{n}}{x_{i}}\Big)^{-n}},\]

where we use Jeffereys prior as ${\pi _{J}}(\lambda )\propto 1/\lambda $. By which the posterior distribution is obtained as

\[ \pi (\lambda |\mathbf{x})=\frac{1}{{\lambda ^{(n+1)}}\Gamma (n)}\exp \left(-\frac{{\textstyle\textstyle\sum _{i=1}^{n}}{x_{i}}}{\lambda }\right){\left({\sum \limits_{i=1}^{n}}{x_{i}}\right)^{n}}\]

After some arithmetic manipulation and using $\hat{\lambda }=\bar{x}$ we obtain

\[\begin{aligned}{}{E^{\pi (\lambda |\mathbf{x})}}(\lambda -\hat{\lambda })& ={\int _{0}^{\infty }}(\lambda -\hat{\lambda })\pi (\lambda |\mathbf{x})d\lambda =\frac{\bar{x}}{(n-1)},\\ {} {E^{\pi (\lambda |\mathbf{x})}}{(\lambda -\hat{\lambda })^{2}}& ={\int _{0}^{\infty }}{(\lambda -\hat{\lambda })^{2}}\pi (\lambda |\mathbf{x})d\lambda \\ {} & ={\bar{x}^{2}}\frac{n+2}{(n-1)(n-2)},\\ {} {E^{\pi (\lambda |\mathbf{x})}}{(\lambda -\hat{\lambda })^{3}}& ={\int _{0}^{\infty }}{(\lambda -\hat{\lambda })^{3}}\pi (\lambda |\mathbf{x})d\lambda \\ {} & ={\bar{x}^{3}}\frac{7n+6}{(n-1)(n-2)(n-3)}.\end{aligned}\]

Plug these terms into $D(a)$ we have

\[\begin{aligned}{}D(a)& \hspace{0.1667em}=\hspace{0.1667em}-{\left.\frac{d}{d\lambda }F(a|\lambda )\right|_{\hat{\lambda }}}{E^{\pi (\lambda |\mathbf{x})}}(\lambda -\hat{\lambda })-\cdots \\ {} & \hspace{1em}\hspace{2.5pt}\frac{1}{2}{\left.\frac{{d^{2}}}{d{\lambda ^{2}}}F(a|\lambda )\right|_{\hat{\lambda }}}{E^{\pi (\lambda |\mathbf{x})}}{(\lambda -\hat{\lambda })^{2}}-\cdots \\ {} & \hspace{1em}\hspace{2.5pt}\frac{1}{6}{\left.\frac{{d^{3}}}{d{\lambda ^{3}}}F(a|\lambda )\right|_{{\lambda _{L}}}}{E^{\pi (\lambda |\mathbf{x})}}{(\lambda -\hat{\lambda })^{3}}\\ {} & \hspace{0.1667em}=\hspace{0.1667em}{e^{-\frac{a}{\bar{x}}}}\left[\frac{a}{\bar{x}}\frac{-4}{(n-1)(n-2)}+\frac{{a^{2}}}{2{\bar{x}^{2}}}\frac{n+2}{(n-1)(n-2)}\right]+\cdots \\ {} & \hspace{1em}\hspace{2.5pt}{e^{-\frac{a}{{\lambda _{L}}}}}\left(\frac{{a^{3}}\hspace{0.1667em}-\hspace{0.1667em}6{\lambda _{L}}{a^{2}}\hspace{0.1667em}+\hspace{0.1667em}6{\lambda _{L}^{2}}a}{6{\lambda _{L}^{6}}}\right)\frac{{\bar{x}^{3}}(7n\hspace{0.1667em}+\hspace{0.1667em}6)}{(n\hspace{0.1667em}-\hspace{0.1667em}1)(n\hspace{0.1667em}-\hspace{0.1667em}2)(n\hspace{0.1667em}-\hspace{0.1667em}3)}\end{aligned}\]

Here we have that $a\gt \gt 0$, and $\bar{x}\ge 0$; ${\lambda _{L}}$ is some number between x and $\bar{x}$ so ${\lambda _{L}}\ge 0$. All of which implies $D(a)\ge 0$.

Then, we want to show that ${\lim \nolimits_{a\to \infty }}{R^{\ast }}(\lambda )=0$, it is sufficient to show that

\[ \underset{a\to \infty }{\lim }{\left.\frac{{d^{3}}F(a|\lambda )}{d{\lambda ^{3}}}\right|_{{\lambda _{L}}}}=0.\]

And we could obtain this simply by using L’Hospital’s rule. In conclusion, the second derivatives for exponential distribution $\exp (\lambda )$ w.r.t. λ is

\[\begin{aligned}{}\frac{{d^{2}}}{d{\lambda ^{2}}}F(a|\lambda )& =\left(2-\frac{a}{\lambda }\right)\frac{a}{{\lambda ^{3}}}{e^{-\frac{a}{\lambda }}}\end{aligned}\]

Since a is assumed to be some extreme number, which implies ${d^{2}}F(a|\lambda )/d{\lambda ^{2}}\le 0$, i.e. the tail probability $\varphi (\lambda )=1-F(a|\lambda )$ is convex.

Example 2 (Pareto Distribution).

Given the scale parameter $\beta =1$ and the shape parameter α as unknown, the pdf of the Pareto distribution is given by

\[ f(x|\alpha )=\alpha {x^{-\alpha -1}}\]

where $x\ge 1$ and $0\lt \alpha \lt 1$, and the cumulative distribution function is

\[ F(x|\alpha )={\int _{1}^{x}}f(t|\alpha )dt=1-{x^{-\alpha }},x\ge 1\]

By setting the derivative of the log-likelihood equal to zero we get the MLE of α as $\hat{\alpha }=n/{\textstyle\sum _{i=1}^{n}}\log {x_{i}}$. We are interested in calculating the tail probability when b is extremely large. Note that

\[ \varphi (\alpha )=1-F(b|\alpha )={b^{-\alpha }}\]

Taking derivatives of $\varphi (\alpha )$ with respect to α we obtain

\[\begin{array}{l}\displaystyle -\frac{d}{d\alpha }F(b|\alpha )=-{b^{-\alpha }}\ln b;\\ {} \displaystyle -\frac{{d^{2}}}{d{\alpha ^{2}}}F(b|\alpha )={b^{-\alpha }}{(\ln b)^{2}};\\ {} \displaystyle -\frac{{d^{3}}}{d{\alpha ^{3}}}F(b|\alpha )=-{b^{-\alpha }}{(\ln b)^{3}}.\end{array}\]

Using Jeffreys’s prior ${\pi _{J}}(\alpha )\propto 1/\alpha $, we have

\[ m(\mathbf{x})={\int _{0}^{1}}f(\mathbf{x}|\alpha ){\pi _{J}}(\alpha )d\alpha =\frac{\Gamma (n,0)-\Gamma (n,{\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}})}{{\textstyle\textstyle\prod _{i=1}^{n}}{x_{i}}{\left[{\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}}\right]^{n}}},\]

where the upper incomplete gamma function is defined as $\Gamma (s,x)={\textstyle\int _{x}^{\infty }}{t^{s-1}}{e^{-t}}dt$. Then the posterior distribution is given by

\[\begin{aligned}{}\pi (\alpha |\mathbf{x})& =\frac{L(\alpha |\mathbf{x}){\pi _{J}}(\alpha )}{m(\mathbf{x})}\\ {} & =\frac{{\alpha ^{n-1}}{\left({\textstyle\textstyle\prod _{i=1}^{n}}{x_{i}}\right)^{-\alpha }}{\left[{\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}}\right]^{n}}}{\Gamma (n,0)-\Gamma (n,{\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}})}\end{aligned}\]

Using the properties of the incomplete gamma function, and integration by parts we find the recurrence relation $\Gamma (s+1,x)=s\Gamma (s,x)+{x^{s}}{e^{-x}}$. We obtain

\[\begin{aligned}{}{E^{\pi (\alpha |\mathbf{x})}}(\alpha -\hat{\alpha })& ={\int _{0}^{1}}(\alpha -\hat{\alpha })\pi (\alpha |\mathbf{x})d\alpha \\ {} & =-\frac{{\left[{\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}}\right]^{n-1}}}{{\textstyle\textstyle\prod _{i=1}^{n}}{x_{i}}[\Gamma (n,0)-\Gamma (n,{\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}})]},\\ {} {E^{\pi (\alpha |\mathbf{x})}}{(\alpha -\hat{\alpha })^{2}}& =\frac{n}{{\left[{\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}}\right]^{2}}}+\cdots \\ {} & \frac{(n-1){({\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}})^{n-2}}-{({\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}})^{n-1}}}{{\textstyle\textstyle\prod _{i=1}^{n}}{x_{i}}[\Gamma (n,0)-\Gamma (n,{\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}})]},\\ {} {E^{\pi (\alpha |\mathbf{x})}}{(\alpha -\hat{\alpha })^{3}}& =\frac{2n}{{({\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}})^{3}}}-\cdots \end{aligned}\]

\[ \frac{{({\sum \limits_{i=1}^{n}}\ln {x_{i}})^{n-3}}[{n^{2}}+2+(2-2n)({\sum \limits_{i=1}^{n}}\ln {x_{i}})+{({\sum \limits_{i=1}^{n}}\ln {x_{i}})^{2}}]}{({\textstyle\textstyle\prod _{i=1}^{n}}{x_{i}})[\Gamma (n,0)-\Gamma (n,{\textstyle\textstyle\sum _{i=1}^{n}}\ln {x_{i}})]}.\]

To show $D(b)\ge 0$, is equivalent to show that the first term in the expression of $D(b)$ after plugging in the Taylor expansion of $1-F(b|\alpha )$ goes to zero as $b\to \infty $, which could be obtain by using the L’Hospital’s rule. And we also need to show the second term ${\left.{d^{2}}/d{\alpha ^{2}}F(b|\alpha )\right|_{\hat{\alpha }}}{E^{\pi (\alpha |\mathbf{x})}}{(\alpha -\hat{\alpha })^{2}}$ is asymptotically negative. We can see this from the fact that ${\left.{d^{2}}/d{\alpha ^{2}}F(b|\alpha )\right|_{\hat{\alpha }}}=-{b^{-\alpha }}{(\ln b)^{2}}\le 0$ and ${E^{\pi (\alpha |\mathbf{x})}}{(\alpha -\hat{\alpha })^{2}}\ge 0$.

Then, We want to show that ${\lim \nolimits_{b\to \infty }}{R^{\ast }}(\alpha )=0$, it is sufficient to show that

\[ \underset{b\to \infty }{\lim }{\left.\frac{{d^{3}}}{d{\alpha ^{3}}}F(b|\alpha )\right|_{{\alpha _{L}}}}=0.\]

And we could obtain this simply by using L’Hospital’s rule. In conclusion, the second derivatives for Pareto distribution is

\[\begin{aligned}{}\frac{{d^{2}}}{d{\alpha ^{2}}}F(b|\alpha )& =-{b^{-\alpha }}{(\ln b)^{2}}\end{aligned}\]

Since b is assumed to be some extreme number, which implies ${d^{2}}/d{\alpha ^{2}}F(b|\alpha )\le 0$, i.e. the tail probability $\varphi (\alpha )=1-F(b|\alpha )$ is convex.

Example 3 (Normal Distribution).

Normal distribution with unknown standard deviation σ and expectation μ is a case where the parameter is a two dimensional vector, i.e. $\boldsymbol{\theta }=(\mu ,\sigma )$. Since $x|\mu ,\sigma \sim N(\mu ,{\sigma ^{2}})$, then the CDF when $x=a$ is

\[ F(a|\mu ,\sigma )=\frac{1}{2}\left[1+\textit{erf}\left(\frac{a-\mu }{\sigma \sqrt{2}}\right)\right]=\frac{1}{2}+\frac{1}{\sqrt{\pi }}{\int _{0}^{\frac{a-\mu }{\sigma \sqrt{2}}}}{e^{-{t^{2}}}}dt\]

where $\textit{erf}(x)$ is the related error function defined as $\textit{erf}(x)=2/\sqrt{\pi }{\textstyle\int _{0}^{x}}{e^{-{t^{2}}}}dt$. Looking at the Hessian matrix, we have

\[\begin{aligned}{}H& =\left[\begin{array}{c@{\hskip10.0pt}c}\frac{{\partial ^{2}}}{\partial {\mu ^{2}}}F(a|\mu ,\sigma )& \frac{{\partial ^{2}}}{\partial \sigma \partial \mu }F(a|\mu ,\sigma )\\ {} \frac{{\partial ^{2}}}{\partial \mu \partial \sigma }F(a|\mu ,\sigma )& \frac{{\partial ^{2}}}{\partial {\sigma ^{2}}}F(a|\mu ,\sigma )\end{array}\right]\\ {} & =\left[\begin{array}{c}-\frac{a-\mu }{\sqrt{2\pi }{\sigma ^{3}}}{e^{-{\left(\frac{a-\mu }{\sigma \sqrt{2}}\right)^{2}}}}\\ {} -\frac{1}{\sqrt{2\pi }{\sigma ^{2}}}{e^{-{\left(\frac{a-\mu }{\sigma \sqrt{2}}\right)^{2}}}}\left[\frac{{(a-\mu )^{2}}}{{\sigma ^{2}}}-1\right]\end{array}\right.\\ {} & \hspace{56.9055pt}\left.\begin{array}{c@{\hskip10.0pt}c}& -\frac{1}{\sqrt{2\pi }{\sigma ^{2}}}{e^{-{\left(\frac{a-\mu }{\sigma \sqrt{2}}\right)^{2}}}}\left[\frac{{(a-\mu )^{2}}}{{\sigma ^{2}}}-1\right]\\ {} & -\frac{a-\mu }{\sqrt{2\pi }{\sigma ^{3}}}{e^{-{\left(\frac{a-\mu }{\sigma \sqrt{2}}\right)^{2}}}}\left[\frac{{(a-\mu )^{2}}}{{\sigma ^{2}}}-2\right]\end{array}\right]\end{aligned}\]

To show H is negative definite for large a, we need to show for $\forall {\mathbf{v}^{T}}=({v_{1}},{v_{2}})\ne \mathbf{0}$, we have

\[ {\mathbf{v}^{T}}H\mathbf{v}\lt 0.\]

By tedious calculation we have

\[\begin{aligned}{}{\mathbf{v}^{T}}H\mathbf{v}& =-\frac{1}{\sqrt{2\pi }{\sigma ^{2}}}{e^{-{\left(\frac{a-\mu }{\sigma \sqrt{2}}\right)^{2}}}}\left[({v_{1}^{2}}-2{v_{2}^{2}})\left(\frac{a-\mu }{\sigma }\right)+\right.\dots \\ {} & \left.\hspace{28.45274pt}2{v_{1}}{v_{2}}{\left(\frac{a-\mu }{\sigma }\right)^{2}}+{v_{2}^{2}}{\left(\frac{a-\mu }{\sigma }\right)^{3}}-2{v_{1}}{v_{2}}\right]\end{aligned}\]

Since a is assumed to be some extreme large number, so $a-\mu \gt 0$, then the leading term in the bracket is

\[ {v_{2}^{2}}{\left(\frac{a-\mu }{\sigma }\right)^{3}}\]

which is positive. Hence, ${\mathbf{v}^{T}}H\mathbf{v}\lt 0$, i.e. H is negative definite for large a as we expected. In other words, $\varphi (\mu ,\sigma )=1-F(a|\mu ,\sigma )$ is a convex function.

Figure 4

Plots of distribution functions when parameters are variables. Note in (a), the parameter θ is the mean of the normal distribution, i.e. in this case σ is given and we can see that $F(a|\theta )$ is concave down when $a\gt \theta $.

3 Conclusions

Bayesian and frequentist estimations of the tail probability sometimes give conclusions of huge differences which can cause serious consequences. Thus in practice, we will have to take both of the results into consideration. To be specific, Bayesian method always estimate the tail probability higher than the frequentist model. The Bayesian estimation of probability of tails is well founded on Probability Theory: It is a marginal computation that integrates out the parameters of the tail. On the other hand, the frequentist estimation is an approximation.

By looking at the Taylor expansion of the tail and investigate the convexity of the distribution function, we claim that the Bayesian estimator for tail probability being higher than the frequentist estimator depends on how $\varphi (\boldsymbol{\theta })=1-F(a|\boldsymbol{\theta })$ is shaped. The condition that $\varphi (\boldsymbol{\theta })$ is a strictly convex function is equivalent to ${H_{\boldsymbol{\theta }}}F(a|\boldsymbol{\theta })\lt 0$. Other examples (only continuous cases with infinite support) like the Cauchy Distribution, Logistic Distribution, Log-normal Distribution, Double Exponential Distribution, Weibull Distribution, etc., also satisfy our convexity conditions here.

However, in general convexity is a much stronger argument than Jensen’s inequality, i.e. $\varphi (\boldsymbol{\theta })=1-F(a|\boldsymbol{\theta })$ is a convex function, or equivalently ${H_{\boldsymbol{\theta }}}F(a|\boldsymbol{\theta })\lt 0$ is only a sufficient condition for Jensen’s inequality to hold but not a necessary condition. There are distributions with ${H_{\boldsymbol{\theta }}}F(a|\boldsymbol{\theta })\ge 0$ but we still have the Bayesian estimator for the tail probability is higher than the frequentist approximation. In [10], they found conditions on the random variable to make the other direction work which will be discussed in our future work.

Abstract