1 Introduction
Function estimation by using shrinkage estimators in the wavelet domain was initiated in the works of David Donoho and Iain Johnstone in the early 1990’s. Since then various shrinkage methods have been proposed. In this article, we propose and investigate a novel approach for estimation in the wavelet domain. This approach brings together the classic principle of Γ-minimaxity, Huber’s famous ϵ-contamination class [23], mathematics of decision theory, aspects of multiple branches of pure mathematics, and then, also, detailed experimental and graphical analysis of the performance of our proposal. In these ways, this article is a new comprehensive analysis of the problem of estimation in the wavelet domain.
For the rest of this introductory section, we review fundamentals of Γ-minimax estimation, wavelet shrinkage, and Bayesian approaches to wavelet shrinkage. This section is mostly expository and gives background and the needed motivation.
1.1 Γ-Minimax Theory
The Γ-minimax paradigm, originally proposed by Robbins [29], deals with the problem of selecting decision rules in tasks of statistical inference. The Γ-minimax approach falls between the Bayes paradigm, which selects procedures that work well “on average a posteriori” for a specific prior, and the minimax paradigm, which guards against least favorable outcomes, however unlikely. This approach has evolved from seminal papers in the fifties [29, 21] and early sixties, through an extensive research on foundations and parametric families in the seventies, to a branch of Bayesian robustness theory, in the eighties and nineties. A comprehensive discussion of the Γ-minimax approach can be found in [5, 6]. Also see [8], a fundamental contribution.
The Γ-minimax paradigm incorporates the prior information about the statistical model by a family of plausible priors, denoted by Γ, rather than by a single prior. Elicitation of “prior families” is often encountered in practice. Given the family of priors, the decision maker selects a rule that is optimal with respect to the least favorable prior in the family.
We note that least favorable priors typically do not exist in unbounded parameter spaces. For example, there is no least favorable prior for estimation of unconstrained normal or Poisson means. There is a least favorable prior for estimating a binomial parameter, and this is a compact parameter space. But no amount of intuition can predict the least favorable prior for the binomial. And, as soon as the loss is changed to a normalized squared error, the least favorable prior changes. There are no links or intuition regarding least favorable priors that are portable. It is too delicate and problem specific. For example, in a quite general setup, Clarke and Barron in [13] show that Jeffreys’ prior is asymptotically least favorable when the loss is Kullback-Leibler distance, thus providing theoretical justification for reference priors.
Inference of this kind is often interpreted in terms of game theory. Formally, if $\mathcal{D}$ is a set of all decision rules and Γ is a family of prior distributions over the parameter space Θ, then a rule ${\delta ^{\ast }}\in \mathcal{D}$ is Γ-minimax if
where $r(\pi ,\delta )={E^{\theta }}\left[{E^{X|\theta }}\mathcal{L}(\theta ,\delta )\right]={E^{\theta }}R(\theta ,\delta )$ is the Bayes risk under the loss $\mathcal{L}(\theta ,\delta )$. Here $R(\theta ,\delta )={E_{\theta }^{X|\theta }}\mathcal{L}(\theta ,\delta )$ denotes the frequentist risk of δ, ${\pi ^{\ast }}$ is the least favorable prior, and $\mathcal{L}(\theta ,\delta )$ is the loss function, often the squared error loss, ${(\theta -\delta )^{2}}$. Note that when Γ is the set of all priors, the Γ-minimax rule coincides with the minimax rule, if a minimax rule exists; when Γ contains a single prior, then the Γ-minimax rule coincides with Bayes’ rule with respect to that prior. When the decision problem, viewed as a statistical game, has a value, that is, when ${\inf _{\delta \in \mathcal{D}}}{\sup _{\pi \in \Gamma }}\equiv {\sup _{\pi \in \Gamma }}{\inf _{\delta \in \mathcal{D}}}$, then the Γ-minimax solution coincides with the Bayes rule with respect to the least favorable prior. For the interplay between the Γ-minimax and Bayesian paradigms, see [6]. A review on Γ-minimax estimation can be found in [32].
(1.1)
\[ \underset{\delta \in \mathcal{D}}{\inf }\underset{\pi \in \Gamma }{\sup }r(\pi ,\delta )=\underset{\pi \in \Gamma }{\sup }r(\pi ,{\delta ^{\ast }})=r({\pi ^{\ast }},{\delta ^{\ast }}),\]In a nutshell, the Γ-minimax is a philosophical compromise between subjective Bayes with a single prior and the conservative approach of not having a prior at all.
1.2 Wavelet Shrinkage
We apply the Γ-minimax approach to the classical nonparametric regression problem
where ${t_{i}}$, $i=1,\dots ,n$, are deterministic equispaced design points on $[0,1]$, the random errors ${\varepsilon _{i}}$ are i.i.d. standard normal random variables, and the noise level ${\sigma ^{2}}$ may, or may not, be known. The interest is to recover the function f from the observations ${y_{i}}$. Additionally, we assume that the unknown signal f has a bounded ${L^{2}}$-energy, namely ${\textstyle\int _{[0,1]}}\hspace{0.1667em}{f^{2}}(t)dt$, and hence it assumes values from a bounded interval. After applying a linear and orthogonal wavelet transform, the model in (1.2) becomes (see, e.g., [16]),
where ${d_{j,k}}$ (${c_{jk}}$), ${\theta _{j,k}}$ and ${\epsilon _{j,k}}$ are the wavelet (scaling) coefficients (at resolution level j and location k) corresponding to y, f and ε, respectively; ${J_{0}}$ and $J-1$ are the coarsest and finest levels of detail in the wavelet decomposition. If ϵ’s are i.i.d. standard normal, an arbitrary wavelet coefficient from (1.3) can be modeled as
where, due to (approximate) independence of the coefficients, we omitted the indices j, k. The prior information on the energy bound of the signal energy implies that a wavelet coefficient θ corresponding to the signal part in (1.2) assumes its values in a bounded interval, say $\Theta =[-m(j),m(j)]$, which depends on the level j.
(1.3)
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {c_{{J_{0}},k}}& \displaystyle =& \displaystyle {\theta _{{J_{0}},k}}+\sigma {\epsilon _{{J_{0}},k}},\hspace{2.5pt}\hspace{2.5pt}k=0,\dots ,{2^{{J_{0}}}}-1,\\ {} \displaystyle {d_{j,k}}& \displaystyle =& \displaystyle {\theta _{j,k}}+\sigma {\epsilon _{j,k}},\hspace{2.5pt}\hspace{2.5pt}j={J_{0}},\dots ,J-1,\\ {} & & \displaystyle k=0,\dots ,{2^{j}}-1,\end{array}\]Wavelet shrinkage rules have been extensively studied in the literature, but mostly when no additional information on the parameter space Θ is available. For the implementation of wavelet methods in nonparametric regression problems, we also refer to [4], where some methods are described and numerically compared.
1.3 Bayesian Model in the Wavelet Domain
Bayesian shrinkage methods in wavelet domains have received considerable attention in recent years. Depending on a prior, Bayes’ rules are shrinkage rules. The shrinkage process is defined as follows: A shrinkage rule is applied in the wavelet domain and the observed wavelet coefficients d are replaced by with their shrunken versions $\hat{\theta }=\delta (d)$. In the subsequent step, by the inverse wavelet transform, coefficients are transformed back to the domain of original data, resulting in a data smoothing. The shape of the particular rule $\delta (\cdot )$ influences the denoising performance.
Bayesian models on the wavelet coefficients have proved capable of incorporating some prior information about the unknown signal, such as smoothness, periodicity, sparseness, self-similarity and, for some particular bases (e.g., Haar), monotonicity.
The shrinkage is usually achieved by eliciting a single prior distribution π on the space of parameters Θ, and then choosing an estimator $\hat{\theta }=\delta (d)$ that minimizes the Bayes risk with respect to the adopted prior.
It is well known that most of the noiseless signals encountered in practical applications have (for each resolution level) empirical distributions of wavelet coefficients centered around zero and peaked at zero. A realistic Bayesian model that takes into account this prior knowledge should consider a prior distribution for which the prior predictive distribution produces a reasonable agreement with observations. A realistic prior distribution on the wavelet coefficient θ is given by
where ${\delta _{0}}$ is a point mass at zero, ξ is a symmetric distribution on the parameter space Θ and ϵ is a fixed number in $[0,1]$, usually level dependent, that regulates the amount of shrinkage for values of d close to 0. Priors for wavelet coefficients as in (1.5) have been considered by Vidakovic and Ruggeri in [33]. Prior literature on wavelet shrinkage by using Bayesian ideas includes important work by [12, 14, 30, 19], among others. A review by Reményi and Vidakovic in [28] overviews a range of Bayesian wavelet shrinkage methods and provides numerous references on Bayesian wavelet shrinkage strategies.
Our approach here is different from all the prior literature. Here is the motivation.
It is clear that specifying a single prior distribution π on the parameter space Θ can never be done exactly. Indeed the prior knowledge of real phenomena always contains uncertainty and multitude of prior distributions can match the prior belief, meaning that on the basis of the partial knowledge about the signal, it is possible to elicit only a family of plausible priors, Γ. In a robust Bayesian point of view the choice of a particular rule δ should not be influenced by the choice of a particular prior, as long as it is in agreement with our prior belief. Several approaches have been considered for ensuring the robustness of a specific rule, Γ-minimax being one compromise.
In this paper, we incorporate prior information on the boundedness of the energy of the signal (the ${L_{2}}$-norm of the regression function). The prior information on the energy bound often exists in real-life problems, and it can be modelled by the assumption that the parameter space Θ is bounded. Estimation of a bounded normal mean has been considered in numerous papers, e.g., [20, 7, 11, 15, 9, 17], and in the classic text of Ibragimov and Hasminskii [24].
It is well known that estimating a bounded normal mean represents a challenging problem. In our context, if the structure of the prior (1.5) can be supported by the analysis of the empirical distribution of the wavelet coefficients, even then precise elicitation of the distribution ξ cannot be done without some kind of approximation. Any symmetric distribution supported on the bounded set, say $[-m,m]$, can be a possible candidate for ξ.
Let Γ denote the family
where ${\Gamma _{S[-m,m]}}$ is the class of all symmetric distributions supported on $[-m,m]$, ${\delta _{0}}$ is point mass at zero, and ϵ is a fixed constant between 0 and 1. We also require that the distribution ξ does not have an atom at 0.
We consider two models; both assume that wavelet coefficients follow normal distribution (which is a statement about the distribution of the noise), $d\sim \mathcal{N}(\theta ,{\sigma ^{2}})$. In the Model I, the variance of the noise is assumed to be known, while in the Model II the variance is not known and is given a prior distribution. We have kept treatment of Model II limited. Detailed treatment of Model II would take too much additional space.
The rest of the paper is organized as follows. Section 2 contains mathematical aspects and results concerning the Γ-minimax rules. An exact risk analysis of the rule is discussed in Section 3. Section 4 proposes methods of elicitation of hyper-parameters defining the model. For example, one can imagine “estimating” ϵ and even m itself as hyperparameters by using simulated data. But it is going to depend on the run and none of the theoretical results of our paper would then stand. It can be the topic of an entirely different paper. Performance of the shrinkage rule in the wavelet domain and application to a data set are given in Section 5. In Section 6 we provide the conclusions. Proofs are mostly deferred to Appendix A and B.
2 Characterizing Γ-Minimax Rules
In this section, we discuss the existence and characterization of Γ-minimax shrinkage rules that are Bayes’ with respect to least favorable priors on the interval $[-m,m]$. It will turn out that there exists a critical ${m^{\ast }}$ such that for $m\le {m^{\ast }}$, the least favorable prior is a three point prior, that is it assigns mass $1-\epsilon $ at 0, and mass $\epsilon /2$ at the two boundary points $\pm m$. It is in this way that three point priors enter into our wavelet shrinkage analysis. We look at two scenarios, when the variance of the noise is known (Model I), and when it is not known (Model II). For reasons of space, we limit our treatment of Model II to just a short exposition.
We emphasize that we treat the case of Γ-minimax estimation in very general compact convex sets in general finite dimensional Euclidean spaces below. By giving a complete and self-contained derivation of the existence of a least favorable prior and the Γ-minimax estimator in such great generality when the parameter space is sufficiently small, we have given a useful unification of the Γ-minimax problem.
Figure 1
Γ-Minimax Rule for Model I. Left: Rules for $m=3$ and $\epsilon =0.5,0.7$, and 0.9. Right: Rules for $\epsilon =0.9$ and $m=3,4$, and 5.
2.1 Model I
Theorem 1.
-
(a) Let ${\mathcal{S}_{0}}$ be a compact convex set in ${R^{p}}$ containing the origin $\mathbf{0}$ in its interior. Let $\mathbf{a}\in {R^{p}}$ and $b\gt 0$, define and let $\partial \hspace{0.1667em}{\mathcal{S}_{\mathbf{a},b}}$ denote the boundary of ${\mathcal{S}_{\mathbf{a},b}}$.Let $\mathbf{X}\sim \mathbf{f}(\mathbf{x}\hspace{0.1667em}|\boldsymbol{\theta })$, a density in the p-parameter exponential family, and consider estimation of $\boldsymbol{\theta }$ using squared error loss. Then, there exists a unique least favorable prior on ${\mathcal{S}_{\mathbf{a},b}}$ as well as $\partial \hspace{0.1667em}{\mathcal{S}_{\mathbf{a},b}}$.
-
(b) There exists ${b^{\ast }}\gt 0$ such that for $b\le {b^{\ast }}$, the least favorable prior on $\partial \hspace{0.1667em}{\mathcal{S}_{\mathbf{a},b}}$ is also the least favorable prior on ${\mathcal{S}_{\mathbf{a},b}}$.
-
(c) If in particular, $\mathbf{X}\sim {\mathbf{N}_{\mathbf{p}}}(\boldsymbol{\theta }$, $\mathbf{I})$, and θ belongs to a p-dimensional ball centered at $\mathbf{0}$ and of radius b, then for $b\le {b^{\ast }}$, the least favorable prior in the ball is simply the uniform distribution on the boundary of the ball.
-
(d) In the one-dimensional case, for the entire one parameter exponential family, if θ belongs to a compact interval $[u,v]$, the least favorable prior is necessarily finitely supported.
-
(e) In the one dimensional case that $X\sim N(\theta ,1)$, let where ϵ is fixed in $[0,1]$, and ξ is any distribution on $[-m,m]$ without atoms at 0, that is, with no point-mass-at-zero. Then there exists ${m^{\ast }}\gt 0$ such that for $0\lt m\le {m^{\ast }}$ the least favorable prior is
Remark.
Figure 1 shows this shinkage rule in (2.2) for selected values of the parameters ϵ and m. Note that the rules heavily shrink small coefficients, but unlike traditional shrinkage rules, remain bounded between $-m$ and m. The values of ${m^{\ast }}$ are given in Table 1.
Table 1
Values of ${m^{\ast }}$ for both models for different values of ϵ.
ϵ | ${m^{\ast }}$ (Model I) | ${m^{\ast }}$ (Model II) |
0.0 | 1.05674 | 0.81758 |
0.1 | 1.15020 | 0.91678 |
0.2 | 1.27739 | 1.05298 |
0.3 | 1.46988 | 1.25773 |
0.4 | 1.84922 | 1.52579 |
0.5 | 2.28384 | 1.74714 |
0.6 | 2.41918 | 1.91515 |
0.7 | 2.50918 | 2.05511 |
0.8 | 2.58807 | 2.19721 |
0.9 | 2.69942 | 2.40872 |
0.95 | 2.81605 | 2.63323 |
0.99 | 3.10039 | 3.24539 |
2.2 Model II
In Model II, the variance ${\sigma ^{2}}$ is not known and is given an exponential prior. It is well-known that the exponential distribution is an entropy maximize in the class of all distributions supported on ${R^{+}}$ with a fixed first moment. This choice is non-informative, in a form of a maxent prior.
The model is:
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}& & \displaystyle [d|\theta ,{\sigma ^{2}}]\sim \mathcal{N}(\theta ,{\sigma ^{2}}),\\ {} & & \displaystyle [{\sigma ^{2}}]\sim \mathcal{E}(\mu ),\hspace{2.5pt}\hspace{2.5pt}\hspace{2.5pt}\hspace{2.5pt}\hspace{2.5pt}\left[f({\sigma ^{2}})=\frac{1}{\mu }\exp \left\{-\frac{{\sigma ^{2}}}{\mu }\right\}\right]\end{array}\]
The marginal likelihood is double exponential as an exponential scale mixture of normals,
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle [d|\theta ]& \displaystyle \sim & \displaystyle \mathcal{DE}\left(\theta ,\sqrt{\frac{\mu }{2}}\right)\\ {} & & \displaystyle \left[g(d|\theta ,\mu )=\sqrt{\frac{1}{2\mu }}\exp \left\{-\sqrt{\frac{2}{\mu }}|d-\theta |\right\}\right].\end{array}\]
Theorem 2.
If in Model II the family of priors on the location parameter is (2.1), the resulting Γ-minimax rule is:
which is Bayes with respect to the least favorable prior
(2.3)
\[ {\delta _{B}}(d)=\frac{m\left({e^{-\sqrt{2/\mu }\hspace{2.5pt}|d-m|}}-{e^{-\sqrt{2/\mu }\hspace{2.5pt}|d+m|}}\right)}{{e^{-\sqrt{2/\mu }\hspace{2.5pt}|d-m|}}+\frac{2\epsilon }{1-\epsilon }{e^{-\sqrt{2/\mu }\hspace{2.5pt}|d|}}+{e^{-\sqrt{2/\mu }\hspace{2.5pt}|d+m|}}},\]
\[ [\theta ]\sim \pi (\theta )=\epsilon {\delta _{0}}+\frac{1-\epsilon }{2}({\delta _{-m}}+{\delta _{m}}),\]
whenever $m\le {m^{\ast }}$. Figure 2 shows the rule in (2.3) for selected values of parameters ϵ and m. The values of ${m^{\ast }}$ depend on ϵ and are given in Table 1 for both models. We remark that the model in Theorem 2 is similar in spirit to those in BAMS wavelet shrinkage [33] and Bayesian Lasso [27]. Double exponential distributions in the marginal likelihood (for BAMS) and prior (for Bayesian Lasso) are obtained as exponential scale mixtures of normals.
3 Risk, Bias, and Variance of Γ-Minimax Rules
Frequentist risk of a rule δ, as a function of θ, can be decomposed as a sum of two functions, variance and bias-squared,
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle R(\delta ,\theta )& \displaystyle =& \displaystyle {E^{d|\theta }}{(\delta (d)-\theta )^{2}}\\ {} & \displaystyle =& \displaystyle {E^{d|\theta }}{(\delta (d)-{E^{d|\theta }}(\delta (d)))^{2}}\\ {} & & \displaystyle +{(\theta -{E^{d|\theta }}(\delta (d)))^{2}}.\end{array}\]
To explore the behavior of the two risk components in the context of Models I and II, we selected the risk of Γ-minimax rule for $\epsilon =0.8$ and $m=2.197$. (Fig. 3). This particular value of m ensures that the rules are Γ-minimax and $m={m^{\ast }}$ for model II. Note that δ in Model II shows a smaller risk for values of θ in the neighborhood of m, while for θ close to 0, the risk of the rule from Model I is smaller.
Figure 4
Risk components of Γ-minimax rule for $\epsilon =0.8$ and $m=2.197$. Dashed plots for Model I and solid for Model II. Left: Bias-squared; Right: Variance.
The most interesting finding is the behavior of the risk function of the rule. The risk function has a wavy shape with maxima at the locations depending on m. Typically the risk is maximized at $\theta =-m,m$. This is the most important finding in that now general minimax theory takes over and establishes that the proposed rule indeed is fully minimax and hence, also gamma-minimax. Similar, but less pronounced behavior is present in bias-squared function (Fig. 4 Left Panel). Compared to Model I, the variance of δ in Model II is significantly smaller for values of θ in the neighborhood of $\pm m$, and larger for θ in the neighborhood of zero. Preference in using either Model I or II depends on what size of signal part we are more interested in. If there is more uncertainty about signal bound m, the rule from Model II is preferable. However, Model I has a lower risk and both components of the risk are in the neighborhood of 0. This translates to a possibly more precise shrinkage of small wavelet coefficients.
4 Elicitation of Parameters
The proposed Bayesian shrinkage procedures with three-point priors depend on three parameters, m, ϵ and μ, that need to be specified. The criteria used for selecting these parameters are critical for effective signal denoising. We propose these hyper-parameters to be elicited in an empirical Bayes fashion, that is, dependent on the observed wavelet coefficients.
-
(1) Elicitation of m: The bound m in the domain of signal acquisition translates to level-dependent bounds on the parameter θ in the wavelet domain. Given a data signal $y=({y_{1}},\dots ,{y_{n}})$, the hyper-parameter m at the jth multiresolution level is estimated as where $J={\log _{2}}n$ is the resolution level of the transformation, and $\hat{\sigma }$ an estimator of the noise size. The noise size is often estimated by a robust estimator of standard deviation that uses the wavelet coefficients at the finest multiresolution level,
(4.2)
\[ \hat{\sigma }=\frac{\text{median}(|{d_{J-1,\bullet }}-\text{median}({d_{J-1,\bullet }})|)}{0.6745},\] -
(2) Elicitation of ϵ: This parameter controls the amount of shrinkage in the neighborhood of zero and overall shape of shrinkage rule. For the levels of fine details, this parameter should be close to 1. Based on the proposal for the ϵ given by Angelini and Vidakovic in [3] and Sousa et al. in [31], we suggest a level-dependent ${\epsilon _{0}}$ as follows: where ${J_{0}}\le j\le J-1$ and k and l are positive constants.As we indicated, ϵ should be close to one at the multiresolution levels of fine details, and then be decreasing gradually for levels approaching the coarsest level [3]. When k and l are large, ϵ remains close to one over all levels. This results in an almost noise-free reconstruction, but could result in over-smoothing. On the other hand, $l\gt 1$ guarantees a certain level of shrinkage even at the coarsest level. Thus, the hyperparameters l and k should be selected with a care in order to achieve good performance for a wide range of signals. Numerical simulations guided us to suggest values $l\ge 6$ and $k=2$ as reasonable choices. However, it is important to note that these parameters should depend on the smoothness of data signals and their size. We further discuss the selection of l and k for specific signals in Section 5.
-
(3) Elicitation of μ: This parameter is needed only for Model II. Since the prior on the noise level ${\sigma ^{2}}$ is exponential and the prior mean is μ, by simple moment matching we select μ as ${\hat{\sigma }^{2}}$. A possible choice for ${\hat{\sigma }^{2}}$ is a robust estimator as in (4.2).
5 Simulation Study
In the simulation study, we assessed the performance of the proposed shrinkage procedures on the battery of standard test signals. We used nine different test signals (step, wave, blip, blocks, bumps, heavisine, doppler, angles, and parabolas), which are constructed to mimic a variety of signals encountered in applications (Fig. 5). As standardly done in literature, Haar and Daubechies six-tap (Daubechies 6) were used for Blocks and Bumps and Symmlet 8-tap filter was used for the remaining test signals. The shrinkage procedures are compared using the average mean square error (AMSE), as in (5.1). All simulations were performed using MATLAB software and toolbox GaussianWaveDen [4] that can be found at http://www.mas.ucy.ac.cy/~fanis/links/software.html.
We generated noisy data samples of the nine test signals by adding normal noise with zero mean and variance ${\sigma ^{2}}=1$. The signals were rescaled so that ${\sigma ^{2}}=1$ leads to a prescribed SNR. Each sample consisted of $n=1024$ data points equally spaced on the interval $[0,1]$. Figure 6 shows a noisy version of the nine test signals with SNR = 1/4. Each noisy signal was transformed into the wavelet domain. After the shrinkage was applied to the transformed signal, the inverse wavelet transform was performed on the processed coefficients to produce a smoothed version of a signal in the original domain. The AMSE was computed as
where f denotes the original test signal and $\hat{{f_{j}}}$ its estimator in the j-th iteration. To calculate the average mean square error this process was repeated $N=100$ times.
(5.1)
\[ AMSE(f(t))=\frac{1}{nN}{\sum \limits_{j=1}^{N}}{\sum \limits_{i=1}^{n}}{\left(f({t_{i}})-{\hat{f}_{j}}({t_{i}})\right)^{2}},\]Figure 7
Change in average MSE(AMSE) as a function of hyper-parameters k (left) and l (right) exemplified on the heavisine test signal, $SNR=1/5$, and $n=1024$.
The shrinkage procedure was applied to each test signal and the AMSE was computed for a range of parameter values of l and k. For example, Fig. 8 shows the average MSE obtained on the heavisine test signal when SNR = 1/5, and l and k vary in the range $l\in [2,15]$ and $k\in [1.0,3.5]$. As evident from Fig. 8, the estimator achieves its best performance for values $k\approx 2.4$ and $l\approx 5.8$. With these selected values of l and k, Fig. 7 shows that the estimator is sufficiently close to the original test signal, even though the SNR is quite small.
Based on our simulations, the optimal hyper-parameter values of l and k varied depending on the nature (e.g., smoothness) of the test signal. For larger values of k and l, the estimator performs better for smooth signals. This is because the corresponding wavelet coefficients rapidly decay with the increase in resolution. However, larger values of l and k may not detect localized features in signals (e.g., cusps, discontinuities, sharp peaks), resulting in over-smoothing. For low values of SNR, the three-point priors estimator is more sensitive to hyper-parameter values of l and k. When SNR increases, the estimation method shows better performance for most of the test signals with relatively small values of l and k. Moreover, higher values of parameters l and k are preferred when the sample size is large.
In general, we suggest $k=2.5$ and $l\ge 6$ as the most universal choice. The suggested values could, however, be adjusted depending on available information about the nature of signals.
Figure 8
Estimation of heavisine test signal: Estimations obtained by three-point priors with $n=1024$, $k=2.4$, $l=5.8$ and $SNR=1/5$.
5.1 Performance Comparison with Some Existing Methods
We compared the performance of the proposed three-point prior estimator with eight existing estimation techniques. The selected existing estimation techniques include: Bayesian adaptive multiresolution shrinker [33] (BAMS), Decompsh [22], block-median and block-mean [1], hybrid version of the block-median procedure [1], blockJS [10], visu-shrink [16], and generalized cross validation [2]. The first five techniques are relying on a Bayesian procedure and they are based on level-dependent shrinkage. The blockJS method uses a level-dependent thresholding, while the visu-shrink and generalized cross-validation techniques use a global thresholding method. Readers can find more details about these techniques in [4].
In the simulation study, we computed the AMSE using the parameter values of $l=6$ and $k=2.5$ and compared with the AMSE computed for the selected estimation techniques. As can be seen in Fig. 9, the proposed estimator shows comparable and for some signals better performance when compared to the selected estimation methods. In particular, for smooth signals (e.g., wave, heavisine), the three-point prior estimator shows better performance compared to non-smooth signals, such as blip, for instance. Moreover, when comparing the performance of the level-dependent estimation methods, the BAMS estimation method shows competitive (or better) performance for most of the cases. We also investigated the influence of SNR level and the sample size on the performance of proposed estimators, compared to the methods considered. For example, for higher SNR (Fig. 10), the three-point priors-based shrinkage procedure does not provide better performance except for wave, angels, and time shifted sine. In general, the Γ-minimax shrinkage shows comparable or better performance compared to other methods considered, when the SNR is low. Also, for larger sample sizes, the three-point prior shrinkage shows improved performance.
Figure 9
The box plots of MSE for the ten estimation methods: (1) Rule-I, (2) Rule-II, (3) Bayesian adaptive multiresolution shrinker (BAMS), (4) Decompsh, (5) Block-median, (6) Block-mean, (7) Hybrid version of the block-median procedure, (8) BlockJS, (9) Visu-Shrink, and (10) Generalized cross-validation. The MSE was computed by using $SNR=1/5$, $k=2.5$, $l=6$, and $n=1024$ data points.
6 Conclusions
We proposed a method for wavelet denoising of signals when prior information about the size of the signal is available. Simple, level-dependent shrinkage rules that are Bayes with respect to a prior supported on three discrete points are at the same time Gamma-minimax, for a set of all priors with a bounded support symmetric about zero with a fixed point mass at zero. This statement is true when the signal is bounded on $[-m,m]$, for small values of m, which in denoising applications translates to estimation of a signal with a low signal-to-noise ratio. For larger values of m the least favorable prior is still discrete with support on a finite number of points symmetrically distributed about 0, but finding exact locations and prior weights of these points presents a challenging optimization problem. Thus, using the three point priors for high signal-to-noise ratio noisy signals is suboptimal from the standpoint of minimaxity and numerical experiments show that in this case, the three point prior shrinkage rules are underperforming compared to other standard shrinkage methods. We considered two estimation scenarios: when the variance of the noise is known and when it is not known. In the first case a plug-in robust estimator of noise variance is used, while in the second the variance is given an exponential prior, and subsequently integrated out. The main theoretical result is given in Theorem 1, that describes the form of Gamma-minimax rules when parameter belongs to a very general convex set in finite dimensional Euclidean space. The two estimation scenarios proposed in this paper are special cases of Theorem 1.
As demonstrated by simulations on a battery of standard test functions, the performance of the proposed shrinkage rules in terms of AMSE is comparable, and for some signals superior, to the state-of-art methods when SNR is low and the signals are smooth.