1 Introduction
Fiber sensors are versatile devices with broad applicability [38, 23, 46, 63].
They are of high interest in smart structures to sense and react to the environment [42, 58]. For quasidistributed sensing based on wavelengthdivision multiplexing (WDM), fiber Bragg grating (FBG) sensors are often employed
due to their sensitivity to strain or temperature [38, 23]. An FBG describes a local variation of the refractive index and reflects light at a certain wavelength, called Bragg wavelength. Typically, a number of detuned FBGs is imprinted into the core of an optical fiber. Fiber interrogation is performed
using broadband light sources or wavelengthtunable lasers.
The latter feature higher local signaltonoise ratios (SNRs) [46, 63].
However, in order to monitor timevarying perturbations, the laser has to sweep quickly through the tuning range. This requires highspeed analogtodigital converters (ADCs) and produces large amounts of data.
Compressed sensing (CS) [5, 15, 27] can help to alleviate these problems by taking samples in form of
projections into a lowdimensional subspace.
The original signal can be reconstructed by exploiting the sparsity of the signal with respect to an adequate dictionary [14, 11]. This task strongly resembles the sparse synthesis problem with redundant dictionaries in [57, 53].
Besides greedy methods, such as Orthogonal Matching Pursuit (OMP) [49], minimization is a popular method to solve
the sparse reconstruction problem [56, 24, 12]. It relies on the restricted isometry property (RIP), which essentially states that unique sparse solutions can be recovered by restricting
the norm instead of the norm [25, 13].
Redundant dictionaries can yield highly sparse representations, that allow for estimating quantities at high resolution
directly in the sparse domain [24, 41]. However, redundancy causes intercolumn coherence and it is likely that the required RIP conditions are no longer fulfilled [24, 52, 12]. The norm, with , offers a tradeoff to avoid an NPhard combinatorical
problem imposed by the norm, while a unique solution might still be retrieved [17, 18].
Dictionaries can be classified as parametric or nonparametric. Nonparametric dictionaries are typically learned from training data and often used if no analytical model is available
[53]. While they can yield sparser representations of certain data realizations [40], nonparametric dictionaries usually lack an interpretable structure and are inefficient in terms of storage [53]. Parametric dictionaries, in turn, rely on an analytical model for the observed signal. Their analytic form offers an efficient implementation and a means to obtain optimality proofs and error bounds [53]. They are also favorable in terms of scalability and storageefficiency [4, 53]. Translationinvariant dictionaries represent an important subclass of parametric dictionaries, that can be used to estimate the translation coefficients of localized signals [36, 9, 29]. Nonetheless, due to the complexity of natural signals, some model parameters might be unknown or contain uncertainty. Parametric Dictionary Learning (DL) addresses this problem with the aim of estimating these parameters from the measured data. Herein, statistical DL methods, such as maximum likelihood (ML) or maximum a posteriori (MAP) estimation, are commonly employed [53]. In order to solve the resulting optimization problem, alternating estimation (AE) is a frequently pursued suboptimal paradigm, that iteratively optimizes a local objective function [43, 6, 2]. In a Bayesian setting, the Expectation Maximization (EM) algorithm is a popular variant of AEbased estimation
[53].A model for the sparse coefficients can be of deterministic or probabilistic nature. While the deterministic case is often assumed in sparse estimation [24, 12], a probabilistic model offers high flexibility to take model deviations and measurement errors into account. Moreover, a hierarchical structure can be used to incorporate additional uncertainty in prior assumptions. Sparsity can either be promoted by continuous distributions, resulting in weakly sparse models, or by discrete mixtures, leading to strongly sparse models [43]. A prominent example of discrete mixtures are Spike & Slab models [34]
. They are based on binary activations and yield strongly sparse representations. Continuous sparse priors, such as a Gaussian or doubleexponential (Laplace) prior, feature high excess kurtosis with heavy tails and a narrow peak around zero
[50, 43]. Besides sparsity, additional knowledge of the signal, e.g. correlation, can be incorporated [26, 66].For many practical models, evaluating the posterior distribution is not feasible and approximate methods, such as Markov Chain Monte Carlo (MCMC) or variational Bayes methods, have to be used to accomplish inference [8, 48, 54]. Variational methods use rather simple analytic functions to approximate the posterior distribution by factorization, which is favorable in terms of scalability and computational costs but leads to a deterministic approximation [54, 8]. MCMC methods attempt to sample the posterior distribution, where subsequent samples form a Markov chain [8]. The Hamilton Monte Carlo (HMC) method is a powerful technique, that is especially suitable for sampling highdimensional spaces in the presence of correlation [47]. However, MCMC performance is generally limited by the available computation time, thereby relying on a stochastic approximation. Another application of MCMC is found in nonconvex optimization, where Stochastic Gradient (SG) MCMC has gained popularity for largescale Bayesian learning [22, 20, 19].
In the present work, we consider the problem of Compressed Fiber Sensing (CFS) with highly coherent translationinvariant dictionaries and imperfectly known parameters. For the sparse coefficients, a weakly sparse hierarchical model is considered. We also establish a relation between this model and nonconvex optimization with norm constraints for . In order to alleviate the problem of dictionary coherence, we leverage additional structure of the dictionary and achieve augmented sparsity by establishing a Markov random field (MRF) relation among the sparse coefficients. For dictionary learning, we pursue two different strategies: In the first strategy (S1), we consider a deterministic dictionary parameter, that is estimated using a Monte Carlo EM algorithm. In the second strategy (S2), a probabilistic hierarchical model for the dictionary parameter is considered, leading to a full Bayesian formulation and joint estimation of the sparse coefficients and the dictionary parameter. In both strategies, approximate inference is accomplished using a hybrid MCMC method based on Gibbs sampling and HMC. Finally, we use simulations and real data to compare the proposed methods to previous work in [62], where a deterministic model is considered for the sparse coefficients and the dictionary parameter. For the deterministic case, we derive the CramérRao bound (CRB) to assess the performance gain achieved by a probabilistic model.
1.1 Contributions

We propose a probabilistic model for the sparse coefficients, where a Weibull prior is used to promote (weak) sparsity. Additional collective shrinkage is achieved by establishing an MRFrelation among the sparse coefficients based on a bivariate kernel function in the joint prior density. This helps to moderate the impact of severe dictionary coherence and can be used in general sparse synthesis problems with similar dictionary structure. We also establish a relation to nonconvex optimization with constraints on the norm for .

For dictionary learning, we investigate two conceptually different strategies, assuming either a deterministic (S1) or a stochastic (S2) dictionary parameter. In both strategies, the noise level can be jointly estimated along with the sparse coefficients. We further highlight advantages, disadvantages and limitations to offer support in choosing an adequate method for practical systems.

To accomplish inference in these models, we use a hybrid MCMC method, combining HMC and Gibbs sampling, We show its applicability and efficacy in the considered sampling problem for CFS.

We use simulations to evaluate the performance of the proposed sparse estimation and DL methods for various scenarios of different CS sample sizes, SNRs and CS matrices. These results are compared to an existing method in [62], where the sparse coefficients and the dictionary parameter are assumed to be deterministic. In addition, we provide a realdata example to verify the practical applicability of S1 and S2.

We derive the CramérRao bound for jointly estimating the sparse coefficients and the dictionary parameter in the deterministic case. It is a valid bound for the competing method in [62], and serves to assess the achieved performance gain of our probabilistic approach.
2 Related Work
There exists little work addressing the combination of CS and DL for the application of WDMbased distributed fiberoptic sensing [60, 61, 62].
In [60], a model for the received sensor signal is presented, from which a redundant shiftinvariant parametric dictionary is created.
The works in [61, 62] focus on the aspect of CS and sparse estimation in the case of uncertain dictionary parameters.
The authors use AEbased estimation to determine the dictionary parameters, where a preprocessing routine
accounts for severe dictionary coherence.
Unlike our approach, these works use a deterministic model for the sparse coefficients and dictionary parameters.
Weakly sparse models have been widely used in the literature.
A comprehensive analysis of different hierachical sparse prior models is provided in [44]. The general problem of choosing the prior in weakly sparse models for sparse regression is addressed in [50], where
the authors describe various properties of different shrinkage priors and illuminate the selection problem
from two perspectives: prior distributions and penalty functions.
The work in [43] also investigates Bayesian methods with different sparse models in comparison to classical minimization.
Seeger [54] found that the Laplace prior is able to shrink most components close to zero, while allowing for selected components
to become sufficiently large.
This effect, termed selective shrinkage in [35], is most noticeable for heavytailed priors, e.g. the Student’s prior [54] or the horseshoe prior in [16, 50].
Based on these findings, we select a sparsity prior that resembles a
positive version of the horseshoe prior. Other works, that focus on the perspective of penalized regression, report higher sparsity levels
by penalizing the norm with instead of the norm [30].
The authors in [17] show that the RIP requirements for the dictionary can be relaxed in this case. It is also pointed out in [17, 18] that nonconvex CS with norm penalization requires less measurements than standard CS, which is based on the norm. We rely on these results
and show a relation between the considered sparsity prior and nonconvex optimization with norm constraints.
There exist several approaches to exploit additional structure of the signal. One example is block sparsity [26].
A block sparse Bayesian learning framework is proposed in [66], pointing out how correlation can be exploited in regularization algorithms.
Wakin et al. [59] introduce the concept of joint sparsity for signal recovery in distributed CS theory.
In [41], temporal correlation across subsequent CS measurements is considered, while the authors in [21] use correlation to achieve smoothness. Another related concept is proposed in [3], where a truncated multivariate Ising MRF model is used to describe the correlation between adjacent pixels for image processing.
Different from these works, we use the ideas of MRFs [45] and exploit correlation to achieve collective shrinkage among the sparse coefficients.
A comparative analysis in [43] suggests that MCMC methods are powerful for inference in sparse models.
In [47], the benefits of HMC and Gibbs sampling in hierarchical models are outlined. It is also shown, that HMC can be more effective than a Gibbs sampler for sampling highdimensional spaces in the presence of correlation.
According to these results, we consider a hybrid MCMC method that combines HMC and Gibbs sampling for inference in our hierarchical model, where the sparse
coefficients are highdimensional and correlated.
For parametric DL, the Monte Carlo EM algorithm in S1 represents one variant of the frequently applied AEbased estimation
technique [39, 51]. Comparable to S2 is the Bayesian framework for sparse estimation and DL in [31].
However, the authors use a Gaussian prior without correlation.
2.1 Outline
In Section 3, the signal model for CFS is introduced, and in Section 4, the CRB for joint estimation of the deterministic sparse coefficients and dictionary parameters is derived. Section 5 details the sparsity and local similarity model, while Section 6 describes the hybrid MCMC method for approximate inference in this model. The parametric DL strategies S1 and S2 are described in Section 7. Section 8 shows the working principle along with a performance analysis of the proposed and an existing method based on simulations and experimental data. A discussion of the results and findings is given in Section 9. Section 10 concludes this work.
3 Signal Model
In order to determine the quantity and nature of impairments at the FBGs in a WDMbased fiber sensor, the time delays of the reflections from the individual FBGs need to be estimated. We adopt the model in [60, 61, 62], where CSbased acquisition is employed to reduce the number of samples to be stored and processed. The CS measurements are described by
(1) 
where is the CS sampling matrix and is a Gaussian noise component with independent and identically distributed (i.i.d.) entries, ,
. The vector
is sparse with significant components, and is a scalar dictionary parameter. The matrix represents a redundant shiftinvariant dictionary and its columns, called atoms, represent FBG reflections on a dense grid of delays. The indices of the significant components in indicate the desired reflection delays. They are collected in the set . We can write the full data likelihood function for this model by(2) 
The th dictionary atom, , is defined by [62]
(3) 
where the generating function, , describes the reflection from a single FBG, incrementally shifted by and sampled with a design sampling period, . In order to specify the dictionary parameter in CFS according to [62], we write
(4) 
Herein, is the received photocurrent in the frequency domain, and is the transfer function of a lowpass filter, that models a limited effective bandwidth of the receiver circuitry. This bandwidth is described in terms of a positive dictionary parameter, . As an auxiliary parameter, it accounts for different indistinguishable sources of uncertainty, that all contribute to the broadening in the temporal response of the FBG reflections. A detailed model for is provided in [62].
4 The CRB for joint estimation of () in CFS
We derive the CRB for jointly estimating the deterministic parameters (). This is a valid bound for the model considered
in [62] and can be used to assess the
relative performance gain achieved by the proposed probabilistic sparse model and DL strategies.
Although the Bayesian CRB in [10] can be empirically determined, we found that this bound is very lose, due to the high information content in the considered sparsity prior. Therefore, and in regard of the comparative analysis with the deterministic case in [62], the nonBayesian
CRB is more useful in this case.
The constrained CRB for estimating with sparsity constraints has been derived in [7].
However, this derivation does not assume uncertainty in the dictionary. It is based on locally balanced sets and involves the projection of the Fisher Information matrix (FIM), ,
onto a lowdimensional subspace spanned by the socalled feasible directions. Any estimator, , for which the constrained CRB is
a valid lower bound, must be unbiased
with respect to these directions. The projection matrix can be created from the unit vectors corresponding to the
nonzero coefficients in , that is with ,
.
For a Gaussian likelihood as in (2), the FIM can be derived
from the expected value of the Hessian matrix of the loglikelihood function, i.e. [37, 7]
(5) 
with . Further, we define the reduced FIM by . Then, given that is exactly sparse, the constrained CRB for a known dictionary becomes [7]
(6) 
Based on these results, we derive the CRB for the joint parameters . First, we derive the Fisher information for , given that is known. It is given by
(7)  
Herein, denotes the (elementwise) derivative of with respect to . Next, we have to take into account that and share some mutual information. Therefore, we define the combined FIM:
(8) 
where and , . Since the partial derivatives can be interchanged, the offdiagonal elements are identical. In order to complete the definition of , we determine
(9)  
The reduced FIM is obtained by appending the set of feasible directions, such that the coordinate is included, i.e. . Hence, . To obtain the inverse, we apply twice the matrix inversion lemma [32]
(10) 
where , and
(11) 
The constrained CRB for the joint parameters in becomes
(12) 
Finally, a lower bound for the mean squared error (MSE) in the joint setting is obtained by updating the individual estimation errors to account for the information shared between and :
(13)  
(14) 
5 Probabilistic sparse model
Regarding the model in (1), the data can be explained in different ways. On the one hand, many nonzero components in and a large bandwidth, , result in many narrow temporal peaks that can yield a good approximation of the observed reflections. On the other hand, it is known that the sensing fiber contains FBGs, so we expect exactly reflections. Therefore, a more useful explanation is given by significant elements in with a smaller value of , such that correctly indicates the reflection delays. Nevertheless, even for a suitable value of , the signal is usually not exactly sparse but contains many small elements close to zero, e.g. due to measurement noise. In a strongly sparse model, these contributions are not taken into account, which impacts the positions of nonzero elements in . Hence, it may lead to incorrectly estimated reflection delays. This motivates a weakly sparse model, where the most significant components indicate the reflection delays. When and are both unknown, the reflections delays can only be estimated when prior information of sparsity is incorporated, since depends on and vice versa. Severe dictionary coherence aggravates this problem and results in several nonzero components with moderate amplitudes around the true significant elements. The coherence level is even stronger when the dimensionality of the acquired data is further reduced by CS. Thus, an adequate sparse model for must compensate for this effect. Classic minimization can be interpreted as an MAP estimation problem, where has i.i.d. entries with Laplace priors [43]. However, the required performance guarantees for minimization, essentially the RIP [14, 11], are no longer fulfilled in the case of strong dictionary coherence. According to [17, 18], the RIP conditions can be relaxed for minimization, when . Therefore, we use a prior with stronger selective shrinkage effect, that can be related to constraints on the norm in nonconvex optimization. Yet, specific characteristics of the signal have to be considered. The measured reflection signal is proportional to the optical power, and the dictionary atoms essentially model the optical power reflected from the individual FBGs. Thus, the prior must also account for the nonnegativity of the data. Due to these restrictions, we choose a Weibull prior that resembles a positive version of the horseshoe prior in [50] and induces the required selective shrinkage effect:
(15) 
where are the scale and shape parameters, respectively. Then, the joint prior density of is given by
(16) 
Fig. 1 (top left)
shows qualitatively the shape of the considered prior in the bivariate case.
Based on (16) and (2), we can relate the problem to
constrained ML estimation.
First, let us consider an interpretation in terms of MAP estimation as in [44], by
calculating or, equivalently,
(17) 
where and with and . In order to formulate a related constrained ML problem, let us define two functions,
(18) 
where are related to the coefficients , respectively. The functions in (18) can represent inequality constraints of the form and , that account for the impact of the prior by restricting the search space. Hence, a constrained version of the ML problem can be formulated by
(19)  
s.t.  (20)  
and  (21) 
In this nonconvex problem, denotes the norm with
. The hyperparameters
control the shrinkage effects. Fig. 1 (top right) depicts the search space restricted by the constraints (20)(21). The borders are shown for a fixed value of and in the bivariate case.5.1 Local covariance model for augmented sparsity
In analogy to the concept of block sparsity [26], we can use the specific sparse structure of the signal with respect to the shiftinvariant dictionary for CFS to exploit sparsity among groups of variables. The signal contains only reflections that arrive at temporally separated delays, indicated by the significant components in . Therefore, we can assume that a significant coefficient is always surrounded by larger groups of nonsignificant coefficients and any two significant components are always well separated. Also, it is likely that the amplitudes of adjacent nonsignificant coefficients are similarly close to zero. Borrowing from the ideas of MRFs [45], such local similarity can be modeled by a prior on the differential coefficients, , where . It restricts the variation of adjacent amplitudes and establishes a MRF relation between neighboring coefficients in . Then, nonsignificant coefficients with larger amplitudes are pulled down to match the majority with almostzero amplitudes, which promotes additional collective shrinkage. However, if a significant coefficient follows a nonsignificant one (or vice versa), the model should allow for larger changes. Therefore, the differential variation must be locally specified, dependent on the respective amplitudes, in order to avoid undesired shrinkage or equalization.
To this end, we define a kernel function for all adjacent pairs of sparse coefficients, i.e. , with hyperparameter :
(22) 
The bivariate function controls the similarity level between adjacent coefficients. Within the scope of this work we consider cases, where this function takes the form , , with positive constants , . They can be incorporated in to yield a modified joint prior density,
(23) 
with normalization constant . For any , it holds that and
(24) 
is bounded. Hence,
there exists a positive constant that normalizes (23) to make a proper density.
Fig. 1 (bottom) visualizes the function and its impact on the original prior in the bivariate case.
In the view of constraint ML estimation, the modified prior density in (23) can be related to
the optimization problem in (19)(21) by imposing
additional constraints
(25) 
Fig. 1 (top right) depicts a bivariate example. In order to show the MRF relation between the coefficients, we calculate the conditional densities . To this end, we conveniently define and get
(26) 
(27) 
(28) 
where dependencies appear only between directly adjacent coefficients.
In order to account for deviations from prior assumptions, we consider randomization of the hyperparameters and assign conjugate inverse Gamma priors to the scale parameters and .
Finally, given and a normalization constant , the shape parameter,
, is assigned the conjugate prior distribution according to
[28]:(29) 
Fig. 2 shows a factor graph for the complete sparsity model with randomized hyperparameters.
6 Approximate Inference: Hybrid MCMC
In order to accomplish inference in the sparse model, we apply a hybrid MCMC technique, i.e. HMC within Gibbs sampling. The reasons for using HMC are twofold: Firstly, it only requires an analytic expression for the posterior density to be sampled. Secondly, it is efficient in sampling highdimensional spaces in the presence of correlation. However, as pointed out in [47], it can be more efficient to sample the hyperparameters separately, as their posterior distributions are often highly peaked and require a small step size in the HMC algorithm, which limits the general performance. Therefore, we employ an outer Gibbs sampler for approximate inference of the latent variables. In each iteration,
is sampled using HMC, while all other variables are fixed. Since we are also interested in estimating the noise variance,
, it is assigned an inverse Gamma () prior and sampled along with the other variables. The resulting model is summarized below:(30) 
We also define as a representative variable with corresponding positive, realvalued parameters and , that belong to the respective density functions. Further, the set denotes the set without the respective variable . Fig. 3 shows a graphical model that helps to visualize the dependencies in this model. Herein, and are only valid for strategy S2, which is discussed in Section 7. For the particular model in (30), we assume that the variables and are mutually independent. Gibbs sampling requires the full conditional distributions for each parameter of interest. Based on these assumptions, we obtain the relation
(31)  
(32) 
Since the prior distributions are all conjugate to the Gaussian likelihood function in (2), a simple calculation yields the posterior distributions of the parameters involved in the Gibbs sampling procedure. For , we obtain
(33) 
and for , we obtain
(34) 
with parameters
, , and .
Samples of the posterior variables can be obtained using Metropolis Hastings [8] or HMC.
The sparse coefficients are sampled using HMC.
We briefly describe the idea of this method according to [47], adapted to our model for :
Within the framework of HMC, the sampling process is described in terms of Hamilton dynamics, a concept known from classical physics.
It is used to describe the
trajectory of a physical system in phase space, based on its potential and kinetic energy.
HMC assigns to every sparse coefficient, , an associated momentum variable, , , that is responsible for
the sampling dynamics. The posterior density to be sampled is related to the potential energy, given by [47]
(35) 
where is a suitable normalization constant. Since and are fixed, we may drop them and write instead. The kinetic energy, , depends only on the auxiliary variables . A standard choice for corresponds to independent particles in free space with mass , i.e. . The dynamics of the sampling process are governed by the Hamiltonian function, which is given by and represents the total system energy. The joint density of is defined by [47]
(36) 
Herein, is called the system temperature and is a normalization constant. The last equation is obtained by setting and , while the Gaussian density arises from the special choice of the kinetic energy term. In HMC, a proposal for a new sample is obtained by the final points () of a trajectory described by Hamilton’s equations of motion. They are calculated , [47]:
(37) 
A Metropolis update decides, whether a proposed sample is accepted or rejected, with acceptance probability
[47](38) 
7 Parametric DL strategies for CFS
In this section, we present two strategies for parametric dictionary learning in CFS. In the first strategy (S1
), we follow the ideas of hybrid Bayesian inference
[64, 65] and AMbased DL [6], where is a deterministic parameter, that is estimated using the Monte Carlo EM algorithm in [8]. In the second strategy (S2), we pursue a full Bayesian approach and consider a probabilistic model for . Herein, approximate inference is accomplished by extending the Gibbs sampler in Section 6 to jointly estimate . Fig. 3 depicts the dependency graph for both strategies, where belong exclusively to S2.As pointed out in [64, 65], hybrid and full Bayesian strategies have their individual advantages in certain situations. For small sample sizes, Bayesian methods can be superior if good prior knowledge is available [64]. Nonetheless, they are often computationally more complex and insufficient prior information can lead to a smallsample bias, even if a noninformative prior is used [64]. In CFS, the sample size is small and only vague prior knowledge of is available. Therefore, we investigate the performance of both DL strategies based on our probabilistic sparse model. The computational complexity of both strategies is comparable. It is dominated by HMC, i.e. by sampling the highdimensional vector in each iteration of the Gibbs sampler. Regarding , the following prior knowledge is assumed: In S1, we roughly restrict the range of values that can take, while in S2, we define a noninformative prior over the same range. Recall that effectively describes the filter characteristics of the lowpass filter . To create the dictionary for a certain value of using (3
), the inverse Fourier transform in (
4) has to be evaluated for each atom. Thus, the dictionary is not a simple function of and we restrict ourselves to a discrete set of parameters, with lower and upper bound, and , respectively. Since the bandwidth should be positive and bounded, we have and . Then, the set contains the discrete values ,7.1 Hybrid DL: iterative estimation of and (S1)
The dictionary parameters in the CFS problem can be iteratively estimated using a Monte Carlo EM algorithm. First, an initial value, , has to be chosen. In subsequent iterations with indices , we obtain joint samples , by Gibbs sampling and HMC according to Section 6. Then, we determine the posterior expectation of , using the previous estimate :
(39)  
(40) 
where is the domain of . The current estimates of the reflection delays, , are determined by identifying the indices of the largest elements in the posterior mean of , denoted by . It is obtained by exchanging with in (40). Besides, we also estimate the amplitudes of the significant components in . They can be useful to assess the sparsity level of the solution and to determine the amount of optical power reflected from the FBGs. Since the posterior of is multimodal with one narrow peak around zero and another peak at some larger amplitude, the MAP is more suitable for this task. It is given by
(41)  
(42) 
However, the estimates of obtained from are less accurate than those obtained by the posterior mean. Therefore, the empirical MAP solution is only used to estimate the reflection amplitudes. Next, we calculate the current estimate by taking the expected value over given (Estep):
(43)  
(44) 
Herein, is the product space formed by the individual domains of all variables in . In the step, a locally optimal value, , is obtained by maximizing over the set , i.e.
(45) 
7.1.1 Initialization of via bisectional search
An adequate initialization, , can alleviate the problem of local optima in the EM algorithm. In CFS, the desired sparsity level is known to be the number of reflections, . Hence, a good choice for yields a solution for with significant nonzero elements. Starting at an arbitrary value , a bisectional search within can quickly determine a suitable initial value. After choosing the first value at random, is subdivided into two parts, containing all larger and all smaller values, respectively. When the number of peaks is too high, the next trial is chosen as the median of the lower division. If it is too low, the next trial is the median of the upper division, and so on. For a properly selected , S1 converges faster and is more likely to approach (or even attain) the global optimum.
7.2 Bayesian DL: joint estimation of () (S2)
In strategy S2, we treat
as a random variable. Due to its discrete nature, each element
is assigned a probability mass, , where . Then, is categorically (Cat) distributed over the set of discrete dictionary parameters, , with corresponding probability masses in . Uncertainty in the a priori assigned probability masses is taken into account in terms of a prior on . The Dirichlet (Dir) distribution can be used as the conjugate prior with parameters , i.e.(46) 
where denotes the Beta function and the variables , , describe the number of occurrences of the values in . When a new element, , is sampled, a posterior count is assigned to that value. After sampling another value in the next iteration, this count is reassigned to the new value. Let indicate the current sample, i.e. for one index , while all other elements are zero. A noninformative prior is obtained if all values are equally likely and each element is assigned a single count. Then, and a new sample has a strong impact on the posterior distribution. In contrast, for large values, e.g. , a new count leaves the distribution almost invariant. The complete model is then given by (30) and, in addition,
(47)  
(48) 
To accomplish approximate inference in this model, the variables and are included in the Gibbs sampling procedure of Section 6. Therefore, the conditional distributions must be determined. Based on the dependencies in Fig. 3, and since and are assumed to be mutually independent, we find
(49)  
(50) 
Algorithm: Sparse estimation and PDL, strategy S1 & S2
Input: 


Output:  
Parameters:  , 
internal HMC parameters (c.f. [47, 33]).  
0. Initialize:  at random via bisectional search, 
as in (30), (S2):  
1. for to do  
2.  for to do 
3.  Gibbs sampling: (i) using (33) and (34), 
(ii) via HMC.  
(S2): (iii) using (49) and (50)  
4.  end for 
5.  Estimate: from in (40) with , 
from (40), from (42),  
5.a  (S1:) . 
5.b  (S2:) from (40) with . 
6.  if or 
7.  return . 
Comments
There are no comments yet.