Dictionary Learning Strategies for Compressed Fiber Sensing Using a Probabilistic Sparse Model

We present a sparse estimation and dictionary learning framework for compressed fiber sensing based on a probabilistic hierarchical sparse model. To handle severe dictionary coherence, selective shrinkage is achieved using a Weibull prior, which can be related to non-convex optimization with p-norm constraints for 0 < p < 1. In addition, we leverage the specific dictionary structure to promote collective shrinkage based on a local similarity model. This is incorporated in form of a kernel function in the joint prior density of the sparse coefficients, thereby establishing a Markov random field-relation. Approximate inference is accomplished using a hybrid technique that combines Hamilton Monte Carlo and Gibbs sampling. To estimate the dictionary parameter, we pursue two strategies, relying on either a deterministic or a probabilistic model for the dictionary parameter. In the first strategy, the parameter is estimated based on alternating estimation. In the second strategy, it is jointly estimated along with the sparse coefficients. The performance is evaluated in comparison to an existing method in various scenarios using simulations and experimental data.

Authors

• 3 publications
• 18 publications
• NOODL: Provable Online Dictionary Learning and Sparse Coding

We consider the dictionary learning problem, where the aim is to model t...
02/28/2019 ∙ by Sirisha Rambhatla, et al. ∙ 0

• Observable dictionary learning for high-dimensional statistical inference

This paper introduces a method for efficiently inferring a high-dimensio...
02/17/2017 ∙ by Lionel Mathelin, et al. ∙ 0

• Self-expressive Dictionary Learning for Dynamic 3D Reconstruction

We target the problem of sparse 3D reconstruction of dynamic objects obs...
05/22/2016 ∙ by Enliang Zheng, et al. ∙ 0

• Learning Hybrid Representation by Robust Dictionary Learning in Factorized Compressed Space

In this paper, we investigate the robust dictionary learning (DL) to dis...
12/26/2019 ∙ by Jiahuan Ren, et al. ∙ 9

• Globally Variance-Constrained Sparse Representation for Image Set Compression

Sparse representation presents an efficient approach to approximately re...
08/17/2016 ∙ by Xiang Zhang, et al. ∙ 0

• Example Selection For Dictionary Learning

In unsupervised learning, an unbiased uniform sampling strategy is typic...
12/18/2014 ∙ by Tomoki Tsuchida, et al. ∙ 0

• A Dictionary Learning Approach for Factorial Gaussian Models

In this paper, we develop a parameter estimation method for factorially ...
08/18/2015 ∙ by Y. Cem Subakan, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fiber sensors are versatile devices with broad applicability [38, 23, 46, 63]. They are of high interest in smart structures to sense and react to the environment [42, 58]. For quasi-distributed sensing based on wavelength-division multiplexing (WDM), fiber Bragg grating (FBG) sensors are often employed due to their sensitivity to strain or temperature [38, 23]. An FBG describes a local variation of the refractive index and reflects light at a certain wavelength, called Bragg wavelength. Typically, a number of detuned FBGs is imprinted into the core of an optical fiber. Fiber interrogation is performed using broadband light sources or wavelength-tunable lasers. The latter feature higher local signal-to-noise ratios (SNRs) [46, 63]. However, in order to monitor time-varying perturbations, the laser has to sweep quickly through the tuning range. This requires high-speed analog-to-digital converters (ADCs) and produces large amounts of data.
Compressed sensing (CS) [5, 15, 27] can help to alleviate these problems by taking samples in form of projections into a low-dimensional subspace. The original signal can be reconstructed by exploiting the sparsity of the signal with respect to an adequate dictionary [14, 11]. This task strongly resembles the sparse synthesis problem with redundant dictionaries in [57, 53]. Besides greedy methods, such as Orthogonal Matching Pursuit (OMP) [49], -minimization is a popular method to solve the sparse reconstruction problem [56, 24, 12]. It relies on the restricted isometry property (RIP), which essentially states that unique sparse solutions can be recovered by restricting the -norm instead of the -norm [25, 13]. Redundant dictionaries can yield highly sparse representations, that allow for estimating quantities at high resolution directly in the sparse domain [24, 41]. However, redundancy causes inter-column coherence and it is likely that the required RIP conditions are no longer fulfilled [24, 52, 12]. The -norm, with , offers a trade-off to avoid an NP-hard combinatorical problem imposed by the -norm, while a unique solution might still be retrieved [17, 18].

Dictionaries can be classified as parametric or non-parametric. Non-parametric dictionaries are typically learned from training data and often used if no analytical model is available

[53]. While they can yield sparser representations of certain data realizations [40], non-parametric dictionaries usually lack an interpretable structure and are inefficient in terms of storage [53]. Parametric dictionaries, in turn, rely on an analytical model for the observed signal. Their analytic form offers an efficient implementation and a means to obtain optimality proofs and error bounds [53]. They are also favorable in terms of scalability and storage-efficiency [4, 53]. Translation-invariant dictionaries represent an important sub-class of parametric dictionaries, that can be used to estimate the translation coefficients of localized signals [36, 9, 29]. Nonetheless, due to the complexity of natural signals, some model parameters might be unknown or contain uncertainty. Parametric Dictionary Learning (DL) addresses this problem with the aim of estimating these parameters from the measured data. Herein, statistical DL methods, such as maximum likelihood (ML) or maximum a posteriori (MAP) estimation, are commonly employed [53]. In order to solve the resulting optimization problem, alternating estimation (AE) is a frequently pursued sub-optimal paradigm, that iteratively optimizes a local objective function [43, 6, 2]

. In a Bayesian setting, the Expectation Maximization (EM) algorithm is a popular variant of AE-based estimation

[53].
A model for the sparse coefficients can be of deterministic or probabilistic nature. While the deterministic case is often assumed in sparse estimation [24, 12], a probabilistic model offers high flexibility to take model deviations and measurement errors into account. Moreover, a hierarchical structure can be used to incorporate additional uncertainty in prior assumptions. Sparsity can either be promoted by continuous distributions, resulting in weakly sparse models, or by discrete mixtures, leading to strongly sparse models [43]. A prominent example of discrete mixtures are Spike & Slab models [34]

. They are based on binary activations and yield strongly sparse representations. Continuous sparse priors, such as a Gaussian or double-exponential (Laplace) prior, feature high excess kurtosis with heavy tails and a narrow peak around zero

[50, 43]. Besides sparsity, additional knowledge of the signal, e.g. correlation, can be incorporated [26, 66].
For many practical models, evaluating the posterior distribution is not feasible and approximate methods, such as Markov Chain Monte Carlo (MCMC) or variational Bayes methods, have to be used to accomplish inference [8, 48, 54]. Variational methods use rather simple analytic functions to approximate the posterior distribution by factorization, which is favorable in terms of scalability and computational costs but leads to a deterministic approximation [54, 8]. MCMC methods attempt to sample the posterior distribution, where subsequent samples form a Markov chain [8]. The Hamilton Monte Carlo (HMC) method is a powerful technique, that is especially suitable for sampling high-dimensional spaces in the presence of correlation [47]. However, MCMC performance is generally limited by the available computation time, thereby relying on a stochastic approximation. Another application of MCMC is found in non-convex optimization, where Stochastic Gradient (SG) MCMC has gained popularity for large-scale Bayesian learning [22, 20, 19].
In the present work, we consider the problem of Compressed Fiber Sensing (CFS) with highly coherent translation-invariant dictionaries and imperfectly known parameters. For the sparse coefficients, a weakly sparse hierarchical model is considered. We also establish a relation between this model and non-convex optimization with -norm constraints for . In order to alleviate the problem of dictionary coherence, we leverage additional structure of the dictionary and achieve augmented sparsity by establishing a Markov random field (MRF) relation among the sparse coefficients. For dictionary learning, we pursue two different strategies: In the first strategy (S1), we consider a deterministic dictionary parameter, that is estimated using a Monte Carlo EM algorithm. In the second strategy (S2), a probabilistic hierarchical model for the dictionary parameter is considered, leading to a full Bayesian formulation and joint estimation of the sparse coefficients and the dictionary parameter. In both strategies, approximate inference is accomplished using a hybrid MCMC method based on Gibbs sampling and HMC. Finally, we use simulations and real data to compare the proposed methods to previous work in [62], where a deterministic model is considered for the sparse coefficients and the dictionary parameter. For the deterministic case, we derive the Cramér-Rao bound (CRB) to assess the performance gain achieved by a probabilistic model.

1.1 Contributions

• We propose a probabilistic model for the sparse coefficients, where a Weibull prior is used to promote (weak) sparsity. Additional collective shrinkage is achieved by establishing an MRF-relation among the sparse coefficients based on a bivariate kernel function in the joint prior density. This helps to moderate the impact of severe dictionary coherence and can be used in general sparse synthesis problems with similar dictionary structure. We also establish a relation to non-convex optimization with constraints on the -norm for .

• For dictionary learning, we investigate two conceptually different strategies, assuming either a deterministic (S1) or a stochastic (S2) dictionary parameter. In both strategies, the noise level can be jointly estimated along with the sparse coefficients. We further highlight advantages, disadvantages and limitations to offer support in choosing an adequate method for practical systems.

• To accomplish inference in these models, we use a hybrid MCMC method, combining HMC and Gibbs sampling, We show its applicability and efficacy in the considered sampling problem for CFS.

• We use simulations to evaluate the performance of the proposed sparse estimation and DL methods for various scenarios of different CS sample sizes, SNRs and CS matrices. These results are compared to an existing method in [62], where the sparse coefficients and the dictionary parameter are assumed to be deterministic. In addition, we provide a real-data example to verify the practical applicability of S1 and S2.

• We derive the Cramér-Rao bound for jointly estimating the sparse coefficients and the dictionary parameter in the deterministic case. It is a valid bound for the competing method in [62], and serves to assess the achieved performance gain of our probabilistic approach.

2 Related Work

There exists little work addressing the combination of CS and DL for the application of WDM-based distributed fiber-optic sensing [60, 61, 62]. In [60], a model for the received sensor signal is presented, from which a redundant shift-invariant parametric dictionary is created. The works in [61, 62] focus on the aspect of CS and sparse estimation in the case of uncertain dictionary parameters. The authors use AE-based estimation to determine the dictionary parameters, where a pre-processing routine accounts for severe dictionary coherence. Unlike our approach, these works use a deterministic model for the sparse coefficients and dictionary parameters.
Weakly sparse models have been widely used in the literature. A comprehensive analysis of different hierachical sparse prior models is provided in [44]. The general problem of choosing the prior in weakly sparse models for sparse regression is addressed in [50], where the authors describe various properties of different shrinkage priors and illuminate the selection problem from two perspectives: prior distributions and penalty functions. The work in [43] also investigates Bayesian methods with different sparse models in comparison to classical -minimization. Seeger [54] found that the Laplace prior is able to shrink most components close to zero, while allowing for selected components to become sufficiently large. This effect, termed selective shrinkage in [35], is most noticeable for heavy-tailed priors, e.g. the Student’s -prior [54] or the horseshoe prior in [16, 50]. Based on these findings, we select a sparsity prior that resembles a positive version of the horseshoe prior. Other works, that focus on the perspective of penalized regression, report higher sparsity levels by penalizing the -norm with instead of the -norm [30]. The authors in [17] show that the RIP requirements for the dictionary can be relaxed in this case. It is also pointed out in [17, 18] that non-convex CS with -norm penalization requires less measurements than standard CS, which is based on the -norm. We rely on these results and show a relation between the considered sparsity prior and non-convex optimization with -norm constraints.
There exist several approaches to exploit additional structure of the signal. One example is block sparsity [26]. A block sparse Bayesian learning framework is proposed in [66], pointing out how correlation can be exploited in regularization algorithms. Wakin et al. [59] introduce the concept of joint sparsity for signal recovery in distributed CS theory. In [41], temporal correlation across subsequent CS measurements is considered, while the authors in [21] use correlation to achieve smoothness. Another related concept is proposed in [3], where a truncated multivariate Ising MRF model is used to describe the correlation between adjacent pixels for image processing. Different from these works, we use the ideas of MRFs [45] and exploit correlation to achieve collective shrinkage among the sparse coefficients.
A comparative analysis in [43] suggests that MCMC methods are powerful for inference in sparse models. In [47], the benefits of HMC and Gibbs sampling in hierarchical models are outlined. It is also shown, that HMC can be more effective than a Gibbs sampler for sampling high-dimensional spaces in the presence of correlation. According to these results, we consider a hybrid MCMC method that combines HMC and Gibbs sampling for inference in our hierarchical model, where the sparse coefficients are high-dimensional and correlated. For parametric DL, the Monte Carlo EM algorithm in S1 represents one variant of the frequently applied AE-based estimation technique [39, 51]. Comparable to S2 is the Bayesian framework for sparse estimation and DL in [31]. However, the authors use a Gaussian prior without correlation.

2.1 Outline

In Section 3, the signal model for CFS is introduced, and in Section 4, the CRB for joint estimation of the deterministic sparse coefficients and dictionary parameters is derived. Section 5 details the sparsity and local similarity model, while Section 6 describes the hybrid MCMC method for approximate inference in this model. The parametric DL strategies S1 and S2 are described in Section 7. Section 8 shows the working principle along with a performance analysis of the proposed and an existing method based on simulations and experimental data. A discussion of the results and findings is given in Section 9. Section 10 concludes this work.

3 Signal Model

In order to determine the quantity and nature of impairments at the FBGs in a WDM-based fiber sensor, the time delays of the reflections from the individual FBGs need to be estimated. We adopt the model in [60, 61, 62], where CS-based acquisition is employed to reduce the number of samples to be stored and processed. The CS measurements are described by

 y=ΦA(θ)x+n, (1)

where is the CS sampling matrix and is a Gaussian noise component with independent and identically distributed (i.i.d.) entries, ,

. The vector

is sparse with significant components, and is a scalar dictionary parameter. The matrix represents a redundant shift-invariant dictionary and its columns, called atoms, represent FBG reflections on a dense grid of delays. The indices of the significant components in indicate the desired reflection delays. They are collected in the set . We can write the full data likelihood function for this model by

 (2)

The -th dictionary atom, , is defined by [62]

 [ai]l(θ)=r(lTd−iδt,θ),  l=1,…,L, (3)

where the generating function, , describes the reflection from a single FBG, incrementally shifted by and sampled with a design sampling period, . In order to specify the dictionary parameter in CFS according to [62], we write

 r(t,θ)=∫∞−∞ej2πftH\tiny LP(f,θ)iph(f)df. (4)

Herein, is the received photocurrent in the frequency domain, and is the transfer function of a lowpass filter, that models a limited effective bandwidth of the receiver circuitry. This bandwidth is described in terms of a positive dictionary parameter, . As an auxiliary parameter, it accounts for different indistinguishable sources of uncertainty, that all contribute to the broadening in the temporal response of the FBG reflections. A detailed model for is provided in [62].

4 The CRB for joint estimation of (x,θ) in CFS

We derive the CRB for jointly estimating the deterministic parameters (). This is a valid bound for the model considered in [62] and can be used to assess the relative performance gain achieved by the proposed probabilistic sparse model and DL strategies. Although the Bayesian CRB in [10] can be empirically determined, we found that this bound is very lose, due to the high information content in the considered sparsity prior. Therefore, and in regard of the comparative analysis with the deterministic case in [62], the non-Bayesian CRB is more useful in this case.
The constrained CRB for estimating with sparsity constraints has been derived in [7]. However, this derivation does not assume uncertainty in the dictionary. It is based on locally balanced sets and involves the projection of the Fisher Information matrix (FIM), , onto a low-dimensional subspace spanned by the so-called feasible directions. Any estimator, , for which the constrained CRB is a valid lower bound, must be unbiased with respect to these directions. The projection matrix can be created from the unit vectors corresponding to the non-zero coefficients in , that is with , . For a Gaussian likelihood as in (2), the FIM can be derived from the expected value of the Hessian matrix of the log-likelihood function, i.e. [37, 7]

 I(x)=−Ey∇2xlogp(y|x,θ)=1σ2nB⊤B, (5)

with . Further, we define the reduced FIM by . Then, given that is exactly -sparse, the constrained CRB for a known dictionary becomes [7]

 Cov(^x)⪰UI−1KU⊤,∥x∥0=K. (6)

Based on these results, we derive the CRB for the joint parameters . First, we derive the Fisher information for , given that is known. It is given by

 I(θ) = −Ey∂2∂θ2logp(y|x,θ) (7) = Ey∂2∂θ212σ2n(y−ΦA(θ)x)⊤(y−ΦA(θ)x) = 1σ2nx⊤A′(θ)⊤Φ⊤ΦA′(θ)x.

Herein, denotes the (element-wise) derivative of with respect to . Next, we have to take into account that and share some mutual information. Therefore, we define the combined FIM:

 I(γ)=(I(x)−Eyu−Eyu⊤I(θ)), (8)

where and , . Since the partial derivatives can be interchanged, the off-diagonal elements are identical. In order to complete the definition of , we determine

 −Eyui = −Ey∂2∂xi∂θlogp(y|x,θ) (9) = 1σ2nx⊤A′(θ)⊤Φ⊤Φai(θ).

The reduced FIM is obtained by appending the set of feasible directions, such that the coordinate is included, i.e. . Hence, . To obtain the inverse, we apply twice the matrix inversion lemma [32]

 I−1K+1=⎛⎜ ⎜⎝(IK−vv⊤I(θ))−1−1˘bI−1Kv−1˘bv⊤I−1K1˘b⎞⎟ ⎟⎠, (10)

where , and

 (IK−vv⊤I(θ))−1=I−1K+1˘bI−1Kvv⊤I−1K. (11)

The constrained CRB for the joint parameters in becomes

 Cov(γ) ⪰ ~UI−1K+1~U⊤,∥x∥0=K. (12)

Finally, a lower bound for the mean squared error (MSE) in the joint setting is obtained by updating the individual estimation errors to account for the information shared between and :

 MSE(^x) ≥ (Tr I−1K)+1˘bv⊤I−1KI−1Kv (13) MSE(^θ) ≥ 1˘b= I(θ)−1+v⊤IKvI(θ)(I(θ)−v⊤IKv). (14)

5 Probabilistic sparse model

Regarding the model in (1), the data can be explained in different ways. On the one hand, many non-zero components in and a large bandwidth, , result in many narrow temporal peaks that can yield a good approximation of the observed reflections. On the other hand, it is known that the sensing fiber contains FBGs, so we expect exactly reflections. Therefore, a more useful explanation is given by significant elements in with a smaller value of , such that correctly indicates the reflection delays. Nevertheless, even for a suitable value of , the signal is usually not exactly sparse but contains many small elements close to zero, e.g. due to measurement noise. In a strongly sparse model, these contributions are not taken into account, which impacts the positions of non-zero elements in . Hence, it may lead to incorrectly estimated reflection delays. This motivates a weakly sparse model, where the most significant components indicate the reflection delays. When and are both unknown, the reflections delays can only be estimated when prior information of sparsity is incorporated, since depends on and vice versa. Severe dictionary coherence aggravates this problem and results in several non-zero components with moderate amplitudes around the true significant elements. The coherence level is even stronger when the dimensionality of the acquired data is further reduced by CS. Thus, an adequate sparse model for must compensate for this effect. Classic -minimization can be interpreted as an MAP estimation problem, where has i.i.d. entries with Laplace priors [43]. However, the required performance guarantees for -minimization, essentially the RIP [14, 11], are no longer fulfilled in the case of strong dictionary coherence. According to [17, 18], the RIP conditions can be relaxed for -minimization, when . Therefore, we use a prior with stronger selective shrinkage effect, that can be related to constraints on the -norm in non-convex optimization. Yet, specific characteristics of the signal have to be considered. The measured reflection signal is proportional to the optical power, and the dictionary atoms essentially model the optical power reflected from the individual FBGs. Thus, the prior must also account for the non-negativity of the data. Due to these restrictions, we choose a Weibull prior that resembles a positive version of the horseshoe prior in [50] and induces the required selective shrinkage effect:

 xi∼p(xi)=W(xi|λw,kw), xi>0, i=1,…,N, (15)

where are the scale and shape parameters, respectively. Then, the joint prior density of is given by

 p(x|kw,λw)=kwλkwwN∏i=1xkw−1iexp(−λ−kwwN∑i=1xkwi). (16)

Fig. 1 (top left) shows qualitatively the shape of the considered prior in the bivariate case.
Based on (16) and (2), we can relate the problem to constrained ML estimation. First, let us consider an interpretation in terms of MAP estimation as in [44], by calculating or, equivalently,

 argminx∥y−ΦA(θ)x∥22+μ1N∑i=1log(xi)+μ2N∑i=1xkwi, (17)

where and with and . In order to formulate a related constrained ML problem, let us define two functions,

 g1=N∑i=1xkwi−λkw1  and  g2=N∑i=1log(xi)−λ2, (18)

where are related to the coefficients , respectively. The functions in (18) can represent inequality constraints of the form and , that account for the impact of the prior by restricting the search space. Hence, a constrained version of the ML problem can be formulated by

 argminx≻0 ∥y−ΦA(θ)x∥22 (19) s.t. ∥x∥kw ≤ λ1 (20) and N∑i=1log(xi) ≤ λ2. (21)

In this non-convex problem, denotes the -norm with

. The hyperparameters

control the shrinkage effects. Fig. 1 (top right) depicts the search space restricted by the constraints (20)-(21). The borders are shown for a fixed value of and in the bivariate case.

5.1 Local covariance model for augmented sparsity

In analogy to the concept of block sparsity [26], we can use the specific sparse structure of the signal with respect to the shift-invariant dictionary for CFS to exploit sparsity among groups of variables. The signal contains only reflections that arrive at temporally separated delays, indicated by the significant components in . Therefore, we can assume that a significant coefficient is always surrounded by larger groups of non-significant coefficients and any two significant components are always well separated. Also, it is likely that the amplitudes of adjacent non-significant coefficients are similarly close to zero. Borrowing from the ideas of MRFs [45], such local similarity can be modeled by a prior on the differential coefficients, , where . It restricts the variation of adjacent amplitudes and establishes a MRF relation between neighboring coefficients in . Then, non-significant coefficients with larger amplitudes are pulled down to match the majority with almost-zero amplitudes, which promotes additional collective shrinkage. However, if a significant coefficient follows a non-significant one (or vice versa), the model should allow for larger changes. Therefore, the differential variation must be locally specified, dependent on the respective amplitudes, in order to avoid undesired shrinkage or equalization.

To this end, we define a kernel function for all adjacent pairs of sparse coefficients, i.e. , with hyperparameter :

 K(xi,xi+1|λ\tinyΔ) = exp(−λ\tinyΔ|xi+1−xi|f\tinyK(xi,xi+1)). (22)

The bivariate function controls the similarity level between adjacent coefficients. Within the scope of this work we consider cases, where this function takes the form , , with positive constants , . They can be incorporated in to yield a modified joint prior density,

 ~p(x|kw,λw,λ\tinyΔ)  =  1Z\tinyKW(xN|kw,λw)
 (23)

with normalization constant . For any , it holds that and

 ~p(x|kw,λw,λ\tinyΔ)∣∣Z\tinyK=1 ≤ p(x|kw,λw) (24)

is bounded. Hence, there exists a positive constant that normalizes (23) to make a proper density. Fig. 1 (bottom) visualizes the function and its impact on the original prior in the bivariate case.
In the view of constraint ML estimation, the modified prior density in (23) can be related to the optimization problem in (19)-(21) by imposing additional constraints

 |xi+1−xi|f\tinyK(xi,xi+1)≤μi,i=1,…,N−1. (25)

Fig. 1 (top right) depicts a bivariate example. In order to show the MRF relation between the coefficients, we calculate the conditional densities . To this end, we conveniently define and get

 ~p(xi|x∖i,kw,λw,λ\tinyΔ) = ~p(xi|xi−1,xi+1,kw,λw,λ\tinyΔ% )

 ∝ W(xi|kw,λw)K(xi−1,xi|λ\tinyΔ)K(xi,xi+1|λ\tinyΔ), (26)
 ~p(x1|x∖1,kw,λw,λ\tinyΔ)   ∝ W(x1|kw,λw)K(x1,x2|λ\tiny% Δ), (27)
 ~p(xN|x∖N,kw,λw,λ\tinyΔ) ∝W(xN|kw,λw)K(xN-1,xN|λ\tinyΔ), (28)

where dependencies appear only between directly adjacent coefficients.
In order to account for deviations from prior assumptions, we consider randomization of the hyperparameters and assign conjugate inverse Gamma priors to the scale parameters and . Finally, given and a normalization constant , the shape parameter,

, is assigned the conjugate prior distribution according to

[28]:

 p(kw|a′,b′,(d′)kw,λw)=ka′wZkwexp(−b′kw−(d′)kwλw), (29)

Fig. 2 shows a factor graph for the complete sparsity model with randomized hyperparameters.

6 Approximate Inference: Hybrid MCMC

In order to accomplish inference in the sparse model, we apply a hybrid MCMC technique, i.e. HMC within Gibbs sampling. The reasons for using HMC are twofold: Firstly, it only requires an analytic expression for the posterior density to be sampled. Secondly, it is efficient in sampling high-dimensional spaces in the presence of correlation. However, as pointed out in [47], it can be more efficient to sample the hyperparameters separately, as their posterior distributions are often highly peaked and require a small step size in the HMC algorithm, which limits the general performance. Therefore, we employ an outer Gibbs sampler for approximate inference of the latent variables. In each iteration,

is sampled using HMC, while all other variables are fixed. Since we are also interested in estimating the noise variance,

, it is assigned an inverse Gamma () prior and sampled along with the other variables. The resulting model is summarized below:

 x|kw,λw,λ% \tinyΔ ∼ ~p(x|kw,λw,λ\tinyΔ)  in (???), λw ∼ Inv-Γ(λw|a,b), kw|λw ∼ p(kw|a′,b′,(d′)kw,λw)in (???), λ\tinyΔ ∼ Inv-Γ(λ\tinyΔ|a′′,b′′) σ2n ∼ Inv-Γ(λ\tinyΔ|aσ,bσ). (30)

We also define as a representative variable with corresponding positive, real-valued parameters and , that belong to the respective density functions. Further, the set denotes the set without the respective variable . Fig. 3 shows a graphical model that helps to visualize the dependencies in this model. Herein, and are only valid for strategy S2, which is discussed in Section 7. For the particular model in (30), we assume that the variables and are mutually independent. Gibbs sampling requires the full conditional distributions for each parameter of interest. Based on these assumptions, we obtain the relation

 ∝ p(y|x,C)p(ζ|x,C∖ζ) (31) ∝ p(y|x,C)p(ζ|C∖ζ)~p(x|C). (32)

Since the prior distributions are all conjugate to the Gaussian likelihood function in (2), a simple calculation yields the posterior distributions of the parameters involved in the Gibbs sampling procedure. For , we obtain

 ζ|y,x,C∖ζ∼ Inv-Γ(ζ|aζ+\scriptsizeM2, bζ+\scriptsize12) ~p(x|C), (33)

and for , we obtain

 kw|y,x,C∖kw ∼  p(kw|~a′,~b′,~c′) ~p(x|C), (34)

with parameters , , and . Samples of the posterior variables can be obtained using Metropolis Hastings [8] or HMC.

The sparse coefficients are sampled using HMC. We briefly describe the idea of this method according to [47], adapted to our model for :
Within the framework of HMC, the sampling process is described in terms of Hamilton dynamics, a concept known from classical physics. It is used to describe the trajectory of a physical system in phase space, based on its potential and kinetic energy. HMC assigns to every sparse coefficient, , an associated momentum variable, , , that is responsible for the sampling dynamics. The posterior density to be sampled is related to the potential energy, given by [47]

 U(x|y,C) = −log~p(x|y,C)−log(Zu), (35)

where is a suitable normalization constant. Since and are fixed, we may drop them and write instead. The kinetic energy, , depends only on the auxiliary variables . A standard choice for corresponds to independent particles in free space with mass , i.e. . The dynamics of the sampling process are governed by the Hamiltonian function, which is given by and represents the total system energy. The joint density of is defined by [47]

 p(x,ξ)=1Zce−H(x,ξ)T\scriptsize sys= ~p(x|y,C)N∏i=1N(ξi|0,mi). (36)

Herein, is called the system temperature and is a normalization constant. The last equation is obtained by setting and , while the Gaussian density arises from the special choice of the kinetic energy term. In HMC, a proposal for a new sample is obtained by the final points () of a trajectory described by Hamilton’s equations of motion. They are calculated , [47]:

 dxidt=ξimi,dξidt=−∂∂xi~p(x|y,C)~p(x|y,C). (37)

A Metropolis update decides, whether a proposed sample is accepted or rejected, with acceptance probability

[47]

 P(accept)=min(1, exp(−H(x∗i,ξ∗i)+H(xi,ξi)) ). (38)

7 Parametric DL strategies for CFS

In this section, we present two strategies for parametric dictionary learning in CFS. In the first strategy (S1

), we follow the ideas of hybrid Bayesian inference

[64, 65] and AM-based DL [6], where is a deterministic parameter, that is estimated using the Monte Carlo EM algorithm in [8]. In the second strategy (S2), we pursue a full Bayesian approach and consider a probabilistic model for . Herein, approximate inference is accomplished by extending the Gibbs sampler in Section 6 to jointly estimate . Fig. 3 depicts the dependency graph for both strategies, where belong exclusively to S2.
As pointed out in [64, 65], hybrid and full Bayesian strategies have their individual advantages in certain situations. For small sample sizes, Bayesian methods can be superior if good prior knowledge is available [64]. Nonetheless, they are often computationally more complex and insufficient prior information can lead to a small-sample bias, even if a non-informative prior is used [64]. In CFS, the sample size is small and only vague prior knowledge of is available. Therefore, we investigate the performance of both DL strategies based on our probabilistic sparse model. The computational complexity of both strategies is comparable. It is dominated by HMC, i.e. by sampling the high-dimensional vector in each iteration of the Gibbs sampler. Regarding , the following prior knowledge is assumed: In S1, we roughly restrict the range of values that can take, while in S2, we define a non-informative prior over the same range. Recall that effectively describes the filter characteristics of the lowpass filter . To create the dictionary for a certain value of using (3

), the inverse Fourier transform in (

4) has to be evaluated for each atom. Thus, the dictionary is not a simple function of and we restrict ourselves to a discrete set of parameters, with lower and upper bound, and , respectively. Since the bandwidth should be positive and bounded, we have and . Then, the set contains the discrete values ,

7.1 Hybrid DL: iterative estimation of θ and (x,C) (S1)

The dictionary parameters in the CFS problem can be iteratively estimated using a Monte Carlo EM algorithm. First, an initial value, , has to be chosen. In subsequent iterations with indices , we obtain joint samples , by Gibbs sampling and HMC according to Section 6. Then, we determine the posterior expectation of , using the previous estimate :

 ^ζ(d) = ∫dom(ζ)ζp(ζ|y,^θ(d−1))dζ (39) ≈ 1L\tiny MCL\tiny MC∑l=1 ζ(d)lp(ζ(d)l|y,^θ(d−1)), (40)

where is the domain of . The current estimates of the reflection delays, , are determined by identifying the indices of the largest elements in the posterior mean of , denoted by . It is obtained by exchanging with in (40). Besides, we also estimate the amplitudes of the significant components in . They can be useful to assess the sparsity level of the solution and to determine the amount of optical power reflected from the FBGs. Since the posterior of is multimodal with one narrow peak around zero and another peak at some larger amplitude, the MAP is more suitable for this task. It is given by

 {^x,^C}(d)\tiny MAP = argmaxx,C logp(x,C|y,^θ(d−1)) (41) ≈ argmax{xj,Cj}∈{xl,Cl}(d)l=1,..,L\tiny MClogp({xj,Cj}(d)|y,^θ(d−1)). (42)

However, the estimates of obtained from are less accurate than those obtained by the posterior mean. Therefore, the empirical MAP solution is only used to estimate the reflection amplitudes. Next, we calculate the current estimate by taking the expected value over given (E-step):

 (43) = ∫RN+∫Ψlogp(y,x,C|θ)p(x,C|y,θ)dCdx ≈ 1L\tiny MCL\tiny MC% ∑l=1 logp(y,{xl,Cl}(d−1)|θ) ≜ Q(θ|^θ(d−1)). (44)

Herein, is the product space formed by the individual domains of all variables in . In the -step, a locally optimal value, , is obtained by maximizing over the set , i.e.

 ^θ(d) = argmaxθ∈Θ Q(θ|θ(d−1)). (45)

7.1.1 Initialization of θ via bisectional search

An adequate initialization, , can alleviate the problem of local optima in the EM algorithm. In CFS, the desired sparsity level is known to be the number of reflections, . Hence, a good choice for yields a solution for with significant non-zero elements. Starting at an arbitrary value , a bisectional search within can quickly determine a suitable initial value. After choosing the first value at random, is subdivided into two parts, containing all larger and all smaller values, respectively. When the number of peaks is too high, the next trial is chosen as the median of the lower division. If it is too low, the next trial is the median of the upper division, and so on. For a properly selected , S1 converges faster and is more likely to approach (or even attain) the global optimum.

7.2 Bayesian DL: joint estimation of (x,C,θ) (S2)

In strategy S2, we treat

as a random variable. Due to its discrete nature, each element

is assigned a probability mass, , where . Then, is categorically (Cat) distributed over the set of discrete dictionary parameters, , with corresponding probability masses in . Uncertainty in the a priori assigned probability masses is taken into account in terms of a prior on . The Dirichlet (Dir) distribution can be used as the conjugate prior with parameters , i.e.

 p(Ξ)= 1B(ν)RΘ∏r=1pνrr, (46)

where denotes the Beta function and the variables , , describe the number of occurrences of the values in . When a new element, , is sampled, a posterior count is assigned to that value. After sampling another value in the next iteration, this count is reassigned to the new value. Let indicate the current sample, i.e. for one index , while all other elements are zero. A non-informative prior is obtained if all values are equally likely and each element is assigned a single count. Then, and a new sample has a strong impact on the posterior distribution. In contrast, for large values, e.g. , a new count leaves the distribution almost invariant. The complete model is then given by (30) and, in addition,

 Ξ ∼ Dir(Ξ|ν) (47) θ|Ξ ∼ Cat(θ|RΘ,Ξ). (48)

To accomplish approximate inference in this model, the variables and are included in the Gibbs sampling procedure of Section 6. Therefore, the conditional distributions must be determined. Based on the dependencies in Fig. 3, and since and are assumed to be mutually independent, we find

 Ξ|θ = ˜Ξ ∼ Dir(Ξ|ν+˘c), (49) θ|y,˜Ξ ∼ Cat(θ|RΘ,˜Ξ). (50)