Improving Generalization of Sequence Encoder-Decoder Networks for Inverse Imaging of Cardiac Transmembrane Potential

10/12/2018 ∙ by Sandesh Ghimire, et al. ∙ 0

Deep learning models have shown state-of-the-art performance in many inverse reconstruction problems. However, it is not well understood what properties of the latent representation may improve the generalization ability of the network. Furthermore, limited models have been presented for inverse reconstructions over time sequences. In this paper, we study the generalization ability of a sequence encoder decoder model for solving inverse reconstructions on time sequences. Our central hypothesis is that the generalization ability of the network can be improved by 1) constrained stochasticity and 2) global aggregation of temporal information in the latent space. First, drawing from analytical learning theory, we theoretically show that a stochastic latent space will lead to an improved generalization ability. Second, we consider an LSTM encoder-decoder architecture that compresses a global latent vector from all last-layer units in the LSTM encoder. This model is compared with alternative LSTM encoder-decoder architectures, each in deterministic and stochastic versions. The results demonstrate that the generalization ability of an inverse reconstruction network can be improved by constrained stochasticity combined with global aggregation of temporal information in the latent space.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


There has been an upsurge in the use of deep learning based methods in inverse problems in computer vision and medical imaging

[Lucas et al.2018]. Examples include image denoising [Mao, Shen, and Yang2016], inpainting [Pathak et al.2016]

, super resolution

[Wang et al.2015] and image reconstructions in a variety of medical imaging modalities such as magnetic resonance imaging, X-ray, and computed tomography [Jin et al.2017, Wang et al.2016, Chen et al.2017]. A common deep network architecture for inverse reconstructions is a encoder-decoder network, which learns to encodes an input measurement into a latent representation that is then decoded into a desired inverse solution [Pathak et al.2016, Zhu et al.2018]. Despite significant successes of these models, two important questions remain relatively unexplored. First, it is not well understood what properties of the latent representation improve the generalization ability of the network. Second, existing works mostly focus on solving single-image problems, while limited models exist for solving inverse problems on image or signal sequences. The latter, however, is important because the incorporation of temporal information can often help alleviate the ill-posedness of an inverse reconstruction problem.

In this paper, we present a probabilistic sequence encoder-decoder model for solving inverse problems on time sequences. Central to the presented model is an emphasis on the generalization ability of the network in order to learn a general inverse mapping from the measurement to reconstruction results. Our main hypothesis is that the generalization ability of the sequence encoder-decoder network can be improved by the following two properties of the latent space: 1) constrained stochasticity, and 2) global aggregation of information throughout the long time sequence.

First, we theoretically show that using stochastic latent space during training helps to learn a decoder that is less sensitive to local changes in the latent space and that, based on analytical learning theory [Kawaguchi and Bengio2018]

, leads to good generalization ability of the network . Second, while it is common in a sequence model to use the last unit of a recurrent encoder network for decoding, we hypothesize that – in the presence of long time sequences – the last hidden unit code may not be able to retain global information and may act as a bottleneck in information flow. This shares a fundamental idea with recent works in which alternative architectures were presented with attention mechanism to consider information from all the hidden states of long-short-term-memory (LSTM) networks

[Bahdanau, Cho, and Bengio2014, Luong, Pham, and Manning2015]. Alternative strategies were also presented to back-propagate the loss through all hidden states addressing the issue of gradient vanishing [Lipton et al.2015, Yue-Hei Ng et al.2015]. Here, to arrive at a compact latent space that globally summarizes a long time sequence, we present an architecture (termed as svs) to combine and compress all LSTM units into a vector representation.

While the presented methodology applies for general sequence inverse problems, we test it on the problem of inverse reconstruction of cardiac transmembrane potential (TMP) sequence from high-density electrocardiogram (ECG) sequence acquired on the body surface [MacLeod et al.1995, Ramanathan et al.2004, Wang et al.2013]. In this setting, the measurement at each time instant is a spatial potential map on the body surface, and the reconstruction output is a spatial potential map throughout the three-dimensional heart muscle. The problem at each time instant is severely ill-posed, and incorporating temporal information is recognized as a main approach to alleviate this issue.

To investigate the generalization ability of the presented model, we analyzed in-depth the benefits brought by each of the two key components within the presented model. In specific, we compared the presented svs architecture with two alternatives: one that decodes directly from the sequence produced by the LSTM encoder (termed as sss) and one that decodes from the output of the last unit of the LSTM encoder, as mostly commonly used in language models (termed as svs-L) [Sutskever, Vinyals, and Le2014]. We also compared between deterministic and stochastic versions of svs and svs-L. The experiments results suggest that the generalization ability of the network in inverse reconstruction can be improved by constrained stochasticity combined with global aggregation of sequence information in the latent space. These findings may set a foundation for investigating the generalization ability of deep networks in sequence inverse problems and inverse problems in general.

Related Work

There is a large body of work in the the use of deep learning in inverse imaging in both general computer vision [Lucas et al.2018, Mao, Shen, and Yang2016, Wang et al.2015, Yao et al.2017, Fischer et al.2015] and medical imaging [Jin et al.2017, Wang et al.2016, Chen et al.2017]. Some of these deep inverse reconstruction networks are based on an encoder-decoder structure [Pathak et al.2016, Zhu et al.2018], similar to that investigated in this paper. Among these works in the domain of medical image reconstruction, the presented work is the closest to Automap [Zhu et al.2018], in that the output image is reconstructed directly from the input measurements without any domain-specific intermediate transformations. However, all of these works employ a deterministic architecture which, as we will show later, can be improved with the introduction of stochasticity. Nor do these existing works handle inverse reconstructions of images or signals over time sequences.

Different elements of the presented work are conceptually similar to several works across different domains of machine learning. The use of encoder-decoder architectures for sequential signals is related to existing works in language translation

[Sutskever, Vinyals, and Le2014]. However, we investigate an alternative global aggregation of sequence information by utilizing and compressing knowledge from all the units in the last layer of the LSTM encoder. This is in concept similar to the works in [Bahdanau, Cho, and Bengio2014, Luong, Pham, and Manning2015], where information from all the units of an LSTM encoder are used for language translation. However, to our knowledge, no existing works have analyzed in-depth the difference in the generalization ability of sequence encoder-decoder models with respect to these different designs of the latent space.

The presented theoretical analysis of stochasticity in generalization utilizes analytical learning theory [Kawaguchi and Bengio2018]

, which is fundamentally different from classical statistical learning theory in that it is strongly instance-dependent. While statistical learning theory deals with data-independent generalization bounds or data dependent bounds for certain hypothesis space of problems, analytical learning theory provides the bound on how well a model learned from a dataset should perform on true (unknown) measures of variable of interest. This makes it aptly suitable for measuring the generalization ability of a stochastic latent space for the given problem and data.

The presented work is related to variational autoencoder (VAE) in using stochastic latent space with regularization

[Kingma and Welling2013]. Similarly, [Bowman et al.2015] present a sequence-to-sequence VAE based on LSTM encoder and decoders to generate coherent and diverse sentences from continuous sampling of latent code. However, it is not well understood why stochasticity of the latent space is so important. In this paper, we intend to provide a justification from the learning theory perspective. In addition, while VAE by nature is concerned with the reconstruction of the same input data, the presented network is concerned with the ill-posed problem of inverse signal/image reconstruction from their (often weak) measurements.


Inverse Imaging of cardiac transmembrane potential (TMP)

Body-surface electrical potential is produced by TMP in the heart. Their mathematical relation is defined by the quasi-static approximation of electromagnetic theory [Plonsey1969] and, when solved on patient-specific heart-torso geometry, can be derived as [Wang et al.2010]:


where denotes the body-surface potential map, the 3D TMP map over the heart muscle, and the measurement matrix specific to the heart-torso geometry of an individual.

The inverse reconstruction of from can be carried out at each time instant independently, which however is notoriously ill-posed since surface potential provides only a weak projection of the 3D TMP. A popular approach is thus to reconstruct the time sequence of TMP propagation over a heart beat, with various strategies to incorporate the temporal information to alleviate the ill-posedness of the problem [Wang et al.2010, Greensite and Huiskamp1998]. Here, we examine the sequence setting, where and represents sequence matrices with each column denoting the potential map at one time instant. This problem has important clinical applications in supporting the diagnosis and treatment for diseases such as ischemia [MacLeod et al.1995] and ventricular arrhythmia [Wang et al.2018].


Figure 1: Presented svs stochastic

architecture with mean and variance network in both encoder and decoder.


Recurrent neural networks (RNN) [Werbos1990, Kalchbrenner and Blunsom2013] generalize the traditional neural networks for sequential data by allowing passage of information over time. Compared to traditional RNNs, LSTM networks [Graves2013, Sutskever, Vinyals, and Le2014] can better handle long term dependency in the data by architectural changes such as using a memory line called the cell state that runs throughout the whole sequence, the ability to forget information deemed irrelevant using a forget gate, and the ability to selectively update the cell state.

Variational Autoencoder (VAE)

VAE [Kingma and Welling2013] is a probabilistic generative model that is typically trained by optimizing the variational lower bound of the data log likelihood. A VAE is distinctive from the traditional autoencoder in two aspects : 1) the use of stochastic latent space realized by sampling with a reparameterization trick, 2) the use of Kullback-Leibler (KL) divergence to regularize the latent distribution.


We train a probabilistic sequence encoder-decoder network to learn to reconstruct the time sequence of TMP, , from input body-surface potential, . In a supervised setting, we maximize the log likelihood as follows:



is the joint distribution of the input-output pair. We introduce a latent random variable

and express the conditional distribution as:


where . We model both and

with Gaussian distributions, with mean and variance parameterized by neural networks:


where denotes a matrix of the same dimension as that of . We implicitly assume that each elements in is independent and Gaussian with variance given by the corresponding element in ; and similarly for . Introduction of the latent random variable in the network allows us to constrain it in two means to improve the generalization ability of inverse reconstructions. First, we constrain the conditional distribution to be close to an isotropic Gaussian distribution. Second, we design it to be a concise vector representation compressed from the whole input time sequence.

Regularized stochasticity

Drawing from the VAE [Kingma and Welling2013], we regularize the latent space by constraining the conditional distribution to be close to an isotropic Gaussian distribution. Training of the network can then be formulated as a constrained optimization problem as follows:


Using the method of Lagrange multipliers, we reformulate the objective function into:


where the inequality in eq.(Regularized stochasticity) is due to Jensen’s inequality as the negative logarithm is a convex function. We use reparameterization as described in [Kingma and Welling2013]

to compute the inner expectation in the first term. The KL divergence in the second term is analytically available for two Gaussian distributions. We thus obtain the upper bound for the loss function as:


where is the function mapping latent variable to the element of mean of , such that .

Global aggregation of sequence information

In sequence inverse problems where the measurement at each time instant provides only a weak projection of the reconstruction solution, utilizing the temporal information in the sequence becomes important for better inverse reconstructions. This motivate us to design an architecture that can distill from the input sequence a global, time-invariant, and low dimensional latent representation from which the entire TMP sequence can be reconstructed.

To do so, we present an architecture with two LSTM networks followed by two fully connected neural networks (FC), each respectively for the mean and variance in the encoder network. The decoder then consists of two FC followed by two LSTM networks for the mean and variance of the output. In the encoder, each LSTM decreases spatial dimensions while keeping temporal dimensions constant; the last-layer outputs from all the units in each LSTM are reshaped into a vector, the length of which is decreased by the FC. The structure and dimension of the decoder mirrors that of the encoder. The overall architecture of the presented network is illustrated in Fig. 1.

Encoder-Decoder Learning from the Perspective of Analytical Learning Theory

In this section we look at the encoder-decoder inverse reconstructions from the analytical learning theory [Kawaguchi and Bengio2018]. We start with a deterministic latent space setting and then show that having a stochastic latent space with regularization helps in generalization.

Let be an input-output pair, and let denote the total set of training and validation data and be a validation set. During training, a neural network learns the parameter by using an algorithm and dataset , at the end of which we have a mapping from to . Typically, we stop training when the model performs well in the validation set. To evaluate this performance, we define a loss function based on our notion of goodness of prediction as . The average validation error is given by .

However, there exists a gap between how well the model performs in the validation set versus in the true distribution of the input-output pair; this gap is called the generalization gap. To be precise let be a measure space with being a measure on . Here, denotes the input-output space of all the observations and inverse solutions. The generalization gap is given by:


Note that this generalization gap depends on the specific problem instance. Theorem 1 [Kawaguchi and Bengio2018] provides an upper bound on equation (9) in terms of data distribution in the latent space and properties of the decoder.

Theorem 1 ([Kawaguchi and Bengio2018]).

For any , let be a pair such that is a measurable function, is of bounded variation as , and , where indicates the Borel - algebra on . Then for any dataset pair and any ,

where is pushforward measure of under the map .

For an encoder-decoder setup, is the encoder which maps the observation to the latent space and becomes the composition of loss function and decoder which maps latent representation to the reconstruction loss. Note that the latent domain can be easily extended to a d-orthotope – as long as the latent variables are bounded – using a function composed of scaling and translation in each dimension. Since is uniformity preserving and affects the partial derivative of only up to a scaling factor and thus does not affect our analysis. In practice, there always exists intervals such that the latent representations are bounded.

Theorem 1 provides two ways to decrease the generalization gap in our problem: by decreasing the variation or the discrepancy . Here, we show that constrained stochasticity of the latent space helps decrease the variation . The variation of on in the sense of Hardy and Krause [Hardy1906] is defined as:


where is defined with following proposition.

Proposition 1 ([Kawaguchi and Bengio2018]).

Suppose that is a function for which exists on . Then,
If is also continuous on ,

The function for the encoder decoder network is the loss as a function of latent representations. Thus, we have,


We use a simple sum of square loss: where the norm is frobenius norm for matrix , and

is a function from latent space to each element of estimated

. Writing ,


Theorem 1 and Proposition 1 implies that if the cross partial derivative of loss with respect to the latent vector at all order is low in all directions throughout the latent space, then the approximated validation loss would be closer to the actual loss over the true unknown distribution of the dataset. Intuitively, we want the loss curve as a function of latent representation to be flat if we want a good generalization.

Using stochastic latent space

In our formulation, the latent space is a random variable with the cost function given by eq.(8), which makes the latent vector stochastic by design. The inner expectation of first term in the cost function is given by

where .

Result 1.


denotes k order tensor product of a vector

by itself.


Using Taylor series expansion for ,


We move expectation operator inside both brackets and take expectation of only the first term in the inner product. Using , we get . Using these in eq.(13) yields the required result. ∎

The first term of Result 1, (after ignoring ), would be the only term in the cost function if the latent space were deterministic. Thus, the rest of the terms in Result 1 are additional in stochastic training. Each of these terms is an inner product of two tensor, the first being , and the second being the order partial derivative tensor . We can thus consider the first tensor as providing penalizing weights to different partial derivatives in the second tensor. Since each inner product is added to the cost, we are minimizing them during optimization. This gives two important implications:

  1. For sufficiently large samples,

    must be close to central moments of isotropic Gaussian. However, in practice, the number of samples remains constant, but the number of parameters to be estimated keeps increasing for higher order tensors and reaches the order of the number of samples pretty quickly (forth order in our case). When the number of parameters is high, we can expect that those higher moments do not converge to that of standard Gaussian. This, luckily, works in our favor. Since we are minimizing

    for each order, the inner product can be vanished for arbitrary only by driving partial derivative tensors towards zero. Therefore, minimizing the sum of all the inner product for arbitrary would minimize most of the terms in the partial derivative tensor. From Proposition 1, minimizing each of these partial derivatives corresponds to minimization of variation of of function , and consequently variation of the total loss according to eq.(12). Hence, additional terms in the stochastic latent space formulation contributes in decreasing variation of the loss.

  2. Not all the partial derivatives are equally weighted in the cost function. Due to the presence of weighting tensor in the first tensor of inner product, different partial derivative terms are penalized differently according to the value of . Combination of the KL divergence term in eq.(8) with thus tries to increase variance towards 1 whenever it does not significantly increase the cost : higher value of penalizes the partial derivatives of a certain direction more heavily, making the cost flatter in some directions than other.

Strictly speaking, Proposition 1 requires cross partial derivatives to be small throughout the domain of latent variable, which is not included in the above analysis. It however should not significantly affect the observation that, compared to deterministic formulation, the stochastic formulation decreases the variation .


Figure 2: Comparison of TMP propagation over heart reconstructed from same ECG using different architecture. The propagation sequence is closer to ground truth in case of TMP reconstruction with svs stochastic architecture

Experiments & Results


We simulated training and test sets using three human-torso geometry models. Spatiotemporal TMP sequences were generated using the Aliev-Panfilov (AP) model [Aliev and Panfilov1996], and projected to the body-surface potential data with 40dB SNR noises. Two parameters were varied when simulating the TMP data : the origin of excitation and abnormal tissue properties representing myocardial scar.

The training set was randomly selected with regard to these two parameters. To test generalization ability, test data were selected with different origins of excitation and shape/location of abnormal tissues than those used in training. In particular, we prepared test datasets of four types: 1) Scar: Low, Exc: Low, 2) Scar: Low, Exc: High, 3) Scar:High, Exc: Low, and 4) Scar: High, Exc: High, where Scar/Exc indicates the parameter being varied and High/Low denotes the level of difference from the training data.

Implementation Details

For all five models being compared (svs stochastic/deterministic, svs-L stochastic/ deterministic, and sss stochastic), we used ReLU activation functions in both the encoder and decoder, ADAM optimizer

[Kingma and Ba2014], and a learning rate of . Each neural network was trained on approximately 2500 TMP simulations on each geometry. In addition to the five neural networks, we included a classic TMP inverse reconstruction method (Greensite) designed to incorporate temporal information [Greensite and Huiskamp1998]

. On each geometry, a random set of 100 test cases was selected from the test set and the process was repeated 120 times. We report the average and standard deviation of the results across all three geometry models.


Figure 3: Visualization of latent point cloud corresponding to whole dataset: training and validation dataset.


Figure 4: Comparison of reconstruction ability of different architecture at the presence of different types of unseen test data.


Figure 5: Comparison of scar region identified by different architectures and Greensite method with reference to in vivo CARTO map.


to 0.48 — p1.7cm—— X[c] —X[c] — X[c] — X[r] — Metric & MSE & TMP Corr. & AT Corr. & Dice Coeff.


& & & &


& & & &


& & & &


& & & &


& & & &


& – & – & &

Table 1: Metric for each method

The reconstruction accuracy was measured with four metrics: 1) mean square error (MSE) of the TMP sequence, 2) correlation of the TMP sequence, 3) correlation of TMP-derived activation time (AT), excluding late activation due to abnormal scar tissue to focus the measure on the accuracy related to excitation points, and 4) dice coefficients of the abnormal scar tissue identified from the TMP sequence.

Table 1 summarizes the results from the three geometry models on all datasets. As shown, accuracy of the svs stochastic architecture was significantly higher than other architectures in all metrics. Similarly, the stochastic version of each architecture was more accurate than its deterministic counterpart. Most of the networks delivered a higher accuracy than the classic Greensite method (which does not preserve TMP signal shape and thus its MSE and correlation of TMP was not reported). These observations are reflected in the example of reconstructed TMP sequences in Fig. 2.

Analysis of latent representation:

To understand the difference in the latent representations obtained among different architectures, we computed latent projections of the training and validation sets for each method, and visualized the point cloud in the latent space with t-SNE [Maaten and Hinton2008] as shown in Fig.3. Each data point has an excitation label and scar label corresponding to their locations in the heart. The first row shows latent points colored by the excitation label, suggesting that all three stochastic models were able to cluster data points in the latent space according to the origin of excitation. The second row shows latent points colored by scar label. In this case, only the latent cloud from the svs stochastic model was clustered. This suggests that the latent representation of the presented svs stochastic model considered information of both the scar location and origin of excitation, while the other models were more focused on the excitation point.

Analysis of performance under different test conditions:

Figure 4 compares the performance of different methods in the four aforementioned types of test dataset, considering two levels of differences in each of the two parameters (excitation points and scar tissue) relative to the training data.

When only the level of unseen excitation points increases in the new data (see Fig. 4.a first two points on x axis), the dice coefficient decreased in all architectures except for the svs stochastic model. This suggests that the representation of scar location in the svs stochastic model is more robust to errors in the excitation point. On the other hand, correlation of AT shows that when only the level of unseen scar tissues varies (see Fig. 4.b first and third points on x axis), the performance of all methods stay at a constant level showing that they all encode the origin of excitation. These two findings were also consistent with the visualization of latent cloud in Fig. 3.

Comparing all the metrics, we observed that even though all the models were trained on scaled mean square loss, the stochastic models and svs stochastic performed better in identifying origin of excitation and region of scar. This suggests that stochasticity combined with aggregation of sequence information helps capture global generative factors and thus improves generalization ability.

Performance on Real Data:

We present a real-data case study on a patient who underwent catheter ablation due to scar-related ventricular tachycardia. First, the presented models were trained on data simulated on this patient as described earlier. Then, TMP was reconstructed from the real ECG data using the trained networks, from which the scar region was delineated based on TMP duration and compared with low-voltage regions from in-vivo mapping data. As shown in Fig. 5, the identified region of scar from the svs stochastic model is the closest to the in-vivo data, which is consistent with simulated results.


To our knowledge, this is the first work connecting VAE type of stochastic regularization with the generalization ability of a neural network. We have shown both theoretically and experimentally on inverse TMP reconstruction that the stochasticity and global aggregation of temporal information indeed improves inverse reconstruction. Future works will extend these analyses on a wider variety of image and signal reconstruction problems over time sequences.