There has been an upsurge of deep learning approaches for traditional image reconstruction problems in computer vision and medical imaging. Examples include image denoising , inpainting , and medical image reconstructions across a variety of modalities such as magnetic resonance imaging  and computed tomography 
. Despite state-of-the-art performances brought by these deep neural networks, their ability to reconstruct from data not seen in the training distribution is not well understood. To date, very limited work has investigated the generalization ability of these image reconstruction networks from a theoretical perspective, or provided insights into what aspects of representation and learning may improve the ability of these networks to generalize outside the training data.
In this paper, we take an information theoretic perspective – along with analytical learning theory – to investigate and improve the generalization ability of deep image reconstruction networks. Let be the original image and be the measurement obtained from by some transformation process. To reconstruct from , we adopt a common deep encoder-decoder architecture [14, 19] where a latent representation is first inferred from before being used for the reconstruction of . Our objective is to learn transformations that are general, possibly learning the underlying generative process rather than focusing on every detail in training examples. To this end, we propose that the generalization ability of a deep reconstruction network can be improved from two means: 1) the ability to generalize to data that are generated from (and thereby ) outside the training distribution; and 2) the ability to generalize to unseen variations in data that are introduced during the measurement process but irrelevant to .
For the first type of generalization ability, we hypothesize that it can be improved by using stochastic instead of deterministic latent representations. We support this hypothesis by the analytical learning theory , showing that stochastic latent space helps to learn a decoder that is less sensitive to perturbations in the latent space and thereby leads to better generalization. For the second type of generalization ability, we hypothesize that it can be improved if the encoder compresses the input measurement into a minimal latent representation (codes in information theory), containing only the necessary information for to be reconstructed. To obtain a minimal representation from that is maximally informative of , we adopt the information bottleneck theory formulated in  to maximize the mutual information between the latent code and , , while putting a constraint on the mutual information between and such that . This can be achieved by minimizing the following objective:
where is the Lagrange multiplier. Based on these two primary hypotheses, we present a deep image reconstruction network optimized by a variational approximation of the information bottleneck principle with stochastic latent space.
While the presented network applies for general reconstruction problems, we test it on the sequence reconstruction of cardiac transmembrane potential (TMP) from high-density body-surface electrocardiograms (ECGs) 
. Given the sequential nature of the problem, we use long short-term memory (LSTM) networks in both the encoder and decoder, with two alternative architectures to compress the temporal information into vector latent space. We tackle two specific challenges regarding the generalization of the reconstruction. First, because the problem is ill-posed, it has been important to constrain the reconstruction with prior physiological knowledge of TMP dynamics[5, 18, 4]. This however made it difficult to generalize to physiological conditions outside those specified by the prior knowledge. By using the stochastic latent space, we demonstrate the ability of the presented method to generalize outside the physiological knowledge provided in the training data. Second, because the generation of ECGs depends on heart-torso geometry, it has been difficult for existing methods to generalize beyond a patient-specific setting. By the use of the information bottleneck principle, we demonstrate the robustness of the presented network to geometrical variations in ECG data and therefore a unique ability to generalize to unseen subjects. These generalization abilities are tested in two controlled synthetic datasets as well as a real-data feasibility study. We hope that these findings may initiate more theoretical and systematic investigations of the generalization ability of deep networks in image reconstruction problems.
2 Related Work
Deep neural networks have become popular in medical image reconstructions across different modalities such as computed tomography , magnetic resonance imaging , and ultrasound . Some of these inverse reconstruction networks are based on an encoder-decoder structure [6, 19], similar to that investigated in this paper. Among these, the presented work is the closest to Automap  in that the output image is reconstructed directly from the input measurements without any intermediate domain-specific transformations. However, these existing works have not investigated either the use of stochastic architectures or the information bottleneck principle to improve the ability of the network to generalize outside the training distributions.
The presented theoretical analysis of stochasticity in generalization utilizes analytical learning theory 
, which is fundamentally different from classic statistical learning theory in that it is strongly instance-dependent. While statistical learning theory deals with data-independent generalization bounds or data-dependent bounds for certain hypothesis space, analytical learning theory provides the bound on how well a model learned from a dataset should perform on true (unknown) measures of variable of interest. This makes it aptly suitable for measuring the generalization ability of stochastic latent space for the given problem and data.
The presented variational formulation of the information bottleneck principle is closely related to that presented in . However, our work differs in three primary aspects. First, we investigate image reconstruction tasks in which the role of information bottleneck has not been clearly understood. Second, we define generalization ability in two different categories, and provide theoretical as well as empirical evidence on how stochastic latent space can improve the network’s generalization ability in a way different from the information bottleneck. Finally, we extend the setting of static image classification to image sequences, in which the latent representation needs to be compressed from temporal information within the whole sequence.
To learn temporal relationship in ECG/TMP sequences, we consider two sequence encoder-decoder architectures. One is commonly used in language translation , where the code from the last unit of the last LSTM encoder layer is used as the latent vector representation to reconstruct . We also present a second architecture where fully connected layers are used to compress all the hidden codes of the last LSTM layer into a latent vector representation. This is in concept similar to the attention mechanism  to selectively use information from all the hidden LSTM codes for decoding. We experimentally compare the generalization ability of using stochastic versus deterministic latent vectors in both architectures, which has not been studied before.
In the application area of cardiac TMP reconstruction, most related to this paper are works constraining the reconstruction with prior temporal knowledge in the form of physics-based simulation models of TMP  and, more recently, generative models learned from physics-based TMP simulation . This however to our knowledge is the first work that investigated the use of deep learning for the direct inference of TMP from ECG. This method will also have the unique potential to generalize outside the patient-specific settings and outside pathological conditions included in the prior knowledge.
Body-surface electrical potential is produced by TMP in the heart. Their mathematical relation is defined by the quasi-static approximation of electromagnetic theory  and, when solved on patient-specific heart-torso geometry, can be derived as: , where denotes the time-varying body-surface potential map, the time-varying TMP map over the 3D heart muscle, and the measurement matrix specific to the heart-torso geometry of a subject . The inverse reconstruction of from at each time instant is ill-posed, and a popular approach is to reconstruct TMP time sequence constrained by prior physiological knowledge of its dynamics [4, 18, 5]. This is the setting considered in this study, in which the deep network learns to reconstruct with prior knowledge from pairs of and generated by physics-based simulation. Note that it is not possible to obtain real TMP data for training, which further highlights the importance of the network to generalize. In what follows, we use and to represent sequence matrices with each column denoting the potential map at one time instant.
Given the joint distribution of TMP and ECG given by, the encoder gives us a conditional distribution . These together defines a joint distribution of ,):
The first term in in eq.(1) is given by
where is intractable. Letting to be the variational approximation of , we have:
where the KL divergence in the first term is non-negative. This gives us:
The second term in in eq.(1) is given by
which gives us to be minimized as an upper bound of the information bottleneck objective formulated in eq.(1).
3.0.1 Parameterization with neural network:
We model both and
where denotes a matrix that consists of the variance of each corresponding element in matrix . This is based on the implicit assumption that each elements in is independent and Gaussian, and similarly for . This gives us:
where . We use reparameterization as described in  to compute the inner expectation in the first term. The KL divergence in the second term is analytically available for two Gaussian distributions. We obtain:
where is the function mapping latent variable to the element of mean of , such that . The deep network is trained to minimize in eq.(9) with respect to network parameters .
3.0.2 Network architectures:
The sequence reconstruction network is realized using long short-term memory (LSTM) neural networks in both the encoder and decoder. To compress the time sequence into a latent vector representation, we experiment with two alternative architectures. First, based on the commonly-used sequence-to-sequence language translation model ,we consider a svs-L architecture that employs the hidden code of the last unit in the last encoding LSTM layer as the latent vector representation for reconstructing TMP sequences. Second, we propose a svs architecture where two fully connected layers are used to compress all the hidden codes of the last LSTM layer into a vector representation. In the decoder, this latent representation is expanded by two fully-connected layers before being fed into LSTM layers as shown in Fig. 1.
4 Encoder-Decoder Learning from the Perspective of Analytical Learning Theory
In this section we look at the encoder-decoder inverse reconstructions using analytical learning theory . We start with a general framework and then show that having a stochastic latent space with regularization helps in generalization.
Let be an input-output pair, and let denote the total set of training and validation data where be the validation set. During training, a neural network learns the parameter by using an algorithm and dataset , at the end of which we have a mapping from to . Typically, we stop training when the model performs well in the validation set. To evaluate this performance, we define a prediction error function, based on our notion of the goodness of prediction. The average validation error is given by . However, there exists a so-called generalization gap between how well the model performs in the validation set versus in the true distribution of the input-output pair. To be precise, let be a measure space with being a measure on . Here, denotes the input-output space of all the observations and inverse solutions. The generalization gap is given by . Theorem 1 in  provides an upper bound on the generalization gap in terms of data distribution in the latent space and properties of the decoder.
Theorem 4.1 ()
For any , let be a pair such that is a measurable function, is of bounded variation as , and , where indicates the Borel - algebra on . Then for any dataset pair and any ,
where is pushforward measure of under the map .
For an encoder-decoder setup, is the encoder that maps the observation to the latent space and
becomes the composition of loss function and decoder that maps the latent representation to the reconstruction loss. Theorem 1 provides two ways to decrease the generalization gap in our problem: by decreasing the variationor the discrepancy . Here, we show that stochasticity of the latent space helps decrease the variation . The variation of on in the sense of Hardy and Krause  is defined as: where is defined with following proposition.
Proposition 1 ()
Suppose that is a function for which exists on . Then, . If is also continuous on , .
In our case, is the prediction error as a function of latent representations :
where denotes the Frobenius norm of matrix , and
maps the latent space to the estimated. Theorem 1 and Proposition 1 implies that if the cross partial derivative of the loss with respect to the latent vector at all order is low in all directions throughout the latent space, then the approximated validation loss would be closer to the actual loss over the true unknown distribution of the dataset. Intuitively, we want the loss curve as a function of latent representation to be flat if we want a good generalization.
4.0.1 Using stochastic latent space:
In our formulation, the latent vector is stochastic with the cost function given by eq.(9). Using reparameterization , the inner expectation of the first term in the loss function is given by
where denotes k order tensor product of a vector
denotes k order tensor product of a vectorby itself.
Using Taylor series expansion for ,
We move expectation operator inside both brackets and take expectation of only the first term in the inner product. Using , we get . Using these in eq.(11) yields the required result.
The first term of Result 1, (after ignoring ), would be the only term in the cost function if the latent space were deterministic. The rest of the terms are additional in stochastic training. Each of these terms is an inner product of two tensor, the first being , and the second being the order partial derivative tensor . We can thus consider the first tensor as providing penalizing weights to different partial derivatives in the second tensor. Since each inner product is added to the cost, we are minimizing them during optimization. This gives two important implications:
For sufficiently large samples,
must be close to central moments of isotropic Gaussian. However, in practice, the number of samples ofremains constant. As we move to the higher order moment tensors, we can expect that they do not converge to that of the standard Gaussian. This, luckily, works in our favor. Since we are minimizing for each order, the inner product can be vanished for arbitrary only by driving partial derivative tensors towards zero. Therefore, minimizing the sum of all the inner product for arbitrary would minimize most of the terms in the partial derivative tensor. From Proposition 1, this corresponds to minimizing the variation of function , and consequently variation of the total error function according to eq.(10). Hence, additional terms in the stochastic latent space formulation contributes to decreasing the variation and consequently the generalization gap.
Not all the partial derivatives are equally weighted in the cost function. Due to the presence of weighting tensor in the first tensor of inner product, different partial derivative terms are penalized differently according to the value of . Combination of the KL divergence term in eq.(9) with
tries to increase standard deviation,towards 1 whenever it does not significantly increase the cost : higher value of penalizes the partial derivatives of a certain direction more heavily, making the cost flatter in some directions than other.
Strictly speaking, Proposition 1 requires cross partial derivatives to be small throughout the domain of latent variable, which is not included in the above analysis. It however should not significantly affect the observation that, compared to deterministic formulation, the stochastic formulation decreases the variation .
5 Experiments & Results
Since it is not possible to obtain real TMP data, the reconstruction network is trained on simulated data pairs of and . We focus on evaluating three generalization tasks of the network: to learn how to reconstruct under the prior physiological knowledge given in simulation data while generalizing to 1) unseen pathological conditions in , 2) unseen geometrical variations in that are irrelevant to , and 3) real clinical data.
5.1 Generalizing outside the training distribution of TMP
5.1.1 Dataset and implementation details:
We simulated training and test sets using three human-torso geometry models. Spatiotemporal TMP sequences were generated using the Aliev-Panfilov (AP) model , and projected to the body-surface potential data with 40dB SNR noises. Two parameters were varied when simulating the TMP data: the origin of excitation and abnormal tissue properties representing myocardial scar. Training data were randomly selected with regard to these two parameters. Test data were selected such that values in these two parameters differed from those used in training in four levels: 1) Scar: Low, Exc: Low, 2) Scar: Low, Exc: High, 3) Scar:High, Exc: Low, and 4) Scar: High, Exc: High, where Scar/Exc indicates the parameter being varied and High/Low denotes the level of difference (therefore difficulty) from the training data. For example, Scar: Low, Exc: High test ECG data was simulated with region of scar similar to training but origin of excitation very different from that used in training.
to p2.5cm X[c] X[c] X[c] X[r] Method \Metric & MSE & TMP Corr. & AT Corr. & Dice Coeff.
& & & &
& & & &
& & & &
& & & &
& – & – & &
The reconstruction accuracy was measured with four metrics: 1) mean square error (MSE) of the TMP sequence, 2) correlation of the TMP sequence, 3) correlation of TMP-derived activation time (AT), and 4) dice coefficients of the abnormal scar tissue identified from the TMP sequence. As summarized in Figure 2 and Table 1, in all test cases with different levels of pathological differences from the training data, the stochastic version of each architecture was consistently more accurate than its deterministic counterpart. In addition, most of the networks delivered a higher accuracy than the classic Greensite method (which does not preserve TMP signal shape and thus its MSE and correlation of TMP was not reported), and the accuracy of the svs stochastic architecture was significantly higher than the other architectures. These observations are reflected in the examples of reconstructed TMP sequences in Fig. 3.
5.2 Generalization to geometrical variations irrelevant to TMP
5.2.1 Dataset and implementation details:
TMP data were simulated as described in the previous section, but on a single heart-torso geometry. ECG data were simulated from TMP with controlled geometrical variations by rotating the heart along Z-axis at different angles (-20 degree to +20 degree at the interval of 1 degree). We trained the network to reconstruct TMP using ECG simulated by i) using five rotation angles from -2 degree to 2 degree, ii) ten rotation angles from -4 degree to +5 degree. We then compared the stochastic and deterministic svs networks on test ECG generated by the rest of the rotation angles. The network architecture and training details were the same as described in the previous section. Test ECG sets at each rotation angle were generated from 250 TMP signals with different tissue properties and origins of excitation and we report the mean and standard deviation of results for each angle.
As summarized in Fig. 4(ii), when trained on a small interval of five rotation values, the stochastic information bottleneck consistently improves the ability of the network to generalize to geometrical values outside the training distribution. This margin of improvement also increases as we move further away from the training set, i.e. as we go left or right from the centre, and seems to be more pronounced when measuring the dice coefficient of the detected scar. When trained on a larger interval of ten rotation values, however, this performance gap diminishes as shown in Fig. 4(i). This suggests that the encoder-decoder architecture with compressed latent space can naturally learn to remove variations irrelevant to the network output, although the use of stochastic information bottleneck allows the network to generalize from a smaller number of training examples.
To understand how the parameter in the information bottleneck loss plays a role in generalization, we repeated the above experiments with different values of . As shown in Fig. 5, as we increase , the generalization ability of the network first increases and then degrades reaching optimum value at .
5.3 Generalization to real data: a feasibility study
Finally, we tested the presented networks – trained on simulated data as described earlier – on clinical 120-lead ECG data obtained from a patient with scar-related ventricular tachycardia. From the reconstructed TMP sequence, the scar region was delineated based on TMP duration and compared with low-voltage regions from in-vivo mapping data. As shown in Fig. 6, because the network is directly transferred from the simulated data to real data, the reconstruction accuracy is in general lower than that in synthetic cases. However, similar to the observations in synthetic cases, the svs stochastic model is able to reconstruct the region of scar that is the closest to the in-vivo data.
To our knowledge, this is the first work that theoretically investigate the generalization of inverse reconstruction networks through the two different perspectives of stochasticity and information bottleneck, supported by carefully designed experiments in real-world applications. Note that the upper bound . Therefore, minimizing puts an additional constraint on the marginal to be close to a predefined . It is possible that the choice of might also play a role in generalization and will be reserved for future investigations. Future works will also extend the presented study to a wider variety of medical image reconstruction problems.
-  Alemi, A., Fischer, I., Dillon, J., Murphy, K.: Deep variational information bottleneck. In: ICLR (2017), https://arxiv.org/abs/1612.00410
-  Aliev, R.R., Panfilov, A.V.: A simple two-variable model of cardiac excitation. Chaos, Solitons & Fractals 7(3), 293–301 (1996)
-  Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
-  Ghimire, S., Dhamala, J., Gyawali, P.K., Sapp, J.L., Horacek, M., Wang, L.: Generative modeling and inverse imaging of cardiac transmembrane potential. In: International Conference on MICCAI. pp. 508–516. Springer (2018)
-  Greensite, F., Huiskamp, G.: An improved method for estimating epicardial potentials from the body surface. IEEE TBME 45(1), 98–104 (1998)
-  Han, Y.S., Yoo, J., Ye, J.C.: Deep residual learning for compressed sensing ct reconstruction via persistent homology analysis. arXiv preprint arXiv:1611.06391 (2016)
-  Hardy, G.H.: On double fourier series and especially those which represent the double zeta-function with real and incommensurable parameters. Quart. J. Math 37(5) (1906)
-  Kawaguchi, K., Bengio, Y.: Generalization in machine learning via analytical learning theory. arXiv preprint arXiv:1802.07426 (2018)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. ICLR (2015)
-  Kingma, D.P., Welling, M.: Auto-encoding variational bayes. ICLR (2013)
-  Lucas, A., Iliadis, M., Molina, R., Katsaggelos, A.K.: Using deep neural networks for inverse problems in imaging: beyond analytical methods. IEEE Signal Processing Magazine 35(1), 20–36 (2018)
-  Luchies, A.C., Byram, B.C.: Deep neural networks for ultrasound beamforming. IEEE transactions on medical imaging 37(9), 2010–2021 (2018)
-  Mao, X., Shen, C., Yang, Y.B.: Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In: Advances in neural information processing systems. pp. 2802–2810 (2016)
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2536–2544 (2016)
-  Plonsey, R.: Bioelectric phenomena (1969)
-  Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp. 3104–3112 (2014)
-  Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. arXiv preprint physics/0004057 (2000)
-  Wang, L., Zhang, H., Wong, K.C., Liu, H., Shi, P.: Physiological-model-constrained noninvasive reconstruction of volumetric myocardial transmembrane potentials. IEEE Transactions on Biomedical Engineering 57(2), 296–315 (2010)
-  Zhu, B., Liu, J.Z., Cauley, S.F., Rosen, B.R., Rosen, M.S.: Image reconstruction by domain-transform manifold learning. Nature 555(7697), 487 (2018)