In inverse problems the goal is to determine model parameters from indirect noisy observations. Example of such problems arise in many different fields in science and engineering, e.g., in X-ray CT natterer2001mathematical , electron tomography oktem2015mathematics , and magnetic resonance imaging brown2014magnetic
. Machine learning has recently also been applied in this area, especially in imaging applications. Using supervised machine-learning to solve inverse problems in imaging requires training data where ground truth images are paired with corresponding noisy indirect observations. The learning provides a mapping that associates observations to corresponding images. However, in several applications there are difficulties in obtaining the ground truth, e.g., in many cases it may have undergone a distortion. For example, a recent study showed that MRI images may be distorted by up to 4 mm due to, e.g., inhomogeneities in the main magnetic fieldWalker2014 . If these images are used for training, the learned MRI reconstruction will suffer in quality. Similar geometric inaccuracies arise in several other imaging modalities, such as Cone Beam CT and full waveform inversion in seismic imaging.
This work seeks to provide a scheme for learning a reconstruction scheme for an ill-posed inverse problem with a Wasserstein loss by leveraging upon recent advances in efficient solutions of optimal transport cuturi2013sinkhorn ; karlsson2016generalized and learned iterative schemes for inverse problems adler2017learned . The proposed method is demonstrated on a computed tomography example, where we show a significant improvement compared to training the same network using mean squared error loss. In particular, using the Wasserstein loss instead of standard mean squared error gives a result that is more robust against potential miss-alignment in training data.
2.1 Inverse problems
In inverse problems the goal is to reconstruct an estimate of the signalfrom noisy indirect measurements (data) assuming
In the above and are referred to as the reconstruction and data space, respectively. Both are typically Hilbert or Banach spaces. Moreover denotes the forward operator, which models how a given signal gives rise to data in absence of noise. Finally, is the noise component of data. Many inverse problems of interest are ill-posed, meaning that there is no uniques solution to (1) and hence there is no inverse to . Typically reconstructions of are sensitive to the data and small errors gets amplified. One way to mitigate these effects is to use regularization engl2000regularization .
In variational regularization one formulates the reconstruction problem as an optimization problem. To this end, one introduces a data discrepancy functional , where , that quantifies the miss-fit in data space, and a regularization functional that encodes a priori information about by penalizing undesirable solutions. For a given , this gives an optimization problem of the form
Here, acts as a trade-off parameter between the data discrepancy and regularization functional. In many cases is taken to be the negative data log-likelihood, e.g., in the case of additive white Gaussian noise. Moreover, a typical choice for regularization functional is total variation (TV) regularization, rudin1992nonlinear . These regularizers typically give rise to large scale and non-differentiable optimization problems, which requires advanced optimization algorithms.
Learning for inverse problems
In many applications, and so also in inverse problems, data driven approaches have shown dramatic improvements over the state-of-the-art LeBeHi15
. Using supervised learning to solve an inverse problem amounts to finding a parametrized operatorwhere the parameters are selected so that
For inverse problems in image processing, such as denoising and deblurring, we have
and it is possible to apply a wide range of widely studied machine learning techniques, such as fully convolutional deep neural networks with various architectures, including fully convolutional networksNIPS2008_3506 and denoising auto-encoders NIPS2012_4686 .
However, in more complicated inverse problems as in tomography, the data and reconstruction spaces are very different, e.g., their dimension after discretization may differ. For this reason, learning a mapping from to becomes nontrivial, and classical architectures that map, e.g., images to images using convolutional networks cannot be applied as-is. One solution is to use fully-connected layers as in PaGiKaLoMa04 for very small scale tomographic reconstruction problems. A major disadvantage with such a fully learned approach is that the parameters space has to be very high dimensional in order to be able to learn both the prior and the data model, which often renders it infeasible due to training time and lack of training data.
A more successful approach is to first apply some crude reconstruction operator and then use machine learning to post process the result. This separates the learning from the complications of mapping between spaces since the operator can be applied off-line, prior to training. Such an approach has been demonstrated for tomographic reconstruction in PeBa13 ; WuGhChMa16 . Its drawback for ill posed inverse problems is that information is typically lost by using , and this information cannot be recovered by post processing.
Finally, another approach is to incorporate the forward operator and its adjoint into the neural network. In these learned iterative schemes, classical neural networks are interlaced with applications of the forward and backward operator, thus allowing for the learned reconstruction operator to work directly from data without having to learn the data model. For example, in YaSuLiXu16 an ADMM-like scheme for Fourier inversion is learned and PuWe17 consider solving inverse problems typically arising in image restoration by a learned gradient-descent scheme. In adler2017solving this later approach is shown to be applicable to large scale tomographic inversion. Finally, in adler2017learned they apply learning in both spaces and , yielding a Learned Primal-Dual scheme, and show that it outperforms learned post-processing for reconstruction of medical CT images.
Loss functions for learning
Once the parametrization of is set, the parameters are typically chosen by minimization of some loss functional . Without doubt, the most common loss function is the mean squared error, also called loss, given by
It has however been noted that it is sub-optimal for imaging, and a range of other loss functions have been investigated. These include the classical norms and the structural similarity index (SSIM) Zhao2017 , as well as more complex losses such as perceptual losses Johnson2016 and adversarial networks DBLP:journals/corr/MardaniGCVZATHD17 .
2.2 Optimal mass transport and Sinkhorn iterations
In optimal mass transport the aim is to transform one distribution into another by moving the mass in a way that minimizes the cost of the movement. For an introduction and overview of the topic, see, e.g., the monograph villani2008optimal . Lately, the area has attracted a lot of research cuturi2013sinkhorn ; cuturi2015asmoothed ; chizat2015unbalanced with applications to, e.g., signal processing haker2004optimal ; georgiou2009metrics ; jiang2012geometric ; engquist2014application and inverse problems benamou2015iterative ; karlsson2016generalized .
The optimal mass transport problem can be formulated as follows: let be a compact set, and let and be two measures, defined on , with the same total mass. Given a cost that describes the cost for transporting a unit mass from one point to another, find a (mass preserving) transference plan that is as cheap as possible. Here, the transference plan characterizes how to move the mass of in order to deform it into . Letting the transference plan be a nonnegative measure on the space
yields a linear programming problem in the space of measures:
Although this formulation is only defined for measures and with the same total mass, it can also be extended to handle measures with unbalanced masses georgiou2009metrics ; chizat2015unbalanced . Moreover, under suitable conditions one can define the Wasserstein metrics using , by taking for and where is a metric on , and (villani2008optimal, , Definition 6.1). As the name indicates, is a metric on the set of nonnegative measures on with fixed mass (villani2008optimal, , Theorem 6.9), and is weak continuous on this set. One important property is that (and thus also ) does not only compare objects point by point, as standard metrics, but instead quantifies how the mass is moved. This makes optimal transport natural for quantifying uncertainty and modelling deformations jiang2012geometric ; karlsson2013uncertainty .
One way to solve the optimal transport problem in applications is to discretize
and solve the corresponding finite-dimensional linear programming problem. In this setting the two measures are represented by point masses on the discretization grid, i.e., by two vectorswhere the element corresponds to the mass in the point for and . Moreover, a transference plan is represented by a matrix where the value denotes the amount of mass transported from point to . The associated cost of a transference plan is , where is the transportation cost from to , and by discretizing the constraints we get that is a feasible transference plan from to if the row sums of is and the column sums of is . The discrete version of (4) thus takes the form
where denotes element-wise non-negativity of the matrix. However, even though (5) is a linear programming problem it is in many cases computationally infeasible due to the vast number of variables. Since the number of variables is , and thus if one seek to solve the optimal transport problem between two images this results in more than variables.
One approach for addressing this problem was proposed by Cuturi cuturi2013sinkhorn that introduces an entropic regularizing term for approximating the transference plan, so the resulting perturbed optimal transport problem reads as
One can show that an optimal solution to (6) is of the form
where (point-wise exponential) is known, and are unknown. This shows that the solution is parameterized by only variables. Moreover, the two vectors can be computed iteratively by so called Sinkhorn iterations, i.e., alternatingly compute and that matches and respectively. This is summarizied in Algorithm 1 where denotes elementwise multiplication and elementwise division. The procedure has been shown to have a linear convergence rate, see cuturi2013sinkhorn and references therein.
Moreover, when the underlying cost is translation invariant the discretized cost matrix , and thus also the transformation , gets a Toeplitz-block-Toeplitz structure. This structure can be used in order to compute and
efficiently using the fast Fourier transform in, instead of naive matrix-vector multiplication in karlsson2016generalized . This is crucial for applications in imaging since for images of size pixels one would have to explicitly store and multiply with matrices of size .
3 Learning a reconstruction operator using Wasserstein loss
In this work we propose to use entropy regularized optimal transport (6) to train a reconstruction operator, i.e., to select the parameters as
This should give better results when data is not aligned with the ground truth . To see this, consider the case when is a point mass. In that case training the network with the loss (3) will (in the ideal case) result in a perfect reconstruction composed with a convolution that “smears” the reconstruction over the area of possible miss-alignment. On the other hand since optimal mass transport does not only compare objects point-wise, the network will (in the ideal case) learn a perfect reconstruction combined with a movement of the object to the corresponding barycenter (centroid) of the miss-alignment. These statements are made more precise in the following propositions. Formal definitions and proofs are deferred to the appendix.
Let be the Dirac delta function on , let be a -valued random variable with probability measure , and let . Then
and otherwise. Furthermore, finding a that minimizes is equivalent to finding the global minimizers to . In particular, if (i) the probability measure is symmetric around its mean, (ii) the underlying cost is of the form , where is convex and symmetric, and (iii) and are such that
both exist and are finite for all , then is an optimal solution. Furthermore, if is also strictly convex, then this is the unique minimizer.
be uniformly distributed on, and let . This gives
which has minimum , and hence the (unique) minimizer to is . For the case with the uniform distribution, the minimizer of is the smoothed function .
The most common choice of distance is to use the squared norm , as in the previous example. In this case the result of Proposition 2 can be strengthened, as shown in the following example.
Let be a -valued random variable with probability measure
with finite first and second moments, and let. This gives
which has a unique global minimum in and hence .
4 Implementation and evaluation
We use the recently proposed learned primal-dual structure in adler2017learned for learning a reconstruction operator for solving the inverse problem in (1). In this algorithm, a sequence of small blocks work alternatingly in the data (dual) space and the reconstruction (primal) space and are connected using the forward operator and its adjoint . The algorithm works with any differentiable operator , but we state the version for linear operators for simplicity in algorithm 2.
, and TensorFlowabadi2016tensorflow . We used the reference implementation111https://github.com/adler-j/learned_primal_dual with default parameters, i.e., the number of blocks in the primal and dual space was , and the number of primal and dual variables was set to . Moreover, the blocks used a residual structure and had three layers of convolutions with filters. PReLU nonlinearities were used. Thus, this corresponds to a residual CNN with convolutional depth of , as shown in graphical format in fig. 1. We used zero initial values, .
We compare a learned reconstruction operator of this form when trained using loss (3) and using optimal transport loss (8). Moreover, the evaluation is done on a problem similar to the evaluation problem in adler2017learned ; adler2017solving , i.e., on a problem in computed tomography. More specifically, training is done on data that consists of randomly generated circles on a domain of 512 ×512, and the forward operator is the ray transform natterer2001mathematical . What makes this an ill-posed problem is that the data acquisition is done from only 30 views with 727 parallel lines. Moreover, the source of noise is two-fold in this set-up: (i) the pairs of data sets and phantoms are not aligned, meaning that the data is computed from a phantom with a random change in position. This random change is independent for the different circles, and for each circle it is a shift which is uniformly distributed over pixels, both in up-down and left-right direction. (ii) on the data computed from the shifted phantom, 5% additive Gaussian noise was added. For an example, see fig. 2.
with additive white noise on top. The pair= (1(c), 1(a)) is what is used in the training.
The optimal mass transport distance computed with Sinkhorn iterations was used as loss function, where we used the transport cost
This was chosen since it heavily penalizes large movements, while not diverging to infinity which causes numerical instabilities. Moreover, is in fact a metric on (see lemma 6 in the appendix) and thus gives rise to a Wasserstein metric on the space of images, where is the optimal mass transport distance with the transport cost . Since this cost is translation invariant, the matrix-vector multiplications and can be done with fast Fourier transform, as mentioned in section 2.2, and this was implemented in Tensorflow. We used 10 Sinkhorn iterations with entropy regularization , to approximate the optimal mass transport. Automatic differentiation was used to back-propagate the result during training.
Since the optimal mass transport function (6) is only finite for marginals and with the same total mass, in the training we normalize the output of the reconstruction with . This makes invariant with respect to the total mass, which is undesirable. To compensate for this, a small penalization on the error in total mass was added to the loss function.
The training also followed adler2017learned closely. In particular, we used batches of size , using the ADAM optimizer kingma2014adam with default values except for . The learning rate (step length) used was cosine annealing loshchilov2016sgdr with initial step length . Moreover, in order to improve training stability we performed gradient norm clipping pascanu2012understanding with norms limited to 1. The convolution parameters were initialized using Xavier initialization glorot2010understanding , and all biases were initialized to zero. The training took approximately 3 hours using a single Titan X GPU. The source code used to replicate these experiments are available online 222https://github.com/adler-j/wasserstein_inverse_problems
Results are presented in fig. 3. As can be seen, the reconstruction using loss “smears” the reconstruction to an extent where the shape is impossible to recover. On the other hand, the reconstruction using the Wasserstein loss retains the over-all global shape of the object, although relative and exact positions of the circles are not recovered.
5 Conclusions and future work
In this work we have considered using Wasserstein loss to train a neural network for solving ill-posed inverse problems in imaging where data is not aligned with the ground truth. We give a theoretical motivation for why this should give better results compared to standard mean squared error loss, and demonstrate it on a problem in computed tomography. In the future, we hope that this method can be applied to other inverse problems and to other problems in imaging such as segmentation.
Appendix: Deferred definition and proofs
Proof of Proposition 1.
To show that minimizes we expand the expression and use Fubini’s theorem to get
Rearranging terms and using that , this can be written as
where is a constant. Using this it follows that the minimizing is of the form
To see that we note that, by using Fubini’s theorem, we have
where the first inequality is the arithmetic-geometric mean inequality. This completes the proof. ∎
Let . A subgradient to in a point is a vector so that
The set of all subgradients in a point is called the subdifferential of at , and is denoted by . This is a set-valued operator, and for any measure on we define to be the set-valued operator
Proof of Proposition 2.
We consider finding the marginal that minimize . Without loss of generality we assume that is zero-mean, since otherwise we simply consider which is a zero-mean random variable. First we note that is only finite for nonegative measures with total mass , and hence is only finite for such measures. Second, for such a we have
since one needs to transport all mass in into the point where has its mass. Using this and expanding the expression for the expectation gives that
where we have used Fubini’s theorem in the last step. This completes the first half of the statement.
To prove the second half of the statement, note that the optimal have support only in the global minimas of the function
which by assumption exists and is finite. Now, since is convex we have that
and convolving this inequality with gives the inequality
where all terms exist and are bounded by assumption. This shows that
Now, since is symmetric we have that is anti-symmetric, i.e., that , since
where the last inclusion follows since is symmetric and is anti-symmetric. Now, since we have that is a global minimizer to (bauschke2011convex, , Theorem 16.2), and thus one optimal solution to the problem is . Now, if is strictly convex, the inequality (9) is strict for , and thus (10) is also strict, which shows that the optimal solution is unique. ∎
Let be a norm on . Then
is a metric on for .
It is easily seen that is symmetric, nonnegative, and equal to zero if only if . Thus we only need to verify that the triangle inequality holds. To this end we note that if
for all , then by taking , , and using the triangle inequality for the norm we have that
Therefore we will show that (11) holds for all , and to do so we will
show that if a function fulfills , , for all , then ,
show that for the map fulfills the assumptions in (i) for any .
To show (i) we note that
where the inequality uses that for any since for all .
To show (ii), let and observe that . Differentiating twice gives
For we see that for all . Moreover, for we see that for all and for all if and only if . With the change of variable , we thus want to show that for all and all . To see this we note that and that
This shows (ii), and hence completes the proof. ∎
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  J. Adler, H. Kohr, and O. Öktem. ODL - a Python framework for rapid prototyping in inverse problems. 2017. In preparation, KTH Royal Institute of Technology. Code and documentation available online: https://github.com/odlgroup/odl.
-  J. Adler and O. Öktem. Learned primal-dual reconstruction. arXiv preprint arXiv:1707.06474, 2017.
-  J. Adler and O. Öktem. Solving ill-posed inverse problems using iterative deep neural networks. Inverse Problems, 2017.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
-  H.H. Bauschke and P.L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. Springer, New York, 2011.
-  J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
-  R.W. Brown, Y.-C. N. Cheng, E.M. Haacke, M.R. Thompson, and R. Venkatesan. Magnetic Resonance Imaging: Physical Principles and Sequence Design. John Wiley & Sons Ltd, 2014.
-  L. Chizat, G. Peyré, B. Schmitzer, and F.-X. Vialard. Unbalanced optimal transport: geometry and Kantorovich formulation. arXiv preprint arXiv:1508.05216, 2015.
-  M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.
-  M. Cuturi and G. Peyré. A smoothed dual approach for variational Wasserstein problems. SIAM Journal on Imaging Sciences, pages 320–343, 2016.
-  H.W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems. Kluwer Academic Publisher, 2000.
-  B. Engquist and B.D. Froese. Application of the Wasserstein metric to seismic signals. Communications in Mathematical Sciences, 12(5), 2014.
-  C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T.A. Poggio. Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
-  T.T. Georgiou, J. Karlsson, and M.S. Takyar. Metrics for power spectra: an axiomatic approach. IEEE Transactions on Signal Processing, 57(3):859–867, 2009.
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
S. Haker, L. Zhu, A. Tannenbaum, and S. Angenent.
Optimal mass transport for registration and warping.
International Journal of computer vision, 60(3):225–240, 2004.
-  V. Jain and S. Seung. Natural image denoising with convolutional networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 769–776. Curran Associates, Inc., 2009.
-  X. Jiang, Z.-Q. Luo, and T.T. Georgiou. Geometric methods for spectral analysis. IEEE Transactions on Signal Processing, 60(3):1064–1074, 2012.
J. Johnson, A. Alahi, and L. Fei-Fei.
Perceptual Losses for Real-Time Style Transfer and Super-Resolution, pages 694–711. Springer International Publishing, Cham, 2016.
-  J. Karlsson and T.T. Georgiou. Uncertainty bounds for spectral estimation. IEEE Transactions on Automatic Control, 58(7):1659–1673, 2013.
-  J. Karlsson and A. Ringh. Generalized Sinkhorn iterations for regularizing inverse problems using optimal mass transport. arXiv preprint arXiv:1612.02273, 2016.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436—444, 2015.
-  I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.
-  M. Mardani, E. Gong, J.Y. Cheng, S. Vasanawala, G. Zaharchuk, M.T. Alley, N. Thakur, S. Han, W.J. Dally, J.M. Pauly, and L. Xing. Deep generative adversarial networks for compressed sensing automates MRI. CoRR, abs/1706.00051, 2017.
-  F. Natterer and F. Wübbeling. Mathematical methods in image reconstruction. SIAM, 2001.
-  O. Öktem. Mathematics of electron tomography. In O. Scherzer, editor, Handbook of Mathematical Methods in Imaging, pages 937–1031. Springer, New York, NY, 2015.
-  R. Pascanu, T. Mikolov, and Y. Bengio. Understanding the exploding gradient problem. CoRR, abs/1211.5063, 2012.
-  P. Paschalis, N. D. Giokaris, A. Karabarbounis, G. K. Loudos, D. Maintas, C. N. Papanicolas, V. Spanoudaki, Ch. Tsoumpas, and E. Stiliaris. Tomographic image reconstruction using artificial neural networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 527(1–2):211–215, 2004.
-  D. M. Pelt and K. J. Batenburg. Fast tomographic reconstruction from limited data using artificial neural networks. IEEE Transactions on Image Processing, 22(12):5238–5251, 2013.
-  P. Putzky and M. Welling. Recurrent inference machines for solving inverse problems. Submitted to ICLR 2017, Toulon, France, April 24–26, 2017. Report available from https://openreview.net/pdf?id=HkSOlP9lg, 2017.
-  L.I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1):259–268, 1992.
-  W. van Aarle, W.J. Palenstijn, J. Cant, E. Janssens, F. Bleichrodt, A. Dabravolski, J. De Beenhouwer, K.J. Batenburg, and J. Sijbers. Fast and flexible X-ray tomography using the ASTRA toolbox. Optics express, 24(22):25129–25147, 2016.
-  C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
-  A. Walker, G. Liney, P. Metcalfe, and L. Holloway. MRI distortion: considerations for MRI based radiotherapy treatment planning. Australasian Physical & Engineering Sciences in Medicine, 37(1):103–113, Mar 2014.
-  T. Würfl, F. C. Ghesu, V. Christlein, and A. Maier. Deep learning computed tomography. In S. Ourselin, L. Joskowicz, M. Sabuncu, G. Unal, and W. Wells, editors, MICCAI 2016: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, volume 9902 of Lecture Notes in Computer Science, pages 432–440. Springer-Verlag, 2016.
-  J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 341–349. Curran Associates, Inc., 2012.
-  Y. Yang, J. Sun, H. Li, and Z. Xu. Deep ADMM-Net for compressive sensing MRI. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29, pages 10–18. Curran Associates, 2016. Report available from http://papers.nips.cc/paper/6406-deep-admm-net-for-compressive-sensing-mri.pdf.
-  H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1):47–57, March 2017.