1 Introduction
Deep image priors (DIP) have been recently introduced as a machine learning approach for some tasks in image processing
ulyanov2017dip . Usually, such machine learning approaches utilize large sets of training data, hence, it was somewhat surprising that deep image priors are based on a single data point The task of DIP is to train a network with parametersby minimizing the simple loss function
(1) 
The minimization is with respect to the input is kept fixed. After training, the solution of the problem is approximated by In image processing typical choices for are the identity operator (denoising) or a projection operator to a subset of the image domain (inpainting). For these applications it has been observed, that minimizing the functional iteratively by gradient descent methods in combination with a suitable stopping criterion leads to amazing results ulyanov2017dip .
This approach is remarkable for several reasons. First of all, the parametrization of a neural network is typically determined during a training phase, afterwards it is fixed and only the input varies in applications. This is based on the assumption, that the set of suitable solutions
to any of the problems mentioned above obeys some probability distribution and that the neural network approach is capable of reproducing this probability distribution by varying the inputs
of a trained network bora2017compressed . The DIP approach, however, uses a fixed input and aims at scanning suitable solutions by varying the parametrization whileis kept fixed. Secondly, as said above DIP approaches do not use large sets of supervised training data, but rely on a single unsupervised data point. Thirdly, one might assume, that this is only possible, if the network architecture is fine tuned to the specific task. This is partially true, nevertheless the presented numerical results perform well with somewhat generic network architectures such as autoencoder networks or even general convolutional feedforward networks.
In this paper we are interested in the analysis of DIP approaches and in particular in proving some convergence properties for iteratively minimizing (1). We will do so in the context of inverse problems, which are modeled by a nonlinear or linear operator between Hilbert spaces and . Contrary to the applications in image processing mentioned above, we assume, that the range of is not closed, which implies, that the inversion or any generalized type of inversion is illposed louis ; engl ; rieder
. Typical examples are compact linear operators or parametertostate mappings for partial differential equations.
Naive applications of neural networks fail for even the most simple inverse problems, see maass2018trivial , but there is a growing number of very convincing numerical experiments using suitable network designs for some of the toughest inverse problems such as photoacoustic tomography hauptmann2018model or Xray tomography with very few measurements jin2017deep ; adler2018learned . Concerning networks based on deep prior approaches for inverse problems, first experimental investigation have been reported, see ulyanov2017dip ; van2018compressed . Similar as for the above mentioned tasks in image processing, DIP for inverse problems is based on two ingredients:

A suitable network design, which leads to our phrase “regularization by architecture”.

Training algorithms for iteratively minimizing Expression (1) with respect to in combination with a suitable stopping criterion.
In this paper we present different mathematical interpretations of DIP approaches (Section 2) and we analyze three network designs in the context of inverse problems in more detail.
Our paper is organized as follows. In the next section we discuss some interpretations of DIP approaches and we add a first mathematical result for a trivial network design, which yields a connection to Landweber iterations, which is known to converge for general inverse problems with arbitrary . Readers solely interested in the mathematical part of the paper may jump directly to Section 3, where we consider a fully connected feed forward network with identical layers. This leads to the notion of analytic deep prior networks, where one can strictly analyze the convergence properties. The key to the theoretical findings is a change of view, which allows to interpret DIP approaches as optimizing families of Tikhonov functionals. We will exemplify our theoretical finding with numerical examples for the standard linear integration operator (fully connected feed forward network) as well as for the challenging inverse problem posed by magnetic particle imaging (UNet with skip connections, Section 4).
2 Deep inverse priors and their interpretations
We start with a description of the state of the art of deep image priors before addressing their generalization to inverse problems. In general, a feed forward neural network is an algorithm that starts with an input
, computes iterativelyand outputs
The parameters of this system are denoted by
and
denotes a nonlinear activation function.
In order to highlight one of the special features of deep image priors, let us first refer to classical generative networks, which e.g. are trained on large sets of “naturallylooking” images. There, is fixed after training and one expects, that changing the input will generate different naturally looking images as well bora2017compressed . Hence, after training the distribution of naturally looking images is parametrized by . In contrast to that, deep image priors keep fixed and aim at parameterizing the distribution of images with .
In general, this is based on the underlying assumption, that complex distributions, e.g. the distribution of natural images, can be obtained by transforming a simpler distribution. This may be the distribution of the coefficients of a deep network ulyanov2017dip or the coefficients of a wavelet representations, see e.g. fista ; unser2007fista , moreover it has been argued, that “naturallylooking” images allow a sparse representation in such a basis. In a recent paper that makes use of DIP for compressed sensing van2018compressed , the parameter distribution of the network is additionally modeled by some Gaussians that are optimized to fit the distribution of some previously obtained image parametrizations.
To some extend, the success of deep image priors are based on a careful designed network architecture. E.g. ulyanov2017dip used a UNetlike “hourglass” architecture with skip connections. In their results they show, that such an architecture can capture the statistics of natural images. However this is still not enough to explain the amazing results they show. In general the DIP learning process may converge towards noisy images or bad reconstructions. The whole success relies on a combination of the network architecture with a suitable stopping rule for the optimization method. The authors of ulyanov2017dip claim that the architecture has an impact on how the solution space is searched during the iterative optimization of the parameters that minimizes (1). In their experiments ulyanov2017dip observed that the training process descends quickly to “naturalylooking” images while requiring much more steps to produce noisy images. Note that this can be somewhat supported by the theoretical findings of saxe2013exact .
This is also in line with the observations of zhang2016understanding , which shows that deep networks are able to fit noise very well, but need more training time to do so. They consider the CIFAR10 krizhevsky2009learning classification task with different noise models and different common architectures like szegedy2016rethinking , krizhevsky2012imagenet
and a vanilla multilayer perceptron. Another paper that hints in this direction is
anonymous2019on , which analyzes whether neural networks could have a bias towards approximating low frequencies.2.1 A trivial remark
We now switch to deep priors for solving illposed inverse problems. For a given operator the general task in inverse problems is to recover an approximation for form measured noisy data
where , with describes the noise in the measurement.
The deep image prior approach to inverse problems asks to train a network with parameters and fixed input by minimizing . After training a final run of the network computes as an approximation to .
We start with an observation for the training process of a trivial network, which simply outputs i.e. Hence, the approximate solution to the inverse problem is given by . Thus can be identified with an element in and training this network by gradient descent of with respect to is equivalent to the classical Landweber iteration, which is a gradient descent method for with respect to .
Despite the obvious trivialization of the neural network approach, this shows that there is potential for training networks with a single data point. Moreover, Landweber iterations converge from rather arbitrary starting points, which indicates that the choice of in the general case is indeed of minor importance.
2.2 Two Perspectives Based on Regression
In this subsection we present two different perspectives on solving inverse problems with the DIP via the minimization of a functional as discussed in the subsection above. The first perspective is based on a reinterpretation of the minimization of the functional
(2) 
in the finite, real setting, i.e. . This setting lets us write
(3)  
(4) 
where denotes the range of the network with regard to for a fixed and the rows of the matrix as well as
the entries of the vector
.This allows for the interpretation that we are solving a linear regression, parameterized by
, which is constrained by a deep learning hypothesis space and given by data pairs of the form .The second perspective is based on a rewriting of the optimization problem via the method of Lagrange multipliers. We start by considering the constrained optimization problem
(5) 
If we now assume that has continuous first partial derivatives with regard to , the Lagrange functional
(6) 
with the correct Lagrange multiplier , has a stationary point at each minimum of the original constraint optimization problem. This gives us a direct connection to unconstrained variational approaches like Tikhonov functionals.
2.3 The Bayesian point of view
The Bayesian approach to inverse problems focuses on computing MAP (maximum a posteriori probability) estimators, i.e. one aims for
(7) 
where is a conditional PDF. From standard Bayesian theory we obtain
(8) 
The setting for inverse problems, i.e. with , yields ()
We now decompose into , , where , resp. , denotes the orthogonal projection onto , resp. . Setting yields
The data only contains information about , which in classical regularization is exploited by restricting any reconstruction to .
However, if available, is a measure on how to extend with an to a suitable . The classical regularization of inverse problems simply uses an extension by zero, i.e. , which not necessarily is optimal. If we accept the interpretation that a network can be a meaningful parametrization of the set of suitable solutions , then for all not in the range of the network and optimizing the network will indeed yield a nontrivial completion . More precisely (I) can be interpreted to be a deep prior on the measurement and (II) to be a deep prior on the nullspace part of the problem.
3 Deep inverse priors and Tikhonov functionals
3.1 A general setting for deep priors for inverse problems
In this section we consider linear operators and aim at rephrasing DIP, i.e. the minimization of (1) with respect to , as an approach for learning optimized Tikhonov functionals. This change of view, i.e. regarding deep inverse priors as an optimization of functionals rather then networks, opens the way for analytic investigations.
We now follow the well known approach of LISTA lista (learned iterated soft thresholding algorithm) or learned PG (proximal gradient) methods as stated in Appendix I. Note, that our loss function is identical with the loss functions used in LISTA or learned PG methods, but using only a single data point. More precisely, we use the particular architecture of a fully connected feedforward network with layers of identical size defined by
(9) 
where
(10) 
for and
I.e. the affine linear map given by is the same for all layers. Moreover the activation function of the network is chosen as the proximal mapping of a regularizing functional , we restrict the matrix by enforcing ( denotes the identity operator) for some and the bias is determined via .
Remark 1
This kind of generalized ISTA or PG schemes converge for rather arbitrary starting points, i.e. the particular choice of in the deep prior context indeed is of minor importance.
Remark 2
Restricting activation functions to be proximal mappings is not as severe as it might look at first glance. E.g. ReLU is the proximal mapping for the indicator function of positive real numbers and soft shrinkage is the proximal mapping for the modulus function.
In this setting, the output of the network is identical to the th iterate of a proximal gradient descent method for minimizing
(11) 
see Appendix I. Hence, updating or changes the discrepancy term in this Tikhonov functional.
In this section we neglect the difference between and and assume that the PG scheme has fully converged, i.e.
(12) 
This leads to the definition of analytic deep priors.
Definition 1
Consider a fully connected neural network with layers, whose activation function is a proximal mapping with respect to a convex functional , i.e.
(13) 
where
(14) 
and Further assume that can be decomposed as with a bounded operator and that the bias satisfies . We define the associated Tikhonov functional and assume that a unique minimizer exists. We call this setting an analytic deep prior if , resp. , is trained from a single data point by gradient descent applied to
(15) 
We now examine the training process for computing resp. in the setting of such analytic deep prior models. This can be either regarded as training a neural network or as determining an optimized Tikhonov functional. In the following we focus on the minimization with respect to and then set which yields a more convenient notation. Hence, the training of the network for given data is achieved by a gradient descent method with respect to for the loss function
(16) 
The stationary points are characterized by and gradient descent iterations with stepsize are given by
(17) 
Hence we need to compute the derivative of with respect to .
Lemma 1
Consider an analytic deep prior network. We define
(18) 
(19) 
Then
(20) 
with
(21) 
which leads to the gradient descent
(22) 
Proof
is a functional which maps operators to real numbers, hence, its derivative is given by
which follows from classical variational calculus, see e.g. engl .
The derivative of with respect to can be computed using the fix point condition for a minimizer of :
which is equivalent to
We apply implicit function theorem applied and obtain the derivative
Combining with yields the required result.
This lemma allows to obtain an explicit description of the gradient descent for or , which in turn leads to an iteration of functionals and minimizers . We will now exemplify this derivation for a rather academic example, which however highlights in particular the differences between a classical Tikhonov minimizer, i.e.
and the solution of the DIP approach.
3.1.1 Example
In this example we examine analytic deep priors for linear inverse problems , i.e. , and
(23) 
The rather abstract characterization of the previous section can be made explicit for this setting. We consider the classical Tikhonov regularization, i.e.
(24) 
This is equivalent to the solution obtained by the analytic deep prior approach with without any iteration. We then take as a starting point. First of all we compute one step of gradient descent for minimizing with respect to and see, how the the resulting differs from .
The proximal mapping corresponding to the functional above is given by
(25) 
We use the explicit description of the iteration
(26) 
with
(27) 
In this case , which is a mapping , can be compute explicitly as
(28) 
The derivative of with respect to is a linear map For we obtain
(29) 
The adjoint operator is a mapping from to , which can be derived from the defining relation
(30) 
Hence,
(31) 
Here, denotes a linear map, which maps an to .
First of all, we now aim at determining explicitly
(32) 
at the starting point of our iteration, i.e. with . I.e. we initialize the iteration with where is the classical Tikhonov regularization. Inserting this yields
(33) 
From this follows a rather lengthy expression for
(34)  
(35) 
and we can compute the update
(36) 
as well as the output of the analytic deep prior approach after one iteration of updating (assuming a suitably chosen )
(37) 
This expression nicely collapses if commutes with . For illustration we assume the rather unrealistic case that , where is a singular function for
with singular value
. The dual singular function is denoted by , i.e. and and we further assume, that the measurement noise in is in the direction of this singular function, i.e. . In this case, the problem is indeed onedimensional and we obtain an iteration restricted to the span of , resp. the span of .A lengthy computation exploiting and shows that the singular value of in the spectral decomposition of obeys the iteration
i.e.
(38) 
with
This iteration has a contracting fix point at , which yields
as opposed to the Tikhonov minimizer .
For fixed , our value is the maximum of for varying . Our artificial setting is onedimensional and hence wellposed, i.e we do not want to regularize or diminish the reconstruction. In this sense, DIP iterations converge to the optimal value for Tikhonov type regularizers for this onedimensional case.
3.2 Some numerical experiments
We now use the analytic deep inverse prior approach for solving an inverse problem with the following integration operator
(39) 
is linear and compact, hence the inverse problem is illposed. We let the matrix be a discretization of and choose to be one of its singular vectors . Then we set the noisy data with and equals of the largest coefficient of see Figure 1.
We aim to recover from considering the setting established in Def. 1 for and a fixed value of . That means the solution is parametrized by the weight matrix of the network, i.e. with and some arbitrary fixed input. Solving the inverse problem is now equivalent to training the network , i.e. finding optimal , to minimize the loss function (1) for the single data point . Afterwards, the reconstruction is obtained by computing .
In order to properly update by backpropagation, the network was implemented taking special care ensuring that (12) holds. For more details please refer to Appendix II.
In Figure 2 some reconstruction results are shown. The first plot contains the true solution the standard Tikhonov solution and the reconstruction obtained with the analytic deep inverse approach after seemed to have converged, in this case that means is the resulting matrix after training iterations. Figure 2 provides some additional reconstructions and plots depicting:

The true error of the network’s output after each update of in a logarithmic scale.

The Frobenius norm of after each update of .

The matrix .
For all choices of the training of converges to a matrix , such that has a smaller true error than . As it can be observed in the last plot, contains some patterns that reflect what was previously obtained in Equation 38. The analytic deep inverse prior approach seems in principle to behave differently than the Tikhonov approach and shows some promising results.
So far, we considered the regularization parameter to be fixed. In a real application one needs to chose it by a technique such as the Lcurve hansen1993lcurve or the discrepancy principle. However, they usually involve finding reconstructions for many different values of . In our case, that would mean to retrain the network each time, which would amount to high computational costs. This motivates an adaptive choice of during the optimization, which could be achieved by letting to be also a trainable weight of the network. The results for the same example and different starting values are shown in Figure 3.
Since we are minimizing (1), it becomes now important to choose when to stop the training, otherwise will probably converge to and will converge to , which would result in bad reconstructions. We did that by monitoring the trajectory/change of over the course of the training. We stopped at an iteration , chosen to be close to the corner of the trajectory depicted in Figure 3, i.e. when starts decreasing very slow. We discuss this further in Appendix II.
4 Deep priors for magnetic particle imaging
In the remainder of this paper we apply a deep inverse prior network for solving the reconstruction problem in magnetic particle imaging (MPI). MPI is an imaging modality based on injecting ferromagnetic nanoparticles, which are consequently transported by the blood flow. Reconstructing the resulting spatial distribution of those nanoparticles is based on exploiting the nonlinear magnetization behavior of ferromagnetic nanoparticles Gleich2005 .
More precisely, one applies a magnetic field, which is a superposition of a static gradient field, which generates a fieldfreepoint (FFP), and a highly dynamic spatially homogeneous field, which moves the FFP in space. The mean magnetic moment of the nanoparticles in the neighborhood of the FFP will generate an electromagnetic field, whose voltages can be measured by socalled receive coils. The timedependent measurements
in the receive coils constitute the data for the inversion process, i.e. for reconstructing .MPI benefits from a high temporal resolution and a potentially high spatial resolution which makes it suitable for several invivo applications, such as imaging blood flow weizenecker2009three ; khandhar2017evaluation , instrument tracking haegele2012magnetic and guidance Salamon:2016cz , flow estimation franke2017 , cancer detection Yu2017 and treatment by hyperthermia murase2015usefulness .
Due to the nonmagnetic coating of the nanoparticles, which largely suppresses particleparticle interactions, MPI is usually modeled by a linear Fredholm integral equation of the first kind describing the relationship between particle concentration and the measured voltage. After removing the voltage induced by the applied magnetic field one obtains a measured signal from the th receive coil as
where
, with some noise and
the kernel of the linear operator. Combining the measurements of all receive coils yields – after discretization and appying the Fourier transform – a linear system of equations
Typically, the rows of are normalized resulting in the final form of the linearized inverse problem denoted by(40) 
This is a coarse simplification of the physical setup, which neglects nonlinear magnetization effects of the nanoparticles as well as the nonhomogeneity of the spatial sensitivity of the receive coils and also the small, but nonnegligible particleparticle interactions. Hence, this is a perfect setup for exploiting the potential of neural networks for matching complex and highdimensional nonlinear models.
For a more detailed introduction to MPI, details on data preprocessing as well as on the implementation using Tensorflow
tensorflow2015whitepaper , see Appendix III. In this section we just report some numerical results.We test the capability of the Deep Imaging Prior approach to improve image reconstruction obtained by standard Tikhonov regularization. For the experiments we use the datasets (D1) and in the Appendix III also (D2) generated by the Bruker preclinical MPI system at the University Medical Center HamburgEppendorf, see Appendix III. We use the deep image prior network introduced by ulyanov2017dip , specifically their Unet architecture. For further details on this architecture we refer to Appendix III and ulyanov2017dip .
The measured voltages are given as a sequence of complex numbers, hence we split up our loss function into the form
(41)  
(42) 
where and denote the real and imaginary parts respectively.
For comparison to our deep inverse prior MPI reconstructions, we also compute sparse and classical Tikhonov reconstructions. We produce the Tikhonov reconstruction, usually associated with the minimization of the functional
(43) 
via the algebraic reconstruction technique (Kaczmarz) as generalized to allow for the constraint by dax1993row . We produce the sparsity reconstruction, usually associated with the minimization of the functional
(44) 
via simply implementing this functional in Tensorflow and minimizing it via gradient descent. In the end we project each entry into
We start by presenting direct comparisons of the Kaczmarz, sparsity and DIP reconstructions in Table 1. Beneath each image we state the parameters we used for the reconstruction where denotes the Frobenius norm and is the regularization parameter as used in Equations 43 or 44 and the learning rate used for Adam. For DIP we always used early stopping after optimization steps. The images started to deteriorate slowly for more iterations.
As one can see, the results presented in Table 1 are roughly on par with, if not even better than, the classical regularization methods. In particular we want to point out the reconstruction of the “2mm” phantom for which only the DIP approach achieves a separation of the two different lines.
Phantom  Kaczmarz with  DIP  Photo  

“4mm” (DS1) 





“2mm” (DS1) 




We now want to compare the DIP based reconstruction process with a Landweber reconstruction with early stopping in more detail. Here we use a phantom in the shape of a “one” provided in DS1. Specifically we compare a DIP reconstruction obtained with Adam kingma2014adam with a DIP reconstruction obtained with gradient descent and with the previously mentioned Landweber reconstruction, see Table 2. To compare them we present the following quantities:

Error: The quantity we are minimizing over the iterations of the optimization. I.e. (yaxis) over (xaxis), where is the iteration index and the reconstruction at iteration .

Lcurve: The path of the points over , where is on the xaxis and on the yaxis. Note that these paths tend to start in the bottom right corner and moves gradually towards the upper left corner.

Change: The change in that happens over the course of the optimization, separately displayed for the spaces and I.e. we plot (yaxis) and (yaxis) over (xaxis). Here denotes the orthogonal projection onto the space

Errors per Singular Value: For each we plot the normalized errors associated with each of the onedimensional linear subspaces spanned by the singular vectors, ordered by the size of their associated singular values, starting with the biggest singular value at
. I.e. for the singular value decomposition
of , where a ordered diagonal matrix and and orthogonal, we plot the quantity over the index and the index 
Final Reconstruction: The reconstruction at the end of the optimization, i.e. where the total number of iterations.
Method  Error  Change  Lcurve  Errors per Singular Value  Final Reconstruction  
























First of all we would like to point out that for the plain Landweber reconstruction and the Adam DIP reconstruction we used optimization steps each, but for the gradient descent DIP reconstruction we needed times that many steps, i.e. steps, to get a reconstruction that did not seem to improve anymore. Also the best Adam reconstruction, based on our visual judgment, was not the final one. The best one was reached after approximately iterations. For both DIP reconstructions we used a learning rate of and for Landweber we chose since the result did not seem to change much based on the learning rate. If we look at the best reconstructions, we can see that both DIP reconstructions look quite good, although the gradient descent one displays the gap between the two dashes of the “one”. The reason we did not use gradient descent instead of Adam for all our reconstructions is first, that one needs to use small learning rates with gradient descent to get good reconstructions and therefore, as already mentioned, one needs considerably more steps to reach good results. Second, the optimization process does not only take much longer than the one with Adam, but also seems to gets stuck in bad local minima most of the time.
As one can see the error curves of the three different methods displayed in Table 2 look quite similar, although the error curve of Adam has, as one would expect from Adam, minor disruptions. Interestingly these disruptions are lining up very well with the disruptions of the change curve of Adam. We found that a DIP reconstruction tends to produce good results, when the choice of the optimizer and its learning rate leads the changes in the null space to be roughly on the same order of magnitude as the changes in the orthogonal of the null space. This is the case for our two DIP examples in the table. For the normal Landweber method we used the Lcurve to decide when to stop the optimization. We also present the Lcurves for the DIP reconstructions. In the DIP with gradient descent reconstruction starts to stagnate in the upper left corner and we could not observe any “exploding” behavior. For DIP with Adam one can see that the Lcurve starts the form a “blob”. In our experiments we saw this blob slightly further extending upward if one used further iterations. The visually best results where usually reached shortly after the “blob”formation started (in this case for example at ca. iterations). The Table 2 also presents the “errors per singular value”. We show them since one often uses so called filter functions engl , defined on the singular values, to describe the properties of regularization methods for linear inverse problems. These filter functions specify how the regularization methods deal differently with the minimization of the errors associated with different singular values – more precisely with the different one dimensional subspaces spanned by the singular vectors. We would like to point out that one can see that the DIP reconstructions, allow for much bigger errors in subspaces associated with large singular values. This hints at the DIP being influential in these subspaces, since for the plain Landweber approach one can clearly see a flat region of small errors for the big singular values at later stages of the optimization.
5 Summary and conclusion
In this paper we investigated the concept of deep inverse priors / regularization by architecture. To our knowledge this is the only deep learning method to solve inverse problems that does not require massive amounts of data, which one usually can only acquire after the inverse problem is solved already. In fact, the method requires one measurement only. We started out by giving different qualitative interpretations of what a regularization is and specifically how regularization by architecture fits into this context.
We followed up with the introduction of the analytic deep inverse prior by specifically showing how proximate gradient descent based architectures, not unlike LISTA, allow for a somewhat transparent regularization by architecture. Specifically we showed that their results can be interpreted as solutions of optimized Tikhonov functionals. This was further investigated with an academic example, were we implemented the analytic deep inverse prior and tested its numerical applicability. The results confirmed our mathematical findings and showed some promising results.
To conclude we applied a deep image prior to the real world problem of computing reconstructions in magnetic particle imaging. We found that this type of regularization compares very well to established and widely used regularization methods, in some cases even surpasses them.
There is obviously, like in deep learning in general, still much work to be done in order to have a good understanding of deep inverse priors, but we see much potential in the idea to use deep architectures to regularize inverse problems; especially since an enormous part of the deep learning research is concerned with the understanding of deep architectures already.
Acknowledgements.
The authors would like to thank P. Szwargulski and T. Knopp from the University Medical Center HamburgEppendorf for their support in conducting the experiments for the dataset DS1. S. Dittmer and T. Kluth acknowledge the support by the Deutsche Forschungsgemeinschaft (DFG) within the framework of GRK 2224/1 “Pi 3 : Parameter Identification—Analysis, Algorithms, Applications”. D. Otero Baguer acknowledges the financial support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Projektnummer 276397488 – SFB 1232, subproject “P02Heuristic, Statistical and Analytical Experimental Design”. Peter Maass acknowledges funding by the European Union’s Horizon 2020 research and innovation programme under the Marie SklodowskaCurie Grant Agreement No. 765374, subproject ’Data driven model adaptations of coil sensitivities in MR systems’.
References
 (1) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Largescale machine learning on heterogeneous systems (2015). URL https://www.tensorflow.org/. Software available from tensorflow.org
 (2) Adler, J., Öktem, O.: Learned primaldual reconstruction. IEEE transactions on medical imaging 37(6), 1322–1332 (2018)
 (3) Anonymous: On the spectral bias of neural networks. In: Submitted to International Conference on Learning Representations (2019). URL https://openreview.net/forum?id=r1gR2sC9FX. Under review
 (4) Bathke, C., Kluth, T., Brandt, C., Maass, P.: Improved image reconstruction in magnetic particle imaging using structural a priori information. International Journal on Magnetic Particle Imaging 3(1), ID 1703015, 10 pages (2017). URL https://journal.iwmpi.org/index.php/iwmpi/article/view/64
 (5) Beck, A., Teboulle, M.: A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009). DOI 10.1137/080716542. URL https://doi.org/10.1137/080716542
 (6) Bora, A., Jalal, A., Price, E., Dimakis, A.G.: Compressed sensing using generative models. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 537–546 (2017). URL http://proceedings.mlr.press/v70/bora17a.html

(7)
Chollet, F., et al.: Keras.
https://keras.io (2015)  (8) Combettes, P., Wajs, V.: Signal recovery by proximal forwardbackward splitting. Multiscale Modeling & Simulation 4(4), 1168–1200 (2005). DOI 10.1137/050626090
 (9) Dax, A.: On row relaxation methods for large constrained least squares problems. SIAM Journal on Scientific Computing 14(3), 570–584 (1993)
 (10) Engl, H.W., Hanke, M., Neubauer, A.: Regularization of inverse problems, Mathematics and its Applications, vol. 375. Kluwer Academic Publishers Group, Dordrecht (1996)
 (11) Erb, W., Weinmann, A., Ahlborg, M., Brandt, C., Bringout, G., Buzug, T.M., Frikel, J., Kaethner, C., Knopp, T., März, T., M, M., Storath, M., Weber, A.: Mathematical analysis of the 1D model and reconstruction schemes for magnetic particle imaging. Inverse Problems 34(5), 055012, 21 pp. (2018)
 (12) Franke, J., Heinen, U., Lehr, H., Weber, A., Jaspard, F., Ruhm, W., Heidenreich, M., Schulz, V.: System characterization of a highly integrated preclinical hybrid mpimri scanner. IEEE Transactions on Medical Imaging 35(9), 1993–2004 (2016). DOI 10.1109/TMI.2016.2542041
 (13) Franke, J., Lacroix, R., Lehr, H., Heidenreich, M., Heinen, U., Schulz, V.: Mpi flow analysis toolbox exploiting pulsed tracer information – an aneurysm phantom proof. International Journal on Magnetic Particle Imaging 3(1) (2017). URL https://journal.iwmpi.org/index.php/iwmpi/article/view/36
 (14) Gleich, B., Weizenecker, J.: Tomographic imaging using the nonlinear response of magnetic particles. Nature 435(7046), 1214–1217 (2005)
 (15) Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: ICML 2010  Proceedings, 27th International Conference on Machine Learning, pp. 399–406 (2010)
 (16) Haegele, J., Rahmer, J., Gleich, B., Borgert, J., Wojtczyk, H., Panagiotopoulos, N., Buzug, T., Barkhausen, J., Vogt, F.: Magnetic particle imaging: visualization of instruments for cardiovascular intervention. Radiology 265(3), 933–938 (2012)
 (17) Hammernik, K., Klatzer, T., Kobler, E., Recht, M.P., Sodickson, D.K., Pock, T., Knoll, F.: Learning a variational network for reconstruction of accelerated mri data. Magnetic resonance in medicine 79(6), 3055–3071 (2018)
 (18) Hansen, P., O’Leary, D.: The use of the lcurve in the regularization of discrete illposed problems. SIAM Journal on Scientific Computing 14(6), 1487–1503 (1993)
 (19) Hauptmann, A., Lucka, F., Betcke, M., Huynh, N., Adler, J., Cox, B., Beard, P., Ourselin, S., Arridge, S.: Modelbased learning for accelerated, limitedview 3d photoacoustic tomography. IEEE transactions on medical imaging 37(6), 1382–1393 (2018)

(20)
Jin, K.H., McCann, M.T., Froustey, E., Unser, M.: Deep convolutional neural network for inverse problems in imaging.
IEEE Transactions on Image Processing 26(9), 4509–4522 (2017)  (21) Khandhar, A., Keselman, P., Kemp, S., Ferguson, R., Goodwill, P., Conolly, S., Krishnan, K.: Evaluation of pegcoated iron oxide nanoparticles as blood pool tracers for preclinical magnetic particle imaging. Nanoscale 9(3), 1299–1306 (2017)
 (22) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 (23) Kluth, T.: Mathematical models for magnetic particle imaging. Inverse Problems 34(8), 083001 (2018). URL http://iopscience.iop.org/article/10.1088/13616420/aac535
 (24) Kluth, T., Jin, B., Li, G.: On the degree of illposedness of multidimensional magnetic particle imaging. Inverse Problems 34(9), 095006 (2018). URL http://stacks.iop.org/02665611/34/i=9/a=095006
 (25) Kluth, T., Maass, P.: Model uncertainty in magnetic particle imaging: Nonlinear problem formulation and modelbased sparse reconstruction. International Journal on Magnetic Particle Imaging 3(2), ID 1707004, 10 pages (2017). URL https://journal.iwmpi.org/index.php/iwmpi/article/view/74
 (26) Knopp, T.: Github mdf. https://github.com/MagneticParticleImaging/MDF. Accessed: 20181116
 (27) Knopp, T.: Open mpi data. https://magneticparticleimaging.github.io/OpenMPIData.jl/latest/index.html. Accessed: 20181116
 (28) Knopp, T., Buzug, T.M.: Magnetic Particle Imaging: An Introduction to Imaging Principles and Scanner Instrumentation. Springer, Berlin/Heidelberg (2012)
 (29) Knopp, T., Gdaniec, N., Möddel, M.: Magnetic particle imaging: from proof of principle to preclinical applications. Physics in Medicine and Biology 62(14), R124 (2017)
 (30) Knopp, T., Hofmann, M.: Online reconstruction of 3D magnetic particle imaging data. Physics in Medicine and Biology 61(11), N257–67 (2016). DOI 10.1088/00319155/61/11/N257
 (31) Knopp, T., Rahmer, J., Sattel, T.F., Biederer, S., Weizenecker, J., Gleich, B., Borgert, J., Buzug, T.M.: Weighted iterative reconstruction for magnetic particle imaging. Phys. Med. Biol. 55(6), 1577–1589 (2010)
 (32) Knopp, T., Viereck, T., Bringout, G., Ahlborg, M., von Gladiss, A., Kaethner, C., Neumann, A., Vogel, P., Rahmer, J., Möddel, M.: Mdf: Magnetic particle imaging data format. ArXiv eprints 1602.06072v6, 1–15 (2018). URL http://arxiv.org/abs/1602.06072v6. Article, MDF
 (33) Konkle, J., Goodwill, P., Hensley, D., Orendorff, R., Lustig, M., Conolly, S.: A convex formulation for magnetic particle imaging xspace reconstruction. PLoS ONE 10(10), e0140137 (2015)
 (34) Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., Citeseer (2009)

(35)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks.
In: Advances in neural information processing systems, pp. 1097–1105 (2012)  (36) Lampe, J., Bassoy, C., Rahmer, J., Weizenecker, J., Voss, H., Gleich, B., Borgert, J.: Fast reconstruction in magnetic particle imaging. Physics in Medicine and Biology 57(4), 1113–1134 (2012). DOI 10.1088/00319155/57/4/1113
 (37) Louis, A.K.: Inverse und schlecht gestellte Probleme. Vieweg+Teubner Verlag, Wiesbaden (1989)
 (38) Maass, P.: Simple networks for trivial inverse problems. submitted for publication (2018)
 (39) März, T., Weinmann, A.: Modelbased reconstruction for magnetic particle imaging in 2D and 3D. Inv. Probl. Imag. 10(4), 1087–1110 (2016)

(40)
Meinhardt, T., Möller, M., Hazirbas, C., Cremers, D.: Learning proximal
operators: Using denoising networks for regularizing inverse imaging
problems.
In: IEEE International Conference on Computer Vision, pp. 1781–1790 (2017)
 (41) Murase, K., Aoki, M., Banura, N., Nishimoto, K., Mimura, A., Kuboyabu, T., Yabata, I.: Usefulness of magnetic particle imaging for predicting the therapeutic effect of magnetic hyperthermia. Open Journal of Medical Imaging 5(02), 85 (2015)
 (42) Nesterov, Y.: Lectures on Convex Optimization. Springer Optimization and Its Applications. Springer International Publishing (2019). URL https://books.google.de/books?id=JSyNtQEACAAJ
 (43) Rahmer, J., Weizenecker, J., Gleich, B., Borgert, J.: Analysis of a 3D system function measured for magnetic particle imaging. IEEE Transactions on Medical Imaging 31(6), 1289–1299 (2012)
 (44) Rieder, A.: Keine Probleme mit inversen Problemen: eine Einführung in ihre stabile Lösung. Vieweg, Wiesbaden (2003)
 (45) Salamon, J., Hofmann, M., Jung, C., Kaul, M.G., Werner, F., Them, K., Reimer, R., Nielsen, P., vom Scheidt, A., Adam, G., Knopp, T., Ittrich, H.: Magnetic particle/magnetic resonance imaging: Invitro MPIguided real time catheter tracking and 4D angioplasty using a road map and blood pool tracer approach. PloS ONE 11(6), e0156899–14 (2016)
 (46) Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120 (2013)
 (47) Storath, M., Brandt, C., Hofmann, M., Knopp, T., Salamon, J., Weber, A., Weinmann, A.: Edge preserving and noise reducing reconstruction for magnetic particle imaging. IEEE Transactions on Medical Imaging 36(1), 74–85 (2017). DOI 10.1109/TMI.2016.2593954

(48)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the
inception architecture for computer vision.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826 (2016)
 (49) Them, K., Kaul, M.G., Jung, C., Hofmann, M., Mummert, T., Werner, F., Knopp, T.: Sensitivity enhancement in magnetic particle imaging by background subtraction. IEEE Trans. Med. Imag. 35(3), 893–900 (2016). DOI 10.1109/TMI.2015.2501462
 (50) Ulyanov, D., Vedaldi, A., Lempitsky, V.S.: Deep image prior. CoRR abs/1711.10925 (2017). URL http://arxiv.org/abs/1711.10925
 (51) Van Veen, D., Jalal, A., Price, E., Vishwanath, S., Dimakis, A.G.: Compressed sensing with deep image prior and learned regularization. arXiv preprint arXiv:1806.06438 (2018)
 (52) Vonesch, C., Unser, M.: A fast iterative thresholding algorithm for waveletregularized deconvolution  art. no. 67010d. Wavelets Xii, Pts 1 And 2 6701, D7010–D7010 (2007)
 (53) Weizenecker, J., Gleich, B., Rahmer, J., Dahnke, H., Borgert, J.: Threedimensional realtime in vivo magnetic particle imaging. Physics in Medicine and Biology 54(5), L1 (2009)
 (54) Yu, E.Y., Bishop, M., Zheng, B., Ferguson, R.M., Khandhar, A.P., Kemp, S.J., Krishnan, K.M., Goodwill, P.W., Conolly, S.M.: Magnetic particle imaging: A novel in vivo imaging platform for cancer detection. Nano Letters 17(3), 1648–1654 (2017). DOI 10.1021/acs.nanolett.6b04865. URL http://dx.doi.org/10.1021/acs.nanolett.6b04865
 (55) Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016)
Appendix I: A reminder on minimization of Tikhonov functionals and the LISTA approach
In this section we consider only linear operators and we review the well known theory for the Iterative Soft Shrinkage Algorithm (ISTA) as well as the slightly more general Proximal Gradient (PG) combettes2005splitting ; nesterov04convex method for minimizing Tikhonov functionals of the type
(45) 
We recapitulate the main steps in deriving ISTA and PG, as far as we need it for our motivation. The necessary first order condition for a minimizer is given by
(46) 
Multiplying with an arbitrary real positive number and adding plus rearranging yields
(47) 
For convex , the term of the right hand side is inverted by the (single valued) proximal mapping of , which yields
(48) 
Hence this is a fixed point condition, which is a necessary condition for all minimizers of . Turning the fixed point condition into an iteration scheme yields the PG method
(49)  
(50) 
This structure is also the motivation for LISTA lista approaches where fully connected networks with internal layers of identical size are used. Moreover, in some versions of LISTA, the affine maps between the layers are assumed to be identical. The values at the th layer are denoted by , hence,
(51) 
LISTA then trains on some given training data. More precisely, it trains two matrices and such that
(52) 
This derivation can be rephrased as follows.
Lemma 2
Let denote a fully connected network with input and internal layers. Further assume, that the activation function is identical to a proximal mapping for a convex functional . Assume is restricted, such that is positive definite, i.e. there exists a matrix such that
(53) 
Furthermore, we assume that the bias term is fixed as . Then is the th iterate of an ISTA scheme with starting value for minimizing
(54) 
Proof
Follows directly from equation (49).
Following these arguments one can rephrase LISTA as a concept for learning the discrepancy term in a classical Tikhonov functional. This point of view opens further connections to variational approaches in deep learning, see e.g. hammernik2018learning ; meinhardt2017learning
Appendix II: Numerical experiments
In this section we provide details about the implementation of the analytic deep inverse prior and the academic example.
Discretizing the integration operator yields the matrix , which has on the main diagonal, everywhere under the main diagonal and above (here ). In our experiments we use .
The analytic deep inverse prior network is implemented using Python and Tensorflow tensorflow2015whitepaper . Initially, the weight matrix is created and then fully connected layers are added to the network, all having the same weight matrix , bias and activation function given by (25). That means the network contains in total parameters (the number of components in ), independently from the number of layers. For the experiments shown in the paper the input is randomly initialized with small norm and is , where
is the biggest eigenvalue of
.In order to guarantee that assumption (12) holds, the network should in principle have thousands of layers, because of the slow convergence of the PG method. However, this is prohibitive from the implementation point of view. During the training iterations, in which we update , we therefore consider only a reduced network with a small number, of layers but we set the input to be the network’s output after the previous iteration. This is equivalent to adding new identical layers, with and , at the end of an implicit network which is growing by layers at a time. Here refers to the th update of when applying gradient descent to minimize (1). After the th iteration, we have implicitly created a network that has layers. However, in the next iteration is updated considering only backpropagation over the last layers.
Let be the output of this implicit network after iteration , see Figure 4. In order to properly update by the backpropagation, we need to be a good approximation of . We checked empirically that when by evaluating and computing at each iteration. The results are shown in Figure 5. As it can be observed, after the error is considerably low.
For the adaptive choice of , we set it as a variable in Tensorflow, i.e. at each training iteration and are updated by the gradient descent. It is expected from the setup in Def. 1 that and this was indeed observed in the experiments, see Figure 3. What is interesting is that it first decays fast and then continues to decrease slowly. Moreover, the turning point corresponds to an iteration that is quite close to the optimal one when looking at the true error plot in Figure 3. Based on that, we used it as stopping criterion for selecting .
Appendix III: Magnetic particle imaging (MPI)
In this appendix we summarize the state of the art concerning analytic models for MPI and we give a detailed description on the MPI experiments using deep inverse prior networks.
Precisely modeling MPI, resp. formulating a physically accurate integral kernel for image reconstruction, is still an unsolved problem. Various modeling aspects, e.g. the magnetization dynamics and particleparticle interactions make it a challenging task such that the integral kernel is commonly determined in a timeconsuming calibration procedure. For further information on the modeling aspects the interested reader is referred to the survey paper Kluth2018b as well as to the review article Knopp2017 for further details on the MPI methodology.
A first step towards a theoretical understanding of this problem, while excluding the temporal dependence, can be found in maerz2015modelbased . A onedimensional problem setup considering the timedependency of the measurements was analyzed in ErbWeinmannAhlborg:2017 for one particular trigonometric excitation. Furthermore, the multidimensional imaging problem for different dynamic magnetic field patterns was analyzed in the context of inverse problems regarding the best possible degree of illposedness Kluth2018a . The theoretical investigations so far (based on the simplified equilibrium model Kluth2018b ) conclude that in the MPI standard setup, which uses a superposition of trigonometric dynamic field excitations and linear gradient fields, is severely illposed.
The concentration reconstruction problem is typically solved by applying Tikhonov regularization weizenecker2009three ; Knopp2010e ; rahmer2012analysis ; Lampe_Fast_2012 . For MPI this is preferably solved by using the algebraic reconstruction technique Knopp2010e ; KnoppOnline2016 combined with a nonnegativity constraint weizenecker2009three . More recently, further sophisticated regularization techniques such as fused lasso regularization, directional total variation, or other gradientbased methods have been applied to the fullcalibrationbased MPI problem storath2016edge ; Bathke2017 ; Konkle2015a . A totalleastsquares approach with respect to operator uncertainty combined with standard Tikhonov regularization as well as a sparsitypromoting penalty term was used to improve reconstruction performance Kluth2017 .
5.1 How Magnetic Particle Imaging works
For the description of the MPI setup we adapt the general notations in Kluth2018a ; Kluth2018b . MPI is inherently a 3D problem such that vectorvalued functions remain 3D even if the domain of the spatial variable is a subset of a dimensional affine subspace . It is further assumed that the support of the concentration function is a subset of the domain . A lower dimensional setup () can be constructed by the assumption that the concentration is a distribution with respect to the orthogonal complement of . Let , , be a bounded domain with a (strong) Lipschitz boundary in . Further, let denote the maximal data acquisition time and the time interval during which the measurement process takes place. The temporal derivative of any function is denoted by .
The measured voltage signal , , obtained at receive coils, is given by
(55) 
where the superscript denotes the transpose of a vector, is the concentration of the magnetic nanoparticles and , , represent the system functions characterizing the mean magnetic moment behavior of nanoparticles. This relation follows from Faraday’s law and the law of reciprocity Knopp2012 . The positive constant is magnetic permeability in vacuum. The scalar functions , , are the analog filters in the signal acquisition chain, and in practice, they are commonly band stop filters adapted to excitation frequencies of the drive field. Note that the analog filters are included in the system functions in this formulation, which can differ in the literature. The functions , , denote the sensitivity profiles of the receive coil units.
In the following it is assumed that the applied magnetic field and the filters are chosen in a way such that all excitation signals , . This assumption on the excitation signals is commonly made but may not be fulfilled in MPI applications Them2016 ; Kluth2017 . Using the previous assumption the inverse problem is to find the concentration from :