1 Introduction
Suppose that the measurement is corrupted with additive noises:
(1) 
where
denotes the unknown mean vector, and
is the variance. Consider a deep neural network model
with the weight , which is trained with the data producing an estimate . Then, our goal is to estimate the prediction error:(2) 
which quantifies how well can predict a test data , independently drawn from the same distribution of (efron2004estimation, ; tibshirani2018excess, ).
The problem of estimating the prediction error is closely related to the generalizability of neural network (anthony2009neural, ). Moreover, this problem is tightly linked to the classical approaches for model order selection in statistics literature (stoica2004model, ). For example, one of the most investigated statistical theories to address this question is socalled covariance penalties approaches such as Mallow’s Cp (mallows1973some, ), Akaike’s information criterion (AIC) (akaike1974new, ), Stein’s unbiased risk estimate (SURE) (donoho1995adapting, ), etc.
This paper is particularly interested in estimating the prediction error of encoderdecoder convolutional neural networks (ED CNNs) such as UNet (ronneberger2015u, ; han2018framing, ; ye2019cnn, ). ED CNNs have been extensively used for various inverse problems such as image denoising, superresolution, medical imaging, etc (ye2018deep, )
. Recent theoretical results showed that, thanks to the ReLU nonlinearities, the input space is partitioned into the nonoverlapping regions so that input images in each region share the same linear frame representation, but not across different partitions
(ye2019cnn, ). In this paper, we will show that this property can be exploited to derive an explicit formulation for the unbiased estimator of the prediction error.Our explicit formulation reveals an important link to the existing unsupervised denoising networks. For example, an unsupervised training scheme called Noise2Noise (N2N) (lehtinen2018noise2noise, ) is based on the training between multiple realizations of noisy images without clean reference data. We show that the loss used by Noise2Noise is indeed an unbiased estimator of the prediction error when many independent noisy realizations are available for each image. When only one noisy realizations are available, Stein’s unbiased risk estimator (SURE)based denoiser (soltanayev2018training, ) is shown as the unbiased estimator for the prediction error. Unfortunately, aside from the inconvenience of using MonteCarlo method to estimate the divergence term (ramani2008monte, ), it is often difficult to prevent the network from learning a trivial identity mapping. We show that another denoising scheme, known as Noise2Void (krull2018noise2void, ), can partially overcome the limitation of SUREdenoiser, thanks to the boostrap sampling of the input data (efron1994introduction, ), which prevents the identity mapping from being learned. Nonetheless, the use of subsampled data as input makes the network performance limited.
To address this problem, here we provide a novel boosting estimator for the prediction error that can be used for neural network training for both supervised and unsupervised learning problems. Moreover, with proper batch normalization (ioffe2015batch, ; hoffer2018norm, ; cho2017riemannian, ; miyato2018spectral, ; ulyanov2016instance, ), we show that the contribution of the divergence term for the resulting loss can be made trivial, which can significantly simplify the neural network training. The resulting algorithm, what we call the Noise2Boosting
, has many advantages. In contrast to Noise2Noise that requires multiple perturbed noisy images as targets, our framework only requires one target image. Unlike the Noise2Void, multiple neural network output from bootstrap subsamples or random weighted input images are adaptively combined to provide a final output, which can reduce the prediction error. The applicability of the new method is demonstrated using various inverse problems such as denoising, superresolution, accelerated MRI, electron microscopy imaging, etc, which clearly show that our Nose2Boosting can significantly improve the image quality beyond labels. All proofs for the lemma and propositions are provided in Supplementary Material.
2 Related works
In Noise2Noise (lehtinen2018noise2noise, ), the main assumption is that a neural network may learn to output the average of all feasible explanations. Therefore, rather than using the groundtruth reference, the authors claimed that any noisy data from the same distribution can be used as a target for the neural network training. However, N2N requires multiple noisy images for the same underlying noiseless image during the training, which may not be feasible in some acquisition scenario. To address this problems, the authors in (krull2018noise2void, ) proposed so called Noise2Void (N2V) training scheme that utilizes random masked images as input. The main assumption for N2V is that each pixel can be predicted from its neighbors using a neural network different from the identity mapping. Unfortunately, N2V has limitations in capturing fast varying and isolated structures. Instead, the authors in (soltanayev2018training, ) proposed SUREbased denosing network that employs the divergence penalty to regularize the autoencoder loss. Due to the difficulty of obtaining divergence penalty, the authors in (soltanayev2018training, ) employed the MonteCarlo SURE (ramani2008monte, ) to approximate the divergence.
Bootstrap aggregation (bagging) (breiman1996bagging, )
is a classical machine learning technique which uses bootstrap sampling and aggregation of the results to reduce the variance to improve the accuracy of the base learner. Boosting
(schapire1999brief, ) differs from bagging in their multiplication of random weights to the training data rather than subsampling. The rationale for bagging and boosting is that it may be easier to train several simple weak leaners and combine them into a more complex learner than to learn a single strong learner.3 Theory
This section analyzes Noise2Noise, SUREbased denoiser, and Noise2Void from the perspective of their capability of estimating the prediction error. This leads to a novel boosting scheme.
3.1 Existing approaches
Recall that the unseen test data in (2) can be represented by
where is drawn from and independent from in (2). This suggests an estimator for the prediction error in (2):
(3) 
We can easily see that , suggesting that is an unbiased estimator of the prediction error in (2). In fact, the Noise2Noise estimator is an extension of (3) for the training samples , resulting in the following loss:
(4) 
Accordingly, the neural network training is done by minimizing the loss:
While is also an unbiased estimator of the prediction error (2), due to independent sampling along and , large number of noisy samples are required to reduce the variance of the estimator .
On the other hand, the authors in (soltanayev2018training, ) employed the Stein Risk Unbiased Estimator (SURE). The original formulation was derived for the unbiased risk estimator, i.e. , but can be easily converted for the prediction error estimator. Accordingly, the corresponding SURE for the prediction error becomes
(5) 
where denotes the divergence. Then, the SURE denoising network training is done for the training samples by minimizing the following loss:
(6) 
Although the application of SURE for unsupervised denoising is an important advance in theory, there are several practical limitations. First, due to the difficulty of calculating the divergence term, the authors relied on MonteCarlo SURE (ramani2008monte, )
which calculates the divergence term using MonteCarlo simulation. This introduces additional hyperparameters, on which the final results critically depend. Another important drawback of SURE denoising network is that it is difficult to prevent the network from learning a trivial identity mapping. More specifically, if
, the cost function in (6) becomes zero. One way to avoid this trivial solution is to guarantee that the divergence term at the optimal network parameter should be negative. However, given that the divergence term comes from the degree of the freedom (efron2004estimation, ) and the amount of excess optimism in estimating the prediction error (tibshirani2018excess, ), enforcing negative value may be unnatural.Recently, the authors in (krull2018noise2void, ) proposed socalled Noise2Void (N2V). In N2V, various position of the input pixels are masked with random distribution. Let
denote the random variable for the sampling index and the corresponding neural network estimate is referred to as
. Then, the corresponding prediction error is given by(7) 
and the associated SURE estimator is given by
Proposition 1
For the given training samples and random mask , the oise2Void denoising network training could be done by minimizing the following loss:
(9) 
Similar to N2N estimator, N2V estimator requires sampling along and . However, the main goal of the index subsampling is to prevent from learning identity mapping; therefore, the number of index sampling is usually small.
In the original N2V, the divergence term in (9) was not used for training. Later, we will show that with the batch normalization, the contribution of the divergence term can be made trivial, which again explains the success of N2V.
3.2 Bagging/Boosting method
Based on the discussion so far, here we propose a novel boosting method, where the input data are subsampled by bootstrapping or multiplied with random weights, and the final output is the aggregated value. More specifically, the bagging (or boosting) provides the network output as an average of the network trained with bootstrap subsampled input (or random weighted input):
(10) 
Since the dependency on is integrated out in (10) during the aggregation, the corresponding prediction error takes the same form as in (2). Then, we have the following key result:
Proposition 2
Proposition 2 show that the prediction error of boosting can be better than that of the N2V. Moreover, the gap increases more when the provides diverse output for each realization of the index (breiman1996bagging, ). In fact, due to the combinatorial nature of ReLU, it was shown in (ye2019cnn, ) that the input space results in nonoverlapping partitions with different linear representations. Therefore, by changing the subsampling pattern , the distinct representation may be selected, which can make the corresponding neural network output diverse.
Finally, it is straightforward to see that the SURE estimator for the associated prediction error is given by
(11) 
Then, for the given training samples , the boosted network training can be done by minimizing the loss:
(12) 
4 Noise2Boosting: A Novel Boosting Scheme
Based on the theoretical analysis so far, we now introduce an efficient implementation of (12), what we call the Noise2Boosting (N2B). The proposed N2B method is based on the following two simplifications.
4.1 Divergence simplification
In SURE estimator, calculation of the divergence term is not trivial for general neural networks. This is why the authors in (soltanayev2018training, ) employed the MonteCarlo SURE. In this section, we first show that there exists a simple explicit form of the divergence term for the case of ED CNNs. Then, the batch normalization is shown to make the divergence term trivial.
Specifically, as shown in (ye2019cnn, ), the output of the EDCNN can be represented by nonlinear basis representation:
(13) 
where and denote the th column of the following frame basis and its dual:
(14)  
(15) 
where and denote the diagonal matrix with 0 and 1 values that are determined by the ReLU output in the previous convolution steps, and refer to the encoder and decoder matrices, respectively, whose explicit forms can be found in Supplementary Material. Since the patterns of and depend on the input, the expression suggests that the input space is partitioned into multiple regions where input signals for each region share the same linear representation, but not across different partitions.
Using this, we can easily obtain the closeform expression for the divergence term.
Lemma 1
Let be represented by (13). Then, we have
(16) 
Proposition 3
Suppose the index set
is obtained by the either 1) sampling with replacement such that each index can be selected with the probability of
, or 2) random weighting with the mean value of . Then,(17) 
Proposition 3 informs that the divergence term for the boosted estimator can be simply represented using the nonlinear frame of the ED CNN. This leads to a simple approximation of divergence term by exploiting the property of the batch normalization. Recall that batch normalization has been extensively used to make the training stable (ioffe2015batch, ; hoffer2018norm, ; cho2017riemannian, ; miyato2018spectral, ; ulyanov2016instance, ). It has been consistently shown that the batch normalization is closely related to the norm of the Jacobian matrix , which is equal to in our ED CNNs. For example, in their original paper (ioffe2015batch, )
, the authors conjectured that “Batch Normalization may lead the layer Jacobians to have singular values close to 1, which is known to be beneficial for training”. By extending the idea in
(ioffe2015batch, ) to multiple layers, the batch normalization can be understood as to make the covariance of the network output and input similar. For example, for the uncorrelated input with , the batch normalization works to provide . Furthermore, for sufficiently smaller , we havewhere the second equality comes that within the small perturbation of the input, the corresponding frame representation does not change (ye2019cnn, ). Therefore, we have
since is a square matrix. This suggests that
Since the resulting divergence term is just a constant, the contribution of the divergence term in (12) is considered trivial and can be neglected.
4.2 Approximation of mean aggregation using an attention network
Another important complication in calculating (12) is that the mean aggregation is not available and we only have its empirical estimate. Although the simplest way to obtain an empirical estimate is to average the overall results of the encoderdecoder CNNs, this may not be the best method because it does not reflect the data distribution of the results. Therefore, we propose a data attention network that efficiently combines all data so that it can adaptively incorporate neural network output from various random sampling patterns. More specifically, we use the following weighted average
where denotes the bootstrap subsampling patterns and is the corresponding weights. To calculate the weight, we propose to use the attention network illustrated in Fig. 1. Specifically, the input of the attention network is the dimensional vector whose values are calculated by an average pooling of
. This input is fed into a multilayer perceptron to generate the weight
, where denotes the attention network parameters.4.3 Implementation Details
A schematic diagram of the proposed method is illustrated in Figs. 1(a)(b). Our Noise2Boosting method consists of three building blocks: bootstrap subsampling or random weighting, a regression network using encoderdecoder CNN, and an attention network.
To estimate the weight , our attention network consists of two fully connected layer. The input dimension of the attention network is followed by the average pooling of the concatenated output of regression network. The number of hidden node is 64, and the final dimension of the output is for aggregation.
The training was performed in two ways. First, the neural network weight are first learned by minimizing the loss (9). Then, the weight of is fixed and the attention network is trained by minimizing the loss with respect to . Second, both attention network and the ED Networks are trained simultaneously by minimizing the loss:
(18) 
While the first approach can reduce the computational time and memory requirement, this is only an approximation and the second training scheme provides significantly better results.
The overall network was trained using Adam optimization (kingma2014adam, ) with the momentum and
. The proposed network was implemented in Python using TensorFlow library
(abadi2016tensorflow, ) and trained using an NVidia GeForce GTX 1080Ti graphics processing unit.5 Experimental Results
Experiments were conducted for various inverse problems such as compressed sensing MRI (lustig2007sparse, ), energydispersive Xray spectroscopy (EDX) (sole2007multiplatform, ) denoising, and superresolution.
Unet (ronneberger2015u, ) was used as our regression network for compressed sensing MRI, and EDX denoising. The network was composed of four stage with convolution, batch normalization, ReLU, and skip connection with concatenation. Each stage is composed of three convolution layers followed by batch normalization and ReLU, except for the last layer, which is convolution layer. The number of convolutional filters increases from 64 in the first stage to 1024 in the final stage. For the case of superresolution, we employed Deep BackProjection Network (DBPN) (haris2018deep, ) which enables to restore the details by exploiting iteratively up and down sampling layer as the base algorithm for superresolution task.
In compressed sensing MRI (lustig2007sparse, ), the goal is to recover high quality MR images from sparsely sampled
space data to reduce the acquisition time. We performed supervised learning experiments with synthetic
space data from Human Connectome Project (HCP) MR dataset (https://db.humanconnectome.org). Among the 34 subject data sets, 28 subject data sets were used for training and validation. The other subject data sets were used for test. As for the input for the neural network, we downsampled space with uniform sampling pattern, which corresponds to the acceleration factors of . The label images for neural network is the fully acquired space data. A recent space learning algorithm (han2018k, )is employed as the base deep learning algorithm which interpolates the missing elements in the
space. We trained the baseline algorithm and the propose methods under the same conditions except for boostrap subsampling for space data in the proposed method. The number of random subsampling mask was set to 10, and the overall network are trained simultaneously by minimize the loss (18). We also provide reconstruction results by GRAPPA (griswold2002generalized, ), which is a standard space interpolation method in MRI. We used six subjects of MR datasets to confirm the effectiveness of the proposed method. Thanks to the boosting, the proposed method provided nearly perfect reconstruction results compared to other algorithms as shown in Fig. 2. Moreover, it average peak signaltonoise ratio (PSNR) and structural similarity (SSIM) index significantly outperform others (PSNR/SSIM: GRAPPA=34.52/0.72, Baseline =38.67dB/0.89, Noise2Boosting=42.72dB/0.92).As for another experiment, we use the EDX data set which is mapped by STEMEDX mode in transmission electron microscopy (TEM). EDX is widely used for nanoscale quantitative and qualitative elemental composition analysis by measuring xray radiation from the interaction with high energy electron and the material (mcdowell2012studying, ). However, the specimens can be quickly damaged by the high energy electrons, so the acquisition time should be reduced to the minimum. This usually results in very noisy and even incomplete images as shown in Fig. 3(a) and the goal is to denoise and interpolate the missing data. The main technical difficulty is that there is no label data, so we need an unsupervised learning technique. A widely used approach for EDX analysis is to use explicit average kernel as shown in Fig. 3(c). Unfortunately, this often results in severe blurring when the measurement data is not sufficient. Noise2Noise and Noise2Void do not work, since there are no specific noise models for the EDX. In fact, this difficulty of EDX denoising was our original motivation for this work. As for the input for training and inference, we use bootstrap sampled images from the measurement image in Fig. 3(a) , and the network output is the measurement data in Fig. 3(a). We used 28 cases from the EDX dataset. The specimen are composed of quantumdots, where core and shell consist of Cadmium (Cd), Selenium (Se), Zinc (Zn), and Sulfur (S), respectively. For the bootstrap subsampling, the number of random subsampling mask was . The regression network was optimized to minimize the loss (9) with respect to first, after which the attention network was trained to properly aggregate the entire interpolated output. Since the regression network learn the measurement statistics, the network output provides more samples than the measured data and the attention network produces the aggregated image. This produces sharper and accurate images as shown in Fig. 3(c).
For the case of superresolution, the baseline network is trained using DIV2K (agustsson2017ntire, ), with totally 800 training images, on the and (in both horizontal and vertical directions) superresolution, and training conditions were followed as described in (haris2018deep, ). For our Noise2Boosting training, the number of random subsampling mask was set to 32 and 8 for and task, respectively. In addition, the entire networks were trained simultaneously to minimize the loss (18). As described in Table 1, our N2B method can improve the performance of the superresolution task. Thanks to the bootstrapping and aggregation using attention network, the data distribution can be fully exploited to restore the high resolution components, which results in the properly reconstructed details of the image as shown in Fig. 4.
6 Conclusion
In this paper, we proposed a novel boosting scheme of neural networks for various inverse problems with and without label data. Here, multiple input data were generated by bootstrap subsampling or random weight multiplication, after which final result are obtained by aggregating the entire output of network using an attention network. Experimental results with compressed sensing MRI, electron microscopy image denoising and superresolution showed that the proposed method provides consistent improvement for various inverse problems.
Appendix A Explicit Form of Encoder and Decoder Matrices
The derivation in this paper is just a brief summary of (ye2019cnn, ), but included for selfcontainment.
Consider a symmetric ED CNN without skipped connection. The encoder network maps a given input signal to a feature space , whereas the decoder takes this feature map as an input, process it and produce an output . At the th layer, , and denote the dimension of the signal, the number of filter channel, and the total feature vector dimension, respectively. Here, the th channel output from the the th layer encoder can be represented by a multichannel convolution operation:
(19) 
where refers to the th input channel signal, denotes the tap convolutional kernel that is convolved with the th input channel to contribute to the th channel output, and is the pooling operator. Here, is the flipped version of the vector such that with the periodic boundary condition, and is the periodic convolution. The use of the periodic boundary conditions is to simplify the mathematical treatment of the boundary condition. Similarly, the th channel decoder layer convolution output is given by:
(20) 
where denotes the unpooling layer. By concatenating the multichannel signal in column direction as
the encoder and decoder convolution in (19) and (20) can be represented using matrix notation:
(21) 
where
(22) 
(23) 
and
Appendix B Proof of Propositions and Lemma
b.1 Proof of Proposition 1
Proof. For a fixed sampling pattern , can be considered as another neural network with the sampling mask at the first layer. Thus, we can easily see
Now, by taking expectation with respect to , we conclude the proof.
b.2 Proof of Proposition 2
Proof. Note that
where we use the Jensen’s inquality for the inequality. Then, we have
b.3 Proof of Lemma 1
Proof. In Proposition 6 of (ye2019cnn, ), it was shown that for the case of ReLU network. Accordingly, we have
where denotes the trace of . This concludes the proof.
b.4 Proof of Proposition 3
Proof. Note that subsampling is equivalent to multiply mask to the input vector. Therefore, the corresponding neural network output can be represented by
(24) 
Here, denotes a diagonal matrix with values, where the probability of 1 is . Accordingly, we have
where we use . The proof for the boosted samples is basically the same. This concludes the proof.
Appendix C Additional Results
Recall that one of the main limitations of Noise2Noise (lehtinen2018noise2noise, )
(N2N) is that it requires many noisy realizations for the same images, which is not usually feasible in many applications. Otherwise, during the training of N2N, different noises are separately added for each epoch to the clean image to generates multiple input and label with different noise realizations. This implies that the groundtruth clean images are necessary for N2N training unless multiple noisy measurements are available.
On the other hand, we only require single realization of noisy images for the training of our Noise2Boosting (N2B). The main idea is that, as discussed before, pixelwise random weight was multiplied to the noisy input data for boosting. The value of random weight is randomly chosen between and
. In order to diversify the image data, we found that the noisy augmentation was helpful. Specifically, we randomly generated Gaussian noises with noise standard deviation
and add them to the noisy measurement data, before applying the random weighting. Note that this procedure is fundamentally different from Noise2Noise which adds noises to the clean images, since our augmentation add the noisy data to the noisy measurement. Therefore, we do not need either many noisy realization of the same clean image nor the clean image to synthetically generate the noisy output.For a fair comparison, a standard Unet was employed for N2N and N2B networks using DIV2K (agustsson2017ntire, ), with totally 800 training images. The initial learning rate for both networks was set to 0.0003, and divided by half per 50 epochs until it reached approximately 0.00001. Minibatch size of 8 was used in all experiments. As explained in (lehtinen2018noise2noise, ), the synthetic Gaussian noise was added to the clean image for training N2N. Accordingly, during the training of N2N, the standard deviation of addictive Gaussian noise is randomly selected between 0 and 50. Furthermore, the different noise are separately added to clean image as input and label for training network, respectively.
For inference, we used the single noise image with for N2N. For N2B, we use boosted samples. As shown in Table 2, although our N2B network does not need clean image for training, the proposed N2B network generally outperformed N2N network. In addition, the proposed N2B network can restore better details and texture than N2N network as shown in Fig. 5 .
Noisy Image  Noise2Noise (lehtinen2018noise2noise, )  Proposed  

PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  
set5  20.620  0.353  29.180  0.756  29.117  0.742 
set14  20.527  0.395  27.847  0.711  28.049  0.725 
bsd  20.477  0.390  27.200  0.691  27.648  0.715 
References
 (1) B. Efron, “The estimation of prediction error: covariance penalties and crossvalidation,” Journal of the American Statistical Association, vol. 99, no. 467, pp. 619–632, 2004.
 (2) R. J. Tibshirani and S. Rosset, “Excess optimism: How biased is the apparent error of an estimator tuned by SURE?” Journal of the American Statistical Association, pp. 1–16, 2018.
 (3) M. Anthony and P. L. Bartlett, Neural network learning: Theoretical foundations. cambridge university press, 2009.
 (4) P. Stoica and Y. Selen, “Modelorder selection: a review of information criterion rules,” IEEE Signal Processing Magazine, vol. 21, no. 4, pp. 36–47, 2004.
 (5) C. L. Mallows, “Some comments on Cp,” Technometrics, vol. 15, no. 4, pp. 661–675, 1973.
 (6) H. Akaike, “A new look at the statistical model identification,” in Selected Papers of Hirotugu Akaike. Springer, 1974, pp. 215–222.
 (7) D. L. Donoho and I. M. Johnstone, “Adapting to unknown smoothness via wavelet shrinkage,” Journal of the american statistical association, vol. 90, no. 432, pp. 1200–1224, 1995.
 (8) O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention. Springer, 2015, pp. 234–241.
 (9) Y. Han and J. C. Ye, “Framing UNet via deep convolutional framelets: Application to sparseview CT,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1418–1429, 2018.
 (10) J. C. Ye and W. K. Sung, “Understanding geometry of encoderdecoder CNNs,” Proceedings of the 2019 International Conference on International Conference on Machine Learning (ICML). also available as arXiv preprint arXiv:1901.07647, 2019.
 (11) J. C. Ye, Y. Han, and E. Cha, “Deep convolutional framelets: A general deep learning framework for inverse problems,” SIAM Journal on Imaging Sciences, vol. 11, no. 2, pp. 991–1048, 2018.
 (12) J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2Noise: learning image restoration without clean data,” in International Conference on Machine Learning, 2018, pp. 2971–2980.
 (13) S. Soltanayev and S. Y. Chun, “Training deep learning based denoisers without ground truth data,” in Advances in Neural Information Processing Systems, 2018, pp. 3257–3267.
 (14) S. Ramani, T. Blu, and M. Unser, “MonteCarlo SURE: a blackbox optimization of regularization parameters for general denoising algorithms,” IEEE Transactions on image processing, vol. 17, no. 9, pp. 1540–1554, 2008.
 (15) A. Krull, T.O. Buchholz, and F. Jug, “Noise2Voidlearning denoising from single noisy images,” arXiv preprint arXiv:1811.10980, 2018.
 (16) B. Efron and R. J. Tibshirani, An introduction to the bootstrap. CRC press, 1994.
 (17) S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 (18) E. Hoffer, R. Banner, I. Golan, and D. Soudry, “Norm matters: efficient and accurate normalization schemes in deep networks,” in Advances in Neural Information Processing Systems, 2018, pp. 2160–2170.
 (19) M. Cho and J. Lee, “Riemannian approach to batch normalization,” in Advances in Neural Information Processing Systems, 2017, pp. 5225–5235.
 (20) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
 (21) D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
 (22) L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
 (23) R. E. Schapire, “A brief introduction to boosting,” in Ijcai, vol. 99, 1999, pp. 1401–1406.
 (24) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 (25) M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “TensorFlow: A System for LargeScale Machine Learning.” in OSDI, vol. 16, 2016, pp. 265–283.
 (26) M. Lustig, D. Donoho, and J. M. Pauly, “Sparse MRI: The application of compressed sensing for rapid MR imaging,” Magn. Reson. Med., vol. 58, no. 6, pp. 1182–1195, 2007.
 (27) V. Solé, E. Papillon, M. Cotte, P. Walter, and J. Susini, “A multiplatform code for the analysis of energydispersive Xray fluorescence spectra,” Spectrochimica Acta Part B: Atomic Spectroscopy, vol. 62, no. 1, pp. 63–68, 2007.

(28)
M. Haris, G. Shakhnarovich, and N. Ukita, “Deep backprojection networks for
superresolution,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2018, pp. 1664–1673.  (29) Y. Han and J. C. Ye, “kSpace Deep Learning for Accelerated MRI,” arXiv preprint arXiv:1805.03779, 2018.
 (30) M. A. Griswold, P. M. Jakob, R. M. Heidemann, M. Nittka, V. Jellus, J. Wang, B. Kiefer, and A. Haase, “Generalized autocalibrating partially parallel acquisitions (GRAPPA),” Magn. Reson. Med., vol. 47, no. 6, pp. 1202–1210, 2002.
 (31) M. T. McDowell, I. Ryu, S. W. Lee, C. Wang, W. D. Nix, and Y. Cui, “Studying the kinetics of crystalline silicon nanoparticle lithiation with in situ transmission electron microscopy,” Advanced Materials, vol. 24, no. 45, pp. 6034–6041, 2012.
 (32) E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image superresolution: Dataset and study,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 126–135.