Pioneered by AlexNet , deep learning models such as convolutional neural networks (CNNs) gain remarkable success in solving computer vision problems [2, 3, 4, 5, 6, 7]. One major research interest is to design powerful architectures to extract more distinguished features from data. Such examples include ResNet  and DenseNet 
. Moreover, to ease the training of deep networks and to alleviate over-fitting, a lot of techniques were proposed. Typical works include batch normalization (BN), and dropout 9], ELU , and SELU . Besides, a lot of domain specific techniques were also developed to further fine-tune networks on specific applications. Detailed discussion of such techniques is beyond the scope of this work.
Another big bottleneck in deep learning is lack of real data. Deep CNN training must rely on a high volume of data. Unfortunately, this condition is not always satisfied in real scenarios. Therefore, data augmentation becomes a feasible and indispensable approach to increase data diversity. Typical ways such as image rotation, flipping, and shifting are widely used in data pre-processing. Recently, generative adversarial network (GAN)  are also widely utilized to generate synthetic data which cannot even be differentiated by discriminative models or human beings.
However, though a lot of efforts have been made on the aforementioned aspects, it is still unclear how an individual training sample will influence the generalization accuracy of a network. To clarify this problem, let us consider two questions. (1) Given a network and its training dataset, can we drop several training samples so that generalization accuracy can be improved? (2) If so, how can we leverage the model to fit a subset of the given training dataset? Our work demonstrates that there exist such training samples which we call unfavorable training samples. We propose a two-round training approach to improve CNN’s generalization accuracy by dropping those samples, and we name the dropping step as data dropout, which is a scheme for training data optimization. Specifically, we train a network with a given training set in the first round, then for each training sample, we compute the influence of removing it on the loss across all validation samples. If the influence value is positive, implying that its removal will reduce the whole validation error, we will drop that training sample. Therefore, the training set can be rebuilt. In the second round, we use the reconstructed training set to retrain the network from scratch to obtain a new trained model which will be used for testing. To make our approach more general, we measure the influence of each training sample based on a validation set instead of a testing set, because testing data is usually unavailable during training stage. If no validation set is given originally, one can randomly separate a group of samples from the given training set as validation samples. Even though a network will see fewer training samples due to the removing of unfavorable samples, extensive experiments still demonstrate that our data dropout scheme implemented by the two-round training approach can further boost performance of the state-of-the-art networks, such as ResNet  and DenseNet . Despite of simpleness, our approach does not rely on particular networks and training configurations. The only prerequisite is a network model that can fit original training data for specific tasks. Therefore, it is convenient to apply our approach with existing CNN models.
It is worth noting that our approach is essentially different from further training or fine-tuning, because we retrain the model from scratch in the second round. The model trained in the first round is only utilized to compute influence values for training data optimization.
The main contributions of this work can be generalized into three folds.
Firstly, we propose data dropout scheme to optimize training set by removing unfavorable samples.
Secondly, we design a two-round training approach to leverage data dropout to improve generalization accuracy.
Thirdly, we conduct extensive experiments to demonstrate the effectiveness and generality of our approach in boosting the performance of existing CNN models that were designed for diverse computer vision problems such as image classification and image denoising.
Ii Related Work
In this section, we review related literature and several benchmark datasets that will be used in our experiments. We also briefly introduce image denoising, a low-level computer vision problem that will be adopted as an example application in the experiments section.
Ii-a Most Related Research
Our work is partially inspired by , but it is worth noting that our work is different in the following aspects. Firstly, the authors’ work mainly concentrated on model behavior while our work focuses on optimizing training data such that we can achieve even better performance with existing network models. Secondly, they studied the feasibility of approximating the influence of removing a training sample on the loss at a testing sample, however, they did not establish a criteria for unfavorable samples, while we explicitly propose this criteria in Section III-B.
Ii-B Image Classification
Image classification has been a classical task to evaluate CNNs. The well-known models such as All-CNN , ResNet , and DenseNet  were originally proposed for this task. In our experiments, we adopt four widely used datasets consisting of the two CIFAR datasets , the SVHN (Street View House Numbers) dataset 
, and the ImageNet dataset. The CIFAR-10 and the CIFAR-100 datasets contain 10 and 100 classes of color images, respectively. There are 50,000 training samples and 10000 testing samples, all in size . The SVHN dataset contains 73,257 training images and 26,032 testing images belonging to 10 classes. There are also 531,131 images in the additional training set. All the images have a dimension of . The ImageNet dataset contains 1.28 million images for training, 50000 images for validation, and 100000 images for testing. There are 1000 classes in total. In real practice, all these color images can be cropped to a fixed size, such as .
Ii-C Image Denoising
Image denoising has been a long term open and challenging low-level computer vision problem. The degradation is usually modeled as = + , where denotes latent clean image, additive Gaussian noise and corrupted observation. In addition to image prior method, discriminative learning based approaches have been widely applied on denoising research. Typical works include MLP , CSF, NLNet , and DnCNN , which presented very competitive results. Unlike other methods that aim to learn latent clean image directly, DnCNN leveraged residual learning to learn noise . Clean image can be restored by subtracting the learned noise from corrupted observation . For reasonable comparison, we will directly adopt DnCNN model and its initial configurations in our experiments.
Iii The Proposed Method
To ease the discussion, we start by defining several notations. Let () denote a training sample. Let denote a model such as CNN with input , the loss, and the influence of removing a training sample on the loss at a validation sample . Here, and equals the number of validation samples. The goal of training is to learn a set of parameters , where
can be typical loss functions, such as softmax loss, mean squared error (MSE), andloss, etc.
Iii-a Influence Computation
According to the approximation theory discussed in , can be defined as below,
where is the Hessian and assumed to be positive definite. In our work, for each training sample , we compute its influence on loss value over all validation samples instead of testing samples since testing data should be invisible until testing phase. Hence, the total influence is . To compute , which can be rewritten as follows,
where , we approximate13] for more details. In our experiments, we note that for each training sample , computing over all at one time is still computing-intensive, hence we slightly change the order of computation which can greatly improve the efficiency. We will detail the implementation tips in Section III-D.
Iii-B Data Dropout Criteria
Once can be computed, we will be able to compute the total influence across all validation samples, which is used to approximate the following,
where is defined as . In practice, we expect , which implies that removing a training sample can decrease total validation loss, hence it is equivalent to have . Therefore, we set the criteria of data dropout as follows: , if , will be dropped from training set, otherwise, it will be kept. The dropped is named as unfavorable sample in this context.
It is important to note that we utilize validation loss to closely reflect potential testing loss when performing data dropout. This makes sense because training data is commonly assumed to have similar data distribution as potential testing data in machine learning. Otherwise, the problem may lie in the category of transfer learning that is beyond the scope of this work. In this context, our validation data is usually separated from original training data, and it can be assumed having similar distribution as potential testing data. Since , we will have , where denotes an unfavorable training sample. Thus we will obtain
and it indicates that testing loss can be reduced by removing unfavorable training samples. In addition, all unfavorable samples will be dropped at one time and thus the network parameters will be updated at one time. Therefore, the removal of each individual unfavorable sample is independent to each other.
Iii-C Two-Round Training
As analyzed above, for an individual training sample , we can compute , where is a validation sample. As a result, we want to examine each training sample to decide whether to drop or keep it. In conventional learning, once training is done, the learned parameters are fixed, hence testing error rates cannot be changed. Therefore, to make use of the computed influence to further decrease testing error rates, we propose a two-round training approach.
In the first round, we choose an arbitrary network which is suitable for a given task, and setup training configurations according to conventional practices, such as ResNet  for image classification. We train the model, and obtain when training is done. Here, denotes the learned network parameters. Then, for each training sample , we compute , the influence of removing on the loss over all validation samples, and remove unfavorable according to the criteria of data dropout. Thus, a new training set can be rebuilt. In the second round, we use the same network and the initial configurations as the first round, but feed the reconstructed training set to the model, and retrain it. When this round of training is complete, the resulting model is adopted as the final model for testing. Since the network is trained on the optimized training set in the second round, the learned parameters are quite different from , which are learned from the first round. We generalize our approach in the Algorithm.
Here, we briefly discuss the appropriate number of training rounds. In general, after the second round of training, we can still find a few unfavorable samples. For instance, in the CIFAR-10 classification with the ResNet-20 , we show the amount of located unfavorable samples after each round of training in Figure 1.
It can be seen that the curve is nearly monotonic, and there are much fewer unfavorable training samples left after the first round of training. This fact indicates that it is not necessary to perform more rounds of training in order to locate more unfavorable samples. On the other hand, more rounds will be computing-intensive, which is undesirable in deep learning. In fact, we empirically observe that two rounds of training is sufficient to improve generalization accuracy. Therefore, our approach is two-round based considering both accuracy and efficiency. The first round is to train a model which is used for locating unfavorable samples and the second round is to train the same network from scratch on the optimized training set for testing purpose.
Iii-D Implementation Tips
According to the Algorithm and the analysis in Section III-C, for each training sample , will be fixed and needs to be computed multiple times across all validation samples. However, approximating is more computing-intensive than computing . Therefore, in implementation of the Algorithm, we firstly fix and compute for all , and then repeat this for all . In this way, we have only (the total number of ) times of approximating . Otherwise, for all , computing first will need times of approximations, where is the number of . With this optimization, there will be values for each training sample in the end. Then, we sum these values to obtain the influence of removing each training sample on the loss at all validation samples. Although this optimization does not change the number of iterations in the Algorithm, it can greatly reduce the number of approximation operations.
Iii-E Difference from ‘Leave-one-out’ retraining
It is worth noting that our approach is essentially different from ‘leave-one-out’ (LOO) retraining. For each training sample , to compute the influence on validation loss, LOO needs to retrain the network by removing from the training set. Hence it needs times of retraining to investigate all training samples, which is not feasible in deep learning. Our approach, instead, computes the influence on validation loss for all at one time after the first round of training.
Our training method looks like a fine-tuning technique. However, it is essentially different from traditional fine-tuning techniques. The reason is that in the second round of training we restore the network to its initial configurations. In fact, result of the first round of training will be completely discarded when we start the second round of training since that result is only necessary for data dropout, which optimizes training data.
In spite of simpleness, our approach does not rely on particular models or applications. The model to be used to solve a domain-specific problem can be either an existing model or a customized network. There is no restriction on hyper-parameter settings, either. One only needs to follow a train-drop-train manner to achieve further improvement on testing accuracy of their selected networks. In addition, our approach is still applicable when there is no validation data, because we can choose some training samples or use another dataset for validation purpose. But the selected validation data should have similar distribution as potential testing data. This scheme has been proved to be effective in our experiments.
Moreover, our approach can not only improve the state-of-the-art baselines as shown in the experiments, but also improve the performance of an arbitrary model even the model is simpler in structure. For instance, the All-CNN model  has simpler structure than the DenseNet  and it cannot give the state-of-the-art baselines for image classification problem, but its performance can be still boosted by employing the proposed scheme as shown in Table I.
To validate the effectiveness of data dropout and two-round training
approach, we conduct extensive experiments which include image classification, and image denoising. We choose well-known networks for each task, and follow common practices to train the networks for reasonable evaluations. All experiments are implemented in TensorFlow
with Keras API.
In all our experiments, after removing unfavorable training samples, we do not add additional samples to the training sets and keep the initial batch size unchanged, thus allowing ‘not-full batch’. Moreover, to better estimate the influence of each training sample, data augmentation is turned off in the first round of training and only used in the second round for all the experiments except the SVHN, in which we turn off data augmentation for both rounds to follow common practices.
Iv-a Image Classification
For the CIFAR-10 and the CIFAR-100 datasets, we separate 5,000 images as validation data from the training set, and the remaining 45,000 images are used for training. In the second round of training, horizontal flippings and translations are adopted for data augmentation.
For the SVHN dataset, we constitute the validation set with 4,000 images from the training set and 2,000 images from the additional training set. These images are evenly sampled from 10 classes. We pre-process the images by subtracting the mean and dividing the standard deviation.
By adopting our approach, we re-evaluate the following three well-known networks, which focus on image classification problem: ResNet , DenseNet , and All-CNN . We directly use the models without changing the architectures. One can refer to the original papers for architecture details. For each round of training, we adopt the MSRA method  to initialize parameters for ResNet and DenseNet, and the Xavier method  to initialize All-CNN.
ResNet. To have reasonable comparison, we follow the practices in 
for ResNet evaluation. Firstly, we re-evaluate ResNet-110 with our approach on the two CIFAR datasets. The model is trained with stochastic gradient descent (SGD) optimizer with a mini-batch size of 128, weight decay of 0.0001 and momentum of 0.9. The initial learning rate is set to 0.1, and reduced to 0.01 and 0.001 at epoch 250 and 375 out of 500 epochs, respectively. Secondly, we re-evaluate ResNet-152 on the SVHN dataset. We train the model for 50 epochs, and the learning rate is reduced to 0.01 and 0.001 at epoch 30 and 35 respectively, from the initial value of 0.1. Other hyper-parameter settings keep unchanged as in the CIFAR experiments.
DenseNet. Although DenseNet has several versions, we choose to re-evaluate the basic version (DenseNet-40) which has no bottleneck layers or compression. There are 16 filters in the initial layer and the growth rate is set to 12. We train the model for 300 and 40 epochs on the CIFAR and the SVHN dataset, respectively. The mini-batch size of 64 is used. The initial learning rate is set to 0.1 and reduced to 0.01 and 0.001 at 50 and 75 of the total number of epochs, respectively. The training is still optimized by SGD algorithm with a momentum of 0.9 and weight decay of 0.0001.
Note that for the two CIFAR datasets, since data augmentation is turned off in the first round of training, to avoid overfitting, a dropout layer with a rate of 0.2 follows each convolutional layer except the first one. In the second round of training, we do not add dropout operations since data augmentation is backed on. For the SVHN dataset, given no data augmentation throughout the training process, dropout layers are added for both rounds of training.
All-CNN. We also evaluate our approach with a typical sequential network (All-CNN) 
on the two CIFAR datasets. In this model, max-pooling layer is replaced by regular convolutional layer with a stride of 2. We take the most advanced version of All-CNN, named as All-CNN-C in the original paper. Each block of this network contains two convolutional layers with a stride of 1 and one convolutional layer with a stride of 2.
We train the network using SGD optimizer with a momentum of 0.9 and weight decay of 0.001. The model is trained for 350 epochs, and the initial learning rate is set to 0.1. We adjust it by multiplying a fixed factor of 0.1 after 200, 250, and 300 epochs. To give reasonable comparison, in the second round of training, we only augment the data by horizontally flipping and translation of 5 pixels in maximum. The pre-processing steps include whitening and normalization.
|ResNet-110 (reported by )||6.41||27.22||-|
|ResNet-152 (reported by )||-||-||2.01|
Analysis. Table 1 lists the performance of the three networks trained with and without our approach. As can be seen, our two-round training with data dropout decreases the test error rates of the three networks on all the datasets, while the improvement on the two CIFAR datasets are greater than that on the SVHN dataset. This is because the images in the two CIFAR datasets contain more complicated scenarios, therefore, dropping unfavorable
training samples has a larger probability of removing disturbing features.
The largest margin of improvement occurs on the All-CNN model. The reason can be attributed to the architecture, which is a sequential model in nature. Only the adjacent layers are connected, and there is no skip connection to feed different levels of features into subsequent layers. Therefore, this network is subject to the influence of unfavorable samples. Nevertheless, ResNet and DenseNet can learn more distinguished features, hence they are more robust to unfavorable samples. Similar interpretation also applies to the comparison between ResNet and DenseNet. Compared to DenseNet, our approach has achieved more performance advancement on ResNet. This holds true across all the three datasets. It indicates that data dropout indirectly removes more disturbing features for ResNet, while relatively less for DenseNet due to its stronger ability of learning more distinguished features to suppress disturbing features. In fact, our approach indirectly proves that DenseNet outperforms ResNet, which has better performance than All-CNN.
In addition, we illustrate the number of unfavorable training samples in Table II. As can be seen, for the same training set, the amount is different for the three networks. Data dropout locates more unfavorable samples for All-CNN, while less for ResNet and DenseNet. This implies that our approach can improve inferior models by a larger margin. As visual examples, we in Figure 2 list several unfavorable training samples that are picked from the CIFAR-10 dataset by the proposed data dropout scheme.
Iv-B Large Scale Image Classification
To validate the effectiveness of the proposed approach for very large dataset, we conduct experiments on ImageNet  which is a benchmark dataset in image classification. We follow common practices [1, 5, 6] to pre-process the images. Any image or its horizontal flip is randomly cropped to size . The per-pixel mean value is subtracted from each image and the standard color augmentation  is applied. We choose the ResNet-18 and ResNet-34 as the base networks, and train them by using the Algorithm as described in Section III-C.
We use SGD optimizer to train both networks for 60 epochs. The momentum and weight decay are set to 0.9 and 0.001, respectively. We choose 0.1 as the initial learning rate and reduce it by multiplying 0.1 once the error has not decreased in the past three epochs. Following common practices, we compare validation errors (with 10-crop) only in Table III. As it shows, the data dropout and two-round scheme effectively boost the existing networks on the very large dataset. Similarly, in Table IV, we give the number of unfavorable training samples that are removed from the original training set in ImageNet.
|without Algorithm||with Algorithm|
Iv-C Image Denoising
As discussed in Section II-C, we re-evaluate DnCNN  without changing its default configurations. A noisy input is fed into the network and the output is the learned noise. We can obtain a clean image by subtracting the learned noise from the noisy input. Since our purpose is to validate the proposed data dropout scheme and two-round training approach, we only measure the effects for gray-scale image denoising with a known noise level. But it can be easily extended to color image denoising with random noise levels, since our approach is general and independent of models and applications.
We build the training set in a similar way as in  for the first round of training. 400 clean images are selected from Berkeley segmentation dataset (BSD500) . Each image is randomly cropped to a new image of size . Then patches of size
are further cropped from these images. Prior to training, additive Gaussian white noise with known kernel, namely, is added to the clean images to form noisy inputs. 12 commonly used gray-scale images in image processing research, as shown in Figure 3, are used for testing purpose.
Note that in , no validation set was used. However, in our approach, we need validation data to find unfavorable training samples after the first round of training. Therefore, we use BSD68 dataset  for validation purpose and no cropping is taken. There is no common image between the training dataset and the BSD68 dataset. It is also important to highlight that, when evaluating , refers to a training image patch of size , while refers to a full size validation image.
Similarly, we adopt the MSRA method  to initialize network parameters for both rounds of training. We train the network using the Adam  optimizer for 50 epochs. The initial learning rate is set to 0.001 for the first 30 epochs and adjusted to 0.0001 afterwards. Other default hyper-parameters of the Adam solver remain unchanged. The mini-batch size is set to 128. In the first round of training, no data augmentation is applied, whereas it is used in the second round of training.
We evaluate the trained model on the testing data, and compare the performance with the original DnCNN in Table V. Here, the quality of restored images is measured in peak-signal-noise-ratio (PSNR), and larger values indicate better denoising results. It can be seen that adopting our training approach can increase the average PSNR by around 0.04dB for the given noise level, which is acceptable in image denoising. For the image House and the image Couple, our results are inferior to that of the original DnCNN. This is because when performing data dropout, the influence of each training sample is estimated over the whole validation set, thus it cannot guarantee better performance for each individual testing sample. We illustrate the visual effects in Figure 4. Besides the original DnCNN, we also list the visual effect of BM3D , which is an image denoising method widely used in engineering. As can be seen, the original DnCNN outperforms BM3D by a large margin, and our approach further improves DnCNN.
In this section, we would like to provide several useful insights and discussions to help readers better understand our approach.
One may argue that further training a network may increase the generalization accuracy, however, it could hardly bring a remarkable performance improvement if following the practices of the original papers of ResNet, DenseNet, and All-CNN. Those published results were actually measured based upon five times of independent running, and the best ones were reported. When our approach was not considered, we also attempted to improve those published results by increasing the training iterations, however, better testing results could not be obtained once the training had converged. On the other hand, although there are two rounds of training in our scheme, the second round of training is essentially different from further training or fine-tuning, because we train a model from scratch in the second round. In fact, the first round of training can be taken as a pre-processing step, which aims to optimize a given training dataset by reducing its size.
Different networks do share several unfavorable samples on the same training dataset, but the number is different because a more powerful network is robust to unfavorable samples. That means a powerful network such as DenseNet will take fewer samples as unfavorable, whereas an inferior model such as All-CNN-C will take more samples as unfavorable. It is similar to human perception: a capable person usually takes surroundings positively while a pessimist may perceive more negative things from surroundings.
Improving the computational efficiency for locating unfavorable samples could be an useful future work. For a specific model, assume the regular training time is (first round), our approach will totally cost ++, where denotes the time of the second round of training and the time of data optimization. Here, is less than because the second round of training is based on a reduced training set. is much less than because there is no back-propagation in data optimization. For a very large dataset, could be large, however, our work provides a practical way to take trade off. It can be used when domain-specific accuracy is highly desired.
In this paper, to further boost performance of existing CNNs, we propose data dropout scheme to optimize training data by removing unfavorable samples. We theoretically analyze the criteria of data dropout and point out it is convenient to apply in practice. To make use of the proposed scheme, we design a two-round training approach which is general and can be easily integrated with existing networks and model configurations. Our experiments demonstrate the effectiveness of our approach for several well-known CNN models dealing with typical computer vision tasks.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
-  Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
-  Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. arXiv preprint arXiv:1706.02515, 2017.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. arXiv preprint arXiv:1703.04730, 2017.
-  Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
-  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  Harold C Burger, Christian J Schuler, and Stefan Harmeling. Image denoising: Can plain neural networks compete with bm3d? In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2392–2399. IEEE, 2012.
-  Uwe Schmidt and Stefan Roth. Shrinkage fields for effective image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2774–2781, 2014.
-  Stamatios Lefkimmiatis. Non-local color image denoising with convolutional neural networks. arXiv preprint arXiv:1611.06757, 2016.
-  Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 2017.
-  Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
-  Martín Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
-  Francois Chollet et al. Keras. https://github.com/keras-team/keras, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
-  Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
-  Stefan Roth and Michael J Black. Fields of experts. International Journal of Computer Vision, 82(2):205, 2009.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 16(8):2080–2095, 2007.