It is generally difficult to solve some image processing problems without complicated procedures. For instance, in natural image edge detection, using edge detectors such as Canny operator is unable to obtain satisfactory results (Ganin and Lempitsky 2014). The textures may be extremely complex while in some cases we expect models to output specific edges instead of all in the whole image.
In fact, many of these problems can be treated as mapping an input image to a corresponding output image. Different fully convolutional networks (FCN) have achieved great success in image-to-image tasks and enable them to be tackled end-to-end. The concept of FCN is first introduced for pixel-to-pixel semantic segmentation by removing fully connected layers and inserting upsampling layers at the end of networks (Long, Shelhamer and Darrell 2015).
With this approach, many classification networks can be converted for image-to-image mapping (Xie and Tu 2015; Liu et al. 2017). They use upsampling layers to recover from feature maps with different resolutions and a fusion layer to output the final image. These models show good performance in edge detection of natural images. Apart from modifying classification networks, (Ronneberger, Fischer and Brox 2015) design an encoder-decoder structure consisting of a contracting path and multiple expanding paths to capture context at different scales. It has been applied for biomedical image segmentation successfully.
Trying and training different architectures empirically can obtain satisfactory results but clear understandings of the improvement in performance are insufficient. Theoretically exploring loss surfaces of classification neural networks have been proposed recently (Nguyen and Hein 2018; Liang et al. 2018). (Li et al. 2017) shows that the residual mapping structure can lead to a convex objective function using a visualization method. However, there is no similar analysis on FCNs, which have totally different structures and objective functions from classification models. Therefore, we make an exploration on FCN-based networks by choosing a solution and projecting its vicinity along two random vectors. To the best of our knowledge, our paper is the first to provide a visualized insight into the minimizers’ geometry of FCNs and their generalization abilities.
The rest of this paper is organized as follows. Firstly, representative FCN-based models and visualization approaches are discussed. Then we introduce the objective function, visualization method as well as different architectures used for comparison in this study. Thirdly, we evaluate these FCNs on multiple datasets and visualize the optimization landscapes to understand their generalization abilities. Detailed analysis is also provided. Finally, we conclude the paper.
Full Convolutional Networks
Fully convolutional network (FCN) is first proposed in (Long, Shelhamer and Darrell 2015), which makes it possible for image classification networks to output segmentation results. In general, fully connected layers are applied at the end of CNN to predict class scores. Features maps will be flattened and fed into classifier layers. Space information is lost due to this squeezing operation. They replace the fully connected layers with upsampling or de-convolution layers. Therefore, image classification models can operate inputs of any size and generate corresponding output images.
Many basic classification networks, such as VGG-Net (Simonyan and Zisserman 2015), can be used as a backbone for learning image mapping by leveraging techniques in (Long, Shelhamer and Darrell 2015). Holistically-nested Edge Detector (HED) extracts features with different resolutions from five layers of VGG-Net and upsamples them to original scales (Xie and Tu 2015). At the end of the model, it uses a fusion layer to concatenate the features maps in a given dimension and output a corresponding result. Besides both SegNet and DeconvNet have a structure consisting of an encoder network and a decoder network, where the encoder topology is identical to VGG-Net (Noh, Hong and Han. 2015; Vijay, Alex and Roberto 2017). These two models learn to do upsampling using transposed convolution and then convolve with kernels to produce feature maps for segmentation.
Instead of modifying classification models, another type of networks consist of a contracting path to capture context and multiple symmetric expanding paths that enable precise localization. U-Net is one of the most representative models, which is initially designed for biomedical image segmentation and won the ISBI cell tracking challenge in 2015 by a large margin (Ronneberger, Fischer and Brox 2015). Further variations of U-Net are proposed to solve different tasks in the literature (Milletari, Navab and Ahmadi 2016; Litjens et al. 2017). (Santhanam, Morariu and Davis 2017) creates a similar framework and achieves good performance on relighting, denoising and colorization.
Unlike the former methods, (Zhang et al. 2017) designs an FCN that removes all pooling layers and keeps the image size inside for image denoising. Batch normalization and residual learning are also integrated to boost the training process as well as the performance. They achieve decent denoising results, though this architecture may have relatively smaller receptive field and higher computing resources consumption.
Although a large number of FCNs have been designed for various tasks, visualized methods are only applied on classification networks. It is well known that the objective function of deep neural networks lies in a super high dimensional space. As a result, we can only visualize it in 2D or 3D space.
A widely-used 2D plotting method is linear interpolation, which is first applied to study local minimizers under different batch size settings (Dinh et al. 2017; Keskar et al. 2017). An alternate approach is projection. (Goodfellow, Vinyals and Saxe 2015) uses this approach to explore the trajectories of different optimization algorithms. (Li et al. 2017) studies the relationship between loss landscapes and models’ generalization abilities by projection. Note that these studies only focus on classification networks, whose structures and objective functions are entirely different from FCNs.
Image-to-image mapping problems can be converted as pixel regression or classification using FCNs. The main difference is whether the output is continuous or categorical.
We train networks as a regression problem since pixel-wise class labels are slightly difficult to obtain in practice. For instance, the raindrop removal dataset only provides original images as labels. Therefore, we use mean square error (MSE) as the objective function, which can be expressed as follows:
where is used to represent the number of samples and denotes the dimension of outputs. Generally, the loss and MSE loss are almost the same in terms of regression accuracy, while sometimes the result of MSE is slightly better (Zhang et al. 2016).
Visualization of Optimization Landscapes
By following (Goodfellow, Vinyals and Saxe 2015; Li et al. 2017), we choose a solution of the network with parameters and sample two set of random vectors and
from a Gaussian distribution, where denotes the number of filters in FCNs, is set at zero and
is assumed to be the identity matrix. After that, the optimization landscape can be plotted by projecting the vicinity of onto a 3D space along these two random directions. It can be formulated as the following equation:
where and determine the optimization step size. We do filter-wise normalization by dividing the norm of each sampled vector and . For better understanding our method, the pseudocode of visualization algorithm is provided in Algorithm 1.
There may be two ambiguities about this approach. One is probably about the generalization of local minimizers. Although finding a global minimum turns out to be an NP-hard problem for the non-convex objective function of deep neural networks, (Kawaguchi 2016; Lu and Kawaguchi 2017) show that local minima is not an issue for networks to generalize well. Another consideration may be the repeatability of 3D projection plot using random vectors for the high-dimension loss functions. (Li et al. 2017) has shown that different random directions can produce similar shape by plotting multiple times. This phenomenon also exists in our experiments.
Fully Convolutional Networks
We compare three fully convolutional networks including two existing representative models (FCN-16s, U-Net) and a new proposed. Figure 1 shows the brief illustrations of these three structures.
Here FCN-16s refers to the network proposed in (Long, Shelhamer and Darrell 2015). For better comparison, we slight adjust it to have an almost same encoder-decoder structure as U-Net, except that the fusion is only from the last max pooling layer.
U-Net is the model designed in (Ronneberger, Fischer and Brox 2015). It is set up as follows: Convolutional and max pooling layers are stacked to generate feature maps with different resolutions. Before each max pooling process, the network branches off and connects to the input of every upsampling layer. This encoder-decoder structure has been widely used as an entire network or a key component in image-mapping tasks.
The network in the right plot of Figure 1 is proposed in this paper for comparison. The design of this model is motivated by the need not only to capture global context at every scale but consolidate the local information at original resolution as well. Our model inherits the main backbone from U-Net. The differences between our model and U-Net is that we insert a various number of residual modules (He et al. 2016) between the pre-pooled and up-sampled feature maps to emphasize the importance of information from original resolution. We do not use a convolution layer with kernel size greater than 7, since using smaller kernels can still capture a large spatial context (Szegedy et al. 2015). We note that HourglassNet has similar skip connections with convolutional layers inserted (Newell, Yang and Deng 2016), but is different from this model.
Experimental Datasets and Settings
To compare FCN-based networks, we conduct different image-to-image experiments including electron microscope (EM) imaging segmentation, vessel extraction and raindrop removal. Descriptions of datasets, models, training details and evaluate methods are provided respectively. Unless otherwise mentioned, we split each dataset as training and testing set according to 7:3. All images in training sets are flipped and rotated at every 90 degrees. Eight times data are obtained after augmentation. All models are trained using the SGD optimizer with batch-size 16 and momentum 0.8 for 60 epochs. The learning rate is initialized at 0.025. Note that our goal is not to achieve state-of-the-art performance on these tasks but rather to study generalization abilities of FCNs through visualizing the optimization landscapes.
The dataset is provided by the EM segmentation challenge (Arganda-Carreras et al. 2015). In our experiments, each image is cropped as several 256256 patches with a 128-pixel overlap. Foreground-restricted rand score and information theoretic score after border thinning are used to evaluate the quality of segmentation results (Unnikrishnan, Pantofaru and Hebert 2007).
Experiments are implemented on the Digital Retinal Images for Vessel Extraction database (DRIVE) proposed for studies on the extraction of blood vessels (Staal et al. 2004). Models are trained after reshaping all data at 256
256 size. Rand score and information theoretic score is also used as evaluation metrics.
We conduct raindrop removal experiments on the database created by (Qian et al. 2018). There are 1100 pairs of images totally with various background scenes and raindrops. We resize all images to 480480 for training. Peak signal-to-noise ratio (PSNR) and structure similarity index (SSIM) are applied for quantitative comparison.
Results on Image-to-image Mapping Tasks
The following provides quantitative comparisons of three FCNs respectively. Example results on various image-to-image mapping tasks are shown in Figure 2.
We plot the training and testing loss curves during the 60-epoch period in Figure 3 aiming to compare the training efficiency and generalization ability of the three networks. It also shows that the difference in generalization is not caused by underfitting or overfitting. As can be seen, U-Net and FCN-16s converge at the nearly same speed and faster than our model. In contrast, the testing loss reveals that U-Net generalizes better than FCN-16s on the EM segmentation task. Our model shows better generalization than FCN-16s. Interestingly there is a fluctuation occurs in the testing loss, for which we provide explanations in visualization experiments. Table 1 lists the best quantitative results during these 60 epochs, which is consistent with the testing MSE loss.
|Model||Rand Score||Information Theoretic Score|
Table 2 lists the best evaluation results. It shows that our model performs on par with U-Net and better than FCN-16s.
|Model||Rand Score||Information Theoretic Score|
The average PSNR and SSMI results on raindrop removal are presented in Table 3. It can be seen that U-Net yields the highest PSNR and SSMI on this task and our model achieves better performance than FCN-16s.
Visualization of Optimization Landscapes
We implement the visualization of optimization landscapes on EM segmentation dataset to address the following three issues:
As illustrated in experimental results, our model and U-Net outperform on all image-to-image tasks. This phenomenon motivates us to think what the optimization landscapes of the three FCNs look like.
Although FCN-16s has a similar encoder-decoder structure and even the same number of parameters as U-Net, a significant difference of the two structures is that FCN-16s only preserves the feature representations after 16 max pooling. One might wonder how the skip-layer connections affect the loss surface of FCNs.
Previous literature has shown that for classification models small-batch training can obtain flat minimizers which generalize better than large-batch method (Keskar et al. 2017; Li et al. 2017). However, FCNs for image-to-image tasks have different structures and objective functions. Does this phenomenon still exist in FCNs? If it exists, what the difference of minimizers that are converged with small/large-batch training methods?
Optimization Landscapes of the Three FCNs
Given the computation cost in visualization, we can only plot the loss surface in low-resolution, i.e., in Algorithm 1. For a better view of the optimization landscapes, we choose an interval , i.e., . The resolution is not high enough to capture the complex non-convexity of networks in large regions. For the convenience of explanation, we first illustrate flat/sharp minimizers loosely as shown in Figure 4.
Using the method described in the previous section, we first choose a solution of three models where the training loss value is smallest during the 60 epochs. After that, we plot the 3D loss landscapes and projected 2D contours. As shown in Figure 5, a noticeable difference is that the optimization landscape around the minimizer of FCN-16s is highly sharper while those of U-Net and our model are much flatter. Note that U-Net and FCN-16s are plotted by choosing a center point with a same training loss value.
In addition to visual comparisons, the flatness/sharpness of minimizers can be described by the Hessian matrix . However, it has extremely heavy computation burden when applying in neural networks. Hence, we adopt a metric from (Keskar et al. 2017) to characterize the sharpness.
Given and , the -sharpness of at is defined as:
where determines the size of box around the solution of the loss function and denotes a constraint set, which is defined as:
The operation of adding one in (3) and (4) is to guard against the zero case, which is discarded in our experiments.
We conduct the experiments with same training loss (0.03 for FCN-16s and U-Net) for 5 times and list the average sharpness value within two different constraint sets in Table 4. The smaller value sketches the flatter minimizer. As can be seen, U-Net and our model converge to flat minimizers while FCN-16s tends to achieve sharper one. The quantitative results are in alignment with our observations.
In fact, it has been widely discussed that convergence to sharp minimizers gives rise to the poor generalization for deep learning. The large sensitivity of the objective function at a sharp minimizer has a negative impact on the ability of the trained model to generalize on new testing data (Keskar et al. 2017). They also provide detailed literature review in statistics, Bayesian learning and Gibbs energy to support this view. Therefore, it may provide an explanation for why U-Net generalizes better than FCN-16s on EM segmentation.
Skip-layer Connections Promote Flat Loss Landscape
To explore the effect of skip-layer connections on the optimization landscape of FCNs, we conduct comparative experiments on three models, i.e., FCN-16s, FCN-8s, FCN-4s. The notation -16s, -8s, -4s represents the skip-layer connections added after 16, 8, 4 max pooling operations respectively. We use the same setting ( and ) and report the sharpness values in Table 5. The visualization results are shown in Figure 6. We use all networks with nearly same training loss 0.03 in experiments, but there is a significant difference in the generalization performance. Comparing together with FCN-16s in Figure 5, we see that the skip-layer connections from the pre-pooled features to the post-upsampled promote FCNs to obtain flat optimization landscapes.
It is well known in computer vision that context and locality are a pair of trade-off. On the one hand, max pooling modules expand local receptive fields and reduce the computation cost. On the other hand, we hope the regression prediction to be pixel perfect, which is slightly difficult for upsampling layers to restore from coarser features. Because there is sort of heterogeneity and deterioration that happens inside these feature maps after max pooling operations. Some sharp boundaries and tiny details at original scale will loss to some extent. Therefore, it is necessary to add skip-layers to alleviate this problem. The loss landscape transition from sharp to flat possibly explains why expanding connection path at finer features is important in FCNs from the perspective of optimization.
Small-batch Training Leads FCNs to Flatter Minimizers with Smooth Vicinities That Generalizes Better
In order to answer the last question, we trained the three FCNs with small batch 2 (B2) and large batch 16 (B16). Note that FCNs generally have more computational cost, which makes it difficult to train using batch size as large (say 128-512) as classification models with limited computation resources. The training and testing loss curves are shown in Figure 7. It is obvious that models trained with large batch sizes generalize worse.
To investigate the reasons, we visualize their optimization landscapes using the same resolution () but smaller region (). This time we compare the minimizer’s flatness of the same network. As shown in Figure 8, small-batch method leads the model to flat minimizers with smooth and nearly convex regions while large-batch training makes the model converge to sharp minimizers with chaotic vicinity. Table 6 lists the sharpness values.
In conclusion, visualized insights into the optimization landscapes of FCNs are provided. We observe that consolidating the original scale features by adding skip-layer connections in FCNs can promote flat loss landscape, which is well related to good generalization. In addition, experiments are conducted to show models trained with small batch sizes generalize better. We investigate the cause and present evidence that small-batch method leads the model to flat minimizers with smooth and nearly convex regions while large-batch training makes the model converge to sharp minimizers with chaotic vicinities. Admittedly theoretical proof is difficult to provide at present, but our work may contribute to understanding and analysis for designing and training FCNs.
Arganda-Carreras, I.; Turaga, S. C.; Berger, D. R.; and et al. 2015. Crowdsourcing the Creation of Image Segmentation Algorithms for Connectomics. Frontiers in neuroanatomy 9(142):1-13.
Dinh, L.; Pascanu, R.; Bengio, S.; and Bengio, Y. 2017. Sharp Minima Can Generalize for Deep Nets. In Proceedings of the 34th International Conference on Machine Learning,
Proceedings of the 34th International Conference on Machine Learning,70:1019-1028.
Ganin, Y., and Lempitsky, V. 2014. N4-Fields: Neural Network Nearest Neighbor Fields for Image Transforms. In Proceedings of Asian Conference on Computer Vision, 536-551. Cham: Springer.
Goodfellow, I. J.; Vinyals, O.; and Saxe, A. M. 2015. Qualitatively Characterizing Neural Network Optimization Problems. In International Conference on Learning Representations.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,770-778.
Kawaguchi, K. 2016. Deep Learning without Poor Local Minima. In Advances in Neural Information Processing Systems, 586-594.
Keskar, N. S.; Mudigere, D.; Nocedal, J; and et al. 2017. On Large-batch Training for Deep Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations.
Li, H.; Xu, Z.; Taylor, G.; and Goldstein, T. 2017. Visualizing the Loss Landscape of Neural Nets. arXiv:1712.09913.
Li, Y.; and Yuan, Y. 2017. Convergence Analysis of Two-layer Neural Networks with ReLU Activation. InAdvances in Neural Information Processing Systems, 597-607.
Liang, S.; Sun, R.; Li, Y.; and Srikant, R. 2018. Understanding the Loss Surface of Neural Networks for Binary Classification. arXiv:1803.00909.
Litjens, G.; Kooi, T.; Bejnordi, B. E.; and et al. 2017. A Survey on Deep Learning in Medical Image Analysis. Medical Image Analysis 42: 60-88.
Liu, Y.; Cheng, M. M.; Hu, X.; Wang, K.; and Bai, X. 2017. Richer Convolutional Features for Edge Detection. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, 5872-5881.
Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431-3440.
Lu, H.; and Kawaguchi, K. 2017. Depth Creates No Bad Local Minima. arXiv:1702.08580.
Newell, A.; Yang, K.; and Deng, J. 2016. Stacked Hourglass Networks for Human Pose Estimation. InProceedings of European Conference on Computer Vision, 483-499. Cham: Springer.
Nguyen, Q., and Hein, M. 2018. Optimization Landscape and Expressivity of Deep CNNs. In International Conference on Machine Learning, 3727-3736.
Noh, H.; Hong, S.; and Han, B. 2015. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, 1520-1528.
Qian, R.; Tan, R. T.; Yang, W.; Su, J.; and Liu, J. 2018. Attentive Generative Adversarial Network for Raindrop Removal from a Single Image. In European Conference on Computer Vision, 694-711. Cham: Springer.
Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, 234-241. Cham: Springer.
Santhanam, V.; Morariu, V. I.; and Davis, L. S. 2017. Generalized Deep Image to Image Regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5609-5619.
Simonyan, K., and Zisserman, A. 2015. Very Deep Convolutional Networks for Large-scale Image Recognition. In International Conference on Learning Representations.
Staal, J.; Abràmoff, M. D.; Niemeijer, M.; and et al. 2004. Ridge-based Vessel Segmentation in Color Images of the Retina. IEEE Transactions on Medical Imaging 23(4), 501-509.
Szegedy, C.; Liu, W.; Jia, Y.; and et al. 2015. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-9.
Unnikrishnan, R.; Pantofaru, C.; and Hebert, M. 2007. Toward Objective Evaluation of Image Segmentation Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 929-944.
Vijay, B.; Alex K.; and Roberto C. 2017. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(12):2481-2495
Xie, S., and Tu, Z. 2015. Holistically-nested Edge Detection. In Proceedings of the IEEE International Conference on Computer Vision, 1395-1403.
Zhang, C. L.; Zhang, H.; Wei, X. S.; and Wu, J. 2016. Deep Bimodal Regression for Apparent Personality Analysis. In European Conference on Computer Vision, 311-324. Cham: Springer.
Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; and Zhang, L. 2017. Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising. IEEE Transactions on Image Processing 26(7): 3142-3155.