Introduction
It is generally difficult to solve some image processing problems without complicated procedures. For instance, in natural image edge detection, using edge detectors such as Canny operator is unable to obtain satisfactory results (Ganin and Lempitsky 2014). The textures may be extremely complex while in some cases we expect models to output specific edges instead of all in the whole image.
In fact, many of these problems can be treated as mapping an input image to a corresponding output image. Different fully convolutional networks (FCN) have achieved great success in imagetoimage tasks and enable them to be tackled endtoend. The concept of FCN is first introduced for pixeltopixel semantic segmentation by removing fully connected layers and inserting upsampling layers at the end of networks (Long, Shelhamer and Darrell 2015).
With this approach, many classification networks can be converted for imagetoimage mapping (Xie and Tu 2015; Liu et al. 2017). They use upsampling layers to recover from feature maps with different resolutions and a fusion layer to output the final image. These models show good performance in edge detection of natural images. Apart from modifying classification networks, (Ronneberger, Fischer and Brox 2015) design an encoderdecoder structure consisting of a contracting path and multiple expanding paths to capture context at different scales. It has been applied for biomedical image segmentation successfully.
Trying and training different architectures empirically can obtain satisfactory results but clear understandings of the improvement in performance are insufficient. Theoretically exploring loss surfaces of classification neural networks have been proposed recently (Nguyen and Hein 2018; Liang et al. 2018). (Li et al. 2017) shows that the residual mapping structure can lead to a convex objective function using a visualization method. However, there is no similar analysis on FCNs, which have totally different structures and objective functions from classification models. Therefore, we make an exploration on FCNbased networks by choosing a solution and projecting its vicinity along two random vectors. To the best of our knowledge, our paper is the first to provide a visualized insight into the minimizers’ geometry of FCNs and their generalization abilities.
The rest of this paper is organized as follows. Firstly, representative FCNbased models and visualization approaches are discussed. Then we introduce the objective function, visualization method as well as different architectures used for comparison in this study. Thirdly, we evaluate these FCNs on multiple datasets and visualize the optimization landscapes to understand their generalization abilities. Detailed analysis is also provided. Finally, we conclude the paper.
Related Work
Full Convolutional Networks
Fully convolutional network (FCN) is first proposed in (Long, Shelhamer and Darrell 2015), which makes it possible for image classification networks to output segmentation results. In general, fully connected layers are applied at the end of CNN to predict class scores. Features maps will be flattened and fed into classifier layers. Space information is lost due to this squeezing operation. They replace the fully connected layers with upsampling or deconvolution layers. Therefore, image classification models can operate inputs of any size and generate corresponding output images.
Many basic classification networks, such as VGGNet (Simonyan and Zisserman 2015), can be used as a backbone for learning image mapping by leveraging techniques in (Long, Shelhamer and Darrell 2015). Holisticallynested Edge Detector (HED) extracts features with different resolutions from five layers of VGGNet and upsamples them to original scales (Xie and Tu 2015). At the end of the model, it uses a fusion layer to concatenate the features maps in a given dimension and output a corresponding result. Besides both SegNet and DeconvNet have a structure consisting of an encoder network and a decoder network, where the encoder topology is identical to VGGNet (Noh, Hong and Han. 2015; Vijay, Alex and Roberto 2017). These two models learn to do upsampling using transposed convolution and then convolve with kernels to produce feature maps for segmentation.
Instead of modifying classification models, another type of networks consist of a contracting path to capture context and multiple symmetric expanding paths that enable precise localization. UNet is one of the most representative models, which is initially designed for biomedical image segmentation and won the ISBI cell tracking challenge in 2015 by a large margin (Ronneberger, Fischer and Brox 2015). Further variations of UNet are proposed to solve different tasks in the literature (Milletari, Navab and Ahmadi 2016; Litjens et al. 2017). (Santhanam, Morariu and Davis 2017) creates a similar framework and achieves good performance on relighting, denoising and colorization.
Unlike the former methods, (Zhang et al. 2017) designs an FCN that removes all pooling layers and keeps the image size inside for image denoising. Batch normalization and residual learning are also integrated to boost the training process as well as the performance. They achieve decent denoising results, though this architecture may have relatively smaller receptive field and higher computing resources consumption.
Visualization Methods
Although a large number of FCNs have been designed for various tasks, visualized methods are only applied on classification networks. It is well known that the objective function of deep neural networks lies in a super high dimensional space. As a result, we can only visualize it in 2D or 3D space.
A widelyused 2D plotting method is linear interpolation, which is first applied to study local minimizers under different batch size settings (Dinh et al. 2017; Keskar et al. 2017). An alternate approach is projection. (Goodfellow, Vinyals and Saxe 2015) uses this approach to explore the trajectories of different optimization algorithms. (Li et al. 2017) studies the relationship between loss landscapes and models’ generalization abilities by projection. Note that these studies only focus on classification networks, whose structures and objective functions are entirely different from FCNs.
Method
Objective Function
Imagetoimage mapping problems can be converted as pixel regression or classification using FCNs. The main difference is whether the output is continuous or categorical.
We train networks as a regression problem since pixelwise class labels are slightly difficult to obtain in practice. For instance, the raindrop removal dataset only provides original images as labels. Therefore, we use mean square error (MSE) as the objective function, which can be expressed as follows:
(1) 
where is used to represent the number of samples and denotes the dimension of outputs. Generally, the loss and MSE loss are almost the same in terms of regression accuracy, while sometimes the result of MSE is slightly better (Zhang et al. 2016).
Visualization of Optimization Landscapes
By following (Goodfellow, Vinyals and Saxe 2015; Li et al. 2017), we choose a solution of the network with parameters and sample two set of random vectors and
from a Gaussian distribution
, where denotes the number of filters in FCNs, is set at zero andis assumed to be the identity matrix
. After that, the optimization landscape can be plotted by projecting the vicinity of onto a 3D space along these two random directions. It can be formulated as the following equation:(2) 
where and determine the optimization step size. We do filterwise normalization by dividing the norm of each sampled vector and . For better understanding our method, the pseudocode of visualization algorithm is provided in Algorithm 1.
There may be two ambiguities about this approach. One is probably about the generalization of local minimizers. Although finding a global minimum turns out to be an NPhard problem for the nonconvex objective function of deep neural networks, (Kawaguchi 2016; Lu and Kawaguchi 2017) show that local minima is not an issue for networks to generalize well. Another consideration may be the repeatability of 3D projection plot using random vectors for the highdimension loss functions. (Li et al. 2017) has shown that different random directions can produce similar shape by plotting multiple times. This phenomenon also exists in our experiments.
Fully Convolutional Networks
We compare three fully convolutional networks including two existing representative models (FCN16s, UNet) and a new proposed. Figure 1 shows the brief illustrations of these three structures.
FCN16s
Here FCN16s refers to the network proposed in (Long, Shelhamer and Darrell 2015). For better comparison, we slight adjust it to have an almost same encoderdecoder structure as UNet, except that the fusion is only from the last max pooling layer.
UNet
UNet is the model designed in (Ronneberger, Fischer and Brox 2015). It is set up as follows: Convolutional and max pooling layers are stacked to generate feature maps with different resolutions. Before each max pooling process, the network branches off and connects to the input of every upsampling layer. This encoderdecoder structure has been widely used as an entire network or a key component in imagemapping tasks.
Proposed Network
The network in the right plot of Figure 1 is proposed in this paper for comparison. The design of this model is motivated by the need not only to capture global context at every scale but consolidate the local information at original resolution as well. Our model inherits the main backbone from UNet. The differences between our model and UNet is that we insert a various number of residual modules (He et al. 2016) between the prepooled and upsampled feature maps to emphasize the importance of information from original resolution. We do not use a convolution layer with kernel size greater than 7, since using smaller kernels can still capture a large spatial context (Szegedy et al. 2015). We note that HourglassNet has similar skip connections with convolutional layers inserted (Newell, Yang and Deng 2016), but is different from this model.
Experiments
Experimental Datasets and Settings
To compare FCNbased networks, we conduct different imagetoimage experiments including electron microscope (EM) imaging segmentation, vessel extraction and raindrop removal. Descriptions of datasets, models, training details and evaluate methods are provided respectively. Unless otherwise mentioned, we split each dataset as training and testing set according to 7:3. All images in training sets are flipped and rotated at every 90 degrees. Eight times data are obtained after augmentation. All models are trained using the SGD optimizer with batchsize 16 and momentum 0.8 for 60 epochs. The learning rate is initialized at 0.025. Note that our goal is not to achieve stateoftheart performance on these tasks but rather to study generalization abilities of FCNs through visualizing the optimization landscapes.
EM Segmentation
The dataset is provided by the EM segmentation challenge (ArgandaCarreras et al. 2015). In our experiments, each image is cropped as several 256256 patches with a 128pixel overlap. Foregroundrestricted rand score and information theoretic score after border thinning are used to evaluate the quality of segmentation results (Unnikrishnan, Pantofaru and Hebert 2007).
Vessel Extraction
Experiments are implemented on the Digital Retinal Images for Vessel Extraction database (DRIVE) proposed for studies on the extraction of blood vessels (Staal et al. 2004). Models are trained after reshaping all data at 256
256 size. Rand score and information theoretic score is also used as evaluation metrics.
Raindrop Removal
We conduct raindrop removal experiments on the database created by (Qian et al. 2018). There are 1100 pairs of images totally with various background scenes and raindrops. We resize all images to 480480 for training. Peak signaltonoise ratio (PSNR) and structure similarity index (SSIM) are applied for quantitative comparison.
Results on Imagetoimage Mapping Tasks
The following provides quantitative comparisons of three FCNs respectively. Example results on various imagetoimage mapping tasks are shown in Figure 2.
EM segmentation
We plot the training and testing loss curves during the 60epoch period in Figure 3 aiming to compare the training efficiency and generalization ability of the three networks. It also shows that the difference in generalization is not caused by underfitting or overfitting. As can be seen, UNet and FCN16s converge at the nearly same speed and faster than our model. In contrast, the testing loss reveals that UNet generalizes better than FCN16s on the EM segmentation task. Our model shows better generalization than FCN16s. Interestingly there is a fluctuation occurs in the testing loss, for which we provide explanations in visualization experiments. Table 1 lists the best quantitative results during these 60 epochs, which is consistent with the testing MSE loss.
Model  Rand Score  Information Theoretic Score 

FCN16s  0.5697  0.5786 
UNet  0.9714  0.9743 
Our Model  0.9296  0.9407 
Vessel Extraction
Table 2 lists the best evaluation results. It shows that our model performs on par with UNet and better than FCN16s.
Model  Rand Score  Information Theoretic Score 

FCN16s  0.6594  0.6667 
UNet  0.7832  0.7974 
Our Model  0.7696  0.7894 
Raindrop Removal
The average PSNR and SSMI results on raindrop removal are presented in Table 3. It can be seen that UNet yields the highest PSNR and SSMI on this task and our model achieves better performance than FCN16s.
Model  PSNR  SSMI 

FCN16s  17.043  0.820 
UNet  27.049  0.982 
Our Model  26.328  0.978 
Visualization of Optimization Landscapes
We implement the visualization of optimization landscapes on EM segmentation dataset to address the following three issues:

As illustrated in experimental results, our model and UNet outperform on all imagetoimage tasks. This phenomenon motivates us to think what the optimization landscapes of the three FCNs look like.

Although FCN16s has a similar encoderdecoder structure and even the same number of parameters as UNet, a significant difference of the two structures is that FCN16s only preserves the feature representations after 16 max pooling. One might wonder how the skiplayer connections affect the loss surface of FCNs.

Previous literature has shown that for classification models smallbatch training can obtain flat minimizers which generalize better than largebatch method (Keskar et al. 2017; Li et al. 2017). However, FCNs for imagetoimage tasks have different structures and objective functions. Does this phenomenon still exist in FCNs? If it exists, what the difference of minimizers that are converged with small/largebatch training methods?
Optimization Landscapes of the Three FCNs
Given the computation cost in visualization, we can only plot the loss surface in lowresolution, i.e., in Algorithm 1. For a better view of the optimization landscapes, we choose an interval , i.e., . The resolution is not high enough to capture the complex nonconvexity of networks in large regions. For the convenience of explanation, we first illustrate flat/sharp minimizers loosely as shown in Figure 4.
Using the method described in the previous section, we first choose a solution of three models where the training loss value is smallest during the 60 epochs. After that, we plot the 3D loss landscapes and projected 2D contours. As shown in Figure 5, a noticeable difference is that the optimization landscape around the minimizer of FCN16s is highly sharper while those of UNet and our model are much flatter. Note that UNet and FCN16s are plotted by choosing a center point with a same training loss value.
In addition to visual comparisons, the flatness/sharpness of minimizers can be described by the Hessian matrix . However, it has extremely heavy computation burden when applying in neural networks. Hence, we adopt a metric from (Keskar et al. 2017) to characterize the sharpness.
Given and , the sharpness of at is defined as:
(3) 
where determines the size of box around the solution of the loss function and denotes a constraint set, which is defined as:
(4)  
The operation of adding one in (3) and (4) is to guard against the zero case, which is discarded in our experiments.
We conduct the experiments with same training loss (0.03 for FCN16s and UNet) for 5 times and list the average sharpness value within two different constraint sets in Table 4. The smaller value sketches the flatter minimizer. As can be seen, UNet and our model converge to flat minimizers while FCN16s tends to achieve sharper one. The quantitative results are in alignment with our observations.
Model  

FCN16s  2.1822  7.3241 
UNet  1.4608  4.7711 
Our Model  1.5519  5.0324 
In fact, it has been widely discussed that convergence to sharp minimizers gives rise to the poor generalization for deep learning. The large sensitivity of the objective function at a sharp minimizer has a negative impact on the ability of the trained model to generalize on new testing data (Keskar et al. 2017). They also provide detailed literature review in statistics, Bayesian learning and Gibbs energy to support this view. Therefore, it may provide an explanation for why UNet generalizes better than FCN16s on EM segmentation.
Skiplayer Connections Promote Flat Loss Landscape
To explore the effect of skiplayer connections on the optimization landscape of FCNs, we conduct comparative experiments on three models, i.e., FCN16s, FCN8s, FCN4s. The notation 16s, 8s, 4s represents the skiplayer connections added after 16, 8, 4 max pooling operations respectively. We use the same setting ( and ) and report the sharpness values in Table 5. The visualization results are shown in Figure 6. We use all networks with nearly same training loss 0.03 in experiments, but there is a significant difference in the generalization performance. Comparing together with FCN16s in Figure 5, we see that the skiplayer connections from the prepooled features to the postupsampled promote FCNs to obtain flat optimization landscapes.
Model  

FCN16s  2.1822  7.3241 
FCN8s  1.7724  6.1531 
FCN4s  1.4904  4.8242 
It is well known in computer vision that context and locality are a pair of tradeoff. On the one hand, max pooling modules expand local receptive fields and reduce the computation cost. On the other hand, we hope the regression prediction to be pixel perfect, which is slightly difficult for upsampling layers to restore from coarser features. Because there is sort of heterogeneity and deterioration that happens inside these feature maps after max pooling operations. Some sharp boundaries and tiny details at original scale will loss to some extent. Therefore, it is necessary to add skiplayers to alleviate this problem. The loss landscape transition from sharp to flat possibly explains why expanding connection path at finer features is important in FCNs from the perspective of optimization.
Smallbatch Training Leads FCNs to Flatter Minimizers with Smooth Vicinities That Generalizes Better
In order to answer the last question, we trained the three FCNs with small batch 2 (B2) and large batch 16 (B16). Note that FCNs generally have more computational cost, which makes it difficult to train using batch size as large (say 128512) as classification models with limited computation resources. The training and testing loss curves are shown in Figure 7. It is obvious that models trained with large batch sizes generalize worse.
To investigate the reasons, we visualize their optimization landscapes using the same resolution () but smaller region (). This time we compare the minimizer’s flatness of the same network. As shown in Figure 8, smallbatch method leads the model to flat minimizers with smooth and nearly convex regions while largebatch training makes the model converge to sharp minimizers with chaotic vicinity. Table 6 lists the sharpness values.
Model  

B2  B16  B2  B16  
FCN16s  0.0252  0.0418  0.1193  0.1964 
UNet  0.0175  0.0288  0.0971  0.1344 
Our Model  0.0187  0.0315  0.1072  0.1699 
Conclusion
In conclusion, visualized insights into the optimization landscapes of FCNs are provided. We observe that consolidating the original scale features by adding skiplayer connections in FCNs can promote flat loss landscape, which is well related to good generalization. In addition, experiments are conducted to show models trained with small batch sizes generalize better. We investigate the cause and present evidence that smallbatch method leads the model to flat minimizers with smooth and nearly convex regions while largebatch training makes the model converge to sharp minimizers with chaotic vicinities. Admittedly theoretical proof is difficult to provide at present, but our work may contribute to understanding and analysis for designing and training FCNs.
References
ArgandaCarreras, I.; Turaga, S. C.; Berger, D. R.; and et al. 2015. Crowdsourcing the Creation of Image Segmentation Algorithms for Connectomics. Frontiers in neuroanatomy 9(142):113.
Dinh, L.; Pascanu, R.; Bengio, S.; and Bengio, Y. 2017. Sharp Minima Can Generalize for Deep Nets. In
Proceedings of the 34th International Conference on Machine Learning,
70:10191028.Ganin, Y., and Lempitsky, V. 2014. N4Fields: Neural Network Nearest Neighbor Fields for Image Transforms. In Proceedings of Asian Conference on Computer Vision, 536551. Cham: Springer.
Goodfellow, I. J.; Vinyals, O.; and Saxe, A. M. 2015. Qualitatively Characterizing Neural Network Optimization Problems. In International Conference on Learning Representations.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
770778.Kawaguchi, K. 2016. Deep Learning without Poor Local Minima. In Advances in Neural Information Processing Systems, 586594.
Keskar, N. S.; Mudigere, D.; Nocedal, J; and et al. 2017. On Largebatch Training for Deep Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations.
Li, H.; Xu, Z.; Taylor, G.; and Goldstein, T. 2017. Visualizing the Loss Landscape of Neural Nets. arXiv:1712.09913.
Li, Y.; and Yuan, Y. 2017. Convergence Analysis of Twolayer Neural Networks with ReLU Activation. In
Advances in Neural Information Processing Systems, 597607.Liang, S.; Sun, R.; Li, Y.; and Srikant, R. 2018. Understanding the Loss Surface of Neural Networks for Binary Classification. arXiv:1803.00909.
Litjens, G.; Kooi, T.; Bejnordi, B. E.; and et al. 2017. A Survey on Deep Learning in Medical Image Analysis. Medical Image Analysis 42: 6088.
Liu, Y.; Cheng, M. M.; Hu, X.; Wang, K.; and Bai, X. 2017. Richer Convolutional Features for Edge Detection. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, 58725881.
Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 34313440.
Lu, H.; and Kawaguchi, K. 2017. Depth Creates No Bad Local Minima. arXiv:1702.08580.
Milletari, F.; Navab, N.; xand Ahmadi, S. A. 2016. VNet: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In
IEEE International Conference on 3D Vision, 565571.Newell, A.; Yang, K.; and Deng, J. 2016. Stacked Hourglass Networks for Human Pose Estimation. In
Proceedings of European Conference on Computer Vision, 483499. Cham: Springer.Nguyen, Q., and Hein, M. 2018. Optimization Landscape and Expressivity of Deep CNNs. In International Conference on Machine Learning, 37273736.
Noh, H.; Hong, S.; and Han, B. 2015. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, 15201528.
Qian, R.; Tan, R. T.; Yang, W.; Su, J.; and Liu, J. 2018. Attentive Generative Adversarial Network for Raindrop Removal from a Single Image. In European Conference on Computer Vision, 694711. Cham: Springer.
Ronneberger, O.; Fischer, P.; and Brox, T. 2015. UNet: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computerassisted Intervention, 234241. Cham: Springer.
Santhanam, V.; Morariu, V. I.; and Davis, L. S. 2017. Generalized Deep Image to Image Regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 56095619.
Simonyan, K., and Zisserman, A. 2015. Very Deep Convolutional Networks for Largescale Image Recognition. In International Conference on Learning Representations.
Staal, J.; Abràmoff, M. D.; Niemeijer, M.; and et al. 2004. Ridgebased Vessel Segmentation in Color Images of the Retina. IEEE Transactions on Medical Imaging 23(4), 501509.
Szegedy, C.; Liu, W.; Jia, Y.; and et al. 2015. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 19.
Unnikrishnan, R.; Pantofaru, C.; and Hebert, M. 2007. Toward Objective Evaluation of Image Segmentation Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 929944.
Vijay, B.; Alex K.; and Roberto C. 2017. SegNet: A Deep Convolutional EncoderDecoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(12):24812495
Xie, S., and Tu, Z. 2015. Holisticallynested Edge Detection. In Proceedings of the IEEE International Conference on Computer Vision, 13951403.
Zhang, C. L.; Zhang, H.; Wei, X. S.; and Wu, J. 2016. Deep Bimodal Regression for Apparent Personality Analysis. In European Conference on Computer Vision, 311324. Cham: Springer.
Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; and Zhang, L. 2017. Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising. IEEE Transactions on Image Processing 26(7): 31423155.
Comments
There are no comments yet.