Recent works in image segmentation has shown that deep segmentation networks (stacking tens of convolutional layers) generally perform better than shallower networks He et al. , due to their ability to learn more complex nonlinear functions. However, they are difficult to train because of the high number of parameters and gradient vanishing. One of the recent key ideas to effectuate the training process and handle gradient vanishing is to introduce skip connections between subsequent layers in the network, which has been shown to improve some of the encoder-decoder segmentation networks (e.g., 2D U-Net Ronneberger et al. , 3D U-Net Çiçek et al. , 3D V-Net Milletari et al. , and The One Hundred Layers Tiramisu (DensNet) Jégou et al. ). Skip connections help in the training process by recovering spatial information lost during down-sampling, as well as reducing the risks of vanishing gradients Huang et al. . It is also has been shown that the skip connections eliminate singularities Orhan and Pitkow . However, direct transfer of feature maps from previous layers to subsequent layers may also lead to redundant and non-discriminatory feature maps being transferred. Also, as the transferred feature maps are concatenated to the feature maps in subsequent layers, the memory usage increases many folds.
Complexity reduction. Recently, there have been several efforts to reduce the training and runtime computations of deep classification networks Leroux et al. , Howard et al. , Zhang et al. , Iandola et al. 
. A few other works have attempted to simplify the structure of deep networks, e.g., by tensor factorizationJaderberg et al. , Kim et al. , Lebedev et al. , channel/network pruning Wen et al. , Hu et al.  or applying sparsity to connections Han et al. [2015b, a], Liu et al. , Guo et al. , Han et al. . However, the non-structured connectivity and irregular memory access, which is caused by sparsity regularization and connection pruning methods, adversely impacts practical speedup Wen et al. . On the other hand, tensor factorization is not compatible with the recently designed networks, e.g., GoogleNet and ResNet, and many such methods may even end up with more computations than the original architectures He et al. . Wei et al. Wen et al. , introduced a learning based sparsity approach by adding sparsity terms in their optimization function and leveraging group Lasso Yuan and Lin . Similarly. Alvarez et. al. Alvarez and Salzmann 
, added a regularization term to their loss function to reduce the number of neurons in a learnable setting. However, for both works, it is not trivial to decide on the level of contribution of each term in their loss function. Hu et al.Hu et al.  proposed a neuron pruning method for having an efficient architecture. Although their method performs at par with the original unpruned model, it is a complex multi-stage threshold based method applied only to parameter-dense layers of the network. Yang et al. Yang et al. [2017b]
proposed an energy-aware method to reduce the hardware energy consumption by pruning convolutional neural networks. However, similar toLiu et al. , Denton et al. , Jaderberg et al. , He et al. , their approach traded off network performance for reduced computational complexity/energy consumption (i.e., they reported lower accuracy scores along with lower energy consumption), which may not be desirable in some applications, especially in medical image segmentation.
Gates and attention.
Attention can be viewed as using information transferred from several subsequent layers/feature maps to localize the most discriminative (or salient) part of the input signal. Attention models have been widely applied for machine translationBahdanau et al. , visual question answering Das et al. , sequence based models Luong et al. , and image captioning Xu et al. . Srivastava et al. Srivastava et al.  modified ResNet in a way to control the flow of information through a connection; their proposed gates control the level of contribution between unmodified input and activations to a consecutive layer. Hu et al. Hu et al.  proposed a selection mechanism where feature maps are first aggregated using global average pooling and reduced to a single channel descriptor, then an activation gate is used to highlight the most discriminative features. Recently, Wang et al. Wang et al.  added an attention module to ResNet for image classification. Their proposed attention module consists of several encoding-decoding layers, which although helped in improving image classification accuracy, also increased the computational complexity of the model by an order of magnitude Wang et al. .
In this paper, we propose a modification of the traditional skip connections, using a novel select-attend-transfer gate, which aims at simultaneously improving segmentation accuracy and reducing memory usage and network parameters (Fig. 1). We focus on skip connections in encoder-decoder architectures (i.e. as opposed to skip connections in residual networks) designed for segmentation tasks. Our proposed select-attend-transfer gate favours sparse feature representations and uses this property to select and attend to the most discriminative feature channels and spatial regions within a skip connection. Specifically, we first learn to identify the most discriminative feature maps in a skip connection, using a set of trainable weights under sparsity constraint. Then, we reduce the feature maps to a single channel using a convolutional filter followed by an attention layer to identify salient spatial locations within the produced feature map. This compact representation forms the final feature map of the skip connection.
Note that our feature selection method differs from these previous works, most notably because: a) instead of indirectly training for selection, i.e., by incorporating new terms in the objective functionWen et al. , we directly train the selection parameters along with other parameters of the network; b) in contrast to previous works mentioned above, we focus specifically on improving segmentation accuracy and consistency by re-designing the skip connections within fully convolutional encoder-decoder networks; and c) instead of transferring all the channels, the proposed method transfers only one attention map, which reduces memory usage and network parameters.
2 The Select-Attend-Transfer (SAT) gate
Notation: We define as an input feature map of size where , , , and refer to the height, width, and depth of the volumetric data, and number of channels, respectively. The notation in the paper is defined for 3D (volumetric) input images but the method can be easily adapted to 2D images by removing an extra dimension and applying 2D convolutions instead of 3D.
An overview of the proposed SAT gate is shown in Fig. 2. It consists of the following modules: 1) Select: re-weighting the channels of the input feature maps
, using a learned weight vector with sparsity constraints, to encourage sparse feature map representations, that is, only those channels with non-zero weights are selected; 2) Attend: discovering the most salient spatial locations within the final feature map; and 3) Transfer: transferring the output of the gate into subsequent layers via a skip connection.
Selection via sparsification. We weight the channels of an input feature map by using a scalar weight vector
trained along with all other network parameters. The weight vector is defined such that we encourage sparse feature maps, resulting in completely/partially turning off or on feature maps. Instead of relying on complex loss functions, we clip negative weights to zero and positives to at most one using a truncated ReLU functionLeCun et al. . Each channel of the feature map block is multiplied by as the following (eq. 1)
where is a scalar weight value associated with the input feature map channel ; is the truncated ReLU function. The output is of size and is a sparse representation of . Zero weights turn off corresponding feature maps completely, whereas positive weights fully/partially turn on features maps; i.e., implementing the soft feature map selection.
Attention via filtering. The output is further filtered to identify the most discriminative linear combination of feature channels . For this, we employ a unique convolution filter of size ( for 2D data), which allows us to learn how best to aggregate the different channels of the feature map . The output (i.e. ) of this feature selection step reduces to of size (eq. 3).
where is the convolution operation. To identify salient spatial locations within the , we introduce an attention gate (eq. 4) composed of a sigmoid activation function. Using the sigmoid as an activation function allows us to identify multiple discriminative spatial locations (as opposed to one single location, which is the case with softmax) within the feature map .
denotes the sigmoid function. The computedforms a compact summary of the input feature map .
Transfer via skip connection. The computed is transferred to subsequent layers via a skip connection.
Special examples. There are two special cases of the proposed SAT gate. One is the ST gate which skips the A part, that is, only channel selection but no attention is performed. The signal is directly fed to subsequent layers. The other is the AT gate which skips the S part by setting all weights to one. This way, there is no channel selection and only attention is performed.
Training and implementation details.
Networks are trained using stochastic gradient descent with momentum. We do not rely on special layers or non-differentiable operations to permit model training via standard backpropagation. To set the hyperparameters, we started with the proposed values mentioned in U-Net, V-Net, and The One Hundred Layers Tiramisu papers. However, we found experimentally that applying ADADELTAZeiler  optimizer (with its proposed default parameters: , , , and ) with Glorot uniform model initializer, also called Xavier uniform initializer Glorot and Bengio 
, works best for all the networks. All the models are implemented using Keras with TensorFlow backend. After each convolution layer we use batch normalization. This allows us to use higher learning rate since the effect of outliers are reduced by batch normalization. Note that to test on 2D dataset, we replace all the 3D opperations in 3D-Vnet with 2D operations and for 3D datasets we replace all the 2D operations of The One Hundred Layers Tiramisu model with 3D ones.
In this section, we evaluate the performance of the proposed method on three commonly used fully convolutional segmentation networks which leverage skip connections: U-Net (both 2D and 3D), V-Net (both 2D and 3D), and the One Hundred Layers Tiramisu network (both 2D and 3D). As U-Net and V-Net were originally designed for biomedical image segmentation, we tested the proposed method on a series of volumetric (3D) and 2D medical imaging datasets (Fig. 3) including (i) a magnetic resonance imaging (MRI) dataset; (ii) a skin lesion dataset; and (iii) a computed tomography (CT) dataset.
To analyze the performance of the proposed method, we performed the following experiments: a) We tested the performance of the proposed method, in terms of segmentation accuracy, on datasets i and ii. b) We applied the proposed SAT gate to the recently introduced method Deep Image to Image Network (DI2IN) Yang et al. [2017a] for liver segmentation method (Section 3.2). c) We quantitatively and qualitatively analysed the proposed skip connections in terms of the amount of data transferred and we visualized the outputs of both channel selection and attention layers. We also compared the proposed method vs. the original networks in terms of memory usage and number of parameters (Section 3.3). Note that for the skin 2D dataset, as original V-Net is 3D, we replaced all the 3D operations in V-Net with 2D counterparts, and for testing the One Hundred Layers Tirsamisu model with volumetric data, we replaced all the network’s 2D operations with 3D ones.
3.1 Volumetric and 2D binary segmentation
Volumetric CT liver segmentation. In this experiment, the goal is to segment the liver from CT images. We used more than 2000 CT scans (to the best of our knowledge this is the largest CT dataset used in the literature so far) of different resolutions collected internally and from The Cancer Imaging Archive (TCIA) QIN-HEADNECK dataset Beichel et al. , Fedorov et al. , Clark et al. , which were resampled to isotropic voxels of size . The QIN-HEADNECK dataset is originally collected from a set of head and neck cancer patients. It has multiple whole body positron emission tomography/computed tomography (PET/CT) scans before and after therapy. However, here we use the CT scans for the purpose of liver segmentation. We picked 61 challenging volumes of whole dataset for testing and trained the networks on the remaining volumes.
Volumetric MRI prostate segmentation. In this experiment, we test the proposed method on a volumetric MRI prostate dataset. The dataset contains 1029 MRI volumes of different resolutions which were collected internally and from TCIA ProstateX dataset Geert et al. , Litjens et al. , Clark et al.  that were resampled to isotropic voxels of size . We used 770 images for training and 258 images for testing.
2D RGB skin lesion segmentation. For this experiment, we used the 2D RGB skin lesion dataset from the 2017 IEEE ISBI International Skin Imaging Collaboration (ISIC) Challenge Codella et al. . We train on a dataset of 2000 images and test on a different set of 150 images.
As reported in Table 1, overall, equipping U-Net and V-Net with SAT improved Dice results for all 4 (2 modalities times 2 networks) experiments. Specifically: (i) For the MRI dataset, the SAT gate improves U-Net performance by 1.15% (0.87 to 0.88) and 35.3% (0.17 to 0.11) in Dice and FNR, respectively. Using SAT with V-Net improved results by 2.4% (0.85 to 0.87) and 20% (0.005 to 0.004) in Dice and FPR. Note that, although for V-Net (MRI data) the Dice improvement is small, instead of transferring all the channels, the proposed method transfers only one attention map, which reduces memory usage to a high extent (4th column in Table 28). (ii) For the skin dataset, equipping U-Net with SAT improved Dice and FPR by 2.53% (0.79 to 0.81) and 25% (0.04 to 0.03), and similarly V-Net with SAT resulted in 2.5% (0.81 to 0.83) and 35% (0.20 to 0.13) improvement in terms of Dice and FNR, respectively.
Note also that, although sometimes ST obtained almost similar results to SAT, the SAT transfers only one channel whereas ST results in transferring multiple channels (i.e. ST requires more parameters and involves higher memory usage). Similar to SAT, AT also transfers only one channel, however, as shown in Figures 6 and 7, AT tends to attend to wrong regions as it does not leverage the ST module to filter the most discriminative channels.
|MRI||3D U-Net Çiçek et al. ||ORG||C = N|
|AT||C = 1|
|SAT||C = 1|
|3D V-NetMilletari et al. ||ORG||C = N|
|AT||C = 1|
|SAT||C = 1|
|Skin||2D U-Net Ronneberger et al. ||ORG||C = N|
|AT||C = 1|
|SAT||C = 1|
|2D V-Net||ORG||C = N|
|AT||C = 1|
|SAT||C = 1|
We further test the proposed SAT gate on The One Hundred Layers Tiramisu network Jégou et al. . Note that we apply the proposed SAT gate to more skip connections in this network i.e. both long skip connections between the encoder and decoder (similar to Figure 1) and the long skip connections inside of each dense block (Figure 4). While our proposed SAT gate reduces the number of parameters in the original Tiramisu network (Table 3) by , it improves Dice results (Table 2) by 3.7% and 2.5% for MRI and skin datasets, respectively.
As the next experiment, we applied our proposed SAT gate to the DI2IN Yang et al. [2017a] method for liver segmentation from CT images and achieved the same performance as the original DI2IN network i.e. Dice score of 0.96 while reducing the number of parameters by 12% (from 2,353,089 to 2,063,053) and reducing number of channels for each skip connection to only 1 channel.
3.2 Quantitative and qualitative analysis of the proposed skip connections
In this section, we visualize the outputs of both the channel selection and attention layers. As shown in Fig. 5, after applying the selection step, some of the less discriminative channels are completely turned off by the proposed channel selection.
After the selected channels are transferred to the attention layer, the model learns to attend to the most important part(s) of the image, which helps in segmenting the target object(s) more accurately. A few samples of the attention maps for two of the proposed skip connections, used within the 2D U-Net (Fig. 6) and 3D V-Net (Fig. 7) architectures, are shown in Figs. 6 and 7 for the 2D skin lesion and 3D prostate MRI datasets. As can be seen in both Figs. 6 and 7, a model with only attention layer (i.e., only AT) tends to attend to several areas of the image; including both where the object is present and absent (note the red colour visible over the whole image). However, applying channel selection (i.e., ST) before the attention layer curtails the model from attending to less discriminative regions.
We also quantitatively analyzed the proposed learnable skip connections in terms of percentage of channels “turned off” (i.e., channels i for which is zero in the channel selection layer). For U-Net, and and for V-Net, and percentage of channels were off for MRI and skin datasets, respectively. Since transferring only one channel (instead of N channels) as the output of the SAT gate, reduces the needed number of convolution kernel parameters in the other side of skip connections, we further report the total number of parameters before and after applying the proposed method (i.e., SAT) in Table 3. As can be seen, the total number of parameters are reduced by 30.2% and 7.3% and 30.6% and 8.02%, for U-Net and V-Net, respectively for MRI and Skin datasets. For The One Hundred Layers Tiramisu network the number of parameters are reduced by for all the datasets. Further note that the proposed method reduces the number of convolution operations after each concatenation to almost 50% as demonstrated next. As an example, after an original skip connection that carries 256 channels and concatenates them with another 256 channels on the other side of the network, the consequent convolution layer right after the concatenation will need 512 (=256+256) convolution operations. However, as the proposed skip connections carry only one channel, for the same example, only 257 (=1+256) convolutions are needed. Note that the reason for the difference in the number of parameters in the original networks in Table 3 is that because of memory limitation we reduced the number of layers and/or channels for different datasets.
|Method||2D/3D U-Net Ronneberger et al. , Çiçek et al. ||2D/3D V-Net Milletari et al. ||2D/3D Tiramisu Jégou et al. |
Advantages of channel selection and attention. For some experiments, we observed slight improvement in terms of Dice score when comparing the proposed method vs. only channels selection (or attention), but we argue that there are clear advantages in combining channel selection and attention because: 1) channel selection reduces the number of feature maps and more importantly enforces learning sparse representations, 2) channel selection alone outputs many channels which require more memory and more parameters to learn whereas the proposed method transfers only one channel (which is more interpretable) thus reducing number of network parameters and memory usage, 3) attention after the selection helps focusing on the most important spatial regions in the input, which is not captured by the channel selection part, however, attention without channel selection tends to attend to wrong areas as visualized above (Figures 6 and 7).
More consistent segmentations.
The proposed method by selecting most relevant features helps achieve accurate segmentations more consistently, i.e., smaller standard deviation compared to original methods (see the standard deviations in Table1 ).
Using ReLu vs softmax or sigmoid in channel selection. An advantage of ReLu is that it is computationally simpler in contrast to sigmoid which requires computing an exponent. This advantage is more obvious for deeper networks. Empirically we observed better performance using ReLu than hard sigmoid or sigmoid.
Applying the proposed SAT gate to densely connected networks. Dense connections make it possible to design deep architectures as they help preventing gradient vanishing. On the other hand, high memory usage (because of many concatenations) of the dense networks is a disadvantage that might limit the number of layers inside of each dense block or the number of dens blocks themselves. However, our proposed SAT gate helps reducing memory usage in densely connected networks thus using more layers/dense blocks if needed.
We proposed a novel architecture for skip connections in fully convolutional segmentation networks. Our proposed skip connection involves a channel selection step followed by an attention gate. Equipping popular segmentation networks (e.g. U-Net, V-Net, and The One Hundred Layers Tiramisu network) with the proposed skip connections allowed us to reduce computations and network parameters (the proposed method transfer only one unique feature channel instead of many), improve the segmentation results (it attends to the most discriminative channels and regions within the feature maps of a skip connection) and consistently obtain more accurate segmentation results.
Disclaimer: This feature is based on research, and is not commercially available. Due to regulatory reasons its future availability cannot be guaranteed.
- Alvarez and Salzmann  J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pages 2270–2278, 2016.
- Bahdanau et al.  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Beichel et al.  R. R. Beichel, E. J. Ulrich, C. Bauer, A. Wahle, B. Brown, T. Chang, K. A. Plichta, B. J. Smith, J. J. Sunderland, T. Braun, A. Fedorov, D. Clunie, M. Onken, J. Riesmeier, S. Pieper, R. Kikinis, M. M. Graham, T. L. Casavant, M. Sonka, and J. M. Buatti. Data from QIN-HEADNECK http://doi.org/10.7937/k9/tcia.2015.k0f5cgli. The Cancer Imaging Archive, 2015.
- Çiçek et al.  Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 424–432. Springer, 2016.
- Clark et al.  K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maffitt, M. Pringle, et al. The cancer imaging archive (TCIA): Maintaining and operating a public information repository. Journal of Digital Imaging, 26(6):1045–1057, 2013.
- Codella et al.  N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1710.05006, 2017.
- Das et al.  A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding, 163:90–100, 2017.
- Denton et al.  E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pages 1269–1277, 2014.
- Fedorov et al.  A. Fedorov, D. Clunie, E. Ulrich, C. Bauer, A. Wahle, B. Brown, M. Onken, J. Riesmeier, S. Pieper, R. Kikinis, J. Buatti, and R. Beichel. DICOM for quantitative imaging biomarker development: a standards based approach to sharing clinical data and structured PET/CT analysis results in head and neck cancer research. peerj 4:e2057 https://doi.org/10.7717/peerj.2057. 2016.
- Geert et al.  L. Geert, D. Oscar, B. Jelle, K. Nico, and H. Henkjan. Prostatex challenge data. https://doi.org/10.7937/k9tcia.2017.murs5cl. The Cancer Imaging Archive, 2017.
Glorot and Bengio 
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
- Guo et al.  Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient DNNs. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
- Han et al. [2015a] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
- Han et al. [2015b] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015b.
- Han et al.  S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 243–254. IEEE, 2016.
He et al. 
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- He et al.  Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision (ICCV), volume 2, page 6, 2017.
- Howard et al.  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Hu et al.  H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
- Hu et al.  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.
- Huang et al.  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 3, 2017.
- Iandola et al.  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- Jaderberg et al.  M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
- Jégou et al.  S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1175–1183. IEEE, 2017.
- Kim et al.  Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
- Lebedev et al.  V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. arXiv preprint arXiv:1412.6553, 2014.
- LeCun et al.  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.
- Leroux et al.  S. Leroux, P. Molchanov, P. Simoens, B. Dhoedt, T. Breuel, and J. Kautz. Iamnn: Iterative and adaptive mobile neural network for efficient image classification. arXiv preprint arXiv:1804.10123, 2018.
- Litjens et al.  G. Litjens, O. Debats, J. Barentsz, N. Karssemeijer, and H. Huisman. Computer-aided detection of prostate cancer in MRI. IEEE Transactions on Medical Imaging, 33(5):1083–1092, 2014.
- Liu et al.  B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 806–814, 2015.
- Luong et al.  M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
- Milletari et al.  F. Milletari, N. Navab, and S.-A. Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 565–571. IEEE, 2016.
- Orhan and Pitkow  A. E. Orhan and X. Pitkow. Skip connections eliminate singularities. arXiv preprint arXiv:1701.09175, 2017.
- Ronneberger et al.  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
- Srivastava et al.  R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
- Wang et al.  F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. arXiv preprint arXiv:1704.06904, 2017.
- Wen et al.  W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
Xu et al. 
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Show, attend and tell: Neural image caption generation with visual
International Conference on Machine Learning, pages 2048–2057, 2015.
- Yang et al. [2017a] D. Yang, D. Xu, S. K. Zhou, B. Georgescu, M. Chen, S. Grbic, D. Metaxas, and D. Comaniciu. Automatic liver segmentation using an adversarial image-to-image network. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 507–515. Springer, 2017a.
- Yang et al. [2017b] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv preprint, 2017b.
Yuan and Lin 
M. Yuan and Y. Lin.
Model selection and estimation in regression with grouped variables.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
- Zeiler  M. D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Zhang et al.  X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.