(VGG16, VGG19), human pose estimationshg (Stacked Hourglass Networks) and image segmentation unet
(U-net). However, the increasing number of parameters of these models restricts their deployability in resource-constrained environments, such as on mobile devices. As a consequence, several lines of research have emerged to train more compact networks that achieve a performance similar to the popular, large ones. In particular, most neural network compression approaches fall in three broad categories: weight quantizationquant1 ; quant2 , architecture pruning prun1 ; prun2 ; AlvarezSalzmannNIPS17 and knowledge distillation caruna ; hinton ; kd1 . While the first two aim to reduce the size of a given large model by either limiting the number of bits used to represent each parameter, or by removing some of its units or layers, the third, which we investigate here, seeks to train a compact model (student) using knowledge acquired by the larger one (teacher). The first distillation method was introduced by Caruna et al. caruna
who focused on the multi-layer perceptron case with an RMSE-based error metric. The concept was then popularized by Hinton et al.hinton
with an approach exploiting the teacher’s predicted probabilities to train the student and by Romero et al.kd1 who proposed to further exploit the teacher’s intermediate representations to guide the student. Since then, several application-driven knowledge distillation strategies have been developed, e.g., for face model identification mobid , object detection objdet and face verification veri . In this paper, we study the use of the knowledge distillation technique of hinton to compress a U-net architecture for biomedical image segmentation. We first show that, without performing any distillation, the number of parameters of the U-net can be reduced drastically at virtually no loss of accuracy. We then observe that a direct application of knowledge distillation is insufficient to further compress the U-net, and propose to complement distillation with batch normalization and class re-weighting. As evidenced by our experiments, this allows us to reduce the U-net size to 0.1% of its original number of parameters at only a negligible loss of segmentation accuracy.
U-net Architecture. The U-net architecture, initially introduced in unet and depicted by Fig. 1, is a fully convolutional network with skip connections, comprising a contracting path and an expansive path. It relies on a channel depth of 64 at the first level and doubles it in 4 consecutive stages, reaching 1024 at the bottom level. This is then reduced back to 64 by the expansive part.
|Starting Channel Depth||64||16||4||2|
|Number of Iterations||80,000||150,000||300,000||290,000|
Knowledge Distillation. Let us consider a binary image segmentation problem, where the input is a image and the output a binary label map . Given training pairs , the training procedure of a U-net makes use of the cross-entropy loss
where a indicates the label of sample at location , and is the probability map predicted by the network. In the standard setting, this probability map is obtained via the softmax function. As discussed in hinton , however, softer probabilities can be obtained by increasing the temperature of this softmax. Distillation is then achieved by training a teacher network at a temperature to generate probabilities for each sample in a validation set. These probabilities are then employed to train the student network using the cross-entropy . This cross-entropy can be used either on its own, for vanilla distillation, or in conjunction with the original cross-entropy , for mixed distillation. After being trained at , the softmax temperature of the student network is reduced back to . Improving Distillation. As evidenced by our results in Section 3, using standard distillation, whether vanilla or mixed, did not prove sufficient to distill a standard U-net into a very small one. To overcome this, we therefore propose two modifications of the original strategy. First, as indicated in Fig. 1, we introduce batch normalization bn operations in every convolution layer of the contracting path. Second, we re-weight the classes according to their proportions in the training set. Specifically, the contribution of each foreground pixel in the loss is multiplied by a weight equal to the ratio of the number of background pixels over the number of foreground ones. In practice, because there are many more background pixels, this weight is larger than 1, i.e., . The contribution of the background pixels is kept unchanged, i.e., . These two modifications, i.e., batch normalization and class re-weighting, are performed on the student network only, to further help overcoming the general difficulty in training shallow networks. The former reduces the internal covariate shift and the latter combats the inherent class imbalance. As shown below, these two modifications allowed us to significantly reduce the necessary U-net size for accurate segmentation.
|Test Loss||Training Loss (Hard)||Training Loss (Soft)|
|Network||# trainable parameters||IoU Score||Cross Entropy Loss|
|Our 2-Unet (soft loss only)||30,902||0.752||0.134|
|Our 2-Unet (mixed distillation)||30,902||0.759||0.135|
Dataset. We use the Electron Microscopy (EM) Mitochondria Segmentation dataset of ds1 . It contains a 5x5x5 micrometer section taken from the CA1 hippocampus region of the brain, corresponding to 1065x2048x1536 voxels of resolution approximately 5x5x5nm. This volume is separated into training and testing sub-volumes, each of which consists of 165 slices. We treat each slice as an image and aim to produce the corresponding binary label map, indicating the presence or absence of a mitochondria at the each location. Compressing without Distillation. First, we experiment with reducing the number of channels in the first U-Net layer while keeping the doubling trend of the contracting path without any distillation. As shown in Table 1, a U-net with only 4 initial channels (4-Unet) achieves a similar test loss to the original 64-Unet. In the remaining experiments, we use the 4-Unet as teacher network for distillation. Exploiting Standard Distillation. We then tried to make use of the standard distillation procedure of hinton to train a 2-Unet from a 4-Unet. To this end, we evaluated both vanilla distillation, mixed distillation and sequential distillation, where we started training with soft labels and then finished with hard ones, or vice-versa. All these attempts were unsuccessful, even with class re-weighting. This is illustrated by Table 2 for mixed distillation at different temperatures. Distilling Our Modified U-net. Finally, we evaluate the use of distillation with our modified U-net that incorporates batch normalization and class re-weighting. Note that for all the distillation experiments, the teacher is the original 4-Unet trained from scratch with hard training loss. As shown in Table 3 for both vanilla distillation and mixed distillation, a 2-Unet achieves segmentation accuracies similar to those of a standard 64-Unet, while requiring only about 0.1% of its capacity. Note that training a 2-Unet without distillation but with our modifications yields a training loss of 0.265 and a test loss of 0.307, which shows that distillation is still required.
We have introduced a modified distillation strategy to compress a U-net architecture by over 1000x while retaining an accuracy close to the original U-net. This was achieved by modifying the U-net to incorporate batch normalization and class re-weighting. In the future, we plan to investigate the use of these modifications to perform distillation of other networks and for other application domains.
-  Alvarez, J.M. and Salzmann, M., 2017. Compression-aware training of deep networks. In Advances in Neural Information Processing Systems (pp. 856-867).
-  Buciluǎ, C., Caruana, R. and Niculescu-Mizil, A., 2006, August. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 535-541). ACM.
-  Chen, G., Choi, W., Yu, X., Han, T. and Chandraker, M., 2017. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems (pp. 742-751).
-  Han, S., Pool, J., Tran, J. and Dally, W., 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems (pp. 1135-1143).
-  Hinton, G., Vinyals, O. and Dean, J., 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
-  Ioffe, S. and Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
-  LeCun, Y., Denker, J.S. and Solla, S.A., 1990. Optimal brain damage. In Advances in neural information processing systems (pp. 598-605).
-  Lucchi, A., Smith, K., Achanta, R., Knott, G. and Fua, P., 2012. Supervoxel-based segmentation of mitochondria in em image stacks with learned shape features. IEEE transactions on medical imaging, 31(2), pp.474-486.
Luo, P., Zhu, Z., Liu, Z., Wang, X. and Tang, X., 2016, February. Face Model Compression by Distilling Knowledge from Neurons. In proceedings of
Association for the Advancement of Artificial Intelligence(pp. 3560-3566).
-  Newell, A., Yang, K. and Deng, J., 2016, October. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (pp. 483-499). Springer, Cham.
-  European Conference on Computer Vision (pp. 525-542). Springer, Cham.
-  Ronneberger, O., Fischer, P. and Brox, T., 2015, October. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham.
-  Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
-  Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015
-  Wang, C., Lan, X. and Zhang, Y., 2017. Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification. arXiv preprint arXiv:1709.02929.
-  Zhu, C., Han, S., Mao, H. and Dally, W.J., 2016. Trained ternary quantization. arXiv preprint arXiv:1612.01064.