Image retrieval employs compact feature descriptors to locate target images containing the same object or scene as the one depicted in a query image (Gordo et al., 2016; Husain & Bober, 2017; Yasmin et al., 2014). With the rapid progress of deep neural networks over the last few years, convolutional neural network (CNN) is becoming the dominant approach and has been adopted by MPEG for generating descriptors (ISO, 2018; Lin et al., 2017; Duan et al., 2017).
State-of-the art CNN networks consist of hundreds of millions of neurons, which require hundreds of megabytes for storage if being represented in floating-point precision(Han et al., 2015). Image retrieval applications with CNN would require descriptor extraction to be performed with limited hardware resources. Therefore, it is preferred to use fixed point format with fewer bits to reduce both logic and memory space and accelerate the CNN processing. Meanwhile, CNN model compression is broadly used in hardware with fixed-point format to reduce the computation cost and the model size stored in local memory (Cheng et al., 2017). The model compression methods include pruning, coefficients clustering and quantization, etc. (Rastegari et al., 2016; Gong et al., 2014)
. The scalar and vector quantization techniques and a scheme that shares weights across layers have been used to reduce the CNN model size, and negligible loss is achieved in retrieval performance with 4 bits quantization for VGG-16(Chandrasekhar et al., 2017). It has also been demonstrated that the CNN-based descriptor can be extracted using 4 bits compressed CNN model on devices with memory and power constraints (ISO, 2019). In addition, the limited image buffer size of chips should be considered for implementing CNN in ASIC, especially for large scale images.
In this paper, we propose a deep model compression and weight quantization approach to enable efficient hardware implementation for CNN-based descriptor extraction. Our compression method balances compressed bit precision and performance using a hybrid bit quantization scheme across layers. The compressed model with as few as 2-bit weights quantization can deliver a similar performance as the floating-point model for image retrieval. Meanwhile, to handle large scale images, we use a region nested invariance pooling (RNIP) strategy based on the concatenation of feature maps from different sub-images to extract deep feature descriptor. This proposed RNIP approach is compatible with and can be combined with the deep model compression technique to enable the large scale image retrieval.
The original contributions of this paper are as follows:
Using as few as 2-bit weights quantization of CNN on ASIC-chip, we achieve a similar image retrieval performance as the floating-point model;
We propose an improved pooling strategy, RNIP, that can be integrated with the 2-bit CNN model compression approach to retrieve large scale images.
The paper is organized as follows. Section 2 shows how CNN and NIP are used for image retrieval. The proposed compressed model is presented in Section 3. The proposed RNIP strategy is described in Section 4. Section 5 shows the results of the compressed model and RNIP on chip for image retrieval. Section 6 concludes this paper.
2 Image retrieval with CNN
For image retrieval, CNN and NIP is standardized as a reference model to generate deep feature descriptor (ISO, 2018)
. In the first step, multiple rotated images are created from the input image at 0, 90, 180 and 270 angles. Then, each rotated image is fed into CNN to generate convolutional feature maps. Following that, NIP is used to generate a single deep feature descriptor from feature maps of each rotated image by utilizing square-root pooling, average pooling, and max pooling in a chain(Lin et al., 2017; Duan et al., 2017; Lou et al., 2017). At last, a 512-D vector is generated with VGG-16 model.
3 Deep Model Compression
In a CNN, each convolutional layer is made up of neurons that consist of learnable weights and biases. Consider the th weight and the th bias of each layer defined by and . For each layer, our proposed quantization and compression scheme is as follows:
1) The bias is represented as a 12-bit signed integer.
2) To quantize , the th 3x3 filter can be expressed as
where is the scalar for the th filter and is quantized to 8 bits; is the quantized 3x3 mask for the th filter which can be quantized to 1, 2, 3, 4 or 5 bits for different layers. Consider the th element of and the th element of defined by , , respectively. In the proposed quantization and compression scheme, for 1-bit quantization, is projected into 1 if , otherwise -1, with being a threshold, where the scalar is calculated by the mean of .
3) To accommodate the dynamic range of the filter coefficients, a layer-wise 4-bit shifting value represents the exponent value of the weights.
To quantize the floating-point coefficients, it is important to trade-off its range and precision. Directly quantizing the model weight parameters will affect the model accuracy. It is discovered that re-training the model with the constraints of fixed point weights would make the model’s precision very close to the floating point model’s precision with far fewer bits used for model weights. Algorithm 1 is used to re-train the fixed-point model.
3.2 Weights Precision and Compression Ratio
Regarding the 32-bit floating point model, the compression ratio can be calculated as , where is set as 9 for 3x3 filter, is the number of bits for each element in mask, and is the number of bits for scalar. In this work, we quantize the coefficients with different precision for different layers, where the earlier layers use more bits for the mask to improve accuracy without increasing the model size too much. For the first seven convolution layers of VGG-16, we use 3 bits for masks, resulting in a 8.2 times compression ratio. The last six convolution layers use 1-bit mask; it is about a 17 times compression ratio compared to the 32-bit floating point model. For VGG-16, this achieves an overall compression ratio of about 15.1 times, which is close to 2-bit fixed point quantization. The proposed compressed model can be easily implemented in ASIC hardware design for image retrieval.
4 Region Nested Invariance Pooling
With reference to object detection and image segmentation using cropped sub-images, this paper proposes an improved NIP strategy, called RNIP, for deep feature descriptor extraction. As shown in Fig. 1, the RNIP strategy uses cropped sub-images from the original image as the input of CNN. The sub-images as CNN input can reduce the buffer size on devices. First, we crop the input image to generate multiple size sub-images, using the same crop strategy as region of interesting (ROI) samplings of feature maps in standard NIP. Second, each cropped image is fed into CNN to generate convolutional feature maps. Following that a set of ROIs from feature maps of corresponding cropped images are concatenated. Finally, NIP pooling is performed on the set of ROIs to generate a single descriptor. For VGG-16, RNIP generates a 512-D feature descriptor for image retrieval.
5 Experiment Results
5.1 Image Retrieval with RNIP and NIP
In this section, examples are presented for the performance of RNIP on image retrieval as shown in Table 1. We build descriptors with floating-point VGG-16. The image retrieval performance is evaluated by mean average precision (mAP).
In Table 1, we resize all images to 224x224 on INRIA Holidays dataset for image retrieval. RNIP 5x and RNIP 14x use 5 and 14 cropped sub-images to extract deep feature descriptors, respectively; 512-D, 512-byte and 512-bit mean that each descriptor element is single-precision floating point, 8 bits quantized and binarized descriptors, respectively.
It clearly shows that image retrieval results for 512-byte descriptor and 512-D single-precision floating descriptor are very close. Hence, the descriptor size per image can be reduced from 2 KB to 0.5 KB without sacrificing image retrieval performance. For binaried descriptor, it is obvious that image retrieval performance drops around 0.03 mAP. In Table 1, it is clearly seen that the RNIP can improve the performance of image retrieval.
5.2 Image Retrieval using ASIC Engine
For classification task, the proposed retrained fixed-point VGG-16 model has achieved T1:70.13% and T5:89.91% accuracy on ILSVRC-2012 dataset, using as few as 2-bit weights quantization of VGG-16 on ASIC chip.
The performance of image retrieval using ASIC engine and NIP on INRIA Holidays dataset is shown in Fig. 2. It shows that the retrained fixed-point VGG-16 model using 2-bit weights quantization achieves the same performance as the floating-point model for both 224x224 and 640x480 input image size. In Fig. 2 (a), it is clearly shown that our proposed deep model compression and quantization scheme on CNN ASIC Engine delivers the same image retrieval performance as the floating-point model.
In Table 2, it clearly shows that the proposed RNIP can extract deep features locally from sub-images with the same image retrieval performance. The proposed model compression scheme reduces the VGG-16 model size from 59M bytes to 4M bytes. It clearly shows that, using the proposed RNIP, VGG-16 using the proposed 2-bit weights quantization approach on chip with 224x224 image input can deliver the same performance of image retrieval as 640x480 image input for VGG-16.
|Input cropping:||N/A||9x crop|
|CNN model input size:||640x480||224x224|
|CNN model format:||Floating||2-bit fixed-point|
|CNN model size:||59M bytes||4M bytes|
|Descriptor size:||512 bytes||512 bytes|
We proposed a deep model compression and quantization scheme as well as an improved pooling strategy named RNIP to enable efficient hardware implementation of CNN-based descriptor for image retrieval. Experimental results show that our compressed VGG-16 with as few as 2-bit quantization can be directly ported onto our specially designed ASIC CNN engine, while achieving a similar performance to the floating-point model. Integrated with RNIP strategy, the model with the proposed 2-bit quantization approach can retrieve 640x480 images with a limited 224x224 image buffer on the ASIC chip with similar performance.
- ISO (2018) Information technology - multimedia content description interface - part 15: Compact descriptors for video analysis. Technical report, ISO/IEC JTC1/SC29/WG11, Geneva, Switzerland, 2018.
- ISO (2019) White paper on cdva. Technical report, ISO/IEC JTC1/SC29/WG11/Nxxxxx, Marrakech, MA, 2019.
- Chandrasekhar et al. (2017) Chandrasekhar, V., Lin, J., Liao, Q., Morère, O., Veillard, A., Duan, L., and Poggio, T. A. Compression of deep neural networks for image instance retrieval. CoRR, abs/1701.04923, 2017. URL http://arxiv.org/abs/1701.04923.
- Cheng et al. (2017) Cheng, Y., Wang, D., Zhou, P., and Zhang, T. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. URL http://arxiv.org/abs/1710.09282.
- Duan et al. (2017) Duan, L., Chandrasekhar, V., Wang, S., Lou, Y., Lin, J., Bai, Y., Huang, T., Kot, A. C., and Gao, W. Compact descriptors for video analysis: the emerging MPEG standard. CoRR, abs/1704.08141, 2017. URL http://arxiv.org/abs/1704.08141.
- Gong et al. (2014) Gong, Y., Liu, L., Yang, M., and Bourdev, L. D. Compressing deep convolutional networks using vector quantization. CoRR, abs/1412.6115, 2014. URL http://arxiv.org/abs/1412.6115.
- Gordo et al. (2016) Gordo, A., Almazán, J., Revaud, J., and Larlus, D. Deep image retrieval: Learning global representations for image search. CoRR, abs/1604.01325, 2016. URL http://arxiv.org/abs/1604.01325.
- Han et al. (2015) Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both weights and connections for efficient neural networks. CoRR, abs/1506.02626, 2015. URL http://arxiv.org/abs/1506.02626.
- Husain & Bober (2017) Husain, S. S. and Bober, M. Improving large-scale image retrieval through robust aggregation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1783–1796, Sep. 2017. ISSN 0162-8828. doi: 10.1109/TPAMI.2016.2613873.
- Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Lin et al. (2017) Lin, J., Duan, L., Wang, S., Bai, Y., Lou, Y., Chandrasekhar, V., Huang, T., Kot, A., and Gao, W. Hnip: Compact deep invariant representations for video matching, localization, and retrieval. IEEE Transactions on Multimedia, 19(9):1968–1983, Sep. 2017. ISSN 1520-9210. doi: 10.1109/TMM.2017.2713410.
- Lou et al. (2017) Lou, Y., Bai, Y., Lin, J., Wang, S., Chen, J., Chandrasekhar, V., Duan, L., Huang, T., Kot, A. C., and Gao, W. Compact deep invariant descriptors for video retrieval. In 2017 Data Compression Conference (DCC), pp. 420–429, April 2017. doi: 10.1109/DCC.2017.31.
- Rastegari et al. (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. CoRR, abs/1603.05279, 2016. URL http://arxiv.org/abs/1603.05279.
- Yasmin et al. (2014) Yasmin, M., Mohsin, S., and Sharif, M. Intelligent image retrieval techniques: A survey. Journal of Applied Research and Technology, 12(1):87 – 103, 2014. ISSN 1665-6423. doi: https://doi.org/10.1016/S1665-6423(14)71609-8. URL http://www.sciencedirect.com/science/article/pii/S1665642314716098.