1 Introduction
Image retrieval employs compact feature descriptors to locate target images containing the same object or scene as the one depicted in a query image (Gordo et al., 2016; Husain & Bober, 2017; Yasmin et al., 2014). With the rapid progress of deep neural networks over the last few years, convolutional neural network (CNN) is becoming the dominant approach and has been adopted by MPEG for generating descriptors (ISO, 2018; Lin et al., 2017; Duan et al., 2017).
Stateofthe art CNN networks consist of hundreds of millions of neurons, which require hundreds of megabytes for storage if being represented in floatingpoint precision
(Han et al., 2015). Image retrieval applications with CNN would require descriptor extraction to be performed with limited hardware resources. Therefore, it is preferred to use fixed point format with fewer bits to reduce both logic and memory space and accelerate the CNN processing. Meanwhile, CNN model compression is broadly used in hardware with fixedpoint format to reduce the computation cost and the model size stored in local memory (Cheng et al., 2017). The model compression methods include pruning, coefficients clustering and quantization, etc. (Rastegari et al., 2016; Gong et al., 2014). The scalar and vector quantization techniques and a scheme that shares weights across layers have been used to reduce the CNN model size, and negligible loss is achieved in retrieval performance with 4 bits quantization for VGG16
(Chandrasekhar et al., 2017). It has also been demonstrated that the CNNbased descriptor can be extracted using 4 bits compressed CNN model on devices with memory and power constraints (ISO, 2019). In addition, the limited image buffer size of chips should be considered for implementing CNN in ASIC, especially for large scale images.In this paper, we propose a deep model compression and weight quantization approach to enable efficient hardware implementation for CNNbased descriptor extraction. Our compression method balances compressed bit precision and performance using a hybrid bit quantization scheme across layers. The compressed model with as few as 2bit weights quantization can deliver a similar performance as the floatingpoint model for image retrieval. Meanwhile, to handle large scale images, we use a region nested invariance pooling (RNIP) strategy based on the concatenation of feature maps from different subimages to extract deep feature descriptor. This proposed RNIP approach is compatible with and can be combined with the deep model compression technique to enable the large scale image retrieval.
The original contributions of this paper are as follows:

Using as few as 2bit weights quantization of CNN on ASICchip, we achieve a similar image retrieval performance as the floatingpoint model;

We propose an improved pooling strategy, RNIP, that can be integrated with the 2bit CNN model compression approach to retrieve large scale images.
The paper is organized as follows. Section 2 shows how CNN and NIP are used for image retrieval. The proposed compressed model is presented in Section 3. The proposed RNIP strategy is described in Section 4. Section 5 shows the results of the compressed model and RNIP on chip for image retrieval. Section 6 concludes this paper.
2 Image retrieval with CNN
For image retrieval, CNN and NIP is standardized as a reference model to generate deep feature descriptor (ISO, 2018)
. In the first step, multiple rotated images are created from the input image at 0, 90, 180 and 270 angles. Then, each rotated image is fed into CNN to generate convolutional feature maps. Following that, NIP is used to generate a single deep feature descriptor from feature maps of each rotated image by utilizing squareroot pooling, average pooling, and max pooling in a chain
(Lin et al., 2017; Duan et al., 2017; Lou et al., 2017). At last, a 512D vector is generated with VGG16 model.3 Deep Model Compression
3.1 Quantization
In a CNN, each convolutional layer is made up of neurons that consist of learnable weights and biases. Consider the th weight and the th bias of each layer defined by and . For each layer, our proposed quantization and compression scheme is as follows:
1) The bias is represented as a 12bit signed integer.
2) To quantize , the th 3x3 filter can be expressed as
(1) 
where is the scalar for the th filter and is quantized to 8 bits; is the quantized 3x3 mask for the th filter which can be quantized to 1, 2, 3, 4 or 5 bits for different layers. Consider the th element of and the th element of defined by , , respectively. In the proposed quantization and compression scheme, for 1bit quantization, is projected into 1 if , otherwise 1, with being a threshold, where the scalar is calculated by the mean of .
3) To accommodate the dynamic range of the filter coefficients, a layerwise 4bit shifting value represents the exponent value of the weights.
To quantize the floatingpoint coefficients, it is important to tradeoff its range and precision. Directly quantizing the model weight parameters will affect the model accuracy. It is discovered that retraining the model with the constraints of fixed point weights would make the model’s precision very close to the floating point model’s precision with far fewer bits used for model weights. Algorithm 1 is used to retrain the fixedpoint model.
3.2 Weights Precision and Compression Ratio
Regarding the 32bit floating point model, the compression ratio can be calculated as , where is set as 9 for 3x3 filter, is the number of bits for each element in mask, and is the number of bits for scalar. In this work, we quantize the coefficients with different precision for different layers, where the earlier layers use more bits for the mask to improve accuracy without increasing the model size too much. For the first seven convolution layers of VGG16, we use 3 bits for masks, resulting in a 8.2 times compression ratio. The last six convolution layers use 1bit mask; it is about a 17 times compression ratio compared to the 32bit floating point model. For VGG16, this achieves an overall compression ratio of about 15.1 times, which is close to 2bit fixed point quantization. The proposed compressed model can be easily implemented in ASIC hardware design for image retrieval.
4 Region Nested Invariance Pooling
With reference to object detection and image segmentation using cropped subimages, this paper proposes an improved NIP strategy, called RNIP, for deep feature descriptor extraction. As shown in Fig. 1, the RNIP strategy uses cropped subimages from the original image as the input of CNN. The subimages as CNN input can reduce the buffer size on devices. First, we crop the input image to generate multiple size subimages, using the same crop strategy as region of interesting (ROI) samplings of feature maps in standard NIP. Second, each cropped image is fed into CNN to generate convolutional feature maps. Following that a set of ROIs from feature maps of corresponding cropped images are concatenated. Finally, NIP pooling is performed on the set of ROIs to generate a single descriptor. For VGG16, RNIP generates a 512D feature descriptor for image retrieval.
5 Experiment Results
5.1 Image Retrieval with RNIP and NIP
In this section, examples are presented for the performance of RNIP on image retrieval as shown in Table 1. We build descriptors with floatingpoint VGG16. The image retrieval performance is evaluated by mean average precision (mAP).
In Table 1, we resize all images to 224x224 on INRIA Holidays dataset for image retrieval. RNIP 5x and RNIP 14x use 5 and 14 cropped subimages to extract deep feature descriptors, respectively; 512D, 512byte and 512bit mean that each descriptor element is singleprecision floating point, 8 bits quantized and binarized descriptors, respectively.

NIP 



512D 
0.806  0.842  0.852  
512byte  0.806  0.841  0.852  
512bit  0.773  0.803  0.824  

It clearly shows that image retrieval results for 512byte descriptor and 512D singleprecision floating descriptor are very close. Hence, the descriptor size per image can be reduced from 2 KB to 0.5 KB without sacrificing image retrieval performance. For binaried descriptor, it is obvious that image retrieval performance drops around 0.03 mAP. In Table 1, it is clearly seen that the RNIP can improve the performance of image retrieval.
5.2 Image Retrieval using ASIC Engine
For classification task, the proposed retrained fixedpoint VGG16 model has achieved T1:70.13% and T5:89.91% accuracy on ILSVRC2012 dataset, using as few as 2bit weights quantization of VGG16 on ASIC chip.
The performance of image retrieval using ASIC engine and NIP on INRIA Holidays dataset is shown in Fig. 2. It shows that the retrained fixedpoint VGG16 model using 2bit weights quantization achieves the same performance as the floatingpoint model for both 224x224 and 640x480 input image size. In Fig. 2 (a), it is clearly shown that our proposed deep model compression and quantization scheme on CNN ASIC Engine delivers the same image retrieval performance as the floatingpoint model.
In Table 2, it clearly shows that the proposed RNIP can extract deep features locally from subimages with the same image retrieval performance. The proposed model compression scheme reduces the VGG16 model size from 59M bytes to 4M bytes. It clearly shows that, using the proposed RNIP, VGG16 using the proposed 2bit weights quantization approach on chip with 224x224 image input can deliver the same performance of image retrieval as 640x480 image input for VGG16.
VGG16+NIP  ASIC+RNIP  
Input size:  640x480  640x480 
Input cropping:  N/A  9x crop 
CNN model input size:  640x480  224x224 
CNN model format:  Floating  2bit fixedpoint 
CNN model size:  59M bytes  4M bytes 
Descriptor size:  512 bytes  512 bytes 
mAP score:  86.81%  86.94% 

6 Conclusion
We proposed a deep model compression and quantization scheme as well as an improved pooling strategy named RNIP to enable efficient hardware implementation of CNNbased descriptor for image retrieval. Experimental results show that our compressed VGG16 with as few as 2bit quantization can be directly ported onto our specially designed ASIC CNN engine, while achieving a similar performance to the floatingpoint model. Integrated with RNIP strategy, the model with the proposed 2bit quantization approach can retrieve 640x480 images with a limited 224x224 image buffer on the ASIC chip with similar performance.
References
 ISO (2018) Information technology  multimedia content description interface  part 15: Compact descriptors for video analysis. Technical report, ISO/IEC JTC1/SC29/WG11, Geneva, Switzerland, 2018.
 ISO (2019) White paper on cdva. Technical report, ISO/IEC JTC1/SC29/WG11/Nxxxxx, Marrakech, MA, 2019.
 Chandrasekhar et al. (2017) Chandrasekhar, V., Lin, J., Liao, Q., Morère, O., Veillard, A., Duan, L., and Poggio, T. A. Compression of deep neural networks for image instance retrieval. CoRR, abs/1701.04923, 2017. URL http://arxiv.org/abs/1701.04923.
 Cheng et al. (2017) Cheng, Y., Wang, D., Zhou, P., and Zhang, T. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. URL http://arxiv.org/abs/1710.09282.
 Duan et al. (2017) Duan, L., Chandrasekhar, V., Wang, S., Lou, Y., Lin, J., Bai, Y., Huang, T., Kot, A. C., and Gao, W. Compact descriptors for video analysis: the emerging MPEG standard. CoRR, abs/1704.08141, 2017. URL http://arxiv.org/abs/1704.08141.
 Gong et al. (2014) Gong, Y., Liu, L., Yang, M., and Bourdev, L. D. Compressing deep convolutional networks using vector quantization. CoRR, abs/1412.6115, 2014. URL http://arxiv.org/abs/1412.6115.
 Gordo et al. (2016) Gordo, A., Almazán, J., Revaud, J., and Larlus, D. Deep image retrieval: Learning global representations for image search. CoRR, abs/1604.01325, 2016. URL http://arxiv.org/abs/1604.01325.
 Han et al. (2015) Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both weights and connections for efficient neural networks. CoRR, abs/1506.02626, 2015. URL http://arxiv.org/abs/1506.02626.
 Husain & Bober (2017) Husain, S. S. and Bober, M. Improving largescale image retrieval through robust aggregation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1783–1796, Sep. 2017. ISSN 01628828. doi: 10.1109/TPAMI.2016.2613873.
 Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
 Lin et al. (2017) Lin, J., Duan, L., Wang, S., Bai, Y., Lou, Y., Chandrasekhar, V., Huang, T., Kot, A., and Gao, W. Hnip: Compact deep invariant representations for video matching, localization, and retrieval. IEEE Transactions on Multimedia, 19(9):1968–1983, Sep. 2017. ISSN 15209210. doi: 10.1109/TMM.2017.2713410.
 Lou et al. (2017) Lou, Y., Bai, Y., Lin, J., Wang, S., Chen, J., Chandrasekhar, V., Duan, L., Huang, T., Kot, A. C., and Gao, W. Compact deep invariant descriptors for video retrieval. In 2017 Data Compression Conference (DCC), pp. 420–429, April 2017. doi: 10.1109/DCC.2017.31.
 Rastegari et al. (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnornet: Imagenet classification using binary convolutional neural networks. CoRR, abs/1603.05279, 2016. URL http://arxiv.org/abs/1603.05279.
 Yasmin et al. (2014) Yasmin, M., Mohsin, S., and Sharif, M. Intelligent image retrieval techniques: A survey. Journal of Applied Research and Technology, 12(1):87 – 103, 2014. ISSN 16656423. doi: https://doi.org/10.1016/S16656423(14)716098. URL http://www.sciencedirect.com/science/article/pii/S1665642314716098.
Comments
There are no comments yet.