1 Introduction
In the modern era where we have an increasingly large amount of highdimensional data to handle, it can be useful to have a system that can efficiently retrieve information that we care about. Examples of such systems are contentbased image retieval (CBIR)
(Datta et al., 2008; Babenko et al., 2014) and document/information retrieval (Mitra & Craswell, 2018). In large scale systems, linear search through the dataset is prohibitive. Therefore, one often resorts to approximate methods, which allow trading off accuracy for speed. These methods are commonly called approximate nearest neighbour (ANN) methods. Another important aspect of modern systems is data locality – ideally the data is stored a local fast disk, which restricts our representation to be quantised due to memory restrictions. Notable examples commonly used are localitysensitive hashing (LSH) (Datar et al., 2004) and product quantisation (PQ) (Jegou et al., 2011; Ge et al., 2013).Recently, deep learning has become an increasingly powerful tool for learning embeddings, due to the success of deep embedding learning (Oh Song et al., 2016; Hermans et al., 2017; Wu et al., 2017). These advances have motivated an approach called deep hashing (Wang et al., 2016; Erin Liong et al., 2015; Zhu et al., 2016)
, where one attempts to directly obtain a hash code from an image that can be used for contentbased image retrieval tasks. These methods have been shown to greatly outperform traditional approaches. However, most methods rely on explicitly incorporating the class label prediction (as opposed to constructing an affinity matrix) to improve performance, which leads to the following issues. Firstly, while exploiting the class labels can improve the discriminative capability, it makes incorporating new labels a nontrivial task. Secondly, the methods do not directly account for semantic similarities at a granular level, making it unsuitable for certain tasks such as a duplication detection. Lastly, it is common to only demonstrate the efficacy of the methods for dataset with a small number of classes (
), and the generalisation for the large scale dataset seems yet to be proven.In this work, we propose a novel network architecture for endtoend semantic hashing, which can be used for both deep hashing and learning an index structure (Kraska et al., 2018). Our network is inspired by a catalyser network (Sablayrolles et al., 2018) and a supervised structured binary code (SUBIC) (Jain et al., 2017): it explores the idea of transforming an input distribution to a uniform distribution, but directly learns to generate the hash code. Our method is also flexible such that it relies on a similarity distance, which can be neighbour ranking or class labels. We show the applicability of our model for retrieval task using publicly available data set and we experimentally show our approach outperforms baseline methods such as LSH and PQ, in particular when the available bitrate is limited.
2 Related Work
In the literature of hashing, common methods include: LSH, Iterative Quantisation (ITQ) (Gong et al., 2013)
and PQ. While the first two aim to generate code in hamming space, PQ aims to represent data using a code book for a set of sub vectors. Recently deep learningbased hashing has become an active area of research
(Lin et al., 2015; Liu et al., 2016). In particular, supervisedhashing is a research area that is concerned with hashing and retrieving objects (e.g. text or images) belonging to specific categories. In deeplearning based hashing, often three aspects are considered for the objective function, which are:
How to preserve semantic similarities of the inputs in their generated hash codes?

How to devise a continuous representation that can be trained using a neural network, which simultaneously minimises the discrepancy from testtime discretisation/binarisation?

How can we optimise the available bitrate of a database (the output space), which can help minimise the collision probability?
The first challenge is often addressed by using a metric learning approach, such as contrastive loss (Hadsell et al., 2006), triplet loss (Schroff et al., 2015; Hermans et al., 2017) and their way extensions (Chen et al., 2017)
. The main drawback of these losses is that it is difficult to optimise them in a high dimensional space due to the curse of dimensionality
(Friedman et al., 2001). Therefore, whenclass labels are available, it can be more effective to utilise those (Jain et al., 2017). Doing so induces a dependence on quality and availability of the class labels. In fact, as pointed out in Sablayrolles et al. (2017), retrieving objects based on classification puts an upper bound on the performance for the recall value. It can be more desirable to optimise the model on a more flexible objective which allows granular hashing based on semantic similarity, rather than solely the label information. The second challenge is a product of the nondifferentiability of a naive discretisation step. Therefore at train time, one can resort to nonlinearities such as and , or continuous functions with better properties (Cao et al., 2017, 2018). The last point is usually handled by variants of entropybased regularisation (Jain et al., 2017).We also provide a more indepth account of SUBIC (Jain et al., 2017) and catalyser network (Sablayrolles et al., 2018) as these networks form a foundation of our network architecture.
Subic
Let be an input data (e.g. an image) and is the the class label. Given an image , SUBIC outputs a structured binary code , which is expressed as blocks: with . This is achieved by the following network:
(1) 
where is a feature extractor, is a hash encoder, is a nonlinearity which applies softmax function to each of the block and ““ is a composition operator. During training, the blocks are relaxed into simplicies: with . The novelty of SUBIC is to fit a classification layer to learn a discriminative binary code. Given minibatch
, the network is trained by minimising the following loss function:
(2) 
where , and is mean entropy of blocks:
(3) 
The idea of the Entropy term is to encourage the network output to become onehot like. On the other hand, the Negative Batch Entropy term encourages the uniform block support so that the available bit rate is fully exploited. At test time, is replaced by a blockwise operation, where the binary code is obtained by setting the maximum activated entry in each block to 1, the rest to 0.
Catalyser network
Let be the input, be the network embedding before quantisation is applied. Let be the quantised representation. The idea of the catalyser network is to embed data uniformly on an sphere, i.e. , which is subsequently encoded by an efficient lattice quantiser. The network is trained by minimising teh triplet rank loss (Hermans et al., 2017) and maximising the entropy loss. Given a triplet of input (anchor point, positive sample and negative sample respectively), the loss is defined as:
(4) 
for margin
. For entropy regularisation, Kozachenko and Leonenko (KoLeo) entropy estimator is used as a surrogate function:
(5) 
The equation is simplified to:
(6) 
The geometric idea is to ensure any two points are sufficiently far from each other, where the penalty decays logarithmically. The network then quantises the output using a Gosset Code, we refer the interested reader to Sablayrolles et al. (2018) for more details.
3 Proposed Approach
While SUBIC is effective, its application is limited to cases where classification labels are available. On the other hand, catalyser networks are more flexible, but not endtoend trainable. In this work, we propose a flexible approach which incorporates the benefit of both and mitigates their limitations.
The proposed network is composed of three components: a feature extractor , a catalyser , and a quantiser , see Fig. 2. Given an input image , the network directly generates a hash, which is a structured binary code as in SUBIC: . Let , . The difference between catalyser networks and our work is that we learn the quantisation network , making the architecture endtoend trainable. Our quantiser is given by , where is a blockwise Ksoftmax.
For simplicity, consider each block separately. Our key insight is the following: a fully connected layer is simply a dot product between and the row vectors of . We have . Since at testtime, binarisation is done by selecting the maximally activated entry (within each block), this is equivalent to selecting the row vector with the smallest angular difference. This can be visualised using row vectors, which linearly partition the output space, and the decision boundary is extending from the origin (Fig. 3).
Let
denote the probability distribution of the catalyser output taking a specific value in
. To achieve the maximal entropy of (i.e. to maximise the used bitrate), one can assign to each row vector ’s with an equal probability. Geometrically, this can be seen as equally partitioning the support of by ’s, where . If is uniformly distributed on , then it is sufficient to uniformly distribute ’s to partition the equally. This gives us the following strategy: we (1) encourage the distribution of to be uniform on a sphere and (2) uniformly distribute ’s on the sphere. We can achieve both by using KoLeo entropy estimators in Eq. 6:(7)  
(8) 
However, it is likely that a perfect uniform distribution cannot be achieved by the training, especially once combined with other deep embedding losses. To circumvent this, we add the following:
(9) 
where (i.e. the index of the closest to ). By minimising Eq. 9, the row vectors ’s will be gravitated towards the probability mass of .
The remaining aspect is similar to the previous approaches: we minimise triplet loss to ensure similar points are embedded closely. For this, we can minimise triplet loss either in the output space of catalyser or the relaxed output space . Indeed, it is beneficial to directly optimise in the final target space. However, interestingly, it turns out that minimising triplet rank loss in simplex is difficult due to the fact that most points are very close to each other in high dimensional simplicies (see Appendix), yielding training instability. We mitigate this issue by using asymmetrical triplet loss:
(10) 
where , , are the anchor, positive and negative points in the embedded space respectively, is a margin, ’s are the discretised point (i.e. by replacing softmax by argmax). Note that sampling
’s is nondifferentiable due to argmax, but we can nevertheless backpropagate the information using straightthrough estimator (STE) proposed by
Bengio et al. (2013), which has resemblance to stochastic graph computation approaches (Maddison et al., 2016). Secondly, the loss becomes zero if and share the same binary representation. We empirically found incorporating the triplet loss in was a useful additional loss to overcome this issue. The final objective is thus:(11) 
where ’s are hyperparameter to be optimised. Note that for triplet loss, normalising and the rows of is important as otherwise arbitrary scaling can make the training unstable. Secondly, to reduce the parameters, we partition to only take each blocks and learn , where .^{1}^{1}1Ideally, blocks are decorrelated to remove the redundancy. This is left as a future work. Finally, note that feature extractor and catalyser are only optimised with respect to , and , whereas the quantiser weight is optimised only with respect to , and . In particular, Eq. 9 is only minimised by .
3.1 Encoding and distance computation
Given a set of data points, we encode via . The resulting vector can be compressed by storing indices of oneof vectors, which only requires bits. The distance between two compressed points can be given by Euclidean distance: , which can be efficiently computed by look up: , where is the index of having one. One can also perform asymmetric distance comparison (ADC), which in case is replaced by , the data representation prior to quantisation.
3.2 Network Implementation
For the feature encoder, a pretrained network can be used, such as VGG or Resnet architectures. The catalyser was implemented using a fully connected network with 2 hidden layers, each having 256 features, and a final layer which maps the dimension to
. We used batchnormalisation and Rectified Linear Unit (ReLU) for nonlinearity, on all layers except the final one. The quantiser are
separate fully connected layers with features. The overall network was trained using Adam with . The convergence speed of the network depends on the size of , but usually sufficient performance can be obtained within 3 hours of training.4 Experiment
4.1 BigANN1M
We evaluate our proposed approach using BigANN1M dataset^{2}^{2}2Publicly available at http://corpustexmex.irisa.fr/: the dataset contains a collection of 128 dimensional SIFT feature vectors. As the input is already feature vectors, we set . Training data contains 30,000 points, test data contains 10,000 query points and 1 million database points. For each point, we labelled the top nearest points in terms of Euclidean distance to be the neighbours for triplet loss.
For evaluation, we used the metric 1Recall@K=10, which measures the probability of retrieving the true first neighbour within the first 10 candidates. We compare to LSH, ITQ and PQ for the baseline methods. For PQ, we chose for each subvector and varied the values of to achieve the desired bitlength .
The result is summarised in Fig. 4. For the proposed method, we varied the number of , and to get different number of bits. One can see that the performance of the proposed approach is comparable to PQ, but better for lower number of bits. Note that ITQ and LSH uses symmetric distance comparison so it is an unfair comparison. We also compared the proposed model with and without and we see a noticeable improvement. We speculate that this is because, while even without the loss, since the points are uniformly distributed it can achieve sufficient level of reconstruction, by minimising the quantisation loss, we remove the “gaps“ in .
4.1.1 Visualisation
We visualise the learnt weight vectors of the quantiser . For each subblock, we randomly select 500 row vectors. Then we visualise 2 axes of these vectors (i.e. a projection onto 2 dimensional plane rather than using dimensionality reduction techniques). Without using , the weights are uniformly distributed on sphere (Fig 5). However, when the loss is introduced, we see the mass of the rows concentrates on a more local area.
5 Conclusion
In this work, we proposed a deep neural network which can perform endtoend hashing of input, which only requires the knowledge of similarity graph, which is a slightly more relaxed constraint than class labels. The network operates by transforming the input space into a uniform distribution by penalising the cost given by KoLeo differential estimator, which was quantised by weight vectors uniformly distributed in its support. The network performs comparatively to the baseline methods, however, there is plenty of room for improvement. In the future, it will be interesting to impose a different prior on the distribution of the simplex, e.g. via Dirichlet distribution, to help control the output distribution, rather than relying on a uniform distribution.
Acknowledgements
We thank Lucas Theis, Ferenc Huszár, Hanchen Xiong and Twitter London CAML team for their valuable insights and comments for this work.
References

Babenko et al. (2014)
Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky.
Neural codes for image retrieval.
In
European conference on computer vision
, pp. 584–599. Springer, 2014.  Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432.

Cao et al. (2018)
Yue Cao, Mingsheng Long, Bin Liu, Jianmin Wang, and MOE KLiss.
Deep cauchy hashing for hamming space retrieval.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1229–1237, 2018.  Cao et al. (2017) Zhangjie Cao, Mingsheng Long, Jianmin Wang, and S Yu Philip. Hashnet: Deep learning to hash by continuation. In ICCV, pp. 5609–5618, 2017.
 Chen et al. (2017) Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: a deep quadruplet network for person reidentification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
 Datar et al. (2004) Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262. ACM, 2004.
 Datta et al. (2008) Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur), 40(2):5, 2008.
 Erin Liong et al. (2015) Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. Deep hashing for compact binary codes learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2475–2483, 2015.
 Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.
 Ge et al. (2013) Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2946–2953, 2013.
 Gong et al. (2013) Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
 Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In null, pp. 1735–1742. IEEE, 2006.
 Hermans et al. (2017) Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person reidentification. arXiv preprint arXiv:1703.07737, 2017.
 Jain et al. (2017) Himalaya Jain, Joaquin Zepeda, Patrick Pérez, and Rémi Gribonval. Subic: A supervised, structured binary code for image search. In Proc. Int. Conf. Computer Vision, volume 1, pp. 3, 2017.
 Jegou et al. (2011) Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2011.
 Kraska et al. (2018) Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pp. 489–504. ACM, 2018.
 Lin et al. (2015) Kevin Lin, HueiFang Yang, JenHao Hsiao, and ChuSong Chen. Deep learning of binary hash codes for fast image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 27–35, 2015.
 Liu et al. (2016) Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2064–2072, 2016.
 Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Mitra & Craswell (2018) Bhaskar Mitra and Nick Craswell. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval (to appear), 2018.
 Oh Song et al. (2016) Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012, 2016.
 Sablayrolles et al. (2017) Alexandre Sablayrolles, Matthijs Douze, Nicolas Usunier, and Hervé Jégou. How should we evaluate supervised hashing? In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 1732–1736. IEEE, 2017.
 Sablayrolles et al. (2018) Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. A neural network catalyzer for multidimensional similarity search. arXiv preprint arXiv:1806.03198, 2018.

Schroff et al. (2015)
Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clustering.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823, 2015.  Wang et al. (2016) Jun Wang, Wei Liu, Sanjiv Kumar, and ShihFu Chang. Learning to hash for indexing big data—a survey. Proceedings of the IEEE, 104(1):34–57, 2016.
 Wu et al. (2017) ChaoYuan Wu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl. Sampling matters in deep embedding learning. In Proc. IEEE International Conference on Computer Vision (ICCV), 2017.
 Zhu et al. (2016) Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. Deep hashing network for efficient similarity retrieval. In AAAI, pp. 2415–2421, 2016.
6 Appendix
6.1 Distribution of pairwise distances on surfaces in n dimension
In the main manuscript, we argued that it is difficult to directly train triplet rank loss on highdimensional simplex. Here we show how points on dimensional objects are distributed in high dimension as a part of the argument.
6.1.1 Interior of Simplex
We use Dirichlet distribution with concentration parameter to sample points uniformly in the interior of dimensional simplex. As one can see from Fig 8, as the dimension increases, the the points become more concentrated around the center of simplex . The distribution of distance between two uniformly sampled points on simplex also sharply concentrates around small value, as it can be seen in 7. However, in the case of triplet rank loss, we would like to guarantee sufficiently high margin to ensure the separation between different classes. For example, the distance from any of the vertices to the centre of simplex is and the distance between two vertices is . We empirically saw that often the network collapses to predicting just and it is difficult to satisfy meaningful margin as well as pushing the points approach towards one of the vertices.
6.1.2 Surface of nSphere
We sample points uniformly on sphere by first sampling from , followed by normalisation. In this case, the distribution of distances between two uniformly sampled points on sphere is given by: (Wu et al., 2017). As , the probability distribution converges to . In this case, there is sufficient space left between majority of points, which is why we speculate that it is easier to train with triplet rank loss. Note that this however also means that since all points are already far, careful negative example mining becomes very important to yield useful gradient.
6.1.3 Interior of ncube
We also study the distribution of distances between two random points in the interior of a hypercube. Here, the distances gets increasingly large as
. Therefore, hypercube would have been an alternative shape we could use as a domain for hashing, which could be interesting for future work. However, In this case, we could use sigmoid function to set the range, but this could result in gradient saturation.