1 Introduction
Learning semantic representations using deep neural networks (DNN) is now a fundamental facet of applications ranging from visual search
(Jing et al., 2015; Hadi Kiapour et al., 2015), semantic text matching (Neculoiu et al., 2016), oneshot classification (Koch et al., 2015), clustering (Oh Song et al., 2017), and recommendation (Shankar et al., 2017). The highdimensional dense embeddings generated from DNNs however pose a computational challenge for performing nearest neighbor search in largescale problems with millions of instances. In particular, when the embedding dimension is high, evaluating the distance of any query to all the instances in a large database is expensive, so that efficient search without sacrificing accuracy is difficult. Representations generated using DNNs typically have a higher dimension compared to handcrafted features such as SIFT (Lowe, 2004), and moreover are dense. The key caveat with dense features is that unlike bagofwords features they cannot be efficiently searched through an inverted index, without approximations.Since accurate search in high dimensions is prohibitively expensive in practice (Wang, 2011), one has to typically sacrifice accuracy for efficiency by resorting to approximate methods. Addressing the problem of efficient approximate NearestNeighbor Search (NNS) (Jegou et al., 2011) or Maximum InnerProduct Search (MIPS) (Shrivastava and Li, 2014) is thus an active area of research, which we review in brief in the related work section. Most approaches (Charikar, 2002; Jegou et al., 2011) aim to learn compact lowerdimensional representations that preserve distance information.
While there has been ample work on learning compact representations, learning sparse higher dimensional representations have been addressed only recently (Jeong and Song, 2018; Cao et al., 2018). As a seminal instance, Jeong and Song (2018) propose an endtoend approach to learn sparse and highdimensional hashes, showing significant speedup in retrieval time on benchmark datasets compared to dense embeddings. This approach has also been motivated from a biological viewpoint (Li et al., 2018) by relating to a fruit fly’s olfactory circuit, thus suggesting the possibility of hashing using higher dimensions instead of reducing the dimensionality. Furthermore, as suggested by Glorot et al. (2011), sparsity can have additional advantages of linear separability and information disentanglement.
In a similar vein, in this work, we propose to learn high dimensional embeddings that are sparse and hence efficient to retrieve using sparse matrix multiplication operations. In contrast to compact lowerdimensional ANNesque representations that typically lead to decreased representational power, a key facet of our higher dimensional sparse embeddings is that they can have the same representational capacity as the initial dense embeddings. The core idea behind our approach is inspired by two key observations: (i) retrieval of (high) dimensional sparse embeddings with fraction of nonzero values on an average, can be sped up by a factor of . (ii) The speed up can be further improved to a factor of by ensuring that the nonzero values are evenly distributed across all the dimensions. This indicates that sparsity alone is not sufficient to ensure maximal speedup; the distribution of the nonzero values plays a significant role as well. This motivates us to consider the effect of sparsity on the number of floating point operations (FLOPs)
required for retrieval with an inverted index. We propose a penalty function on the embedding vectors that is a continuous relaxation of the exact number of FLOPs, and encourages an even distribution of the nonzeros across the dimensions.
We apply our approach to the large scale metric learning problem of learning embeddings for facial images. Our training loss consists of a metric learning (Weinberger and Saul, 2009) loss aimed at learning embeddings that mimic a desired metric, and a FLOPs loss to minimize the number of operations. We perform an empirical evaluation of our approach on the Megaface dataset (KemelmacherShlizerman et al., 2016), and show that our proposed method successfully learns highdimensional sparse embeddings that are ordersofmagnitude faster. We compare our approach to multiple baselines demonstrating an improved or similar speedvsaccuracy tradeoff.
The rest of the paper is organized as follows. In Section 3 we analyze the expected number of FLOPs, for which we derive an exact expression. In Section 4 we derive a continuous relaxation that can be used as a regularizer, and optimized using gradient descent. We also provide some analytical justifications for our relaxation. In Section 5 we then compare our method on a large metric learning task showing an improved speedaccuracy tradeoff compared to the baselines.
2 Related Work
Learning compact representations, ANN.
Exact retrieval of the topk nearest neighbours is expensive in practice for highdimensional dense embeddings learned from deep neural networks, with practitioners often resorting to approximate nearest neighbours (ANN) for efficient retrieval. Popular approaches for ANN include Locality sensitive hashing (LSH) (Gionis et al., 1999; Andoni et al., 2015; Raginsky and Lazebnik, 2009) relying on random projections, Navigable small world graphs (NSW) (Malkov et al., 2014) and hierarchical NSW (HNSW) (Malkov and Yashunin, 2018) based on constructing efficient search graphs by finding clusters in the data, Product Quantization (PQ) (Ge et al., 2013; Jegou et al., 2011) approaches which decompose the original space into a cartesian product of lowdimensional subspaces and quantize each of them separately, and Spectral hashing (Weiss et al., 2009) which involves an NP hard problem of computing an optimal binary hash, which is relaxed to continuous valued hashes, admitting a simple solution in terms of the spectrum of the similarity matrix. Overall, for compact representations and to speed up query times, most of these approaches use a variety of carefully chosen data structures, such as hashes (Neyshabur and Srebro, 2015; Wang et al., 2018), locality sensitive hashes (Andoni et al., 2015), inverted file structure (Jegou et al., 2011; Baranchuk et al., 2018), trees (Ram and Gray, 2012), clustering (Auvolat et al., 2015), quantization sketches (Jegou et al., 2011; Ning et al., 2016)
, as well as dimensionality reductions based on principal component analysis and tSNE
(Maaten and Hinton, 2008).End to end ANN.
Learning the ANN structure endtoend is another thread of work that has gained popularity recently. Norouzi et al. (2012) propose to learn binary representations for the Hamming metric by minimizing a margin based triplet loss. Erin Liong et al. (2015) use the signed output of a deep neural network as hashes, while imposing independence and orthogonality conditions on the hash bits. Other endtoend learning approaches for learning hashes include (Cao et al., 2016; Li et al., 2017). An advantage of endtoend methods is that they learn hash codes that are optimally compatible to the feature representations.
Sparse representations.
Sparse representations have been previously studied from various viewpoints. Glorot et al. (2011) explore sparse neural networks in modeling biological neural networks and show improved performance, along with additional advantages such as better linear separability and information disentangling. Ranzato et al. (2008, 2007); Lee et al. (2008)
propose learning sparse features using deep belief networks.
Olshausen and Field (1997) explore sparse coding with an overcomplete basis, from a neurobiological viewpoint. Sparsity in autoencoders have been explored by Ng and others (2011); Kavukcuoglu et al. (2010). Arpit et al. (2015)provide sufficient conditions to learn sparse representations, and also further provide an excellent review of sparse autoencoders. Dropout
(Srivastava et al., 2014) and a number of its variants (Molchanov et al., 2017; Park et al., 2018; Ba and Frey, 2013) have also been shown to impose sparsity in neural networks.Highdimensional sparse representations.
Sparse deep hashing (SDH) (Jeong and Song, 2018) is an endtoend approach that involves starting with a pretrained network and then performing alternate minimization consisting of two minimization steps, one for training the binary hashes and the other for training the continuous dense embeddings. The first involves computing an optimal hash best compatible with the dense embedding using a mincostmaxflow approach. The second step is a gradient descent step to learn a dense embedding by minimizing a metric learning loss. A related approach, sparse autoencoders (Makhzani and Frey, 2013) learn representations in an unsupervised manner with at most nonzero activations. The idea of high dimensional sparse embeddings is also reinforced by the sparselifting approach (Li et al., 2018) where sparse high dimensional embeddings are learned from dense features. The idea is motivated by the biologically inspired fly algorithm (Dasgupta et al., 2017). Experimental results indicated that sparselifting is an improvement both in terms of precision and speed, when compared to traditional techniques like LSH that rely on dimensionality reduction.
regularization, lasso.
The Lasso (Tibshirani, 1996) is the most popular approach to impose sparsity and has been used in a variety of applications including sparsifying and compressing neural networks (Liu et al., 2015; Wen et al., 2016). The group lasso (Meier et al., 2008) is an extension of lasso that encourages all features in a specified group to be selected together. Another extension, the exclusive lasso (Kong et al., 2014; Zhou et al., 2010), on the other hand, is designed to select a single feature in a group. Our proposed regularizer, originally motivated by idea of minimizing FLOPs closely resembles exclusive lasso. Our focus however is on sparsifying the produced embeddings rather than sparsifying the parameters.
Sparse matrix vector product (SpMV).
Existing work for SpMV computations include (Haffner, 2006; Kudo and Matsumoto, 2003), proposing algorithms based on inverted indices. Inverted indices are however known to suffer from severe cache misses. Linear algebra backends such as BLAS (Blackford et al., 2002) rely on efficient cache accesses to achieve speedup. Haffner (2006); MellorCrummey and Garvin (2004); Krotkiewski and Dabrowski (2010) propose cache efficient algorithms for sparse matrix vector products. There has also been substantial interest in speeding up SpMV computations using specialized hardware such as GPUs (Vazquez et al., 2010; Vázquez et al., 2011), FPGAs (Zhuo and Prasanna, 2005; Zhang et al., 2009), and custom hardware (Prasanna and Morris, 2007).
Metric learning.
While there exist many settings for learning embeddings (Hinton and Salakhutdinov, 2006; Kingma and Welling, 2013; Kiela and Bottou, 2014) in this paper we restrict our attention to the context of metric learning (Weinberger and Saul, 2009). Some examples of metric learning losses include large margin softmax loss for CNNs (Liu et al., 2016), triplet loss (Schroff et al., 2015), and proxy based metric loss (MovshovitzAttias et al., 2017).
3 Expected number of FLOPs
In this section we study the effect of sparsity on the expected number of FLOPs required for retrieval and derive an exact expression for the expected number of FLOPs. The main idea in this paper is based on the key insight that if each of the dimensions of the embedding are nonzero with a probability
(not necessarily independently), then it is possible to achieve a speedup up to an order of using an inverted index on the set of embeddings. Consider two embedding vectors . Computing requires computing only the pointwise product at the indices where both and are nonzero. This is the main motivation behind using inverted indices and leads to the aforementioned speedup. Before we analyze it more formally, we introduce some notation.Let be a set of independent training samples drawn from according to a distribution , where denote the input and label spaces respectively. Let be a class of functions parameterized by , mapping input instances to dimensional embeddings. Typically, for image tasks, the function is chosen to be a suitable CNN (Krizhevsky et al., 2012). Suppose , then define the activation probability , and its empirical version .
We now show that sparse embeddings can lead to a quadratic speedup. Consider a dimensional sparse query vector and a database of sparse vectors forming a matrix . We assume that are sampled independently from . Computing the vector matrix product requires looking at only the columns of corresponding to the nonzero entries of given by . Furthermore, in each of those columns we only need to look at the nonzero entries. This can be implemented efficiently in practice by storing the nonzero indices for each column in independent lists, as depicted in Figure 1.
The number of FLOPs incurred is given by,
Taking the expectation on both sides w.r.t. and using the independence of the data, we get
(1) 
where . Since the expected number of FLOPs scales linearly with the number of vectors in the database, a more suitable quantity is the meanFLOPsperrow defined as
(2) 
Note that for a fixed amount of sparsity , this is minimized when each of the dimensions are nonzero with equal probability , upon which (so that as a regularizer,
will in turn encourage such a uniform distribution across dimensions). Given such a uniform distribution, compared to dense multiplication which has a complexity of
per row, we thus get an improvement by a factor of . Thus when only fraction of all the entries is nonzero, and evenly distributed across all the columns, we achieve a speedup of . Note that independence of the nonzero indices is not necessary due to the linearity of expectation – in fact, features from a neural network are rarely uncorrelated in practice.FLOPs versus speedup.
While FLOPs reduction is a reasonable measure of speedup on primitive processors of limited parallelization and cache memory. FLOPs is not an accurate measure of actual speedup when it comes to mainstream commercial processors such as Intel’s CPUs and Nvidia’s GPUs, as the latter have cache and SIMD (SingleInstruction Multiple Data) mechanism highly optimized for dense matrix multiplication, while sparse matrix multiplication are inherently less tailored to their cache and SIMD design (Sohoni et al., 2019). On the other hand, there have been threads of research on hardwares with cache and parallelization tailored to sparse operations that show speedup proportional to the FLOPs reduction (Han et al., 2016; Parashar et al., 2017). Modeling the cache and other hardware aspects can potentially lead to better performance but less generality and is left to our future works.
4 Our Approach
The
regularization is the most common approach to induce sparsity. However, as we will also verify experimentally, it does not ensure an uniform distribution of the nonzeros in all the dimensions that is required for the optimal speedup. Therefore, we resort to incorporating the actual FLOPs incurred, directly into the loss function which will lead to an optimal tradeoff between the search time and accuracy. The FLOPs
being a discontinuous function of model parameters, is hard to optimize, and hence we will instead optimize using a continuous relaxation of it.Denote by , any metric loss on for the embedding function . The goal in this paper is to minimize the loss while controlling the expected FLOPs defined in Eqn. 2. Since the distribution
is unknown, we use the samples to get an estimate of
. Recall the empirical fraction of nonzero activations , which converges in probability to . Therefore, with a slight abuse of notation define , which is a consistent estimator for based on the samples . Note that denotes either the population or empirical quantities depending on whether the functional argument is or . We now consider the following regularized loss.(3) 
for some parameter that controls the FLOPsaccuracy tradeoff. The regularized loss poses a further hurdle, as and consequently are not continuous due the presence of the indicator functions. We thus compute the following continuous relaxation. Define the mean absolute activation and its empirical version , which is the norm of the activations (scaled by ) in contrast to the quasi norm in the FLOPs calculation. Define the relaxations, and its consistent estimator
. We propose to minimize the following relaxation, which can be optimized using any offtheshelf stochastic gradient descent optimizer.
(4) 
Sparse retrieval and reranking.
During inference, the sparse vector of a query image is first obtained from the learned model and the nearest neighbour is searched in a database of sparse vectors forming a sparse matrix. An efficient algorithm to compute the dot product of the sparse query vector with the sparse matrix is presented in Figure 1. This consists of first building a list of the nonzero values and their positions in each column. As motivated in Section 3, given a sparse query vector, it is sufficient to only iterate through the nonzero values and the corresponding columns. Next, a filtering step is performed keeping only scores greater than a specified threshold. Top candidates from the remaining items are returned. The complete algorithm is presented in Algorithm 1. In practice, the sparse retrieval step is not sufficient to ensure good performance. The top shortlisted candidates are therefore further reranked using dense embeddings as done in SDH. This step involves multiplication of a small dense matrix with a dense vector. The number of shortlisted candidates is chosen such that the dense reranking time does not dominate the total time.
Comparison to SDH (Jeong and Song, 2018).
It is instructive to contrast our approach with that of SDH (Jeong and Song, 2018). In contrast to the binary hashes in SDH, our approach learns sparse real valued representations. SDH uses a mincostmaxflow approach in one of the training steps, while we train ours only using SGD. During inference in SDH, a shortlist of candidates is first created by considering the examples in the database that have hashes with nonempty intersections with the query hash. The candidates are further reranked using the dense embeddings. The shortlist in our approach on the other hand is constituted of the examples with the top scores from the sparse embeddings.
Comparison to unrelaxed FLOPs regularizer.
We provide an experimental comparison of our continuous relaxation based FLOPs regularizer to its unrelaxed variant, showing that the performance of the two are markedly similar. Setting up this experiment requires some analytical simplifications based on recent deep neural network analyses. We first recall recent results that indicate that the output of a batch norm layer nearly follows a Gaussian distribution
(Santurkar et al., 2018), so that in our context, we could make the simplifying approximation that (where ) is distributed as where , is theactivation used at the neural network output. We have modelled the preactivation as a Gaussian distribution with mean and variance depending on the model parameters
. We experimentally verify that this assumption holds by minimizing the KS distance (Massey Jr, 1951) between the CDF of where and the empirical CDF of the activations. The KS distance is minimized wrt. . Figure 2 shows the empirical CDF and the fitted CDF of for two different architectures.While cannot be tuned independently due to their dependence on , in practice, the huge representational capacity of neural networks allows and to be tuned almost independently. We consider a toy setting with 2d embeddings. For a tractable analysis, we make the simplifying assumption that, for , is distributed as where , thus losing the dependence on .
We now analyze how minimizing the continuous relaxation compares to minimizing . Note that we consider the population quantities here instead of the empirical quantities, as they are more amenable to theoretical analyses due to the existence of closed form expressions. We also consider the regularizer as a baseline. We initialize with , and minimize the three quantities via gradient descent with infinitesimally small learning rates. For this contrastive analysis, we have not considered the effect of the metric loss. Note that while the discontinuous empirical quantity cannot be optimized via gradient descent, it is possible to do so for its population counterpart since it is available in closed form as a continuous function when making Gaussian assumptions. The details of computing the gradients can be found in Appendix A.
We start with activation probabilities , and plot the trajectory taken when performing gradient descent, shown in Figure 2. Without the effect of the metric loss, the probabilities are expected to go to zero as observed in the plot. It can be seen that, in contrast to the regularizer, and both tend to sparsify the less sparse activation () at a faster rate, which corroborates the fact that they encourage an even distribution of nonzeros.
[trim=0 10pt 0 7pt, clip, width=0.50]plots/mf_l1.png[trim=0 10pt 0 7pt, clip, width=0.50]plots/rn_fl.png 
[trim=0 12pt 0 7pt, clip, width=1.05]plots/trajectory.png 
is a Gaussian random variable. Figure (b) shows that
and behave similarly by sparsifying the less sparser activation at a faster rate when compared to the regularizer.promotes orthogonality.
We next show that, when the embeddings are normalized to have a unit norm, as typically done in metric learning, then minimizing is equivalent to promoting orthogonality on the absolute values of the embedding vectors. Let , we then have the following:
(5) 
is minimized when the vectors are orthogonal. Metric learning losses aim at minimizing the interclass dot product, whereas the FLOPs regularizer aims at minimizing pairwise dot products irrespective of the class, leading to a tradeoff between sparsity and accuracy. This approach of pushing the embeddings apart, bears some resemblance to the idea of spreading vectors (Sablayrolles et al., 2019) where an entropy based regularizer is used to uniformly distribute the embeddings on the unit sphere, albeit without considering any sparsity. Maximizing the pairwise dot product helps in reducing FLOPs as is illustrated by the following toy example. Consider a set of vectors (here ) satisfying . Then is minimized when , where is an onehot vector with the th entry equal to 1 and the rest 0. The FLOPs regularizer thus tends to spread out the nonzero activations in all the dimensions, thus producing balanced embeddings. This simple example also demonstrates that when the number of classes in the training set is smaller or equal to the number of dimensions , a trivial embedding that minimizes the metric loss and also achieves a small number of FLOPs is where is true label for . This is equivalent to predicting the class of the input instance. The caveat with such embeddings is that they might not be semantically meaningful beyond the specific supervised task, and will naturally hurt performance on unseen classes, and tasks where the representation itself is of interest. In order to avoid such a collapse in our experiments, we ensure that the embedding dimension is smaller than the number of training classes. Furthermore, as recommended by Sablayrolles et al. (2017), we perform all our evaluations on unseen classes.
Exclusive lasso.
Also known as norm, in previous works it has been used to induce competition (or exclusiveness) in features in the same group. More formally, consider features indexed by , and groups forming a set of groups .^{2}^{2}2Denotes the powerset of . Let
denote the weight vector for a linear classifier. The exclusive lasso regularizer is defined as,
where denotes the subvector , corresponding to the indices in . can be used to induce various kinds of structural properties. For instance can consist of groups of correlated features. The regularizer prevents feature redundancy by selecting only a few features from each group.
Our proposed FLOPs based regularizer has the same form as exclusive lasso. Therefore exclusive lasso applied to the batch of activations, with the groups being columns of the activation matrix (and rows corresponding to different inputs), is equivalent to the FLOPs regularizer. It can be said that, within each activation column, the FLOPs regularizer induces competition between different input examples for having a nonzero activation.
5 Experiments
We evaluate our proposed approach on a large scale metric learning dataset: the Megaface (KemelmacherShlizerman et al., 2016)
used for face recognition. This is a much more fine grained retrieval tasks (with 85k classes for training) compared to the datasets used by
Jeong and Song (2018). This dataset also satisfies our requirement of the number of classes being orders of magnitude higher than the dimensions of the sparse embedding. As discussed in Section 4, a few number of classes during training can lead the model to simply learn an encoding of the training classes and thus not generalize to unseen classes. Face recognition datasets avoid this situation by virtue of the huge number of training classes and a balanced distribution of examples across all the classes.Following standard protocol for evaluation on the Megaface dataset (KemelmacherShlizerman et al., 2016), we train on a refined version of the MSCeleb1M (Guo et al., 2016) dataset released by Deng et al. (2018) consisting of million images spanning k classes. We evaluate with 1 million distractors from the Megaface dataset and 3.5k query images from the Facescrub dataset (Ng and Winkler, 2014), which were not seen during training.
Network architecture.
We experiment with two architectures: MobileFaceNet (Chen et al., 2018), and ResNet101 (He et al., 2016)
. We use ReLU activations in the embedding layer for MobileFaceNet, and SThresh activations (defined below) for ResNet. The activations are
normalized to produce an embedding on the unit sphere, and used to compute the Arcface loss (Deng et al., 2018). We learn 1024 dimensional sparse embeddings for the andregularizers; and 128, 512 dimensional dense embeddings as baselines. All models were implemented in Tensorflow
(Abadi et al., 2016) with the sparse retrieval algorithm implemented in C++. The reranking step used 512d dense embeddings.Activation function.
In practice, having a nonlinear activation at the embedding layer is crucial for sparsification. Layers with activations such as ReLU are easier to sparsify due to the bias parameter in the layer before the activation (linear or batch norm) which acts as a direct control knob to the sparsity. More specifically, can be made more (less) sparse by increasing (decreasing) the components of , where is the bias parameter of the previous linear layer. In this paper we consider two types of activations: , and the soft thresholding operator (Boyd and Vandenberghe, 2004). ReLU activations always produce positive values, whereas soft thresholding can produce negative values as well.
Practical considerations.
In practice, setting a large regularization weight from the beginning is harmful for training. Sparsifying too quickly using a large leads to many dead activations (saturated to zero) in the embedding layer and the model getting stuck in a local minima. Therefore, we use an annealing procedure and gradually increase throughout the training using a regularization weight schedule that maps the training step to a real valued regularization weight. In our experiments we choose a that increases quadratically as , until step , where is the threshold step beyond which .
Baselines.
We compare our proposed regularizer, with multiple baselines: exhaustive search with dense embeddings, sparse embeddings using regularization, Sparse Deep Hashing (SDH) (Jeong and Song, 2018), and PCA, LSH, PQ applied to the 512 dimensional dense embeddings from both the architectures. We train the SDH model using the aforementioned architectures for 512 dimensional embeddings, with number of active hash bits . We use numpy (using efficient MKL optimizations in the backend) for matrix multiplication required for exhaustive search in the dense and PCA baselines. We use the CPU version of the Faiss (Johnson et al., 2017) library for LSH and PQ (we use the IVFPQ index from Faiss).
Further details on the training hyperparameters and the hardware used can be found in Appendix
B.5.1 Results
We report the recall and the timeperquery for various hyperparameters of our proposed approach and the baselines, yielding tradeoff curves. The reported times include the time required for reranking. The tradeoff curves for MobileNet and ResNet are shown in Figures 3 and 3 respectively. We observe that while vanilla regularization is an improvement by itself for some hyperparameter settings, the regularizer is a further improvement, and yields the most optimal tradeoff curve. SDH has a very poor speedaccuracy tradeoff, which is mainly due to the explosion in the number of shortlisted candidates with increasing number of active bits leading to an increase in the retrieval time. On the other hand, while having a small number of active bits is faster, it leads to a smaller recall. For the other baselines we notice the usual order of performance, with PQ having the best speedup compared to LSH and PCA. While dimensionality reduction using PCA leads to some speedup for relatively high dimensions, it quickly wanes off as the dimension is reduced even further.
[trim=0 14pt 0 12pt, clip, width=]plots/mobilenet_time_recall.png 
[trim=0 10pt 0 5pt, clip, width=]plots/mobilenet_sparsity.png 
[trim=0 14pt 0 12pt, clip, width=]plots/resnet_time_recall.png 
[trim=0 10pt 0 5pt, clip, width=]plots/resnet_sparsity.png 
We also report the suboptimality ratio computed over the dataset , where is the mean activation probability estimated on the test data. Notice that , and the optimal is achieved when , that is when the nonzeros are evenly distributed across the dimensions. The sparsityvssuboptimality plots for MobileNet and ResNet are shown in Figures 3 and 3 respectively. We notice that the regularizer yields values of closer to when compared to the regularizer. For the MobileNet architecture we notice that the regularizer is able to achieve values of close to that of in the less sparser region. However, the gap increases substantially with increasing sparsity. For the ResNet architecture on the other hand the regularizer yields extremely suboptimal embeddings in all regimes. The regularizer is therefore able to produce more balanced distribution of nonzeros.
The suboptimality is also reflected in the recall values. The gap in the recall values of the and models is much higher when the suboptimality gap is higher, as in the case of ResNet, while it is small when the suboptimality gap is smaller as in the case of MobileNet. This shows the significance of having a balanced distribution of nonzeros. Additional results, including results without the reranking step and performance on CIFAR100 can be found in Appendix C.
6 Conclusion
In this paper we proposed a novel approach to learn high dimensional embeddings with the goal of improving efficiency of retrieval tasks. Our approach integrates the FLOPs incurred during retrieval into the loss function as a regularizer and optimizes it directly through a continuous relaxation. We provide further insight into our approach by showing that the proposed approach favors an even distribution of the nonzero activations across all the dimensions. We experimentally showed that our approach indeed leads to a more even distribution when compared to the regularizer. We compared our approach to a number of other baselines and showed that it has a better speedvsaccuracy tradeoff. Overall we were able to show that sparse embeddings can be around 50 faster compared to dense embeddings without a significant loss of accuracy.
Acknowledgements
We thank HongYou Chen for helping in running some baselines during the early stages of this work. This work has been partially funded by the DARPA D3M program and the Toyota Research Institute. Toyota Research Institute ("TRI") provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
References

Tensorflow: a system for largescale machine learning
. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §5.  Practical and optimal lsh for angular distance. NeurIPS. Cited by: §2.
 Why regularized autoencoders learn sparse representation?. arXiv preprint arXiv:1505.05561. Cited by: §2.
 Clustering is efficient for approximate maximum inner product search. arXiv preprint arXiv:1507.05910. Cited by: §2.
 Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems, pp. 3084–3092. Cited by: §2.

Revisiting the inverted indices for billionscale approximate nearest neighbors.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 202–216. Cited by: §2.  An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software 28 (2), pp. 135–151. Cited by: §2.
 Convex optimization. Cambridge university press. Cited by: §5.

Deep cauchy hashing for hamming space retrieval.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1229–1237. Cited by: §1. 
Deep quantization network for efficient image retrieval.
. AAAI. Cited by: §2. 
Similarity estimation techniques from rounding algorithms.
In
Proceedings of the thiryfourth annual ACM symposium on Theory of computing
, pp. 380–388. Cited by: §1.  MobileFaceNets: efficient cnns for accurate realtime face verification on mobile devices. arXiv preprint arXiv:1804.07573. Cited by: §5.
 A neural algorithm for a fundamental computing problem. Science. Cited by: §2.
 Arcface: additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698. Cited by: §5, §5.
 Deep hashing for compact binary codes learning. CVPR. Cited by: §2.
 Optimized product quantization for approximate nearest neighbor search. CVPR. Cited by: §2.
 Similarity search in high dimensions via hashing. VLDB. Cited by: §2.

Deep sparse rectifier neural networks.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pp. 315–323. Cited by: §1, §2.  MSCeleb1M: a dataset and benchmark for large scale face recognition. In European Conference on Computer Vision, Cited by: §5.
 Where to buy it: matching street clothing photos in online shops. In Proceedings of the IEEE international conference on computer vision, pp. 3343–3351. Cited by: §1.
 Fast transpose methods for kernel learning on sparse data. In Proceedings of the 23rd international conference on Machine learning, pp. 385–392. Cited by: §2.
 EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. Cited by: §3.
 Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §5.
 Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §2.
 Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. In Workshop on faces in’RealLife’Images: detection, alignment, and recognition, Cited by: §C.2.
 Product quantization for nearest neighbor search. TPAMI. Cited by: §1, §2.
 Efficient endtoend learning for quantizable representations. ICML. Cited by: §C.3, §1, §2, §4, §4, §5, §5.
 Visual search at pinterest. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1889–1898. Cited by: §1.
 Billionscale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §5.
 Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467. Cited by: §2.
 The megaface benchmark: 1 million faces for recognition at scale. CVPR. Cited by: §1, §5, §5.

Learning image embeddings using convolutional neural networks for improved multimodal semantics
. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pp. 36–45. Cited by: §2.  Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.

Siamese neural networks for oneshot image recognition.
In
ICML Deep Learning Workshop
, Vol. 2. Cited by: §1.  Exclusive feature learning on arbitrary structures via norm. In Advances in Neural Information Processing Systems, pp. 1655–1663. Cited by: §2.
 Cifar10 and cifar100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html 6. Cited by: §C.3.
 Imagenet classification with deep convolutional neural networks. NeurIPS. Cited by: §3.
 Parallel symmetric sparse matrix–vector product on scalar multicore cpus. Parallel Computing 36 (4), pp. 181–198. Cited by: §2.
 Fast methods for kernelbased text analysis. In Proceedings of the 41st Annual Meeting on Association for Computational LinguisticsVolume 1, pp. 24–31. Cited by: §2.
 Sparse deep belief net model for visual area v2. In Advances in neural information processing systems, pp. 873–880. Cited by: §2.
 Deep supervised discrete hashing. NeurIPS. Cited by: §2.
 Fast similarity search via optimal sparse lifting. NeurIPS. Cited by: §1, §2.
 Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814. Cited by: §2.
 Largemargin softmax loss for convolutional neural networks.. In ICML, Vol. 2, pp. 7. Cited by: §2.
 Distinctive image features from scaleinvariant keypoints. International journal of computer vision. Cited by: §1.
 Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §2.
 Ksparse autoencoders. arXiv preprint arXiv:1312.5663. Cited by: §2.
 Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI. Cited by: §2.
 Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems. Cited by: §2.
 The kolmogorovsmirnov test for goodness of fit. Journal of the American statistical Association 46 (253), pp. 68–78. Cited by: §4.

The group lasso for logistic regression
. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (1), pp. 53–71. Cited by: §2.  Optimizing sparse matrix–vector product computations using unroll and jam. The International Journal of High Performance Computing Applications 18 (2), pp. 225–236. Cited by: §2.
 Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2498–2507. Cited by: §2.
 Agedb: the first manually collected, inthewild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, Cited by: §C.2.
 No fuss distance metric learning using proxies. arXiv preprint arXiv:1703.07464. Cited by: §2.
 Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 148–157. Cited by: §1.
 On symmetric and asymmetric lshs for inner product search. In International Conference on Machine Learning, pp. 1926–1934. Cited by: §2.
 Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §2.
 A datadriven approach to cleaning large face datasets. In Image Processing (ICIP), 2014 IEEE International Conference on, Cited by: §5.
 Scalable image retrieval by sparse product quantization. IEEE Transactions on Multimedia 19 (3), pp. 586–597. Cited by: §2.
 Hamming distance metric learning. NeurIPS. Cited by: §2.
 Deep metric learning via facility location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390. Cited by: §1.
 Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision research 37 (23), pp. 3311–3325. Cited by: §2.
 Scnn: an accelerator for compressedsparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 27–40. Cited by: §3.

Adversarial dropout for supervised and semisupervised learning
. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §2.  Sparse matrix computations on reconfigurable hardware. Computer 40 (3), pp. 58–64. Cited by: §2.
 Localitysensitive binary codes from shiftinvariant kernels. NeurIPS. Cited by: §2.
 Maximum innerproduct search using cone trees. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 931–939. Cited by: §2.
 Sparse feature learning for deep belief networks. In Advances in neural information processing systems, pp. 1185–1192. Cited by: §2.

Efficient learning of sparse representations with an energybased model
. In Advances in neural information processing systems, pp. 1137–1144. Cited by: §2.  Spreading vectors for similarity search. ICLR. Cited by: §4.
 How should we evaluate supervised hashing?. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1732–1736. Cited by: §4.

How does batch normalization help optimization?
. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §4.  Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §C.3, §2.
 Deep learning based large scale visual recommendation and search for ecommerce. arXiv preprint arXiv:1703.02344. Cited by: §1.
 Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329. Cited by: §1.
 Lowmemory neural network training: a technical report. arXiv preprint arXiv:1904.10631. Cited by: §3.
 Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
 Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §2.
 A new approach for sparse matrix vector product on nvidia gpus. Concurrency and Computation: Practice and Experience 23 (8), pp. 815–826. Cited by: §2.
 Improving the performance of the sparse matrix vector product with gpus. In 2010 10th IEEE International Conference on Computer and Information Technology, pp. 1146–1151. Cited by: §2.
 A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 769–790. Cited by: §2.

A fast exact knearest neighbors algorithm for high dimensional search using kmeans clustering and triangle inequality
. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pp. 1293–1299. Cited by: §1.  Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §1, §2.
 Spectral hashing. NeurIPS. Cited by: §2.
 Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §2.
 FPGA vs. gpu for sparse matrix vector multiply. In 2009 International Conference on FieldProgrammable Technology, pp. 255–262. Cited by: §2.

Exclusive lasso for multitask feature selection
. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 988–995. Cited by: §2.  Sparse matrixvector multiplication on fpgas. In Proceedings of the 2005 ACM/SIGDA 13th international symposium on Fieldprogrammable gate arrays, pp. 63–74. Cited by: §2.
Appendix A Gradient computations for analytical experiments
As described in the main text, for purposes of an analytical toy experiment, we consider a simplified setting with 2d embeddings with the th () activation being distributed as where . We assume , which is typical for sparse activations (). Then the three compared regularizers are , , and . Computing the regularizer gradients thus boils down to computing the gradients of , and as provided in the following lemmas. We hide the subscript for brevity, as computations are similar for all .
(6) 
and,
(7) 
where denotes the cdf of the Gaussian distribution.
Proof of Lemma A.
The proof is based on standard Gaussian identities.
∎
(8)  
(9) 
Proof of Lemma A.
Follows directly from the statement by standard differentiation. ∎
(10)  
(11) 
Proof of Lemma A.
(12)  
(13) 
(14)  
(15) 
Appendix B Experimental details
All images were resized to size and aligned using a pretrained aligner^{3}^{3}3https://github.com/deepinsight/insightface. For the Arcloss function, we used the recommended parameters of margin and temperature . We trained our models on 4 NVIDIA Tesla V100 GPUs using SGD with a learning rate of , momentum of . Both the architectures were trained for a total of 230k steps, with the learning rate being decayed by a factor of after 170k steps. We use a batch size of 256 and 64 per GPU for MobileFaceNet for ResNet respectively.
Pretraining in SDH is performed in the same way as described above. The hash learning step is trained on a single GPU with a learning rate of . The ResNet model is trained for 200k steps with a batch size of 64, and the MobileFaceNet model is trained for 150k steps with a batch size of 256. We set the number of active bits and a pairwise cost of .
Hyperparameters for MobileNet models.

The regularization parameter for the regularizer was varied as 200, 300, 400, 600.

The regularization parameter for the regularizer was varied as 1.5, 2.0, 2.7, 3.5.

The PCA dimension is varied as 64, 96, 128, 256.

The number of LSH bits were varied as 512, 768, 1024, 2048, 3072.

For IVFPQ from the faiss library, the following parameters were fixed: nlist=4096, M=64, nbit=8, and nprobe was varied as 100, 150, 250, 500, 1000.
Hyperparameters for ResNet baselines.

The regularization parameter for the regularizer was varied as 50, 100, 200, 630.

The regularization parameter for the regularizer was varied as 2.0, 3.0, 5.0, 6.0.

The PCA dimension is varied as 48, 64, 96, 128.

The number of LSH bits were varied as 256, 512, 768, 1024, 2048.

For IVFPQ, the following parameters were the same as in MobileNet: nlist=4096, M=64, nbit=8. nprobe was varied as 50, 100, 150, 250, 500, 1000.
Selecting top.
We use the following heuristic to create the shortlist of candidates after the sparse ranking step. We first shortlist all candidates with a score greater than some confidence threshold. For our experiments we set the confidence threshold to be equal to 0.25. If the size of this shortlist is larger than
, it is further shrunk by consider the top scorers. For all our experiments we set . This heuristic avoids sorting the whole array, which can be a bottleneck in this case. The parameters are chosen such that the time required for the reranking step does not dominate the total retrieval time.Hardware.

All models were trained on 4 NVIDIA Tesla V100 GPUs with 16G of memory.

System Memory: 256G.

CPU: Intel(R) Xeon(R) CPU E52686 v4 @ 2.30GHz.

Number of threads: 32.

Cache: L1d cache 32K, L1i cache 32K, L2 cache 256K, L3 cache 46080K.
All timing experiments were performed on a single thread in isolation.
Appendix C Additional Results
c.1 Results without reranking
Figure 4 shows the comparison of the approaches with and without reranking. We notice that there is a significant dip in the performance without reranking with the gap being smaller for ResNet with FLOPs regularization. We also notice that the FLOPs regularizers has a better tradeoff curve for the no reranking setting as well.
[width=0.7]plots/rr_vs_nrr.png
c.2 FPR and TPR curves
In the main text we have reported the recall@1 which is a standard face recognition metric. This however is not sufficient to ensure good face verification performance. The goal in face verification is to predict whether two faces are similar or dissimilar. A natural metric in such a scenario is the FPRTPR curve. Standard face verification datasets include LFW [Huang et al., 2008] and AgeDB [Moschoglou et al., 2017]. We produce embeddings using our trained models, and use them to compute similarity scores (dot product) for pairs of images. The similarity scores are used to compute the FPRTPR curves which are shown in Figure 5. We notice that for curves with similar probability of activation , the FLOPs regularizer performs better compared to . This demonstrates the efficient utilization of all the dimensions in the case of the FLOPs regularizer that helps in learning richer representations for the same sparsity.
We also observe that the gap between sparse and dense models is smaller for ResNet, thus suggesting that the ResNet model learns better representations due to increased model capacity. Lastly, we also note that the gap between the dense and sparse models is smaller for LFW compared to AgeDB, thus corroborating the general consensus that LFW is a relatively easier dataset.
[width=0.5]plots/agedb_mobilenet.png[width=0.5]plots/agedb_resnet.png [width=0.5]plots/lfw_mobilenet.png[width=0.5]plots/lfw_resnet.png
c.3 Cifar100 results
We also experimented with the Cifar100 dataset [Krizhevsky et al., 2009] consisting of 60000 examples and 100 classes. Each class consists of 500 train and 100 test examples. We compare the and FLOPs regularized approaches with the sparse deep hashing approach. All models were trained using the triplet loss [Schroff et al., 2015] and embedding dim . For the dense and DH baselines, no activation was used on the embeddings. For the and FLOPs regularized models we used the SThresh activation. Similar to Jeong and Song [2018], the traintest and testtest precision values have been reported in Table 1. Furthermore, the reported results are without reranking. Cifar100 being a small dataset, we only report the FLOPsperrow, as time measurements can be misleading. In our experiments, we achieved slightly higher precisions for the dense model compared to [Jeong and Song, 2018]. We notice that our models use less than of the computation compared to SDH, albeit with a slightly lower precision.
Train  Test  
Model  F  prec  prec  prec  prec 
Dense  64  61.53  61.26  57.31  56.95 
SDH  1.18  62.29  61.94  57.22  55.87 
SDH  3.95  60.93  60.15  55.98  54.42 
SDH  8.82  60.80  59.96  55.81  54.10 
no reranking  0.40  61.05  61.08  55.23  55.21 
no reranking  0.47  60.50  60.17  54.32  54.96 
Comments
There are no comments yet.