Minimizing FLOPs to Learn Efficient Sparse Representations

by   Biswajit Paria, et al.
Carnegie Mellon University

Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets.



page 1

page 2

page 3

page 4


Densifying Sparse Representations for Passage Retrieval by Representational Slicing

Learned sparse and dense representations capture different successful ap...

Sparse GPU Kernels for Deep Learning

Scientific workloads have traditionally exploited high levels of sparsit...

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

Information Retrieval using dense low-dimensional representations recent...

Supervised Deep Hashing for High-dimensional and Heterogeneous Case-based Reasoning

Case-based Reasoning (CBR) on high-dimensional and heterogeneous data is...

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

In neural Information Retrieval, ongoing research is directed towards im...

Sparsifying Sparse Representations for Passage Retrieval by Top-k Masking

Sparse lexical representation learning has demonstrated much progress in...

Efficient Inner Product Approximation in Hybrid Spaces

Many emerging use cases of data mining and machine learning operate on l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning semantic representations using deep neural networks (DNN) is now a fundamental facet of applications ranging from visual search

(Jing et al., 2015; Hadi Kiapour et al., 2015), semantic text matching (Neculoiu et al., 2016), oneshot classification (Koch et al., 2015), clustering (Oh Song et al., 2017), and recommendation (Shankar et al., 2017). The high-dimensional dense embeddings generated from DNNs however pose a computational challenge for performing nearest neighbor search in large-scale problems with millions of instances. In particular, when the embedding dimension is high, evaluating the distance of any query to all the instances in a large database is expensive, so that efficient search without sacrificing accuracy is difficult. Representations generated using DNNs typically have a higher dimension compared to hand-crafted features such as SIFT (Lowe, 2004), and moreover are dense. The key caveat with dense features is that unlike bag-of-words features they cannot be efficiently searched through an inverted index, without approximations.

Since accurate search in high dimensions is prohibitively expensive in practice (Wang, 2011), one has to typically sacrifice accuracy for efficiency by resorting to approximate methods. Addressing the problem of efficient approximate Nearest-Neighbor Search (NNS) (Jegou et al., 2011) or Maximum Inner-Product Search (MIPS) (Shrivastava and Li, 2014) is thus an active area of research, which we review in brief in the related work section. Most approaches (Charikar, 2002; Jegou et al., 2011) aim to learn compact lower-dimensional representations that preserve distance information.

While there has been ample work on learning compact representations, learning sparse higher dimensional representations have been addressed only recently (Jeong and Song, 2018; Cao et al., 2018). As a seminal instance, Jeong and Song (2018) propose an end-to-end approach to learn sparse and high-dimensional hashes, showing significant speed-up in retrieval time on benchmark datasets compared to dense embeddings. This approach has also been motivated from a biological viewpoint (Li et al., 2018) by relating to a fruit fly’s olfactory circuit, thus suggesting the possibility of hashing using higher dimensions instead of reducing the dimensionality. Furthermore, as suggested by Glorot et al. (2011), sparsity can have additional advantages of linear separability and information disentanglement.

In a similar vein, in this work, we propose to learn high dimensional embeddings that are sparse and hence efficient to retrieve using sparse matrix multiplication operations. In contrast to compact lower-dimensional ANN-esque representations that typically lead to decreased representational power, a key facet of our higher dimensional sparse embeddings is that they can have the same representational capacity as the initial dense embeddings. The core idea behind our approach is inspired by two key observations: (i) retrieval of (high) dimensional sparse embeddings with fraction of non-zero values on an average, can be sped up by a factor of . (ii) The speed up can be further improved to a factor of by ensuring that the non-zero values are evenly distributed across all the dimensions. This indicates that sparsity alone is not sufficient to ensure maximal speedup; the distribution of the non-zero values plays a significant role as well. This motivates us to consider the effect of sparsity on the number of floating point operations (FLOPs)

required for retrieval with an inverted index. We propose a penalty function on the embedding vectors that is a continuous relaxation of the exact number of FLOPs, and encourages an even distribution of the non-zeros across the dimensions.

We apply our approach to the large scale metric learning problem of learning embeddings for facial images. Our training loss consists of a metric learning (Weinberger and Saul, 2009) loss aimed at learning embeddings that mimic a desired metric, and a FLOPs loss to minimize the number of operations. We perform an empirical evaluation of our approach on the Megaface dataset (Kemelmacher-Shlizerman et al., 2016), and show that our proposed method successfully learns high-dimensional sparse embeddings that are orders-of-magnitude faster. We compare our approach to multiple baselines demonstrating an improved or similar speed-vs-accuracy trade-off.

The rest of the paper is organized as follows. In Section 3 we analyze the expected number of FLOPs, for which we derive an exact expression. In Section 4 we derive a continuous relaxation that can be used as a regularizer, and optimized using gradient descent. We also provide some analytical justifications for our relaxation. In Section 5 we then compare our method on a large metric learning task showing an improved speed-accuracy trade-off compared to the baselines.

2 Related Work

Learning compact representations, ANN.

Exact retrieval of the top-k nearest neighbours is expensive in practice for high-dimensional dense embeddings learned from deep neural networks, with practitioners often resorting to approximate nearest neighbours (ANN) for efficient retrieval. Popular approaches for ANN include Locality sensitive hashing (LSH) (Gionis et al., 1999; Andoni et al., 2015; Raginsky and Lazebnik, 2009) relying on random projections, Navigable small world graphs (NSW) (Malkov et al., 2014) and hierarchical NSW (HNSW) (Malkov and Yashunin, 2018) based on constructing efficient search graphs by finding clusters in the data, Product Quantization (PQ) (Ge et al., 2013; Jegou et al., 2011) approaches which decompose the original space into a cartesian product of low-dimensional subspaces and quantize each of them separately, and Spectral hashing (Weiss et al., 2009) which involves an NP hard problem of computing an optimal binary hash, which is relaxed to continuous valued hashes, admitting a simple solution in terms of the spectrum of the similarity matrix. Overall, for compact representations and to speed up query times, most of these approaches use a variety of carefully chosen data structures, such as hashes (Neyshabur and Srebro, 2015; Wang et al., 2018), locality sensitive hashes (Andoni et al., 2015), inverted file structure (Jegou et al., 2011; Baranchuk et al., 2018), trees (Ram and Gray, 2012), clustering (Auvolat et al., 2015), quantization sketches (Jegou et al., 2011; Ning et al., 2016)

, as well as dimensionality reductions based on principal component analysis and t-SNE 

(Maaten and Hinton, 2008).

End to end ANN.

Learning the ANN structure end-to-end is another thread of work that has gained popularity recently. Norouzi et al. (2012) propose to learn binary representations for the Hamming metric by minimizing a margin based triplet loss. Erin Liong et al. (2015) use the signed output of a deep neural network as hashes, while imposing independence and orthogonality conditions on the hash bits. Other end-to-end learning approaches for learning hashes include (Cao et al., 2016; Li et al., 2017). An advantage of end-to-end methods is that they learn hash codes that are optimally compatible to the feature representations.

Sparse representations.

Sparse representations have been previously studied from various viewpoints. Glorot et al. (2011) explore sparse neural networks in modeling biological neural networks and show improved performance, along with additional advantages such as better linear separability and information disentangling. Ranzato et al. (2008, 2007); Lee et al. (2008)

propose learning sparse features using deep belief networks.

Olshausen and Field (1997) explore sparse coding with an overcomplete basis, from a neurobiological viewpoint. Sparsity in auto-encoders have been explored by Ng and others (2011); Kavukcuoglu et al. (2010). Arpit et al. (2015)

provide sufficient conditions to learn sparse representations, and also further provide an excellent review of sparse autoencoders. Dropout

(Srivastava et al., 2014) and a number of its variants (Molchanov et al., 2017; Park et al., 2018; Ba and Frey, 2013) have also been shown to impose sparsity in neural networks.

High-dimensional sparse representations.

Sparse deep hashing (SDH) (Jeong and Song, 2018) is an end-to-end approach that involves starting with a pre-trained network and then performing alternate minimization consisting of two minimization steps, one for training the binary hashes and the other for training the continuous dense embeddings. The first involves computing an optimal hash best compatible with the dense embedding using a min-cost-max-flow approach. The second step is a gradient descent step to learn a dense embedding by minimizing a metric learning loss. A related approach, -sparse autoencoders (Makhzani and Frey, 2013) learn representations in an unsupervised manner with at most non-zero activations. The idea of high dimensional sparse embeddings is also reinforced by the sparse-lifting approach (Li et al., 2018) where sparse high dimensional embeddings are learned from dense features. The idea is motivated by the biologically inspired fly algorithm (Dasgupta et al., 2017). Experimental results indicated that sparse-lifting is an improvement both in terms of precision and speed, when compared to traditional techniques like LSH that rely on dimensionality reduction.

regularization, lasso.

The Lasso (Tibshirani, 1996) is the most popular approach to impose sparsity and has been used in a variety of applications including sparsifying and compressing neural networks (Liu et al., 2015; Wen et al., 2016). The group lasso (Meier et al., 2008) is an extension of lasso that encourages all features in a specified group to be selected together. Another extension, the exclusive lasso (Kong et al., 2014; Zhou et al., 2010), on the other hand, is designed to select a single feature in a group. Our proposed regularizer, originally motivated by idea of minimizing FLOPs closely resembles exclusive lasso. Our focus however is on sparsifying the produced embeddings rather than sparsifying the parameters.

Sparse matrix vector product (SpMV).

Existing work for SpMV computations include (Haffner, 2006; Kudo and Matsumoto, 2003), proposing algorithms based on inverted indices. Inverted indices are however known to suffer from severe cache misses. Linear algebra back-ends such as BLAS (Blackford et al., 2002) rely on efficient cache accesses to achieve speedup. Haffner (2006); Mellor-Crummey and Garvin (2004); Krotkiewski and Dabrowski (2010) propose cache efficient algorithms for sparse matrix vector products. There has also been substantial interest in speeding up SpMV computations using specialized hardware such as GPUs (Vazquez et al., 2010; Vázquez et al., 2011), FPGAs (Zhuo and Prasanna, 2005; Zhang et al., 2009), and custom hardware (Prasanna and Morris, 2007).

Metric learning.

While there exist many settings for learning embeddings (Hinton and Salakhutdinov, 2006; Kingma and Welling, 2013; Kiela and Bottou, 2014) in this paper we restrict our attention to the context of metric learning (Weinberger and Saul, 2009). Some examples of metric learning losses include large margin softmax loss for CNNs (Liu et al., 2016), triplet loss (Schroff et al., 2015), and proxy based metric loss (Movshovitz-Attias et al., 2017).

3 Expected number of FLOPs

In this section we study the effect of sparsity on the expected number of FLOPs required for retrieval and derive an exact expression for the expected number of FLOPs. The main idea in this paper is based on the key insight that if each of the dimensions of the embedding are non-zero with a probability

(not necessarily independently), then it is possible to achieve a speedup up to an order of using an inverted index on the set of embeddings. Consider two embedding vectors . Computing requires computing only the pointwise product at the indices where both and are non-zero. This is the main motivation behind using inverted indices and leads to the aforementioned speedup. Before we analyze it more formally, we introduce some notation.

Let be a set of independent training samples drawn from according to a distribution , where denote the input and label spaces respectively. Let be a class of functions parameterized by , mapping input instances to -dimensional embeddings. Typically, for image tasks, the function is chosen to be a suitable CNN (Krizhevsky et al., 2012). Suppose , then define the activation probability , and its empirical version .

We now show that sparse embeddings can lead to a quadratic speedup. Consider a -dimensional sparse query vector and a database of sparse vectors forming a matrix . We assume that are sampled independently from . Computing the vector matrix product requires looking at only the columns of corresponding to the non-zero entries of given by . Furthermore, in each of those columns we only need to look at the non-zero entries. This can be implemented efficiently in practice by storing the non-zero indices for each column in independent lists, as depicted in Figure 1.

The number of FLOPs incurred is given by,

Taking the expectation on both sides w.r.t. and using the independence of the data, we get


where . Since the expected number of FLOPs scales linearly with the number of vectors in the database, a more suitable quantity is the mean-FLOPs-per-row defined as


Note that for a fixed amount of sparsity , this is minimized when each of the dimensions are non-zero with equal probability , upon which (so that as a regularizer,

will in turn encourage such a uniform distribution across dimensions). Given such a uniform distribution, compared to dense multiplication which has a complexity of

per row, we thus get an improvement by a factor of . Thus when only fraction of all the entries is non-zero, and evenly distributed across all the columns, we achieve a speedup of . Note that independence of the non-zero indices is not necessary due to the linearity of expectation – in fact, features from a neural network are rarely uncorrelated in practice.

FLOPs versus speedup.

While FLOPs reduction is a reasonable measure of speedup on primitive processors of limited parallelization and cache memory. FLOPs is not an accurate measure of actual speedup when it comes to mainstream commercial processors such as Intel’s CPUs and Nvidia’s GPUs, as the latter have cache and SIMD (Single-Instruction Multiple Data) mechanism highly optimized for dense matrix multiplication, while sparse matrix multiplication are inherently less tailored to their cache and SIMD design (Sohoni et al., 2019). On the other hand, there have been threads of research on hardwares with cache and parallelization tailored to sparse operations that show speedup proportional to the FLOPs reduction (Han et al., 2016; Parashar et al., 2017). Modeling the cache and other hardware aspects can potentially lead to better performance but less generality and is left to our future works.


1:(Build Index)
2:Input: Sparse matrix
3:     for  do
4:         Init
5:          Stores the non-zero values and their indices
6:     end for
9:Input: Sparse query , threshold , number of NNs
10:     Init score vector .
11:     for  do SpMV product
12:         for  do
14:         end for
15:     end for
16:      Thresholding
17:      top- indices of based on
18:      Using nth_select from C++ STL
19:     return
Algorithm 1 Sparse Nearest Neighbour
Figure 1: SpMV product: The colored cells denote non-zero entries, and the arrows indicate the list structure for each of the columns, with solid arrows denoting links that were traversed for the given query. The green and grey cells denote the non-zero entries that were accessed and not accessed, respectively. The non-zero values in (blue) can be computed using only the common non-zero values (green). Selecting top-: The sparse product vector is then filtered using a threshold , after which the top- indices are returned.

4 Our Approach


regularization is the most common approach to induce sparsity. However, as we will also verify experimentally, it does not ensure an uniform distribution of the non-zeros in all the dimensions that is required for the optimal speed-up. Therefore, we resort to incorporating the actual FLOPs incurred, directly into the loss function which will lead to an optimal trade-off between the search time and accuracy. The FLOPs

being a discontinuous function of model parameters, is hard to optimize, and hence we will instead optimize using a continuous relaxation of it.

Denote by , any metric loss on for the embedding function . The goal in this paper is to minimize the loss while controlling the expected FLOPs defined in Eqn. 2. Since the distribution

is unknown, we use the samples to get an estimate of

. Recall the empirical fraction of non-zero activations , which converges in probability to . Therefore, with a slight abuse of notation define , which is a consistent estimator for based on the samples . Note that denotes either the population or empirical quantities depending on whether the functional argument is or . We now consider the following regularized loss.


for some parameter that controls the FLOPs-accuracy tradeoff. The regularized loss poses a further hurdle, as and consequently are not continuous due the presence of the indicator functions. We thus compute the following continuous relaxation. Define the mean absolute activation and its empirical version , which is the norm of the activations (scaled by ) in contrast to the quasi norm in the FLOPs calculation. Define the relaxations, and its consistent estimator

. We propose to minimize the following relaxation, which can be optimized using any off-the-shelf stochastic gradient descent optimizer.


Sparse retrieval and re-ranking.

During inference, the sparse vector of a query image is first obtained from the learned model and the nearest neighbour is searched in a database of sparse vectors forming a sparse matrix. An efficient algorithm to compute the dot product of the sparse query vector with the sparse matrix is presented in Figure 1. This consists of first building a list of the non-zero values and their positions in each column. As motivated in Section 3, given a sparse query vector, it is sufficient to only iterate through the non-zero values and the corresponding columns. Next, a filtering step is performed keeping only scores greater than a specified threshold. Top- candidates from the remaining items are returned. The complete algorithm is presented in Algorithm 1. In practice, the sparse retrieval step is not sufficient to ensure good performance. The top- shortlisted candidates are therefore further re-ranked using dense embeddings as done in SDH. This step involves multiplication of a small dense matrix with a dense vector. The number of shortlisted candidates is chosen such that the dense re-ranking time does not dominate the total time.

Comparison to SDH (Jeong and Song, 2018).

It is instructive to contrast our approach with that of SDH (Jeong and Song, 2018). In contrast to the binary hashes in SDH, our approach learns sparse real valued representations. SDH uses a min-cost-max-flow approach in one of the training steps, while we train ours only using SGD. During inference in SDH, a shortlist of candidates is first created by considering the examples in the database that have hashes with non-empty intersections with the query hash. The candidates are further re-ranked using the dense embeddings. The shortlist in our approach on the other hand is constituted of the examples with the top scores from the sparse embeddings.

Comparison to unrelaxed FLOPs regularizer.

We provide an experimental comparison of our continuous relaxation based FLOPs regularizer to its unrelaxed variant, showing that the performance of the two are markedly similar. Setting up this experiment requires some analytical simplifications based on recent deep neural network analyses. We first recall recent results that indicate that the output of a batch norm layer nearly follows a Gaussian distribution

(Santurkar et al., 2018), so that in our context, we could make the simplifying approximation that (where ) is distributed as where , is the

activation used at the neural network output. We have modelled the pre-activation as a Gaussian distribution with mean and variance depending on the model parameters

. We experimentally verify that this assumption holds by minimizing the KS distance (Massey Jr, 1951) between the CDF of where and the empirical CDF of the activations. The KS distance is minimized wrt. . Figure 2 shows the empirical CDF and the fitted CDF of for two different architectures.

While cannot be tuned independently due to their dependence on , in practice, the huge representational capacity of neural networks allows and to be tuned almost independently. We consider a toy setting with 2-d embeddings. For a tractable analysis, we make the simplifying assumption that, for , is distributed as where , thus losing the dependence on .

We now analyze how minimizing the continuous relaxation compares to minimizing . Note that we consider the population quantities here instead of the empirical quantities, as they are more amenable to theoretical analyses due to the existence of closed form expressions. We also consider the regularizer as a baseline. We initialize with , and minimize the three quantities via gradient descent with infinitesimally small learning rates. For this contrastive analysis, we have not considered the effect of the metric loss. Note that while the discontinuous empirical quantity cannot be optimized via gradient descent, it is possible to do so for its population counterpart since it is available in closed form as a continuous function when making Gaussian assumptions. The details of computing the gradients can be found in Appendix A.

We start with activation probabilities , and plot the trajectory taken when performing gradient descent, shown in Figure 2. Without the effect of the metric loss, the probabilities are expected to go to zero as observed in the plot. It can be seen that, in contrast to the -regularizer, and both tend to sparsify the less sparse activation () at a faster rate, which corroborates the fact that they encourage an even distribution of non-zeros.

[trim=0 10pt 0 7pt, clip, width=0.50]plots/mf_l1.png[trim=0 10pt 0 7pt, clip, width=0.50]plots/rn_fl.png

The CDF of fitted to minimize the KS distance to the empirical CDF of the activations for two different architectures.

[trim=0 12pt 0 7pt, clip, width=1.05]plots/trajectory.png

The trajectory of the activation probabilities when minimizing the respective regularizers.
Figure 2: Figure (a) shows that the CDF of the activations (red) closely resembles the CDF of (blue) where

is a Gaussian random variable. Figure (b) shows that

and behave similarly by sparsifying the less sparser activation at a faster rate when compared to the regularizer.

promotes orthogonality.

We next show that, when the embeddings are normalized to have a unit norm, as typically done in metric learning, then minimizing is equivalent to promoting orthogonality on the absolute values of the embedding vectors. Let , we then have the following:


is minimized when the vectors are orthogonal. Metric learning losses aim at minimizing the interclass dot product, whereas the FLOPs regularizer aims at minimizing pairwise dot products irrespective of the class, leading to a tradeoff between sparsity and accuracy. This approach of pushing the embeddings apart, bears some resemblance to the idea of spreading vectors (Sablayrolles et al., 2019) where an entropy based regularizer is used to uniformly distribute the embeddings on the unit sphere, albeit without considering any sparsity. Maximizing the pairwise dot product helps in reducing FLOPs as is illustrated by the following toy example. Consider a set of vectors (here ) satisfying . Then is minimized when , where is an one-hot vector with the th entry equal to 1 and the rest 0. The FLOPs regularizer thus tends to spread out the non-zero activations in all the dimensions, thus producing balanced embeddings. This simple example also demonstrates that when the number of classes in the training set is smaller or equal to the number of dimensions , a trivial embedding that minimizes the metric loss and also achieves a small number of FLOPs is where is true label for . This is equivalent to predicting the class of the input instance. The caveat with such embeddings is that they might not be semantically meaningful beyond the specific supervised task, and will naturally hurt performance on unseen classes, and tasks where the representation itself is of interest. In order to avoid such a collapse in our experiments, we ensure that the embedding dimension is smaller than the number of training classes. Furthermore, as recommended by Sablayrolles et al. (2017), we perform all our evaluations on unseen classes.

Exclusive lasso.

Also known as -norm, in previous works it has been used to induce competition (or exclusiveness) in features in the same group. More formally, consider features indexed by , and groups forming a set of groups .222Denotes the powerset of . Let

denote the weight vector for a linear classifier. The exclusive lasso regularizer is defined as,

where denotes the sub-vector , corresponding to the indices in . can be used to induce various kinds of structural properties. For instance can consist of groups of correlated features. The regularizer prevents feature redundancy by selecting only a few features from each group.

Our proposed FLOPs based regularizer has the same form as exclusive lasso. Therefore exclusive lasso applied to the batch of activations, with the groups being columns of the activation matrix (and rows corresponding to different inputs), is equivalent to the FLOPs regularizer. It can be said that, within each activation column, the FLOPs regularizer induces competition between different input examples for having a non-zero activation.

5 Experiments

We evaluate our proposed approach on a large scale metric learning dataset: the Megaface (Kemelmacher-Shlizerman et al., 2016)

used for face recognition. This is a much more fine grained retrieval tasks (with 85k classes for training) compared to the datasets used by

Jeong and Song (2018). This dataset also satisfies our requirement of the number of classes being orders of magnitude higher than the dimensions of the sparse embedding. As discussed in Section 4, a few number of classes during training can lead the model to simply learn an encoding of the training classes and thus not generalize to unseen classes. Face recognition datasets avoid this situation by virtue of the huge number of training classes and a balanced distribution of examples across all the classes.

Following standard protocol for evaluation on the Megaface dataset (Kemelmacher-Shlizerman et al., 2016), we train on a refined version of the MSCeleb-1M (Guo et al., 2016) dataset released by Deng et al. (2018) consisting of million images spanning k classes. We evaluate with 1 million distractors from the Megaface dataset and 3.5k query images from the Facescrub dataset (Ng and Winkler, 2014), which were not seen during training.

Network architecture.

We experiment with two architectures: MobileFaceNet (Chen et al., 2018), and ResNet-101 (He et al., 2016)

. We use ReLU activations in the embedding layer for MobileFaceNet, and SThresh activations (defined below) for ResNet. The activations are

-normalized to produce an embedding on the unit sphere, and used to compute the Arcface loss (Deng et al., 2018). We learn 1024 dimensional sparse embeddings for the and

regularizers; and 128, 512 dimensional dense embeddings as baselines. All models were implemented in Tensorflow 

(Abadi et al., 2016) with the sparse retrieval algorithm implemented in C++. The re-ranking step used 512-d dense embeddings.

Activation function.

In practice, having a non-linear activation at the embedding layer is crucial for sparsification. Layers with activations such as ReLU are easier to sparsify due to the bias parameter in the layer before the activation (linear or batch norm) which acts as a direct control knob to the sparsity. More specifically, can be made more (less) sparse by increasing (decreasing) the components of , where is the bias parameter of the previous linear layer. In this paper we consider two types of activations: , and the soft thresholding operator (Boyd and Vandenberghe, 2004). ReLU activations always produce positive values, whereas soft thresholding can produce negative values as well.

Practical considerations.

In practice, setting a large regularization weight from the beginning is harmful for training. Sparsifying too quickly using a large leads to many dead activations (saturated to zero) in the embedding layer and the model getting stuck in a local minima. Therefore, we use an annealing procedure and gradually increase throughout the training using a regularization weight schedule that maps the training step to a real valued regularization weight. In our experiments we choose a that increases quadratically as , until step , where is the threshold step beyond which .


We compare our proposed -regularizer, with multiple baselines: exhaustive search with dense embeddings, sparse embeddings using regularization, Sparse Deep Hashing (SDH) (Jeong and Song, 2018), and PCA, LSH, PQ applied to the 512 dimensional dense embeddings from both the architectures. We train the SDH model using the aforementioned architectures for 512 dimensional embeddings, with number of active hash bits . We use numpy (using efficient MKL optimizations in the backend) for matrix multiplication required for exhaustive search in the dense and PCA baselines. We use the CPU version of the Faiss (Johnson et al., 2017) library for LSH and PQ (we use the IVF-PQ index from Faiss).

Further details on the training hyperparameters and the hardware used can be found in Appendix 


5.1 Results

We report the recall and the time-per-query for various hyperparameters of our proposed approach and the baselines, yielding trade-off curves. The reported times include the time required for re-ranking. The trade-off curves for MobileNet and ResNet are shown in Figures 3 and 3 respectively. We observe that while vanilla regularization is an improvement by itself for some hyperparameter settings, the regularizer is a further improvement, and yields the most optimal trade-off curve. SDH has a very poor speed-accuracy trade-off, which is mainly due to the explosion in the number of shortlisted candidates with increasing number of active bits leading to an increase in the retrieval time. On the other hand, while having a small number of active bits is faster, it leads to a smaller recall. For the other baselines we notice the usual order of performance, with PQ having the best speed-up compared to LSH and PCA. While dimensionality reduction using PCA leads to some speed-up for relatively high dimensions, it quickly wanes off as the dimension is reduced even further.

[trim=0 14pt 0 12pt, clip, width=]plots/mobilenet_time_recall.png

Time per query vs recall for MobileNet.

[trim=0 10pt 0 5pt, clip, width=]plots/mobilenet_sparsity.png

(a) vs sparsity for MobileNet.

[trim=0 14pt 0 12pt, clip, width=]plots/resnet_time_recall.png

Time per query vs recall for ResNet.

[trim=0 10pt 0 5pt, clip, width=]plots/resnet_sparsity.png

(b) vs sparsity for ResNet.
Figure 3: Figures (a) and (c) show the speed vs recall trade-off for the MobileNet and ResNet architectures respectively. The trade-off curves produced by varying the hyper-parameters of the respective approaches. The points with higher recall and lower time (top-left side of the plots) are better. The SDH baseline being out of range of both the plots is indicated using an arrow. Figures (b) and (d) show the sub-optimality ratio vs sparsity plots for MobileNet and ResNet respectively. closer to indicates that the non-zeros are uniformly distributed across the dimensions.

We also report the sub-optimality ratio computed over the dataset , where is the mean activation probability estimated on the test data. Notice that , and the optimal is achieved when , that is when the non-zeros are evenly distributed across the dimensions. The sparsity-vs-suboptimality plots for MobileNet and ResNet are shown in Figures 3 and 3 respectively. We notice that the -regularizer yields values of closer to when compared to the -regularizer. For the MobileNet architecture we notice that the regularizer is able to achieve values of close to that of in the less sparser region. However, the gap increases substantially with increasing sparsity. For the ResNet architecture on the other hand the regularizer yields extremely sub-optimal embeddings in all regimes. The regularizer is therefore able to produce more balanced distribution of non-zeros.

The sub-optimality is also reflected in the recall values. The gap in the recall values of the and models is much higher when the sub-optimality gap is higher, as in the case of ResNet, while it is small when the sub-optimality gap is smaller as in the case of MobileNet. This shows the significance of having a balanced distribution of non-zeros. Additional results, including results without the re-ranking step and performance on CIFAR-100 can be found in Appendix C.

6 Conclusion

In this paper we proposed a novel approach to learn high dimensional embeddings with the goal of improving efficiency of retrieval tasks. Our approach integrates the FLOPs incurred during retrieval into the loss function as a regularizer and optimizes it directly through a continuous relaxation. We provide further insight into our approach by showing that the proposed approach favors an even distribution of the non-zero activations across all the dimensions. We experimentally showed that our approach indeed leads to a more even distribution when compared to the regularizer. We compared our approach to a number of other baselines and showed that it has a better speed-vs-accuracy trade-off. Overall we were able to show that sparse embeddings can be around 50 faster compared to dense embeddings without a significant loss of accuracy.


We thank Hong-You Chen for helping in running some baselines during the early stages of this work. This work has been partially funded by the DARPA D3M program and the Toyota Research Institute. Toyota Research Institute ("TRI") provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §5.
  • A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt (2015) Practical and optimal lsh for angular distance. NeurIPS. Cited by: §2.
  • D. Arpit, Y. Zhou, H. Ngo, and V. Govindaraju (2015) Why regularized auto-encoders learn sparse representation?. arXiv preprint arXiv:1505.05561. Cited by: §2.
  • A. Auvolat, S. Chandar, P. Vincent, H. Larochelle, and Y. Bengio (2015) Clustering is efficient for approximate maximum inner product search. arXiv preprint arXiv:1507.05910. Cited by: §2.
  • J. Ba and B. Frey (2013) Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems, pp. 3084–3092. Cited by: §2.
  • D. Baranchuk, A. Babenko, and Y. Malkov (2018) Revisiting the inverted indices for billion-scale approximate nearest neighbors. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 202–216. Cited by: §2.
  • L. S. Blackford, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, et al. (2002) An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software 28 (2), pp. 135–151. Cited by: §2.
  • S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge university press. Cited by: §5.
  • Y. Cao, M. Long, B. Liu, and J. Wang (2018) Deep cauchy hashing for hamming space retrieval. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1229–1237. Cited by: §1.
  • Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen (2016)

    Deep quantization network for efficient image retrieval.

    AAAI. Cited by: §2.
  • M. S. Charikar (2002) Similarity estimation techniques from rounding algorithms. In

    Proceedings of the thiry-fourth annual ACM symposium on Theory of computing

    pp. 380–388. Cited by: §1.
  • S. Chen, Y. Liu, X. Gao, and Z. Han (2018) MobileFaceNets: efficient cnns for accurate real-time face verification on mobile devices. arXiv preprint arXiv:1804.07573. Cited by: §5.
  • S. Dasgupta, C. F. Stevens, and S. Navlakha (2017) A neural algorithm for a fundamental computing problem. Science. Cited by: §2.
  • J. Deng, J. Guo, and S. Zafeiriou (2018) Arcface: additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698. Cited by: §5, §5.
  • V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou (2015) Deep hashing for compact binary codes learning. CVPR. Cited by: §2.
  • T. Ge, K. He, Q. Ke, and J. Sun (2013) Optimized product quantization for approximate nearest neighbor search. CVPR. Cited by: §2.
  • A. Gionis, P. Indyk, R. Motwani, et al. (1999) Similarity search in high dimensions via hashing. VLDB. Cited by: §2.
  • X. Glorot, A. Bordes, and Y. Bengio (2011) Deep sparse rectifier neural networks. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    pp. 315–323. Cited by: §1, §2.
  • Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) MS-Celeb-1M: a dataset and benchmark for large scale face recognition. In European Conference on Computer Vision, Cited by: §5.
  • M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg (2015) Where to buy it: matching street clothing photos in online shops. In Proceedings of the IEEE international conference on computer vision, pp. 3343–3351. Cited by: §1.
  • P. Haffner (2006) Fast transpose methods for kernel learning on sparse data. In Proceedings of the 23rd international conference on Machine learning, pp. 385–392. Cited by: §2.
  • S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally (2016) EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. Cited by: §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §5.
  • G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §2.
  • G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller (2008) Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, Cited by: §C.2.
  • H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. TPAMI. Cited by: §1, §2.
  • Y. Jeong and H. O. Song (2018) Efficient end-to-end learning for quantizable representations. ICML. Cited by: §C.3, §1, §2, §4, §4, §5, §5.
  • Y. Jing, D. Liu, D. Kislyuk, A. Zhai, J. Xu, J. Donahue, and S. Tavel (2015) Visual search at pinterest. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1889–1898. Cited by: §1.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §5.
  • K. Kavukcuoglu, M. Ranzato, and Y. LeCun (2010) Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467. Cited by: §2.
  • I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard (2016) The megaface benchmark: 1 million faces for recognition at scale. CVPR. Cited by: §1, §5, §5.
  • D. Kiela and L. Bottou (2014)

    Learning image embeddings using convolutional neural networks for improved multi-modal semantics


    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 36–45. Cited by: §2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In

    ICML Deep Learning Workshop

    Vol. 2. Cited by: §1.
  • D. Kong, R. Fujimaki, J. Liu, F. Nie, and C. Ding (2014) Exclusive feature learning on arbitrary structures via -norm. In Advances in Neural Information Processing Systems, pp. 1655–1663. Cited by: §2.
  • A. Krizhevsky, V. Nair, and G. Hinton (2009) Cifar-10 and cifar-100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html 6. Cited by: §C.3.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. NeurIPS. Cited by: §3.
  • M. Krotkiewski and M. Dabrowski (2010) Parallel symmetric sparse matrix–vector product on scalar multi-core cpus. Parallel Computing 36 (4), pp. 181–198. Cited by: §2.
  • T. Kudo and Y. Matsumoto (2003) Fast methods for kernel-based text analysis. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 24–31. Cited by: §2.
  • H. Lee, C. Ekanadham, and A. Y. Ng (2008) Sparse deep belief net model for visual area v2. In Advances in neural information processing systems, pp. 873–880. Cited by: §2.
  • Q. Li, Z. Sun, R. He, and T. Tan (2017) Deep supervised discrete hashing. NeurIPS. Cited by: §2.
  • W. Li, J. Mao, Y. Zhang, and S. Cui (2018) Fast similarity search via optimal sparse lifting. NeurIPS. Cited by: §1, §2.
  • B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky (2015) Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814. Cited by: §2.
  • W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks.. In ICML, Vol. 2, pp. 7. Cited by: §2.
  • D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision. Cited by: §1.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §2.
  • A. Makhzani and B. Frey (2013) K-sparse autoencoders. arXiv preprint arXiv:1312.5663. Cited by: §2.
  • Y. A. Malkov and D. A. Yashunin (2018) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI. Cited by: §2.
  • Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov (2014) Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems. Cited by: §2.
  • F. J. Massey Jr (1951) The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association 46 (253), pp. 68–78. Cited by: §4.
  • L. Meier, S. Van De Geer, and P. Bühlmann (2008)

    The group lasso for logistic regression

    Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (1), pp. 53–71. Cited by: §2.
  • J. Mellor-Crummey and J. Garvin (2004) Optimizing sparse matrix–vector product computations using unroll and jam. The International Journal of High Performance Computing Applications 18 (2), pp. 225–236. Cited by: §2.
  • D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. Cited by: §2.
  • S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou (2017) Agedb: the first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, Cited by: §C.2.
  • Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. arXiv preprint arXiv:1703.07464. Cited by: §2.
  • P. Neculoiu, M. Versteegh, and M. Rotaru (2016) Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 148–157. Cited by: §1.
  • B. Neyshabur and N. Srebro (2015) On symmetric and asymmetric lshs for inner product search. In International Conference on Machine Learning, pp. 1926–1934. Cited by: §2.
  • A. Ng et al. (2011) Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §2.
  • H. Ng and S. Winkler (2014) A data-driven approach to cleaning large face datasets. In Image Processing (ICIP), 2014 IEEE International Conference on, Cited by: §5.
  • Q. Ning, J. Zhu, Z. Zhong, S. C. Hoi, and C. Chen (2016) Scalable image retrieval by sparse product quantization. IEEE Transactions on Multimedia 19 (3), pp. 586–597. Cited by: §2.
  • M. Norouzi, D. J. Fleet, and R. R. Salakhutdinov (2012) Hamming distance metric learning. NeurIPS. Cited by: §2.
  • H. Oh Song, S. Jegelka, V. Rathod, and K. Murphy (2017) Deep metric learning via facility location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390. Cited by: §1.
  • B. A. Olshausen and D. J. Field (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision research 37 (23), pp. 3311–3325. Cited by: §2.
  • A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally (2017) Scnn: an accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 27–40. Cited by: §3.
  • S. Park, J. Park, S. Shin, and I. Moon (2018)

    Adversarial dropout for supervised and semi-supervised learning

    In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
  • V. K. Prasanna and G. R. Morris (2007) Sparse matrix computations on reconfigurable hardware. Computer 40 (3), pp. 58–64. Cited by: §2.
  • M. Raginsky and S. Lazebnik (2009) Locality-sensitive binary codes from shift-invariant kernels. NeurIPS. Cited by: §2.
  • P. Ram and A. G. Gray (2012) Maximum inner-product search using cone trees. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 931–939. Cited by: §2.
  • M. Ranzato, Y. Boureau, and Y. L. Cun (2008) Sparse feature learning for deep belief networks. In Advances in neural information processing systems, pp. 1185–1192. Cited by: §2.
  • M. Ranzato, C. Poultney, S. Chopra, and Y. L. Cun (2007)

    Efficient learning of sparse representations with an energy-based model

    In Advances in neural information processing systems, pp. 1137–1144. Cited by: §2.
  • A. Sablayrolles, M. Douze, C. Schmid, and H. Jégou (2019) Spreading vectors for similarity search. ICLR. Cited by: §4.
  • A. Sablayrolles, M. Douze, N. Usunier, and H. Jégou (2017) How should we evaluate supervised hashing?. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1732–1736. Cited by: §4.
  • S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018)

    How does batch normalization help optimization?

    In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §4.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §C.3, §2.
  • D. Shankar, S. Narumanchi, H. Ananya, P. Kompalli, and K. Chaudhury (2017) Deep learning based large scale visual recommendation and search for e-commerce. arXiv preprint arXiv:1703.02344. Cited by: §1.
  • A. Shrivastava and P. Li (2014) Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329. Cited by: §1.
  • N. S. Sohoni, C. R. Aberger, M. Leszczynski, J. Zhang, and C. Ré (2019) Low-memory neural network training: a technical report. arXiv preprint arXiv:1904.10631. Cited by: §3.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
  • R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §2.
  • F. Vázquez, J. Fernández, and E. M. Garzón (2011) A new approach for sparse matrix vector product on nvidia gpus. Concurrency and Computation: Practice and Experience 23 (8), pp. 815–826. Cited by: §2.
  • F. Vazquez, G. Ortega, J. Fernández, and E. M. Garzón (2010) Improving the performance of the sparse matrix vector product with gpus. In 2010 10th IEEE International Conference on Computer and Information Technology, pp. 1146–1151. Cited by: §2.
  • J. Wang, T. Zhang, N. Sebe, H. T. Shen, et al. (2018) A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 769–790. Cited by: §2.
  • X. Wang (2011)

    A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality

    In Neural Networks (IJCNN), The 2011 International Joint Conference on, pp. 1293–1299. Cited by: §1.
  • K. Q. Weinberger and L. K. Saul (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §1, §2.
  • Y. Weiss, A. Torralba, and R. Fergus (2009) Spectral hashing. NeurIPS. Cited by: §2.
  • W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §2.
  • Y. Zhang, Y. H. Shalabi, R. Jain, K. K. Nagar, and J. D. Bakos (2009) FPGA vs. gpu for sparse matrix vector multiply. In 2009 International Conference on Field-Programmable Technology, pp. 255–262. Cited by: §2.
  • Y. Zhou, R. Jin, and S. C. Hoi (2010)

    Exclusive lasso for multi-task feature selection

    In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 988–995. Cited by: §2.
  • L. Zhuo and V. K. Prasanna (2005) Sparse matrix-vector multiplication on fpgas. In Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, pp. 63–74. Cited by: §2.

Appendix A Gradient computations for analytical experiments

As described in the main text, for purposes of an analytical toy experiment, we consider a simplified setting with 2-d embeddings with the th () activation being distributed as where . We assume , which is typical for sparse activations (). Then the three compared regularizers are , , and . Computing the regularizer gradients thus boils down to computing the gradients of , and as provided in the following lemmas. We hide the subscript for brevity, as computations are similar for all .




where denotes the cdf of the Gaussian distribution.

Proof of Lemma A.

The proof is based on standard Gaussian identities.

Proof of Lemma A.

Follows directly from the statement by standard differentiation. ∎

Proof of Lemma A.

where the last step follows from Lemma A.

where the last step follows from Lemma A. ∎

Proof of Lemma A.

Follows directly from Lemma A. ∎

Proof of Lemma A.

Follows directly from Lemma A. ∎

Appendix B Experimental details

All images were resized to size and aligned using a pre-trained aligner333 For the Arcloss function, we used the recommended parameters of margin and temperature . We trained our models on 4 NVIDIA Tesla V-100 GPUs using SGD with a learning rate of , momentum of . Both the architectures were trained for a total of 230k steps, with the learning rate being decayed by a factor of after 170k steps. We use a batch size of 256 and 64 per GPU for MobileFaceNet for ResNet respectively.

Pre-training in SDH is performed in the same way as described above. The hash learning step is trained on a single GPU with a learning rate of . The ResNet model is trained for 200k steps with a batch size of 64, and the MobileFaceNet model is trained for 150k steps with a batch size of 256. We set the number of active bits and a pairwise cost of .

Hyper-parameters for MobileNet models.

  1. The regularization parameter for the regularizer was varied as 200, 300, 400, 600.

  2. The regularization parameter for the regularizer was varied as 1.5, 2.0, 2.7, 3.5.

  3. The PCA dimension is varied as 64, 96, 128, 256.

  4. The number of LSH bits were varied as 512, 768, 1024, 2048, 3072.

  5. For IVF-PQ from the faiss library, the following parameters were fixed: nlist=4096, M=64, nbit=8, and nprobe was varied as 100, 150, 250, 500, 1000.

Hyper-parameters for ResNet baselines.

  1. The regularization parameter for the regularizer was varied as 50, 100, 200, 630.

  2. The regularization parameter for the regularizer was varied as 2.0, 3.0, 5.0, 6.0.

  3. The PCA dimension is varied as 48, 64, 96, 128.

  4. The number of LSH bits were varied as 256, 512, 768, 1024, 2048.

  5. For IVF-PQ, the following parameters were the same as in MobileNet: nlist=4096, M=64, nbit=8. nprobe was varied as 50, 100, 150, 250, 500, 1000.

Selecting top-.

We use the following heuristic to create the shortlist of candidates after the sparse ranking step. We first shortlist all candidates with a score greater than some confidence threshold. For our experiments we set the confidence threshold to be equal to 0.25. If the size of this shortlist is larger than

, it is further shrunk by consider the top scorers. For all our experiments we set . This heuristic avoids sorting the whole array, which can be a bottleneck in this case. The parameters are chosen such that the time required for the re-ranking step does not dominate the total retrieval time.


  1. All models were trained on 4 NVIDIA Tesla V-100 GPUs with 16G of memory.

  2. System Memory: 256G.

  3. CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz.

  4. Number of threads: 32.

  5. Cache: L1d cache 32K, L1i cache 32K, L2 cache 256K, L3 cache 46080K.

All timing experiments were performed on a single thread in isolation.

Appendix C Additional Results

c.1 Results without re-ranking

Figure 4 shows the comparison of the approaches with and without re-ranking. We notice that there is a significant dip in the performance without re-ranking with the gap being smaller for ResNet with FLOPs regularization. We also notice that the FLOPs regularizers has a better trade-off curve for the no re-ranking setting as well.


Figure 4: Time vs Recall@1 plots for retrieval with and without re-ranking. Results from the same model and regularizer have same colors. Diamonds () denote results with re-ranking, and triangles () denote results without re-ranking.

c.2 FPR and TPR curves

In the main text we have reported the recall@1 which is a standard face recognition metric. This however is not sufficient to ensure good face verification performance. The goal in face verification is to predict whether two faces are similar or dissimilar. A natural metric in such a scenario is the FPR-TPR curve. Standard face verification datasets include LFW [Huang et al., 2008] and AgeDB [Moschoglou et al., 2017]. We produce embeddings using our trained models, and use them to compute similarity scores (dot product) for pairs of images. The similarity scores are used to compute the FPR-TPR curves which are shown in Figure 5. We notice that for curves with similar probability of activation , the FLOPs regularizer performs better compared to . This demonstrates the efficient utilization of all the dimensions in the case of the FLOPs regularizer that helps in learning richer representations for the same sparsity.

We also observe that the gap between sparse and dense models is smaller for ResNet, thus suggesting that the ResNet model learns better representations due to increased model capacity. Lastly, we also note that the gap between the dense and sparse models is smaller for LFW compared to AgeDB, thus corroborating the general consensus that LFW is a relatively easier dataset.

[width=0.5]plots/agedb_mobilenet.png[width=0.5]plots/agedb_resnet.png [width=0.5]plots/lfw_mobilenet.png[width=0.5]plots/lfw_resnet.png

Figure 5: FPR-TPR curves. The curves are all shown in shades of red, where as the FLOPs curves are all shown in shades of blue. The probability of activation is provided in the legend for comparison. For curves with similar probability of activation , the FLOPs regularizer performs better compared to , thus demonstrating that the FLOPs regularizer learns richer representations for the same sparsity.

c.3 Cifar-100 results

We also experimented with the Cifar-100 dataset [Krizhevsky et al., 2009] consisting of 60000 examples and 100 classes. Each class consists of 500 train and 100 test examples. We compare the and FLOPs regularized approaches with the sparse deep hashing approach. All models were trained using the triplet loss [Schroff et al., 2015] and embedding dim . For the dense and DH baselines, no activation was used on the embeddings. For the and FLOPs regularized models we used the SThresh activation. Similar to Jeong and Song [2018], the train-test and test-test precision values have been reported in Table 1. Furthermore, the reported results are without re-ranking. Cifar-100 being a small dataset, we only report the FLOPs-per-row, as time measurements can be misleading. In our experiments, we achieved slightly higher precisions for the dense model compared to [Jeong and Song, 2018]. We notice that our models use less than of the computation compared to SDH, albeit with a slightly lower precision.

Train Test
Model F prec prec prec prec
Dense 64 61.53 61.26 57.31 56.95
SDH 1.18 62.29 61.94 57.22 55.87
SDH 3.95 60.93 60.15 55.98 54.42
SDH 8.82 60.80 59.96 55.81 54.10
no re-ranking 0.40 61.05 61.08 55.23 55.21
no re-ranking 0.47 60.50 60.17 54.32 54.96
Table 1: Cifar-100 results using triplet loss and embedding size . For and models, no re-ranking was used. F is used to denote the FLOPs-per-row (lower is better). The SDH results have been reported from the original paper.