1 Introduction
Stochastic gradient descent (SGD) is one of the most important algorithms for scalable machine learning
[7, 36, 27]. SGD optimizes an objective function by successively following noisy estimates of its gradient based on minibatches from a large underlying dataset. We usually assure that this gradient is unbiased, meaning that the expected stochastic gradient equals the true gradient. When combined with a suitably decreasing learning rate schedule, the algorithm converges to a local optimum of the objective
[7].Often we are not interested in learning an unbiased estimator of the gradient, but are rather willing to introduce some bias. There are many reasons for why this might be the case. First, biased SGD schemes such as momentum
[34], iterate averaging [40], or preconditioning [9, 16, 43, 46] may reduce the stochastic gradient noise or ease the optimization problem, and therefore often lead to faster convergence. Another reason is that we may decide to actively select samples based on their relevance or difficulty levels such as boosting [10], or because we believe that our dataset is in some respect imbalanced [12]. In this paper, we propose and investigate a biased minibatch subsampling scheme for imbalanced data.Realworld data sets are naturally imbalanced. For instance, the sports topic appears more often in the news than biology; the internet contains more images of young people than of senior people, and Youtube has more videos of cats than of bees or ants. Aiming to maximize the probability of generating such training data, machine learning models will refine the dominant information with redundancy but ignore the important but scarce data. For example, a model trained on Youtube data might be very sensitive to different cats but unable to recognize ants. We may therefore decide to try to learn on a more balanced data set by actively selecting diversified minibatches.
The currently most common tool for minibatch diversification is stratified sampling [30, 48]. In this approach, one groups the data into a finite set of strata based on discrete or continuous features such as a label or cluster assignment. To rebalance the data set, the data can then be subsampled such that each stratum occurs with equal probability in the minibatch (in the following, we refer to this method as biased stratified sampling). Unfortunately, the data are not always amenable to biased stratified sampling because discrete features may not exist, or the data may not be unambiguously clustered. Instead of subsampling based on discrete strata, it would be desirable to diversify the minibatch based on a soft similarity measure between data points. As we show in this paper, this can be achieved using Determinantal Point Processes (DPPs) [19].
The DPP is a point process which mimics repulsive interactions between samples. Being based on a similarity matrix between the data points, a draw from a DPP yields diversified subsets of the data. The main contribution of this paper is using this mechanism to diversify the minibatches in stochastic gradientbased learning and analyzing this setup theoretically. In more detail, our main achievements are:

We present a minibatch diversification scheme based on DPPs for stochastic gradient algorithms. This approach requires a similarity measure among data points, which can be constructed using lowlevel features of the data. Since the sampling strategy is independent of the learning objective, diversified minibatches can be precomputed in parallel and reused for different learning tasks. Our approach applies to both supervised and unsupervised models.

We prove that our method is a generalization of stratified sampling and i.i.d. minibatch sampling. Both cases emerge for specific similarity kernels of the data.

We theoretically analyze the conditions under which the variance of the DMSGD gradient gets reduced. We also give an unbiased version of DMSGD which optimizes the original objective without rebalancing the data.

We carry out extensive experiments on several models and datasets. Our approach leads to faster learning and higher classification accuracies in deep supervised learning. For topic models we find that that the resulting document features are more interpretable and are better suited for subsequent supervised learning tasks.
Our paper is structured as follows. In Section 2 we list related work. Section 3 discusses our main concepts of a diversifed risk, and discuses the DMSGD method. Section 4 discusses theoretical properties of our approach such as variance reduction. Finally, in Section 5
, we give empirical evidence that our approach leads to higher classification accuracy and better feature extractions than i.i.d. sampling.
2 Related Work
We revisit the most relevant prior work based on the following aspects. Diversification and Stratification comprises methods which aim at rebalancing the empirical distribution of the data. Variance reduction summarizes stochastic gradient methods that aim at faster convergence by reducing the stochastic gradient noise. Finally, we list related applications and extensions of determinantal point processes.
Diversification and stratification.
Since our method suggests to diversify the minibatches by of nonuniform subsampling from the data, it relates to stratification methods.
Stratification [30, 29] assumes that the data decomposes into disjoint subdatasets, called strata. These are formed based on certain criteria such as a classlabel. Instead of uniformly sampling from the whole dataset, each stratum is subsampled independently, which reduces the variance of the estimator of interest.
Stratified sampling has been suggested as a variance reduction method for stochastic gradient algorithms [11, 48]. If one subsamples the same number of data points from every stratum to form a minibatch as in [48], one naturally balances the training procedure. This approach was also used in [1]. Our work relates closely to this type of biased stratified sampling. It is different in that it does not rely on discrete strata, but only requires a measure a measure of similarity between data points to achieve a similar effect. This applies more broadly.
Variance reduction.
Besides rebalancing the dataset, our approach also reduces the variance of the stochastic gradients. Several ways of variance reduction of stochastic gradient algorithms have been proposed, an important class relying on control variates [26, 32, 35, 44, 38]. A second class of methods relies on nonuniform sampling of minibatches [8, 11, 33, 39, 48, 49]. None of these methods rely on similarity measures between data points.
Our approach is most closely related to clusteringbased sampling (CBS) [11] and stratified sampling (StS) [48]
. StS applies stratified sampling to SGD and builds on prespecified strata. For every stratum, the same number of data points are uniformly selected, and then reweighted according to the size of the stratum to make the sampling scheme unbiased. CBS uses a similar strategy, but does not require a prespeficied set of strata. Instead, the strata are formed by preclustering the raw data with kmeans. (Thus, if the data are clustered based on a class label, CBS is identical to StS.) The problem is that the data are not always amenable to clustering. Second, both StS and CBS ignore the withincluster variations between data points. In contrast, our approach relies on a continuous measure of similarity between samples. We furthermore show that it is a strict generalization of both setups for particular choices of similarity kernels.
Determinantal point processes.
The DPP [19, 25] has been proposed [20, 22, 45] and advanced [4, 23, 24] in the machine learning community in the recent years. It has been applied in subset sampling [18, 24] and results filtering [22].
The DPP has also been used as a diversityenhancing prior in Bayesian models [20, 45]. In big data setups, the data may overwhelm the prior such that the strength of the prior has to scale with the number of data points; introducing a bias. The approach is furthermore constrained to hierarchical Bayesian models, while our approach applies to all empirical risk minimization problems.
Recently, efficient algorithms have been proposed to make sampling using the DPP more scalable. In the traditional formulation, minibatch sampling costs , with an initial fixed cost of diagonalizing the similarity matrix [19], where is the size of the data and is the size of the minibatch. Recent scalable versions of the DPP rely on coresets and lowrank approximations and scale more favorably [4, 24]. These versions were used in our largescale experiments.
3 Method
Our method, DMSGD, uses a version of the DPP for minibatch sampling in stochastic gradient descent. We show that this balances the underlying data distribution and simultaneously accelerates the convergence due to variance reduction. We briefly revisit DPP first, and then introduce our minibatch diversification method. Theoretical aspects are then discussed in Section 4.
3.1 Determinantal Point Processes
A point process is a collection of points randomly located in some mathematical space. The most prominent example is the Poisson process on the real line [17], which models independently occurring events. In contrast, the DPP [19, 25] models repulsive correlations between these points.
In this paper, we restrict ourselves to a finite set of points. Denote by a similarity kernel matrix between these points, e.g. based on spatial distances or some other criterion. is real, symmetric and positive definite, and its elements are some appropriately defined measure of similarity between the and data. The DPP assigns a probability to subsampling any subset of , which is proportional to the determinant of the submatrix of which indexes the subset,
(1) 
For instance, if consists of only two elements, then . Because and measure the similarity between elements and , being more similar lowers the probability of cooccurrence. On the other hand, when the subset is very diverse, the determinant is bigger and correspondingly its cooccurrence is more likely. The DPP thus naturally diversifies the selection of subsets.
In this paper, we propose to use the DPP to diversify minibatches. In practice, the minibatch size is usually constrained by empirical bounds or hardware restrictions. In this case, we want to use DPP conditioned on a given size . Therefore, a slightly modified version of the DPP is needed, which is called DPP [18]. It assigns probabilities to subsets of size ,
(2) 
Apart from conditioning on the size of the subset of points, the DPP has the same diversification effect as the DPP [18]. In order to have a fixed minibatch size we use the DPP in this work.
3.2 MiniBatch Diversification
The diversifying property of the DPP makes it wellsuited to diversify minibatches. We first discuss our learning objective—the diversified risk. We then introduce our algorithm and qualitatively discuss its properties.
Expected, empirical, and diversified risk.
Many problems in machine learning amount to minimizing some loss function
which both depends on a set of parameters and on data . In probabilistic modeling, could be the negative logarithm of the likelihood of a probabilistic model, or a variational lower bound [6, 14]. We often thereby assume that the data were generated as draws from some underlying unknown datagenerating distribution , also called the population distribution. To best generalize to unseen data, we would ideally like to minimize this function’s expectation under ,(3) 
This objective function is also called expected risk [7]. Since is unknown and we believe that our observed data are in some sense a representative draw from the population distribution, we can replace the expectation by an expectation over the empirical distribution of the data , which leads to the empirical risk [7],
(4) 
A typical goal in machine learning is not to minimize the empirical risk with high accuracy, but to learn model parameters that generalize well to unseen data. For every data point in a test set, we wish our model to have high predictive accuracy. If this test set is more balanced than the training set (for instance, because it contains all classes to equal proportions in a classification setup), we would naturally like to train our model on a more balanced training set than the original one without throwing away data. In this work, we present a systematic way to achieve this goal based on biased subsampling of the training data. We term the collection of all samples generated from biased subsampling the balanced dataset.
To this end, we introduce the diversified risk, where we average the loss function over diversified minibatches of size ,
(5) 
Due to the repulsive nature of DPP, similar data points are less likely to cooccur in the same draw. Thus, data points which are very different from the rest are more likely to be sampled and obtain a higher weight, as illustrated in Figure 2 (e).
The diversified risk depends both on the minibatch size and on the similarity kernel of the data. A more theoretical analysis of the diversified risk is carried out in Section 4.
Algorithm.
Our proposed algorithm directly optimizes the diversified risk in Eq. 5. To this end, we propose SGD updates on diversified minibatches of fixed size ,
(6) 
Above, is a collection of indices, drawn from the DPP. In every stochastic gradient step, we thus sample minibatches from the DPP and carry out an update with decreasing learning rate .
Sampling from the DPP first requires an eigendecomposition of its kernel. This decomposition can also be approximated and has to be computed only once for one dataset. Drawing a sample then has the computational complexity , where is the minibatch size, which is much more efficient since is commonly small. This approach is briefly summarized in Algorithm 1; details on the sampling procedure are given in the supplementary material. For more details, we refer to [19] and to [4, 24] for more efficient sampling procedures.
Variance reduction and connections to biased stratified sampling.
Dividing the data into different strata and sampling data from each stratum with adjusted probabilities may reduce the variance of SGD. This insight forms the basis of stratified sampling [48], and the related preclustering based method [11]. As we will demonstrate rigorously in the next section, our approach also enjoys variance reduction but does not require an artificial partition of the data into clusters.
For many models, the gradient varies smoothly as a function of the data. Subsampling data from diversified regions in data space will therefore decorrelate the gradient contributions. This, in turn, may reduce the variance of the stochastic gradient. To some degree, methods such as biased stratified sampling or preclustering sample data from diversified regions, but ignore the fact that gradients within clusters may still be highly correlated. If the data are not amenable to clustering, this variance may be just as large as the intercluster variance. Our approach does not rely on the notion of clusters. Instead, we have a continuous measure of similarity between samples, given by the similarity kernel. This applies more broadly.
In Figure 2, we investigate how well our subsampling procedure using the
DPP allows us to recover an original distribution of data from which we only observe an imbalanced subset. Panel (a) shows the original (uniform) distribution of data points, and (b) shows the observed data set which we use to reestimate the original dataset. While biased stratified sampling (c) or preclustering based on kmeans (d) need an artificial way of dividing the data into finitely many strata and rebalance their corresponding weights, our approach (e) relies on a continuous similarity measure between data and takes into account both intrastrata and interstrata variations.
Computational overhead.
Sampling from the DPP implies a computational overhead over classical SGD. Regarding the overall runtime, the benefits of the approach therefore come mainly into play in setups where each gradient update is expensive. One example is stochastic variational inference for models with local latent variables. For example, in LDA, the computational bottleneck is to update the perdocument topic proportions. The time spent on sampling a minibatch using the DPP is only about of the time to infer these local variables and estimate the gradient (See Table 1 in Section 5). Spending this tiny overhead on actively selecting training examples is well invested as the resulting stochastic gradient has a lower variance.
Since the sampling procedure is independent of the learning algorithm, we can parallelize it or even draw the samples as a preprocessing step and reuse them for different hyperparameter settings. Moreover, there are approximate versions of
DPP sampling which are scalable to big datasets [4, 23]. In this paper, we use the fast DPP [23] in our largescale experiments (Section 5.3).4 Theoretical Considerations
In this section, we give the theoretical foundation of the DMSGD scheme. We first prove that biased stratified sampling and preclustering emerge as special cases of our algorithm for particular choices of the kernel matrix . We then prove that the diversified risk of DMSGD is a reweighted variant of the empirical risk, where the weights are given by the marginal likelihoods of the DPP (we also present an unbiased DMSGD scheme which approximates the true gradients, but which performs less favorably in practice). Last, we investigate under which circumstances DMSGD reduces the variance of the stochastic gradient.
Notation.
For what follows, let denote a variable which indicates whether the data point was sampled under the DPP. Furthermore, let always denote the expectation under the DPP. This lets us express the expectation which depends additively on the data points as
(7) 
Next, we introduce short hand notations for first and second moments. Denote the marginal probability for a point
being sampled as(8) 
which has an analytic form and can be computed efficiently. We also introduce the correlation matrix
(9) 
In contrast to minibatch SGD where and hence , this is no longer true under the DPP. Instead, the correlation can be both negative (when data points are similar) and even positive (when data points are very dissimilar).
Lastly, let denote the gradient of the empirical risk, which is the batch gradient, and its individual contributions from the data .
We first prove that our algorithm captures two important limiting cases, namely (biased) stratified sampling and preclustering.
Proposition 1.
Biased stratified sampling (StS) [48], where data from different strata are subsampled with equal probability, is equivalent to DMSGD with a similarity matrix , defined as a blockdiagonal matrix with
(10) 
where denotes the label for the stratum of data point .
Proof.
It is enough to show that a draw from the DPP which has multiple data points with the same strata assignment has probability zero.
Let , where is a collection of indices which come from the same stratum, and is its disjoint complement. Because of the blockstructure of , we have that
However, because it is a matrix of allones. Therefore, , and hence has zero probability under the DPP. Therefore, every draw from the DPP with defined as above contains at most one data point from each stratum. When is the same as the number of classes, we recover StS. If is smaller than the number of classes, we provide a direct generalization of StS. ∎
Proposition 2.
Preclustering [11] results as a special case of DMSGD, with if the data points and are assigned to the same cluster, and otherwise .
It is furthermore simple to see that regular minibatch SGD results from DMSGD when choosing the identity kernel.
Next, we analyze the objective function of DMSGD. We prove that the diversified risk (Eq. 5) is given by a reweighted version of the empirical risk (Eq. 4) of the data.
Proposition 3.
The diversified risk (Eq. 5) can be expressed as a reweighted empirical risk with the marginal DPP weights ,
As in case of a trivial similarity kernel , this quantity just becomes the empirical risk.
Proof.
We employ the indicators defined above:
∎
The following corollary allows us to construct an unbiased stochastic gradient based on DMSGD in case we are not interested in rebalancing the population.
Proposition 4.
The following SGD scheme leads to an unbiased stochastic gradient:
(11) 
This is a simple consequence of the identity .
Finally, we investigate under which circumstances the DMSGD gradient has a lower variance compared to simple minibatch SGD on the diversified risk. To this end, consider the gradient components , of data points and , respectively, as well as their correlation under the DPP. A sufficient condition for BNSGD to reduce the variance is given as follows.
Theorem 1.
Assume that for all data points and and for all parameters in a region of interest, the scalar product is always positive (negative) whenever the correlation is negative (positive), respectively, i.e.
(12) 
Then, DMSGD has a lower variance than SGD.
Remark.
The sufficient conditions outlined in Theorem 1 are very strong, but its proof provides us with valuable insights of why variance reduction occurs.
Proof.
To begin with, define
(13)  
(14) 
where is the DMSGD gradient and is the full gradient of the diversified risk.
We denote the difference between the expected and stochastic gradient as
(15) 
By construction, this quantity has expectation zero. We are interested in the trace of the stochastic gradient covariance,
(16) 
This quantity can be expressed as
We can furthermore compute
where is the Kronecker symbol (we used ).
Collecting all terms, the variance can be written as
The first term is just the variance of regular minibatch SGD, where we sample each data point with probability proportional to , which also optimizes the diversified risk. This term is always positive because and thus .
Discussion of Theorem 1.
If the similarity kernel relies on spatial distances, nearby data points and have a negative correlation under the DPP. However, if the loss function is smooth, and tend to align (i.e. have a positive scalar product). Eq. 12 is therefore naturally satisfied for these points. can also be positive: since some combinations of data points are less likely to cooccur, others must be more likely to cooccur. Since these points tend to be far apart, it is reasonable to assume that their gradients show no tendency to align. It is therefore plausible to assume that for these points, Eq. 12 also applies^{1}^{1}1We only need to assure that the negative contributions outweigh the positive ones to see variance reduction..
To summarize, if the condition in Eq. 12 is met, we can guarantee variance reduction relative to minibatch SGD, and we have given arguments why it is plausible that these are met to some degree when using DMSGD with a distancedependent similarity kernel. In our experimental section we show that DMSGD has a faster learning curve, which we attribute to this phenomenon.
5 Experiments
We evaluate the performance of our method in different settings. In Section 5.1 we demonstrate the usage of DMSGD for Latent Dirichlet Allocation (LDA) [6], an unsupervised probabilistic topic model. We show that the learned diversified topic representations are better suited for subsequent text classification. In Section 5.2
we evaluate the supervised scenario based on multinomial (softmax) logistic regression with imbalanced data. We compare against stratified sampling, which emerges naturally in this example. In section
5.3we show that our method also maintains performance on the balanced MNIST data set, where we tested convolutional neural networks. In all the experiments, we presample the minibatch indices using the
DPP implementation from [19] for small datasets, and from [23] for big datasets. In this way, sampling is treated as a prescheduling step and can easily be parallelized. We found that our approach finds more diversified feature representations (in unsupervised setups) and higher predictive accuracies (in supervised setups). We also found that the DPP converges within fewer passes through the data compared to standard minibatch sampling due to variance reduction.5.1 Topic Learning With Lda
We follow Algorithm 2 for LDA. Firstly, we demonstrate the performance of DMSVI on synthetic data with LDA. We show that by balancing our minibatches, we find a much better recovery of the topics that were used to generate the data. Second, we use a realworld news dataset. We demonstrate that we can learn more diverse topics that are also better features for text classification tasks.
In this setting, stratified sampling is not applicable since there is no discrete feature such as a class label available. With only word frequencies available, no simple criterion can be used to divide the data into meaningful strata.
5.1.1 Synthetic Data
We generate a synthetic dataset (shown in the supplementary material) following the generative process of LDA with a fixed global latent parameter (the graphical topics). We choose distinct patterns as shown in Figure 3 (a), where each row represents a topic and each column represents a word. To generate an imbalanced data set, we use different Dirichlet priors for the per document topic distribution . 300 documents are generated with prior (0.5 0.5 0.01 0.01 0.01); 50 with prior (0.01 0.5 0.5 0.5 0.01) and 10 with prior (0.01 0.01 0.01 0.5 0.5). Hence, the first two topics are used very often in the corpus. Topic 3 and 4 are shown a few times and topic 5 appears very rarely.
We fit LDA to recover the topics of the synthetic data using traditional SVI and our proposed DMSVI respectively. Here, the raw data occurence is used to construct the similarity matrix . We check how well the global parameters are recovered. Fully recovered latent variables indicate that the model is able to capture its underlying structure of the data. Figure 3 (b) shows the estimated per topic words distribution with SVI and Figure 3 (c) shows the result with our proposed DMSVI.
In Figure 3 (b), we see that the first three topics are recovered using traditional SVI. Topic four is roughly recovered but with information from topic five mixed in. The last topic is not recovered at all, instead, it is a repetition of the first topic. This shows the drawback of the traditional method: when the data is not balanced, the model creates redundant topics to refine the likelihood of the dense data but ignores the scarce data even when they carry important information. In Figure 3 (c), we see that all the topics are correctly recovered thanks to the balanced dataset.
5.1.2 R8 News Data Experiment
We also evaluate the effect of DMSVI on the Reuters news R8 dataset [3]. This dataset contains eight classes of news with an extremely imbalanced number of documents per class, as shown in Figure 4
(a). To measure similarities between documents, we represent each document with a vector
of the tfidf [37] scores of each word in the document. Then define an annealed linear kernel with parameter , which is more sensitive to small feature overlap. We run LDA with SVI and DMSVI with one effective pass through the data, where we set the minibatch size to and use topics.We first compare the frequencies at which documents with particular labels were subsampled. While Figure 4 shows the actual frequency of these classes in the original data set compared with the frequency of these classes over the balanced dataset (a collection of sampled minibatches using the DPP). We can see that the number of documents is more balanced among different classes.
To demonstrate that DMSVI leads to a more useful topic representation, we classify each document in the testset based on the learned topic proportions with a linear SVM. The global variable (pertopic word distribution) is only trained on the training set. The resulting confusion matrices are shown in Figure
5 using traditional SVI and DMSVI respectively. With traditional SVI, the average performance over 8 classes is ; the total accuracy (number of correctly classified documents over number of test documents) is . With DMSVI, the average performance over 8 classes is and the total accuracy is .Thus the overall classification performance is improved using DMSVI features, and especially the performance on the classes with few documents (such as "grain" and "ship") is improved significantly.
We also visualize the first two principal components (PC) of the the global topics in Figure 6. In traditional SVI, many topics are redundant and share large parts of their vocabulary, resulting in a single dense cluster. In contrast, we see that the topics in DMSVI are more spread out. In this regard, DMSVI achieves a similar effect as when using diversity priors as in [20] without the need to grow the prior with the data. The top words from each topic are shown in the appendix, where we present more evidence that the topics learned by DMSVI are more diverse.
Size  k = 10  k =30  k=50  k=80 

Relative cost 
The relative costs of sampling per iteration for LDA is shown in Table 1. Because every local update is expensive, the relative overhead of minibatch sampling is small. More details are given in the appendix.
5.2 Multiclass Logistic Regression
In this section, we demonstrate DMSGD on a finegrained classification task. The Oxford 102 flower dataset [31, 41] is used here for evaluation.
Many datasets in computer vision are balanced even though the true collected dataset is extremely imbalanced. The true reason is that the performance of machine learning models usually suffer from imbalanced training data. One example is the Oxford 102 flower dataset which contains 1020 images in the training set with 10 images per class. However, in the test set, 6149 images are available with high imbalance. In this experiment, we make the learning task harder. We use the original testing set for training and use the original training set for testing. This setting demonstrates the real life scenario where we only can collect data with bias but wish the model to perform well in all different situations.
Test accuracy as a function of training epochs on the Oxford 102 multiclass classification task. We show DMSGD for different values of
, with being biased stratified sampling (see Eq. 17 and the discussion below). The plot caption indicates the batch size and the three best performing values of . ’Rand’ indicates regular SGD sampling. We listed the final test accuracy after convergence, where "Best" indicates the best performance within our DMSGD experiments, and "Baseline" indicates regular SGD as our baseline. The improvement is up to .Offtheshelf CNN features [41] are used in this experiment. A pretrained VGG16 network [42] is used for the feature extraction. We use the first fully connected layer as features, since [5] shows that this layer is most robust.
The similarity kernel of the DPP was constructed as follows. We chose a linear kernel , where is a weighted concatenation of the fc1 features and the labels a onehotvector representation of the class label ,
(17) 
This kernel construction enables the population to be balanced both among classes and within classes. When is large, the algorithm focuses more on the class labels. When is small, balancing is performed mostly based on the features. The weighting factor is a free parameter. As results in stratified sampling (see Theorem 10), this baseline is naturally captured in our approach.
In this setting, the class label is a natural criterion to divide the data into strata. One can then resample the same amount of data from each stratum in order to rebalance the data set. Such a mechanism constrains the minibatch size to be where is the number of classes/strata and is a positive integer. As proved in Section 4, when and , DMSGD is equivalent to this type of (biased) stratified sampling.
Figure 7 shows the percentage of data in each class for the original dataset and with the balanced dataset. It shows that with larger , the dataset is more balanced among classes. More examples are shown in the supplementary material.
We demonstrate this application with a standard linear Softmax classifier for multiclass classification. In our case, the inputs are the offtheshelf CNN fc1 feature (). We can also view this procedure as finetuning a neural network.
Figure 8 shows how the test accuracy changes with respect to each training epoch. We compare the DMSGD with different weights against random sampling. The learning rate schedule is kept the same among different experiments. Different minibatch sizes are used, which is shown in the caption of each panel in the figure. We can see that with DMSGD, we can reach a high model performance more rapidly. Additionally, for a classification task, balancing data with respect to classes is important since the performance is better in general for bigger . On the other hand, the feature information is essential as well since the best performance is mostly obtained with and . Comparing these plots, we can see that the performance benefits more when the minibatch size is comparably small. Small minibatches in general are preferred due to low cost and our method can maximize the usage of small minibatches.
5.3 Cnn Classification on Mnist
Finally, we show the performance of our method in a scenario where the dataset is balanced, which is less preferable scenario for DMSGD. Here we consider the MNIST dataset [21], which contains approximately the same number of examples per handwritten digits.
Since our method is independent of the model, we can use any low level data statistics. Here, we demonstrate DMSGD with raw data features and apply it to training a CNN. Here, we construct the similarity kernel using a RBF kernel. For the low level feature, we use the normalized raw pixel value directly. To encode both class information and label information, we use to compute the similarities matrix, where for this experiment. We use half of the training data from MNIST to train a 5layer CNN as in [2]. Figure 9 shows the test accuracy from each iteration with minibatch size and respectively. We can see that even if the data are balanced, DMSGD still performs better than random sampling due to its variance reduction property.
6 Conclusion
We proposed a diversified minibatch sampling scheme based on determinantal point processes. Our method, DMSGD, builds on a similarity matrix between the data points and suppresses the cooccurance of similar data points in the same minibatch. This leads to a training outcome which generalizes better to unseen data. We also derived sufficient conditions under which the method reduces the variance of the stochastic gradient, leading to faster learning. We showed that our approach generalizes both stratified sampling and preclustering. In the future, we will explore the possibility to further improve the efficiency of the algorithm with data reweighing [28] and tackle imbalanced learning problems involving different modalities for supervised [47] and multimodal [15] settings.
References

[1]
Image classification with imagenet.
https://github.com/soumith/imagenetmultiGPU.torch/blob/master/dataset.lua.  [2] Multilayer convolutional network. https://www.tensorflow.org/get_started/mnist/pros.
 [3] R8 dataset. http://csmining.org/index.php/r52andr8ofreuters21578.html.
 [4] R. H. Affandi, A. Kulesza, E. B. Fox, and B. Taskar. Nystrom approximation for largescale determinantal processes. In AISTATS, pages 85–98, 2013.
 [5] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson. From generic to specific deep representations for visual recognition. In CVPR WS, pages 36–45, 2015.
 [6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. JMLR, 3:993–1022, 2003.
 [7] L. Bottou. Largescale machine learning with stochastic gradient descent. In COMPSTAT, pages 177–186. 2010.
 [8] D. Csiba and P. Richtarik. Importance sampling for minibatches. arXiv:1602.02283, 2016.
 [9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12(Jul):2121–2159, 2011.
 [10] Y. Freund and R. E. Schapire. A desiciontheoretic generalization of online learning and an application to boosting. In EuroCOLT, pages 23–37. Springer, 1995.
 [11] T.F Fu and Z.H. Zhang. CPSGMCMC: Clusteringbased preprocessing method for stochastic gradient MCMC. In AISTATS, 2017.
 [12] H. He and E. A. Garcia. Learning from imbalanced data. TKDE, 21(9):1263–1284, 2009.
 [13] M. Hoffman, F. R. Bach, and D. M. Blei. Online learning for latent dirichlet allocation. In NIPS, pages 856–864, 2010.
 [14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 [15] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016.
 [16] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
 [17] J. F. C. Kingman. Poisson processes. Wiley Online Library, 1993.
 [18] A. Kulesza and B. Taskar. kDPPs: Fixedsize determinantal point processes. In ICML, pages 1193–1200, 2011.
 [19] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. arXiv:1207.6083, 2012.
 [20] J. T. Kwok and R. P. Adams. Priors for diversity in generative latent variable models. In NIPS, pages 2996–3004, 2012.
 [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [22] D. Lee, G. Cha, M. H. Yang, and S. Oh. Individualness and determinantal point processes for pedestrian detection. In ECCV, pages 330–346, 2016.
 [23] C. T. Li, S. Jegelka, and S. Sra. Fast DPP sampling for nyströom with application to kernel methods. arXiv:1603.06052, 2016.
 [24] C.T. Li, S. Jegelka, and S. Sra. Efficient sampling for kdeterminantal point processes. arXiv:1509.01618.
 [25] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability.
 [26] S. Mandt and D. M. Blei. Smoothed gradients for stochastic variational inference. In NIPS, pages 2438–2446, 2014.
 [27] S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate Bayesian inference. arXiv:1704.04289, 2017.
 [28] S. Mandt, J. McInerney, F. Abrol, R. Ranganath, and D. M. Blei. Variational Tempering. In AISTATS, pages 704–712, 2016.
 [29] M. D. McKay, R. J. Beckman, and W. J. Conover. Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2):239–245, 1979.
 [30] J. Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4):558–625, 1934.
 [31] M. E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In ICVGIP, 2008.
 [32] J. Paisley, D. Blei, and M. Jordan. Variational bayesian inference with stochastic search. arXiv:1206.6430, 2012.
 [33] D. Perekrestenko, V. Cevher, and M. Jaggi. Faster coordinate descent via adaptive importance sampling. In AISTATS, 2017.
 [34] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 [35] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, pages 814–822, 2014.
 [36] H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers, pages 111–135. Springer, 1985.
 [37] S. Robertson. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation, 60(5):503–520, 2004.
 [38] T. Salimans and D.A. Knowles. On using control variates with stochastic approximation for variational bayes and its connection to stochastic linear regression. arXiv:1401.1022.
 [39] M. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Clifton, and A. Sarkar. Nonuniform stochastic average gradient method for training conditional random fields. In AISTATS, 2015.
 [40] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2013.
 [41] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features offtheshelf: an astounding baseline for recognition. In CVPR WS, pages 806–813, 2014.
 [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv:1409.1556, 2014.

[43]
T. Tieleman and G. Hinton.
Lecture 6.5RMSPROP: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning.  [44] C. Wang, X. Chen, A. J. Smola, and E. P. Xing. Variance reduction for stochastic gradient optimization. In NIPS, pages 181–189, 2013.

[45]
P.T. Xie, Y.T. Deng, and E. Xing.
Diversifying restricted boltzmann machine for document modeling.
In ACM SIGKDD, pages 1315–1324. ACM, 2015.  [46] M. D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv:1212.5701, 2012.
 [47] C. Zhang and H. Kjellström. How to Supervise Topic Models. In ECCV WS, 2014.
 [48] P.L. Zhao and T. Zhang. Accelerating minibatch stochastic gradient descent using stratified sampling. arXiv:1405.3080.
 [49] P.L. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In ICML, pages 1–9, 2015.
Appendix A Supplement
Algorithm 3 shows the details of how to sample a minibatch using DPP [19] which is used for the DMSGD and DMSVI algorithm in the paper.
Table 2 and 3 show the top words using for LDA using traditional SVI and our proposed DMSVI respectively. We can see that the topics that are learned by DMSVI are more diverse and rare topics such as grain (colored in blue) are captured.
Figure 10 shows the synthetic data that are used in the LDA experiment. Each row represents a document and each column represents a word.
Topic 1  pct shares stake and group investment securities stock commission firm 

Topic 2  year pct and for last lower growth debt profits company 
Topic 3  and merger for will approval companies corp acquire into letter 
Topic 4  and for canadian company management pacific bid southern court units 
Topic 5  baker official and that treasury western policy administration study budget 
Topic 6  and president for executive chief shares plc company chairman cyclops 
Topic 7  bank pct banks rate rates money interest and reuter today 
Topic 8  and unit inc sale sell reuter company systems corp terms 
Topic 9  mln stg and reuter months year for plc market pretax 
Topic 10  and national loan federal savings reuter association insurance estate real 
Topic 11  trade and for bill not united imports that surplus south 
Topic 12  and february for china january gulf issue month that last 
Topic 13  market dollar that had and will exchange system currency west 
Topic 14  dlrs quarter share for company earnings year per and fiscal 
Topic 15  billion mln tax year profit credit marks francs net pct 
Topic 16  usair inc twa reuter trust air department chemical diluted piedmont 
Topic 17  and will union spokesman not two that reuter security port 
Topic 18  offer share tender shares that general and gencorp dlrs not 
Topic 19  and company for that board proposal group made directors proposed 
Topic 20  that japan japanese and world industry government for told officials 
Topic 21  american analysts and that analyst chrysler shearson express stock not 
Topic 22  loss profit mln reuter cts net shr dlrs qtr year 
Topic 23  mln dlrs and assets for dlr operations year charge reuter 
Topic 24  mln net cts shr revs dlrs qtr year oper reuter 
Topic 25  cts april reuter div pay prior record qtly march sets 
Topic 26  dividend stock split for two reuter march payable record april 
Topic 27  oil and prices crude for energy opec petroleum production bpd 
Topic 28  agreement for development and years program technology reuter conditions agreed 
Topic 29  and foreign that talks for international industrial exchange not since 
Topic 30  corp inc acquisition will company common shares reuter stock purchase 
Topic 1  oil and that prices for petroleum dlrs energy crude field 

Topic 2  pct and that rate market banks term rates this will 
Topic 3  billion and pct mln group marks sales year capital rose 
Topic 4  and saudi oil gulf that arabia december minister prices for 
Topic 5  and dlrs debt for brazil southern mln will medical had 
Topic 6  and grain that will futures for program farm certificates agriculture 
Topic 7  bank banks rate and pct interest rates for foreign banking 
Topic 8  and union for national seamen california port security that strike 
Topic 9  and trade that for dollar deficit gatt not exports economic 
Topic 10  and financial for sale inc services reuter systems agreement assets 
Topic 11  dollar and for yen mark march that dealers sterling market 
Topic 12  and for south unit equipment reuter two will state corp 
Topic 13  and firm stock company will for pct not share that 
Topic 14  and world that talks economic official for countries system monetary 
Topic 15  and gencorp for offer general company partners that dlrs share 
Topic 16  mln canada canadian stg and pct will air that royal 
Topic 17  usair and twa that analysts not for pct analyst piedmont 
Topic 18  and that for companies not years study this areas overseas 
Topic 19  trade and bill for house that reagan foreign states committee 
Topic 20  company dlrs offer stock and for corp share shares mln 
Topic 21  dlrs year and quarter company for earnings will tax share 
Topic 22  mln cts net loss dlrs profit reuter shr year qtr 
Topic 23  exchange paris and rates that treasury baker allied for western 
Topic 24  and shares inc for group dlrs pct offer reuter share 
Topic 25  merger and that pacific texas hughes baker commerce for company 
Topic 26  and american company subsidiary china french reuter pct for owned 
Topic 27  japan japanese and that trade officials for government industry pact 
Topic 28  oil opec mln bpd prices production ecuador and output crude 
Topic 29  and that had shares block for mln government not san 
Topic 30  mln pct and profits dlrs year for billion company will 
The sampling time in seconds for the R8 dataset is listed in Table 4. There are 5485 training documents. The first row in the table shows the sampling time for different minibatch sizes k and different versions of kDPP sampling. In practice, we use the original implementation from [23] with . To compare with the traditional kDPP, we listed the elapsed time with [19]. The last row shows the running time per local LDA update, excluding sampling.
Size  k = 10  k =30  k=50  k=80 

Fast kDPP  0.001  0.0139  0.0541  0.2199 
kDPP  0.0098  0.1468  0.6438  2.6698 
LDA  0.8777  1.2530  1.6414  2.2312 
The computational time for training a neural network highly depends on the network structure and implementation details. For example, when using only one softmax layer as in the flower experiment, the cost per gradient step is in the milliseconds. In this setup, kDPP is not effective from a runtime perspective, but still results in better final classification accuracies. However, the cost for each gradient step for a simple 5 layer NN as in the MNIST experiment with
is 1.294 seconds. In the latter case, this time is comparable to kDPP sampling (0.7941 sec) see Table 5. We thus expect our methods to benefit expensive models and imbalanced training datasets more.Size  k = 10  k =100  k=200 

Fast kDPP  0.0012  0.7941  5.4216 
NN cost  0.166948  1.29452  2.64811 
Figure 11 shows the bar plots of the frequency of images in each class for Oxford Flower dataset using the number of classes as the minibatch size. With this setting, we can see that when , DMSGD is equivalent to StS.
Comments
There are no comments yet.