Stochastic gradient descent (SGD) is one of the most important algorithms for scalable machine learning[7, 36, 27]
. SGD optimizes an objective function by successively following noisy estimates of its gradient based on mini-batches from a large underlying dataset. We usually assure that this gradient is unbiased, meaning that the expected stochastic gradient equals the true gradient. When combined with a suitably decreasing learning rate schedule, the algorithm converges to a local optimum of the objective.
Often we are not interested in learning an unbiased estimator of the gradient, but are rather willing to introduce some bias. There are many reasons for why this might be the case. First, biased SGD schemes such as momentum, iterate averaging , or preconditioning [9, 16, 43, 46] may reduce the stochastic gradient noise or ease the optimization problem, and therefore often lead to faster convergence. Another reason is that we may decide to actively select samples based on their relevance or difficulty levels such as boosting , or because we believe that our dataset is in some respect imbalanced . In this paper, we propose and investigate a biased mini-batch subsampling scheme for imbalanced data.
Real-world data sets are naturally imbalanced. For instance, the sports topic appears more often in the news than biology; the internet contains more images of young people than of senior people, and Youtube has more videos of cats than of bees or ants. Aiming to maximize the probability of generating such training data, machine learning models will refine the dominant information with redundancy but ignore the important but scarce data. For example, a model trained on Youtube data might be very sensitive to different cats but unable to recognize ants. We may therefore decide to try to learn on a more balanced data set by actively selecting diversified mini-batches.
The currently most common tool for mini-batch diversification is stratified sampling [30, 48]. In this approach, one groups the data into a finite set of strata based on discrete or continuous features such as a label or cluster assignment. To re-balance the data set, the data can then be subsampled such that each stratum occurs with equal probability in the mini-batch (in the following, we refer to this method as biased stratified sampling). Unfortunately, the data are not always amenable to biased stratified sampling because discrete features may not exist, or the data may not be unambiguously clustered. Instead of subsampling based on discrete strata, it would be desirable to diversify the mini-batch based on a soft similarity measure between data points. As we show in this paper, this can be achieved using Determinantal Point Processes (DPPs) .
The DPP is a point process which mimics repulsive interactions between samples. Being based on a similarity matrix between the data points, a draw from a DPP yields diversified subsets of the data. The main contribution of this paper is using this mechanism to diversify the mini-batches in stochastic gradient-based learning and analyzing this setup theoretically. In more detail, our main achievements are:
We present a mini-batch diversification scheme based on DPPs for stochastic gradient algorithms. This approach requires a similarity measure among data points, which can be constructed using low-level features of the data. Since the sampling strategy is independent of the learning objective, diversified mini-batches can be precomputed in parallel and reused for different learning tasks. Our approach applies to both supervised and unsupervised models.
We prove that our method is a generalization of stratified sampling and i.i.d. mini-batch sampling. Both cases emerge for specific similarity kernels of the data.
We theoretically analyze the conditions under which the variance of the DM-SGD gradient gets reduced. We also give an unbiased version of DM-SGD which optimizes the original objective without re-balancing the data.
We carry out extensive experiments on several models and datasets. Our approach leads to faster learning and higher classification accuracies in deep supervised learning. For topic models we find that that the resulting document features are more interpretable and are better suited for subsequent supervised learning tasks.
Our paper is structured as follows. In Section 2 we list related work. Section 3 discusses our main concepts of a diversifed risk, and discuses the DM-SGD method. Section 4 discusses theoretical properties of our approach such as variance reduction. Finally, in Section 5
, we give empirical evidence that our approach leads to higher classification accuracy and better feature extractions than i.i.d. sampling.
2 Related Work
We revisit the most relevant prior work based on the following aspects. Diversification and Stratification comprises methods which aim at re-balancing the empirical distribution of the data. Variance reduction summarizes stochastic gradient methods that aim at faster convergence by reducing the stochastic gradient noise. Finally, we list related applications and extensions of determinantal point processes.
Diversification and stratification.
Since our method suggests to diversify the mini-batches by of non-uniform subsampling from the data, it relates to stratification methods.
Stratification [30, 29] assumes that the data decomposes into disjoint sub-datasets, called strata. These are formed based on certain criteria such as a class-label. Instead of uniformly sampling from the whole dataset, each stratum is sub-sampled independently, which reduces the variance of the estimator of interest.
Stratified sampling has been suggested as a variance reduction method for stochastic gradient algorithms [11, 48]. If one subsamples the same number of data points from every stratum to form a mini-batch as in , one naturally balances the training procedure. This approach was also used in . Our work relates closely to this type of biased stratified sampling. It is different in that it does not rely on discrete strata, but only requires a measure a measure of similarity between data points to achieve a similar effect. This applies more broadly.
Besides re-balancing the dataset, our approach also reduces the variance of the stochastic gradients. Several ways of variance reduction of stochastic gradient algorithms have been proposed, an important class relying on control variates [26, 32, 35, 44, 38]. A second class of methods relies on non-uniform sampling of mini-batches [8, 11, 33, 39, 48, 49]. None of these methods rely on similarity measures between data points.
. StS applies stratified sampling to SGD and builds on pre-specified strata. For every stratum, the same number of data points are uniformly selected, and then re-weighted according to the size of the stratum to make the sampling scheme un-biased. CBS uses a similar strategy, but does not require a pre-speficied set of strata. Instead, the strata are formed by pre-clustering the raw data with k-means. (Thus, if the data are clustered based on a class label, CBS is identical to StS.) The problem is that the data are not always amenable to clustering. Second, both StS and CBS ignore the within-cluster variations between data points. In contrast, our approach relies on a continuous measure of similarity between samples. We furthermore show that it is a strict generalization of both setups for particular choices of similarity kernels.
Determinantal point processes.
The DPP [19, 25] has been proposed [20, 22, 45] and advanced [4, 23, 24] in the machine learning community in the recent years. It has been applied in subset sampling [18, 24] and results filtering .
The DPP has also been used as a diversity-enhancing prior in Bayesian models [20, 45]. In big data setups, the data may overwhelm the prior such that the strength of the prior has to scale with the number of data points; introducing a bias. The approach is furthermore constrained to hierarchical Bayesian models, while our approach applies to all empirical risk minimization problems.
Recently, efficient algorithms have been proposed to make sampling using the DPP more scalable. In the traditional formulation, mini-batch sampling costs , with an initial fixed cost of diagonalizing the similarity matrix , where is the size of the data and is the size of the mini-batch. Recent scalable versions of the DPP rely on core-sets and low-rank approximations and scale more favorably [4, 24]. These versions were used in our large-scale experiments.
Our method, DM-SGD, uses a version of the DPP for mini-batch sampling in stochastic gradient descent. We show that this balances the underlying data distribution and simultaneously accelerates the convergence due to variance reduction. We briefly revisit DPP first, and then introduce our mini-batch diversification method. Theoretical aspects are then discussed in Section 4.
3.1 Determinantal Point Processes
A point process is a collection of points randomly located in some mathematical space. The most prominent example is the Poisson process on the real line , which models independently occurring events. In contrast, the DPP [19, 25] models repulsive correlations between these points.
In this paper, we restrict ourselves to a finite set of points. Denote by a similarity kernel matrix between these points, e.g. based on spatial distances or some other criterion. is real, symmetric and positive definite, and its elements are some appropriately defined measure of similarity between the and data. The DPP assigns a probability to subsampling any subset of , which is proportional to the determinant of the sub-matrix of which indexes the subset,
For instance, if consists of only two elements, then . Because and measure the similarity between elements and , being more similar lowers the probability of co-occurrence. On the other hand, when the subset is very diverse, the determinant is bigger and correspondingly its co-occurrence is more likely. The DPP thus naturally diversifies the selection of subsets.
In this paper, we propose to use the DPP to diversify mini-batches. In practice, the mini-batch size is usually constrained by empirical bounds or hardware restrictions. In this case, we want to use DPP conditioned on a given size . Therefore, a slightly modified version of the DPP is needed, which is called -DPP . It assigns probabilities to subsets of size ,
Apart from conditioning on the size of the subset of points, the -DPP has the same diversification effect as the DPP . In order to have a fixed mini-batch size we use the -DPP in this work.
3.2 Mini-Batch Diversification
The diversifying property of the -DPP makes it well-suited to diversify mini-batches. We first discuss our learning objective—the diversified risk. We then introduce our algorithm and qualitatively discuss its properties.
Expected, empirical, and diversified risk.
Many problems in machine learning amount to minimizing some loss functionwhich both depends on a set of parameters and on data . In probabilistic modeling, could be the negative logarithm of the likelihood of a probabilistic model, or a variational lower bound [6, 14]. We often thereby assume that the data were generated as draws from some underlying unknown data-generating distribution , also called the population distribution. To best generalize to unseen data, we would ideally like to minimize this function’s expectation under ,
This objective function is also called expected risk . Since is unknown and we believe that our observed data are in some sense a representative draw from the population distribution, we can replace the expectation by an expectation over the empirical distribution of the data , which leads to the empirical risk ,
A typical goal in machine learning is not to minimize the empirical risk with high accuracy, but to learn model parameters that generalize well to unseen data. For every data point in a test set, we wish our model to have high predictive accuracy. If this test set is more balanced than the training set (for instance, because it contains all classes to equal proportions in a classification setup), we would naturally like to train our model on a more balanced training set than the original one without throwing away data. In this work, we present a systematic way to achieve this goal based on biased subsampling of the training data. We term the collection of all samples generated from biased subsampling the balanced dataset.
To this end, we introduce the diversified risk, where we average the loss function over diversified mini-batches of size ,
Due to the repulsive nature of -DPP, similar data points are less likely to co-occur in the same draw. Thus, data points which are very different from the rest are more likely to be sampled and obtain a higher weight, as illustrated in Figure 2 (e).
The diversified risk depends both on the mini-batch size and on the similarity kernel of the data. A more theoretical analysis of the diversified risk is carried out in Section 4.
Our proposed algorithm directly optimizes the diversified risk in Eq. 5. To this end, we propose SGD updates on diversified mini-batches of fixed size ,
Above, is a collection of indices, drawn from the -DPP. In every stochastic gradient step, we thus sample mini-batches from the -DPP and carry out an update with decreasing learning rate .
Sampling from the -DPP first requires an eigendecomposition of its kernel. This decomposition can also be approximated and has to be computed only once for one dataset. Drawing a sample then has the computational complexity , where is the mini-batch size, which is much more efficient since is commonly small. This approach is briefly summarized in Algorithm 1; details on the sampling procedure are given in the supplementary material. For more details, we refer to  and to [4, 24] for more efficient sampling procedures.
Variance reduction and connections to biased stratified sampling.
Dividing the data into different strata and sampling data from each stratum with adjusted probabilities may reduce the variance of SGD. This insight forms the basis of stratified sampling , and the related pre-clustering based method . As we will demonstrate rigorously in the next section, our approach also enjoys variance reduction but does not require an artificial partition of the data into clusters.
For many models, the gradient varies smoothly as a function of the data. Subsampling data from diversified regions in data space will therefore decorrelate the gradient contributions. This, in turn, may reduce the variance of the stochastic gradient. To some degree, methods such as biased stratified sampling or pre-clustering sample data from diversified regions, but ignore the fact that gradients within clusters may still be highly correlated. If the data are not amenable to clustering, this variance may be just as large as the inter-cluster variance. Our approach does not rely on the notion of clusters. Instead, we have a continuous measure of similarity between samples, given by the similarity kernel. This applies more broadly.
In Figure 2, we investigate how well our subsampling procedure using the
-DPP allows us to recover an original distribution of data from which we only observe an imbalanced subset. Panel (a) shows the original (uniform) distribution of data points, and (b) shows the observed data set which we use to re-estimate the original dataset. While biased stratified sampling (c) or pre-clustering based on k-means (d) need an artificial way of dividing the data into finitely many strata and re-balance their corresponding weights, our approach (e) relies on a continuous similarity measure between data and takes into account both intra-strata and inter-strata variations.
Sampling from the -DPP implies a computational overhead over classical SGD. Regarding the overall runtime, the benefits of the approach therefore come mainly into play in setups where each gradient update is expensive. One example is stochastic variational inference for models with local latent variables. For example, in LDA, the computational bottleneck is to update the per-document topic proportions. The time spent on sampling a mini-batch using the -DPP is only about of the time to infer these local variables and estimate the gradient (See Table 1 in Section 5). Spending this tiny overhead on actively selecting training examples is well invested as the resulting stochastic gradient has a lower variance.
Since the sampling procedure is independent of the learning algorithm, we can parallelize it or even draw the samples as a pre-processing step and reuse them for different hyperparameter settings. Moreover, there are approximate versions of-DPP sampling which are scalable to big datasets [4, 23]. In this paper, we use the fast -DPP  in our large-scale experiments (Section 5.3).
4 Theoretical Considerations
In this section, we give the theoretical foundation of the DM-SGD scheme. We first prove that biased stratified sampling and pre-clustering emerge as special cases of our algorithm for particular choices of the kernel matrix . We then prove that the diversified risk of DM-SGD is a re-weighted variant of the empirical risk, where the weights are given by the marginal likelihoods of the -DPP (we also present an unbiased DM-SGD scheme which approximates the true gradients, but which performs less favorably in practice). Last, we investigate under which circumstances DM-SGD reduces the variance of the stochastic gradient.
For what follows, let denote a variable which indicates whether the data point was sampled under the -DPP. Furthermore, let always denote the expectation under the -DPP. This lets us express the expectation which depends additively on the data points as
Next, we introduce short hand notations for first and second moments. Denote the marginal probability for a pointbeing sampled as
which has an analytic form and can be computed efficiently. We also introduce the correlation matrix
In contrast to minibatch SGD where and hence , this is no longer true under the -DPP. Instead, the correlation can be both negative (when data points are similar) and even positive (when data points are very dissimilar).
Lastly, let denote the gradient of the empirical risk, which is the batch gradient, and its individual contributions from the data .
We first prove that our algorithm captures two important limiting cases, namely (biased) stratified sampling and pre-clustering.
Biased stratified sampling (StS) , where data from different strata are subsampled with equal probability, is equivalent to DM-SGD with a similarity matrix , defined as a block-diagonal matrix with
where denotes the label for the stratum of data point .
It is enough to show that a draw from the -DPP which has multiple data points with the same strata assignment has probability zero.
Let , where is a collection of indices which come from the same stratum, and is its disjoint complement. Because of the block-structure of , we have that
However, because it is a matrix of all-ones. Therefore, , and hence has zero probability under the -DPP. Therefore, every draw from the -DPP with defined as above contains at most one data point from each stratum. When is the same as the number of classes, we recover StS. If is smaller than the number of classes, we provide a direct generalization of StS. ∎
Pre-clustering  results as a special case of DM-SGD, with if the data points and are assigned to the same cluster, and otherwise .
It is furthermore simple to see that regular minibatch SGD results from DM-SGD when choosing the identity kernel.
The diversified risk (Eq. 5) can be expressed as a re-weighted empirical risk with the marginal -DPP weights ,
As in case of a trivial similarity kernel , this quantity just becomes the empirical risk.
We employ the indicators defined above:
The following corollary allows us to construct an unbiased stochastic gradient based on DM-SGD in case we are not interested in re-balancing the population.
The following SGD scheme leads to an unbiased stochastic gradient:
This is a simple consequence of the identity .
Finally, we investigate under which circumstances the DM-SGD gradient has a lower variance compared to simple mini-batch SGD on the diversified risk. To this end, consider the gradient components , of data points and , respectively, as well as their correlation under the -DPP. A sufficient condition for BN-SGD to reduce the variance is given as follows.
Assume that for all data points and and for all parameters in a region of interest, the scalar product is always positive (negative) whenever the correlation is negative (positive), respectively, i.e.
Then, DM-SGD has a lower variance than SGD.
The sufficient conditions outlined in Theorem 1 are very strong, but its proof provides us with valuable insights of why variance reduction occurs.
To begin with, define
where is the DM-SGD gradient and is the full gradient of the diversified risk.
We denote the difference between the expected and stochastic gradient as
By construction, this quantity has expectation zero. We are interested in the trace of the stochastic gradient covariance,
This quantity can be expressed as
We can furthermore compute
where is the Kronecker symbol (we used ).
Collecting all terms, the variance can be written as
The first term is just the variance of regular mini-batch SGD, where we sample each data point with probability proportional to , which also optimizes the diversified risk. This term is always positive because and thus .
Discussion of Theorem 1.
If the similarity kernel relies on spatial distances, nearby data points and have a negative correlation under the -DPP. However, if the loss function is smooth, and tend to align (i.e. have a positive scalar product). Eq. 12 is therefore naturally satisfied for these points. can also be positive: since some combinations of data points are less likely to co-occur, others must be more likely to co-occur. Since these points tend to be far apart, it is reasonable to assume that their gradients show no tendency to align. It is therefore plausible to assume that for these points, Eq. 12 also applies111We only need to assure that the negative contributions outweigh the positive ones to see variance reduction..
To summarize, if the condition in Eq. 12 is met, we can guarantee variance reduction relative to mini-batch SGD, and we have given arguments why it is plausible that these are met to some degree when using DM-SGD with a distance-dependent similarity kernel. In our experimental section we show that DM-SGD has a faster learning curve, which we attribute to this phenomenon.
We evaluate the performance of our method in different settings. In Section 5.1 we demonstrate the usage of DM-SGD for Latent Dirichlet Allocation (LDA) , an unsupervised probabilistic topic model. We show that the learned diversified topic representations are better suited for subsequent text classification. In Section 5.2
we evaluate the supervised scenario based on multinomial (softmax) logistic regression with imbalanced data. We compare against stratified sampling, which emerges naturally in this example. In section5.3
we show that our method also maintains performance on the balanced MNIST data set, where we tested convolutional neural networks. In all the experiments, we pre-sample the mini-batch indices using the-DPP implementation from  for small datasets, and from  for big datasets. In this way, sampling is treated as a pre-scheduling step and can easily be parallelized. We found that our approach finds more diversified feature representations (in unsupervised setups) and higher predictive accuracies (in supervised setups). We also found that the -DPP converges within fewer passes through the data compared to standard minibatch sampling due to variance reduction.
5.1 Topic Learning With Lda
We follow Algorithm 2 for LDA. Firstly, we demonstrate the performance of DM-SVI on synthetic data with LDA. We show that by balancing our mini-batches, we find a much better recovery of the topics that were used to generate the data. Second, we use a real-world news dataset. We demonstrate that we can learn more diverse topics that are also better features for text classification tasks.
In this setting, stratified sampling is not applicable since there is no discrete feature such as a class label available. With only word frequencies available, no simple criterion can be used to divide the data into meaningful strata.
5.1.1 Synthetic Data
We generate a synthetic dataset (shown in the supplementary material) following the generative process of LDA with a fixed global latent parameter (the graphical topics). We choose distinct patterns as shown in Figure 3 (a), where each row represents a topic and each column represents a word. To generate an imbalanced data set, we use different Dirichlet priors for the per document topic distribution . 300 documents are generated with prior (0.5 0.5 0.01 0.01 0.01); 50 with prior (0.01 0.5 0.5 0.5 0.01) and 10 with prior (0.01 0.01 0.01 0.5 0.5). Hence, the first two topics are used very often in the corpus. Topic 3 and 4 are shown a few times and topic 5 appears very rarely.
We fit LDA to recover the topics of the synthetic data using traditional SVI and our proposed DM-SVI respectively. Here, the raw data occurence is used to construct the similarity matrix . We check how well the global parameters are recovered. Fully recovered latent variables indicate that the model is able to capture its underlying structure of the data. Figure 3 (b) shows the estimated per topic words distribution with SVI and Figure 3 (c) shows the result with our proposed DM-SVI.
In Figure 3 (b), we see that the first three topics are recovered using traditional SVI. Topic four is roughly recovered but with information from topic five mixed in. The last topic is not recovered at all, instead, it is a repetition of the first topic. This shows the drawback of the traditional method: when the data is not balanced, the model creates redundant topics to refine the likelihood of the dense data but ignores the scarce data even when they carry important information. In Figure 3 (c), we see that all the topics are correctly recovered thanks to the balanced dataset.
5.1.2 R8 News Data Experiment
(a). To measure similarities between documents, we represent each document with a vectorof the tf-idf  scores of each word in the document. Then define an annealed linear kernel with parameter , which is more sensitive to small feature overlap. We run LDA with SVI and DM-SVI with one effective pass through the data, where we set the mini-batch size to and use topics.
We first compare the frequencies at which documents with particular labels were sub-sampled. While Figure 4 shows the actual frequency of these classes in the original data set compared with the frequency of these classes over the balanced dataset (a collection of sampled mini-batches using the -DPP). We can see that the number of documents is more balanced among different classes.
To demonstrate that DM-SVI leads to a more useful topic representation, we classify each document in the test-set based on the learned topic proportions with a linear SVM. The global variable (per-topic word distribution) is only trained on the training set. The resulting confusion matrices are shown in Figure5 using traditional SVI and DM-SVI respectively. With traditional SVI, the average performance over 8 classes is ; the total accuracy (number of correctly classified documents over number of test documents) is . With DM-SVI, the average performance over 8 classes is and the total accuracy is .
Thus the overall classification performance is improved using DM-SVI features, and especially the performance on the classes with few documents (such as "grain" and "ship") is improved significantly.
We also visualize the first two principal components (PC) of the the global topics in Figure 6. In traditional SVI, many topics are redundant and share large parts of their vocabulary, resulting in a single dense cluster. In contrast, we see that the topics in DM-SVI are more spread out. In this regard, DM-SVI achieves a similar effect as when using diversity priors as in  without the need to grow the prior with the data. The top words from each topic are shown in the appendix, where we present more evidence that the topics learned by DM-SVI are more diverse.
|Size||k = 10||k =30||k=50||k=80|
The relative costs of sampling per iteration for LDA is shown in Table 1. Because every local update is expensive, the relative overhead of mini-batch sampling is small. More details are given in the appendix.
5.2 Multiclass Logistic Regression
Many datasets in computer vision are balanced even though the true collected dataset is extremely imbalanced. The true reason is that the performance of machine learning models usually suffer from imbalanced training data. One example is the Oxford 102 flower dataset which contains 1020 images in the training set with 10 images per class. However, in the test set, 6149 images are available with high imbalance. In this experiment, we make the learning task harder. We use the original testing set for training and use the original training set for testing. This setting demonstrates the real life scenario where we only can collect data with bias but wish the model to perform well in all different situations.
Test accuracy as a function of training epochs on the Oxford 102 multi-class classification task. We show DM-SGD for different values of, with being biased stratified sampling (see Eq. 17 and the discussion below). The plot caption indicates the batch size and the three best performing values of . ’Rand’ indicates regular SGD sampling. We listed the final test accuracy after convergence, where "Best" indicates the best performance within our DM-SGD experiments, and "Baseline" indicates regular SGD as our baseline. The improvement is up to .
Off-the-shelf CNN features  are used in this experiment. A pre-trained VGG16 network  is used for the feature extraction. We use the first fully connected layer as features, since  shows that this layer is most robust.
The similarity kernel of the -DPP was constructed as follows. We chose a linear kernel , where is a weighted concatenation of the fc1 features and the labels a one-hot-vector representation of the class label ,
This kernel construction enables the population to be balanced both among classes and within classes. When is large, the algorithm focuses more on the class labels. When is small, balancing is performed mostly based on the features. The weighting factor is a free parameter. As results in stratified sampling (see Theorem 10), this baseline is naturally captured in our approach.
In this setting, the class label is a natural criterion to divide the data into strata. One can then re-sample the same amount of data from each stratum in order to re-balance the data set. Such a mechanism constrains the mini-batch size to be where is the number of classes/strata and is a positive integer. As proved in Section 4, when and , DM-SGD is equivalent to this type of (biased) stratified sampling.
Figure 7 shows the percentage of data in each class for the original dataset and with the balanced dataset. It shows that with larger , the dataset is more balanced among classes. More examples are shown in the supplementary material.
We demonstrate this application with a standard linear Softmax classifier for multi-class classification. In our case, the inputs are the off-the-shelf CNN fc1 feature (). We can also view this procedure as fine-tuning a neural network.
Figure 8 shows how the test accuracy changes with respect to each training epoch. We compare the DM-SGD with different weights against random sampling. The learning rate schedule is kept the same among different experiments. Different mini-batch sizes are used, which is shown in the caption of each panel in the figure. We can see that with DM-SGD, we can reach a high model performance more rapidly. Additionally, for a classification task, balancing data with respect to classes is important since the performance is better in general for bigger . On the other hand, the feature information is essential as well since the best performance is mostly obtained with and . Comparing these plots, we can see that the performance benefits more when the mini-batch size is comparably small. Small mini-batches in general are preferred due to low cost and our method can maximize the usage of small mini-batches.
5.3 Cnn Classification on Mnist
Finally, we show the performance of our method in a scenario where the dataset is balanced, which is less preferable scenario for DM-SGD. Here we consider the MNIST dataset , which contains approximately the same number of examples per hand-written digits.
Since our method is independent of the model, we can use any low level data statistics. Here, we demonstrate DM-SGD with raw data features and apply it to training a CNN. Here, we construct the similarity kernel using a RBF kernel. For the low level feature, we use the normalized raw pixel value directly. To encode both class information and label information, we use to compute the similarities matrix, where for this experiment. We use half of the training data from MNIST to train a 5-layer CNN as in . Figure 9 shows the test accuracy from each iteration with mini-batch size and respectively. We can see that even if the data are balanced, DM-SGD still performs better than random sampling due to its variance reduction property.
We proposed a diversified mini-batch sampling scheme based on determinantal point processes. Our method, DM-SGD, builds on a similarity matrix between the data points and suppresses the co-occurance of similar data points in the same mini-batch. This leads to a training outcome which generalizes better to unseen data. We also derived sufficient conditions under which the method reduces the variance of the stochastic gradient, leading to faster learning. We showed that our approach generalizes both stratified sampling and pre-clustering. In the future, we will explore the possibility to further improve the efficiency of the algorithm with data reweighing  and tackle imbalanced learning problems involving different modalities for supervised  and multi-modal  settings.
Image classification with imagenet.https://github.com/soumith/imagenet-multiGPU.torch/blob/master/dataset.lua.
-  Multilayer convolutional network. https://www.tensorflow.org/get_started/mnist/pros.
-  R8 dataset. http://csmining.org/index.php/r52-and-r8-of-reuters-21578.html.
-  R. H. Affandi, A. Kulesza, E. B. Fox, and B. Taskar. Nystrom approximation for large-scale determinantal processes. In AISTATS, pages 85–98, 2013.
-  H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson. From generic to specific deep representations for visual recognition. In CVPR WS, pages 36–45, 2015.
-  D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. JMLR, 3:993–1022, 2003.
-  L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177–186. 2010.
-  D. Csiba and P. Richtarik. Importance sampling for minibatches. arXiv:1602.02283, 2016.
-  J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12(Jul):2121–2159, 2011.
-  Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37. Springer, 1995.
-  T.F Fu and Z.H. Zhang. CPSG-MCMC: Clustering-based preprocessing method for stochastic gradient MCMC. In AISTATS, 2017.
-  H. He and E. A. Garcia. Learning from imbalanced data. TKDE, 21(9):1263–1284, 2009.
-  M. Hoffman, F. R. Bach, and D. M. Blei. Online learning for latent dirichlet allocation. In NIPS, pages 856–864, 2010.
-  M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
-  A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
-  J. F. C. Kingman. Poisson processes. Wiley Online Library, 1993.
-  A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In ICML, pages 1193–1200, 2011.
-  A. Kulesza and B. Taskar. Determinantal point processes for machine learning. arXiv:1207.6083, 2012.
-  J. T. Kwok and R. P. Adams. Priors for diversity in generative latent variable models. In NIPS, pages 2996–3004, 2012.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  D. Lee, G. Cha, M. H. Yang, and S. Oh. Individualness and determinantal point processes for pedestrian detection. In ECCV, pages 330–346, 2016.
-  C. T. Li, S. Jegelka, and S. Sra. Fast DPP sampling for nyströom with application to kernel methods. arXiv:1603.06052, 2016.
-  C.T. Li, S. Jegelka, and S. Sra. Efficient sampling for k-determinantal point processes. arXiv:1509.01618.
-  O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability.
-  S. Mandt and D. M. Blei. Smoothed gradients for stochastic variational inference. In NIPS, pages 2438–2446, 2014.
-  S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate Bayesian inference. arXiv:1704.04289, 2017.
-  S. Mandt, J. McInerney, F. Abrol, R. Ranganath, and D. M. Blei. Variational Tempering. In AISTATS, pages 704–712, 2016.
-  M. D. McKay, R. J. Beckman, and W. J. Conover. Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2):239–245, 1979.
-  J. Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4):558–625, 1934.
-  M. E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In ICVGIP, 2008.
-  J. Paisley, D. Blei, and M. Jordan. Variational bayesian inference with stochastic search. arXiv:1206.6430, 2012.
-  D. Perekrestenko, V. Cevher, and M. Jaggi. Faster coordinate descent via adaptive importance sampling. In AISTATS, 2017.
-  B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
-  R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, pages 814–822, 2014.
-  H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers, pages 111–135. Springer, 1985.
-  S. Robertson. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation, 60(5):503–520, 2004.
-  T. Salimans and D.A. Knowles. On using control variates with stochastic approximation for variational bayes and its connection to stochastic linear regression. arXiv:1401.1022.
-  M. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Clifton, and A. Sarkar. Non-uniform stochastic average gradient method for training conditional random fields. In AISTATS, 2015.
-  M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1–30, 2013.
-  A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR WS, pages 806–813, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
T. Tieleman and G. Hinton.
Lecture 6.5-RMSPROP: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning.
-  C. Wang, X. Chen, A. J. Smola, and E. P. Xing. Variance reduction for stochastic gradient optimization. In NIPS, pages 181–189, 2013.
P.T. Xie, Y.T. Deng, and E. Xing.
Diversifying restricted boltzmann machine for document modeling.In ACM SIGKDD, pages 1315–1324. ACM, 2015.
-  M. D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv:1212.5701, 2012.
-  C. Zhang and H. Kjellström. How to Supervise Topic Models. In ECCV WS, 2014.
-  P.L. Zhao and T. Zhang. Accelerating minibatch stochastic gradient descent using stratified sampling. arXiv:1405.3080.
-  P.L. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In ICML, pages 1–9, 2015.
Appendix A Supplement
Table 2 and 3 show the top words using for LDA using traditional SVI and our proposed DM-SVI respectively. We can see that the topics that are learned by DM-SVI are more diverse and rare topics such as grain (colored in blue) are captured.
Figure 10 shows the synthetic data that are used in the LDA experiment. Each row represents a document and each column represents a word.
|Topic 1||pct shares stake and group investment securities stock commission firm|
|Topic 2||year pct and for last lower growth debt profits company|
|Topic 3||and merger for will approval companies corp acquire into letter|
|Topic 4||and for canadian company management pacific bid southern court units|
|Topic 5||baker official and that treasury western policy administration study budget|
|Topic 6||and president for executive chief shares plc company chairman cyclops|
|Topic 7||bank pct banks rate rates money interest and reuter today|
|Topic 8||and unit inc sale sell reuter company systems corp terms|
|Topic 9||mln stg and reuter months year for plc market pretax|
|Topic 10||and national loan federal savings reuter association insurance estate real|
|Topic 11||trade and for bill not united imports that surplus south|
|Topic 12||and february for china january gulf issue month that last|
|Topic 13||market dollar that had and will exchange system currency west|
|Topic 14||dlrs quarter share for company earnings year per and fiscal|
|Topic 15||billion mln tax year profit credit marks francs net pct|
|Topic 16||usair inc twa reuter trust air department chemical diluted piedmont|
|Topic 17||and will union spokesman not two that reuter security port|
|Topic 18||offer share tender shares that general and gencorp dlrs not|
|Topic 19||and company for that board proposal group made directors proposed|
|Topic 20||that japan japanese and world industry government for told officials|
|Topic 21||american analysts and that analyst chrysler shearson express stock not|
|Topic 22||loss profit mln reuter cts net shr dlrs qtr year|
|Topic 23||mln dlrs and assets for dlr operations year charge reuter|
|Topic 24||mln net cts shr revs dlrs qtr year oper reuter|
|Topic 25||cts april reuter div pay prior record qtly march sets|
|Topic 26||dividend stock split for two reuter march payable record april|
|Topic 27||oil and prices crude for energy opec petroleum production bpd|
|Topic 28||agreement for development and years program technology reuter conditions agreed|
|Topic 29||and foreign that talks for international industrial exchange not since|
|Topic 30||corp inc acquisition will company common shares reuter stock purchase|
|Topic 1||oil and that prices for petroleum dlrs energy crude field|
|Topic 2||pct and that rate market banks term rates this will|
|Topic 3||billion and pct mln group marks sales year capital rose|
|Topic 4||and saudi oil gulf that arabia december minister prices for|
|Topic 5||and dlrs debt for brazil southern mln will medical had|
|Topic 6||and grain that will futures for program farm certificates agriculture|
|Topic 7||bank banks rate and pct interest rates for foreign banking|
|Topic 8||and union for national seamen california port security that strike|
|Topic 9||and trade that for dollar deficit gatt not exports economic|
|Topic 10||and financial for sale inc services reuter systems agreement assets|
|Topic 11||dollar and for yen mark march that dealers sterling market|
|Topic 12||and for south unit equipment reuter two will state corp|
|Topic 13||and firm stock company will for pct not share that|
|Topic 14||and world that talks economic official for countries system monetary|
|Topic 15||and gencorp for offer general company partners that dlrs share|
|Topic 16||mln canada canadian stg and pct will air that royal|
|Topic 17||usair and twa that analysts not for pct analyst piedmont|
|Topic 18||and that for companies not years study this areas overseas|
|Topic 19||trade and bill for house that reagan foreign states committee|
|Topic 20||company dlrs offer stock and for corp share shares mln|
|Topic 21||dlrs year and quarter company for earnings will tax share|
|Topic 22||mln cts net loss dlrs profit reuter shr year qtr|
|Topic 23||exchange paris and rates that treasury baker allied for western|
|Topic 24||and shares inc for group dlrs pct offer reuter share|
|Topic 25||merger and that pacific texas hughes baker commerce for company|
|Topic 26||and american company subsidiary china french reuter pct for owned|
|Topic 27||japan japanese and that trade officials for government industry pact|
|Topic 28||oil opec mln bpd prices production ecuador and output crude|
|Topic 29||and that had shares block for mln government not san|
|Topic 30||mln pct and profits dlrs year for billion company will|
The sampling time in seconds for the R8 dataset is listed in Table 4. There are 5485 training documents. The first row in the table shows the sampling time for different mini-batch sizes k and different versions of k-DPP sampling. In practice, we use the original implementation from  with . To compare with the traditional k-DPP, we listed the elapsed time with . The last row shows the running time per local LDA update, excluding sampling.
|Size||k = 10||k =30||k=50||k=80|
The computational time for training a neural network highly depends on the network structure and implementation details. For example, when using only one softmax layer as in the flower experiment, the cost per gradient step is in the milliseconds. In this setup, k-DPP is not effective from a runtime perspective, but still results in better final classification accuracies. However, the cost for each gradient step for a simple 5 layer NN as in the MNIST experiment withis 1.294 seconds. In the latter case, this time is comparable to k-DPP sampling (0.7941 sec) see Table 5. We thus expect our methods to benefit expensive models and imbalanced training datasets more.
|Size||k = 10||k =100||k=200|
Figure 11 shows the bar plots of the frequency of images in each class for Oxford Flower dataset using the number of classes as the mini-batch size. With this setting, we can see that when , DM-SGD is equivalent to StS.