Stochastic Subset Selection

06/25/2020 ∙ by Tuan A. Nguyen, et al. ∙ AItrics KAIST 수리과학과 24

Current machine learning algorithms are designed to work with huge volumes of high dimensional data such as images. However, these algorithms are being increasingly deployed to resource constrained systems such as mobile devices and embedded systems. Even in cases where large computing infrastructure is available, the size of each data instance, as well as datasets, can provide a huge bottleneck in data transfer across communication channels. Also, there is a huge incentive both in energy and monetary terms in reducing both the computational and memory requirements of these algorithms. For non-parametric models that require to leverage the stored training data at the inference time, the increased cost in memory and computation could be even more problematic. In this work, we aim to reduce the volume of data these algorithms must process through an end-to-end two-stage neural subset selection model, where the first stage selects a set of candidate points using a conditionally independent Bernoulli mask followed by an iterative coreset selection via a conditional Categorical distribution. The subset selection model is trained by meta-learning with a distribution of sets. We validate our method on set reconstruction and classification tasks with feature selection as well as the selection of representative samples from a given dataset, on which our method outperforms relevant baselines. We also show in our experiments that our method enhances scalability of non-parametric models such as Neural Processes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent success of deep learning algorithms partly owes to the availability of huge volume of data 

Deng et al. (2009); Krizhevsky et al. (2009); Liu et al. (2015)

, which enables training of very large deep neural networks. However, the high dimensionality of each data instance and the large size of datasets makes it difficult, especially for resource-limited devices 

Chan et al. (2018); Li et al. (2019); Bhatia et al. (2019), to store and transfer the dataset, or perform on-device learning with the data. This problem becomes more problematic for non-parametric models such as Neural Processes Hensel (1973); Kim et al. (2019a) which require the training dataset to be stored for inference. Therefore, it is appealing to reduce the size of the dataset, both at the instance Dovrat et al. (2019); Li et al. (2018b, b) and the dataset level, such that we selects only a small number of samples from the dataset, each of which contains only few selected input features (e.g. pixels). Then, we could use the selected subset for the reconstruction of the entire set (either each instance or the entire dataset) or for a prediction task, such as classification.

The simplest way to obtain such a subset is random sampling, but it is highly sub-optimal in that it treats all elements in the set equally. However, the pixels from each image and examples from each dataset will have varying degree of importance Katharopoulos and Fleuret (2018) to a target task, whether it is reconstruction or prediction, and thus random sampling will generally incur large loss of accuracy for the target task. There exist some work on coreset construction Huggins et al. (2016); Campbell and Broderick (2018, 2019) which proposed to construct a small subset with the most important samples for Bayesian posterior inference. However, these methods cannot be applied straightforwardly to deep learning with an arbitrary target task. How can we then sample elements from the given set to construct a subset, such that it suffers from minimal accuracy loss on any target task? To this end, we propose to learn a sampler that learns to sample the most important samples for a given task, by training it jointly with the target task.

Specifically, we learn the sampling rate for individual samples in two stages. First we learn a Bernoulli sampling rate for individual sample to efficiently screen out less important elements. Then, to select the most important elements out of this candidate set considering relative importance, we use a Categorical distribution to model the conditional distribution of sampling each element given a set of selected elements. After learning the sampling probability for each stage, we could perform stochastic selection of a given set, with linear time complexity. Our

Stochastic Subset Selection (SSS) is a general framework to sample elements from a set, and it can be applied to both feature sampling and instance sampling. SSS can reduce the memory and computation cost required to process data while retaining performance on downstream tasks.

Our model can benefit from a wide range of practical applications. For example, when sending an image to an edge device with low computing power, instead of sending the entire image, we could send a subset of pixels with their coordinates, which will reduce both communication and inference cost. Similarly, edge devices may need to perform inference on a huge amount of data that could be represented as a set (e.g. video, point clouds) in real-time, and our feature selection could be used to speed up the inference. Moreover, our model could also help with on-device learning on personal data (e.g. photos), as it can select out examples to train the model at a reduced cost. Finally, it can help with the scalability of non-parametric models which requires storage of training examples, such as Neural Processes, to scale up to large-scale problems.

We validate our SSS model on multiple datasets for 1D function regression and 2D image reconstruction and classification for both feature selection and instance selection. The results show that our method is able to select samples with minimal decrease on the target task accuracy, largely outperforming random or an existing sampling method. Our contribution in this work is threefold:

  • We propose a novel two-stage stochastic subset selection method that learns to sample a subset from a larger set with linear time complexity, with minimal loss of accuracy at the downstream task.

  • We propose a framework that trains the subset selection model via meta-learning, such that it can generalize to unseen tasks.

  • We validate the efficacy and generality of our model on various datasets for feature selection from an instance and instance selection from a dataset, on which it significantly outperforms relevant baselines.

(a) Features Selection
(b) Instance Selection
Figure 1: Concept: Our Stochastic Subset Selection method is generic and can be applied to many type of sets. (a) Selecting a subset of features (pixels) out of an image. This reduces communication cost in data transfer between devices and allows faster training or inference on resource-constrained devices due to the reduced computational cost. (b) Selecting a subset of instances out of a dataset. This helps resource-constrained system to train faster, or to make non-parametric inference more scalable.

2 Related Work

Set encoding - Permutation invariant networks

Recently, extensive research efforts have been made in the area of set representation learning with the goal of obtaining order-invariant (or equivariant) and size-invariant representations. Many propose simple methods to obtain set representations by applying non-linear transformations to each element before a pooling layer (e.g. average pooling or max pooling)  

Ravanbakhsh et al. (2016); Qi et al. (2017b); Zaheer et al. (2017); Sannai et al. (2019)

. However, these models are known to have limited expressive power and sometimes not capable of capturing high moments of distributions. Yet approaches such as Stochastic Deep Network  

De Bie et al. (2018) and Set Transformer  Lee et al. (2018) consider the pairwise (or higher order) interactions among set elements and hence can capture more complex statistics of the distributions . These methods often result in higher performance in classification/regression tasks; however, they have run time complexities of or higher.

Subset sampling

There exist some works which have been proposed to handle large sets.  Dovrat et al. (2019) proposed to learn to sample a subset from a set by generating virtual points, then matching them back to a subset of the original set. However, such element generation and matching process is highly inefficient. Our method on the other hand only learns to select from the original elements and does not suffer from such overhead. Wang et al. Wang et al. (2018) proposed to distill the knowledge of a large dataset to a small number of artificial data instances. However, these artificial data instances are only for faster training and doesn’t capture the statistics of the original set. Moreover, the instances are generated artificially and can differ from the original set making the method less applicable to other tasks. Also  Qi et al. (2017a, c); Li et al. (2018b); Eldar et al. (1997); Moenning and Dodgson (2003) propose farthest point sampling, which selects points from a set by ensuring that the selected samples are far from each other on a given metric space.

Image Compression

Due to the huge demand for image and video transfer over the internet, a number of works have attempted to compress images with minimal distortion. These models  Toderici et al. (2017); Rippel and Bourdev (2017); Mentzer et al. (2018); Li et al. (2018a)

typically consist of a pair of encoder and decoder, where the encoder will transfer the image into a compact matrix to reduce the memory footprint and communication cost, while the decoder is used to reconstruct the image back. These methods, while achieving huge successes in the image compression problem, are less flexible than ours. Firstly, our model can be applied to any type of sets (and instances represented as sets), while the aforementioned models mainly work for images represented in tensor form. Furthermore, our method can be applied both at the instance and dataset level.

3 Approach

Our work is based on the assumption that, for every set (in some cases, each set element can be further separated into input and target ), there exists a small subset which is representative of the entire set. Thus, we aim to learn to select out a fixed-size subset from (where ) such that can be used as a proxy in evaluating a loss for the entire set (i.e., ). We propose to learn the conditional distribution of the subset by minimizing the approximation gap between the subset and the full set through a two-stage selection process (Figure 2). The resulting distribution of is denoted by

. Ultimately, we want to minimize the expected loss function with respect to this conditional distribution to select a representative subset,

. When follows a distribution of sets in the meta-learning framework, the final objective to minimize is .

Figure 2: Overview. The subset is sampled by a stochastic 2-stage subset selection process. (a) Candidate Selection. (b) Core Subset Selection using continuous relaxations of discrete distributions.

3.1 Stochastic Subset Selection

To select a compact yet representative subset, we must take into account the dependencies between the elements in the set to avoid the selection of redundant elements. Processing entire sets in this manner can be computationally infeasible for large sets. Hence we propose a two-stage selection scheme where we first independently selects a manageable sized candidate set without considering the inter-sample dependencies (candidate selection), and then select the final subset from the candidate set by considering the dependencies among the elements (autoregressive subset selection).

3.1.1 Candidate Selection

In order to select the candidate set , we learn a binary mask where means that should be retained in the candidate set (Figure 2a). Specifically, for ,

(1)

where is a neural network that computes a representation for the whole dataset and is a network that produces the probability of selecting . While our method does not restrict the specific form of the permutation-invariant neural network , we use deep sets Zaheer et al. (2017)

in our implementation. To ensure that the whole process is differentiable, we use continuous relaxations for the Bernoulli distribution 

Maddison et al. (2016); Gal et al. (2017); Jang et al. (2016) when sampling :

(2)

where

is Sigmoid function and

is the temperature of the relaxation ( in all experiments).

(a)
(b)
(c)
(d)
Figure 3: Graphical Models: (a) Feature Selection for Reconstruction. (b) Feature Selection for Prediction Task. (c) Instance Selection model. (d) Instance Selection model for Classification

3.1.2 AutoRegressive Subest Selection

In the first stage, the mask distributions are conditionally independent. If the set contains two elements that are informative but provide similar information, they can be selected together and introduce redundancy. To alleviate this problem, we introduce an iterative selection procedure, which selects one element at a time. Specifically, to select a core subset (Figure 2b) of size from the Candidate Set , iterative steps are required. At step , we select an element from with the probability

(3)

where is the set of elements that have been previously selected and is a positive function. Note that, when choosing the element, we use a distribution that is conditioned not only on the candidate set but on all previously chosen elements. This ensures that the final subset has no redundancy. While the goal of the second stage is to completely eliminate redundancy by selecting one sample at a time from the candidate set, our method also allows for reduction in computational cost by selecting samples at a time.Towards this end, we propose a method that samples elements from for efficient training. Specifically, instead of sampling times from , we can sample the selection mask for element from .Note that by doing this, the expected number of selected elements is still , and the probability of the element being selected is , which is very close to the original distribution. At inference time, we can use existing libraries such as NumPy Oliphant (2006) to select elements at once (without replacement) to get distinct elements.

Input (subset size), (# elements selected at each iteration), (Full Set)
Output (selected subset)
1:procedure Stochastic Subset Selection()
2:     
3:      for .
4:      Candidate Selection
5:     
6:     for  do AutoRegressive Subset Selection
7:           Select elements      
8:     return
9:procedure AutoSelect()
10:     
11:     
12:     
13:      Select elements from with probability
14:     return
Algorithm 1 Fixed Size Subset Selection

Algorithm 1 shows the entire subset selection procedure. The inference complexity depends on the choice of the function . If we choose as a function that considers pairwise interaction between candidate elements and all selected elements, the inference complexity is where are the size of the original set, the candidate set and the final subset respectively.

3.1.3 Greedy Training Algorithm

During training, it might be desirable not to run all iterations of the second selection stage. We propose a greedy training algorithm: During training, in each mini-batch, we sample a number and a random -element subset of the Candidate Set. The auto-regressive model is then trained to select a -element subset from that makes optimal for the output task (minimizing ). By doing this, we only need to run one forward pass and one backward pass of the auto-regressive model during training. During inference, we can run the auto-regressive model iteratively to select a fixed-size subset. We empirically found that the greedy training algorithm works well in practice (section 4.1 and 4.2).

3.1.4 Constraining the Candidate Set Size

From the complexity analysis in subsection 3.1.2, our model’s inference complexity scales linearly with the candidate size . In some cases, it is desirable to restrict the size of the Candidate Set to save computational cost when performing the second stage selection. We adopt the idea of information bottleneck and constrain the distribution of for the Candidate Set. Specifically, we minimize:

(4)

where is a hyper-parameter and is a prior on the mask . We can set to a desirable prior (e.g a sparse prior when we want the candidate to be small).

Suppose that ( is small in case of a sparse prior; we set to 0.1 or 0.01, depending on the experiment), the KL term can be computed as:

(5)

Note that in cases where we have sufficient memory at inference, we do not need to constrain the size of the candidate set. In that case, we can simply set to zero and the model can freely learn the best distribution of the candidate set that helps with the second stage selection to minimize the task loss.

3.2 The Task Network with Loss Function

Depending on the target task, we can design the task network with a specific loss function. In this section, we describe some common tasks.

Set Reconstruction Given the selected subset of , a network , parameterized by , is used to reconstruct the original set . The loss function in this case is the negative conditional log-likelihood . The overall objective function is then given as:

(6)

By minimizing this objective, we learn to select a compact subset that is most representative of the original set, which is similar to Variational Auto Encoder (VAE) and the subset can then be used for other tasks. We choose Attentive Neural Processes(ANP)  Kim et al. (2019b) as the neural network . An ANP is a member of the Neural Processes (NP) family  Garnelo et al. (2018a, b). A NP is a type of network that takes as input a context set ( in our case) and predicts the distribution of the elements in the original set (). It mimics the behavior of Gaussian Processes, but reduces the complexity of inference.

Set Classification/Prediction We can also opt to train the network to predict a single target for a set . For example, the target could be the class of an image (classification problem) or the statistics of the set (regression problem). For this, we use a network to predict the target . The loss then becomes the negative log-likelihood and the overall objective function is:

(7)

3.3 Dataset Distillation

Instance Selection For this task, we are given a dataset where each element in is a set of data points sampled from the entire dataset. Using CelebA as an illustrative example, some may consist of randomly sampled faces from the whole dataset. Given such a dataset, we apply SSS to learn to select representative instances from each dataset in . To achieve this, we require a model capable of taking as input sets in to perform a task such as the reconstruction of all the elements in the given dataset.

Given a set of datasets , we consider a single dataset on which we apply our SSS model to learn the selection of representative elements in , which can then be used to reconstruct all the elements . In essence, each dataset is distilled into a new dataset whose elements each have

elements selected according to our SSS. The task then is to reconstruct the entire set back conditioned on the selected subset. The first step in this process entails the representation of the selected subset into a unique representative vector

for each element in the dataset akin to the statistics network used in the Neural Statistician  Edwards and Storkey (2016) model. Specifically, to generate an element given , is computed by applying a stochastic cross-attention mechanism on where the stochasticity is supplied by the query . To obtain varying styles in the generated images, we additionally learn a latent variable used to perturb and the two variables are combine to obtain a new element . The graphical model for this process is depicted in Figure 2(c). Additionally, to ensure that is properly learnt, we add an informativity loss by reconstructing from the generated samples for a given dataset. The objective for the model depicted in Figure 2(c) for a single dataset is:

(8)

where

are priors on their respective latent variables. All priors are chosen to be Gaussian with zero mean and unit variance. Also, to generate a sample

in , we must supply a query to the cross attention module applied on and this is modeled through the stochastic variable . This objective is combined with the informativity loss on all samples in . It is important to note that is computed using only for all elements in the original dataset . Thus in addition to Equation 8 and the informativity loss, the model is optimized together with the subset selection model described in the previous section. When this model is fully trained, we obtain a SSS model that can then be applied to the instance selection task on a given dataset. In essence, the purpose of the generative model introduced is to train the subset selection module.

Classification Finally in the dataset distillation task, we consider the problem of selecting prototypes to be used for few-shot classification. Here, we adopt the few-shot classification framework of Prototypical Networks  Snell et al. (2017)

and apply our SSS model to the task of selecting representative prototypes from each class to be used for classifying new instances. By learning to select the prototypes, we can remove outliers that would otherwise change the class decision boundaries in the classification task. The graphical model including our SSS model is depicted in Figure  

2(d) where now corresponds to the selected prototypes and and correspond to query and class label respectively.

4 Experiments

In this section, we present our experimental results.

(a)
(b)
(c)
Figure 4:

Models’ performance on set reconstruction and classification task (lighter-color areas present the standard deviations).

(a) 1D Function reconstruction. (b) CelebA (reconstruction). (c) CelebA (classification)
Figure 5: Reconstruction of an 1D function. Table 1: CelebA Attributes Classification Model: CNN on # Pixels Storage mAUC Original image All 38804 114KB 0.9157 Random pixels 500 5KB 0.8471 Selected pixels (rec) 500 5KB 0.8921 Selected pixels (MC) 500 5*5KB 0.9132 Selected pixels (ours) 500 5KB 0.9093

4.1 Feature Selection Experiments

Function Reconstruction - Approximation Our first experiment is on 1D function reconstruction. Suppose that we have a function . We first construct a set of data points of that function: where

are uniformly distributed along the x-axis within the interval

. Now if we have a family of functions , this will lead to a family of sets . We train our model which consists of the subset selection model and a task network (e.g. ANP), on this data set and report the reconstruction loss, which is the negative log-likelihood.

Here, we compare our reconstruction performance with multiple baselines, namely Random Select (randomly selects a subset of and uses an ANP to reconstruct the set), and Learning to Sample  Dovrat et al. (2019) (uses  Dovrat et al. (2019) to sample k elements and uses an ANP to reconstruct the set). Figure 3(a) shows the performance (reconstruction loss) of our models and the baselines. SSS out-performs Random Select (RS), verifying that the subset selection model can learn a meaningful distribution for the selected elements. Our model also out-performs the Learning to Sample (LTS) baseline.

Through the visualization of the selected points in figure 5, we can see that out model tends to pick out more points (presented as red dots) in the drifting parts of the curve, which is understandable since these parts are harder to reconstruct. The other two baselines sometimes fails to do that, which leads to inaccurate reconstructions.

Image Reconstruction Similar to the case of function reconstruction, given an image we learn to select a core subset of pixels that best reconstructs the original image. Here, is 2-dimensional and is 3-dimensional for RGB images. We use an ANP to reconstruct the remaining pixels from a set of context elements (selected subset in our case). We conduct this experiment on the CelebA dataset  Liu et al. (2018). Figure 3(b) shows that our model significantly outperforms ANP with random subsets (as in the original ANP paper) and the Learning to Sample baseline. Figure 6 shows the reconstruction samples of our model, which look objectively better than the reconstruction of other baselines (with the same number of pixels).

4.2 Classification/Regression

In this subsection, we validate our model on the prediction task. The goal is to learn to select a subset for a target task such as classification or regression. We again use the CelebA dataset, but this time the selected pixels are used to give predictions for 40 attributes of a celebrity face (in a multi-task learning setting). We use a stack of CNN for prediction for all models. For our proposed model, only the selected pixels are used for prediction (other pixels’ values are set to zeros). Table LABEL:celeba-classifitcation shows that using only 500 pixels (1.3% of total pixels), we can achieve a mean AUC of 0.9093 (99.3% of the original image). Figure 3(c) shows the classification performance (in terms of mean AUC) versus the number of pixels selected. The AUC with selected pixels learned from our subset selection model is significantly higher than that of the random pixels baseline, showing the effectiveness of our subset selection method. We also include another baseline, namely . This is our stochastic subset selection model trained with reconstruction loss, but then later used for classification. Our model outperforms this variant, showing the effectiveness of training with the target task. Note that Learning to Sample cannot be applied straightforwardly to this experiment setup. This is because during training, the generated virtual points cannot be converted back to an image in matrix form (due to the virtual coordinate), thus we cannot train the Learning to Sample model with CNN-based classification on the target task.

Ablation Study Since our method is stochastic, the predictive distribution can be written as , and we can use Monte Carlo sampling to get the prediction in practice. However, throughout the experiment section, we only reported the result with one sampled subset, #Instances 2 5 10 15 20 30 FPS 6.50 4.51 3.07 2.75 2.71 2.29 Random 3.73 1.16 0.90 0.38 0.39 0.20 SSS 2.53 1.02 0.59 0.33 0.24 0.17 Table 2: . FID Score for varying Instance Selection #Instances 1 2 5 10 FPS 0.432 0.501 0.598 0.636 Random 0.444 0.525 0.618 0.663 SSS 0.475 0.545 0.625 0.664 Table 3: Accuracy on miniImagenet since it gives the best reduction in memory and computational cost. This can be seen as MC sampling with one sample. We compare it against another variant: SSS with MC sampling (5 samples). It should be noted that by doing MC sampling with 5 samples, the computational cost (inference) is increased by 5 times, and the memory requirement can be increased by up to 5 times too. Table 4 shows that our model achieves comparable performance with that variant, thus justifying that it can achieve good performance for target tasks, while reducing memory and computation requirement.

Figure 6: CelebA reconstruction samples with varying number of pixels.
Figure 7: Instance Selection Samples from a dataset of size 200.

4.3 Dataset Distillation

Instance Selection We present results on the instance selection task applied to a whole dataset. In this task, we use the CelebALiu et al. (2015) dataset since it has an imbalance both in terms of gender and race. A dataset is constructed by sampling 200 random images from the full dataset. In this experiment, we seek to select only a few(5-30) representative images from these randomly created datasets. On this task, our subset selection module is trained via the procedure detailed in Section 3.3. To evaluate the effectiveness of the stochastic subset selection model, we evaluate the model in terms of the diversity in the selected subset using the Fréchet Inception Distance(FID Score) Heusel et al. (2017) which measures the similarity and diversity between two datasets. We compare our model with the model that randomly samples instances from the full dataset. Additionally, we compare our method with the Farthest Point Sampling(FPS) algorithm which selects points from a given set by computing distances on a metric space between all elements and selecting those elements that are furthest from each other. FPS in general seeks to obtain a wide coverage over a given set and hence is a suitable baseline. The results of this experiment is presented in Table  4 where our selection method achieves a lower FID score compared to FPS and Random Sampling. Additionally, given that the dataset is highly imbalanced, FPS performs worst since by selecting the furthest elements in the given set it cannot capture the true representation of the whole dataset even when compared with Random Sampling. Also for small sample selection, our method outperforms FPS and Random Sampling since our method is able to model the interactions within the full dataset and hence can select the most representative subset.

Classification We use the miniImageNet datasetVinyals et al. (2016) and go from a 20 shot classification task to one of 1,2,5 or 10 shot classification task. We again compare with Random Sampling and FPS and apply them together with SSS for the reduction in shot. The results for this experiment is shown in Table 3

, where it can be observed that SSS can learn to select more representative prototypes compared to the other methods especially in the few-shot problems where the choice of prototypes matters more. All models were trained for 300 epochs and the best model was picked using a validation set.

5 Conclusion

In this paper, we have proposed a stochastic subset selection method to reduce the size of an arbitrary set, while preserving the performance on a target task. Our selection method utilizes a Bernoulli mask to perform candidate selection, and a stack of Categorical distributions to iteratively select a core subset from the candidate set. As a result, the selection process does take the dependencies of the set’s members into account. Hence, it can select a compact set without redundancy. By using the compact subset in place of the original set for a target task, we can save memory, communication and computational cost. We hope that this can facilitate the use of machine learning algorithm in resource-limited systems such as mobile and embedded devices.

References

  • [1] A. Bhatia, P. Varakantham, and A. Kumar (2019)

    Resource constrained deep reinforcement learning

    .
    In Proceedings of the International Conference on Automated Planning and Scheduling, Vol. 29, pp. 610–620. Cited by: §1.
  • [2] T. Campbell and T. Broderick (2018) Bayesian coreset construction via greedy iterative geodesic ascent. arXiv preprint arXiv:1802.01737. Cited by: §1.
  • [3] T. Campbell and T. Broderick (2019)

    Automated scalable bayesian inference via hilbert coresets

    .
    The Journal of Machine Learning Research 20 (1), pp. 551–588. Cited by: §1.
  • [4] M. Chan, D. Scarafoni, R. Duarte, J. Thornton, and L. Skelly (2018) Learning network architectures of deep cnns under resource constraints. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    ,
    pp. 1703–1710. Cited by: §1.
  • [5] G. De Bie, G. Peyré, and M. Cuturi (2018) Stochastic deep networks. arXiv preprint arXiv:1811.07429. Cited by: §2.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
  • [7] O. Dovrat, I. Lang, and S. Avidan (2019) Learning to sample. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2760–2769. Cited by: §1, §2, §4.1.
  • [8] H. Edwards and A. Storkey (2016) Towards a neural statistician. arXiv preprint arXiv:1606.02185. Cited by: §3.3.
  • [9] Y. Eldar, M. Lindenbaum, M. Porat, and Y. Y. Zeevi (1997) The farthest point strategy for progressive image sampling. IEEE Transactions on Image Processing 6 (9), pp. 1305–1315. Cited by: §2.
  • [10] Y. Gal, J. Hron, and A. Kendall (2017) Concrete dropout. In Advances in neural information processing systems, pp. 3581–3590. Cited by: §3.1.1.
  • [11] M. Garnelo, D. Rosenbaum, C. J. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. J. Rezende, and S. Eslami (2018) Conditional neural processes. arXiv preprint arXiv:1807.01613. Cited by: §3.2.
  • [12] M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. Eslami, and Y. W. Teh (2018) Neural processes. arXiv preprint arXiv:1807.01622. Cited by: §3.2.
  • [13] H. Hensel (1973) Neural processes in thermoregulation.. Physiological Reviews 53 (4), pp. 948–1017. Cited by: §1.
  • [14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §4.3.
  • [15] J. Huggins, T. Campbell, and T. Broderick (2016)

    Coresets for scalable bayesian logistic regression

    .
    In Advances in Neural Information Processing Systems, pp. 4080–4088. Cited by: §1.
  • [16] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §3.1.1.
  • [17] A. Katharopoulos and F. Fleuret (2018) Not all samples are created equal: deep learning with importance sampling. arXiv preprint arXiv:1803.00942. Cited by: §1.
  • [18] H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh (2019) Attentive neural processes. arXiv preprint arXiv:1901.05761. Cited by: §1.
  • [19] H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh (2019) Attentive neural processes. arXiv preprint arXiv:1901.05761. Cited by: §3.2.
  • [20] A. Krizhevsky, V. Nair, and G. Hinton (2009) Cifar-10 and cifar-100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html 6. Cited by: §1.
  • [21] J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh (2018) Set transformer. arXiv preprint arXiv:1810.00825. Cited by: §2.
  • [22] M. Li, E. Yumer, and D. Ramanan (2019) Budgeted training: rethinking deep neural network training under resource constraints. arXiv preprint arXiv:1905.04753. Cited by: §1.
  • [23] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang (2018) Learning convolutional networks for content-weighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3214–3223. Cited by: §2.
  • [24] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) Pointcnn: convolution on x-transformed points. In Advances in neural information processing systems, pp. 820–830. Cited by: §1, §2.
  • [25] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §1, §4.3.
  • [26] Z. Liu, P. Luo, X. Wang, and X. Tang (2018) Large-scale celebfaces attributes (celeba) dataset. Retrieved August 15, pp. 2018. Cited by: §4.1.
  • [27] C. J. Maddison, A. Mnih, and Y. W. Teh (2016)

    The concrete distribution: a continuous relaxation of discrete random variables

    .
    arXiv preprint arXiv:1611.00712. Cited by: §3.1.1.
  • [28] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §2.
  • [29] C. Moenning and N. A. Dodgson (2003) Fast marching farthest point sampling. Technical report University of Cambridge, Computer Laboratory. Cited by: §2.
  • [30] T. E. Oliphant (2006) A guide to numpy. Vol. 1, Trelgol Publishing USA. Cited by: §3.1.2.
  • [31] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §2.
  • [32] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §2.
  • [33] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §2.
  • [34] S. Ravanbakhsh, J. Schneider, and B. Poczos (2016) Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500. Cited by: §2.
  • [35] O. Rippel and L. Bourdev (2017) Real-time adaptive image compression. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2922–2930. Cited by: §2.
  • [36] A. Sannai, Y. Takai, and M. Cordonnier (2019) Universal approximations of permutation invariant/equivariant functions by deep neural networks. arXiv preprint arXiv:1903.01939. Cited by: §2.
  • [37] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §3.3.
  • [38] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell (2017)

    Full resolution image compression with recurrent neural networks

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5306–5314. Cited by: §2.
  • [39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §4.3.
  • [40] T. Wang, J. Zhu, A. Torralba, and A. A. Efros (2018) Dataset distillation. arXiv preprint arXiv:1811.10959. Cited by: §2.
  • [41] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §2, §3.1.1.

Appendix A Greedy Training Algorithm

Algorithm 2

shows our greedy training algorithm with stochastic gradient descent. The idea of the greedy training algorithm is to train the auto-regressive model to select the best next

elements from the candidate set to minimize the target loss on the selected samples. By doing this, we do not have to run the auto-regressive model time during training, thus reducing the computational cost.

Input (max subset size)
(# elements selected at each iteration)
(distribution of sets)
(learning rate)
a target task with loss function
Output trained model with converged and
1: initialization
2:while not converged do
3:     Sample a minibatch with sets from
4:      for
5:      random sample from
6:      random -element subset of for
7:      select a -element subset from (with the auto-regressive model)
8:     
Algorithm 2 Greedy Training Algorithm

Appendix B Instance Selection Samples

In this section, we show more examples of our 1D and CelebA experiments on how the models select the set elements for the target task.

b.1 1D Function - Reconstruction

Figure 8: Reconstruction samples of 1D functions with different selection methods

Figure 8 shows the reconstruction samples of our model on the 1D function dataset, which is objectively better than that of Learning to Sample (LTS) or Random Subset (RS). Since RS selects the set elements randomly, it can leave out important part of the 1D curve leading to wrong reconstructions. LTS also selects insufficient amount of set elements in some parts of the curves, resulting in suboptimal reconstructions.

b.2 CelebA

Figure 9: Selected pixels for different tasks on CelebA.

Figure 9 shows the selected pixels of our model for both the classification and reconstruction task. For the attribute classification task, the model tends to select pixels mainly from the face, since the task is to classify characteristics of the person. For reconstruction, the selected pixels are more evenly distributed, since the background also contributes significantly to the reconstruction loss.

Appendix C Dataset Distillation: Instance Selection

In Table  4

, we represent the full results for the Instance Selection model on the CelebA dataset. For these experiments, we construct a set by randomly sampling 200 face images from the full dataset. To evaluate the model, we create multiple such datasets and run the baselines(Random Sampling and FPS) and SSS on the same datasets. The FID metric is then computed on the instances and averaged on all the randomly constructed datasets. For FPS, we use the open-source implementation in  

https://github.com/rusty1s/pytorch_cluster. Further, we provide qualitative results on a single dataset in Figure  10 where we show how our model picks 5 instances from the full set of 200 images face images.

Figure 10: Visualization of a set with 200 images for instance selection. The two stage selection method in SSS is visualized as Candidate Set and SSS. A coreset of size 5 is visualized.
Table 4: . FID Score for varying Instance Selection
#Instances 2 5 10 15 20 30 FPS 6.5014 4.3502 4.5098 2.3809 3.0746 1.0979 2.7458 0.6201 2.7118 1.0410 2.2943 0.8010 Random 3.7309 1.1690 1.1575 0.6532 0.8970 0.4867 0.3843 0.2171 0.3877 0.1906 0.1980 0.1080 SSS 2.5307 1.3583 1.0186 0.1982 0.5922 0.3181 0.3331 0.1169 0.2381 0.1153 0.1679 0.0807
Figure 11: Sample visualization of prototype selection for the miniImagenet dataset on the few-shot classification task. Each row represents a set that corresponds to the support from which a prototype is selected for the few-shot classification task.

Appendix D Dataset Distillation: Classification

In Figure  11 we provide visualizations for the instance selection problem as applied to the few-shot classification task. Here, we go from a 20-shot to a 1-shot classification problem where the prototype is selected from the support using SSS.