1 Introduction
The recent success of deep learning algorithms partly owes to the availability of huge volume of data
Deng et al. (2009); Krizhevsky et al. (2009); Liu et al. (2015), which enables training of very large deep neural networks. However, the high dimensionality of each data instance and the large size of datasets makes it difficult, especially for resourcelimited devices
Chan et al. (2018); Li et al. (2019); Bhatia et al. (2019), to store and transfer the dataset, or perform ondevice learning with the data. This problem becomes more problematic for nonparametric models such as Neural Processes Hensel (1973); Kim et al. (2019a) which require the training dataset to be stored for inference. Therefore, it is appealing to reduce the size of the dataset, both at the instance Dovrat et al. (2019); Li et al. (2018b, b) and the dataset level, such that we selects only a small number of samples from the dataset, each of which contains only few selected input features (e.g. pixels). Then, we could use the selected subset for the reconstruction of the entire set (either each instance or the entire dataset) or for a prediction task, such as classification.The simplest way to obtain such a subset is random sampling, but it is highly suboptimal in that it treats all elements in the set equally. However, the pixels from each image and examples from each dataset will have varying degree of importance Katharopoulos and Fleuret (2018) to a target task, whether it is reconstruction or prediction, and thus random sampling will generally incur large loss of accuracy for the target task. There exist some work on coreset construction Huggins et al. (2016); Campbell and Broderick (2018, 2019) which proposed to construct a small subset with the most important samples for Bayesian posterior inference. However, these methods cannot be applied straightforwardly to deep learning with an arbitrary target task. How can we then sample elements from the given set to construct a subset, such that it suffers from minimal accuracy loss on any target task? To this end, we propose to learn a sampler that learns to sample the most important samples for a given task, by training it jointly with the target task.
Specifically, we learn the sampling rate for individual samples in two stages. First we learn a Bernoulli sampling rate for individual sample to efficiently screen out less important elements. Then, to select the most important elements out of this candidate set considering relative importance, we use a Categorical distribution to model the conditional distribution of sampling each element given a set of selected elements. After learning the sampling probability for each stage, we could perform stochastic selection of a given set, with linear time complexity. Our
Stochastic Subset Selection (SSS) is a general framework to sample elements from a set, and it can be applied to both feature sampling and instance sampling. SSS can reduce the memory and computation cost required to process data while retaining performance on downstream tasks.Our model can benefit from a wide range of practical applications. For example, when sending an image to an edge device with low computing power, instead of sending the entire image, we could send a subset of pixels with their coordinates, which will reduce both communication and inference cost. Similarly, edge devices may need to perform inference on a huge amount of data that could be represented as a set (e.g. video, point clouds) in realtime, and our feature selection could be used to speed up the inference. Moreover, our model could also help with ondevice learning on personal data (e.g. photos), as it can select out examples to train the model at a reduced cost. Finally, it can help with the scalability of nonparametric models which requires storage of training examples, such as Neural Processes, to scale up to largescale problems.
We validate our SSS model on multiple datasets for 1D function regression and 2D image reconstruction and classification for both feature selection and instance selection. The results show that our method is able to select samples with minimal decrease on the target task accuracy, largely outperforming random or an existing sampling method. Our contribution in this work is threefold:

We propose a novel twostage stochastic subset selection method that learns to sample a subset from a larger set with linear time complexity, with minimal loss of accuracy at the downstream task.

We propose a framework that trains the subset selection model via metalearning, such that it can generalize to unseen tasks.

We validate the efficacy and generality of our model on various datasets for feature selection from an instance and instance selection from a dataset, on which it significantly outperforms relevant baselines.
2 Related Work
Set encoding  Permutation invariant networks
Recently, extensive research efforts have been made in the area of set representation learning with the goal of obtaining orderinvariant (or equivariant) and sizeinvariant representations. Many propose simple methods to obtain set representations by applying nonlinear transformations to each element before a pooling layer (e.g. average pooling or max pooling)
Ravanbakhsh et al. (2016); Qi et al. (2017b); Zaheer et al. (2017); Sannai et al. (2019). However, these models are known to have limited expressive power and sometimes not capable of capturing high moments of distributions. Yet approaches such as Stochastic Deep Network
De Bie et al. (2018) and Set Transformer Lee et al. (2018) consider the pairwise (or higher order) interactions among set elements and hence can capture more complex statistics of the distributions . These methods often result in higher performance in classification/regression tasks; however, they have run time complexities of or higher.Subset sampling
There exist some works which have been proposed to handle large sets. Dovrat et al. (2019) proposed to learn to sample a subset from a set by generating virtual points, then matching them back to a subset of the original set. However, such element generation and matching process is highly inefficient. Our method on the other hand only learns to select from the original elements and does not suffer from such overhead. Wang et al. Wang et al. (2018) proposed to distill the knowledge of a large dataset to a small number of artificial data instances. However, these artificial data instances are only for faster training and doesn’t capture the statistics of the original set. Moreover, the instances are generated artificially and can differ from the original set making the method less applicable to other tasks. Also Qi et al. (2017a, c); Li et al. (2018b); Eldar et al. (1997); Moenning and Dodgson (2003) propose farthest point sampling, which selects points from a set by ensuring that the selected samples are far from each other on a given metric space.
Image Compression
Due to the huge demand for image and video transfer over the internet, a number of works have attempted to compress images with minimal distortion. These models Toderici et al. (2017); Rippel and Bourdev (2017); Mentzer et al. (2018); Li et al. (2018a)
typically consist of a pair of encoder and decoder, where the encoder will transfer the image into a compact matrix to reduce the memory footprint and communication cost, while the decoder is used to reconstruct the image back. These methods, while achieving huge successes in the image compression problem, are less flexible than ours. Firstly, our model can be applied to any type of sets (and instances represented as sets), while the aforementioned models mainly work for images represented in tensor form. Furthermore, our method can be applied both at the instance and dataset level.
3 Approach
Our work is based on the assumption that, for every set (in some cases, each set element can be further separated into input and target ), there exists a small subset which is representative of the entire set. Thus, we aim to learn to select out a fixedsize subset from (where ) such that can be used as a proxy in evaluating a loss for the entire set (i.e., ). We propose to learn the conditional distribution of the subset by minimizing the approximation gap between the subset and the full set through a twostage selection process (Figure 2). The resulting distribution of is denoted by
. Ultimately, we want to minimize the expected loss function with respect to this conditional distribution to select a representative subset,
. When follows a distribution of sets in the metalearning framework, the final objective to minimize is .3.1 Stochastic Subset Selection
To select a compact yet representative subset, we must take into account the dependencies between the elements in the set to avoid the selection of redundant elements. Processing entire sets in this manner can be computationally infeasible for large sets. Hence we propose a twostage selection scheme where we first independently selects a manageable sized candidate set without considering the intersample dependencies (candidate selection), and then select the final subset from the candidate set by considering the dependencies among the elements (autoregressive subset selection).
3.1.1 Candidate Selection
In order to select the candidate set , we learn a binary mask where means that should be retained in the candidate set (Figure 2a). Specifically, for ,
(1) 
where is a neural network that computes a representation for the whole dataset and is a network that produces the probability of selecting . While our method does not restrict the specific form of the permutationinvariant neural network , we use deep sets Zaheer et al. (2017)
in our implementation. To ensure that the whole process is differentiable, we use continuous relaxations for the Bernoulli distribution
Maddison et al. (2016); Gal et al. (2017); Jang et al. (2016) when sampling :(2) 
where
is Sigmoid function and
is the temperature of the relaxation ( in all experiments).3.1.2 AutoRegressive Subest Selection
In the first stage, the mask distributions are conditionally independent. If the set contains two elements that are informative but provide similar information, they can be selected together and introduce redundancy. To alleviate this problem, we introduce an iterative selection procedure, which selects one element at a time. Specifically, to select a core subset (Figure 2b) of size from the Candidate Set , iterative steps are required. At step , we select an element from with the probability
(3) 
where is the set of elements that have been previously selected and is a positive function. Note that, when choosing the element, we use a distribution that is conditioned not only on the candidate set but on all previously chosen elements. This ensures that the final subset has no redundancy. While the goal of the second stage is to completely eliminate redundancy by selecting one sample at a time from the candidate set, our method also allows for reduction in computational cost by selecting samples at a time.Towards this end, we propose a method that samples elements from for efficient training. Specifically, instead of sampling times from , we can sample the selection mask for element from .Note that by doing this, the expected number of selected elements is still , and the probability of the element being selected is , which is very close to the original distribution. At inference time, we can use existing libraries such as NumPy Oliphant (2006) to select elements at once (without replacement) to get distinct elements.
Input  (subset size), (# elements selected at each iteration), (Full Set) 

Output  (selected subset) 
Algorithm 1 shows the entire subset selection procedure. The inference complexity depends on the choice of the function . If we choose as a function that considers pairwise interaction between candidate elements and all selected elements, the inference complexity is where are the size of the original set, the candidate set and the final subset respectively.
3.1.3 Greedy Training Algorithm
During training, it might be desirable not to run all iterations of the second selection stage. We propose a greedy training algorithm: During training, in each minibatch, we sample a number and a random element subset of the Candidate Set. The autoregressive model is then trained to select a element subset from that makes optimal for the output task (minimizing ). By doing this, we only need to run one forward pass and one backward pass of the autoregressive model during training. During inference, we can run the autoregressive model iteratively to select a fixedsize subset. We empirically found that the greedy training algorithm works well in practice (section 4.1 and 4.2).
3.1.4 Constraining the Candidate Set Size
From the complexity analysis in subsection 3.1.2, our model’s inference complexity scales linearly with the candidate size . In some cases, it is desirable to restrict the size of the Candidate Set to save computational cost when performing the second stage selection. We adopt the idea of information bottleneck and constrain the distribution of for the Candidate Set. Specifically, we minimize:
(4) 
where is a hyperparameter and is a prior on the mask . We can set to a desirable prior (e.g a sparse prior when we want the candidate to be small).
Suppose that ( is small in case of a sparse prior; we set to 0.1 or 0.01, depending on the experiment), the KL term can be computed as:
(5) 
Note that in cases where we have sufficient memory at inference, we do not need to constrain the size of the candidate set. In that case, we can simply set to zero and the model can freely learn the best distribution of the candidate set that helps with the second stage selection to minimize the task loss.
3.2 The Task Network with Loss Function
Depending on the target task, we can design the task network with a specific loss function. In this section, we describe some common tasks.
Set Reconstruction Given the selected subset of , a network , parameterized by , is used to reconstruct the original set . The loss function in this case is the negative conditional loglikelihood . The overall objective function is then given as:
(6) 
By minimizing this objective, we learn to select a compact subset that is most representative of the original set, which is similar to Variational Auto Encoder (VAE) and the subset can then be used for other tasks. We choose Attentive Neural Processes(ANP) Kim et al. (2019b) as the neural network . An ANP is a member of the Neural Processes (NP) family Garnelo et al. (2018a, b). A NP is a type of network that takes as input a context set ( in our case) and predicts the distribution of the elements in the original set (). It mimics the behavior of Gaussian Processes, but reduces the complexity of inference.
Set Classification/Prediction We can also opt to train the network to predict a single target for a set . For example, the target could be the class of an image (classification problem) or the statistics of the set (regression problem). For this, we use a network to predict the target . The loss then becomes the negative loglikelihood and the overall objective function is:
(7) 
3.3 Dataset Distillation
Instance Selection For this task, we are given a dataset where each element in is a set of data points sampled from the entire dataset. Using CelebA as an illustrative example, some may consist of randomly sampled faces from the whole dataset. Given such a dataset, we apply SSS to learn to select representative instances from each dataset in . To achieve this, we require a model capable of taking as input sets in to perform a task such as the reconstruction of all the elements in the given dataset.
Given a set of datasets , we consider a single dataset on which we apply our SSS model to learn the selection of representative elements in , which can then be used to reconstruct all the elements . In essence, each dataset is distilled into a new dataset whose elements each have
elements selected according to our SSS. The task then is to reconstruct the entire set back conditioned on the selected subset. The first step in this process entails the representation of the selected subset into a unique representative vector
for each element in the dataset akin to the statistics network used in the Neural Statistician Edwards and Storkey (2016) model. Specifically, to generate an element given , is computed by applying a stochastic crossattention mechanism on where the stochasticity is supplied by the query . To obtain varying styles in the generated images, we additionally learn a latent variable used to perturb and the two variables are combine to obtain a new element . The graphical model for this process is depicted in Figure 2(c). Additionally, to ensure that is properly learnt, we add an informativity loss by reconstructing from the generated samples for a given dataset. The objective for the model depicted in Figure 2(c) for a single dataset is:(8) 
where
are priors on their respective latent variables. All priors are chosen to be Gaussian with zero mean and unit variance. Also, to generate a sample
in , we must supply a query to the cross attention module applied on and this is modeled through the stochastic variable . This objective is combined with the informativity loss on all samples in . It is important to note that is computed using only for all elements in the original dataset . Thus in addition to Equation 8 and the informativity loss, the model is optimized together with the subset selection model described in the previous section. When this model is fully trained, we obtain a SSS model that can then be applied to the instance selection task on a given dataset. In essence, the purpose of the generative model introduced is to train the subset selection module.Classification Finally in the dataset distillation task, we consider the problem of selecting prototypes to be used for fewshot classification. Here, we adopt the fewshot classification framework of Prototypical Networks Snell et al. (2017)
and apply our SSS model to the task of selecting representative prototypes from each class to be used for classifying new instances. By learning to select the prototypes, we can remove outliers that would otherwise change the class decision boundaries in the classification task. The graphical model including our SSS model is depicted in Figure
2(d) where now corresponds to the selected prototypes and and correspond to query and class label respectively.4 Experiments
In this section, we present our experimental results.
Models’ performance on set reconstruction and classification task (lightercolor areas present the standard deviations).
(a) 1D Function reconstruction. (b) CelebA (reconstruction). (c) CelebA (classification)4.1 Feature Selection Experiments
Function Reconstruction  Approximation Our first experiment is on 1D function reconstruction. Suppose that we have a function . We first construct a set of data points of that function: where
are uniformly distributed along the xaxis within the interval
. Now if we have a family of functions , this will lead to a family of sets . We train our model which consists of the subset selection model and a task network (e.g. ANP), on this data set and report the reconstruction loss, which is the negative loglikelihood.Here, we compare our reconstruction performance with multiple baselines, namely Random Select (randomly selects a subset of and uses an ANP to reconstruct the set), and Learning to Sample Dovrat et al. (2019) (uses Dovrat et al. (2019) to sample k elements and uses an ANP to reconstruct the set). Figure 3(a) shows the performance (reconstruction loss) of our models and the baselines. SSS outperforms Random Select (RS), verifying that the subset selection model can learn a meaningful distribution for the selected elements. Our model also outperforms the Learning to Sample (LTS) baseline.
Through the visualization of the selected points in figure 5, we can see that out model tends to pick out more points (presented as red dots) in the drifting parts of the curve, which is understandable since these parts are harder to reconstruct. The other two baselines sometimes fails to do that, which leads to inaccurate reconstructions.
Image Reconstruction Similar to the case of function reconstruction, given an image we learn to select a core subset of pixels that best reconstructs the original image. Here, is 2dimensional and is 3dimensional for RGB images. We use an ANP to reconstruct the remaining pixels from a set of context elements (selected subset in our case). We conduct this experiment on the CelebA dataset Liu et al. (2018). Figure 3(b) shows that our model significantly outperforms ANP with random subsets (as in the original ANP paper) and the Learning to Sample baseline. Figure 6 shows the reconstruction samples of our model, which look objectively better than the reconstruction of other baselines (with the same number of pixels).
4.2 Classification/Regression
In this subsection, we validate our model on the prediction task. The goal is to learn to select a subset for a target task such as classification or regression. We again use the CelebA dataset, but this time the selected pixels are used to give predictions for 40 attributes of a celebrity face (in a multitask learning setting). We use a stack of CNN for prediction for all models. For our proposed model, only the selected pixels are used for prediction (other pixels’ values are set to zeros). Table LABEL:celebaclassifitcation shows that using only 500 pixels (1.3% of total pixels), we can achieve a mean AUC of 0.9093 (99.3% of the original image). Figure 3(c) shows the classification performance (in terms of mean AUC) versus the number of pixels selected. The AUC with selected pixels learned from our subset selection model is significantly higher than that of the random pixels baseline, showing the effectiveness of our subset selection method. We also include another baseline, namely . This is our stochastic subset selection model trained with reconstruction loss, but then later used for classification. Our model outperforms this variant, showing the effectiveness of training with the target task. Note that Learning to Sample cannot be applied straightforwardly to this experiment setup. This is because during training, the generated virtual points cannot be converted back to an image in matrix form (due to the virtual coordinate), thus we cannot train the Learning to Sample model with CNNbased classification on the target task.
Ablation Study Since our method is stochastic, the predictive distribution can be written as , and we can use Monte Carlo sampling to get the prediction in practice. However, throughout the experiment section, we only reported the result with one sampled subset, #Instances 2 5 10 15 20 30 FPS 6.50 4.51 3.07 2.75 2.71 2.29 Random 3.73 1.16 0.90 0.38 0.39 0.20 SSS 2.53 1.02 0.59 0.33 0.24 0.17 #Instances 1 2 5 10 FPS 0.432 0.501 0.598 0.636 Random 0.444 0.525 0.618 0.663 SSS 0.475 0.545 0.625 0.664 since it gives the best reduction in memory and computational cost. This can be seen as MC sampling with one sample. We compare it against another variant: SSS with MC sampling (5 samples). It should be noted that by doing MC sampling with 5 samples, the computational cost (inference) is increased by 5 times, and the memory requirement can be increased by up to 5 times too. Table 4 shows that our model achieves comparable performance with that variant, thus justifying that it can achieve good performance for target tasks, while reducing memory and computation requirement.
4.3 Dataset Distillation
Instance Selection We present results on the instance selection task applied to a whole dataset. In this task, we use the CelebALiu et al. (2015) dataset since it has an imbalance both in terms of gender and race. A dataset is constructed by sampling 200 random images from the full dataset. In this experiment, we seek to select only a few(530) representative images from these randomly created datasets. On this task, our subset selection module is trained via the procedure detailed in Section 3.3. To evaluate the effectiveness of the stochastic subset selection model, we evaluate the model in terms of the diversity in the selected subset using the Fréchet Inception Distance(FID Score) Heusel et al. (2017) which measures the similarity and diversity between two datasets. We compare our model with the model that randomly samples instances from the full dataset. Additionally, we compare our method with the Farthest Point Sampling(FPS) algorithm which selects points from a given set by computing distances on a metric space between all elements and selecting those elements that are furthest from each other. FPS in general seeks to obtain a wide coverage over a given set and hence is a suitable baseline. The results of this experiment is presented in Table 4 where our selection method achieves a lower FID score compared to FPS and Random Sampling. Additionally, given that the dataset is highly imbalanced, FPS performs worst since by selecting the furthest elements in the given set it cannot capture the true representation of the whole dataset even when compared with Random Sampling. Also for small sample selection, our method outperforms FPS and Random Sampling since our method is able to model the interactions within the full dataset and hence can select the most representative subset.
Classification We use the miniImageNet datasetVinyals et al. (2016) and go from a 20 shot classification task to one of 1,2,5 or 10 shot classification task. We again compare with Random Sampling and FPS and apply them together with SSS for the reduction in shot. The results for this experiment is shown in Table 3
, where it can be observed that SSS can learn to select more representative prototypes compared to the other methods especially in the fewshot problems where the choice of prototypes matters more. All models were trained for 300 epochs and the best model was picked using a validation set.
5 Conclusion
In this paper, we have proposed a stochastic subset selection method to reduce the size of an arbitrary set, while preserving the performance on a target task. Our selection method utilizes a Bernoulli mask to perform candidate selection, and a stack of Categorical distributions to iteratively select a core subset from the candidate set. As a result, the selection process does take the dependencies of the set’s members into account. Hence, it can select a compact set without redundancy. By using the compact subset in place of the original set for a target task, we can save memory, communication and computational cost. We hope that this can facilitate the use of machine learning algorithm in resourcelimited systems such as mobile and embedded devices.
References

[1]
(2019)
Resource constrained deep reinforcement learning
. In Proceedings of the International Conference on Automated Planning and Scheduling, Vol. 29, pp. 610–620. Cited by: §1.  [2] (2018) Bayesian coreset construction via greedy iterative geodesic ascent. arXiv preprint arXiv:1802.01737. Cited by: §1.

[3]
(2019)
Automated scalable bayesian inference via hilbert coresets
. The Journal of Machine Learning Research 20 (1), pp. 551–588. Cited by: §1. 
[4]
(2018)
Learning network architectures of deep cnns under resource constraints.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 1703–1710. Cited by: §1.  [5] (2018) Stochastic deep networks. arXiv preprint arXiv:1811.07429. Cited by: §2.
 [6] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
 [7] (2019) Learning to sample. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2760–2769. Cited by: §1, §2, §4.1.
 [8] (2016) Towards a neural statistician. arXiv preprint arXiv:1606.02185. Cited by: §3.3.
 [9] (1997) The farthest point strategy for progressive image sampling. IEEE Transactions on Image Processing 6 (9), pp. 1305–1315. Cited by: §2.
 [10] (2017) Concrete dropout. In Advances in neural information processing systems, pp. 3581–3590. Cited by: §3.1.1.
 [11] (2018) Conditional neural processes. arXiv preprint arXiv:1807.01613. Cited by: §3.2.
 [12] (2018) Neural processes. arXiv preprint arXiv:1807.01622. Cited by: §3.2.
 [13] (1973) Neural processes in thermoregulation.. Physiological Reviews 53 (4), pp. 948–1017. Cited by: §1.
 [14] (2017) Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §4.3.

[15]
(2016)
Coresets for scalable bayesian logistic regression
. In Advances in Neural Information Processing Systems, pp. 4080–4088. Cited by: §1.  [16] (2016) Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §3.1.1.
 [17] (2018) Not all samples are created equal: deep learning with importance sampling. arXiv preprint arXiv:1803.00942. Cited by: §1.
 [18] (2019) Attentive neural processes. arXiv preprint arXiv:1901.05761. Cited by: §1.
 [19] (2019) Attentive neural processes. arXiv preprint arXiv:1901.05761. Cited by: §3.2.
 [20] (2009) Cifar10 and cifar100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html 6. Cited by: §1.
 [21] (2018) Set transformer. arXiv preprint arXiv:1810.00825. Cited by: §2.
 [22] (2019) Budgeted training: rethinking deep neural network training under resource constraints. arXiv preprint arXiv:1905.04753. Cited by: §1.
 [23] (2018) Learning convolutional networks for contentweighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3214–3223. Cited by: §2.
 [24] (2018) Pointcnn: convolution on xtransformed points. In Advances in neural information processing systems, pp. 820–830. Cited by: §1, §2.
 [25] (201512) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §1, §4.3.
 [26] (2018) Largescale celebfaces attributes (celeba) dataset. Retrieved August 15, pp. 2018. Cited by: §4.1.

[27]
(2016)
The concrete distribution: a continuous relaxation of discrete random variables
. arXiv preprint arXiv:1611.00712. Cited by: §3.1.1.  [28] (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §2.
 [29] (2003) Fast marching farthest point sampling. Technical report University of Cambridge, Computer Laboratory. Cited by: §2.
 [30] (2006) A guide to numpy. Vol. 1, Trelgol Publishing USA. Cited by: §3.1.2.
 [31] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §2.
 [32] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §2.
 [33] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §2.
 [34] (2016) Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500. Cited by: §2.
 [35] (2017) Realtime adaptive image compression. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2922–2930. Cited by: §2.
 [36] (2019) Universal approximations of permutation invariant/equivariant functions by deep neural networks. arXiv preprint arXiv:1903.01939. Cited by: §2.
 [37] (2017) Prototypical networks for fewshot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §3.3.

[38]
(2017)
Full resolution image compression with recurrent neural networks
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5306–5314. Cited by: §2.  [39] (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §4.3.
 [40] (2018) Dataset distillation. arXiv preprint arXiv:1811.10959. Cited by: §2.
 [41] (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §2, §3.1.1.
Appendix A Greedy Training Algorithm
Algorithm 2
shows our greedy training algorithm with stochastic gradient descent. The idea of the greedy training algorithm is to train the autoregressive model to select the best next
elements from the candidate set to minimize the target loss on the selected samples. By doing this, we do not have to run the autoregressive model time during training, thus reducing the computational cost.Input  (max subset size) 

(# elements selected at each iteration)  
(distribution of sets)  
(learning rate)  
a target task with loss function  
Output  trained model with converged and 
Appendix B Instance Selection Samples
In this section, we show more examples of our 1D and CelebA experiments on how the models select the set elements for the target task.
b.1 1D Function  Reconstruction
Figure 8 shows the reconstruction samples of our model on the 1D function dataset, which is objectively better than that of Learning to Sample (LTS) or Random Subset (RS). Since RS selects the set elements randomly, it can leave out important part of the 1D curve leading to wrong reconstructions. LTS also selects insufficient amount of set elements in some parts of the curves, resulting in suboptimal reconstructions.
b.2 CelebA
Figure 9 shows the selected pixels of our model for both the classification and reconstruction task. For the attribute classification task, the model tends to select pixels mainly from the face, since the task is to classify characteristics of the person. For reconstruction, the selected pixels are more evenly distributed, since the background also contributes significantly to the reconstruction loss.
Appendix C Dataset Distillation: Instance Selection
In Table 4
, we represent the full results for the Instance Selection model on the CelebA dataset. For these experiments, we construct a set by randomly sampling 200 face images from the full dataset. To evaluate the model, we create multiple such datasets and run the baselines(Random Sampling and FPS) and SSS on the same datasets. The FID metric is then computed on the instances and averaged on all the randomly constructed datasets. For FPS, we use the opensource implementation in
https://github.com/rusty1s/pytorch_cluster. Further, we provide qualitative results on a single dataset in Figure 10 where we show how our model picks 5 instances from the full set of 200 images face images.Appendix D Dataset Distillation: Classification
In Figure 11 we provide visualizations for the instance selection problem as applied to the fewshot classification task. Here, we go from a 20shot to a 1shot classification problem where the prototype is selected from the support using SSS.