Meta-Learning One-Class Classification with DeepSets: Application in the Milky Way

07/08/2020 ∙ by Ademola Oladosu, et al. ∙ NYU college Simons Foundation 12

We explore in this paper the use of neural networks designed for point-clouds and sets on a new meta-learning task. We present experiments on the astronomical challenge of characterizing the stellar population of stellar streams. Stellar streams are elongated structures of stars in the outskirts of the Milky Way that form when a (small) galaxy breaks up under the Milky Way's gravitational force. We consider that we obtain, for each stream, a small 'support set' of stars that belongs to this stream. We aim to predict if the other stars in that region of the sky are from that stream or not, similar to one-class classification. Each "stream task" could also be transformed into a binary classification problem in a highly imbalanced regime (or supervised anomaly detection) by using the much bigger set of "other" stars and considering them as noisy negative examples. We propose to study the problem in the meta-learning regime: we expect that we can learn general information on characterizing a stream's stellar population by meta-learning across several streams in a fully supervised regime, and transfer it to new streams using only positive supervision. We present a novel use of Deep Sets, a model developed for point-cloud and sets, trained in a meta-learning fully supervised regime, and evaluated in a one-class classification setting. We compare it against Random Forests (with and without self-labeling) in the classic setting of binary classification, retrained for each task. We show that our method outperforms the Random-Forests even though the Deep Sets is not retrained on the new tasks, and accesses only a small part of the data compared to the Random Forest. We also show that the model performs well on a real-life stream when including additional fine-tuning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stellar streams are groups of co-moving stars that orbit a galaxy (such as our own, the Milky Way) and are thought to originate in smaller galaxies and star clusters. These smaller galaxies and star clusters are deformed and pulled apart by the differential gravitational field (the “tidal field”) of the larger galaxy around which they orbit. The collection of stars that are pulled out of these smaller systems is stretched by this tidal field and form linear, “spaghetti-like” structures of stars that continue to orbit coherently for (typically) several orbits around their parent galaxy. Figure 1 shows an artistic visualization of remnants of satellite galaxies wrapping around a bigger galaxy like the Milky Way. Astronomers detect stellar streams in the Milky Way using large-area sky surveys that provide photometric (imaging), astrometric (sky motion), and radial velocity (line-of-sight motion) data. Typically, searches are performed by making cuts in specific regions of feature space, and actual detection is usually done by visual inspection of 2D visualizations by searching for over-densities that are stream-shaped in the selected feature-space region. More recently, automated approaches have also been explored, e.g., Carlin (2016), but even these methods typically require a significant amount of visual inspection and validation.

These streams are immensely useful tools for astronomers and astrophysicists: they are one of the most promising avenues for uncovering the (astrophysical) nature of dark matter and for inferring the accretion history of the Galaxy Küpper et al. (2015); Gibson (2007); others (2018a). More specifically, sub structures within streams are of particular interest: for instance, detecting gaps and spurs within a stream can help as signatures of interactions with clumps of (otherwise invisible) dark matter Banik et al. (2018); others (2018b). Therefore, beyond the sole detection of streams, having a reliable catalog of a stream’s stellar population (i.e., determining which star belongs or not to a given stream) is also extremely important, to allow astronomers to analyze the internal substructure or population characteristics of the stream.

We therefore focus here on the problem of using machine learning methods to characterize the stellar population of individual streams. We consider that we have access, for a given stream, to a small

support set of stars, i.e., positive examples that we are confident belong to this stream, e.g., selected on a clear over-dense region. We also consider a possible negative set, using (a subsample of) the remaining stars: this negative set can be noisy as part of those stars will actually belong to the stream, which could be especially harmful to detect lower dense regions of the stream that will be on the ’true’ decision boundary (see Section 2 for more details).
Our goal is to obtain a measure for each remaining stars characterizing their belonging to the stream.
We propose to approach this problem in a meta-learning setting. Indeed, we expect that the "membership" function should share general "principles" across the streams given their support set, i.e., each stream characterization is a similar problem and only differs on the support set. This is a similar motivation to meta-learning methods for few-shot learning, where one aims at building a model that meta-learns how to distinguish different classes of instances using only a few examples of each (e.g., Koch et al. (2015)). However, in this application, our ’small’ set of positive examples can be larger than the usual few-shot setting (e.g., from ten to up to a couple hundred examples). The negative set would be even larger (e.g., 150 times more ’negatives’ than positives).

In light of this, this paper presents the following contributions:

  • A novel use of Deep Sets Zaheer et al. (2017), a neural network model dedicated to point-clouds and sets, in a meta-learning framework.

  • An experimental protocol to meta-learn a one-class Deep Sets classifier that takes as input only a small support set of ’positive examples’ and the example to classify. While the model is trained in the supervised anomaly detection regime (with full supervision), at test time for new tasks/streams, it only accesses the positive set of examples (stream’s stars).

  • Experiments on a novel application on astronomical data, with a dataset of synthetic streams immersed on real data (for which we have ground truth), and a real stream (GD-1) dataset, for which we have an exhaustive catalog.

As a baseline, we compare our model to a classical machine-learning method (Random Forest) trained in the binary classification setting (each stream is considered as a separate classification problem). We use the set of problems to optimize potential hyper-parameters in a meta-fashion. We also extend this baseline with a self-labeling process to increase the size of the support set used in training. We show that our meta-learned model outperforms Random Forest (with or without self-labeling) on the synthetic streams even though our model has access to less data (only positive examples) per task at test time). For GD-1, we see a performance drop when using the model ’out-of-the-box,’ but a simple fine-tuning (using positive and noisy negative data) efficiently allows us to outperform the baselines.

We provide details about the data and the problem in Section 2. The Deep Sets-based model is described in Section 3. We review related works in Section 4. We present experiments and results in Section 5 and close with discussion in Section 6.

2 Data

This section presents the data used in this paper. Formatted datasets will be available on Github.

Gaia data

Gaia is a space observatory designed for astrometry. This mission has produced a dataset of stars in our Milky Way of unprecedented caliber. The spacecraft measures the positions, distances and motions of stars in the Milky Way brighter than magnitude 20, which represents approximately 1% of the Milky Way population. The mission also conducts spectrophotometric measurements, providing detailed physical properties of the stars observed, such as luminosity, and elemental composition. Recently, the second dataset of observations (Gaia DR-2 others (2019)) has been released. It contains measurements of the five-parameters astrometric solution – positions on the sky, parallaxes, and proper motions in two dimensions – and photometry (colors) in three bands for more than 1.3 billion sources.

Features

We describe briefly here the 10 features used in our data, and how they potentially help to characterize streams’ stellar populations.

  • RA-DEC Position in the sky: Positions of the stars projected in the Equatorial Coordinate System, in two dimensions: RA is the barycentric right ascension; DEC is the barycentric declination. The characteristic shape of the stream will be observable in this 2D space. Figure 2 shows a synthetic stream and the foreground stars in that space, subsampled to a ratio of 150 to 1. The upper plot illustrates how one can detect a stream through possible over-densities. In the lower plot, we highlight the actual stream’s stars in red, which illustrates how part of the stream can ’disappear’ in the foreground.

  • Proper motions: Movement of an astronomical object relative to the sun frame of reference, pmRA in the direction of right ascension; pmDEC is the proper motion in the direction of declination. The stream’s stars will also be structured in this 2D space as they share common motion properties from their orbit around the Milky Way, as illustrated in Figure 3.

  • Colors: Each star in the data set has several photometric features: g, g_bp, and g_rp are the mean absolute magnitudes (brightness of a star as seen from a distance of 10 parsecs) for the green band, the green band minus the red band, the green band minus that of the blue band, respectively. These features are indirect indicators of the potential age and composition of a star. There exists a (non-linear) relationship between stars’ ages and observed colors called an isochrone. See Supplementary Material for additional information.

  • Angular coordinates: We additionally use the angular velocity coordinates (2D—just direction) of each star, which combines proper motions and the equatorial coordinates. Essentially, these angles represent the great circle along which the star is moving across the sky.

Figure 1: Artistic illustration of stellar streams wrapping around a galaxy. Image credit: NASA / JPL-Caltech / R. Hurt, SSC & Caltech. Figure 2: Upper plot shows a synthetic stream and its foreground (ratio 1:150) in RA-DEC space (2D projection of the sky). Bottom plot highlights all the stream’s stars in red. Figure 3: A synthetic stream and its foreground (ratio 1:150) in proper motion (pmRA-pmDEC) feature space, zoomed in on the streams’ stars. Bottom plot highlights all the stream’s stars in red.

Synthetic streams:

Currently, there is no extensive catalog of ’ground-truth’ streams’ stellar populations: while several streams have been detected (using previous missions, or within Gaia), cataloging each star as belonging to a specific stream or not has not been done. We propose to use in this paper a set of synthetic streams to train, validate and compare our methods. Those synthetic streams will also help us alleviate the difficulty of building a useful negative set for real streams, by allowing us to meta-train a model in a fully supervised setting but dedicated to one-class classification. These streams are generated by simulating a collection of star clusters as they orbit the Milky Way.In detail, star ‘particles’ are ejected from a mock (massive) star cluster, and the orbits of the individual star particles are then computed accounting for the mass distribution of the Galaxy along with the mass of the parent star cluster. The star cluster orbits are randomly sampled to match a plausible initial radial distribution of star clusters born or accreted into the Milky Way. These simulations are performed with the Gala Galactic Dynamics code Price-Whelan (2017).
After evolving the orbits of the star particles ejected from all of the individual star clusters, the final state of these simulations is a set of synthetic stellar streams: Positions and velocities in a Galactocentric reference frame for all star particles in all synthetic streams. We then transform these positions and velocities to heliocentric quantities and mock observe the star particles to mimic the selection function and noise properties of sources in Gaia DR-2. These streams are then superimposed over the real Gaia data: because the streams are generated so as to orbit around our actual galaxy, we can mimic the "foreground" we would observe if those stream were real (i.e., in terms of positions in the sky / equatorial coordinates). Thus, we can generate realistic datasets composed of real data from Gaia and synthetic streams, where we have supervision (ground truth) for all stars. Each stream dataset is generated by selecting a random window in RA-DEC that contains part of the stream, and injecting real foreground stars from Gaia to a ratio of up to 1:150111

This has been estimated based on the ratio in known streams after astronomically relevant cuts are performed.

with galactic disk removed. Additional information on the data is given in the supplementary material section.

Gd-1:

We also show results for one real stream for which we have an exhaustive catalog based on astronomical cuts, GD-1 Price-Whelan and Bonaca (2018); others (2018b). This stream presents interesting sub structures (a gap and a spur) and will be useful to further analyze the ability of the methods to preserve these structures.

3 Deep Sets to Meta-Learn One-Class Classification

Let us first highlight the particularities of the problem we propose to address. Similar to few-shot learning and other meta-learning approaches, we consider that we have access to a dataset of ’training tasks’ with supervision. Our goal is to build a model that will be able to predict on new tasks, unseen before but of similar nature, using a ’small’ set of supervised examples for the task. However, different to few-shot learning, our set of positive supervised examples (support set) can have larger range of sizes than the usual few-shot setting (e.g., between ten and hundreds of examples). The potential negative set is even larger (order of ten or hundred thousands examples), and can be quite noisy or tricky to build: while gathering obvious negative examples will be trivial for astronomers (e.g., far away from the stream in RA-DEC space), they will likely be uninformative for the classifier. Labeling informative negative examples (close to the actual decision border) will be much harder, as it is the goal of the task at hand. There is the possibility to use "all the remaining examples" and label them as negative, which leads to a big but very noisy negative set of examples, especially regarding the actual decision boundary. Therefore, we propose to develop a model that is able to meta-learn one-class classification, so as to be usable at test time potentially without a negative dataset222We note that a (noisy) negative set can be used to fine-tune the model on a new task, see Sec.5 on GD-1.. The meta-learning will, however, be fully supervised with both positive and negative examples (in our case possible through the synthetic streams dataset). To summarize, our setting can be defined as (meta) supervised anomaly detection during training, and (meta) one-class classification during testing333Our ’one-class’ would usually be considered the ’anomalous’ class in the literature given its size compared to the ’main’ negative class..

The design of our approach follows a similar motivation to representation and metric learning-based methods designed for meta-learning (see Section 4 for details). However, we propose to use methods that are designed specifically for point-clouds and sets. These methods, such as Qi et al. (2017a, b); Zaheer et al. (2017), have been developed to handle inputs that are sets, i.e., unordered sequence. They are generally used on 3D point-clouds, to solve tasks such as point-cloud classification (e.g., the model receives a surface mapping of an object as a set of points in a 3D space, and should predict which object it is) or segmentation (e.g., similarly surface mapping of scenery with various objects). While those models have been used mostly on 3D point-clouds or meshes, one could see a dataset of instances as, in itself, a point-cloud.
We propose to use such methods in a meta-learning setting. For each separate task, we propose to consider the support set (positive supervised examples) as the point-cloud to be taken as input of the model. Additionally, we integrate in the ’point-cloud’ the example we want to classify on by concatenating it to each element of the set (i.e., augmenting the space dimension of the cloud by 2). Intuitively, we want our model to learn a representation of the support set conditioned on the current example (or vice versa) that is useful to classify the example as from the class of the set or not.

More specifically, we use the Deep Sets model Zaheer et al. (2017). The model is composed of two networks, one using equivariant layers to build a fixed-size representation of the input set, and a secondary network that takes the set representation as input and is optimized to predict the target. Let us consider an example to classify , within a task . The task is associated to a support set of positive examples . We build a set of instances by concatenating for all supports element in . This set of instances is passed through several layers of equivariant functions . From a practical point of view, each equivariant-layer can be considered as a combination of two fully connected linear layers and that receive an input-set as a batch, i.e., and are used in a recurrent fashion for all elements of the input set (either original input set or outputs of the previous layer). More precisely, for a given batch/set of instances , we use , with and where

is the mean vector in

of set over its instances.

is an Exponential Linear Unit (ELU) activation function. Then, the

outputs of the last equivariant-layer are averaged to build the final set representation.
The set representation is then passed through a secondary network to predict the target . In our experiments, this secondary network is composed of two hidden layers with ELU activation function.
The resulting architecture can be optimized with classic optimization techniques and classification losses. We use a Cross Entropy loss, with a hyper-parameter weight for the imbalanced data, and ADAM Kingma and Ba (2014) optimizer.

4 Related Work

Meta-Learning and Few-Shot Learning

Meta-learning aims at designing models able to learn how to learn, i.e., how to use information such as a small supervised dataset for a new, unseen task. The goal is to have a model that will adapt and generalize well to new tasks or new environments. To do so, the model will usually be trained on several similar tasks, with a ’meta’-training dataset. Various methods have been proposed with three main type of approaches: (i) optimization based methods, which aim for instance at predicting how to update the weights more efficiently than usual optimization Andrychowicz et al. (2016); Nichol et al. (2018); Finn et al. (2017); Ravi and Larochelle (2016)

, (ii) memory-based and model-based methods relying on models likes Neural Turing Machine

Graves et al. (2014) or Memory Networks Weston et al. (2014) which are trained to learn how to encode, store and retrieve information on new tasks fast Santoro et al. (2016); Bartunov et al. (2019), or are based on specific architectures with "slow weights" and "fast weights" dedicated for different parts of the training Munkhdalai and Yu (2017), (iii) representation or metric learning-based methods, which aim at learning a good representation function to extract the most information from the examples from a task, in order to then use that representation as a basis to measure e.g., a linear distance with the unlabeled examples Snell et al. (2017); Koch et al. (2015); Vinyals et al. (2016); Sung et al. (2018).
Our proposed approach is closest to Sung et al. (2018): one can see the Deep Sets as their ’embedding module,’ and our secondary network as their ’relation module’. They sum the representation of each element of a given class and concatenate it with the example before using the ’relation module’. The output of the relation module is a relation score for each class used to classify. In our case, the concatenation is done on the input set, processed through equivariant layers (which uses information about all the set through the average pooling on the set) and averaged at the output of the Deep Sets. Additionally, as we focus here on a ’one-class’ setting, our support set contains a single class.

We refer interested readers to more extensive surveys on meta-learning Vanschoren (2018); Hospedales et al. (2020) and few-shots learning Wang and Yao (2019).

Anomaly Detection and One-Class Classification

Detecting or characterizing ’anomaly’ has been a widely studied problem in machine learning and for various applications. We will present only briefly some references here, and refer the readers to surveys Chandola et al. (2009) and Chalapathy and Chawla (2019)

, the latter focusing on deep-learning methods. Anomaly detection is usually formulated as finding instances that are dissimilar to the rest of the data, also called outlier detection or out-of-distribution detection. It can be either unsupervised, supervised or semi-supervised.


When supervised, the problem becomes largely similar to classic prediction with the main issues being getting accurate labels and the highly imbalanced data. The latter has been addressed e.g., through ensemble-learning Joshi et al. (2002), two-phase rules (getting high recall first then high precision) Joshi et al. (2001) or with cost-based / classes re-weighting in the classification loss. It is highlighted in Chalapathy and Chawla (2019) that deep-learning methods don’t fare well in such setting if the feature space is highly complex and non-linear.
The semi-supervised case, or one-class classification Moya and Hush (1996); Khan and Madden (2009) , considers that one only has one type of available labels (usually the ’normal’ class as it is in most cases easier to obtain than examples of various possible anomalies). The goal is usually to learn a model of the class behavior to get a discriminative boundary around normal instances. Counter-examples can also be injected to refine the boundary. Some of the techniques proposed are One-Class SVM Schölkopf et al. (2000); Manevitz and Yousef (2001); Li et al. (2003) or using representation-learning with neural network Perera and Patel (2019). This setting is also closely related to Positive and Unlabeled Learning (see e.g., Elkan and Noto (2008)), and our application would also fit with this definition as it could be more ’accurate’ to consider our ’negative’ dataset as unlabeled.

In our specific application, we aim at designing a model able in practice to conduct one-class classification for each stellar stream we know. However, our supervised class would be considered the ’anomaly’ (in terms of the number of instances in the ’stream class’ versus the other foreground stars). We propose a model that will meta-learn this model in a meta-supervised setting, which still suffers from the imbalance data but where synthetic data grants us accurate and complete labeling of both classes.

5 Experiments

5.1 Baselines

We propose to use Random Forests trained in the classic binary-classification setting as a baseline, motivated by this model robustness to smaller datasets and overall good performances over a variety of machine learning problems. For each stream dataset, we train a new model from scratch. We use a class imbalance weight hyper-parameter to deal with the imbalance of the datasets and to train models with various trade-off between recall and precision. We also explore the following hyper-parameters: number of trees in forest (100, 200, 300, 500), max depth (10,30,50), min split (2,5,10), min leaf (1,2,4), bootstrap vs. not bootstrap.
Our preliminary experiments showed that the Random Forest (RF) approach could achieve high precision with medium or low recall (i.e., conservative models with few false positives). Given that one of the challenges of the problem at hand is the low number of positive examples compared to the negative ones, we propose to also explore self-labeling Triguero et al. (2015)

with RF. The idea is to use predicted labels from the model to augment the set of (training) positive examples, and retrain. One can start with an initial Random Forest that is highly conservative (high precision / low recall), which means ’safe’ examples but probably less informative, or a more balanced mixture between precision and recall. Our validation-protocol indicates better results when using such criteria (e.g., F1) with medium precision and recall, which are the results shown in this paper. We can repeat the self-labeling process for a certain number of steps or until a stopping criterion (e.g., no positive examples are predicted anymore in the self-labeling pool). Hyper-parameters such as the number of iterations are selected in a meta-validation fashion (i.e., on a group of streams used for validation, described below). The following subsection also describes the split used to keep a pool of separate examples for self-labeling.

While our ’meta-test’ setting is similar to one-class classification, preliminary experiments of classic one-class methods were not conclusive. It is likely explained by the generally low number of examples in our positive/support sets. Therefore we choose to show here only results of baselines in a binary-classification setting.

5.2 Data Pre-Processing and Dataset building

We build a meta-dataset of 61 synthetic stellar streams and their respective foreground. We split them into a meta-training, meta-validation and meta-test dataset composed of respectively 46, 7 and 8 streams. Each "stream dataset" has a ratio of 150 negative examples for 1 positive example.

Meta-training dataset for Meta-DeepSets:

From each stream dataset, we generate training examples composed of (i) a support set of varying size randomly sampled within the stream’s stars, (ii) a star to classify, (iii) its corresponding label. We can generate meta-training datasets with varying imbalance by changing the ratio of negative examples used and by duplicating positive examples, as they will have different support set. Results shown here are computed on a ratio 1:100 with positive examples used twice (i.e., resulting in a dataset with a balance of 1:50 positive vs negative examples).

Meta-Validation and Test datasets with Self-Labeling pool:

For each sub-datasets in the meta-validation and meta-testing split, we use 10% of the stream’s stars as "training examples" –used to train the RF, or used only as the support set as input of the Meta-Deep Sets–. In the Meta-Test set, the smallest support set (resp. biggest) has 9 stars (resp. 92 stars). Average size of the support set is 36 examples. The RF also has access to 10% of the foreground stars (negative examples) to train (i.e., keeping a ratio of 1:150 between positives and negatives examples). The Meta-Deep Sets does not have access to those stars (except if fine-tuned on those). The remaining stars of each dataset are split into two groups, one reserved as the self-labeling pool. The other half (final test) is dedicated for all final evaluation, common to all methods, to make comparison of results consistent across methods.
GD-1 dataset is built similarly with a support set (positive training examples, 197 stars), a negative training set (ratio 1:400), a ’self-labelization’ set, and a test set (ratios 1:150 for both).

Each task’ dataset within Meta-Train, Meta-Validation and Meta-Test are normalized ’locally,’ i.e., per task/stream, using all examples within the dataset of the task.
The meta-training dataset will be used by the Meta-Deep Sets model to train. Validation and model selection is conducted on the meta-validation set. Note that the Meta-Deep Sets is not fine-tuned or retrained on the meta-validation or meta-test set. We also use the meta-validation dataset to select hyper-parameters for the Random Forest models.

The following results are obtained for Deep Sets with 5 layers of size 100, and exploring the following hyper-parameters: learning rate , -regularization and weight imbalance for the loss .

5.3 Synthetic Streams

Given the imbalanced nature of our classification task, we study the performance of the methods on several criteria that give various trade-off between precision and recall. The different measures are precision, recall, F-1, F-2 (favors recall), F-0.5 (favors precision), Balanced Accuracy (BAcc) and the Matthew’s Correlation Coefficient (MCC) –or the Phi Coefficient–. BAcc is computed as

, where the True Positive Rate is the recall, and the True Negative Rate is the specificity (or the number of True Negative divided by the total number of negatives). BAcc is less misleading than the classical accuracy for imbalanced datasets and assumes the cost of a false negative is the same as the cost of a false positive. MCC is also a balanced measure even if the class are of different sizes; it combines the four elements of the confusion matrix

Powers (2011).
All scores shown here are computed by averaging the scores obtained for each stream task within the meta-test set, on the final test stars.
Upper part of the Table 1 summarizes the score of our models selected to maximize different criteria. The hyper-parameters selected for the RF and RF with self-labeling were the same for F-1, F-2, MCC and Balanced Accuracy. The hyper-parameters selected for the Deep Sets led to two different models that maximized either F-1, F-0.5, MCC (and precision) or F-2, BAcc (and recall). Therefore we show both models on the Table. We see that the Deep Sets model manages to generalize well to new tasks in the meta-test set without any fine-tuning or self-labeling. It gets the best results for all criteria considered, with significant gain for all of them.

Dataset Model Precision Recall F1 F2 F0.5 BAcc MCC
Synthetic RF 0.590 0.538 0.499 0.502 0.534 0.766 0.526
RF Self-Lab 0.525 0.690 0.553 0.609 0.529 0.841 0.574
Meta DS (best meta F1) 0.698 0.643 0.652 0.643 0.674 0.821 0.660
Meta DS (best meta F2) 0.502 0.859 0.619 0.737 0.541 0.927 0.646
GD-1 RF 0.739 0.575 0.647 0.601 0.699 0.787 0.650
RF Self-Lab 0.402 0.869 0.550 0.705 0.450 0.930 0.587
Meta DS (best meta F1) 0.490 0.473 0.481 0.476 0.486 0.735 0.478
Meta DS (best meta F2) 0.266 0.561 0.360 0.459 0.297 0.775 0.380
DS FT (best train F1) 0.731 0.675 0.702 0.686 0.719 0.837 0.701
DS FT (best train F2) 0.624 0.834 0.714 0.782 0.658 0.916 0.720
DS FT (best train F0.5) 0.774 0.541 0.637 0.575 0.712 0.770 0.645
DS FT (best train Rec) 0.341 0.981 0.506 0.713 0.392 0.984 0.574
Table 1: Results on Meta-Test synthetic streams and GD-1 for Random Forest (RF), Deep Sets (DS), Deep Sets Fine-Tuned (DS FT). Best meta F1 indicates the selection criteria used on the Meta-Validation set. Best train F1 indicates criteria selection used on GD-1 training-set when fine-tuning.

5.4 Gd-1

The lower part of Table 1 shows the scores for GD-1 data for the models selected through the same validation process as the synthetic streams. We see that, on this real stream dataset, our Meta Deep Sets model struggles to obtain as good results. Comparatively, the RF and RF with self-labeling perform very well. We see different factors that could explain the difference of results between synthetic streams and GD-1. First, the synthetic streams may be different in nature to GD-1. The Meta-approach seems to generalize well on new synthetic streams, but not here, this could highlight that either the synthetic streams are not realistic enough, or GD-1 has something inherently different. In particular, we know that GD-1 has a very distinctive orbit (observed in pm-RA/pm-DEC space); maybe those features are less important on the synthetic streams, but the RF, as it is trained on it, can pick up on it more easily. Additionally, the number of positive elements is already reasonably ’big’ for RF to learn (197 examples), compared to the sizes of the Meta-Test set. It would be interesting to study when the RF ’breaks’ for smaller positive training sets, as other real streams will likely have smaller support sets.

We propose to briefly explore using fine-tuning on the Deep Sets with best meta-validation F-1 score. We use the same (noisy) negative training set as the RF (some examples are false negatives), combined with the original positive support set (also used by RF). We divide the learning rate by two. We modify the learning scheme so that at each fine-tuning epoch, the model is trained on a dataset sampled from the support set and

negative stars, where is the number of positive examples an imbalance factor. We try 4 imbalance factor (30, 50, 70 and 100). All fine-tuned models are retrained so as to see the entire negative dataset 3 times. We show results on the final test data when selecting the best models for criteria f1, f2, f05 and recall on the full training-set (last rows of Table 1, where DS FT is Deep Sets Fine-Tuned). Fine-tuning leads to great improvement. We also see improvement compared to RF. However, it is fair to note that the RFs might also be improved with additional cross-validation on GD-1 instead of using meta-selected hyper-parameters, though cross-validation might be unstable given the low number of positive examples. It is also important to highlight that the actual best criteria to select for the fine-tuned Deep Sets (regarding what one is trying to optimize) should also be learned in a meta-fashion, as our selection here could be susceptible to overfitting in the fine-tuning phase. However, we feel those results illustrate the potential of the model to reach a variety of trade-offs between precision and recall when using fine-tuning.

Those experiments show promising results. It encourages us to deepen our understanding of stellar streams characteristics with additional analysis, but also to explore how to improve our current meta-method e.g., through meta-validation, and how to combine both approaches.

6 Discussion

We presented an application of Deep Sets, a model designed for point-clouds and sets, in a meta-learning framework on a new task from astrophysics. We showed that an adapted Deep Sets model can efficiently tackle one-class classification with small sets of examples (where the single class can be considered ’anomalies’) when trained in a fully supervised meta-learning regime.
The results obtained motivate a more in-depth study of models designed for point-clouds applied on meta-learning problems. It is worth noting that using more recent and complex methods for point-clouds such as Qi et al. (2017b); Zhou and Tuzel (2018) or other equivariant networks Ravanbakhsh (2020)

would likely further improve results, as well as designing meta-learned self-labeling methods, or possibly using active learning. It would also be interesting to explore these methods in the more classical few-shots setting. However, such methods might need to be adapted to work on inputs of larger dimension such as images or time series. While our approach generalizes well on similar data as trained on, we observed a decrease in performance on our real stream, but fine-tuning could efficiently improve the performance.

7 Broader Impact

Regarding potential broader impact directly related to our application at hand, if the method is successfully applied on more real known streams, this could be a novel way of characterizing the stellar population of streams. Combined with validation using astronomical analysis, it could help finding sub structures within streams that were not detected before. As mentioned in the Introduction, this could serve as probes for dark matter structure in our galaxy, among other things.

Beyond this application, the model we present (and, in the more general sense the problem of (meta) learning one-class classification with small sets for the ’anomalous classes’ 444And also Positive and Unlabeled Learning problems) could match a variety of other applications in different domains. Efficient methods for this problem – derived from this paper or not – could help or improve doing anomaly detection (of known types of anomaly) on larger sets, and facilitate the integration of new ’groups’ of anomalies (but it wouldn’t directly or necessarily help to detect new groups of anomalies or new classes). This can be either good or terrible depending on the use. Some could be positive, for instance applications where one has sets of rare occurrences of the same nature, like molecules (but the authors want to point out that they are not expert on the domain), or rare diseases, where learning fast on small sets could be useful. Unfortunately, one can also easily think of quite worse usages for instance when applied to personal data, e.g., social network-related data or government-obtained data, where various ’one-class problems’ could be different groups of specific ’behaviours’ or ’people’.
However, for all these different applications, the presented approach might not be suitable as it is, especially since it is possible that in those cases, the different ’anomalous classes’ don’t share enough in common: we focused on similar ’types’ of anomalies. It is also unclear – and this would be true for both ’positive’ and ’negative’ possible applications – how much having a very good or perfectly labeled meta-training dataset is crucial, and this point is likely problem-dependent.

8 Supplementary material

8.1 Code

Code and data have been released on GitHub here.
We provide code to reproduce the Random Forest results with and without self-labelization, as well as code to generate the datasets used for the Deep Sets, training code, evaluation code and fine-tuning for GD-1.
We built our Meta-Deep Sets upon a previous implementation of Deep Sets by the authors https://github.com/manzilzaheer/DeepSets. We simply used a slightly deeper Deep Sets with 5 equivariant layers. The results shown in this paper have been obtained by training Deep Sets models for 50 epochs, where each epoch trains on 300,000 examples.

8.2 Data and Datasets - Additional information and statistics

We provide in this section additional details about the features and behaviors of streams. We also provide detailed Tables on the synthetic streams datasets we generated as well as GD-1. All datasets are available at https://anonymous.4open.science/r/dad5b12a-3dd7-458b-8af7-fb8da319b457/.

Figure 4: Hertzsprung-Russell (HR) diagram with different groups of stars, -axis is the temperatures of stars, -axis their luminosities. The position of a star in the diagram provides information about its present stage and its mass. Note that temperature axis is inverted (cooler / redder is on the right). Image credit: European Southern Observatory.
Figure 5: Isochrone lines for different stellar ages on the HR-diagram, from Casanellas and Lopes (2011). Different colors show different stellar age population and how they move in the HR diagram when they evolve. -axis is luminosity, -axis is effective temperature, or can be understood equivalently as color.

Complementary information on astronomical features and streams

As mentioned in Section 2, we use colors features for the stars because these colors are indirect indicators of the age and composition of a star. To give the reader additional understanding of the data and the problem, here is a short explanation motivating the use of these features, and also motivating a meta-learning approach. Streams, as explained in Section 1, are thought to be formed when an external smaller galaxy or a globular cluster (i.e., a group of stars) get deformed and pulled apart by a bigger galaxy. The fact that the stars of a stream originate from the same external group has an impact on the colors observed in its population: it is expected that such cluster of stars is composed of a stellar population of roughly the same ages (i.e., all the stars in that external group formed around the same time). This also applies for their composition (metalicity). There exists a relationship between stellar ages and their observed colors called an isochrone. This relationship can be observed in the Hertzsprung-Russell (HR) diagram. The HR diagram plots a star’s luminosity against its temperature, or equivalently, its color. This plot has been extremely helpful to improve understanding on stellar classification and evolution. Within this diagram, different types of stars will cluster at different locations (see Figure 5), like main sequence stars, supergiants, white dwarfs, etc. But additionally, a star’s position on the HR diagram will change throughout their life. Figure 5

illustrates the isochrone curves in the HR-diagram, each colored line representing a population of stars of the same age. Intuitively, we hope that our meta-learning algorithm will learn a non-linear transformation of the color features that will cluster them efficiently based on their underlying ages.

We provide additional visualization of GD-1 and one synthetic stream from our Meta-Test set (stream-2805) to help understanding the data (real and generated) and some of their differences. Figure 7 shows GD-1 in RA-DEC space zoomed in (projection in the sky) which illustrate the ’gaps’ (under-density) in the stream as well as the ’spurs’ (over-dense region slightly over the ’bend’ of the main over-dense region). Figure 9 shows the same dimension for the synthetic stream, which doesn’t have those sub structures (note that the synthetic streams are not simulated to interact with dark matter ’clumps’). Figure 7 (resp. Fig. 9) shows GD-1 (resp. the synthetic stream 2805) in the proper motion space, which is an indicator of the orbit followed by the stars of the stream. We can see that for GD-1, the stars gather on the outskirts of the main ’orbits’ in that region: GD-1 actually has a very peculiar orbit which makes it easier to distinguish from the Milky Way. The synthetic stream’s stars cluster in two smaller regions in the proper motion space, but within a region which is already ’dense’ (i.e., a more ’common’ orbit compared to other stars not from the stream). Finally Figures 12, 12, and 12 (resp. Figs. 15, 15, and 15) show GD-1 (resp. synthetic stream) in the color-space (each plot shows 2D color bands). We can see that GD-1 stars share a ’behavior’ in this 3D space, but they are more ’spread-out’ than the synthetic stream’s stars, which seems more compactly clustered in these dimensions. This could explain why the meta-model struggled to transfer directly to GD-1, if most training streams had more information contained in the colors than the proper motions, or more tightly clustered.

Figure 6: GD-1 Stream in RA-DEC (projection in the sky) feature space. This dataset is pre-cut on a ’tube’ around GD-1 known position.
Figure 7: GD-1 Stream in pm-RA - pm-DEC space (motion space). This plot illustrates that GD-1 has a very specific orbit in almost the opposite direction as the Milky Way’s rotation, which makes it –in principle– easier to detect and characterize.
Figure 8: Synthetic stream (2805) in RA-DEC (projection in the sky) feature space. The white region without star corresponds to the galactic disk (over-dense region of the sky where the plan of the Milky Way is, where streams unobservable) which is removed.
Figure 9: Synthetic stream (2805) in pm-RA - pm-DEC space (motion space).
Figure 10: GD-1 Stream and foreground in g_bp - g color space.
Figure 11: GD-1 Stream in and foreground in g_bp - g_rp color space.
Figure 12: GD-1 Stream and foreground in g_rp - g color space.
Figure 13: Synthetic stream (2805) and foreground in g_bp - g color space.
Figure 14: Synthetic stream (2805) in and foreground in g_bp - g_rp color space.
Figure 15: Synthetic stream (2805) and foreground in g_rp - g color space.
Figure 16: Histogram of the stream sizes (number of stars in the stream within the selected window) for the Meta-Train set. Darker ticks illustrate each stream task.
Figure 17: Histogram of the support sets’ sizes in the Meta-Train set.

Datasets composition and statistics

Figure 17 summarizes the sizes of the training stream task (Meta-Train) by showing the histogram of the streams’ sizes. From the 41 streams we generate a dataset of 25,812 positive examples and 1,637,113 examples ( of negatives) by sampling negative examples to a ratio 1:150 compared to the number of stream’s stars, and duplicating the positive examples (stream) twice. The support set for each training example are generated by sampling stars within the stream’s stars, where is also randomly sampled between 7 (minimum size) and . Tables 2 and 3 provide detailed numbers of the composition of each stream in respectively the Meta-Validation dataset and the Meta-Test set, and GD-1. This gives the reader an idea of the total size of the synthetic streams and their foreground (sampled to a ratio 1:150), and the sizes of the support set of each task. 10% of the "Total" data in each stream are used to mimic the training, the remaining stars are then split (roughly 50:50) in a self-labeling dataset and a test-set. Note that GD-1 train set has a foreground ratio of 1:400, but its self-label and test-set have the usual ratio of 1:150. We remind the reader that during validation and test time (without fine-tuning) the Deep Sets only see the support set (i.e., for instance for stream-1012 in Validation, 18 examples) as input concatenated to the star to predict on (no retraining), and that the Random-Forests are trained on the Train (support set and foreground) set. For the fine-tuning of the Deep Sets on GD-1, we use the same training set as the Random Forests.

Total Train Self-Label Set Test set
Stream idx Stream foreground Stream (Support set) FG Stream FG Stream FG
stream-1012 142 24000 18 2700 72 10662 52 10638
stream-1667 104 17100 10 1500 55 7812 39 7788
stream-1698 125 20550 12 1800 62 9395 51 9355
stream-178 340 56400 36 5400 184 25519 120 25481
stream-3775 99 17400 17 2550 50 7431 32 7419
stream-5489 365 59250 30 4500 192 27415 143 27335
stream-8137 506 83400 50 7500 258 38004 198 37896
Table 2: Detailed statistics for each stream in the Meta-Validation Set with the number of examples in Stream / foreground (FG) in the train set, Self-Labeling (SL) set and Test set.
Total Train Self-Label Set Test set
Stream idx Stream foreground Stream (Support set) FG Stream FG Stream FG
stream-5402 76 12750 9 1350 45 5702 22 5698
stream-1101 627 107850 92 13800 322 47075 213 46975
stream-1519 661 109050 66 9900 368 49633 227 49517
stream-247 495 81150 46 6900 289 37164 160 37086
stream-2805 409 66600 35 5250 229 30694 145 30656
stream-4717 105 17400 11 1650 60 7881 34 7869
stream-5713 178 29250 17 2550 109 13354 52 13346
stream-9528 144 23400 12 1800 85 10810 47 10790
GD-1 1979 345300 198 78000 985 143566 797 123734
Table 3: Detailed statistics for each stream in the Meta-Test Set and GD-1 with the number of examples in Stream / foreground (FG) in the train set, Self-Labeling (SL) set and Test set.

Detailed results per streams

We provide in Tables 4 and 5 the detailed performance (Precision and Recall) of the four models (Random Forest (RF), Random Forest with Self-Labelization (RF SL), Deep Sets selected with best F1 in validation (DS F1) and Deep Sets selected with best F2 in validation (DS F2)) for each stream. We see a correlation between the Recall obtained and the size of the support set for the Random Forests methods, while the Deep Sets seem more robust even for smaller support set.

Precision Recall
Stream idx Support Set RF RF SL DS F1 DS F2 RF RF SL DS F1 DS F2
stream-5402 9 0.947 0.969 0.692 0.633 0.400 0.689 0.409 0.864
stream-1101 92 0.595 0.437 0.724 0.485 0.817 0.879 0.432 0.545
stream-1519 66 0.178 0.163 0.609 0.425 0.791 0.801 0.736 0.934
stream-247 46 0.593 0.472 0.819 0.517 0.702 0.799 0.794 0.931
stream-2805 35 0.567 0.500 0.698 0.399 0.686 0.707 0.779 0.979
stream-4717 11 0.459 0.300 0.407 0.360 0.283 0.400 0.706 0.912
stream-5713 17 0.946 0.928 0.778 0.714 0.321 0.587 0.673 0.769
stream-9528 12 0.433 0.433 0.853 0.478 0.306 0.655 0.617 0.936
Table 4: Detailed Precision and Recall obtained for each stream in the Test Set with Random Forest (RF), Random Forest with self labelization (RF SL), Deep Sets with best validation F1 (DS F1) and F2 (DS F2). We also rewrite here the number of positive examples (Stream train) available.
Precision Recall
Stream idx Support Set RF RF SL DS F1 DS F2 RF RF SL DS F1 DS F2
stream-1012 18 0.380 0.342 0.667 0.627 0.639 0.694 0.538 0.904
stream-1667 10 0.500 0.471 0.625 0.530 0.073 0.145 0.513 0.897
stream-1698 12 1.000 1.000 0.979 0.940 0.258 0.387 0.922 0.922
stream-178 36 0.417 0.434 0.649 0.478 0.679 0.663 0.617 0.825
stream-3775 17 0.263 0.194 0.571 0.345 0.420 0.560 0.375 0.594
stream-5489 30 0.211 0.184 0.677 0.312 0.724 0.839 0.762 0.958
stream-8137 50 0.321 0.255 0.920 0.631 0.748 0.800 0.924 0.995
Table 5: Detailed Precision and Recall obtained for each stream in the Validation Set with Random Forest (RF), Random Forest with self labelization (RF SL), Deep Sets with best validation F1 (DS F1) and F2 (DS F2). We also rewrite here the number of positive examples (Stream train) available.

References

  • M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas (2016) Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989. Cited by: §4.
  • N. Banik, G. Bertone, J. Bovy, and N. Bozorgnia (2018) Probing the nature of dark matter particles with stellar streams. Journal of Cosmology and Astroparticle Physics 2018 (07), pp. 061. Cited by: §1.
  • S. Bartunov, J. W. Rae, S. Osindero, and T. P. Lillicrap (2019) Meta-learning deep energy-based memory models. arXiv preprint arXiv:1910.02720. Cited by: §4.
  • J. L. Carlin (2016) Stellar streams and clouds in the galactic halo. In Tidal Streams in the Local Group and Beyond, J. L. Carlin (Ed.), Astrophysics and Space Science Library, Vol. 420, pp. 87–112. Cited by: §1.
  • J. Casanellas and I. Lopes (2011) Signatures of dark matter burning in nuclear star clusters. The Astrophysical Journal Letters 733 (2), pp. L51. Cited by: Figure 5.
  • R. Chalapathy and S. Chawla (2019) Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §4.
  • V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §4.
  • C. Elkan and K. Noto (2008) Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213–220. Cited by: §4.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §4.
  • B. Gibson (2007) The ghosts of galaxies past. Scientific American 296, pp. 40–45. Cited by: §1.
  • A. Graves, G. Wayne, and I. Danihelka (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §4.
  • T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey (2020) Meta-learning in neural networks: a survey. arXiv preprint arXiv:2004.05439. Cited by: §4.
  • M. V. Joshi, R. C. Agarwal, and V. Kumar (2001) Mining needle in a haystack: classifying rare classes via two-phase rule induction. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pp. 91–102. Cited by: §4.
  • M. V. Joshi, R. C. Agarwal, and V. Kumar (2002) Predicting rare classes: can boosting make any weak learner strong?. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 297–306. Cited by: §4.
  • S. S. Khan and M. G. Madden (2009) A survey of recent trends in one class classification. In

    Irish conference on artificial intelligence and cognitive science

    ,
    pp. 188–197. Cited by: §4.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §3.
  • G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §1, §4.
  • A. H. Küpper, E. Balbinot, A. Bonaca, K. V. Johnston, D. W. Hogg, P. Kroupa, and B. X. Santiago (2015) Globular cluster streams as galactic high-precision scales?the poster child palomar 5. The Astrophysical Journal 803 (2), pp. 80. Cited by: §1.
  • K. Li, H. Huang, S. Tian, and W. Xu (2003) Improving one-class svm for anomaly detection. In Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 03EX693), Vol. 5, pp. 3077–3081. Cited by: §4.
  • L. M. Manevitz and M. Yousef (2001) One-class svms for document classification. Journal of machine Learning research 2 (Dec), pp. 139–154. Cited by: §4.
  • M. M. Moya and D. R. Hush (1996) Network constraints and multi-objective optimization for one-class classification. Neural Networks 9 (3), pp. 463–474. Cited by: §4.
  • T. Munkhdalai and H. Yu (2017) Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2554–2563. Cited by: §4.
  • A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §4.
  • et al. (2018a) Fossil stellar streams and their globular cluster populations in the E-MOSAICS simulations. External Links: 1810.09889 Cited by: §1.
  • et al. (2018b) The spur and the gap in GD-1: dynamical evidence for a dark substructure in the milky way halo.. External Links: 1811.03631 Cited by: §1, §2.
  • et al. (2019) Datamodel description. In Gaia Data Release 2, pp. 521–576. Cited by: §2.
  • P. Perera and V. M. Patel (2019)

    Learning deep features for one-class classification

    .
    IEEE Transactions on Image Processing 28 (11), pp. 5450–5463. Cited by: §4.
  • D. M. Powers (2011) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. Cited by: §5.3.
  • A. M. Price-Whelan and A. Bonaca (2018) Off the beaten path: gaia reveals gd-1 stars outside of the main stream. The Astrophysical Journal Letters 863 (2), pp. L20. Cited by: §2.
  • A. M. Price-Whelan (2017) Gala: A Python package for galactic dynamics.

    The Journal of Open Source Software

    2, pp. 388.
    External Links: Document Cited by: §2.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017a) Pointnet: deep learning on point sets for 3d classification and segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 652–660. Cited by: §3.
  • C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017b) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §3, §6.
  • S. Ravanbakhsh (2020)

    Universal equivariant multilayer perceptrons

    .
    arXiv preprint arXiv:2002.02912. Cited by: §6.
  • S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. Cited by: §4.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §4.
  • B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt (2000)

    Support vector method for novelty detection

    .
    In Advances in neural information processing systems, pp. 582–588. Cited by: §4.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §4.
  • F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §4.
  • I. Triguero, S. García, and F. Herrera (2015)

    Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study

    .
    Knowledge and Information systems 42 (2), pp. 245–284. Cited by: §5.1.
  • J. Vanschoren (2018) Meta-learning: a survey. arXiv preprint arXiv:1810.03548. Cited by: §4.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §4.
  • Y. Wang and Q. Yao (2019) Few-shot learning: a survey. arXiv preprint arXiv:1904.05046. Cited by: §4.
  • J. Weston, S. Chopra, and A. Bordes (2014) Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §4.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3391–3401. External Links: Link Cited by: 1st item, §3, §3.
  • Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §6.