Stellar streams are groups of co-moving stars that orbit a galaxy (such as our own, the Milky Way) and are thought to originate in smaller galaxies and star clusters. These smaller galaxies and star clusters are deformed and pulled apart by the differential gravitational field (the “tidal field”) of the larger galaxy around which they orbit. The collection of stars that are pulled out of these smaller systems is stretched by this tidal field and form linear, “spaghetti-like” structures of stars that continue to orbit coherently for (typically) several orbits around their parent galaxy. Figure 1 shows an artistic visualization of remnants of satellite galaxies wrapping around a bigger galaxy like the Milky Way. Astronomers detect stellar streams in the Milky Way using large-area sky surveys that provide photometric (imaging), astrometric (sky motion), and radial velocity (line-of-sight motion) data. Typically, searches are performed by making cuts in specific regions of feature space, and actual detection is usually done by visual inspection of 2D visualizations by searching for over-densities that are stream-shaped in the selected feature-space region. More recently, automated approaches have also been explored, e.g., Carlin (2016), but even these methods typically require a significant amount of visual inspection and validation.
These streams are immensely useful tools for astronomers and astrophysicists: they are one of the most promising avenues for uncovering the (astrophysical) nature of dark matter and for inferring the accretion history of the Galaxy Küpper et al. (2015); Gibson (2007); others (2018a). More specifically, sub structures within streams are of particular interest: for instance, detecting gaps and spurs within a stream can help as signatures of interactions with clumps of (otherwise invisible) dark matter Banik et al. (2018); others (2018b). Therefore, beyond the sole detection of streams, having a reliable catalog of a stream’s stellar population (i.e., determining which star belongs or not to a given stream) is also extremely important, to allow astronomers to analyze the internal substructure or population characteristics of the stream.
We therefore focus here on the problem of using machine learning methods to characterize the stellar population of individual streams. We consider that we have access, for a given stream, to a smallsupport set of stars, i.e., positive examples that we are confident belong to this stream, e.g., selected on a clear over-dense region. We also consider a possible negative set, using (a subsample of) the remaining stars: this negative set can be noisy as part of those stars will actually belong to the stream, which could be especially harmful to detect lower dense regions of the stream that will be on the ’true’ decision boundary (see Section 2 for more details).
Our goal is to obtain a measure for each remaining stars characterizing their belonging to the stream.
We propose to approach this problem in a meta-learning setting. Indeed, we expect that the "membership" function should share general "principles" across the streams given their support set, i.e., each stream characterization is a similar problem and only differs on the support set. This is a similar motivation to meta-learning methods for few-shot learning, where one aims at building a model that meta-learns how to distinguish different classes of instances using only a few examples of each (e.g., Koch et al. (2015)). However, in this application, our ’small’ set of positive examples can be larger than the usual few-shot setting (e.g., from ten to up to a couple hundred examples). The negative set would be even larger (e.g., 150 times more ’negatives’ than positives).
In light of this, this paper presents the following contributions:
A novel use of Deep Sets Zaheer et al. (2017), a neural network model dedicated to point-clouds and sets, in a meta-learning framework.
An experimental protocol to meta-learn a one-class Deep Sets classifier that takes as input only a small support set of ’positive examples’ and the example to classify. While the model is trained in the supervised anomaly detection regime (with full supervision), at test time for new tasks/streams, it only accesses the positive set of examples (stream’s stars).
Experiments on a novel application on astronomical data, with a dataset of synthetic streams immersed on real data (for which we have ground truth), and a real stream (GD-1) dataset, for which we have an exhaustive catalog.
As a baseline, we compare our model to a classical machine-learning method (Random Forest) trained in the binary classification setting (each stream is considered as a separate classification problem). We use the set of problems to optimize potential hyper-parameters in a meta-fashion. We also extend this baseline with a self-labeling process to increase the size of the support set used in training. We show that our meta-learned model outperforms Random Forest (with or without self-labeling) on the synthetic streams even though our model has access to less data (only positive examples) per task at test time). For GD-1, we see a performance drop when using the model ’out-of-the-box,’ but a simple fine-tuning (using positive and noisy negative data) efficiently allows us to outperform the baselines.
This section presents the data used in this paper. Formatted datasets will be available on Github.
Gaia is a space observatory designed for astrometry. This mission has produced a dataset of stars in our Milky Way of unprecedented caliber. The spacecraft measures the positions, distances and motions of stars in the Milky Way brighter than magnitude 20, which represents approximately 1% of the Milky Way population. The mission also conducts spectrophotometric measurements, providing detailed physical properties of the stars observed, such as luminosity, and elemental composition. Recently, the second dataset of observations (Gaia DR-2 others (2019)) has been released. It contains measurements of the five-parameters astrometric solution – positions on the sky, parallaxes, and proper motions in two dimensions – and photometry (colors) in three bands for more than 1.3 billion sources.
We describe briefly here the 10 features used in our data, and how they potentially help to characterize streams’ stellar populations.
RA-DEC Position in the sky: Positions of the stars projected in the Equatorial Coordinate System, in two dimensions: RA is the barycentric right ascension; DEC is the barycentric declination. The characteristic shape of the stream will be observable in this 2D space. Figure 2 shows a synthetic stream and the foreground stars in that space, subsampled to a ratio of 150 to 1. The upper plot illustrates how one can detect a stream through possible over-densities. In the lower plot, we highlight the actual stream’s stars in red, which illustrates how part of the stream can ’disappear’ in the foreground.
Proper motions: Movement of an astronomical object relative to the sun frame of reference, pmRA in the direction of right ascension; pmDEC is the proper motion in the direction of declination. The stream’s stars will also be structured in this 2D space as they share common motion properties from their orbit around the Milky Way, as illustrated in Figure 3.
Colors: Each star in the data set has several photometric features: g, g_bp, and g_rp are the mean absolute magnitudes (brightness of a star as seen from a distance of 10 parsecs) for the green band, the green band minus the red band, the green band minus that of the blue band, respectively. These features are indirect indicators of the potential age and composition of a star. There exists a (non-linear) relationship between stars’ ages and observed colors called an isochrone. See Supplementary Material for additional information.
Angular coordinates: We additionally use the angular velocity coordinates (2D—just direction) of each star, which combines proper motions and the equatorial coordinates. Essentially, these angles represent the great circle along which the star is moving across the sky.
Currently, there is no extensive catalog of ’ground-truth’ streams’ stellar populations: while several streams have been detected (using previous missions, or within Gaia), cataloging each star as belonging to a specific stream or not has not been done. We propose to use in this paper a set of synthetic streams to train, validate and compare our methods. Those synthetic streams will also help us alleviate the difficulty of building a useful negative set for real streams, by allowing us to meta-train a model in a fully supervised setting but dedicated to one-class classification.
These streams are generated by simulating a collection of star clusters as they orbit the Milky Way.In detail, star ‘particles’ are ejected from a mock (massive) star cluster, and the orbits of the individual star particles are then computed accounting for the mass distribution of the Galaxy along with the mass of the parent star cluster.
The star cluster orbits are randomly sampled to match a plausible initial radial distribution of star clusters born or accreted into the Milky Way.
These simulations are performed with the Gala Galactic Dynamics code Price-Whelan (2017).
This has been estimated based on the ratio in known streams after astronomically relevant cuts are performed.
After evolving the orbits of the star particles ejected from all of the individual star clusters, the final state of these simulations is a set of synthetic stellar streams: Positions and velocities in a Galactocentric reference frame for all star particles in all synthetic streams. We then transform these positions and velocities to heliocentric quantities and mock observe the star particles to mimic the selection function and noise properties of sources in Gaia DR-2. These streams are then superimposed over the real Gaia data: because the streams are generated so as to orbit around our actual galaxy, we can mimic the "foreground" we would observe if those stream were real (i.e., in terms of positions in the sky / equatorial coordinates). Thus, we can generate realistic datasets composed of real data from Gaia and synthetic streams, where we have supervision (ground truth) for all stars. Each stream dataset is generated by selecting a random window in RA-DEC that contains part of the stream, and injecting real foreground stars from Gaia to a ratio of up to 1:150111
This has been estimated based on the ratio in known streams after astronomically relevant cuts are performed.with galactic disk removed. Additional information on the data is given in the supplementary material section.
We also show results for one real stream for which we have an exhaustive catalog based on astronomical cuts, GD-1 Price-Whelan and Bonaca (2018); others (2018b). This stream presents interesting sub structures (a gap and a spur) and will be useful to further analyze the ability of the methods to preserve these structures.
3 Deep Sets to Meta-Learn One-Class Classification
Let us first highlight the particularities of the problem we propose to address. Similar to few-shot learning and other meta-learning approaches, we consider that we have access to a dataset of ’training tasks’ with supervision. Our goal is to build a model that will be able to predict on new tasks, unseen before but of similar nature, using a ’small’ set of supervised examples for the task. However, different to few-shot learning, our set of positive supervised examples (support set) can have larger range of sizes than the usual few-shot setting (e.g., between ten and hundreds of examples). The potential negative set is even larger (order of ten or hundred thousands examples), and can be quite noisy or tricky to build: while gathering obvious negative examples will be trivial for astronomers (e.g., far away from the stream in RA-DEC space), they will likely be uninformative for the classifier. Labeling informative negative examples (close to the actual decision border) will be much harder, as it is the goal of the task at hand. There is the possibility to use "all the remaining examples" and label them as negative, which leads to a big but very noisy negative set of examples, especially regarding the actual decision boundary. Therefore, we propose to develop a model that is able to meta-learn one-class classification, so as to be usable at test time potentially without a negative dataset222We note that a (noisy) negative set can be used to fine-tune the model on a new task, see Sec.5 on GD-1.. The meta-learning will, however, be fully supervised with both positive and negative examples (in our case possible through the synthetic streams dataset). To summarize, our setting can be defined as (meta) supervised anomaly detection during training, and (meta) one-class classification during testing333Our ’one-class’ would usually be considered the ’anomalous’ class in the literature given its size compared to the ’main’ negative class..
The design of our approach follows a similar motivation to representation and metric learning-based methods designed for meta-learning (see Section 4 for details). However, we propose to use methods that are designed specifically for point-clouds and sets.
These methods, such as Qi et al. (2017a, b); Zaheer et al. (2017), have been developed to handle inputs that are sets, i.e., unordered sequence. They are generally used on 3D point-clouds, to solve tasks such as point-cloud classification (e.g., the model receives a surface mapping of an object as a set of points in a 3D space, and should predict which object it is) or segmentation (e.g., similarly surface mapping of scenery with various objects).
While those models have been used mostly on 3D point-clouds or meshes, one could see a dataset of instances as, in itself, a point-cloud.
We propose to use such methods in a meta-learning setting. For each separate task, we propose to consider the support set (positive supervised examples) as the point-cloud to be taken as input of the model. Additionally, we integrate in the ’point-cloud’ the example we want to classify on by concatenating it to each element of the set (i.e., augmenting the space dimension of the cloud by 2). Intuitively, we want our model to learn a representation of the support set conditioned on the current example (or vice versa) that is useful to classify the example as from the class of the set or not.
More specifically, we use the Deep Sets model Zaheer et al. (2017). The model is composed of two networks, one using equivariant layers to build a fixed-size representation of the input set, and a secondary network that takes the set representation as input and is optimized to predict the target. Let us consider an example to classify , within a task . The task is associated to a support set of positive examples . We build a set of instances by concatenating for all supports element in . This set of instances is passed through several layers of equivariant functions . From a practical point of view, each equivariant-layer can be considered as a combination of two fully connected linear layers and that receive an input-set as a batch, i.e., and are used in a recurrent fashion for all elements of the input set (either original input set or outputs of the previous layer). More precisely, for a given batch/set of instances , we use , with and where
is the mean vector inof set over its instances.
is an Exponential Linear Unit (ELU) activation function. Then, theoutputs of the last equivariant-layer are averaged to build the final set representation.
The set representation is then passed through a secondary network to predict the target . In our experiments, this secondary network is composed of two hidden layers with ELU activation function.
The resulting architecture can be optimized with classic optimization techniques and classification losses. We use a Cross Entropy loss, with a hyper-parameter weight for the imbalanced data, and ADAM Kingma and Ba (2014) optimizer.
4 Related Work
Meta-Learning and Few-Shot Learning
Meta-learning aims at designing models able to learn how to learn, i.e., how to use information such as a small supervised dataset for a new, unseen task. The goal is to have a model that will adapt and generalize well to new tasks or new environments. To do so, the model will usually be trained on several similar tasks, with a ’meta’-training dataset. Various methods have been proposed with three main type of approaches: (i) optimization based methods, which aim for instance at predicting how to update the weights more efficiently than usual optimization Andrychowicz et al. (2016); Nichol et al. (2018); Finn et al. (2017); Ravi and Larochelle (2016)
, (ii) memory-based and model-based methods relying on models likes Neural Turing MachineGraves et al. (2014) or Memory Networks Weston et al. (2014) which are trained to learn how to encode, store and retrieve information on new tasks fast Santoro et al. (2016); Bartunov et al. (2019), or are based on specific architectures with "slow weights" and "fast weights" dedicated for different parts of the training Munkhdalai and Yu (2017), (iii) representation or metric learning-based methods, which aim at learning a good representation function to extract the most information from the examples from a task, in order to then use that representation as a basis to measure e.g., a linear distance with the unlabeled examples Snell et al. (2017); Koch et al. (2015); Vinyals et al. (2016); Sung et al. (2018).
Our proposed approach is closest to Sung et al. (2018): one can see the Deep Sets as their ’embedding module,’ and our secondary network as their ’relation module’. They sum the representation of each element of a given class and concatenate it with the example before using the ’relation module’. The output of the relation module is a relation score for each class used to classify. In our case, the concatenation is done on the input set, processed through equivariant layers (which uses information about all the set through the average pooling on the set) and averaged at the output of the Deep Sets. Additionally, as we focus here on a ’one-class’ setting, our support set contains a single class.
Anomaly Detection and One-Class Classification
Detecting or characterizing ’anomaly’ has been a widely studied problem in machine learning and for various applications. We will present only briefly some references here, and refer the readers to surveys Chandola et al. (2009) and Chalapathy and Chawla (2019)
, the latter focusing on deep-learning methods. Anomaly detection is usually formulated as finding instances that are dissimilar to the rest of the data, also called outlier detection or out-of-distribution detection. It can be either unsupervised, supervised or semi-supervised.
When supervised, the problem becomes largely similar to classic prediction with the main issues being getting accurate labels and the highly imbalanced data. The latter has been addressed e.g., through ensemble-learning Joshi et al. (2002), two-phase rules (getting high recall first then high precision) Joshi et al. (2001) or with cost-based / classes re-weighting in the classification loss. It is highlighted in Chalapathy and Chawla (2019) that deep-learning methods don’t fare well in such setting if the feature space is highly complex and non-linear.
The semi-supervised case, or one-class classification Moya and Hush (1996); Khan and Madden (2009) , considers that one only has one type of available labels (usually the ’normal’ class as it is in most cases easier to obtain than examples of various possible anomalies). The goal is usually to learn a model of the class behavior to get a discriminative boundary around normal instances. Counter-examples can also be injected to refine the boundary. Some of the techniques proposed are One-Class SVM Schölkopf et al. (2000); Manevitz and Yousef (2001); Li et al. (2003) or using representation-learning with neural network Perera and Patel (2019). This setting is also closely related to Positive and Unlabeled Learning (see e.g., Elkan and Noto (2008)), and our application would also fit with this definition as it could be more ’accurate’ to consider our ’negative’ dataset as unlabeled.
In our specific application, we aim at designing a model able in practice to conduct one-class classification for each stellar stream we know. However, our supervised class would be considered the ’anomaly’ (in terms of the number of instances in the ’stream class’ versus the other foreground stars). We propose a model that will meta-learn this model in a meta-supervised setting, which still suffers from the imbalance data but where synthetic data grants us accurate and complete labeling of both classes.
We propose to use Random Forests trained in the classic binary-classification setting as a baseline, motivated by this model robustness to smaller datasets and overall good performances over a variety of machine learning problems. For each stream dataset, we train a new model from scratch. We use a class imbalance weight hyper-parameter to deal with the imbalance of the datasets and to train models with various trade-off between recall and precision. We also explore the following hyper-parameters: number of trees in forest (100, 200, 300, 500), max depth (10,30,50), min split (2,5,10), min leaf (1,2,4), bootstrap vs. not bootstrap.
Our preliminary experiments showed that the Random Forest (RF) approach could achieve high precision with medium or low recall (i.e., conservative models with few false positives). Given that one of the challenges of the problem at hand is the low number of positive examples compared to the negative ones, we propose to also explore self-labeling Triguero et al. (2015)
with RF. The idea is to use predicted labels from the model to augment the set of (training) positive examples, and retrain. One can start with an initial Random Forest that is highly conservative (high precision / low recall), which means ’safe’ examples but probably less informative, or a more balanced mixture between precision and recall. Our validation-protocol indicates better results when using such criteria (e.g., F1) with medium precision and recall, which are the results shown in this paper. We can repeat the self-labeling process for a certain number of steps or until a stopping criterion (e.g., no positive examples are predicted anymore in the self-labeling pool). Hyper-parameters such as the number of iterations are selected in a meta-validation fashion (i.e., on a group of streams used for validation, described below). The following subsection also describes the split used to keep a pool of separate examples for self-labeling.
While our ’meta-test’ setting is similar to one-class classification, preliminary experiments of classic one-class methods were not conclusive. It is likely explained by the generally low number of examples in our positive/support sets. Therefore we choose to show here only results of baselines in a binary-classification setting.
5.2 Data Pre-Processing and Dataset building
We build a meta-dataset of 61 synthetic stellar streams and their respective foreground. We split them into a meta-training, meta-validation and meta-test dataset composed of respectively 46, 7 and 8 streams. Each "stream dataset" has a ratio of 150 negative examples for 1 positive example.
Meta-training dataset for Meta-DeepSets:
From each stream dataset, we generate training examples composed of (i) a support set of varying size randomly sampled within the stream’s stars, (ii) a star to classify, (iii) its corresponding label. We can generate meta-training datasets with varying imbalance by changing the ratio of negative examples used and by duplicating positive examples, as they will have different support set. Results shown here are computed on a ratio 1:100 with positive examples used twice (i.e., resulting in a dataset with a balance of 1:50 positive vs negative examples).
Meta-Validation and Test datasets with Self-Labeling pool:
For each sub-datasets in the meta-validation and meta-testing split, we use 10% of the stream’s stars as "training examples" –used to train the RF, or used only as the support set as input of the Meta-Deep Sets–. In the Meta-Test set, the smallest support set (resp. biggest) has 9 stars (resp. 92 stars). Average size of the support set is 36 examples. The RF also has access to 10% of the foreground stars (negative examples) to train (i.e., keeping a ratio of 1:150 between positives and negatives examples). The Meta-Deep Sets does not have access to those stars (except if fine-tuned on those). The remaining stars of each dataset are split into two groups, one reserved as the self-labeling pool. The other half (final test) is dedicated for all final evaluation, common to all methods, to make comparison of results consistent across methods.
GD-1 dataset is built similarly with a support set (positive training examples, 197 stars), a negative training set (ratio 1:400), a ’self-labelization’ set, and a test set (ratios 1:150 for both).
Each task’ dataset within Meta-Train, Meta-Validation and Meta-Test are normalized ’locally,’ i.e., per task/stream, using all examples within the dataset of the task.
The meta-training dataset will be used by the Meta-Deep Sets model to train. Validation and model selection is conducted on the meta-validation set. Note that the Meta-Deep Sets is not fine-tuned or retrained on the meta-validation or meta-test set. We also use the meta-validation dataset to select hyper-parameters for the Random Forest models.
The following results are obtained for Deep Sets with 5 layers of size 100, and exploring the following hyper-parameters: learning rate , -regularization and weight imbalance for the loss .
5.3 Synthetic Streams
Given the imbalanced nature of our classification task, we study the performance of the methods on several criteria that give various trade-off between precision and recall. The different measures are precision, recall, F-1, F-2 (favors recall), F-0.5 (favors precision), Balanced Accuracy (BAcc) and the Matthew’s Correlation Coefficient (MCC) –or the Phi Coefficient–. BAcc is computed as
, where the True Positive Rate is the recall, and the True Negative Rate is the specificity (or the number of True Negative divided by the total number of negatives). BAcc is less misleading than the classical accuracy for imbalanced datasets and assumes the cost of a false negative is the same as the cost of a false positive. MCC is also a balanced measure even if the class are of different sizes; it combines the four elements of the confusion matrixPowers (2011).
All scores shown here are computed by averaging the scores obtained for each stream task within the meta-test set, on the final test stars.
Upper part of the Table 1 summarizes the score of our models selected to maximize different criteria. The hyper-parameters selected for the RF and RF with self-labeling were the same for F-1, F-2, MCC and Balanced Accuracy. The hyper-parameters selected for the Deep Sets led to two different models that maximized either F-1, F-0.5, MCC (and precision) or F-2, BAcc (and recall). Therefore we show both models on the Table. We see that the Deep Sets model manages to generalize well to new tasks in the meta-test set without any fine-tuning or self-labeling. It gets the best results for all criteria considered, with significant gain for all of them.
|Meta DS (best meta F1)||0.698||0.643||0.652||0.643||0.674||0.821||0.660|
|Meta DS (best meta F2)||0.502||0.859||0.619||0.737||0.541||0.927||0.646|
|Meta DS (best meta F1)||0.490||0.473||0.481||0.476||0.486||0.735||0.478|
|Meta DS (best meta F2)||0.266||0.561||0.360||0.459||0.297||0.775||0.380|
|DS FT (best train F1)||0.731||0.675||0.702||0.686||0.719||0.837||0.701|
|DS FT (best train F2)||0.624||0.834||0.714||0.782||0.658||0.916||0.720|
|DS FT (best train F0.5)||0.774||0.541||0.637||0.575||0.712||0.770||0.645|
|DS FT (best train Rec)||0.341||0.981||0.506||0.713||0.392||0.984||0.574|
The lower part of Table 1 shows the scores for GD-1 data for the models selected through the same validation process as the synthetic streams. We see that, on this real stream dataset, our Meta Deep Sets model struggles to obtain as good results. Comparatively, the RF and RF with self-labeling perform very well. We see different factors that could explain the difference of results between synthetic streams and GD-1. First, the synthetic streams may be different in nature to GD-1. The Meta-approach seems to generalize well on new synthetic streams, but not here, this could highlight that either the synthetic streams are not realistic enough, or GD-1 has something inherently different. In particular, we know that GD-1 has a very distinctive orbit (observed in pm-RA/pm-DEC space); maybe those features are less important on the synthetic streams, but the RF, as it is trained on it, can pick up on it more easily. Additionally, the number of positive elements is already reasonably ’big’ for RF to learn (197 examples), compared to the sizes of the Meta-Test set. It would be interesting to study when the RF ’breaks’ for smaller positive training sets, as other real streams will likely have smaller support sets.
We propose to briefly explore using fine-tuning on the Deep Sets with best meta-validation F-1 score. We use the same (noisy) negative training set as the RF (some examples are false negatives), combined with the original positive support set (also used by RF). We divide the learning rate by two. We modify the learning scheme so that at each fine-tuning epoch, the model is trained on a dataset sampled from the support set andnegative stars, where is the number of positive examples an imbalance factor. We try 4 imbalance factor (30, 50, 70 and 100). All fine-tuned models are retrained so as to see the entire negative dataset 3 times. We show results on the final test data when selecting the best models for criteria f1, f2, f05 and recall on the full training-set (last rows of Table 1, where DS FT is Deep Sets Fine-Tuned). Fine-tuning leads to great improvement. We also see improvement compared to RF. However, it is fair to note that the RFs might also be improved with additional cross-validation on GD-1 instead of using meta-selected hyper-parameters, though cross-validation might be unstable given the low number of positive examples. It is also important to highlight that the actual best criteria to select for the fine-tuned Deep Sets (regarding what one is trying to optimize) should also be learned in a meta-fashion, as our selection here could be susceptible to overfitting in the fine-tuning phase. However, we feel those results illustrate the potential of the model to reach a variety of trade-offs between precision and recall when using fine-tuning.
Those experiments show promising results. It encourages us to deepen our understanding of stellar streams characteristics with additional analysis, but also to explore how to improve our current meta-method e.g., through meta-validation, and how to combine both approaches.
We presented an application of Deep Sets, a model designed for point-clouds and sets, in a meta-learning framework on a new task from astrophysics. We showed that an adapted Deep Sets model can efficiently tackle one-class classification with small sets of examples (where the single class can be considered ’anomalies’) when trained in a fully supervised meta-learning regime.
The results obtained motivate a more in-depth study of models designed for point-clouds applied on meta-learning problems. It is worth noting that using more recent and complex methods for point-clouds such as Qi et al. (2017b); Zhou and Tuzel (2018) or other equivariant networks Ravanbakhsh (2020)
would likely further improve results, as well as designing meta-learned self-labeling methods, or possibly using active learning. It would also be interesting to explore these methods in the more classical few-shots setting. However, such methods might need to be adapted to work on inputs of larger dimension such as images or time series. While our approach generalizes well on similar data as trained on, we observed a decrease in performance on our real stream, but fine-tuning could efficiently improve the performance.
7 Broader Impact
Regarding potential broader impact directly related to our application at hand, if the method is successfully applied on more real known streams, this could be a novel way of characterizing the stellar population of streams. Combined with validation using astronomical analysis, it could help finding sub structures within streams that were not detected before. As mentioned in the Introduction, this could serve as probes for dark matter structure in our galaxy, among other things.
Beyond this application, the model we present (and, in the more general sense the problem of (meta) learning one-class classification with small sets for the ’anomalous classes’ 444And also Positive and Unlabeled Learning problems) could match a variety of other applications in different domains. Efficient methods for this problem – derived from this paper or not – could help or improve doing anomaly detection (of known types of anomaly) on larger sets, and facilitate the integration of new ’groups’ of anomalies (but it wouldn’t directly or necessarily help to detect new groups of anomalies or new classes). This can be either good or terrible depending on the use. Some could be positive, for instance applications where one has sets of rare occurrences of the same nature, like molecules (but the authors want to point out that they are not expert on the domain), or rare diseases, where learning fast on small sets could be useful. Unfortunately, one can also easily think of quite worse usages for instance when applied to personal data, e.g., social network-related data or government-obtained data, where various ’one-class problems’ could be different groups of specific ’behaviours’ or ’people’.
However, for all these different applications, the presented approach might not be suitable as it is, especially since it is possible that in those cases, the different ’anomalous classes’ don’t share enough in common: we focused on similar ’types’ of anomalies. It is also unclear – and this would be true for both ’positive’ and ’negative’ possible applications – how much having a very good or perfectly labeled meta-training dataset is crucial, and this point is likely problem-dependent.
8 Supplementary material
Code and data have been released on GitHub here.
We provide code to reproduce the Random Forest results with and without self-labelization, as well as code to generate the datasets used for the Deep Sets, training code, evaluation code and fine-tuning for GD-1.
We built our Meta-Deep Sets upon a previous implementation of Deep Sets by the authors https://github.com/manzilzaheer/DeepSets. We simply used a slightly deeper Deep Sets with 5 equivariant layers. The results shown in this paper have been obtained by training Deep Sets models for 50 epochs, where each epoch trains on 300,000 examples.
8.2 Data and Datasets - Additional information and statistics
We provide in this section additional details about the features and behaviors of streams. We also provide detailed Tables on the synthetic streams datasets we generated as well as GD-1. All datasets are available at https://anonymous.4open.science/r/dad5b12a-3dd7-458b-8af7-fb8da319b457/.
Complementary information on astronomical features and streams
As mentioned in Section 2, we use colors features for the stars because these colors are indirect indicators of the age and composition of a star. To give the reader additional understanding of the data and the problem, here is a short explanation motivating the use of these features, and also motivating a meta-learning approach. Streams, as explained in Section 1, are thought to be formed when an external smaller galaxy or a globular cluster (i.e., a group of stars) get deformed and pulled apart by a bigger galaxy. The fact that the stars of a stream originate from the same external group has an impact on the colors observed in its population: it is expected that such cluster of stars is composed of a stellar population of roughly the same ages (i.e., all the stars in that external group formed around the same time). This also applies for their composition (metalicity). There exists a relationship between stellar ages and their observed colors called an isochrone. This relationship can be observed in the Hertzsprung-Russell (HR) diagram. The HR diagram plots a star’s luminosity against its temperature, or equivalently, its color. This plot has been extremely helpful to improve understanding on stellar classification and evolution. Within this diagram, different types of stars will cluster at different locations (see Figure 5), like main sequence stars, supergiants, white dwarfs, etc. But additionally, a star’s position on the HR diagram will change throughout their life. Figure 5
illustrates the isochrone curves in the HR-diagram, each colored line representing a population of stars of the same age. Intuitively, we hope that our meta-learning algorithm will learn a non-linear transformation of the color features that will cluster them efficiently based on their underlying ages.
We provide additional visualization of GD-1 and one synthetic stream from our Meta-Test set (stream-2805) to help understanding the data (real and generated) and some of their differences. Figure 7 shows GD-1 in RA-DEC space zoomed in (projection in the sky) which illustrate the ’gaps’ (under-density) in the stream as well as the ’spurs’ (over-dense region slightly over the ’bend’ of the main over-dense region). Figure 9 shows the same dimension for the synthetic stream, which doesn’t have those sub structures (note that the synthetic streams are not simulated to interact with dark matter ’clumps’). Figure 7 (resp. Fig. 9) shows GD-1 (resp. the synthetic stream 2805) in the proper motion space, which is an indicator of the orbit followed by the stars of the stream. We can see that for GD-1, the stars gather on the outskirts of the main ’orbits’ in that region: GD-1 actually has a very peculiar orbit which makes it easier to distinguish from the Milky Way. The synthetic stream’s stars cluster in two smaller regions in the proper motion space, but within a region which is already ’dense’ (i.e., a more ’common’ orbit compared to other stars not from the stream). Finally Figures 12, 12, and 12 (resp. Figs. 15, 15, and 15) show GD-1 (resp. synthetic stream) in the color-space (each plot shows 2D color bands). We can see that GD-1 stars share a ’behavior’ in this 3D space, but they are more ’spread-out’ than the synthetic stream’s stars, which seems more compactly clustered in these dimensions. This could explain why the meta-model struggled to transfer directly to GD-1, if most training streams had more information contained in the colors than the proper motions, or more tightly clustered.
Datasets composition and statistics
Figure 17 summarizes the sizes of the training stream task (Meta-Train) by showing the histogram of the streams’ sizes. From the 41 streams we generate a dataset of 25,812 positive examples and 1,637,113 examples ( of negatives) by sampling negative examples to a ratio 1:150 compared to the number of stream’s stars, and duplicating the positive examples (stream) twice. The support set for each training example are generated by sampling stars within the stream’s stars, where is also randomly sampled between 7 (minimum size) and . Tables 2 and 3 provide detailed numbers of the composition of each stream in respectively the Meta-Validation dataset and the Meta-Test set, and GD-1. This gives the reader an idea of the total size of the synthetic streams and their foreground (sampled to a ratio 1:150), and the sizes of the support set of each task. 10% of the "Total" data in each stream are used to mimic the training, the remaining stars are then split (roughly 50:50) in a self-labeling dataset and a test-set. Note that GD-1 train set has a foreground ratio of 1:400, but its self-label and test-set have the usual ratio of 1:150. We remind the reader that during validation and test time (without fine-tuning) the Deep Sets only see the support set (i.e., for instance for stream-1012 in Validation, 18 examples) as input concatenated to the star to predict on (no retraining), and that the Random-Forests are trained on the Train (support set and foreground) set. For the fine-tuning of the Deep Sets on GD-1, we use the same training set as the Random Forests.
|Total||Train||Self-Label Set||Test set|
|Stream idx||Stream||foreground||Stream (Support set)||FG||Stream||FG||Stream||FG|
|Total||Train||Self-Label Set||Test set|
|Stream idx||Stream||foreground||Stream (Support set)||FG||Stream||FG||Stream||FG|
Detailed results per streams
We provide in Tables 4 and 5 the detailed performance (Precision and Recall) of the four models (Random Forest (RF), Random Forest with Self-Labelization (RF SL), Deep Sets selected with best F1 in validation (DS F1) and Deep Sets selected with best F2 in validation (DS F2)) for each stream. We see a correlation between the Recall obtained and the size of the support set for the Random Forests methods, while the Deep Sets seem more robust even for smaller support set.
|Stream idx||Support Set||RF||RF SL||DS F1||DS F2||RF||RF SL||DS F1||DS F2|
|Stream idx||Support Set||RF||RF SL||DS F1||DS F2||RF||RF SL||DS F1||DS F2|
- Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989. Cited by: §4.
- Probing the nature of dark matter particles with stellar streams. Journal of Cosmology and Astroparticle Physics 2018 (07), pp. 061. Cited by: §1.
- Meta-learning deep energy-based memory models. arXiv preprint arXiv:1910.02720. Cited by: §4.
- Stellar streams and clouds in the galactic halo. In Tidal Streams in the Local Group and Beyond, J. L. Carlin (Ed.), Astrophysics and Space Science Library, Vol. 420, pp. 87–112. Cited by: §1.
- Signatures of dark matter burning in nuclear star clusters. The Astrophysical Journal Letters 733 (2), pp. L51. Cited by: Figure 5.
- Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §4.
- Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §4.
- Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213–220. Cited by: §4.
- Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §4.
- The ghosts of galaxies past. Scientific American 296, pp. 40–45. Cited by: §1.
- Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §4.
- Meta-learning in neural networks: a survey. arXiv preprint arXiv:2004.05439. Cited by: §4.
- Mining needle in a haystack: classifying rare classes via two-phase rule induction. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pp. 91–102. Cited by: §4.
- Predicting rare classes: can boosting make any weak learner strong?. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 297–306. Cited by: §4.
A survey of recent trends in one class classification.
Irish conference on artificial intelligence and cognitive science, pp. 188–197. Cited by: §4.
- Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §3.
- Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §1, §4.
- Globular cluster streams as galactic high-precision scales?the poster child palomar 5. The Astrophysical Journal 803 (2), pp. 80. Cited by: §1.
- Improving one-class svm for anomaly detection. In Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 03EX693), Vol. 5, pp. 3077–3081. Cited by: §4.
- One-class svms for document classification. Journal of machine Learning research 2 (Dec), pp. 139–154. Cited by: §4.
- Network constraints and multi-objective optimization for one-class classification. Neural Networks 9 (3), pp. 463–474. Cited by: §4.
- Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2554–2563. Cited by: §4.
- On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §4.
- Fossil stellar streams and their globular cluster populations in the E-MOSAICS simulations. External Links: Cited by: §1.
- The spur and the gap in GD-1: dynamical evidence for a dark substructure in the milky way halo.. External Links: Cited by: §1, §2.
- Datamodel description. In Gaia Data Release 2, pp. 521–576. Cited by: §2.
Learning deep features for one-class classification. IEEE Transactions on Image Processing 28 (11), pp. 5450–5463. Cited by: §4.
- Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. Cited by: §5.3.
- Off the beaten path: gaia reveals gd-1 stars outside of the main stream. The Astrophysical Journal Letters 863 (2), pp. L20. Cited by: §2.
Gala: A Python package for galactic dynamics.
The Journal of Open Source Software2, pp. 388. External Links: Cited by: §2.
- Pointnet: deep learning on point sets for 3d classification and segmentation. In , pp. 652–660. Cited by: §3.
- Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §3, §6.
Universal equivariant multilayer perceptrons. arXiv preprint arXiv:2002.02912. Cited by: §6.
- Optimization as a model for few-shot learning. Cited by: §4.
- Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §4.
Support vector method for novelty detection. In Advances in neural information processing systems, pp. 582–588. Cited by: §4.
- Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §4.
- Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §4.
Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and Information systems 42 (2), pp. 245–284. Cited by: §5.1.
- Meta-learning: a survey. arXiv preprint arXiv:1810.03548. Cited by: §4.
- Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §4.
- Few-shot learning: a survey. arXiv preprint arXiv:1904.05046. Cited by: §4.
- Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §4.
- Deep sets. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3391–3401. External Links: Cited by: 1st item, §3, §3.
- Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §6.