Drifting Features: Detection and evaluation in the context of automatic RRLs identification in VVV

05/04/2021
by   J. B. Cabral, et al.
0

As most of the modern astronomical sky surveys produce data faster than humans can analyze it, Machine Learning (ML) has become a central tool in Astronomy. Modern ML methods can be characterized as highly resistant to some experimental errors. However, small changes on the data over long distances or long periods of time, which cannot be easily detected by statistical methods, can be harmful to these methods. We develop a new strategy to cope with this problem, also using ML methods in an innovative way, to identify these potentially harmful features. We introduce and discuss the notion of Drifting Features, related with small changes in the properties as measured in the data features. We use the identification of RRLs in VVV based on an earlier work and introduce a method for detecting Drifting Features. Our method forces a classifier to learn the tile of origin of diverse sources (mostly stellar 'point sources'), and select the features more relevant to the task of finding candidates to Drifting Features. We show that this method can efficiently identify a reduced set of features that contains useful information about the tile of origin of the sources. For our particular example of detecting RRLs in VVV, we find that Drifting Features are mostly related to color indices. On the other hand, we show that, even if we have a clear set of Drifting Features in our problem, they are mostly insensitive to the identification of RRLs. Drifting Features can be efficiently identified using ML methods. However, in our example, removing Drifting Features does not improve the identification of RRLs.

READ FULL TEXT VIEW PDF
02/20/2020

Pulsars Detection by Machine Learning with Very Few Features

It is an active topic to investigate the schemes based on machine learni...
05/01/2020

Automatic Catalog of RRLyrae from ∼ 14 million VVV Light Curves: How far can we go with traditional machine-learning?

The creation of a 3D map of the bulge using RRLyrae (RRL) is one of the ...
12/02/2020

A Novel Approach to Radiometric Identification

This paper demonstrates that highly accurate radiometric identification ...
07/30/2020

The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification

In this work, we provide a follow-up on the Moldavian versus Romanian Cr...
08/23/2018

Insect cyborgs: Biological feature generators improve machine learning accuracy on limited data

Despite many successes, machine learning (ML) methods such as neural net...
06/03/2019

Algorithmically generating new algebraic features of polynomial systems for machine learning

There are a variety of choices to be made in both computer algebra syste...
08/23/2022

ULISSE: A Tool for One-shot Sky Exploration and its Application to Active Galactic Nuclei Detection

Modern sky surveys are producing ever larger amounts of observational da...

1 Introduction

Most of the modern astronomical sky surveys are characterized by fast pace data ingestion, data intensive science cases or automatic reduction pipelines (e.g. feigelson2012big), which often lie on the verge of technological developments and analysis capabilities. This unprecedented availability of observations challenges the traditional approaches for data analysis, leading to a shift in the paradigm for knowledge discovery (bell2009beyond), notably dominated by machine learning (ML) techniques (ball2010data). Albeit the difficult mathematical and statistical foundations, a complex terminology driven by the confluence of several sciences and the arduous interpretation of the results, the training of intelligent agents has become an every-day practice in astronomy. The accessibility of easy-to-use free software resources mostly written for R (team2000r) or Python (van2003python) languages was fundamental to this step.

In most cases ML methods can be separated into two basic steps: first, raw data is converted into a set of useful features that are relevant to the task at hand (e.g. periods or intensities) and then these features are fed to a classifier or a statistical method (see e.g. mitchell1997machine).

ML methods have a number of limitations. For instance, they are highly susceptible to errors produced by the limitations of the datasets (cai2015challenges). The results can also be hampered by the not fully understood role of the features (duboue2020art) or by the biases introduced by improperly defined experiments (domingos2012few). These facts are well known and have not been ignored in the astronomical community (2020MNRAS.492.5377L).

Here we are interested in the role of some sources of noise that are present in commonly used features in astronomical research, and their impact on the results of ML methods in this context. We use data from the synoptic survey “Vista Variables in Via Láctea” (VVV, 2010NewA...15..433M), observed with the Vista telescope (sutherland2015visible), which pursues, among its main objectives, to produce a three-dimensional map of a large part of the galactic center (Bulge) of the Milky Way and a fraction of the internal Galactic Disk. The VVV data is presented in units called “tiles”, which are rectangular areas of the sky surveyed over time. For each tile, the VVV data reduction pipeline (emerson_vista_2004) provides a pre-processed image and a database of files with the values of position, magnitude and color indices of the light sources present in the image, which comprises the “photometric catalog”. These catalogs are the main subject of this study.

The images are subject to two noise sources, namely, experimental errors and observation conditions. The derived catalogs are also affected, since the noise permeates all the survey information which is comprised on a set of features or observables. Atmospheric conditions, moon phases, maintenance of the camera and telescope or modifications to the software, among many others, can influence the observation or recording of the data. As a consequence, the derived measurements that are used as features for a ML analysis can be also prone, at different levels, to these errors and conditions.

Random measurement errors are present in every experimental or observational science. They are unavoidable, but each error typically affects only a single or a reduced set of observations. ML methods can cope efficiently with this kind of errors. For a large survey as VVV, observational conditions can change slightly (but not randomly) over long periods of time or for different regions in the sky. This problem is more difficult for ML methods. In many situations we want to train an intelligent agent using a well–known portion of the survey, and then use it to predict other less–known zones searching for a given astronomical phenomenon. Given the ML methodology, the agent will work efficiently on training data, but will probably fail to generalize to other zones due to this slight change in observational conditions. Due to the diverse nature of the features extracted from the data (intensity, periods, colours, etc.), possibly they will reflect this effect in different proportions. It is then interesting to ask whether it is possible to automatically detect which of the extracted features are more sensitive to these changes in observational conditions. Hereafter we call “Drifting Features” to the ones in a dataset that are sensitive to the observational conditions. We aim at evaluating their influence over a large scale ML experiment.

As a working example, we focus on the problem of detecting RR-Lyrae (RRLs) variable objects over VVV data. That is, we train classifiers using data from some VVV tiles, and evaluate them in the task of identifying RRLs on other tiles.

Drifting Features should be consistent within a limited zone of the sky (for instance, a tile or two consecutive tiles) but should show slight changes, almost undetectable by most simple statistics, between tiles that are separated apart. Those changes could potentially alter the capabilities of the classifier. To detect these features and their effect on automatic classification, we propose to use, again, ML methods. If we face a ML method with the task of discriminating data from two tiles, it will be forced to learn differences among the tiles that are present in the features. We can then use Feature Selection methods

(guyon2002gene) to evaluate the importance of each feature for this classifier that discriminates tiles, and mark highly relevant features as candidates of been Drifting Features.

This work is divided into the following sections: in Section 2 we explain our experimental setup (data, feature extraction, model selection,etc.). In Section 3 we introduce our procedure for the identification of Drifting Features and in Section 4 we evaluate the effect of these features in the task of RRL identification. Finally in Section 5 we discuss our results and draw our conclusions.

2 Experimental setup

2.1 Data

One of the main objectives of the VVV is the creation of a 3D map of the bulge and the galactic center (2010NewA...15..433M) for which the search for variable stars in general, and RRLs in particular, is important due to their use as standard candles (bailey1902discussion). To this end, the survey relies on data from the VIRCAM infrared camera, mounted on ESO’s VISTA survey telescope (sutherland2015visible), which at the time of its construction was the largest NIR camera with 16 non-contiguous 2k x 2k detectors. To complete a contiguous tile, VIRCAM simultaneously exposes its detectors 6 times with a suitable offset. Each one of the exposures is called a pawprint and the combination of the six overlapping pawprints is a tile. For this reason each pixel is observed in at least 2 pawprints and also the edges are shared with the observations of the adjacent tiles. The survey observation plan was organized in two stages: during the first year the tile is observed in five astronomical filters Z, Y, J, H and Ks separated by a few hours; then, in subsequent years, it is re-observed using the Ks band for variability studies. Only some tiles are observed in multi-band after the first year.

The dataset used in this work is the one presented in cabral2020automatic, which consists of 62 features extracted with feets (cabral_fats_2018) from light curves that were reconstructed from the photometric catalogs provided by the Cambridge Astronomical Survey Unit (CASU).

From the original dataset we selected 8 tiles located at different zones of the Bulge, as shown in Figure 1. For each tile we extracted all the RRLs plus a uniform sample of unclassified, unsaturated and non-faint sources (average magnitude between and ). From these selections, sources with invalid values were removed, leaving the final dataset for this work, described in Table 1.

We choose to use a reduced dataset with around sources for each tile in order to decrease dramatically the computational burden of our experiments. As shown in cabral2020automatic

, the use of a reduced dataset can lead to optimistic estimations of the accuracy of the detections, but our main objective is to find and characterize the features that best represent the differences between the tiles and not the accuracy of the detection of the RRLs.

Figure 1: Map of the Bulge tiles of the VVV survey over an extinction map (Extinction map adapted from and gonzalez2012reddening). We highlight the tiles used in this work with red borders.
Tile Total RRL Sample
b206 407720 47 2047
b214 376822 35 2034
b216 334773 43 2043
b261 735838 253 2252
b277 831323 430 2429
b278 857887 437 2436
b360 1029149 679 2669
b396 729671 15 2015
Table 1: Total number of sources, RRL and sample taken in each tile used in this work.

2.2 Error measures

We face two different binary classification problems in this work. First we try to separate sources between two tiles, this instance is made in order to construct a Tile Classifier (hereafter TC) that allows to asses the relevance of the features. Then, we build a Source Classifier (hereafter SC) that seeks to discriminate RRL sources from unknown sources. In the first problem both classes are nearly balanced in all cases. On the other hand, as discussed in cabral2020automatic, the identification of a few variable stars within a large set of unknown sources is usually a highly imbalanced problem, which generates several inconveniences such as those discussed in the recent work by hosenie2020imbalance and requires specific error measures.

In the RRL detection problem (SC) we will call RRL samples as the positive class and the other sources as the negative class. In the tile identification problem (TC) both classes (the two tiles) are equivalent, so we will arbitrarily call one of them as positive and the other as negative. All positive samples (in this case, either a source or a tile) that are correctly identified by the classifier are called true positives (TP), otherwise if they are missed by the classifier they are called false negatives (FN). Negative samples that are wrongly classified are called false positives (FP), and those correctly identified are called true negatives (TN). Using a combination of these four outcomes, we can define two complementary performance measures, called Precision and Recall, which are adequate to deal with unbalanced problems. The Precision is defined as

. It measures, for example, the fraction of real RRLs detected over all those retrieved by the classifier. The Recall, on the other hand, is defined as . It measures, in the same example, the fraction of all RRLs that are detected by the classifier.

Many classifiers can change their decisions outputs by adjusting the probability threshold that considers an observation to be positive or negative. A high threshold increases the Precision and decreases the Recall since fewer cases are classified as positive, while a low threshold generates the opposite effect. To evaluate Precision and Recall together, we consider the Precision-Recall curves, where we plot a set of pairs of values corresponding to different thresholds. A curve that approaches the top-right corner is, in general, considered to represent a better classifier.

For balanced classification problems it is common to find more traditional metrics in the literature. According to that, for the tile identification problem we also use Accuracy and the area under the ROC curve (ROC-AUC) measures. The ROC curve is equivalent in concept to the Precision-Recall curve described before, and the area under it is a global measure of the performance of the classifier. The only difference between both curves is that a ROC curve that approaches the top-left corner represents a better classifier.

2.3 Model Selection

For the TC problem we evaluated four classifiers with diverse foundations: SVM (Support Vector Machine) with linear kernel

(vapnik2013nature)

, SVM with Radial Basis Function (RBF) kernel, K-Nearest Neighbors

(KNN,

mitchell1997machine)

and Random Forest

(RF, breiman2001random).

To determine the best hyper-parameters for every model, we executed a grid-search of all possible combinations of values for each hyper-parameter over a fixed list. We used a 5 k-folds setup on a dataset with tiles b278 and b261, using precision as performance measure. These tiles were chosen because they are not extreme in location or in their balance, such as b396 or b220.

With this setup, we selected the following hyper-parameter values:

SVM-Linear:

.

SVM-RBF:

and .

KNN:

with a metric; also, the importance of the neighbor class was not weighted by distance.

RF:

We created decision trees with Information-Gain as metric, the maximum number of random selected features for each tree is the of the total number of features, and the minimum number of observations in each leaf is .

Model Precision Recall AUC
SVM-Linear 0.8511 0.8511 0.9286
SVM-RBF 0.8003 0.8003 0.8680
RF 0.7707 0.7707 0.8548
KNN 0.6973 0.6973 0.7685
Table 2: Classification metrics of the SVM models (with kernel linear and RBF), RF and KNN on the tiles b278 and b261 sources.
Figure 2: ROC (left) and Precision-Recall (right) curves of the SVM models (with linear and RBF kernels), RF and KNN, for the prediction of the tile of a given source, using 10-fold CV with tiles b278 and b261.

Using the optimal values for the hyper-parameters we compared the four models on the same dataset using a 10 folds cross validation setup. Table 2 shows the corresponding results using the default threshold (0.5) for all models. For all three metrics considered (Precision, Recall and AUC) the SVM-Linear classifier clearly outperformed all the other classifiers. More importantly, Figure 2 shows the corresponding ROC and Precision-Recall curves, which show that SVM-Linear also outperforms the other methods for all possible thresholds. Given these results, we selected SVM-Linear as the classifier for the tile identification problem.

For the SC problem cabral2020automatic already determined that RF is the classifier with the best performance for our dataset and general experimental setup.

2.4 Feature Selection

Feature selection (guyon2013introduction) is the process of extracting some subsets of features from the entire set in order to optimize the classification performance and/or the computational complexity of the problem. We choose for this work the Recursive Feature Elimination algorithm (RFE, guyon2002gene). The method is widely adopted, characterized by its good performance and simplicity. As a backward selection method, RFE starts with all the features and sequentially eliminates the unimportant features using a recursive process.

RFE is integrated with a classification method, which provides at each step of the recursion the importance score of the features. RFE iteratively executes the underlying classifier and extracts the score for each variable, then the variable (or group of variables) with worse performance (according to the score) is eliminated.

The method typically ends when the desired (fixed) number of features to select is reached. Another possibility is to monitor a performance metric for the subsets (for example the accuracy on an independent validation set) and stop the recursion when the metric is optimal.

In this work we rely on the RFE implementation with k-fold Cross-Validation (RFECV) as stopping criteria provided by the Scikit-Learn package (pedregosa2011scikit). RFECV produces replicated experiments ( in our work), each one selecting features over () folds and monitoring the classification error () over the remaining fold. Then it determines the number of features to select looking for the least average error throughout all the folds. In the last step, RFECV produces a final selection using all the dataset to select the features and stopping at the previously selected point.

It is worth mentioning that the classifier embedded in the RFE in the feature selection stage may be different from the eventual method in the final classification stage.

3 Finding Drifting Features

As we stated in the Introduction, we propose to use ML methods to detect Drifting Features, looking for features that are useful to know the tile of origin of a given source (exclusively from features derived from the pawprint stack photometry, without any other Header Keyword Data). With this goal, in this first experiment we consider all the sources in each tile together (RRLs & unknowns) and train classifiers to learn the tile of origin of each source and not its astronomic type.

We apply the RFECV method, as described in the previous section, on 28 binary classification problems, consisting each one in separating a different pair of tiles from the set of 8 tiles in our dataset. Thus, for each SC problem (for example, separating tiles b206 and b214) we obtain from RFECV a subset of selected features for that problem. Each subset has potentially a different length, as discussed before.

Figure 3 shows the number of features selected for each problem. It is evident that there are two different behaviours. In some cases RFE selects just a few features, as for example tile b216 with any other except b206; opposite, in some other cases the selected subset contains a high number of features (tile b206 against b216, b396 against b277 or b278, for example). But the number of selected features by itself is not relevant, what is more important to identify Drifting Features is how well they separate the two tiles.

We can arbitrarily divide the problem into “few features” (4 or less selected features) and “high number” (more than 4 selected features). Figure 4 shows ROC curves for the 12 “high number” problems, and their relative locations in space. The figure shows curves for three classifiers: one trained with all the features in the dataset (All features), a second trained using only those selected by RFE, i.e. our candidates to Drifting Features, and a third one trained with those not selected by RFE (we call them the Stable subset). All the “few features” subset cases (b216-b278 for example) produce trivial ROC curves saturated on the top-left corner for the three subsets, with , which we do not show.

Analyzing the results, the first observation is that, as expected, the classifier trained with the Drifting Features (those selected by feature selection) is always very similar in performance to the one trained with all the features. This result is a confirmation that RFECV does its work, selecting a subset of features that is responsible for the separation of the classes. The second result is that the performance of the models for the “few features” problems is clearly superior to the ones shown in the figure, i.e. the “high number” problems. This implies that the two behaviours noted in Figure 3 correspond to problems that are easy to solve, where the separation is almost perfect and can be done with a few features, and problems that are harder, where the tiles cannot be fully separated and RFE selects bigger subsets.

It is interesting to note the different response of the classifiers trained on the Stable subset on the easy and hard problems. In the hard problems RFECV selects a high number of features, which means that there are features with no considerable information about the tile of origin of the source. After several features were selected by RFECV, the rest of the features, the Stable subset, contains much less information about the origin and produces a classifier with low performance. For the easy problems, on the other hand, there are several features with much information about the tile of origin. Using just 2 to 4 features selected by RFECV is enough to produce an almost perfect classifier. The Stable subset in this case contains plenty of features with good information about the origin, producing a classifier with also almost perfect performance. If we take into account the relative positions of each pair of tiles for the Hard and Easy problems, there is not a definite pattern emerging. Most problems involving neighbour tiles, as for example b277, b278 and b261 are hard, and most problems involving tiles in the low-right region are easy.

Figure 3: Number of features selected by RFE for each binary TC problem. Each cell corresponds to the dataset including Tile A (Rows) and Tile B (Columns).
Figure 4: ROC curves for the combinations of two tiles selecting the more important features. Each panel has three curves: one using all the features (thick grey lines), the Drifting Features selected by RFE (dashed lines), and those considered stable, not selected by RFE (dotted lines).
Figure 5: Total times that each feature was selected by RFECV over the 28 SC problems considered in this work. The different colors identify which of the three groups each feature belongs to: orange for those based on color, blue for those based on period, and white for those based only on magnitudes.

Another relevant analysis for a feature selection method is which are the features that are selected in each case. The upper half of the Figure 5 shows the number of times that each feature was selected by RFECV over the 28 TC problems, for all features that were selected at least two times (Table 3 in Appendix A shows the list of features selected on each problem). Two particular features (c89_hk_color and c09_hk_color) were selected in almost all cases, while all the features related to pseudo-color (cabral2020automatic) were the most frequently selected, appearing in at least half of the cases. This information suggests that color-related information in general is the most important characteristic to distinguish tiles.

Figure 6: Same as Fig. 4 for a fixed RFE selection of 10 features for the Drifting subset. Top row shows two Easy datasets while bottom row shows two Hard datasets.

The two very different behaviours of Hard and Easy problems will make harder the evaluation of the real influence of Drifting Features on the TC problems, because deleting only two features or half of the features will lead to diverse scenarios. To make an easier and fairer comparison we changed the feature selection method, using RFE with a fixed number of ten selected features.

The list of these selected features and their frequency can be observed in the bottom half of Figure 5. Only 15 features were selected in total over the 28 TC problems, from whom the 11 more relevant are related to color, probably with a high dependence with the location of the tile.

The overall performance of the classifiers trained with the Full, Drifting and Stable subsets, for some exemplars of Easy and Hard datasets, can be seen in Figure 6. For the Easy problems (bottom row) we are now using less features in the Stable subset, leading to a lower performance. On the opposite side, for the hard problems (top row) we are now using more features in the Stable subset, leading to a clear increase in its performance. The rest of the TC problems show the same type of result (data not shown).

Using a fixed selection of features with RFE we obtain in all cases a subset of 10 features (the Drifting subset) that can discriminate the tile of origin with high accuracy, and another subset (Stable) with much less information about the tile of origin of the sources. We analyze in the following the impact of these subsets on the SC problems.

4 Evaluation of the influence of Drifting Features

In this section we evaluate the influence of the Drifting subsets selected in the previous step on the SC problems, i.e., to discriminate unknown sources and RRL variable stars.

For each pair of tiles we have three datasets, one with all the features (Full), a second with only 10 Drifting Features selected by RFE and finally one with the remaining features, the Stable subset. Different from the previous problem we have now, for the SC problem, two possibilities for each pair of tiles: first we train our classifiers in one of the tiles and search for RRLs in the other, and second we invert the tiles, training classifiers in the second tile of the pair and looking for RRLs in the first tile. Thus, for each one of the 56 SC problems we train corresponding RF classifiers and obtain three PR curves for the Full, Drifting and Stable classifiers.

The complete results are presented in Appendix B, while a summary of some representative cases can be seen in Figure 7. A first result is that, clearly, the Drifting subset shows lower performance in all cases. This was expected as most Drifting Features are related to color and cabral2020automatic demonstrated that color alone cannot clearly identify RRLs. More interesting, if we compare the performance of the Full datasets with the Stable datasets, we can see that there is no clear advantage in eliminating the Drifting Features from the datasets. Full and Stable curves are almost similar in all cases. Differences are small and there is no clear pattern of when eliminating Drifting Features will improve the performance of the ML methods.

Figure 7: Precision-Recall curves for the SC problems. We show results for six combinations of train-test tiles using three classifiers trained with the Full, Drifting and Stable subsets of features.

5 Discussion

In this work we introduced and discussed the concept of Drifting Features, related with small changes in the properties measured by the features, which can potentially harm the result of ML methods in astronomy. Using the identification of RRLs in VVV as a working example, we introduced a method for detecting Drifting Features, using an indirect ML method. We forced a classifier to learn the tile of origin of diverse sources, and select the features more relevant to this task as candidates to Drifting Features. We showed that this method can efficiently identify a reduced set of features that contains useful information about the tile of origin of the sources. We also showed that, for our particular example of detecting RRLs in VVV, Drifting Features are mostly related to color. On the other hand, we showed in the Section 4 that, even if we have a clear set of Drifting Features in our problem, they are almost harmless for the identification of RRLs.

As a future work we will explore the influence of Drifting Features on the detection of other types of variable sources and other large scale ML experiments. We will also explore a different way of setting the number of selected features by RFE, considering all features that are relevant to the problem and not only the subset that shows the best performance for some metric or a fixed-length subset.

Acknowledgements.
The authors would like to thank to their families and friends, and also IATE astronomers for useful comments and suggestions. This work was partially supported by the Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET, Argentina) and the Secretaría de Ciencia y Tecnología de la Universidad Nacional de Córdoba (SeCyT-UNC, Argentina). J.B.C, are supported by a fellowship from CONICET. Some processing was achieved with Argentine VO (NOVA) infrastructure, for which the authors express their gratitude. We gratefully acknowledge data from the ESO Public Survey program ID 179.B-2002 taken with the VISTA telescope and products from the Cambridge Astronomical Survey Unit (CASU). J.B.C. thanks to Maren Hempel by creating the template for the creation of the template on which the figure 1 is based, and finally Bruno Sánchez and Martín Beroiz for the continuous support and friendship. This research has made use of the http://adsabs.harvard.edu/, Cornell University xxx.arxiv.org repository, adstex (https://github.com/yymao/adstex), astropy and the Python programming language.

References

Appendix A Finding Drifting Features

Feature

b261 – b277

b261 – b278

b261 – b360

b277 – b278

b277 – b360

b278 – b360

b206 – b216

b206 – b214

b206 – b261

b206 – b277

b206 – b278

b206 – b360

b206 – b396

b216 – b214

b216 – b261

b216 – b277

b216 – b278

b216 – b360

b216 – b396

b214 – b261

b214 – b277

b214 – b278

b214 – b360

b214 – b396

b261 – b396

b360 – b396

b278 – b396

b277 – b396

Selected

c89_hk_color 27
n09_hk_color 25
n09_c3 15
c89_c3 14
c89_m2 12
c89_m4 12
c89_jh_color 12
n09_m2 12
n09_m4 12
n09_jh_color 10
c89_jk_color 10
n09_jk_color 9
Eta_e 8
MaxSlope 6
Mean 6
Psi_eta 5
MedianBRP 5
PercentAmplitude 5
PercentDifferenceFluxPercentile 5
ppmb 5
Rcs 5
Std 5
FluxPercentileRatioMid65 5
Q31 4
Autocor_length 4
LinearTrend 4
Amplitude 4
Freq1_harmonics_amplitude_1 4
Psi_CS 3
Beyond1Std 3
FluxPercentileRatioMid35 3
MedianAbsDev 3
SmallKurtosis 3
Skew 3
Freq2_harmonics_amplitude_0 3
Freq2_harmonics_amplitude_1 2
FluxPercentileRatioMid80 2
Freq1_harmonics_amplitude_2 2
Freq1_harmonics_rel_phase_1 2
Gskew 2
Freq3_harmonics_amplitude_1 2
Freq3_harmonics_amplitude_0 2
PeriodLS 1
Freq1_harmonics_rel_phase_2 1
FluxPercentileRatioMid50 1
Freq3_harmonics_amplitude_2 1
Freq1_harmonics_amplitude_0 1
Period_fit 1
Freq2_harmonics_rel_phase_3 1
Freq3_harmonics_amplitude_3 1
Freq1_harmonics_rel_phase_3 1
Freq3_harmonics_rel_phase_1 1
Freq2_harmonics_amplitude_3 1
Freq2_harmonics_rel_phase_2 0
PairSlopeTrend 0
Freq2_harmonics_rel_phase_1 0
Freq2_harmonics_amplitude_2 0
Freq3_harmonics_rel_phase_3 0
FluxPercentileRatioMid20 0
Con 0
Freq3_harmonics_rel_phase_2 0
Freq1_harmonics_amplitude_3 0
Table 3: RFE results without a minimum feature limit to try to identify the tile of a source. The rows contain the name of the features ordered from highest to lowest by the frequency they are selected, the columns are the datasets, and the body indicates with a check if that dataset selected that feature.
Feature

b360 – b396

b278 – b396

b278 – b360

b277 – b396

b277 – b360

b277 – b278

b261 – b396

b261 – b360

b261 – b278

b261 – b277

b261 – b396

b261 – b360

b261 – b278

b261 – b277

b261 – b261

b216 – b396

b216 – b360

b216 – b278

b216 – b277

b216 – b261

b216 – b214

b206 – b396

b206 – b360

b206 – b278

b206 – b277

b206 – b261

b206 – b214

b206 – b216

Selected

c89_m4 28
c89_jh_color 28
n09_jh_color 27
c89_hk_color 27
n09_m4 25
n09_hk_color