Supernovae provide an enormous amount of information for astrophysics researchers. For example, Type Ia Supernovae (SNIa) are used as standardisable candles, that makes it highly important for extragalactic astrophysics and cosmology allowing to measure distances across the Universe 1974PhDT.........7R; 1977SvA....21..675P; 1998AJ....116.1009R; 1999ApJ...517..565P.
Traditionally, SNIa candidates discovered photometrically are checked by spectroscopic follow-up, that requires a significant amount of dedicated observational resources. However, in the epoch of large synoptic surveys such as Zwicky Transient Facility111http://ztf.caltech.edu ztf and Legacy Survey of Space and Time (LSST)222http://lsst.org lsst
thousands of transient candidates are expected to be found per night, which makes it impossible to confirm all of them spectroscopically. Moreover, it is not possible to perform additional observations for these short-lived historical objects. Thus, it has become essential to classify transients, based on photometric information alone, to have an opportunity to find suitable candidates to SNIa, even in the case of a lack of spectroscopic observations.
Photometric supernovae classification, using the machine-learning approach is a well-developed field of research. Several simulated datasets are available from The Supernova Photometric Classification Challenge (SPCC) Kessler_etal2019 and Photometric LSST Astronomical Time-Series Classification Challenge (PLASTiCC) PLAsTiCC. These datasets contain light curves of different types of transients, in the way they could be observed by Dark Energy Survey 333http://darkenergysurvey.org and LSST correspondingly. These challenges, and their datasets, inspired several papers aimed at providing a solution for the photometric classification of supernovae (see, e.g. Lochner_2016; 2018MNRAS.473.3969R; Markel_etal2019; Muthukrishna_etal2019; vargas-dos-santos_etal2019; Moller_Boissi2020). Supernova data, from these challenges, use spectroscopic templates snana2009; salt2_2007; sugar2019, thus every dataset corresponds to one of a few real historical supernovae observed, at a different distance and with a different cadence. However, supernovae type Ia are very diverse blondin_etal2012, and such simulations do not cover all the variety villar_etal2019. Thus, the performance of algorithms, trained and tested on such a dataset, can be overestimated. That is why we choose to use only real photometric observation data in this paper.
We aim to create an automatic classification algorithm of SNIa, based on it’s transient light curve alone. To train the algorithm, we use photometric data and object labels from the Open Supernova Catalog (OSC)444http://sne.space Guillochon_etal2017, which includes almost all publicly available observations of supernovae and related objects. The OSC has already been used for machine-learning classification problems. Authors of Muthukrishna_etal2019DASH prepared supernova spectral classification to obtain type, age, redshift and other properties of the target object. pruzhinskaya_etal2019; Ishida_etal2019
built an anomaly detection pipeline to find abnormal light curves in the OSC.narayan_etal2018 used both OSC and simulation photometric data to implement SN photometric classification for ANTARES LSST broker555http://antares.noao.edu. In the current paper, we use less strict control criteria in data selection comparing to narayan_etal2018 (see Section 2) and do not use any simulated data which can have an unmeasurable intrinsic bias.
Previous papers use a wide range of methods to train classification algorithms on supernovae light curve data. In particular, in 2018MNRAS.473.3969R, authors based their research on the dataset presented in SPCC (Kessler_etal2010)
. This dataset includes 18,321 simulated SNe. Each object is described by multicolor light curves. Their models are trained using only candidates with the host galaxy redshift information and with not less than three observation points. This selection reduces their input data to 17330 objects. The authors suggest a 2-step classification approach training and optimising hyperparameters of the Diffusion Map and Random Forest Classifier (RFC) simultaneously. It allows the use of, not original light curve vectors, but vectors of similarity between objects as an input to RFC. This method shows a promising result with 0.96 ROC AUC.
In Moller_Boissi2020, the authors consider the classification of 1,983,213 SNe light curves simulated with a Public Software Package for Supernova ANAlysis (SNANA) (snana2009)
. The number of candidates makes it possible to use complex models, such as recurrent and convolutional neural networks (RNN and CNN). CNN classification allows us to achieve excellent quality with 0.98 ROC AUC. However, they also tried to a Random Forest classifier (RF), and it showed an even better result with 0.9929 ROC AUC. This result indicates that RF works well with SNe data, and is suitable for solving our problem.
In this paper, we present a novel light curve feature extraction method while using well-known classification methods: logistic regression, random forest, gradient boosting and artificial neural network.
The rest of the paper is organised as follows. In Section 2, we formally describe the problem and the data used for the analysis and their preprocessing. Section 3 is devoted to feature extraction and machine-learning models. Results and their discussion are shown in Sections 4 and 5 correspondingly. We conclude the paper in Section 6.
2 Problem Statement and Data sets
This study aims to develop an approach for supernova classification, based only on the photometric information. Given the differences between data and simulation, the algorithm must be trained using as much information from available data as possible. In machine learning terms, we use a supervised approach that provides the most stable results.
OSC Guillochon_etal2017 is a compilation of different catalogues and individual papers, based on observations using various facilities and data processing pipelines. This way of data collection makes OSC data very heterogeneous. As of June 8th, 2019, OSC consisted of 63,689 objects666Our snapshot of OSC data can be found at http://sai.snad.space/sne20190608/, 53,413 of which had photometric observations and onыly 7,985 had spectral data. To make our dataset as homogeneous as possible, we considered examining only light curves in the -band. This selection gave us 8,657 objects described with the following fields: name, supernova type, number of observations, timestamps of observation points, -fluxes for each observation and their errors.
2.1 Filtering Data
Often the quality of observations is not good enough and it makes impossible to extract information about the type of object. For example, there are too few observations or too much noise in SN light curve. Therefore, we develop a method for filtering the initial dataset, to avoid such objects during training.
We suggest two criteria for filtering the source data: number of light curve observations for each object and a -value of
per degree of freedom:
where is the degrees of freedom and is the Gamma function,
is the probability that a single observation from distribution withdegrees of freedom does not fall in the interval . For our case, a
-value is calculated for the following null hypothesis: light curve shape is not different from a constant function. Figure1 contains a distribution of number of points on the left-hand panel and a distribution of -value on the right-hand panel.
We select supernovae candidates which are described with more than three observation points. This comes from the assumption that each light curve has to have a starting point, a peak and two points for the hyperbolic part of a curve. Using a selection on -value, we veto the candidates that do not have a characteristic form and their light curves tend to be flat. We thus concentrate only on those objects with a -value less than 0.001. Applying the selection above, we obtain 1,572 objects, which we use to train our classification model. This dataset contains 1,232 SNe type Ia and 340 non-Ia. The latter includes 190 core-collapse SNe; 39 unconfirmed SN candidates with unknown true type; 45 super-luminous SNe and 66 misclassified objects of various true types, the largest of which is cataclysmic variables containing 36 objects. Figure 2 shows the difference between filtered light curves and those that do not pass our selection. We observe that for rejected candidates, either observation points of filtered objects are actually in a shorter time range or light curves are too flat.
After filtering out bad candidates, we proceed to the feature extraction. Since we use a machine-learning model for classification that requires vectors of constant length as input, the main goal of the data preprocessing step is to represent light curves as vectors of the same length. Based on this, we suggest three preprocessing steps and a feature extraction approach which allows us to emphasise the physics properties of SN light curves. We describe them in Subsection 3.3. Finally, we train the models as it is described in Section 3.4.
3.2 Light Curve Preprocessing
Initially, light curves are presented as arrays of photometric observations. To make comparable vector representations of objects, we suggest three steps: normalisation, binarisation and interpolation. Figure3 shows us the intermediate results of preprocessing.
Normalisation consists of shifting
where is a vector of timestamps (modified Julian dates) and is a vector of flux spectral densities.
Binarisation allows us to get vectors of the same length and to consider them in the same vector space. We divide the initial light curve in range from to into 16 equal bins (segments) with calculating the mean value for each bin:
where is the time interval considered (), is the number of bins, is the -th bin interval, is a value of -th bin, is a number of observations within a bin.
Some bins do not have any observation points, thus having zeroes in the corresponding places of the resulting vector. Intuitively, a supernova cannot dim for a moment and then light up again, but due to a different frequency of observations for each object, some experimental points are missing. That is why we use linear interpolation to fill missed observations. The resulting light curves are shown in Fig.3.
3.3 Feature Engineering
As noted earlier, an essential difference between Ia and non-Ia supernova is the rise-time and decline-time of their brightness. In order to emphasise the physics nature of supernovae light curves and improve the quality of the model, we generate an additional 16 features: for all elements of the resulting vector, we take the moving ratio of its elements:
The resulting light curves are shown in Figure 4.
Finally, we concatenate the resulting vector with the one obtained as a result of the processing steps.
In our paper we consider four classification algorithms based on statistical models: Logistic Regression (LogReg), Random Forest Classifier breiman1999random (RF) implemented in the Scikit-learn python library, Gradient Boosting friedman2001greedy
To train any model, we need to split our data into two subsets: one for training and one for testing. In this case, we lose a significant part of our data. To solve this problem, we use a K-Fold cross-validation approach with .
After training the models, we predict labels for all objects in the initial dataset, so we get a true label and a predicted label for each SN object. To evaluate our model performance we use five common metrics: area under the receiver operating characteristic curve (ROC AUC), accuracy,
-score, precision and recall. In our paper, we consider ROC AUC as a target metric as it takes into account both true positive and false positive rates. Metric scores for each model are presented in Table1.
We choose the Logistic Regression model as a baseline for our classification problem, due to its interpretability. LogReg is not a very strong model, and requires each data point to be independent of all other data points. In our case, observation points are related to one another. Nevertheless, LogReg shows good results, which means that presented objects are well separated and can be efficiently classified with linear models.
RF classifier shows the best scores in all metrics except precision and recall, which can be explained by the fact that the model is resistant to outliers. We choose it as our primary model, and all feature analysis is based on RF model results.
To evaluate the quality of our model more thoroughly, we calculate a simple confusion matrix with a threshold value of 0.5 as shown in Table2. We want to make sure that the false negative (FN) rate would be as low as possible and our FN rate is equal to . If we examine shapes of preprocessed light curves of objects in the FN subset in Figure 6, we notice that most light curves are more chaotic than curves in the TP subset. We conclude that in this case, we are dealing with a bias of initial data quality. If we consider FN light curves in Figure 7, we see that most of them are poor quality even though they passed our filter.
|n = 1572||Positive||Negative|
The ROC curves are demonstrated in Figure 5. The histogram of two classes colored by blue (Ia) and red (non-Ia) with respect to RF model output demonstrated in Fig. 8. It shows us that a huge part of supernovae can be correctly classified with high probabilities. Nevertheless, we notice a small peak around 0.1-0.2 x-values. A significant part of supernovae in this probability range are cataclysmic variables (CV) type, and 17 out of 31 CVs fell into this range. CVs differ from SNe by both physical nature (the first are flaring close binary systems and the last are destructive explosions) and observation behaviour: CVs show faint outbursts lasting a few weeks, and SNe show bright flares on the time scales of months.
In this article, we suggest an SN classification method in which we use only real photometric data for model training. Previous works described in Section 1 have shown promising results using samples of simulated SNe. Nevertheless, we suggest to use with great care, any data simulations during model training for real objects classification, as such models can overfit to simulated samples and show less precise classification results for real physical objects observed with future experiments. To demonstrate the statement above we train two RF models on simulated and real datasets: PLASTiCC and OSC respectively.
Authors of Boone2019 propose an approach for SN Ia classification and achieved ROC AUC of 0.957 with a tree-based LightGBM model. Unlike our experiment, they use all available passbands and preprocess them with Gaussian process regression to smooth light curves. These GP models allow them to augment a part of the dataset with spectroscopically-confirmed objects.
In our case, we use the same preprocessing pipeline and feature extraction methods for both datasets as is described in Subsection 3.2. Then we test it with PLASTiCC and OSC datasets. As a result, we get four train-test combinations: RF trained with OSC and tested with PLASTiCC, RF trained with PLASTiCC and tested with OSC, RF trained and tested with OSC and RF trained and tested with PLASTiCC. In order to make the models resulting scores as comparable as possible, we undersample non-Ia class in the PLASTiCC dataset to achieve OSC dataset class balance. We also select only 8 SN types from the PLASTiCC dataset which are present in OSC: SNIa, SNIa-91bg, SNIax, SNII, SNIbc, SLSN-I, TDE, AGN.
Table 3 shows that the performance of a model trained with simulated data decreases its ROC AUC score, from 0.843 to 0.739, while testing on real objects. The same effect is observed in the inverse experiment. It brings us to the conclusion that simulated objects can not fully represent all varieties of real SNe.
While some of the papers mentioned above show better results, in practice, they suffer from an unknown systematic uncertainty due to the simulated events sample used in training. Our approach is free of this virtue and shows that using only a small amount of real data, one can obtain a scalable result, which might be improved once more data is available.
In this paper, we consider a new data-driven classification approach of SN objects from the Open Supernovae Catalog. We managed to achieve good classification quality using only real objects. To emphasise the physical properties of SN Ia, we suggest adding a vector of generated features to the initial vector. The random forest classifier shows the best performance, with the highest score. We demonstrated that training a model on the PLASTiCC simulated data significantly reduces its efficiency in classifying real objects. Hence, we can conclude that validation of the model on real data is a necessary step for the purpose of achieving good classification quality on real-life tasks.
The authors thank Maria Pruzhinskaya for the fruitful discussion. KM is supported by a RFBR grant 20-02-00779 for preparing the Open Supernova Catalog data. DD is supported by a RSF 19-71-30020 grant in preparing and discussing data augmentation techniques and data-simulation discrepancy measurement methods. This research has made use of NASA’s Astrophysics Data System Bibliographic Services and following Python software packages: NumPy (numpy), Matplotlib (matplotlib), SciPy (scipy), pandas (pandas), and scikit-learn (scikit-learn).