Cosmologists are facing the problem of the analysis of a huge quantity of data when observing the sky. As an example, in a very close future (2022), the Large Synoptic Survey Telescope (LSST)111https://www.lsst.org/ will produce terabytes of images of the sky per day. Cosmologists thus need automatic analysis algorithms to alert them when a cosmological phenomenon occurs, such as the explosion of a star (supernova). In order to exploit this huge amount of data, cosmologists and computer-science scientists work in collaboration on classification algorithms, and on benchmarking, for example through international competitions such as the Astronomical Classification Challenge form Kaggle222https://www.kaggle.com/c/PLAsTiCC-2018.
In this paper, we are specifically studying the supernovae phenomenon and especially the binary classification ”I.a supernovae versus not-I.a supernovae”. The type I.a supernovae produce an extremely bright explosion which can be seen at a very far distance. The uniform intrinsic brightness of this kind of star allows to calculate distances and to understand the universe and the dark energy better. Roughly supernovae have been discovered in the history of astronomy. According to [b_lsst1], the LSST will allow discovering over ten million supernovae during its ten years survey. We thus have to anticipate and find methods that will ease both the processing and analysis of astronomical data.
From a practical point of view, when a zone of the sky is observed by a telescope during the night, an image is produced. The light flux from a region of interest in the image - where a phenomenon is occurring - is computed. Cosmologists repeat this calculation every night. Then they can build a time-series giving the light flux in function of the date. In practice, they are making time-series for specific bands: green, red, near-infrared and infrared.
In this work, and having in mind what will be produced by the LSST, we use these four times-series as input data. The set of four times-series is named a light-curve, and its duration is about 100 to 200 days. The Figure 1 gives an example of a supernova I.a light-curve. light-curves have irregular sampling and are sparse. This sparsity is due to the weather conditions which do not allow a regular observation, and also because the observed zone of the sky is not always the same each night.
Until the end of 2017, cosmologist methods for classifying a light-curve as a supernova type I.a or not type I.a, are mostly relying on the use of astrophysical models. They usually use a two-steps machine learning approach: feature extraction and classification. Feature extraction is thus a crucial step, but even if features are chosen by experts, they can be incomplete or sub-optimal for the classification step. In this paper, we present two Convolutional Neural Networks (CNNs) to replace the usual two-steps machine learning approaches. The first CNN is adapted to time series and thus to the treatment of supernovae light-curves. The second one is based on a Siamese CNN[b_siamese] and is adapted to the nature of data, i.e. their sparsity and their weak quantity (small learning database). We benchmarked our CNNs to two well established state-of-the-art methods, the first one by Lochner et al. [b_lochner], and the second one using FATS features of Nun et al. [b_fats]
. Additionally, we also evaluated our best CNN to a recurrent neural network (RNN) proposed by Charnock and Moss[b_rnn1], which is also adapted to time series. Confronted by the strong sparsity of the data, our CNN performs better. Note that we took care to compare the algorithms in fair conditions, at low regime (small learning database), and also with hard condition close to real conditions, by having more than 70% of unavailable data in each time series.
This paper is organized as follows. In Section 2
we present the two state-of-the-art methods of supernovae classification. These machine learning algorithms use different feature descriptors; one relies on supernovae models and the other one on times series. We also present a deep learning method using a recurrent neural network[b_rnn2]. In Section 3.1 our CNN and Siamese CNN architecture are presented. Then, in Section 4 we detail the databases, the experiments setup, the parameters settings, and analyze results. In Section 5 we conclude and give possible extensions of this work.
2 Existing methods
In this section, we present a small survey of recent machine learning-based methods for supernovae classification. The first method, proposed by Lochner et al. [b_lochner], uses SALT2 features [b2], and was considered as the state-of-the-art until the end of 2017. The second method relies on the use of a library named “Feature Analysis for Time Series” (FATS) whose features are dedicated to light-curves analysis [b_fats]. Finally, we also present the work of Charnock and Moss [b_rnn1] which uses a recurrent neural network.
2.1 Boosted decision trees using SALT2 features
, the authors compare the results of different classifiers and features. They found that the boosted decision trees (BDT) with features of “Spectral Adaptive Light-curve Template features 2” (SALT2) model provided the best performance. SALT2 is the most commonly used model for type I.a supernovae. It relies on the following equation for computation of the light-curve’s flux in a given band:
where is the time since the maximum luminosity in B-band (the phase), is the wavelength, is the average spectral sequence, for are higher order components that describe the variability of supernovae I.a. represents the average color correction law. For each supernova, the redshift (variation of the wavelength due to the expansion of the universe) is also used. The machine learning algorithm uses five parameters to describe the supernovae: , (the time of peak brightness in the B-band), the normalization term , that describes the shape of the light-curve, and
the color at the maximum luminosity in the B-band. Then, BDT are used to classify the supernovae. BDT are machine learning classifiers using multiple decision trees to construct a model. It associates input features to output classes. Multiple decision trees are built on slightly different subsets of data, and the resulting classifications are averaged to provide robustness. Instead of using bagging like in random forest (RF)[b_RF], which selects subsets of data with random replacement, a BDT use boosting such that for each iteration the same data-set is used but with an increase of the weights of incorrectly classified examples, which allows subsequent classifiers to focus on difficult cases. Even if Lochner et al. [b_lochner] found that RF and BDT gave almost the same classification results, BDT were usually faster than RF. Moreover, BDT are considered as robust classifiers because of the averaging process [b_Ensemble].
2.2 Boosted decision trees using FATS library
In Lochner et al. [b_lochner], in addition to SALT2, several feature extraction methods are compared. Among them, there is a method that uses no prior knowledge on supernovae light-curves models, which makes it a generic approach. Nevertheless, similarly to what was observed in [b_lochner], our experimental results show that this wavelet-based method is less efficient than the approach of Lochner et al.. We thus choose another generic approach based on the FATS library 333FATS is a python library, and it can be found on GitHub https://github.com/isadoranun/FATS [b_fats]
, which requires no previous supernovae knowledge and allow to extract many more features than SALT2. FATS (Feature Analysis for Time Series) is a python library which includes a set of features dedicated to time series analysis. FATS contains relevant information for the classification of astrophysical objects, such as the color which is the difference of the flux between two separate bands, the skewness, the mean, … We then use BDT (presented in2.1) to perform the classification.
2.3 Recurrent neural network for supernovae classification
Recurrent neural networks (RNN) are a class of deep learning method which exhibit a temporal dynamic behavior for a time sequence [b_rnn2]. RNN are used for tasks such as speech recognition [b_rnn3] and language translation [b_rnn4]. In [b_rnn1]
, the authors present a supernovae type I.a versus not-I.a classification algorithm using RNN with Long Short-Term Memory (LSTM) cells[b_lstm]
. They used LSTM cells because more traditional RNN are unable to manage long-term information. By using the gating process, the LSTM architecture overcomes the vanishing gradient problem, which means that it is more able to learn on long time series while taking better into account the past values. Each LSTM cell is composed of three gates. The first one represents the forget gate that allows the network to remove the information transmitted by the previous cell. The second corresponds to the input gate which processes the input information at a given time. The last gate merges the information from the input gate and the output gate to feed the next cell of the LSTM network with a piece of new information. In[b_rnn1], the authors tested multiple RNN LSTM architectures with different numbers of LSTM cells per hidden layer. The RNN architecture that gave the best results is unidirectional with two hidden layer and sixteen LSTM cells per layer.
3 Network Architecture
In the section, we present the most important characteristics of our convolutional neural network (CNN) and our Siamese network.
3.1 Convolutional neural network
In this section, we present our convolutional neural network (CNN). First, we explain its architecture, and then we detail some of the most important elements.
3.1.1 Network architecture
Our CNN is shown in Figure 2. The source code can be downloaded at https://github.com/Anzzy30/SupernovaeClassification. Our CNN takes as input a light-curve which is represented by a matrix of dimension , where 4 stands for the height and T for the width. There are 4 rows (one for the green, one the red, one the near-infrared, and one for the infrared bands), i.e. 4 time-series, and the duration T of the light-curve is variable. The CNN contains eleven layers, and most of those layers are ”inception” modules (described below), where the filters are modified to be 1D temporal filters (performed along the rows of the input). Poolings are also modified to be 1D. A time series is indeed usually represented as a 1D array, and this ensures the extraction of temporal information independently for each time series. 1D convolutions capture the input signals evolution over the time [b_tconvolution]. Layers 2, 5, 9 and 10 are inception modules with 1D convolutions and a stride of two pixels. This enables to divide the input width by two after each inception block.
We also introduce a depth-wise convolution (a 1D convolution in the depth axis, i.e. a convolution in the color axis) in the 7th layer to take into account the correlation between the bands. This color-convolution is done without padding, over the four columns in order to merge the time series and combine the information provided by each time-series. This fusion process is important for a more efficient separability between the two classes supernovae I.a and not-I.a[b_color]. We indeed observed an increase in the accuracy of the classification by using this color-convolution.
. This ”global max pooling” allows treating light-curves of any duration. Then, a fully connected network with 1024 neurons is used for the classification part. The network’s output is finally provided by a softmax prediction between the two classes I.a supernovae and not-I.a supernovae.
3.1.2 Inception module
Our network is mainly built by a succession of inception modules [b_googlenet]. These modules have been adapted to work with time series, i.e. 1D convolution. It is built with , and convolution layers and max pooling. It allows extracting information at multiple resolutions. and convolutions are preceded by convolution to reduce the dimensionality and add non-linearity. In our network, each convolution layer is followed by a ReLU activation function [b_relu]. The Figure 3 describes our inception module.
3.1.3 Global max pooling
The time series may have a variable duration, i.e. a variable number of columns for the input matrices. A classical CNN is generally built with a fully connected part, this means with a fixed number of neurons, which impose a fixed dimension for the inputs. Thus, for a classical CNN, a variable length of time series generate feature maps of variable lengths, which cannot be processed by the fully connected layer. To overcome this limitation, we incorporate a global max pooling layer (described in Figure 4) after the last convolution and just before the fully connected layer. This force the features to have a fixed dimension before entering the fully connected layer. Our network can thus classify light-curves of any duration. Another interesting effect of this measure is that during the learning phase, we can now sometimes crop the light-curves from 40% to 80% of their original duration. This allows a data-augmentation, reduces the over-fitting, and increases the generalization abilities.
3.2 Siamese network
The second method proposed in this paper is a Siamese network. This type of neural network was introduced in the 1990s by Bromley et al. [b_siamese] and contains at least two sub-networks with shared weights. Each sub-network produces an n-dimensional feature vector. Afterwards, we can compute different metrics like L2-norm between vectors. The learning process for this method is to bring closer, in the features space, elements that have the same label and drive away elements with a different label.
With the Siamese network we search for a solution to the sparsity problem which is extremely present in the data (See Fig. 1
). We propose a loss function with the triplet loss presented in[b_triplet], and we also propose an adaptation of the triplet loss. First, each triplet is chosen online (online triplet mining) which means that the useful triplets are computed on the fly. Online triplet mining was introduced in Facenet444http://bamos.github.io/2016/01/19/openface-0.2.0/. Online triplet mining is more efficient than offline regarding computation time and performance.
The loss function is described by Equations 2 and 3. First, we compute the classic triplet loss that allows the network to bring closer elements with the same label. The triplet loss is defined by the Equation 2,
with the L2 norm function which takes two vectors as input, a feature vector of dimension , named the anchor, a feature vector named the positive example (it has the same label than the anchor), a feature vector associated to the negative example (it has a different label as that of the anchor), and the . This triplet loss, when minimized, has the effect of pushing the negative examples at a distance of the anchor greater than the margin plus the distance of the anchor to the positive.
During the optimization process, the triplet examples whose loss is greater than zero are the only triplets useful for the minimization. The minimization is thus done on the arithmetic mean of the useful triplets; see Equation 3:
with the number of useful triplet .
In order to better take into account the sparsity of light-curves, in addition to the triplet loss function, we propose an adaptation of the triplet loss described by Equation 4. The goal of this additional loss function is to amalgamate ( force an identical feature representation) the anchor features vector with the feature vector obtained by a sub-sampling of the anchor light-curve, while keeping a minimum distance with negative examples.
where is a feature vector associated to the anchor example, is a feature vector of the sub-sampled anchor light-curve, is a feature vector associated with the negative example (which is an example different from the anchor), and the .
Once again, during the optimization process, the triplet examples whose loss is greater than zero are the only triplets useful for the minimization. The minimization is thus done on the arithmetic mean of the useful triplets; see Equation 5:
To compute the features vectors of each example, we use two networks with shared weights and the same architecture as the convolutional neural network defined in section 3.1 but without the fully connected layer. Once the Siamese network converges, we train a fully connected layer of 1024 neurons followed by a softmax layer. The input is the feature vector (output of the Siamese network), and the outputs are the two classes to predict (type I.a or not-I.a).
In this section, we present the database used for the experiments, the parameter settings, the setup for the experiments, and finally, our results.
4.1.1 First base
We simulate with the software SNANA [b_snana]
light-curves of different types of supernovae. SNANA is an analysis package for supernovae light-curves that contains a simulation, a light-curve fitter and a cosmology fitter. It takes into account actual survey conditions and so generates realistic light-curves by using the measured observing conditions at each survey epoch and sky location. Each light-curve is composed of 4 time-series. Each time series contains the light flux measure in specific bands taken each day during a certain period of time (around 100 days). The bands are obtained using color band-pass whose filters are green, red, near-infrared, and infrared. In the first experiment, we used 5000 light-curves with 2500 supernovae of type I.a and 2500 of type not-I.a. For the deep learning input, each data is day-sampled and stored in a matrix. Each cell has a specific value of flux. If a value is missing, then we fill it up by zero values. As light-curves of supernovae are very sparse (due to missing values), the matrix contains more than 70% of zero values. We will note this database B1.
4.1.2 Second base
The second database is available on GitHub555https://github.com/adammoss/supernovae and contains 21319 light-curves with 5088 I.a supernovae and 16231 not-I.a supernovae. This database is from the Supernova Photometric Classification Challenge [Kessler_SPCC_2010]. We used this database to compare our CNN (our first architecture; See Section 3.1) to the results obtained by the RNN LSTM of [b_rnn1] (See Section 2.3). We will note this database B2.
4.1.3 Data augmentation
For the convolutional neural networks, we used artificial data augmentation. This method allows the networks to get a better representative set for the learning process. Each light-curve gets a chance to be altered during each epoch. The alteration will crop the light-curve and takes a random fraction between 0.4 and 0.8 of the light-curves. Then the network is fed with this new representation. This strategy reduces over-fitting and slightly increases the classification performances.
4.2 Parameter settings
For the machine learning methods, we used the algorithm available on Lochner’s GitHub666https://github.com/MichelleLochner/supernova-machine with boosted decision trees, as presented in [b_lochner].
We set the number of iteration at 4500 for the convolutional neural network and 9000 for the Siamese network. We set the dropout to for the fully connected layer to reduce over-fitting. The learning rate for the 2 deep learning methods varies between and with exponential decay. For the two methods, we used Adam optimizer [b_adam] on a cross entropy loss function. Network’s weights are initialized with Xavier uniform initializer algorithm [b_xavier] and the batch size is set to 128.
4.3 Experiments setup
We used tensorflow and python to develop our deep learning methods. The training phase is performed with an NVIDIA GTX 1080.
To compare the two methods relying on a two-steps machine learning approach with our convolutional neural network and our Siamese network, we used the Base B1 with k-folds cross-validation with k=4 and 5000 light-curves. The database is thus randomly partitioned into 4 equal sized sub-samples. A single sub-sample is kept for testing the model, and the remaining 3 sub-samples are used as training data. We repeat the process 4 times, with each of the k sub-samples used exactly once as the testing data. It means that each model is trained on only 3750 light-curves and tested on the 1250 other.
To confront our convolutional neural network against the recurrent neural network [b_rnn1], we followed the same strategy as presented in [b_rnn1] with the database B2. We thus trained our CNN model (our first architecture; See Section 3.1
) on a base composed of 5330 supernovae light-curves took randomly from the database B2. The remaining elements in B2 (15989) are used to evaluate the network. We repeat this process five times to compute the arithmetic mean and the standard deviation of the results. Note that the number of training data is chosen to be close to the training set size used with the database B1.
4.4 Comparison metrics
We use two metrics to compare the different methods. The first is the accuracy defined as the ratio of the number of correct predictions over the total number of predictions. The second one is the Area Under the ROC Curve (AUC). The ROC curve represents the True Positives Rate (TPR) versus False Positives Rate (FPR) when the probability threshold is moved from 0 to 1. TPR and FPR are given by equations7 and 8.
where TP are the true positives, FP the false positives, TN the true negatives and FN the false negatives.
4.5 Results and discussion
4.5.1 Database B1: Comparisons of our CNNs and our Siamese network, to SALT2, and FATS
In our first experiment with the database B1 we confront our CNN and our Siamese network to BDT + SALT2 [b_lochner], and BDT + FATS [b_fats]. The Figure 6 shows the obtained ROC curves. The convolutional neural network gets the best performances with an AUC equal to followed by the BDT with SALT2 features which obtains . The Siamese network and the BDT using FATS obtain respectively an AUC equal to and .
Another interesting metric is the accuracy. The CNN got the best accuracy of all 4 methods with (See Table 1). Siamese network comes with that outperforms BDT using SALT2 which obtains . BDT using FATS get the lowest accuracy with .
The state-of-the-art in two-steps machine learning methods used for supernovae classification [b_lochner], [b_fats], perform well with a small learning set (i.e. less than 1 000 learning example), but with the above results, we can clearly say that CNNs are a promising solution for the binary classification of super-novae type I.a versus not-I.a. Indeed, our CNN performs better than two-steps machine learning methods which are based on hand-crafted features, and this with a small number of examples (3 750). A set of 3 750 examples is indeed considered as a very small database for a learning of a binary classification by deep learning methods.
The Siamese network gets good results but is harder to train because of the shared weights between its two networks. Even if its results are not as good as our CNN, we think the results can be improved with a better loss function. Adding some data augmentation, as we do with our CNN, may also improve the performances.
Additionally, we observed that with the increase in the size of the learning set (i.e. with more time series) the performances of our CNN are increasing. Another advantage of our method is that it learns the filters value directly to extract the more relevant features. There is thus no need of prior knowledge as in SALT2 [b_lochner] or FATS [b_fats].
When working with a more complex architecture, we have shown that in a more realistic and more complicated situation from a cosmological point of view (extremely small database, a mismatch between learning set and test set), that Deep Learning approaches surpass all the previous known approaches [Pasquet_Pelican_2019].
4.5.2 Database B2: Comparisons of our CNN with an RNN LSTM
In the second experiment with the database B2, we confront our CNN against the recurrent neural network proposed in [b_rnn1]
. The input vector given to the RNN contains the day, the light flux in each band, and the redshift extra-information. The RNN thus uses a little bit more information compared to our CNN. In this experiment, 5330 supernovae light-curves are used for the learning, whereas in the first experiment 3750 light-curves were used. We choose 5330 light-curves (a fraction of 25% of the database) because it is close to the size used in our first experiment. In this experiment, the proportion of I.a and not-I.a is around 1/4, 3/4 whereas in the first experiment with B1 it was 1/2, 1/2. For our CNN the data-augmentation is done by cropping, and it artificially increases the learning data-set by 2 or 3. For the RNN, there is a random interpolation which artificially increases the learning data-set by 5. The data-augmentation for this experiment improves the results for both of the approaches of roughly 1% (with a smaller increase for our CNN). The comparison should then be considered more or less fair, with nevertheless a small disadvantage for our CNN.
The results (see Table 2) show that our CNN is superior to the RNN LSTM of [b_rnn1] with 98.3% of AUC against 97.5%, and 94.1% of accuracy against 92.9%. Also, note that the standard deviation of the five runs is smaller for our CNNs. In [b_rnn1] multiple experiments with a different fraction of the database are presented. Taking more data, for example, a fraction of 50% of the database, provides better results for both our CNN and the RNN, and the gap between our CNN and the RNN remains the same.
The recurrent neural networks are a great technique to treat time series as it allows to extract dynamic temporal behavior for a time sequence. However, in this work, we showed that our CNN is better. This is maybe due to the trend of over-fitting that is more present in the RNN. The various information of different nature, given to the RNN (day, light flux in each band, redshift), can also make information treatment harder.
Note that when taking less data, for example, a fraction of 5% ( 1065 light-curves) of the database, SALT2 method [b_lochner]
gets the best results. When the size of the learning set is too small (less than 1000 light-curves), deep learning approaches are suffering from an insufficient number of examples. Measures can nevertheless be taken such as data-augmentation by noise addition, use of transfer learning or curriculum learning, use of cosmological parameters such as the red-shift, use of ensembles, use of multiple classes, etc. (see for example our paper under revision[Pasquet_Pelican_2019]).
The field of Cosmology is facing great challenges in order to be able to analyze a huge amount of data. One of those challenges is to be able to automatically detect I.a supernovae from not-I.a supernovae. In this paper, we propose a new CNN, adapted to time series (light-curve), that can defeat the state-of-the-art. We also compared our CNN to a Siamese approach and an RNN LSTM. Our CNNs gives the better results, nevertheless, the main conclusion is that all those deep learning approaches are very promising.
We observed strong efficiency improvement of our CNN when the learning database increases in size. We are also aware that using more clever data-enrichment could boost the efficiency of the different deep learning approaches, and especially our CNN, without substantial additional processing.
We believe that this publication will reinforce the use of deep learning in the cosmology and astronomy fields. We put all the necessary files for the deployment of our CNN on an open-access GitHub (https://github.com/Anzzy30/SupernovaeClassification), thus giving strong visibility of our work to the cosmology community.