Machine Learning Pipeline for Pulsar Star Dataset

05/03/2020 ∙ by Alexander Ylnner Choquenaira Florez, et al. ∙ Universidade de São Paulo 0

This work brings together some of the most common machine learning (ML) algorithms, and the objective is to make a comparison at the level of obtained results from a set of unbalanced data. This dataset is composed of almost 17 thousand observations made to astronomical objects to identify pulsars (HTRU2). The methodological proposal based on evaluating the accuracy of these different models on the same database treated with two different strategies for unbalanced data. The results show that in spite of the noise and unbalance of classes present in this type of data, it is possible to apply them on standard ML algorithms and obtain promising accuracy ratios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pulsars are defined as a rare type of stars that, according to research, can emit radio signals which they could be detected from Earth. The Pulsar astronomy area began its existence as a field of study around the year 1967, when Jocelyn Bell discovered by chance a train of pulses spaced regularly with a period of 1.33 seconds in the radio data tables [9].

In recent years, scientists in the area have become very interested in this rare type of stars for various reasons. When pulsars rotate, their emission beam travels through the sky. Then, a detectable pattern of broadband radio emission produced when this beam crosses our line of sight. According to research, it can be seen that this pattern is repeated periodically when the pulsars spin rapidly[9].

Machine learning is a method of data analysis to built analytic models. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with the least human intervention. Also, they can independently modify their behavior based on their own experience. Generally, Machine Learning uses two basics learning paradigm: unsupervised learning and supervised learning. Supervised learning uses labeled data or prior knowledge to estimate the model which represents the data relationship. This work using the supervised learning approach to perform experiments

[1].

There are many works developed using machine learning techniques for pulsar stars detection [8, 11]

, to identify credible by pressing candidates from pulsar surveys using an artificial neural network

[3]

and using KNN classifier

[9], and using image pattern with deep neural nets PICS (Pulsar Image-based Classification System) AI [12].

Nowadays, Machine learning tools are used to label pulsar candidates to facilitate rapid analysis automatically. This phenomenon is reduced a binary classification problem. Here the legitimate pulsar examples are a positive minority class, and spurious examples the majority negative class. These examples have all checked by human annotators. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive). [7]

This paper, organized as follows: Section 2 describes a literature review about methodologies, processes to explore the dataset and get some insight, Section 3, describes the methodology of the proposal. Section 4, describes the experiments and present the results, and Section 5 states the conclusions.

2 State of the Art

A subject of significant research in the field of radio astronomy is the discovery of pulsars. Identifying new pulsar signals in observational radio data can be done either via single pulse searches [2], or periodicity searches. Graphics selection tools such as REAPER [4] and JREAPER [5] allow the user to project up to several thousand candidates at a time in scatter diagrams. Scoring algorithms such as PEACE [6] have also been developed, combining six numerical candidate quality factors into one formula that produces a subjective ranking where pulsars are expected to found close to the top. There ranking method helped to find 47 new pulsars.

With increasingly sophisticated astronomical instruments, the volumes and rates of this type of data are growing exponentially. This fact requires a focus on artificial intelligence (AI) technologies that can perform the automatic identification of the pulsar candidate to extract large sets of astronomical data. [3] Manually created 12 features and trained an artificial neural network of only one hidden layer. In [10] proposed Straightforward Pulsar Identification using Neural Networks (SPINN) which designs six features and trains a feed-forward single-hiddenlayer artificial neural network for candidate binary classification, their method contributed to 4 new pulsar discoveries. [12]

Developed a classification system based on images of pulsars based on the histograms and graphs of the candidates, and training hidden single layer networks, SVM and CNN. They formed an ensemble network combining all the classifiers with logistic regression, although CNN’s are powerful in two-dimensional image data, more labeled samples are required for the training of a deep CNN.

In a real scenario, labeling this data is a difficult and expensive process to perform. In the task of selecting candidates for pulsar, the positive samples are very limited because the number of real pulsars discovered is small, and millions of candidates are negative samples.

3 Methodology

A diagram of the methodology is:

Data Analysis

Pre-processing

Sampling

Processing

Figure 1: Diagram of the methodology

3.1 Data Analysis

Data analysis is performed to see general information about the dataset, the number of items, columns, kind of values (float, string, etc.), missing-values, correlation between features, etc.

3.2 Pre-processing

An exploration of dataset is performed to see the distribution of number of items from every target class. This is important because it shows that is an imbalanced dataset. To deal with this, it is performed some techniques of oversampling and undersampling on the original dataset.

3.3 Sampling

Sampling is used to choose the training and test set to guarantee the best performance of a machine learning algorithm. It used K-Fold Cross Validation where the dataset is randomly partitioned in k subsamples with the same size. From this k subsamples, one of them used as a testing set, and the other subsamples used as the training set.

3.4 Processing

The general process includes a different kind of algorithms. The main objective is to compare the different results.

The main algorithms are:

Also, it is applied different kinds of ensembles:

4 Experiments and Results

4.1 Dataset Description

The dataset used was the HTRU2 that describes a sample of pulsar candidates collected during the Universe Resolution Universe survey. Pulsars are a rare type of neutron star that produces detectable radio emissions here on Earth. The dataset contains 17,898 total examples, of which 1,639 are positive examples, and 16,259 are negative examples. The 16,259 are spurious examples caused by RFI/noise, and 1,639 are real examples of pulsars. Humans annotators checked all examples. In each row, the variables are listed first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive), each candidate described by eight continuous variables and a single class variable. The variables contain Mean of the integrated profile; Standard deviation of the integrated profile; Excess kurtosis of the integrated profile; Skewness of the integrated profile; Mean of the DM-SNR curve; Standard deviation of the DMSNR curve; Excess kurtosis of the DM-SNR curve; Skewness of the DM-SNR curve; Class.

[7]

4.2 Experiments Setup

The experiments were performed with three variations of the dataset HTRU2 considering sampling to balance unbalanced class and with Feature Selection using Correlation.

4.2.1 Experiment 1.

  • Dataset 1: original dataset with any modification.

  • Dataset 2: use of undersampling to decrease the number of sample of the majority class.

  • Dataset 3: use of oversampling to get synthetic data from minority class to balance the dataset.

Dataset Name Dimensionality Distribution of Classes
Dataset1 (17898, 9) (1639, 16259)
Dataset2 (3278, 9) (1639, 1639)
Dataset3 (32518, 9) (16259, 16259)
Table 1: Dataset Dimensions

A PCA visualization with the three main components of the datasets is presented in Fig. 3

Figure 2: PCA Visualization of Dataset 1, 2, 3

4.2.2 Experiment 2.

This experiment is similar to Experiment 1, but considering feature selection, then 6 features are chosen following the correlation proportion between them.

  • skewness integrated profile

  • excess kurtosis integrated profile

  • std dm-snr curve

  • mean dm-snr curve

  • skewness dm-snr curve

  • excess kurtosis dm-snr curve

A PCA visualization with the three main components of the datasets is presented in Fig. 3

Figure 3: PCA Visualization of Dataset 1, 2, 3

4.3 Results

4.3.1 Experiment 1.

After to perform the experiments using K-Fold Cross Validation with k = 10 and Accuracy as a measure, the next Tab. 2 presents the results.

Model Dataset 1 Dataset 2 Dataset 3
Gausssian-NB 0.95 (+/- 0.02) 0.90 (+/- 0.04) 0.90 (+/- 0.01)
Logistic-Regression 0.98 (+/- 0.01) 0.93 (+/- 0.03) 0.94 (+/- 0.01)
Decision-Tree 0.98 (+/- 0.01) 0.93 (+/- 0.04) 0.95 (+/- 0.01)
Perceptron 0.97 (+/- 0.05) 0.90 (+/- 0.10) 0.86 (+/- 0.19)
MLPClassifier 0.97 (+/- 0.01) 0.93 (+/- 0.03) 0.92 (+/- 0.01)
SVC-PolyK 0.97 (+/- 0.01) 0.92 (+/- 0.04) 0.93 (+/- 0.01)
SVC-RbfK 0.97 (+/- 0.01) 0.92 (+/- 0.04) 0.94 (+/- 0.01)
SVC-SigK 0.97 (+/- 0.01) 0.92 (+/- 0.04) 0.93 (+/- 0.01)
Xgboost 0.98 (+/- 0.01) 0.95 (+/- 0.03) 0.95 (+/- 0.00)
RF 0.97 (+/- 0.01) 0.92 (+/- 0.03) 0.92 (+/- 0.01)
Bagging 0.98 (+/- 0.01) 0.94 (+/- 0.03) 0.97 (+/- 0.01)
Gradient 0.98 (+/- 0.01) 0.94 (+/- 0.04) 0.95 (+/- 0.01)
Stacking 0.97 (+/- 0.003) 0.91 (+/- 0.02) 0.96 (+/- 0.002)
Table 2: Results of Experiments

Following the results of Tab. 2 it is possible to state there is any benefit with undersampling or oversampling to the performance of classifiers even the two algorithms can spend time and resources, but they are not showing the truth about performance classifiers. The best results are highlight (bold) in Tab. 2 in column Dataset 1 (5 values), classifiers got the same result. The same criteria for Dataset 2 and 3. But, it is necessary to remember the dataset HTRU2 is an unbalanced dataset, and the interest of classification is for class 1. Considering the previous criterion, a report of the experiments is represented in Tab. 3 to compare the best with the rest of result classifiers.

Classifier Metric Dataset 1 Dataset 2 Dataset 3
0 1 0 1 0 1
Logistic Regression precision 0.97 0.92 0.91 0.97 0.91 0.97
recall 0.99 0.72 0.98 0.89 0.97 0.91
Decision Tree precision 0.98 0.9 0.91 0.91 0.94 0.96
recall 0.99 0.84 0.92 0.91 0.96 0.94
XGboost precision 0.98 0.91 0.92 0.95 0.94 0.97
recall 0.99 0.83 0.96 0.92 0.97 0.93
Bagging precision 0.98 0.9 0.91 0.95 0.97 0.98
recall 0.99 0.82 0.96 0.91 0.98 0.97
Gradient precision 0.98 0.9 0.91 0.95 0.93 0.96
recall 0.99 0.79 0.96 0.91 0.96 0.93
Table 3: Classification Report of Experiments

Then, following the classification report is confirmed the previous idea, classifiers got best results with the Dataset 1 because of the majority class 0 but considering the application of identifying pulsar star is most important to identify class 1. According to results is confirmed a sampling for an unbalanced dataset is necessary for spite of the initial results was the best. Now, considering sampled datasets (Dataset 2, 3) is easy to notice the benefit of sampling and oversample, the minority class got the best results for class 1 (Dataset 3).

4.3.2 Experiment 2.

The Tab. 4 presents results of Experiment 2, a similar situation found here. Classifiers got best results with Dataset 1. But there is a small increment in accuracy metric because of the Feature Selection.

Model Dataset 1 Dataset 2 Dataset 3
Gausssian-NB 0.94 (+/- 0.02) 0.90 (+/- 0.05) 0.90 (+/- 0.01)
Logistic-Regression 0.97 (+/- 0.01) 0.92 (+/- 0.05) 0.94 (+/- 0.01)
Decision-Tree 0.98 (+/- 0.01) 0.94 (+/- 0.04) 0.95 (+/- 0.01)
Perceptron 0.97 (+/- 0.01) 0.93 (+/- 0.05) 0.90 (+/- 0.12)
MLPClassifier 0.96 (+/- 0.01) 0.92 (+/- 0.04) 0.92 (+/- 0.01)
SVC-PolyK 0.97 (+/- 0.01) 0.90 (+/- 0.05) 0.93 (+/- 0.01)
SVC-RbfK 0.98 (+/- 0.01) 0.92 (+/- 0.04) 0.94 (+/- 0.01)
SVC-SigK 0.97 (+/- 0.01) 0.90 (+/- 0.05) 0.93 (+/- 0.01)
Xgboost 0.98 (+/- 0.01) 0.95 (+/- 0.04) 0.95 (+/- 0.01)
RF 0.98 (+/- 0.01) 0.94 (+/- 0.04) 0.94 (+/- 0.01)
Bagging 0.98 (+/- 0.01) 0.94 (+/- 0.04) 0.97 (+/- 0.01)
Gradient 0.98 (+/- 0.01) 0.94 (+/- 0.04) 0.95 (+/- 0.01)
Stacking 0.97 (+/- 0.00) 0.92 (+/- 0.02) 0.95 (+/- 0.00)
Table 4: Results Experiment 2
Classifier Metric Dataset 1 Dataset 2 Dataset 3
0 1 0 1 0 1
Decision-Tree precision 0.98 0.87 0.92 0.92 0.94 0.96
recall 0.99 0.84 0.93 0.92 0.96 0.94
SVC-RbfK precision 0.97 0.96 0.89 0.98 0.9 0.98
recall 1 0.71 0.98 0.88 0.98 0.9
Xgboost precision 0.98 0.9 0.92 0.98 0.93 0.97
recall 0.99 0.83 0.98 0.91 0.97 0.93
RF precision 0.97 0.92 0.91 0.99 0.91 0.97
recall 0.99 0.75 0.99 0.9 0.97 0.91
Bagging precision 0.98 0.89 0.91 0.96 0.96 0.97
recall 0.99 0.81 0.96 0.9 0.97 0.96
Gradient precision 0.98 0.9 0.92 0.97 0.94 0.96
recall 0.99 0.79 0.97 0.91 0.96 0.94
Table 5: Classification Report on Experiment 2

Considering the previous situation of Experiment 1, it was expected to have a better metric with class 1 for Dataset 2, 3.

5 Conclusions

  • Pulsar Start Dataset HTRU2 is an unbalanced dataset, this feature influence in the pipeline of Machine Learning to use in the experiments.

  • Sampling is necessary to balance minority class and have appropriate training to reduce bias.

  • Classifiers can show the best results, but it is necessary to analyze the result for class and choose the importance class for the application.

References

  • [1] E. Alpaydin (2010) Introduction to machine learning. 2 edition, Cambridge, Mass. : MIT Press, c2010.. Cited by: §1.
  • [2] J. M. Cordes and M. A. McLaughlin (2003-10) Searches for fast radio transients. The Astrophysical Journal 596 (2), pp. 1142–1154. External Links: Document, Link Cited by: §2.
  • [3] R. P. Eatough, N. Molkenthin, M. Kramer, A. Noutsos, M. J. Keith, B. W. Stappers, and A. G. Lyne (2010-07) Selection of radio pulsar candidates using artificial neural networks. Monthly Notices of the Royal Astronomical Society 407 (4), pp. 2443–2450. External Links: ISSN 0035-8711, Document, Link, http://oup.prod.sis.lan/mnras/article-pdf/407/4/2443/3209742/mnras0407-2443.pdf Cited by: §1, §2.
  • [4] A. J. Faulkner, I. H. Stairs, M. Kramer, A. G. Lyne, G. Hobbs, A. Possenti, D. R. Lorimer, R. N. Manchester, M. A. McLaughlin, N. D’Amico, F. Camilo, and M. Burgay (2004-11) The Parkes Multibeam Pulsar Survey – V. Finding binary and millisecond pulsars. Monthly Notices of the Royal Astronomical Society 355 (1), pp. 147–158. External Links: ISSN 0035-8711, Document, Link, http://oup.prod.sis.lan/mnras/article-pdf/355/1/147/11178642/355-1-147.pdf Cited by: §2.
  • [5] M. J. Keith, R. P. Eatough, A. G. Lyne, M. Kramer, A. Possenti, F. Camilo, and R. N. Manchester (2009-04) Discovery of 28 pulsars using new techniques for sorting pulsar candidates. Monthly Notices of the Royal Astronomical Society 395 (2), pp. 837–846. External Links: ISSN 0035-8711, Document, Link, http://oup.prod.sis.lan/mnras/article-pdf/395/2/837/4882147/mnras0395-0837.pdf Cited by: §2.
  • [6] K. J. Lee, K. Stovall, F. A. Jenet, J. Martinez, L. P. Dartez, A. Mata, G. Lunsford, S. Cohen, C. M. Biwer, M. Rohr, J. Flanigan, A. Walker, S. Banaszak, B. Allen, E. D. Barr, N. D. R. Bhat, S. Bogdanov, A. Brazier, F. Camilo, D. J. Champion, S. Chatterjee, J. Cordes, F. Crawford, J. Deneva, G. Desvignes, R. D. Ferdman, P. Freire, J. W. T. Hessels, R. Karuppusamy, V. M. Kaspi, B. Knispel, M. Kramer, P. Lazarus, R. Lynch, A. Lyne, M. McLaughlin, S. Ransom, P. Scholz, X. Siemens, L. Spitler, I. Stairs, M. Tan, J. van Leeuwen, and W. W. Zhu (2013-05) peace: pulsar evaluation algorithm for candidate extraction – a software package for post-analysis processing of pulsar survey candidates. Monthly Notices of the Royal Astronomical Society 433 (1), pp. 688–694. External Links: ISSN 0035-8711, Document, Link, http://oup.prod.sis.lan/mnras/article-pdf/433/1/688/18722551/stt758.pdf Cited by: §2.
  • [7] R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, and J. D. Knowles (2016) Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach. 459 (1), pp. 1104–1123. External Links: Document, 1603.05166 Cited by: §1, §4.1.
  • [8] R. McFadden, A. Karastergiou, and S. Roberts (2017) Machine learning for pulsar detection. 13 (S337), pp. 372–373. External Links: Document Cited by: §1.
  • [9] T. M. Mohamed (2018) Pulsar selection using fuzzy knn classifier. 3 (1), pp. 1 – 6. External Links: ISSN 2314-7288, Document, Link Cited by: §1, §1, §1.
  • [10] V. Morello, E. D. Barr, M. Bailes, C. M. Flynn, E. F. Keane, and W. van Straten (2014-07) SPINN: a straightforward machine learning solution to the pulsar candidate selection problem. Monthly Notices of the Royal Astronomical Society 443 (2), pp. 1651–1662. External Links: ISSN 0035-8711, Document, Link, http://oup.prod.sis.lan/mnras/article-pdf/443/2/1651/3623597/stu1188.pdf Cited by: §2.
  • [11] V. Morello (2016-05) Discovering Pulsars with Machine Learning. Ph.D. Thesis, Faculty of Science, Engineering and Technology Swinburne University. Cited by: §1.
  • [12] W. W. Zhu, A. Berndsen, E. C. Madsen, M. Tan, I. H. Stairs, A. Brazier, P. Lazarus, R. Lynch, P. Scholz, K. Stovall, S. M. Ransom, S. Banaszak, C. M. Biwer, S. Cohen, L. P. Dartez, J. Flanigan, G. Lunsford, J. G. Martinez, A. Mata, M. Rohr, A. Walker, B. Allen, N. D. R. Bhat, S. Bogdanov, F. Camilo, S. Chatterjee, J. M. Cordes, F. Crawford, J. S. Deneva, G. Desvignes, R. D. Ferdman, P. C. C. Freire, J. W. T. Hessels, F. A. Jenet, D. L. Kaplan, V. M. Kaspi, B. Knispel, K. J. Lee, J. van Leeuwen, A. G. Lyne, M. A. McLaughlin, X. Siemens, L. G. Spitler, and A. Venkataraman (2014-01)

    SEARCHING FOR PULSARS USING IMAGE PATTERN RECOGNITION

    .
    The Astrophysical JournalFuture Computing and Informatics JournalProceedings of the International Astronomical UnionMon. Not. Roy. Astron. Soc. 781 (2), pp. 117. External Links: Document, Link Cited by: §1, §2.