Log In Sign Up

Automated identification of transiting exoplanet candidates in NASA Transiting Exoplanets Survey Satellite (TESS) data with machine learning methods

by   Leon Ofman, et al.

A novel artificial intelligence (AI) technique that uses machine learning (ML) methodologies combines several algorithms, which were developed by ThetaRay, Inc., is applied to NASA's Transiting Exoplanets Survey Satellite (TESS) dataset to identify exoplanetary candidates. The AI/ML ThetaRay system is trained initially with Kepler exoplanetary data and validated with confirmed exoplanets before its application to TESS data. Existing and new features of the data, based on various observational parameters, are constructed and used in the AI/ML analysis by employing semi-supervised and unsupervised machine learning techniques. By the application of ThetaRay system to 10,803 light curves of threshold crossing events (TCEs) produced by the TESS mission, obtained from the Mikulski Archive for Space Telescopes, we uncover 39 new exoplanetary candidates (EPC) targets. This study demonstrates for the first time the successful application of combined multiple AI/ML-based methodologies to a large astrophysical dataset for rapid automated classification of EPCs.


page 12

page 14

page 22


Using a Collated Cybersecurity Dataset for Machine Learning and Artificial Intelligence

Artificial Intelligence (AI) and Machine Learning (ML) algorithms can su...

AI/ML Algorithms and Applications in VLSI Design and Technology

An evident challenge ahead for the integrated circuit (IC) industry in t...

Machine Learning in Astronomy: A Case Study in Quasar-Star Classification

We present the results of various automated classification methods, base...

Nigraha: Machine-learning based pipeline to identify and evaluate planet candidates from TESS

The Transiting Exoplanet Survey Satellite (TESS) has now been operationa...

Exoplanet Validation with Machine Learning: 50 new validated Kepler planets

Over 30 'validation', where the statistical likelihood of a transit aris...

Human Assisted Artificial Intelligence Based Technique to Create Natural Features for OpenStreetMap

In this work, we propose an AI-based technique using freely available sa...

Polymer Informatics: Current Status and Critical Next Steps

Artificial intelligence (AI) based approaches are beginning to impact se...

1 Introduction

The Transiting Exoplanet Survey Satellite (TESS) (Ricker et al., 2014) was launched by NASA on April 18, 2018 with the primary objective of all-sky surveying more than 200,000 near-Earth stars in search of transiting exoplanets using high-precision photometry, producing light curves with a 2-minute cadence. The TESS Objects of Interests (TOI) have been released periodically and archived at the Mikulski Archive for Space Telescopes (MAST, The TOI includes planetary candidates, as well as potential planetary candidates and other astrophysical targets, including false positives, comprising the database used for searching for confirmed exoplanets. As of March 23, 2020 TESS has released 1766 TOIs with 43 confirmed planets and 412 false positives (see,

Previously, Kepler Space Telescope launched by NASA in 2009 was designed to determine the occurrence frequency of Earth-sized planets. Towards this objective, Kepler observed about 200,000 stars with high photometric precision discovering thousands of transiting exoplanets and exoplanetary candidates (Borucki et al., 2010; Jenkins et al., 2010a; Koch et al., 2010; Christiansen et al., 2012). During the prime missions (2009 May 2 -2013 May 11) Kepler was pointing at a single field of view of about 115 square degrees in the constellations of Cygnus and Lyra. The many periodic signals detected by Kepler were processed using the Kepler Science Processing Pipeline (Jenkins et al., 2010b). They were assembled into a database of threshold crossing events (TCEs). Direct human input was required to remove false positives and instrumental effects from this database. However, the resulting TCEs database contains data produced by many possible sources, such as eclipsing binaries, background eclipsing binaries and many other possible false alarm sources, in addition to small fraction of exoplanetary candidates (EPCs), and still require considerable analysis for confirmed identification of exoplanets.

Recently, Shallue and Vanderburg (2018)

identified transiting exoplanets in Kepler satellite data using Deep Learning (DL) algorithm based on training of convolutional neural networks using the Google-Vizier system

(Golovin et al., 2017). Shallue and Vanderburg (2018)

trained the neural networks to classify whether a given light curve signal is a signature of a transiting exoplanet with low false positive rate. By using their algorithm, they identify multi-planet resonant chains around Kepler-80 and Kepler-90. Later, the extended Kepler K2 mission, which starting in Nov. 2013, was designed to use the remaining Kepler capabilities after the completion of the prime mission including the technical failures of the reaction wheels. During this observation phase, the photometric accuracy was reduced, and the pointing varied in different regions of the sky. Nevertheless,

Dattilo et al. (2019) used a similar automated technique based on Shallue and Vanderburg (2018) study that is applied to mission data K2 while identifying two previously unknown exoplanets.

Automated classification methods for transiting exoplanets from TESS data have been developed using machine learning (ML) techniques in several studies (e.g., Ansdell et al., 2018; Zucker and Giryes, 2018; Yu et al., 2019; Osborn et al., 2020)

that demonstrate the usefulness and feasibility of this approach with various degrees of improved classification performance. In this paper, we describe an application of novel algorithms, which combine several ML approaches and low rank matrix decomposition, including algorithms that identify anomalies in high dimensional big data by using augmentation approach. This methods, utilized semi-supervised and unsupervised learning was developed by

ThetaRay, Inc. ( for uncovering financial crimes, cyber and Internet of Things (IoT) security, was applied for transiting EPCs search, reported in this study. By using Kepler data with confirmed exoplanets as part of the algorithm training phase and validation, the ThetaRay platform was applied to TESS data yielding 39 new EPCs out of nearly 11000 TCEs, demonstrating the feasibility and utility of this new platform.

The paper is organized as follows: Section 2 discusses the ML methods, Section 3 presents the resulting planetary exoplanet classification in TESS data. Section 4 is discussion and conclusions. Details of ThetaRay algorithms are described in the Appendix.

2 Machine Learning Methods

2.1 The ThetaRay Algorithm

In the present study we utilize ThetaRay

AI-based Fintech algorithms, commercially developed for anomaly detection (financial crimes) in financial institutions, cyber security and IoT for smooth operations of critical infrastructure installations. Since transiting exoplanets light curves are rare and only appear in small number of all observed Kepler or TESS stellar light curves, they are classified as ‘anomalies’ in our analysis when

ThetaRay system utilizes the strengths of its algorithms to identify transiting EPCs in the large number of TCEs. To identify these ‘anomalies’, or exoplanet light-curves, ThetaRay’s algorithms generates a data-driven ‘normal’ profile of the data ingested, and simultaneously identifies anomalies also called abnormal events, providing forensics that categorizes each event based on its features. This is done autonomously by the algorithm without the need to have rules or signatures. ThetaRay

’s algorithmic engine utilizes techniques drawn from a wide variety of mathematical disciplines, such as harmonic analysis, diffusion geometry and stochastic processing, low rank matrix decomposition, randomized algorithms in general and randomized linear algebra in particular, geometric measure theory, manifold learning, neural networks/deep learning, and compact representation by dictionaries. One approach models the data as a diffusion process using Brownian motion of a random walk process to geometrize the data. There is no need for any semantic understanding of the processed data, nor are there any predefined rules, heuristics or weights in the system. The diffused collected dataset is then converted into a Markov matrix through a normalized graph-Laplacian and modeled as a stochastic process that is applied in many dimension (could reach thousands) - see the Appendix for additional details of the algorithms.

2.2 Kepler Satellite data ML training

We have focused on light curves produced by the Kepler space telescope, which collected the light curves of 200,000 stars in our milky way galaxy for 4 years with continuous 30-min or 1-min sampling. To train the algorithm to identify planets candidates in Kepler light curves, we used a training set of labeled Threshold Crossing Events (TCEs). We obtained all the 15,737 TCEs produced by Kepler and utilized in Google’s research deep learning method (Shallue and Vanderburg, 2018; Dattilo et al., 2019), where they used a supervised convolutional neural network machine learning architecture that included 2,202 features: 201 features of ‘local view’ and 2,001 features of ‘global view’. The ‘global view’ represents the entire light curve and the ‘local view’ represents a phase-folded window around the identified transit.

We derived our training set of labeled TCEs from the Autovetter Planet Candidate Catalog for Q1-Q17 DR24 (Catanzarite, 2015; Coughlin et al., 2016) hosted at the NASA Exoplanet Archive ( We obtained the TCE labels from the catalog’s
“av_training_set” column, which has three possible values: planet candidate (PC), astrophysical false positive (AFP) and non-transiting phenomenon (NTP). We ignored TCEs with the “unknown” label (UNK). These labels were produced by manual vetting and other diagnostics. We obtained additional data on the TCEs such as planet number, radius of the planet, interval between consecutive planetary transits, etc., from the MAST TESS archive ( -survey-satellite-tess) for data labeling and use in our analysis.

2.2.1 Features

Feature engineering is the process of using data domain knowledge to create features by manipulating the data through mathematical and statistical relations (for examples, see section 2.2.4) of the various components in order to improve the performance of the AI/ML algorithms. The feature engineering process includes deciding which features to develop, creating the features, checking how the features work with the model, improving the features as needed, and going back to deciding on or creating additional data features until the ML/AI algorithm results are optimized. We applied the feature engineering process on our dataset and created new features in addition to the existing features available in MAST in order to provide more information which will quantify various aspects of the data used by the AI/ML algorithm in the present analysis. We produced a total of 424 features that were used for the analysis. We chose the combination of features that provided the best results under the capabilities of ThetaRay’s system, validated in the training step. In the feature engineering process, we tested the effectiveness of different combinations of features under the limits of ThetaRay’s system.

2.2.2 Existing features

Additional TCEs Data were downloaded from MAST. We narrowed down the data only to the required fields for the present task, such as the planet number, the radius of the planet, the interval between consecutive planetary transits, etc., and selected the relevant data from all the fields from “Data Columns in the Kepler TCE Table” ( using the visualization of the variables (especially KDE plots, see below). Below is the description of the variables and labels used in our analysis.

  • Unique key - concatenation of Kepler ID and Planet Number. Kepler ID is a target identification number, as listed in the Kepler Input Catalog (KIC). The KIC was derived from a ground-based imaging survey of the Kepler field conducted prior to launch. The survey’s purpose was to identify stars for the Kepler exoplanet survey by magnitude and color. The full catalog of 13 million sources can be searched at the MAST archive. The subset of 4 million targets found upon the Kepler CCDs can be searched via the Kepler Target Search form.

  • av_training_set - Autovetter Training Set Label. If the TCE was included in the training set, the training label encodes what is believed to be the “true” classification, and takes a value of either PC, AFP or NTP. The TCEs in the UNKNOWN class sample are marked UNK. Training labels are given a value of NULL for TCEs not included in the training set. For more detail about how the training set is constructed, see Autovetter Planet Candidate Catalog for Q1-Q17 Data Release 24 (KSCI-19091):

  • tce_prad - Planetary Radius (Earth radii). The radius of the planet obtained from the product of the planet to stellar radius ratio and the stellar radius.

  • tce_max_mult_ev - Multiple Event Statistic (MES). The maximum calculated value of the MES. TCEs that meet the maximum MES threshold criterion and other criteria listed in the TCE release notes are delivered to the Data Validation (DV) module of the data analysis pipeline for transit characterization and the calculation of statistics required for disposition. A TCE exceeding the maximum MES threshold are removed from the time-series data and the SES and MES statistics recalculated. If a second TCE exceeds the maximum MES threshold then it is also propagated through the DV module and the cycle is iterated until no more events exceed the criteria. Candidate multi-planet systems are thus found this way. Users of the TCE table can exploit the maximum MES statistic to help filter and sort samples of TCEs for the purposes of discerning the event quality, determining the likelihood of planet candidacy, or assessing the risks of observational follow-up. DV module –

  • tce_period - Orbital Period (days). The interval between consecutive planetary transits.

  • tce_time0bk

    - Transit Epoch (BJD) - 2,454,833.0. The time corresponding to the center of the first detected transit in Barycentric Julian Day (BJD) minus a constant offset of 2,454,833.0 days. The offset corresponds to 12:00 on Jan 1, 2009 UTC.

  • tce_duration - Transit Duration (hrs). The duration of the observed transits. Duration is measured from first contact between the planet and star until last contact. Contact times are typically computed from a best-fit model produced by a Mandel and Agol (2002) model fit to a multi-quarter Kepler light curve, assuming a linear orbital ephemeris.

  • tce_model_snr - Transit Signal-to-Noise (SNR). Transit depth normalized by the mean uncertainty in the flux during the transits.

  • av_pred_class - Autovetter Predicted Classification. Predicted classifications, which are the ‘optimum MAP classifications.’ Values are either PC, AFP, or NTP.

  • tce_depth - Transit Depth (ppm). The fraction of stellar flux lost at the minimum of the planetary transit. Transit depths are typically computed from a best-fit model produced by the Mandel and Agol (2002) model fit to a multi-quarter Kepler light curve, assuming a linear orbital ephemeris.

  • tce_impact - Impact Parameter. The sky-projected distance between the center of the stellar disc and the center of the planet disc at conjunction, normalized by the stellar radius.

  • local_view

    - vector of length 201: a ‘local view’ of the TCE. It shows the shape of the transit in detail (close-up of the transit event).

2.2.3 Visualization of Kepler Data

We investigated the Kepler data and visualized the variables with Pandas package in Python. For example, we visualize the distributions of the numerical variables per class using KDE (Kernel Density Estimation) plots. In Figure 

1 we show several interesting examples with a gap between the curves labeled ‘Planets’ and ‘Not planets’ as identified by ThetaRay system and validated by the Kepler data training set. It can be concluded that these features are significant in candidate exoplanet identification and therefore we have included them in the model. If both curves coincide, it can be concluded that the behavior is the same for label ‘planets’ and ‘not planets’, and so we chose not to include these features in the model.

Figure 1:

The distributions of the numerical variables using KDE (Kernel Density Estimation) plots where the blue curves are labeled ‘Planet’ and the orange curves are labeled ‘ Not a planet’ from Kepler data. When there is significant difference between the curves, it can be concluded that these features are more significant for planet identification and therefore we have included them in the model. If both curves coincide, it can be concluded that the behavior is not statistically different between the two populations. The plotted variables are (a)

tce_period, (b) tce_duration, (c) tce_time0bk, (d) tce_model_snr (see text for their definitions).

Another example of our analysis is demonstrated in the ‘heat map’, which is basically a color-coded matrix, where a correlation value between the variable of features is used to color each cell of the matrix to represent the relative value of that cell. If there is a high correlation between any variables, the dimension of the data can be reduced. The various features are labeled on the axes. Obviously, the features on the main diagonal that indicate identity correlation are light colored. It is evident from the ‘heat map’ shown in Figure 2 that most off-diagonal features are weakly correlated. The only significant off-diagonal correlations is between av_training_set - the training labels, i.e., if the TCE was included in the training set, the training label encodes what is believed to be the “true” classification, and av_pred_class - predicted classifications, which are the optimum MAP (maximum a posteriori) classifications. In fact, this field does not provide analysis information for the data but is used as forensic feature. The forensic features are not included directly in the analysis, but, provide supplementary information about the data useful for the investigation of the analysis. Some artificial correlation is also evident between the tce_time0bk - transit epoch (BJD), and tce_period - Orbital Period (days).

Figure 2: The ‘Heat map’ of some of the features (or parameters) used in the ThetaRay algorithm. The intensity scale indicates the magnitude of the correlation between the features that facilitates determining the dimensionality of the dataset (see text).

2.2.4 New Features

New features were developed based on the original data set from Kepler that was obtained from MAST to optimize the analysis with ThetaRay algorithm. These features were constructed from the original dataset as described below using the phase-folded “Local View” light curves (see, e.g., Shallue and Vanderburg, 2018).

  • global_view - the original vector of length 2001 or a ‘global view’ of the TCE that shows the characteristics of the light curve over an entire orbital period. Because of the size limitations of the ThetaRay

    ’s system, we performed dimension reduction. We represented groups of 20 columns in the ‘global view’ by computing the average and standard deviation of those columns. We have a total of 200 new “global_view” features.

  • spline_bkspace - the break-point spacing in time units, used for the best-fit spline. We chose the optimal spacing of spline breakpoints for each light curve by fitting splines with different breakpoint spacings, calculating the Bayesian Information Criterion (BIC, Schwarz (1978)) for each spline, and choosing the breakpoint spacing that minimized the BIC. Below, is a brief description of the new features that were computed for each TCE “Global View” and “Local View” light curves:

  • loc_mean – average of the “Local View” light curve.

  • loc_std - standard deviation of the “Local View” light curve.

  • loc_25% -25% percentile of the “Local View” light curve.

  • loc_75% - 75% percentile of the “Local View”light curve.

  • loc_max – max value of the “Local View” light curve.

  • glob_mean – average of the original “Global View” light curve.

  • glob_std- standard deviation of the original “Global View” light curve.

  • glob_25% - lower percentage of the original “Global View” light curve.

  • glob_75% - upper percentage of the original “Global View” light curve.

  • glob_max – max value of the original “Global View” light curve.

  • zScore_loc_min

    – minimum value of the Z-Score on the “Local View” light curve with window of 10.

  • zScore_loc_max – maximum value of the Z-Score on the “Local View” light curve with window of 10.

  • zScore_glob_min – minimum value of the Z-Score on the “Global View” light curve with window of 100.

  • zScore_glob_max – maximum Z of the-Score on the “Global View” light curve with window of 100.

2.2.5 Working on ThetaRay’s System

We built in ThetaRay platform an “analysis chain”, which is a multi-staged flowchart, that is composed of three main stages: Data Source, Data Frame and Analysis. The data is organized into data sources and they are uploaded to ThetaRay’s platform. We created data frames in the system with wrangling method (where, data wrangling is a process of cleaning, structuring and enriching raw data into a desired format with the intent of making it more appropriate and valuable for modeling) and split the data randomly into train and test in ThetaRay system such that 80% is allocated for training and 20% are allocated for testing. The training procedure generates profile and this was fed into different types of analyses using ThetaRay

Augmented and unsupervised algorithms, to find the best parameters that maximize the Area Under ROC Curve (AUC) in each chain, where ROC is Receiver Operating Characteristic (ROC) curve - a standard evaluation metrics for testing classification model’s performance. After the analysis and review of these results were completed, the data was processed again after modification and fine tuning of the internal parameters in the system for results improvement. Then, identification was executed again.

2.3 TESS Satellite Data Analysis

2.3.1 Preprocessing the Data

We obtained 10,803 light curves of TCEs produced by the TESS mission from MAST ( We wanted to use the same model we built based on Kepler’s data, in order to find potential exoplanets (anomalies) in the new data from TESS. For using the same models for the two different satellites, we must convert the TESS data to the same structure as Kepler data. Therefore, we performed additional steps to prepare the light curves to be used as inputs to our system. We generated a set of TFRecord files for the TCEs. Each file contains global_view, local_view and spline_bkspace representations like in Kepler. We also created in python the following data files:

  • global_view - Vector of length 2001 that shows the characteristics of the light curve over an entire orbital period.

  • local_view - Vector of length 201 that shows the shape of the transit in detail (phase-folded close-up of the transit event).

  • more_features - includes

    • ticid - TESS ID of the target star.

    • planetNumber - TCE number within the target star.

    • planetRadiusEarthRadii - has the same meaning as the field of tce_prad in Kepler data.

    • spline_bkspace, mes - same meaning as tce_max_mult_ev in Kepler data.

    • orbitalPeriodDays - same meaning as tce_period in Kepler data.

    • transitEpochBtjd - same meaning as tce_time0bk in Kepler data.

    • transitDurationHours - same meaning as tce_duration in Kepler data.

    • transitDepthPpm - same meaning as tce_depth in Kepler Data.

    • minImpactParameter - same meaning as tce_impact in Kepler data.

    TESS data is unlabeled, so av_training_set and av_pred_class fields do not exist in the TESS data, therefore, we filled these fields with zeros. tce_model_snr feature exists in Kepler data, but it does not exist in TESS data, so we calculated its value by the ratio of transitDepthPpm and transitDepthPpm_err.

  • Describe files - includes count, mean, std, min, max, 25% percentile, median (50%), 75% percentile. These quantities were computed on each original data row from the global_view and local_view files and on each scaling row of these files.

Following the generation of the dataset in the form of Coma Separated Values (CSVs), we applied the same manipulation on global_view, as in Kepler data, in order to reduce the dimensions, and used the analogous 424 features produced from TESS data as in Kepler data, for the analysis on ThetaRay’s system. Following this step, we applied the Detection algorithm on TESS data according to the saved model from Kepler and used the results for classification and mapping of TESS light curve TCEs data.

3 Results: Transiting Exoplanet Detection

The first results of the ThetaRay algorithm produced around 90 preliminary identification of EPCs that were further manually vetted, reducing the number of confirmed EPCs by about a factor of two. Local view light-curves were used together with planetary candidate parameters to vet the algorithm’s output. In the manual vetting the physical parameters, such as non-typical ‘local view’ light curves (i.e, v-shapes, and other non-planetary periodic features), extremely large planetary radius, and very low signal-to-noise were used. The parameters for the remaining 39 identified EPCs by the ThetaRay system form the TESS database of 10,803 TCE’s are given in Table 1. In Figure 3 we show the Local View light curves of eight selected light curves for exoplanetary candidates identified using the ThetaRay algorithm. The TESS input catalog ID number (TIC_ID), along with several parameters (tce_prad, tce_period, tce_depth defined in section 2.2.2) for the identified EPC are indicated on each panel. Of the 39 validated cases we note that only two case with planetary radius (tce_prad) or (TIC_ID 307210830 and 259377017), and a total of eight EPCs identified with . Another 15 identified EPCs were similar in size or larger than Jupiter with . We find the following properties of the 39 cases

  • The orbital periods (tce_period) of the identified EPCs range from 0.38d to just under 23d.

  • The transit depth (tce_depth) varied by about an order of magnitude in the range ppm with the signal-to-noise in the range ppm.

  • The impact parameter was in the range .

  • The duration of the transits (tce_duration) was in the range d.

  • In four cases the identified EPCs suggest multiple planetary systems with 2 and 3 planets.

TIC_ID # p tce_prad tce_max tce_period tce_time0bk tce_duration tce_model tce_depth tce_impact
_mult_ev _snr
150162739 1 15.70639992 8.687669754 14.63549995 1335.199951 0.312352091 8.506078927 2042.819946 0.200270995
167603396 1 2.751130104 8.034460068 14.37919998 1365.140015 0.324110419 8.00441099 1444.5 0.269908011
254700590 1 2.881239891 7.790110111 11.77639961 1625.609985 0.193634167 7.599164669 543.3380127 0.651623011
259377017 3 1.372750044 9.26651001 3.359859943 1387.089966 0.057147499 8.651269232 1034.959961 0.463200003
279201188 1 4.806849957 7.558539867 14.4708004 1417.109985 0.112282082 7.839735967 652.2369995 0.375441998
303051566 1 3.854789972 8.216239929 15.53929996 1328.569946 0.06302125 6.706231497 569.2410278 0.0374793
307210830 3 0.869957983 11.3927002 2.253309965 1598.23999 0.042161249 10.71902329 723.9060059 0.501681983
355509914 1 10.77110004 7.868070126 1.738260031 1326.119995 0.052042082 8.795540018 18978.40039 0.200622007
370228465 2 4.152969837 7.710509777 12.32479954 1357.910034 0.057042085 7.614795753 4594.089844 0.0166821
401889161 1 5.161489964 7.100709915 16.49230003 1417.01001 0.126625001 6.860909397 323.1549988 0.495678991
422280868 1 2.566740036 8.417449951 3.133980036 1544.25 0.046455417 6.119703991 628.3099976 0.386207998
447061717 1 2.608789921 21.75650024 9.204919815 1569.719971 0.135038748 18.51823286 4082.899902 0.0298279
453767182 1 2.725820065 7.798190117 10.76249981 1626.119995 0.122235835 7.410331196 13611 0.027590601
101948569 1 3.049010038 11.03339958 19.47240067 1360.109985 0.145037085 11.74638202 1645.949951 0.303943992
102195674 1 20.62459946 68.38349915 4.378769875 1547.459961 0.159099996 67.04560811 30425.69922 0.00999983
120916706 1 5.22453022 10.40649986 0.556737006 1386.170044 0.035527959 11.14050062 63679.10156 0.00999983
141663326 1 24.71339989 65.67880249 6.65583992 1601.439941 0.119643748 55.47800308 6468.180176 0.985625029
167418903 1 10.36499977 20.16550064 21.96240044 1599.300049 0.073109999 17.57117062 8278.05957 0.844699979
170849515 1 10.83209991 16.08620071 1.941280007 1438.150024 0.074110419 20.93755124 37131.69922 0.00999983
172464366 1 19.09110069 79.50800323 2.921689987 1470.050049 0.13241291 77.20788698 17562.40039 0.552250981
178155732 1 2.372940063 10.47840023 5.971879959 1415.630005 0.11720375 12.1434367 316.0220032 0.226411998
200591694 1 4.385819912 8.523739815 13.58699989 1470.150024 0.103903331 9.010778093 4702.040039 0.00999983
206412587 1 3.578089952 7.95663023 16.51129913 1417.02002 0.106479168 7.899363151 524.2609863 0.475097001
218524525 1 3.137619972 7.899419785 16.88459969 1494.859985 0.177322909 8.048498208 986.6010132 0.00999983
219379012 1 4.340690136 15.25220013 1.546159983 1469.709961 0.055939998 17.6102274 1064.26001 0.692296982
219403686 1 5.821829796 22.26140022 0.380145997 1468.579956 0.028838458 29.63871326 1336.619995 0.679122984
235009317 1 23.10919952 73.6289978 7.456830025 1329.209961 0.158223748 49.59156193 21885.40039 0.903016984
264537668 1 24.19919968 141.2850037 4.03110981 1469.109985 0.129590005 128.3034072 40075.69922 0.571915984
270677759 1 9.437470436 14.45300007 9.129110336 1597.199951 0.196854994 14.20179783 8185.049805 0.805234015
306735585 1 8.496970177 12.65380001 4.816760063 1414.079956 0.131257921 11.34648808 5228.529785 0.893122017
307467401 1 4.158410072 44.08229828 9.587329865 1475.709961 0.281197071 25.5473955 1375.969971 0.83335799
308994098 1 5.175449848 19.70980072 10.51659966 1552.050049 0.446095824 18.35474766 991.1049805 0.707704008
309619055 1 10.75909996 20.72480011 10.55350018 1604.969971 0.195166245 11.93659993 9440.740234 0.894083023
322900369 1 7.504670143 95.60189819 3.126100063 1493.890015 0.124930002 92.18038804 5151.629883 0.570958972
335452175 1 15.31970024 58.07910156 15.49790001 1601.26001 0.134304583 60.38542514 8295.75 0.990104973
410214984 3 4.706439972 12.51220036 8.135899544 1332.349976 0.043714583 6.290396653 4277.52002 0.144591004
422655579 1 15.61709976 27.75729942 2.903460026 1413.140015 0.210315004 45.99784581 4657.879883 0.00999983
423275733 1 17.97360039 24.79450035 2.052979946 1518.689941 0.110506669 36.4699389 10176.90039 0.745383978
455278250 1 7.306509972 36.73529816 15.60929966 1521.51001 0.238732085 28.64738963 2824.179932 0.82368201
Table 1: Some of the parameters (see text) of identified exoplanetary candidates (EPCs) from the TESS mission data archive at using the ThetaRay system.
Figure 3: “Local view” normalized phase-folded light-curves of selected exoplanetary candidates from Table 1 with the parameters tce_prad the radius in terms of , tce_period in days, tce_depth in ppm, indicated on the corresponding panels. The typical eclipsing exoplanetary light curve temporal shape structure is evident.

4 Discussion and Conclusions

The TESS satellite provides observations of a large number (200,000) of stellar light curves with high photometric precision over the whole sky, divided in observing sectors, with the aim of detecting transiting Earth-sized planets. The stellar object were selected to represent the brightest and closest to our solar system. The large dataset of nearly 27 gigabytes per day is then processed in the science data pipeline providing nearly 11,000 TCE’s as of the time of writing this paper. Further analysis of the TCEs is required to find confirmed examples of exoplanets, or exoplanetary candidates for more in-depth processing. However, evidently this formidable data analysis task is difficult, if not impossible to carry out manually. A feasible approach for the TESS data analysis is based on automated identification techniques that were developed recently, customized for transiting exoplanetary candidates identification, utilizing AI/ML methods based on DL neural networks machine learning methods combined with anomaly identification methods reported the present study. This EPCs could be than vetted further with targeted observations and data analysis.

In this study we apply a novel algorithm developed by ThetaRay, Inc. for cybersecurity and anomaly identification in financial systems. The advantage of this AI/ML system over other machine learning methods is the combination of several algorithms, as described in this paper and the Appendix, and the direct application to any large dataset that contain possibly small number of target datapoints (‘anomalies’). We apply the system to TESS observations of TCE’s in search of transiting exoplanet signatures in the large TCE dataset. For the training set of the ML algorithm we used the Kepler exoplanet TCE’s validated with confirmed exoplanet dataset. By applying the trained ThetaRay algorithm to TESS TCE’s we report 39 new planetary candidates in wide range of sizes from below Earth’s radius to super-Jupiter’s radii, and planetary periods ranging from 0.38d to just under 23d. We demonstrate that the combination of DL neural networks with anomaly identification mathematical techniques provide an efficient AI/ML algorithm for the rapid automated search of transiting exoplanet candidates light curves. Although, we find that we need to apply manual vetting to reduce the number of false-positives, the total number of EPCs identifications is manageable for secondary manual vetting of the relatively small number of light-curves, and this approach provides the desired identification results. In future applications, the ThetaRay’s algorithm could be further optimized for transiting exoplanets identification, by including, for example, informed ML steps, potentially reducing further the false-positive rate in this application and providing a new tool for analyzing TESS TCE data.


The resources for this research were provided by ThetaRay, Inc. LO would like to acknowledge the hospitality of the Department of Geosciences, Tel Aviv University.


The classification of light curves as exoplanetary candidates in this paper is achieved by using the analytic platform of ThetaRay that is described in this appendix. This platform processes high dimensional big data to identify anomalous behavior in comparison to a normal profile. This anomaly detection tool is used in the present application for classification of EPCs in TESS TCE database. The normal profile is a training data driven and its generation is explained below. In the present study we used Kepler TCE data as a training dataset as described in section 2.2. This appendix describes some of the algorithms that were utilized in the study of identifying anomalies in a big data using augmentation, semi-supervised and unsupervised type algorithms. The same core algorithms for anomaly identification are capable of identifying anomalies in cyber (malware), industrial malfunction (IoT) and financial (crimes) data. The algorithms were applied for the first time to astrophysical data in this study. These algorithms are part of ThetaRay ( core technology portfolio to fight financial crimes (Shabat et al., 2018a). The algorithms are housed in ThetaRay Computational Platform that enables efficient data manipulation and processing. The reported results were obtained by executing these algorithms on ThetaRay platform.

Appendix A Semi-supervised processing via augmentation: Introduction

For background and context, we describe briefly the ThetaRay system current commercial applications that now have been expanded and applied to astrophysics dataset. The ThetaRay

is designed to provide a fast and accurate analytic solutions for identifying emerging risk/crime (classified as anomalies) in financial data, discovering new opportunities, and exposing blind spots within these large, complex high dimensional data sets. These AI-based algorithms radically reducing false positives, and are uniquely able to uncover “unknown unknowns” (these are threats that one is not aware of, and do not even know that one is not aware of them).

ThetaRay provides constructive solutions to anomaly detections challenges via its analytic platform designed for a big data, uncover previously unknown risks, and do so with industry low false positive rates and in real time enabling fast forensic.

In this project, we assume that some labels of Kepler TCE data, which is a related dataset to TESS TCEs, are given but are not given for the TESS data. An augmented algorithm, which is considered as a learning method, generates a new data frame based on the provided labels. Then, the new data frame serves as an input to unsupervised algorithms. In this project, we apply 3 unsupervised algorithms to the augmented data: Geometric-based denoted by NY (see section C.1), algebraic-based denoted by LU (see section C.2), an hybrid of LU and NY and Neural network denoted by AE.

The augmentation method is based on Neural Network. By using a Neural Network-based method, the default network (that can be user-adjusted) consists of one input layer (the analysis data frame), three hidden layers and one output layer. All the layers are connected through “weights” that are automatically tuned during the learning (optimization) process until the network output layer values are close to the values of the provided labels. After optimization, the third hidden layer becomes the new data frame as well as the input to the unsupervised algorithms that are outlined in section B and some of them are described in details in section C.

ThetaRay’s platform covers detection and monitoring of several verticals with current emphasis on financial crimes by suppling an end-to-end solution. ThetaRay provides an un- and semi-supervised real-time agnostic, AI based financial crimes detection platform that are based on anomaly detection algorithms of “unknown unknowns”.

Rule-based technology, which is very popular among anomaly detection tools, is intended for what is known and when you know what to look for. ThetaRay’s detection is achieved by un- and semi-supervised with automatic methods that are not based on rules, patterns, signatures, heuristics, data semantics of the features or any prior domain expertise and provide high detection rate and very low false positives. ThetaRay’s methodologies within its Analytics Platform are based on unbiased detection through a series of randomized advanced AI-based algorithms that can process any number of data features and can be explained, justified and anomalies can be traced back to identify features that triggered the anomalies therefore it is not classified as a black box. Thus, the platform enables past tracking of events and features that trigger the occurrence of anomalies. ThetaRay’s system operates under the assumption that is not know what to look for or what to ask. This allows their technology to potentially, detect every type of anomaly before the rules are discovered automatically. For efficient processing of the algorithms the system uses off-the-shelf hardware components. Inherent parallelism in the algorithms are implemented with GPU utilization. The platform contains advanced and interactive visualization of the input and output phases of the data analysis. The detection approach is data driven thus, no preexisting models are assumed to exist. This makes this approach universal and generic and thus opens the way for different applications without the introduction of bias, limitations, and unfounded preconceptions into the processing, a property well suited for large astrophysical datasets. Mathematical and physical justification for most of the available algorithms in the system are given below.

The input training data can be enriched by a given limited set of labels. This increases the detection rate and reduces the false alarm rates. This is part of semi-supervised algorithms. Semi- and un-supervised algorithms are used. Currently, the platform contains eight different unsupervised algorithms for the data without labels and three different semi-supervised algorithms for the data with partial labels within the detection engine. The results are fused to produce one solution. ThetaRay

combines the strengths of unsupervised and semi-supervised techniques to identify anomalies in the data. Unsupervised learning assumes that there are no labels to the various data components. Semi-supervised learning frameworks have made significant progress in training machine learning with limited labeled data in image domain. Augmented unsupervised learning can be used side-by-side with semi-supervised learning. The augmented algorithms generate a new data frame based on the analysis data frame and the provided labels. The new data frame generated is then the new input for all the unsupervised algorithms selected. Labels are categorized as binaries, with the minority of the labels (known anomalies) marked as “1” and the remainder, which are the majority of unknown cases, assigned “0”.

Augmented process enables covering both the known and the unknown with a relative balance between them. The ThetaRay system allows for configuration of the underlying input features, algorithms and detection logic at each applications. Technically it is a neural network-based process which generates a new data frame based on the input data frame and binary labels provided by the application (in the present case, stellar light-curve data).

Appendix B Unsupervised algorithms: General description


This algorithm (see, Figure 4) is based on diffusion maps (DM) methodology (Coifman and Lafon, 2006a) and it is primarily a non-linear dimension reduction process. The anomaly identification procedure takes place inside the lower dimensional space (manifold) that is determined automatically during the training phase. An out-of-sample extension procedure (Coifman and Lafon, 2006b) is applied to the identification phase for each multidimensional data point, which did not participate in the training phase, to determine whether it belongs to the manifold (low dimensional space - classified as normal) or deviates from it (classified as anomalous).

Figure 4: NY algorithm: flow chart.

The NY algorithm, which is based on DM, geometrizes the input training data. DM analyzes the ambient space (training data) and determines automatically where the data actually resides in the embedded space. We can visualize the input training data (ambient space) as a matrix of size where is the number of multidimensional data points (number of rows in the matrix) and each row is of dimension - the number of columns in the matrix. The input data is assumed to be sampled from a low dimensional manifold (embedded space) that captures the dependencies between the observable parameters. DM reduces in a non-linear way the dimension of the ambient space which is the training data. The dimensionality reduction by DM is based on local affinities between multidimensional data points and on non-linear embedding of the ambient space into a lower dimensional space, described as a manifold, by using a low rank matrix decomposition. The non-parametric nature of this analysis uncovers the important underlying factors of the input data and reveals the intrinsic geometry of the data represented by the embedded manifold. This manifold describes geometrically what we classify as the normal profile of the ambient data. Newly arrived multidimensional data points, which did not participate in the training procedure, are embedded into the lower dimensional space by the application of an out-of-sample extension algorithm. If the embedded multidimensional data point falls into the manifold, it is classified as normal otherwise it is classified as abnormal (anomalous). See section C.1 for more details.


Based on a randomized low-rank matrix decomposition (Shabat et al., 2018b). This algorithm builds a dictionary from the training data. Then, each newly arrived multidimensional data point that is not well described (not spanned well) by the dictionary is classified as an anomalous data point.

The randomized LU (RLU) algorithm is an algebraic approach applied to input matrix of size with an intrinsic dimension smaller than . can be computed automatically or given. RLU is a low rank matrix decomposition which enables the identification of anomalies using a dictionary constructed from the training data. RLU forms a low rank matrix approximation of such that where and are orthogonal permutation matrices, and and are the lower and upper triangular matrices, respectively. A dictionary is then constructed according to ( is the transpose of a matrix). Thus, is a linear combination of the input matrix and a representation of the normal data. It is also used in the identification step to classify newly arrived multidimensional data points that did not participate in the training phase. Thus, a new incoming a multidimensional data point , which satisfies , is classified as normal; otherwise, it is classified as anomalous. Here, is the pseudo inverse of and is a quantity defined in the training phase. When applied to a matrix of size , the RLU decomposition reduces the number of multidimensional data points, resulting in a reduced-measurements matrix of size where . Although the algorithm is a randomized, it has been proven in Shabat et al. (2018b)

that the probability that the RLU approximation will generate a big error tends to be very small. See section

C.2 for more details.


The DK Algorithm relies on successive applications of LU and NY. Assume the size of a given training matrix is data points (rows) by features (columns). RLU (described in section C.2) is applied to . The size of is reduced substantially through the application of random projection (Johnson and Lindenstrauss, 1984). Then, NY (described in section C.1) is applied to (dimension) and the matrix is embedded into a lower dimensional space and anomaly identification procedure NY is called in this embedded space.


This is a variational autoencoder (AE) algorithm. AE is machine learning tool designed to generate complex models of data after careful distribution modeling of example data. In neural net language, AE consists of an encoder component and a decoder component. We assume that the input data set is generated from an underlying unobserved (latent) representation. Given an input data set, the encoder part of an AE approximates the distribution of the latent variables. Finally, the algorithm sets the distribution parameters of the latent layers in a manner that maximizes the likelihood of generating or reconstructing the input data in the decoder section. As soon as the distribution of the latent variables is approximated, we can sample from this distribution to generate an approximate representation of the input data. Since normality consists of and is defined by most of the data points, those will be well-approximated by the AE, while anomalies will be poorly modeled. Therefore, by comparing the original sample with the reconstructed (generated) data, we can calculate a similarity score that enables us to detect anomalies. The goal is to use the AE as a denoising autoencoder. It allows us to encode our sample into the latent space and then reconstruct it. By comparing the original sample to the reconstruction, we are able to calculate a score that enables us to classify a data point as anomalous data point. Since we plan to use the AE for anomaly detection, we have to calculate the scores for the input and output.

Appendix C Unsupervised algorithms: Mathematical description

c.1 Diffusion geometry: Background

DMare a kernel-based method for manifold learning that can reveal the intrinsic structures in data and embed them in a low dimensional space. The DM-based approach computes the diffusion geometry. A spectral embedding of the data points provides coordinates that are used to interpolate and approximate the pointwise diffusion map embedding of data.

Manifold learning approaches are often used for modeling and uncovering intrinsic low dimensional structure in high dimensional data. DM is a method that captures data manifolds with random walks that propagate through non-linear pathways in the data. Transition probabilities of a Markovian diffusion process (explained later how to compute them) define an intrinsic diffusion distance metric that is amenable to a low dimensional embedding. By arranging transition probabilities in a row-stochastic diffusion operator, and taking its leading eigenvalues and eigenvectors, one can derive a small set of coordinates where diffusion distances are approximated as Euclidean distances and intrinsic manifold structures are revealed.

In more details, the NY algorithm uncovers the internal geometry of the input training data denoted as . The use of geometric consdierations speeds up significantly the anomaly detection computational time. Next is a theory that supports this approach: The goal is to detect anomalies in and in newly arrived -dimensional data points that did not participate in the training data . During the training procedure, size of , which is also called the dimension of , is automatically reduced. The procedure is called dimensionality reduction. Dimensionality reduction as explained later, is achieved without damaging the quality and the coherency of the data in . More than that, there is no loss of data as explained later. Dimensionality reduction is just a different representation of the training data that automatically without any human intervention reduced the dimension according to the data and uncovers the real dimension where the training data actually resides.

In general, anomaly detection is based on the notion of similarities (or affinities) between the high dimensional data points (these are the rows in the matrix ). How we detect anomalies in this big data efficiently without introducing bias and without damaging the data? Dimensionality reduction of is needed. How to achieve this reduction? The following provides the rationale why geometrization of the training data and tracking the movement of newly arrived data points identify a low dimensional manifold for learning. It is founded mathematically through the preservation of the quality and the integrity (completeness) of the data in .

The assumption is that the processed data is imbalance: High densities of -dimensional samples (rows in the matrix ) represent normal data otherwise the data is classified as anomalous (abnormal) since the majority of the data is normal and thus it is classified as having high density.

Theory: How to find the low dimensional space (manifold)? It is proved that if is sampled from a low intrinsic dimensional manifold then, as (dimension) tends to infinity, the defined random walk, which travels between all the data samples, converges to a diffusion process over the manifold. This is the key to the processing of as diffusion process that guarantees efficient scan of the data through randomization without introduction of bias. It provides three complementary approaches for dimensionality reduction – diffusion distances between -dimensional samples, randomization and manifold learning - emerge from this observation (theorem): 1. kernel matrix of size (huge) is constructed from distances among all the -dimensional samples (rows). The distances are diffusion distances. 2. Random walk is applied to the entries in . This random walk guarantees that there is no bias between the utilization of the distances in . 3. Diffusion Maps (DM) links between the matrix and a lower dimensional space (manifold) via diffusion processing. The dimension of the embedded manifold represents the reduction of .

Geometrization of the training data - outline description of the approach: The NY algorithm is based on a geometric uncovering of a low dimensional manifold in the ambient space (the original space represented by ) by the application of DM to ambient space represented by . The input data is assumed to be sampled from a low intrinsic dimensional manifold that captures the dependencies between the observable parameters (-dimensional features). DM reduces the dimension of the training data. It is based on local affinities between multidimensional data points and on non-linear embedding of the ambient space into a lower dimensional space, described as a manifold, by using a low rank matrix decomposition. The non-parametric nature of this analysis uncovers the important underlying factors of the input data and reveals the intrinsic geometry of the data represented by the embedded manifold. This manifold describes geometrically what we classify as the normal profile in the ambient data. Newly arrived n-dimensional data points, which did not participate in the training procedure, are embedded into the lower dimensional space by the application of an out-of-sample extension algorithm. If the embedded n-dimensional data point falls into the manifold where most of the normal data reside, it is classified as normal; otherwise it is classified as abnormal (anomalous). The exchange of data between the ambient space and the manifold, where the detection takes place, does not degrade the coherency and the completeness of the data and preserves the geometrical relations (affinities) between the two spaces – ambient and embedded (manifold).

c.1.1 Diffusion geometry: outline

Let be a dataset and let be a symmetric point-wise positive kernel that defines a connected, undirected and weighted graph over . Then, a random walk over is defined by the row-stochastic transition probabilities matrix , where is an matrix whose entries are and is the diagonal degrees matrix whose -th element is The vector is referred to as the degrees vector of the graph defined by .

The associated time-homogeneous random walk , is defined via the conditional probabilities on its state-space : assuming that the process starts at time , then for any time point , where is the th entry of the -th power of the matrix . As long as the process is aperiodic, it has a unique stationary distribution which is the steady state of the process, i.e. , regardless the initial state

. This steady state is the probability distribution resulted from

normalization of the degrees vector , i.e.,


where . The diffusion distances at time are defined by the metric ,


By definition, , the -th row of , is the probability distribution over after time steps given that the initial state is . Therefore, the diffusion distance from Eq. 2 measures the difference between two propagations along time steps: the first is originated in and the second in . Weighing the metric by the inverse of the steady state results in ascribing high weight for similar probabilities on rare states and vice versa. Thus, a family of diffusion geometries is defined by Eq. 2, each corresponds to a single time step .

Due to the above interpretation, the diffusion distances are naturally utilized for multiscale clustering since they uncover the connectivity properties of the graph across time. In Bérard et al. (1994); Coifman and Lafon (2006a) it has been proven that under some conditions, if is sampled from a low intrinsic dimensional manifold then, as tends to infinity, the defined random walk converges to a diffusion process over the manifold.

c.2 Randomized LU decomposition: An algorithm for dictionary construction

A dictionary construction algorithm is presented. It is based on a low-rank matrix factorization being achieved by the application of the randomized LU decomposition (Shabat et al., 2018b) to a training data. This method is fast, scalable, parallelizable, consumes low memory, outperforms SVD in these categories and works also extremely well on large sparse matrices. In contrast to existing methods, the randomized LU decomposition constructs an under-complete dictionary, which simplifies both the construction and the classification processes of newly arrived multidimensional data points. The dictionary construction is generic and general that fits different applications.

The randomized LU algorithm, which is applied to a given training data matrix of multidimensional data points and features, decomposes into two matrices and . The size of

is determined by the decaying spectrum of the singular values of the matrix

, and bounded by . Both and are linearly independent.

The randomized LU decomposition algorithm (see, Figure 5) computes the rank LU approximation of a full matrix (Algorithm 1). The main building blocks of the algorithm are random projections and Rank Revealing LU (RRLU) (Pan, 2000) to obtain a stable low-rank approximation for an input matrix that is classified as a training data. In Figure 5 ‘II’ describes the generation of a dictionaries by calling item I that describes the flow of the randomized LU decomposition. The end of the execution of ‘I’ means that the training is completed. The dictionaries are the input of ‘II’ that performs the identification. Newly arrived data point that did not participate in the training is either span (classified as normal) or not spanned by the dictionary (classified as anomalous).

Figure 5: II calls the construction of a dictionary via randomized LU decomposition as described in I. The LU algorithm is built from the following steps: The inputs to the algorithm are a matrix and its rank (see I). They are submitted to Randomized LU that generates the following outputs: Permutation matrices and and lower and upper triangle matrices and , respectively. Then, a newly arrived data point, that did not participate in the training, is either spanned by therefore classified as normal otherwise it is classified as abnormal (anomalous).

The RRLU algorithm, used in Algorithm 1

, reveals the connection between LU decomposition of a matrix and its singular values. Similar algorithms exist for rank revealing QR decompositions (see, for example

Gu and Eisenstat (1996)).

Theorem C.1 (Pan (2000)).

Let be an matrix (). Given an integer , then the following factorization


holds where is a lower triangular with ones on the diagonal, is an upper triangular, and are orthogonal permutation matrices. Let be the singular values of , then:




Based on Theorem C.1, we have the following definition:

Definition C.1 (RRLU Rank Approximation denoted RRLU).

Given a RRLU decomposition (Theorem C.1) of a matrix with an integer (as in Eq. 3) such that , then the RRLU rank approximation is defined by taking columns from and rows from such that


where and are defined in Theorem C.1.

Lemma C.2Shabat et al. (2018b) RRLU Approximation Error).

The error of the RRLU approximation of is


Algorithm 1 describes the flow of the RLU decomposition algorithm.

Input: Matrix of size to decompose; rank of ; number of columns to use (for example, ).
Output: Matrices such that where and are orthogonal permutation matrices, and are the lower and upper triangular matrices, respectively, and is the th singular value of .
1:  Create a matrix of size

whose entries are i.i.d. Gaussian random variables with zero mean and unit standard deviation.

2:  .
3:  Apply RRLU decomposition (See Pan (2000)) to such that .
4:  Truncate and by choosing the first columns and rows, respectively: and .
5:  . ( is the pseudo inverse of ).
6:  Apply LU decomposition to with column pivoting .
7:  .
8:  .
Algorithm 1 Randomized LU Decomposition

c.2.1 Randomized LU Based Classification Algorithm

Based on Section C.2, we apply the randomized LU decomposition (Algorithm 1) to matrix , yielding . The outputs and are orthogonal permutation matrices. Theorem C.3 shows that forms (up to a certain accuracy) a basis to . This is the key property of the classification algorithm.

Theorem C.3Shabat et al. (2018b)).

Given a matrix . Its randomized LU decomposition is . Then, the error of representing by satisfies:


Let be a multidimensional data point and is a dictionary. The distance between and the dictionary is defined by , where is the pseudo-inverse of the matrix . If then is normal otherwise it is anomalous.


  • Ansdell et al. (2018) Ansdell, M., Ioannou, Y., Osborn, H.P., Sasdelli, M., 2018 NASA Frontier Development Lab Exoplanet Team, Smith, J.C., Caldwell, D., Jenkins, J.M., Räissi, C., Angerhausen, D., NASA Frontier Development Lab Exoplanet Mentors, ., 2018. Scientific Domain Knowledge Improves Exoplanet Transit Classification with Deep Learning. Astrophys. J. Lett. 869, L7. doi:10.3847/2041-8213/aaf23b, arXiv:1810.13434.
  • Bérard et al. (1994) Bérard, P., Besson, G., Gallot, S., 1994. Embedding riemannian manifolds by their heat kernel. Geometric and Functional Analysis GAFA 4, 373–398.
  • Borucki et al. (2010) Borucki, W.J., Koch, D., Basri, G., Batalha, N., Brown, T., Caldwell, D., Caldwell, J., Christensen-Dalsgaard, J., Cochran, W.D., DeVore, E., Dunham, E.W., Dupree, A.K., Gautier, T.N., Geary, J.C., Gilliland, R., Gould, A., Howell, S.B., Jenkins, J.M., Kondo, Y., Latham, D.W., Marcy, G.W., Meibom, S., Kjeldsen, H., Lissauer, J.J., Monet, D.G., Morrison, D., Sasselov, D., Tarter, J., Boss, A., Brownlee, D., Owen, T., Buzasi, D., Charbonneau, D., Doyle, L., Fortney, J., Ford, E.B., Holman, M.J., Seager, S., Steffen, J.H., Welsh, W.F., Rowe, J., Anderson, H., Buchhave, L., Ciardi, D., Walkowicz, L., Sherry, W., Horch, E., Isaacson, H., Everett, M.E., Fischer, D., Torres, G., Johnson, J.A., Endl, M., MacQueen, P., Bryson, S.T., Dotson, J., Haas, M., Kolodziejczak, J., Van Cleve, J., Chandrasekaran, H., Twicken, J.D., Quintana, E.V., Clarke, B.D., Allen, C., Li, J., Wu, H., Tenenbaum, P., Verner, E., Bruhweiler, F., Barnes, J., Prsa, A., 2010. Kepler Planet-Detection Mission: Introduction and First Results. Science 327, 977. doi:10.1126/science.1185402.
  • Brown et al. (2011) Brown, T.M., Latham, D.W., Everett, M.E., Esquerdo, G.A., 2011. Kepler Input Catalog: Photometric Calibration and Stellar Classification. Astron. J. 142, 112. doi:10.1088/0004-6256/142/4/112, arXiv:1102.0342.
  • Catanzarite (2015) Catanzarite, J.H., 2015. Autovetter Planet Candidate Catalog for Q1-Q17 Data Release 24. KSCI-19091-001, NASA Ames Research Center, Moffett Field, CA.
  • Christiansen et al. (2012) Christiansen, J.L., Jenkins, J.M., Caldwell, D.A., Burke, C.J., Tenenbaum, P., Seader, S., Thompson, S.E., Barclay, T.S., Clarke, B.D., Li, J., Smith, J.C., Stumpe, M.C., Twicken, J.D., Cleve, J.V., 2012. The derivation, properties, and value of kepler’s combined differential photometric precision. Publications of the Astronomical Society of the Pacific 124, 1279–1287. URL:, doi:10.1086/668847.
  • Coifman and Lafon (2006a) Coifman, R.R., Lafon, S., 2006a. Diffusion maps. Applied and Computational Harmonic Analysis 21, 5 – 30.
  • Coifman and Lafon (2006b) Coifman, R.R., Lafon, S., 2006b. Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions. Applied and Computational Harmonic Analysis 21, 31–52.
  • Coughlin et al. (2016) Coughlin, J.L., Mullally, F., Thompson, S.E., Rowe, J.F., Burke, C.J., Latham, D.W., Batalha, N.M., Ofir, A., Quarles, B.L., Henze, C.E., Wolfgang, A., Caldwell, D.A., Bryson, S.T., Shporer, A., Catanzarite, J., Akeson, R., Barclay, T., Borucki, W.J., Boyajian, T.S., Campbell, J.R., Christiansen, J.L., Girouard, F.R., Haas, M.R., Howell, S.B., Huber, D., Jenkins, J.M., Li, J., Patil-Sabale, A., Quintana, E.V., Ramirez, S., Seader, S., Smith, J.C., Tenenbaum, P., Twicken, J.D., Zamudio, K.A., 2016. Planetary Candidates Observed by Kepler. VII. The First Fully Uniform Catalog Based on the Entire 48-month Data Set (Q1-Q17 DR24). Astrophys. J. Supp. 224, 12. doi:10.3847/0067-0049/224/1/12, arXiv:1512.06149.
  • Dattilo et al. (2019) Dattilo, A., Vanderburg, A., Shallue, C.J., Mayo, A.W., Berlind, P., Bieryla, A., Calkins, M.L., Esquerdo, G.A., Everett, M.E., Howell, S.B., Latham, D.W., Scott, N.J., Yu, L., 2019. Identifying Exoplanets with Deep Learning. II. Two New Super-Earths Uncovered by a Neural Network in K2 Data. Astron. J. 157, 169. doi:10.3847/1538-3881/ab0e12, arXiv:1903.10507.
  • Golovin et al. (2017) Golovin, D., Solnil, B., Moitra, S., Kochanski, G., Karro, J., D., S., 2017. Google Vizier: A Service for Black-Box Optimization. ACM ISBN 978-1-4503-4887-4/17/08, 1487. doi:10.1145/3097983.3098043.
  • Gu and Eisenstat (1996) Gu, M., Eisenstat, S.C., 1996. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing 17, 848–869.
  • Jenkins et al. (2010a) Jenkins, J.M., Caldwell, D.A., Chandrasekaran, H., Twicken, J.D., Bryson, S.T., Quintana, E.V., Clarke, B.D., Li, J., Allen, C., Tenenbaum, P., Wu, H., Klaus, T.C., Cleve, J.V., Dotson, J.A., Haas, M.R., Gilliland, R.L., Koch, D.G., Borucki, W.J., 2010a. INITIAL CHARACTERISTICS OF KEPLER LONG CADENCE DATA FOR DETECTING TRANSITING PLANETS. Astrophys. J. Lett. 713, L120–L125. URL:, doi:10.1088/2041-8205/713/2/l120.
  • Jenkins et al. (2010b) Jenkins, J.M., Caldwell, D.A., Chandrasekaran, H., Twicken, J.D., Bryson, S.T., Quintana, E.V., Clarke, B.D., Li, J., Allen, C., Tenenbaum, P., Wu, H., Klaus, T.C., Middour, C.K., Cote, M.T., McCauliff, S., Girouard, F.R., Gunter, J.P., Wohler, B., Sommers, J., Hall, J.R., Uddin, A.K., Wu, M.S., Bhavsar, P.A., Cleve, J.V., Pletcher, D.L., Dotson, J.A., Haas, M.R., Gilliland, R.L., Koch, D.G., Borucki, W.J., 2010b. OVERVIEW OF THE KEPLER SCIENCE PROCESSING PIPELINE. Astrophys. J. Lett. 713, L87–L91. URL:, doi:10.1088/2041-8205/713/2/l87.
  • Johnson and Lindenstrauss (1984) Johnson, W.B., Lindenstrauss, J., 1984. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26, 1.
  • Koch et al. (2010) Koch, D.G., Borucki, W.J., Basri, G., Batalha, N.M., Brown, T.M., Caldwell, D., Christensen-Dalsgaard, J., Cochran, W.D., DeVore, E., Dunham, E.W., Gautier, T.N., Geary, J.C., Gilliland, R.L., Gould, A., Jenkins, J., Kondo, Y., Latham, D.W., Lissauer, J.J., Marcy, G., Monet, D., Sasselov, D., Boss, A., Brownlee, D., Caldwell, J., Dupree, A.K., Howell, S.B., Kjeldsen, H., Meibom, S., Morrison, D., Owen, T., Reitsema, H., Tarter, J., Bryson, S.T., Dotson, J.L., Gazis, P., Haas, M.R., Kolodziejczak, J., Rowe, J.F., Cleve, J.E.V., Allen, C., Chandrasekaran, H., Clarke, B.D., Li, J., Quintana, E.V., Tenenbaum, P., Twicken, J.D., Wu, H., 2010. KEPLER MISSION DESIGN, REALIZED PHOTOMETRIC PERFORMANCE, AND EARLY SCIENCE. Astrophys. J. Lett. 713, L79–L86. URL:, doi:10.1088/2041-8205/713/2/l79.
  • Mandel and Agol (2002) Mandel, K., Agol, E., 2002. Analytic Light Curves for Planetary Transit Searches. Astrophys. J. Lett. 580, L171–L175. doi:10.1086/345520, arXiv:astro-ph/0210099.
  • Osborn et al. (2020) Osborn, H.P., Ansdell, M., Ioannou, Y., Sasdelli, M., Angerhausen, D., Caldwell, D., Jenkins, J.M., Räissi, C., Smith, J.C., 2020. Rapid classification of TESS planet candidates with convolutional neural networks. Astron. Astrophys. 633, A53. doi:10.1051/0004-6361/201935345, arXiv:1902.08544.
  • Pan (2000) Pan, C.T., 2000. On the existence and computation of rank-revealing LU factorizations. Linear Algebra and its Applications 316, 199–222.
  • Ricker et al. (2014) Ricker, G.R., Winn, J.N., Vanderspek, R., Latham, D.W., Bakos, G.Á., Bean, J.L., Berta-Thompson, Z.K., Brown, T.M., Buchhave, L., Butler, N.R., Butler, R.P., Chaplin, W.J., Charbonneau, D., Christensen-Dalsgaard, J., Clampin, M., Deming, D., Doty, J., De Lee, N., Dressing, C., Dunham, E.W., Endl, M., Fressin, F., Ge, J., Henning, T., Holman, M.J., Howard, A.W., Ida, S., Jenkins, J., Jernigan, G., Johnson, J.A., Kaltenegger, L., Kawai, N., Kjeldsen, H., Laughlin, G., Levine, A.M., Lin, D., Lissauer, J.J., MacQueen, P., Marcy, G., McCullough, P.R., Morton, T.D., Narita, N., Paegert, M., Palle, E., Pepe, F., Pepper, J., Quirrenbach, A., Rinehart, S.A., Sasselov, D., Sato, B., Seager, S., Sozzetti, A., Stassun, K.G., Sullivan, P., Szentgyorgyi, A., Torres, G., Udry, S., Villasenor, J., 2014. Transiting Exoplanet Survey Satellite (TESS). volume 9143 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series. p. 914320. doi:10.1117/12.2063489.
  • Schwarz (1978) Schwarz, G., 1978. Estimating the dimension of a model. Ann. Statist. 6, 461–464. URL:, doi:10.1214/aos/1176344136.
  • Shabat et al. (2018a) Shabat, G., Segev, D., Averbuch, A., 2018a. Uncovering unknown unknowns in financial services big data by unsupervised methodologies: Present and future trends, in: Proceedings of Machine Learning Research, KDD 2017 Workshop on Anomaly Detection in Finance, pp. 8–19.
  • Shabat et al. (2018b) Shabat, G., Shmueli, Y., Aizenbud, Y., Averbuch, A., 2018b. Randomized LU decomposition. Applied and Computational Harmonic Analysis 44, 246–272.
  • Shallue and Vanderburg (2018) Shallue, C.J., Vanderburg, A., 2018. Identifying Exoplanets with Deep Learning: A Five-planet Resonant Chain around Kepler-80 and an Eighth Planet around Kepler-90. Astron. J. 155, 94. doi:10.3847/1538-3881/aa9e09, arXiv:1712.05044.
  • Yu et al. (2019) Yu, L., Vanderburg, A., Huang, C., Shallue, C.J., Crossfield, I.J.M., Gaudi, B.S., Daylan, T., Dattilo, A., Armstrong, D.J., Ricker, G.R., Vanderspek, R.K., Latham, D.W., Seager, S., Dittmann, J., Doty, J.P., Glidden, A., Quinn, S.N., 2019. Identifying Exoplanets with Deep Learning. III. Automated Triage and Vetting of TESS Candidates. Astron. J. 158, 25. doi:10.3847/1538-3881/ab21d6, arXiv:1904.02726.
  • Zucker and Giryes (2018) Zucker, S., Giryes, R., 2018. Shallow Transits—Deep Learning. I. Feasibility Study of Deep Learning to Detect Periodic Transits of Exoplanets. Astron. J. 155, 147. doi:10.3847/1538-3881/aaae05, arXiv:1711.03163.