Detection of gravitational-wave signals from binary neutron star mergers using machine learning

by   Marlin B. Schäfer, et al.

As two neutron stars merge, they emit gravitational waves that can potentially be detected by earth bound detectors. Matched-filtering based algorithms have traditionally been used to extract quiet signals embedded in noise. We introduce a novel neural-network based machine learning algorithm that uses time series strain data from gravitational-wave detectors to detect signals from non-spinning binary neutron star mergers. For the Advanced LIGO design sensitivity, our network has an average sensitive distance of 130 Mpc at a false-alarm rate of 10 per month. Compared to other state-of-the-art machine learning algorithms, we find an improvement by a factor of 6 in sensitivity to signals with signal-to-noise ratio below 25. However, this approach is not yet competitive with traditional matched-filtering based methods. A conservative estimate indicates that our algorithm introduces on average 10.2 s of latency between signal arrival and generating an alert. We give an exact description of our testing procedure, which can not only be applied to machine learning based algorithms but all other search algorithms as well. We thereby improve the ability to compare machine learning and classical searches.



There are no comments yet.


page 1

page 2

page 3

page 4


Training Strategies for Deep Learning Gravitational-Wave Searches

Compact binary systems emit gravitational radiation which is potentially...

Gravitational-wave selection effects using neural-network classifiers

We present a novel machine-learning approach to estimate selection effec...

An HTM based cortical algorithm for detection of seismic waves

Recognizing seismic waves immediately is very important for the realizat...

Using machine learning to parametrize postmerger signals from binary neutron stars

There is growing interest in the detection and characterization of gravi...

Analyzing the Stability of Non-coplanar Circumbinary Planets using Machine Learning

Exoplanet detection in the past decade by efforts including NASA's Keple...

Detecting residues of cosmic events using residual neural network

The detection of gravitational waves is considered to be one of the most...

Convolutional neural networks: a magic bullet for gravitational-wave detection?

In the last few years, machine learning techniques, in particular convol...

Code Repositories


Data Release for the Machine Learning based Search for Binary Neutron Star Mergers

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The first direct detection of a gravitational-wave (gw) signal on September the 14th 2015 et al. (2016) marked the dawn of gravitational wave-astronomy. During the first two observing runs, the ligo and virgo scientific collaboration found 11 gws Collaboration et al. (2018) from coalescing compact binary systems. Two independent reanalyses of the data have discovered a further set of events, 3 of which are found to be of astronomical origin with probability by both studies Venumadhav et al. (2019); Nitz et al. (2019a, b). The third observing run has identified tens of new gw candidate events Collaboration and Collaboration and so far reported two new gw detections Collaboration and the Virgo Collaboration (2020); Abbott et al. (2020). With detector sensitivity improving further for future observing runs and kagra Aso et al. (2013); Akutsu et al. (2020) joining the detector network, the rate of detections is expected to grow The LIGO Scientific Collaboration (2019).

The most sensitive low-latency searches are tailored specifically to signals from coalescing compact binaries and use a fixed number of pre-calculated templates Collaboration and Collaboration (2018).

Each template is a unique combination of a waveform model and source parameters. These searches work by calculating an inner product between the data and every template to produce a signal-to-noise ratio (snr) time series. This process is known as matched filtering and is mathematically proven to be optimal for finding signals submerged in stationary, Gaussian noise Maggiore (2008).

If the SNR of a candidate exceeds a pre-selected threshold and the candidate is not excluded due to other factors, such as poor data quality or an implausible time of arrival difference between two different detectors, the low-latency search algorithms return a candidate event Sachdev et al. (2019); Adams et al. (2016); Nitz et al. (2018); Hooper et al. (2012).

The computational cost of a matched-filter search scales linearly with the number of templates used. This number will grow with the improving detector sensitivity at low frequencies Canton and Harry (2017) of planned updates The LIGO Scientific Collaboration (2019). If currently neglected effects such as precession Harry et al. (2016); Hannam et al. (2014); Khan et al. (2019); Pan et al. (2014), higher order modes Harry et al. (2018); London et al. (2018); Khan et al. (2020); Cotesta et al. (2018) or eccentricity Nitz et al. (2020) are taken into account, even more templates would be required. More computationally efficient algorithms would enable searching for sources which cannot currently be targeted due to a fixed computational budget.

When detecting gws from compact binary systems that contain at least one neutron star, the latency of the detection pipeline is critical, as these systems may produce electromagnetic (em) signals. To detect these em counterparts and maximize observation time, observatories need to be notified of possible events quickly. The number of false alarms on the contrary should be minimized as telescope time is expensive. Current low-latency searches introduce a latency of seconds and operate at a false-alarm rate (far) of 1 per 2 months Collaboration and Collaboration (2018); Sachdev et al. (2019); Adams et al. (2016); Nitz et al. (2018); Hooper et al. (2012). Any new search needs to meet or exceed these standards to be considered for production use.

Neural network (nn) based machine learning algorithms are an interesting alternative to traditional search algorithms, as they have shown great improvements in many tasks such as image recognition Russakovsky et al. (2015), sound generation Oord et al. (2016) or certain board and computer games Silver et al. (2016); ope (2018). nns have also already found some application in the context of gw data analysis Bahaadini et al. (2018); Dreissigacker et al. (2019); Wei and Huerta (2019); Cuoco et al. (2020); Green et al. (2020); Gabbard et al. (2019); Marulanda et al. (2020); Chan et al. (2019); Iess et al. (2020). A few notable examples are the classification of non-gaussian noise transients Bahaadini et al. (2018), the search for continuous gws Dreissigacker et al. (2019) and denoising of detector data to recover injected signals Wei and Huerta (2019). One key advantage of nns is their computational efficiency once trained. Most of the computational cost is shifted to the training stage, resulting in very quick evaluation. The application of nns to gw searches might therefore offer a way to reduce computational cost of low-latency searches.

The authors of George and Huerta (2018a); Gabbard et al. (2018) were the first to directly apply deep nns to time series strain data to detect gws from binary black hole (bbh) mergers. They tested the sensitivity of these searches at estimated false-alarm rates per month111We estimate this false-alarm rate by multiplying the false-alarm probabilities given in George and Huerta (2018a); Gabbard et al. (2018) by the respective number of samples times half the duration by which the position of the peak amplitude is varied within the training data-samples.. Both analyses are able to closely reproduce the performance of a matched filter search at these false-alarm rates at a fraction of the computational cost. The nns excel at high false-alarm rates and low snr. Both networks detected all signals with snr larger 10 at estimated false-alarm rates of per month. These results are a promising first step but the algorithms would need to be tested at the required false-alarm rates of 1 per 2 months on real detector data to demonstrate an improvement over established methods.

Starting from their network, we reshaped the architecture significantly to optimize it to detect signals from binary neutron star (bns) mergers. Our network-based search estimates the snr and a quantity we call p-score for the given input data. The p-score is a measure for how likely the data is to contain a gw signal. It is explicitly not a probability. The network is trained on simulated data of non-spinning binary neutron star systems with masses between and solar masses, isotropically distributed over the sky. All noise is stationary and Gaussian and as such does not contain any transients or other contaminations that are present in real detector data al. (2016); Nuttall (2018); Cabero et al. (2019). The previous works George and Huerta (2018a); Gabbard et al. (2018) have used data from a single detector. To improve the performance of our search, we expand the algorithm to work with data from two detectors. Using multiple detectors may also enable real-time estimates of the sky-position in the future.

Detecting bns signals using a nn is inherently more difficult than finding a bbh signal, as (1) the gw of a bns reaches higher frequencies and (2) spends more time in the sensitive bands of the detectors. Due to (1) the data needs to be sampled at a high rate. Combined with (2) this leads to a massive increase of data that needs to be analyzed. As nns tend to be difficult to optimize when the input data has many samples, it is not feasible to naïvely use the full time series sampled at a single rate as input. To solve this problem, we sample different parts of the signal at different rates. Frequencies emitted during the early inspiral are low and evolve slowly (see Figure 1). High sample rates are only necessary during the final few cycles, where frequencies are high and grow rapidly.

The false-alarm rates probed by George and Huerta (2018a); Gabbard et al. (2018) are orders of magnitude larger than what is required for low-latency pipelines. Additionally these false-alarm rates were estimated on a discrete set of samples which either contain a signal or consist of pure noise. The waveforms within these samples are always aligned in a similar way and no signal is contained only partially in the analyzed segment. As the authors of Gebhard et al. (2019) point out, false-alarm rates estimated on a discrete set of samples may for these reasons not be representative of a realistic search which has to work with a continuous stream of data.

We propose a standardized way of evaluating nn false-alarm rates and sensitivities. To calculate these metrics, we generate a long stretch of continuous time series data which contains many injected gws that are roughly separated by the average duration of a bns signal. Our network is applied to this data and points of interest are clustered into events. All results we provide are derived from analysis of days of simulated continuous data. We test the network down to false-alarm rates of per month and find sensitive distances of down to false-alarm rates of per month.

We compare our search to the currently in-use low-latency detection pipeline PyCBC Live Nitz et al. (2018) and the results given by the authors of Krastev (2019)

, who were the first to classify bns signals with a machine learning algorithm. We find an improvement in sensitivity of close to

% for bns signals with snr below over the previous state-of-the-art machine learning algorithm. This makes our algorithm the best machine learning algorithm for detecting bns signals at low snrs. We are, however, not yet able to match the performance of template based searches. To do so we need to either increase the sensitive radius of our search at the lowest false-alarm rates by a factor of or double the sensitive radius while lowering the false-alarm rate by an order of magnitude.

The trained network is public and can be found in the associated data release Schäfer et al. (2020). At the same location we also provide example code of how to apply it to long stretches of data and a function that generates injections as they are used to derive false-alarm rates and sensitivities in this work.

The contents of this paper are structured as follows: Section II describes how search algorithms should be evaluated. It gives the general concepts in the first part and details on how to apply these concepts to nns in the second part. Section III explains the multi-rate sampling and describes the data used for both training and validation of the network. The following section IV gives an overview of the architecture and how this architecture is trained and tested. We present our results in section V of which we draw conclusions in section VI.

Ii False-alarm rate and Sensitivity of gravitational-wave search algorithms

There are two important metrics that have been used to evaluate gravitational-wave searches in the past. These two are the false-alarm rate of the search and the corresponding sensitivity Usman et al. (2016a). In principle, these metrics can directly be applied to gw searches that utilize nns. As pointed out by the authors of Gebhard et al. (2019), in practice the discrete nature of the data that is used to train these networks has lead to some divergence between the terminology used for nn and traditional search algorithms.

ii.1 Calculation for general search algorithms

The main goal of a search algorithm is to detect gw signals in real data, where the input is a nearly continuous strain time series. A search therefore must produce a list of times of candidate events and rank them by a ranking statistic . The ranking statistic is a number which signifies how likely the data is to contain a signal. To evaluate the performance of an algorithm, it is applied to mock data containing known injections, i.e. additive gw signals with known parameters. The events generated from this data are compared to the list of injections and used to determine which injected signals were found, which missed and which events are false alarms.

Any event that is reported by the search needs to be assigned a false-alarm rate to express the confidence in its detection. For a given value , the false-alarm rate is the number of false alarms with a ranking statistic of at least per unit time. To estimate it on mock data, the number of false detections exceeding a ranking statistic is divided by the duration of the analyzed data.

The ability of the search to recover signals is quantified by the sensitivity which is a function of the false-alarm rate lower bound. It is often given in terms of the fraction of recovered injections. This fraction, however, strongly depends on the parameter distribution of the injected signals, as the amplitude of the signal in the detector depends on the orientation and location of the source. Thus, the fraction can be diminished by injecting sources at larger distances or unfavorable orientations. A more astrophysically motivated measure of sensitivity is the sensitive volume of the search algorithm. It is an estimate of the volume around the detector from which gw sources will be detectable. This volume may be calculated through


where is the efficiency of the search pipeline for signals with false-alarm rate , spatial co-ordinates and injection parameters . The function is a probability density function which describes the astrophysical distribution of signals Usman et al. (2016b). When the distribution of injections matches the expected astrophysical distribution (i.e. uniform in volume, isotropic in sky location, etc.), equation (1) can be estimated by


where is the maximal distance of injected signals, is the volume of a sphere with radius and the function counts the number of detected injections with a false-alarm rate .

We use the function volume_montecarlo of the PyCBC software library Nitz et al. (2019c) to carry out this estimation.

Current searches notify astronomers of a gw event when the event is assigned a false-alarm rate of at most 1 per 2 months Nitz et al. (2018). Any new search should hence be tested at least down to these false-alarm rates. To resolve false-alarm rates of that scale at least 2 months of mock data are required.

For our tests we generated 100 files each containing roughly 1 day of continuous data. Each file contains independently drawn data. For easier multiprocessing, each file is internally split into 22 chunks of duration 4096 seconds. We start by generating a list of injection times, requiring that two injections are separated by 180 to 220 seconds. The exact separation time is chosen uniformly from this interval. To avoid waveforms that are not completely within one chunk we discard any injections that are within the first or final 256 seconds of each chunk. For every injection time, we generate a waveform using the inspiral-only waveform model TaylorF2 Droz et al. (1999); Blanchet (2002); Faye et al. (2012) with a lower frequency cutoff of 25 Hz. Its parameters are drawn from the distribution specified in Table 1. Finally, the waveform is projected into the frame of the LIGO-Hanford and LIGO-Livingston detectors and added into simulated Gaussian noise such that the peak amplitude is positioned at the injection time. All noise is generated from the analytic estimate of the power spectral density (psd) of the aligo final design sensitivity as provided by the software library LALSuite Collaboration (2018).

parameter uniform distribution
component masses
spins 0
coalescence phase
right ascension
Table 1: The astrophysically motivated distribution of parameters used to generate injections. These are used to estimate the false-alarm rate and sensitivity of the search algorithm specified in this paper.

ii.2 Calculation for neural network searches

A nn is a general purpose function approximator that finds a fit for a set of example input-output pairs. This fitting process is called training and the example data used is called the training set. Once trained, the network can be applied to data that was not covered by the training set and will evaluate the fit function at this new point. It does so assuming that the unseen data-samples originate from the same underlying process as the training data-samples. The given output for any unseen data-sample is thus an interpolation or extrapolation of the example outputs from the training set.

To generate false-alarm rates and sensitivities, previous works Gabbard et al. (2018); George and Huerta (2018a) generated a second set of noise- and signal-samples with the same duration and sample rate used in the training set. They then applied the network to this second set of data-samples and determined false-alarm rates by counting how many noise-samples were classified as signals and sensitivities by counting how many signal samples were classified as such.

There are two main problems with using a discrete set of data-samples to determine false-alarm rates and sensitivities. The first stems from the structure of the data-samples themselves. To make the training process more efficient, it is necessary to position the peak amplitude of the gw signal within a narrow band of the data-sample. When applied to real data, this property can not be ensured, and the assumption of the data being similar to the training set is not well approximated. Hence, if a false-alarm rate or sensitivity is calculated on a set where the alignment is guaranteed, it will not necessarily be representative of the performance on realistic data. The second problem is the required fixed duration of input data-samples. Usually, a search algorithm is applied to long stretches of time series data to find potential signals. To evaluate data of greater duration than the accepted input size of the network, it is applied multiple times via a sliding window. At each position the output will give an estimate if a signal is present. If this is the case, it is initially not clear what a true positive is, as the network may predict the presence of a signal for multiple consecutive positions, the input window may only partially contain a signal or the network may jump between predicting the presence and absence of a signal for multiple subsequent positions Gebhard et al. (2019).

To generate representative false-alarm rates and sensitivities, we propose to use mock data of much greater duration than the input duration of the network. The network is then applied to this data by sliding it across. The step size should at most be half the size of the interval where peak amplitudes of the waveforms occur in the training set. This step size assures that any waveform is positioned correctly for at least one position of the network.

If the output of the network is not binary but continuous it can be interpreted as a ranking statistic. In this case a threshold can be applied to find positions where the network predicts to have found a signal. Candidate events are identified in the resulting time series by applying a threshold and clustering.

Each event is assigned a specific time and ranking statistic. The resulting list of events is compared to the list of known injections as described in II.1 to calculate false-alarm rates and sensitivities. The specifics of our analysis and clustering algorithm are described in IV.3.

As every event needs to be assigned a ranking statistic, we can calculate the metrics by only using events that exceed a given threshold. Doing so for many different values allows us to obtain the false-alarm rate as a function of the ranking statistic threshold and subsequently also the sensitivity as a function of the false-alarm rate.

We found that testing our nn on long stretches of data and applying clustering to obtain a list of events increased the false-alarm rate over false-alarm rates measured on a set of discrete samples at the same detection threshold by at least a factor of 2. To give comparable statistics, we therefore strongly recommend to test networks in the way described above.

Iii Data processing

To train and evaluate the network, three disjoint data sets with known contents are required. One of the three sets is the training set and used to optimize the parameters of the network. The network is usually trained on the training set multiple times, where each complete pass is called an epoch. After multiple epochs, the network may start to learn specifics of the provided data-samples rather than the general structure. This behavior is called overfitting and can be detected by monitoring the performance of the network on a validation set. Different epochs are rated by their performance on this second set. Ranking the different training stages of the network in this way introduces a selection bias and optimizes the network on the validation set. To give unbiased results a testing set is required. This one should represent the data that the network will be applied to the closest and should optimally be generated independently from the training and validation set. To keep results as unbiased as possible, the testing set should ideally only be analyzed once. This section describes how an individual data-sample needs to be formatted for our network and how the training and validation set are generated. Details on the testing set are described in

subsection II.1.

iii.1 Input data preparation

Previous works George and Huerta (2018a); Gabbard et al. (2018); George and Huerta (2018b) have already successfully classified whitened time series data for bbh signals with a simple convolutional neural network. As input they used 1 second of data sampled at . For bbh signals this is a sensible choice for the duration of analyzed data, as these signals sweep through the sensitive frequency range of the detectors in seconds, and the chosen sample rate is sufficient to resolve them. Signals from binary neutron star mergers, on the other hand, spend seconds in the sensitive band of the detectors. Using the usual signal duration as input to the network would lead to a hundredfold increase of data-points over the bbh case. Training a nn with this many input samples is infeasible due to memory constraints and optimization problems.

To reduce the number of input samples, the authors of Krastev (2019) use only the final 2 seconds of each signal as input. This enables them to differentiate not only between noise and bns signal, but also distinguish gws from bbh mergers. They test an architecture, i.e. a network structure, similar to those of George and Huerta (2018a); Gabbard et al. (2018) and are able to closely reproduce their results for bbh data. Their sensitivity to bns signals looks very promising, but the search has yet to be tested on signals of realistic strength or at realistic false-alarm rates. For further discussion on the strength of the signals used in their work, see section V. The short duration of the input in comparison to the duration a bns signal spends inside the sensitive band of the detectors reduces the snr contained in the data by about and thus limits the sensitivity of the search.

To retain as much snr in the data as possible while at the same time reducing the input size to the network, we sample 32 seconds of data containing a potential signal at different rates. During the early inspiral, frequencies are low and the frequency evolution is slow. This allows us to sample a long stretch in the beginning at a low rate. The first 16 seconds are sampled at , the following 8 seconds are sampled at and so on. The pattern continues until the final second, which is sampled at but split into two parts of equal length. We ensure that no two sections overlap to reduce redundant information. This method of re-sampling generates 7 parts containing 2048 samples each and ensures that for every part of the signal a sufficient sample rate is used while the number of samples is reduced by a factor of 9 (See Figure 1).222Notice that this way of sampling the data differs from our previous work Schäfer (2019) in that we dropped the lowest sample rate. We found that for some signals a sample rate of for the first 32 seconds of a 64 second signal was not sufficient to resolve the highest frequencies during that stage of the binary evolution and introduced unwanted artifacts. We also sampled the psd used for whitening the data too coarsely in our previous work Schäfer (2019) and signals were more difficult to find as a result.

Rather than re-sampling the data directly, we first whiten it using the analytic model aLIGOZeroDetHighPower for the aLIGO design sensitivity psd as provided by the software library LALSuite Collaboration (2018). Whitening of data is a procedure where every frequency bin is reweighted by the average power of the background noise in this bin. It ensures that power in any frequency bin in excess of unity is an indication for the presence of a signal. For computational efficiency during training, noise and signal samples are whitened individually. Since the whitening procedure is a linear operation, whitening the sum is equivalent to whitening both parts individually. Both parts are combined at runtime on the first layer of the nn. The reason to store them separately is an increase in the effective number of samples which can be achieved by mixing and matching different signals and noise samples. It also helps to improve the efficiency of training by using the same signal template submerged in different realizations of noise.

When evaluating real samples, we cannot trivially separate the signal from the background and thus cannot whiten each part individually. Instead we whiten the total signal by the same analytic model of the psd used for the training and validation data. The whitened data is re-sampled and used as the signal input. As the signal input already contains the total signal including noise and the network only sees the sum of both inputs, the noise input is set to zero.

Figure 1: The top panel shows the strain evolution of an example gw from a bns merger in black. The bottom panel shows the corresponding frequency evolution in black. The colored boxes represent parts of the signal which we sample at different rates. The height of these boxes in the bottom panel represents the Nyquist-frequency of the sample rate which is used for each part. To fully resolve the signal, the black curve must stay inside the colored boxes of the bottom panel at all times.

iii.2 Generating training and validation set

All signals for the training and validation set are generated using the inspiral-only waveform model TaylorF2 from the software library LALSuite Collaboration (2018) with all parameters but the distance drawn from the distribution given in Table 1. The luminosity distance is set indirectly by uniformly drawing a target network snr from the interval . The waveform is first computed at a fiducial distance of with a low-frequency cutoff of . Then the waveform is projected onto the two detectors Hanford and Livingston Schutz (2011) and cropped to a length of 96 seconds. During this step, we shift the waveform such that the peak amplitude occurs within the final to seconds of the 96 seconds data-segment. The exact position within this interval is drawn uniformly. Next, we calculate the network snr by taking the root of the sum of the squares of the inner product of the waveforms with themselves Schutz (2011), weighing each frequency by the analytic psd aLIGOZeroDetHighPower of Collaboration (2018). The waveforms are finally scaled by multiplying with the target network snr and dividing by the network snr at distance . Afterwards the data is whitened, the initial and final seconds are discarded to avoid filter wrap-around errors, and the last seconds of the remaining segment are re-sampled as described in III.1. Noise samples are simulated from the analytic psd used above, whitened and re-sampled. As such all noise is stationary and Gaussian.

The training set contains 25,000 different gw signals and 80,000 instances of noise. When training the network we pre-select 800,000 unique combinations of signal and noise at random and shuffle them with all 80,000 noise samples to obtain 880,000 training samples with a split between signals and noise. To compensate for this inequality we apply sample weights of to all signal samples during training. Sample weights modify the loss by reweighting contributions from the according sample.

The validation set contains 1,500 different gw signals and 8,000 instances of noise. We again generate 24,000 unique combinations of and shuffle them with the 8,000 noise samples. This results in a validation set that contains 32,000 samples with a split for .

Iv The search algorithm

iv.1 Neural network architecture

When working with neural networks, the details of the implementation of the machine learning algorithm are mostly defined by the architecture of the network. There is no known optimal procedure for designing a network that works well for a given problem.

The architecture presented in this paper is highly optimized for the problem of detecting bns signals and relies on the input data format described in III.1. Some of the main improvements over a standard convolutional architecture will be more general and may be of use for different data formats and similar problems.

We started our research by adjusting the architecture given in George and Huerta (2018a); Gabbard et al. (2018)

for data sampled at multiple rates, by using one channel for every sample rate and detector combination. In convolutional networks, channels represent different features of the data and are correlated by the convolutional filters. With this as a starting point, we made iterative improvements. The three changes that had the greatest positive effect were the replacement of convolutional layers by inception modules

Szegedy et al. (2015), the use of a temporal convolutional network (tcn) as a signal amplifier Bai et al. (2018); Schmitt et al. (2019); Wei and Huerta (2019) and using different stacks of inception modules for each sample rate. A detailed description of the evolution of the network can be found in Schäfer (2019). The architecture presented here differs from an earlier iteration presented in Schäfer (2019) only by removing the lowest sample rate as input and adjusting the structure accordingly.

For computational efficiency, we provide the noise and signal time series not as a sum but as separate inputs to the network. They are combined on the first layer of each parallel stack of layers (see Figure 2). This sum is passed to a tcn which tries to recover the pure signal. The denoised data is added back onto the input of the tcn to amplify potential signals. The amplified data is preprocessed by convolutional layers before 2 inception modules with very small kernel sizes are applied. These inception modules are the main building block of the network. Afterwards two adjacent stacks are concatenated and used as input to further inception modules. The process is repeated until only a single stack is left. This stack is reduced down to the desired output format by applying dense layers. The final outputs are one scalar for the snr estimate and a tuple of length 2 estimating the p-score. The p-score is a measure for how confident the network is that the data contains a gw and the content of the corresponding tuple is for technical reasons. We interpret both of these outputs as a ranking statistic. A high-level overview of the architecture can be found in Figure 2.

Alongside the two outputs described above, the network is equipped with 13 further auxiliary outputs. The purpose of these outputs is to prevent vanishing gradients Szegedy et al. (2015) or provide the intermediate layers with more information on their purpose. The auxiliary outputs thus improve the training efficiency of the network. Seven of the auxiliary outputs are the outputs of the tcns. They are trained using the pure signals as target. We found that the network is significantly more sensitive if it cannot decide how to use the parameters of the tcns freely but is forced to learn to recover the gw. Five of the remaining six outputs are taken after all but the final concatenation layer. They receive the injected snr as target. Since the output of the concatenation layers is not a scalar, we use a few pooling- and dimensional reduction layers Lin et al. (2013); Schäfer (2019) to reduce the output shape. The final auxiliary output is taken after the first two inception modules of the lowest sample rate and treated in the same way as the auxiliary outputs after the concatenation layers. The network is trained as a complete unit and the loss takes into account all outputs.

The complexity of this architecture comes at the cost of memory size and speed. The model has million trainable parameters. The computational cost is a problem when optimizing the architecture as it is costly to compare two different designs. We therefore suspect that the details of this architecture can be further optimized.

Figure 2: A high level overview of the architecture presented in this work. Details on every block can be found in Schäfer (2019). The network takes signal and noise inputs to , where each number corresponds to a different part of the re-sampled raw data described in III.1. It outputs an estimate of the snr contained in the input and a p-score, which rates how likely the data is to contain a bns signal. All auxiliary outputs that are only used for training are not shown. The network adds the noise and signal input for every re-sampled part individually and the remaining layers operate only on this sum. The output of this addition is amplified by a tcn and processed by an inception network. Afterwards the outputs of two inception networks from adjacent sample rates are concatenated and further analyzed by another inception network. The parallel inception networks are concatenated until only a single one remains. A few final dense layers, which are summarized as the post-processing block, are applied to reduce the output shape to the desired dimensions of the snr estimate and p-score. The pre-processing block is inspired by Szegedy et al. (2015) and contains a small convolutional network.

iv.2 Training

The network requires of memory to be trained with a mini-batch size of 32. We used a NVIDIA V-100 gpu with of video memory for training. On this hardware each training epoch plus the subsequent validation step takes roughly 5 hours to complete. Due to time constraints and instabilities during training the network is trained for 24 epochs only. The instabilities are discussed below and manifest as a sudden drop of the sensitivity during training.

The total loss of the network is the weighted sum of the individual losses for each of the outputs discussed in IV.1. All auxiliary outputs are assigned a mean squared error as the individual loss and given a weight of

. The same loss function is used for the snr output, but it receives a weight of 1. Finally the p-score output uses the categorical cross-entropy and a weight of 0.5. The total loss is thus given by


where is the mean squared error, is the categorical cross-entropy, a subscript ”target” indicates the known target values, a subscript ”pred” indicates the network output and the

are the auxilliary outputs. The different weights are used to inform the optimization algorithm on the importance of each individual output. The auxiliary outputs are only used during training and discarded during inference. Their value is unimportant as long as using them improves the performance of the snr and p-score output. We use the default implementation of the ”Adam” optimizer from the machine learning library Keras

Chollet et al. (2019) to train the entire network using the total loss. Altering the initial learning rate in either direction reduced the sensitivity of the network.

During training we monitor an estimate of the sensitivity of our network. To do so, we calculate the true positive rate on the validation set, by choosing the maximum predicted snr and p-score value of all noise samples from the validation set as a threshold. All signals that are estimated with a ranking statistic higher than these thresholds are counted as detected. The number of detected signals is then divided by the total number of signal samples to get a true positive rate. We rank the epochs based on this estimate of the sensitivity and thoroughly test the best one.

We found that the network experienced strong overfitting. While the training loss fell by 25% from the initial to the last epoch, the validation loss doubled. If the loss and the sensitivity were strongly correlated it would be expected that the sensitivity drops with an increasing validation loss. We find the opposite and reach the highest true-positive rate of % on epoch 21. At this point the validation loss grew by 75% over the initial epoch. The loss in use is therefore at best loosely correlated with the sensitivity. Designing a loss function that is better suited to the problem might improve the search further. The strong overfitting also indicates the possibility to simplify the architecture significantly without a strong loss of sensitivity or improving the performance of the current architecture by increasing the size of the training set significantly.

When training networks that predict if a gw signal is present in some time series data, we found that after some number of epochs the sensitivity estimate drops to zero for both the snr and the p-score output. Initially, it often recovers on the next epoch, but drops become more frequent. After some point the network does not seem to recover at all and the estimated sensitivity stays at zero. This behavior is caused by noise samples that are estimated with very high confidence to contain a gw. These are sometimes appointed physically nonsensical snr values. The number of these misclassified noise samples is low and thus the impact on the total loss is negligible. Furthermore, the values that are given for these noise samples grow over time, which is the reason why the drop occurs only after training for a while. In principle, these outliers may be vetoed by their snr value at the cost of some sensitivity at low false-alarm rates. We disfavor this approach as it introduces artificial constraints on the search algorithm. It is currently unknown what causes the predicted values of the ranking statistics to grow or how the issue can be resolved. To avoid problems, we stop training before the sensitivity estimate stays at zero for many consecutive epochs.

iv.3 Testing on binary neutron star injections

To evaluate the sensitivity of the network, we use the test data described in II.1. It contains 8,794,112 seconds 101 days of data split into 100 files. With this data set, false-alarm rates down to false alarms per month can be resolved.

To analyze continuous stretches of data, we use a sliding window of duration 72 seconds with a step size of seconds. We chose the step size based on the training set in which the exact position of the peak amplitude was varied by seconds around a central position.

The content of every window is whitened by the analytic model psd of the advanced ligo detectors as provided by the software library LALSuite Collaboration (2018). To avoid filter-wraparound errors the initial and final 4 seconds are discarded. The final 32 seconds of the remaining data are re-sampled and formatted as described in III.1.

To assign the correct times to each window, the alignment of the waveforms in the training set needs to be considered. The central position for the peak amplitude in the training set is set to seconds from the end. If the merger time is defined as the time of the peak amplitude of the waveform it will on average be positioned seconds from the first sample of the 32 second input window. Considering the seconds that are discarded at the beginning of each window, the first position of a gw merger we are sensitive to is located at seconds from the start of each continuous segment. The reported sample times are therefore


where is the time of the sample and is the starting time of the analyzed data segment.

By applying our network in this way, we obtain two time series. One estimates the snr at each window position while the other gives a p-score at every step. We apply a fixed threshold of snr and p-score to the respective time series. Every position that exceeds the corresponding threshold is marked. All marked positions are then clustered by assuming that two marked positions are generated by the same underlying process if they are within 1 second of each other. The clusters are expanded until there are no marked positions within 1 second of the boundaries of the cluster. Each cluster is an event and assigned the time and value of the maximum snr or p-score respectively inside this cluster. An event is said to be a true positive if an injection was placed within seconds of the reported time. The times used for clustering and accepting a signal as true positive were empirically found to work well on a different data set and are arbitrary choices.

V Results

v.1 False-alarm rate and sensitivity

The analysis of the bns test data described in IV.3 returns a list of events. Each event is assigned a ranking statistic. We obtain the false-alarm rate as a function of the ranking statistics snr and p-score (Figure 3) by considering only those events that exceed the given threshold. Subsequently we can generate the sensitivity as a function of the false-alarm rate (Figure 4). We choose a range of SNR 4 to 20 and p-score 0.1 to 1 to generate these plots.

We find that the snr estimate is able to resolve false-alarm rates down to per month, whereas the p-score output is able to resolve false-alarm rates down to per month. Both curves drop steeply with the corresponding ranking statistic until they reach a false-alarm rate . At this point both curves level off significantly. Our previous work Schäfer (2019) was able to resolve false-alarm rates down to per month and was tested on a set of roughly half the duration used in this paper. We also observed a change in gradient of the false-alarm rate in Schäfer (2019) although at smaller ranking statistics. For the snr output, this change lined up well with the lower limit of the snr contained in the training samples. This may be a hint that the network presented in Schäfer (2019) successfully learned the lower bound on the snr in the training set.

For high false-alarm rates, both outputs show equivalent sensitivities. At low false-alarm rates, on the other hand, the snr output is more sensitive and has non-negligible sensitivities down to a false-alarm rate of 10 per month, where it reaches a sensitive radius of . The sensitivity of the p-score output becomes negligible around a false-alarm rate of 20 per month and also reaches a sensitive radius of .

In our previous work Schäfer (2019) we observed the opposite behavior with regards to which of the two outputs is more sensitive at low false-alarm rates. We do not know what causes either output to perform better than the other.

We can also observe a change in gradient in the sensitivity curves shown in Figure 4. The locations where the sensitivity starts to suddenly drop steeply line up with the point where the false-alarm rate levels off observed in Figure 3. At false-alarm rates below this point the sensitivity becomes negligible quickly.

Figure 3: The estimated false-alarm rate as a function of the threshold value used for either output. The bottom x-axis applies only to the false-alarm rate of the snr output. The red line in this plot points out the lowest snr of training samples. In our previous work Schäfer (2019) we found a change in gradient at this position. For the current search this change appears at a higher snr. The top x-axis applies only to the false-alarm rate of the p-score. It is logarithmic and is scaled to give a sense for the distance to .
Figure 4: The sensitive distance as a function of the false-alarm rate. The blue curve shows the sensitive distance when the snr is used to classify events. The yellow curve shows the sensitive distance when the p-score is used. The green curve is generated from the data found in Nitz et al. (2017) by counting all signals at a higher injection snr than the corresponding false-alarm rate. We are able to resolve a small overlap-region between the two different searches but find that the sensitivity of our search drops close to zero for false-alarm rates below 10 per month. At high false-alarm rates both outputs of our network perform equally well, for low false-alarm rates the snr shows superior performance.

v.2 Comparison to PyCBC Live

We compare our search to PyCBC Live Nitz et al. (2017), which is a low-latency analysis and has been used in the second and third observing runs Collaboration and Collaboration (2018); Collaboration et al. (2018); Collaboration and the Virgo Collaboration (2020); Abbott et al. (2020). The green curve in Figure 4 is estimated from Figure 1 in Nitz et al. (2017) on our test set by assuming that all injections with optimal snr are found and all others are missed. Here is the network snr reweighted by signal consistency tests corresponding to a given false-alarm rate. At a false-alarm rate of per month the PyCBC Live search has a sensitive radius of . At a comparable false-alarm rate of 0.6 per month, our search reaches the sensitive radius. At a false-alarm rate of 10 per month, where our search is still marginally sensitive, the radius increases to , which is still about half the radius of the reference value from the PyCBC Live search. To reach this reference value, we would need to operate at a false-alarm rate of per month.

To compare the computational cost of our search to that of PyCBC Live Nitz et al. (2018), we analyze the resources both algorithms require to evaluate incoming data in real time. One pass of our network on the correctly whitened, re-sampled and formatted data takes on an Intel i7-6600U dual-core, 4 thread laptop processor. Neglecting the cost of pre-processing the raw input data, our search would be able to produce estimates of the snr and p-score slightly faster than real time with the above mentioned resources, as each step processes a time span of . We can estimate the number of cpu-cores PyCBC Live would require to run a search for bns signals of the distribution our network was trained on by counting the number of templates the corresponding filter-bank would use. To produce a non-spinning filter-bank with component masses between and solar masses at pn-order we use the pycbc_geom_nonspinbank program from the PyCBC software library Nitz et al. (2019c). The minimum match is set to . With these settings the bank contains templates per detector. The PyCBC Live search is able to evaluate templates per core in real time Nitz et al. (2018). The required templates for a 2 detector search could therefore be evaluated on a single core in real time.

At a false-alarm rate of 10 per month, our search introduces an average latency of seconds for true positives. This value is calculated by taking the difference between the latest position in any cluster at the given false-alarm rate that belongs to a real injection and the reported time of the corresponding event. To ensure that the cluster is complete we add second on top of that latency and another seconds to ensure the network has finished its calculations. We then average over all clusters evaluated that way. Our search algorithm has not yet been optimized for low-latency analysis and we assume that the latency can be reduced by about an order of magnitude by choosing a different clustering algorithm without a large impact on the sensitivity. The reported latency does not take into account any time lost due to whitening, re-sampling or formatting of the raw input data. PyCBC Live for comparison operates at an average latency of . This latency can be reduced to at the cost of doubling the required computational resources Nitz et al. (2018).

v.3 Comparison to another machine learning algorithm

The authors of Krastev (2019) were the first to search for bns signals using a machine learning algorithm. They used a convolutional network very similar in structure to those found in George and Huerta (2018a); Gabbard et al. (2018) to analyze seconds of data sampled at . This setup allowed them to differentiate the three classes ”pure noise”, ”bbh signal” and ”bns signal”. They found that their algorithm is able to distinguish the three classes and is as sensitive to bbh signals as the previous comparable search algorithms George and Huerta (2018a); Gabbard et al. (2018). The sensitivity to bns signals looks as good or in some cases even better than the sensitivity to bbh signals.

All of their results are given in terms of peak signal-to-noise ratio (psnr) instead of optimal or matched-filter snr. To convert between psnr and matched filter snr they quote a factor of , which was derived on bbh data. We calculated this factor on bns data and find a conversion of optimal . Furthermore, they used data from a single detector. Signals detected at snr gain on average a factor of when a network of detectors is used. Our results are compared to the findings of Krastev (2019) by using the conversion .

Figure 5 compares the true positive rates of the search algorithm presented here to the one found in Krastev (2019) at a fixed false-alarm rate of per month. We compute it by fixing the detection threshold to the corresponding values of and . The injections are then binned by their snr and for each bin the fraction of detected over total injections is calculated. We find that our search does not generalize well to very strong signals. The loudest missed signal at this false-alarm rate was injected with a snr of which means that our search only reaches sensitivity above snr . The search described in Krastev (2019) is more sensitive to signals above snr and saturates already around snr . For current detectors on the other hand, most signals are expected to be measured with low snr Schutz (2011). Our search is on average 6 times more sensitive to signals below snr when compared to Krastev (2019) at the same false-alarm rate. Within the snr-range covered by the training set (marked gray in Figure 5), our search is almost 10 times as sensitive. We would expect an increase in sensitivity at high snrs if the range in our training set were expanded to include high snr signals. The plot also shows the true positive rate of our network at a false-alarm rate of 10 per month, where, within the snr range of the training set, the snr output roughly matches the true positive rate of the algorithm proposed in Krastev (2019) at a 75 times higher false-alarm rate. One can also observe that at a false-alarm rate of 10 per month the p-score output is significantly worst over the entire range of injected signals.

Figure 6 shows the recovered snr against the injected snr at a fixed false-alarm rate of 10 per month. For any missed injection, we give the value of the estimated snr time series that is closest to the injection time. The strongest missed injection at this false-alarm rate has a snr of . We find that the injected snr is recovered with a mean squared error of . Our search is therefore able to distinguish signals from the background but the estimation of the snr is uninformative. At this false-alarm rate, there are no injections that are only detected in the p-score output. The plot can visually be split into three vertical zones. The lowest zone (red) contains all missed injections. They are recovered with a snr below the threshold for the snr output. If the p-score output was independent of the snr output we would expect to find a few blue triangles in this region. The second zone (green) contains injections that are only recovered in the snr output. The clear separation to the third zone (black) indicates that the p-score output operates very similarly to the snr output and assigns a value based on the internal snr estimate. The louder the injected signals are the more likely the network is to detect it in both outputs.

Figure 5: To compare our search to the work of Krastev (2019) we plot their true positive rate at a fixed false-alarm rate of per month in blue and our true positive rate at the same false-alarm rate in green and yellow. On the x-axis we track the injected optimal network snr. The authors of Krastev (2019) give their results in terms of psnr. We use the conversion . To obtain these curves we bin the injected signals by their optimal injection snr and a bin size of 4. For high snrs some bins are empty. Empty bins are interpolated linearly from the remaining data. The area marked gray highlights the region covered by the training set. We find that our search performs better for low snrs but is less sensitive for strong signals. We also show the true positive rate of our search at a false-alarm rate of 10 in red and purple. Within the training range we find that our search closely matches the true positive rate of Krastev (2019) at a higher false-alarm rate.
Figure 6: The plot shows the estimated snr against the optimal injected snr of the test set. There are a few injections with which are not shown here, but all of them are detected. The red cross corresponds to injections that the search did not recover in either of the two outputs at a fixed false-alarm rate of 10 per month. Injections that are found in the p-score output but not in the snr output would be shown as blue triangles, but no injections of this type exist. Green squares show injections that are found only in the snr output. Black hexagons represent injections that were found in both outputs. A clear vertical separation can be seen in this figure. We suspect that the network learns an estimate of the snr internally and only maps the p-score to this internal representation. Otherwise borders would not be so sharp and some blue triangles should be seen in the red area.

v.4 Binary black hole injections

Realistic continuous data will not only contain signals from bns mergers but also prominently signals from bbh events. It is therefore interesting to test the response of the nn to these kinds of signals.

To do so, we generate a mock data set containing bbh injections. We use the process described in II.1 but adjust the parameter ranges to represent a distribution for bbh signals. The masses are uniformly distributed in the range from 10 to 30 solar masses, the maximal distance is increased to to adjust for the louder signals and the waveform model is changed to SEOBNRv4_opt

. As signals from bbhs are within the sensitive band of the detectors for a shorter duration, the average signal separation can be reduced to 20 seconds with a variance of

seconds. The duration of the sections in the beginning and end of each chunk that do not contain any injections is reduced to 16 seconds.

As we only want to make a qualitative statement about the sensitivity of our analysis to bbh signals, we generated and evaluated hours of mock data, containing injected signals. The data is processed in the same way as the data containing bns injections.

For this test set we find that our network has negligible sensitivity to bbh mergers. The bbh waveforms, which are short compared to bns signals, are consistently classified as noise.

Vi Conclusions

We presented a new machine learning based search pipeline for bns signals. To allow our network to efficiently process up to of data, we introduced multi-rate sampling; a technique that samples different parts of a gw signal at different sample rates.

Our search improves upon the sensitivity of other machine learning based algorithms at low false-alarm rates and for signals with low snr. For signals below snr we find an improvement of % over previous machine-learning based searches Krastev (2019).

We probed, for the first time, the sensitivity of a machine learning based algorithm at false-alarm rates down to per month. This enabled a direct comparison to the template based low-latency search PyCBC Live Nitz et al. (2018). We found that our machine learning based algorithm is computationally more expensive than using PyCBC Live with a template bank equivalent to our training set. At the same time, the sensitive radius of our search algorithm is lower by a factor of . We therefore conclude that machine learning based search algorithms are not yet sufficiently sensitive or efficient to match the algorithms currently in use.

We do, however, find an improvement in the latency when compared to traditional searches. PyCBC Live introduces on average a latency of s between signal arrival and generating an alert. A very conservative estimate of the latency introduced by our search finds s. This value is limited not by the computational cost of applying the network, but by processing the data it outputs. Choosing a different algorithm to do so is straightforward and might improve the latency by roughly an order of magnitude. The latency of PyCBC Live can be reduced to a similar duration by increasing the computational cost of the analysis. There are also other search algorithms targeted specifically at low-latency detection of candidate events which are already able to achieve latencies of Hooper et al. (2012). The computational cost of all of these searches scales with the size of the template bank used. nns on the other hand have often proven to adapt well to a large variety of features in the input space. It is therefore not unreasonable to believe that machine learning search algorithms may be able to provide low-latency detections at constant or only slightly increased computational cost when the parameter space is enlarged. We think that this is a strong motivation to further pursue a machine learning based search algorithm.

To help compare different search algorithms we proposed a standardized test procedure that can be applied to neural networks as well. We want to stress the importance of providing false-alarm rates and sensitivities for machine learning based searches which are derived on as realistic a data set as possible.

Future works might try to reduce the complexity of the network proposed here to minimize the computational cost and make machine learning based searches a viable alternative. Reducing the complexity of the network may also help to improve the sensitivity of the search. Previous works George and Huerta (2018b) have shown that a network which works well with simulated noise adapts well to real detector noise if re-trained. The algorithm at hand should thus also be extended to be trained and tested on real detector noise. It would further be of interest to test if a computationally less expensive network could be used at a high false-alarm rate to be followed up by a matched-filter search with a heavily reduced template bank.

Vii Acknowledgments

We thank Tobias Blenke, Christoph Dreißigacker and Tobias Florin for their comments and suggestions. We acknowledge the Max Planck Gesellschaft and the Atlas cluster computing team at AEI Hannover for support. F.O. was supported by the Max Planck Society’s Independent Research Group Programme.


  • et al. (2016) B. P. A. et al. (LIGO Scientific Collaboration and Virgo Collaboration), Phys. Rev. Lett. 116, 061102 (2016).
  • Collaboration et al. (2018) L. S. Collaboration, V. Collaboration, et al.arXiv preprint arXiv:1811.12907  (2018)arXiv:1811.12907 .
  • Venumadhav et al. (2019) T. Venumadhav, B. Zackay, J. Roulet, L. Dai,  and M. Zaldarriaga, arXiv preprint arXiv:1904.07214  (2019)arXiv:1904.07214 .
  • Nitz et al. (2019a) A. H. Nitz, T. Dent, G. S. Davies, S. Kumar, C. D. Capano, I. Harry, S. Mazzon, L. Nuttall, A. Lundgren,  and M. Tápai, arXiv preprint arXiv:1910.05331  (2019a)arXiv:1910.05331 .
  • Nitz et al. (2019b) A. H. Nitz, C. Capano, A. B. Nielsen, S. Reyes, R. White, D. A. Brown,  and B. Krishnan, Astrophys. J. 872, 195 (2019b)arXiv:1811.01921 [gr-qc] .
  • (6) L. S. Collaboration and V. Collaboration, “Gracedb — gravitational-wave candidate event database,” .
  • Collaboration and the Virgo Collaboration (2020) T. L. S. Collaboration and the Virgo Collaboration,  (2020)arXiv:2004.08342 [astro-ph.HE] .
  • Abbott et al. (2020) B. P. Abbott, R. Abbott, T. D. Abbott, S. Abraham, F. Acernese, K. Ackley, C. Adams, R. X. Adhikari, V. B. Adya, C. Affeldt,  and et al., The Astrophysical Journal 892, L3 (2020).
  • Aso et al. (2013) Y. Aso, Y. Michimura, K. Somiya, M. Ando, O. Miyakawa, T. Sekiguchi, D. Tatsumi,  and H. Yamamoto (The KAGRA Collaboration), Phys. Rev. D 88, 043007 (2013).
  • Akutsu et al. (2020) T. Akutsu, M. Ando, K. Arai, Y. Arai, S. Araki, A. Araya, N. Aritomi, Y. Aso, S. W. Bae, Y. B. Bae, L. Baiotti, R. Bajpai, M. A. Barton, K. Cannon, E. Capocasa, M. L. Chan, C. S. Chen, K. H. Chen, Y. R. Chen, H. Y. Chu, Y.-K. Chu, S. Eguchi, Y. Enomoto, R. Flaminio, Y. Fujii, M. Fukunaga, M. Fukushima, G. G. Ge, A. Hagiwara, S. Haino, K. Hasegawa, H. Hayakawa, K. Hayama, Y. Himemoto, Y. Hiranuma, N. Hirata, E. Hirose, Z. Hong, B. H. Hsieh, G. Z. Huang, P. W. Huang, Y. J. Huang, B. Ikenoue, S. Imam, K. Inayoshi, Y. Inoue, K. Ioka, Y. Itoh, K. Izumi, K. Jung, P. Jung, T. Kajita, M. Kamiizumi, N. Kanda, G. W. Kang, K. Kawaguchi, N. Kawai, T. Kawasaki, C. Kim, J. Kim, W. Kim, Y. M. Kim, N. Kimura, N. Kita, H. Kitazawa, Y. Kojima, K. Kokeyama, K. Komori, A. K. H. Kong, K. Kotake, C. Kozakai, R. Kozu, R. Kumar, J. Kume, C. M. Kuo, H. S. Kuo, S. Kuroyanagi, K. Kusayanagi, K. Kwak, H. K. Lee, H. W. Lee, R. K. Lee, M. Leonardi, C. Y. Lin, F. L. Lin, L. C. C. Lin, G. C. Liu, L. W. Luo, M. Marchio, Y. Michimura, N. Mio, O. Miyakawa, A. Miyamoto, Y. Miyazaki, K. Miyo, S. Miyoki, S. Morisaki, Y. Moriwaki, K. Nagano, S. Nagano, K. Nakamura, H. Nakano, M. Nakano, R. Nakashima, T. Narikawa, R. Negishi, W. T. Ni, A. Nishizawa, Y. Obuchi, W. Ogaki, J. J. Oh, S. H. Oh, M. Ohashi, N. Ohishi, M. Ohkawa, K. Okutomi, K. Oohara, C. P. Ooi, S. Oshino, K. C. Pan, H. F. Pang, J. Park, F. E. P. Arellano, I. Pinto, N. Sago, S. Saito, Y. Saito, K. Sakai, Y. Sakai, Y. Sakuno, S. Sato, T. Sato, T. Sawada, T. Sekiguchi, Y. Sekiguchi, S. Shibagaki, R. Shimizu, T. Shimoda, K. Shimode, H. Shinkai, T. Shishido, A. Shoda, K. Somiya, E. J. Son, H. Sotani, R. Sugimoto, T. Suzuki, T. Suzuki, H. Tagoshi, H. Takahashi, R. Takahashi, A. Takamori, S. Takano, H. Takeda, M. Takeda, H. Tanaka, K. Tanaka, K. Tanaka, T. Tanaka, T. Tanaka, S. Tanioka, E. N. T. S. Martin, S. Telada, T. Tomaru, Y. Tomigami, T. Tomura, F. Travasso, L. Trozzo, T. T. L. Tsang, K. Tsubono, S. Tsuchida, T. Tsuzuki, D. Tuyenbayev, N. Uchikata, T. Uchiyama, A. Ueda, T. Uehara, K. Ueno, G. Ueshima, F. Uraguchi, T. Ushiba, M. H. P. M. van Putten, H. Vocca, J. Wang, C. M. Wu, H. C. Wu, S. R. Wu, W. R. Xu, T. Yamada, K. Yamamoto, K. Yamamoto, T. Yamamoto, K. Yokogawa, J. Yokoyama, T. Yokozawa, T. Yoshioka, H. Yuzurihara, S. Zeidler, Y. Zhao,  and Z. H. Zhu, “Overview of kagra: Detector design and construction history,”  (2020), arXiv:2005.05574 [physics.ins-det] .
  • The LIGO Scientific Collaboration (2019) t. K. C. The LIGO Scientific Collaboration, the Virgo Collaboration,  (2019)arXiv:1304.0670 .
  • Collaboration and Collaboration (2018) L. S. Collaboration and V. Collaboration, “Online pipelines,”  (2018).
  • Maggiore (2008) M. Maggiore, Gravitational Waves (Oxford University Press, 2008).
  • Sachdev et al. (2019) S. Sachdev, S. Caudill, H. Fong, R. K. L. Lo, C. Messick, D. Mukherjee, R. Magee, L. Tsukada, K. Blackburn, P. Brady, P. Brockill, K. Cannon, S. J. Chamberlin, D. Chatterjee, J. D. E. Creighton, P. Godwin, A. Gupta, C. Hanna, S. Kapadia, R. N. Lang, T. G. F. Li, D. Meacher, A. Pace, S. Privitera, L. Sadeghian, L. Wade, M. Wade, A. Weinstein,  and S. L. Xiao, “The gstlal search analysis methods for compact binary mergers in advanced ligo’s second and advanced virgo’s first observing runs,”  (2019), arXiv:1901.08580 .
  • Adams et al. (2016) T. Adams, D. Buskulic, V. Germain, G. M. Guidi, F. Marion, M. Montani, B. Mours, F. Piergiovanni,  and G. Wang, Classical and Quantum Gravity 33, 175012 (2016).
  • Nitz et al. (2018) A. H. Nitz, T. Dal Canton, D. Davis,  and S. Reyes, Phys. Rev. D 98, 024050 (2018).
  • Hooper et al. (2012) S. Hooper, S. K. Chung, J. Luan, D. Blair, Y. Chen,  and L. Wen, Phys. Rev. D 86, 024012 (2012).
  • Canton and Harry (2017) T. D. Canton and I. W. Harry, “Designing a template bank to observe compact binary coalescences in advanced ligo’s second observing run,”  (2017), arXiv:1705.01845 [gr-qc] .
  • Harry et al. (2016) I. Harry, S. Privitera, A. Bohé,  and A. Buonanno, Phys. Rev. D 94, 024012 (2016).
  • Hannam et al. (2014) M. Hannam, P. Schmidt, A. Bohé, L. Haegel, S. Husa, F. Ohme, G. Pratten,  and M. Pürrer, Phys. Rev. Lett. 113, 151101 (2014).
  • Khan et al. (2019) S. Khan, K. Chatziioannou, M. Hannam,  and F. Ohme, Phys. Rev. D 100, 024059 (2019).
  • Pan et al. (2014) Y. Pan, A. Buonanno, A. Taracchini, L. E. Kidder, A. H. Mroué, H. P. Pfeiffer, M. A. Scheel,  and B. Szilágyi, Phys. Rev. D 89, 084006 (2014).
  • Harry et al. (2018) I. Harry, J. C. Bustillo,  and A. Nitz, Phys. Rev. D 97, 023004 (2018).
  • London et al. (2018) L. London, S. Khan, E. Fauchon-Jones, C. García, M. Hannam, S. Husa, X. Jiménez-Forteza, C. Kalaghatgi, F. Ohme,  and F. Pannarale, Phys. Rev. Lett. 120, 161102 (2018).
  • Khan et al. (2020) S. Khan, F. Ohme, K. Chatziioannou,  and M. Hannam, Phys. Rev. D 101, 024056 (2020).
  • Cotesta et al. (2018) R. Cotesta, A. Buonanno, A. Bohé, A. Taracchini, I. Hinder,  and S. Ossokine, Phys. Rev. D 98, 084028 (2018).
  • Nitz et al. (2020) A. H. Nitz, A. Lenon,  and D. A. Brown, The Astrophysical Journal 890, 1 (2020).
  • Russakovsky et al. (2015) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,  and L. Fei-Fei, International Journal of Computer Vision (IJCV) 115, 211 (2015).
  • Oord et al. (2016) A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,  and K. Kavukcuoglu, arXiv preprint arXiv:1609.03499  (2016)arXiv:1609.03499 .
  • Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al.nature 529, 484 (2016).
  • ope (2018) “Openai five,”  (2018).
  • Bahaadini et al. (2018) S. Bahaadini, V. Noroozi, N. Rohani, S. Coughlin, M. Zevin, J. Smith, V. Kalogera,  and A. Katsaggelos, Information Sciences 444, 172 (2018).
  • Dreissigacker et al. (2019) C. Dreissigacker, R. Sharma, C. Messenger, R. Zhao,  and R. Prix, Phys. Rev. D 100, 044009 (2019).
  • Wei and Huerta (2019) W. Wei and E. A. Huerta,  (2019)arXiv:1901.00869 .
  • Cuoco et al. (2020) E. Cuoco et al.,  (2020), arXiv:2005.03745 [astro-ph.HE] .
  • Green et al. (2020) S. R. Green, C. Simpson,  and J. Gair,  (2020), arXiv:2002.07656 [astro-ph.IM] .
  • Gabbard et al. (2019) H. Gabbard, C. Messenger, I. S. Heng, F. Tonolini,  and R. Murray-Smith,  (2019), arXiv:1909.06296 [astro-ph.IM] .
  • Marulanda et al. (2020) J. P. Marulanda, C. Santa,  and A. E. Romano,  (2020), arXiv:2004.01050 [gr-qc] .
  • Chan et al. (2019) M. L. Chan, I. S. Heng,  and C. Messenger,  (2019), arXiv:1912.13517 [astro-ph.HE] .
  • Iess et al. (2020) A. Iess, E. Cuoco, F. Morawski,  and J. Powell,  (2020), arXiv:2001.00279 [gr-qc] .
  • George and Huerta (2018a) D. George and E. A. Huerta, Phys. Rev. D 97, 044039 (2018a).
  • Gabbard et al. (2018) H. Gabbard, M. Williams, F. Hayes,  and C. Messenger, Phys. Rev. Lett. 120, 141103 (2018).
  • al. (2016) B. P. A. E. al., Classical and Quantum Gravity 33, 134001 (2016).
  • Nuttall (2018) L. K. Nuttall, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 376, 20170286 (2018).
  • Cabero et al. (2019) M. Cabero, A. Lundgren, A. H. Nitz, T. Dent, D. Barker, E. Goetz, J. S. Kissel, L. K. Nuttall, P. Schale, R. Schofield,  and D. Davis, Classical and Quantum Gravity 36, 155010 (2019).
  • Gebhard et al. (2019) T. D. Gebhard, N. Kilbertus, I. Harry,  and B. Schölkopf, arXiv preprint arXiv:1904.08693  (2019)arXiv:1904.08693 .
  • Krastev (2019) P. G. Krastev,  (2019)arXiv:1908.03151 .
  • Schäfer et al. (2020) M. B. Schäfer, F. Ohme,  and A. H. Nitz, “Data Release: Detection of gravitational-wave signals from binary neutron star mergers using machine learning,” (2020).
  • Usman et al. (2016a) S. A. Usman, A. H. Nitz, I. W. Harry, C. M. Biwer, D. A. Brown, M. Cabero, C. D. Capano, T. D. Canton, T. Dent, S. Fairhurst, M. S. Kehl, D. Keppel, B. Krishnan, A. Lenon, A. Lundgren, A. B. Nielsen, L. P. Pekowsky, H. P. Pfeiffer, P. R. Saulson, M. West,  and J. L. Willis, Classical and Quantum Gravity 33, 215004 (2016a).
  • Usman et al. (2016b) S. A. Usman, A. H. Nitz, I. W. Harry, C. M. Biwer, D. A. Brown, M. Cabero, C. D. Capano, T. D. Canton, T. Dent, S. Fairhurst, M. S. Kehl, D. Keppel, B. Krishnan, A. Lenon, A. Lundgren, A. B. Nielsen, L. P. Pekowsky, H. P. Pfeiffer, P. R. Saulson, M. West,  and J. L. Willis, Classical and Quantum Gravity 33, 215004 (2016b).
  • Nitz et al. (2019c) A. Nitz, I. Harry, D. Brown, C. M. Biwer, J. Willis, T. D. Canton, L. Pekowsky, C. Capano, T. Dent, A. R. Williamson, S. De, M. Cabero, B. Machenschalk, P. Kumar, S. Reyes, T. Massinger, D. Macleod, A. Lenon, S. Fairhurst, S. Khan, GarethDaviesGW, A. Nielsen, shasvath, dfinstad, F. Pannarale, L. Singer, M. Tápai, H. Gabbard, P. Couvares,  and L. M. Zertuche, “gwastro/pycbc: Pycbc release v1.13.5,”  (2019c).
  • Droz et al. (1999) S. Droz, D. J. Knapp, E. Poisson,  and B. J. Owen, Phys. Rev. D 59, 124016 (1999).
  • Blanchet (2002) L. Blanchet, Living Reviews in Relativity 5, 3 (2002).
  • Faye et al. (2012) G. Faye, S. Marsat, L. Blanchet,  and B. R. Iyer, Classical and Quantum Gravity 29, 175004 (2012).
  • Collaboration (2018) L. S. Collaboration, “LIGO Algorithm Library - LALSuite,” free software (GPL) (2018).
  • George and Huerta (2018b) D. George and E. Huerta, Physics Letters B 778, 64 (2018b).
  • Schäfer (2019) M. B. Schäfer, “Analysis of gravitational-wave signals from binary neutron star mergers using machine learning,”  (2019).
  • Schutz (2011) B. F. Schutz, Classical and Quantum Gravity 28, 125023 (2011).
  • Szegedy et al. (2015) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,  and A. Rabinovich, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).
  • Bai et al. (2018) S. Bai, J. Z. Kolter,  and V. Koltun, CoRR abs/1803.01271 (2018)arXiv:1803.01271 .
  • Schmitt et al. (2019) A. Schmitt, K. Fu, S. Fan,  and Y. Luo, in Proceedings of the 2Nd International Conference on Computer Science and Software Engineering, CSSE 2019 (ACM, New York, NY, USA, 2019) pp. 73–78.
  • Lin et al. (2013) M. Lin, Q. Chen,  and S. Yan, arXiv preprint arXiv:1312.4400  (2013)arXiv:1312.4400 [cs] .
  • Chollet et al. (2019) F. Chollet et al., “Keras,” (2019).
  • Nitz et al. (2017) A. H. Nitz, T. Dent, T. D. Canton, S. Fairhurst,  and D. A. Brown, The Astrophysical Journal 849, 118 (2017).