ggwd
Use PyCBC / LALSuite to generate synthetic gravitationalwave data (e.g., for machine learning endeavors).
view repo
In the last few years, machine learning techniques, in particular convolutional neural networks, have been investigated as a method to replace or complement traditional matched filtering techniques that are used to detect the gravitationalwave signature of merging black holes. However, to date, these methods have not yet been successfully applied to the analysis of long stretches of data recorded by the Advanced LIGO and Virgo gravitationalwave observatories. In this work, we critically examine the use of convolutional neural networks as a tool to search for merging black holes. We identify the strengths and limitations of this approach, highlight some common pitfalls in translating between machine learning and gravitationalwave astronomy, and discuss the interdisciplinary challenges. In particular, we explain in detail why convolutional neural networks alone can not be used to claim a statistically significant gravitationalwave detection. However, we demonstrate how they can still be used to rapidly flag the times of potential signals in the data for a more detailed followup. Our convolutional neural network architecture as well as the proposed performance metrics are better suited for this task than a standard binary classifications scheme. A detailed evaluation of our approach on Advanced LIGO data demonstrates the potential of such systems as trigger generators. Finally, we sound a note of caution by constructing adversarial examples, which showcase interesting "failure modes" of our model, where inputs with no visible resemblance to real gravitationalwave signals are identified as such by the network with high confidence.
READ FULL TEXT VIEW PDFUse PyCBC / LALSuite to generate synthetic gravitationalwave data (e.g., for machine learning endeavors).
Analysis code for the research paper "Convolutional neural networks: a magic bullet for gravitationalwave detection?"
Code for our paper: Improved deep learning techniques in gravitationalwave data analysis.
Matched filtering techniques [1, 2, 3, 4] have proven highly successful in discovering binary black hole coalescences from the recordings of the Advanced LIGO and Advanced Virgo gravitationalwave observatories [5, 6, 7, 8, 9, 10, 11]. Ten observations of merging black holes have now been made [12]. These observations have enabled population studies of the properties of stellarmass black holes and allowed precision tests of general relativity to be carried out [12, 13]. The most important observation to date was arguably the detection of a binary neutron star inspiral together with a gammaray burst and other electromagnetic counterparts [14, 15]. This detection heralds the era of multimessenger gravitationalwave astronomy, has yielded an independent measurement of Hubble’s constant, and probed the behavior of matter at the core of neutron stars [16, 17].
Additional observatories in Japan and India are expected to become operational in the next five years forming an evolving detector network capable of observing hundreds of sources every year [18, 19]. These sources will need to be rapidly observed, localized in the sky and this information quickly disseminated to electromagnetic partners to maximize the chance of multimessenger observations [19]
. This requires reliable, realtime identification of potential compact binary coalescences (CBCs) to provide a time window and basic parameter estimate for slower, but more accurate Bayesian inference techniques to followup
[20, 21]. However, current matched filtering techniques are computationally expensive, with the computational cost scaling as a function of the broadness of the detector’s sensitivity curve and the number of observatories; both of which are expected to increase in the coming years [19].In this work, we investigate whether some of these challenges can efficiently be overcome by using deep convolutional neural networks (CNNs). CNNs are a machine learning technique that has been employed successfully on a wide variety of tasks, including image classification [22, 23, 24]
[25] and audio generation [26]. In the physics community, an early application of CNNs was [27]; Carleo et al. [28] provide a review of recent developments in this direction. In particular, CNNs have also been studied in the literature as a tool for gravitationalwave searches, and previous works have shown that they can indeed be effectively applied to this problem when treating it as a binary (i.e., twoclass) classification task [29, 30].However, despite these promising preliminary results, we believe that the precise role that machine learning can play within the larger scope of CBC searches and practical multimessenger gravitationalwave astronomy has not yet been assayed in sufficient detail. The main goal of this work is, therefore, to carefully and realistically analyze the practical potential of using CNNs to search for GWs from CBCs. Here, we pay particular attention to realistic data generation, an appropriate, taskspecific architecture design and adequately chosen performance metrics. This results in the following main contributions:
We provide an indepth analysis of the challenges one may expect machine learning to solve within the scope of a search for GWs from CBCs, and also discuss their limitations in replacing matched filtering or Bayesian parameter estimation techniques.
We extend the existing, binary classificationbased approach of using CNNs to also handle inputs of varying length. This requires the introduction of new taskspecific performance metrics, which we discuss and relate to the existing metrics.
We highlight potential challenges and subtle pitfalls in the data generation process that may lead to unfair comparisons. To facilitate further research and reproducibility in this area, we release the data generation workflow we have developed as a reusable open source software package.
Finally, the empirical results of our architecture indicate that deep convolutional neural networks are a powerful supplement to the existing pipeline for fast and reliable trigger generation. However, we also demonstrate that—like most deep neural networks—our architecture is also prone to adversarial attacks: We can construct inputs with no visible resemblance to gravitationalwave signals that are nevertheless identified as such by the model.
As a key aspect of this work, we aim to foster communication and understanding between disciplines: On the one hand, we hope to help physicists less acquainted with deep learning techniques understand the strengths and limitations of such methods in gravitationalwave searches and gain intuition towards how they function in this context. Simultaneously, for machine learning experts, we explicitly highlight some problemspecific subtleties—ranging from data generation to model architecture design and meaningful evaluation metrics—to help them to circumvent tempting pitfalls.
The rest of this paper is structured as follows. In section II, we revisit matched filtering (with a focus on the implementation by PyCBC). Furthermore, we discuss the existing literature on using CNNs in the context of gravitationalwave searches. In section III, we then continue by reviewing the previously used binary classification framework more principally, and discuss for which specific tasks CNNs may be useful and for which their output is insufficient. Consequently, after introducing our carefully designed data generation procedure and the corresponding open source software package in section IV, we suggest a fully convolutional network architecture suited for gravitationalwave trigger generation in streaming data in section V. This architecture naturally gives way to novel performance metrics, which we develop in section VI, where we also explain their benefits and relation to current standard metrics. In section VII, we present and discuss the results of our model together with a note of caution concerning adversarial examples, highlighting the still not wellunderstood and unsettling brittleness of deep neural networks. Finally, we conclude with a summary and outlook in section VIII.
Observing compact binary coalescences has always been one of the primary goals of gravitationalwave astronomy. To date, searches for such systems rely on matched filtering using a large template bank (i.e., a set of simulated waveforms covering a carefully chosen parameter space). In the first part of this section, we will describe matched filtering with a specific focus on the implementation provided by the PyCBC software package [3, 31]. We explain the necessary components for a statistically sound search procedure and explain what it means to “detect” a gravitational wave. Readers familiar with the matched filtering search pipeline may wish to skip parts II.1, II.2 and II.3. In part II.4, we then review the existing work using convolutional neural networks for gravitationalwave searches.
Schutz [32] vividly describes the intuition behind the matched filtering technique as follows: “Matched filtering works by multiplying the output of the detector by a function of time (called the template) that represents an expected waveform, and summing (integrating) the result. If there is a signal matching the waveform buried in the noise then the output of the filter will be higher than expected for pure noise.”
In the following, we will formalize this idea mathematically in order to provide the necessary background for a comparison between matched filtering and the outputs of deep learningbased systems later on. Readers interested in further details are referred to the excellent overview of matched filtering in the context of the LIGO and Virgo collaborations by Caudill [33] (and references therein).
The fundamental assumption of matched filtering is that the strain measured by the interferometric detector is made up of two additive components, namely the instrument noise and the (astrophysical) signal :
(1) 
For a given power spectral density of , we can then quantify the agreement between a given template in the template bank and the recorded strain at a time by computing the signaltonoise ratio (SNR). For an appropriate choice of normalization, the matched filtering signaltonoise ratio is given by:
(2) 
where the tilde denotes the Fourier transform. For stationary Gaussian noise it can be shown that—by design—the SNR is indeed the optimal detection statistic for finding a signal
if the timereversed template is equal to the signal [1]. This is called the matched filter. In practice, the template bank should therefore contain accurate simulated waveforms that cover the space of expected signals in the recorded data sufficiently densely. Computing the SNR for every waveform in the template bank and applying a threshold then produces a list of candidate event times.In reality, however, the data is usually neither stationary nor exactly Gaussian. One particular challenge to the data analysis are so called glitches. Glitches are nonstationary noise transients, which comprise a range of different shorttime phenomena that affect the quality of the data measured by the detectors. They occur frequently, at rates up to several times per minute [34]. Some of these effects are well understood, such as the signature of scattered light in the beam tube; others, however, remain enigmatic. For example, a certain common type of glitch named “blip”, whose origin is only poorly understood, tends to mimic the signals that one would expect from the merger of two intermediatemass black holes, thus limiting the sensitivity for this kind of event [35].
As a consequence of these nonGaussian and nonstationary effects, the real distribution of the SNR (and thus the threshold value) is not known and must be determined empirically in order to obtain calibrated statistical results from the computed SNR. Allen et al. [1] provide a detailed account of the merits and challenges of matched filtering in practical gravitationalwave searches.
To understand the crucial components of a full search (which ideally results in a detection), we now outline the current PyCBC search pipeline [3]. The different steps of the search procedure are also illustrated schematically as a flowchart in Figure 1.
In a first step, a template bank containing simulated waveforms that cover the parameter space of interest is constructed; typically using the simulation routines provided by LALSuite [36], the central codebase that implements all waveform models used in Advanced LIGO and Advanced Virgo analyses. For more technical details we refer the reader to, for example, Capano et al. [37].
This template bank is then used to compute an time series for every possible combination of templates and recordings (i.e., we match every template with every observatory). We then find the times of peaks within all these time series that exceed a certain predefined threshold. Next we cluster these times to keep only the times of largest SNR within a 1second window and then store the remaining times alongside the parameters of the template that caused the match. Each of these recordings is called a trigger.
Consequently, we obtain a list of single detector triggers for each observatory independently. Furthermore, a set of signal consistency tests— tests—are computed for every trigger, which help to discriminate between real events and triggers that were caused by noise transients [38]. More precisely, these test values are used to compute a reweighted single detector which serves as a ranking statistic. In a subsequent stage, several coincidence tests (for both the event time and the estimated event parameters) are conducted: the single detector triggers are combined if the same template matched at compatible times (i.e., within light distance of each other) in all detectors. The resulting coincident triggers are called candidate events. Finally, each candidate event is assigned a combined ranking statistic, informally called loudness, which is computed from the parameters of the triggers in each observatory. The precise mathematical definitions of the individual and combined ranking statistics are handtuned and regularly adjusted (see, for example, Nitz et al. [39] or Nitz [40]).
Note that while the loudness is designed to intuitively correspond to our confidence of the candidate being a real event (higher scores indicating higher confidence), the raw numerical values have no significance. Instead, we are interested in the relative ordering of the candidate events that is induced by the loudness score. To claim a detection
—that is, to say that a candidate event with a given loudness in fact corresponds to a true gravitationalwave signal—we must perform the following statistical test: within our model assumptions, what is the probability that we observe this loudness purely by chance, if in reality there is no gravitationalwave signal present? This probability measures the statistical significance of the detection, that is, the confidence with which we can reject the null hypothesis, namely
“there was no real signal in the data”.At this point, it is crucial to contrast this with deep learning based machine learning classifiers. The output of such classifier on a single example—for example, from a softmax or sigmoid output layer—is also between
and and thus at times interpreted as a probability. However, these “probabilities” only reflect the “degree of confidence” of the network regarding its prediction. Therefore, they must not be interpreted as the statistical significance of a detection (see also section III).In PyCBC, the probability of obtaining a given loudness from only noise is estimated via frequentist inference over a given time period. To this end, a matched filtering search is performed on a recording of given length that is known to not contain any gravitationalwave signals. We then count the number of resulting candidate events that are at least as loud as the candidate event.
To obtain data that is guaranteed to not contain any gravitationalwave signals but still shares characteristics of real detector recordings, PyCBC makes use of time shifts. It shifts the recordings of the detectors relative to each other by a time period that is larger than the light travel time between them (see again Figure 1 for where this fits in the pipeline). Assuming that gravitational waves above the detection threshold of the instrument are sparse in time (i.e., further apart than the time shift), this ensures that no real signal will pass the coincidence tests and give rise to a candidate event. Instead, any candidate event found for a timeshifted input must be due to triggers caused by the random detector noise. Therefore, the loudness scores of candidate events found in timeshifted data can be used to estimate the frequency of false positives. This further allows us to derive false alarm rates for candidate events in the non timeshifted data and ultimately assign a statistical significance to a claimed detection. For a slightly more detailed yet compact description of how to estimate these probabilities in practice, we again refer to Caudill [33].
To conclude this introduction to the existing search pipeline, we note that due to the relatively small number of events detected so far, a proper performance evaluation of any search approach hinges on so called injections. An injection is a simulated waveform that is added into a piece of background noise (either synthetic or real) to emulate a real gravitationalwave signal as it would be observed by an actual detector. The search performance can then be evaluated by searching for a large variety of such injections added to the recorded strain data. Because in this case we know the precise location of the injections, we have access to the ground truth required to evaluate the detection rate and false alarm rate of the search pipeline for a given template bank, real recordings, and injections.
In the previous paragraphs, we have glanced over the fact that we can only compute false alarm probabilities and detection rates within our model assumptions. These assumptions include—among other factors—the parameter ranges and distributions of simulated waveforms both for the template bank and injections. Since the true physical distribution of gravitationalwave sources in the Universe (not only in terms of location, but also in terms of the parameters of their constituents) is unknown, these choices will not only affect how the obtained performance results transfer to real searches, but also influence the sensitivity towards various sources. In section IV, we comment on this in a little more detail. However, a full discussion of how to properly incorporate such ad hoc choices in the statistical analysis of the method is beyond of the scope of this work.
The idea of using convolutional neural networks (CNNs) to process time series information goes back to the early days of deep learning itself, more than twenty years ago [41]. Ever since, the community has established CNNs as one of the major work horses for processing images as well as time series data like audio (or various timefrequency representation thereof), which is structurally similar to the strain data produced by gravitationalwave observatories. CNNs have been particularly successful in supervised classification or regression tasks, where they are typically trained to map inputs in
—for example, images of a fixed resolution or fixedlength audio snippets—to either a finite set of labels (classification) or a typically lowdimensional real vector (regression).
All previous work applying convolutional neural networks to the detection of gravitationalwave signals in interferometric detector data has adopted a classificationbased approach. George and Huerta [29] generate white Gaussian noise examples with a fixed length of and, for a subset of them, add simulated gravitationalwave signals from binary black hole mergers similar to the injections in the PyCBC search. The maximum of the signal (which corresponds to the coalescence time) is randomly located in the last quarter of the sample. Using these data, they train a deep neural net, consisting of a common combination of convolutional and fullyconnected layers with a final sigmoid layer, to output a value between and , indicating the confidence of the network about the absence or presence of a gravitationalwave signal in each example. The network output can be thresholded to obtain a binary response. In addition, they train a second neural network, which estimates some basic parameters of the corresponding binary merger whenever the first network claims to have found a signal. In this setup, the CNN’s task is to detect nonGaussianities of a specific form in white Gaussian noise, where the nonGaussianities fall within a specific region of the input snippet.
In later works, they also evaluate this method on snippets of real LIGO recordings, and on an enlarged dataset which also includes waveforms for binary black hole mergers with precessing spins and nonvanishing orbital eccentricities [42, 43]. Longer samples are processed by a slidingwindow approach: recordings are split into overlapping windows to each of which the trained network is applied. Multiple detectors are accounted for by processing each recording separately first and then combining the binary outputs at each time via a logical and function. Notably, the authors suggest that their method can be used for gravitationalwave detection as well as parameter estimation and that it beats matched filtering in terms of errors and computational efficiency while retaining similar sensitivity [43]. We will explain in section III why we believe that a more careful and nuanced interpretation of such claims is essential to understanding the practical merits of CNN based approaches.
Gabbard et al. [30] employ a similar approach: the authors also use a deep neural network consisting of both convolutional and fully connected layers to perform a binary classification task on samples of Gaussian noise which either do or do not contain a simulated GW signal. The focus of their work, however, is the comparison with matched filtering. They conclude that their method is indeed able to closely reproduce the results of a matched filteringbased search on these samples.
A somewhat different approach was presented by Li et al. [44]. In their method, they use a wavelet packet decomposition to preprocess the data before feeding it into a convolutional neural network, which then operates on a frequency representation. They also work with a slidingwindow approach to apply their network to samples of variable length. Ultimately, the practical conclusions of their work are limited by the fact that they use Gaussian noise for the background and an unrealistically simplified damped sinusoid as an analytical waveform model.
In this section, we develop our main conceptual contributions, namely that a) convolutional neural networks are not suited to claim statistically significant detections of gravitational waves, however, b) they can still be useful tools for realtime trigger generation.
Our core argument for claim a) hinges on the fact that the “false alarm rate” which can be derived from machine learningbased classifiers is directly linked to the training dataset. As a consequence, there is only a single significance level that one can assign to every claimed detection, without being able to distinguish particularly loud events from fainter ones. Additional difficulties stem from the fact that in a real search, the task at hand is not to perform binary classification on fixedlength examples, but to identify the temporal location of potential signals in time series data of arbitrary length, or even in streaming data. The significance level obtained in the examplebased binary classification setup does not transfer easily to slidingwindow based approaches for streaming data.
To substantiate b), we highlight the benefits of CNNs in terms of computational complexity and devote the remaining sections of this paper to developing a modified CNN architecture which can overcome many of the pitfalls of the binary classification approach.
Standard performance metrics for classification tasks are the true positive rate (TPR; also called recall) and the false positive rate (FPR), which are defined as:
True Positive Rate (TPR)  
False Positive Rate (FPR) 
Here, TP are true positives (i.e., examples correctly classified as positives), FP are false positives (i.e., examples falsely classified as positive; Type I error), TN are true negatives (i.e., examples correctly classified as negative) and FN are false negatives (i.e., examples falsely classified as negative; Type II error).
Indeed, all previous comparisons of CNNs use a binary classification framework and compare the true positive rate at fixed false positive rate directly to matched filtering results at a given false alarm rate [42, 43, 30]. To obtain this measure in practice, for thresholdbased binary classifiers, one usually sweeps the threshold from to , recording the true positive rate and the false positive rate for each threshold value to produce the receiver operator characteristic (ROC) curve, that is, the true positive rate over the false positive rate. Since the false positive rate is maximal for threshold and minimal (zero) for threshold , we can then simply read off the true positive rate for any given false positive rate. However, there is a subtle difference between the generalization properties of this population level false positive rate and the false alarm rate in matched filtering.
Intuitively, we may interpret the CNN as an implicit abstract representation of all the examples—with and without simulated waveforms—which it has seen during training. In that sense it does not directly capture a compressed version of the template bank alone, but the entire training distribution including the ratio of positive and negative examples. Therefore, unlike matched filtering, the network’s output on new inputs depends also on the relative frequencies of positive and negative examples in the training set and the above performance measures only transfer to unseen examples following the exact same distribution. Consequently, performance evaluations of CNNs on the training distribution (many examples with injections) do not transfer to the test distribution (real recordings with few signals) as is the case for matched filtering, where the output depends only on the template bank. For efficient and stable training, the number of positive and negative examples should be on a similar order of magnitude, which is a clear misrepresentation of the true distribution and calls for caution when interpreting the FPR on handcrafted training or validation sets as false alarm probability in a full search on real data. We note that in [43], the authors have computed an estimate of their FPR by applying their trained network to a continuous stretch of real LIGO data.
The core task of gravitationalwave searches is not a populationlevel performance rating of the search pipeline on synthetic data, but to ascertain the individual statistical significance of a given candidate event. Hence, we must ask ourselves the question: What would be our level of confidence that there is a real event in the data when a binary classifier outputs a ? Here is the problem: If we were to use the false positive rate as a level of confidence for a claimed detection of the CNN (output ), we would assign the same confidence to every candidate! In particular, we would have no way of distinguishing particularly significant detections from fainter ones. This is due to the fact that the false positive rate is a statistic of the network output on the entire dataset, not any given example. Furthermore, as described above, the interpretation of the false positive rate as a confidence is only valid if the test distribution (actual detector recordings) comes in welldefined, distinct fixedlength examples which follow the same distribution (including the frequencies of positive and negative examples) as the training set. Therefore, while the false positive rate may seem like a tempting, convenient measure for the false alarm probability of CNNs, it must not be interpreted as a statistical significance. Consequently, CNNs alone cannot be used to properly claim gravitationalwave detections.
In a real search, we must identify and annotate those parts of an arbitrarily long input time series that contain a signal. The existing works extend the binary classificationbased approach to longer inputs via a sliding window approach. In addition to the fixed input size of the classifier, this requires yet another parameter choice, namely the step size of the sliding window.
Both of these parameters influence the performance metrics directly and in ways that are hard to interpret. First, the tempting conversion of “FPR example length temporal rate of false positives” becomes invalid due to the overlap between neighboring windows. Second, depending on the step size of the sliding window, waveforms may lie only partially within the input window, which can then not be labelled as one or zero in a principled fashion. Moreover, there is no natural interpretation of the sequence of outputs. For example, assume the CNN outputs the sequence , where the coalescence happens roughly at the center value. How should these labels be counted as true (false) positives (negatives)? The interpretation would perhaps also depend on the time step, that is the temporal resolution, and the window size. Finally, while a high temporal resolution (small step size) would be desirable in order to localize the signal in time, it also leads to computational redundancy as we will further elaborate in section V.
All in all, the metrics derived from the examplebased binary classification setup do not easily transfer to the sliding window approach on streaming data; a fact which has largely been overlooked in the literature so far.
We have seen that in the examplebased approach, we cannot easily process inputs with partially contained waveforms. Previous works have therefore positioned injections only in specific regions within the examples, usually such that the coalescence is located towards the end.
Deep learning systems are known to pick up unintentional quirks in the training data which correlate with the labels. This can result in an undesirable behavior called overfitting, where a classifier learns to perform well on training data, but fails on new examples in the real application. In the above example, the CNN may overfit on the location of the coalescence within the training examples. In particular, the final, fully connected layer(s) can learn locationsensitive features. Since the coalescence is the most pronounced part of the waveform, if it is always located in the same region, a network containing fully connected layers may focus exclusively on high amplitude, high frequency oscillations in this region, ignoring other parts of the input.
One crucial measure to avoid overfitting is to make the training set as representative as possible of the context in which the model will be deployed to reduce its potential to adapt to irrelevant characteristics of the training data. In sections IV and V, we discuss a data generation process and network architecture that pay particular attention to minimizing the danger of overfitting.
To conclude this section, let us discuss how CNNs can still complement matched filteringbased searches (instead of replacing them). Looking into the future, various upcoming challenges of matched filtering concern growing computational needs. For example, as more detectors come online, the computational complexity of matched filtering scales at least linearly in the number of detectors (recall that the search for triggers is performed independently for each detector first). Moreover, this trigger generation scales linearly also in the number of waveforms in the template bank. As template banks grow, matched filtering becomes increasingly expensive, causing realtime online trigger generation to become computationally challenging and prohibitive.
Such computational considerations are a key part of the motivation to look into alternative search methods in the first place. Convolutional neural networks are natural candidates, because inference—evaluating the network on new strain data after it has been trained—can be substantially faster than matched filtering. Our architecture (see section V.1) scales to an arbitrary number of detectors with almost no computational overhead. Furthermore, once an architecture is fixed, it can in principle be trained on any distribution of simulated waveforms. Thus, we can view the network training as building an abstract, constant size representation of the template bank. Note that the computational cost of inference is independent of the size of the training data. The expensive training of the network is performed only once up front.
The benefit of fast inference of CNNs—they analyze detector recordings much faster than realtime—makes them natural candidates for trigger generators. Realtime alarms can provide useful hints for follow up searches of electromagnetic counterparts as well as for focused analysis with matched filtering and Bayesian parameter estimation [51]. Arguably, a straightforward extension to also provide rough first parameter estimates could further decrease the computational cost of subsequent analysis by narrowing down the parameter space.
Moreover, while CNNs do not enjoy theoretical guarantees for stationary Gaussian data like matched filtering, one may speculate that they can, in principle, incorporate mechanisms to better deal with common nonGaussianities in the data by learning internal models not only of waveforms, but also of transient glitches. Testing and quantifying this hypothesis is left for future work.
In the remainder of this work, we develop a promising proof of concept implementation for such a usecase that avoids many pitfalls presented earlier in this section.
In this section, we describe the steps we have taken to generate realistic, synthetic data which can be used to train and evaluate a CNNbased model. We discuss our design choices and explain steps where we found a need to compromise between realistically modeling physics on the one hand and the requirements for efficient and reliable machine learning on the other hand. For reasons of transparency and reproducibility, as well as to foster further research in the area, we will make our data generation code publicly available online at ^{1}^{1}1https://github.com/timothygebhard/ggwd.
When choosing background data, one has essentially two options: simulated Gaussian noise, which is then colored using the Power Spectral Density (PSD) of the detectors, or actual detector recordings (in which the existing matched filtering pipeline did not find any gravitationalwave signals). While the first option yields background data that has on average the correct frequency distribution, it will not contain glitches. However, as discussed before, glitches are one of the major challenges for the data analysis. Therefore, we have decided to use real LIGO recordings from the first observation run (O1) to emulate the background noise. O1 included the first three discoveries of gravitational waves: GW150914, GW151012 and GW151226 [7, 8, 12]. The exact detector configuration during O1 is described in detail in [53, 54, 55].
The data from O1 is publicly available through the Gravitational Wave Open Science Center (GWOSC; see also [56, 57]). In our study, we limited ourselves to a subset of the data, specified by the following criteria:
Data available: Both H1 and L1 must have data available (due to different times when the detectors are operating, this is not always the case).
Minimum data quality: For the scope of this study, the data needs to pass all vetoes for CBC searches, meaning that only recording segments with data quality at least CBC_CAT3 (using the GWOSC definitions) are used.
In this section, we develop our specific neural network architecture (which aims to avoid some of the previously mentioned problems of CNNs) and document the training procedure. A highlevel schematic drawing of the model architecture is depicted in Figure 2.
In order to achieve a model that is agnostic to the length of the input time series, we choose a fully convolutional architecture. This means there are no fully connected (or dense) layers. Instead, the neural network only learns convolutional filters (or kernels), which make no assumptions about the size of their input data.
This has two major advantages. First, if the size of the receptive field of the network is , we can directly evaluate our model on a time series of time steps for any , resulting in an output time series of length . The receptive field of a network refers to the number of time steps on the input layer that affect a single time step on the output layer. Typically, an architecture should be chosen such that the receptive field is large enough to cover a substantial part of the signal. Secondly, it is more computationally efficient than a sliding window approach, which—due to the overlap of neighboring windows—performs redundant computations. A fully convolutional architecture avoids this overhead.
Moreover, instead of evaluating the network for each detector separately, we stack the recordings from all observatories and treat them as different channels of a single, multidimensional input. This means that when the number of detectors changes, we only need to adjust the number of input channels of the first layer, while the rest of the architecture remains fixed. While retraining is required after such an extension, the computational complexity of our approach at test time is virtually constant in the number of detectors.
In practice, we use a stack of (convolutional) blocks, each based on a dilated convolutional layer with convolutional kernels of size
. Empirically, we found that increasing the number of channels used in the convolutional blocks generally improves the overall performance. However, memory limitations during training upperbounded the number of channels to 512. Within each block, the convolutional layer itself is followed by a nonlinear activation function, namely a rectified linear unit (ReLU). We did not use any regularization techniques such as dropout or batch normalization.
The difference between the twelve convolutional blocks is the dilation of the kernels, which increases exponentially in powers of two (i.e.,
) with the block number. This simple trick yields a relatively large receptive field of 2 seconds with a moderate depth of only 12 blocks while avoiding loss of resolution or coverage. This was considered sufficient to cover the relevant region around the coalescence for all signals of interest. Other modifications of the kernel, such as strides, were not used.
The stack of convolutional blocks is preceded by an input convolutional layer with kernel size of 1, which maps the input data from two channels (the strains from H1 and L1) to 512 channels. On the output side of the network, the last convolutional block is succeeded by an output convolutional layer, which again has a kernel size of 1 and serves to reduce the number of channels from 512 back to 1. The now onedimensional network output is then passed through a sigmoid layer ^{2}^{2}2The sigmoid activation function is defined as ., which maps it into the interval .
Our implementation (in Python 3.6.7) is based on the PyTorch deep learning framework (version 1.0.1) [65]. All code that was used to obtain the results presented in this work will be made available online at ^{3}^{3}3https://github.com/timothygebhard/magicbullet.
As usual for CNNs, before feeding an example time series as input during training, validation, and test time, we normalize it via , where and
are computed as the medians of the mean and standard deviation of each individual example in the training set. During training, we monitor the generalization performance by regularly evaluating the model on the validation set. For the actual training, we first use the
Kaiming initialization scheme as introduced in [67]to assign initial random values to the network parameters (i.e., the convolutional kernels). During training, the kernel entries are optimized using stochastic gradient descent using
Adam [68] with the AMSgrad modification proposed in [69]. To this end, within every epoch (i.e., a full pass over all training data) the entire training dataset is randomly shuffled and divided into a fixed number of minibatches. We use binary crossentropy(BCE) as the loss function. The
batch loss is calculated as the average of the BCE losses at every time step of every example in the minibatch and its corresponding label value. This batch loss is then automatically differentiated with respect to all kernels, and error backpropagation is used to update the kernel values.At the end of every epoch, the performance of the network with its current parameter values is evaluated both on the full training and validation data set. The current loss (as well as other metrics, see below) are logged and a checkpoint of the model is created. This means that a copy of the model parameters is saved to disk such that the current training state can later be loaded again. We end training after a fixed number of epochs and retrieve the checkpoint corresponding to the lowest validation loss as the final trained model. This is a form of validationbased early stopping, which helps to avoid overfitting.
By default, we choose an initial learning rate of . During training, the learning rate is reduced whenever the loss on the validation set has not decreased by more than a certain threshold over a given number of epochs (default value: 8). This behavior is controlled by PyTorch’s ReduceLROnPlateau method.
Let us now explain how we generate the labels, that is, the desired network output for a given input. In our case, the labels are also time series: Ideally, the network should mark the exact locations of coalescences. A natural way to represent this is a time series which is zero everywhere except at the event time where the signal in H1 reaches its maximum amplitude (where the label takes on a value of ).
From a practical machine learning point of view, however, this is problematic: such sparse signals do not contribute sufficiently to the average loss to keep the network from simply always predicting zero. To prevent this failure mode, instead of labelling a single time point, we choose a fixedwidth interval centered around the time when the injected signal in the H1 channel reaches its maximum amplitude. By construction of our data generation pipeline, this position is fixed for all examples. (Recall that our fully convolutional network architecture is by design unable to overfit to specific locations within input examples.) Thus, labels need not be pregenerated or stored, but can be computed on the fly during training or testing. By default, the labels width (i.e., the length of the symmetric interval around the event time in which the label time series takes on a value of ) is .
In section II we discussed the drawbacks of the true positive rate and the false positive rate as performance measures for gravitationalwave searches in the example based binary classification setup. The fact that our data generation pipeline also generates individual examples is merely to make training convenient. Our model could equivalently be trained on a single time series (of sufficient length) containing any number of injections at arbitrary locations. This is possible because our architecture does not perform binary classification on an example level, but outputs yet another time series. As a consequence, different performance metrics are required.
Our objective is to tag signals in the data by outputting a peak close to the actual coalescence time. This intuition can be formalized to obtain interpretable performance metrics using the following evaluation procedure:
We identify all intervals of value 1 in the smoothed and thresholded network output.
The interval centers are stored as candidate times.
For each candidate time , we test for coincidence with the ground truth injection time , that is, if . By default, we used . Note that
is another free, tuneable hyperparameter whose value will directly affect the performance metrics defined below.
If a candidate time passes this coincidence check, we count it as a true positive or detection (see note below); otherwise, it is a false positive.
Per example, there can only be one true positive. If multiple candidate times pass the coincidence test for a single example, only one of them is counted as a detection, while the others are false positives.
Note: We use the term detection in this context to refer to an injected signal which was successfully recovered (in the sense of the procedure described above) by the network. This is, however, purely for ease of terminology. A “detection” by the CNN cannot be compared to and must not be confused with the (statistically significant) detection of a gravitational wave as described in section II.2. Similarly, the false positive rate (see below) cannot directly be compared to a false alarm rate.
We can now discuss the network performance on the test set in terms of the detection ratio and the false positive ratio. The detection ratio is simply the number of injected signals in the test set that the network could recover, divided by the total number of injected signals. We therefore also call it sensitivity. The false positive ratio is the number of false positives divided by the number of all produced candidate times. It is thus an estimate of the error probability; the probability that any given candidate time does not coincide with an actual signal.
Additionally, we can also define the false positive rate: the total number of false positives divided by the combined duration of all samples in the test set. Its inverse is more intuitive: the inverse false positive rate is the average time between two false positives. Naturally, this number should be as high as possible, meaning false positives should be as infrequent as possible.
Again, note that our metrics do not rely on the existence of distinct examples, but could equally be evaluated on a single time series of arbitrary length containing multiple signals. To illustrate this key difference further, let us go back to our argument why the true positive rate and the false positive rate can not be used to evaluate example based binary classification approaches in the sliding window mode of operation considering the output sequence . First, previous approaches do not explain how to interpret such an undesirable situation. Moreover, their performance metrics are blind to these occurrences, because they are derived only from fixedlength examples, which all have an unambiguous binary label. Taking into account the continuous nature of the task, our metrics acknowledge this issue by counting at least one of the two positive intervals as a false positive if there was only one real signal within the corresponding time interval.
When evaluated on our full test set, our trained model is able to successfully recover approximately 89% of all injections, while on average producing a false positive about once every 19.5 minutes.
For a more fine grained analysis, we then split the positive examples (i.e., the ones that do contain an injection) in the test set into 30 bins based on their respective injection SNR. The bins are distributed equidistantly and cover the full injection SNR range of . On average, every bin therefore contains examples. We then compute the detection ratio independently for each of these bins to investigate how the sensitivity of our method scales with the faintness of the signals. The results in Figure 4 show that the detection ratio increases steeply with the injection SNR and achieves essentially 100% roughly at an SNR of 11.
For comparison, the threshold for a coincident trigger to even be analyzed within the PyCBC search pipeline is . At this injection SNR, our model already recovers more than 80% of all injected signals. Furthermore, the first ten real binary black hole mergers observed so far had network SNRs between and [12], which is well within the region in which our model has a virtually perfect detection ratio.
Next, we systematically investigate the effect of both the smoothing and thresholding parameters. To this end, we postprocess the raw network output on the test set with different sizes of the smoothing window (1, 2, 4, 8, 16, 32, 64, 128, and 256) and different thresholds (0.1, 0.3, 0.5, 0.7, and 0.9). In the parametric plot in Figure 4, we show the detection ratio and the inverse false positive rate averaged over the entire test set for each combination of parameter settings (meaning up and right are better). While there is no single best option, this plot shows that our two parameters provide clearly interpretable tuning knobs to choose an operating point by trading off the sensitivity and the false positive rate. Depending on the application requirements one may use this plot to optimize detection ratio at fixed false positive rate or vice versa.
In the next experiment, we evaluate our model’s ability to generalize from synthetic training data to real events. The first two observations announced during LIGO’s first observation run were GW150914 and GW151226 [7, 8]. These real signals were not included in the training data. At test time, we select an interval centered around the event times from the original recordings for both events, and apply the established whitening and bandpassing procedure. Both samples are then cropped to , again centered around the event time. After normalizing and passing them through the network, we apply our usual postprocessing steps, using a window size of 256 time steps for the smoothing and thresholding the result at .
The results in Figure 5 show that in both cases, the model was able to successfully recover the real GW signal at the correct position despite being slightly less accurate on the fainter event GW151226 (with a network SNR of 13) than the first observed event GW150914 (with a network SNR of 24) [7, 8]. The fainter example highlights the effect of postprocessing: Instead of causing multiple false positives when thresholding the raw network output directly, the additional smoothing step yields a single connected interval (i.e., a single predicted event time).
Finally, we also apply our trained network to all other events in the GWTC1 catalog [12], which consists of 11 confirmed binary mergers from both the first and second observation run of LIGO. Using the event data available from the GWOSC (which was preprocessed in the same way as before), we find that our network can indeed recover all known events, with the exception of GW170817. This is, however, not a surprise: While all other events are binary black hole mergers, and we also trained our model using simulated BBH waveforms, GW170817 is the only confirmed binary neutron star merger [14].
Lastly, the fact that we are able to also successfully recover the events from O2 after using only recordings from O1 to train also indicates that the model is, to a certain extent, robust to changes in the detector characteristics.



In a final experiment, we once more want to emphasize our call for caution when interpreting CNNs in the context of gravitationalwave searches. To address the question “What has the model actually learned?”, we use techniques inspired from activation maximization or feature visualization (see, e.g., [70, 71]), as well as adversarial examples or adversarial attacks (see, e.g., [72]
), which are currently active areas of research within the machine learning community. Specifically, we perform the following test in which we make use of the differentiability of our model to find examples of inputs which cause the network to produce a given target output:
We randomly select a noiseonly example (i.e., an example that does not contain an injection) from our testing set and crop it from the end to a length of . This is our initial network input.
Next, we generate a target label, which is about long ( minus the receptive field of the model) and zero everywhere except for the interval from to , where it takes on a value of 1.
If applicable, we enforce additional constraints on the inputs. For example, we pass the input through a function to create the physically nonsensical scenario of a strain that is strictly nonnegative (see first example in Figure 6 c).
We pass the constrained network input through the trained model from the previous experiments. We then compute a weighted sum of a binary crossentropy and a mean squared error loss between the network prediction and the target. The exact weighting depends on the optimization target.
Unlike when training a neural network, this loss is then not backpropagated to the weights of the network, which stay fixed during this experiment. Instead, the loss is backpropagated to the input, which is updated in order to minimize the loss.
We repeat this procedure (starting with enforcing possible constraints on the inputs) for 256 iterations, again using Adam as the optimizer, with an initial learning rate of . PyTorch’s default cosine annealing scheduler is used to gradually decrease the learning rate every epoch.
Finally, we compute the difference between the original network input and the optimized input. This can be interpreted as the hypothetical “signal”, which—when added into the pure noise example—makes our network produce the target output.
We repeat this procedure for different initial inputs and manually inspect the results in form of the hypothetical “signals” to check if they match our expectation: If the network had truly learned to respond only to gravitational waves, we would expect these hypothetical “signals” to closely resemble gravitationalwave signals.
However, while some of the inputs that have undergone the described optimization procedure do exhibit a chirplike structure (i.e., oscillations increasing in both amplitude and frequency), we find that this is not always the case; see panel a) and b) of Figure 6. Worse yet, we can also achieve the desired output even when imposing nonphysical constraints on the inputs. We investigate three types of such constraints: Firstly, we allow only nonnegative strain values. Secondly, we enforce the strain to be zero in a interval covering the interval in which the target output is one. Thirdly, we clip the network input values to a small interval around zero to minimize the overall amplitude. In all three cases, we can still find examples that obey the constraints and, when passed through the network, yield the desired target output. Examples for this are shown in panel c) of Figure 6.
Since we crafted these examples in a supervised fashion, one may argue that the cases in panel c) are unrealistically out of distribution, that is, they would never occur in real detector recordings and therefore do not lead to complications in practice. However, in particular the unconstrained examples in panel b) of Figure 6 are unsettling, because they illustrate just how easily the network can be fooled even by small changes in the inputs. These results suggest a detailed quantification of how contrived these hypothetical “signals” really are (measured by how likely they are to occur accidentally in future detector recordings) to assess whether one must account for them in the false positive rate. Without such an analysis the worry of overconfident positive CNN output on pure noise or faint nonGaussian transients remains.
In this work we provide an interdisciplinary, indepth analysis of the potential of deep convolutional neural networks (CNNs) as part of the effort around searching for gravitational waves from binary coalescences in strain data. First, we critically scrutinize both the methods as well as the contributions of existing works on this topic by carefully analyzing how standard machine learning approaches and metrics map to the specific task at hand. This analysis yields two major conclusions:
CNNs alone can not be used to claim statistically significant gravitationalwave detections.
Fast inference times, favorable computational scaling in the number of detectors, and a compact internal representation of a large number of waveforms presented during training still make CNNs a useful and promising tool to produce realtime triggers for detailed analysis and follow up searches.
As part of these key conceptual insights, we hope to foster further interdisciplinary research on this topic by highlighting important subtleties of GW searches to machine learning experts and exposing some potential pitfalls and surprising properties of CNNs to physicists.
Building on these insights, we have designed a flexible data generation pipeline which we make publicly available as an open source package. We use a novel network architecture which is more tailored to the physical task at hand than a binary classificationbased approach and also overcomes some subtle pitfalls, such as the danger of overfitting to some particular properties of the training data. We evaluate this approach on real LIGO recordings and demonstrate the potential of such a system as a trigger generator by achieving a detection ratio of 86% with a false positive on average once every 40 minutes. Two tuneable postprocessing parameters allow us to intuitively trade off between the detection ratio and the false positive rate without having to retrain the model.
Finally, as part of our effort for crossdisciplinary understanding, we showcase a selection of “failure modes” of our model which are typical for deep convolutional neural networks. We contrive inputs which the network believes to contain gravitationalwave signals with high confidence, even though they are structurally very different from real detector signals for compact binary coalescences. While some of these inputs are physically unrealistic and thus unlikely to be observed in practice, others appear quite plausible (e.g., tiny modifications of pure noise examples). Because the detector noise properties change on an hourly timescale, the rate of false triggers due to such failures may be hard to predict even for a welltuned CNN. We leave the required quantitative analysis of how such incidences may affect the performance on realworld recordings under changing detector characteristics for future research, and conclude this work with a note of caution: CNNs are a promising tool for gravitationalwave data analysis; however, their exact interpretation requires great care and attention.
In this section, we develop our specific neural network architecture (which aims to avoid some of the previously mentioned problems of CNNs) and document the training procedure. A highlevel schematic drawing of the model architecture is depicted in Figure 2.
In order to achieve a model that is agnostic to the length of the input time series, we choose a fully convolutional architecture. This means there are no fully connected (or dense) layers. Instead, the neural network only learns convolutional filters (or kernels), which make no assumptions about the size of their input data.
This has two major advantages. First, if the size of the receptive field of the network is , we can directly evaluate our model on a time series of time steps for any , resulting in an output time series of length . The receptive field of a network refers to the number of time steps on the input layer that affect a single time step on the output layer. Typically, an architecture should be chosen such that the receptive field is large enough to cover a substantial part of the signal. Secondly, it is more computationally efficient than a sliding window approach, which—due to the overlap of neighboring windows—performs redundant computations. A fully convolutional architecture avoids this overhead.
Moreover, instead of evaluating the network for each detector separately, we stack the recordings from all observatories and treat them as different channels of a single, multidimensional input. This means that when the number of detectors changes, we only need to adjust the number of input channels of the first layer, while the rest of the architecture remains fixed. While retraining is required after such an extension, the computational complexity of our approach at test time is virtually constant in the number of detectors.
In practice, we use a stack of (convolutional) blocks, each based on a dilated convolutional layer with convolutional kernels of size
. Empirically, we found that increasing the number of channels used in the convolutional blocks generally improves the overall performance. However, memory limitations during training upperbounded the number of channels to 512. Within each block, the convolutional layer itself is followed by a nonlinear activation function, namely a rectified linear unit (ReLU). We did not use any regularization techniques such as dropout or batch normalization.
The difference between the twelve convolutional blocks is the dilation of the kernels, which increases exponentially in powers of two (i.e.,
) with the block number. This simple trick yields a relatively large receptive field of 2 seconds with a moderate depth of only 12 blocks while avoiding loss of resolution or coverage. This was considered sufficient to cover the relevant region around the coalescence for all signals of interest. Other modifications of the kernel, such as strides, were not used.
The stack of convolutional blocks is preceded by an input convolutional layer with kernel size of 1, which maps the input data from two channels (the strains from H1 and L1) to 512 channels. On the output side of the network, the last convolutional block is succeeded by an output convolutional layer, which again has a kernel size of 1 and serves to reduce the number of channels from 512 back to 1. The now onedimensional network output is then passed through a sigmoid layer ^{2}^{2}2The sigmoid activation function is defined as ., which maps it into the interval .
Our implementation (in Python 3.6.7) is based on the PyTorch deep learning framework (version 1.0.1) [65]. All code that was used to obtain the results presented in this work will be made available online at ^{3}^{3}3https://github.com/timothygebhard/magicbullet.
As usual for CNNs, before feeding an example time series as input during training, validation, and test time, we normalize it via , where and
are computed as the medians of the mean and standard deviation of each individual example in the training set. During training, we monitor the generalization performance by regularly evaluating the model on the validation set. For the actual training, we first use the
Kaiming initialization scheme as introduced in [67]to assign initial random values to the network parameters (i.e., the convolutional kernels). During training, the kernel entries are optimized using stochastic gradient descent using
Adam [68] with the AMSgrad modification proposed in [69]. To this end, within every epoch (i.e., a full pass over all training data) the entire training dataset is randomly shuffled and divided into a fixed number of minibatches. We use binary crossentropy(BCE) as the loss function. The
batch loss is calculated as the average of the BCE losses at every time step of every example in the minibatch and its corresponding label value. This batch loss is then automatically differentiated with respect to all kernels, and error backpropagation is used to update the kernel values.At the end of every epoch, the performance of the network with its current parameter values is evaluated both on the full training and validation data set. The current loss (as well as other metrics, see below) are logged and a checkpoint of the model is created. This means that a copy of the model parameters is saved to disk such that the current training state can later be loaded again. We end training after a fixed number of epochs and retrieve the checkpoint corresponding to the lowest validation loss as the final trained model. This is a form of validationbased early stopping, which helps to avoid overfitting.
By default, we choose an initial learning rate of . During training, the learning rate is reduced whenever the loss on the validation set has not decreased by more than a certain threshold over a given number of epochs (default value: 8). This behavior is controlled by PyTorch’s ReduceLROnPlateau method.
Let us now explain how we generate the labels, that is, the desired network output for a given input. In our case, the labels are also time series: Ideally, the network should mark the exact locations of coalescences. A natural way to represent this is a time series which is zero everywhere except at the event time where the signal in H1 reaches its maximum amplitude (where the label takes on a value of ).
From a practical machine learning point of view, however, this is problematic: such sparse signals do not contribute sufficiently to the average loss to keep the network from simply always predicting zero. To prevent this failure mode, instead of labelling a single time point, we choose a fixedwidth interval centered around the time when the injected signal in the H1 channel reaches its maximum amplitude. By construction of our data generation pipeline, this position is fixed for all examples. (Recall that our fully convolutional network architecture is by design unable to overfit to specific locations within input examples.) Thus, labels need not be pregenerated or stored, but can be computed on the fly during training or testing. By default, the labels width (i.e., the length of the symmetric interval around the event time in which the label time series takes on a value of ) is .
In section II we discussed the drawbacks of the true positive rate and the false positive rate as performance measures for gravitationalwave searches in the example based binary classification setup. The fact that our data generation pipeline also generates individual examples is merely to make training convenient. Our model could equivalently be trained on a single time series (of sufficient length) containing any number of injections at arbitrary locations. This is possible because our architecture does not perform binary classification on an example level, but outputs yet another time series. As a consequence, different performance metrics are required.
Our objective is to tag signals in the data by outputting a peak close to the actual coalescence time. This intuition can be formalized to obtain interpretable performance metrics using the following evaluation procedure:
We identify all intervals of value 1 in the smoothed and thresholded network output.
The interval centers are stored as candidate times.
For each candidate time , we test for coincidence with the ground truth injection time , that is, if . By default, we used . Note that
is another free, tuneable hyperparameter whose value will directly affect the performance metrics defined below.
If a candidate time passes this coincidence check, we count it as a true positive or detection (see note below); otherwise, it is a false positive.
Per example, there can only be one true positive. If multiple candidate times pass the coincidence test for a single example, only one of them is counted as a detection, while the others are false positives.
Note: We use the term detection in this context to refer to an injected signal which was successfully recovered (in the sense of the procedure described above) by the network. This is, however, purely for ease of terminology. A “detection” by the CNN cannot be compared to and must not be confused with the (statistically significant) detection of a gravitational wave as described in section II.2. Similarly, the false positive rate (see below) cannot directly be compared to a false alarm rate.
We can now discuss the network performance on the test set in terms of the detection ratio and the false positive ratio. The detection ratio is simply the number of injected signals in the test set that the network could recover, divided by the total number of injected signals. We therefore also call it sensitivity. The false positive ratio is the number of false positives divided by the number of all produced candidate times. It is thus an estimate of the error probability; the probability that any given candidate time does not coincide with an actual signal.
Additionally, we can also define the false positive rate: the total number of false positives divided by the combined duration of all samples in the test set. Its inverse is more intuitive: the inverse false positive rate is the average time between two false positives. Naturally, this number should be as high as possible, meaning false positives should be as infrequent as possible.
Again, note that our metrics do not rely on the existence of distinct examples, but could equally be evaluated on a single time series of arbitrary length containing multiple signals. To illustrate this key difference further, let us go back to our argument why the true positive rate and the false positive rate can not be used to evaluate example based binary classification approaches in the sliding window mode of operation considering the output sequence . First, previous approaches do not explain how to interpret such an undesirable situation. Moreover, their performance metrics are blind to these occurrences, because they are derived only from fixedlength examples, which all have an unambiguous binary label. Taking into account the continuous nature of the task, our metrics acknowledge this issue by counting at least one of the two positive intervals as a false positive if there was only one real signal within the corresponding time interval.
When evaluated on our full test set, our trained model is able to successfully recover approximately 89% of all injections, while on average producing a false positive about once every 19.5 minutes.
For a more fine grained analysis, we then split the positive examples (i.e., the ones that do contain an injection) in the test set into 30 bins based on their respective injection SNR. The bins are distributed equidistantly and cover the full injection SNR range of . On average, every bin therefore contains examples. We then compute the detection ratio independently for each of these bins to investigate how the sensitivity of our method scales with the faintness of the signals. The results in Figure 4 show that the detection ratio increases steeply with the injection SNR and achieves essentially 100% roughly at an SNR of 11.
For comparison, the threshold for a coincident trigger to even be analyzed within the PyCBC search pipeline is . At this injection SNR, our model already recovers more than 80% of all injected signals. Furthermore, the first ten real binary black hole mergers observed so far had network SNRs between and [12], which is well within the region in which our model has a virtually perfect detection ratio.
Next, we systematically investigate the effect of both the smoothing and thresholding parameters. To this end, we postprocess the raw network output on the test set with different sizes of the smoothing window (1, 2, 4, 8, 16, 32, 64, 128, and 256) and different thresholds (0.1, 0.3, 0.5, 0.7, and 0.9). In the parametric plot in Figure 4, we show the detection ratio and the inverse false positive rate averaged over the entire test set for each combination of parameter settings (meaning up and right are better). While there is no single best option, this plot shows that our two parameters provide clearly interpretable tuning knobs to choose an operating point by trading off the sensitivity and the false positive rate. Depending on the application requirements one may use this plot to optimize detection ratio at fixed false positive rate or vice versa.
In the next experiment, we evaluate our model’s ability to generalize from synthetic training data to real events. The first two observations announced during LIGO’s first observation run were GW150914 and GW151226 [7, 8]. These real signals were not included in the training data. At test time, we select an interval centered around the event times from the original recordings for both events, and apply the established whitening and bandpassing procedure. Both samples are then cropped to , again centered around the event time. After normalizing and passing them through the network, we apply our usual postprocessing steps, using a window size of 256 time steps for the smoothing and thresholding the result at .
The results in Figure 5 show that in both cases, the model was able to successfully recover the real GW signal at the correct position despite being slightly less accurate on the fainter event GW151226 (with a network SNR of 13) than the first observed event GW150914 (with a network SNR of 24) [7, 8]. The fainter example highlights the effect of postprocessing: Instead of causing multiple false positives when thresholding the raw network output directly, the additional smoothing step yields a single connected interval (i.e., a single predicted event time).
Finally, we also apply our trained network to all other events in the GWTC1 catalog [12], which consists of 11 confirmed binary mergers from both the first and second observation run of LIGO. Using the event data available from the GWOSC (which was preprocessed in the same way as before), we find that our network can indeed recover all known events, with the exception of GW170817. This is, however, not a surprise: While all other events are binary black hole mergers, and we also trained our model using simulated BBH waveforms, GW170817 is the only confirmed binary neutron star merger [14].
Lastly, the fact that we are able to also successfully recover the events from O2 after using only recordings from O1 to train also indicates that the model is, to a certain extent, robust to changes in the detector characteristics.



In a final experiment, we once more want to emphasize our call for caution when interpreting CNNs in the context of gravitationalwave searches. To address the question “What has the model actually learned?”, we use techniques inspired from activation maximization or feature visualization (see, e.g., [70, 71]), as well as adversarial examples or adversarial attacks (see, e.g., [72]
), which are currently active areas of research within the machine learning community. Specifically, we perform the following test in which we make use of the differentiability of our model to find examples of inputs which cause the network to produce a given target output:
We randomly select a noiseonly example (i.e., an example that does not contain an injection) from our testing set and crop it from the end to a length of . This is our initial network input.
Next, we generate a target label, which is about long ( minus the receptive field of the model) and zero everywhere except for the interval from to , where it takes on a value of 1.
If applicable, we enforce additional constraints on the inputs. For example, we pass the input through a function to create the physically nonsensical scenario of a strain that is strictly nonnegative (see first example in Figure 6 c).
We pass the constrained network input through the trained model from the previous experiments. We then compute a weighted sum of a binary crossentropy and a mean squared error loss between the network prediction and the target. The exact weighting depends on the optimization target.
Unlike when training a neural network, this loss is then not backpropagated to the weights of the network, which stay fixed during this experiment. Instead, the loss is backpropagated to the input, which is updated in order to minimize the loss.
We repeat this procedure (starting with enforcing possible constraints on the inputs) for 256 iterations, again using Adam as the optimizer, with an initial learning rate of . PyTorch’s default cosine annealing scheduler is used to gradually decrease the learning rate every epoch.
Finally, we compute the difference between the original network input and the optimized input. This can be interpreted as the hypothetical “signal”, which—when added into the pure noise example—makes our network produce the target output.
We repeat this procedure for different initial inputs and manually inspect the results in form of the hypothetical “signals” to check if they match our expectation: If the network had truly learned to respond only to gravitational waves, we would expect these hypothetical “signals” to closely resemble gravitationalwave signals.
However, while some of the inputs that have undergone the described optimization procedure do exhibit a chirplike structure (i.e., oscillations increasing in both amplitude and frequency), we find that this is not always the case; see panel a) and b) of Figure 6. Worse yet, we can also achieve the desired output even when imposing nonphysical constraints on the inputs. We investigate three types of such constraints: Firstly, we allow only nonnegative strain values. Secondly, we enforce the strain to be zero in a interval covering the interval in which the target output is one. Thirdly, we clip the network input values to a small interval around zero to minimize the overall amplitude. In all three cases, we can still find examples that obey the constraints and, when passed through the network, yield the desired target output. Examples for this are shown in panel c) of Figure 6.
Since we crafted these examples in a supervised fashion, one may argue that the cases in panel c) are unrealistically out of distribution, that is, they would never occur in real detector recordings and therefore do not lead to complications in practice. However, in particular the unconstrained examples in panel b) of Figure 6 are unsettling, because they illustrate just how easily the network can be fooled even by small changes in the inputs. These results suggest a detailed quantification of how contrived these hypothetical “signals” really are (measured by how likely they are to occur accidentally in future detector recordings) to assess whether one must account for them in the false positive rate. Without such an analysis the worry of overconfident positive CNN output on pure noise or faint nonGaussian transients remains.
In this work we provide an interdisciplinary, indepth analysis of the potential of deep convolutional neural networks (CNNs) as part of the effort around searching for gravitational waves from binary coalescences in strain data. First, we critically scrutinize both the methods as well as the contributions of existing works on this topic by carefully analyzing how standard machine learning approaches and metrics map to the specific task at hand. This analysis yields two major conclusions:
Let us now explain how we generate the labels, that is, the desired network output for a given input. In our case, the labels are also time series: Ideally, the network should mark the exact locations of coalescences. A natural way to represent this is a time series which is zero everywhere except at the event time where the signal in H1 reaches its maximum amplitude (where the label takes on a value of ).
From a practical machine learning point of view, however, this is problematic: such sparse signals do not contribute sufficiently to the average loss to keep the network from simply always predicting zero. To prevent this failure mode, instead of labelling a single time point, we choose a fixedwidth interval centered around the time when the injected signal in the H1 channel reaches its maximum amplitude. By construction of our data generation pipeline, this position is fixed for all examples. (Recall that our fully convolutional network architecture is by design unable to overfit to specific locations within input examples.) Thus, labels need not be pregenerated or stored, but can be computed on the fly during training or testing. By default, the labels width (i.e., the length of the symmetric interval around the event time in which the label time series takes on a value of ) is .
In section II we discussed the drawbacks of the true positive rate and the false positive rate as performance measures for gravitationalwave searches in the example based binary classification setup. The fact that our data generation pipeline also generates individual examples is merely to make training convenient. Our model could equivalently be trained on a single time series (of sufficient length) containing any number of injections at arbitrary locations. This is possible because our architecture does not perform binary classification on an example level, but outputs yet another time series. As a consequence, different performance metrics are required.
Our objective is to tag signals in the data by outputting a peak close to the actual coalescence time. This intuition can be formalized to obtain interpretable performance metrics using the following evaluation procedure:
We identify all intervals of value 1 in the smoothed and thresholded network output.
The interval centers are stored as candidate times.
For each candidate time , we test for coincidence with the ground truth injection time , that is, if . By default, we used . Note that
is another free, tuneable hyperparameter whose value will directly affect the performance metrics defined below.
If a candidate time passes this coincidence check, we count it as a true positive or detection (see note below); otherwise, it is a false positive.
Per example, there can only be one true positive. If multiple candidate times pass the coincidence test for a single example, only one of them is counted as a detection, while the others are false positives.
Note: We use the term detection in this context to refer to an injected signal which was successfully recovered (in the sense of the procedure described above) by the network. This is, however, purely for ease of terminology. A “detection” by the CNN cannot be compared to and must not be confused with the (statistically significant) detection of a gravitational wave as described in section II.2. Similarly, the false positive rate (see below) cannot directly be compared to a false alarm rate.
We can now discuss the network performance on the test set in terms of the detection ratio and the false positive ratio. The detection ratio is simply the number of injected signals in the test set that the network could recover, divided by the total number of injected signals. We therefore also call it sensitivity. The false positive ratio is the number of false positives divided by the number of all produced candidate times. It is thus an estimate of the error probability; the probability that any given candidate time does not coincide with an actual signal.
Additionally, we can also define the false positive rate: the total number of false positives divided by the combined duration of all samples in the test set. Its inverse is more intuitive: the inverse false positive rate is the average time between two false positives. Naturally, this number should be as high as possible, meaning false positives should be as infrequent as possible.
Again, note that our metrics do not rely on the existence of distinct examples, but could equally be evaluated on a single time series of arbitrary length containing multiple signals. To illustrate this key difference further, let us go back to our argument why the true positive rate and the false positive rate can not be used to evaluate example based binary classification approaches in the sliding window mode of operation considering the output sequence . First, previous approaches do not explain how to interpret such an undesirable situation. Moreover, their performance metrics are blind to these occurrences, because they are derived only from fixedlength examples, which all have an unambiguous binary label. Taking into account the continuous nature of the task, our metrics acknowledge this issue by counting at least one of the two positive intervals as a false positive if there was only one real signal within the corresponding time interval.
When evaluated on our full test set, our trained model is able to successfully recover approximately 89% of all injections, while on average producing a false positive about once every 19.5 minutes.
For a more fine grained analysis, we then split the positive examples (i.e., the ones that do contain an injection) in the test set into 30 bins based on their respective injection SNR. The bins are distributed equidistantly and cover the full injection SNR range of . On average, every bin therefore contains examples. We then compute the detection ratio independently for each of these bins to investigate how the sensitivity of our method scales with the faintness of the signals. The results in Figure 4 show that the detection ratio increases steeply with the injection SNR and achieves essentially 100% roughly at an SNR of 11.
For comparison, the threshold for a coincident trigger to even be analyzed within the PyCBC search pipeline is . At this injection SNR, our model already recovers more than 80% of all injected signals. Furthermore, the first ten real binary black hole mergers observed so far had network SNRs between and [12], which is well within the region in which our model has a virtually perfect detection ratio.
Next, we systematically investigate the effect of both the smoothing and thresholding parameters. To this end, we postprocess the raw network output on the test set with different sizes of the smoothing window (1, 2, 4, 8, 16, 32, 64, 128, and 256) and different thresholds (0.1, 0.3, 0.5, 0.7, and 0.9). In the parametric plot in Figure 4, we show the detection ratio and the inverse false positive rate averaged over the entire test set for each combination of parameter settings (meaning up and right are better). While there is no single best option, this plot shows that our two parameters provide clearly interpretable tuning knobs to choose an operating point by trading off the sensitivity and the false positive rate. Depending on the application requirements one may use this plot to optimize detection ratio at fixed false positive rate or vice versa.
In the next experiment, we evaluate our model’s ability to generalize from synthetic training data to real events. The first two observations announced during LIGO’s first observation run were GW150914 and GW151226 [7, 8]. These real signals were not included in the training data. At test time, we select an interval centered around the event times from the original recordings for both events, and apply the established whitening and bandpassing procedure. Both samples are then cropped to , again centered around the event time. After normalizing and passing them through the network, we apply our usual postprocessing steps, using a window size of 256 time steps for the smoothing and thresholding the result at .
The results in Figure 5 show that in both cases, the model was able to successfully recover the real GW signal at the correct position despite being slightly less accurate on the fainter event GW151226 (with a network SNR of 13) than the first observed event GW150914 (with a network SNR of 24) [7, 8]. The fainter example highlights the effect of postprocessing: Instead of causing multiple false positives when thresholding the raw network output directly, the additional smoothing step yields a single connected interval (i.e., a single predicted event time).
Finally, we also apply our trained network to all other events in the GWTC1 catalog [12], which consists of 11 confirmed binary mergers from both the first and second observation run of LIGO. Using the event data available from the GWOSC (which was preprocessed in the same way as before), we find that our network can indeed recover all known events, with the exception of GW170817. This is, however, not a surprise: While all other events are binary black hole mergers, and we also trained our model using simulated BBH waveforms, GW170817 is the only confirmed binary neutron star merger [14].
Lastly, the fact that we are able to also successfully recover the events from O2 after using only recordings from O1 to train also indicates that the model is, to a certain extent, robust to changes in the detector characteristics.



In a final experiment, we once more want to emphasize our call for caution when interpreting CNNs in the context of gravitationalwave searches. To address the question “What has the model actually learned?”, we use techniques inspired from activation maximization or feature visualization (see, e.g., [70, 71]), as well as adversarial examples or adversarial attacks (see, e.g., [72]
), which are currently active areas of research within the machine learning community. Specifically, we perform the following test in which we make use of the differentiability of our model to find examples of inputs which cause the network to produce a given target output:
We randomly select a noiseonly example (i.e., an example that does not contain an injection) from our testing set and crop it from the end to a length of . This is our initial network input.
Next, we generate a target label, which is about long ( minus the receptive field of the model) and zero everywhere except for the interval from to , where it takes on a value of 1.
If applicable, we enforce additional constraints on the inputs. For example, we pass the input through a function to create the physically nonsensical scenario of a strain that is strictly nonnegative (see first example in Figure 6 c).
We pass the constrained network input through the trained model from the previous experiments. We then compute a weighted sum of a binary crossentropy and a mean squared error loss between the network prediction and the target. The exact weighting depends on the optimization target.
Unlike when training a neural network, this loss is then not backpropagated to the weights of the network, which stay fixed during this experiment. Instead, the loss is backpropagated to the input, which is updated in order to minimize the loss.
We repeat this procedure (starting with enforcing possible constraints on the inputs) for 256 iterations, again using Adam as the optimizer, with an initial learning rate of . PyTorch’s default cosine annealing scheduler is used to gradually decrease the learning rate every epoch.
Finally, we compute the difference between the original network input and the optimized input. This can be interpreted as the hypothetical “signal”, which—when added into the pure noise example—makes our network produce the target output.
We repeat this procedure for different initial inputs and manually inspect the results in form of the hypothetical “signals” to check if they match our expectation: If the network had truly learned to respond only to gravitational waves, we would expect these hypothetical “signals” to closely resemble gravitationalwave signals.
However, while some of the inputs that have undergone the described optimization procedure do exhibit a chirplike structure (i.e., oscillations increasing in both amplitude and frequency), we find that this is not always the case; see panel a) and b) of Figure 6. Worse yet, we can also achieve the desired output even when imposing nonphysical constraints on the inputs. We investigate three types of such constraints: Firstly, we allow only nonnegative strain values. Secondly, we enforce the strain to be zero in a interval covering the interval in which the target output is one. Thirdly, we clip the network input values to a small interval around zero to minimize the overall amplitude. In all three cases, we can still find examples that obey the constraints and, when passed through the network, yield the desired target output. Examples for this are shown in panel c) of Figure 6.
Since we crafted these examples in a supervised fashion, one may argue that the cases in panel c) are unrealistically out of distribution, that is, they would never occur in real detector recordings and therefore do not lead to complications in practice. However, in particular the unconstrained examples in panel b) of Figure 6 are unsettling, because they illustrate just how easily the network can be fooled even by small changes in the inputs. These results suggest a detailed quantification of how contrived these hypothetical “signals” really are (measured by how likely they are to occur accidentally in future detector recordings) to assess whether one must account for them in the false positive rate. Without such an analysis the worry of overconfident positive CNN output on pure noise or faint nonGaussian transients remains.
In this work we provide an interdisciplinary, indepth analysis of the potential of deep convolutional neural networks (CNNs) as part of the effort around searching for gravitational waves from binary coalescences in strain data. First, we critically scrutinize both the methods as well as the contributions of existing works on this topic by carefully analyzing how standard machine learning approaches and metrics map to the specific task at hand. This analysis yields two major conclusions:
When evaluated on our full test set, our trained model is able to successfully recover approximately 89% of all injections, while on average producing a false positive about once every 19.5 minutes.
For a more fine grained analysis, we then split the positive examples (i.e., the ones that do contain an injection) in the test set into 30 bins based on their respective injection SNR. The bins are distributed equidistantly and cover the full injection SNR range of . On average, every bin therefore contains examples. We then compute the detection ratio independently for each of these bins to investigate how the sensitivity of our method scales with the faintness of the signals. The results in Figure 4 show that the detection ratio increases steeply with the injection SNR and achieves essentially 100% roughly at an SNR of 11.
For comparison, the threshold for a coincident trigger to even be analyzed within the PyCBC search pipeline is . At this injection SNR, our model already recovers more than 80% of all injected signals. Furthermore, the first ten real binary black hole mergers observed so far had network SNRs between and [12], which is well within the region in which our model has a virtually perfect detection ratio.
Next, we systematically investigate the effect of both the smoothing and thresholding parameters. To this end, we postprocess the raw network output on the test set with different sizes of the smoothing window (1, 2, 4, 8, 16, 32, 64, 128, and 256) and different thresholds (0.1, 0.3, 0.5, 0.7, and 0.9). In the parametric plot in Figure 4, we show the detection ratio and the inverse false positive rate averaged over the entire test set for each combination of parameter settings (meaning up and right are better). While there is no single best option, this plot shows that our two parameters provide clearly interpretable tuning knobs to choose an operating point by trading off the sensitivity and the false positive rate. Depending on the application requirements one may use this plot to optimize detection ratio at fixed false positive rate or vice versa.
In the next experiment, we evaluate our model’s ability to generalize from synthetic training data to real events. The first two observations announced during LIGO’s first observation run were GW150914 and GW151226 [7, 8]. These real signals were not included in the training data. At test time, we select an interval centered around the event times from the original recordings for both events, and apply the established whitening and bandpassing procedure. Both samples are then cropped to , again centered around the event time. After normalizing and passing them through the network, we apply our usual postprocessing steps, using a window size of 256 time steps for the smoothing and thresholding the result at .
The results in Figure 5 show that in both cases, the model was able to successfully recover the real GW signal at the correct position despite being slightly less accurate on the fainter event GW151226 (with a network SNR of 13) than the first observed event GW150914 (with a network SNR of 24) [7, 8]. The fainter example highlights the effect of postprocessing: Instead of causing multiple false positives when thresholding the raw network output directly, the additional smoothing step yields a single connected interval (i.e., a single predicted event time).
Finally, we also apply our trained network to all other events in the GWTC1 catalog [12], which consists of 11 confirmed binary mergers from both the first and second observation run of LIGO. Using the event data available from the GWOSC (which was preprocessed in the same way as before), we find that our network can indeed recover all known events, with the exception of GW170817. This is, however, not a surprise: While all other events are binary black hole mergers, and we also trained our model using simulated BBH waveforms, GW170817 is the only confirmed binary neutron star merger [14].
Lastly, the fact that we are able to also successfully recover the events from O2 after using only recordings from O1 to train also indicates that the model is, to a certain extent, robust to changes in the detector characteristics.



In a final experiment, we once more want to emphasize our call for caution when interpreting CNNs in the context of gravitationalwave searches. To address the question “What has the model actually learned?”, we use techniques inspired from activation maximization or feature visualization (see, e.g., [70, 71]), as well as adversarial examples or adversarial attacks (see, e.g., [72]
), which are currently active areas of research within the machine learning community. Specifically, we perform the following test in which we make use of the differentiability of our model to find examples of inputs which cause the network to produce a given target output:
We randomly select a noiseonly example (i.e., an example that does not contain an injection) from our testing set and crop it from the end to a length of . This is our initial network input.
Next, we generate a target label, which is about long ( minus the receptive field of the model) and zero everywhere except for the interval from to , where it takes on a value of 1.
If applicable, we enforce additional constraints on the inputs. For example, we pass the input through a function to create the physically nonsensical scenario of a strain that is strictly nonnegative (see first example in Figure 6 c).
We pass the constrained network input through the trained model from the previous experiments. We then compute a weighted sum of a binary crossentropy and a mean squared error loss between the network prediction and the target. The exact weighting depends on the optimization target.
Unlike when training a neural network, this loss is then not backpropagated to the weights of the network, which stay fixed during this experiment. Instead, the loss is backpropagated to the input, which is updated in order to minimize the loss.
We repeat this procedure (starting with enforcing possible constraints on the inputs) for 256 iterations, again using Adam as the optimizer, with an initial learning rate of . PyTorch’s default cosine annealing scheduler is used to gradually decrease the learning rate every epoch.
Finally, we compute the difference between the original network input and the optimized input. This can be interpreted as the hypothetical “signal”, which—when added into the pure noise example—makes our network produce the target output.
We repeat this procedure for different initial inputs and manually inspect the results in form of the hypothetical “signals” to check if they match our expectation: If the network had truly learned to respond only to gravitational waves, we would expect these hypothetical “signals” to closely resemble gravitationalwave signals.
However, while some of the inputs that have undergone the described optimization procedure do exhibit a chirplike structure (i.e., oscillations increasing in both amplitude and frequency), we find that this is not always the case; see panel a) and b) of Figure 6. Worse yet, we can also achieve the desired output even when imposing nonphysical constraints on the inputs. We investigate three types of such constraints: Firstly, we allow only nonnegative strain values. Secondly, we enforce the strain to be zero in a interval covering the interval in which the target output is one. Thirdly, we clip the network input values to a small interval around zero to minimize the overall amplitude. In all three cases, we can still find examples that obey the constraints and, when passed through the network, yield the desired target output. Examples for this are shown in panel c) of Figure 6.
Since we crafted these examples in a supervised fashion, one may argue that the cases in panel c) are unrealistically out of distribution, that is, they would never occur in real detector recordings and therefore do not lead to complications in practice. However, in particular the unconstrained examples in panel b) of Figure 6 are unsettling, because they illustrate just how easily the network can be fooled even by small changes in the inputs. These results suggest a detailed quantification of how contrived these hypothetical “signals” really are (measured by how likely they are to occur accidentally in future detector recordings) to assess whether one must account for them in the false positive rate. Without such an analysis the worry of overconfident positive CNN output on pure noise or faint nonGaussian transients remains.
In this work we provide an interdisciplinary, indepth analysis of the potential of deep convolutional neural networks (CNNs) as part of the effort around searching for gravitational waves from binary coalescences in strain data. First, we critically scrutinize both the methods as well as the contributions of existing works on this topic by carefully analyzing how standard machine learning approaches and metrics map to the specific task at hand. This analysis yields two major conclusions:
In this work we provide an interdisciplinary, indepth analysis of the potential of deep convolutional neural networks (CNNs) as part of the effort around searching for gravitational waves from binary coalescences in strain data. First, we critically scrutinize both the methods as well as the contributions of existing works on this topic by carefully analyzing how standard machine learning approaches and metrics map to the specific task at hand. This analysis yields two major conclusions:
The following list explains the different parameters and the distributions from which their values are randomly sampled before being passed as inputs to the SEOBNRv4 waveform model in order to simulate synthetic gravitationalwave signals.
, Backpropagation applied to handwritten zip code recognition,
Neural Computation 1 (1989).A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in
Advances in Neural Information Processing Systems (NeurIPS) (2012)., Searching for pulsars using image pattern recognition,
The Astrophysical Journal 781, 117 (2014).
Comments
There are no comments yet.