Crowdsourcing, also known as citizen science, is revolutionizing the way real-world data sets are obtained nowadays [1, 2]. Traditionally, the task of labelling has been accomplished by a single expert annotator in a process that is time-consuming, expensive and difficult to scale. The proliferation of web services such as Amazon Mechanical Turk (www.mturk.com) and Figure-Eight (www.figure-eight.com, formerly Crowdflower) allows for outsourcing this process to a distributed workforce that can collaborate virtually, sharing the effort among a huge number of annotators [3, 4]. This approach is rapidly growing in popularity, and is being applied to many different fields such as medical imaging , genetics , remote sensing , topic modelling , and object segmentation .
A very recent application of crowdsourcing in the field of astrophysics is the GravitySpy project , which aims at detecting glitches in the Laser Interferometer Gravitational Waves Observatory (LIGO). The LIGO collaboration is one of the most exciting and recognized scientific international initiatives . It was awarded the 2017 Physics Nobel Prize for the first empirical detection of a gravitational-wave in September 2015 . These waves are ripples in the fabric of spacetime, their existence was theoretically predicted by Einstein’s General Relativity theory in 1916, and open a whole new way to explore the universe (beyond the electromagnetic signals available so far) . However, the LIGO detector is equipped with extremely delicate technology, which is sensitive to many different sources of noise. This produces a wide variety of glitches, see figure 1
, which make the detection of true gravitational-waves difficult. The goal of GravitySpy is to leverage citizen science to label the large data set of glitches produced by LIGO, and then develop a machine learning system (based on crowdsourcing methods) to help astrophysicists classify them.
The crowdsourcing scenario introduces new challenges in machine learning, such as combining the unknown expertise of annotators, dealing with disagreements on the labelled samples, or detecting the existence of spammer and adversarial annotators [14, 15]. The first approaches to deal with multiple-annotated data used to rely on some kind of label aggregation mechanism prior to training. The most straightforward one is majority voting, which assumes that every annotator is equally reliable. More elaborated methods consider the biases of the different annotators, yielding a better calibrated set of training labels, see  (which is usually considered the first crowdsourcing work) and [17, 18]. In all these cases, the idea is to obtain a set of clean true labels, which are then fed to the preferred standard (no-crowdsourcing) classification algorithm.
However, recent works show that jointly modelling the classifier and the annotators behavior leads to superior performance, since the features provide information to puzzle out the true labels [19, 20]. In this joint model, Bayesian methods based on Gaussian Processes (GPs) have proved extremely successful to accurately quantify uncertainty [21, 22, 8]. However, in real-world applications they have been gradually replaced by deep learning based approaches [5, 23, 24], since GPs do not scale well to large data sets [8, 25]. As a result, the sound probabilistic formulation of GPs has been sacrificed. However, large scale problems could greatly benefit from such a solid modelling. In particular, in order to develop a reliable glitch detection system, astrophysicists with the GravitySpy project are particularly interested in the Bayesian formulation given by GPs . Therefore, their scalability issues must be addressed.
In this work, we extend the sparse GP approximation behind SVGP to the multiple-annotated crowdsourcing setting. Interestingly, the form of the Evidence Lower Bound (ELBO) is still suitable for Stochastic Variational Inference , which allows for training through mini-batches. To the best of our knowledge, this allows GPs to be used for crowdsourcing problems of virtually any size for the first time. This novel method is refered to as Scalable (or Sparse) Variational Gaussian Processes for Crowdsourcing (SVGPCR). The annotators noise model is also fully Bayesian, described by per-user confusion matrices which are assigned Dirichlet priors. The underlying true labels are modelled in a probabilistic manner as well. Variational inference  is used to approximate the posterior distribution of the model.
In order to deal with the LIGO data, SVGPCR is modelled and implemented as a multi-class method. The implementation is based on GPflow, a very popular GP library that benefits from GPU acceleration through TensorFlow. Three sets of experiments are provided. First, a controlled crowdsourcing problem specified for MNIST illustrates the main properties and behavior of SVGPCR. Among these, we may highlight its accurate identification of annotators’ expertise degree, reconstruction of the real underlying label, and how the number of inducing points influences its performance. Secondly, SVGPCR is compared against previous probabilistic crowdsourcing methods in a relevant binary LIGO problem222Most of these previous probabilistic crowdsourcing approaches were originally proposed for binary problems, and the code is available accordingly.. SVGPCR stands out as the indisputably best performing approach, thanks to its innovative scalability through mini-batches. Third, SVGPCR is shown to outperform state-of-the-art DL-based methods in the full LIGO data set, specially in terms of test likelihood, due to the more robust uncertainty control.
2 Probabilistic model and inference
This section introduces the theoretical formulation of the proposed method. It follows the rationale of previous GP-based crowdsourcing approaches [21, 22], but achieves scalability through the sparse GP approximation behind SVGP. Figure 2 shows a graphical representation of the proposed model, which will be useful throughout this section.
2.1 The model
In a crowdsourcing problem with classes, we observe training data , where are the training features, and is the set333Notice that annotators are allowed to label the same instance more than once (possibly with different labels). This happens in a few cases in the LIGO data. of annotations provided by the -th annotator for the -th instance. That is, eachthat represents the -th class (i.e., all elements of are zero but the -th one, which is one). There are data points, annotators, and contains the annotators that labelled the -th instance. All training instances will be grouped in , and analogously all annotations in .
. The actual annotations depend on this real label and the degree of expertise of each annotator, which is modelled by the confusion matrix. Each represents the probability that the -th annotator labels as class an instance whose real class is . Notice that this matrix must add up to one by columns. Mathematically, this is given by
Assuming that all annotators label the different instances independently, we have
where and group the corresponding individual variables, and is given by eq. (1).
where denotes the -th column of the confusion matrix
. The hyperparameterscodify any prior belief on the behavior of the annotators. If this is not available, the default choice
corresponds to uniform distributions.
For each instance, the true underlying label is modelled through latent variables . Both parts are related by means of the likelihood model
where is any vector with positive components that add up to . In this work we will use the popular robust-max likelihood, which prevents overfitting in GP classification . It is given by , with for and for . The value of is set to the default value and is kept fixed. This likelihood is implemented in the GPflow library , and in practice it can be substituted by any other one available in GPflow. For instance the soft-max likelihood, which generalizes the sigmoid likelihood to multi-class, i.e. .
Assuming that the underlying real labels for the different instances are independent given the latent variables, we have
where is given by eq. (4), and gathers the latent variables for the instances. Specifically, is a matrix, whose term is the value of the -th latent variable for the -th instance. As usual, the -th row of is denoted by , and the -th column by .
Finally, independent GP priors are utilized for the latent variables . This yields the joint prior
where are the kernel hyperparameters for the -th GP. In this work we will use the well-known squared exponential kernel,
, which has the hyperparameters of varianceand length-scale . However, as before, the flexibility of GPflow allows us to use any other kernel .
In summary, the full probabilistic model is given by
In order to introduce the sparse GP approximation, let us expand this model by introducing inducing points for each GP. Namely, each GP prior can be naively rewritten as the marginal of , where are inducing points. These represent the value of the -th GP on new locations called inducing inputs, , just like does for 444Notice that the inducing locations do not depend on . Although different inducing locations could be used for each GP from a theoretical viewpoint, in practice they are usually considered the same. However, the inducing points do depend on , as each GP models a different function.. Analogously to , we write for the matrix gathering all the inducing points, whose rows and columns are denoted by and respectively. Then, if the joint GP is factorized as , the model in eq. (7) can be analogously rewritten as
It is worth stressing that, by marginalizing out , this model is equivalent to the one in eq. (7). This is important because the sparse GP approximations are grouped into two big categories: those which approximate the model and perform exact inference (like FITC ), and those which keep the model unaltered and introduce the approximation at the inference step. Our approach, like SVGP, belongs to the second group, and the approximation is carried out next.
2.2 Variational inference
Given the model in eq. (2.1), an exact solution would involve calculating the marginal likelihood , in order to estimate the optimal kernel hyperparameters and then obtain the posterior . However, integrating out , , and is analytically intractable, and we resort to variational inference to approximate the computations .
The core of variational inference is the following decomposition of the log marginal likelihood (associated to the observations), which is straightforward and holds for any distribution 555Observe that, in order to “lighten” the notation, we use the integral symbol also for the discrete variable .:
This distribution must be understood as an approximation to the true posterior . The second term in the right hand side of eq. (9) is called the Evidence Lower Bound (ELBO), since it is a lower bound for the model evidence or log marginal likelihood (recall that the first term, the KL divergence, is always non-negative, and is zero if and only if both distributions coincide). Moreover, notice that this KL divergence is precisely between the approximate posterior and the true one.
The idea of variational inference is to propose a parametric form for . Then, the ELBO in eq. (9) is maximized with respect to these new variational parameters, the kernel hyperparameters , and the inducing locations (which are not usually considered fixed). Notice that, by maximizing the ELBO, we are at the same time considering the log marginal likelihood and the KL divergence between and the real posterior (just solve for the ELBO in eq. (9)). Thus, variational inference converts the problem of posterior distribution approximation into an optimization one , [37, Section 10.1], which in practice is addressed through optimization algorithms such as Adam Optimizer .
Here, the following parametric form is proposed for :
The proposed posterior on factorizes across data points, and each describes the probability that is the real class for (i.e., ). The prior conditional does not introduce any new variational parameter. The posterior on factorizes across dimensions, and each one is given by a Gaussian with mean and (positive-definite) covariance matrix . Finally, factorizes across annotators and dimensions, and they are assigned Dirichlet distributions with parameters . In the sequel, all these variational parameters , , will be jointly denoted by .
In the proposed form described by eqs. (10)–(14), the prior conditional arises in a natural way if the GP values are assumed conditionally independent on any other value given the inducing points . This is the original assumption of Titsias in , and intuitively implies that all the information is condensed by and propagated through the inducing points . This form of , plus that of , are at the core of the sparse GP approximation that we are inspired by, SVGP . The distributions and are given the functional form that would arise if a mean-field approach was applied [37, Eq. (10.9)]. For that, the conjugacy between the Dirichlet distribution in and the categorical in is essential.
Now, we can compute the explicit expression for the ELBO in our case, which must be maximized w.r.t. , , and :
A detailed derivation of this expression is provided in the supplemental material. Notice that the inclusion of the prior conditional in the approximate posterior makes the highlighted cancellation possible, which is essential for the scalability of the method. All these five terms in eq. (15) but the second one can be expressed in closed-form as a function of , , and . Similarly, the second one can be approximated explicitly through Gaussian-Hermite quadrature , which is already implemented in GPflow for many different likelihoods (like the robust-max used here) . Further details and the specific expressions can be found in the supplemental material. As a summary, Table I shows which parameters each term in eq. (15) depends on.
|ELBO term||Parameters it depends on|
Importantly, observe that the expression for the ELBO factorizes across data points, which allows for stochastic optimization through mini-batches . To the best of our knowledge, this allows GP-based crowdsourcing methods to scale up to previously prohibitive data sets for the first time. More specifically, the computational complexity to evaluate the ELBO in eq. (15) in terms of the training set size is , where is the mini-batch size, the number of inducing points, the number of classes, and the number of annotations per instance in the mini-batch. This extends the scalability of the sparse GP approximation SVGP  to the crowdsourcing setting. It is also interesting to compare eq. (15) with the expression for the ELBO in SVGP [32, Eq. (19)]. The second and fourth terms, which come from the prior and the classification likelihood, are analogous to the two terms in . The other three terms arise naturally from the crowdsourcing modelling.
Once the ELBO is maximized w.r.t. , and , we can make predictions for previously unseen data points. Given a new , we have
where stands for , and we are using the values of , , , and estimated after training. The predictive distribution on the real label is obtained as . For classification likelihoods like ours, this is computed by GPflow through Gaussian-Hermite quadrature. Moreover, as we will illustrate in the experiments, the posterior distributions and provide an estimation for the underlying real label of the training points and for the annotators degree of expertise, respectively. Finally, in order to exploit GPU acceleration through TensorFlow, the novel SVGPCR is implemented within the popular GP framework GPflow . The code will be made publicly available in GitHub upon acceptance of the paper, and will be listed in the “projects using GPflow” section of the GPflow site https://github.com/GPflow/GPflow.
3 LIGO data description
The Laser Interferometer Gravitational-Waves Observatory (LIGO) is a large-scale physics experiment and observatory to detect gravitational waves (GWs) . These are ripples in the space-time produced by non-symmetric movements of masses, being their energy much higher for events such as binary black holes or neutron stars mergers. Their existence is a direct consequence of the General Relativity theory postulated in 1916. However, Albert Einstein himself believed they would be extremely difficult to detect by any technology foreseen at that time .
The first direct observation of GWs was made one hundred years later by LIGO, on September 14th, 2015. The discovery had a tremendous impact in the scientific community. Not only as an empirical validation of one of the most recognized Physics theories, but also as a whole new way to explore the universe. So far, astrophysicists could perceive the outer space only through one “sense” (electromagnetic radiation), but were “deaf” to GWs. This detection has inaugurated a new era of the so-called GWs astronomy, and has been awarded the 2017 Physics Nobel Prize .
To identify GWs, LIGO is able to detect changes of the length of a 4 kilometers arm by a thousandth of the width of a proton . This is proportionally equivalent to changing the distance to the nearest star outside the Solar System by one hair’s width. Such precision requires cutting-edge technology that is also extremely sensitive to different instrumental and environmental sources of noise. In the spectrograms that astrophysicists analyze to search for GWs, this contamination manifests itself in the form of glitches, which are noisy patterns that adopt many different morphologies . The presence of these glitches hinders the detection of true GWs. Figure 3 shows the 15 types of glitches considered in this work, which will be later described.
The goal of the GravitySpy project is to develop a system to accurately classify the different types of glitches . This would help astrophysicists to gain insights on their taxonomy and potential causes, enhancing detection of true GWs. Since LIGO produces a constant stream of data, GravitySpy leverages crowdsourcing techniques through the Zooniverse platform in order to label a training set https://www.zooniverse.org/projects/zooniverse/gravity-spy. Then, machine learning crowdsourcing algorithms that can learn from this multiple-annotated data must be applied (like the SVGPCR presented here).
Our training set contains 173565 instances (glitches) and 1828981 annotations (i.e., a mean value of more than 10 labels per instance), which have been provided by 3443 collaborators through the Zooniverse platform. These instances are time-frequency plots (spectrograms) like those in figures 1 and 3
, taken with four time windows. For each one, we will use 256 relevant features extracted in. These glitches have been classified into 15 different classes proposed by astrophysicists (recall figure 3). Next, we provide a brief description of them (see  for a more detailed explanation).
1080 Line: It appears as a string of short yellow dots, always around 1080Hz. It was reduced after an update on 2017, although it is still present.
1400 Ripple: Glitches of 0.05s or longer around 1400Hz. So far, their origin is unknown. They are commonly confused with 1080Line and Violin Mode Harmonic.
Blip: Short glitches with a symmetric “teardrop” shape in time-frequency. Blips are extremely important since they hamper the detection of binary black hole mergers .
Extremely Loud: These are caused by major disturbances, such as an actuator reaching the end of its range and “railing”, or a photodiode “saturating”. They look very bright, due to their very high energy.
Koifish: Similar to Blips, but resemble a fish with the head at the low frequency end, pectoral fins around 30 Hz, and a thin tail around 500Hz. LIGO scientists do not understand the physical origin of this glitch.
Low Frequency Burst: Resembles a hump with a nearly triangular shape growing from low frequency to a peak, and then dying back down in one or two seconds. It is caused by scattered light driven by motion of the output mirrors.
Low Frequency Lines: These appear as horizontal lines at low frequencies. Can be confused with Scattered Light (the latter shows some curvature) and Low Frequency Bursts (the former continues to look like a line in the 4s window).
No Glitch: No glitch refers to images that do not have any glitch visible at all. The spectrograms would appear dark blue with only small fluctuations.
Other: This category is a catch-all for glitches that do not fit into the other categories. Therefore, it presents a great variability in its morphology.
Power-line 60Hz: In US, the mains power is alternating current at 60Hz. When equipment running on this power switches on or off, glitches can occur at 60Hz or harmonics (120, 180…). These glitches usually look narrow in frequency, centered around 60Hz or harmonics.
Repeating blips: Analogous to blips, but repeat at regular intervals, usually every 0.125, 0.25 or 0.5 seconds.
Scattered Light: After hitting optical components, some light from LIGO beam is scattered. It may then reflect off of other objects and re-enter the beam with a different phase. It usually looks like upward humps, with frequency below 30 Hz. It hinders searches of binary neutron stars, neutron star black hole binaries, and binary black holes.
Scratchy: Wide band of mid-frequency signals that looks like a ripply path through time-frequency space. This glitch hampers searches for binary black hole mergers.
Violin Mode Harmonic: Test masses in LIGO are suspended from fibers with resonances. These are called violin modes, as they resemble violin strings resonances. Thermal excitations of the fibers produce movements at the violin mode frequencies, centered around 500Hz. Thus, these glitches are short and located around 500 Hz and harmonics.
Whistle: Usually appear with a characteristic W or V shape. Caused by radio frequency signals beating with the LIGO Voltage Controlled Oscillators. Whistles mainly contaminate searches for binary black hole mergers .
For testing purposes, the astrophysicists at GravitySpy have labelled a set of instances, including glitches from all the types explained above.
|EXTREMELY LOUD||KOIFISH||LOW FREQUENCY BURST|
|LOW FREQUENCY LINES||NO GLITCH||OTHER|
|POWER-LINE 60HZ||REPEATING BLIPS||SCATTERED LIGHT|
|SCRATCHY||VIOLIN MODE HARMONIC||WHISTLE|
4 Experimental results
In this section, the proposed SVGPCR is empirically validated and compared against current methods, with a special focus on the LIGO data introduced in the previous section. Three blocks of experiments are presented in sections 4.1, 4.2 and 4.3. Firstly, the behavior of SVGPCR is thoroughly analyzed in a controlled crowdsourcing experiment based on the popular MNIST set. Secondly, SVGPCR is compared with previous probabilistic (mainly GP-based) approaches on the LIGO data. Since most of these methods were proposed for binary problems, we consider a binary task relevant to the GravitySpy project. Thirdly, SVGPCR is compared against state-of-the-art DL-based crowdsourcing methods in the full LIGO data set.
4.1 Understanding the proposed method
Before comparing against other crowdsourcing methodologies, let us analyze the behavior and main properties of the novel SVGPCR. To do so, we simulate five different crowdsourcing annotators for the well-known MNIST data set. The availability of simulated annotators and real training labels on this graphic data set constitutes a controlled setting that allows for a comprehensive analysis.
We use the standard train/test split of MNIST with 60K/10K hand-written digits from 0 to 9 (multi-class problem with 10 classes) . Notice that 60K training instances is already prohibitive for standard GPs. Five decreasingly reliable annotators are considered. The first one has a accuracy for each class, that is, for (the rest of values, , , are randomly assigned to add the remaining probability by columns). The second and third ones are defined analogously, but with and accuracy, respectively. The fourth one is a spammer annotator, that is, for all . This implies that, regardless of the real class, this annotator assigns a random label. The fifth one is an adversarial annotator. Specifically, in this case, with a of probability, the annotator labels as ()-th class an instance whose real class is the th (samples in class 9 are assigned to class 0). The confusion matrices for these annotators are depicted in the first row of figure 4 (the rest of the figure will be explained later). The five annotators label all the instances, which yields K annotations that are used to train SVGPCR.
|Annotator 1||Annotator 2||Annotator 3||Annotator 4||Annotator 5|
Since we have available the true labels for the training instances, let us start by comparing SVGPCR with its theoretical upper bound, namely SVGP trained with the true labels, which we refer to as SVGP-gold. Indeed, since SVGPCR can be considered an extension of SVGP to the noisy scenario of crowdsourcing (recall section 2), the performance of the former can be thought to be bounded by that of the latter. Table II shows the global and per-class test accuracy and mean test likelihood for both approaches. Importantly, notice that the results are very similar for all classes and both metrics, and SVGPCR almost reaches the same global performance as SVGP-gold (in spite of the corrupted labels provided by annotators).
|Test accuracy||Test likelihood|
This excellent performance of SVGPCR can be explained by its accurate prediction of the annotators behavior, which in turn allows SVGPCR to properly reconstruct the underlying true labels from the noisy annotations. Indeed, firstly, figure 4 shows the exceptional estimations obtained by SVGPCR for the annotators confusion matrices. Recall from eq. (14) that the expertise degree of annotators is estimated through posterior Dirichlet distributions. The bottom row of figure 4 shows the mean of those distributions. Interestingly, the maximum variance was , which implies a high degree of certainty about the predictions in figure 4. Secondly, as previously mentioned, this allows SVGPCR to correctly puzzle out the underlying true labels from the noisy annotations. In fact, table III shows the excellent per-class and global performance of SVGPCR in this sense (recall that SVGPCR estimates the underlying true labels through the approximate posterior in eq. (11)).
More in depth, we have analyzed the 20 examples where SVGPCR fails to reconstruct the true label, and some of them can be certainly considered as not-easy ones. Figure 5 shows four of them, along with the probabilities assigned by SVGPCR for each one. In all cases, the true label is assigned the second highest probability by SVGPCR, and the digit presents some feature which certainly leads to confusion with the class that SVGPCR assigns more probability to.
Another key aspect of the novel SVGPCR is the role of the inducing points. In this example we are using , and the next experiment will be devoted to analyze the influence of in the performance of SVGPCR. But before, figure 6 shows the locations to which 30 out of the 100 inducing points have converged after training (recall that the ELBO in eq. (15) is also maximized w.r.t. the inducing locations ). For instance, the first column shows the locations of three inducing points which are classified as by SVGPCR (according to the estimated , recall eq. (13)), and analogously for the rest of the columns. It is very interesting to notice that the inducing point locations comprise different calligraphic representations (in terms of shape, orientation and thickness) of the same digit. This is related to their intuitive role of entities that summarize the training data.
Next, let us study the influence of (the number of inducing points) on the behavior of SVGPCR. Figure 7 shows the dependence on of four different metrics: two measures of the test performance (accuracy and mean likelihood), and two related to the computational cost (at training and test steps). As expected from the theoretical formulation in section 2, a greater number of inducing points implies a higher performance at test (in both accuracy and mean likelihood), since the expressiveness of the model is higher. However, this also leads to heavier train and test costs, since there are more parameters to be estimated (inducing locations , , and ), and the size of several matrices increase.
Moreover, for a given
, the model is expected to obtain better test performance as the training time evolves (i.e., when more epochs are run). In order to further investigate this, figure8 shows the test accuracy of SVGPCR as the training time evolves, for different values of . It is interesting to observe that, the more inducing points, the higher values of test accuracy can be potentially reached, but also a greater amount of training time is needed to reach that precision (notice that the steps which take to the level of their final precision happen increasingly later). The conclusion is that, for a given computational budget, the to be selected is the highest one that can reach convergence in that time (logically, assuming that it allows for the inversion of the associated kernel matrix, i.e., usually ).
Finally, since the associated code can leverage GPU acceleration through GPflow , let us compare CPU and GPU implementations. Figure 9 shows that, for training, the GPU is usually the preferred choice, unless the minibatch size is very small, in which case the amount of memory copies from CPU to GPU does not compensate the advantage provided by the latter. In test, the GPU is always faster, since it involves much less data transfers from CPU to GPU.
4.2 Comparison to classical probabilistic approaches
As explained in section 1
, the most popular approaches to crowdsourcing jointly model a classifier for the underlying true labels along with the annotators’ behavior. The first works used basic logistic regression as the classifier, e.g.Raykar  and Yan  (the difference between them is the noise model considered for the annotators). However, they struggled when dealing with complex non-linear data sets. Then, Gaussian Processes became the preferred choice, since their non-parametric form and accurate uncertainty quantification yielded much better results, e.g. Rodr14  and VGPCR  (they differ in the inference procedure used, Expectation Propagation  and Variational Inference , respectively). However, the poor scalability of GPs hampered the wide adoption of these approaches in practice. This motivated the development of the so-called RFF and VFF algorithms, which leverage Random Fourier features approximations to GPs to propose two more scalable GP-based crowdsourcing methods . These approaches significantly improve the scalability, reducing it from cubic to linear (with the number of Fourier frequencies used, , see ). In practice, this implies moving from manageable data sets of up to . However, RFF and VFF do not factorize in mini-batches, which prevents them from reaching data sets of virtually any size.
In the last few years, these classical (mainly GP-based) approaches have been replaced by crowdsourcing methods based on Deep Learning (DL) [5, 23]. These achieve excellent scalability through mini-batches, and can handle data sets of almost any size. Because of this, they have become the state of the art approach for real-world crowdsourcing problems. In the next section 4.3, we will bring GP-based methods back to a state of the art level. We will show that the novel SVGPCR is competitive with DL-based methods, and additionally provides a very accurate control of uncertainty. But before this, it is worth to analyze here the advances that SVGPCR introduces over its predecessors classical (mainly GP-based) crowdsourcing approaches.
More specifically, let us compare SVGPCR with the aforementioned Raykar, Yan (based on logistic-regression), Rodr14, VGPCR (based on GPs), and RFF, VFF (based on scalable approximations to GP). Since most of them were formulated for binary problems, we consider a binary task relevant to astrophysicists in GravitySpy. Using the data set presented in section 3, the goal is to distinguish between the glitch called “Other” and the rest of types. This is important in order to identify potential overlaps between that catch-all class and the rest of glitches. Moreover, it introduces an imbalanced scenario, since “Other” represents only a of the total amount of annotations. We will use the area under the ROC curve (AUC) as test performance metric.
compares the scalability of the compared methods as the training set grows. The novel SVGPCR clearly stands out as the most scalable approach. This can be attributed to its training scheme through mini-batches, which considerably alleviates the dependence on the training set size. The rest of methods explode at different moments: the heavy EP inference ofRodr14 only allows for training with up to , the GP-based formulation of VGPCR and the complex annotators noise model of Yan make them reach with difficulties. In spite of the GP approximation, VFF does not go beyond in this problem, because of the expensive optimization of Fourier features. Finally, Raykar (which is based on cheap logistic regression) and RFF (which does not optimize over the Fourier features) can cope with the full data set, although they are significantly slower than SVGPCR.
Moreover, figure 11 shows that their test performance is pretty far from that of the novel SVGPCR. Indeed, the logistic regression model underlying Raykar is not sufficient for the nonlinear problem at hand, and the GP approximation provided by RFF is known to be poor when the dimensionality of the problem is high  (like here, where we are working with features, recall section 3). The rest of methods are also clearly outperformed, since their limited scalability prevents them from processing the full data set. Interestingly, figure 11 shows an intuitive and logical structure: the more simple logistic-regression based methods are located on the left (less test AUC), the classical GP-based ones in the central part, and the novel SVGPCR on the right.
4.3 Comparison with state of the art DL-based methods
. The former considers a deep neural network (DNN) as underlying classifier, and a probabilistic noise model for annotators based on per-user confusion matrices. Then, the training step follows an iterative expectation-maximization (EM) scheme between both parts of the model[37, Section 9.4]. Alternatively, the crowd layers in  allow for end-to-end training of the DNN, without the need for the EM scheme. This is significantly cheaper in terms of computational cost, although the probabilistic formulation of AggNet allows for a better uncertainty quantification. The three crowd layers studied in  will be considered here: CL-VW, CL-VWB and CL-MW. They differ in the parametric form of the annotator noise model, which is increasingly complex: a vector of per-class weights for CL-VW, an additional bias for CL-VWB, and a whole confusion matrix for CL-MW.
These four DL-based methods (AggNet, CL-VW, CL-VWB, CL-MW) are compared against three increasingly complex SVGPCR models: SVGPCR-10, SVGPCR-50, SVGPCR-100, where each number represents the amount of inducing points used. As all these approaches are defined for multi-class tasks, the full LIGO problem in section 3 can be addressed now.
Tables IV and V show the global and per-class test performance of the compared methods. Table IV is devoted to the test accuracy, which relies only on the mode of the predictive distribution and is less influenced by the uncertainty quantification of the model. Table V shows the test likelihood, which additionally depends on the uncertainty of the predictive distribution, and therefore depends more heavily on its accurate control within the model.
In both tables, SVGPCR stands out as the best-performing method globally. The difference is greater in the case of the test likelihood, which is logically explained by the excellent uncertainty quantification of GPs. Indeed, the better control of uncertainty also justifies that AggNet outperforms CL-based methods in test likelihood (whereas they are very similar in accuracy). Moreover, observe that the global superiority of SVGPCR is not due to a great result in only one or two very populated classes. Instead, SVGPCR performs consistently well across the 15 glitch types in both tables, winning in few of them (a bit more in test likelihood, as logically expected). According to astrophysicists at GravitySpy, this regularity across classes is a very desirable property for a reliable glitch detection system.
It is also worth to notice that inducing points seem enough for the problem at hand. In both tables IV and V, a significant improvement is observed from to , but produces very similar results. This small value of hints at a not very complex internal structure of the data. It is also interesting to observe that, in general, the most difficult classes are “Repeating Blips” and “Other” (recall the 15 types in figure 3). This discovery is not surprising for astrophysicists in GravitySpy, since the former is usually confused with “Blips”, and the latter is a catch-all class to which some conservative annotators resort too often. The case of “Other” is also related to the interest of astrophysicists to study it separately in the experiment of previous section 4.2.
|AggNet||CL-VW||CL-VWB||CL-MW||SVGPCR (M=10)||SVGPCR (M=50)||SVGPCR (M=100)|
|AggNet||CL-VW||CL-WVB||CL-MW||SVGPCR (M=10)||SVGPCR (M=50)||SVGPCR (M=100)|
It is also important to highlight that all these methods are scalable enough so as to cope with the full LIGO data set. More specifically, figure 12 shows the elapsed time at training and testing for the compared methods. In general, the proposed SVGPCR is competitive with DL-based methods in these aspects. At training, SVGPCR is significantly faster than AggNet due to the heavy iterative EM scheme of the latter, and is slower than CL-MW666Results of CL-VW/CL-VWB being worse than CL-MW in figure 12 might be attributed to implementation inefficiency, since the former include for loops whereas matrix multiplication is used in the latter.. Nonetheless, less than one hour of training is a competitive result for a data set with instances (recall section 3). At testing, SVGPCR is the fastest approach, which is convenient for real-time applications the system might be used for.
As already pointed out in Table V, the underlying GP model of SVGPCR implies an advantage over DL-based methods in terms of uncertainty quantification. Importantly, the test likelihood metric is a global measure of the quality of the predictive distribution obtained for each individual test instance. To clearly understand the benefits of the GP modelling, figure 13 shows the predictive distributions for some test instances which are behind the better global performance of SVGPCR. Only the best method (in terms of test likelihood) of each type (i.e. CL-based ones, SVGPCR ones, and AggNet) is considered, which yields the three columns in figure 13. Each row represents a different test instance.
Interestingly, we observe that the three approaches correctly classify the four instances, that is, they assign the highest probability to the correct class (which is highlighted in red). In particular, this means that these four instances contribute equally to the test accuracy of the three methods. However, notice that the quality of the predictive distribution worsens from left to right (i.e., from better to worse uncertainty quantification theoretical properties), since the methods become less certain about the correct answer and assign more probability to wrong ones. This is precisely what is accounted for in the test likelihood metric. From a practical perspective, this better quality of the predictive distributions has been particularly appreciated by astrophysicists at GravitySpy, in addition to the improvement in test accuracy (recall table IV).
Finally, a key aspect of crowdsourcing methods is the identification of the different annotators behavior. Unlike in section 4.1, where we had simulated annotators to check the good estimations of SVGPCR, in this real experiment we do not have available a ground-truth. Nonetheless, let us compare the predictions obtained by the different methods. We will see that they capture similar patterns, some of which can be explained from the experience of astrophysicists. Figure 14 shows the confusion matrices predicted by the compared methods for five different annotators. In the CL-based family we only consider CL-MW, as it is the best in test likelihood and the only one which provides a confusion matrix.
One of the most distinctive features for all instances and methods is the predominance of high values in the diagonal. This was considered as a positive feedback by astrophysicists, as it means that annotators have been generally well instructed to distinguish among glitches. Additionally, other patterns out of the diagonal are worth an analysis. For the first column (first volunteer), SVGPCR and AggNet detect that glitches of type 1 (i.e. “1400Ripple”, recall figure 3) are classified as class 13 (“Violin Mode Harmonic”). This is a very frequent mistake according to experts, since the general appearance of both glitches is similar. We also observe that CL-MW does not agree on this prediction. This discrepancy of CL-MW for some particular patterns is recurrent across different annotators, and can be attributed to the different modelling of the annotators noise (non-probabilistic one, but through weights in the DNN). The second column shows a typical conservative annotator, who resorts too frequently to the catch-all “Other” class. This is reflected in the persistent high values of the row number 8 in the matrices, regardless of the column (the real class). For the third column, the three methods identify the confusion from “Violin Mode Harmonic” to “1400Ripple”. Notice that this is the opposite to the first annotator, where the confusion was the other way round. In the fourth annotator, AggNet exhibits a noisy behavior compared to SVGPCR and CL-MW. Although perhaps less explicitly, this can be also observed across different annotators, and might be due to the iterative nature of AggNet, which does not allow for an end-to-end learning and leaves some extra noise after training. In the fifth annotator, the three methods identify a very common confusion, which is labelling instances whose real class is “Blip” as “Koifish” (classes 2 and 4, respectively). Although these glitches seem pretty different in the paradigmatic examples shown in figure 3, wider “Blip” and narrower “Koifish” are frequent in the data set, and might mislead a non-expert volunteer.
Most importantly, the identification of all these wrong behaviors allows crowdsourcing methods to take full advantage of the noisy annotations. It is also worth noticing that the Bayesian nature of SVGPCR provides uncertainties for the confusion matrices obtained here (recall the full posterior Dirichlet distributions in eq. (14)), which is not available for the DL-based methods.
5 Conclusions and future work
In this work we have introduced SVGPCR, a novel GP-based crowdsourcing classification algorithm that can scale up to very large data sets through its mini-batch training scheme. The motivation for this methodology is the problem of glitch classification in the laureate LIGO project, which is addressed with crowdsourcing techniques in the GravitySpy sub-project. To that end, and in order to obtain accurate predictive distributions, astrophysicists were interested in combining the excellent uncertainty quantification of GP-based crowdsourcing methods with the scalability of those based on deep learning (DL). The proposed SVGPCR resorts to the most popular sparse GP approximations in Machine Learning in order to make such combination a reality, and brings back GP-based methods to the state of the art in crowdsourcing.
SVGPCR is competitive with DL-based approaches in terms of test accuracy and computational cost, and stands out in terms of predictive distribution quality. Moreover, its behavior naturally follows its theoretical formulation: it provides very accurate estimations for the annotators expertise degree, and the inducing points influence the test performance and the computational cost as expected. Moreover, the code is based on the popular GPflow library, which leverages GPU acceleration through TensorFlow.
In the LIGO problem, the glitches were given by relevant features extracted by astrophysicists. However, in the case of more complex data such as images, audio or natural language, DL-based methods can benefit from convolutional layers in the deep neural network. From a probabilistic perspective, this could be addresed through Deep Gaussian Processes  and the very recent attempts to introduce convolutional structure in GPs [48, 49].
-  A. Irwin, “No phds needed: how citizen science is transforming research,” Nature, vol. 562, pp. 480–482, 2018.
-  C. J. Guerrini, M. A. Majumder, M. J. Lewellyn, and A. L. McGuire, “Citizen science, public policy,” Science, vol. 361, no. 6398, pp. 134–136, 2018.
R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng, “Cheap and fast, but is it
good?: evaluating non-expert annotations for natural language tasks,” in
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008, pp. 254–263.
-  M. Buhrmester, T. Kwang, and S. D. Gosling, “Amazon’s mechanical turk: A new source of inexpensive, yet high-quality, data?” Perspectives on psychological science, vol. 6, no. 1, pp. 3–5, 2011.
-  S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis, S. Demirci, and N. Navab, “AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1313–1321, 2016.
-  J. Saez-Rodriguez, J. C. Costello, S. H. Friend, M. R. Kellen, L. Mangravite, P. Meyer, T. Norman, and G. Stolovitzky, “Crowdsourcing biomedical research: leveraging communities as innovation engines,” Nature Reviews Genetics, vol. 17, no. 8, pp. 470–486, 2016.
-  S. Fritz, L. See, C. Perger, I. McCallum, C. Schill, D. Schepaschenko, M. Duerauer, M. Karner, C. Dresel, J.-C. Laso-Bayas et al., “A global dataset of crowdsourced land cover and land use reference data,” Scientific data, vol. 4, p. 170075, 2017.
-  F. Rodrigues, M. Lourenco, B. Ribeiro, and F. Pereira, “Learning supervised topic models for classification and regression from crowds,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2409–2422, 2017.
-  E. Heim, A. Seitel, J. Andrulis, F. Isensee, C. Stock, T. Ross, and L. Maier-Hein, “Clickstream analysis for crowd-based object segmentation with confidence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 2814–2826, 2018.
-  M. Zevin, S. Coughlin, S. Bahaadini, E. Besler, N. Rohani, S. Allen et al., “Gravity spy: integrating advanced ligo detector characterization, machine learning, and citizen science,” Classical and Quantum Gravity, vol. 34, no. 6, p. 064003, 2017.
-  A. Abramovici, W. E. Althouse, R. W. P. Drever, Y. Gürsel, S. Kawamura, F. J. Raab, D. Shoemaker, L. Sievers, R. E. Spero, K. S. Thorne, R. E. Vogt, R. Weiss, S. E. Whitcomb, and M. E. Zucker, “Ligo: The laser interferometer gravitational-wave observatory,” Science, vol. 256, no. 5055, pp. 325–333, 1992.
-  B. P. Abbott et al. (LIGO Scientific Collaboration and Virgo Collaboration), “Observation of gravitational waves from a binary black hole merger,” Physical Review Letters, vol. 116, p. 061102, Feb 2016.
-  D. Castelvecchi and A. Witze, “Einstein’s gravitational waves found at last,” Nature news, 2016.
-  V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label? improving data quality and data mining using multiple, noisy labelers,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 614–622.
P. Donmez and J. G. Carbonell, “Proactive learning: cost-sensitive active learning with multiple imperfect oracles,” inACM Conference on Information and Knowledge Management, 2008, pp. 619–628.
-  A. P. Dawid and A. M. Skene, “Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm,” Journal of the Real Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 20–28, 1979.
-  P. G. Ipeirotis, F. Provost, and J. Wang, “Quality Management on Amazon Mechanical Turk,” in ACM SIGKDD Workshop on Human Computation (HCOMP’10), 2010, pp. 64–67.
-  T.-F. Whitehill, J.and Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo, “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,” in Advances in Neural Information Processing Systems (NIPS), 2009, pp. 2035–2043.
-  V. Raykar, S. Yu, L. Zhao, G. Hermosillo Valadez, C. Florin, L. Bogoni, and L. Moy, “Learning from crowds,” The Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1297–1322, 2010.
-  Y. Yan, R. Rosales, G. Fung, R. Subramanian, and J. Dy, “Learning from multiple annotators with varying expertise,” Machine Learning, vol. 95, no. 3, pp. 291–327, 2014.
-  F. Rodrigues, F. Pereira, and B. Ribeiro, “Gaussian process classification and active learning with multiple annotators,” in International Conference on Machine Learning (ICML), 2014, pp. 433–441.
-  P. Ruiz, P. Morales-Álvarez, R. Molina, and A. K. Katsaggelos, “Learning from crowds with variational gaussian processes,” Pattern Recognition, vol. 88, pp. 298 – 311, 2019.
-  F. Rodrigues and F. Pereira, “Deep learning from crowds,” in Conference on Artificial Intelligence (AAAI), 2018, pp. 1611–1618.
-  M. Y. Guan, V. Gulshan, A. M. Dai, and G. E. Hinton, “Who said what: Modeling individual labelers improves classification,” in Conference on Artificial Intelligence (AAAI), 2018, pp. 3109–3118.
-  C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT, 2006.
-  A. Damianou, “Deep gaussian processes and variational propagation of uncertainty,” Ph.D. dissertation, University of Sheffield, 2015.
-  M. Bauer, M. van der Wilk, and C. E. Rasmussen, “Understanding probabilistic sparse gaussian process approximations,” in Advances in Neural Information Processing Systems (NIPS), 2016, pp. 1533–1541.
-  E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” in Advances in Neural Information Processing Systems (NIPS), 2006, pp. 1257–1264.
-  M. Titsias, “Variational learning of inducing variables in sparse gaussian processes,” in International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 5, 2009, pp. 567–574.
-  P. Morales-Álvarez, A. Pérez-Suay, R. Molina, and G. Camps-Valls, “Remote sensing image classification with large-scale gaussian processes,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 1103–1114, 2018.
-  P. Morales-Álvarez, P. Ruiz, R. Santos-Rodríguez, R. Molina, and A. K. Katsaggelos, “Scalable and efficient learning from crowds with gaussian processes,” Information Fusion, vol. 52, pp. 110 – 127, 2019.
-  J. Hensman, A. Matthews, and Z. Ghahramani, “Scalable Variational Gaussian Process Classification,” in International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 38, 2015, pp. 351–360.
-  D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, pp. 859–877, 2017.
-  J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for big data,” in Conference on Uncertainty in Artificial Intelligence (UAI), 2013, pp. 282–290.
-  M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 1303–1347, 2013.
-  D. G. Matthews, G. Alexander, M. Van Der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman, “Gpflow: A gaussian process library using tensorflow,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 1299–1304, 2017.
-  C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
-  D. Hernández-Lobato, J. M. Hernández-Lobato, and P. Dupont, “Robust multi-class gaussian process classification,” in Advances in Neural Information Processing Systems (NIPS), 2011, pp. 280–288.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference for Learning Representations (ICLR), 2015.
-  F. W. Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark, NIST handbook of mathematical functions. Cambridge university press, 2010.
-  D. J. Kennefick, Traveling at the speed of thought: Einstein and the quest for gravitational waves. Princeton university press, 2016.
-  S. Bahaadini, V. Noroozi, N. Rohani, S. Coughlin, M. Zevin, J. Smith, V. Kalogera, and A. Katsaggelos, “Machine learning for gravity spy: Glitch classification and dataset,” Information Sciences, vol. 444, pp. 172–186, 2018.
-  B. P. Abbott, R. Abbott, T. Abbott, M. Abernathy, F. Acernese, K. Ackley, M. Adamo, C. Adams, T. Adams, P. Addesso et al., “Characterization of transient noise in advanced ligo relevant to gravitational wave signal gw150914,” Classical and Quantum Gravity, vol. 33, no. 13, p. 134001, 2016.
-  L. K. Nuttall, T. J. Massinger, J. Areeda, J. Betzwieser, S. Dwyer, A. Effler, R. P. Fisher, P. Fritschel, J. S. Kissel, A. P. Lundgren et al., “Improving the data quality of advanced ligo based on early engineering run results,” Classical and Quantum Gravity, vol. 32, no. 24, p. 245005, 2015.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
T. P. Minka, “A family of algorithms for approximate bayesian inference,” Ph.D. dissertation, University of Cambridge, 2001.
-  H. Salimbeni and M. Deisenroth, “Doubly stochastic variational inference for deep gaussian processes,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 4588–4599.
-  M. Van der Wilk, C. E. Rasmussen, and J. Hensman, “Convolutional gaussian processes,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 2849–2858.
-  K. Blomqvist, S. Kaski, and M. Heinonen, “Deep convolutional gaussian processes,” arXiv preprint arXiv:1810.03052, 2018.