Labeling Bias in Galaxy Morphologies

11/08/2018 ∙ by Guillermo Cabrera-Vives, et al. ∙ University of Concepcion 0

We present a metric to quantify systematic labeling bias in galaxy morphology data sets stemming from the quality of the labeled data. This labeling bias is independent from labeling errors and requires knowledge about the intrinsic properties of the data with respect to the observed properties. We conduct a relative comparison of label bias for different low redshift galaxy morphology data sets. We show our metric is able to recover previous de-biasing procedures based on redshift as biasing parameter. By using the image resolution instead, we find biases that have not been addressed. We find that the morphologies based on supervised machine-learning trained over features such as colors, shape, and concentration show significantly less bias than morphologies based on expert or citizen-science classifiers. This result holds even when there is underlying bias present in the training sets used in the supervised machine learning process. We use catalog simulations to validate our bias metric, and show how to bin the multidimensional intrinsic and observed galaxy properties used in the bias quantification. Our approach is designed to work on any other labeled multidimensional data sets and the code is publicly available.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

For more than a century astronomers have been working to understand galaxy properties and evolution from their morphology. The seminal examples is the Hubble sequence (Hubble, 1926), which first classified galaxies into ellipticals, spirals, barred spirals, and irregulars. Galaxy morphologies have been shown to correlate with other intrinsic properties such as color, brightness, maximum rotation velocity, and gas content (Dressler, 1980). From these properties, it is possible to infer important physical properties such as stellar population fraction, surface star density, total mass, and gas to star conversion rates (see Odewahn et al., 2002, and references therein).

For some time, visual classifications played the dominant role in galaxy morphologies. Classifications have been done by expert astronomers (de Vaucouleurs et al., 1976, 1991; Bundy et al., 2005; Fukugita et al., 2007; Schawinski et al., 2007; Nair & Abraham, 2010; Kartaltepe et al., 2015) as well as non-expert citizen scientists through crowdsourcing systems such as Galaxy Zoo (Lintott et al., 2008; Bamford et al., 2009; Lintott et al., 2011; Willett et al., 2013; Simmons et al., 2017; Willett et al., 2017). With the advent of large high quality survey data like from the Sloan Digital Sky Survey (York & SDSS Collaboration, 2000) and CANDELS (Grogin et al., 2011), we are beginning to see more machine learning morphologically classified galaxy data sets using a variety of methods (e.g. Ball et al., 2004; Scarlata et al., 2007; Tasca et al., 2009; Gauci et al., 2010; Huertas-Company et al., 2011; Dieleman et al., 2015; Huertas-Company et al., 2015)

. Many of the current machine learning classification techniques fall into the category of supervised learning and thus require training data sets, usually based on visually classified morphologies. Examples of unsupervised machine learning classifications can be found in

Naim et al. (1997); Edwards & Gaber (2013); Kramer et al. (2013); Shamir et al. (2013); Schutter & Shamir (2015).

When visually classifying galaxies according to their morphologies, the resulting labels will be biased in terms of observable parameters. Low resolution and dim galaxy images will be biased towards smoother types, because the human annotator in charge of labeling the images will not be able to see the fine structure of these objects. Bias in galaxy morphology catalogs has been extensively studied by the Galaxy Zoo team (Lintott et al., 2008). In Bamford et al. (2009) and Willett et al. (2013)

a bias correction term was applied to morphology probabilities by assuming that the morphological fraction does not evolve over the redshift within bins of fixed galaxy physical size and luminosity. For Galaxy Zoo: Hubble morphologies

(Willett et al., 2017)

artificially redshifted images have been used to quantify this bias. A different way of addressing the problem is through a machine learning approach, simultaneously learning a classification model, estimating the intrinsic biases in the ground truth, and providing new de-biased labels

(Cabrera et al., 2014; Bootkrajang, 2016).

In this paper, we present a metric for measuring this labeling bias in morphological classification data sets and we compare low redshift morphological catalogs of spiral/elliptical galaxies from experts (Fukugita et al., 2007; Nair & Abraham, 2010), non-experts (Lintott et al., 2011) and machine learned (Huertas-Company et al., 2011). We release to the public the code for measuring labeling bias and for simulating multi-dimensional labeling bias. This code can be used not only by the galaxy evolution community but also by anyone interested in measuring how biased their catalogs are in terms of observable parameters.

This paper is organized as follows: in Section 2 we develop a statistical measure of labeling bias based on the fraction of objects in terms of their intrinsic and observable properties. Our metric is based upon the assertion that the fractions of labels are fixed within bins on the intrinsic properties. We then quantify variations in labeled fractions from the estimated intrinsic fraction as a function of observed properties. In Section 3

we describe the data sets to be used and how we created simulated galaxy morphology biased data sets. Some considerations on the bias-variance trade-off of our estimators have to be taken in to account. This is explained in Section

4, where we also describe the methodology used to address this issue. In Section 5 we measure the biases for different data sets and show that even “expert labels” are often biased in terms of observed quantities like apparent size. In Section 6 we describe the main conclusions coming from this work.

2. Classification Bias

In real data it may be very hard to obtain the high quality true classification labels , which we will call the ground truth or gold standard. However, one can always make an estimate of this ground truth

. In supervised and semi-supervised machine learning this is usually accomplished through human annotators. In terms of galaxy morphology, the estimated labels stem from visual inspection of galaxy images. These visually determined morphologies are sometimes used directly in scientific analyses. Sometimes, they are used explicitly to train classification algorithms (e.g., supervised learning). Sometimes they are used implicitly to test such algorithms or used in conjunction with unlabeled data (e.g., semi-supervised learning). However, the galaxies are always convolved with the point spread function (PSF) of the telescope, which makes it difficult to visually (or even computationally) resolve the spiral features in small and faint galaxies. For galaxies, this means that in the estimated labels, spirals can be misclassified as ellipticals. This labeling bias is more important when the PSF is close to the angular size of the galaxies, particularly for ground based telescope classification, such as Galaxy Zoo. As noted in

Bamford et al. (2009) and Cabrera et al. (2014), this bias is not statistical nor inherent to the visual classifiers, but a direct consequence of the quality of the data.

There are many steps which go into the visual classification of the morphologies of galaxies. While we expect classifiers to notice that the light profile is steeper for ellipticals than for disk-like spirals, classifiers also use color and spatial feature identification during their classification process. Depending on the filters used or the resolution of the galaxy image, it is possible to confuse one type of morphology with another. Worse, these mislabellings can be consistent amongst different human classifiers leading to a high degree of statistical confidence in the wrong classification label.

In Figure 1 right, we show a spiral galaxy that was classified as an elliptical with high confidence in the Galaxy Zoo sample which is based on ground-based imaging from the Sloan Digital Sky Survey DR7 (Abazajian et al., 2009). On the left, we show a higher resolution view of this same galaxy from the Hubble Space Telescope, in which one can clearly identify spiral arms. In this example, the spiral arms are washed out by the convolution of the ground-based PSF, rendering their structure undetectable to the human classifiers. It is the projected intrinsic physical scale of the underlying features relative to the PSF that drives the misclassifications.

Figure 1.— Spiral galaxy biased classification. Left: A spiral galaxy with good resolution taken from low earth orbit. Notice the spiral arms. Right: The same spiral galaxy, except at worse resolution and taken from the ground through the Earth’s atmosphere. Notice that the arms are no longer discernible. This spiral galaxy was classified by of annotators as being an elliptical, even though higher quality data prove that it is a spiral.

In order to measure the amount of bias in different labeled data sets, we follow Bamford et al. (2009) and use the fractions of objects of each class as a function of the observable parameters that may bias our labels. An example of such parameters for galaxy morphologies is the resolution of the galaxies: high resolution galaxies are hardly going to be mislabeled, while low resolution galaxies are more likely to be so. We would expect the fractions of an unbiased data set not to depend on these observable parameters. At the same time we should also consider intrinsic parameters for which the real fractions of labels will depend on.

Consider a set of intrinsic properties (e.g., physical size, luminosity, or redshift) on which we define multi-dimensional bins . Given a set of labels (e.g. for spirals and ellipticals), in each bin , we calculate the intrinsic class fraction of objects with each label as . For typical galaxy morphology data sets, we define , where is the physical radius (in kpc), is the absolute magnitude, and is the redshift for object . In other words, given a fixed bin in galaxy physical size, luminosity, and redshift, defines the intrinsic fraction of spirals compared to the total number of galaxies in bin .

We then consider the set of observed properties of the objects (e.g., angular size). We define the set of properties and create single dimensional bins on each observed property for each of the multi-dimensional bin . Here defines which property and defines the range of the bin for that property. For typical galaxy morphological data sets, we define where is the angular size and is the estimated size of the point spread function at the galaxy location in the same units as its angular size.

Note the intrinsic properties are treated in multi-dimensional bins, , whereas within each of those bins, the observed properties are treated in individual bins, . This is because our aim is to study the biases with respect to their observed individual properties and so we require at least two bins () for each observational property. Figure 2 shows a diagram explaining binning in the intrisic and observable parameters. We start by defining bins in the intrinsic parameters using a kd-tree (see Section 4). For each of this multi-dimensional bins we bin again in terms of the observable parameters and calculate the fraction of objects in each of these bins for every class.

We then calculate the observed class fraction


where is the total number of objects with the observed property in bin , is the Kronecker delta given an estimate of each galaxy ’s classification for class . The right-hand sides sums over all galaxies which are simultaneously in the observed single property bin and the intrinsic property multi-dimensional bin .

For a given classification and intrinsic property bin , we calculate the Euclidean difference between the observed class fraction and the intrinsic class fraction and sum over all the bins for the observed property


Equation 2 should be for large and when there is no difference between the intrinsic and observed class fractions, i.e., when the classifications are unbiased with respect to an observable.

Figure 2.— Binning on the intrinsic and observed object properties. Top Left: We first create multidimensional bins, , based on the intrinsic properties, such as absolute magnitude (M), physical size (R), and redshift (z). Bottom left: These bin edges are defined using a kd-tree, as explained in Section 4. Right: For the sub-set of galaxies within each intrinsic bin , we measure the fraction of labeled objects for each class as a function of each observable . We define equal-sized one dimensional bins on the observed properties , where runs over these bins and we require at least two bins. For an un-biased data set, the fractions within each bin should not change in terms of the observables. When labeling bias is present, the fractions of objects labeled by humans will depend on the observable parameters. We calculate the deviation of fractions from the intrinsic class fraction (Eq. 2) and then define the total labeling bias by summing over all of the properties as in Eq. 3.

We can extend this to all classes and intrinsic and observed properties as


where is the number of classes (2 for the case of elliptical versus spirals). We term equation 3 the classification bias which quantifies the difference in the observational class fractions with respect to the intrinsic class fractions.

We note that the intrinsic class fraction can vary for any data set. For instance, a data set designed to represent ellipticals might have an inherently lower spiral fraction than a broader morphological catalog containing spirals, ellipticals and irregulars. Alternatively, one might be interested in comparing classification algorithms over a wide range of classes and data sets. If so, care has to be taken so that the intrinsic parameters distributions are similar so we do not have any selection effects which could influence and in turn the value of .

One would hope that the fraction of labels within bins of intrinsic properties, could in principle be measured using an un-biased (“gold standard”) data set or perhaps a subset of the data itself. It is also possible that could be predicted from theory (e.g. Genel et al., 2014). Here, we take a conservative approach and assume that all observed morphological data sets have some level of bias. We make an estimate by using the observed class fraction for the bin in observed property which is likely to have the least bias. For example, if we are calculating for , then we calculate for the bin which includes the largest values of , since it should contain the least biased classifications.

Figure 3 shows an example of binning in intrinsic and observable parameters for Galaxy Zoo data. Here, we build the kd-tree splitting the data in terms of , , and creating a 3-dimensional partition of the data. For the data falling in each of the intrinsic bins we calculate the fraction of spiral and elliptical galaxies as a function of the observable parameter . As decreases (smaller objects), the fraction of spiral galaxies decreases and the fraction of elliptical galaxies increases. In other words, smaller spiral galaxies are confused as elliptical. In order to calculate our bias metric (Eq. 3) we need the intrinsic class fractions . The least biased bins in the observable parameters are the ones with the biggest , which we consider as our estimate for the intrinsic class fractions. Figure 4 shows the fractions of spiral and elliptical galaxies in term of the observable parameter for bins in intrinsic parameters. Independently of the bin in terms of , , and , the fraction of spirals increases with , while the fraction of ellipticals decreases. In order to calculate the dataset bias, we use as intrinsic class fractions the fraction in the bin with a higher , denoted by a dot in Figure 4.

Figure 3.— Binning example for Galaxy Zoo biased data. The kd-tree splits the data in terms of the intrinsic properties (center top and bottom), , and (solid line rectangles). For each of these 3-dimensional bins we calculate the fractions of objects in terms of the observable parameter . As the size of the galaxies diminishes the fraction of spiral galaxies (dotted lines) decreases, and the fraction of ellipticals (dashed lines) increases. The least biased bins in observable parameters are the ones with the highest , represented by a dot in the plots. We use these lowest bias bins as our estimation for the intrinsic fractions .
Figure 4.— Fractions in terms of observable parameters for Galaxy Zoo biased data using bins in intrinsic parameters. As the angular size of the galaxies diminishes the fraction of observed spiral galaxies (dotted lines) decreases, and the fraction of ellipticals (dashed lines) increases due to observational bias. The least biased bins in observable parameters are the ones with the highest , represented by a dot in the plots, which we use as our estimate for the intrinsic fractions .

3. data sets

In this section we describe the data we use on our experiments. All data considers the r-band from SDSS (Abazajian et al., 2009), and the 9-year WMAP cosmology (Hinshaw et al., 2013) from astropy (The Astropy Collaboration et al., 2018).

3.1. Eyeball Classifications

Fukugita et al. (2007) (hereafter F07) have visually classified 2,275 galaxies, each by three experts. They defined a morphological index such that for E, S0, Sa, Sb, Sc, Sd, Im, respectively. In order to measure their bias, we focus on just the elliptical (+S0) galaxies (N=941) having , and the spirals (N=902) having , since the other data sets we compare to only use these two classes. We cross-match these data to the SDSS DR7 to obtain their apparent magnitudes (Petrosian r-band), their apparent sizes (Petrosian r-band radii), and their redshifts.

We also use expert labels from Nair & Abraham (2010) (NA10 hereafter) who have visually classified 14,034 spectroscopically-targeted galaxies from the SDSS. They report T-Types as well as other morphological features such as bars, rings, lenses tails, among others. As with the F07 sample, we focus on elliptical (+S0) galaxies (N=6,276) having and spirals (N=7,640) having , where are their T-Types.

3.2. Galaxy Zoo

We use the Galaxy Zoo 1 data release (Lintott et al., 2011) and their sample with spectra in SDSS which contains classifications for 667,944 galaxies achieved by crowd-sourcing. We define two subsets of the Galaxy Zoo 1: (a) the original biased morphologies (hereafter GZB) and (b) the “debiased” morphologies (hereafter GZD). The debiasing procedure used is described in detail in Bamford et al. (2009) and Lintott et al. (2011). Briefly, they assumed that the morphological fraction within bins of fixed galaxy physical size and luminosity does not evolve over the redshift of their data. From that assumption, a bias correction term was estimated in bins of physical size and luminosity and then applied to the original spiral and elliptical classification probabilities. Their algorithm helped motivate our approach to quantify classification bias as described in Section 2.

We cross-match the Galaxy Zoo catalog to the SDSS DR7 to obtain the observed properties, including each galaxy’s point-spread function (PSF-determined over the SDSS field). We used the SDSS field-specific parameter as an estimate of the FWHM for a Gaussian PSF at the location of each galaxy. When galaxies belong to more than one field we used the galaxy classification and properties pertaining to that with the smallest PSF.

3.3. Supervised Learned Morphologies

Huertas-Company et al. (2011) (hereafter HC11) used a support vector machine (SVM) classification model trained over the data set from Fukugita et al. (2007). The HC11 morphologies are probability densities, and so we defined elliptical (+S0s) galaxies as having a probability of being early-type Early and spiral galaxies having Spiral, where takes values of 0.5 and 0.8. As with the previous data sets, we cross-match the HC11 data to the SDSS DR7 to ensure that all galaxies in our data sets have the same observed properties and that there are no duplicates.

3.4. Simulated Morphology Catalogs

In order to assess the validity of our method, we created a simulated catalog following the Galaxy Zoo 1 distribution of parameters. We used a kernel density estimation

(see Hastie et al., 2009, and references therein) with a Gaussian kernel to estimate the distribution of angular Petrosian radius , apparent Petrosian magnitude , redshift , PSF, and de-biased probabilities randomly choosing 100.000 galaxies from GZ1. Using these parameters we calculate their physical Petrosian radius , absolute Petrosian magnitude , and . We consider and as the biasing parameters, so we artificially created this bias by changing the labels from spirals to ellipticals with a Gaussian probability depending on these parameters:


where the probability of modifying a label from S to E depends of a biasing parameter which controls the amount of bias in the data set. The higher the value of , the larger the amount of bias. Notice that this added bias is normalized in terms of and , by using their median values , and .

4. Impact of Sampling over the Estimator

Equation 3 is a statistical measure of the classification bias for any data set with classes and requires bin definitions on the observed properties and multi-dimensional data set binning of the intrinsic properties . In this section, we examine the effects of how the bins are defined using the simulated morphology catalogs which have varying degrees of bias.

We bin the intrinsic properties of the data using kd-trees. A kd-tree is a data structure for storing a finite set of points from a k-dimensional space. It was examined in detail by Bentley (1975) and Friedman et al. (1977). kd-trees have the benefit of dividing the data into bins for optimal querying performance. They are well characterized in the literature and numerous libraries exist to build such trees. The total number of bins in these trees is , where is the height of the tree. As increases, the bins get smaller causing the number of points inside each bin to get smaller too. The dimension of our kd-tree depends on the number of intrinsic properties we are examining. In this effort, we use the absolute magnitude, the physical size, and the redshift for our tree.

For the observed properties, we need to build a grid defining the ranges on each of the observational parameters (e.g., such as the resolution limits within the bin). We choose a simple linear binning procedure such that the number of observed galaxies in each bin is equal.

Having defined the bins on the intrinsic and observed properties of each galaxy, as well as the morphological classifications, we examine the robustness of the labeling bias estimator, Equation 3.

4.1. Finite-Sampling Bias and Variance

The number of 3D bins on the intrinsic properties, the number of 1D bins in the observable parameters, as well as the total number of objects in each of the bins, combine to impact . Because real data sets have finite size, the trade-off between bias and variance of our estimators has to be taken into account when defining the binning strategy. We use the simulations to show the impact of the selected binning strategy on our bias metric. Our results are shown in Figure 5

where the left panel is for simulated galaxies following GZD probability distributions (

) and the right panel shows a simulated bias of .

un-biased simulations bias
Figure 5.— Sampling effect over as calculated over the simulated data sets. Left: un-biased simulations. Right: bias . Labels indicate the number of bins in the intrinsic parameters obtained from the KD-tree, and the number of bins in the observable parameter: . The variance over increases as we diminish the number of objects per bin increasing the value of . Independently of the binning strategy chosen, our metric obtains a higher value for the biased simulated dataset.

First, consider a simple fixed binning scenario where we allow the number of galaxies per bin to vary. Figure 5 shows this effect over simulations for different binning strategies. One can see that decreases as a function of the square root of the total number of galaxies in each bin. There is a point after which adding more galaxies to each bin does not reduce significantly. We use the shape of this curve to define the optimal number of total galaxies per bin.

Next, consider a fixed number of objects per bin and a fixed number of bins on the intrinsic properties. As one decreases the number of 1D bins in the observable parameters, due to the bias-variance trade-off (see Hastie et al., 2009), there will be a corresponding decrease in the statistical variance for the estimate of the fractions , at the expense of increasing the statistical bias. The extreme case is a single bin with very low variance. However, as shown in Figure 2, a single bin in provides no useful information on the bias we are trying to measure: at least two bins in the observed properties are required in order to track observational bias. Regardless, the decrease in variance (simply due to fewer bins) simultaneously decreases the value of the labeling bias . This is shown in Figure 5 by considering curves with the same number of intrinsic bins and noting that is always lowest for the fewest number of observed bins.

On the other hand, if we fix the number of objects per bin as well as the number of bins on the observed properties, then by decreasing the number of bins on the intrinsic parameters, we lose information about the true object fractions, thus causing an increase in the bias of the estimator for the intrinsic class fraction . This produces an increase of the differences between the observational class fractions and the estimate for the intrinsic class fractions in Eq. 2 increasing the value of . This is shown in Figure 5 by considering curves with equal number of bins in the observables noting that is always lowest for the highest number of intrinsic bins.

Figure 5 allow us to define a binning procedure for any data set. Notice how the number of objects per bin and binning impacts the value of our bias metric for data sets with the same amount of simulated bias. Also notice that is always higher for the data set with higher simulated bias for a given binning strategy, which suggests that any binning strategy helps evaluate differences in biases between data sets, as long as enough number of objects per bin are considered.

Since the value of can vary as a function of the binning, we must be careful to use the same binning procedure when conducting relative comparisons of one or more data sets, even if the binning is not optimal for any specific data set. In practice, when comparing the labeling bias for different sets of data, the binning strategy is defined by the data set with the smallest number of objects.

4.2. Choosing number of bins for real data

For the datasets in this work, we consider binning strategies that split all parameters (observable and intrinsic) into the the closest number of bins. Given this constraint, we then search for the maximum number of bins such that the running slope of in Figure 5 is for the maximum number of objects per bin allowed by our data set size. Because the real data is noisy, we calculate the mean value of over 20 bootstrapping sub-samples and considering the same number of bins for each intrinsic and observable parameters. The kd-tree automatically defines the multidimensional binning on the intrinsic parameters.

Special care has to be taken when comparing two or more data sets of different sizes. On a larger data set we may be able to use more bins and/or number of objects per bin, but when comparing it to a smaller data set, this sampling is not going to be feasible to use. In order to make a fair comparison, we need to sample in terms of the smaller data set, so that biases and variance over the distribution of fractions are comparable.

5. Bias for galaxy morphologies

Now that we defined how to choose the binning in Equation 3, we measure the classification bias for the different datasets defined in Section 3. In Section 5.1 we follow the approach proposed by Bamford et al. (2009) and use the redshift as a way to quantify the morphological bias. Then, in Section 5.2 we use as our biasing observable parameters and , , and as intrinsic parameters.

5.1. Redshift as biasing parameter

We start by following the approach proposed by Bamford et al. (2009) and consider redshift as a biasing parameter and physical radius and absolute magnitude as intrinsic parameters. The smallest dataset is F07 with 1843 spirals and ellipticals. From this, we used the technique described in Section 4.2 to determine the binning. We find the best finite-sampling bias levels for a maximum binning size of 8 in the intrinsic parameters and 2 in the observed parameters. We then apply this binning scheme to all of the datasets to measure the classification bias using Equation 3. For the F07 data set we obtain 115 galaxies per bin, so we fix this number for all the other datasets. The data from F07 and NA10 only contains galaxies for , so in order for the comparison of biases between datasets to be fair, we consider galaxies with in the GZ and HC11 datasets. Figure 6

a shows the bias for different datasets. Notice the standard deviation of

makes it hard to make statistical significant conclusions on the difference between datasets.

(a) (b) (c)
Figure 6.— Bias for different datasets considering as unique observable parameter and and as intrinsic parameters, as proposed by Bamford et al. (2009). The number of galaxies considered for measuring the bias increases from (a) to (c). Error bars show the standard deviation over 100 bootstrapping samples. (a) Using bins in intrinsic parameters, bins in observable parameters, and 115 galaxies per bin in order to match the total number of galaxies of Fukugita et al. (2007) (F07). (b) Using bins in intrinsic parameters, bins in observable parameters, and 217 galaxies per bin in order to match the total number of galaxies of Nair & Abraham (2010) (NA10). (c) Using bins in intrinsic parameters, bins in observable parameters, and 58 galaxies per bin in order to match the total number of galaxies in the Galazy Zoo Biased (GZB) sample Bamford et al. (2009). Note that the machine learning classifications of Huertas-Company et al. (2011) (HC11) use the F07 classifications for training the Support Vector Machine.

If we exclude F07, the smallest dataset is NA10, with 13,916 galaxies. For these data, and using the procedure from Section 4.2, we find a maximum binning size of in the intrinsic parameters and 4 bins in the observable parameters (217 galaxies per bin). Figure 6b shows the results of the bias for each data set. The error bars now allow us to interpret these results with higher significance. The bias for expert annotators of NA10 is similar to the one from HC11, and both are smaller than the bias of Galaxy Zoo. Now, we can see that our metric starts to recover the de-biasing procedure from Bamford et al. (2009): the value of is lower for GZD than for GZB.

If we exclude both F07 and NA10, we can consider galaxies with . The smallest dataset is the Galaxy Zoo biased with 237,963 galaxies, so by doing this, we are able to use a larger number of bins, thus having best estimates for . Using the method described in Section 4.2 we obtain bins for the intrinsic parameters and 16 bins for the observable parameters, from which we can use 58 objects per bin. In Figure 6c we show the labeling bias as defined by Equation 3 using this binning strategy for GZB, GZD, and HC11. Now, we can clearly recover the de-biasing procedure proposed by Bamford et al. (2009). The highest values of are obtained over the GZB dataset, while the GZD dataset achieves a significantly lower . This shows, that our proposed metric is capable of measuring biases given an assumption of intrinsic and biasing parameters. Again, the lowest labeling bias is obtained for HC11.

5.2. Apparent radius as biasing parameter

As opposed to Bamford et al. who utilized redshift as the parameter for which to characterize and correct labeling bias, in this section we treat the apparent size as the parameter which governs bias. With respect to the PSF, it is the apparent size of a galaxy that will determine whether or not spiral features are washed out to become undetectable. We then include redshift as an intrinsic parameter since we expect it to play a role in the underlying fraction of spirals and ellipticals, which we know to evolve over time (Buitrago et al., 2013; Huertas-Company et al., 2015; Cerulo et al., 2017). There is a concern that the apparent size as an observable parameter is degenerate with the combination of the redshift and the physical size for any galaxy. A small nearby galaxy can have the same apparent size as a large and more distant galaxy. However, by also including the absolute magnitude as an intrinsic parameter, this degeneracy is broken. In other words, a small and large galaxy with the same apparent size will never be in the same bin since the small (and thus intrinsically dim) galaxy will appear in a different Magnitude bin than a large (and intrinsically bright) galaxy.

We start with the smallest dataset F07 with 1843 spirals and ellipticals. We find the best finite-sampling bias levels for a maximum bin number of 8 in the intrinsic parameters and 2 in the observed parameters, obtaining 115 galaxies per bin. Note that this is the minimum bin sizes we can apply due to the number of intrinsic and biasing (observed in this case) properties in the data. Recall the data from F07 and NA10 only contains galaxies for , so again we consider galaxies with in the GZ and HC11 datasets. Figure 7a shows the biases under these assumptions for different datasets. Again, due to the size of the standard deviation error bars of , it is hard to make statistical significant conclusions on the difference between datasets.

(a) (b) (c)
Figure 7.— Bias for different datasets considering as unique observable parameter and , , and as intrinsic parameters. The number of galaxies considered for measuring the bias increases from (a) to (c). Error bars show the standard deviation over 100 bootstrapping samples. (a) Using bins in intrinsic parameters, bins in observable parameters, and 115 galaxies per bin in order to match the total number of galaxies of Fukugita et al. (2007) (F07). (b) Using bins in intrinsic parameters, bins in observable parameters, and 144 galaxies per bin in order to match the total number of galaxies of Nair & Abraham (2010) (NA10). (c) Using bins in intrinsic parameters, bins in observable parameters, and 232 galaxies per bin in order to match the total number of galaxies in the Galazy Zoo Biased (GZB) sample Bamford et al. (2009). Note that the machine learning classifications of Huertas-Company et al. (2011) (HC11) use the F07 classifications for training the Support Vector Machine.

If we exclude F07, we find a maximum binning size of in the intrinsic parameters and 3 bins in the observable parameters with 144 galaxies per bin. Figure 6b shows the results of the bias for each data set. HC11 presents the lowest bias. Expert labels froms NA10 are less biased than Galaxy Zoo. With this number of galaxies there is no statistical significance between GZB and GZD for a given probability threshold.

If we exclude both F07 and NA10, we are able to consider galaxies with and use bins for the intrinsic parameters and 4 bins for the biasing parameter , from which we obtain 232 objects per bin. In Figure 7c we show the labeling bias as defined by Equation 3 using this binning strategy for GZB, GZD, and HC11. The highest values of are obtained over the GZ biased data sets and the lowest labeling bias is obtained for HC11. With this amount of data we notice that GZD with shows a smaller amount of bias than GZB. At the same time, by choosing the selected GZD data set is significantly more biased than the GZD data with and closer to GZB for . In other words, it appears that the de-biasing procedure implemented in Bamford et al. (2009) for Galaxy Zoo classifications does not work when the vast majority of classifiers agree on the morphological type.

We explore this interesting result further in Figure 8, where we plot the bias as a function of an increasing Galaxy Zoo classification probability threshold. For the biased sample, we see no clear trend. However, the de-biased sample shows a trend of increasing bias with increasing classification probability threshold.

Figure 8.— Bias for the Galaxy Zoo Biased sample (GZB) and the Galaxy Zoo Debiased sample (GZD) versus the probability threshold used to define the classes. Notice that the GZB bias does not significantly decrease with increasing probability threshold and that the GZD bias increases with increasing probability threshold. An explanation for these unexpected trends is discussed in the text.

We can explain Figure 8 in the following way. First, Bamford et al. (2009) use a statistical correction (their equations A3 and A4) that depends on both the raw classification probabilities as well as the intrinsic characteristics of the galaxy (e.g., absolute magnitude, physical Petrosian radius, redshift). This form of correction was chosen under the assumption that at high classification probabilities, no morphology adjustment should be applied since the labels would be correct (see Figure A9 of Bamford et al. (2009)). Thus, the fact that the classification bias is closer to the one from GZB at high in Figure 8 stems from the design of the classification adjustment formalism. Since the correction term approaches zero at high , the sample reverts back to the same level of bias inherent in the nominal biased sample.

What is perhaps surprising is that for the Galaxy Zoo Biased (GZB) sample, the level of bias does not decrease as the classifications reach higher levels of confidence (high ). Recall from the Introduction that our algorithm aims to quantify the presence of classification bias due to mislabeled data. We noted that such mislabeling error is not a statistical labeling error, but instead an intrinsic error related to the quality of data itself (see Figure 1). The high bias at in the Galaxy Zoo Biased sample is to be expected, especially for spirals, when the data quality is low or when the classifiers are non-expert. As noted earlier, it can be difficult to distinguish between spirals and ellipticals due to the data quality at low brightness or small apparent size. At , it should have been easy for classifiers to have identified morphologies since the classifications from different classifiers agree. This is likely to be true for spirals, but it is nearly impossible for the classifiers to separate ellipticals from spirals when the data quality is bad. When the data is bad the classifications will always tend towards elliptical with high confidence. In other words, while it is almost certainly the case that spiral galaxies are true spirals, ellipticals are not always true ellipticals. Thus the formalism to adjust classifications for ellipticals should not converge to the raw classification, even at high .

An alternative approach to correct biased labels is to produce a set of simulated calibration images. These images are degraded versions of high quality images, where the ground truth labels can be accurately estimated. Galaxy Zoo: Hubble (Willett et al., 2017) label such images through their interface, producing a set of biased labels with their corresponding ground truth labels. Their correction term allows high classification probabilities to be adjusted. Measuring biases on such corrected labels would be very interesting, but slightly out of the scope of this paper: here, we present a metric to assess biases, and show an application to low redshift galaxies. We plan to address biases at higher redshift galaxies in the future, including Galaxy Zoo: Hubble.

The final question regarding Figure 7 is why the machine learning algorithms perform better than the training sets they used? Under perfect conditions the learned classifications should recover any biases inherent to the input training sets. Recall that HC11 uses an SVM supervised machine learning algorithm that is trained on the F07 data set. However, since the bias in the F07 data is higher than in the HC11 data, we conclude that the supervised machine learning technique used by Huertas-Company et al. (2011) was able to mitigate the biases inherent in their training sets.

It is important to recognize that labeling bias mitigation can only occur if the “correct” choice of observed features are used in the machine learning training sets. The term “correct” simply means that the chosen observable parameters can in fact cause bias and that this bias can be removed with additional truth information. In other words, the bias caused by the apparent sizes can be fixed by leveraging information about the true sizes and absolute magnitudes. As a counter example, one should be able to feed the SVM-tool with an n-dimension set of observed parameters that precisely recover the training set classification (i.e., 100% accuracy to the original training set). In this case, one would have the same level of bias in the SVM trained classifications as in the eyeballed training set. In the particular case of HC11, they used features such as colors, shape and concentration. These features correlate with morphological types independent of observational parameters, such as resolution. Therefore, the chosen HC11 observed galaxy parameter set is enabling the Fukugita classification biases to be minimized by the complexities of the machine learning algorithm. At the same time, a machine learning model trained over features such as colors, will not be able to correctly identify morphologies of outlier galaxies, such as red spirals or blue ellipticals. These galaxies may still be recognized by a human from a relatively high quality image. Machine learning models and eyeball labels may be complementarily used to obtain scientifically interesting outliers.

6. Conclusions

Observational parameters, such as resolution, can bias the procedure of human labeling of galaxies. We have developed a metric to assess systematic mislabeling of galaxy morphologies which incorporates information about the galaxy intrinsic parameters, such as their true sizes and absolute magnitudes. Our algorithm requires that the true (but unknown) fractions for the classes be constant when binned against their intrinsic parameters. We then quantify the mean deviation of the fraction of objects from the estimated intrinsic fraction in terms of their observational parameters.

We then conduct a relative comparison of labeling bias for expert, citizen science, and machine learning-based galaxy classifications between spirals and ellipticals (+S0s). We find that, when enough data is provided, the bias in expert labels is statistically lower than the citizen science labels. We use our metric to recover Galaxy Zoo de-biasing procedure, under the assumption that labels are biased in terms of the redshift. By using the labeled image resolution as biasing parameters instead, we show our metric is able to find biases that have not been addressed. These biases may be statistically corrected in the future in the same manner that Galaxy Zoo does it. The classifications which use machine learning techniques show the least levels of bias, even when they are trained on biased “gold standards”. We conclude that future large-scale morphological classification efforts should employ a combination of human classifications and machine learning in order to minimize labeling bias.

In this paper we have focused on the problem of galaxy morphologies. However, our approach may be applied to any other labeled data set where intrinsic information can be inferred. We have made our code publicly available so that it can be used by the galaxy evolution community or any other classification problem at under the terms of the GNU General Public License v3.0..


We wish to thank Nancy Hitschfeld, Benjamín Bustos, Eduardo Vera, Jaime San Martín, Chris Smith, and Alfredo Zenteno for valuable discussion and supporting our project.

G.C.V. gratefully acknowledge financial support from CONICYT-Chile through its FONDECYT postdoctoral grant number 3160747; CONICYT-Chile and NSF through the Programme of International Cooperation project DPI201400090; Basal Project PFB–03; the Ministry of Economy, Development, and Tourism’s Millennium Science Initiative through grant IC120009, awarded to The Millennium Institute of Astrophysics (MAS). CJM was supported by the National Science Foundation under Grant No. 1256260. PoweredNLHPC: This research was partially supported by the supercomputing infrastructure of the NLHPC (ECM-02). Most of the table operations and plots were done using TOPCAT (Taylor, 2005) and matplotlib (Hunter, 2007). We used numpy (Oliphant, 2006), scipy (Oliphant, 2007), and pandas (McKinney et al., 2010) for numerical computations. The kernel density estimation model was trained using scikit-learn (Pedregosa et al., 2011).

Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is

The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions. The Participating Institutions are the American Museum of Natural History, Astrophysical Institute Potsdam, University of Basel, University of Cambridge, Case Western Reserve University, University of Chicago, Drexel University, Fermilab, the Institute for Advanced Study, the Japan Participation Group, Johns Hopkins University, the Joint Institute for Nuclear Astrophysics, the Kavli Institute for Particle Astrophysics and Cosmology, the Korean Scientist Group, the Chinese Academy of Sciences (LAMOST), Los Alamos National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the Max-Planck-Institute for Astrophysics (MPA), New Mexico State University, Ohio State University, University of Pittsburgh, University of Portsmouth, Princeton University, the United States Naval Observatory, and the University of Washington.

Based on observations made with the NASA/ESA Hubble Space Telescope, and obtained from the Hubble Legacy Archive, which is a collaboration between the Space Telescope Science Institute (STScI/NASA), the Space Telescope European Coordinating Facility (ST-ECF/ESA) and the Canadian Astronomy Data Centre (CADC/NRC/CSA).


  • Abazajian et al. (2009) Abazajian, K. N., Adelman-McCarthy, J. K., Agüeros, M. A., et al. 2009, ApJS, 182, 543
  • Ball et al. (2004) Ball, N. M., Loveday, J., Fukugita, M., et al. 2004, MNRAS, 348, 1038
  • Bamford et al. (2009) Bamford, S. P., Nichol, R. C., Baldry, I. K., et al. 2009, MNRAS, 393, 1324
  • Bentley (1975) Bentley, J. L. 1975, Communications of the ACM, 18, 509
  • Bootkrajang (2016) Bootkrajang, J. 2016, Neurocomputing, 192, 61
  • Buitrago et al. (2013) Buitrago, F., Trujillo, I., Conselice, C. J., & Häußler, B. 2013, MNRAS, 428, 1460
  • Bundy et al. (2005) Bundy, K., Ellis, R. S., & Conselice, C. J. 2005, ApJ, 625, 621
  • Cabrera et al. (2014)

    Cabrera, G. F., Miller, C. J., & Schneider, J. 2014, in Pattern Recognition (ICPR), 2014 22nd International Conference on, IEEE, in press

  • Cerulo et al. (2017) Cerulo, P., Couch, W., Lidman, C., et al. 2017, Monthly Notices of the Royal Astronomical Society, 472, 254
  • de Vaucouleurs et al. (1991) de Vaucouleurs, G., de Vaucouleurs, A., Corwin, Jr., H. G., et al. 1991, Third Reference Catalogue of Bright Galaxies. Volume I: Explanations and references. Volume II: Data for galaxies between 0 and 12. Volume III: Data for galaxies between 12 and 24.
  • de Vaucouleurs et al. (1976) de Vaucouleurs, G., de Vaucouleurs, A., & Corwin, J. R. 1976, in Second reference catalogue of bright galaxies, 1976, Austin: University of Texas Press., 0
  • Dieleman et al. (2015) Dieleman, S., Willett, K. W., & Dambre, J. 2015, MNRAS, 450, 1441
  • Dressler (1980) Dressler, A. 1980, ApJ, 236, 351
  • Edwards & Gaber (2013)

    Edwards, K. J., & Gaber, M. M. 2013, in Artificial Intelligence and Soft Computing, Springer, 146–157

  • Friedman et al. (1977) Friedman, J. H., Bentley, J. L., & Finkel, R. A. 1977, ACM Transactions on Mathematical Software (TOMS), 3, 209
  • Fukugita et al. (2007) Fukugita, M., Nakamura, O., Okamura, S., et al. 2007, AJ, 134, 579
  • Gauci et al. (2010) Gauci, A., Zarb Adami, K., & Abela, J. 2010, ArXiv e-prints
  • Genel et al. (2014) Genel, S., Vogelsberger, M., Springel, V., et al. 2014, ArXiv e-prints
  • Grogin et al. (2011) Grogin, N. A., Kocevski, D. D., Faber, S. M., et al. 2011, ApJS, 197, 35
  • Hastie et al. (2009) Hastie, T., Tibshirani, R., Friedman, J., et al. 2009 (Springer)
  • Hinshaw et al. (2013) Hinshaw, G., Larson, D., Komatsu, E., et al. 2013, The Astrophysical Journal Supplement Series, 208, 19
  • Hubble (1926) Hubble, E. 1926, Contributions from the Mount Wilson Observatory / Carnegie Institution of Washington, 324, 1
  • Huertas-Company et al. (2011) Huertas-Company, M., Aguerri, J. A. L., Bernardi, M., Mei, S., & Sánchez Almeida, J. 2011, A&A, 525, A157
  • Huertas-Company et al. (2015) Huertas-Company, M., Gravet, R., Cabrera-Vives, G., et al. 2015, ApJS, 221, 8
  • Huertas-Company et al. (2015) Huertas-Company, M., Pérez-González, P. G., Mei, S., et al. 2015, The Astrophysical Journal, 809, 95
  • Hunter (2007) Hunter, J. D. 2007, Computing In Science & Engineering, 9, 90
  • Kartaltepe et al. (2015) Kartaltepe, J. S., Mozena, M., Kocevski, D., et al. 2015, ApJS, 221, 11
  • Kramer et al. (2013) Kramer, O., Gieseke, F., & Polsterer, K. L. 2013, Expert Systems with Applications, 40, 2841
  • Lintott et al. (2011) Lintott, C., Schawinski, K., Bamford, S., et al. 2011, MNRAS, 410, 166
  • Lintott et al. (2008) Lintott, C. J., Schawinski, K., Slosar, A., et al. 2008, MNRAS, 389, 1179
  • McKinney et al. (2010) McKinney, W., et al. 2010, in Proceedings of the 9th Python in Science Conference, Vol. 445, Austin, TX, 51–56
  • Naim et al. (1997) Naim, A., Ratnatunga, K. U., & Griffiths, R. E. 1997, ApJS, 111, 357
  • Nair & Abraham (2010) Nair, P. B., & Abraham, R. G. 2010, ApJS, 186, 427
  • Odewahn et al. (2002) Odewahn, S., Cohen, S., Windhorst, R., & Philip, N. S. 2002, ApJ, 568, 539
  • Oliphant (2006) Oliphant, T. E. 2006, A guide to NumPy, Vol. 1 (Trelgol Publishing USA)
  • Oliphant (2007) —. 2007, Computing in Science & Engineering, 9
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, Journal of Machine Learning Research, 12, 2825
  • Scarlata et al. (2007) Scarlata, C., Carollo, C. M., Lilly, S., et al. 2007, ApJS, 172, 406
  • Schawinski et al. (2007) Schawinski, K., Thomas, D., Sarzi, M., et al. 2007, MNRAS, 382, 1415
  • Schutter & Shamir (2015) Schutter, A., & Shamir, L. 2015, Astronomy and Computing, 12, 60
  • Shamir et al. (2013) Shamir, L., Holincheck, A., & Wallin, J. 2013, Astronomy and Computing, 2, 67
  • Simmons et al. (2017) Simmons, B. D., Lintott, C., Willett, K. W., et al. 2017, MNRAS, 464, 4420
  • Tasca et al. (2009) Tasca, L. A. M., Kneib, J.-P., Iovino, A., et al. 2009, A&A, 503, 379
  • Taylor (2005) Taylor, M. B. 2005, in Astronomical Society of the Pacific Conference Series, Vol. 347, Astronomical Data Analysis Software and Systems XIV, ed. P. Shopbell, M. Britton, & R. Ebert, 29
  • The Astropy Collaboration et al. (2018) The Astropy Collaboration, Price-Whelan, A. M., Sipőcz, B. M., et al. 2018, ArXiv e-prints
  • Willett et al. (2013) Willett, K. W., Lintott, C. J., Bamford, S. P., et al. 2013, MNRAS, stt1458
  • Willett et al. (2017) Willett, K. W., Galloway, M. A., Bamford, S. P., et al. 2017, MNRAS, 464, 4176
  • York & SDSS Collaboration (2000) York, D. G., & SDSS Collaboration. 2000, AJ, 120, 1579