1. Introduction
For more than a century astronomers have been working to understand galaxy properties and evolution from their morphology. The seminal examples is the Hubble sequence (Hubble, 1926), which first classified galaxies into ellipticals, spirals, barred spirals, and irregulars. Galaxy morphologies have been shown to correlate with other intrinsic properties such as color, brightness, maximum rotation velocity, and gas content (Dressler, 1980). From these properties, it is possible to infer important physical properties such as stellar population fraction, surface star density, total mass, and gas to star conversion rates (see Odewahn et al., 2002, and references therein).
For some time, visual classifications played the dominant role in galaxy morphologies. Classifications have been done by expert astronomers (de Vaucouleurs et al., 1976, 1991; Bundy et al., 2005; Fukugita et al., 2007; Schawinski et al., 2007; Nair & Abraham, 2010; Kartaltepe et al., 2015) as well as nonexpert citizen scientists through crowdsourcing systems such as Galaxy Zoo (Lintott et al., 2008; Bamford et al., 2009; Lintott et al., 2011; Willett et al., 2013; Simmons et al., 2017; Willett et al., 2017). With the advent of large high quality survey data like from the Sloan Digital Sky Survey (York & SDSS Collaboration, 2000) and CANDELS (Grogin et al., 2011), we are beginning to see more machine learning morphologically classified galaxy data sets using a variety of methods (e.g. Ball et al., 2004; Scarlata et al., 2007; Tasca et al., 2009; Gauci et al., 2010; HuertasCompany et al., 2011; Dieleman et al., 2015; HuertasCompany et al., 2015)
. Many of the current machine learning classification techniques fall into the category of supervised learning and thus require training data sets, usually based on visually classified morphologies. Examples of unsupervised machine learning classifications can be found in
Naim et al. (1997); Edwards & Gaber (2013); Kramer et al. (2013); Shamir et al. (2013); Schutter & Shamir (2015).When visually classifying galaxies according to their morphologies, the resulting labels will be biased in terms of observable parameters. Low resolution and dim galaxy images will be biased towards smoother types, because the human annotator in charge of labeling the images will not be able to see the fine structure of these objects. Bias in galaxy morphology catalogs has been extensively studied by the Galaxy Zoo team (Lintott et al., 2008). In Bamford et al. (2009) and Willett et al. (2013)
a bias correction term was applied to morphology probabilities by assuming that the morphological fraction does not evolve over the redshift within bins of fixed galaxy physical size and luminosity. For Galaxy Zoo: Hubble morphologies
(Willett et al., 2017)artificially redshifted images have been used to quantify this bias. A different way of addressing the problem is through a machine learning approach, simultaneously learning a classification model, estimating the intrinsic biases in the ground truth, and providing new debiased labels
(Cabrera et al., 2014; Bootkrajang, 2016).In this paper, we present a metric for measuring this labeling bias in morphological classification data sets and we compare low redshift morphological catalogs of spiral/elliptical galaxies from experts (Fukugita et al., 2007; Nair & Abraham, 2010), nonexperts (Lintott et al., 2011) and machine learned (HuertasCompany et al., 2011). We release to the public the code for measuring labeling bias and for simulating multidimensional labeling bias. This code can be used not only by the galaxy evolution community but also by anyone interested in measuring how biased their catalogs are in terms of observable parameters.
This paper is organized as follows: in Section 2 we develop a statistical measure of labeling bias based on the fraction of objects in terms of their intrinsic and observable properties. Our metric is based upon the assertion that the fractions of labels are fixed within bins on the intrinsic properties. We then quantify variations in labeled fractions from the estimated intrinsic fraction as a function of observed properties. In Section 3
we describe the data sets to be used and how we created simulated galaxy morphology biased data sets. Some considerations on the biasvariance tradeoff of our estimators have to be taken in to account. This is explained in Section
4, where we also describe the methodology used to address this issue. In Section 5 we measure the biases for different data sets and show that even “expert labels” are often biased in terms of observed quantities like apparent size. In Section 6 we describe the main conclusions coming from this work.2. Classification Bias
In real data it may be very hard to obtain the high quality true classification labels , which we will call the ground truth or gold standard. However, one can always make an estimate of this ground truth
. In supervised and semisupervised machine learning this is usually accomplished through human annotators. In terms of galaxy morphology, the estimated labels stem from visual inspection of galaxy images. These visually determined morphologies are sometimes used directly in scientific analyses. Sometimes, they are used explicitly to train classification algorithms (e.g., supervised learning). Sometimes they are used implicitly to test such algorithms or used in conjunction with unlabeled data (e.g., semisupervised learning). However, the galaxies are always convolved with the point spread function (PSF) of the telescope, which makes it difficult to visually (or even computationally) resolve the spiral features in small and faint galaxies. For galaxies, this means that in the estimated labels, spirals can be misclassified as ellipticals. This labeling bias is more important when the PSF is close to the angular size of the galaxies, particularly for ground based telescope classification, such as Galaxy Zoo. As noted in
Bamford et al. (2009) and Cabrera et al. (2014), this bias is not statistical nor inherent to the visual classifiers, but a direct consequence of the quality of the data.There are many steps which go into the visual classification of the morphologies of galaxies. While we expect classifiers to notice that the light profile is steeper for ellipticals than for disklike spirals, classifiers also use color and spatial feature identification during their classification process. Depending on the filters used or the resolution of the galaxy image, it is possible to confuse one type of morphology with another. Worse, these mislabellings can be consistent amongst different human classifiers leading to a high degree of statistical confidence in the wrong classification label.
In Figure 1 right, we show a spiral galaxy that was classified as an elliptical with high confidence in the Galaxy Zoo sample which is based on groundbased imaging from the Sloan Digital Sky Survey DR7 (Abazajian et al., 2009). On the left, we show a higher resolution view of this same galaxy from the Hubble Space Telescope, in which one can clearly identify spiral arms. In this example, the spiral arms are washed out by the convolution of the groundbased PSF, rendering their structure undetectable to the human classifiers. It is the projected intrinsic physical scale of the underlying features relative to the PSF that drives the misclassifications.
In order to measure the amount of bias in different labeled data sets, we follow Bamford et al. (2009) and use the fractions of objects of each class as a function of the observable parameters that may bias our labels. An example of such parameters for galaxy morphologies is the resolution of the galaxies: high resolution galaxies are hardly going to be mislabeled, while low resolution galaxies are more likely to be so. We would expect the fractions of an unbiased data set not to depend on these observable parameters. At the same time we should also consider intrinsic parameters for which the real fractions of labels will depend on.
Consider a set of intrinsic properties (e.g., physical size, luminosity, or redshift) on which we define multidimensional bins . Given a set of labels (e.g. for spirals and ellipticals), in each bin , we calculate the intrinsic class fraction of objects with each label as . For typical galaxy morphology data sets, we define , where is the physical radius (in kpc), is the absolute magnitude, and is the redshift for object . In other words, given a fixed bin in galaxy physical size, luminosity, and redshift, defines the intrinsic fraction of spirals compared to the total number of galaxies in bin .
We then consider the set of observed properties of the objects (e.g., angular size). We define the set of properties and create single dimensional bins on each observed property for each of the multidimensional bin . Here defines which property and defines the range of the bin for that property. For typical galaxy morphological data sets, we define where is the angular size and is the estimated size of the point spread function at the galaxy location in the same units as its angular size.
Note the intrinsic properties are treated in multidimensional bins, , whereas within each of those bins, the observed properties are treated in individual bins, . This is because our aim is to study the biases with respect to their observed individual properties and so we require at least two bins () for each observational property. Figure 2 shows a diagram explaining binning in the intrisic and observable parameters. We start by defining bins in the intrinsic parameters using a kdtree (see Section 4). For each of this multidimensional bins we bin again in terms of the observable parameters and calculate the fraction of objects in each of these bins for every class.
We then calculate the observed class fraction
(1) 
where is the total number of objects with the observed property in bin , is the Kronecker delta given an estimate of each galaxy ’s classification for class . The righthand sides sums over all galaxies which are simultaneously in the observed single property bin and the intrinsic property multidimensional bin .
For a given classification and intrinsic property bin , we calculate the Euclidean difference between the observed class fraction and the intrinsic class fraction and sum over all the bins for the observed property
(2) 
Equation 2 should be for large and when there is no difference between the intrinsic and observed class fractions, i.e., when the classifications are unbiased with respect to an observable.
We can extend this to all classes and intrinsic and observed properties as
(3) 
where is the number of classes (2 for the case of elliptical versus spirals). We term equation 3 the classification bias which quantifies the difference in the observational class fractions with respect to the intrinsic class fractions.
We note that the intrinsic class fraction can vary for any data set. For instance, a data set designed to represent ellipticals might have an inherently lower spiral fraction than a broader morphological catalog containing spirals, ellipticals and irregulars. Alternatively, one might be interested in comparing classification algorithms over a wide range of classes and data sets. If so, care has to be taken so that the intrinsic parameters distributions are similar so we do not have any selection effects which could influence and in turn the value of .
One would hope that the fraction of labels within bins of intrinsic properties, could in principle be measured using an unbiased (“gold standard”) data set or perhaps a subset of the data itself. It is also possible that could be predicted from theory (e.g. Genel et al., 2014). Here, we take a conservative approach and assume that all observed morphological data sets have some level of bias. We make an estimate by using the observed class fraction for the bin in observed property which is likely to have the least bias. For example, if we are calculating for , then we calculate for the bin which includes the largest values of , since it should contain the least biased classifications.
Figure 3 shows an example of binning in intrinsic and observable parameters for Galaxy Zoo data. Here, we build the kdtree splitting the data in terms of , , and creating a 3dimensional partition of the data. For the data falling in each of the intrinsic bins we calculate the fraction of spiral and elliptical galaxies as a function of the observable parameter . As decreases (smaller objects), the fraction of spiral galaxies decreases and the fraction of elliptical galaxies increases. In other words, smaller spiral galaxies are confused as elliptical. In order to calculate our bias metric (Eq. 3) we need the intrinsic class fractions . The least biased bins in the observable parameters are the ones with the biggest , which we consider as our estimate for the intrinsic class fractions. Figure 4 shows the fractions of spiral and elliptical galaxies in term of the observable parameter for bins in intrinsic parameters. Independently of the bin in terms of , , and , the fraction of spirals increases with , while the fraction of ellipticals decreases. In order to calculate the dataset bias, we use as intrinsic class fractions the fraction in the bin with a higher , denoted by a dot in Figure 4.
3. data sets
In this section we describe the data we use on our experiments. All data considers the rband from SDSS (Abazajian et al., 2009), and the 9year WMAP cosmology (Hinshaw et al., 2013) from astropy (The Astropy Collaboration et al., 2018).
3.1. Eyeball Classifications
Fukugita et al. (2007) (hereafter F07) have visually classified 2,275 galaxies, each by three experts. They defined a morphological index such that for E, S0, Sa, Sb, Sc, Sd, Im, respectively. In order to measure their bias, we focus on just the elliptical (+S0) galaxies (N=941) having , and the spirals (N=902) having , since the other data sets we compare to only use these two classes. We crossmatch these data to the SDSS DR7 to obtain their apparent magnitudes (Petrosian rband), their apparent sizes (Petrosian rband radii), and their redshifts.
We also use expert labels from Nair & Abraham (2010) (NA10 hereafter) who have visually classified 14,034 spectroscopicallytargeted galaxies from the SDSS. They report TTypes as well as other morphological features such as bars, rings, lenses tails, among others. As with the F07 sample, we focus on elliptical (+S0) galaxies (N=6,276) having and spirals (N=7,640) having , where are their TTypes.
3.2. Galaxy Zoo
We use the Galaxy Zoo 1 data release (Lintott et al., 2011) and their sample with spectra in SDSS which contains classifications for 667,944 galaxies achieved by crowdsourcing. We define two subsets of the Galaxy Zoo 1: (a) the original biased morphologies (hereafter GZB) and (b) the “debiased” morphologies (hereafter GZD). The debiasing procedure used is described in detail in Bamford et al. (2009) and Lintott et al. (2011). Briefly, they assumed that the morphological fraction within bins of fixed galaxy physical size and luminosity does not evolve over the redshift of their data. From that assumption, a bias correction term was estimated in bins of physical size and luminosity and then applied to the original spiral and elliptical classification probabilities. Their algorithm helped motivate our approach to quantify classification bias as described in Section 2.
We crossmatch the Galaxy Zoo catalog to the SDSS DR7 to obtain the observed properties, including each galaxy’s pointspread function (PSFdetermined over the SDSS field). We used the SDSS fieldspecific parameter as an estimate of the FWHM for a Gaussian PSF at the location of each galaxy. When galaxies belong to more than one field we used the galaxy classification and properties pertaining to that with the smallest PSF.
3.3. Supervised Learned Morphologies
HuertasCompany et al. (2011) (hereafter HC11) used a support vector machine (SVM) classification model trained over the data set from Fukugita et al. (2007). The HC11 morphologies are probability densities, and so we defined elliptical (+S0s) galaxies as having a probability of being earlytype Early and spiral galaxies having Spiral, where takes values of 0.5 and 0.8. As with the previous data sets, we crossmatch the HC11 data to the SDSS DR7 to ensure that all galaxies in our data sets have the same observed properties and that there are no duplicates.
3.4. Simulated Morphology Catalogs
In order to assess the validity of our method, we created a simulated catalog following the Galaxy Zoo 1 distribution of parameters. We used a kernel density estimation
(see Hastie et al., 2009, and references therein) with a Gaussian kernel to estimate the distribution of angular Petrosian radius , apparent Petrosian magnitude , redshift , PSF, and debiased probabilities randomly choosing 100.000 galaxies from GZ1. Using these parameters we calculate their physical Petrosian radius , absolute Petrosian magnitude , and . We consider and as the biasing parameters, so we artificially created this bias by changing the labels from spirals to ellipticals with a Gaussian probability depending on these parameters:(4) 
where the probability of modifying a label from S to E depends of a biasing parameter which controls the amount of bias in the data set. The higher the value of , the larger the amount of bias. Notice that this added bias is normalized in terms of and , by using their median values , and .
4. Impact of Sampling over the Estimator
Equation 3 is a statistical measure of the classification bias for any data set with classes and requires bin definitions on the observed properties and multidimensional data set binning of the intrinsic properties . In this section, we examine the effects of how the bins are defined using the simulated morphology catalogs which have varying degrees of bias.
We bin the intrinsic properties of the data using kdtrees. A kdtree is a data structure for storing a finite set of points from a kdimensional space. It was examined in detail by Bentley (1975) and Friedman et al. (1977). kdtrees have the benefit of dividing the data into bins for optimal querying performance. They are well characterized in the literature and numerous libraries exist to build such trees. The total number of bins in these trees is , where is the height of the tree. As increases, the bins get smaller causing the number of points inside each bin to get smaller too. The dimension of our kdtree depends on the number of intrinsic properties we are examining. In this effort, we use the absolute magnitude, the physical size, and the redshift for our tree.
For the observed properties, we need to build a grid defining the ranges on each of the observational parameters (e.g., such as the resolution limits within the bin). We choose a simple linear binning procedure such that the number of observed galaxies in each bin is equal.
Having defined the bins on the intrinsic and observed properties of each galaxy, as well as the morphological classifications, we examine the robustness of the labeling bias estimator, Equation 3.
4.1. FiniteSampling Bias and Variance
The number of 3D bins on the intrinsic properties, the number of 1D bins in the observable parameters, as well as the total number of objects in each of the bins, combine to impact . Because real data sets have finite size, the tradeoff between bias and variance of our estimators has to be taken into account when defining the binning strategy. We use the simulations to show the impact of the selected binning strategy on our bias metric. Our results are shown in Figure 5
where the left panel is for simulated galaxies following GZD probability distributions (
) and the right panel shows a simulated bias of .unbiased simulations  bias 

First, consider a simple fixed binning scenario where we allow the number of galaxies per bin to vary. Figure 5 shows this effect over simulations for different binning strategies. One can see that decreases as a function of the square root of the total number of galaxies in each bin. There is a point after which adding more galaxies to each bin does not reduce significantly. We use the shape of this curve to define the optimal number of total galaxies per bin.
Next, consider a fixed number of objects per bin and a fixed number of bins on the intrinsic properties. As one decreases the number of 1D bins in the observable parameters, due to the biasvariance tradeoff (see Hastie et al., 2009), there will be a corresponding decrease in the statistical variance for the estimate of the fractions , at the expense of increasing the statistical bias. The extreme case is a single bin with very low variance. However, as shown in Figure 2, a single bin in provides no useful information on the bias we are trying to measure: at least two bins in the observed properties are required in order to track observational bias. Regardless, the decrease in variance (simply due to fewer bins) simultaneously decreases the value of the labeling bias . This is shown in Figure 5 by considering curves with the same number of intrinsic bins and noting that is always lowest for the fewest number of observed bins.
On the other hand, if we fix the number of objects per bin as well as the number of bins on the observed properties, then by decreasing the number of bins on the intrinsic parameters, we lose information about the true object fractions, thus causing an increase in the bias of the estimator for the intrinsic class fraction . This produces an increase of the differences between the observational class fractions and the estimate for the intrinsic class fractions in Eq. 2 increasing the value of . This is shown in Figure 5 by considering curves with equal number of bins in the observables noting that is always lowest for the highest number of intrinsic bins.
Figure 5 allow us to define a binning procedure for any data set. Notice how the number of objects per bin and binning impacts the value of our bias metric for data sets with the same amount of simulated bias. Also notice that is always higher for the data set with higher simulated bias for a given binning strategy, which suggests that any binning strategy helps evaluate differences in biases between data sets, as long as enough number of objects per bin are considered.
Since the value of can vary as a function of the binning, we must be careful to use the same binning procedure when conducting relative comparisons of one or more data sets, even if the binning is not optimal for any specific data set. In practice, when comparing the labeling bias for different sets of data, the binning strategy is defined by the data set with the smallest number of objects.
4.2. Choosing number of bins for real data
For the datasets in this work, we consider binning strategies that split all parameters (observable and intrinsic) into the the closest number of bins. Given this constraint, we then search for the maximum number of bins such that the running slope of in Figure 5 is for the maximum number of objects per bin allowed by our data set size. Because the real data is noisy, we calculate the mean value of over 20 bootstrapping subsamples and considering the same number of bins for each intrinsic and observable parameters. The kdtree automatically defines the multidimensional binning on the intrinsic parameters.
Special care has to be taken when comparing two or more data sets of different sizes. On a larger data set we may be able to use more bins and/or number of objects per bin, but when comparing it to a smaller data set, this sampling is not going to be feasible to use. In order to make a fair comparison, we need to sample in terms of the smaller data set, so that biases and variance over the distribution of fractions are comparable.
5. Bias for galaxy morphologies
Now that we defined how to choose the binning in Equation 3, we measure the classification bias for the different datasets defined in Section 3. In Section 5.1 we follow the approach proposed by Bamford et al. (2009) and use the redshift as a way to quantify the morphological bias. Then, in Section 5.2 we use as our biasing observable parameters and , , and as intrinsic parameters.
5.1. Redshift as biasing parameter
We start by following the approach proposed by Bamford et al. (2009) and consider redshift as a biasing parameter and physical radius and absolute magnitude as intrinsic parameters. The smallest dataset is F07 with 1843 spirals and ellipticals. From this, we used the technique described in Section 4.2 to determine the binning. We find the best finitesampling bias levels for a maximum binning size of 8 in the intrinsic parameters and 2 in the observed parameters. We then apply this binning scheme to all of the datasets to measure the classification bias using Equation 3. For the F07 data set we obtain 115 galaxies per bin, so we fix this number for all the other datasets. The data from F07 and NA10 only contains galaxies for , so in order for the comparison of biases between datasets to be fair, we consider galaxies with in the GZ and HC11 datasets. Figure 6
a shows the bias for different datasets. Notice the standard deviation of
makes it hard to make statistical significant conclusions on the difference between datasets.(a)  (b)  (c) 
If we exclude F07, the smallest dataset is NA10, with 13,916 galaxies. For these data, and using the procedure from Section 4.2, we find a maximum binning size of in the intrinsic parameters and 4 bins in the observable parameters (217 galaxies per bin). Figure 6b shows the results of the bias for each data set. The error bars now allow us to interpret these results with higher significance. The bias for expert annotators of NA10 is similar to the one from HC11, and both are smaller than the bias of Galaxy Zoo. Now, we can see that our metric starts to recover the debiasing procedure from Bamford et al. (2009): the value of is lower for GZD than for GZB.
If we exclude both F07 and NA10, we can consider galaxies with . The smallest dataset is the Galaxy Zoo biased with 237,963 galaxies, so by doing this, we are able to use a larger number of bins, thus having best estimates for . Using the method described in Section 4.2 we obtain bins for the intrinsic parameters and 16 bins for the observable parameters, from which we can use 58 objects per bin. In Figure 6c we show the labeling bias as defined by Equation 3 using this binning strategy for GZB, GZD, and HC11. Now, we can clearly recover the debiasing procedure proposed by Bamford et al. (2009). The highest values of are obtained over the GZB dataset, while the GZD dataset achieves a significantly lower . This shows, that our proposed metric is capable of measuring biases given an assumption of intrinsic and biasing parameters. Again, the lowest labeling bias is obtained for HC11.
5.2. Apparent radius as biasing parameter
As opposed to Bamford et al. who utilized redshift as the parameter for which to characterize and correct labeling bias, in this section we treat the apparent size as the parameter which governs bias. With respect to the PSF, it is the apparent size of a galaxy that will determine whether or not spiral features are washed out to become undetectable. We then include redshift as an intrinsic parameter since we expect it to play a role in the underlying fraction of spirals and ellipticals, which we know to evolve over time (Buitrago et al., 2013; HuertasCompany et al., 2015; Cerulo et al., 2017). There is a concern that the apparent size as an observable parameter is degenerate with the combination of the redshift and the physical size for any galaxy. A small nearby galaxy can have the same apparent size as a large and more distant galaxy. However, by also including the absolute magnitude as an intrinsic parameter, this degeneracy is broken. In other words, a small and large galaxy with the same apparent size will never be in the same bin since the small (and thus intrinsically dim) galaxy will appear in a different Magnitude bin than a large (and intrinsically bright) galaxy.
We start with the smallest dataset F07 with 1843 spirals and ellipticals. We find the best finitesampling bias levels for a maximum bin number of 8 in the intrinsic parameters and 2 in the observed parameters, obtaining 115 galaxies per bin. Note that this is the minimum bin sizes we can apply due to the number of intrinsic and biasing (observed in this case) properties in the data. Recall the data from F07 and NA10 only contains galaxies for , so again we consider galaxies with in the GZ and HC11 datasets. Figure 7a shows the biases under these assumptions for different datasets. Again, due to the size of the standard deviation error bars of , it is hard to make statistical significant conclusions on the difference between datasets.
(a)  (b)  (c) 
If we exclude F07, we find a maximum binning size of in the intrinsic parameters and 3 bins in the observable parameters with 144 galaxies per bin. Figure 6b shows the results of the bias for each data set. HC11 presents the lowest bias. Expert labels froms NA10 are less biased than Galaxy Zoo. With this number of galaxies there is no statistical significance between GZB and GZD for a given probability threshold.
If we exclude both F07 and NA10, we are able to consider galaxies with and use bins for the intrinsic parameters and 4 bins for the biasing parameter , from which we obtain 232 objects per bin. In Figure 7c we show the labeling bias as defined by Equation 3 using this binning strategy for GZB, GZD, and HC11. The highest values of are obtained over the GZ biased data sets and the lowest labeling bias is obtained for HC11. With this amount of data we notice that GZD with shows a smaller amount of bias than GZB. At the same time, by choosing the selected GZD data set is significantly more biased than the GZD data with and closer to GZB for . In other words, it appears that the debiasing procedure implemented in Bamford et al. (2009) for Galaxy Zoo classifications does not work when the vast majority of classifiers agree on the morphological type.
We explore this interesting result further in Figure 8, where we plot the bias as a function of an increasing Galaxy Zoo classification probability threshold. For the biased sample, we see no clear trend. However, the debiased sample shows a trend of increasing bias with increasing classification probability threshold.
We can explain Figure 8 in the following way. First, Bamford et al. (2009) use a statistical correction (their equations A3 and A4) that depends on both the raw classification probabilities as well as the intrinsic characteristics of the galaxy (e.g., absolute magnitude, physical Petrosian radius, redshift). This form of correction was chosen under the assumption that at high classification probabilities, no morphology adjustment should be applied since the labels would be correct (see Figure A9 of Bamford et al. (2009)). Thus, the fact that the classification bias is closer to the one from GZB at high in Figure 8 stems from the design of the classification adjustment formalism. Since the correction term approaches zero at high , the sample reverts back to the same level of bias inherent in the nominal biased sample.
What is perhaps surprising is that for the Galaxy Zoo Biased (GZB) sample, the level of bias does not decrease as the classifications reach higher levels of confidence (high ). Recall from the Introduction that our algorithm aims to quantify the presence of classification bias due to mislabeled data. We noted that such mislabeling error is not a statistical labeling error, but instead an intrinsic error related to the quality of data itself (see Figure 1). The high bias at in the Galaxy Zoo Biased sample is to be expected, especially for spirals, when the data quality is low or when the classifiers are nonexpert. As noted earlier, it can be difficult to distinguish between spirals and ellipticals due to the data quality at low brightness or small apparent size. At , it should have been easy for classifiers to have identified morphologies since the classifications from different classifiers agree. This is likely to be true for spirals, but it is nearly impossible for the classifiers to separate ellipticals from spirals when the data quality is bad. When the data is bad the classifications will always tend towards elliptical with high confidence. In other words, while it is almost certainly the case that spiral galaxies are true spirals, ellipticals are not always true ellipticals. Thus the formalism to adjust classifications for ellipticals should not converge to the raw classification, even at high .
An alternative approach to correct biased labels is to produce a set of simulated calibration images. These images are degraded versions of high quality images, where the ground truth labels can be accurately estimated. Galaxy Zoo: Hubble (Willett et al., 2017) label such images through their interface, producing a set of biased labels with their corresponding ground truth labels. Their correction term allows high classification probabilities to be adjusted. Measuring biases on such corrected labels would be very interesting, but slightly out of the scope of this paper: here, we present a metric to assess biases, and show an application to low redshift galaxies. We plan to address biases at higher redshift galaxies in the future, including Galaxy Zoo: Hubble.
The final question regarding Figure 7 is why the machine learning algorithms perform better than the training sets they used? Under perfect conditions the learned classifications should recover any biases inherent to the input training sets. Recall that HC11 uses an SVM supervised machine learning algorithm that is trained on the F07 data set. However, since the bias in the F07 data is higher than in the HC11 data, we conclude that the supervised machine learning technique used by HuertasCompany et al. (2011) was able to mitigate the biases inherent in their training sets.
It is important to recognize that labeling bias mitigation can only occur if the “correct” choice of observed features are used in the machine learning training sets. The term “correct” simply means that the chosen observable parameters can in fact cause bias and that this bias can be removed with additional truth information. In other words, the bias caused by the apparent sizes can be fixed by leveraging information about the true sizes and absolute magnitudes. As a counter example, one should be able to feed the SVMtool with an ndimension set of observed parameters that precisely recover the training set classification (i.e., 100% accuracy to the original training set). In this case, one would have the same level of bias in the SVM trained classifications as in the eyeballed training set. In the particular case of HC11, they used features such as colors, shape and concentration. These features correlate with morphological types independent of observational parameters, such as resolution. Therefore, the chosen HC11 observed galaxy parameter set is enabling the Fukugita classification biases to be minimized by the complexities of the machine learning algorithm. At the same time, a machine learning model trained over features such as colors, will not be able to correctly identify morphologies of outlier galaxies, such as red spirals or blue ellipticals. These galaxies may still be recognized by a human from a relatively high quality image. Machine learning models and eyeball labels may be complementarily used to obtain scientifically interesting outliers.
6. Conclusions
Observational parameters, such as resolution, can bias the procedure of human labeling of galaxies. We have developed a metric to assess systematic mislabeling of galaxy morphologies which incorporates information about the galaxy intrinsic parameters, such as their true sizes and absolute magnitudes. Our algorithm requires that the true (but unknown) fractions for the classes be constant when binned against their intrinsic parameters. We then quantify the mean deviation of the fraction of objects from the estimated intrinsic fraction in terms of their observational parameters.
We then conduct a relative comparison of labeling bias for expert, citizen science, and machine learningbased galaxy classifications between spirals and ellipticals (+S0s). We find that, when enough data is provided, the bias in expert labels is statistically lower than the citizen science labels. We use our metric to recover Galaxy Zoo debiasing procedure, under the assumption that labels are biased in terms of the redshift. By using the labeled image resolution as biasing parameters instead, we show our metric is able to find biases that have not been addressed. These biases may be statistically corrected in the future in the same manner that Galaxy Zoo does it. The classifications which use machine learning techniques show the least levels of bias, even when they are trained on biased “gold standards”. We conclude that future largescale morphological classification efforts should employ a combination of human classifications and machine learning in order to minimize labeling bias.
In this paper we have focused on the problem of galaxy morphologies. However, our approach may be applied to any other labeled data set where intrinsic information can be inferred. We have made our code publicly available so that it can be used by the galaxy evolution community or any other classification problem at https://github.com/guillec/labeling_bias.^{2}^{2}2Licensed under the terms of the GNU General Public License v3.0..
Acknowledgements
We wish to thank Nancy Hitschfeld, Benjamín Bustos, Eduardo Vera, Jaime San Martín, Chris Smith, and Alfredo Zenteno for valuable discussion and supporting our project.
G.C.V. gratefully acknowledge financial support from CONICYTChile through its FONDECYT postdoctoral grant number 3160747; CONICYTChile and NSF through the Programme of International Cooperation project DPI201400090; Basal Project PFB–03; the Ministry of Economy, Development, and Tourism’s Millennium Science Initiative through grant IC120009, awarded to The Millennium Institute of Astrophysics (MAS). CJM was supported by the National Science Foundation under Grant No. 1256260. PoweredNLHPC: This research was partially supported by the supercomputing infrastructure of the NLHPC (ECM02). Most of the table operations and plots were done using TOPCAT (Taylor, 2005) and matplotlib (Hunter, 2007). We used numpy (Oliphant, 2006), scipy (Oliphant, 2007), and pandas (McKinney et al., 2010) for numerical computations. The kernel density estimation model was trained using scikitlearn (Pedregosa et al., 2011).
Funding for the SDSS and SDSSII has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is http://www.sdss.org/.
The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions. The Participating Institutions are the American Museum of Natural History, Astrophysical Institute Potsdam, University of Basel, University of Cambridge, Case Western Reserve University, University of Chicago, Drexel University, Fermilab, the Institute for Advanced Study, the Japan Participation Group, Johns Hopkins University, the Joint Institute for Nuclear Astrophysics, the Kavli Institute for Particle Astrophysics and Cosmology, the Korean Scientist Group, the Chinese Academy of Sciences (LAMOST), Los Alamos National Laboratory, the MaxPlanckInstitute for Astronomy (MPIA), the MaxPlanckInstitute for Astrophysics (MPA), New Mexico State University, Ohio State University, University of Pittsburgh, University of Portsmouth, Princeton University, the United States Naval Observatory, and the University of Washington.
Based on observations made with the NASA/ESA Hubble Space Telescope, and obtained from the Hubble Legacy Archive, which is a collaboration between the Space Telescope Science Institute (STScI/NASA), the Space Telescope European Coordinating Facility (STECF/ESA) and the Canadian Astronomy Data Centre (CADC/NRC/CSA).
References
 Abazajian et al. (2009) Abazajian, K. N., AdelmanMcCarthy, J. K., Agüeros, M. A., et al. 2009, ApJS, 182, 543
 Ball et al. (2004) Ball, N. M., Loveday, J., Fukugita, M., et al. 2004, MNRAS, 348, 1038
 Bamford et al. (2009) Bamford, S. P., Nichol, R. C., Baldry, I. K., et al. 2009, MNRAS, 393, 1324
 Bentley (1975) Bentley, J. L. 1975, Communications of the ACM, 18, 509
 Bootkrajang (2016) Bootkrajang, J. 2016, Neurocomputing, 192, 61
 Buitrago et al. (2013) Buitrago, F., Trujillo, I., Conselice, C. J., & Häußler, B. 2013, MNRAS, 428, 1460
 Bundy et al. (2005) Bundy, K., Ellis, R. S., & Conselice, C. J. 2005, ApJ, 625, 621

Cabrera et al. (2014)
Cabrera, G. F., Miller, C. J., & Schneider, J. 2014, in Pattern Recognition (ICPR), 2014 22nd International Conference on, IEEE, in press
 Cerulo et al. (2017) Cerulo, P., Couch, W., Lidman, C., et al. 2017, Monthly Notices of the Royal Astronomical Society, 472, 254
 de Vaucouleurs et al. (1991) de Vaucouleurs, G., de Vaucouleurs, A., Corwin, Jr., H. G., et al. 1991, Third Reference Catalogue of Bright Galaxies. Volume I: Explanations and references. Volume II: Data for galaxies between 0 and 12. Volume III: Data for galaxies between 12 and 24.
 de Vaucouleurs et al. (1976) de Vaucouleurs, G., de Vaucouleurs, A., & Corwin, J. R. 1976, in Second reference catalogue of bright galaxies, 1976, Austin: University of Texas Press., 0
 Dieleman et al. (2015) Dieleman, S., Willett, K. W., & Dambre, J. 2015, MNRAS, 450, 1441
 Dressler (1980) Dressler, A. 1980, ApJ, 236, 351

Edwards & Gaber (2013)
Edwards, K. J., & Gaber, M. M. 2013, in Artificial Intelligence and Soft Computing, Springer, 146–157
 Friedman et al. (1977) Friedman, J. H., Bentley, J. L., & Finkel, R. A. 1977, ACM Transactions on Mathematical Software (TOMS), 3, 209
 Fukugita et al. (2007) Fukugita, M., Nakamura, O., Okamura, S., et al. 2007, AJ, 134, 579
 Gauci et al. (2010) Gauci, A., Zarb Adami, K., & Abela, J. 2010, ArXiv eprints
 Genel et al. (2014) Genel, S., Vogelsberger, M., Springel, V., et al. 2014, ArXiv eprints
 Grogin et al. (2011) Grogin, N. A., Kocevski, D. D., Faber, S. M., et al. 2011, ApJS, 197, 35
 Hastie et al. (2009) Hastie, T., Tibshirani, R., Friedman, J., et al. 2009 (Springer)
 Hinshaw et al. (2013) Hinshaw, G., Larson, D., Komatsu, E., et al. 2013, The Astrophysical Journal Supplement Series, 208, 19
 Hubble (1926) Hubble, E. 1926, Contributions from the Mount Wilson Observatory / Carnegie Institution of Washington, 324, 1
 HuertasCompany et al. (2011) HuertasCompany, M., Aguerri, J. A. L., Bernardi, M., Mei, S., & Sánchez Almeida, J. 2011, A&A, 525, A157
 HuertasCompany et al. (2015) HuertasCompany, M., Gravet, R., CabreraVives, G., et al. 2015, ApJS, 221, 8
 HuertasCompany et al. (2015) HuertasCompany, M., PérezGonzález, P. G., Mei, S., et al. 2015, The Astrophysical Journal, 809, 95
 Hunter (2007) Hunter, J. D. 2007, Computing In Science & Engineering, 9, 90
 Kartaltepe et al. (2015) Kartaltepe, J. S., Mozena, M., Kocevski, D., et al. 2015, ApJS, 221, 11
 Kramer et al. (2013) Kramer, O., Gieseke, F., & Polsterer, K. L. 2013, Expert Systems with Applications, 40, 2841
 Lintott et al. (2011) Lintott, C., Schawinski, K., Bamford, S., et al. 2011, MNRAS, 410, 166
 Lintott et al. (2008) Lintott, C. J., Schawinski, K., Slosar, A., et al. 2008, MNRAS, 389, 1179
 McKinney et al. (2010) McKinney, W., et al. 2010, in Proceedings of the 9th Python in Science Conference, Vol. 445, Austin, TX, 51–56
 Naim et al. (1997) Naim, A., Ratnatunga, K. U., & Griffiths, R. E. 1997, ApJS, 111, 357
 Nair & Abraham (2010) Nair, P. B., & Abraham, R. G. 2010, ApJS, 186, 427
 Odewahn et al. (2002) Odewahn, S., Cohen, S., Windhorst, R., & Philip, N. S. 2002, ApJ, 568, 539
 Oliphant (2006) Oliphant, T. E. 2006, A guide to NumPy, Vol. 1 (Trelgol Publishing USA)
 Oliphant (2007) —. 2007, Computing in Science & Engineering, 9
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, Journal of Machine Learning Research, 12, 2825
 Scarlata et al. (2007) Scarlata, C., Carollo, C. M., Lilly, S., et al. 2007, ApJS, 172, 406
 Schawinski et al. (2007) Schawinski, K., Thomas, D., Sarzi, M., et al. 2007, MNRAS, 382, 1415
 Schutter & Shamir (2015) Schutter, A., & Shamir, L. 2015, Astronomy and Computing, 12, 60
 Shamir et al. (2013) Shamir, L., Holincheck, A., & Wallin, J. 2013, Astronomy and Computing, 2, 67
 Simmons et al. (2017) Simmons, B. D., Lintott, C., Willett, K. W., et al. 2017, MNRAS, 464, 4420
 Tasca et al. (2009) Tasca, L. A. M., Kneib, J.P., Iovino, A., et al. 2009, A&A, 503, 379
 Taylor (2005) Taylor, M. B. 2005, in Astronomical Society of the Pacific Conference Series, Vol. 347, Astronomical Data Analysis Software and Systems XIV, ed. P. Shopbell, M. Britton, & R. Ebert, 29
 The Astropy Collaboration et al. (2018) The Astropy Collaboration, PriceWhelan, A. M., Sipőcz, B. M., et al. 2018, ArXiv eprints
 Willett et al. (2013) Willett, K. W., Lintott, C. J., Bamford, S. P., et al. 2013, MNRAS, stt1458
 Willett et al. (2017) Willett, K. W., Galloway, M. A., Bamford, S. P., et al. 2017, MNRAS, 464, 4176
 York & SDSS Collaboration (2000) York, D. G., & SDSS Collaboration. 2000, AJ, 120, 1579
Comments
There are no comments yet.