In order to constrain the physical processes driving galaxy evolution, it is common practice to measure a number of physical properties for a set of galaxies, and then investigate the correlations between these parameters. In this context, galaxy surveys have become more and more appropriate. The number of galaxies available is getting larger, and the amount of information to constrain physical properties is also increasing, yielding to more accurate estimates. The level of precision of these estimates is also likely to increase in the future, either with the combination of wide angle surveys observing at different wavelengths, or with panchromatic surveys using large number of filters(e.g. PAU, Benítez et al., 2009), which will benefit from multiband imaging for millions of galaxies. As this data deluge is turning astronomy into becoming a data-intensive or e-science (see Hey et al., 2009), one is confronted with the issue of being just able enough to analyze the feature space, whose dimensionality keeps on increasing. In face of such large amount of physical properties, one wants to find the minimal and most important set which describes galaxies accurately. In this context, a common approach used to reduce the dimensionality of these dataset is performing a Principal Component Analysis (PCA, also known as Karhunen-Loève transform; see e.g. Efstathiou & Fall, 1984; Murtagh & Heck, 1987)
. PCA enables us to find an uncorrelated and orthonormal set of linear combinations of properties (eigenvectors) that describe optimally the correlations and variation of the data. This approach has been fruitfully used in astronomy to classify galaxy and quasars based on their spectra(Connolly et al., 1995; Yip et al., 2004a, b). PCA has be applied on a wider basis using various galaxy properties such as the equivalent width of emission lines (Győry et al., 2011) or a mix of spectral and morphological features (Coppa et al., 2011) to help characterizing the galaxy population. PCA also showed useful for instance when applied to stellar synthesis population models to derive galaxy physical parameters (Chen et al., 2011). PCA does not however, enable to capture all the information contained in the input sample. It is by nature linear, and hence can not describe non linear correlations within the data. Other methods, such as applying locally linear embedding to galaxy spectra (Roweis & Saul, 2000; Vanderplas & Connolly, 2009), enable to take into account non linearities, as they map high dimension data onto a surface, while preserving the local geometry of the data.
In this paper, we introduce the principal curve (P-curve, see e.g. Einbeck et al., 2007, for a review)
, which can be seen as a nonparametric extension of linear PCA. The principal curve is the curve following the location of the local mean in the multi-dimensional cloud of data points. In practice, the P-curve can be conveniently built in the PCA eigenspace spanned by the most important eigenvectors along which the variance is highest. The important fact here is that every data point can be assigned a unique closest projection onto the curve, and be labeled by the arc length value measured from the beginning of the curve to the projection. This reduces the complexity of multi dimensional data effectively into only one dimension. Moreover, the ranking of galaxies according to their associated arc length values provides a natural and objective way of ordering, partitioning and classifying the rich zoo of galaxies in the nearby universe.
In this paper, we take advantage of the wealth of data and build the principal curve for both physical and photometric properties belonging to the low redshift Main Galaxy Sample (MGS) (Strauss et al., 2002) in SDSS (Stoughton et al., 2002)
. Since the MGS is flux limited, the Malmquist bias underestimates the volume density of faint galaxies compared to that of brighter ones. As a result, the common practice of performing a simple PCA for all galaxies does indeed provide a biased result toward the behavior of the properties of bright objects. As a solution, we do not restrain the statistics by constructing a much smaller volume limited sample, but keep all galaxies by assigning them weights with which we perform Weighted PCA (WPCA) and P-curve methods. We then investigate how the arc length associated to each galaxy correlates with a number of photometric, spectroscopic and physical galaxy properties, as well as morphology, mean spectra, and a first (luminosity function) and second (clustering) moments of galaxies. Our results show that the arc length values remarkably encode a large number of well-known trends in the local Universe.
This paper is organized as follows: in Section 2 we present the dataset we use. Section 3 details the galaxy properties we include in our PCA analysis. Section 4 presents the methods we use for the dimensionality reduction, weighted PCA and principal curve. We detail in Sec. 5 how we build the principal curve from the SDSS data. In Section 6 we present our results and discuss them in Sec. 7.
We use in this paper a flat cosmology assuming ,,, = 0.7,0.3,0.7,-1.
2. The Galaxy Sample
In particular, we use the Main Galaxy Sample (MGS) (Strauss et al., 2002). These galaxies constitute a flux limited sample, with an r-band petrosian apparent magnitude cut of , and a redshift distribution peaking at . Their spectra covers the rest frame range of 3800-8000Å, with a resolution of 69 km spix.
Several selection cuts and flags were enforced in order to have a clean sample. We selected only science primary objects appearing in calibrated images having the photometric status flag. Also, we selected imaging fields where , which assures good imaging quality with respect to the sky flux and the PSF’s width. Furthermore, we neglected individual objects with bad deblending (with flags PEAKCENTER, DEBLENDNOPEAK, NOTCHECKED
) and interpolation problems (PSFFLUXINTERP, BADCOUNTSERROR) or suspicious detections (SATURATED NOPROFILE) 222Detailed explanation in sdss3.org/dr8/algorithms. Also, we chose galaxies whose spectral line measurements and properties are labeled as RELIABLE.
The sky footprint of the clean spectroscopic survey builds up from a complicated geometry defined by sectors, whose aggregated area covers deg or a fractional area of whole sky. We choose a redshift window of . The lower limit avoids including large photometrically-cumbersome galaxies on the sky, and the upper limits reduces the amount of evolution of galaxy properties (Gyr), while keeping the statistics high. Redshift incompleteness arises from the fact that two 3” aperture spectroscopic fibers cannot be put together closer than 55” in the same plate. As an strategy, denser region in the sky are given a greater number of overlapping plates. Nevertheless, 7 of the initial galaxies photometrically targeted as MGS didn’t have their spectra taken.
We further construct a magnitude limited sample, on which we will center our main study. Here, extinction-corrected petrosian apparent magnitude cuts of are applied. The lower limit is set due to the arising cross talk from close fibers in the spectrographs, when they contain light from very bright galaxies. The upper limit safely avoids the slight variations of the limiting apparent magnitude around 17.77 over the sky in the targeting algorithm. This leaves us with 174,266 galaxies.
A volume limited subsample was also created, being a subset of the previous magnitude limited sample. This subsample is used for the study of spatial correlation functions in Sec. 6.4. The redshift ranges are , with an absolute magnitude window of , which leave us with galaxies.
3. Selecting Galaxy Properties
Galaxies present a variety of physical, spectroscopic and 5-band photometric properties made available in the SDSS-DR8 data catalog. We selected the most relevant in order to create a p-Dimensional cloud of properties or features for further study.
Within the photometry-derived properties included are the colors, which show the coarse shape of galaxy spectra, and in some extend the age of the overall stellar population in the galaxy. Only the colors and were selected, since most of the color combinations possible from the bands (Fukugita et al., 1996) are highly correlated. For computing colors, extinction-corrected model magnitudes (Stoughton et al., 2002) were used, as well as k-corrections to an observing rest-frame of . The k-corrections are calculated by using a template fitting technique used in e.g. Budavári et al. (2000) and Csabai et al. (2000). Here, the colors are matched to the colors of a model spectrum defined by a non negative linear combination of redshifted template spectra. Then, the best model spectrum is blueshifted back to the rest frame (z=0) and the k-correction computed. The template spectra are drawn from a list provided by Bruzual & Charlot (2003).
Since we will study the luminosity function as a function of position in this cloud (Section 6.3), we decided not to include the absolute magnitude as a property. If we did, any partitioning of the cloud would introduce non-desired artificial cuts in the range of absolute magnitudes used in the computation of luminosity functions. Therefore, neither the absolute magnitude nor any other strongly correlated property of it (such as stellar mass) should be chosen as part of the properties.
Another photometry-derived feature is the concentration index , where and are the radii enclosing the 90 and 50 of the r-band petrosian flux, respectively. This index has been found to correlate with galaxy morphological type (Strateva et al., 2001; Shimasaku et al., 2001). Indeed, de Vaucouleurs light profiles of elliptical galaxies are more concentrated than the exponential profile in the disks of spiral galaxies.
The redshift-dependent -band surface brightness defined by is also included as a property. This breaks the degeneracy of between bright and dim spiral galaxies. Here we use the extinction and k-corrected petrosian apparent magnitude , taking as a less noisy proxy for the petrosian radius (Stoughton et al., 2002; Strauss et al., 2002).
The physical properties selected are the star formation rate (SFR), specific star formation rate (SFR/, where is the stellar mass) and petrosian r-band mass-to-light ratio (). These are included in SDSS-DR8 and obtained from galaxy spectra analysis at MPA and JHU 333http://www.mpa-garching.mpg.de/SDSS, as detailed in Kauffmann et al. (2003a), Brinchmann et al. (2004), Tremonti et al. (2004), Gallazzi et al. (2005) and Salim et al. (2007).
Note that has been derived from template fitting to the total flux in the 5 photometric bands (Aihara et al., 2011). As the spectral fibers diameter cover only of the central part of each galaxy, the SFR had to be corrected for this deficiency to its full value (Brinchmann et al., 2004).
4. Methods For Dimensionality Reduction
Most of the time, data-mining deals with the data point matrix , composed by the columns that contain observations for each of the properties or features. Thus can then be thought as a length-
realization of the random vectorwith distribution .
In our work, dimensionality reduction is used for explaining the variations of as function of only 1 parameter. For that effect, we use Weighted PCA and Principal Curves, whose detailed descriptions are included in Sections 4.1 and 4.2.
4.1. Weighted PCA (WPCA)
Principal Components Analysis (PCA) (Pearson, 1901; Jackson, 1991; Jolliffe, 2002), also known as Karhunen-Loeve Transform, is a widely used method for dimensionality reduction and classification. It can be seen as a transformation involving a translation, linear scaling and rigid rotation of a collection of -dimensional data points onto a new coordinate system. The new orthonormal axes, or principal components , are constructed such that the projections of the data points on the s are uncorrelated. is selected as the axis on which has the highest possible variance of the points projected onto it. The next s are ordered in descending value of the variance, having the lowest. Thus, dimensionality reduction is attained by describing the data in terms of the most important principal components (Hastie et al., 2009). This can be obtained by considering only the space spanned by the first variance-ranked eigenvectors whose cumulative variance reaches above a high enough threshold.
In practice, the PCs and their variances can be found using singular value decomposition (SVD) of the covariance matrixof the data points (Golub & Van Loan, 1996). SVD allows us to factorize it in the form . Here,
is a diagonal matrix with the eigenvalues (variances) andcontains the eigenvectors (principal components) in the respective columns. Thus, contains the expansion coefficients of the transformation () from property space to PC space.
In Weighted PCA, the covariance matrix is calculated in a weighted schema. Many times we are confronted with noisy or missing data points. As a solution, we can assign a weight to each th data point in order to account for the noisy data points or the missing ones. In this context, WPCA involves considering these weights in the calculation of all averages and covariances between the
properties. In general, the properties might have different units, for which they have first to be made unitless by standardization of the data points (subtract to each property its (weighted) average and then divide it by its (weighted) standard deviation).
4.2. Principal Curves
Principal curves (P-curves) and surfaces (P-surfaces) (Hastie, 1984; Hastie & Stuetzle, 1989; Tibshirani, 1992; Gorban et al., 2008) go one step ahead of PCA, providing a low-dimensional curved manifold that passes trough the middle of the data points. In this paper we consider a 1-parameter (called ) principal curve , where each of the data points is given a unique closest projection onto the curve. As a convention, is chosen to be the arc length from the beginning of the curve to the projection point of . Under this context, the P-curve can be considered by itself the 1st and only principal component, as the dimensionality of the data is reduced from to 1 dimension. In practice, the P-curve is composed by line segments that connect the projection points.
The principal curve is defined as the average of the data-points that project onto it, minimizing the projection distance between and over all points. This property of self consistency allows us to follow a series of iterative projection-expectation steps for its construction (Hastie & Stuetzle, 1989). In fact, an educated first guess for the P-curve is to make it equal to . Later on, the th estimate of the curve at the th expectation step is calculated as . In practice, we compute this expression using a weighted penalized cubic B-spline regression (Silverman, 1985; Hastie & Tibshirani, 1990; Ruppert et al., 2003; Hastie et al., 2009). These splines are calculated on a series of
knots chosen from the data points, while the degrees of freedom () of the regression control the degree of smoothing of the P-curve. On the other hand, the th projection step is performed next, involving the search for the closest perpendicular projection of onto , which is composed by the line segments. The iterations stop when the cumulative projection distances from the data points to the P-curve do not change significantly with respect to the one in the previous step.
Although P-curves are constructed on the -dimensional space of properties, we can consider building the P-curve of the data points projected on the first most important principal components of the WPCA. This minimizes the complexity and computations, specially in the case , without losing much information. The approximation is of course valid as long as the first eigenmodes contain as much of the total variance as possible.
5. Building the Principal Curve and Population Separators along Arc Length
As the MGS is a magnitude limited sample, not all galaxy types are sampled equally in the survey volume. As a consequence, we used WPCA and a weighted principal curve of the galaxy population to get an unbiased result.
In detail, at higher redshifts we sample mostly the brightest galaxies, neglecting the faint ones (Malmquist bias). On the other side, at low redshifts the SDSS spectrograph fails to take the spectra of very bright galaxies (see Section 2).
As a solution, we use the weighting method (Schmidt, 1968) to account for this incompleteness. Here, each -th galaxy is assigned a weight , where is the volume of the survey. Here we note that, given the particular and intervals for the survey, the -th galaxy found at could be observed only within a maximum comoving volume . If the -th galaxy of apparent magnitude , k-correction , and at a luminosity distance were to have limiting apparent magnitudes , then it should be moved to a limiting luminosity distance given by
Hence, the maximum volume is defined by the biggest interval of inside which a galaxy can appear in the survey:
The PCA, P-curve and calculations related to volume densities in this paper (such as histograms) are all weighted.
5.2. WPCA results
Variance for each principal components and its associated cumulative variance. Since the data has been standardized, the sum of the variances is equal to the number of dimensions ().
Figure 1 and Table LABEL:Table:PCAeigenvalues present the results from computing WPCA on the 7 galaxy properties. From Table LABEL:Table:PCAeigenvalues, we can notice that most of the information (97% of the total variance) is contained in the first 4 principal components. On Fig. 1, each , can be viewed as a linear combination of properties, with the expansion coefficients of the th property stored in the th row. Coefficients with stronger color show a higher importance of the property for the given PC. The sign of the coefficient shows correlations/anticorrelations between the properties and the final value of the PC.
For , the strength (absolute magnitude) of its expansion coefficients in the basis of the galaxy properties is shared mostly evenly between these properties, being , , and the most important. The correlations show that high values of , , and , together with low values of , and , will produce a high value. We might therefore expect that is a good separator between the young, blue population of spirals/irregulars and the old population of red old ellipticals.
For , the and are the most important, having opposite signs. Thus, we expect galaxies with bright surface brightness and high star formation to show high values of .
For , the most important property is , with an opposite correlation with respect to the next important properties of mostly equal strength (, , , and ). We can expect that big and bright star-forming spiral galaxies with reddish colors (probably from a red core) should have high .
For , all the properties have the same correlations, being , and the most important. Thus, concentrated (and possibly star-forming) galaxies of faint surface brightness have high values of . As the variance along is much smaller than the previous PCs, this is a rare combination of correlation for these properties to be observed at the same time.
Furthermore, the last 3 PCs (, and ), which account for less than 2% of the total variance, are less obvious to interpret. They might trace either special cases of galaxy populations, or just artifacts and wrong/noisy measurements of the properties.
5.3. The fitted Principal Curve and Population Separators along it
We decided to construct the Principal curve in the 4-dimensional space defined by , since their combined cumulative variance (0.973) is close to unity (see Table LABEL:Table:PCAeigenvalues). Although the computation for the number of dimensions and data points involved is not too intensive, we think of this as a pedagogical example that can be used for other extreme cases when objects with dimensions, for instance. In fact, our election does not change significantly the results compared to using .
In the expectation step for creating the principal curve, each is fitted with penalized B-splines of 5.4 degrees of freedom (), defined at a sequence of
unique knots chosen at equally spaced quantiles of arc-length values. Principal curves withmake the curve to oscillate excessively, turning back and forth across and along , whereas with resemble more closely a straight line along the direction.
Figure 2 shows the result of fitting the principal curve to . The 4-dimensional cloud of properties presents 2 density maxima placed mainly along the direction, corresponding to the blue and red galaxy populations. The principal curve mostly resembles the letter ”W”, presenting clearly 4 different regimes or branches separated by 3 turning points (T-points).
We created 20 equal number density galaxy groups (in Mpc) labeled as by placing population separators at fixed arc length values along the P-curve, as shown in Figures 2 and 3. Galaxies are grouped together into the same group when the arc length values measured at their projections points onto the P-curve are placed between 2 consecutive separators. These separators are positioned in such a way that the (-weighted) number density (in Mpc) of the galaxies belonging to each of the 20 groups amounts to 1/20th of that from the whole sample of galaxies. This allowed us to study the 4 principal curve branches in detail. We chose the arc-length to increase in the same direction of increasing , with growing values of arc length as we progress from to . Thus, the P-curve’s 1st branch is comprehended in , the 2nd branch in , the 3rd branch in and the 4th branch in . Table LABEL:Table:20GroupStatistics2 shows some statistics of these groups.
Within each group, we further created 5 subgroups of galaxies along arc length naming them , also of equal number density in Mpc as explained before. We further partitioned these groups similarly, now using several radial separators in the perpendicular direction to the curve, defining 10 concentric cylinder-like separating surfaces. In this way, the groups defined by this finer partitioning have all the same number density (in Mpc), equal approximately to 1/1000th of the number density of the whole sample. This allowed us to identify and extract localized galaxy populations positioned very close to the spine of the cloud of properties, and study them in Sec. 6.5.2.
Figure 3 shows the probability density distribution of the arc-length values, as well as the population separators. The curve has a length of =20.24, and the variance of the arc length values is , measured with respect to the center of the curve at . Note that Table LABEL:Table:20GroupStatistics2 shows that the quadratic mean (root mean square) of all the projection distances from the data points to the P-curve takes a value of , which is small compared to the length of the curve. The blue and red peaks of maximum density are clearly visible, as well as a small green peak. The 1st turning point (at ) lies closely with the blue maximum (), whereas the red maximum () is a little behind of the 3rd T-point (), after which we can find a hump defining the red sequence of galaxies. We find a green maximum () standing in between the 2nd T-point () and the red maximum.
Figure 4 shows the density maps of the scatter of each as a function of the arc-length. The different shapes that this scatter presents depend evidently on the contortions or twists of the principal curve along the PCs. As the 4 branches of the curve mostly turn left and right along , the scatter in show the same ”W” shape as the P-curve. On the other hand, the curve increases its length into the direction, so the scatter shows a mostly linear relation between and arc length. The same analysis applies to the scatter of the next PCs, which is boomerang-shaped for and mostly constant with respect to arc-length for (although with little wiggles).
denotes the number of galaxies in each group, comprehended in the arc length interval of average arc length. The value denotes the quadratic mean (root mean square) of the projection distances from the data points onto the P-curve.
6. Galaxy properties and statistics as function of Arc Length
In this section we show how galaxy properties, luminosity functions and spatial clustering change as a function of the equal number density galaxy groups (ordered in ascending arc-length).
Compared to alone, the principal curve provides much more information about particular changes in properties along its arc length. We will see that the evolution of galaxy properties along the curve is intimately related to the ”W” shape of the principal curve, where each of the 4 branches define particular galaxy populations.
6.1. Morphology and Average Spectra
Figure 5 shows the most representative galaxy morphologies and average spectra for the groups.
The most evident feature is the change in color and the slope of the spectra (from blue to red), as well as an overall weakening of emission lines (e.g. Balmer series of Hydrogen and forbidden lines, such as OIII, OII, NII, etc.) and an increase of metallic absorption lines and bands (Na,Mg,H,K,G) as we reach high arc length values. In the same way, morphological types include various types of blue galaxies at the beginning and middle of the curve, whereas red ellipticals dominate the end of it. This bimodality is expected and agrees with in Fig. 1, appearing also other studies as the change along the 1st principal component (e.g Yip et al., 2004b; Coppa et al., 2011). We can, however, identify as well more subtle populations along arc length, not distinguishable in alone. These distinct population are defined on each of the 4 branches of the principal curve, connected by the 3 turning points.
With respect to morphologies, we see that the arc length correlates very well with the Hubble galaxy type. We however miss the distinction between barred/non-barred spiral galaxies due to the lack of properties able to separate them. Blue irregulars and blue compact dwarf (BCD) galaxies (Papaderos et al., 2006; Corbin et al., 2006) appear in the 1st branch of the principal curve. Some of these type of BCDs were identified as the green pea galaxies at higher redshift (Cardamone et al., 2009). These morphologies change then into low surface brightness galaxies (LSBGs) with spiral and irregular shapes, which dominate the 1st turning point and blue maximum. Bright spirals with strong blue star forming arms appear in the second branch, which by the 2nd turning point show sizable bulges. A dramatic change happens in the 3rd branch, where reddish big-bulged spirals and lenticulars dominate, forming part of the green and red maxima. A new transition happens at the 3rd turning point, having the big bright red ellipticals (CDs) and brightest cluster galaxies (BCGs) dominate at the end of the P-curve’s 4th branch.
Emission lines, such as the forbidden OII, OIII, SII and NeII, as well as the Balmer series of Hydrogen (e.g. H, H, H), are strong in the violently starforming blue galaxies at the 1st branch. These lines weaken as we transition into LSBGs, but interestingly H and H become stronger in the 2nd branch, reaching maximal values in the starforming spirals at the 2nd turning point. After this, they weaken again to become imperceptible in the bright ellipticals in the 4th branch. NII follows the same pattern as H, but somehow remains still visible in CD galaxies, as seen in many spectral atlas (e.g. Dobos et al., 2012). On the other hand, OIII declines steadily through arc length, disappearing after the red maximum.
Absorption lines, such as Na, Mg and the G band become evident in the starforming spirals by the end of the 2nd branch (as the bulge increases in size), and appear strong in the ellipticals at the 4th branch. Although the H and K lines of calcium are always visible, the 4000Åbreak increases steadily with arc length, turning into a striking feature in bright ellipticals.
|Group||log SFR/||Lick G4300||Lick Fe4531||Lick Mg||Lick Na D||Lick H|
|log OIII||eclass||log SFR||log||log H||log NII||fracDeV|