Principal component analysis (PCA) is a popular data science tool used for tasks such as dimensionality reduction, feature extraction, missing value imputation, denoising, and data compression. PCA originated from the works of Pearson  and Hotelling [5, 6] and a detailed review of the subject is provided by Jolliffe . Eckart and Young  described PCA as finding the best approximation of a numeric matrix using a lower rank matrix , where the quality of the approximation is measured using least squares or quadratic loss.
Numerous authors have extended the concepts of PCA by changing the loss function and adding regularization to the low-rank matrix approximation problem. Notably, Collins et al. proposed using exponential family loss functions and Gordon  used matching link-loss function pairs to construct procedures based on Bregman divergence. These contributions generalized PCA and factor analysis similar to how generalized linear models  extended the concepts of regression. Regularization has been used to construct low-rank approximations which account for data characteristics such as sparseness  and non-negativity . Udell et al.  summarized many of the major contributions using the generalized low-rank model framework.
This paper is focused on the task of constructing a low-rank approximation when some of the measured variables contain interesting values which occur frequently. Examples include missing, censored, or truncated values; as well as zero-inflated data. The zero-inflated case is commonly encountered when measuring manufacturing defect counts. For regression analysis settings, Mullahy proposed using the hurdle model, Lambert  described the zero-inflated model, and Min and Agresti  provided enhancements which included random effects. For dimensionality reduction, Pierson and Yau  developed the zero-inflated factor analysis (ZIFA) model for analyzing single cell RNA sequencing data suffering from gene expression dropout. The ZIFA model follows the probabilistic PCA approach of Tipping and Bishop  and optimization is carried out via the EM algorithm . The ZIFA model can be expressed as a special case of the low-rank reduced hurdle model presented in Section 3. The case of performing PCA in the presence of missing data has been examined previously, with Ilin and Raiko  providing a review of existing procedures. The low-rank hurdle model offers a new representation which can be leveraged to gain additional data insights not directly available from competing PCA missing data methods.
The remaining contents are organized as follows. Section 2 describes the generalized low-rank framework. The hurdle model is motivation in Section 3, along with details for proper implementation. In Section 4 the hurdle approach is used to analyze a zero-inflated manufacturing data set and investigate missing value imputation. Lastly, Section 5 contains some concluding remarks.
2 The generalized low-rank model
The following notation is used throughout. Matrices are denoted by bold uppercase letters or Greek symbols (e.g. ,
) , vectors are represented by bold lowercase letters or Greek symbols (e.g., ), and scalars are not bold (e.g. , ). Additionally, matrices with dimensions and vectors of length are denoted as matrices and vectors; respectively, even if occurs for some .
Here we present a generalized framework for low-rank modeling, summarizing the methodology highlighted by Udell et al. . Let be an data table where the rows represent observations consisting of measurements collected on variables. Then for and , represents the variable value for the observation. The domain for each column variable is denoted by , which is not restricted to
, but includes discrete and non-numeric domains to facilitate abstract data types such as count, Boolean, categorical, and ordinal variables. We will approximate abstract data types by representingwith numerical embeddings , where is the embedding dimension of the variable. The resulting embedded dimension of the model is . The loss incurred from using to describe is measured using an appropriately selected loss function .
Essential to this analysis is the construction of a low-rank matrix which approximates our data table with minimal loss. A rank- approximation can be found by specifying where , , and . Notice this decomposition is not unique since for any non-singular matrix . An optimal rank- matrix decomposition can be found by minimizing the following optimization problem. Let denote the row of , and such that denotes the embedded columns associated with the variable of , then the generalized low-rank model for is found using
where and are appropriately selected regularizers, and represents the set of indices such that is observed. An appealing feature of the above generalized structure is the ability to combine different loss functions and regularizers to address different variable characteristics observed in the data table.
Many data reduction methods can be described in terms of equation (1). For example, if unregularized quadratic loss is chosen for a numeric data table with no missing values, then the optimization problem is solved using standard PCA . This motivates the interpretation of the matrix as a low-dimensional representation of , with representing a mapping of back into the original embedded data space. Other special cases described by the general framework include robust and sparse PCA, exponential family PCA, non-negative matrix factorization, and matrix completion .
The task of optimizing equation (1) is simplified for convex loss functions and regularizers. Under these conditions (1) becomes a biconvex minimization problem, which is commonly solved iteratively by alternating between convex updates in one argument while fixing the other. Using the above notation, we alternate minimization over the rows of while fixing , and minimization over the columns of while fixing . These updates can be parallelized over the rows of and the columns of which may significantly improve computing times. In general this alternating approach does not guarantee convergence to the global minimizer, and care may be required to avoid poor solutions. In many applications, the usefulness of the sub-optimal solution is used to justify its adoption.
Variable scaling is a well known issue in multivariate analysis, and commonly data is normalized prior to performing methods such as PCA. The concepts of offset and scaling can be generalized by replacing the loss functions in (1) by , where
and is the number of non-missing values for the variable. Using the above expressions, the loss contribution for the variable is equal to under the offset only model. This motivates the use of as the total loss of the scaled model, which has a similar interpretation as total variation from standard PCA analysis. It is important to note the offset and scaling adjustments are applied to the loss functions, and not directly to the data table itself. This ensures aspects of the data table are maintained, such as sparseness or non-negativity.
3 The hurdle model
Suppose data table contains variable with elements which periodically take on the value . Assume the occurrence of is interesting because it potentially signifies a different generating process as compared to when . For example, defect counts are observed during the manufacturing of hard disc drives. Normal counts are typically zero and appear to be governed by a process which differs from non-zero defect counts; where differences are observable across the data table variables. Regression analysis techniques have been proposed for this paradigm, including the hurdle model from Mullahy  and the zero-inflated model from Lambert .
The hurdle model contains two components, where the first component represents the probability of observingand the second describes the conditional behavior of the data provided is not observed. Explicitly,
where represents the probability density or mass function when is not observed, and is the possibly -truncated density or mass function with mean parameter . Following the generalized linear model framework , appropriate mean functions can be defined such that
where represent predictor row vectors and
represent parameter column vectors. Under the typical assumption of logit link for the probabilities
, maximum likelihood estimation is performed using the following equation:
Equation (3) can be expressed as a minimization problem by examining the negative log of the likelihood function, which yields
where is an embedded indicator variable. Previous authors have examined low-rank procedures motivated by loss functions based on the negative log likelihood; notably Collins et al.  in the case of exponential family models. In the context of the generalized low-rank model presented earlier (1), denoting and replacing with in (4) yields the following equivalent low-rank expression for fixed :
where denotes logistic loss, represents a derived loss, and is a normalizing constant to ensure is non-negative. The derived equation in (7) can be further generalized for arbitrary data structures by making use of the subsequent composite loss definitions.
Definition (hurdle loss). Let , , , , and
be a binary variable indicating whetherhas occurred, then full hurdle loss is specified by
where denotes a non-negative binary loss, and is an appropriate non-negative loss for describing the -truncated data. Furthermore let , then reduced hurdle loss is defined as
The weights assign relative importance to the two model components, with larger weights implying higher importance on the resulting reduced representation. The choice of weights will also affect the aggregated total loss, but this can be corrected using the formulas for offset and scaling appearing in equations (2). One strategy is to assign weights proportional to total loss contributions resulting from the two hurdle components, where total loss is found using the offset only model. Specifically for the variable, allow to represent the number of occurrences, the number of non- occurrences, and offsets and are found using (2), then weights which solve the below system of equations will yield a total loss of and ensure the binary loss contributes times the non- loss:
where , , . Several intuitive choices for the multiplier are or .
In practice, data analysts are required to make several decisions in order for hurdle loss to be implemented within the generalized low-rank framework described in Section 2. Logistic loss is likely a default choice for the binary loss , while selecting a form for may depend more heavily on the underlying data characteristics. For example, quadratic and losses are reasonable choices for continuous data, while Poisson loss is useful for count variables. As previously mentioned, selecting convex loss functions and regularizers allows alternating minimization to be employed which may further guide the decision. A more detailed overview of possible loss functions and regularizers is provided in Udell et al. .
Reduced hurdle loss (9) simplifies the representation by setting and in (7). In the special case of the ZIFA model , quadratic loss is selected for and is based on the binomial probabilities where is a positive decay coefficient. The ZIFA model substitutions were justified in the context of the gene expression data problem and may be unsuitable for other subject domains.
In applications such as matrix completion and data reconstruction, mapping the reduced representation back into the original domain of is required. In general, low-rank models approximate using some function of the vector . For quadratic loss this is simply since this model corresponds to the generalized linear model with identity link. Under hurdle loss the original variable is encoded using both an indicator function and the identity function, where the latter function is applied only when non- values occur. These encodings are then approximated using a vector in the full model setting, or for reduced models such as ZIFA. Hence, the reverse mapping under hurdle loss can be found as follows. Let and , then the reconstructed value is determined by
The above piecewise condition is often simplified since and for commonly used loss functions, such as quadratic and Poisson losses. Hence the condition in (12) becomes a ratio of only the binary loss components. In the case of missing data, is still a reasonable imputed value even when the binary loss components suggest missingness is likely. This is further demonstrated in Subsection 4.2.
The foundations for hurdle loss were developed following a likelihood model explanation, but additional intuition and advantages are worth mentioning. First, in the full model setting each of the hurdle components receives a different principal vector in which allows for differing dependencies on the other columns of . This added flexibility mirrors the modeling complexities available in the hurdle and zero-inflated regression frameworks. Low-rank applications concerned with data similar to those which motivated the regression models may find using the hurdle approach a suitable alternative to competing dimension reduction methods. Second, since the low-dimensional representation retains information related to the likelihood of values, we may extract probability type scores using
and measure associations with other variables by examining the cosine similarity betweenand the remaining columns of . These metrics inform the analyst about the quality of the low-rank representation with respect to discriminating
values, without the need to conduct additional analysis. Lastly, employing the composite hurdle loss provides an additional degree of freedom when determining the offset and scaling for the underlying variable. These values can be strongly influenced by-inflation when using a single loss function, potentially obscuring meaningful representations of the underlying processes.
4.1 Zero-inflated model
The first example investigates a factory data set which contains various defect count variables related to the manufacturing of hard disc drives. In general, defect count variables tend to exhibit high degrees of zero-inflation. This particular data set contains 14 different count variables measured on 2200 unique storage devices. The observed zero-inflation varies across variables and ranges from to , with an aggregated value of about . The distribution of non-zero values displays a long tail with an overall median and mean of and defects, respectively.
The generalized low-rank model (1) was used to analyze the defect data set. Unregularized hurdle loss was chosen for all 14 count variables, with binary and non-zero components selected to be the following logistic and Poisson loss functions
where . However, convergence tends to be slower and numerically unstable when employing the truncated version. Additionally, differences between the ordinary Poisson loss and the zero-truncated version converge quickly to zero as increase.
All loss functions were centered and scaled according to (2). Specifically, offset terms for logistic and Poisson losses are and , where is the sample column mean. Additionally, hurdle loss components were weighted and scaled using (10) with . Ordinary PCA and the ZIFA model were also considered for comparative purposes. For the ZIFA model, initial values were based on PCA and the decay parameter was allowed to vary across variables.
Figure 1 contains three plots comparing the full hurdle, ZIFA, and PCA approaches. The left plot displays proportion of total loss explained as a function of the model dimension . Recall total loss is calculated under the offset only model and equals
when the loss functions are appropriately scaled. The hurdle model achieves a quicker rate of model loss reduction with respect to its model loss space, followed by PCA. The middle plot compares element-wise weighted sum of squared reconstruction errors, where the weights are the sample standard deviations of the target variables. The hurdle model performs similarly to PCA and shows improvement over dimensions 4 through 11. The right plot displays zero misclassification rates for the three methods. Simple threshold decision rules are used to map the reduced rank representations into zero/non-zero responses. Specifically, PCA reports a zero outcome whenever a reconstructed value is less than 0.5, while the hurdle and ZIFA models assign a zero value whenever a reconstructed probability score exceeds 0.5. The plot shows the full hurdle model performs noticeable better than PCA and ZIFA, which both performed similarly. Overall, the ZIFA model either performs similar or worse than PCA. The full hurdle model framework includes an additional column in the representation, which may provide advantages when optimizing the potential trade-offs between the competing composite losses. This added flexibility is absent in reduced hurdle models and may explain the degraded ZIFA performance on this data set.
Missing in the analysis of Figure 1 is the computational speed advantages of ordinary PCA. A parallelized alternating second order gradient descent procedure was used to fit the hurdle model, and the EM algorithm was used for ZIFA. In general, optimizing the generalized model is slower than ordinary PCA and care needs to be taken to avoid poor local minimums. For applications which require fast implementations, stable representations for can be found offline and held fixed for efficient scoring of new data.
4.2 Missing value model
Performing PCA in the presence of missing values is a well studied problem. Ilin and Raiko  provide a review of common practical approaches to PCA with incomplete data. For our purposes the problem can be reformulated in the context of the hurdle model. Specifically, assume a logistic loss for the occurrence of missingness and quadratic loss for the observed data. This approach is investigated by simulating 30 data sets each containing 5000 observations and 10 variables. Each observation vector is generated using the following low-rank sampling scheme:
where for each data set is a diagonal matrix sampled uniformly from (0.9, 1.1), and is a matrix with entries
generated from a standard normal distribution. Missingness is induced using two alternative methods applied to the same generated data set. The first approach assumes data is missing completely at random (MCAR), where as the second assumes data is missing at random (MAR) by correlating selection with the observed data. Only the first entriesin suffer from missingness with exclusions based on the following selection probabilities:
The value of is recalculated for each data set so that the rate of missingness is approximately the same under both MCAR and MAR cases; yet under MAR, missingness is directly associated with the observed values of the second and third measured variables.
Under the zero-inflated model, offset terms for the -truncated loss were found using (2). For quadratic loss this suggests using the sample mean. However under missing data the sample mean is known to be a biased estimate for the offset term , especially when considering MAR type missingness. To account for bias, the offset term for the first variable was updated between alternating minimization steps using
Regularization was included in the low-rank model to reduce over-fitting and improve data imputation. Quadratic regularizers and were selected, and for simplicity was assumed. In order to choose the regularization parameter , missing data was omitted from the generated data sets and new missing values were randomly created using a MCAR scheme with a similar selection rate as observed in the generated data. The new missing values were imputed using a range of values and the mean squared imputation error was used to find optimal values.
The regularized full hurdle model was applied to the MCAR and MAR data sets, along with four additional models for comparative purposes: Bayesian PCA (BPCA) , Probabilistic PCA (PPCA) , Nonlinear Iterative Partial Least Squares (NIPALS) , and imputation using the sample mean of the observed values. All the data reduction techniques assumed a reduced representation. The performance of the various methods was measured based on imputation and offset mean squared errors, and average performance is reported in Table 1. In both cases the low-rank models significantly improve upon the sample mean approach. In the MCAR setting, adding the hurdle structure is unnecessary causing performance to be slightly worse than the BPCA and PPCA approaches. The hurdle model reports the overall best performance for the MAR data sets. Under the MAR setting, missing values provide additional information regarding the underlying data structure which the hurdle model more accurately represents. This point is further expressed by the left most plot in Figure 2. For each of the MAR data sets, the probability of the observed values exceeding the unobserved missing values was recorded. Data sets with high separation between observed and missing distributions exhibit small values. This separation measure was compared to the percentage improvement in MSE for the hurdle model over BPCA, where positive values indicate better performance for the hurdle model. The plot clearly reveals the hurdle model becomes more preferable as the overlap between the observed and missing data decreases.
|Case||Model||Average Imputation MSE||Average Offset MSE|
The hurdle model representation provides several diagnostics for missing data which are not directly obtainable using other approaches. The first is based on the missingness probability score found using the sigmoid expression which represents the fitted value for the Boolean portion of the hurdle data. These scores can be used to construct ROC curves to measure how well the low-rank representation can discriminate between missing and non-missing occurrences. Figure 2 contains ROC curves for both the first simulated MCAR and MAR data sets. In the MCAR data sets, missingness is unexplained by the observed data and the resulting average area under the ROC curve (AUC) was 0.53. The MAR data sets had an average AUC of 0.88 which correctly suggests missingness is not likely to be completely at random. The interpretability of the AUC value is dependent on the degree of the low-rank model. Higher rank models which explain close to of the total loss lack interpretability since their representation will over-fit observed noise. In both MCAR and MAR cases the total loss reductions were near over the offset only models. This suggests missingness is difficult to explain for the MCAR example and remains as noise, whereas missingness is easily represented in the MAR case and does not remain as a contributor to unexplained loss. The latter finding suggests the hurdle model is a useful representation for the underlying MAR data structure.
The second diagnostic is relevant when missingness is easily explained by the model. Variables associated with missingness can be identified by inspecting the cosine similarity between the vector and the other columns in :
The cosine similarities can be converted into distances using , where implies a high degree of dependence and suggests no association. The similarities and distances for the first MAR data set are summarized in Table 2. The distance measures for columns and are small, which correctly suggests the values and are related to missingness. Interestingly columns and also report small distances. Upon inspection, simulated entries and were moderate to highly correlated with and , indicating the reduced representation is distributing the observed influences across the collection of correlated variables. This outcome seems somewhat expected given the nature of low-rank models. Overall of the simulated MAR data sets had at least one of the two influential variables in the top two distance scores, and this increased to when considering the top three.
This paper described the low-rank hurdle model which falls under the generalized low-rank framework. Previous authors have proposed the ZIFA model which is a special case of the reduced hurdle model. The methodology is particularly applicable to dimensionality reduction problems which exhibit characteristics similar to hurdle or zero-inflated regression problems. In addition to providing a more natural loss approximation, the hurdle model’s design allows practitioners to examine aspects of the low-rank representation not readily available when using alternative procedures. This may be particularly useful in the case of missing data which was demonstrated in the applications.
-  Collins, M., Dasgupta, S., & Schapire, R. (2001). A generalization of principal component analysis to the exponential family. In NIPS, 13, 23.
-  Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39(1), 1.
-  Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211.
-  Gordon, G. (2002). Generalized linear models. In NIPS, 577.
-  Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417.
-  Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321.
Ilin, A. & Raiko, T. (2010).
Practical approaches to principal component analysis in the presence of missing values.
Journal of Machine Learning Research, 11(Jul), 1957.
-  Jolliffe, I. (1986). Principal component analysis. Springer, 2nd edition.
-  Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1.
-  Lee, D., & Seung, H. (2001). Algorithms for non-negative matrix factorization. In NIPS, 556.
-  McCullagh, P. & Nelder, J. (1990). Generalized linear models. CRC Press, 2nd edition.
-  Min, Y. & Agresti, A. (2005). Random effect models for repeated measures of zero-inflated count data. Statistical Modeling, 5(1), 1.
-  Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of Econometrics, 33(3), 341.
-  Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., & Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16), 2088.
-  Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559.
-  Pierson, E. & Yau, C. (2015). ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology, 16(1), 241.
-  Roweis, S. (1998). EM algorithms for PCA and SPCA. In NIPS, 626.
-  Tipping, M. & Bishop, C. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society. Series B, 61(3), 611.
-  Udell, M., Horn, C., Zadeh, R., & Boyd, S. (2016). Generalized low rank models. Foundations and Trends in Machine Learning, 9(1), 1.
-  Witten, D., Tibshirani, R., & Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515.
-  Wold, H. (1966). Estimation of principal components and related models by iterative least squares. Multivariate Analysis, 391.