Machine learning models exhibit poor predictive performance in high-dimensional problems, where the number of features is relatively large compared to the number of observations. In the high-dimensional setting, models suffer from high variance and overfit to the training data, a phenomenon known as the curse of dimensionality. Most methods used to mitigate this issue rely solely on data and fail to leverage knowledge from domain experts. In this paper, we propose a method that elicits an expert’s knowledge on feature importances and integrates it into a regularized linear model to improve performance in high-dimensional settings.
High-dimensional datasets are ubiquitous in practice. Certain types of data such as images, text, and time series are inherently high-dimensional. In many domains, features are often cheap and numerous – for instance, one may collect measurements from many sensors without knowing which ones are relevant to the prediction task at hand. In contrast, observations are often expensive to obtain and label. As a result, many problems have a large number of features, but relatively few observations. Under such circumstances, classical approaches are no longer valid (Donoho, 2000).
Given the wide presence of high-dimensional data, techniques such as regularization and dimensionality reduction are often used to fit simpler, less flexible models. These methods reduce the variance of the model at the cost of increased bias. For example, ridge regression and lasso are regularized variants of least squares linear regression, and they reduce the model’s variance by shrinking coefficient estimates toward zero(Hoerl and Kennard, 1970; Tibshirani, 1996).
Before introducing our approach, it is worth discussing the strengths and weaknesses of regularization. In Figure 1, we compare the performance of k-nearest neighbors, linear regression, ridge regression, and lasso on a simulated regression problem where there are features, out of which only the first features are truly associated with the response. The remaining features are noise features with true coefficient values equal to zero. The training set consists of observations and the validation MSE is evaluated on a validation set of observations111For complete details on the data generation procedure, see Section 4.. Cross-validation is used for the ridge regression and lasso models to select the optimal regularization parameter from at each . As approaches , the performance of k-nearest neighbors and linear regression rapidly deteriorate, but that of ridge regression and lasso remain relatively stable.
Still, even ridge regression and lasso suffer from overfitting, indicated by a steady increase in their validation MSEs as approaches . With enough noise features, even regularization does not prevent overfitting, since chance associations between the features and the response on the training set result in some noise features being assigned nonzero coefficient estimates, even though those features are not truly associated with the response (James et al., 2014). Moreover, truly associated features with high true coefficient values have their coefficient estimates driven toward zero due to the regularization penalty.
To better understand these limitations and motivate our approach, it is helpful to look at regularization from a Bayesian perspective. Ridge regression and lasso have simple Bayesian interpretations (Murphy, 2012; Keng, 2016):
Ridge regression is the maximum a posteriori (MAP) solution from assuming a standard linear model with a Gaussian prior on the coefficients, with mean zero and variance parameterized by .
Lasso is the MAP solution from assuming a standard linear model with a Laplace prior on the coefficients, with mean zero and scale parameterized by .
In both cases, all model coefficients share the same prior distribution, since they are all centered at zero and parameterized by the same regularization parameter . If we expect coefficients to be small or sparse in general, then ridge regression and lasso provide effective ways to encode that prior knowledge into the model. But if we have more informative prior knowledge about individual feature importances, then ridge regression and lasso do not help to encode that knowledge.
We propose a new regularization technique called Distance Metric Learning Regularization (DMLreg) to elicit prior knowledge on feature importances and incorporate that knowledge into a linear (classification or regression) model. In situations where training data are limited yet domain experts possess a wealth of prior knowledge about the problem, DMLreg combines human-driven priors with data-driven parameter estimation to fit a better regularized model.
2 Related work
This paper relies on distance metric learning, a concept pioneered by Xing et al. (2002) to automatically learn a distance metric from data and user guidance. Distance metric learning algorithms optimize a distance function, most commonly the Mahalanobis distance, to respect the user’s notion of similarity between instances. As input to the algorithm, the user provides weak supervision through pairs of similar and dissimilar instances (e.g. “ is similar/dissimilar to ”). Alternatively, the user can provide relative comparisons between instances (e.g. “ is more similar to than to ”) (Schultz and Joachims, 2003). Surveys by Kulis (2013) and Bellet et al. (2013) review the vast literature on distance metric learning.
Typically, the learned distance metric is used as an alternative to Euclidean distance to improve the performance of nearest-neighbor or kernel regression methods (Weinberger et al., 2006; Weinberger and Tesauro, 2007); however, in this paper, we integrate the learned metric into a regularized linear model. This offers advantages over nearest-neighbor methods such as greater interpretability, robustness to an incorrect learned distance metric, and a better model fit when a linear model is more appropriate for the underlying data.
Our work has parallels to research in knowledge elicitation. Daee et al. (2017) have a similar objective to elicit and incorporate expert knowledge, but they do so by using expected information gain to identify the most important features on which to query expert feedback. Unlike their approach which elicits knowledge at the feature level, our approach elicits knowledge at the sample level. This gives us advantages in high-dimensional settings where individual features may not be interpretable (e.g. pixel intensities in an image), but comparing samples for similarity may be easier.
More generally, a thorough discussion on the challenges and approaches in high-dimensional problems is provided by Hastie et al. (2009) and James et al. (2014). These resources describe the curse of dimensionality in both kernel methods and linear models. Additionally, Murphy (2012) and Wasserman (2010)
offer good references for Bayesian inference, a key framework used in this paper.
Our DMLreg regularization approach has two key steps: (i) elicit domain knowledge on feature importances and (ii) incorporate that knowledge into a linear model. We describe this process in detail and derive MAP estimates for two models (linear regression and logistic regression) combined with DMLreg.
3.1 Eliciting domain knowledge
The first objective of DMLreg is to elicit knowledge on feature importances held by the domain expert. Here, one may ask: why not simply ask the expert which features are relevant to the task at hand? While ideal, this is impractical in the high-dimensional setting – it may be overwhelming or even impossible for an expert to rank or select good features in a feature space with hundreds or thousands of features.
In contrast, in many problems, it is relatively easy for an expert to look at pairs of observations and determine whether they are similar or dissimilar. In DMLreg, we exploit this ability of experts to do pairwise similarity comparisons in order to learn their tacit knowledge about feature importances. In particular, we aim to learn a weighting of the features that reflect their relative importances, and we achieve this task through distance metric learning.
We use the original distance metric learning formulation (Xing et al., 2002), through which we learn a Mahalanobis distance metric between points and that respects the expert’s notion of similarity between observations. In other words, the learned distance metric assigns small distances between observations that the expert considers similar and large distances between observations that the expert considers dissimilar.
To learn this distance metric, the domain expert must provide weak supervision on pairwise similarities in the form of sets and , each containing pairs of observations which the expert considers similar and dissimilar, respectively.
Learning the distance metric can then be posed as a convex optimization problem: find an optimal matrix to minimize the sum of squared distances between the pairs of points in set that the expert has identified as similar.
The constraint in (5) maintains a lower bound on the total of distances between the dissimilar points in and prevents the trivial solution to (4) of . Constraint (6) requires to be a positive semi-definite matrix so that it is a valid distance metric that satisfies non-negativity and triangle inequality. Finally, constraint (7) restricts to be a diagonal matrix, which will make it easier to integrate with the model.
The diagonal elements of the learned matrix represent a weighting of the features based on their relative importances as indicated by the expert (at least with regard to computing distances between observations). For simplicity, we will refer to as the “distance metric” in the following discussion.
3.2 Incorporating domain knowledge into the model
Once we have captured the expert’s knowledge on feature importances in the form of a learned distance metric
, DMLreg integrates this knowledge into a regularized linear model. A natural way to combine prior knowledge with data is through Bayesian inference. In Bayesian inference, we specify a prior probability distributionand a likelihood function
. Using Bayes’ theorem, we can write the posterior distributionas proportional to the likelihood times the prior.
Specifically, let us first consider the case of Bayesian linear regression, which assumes a linear model with Gaussian errors . This implies that is also a Gaussian centered at the regression line with a corresponding likelihood function .
The prior distribution in Bayesian linear regression lets us express our knowledge about the coefficients . DMLreg encodes the knowledge held in the learned distance metric through a Gaussian prior on each coefficient. Like ridge regression, the Gaussian priors are all centered at zero, since does not provide any information on the direction of the relationship between each feature and the response.
However, the diagonal elements of represent the relative importances of the features. For , if is large, then we can expect a priori that feature has a large positive or negative coefficient ; in other words, we can expect to have a large variance around zero. So, we encode this information through the variance of each coefficient’s prior. This leads to placing a Gaussian prior distribution on each coefficient with mean zero and variance , and a corresponding density function .
With our likelihood function and prior distributions defined, we can use Bayes’ theorem to calculate the posterior distribution of . Since we are only interested in the coefficient point estimates and not the full distribution of , we can use maximum a posteriori (MAP) estimation to find the posterior mode : the value of that is most likely given the data and the priors.
Line (21) is the loss function for linear regression with DMLreg (Gaussian prior). We can solve forto obtain a closed-form solution for the coefficient estimates.
Thus, for linear regression with DMLreg (Gaussian prior), we can compute coefficient estimates in terms of , and the learned distance metric . This is equivalent to the coefficient estimates for ridge regression when , for regularization parameter . An interpretation of this result is that linear regression with DMLreg (Gaussian prior) is a generalization of ridge regression with a separate regularization parameter for each feature , instead of a single regularization parameter across all features.
Apart from linear regression, DMLreg can be easily applied to any generalized linear model. In Table 1, we show how to fit linear regression and logistic regression models with DMLreg for both Gaussian and Laplace priors on the coefficients. 222For the Logistic Regression + DMLreg models in Table 1, is the Sigmoid function
is the Sigmoid function.
|Model Name||Assumptions||Optimization Method|
We evaluate DMLreg through an experiment on an artificial dataset with simulated domain knowledge. The experiment is outlined in Figure 2.
First, we generate artificial high-dimensional datasets with observations and features. We sample an
-dimensional coefficient vector
with the first 10 coefficients drawn from a Uniform distribution between -10 and 10 and the remainingcoefficients set exactly to zero (line 27). Keeping fixed, we simulate 100 datasets through the procedure in lines 28-30. To be precise, our training set has observations; we hold out a separate validation set with 900 observations.
To simulate an expert’s domain knowledge on feature importances, we create three true distance metrics , listed in Table 2, which represent varying levels of expertise: perfect, noisy, and incorrect. These true metrics are used for computing pairwise distances between observations in order to generate sets and for distance metric learning (to simulate an expert manually comparing observations and creating these sets).
|Knowledge type||True metric value|
In practice, we expect real-world expert knowledge to be similar to a noisy or incorrect knowledge metric. The noisy knowledge metric is a diagonal matrix which weighs each feature by the magnitude of its true coefficient value plus some error . It represents an expert who knows the correct feature importances up to some error. In contrast, the incorrect knowledge metric weighs the features with completely random weights .
Now, we explain how to generate sets and in the experiment. For a feature matrix and a true metric , we compute pairwise distances between all observation pairs in using the Mahalanobis distance . The pairs with the highest distances are added to set , and the pairs with the lowest distances are added to set .
The sets and are then passed on to the distance metric learning algorithm to learn an approximation for . The learned metric , along with and are used as inputs to DMLreg, which fits a regularized linear model.
We evaluate the performance of the Linear Regression + DMLreg (Laplace prior) model on the validation set for each learned metric and for each features. Beside the metrics learned to approximate the three true metrics , we also consider the Euclidean distance metric , which is identical to the lasso () model when used in DMLreg with a Laplace prior. For this evaluation, we fix pairs. The validation set MSEs are aggregated across the 100 datasets and displayed in Figure 3.
For high values of , DMLreg using metrics learned from perfect or noisy knowledge performs better than DMLreg using the Euclidean distance metric (lasso with ). DMLreg substantially outperforms k-nearest neighbors and linear regression evaluated on the same data (refer to Figure 1). It also performs slightly better than the lasso model which had tuned through grid search and cross-validation (refer to Figure 1
), even though no hyperparameter search was required for its DMLreg counterpart. Nevertheless, when provided a metric learned from incorrect knowledge, DMLreg performs worse than lasso. These results suggest that DMLreg is effective for high-dimensional problems as long as the domain knowledge provided through the learned metric is reasonably correct.
An important practical concern when using DMLreg is: how many pairs of observations does one need to provide in sets and in order to get good performance? Figure 4 shows the validation MSEs for DMLreg on a single dataset with a fixed features as is varied from 25 to 700 observations pairs. For DMLreg using the metric learned from noisy knowledge, providing just observation pairs yields nearly the same performance as pairs, as in the experiment above.
To understand why the DMLreg model using the metric learned from noisy knowledge outperforms lasso, it is helpful to examine the coefficient estimates in Figure 5. For a single dataset , Figure 5 shows the coefficients for the first 10 (relevant) features on the left and the remaining (noise) features on the right. It is clear that the coefficient estimates for the DMLreg model (with noisy knowledge and Laplace prior) are closer to the true coefficients than the coefficient estimates for the lasso () model. In fact, the lasso model slightly underestimates the coefficients of the truly associated features and overestimates the coefficients of the noise features when compared to the DMLreg model.
When training data is limited but domain knowledge on feature importances is available, DMLreg can be used to fit a better regularized model. Using distance metric learning and regularization, DMLreg elicits and integrates expert knowledge into a linear model. Through an experiment on artificial data using simulated domain knowledge, we demonstrated that DMLreg outperforms ridge regression and lasso when the knowledge elicited is approximately correct.
- A survey on metric learning for feature vectors and structured data. CoRR abs/1306.6709. Cited by: §2.
- Knowledge elicitation via sequential probabilistic inference for high-dimensional prediction. Machine Learning 106 (9), pp. 1599–1620. Cited by: §2.
- High-dimensional data analysis: the curses and blessings of dimensionality. In AMS Conference on Math Challenges of the 21st Century, Cited by: §1.
- The elements of statistical learning: data mining, inference, and prediction. Springer. Cited by: §2.
- Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §1.
- An introduction to statistical learning: with applications in r. pp. 238–244. Cited by: §1, §2.
- A probabilistic interpretation of regularization. Cited by: §1.
- Metric learning: a survey. Foundations and trends in machine learning, Now Publishers. Cited by: §2.
- Machine learning: a probabilistic perspective. The MIT Press. Cited by: §1, §2.
- Learning a distance metric from relative comparisons. In Proceedings of the 16th International Conference on Neural Information Processing Systems, Cited by: §2.
- Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1), pp. 267–288. Cited by: §1.
- All of statistics: a concise course in statistical inference. Springer. Cited by: §2.
- Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, Cited by: §2.
- Metric learning for kernel regression. In Artificial Intelligence and Statistics, pp. 612–619. Cited by: §2.
- Distance metric learning, with application to clustering with side-information. In Proceedings of the 15th International Conference on Neural Information Processing Systems, Cited by: §2, §3.1.