1 Introduction
High covariate (feature) dimensionality poses challenges for robust prediction and inference, and dimensionality reduction is often performed to address this challenge. Supervised dimensionality reduction techniques [BHPT06] attempt to preserve the dependence of the outcome on a reduced covariate subspace while minimizing the dimensionality of that subspace. Supervised kernel dimension reduction (SKDR) methods provide increased flexibility in specifying the relationship between covariates and outcomes by determining the distance metric between points in an implied (often highdimensional) Hilbert space rather than the dependence of covariates and outcomes directly. Gretton et al. [GBSS05] proposed an independence criterion in reproducing kernel Hilbert spaces [SS02], the HilbertSchmidt Independence Criterion (HSIC). Many proposed covariate selection methods use penalized feature weighting to optimize dependence via the HSIC [All13] [CSS07] [FL12] [GC03] [SPS14] [WEST03]. A stepwise selection procedure was suggested by Song et al. [SSG12]. The approach by Chen et al. [CSWJ17] seeks to find a covariate subset for which the HSIC is minimized conditional on that set. Other methods yield a selection of feature transformations, such as Barshan et al. [BGAJ11]
, who propose a method that selects a subset of feature eigenvectors.
Each of these methods assume data are i.i.d.
(independent and identically distributed), which is not true for clustered data. Such data arise naturally in the setting of longitudinal data, which consists of repeated measurements on multiple subjects. Mixed model approaches such as generalized estimating equations
[HH02]and generalized linear mixed models
[MN01] maximize the dependency of features on the outcome while accounting for clustering, and penalized methods such as GLMM LASSO [GT14]enable feature selection in this context. The degree of clustering might be of interest, in which case
SKDR methods that maximize clustering have been developed [SSGB07]. In the longitudinal setting, clustering may be considered a nuisance characteristic of the data, where the primary interest is maximizing the dependence between features and the outcome within clusters. To the authors’ knowledge, no literature to date proposes SKDR techniques in the context of longitudinal data.This paper derives a method for SKDR in the longitudinal data setting, extending the methods proposed by Barshan et al [BGAJ11]. In Section 2, we review the definition of the HSIC and propose a decomposition of the HSIC resulting in fixed and random components. We derive estimators for this decomposition and prove their bias and rate of convergence. We also propose an algorithm for maximizing these estimators while selecting a feature subspace with reduced dimension. In Section 3, we perform a simulation study to evaluate the properties of this method and compare this derivation with existing methods, and apply this method to the Michigan Intern Health Study. We close with final remarks in Section 4.
2 Methods
2.1 Definition and Decomposition of the Hsic
Consider random variables
and representing distinct features and univariate outcomes, respectively. Let and be separable Hilbert spaces, equipped with reproducing kernels and such that and for some maps and , where denotes an inner product. Gretton et al.[GBSS05] suggest a measurement of dependence in terms of kernels for the features and outcomes:(1) 
(2) 
Random variables and are independent when this value is zero. For i.i.d. data, SKDR consists of minimizing for some selection or transformation map while maximizing .
If a collection of covariate sets is available and a transformation of reduced dimension is desired for each set, Theorem 1 provides a formulation of the joint HSIC in terms of these random variables.
Theorem 1: Let and be random variables, and be separable Hilbert spaces equipped with kernels and such that , and for some feature maps and . Let be the direct sum of these random variables, and be the direct sum of these Hilbert spaces and . Also let , , , and be defined as before. Then:
(3) 
Proof for Theorem 1 is provided in the Supplementary Materials. This theorem readily generalizes to any finite collection of random variables. In general, the random variables and kernels used for between and withinsubject terms are arbitrary. Examples of separate covariate sets suitable for these terms would be subjectlevel demographic information and withinsubject repeated measures. In the context of mixed models, model fits are sought in terms of between and withingroup covariates. Consider a general distribution , representing cluster membership. Theorem 2 provides a decomposition for HSIC for a single covariate set using a generalization of the law of total covariance.
Theorem 2: Let , , and be random variables, and be separable Hilbert spaces equipped with kernels and such that and for some explicit feature maps and . Then:
(4) 
Theorem 2 is proven in the Supplementary Materials. In this equation, is a nuisance crosscovariance term. In the context of mixed models, the first two terms represent the dependence of the outcome on fixed effects and random effects, respectively. Estimates of these terms may then be used in turn to reduce the dimensions of the fixed effect and random effect covariates.
The estimands of interest in Theorem 2 can also be cast in terms of Theorem 1, by treating these estimands as separate subspaces. Theorem 1 assumes that the total HSIC term is a direct sum of random variables and Hilbert spaces, and any covariance between these terms is assumed to be zero. In mixed regression models, fixed and random effects are typically modeled as independent of each other; we therefore restrict our attention to maximizing the first two terms in Theorem 2. In this paper, we consider the decomposition of a single covariate set only.
2.2 Estimators of Hsic
With our estimands defined, we now derive estimators of these quantities. Let subjects have repeated measurements , comprising observations in total. For Subject , let observation with features be written , let be a matrix of feature data for Subject , and let feature data for all subjects be written . Similarly, let and represent the outcome data. Let and be the sample kernels for all observations, where and , respectively. Finally, let , and , and let the centering matrix be defined as .
Fixed Effects
Let selection matrix be defined as:
(5) 
For matrix , and result in the selection of the rows and columns associated with Subject , respectively. Let and . We define the kernel sample mean between two subjects and as follows:
(6)  
(7) 
Let and . We propose the following estimator for :
Theorem 3: Under regulatory conditions, then is a consistent estimator of with bias of order and convergence rate .
Proof for Theorem 3 is provided in the Supplementary Materials.
2.2.1 Random Effects
For random effects, we must specify the withinsubject correlation structure, analogous to choosing a covariance structure in mixed estimation [ZLRS98]. For example, conditional independence holds if withinsubject observations are independent, a strong but common assumption. The following estimator assumes conditional independence holds. Let and
. A simple unbiased estimator for the expected covariance across subjects is the sample average of subject estimates:
(8) 
Theorems 1 and 3 in [GBSS05] imply that the summands in each exhibit bias of order and convergence with rate . In terms of Theorem 1, we can combine both HSIC terms into a single estimator by treating them as the direct sum of their respective spaces:
(9)  
(10)  
(11) 
Because each of these terms is nonnegative and independent, the global maximum of is the maximum of each component.
2.3 Maximization of Hsic Estimators
To perform SKDR in the case, Section 5.3.2 of Barshan et al.[BGAJ11] proposes selecting the top generalized eigenvectors of . We denote this approach as supervised kernel PCA, or skPCA. We now proceed with a similar derivation for closedform solutions for maximizing while reducing the dimension of components for use in longitudinal mixed models. Let and be to fixed and subjectspecific maps of with reduced dimension. We seek that maximize while minimizing , .
2.3.1 Fixed Effects
Consider feature map that yields the subjectlevel mean of such that . Let be a operator from to the subspace of spanned by . Then can be rewritten as . Using the Representer Theorem [KW71], let be written as a linear combination . Then , and the fixed component estimator can be written as follows:
(12)  
(13)  
(14)  
(15) 
subject to . Solving for is characterized as the generalized eigenvector problem for matrices (,
). Selecting a subset of these generalized eigenvectors with the largest eigenvalues and computing the kernel composed from training and test features, we may compute the resulting feature subspace that maximizes
for any number of reduced dimensions.2.3.2 Random Effects
Maximizing is performed in a similar manner. Let be similarly defined for all :
(16)  
(17)  
(18)  
(19) 
subject to . Because each term in is positive, it is maximized when its individual summands are maximized. These steps comprise sklPCA, summarized in Algorithm 1.
3 Application
In this section, we demonstrate the utility of sklPCA as a supervised kernel dimension reduction method in a longitudinal setting by applying sklPCA to simulated and empirical datasets, and compare its performance to skPCA.
3.1 Simulation Specification
We construct a simulation that satisfies the following conditions: (1) features and outcomes are measured multiple times for several subjects exhibiting withinsubject clustering, (2) the relationship between features within and between subjects and the outcome is arbitrary, and (3) features are sampled from a highdimensional, lowrank space. The modeling task in this simulation is to reduce the dimensionality of these highdimensional features resulting in betweensubject and withinsubject components appropriate for use in a mixed model. To elucidate circumstances under which sklPCA and/or skPCA perform well in this task, we repeat the simulation study under a variety of simulated settings.
For subjects at measurement time , the outcome of interest depends on a subspace though an arbitrary function in a way that differs between and within subjects. The observed outcomes are assumed to be noisy measurements of , and the observed features are related to linearly through a projection matrix with rank and dimension , where . The distributional form of these variables are specified as follows:
(20)  
(21)  
(22)  
(23) 
3.2 Model Comparison
We vary four binary factors independently: (1)
, or the ratio of betweensubject variance compared to withinsubject variance, (2) two configurations of distributions applied in the above framework, (3) the rank of the predictors
, and (4) the dimension of the predictors . The two distribution configurations are selected to imitate linear and nonlinear relationships, respectively:
Linear Configuration: are uniform, where the support is defined on to preserve mean and variance . , and kernels and are set to the linear kernel .

Radial Configuration: are standard normal, and , , and are set to the Gaussian radial kernel, .
Variants of these distribution configurations are displayed visually in Figure 1; the above configurations correspond to the diagonal panels. Note that specifies a highdimensional linear mixed model, whereas specifies a more complex relationship for which more flexible methods such as SKDR is required. The selection of the remaining parameters are , , and . For each combination, several other parameters are held constant. To enable arbitrarily accurate model fits when the model is appropriately specified, the error between the outcome and latent space is assumed to be very low, . The data are assumed to exhibit moderate size, ; for simplicity, we assume is the same for all subjects.
After simulation and dimension reduction, linear and mixed regression models are fit to the data reduced using skPCA and sklPCA, respectively. The mixed models are fit using a twostep estimation approach similar to the unrestricted estimation of longitudinal models [Ver97]
, which is performed by first reducing and fitting the fixed component, then reducing and fitting the random components to the resulting residuals. The criterion of 5fold crossvalidated correlation between the predictions and outcome is used to compare results across each simulation and method. Each parameter combination is simulated 100 times, and the sample mean and standard deviation of the outofsample prediction correlations is calculated. Simulations were performed on a 15inch 2017 MacBook Pro laptop computer, with a 2.8 GHz Intel Core i7 processor (4 cores) and 16 GB RAM. The results of these analyses are shown in Table
1.The general form of is designed to produce no little or no correlation when if subject identity is not taken into account, because the relationship between subjects is the negative of that within each subject. This results in poor model performance when using skPCA, which treats each observation independently, whereas sklPCA reduces the feature dimension separately based on these separate relationships. When the betweensubject variance is small (), skPCA performs reasonably well, although sklPCA still yields superior accuracy. Due to the increased complexity, the radial configuration is less performant than the linear configuration. Because the dataset size is held constant, increasing the rank of the feature space has the intuitive effect of reducing overall predictive performance. In contrast, the effect of increasing the dimensionality of the feature space differs for the two distribution configurations: model performance is maintained or slightly improves in the linear case, but model performance decreases in the radial case, and the size of the effect is monotone in and . For moderatesized data, determining complex nonlinear relationships is particularly challenging in highrank, highdimensional settings.
Linear Configuration  Radial Configuration  

0.979  (0.003)  0.050  (0.050)  0.601  (0.051)  0.056  (0.049)  skPCA  
0.999  (1e3)  0.971  (0.004)  0.885  (0.011)  0.765  (0.034)  sklPCA  
310  0.980  (0.002)  0.046  (0.037)  0.595  (0.056)  0.044  (0.047)  skPCA  
0.999  (1e3)  0.972  (0.003)  0.785  (0.024)  0.631  (0.047)  sklPCA  
0.861  (0.082)  0.124  (0.047)  0.674  (0.046)  0.245  (0.043)  skPCA  
0.991  (0.005)  0.905  (0.035)  0.813  (0.027)  0.667  (0.030)  sklPCA  
310  0.964  (0.017)  0.135  (0.062)  0.629  (0.033)  0.006  (0.025)  skPCA  
0.998  (1e3)  0.958  (0.005)  0.573  (0.048)  0.021  (0.014)  sklPCA 
To illustrate performance differences between these methods, we performed additional simulations designed for visual clarity (see Figure 2). The relationships between and are constructed to produce a regular lattice: , , and . The simulation parameters for this simulation are set as , , , and . In this figure, perfectly accurate model predictions for all outcomes would yield a diagonal line. For the diagonal panels, the withinsubject and betweensubject correlations are equal; again, by design, performing data reduction and modeling assuming i.i.d data yields little or no model accuracy in expectation. In contrast, the sklPCA and mixed model approach does capture the withinsubject and betweensubject correlations separately, resulting in highly accurate crossvalidated predictions. The offdiagonal panels show cases where the variance within subjects and between subjects differ markedly. In these cases, the larger variance component will account for an overall correlation between and , even for the i.i.d reduction and modeling strategy. However, Figure 2 shows that predictions using this approach reliably fail to model the remaining correlation resulting from the smaller source of variation; the two sources exhibit correlations of opposite magnitude, and the resulting model fits reveal residuals of potential correlation consistently not captured by this model. In these cases, the sklPCA combined with a mixed model approach successfully models both components, yielding comparatively stronger overall correlations.
3.3 Application to the Michigan Intern Health Study
We now turn to an application of sklPCA to an empirical dataset. The Intern Health Study is a multisite prospective cohort study following physicians throughout internship, for whom correlations are sought between their internship tenure and cognitive and/or clinical outcomes such as mood [KFA18], depression [SKK10], and suicidal ideation [GZK15]. Residencies among interns range from internal medicine, general surgery, pediatrics, obstetrics/gynecology, and psychiatry.
This cohort has participated in an assortment of experimental interventions and clinical measurements of wellbeing. Among these, a subset of physicians installed the Mindstrong digital phenotyping app, which unobtrusively collects contentfree feature information such as swiping and typing patterns. A variety of feature extraction techniques are subsequently used to reduce the behavioral gestures to a single, highdimensional set of features per day per subject. This daily measurement is matched with an outcome of interest: the cumulative sum of minutes of sleep obtained within the last 4 days. The measure of accuracy used to measure model performance is outofsample crossvalidated correlation, where
training folds are created by dividing all paired outcome/feature repeated measurements per subject into a partition of timecontiguous parts each, for which all but one part are used as training data, and the remaining part is used for predictive testing.Among the 181 consented physicians, 108 installed the Mindstrong app and uploaded sufficient phone log data from which to extract features and reported minutes spent sleeping per day. We perform a complete cases analysis, in which a subset of the data exhibiting no missing values is selected, comprising 575 features and 56 subjects with 1415 total daily outcome/feature pairs. Following Bair et al. [BHPT06], the number of features is further reduced by selecting features with significant univariate correlations while controlling for a false discovery rate of 0.1.
After performing these preprocessing steps, we compare the two methods previously discussed in this section. Employing skPCA yields a crossvalidated correlation of 0.438 (<1e10). While using sklPCA, a crossvalidated correlation of 0.814 (<1e10) is achieved (see Figure 3). Individual predictions over time for a subset of subjects are shown in Figure 4. Combined, these suggest that performing SKDR while accounting for subject identity in longitudinal datasets can result in substantially improved model performance.
4 Conclusion
We have proposed a supervised kernel dimension reduction method in the longitudinal setting. In Section 2, we derived a novel decomposition for the HSIC, and proposed sklPCA, a method extending skPCA that maximizes two constituent components of HSIC separately. In Section 3, we showed that our method yields comparable or superior accuracy compared to skPCA for a range of simulation settings. We also applied these approaches to the Michigan Intern Health Study, for which sklPCA yielded a crossvalidated correlation approximately twice that obtained using skPCA.
This estimation procedure might be improved in several ways. This work extends the HSIC estimator proposed by Gretton et al., [GBSS05], yet more efficient estimators in both the computational and statistical sense have recently been proposed [PSCV17]. Extending this approach for longitudinal data might be similarly efficient. Alternatively, the estimator for the HSIC used here is based on statistics [Hoe48], which have been extended to noni.i.d. data [Bor96], which might provide another route through which estimators of the HSIC may be derived for clustered data. Our method transforms covariates by selecting generalized eigenvectors, but techniques for recasting variable transformation methods as feature selection methods have been developed [MDF10], which might be adopted here. Finally, new kernels might be developed that account for withinsubject clustering, such as an extension of the Gaussian mixture proposed by Weiss et al. [Wei99]. We leave such investigations as future work.
5 Supplementary Materials
In this supplement, we provide proofs for the theorems stated in the paper.
5.1 Proof of Theorem 1
Theorem 1: Let , , and be random variables, , , and be separable Hilbert spaces equipped with kernels , , and such that , , , and for some explicit feature maps , , and . Let be the direct sum of these random variables, and be the direct sum of these Hilbert spaces and . Then:
(24) 
Proof: The direct sum of two Hilbert spaces is also a Hilbert space, equipped with the inner product
. The kernel for feature map is therefore:
(25)  
(26)  
(27) 
Let . Using the linearity of expectations and the distributive properties for direct sums and tensor products, we obtain:
(28)  
(29)  
(30) 
for crosscovariance linear operators and . The HilbertSchmidt norm of is therefore:
(31)  
(32)  
(33) 
5.2 Proof of Theorem 2
Theorem 2: Let , , and be random variables, and be separable Hilbert spaces equipped with kernels and such that and for some explicit feature maps and . Then:
(34) 
Proof: For ease of notation, define the following expectations:
(35)  
Comments
There are no comments yet.