Subspace Network -- Research Paper, Supplemental, Code.
Over the past decade a wide spectrum of machine learning models have been developed to model the neurodegenerative diseases, associating biomarkers, especially non-intrusive neuroimaging markers, with key clinical scores measuring the cognitive status of patients. Multi-task learning (MTL) has been commonly utilized by these studies to address high dimensionality and small cohort size challenges. However, most existing MTL approaches are based on linear models and suffer from two major limitations: 1) they cannot explicitly consider upper/lower bounds in these clinical scores; 2) they lack the capability to capture complicated non-linear interactions among the variables. In this paper, we propose Subspace Network, an efficient deep modeling approach for non-linear multi-task censored regression. Each layer of the subspace network performs a multi-task censored regression to improve upon the predictions from the last layer via sketching a low-dimensional subspace to perform knowledge transfer among learning tasks. Under mild assumptions, for each layer the parametric subspace can be recovered using only one pass of training data. Empirical results demonstrate that the proposed subspace network quickly picks up the correct parameter subspaces, and outperforms state-of-the-arts in predicting neurodegenerative clinical scores using information in brain imaging.READ FULL TEXT VIEW PDF
Mortality prediction of diverse rare diseases using electronic health re...
In this paper, we empirically evaluate the utility of transfer and multi...
We introduce a novel method that enables parameter-efficient transfer an...
Most data is multi-dimensional. Discovering whether any subset of dimens...
The typical multi-task learning methods for spatio-temporal data predict...
Most brain disorders are very heterogeneous in terms of their underlying...
Multi-task learning (MTL) with neural networks leverages commonalities i...
Subspace Network -- Research Paper, Supplemental, Code.
Recent years have witnessed increasing interests on applying machine learning (ML) techniques to analyze biomedical data. Such data-driven approaches deliver promising performance improvements in many challenging predictive problems. For example, in the field of neurodegenerative diseases such as Alzheimer’s disease and Parkinson’s disease, researchers have exploited ML algorithms to predict the cognitive functionality of the patients from the brain imaging scans, e.g., using the magnetic resonance imaging (MRI) as in (Adeli-Mosabbeb et al., 2015; Zhang et al., 2012; Zhou et al., 2011b). A key finding points out that there are typically various types of prediction targets (e.g., cognitive scores), and they can be jointly learned using multi-task learning (MTL), e.g., (Caruana, 1998; Evgeniou and Pontil, 2004; Zhang et al., 2012), where the predictive information is shared and transferred among related models to reinforce their generalization performance.
Two challenges persist despite the progress of applying MTL in disease modeling problems. First, it is important to notice that clinical targets, different from typical regression targets, are often naturally bounded. For example, in the output of Mini-Mental State Examination (MMSE) test, a key reference for deciding cognitive impairments, ranges from 0 to 30 (a healthy subject): a smaller score indicates a higher level of cognitive dysfunction (please refer to (Tombaugh and McIntyre, 1992)). Other cognitive scores, such as Clinical Dementia Rating Scale (CDR) (Hughes et al., 1982) and Alzheimer’s Disease Assessment Scale-Cog (ADAS- Cog) (Rosen et al., 1984), also have specific upper and lower bounds. Most existing approaches, e.g., (Zhang et al., 2012; Zhou et al., 2011b; Poulin et al., 2011)
, relied on linear regression without considering the range constraint, partially due to the fact that mainstream MTL models for regression, e.g.,(Jalali et al., 2010; Argyriou et al., 2007; Zhang et al., 2012; Zhou et al., 2011a), are developed using the least squares loss and cannot be directly extended to censored regressions. As the second challenge, a majority of MTL research focused on linear models because of computational efficiency and theoretical guarantees. However, linear models cannot capture the complicated non-linear relationship between features and clinical targets. For example, (Association et al., 2013) showed the early onset of Alzheimer’s disease to be related to single-gene mutations on chromosomes 21, 14, and 1, and the effects of such mutations on the cognitive impairment are hardly linear (please refer to (Martins et al., 2005; Sweet et al., 2012)
). Recent advances in multi-task deep neural networks(Seltzer and Droppo, 2013; Zhang et al., 2014; Wu et al., 2015) provide a promising direction, but their model complexity and demands of huge number of training samples prohibit their broader usages in clinical cohort studies.
To address the aforementioned challenges, we propose a novel and efficient deep modeling approach for non-linear multi-task censored regression, called Subspace Network (SN), highlighting the following multi-fold technical innovations:
It efficiently builds up a deep network in a layer-by-layer feedforward fashion, and in each layer considers a censored regression problem. The layer-wise training allows us to grow a deep model efficiently.
It explores a low-rank subspace structure that captures task relatedness for better predictions. A critical difference on subspace decoupling between previous studies such as (Mardani et al., 2015) (Shen et al., 2016) and our method lies on our assumption of a low-rank structure in the parameter space among tasks rather than the original feature space.
Synthetic experiments verify the technical claims of the proposed SN, and it outperforms various state-of-the-arts methods in modeling neurodegenerative diseases on real datasets.
In censored regression, we are given a set of observations of
dimensional feature vectorsand corresponding outcomes , where each outcome , , can be cognitive scores (e.g., MMSE and ADAS-Cog) or other biomarkers of interest such as proteomics111Without loss of generality, in this paper we assume that outcomes are lower censored at 0. By using variants of Tobit models, e.g., as in (Shen et al., 2016)
, the proposed algorithms and analysis can be extended to other censored models with minor changes in the loss function.where is the coefficient for input features, is i.i.d. noise, and ReLU is defined by . We can thus collectively represent the censored regression for multiple tasks by:
where is the coefficient matrix. We consider the regression problem for each outcome as a learning task. One commonly used task relationship assumption is that the transformation matrix belongs to a linear low-rank subspace . The subspace allows us to represent as product of two matrices, , where columns of span the linear subspace , and is the embedding coefficient. We note that the output can be entry-wise decoupled, such that for each component . By assuming Gaussian noise , we derive the following likelihood function:
where is the probabilistic density function of the standardized Gaussian and is the standard Gaussian tail. controls how accurately the low-rank subspace assumption can fit the data. Note that other noise models can be assumed here as well. The likelihood of pair is thus given by:
The likelihood function allows us to estimate subspaceand coefficient from data . To enforce a low-rank subspace, one common approach is to impose a trace norm on , where trace norm of a matrix is defined by and is the
th singular value of. Since , e.g., see (Srebro et al., 2005; Mardani et al., 2015), the objective function of multi-task censored regression problem is given by:
We propose to solve the objective in (2) via the block coordinate descent approach which is reduced to iteratively updating the following two subproblems:
Define the instantaneous cost of the -th datum:
and the online optimization form of (2) can be recast as an empirical cost minimization given below:
According to the analysis in Section 2.2, one pass of the training data can warrant the subspace learning problem. We outline the solver for each subproblem as follows:
Problem (P:V) sketches parameters in the current space. We solve (P:V) using gradient descent. The parameter sketching couples all the subspace dimensions in (not decoupled as in (Shen et al., 2016)), and thus we need to solve this collectively. The update of () can be obtained by solving the online problem given below:
can be computed by the following gradient update: where the gradient is given by:
) using stochastic gradient descent (SGD). We note that the problem is decoupled for different subspace dimensions(i.e., rows of ). With careful parallel design, this procedure can be done very efficiently. Given a training data point , the problem related to the -th subspace basis is:
We can revise subspace by the following gradient update: where the gradient is given by:
where . We summarize the procedure in Algorithm 1 and show in Section 2.2 that under mild assumptions this procedure will be able to capture the underlying subspace structure in the parameter space with just one pass of the data.
We establish both asymptotic and non-asymptotic convergence properties for Algorithm 1. The proof scheme is inspired by a series of previous works: (Mairal et al., 2010; Kasiviswanathan et al., 2012; Shalev-Shwartz et al., 2012; Mardani et al., 2013, 2015; Shen et al., 2016). We briefly present the proof sketch, and more proof details can be found in Appendix. At each iteration , we sample , and let denote the intermediate and , to be differentiated from which are the -th columns of . For the proof feasibility, we assume that are sampled i.i.d., and the subspace sequence lies in a compact set.
Asymptotic Case: To estimate , the Stochastic Gradient Descent (SGD) iterations can be seen as minimizing the approximate cost , where is a tight quadratic surrogate for based on the second-order Taylor approximation around . Furthermore, can be shown to be smooth, by bounding its first-order and second-order gradients w.r.t. each (similar to Appendix 1 of (Shen et al., 2016)).
Following (Mairal et al., 2010; Mardani et al., 2015), it can then be established that, as , the subspace sequence asymptotically converges to a stationary-point of the batch estimator, under a few mild conditions. We can sequentially show: 1) asymptotically converges to , according to the quasi-martingale property in the almost sure sense, owing to the tightness of ; 2) the first point implies convergence of the associated gradient sequence, due to the regularity of ; 3) is bi-convex for block variables and .
Non-Asymptotic Case: When is finite, (Mardani et al., 2013) asserts that the distance between successive subspace estimates will vanish as fast as : , for some constant that is independent of and . Following (Shen et al., 2016) to leverage the unsupervised formulation of regret analysis as in (Kasiviswanathan et al., 2012; Shalev-Shwartz et al., 2012), we can similarly obtain a tight regret bound that will again vanish if .
The single layer model in (1) has limited capability to capture the highly nonlinear regression relationships, as the parameters are linearly linked to the subspace except for a ReLU operation. However, the single-layer procedure in Algorithm 1 has provided a building block, based on which we can develop an efficient algorithm to train a deep subspace network (SN) in a greedy fashion. We thus propose a network expansion procedure to overcome such limitation.
After we obtain the parameter subspace and sketch for the single-layer case (1), we project the data points by . A straightforward idea of the expansion is to use as the new samples to train another layer. Let denote the network structure we obtained before the -th expansion starts, , the expansion can recursively stack more ReLU layers:
However, we observe that simply stacking layers by repeating (3) many times can cause substantial information loss and degrade the generalization performance, especially since our training is layer-by-layer without “looking back” (i.e., top-down joint tuning). Inspired by deep residual networks (He et al., 2016) that exploit “skip connections” to pass lower-level data and features to higher levels, we concatenate the original samples with the newly transformed, censored outputs after each time of expansion, i.e., reformulating (similar manners could be found in (Zhou and Feng, 2017)). The new formulation after the expansion is given below:
), SN gradually refines the parameter subspaces by multiple stacked nonlinear projections. It is expected to achieve superior performance due to the higher learning capacity, and the proposed SN can also be viewed as a gradient boosting method. Meanwhile, the layer-wise low-rank subspace structural prior would further improve generalization compared to naive multi-layer networks.
The subspace network code and scripts for generating the results in this section are available at https://github.com/illidanlab/subspace-net.
Subspace recovery in a single layer model. We first evaluate the subspace recovered by the proposed Algorithm 1 using synthetic data. We generated , and , all as i.i.d. random Gaussian matrices. The target matrix was then synthesized using (1). We set , , , and random noise as .
Figure (a)a shows the plot of subspace difference between the ground-truth and the learned subspace throughout the iterations, i.e., w.r.t. . This result verifies that Algorithm 1 is able to correctly find and smoothly converge to the underlying low-rank subspace of the synthetic data. The objective values throughout the online training process of Algorithm 1 are plotted in Figure (b)b. We further show the plot of iteration-wise subspace differences, defined as , in Figure (c)c, which complies with the
result in our non-asymptotic analysis. Moreover, the distribution of correlation between recovered weights and true weights for all tasks is given in Figure9, with most predicted weights having correlations with ground truth of above 0.9.
|Metric||Subspace Difference||Maximum Mutual Coherence||Mean Mutual Coherence|
Subspace recovery in a multi-layer subspace network.
We re-generated synthetic data by repeatedly applying (1) for three times, each time following the same setting as the single-layer model. A three-layer SN was then learned using Algorithm 3. As one simple baseline, a multi-layer perceptron
multi-layer perceptron(MLP) is trained, whose three hidden layers have the same dimensions as the three ReLU layers of the SN. Inspired by (Xue et al., 2013; Sainath et al., 2013; Wang et al., 2015), we then applied low-rank matrix factorization to each layer of MLP, with the same desired rank , creating the factorized MLP (f-MLP) baseline that has the identical architecture (including both ReLU hidden layers and linear bottleneck layers) to SN. We further re-trained the f-MLP on the same data from end to end, leading to another baseline, named retrained factorized MLP (rf-MLP).
Table 1 evaluates the subspace recovery fidelity in three layers, using three different metrics: (1) the maximum mutual coherence of all column pairs from two matrices, defined in (Candes and Romberg, 2007) as a classical measurement on how correlated the two matrices’ column subspaces are; (2) the mean mutual coherence of all column pairs from two matrices; (3) the subspace difference defined the same as in the single-layer case222The higher in terms of the two mutual coherence-based metrics, the better subspace recovery is achieved.That is different from the subspace difference case where the smaller the better,
. Note that the two mutual coherence-based metrics are immune to linear transformations of subspace coordinates, to which the-based subspace difference might become fragile. SN achieves clear overall advantages under all three measurements, over f-MLP and rf-MLP. More notably, while the performance margin of SN in subspace difference seems to be small, the much sharper margins, in two (more robust) mutual coherence-based measurements, suggest that the recovered subspaces by SN are significantly better aligned with the groundtruth.
|Percent||Single Task (Shallow)||Multi Task (Shallow)|
|Uncensored (LS + )||Censored (LS + )||Nonlinear Censored (Tobit)||Uncensored (Multi Trace)||Censored (Multi Trace)|
|40%||0.1412 (0.0007)||0.1127 (0.0010)||0.0428 (0.0003)||0.1333 (0.0009)||0.1053 (0.0027)|
|50%||0.1384 (0.0005)||0.1102 (0.0010)||0.0408 (0.0004)||0.1323 (0.0010)||0.1054 (0.0042)|
|60%||0.1365 (0.0005)||0.1088 (0.0009)||0.0395 (0.0003)||0.1325 (0.0012)||0.1031 (0.0046)|
|70%||0.1349 (0.0005)||0.1078 (0.0010)||0.0388 (0.0004)||0.1315 (0.0013)||0.1024 (0.0042)|
|80%||0.1343 (0.0011)||0.1070 (0.0012)||0.0383 (0.0006)||0.1308 (0.0008)||0.1040 (0.0011)|
|Percent||Deep Neural Network||Subspace Net (SN)|
|DNN i (naive)||DNN ii (censored)||DNN iii (censored + low-rank)||Layer 1||Layer 3|
|40%||0.0623 (0.0041)||0.0489 (0.0035)||0.0431 (0.0041)||0.0390 (0.0004)||0.0369 (0.0002)|
|50%||0.0593 (0.0048)||0.0462 (0.0042)||0.0400 (0.0039)||0.0389 (0.0007)||0.0366 (0.0003)|
|60%||0.0587 (0.0053)||0.0455 (0.0054)||0.0395 (0.0050)||0.0388 (0.0006)||0.0364 (0.0003)|
|70%||0.0590 (0.0071)||0.0447 (0.0043)||0.0386 (0.0058)||0.0388 (0.0006)||0.0363 (0.0003)|
|80%||0.0555 (0.0057)||0.0431 (0.0053)||0.0380 (0.0057)||0.0390 (0.0008)||0.0364 (0.0005)|
|Perc.||Layer 1||Layer 2||Layer 3||Layer 10||Layer 20|
|40%||0.0390 (0.0004)||0.0381 (0.0005)||0.0369 (0.0002)||0.0368 (0.0002)||0.0368 (0.0002)|
|50%||0.0389 (0.0007)||0.0379 (0.0005)||0.0366 (0.0003)||0.0366 (0.0003)||0.0365 (0.0003)|
|60%||0.0388 (0.0006)||0.0378 (0.0004)||0.0364 (0.0003)||0.0364 (0.0003)||0.0363 (0.0003)|
|70%||0.0388 (0.0006)||0.0378 (0.0005)||0.0363 (0.0003)||0.0363 (0.0003)||0.0362 (0.0003)|
|80%||0.0390 (0.0008)||0.0378 (0.0006)||0.0364 (0.0005)||0.0363 (0.0005)||0.0363 (0.0005)|
Benefits of Going Deep. We re-generate synthetic data again in the same way as the first single-layer experiment; yet differently, we now aim to show that a deep SN will boost performance over single-layer subspace recovery, even the data generation does not follow a known multi-layer model. We compare SN (both 1-layer and 3-layer) with two carefully chosen sets of state-of-art approaches: (1) single and multi-task “shallow” models; (2) deep models. For the first set, the least squares (LS) is treated as a naive baseline, while ridge (LS + ) and lasso (LS + ) regressions are considered for shrinkage or variables selection purpose; Censor regression, also known as the Tobit model, is a non-linear method to predict bounded targets , e.g., (Berberidis et al., 2016). Multi-task models with regularizations on trace norm (Multi Trace) and norm (Multi ) have been demonstrated to be successful on simultaneous structured/sparse learning, e.g., (Yang et al., 2010; Zhang et al., 2013).333Least squares, ridge, lasso, and censor regression are implemented by Matlab optimization toolbox. MTLs are implemented through MALSAR (Zhou et al., 2011a) with parameters carefully tuned. We also verify the benefits of accounting for boundedness of targets (Uncensored vs. Censored) in both single-task and multi-task settings, with best performance reported for each scenario (LS + for single-task and Multi Trace for multi-task). For the set of deep model baselines, we construct three DNNs for fair comparison: i) A 3-layer fully connected DNN with the same architecture as SN, with a plain MSE loss; ii) A 3-layer fully connected DNN as i) with ReLU added for output layer before feeding into the MSE loss, which naturally implements non-negativity censored training and evaluation; iii) A factorized and re-trained DNN from ii), following the same procedure of rf-MLP in the multi-layer synthetic experiment. Apparently, ii) and iii) are constructed to verify if DNN also benefits from the censored target and the low-rank assumption, respectively.
We performed 10-fold random-sampling validation on the same dataset, i.e., randomly splitting into training and validation data 10 times. For each split, we fitted model on training data and evaluated the performance on validation data. Average normalized mean square error (ANMSE) across all tasks was obtained as the overall performance for each split. For methods without hyper parameters (least square and censor regression), an average of ANMSE for 10 splits was regarded as the final performance; for methods with tunable parameters, e.g., in lasso, we performed a grid search on values and chose the optimal ANMSE result. We considered different splitting sizes with training samples containing [40%, 50%, 60%, 70%, 80%] of all the samples.
further compares the performance of all approaches. Standard deviation of 10 trials is given in parenthesis (same for all following tables). We can observe that: (1) allcensored models significantly outperform their uncensored counterparts, verifying the necessity of adding censoring targets for regression. Therefore, we will use censored baselines hereinafter, unless otherwise specified; (2) the more structured MTL models tend to outperform single task models by capturing task relatedness. That is also evidenced by the performance margin of DNN iii over DNN i; (3) the nonlinear models are undoubtedly more favorable: we even see the single-task Tobit model to outperform MTL models; (4) As a nonlinear, censored MTL model, SN combines the best of them all, accounting for its superior performance over all competitors. In particular, even a 1-layer SN already produces comparable performance to the 3-layer DNN iii (which also a nonlinear, censored MTL model trained with back-propagation, with three times the parameter amount of SN), thanks to SN’s theoretically solid online algorithm in sketching subspaces.
Furthermore, increasing the number of layers in SN from 2 to 20 demonstrated that SN can also benefit from growing depth without an end to end scheme. As Table 3 reveals, SN steadily improves with more layers, until reaching a plateau at layers (as the underlying data distribution is relatively simple here). The observation is consistent among all splits.
Computation speed. All experiments run on the same machine (1 x Six-core Intel Xeon E5-1650 v3 [3.50GHz], 12 logic cores, 128 GB RAM). GPU accelerations are enabled for DNN baselines, while SN has not exploited the same accelerations yet. The running time for a single round training on synthetic data (N=5000, D=200, T=100) is given in Table 4. Training each layer of SN will cost 109 seconds on average. As we can see, SN improves generalization performance without a significant computation time burden. Furthermore, we can accelerate SN further, by reading data in batch mode and performing parallel updates.
|SN (per layer)||109||Python|
|Percent||Layer 1||Layer 2||Layer 3||Layer 5||Layer 10|
|40%||0.2016 (0.0057)||0.2000 (0.0039)||0.1981 (0.0031)||0.1977 (0.0031)||0.1977 (0.0031)|
|50%||0.1992 (0.0040)||0.1992 (0.0053)||0.1971 (0.0038)||0.1968 (0.0036)||0.1967 (0.0035)|
|60%||0.1990 (0.0061)||0.1990 (0.0047)||0.1967 (0.0038)||0.1964 (0.0039)||0.1964 (0.0038)|
|70%||0.1981 (0.0046)||0.1966 (0.0052)||0.1953 (0.0039)||0.1952 (0.0039)||0.1951 (0.0038)|
|80%||0.1970 (0.0034)||0.1967 (0.0044)||0.1956 (0.0040)||0.1955 (0.0039)||0.1953 (0.0039)|
|Percent||Single Task (Censored)||Multi Task (Censored)|
|Least Square||LS +||Tobit (Nonlinear)||Multi Trace||Multi|
|40%||0.3874 (0.0203)||0.2393 (0.0056)||0.3870 (0.0306)||0.2572 (0.0156)||0.2006 (0.0099)|
|50%||0.3119 (0.0124)||0.2202 (0.0049)||0.3072 (0.0144)||0.2406 (0.0175)||0.2002 (0.0132)|
|60%||0.2779 (0.0123)||0.2112 (0.0055)||0.2719 (0.0114)||0.2596 (0.0233)||0.2072 (0.0204)|
|70%||0.2563 (0.0108)||0.2037 (0.0042)||0.2516 (0.0108)||0.2368 (0.0362)||0.2017 (0.0116)|
|80%||0.2422 (0.0112)||0.2005 (0.0054)||0.2384 (0.0099)||0.2176 (0.0171)||0.2009 (0.0050)|
|Percent||Deep Neural Network||Subspace Net (SN)|
|DNN i (naive)||DNN ii (censored)||DNN iii (censored + low-rank)||Layer 1||Layer 3|
|40%||0.2549 (0.0442)||0.2388 (0.0121)||0.2113 (0.0063)||0.2016 (0.0057)||0.1981 (0.0031)|
|50%||0.2236 (0.0066)||0.2208 (0.0062)||0.2127 (0.0118)||0.1992 (0.0040)||0.1971 (0.0038)|
|60%||0.2215 (0.0076)||0.2200 (0.0076)||0.2087 (0.0102)||0.1990 (0.0061)||0.1967 (0.0038)|
|70%||0.2149 (0.0077)||0.2141 (0.0079)||0.2093 (0.0137)||0.1981 (0.0046)||0.1953 (0.0039)|
|80%||0.2132 (0.0138)||0.2090 (0.0079)||0.2069 (0.0135)||0.1970 (0.0034)||0.1956 (0.0040)|
|Method||Percent - Rank|
|SN||40%||0.2052 (0.0030)||0.1993 (0.0036)||0.1981 (0.0031)||0.2010 (0.0044)|
|50%||0.2047 (0.0029)||0.1983 (0.0034)||0.1971 (0.0038)||0.2001 (0.0046)|
|60%||0.2052 (0.0033)||0.1988 (0.0047)||0.1967 (0.0038)||0.1996 (0.0052)|
|70%||0.2043 (0.0044)||0.1975 (0.0042)||0.1953 (0.0039)||0.1990 (0.0057)|
|80%||0.2058 (0.0051)||0.1977 (0.0042)||0.1956 (0.0040)||0.1990 (0.0058)|
|DNN iii (censored + low-rank)||40%||0.2322 (0.0146)||0.2360 (0.0060)||0.2113 (0.0063)||0.2196 (0.0124)|
|50%||0.2298 (0.0093)||0.2256 (0.0127)||0.2127 (0.0118)||0.2235 (0.0142)|
|60%||0.2244 (0.0132)||0.2277 (0.0099)||0.2087 (0.0102)||0.2145 (0.0208)|
|70%||0.2178 (0.0129)||0.2177 (0.0115)||0.2093 (0.0137)||0.2083 (0.0127)|
|80%||0.2256 (0.0117)||0.2250 (0.0079)||0.2069 (0.0135)||0.2158 (0.0183)|
|40%||0.1993 (0.0034)||0.1977 (0.0031)|
|50%||0.1987 (0.0043)||0.1967 (0.0036)|
|60%||0.1991 (0.0044)||0.1964 (0.0039)|
|70%||0.1982 (0.0042)||0.1951 (0.0038)|
|80%||0.1984 (0.0041)||0.1954 (0.0039)|
We evaluated SN in a real clinical setting to build models for the prediction of important clinical scores representing a subject’s cognitive status and signaling the progression of Alzheimer’s disease (AD), from structural Magnetic Resonance Imaging (sMRI) data. AD is one major neurodegenerative disease that accounts for 60 to 80 percent of dementia. The National Institutes of Health has thus focused on studies investigating brain and fluid biomarkes of the disease, and supported the long running project Alzheimer’s Disease Neuroimaging Initiative (ADNI) from 2003. We used the ADNI-1 cohort (http://adni.loni.usc.edu/). In the experiments, we used the 1.5 Tesla structural MRI collected at the baseline, and performed cortical reconstruction and volumetric segmentations with the FreeSurfer following the procotol in (Jack et al., 2008). For each MRI image, we extracted 138 features representing the cortical thickness and surface areas of region-of-interests (ROIs) using the Desikan-Killiany cortical atlas (Desikan et al., 2006). After preprocessing, we obtained a dataset containing 670 samples and 138 features. These imaging features were used to predict a set of 30 clinical scores including ADAS scores (Rosen et al., 1984) at baseline and future (6 months from baseline), baseline Logical Memory from Wechsler Memory Scale IV (Scale—Fourth, 2009), Neurobattery scores (i.e. immediate recall total score and Rey Auditory Verbal Learning Test scores), and the Neuropsychiatric Inventory (Cummings, 1997) at baseline and future.
In MTL formulations we typically assume that noise varianceis the same across all tasks, which may not be true in many cases. To deal with heterogeneous among tasks, we design a calibration step in our optimization process, where we estimate task-specific using before ReLU, as the input for next layer and repeat on layer-wise. We compare performance of both non-calibrated and calibrated methods.
Performance. We adopted the two sets of baselines used in the last synthetic experiment for the real world data. Different from synthetic data where the low-rank structure was predefined, for real data, there is no groundtruth rank available and we have to try different rank assumptions. Table 8 compares the performances between non-calibrated versus calibrated models. We observe a clear improvement by assuming different across tasks. Table 6 shows the results for all comparison methods, with SN outperforming all else. Table 5 shows the SN performance growth with increasing the number of layers. Table 7 further reveals the performance of DNNs and SN using varying rank estimations in real data. As expected, the U-shape curve suggests that an overly low rank may not be informative enough to recover the original weight space, while a high-rank structure cannot enforce as strong a structural prior. However, the overall robustness of SN to rank assumptions is fairly remarkable: its performance under all ranks is competitive, consistently outperforming DNNs under the same rank assumptions and other baselines.
Qualitative Assessment. From the multi-task learning perspective, the subspaces serve as the shared component for transferring predictive knowledge among the censored learning tasks. The subspaces thus capture important predictive information in predicting cognitive changes. We normalized the magnitude of the subspace into the range of and visualized the subspace in brain mappings. The the 5 lowest level subspaces in are the most important five subspaces, and is illustrated in Figure 10.
We find that each subspace captures very different information. In the first subspace, the volumes of right banks of the superior temporal sulcus, which is found to involve in prodromal AD (Killiany et al., 2000), rostral middle frontal gyrus, with highest A loads in AD pathology (Nicoll et al., 2003), and the volume of inferior parietal lobule, which was found to have an increased S-glutathionylated proteins in a proteomics study (Newman et al., 2007), have significant magnitude. We also find evidence of strong association between AD pathology and brain regions of large magnitude in other subspaces. The subspaces in remaining levels and detailed clinical analysis will be available in a journal extension of this paper.
In this paper, we proposed a Subspace Network (SN), an efficient deep modeling approach for non-linear multi-task censored regression, where each layer of the subspace network performs a multi-task censored regression to improve upon the predictions from the last layer via sketching a low-dimensional subspace to perform knowledge transfer among learning tasks. We show that under mild assumptions, for each layer we can recover the parametric subspace using only one pass of training data. We demonstrate empirically that the subspace network can quickly capture correct parameter subspaces, and outperforms state-of-the-arts in predicting neurodegenerative clinical scores from brain imaging. Based on similar formulations, the proposed method can be easily extended to cases where the targets have nonzero bounds, or both lower and upper bounds.
We hereby give more details for the proofs of both asymptotic and non-asymptotic convergence properties for Algorithm 1 to recover the latent subspace . The proofs heavily rely on a series of previous results in (Mairal et al., 2010; Kasiviswanathan et al., 2012; Shalev-Shwartz et al., 2012; Mardani et al., 2013, 2015; Shen et al., 2016), and many key results are directly referred to hereinafter for conciseness. We include the proofs for the manuscript to be self-contained.
At iteration , we sample , and let denote the intermediate and , to be differentiated from which are the -th columns of . For the proof feasibility, we assume that are sampled i.i.d., and the subspace sequence lies in a compact set.
For infinite data streams with , we recall the instantaneous cost of the -th datum:
and the online optimization form recasted as an empirical cost minimization:
The Stochastic Gradient Descent (SGD) iterations can be seen as minimizing the approximate cost:
where is a tight quadratic surrogate for based on the second-order Taylor approximation around :
with . is further recognized as a locally tight upper-bound surrogate for , with locally tight gradients. Following the Appendix 1 of (Shen et al., 2016), we can show that is smooth, with its first-order and second-order gradients bounded w.r.t. each .
With the above results, the convergence of subspace iterates can be proven in the same regime developed in (Mardani et al., 2015), whose main inspirations came from (Mairal et al., 2010) that established convergence of an online dictionary learning algorithm using the martingale sequence theory. In a nutshell, the proof procedure proceeds by first showing that converges to asymptotically, according to the quasi-martingale property in the almost sure sense, owing to the tightness of . It then implies convergence of the associated gradient sequence, due to the regularity of .
Meanwhile, we notice that is bi-convex for the block variables and (see Lemma 2 of (Shen et al., 2016)). Therefore due to the convexity of w.r.t. when is fixed, the parameter sketches can also be updated exactly per iteration.
All above combined, we can claim the asymptotic convergence for the iterations of Algorithm 1: as , the subspace sequence asymptotically converges to a stationary-point of the batch estimator, under a few mild conditions.
For finite data streams, we rely on the unsupervised formulation of regret analysis (Kasiviswanathan et al., 2012; Shalev-Shwartz et al., 2012) to assess the performance of online iterates. Specifically, at iteration (), we use the previous to span the partial data at . Prompted by the alternating nature of iterations, we adopt a variant of the unsupervised regret to assess the goodness of online subspace estimates in representing the partially available data. With being the loss incurred by the estimate for predicting the -th datum, the cumulative online loss for a stream of size is given by:
Further, we will assess the cost of the last estimate using:
We define as the batch estimator cost. For the sequence , we define the online regret:
We investigate the convergence rate of the sequence to zero as grows. Due to the nonconvexity of the online subspace iterates, it is challenging to directly analyze how fast the online cumulative loss approaches the optimal batch cost . As (Shen et al., 2016) advocates, we instead investigate whether converges to . That is established by first referring to the Lemma 2 of (Mardani et al., 2013): the distance between successive subspace estimates will vanish as fast as : , for some constant that is independent of and . Following the proof of Proposition 2 in (Shen et al., 2016), we can similarly show that: if and are uniformly bounded, i.e., , and , , then with constants and by choosing a constant step size , we have a bounded regret as:
This thus concluded the proof.
Restructuring of deep neural network acoustic models with singular value decomposition.. InInterspeech. 2365–2369.
Online learning for multi-task feature selection. InCIKM. ACM, 1693–1696.