1 Introduction
Analyzing and interpreting neural dynamics involves finding patterns in large datasets, both in terms of the dimensionality of the measurements as well as the number of observations [44]. Neuroscientists often resort to dimensionality reduction to make sense of their data. This is commonly done through factor models [46, 16, 47], in which a small number of latent variables explain the variance of the high dimensional data. Modeling the data this way has several benefits. First, lower dimensional representations can increase statistical power by ameliorating issues stemming from multiple hypothesis testing [42]. Second, carefully designed factors in neuroscience can have intuitive biological meaning as “networks” [33]. Third, the loadings from some factor models can be interpreted to generate new biological hypotheses for further experimentation [13]
. Commonly used factor analysistype approaches include principal component analysis (PCA)
[49], nonnegative matrix factorization (NMF) [28], latent Dirichlet allocation [6], independent component analysis (ICA)
[19], and many other techniques.In our motivating applications, the data quantify electrophysiological signals, such as electroencepholograms (EEG) [48], local field potentials (LFPs) [43]
, or neuron firing (spiking)
[12]. While finding networks that explain brain activity is important, it is not sufficient. To answer relevant scientific questions, the networks must both explain observed patterns in the activity data and relate to a specific ‘outcome’, such as behavior [5], genotype [14], or experimental condition [45]. It is well known that the factors explaining most of the variability in a set of predictors are commonly not the most predictive of an outcome [20]. This is certainly true in neuroscience; for example, the leading factors underlying activity may relate to motion but not to a specific neuropsychiatric disorder of interest.In order to target learning of factors more to prediction, joint models have been widely used. In a joint factor model, unobserved latent variables give rise to both the outcome and predictors, usually with the assumption of conditional independence given the factors. One example is supervised probabilistic PCA [56], which corresponds to probabilistic PCA [49] with distinct variances for the predictors and outcome. Such models are appealing in neuroscience by providing an explicit relationship between electophysiological data and outcomes. In addition, prediction is straightforward; for test individuals, we simply learn the posterior distribution of the latent factors given only the predictors, and then use the predictive distribution of the outcomes given the inferred latent factors. When the model is well specified, such an approach leads to Bayes optimal predictions.
However, misspecifying a joint model can severely impact the model’s utility for prediction. In particular, it has been documented that underestimating the number of latent factors can destroy predictive ability [15]. The intuitive cause is that when fitting the model jointly, the variance of the predictors often vastly outweighs the variance of the outcome. Hence, if factors are a scarce resource, they will be used to improve characterization of the predictor distribution at the expense of predictive accuracy.
In practice, all models of neural dynamics are merely approximations to the true brain mechanics. Neural dynamics are notoriously complex, which is not surprising given the large amount of work needed for the body to simultaneously maintain homeostasis, move, analyze sensory inputs, make decisions, and so on. It is reasonable to assume that any factor model substantially underestimates the number of true factors underlying neural activity. Thus, any model seeking to predict behavior using neural dynamics must be robust to this type of misspecification.
We provide a new inference method for factor models based on supervised autoencoders [41]
, a technique used with great success in image recognition, especially in the context of semisupervised learning
[39, 37, 27]. An autoencoder is a method for dimensionality reduction that maps inputs to a low dimensional space using a neural network known as an “encoder.” These latent spaces are then used to reconstruct the original observations by a neural network called a “decoder” [Goodfellow2017]. Both the encoder and the decoder are learned simultaneously to minimize the reconstructive loss. Supervised autoencoders add another transformation from the latent factors to predict the outcome of interest, thus encouraging the latent factors to represent the outcome well [57, 50, 55]. While autoencoders are deterministic, variational autoencoders [26] use a stochastic mapping to approximate the conditional distribution of the scores given the predictors, and can approximate posterior distributions and yield statistical interpretations.Our modified inference method, which we call encoded supervision, yields substantially better predictions of the outcome under model misspecification. This robustness to misspecification is demonstrated on synthetic data. We also show substantial gains in predictive performance of behavioral tasks from electrophysiology in applied neuroscience datasets. We provide some technical support for encoded supervision, showing the estimator is distinct from previous estimators, and giving analytic solutions when both the predictors and outcome are normally distributed in a linear factor model.
In Section 2 we introduce the data motivating this study, as well as highlighting the deficiencies of existing methods in analyzing these data. In Section 3 we introduce the proposed encoded supervision approach. Section 4 provides simulation studies assessing the performance under misspecification. In Section 5 we demonstrate improvements for our motivating data. Finally, in Section 6 we provide a discussion of our method and further directions.
The codes to implement the methods we describe below are publicly available at https://github.com/carlsonlab/encodedSupervision.
2 Data and Scientific Motivation
2.1 Electrophysiology: Tail Suspension Test
Our motivation comes from recent applications in neuropsychiatric research with the explicit goal of characterizing the brain dynamics responsible for mental illnesses at a network level [9, 13, 18]. In these examples, the data come from mouse models of neuropsychiatric disease performing behavioral tasks. During these tasks the electrical activity is measured in multiple brain regions through implanted electrodes [11]. The measured signals are known as Local Field Potentials (LFPs) and are believed to reflect the aggregate behavior of hundreds of thousands of neurons [8]. Compared to individual neurons, LFP signals have been found to be more consistent between individuals, motivating their use to find common patterns of neuropsychiatric disorders in a population. Our goal is to model this electrical activity (the predictors) with factor models and relate the factors to behavior (the outcome), which could potentially be used to develop new treatments [9].
We focus on using our methods to identity a whole brain network associated with stress [9]. In this experiment, 26 mice were recorded from two genetic backgrounds (14 wild type and 12 genetically modified CLOCK19). The CLOCK19 genetic modification is a mouse model of bipolar disorder [53]. Each mouse was recorded for 20 minutes across 3 behavioral contexts: 5 minutes in its home cage (nonstressful), 5 minutes in an open field (arousing, mildly stressful), and 10 minutes suspended by its tail (highly stressful). For our purposes, we want to determine a network that predicts stressful activity, so we consider all data from the home cage as the negative label (stressfree condition) and all data from the other two conditions as the positive label (stressed condition). A second prediction task is to determine brain differences between the genetic conditions (i.e., what underlying differences are there between the wild type and bipolar mouse model?). Data were recorded from 11 biologically relevant brain regions with 32 electrodes (multiple electrodes per region) at 1000 Hz. Data from faulty electrodes were removed, and the time series were averaged per region to yield an 11dimensional time series per mouse. A visualization of these data can be seen in Figure 1.
There is strong evidence to support the belief that the behaviorally relevant aspects of electrophysiology are frequencybased power within brain regions, as well as frequencybased coherence between brain regions [51, 17]. To characterize the dynamics in the LFPs we discretized the time series into 1second windows. Windows containing saturated signals were removed (typically due to motion artifacts). This preprocessing procedure yielded an data matrix for each window with an associated binary outcome.
For our purposes any model must: (1) characterize the powers and coherences in terms of latent factors, (2) relate these factors to the observed behavior, and (3) yield interpretable loadings that can be used to construct testable hypotheses to establish causality in followup experiments. Following our recent work [13, 18], we will use a factor modeling framework and will explore how our proposed model fitting helps satisfies these conditions.
First, we previously proposed a latent function factor model called CrossSpectral Factor Analysis (CSFA) [13]. This approach modeled the observed time series as a weighted superposition of Gaussian processes with a CrossSpectral Mixture (CSM) kernel [52]. This kernel is constructed to capture power and coherence properties of the data, yielding interpretable networks that were previously called electrical connectome (electome) networks [18]. We denote this approach as CSFACSM in the rest of the manuscript. The mathematical framework of this model has been previously published and we give critical mathematical details in Appendix D.
Instead of directly modeling the time series, a second approach is to first extract the relevant quantities from the recorded data prior to modeling. Specifically, we first calculated the spectral power from 1 Hz to 56 Hz in 1 Hz bands using Welch’s method [54], which is recommended in the neuroscience literature [22]. Prior literature has demonstrated that most of the important information is in lower frequencies, so we only used information up to 56Hz. Mean squared coherence was calculated in the same frequency bands for every pair of brain regions [40]. After these preprocessing steps, each window of data was converted to 3696 nonnegative power and coherence features. A visualization of what these features look like can be found in Figure 2.
Given these spectral estimates, we then can use a nonnegative matrix factorization approach to extract the electome networks and feature scores, which we denote as CSFANMF. The nonnegative formulation provides a better biological interpretation: individual features are only positive (e.g. power and coherence) and the networks are either active or inactive in the measured data. We provide additional details on the model and inference in Appendix C.
While both CSFACSM and CSFANMF are useful in these neuroscience applications in their own right, we want to encourage the learned representations to be predictive of our outcome of interest. This motivates our interest in supervised factor modeling (or supervised dimensionality reduction) approaches to improve the scientific utility.
2.2 Difficulties with Existing Methods
There is a substantial literature on estimating latent spaces predictive of an outcome. Joint modeling and supervised autoencoders are two of the most prominent methods for this objective. Joint modeling methods have had successes [56], but are unfortunately not robust. In particular, since the variance of the outcome is generally small relative to the predictors, it is often necessary to increase the importance of the outcome in order to make it influence the estimated model. As the relative importance of the outcome increases, however, it makes the posterior estimation of the factors highly dependent on the knowledge of the outcome. At test time, when the outcome is unknown, the inferred factors have greatly changed estimates, an effect we refer to as factor dragging. We detail this effect in our simulations in Section 4. This creates a challenging situation, where we must choose between making the outcome irrelevant for estimating the model or making the estimation of the factors rely on knowledge of the outcome. Several techniques have been explored to address these issues.
One widely studied approach is sufficient dimensionality reduction (SDR), which exploits the assumption that the response is conditionally independent of the predictors given a projection of the predictors to a lower dimensional subspace. Sliced inverse regression provides one popular example of an SDR approach [29]. A related statistical technique is to first use some form of regularized regression to find a subset of the predictors related to the outcome, and then learn a factor model on the reduced set [3]. A major issue with SDRtype methods in our motivating neuroscience applications is that the focus is entirely on prediction and not on simultaneously providing an accurate lowerdimensional characterization of the predictors (in our case, brain activity dynamics, or the electome networks).
In an attempt to improve upon robustness to misspecification in joint modeling, “cutting the feedback” approaches have been proposed [30, 38, 32]. When used in an MCMC method, the samples on the latent variables are estimated using only the predictors (hence the influence from the outcome is cut). A variety of justifications have been given including when the model is flawed [30], the outcome is unreliable [38], or for handling missing data in metaanalysis [34]. While these approaches have been successful for some applications in joint modeling, they yield subpar results as we later show in Section 5. The primary issue is that the methods require that the relevant information for the classification is captured in the highest variance components. Unfortunately, in our application the factors that explain the largest amount of variance are relatively unpredictive (a common issue in statistics [20]), as networks related to motion [24] or even blinking [21] can be particularly dominant.
3 Encoded Supervision
3.1 Joint Modeling
Before introducing our model fitting framework, we first introduce a standard joint modeling approach. We denote as the measured predictors and as the corresponding outcome. The latent factors are denoted . A generic objective function to fit a joint factor model is
(1) 
In this model, corresponds to the parameters relating the factors to and corresponds to the parameters relating the factors to . A standard maximum likelihood approach would use ; however, in practice is often modified to upweight the importance of predicting the outcome by taking . Given this objective function, we find the optimal values of the latent factors, , and , which can be effectively done through stochastic gradient methods [31]. Because the latent variables are referred to as “local” variables for these stochastic methods, we denote this strategy as “local supervision,” and we modify this approach below.
As mentioned in the introduction, this approach is successful when the model is properly specified. Unfortunately, though, it is not robust to model misspecification, which is prevalent in our application.
3.2 Encoded Objective Function
Our primary modification to Equation 1 is that we require that latent variables to be estimated exclusively by the information given in , which would enforce that our latent variable estimates are the same whether or not is observed. However, we still want to influence how we learn our model parameters. Our strategy to accomplish this is to require that , where is a function that maps the predictors to the latent factors, which we call the “encoder.” The encoder ensures that the latent variables are only determined by , but we want the encoded values to still be predictive of . These modifications yield our novel objective function of
(2) 
While the encoder can take any functional form, it is often useful to limit the complexity to address overfitting concerns. The simplest useful encoder is an affine transformation . As in Equation 1, is a tuning parameter that controls the relative focus on predicting from the latent space.
At first glance, Equations 1 and 2 seem trivially different; however, despite superficial similarities, this inference approach results in different behavior under model misspecification. To help understand why, we can think of Eq. 2 as a Lagrangian relaxation of a constrained optimization objective,
(3)  
s.t. 
In this formulation, it is clear that our method is finding a latent representation of that requires be predictable by using the information in alone. This contrasts to the effects happening in Eq. 1, where the factors are dragged by the information in , especially as increases. This property will make the solution found in this approach much more robust to model mismatch, and also decreases the sensitivity to whether the information pertinent to classification is a dominant source of variance in .
3.3 Analytic solutions with a Loss
Most encoded supervision and joint models do not have analytic solutions, but we can obtain analytic solutions to Equations 1 and 2 in specific scenarios. One such case is when the likelihood is replaced with an loss. This yields a form similar to PCA, which can be viewed as the limiting case of probabilistic PCA as the variance of the conditional distribution goes to 0 [2].
We define the encoder as a linear transformation
in the encoded inference strategy. For convenience, we will define matrix forms of the data and factors: , , and . We let the model parameters and for matrices and be the Frobenius norm.First, we focus on the solution that comes from the “local” approach in which is learned from both and . This decomposition and its solution are wellknown except for the minor extension due to the term. Following simplifications, the objective is
Rewriting the above objective in matrix form as a minimization problem yields
The solution for the concatenation of and , , can be found as an eigendecomposition of the matrix
(4) 
Predicting when is unknown is a simple projection of the data onto the latent space, . This solution form is wellknown from PCA except for the minor extension with . The derivation of Equation 4 is given is Appendix A.1.
Next, we will focus on deriving the solution for encoded supervision, which can be similarly rewritten in terms of minimizing an loss. We assume a linear encoder, . Once again, we start with the loss corresponding to Equation 2 as
(5) 
This can be rewritten as a minimization in matrix form as
The solution for the concatenation of and is also an eigendecomposition, but with a different matrix
(6) 
where is the projection matrix on . In contrast to the local supervision approach, the mapping to the latent factors is the same regardless of whether is observed. We provide the derivation of Equation 6 in Appendix A.2.
While finding an analytic solution is helpful computationally when using these models, it also provides insight as to practical differences between the two forms. When , the solution for Equation 3.3 is easily recognizable as the solution for PCA on the predictors and outcome jointly. The solution for Equation 3.3 is different due only to the addition of the term . In this case it maximizes the variance explained by and the variance of to the extent that it is linearly predictable by . This makes sense in terms of the original objectives of the learning and inference strategy because the estimated scores are unaffected when is unknown.
3.4 General Inference
The encoded supervision formulation given in Equation 2 can be used with many different modeling formulations. A more general form is
(7) 
where is a loss measuring the distance between the estimated value of or
to the observed value. This loss can easily be minimized by stochastic gradient descent, facilitating easy scalability to large datasets. This is of vital importance as our datasets contain millions of highdimensional observations. Stochastic gradient descent has rigorous convergence proofs under mild conditions
[7] and adaptive learning methods can improve computational efficiency and decrease parameter tuning [25, 10]. When used with packages such as Tensorflow
[1] that allow for automatic differentiation, a unique statistical model can be implemented with a few lines of code and inference performed in a matter of minutes.4 Simulation Study
The two main types of model misspecification that concern us are using fewer latent factors than are truly needed to represent the data, as well as an incorrect latent distribution. Prior to analyzing our data of interest, we use synthetic data to illustrate that encoded supervision is robust to both of these difficulties.
We also found that the encoded supervision was comparatively robust to the chosen even when all other properties are properly specified. Since that value can be crossvalidated, this is a practical advantage, but less important in these studies. A visualization of this property is shown in Appendix E.
4.1 Misspecified Number of Latent Factors
We first examine misspecification in the number of latent factors. We synthesized data with the generative model corresponding to Supervised Probabilistic PCA [56], which is a joint linear latent factor model. We set the number of predictors to 20 with a single outcome and the latent dimensionality was 3. The largest factor was unpredictive of the outcome. Additionally, we limited the inferred dimensionality to 2. This mimics our biological models that underspecify the true number of brain networks. We fit a locally supervised and encoded supervised PCA model with 2 components over a range of supervision strengths using the formulas derived in Section 3.3 and show the results in Figure 3.
Both the local and encoded supervision methods yield worse reconstructions of the data when the model is forced to be predictive (high ), which is expected as the highest variance component is unpredictive of outcome. In the local supervision approach, though, the improved prediction of is contingent upon knowing . Specifically, when is known the model overfits, forcing the prediction loss to vanish. However, when is unknown there are only slight initial gains in predictive ability before overfitting actually leads to worse predictions. In contrast, the ability to predict does not depend on the observation of in the encoded scenario, so it is more robust and predicts better. Additionally, because the encoded supervised approach only uses to predict , it corresponds to better reconstruction of the predictors when the outcome is unknown.
4.2 Unknown Latent Distribution
We chose NMF with a binary classification problem to illustrate the impact of an unknown latent distribution and unknown number of factors. For this example we wanted the data to be more complex but still know the true model behind the system. To illustrate this, we created a pseudosynthetic dataset from the FashionMNIST dataset by fitting a 40 component NMF model from the data and then generating the synthetic data as the reconstruction of the original data plus noise. Thus the synthetic data come from a known NMF model with 40 components. However, the distribution on the latent scores needed to create a probabilistic model is still unknown, making this an ideal demonstration of the utility of encoded supervision and contrasts to the situations above where the latent scores were generated from a known Gaussian distribution.
To simplify our synthetic example, we limited ourselves to three binary classification problems. For each classification problem we selected the relevant observations and trained a supervised model and evaluated the AUC on a holdout test set. We trained a locally supervised and encoded supervision model with 5 components, creating an analogous situation where the fitted model severely underestimates the true dimensionality of the data. As comparison methods, we also fit logistic regression on the original NMF features used to generate the data. We compared local and encoded supervision to a sequentially fit model. The sequential model first fits an NMF to the images and then applies a logistic regression model.
Table 1 shows the results of the different prediction methods. Encoded supervision matches the performance from logistic regression on every classification task and dramatically improves upon the local supervision. The sequentially fit model matches the performance on a single category but lags substantially on the other two classification tasks. The case in which the performance is the same corresponds to when one of the highest variance components is predictive of the outcome, so the NMF captures this feature. However, encoded supervision is able to provide consistently good predictions irregardless of the relative importance of the predictive features in the generative model.
Method  Pullover/Shirt  Tshirt/Shirt  Tshirt/Coat 

Linear model on ground truth NMF  0.84  0.90  0.90 
Sequential NMF  0.78  0.90  0.76 
Local Supervision NMF  0.76  0.82  0.71 
Encoded Supervision NMF  0.85  0.89  0.89 
5 Learning Relevant Electome Networks
We now evaluate the performance of encoded supervision on data described in Section 2. There are two classification tasks, differentiating between the two genotypes and differentiating between the experimental conditions (stress biomarker). All models were implemented in Tensorflow and fit with the NADAM stochastic gradient method [10]. Previously, the local supervision solution was referred to as discriminative CSFA (dCSFA) [13], so we denote local supervised approaches as dCSFA in the tables.
First, we evaluated the different fitting approaches on CSFANMF using the derived power and coherence features to learn 5 networks. The loadings were constrained to have unit norm. We also placed an penalty on the supervision coefficients to encourage sparsity and prevent overfitting. All network details can be found in Appendix C. We compared a sequential method, local supervision, and then our encoded supervision approach. In our experiments, 18 mice were used for training the model and a set of 8 mice was reserved for testing.
For our applications, it is imperative that networks both predict outcome and use relevant neural activity so that they are biologically meaningful. Thus, we care both about the outcome prediction metrics and the reconstruction ability. In these experiments, crossvalidating over is computationally expensive, so we chose such that the reconstruction loglikelihood and the supervision loglikelihood had similar magnitudes. The results are shown in Table 2. In our case, the encoded supervision pays only a small penalty on the reconstruction for a vast and meaningful improvement to the prediction. Notably, the local supervision approach drastically overfits with the chosen value of . These results match the the same patterns as the synthetic data. We note that the if the value of were fully crossvalidated, we would expect that the local supervision should at least match the same performance as the sequential method; however, that is not tractable. Instead, it is preferable in practice on largescale data to have a method that is fairly robust to this chosen value. The most predictive brain network from the encoded CSFANMF is shown in Figure 4.
Genotype  Condition  

Method  Predictive
AUC 
Generative
Loss 
Predictive
AUC 
Generative
Loss 
CSFANMF  
dCSFANMF  
Encoded CSFANMF 
As a demonstration of the general applicability of encoded supervision, we next applied encoded supervision on the same dataset using the CSFACSM algorithm [13]
. For the encoder we use an affine transform with a softplus activation to ensure nonnegativity of the factors. While the loglikelihood uses the original data, we used the same power features derived used in CSFANMF as input to the encoder (directly using the time series data would require the development of a more complex convolutional neural network encoder). To reflect a common goal in neuroscience, we induce sparsity by limiting a single network to be predictive, which also addresses the multiple comparisons problem. Our CSFA models for the encoded, local and generative models included 25 networks. Computational limitations precluded crossvalidating over supervision strength for local supervision, hence we set
, where the supervision loss was approximately 10 percent of the total loss.The classification results comparing the different approaches using the CSM kernel are given in Table 3. As we can see, the encoded supervision dramatically outperforms local supervision and the sequential method on both classification tasks, likely due to the fact that dominant brain networks in variance are rarely directly related to neuropsychiatric disorders. For CSFANMF, the estimated network from encoded CSFACSM is visualized in Figure 5.
Genotype  Condition  

Method  Predictive
AUC 
Negative Log
Likelihood 
Predictive
AUC 
Negative Log
Likelihood 
CSFACSM  
dCSFACSM  
Encoded CSFACSM 
While these approaches have differences in the way that they model the data, the encoded supervision uniformly provides gains in predictive accuracy. In our experience, in these real datasets, the methodology was reliable and robust to the exact setting of the supervision strength, mirroring the synthetic examples.
Finally, the encoded supervision was much faster, and a single model was trained in approximately half an hour for CSFANMF and one hour for CSFACSM using a single NVIDIA 2080 Ti GPU, while the local supervision took roughly 10 hours for CSFANMF and 16 hours for CSFACSM. The NMF models were initialized to the learned parameters from the NMF implementation in scikitlearn [35] to both have a robust initialization and save time. While both the CSFACSM models and CSFANMF models are nonconvex, we empirically found that the results were robust to different random initialization, yielding similar predictive performance and a similar predictive factor.
6 Discussion
Supervised dimension reduction is useful in neuroscience, as most datasets involve explaining variance in highdimensional observations. Reducing the dimensionality can increase statistical power, and carefully designed factor models can have useful biological interpretation as networks. However, these benefits are contingent upon capturing the information related to outcome in the latent space. A joint factor model fit by maximum likelihood has been shown to be susceptible to misspecification, a common occurance in neuroscience and many other fields.
Our formulation for supervised factor models, based on supervised autoencoders, differs from previous methods in that we maximize a constrained optimization, yielding a latent representation that explains the variation in the observed predictors while requiring that the factors estimated from the observed predictors alone give good predictive performance. We find that our model formulation is substantially more robust to misspecification than approaches that maximize a joint likelihood. In our practical realdata experiments, this led to significant gains in predictive performance. These gains appear for multiple generative models with vastly different likelihoods.
This work can be easily extended in several new directions. Replacing a deterministic mapping with a stochastic one similar to a variational autoencoder could help with our potential for overfitting the predictions and quantify our uncertainty in the network scores. Additionally, combining the developed approach with missing data models would increase practical utility, as different experiments often focus on different subsets of brain regions.
Acknowledgements
We would like to thank Michael Hunter Klein and Neil Gallagher for providing helpful feedback and testing the code for the NMF and PCA models. Research reported in this publication was supported by the National Institute of Biomedical Imaging and Bioengineering and the National Institute of Mental Health through the National Institutes of Health BRAIN Initiative under Award Number R01EB026937 to DC and KD and by a WM Keck Foundation grant to KD. DD was funded by National Institute of Health grants R01ES027498 and R01MH118927.
References

[1]
(2016)
TensorFlow: A system for largescale machine learning
. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, pp. 265–283. External Links: 1605.08695, ISBN 9781931971331, Link Cited by: §3.4.  [2] (1963) Pattern recognition and machine learning. Vol. 9. External Links: Document, ISSN 15579654 Cited by: §3.3.
 [3] (2006) Prediction by Supervised Principal Components. Journal of the American Statistical Association 101 (473). External Links: Document, Link Cited by: §2.2.
 [4] (2014) Hierarchical modeling and analysis for spatial data. CRC press, . Cited by: §D.1.
 [5] (2017) Network neuroscience. Nature Neuroscience 20 (3), pp. 353–364. External Links: Document, ISSN 15461726 Cited by: §1.
 [6] (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research 3, pp. 993–1022. External Links: Link Cited by: §1.
 [7] (2018) Optimization methods for largescale machine learning. SIAM Review 60 (2), pp. 223–311. External Links: Document, 1606.04838, ISSN 00361445 Cited by: §3.4.
 [8] (2012) The origin of extracellular fields and currentsEEG, ECoG, LFP and spikes. Nature Reviews Neuroscience 13 (6), pp. 407–420. External Links: Document, ISSN 1471003X, Link Cited by: §2.1.
 [9] (2017) Dynamically Timed Stimulation of Corticolimbic Circuitry Activates a StressCompensatory Pathway. Vol. 82. External Links: Document, ISSN 18732402 Cited by: §2.1, §2.1.

[10]
(2016)
Incorporating Nesterov Momentum into Adam
. In ICLR Workshop, pp. 2013–2016. Cited by: §3.4, §5.  [11] (2006) Dopaminergic Control of SleepWake States. Journal of Neuroscience 26 (41), pp. 10577–10589. External Links: Document, arXiv:1011.1669v3, ISBN 15292401 (Electronic)$\$n02706474 (Linking), ISSN 02706474, Link Cited by: §2.1.
 [12] (2006) Dynamics and effective topology underlying synchronization in networks of cortical neurons. Journal of Neuroscience 26 (33), pp. 8465–8476. External Links: Document, ISSN 02706474, Link Cited by: §1.
 [13] (2017) CrossSpectral Factor Analysis. Advances in neural information processing systems, pp. 6842–6852. External Links: Link Cited by: Appendix C, §D.2, Appendix D, §1, §2.1, §2.1, §2.1, §5, §5.
 [14] (2012) Effects of the DRD4 genotype on neural networks associated with executive functions in children and adolescents. Developmental Cognitive Neuroscience 2 (4), pp. 417–427. External Links: Document, ISSN 18789293, Link Cited by: §1.

[15]
(2013)
Partial Factor Modeling: PredictorDependent Shrinkage for Linear Regression
. Journal of the American Statistical Association 108 (503), pp. 999–1008. External Links: Document, Link Cited by: §1.  [16] (2003) Response inhibition and impulsivity: An fMRI study. Neuropsychologia 41 (14), pp. 1959–1966. External Links: Document, ISSN 00283932 Cited by: §1.
 [17] (2016) Dysregulation of Prefrontal CortexMediated SlowEvolving Limbic Dynamics Drives StressInduced Emotional Pathology. Neuron 91 (2), pp. 439–452. External Links: Document, ISSN 10974199, Link Cited by: §2.1.
 [18] (2018) Brainwide Electrical Spatiotemporal Dynamics Encode Depression Vulnerability. Cell 173 (1), pp. 166–180.e14. External Links: Document, ISSN 10974172, Link Cited by: §2.1, §2.1, §2.1.
 [19] (199811) Independent component analysis in the presence of Gaussian noise by maximizing joint likelihood. Neurocomputing 22 (13), pp. 49–67. External Links: Document, ISSN 09252312 Cited by: §1.
 [20] (1982) A note on the use of principal components in regression. Journal of the Royal Statistical Society, Series C 31 (3), pp. 300–303. External Links: Link Cited by: §1, §2.2.
 [21] (2004) Automatic removal of eye movement and blink artifactsfrom EEG data using blind component separation. Psychophysiology 41, pp. 313–325. External Links: Document, ISBN 14698986, ISSN 00485772 Cited by: §2.2.
 [22] (2014) Analysis of Neural Data. Springer, New York. Cited by: §C.1, §2.1.
 [23] (1988) Modern Spectral Estimation. PrenticeHall, Englewood Cliffs, NJ. External Links: ISBN 013598582X, Link Cited by: §C.1.
 [24] (2019) Adaptive artifact removal from intracortical channels for accurate decoding of a force signal in freely moving rats. Frontiers in Neuroscience 13 (April), pp. 1–12. External Links: Document, ISSN 1662453X Cited by: §2.2.
 [25] (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. External Links: arXiv:1412.6980v9 Cited by: §3.4.
 [26] (2013) AutoEncoding Variational Bayes. In International Conference on Learning Representations, External Links: 1312.6114v10, Link Cited by: §1.
 [27] (2018) Supervised autoencoders: Improving generalization performance with unsupervised regularizers. In Advances in Neural Information Processing Systems, Vol. 2018Decem, pp. 107–117. External Links: ISSN 10495258 Cited by: §1.
 [28] (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401 (6755), pp. 788–791. External Links: Document, ISSN 00280836, Link Cited by: §1.
 [29] (1991) Sliced Inverse Regression for Dimension Reduction. Journal of the American Statistical Association 86 (414), pp. 316–327. External Links: Link Cited by: §2.2.
 [30] (2009) Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis 4 (1), pp. 119–150. External Links: Document, ISSN 19360975 Cited by: §2.2.
 [31] (2009) Supervised Dictionary Learning. In Advances in neural information processing systems, pp. 1033–1040. External Links: Link Cited by: §3.1.
 [32] (2010) Cutting feedback in bayesian regression adjustment for the propensity score. International Journal of Biostatistics 6 (2). External Links: Document, ISSN 15574679 Cited by: §2.2.
 [33] (2015) Cognitive network neuroscience. Journal of Cognitive Neuroscience 27 (8), pp. 1471–1491. External Links: Document, ISSN 15308898 Cited by: §1.
 [34] (2013) Feedback and modularization in a bayesian metaanalysis of tree traits affecting forest dynamics. Bayesian Analysis 8 (1), pp. 133–168. External Links: Document, ISSN 19360975 Cited by: §2.2.
 [35] (2011) Scikitlearn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §5.
 [36] (2008) The Matrix Cookbook [ http://matrixcookbook.com ]. Technical report Cited by: §A.1.
 [37] (2016) Deconstructing the ladder network architecture. 33rd International Conference on Machine Learning, ICML 2016 5, pp. 3527–3539. External Links: 1511.06430, ISBN 9781510829008 Cited by: §1.
 [38] (2014) Cuts in Bayesian graphical models. Statistics and Computing 25 (1), pp. 37–43. External Links: Document, ISSN 09603174 Cited by: §2.2.

[39]
(2016)
Variational autoencoder for deep learning of images, labels and captions
. In Advances in Neural Information Processing Systems, pp. 2360–2368. External Links: 1609.08976, ISSN 10495258 Cited by: §1.  [40] (1975) Theory and Application of Digital Signal Processing. PrenticeHall, Englewood Cliffs, NJ. Cited by: §C.1, §2.1.
 [41] (2008) Semisupervised learning of compact document representations with deep networks. In Proceedings of the 25th International Conference on Machine Learning, pp. 792–799. External Links: Document, ISBN 9781605582054 Cited by: §1.
 [42] (2001) Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. American Journal of Human Genetics 69 (1), pp. 138–147. External Links: Document, ISSN 00029297 Cited by: §1.
 [43] (2005) Cortical local field potential encodes movement intentions in the posterior parietal cortex. Neuron 46 (2), pp. 347–354. External Links: Document, ISSN 08966273 Cited by: §1.
 [44] (2014) Putting big data to good use in neuroscience. Nature Neuroscience 17 (11), pp. 1440–1441. External Links: Document, ISSN 15461726, Link Cited by: §1.
 [45] (2010) Development of a largescale functional brain network during human nonrapid eye movement sleep. Journal of Neuroscience 30 (34), pp. 11379–11387. External Links: Document, ISSN 02706474 Cited by: §1.
 [46] (2011) How advances in neural recording affect data analysis. Nature Neuroscience 14 (2), pp. 139–142. External Links: Document, ISSN 10976256 Cited by: §1.
 [47] (2017) Bayesian switching factor analysis for estimating timevarying functional connectivity in fMRI. NeuroImage 155, pp. 271–290. External Links: Document, ISSN 10959572, Link Cited by: §1.
 [48] (2005) Electroencephalogrambased control of an electric wheelchair. IEEE Transactions on Robotics 21 (4), pp. 762–766. External Links: Document, ISSN 15523098 Cited by: §1.
 [49] (1999) Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 61 (3), pp. 611–622. External Links: Link Cited by: Appendix E, §1, §1.
 [50] (2016) Deep extreme learning machines: Supervised autoencoding architecture for classification. Neurocomputing 174, pp. 42–49. External Links: Document, ISSN 18728286 Cited by: §1.
 [51] (2008) The role of oscillations and synchrony in cortical networks and their putative relevance for the pathophysiology of schizophrenia. Schizophrenia Bulletin 34 (5), pp. 927–943. External Links: Document, ISSN 05867614 Cited by: §2.1.
 [52] (2015) GP kernels for crossspectrum analysis. In Advances in Neural Information Processing Systems, Vol. 2015Janua, pp. 1999–2007. External Links: ISSN 10495258 Cited by: §D.1, §2.1.
 [53] (2013) Further evidence for Clock19 mice as a model for bipolar disorder mania using crossspecies tests of exploration and sensorimotor gating. Behavioural Brain Research. External Links: Document, ISSN 01664328 Cited by: §2.1.

[54]
(1967)
The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms
. IEEE Transactions on audio and electroacoustics 15 (2), pp. 70–73. Cited by: §C.1, §2.1. 
[55]
(2017)
Variational autoencoder for semisupervised text classification.
31st AAAI Conference on Artificial Intelligence, AAAI 2017
, pp. 3358–3364. External Links: 1603.02514 Cited by: §1.  [56] (2006) Supervised probabilistic principal component analysis. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 2006, pp. 464–473. External Links: Document, ISBN 1595933395, Link Cited by: §1, §2.2, §4.1.

[57]
(2015)
Supervised representation learning: Transfer learning with deep autoencoders
. In IJCAI International Joint Conference on Artificial Intelligence, Vol. 2015Janua, pp. 4119–4125. External Links: ISBN 9781577357384, ISSN 10450823 Cited by: §1.
Appendix A Proof of Analytic Solutions in Linear Factor Models
Below we give the accompanying proofs for Section 3.3, which focus on PCAlike solutions for the local and encoded supervision models.
a.1 Local Supervision
We first restate the optimization goal from Equation 3.3 showing that local supervision with linear models can be rewritten in matrix form as
This is equivalent to the trace representation of
We now compute the partial derivatives with respect to , and using standard techniques [36] with the aim to find the fixed points where each partial derivative vanishes. First, the partial derivative with respect to is
(8) 
The derivative with respect to is
(9) 
And the derivative with respect to is
(10) 
We need these conditions to all be satisfied at the fixed point solution to this problem. We will first solve for a single latent feature, so the matricies , , and
are vectors denoted as
, , and , respectively. With that, we can rewrite Equation 10 as(11) 
We denote and . Since is a vector both and are scalars in this case. Thus, in the single feature case the fixed point equations of 8, 9, and 10 can be written concisely as
When we substitute into the previous equations we end up with
(12)  
(13) 
Because and
are scalars, we note that the solution to Equations 21 and 22 must be an eigenvector of
where the first part of the eigenvector corresponds to solution of and the second part corresponds to the solution of . In practice, we have always found that the solution corresponds to the largest eigenvector. Finding further dimensions can be done iteratively by subtracting out the previous variance explained then repeating the procedure. In practice we have found these to be equivalent to the
largest eigenvalues of
, which is supported by the fact that when this yields the classic PCA solution.a.2 Encoded Supervision
We use a similar procedure as before. Rewriting the encoded supervision objective in matrix form in Equation 3.3 is
Here, the matrix functions as our linear encoder, with and retaining their meaning as the loadings for the outcome and predictors respectively.
We can rewrite this objective in terms of traces as
Next, we will find the fixed points by determining when the partial derivatives are 0. The partial derivative for is
(14) 
the partial derivative of is
(15) 
The partial derivative of is
(16) 
We begin by finding the solution when only a single latent variable is used, and can rewrite Equation 16 as
We define the scalars and . Thus, our fixed points in the univariate case are the solutions to the system of equations given as
Substituting yields a similar form to the local supervision case, with
(17)  
(18) 
Because and are scalars, we note that the solution to Equations 17 and 18 must be an eigenvector of
where the first part of the eigenvector corresponds to solution of and the second part corresponds to the solution of . In practice, we have always found that the solution corresponds to the largest eigenvector. Once again, the full solution with latent variables can be found iteratively, which in practice we have found these to be the largest eigenvalues of the previous matrix.
Appendix B Implementation of supervised NMF
Our encoded supervision version of NMF was implemented in Tensorflow. The encoder was a neural network with a single hidden layer with a softplus activation function, 30 latent features, and a softmax output layer for classification. The loadings
were optimized by learning an unconstrained matrix which was put through a softmax activation function to ensure both nonnegativity and normalization such that . The parameters were learned using Adam with Nesterov momentum (NADAM) as our optimization algorithm with iterations. We set . We used an penalty to regularize the predictive coefficients.Appendix C Cross Spectral Factor Analysis: NonNegative Matrix Factorization (CSFANMF)
The CSFANMF algorithm is based upon the motivation for the originally proposed CSFA algorithm [13]. However, instead of using a Guassian process formulation, the Electrical Connectome Networks are learned by NMF over extracted features that capture relevant properties. In this case, we use Power and Coherence features defined immediately below, and then use our inference strategy to learn a supervised NMF model. Note that it would be natural to extend this strategy with Granger causality features as well.
c.1 Generating Power and Coherence Features
The electrophysiological features of interest for these experiments are frequency based power within the brain regions as well as the coherence between the brain regions. While the periodogram, the squared coefficients of a Fourier transform, yields an unbiased estimate of the power spectrum, it is an inconsistent estimator with a fixed variance. There are multiple methods for overcoming this issue, such as spline smoothing or using a parameterized autoregressive model
[22]. We instead chose to use Welch’s method [54], which breaks up the signal into subsections and averages the periodogram for each segment, which denoises the estimator. We quantified the coherence using the metric of mean squared coherence [40, 23]. This computes the power and crossspectra using Welch’s method and scales the crossspectrum by the power spectrum of each signal. This yields a value between 0 and 1. The power and coherence were estimated in Matlab using the functions pwelch and mscohere respectively with the default parameters at 1 Hz intervals.c.2 Integrating with Supervised NMF
After extracting the relevant features, the approach outlined in the manuscript and in Appendix B are used to learn a supervised NMF model. Each learned component corresponds to a defined brain network, and the extracted features are related to behaviors of interest.
Appendix D Implementation of Supervised Cross Spectral Factor Analysis: Cross Spectral Mixture (CSFACSM)
Below, we give a brief description of the CSFA generative model [13] using the CrossSpectral Mixture (CSM) kernel, and then we discuss how it can be combined with the supervised framework we developed.
d.1 The CSFACSM Generative Model
CrossSpectral Factor Analysis (CSFA) is designed specifically to model brain networks in an interpretable manner. It uses Gaussian processes to characterize the power and cross spectra through a carefully designed kernel. Let the data be composed of observations, from distinct brain regions. We let window be represented by . is determined by the sampling rate and the duration of the window. The complete dataset is represented by the set . The CSM kernel, designed by [52], is given as
(19) 
where the matrix . This is the real component of a sum of separable kernels. Each of these kernels is given by the combination of a crossspectral density matrix, , and a stationary function of two time points that defines a frequency band, . Representing , as all kernels used here are stationary and depend only on the difference between the two inputs, the frequency band for each spectral kernel is defined by a spectral Gaussian kernel,
(20) 
which is equivalent to a Gaussian distribution in the frequency domain with variance
, centered at .The matrix is a positive semidefinite matrix with rank . This is also known as coregionalization matrix in spatial statistics [4]. Choosing to be smaller than induces a lowrank approximation that can help prevent overfitting. This relationship is maintained and is updated by storing the full matrix as the outer product of a tall matrix with itself:
(21) 
Phase coherence between regions is given by the magnitudes of the complex offdiagonal entries in . The phase offset is given by the complex angle of those offdiagonal entries.
CSFA creates a factor model by relating each factor to the data through a CSM kernel. Let represent the time point of the sample in the window and represent . Each window is modeled as
(22)  
(23) 
where is represented as a linear combination functions drawn from latent factors, given by . The th latent function is drawn independently for each task according to
(24) 
where is the set of parameters associated with the factor (i.e. ).
The factors are all constrained to be nonnegative, and thus treats the brain dynamics as a linear sum of nonnegative brain networks.
d.2 Implementing Encoded CSFACSM
t Implementing CSFA for the local and unsupervised models follow the developments of [13], although we reimplemented the technique in Tensorflow and used stochastic gradient descent with Nesterov momentum as the optimizer rather than using the originally proposed learning strategy. At each step in our algorithm a small batch of data was selected, the factors were trained to convergence, and then a single update on the CSFA parameters was performed given the factors.
We modified the method described in Equation 7 as well as the training algorithm for encoded CSFA. Rather than using the time series for our encoder, we used the power features used for the CSFANMF model as inputs to our encoder. This was done because the features of interest are spectral based, thus any encoder relying on the original time series would require a convolutional encoder, which has high potential to overfit. The original time series was used however, to evaluate the likelihood from the CSFACSM model.
Due to the vast differences in size of the likelihood during training, we incorporated an additional step optimizing the encoder solely with respect to the predictive loss at each iteration. Thus, the encoder was constrained to be predictive at each step of the training. This was done implemented in Tensorflow version 1.9 with Adam as the optimization method.
Appendix E Robustness to Supervision Strength
Here we examine the sensitivity of both encoded and local supervision to the parameter when the other parameters are known and given. We only fit the model using 200 samples. Here, we use the derived PCAlike analytic solutions for the local and encoded supervision models. We choose to use this assumption on the variances because it is common in practice (e.g., the prevalence of PCA for data analysis) and allows us to compare model parameters without future consideration of model settings. Also, note that while this assumption impacts the eigenvalues, and hence the latent factors, the eigenvectors indicating the directions of maximum variance remain unchanged [49].
Figure 6 visualizes how model performance changes as we change the value of from the local and encoded supervision. When is small, the impact of the supervised task is small, and the solution from the local and encoded methods are very similar. However, as increases, the fit model from the local and encoded supervised inference strategies diverge. Specifically, the local supervised approach is greatly impacted by the knowledge of that appears as a large discrepancy in the latent factors when is observed or not (factor dragging). Note that the model does not predict well when it is unobserved. In direct contrast, the encoded supervision model is comparatively less unaffected and is substantially more robust to the value of , which is visible as the predictive performance is better for large values of and the learned parameters reconstruct better.