Our main focus in this paper is to elucidate the underlying neuronal mechanisms for the memory of sequence of events. There is considerable evidence that hippocampal neurons can encode sequences of locations[1, 2, 3, 4, 5]. However, there is no direct evidence that this coding property extends to nonspatial memories (the type of memory typically investigated in human studies). Recent electrophysiological studies have specifically targeted this important issue [6, 7, 8, 9]. These studies commonly involve recording spike train data, which are sequences of spikes (action potentials) produced by neurons over time .
To study temporal organization of memory, we have designed an experiment to record neural activity in the CA1 region of the hippocampus as rats performed a nonspatial sequential memory task (see Figure 1 for more details). The task involves the presentation of repeated sequences of odors at a single port and tests the rats’ ability to identify each odor as “in sequence” (InSeq; by holding their nosepoke response until the signal at 1.2s) or “out of sequence” (OutSeq; by withdrawing their nose before the signal). Spiking activities were recorded using 24 tetrodes (bundles of 4 electrodes) in rats tested on well-trained or novel sequences of 5 odors denoted as A, B, C, D, and E. For each odor presentation (trial), the data typically features spike counts from 40-70 neurons and trial identifiers (e.g., Odor presented, InSeq/OutSeq, response correct/incorrect). Here, our goal is to identify the presented odors (each treated as a class) given the spike train data.
Properly analyzing the massive amount of data collected through this experiment is an extremely difficult task. Many existing statistical methods and machine learning techniques lack the flexibility, rigor, and scalability required for analyzing such complex and high dimensional data. What makes this task more daunting is the fact that the underlying patterns tend to be highly nonlinear. Moreover, the patterns are typically dominated by a specific behaviour or experimental condition mainly due to the class imbalance problem in the data. In our case, the patterns are dominated by odor A since it is the first odor in the sequence, is always in the right order (i.e., always occurs first), and is included in all trials. In contrast, odors D and E might not appear in some trials if the trials end early due to rats’ mistakes in recognizing earlier odors.
In general, class imbalance is a commonly encountered problem in many real-world problems, where some classes might be infrequent or rare so they are not represented properly in finite samples. As a result, the analysis might lead to biased estimates and inaccurate predictions. In supervised learning, the classification model might be dominated by the most frequent classes ignoring infrequent, but usually important classes. Imbalance can also affect unsupervised learning models. For example, for latent variable models, the low dimensional representation of the data might be mainly dominated by the information collected on more frequently observed classes.
In this paper, we propose a framework that can be used to explore, visualize, describe, and predict neural patterns. More specifically, we propose a method for alleviating the class imbalance problem in latent variable models to improve the overall quality of latent representation learning and also to increase the accuracy of the decoding process. To this end, a diversity-encouraging prior, namely determinantal point process (DPP), is used for latent variables. The prior plays the role of regularizing the latent variables to be less redundant. In particular, we use this prior for the latent variables in Variational Auto-encoder (VAE), which is one of the most widely used deep latent model. Compared to the standard VAE models, we expect our method to provide a more clear latent representation of the data by avoiding redundancy in the latent space.
Our proposed method has the following advantages: 1) it can capture nonlinear patterns and map the observed data into a latent spaces where the linearity assumption (in terms of the relationship between neural data and behaviour) would be more reasonable; 2) by using a DPP prior, our method has a better chance to separate different classes in order to provide a more clear representation of the data; 3) additionally, our approach can ensure that the patterns associated with minor classes (here, odors D and E) are well represented in the latent space and are decoded more accurately; 4) finally, our proposed method can provide unprecedented insight into the hippocampal mechanisms underlying the memory for sequences of events (a defining feature of episodic memory) and can also determine whether spatial coding properties (e.g., sequence replay) thought to be fundamental to hippocampal function extend to the nonspatial domain.
The paper is organized as follows. First, we describe the details of the experiment that inspired this research. We then provide a brief overview of Variational Auto-encoder and Determinantal Point Processes. We also review some existing methods for dealing with class imbalance. Next, we describe our proposed method in details. Finally, we provide our analyses and results for two separate problems. First, we show the advantages of our method based on an imbalanced MNIST classification task. We then use our method for analyzing the data collected from the rat experiment, which is a multi-class and high-dimensional neurophysiological experiment.
2 Experimental design and data collection
Subjects were male Long-Evans rats, weighing 350g at the beginning of the experiment. They were individually housed and maintained on a 12h light/dark cycle. Rats had ad libitum access to food and their hydration levels were monitored daily. All procedures were conducted in accordance with the Institutional Animal Care and Use Committee.
The apparatus consisted of a linear track with walls angled outward. An odor port, located on one end of the track, was equipped with photobeam sensors to precisely detect nose entries and was connected to an automated odor delivery system capable of repeated deliveries of multiple distinct odors. Two water ports were used for reward delivery: one located under the odor port, the other at the opposite end of the track. Timing boards (Plexon) and digital input/output boards (National Instruments) were used to measure response times and control the hardware. All aspects of the task were automated using custom Matlab scripts (MathWorks). A 96-channel Multichannel Acquisition Processor (MAP; Plexon) was used to interface with the hardware in real time and record the behavioral and electrophysiological data. Odor stimuli consisted of synthetic food extracts contained in glass jars (A, lemon; B, rum; C, anise; D, vanilla; E, banana) that were volatilized with desiccated, charcoal-filtered air (flow rate, 2 L/min). To prevent cross-contamination, separate Teflon tubing lines were used for each odor. These lines converged in a single channel at the bottom of the odor port. In addition, an air vacuum located at the top of the odor port provided constant negative pressure to quickly evacuate odor traces. Readings from a volatile organic compound detector confirmed that odors were cleared from the port 500-750 ms after odor delivery (inter-odor intervals were limited by software to 800 ms).
Naïve rats were initially trained to nosepoke and reliably hold their nose for 1.2 sec in the odor port for a water reward. Odor sequences of increasing length were then introduced in successive stages (A, AB, ABC, ABCD, and ABCDE) upon reaching behavioral criterion of 80% correct over three sessions per training stage. In each stage, rats were trained to correctly identify each presented item as either InSeq (by holding their nosepoke response for 1.2 sec to receive reward) or OutSeq (by withdrawing their nose before 1.2 sec to receive reward). There were two types of OutSeq items in the dataset, Repeats, in which an earlier item was presented a second time in the sequence (e.g., ABADE), and Skips, in which an item was presented too early in the sequence (e.g., ABDDE, which skipped over item C). Although our previous work has revealed important differences in performance and neural activity on Repeats and Skips, this distinction was beyond the scope of the present analyses and not further discussed here. Note that OutSeq items could be presented in any sequence position except the first (i.e., sequences always began with odor A, though odor A could also be presented later in the sequence as a Repeat). After reaching criterion performance on the five-item sequence (80% correct on both InSeq and OutSeq items), rats underwent microdrive implantation.
Each chronically implanted microdrive contained 20 independently drivable tetrodes. Voltage signals recorded from the tetrode tips were referenced to a ground screw positioned over the cerebellum, and filtered for single-unit activity (154 Hz to 8.8 kHz). The neural signals were then amplified (10,000-32,000), digitized (40 kHz) and recorded to disk with the data acquisition system (MAP, Plexon). Action potentials from individual neurons were manually isolated off-line using a combination of standard waveform features across the four channels of each tetrode (Offline Sorter, Plexon). Proper isolation was verified using interspike interval distributions for each isolated unit (assuming a minimum refractory period of 1 ms) and cross-correlograms for each pair of simultaneously recorded units on the same tetrode. To confirm recording sites, current was passed through the electrodes before perfusion (0.9% PBS followed by 4% paraformaldehyde) to produce small marking lesions, which were subsequently localized on Nissl-stained tissue slices.
3.1 Class imbalance
As summarized in , there are two major branches of methods for dealing with imbalanced learning problem: sampling (data-level) methods and cost-sensitive (model-level) methods. The goal of imbalanced learning is usually to improve the performance of the underrepresented (minnor) classes, sometimes at the cost of slightly worse performance for the dominating classes.
Random oversampling and undersampling are the most popular sampling methods since they are easy and straightforward to implement. However, undersampling has the problem of losing important information regarding the major class, while oversampling might result in overfitting [12, 13].
Cost-sensitive methods usually associate a higher cost with misclassifying minor classes compared to misclassifying major classes 11]. In practice, however, finding a reasonable cost function is not trivial.
In contrast to these alternatives, our proposed method in this paper takes a probabilistic perspective. Instead of ‘up-weighting’ the cost of misclassifying minor classes, our method intrinsically ‘up-weight’ them in the prior assumed for the latent representation by using DPP (this is illustrated in the Results section). Additionally, while most existing methods designed for class imbalance focus on binary classification problems, our approach can be easily applied to general classification problems with multiple classes.
3.2 Variational auto-encoder
Variational auto-encoder (VAE)  is an unsupervised model for learning low-dimensional latent representation of a given dataset. It can also be used for learning deep generative models to generate realistic data such as images and texts. Compared to a standard auto-encoder, variational auto-encoder imposes a prior on the latent variable instead of treating it as a deterministic term. The prior can also be regarded as a regularizor on the latent variablez.
More formally, we are interested in a dataset, , and its latent representation . In the framework of graphical models, latent data is generated from a pior , and data is generated from the model . VAE aims at efficient marginal inference for as well as finding an approximation for the posterior distribution of using a simplified model . Unlike mean-field variational Bayes, VAE simultaneously maximizes marginal likelihood of (evidence) and minimizes KL-divergence (KLD) between and . This is achieved by maximizing the evidence lower bound (ELBO) with respect to and :
This is equivalent to minimizing , called the reconstruction loss, while minimizing KLD between and at the same time.
The basic steps of a VAE involves passing observed data, , into an encoder model to learn . Then, latent variable is sampled from and fed into a decoder model to learn , which can be used to reconstruct new . VAE involves learning both the inference model and the generative model . The inference model , which called the encoder, is a posterior distribution of the latent variable given the data. Meanwhile, the generative model , which is called the decoder, maps latent variables to a generative distribution for . Both decoder and encoder models can be approximated by neural networks, which is capable of learning most of the complex and nonlinear patterns.
3.3 Determinantal Point Process (DPP) and k-DPP
DPP is a point process for modeling repulsion interactions between samples 
. It defines a distribution over subsets of a fixed ground set, and assigns higher probability to more diverse subsets. In a discrete setting, let us denote the ground set as. Then, is defined to be a determinantal point process if for every we have
where Y is a subset randomly drawn from according to . is a kernel matrix: , and is its submatrix corresponding to all entries in . For example, if , where , then:
Note that because the likelihood is proportional to the determinant of
, as a result, it is also proportional to the square of the volume spanned by the element vectors. Therefore, the likelihood becomes smaller for subsets with similar elements.
In a continuous setting, denote the ground set as . Similar to the discrete set, we have a positive definite kernel function . For any , we have .
In some cases, it is necessary to fix the subset size for every draw. A determinantal point process over subsets with cardinality is denoted as -DPP. For the discrete setting, the likelihood is:
are eigenvalues of, and is the th elementary symmetric polynomial . Similarly, for the continuous setting we have
However, the term is generally infeasible to evaluate.
4.1 Diverse latent variables using a k-DPP prior
In the original VAE model, a standard normal prior is used for the latent variable . Instead, we propose to use a continuous k-DPP prior for the continuous latent space, with the cardinality set to be the sample size . This way, we have the following prior:
where is a kernel matrix.
Notice that the loss function of VAE is composed of the reconstruction loss and the KLD loss. For DPP-VAE, the reconstruction loss remains the same, while the KLD loss is modified to be
That is, to avoid large penalty from the KLD loss, the approximating distribution needs to generate more diverse (hence, more balanced) latent variables across all classes. This is due to the fact that samples from the same class tend to be more similar compared to samples from different classes. As a result, when most samples are from the same class (more likely from the dominating class), the term will become smaller.
The other components of our model remain similar to the conventional VAE. The approximate posterior distribution for latent variable is modeled as a normal distribution:. The generative model can be written as . For example, if is continuous, we can use . If is binary, we have . Note that , and are modeled by neural networks as nonlinear functions of () and (), where and are the weight vectors. The neural network structure of our method is shown in Figure 2.
Similar to obtaining the reconstruction loss, we use Monte Carlo estimates of the expectation of with respect to . More specifically, we choose a positive definite kernel function as suggested by [17, 18], where
The eigenvalues can be obtained by
Note that the term in the KLD loss has no explicit form, but as shown in , it has the following lower and upper bounds:
To keep the KLD loss positive, we can use the upper bound of the normalization term for approximation. Moreover, can be efficiently computed by the algorithm developed in .
In this section, we apply our proposed method, henceforth called DPP-VAE, to two different problems: 1) MNIST classification, and 2) Bayesian neural decoding of the odor experiment discussed in the introduction section.
5.1 Two-class MNIST data
In this example, VAE and DPP-VAE are trained to draw MNIST digits. We show the effectiveness of DPP-VAE in terms of balancing the class ratios of the generated data and classification accuracy.
We first compare standard VAE and DPP-VAE on a binary classification task based on MNIST data. The training examples include 5000 MNIST ‘0’, ‘1’ handwritten digits data. For the balanced case, there are 2500 Class 0 and 2500 Class 1 observations. We then vary the imbalance ratio from 1:1 to 1:1000, where digit ‘1’ is considered as the minor class. The test set is a balanced dataset with 500 Class 0 and 500 Class 1 observations.
We select the latent dimension to be 20 and set
. Both methods are trained for 10 epochs with the batch size set to 100. Monte Carlo sample of size 1 is used since the batch size is large enough. A two layer convolutional and deconvolutional neural network with the ReLU activation function is used for encoding and decoding the images. The latent samples are then used for classification. To this end, we use the learned features in a simple logistic regression model with optimal hyperparameters selected through cross-validation.
The performance of the two methods on balanced test sets is shown in Figure 3. The results are averaged over 20 independent runs. As we can see, for a given computation time, the DPP-VAE model tends to achieves a significantly higher classification accuracy rate compared to the standard VAE. The improvement becomes more obvious as the imbalance ratio increases.
|Class ratio||Training (%)||VAE (%)||DPP-VAE (%)|
Balancing data generation
To further investigate our method, we use both VAE and DPP-VAE to generate digits under three different scenarios: 1:10 ratio, 1:100 ratio, and 1:1000 ratio. Here, the parameter settings are the same as before.
Synthetic MNIST data (digit ‘0’ and ‘1’) are generated by standard VAE and DPP-VAE separately. For this, random latent vectors from standard normal distribution are fed into the trained decoder network to generate handwritten ‘0’ and ‘1’ digits. The same latent vectors are used for both methods. The total training data is set to 5000.
Results are displayed in Table 1. As we can see, the percentage of the minor class is substantially increased for our model, and the effect is more significant as the imbalance ratio increases: 1.9 times increase for 1:10 ratio (17.7% versus 9.1%), 3.7 times increase for 1:100 ratio (3.68% versus 1%), and 9.5 times increase for 1:1000 ratio (0.9469% versus 0.0999%). This shows that our method is up-weighting the minor class.
5.2 Neural decoding
We now apply our proposed method to the odor experiment, which was discussed above and was the original inspiration for this research. While most existing methods for class imbalance focus on binary classification problems, here we can show that our method can be easily applied to multi-class problems, which are more challenging in general. We also show that our proposed method can improve latent representation learning by avoiding redundancy in the latent space.
As discussed above, in our experiment, rats perform a sequence memory task and are expected to recognize five different odors presented in a specific order: A, B, C, D, and E. Here, we are interested in decoding their corresponding neural spike signals for odor classification. Neural decoding refers to the mapping from neural activities to stimulus (here, odors). In our experiment, a sequence of trials always starts with presenting odor A and terminates early once the rat makes a mistake. Therefore, there are more class A trails than other classes, and decoding results are highly affected by this imbalance. More specifically, there are 58 trials for odor A, 41 trials for odor B, 37 trials for odor C, 32 trials for odor D, and 26 trials for odor E. Previous studies have shown that a given odor is most related to the neural activities occurring during the 0.15s-0.4s time window after odor presentation. Thus, neural spike data during that time window are used for training both VAE and DPP-VAE models along with the multinomial logit model that uses the latent features as predictors.
The average performance based on a 6-fold cross-validation is displayed in Table 2. The test data for each fold is selected to be a balanced set, with 4 trials for each class. As the results show, our proposed method improves the performance (e.g., F1 score) of the minor classes (C, D and E), without negatively affecting the performance on the major class, odor A (although, the accuracy rate for odor B dropped slightly). The overall performance has also improved substantially.
Visualization for sequence memory replay
Besides neural decoding, another fundamental question that is of immense interest in neuroscience is sequence memory replay. That is, in our experiment we are interested to find out whether the neural activities corresponding to the five odors are replayed in the sequence of ABCDE during each session. Here, we use our method to investigate this phenomenon for the designed nonspatial task, which is quite different from typical experiments discussed in the literature.
The general framework, as before, is to use neural activities during the 0.15-0.4s time window relative to the nose-poke as our training data, and predict/reconstruct odors using neural activities during other time windows within each trial. We expect to identify some sort of replay during the -2s to 2s time frame, regarding the odor presentation as time 0. The neural activities outside of the 0.15-0.4s window are treated as the test data.
More specifically, we first train VAE and DPP-VAE using the 0.15-0.4s neural data as before. We keep the latent dimension to be 20 since VAE works better with a moderate dimension. Then, for the purpose of visualization, we use the first two principal components of the latent representation to train a multinomial logit model. Note that applying these linear models to latent variables is reasonable since we expect that the decoding process in VAE and DPP-VAE maps nonlinear patterns in the original space to relatively linear patterns in the latent space. The decision boundaries of the classification model can then be visualized using a 2-dimensional plot.
We focus on replay pattern during trials associated with odor B presentation. To investigate sequence replay, we simply examine different time windows such as , , , , for all B trials. Within each time window, the 2-dimensional representation for all odor B trials are visualized in the same plot. The representations in the test windows are generated by using the models that are trained on the 0.15-0.4s window.
Figure 4 shows the results based on VAE and DPP-VAE for a subset of windows during the presentation of odor B. The latent space is divided into different regions based on the decision boundaries obtained from the multinomial logit model. Each point on the plot corresponds to one trial projected into the latent space. As we can see, before the odor presentation (-1s), the trails are randomly scattered. The movement towards region B starts around 0s (at the time of odor presentation) indicating the rat’s anticipation for the upcoming odor. Most trails move into region B around 0.15s, which is expected since this is the training window. However, the most interesting part of our results is what happens right after that around 0.7s, which is a test window: the trials clearly move into region C, which is the next odor in the sequence while the rat is still in trial B (recall that the trial ends after 1.2s). This is an evidence of sequence replay, which has not been shown for this type of nonspatial experiments in the past. More importantly, note that while the overall neural patterns throughout the course of a trial are somehow similar between VAE and DPP-VAE, the patterns are more clear using our proposed DPP-VAE model: around 1s after the odor presentation, our method shows that most trials have moved into C region.
In this work, we have proposed a novel latent representation learning method based on diversity-encouraging Determinantal Point Processes to alleviate the class imbalance problem and to provide a more clear latent representation of the data. To this end, we have modified the standard variational auto-encoder method by using continuous k-DPP as the prior for latent variables. Further, we have developed the required inference algorithm to implement this model. Using synthetic and real data, we have shown that our proposed method can in fact improve latent representation learning, as well as the prediction accuracy of classification models (especially for minor classes). Our work can have significant contributions to many areas of machine learning, especially when samples of rare classes with great importance are hard to collect. Additionally, our proposed method can lead to finding novel phenomena in the field of neuroscience by providing unprecedented insight into underlying structure of neural data.
This work is supported by NSF grant DMS 1622490 and NIH grant R01 MH115697.
-  W.E. Skaggs and B.L. McNaughton. Replay of neuronal firing sequences in rat hippocampus during sleep following spatial experience. Science, 271(5257):1870, 1996.
-  M.R. Mehta, M.C. Quirk, and M.A. Wilson. Experience-dependent asymmetric shape of hippocampal receptive fields. Neuron, 25(3):707 – 715, 2000.
-  G. Dragoi and G. Buzsaki. Temporal encoding of place sequences by hippocampal cell assemblies. Neuron, 50:145–157, 2006.
-  D.J. Foster and M.A. Wilson. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature, 440(7084):680–683, 2006.
-  A.S. Gupta, M.A.A. van der Meer, D. S Touretzky, and A.D. Redish. Segmentation of spatial experience by hippocampal theta sequences. Nature neuroscience, 15(7):1032–1039, 2012.
-  H.B. Eichenbaum. Time cells in the hippocampus: a new dimension for mapping memories. Nature Reviews Neuroscience, 15:732– 744, 2014.
-  T.A. Allen, A.M. Morris, A.T. Mattfeld, C.E. Stark, and N.J. Fortin. A sequence of events model of episodic memory shows parallels in rats and humans. Hippocampus, 24:1178–1188, 2014.
-  T.A. Allen, A.M. Morris, S.M. Stark, N.J. Fortin, and C.E. Stark. Memory for sequences of events impaired in typical aging. Learn Mem, 22:138–148, 2015.
-  Timothy A Allen, Daniel M Salz, Sam McKenzie, and Norbert J Fortin. Nonspatial sequence coding in ca1 neurons. Journal of Neuroscience, 36(5):1547–1563, 2016.
-  U. Mitzdorf et al. Current source-density method and application in cat cerebral cortex: investigation of evoked potentials and EEG phenomena. American Physiological Society, 1985.
-  Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
-  Robert C Holte, Liane Acker, Bruce W Porter, et al. Concept learning and the problem of small disjuncts. In IJCAI, volume 89, pages 813–818. Citeseer, 1989.
David Mease, Abraham J Wyner, and Andreas Buja.
Boosted classification trees and class probability/quantile estimation.Journal of Machine Learning Research, 8(Mar):409–439, 2007.
The foundations of cost-sensitive learning.
International joint conference on artificial intelligence, volume 17, pages 973–978. Lawrence Erlbaum Associates Ltd, 2001.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3):123–286, 2012.
-  Raja Hafiz Affandi, Emily Fox, Ryan Adams, and Ben Taskar. Learning the parameters of determinantal point process kernels. In International Conference on Machine Learning, pages 1224–1232, 2014.
-  Gregory E Fasshauer and Michael J McCourt. SIAM Journal on Scientific Computing, 34(2):A737–A762, 2012.