There has been significant recent effort directed at connecting deep learning to kernel machinesPoggio15; Bietti; Mairal; DKL. Specifically, it has been recognized that a deep neural network may be viewed as constituting a feature mapping , for input data . The nonlinear function , with model parameters , has an output that corresponds to a
-dimensional feature vector;may be viewed as a mapping of to a Hilbert space , where . The final layer of deep neural networks typically corresponds to an inner product , with weight vector ; for a vector output, there are multiple , with defining the -th component of the output. For example, in a deep convolutional neural network (CNN) CNN, is a function defined by the multiple convolutional layers, the output of which is a -dimensional feature map; represents the fully-connected layer that imposes inner products on the feature map. Learning and , , the cumulative neural network parameters, may be interpreted as learning within a reproducing kernel Hilbert space (RKHS) RKHS, with the function in ; represents the mapping from the space of the input to , with associated kernel , where is another input.
Insights garnered about neural networks from the perspective of kernel machines provide valuable theoretical underpinnings, helping to explain why such models work well in practice. As an example, the RKHS perspective helps explain invariance and stability of deep models, as a consequence of the smoothness properties of an appropriate RKHS to variations in the input Bietti; Mairal. Further, such insights provide the opportunity for the development of new models.
Most prior research on connecting neural networks to kernel machines has assumed a single input , , image analysis in the context of a CNN Poggio15; Bietti; Mairal. However, the recurrent neural network (RNN) has also received renewed interest for analysis of sequential data. For example, long short-term memory (LSTM) lstm; odyssey
and the gated recurrent unit (GRU)gru have become fundamental elements in many natural language processing (NLP) pipelines RNN_arch; gru; QandA. In this context, a sequence of data vectors is analyzed, and the aforementioned single-input models are inappropriate.
In this paper, we extend to recurrent neural networks (RNNs) the concept of analyzing neural networks from the perspective of kernel machines. Leveraging recent work on recurrent kernel machines (RKMs) for sequential data rkm, we make new connections between RKMs and RNNs, showing how RNNs may be constructed in terms of recurrent kernel machines, using simple filters. We demonstrate that these recurrent kernel machines are composed of a memory cell that is updated sequentially as new data come in, as well as in terms of a (distinct) hidden unit. A recurrent model that employs a memory cell and a hidden unit evokes ideas from the LSTM. However, within the recurrent kernel machine representation of a basic RNN, the rate at which memory fades with time is fixed. To impose adaptivity within the recurrent kernel machine, we introduce adaptive gating elements on the updated and prior components of the memory cell, and we also impose a gating network on the output of the model. We demonstrate that the result of this refinement of the recurrent kernel machine is a model closely related to the LSTM, providing new insights on the LSTM and its connection to kernel machines.
Continuing with this framework, we also introduce new concepts to models of the LSTM type. The refined LSTM framework may be viewed as convolving learned filters across the input sequence and using the convolutional output to constitute the time-dependent memory cell. Multiple filters, possibly of different temporal lengths, can be utilized, like in the CNN. One recovers the CNN CNN_text_OG; CNN_text; kim2014convolutional and Gated CNN gatedCNN models of sequential data as special cases, by turning off elements of the new LSTM setup. From another perspective, we demonstrate that the new LSTM-like model may be viewed as introducing gated memory cells and feedback to a CNN model of sequential data.
In addition to developing the aforementioned models for sequential data, we demonstrate them in an extensive set of experiments, focusing on applications in natural language processing (NLP) and in analysis of multi-channel, time-dependent local field potential (LFP) recordings from mouse brains. Concerning the latter, we demonstrate marked improvements in performance of the proposed methods relative to recently-developed alternative approaches syncnet.
2 Recurrent Kernel Network
Consider a sequence of vectors , with . For a language model, is the embedding vector for the -th word in a sequence of words. To model this sequence, we introduce , with the recurrent hidden variable satisfying
where , , , , and . In the context of a language model, the vector may be fed into a nonlinear function to predict the next word
in the sequence. Specifically, the probability thatcorresponds to in a vocabulary of words is defined by element of vector , with bias . In classification, such as the LFP-analysis example in Section 6, is the number of classes under consideration.
We constitute the factorization , where and , often with . Hence, we may write , with ; the columns of may be viewed as time-invariant factor loadings, and represents a vector of dynamic factor scores. Let represent a column vector corresponding to the concatenation of and ; then where . Computation of corresponds to inner products of the rows of with the vector . Let be a column vector, with elements corresponding to row of . Then component of is
We view as mapping into a RKHS , and vector is also assumed to reside within . We consequently assume
where . Note that here also depends on index , which we omit for simplicity; as discussed below, will play the primary role when performing computations.
where is a Mercer kernel kernels. Particular kernel choices correspond to different functions , and is meant to represent kernel parameters that may be adjusted.
We initially focus on kernels of the form ,111One may also design recurrent kernels of the form rkm, as for a Gaussian kernel, but if vectors and filters are normalized (, ), then reduces to . where is a function of parameters , , and is the implicit latent vector associated with the inner product, , . As discussed below, we will not need to explicitly evaluate or to evaluate the kernel, taking advantage of the recursive relationship in (1). In fact, depending on the choice of , the hidden vectors may even be infinite-dimensional. However, because of the relationship , for rigorous analysis should satisfy Mercer’s condition Genton2001; kernels.
The vectors are assumed to satisfy the same recurrence setup as (1), with each vector in the associated sequence assumed to be the same at each time, , associated with , . Stepping backwards in time three steps, for example, one may show
The inner product encapsulates contributions for all times further backwards, and for a sequence of length , plays a role analogous to a bias. As discussed below, for stability the repeated application of yields diminishing (fading) contributions from terms earlier in time, and therefore for large the impact of on is small.
The overall model may be expressed as
where is a memory cell at time , row of corresponds to , and operates pointwise on the components of (see Figure 1). At the start of the sequence of length , may be seen as a vector of biases, effectively corresponding to ; we henceforth omit discussion of this initial bias for notational simplicity, and because for sufficiently large its impact on is small.
Note that via the recursive process by which is evaluated in (6), the kernel evaluations reflected by are defined entirely by the elements of the sequence . Let represent the -th component in vector , and define . Then the sequence is specified by convolving in time with , denoted . Hence, the components of the sequence are completely specified by convolving with each of the filters, , , , taking an inner product of with the vector in at each time point.
In (4) we represented as ; now, because of the recursive form of the model in (1), and because of the assumption , we have demonstrated that we may express the kernel equivalently as , to underscore that it is defined entirely by the elements at the output of the convolution . Hence, we may express component of as .
Component of may be expressed
where represents component of matrix . Considering (7), the connection of an RNN to an RKHS is clear, as made explicit by the kernel . The RKHS is manifested for the final output , with the hidden now absorbed within the kernel, via the inner product (4). The feedback imposed via latent vector is constituted via update of the memory cell used to evaluate the kernel.
where , and recalling (see Figure 2a for illustration). In (8) the input to the kernel is dependent on the input elements and is now also a function of the kernel outputs at the previous time, via . However, note that is still specified entirely by the elements of , for .
3 Choice of Recurrent Kernels & Introduction of Gating Networks
3.1 Fixed kernel parameters & time-invariant memory-cell gating
The function discussed above may take several forms, the simplest of which is a linear kernel, with which (8) takes the form
where and (using analogous notation from rkm) are scalars, with for stability. The scalars and may be viewed as static (i.e., time-invariant) gating elements, with controlling weighting on the new input element to the memory cell, and controlling how much of the prior memory unit is retained; given , this means information from previous time steps tends to fade away and over time is largely forgotten. However, such a kernel leads to time-invariant decay of memory: the contribution from steps before to the current memory is , meaning that it decays at a constant exponential rate. Because the information contained at each time step can vary, this can be problematic. This suggests augmenting the model, with time-varying gating weights, with memory-component dependence on the weights, which we consider below.
3.2 Dynamic gating networks & LSTM-like model
Recent work has shown that dynamic gating can be seen as making a recurrent network quasi-invariant to temporal warpings warp. Motivated by the form of the model in (9) then, it is natural to impose dynamic versions of and ; we also introduce dynamic gating at the output of the hidden vector. This yields the model:
are bias vectors; and. In (10), and play dynamic counterparts to and , respectively. Further, , and are vectors, constituting vector-component-dependent gating. Note that starting from a recurrent kernel machine, we have thus derived a model closely resembling the LSTM. We call this model RKM-LSTM (see Figure 2).
Concerning the update of the hidden state, in (10), one may also consider appending a hyperbolic-tangent nonlinearity: . However, recent research has suggested not using such a nonlinearity RAN; gatedCNN; Best_Both, and this is a natural consequence of our recurrent kernel analysis. Using , the model in (10) and (11) is in the form of the LSTM, except without the nonlinearity imposed on the memory cell , while in the LSTM a nonlinearity (and biases) is employed when updating the memory cell lstm; odyssey, , for the LSTM . If for all time (no output gating network), and if (no dependence on for update of the memory cell), this model reduces to the recurrent additive network (RAN) RAN.
While separate gates and were constituted in (10) and (11) to operate on the new and prior composition of the memory cell, one may also also consider a simpler model with memory cell updated ; this was referred to as having a Coupled Input and Forget Gate (CIFG) in odyssey. In such a model, the decisions of what to add to the memory cell and what to forget are made jointly, obviating the need for a separate input gate . We call this variant RKM-CIFG.
4 Extending the Filter Length
4.1 Generalized form of recurrent model
Consider a generalization of (1):
where , , and therefore the update of the hidden state 222Note that while the same symbol is used as in (12), clearly takes on a different meaning when . depends on data observed time steps prior, and also on the previous hidden state . Analogous to (3), we may express
The inner product is assumed represented by a Mercer kernel, and .
Let be an
-gram input with zero padding if, and be sets of filters, with the -th rows of collectively represent the -th -gram filter, with . Extending Section 2, the kernel is defined
where . Note that corresponds to the -th component output from the -gram convolution of the filters and the input sequence; therefore, similar to Section 2, we represent as , emphasizing that the kernel evaluation is a function of outputs of the convolution , here with -gram filters. Like in the CNN CNN_text_OG; CNN_text; kim2014convolutional, different filter lengths (and kernels) may be considered to constitute different components of the memory cell.
4.2 Linear kernel, CNN and Gated CNN
For the special case of and equal to a constant (, ), (15) reduces to a convolutional neural network (CNN), with a nonlinear operation typically applied subsequently to .
Rather than setting to a constant, one may impose dynamic gating, yielding the model (with )
where are distinct convolutional filters for calculating , and is a vector of biases. The form of the model in (16) corresponds to the Gated CNN gatedCNN, which we see as a a special case of the recurrent model with linear kernel, and dynamic kernel weights (and without feedback, , ). Note that in (16) a nonlinear function is not imposed on the output of the convolution , there is only dynamic gating via multiplication with ; the advantages of which are discussed in gatedCNN. Further, the -gram input considered in (12) need not be consecutive. If spacings between inputs of more than 1 are considered, then the dilated convolution (e.g., as used in wavenet) is recovered.
4.3 Feedback and the generalized LSTM
Now introducing feedback into the memory cell, the model in (8) is extended to
Again motivated by the linear kernel, generalization of (17) to include gating networks is
The model in (18) and (19) is similar to the LSTM, with important differences: () there is not a nonlinearity imposed on the update to the memory cell, , and therefore there are also no biases imposed on this cell update; () there is no nonlinearity on the output; and () via the convolutions with , , , and , the memory cell can take into account -grams, and the length of such sequences may vary as a function of the element of the memory cell.
5 Related Work
In our development of the kernel perspective of the RNN, we have emphasized that the form of the kernel yields a recursive means of kernel evaluation that is only a function of the elements at the output of the convolutions or , for 1-gram and -gram filters, respectively. This underscores that at the heart of such models, one performs convolutions between the sequence of data and filters or . Consideration of filters of length greater than one (in time) yields a generalization of the traditional LSTM. The dependence of such models entirely on convolutions of the data sequence and filters is evocative of CNN and Gated CNN models for text CNN_text_OG; CNN_text; kim2014convolutional; gatedCNN, with this made explicit in Section 4.2 as a special case.
The Gated CNN in (16) and the generalized LSTM in (18)-(19) both employ dynamic gating. However, the generalized LSTM explicitly employs a memory cell (and feedback), and hence offers the potential to leverage long-term memory. While memory affords advantages, a noted limitation of the LSTM is that computation of is sequential, undermining parallel computation, particularly while training gatedCNN; transformer. In the Gated CNN, comes directly from the output of the gated convolution, allowing parallel fitting of the model to time-dependent data. While the Gated CNN does not employ recurrence, the filters of length do leverage extended temporal dependence. Further, via deep Gated CNNs gatedCNN, the effective support of the filters at deeper layers can be expansive.
Recurrent kernels of the form were also developed in rkm, but with the goal of extending recurrent kernel machines to sequential inputs, rather than making connections with RNNs. The formulation in Section 2 has two important differences with that prior work. First, we employ the same vector for all shift positions of the inner product . By contrast, in rkm effectively infinite-dimensional filters are used, because the filter changes with . This makes implementation computationally impractical, necessitating truncation of the long temporal filter. Additionally, the feedback of in (8) was not considered, and as discussed in Section 3.2, our proposed setup yields natural connections to long short-term memory (LSTM) lstm; odyssey.
Prior work analyzing neural networks from an RKHS perspective has largely been based on the feature mapping and the weight Poggio15; Bietti; Mairal; DKL. For the recurrent model of interest here, function plays a role like as a mapping of an input to what may be viewed as a feature vector . However, because of the recurrence, is a function of for an arbitrarily long time period prior to time :
However, rather than explicitly working with , we focus on the kernel .
The authors of string_kernel derive recurrent neural networks from a string kernel by replacing the exact matching function with an inner product and assume the decay factor to be a nonlinear function. Convolutional neural networks are recovered by replacing a pointwise multiplication with addition. However, the formulation cannot recover the standard LSTM formulation, nor is there a consistent formulation for all the gates. The authors of roth2018kernel
introduce a kernel-based update rule to approximate backpropagation through time (BPTT) for RNN training, but still follow the standard RNN structure.
Previous works have considered recurrent models with -gram inputs as in (12). For example, strongly-typed RNNs st_rnn consider bigram inputs, but the previous input is used as a replacement for rather than in conjunction, as in our formulation. Quasi-RNNs QRNN are similar to st_rnn, but generalize them with a convolutional filter for the input and use different nonlinearities. Inputs corresponding to -grams have also been implicitly considered by models that use convolutional layers to extract features from -grams that are then fed into a recurrent network (, cheng2016; wang2016; zhou2015). Relative to (18), these models contain an extra nonlinearity from the convolution and projection matrix from the recurrent cell, and no longer recover the CNN CNN_text_OG; CNN_text; kim2014convolutional or Gated CNN gatedCNN as special cases.
In the following experiments, we consider several model variants, with nomenclature as follows. The -gram LSTM developed in Sec. 4.3 is a generalization of the standard LSTM lstm (for which ). We denote RKM-LSTM (recurrent kernel machine LSTM) as corresponding to (10)-(11), which resembles the -gram LSTM, but without a nonlinearity on the cell update or emission . We term RKM-CIFG as a RKM-LSTM with , as discussed in Section 3.2. Linear Kernel w/ corresponds to (10)-(11) with and , with and time-invariant constants; this corresponds to a linear kernel for the update of the memory cell, and dynamic gating on the output, via . We also consider the same model without dynamic gating on the output, , for all (with a nonlinearity on the output), which we call Linear Kernel. The Gated CNN corresponds to the model in gatedCNN, which is the same as Linear Kernel w/ , but with (, no memory). Finally, we consider a CNN model CNN_text_OG, that is the same as the Linear Kernel model, but without feedback or memory, , and . For all of these, we may also consider an -gram generalization as introduced in Section 4. For example, a 3-gram RKM-LSTM corresponds to (18)-(19), with length-3 convolutional filters in the time dimension. The models are summarized in Table 1. All experiments are run on a single NVIDIA Titan X GPU.
|Linear Kernel w/|
|Gated CNN gatedCNN|
Document Classification We show results for several popular document classification datasets CNN_text in Table 2
. The AGNews and Yahoo! datasets are topic classification tasks, while Yelp Full is sentiment analysis and DBpedia is ontology classification. The same basic network architecture is used for all models, with the only difference being the choice of recurrent cell, which we make single-layer and unidirectional. Hidden representationsare aggregated with mean pooling across time, followed by two fully connected layers, with the second having output size corresponding to the number of classes of the dataset. We use 300-dimensional GloVe glove as our word embedding initialization and set the dimensions of all hidden units to 300. We follow the same preprocessing procedure as in leam. Layer normalization layer_norm is performed after the computation of the cell state . For the Linear Kernel w/ and the Linear Kernel, we set333 and can also be learned, but we found this not to have much effect on the final performance. .
Notably, the derived RKM-LSTM model performs comparably to the standard LSTM model across all considered datasets. We also find the CIFG version of the RKM-LSTM model to have similar accuracy. As the recurrent model becomes less sophisticated with regard to gating and memory, we see a corresponding decrease in classification accuracy. This decrease is especially significant for Yelp Full, which requires a more intricate comprehension of the entire text to make a correct prediction. This is in contrast to AGNews and DBpedia, where the success of the 1-gram CNN indicates that simple keyword matching is sufficient to do well. We also observe that generalizing the model to consider -gram inputs typically improves performance; the highest accuracies for each dataset were achieved by an -gram model.
|Linear Kernel w/||360K||720K||92.07||91.49||98.96||98.94||77.41||77.53||65.35||65.94|
|Gated CNN gatedCNN||180K||540K||91.54||91.78||98.37||98.77||72.92||76.66||60.25||64.30|
Document classification accuracy for 1-gram and 3-gram versions of various models. Total parameters of each model are shown, excluding word embeddings and the classifier.
Language Modeling We also perform experiments on popular word-level language generation datasets Penn Tree Bank (PTB) ptb and Wikitext-2 wikitext, reporting validation and test perplexities (PPL) in Table 3. We adopt AWD-LSTM awd_lstm as our base model444We use the official codebase https://github.com/salesforce/awd-lstm-lm and report experiment results before two-step fine-tuning., replacing the standard LSTM with RKM-LSTM, RKM-CIFG, and Linear Kernel w/
to do our comparison. We keep all other hyperparameters the same as the default. Here we consider 1-gram filters, as they performed best for this task; given that the datasets considered here are smaller than those for the classification experiments, 1-grams are less likely to overfit. Note that the static gating on the update of the memory cell (Linear Kernel w/) does considerably worse than the models with dynamic input and forget gates on the memory cell. The RKM-LSTM model consistently outperforms the traditional LSTM, again showing that the models derived from recurrent kernel machines work well in practice for the data considered.
|Model||PPL valid||PPL test||PPL valid||PPL test|
|LSTM lstm; awd_lstm||61.2||58.9||68.74||65.68|
|Linear Kernel w/||72.3||69.7||84.23||80.21|
LFP Classification We perform experiments on a Local Field Potential (LFP) dataset. The LFP signal is multi-channel time series recorded inside the brain to measure neural activity. The LFP dataset used in this work contains recordings from mice (wild-type or CLOCK van2013further), while the mice were in their home cages, in an open field, and suspended by their tails. There are a total of channels and the sampling rate is Hz. The goal of this task is to predict the state of a mouse from a second segment of its LFP recording as a 3-way classification problem. In order to test the model generalizability, we perform leave-one-out cross-validation testing: data from each mouse is left out as testing iteratively while the remaining mice are used as training.
SyncNet syncnet is a CNN model with specifically designed wavelet filters for neural data. We incorporate the SyncNet form of -gram convolutional filters into our recurrent framework (we have parameteric -gram convolutional filters, with parameters learned). As was demonstrated in Section 4.2, the CNN is a memory-less special case of our derived generalized LSTM. An illustration of the modified model (Figure 3) can be found in Appendix A, along with other further details on SyncNet.
While the filters of SyncNet are interpretable and can prevent overfitting (because they have a small number of parameters), the same kind of generalization to an -gram LSTM can be made without increasing the number of learned parameters. We do so for all of the recurrent cell types in Table 1, with the CNN corresponding to the original SyncNet model. Compared to the original SyncNet model, our newly proposed models can jointly consider the time dependency within the whole signal. The mean classification accuracies across all mice are compared in Table 4, where we observe substantial improvements in prediction accuracy through the addition of memory cells to the model. Thus, considering the time dependency in the neural signal appears to be beneficial for identifying hidden patterns. Classification performances per subject (Figure 4) can be found in Appendix A.
The principal contribution of this paper is a new perspective on gated RNNs, leveraging concepts from recurrent kernel machines. From that standpoint, we have derived a model closely connected to the LSTM lstm; odyssey (for convolutional filters of length one), and have extended such models to convolutional filters of length greater than one, yielding a generalization of the LSTM. The CNN CNN_text_OG; CNN_text; kim2014convolutional, Gated CNN gatedCNN and RAN RAN models are recovered as special cases of the developed framework. We have demonstrated the efficacy of the derived models on NLP and neuroscience tasks, for which our RKM variants show comparable or better performance than the LSTM. In particular, we observe that extending LSTM variants with convolutional filters of length greater than one can significantly improve the performance in LFP classification relative to recent prior work.
The research reported here was supported in part by DARPA, DOE, NIH, NSF and ONR.
Appendix A More Details of the LFP Experiment
In this section, we provide more details on the Sync-RKM model. In order to incorporate the SyncNet model [syncnet] into our framework, the weight defined in Eq. (12) is parameterized as wavelet filters. If there is a total of filters, then is of size .
Specifically, suppose the -gram input data at time is given as with channel number and window size . The -th filter for channel can be written as
has the form of the Morlet wavelet base function. Parameters to be learned are , , and for and .