Deriving Neural Architectures from Sequence and Graph Kernels

05/25/2017 ∙ by Tao Lei, et al. ∙ 0

The design of neural architectures for structured objects is typically guided by experimental insights rather than a formal process. In this work, we appeal to kernels over combinatorial structures, such as sequences and graphs, to derive appropriate neural operations. We introduce a class of deep recurrent neural operations and formally characterize their associated kernel spaces. Our recurrent modules compare the input to virtual reference objects (cf. filters in CNN) via the kernels. Similar to traditional neural operations, these reference objects are parameterized and directly optimized in end-to-end training. We empirically evaluate the proposed class of neural architectures on standard applications such as language modeling and molecular graph regression, achieving state-of-the-art results across these applications.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many recent studies focus on designing novel neural architectures for structured data such as sequences or annotated graphs. For instance, LSTM (Hochreiter & Schmidhuber, 1997), GRU (Chung et al., 2014) and other complex recurrent units (Zoph & Le, 2016) can be easily adapted to embed structured objects such as sentences (Tai et al., 2015) or molecules (Li et al., 2015; Dai et al., 2016)

into vector spaces suitable for later processing by standard predictive methods. The embedding algorithms are typically integrated into an end-to-end trainable architecture so as to tailor the learnable embeddings directly to the task at hand.

The embedding process itself is characterized by a sequence operations summarized in a structure known as the computational graph. Each node in the computational graph identifies the unit/mapping applied while the arcs specify the relative arrangement/order of operations. The process of designing such computational graphs or associated operations for classes of objects is often guided by insights and expertise rather than a formal process.

Recent work has substantially narrowed the gap between desirable computational operations associated with objects and how their representations are acquired. For example, value iteration calculations can be folded into convolutional architectures so as to optimize the representations to facilitate planning (Tamar et al., 2016). Similarly, inference calculations in graphical models about latent states of variables such as atom characteristics can be directly associated with embedding operations (Dai et al., 2016).

We appeal to kernels over combinatorial structures to define the appropriate computational operations. Kernels give rise to well-defined function spaces and possess rules of composition that guide how they can be built from simpler ones. The comparison of objects inherent in kernels is often broken down to elementary relations such as counting of common sub-structures as in


where is the set of possible substructures. For example, in a string kernel (Lodhi et al., 2002), may refer to all possible subsequences while a graph kernel (Vishwanathan et al., 2010) would deal with possible paths in the graph. Several studies have highlighted the relation between feed-forward neural architectures and kernels (Hazan & Jaakkola, 2015; Zhang et al., 2016) but we are unaware of any prior work pertaining to kernels associated with neural architectures for structured objects.

In this paper, we introduce a class of deep recurrent neural embedding operations and formally characterize their associated kernel spaces. The resulting kernels are parameterized in the sense that the neural operations relate objects of interest to virtual reference objects through kernels. These reference objects are parameterized and readily optimized for end-to-end performance.

To summarize, the proposed neural architectures, or

Kernel Neural Networks

111Code available at , enjoy the following advantages:

  • [itemsep=0pt]

  • The architecture design is grounded in kernel computations.

  • Our neural models remain end-to-end trainable to the task at hand.

  • Resulting architectures demonstrate state-of-the-art performance against strong baselines.

In the following sections, we will introduce these neural components derived from string and graph kernels, as well as their deep versions. Due to space limitations, we defer proofs to supplementary material.

2 From String Kernels to Sequence NNs

Notations We define a sequence (or a string) of tokens (e.g. a sentence) as where represents its element and denotes the length. Whenever it is clear from the context, we will omit the subscript and directly use (and ) to denote a sequence. For a pair of vectors (or matrices) , we denote as their inner product. For a kernel function with subscript , we use to denote its underlying mapping, i.e. .

String Kernel String kernel measures the similarity between two sequences by counting shared subsequences (see Lodhi et al. (2002)). For example, let and be two strings, a bi-gram string kernel counts the number of bi-grams and such that 222

We define n-gram as a

subsequence of original string (not necessarily consecutive).,


where are context-dependent weights and is an indicator that returns 1 only when . The weight factors can be realized in various ways. For instance, in temporal predictions such as language modeling, substrings (i.e. patterns) which appear later may have higher impact for prediction. Thus a realization and (penalizing substrings far from the end) can be used to determine weights given a constant decay factor .

In our case, each token in the sequence is a vector (such as one-hot encoding of a word or a feature vector). We shall replace the exact match

by the inner product . To this end, the kernel function (2) can be rewritten as,


where (and similarly ) is the outer-product. In other words, the underlying mapping of kernel defined above is . Note we could alternatively use a partial additive scoring , and the kernel function can be generalized to n-grams when . Again, we commit to one realization in this section.

String Kernel NNs We introduce a class of recurrent modules whose internal feature states embed the computation of string kernels. The modules project kernel mapping into multi-dimensional vector space (i.e. internal states of recurrent nets). Owing to the combinatorial structure of , such projection can be realized and factorized via efficient computation. For the example kernel discussed above, the corresponding neural component is realized as,


where are the pre-activation cell states at word , and is the (post-activation) hidden vector. is initialized with a zero vector. are weight matrices to be learned from training examples.

The network operates like other RNNs by processing each input token and updating the internal states. The elementwise multiplication can be replaced by addition (corresponding to the partial additive scoring above). As a special case, the additive variant becomes a word-level convolutional neural net (Kim, 2014) when .333 when .

Figure 1: An unrolled view of the derived recurrent module for . Horizontal lines denote decayed propagation from to , while vertical lines represent a linear mapping that is propagated to the internal states .

2.1 Single Layer as Kernel Computation

Now we state how the proposed class embeds string kernel computation. For , let be the i-th entry of state vector , represents the i-th row of matrix . Define as a “reference sequence” constructed by taking the i-th row from each matrix . Let be the prefix of consisting of first tokens, and be the string kernel of -gram shown in Eq.(3). Then evaluates kernel function,

for any , . In other words, the network embeds sequence similarity computation by assessing the similarity between the input sequence and the reference sequence . This interpretation is similar to that of CNNs, where each filter is a “reference pattern” to search in the input. String kernel NN further takes non-consecutive n-gram patterns into consideration (seen from the summation over all n-grams in Eq.(3)).

Applying Non-linear Activation

In practice, a non-linear activation function such as polynomial or sigmoid-like activation is added to the internal states to produce the final output state

. It turns out that many activations are also functions in the reproducing kernel Hilbert space (RKHS) of certain kernel functions (see Shalev-Shwartz et al. (2011); Zhang et al. (2016)). When this is true, the underlying kernel of is the composition of string kernel and the kernel containing the activation. We give the formal statements below.

Let and be multi-dimensional vectors with finite norm. Consider the function with non-linear activation

. For functions such as polynomials and sigmoid function, there exists kernel functions

and the underlying mapping such that is in the reproducing kernel Hilbert space of , i.e.,

for some mapping constructed from . In particular, can be the inverse-polynomial kernel for the above activations.

For one layer string kernel NN with non-linear activation discussed in Lemma 2.1, as a function of input belongs to the RKHS introduced by the composition of and string kernel . Here a kernel composition is defined with the underlying mapping , and hence .

Proposition 2.1 is the corollary of Lemma 2.1 and Theorem 2.1, since and is the mapping for the composed kernel. The same proof applies when is a linear combination of all since kernel functions are closed under addition.

2.2 Deep Networks as Deep Kernel Construction

We now address the case when multiple layers of the same module are stacked to construct deeper networks. That is, the output states of the -th layer are fed to the -th layer as the input sequence. We show that layer stacking corresponds to recursive kernel construction (i.e. -th kernel is defined on top of -th kernel), which has been proven for feed-forward networks (Zhang et al., 2016).

We first generalize the sequence kernel definition to enable recursive construction. Notice that the definition in Eq.(3) uses the linear kernel (inner product) as a “subroutine” to measure the similarity between substructures (e.g. tokens) within the sequences. We can therefore replace it with other similarity measures introduced by other “base kernels”. In particular, let be the string kernel (associated with a single layer). The generalized sequence kernel can be recursively defined as,

where denotes the pre-activation mapping of the -th kernel, denotes the underlying (post-activation) mapping for non-linear activation , and is the -th post-activation kernel. Based on this definition, a deeper model can also be interpreted as a kernel computation.

Consider a deep string kernel NN with layers and activation function . Let the final output state (or any linear combination of ). For ,

  1. [itemsep=0pt]

  2. as a function of input belongs to the RKHS of kernel ;

  3. belongs to the RKHS of kernel .

3 From Graph Kernels to Graph NNs

In the previous section, we encode sequence kernel computation into neural modules and demonstrate possible extensions using different base kernels. The same ideas apply to other types of kernels and data. Specifically, we derive neural components for graphs in this section.

Notations A graph is defined as , with each vertex associated with feature vector . The neighbor of node is denoted as . Following previous notations, for any kernel function with underlying mapping , we use to denote the post-activation kernel induced from the composed underlying mapping .

3.1 Random Walk Kernel NNs

We start from random walk graph kernels (Gärtner et al., 2003), which count common walks in two graphs. Formally, let be the set of walks , where .444A single node could appear multiple times in a walk. Given two graphs and , an -th order random walk graph kernel is defined as:


where is the feature vector of node in the walk.

Now we show how to realize the above graph kernel with a neural module. Given a graph , the proposed neural module is:


where again is the cell state vector of node , and is the representation of graph aggregated from node vectors. could then be used for classification or regression.

Now we show the proposed model embeds the random walk kernel. To show this, construct as a “reference walk” consisting of the row vectors from the parameter matrices. Here , where , and ’s feature vector is . We have the following theorem: For any , the state value (the -th coordinate of ) satisfies:

thus lies in the RKHS of kernel . As a corollary, lies in the RKHS of kernel .

3.2 Unified View of Graph Kernels

The derivation of the above neural module could be extended to other classes of graph kernels, such as subtree kernels (cf. (Ramon & Gärtner, 2003; Vishwanathan et al., 2010)). Generally speaking, most of these kernel functions factorize graphs into local sub-structures, i.e.


where measures the similarity between local sub-structures centered at node and .

For example, the random walk kernel can be equivalently defined with

Other kernels like subtree kernels could be recursively defined similarly. Therefore, we adopt this unified view of graph kernels for the rest of this paper.

In addition, this definition of random walk kernel could be further generalized and enhanced by aggregating neighbor features non-linearly:

where could be either multiplication or addition. denotes a non-linear activation and denotes the post-activation kernel when is involved. The generalized kernel could be realized by modifying Eq.(6) into:


where could be either or operation.

3.3 Deep Graph Kernels and NNs

Following Section 2, we could stack multiple graph kernel NNs to form a deep network. That is:

The local kernel function is recursively defined in two dimensions: depth (term ) and width (term ). Let the pre-activation kernel in the -th layer be , and the post-activation kernel be . We recursively define

for . Finally, the graph kernel is . Similar to Theorem 2.2, we have Consider a deep graph kernel NN with layers and activation function . Let the final output state . For :

  1. [itemsep=0pt]

  2. as a function of input and graph belongs to the RKHS of kernel ;

  3. belongs to the RKHS of kernel .

  4. belongs to the RKHS of kernel .

3.4 Connection to Weisfeiler-Lehman Kernel

We derived the above deep kernel NN for the purpose of generality. This model could be simplified by setting , without losing representational power (as non-linearity is already involved in depth dimension). In this case, we rewrite the network by reparametrization:


In this section, we further show that this model could be enhanced by sharing weight matrices and across layers. This parameter tying mechanism allows our model to embed Weisfeiler-Lehman kernel (Shervashidze et al., 2011). For clarity, we briefly review basic concepts of Weisfeiler-Lehman kernel below.

Weisfeiler-Lehman Graph Relabeling Weisfeiler-Lehman kernel borrows concepts from the Weisfeiler-Lehman isomorphism test for labeled graphs. The key idea of the algorithm is to augment the node labels by the sorted set of node labels of neighbor nodes, and compress these augmented labels into new, short labels (Figure 2). Such relabeling process is repeated times. In the -th iteration, it generates a new labeling for all nodes in graph , with initial labeling .

Generalized Graph Relabeling The key observation here is that graph relabeling operation could be viewed as neighbor feature aggregation. As a result, the relabeling process naturally generalizes to the case where nodes are associated with continuous feature vectors. In particular, let be the relabeling function. For a node :


Note that our definition of is exactly the same as in Equation 9, with being additive composition.

Weisfeiler-Lehman Kernel

Let be any graph kernel (called base kernel). Given a relabeling function , Weisfeiler-Lehman kernel with base kernel and depth is defined as


where and are the -th relabeled graph of and respectively.

Figure 2: Node relabeling in Weisfeiler-Lehman isomorphism test. Figure taken from Shervashidze et al. (2011)

Weisfeiler-Lehman Kernel NN Now with the above kernel definition, and random walk kernel as the base kernel, we propose the following recurrent module:

where and are shared across layers. The final output of this network is .

The above recurrent module is still an instance of deep kernel, even though some parameters are shared. A minor difference here is that there is an additional random walk kernel NN that connects -th layer and the output layer. But this is just a linear combination of deep random walk kernels (of different depth). Therefore, as an corollary of Theorem 3.3, we have: For a Weisfeiler-Lehman Kernel NN with iterations and random walk kernel as base kernel, the final output state belongs to the RKHS of kernel .

4 Adaptive Decay with Neural Gates

The sequence and graph kernel (and their neural components) discussed so far use a constant decay value regardless of the current input. However, this is often not the case since the importance of the input can vary across the context or the applications. One extension is to make use of neural gates that adaptively control the decay factor. Here we give two illustrative examples:

Gated String Kernel NN By replacing constant decay with a sigmoid gate, we modify our single-layer sequence module as:

As compared with the original string kernel, now the decay factor is no longer , but rather an adaptive value based on current context.

Gated Random Walk Kernel NN Similarly, we could introduce gates so that different walks have different weights:

The underlying kernel of the above gated network becomes

where each path is weighted by different decay weights, determined by network itself.

5 Related Work

Sequence Networks Considerable effort has gone into designing effective networks for sequence processing. This includes recurrent modules with the ability to carry persistent memories such as LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Chung et al., 2014), as well as non-consecutive convolutional modules (RCNNs, Lei et al. (2015)), and others. More recently, Zoph & Le (2016)

exemplified a reinforcement learning-based search algorithm to further optimize the design of such recurrent architectures. Our proposed neural networks offer similar state evolution and feature aggregation functionalities but derive the motivation for the operations involved from well-established kernel computations over sequences.

Recursive neural networks are alternative architectures to model hierarchical structures such as syntax trees and logic forms. For instance, Socher et al. (2013) employs recursive networks for sentence classification, where each node in the dependency tree of the sentence is transformed into a vector representation. Tai et al. (2015) further proposed tree-LSTM, which incorporates LSTM-style architectures as the transformation unit. Dyer et al. (2015, 2016) recently introduced a recursive neural model for transition-based language modeling and parsing. While not specifically discussed in the paper, our ideas do extend to similar neural components for hierarchical objects (e.g. trees).

Graph Networks Most of the current graph neural architectures perform either convolutional or recurrent operations on graphs. Duvenaud et al. (2015)

developed Neural Fingerprint for chemical compounds, where each convolution operation is a sum of neighbor node features, followed by a linear transformation. Our model differs from theirs in that our generalized kernels and networks can aggregate neighboring features in a non-linear way. Other approaches, e.g.,

Bruna et al. (2013) and Henaff et al. (2015)

, rely on graph Laplacian or Fourier transform.

For recurrent architectures, Li et al. (2015) proposed gated graph neural networks, where neighbor features are aggregated by GRU function. Dai et al. (2016) considers a different architecture where a graph is viewed as a latent variable graphical model. Their recurrent model is derived from Belief Propagation-like algorithms. Our approach is most closely related to Dai et al. (2016), in terms of neighbor feature aggregation and resulting recurrent architecture. Nonetheless, the focus of this paper is on providing a framework for how such recurrent networks could be derived from deep graph kernels.

Kernels and Neural Nets Our work follows recent work demonstrating the connection between neural networks and kernels (Cho & Saul, 2009; Hazan & Jaakkola, 2015). For example, Zhang et al. (2016) showed that standard feedforward neural nets belong to a larger space of recursively constructed kernels (given certain activation functions). Similar results have been made for convolutional neural nets (Anselmi et al., 2015), and general computational graphs (Daniely et al., 2016). We extend prior work to kernels and neural architectures over structured inputs, in particular, sequences and graphs. Another difference is how we train the model. While some prior work appeals to convex optimization through improper learning (Zhang et al., 2016; Heinemann et al., 2016) (since kernel space is larger), we use the proposed networks as building blocks in typical non-convex but flexible neural network training.

6 Experiments

The left-over question is whether the proposed class of operations, despite its formal characteristics, leads to more effective architecture exploration and hence improved performance. In this section, we apply the proposed sequence and graph modules to various tasks and empirically evaluate their performance against other neural network models. These tasks include language modeling, sentiment classification and molecule regression.

6.1 Language Modeling on PTB

Dataset and Setup We use the Penn Tree Bank (PTB) corpus as the benchmark. The dataset contains about 1 million tokens in total. We use the standard train/development/test split of this dataset with vocabulary of size 10,000.

Model Configuration

Following standard practice, we use SGD with an initial learning rate of 1.0 and decrease the learning rate by a constant factor after a certain epoch. We back-propagate the gradient with an unroll size of 35 and use dropout 

(Hinton et al., 2012) as the regularization. Unless otherwise specified, we train 3-layer networks with and normalized adaptive decay.555See the supplementary sections for a discussion of network variants. Following (Zilly et al., 2016), we add highway connections (Srivastava et al., 2015) within each layer:

where , is the gated decay factor and is the transformation gate of highway connections.666We found non-linear activation is no longer necessary when the highway connection is added.

Model PPL
LSTM (large) (Zaremba et al., 2014) 66m 78.4
Character CNN (Kim et al., 2015) 19m 78.9
Variational LSTM (Gal & Ghahramani) 20m 78.6
Variational LSTM (Gal & Ghahramani) 66m 73.4
Pointer Sentinel-LSTM (Merity et al.) 21m 70.9
Variational RHN (Zilly et al., 2016) 23m 65.4
Neural Net Search (Zoph & Le, 2016) 25m 64.0
Kernel NN () 5m 84.3
Kernel NN ( learned as parameter) 5m 76.8
Kernel NN (gated ) 5m 73.6
Kernel NN (gated ) 20m 69.2
+ variational dropout 20m 65.5
+ variational dropout, 4 RNN layers 20m 63.8
Table 1: Comparison with state-of-the-art results on PTB. denotes the number of parameters. Following recent work (Press & Wolf, 2016), we share the input and output word embedding matrix. We report the test perplexity (PPL) of each model. Lower number is better.

Results Table 1 compares our model with various state-of-the-art models. Our small model with 5 million parameters achieves a test perplexity of 73.6, already outperforming many results achieved using much larger network. By increasing the network size to 20 million, we obtain a test perplexity of 69.2, with standard dropout. Adding variational dropout (Gal & Ghahramani, 2016) within the recurrent cells further improves the perplexity to 65.5. Finally, the model achieves 63.8 perplexity when the recurrence depth is increased to 4, being state-of-the-art and on par with the results reported in (Zilly et al., 2016; Zoph & Le, 2016). Note that Zilly et al. (2016) uses 10 neural layers and Zoph & Le (2016) adopts a complex recurrent cell found by reinforcement learning based search. Our network is architecturally much simpler.

Figure 3 analyzes several variants of our model. Word-level CNNs are degraded cases () that ignore non-contiguous n-gram patterns. Clearly, this variant performs worse compared to other recurrent variants with . Moreover, the test perplexity improves from 84.3 to 76.8 when we train the constant decay vector as part of the model parameters. Finally, the last two variants utilize neural gates (depending on input only or both input and previous state ), further improving the performance.

Figure 3: Comparison between kernel NN variants on PTB. for all models. Hyper-parameter search is performed for each variant.

6.2 Sentiment Classification

Dataset and Setup We evaluate our model on the sentence classification task. We use the Stanford Sentiment Treebank benchmark (Socher et al., 2013). The dataset consists of 11855 parsed English sentences annotated at both the root (i.e. sentence) level and the phrase level using 5-class fine-grained labels. We use the standard split for training, development and testing. Following previous work, we also evaluate our model on the binary classification variant of this benchmark, ignoring all neutral sentences.

Following the recent work of DAN (Iyyer et al., 2015) and RLSTM (Tai et al., 2015), we use the publicly available 300-dimensional GloVe word vectors (Pennington et al., 2014). Unlike prior work which fine tunes the word vectors, we normalize the vectors (i.e. ) and fixed them for simplicity.

Model Configuration Our best model is a 3-layer network with and hidden dimension . We average the hidden states across

, and concatenate the averaged vectors from the 3 layers as the input of the final softmax layer. The model is optimized with Adam

(Kingma & Ba, 2015)

, and dropout probability of 0.35.

Model Fine Binary
RNN (Socher et al. (2011)) 43.2 82.4
RNTN (Socher et al. (2013)) 45.7 85.4
DRNN (Irsoy & Cardie (2014)) 49.8 86.8
RLSTM (Tai et al. (2015)) 51.0 88.0
DCNN (Kalchbrenner et al. (2014)) 48.5 86.9
CNN-MC (Kim (2014)) 47.4 88.1
Bi-LSTM (Tai et al. (2015)) 49.1 87.5
LSTMN (Cheng et al. (2016)) 47.9 87.0
PVEC (Le & Mikolov (2014)) 48.7 87.8
DAN (Iyyer et al. (2014)) 48.2 86.8
DMN (Kumar et al. (2016)) 52.1 88.6
Kernel NN, 51.2 88.6
Kernel NN, gated 53.2 89.9
Table 2: Classification accuracy on Stanford Sentiment Treebank. Block I: recursive networks; Block II: convolutional or recurrent networks; Block III: other baseline methods. Higher number is better.

Results Table 2 presents the performance of our model and other networks. We report the best results achieved across 5 independent runs. Our best model obtains 53.2% and 89.9% test accuracies on fine-grained and binary tasks respectively. Our model with only a constant decay factor also obtains quite high accuracy, outperforming other baseline methods shown in the table.

Model (Dai et al., 2016) RMSE
Mean Predicator 1 2.4062
Weisfeiler-lehman Kernel, degree=3 1.6m 0.2040
Weisfeiler-lehman Kernel, degree=6 1378m 0.1367
Embedded Mean Field 0.1m 0.1250
Embedded Loopy BP 0.1m 0.1174
Under Our Setup
Neural Fingerprint 0.26m 0.1409
Embedded Loopy BP 0.26m 0.1065
Weisfeiler Kernel NN 0.26m 0.1058
Weisfeiler Kernel NN, gated 0.26m 0.1043
Table 3: Experiments on Harvard Clean Energy Project. We report Root Mean Square Error(RMSE) on test set. The first block lists the results reported in Dai et al. (2016) for reference. For fair comparison, we reimplemented their best model so that all models are trained under the same setup. Results under our setup is reported in second block.

6.3 Molecular Graph Regression

Dataset and Setup We further evaluate our graph NN models on the Harvard Clean Energy Project benchmark, which has been used in Dai et al. (2016); Duvenaud et al. (2015) as their evaluation dataset. This dataset contains 2.3 million candidate molecules, with each molecule labeled with its power conversion efficiency (PCE) value.

We follow exactly the same train-test split as Dai et al. (2016), and the same re-sampling procedure on the training data (but not the test data) to make the algorithm put more emphasis on molecules with higher PCE values, since the data is distributed unevenly.

We use the same feature set as in Duvenaud et al. (2015) for atoms and bonds. Initial atom features include the atom’s element, its degree, the number of attached hydrogens, its implicit valence, and an aromaticity indicator. The bond feature is a concatenation of bond type indicator, whether the bond is conjugated, and whether the bond is in a ring.

Model Configuration Our model is a Weisfeiler-Lehman NN, with 4 recurrent iterations and . All models (including baseline) are optimized with Adam (Kingma & Ba, 2015), with learning rate decay factor 0.9.

Results In Table 3, we report the performance of our model against other baseline methods. Neural Fingerprint (Duvenaud et al., 2015)

is a 4-layer convolutional neural network. Convolution is applied to each atom, which sums over its neighbors’ hidden state, followed by a linear transformation and non-linear activation. Embedded Loopy BP 

(Dai et al., 2016) is a recurrent architecture, with 4 recurrent iterations. It maintains message vectors for each atom and bond, and propagates those vectors in a message passing fashion. Table 3 shows our model achieves state-of-the-art against various baselines.

7 Conclusion

We proposed a class of deep recurrent neural architectures and formally characterized its underlying computation using kernels. By linking kernel and neural operations, we have a “template” for deriving new families of neural architectures for sequences and graphs. We hope the theoretical view of kernel neural networks can be helpful for future model exploration.


We thank Prof. Le Song for sharing Harvard Clean Energy Project dataset. We also thank Yu Zhang, Vikas Garg, David Alvarez, Tianxiao Shen, Karthik Narasimhan and the reviewers for their helpful comments. This work was supported by the DARPA Make-It program under contract ARO W911NF-16-2-0023.


Appendix A Examples of kernel / neural variants

Our theoretical results apply to some other variants of sequence kernels and the associated neural components. We give some examples in the this section. Table 4 shows three network variants, corresponding to three realizations of string kernels provided in Table 5.

Connection to LSTMs

Interestingly, many recent work has reached similar RNN architectures through empirical exploration. For example, Greff et al. (2015) found that simplifying LSTMs, by removing the input gate or coupling it with the forget gate does not significantly change the performance. However, the forget gate (corresponding to the decay factor in our notation) is crucial for performance. This is consistent with our theoretical analysis and the empirical results in Figure 3. Moreover, Balduzzi & Ghifary (2016) and Lee et al. (2017) both suggest that a linear additive state computation suffices to provide competitive performance compared to LSTMs: 777Balduzzi & Ghifary (2016) also includes the previous token , i.e. , which doesn’t affect the discussion here.

In fact, this variant becomes an instance of the kernel NN presented in this work (with and adaptive gating), when and or 1.

  (a) Multiplicative mapping, aggregation un-normalized:
  (b) Multiplicative mapping, aggregation normalized:
  (c) Additive mapping, aggregation normalized:
  Final activation:
(any linear combination of )
Table 4: Example sequence NN variants. We present these equations in the context of .
  (a) Multiplicative mapping, aggregation un-normalized:
  (b) Multiplicative mapping, aggregation normalized:
  (c) Additive mapping, aggregation normalized: