1 Introduction
Many recent studies focus on designing novel neural architectures for structured data such as sequences or annotated graphs. For instance, LSTM (Hochreiter & Schmidhuber, 1997), GRU (Chung et al., 2014) and other complex recurrent units (Zoph & Le, 2016) can be easily adapted to embed structured objects such as sentences (Tai et al., 2015) or molecules (Li et al., 2015; Dai et al., 2016)
into vector spaces suitable for later processing by standard predictive methods. The embedding algorithms are typically integrated into an endtoend trainable architecture so as to tailor the learnable embeddings directly to the task at hand.
The embedding process itself is characterized by a sequence operations summarized in a structure known as the computational graph. Each node in the computational graph identifies the unit/mapping applied while the arcs specify the relative arrangement/order of operations. The process of designing such computational graphs or associated operations for classes of objects is often guided by insights and expertise rather than a formal process.
Recent work has substantially narrowed the gap between desirable computational operations associated with objects and how their representations are acquired. For example, value iteration calculations can be folded into convolutional architectures so as to optimize the representations to facilitate planning (Tamar et al., 2016). Similarly, inference calculations in graphical models about latent states of variables such as atom characteristics can be directly associated with embedding operations (Dai et al., 2016).
We appeal to kernels over combinatorial structures to define the appropriate computational operations. Kernels give rise to welldefined function spaces and possess rules of composition that guide how they can be built from simpler ones. The comparison of objects inherent in kernels is often broken down to elementary relations such as counting of common substructures as in
(1) 
where is the set of possible substructures. For example, in a string kernel (Lodhi et al., 2002), may refer to all possible subsequences while a graph kernel (Vishwanathan et al., 2010) would deal with possible paths in the graph. Several studies have highlighted the relation between feedforward neural architectures and kernels (Hazan & Jaakkola, 2015; Zhang et al., 2016) but we are unaware of any prior work pertaining to kernels associated with neural architectures for structured objects.
In this paper, we introduce a class of deep recurrent neural embedding operations and formally characterize their associated kernel spaces. The resulting kernels are parameterized in the sense that the neural operations relate objects of interest to virtual reference objects through kernels. These reference objects are parameterized and readily optimized for endtoend performance.
To summarize, the proposed neural architectures, or
Kernel Neural Networks
^{1}^{1}1Code available at https://github.com/taolei87/icml17_knn , enjoy the following advantages:
[itemsep=0pt]

The architecture design is grounded in kernel computations.

Our neural models remain endtoend trainable to the task at hand.

Resulting architectures demonstrate stateoftheart performance against strong baselines.
In the following sections, we will introduce these neural components derived from string and graph kernels, as well as their deep versions. Due to space limitations, we defer proofs to supplementary material.
2 From String Kernels to Sequence NNs
Notations We define a sequence (or a string) of tokens (e.g. a sentence) as where represents its element and denotes the length. Whenever it is clear from the context, we will omit the subscript and directly use (and ) to denote a sequence. For a pair of vectors (or matrices) , we denote as their inner product. For a kernel function with subscript , we use to denote its underlying mapping, i.e. .
String Kernel String kernel measures the similarity between two sequences by counting shared subsequences (see Lodhi et al. (2002)). For example, let and be two strings, a bigram string kernel counts the number of bigrams and such that ^{2}^{2}2
We define ngram as a
subsequence of original string (not necessarily consecutive).,(2) 
where are contextdependent weights and is an indicator that returns 1 only when . The weight factors can be realized in various ways. For instance, in temporal predictions such as language modeling, substrings (i.e. patterns) which appear later may have higher impact for prediction. Thus a realization and (penalizing substrings far from the end) can be used to determine weights given a constant decay factor .
In our case, each token in the sequence is a vector (such as onehot encoding of a word or a feature vector). We shall replace the exact match
by the inner product . To this end, the kernel function (2) can be rewritten as,(3) 
where (and similarly ) is the outerproduct. In other words, the underlying mapping of kernel defined above is . Note we could alternatively use a partial additive scoring , and the kernel function can be generalized to ngrams when . Again, we commit to one realization in this section.
String Kernel NNs We introduce a class of recurrent modules whose internal feature states embed the computation of string kernels. The modules project kernel mapping into multidimensional vector space (i.e. internal states of recurrent nets). Owing to the combinatorial structure of , such projection can be realized and factorized via efficient computation. For the example kernel discussed above, the corresponding neural component is realized as,
(4) 
where are the preactivation cell states at word , and is the (postactivation) hidden vector. is initialized with a zero vector. are weight matrices to be learned from training examples.
The network operates like other RNNs by processing each input token and updating the internal states. The elementwise multiplication can be replaced by addition (corresponding to the partial additive scoring above). As a special case, the additive variant becomes a wordlevel convolutional neural net (Kim, 2014) when .^{3}^{3}3 when .
2.1 Single Layer as Kernel Computation
Now we state how the proposed class embeds string kernel computation. For , let be the ith entry of state vector , represents the ith row of matrix . Define as a “reference sequence” constructed by taking the ith row from each matrix . Let be the prefix of consisting of first tokens, and be the string kernel of gram shown in Eq.(3). Then evaluates kernel function,
for any , . In other words, the network embeds sequence similarity computation by assessing the similarity between the input sequence and the reference sequence . This interpretation is similar to that of CNNs, where each filter is a “reference pattern” to search in the input. String kernel NN further takes nonconsecutive ngram patterns into consideration (seen from the summation over all ngrams in Eq.(3)).
Applying Nonlinear Activation
In practice, a nonlinear activation function such as polynomial or sigmoidlike activation is added to the internal states to produce the final output state
. It turns out that many activations are also functions in the reproducing kernel Hilbert space (RKHS) of certain kernel functions (see ShalevShwartz et al. (2011); Zhang et al. (2016)). When this is true, the underlying kernel of is the composition of string kernel and the kernel containing the activation. We give the formal statements below.Let and be multidimensional vectors with finite norm. Consider the function with nonlinear activation
. For functions such as polynomials and sigmoid function, there exists kernel functions
and the underlying mapping such that is in the reproducing kernel Hilbert space of , i.e.,for some mapping constructed from . In particular, can be the inversepolynomial kernel for the above activations.
For one layer string kernel NN with nonlinear activation discussed in Lemma 2.1, as a function of input belongs to the RKHS introduced by the composition of and string kernel . Here a kernel composition is defined with the underlying mapping , and hence .
2.2 Deep Networks as Deep Kernel Construction
We now address the case when multiple layers of the same module are stacked to construct deeper networks. That is, the output states of the th layer are fed to the th layer as the input sequence. We show that layer stacking corresponds to recursive kernel construction (i.e. th kernel is defined on top of th kernel), which has been proven for feedforward networks (Zhang et al., 2016).
We first generalize the sequence kernel definition to enable recursive construction. Notice that the definition in Eq.(3) uses the linear kernel (inner product) as a “subroutine” to measure the similarity between substructures (e.g. tokens) within the sequences. We can therefore replace it with other similarity measures introduced by other “base kernels”. In particular, let be the string kernel (associated with a single layer). The generalized sequence kernel can be recursively defined as,
where denotes the preactivation mapping of the th kernel, denotes the underlying (postactivation) mapping for nonlinear activation , and is the th postactivation kernel. Based on this definition, a deeper model can also be interpreted as a kernel computation.
Consider a deep string kernel NN with layers and activation function . Let the final output state (or any linear combination of ). For ,

[itemsep=0pt]

as a function of input belongs to the RKHS of kernel ;

belongs to the RKHS of kernel .
3 From Graph Kernels to Graph NNs
In the previous section, we encode sequence kernel computation into neural modules and demonstrate possible extensions using different base kernels. The same ideas apply to other types of kernels and data. Specifically, we derive neural components for graphs in this section.
Notations A graph is defined as , with each vertex associated with feature vector . The neighbor of node is denoted as . Following previous notations, for any kernel function with underlying mapping , we use to denote the postactivation kernel induced from the composed underlying mapping .
3.1 Random Walk Kernel NNs
We start from random walk graph kernels (Gärtner et al., 2003), which count common walks in two graphs. Formally, let be the set of walks , where .^{4}^{4}4A single node could appear multiple times in a walk. Given two graphs and , an th order random walk graph kernel is defined as:
(5) 
where is the feature vector of node in the walk.
Now we show how to realize the above graph kernel with a neural module. Given a graph , the proposed neural module is:
(6)  
where again is the cell state vector of node , and is the representation of graph aggregated from node vectors. could then be used for classification or regression.
Now we show the proposed model embeds the random walk kernel. To show this, construct as a “reference walk” consisting of the row vectors from the parameter matrices. Here , where , and ’s feature vector is . We have the following theorem: For any , the state value (the th coordinate of ) satisfies:
thus lies in the RKHS of kernel . As a corollary, lies in the RKHS of kernel .
3.2 Unified View of Graph Kernels
The derivation of the above neural module could be extended to other classes of graph kernels, such as subtree kernels (cf. (Ramon & Gärtner, 2003; Vishwanathan et al., 2010)). Generally speaking, most of these kernel functions factorize graphs into local substructures, i.e.
(7) 
where measures the similarity between local substructures centered at node and .
For example, the random walk kernel can be equivalently defined with
Other kernels like subtree kernels could be recursively defined similarly. Therefore, we adopt this unified view of graph kernels for the rest of this paper.
In addition, this definition of random walk kernel could be further generalized and enhanced by aggregating neighbor features nonlinearly:
where could be either multiplication or addition. denotes a nonlinear activation and denotes the postactivation kernel when is involved. The generalized kernel could be realized by modifying Eq.(6) into:
(8) 
where could be either or operation.
3.3 Deep Graph Kernels and NNs
Following Section 2, we could stack multiple graph kernel NNs to form a deep network. That is:
The local kernel function is recursively defined in two dimensions: depth (term ) and width (term ). Let the preactivation kernel in the th layer be , and the postactivation kernel be . We recursively define
for . Finally, the graph kernel is . Similar to Theorem 2.2, we have Consider a deep graph kernel NN with layers and activation function . Let the final output state . For :

[itemsep=0pt]

as a function of input and graph belongs to the RKHS of kernel ;

belongs to the RKHS of kernel .

belongs to the RKHS of kernel .
3.4 Connection to WeisfeilerLehman Kernel
We derived the above deep kernel NN for the purpose of generality. This model could be simplified by setting , without losing representational power (as nonlinearity is already involved in depth dimension). In this case, we rewrite the network by reparametrization:
(9) 
In this section, we further show that this model could be enhanced by sharing weight matrices and across layers. This parameter tying mechanism allows our model to embed WeisfeilerLehman kernel (Shervashidze et al., 2011). For clarity, we briefly review basic concepts of WeisfeilerLehman kernel below.
WeisfeilerLehman Graph Relabeling WeisfeilerLehman kernel borrows concepts from the WeisfeilerLehman isomorphism test for labeled graphs. The key idea of the algorithm is to augment the node labels by the sorted set of node labels of neighbor nodes, and compress these augmented labels into new, short labels (Figure 2). Such relabeling process is repeated times. In the th iteration, it generates a new labeling for all nodes in graph , with initial labeling .
Generalized Graph Relabeling The key observation here is that graph relabeling operation could be viewed as neighbor feature aggregation. As a result, the relabeling process naturally generalizes to the case where nodes are associated with continuous feature vectors. In particular, let be the relabeling function. For a node :
(10) 
Note that our definition of is exactly the same as in Equation 9, with being additive composition.
WeisfeilerLehman Kernel
Let be any graph kernel (called base kernel). Given a relabeling function , WeisfeilerLehman kernel with base kernel and depth is defined as
(11) 
where and are the th relabeled graph of and respectively.
WeisfeilerLehman Kernel NN Now with the above kernel definition, and random walk kernel as the base kernel, we propose the following recurrent module:
where and are shared across layers. The final output of this network is .
The above recurrent module is still an instance of deep kernel, even though some parameters are shared. A minor difference here is that there is an additional random walk kernel NN that connects th layer and the output layer. But this is just a linear combination of deep random walk kernels (of different depth). Therefore, as an corollary of Theorem 3.3, we have: For a WeisfeilerLehman Kernel NN with iterations and random walk kernel as base kernel, the final output state belongs to the RKHS of kernel .
4 Adaptive Decay with Neural Gates
The sequence and graph kernel (and their neural components) discussed so far use a constant decay value regardless of the current input. However, this is often not the case since the importance of the input can vary across the context or the applications. One extension is to make use of neural gates that adaptively control the decay factor. Here we give two illustrative examples:
Gated String Kernel NN By replacing constant decay with a sigmoid gate, we modify our singlelayer sequence module as:
As compared with the original string kernel, now the decay factor is no longer , but rather an adaptive value based on current context.
Gated Random Walk Kernel NN Similarly, we could introduce gates so that different walks have different weights:
The underlying kernel of the above gated network becomes
where each path is weighted by different decay weights, determined by network itself.
5 Related Work
Sequence Networks Considerable effort has gone into designing effective networks for sequence processing. This includes recurrent modules with the ability to carry persistent memories such as LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Chung et al., 2014), as well as nonconsecutive convolutional modules (RCNNs, Lei et al. (2015)), and others. More recently, Zoph & Le (2016)
exemplified a reinforcement learningbased search algorithm to further optimize the design of such recurrent architectures. Our proposed neural networks offer similar state evolution and feature aggregation functionalities but derive the motivation for the operations involved from wellestablished kernel computations over sequences.
Recursive neural networks are alternative architectures to model hierarchical structures such as syntax trees and logic forms. For instance, Socher et al. (2013) employs recursive networks for sentence classification, where each node in the dependency tree of the sentence is transformed into a vector representation. Tai et al. (2015) further proposed treeLSTM, which incorporates LSTMstyle architectures as the transformation unit. Dyer et al. (2015, 2016) recently introduced a recursive neural model for transitionbased language modeling and parsing. While not specifically discussed in the paper, our ideas do extend to similar neural components for hierarchical objects (e.g. trees).
Graph Networks Most of the current graph neural architectures perform either convolutional or recurrent operations on graphs. Duvenaud et al. (2015)
developed Neural Fingerprint for chemical compounds, where each convolution operation is a sum of neighbor node features, followed by a linear transformation. Our model differs from theirs in that our generalized kernels and networks can aggregate neighboring features in a nonlinear way. Other approaches, e.g.,
Bruna et al. (2013) and Henaff et al. (2015), rely on graph Laplacian or Fourier transform.
For recurrent architectures, Li et al. (2015) proposed gated graph neural networks, where neighbor features are aggregated by GRU function. Dai et al. (2016) considers a different architecture where a graph is viewed as a latent variable graphical model. Their recurrent model is derived from Belief Propagationlike algorithms. Our approach is most closely related to Dai et al. (2016), in terms of neighbor feature aggregation and resulting recurrent architecture. Nonetheless, the focus of this paper is on providing a framework for how such recurrent networks could be derived from deep graph kernels.
Kernels and Neural Nets Our work follows recent work demonstrating the connection between neural networks and kernels (Cho & Saul, 2009; Hazan & Jaakkola, 2015). For example, Zhang et al. (2016) showed that standard feedforward neural nets belong to a larger space of recursively constructed kernels (given certain activation functions). Similar results have been made for convolutional neural nets (Anselmi et al., 2015), and general computational graphs (Daniely et al., 2016). We extend prior work to kernels and neural architectures over structured inputs, in particular, sequences and graphs. Another difference is how we train the model. While some prior work appeals to convex optimization through improper learning (Zhang et al., 2016; Heinemann et al., 2016) (since kernel space is larger), we use the proposed networks as building blocks in typical nonconvex but flexible neural network training.
6 Experiments
The leftover question is whether the proposed class of operations, despite its formal characteristics, leads to more effective architecture exploration and hence improved performance. In this section, we apply the proposed sequence and graph modules to various tasks and empirically evaluate their performance against other neural network models. These tasks include language modeling, sentiment classification and molecule regression.
6.1 Language Modeling on PTB
Dataset and Setup We use the Penn Tree Bank (PTB) corpus as the benchmark. The dataset contains about 1 million tokens in total. We use the standard train/development/test split of this dataset with vocabulary of size 10,000.
Model Configuration
Following standard practice, we use SGD with an initial learning rate of 1.0 and decrease the learning rate by a constant factor after a certain epoch. We backpropagate the gradient with an unroll size of 35 and use dropout
(Hinton et al., 2012) as the regularization. Unless otherwise specified, we train 3layer networks with and normalized adaptive decay.^{5}^{5}5See the supplementary sections for a discussion of network variants. Following (Zilly et al., 2016), we add highway connections (Srivastava et al., 2015) within each layer:where , is the gated decay factor and is the transformation gate of highway connections.^{6}^{6}6We found nonlinear activation is no longer necessary when the highway connection is added.
Model  PPL  
LSTM (large) (Zaremba et al., 2014)  66m  78.4 
Character CNN (Kim et al., 2015)  19m  78.9 
Variational LSTM (Gal & Ghahramani)  20m  78.6 
Variational LSTM (Gal & Ghahramani)  66m  73.4 
Pointer SentinelLSTM (Merity et al.)  21m  70.9 
Variational RHN (Zilly et al., 2016)  23m  65.4 
Neural Net Search (Zoph & Le, 2016)  25m  64.0 
Kernel NN ()  5m  84.3 
Kernel NN ( learned as parameter)  5m  76.8 
Kernel NN (gated )  5m  73.6 
Kernel NN (gated )  20m  69.2 
+ variational dropout  20m  65.5 
+ variational dropout, 4 RNN layers  20m  63.8 
Results Table 1 compares our model with various stateoftheart models. Our small model with 5 million parameters achieves a test perplexity of 73.6, already outperforming many results achieved using much larger network. By increasing the network size to 20 million, we obtain a test perplexity of 69.2, with standard dropout. Adding variational dropout (Gal & Ghahramani, 2016) within the recurrent cells further improves the perplexity to 65.5. Finally, the model achieves 63.8 perplexity when the recurrence depth is increased to 4, being stateoftheart and on par with the results reported in (Zilly et al., 2016; Zoph & Le, 2016). Note that Zilly et al. (2016) uses 10 neural layers and Zoph & Le (2016) adopts a complex recurrent cell found by reinforcement learning based search. Our network is architecturally much simpler.
Figure 3 analyzes several variants of our model. Wordlevel CNNs are degraded cases () that ignore noncontiguous ngram patterns. Clearly, this variant performs worse compared to other recurrent variants with . Moreover, the test perplexity improves from 84.3 to 76.8 when we train the constant decay vector as part of the model parameters. Finally, the last two variants utilize neural gates (depending on input only or both input and previous state ), further improving the performance.
6.2 Sentiment Classification
Dataset and Setup We evaluate our model on the sentence classification task. We use the Stanford Sentiment Treebank benchmark (Socher et al., 2013). The dataset consists of 11855 parsed English sentences annotated at both the root (i.e. sentence) level and the phrase level using 5class finegrained labels. We use the standard split for training, development and testing. Following previous work, we also evaluate our model on the binary classification variant of this benchmark, ignoring all neutral sentences.
Following the recent work of DAN (Iyyer et al., 2015) and RLSTM (Tai et al., 2015), we use the publicly available 300dimensional GloVe word vectors (Pennington et al., 2014). Unlike prior work which fine tunes the word vectors, we normalize the vectors (i.e. ) and fixed them for simplicity.
Model Configuration Our best model is a 3layer network with and hidden dimension . We average the hidden states across
, and concatenate the averaged vectors from the 3 layers as the input of the final softmax layer. The model is optimized with Adam
(Kingma & Ba, 2015), and dropout probability of 0.35.
Model  Fine  Binary 

RNN (Socher et al. (2011))  43.2  82.4 
RNTN (Socher et al. (2013))  45.7  85.4 
DRNN (Irsoy & Cardie (2014))  49.8  86.8 
RLSTM (Tai et al. (2015))  51.0  88.0 
DCNN (Kalchbrenner et al. (2014))  48.5  86.9 
CNNMC (Kim (2014))  47.4  88.1 
BiLSTM (Tai et al. (2015))  49.1  87.5 
LSTMN (Cheng et al. (2016))  47.9  87.0 
PVEC (Le & Mikolov (2014))  48.7  87.8 
DAN (Iyyer et al. (2014))  48.2  86.8 
DMN (Kumar et al. (2016))  52.1  88.6 
Kernel NN,  51.2  88.6 
Kernel NN, gated  53.2  89.9 
Results Table 2 presents the performance of our model and other networks. We report the best results achieved across 5 independent runs. Our best model obtains 53.2% and 89.9% test accuracies on finegrained and binary tasks respectively. Our model with only a constant decay factor also obtains quite high accuracy, outperforming other baseline methods shown in the table.
Model (Dai et al., 2016)  RMSE  

Mean Predicator  1  2.4062 
Weisfeilerlehman Kernel, degree=3  1.6m  0.2040 
Weisfeilerlehman Kernel, degree=6  1378m  0.1367 
Embedded Mean Field  0.1m  0.1250 
Embedded Loopy BP  0.1m  0.1174 
Under Our Setup  
Neural Fingerprint  0.26m  0.1409 
Embedded Loopy BP  0.26m  0.1065 
Weisfeiler Kernel NN  0.26m  0.1058 
Weisfeiler Kernel NN, gated  0.26m  0.1043 
6.3 Molecular Graph Regression
Dataset and Setup We further evaluate our graph NN models on the Harvard Clean Energy Project benchmark, which has been used in Dai et al. (2016); Duvenaud et al. (2015) as their evaluation dataset. This dataset contains 2.3 million candidate molecules, with each molecule labeled with its power conversion efficiency (PCE) value.
We follow exactly the same traintest split as Dai et al. (2016), and the same resampling procedure on the training data (but not the test data) to make the algorithm put more emphasis on molecules with higher PCE values, since the data is distributed unevenly.
We use the same feature set as in Duvenaud et al. (2015) for atoms and bonds. Initial atom features include the atom’s element, its degree, the number of attached hydrogens, its implicit valence, and an aromaticity indicator. The bond feature is a concatenation of bond type indicator, whether the bond is conjugated, and whether the bond is in a ring.
Model Configuration Our model is a WeisfeilerLehman NN, with 4 recurrent iterations and . All models (including baseline) are optimized with Adam (Kingma & Ba, 2015), with learning rate decay factor 0.9.
Results In Table 3, we report the performance of our model against other baseline methods. Neural Fingerprint (Duvenaud et al., 2015)
is a 4layer convolutional neural network. Convolution is applied to each atom, which sums over its neighbors’ hidden state, followed by a linear transformation and nonlinear activation. Embedded Loopy BP
(Dai et al., 2016) is a recurrent architecture, with 4 recurrent iterations. It maintains message vectors for each atom and bond, and propagates those vectors in a message passing fashion. Table 3 shows our model achieves stateoftheart against various baselines.7 Conclusion
We proposed a class of deep recurrent neural architectures and formally characterized its underlying computation using kernels. By linking kernel and neural operations, we have a “template” for deriving new families of neural architectures for sequences and graphs. We hope the theoretical view of kernel neural networks can be helpful for future model exploration.
Acknowledgement
We thank Prof. Le Song for sharing Harvard Clean Energy Project dataset. We also thank Yu Zhang, Vikas Garg, David Alvarez, Tianxiao Shen, Karthik Narasimhan and the reviewers for their helpful comments. This work was supported by the DARPA MakeIt program under contract ARO W911NF1620023.
References
 Anselmi et al. (2015) Anselmi, Fabio, Rosasco, Lorenzo, Tan, Cheston, and Poggio, Tomaso. Deep convolutional networks are hierarchical kernel machines. preprint arXiv:1508.01084, 2015.

Balduzzi & Ghifary (2016)
Balduzzi, David and Ghifary, Muhammad.
Stronglytyped recurrent neural networks.
InProceedings of 33th International Conference on Machine Learning (ICML)
, 2016.  Bruna et al. (2013) Bruna, Joan, Zaremba, Wojciech, Szlam, Arthur, and LeCun, Yann. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.

Cheng et al. (2016)
Cheng, Jianpeng, Dong, Li, and Lapata, Mirella.
Long shortterm memory networks for machine reading.
Proceedings of the Conference on Empirical Methods in Natural Language Processing
, pp. 551–562, 2016.  Cho & Saul (2009) Cho, Youngmin and Saul, Lawrence K. Kernel methods for deep learning. In Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22, pp. 342–350. 2009.
 Chung et al. (2014) Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, and Bengio, Yoshua. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 Dai et al. (2016) Dai, Hanjun, Dai, Bo, and Song, Le. Discriminative embeddings of latent variable models for structured data. arXiv preprint arXiv:1603.05629, 2016.
 Daniely et al. (2016) Daniely, Amit, Frostig, Roy, and Singer, Yoram. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. CoRR, abs/1602.05897, 2016.
 Duvenaud et al. (2015) Duvenaud, David K, Maclaurin, Dougal, Iparraguirre, Jorge, Bombarell, Rafael, Hirzel, Timothy, AspuruGuzik, Alán, and Adams, Ryan P. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232, 2015.
 Dyer et al. (2015) Dyer, Chris, Ballesteros, Miguel, Ling, Wang, Matthews, Austin, and Smith, Noah A. Transitionbased dependency parsing with stack long shortterm memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Beijing, China, July 2015.
 Dyer et al. (2016) Dyer, Chris, Kuncoro, Adhiguna, Ballesteros, Miguel, and Smith, Noah A. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, San Diego, California, June 2016.
 Gal & Ghahramani (2016) Gal, Yarin and Ghahramani, Zoubin. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 29 (NIPS), 2016.
 Gärtner et al. (2003) Gärtner, Thomas, Flach, Peter, and Wrobel, Stefan. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines, pp. 129–143. Springer, 2003.
 Greff et al. (2015) Greff, Klaus, Srivastava, Rupesh Kumar, Koutník, Jan, Steunebrink, Bas R, and Schmidhuber, Jürgen. Lstm: A search space odyssey. arXiv preprint arXiv:1503.04069, 2015.
 Hazan & Jaakkola (2015) Hazan, Tamir and Jaakkola, Tommi. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133, 2015.

Heinemann et al. (2016)
Heinemann, Uri, Livni, Roi, Eban, Elad, Elidan, Gal, and Globerson, Amir.
Improper deep kernels.
In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
, pp. 1159–1167, 2016.  Henaff et al. (2015) Henaff, Mikael, Bruna, Joan, and LeCun, Yann. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 Hinton et al. (2012) Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Irsoy & Cardie (2014) Irsoy, Ozan and Cardie, Claire. Deep recursive neural networks for compositionality in language. In Advances in Neural Information Processing Systems, 2014.
 Iyyer et al. (2014) Iyyer, Mohit, BoydGraber, Jordan, Claudino, Leonardo, Socher, Richard, and Daumé III, Hal. A neural network for factoid question answering over paragraphs. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 633–644, Doha, Qatar, October 2014.
 Iyyer et al. (2015) Iyyer, Mohit, Manjunatha, Varun, BoydGraber, Jordan, and Daumé III, Hal. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2015.
 Kalchbrenner et al. (2014) Kalchbrenner, Nal, Grefenstette, Edward, and Blunsom, Phil. A convolutional neural network for modelling sentences. In Proceedings of the 52th Annual Meeting of the Association for Computational Linguistics, 2014.
 Kim (2014) Kim, Yoon. Convolutional neural networks for sentence classification. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 2014.
 Kim et al. (2015) Kim, Yoon, Jernite, Yacine, Sontag, David, and Rush, Alexander M. Characteraware neural language models. TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 Kingma & Ba (2015) Kingma, Diederik P and Ba, Jimmy Lei. Adam: A method for stochastic optimization. In International Conference on Learning Representation, 2015.
 Kumar et al. (2016) Kumar, Ankit, Irsoy, Ozan, Ondruska, Peter, Iyyer, Mohit, James Bradbury, Ishaan Gulrajani, Zhong, Victor, Paulus, Romain, and Socher, Richard. Ask me anything: Dynamic memory networks for natural language processing. 2016.
 Le & Mikolov (2014) Le, Quoc and Mikolov, Tomas. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pp. 1188–1196, 2014.
 Lee et al. (2017) Lee, Kenton, Levy, Omer, and Zettlemoyer, Luke. Recurrent additive networks. Preprint, 2017.
 Lei et al. (2015) Lei, Tao, Joshi, Hrishikesh, Barzilay, Regina, Jaakkola, Tommi, Tymoshenko, Katerina, Moschitti, Alessandro, and Marquez, Lluis. Semisupervised question retrieval with gated convolutions. arXiv preprint arXiv:1512.05726, 2015.
 Li et al. (2015) Li, Yujia, Tarlow, Daniel, Brockschmidt, Marc, and Zemel, Richard. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 Lodhi et al. (2002) Lodhi, Huma, Saunders, Craig, ShaweTaylor, John, Cristianini, Nello, and Watkins, Chris. Text classification using string kernels. Journal of Machine Learning Research, 2(Feb):419–444, 2002.
 Merity et al. (2016) Merity, Stephen, Xiong, Caiming, Bradbury, James, and Socher, Richard. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 Pennington et al. (2014) Pennington, Jeffrey, Socher, Richard, and Manning, Christopher D. Glove: Global vectors for word representation. volume 12, 2014.
 Press & Wolf (2016) Press, Ofir and Wolf, Lior. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
 Ramon & Gärtner (2003) Ramon, Jan and Gärtner, Thomas. Expressivity versus efficiency of graph kernels. In First international workshop on mining graphs, trees and sequences, pp. 65–74. Citeseer, 2003.
 ShalevShwartz et al. (2011) ShalevShwartz, Shai, Shamir, Ohad, and Sridharan, Karthik. Learning kernelbased halfspaces with the 01 loss. SIAM Journal on Computing, 40(6):1623–1646, 2011.
 Shervashidze et al. (2011) Shervashidze, Nino, Schweitzer, Pascal, Leeuwen, Erik Jan van, Mehlhorn, Kurt, and Borgwardt, Karsten M. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.

Socher et al. (2011)
Socher, Richard, Pennington, Jeffrey, Huang, Eric H, Ng, Andrew Y, and Manning,
Christopher D.
Semisupervised recursive autoencoders for predicting sentiment distributions.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 151–161, 2011.  Socher et al. (2013) Socher, Richard, Perelygin, Alex, Wu, Jean, Chuang, Jason, Manning, Christopher D., Ng, Andrew Y., and Potts, Christopher. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, October 2013.
 Srivastava et al. (2015) Srivastava, Rupesh K, Greff, Klaus, and Schmidhuber, Jürgen. Training very deep networks. In Advances in neural information processing systems, pp. 2377–2385, 2015.
 Tai et al. (2015) Tai, Kai Sheng, Socher, Richard, and Manning, Christopher D. Improved semantic representations from treestructured long shortterm memory networks. In Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics, 2015.
 Tamar et al. (2016) Tamar, Aviv, Levine, Sergey, Abbeel, Pieter, Wu, Yi, and Thomas, Garrett. Value iteration networks. In Advances in Neural Information Processing Systems, pp. 2146–2154, 2016.
 Vishwanathan et al. (2010) Vishwanathan, S Vichy N, Schraudolph, Nicol N, Kondor, Risi, and Borgwardt, Karsten M. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201–1242, 2010.
 Zaremba et al. (2014) Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 Zhang et al. (2016) Zhang, Yuchen, Lee, Jason D., and Jordan, Michael I. regularized neural networks are improperly learnable in polynomial time. In Proceedings of the 33nd International Conference on Machine Learning, 2016.
 Zilly et al. (2016) Zilly, Julian Georg, Srivastava, Rupesh Kumar, Koutník, Jan, and Schmidhuber, Jürgen. Recurrent Highway Networks. arXiv preprint arXiv:1607.03474, 2016.
 Zoph & Le (2016) Zoph, Barret and Le, Quoc V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Appendix A Examples of kernel / neural variants
Our theoretical results apply to some other variants of sequence kernels and the associated neural components. We give some examples in the this section. Table 4 shows three network variants, corresponding to three realizations of string kernels provided in Table 5.
Connection to LSTMs
Interestingly, many recent work has reached similar RNN architectures through empirical exploration. For example, Greff et al. (2015) found that simplifying LSTMs, by removing the input gate or coupling it with the forget gate does not significantly change the performance. However, the forget gate (corresponding to the decay factor in our notation) is crucial for performance. This is consistent with our theoretical analysis and the empirical results in Figure 3. Moreover, Balduzzi & Ghifary (2016) and Lee et al. (2017) both suggest that a linear additive state computation suffices to provide competitive performance compared to LSTMs: ^{7}^{7}7Balduzzi & Ghifary (2016) also includes the previous token , i.e. , which doesn’t affect the discussion here.
In fact, this variant becomes an instance of the kernel NN presented in this work (with and adaptive gating), when and or 1.
(a) Multiplicative mapping, aggregation unnormalized:  
(b) Multiplicative mapping, aggregation normalized:  
(c) Additive mapping, aggregation normalized:  
Final activation:  
(any linear combination of ) 
(a) Multiplicative mapping, aggregation unnormalized:  

(b) Multiplicative mapping, aggregation normalized:  
s.t.  
s.t.  
(c) Additive mapping, aggregation normalized:  
s.t.  
Comments
There are no comments yet.