1 Introduction
The explosive growth in the volume of scientific publications in the biomedical domain calls for efforts to automate information extraction from biomedical literature. In particular, the task of identifying biological entities and their relations from scientific papers has attracted significant recent attention [Garg et al.2016, Hahn and Surdeanu2015, Krallinger et al.2008, Hakenberg et al.2008]—especially for its potential impact towards personalized cancer treatment [Cohen2015, Rzhetsky2016, ValenzuelaEscárcega et al.2017]. See Fig. 1 for an example of the relation extraction task.
For the relation extraction task, approaches based on convolution kernels [Haussler1999] have demonstrated stateoftheart performance [Chang et al.2016, Garg et al.2016, Qian and Zhou2012, Tikk et al.2010, Airola et al.2008]. Despite their success and intuitive appeal, kernel methods can suffer from relatively high computational cost since computing kernel similarities between two natural language structures (graphs, paths, sequences, etc.) can be a relatively expensive operation. Furthermore, to build a support vector machine (SVM) or a knearest neighbor (kNN) classifier from training examples, one needs to compute kernel similarities between pairs of training points, which can be prohibitively expensive for large .
There have been several attempts to make the kernel methods more efficient for NLP tasks using techniques such as caching kernel similarities between substructures, or making a kernel similarity computation efficient, etc. [Moschitti2006, Severyn and Moschitti2012, Severyn and Moschitti2013]. However, these approaches do not eliminate the above quadratic bottleneck of computing kernel similarity between pairs during a model training phase. The introduction of kernelized localitysensitive hashing (KLSH) [Kulis and Grauman2009, Joly and Buisson2011] allowed s reduction in the number of kernel computations down to by providing efficient approximation for constructing kNN graphs. Those methods, however, are limited only to classifiers that operate on kNN graphs. Thus, the question is whether one can generalize this principally scalable approach to more general classifiers.
The main contribution of this paper is a principled approach for building an explicit representation for structured data, as opposed to an implicit one, by using random subspaces of KLSH codes. The intuition behind our approach is as follows. If we keep the total number of bits in the KLSH codes of NLP structures relatively large (e.g., 1000 bits), and take many random subsets of bits (e.g., 30 bits each), we can build a large variety of generalized representations which preserves detailed information present in NLP structures by distributing this information across different generalizations.^{1}^{1}1 KLSH code compute cost is linear w.r.t. number of bits.
The main advantage of the proposed representation is that is lends itself to more general classification methods such as random forest (RF)
[Ho1995, Breiman2001]. Fig. 2 depicts the highlevel workflow of the proposed approach.As our second major contribution, we propose a theoretically justified and computationally efficient method for optimizing the KLSH representation with respect to: (i) KLSHspecific configuration parameters, or tuning parameters of a classifier like the number of trees in an RF, and (ii) a reference set of examples w.r.t. which kernel similarities of each example in the training/test sets is computed for obtaining its KLSH code. This is accomplished by maximizing a variational lower bound on mutual information between KLSH codes of NLP structures and their class labels.^{2}^{2}2Upon publication, code will be released on GitHub, along with preprocessed features for the public datasets.
In addition to the scalability issue, kernels also suffer from lack of flexibility, depending typically on a small number of tunable parameters, as opposed to, for instance, neural networks that can have millions of learnable parameters. Another important contribution of this paper is a
nonstationary extension to the conventional convolution kernels to achieve better expressiveness and flexibility, by proposing a richer parameterization of the kernel similarity function. The additional parameters that result from the nonstationary extension can also be optimized by maximizing the lower bound.We validate our model on the relational extraction task using four publicly available datasets. We observe significant improvements in F1 scores w.r.t. the stateoftheart methods, including RNN, CNN, etc., along with large reductions in compute time w.r.t. traditional kernelbased classifiers.
2 Learning Kernelized Hashcode Representation for Classification
As indicated in Fig. 1, we map the relation extraction task to a classification problem, where each candidate interaction as represented by a corresponding (sub)structure is classified either valid or invalid.
Let be a set of data points representing NLP structures (such as sequences, paths, graphs) with their corresponding class labels, . Our goal is to infer the class label of a given test data point . Within the kernelbased methods, this is done via a kernel similarity function defined for any pair of structures and , augmented with an appropriate kernelbased classifier [Garg et al.2016, Srivastava et al.2013, Culotta and Sorensen2004, Zelenko et al.2003].
The remainder of this section is organized as follows: We provide a background on kernel localitysensitive hashing (KLSH) techniques in Sec. 2.1, and present our proposal to leverage highdimensional KLSH codes as explicit feature representations in Sec. 2.2.^{3}^{3}3We use terms KLSH and kernelhashing interchangably. In Sec. 2.3 we introduce efficient algorithms for taskspecific optimization of KLSH codes. Finally, we discuss nonstationary extension of kernels in Sec. 2.4.
2.1 Kernel LocalitySensitive Hashing (KLSH)
We now describe a generic method for mapping an arbitrary data point to a binary kernelhashcode , using a KLSH technique that relies upon a convolution kernel function .
Let us consider a set of data points that might include both labeled and unlabeled examples. As a first step in constructing the KLSH codes, we select a random subset of size , which we call a reference set; this corresponds to the grey dots in the leftmost panel of Fig. 3. Typically, the size of the reference set is significantly smaller than the size of the whole dataset, .
Next, let be a realvalued vector of size , whose –th component is the kernel similarity between the data point and the –th element in the reference set, . Further, let , , be a set of binary valued hash functions that take as an input and map it to binary bits and let . The kernel hashcode representation is then given as .
We now describe a specific choice of hash functions based on nearest neighbors, called as Random k Nearest Neighbors (RkNN). For a given , let and be two randomly selected, equalsized and nonoverlapping subsets of , , . Those sets are indicated by red and blue dots in Fig. 3. Furthermore, let be the similarity between and its nearest neighbor in , with defined similarly (indicated by red and blue arrows in Fig. 3). Then the corresponding hash function is:
(1) 
Pictorial illustration of this hashing scheme is provided in Fig. 3, where ’s nearest neighbors in either subset are indicated by the red and blue arrows.^{4}^{4}4 Small value of , i.e. , should ensure that hashcode bits have minimal redundancy w.r.t. each other.
Other KernelHashing Techniques: The same principle of random subsampling is applied in KLSH techniques previously proposed in [Kulis and Grauman2009, Joly and Buisson2011]. In [Joly and Buisson2011], is built by learning a (random) maximum margin boundary (RMM) that discriminates between the two subsets, and . In [Kulis and Grauman2009], is obtained from
, which is a (approximately) random linear hyperplane in the kernel implied feature space; this is referred to as “Kulis” here. A KLSH model, along with its parameters, is denoted as
; refers to the type of , from the three choices.Efficiency of Hashcode Computation: It is worthwhile noting that for a fixed number of kernel computations per structure , a large number of hashcode bits () can be generated through the randomization principle with computational cost linear in . We exploit this property in our main algorithmic contribution, discussed next.
2.2 KernelHashcodes as Explicit Feature Representations of NLP Structures
We propose a novel use of KLSH codes as generalized representations (feature vectors).
KLSH in kNN to General Classifiers: The key idea of KLSH as an approximate technique for finding the nearest neighbors of is that rather than computing its kernel similarity w.r.t. all other elements in , kernel similarities are computed only w.r.t. the NLP structures in the bucket of or the neighboring buckets, i.e. the buckets of hashcodes with small Hamming distance to . This approximation works well in practice if a hashing approach is locality sensitive. ^{5}^{5}5See a formal definition of localitysensitive hashing in [Indyk and Motwani1998, Definition 7 in Sec. 4.2]. Locality sensitivity property states that NLP structures, that are assigned kernelhashcodes with low hamming distance to each other, are highly similar (nearest neighbors) to each other according to the kernel function; this property should not only mean that nearest neighbors can be found within the neighboring hashcode buckets, but it also implies that KLSH codes can serve as explicit representations of NLP structures.
Longer KernelHashcodes: In kNN classifiers using KLSH, a small number of hashcode bits (), corresponding to a small number of hashcode buckets, generate a coarse partition of the feature space—sufficient for approximate computation of a kNN graph. In our representation learning framework, however, hashcodes must extract enough information about class labels from NLP structures, so we propose to generate longer hashcodes, i.e. .
KernelHashcodes Explicitly Representing Substructures: Unlike regular kernel methods (SVM, kNN, etc.), we use kernels to build an explicit feature space, via kernelhashing. Referring to Fig. 3, when using RkNN technique to obtain for , hashcode bit, , should correspond to finding a substructure in , that should also be present in its 1NN from either the set or , depending on the bit value being or . Thus, represents finding important substructures in in relation to . The same should apply for the other KLSH techniques.^{6}^{6}6 Here, high value of , , should lead to higher flexibility in finding generalized substructures.
Random Subspaces of Kernel Hashcodes:
The next question is how to use the binaryvalued representations for building a good classifier. Intuitively, not all the bits may be matching across the hashcodes of NLP structures in training and test datasets. So a single classifier trained on all the hashcode bits may overfit in testing scenarios. This is especially relevant for bioinformation extraction tasks where there is a high possibility of mismatch between training and test conditions [Airola et al.2008, Garg et al.2016]. Therefore, we adopt the approach of building an ensemble of classifiers, with each one built on a random subspace of hashcodes [Zhou2012, Kuncheva2004, Ho1998].
For building each classifier in an ensemble of classifiers, bits are selected randomly from hash bits; for inference on a test NLP structure
, we take mean statistics over the inferred probability vectors from each of the classifiers, as it is a standard practice in ensemble approaches. Another way of building an ensemble from subspaces of hashcodes is bagging
[Breiman1996]. If we use a decision tree as a classifier in ensemble, it corresponds to a random forest [Ho1995, Breiman2001].It is highly efficient to train a random forest (RF) with a large number of decision trees (), even on long binary hashcodes (), leveraging the fact that decision trees can be very efficient to train and test on binary features.
2.3 Supervised Optimization of KLSH Model
Here we propose a framework for optimization of the KLSH model for a supervised classification task. The parameters of the model to be optimized include the reference set ,^{7}^{7}7In the previous works [Joly and Buisson2011, Kulis and Grauman2012], is chosen randomly. the configuration parameters , and the kernel parameters.
One important aspect of the abovespecified optimization problem is that the optimal kernel hashing parameters are shared between all the hash functions jointly, and are not specific to any of the hash functions. This kind of optimization allows us to keep individual hash functions randomized (as each hash function is built using a small random subset of ), which should help in learning kernelhashcodebased representations that are more robust to overfitting a training dataset. Also, while it is theoretically difficult to establish, this inherent randomization of hash functions, despite the learning of the parameters, should help towards keeping kernel hashcodes localitysensitive (an important property for hashcodes to be good feature representations).
Mutual Information as an Objective Function: Intuitively, we want to generate hashcodes that are maximally informative about the class labels. Thus, for optimizing the KLSH model (denoted as ), a natural objective function is the mutual information (MI) between kernelhashcodes () of and the class labels (), , where is the Shannon entropy [Cover and Thomas2012]
. MI is a fundamental measure of dependence between random variables and has been used extensively for feature selection problems in supervised settings
[Fleuret2004, Peng et al.2005, Brown et al.2012, Nguyen et al.2014, Chen et al.2018].Unfortunately, estimating MI in highdimensional settings is an extremely difficult problem due to curse of dimensionality
[Donoho and others2000, Kraskov et al.2004, WaltersWilliams and Li2009, Singh and Póczos2014, Gao et al.2015]. Instead, here we propose to maximize a variational lower bound on MI [Barber and Agakov2003, Chalk et al.2016, Gao et al.2016, Alemi et al.2017].To derive the lower bound (LB), we rewrite the mutual information as [Barber and Agakov2003]:
(2)  
where
is an arbitrary proposal distribution, and the inequality follows from nonnegativity of the Kullback–Leibler divergence between the true and variational distributions,
. We denote the lower bound in Eq. 2 by , .For highdimensional kernel hashcodes, computing is intractable, and that is required to compute . Though, one can instead use a proposal distribution, , approximating , which may be easier to compute; this leads to the Variational MI LB; any classifier that can infer class probabilities is usable as a proposal distribution.
Empirical Estimate of MI LB: For an empirical estimate, expectation of is computed approximately using samples of hashcodes and the class labels; for a given , for are computed, and a discriminative classifier is trained on to obtain the class inference probabilities for , as .
(3) 
For computation efficiency as well as robustness w.r.t. overfitting, we use small random subsets (of size ) from a training set for stochastic empirical estimates of the MI LB, motivated by the idea of stochastic gradients [Bottou2010].
MI LB Estimate with RF: Preferably, an ensemble (RF) classifier can be used for .
(4) 
The advantage of using RF for , besides efficiency, is that it should implicitly lead to computing a LB on MI between, subspaces of , and , and is thus helpful for more robust learning of .
Next, we discuss algorithms optimizing different kinds of parameters in the kernel hashing model by maximizing the MI LB; since kernel matrices can be computed in parallel using multiple cores, all the algorithms can take advantage of parallel computing.
Informative Selection of Reference Set:
In Alg. 1, is selected by maximizing the MI LB.
MI LB is maximized greedily, adding one element to in each step (line 3); greedy maximization of MIlike objectives has been successful [Gao et al.2016, Krause et al.2008]. Since we need a minimal size of to generate hashcodes, we initialize as a random subset of of small size (line 1). Value of increases in each greedy step, as a constant fraction of the increasing size of (line 8).
In a greedy step, we compute MI LB values for all the candidate NLP structures (lines 8 to 17), and the one with the highest MI LB is selected as the greedy optimal (line 19). Employing the paradigm of stochastic sampling, for estimating MI LB for a candidate, we randomly sample a small subset of of size along with their class labels (line 9). Also, we consider only a small random subset of as candidates, of size (line 10). Kernel matrices are computed in advance (lines 11, 12), and then used to compute MI LB value for each of the candidates (lines 15, 16, 17). Alg. 1 requires kernel computations of order, , with being the sampling size constants; in practice, .
Optimizing Configuration Parameters:
Alg. 2 jointly optimizes the configuration parameters of KLSH model () and the ensemble classification specific parameters (), maximizing MI LB with grid search over the parameters; it is easily extensible for other search strategies [Snoek et al.2015, Bergstra and Bengio2012].
Alg. 2 computes kernel matrix, , between and (line 1). During the optimization, for any given KLSH configuration (), is reused to compute hashcodes for random subsets of (line 3, 4); for each configuration of ensemble (), an empirical estimate of MI LB is obtained from the hashcodes and the class labels (line 7). In theory, if hashcode distributions could be represented analytically, or we had an infinite number of samples, maximizing will favor the highest possible value for ; however, for empirical estimates of , a smaller value for should be more optimal due to a small size () of samples set (line 3), favoring a model with low complexity as desired (aligning to the Occam’s razor principle on model complexity).
2.4 Nonstationary Extension for Kernels
One common principle applicable to all the convolution kernel functions, , defining similarity between any two NLP structures is: is expressed in terms of a kernel function, , that defines similarity between any two tokens (node/edge labels in Fig. 1). Some common examples of , from previous works [Culotta and Sorensen2004, Srivastava et al.2013], are:
Herein, are tokens, and , are the corresponding word vectors. The first two kernels are stationary, i.e. translation invariant [Genton2001]. The last two are nonstationary, though lacking nonstationarityspecific parameters for learning nonstationarity in a datadriven manner.
There are generic nonstationaritybased parameterizations, unexplored in NLP, applicable for extending not only a stationary kernel but any kernel, , to a nonstationary one, , so as to achieve higher expressiveness and generalization in model learning [Higdon1998, Snelson et al.2003].
For NLP tasks, from a nonstationary , the corresponding extension of , , is also guaranteed to be a valid nonstationary convolution kernel.^{8}^{8}8Proof for this statement omitted due to space constraint. Some generic nonstationary extensions of are as follows.
(5)  
(6) 
Here, , , are nonstationaritybased parameters; for more details, see [Paciorek and Schervish2003, Adams and Stegle2008]. Among the two extensions, the latter one is simpler yet intuitive, as briefed in the following. If , it means that the token should be completely ignored when computing a convolution kernel similarity of an NLP structure (tree, path, etc.) that contains the token (node or edge label ) w.r.t. another NLP structure. Thus, the additional nonstationary parameters allow convolution kernels to be expressive enough for deciding if some substructures in an NLP structure should be ignored explicitly.^{9}^{9}9This approach is explicit in ignoring substructures irrelevant for a given task unlike the (complementary) standard skipping over nonmatching substructures in a convolution kernel.
While the above proposed idea of nonstationary kernel extensions for NLP structures remains general, for the experiments, the nonstationary kernel for similarity between tuples with format (edgelabel, nodelabel) is defined as the product of kernels on edge labels, , and node labels, ,
with operating only on edge labels. Edge labels come from syntactic or semantic parses of text with small size vocabulary (see syntactic parsebased edge labels in Fig. 1); we keep as a measure for robustness to overfitting. These parameters are optimized by maximizing the MI LB.
3 Experiments
Datasets  No. of Valid Extractions  No. of Invalid Extractions 

PubMed45  2,794  20,102 
BioNLP  6,527  34,958 
AIMed  1,000  4,834 
BioInfer  2,534  7,132 
We evaluate our model “KLSHRF” (kernelized localitysensitive hashing with random forest) for the biomedical relation extraction task using four public datasets, AIMed, BioInfer, PubMed45, BioNLP, as briefed below.^{10}^{10}10PubMed45 dataset is available here: github.com/sgarg87/big_mech_isi_gg/tree/master/pubmed45_dataset; the other three datasets are here: corpora.informatik.huberlin.de Fig. 1 illustrates that the task is formulated as a binary classification of extraction candidates. For evaluation, it is standard practice to compute precision, recall, and F1 score on the positive class (i.e., identifying valid extractions). The number of valid/invalid extractions in each dataset is shown in Tab. 1.
Details on Datasets and Structural Features:
AIMed and BioInfer: For AIMed and BioInfer datasets, crosscorpus evaluation has been performed in many previous works [Airola et al.2008, Miwa et al.2009, Tikk et al.2010, Chang et al.2016, Peng and Lu2017, Hsieh et al.2017, Rios et al.2018]. Herein, the task is of identifying pairs of interacting proteins (PPI) in a sentence while ignoring the interaction type. We follow the same evaluation setup, using Stanford Dependency Graph parses of text sentences^{11}^{11}11www.nltk.org/_modules/nltk/parse/stanford.html to obtain undirected shortest paths as structural features for use with a path kernel (PK) to classify proteinprotein pairs.
Models  (A, B)  (B, A) 
SVM (Airola08)  0.25  0.44 
SVM (Airola08)  0.47  0.47 
SVM (Tikk10)  0.41  0.42 
(0.67, 0.29)  (0.27, 0.87)  
CNN (Peng17)  0.48  0.50 
(0.40, 0.61)  (0.40, 0.66)  
RNN (Hsieh17)  0.49  0.51 
CNNRevGrad (Ganin16Rios18)  0.43  0.47 
BiLSTMRevGrad (Ganin16Rios18)  0.40  0.46 
AdvCNN (Rios18)  0.54  0.49 
AdvBiLSTM (Rios18)  0.49  
KLSHkNN  
(0.41, 0.68)  (0.38, 0.80)  
KLSHRF  
(0.46, 0.74)  (0.37, 0.93) 
PubMed45 & BioNLP: We use PubMed45 and BioNLP datasets for an extensive evaluation of our KLSHRF model; for more details on the two datasets, see [Garg et al.2016] and [Kim et al.2009, Kim et al.2011, Nédellec et al.2013]. Annotations in these datasets are richer in the sense that a biomolecular interaction can involve up to two participants, along with an optional catalyst, and an interaction type from an unrestricted list. In PubMed45 (BioNLP) dataset, 36% (17%) of the “valid” interactions are such that an interaction must involve two participants and a catalyst. For both datasets, we use abstract meaning representation (AMR) to build subgraph or shortest pathbased structural features [Banarescu et al.2013], for use with graph kernels (GK) or path kernels (PK) respectively, as done in the recent works evaluating these datasets [Garg et al.2016, Rao et al.2017]. For a fair comparison of the classification models, we use the same bioAMR parser [Pust et al.2015] as in the previous works. In [Garg et al.2016]
, the PubMed45 dataset is split into 11 subsets for evaluation, at paper level. Keeping one of the subsets for testing, we use the others for training a binary classifier. This procedure is repeated for all 11 subsets in order to obtain the final F1 scores (mean and standard deviation values are reported from the numbers for 11 subsets). For BioNLP dataset
[Kim et al.2009, Kim et al.2011, Nédellec et al.2013], we use training datasets from years 2009, 2011, 2013 for learning a model, and the development dataset from year 2013 as the test set; the same evaluation setup is followed in [Rao et al.2017].In addition to the competitive models previously evaluated on these datasets, we also compare our KLSHRF model to KLSHkNN (kNN classifier with KSLH approximation).
Detailed Evaluation of KLSHRF model, using PubMed45 and BioNLP datasets. Here, orange and blue bars are for precision and recall numbers respectively. “NSK” refers to nonstationary kernel learning; PK & GK denote Path Kernels and Graph Kernels respectively; NSPK and NSGK are extensions of PK and GK respectively, with addition of nonstationarity based binary parameters; “M30” represents
of size selected randomly, and the suffix “RO” in “M30RO” refers to optimization of (Reference optimization) in contrast to random selection of .Models  PubMed45  PubMed45ERN  BioNLP 

SVM (Garg16)  
(0.58, 0.43)  (0.33, 0.45)  (0.35, 0.67)  
LSTM (Rao17)  N.A.  N.A.  0.46 
(0.51, 0.44)  
KLSHkNN  
(0.44, 0.53)  (0.23, 0.29)  (0.63, 0.57)  
KLSHRF  
(0.67, 0.49)  (0.52, 0.46)  (0.78, 0.52) 
Parameter Settings:
We use GK and PK, both using the same word vectors, with kernel parameter settings as in [Garg et al.2016, Mooney and Bunescu2005].^{12}^{12}12More details will be provided in documentation for code.
Reference set size, , doesn’t need tuning in our proposed model; there is a tradeoff between compute cost and accuracy; by default, we keep . Unless mentioned otherwise, kernelhashing parameters, , in our model are optimized using Alg. 2 (). For tuning any other parameters in our model or competitive models, including the choice of a kernel similarity function (PK or GK), we use 10% of training data, sampled randomly, for validation purposes.
When selecting reference set randomly, we perform 10 trials, and report mean statistics.^{13}^{13}13Variance across these trials is small, empirically. The same applies for KLSHkNN. When optimizing with Alg. 1, we use , , (sampling parameters are easy to tune). We employ 3 cores on an i5 processor, with 16GB memory.
3.1 Main Results for KLSHRF
In the following we compare the simplest version of our KLSHRF model that is optimized using Alg. 2 (); it takes 300 seconds to optimize the hashing configuration parameters (lines 212 in Alg. 2). In summary, our KLSHRF model outperforms stateoftheart models consistently across the four datasets, along with very significant speedups in training time w.r.t. traditional kernel classifiers.
Results for AIMed and BioInfer Datasets:
In reference to Tab. 2, KLSHRF gives an F1 score significantly higher than stateoftheart kernelbased models (6 pts gain in F1 score w.r.t. KLSHkNN), and consistently outperforms the neural models. When using AIMed for training and BioInfer for testing, there is a tie between AdvBiLSTM [Rios et al.2018] and KLSHRF. However, KLSHRF still outperforms their AdvCNN model by 3 pts; further, the performance of AdvCNN and AdvBiLSTM is not consistent, giving a low F1 score when training on the BioInfer dataset for testing on AIMed. For the latter setting of AIMed as a test set, we obtain an F1 score improvement by 2 pts w.r.t. the best competitive models (RNN and KLSHkNN).
The models based on adversarial neural networks [Ganin et al.2016, Rios et al.2018], AdvCNN, AdvBiLSTM, CNNRevGrad, BiLSTMRevGrad, are learned jointly on labeled training datasets and unlabeled test sets, whereas our model is purely supervised. In contrast to our principled approach, there are also systemlevel solutions using multiple parses jointly, along with multiple kernels, and knowledge bases [Miwa et al.2009, Chang et al.2016]. We refrain from comparing KLSHRF w.r.t. such system level solutions, as it would be an unfair comparison from a modeling perspective.
Results for PubMed45 and BioNLP
Datasets:
A summary of main results is presented in Tab. 3. “PubMed45ERN” is another version of the PubMed45 dataset from [Garg et al.2016], with ERN referring to entity recognition noise. Clearly, our model gives F1 scores significantly higher than SVM, LSTM, and the KLSHkNN model. For PubMed45 and BioNLP, the F1 score for KLSHRF is higher by 9 pts and 3 pts respectively w.r.t. state of the art. Note that standard deviations of F1 scores are high for the PubMed45 dataset (and PubMed45ERN) because of the high variation in distribution of text across the 11 test subsets (the F1 score improvements with our model are statistically significant, pvalue=4.4e8).
For the PubMed45 dataset, there are no previously published results with a neural model (LSTM). The LSTM model of [Rao et al.2017], proposed specifically for the BioNLP dataset, is not directly applicable for the PubMed45 dataset because the list of interaction types in the latter is unrestricted. F1 score numbers for SVM classifier were also improved in [Garg et al.2016] by additional contributions such as documentlevel inference, and the joint use of semantic and syntactic representations; those systemlevel contributions are complementary to ours, so excluded from the comparison.
3.2 Detailed Analysis of KLSHRF
While we are able to obtain superior results with our KLSHRF model w.r.t. stateoftheart methods using just core optimization of the hashing configuration parameters, in this subsection we analyze how we can further improve the model. In Fig. 4 we present our results from optimization of other aspects of the KLSHRF model: nonstationary kernel parameters learning (NS) and reference set optimization (RO). We also analyze the effect of parameters, , under controller settings ( is fixed to value 20 for , and for ). We report mean values for precision, recall, F1 scores. For these experiments, we focus on PubMed45 and BioNLP datasets.
Nonstationary Kernel Learning (NSK): In Fig. 4(a) and 4(b), we compare performance of nonstationary kernels, w.r.t. traditional stationary kernels (M=100). As proposed in Sec. 2.4, the idea is to extend a convolution kernel (PK or GK) with nonstationaritybased binary parameters (NSPK or NSGK), defined for the top 50% frequent edge labels, optimized iteratively via MI LB maximization (). For the PubMed45 dataset with PK, the advantage of NSK learning is more prominent, leading to high increase in recall (4 pts), and marginal increase in precision (1 pt). Compute time for learning the nonstationarity parameters in our KLSHRF model is less than an hour.
Reference Set Optimization: In Fig. 4(c) and 4(d), we analyze the effect of the reference set optimization (RO), in comparison to random selection, and find that the optimization leads to some increase in precision (56 pts for ), but a marginal drop in recall (12 pts for ); we used PK for these experiments. For the BioNLP dataset, the improvements are more significant. To optimize reference set for , it takes approximately 2 to 3 hours (with in Alg. 1).
Analyzing Hashing Parameters: In Fig. 4(e) and 4(f), we compare performance of all three hash functions (). Hash functions “RMM” and “RkNN” outperform hash function “Kulis,” especially for the PubMed45 dataset. For PubMed45 dataset, we also vary the parameters (=None & , using PK); see Fig. 4(g). For a low value of (15, 30), the F1 score drops significantly. We also note that despite the high number of hashcode bits, classification accuracy improves only if we have a minimal number of decision trees.
Compute Time: Compute times to train all the models are reported in Fig. 4(h) for the BioNLP dataset; similar time scales apply for other datasets. We observe that our basic KLSHRF model (trained using Alg. 2) has a very low training cost, w.r.t. models like LSTM, KLSHkNN, etc. (similar analysis applies for inference cost). The extensions of KLSHRF, KLSHRFRO and KLSHRFNS, are more expensive yet cheaper than LSTM and SVM.
4 Related Work
Other Hashing Techniques: In addition to the hashing techniques considered in this paper, other localitysensitive hashing techniques [Weinberger et al.2009, Mu and Yan2010, Liu et al.2011, Heo et al.2012, Liu et al.2012, Grauman and Fergus2013, Zhao et al.2014, Wang et al.2014, Wang et al.2017] are either not kernel based, or defined for specific kernels that are not applicable for hashing of NLP structures [Raginsky and Lazebnik2009]
. In deep learning, hashcodes are used for similarity search for objects, but not for classification of objects
[Liong et al.2015, Liu et al.2016].Hashcodes for Feature Compression: Binary hashing has been used as an approximate feature compression technique so as to reduce storage/computation costs [Li et al.2011, Mu et al.2014]. This work proposes hashing as a representation learning technique.
Uses of Hashcodes in NLP: In NLP, hashcodes were used only for similarity or nearest neighbor search for words/tokens in various NLP tasks [Bawa et al.2005, Ravichandran et al.2005, Goyal et al.2012, Li et al.2014, Wurzer et al.2015, Shi and Knight2017], with kernelhashing of NLP structures, not just tokens, explored only in this work.
Weighting Substructures: Our idea of skipping substructures through nonstationary kernels draws some similarities to substructure mining algorithms [Suzuki and Isozaki2006, Severyn and Moschitti2013]. Recently, it is proposed to learn the weights of substructures for regression, but not in classification problems [Beck et al.2015].
5 Conclusions
In this paper we propose to use a wellknown technique, kernelized localitysensitive hashing (KLSH), in order to derive feature vectors from natural language structures. More specifically, we propose to use random subspaces of KLSH codes for building a random forest of decision trees. We find this methodology particularly suitable for modeling natural language structures in supervised settings where there are significant mismatches between the training and the test conditions. Moreover we optimize a KLSH model in the context of classification performed using a random forest, by maximizing a variational lower bound on the mutual information between the KLSH codes (feature vectors) and the class labels. We apply the proposed approach to the difficult task of extracting information about biomolecular interactions from the semantic or syntactic parsing of scientific papers. Experiments on a wide range of datasets demonstrate the considerable advantages of our method.
References
 [Adams and Stegle2008] Ryan Prescott Adams and Oliver Stegle. 2008. Gaussian process product models for nonparametric nonstationarity. In In Proc. of ICML.
 [Airola et al.2008] Antti Airola, Sampo Pyysalo, Jari Björne, Tapio Pahikkala, Filip Ginter, and Tapio Salakoski. 2008. Allpaths graph kernel for proteinprotein interaction extraction with evaluation of crosscorpus learning. BMC Bioinformatics, 9:S2.
 [Alemi et al.2017] Alex Alemi, Ian Fischer, Josh Dillon, and Kevin Murphy. 2017. Deep variational information bottleneck.
 [Banarescu et al.2013] Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse.
 [Barber and Agakov2003] David Barber and Felix Agakov. 2003. The im algorithm: a variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pages 201–208. MIT Press.
 [Bawa et al.2005] Mayank Bawa, Tyson Condie, and Prasanna Ganesan. 2005. Lsh forest: selftuning indexes for similarity search. In Proceedings of the 14th international conference on World Wide Web, pages 651–660. ACM.
 [Beck et al.2015] Daniel Beck, Trevor Cohn, Christian Hardmeier, and Lucia Specia. 2015. Learning structural kernels for natural language processing. arXiv preprint arXiv:1508.02131.

[Bergstra and Bengio2012]
James Bergstra and Yoshua Bengio.
2012.
Random search for hyperparameter optimization.
Journal of Machine Learning Research
, 13(Feb):281–305. 
[Bottou2010]
Léon Bottou.
2010.
Largescale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT’2010, pages 177–186. Springer.  [Breiman1996] Leo Breiman. 1996. Bagging predictors. Machine learning, 24(2):123–140.
 [Breiman2001] Leo Breiman. 2001. Random forests. Machine learning, 45(1):5–32.
 [Brown et al.2012] Gavin Brown, Adam Pocock, MingJie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The Journal of Machine Learning Research, 13(1):27–66.
 [Chalk et al.2016] Matthew Chalk, Olivier Marre, and Gasper Tkacik. 2016. Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems, pages 1957–1965.
 [Chang et al.2016] YungChun Chang, ChunHan Chu, YuChen Su, Chien Chin Chen, and WenLian Hsu. 2016. Pipe: a protein–protein interaction passage extraction module for biocreative challenge. Database, 2016.
 [Chen et al.2018] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. 2018. Learning to explain: An informationtheoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814.
 [Cohen2015] Paul R Cohen. 2015. Darpa’s big mechanism program. Physical biology, 12(4):045008.
 [Cover and Thomas2012] Thomas M Cover and Joy A Thomas. 2012. Elements of information theory. John Wiley & Sons.
 [Culotta and Sorensen2004] Aron Culotta and Jeffrey Sorensen. 2004. Dependency tree kernels for relation extraction. In Proc. of ACL, page 423.
 [Donoho and others2000] David L Donoho et al. 2000. Highdimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, 1:32.
 [Fleuret2004] François Fleuret. 2004. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5(Nov):1531–1555.
 [Ganin et al.2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domainadversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030.
 [Gao et al.2015] Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. 2015. Efficient estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pages 277–286.
 [Gao et al.2016] Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. 2016. Variational information maximization for feature selection. In Advances in Neural Information Processing Systems, pages 487–495.
 [Garg et al.2016] Sahil Garg, Aram Galstyan, Ulf Hermjakob, and Daniel Marcu. 2016. Extracting biomolecular interactions using semantic parsing of biomedical text. In Proc. of AAAI.
 [Genton2001] Marc G Genton. 2001. Classes of kernels for machine learning: a statistics perspective. Journal of machine learning research, 2(Dec):299–312.
 [Goyal et al.2012] Amit Goyal, Hal Daumé III, and Raul Guerra. 2012. Fast largescale approximate graph construction for nlp. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1069–1080. Association for Computational Linguistics.

[Grauman and Fergus2013]
Kristen Grauman and Rob Fergus.
2013.
Learning binary hash codes for largescale image search.
Machine learning for computer vision
, 411(4987):1.  [Hahn and Surdeanu2015] Marco A ValenzuelaEscárcega Gus Hahn and Powell Thomas Hicks Mihai Surdeanu. 2015. A domainindependent rulebased framework for event extraction. ACLIJCNLP 2015, page 127.
 [Hakenberg et al.2008] Jörg Hakenberg, Conrad Plake, Loic Royer, Hendrik Strobelt, Ulf Leser, and Michael Schroeder. 2008. Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol, 9:S14.
 [Haussler1999] David Haussler. 1999. Convolution kernels on discrete structures. Technical report.

[Heo et al.2012]
JaePil Heo, Youngwoon Lee, Junfeng He, ShihFu Chang, and SungEui Yoon.
2012.
Spherical hashing.
In
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on
, pages 2957–2964. IEEE.  [Higdon1998] David Higdon. 1998. A processconvolution approach to modelling temperatures in the north atlantic ocean. Environmental and Ecological Statistics, 5(2):173–190.
 [Ho1995] Tin Kam Ho. 1995. Random decision forests. In Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, volume 1, pages 278–282. IEEE.
 [Ho1998] Tin Kam Ho. 1998. The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, 20(8):832–844.

[Hsieh et al.2017]
YuLun Hsieh, YungChun Chang, NaiWen Chang, and WenLian Hsu.
2017.
Identifying proteinprotein interactions in biomedical literature using recurrent neural networks with long shortterm memory.
In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 240–245. 
[Indyk and Motwani1998]
Piotr Indyk and Rajeev Motwani.
1998.
Approximate nearest neighbors: towards removing the curse of
dimensionality.
In
Proceedings of the thirtieth annual ACM symposium on Theory of computing
.  [Joly and Buisson2011] Alexis Joly and Olivier Buisson. 2011. Random maximum margin hashing. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 873–880. IEEE.
 [Kim et al.2009] JinDong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. 2009. Overview of bionlp’09 shared task on event extraction. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pages 1–9. Association for Computational Linguistics.
 [Kim et al.2011] JinDong Kim, Sampo Pyysalo, Tomoko Ohta, Robert Bossy, Ngan Nguyen, and Jun’ichi Tsujii. 2011. Overview of bionlp shared task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop, pages 1–6. Association for Computational Linguistics.
 [Krallinger et al.2008] Martin Krallinger, Florian Leitner, Carlos RodriguezPenagos, and Alfonso Valencia. 2008. Overview of the proteinprotein interaction annotation extraction task of biocreative ii. Genome biology, 9:S4.
 [Kraskov et al.2004] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical Review E, 69:066138.
 [Krause et al.2008] Andreas Krause, Ajit Singh, and Carlos Guestrin. 2008. Nearoptimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies. The Journal of Machine Learning Research, 9:235–284.
 [Kulis and Grauman2009] Brian Kulis and Kristen Grauman. 2009. Kernelized localitysensitive hashing for scalable image search. In Computer Vision, 2009 IEEE 12th International Conference on, pages 2130–2137. IEEE.
 [Kulis and Grauman2012] Brian Kulis and Kristen Grauman. 2012. Kernelized localitysensitive hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(6):1092–1104.
 [Kuncheva2004] Ludmila I Kuncheva. 2004. Combining pattern classifiers: methods and algorithms. John Wiley & Sons.
 [Li et al.2011] Ping Li, Anshumali Shrivastava, Joshua L Moore, and Arnd C König. 2011. Hashing algorithms for largescale learning. In Advances in neural information processing systems, pages 2672–2680.
 [Li et al.2014] Hao Li, Wei Liu, and Heng Ji. 2014. Twostage hashing for fast document retrieval. In ACL (2), pages 495–500.
 [Liong et al.2015] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, Jie Zhou, et al. 2015. Deep hashing for compact binary codes learning. In CVPR, volume 1, page 3.
 [Liu et al.2011] Wei Liu, Jun Wang, Sanjiv Kumar, and ShihFu Chang. 2011. Hashing with graphs. In In Proc. of ICML.
 [Liu et al.2012] Wei Liu, Jun Wang, Rongrong Ji, YuGang Jiang, and ShihFu Chang. 2012. Supervised hashing with kernels. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2074–2081. IEEE.

[Liu et al.2016]
Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen.
2016.
Deep supervised hashing for fast image retrieval.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2064–2072.  [Miwa et al.2009] Makoto Miwa, Rune Sætre, Yusuke Miyao, and Jun’ichi Tsujii. 2009. Protein–protein interaction extraction by leveraging multiple kernels and parsers. International journal of medical informatics, 78(12):e39–e46.
 [Mooney and Bunescu2005] Raymond J Mooney and Razvan C Bunescu. 2005. Subsequence kernels for relation extraction. In Proc. of NIPS, pages 171–178.
 [Moschitti2006] Alessandro Moschitti. 2006. Making tree kernels practical for natural language learning. In Eacl, volume 113, page 24.
 [Mu and Yan2010] Yadong Mu and Shuicheng Yan. 2010. Nonmetric localitysensitive hashing. In AAAI, pages 539–544.
 [Mu et al.2014] Yadong Mu, Gang Hua, Wei Fan, and ShihFu Chang. 2014. Hashsvm: Scalable kernel machines for largescale visual classification. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 979–986. IEEE.
 [Nédellec et al.2013] Claire Nédellec, Robert Bossy, JinDong Kim, JungJae Kim, Tomoko Ohta, Sampo Pyysalo, and Pierre Zweigenbaum. 2013. Overview of bionlp shared task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 1–7. Association for Computational Linguistics.
 [Nguyen et al.2014] Xuan Vinh Nguyen, Jeffrey Chan, Simone Romano, and James Bailey. 2014. Effective global approaches for mutual information based feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 512–521. ACM.
 [Paciorek and Schervish2003] Christopher J Paciorek and Mark J Schervish. 2003. Nonstationary covariance functions for gaussian process regression. In NIPS, pages 273–280.
 [Peng and Lu2017] Yifan Peng and Zhiyong Lu. 2017. Deep learning for extracting proteinprotein interactions from biomedical literature. BioNLP 2017, pages 29–38.
 [Peng et al.2005] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8):1226–1238.
 [Pust et al.2015] Michael Pust, Ulf Hermjakob, Kevin Knight, Daniel Marcu, and Jonathan May. 2015. Parsing english into abstract meaning representation using syntaxbased machine translation. In EMNLP.
 [Qian and Zhou2012] Longhua Qian and Guodong Zhou. 2012. Tree kernelbased protein–protein interaction extraction from biomedical literature. Journal of biomedical informatics, 45(3):535–543.
 [Raginsky and Lazebnik2009] Maxim Raginsky and Svetlana Lazebnik. 2009. Localitysensitive binary codes from shiftinvariant kernels. In Advances in neural information processing systems.
 [Rao et al.2017] Sudha Rao, Daniel Marcu, Kevin Knight, and Hal Daumé III. 2017. Biomedical event extraction using abstract meaning representation. In Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
 [Ravichandran et al.2005] Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 622–629. Association for Computational Linguistics.
 [Rios et al.2018] Anthony Rios, Ramakanth Kavuluru, and Zhiyong Lu. 2018. Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics, 1:9.
 [Rzhetsky2016] Andrey Rzhetsky. 2016. The big mechanism program: Changing how science is done. In DAMDID/RCDL, pages 1–2.
 [Severyn and Moschitti2012] Aliaksei Severyn and Alessandro Moschitti. 2012. Fast support vector machines for convolution tree kernels. Data Mining and Knowledge Discovery, 25(2):325–357.
 [Severyn and Moschitti2013] Aliaksei Severyn and Alessandro Moschitti. 2013. Fast linearization of tree kernels over largescale data. In IJCAI.

[Shi and Knight2017]
Xing Shi and Kevin Knight.
2017.
Speeding up neural machine translation decoding by shrinking runtime vocabulary.
In ACL.  [Singh and Póczos2014] Shashank Singh and Barnabás Póczos. 2014. Generalized exponential concentration inequality for rényi divergence estimation. In International Conference on Machine Learning, pages 333–341.
 [Snelson et al.2003] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. 2003. Warped gaussian processes. In NIPS, pages 337–344.
 [Snoek et al.2015] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180.
 [Srivastava et al.2013] Shashank Srivastava, Dirk Hovy, and Eduard H Hovy. 2013. A walkbased semantically enriched tree kernel over distributed word representations. In Proc. of EMNLP, pages 1411–1416.
 [Suzuki and Isozaki2006] Jun Suzuki and Hideki Isozaki. 2006. Sequence and tree kernels with statistical feature mining. In Advances in neural information processing systems, pages 1321–1328.
 [Tikk et al.2010] Domonkos Tikk, Philippe Thomas, Peter Palaga, Jörg Hakenberg, and Ulf Leser. 2010. A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Comput Biol.
 [ValenzuelaEscárcega et al.2017] Marco A ValenzuelaEscárcega, Ozgün Babur, Gus HahnPowell, Dane Bell, Thomas Hicks, Enrique NoriegaAtala, Xia Wang, Mihai Surdeanu, Emek Demir, and Clayton T Morrison. 2017. Largescale automated reading with reach discovers new cancer driving mechanisms.
 [WaltersWilliams and Li2009] Janett WaltersWilliams and Yan Li. 2009. Estimation of mutual information: A survey. In International Conference on Rough Sets and Knowledge Technology, pages 389–396. Springer.
 [Wang et al.2014] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927.
 [Wang et al.2017] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. 2017. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence.
 [Weinberger et al.2009] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1113–1120. ACM.
 [Wurzer et al.2015] Dominik Wurzer, Victor Lavrenko, and Miles Osborne. 2015. Twitterscale new event detection via kterm hashing. In EMNLP, pages 2584–2589.
 [Zelenko et al.2003] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2003. Kernel methods for relation extraction. JMLR, 3:1083–1106.
 [Zhao et al.2014] Kang Zhao, Hongtao Lu, and Jincheng Mei. 2014. Locality preserving hashing. In AAAI, pages 2874–2881.
 [Zhou2012] ZhiHua Zhou. 2012. Ensemble methods: foundations and algorithms. CRC press.