Embedding Lexical Features via Low-Rank Tensors

04/02/2016 ∙ by Mo Yu, et al. ∙ ibm Johns Hopkins University Carnegie Mellon University 0

Modern NLP models rely heavily on engineered features, which often combine word and contextual information into complex lexical features. Such combination results in large numbers of features, which can lead to over-fitting. We present a new model that represents complex lexical features---comprised of parts for words, contextual information and labels---in a tensor that captures conjunction information among these parts. We apply low-rank tensor approximations to the corresponding parameter tensors to reduce the parameter space and improve prediction speed. Furthermore, we investigate two methods for handling features that include n-grams of mixed lengths. Our model achieves state-of-the-art results on tasks in relation extraction, PP-attachment, and preposition disambiguation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical NLP models usually rely on hand-designed features, customized for each task. These features typically combine lexical and contextual information with the label to be scored. In relation extraction, for example, there is a parameter for the presence of a specific relation occurring with a feature conjoining a word type (lexical) with dependency path information (contextual). In measuring phrase semantic similarity, a word type is conjoined with its position in the phrase to signal its role. Figure 1b shows an example in dependency parsing, where multiple types (words) are conjoined with POS tags or distance information.

To avoid model over-fitting that often results from features with lexical components, several smoothed lexical representations have been proposed and shown to improve performance on various NLP tasks; for instance, word embeddings [Bengio et al.2006] help improve NER, dependency parsing and semantic role labeling [Miller et al.2004, Koo et al.2008, Turian et al.2010, Sun et al.2011, Roth and Woodsend2014, Hermann et al.2014].

However, using only word embeddings is not sufficient to represent complex lexical features (e.g. in Figure 1c). In these features, the same word embedding conjoined with different non-lexical properties may result in features indicating different labels; the corresponding lexical feature representations should take the above interactions into consideration. Such important interactions also increase the risk of over-fitting as feature space grows exponentially, yet how to capture these interactions in representation learning remains an open question.

Figure 1: An example of lexical features used in dependency parsing. To predict the “PMOD” arc (the dashed one) between “see” and “with” in (a), we may rely on lexical features in (b). Here , , are indices of the word “with”, its child (“telescope”) and a candidate head. Figure (c) shows what the fifth feature () is like, when the candidate is “see”. As is common in multi-class classification tasks, each template generates a different feature for each label y. Thus a feature is the conjunction of the four parts. Figure (d) is the one-hot representation of

, which is equivalent to the outer product (i.e. a 4-way tensor) among the four one-hot vectors.

means the vector has a single non-zero element in the position.

To address the above problems,111Our paper only focuses on lexical features, as non-lexical features usually suffer less from over-fitting. we propose a general and unified approach to reduce the feature space by constructing low-dimensional feature representations, which provides a new way of combining word embeddings, traditional non-lexical properties, and label information. Our model exploits the inner structure of features by breaking the feature into multiple parts: lexical, non-lexical and (optional) label. We demonstrate that the full feature is an outer product among these parts. Thus, a parameter tensor scores each feature to produce a prediction. Our model then reduces the number of parameters by approximating the parameter tensor with a low-rank tensor: the Tucker approximation of Yu:2015rt but applied to each embedding type (view), or the Canonical/Parallel-Factors Decomposition (CP). Our models use fewer parameters than previous work that learns a separate representation for each feature [Ando and Zhang2005, Yang and Eisenstein2015]. CP approximation also allows for much faster prediction, going from a method that is cubic in rank and exponential in the number of lexical parts, to a method linear in both. Furthermore, we consider two methods for handling features that rely on -grams of mixed lengths.

Our model makes the following contributions when contrasted with prior work:

lei-EtAl:2014:P14-1 applied CP to combine different views of features. Compared to their work, our usage of CP-decomposition is different in the application to feature learning: (1) We focus on dimensionality reduction of existing, well-verified features, while lei-EtAl:2014:P14-1 generates new features (usually different from ours) by combining some “atom” features. Thus their work may ignore some useful features; it relies on binary features as supplementary but our model needs not. (2) lei-EtAl:2014:P14-1’s factorization relies on views with explicit meanings, e.g. head/modifier/arc in dependency parsing, making it less general. Therefore its applications to tasks like relation extraction are less obvious.

Compared to our previous work [Gormley et al.2015, Yu et al.2015], this work allows for higher-order interactions, mixed-length n-gram features, lower-rank representations. We also demonstrate the strength of our new model via applications to new tasks.

The resulting method learns smoothed feature representations combining lexical, non-lexical and label information, achieving state-of-the-art performance on several tasks: relation extraction, preposition semantics and PP-attachment.

2 Notation and Definitions

We begin with some background on notation and definitions. Let be a -way tensor (i.e., a tensor with views). In this paper, we consider the tensor -mode product, i.e. multiplying a tensor by a matrix (or a vector if ) in mode (view) . The product is denoted by and is of size . Element-wise, we have

for . A mode- fiber of is the dimensional vector obtained by fixing all but the th index. The mode- unfolding of is the matrix obtained by concatenating all the mode- fibers along columns.

Given two matrices , we write to denote the Kronecker product between and (outer product for vectors). We define the Frobenius product (matrix dot product) between two matrices with the same sizes; and define element-wise (Hadamard) multiplication between vectors with the same sizes.

Tucker Decomposition:

Tucker Decomposition represents a tensor as:

(1)

where each is the tensor -mode product and each is a matrix. Tensor with size is called the core tensor. We say that has a Tucker rank , where is the rank of mode- unfolding. To simplify learning, we define the Tucker rank as rank(), which can be bounded simply by the dimensions of , i.e. ; this allows us to enforce a rank constraint on simply by restricting the dimensions of , as described in §6.

CP Decomposition:

CP decomposition represents a tensor as a sum of rank-one tensors (i.e. a sum of outer products of vectors):

(2)

where each is an matrix and is the vector of its -th row. For CP decomposition, the rank of a tensor is defined to be the number of rank-one tensors in the decomposition. CP decomposition can be viewed as a special case of Tucker decomposition in which and is a superdiagonal tensor.

3 Factorization of Lexical Features

Suppose we have feature that includes information from a label y, multiple lexical items and non-lexical property u. This feature can be factorized as a conjunction of each part: . The feature fires when all parts fire in the instance (reflected by the symbol in ). The one-hot representation of can then be viewed as a tensor , where each feature part is also represented as a one-hot vector.222 denote one-hot vectors instead of symbols. Figure 1d illustrates this case with two lexical parts.

Given an input instance and its associated label y, we can extract a set of features . In a traditional log-linear model, we view the instance as a bag-of-features, i.e. a feature vector . Each dimension corresponds to a feature , and has value 1 if . Then the log-linear model scores the instance as , where is the parameter vector. We can re-write based on the factorization of the features using tensor multiplication; in which becomes a parameter tensor :

(3)

Here each has the form , and

(4)

Note that one-hot vectors of words themselves are large ( 500k), thus the above formulation with parameter tensor

can be very large, making parameter estimation difficult. Instead of estimating only the values of the dimensions which appear in training data as in traditional methods, we will reduce the size of tensor

via a low-rank approximation. With different approximation methods, (4) will have different equivalent forms, e.g. (6), (7) in §4.1.

Optimization objective:

The loss function

for training the log-linear model uses (3) for scores, e.g., the log-loss . Learning can be formulated as the following optimization problem:

(5)

where the constraints on rank() depend on the chosen tensor approximation method (§2).

The above framework has some advantages: First, as discussed in §1 and here, we hope the representations capture rich interactions between different parts of the lexical features; the low-rank tensor approximation methods keep the most important interaction information of the original tensor, while significantly reducing its size. Second, the low-rank structure will encourage weight-sharing among lexical features with similar decomposed parts, leading to better model generalization. Note that there are examples where features have different numbers of multiple lexical parts, such as both unigram and bigram features in PP-attachment. We will use two different methods to handle these features (§5).

Remarks (advantages of our factorization)

Compared to prior work, e.g. [Lei et al.2014, Lei et al.2015], the proposed factorization has the following advantages:

  1. Parameter explosion when mapping a view with lexical properties to its representation vector (as will be discussed in 4.3): Our factorization allows the model to treat word embeddings as inputs to the views of lexical parts, dramatically reducing the parameters. Prior work cannot do this since its views are mixtures of lexical and non-lexical properties. Note that lei-EtAl:2014:P14-1 uses embeddings by concatenating them to specific views, which increases dimensionality, but the improvement is limited.

  2. No weight-sharing among conjunctions with same lexical property, like the child-word “word()” and its conjunction with head-postag “word() word()” in Figure 1(b). The factorization in prior work treats them as independent features, greatly increasing the dimensionality. Our factorization builds representations of both features based on the embedding of “word()”, thus utilizing their connections and reducing the dimensionality.

The above advantages are also key to overcome the problems of prior work mentioned at the end of §1.

4 Feature Representations via Low-rank Tensor Approximations

Using one-hot encodings for each of the parts of feature

results in a very large tensor. This section shows how to compute the score in (4) without constructing the full feature tensor using two tensor approximation methods (§4.1 and §4.2).

We begin with some intuition. To score the original (full rank) tensor representation of , we need a parameter tensor of size , where is the vocabulary size, is the number of lexical parts in the feature and and are the number of different labels and non-lexical properties, respectively. (§5 will handle varying across features.) Our methods reduce the tensor size by embedding each part of into a lower dimensional space, where we represent each label, non-lexical property and words with an , dimensional vector respectively (, ). These embedded features can then be scored by much smaller tensors. We denote the above transformations as matrices , , for

, and write corresponding low-dimensional hidden representations as

, and .

In our methods, the above transformations of embeddings are parts of low-rank tensors as in (5), so the embeddings of non-lexical properties and labels can be trained simultaneously with the low-rank tensors. Note that for one-hot input encodings the transformation matrices are essentially lookup tables, making the computation of these transformations sufficiently fast.

4.1 Tucker Form

For our first approximation, we assume that tensor has a low-rank Tucker decomposition: . We can then express the scoring function (4) for a feature with -lexical parts, as:

(6)

which amounts to first projecting , and (for all ) to lower dimensional vectors , and then weighting these hidden representations using the flattened core tensor . The low-dimensional representations and the corresponding weights are learned jointly using a discriminative (supervised) criterion. We call the model based on this representation the Low-Rank Feature Representation with Tucker form, or lrfr-tucker.

4.2 CP Form

For the Tucker approximation the number of parameters in (6) scale exponentially with the number of lexical parts. For instance, suppose each has dimensionality , then . To address scalability and further control the complexity of our tensor based model, we approximate the parameter tensor using CP decomposition as in (2), resulting in the following scoring function:

(7)

We call this model Low-Rank Feature Representation with CP form (lrfr-cp).

4.3 Pre-trained Word Embeddings

One of the computational and statistical bottlenecks in learning these lrfr models is the vocabulary size; the number of parameters to learn in each matrix scales linearly with and would require very large sets of labeled training data. To alleviate this problem, we use pre-trained continuous word embeddings [Mikolov et al.2013] as input embeddings rather than the one-hot word encodings. We denote the -dimensional word embeddings by ; so the transformation matrices for the lexical parts are of size where

We note that when sufficiently large labeled data is available, our model allows for fine-tuning the pre-trained word embeddings to improve the expressive strength of the model, as is common with deep network models.

Remarks

Our lrfrs introduce embeddings for non-lexical properties and labels, making them better suit the common setting in NLP: rich linguistic properties; and large label sets such as open-domain tasks [Hoffmann et al.2010]. The lrfr-cp better suits -gram features, since when increases 1, the only new parameters are the corresponding . It is also very efficient during prediction (), since the cost of transformations can be ignored with the help of look-up tables and pre-computing.

5 Learning Representations for -gram Lexical Features of Mixed Lengths

For features with lexical parts, we can train an lrfr model to obtain their representations. However, we often have features of varying (e.g. both unigrams (=1) and bigrams (=2) as in Figure 1). We require representations for features with arbitrary different simultaneously.

We propose two solutions. The first is a straightforward solution based on our framework, which handles each with a -way tensor. This strategy is commonly used in NLP, e.g. taubtabib-goldberg-globerson:2015:NAACL-HLT have different kernel functions for different order of dependency features. The second is an approximation method which aims to use a single tensor to handle all s.

Multiple Low-Rank Tensors

Suppose that we can divide the feature set into subsets which correspond to features with one lexical part (unigram features), two lexical parts (bigram features) and lexical parts (-gram features), respectively. To handle these types of features, we modify the training objective as follows:

(8)

where the score of a training instance is defined as . We use the Tucker form low-rank tensor for , and the CP form for . We refer to this method as lrfr-tucker & lrfr-cp.

Word Clusters

Alternatively, to handle different numbers of lexical parts, we replace some lexical parts with discrete word clusters. Let denote the word cluster (e.g. from Brown clustering) for word w. For bigram features we have:

(9)

where for each word we have introduced an additional set of non-lexical properties that are conjunctions of word clusters and the original non-lexical properties. This allows us to reduce an -gram feature representation to a unigram representation. The advantage of this method is that it uses a single low-rank tensor to score features with different numbers of lexical parts. This is particularly helpful when we have very limited labeled data. We denote this method as lrfr-Brown, since we use Brown clusters in practice. In the experiments we use the Tucker form for lrfr-Brown.

6 Parameter Estimation

The goal of learning is to find a tensor that solves problem (5

). Note that this is a non-convex objective, so compared to the convex objective in a traditional log-linear model, we are trading better feature representations with the cost of a harder optimization problem. While stochastic gradient descent (SGD) is a natural choice for learning representations in large data settings, problem (

5) involves rank constraints, which require an expensive proximal operation to enforce the constraints at each iteration of SGD. We seek a more efficient learning algorithm. Note that we fixed the size of each transformation matrix so that the smaller dimension ( ) matches the upper bound on the rank. Therefore, the rank constants are always satisfied through a run of SGD and we in essence have an unconstrained optimization problem. Note that in this way we do not guarantee orthogonality and full-rank of the learned transformation matrices. These properties are assumed in general, but are not necessary according to [Kolda and Bader2009].

The gradients are computed via the chain-rule. We use AdaGrad

[Duchi et al.2011] and apply L2 regularization on all s and , except for the case of =, where we will start with and regularize with - . We use early-stopping on a development set.

7 Experimental Settings

Task Benchmark Dataset Numbers on Each View
#Labels () #Non-lexical Features ()
Relation Extraction Yu:2015rt ACE 2005 32 264
PP-attachment belinkov2014exploring WSJ - 1,213 / 607
Preposition Disambiguation Ritter:2014learningsemantics Ritter:2014learningsemantics 6 9/3
Table 1: Statistics of each task. PP-attachment and preposition disambiguation have both unigram and bigram features. Therefore we list the numbers of non-lexical properties for both types.
Set Template
HeadEmb (head of )
Context (left/right token of )
In-between
On-path
Set Template
Bag of Words , ( is or )
Word-Position , ,
Preposition , , ,
Set Template
Bag of Words ( is or ),
Distance Dis
Prep
POS
NextPOS
VerbNet
WordNet
Table 2: Up-left: Unigram lexical features (only showing non-lexical parts) for relation extraction (from Yu et al. (2014)). We denote the two target entities as (with head indices , NE types ), and their dependency path as . Right: Uni/bi-gram feature for PP-attachment: Each feature is defined on tuple (, , ), where is the preposition word, is the child of the preposition, and is a candidate head of . : POS tag of word ; : a preposition collocation of verb from VerbNet; : the root hypernym of word in WordNet. Dis: the number of candidate heads between two words. Down-left: Uni/bi-gram feature for preposition disambiguation (for each preposition word , its modifier noun and head noun ). Since the sentences are different from each other on only , and , we ignore the words on the other positions.

We evaluate lrfr on three tasks: relation extraction, PP attachment and preposition disambiguation (see Table 1 for a task summary). We include detailed feature templates in Table 2.

PP-attachment and relation extraction are two fundamental NLP tasks, and we test our models on the largest English data sets. The preposition disambiguation task was designed for compositional semantics, which is an important application of deep learning and distributed representations. On all these tasks, we compare to the state-of-the-art.

We use the same word embeddings in belinkov2014exploring on PP-attachment for a fair comparison. For the other experiments, we use the same 200- word embeddings in Yu:2015rt.

Relation Extraction

We use the English portion of the ACE 2005 relation extraction dataset [Walker et al.2006]. Following Yu:2015rt, we use both gold entity spans and types, train the model on the news domain and test on the broadcast conversation domain. To highlight the impact of training data size we evaluate with all 43,518 relations (entity mention pairs) and a reduced training set of the first 10,000 relations. We report precision, recall, and F1.

We compare to two baseline methods: 1) a log-linear model with a rich binary feature set from sun_semi-supervised_2011 and zhou_exploring_2005 as described in Yu:2015rt (Baseline); 2) the embedding model (fcm) of gormley-yu-dredze:2015:EMNLP, which uses rich linguistic features for relation extraction. We use the same feature templates and evaluate on fine-grained relations (sub-types, 32 labels) [Yu et al.2015]. This will evaluate how lrfr can utilize non-lexical linguistic features.

PP-attachment

We consider the prepositional phrase (PP) attachment task of belinkov2014exploring,333http://groups.csail.mit.edu/rbg/code/pp where for each PP the correct head (verbs or nouns) must be selected from content words before the PP (within a 10-word window). We formulate the task as a ranking problem, where we optimize the score of the correct head from a list of candidates with varying sizes.

PP-attachment suffers from data sparsity because of bi-lexical features, which we will model with methods in §5. Belikov et al. show that rich features – POS, WordNet and VerbNet – help this task. The combination of these features give a large number of non-lexical properties, for which embeddings of non-lexical properties in lrfr should be useful.

We extract a dev set from section 22 of the PTB following the description in belinkov2014exploring.

Preposition Disambiguation

We consider the preposition disambiguation task proposed by Ritter:2014learningsemantics. The task is to determine the spatial relationship a preposition indicates based on the two objects connected by the preposition. For example, “the apple on the refrigerator” indicates the “support by Horizontal Surface” relation, while “the apple on the branch” indicates the “Support from Above” relation. Since the meaning of a preposition depends on the combination of both its head and child word, we expect conjunctions between these word embeddings to help, i.e. features with two lexical parts.

We include three baselines: point-wise addition (SUM) [Mitchell and Lapata2010], concatenation [Ritter et al.2014], and an SVM based on hand-crafted features in Table 2. Ritter et al. show that the first two methods beat other compositional models.

Hyperparameters

are all tuned on the dev set. The chosen values are learning rate and the weight of L2 regularizer for lrfr, except for the third lrfr in Table 3 which has . We select the rank of lrfr-tucker with a grid search from the following values: , and . For lrfr-cp, we select . For the PP-attachement task there is no since it uses a ranking model. For the Preposition Disambiguation we do not choose since the number of labels is small.

8 Results

Full Set (=43,518) Reduced Set (=10,000) Prediction
Method P R F1 P R F1 Time (ms)
Baseline - - - 60.2 51.2 55.3 - - - -
fcm 32/N 264/N 200/N 62.9 49.6 55.4 61.6 37.1 46.3 2,242
lrfr-tucker 32/N 20/Y 200/Y 62.1 52.7 57.0 51.5 40.8 45.5 3,076
lrfr-tucker 32/N 20/Y 200/N 63.5 51.1 56.6 52.8 40.1 45.6 2,972
lrfr-tucker 20/Y 20/Y 200/Y 62.4 51.0 56.1 52.1 41.2 46.0 2,538
lrfr-tucker 32/Y 20/Y 50/Y 57.4 52.4 54.8 49.7 46.1 47.8 1,198
lrfr-cp 200/Y 61.3 50.7 55.5 58.3 41.6 48.6 502
Table 3: Results on test for relation extraction. Y(es)/N(o) indicates whether embeddings are updated during training.
System Resources Used Acc
SVM [Belinkov et al.2014] distance, word, embedding, clusters, POS, WordNet, VerbNet 86.0
HPCD [Belinkov et al.2014] distance, embedding, POS, WordNet, VerbNet 88.7
lrfr-tucker & lrfr-cp distance, embedding, POS, WordNet, VerbNet 90.3
lrfr-Brown distance, embedding, clusters, POS, WordNet, VerbNet 89.6
RBG [Lei et al.2014] dependency parser 88.4
Charniak-RS [McClosky et al.2006] dependency parser + re-ranker 88.6
RBG + HPCD (combined model) dependency parser + distance, embedding, POS, WordNet, VerbNet 90.1
Table 4: PP-attachment test accuracy. The baseline results are from Belinkov et al. (2014).
Relation Extraction

All lrfr-tucker models improve over Baseline and fcm (Table 3), making these the best reported numbers for this task. However, lrfr-cp does not work as well on the features with only one lexical part. The Tucker-form does a better job of capturing interactions between different views. In the limited training setting, we find that lrfr-cp does best.

Additionally, the primary advantage of the CP approximation is its reduction in the number of model parameters and running time. We report each model’s running time for a single pass on the development set. The lrfr-cp is by far the fastest. The first three lrfr-tucker models are slightly slower than fcm, because they work on dense non-lexical property embeddings while fcm benefits from sparse vectors.

PP-attachment

Table 4 shows that lrfr (89.6 and 90.3) improves over the previous best standalone system HPCD (88.7) by a large margin, with exactly the same resources. belinkov2014exploring also reported results of parsers and parser re-rankers, which can access to additional resources (complete parses for training and complete sentences as inputs) so it is unfair to compare them with the standalone systems like HPCD and our lrfr. Nonetheless lrfr-tucker & lrfr-cp (90.3) still outperforms the state-of-the-art parser RBG (88.4), re-ranker Charniak-RS (88.6), and the combination of the state-of-the-art parser and compositional model RBG + HPCD (90.1). Thus, even with fewer resources, lrfr becomes the new best system.

Not shown in the table: we also tried lrfr-tucker & lrfr-cp with postag features only (89.7), and with grand-head-modifier conjunctions removed (89.3) . Note that compared to lrfr, RBG benefits from binary features, which also exploit grand-head-modifier structures. Yet the above reduced models still work better than RBG (88.4) without using additional resources.444Still this is not a fair comparison since we have different training objectives. Using RBG’s factorization and training with our objective will give a fair comparison and we leave it to future work. Moreover, the results of lrfr

can still be potentially improved by combining with binary features. The above results show the advantage of our factorization method, which allows for utilizing pre-trained word embeddings, and thus can benefit from semi-supervised learning.

Preposition Disambiguation

lrfr improves (Table 5) over the best methods (SUM and Concatenation) in Ritter:2014learningsemantics as well as the SVM based on the original lexical features (85.1). In this task lrfr-Brown better represents the unigram and bigram lexical features, compared to the usage of two low-rank tensors (lrfr-tucker & lrfr-cp). This may be because lrfr-Brown has fewer parameters, which is better for smaller training sets.

We also include a control setting (lrfr-Brown - Control), which has a full rank parameter tensor with the same inputs on each view as lrfr-Brown, but represented as one hot vectors without transforming to the hidden representations s. This is equivalent to an SVM with the compound cluster features as in koo_simple_2008. It performs much worse than lrfr-Brown, showing the advantage of using word embeddings and low-rank tensors.

Method Accuracy
SVM - Lexical Features 85.09
SUM 80.55
Concatenation 86.73
lrfr-tucker & lrfr-cp 87.82
lrfr-Brown 88.18
lrfr-Brown - Control 84.18
Table 5: Accuracy for spatial classification of PPs.
Summary

For unigram lexical features, lrfr-tucker achieves better results than lrfr-cp. However, in settings with fewer training examples, features with more lexical parts (-grams), or when faster predictions are advantageous, lrfr-cp does best as it has fewer parameters to estimate. For -grams of variable length, lrfr-tucker & lrfr-cp does best. In settings with fewer training examples, lrfr-Brown does best as it has only one parameter tensor to estimate.

9 Related Work

Dimensionality Reduction for Complex Features

is a standard technique to address high-dimensional features, including PCA, alternating structural optimization [Ando and Zhang2005]

, denoising autoencoders 

[Vincent et al.2008], and feature embeddings [Yang and Eisenstein2015]. These methods treat features as atomic elements and ignore the inner structure of features, so they learn separate embedding for each feature without shared parameters. As a result, they still suffer from large parameter spaces when the feature space is very huge.555For example, a state-of-the-art dependency parser [Zhang and McDonald2014] extracts about 10 million features; in this case, learning 100-dimensional feature embeddings involves estimating approximately a billion parameters.

Another line of research studies the inner structures of lexical features: e.g. koo_simple_2008, turian2010word, sun_semi-supervised_2011, nguyen_employing_2014, roth_composition_2014, and hermann-EtAl:2014:P14-1 used pre-trained word embeddings to replace the lexical parts of features ; srikumar2014learning, gormley-yu-dredze:2015:EMNLP and Yu:2015rt propose splitting lexical features into different parts and employing tensors to perform classification. The above can therefore be seen as special cases of our model that only embed a certain part (view) of the complex features. This restriction also makes their model parameters form a full rank tensor, resulting in data sparsity and high computational costs when the tensors are large.

Composition Models (Deep Learning) build representations for structures based on their component word embeddings [Collobert et al.2011, Bordes et al.2012, Socher et al.2012, Socher et al.2013b]. When using only word embeddings, these models achieved successes on several NLP tasks, but sometimes fail to learn useful syntactic or semantic patterns beyond the strength of combinations of word embeddings, such as the dependency relation in Figure 1(a). To tackle this problem, some work designed their model structures according to a specific kind of linguistic patterns, e.g. dependency paths [Ma et al.2015, Liu et al.2015]

, while a recent trend enhances compositional models with linguistic features. For example, belinkov2014exploring concatenate embeddings with linguistic features before feeding them to a neural network; socher2013parsing and hermann2013role enhanced Recursive Neural Networks by refining the transformation matrices with linguistic features (e.g. phrase types). These models are similar to ours in the sense of learning representations based on linguistic features and embeddings.

Low-rank Tensor Models for NLP aim to handle the conjunction among different views of features [Cao and Khudanpur2014, Lei et al.2014, Chen and Manning2014]. yu2015learning proposed a model to compose phrase embeddings from words, which has an equivalent form of our CP-based method under certain restrictions. Our work applies a similar idea to exploiting the inner structure of complex features, and can handle -gram features with different s. Our factorization (§3) is general and easy to adapt to new tasks. More importantly, it makes the model benefit from pre-trained word embeddings as shown by the PP-attachment results.

10 Conclusion

We have presented lrfr, a feature representation model that exploits the inner structure of complex lexical features and applies a low-rank tensor to efficiently score features with this representation. lrfr attains the state-of-the-art on several tasks, including relation extraction, PP-attachment, and preposition disambiguation. We make our implementation available for general use.666https://github.com/Gorov/LowRankFCM

Acknowledgements

A major portion of this work was done when MY was visiting MD and RA at JHU. This research was supported in part by NSF grant IIS-1546482.

References

  • [Ando and Zhang2005] Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data.

    The Journal of Machine Learning Research

    , 6.
  • [Belinkov et al.2014] Yonatan Belinkov, Tao Lei, Regina Barzilay, and Amir Globerson. 2014. Exploring compositional architectures and word vector representations for prepositional phrase attachment. Transactions of the Association for Computational Linguistics, 2.
  • [Bengio et al.2006] Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. 2006. Neural probabilistic language models. In Innovations in Machine Learning. Springer.
  • [Bordes et al.2012] Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2012. A semantic matching energy function for learning with multi-relational data. Machine Learning.
  • [Cao and Khudanpur2014] Yuan Cao and Sanjeev Khudanpur. 2014. Online learning in tensor space. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • [Chen and Manning2014] Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of EMNLP.
  • [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. JMLR, 12.
  • [Duchi et al.2011] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12.
  • [Gormley et al.2015] Matthew R. Gormley, Mo Yu, and Mark Dredze. 2015. Improved relation extraction with feature-rich compositional embedding models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
  • [Hermann and Blunsom2013] Karl Moritz Hermann and Phil Blunsom. 2013. The role of syntax in vector space models of compositional semantics. In Association for Computational Linguistics.
  • [Hermann et al.2014] Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. 2014. Semantic frame identification with distributed word representations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • [Hoffmann et al.2010] Raphael Hoffmann, Congle Zhang, and Daniel S. Weld. 2010. Learning 5000 relational extractors. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.
  • [Kolda and Bader2009] Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review, 51(3).
  • [Koo et al.2008] Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proceedings of ACL.
  • [Lei et al.2014] Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank tensors for scoring dependency structures. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • [Lei et al.2015] Tao Lei, Yuan Zhang, Lluís Màrquez, Alessandro Moschitti, and Regina Barzilay. 2015. High-order low-rank tensors for semantic role labeling. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • [Liu et al.2015] Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng WANG. 2015. A dependency-based neural network for relation classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers).
  • [Ma et al.2015] Mingbo Ma, Liang Huang, Bowen Zhou, and Bing Xiang. 2015.

    Dependency-based convolutional neural networks for sentence embedding.

    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers).
  • [McClosky et al.2006] David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics.
  • [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • [Miller et al.2004] Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name tagging with word clusters and discriminative training. In Proceedings of HLT-NAACL.
  • [Mitchell and Lapata2010] Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive science, 34(8).
  • [Nguyen and Grishman2014] Thien Huu Nguyen and Ralph Grishman. 2014. Employing word representations and regularization for domain adaptation of relation extraction. In Association for Computational Linguistics (ACL).
  • [Ritter et al.2014] Samuel Ritter, Cotie Long, Denis Paperno, Marco Baroni, Matthew Botvinick, and Adele Goldberg. 2014. Leveraging preposition ambiguity to assess representation of semantic interaction in cdsm. In NIPS Workshop on Learning Semantics.
  • [Roth and Woodsend2014] Michael Roth and Kristian Woodsend. 2014. Composition of word representations improves semantic role labelling. In Proceedings of EMNLP.
  • [Socher et al.2012] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of EMNLP-CoNLL 2012.
  • [Socher et al.2013a] Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013a. Parsing with compositional vector grammars. In Proceedings of ACL.
  • [Socher et al.2013b] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013b. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP.
  • [Srikumar and Manning2014] Vivek Srikumar and Christopher D Manning. 2014. Learning distributed representations for structured output prediction. In Advances in Neural Information Processing Systems.
  • [Sun et al.2011] Ang Sun, Ralph Grishman, and Satoshi Sekine. 2011. Semi-supervised relation extraction with large-scale word clustering. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
  • [Taub-Tabib et al.2015] Hillel Taub-Tabib, Yoav Goldberg, and Amir Globerson. 2015. Template kernels for dependency parsing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • [Turian et al.2010] Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Association for Computational Linguistics.
  • [Vincent et al.2008] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning.
  • [Walker et al.2006] Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. ACE 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia.
  • [Yang and Eisenstein2015] Yi Yang and Jacob Eisenstein. 2015. Unsupervised multi-domain adaptation with feature embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 672–682, Denver, Colorado, May–June. Association for Computational Linguistics.
  • [Yu and Dredze2015] Mo Yu and Mark Dredze. 2015. Learning composition models for phrase embeddings. Transactions of the Association for Computational Linguistics, 3.
  • [Yu et al.2015] Mo Yu, Matthew R. Gormley, and Mark Dredze. 2015. Combining word embeddings and feature embeddings for fine-grained relation extraction. In North American Chapter of the Association for Computational Linguistics (NAACL).
  • [Zhang and McDonald2014] Hao Zhang and Ryan McDonald. 2014. Enforcing structural diversity in cube-pruned dependency parsing. In Proceedings of ACL.
  • [Zhou et al.2005] GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang. 2005. Exploring various knowledge in relation extraction. In Proceedings of ACL.