1 Introduction
Statistical NLP models usually rely on handdesigned features, customized for each task. These features typically combine lexical and contextual information with the label to be scored. In relation extraction, for example, there is a parameter for the presence of a specific relation occurring with a feature conjoining a word type (lexical) with dependency path information (contextual). In measuring phrase semantic similarity, a word type is conjoined with its position in the phrase to signal its role. Figure 1b shows an example in dependency parsing, where multiple types (words) are conjoined with POS tags or distance information.
To avoid model overfitting that often results from features with lexical components, several smoothed lexical representations have been proposed and shown to improve performance on various NLP tasks; for instance, word embeddings [Bengio et al.2006] help improve NER, dependency parsing and semantic role labeling [Miller et al.2004, Koo et al.2008, Turian et al.2010, Sun et al.2011, Roth and Woodsend2014, Hermann et al.2014].
However, using only word embeddings is not sufficient to represent complex lexical features (e.g. in Figure 1c). In these features, the same word embedding conjoined with different nonlexical properties may result in features indicating different labels; the corresponding lexical feature representations should take the above interactions into consideration. Such important interactions also increase the risk of overfitting as feature space grows exponentially, yet how to capture these interactions in representation learning remains an open question.
To address the above problems,^{1}^{1}1Our paper only focuses on lexical features, as nonlexical features usually suffer less from overfitting. we propose a general and unified approach to reduce the feature space by constructing lowdimensional feature representations, which provides a new way of combining word embeddings, traditional nonlexical properties, and label information. Our model exploits the inner structure of features by breaking the feature into multiple parts: lexical, nonlexical and (optional) label. We demonstrate that the full feature is an outer product among these parts. Thus, a parameter tensor scores each feature to produce a prediction. Our model then reduces the number of parameters by approximating the parameter tensor with a lowrank tensor: the Tucker approximation of Yu:2015rt but applied to each embedding type (view), or the Canonical/ParallelFactors Decomposition (CP). Our models use fewer parameters than previous work that learns a separate representation for each feature [Ando and Zhang2005, Yang and Eisenstein2015]. CP approximation also allows for much faster prediction, going from a method that is cubic in rank and exponential in the number of lexical parts, to a method linear in both. Furthermore, we consider two methods for handling features that rely on grams of mixed lengths.
Our model makes the following contributions when contrasted with prior work:
leiEtAl:2014:P141 applied CP to combine different views of features. Compared to their work, our usage of CPdecomposition is different in the application to feature learning: (1) We focus on dimensionality reduction of existing, wellverified features, while leiEtAl:2014:P141 generates new features (usually different from ours) by combining some “atom” features. Thus their work may ignore some useful features; it relies on binary features as supplementary but our model needs not. (2) leiEtAl:2014:P141’s factorization relies on views with explicit meanings, e.g. head/modifier/arc in dependency parsing, making it less general. Therefore its applications to tasks like relation extraction are less obvious.
Compared to our previous work [Gormley et al.2015, Yu et al.2015], this work allows for higherorder interactions, mixedlength ngram features, lowerrank representations. We also demonstrate the strength of our new model via applications to new tasks.
The resulting method learns smoothed feature representations combining lexical, nonlexical and label information, achieving stateoftheart performance on several tasks: relation extraction, preposition semantics and PPattachment.
2 Notation and Definitions
We begin with some background on notation and definitions. Let be a way tensor (i.e., a tensor with views). In this paper, we consider the tensor mode product, i.e. multiplying a tensor by a matrix (or a vector if ) in mode (view) . The product is denoted by and is of size . Elementwise, we have
for . A mode fiber of is the dimensional vector obtained by fixing all but the th index. The mode unfolding of is the matrix obtained by concatenating all the mode fibers along columns.
Given two matrices , we write to denote the Kronecker product between and (outer product for vectors). We define the Frobenius product (matrix dot product) between two matrices with the same sizes; and define elementwise (Hadamard) multiplication between vectors with the same sizes.
Tucker Decomposition:
Tucker Decomposition represents a tensor as:
(1) 
where each is the tensor mode product and each is a matrix. Tensor with size is called the core tensor. We say that has a Tucker rank , where is the rank of mode unfolding. To simplify learning, we define the Tucker rank as rank(), which can be bounded simply by the dimensions of , i.e. ; this allows us to enforce a rank constraint on simply by restricting the dimensions of , as described in §6.
CP Decomposition:
CP decomposition represents a tensor as a sum of rankone tensors (i.e. a sum of outer products of vectors):
(2) 
where each is an matrix and is the vector of its th row. For CP decomposition, the rank of a tensor is defined to be the number of rankone tensors in the decomposition. CP decomposition can be viewed as a special case of Tucker decomposition in which and is a superdiagonal tensor.
3 Factorization of Lexical Features
Suppose we have feature that includes information from a label y, multiple lexical items and nonlexical property u. This feature can be factorized as a conjunction of each part: . The feature fires when all parts fire in the instance (reflected by the symbol in ). The onehot representation of can then be viewed as a tensor , where each feature part is also represented as a onehot vector.^{2}^{2}2 denote onehot vectors instead of symbols. Figure 1d illustrates this case with two lexical parts.
Given an input instance and its associated label y, we can extract a set of features . In a traditional loglinear model, we view the instance as a bagoffeatures, i.e. a feature vector . Each dimension corresponds to a feature , and has value 1 if . Then the loglinear model scores the instance as , where is the parameter vector. We can rewrite based on the factorization of the features using tensor multiplication; in which becomes a parameter tensor :
(3) 
Here each has the form , and
(4) 
Note that onehot vectors of words themselves are large ( 500k), thus the above formulation with parameter tensor
can be very large, making parameter estimation difficult. Instead of estimating only the values of the dimensions which appear in training data as in traditional methods, we will reduce the size of tensor
via a lowrank approximation. With different approximation methods, (4) will have different equivalent forms, e.g. (6), (7) in §4.1.Optimization objective:
The loss function
for training the loglinear model uses (3) for scores, e.g., the logloss . Learning can be formulated as the following optimization problem:(5) 
where the constraints on rank() depend on the chosen tensor approximation method (§2).
The above framework has some advantages: First, as discussed in §1 and here, we hope the representations capture rich interactions between different parts of the lexical features; the lowrank tensor approximation methods keep the most important interaction information of the original tensor, while significantly reducing its size. Second, the lowrank structure will encourage weightsharing among lexical features with similar decomposed parts, leading to better model generalization. Note that there are examples where features have different numbers of multiple lexical parts, such as both unigram and bigram features in PPattachment. We will use two different methods to handle these features (§5).
Remarks (advantages of our factorization)
Compared to prior work, e.g. [Lei et al.2014, Lei et al.2015], the proposed factorization has the following advantages:

Parameter explosion when mapping a view with lexical properties to its representation vector (as will be discussed in 4.3): Our factorization allows the model to treat word embeddings as inputs to the views of lexical parts, dramatically reducing the parameters. Prior work cannot do this since its views are mixtures of lexical and nonlexical properties. Note that leiEtAl:2014:P141 uses embeddings by concatenating them to specific views, which increases dimensionality, but the improvement is limited.

No weightsharing among conjunctions with same lexical property, like the childword “word()” and its conjunction with headpostag “word() word()” in Figure 1(b). The factorization in prior work treats them as independent features, greatly increasing the dimensionality. Our factorization builds representations of both features based on the embedding of “word()”, thus utilizing their connections and reducing the dimensionality.
The above advantages are also key to overcome the problems of prior work mentioned at the end of §1.
4 Feature Representations via Lowrank Tensor Approximations
Using onehot encodings for each of the parts of feature
results in a very large tensor. This section shows how to compute the score in (4) without constructing the full feature tensor using two tensor approximation methods (§4.1 and §4.2).We begin with some intuition. To score the original (full rank) tensor representation of , we need a parameter tensor of size , where is the vocabulary size, is the number of lexical parts in the feature and and are the number of different labels and nonlexical properties, respectively. (§5 will handle varying across features.) Our methods reduce the tensor size by embedding each part of into a lower dimensional space, where we represent each label, nonlexical property and words with an , dimensional vector respectively (, ). These embedded features can then be scored by much smaller tensors. We denote the above transformations as matrices , , for
, and write corresponding lowdimensional hidden representations as
, and .In our methods, the above transformations of embeddings are parts of lowrank tensors as in (5), so the embeddings of nonlexical properties and labels can be trained simultaneously with the lowrank tensors. Note that for onehot input encodings the transformation matrices are essentially lookup tables, making the computation of these transformations sufficiently fast.
4.1 Tucker Form
For our first approximation, we assume that tensor has a lowrank Tucker decomposition: . We can then express the scoring function (4) for a feature with lexical parts, as:
(6) 
which amounts to first projecting , and (for all ) to lower dimensional vectors , and then weighting these hidden representations using the flattened core tensor . The lowdimensional representations and the corresponding weights are learned jointly using a discriminative (supervised) criterion. We call the model based on this representation the LowRank Feature Representation with Tucker form, or lrfrtucker.
4.2 CP Form
For the Tucker approximation the number of parameters in (6) scale exponentially with the number of lexical parts. For instance, suppose each has dimensionality , then . To address scalability and further control the complexity of our tensor based model, we approximate the parameter tensor using CP decomposition as in (2), resulting in the following scoring function:
(7) 
We call this model LowRank Feature Representation with CP form (lrfrcp).
4.3 Pretrained Word Embeddings
One of the computational and statistical bottlenecks in learning these lrfr models is the vocabulary size; the number of parameters to learn in each matrix scales linearly with and would require very large sets of labeled training data. To alleviate this problem, we use pretrained continuous word embeddings [Mikolov et al.2013] as input embeddings rather than the onehot word encodings. We denote the dimensional word embeddings by ; so the transformation matrices for the lexical parts are of size where
We note that when sufficiently large labeled data is available, our model allows for finetuning the pretrained word embeddings to improve the expressive strength of the model, as is common with deep network models.
Remarks
Our lrfrs introduce embeddings for nonlexical properties and labels, making them better suit the common setting in NLP: rich linguistic properties; and large label sets such as opendomain tasks [Hoffmann et al.2010]. The lrfrcp better suits gram features, since when increases 1, the only new parameters are the corresponding . It is also very efficient during prediction (), since the cost of transformations can be ignored with the help of lookup tables and precomputing.
5 Learning Representations for gram Lexical Features of Mixed Lengths
For features with lexical parts, we can train an lrfr model to obtain their representations. However, we often have features of varying (e.g. both unigrams (=1) and bigrams (=2) as in Figure 1). We require representations for features with arbitrary different simultaneously.
We propose two solutions. The first is a straightforward solution based on our framework, which handles each with a way tensor. This strategy is commonly used in NLP, e.g. taubtabibgoldberggloberson:2015:NAACLHLT have different kernel functions for different order of dependency features. The second is an approximation method which aims to use a single tensor to handle all s.
Multiple LowRank Tensors
Suppose that we can divide the feature set into subsets which correspond to features with one lexical part (unigram features), two lexical parts (bigram features) and lexical parts (gram features), respectively. To handle these types of features, we modify the training objective as follows:
(8) 
where the score of a training instance is defined as . We use the Tucker form lowrank tensor for , and the CP form for . We refer to this method as lrfrtucker & lrfrcp.
Word Clusters
Alternatively, to handle different numbers of lexical parts, we replace some lexical parts with discrete word clusters. Let denote the word cluster (e.g. from Brown clustering) for word w. For bigram features we have:
(9) 
where for each word we have introduced an additional set of nonlexical properties that are conjunctions of word clusters and the original nonlexical properties. This allows us to reduce an gram feature representation to a unigram representation. The advantage of this method is that it uses a single lowrank tensor to score features with different numbers of lexical parts. This is particularly helpful when we have very limited labeled data. We denote this method as lrfrBrown, since we use Brown clusters in practice. In the experiments we use the Tucker form for lrfrBrown.
6 Parameter Estimation
The goal of learning is to find a tensor that solves problem (5
). Note that this is a nonconvex objective, so compared to the convex objective in a traditional loglinear model, we are trading better feature representations with the cost of a harder optimization problem. While stochastic gradient descent (SGD) is a natural choice for learning representations in large data settings, problem (
5) involves rank constraints, which require an expensive proximal operation to enforce the constraints at each iteration of SGD. We seek a more efficient learning algorithm. Note that we fixed the size of each transformation matrix so that the smaller dimension ( ) matches the upper bound on the rank. Therefore, the rank constants are always satisfied through a run of SGD and we in essence have an unconstrained optimization problem. Note that in this way we do not guarantee orthogonality and fullrank of the learned transformation matrices. These properties are assumed in general, but are not necessary according to [Kolda and Bader2009].The gradients are computed via the chainrule. We use AdaGrad
[Duchi et al.2011] and apply L2 regularization on all s and , except for the case of =, where we will start with and regularize with  . We use earlystopping on a development set.7 Experimental Settings
Task  Benchmark  Dataset  Numbers on Each View  

#Labels ()  #Nonlexical Features ()  
Relation Extraction  Yu:2015rt  ACE 2005  32  264 
PPattachment  belinkov2014exploring  WSJ    1,213 / 607 
Preposition Disambiguation  Ritter:2014learningsemantics  Ritter:2014learningsemantics  6  9/3 
Set  Template 

HeadEmb  (head of ) 
Context  (left/right token of ) 
Inbetween  
Onpath  
Set  Template 
Bag of Words  , ( is or ) 
WordPosition  , , 
Preposition  , , , 
Set  Template 

Bag of Words  ( is or ), 
Distance  Dis 
Prep  
POS  
NextPOS  
VerbNet  
WordNet  
We evaluate lrfr on three tasks: relation extraction, PP attachment and preposition disambiguation (see Table 1 for a task summary). We include detailed feature templates in Table 2.
PPattachment and relation extraction are two fundamental NLP tasks, and we test our models on the largest English data sets. The preposition disambiguation task was designed for compositional semantics, which is an important application of deep learning and distributed representations. On all these tasks, we compare to the stateoftheart.
We use the same word embeddings in belinkov2014exploring on PPattachment for a fair comparison. For the other experiments, we use the same 200 word embeddings in Yu:2015rt.
Relation Extraction
We use the English portion of the ACE 2005 relation extraction dataset [Walker et al.2006]. Following Yu:2015rt, we use both gold entity spans and types, train the model on the news domain and test on the broadcast conversation domain. To highlight the impact of training data size we evaluate with all 43,518 relations (entity mention pairs) and a reduced training set of the first 10,000 relations. We report precision, recall, and F1.
We compare to two baseline methods: 1) a loglinear model with a rich binary feature set from sun_semisupervised_2011 and zhou_exploring_2005 as described in Yu:2015rt (Baseline); 2) the embedding model (fcm) of gormleyyudredze:2015:EMNLP, which uses rich linguistic features for relation extraction. We use the same feature templates and evaluate on finegrained relations (subtypes, 32 labels) [Yu et al.2015]. This will evaluate how lrfr can utilize nonlexical linguistic features.
PPattachment
We consider the prepositional phrase (PP) attachment task of belinkov2014exploring,^{3}^{3}3http://groups.csail.mit.edu/rbg/code/pp where for each PP the correct head (verbs or nouns) must be selected from content words before the PP (within a 10word window). We formulate the task as a ranking problem, where we optimize the score of the correct head from a list of candidates with varying sizes.
PPattachment suffers from data sparsity because of bilexical features, which we will model with methods in §5. Belikov et al. show that rich features – POS, WordNet and VerbNet – help this task. The combination of these features give a large number of nonlexical properties, for which embeddings of nonlexical properties in lrfr should be useful.
We extract a dev set from section 22 of the PTB following the description in belinkov2014exploring.
Preposition Disambiguation
We consider the preposition disambiguation task proposed by Ritter:2014learningsemantics. The task is to determine the spatial relationship a preposition indicates based on the two objects connected by the preposition. For example, “the apple on the refrigerator” indicates the “support by Horizontal Surface” relation, while “the apple on the branch” indicates the “Support from Above” relation. Since the meaning of a preposition depends on the combination of both its head and child word, we expect conjunctions between these word embeddings to help, i.e. features with two lexical parts.
We include three baselines: pointwise addition (SUM) [Mitchell and Lapata2010], concatenation [Ritter et al.2014], and an SVM based on handcrafted features in Table 2. Ritter et al. show that the first two methods beat other compositional models.
Hyperparameters
are all tuned on the dev set. The chosen values are learning rate and the weight of L2 regularizer for lrfr, except for the third lrfr in Table 3 which has . We select the rank of lrfrtucker with a grid search from the following values: , and . For lrfrcp, we select . For the PPattachement task there is no since it uses a ranking model. For the Preposition Disambiguation we do not choose since the number of labels is small.
8 Results
Full Set (=43,518)  Reduced Set (=10,000)  Prediction  
Method  P  R  F1  P  R  F1  Time (ms)  
Baseline        60.2  51.2  55.3         
fcm  32/N  264/N  200/N  62.9  49.6  55.4  61.6  37.1  46.3  2,242 
lrfrtucker  32/N  20/Y  200/Y  62.1  52.7  57.0  51.5  40.8  45.5  3,076 
lrfrtucker  32/N  20/Y  200/N  63.5  51.1  56.6  52.8  40.1  45.6  2,972 
lrfrtucker  20/Y  20/Y  200/Y  62.4  51.0  56.1  52.1  41.2  46.0  2,538 
lrfrtucker  32/Y  20/Y  50/Y  57.4  52.4  54.8  49.7  46.1  47.8  1,198 
lrfrcp  200/Y  61.3  50.7  55.5  58.3  41.6  48.6  502 
System  Resources Used  Acc 

SVM [Belinkov et al.2014]  distance, word, embedding, clusters, POS, WordNet, VerbNet  86.0 
HPCD [Belinkov et al.2014]  distance, embedding, POS, WordNet, VerbNet  88.7 
lrfrtucker & lrfrcp  distance, embedding, POS, WordNet, VerbNet  90.3 
lrfrBrown  distance, embedding, clusters, POS, WordNet, VerbNet  89.6 
RBG [Lei et al.2014]  dependency parser  88.4 
CharniakRS [McClosky et al.2006]  dependency parser + reranker  88.6 
RBG + HPCD (combined model)  dependency parser + distance, embedding, POS, WordNet, VerbNet  90.1 
Relation Extraction
All lrfrtucker models improve over Baseline and fcm (Table 3), making these the best reported numbers for this task. However, lrfrcp does not work as well on the features with only one lexical part. The Tuckerform does a better job of capturing interactions between different views. In the limited training setting, we find that lrfrcp does best.
Additionally, the primary advantage of the CP approximation is its reduction in the number of model parameters and running time. We report each model’s running time for a single pass on the development set. The lrfrcp is by far the fastest. The first three lrfrtucker models are slightly slower than fcm, because they work on dense nonlexical property embeddings while fcm benefits from sparse vectors.
PPattachment
Table 4 shows that lrfr (89.6 and 90.3) improves over the previous best standalone system HPCD (88.7) by a large margin, with exactly the same resources. belinkov2014exploring also reported results of parsers and parser rerankers, which can access to additional resources (complete parses for training and complete sentences as inputs) so it is unfair to compare them with the standalone systems like HPCD and our lrfr. Nonetheless lrfrtucker & lrfrcp (90.3) still outperforms the stateoftheart parser RBG (88.4), reranker CharniakRS (88.6), and the combination of the stateoftheart parser and compositional model RBG + HPCD (90.1). Thus, even with fewer resources, lrfr becomes the new best system.
Not shown in the table: we also tried lrfrtucker & lrfrcp with postag features only (89.7), and with grandheadmodifier conjunctions removed (89.3) . Note that compared to lrfr, RBG benefits from binary features, which also exploit grandheadmodifier structures. Yet the above reduced models still work better than RBG (88.4) without using additional resources.^{4}^{4}4Still this is not a fair comparison since we have different training objectives. Using RBG’s factorization and training with our objective will give a fair comparison and we leave it to future work. Moreover, the results of lrfr
can still be potentially improved by combining with binary features. The above results show the advantage of our factorization method, which allows for utilizing pretrained word embeddings, and thus can benefit from semisupervised learning.
Preposition Disambiguation
lrfr improves (Table 5) over the best methods (SUM and Concatenation) in Ritter:2014learningsemantics as well as the SVM based on the original lexical features (85.1). In this task lrfrBrown better represents the unigram and bigram lexical features, compared to the usage of two lowrank tensors (lrfrtucker & lrfrcp). This may be because lrfrBrown has fewer parameters, which is better for smaller training sets.
We also include a control setting (lrfrBrown  Control), which has a full rank parameter tensor with the same inputs on each view as lrfrBrown, but represented as one hot vectors without transforming to the hidden representations s. This is equivalent to an SVM with the compound cluster features as in koo_simple_2008. It performs much worse than lrfrBrown, showing the advantage of using word embeddings and lowrank tensors.
Method  Accuracy 

SVM  Lexical Features  85.09 
SUM  80.55 
Concatenation  86.73 
lrfrtucker & lrfrcp  87.82 
lrfrBrown  88.18 
lrfrBrown  Control  84.18 
Summary
For unigram lexical features, lrfrtucker achieves better results than lrfrcp. However, in settings with fewer training examples, features with more lexical parts (grams), or when faster predictions are advantageous, lrfrcp does best as it has fewer parameters to estimate. For grams of variable length, lrfrtucker & lrfrcp does best. In settings with fewer training examples, lrfrBrown does best as it has only one parameter tensor to estimate.
9 Related Work
Dimensionality Reduction for Complex Features
is a standard technique to address highdimensional features, including PCA, alternating structural optimization [Ando and Zhang2005]
[Vincent et al.2008], and feature embeddings [Yang and Eisenstein2015]. These methods treat features as atomic elements and ignore the inner structure of features, so they learn separate embedding for each feature without shared parameters. As a result, they still suffer from large parameter spaces when the feature space is very huge.^{5}^{5}5For example, a stateoftheart dependency parser [Zhang and McDonald2014] extracts about 10 million features; in this case, learning 100dimensional feature embeddings involves estimating approximately a billion parameters.Another line of research studies the inner structures of lexical features: e.g. koo_simple_2008, turian2010word, sun_semisupervised_2011, nguyen_employing_2014, roth_composition_2014, and hermannEtAl:2014:P141 used pretrained word embeddings to replace the lexical parts of features ; srikumar2014learning, gormleyyudredze:2015:EMNLP and Yu:2015rt propose splitting lexical features into different parts and employing tensors to perform classification. The above can therefore be seen as special cases of our model that only embed a certain part (view) of the complex features. This restriction also makes their model parameters form a full rank tensor, resulting in data sparsity and high computational costs when the tensors are large.
Composition Models (Deep Learning) build representations for structures based on their component word embeddings [Collobert et al.2011, Bordes et al.2012, Socher et al.2012, Socher et al.2013b]. When using only word embeddings, these models achieved successes on several NLP tasks, but sometimes fail to learn useful syntactic or semantic patterns beyond the strength of combinations of word embeddings, such as the dependency relation in Figure 1(a). To tackle this problem, some work designed their model structures according to a specific kind of linguistic patterns, e.g. dependency paths [Ma et al.2015, Liu et al.2015]
, while a recent trend enhances compositional models with linguistic features. For example, belinkov2014exploring concatenate embeddings with linguistic features before feeding them to a neural network; socher2013parsing and hermann2013role enhanced Recursive Neural Networks by refining the transformation matrices with linguistic features (e.g. phrase types). These models are similar to ours in the sense of learning representations based on linguistic features and embeddings.
Lowrank Tensor Models for NLP aim to handle the conjunction among different views of features [Cao and Khudanpur2014, Lei et al.2014, Chen and Manning2014]. yu2015learning proposed a model to compose phrase embeddings from words, which has an equivalent form of our CPbased method under certain restrictions. Our work applies a similar idea to exploiting the inner structure of complex features, and can handle gram features with different s. Our factorization (§3) is general and easy to adapt to new tasks. More importantly, it makes the model benefit from pretrained word embeddings as shown by the PPattachment results.
10 Conclusion
We have presented lrfr, a feature representation model that exploits the inner structure of complex lexical features and applies a lowrank tensor to efficiently score features with this representation. lrfr attains the stateoftheart on several tasks, including relation extraction, PPattachment, and preposition disambiguation. We make our implementation available for general use.^{6}^{6}6https://github.com/Gorov/LowRankFCM
Acknowledgements
A major portion of this work was done when MY was visiting MD and RA at JHU. This research was supported in part by NSF grant IIS1546482.
References

[Ando and Zhang2005]
Rie Kubota Ando and Tong Zhang.
2005.
A framework for learning predictive structures from multiple tasks
and unlabeled data.
The Journal of Machine Learning Research
, 6.  [Belinkov et al.2014] Yonatan Belinkov, Tao Lei, Regina Barzilay, and Amir Globerson. 2014. Exploring compositional architectures and word vector representations for prepositional phrase attachment. Transactions of the Association for Computational Linguistics, 2.
 [Bengio et al.2006] Yoshua Bengio, Holger Schwenk, JeanSébastien Senécal, Fréderic Morin, and JeanLuc Gauvain. 2006. Neural probabilistic language models. In Innovations in Machine Learning. Springer.
 [Bordes et al.2012] Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2012. A semantic matching energy function for learning with multirelational data. Machine Learning.
 [Cao and Khudanpur2014] Yuan Cao and Sanjeev Khudanpur. 2014. Online learning in tensor space. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
 [Chen and Manning2014] Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of EMNLP.
 [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. JMLR, 12.
 [Duchi et al.2011] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12.
 [Gormley et al.2015] Matthew R. Gormley, Mo Yu, and Mark Dredze. 2015. Improved relation extraction with featurerich compositional embedding models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
 [Hermann and Blunsom2013] Karl Moritz Hermann and Phil Blunsom. 2013. The role of syntax in vector space models of compositional semantics. In Association for Computational Linguistics.
 [Hermann et al.2014] Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. 2014. Semantic frame identification with distributed word representations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
 [Hoffmann et al.2010] Raphael Hoffmann, Congle Zhang, and Daniel S. Weld. 2010. Learning 5000 relational extractors. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.
 [Kolda and Bader2009] Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review, 51(3).
 [Koo et al.2008] Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semisupervised dependency parsing. In Proceedings of ACL.
 [Lei et al.2014] Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Lowrank tensors for scoring dependency structures. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
 [Lei et al.2015] Tao Lei, Yuan Zhang, Lluís Màrquez, Alessandro Moschitti, and Regina Barzilay. 2015. Highorder lowrank tensors for semantic role labeling. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
 [Liu et al.2015] Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng WANG. 2015. A dependencybased neural network for relation classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers).

[Ma et al.2015]
Mingbo Ma, Liang Huang, Bowen Zhou, and Bing Xiang.
2015.
Dependencybased convolutional neural networks for sentence embedding.
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers).  [McClosky et al.2006] David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective selftraining for parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics.
 [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
 [Miller et al.2004] Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name tagging with word clusters and discriminative training. In Proceedings of HLTNAACL.
 [Mitchell and Lapata2010] Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive science, 34(8).
 [Nguyen and Grishman2014] Thien Huu Nguyen and Ralph Grishman. 2014. Employing word representations and regularization for domain adaptation of relation extraction. In Association for Computational Linguistics (ACL).
 [Ritter et al.2014] Samuel Ritter, Cotie Long, Denis Paperno, Marco Baroni, Matthew Botvinick, and Adele Goldberg. 2014. Leveraging preposition ambiguity to assess representation of semantic interaction in cdsm. In NIPS Workshop on Learning Semantics.
 [Roth and Woodsend2014] Michael Roth and Kristian Woodsend. 2014. Composition of word representations improves semantic role labelling. In Proceedings of EMNLP.
 [Socher et al.2012] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality through recursive matrixvector spaces. In Proceedings of EMNLPCoNLL 2012.
 [Socher et al.2013a] Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013a. Parsing with compositional vector grammars. In Proceedings of ACL.
 [Socher et al.2013b] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013b. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP.
 [Srikumar and Manning2014] Vivek Srikumar and Christopher D Manning. 2014. Learning distributed representations for structured output prediction. In Advances in Neural Information Processing Systems.
 [Sun et al.2011] Ang Sun, Ralph Grishman, and Satoshi Sekine. 2011. Semisupervised relation extraction with largescale word clustering. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
 [TaubTabib et al.2015] Hillel TaubTabib, Yoav Goldberg, and Amir Globerson. 2015. Template kernels for dependency parsing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
 [Turian et al.2010] Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semisupervised learning. In Association for Computational Linguistics.
 [Vincent et al.2008] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning.
 [Walker et al.2006] Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. ACE 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia.
 [Yang and Eisenstein2015] Yi Yang and Jacob Eisenstein. 2015. Unsupervised multidomain adaptation with feature embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 672–682, Denver, Colorado, May–June. Association for Computational Linguistics.
 [Yu and Dredze2015] Mo Yu and Mark Dredze. 2015. Learning composition models for phrase embeddings. Transactions of the Association for Computational Linguistics, 3.
 [Yu et al.2015] Mo Yu, Matthew R. Gormley, and Mark Dredze. 2015. Combining word embeddings and feature embeddings for finegrained relation extraction. In North American Chapter of the Association for Computational Linguistics (NAACL).
 [Zhang and McDonald2014] Hao Zhang and Ryan McDonald. 2014. Enforcing structural diversity in cubepruned dependency parsing. In Proceedings of ACL.
 [Zhou et al.2005] GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang. 2005. Exploring various knowledge in relation extraction. In Proceedings of ACL.
Comments
There are no comments yet.