1 Introduction
A major goal of semantic processing is to map natural language utterances to representations that facilitate calculation of meanings, execution of commands, and/or inference of knowledge. Formal semantics supports such representations by defining words as some functional units and combining them via a specific logic. A simple and illustrative example is the Dependencybased Compositional Semantics (DCS) [Liang et al.2013]. DCS composes meanings from denotations of words (i.e. sets of things to which the words apply); say, the denotations of the concept drug and the event ban is shown in Figure 1b, where drug is a list of drug names and ban is a list of the subjectcomplement pairs in any ban event; then, a list of banned drugs can be constructed by first taking the COMP column of all records in ban (projection “”), and then intersecting the results with drug (intersection “”). This procedure defined how words can be combined to form a meaning. Better yet, the procedure can be concisely illustrated by the DCS tree of “banned drugs” (Figure 1a), which is similar to a dependency tree but possesses precise procedural and logical meaning (Section 2). DCS has been shown useful in question answering [Liang et al.2013] and textual entailment recognition [Tian et al.2014].
Orthogonal to the formal semantics of DCS, distributional vector representations are useful in capturing lexical semantics of words [Turney and Pantel2010, Levy et al.2015], and progress is made in combining the word vectors to form meanings of phrases/sentences [Mitchell and Lapata2010, Baroni and Zamparelli2010, Grefenstette and Sadrzadeh2011, Socher et al.2012, Paperno et al.2014, Hashimoto et al.2014]. However, less effort is devoted to finding a link between vectorbased compositions and the composition operations in any formal semantics. We believe that if a link can be found, then symbolic formulas in the formal semantics will be realized by vectors composed from word embeddings, such that similar things are realized by similar vectors; meanwhile, vectors will acquire formal meanings that can directly be used in execution or inference process. Still, to find a link is challenging because any vector compositions that realize such a link must conform to the logic of the formal semantics.
In this paper, we establish a link between DCS and certain vector compositions, achieving a vectorbased DCS by replacing denotations of words with word vectors, and realizing the composition operations such as intersection and projection as addition and linear mapping, respectively. For example, to construct a vector for “banned drugs”, one takes the word vector and multiply it by a matrix , corresponding to the projection ; then, one adds the result to the word vector to realize the intersection operation (Figure 1c). We provide a method to train the word vectors and linear mappings (i.e. matrices) jointly from unlabeled corpora.
The rationale for our model is as follows. First, recent research has shown that additive composition of word vectors is an approximation to the situation where two words have overlapping context [Tian et al.2015]; therefore, it is suitable to implement an “and” or intersection operation (Section 3
). We design our model such that the resulted distributional representations are expected to have additive compositionality. Second, when intersection is realized as addition, it is natural to implement projection as linear mapping, as suggested by the logical interactions between the two operations (Section
3). Experimentally, we show that vectors and matrices learned by our model exhibit favorable characteristics as compared with vectors trained by GloVe [Pennington et al.2014] or those learned from syntactic dependencies (Section 5.1). Finally, additive composition brings our model a strong ability to calculate similar vectors for similar phrases, whereas syntacticsemantic roles (e.g. SUBJ, COMP) can be distinguished by different projection matrices (e.g. , ). We achieve near stateoftheart performance on a wide range of phrase similarity tasks (Section 5.2) and relation classification (Section 5.3).Furthermore, we show that a vector as constructed above for “banned drugs” can be used as a query vector to retrieve a coarsegrained candidate list of banned drugs, by sorting its dot products with answer vectors that are also learned by our model (Figure 1d). This is due to the ability of our approach to provide a language model that can find likely words to fill in the blanks such as “ is a banned drug” or “the drug is banned by …”. A highlight is the calculation being done as if a query is “executed” by the DCS tree of “banned drugs”. We quantitatively evaluate this utility on sentence completion task [Zweig et al.2012] and report a new stateoftheart (Section 5.4).
2 DCS Trees
DCS composes meanings from denotations, or sets of things to which words apply. A “thing” (i.e. element of a denotation) is represented by a tuple of features of the form Field=Value, with a fixed inventory of fields. For example, a denotation ban might be a set of tuples , in which each tuple records participants of a banning event (e.g. Canada banning Thalidomide).
Operations are applied to sets of things to generate new denotations, for modeling semantic composition. An example is the intersection of pet and fish giving the denotation of “pet fish”. Another necessary operation is projection; by we mean a function mapping a tuple to its value of the field N. For example, is the value set of the COMP fields in ban, which consists of banned objects (i.e. ). In this paper, we assume a field ARG to be names of things representing themselves, hence for example is the set of names of drugs.
For a value set , we also consider inverse image . For example,
consists of all tuples of the form , where is a man’s name (i.e. ). Thus, denotes men’s selling events (i.e. as in Figure 2). Similarly, the denotation of “banned drugs” as in Figure 1b is formally written as
Hence the following denotation
consists of selling events such that the SUBJ is a man and the COMP is a banned drug.
The calculation above can proceed in a recursive manner controlled by DCS trees. The DCS tree for the sentence “a man sells banned drugs” is shown in Figure 2. Formally, a DCS tree is defined as a rooted tree in which nodes are denotations of content words and edges are labeled by fields at each ends. Assume a node x has children , and the edges are labeled by , respectively. Then, the denotation of the subtree rooted at x is recursively calculated as
(1) 
As a result, the denotation of the DCS tree in Figure 2 is the denotation of “a man sells banned drugs” as calculated above. DCS can be further extended to handle phenomena such as quantifiers or superlatives [Liang et al.2013, Tian et al.2014]. In this paper, we focus on the basic version, but note that it is already expressive enough to at least partially capture the meanings of a large portion of phrases and sentences.
DCS trees can be learned from questionanswer pairs and a given database of denotations [Liang et al.2013], or they can be extracted from dependency trees if no database is specified, by taking advantage of the observation that DCS trees are similar to dependency trees [Tian et al.2014]. We use the latter approach, obtaining DCS trees by rulebased conversion from universal dependency (UD) trees [McDonald et al.2013]. Therefore, nodes in a DCS tree are content words in a UD tree, which are in the form of lemmaPOS pairs (Figure 3). The inventory of fields is designed to be ARG, SUBJ, COMP, and all prepositions. Prepositions are unlike content words which denote sets of things, but act as relations which we treat similarly as SUBJ and COMP. For example, a prepositional phrase attached to a verb (e.g. play on the grass) is treated as in Figure 3a. The presence of two field labels on each edge of a DCS tree makes it convenient for modeling semantics in several cases, such as a relative clause (Figure 3b).
3 Vectorbased DCS
For any content word w, we use a query vector to model its denotation, and an answer vector to model a prototypical element in that denotation. Query vector and answer vector are learned such that
is proportional to the probability of
answering the query . The learning source is a collection of DCS trees, based on the idea that the DCS tree of a declarative sentence usually has nonempty denotation. For example, “kids play” means there exists some kid who plays. Consequently, some element in the play denotation belongs to , and some element in the kid denotation belongs to . This is a signal to increase the dot product of and the query vector of , as well as the dot product of and the query vector of . When optimized on a large corpus, the “typical” elements of play and kid should be learned by and , respectively. In general, one hasTheorem 1
Assume the denotation of a DCS tree is not empty. Given any path from node x to y, assume edges along the path are labeled by . Then, an element in the denotation y belongs to .
Therefore, for any two nodes in a DCS tree, the path from one to another forms a training example, which signals increasing the dot product of the corresponding query and answer vectors.
It is noteworthy that the above formalization happens to be closely related to the skipgram model [Mikolov et al.2013b]. The skipgram learns a target vector and a context vector for each word w. It assumes the probability of a word y cooccurring with a word x in a context window is proportional to . Hence, if x and y cooccur within a context window, then one gets a signal to increase . If the context window is taken as the same DCS tree, then the learning of skipgram and vectorbased DCS will be almost the same, except that the target vector becomes the query vector , which is no longer assigned to the word x but the path from x to y in the DCS tree (e.g. the query vector for instead of ). Therefore, our model can also be regarded as extending skipgram to take account of the changes of meanings caused by different syntacticsemantic roles.
Additive Composition
Word vectors trained by skipgram are known to be semantically additive, such as exhibited in word analogy tasks. An effect of adding up two skipgram vectors is further analyzed in addcomp. Namely, the target vector can be regarded as encoding the distribution of context words surrounding w. If another word x is given, can be decomposed into two parts, one encodes context words shared with x, and another encodes context words not shared. When and
are added up, the nonshared part of each of them tend to cancel out, because nonshared parts have nearly independent distributions. As a result, the shared part gets reinforced. An error bound is derived to estimate how close
gets to the distribution of the shared part. We can see the same mechanism exists in vectorbased DCS. In a DCS tree, two paths share a context word if they lead to a same node y; semantically, this means some element in the denotation y belongs to both denotations of the two paths (e.g. given the sentence “kids play balls”, and both contain a playing event whose SUBJ is a kid and COMP is a ball). Therefore, addition of query vectors of two paths approximates their intersection because the shared context y gets reinforced.Projection
Generally, for any two denotations and any projection , we have
(2) 
And the “” can often become “”, for example when is a onetoone map or for some value set . Therefore, if intersection is realized by addition, it will be natural to realize projection by linear mapping because
(3) 
holds for any vectors and any matrix , which is parallel to (2). If is realized by a matrix , then should correspond to the inverse matrix , because for any value set . So we have realized all composition operations in DCS.
Query vector of a DCS tree
Now, we can define the query vector of a DCS tree as parallel to (1):
(4) 
4 Training
As described in Section 3, vectorbased DCS assigns a query vector and an answer vector to each content word w. And for each field N, it assigns two matrices and . For any path from node x to y sampled from a DCS tree, assume the edges along are labeled by . Then, the dot product gets a signal to increase.
Formally, we adopt the noisecontrastive estimation
[Gutmann and Hyvärinen2012] as used in the skipgram model, and mix the paths sampled from DCS trees with artificially generated noise. Then, models the probability of a training example coming from DCS trees, whereis the sigmoid function. The vectors and matrices are trained by maximizing the loglikelihood of the mixed data. We use stochastic gradient descent
[Bottou2012] for training. Some important settings are discussed below.Noise
For any obtained from a path of a DCS tree, we generate noise by randomly choosing an index , and then replacing or () and by or and , respectively, where and z are independently drawn from the marginal (i.e. unigram) distributions of fields and words.
Update
For each data point, when is the chosen index above for generating noise, we view indices as the ”target” part, and as the ”context”, which is completely replaced by the noise, as an analogous to the skipgram model. Then, at each step we only update one vector and one matrix from each of the target, context, and noise part; more specifically, we only update , or , or , or , and , at the step. This is much faster than always updating all matrices.
Initialization
Matrices are initialized as , where
is the identity matrix; and
and all vectors are initialized with i.i.d. Gaussians of variance
, where is the vector dimension. We find that the diagonal component is necessary to bring information from to , whereas the randomness of makes convergence faster. is initialized as the transpose of .Learning Rate
We find that the initial learning rate for vectors can be set to . But for matrices, it should be less than otherwise the model diverges. For stable training, we rescale gradients when their norms exceed a threshold.
Regularizer
During training, and are treated as independent matrices. However, we use the regularizer to drive close to the inverse of .^{1}^{1}1Problem with the naive regularizer is that, when the scale of goes larger, it will drive smaller, which may lead to degeneration. So we scale according to the trace of . We also use to prevent from having too different scales at different directions (i.e., to drive close to orthogonal). We set and . Despite the rather weak regularizer, we find that can be learned to be exactly the inverse of , and
can actually be an orthogonal matrix, showing some semantic regularity (Section
5.1).GloVe  no matrix  vecDCS  vecUD 

books  essay/N  novel/N  essay/N 
author  novel/N  essay/N  novel/N 
published  memoir/N  anthology/N  article/N 
novel  books/N  publication/N  anthology/N 
memoir  autobiography/N  memoir/N  poem/N 
wrote  nonfiction/J  poem/N  autobiography/N 
biography  reprint/V  autobiography/N  publication/N 
autobiography  publish/V  story/N  journal/N 
essay  republish/V  pamphlet/N  memoir/N 
illustrated  chapbook/N  tale/N  pamphlet/N 
5 Experiments
For training vectorbased DCS, we use Wikipedia Extractor^{2}^{2}2http://medialab.di.unipi.it/wiki/Wikipedia_Extractor to extract texts from the 20151201 dump of English Wikipedia^{3}^{3}3https://dumps.wikimedia.org/enwiki/. Then, we use Stanford Parser^{4}^{4}4http://nlp.stanford.edu/software/lexparser.shtml [Klein and Manning2003] to parse all sentences and convert the UD trees into DCS trees by handwritten rules. We assign a weight to each path of the DCS trees as follows.
victorian/J  build/V  sit/V 
stand/V  rent/V  house/N 
vacant/J  leave/V  stand/V 
18thcentury/J  burn down/V  live/V 
historic/J  remodel/V  hang/V 
old/J  demolish/V  seat/N 
georgian/J  restore/V  stay/V 
local/J  renovate/V  serve/V 
19thcentury/J  rebuild/V  reside/V 
tenement/J  construct/V  hold/V 
teacher/N  skill/N  otherness/N 
skill/N  lesson/N  intimacy/N 
he/P  technique/N  femininity/N 
she/P  experience/N  selfawareness/N 
therapist/N  ability/N  life/N 
student/N  something/N  selfexpression/N 
they/P  knowledge/N  sadomasochism/N 
mother/N  language/N  emptiness/N 
lesson/N  opportunity/N  criminality/N 
father/N  instruction/N  masculinity/N 
For any path passing through intermediate nodes of degrees , respectively, we set
(5) 
Note that because there is a path passing through the node; and if consists of a single edge. The equation (5) is intended to degrade long paths which pass through several highvalency nodes. We use a random walk algorithm to sample paths such that the expected times a path is sampled equals its weight. As a result, the sampled path lengths range from to , average , with an exponential tail. We convert all words which are sampled less than 1000 times to *UNKNOWN*/POS, and all prepositions occurring less than 10000 times to an *UNKNOWN* field. As a result, we obtain a vocabulary of 109k words and 211 field names.
AN  NN  VO  SVO  GS11  GS12  
vecDCS  0.51  0.49  0.41  0.62  0.29  0.33 
no matrix  0.52  0.46  0.42  0.62  0.29  0.33 
no inverse  0.47  0.43  0.38  0.58  0.28  0.33 
vecUD  0.44  0.46  0.41  0.58  0.25  0.25 
GloVe  0.41  0.47  0.41  0.60  0.23  0.17 
grefenstettesadrzadeh:2011:EMNLP          0.21   
blacoelapata:2012:EMNLPCoNLL:RAE  0.31  0.30  0.28       
Grefenstette:PhDthesis            0.27 
papernophambaroni14            0.36 
hashimotoEtAl:2014:EMNLP2014:  0.48  0.40  0.39  0.34    
kartsadrqpl2014        0.43  0.41   
Using the sampled paths, vectors and matrices are trained as in Section 4 (vecDCS). The vector dimension is set to . We compare with three baselines: (i) all matrices are fixed to identity (“no matrix”), in order to investigate the effects of meaning changes caused by syntacticsemantic roles and prepositions; (ii) the regularizer enforcing to be actually the inverse matrix of is set to (“no inverse”), in order to investigate the effects of a semantically motivated constraint; and (iii) applying the same training scheme to UD trees directly, by modeling UD relations as matrices (“vecUD”). In this case, one edge is assigned one UD relation rel, so we implement the transformation from child to parent by , and from parent to child by . The same hyperparameters are used to train vecUD. By comparing vecDCS with vecUD we investigate if applying the semantics framework of DCS makes any difference. Additionally, we compare with the GloVe (6B, 300d) vector^{5}^{5}5http://nlp.stanford.edu/projects/glove/ [Pennington et al.2014]. Norms of all word vectors are normalized to 1 and Frobenius norms of all matrices are normalized to .
5.1 Qualitative Analysis
We observe several special properties of the vectors and matrices trained by our model.
Words are clustered by POS
In terms of cosine similarity, word vectors trained by vecDCS and vecUD are clustered by POS tags, probably due to their interactions with matrices during training. This is in contrast to the vectors trained by GloVe or “no matrix” (Table
1).Matrices show semantic regularity
Matrices learned for ARG, SUBJ and COMP are exactly orthogonal, and some most frequent prepositions^{6}^{6}6of, in, to, for, with, on, as, at, from are remarkably close. For these matrices, the corresponding also exactly converge to their inverse. It suggests regularities in the semantic space, especially because orthogonal matrices preserve cosine similarity – if is orthogonal, two words x, y and their projections , will have the same similarity measure, which is semantically reasonable. In contrast, matrices trained by vecUD are only orthogonal for three UD relations, namely conj, dep and appos.
Words transformed by matrices
To illustrate the matrices trained by vecDCS, we start from the query vectors of two words, house and learn, applying different matrices to them, and show the 10 answer vectors of the highest dot products (Tabel 2). These are the lists of likely words which: take house as a subject, take house as a complement, fills into “ in house”, serve as a subject of learn, serve as a complement of learn, and fills into “learn about ”, respectively. As the table shows, matrices in vecDCS are appropriately learned to map word vectors to their syntacticsemantic roles.
MessageTopic(, )  It is a monthly providing and advice on current United States government contract issues. 

MessageTopic(, )  The gives an account of the silvicultural done in Africa, Asia, Australia, South American and the Caribbean. 
MessageTopic(, )  NUS today responded to the Government’s of the longawaited of university funding. 
ComponentWhole(, )  The published political and opinion, but even more than that. 
MessageTopic(, )  It is a 2004 criticizing the political and linguistic of Noam Chomsky. 
5.2 Phrase Similarity
To test if vecDCS has the composition ability to calculate similar things as similar vectors, we conduct evaluation on a wide range of phrase similarity tasks. In these tasks, a system calculates similarity scores for pairs of phrases, and the performance is evaluated as its correlation with human annotators, measured by Spearman’s .
Datasets
mitchell10 create datasets^{7}^{7}7http://homepages.inf.ed.ac.uk/s0453356/ for pairs of three types of twoword phrases: adjectivenouns (AN) (e.g. “black hair” and “dark eye”), compound nouns (NN) (e.g. “tax charge” and “interest rate”) and verbobjects (VO) (e.g. “fight war” and “win battle”). Each dataset consists of 108 pairs and each pair is annotated by 18 humans (i.e., 1,944 scores in total). Similarity scores are integers ranging from 1 to 7. Another dataset^{8}^{8}8http://www.cs.ox.ac.uk/activities/compdistmeaning/ is created by extending VO to SubjectVerbObject (SVO), and then assessing similarities by crowd sourcing [Kartsaklis and Sadrzadeh2014]. The dataset GS11 created by grefenstettesadrzadeh:2011:EMNLP (100 pairs, 25 annotators) is also of the form SVO, but in each pair only the verbs are different (e.g. “man provide/supply money”). The dataset GS12 described in Grefenstette:PhDthesis (194 pairs, 50 annotators) is of the form AdjectiveNounVerbAdjectiveNoun (e.g. “local family run/move small hotel”), where only verbs are different in each pair.
Our method
We calculate the cosine similarity of query vectors corresponding to phrases. For example, the query vector for “fight war” is calculated as . For vecUD we use and instead of and , respectively. For GloVe we use additive compositions.
Results
As shown in Table 3, vecDCS is competitive on AN, NN, VO, SVO and GS12, consistently outperforming “no inverse”, vecUD and GloVe, showing strong compositionality. The weakness of “no inverse” suggests that relaxing the constraint of inverse matrices may hurt compositionaly, though our preliminary examination on word similarities did not find any difference. The GS11 dataset appears to favor models that can learn from interactions between the subject and object arguments, such as the nonlinear model
in hashimotoEtAl:2014:EMNLP2014 and the entanglement model in kartsadrqpl2014. However, these models do not show particular advantages on other datasets. The recursive autoencoder (RAE) proposed in socher11 shares an aspect with vecDCS as to construct meanings from parse trees. It is tested by blacoelapata:2012:EMNLPCoNLL for compositionality, where vecDCS appears to be better. Nevertheless, we note that “no matrix” performs as good as vecDCS, suggesting that meaning changes caused by syntacticsemantic roles might not be major factors in these datasets, because the syntacticsemantic relations are all fixed in each dataset.
5.3 Relation Classification
In a relation classification task, the relation between two words in a sentence needs to be classified; we expect vecDCS to perform better than “no matrix” on this task because vecDCS can distinguish the different syntacticsemantic roles of the two slots the two words fit in. We confirm this conjecture in this section.
vecDCS  81.2 

no matrix  69.2 
no inverse  79.7 
vecUD  69.2 
GloVe  74.1 
socher12  79.1 
+3 features  82.4 
dossantosxiangzhou:2015:ACLIJCNLP  84.1 
xuEtAl:2015:EMNLP1  85.6 
Dataset
We use the dataset of SemEval2010 Task 8 [Hendrickx et al.2009], in which 9 directed relations (e.g. CauseEffect) and 1 undirected relation Other are annotated, 8,000 instances for training and 2,717 for test. Performance is measured by the 9class directionaware MacroF1 score excluding Other class.
Our method
For any sentence with two words marked as and , we construct the DCS tree of the sentence, and take the subtree rooted at the common ancestor of and . We construct four vectors from , namely: the query vector for the subtree rooted at (resp. ), and the query vector of the DCS tree obtained from by rerooting it at (resp. ) (Figure 4). The four vectors are normalized and concatenated to form the only feature used to train a classifier. For vecUD, we use the corresponding vectors calculated from UD trees. For GloVe, we use the word vector of (resp. ), and the sum of vectors of all words within the span (resp. ) as the four vectors. Classifier is SVM^{9}^{9}9https://www.csie.ntu.edu.tw/~cjlin/libsvm/ with RBF kernel, and . The hyperparameters are selected by 5fold cross validation.
Results
VecDCS outperforms baselines on relation classification (Table 5). It makes 16 errors in misclassifying the direction of a relation, as compared to 144 such errors made by “no matrix”, 23 by “no inverse”, 30 by vecUD, and 161 by GloVe. This suggests that models with syntacticsemantic transformations (i.e. vecDCS, “no inverse”, and vecUD) are indeed good at distinguishing the different roles played by and . VecDCS scores moderately lower than the stateoftheart [Xu et al.2015]
, however we note that these results are achieved by adding additional features and training taskspecific neural networks
[dos Santos et al.2015, Xu et al.2015]. Our method only uses features constructed from unlabeled corpora. From this point of view, it is comparable to the MVRNN model (without features) in socher12, and vecDCS actually does better. Table 4 shows an example of clustered training instances as assessed by cosine similarities between their features. It suggests that the features used in our method can actually cluster similar relations.“banned drugs”  “banned movies”  “banned books” 

drug/N  bratz/N  publish/N 
marijuana/N  porn/N  unfair/N 
cannabis/N  indecent/N  obscene/N 
trafficking/N  blockbuster/N  samizdat/N 
thalidomide/N  movie/N  book/N 
smoking/N  idiots/N  responsum/N 
narcotic/N  blacklist/N  illegal/N 
botox/N  grindhouse/N  reclaiming/N 
doping/N  doraemon/N  redbook/N 
5.4 Sentence Completion
If vecDCS can compose query vectors of DCS trees, one should be able to “execute” the vectors to get a set of answers, as the original DCS trees can do. This is done by taking dot products with answer vectors and then ranking the answers. Examples are shown in Table 6. Since query vectors and answer vectors are trained from unlabeled corpora, we can only obtain a coarsegrained candidate list. However, it is noteworthy that despite a common word “banned” shared by the phrases, their answer lists are largely different, suggesting that composition actually can be done. Moreover, some words indeed answer the queries (e.g. Thalidomide for “banned drugs” and Samizdat for “banned books”).
Quantitatively, we evaluate this utility of executing queries on the sentence completion task. In this task, a sentence is presented with a blank that need to be filled in. Five possible words are given as options for each blank, and a system needs to choose the correct one. The task can be viewed as a coarsegrained question answering or an evaluation for language models [Zweig et al.2012]. We use the MSR sentence completion dataset^{10}^{10}10http://research.microsoft.com/enus/projects/scc/ which consists of 1,040 test questions and a corpus for training language models. We train vecDCS on this corpus and use it for evaluation.
Results
As shown in Table 7
, vecDCS scores better than the Ngram model and demonstrates promising performance. However, to our surprise, “no matrix” shows an even better result which is the new stateoftheart. Here we might be facing the same problem as in the phrase similarity task (Section
5.2); namely, all choices in a question fill into the same blank and the same syntacticsemantic role, so the transforming matrices in vecDCS might not be able to distinguish different choices; on the other hand, vecDCS would suffer more from parsing and POStagging errors. Nonetheless, we believe the result by “no matrix” reveals a new horizon of sentence completion, and suggests that composing semantic vectors according to DCS trees could be a promising direction.vecDCS  50 
no matrix  60 
no inverse  46 
vecUD  31 
Ngram (Various)  3941 
zweigEtAl:2012:ACL2012  52 
Mnih12afast  55 
gubbinsvlachos:2013:EMNLP  50 
mikolovRNN  55 
6 Discussion
We have demonstrated a way to link a vector composition model to a formal semantics, combining the strength of vector representations to calculate phrase similarities, and the strength of formal semantics to build up structured queries. In this section, we discuss several lines of previous research related to this work.
Logic and Distributional Semantics
Logic is necessary for implementing the functional aspects of meaning and organizing knowledge in a structured and unambiguous way. In contrast, distributional semantics provides an elegant methodology for assessing semantic similarity and is well suited for learning from data. There have been repeated calls for combining the strength of these two approaches [Coecke et al.2010, Baroni et al.2014, Liang and Potts2015], and several systems [Lewis and Steedman2013, Beltagy et al.2014, Tian et al.2014] have contributed to this direction. In the remarkable work by beltagy:cl2016, word and phrase similarities are explicitly transformed to weighted logical rules that are used in a probabilistic inference framework. However, this approach requires considerable amount of engineering, including the generation of rule candidates (e.g. by aligning sentence fragments), converting distributional similarities to weights, and efficiently handling the rules and inference. What if the distributional representations are equipped with a logical interface, such that the inference can be realized by simple vector calculations? We have shown it possible to realize semantic composition; we believe this may lead to significant simplification of the system design for combining logic and distributional semantics.
Compositional Distributional Models
There has been active exploration on how to combine word vectors such that adequate phrase/sentence similarities can be assessed [Mitchell and Lapata2010, inter alia], and there is nothing new in using matrices to model changes of meanings. However, previous model designs mostly rely on linguistic intuitions [Paperno et al.2014, inter alia], whereas our model has an exact logic interpretation. Furthermore, by using additive composition we enjoy a learning guarantee [Tian et al.2015].
Vectorbased Logic Models
This work also shares the spirit with grefenstette:2013:*SEM and rocktaschel14low, in exploring vector calculations that realize logic operations. However, the previous works did not specify how to integrate contextual distributional information, which is necessary for calculating semantic similarity.
Formal Semantics
Our model implements a fragment of logic capable of semantic composition, largely due to the simple framework of Dependencybased Compositional Semantics [Liang et al.2013]. It fits in a long tradition of logicbased semantics [Montague1970, Dowty et al.1981, Kamp and Reyle1993], with extensive studies on extracting semantics from syntactic representations such as HPSG [Copestake et al.2001, Copestake et al.2005] and CCG [Baldridge and Kruijff2002, Bos et al.2004, Steedman2012, Artzi et al.2015, Mineshima et al.2015].
Logic for Natural Language Inference
The pursue of a logic more suitable for natural language inference is also not new. For example, maccartney08 has implemented a model of natural logic [Lakoff1970]. We would not reach the current formalization of logic of DCS without reading the work by calvanese98, which is an elegant formalization of database semantics in description logic.
Semantic Parsing
DCSrelated representations have been actively used in semantic parsing and we see potential in applying our model. For example, berantliang:2014:P141 convert DCS queries to canonical utterances and assess paraphrases at the surface level; an alternative could be using vectorbased DCS to bring distributional similarity directly into calculation of denotations. We also borrow ideas from previous work, for example our training scheme is similar to guumillerliang:2015:EMNLP in using paths and composition of matrices, and our method is similar to poondomingos:2009:EMNLP in building structured knowledge from clustering syntactic parse of unlabeled data.
Further Applications
Regarding the usability of distributional representations learned by our model, a strong point is that the representation takes into account syntactic/structural information of context. Unlike several previous models [Padó and Lapata2007, Levy and Goldberg2014, Pham et al.2015], our approach learns matrices at the same time that can extract the information according to different syntacticsemantic roles. A related application is selectional preference [Baroni and Lenci2010, Lenci2011, Van de Cruys2014], wherein our model might has potential for smoothly handling composition.
Reproducibility
Find our code at https://github.com/tianran/vecdcs
Acknowledgments
This work was supported by CREST, JST. We thank the anonymous reviewers for their valuable comments.
References
 [Artzi et al.2015] Yoav Artzi, Kenton Lee, and Luke Zettlemoyer. 2015. Broadcoverage ccg semantic parsing with amr. In Proceedings of EMNLP.
 [Baldridge and Kruijff2002] Jason Baldridge and GeertJan Kruijff. 2002. Coupling ccg and hybrid logic dependency semantics. In Proceedings of ACL.
 [Baroni and Lenci2010] Marco Baroni and Alessandro Lenci. 2010. Distributional memory: A general framework for corpusbased semantics. Computational Linguistics, 36(4).
 [Baroni and Zamparelli2010] Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjectivenoun constructions in semantic space. In Proceedings of EMNLP.
 [Baroni et al.2014] Marco Baroni, Raffaella Bernardi, and Roberto Zamparelli. 2014. Frege in space: A program for compositional distributional semantics. Linguistic Issues in Language Technology, 9(6).
 [Beltagy et al.2014] Islam Beltagy, Katrin Erk, and Raymond Mooney. 2014. Probabilistic soft logic for semantic textual similarity. In Proceedings of ACL.
 [Beltagy et al.to appear] Islam Beltagy, Stephen Roller, Pengxiang Cheng, Katrin Erk, and Raymond J. Mooney. to appear. Representing meaning with a combination of logical form and vectors. Computational Linguistics, special issue on formal distributional semantics.
 [Berant and Liang2014] Jonathan Berant and Percy Liang. 2014. Semantic parsing via paraphrasing. In Proceedings of ACL.
 [Blacoe and Lapata2012] William Blacoe and Mirella Lapata. 2012. A comparison of vectorbased representations for semantic composition. In Proceedings of EMNLPCoNLL.
 [Bos et al.2004] Johan Bos, Stephen Clark, Mark Steedman, James R. Curran, and Julia Hockenmaier. 2004. Widecoverage semantic representations from a ccg parser. In Proceedings of ICCL.
 [Bottou2012] Léon Bottou. 2012. Stochastic gradient descent tricks. In Grégoire Montavon, Geneviève B. Orr, and KlausRobert Müller, editors, Neural Networks: Tricks of the Trade. Springer, Berlin.
 [Calvanese et al.1998] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. 1998. On the decidability of query containment under constraints. In Proceedings of the 17th ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS’98).
 [Coecke et al.2010] Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. 2010. Mathematical foundations for a compositional distributional model of meaning. Linguistic Analysis.
 [Copestake et al.2001] Ann Copestake, Alex Lascarides, and Dan Flickinger. 2001. An algebra for semantic construction in constraintbased grammars. In Proceedings of ACL.
 [Copestake et al.2005] Ann Copestake, Dan Flickinger, Carl Pollard, and Ivan A. Sag. 2005. Minimal recursion semantics: An introduction. Research on Language and Computation, 3(23).

[dos Santos et al.2015]
Cicero dos Santos, Bing Xiang, and Bowen Zhou.
2015.
Classifying relations by ranking with convolutional neural networks.
In Proceedings of ACLIJCNLP.  [Dowty et al.1981] David R. Dowty, Robert E. Wall, and Stanley Peters. 1981. Introduction to Montague Semantics. Springer Netherlands.
 [Grefenstette and Sadrzadeh2011] Edward Grefenstette and Mehrnoosh Sadrzadeh. 2011. Experimental support for a categorical compositional distributional model of meaning. In Proceedings of EMNLP.
 [Grefenstette2013a] Edward Grefenstette. 2013a. CategoryTheoretic Quantitative Compositional Distributional Models of Natural Language Semantics. PhD thesis.

[Grefenstette2013b]
Edward Grefenstette.
2013b.
Towards a formal distributional semantics: Simulating logical calculi with tensors.
In Proceedings of *SEM.  [Gubbins and Vlachos2013] Joseph Gubbins and Andreas Vlachos. 2013. Dependency language models for sentence completion. In Proceedings of EMNLP.
 [Gutmann and Hyvärinen2012] Michael U. Gutmann and Aapo Hyvärinen. 2012. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res., 13(1).

[Guu et al.2015]
Kelvin Guu, John Miller, and Percy Liang.
2015.
Traversing knowledge graphs in vector space.
In Proceedings of EMNLP.  [Hashimoto et al.2014] Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa, and Yoshimasa Tsuruoka. 2014. Jointly learning word representations and composition functions using predicateargument structures. In Proceedings of EMNLP.

[Hendrickx et al.2009]
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz.
2009. Semeval2010 task 8: Multiway classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW2009).  [Kamp and Reyle1993] Hans Kamp and Uwe Reyle. 1993. From Discourse to Logic. Springer Netherlands.
 [Kartsaklis and Sadrzadeh2014] Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2014. A study of entanglement in a categorical framework of natural language. In Proceedings of the 11th Workshop on Quantum Physics and Logic (QPL).
 [Klein and Manning2003] Dan Klein and Christopher D. Manning. 2003. Fast exact inference with a factored model for natural language parsing. In Advances in NIPS.
 [Lakoff1970] George Lakoff. 1970. Linguistics and natural logic. Synthese, 22(12).
 [Lenci2011] Alessandro Lenci. 2011. Composing and updating verb argument expectations: A distributional semantic model. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics.
 [Levy and Goldberg2014] Omer Levy and Yoav Goldberg. 2014. Dependencybased word embeddings. In Proceedings of ACL.
 [Levy et al.2015] Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of ACL, 3.
 [Lewis and Steedman2013] Mike Lewis and Mark Steedman. 2013. Combined distributional and logical semantics. Transactions of ACL, 1.

[Liang and Potts2015]
Percy Liang and Christopher Potts.
2015.
Bringing machine learning and compositional semantics together.
Annual Review of Linguistics, 1.  [Liang et al.2013] Percy Liang, Michael I. Jordan, and Dan Klein. 2013. Learning dependencybased compositional semantics. Computational Linguistics, 39(2).
 [MacCartney and Manning2008] Bill MacCartney and Christopher D. Manning. 2008. Modeling semantic containment and exclusion in natural language inference. In Proceedings of Coling.
 [McDonald et al.2013] Ryan McDonald, Joakim Nivre, Yvonne QuirmbachBrundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal dependency annotation for multilingual parsing. In Proceedings ACL.
 [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781.
 [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in NIPS.
 [Mineshima et al.2015] Koji Mineshima, Pascual MartínezGómez, Yusuke Miyao, and Daisuke Bekki. 2015. Higherorder logical inference with compositional semantics. In Proceedings of EMNLP.
 [Mitchell and Lapata2010] Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science, 34(8).
 [Mnih and Teh2012] Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. In In Proceedings of ICML.
 [Montague1970] Richard Montague. 1970. Universal grammar. Theoria, 36.
 [Padó and Lapata2007] Sebastian Padó and Mirella Lapata. 2007. Dependencybased construction of semantic space models. Computational Linguistics, 33(2).
 [Paperno et al.2014] Denis Paperno, Nghia The Pham, and Marco Baroni. 2014. A practical and linguisticallymotivated approach to compositional distributional semantics. In Proceedings of ACL.
 [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP.
 [Pham et al.2015] Nghia The Pham, Germán Kruszewski, Angeliki Lazaridou, and Marco Baroni. 2015. Jointly optimizing word representations for lexical and sentential tasks with the cphrase model. In Proceedings of ACL.
 [Poon and Domingos2009] Hoifung Poon and Pedro Domingos. 2009. Unsupervised semantic parsing. In Proceedings of EMNLP.
 [Rocktaeschel et al.2014] Tim Rocktaeschel, Matko Bosnjak, Sameer Singh, and Sebastian Riedel. 2014. Lowdimensional embeddings of logic. In ACL Workshop on Semantic Parsing (SP’14).
 [Socher et al.2011] Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Y. Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in NIPS.
 [Socher et al.2012] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality through recursive matrixvector spaces. In Proceedings of EMNLP.
 [Steedman2012] Mark Steedman. 2012. Taking Scope  The Natural Semantics of Quantifiers. MIT Press.
 [Tian et al.2014] Ran Tian, Yusuke Miyao, and Takuya Matsuzaki. 2014. Logical inference on dependencybased compositional semantics. In Proceedings of ACL.
 [Tian et al.2015] Ran Tian, Naoaki Okazaki, and Kentaro Inui. 2015. The mechanism of additive composition. arXiv:1511.08407.

[Turney and Pantel2010]
Peter D. Turney and Patrick Pantel.
2010.
From frequency to meaning: Vector space models of semantics.
Journal of Artificial Intelligence Research
, 37(1).  [Van de Cruys2014] Tim Van de Cruys. 2014. A neural network approach to selectional preference acquisition. In Proceedings of EMNLP.
 [Xu et al.2015] Kun Xu, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2015. Semantic relation classification via convolutional neural networks with simple negative sampling. In Proceedings of EMNLP.
 [Zweig et al.2012] Geoffrey Zweig, John C. Platt, Christopher Meek, Christopher J.C. Burges, Ainur Yessenalina, and Qiang Liu. 2012. Computational approaches to sentence completion. In Proceedings of ACL.