Challenges for Distributional Compositional Semantics

07/10/2012 ∙ by Daoud Clarke, et al. ∙ 0

This paper summarises the current state-of-the art in the study of compositionality in distributional semantics, and major challenges for this area. We single out generalised quantifiers and intensional semantics as areas on which to focus attention for the development of the theory. Once suitable theories have been developed, algorithms will be needed to apply the theory to tasks. Evaluation is a major problem; we single out application to recognising textual entailment and machine translation for this purpose.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper summarises some major challenges for the nascent field of distributional compositional semantics. Research in this area has arisen out of the success of vector-based techniques for representing aspects of lexical semantics, such as latent semantic analysis

[Deerwester et al.1990] and measures of distributional similarity [Lin1998, Lee1999].

The automatic nature of these techniques mean that much higher coverage can be achieved compared to manually constructed resources such as WordNet [Fellbaum2005]. Additionally, the vector-based nature of the semantic representations allow for fine-grained aspects of meaning to be incorporated, in contrast to the type of relations typically expressed in ontologies; moreover the construction of an ontology is generally a subjective process, whereas vector-based approaches are typically more objective, being formed from observations of the contexts in which words occur in large corpora. There are disadvantages: automatic techniques are arguably less reliable than manually constructed resources, and often do not explicitly identify the variety of relationships between words that are captured in an ontology such as WordNet.

Researchers have begun to look at how such techniques can be extended beyond the word level to represent meanings of phrases and even whole sentences. Existing techniques cannot be applied directly beyond phrases of two or three words because of the problem of data sparseness — as the length of the phrase increases, the amount data matching the phrase falls off very quickly, and soon there is not enough data to build vectors reliably. The alternative is to look at how to compose such vectors, so that the vector for a phrase or sentence is determined purely by the vector representations for the individual words in the sentence.

While interest in this area has exploded in recent years, and some significant advances have been made, there is still a lot of work to do:

  • The underlying theory needs to be developed to allow distributional approaches to describe aspects of natural language meaning easily described by model-theoretic semantics, for example, generalised quantifiers and intensional semantics. We explain below why current approaches are not suited to either of these.

  • New algorithms and tools are needed to perform inference with the new theories.

  • We need suitable methods for evaluating distributional models of compositionality. In addition, approaches need to be evaluated across a broader range of natural language processing tasks. In particular we identify textual entailment and machine translation as suitable areas for application of current and future techniques.

In the remainder of the paper, we summarise existing work (Section 2), then motivate each of the above areas in detail (Section 3).

2 Background

Vector representations provide a rich variety of possible methods of composition. The most obvious method is perhaps vector addition [Landauer and Dumais1997, Foltz et al.1998], in which a string of words is represented by the sum of the individual words making up the string. This method has several problems, the most obvious of which is that the operation is commutative, whereas natural language meaning is not: John hit Mary does not mean the same as Mary hit John. Another composition operation that suffers from this problem is point-wise multiplication [Mitchell and Lapata2008].

A method of composing vectors that avoids this issue is the tensor product

[Smolensky1990, Clark and Pulman2007, Widdows2008]. Given two vectors and in vector spaces and of dimensionality and respectively, the tensor product is a vector in a much larger space of dimensionality . Each pair of basis vectors in and has a corresponding basis vector in , so given a tensor product it is always possible to deduce the original vectors and , another property that is missing from vector addition.

The problem with the tensor product is that strings of different lengths have different dimensionalities and live in different vector spaces and are thus not directly comparable. This means that we cannot say to what extent big dog entails dog. There are several ways to get around this:

  • Use some linear map from the tensor product space to the original space to reduce the dimensionality of vectors and allow them to be compared. This was suggested by Mitchell:08 as a general “multiplicative model” of composition. The problem with this method is that information is lost as meanings compose since all strings have the same dimensionality.

  • Impose relations on different tensor powers of the space to make them comparable [Clarke et al.2010]. This approach allows a lot of flexibility in describing composition but it is not clear how to determine what relations should be imposed, nor how we can easily compute with the resulting structures. It does, however, resolve the problem of information loss as strings are composed.

The approach of Grefenstette:11 is inspired by some mathematical similarities between the structure of vector spaces and that of pregroup grammars: they are both compact closed categories. Their approach can be viewed as a vectorisation of Montague semantics [Clark et al.2008].

Other approaches to this problem include the use of matrices [Rudolph and Giesbrecht2010] including those learnt directly from data [Baroni and Zamparelli2010].

2.1 Context-theoretic Semantics

The framework of Clarke:12 is a mathematical formalisation of the idea that meaning is determined by context. The structure that is proposed to model natural language semantics is an associative algebra over the real numbers . This is a real vector space , together with multiplication which satisfies a property called bilinearity:

for all and all . It can be shown that this type of structure generalises all the approaches we discussed above [Clarke2012].

Clarke:12 also proposes a principle to determine entailment between strings in distributional semantics, based on the concept of distributional generality [Weeds et al.2004], that terms that have a more general meaning will occur in a wider range of contexts. The theory assumes the existence of a distinguished basis which can be interpreted as defining the contexts in which strings can appear. This defines a partial ordering on the vector space by if and only if every component of is less than or equal to the corresponding component of . The partial ordering is interpreted as entailment and is connected with distributional generality since if occurs at least as frequently as in every context, where and are the vectors associated with terms and .

3 Challenges

3.1 Theory

The greatest problem currently facing attempts to describe meaning using vectors is to reconcile them with existing theories of meaning, most notably logical approaches to semantics. If distributional semantics is to replace logical semantics, it has to encompass it, since there are things that logical semantics does very well that it is hard to imagine distributional semantics doing in its current form. For example, it is conceivable that an intelligent agent could be built which interpreted natural language sentences using logic. The agent would chose the best course of action given a set of assumptions, perhaps using a combination of theorem provers, automated planning and search tools. The functionality provided by the theorem proving component in such a system would be essential, allowing diverse pieces of knowledge from a variety of sources to be combined and deductions to be made from them. This is something that distributional approaches are not currently able to do.

Encompassing a whole logical semantic formalism in a manner consistent with distributional semantics is an ambitious goal. We have identified two particular areas with the following characteristics:

  • They are intuitively familiar and easy to understand

  • They occur fairly frequently in ordinary speech and writing

  • No existing framework for compositionality in distributional semantics deals with them satisfactorily

It is our hope that by concentrating on these areas we are able to make progress towards the ultimate goal.

Generalised Quantifiers

The study of generalised quantifiers concerns expressions such as some, most but not all, no and at least two. In the analysis of Barwise:81, which is based on the earlier work of Montague:74, the semantics of determiners such as these is to operate on a set of entities (for example the set of people) and to return a set of sets, for example the semantics of most people is the set of all sets of entities which contain most people.

Formalising these properties mathematically allows us to understand some properties of entailment between sentences containing such quantifiers. For example, all animals breathe entails all cats breathe, whereas some cats like cheese entails some animals like cheese; the change in quantifier has reversed the direction of the entailment.

This property cannot be captured within the framework of Clarke:12, because of the in-built property of linearity of the multiplication in the underlying algebra. If we accept the idea of distributional generality, that cat should entail animal because the latter will occur in a broader range of contexts, then it follows from linearity that x cat y will entail x animal y for any strings x and y. More generally, for any such that , and .

In fact, what the reversal of entailment indicates is that quantifiers such as all are non-linear; they are not compatible with the bilinearity condition of context-theoretic semantics. This is a problem for all existing approaches to the problem of compositionality in distributional semantics, since linearity is a common assumption among them [Clarke2012].

The work of Preller:09 addresses the problem of representing negation in distributional semantics using Bell states. Since negation results in a similar reversal of entailment, it is possible that such an approach would also be useful for modelling generalised quantifiers.

Intensional Semantics

Intensional semantics deals with certain complex semantic phenomena such as those involving the verbs know, believe, want and need. These are described elegantly in Montague semantics [Montague1974], and the ability to reason about such concepts is essential for intelligent agents that would interact with humans in natural language. Reasoning about such sentences requires additional knowledge about the meaning of these words that would normally be described in terms of logic; it is hard to imagine how their meanings could be described reliably within distributional semantics.

3.2 Algorithms and Tools

In order to compete with logical methods in semantics, distributional semantics needs to be able to, given a fixed set of background knowledge (expressed in natural language):

  1. Truth:Estimate the probability that a given sentence is true.

  2. Search: Given a parameterised sentence, for example the queen was born in , find the parameter which maximises the probability of the sentence.

  3. Entailment: Given two sentences, compute the degree to which the first entails the second.

The first and third of these will be useful in tasks such as question answering while the third will be useful for any of the tasks associated with textual entailment [Dagan et al.2005], for example information retrieval.

There are more complex tasks that may not be expressible in terms of distributional semantics, for example those needed in planning for an intelligent agent; the exact formulation for such tasks may depend on the the particular semantic formalism chosen.

When designing algorithms for these tasks, it is likely that we will be able to compute the answer much faster if we allow an approximation to the answer, which may be perfectly suitable for many tasks. Without a satisfactory theory of meaning, however, it is hard to speculate on the possible nature of such algorithms.

3.3 Evaluation Methods

A problem for researchers working in this field is how to evaluate models of compositionality. Researchers have evaluated models on short phrases by determining context vectors for the phrases and for individual words directly. They then compose the vectors for individual words using their models to obtain vectors for phrases and measure how similar these are to the observed phrase vectors [Baroni and Zamparelli2010, Guevara2011]. This evaluation technique cannot be extended beyond short phrases however, so may not provide a good measure of how good models are at handling deep semantics.

The recent Workshop on Distributional Semantics and Compositionality [Biemann and Giesbrecht2011] provided a dataset and a shared task of determining to what degree a phrase is compositional. This is undoubtedly a useful task, but again does not address the question of deep semantics.

In order to evaluate deep semantics, we propose applying methods to two tasks requiring deep semantics to perform well: recognising textual entailment and machine translation. We believe these tasks are suitable for this purpose because they would intuitively seem to require deep semantics to achieve perfect performance, yet statistical approaches are able to achieve reasonable to good performance. These tasks would thus provide a testing ground in which the sophistication of the techniques applied can be increased gradually towards deep semantics, the hope being that the more sophisticated techniques will lead to improved performance.

4 Conclusion

We have summarised some approaches to modelling compositionality in distributional semantics, and highlighted some challenges which we believe to be pertinent. In particular, we identified some aspects of the theory of distributional semantics which we believe to be lacking; anyone able to resolve these will necessarily push the boundaries of our understanding of meaning.

References

  • [Baroni and Zamparelli2010] Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), East Stroudsburg PA: ACL, pages 1183–1193.
  • [Barwise and Cooper1981] Jon Barwise and Robin Cooper. 1981. Generalized quantifiers and natural language. Linguistics and Philosophy, 4:159–219.
  • [Biemann and Giesbrecht2011] Chris Biemann and Eugenie Giesbrecht, editors. 2011. Proceedings of the Workshop on Distributional Semantics and Compositionality. Association for Computational Linguistics, Portland, Oregon, USA, June.
  • [Clark and Pulman2007] Stephen Clark and Stephen Pulman. 2007. Combining symbolic and distributional models of meaning. In Proceedings of the AAAI Spring Symposium on Quantum Interaction, pages 52–55, Stanford, CA.
  • [Clark et al.2008] Stephen Clark, Bob Coecke, and Mehrnoosh Sadrzadeh. 2008. A compositional distributional model of meaning. In Proceedings of the Second Quantum Interaction Symposium (QI-2008), pages 133–140, Oxford, UK.
  • [Clarke et al.2010] Daoud Clarke, Rudi Lutz, and David Weir. 2010. Semantic composition with quotient algebras. In Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics, pages 38–44, Uppsala, Sweden, July. Association for Computational Linguistics.
  • [Clarke2012] Daoud Clarke. 2012. A context-theoretic framework for compositionality in distributional semantics. Computational Linguistics, 38(1):41–71.
  • [Dagan et al.2005] Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. In Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment, pages 1–8.
  • [Deerwester et al.1990] Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407.
  • [Fellbaum2005] C. Fellbaum. 2005. Wordnet and wordnets. Encyclopedia of Language and Linguistics, Second Edition, Oxford: Elsevier, pages 665–670.
  • [Foltz et al.1998] Peter W. Foltz, Walter Kintsch, and Thomas K. Landauer. 1998. The measurement of textual coherence with latent semantic analysis. Discourse Process, 15:285–307.
  • [Grefenstette et al.2011] Edward Grefenstette, Mehrnoosh Sadrzadeh, Stephen Clark, Bob Coecke, and Stephen Pulman. 2011. Concrete sentence spaces for compositional distributional models of meaning. Proceedings of the 9th International Conference on Computational Semantics (IWCS 2011), pages 125–134.
  • [Guevara2011] Emiliano Guevara. 2011. Computing semantic compositionality in distributional semantics. In Proceedings of the 9th International Conference on Computational Semantics (IWCS 2011), pages 135–144.
  • [Landauer and Dumais1997] Thomas K. Landauer and Susan T. Dumais. 1997. A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2):211–240.
  • [Lee1999] Lillian Lee. 1999. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-1999), pages 23–32.
  • [Lin1998] Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL ’98), pages 768–774, Montreal.
  • [Mitchell and Lapata2008] Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of ACL-08: HLT, pages 236–244, Columbus, Ohio, June. Association for Computational Linguistics.
  • [Montague1974] Richard Montague. 1974. The proper treatment of quantification in ordinary english. In Formal Philosophy: Selected Papers of Richard Montague. Yale University Press.
  • [Preller and Sadrzadeh2011] Anne Preller and Mehrnoosh Sadrzadeh. 2011. Bell states and negative sentences in the distributed model of meaning. Electronic Notes in Theoretical Computer Science, 270(2):141–153. Proceedings of the 6th International Workshop on Quantum Physics and Logic (QPL 2009).
  • [Rudolph and Giesbrecht2010] Sebastian Rudolph and Eugenie Giesbrecht. 2010. Compositional matrix-space models of language. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 907–916, Uppsala, Sweden, July. Association for Computational Linguistics.
  • [Smolensky1990] Paul Smolensky. 1990. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1-2):159–216, November.
  • [Weeds et al.2004] Julie Weeds, David Weir, and Diana McCarthy. 2004. Characterising measures of lexical distributional similarity. In Proceedings of Coling 2004, pages 1015–1021, Geneva, Switzerland, Aug 23–Aug 27. COLING.
  • [Widdows2008] Dominic Widdows. 2008. Semantic vector products: Some initial investigations. In Proceedings of the Second Symposium on Quantum Interaction, Oxford, UK, pages 1–8.