Common sense knowledge bases (KB) are notoriously ‘brittle’: they are generally only usable by those who have spent a lot of time getting to know precisely how to phrase a question so that it will match the representation in the KB. They are also inevitably incomplete, leaving out many facts that one would expect a system that claims common sense to include. In order to get around these limitations, several researchers have been exploring the possibililty of somehow combining the deductive reasoning abilities of a knowledge base with the ability to represent semantic similarity that is provided by distributional semantic vector spaces. “Query expansion,” for example, involves querying for semantically nearby terms as well as the explicit terms entered. The deductive reasoning in such a system still takes place in the discrete knowledge base, however. When there are concepts or relations missing from the knowledge base that prevent a chain of reasoning from going through for any of these near terms,the system will be unable to return any result.
Searches that take place completely in a semantic vector space, on the other hand, are more akin to searching via a web search engine. These searches forgo any explicit steps of deductive reasoning, relying instead on broad coverage. Combining multiple facts in a chain of reasoning to answer a query is beyond their current capabilities. What we propose in this paper is a way of discovering chains of reasoning connecting a premise to a conclusion directly in a semantic vector space. The method can be applied to various ways of representing knowledge by high-dimensional vectors.
Forming a chain of deductive logical reasoning can be thought of as a special variety of a more general phenomenon in the mind of following a “train of thought.” One idea brings up a related idea, which in turn brings up another related idea, and so forms a connected train. We can deliberately return to an earlier point in the train and follow another path either backward or forward, so that the trains link up to form a larger structure.
Trains of thought serve several purposes. Parts of an essay or a story are often structured as trains of thought, with each sentence building on the one before. Restricted to cause-effect relations, the root cause of an event can be found. Trains formed of links between means and ends can form a plan of actions and subgoals to achieve a larger goal. Trains of reasons can answer “why” questions. Trains of looser relations like resemblance of form and sound form the basis of some kinds of poetry, symbolism, or mysticism. Memory techniques, creativity methods, and dreams also rely on trains of thought.
In order to form chains of reasoning, AI researchers have attempted to find paths between ideas using exhaustive search in a knowledge graph. This blind walk through all connections in the graph seems very different from how we normally think. A path connecting two ideas seems to bubble up– we initially feel the connection more than see it. Ideas shade imperceptibly into one another. Analogy and association are everpresent. An argument as originally conceived generally skips steps, and may include steps which are simply analogous to related problems. Turning such a jumble of ideas into a step-by-step proof is a process that takes skill, training, and deliberate conscious effort.
Such imprecision can lead to invalid conclusions and fuzzy thinking, but it has the advantage of being capable of operating under unknown or incompletely represented conditions. When we don’t know, we can guess at the general ballpark of the answer. In order to create a system that can deal with the ambiguities of natural language and take action in an uncertain environment, we need to build in the ability to think in a more flexible, human manner. A more human-like reasoning engine should have at least the following properties:
Be capable of associational, analogical, inductive, abductive, and deductive reasoning;
when exact answers can’t be found, guess at an approximate answer;
be aware of the strength or weakness of its arguments;
creatively find connections that were not deliberately given, and
find arguments that add up to a whole, rather than find strictly linear connections.
There are multiple strands of research that involve representing knowledge as vectors. One strand comes from the biologically-inspired cognitive architecture community. This is increasingly known as Vector Symbolic Architectures (VSA).  introduced the idea of using sparse high-dimensional binary vectors as a way of storing information that was resistant to noise and capable of addressing memory with exemplars. These ideas have been developed to include the notion of binding vectors for compositional structure and to be more biologically accurate  .
A second strand comes from the linguistics community, beginning with Latent Semantic Analysis to create word and document context vectors, and includes the well-known word vector representation word2vec. The ability of such vectors to solve analogy problems was demonstrated in 2005. Attempts to encode the meaning of sentences by composing the meaning of the words in the sentence    is a very similar problem to encoding triples from a knowledge base. Some researchers encode triples from a knowledge graph directly as vectors, building on .
. These approaches use machine learning to build methods for composing vectors in a reasoning chain. The system described in this paper does not require any training beyond what is done to create the word vector representations in the first place. It is unique in using sparse vector decomposition to solve a deductive reasoning problem.
We are given a knowledge base of facts represented as triples of the form ,, . We are also given a semantic vector space where every entity is represented by a high-dimensional vector in such a way that terms that are semantically similar are nearby in the semantic space. Each of the triples is represented within the vector space by a vector of the form . For the purposes of the vector space calculations, these triples are treated as statements that . The specific predicates are not used in the vector space calculation, but instead all predicates are treated as a simple statement of implication. This maps the first-order predicate calculus problem to a “zeroth-order” propositional calculus problem.
We wish to prove that . The vector representing this relation is . If there is some set of facts in the knowledge base that can prove this, it must be the case that the facts form a chain:
Representing this chain as vectors we get
Cancelling out we see that this sum is equal to the vector directly from g to p:
Our goal, then, in order to find a chain of entities linking g to p, is to find a sum of fact vectors of the form that adds up to . Such a sum can be thought of as a weight vector multiplied by the list of fact vectors, with a weight of 1 for each fact vector included in the chain, and a weight of zero for each fact vector not included. Clearly will be a sparse vector, with many more zeros than ones. This suggests that in order to find such a sum, we can use sparse approximation techniques such as OMP or LASSO to obtain the sparse weight vector .
In cases where such a chain exists, this method should (when the sparse approximation is successful) return a set of facts that constitute the chain. When the chain does not exist, however, the method will return an approximation of the correct links in the path. Because the vectors come from a semantic vector space, such approximations will amount to undefined relations between closely related entities. Such gaps can be considered a kind of associational reasoning.
For example, suppose we want to find a path of relations between and . The knowledge base contains, among many others, the following two facts:
(Michael Jackson, is a, songwriter) and (musician, composes, music)
The proposed method would return and , even though they don’t strictly form a chain of reasoning, because and are nearby in the semantic space, and so the error in the sum is fairly small. 111In some special cases, the error in one gap of the chain will largely cancel out with the error at another gap. When this happens, the system has found an analogous relation. This is discussed in the section Analogical Properties of Semantic Spaces below. This is the core idea we hope to communicate in this paper: that sparse solvers can be used to find deductive chains in a semantic vector space, in a way that allows for analogical and associational connections where appropriate.
4 Propositional Calculus and the Logic of Subsets
|near “classical”||classical, classical music, Classical, classical repertoire, Hindustani classical, contemporary, Mohiniattam, sacred choral|
|near “music”||music, classical music, jazz, Music, songs, musicians, tunes|
|near “music - classical”||music, Rhapsody subscription, ringtone, MP3s, Polow, Napster, entertainment, Music, tunes|
|near “music + classical”||classical, music, classical music, jazz, classical repertoire, Hindustani classical, sacred choral, classical guitar|
The system is able to perform deductive logic because it is approximately implementing propositional calculus as a logic of subsets.222Boole and DeMorgan originally formulated propositional logic as a special case of the logic of subsets.  Call the universe the set of of all entities in the semantic vector space. The nearest neighbors of any entity form a subset of . (These are the terms which are semantically near to .) In a high dimensional semantic vector space, if a vector is a nearest neighbor of vector or it will also usually be a nearest neighbor of vector . 333If and are approximately orthogonal unit vectors, then the similarity between the two will be . This is much higher than the expected similarity between any two terms selected from the space. See  for details. This means that we can treat as the union operator: The elements of of will be the near neighbors of the vector . In propositional calculus, this is the OR operator, .
The vectors in near are the vectors which are not near to . So can be treated as a the set complement operator . In propositional calculus, this is the NOT operator, .
In propositional calculus, implies () means that either is true, or is not true, so it can be rewritten as (NOT A) OR B. In the subset logic, this is . In the vector space, then, can be represented as .
In propositional calculus, the modus ponens rule allows us to conclude from the two facts and . In the vector space, and cancel to give . In a chain of implication all the interior terms cancel, allowing us to conclude that . Similarly in the vector space, the vectors simplify to the vector .444Notice that addition is used as AND rather than OR when combining with and (see the caption of Table 1 for why this is acceptable). At any rate, the notion of cancelling out with modus ponens still holds. In this way, the system is able to carry out modus ponens deductive reasoning within the semantic vector space.
Propositional calculus is less powerful than predicate calculus. In order to prove that one must have, in addition to the triples in the knowledge base, Horn clauses which have as the conclusion (i.e. the non-negative literal). If the facts in the knowledge base passed to the solver are limited to those which have relations that participate in such Horn clauses, the chains of implication will tend to be more reasonable. In general, using this system as it currently stands requires restricting which predicates are allowed to participate in a solution. Instead of representing , we could represent the more informative statement . Doing this requires using vectors that bind multiple concepts to roles, as in VSA. It is not yet clear how well the analogical or associational properties described below would work in such an architecture, however: it depends on the details of how binding is performed.
5 Analogical and Abductive Reasoning
The ability of distributional semantic vectors such as word2vec to find analogies is not peculiar to how such vectors are trained, but should be an expected property of any system that maps semantically similar concepts to similar high-dimensional vectors. Suppose we are given the following analogy to solve: bear:hiker::shark:X. To make it simpler, consider contexts representing the ideas woods, sea, predator and tourist, and treat any other contexts as noise. The vector for bear, for instance, is some weighted average of (the mean of all vectors related to woods), and (the mean of all vectors related to predators) plus some noise. Thus we can rewrite the analogy as .
The vector between bear and hiker is . This is very close to the vector from shark to snorkeler. These two vectors are so similar because the relations between the two pairs of words being connected are so similar. Since the system looks for any vector that will make the sum have as low error as possible, it could choose the relation vector between bear and hiker to connect the concept shark to the concept snorkeler: the system can make use of analogical relations to complete a chain of argument.555When a direct chain of reasoning is possible, such links won’t happen– the analogy, being inexact, has a higher cost than the direct link. This makes it better at handling incompleteness in the knowledge base and makes it more like human reasoning, where newly encountered concepts do not need to be exact matches to those in our memories in order for us to reason about them. In everyday thinking, analogy, association and abduction are frequently used together with deduction.
While it is possible to use the raw distributional vectors for terms themselves as entities in the vector space, we can also define other vectors in this space. The fact that the terms in a natural category like mammal tend to already be clustered in the semantic space means that the number of such terms that can be averaged into a category vector is somewhat larger than the results in Experiment 1. We could also make use of the analogical properties of the semantic vector space to place other concepts that don’t appear in the corpora, if we know some of their attributes. These techniques are useful when attempting to embed a knowledge base into the semantic vector space, where the concepts in the knowledge base may not be named by a specific English word.666Along the same lines,  describes a more intricate method of locating particular word senses in the vector space.
6 Ontology Merging
One of the major benefits of using an embedded deduction mechanism is that it simplifies the process of merging ontologies. If we are able to map both ontologies into the semantic vector space, then even if the same concept isn’t mapped to the exact same term, it will be mapped to a nearby term which may be good enough for the chain of reasoning to be found. For example, suppose one ontology contained the statement (bears, eat, grubs) and another contained the statement (insects, live in, dead trees). Neither ontology defines the relation of grubs to insects, but the system would be able to make the connection between bears and dead trees (answering the question “Why is the bear digging in a dead tree?” for example) because of the semantic similarity of grub to insect. Such a method would be especially useful when the ontology has not been hand built. Information extraction methods that extract triples from natural language sources, such as ReVerb, can be used to add facts to the knowledge base, without worrying too much about whether the entities to which triples refer are all expressed in the same way.
7 Answering Questions
The system as described so far has been finding a chain of reasoning connecting between two terms: one “given”, and one “to prove.”777Deductive reasoning systems typically use either forwards or backwards inference. This system uses ”middle out” inference, that doesn’t begin at either end but is a holistic procedure happening all along the chain at once. However, a knowledge base is usually used with one or more variables, to find multiple possible chains that answer a query. If the possible answers can be limited to a smaller set, this system can also be used in this way, by having the “to prove” vector be a sum of all of the possible answers. For example, the knowledge base contains the following statements:
(apple, hasColor, red), (apple, hasColor, yellow), (apple, hasColor, green)
and we want to know what colors apples have. We could put in as the query, and the result picks out these three statements as highly relevant:
1.00 (apple, hasColor, red)
0.99 (apple, hasColor, green)
0.72 (apple, hasColor, yellow)
0.08 (cordon bleu, derivedFrom, blue)888Notice that the fourth, less relevant, fact is also relating a food to a color.
Notice that the goal vector is a “category vector” as decribed in section 7. Another way to get a particular type of result is by limiting the type of relations that are in the portion of the database that is searched. For example, if one wanted to know how B was caused, the search could be limited to those facts in the database related by causal predicates, such as causes, turns into, has side effect, and so forth. One way to do this, if the Horn clauses are known, is to find all relations which participate in a Horn clause that resolves to A causes B.
8 Ordering the Chain
The results of the sparse vector decomposition define which triples might participate in the chain, but they are unordered.999In fact, they may form a multistranded rope rather than a chain– the “elastic-net” parameter in LASSO can be used to encourage or discourage finding alternative equally good paths for part or all of the chain. To arrange them in order, we use the following method. All entities that participate in a triple returned by the solver, as well as the input terms, are added to a complete directed graph. Edges corresponding to relations returned by the solver are given very low weights, while edges not included are weighted based on their distance in the semantic space. Then we find the least costly path from the head input term to the tail. 101010A slightly more complicated cost function can be used to encourage the lowest cost path to follow analogical connections as well. Although the system is capable of coming up with tree-like proofs to multiple entities connected by OR, we haven’t yet implemented a method for finding least-cost trees.
LASSO, OMP and other sparse solvers are not guaranteed to find the optimal solution (which would be an NP-complete problem.) Their performance depends on the size, dimensionality, and clustering of the data. We characterized how well LASSO performed for the vectors in our dataset. For all these experiments, we used the 300-dimensional word2vec vectors provided by Mikolov. We used L=20, and lambda=.2 for the LASSO parameters.
9.1 Experiment 1
As noted in the section on propositional calculus, it is a curious property of high-dimensional vector spaces that the vector will tend to be closer to and than other vectors in the space, assuming they are fairly well distributed. However, this property only holds for a few vectors being added together. In Table 1, we added from 1 to 10 randomly chosen term vectors, and found how frequently all of the summed vectors were present among the 20 nearest neighbors of the sum vector, for various dictionary sizes. For larger dictionaries, fewer of the summed terms are found because the dictionary more densely populates the space. LASSO does a better job of recovering the vectors in the sum. Much fewer than 20 vectors are usually chosen by LASSO, which is another big advantage.
9.2 Experiment 2
This experiment was similar to the previous one, but instead of adding terms we added fact vectors from the embedded KB of the form . This is a more difficult problem for LASSO to solve because, for example, and would be exactly equal and so unrecoverable except by chance, and there are effectively twice as many entities being added. For large dictionary sizes, even two fact terms could not be reliably found. (See table 3.)
9.3 Experiment 3
This experiment measured how often the system was able to find a chain of reasoning linking a given head to a tail known to be reachable in from 1 to 7 steps. We used a KB with 906000 facts, formed of all the first-order facts in CYC and conceptnet in which both entities being related could be mapped to a vector in the word2vec space (either with a corresponding English word, or as a category vector.)
10 Conclusion and Future Work
We have demonstrated how sparse decomposition methods can be used to find chains of reasoning in a knowledge graph embedded in a distributional vector space. In the future, we hope to evaluate the system on question answering datasets. The performance on longer chains needs to be improved. We would also like to find ways of integrating this method into more comprehensive cognitive architectures. The notion of antonymy in semantic vector spaces also needs a more careful treatment.
Baroni, M., Zamparelli, R.: Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. 2010 Conference on Empirical Methods in Natural Language Processing (pp. 1183-1193). ACL. (2010, October).
-  Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems (pp. 2787-2795). (2013).
-  B. G. Buchanan, E.H. Shortliffe.: Rule-Based Expert Systems: The MYCIN Experiments. Addison-Wesley, Reading, MA, 1984.
-  Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. SIGCHI conference (pp. 281-285). ACM. (1988, May).
-  Ellerman, D.: The logic of partitions: introduction to the dual of the logic of subsets. The Review of Symbolic Logic, 3(2), 287-350. (2010).
-  Freitas, Andre, and Edward Curry.: Natural language queries over heterogeneous linked data graphs: A distributional-compositional semantics approach. 19th international conference on Intelligent User Interfaces. ACM, 2014.
-  Gayler, R.. Vector Symbolic Architectures Answer Jackendoff’s Challenges for Cognitive Neuroscience. In Slezak, P., ed.: ICCS/ASCS International Conference on Cognitive Science. CogPrints, Sydney, U. of New South Wales, 133-138. (2003).
-  Grefenstette, E., Sadrzadeh, M. Experimental support for a categorical compositional distributional model of meaning. Conference on Empirical Methods in Natural Language Processing (pp. 1394-1404). ACL. (2011, July).
-  Kanerva, P.: Sparse distributed memory. MIT press. (1988).
-  Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. NIPS (pp. 3294-3302). (2015).
-  Knowlton, B., Morrison, R., Hummel, J., Holyoak, K.: A neurocomputational system for relational reasoning. Trends in cognitive sciences, 16(7), 373-381. (2012).
-  Lee, M., He, X., Yih, W. T., Gao, J., Deng, L., Smolensky, P.: Reasoning in vector space: An exploratory study of question answering. arXiv:1511.06426. (2015).
Levy, S.D.: Distributed Representation of Compositional Structure. Juan R. Rabuñal, Julian Dorado, and Alejandro Pazos (eds.), Encyclopedia of Artificial Intelligence. Hershey, Pennsylvania: IGI Publishing. (2008).
-  Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems (pp. 3111-3119). (2013)
-  Rocktäschel, T., Riedel, S.: Learning knowledge base inference with neural theorem provers. AKBC, 45-50. (2016).
-  Summers-Stay, D., Voss, C., Cassidy, T.: Using a distributional semantic vector space with a knowledge base for reasoning in uncertain conditions. Biologically Inspired Cognitive Architectures, 16, 34-44. (2016).
-  Turney, P. D. Measuring semantic similarity by latent relational analysis. arXiv preprint cs/0508053. (2005).
-  Vector Symbolic Architectures: A New Building Material for Artificial General Intelligence. First Conference on Artificial General Intelligence (AGI-08). IOS Press.
-  Wang, H., Onishi, T., Gimpel, K., McAllester, D.: Emergent Logical Structure in Vector Representations of Neural Readers. arXiv preprint arXiv:1611.07954. (2016).
-  West, R., Gabrilovich, E., Murphy, K., Sun, S., Gupta, R., Lin, D.: Knowledge base completion via search-based question answering. 23rd international conference on World wide web (pp. 515-526). ACM. (2014, April).
-  Widdows, D., Peters, S.: Word vectors and quantum logic: Experiments with negation and disjunction. Mathematics of language, 8(141-154). (2003).
-  Widdows, D., Cohen, T.: Reasoning with vectors: A continuous model for fast robust inference. Logic Journal of IGPL, jzu028. (2014)
-  Yu, M., Dredze, M.: Improving Lexical Embeddings with Semantic Knowledge. ACL (2) (pp. 545-550). (2014, June).
-  Zou, H., Hastie, T. : Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Statistical Methodology, 67(2), 301-320.(2005).