Entity-Relationship Search over the Web

10/08/2018 ∙ by Pedro Saleiro, et al. ∙ Universidade do Porto The University of Chicago The University of Nottingham 0

Entity-Relationship (E-R) Search is a complex case of Entity Search where the goal is to search for multiple unknown entities and relationships connecting them. We assume that a E-R query can be decomposed as a sequence of sub-queries each containing keywords related to a specific entity or relationship. We adopt a probabilistic formulation of the E-R search problem. When creating specific representations for entities (e.g. context terms) and for pairs of entities (i.e. relationships) it is possible to create a graph of probabilistic dependencies between sub-queries and entity plus relationship representations. To the best of our knowledge this represents the first probabilistic model of E-R search. We propose and develop a novel supervised Early Fusion-based model for E-R search, the Entity-Relationship Dependence Model (ERDM). It uses Markov Random Field to model term dependencies of E-R sub-queries and entity/relationship documents. We performed experiments with more than 800M entities and relationships extractions from ClueWeb-09-B with FACC1 entity linking. We obtained promising results using 3 different query collections comprising 469 E-R queries, with results showing that it is possible to perform E-R search without using fix and pre-defined entity and relationship types, enabling a wide range of queries to be addressed.



There are no comments yet.


page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years, we have seen increased interest in using online information sources to find concise and precise information about specific issues, events, and entities rather than retrieving and reading entire documents and web pages. Modern search engines are now presenting entity cards, summarizing entity properties and related entities, to answer entity-bearing queries directly in the search engine result page. Examples of such queries are “Who founded Intel?” and “Works by Charles Rennie Mackintosh”.

Existing strategies for entity search can be divided in IR-centric and Semantic-Web-based approaches. The former usually rely on statistical language models to match and rank co-occurring terms in the proximity of the target entity (Balog et al., 2012b). The latter consists in creating a SPARQL query and using it over a structured knowledge base to retrieve relevant RDF triples (Heath and Bizer, 2011). Neither of these paradigms provide good support for entity-relationship (E-R) retrieval, i.e., searching for multiple unknown entities and relationships connecting them.

Contrary to traditional entity queries, E-R queries expect tuples of connected entities as answers. For instance, “US technology companies contracts Chinese electronics manufacturers” can be answered by tuples Apple, Foxconn, while “Companies founded by disgraced Hollywood producer” is expecting tuples Miramax, Harvey Weinstein. In essence, an E-R query can be decomposed into a set of sub-queries that specify types of entities and types of relationships between entities.

Recent work in Semantic-Web search tackled E-R retrieval by extending SPARQL to support joins of multiple query results and creating an extended knowledge graph

(Yahya et al., 2016). Extracted entities and relationships are typically stored in a knowledge graph. However, it is not always convenient to rely on a structured knowledge graph with predefined and constraining entity types.

In particular, we are interested in transient information sources, such as online news or social media. General purpose knowledge graphs are usually fed with more stable and reliable data sources (e.g. Wikipedia). Furthermore, predefining and constraining entity and relationship types, such as in Semantic Web-based approaches, reduces the range of queries that can be answered and therefore limits the usefulness of entity search, particularly when one wants to leverage free-text.

To the best of our knowledge, E-R retrieval using IR-centric approaches is a new and unexplored research problem within the Information Retrieval research community. One of the objectives of our research is to explore to what degree we can leverage the textual context of entities and relationships, i.e., co-occurring terminology, to relax the notion of an entity or relationship type.

Instead of being characterized by a fixed type, e.g., person, country, place, the entity would be characterized by any contextual term. The same applies to the relationships. Traditional knowledge graphs have fixed schema of relationships, e.g. child of, created by, works for while our approach relies on contextual terms in the text proximity of every two co-occurring entities in a raw document. Relationships descriptions such as “criticizes”, “hits back”, “meets” or “interested in” would be possible to search for. This is expected to significantly reduce the limitations which structured approaches suffer from, enabling a wider range of queries to be addressed.

We assume that a E-R query can be formulated as a sequence of individual sub-queries each targeting a specific entity or relationship. If we create specific representations for entities (e.g. context terms) as well as for pairs of entities, i.e. relationships then we can create a graph of probabilistic dependencies between sub-queries and entity/relationship representations. We show that this dependencies can be depicted in a probabilistic graphical model, i.e. a Bayesian network. Therefore, answering an E-R query can be reduced to a computation of factorized conditional probabilities over a graph of sub-queries and entity/relationship documents.

However, it is not possible to compute these conditional probabilities directly from raw documents in a collection. Such as with traditional entity retrieval, documents serve as proxies to entities (and relationships) representations. It is necessary to fuse information spread across multiple documents. We propose an early fusion approach to create entity and relationship centric document representations. It consists in aggregating context terms of entity and relationship occurrences to create two dedicate indexes, the entity index and the relationship index. Then it is possible to use any retrieval method to compute the relevance score of entity and relationship documents given the E-R sub-queries. Furthermore, we propose the Entity-Relationship Dependence Model, a novel early-fusion supervised model based on the Markov Random Field framework for Retrieval. We performed experiments at scale with results showing that it is possible to perform E-R retrieval without using fix and pre-defined entity and relationship types, enabling a wide range of queries to be addressed

2. Related Work

2.1. Entity and Relationship Search

Entity Search differs from traditional document search in the search unit. While document search considers a document as the atomic response to a query, in Entity Search document boundaries are not so important and entities need to be identified based on occurrence in documents (Adafre et al., 2007). The focus level is more granular as the objective is to search and rank entities among documents. However, traditional Entity Search do not exploit semantic relationships between terms in the query and in the collection of documents, i.e. if there is no match between query terms and terms describing the entity, relevant entities tend to be missed.

Entity Search has been an active research topic in the last decade, including various specialized tracks, such as Expert finding track (Chen et al., 2006), INEX entity ranking track (Demartini et al., 2009), TREC entity track (Balog et al., 2010) and SIGIR EOS workshop (Balog et al., 2012a). Previous research faced two major challenges: entity representation and entity ranking. Entities are complex objects composed by a different number of properties and they are mentioned in a variety of contexts through time. Consequently, there is no single definition of the atomic unit (entity) to be retrieved. Additionally, it is a challenge to devise entity rankings that use various entity representations approaches and tackle different information needs.

There are two main approaches for tackling Entity Search: Model 1 or “profile based approach” and Model 2 or “voting approach” (Balog et al., 2006)). The “profile based approach” starts by applying NER and NED in the collection in order to extract all entities occurrences. Then, for each entity identified, a meta-document is created by concatenating every passage in which the entity occurs. An index of entity meta-documents is created and a standard document ranking method (e.g. BM25) is applied to rank meta-documents with respect to a given query (Azzopardi et al., 2005; Craswell et al., 2005). One of the main challenges of this approach is the transformation of original text documents to an entity-centric meta-document index, including pre-processing the collection in order to extract all entities and their context.

In the ‘voting approach”, the query is processed as typical document search to obtain an initial list of documents (Balog et al., 2006; Ru et al., 2005)

. Entities are extracted from these documents using NER and NED techniques. Then, score functions are calculated to estimate the relation of entities captured and the initial query. For instance, counting the frequency of occurrence of the entity in the top documents combined with each document score (relevance to the query)

(Balog et al., 2006). Another approach consists in taking into account the distance between the entity mention and the query terms in the documents (Petkova and Croft, 2007).

Recently, there is an increasing research interest in Entity Search over Linked Data, also referred as Semantic Search, due to the availability of structured information about entities and relations in the form of Knowledge Bases (Bron et al., 2013; Zong et al., 2015; Zhiltsov et al., 2015). Semantic Search exploit rich structured entity related in machine readable RDF format, expressed as a triple (entity, predicate, object). There are two types of search: keyword-based and natural language based search (Pound et al., 2012; Unger et al., 2012). Regardless of the search type, the objective is to interpret the semantic structure of queries and translate it to the underlying schema of the target Knowledge Base. Most of the research focus is on interpreting the query intent (Pound et al., 2012; Unger et al., 2012) while others focus on how to devise a ranking framework that deals with similarities between different attributes of the entity entry in the KB and the query terms (Zhiltsov et al., 2015)

Li et al. (Li et al., 2012) were the first to study relationship queries for structured querying entities over Wikipedia text with multiple predicates. This work used a query language with typed variables, for both entities and entity pairs, that integrates text conditions. First it computes individual predicates and then aggregates multiple predicate scores into a result score. The proposed method to score predicates relies on redundant co-occurrence contexts.

Yahya et al. (Yahya et al., 2016) defined relationship queries as SPARQL-like subject-predicate-object (SPO) queries joined by one or more relationships. Authors cast this problem into a structured query language (SPARQL) and extended -it to support textual phrases for each of the SPO arguments. Therefore it allows to combine both structured SPARQL-like triples and text simultaneously. TriniTs extended the YAGO knowledge base with triples extracted from ClueWeb using an Open Information Extraction approach (Schmitz et al., 2012).

In the scope of relational databases, keyword-based graph search has been widely studied, including ranking (Yu et al., 2009). However, these approaches do not consider full documents of graph nodes and are limited to structured data. While searching over structured data is precise it can be limited in various respects. In order to increase the recall when no results are returned and enable prioritization of results when there are too many, Elbassuoni et al. (Elbassuoni et al., 2009) propose a language-model for ranking results. Similarly, the models like EntityRank by Cheng et al. (Cheng et al., 2007) and Shallow Semantic Queries by Li et al. (Li et al., 2012), relax the predicate definitions in the structured queries and, instead, implement proximity operators to bind the instances across entity types. Yahya et al. (Yahya et al., 2016) propose algorithms for application of a set of relaxation rules that yield higher recall.

Web documents contain term information that can be used to apply pattern heuristics and statistical analysis often used to infer entities as investigated by Conrad and Utt

(Conrad and Utt, 1994), Petkova and Croft (Petkova and Croft, 2007), Rennie and Jaakkola (Rennie and Jaakkola, 2005). In fact, early work by Conrad and Utt (Conrad and Utt, 1994) demonstrates a method that retrieves entities located in the proximity of a given keyword. They show that by using a fixed-size window around proper-names can be effective for supporting search for people and finding relationship among entities. Similar considerations of the co-occurrence statistics have been used to identify salient terminology, i.e. keyword to include in the document index (Petkova and Croft, 2007).

Existing approaches to the problem of entity-relationship (E-R) search are limited by pre-defined sets of both entity and relationship types. In this work, we generalize the problem to allow the search for entities and relationships without any restriction to a given set and we propose an IR-centric approach to address it.

2.2. Markov Random Field for Retrieval

The Markov Random Field (MRF) model for retrieval was first proposed by Metzler and Croft (Metzler and Croft, 2005) to model query term and document dependencies. In the context of retrieval, the objective is to rank documents by computing the posterior , given a document and a query :


For that purpose, a MRF is constructed from a graph

, which follows the local Markov property: every random variable in

is independent of its non-neighbors given observed values for its neighbors. Therefore, different edge configurations imply different independence assumptions.

Figure 1. Markov Random Field document and term dependencies.

Metzler and Croft (Metzler and Croft, 2005) defined that consists of query term nodes and a document node , as depicted in Figure 1. The joint probability mass function over the random variables in is defined by:


where are the query term nodes, is the document node, is the set of maximal cliques in , and is a non-negative potential function over clique configurations. The parameter is the partition function that normalizes the distribution. It is generally unfeasible to compute , due to the exponential number of terms in the summation, and it is ignored as it does not influence ranking.

The potential functions are defined as compatibility functions between nodes in a clique. For instance, a tf-idf score can be measured to reflect the “aboutness” between a query term and a document . Metzler and Croft (Metzler and Croft, 2005) propose to associate one or more real valued feature function with each clique in the graph. The non-negative potential functions are defined using an exponential form , where is a feature weight, which is a free parameter in the model, associated with feature function . The model allows parameter and feature functions sharing across cliques of the same configuration, i.e. same size and type of nodes (e.g. 2-cliques of one query term node and one document node).

For each query , we construct a graph representing the query term dependencies, define a set of non-negative potential functions over the cliques of this graph and rank documents in descending order of :


Metzler and Croft concluded that given its general form, the MRF can emulate most of the retrieval and dependence models, such as language models (Song and Croft, 1999).

2.2.1. Sequential Dependence Model

The Sequential Dependence Model (SDM) is the most popular variant of the MRF retrieval model (Metzler and Croft, 2005). It defines two clique configurations represented in the following potential functions and . Basically, it considers sequential dependency between adjacent query terms and the document node.

The potential function of the 2-cliques containing a query term node and a document node is represented as . The clique configuration containing contiguous query terms and a document node is represented by two real valued functions. The first considers exact ordered matches of the two query terms in the document, while the second aims to capture unordered matches within fixed window sizes. Consequently, the second potential function is .

Replacing by these potential functions in Equation 46 and factoring out the parameters , the SDM can be represented as a mixture model computed over term, phrase and proximity feature classes:

where the free parameters must follow the constraint . Coordinate Ascent was chosen to learn the optimal values that maximize mean average precision using training data (Metzler and Croft, 2007). Considering the frequency of the term(s) in the document , the frequency of the term(s) in the entire collection , the feature functions in SDM are set as:


where is the Dirichlet prior for smoothing, is a function that searches for exact matches of the phrase “ ” and is a function that searches for co-occurrences of and within a window of fixed-N terms (usually 8 terms) across document . SDM has shown state-of-the-art performance in ad-hoc document retrieval when compared with several bigram dependence models and standard bag-of-words retrieval models, across short and long queries (Huston and Croft, 2014).

2.2.2. MRF for Entity Retrieval

The current state-of-the-art methods in ad-hoc entity retrieval from knowledge graphs are based on MRF (Zhiltsov et al., 2015; Nikolaev et al., 2016). The Fielded Sequential Dependence Model (FSDM) (Zhiltsov et al., 2015) extends SDM for structured document retrieval and it is applied to entity retrieval from knowledge graphs. In this context, entity documents are composed by fields representing metadata about the entity. Each entity document has five fields: names, attributes, categories, similar entity names and related entity names. FSDM builds individual language models for each field in the knowledge base. This corresponds to replacing SDM feature functions with those of the Mixture of Language Models (Ogilvie and Callan, 2003). The feature functions of FSDM are defined as:


where are the Dirichlet priors for each field and are the weights for each field and must be non-negative with constraint . Coordinate Ascent was used in two stages to learn and values (Zhiltsov et al., 2015).

The Parameterized Fielded Sequential Dependence Model (PFSDM) (Nikolaev et al., 2016) extends the FSDM by dynamically calculating the field weights to different query terms. Part-of-speech features are applied to capture the relevance of query terms to specific fields of entity documents. For instance, NNP feature is positive if query terms are proper nouns, therefore the query terms should be mapped to the names field. Therefore, the field weight contribution of a given query term and a query bigram , in a field are a linear weighted combination of features:


where is the feature function of a query unigram for the field and is its respective weight. For bigrams, is the feature function of a query bigram for the field and is its respective weight. Consequently, PFSDM has total parameters, where is the number of fields, is the number of field mapping features for unigrams, is the number of field mapping features for bigrams, plus the three parameters. Their estimation is performed in a two stage optimization. First parameters are learned separately for unigrams and then bigrams. This is achieved by setting to zero the corresponding parameters. In the second stage, the parameters are learned. Coordinate Ascent is used in both stages.

The ELR model exploits entity mentions in queries by defining a dependency between entity documents and entity links in the query (Hasibi et al., 2016).

3. Modeling Entity-Relationship Retrieval

E-R retrieval is a complex case of entity retrieval. E-R queries expect tuples of related entities as results instead of a single ranked list of entities as it happens with general entity queries. For instance, the E-R query “Ethnic groups by country” is expecting a ranked list of tuples ethnic group, country as results. The goal is to search for multiple unknown entities and relationships connecting them.

E-R query (e.g. “congresswoman hits back at US president”).
Entity sub-query in (e.g. “congresswoman”).
Relationship sub-query in (e.g. “hits back at”).
Term-based representation of an entity (e.g. ¡Frederica Wilson¿ = representative, congresswoman). We use the terminology representation and document interchangeably.
Term-based representation of a relationship (e.g. ¡Frederica Wilson, Donald Trump¿ = hits,back). We use the terminology representation and document interchangeably.
The set of entity sub-queries in a E-R query (e.g. “congresswoman”,“US president” ).
The set of relationship sub-queries in a E-R query.
The set of entity documents to be retrieved by a E-R query.
The set of relationship documents to be retrieved by a E-R query.
E-R query length corresponding to the number of entity and relationship sub-queries.
The entity tuple to be retrieved (e.g. ¡Frederica Wilson, Donald Trump¿).
Table 1. E-R retrieval definitions.

In this section, we present a definition of E-R queries and a probabilistic formulation of the E-R retrieval problem from an Information Retrieval perspective. Table 1 presents several definitions that will be used throughout this chapter.

3.1. E-R Queries

E-R queries aim to obtain a ordered list of entity tuples as a result. Contrary to entity search queries where the expected result is a ranked list of single entities, results of E-R queries should contain two or more entities. For instance, the complex information need “Silicon Valley companies founded by Harvard graduates” expects entity-pairs (2-tuples) company, founder as results. In turn, “European football clubs in which a Brazilian player won a trophy” expects triples (3-tuples) club, player, trophy as results.

Each pair of entities , in an entity tuple is connected with a relationship . A complex information need can be expressed in a relational format, which is decomposed into a set of sub-queries that specify types of entities and types of relationships between entities.

For each relationship sub-query there must be two sub-queries, one for each of the entities involved in the relationship. Thus a E-R query that expects 2-tuples, is mapped into a triple of sub-queries , , , where and are the entity attributes queried for and respectively, and is a relationship attribute describing .

If we consider a E-R query as a chain of entity and relationship sub-queries , , , …, ,, and we define the length of a E-R query as the number of sub-queries, then the number of entity sub-queries must be and the number of relationship sub-queries equal to . Consequently, the size of each entity tuple to be retrieved must be equal to the number of entity sub-queries. For instance, the E-R query “soccer players who dated a top model” with answers such as Cristiano Ronaldo, Irina Shayk) is represented as three sub-queries soccer players, dated, top model.

Automatic mapping of terms from a E-R query to sub-queries or is out of the scope of this work and can be seen as a problem of query understanding (Yahya et al., 2012; Pound et al., 2012; Sawant and Chakrabarti, 2013)

. We assume that the information needs are decomposed into constituent entity and relationship sub-queries using Natural Language Processing techniques or by user input through an interface that enforces the structure

, , , …, ,, .

3.2. Bayesian E-R Retrieval

Our approach to E-R retrieval assumes that we have a raw document collection (e.g. news articles) and each document is associated with one or more entities . In other words, documents contain mentions to one or more entities that can be related between them. Since our goal is to retrieve tuples of related entities given a E-R query that expresses entity attributes and relationship attributes, we need to create term-based representations for both entities and relationships. We denote a representation of an entity as .

In E-R retrieval we are interested in retrieving tuples of entities as a result. The number of entities in each tuple can be two, three or more depending on the structure of the particular E-R query. When a E-R query aims to get tuples of more than two entities, we assume it is possible to combine tuples of length two. For instance, we can associate two tuples of length two that share the same entity to retrieve a tuple of length three. Therefore we create representations of relationships as pairs of entities. We denote a representation of a relationship as .

Considering the example query “Which spiritual leader won the same award as a US vice president?” it can be formulated in the relational format as spiritual leader, won, award, won, US vice president. Associating the tuples of length two Dalai Lama, Nobel Peace Prize and Nobel Peace Prize, Al Gore would result in the expected 3-tuple Dalai Lama, Nobel Peace Prize, Al Gore.

For the sake of clarity we now consider an example E-R query with three sub-queries (). This query aims to retrieve a tuple of length two, i.e. a pair of entities connected by a relationship. Based on the definition of a E-R query, each entity in the resulting tuple must be relevant to the corresponding entity sub-queries . Moreover, the relationship between the two entities must also be relevant to the relationship sub-queries . Instead of calculating a simple posterior as with traditional information retrieval, in E-R retrieval the objective is to rank tuples based on a joint posterior of multiple entity and relationship representations given a E-R query, such as when .

E-R queries can be seen as chains of interleaved entity and relationship sub-queries. We take advantage of the chain rule to formulate the joint probability

as a product of conditional probabilities. Formally, we want to rank entity and relationship candidates in descending order of the joint posterior as:


We consider conditional independence between entity representations within the joint posterior, i.e., the probability of a given entity representation being relevant given a E-R query is independent of knowing that entity is relevant as well. As an example, consider the query “action movies starring a British actor”. Retrieving entity representations for “action movies” is independent of knowing that ¡Tom Hardy¿ is relevant to the sub-query “British actor”. However, it is not independent of knowing the set of relevant relationships for sub-query “starring”. If a given action movie is not in the set of relevant entity-pairs for “starring” it does not make sense to consider it as relevant. Consequently, .

Since E-R queries can be decomposed in constituent entity and relationship sub-queries, ranking candidate tuples using the joint posterior is rank proportional to the product of conditional probabilities on the corresponding entity and relationship sub-queries , and .

We now consider a longer E-R query aiming to retrieve a triple of connected entities. This query has three entity sub-queries and two relationship sub-queries, thus . As we previously explained, when there are more than one relationship sub-queries we need to join entity-pairs relevant to each relationship sub-query that have one entity in common. From a probabilistic point of view this can be seen as conditional dependence from the entity-pairs retrieved from the previous relationship sub-query, i.e. . To rank entity and relationship candidates we need to calculate the following joint posterior:


When compared to the previous example, the joint posterior for shows that entity candidates for are conditional dependent of both and . In other words, entity candidates for must belong to entity-pairs candidates for both relationships representations that are connected with , i.e. and .

We are now able to make a generalization of E-R retrieval as a factorization of conditional probabilities of a joint probability of entity representations , relationship representations , entity sub-queries and relationship sub-queries . These set of random variables and their conditional dependencies can be easily represented in a probabilistic directed acyclic graph,i.e. a Bayesian network (Pearl, 1985).

In Bayesian networks, nodes represent random variables while edges represent conditional dependencies. Every other nodes that point to a given node are considered parents. Bayesian networks define the joint probability of a set of random variables as a factorization of the conditional probability of each random variable conditioned on its parents. Formally, , where represents all parent nodes of .

Figure 2 depicts the representation of E-R retrieval for different query lengths using Bayesian networks. We easily conclude that graphical representation contributes to establish a few guidelines for modeling E-R retrieval. First, each sub-query points to the respective document node. Second, relationship document nodes always point to the contiguous entity representations. Last, when there are more than one relationship sub-query, relationship documents also point to the subsequent relationship document.

Figure 2. Bayesian networks for E-R Retrieval with queries of different lengths.

Once we draw the graph structure for the number of sub-queries in we are able to compute a product of conditional probabilities of each node given its parents. Adapting the general joint probability formulation of Bayesian networks to E-R retrieval we come up with the following generalization:


We denote as the set of all candidate relationship documents in the graph and the set of all candidate entity documents in the graph. In Information Retrieval is often convenient to work in the log-space as it does not affect ranking and transforms the product of conditional probabilities in a summation, as follows:


In essence, E-R retrieval is an extension, or a more complex case, of object-retrieval where besides ranking objects we need to rank tuples of objects that satisfy the relationship expressed in the E-R query. This requires creating representations of both entities and relationships by fusing information spread across multiple raw documents. We hypothesize that it should be possible to generalize the term dependence models to represent entity-relationships and achieve effective E-R retrieval without entity or relationship type restrictions (e.g. categories) as it happens with the Semantic Web based approaches.

4. Early Fusion

Traditional ad-hoc document retrieval approaches create direct term-based representations of raw documents. A retrieval model (e.g. Language Models) is then used to match the information need, expressed as a keyword query, against those representations. However, E-R retrieval requires collecting evidence for both entities and relationships that can be spread across multiple documents. It is not possible to create direct term-based representations. Raw documents serve as proxy to connect queries with entities and relationships.

As described in previous section, E-R queries can be formulated as a sequence of multiple entity queries and relationship queries . In a Early Fusion approach, each of these queries should match against a previously created term-based representation. Since there are two types of queries, we propose to create two types of term-based representations, one for entities and other for relationships. It can be thought as creating two types of meta-documents and . A meta-document is created by aggregating the context terms of the occurrences of across the raw document collection. On the other hand, for each each pair of entities and that co-occur close together across the raw document collection we aggregate context terms that describe the relationship to create a meta-document .

Our Early Fusion design pattern for E-R retrieval is inspired in Model 1 of (Balog et al., 2006) for single entity retrieval. In our approach we focus on sentence level information about entities and relationships although the design pattern can be applied to more complex segmentations of text (e.g. dependency parsing). We rely on Entity Linking methods for disambiguating and assigning unique identifiers to entity mentions on raw documents . We collect entity contexts across the raw document collection and index them in the entity index. The same is done by collecting and indexing entity pair contexts in the relationship index.

We define the (pseudo) frequency of a term for an entity meta-document as follows:


where is the total number of raw documents in the collection, is the term frequency in the context of the entity in a raw document . is the entity-document association weight that corresponds to the weight of the document in the mentions of the entity across the raw document collection. Similarly, the term (pseudo) frequency of a term for a relationship meta-document is defined as follows:


where is the term frequency in the context of the pair of entity mentions corresponding to the relationship in a raw document and is the relationship-document association weight. In this work we use binary associations weights indicating the presence/absence of an entity mention in a raw document, as well as for a relationship. However, other weight methods can be used.

The relevance score for an entity tuple can then be calculated using the posterior defined in previous section (equation 18). We calculate the individual conditional probabilities as a product of a retrieval score with an association weight. Formally we consider:


where represents the retrieval score resulting of the match of the query terms of a relationship sub-query and a relationship meta-document . The same applies to the retrieval score which corresponds to the result of the match of an entity sub-query with a entity meta-document . For computing both and any retrieval model can be used.

We use a binary association weight for which represents the presence of a relevant entity to a sub-query in its contiguous relationships in the Bayesian network, i.e. and which must be relevant to the sub-queries and . This entity-relationship association weight is the building block that guarantees that two entities relevant to sub-queries that are also part of a relationship relevant to a sub-query will be ranked higher than tuples where just one or none of the entities are relevant to the entity sub-queries . On the other hand, the entity-relationship association weight guarantees that consecutive relationships share one entity between them in order to create triples or 4-tuples of entities for longer E-R queries ().

The relevance score of an entity tuple given a query is calculated by summing individual relationship and entity relevance scores for each and in . We define the score for a tuple given a query as follows:


Considering Dirichlet smoothing unigram Language Models (LM) the constituent retrieval scores can be computed as follows:


where is a term of a sub-query or , and are the (pseudo) frequencies defined in equations 20 and 21. The collection frequencies , represent the frequency of the term in either the entity index or in the relationship index . and represent the total number of terms in a meta-document while and represent the total number of terms in a collection of meta-documents. Finally, and are the Dirichlet prior for smoothing which generally corresponds to the average document length in a collection.

4.1. Association Weights

In Early Fusion there are three association weights: , and . The first two represent document associations which determine the weight a given raw document contributes to the relevance score of a particular entity tuple . The last one is the entity-relationship association which indicates the strength of the connection of a given entity within a relationship .

In our work we only consider binary association weights but other methods could be used. According to the binary method we define the weights as follows:


Under this approach the weight of a given association is independent of the number of times an entity or a relationship occurs in a document. A more general approach would be to assign real numbers to the association weights depending on the strength of the association (Balog et al., 2012b). For instance, uniform weighting would be proportional to the inverse of the number of documents where a given entity or relationship occurs. Other option could be a TF-IDF approach.

4.2. Discussion

This approach is flexible enough to allow using any retrieval method to compute individual retrieval scores between document and query nodes in a E-R graph structure. When using Language Models (LM) or BM25 as scoring functions, these design patterns can be used to create unsupervised baseline methods for E-R retrieval (e.g. EF-LM, EF-BM25, LF-LM, LF-BM25, etc.).

There is some overhead over traditional document search, since we need to create two E-R dedicated indexes that will store entity and relationship meta-documents. The entity index is created by harvesting the context terms in the proximity of every occurrence of a given entity across the raw document collection. This process must be carried for every entity in the raw document collection. A similar process is applied to create the relationship index. For every two entities occurring close together in a raw document we extract the text between both occurrences as a term-based representation of the relationship between the two. Once again, this process must be carried for every pair of co-occurring entities in sentences across the raw document collection.

One advantage of Early Fusing lies in its flexibility as we need to create two separate indexes for E-R retrieval it is possible to combine data from multiple sources in seamless way. For instance, one could use a well established knowledge base (e.g. DBpedia) as entity index and use a specific collection, such as a news collection or a social media stream, for harvesting relationships having a more transient nature.

The challenge inherent to the problem of E-R retrieval is the size of the search space. Although the E-R problem is formulated as a sequence of independent sub-queries, the results of those sub-queries must be joined together. Consequently, we have a multi-dimensional search space in which we need to join results based on shared entities.

This problem becomes particularly hard when sub-queries are short and contain very popular terms. Let us consider “actor” as , there will be many results to this sub-query, probably thousands. There is a high probability that will need to process thousands of sub-results before finding one entity that is also relevant to the relationship sub-query . If at the same time we have computational power constraints, we will probably apply a strategy of just considering top k results for each sub-query which can lead to reduced recall in the case of short sub-queries with popular terms.

5. Entity-Relationship Dependence Model

In this section we present the Entity-Relationship Dependence Model (ERDM), a novel supervised Early Fusion-based model for E-R retrieval. Recent approaches to entity retrieval (Zhiltsov et al., 2015; Nikolaev et al., 2016; Hasibi et al., 2016) have demonstrated that using models based on Markov Random Field (MRF) framework for retrieval (Metzler and Croft, 2005) to incorporate term dependencies can improve entity search performance. This suggests that MRF could be used to model E-R query term dependencies among entities and relationships documents.

One of the advantages of the MRF framework for retrieval is its flexibility, as we only need to construct a graph representing dependencies to model, define a set of non-negative potential functions over the cliques of

and to learn the parameter vector

to score each document by its unique and unnormalized joint probability with under the MRF (Metzler and Croft, 2005).

The non-negative potential functions are defined using an exponential form , where is a feature weight, which is a free parameter in the model, associated with feature function

. Learning to rank is then used to learn the feature weights that minimize the loss function. The model allows parameter and feature functions sharing across cliques of the same configuration, i.e. same size and type of nodes (e.g. 2-cliques of one query term node and one document node).

5.1. Graph Structures

The Entity-Relationship Dependence Model (ERDM) creates a MRF for modeling implicit dependencies between sub-query terms, entities and relationships. Each entity and each relationship are modeled as document nodes within the graph and edges reflect term dependencies. Contrary to traditional ad-hoc retrieval using MRF (e.g. SDM), where the objective is to compute the posterior of a single document given a query, the ERDM allows the computation of a joint posterior of multiple documents (entities and relationships) given a E-R query which consists also of multiple sub-queries.

Figure 3. Markov Random Field dependencies for E-R retrieval, .
Figure 4. Markov Random Field dependencies for E-R retrieval, .

The graph structures of the ERDM for two E-R queries, one with and other with are depicted in Figure 3 and Figure 4, respectively. Both graph structures contain two different types of query nodes and document nodes: entity query and relationship query nodes, and , plus entity and relationship document nodes, and . Within the MRF framework, and are considered “documents” but they are not actual real documents but rather objects representing an entity or a relationship between two entities. Unlike real documents, these objects do not have direct and explicit term-based representations. Usually, it is necessary to gather evidence across multiple real documents that mention the given object, in order to be able to match them against keyword queries. Therefore, ERDM can be seen as Early Fusion-based retrieval model. The existence of two different types of documents implies two different indexes: the entity index and the relationship index.

The relationship-specific dependencies of ERDM are found in the 2-cliques formed by one entity document and one relationship document: - , - and for , - and - . The graph structure does not need to assume any explicit dependence between entity documents given a relationship document. They have an implicit connection through the dependencies with the relationship document. The likelihood of observing an entity document given a relationship document is not affected by the observation of any other entity document.

Explicit dependence between the two entity documents could be used to represent the direction of the relationship between the two entities. To support this dependence, relationship documents would need to account the following constraint: , with representing the relationship index. Then, we would compute an ordered feature function between entities in a relationship, similar to the ordered bigram feature function in SDM. In this work, we do not explicitly model asymmetric relationships. For instance, if a user searches for the relationship entity A “criticized” entity B but was in fact entity B who criticized entity A we assume that the entity tuple ¡entity A, entity B¿ is still relevant for the information need expressed in the E-R query.

ERDM follows the SDM (Metzler and Croft, 2005) dependencies between query terms and documents due to its proved effectiveness in multiple contexts. Therefore, ERDM assumes a dependence between neighboring sub-query terms:


MRF for retrieval requires the definition of the sets of cliques (maximal or non-maximal) within the graph that one or more feature functions is to be applied to. The set of cliques in ERDM containing at least one document are the following:

  • - set of 2-cliques containing an entity document node and exactly one term in a entity sub-query.

  • - set of 3-cliques containing an entity document node and two ordered terms in a entity sub-query.

  • - set of 2-cliques containing a relationship document node and exactly one term in a relationship sub-query.

  • - set of 3-cliques containing a relationship document node and two ordered terms in a relationship sub-query.

  • - set of 2-cliques containing one entity document node and one relationship document node.

  • - set of 3-cliques containing one entity document node and two consecutive relationship document nodes.

The joint probability mass function of the MRF is computed using the set of potential functions over the configurations of the maximal cliques in the graph (Metzler and Croft, 2005). Non-negative potential functions are constructed from one or more real valued feature functions associated with the respective feature weights using an exponential form.

5.2. Feature Functions

ERDM has two types of feature functions: textual and non-textual. Textual feature functions measure the textual similarity between one or more sub-query terms and a document node. Non-textual feature functions measure compatibility between entity and relationship documents, i.e., if they share a given entity.

Clique Set Feature Functions Type Input Nodes
and Textual