1.1 Thesis Statement
The ultimate goal of ORM is to track everything that is said on the Web about a given target entity and consequently, to assess/predict the impact on its reputation. From our perspective, this goal is very hard to achieve for two reasons. The first reason has to do with the difficulty of computationally processing, interpreting and accessing the huge amount of information published online everyday. The second reason is inherent to the definition of reputation as being intangible but having tangible outcomes. More specifically, Fombrun and Van Riel (Fombrun and Van Riel, 2004) and later Stacks (Stacks, 2010) found a correlation between several indicators, such as reputation or trust, and financial indicators, such as sales or profits. However, this finding does not imply causality, as financial indicators can be influenced by many factors, besides stakeholders’ perceived reputation. In conclusion, there is no consensus on how to measure reputation, neither intrinsically nor extrinsically.
To the best of our knowledge, current ORM is still very limited and naive. The most standard approach consists in counting mentions of entity names and applying sentiment analysis to produce descriptive reports of aggregated entity popularity and overall sentiment. We propose to make progress in ORM by tackling two computational problems: Entity Retrieval and Text Mining (Figure 1.1).
We believe that a ORM platform, besides providing aggregated statistics and trends about entity popularity and sentiment on the news and social media, would benefit from providing entity retrieval capabilities. End users often like to have the flexibility to search for specific information that is not available in predefined charts. However, ORM has some specificities that traditional entity search systems cannot cope with. More specifically, an entity’s reputation is also influenced by the entity’s relationships with other entities.
For instance, the reputation of Apple Inc. was severely damaged with the so called “Apple Foxconn scandal”. Foxconn was one of the several contractor companies in Apple’s supply chain that was accused of exploiting Chinese workers. Although the facts were not directly concerned with Apple itself, its relationship with Foxconn triggered bad public opinion about Apple. The same happened recently with the “Weinstein sex scandal”, as accusations of sexual harassment aimed at Harvey Weinstein created a wave of damage to companies and personalities associated with the disgraced Hollywood producer.
Therefore, a ORM platform should provide entity-relationship search capabilities. Entity-Relationship (E-R) Retrieval is a complex case of entity retrieval where the goal is to search for multiple unknown entities and relationships connecting them. Contrary to traditional entity queries, E-R queries expect tuples of connected entities as answers. For instance, “US technology companies contracts Chinese electronics manufacturers" can be answered by tuples Apple, Foxconn, while “Companies founded by disgraced Hollywood producer" is expecting tuples Miramax, Harvey Weinstein. In essence, an E-R query can be decomposed into a set of sub-queries that specify types of entities and types of relationships between entities.
On the other hand, ORM requires accurate and robust text processing and data analysis methods. Text Mining plays an essential enabling role in developing better ORM. There are several challenges with collecting and extracting relevant entity-centric information from raw text data. It is necessary to filter noisy data otherwise downstream processing tasks, such as sentiment analysis, will be compromised. More specifically, it is essential to develop named entity disambiguation approaches that can distinguish relevant text passages from non-relevant. Named entities are often ambiguous, for example, the word “bush” is a surface form for two former U.S. presidents, a music band and a shrub. The ambiguity of named entities is particularly problematic in social media texts, where users often mention entities using a single term.
ORM platforms would be even more useful if they would be able to predict if social media users will talk a lot about the target entities or not. For instance, on April 4th 2016, the UK Prime-minister, David Cameron, was mentioned on the news regarding the Panama Papers story. He did not acknowledge the story in detail on that day. However, the news cycle kept mentioning him about this topic in the following days and his mentions on social media kept very high. He had to publicly address the issue on April 9th, when his reputation had already been severely damaged, blaming himself for not providing further details earlier. Thus we also want to study the feasibility of using entity-centric knowledge extracted from Social Media and online news to predict real world surveys results, such as political polls.
The work reported on this dissertation aimed to understand, formalize and explore the scientific challenges inherent to the problem of using unstructured text data from different Web sources for Online Reputation Monitoring. We now describe the specific research challenges we proposed to overcome.
Entity-Relationship Retrieval: Existing strategies for entity search can be divided in IR-centric and Semantic-Web-based approaches. The former usually rely on statistical language models to match and rank co-occurring terms in the proximity of the target entity Balog et al. (2012a). The latter consists in creating a SPARQL query and using it over a structured knowledge base to retrieve relevant RDF triples Heath and Bizer (2011). Neither of these paradigms provide good support for entity-relationship (E-R) retrieval.
Recent work in Semantic-Web search tackled E-R retrieval by extending SPARQL to support joins of multiple query results and creating an extended knowledge graph Yahya et al. (2016). Extracted entities and relationships are typically stored in a knowledge graph. However, it is not always convenient to rely on a structured knowledge graph with predefined and constraining entity types.
In particular, ORM is interested in transient information sources, such as online news or social media. General purpose knowledge graphs are usually fed with more stable and reliable data sources (e.g. Wikipedia). Furthermore, predefining and constraining entity and relationship types, such as in Semantic Web-based approaches, reduces the range of queries that can be answered and therefore limits the usefulness of entity search, particularly when one wants to leverage free-text.
To the best of our knowledge, E-R retrieval using IR-centric approaches is a new and unexplored research problem within the Information Retrieval research community. One of the objectives of our research is to explore to what degree we can leverage the textual context of entities and relationships, i.e., co-occurring terminology, to relax the notion of an entity or relationship type.
Instead of being characterized by a fixed type, e.g., person, country, place, the entity would be characterized by any contextual term. The same applies to the relationships. Traditional knowledge graphs have fixed schema of relationships, e.g. child of, created by, works for while our approach relies on contextual terms in the text proximity of every two co-occurring entities in a raw document. Relationships descriptions such as “criticizes”, “hits back”, “meets” or “interested in” would be possible to search for. This is expected to significantly reduce the limitations which structured approaches suffer from, enabling a wider range of queries to be addressed.
Entity Filtering and Sentiment Analysis: Entity Filtering is a sub-problem of Named Entity Disambiguation (NED) in which we have a named entity mention and we want to classify it as related or not related with the given target entity. This is a relatively easy problem in well formed texts such as news articles. However, social media texts pose several problems to this task. We are particularly interested in Entity Filtering of tweets and we aim to study a large set of features that can be generated to describe the relationship between a given target entity and a tweet, as well as exploring different learning algorithms to create supervised models for this task.
Sentiment Analysis has been thoroughly studied in the last decade (Giachanou and Crestani, 2016). There have been several PhD thesis entirely dedicated to this subject. It is a broad problem with several ramifications depending on the text source and specific application. Within the context of ORM, we will focus in a particular domain: finance. Sentiment Analysis on financial texts has received increased attention in recent years (Nardo et al., 2016). Neverthless, there are some challenges yet to overcome (Smailović et al., 2014)
. Financial texts, such as microblogs or newswire, usually contain highly technical and specific vocabulary or jargon, making the development of specific lexical and machine learning approaches necessary.
Text-based Entity-centric Prediction:
We hypothesize that for entities that are frequently mentioned on the news (e.g. politicians) it is possible to establish a predictive link between online news and popularity on social media. We cast the problem as a supervised learning classification approach: to decide whether popularity will be high or low based on features extracted from the news cycle. We aim to assess if online news are valuable as source of information to effectively predict entity popularity on Twitter. More specifically, we want to find if online news carry different predictive power based on the nature of the entity under study and how predictive performance varies with different times of prediction. We propose to explore different text-based features and how particular ones affect the overall predictive power and specific entities in particular.
On the other hand, we will study if it is possible to use knowledge extracted from social media texts to predict the outcome of public opinion surveys. The automatic content analysis of mass media in the social sciences has become necessary and possible with the rise of social media and computational power. One particularly promising avenue of research concerns the use of sentiment analysis in microblog streams. However, one of the main challenges consists in aggregating sentiment polarity in a timely fashion that can be fed to the prediction method.
A Framework for ORM: The majority of the work in ORM consists in ad-hoc studies where researchers collect data from a given social network and produce their specific analysis or predictions, often unreproducible. The availability of open source platforms in this area is scarse. Researchers typically use specific APIs and software modules to produce their studies. However, there has been some effort among the research community to address these issues through open source research platforms. We therefore aim to create an adaptable text mining framework specifically tailored for ORM that can be reused in multiple application scenarios, from politics to finance. This framework is able to collect texts from online media, such as Twitter, and identify entities of interest and classify sentiment polarity and intensity. The framework supports multiple data aggregation methods, as well as visualization and modeling techniques that can be used for both descriptive analytics, such as analyze how political polls evolve over time, and predictive analytics, such as predict elections.
1.3 Research Methodology
We adopted distinct research methodologies in the process of developing the research work described in this thesis. The origin of this work was the POPSTAR project. POPSTAR (Public Opinion and Sentiment Tracking, Analysis, and Research) was a project that developed methods for the collection, measurement and aggregation of political opinions voiced in microblogs (Twitter), in blogs and online news. A first prototype of the framework for ORM was implemented and served as the backend of the POPSTAR website (http://www.popstar.pt/). The ground work concerned with the development of a framework for ORM was carried in the scope of the project. Therefore, the POPSTAR website served as use case for validating the effectiveness and adaptability of the framework.
The Entity Filtering and Sentiment Analysis modules of the framework were evaluated using well known external benchmarks resulting in state-of-the-art performance. We participated in RepLab 2013 Filtering Task and evaluated our Entity Filtering method using the dataset created for the competition. One of our submissions obtained the first place at the competition. We also participated in SemEval 2017 Task 5: Fine-grained Sentiment Analysis on Financial Microblogs and News. We were ranked 4th using one of the metrics at the sub-task 5.1 Microblogs.
We performed two experiments regarding the text-based entity centric predictions. For predicting entity popularity on Twitter based on the news cycle we collected tweets and news articles from Portugal using the SocialBus twitter collector and online news from 51 different news outlets collected by SAPO. We used the number of entity mentions on Twitter as target variable and we extracted text-based features from the news datasets. Both datasets were aligned in time. We used the same Twitter dataset for studying different sentiment aggregate functions to serve as features for predicting political polls of a private opinion studies company, Eurosondagem.
Improvements of Entity-Relationship (E-R) retrieval techniques have been hampered by a lack of test collections, particularly for complex queries involving multiple entities and relationships. We created a method for generating E-R test queries to support comprehensive E-R search experiments. Queries and relevance judgments were created from content that exists in a tabular form where columns represent entity types and the table structure implies one or more relationships among the entities. Editorial work involved creating natural language queries based on relationships represented by the entries in the table. We have publicly released the RELink test collection comprising 600 queries and relevance judgments obtained from a sample of Wikipedia List-of-lists-of-lists tables.
We evaluated the new methods proposed for E-R retrieval using the RELink query collection together with two other smaller query collections created by research work in Semantic Web-based E-R retrieval. We used a large web corpus, the ClueWeb-09B containing 50 million web pages for creating E-R retrieval tailored indexes for running our experiments. Moreover, we implemented a demo using a large news collection of 12 million Portuguese news articles, resulting in the best demo award at ECIR 2016.
1.4 Contributions and Applications
This work resulted in the following contributions:
A Text Mining framework that puts together all the building blocks required to perform ORM. The framework is adaptable and can be reused in different application scenarios, such as finance and politics. The framework provides entity-specific Text Mining functionalities that enable the collection, disambiguation, sentiment analysis, aggregation, prediction and visualization of entity-centric information from heterogeneous Web data sources. Furthermore, given that it is built using a modular architecture providing abstraction layers and well defined interfaces, new functionalities can easily be integrated.
Generalization of the problem of entity-relationship search to cover entity types and relationships represented by any attribute and predicate, respectively, rather than a pre-defined set.
A general probabilistic model for E-R retrieval using Bayesian Networks.
Proposal of two design patterns that support retrieval approaches using the E-R model.
Proposal of a Entity-Relationship Dependence model that builds on the basic Sequential Dependence Model (SDM) to provide extensible entity-relationship representations and dependencies, suitable for complex, multi-relations queries.
An Entity-relationship indexing and retrieval approach including learning to rank/data fusion methods that can handle entity and relationships ranking and merging of results.
The proposal of a method and strategy for automatically obtaining relevance judgments for entity-relationship queries.
We make publicly available queries and relevance judgments for the previous task.
Entity Filtering and Financial Sentiment Analysis methods tailored for Twitter that is able to cope with short informal texts constraints.
Analysis of the predictive power of online news regarding entity-centric metrics on Twitter, such as popularity or sentiment.
Analysis of how to combine entity-centric knowledge obtained from heterogeneous sources for survey-like prediction tasks.
We believe this work can be useful in a wide range of applications from which we highlight six:
Reputation Management is concerned with influencing and controlling company or individual reputation and consequently tracking what is said about entities online is one of the main concerns of this area. For instance, knowing if a given news article will have a negative impact on entity’s reputation would be crucial for damage control.
Digital Libraries are special libraries comprising a collection of digital objects (e.g. text or images) stored in a electronic media format. They are ubiquitous nowadays, from academic repositories, to biomedical databases, law enforcement repositories, etc. We believe the contributions we make to the Entity-Relationship Retrieval research problem can be applied to any digital library enabling a new wide range of search capabilities.
Fraud Detection and inside trading detection is an area where information about entities (individuals and companies) and relationships between entities is very useful to discover hidden relationships and contexts of entities that might represent conflicts of interests or even fraud.
Journalism, or more specifically, computational journalism would benefit of a powerful entity-relationship search tool in which journalists could investigate how entities were previously mentioned on the Web, including online news through time, as well as relationships among entities and their semantics.
Political Science has given a lot of attention to Social Media in recent years due to the sheer amount of people reactions and opinions regarding politically relevant events. Being able to analyze the interplay between online news and Social Media from a political entity perspective can be very interesting for political scientists. On the other hand, it is becoming increasingly difficult to obtain pollsresponses via telephone and it is necessary to start testing alternative approaches.
Social Media Marketing focuses on communicating through social networks with company potential and effective customers. Evaluating the success of a given campaign is a key aspect of this area. Therefore assessing the volume and polarity of mentions of a given company before and after a campaign would be very useful.
Most of the material of this thesis was previously published in journal, conference and workshop publications:
P.Saleiro, E. M. Rodrigues, C. Soares, E. Oliveira, “TexRep: A Text Mining Framework for Online Reputation Monitoring”, New Generation Computing, Volume 35, Number 4 2017 (Saleiro et al., 2017a)
P. Saleiro, N. Milic-Frayling, E. M. Rodrigues, C. Soares, “RELink: A Research Framework and Test Collection for Entity-Relationship Retrieval”, 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017) (Saleiro et al., 2017b)
P. Saleiro, N. Milic-Frayling, E. M. Rodrigues, C. Soares, “Early Fusion Strategy for Entity-Relationship Retrieval”, The First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR@SIGIR 2017) (Saleiro et al., 2017c)
P. Saleiro, E. M. Rodrigues, C. Soares, E. Oliveira, “FEUP at SemEval-2017 Task 5: Predicting Sentiment Polarity and Intensity with Financial Word Embeddings”, International Workshop on Semantic Evaluation (SemEval@ACL 2017) (Saleiro et al., 2017e)
P. Saleiro and C. Soares, “Learning from the News: Predicting Entity Popularity on Twitter” in Advances in Intelligent Data Analysis XV (IDA 2016) (Saleiro and Soares, 2016)
P. Saleiro, J. Teixeira, C. Soares, E. Oliveira, “TimeMachine: Entity-centric Search and Visualization of News Archives” in Advances in Information Retrieval: 38th European Conference on IR Research (ECIR 2016) (Saleiro et al., 2016a)
P. Saleiro, L. Gomes, C. Soares, “Sentiment Aggregate Functions for Political Opinion Polling using Microblog Streams” in International C* Conference on Computer Science and Software Engineering (C3S2E 2016) (Saleiro et al., 2016b)
P. Saleiro, S. Amir, M. J. Silva, C. Soares , “POPmine: Tracking Political Opinion on the Web” in IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing (IUCC 2015) (Saleiro et al., 2015a)
P. Saleiro, L. Rei, A. Pasquali, C. Soares, et al., “POPSTAR at RepLab 2013: Name ambiguity resolution on Twitter” in Fourth International Conference of the CLEF initiative (CLEF 2013) (Saleiro et al., 2013a)
1.6 Thesis Outline
In Chapter 2 we discuss related work to this thesis. In Chapter 3 we present a formalization of the problem of E-R retrieval using a IR-centric approach. We provide two design patterns for fusion-based E-R retrieval: Early Fusion and Late Fusion. We end the chapter by introducing a new supervised early fusion-based Entity Relationship Dependence Model (ERDM) that can be seen as an extension of the MRF framework for retrieval adapted to E-R retrieval. In Chapter 4 we describe a set of experiments on E-R retrieval over a Web corpus. First we introduce a new query collection, RELink QC, specifically tailored to this problem. We developed a semi-automatic approach to collect relevance judgments from tabular data and the editorial work consisted in creating E-R queries answered by those relevance judgments. We run experiments using the ClueWeb09-B as dataset and provide evaluation results for the new proposed methods for E-R retrieval.
Chapter 5 is dedicated to Entity Filtering and Financial Sentiment Analysis. We evaluate our approaches using well known external benchmarks, namely, RepLab 2013 and SemEval 2017. In Chapter 6, we present two experiments of text-based entity-centric predictions. In the first experiment, we try to predict the popularity of entities on social media using solely features extracted from the news cycle. On the second experiment, we try to assess which sentiment aggregate functions are useful in predicting political polls results.
In Chapter 7, we present an unified framework of ORM. The framework is divided in two major containers: RELink (Entity Retrieval) and TexRep (Text Mining). We present the data flow within the framework and how it can be used as a reference open source framework for researching in ORM. We also present some case studies of using this framework. We end this thesis with Chapter 8 which is dedicated to the conclusions.
2.1 Online Reputation Monitoring
The reputation of a company is important for the company itself but as well for the stakeholders. More specifically, stakeholders make decisions about the company and its products faster if they are aware of the image of the company (Poiesz, 1989). From the company perspective, reputation is an asset as it attracts stakeholders and it can represent economic profit at the end (Jones et al., 2000; Fombrun and Van Riel, 2004)
In 2001, Newell and Goldsmith used questionnaire and survey methodologies to introduce the first standardized and reliable measure of credibility of companies from a consumer perspective (Newell and Goldsmith, 2001). There have been also studies that find a correlation between company indicators such as reputation, trust and credibility, and financial indicators, such as sales and profits (Fombrun and Van Riel, 2004; Stacks, 2010). These studies found that although reputations are intangible they influence tangible assets. Following this reasoning, Fombrum created a very successful measurement framework, named RepTrak (Fombrun, 2006).
A different methodology compared to questionnaires is media analysis (news, TV and radio broadcasts). Typically, the analysis involves consuming and categorizing media according to stakeholder and polarity (positive, negative) towards the company. Recently, Social Media analysis is becoming an important proxy of people opinion, originating the field of Online Reputation Monitoring (Kurniawati et al., 2013). While traditional reputation monitoring is mostly manual, online media pose the opportunity to process, understand and aggregate large streams of facts about about a company or individual.
ORM requires some level of continuous monitoring Kaufmann et al. (2013). It is crucial to detect early the changes in the perception of a company or personality conveyed in Social Media. Online buzz may be good or bad and consequently, companies must react and address negative trends Portmann (2012); Gonzalo (2016). It also creates an opportunity to monitor the reputation of competitors. In this context, Text Mining plays a key, enabling role as it offers methods for deriving high-quality information from textual content Amigó et al. (2013). For instance, Gonzalo Gonzalo (2016) identifies 5 different Text Mining research areas relevant to ORM: entity filtering, topic tracking, reputation priority detection, user profiling and automatic reporting/summarization.
Social Media as a new way of communication and collaboration is an influence for every stakeholder of society, such as personalities, companies or individuals (Matešić et al., 2010). Social Media users share every aspect of their lives and that includes information about events, news stories, politicians, brands or organizations. Companies have access to all this sharing which opens new horizons for obtaining insights that can be valuable to them and their online reputation. Companies also invest a big share of their public relations on Social Media. Building a strong reputation can take long time and effort but destroying it can take place overnight. Therefore, as the importance of Social Media increased, so did the importance of having powerful tools that deal with this enormous amount of data.
2.1.1 Related Frameworks
The great majority of work in ORM consists in ad-hoc studies and platforms for ORM are usually developed by private companies that do not share internal information. However, there are some open source research projects that can be considered as related frameworks to this work.
Trendminer Samangooei et al. (2012) is one of such platforms that enables real time analysis of Twitter data, but has a very simple sentiment analysis using word counts and lacks flexibility in order to support entity-centric data processing. A framework for ORM should be entity-centric, i.e., collect, process and aggregate texts and information extracted from those texts in relation to the entities being monitored.
conTEXT Khalili et al. (2014) addresses adaptability and reusability by allowing a modular interface and allowing plugin components to extend their framework, specially from the perspective of the data sources and text analysis modules. For instance, it does not support Sentiment Analysis module by default but it could be plugged in. Neverthless, conTEXT does not support the plugin of aggregation and prediction modules which makes it not suitable for ORM. The FORA framework Portmann (2012) is specifically tailored for ORM. It creates an ontology based on fuzzy clustering of texts but it is only concerned with extracting relevant linguistic units regarding the target entities and does not include automatic sentiment analysis and it does not allow the plugin of new modules.
POPmine Saleiro et al. (2015b) was the first version of our Text Mining framework for ORM and it was developed specifically in the context of a project in political data science. It comprises a richer set of modules, including cross media data collection (Twitter, blog posts and online news) and real-time trend analysis based on entity filtering and sentiment analysis modules. In fact, our current version of TexRep, our Text Mining framework for ORM, can be seen as an extension of the POPmine architecture by creating a more general purpose framework for ORM which is not restricted to political analysis. While it would be possible to adapt POPmine’s entity disambiguation and sentiment analysis modules, its aggregations are specific to the political scenarios. On the other hand, TexRep supports users to define and plug custom-specific aggregate functions. Moreover, POPmine has limited user configurations (e.g. lacks support for pre-trained word embeddings) and does not include predictive capabilities.
2.2 Entity Retrieval and Semantic Search
Information Retrieval deals with the “search for information”. It is defined as the activity of finding relevant information resources (usually documents) that meet an information need (usually a query), from within a large collection of resources of an unstructured nature (usually text) (Manning et al., 2008).
In early boolean retrieval systems, documents were retrieved if the exact query term was present and they were represented as a list of terms (Manning et al., 2008)
. With the introduction of the Vector Space Model, each term represents a dimension in a multi-dimensional space, and consequently, each document and query are represented as vectors(Salton, 1968). Values of each dimension of the document vector correspond to the term frequency (TF) of the term in the document. Therefore, the ranking list of documents is produced based on their spatial distance to the query vector.
The concept of inverse document frequency (IDF) was later introduced to limit the effect of common terms in a collection (Sparck Jones, 1972). A term that occurs in many documents of the collection has a lower IDF than terms that occur less often. The combination TF-IDF and variants, such as BM25 (Robertson et al., 1995), became commonly used weighting statistics for Vector Space Model.
Recently, it has been observed that when people have focused information needs, entities better satisfy those queries than a list of documents or large text snippets (Pound et al., 2010)
. This type of retrieval is called Entity Retrieval or Entity-oriented retrieval and includes extra Information Extraction tasks for processing documents, such as Named Entity Recognition (NER) and Named Entity Disambiguation (NED). Entity Retrieval is closely connected with Question answering (QA) though, QA systems focus on understanding the semantic intent of a natural language query and deciding which sentences represent the answer to the user.
Considering the query “British politicians in Panama papers”, the expected result would be a list of names rather than documents related to British politics and the “Panama Papers” news story. There are two search patterns related to Entity Retrieval (Demartini et al., 2010a). First, the user knows the existence of a certain entity and aims to find related information about it. For example, a user searching for product related information. Second, the user defines a predicate that constrains the search to a certain type of entities, e.g. searching for movies of a certain genre.
Online Reputation Monitoring systems usually focus on reporting statistical insights based on information extracted from Social Media and online news mentioning the target entity. However, this kind of interaction limits the possibility of users to explore all the knowledge extracted about the target entity. We believe Entity Retrieval could enhance Online Reputation Monitoring by allowing free text search over all mentions of the target entity and, consequently, allow users to discover information that descriptive statistical insights might not be able to identify.
Entity Retrieval differs from traditional document retrieval in the retrieval unit. While document retrieval considers a document as the atomic response to a query, in Entity Retrieval document boundaries are not so important and entities need to be identified based on occurrence in documents (Adafre et al., 2007). The focus level is more granular as the objective is to search and rank entities among documents. However, traditional Entity Retrieval systems does not exploit semantic relationships between terms in the query and in the collection of documents, i.e. if there is no match between query terms and terms describing the entity, relevant entities tend to be missed.
Entity Retrieval has been an active research topic in the last decade, including various specialized tracks, such as Expert finding track (Chen et al., 2006), INEX entity ranking track (Demartini et al., 2009), TREC entity track (Balog et al., 2010) and SIGIR EOS workshop (Balog et al., 2012b). Previous research faced two major challenges: entity representation and entity ranking. Entities are complex objects composed by a different number of properties and are mentioned in a variety of contexts through time. Consequently, there is no single definition of the atomic unit (entity) to be retrieved. Additionally, it is a challenge to devise entity rankings that use various entity representations approaches and tackle different information needs.
There are two main approaches for tackling Entity Retrieval: “profile based approach” and “voting approach” (Balog et al., 2006)). The “profile based approach” starts by applying NER and NED in the collection in order to extract all entity occurrences. Then, for each entity identified, a meta-document is created by concatenating every passage in which the entity occurs. An index of entity meta-documents is created and a standard document ranking method (e.g. BM25) is applied to rank meta-documents with respect to a given query (Azzopardi et al., 2005; Craswell et al., 2005). One of the main challenges of this approach is the transformation of original text documents to an entity-centric meta-document index, including pre-processing the collection in order to extract all entities and their context.
In the “voting approach”, the query is processed as typical document retrieval to obtain an initial list of documents (Balog et al., 2006; Ru et al., 2005). Entities are extracted from these documents using NER and NED techniques. Then, score functions are calculated to estimate the relation of entities captured and the initial query. For instance, counting the frequency of occurrence of the entity in the top documents combined with each document score (relevance to the query) (Balog et al., 2006). Another approach consists in taking into account the distance between the entity mention and the query terms in the documents (Petkova and Croft, 2007).
Recently, there is an increasing research interest in Entity Search over Linked Data, also referred as Semantic Search, due to the availability of structured information about entities and relations in the form of Knowledge Bases (Bron et al., 2013; Zong et al., 2015; Zhiltsov et al., 2015). Semantic Search exploits rich structured entity related in machine readable RDF format, expressed as a triple (entity, predicate, object). There are two types of search: keyword-based and natural language based search (Pound et al., 2012; Unger et al., 2012). Regardless of the search type, the objective is to interpret the semantic structure of queries and translate it to the underlying schema of the target Knowledge Base. Most of the research focus is on interpreting the query intent (Pound et al., 2012; Unger et al., 2012) while others focus on how to devise a ranking framework that deals with similarities between different attributes of the entity entry in the KB and the query terms (Zhiltsov et al., 2015)
Relationship Queries: Li et al. Li et al. (2012) were the first to study relationship queries for structured querying entities over Wikipedia text with multiple predicates. This work used a query language with typed variables, for both entities and entity pairs, that integrates text conditions. First it computes individual predicates and then aggregates multiple predicate scores into a result score. The proposed method to score predicates relies on redundant co-occurrence contexts.
Yahya et al. Yahya et al. (2016) defined relationship queries as SPARQL-like subject-predicate-object (SPO) queries joined by one or more relationships. The authors cast this problem into a structured query language (SPARQL) and extended it to support textual phrases for each of the SPO arguments. Therefore it allows to combine both structured SPARQL-like triples and text simultaneously. It extended the YAGO knowledge base with triples extracted from ClueWeb using an Open Information Extraction approach Schmitz et al. (2012).
In the scope of relational databases, keyword-based graph search has been widely studied, including ranking Yu et al. (2009). However, these approaches do not consider full documents of graph nodes and are limited to structured data. While searching over structured data is precise it can be limited in various respects. In order to increase the recall when no results are returned and enable prioritization of results when there are too many, Elbassuoni et al. Elbassuoni et al. (2009) propose a language-model for ranking results. Similarly, the models like EntityRank by Cheng et al. Cheng et al. (2007) and Shallow Semantic Queries by Li et al. Li et al. (2012), relax the predicate definitions in the structured queries and, instead, implement proximity operators to bind the instances across entity types. Yahya et al. Yahya et al. (2016) propose algorithms for application of a set of relaxation rules that yield higher recall.
Entity Retrieval and proximity:
Web documents contain term information that can be used to apply pattern heuristics and statistical analysis often used to infer entities as investigated by Conrad and UttConrad and Utt (1994), Petkova and Croft Petkova and Croft (2007), Rennie and Jaakkola Rennie and Jaakkola (2005). In fact, early work by Conrad and Utt Conrad and Utt (1994) demonstrates a method that retrieves entities located in the proximity of a given keyword. They show that using a fixed-size window around proper-names can be effective for supporting search for people and finding relationship among entities. Similar considerations of the co-occurrence statistics have been used to identify salient terminology, i.e. keyword to include in the document index Petkova and Croft (2007).
2.2.1 Markov Random Field for IR
In this section we detail the generic Markov Random Field (MRF) model for retrieval and its variation, the Sequential Dependence Model (SDM). As we later show, this model is the basis for our entity-relationship retrieval model.
The Markov Random Field (MRF) model for retrieval was first proposed by Metzler and Croft Metzler and Croft (2005) to model query term and document dependencies. In the context of retrieval, the objective is to rank documents by computing the posterior , given a document and a query :
For that purpose, a MRF is constructed from a graph
, which follows the local Markov property: every random variable inis independent of its non-neighbors given observed values for its neighbors. Therefore, different edge configurations imply different independence assumptions.
Metzler and Croft Metzler and Croft (2005) defined that consists of query term nodes and a document node , as depicted in Figure 2.1. The joint probability mass function over the random variables in is defined by:
where are the query term nodes, is the document node, is the set of maximal cliques in , and is a non-negative potential function over clique configurations. The parameter is the partition function that normalizes the distribution. It is generally unfeasible to compute , due to the exponential number of terms in the summation, and it is ignored as it does not influence ranking.
The potential functions are defined as compatibility functions between nodes in a clique. For instance, a tf-idf score can be measured to reflect the “aboutness” between a query term and a document . Metzler and Croft Metzler and Croft (2005) propose to associate one or more real valued feature function with each clique in the graph. The non-negative potential functions are defined using an exponential form , where is a feature weight, which is a free parameter in the model, associated with feature function . The model allows parameter and feature functions sharing across cliques of the same configuration, i.e. same size and type of nodes (e.g. 2-cliques of one query term node and one document node).
For each query , we construct a graph representing the query term dependencies, define a set of non-negative potential functions over the cliques of this graph and rank documents in descending order of :
Metzler and Croft concluded that given its general form, the MRF can emulate most of the retrieval and dependence models, such as language models Song and Croft (1999).
2.2.2 Sequential Dependence Model
The Sequential Dependence Model (SDM) is the most popular variant of the MRF retrieval model Metzler and Croft (2005). It defines two clique configurations represented in the following potential functions and . Basically, it considers sequential dependency between adjacent query terms and the document node.
The potential function of the 2-cliques containing a query term node and a document node is represented as . The clique configuration containing contiguous query terms and a document node is represented by two real valued functions. The first considers exact ordered matches of the two query terms in the document, while the second aims to capture unordered matches within fixed window sizes. Consequently, the second potential function is .
Replacing by these potential functions in Equation 3.38 and factoring out the parameters , the SDM can be represented as a mixture model computed over term, phrase and proximity feature classes:
where the free parameters must follow the constraint . Coordinate Ascent was chosen to learn the optimal values that maximize mean average precision using training data Metzler and Croft (2007). Considering the frequency of the term(s) in the document , the frequency of the term(s) in the entire collection , the feature functions in SDM are set as:
where is the Dirichlet prior for smoothing, is a function that searches for exact matches of the phrase “ ” and is a function that searches for co-occurrences of and within a window of fixed-N terms (usually 8 terms) across document . SDM has shown state-of-the-art performance in ad-hoc document retrieval when compared with several bigram dependence models and standard bag-of-words retrieval models, across short and long queries Huston and Croft (2014).
2.2.3 MRF for Entity Retrieval
The current state-of-the-art methods in ad-hoc entity retrieval from knowledge graphs are based on MRF Zhiltsov et al. (2015); Nikolaev et al. (2016). The Fielded Sequential Dependence Model (FSDM) Zhiltsov et al. (2015) extends SDM for structured document retrieval and it is applied to entity retrieval from knowledge graphs. In this context, entity documents are composed by fields representing metadata about the entity. Each entity document has five fields: names, attributes, categories, similar entity names and related entity names. FSDM builds individual language models for each field in the knowledge base. This corresponds to replacing SDM feature functions with those of the Mixture of Language Models Ogilvie and Callan (2003). The feature functions of FSDM are defined as:
where are the Dirichlet priors for each field and are the weights for each field and must be non-negative with constraint . Coordinate Ascent was used in two stages to learn and values Zhiltsov et al. (2015).
The Parameterized Fielded Sequential Dependence Model (PFSDM) Nikolaev et al. (2016) extends the FSDM by dynamically calculating the field weights to different query terms. Part-of-speech features are applied to capture the relevance of query terms to specific fields of entity documents. For instance, NNP feature is positive if query terms are proper nouns, therefore the query terms should be mapped to the names field. Therefore, the field weight contribution of a given query term and a query bigram , in a field are a linear weighted combination of features:
where is the feature function of a query unigram for the field and is its respective weight. For bigrams, is the feature function of a query bigram for the field and is its respective weight. Consequently, PFSDM has total parameters, where is the number of fields, is the number of field mapping features for unigrams, is the number of field mapping features for bigrams, plus the three parameters. Their estimation is performed in a two stage optimization. First parameters are learned separately for unigrams and then bigrams. This is achieved by setting to zero the corresponding parameters. In the second stage, the parameters are learned. Coordinate Ascent is used in both stages.
The ELR model exploits entity mentions in queries by defining a dependency between entity documents and entity links in the query Hasibi et al. (2016).
2.3 Named Entity Disambiguation
Given a mention in a document, Named Entity Disambiguation (NED) or Entity Linking aims to predict the entity in a reference knowledge base that the string refers to, or NIL if no such entity is available. Usually the reference knowledge base (KB) includes a set of documents, where each document describes one specific entity. Wikipedia is by far the most popular reference KB Kulkarni et al. (2009).
Previous research typically performs three steps to link an entity mention to a KB: 1) representation of the mention, i.e. extend the entity mention with relevant knowledge from the background document, 2) candidate generation, i.e. find all possible KB entries that the mention might refer to and their representation 3) disambiguation, by computing the similarity between the represented mention and the candidate entities.
Entity Filtering, or targeted entity disambiguation, is a special case of NED in which there is only one candidate entity, i.e. the entity that is being monitored. There is an increasing interest in developing Entity Filtering methods for Social Media texts, considering its specificities and limitations Spina et al. (2011); Munoz et al. (2012). These approaches focus on finding relevant keywords for positive and negative cases using co-occurrence, web and collection based features. Another line of work creates topic-centric entity extraction systems where entities belong to a certain topic and are used as evidence to disambiguate the short message given its topic Christoforaki et al. (2011). Similarly, Hangya et al. Hangya and Farkas (2013) create features representing topic distributions over tweets using Latent Dirichlet Allocation (LDA).
The majority of research work in NED is usually applied to disambiguate entities in reasonably long texts as news or blog posts. In recent years, there has been an increasing interest in developing NED methods for Social Media texts and its specificities and limitations (Cano Basave et al., 2013; Derczynski et al., 2013; Liu et al., 2013; Greenwood et al., 2012). A survey and evaluation of state-of-the-art NER and NED for Tweets concluded that current approaches do not perform robustly on “ill-formed, terse, and linguistically compressed” microblog texts (Derczynski et al., 2015). Some Twitter-specific methods reach F1 measures of over 80%, but are still behind the state-of-the-art results obtained on well-formed news texts.
Social Media texts are too short to provide sufficient information to calculate context similarity accurately (Derczynski et al., 2013; Meij et al., 2012; Greenwood et al., 2012; Liu et al., 2013; Davis et al., 2012). In addition, most of state-of-the-art approaches leverage on neighboring entities in the documents but, once again, tweets are short and do not have more than one or two entities mentioned. Most of them (Shen et al., 2013; Liu et al., 2013; Davis et al., 2012) extract information obtained from other tweets, and disambiguate entity mentions in these tweets collectively. The assumption is that Twitter users are content generators and tend to scatter their interests over many different messages they broadcast, which is not necessarily true (Kwak et al., 2010).
Entity Filtering has also been studied in the context of real-time classification. Davis et al. Davis et al. (2012)
propose a pipeline containing three stages. Clearly positive examples are exploited to create filtering rules comprising collocations, users and hashtags. The remaining examples are classified using a Expectation-Maximization (EM) model trained using the clearly positive examples. Recently, Habib et al.Habib and Van Keulen (2016)
proposed an hybrid approach where authors first query Google to retrieve a set of possible candidate homepages and then enrich the candidate list with text from the Wikipedia. They extract a set of features for each candidate, namely, a language model and overlapping terms between tweet and document, as well as URL length and mention-URL string similarity. In addition, a prior probability of the mention corresponding to a certain entity on the YAGOSuchanek et al. (2007) knowledge base is also used.
Recent work in NED or Entity Linking includes graph based algorithms for collective entity disambiguation, such as TagMeFerragina and Scaiella (2010), Babelfy Moro et al. (2014) and WAT Piccinno and Ferragina (2014). Word and entity embeddings have been also used for entity disambiguation He et al. (2013); Fang et al. (2016); Moreno et al. (2017). More specifically, Fang Fang et al. (2016) and Moreno Moreno et al. (2017) propose to learn an embedding space for both entities and words and then compute similarity features based on the combined representations.
2.4 Sentiment Analysis
In the last decade, the automatic processing of subjective and emotive text, commonly known as Sentiment Analysis, has triggered huge interest from the Text Mining research community Liu (2012). A typical task in Sentiment Analysis is text polarity classification and in the context of this work can be formalized as follows: given a text span that mentions a target entity, decide whether it conveys positive, negative or neutral sentiment towards the target.
With the rise of Social Media, research on Sentiment Analysis shifted towards Twitter. New challenges have risen, including slang, misspelling, emoticons, poor grammatical structure Liu (2012). A number of competitions were organized, such as SemEval Rosenthal et al. (2015), leading to the creation of resources for research Mohammad et al. (2013).
There are two main approaches to sentiment polarity classification: lexicon-based - using a dictionary of terms and phrases with annotated polarity – or supervised learning – building a model of the differences in language associated with each polarity, based on training examples. In the supervised learning approach, a classifier is specifically trained for a particular type of text (e.g. tweets about politics). Consequently, it is possible to capture peculiarities of the language used in that context. As expected, this reduces the generality of the model, as it is biased towards a specific domain. Supervised learning approaches require training data. In Twitter, most of previous work obtained training data by assuming that emoticons represent the tweet polarity (positive, negative, neutral)Kouloumpis et al. (2011a), or by using third party software, such as the Stanford Sentiment Analyzer Bamman and Smith (2015).
Lexicon-based approaches have shown to work effectively on conventional text Liu (2010) but tend to be ill suited for Twitter data. With the purpose of overcoming this limitation, an algorithm that uses a human-coded lexicon specifically tailored to Social Media text was introduced Thelwall et al. (2012). SentiStrength has become a reference in recent years due to its relatively good performance and consistent performance on polarity classification of Social Media texts. Nevertheless, it is confined to a fixed set of words and it is context independent.
The recent interest in deep learning led to approaches that use deep learned word embeddings as features in a variety of Text Mining tasksBengio (2013); Mikolov et al. (2013a). In Sentiment Analysis, recent work integrated polarity information of text into the word embedding by extending the probabilistic document model obtained from Latent Dirichlet Allocation Maas et al. (2011). While others learned task-specific embeddings from an existing embedding and sentences with annotated polarity Labutov and Lipson (2013). Or learning polarity specific word embeddings from tweets collected using emoticons Sun et al. (2014)2014).
2.5 Word Embeddings
The most popular and simple way to model and represent text data is the Vector Space Model (Salton et al., 1975). A vector of features in a multi-dimensional feature space represents each lexical item (e.g. a word) in a document and each item is independent of other items in the document. This allows to compute geometric operations over vectors of lexical items using well established algebraic methods. However, the Vector Space Model faces some limitations. For instance, the same word can express different meanings in different contexts - the polysymy problem - or different words may be used to describe the same meaning - the synonymy problem. Since 2000, a variety of different methods (e.g. LDA (Blei et al., 2003)) and resources (e.g. DBpedia (Auer et al., 2007)) have been developed to try to assign semantics, or meaning, to concepts and parts of text.
Word embedding methods aim to represent words as real valued continuous vectors in a much lower dimensional space when compared to traditional bag-of-words models. Moreover, this low dimensional space is able to capture lexical and semantic properties of words. Co-occurrence statistics are the fundamental information that allows creating such representations. Two approaches exist for building word embeddings. One creates a low rank approximation of the word co-occurrence matrix, such as in the case of Latent Semantic Analysis Deerwester et al. (1990) and GloVe Pennington et al. (2014). The other approach consists in extracting internal representations from neural network models of text Bengio et al. (2003); Collobert and Weston (2008); Mikolov et al. (2013a). Levy and Goldberg Levy and Goldberg (2014) showed that the two approaches are closely related.
Although, word embedding research goes back several decades, it was the recent developments of Deep Learning and the word2vec framework Mikolov et al. (2013a) that captured the attention of the NLP community. Moreover, Mikolov et al. Mikolov et al. (2013b) showed that embeddings trained using word2vec models (CBOW and Skip-gram) exhibit linear structure, allowing analogy questions of the form “man:woman::king:??.” and can boost performance of several text classification tasks.
In this context, the objective is to maximize the likelihood that words are predicted given their context. word2vec has two models for learning word embeddings, the skip-gram model (SG) and the continuous-bag-of-word model (CBOW). Here we focus on CBOW. More formally, every word is mapped to a unique vector represented by a column in a projection matrix with as embedding dimension and as the total number of words in the vocabulary. Given a sequence of words , the objective is to maximize the average log probability:
where is the size of the context window and is a word in the context window of the center word . The context vector is obtained by averaging the embeddings of each word and the prediction of the center word is performed using a softmax multiclass classifier over all vocabulary :
Each of is un-normalized log-probability for each output word . After training, a low dimensionality embedding matrix E encapsulating information about each word in the vocabulary and its surrounding contexts is learned, transforming a one-hot sparse representation of words into a compact real valued embedding vector of size . This matrix can then be used as input to other learning algorithms tailored for specific tasks to further enhance performance.
For large vocabularies it is unfeasible to compute the partition function (normalizer) of softmax therefore Mikolov Mikolov et al. (2013a)
proposes to use the hierarchical softmax objective function or to approximate the partition function using a technique called negative sampling. Stochastic gradient descent is usually applied for training the softmax where the gradient is obtained via backpropagation.
There are several approaches to generating word embeddings. One can build models that explicitly aim at generating word embeddings, such as Word2Vec or GloVe Mikolov et al. (2013a); Pennington et al. (2014), or one can extract such embeddings as by-products of more general models, which implicitly compute such word embeddings in the process of solving other language tasks.
One of the issues of recent work in training word embeddings is the variability of experimental setups reported. For instance, in the paper describing GloVe Pennington et al. (2014) the authors trained their model on five corpora of different sizes and built a vocabulary of 400K most frequent words. Mikolov et al. Mikolov et al. (2013b) trained with 82K vocabulary while Mikolov et al. Mikolov et al. (2013a) was trained with 3M vocabulary. Recently, Arora et al. Arora et al. (2015) proposed a generative model for learning embeddings that tries to explain some theoretical justification for nonlinear models (e.g. word2vec and GloVe) and some hyper parameter choices. The authors evaluated their model using 68K vocabulary.
SemEval 2016-Task 4: Sentiment Analysis in Twitter organizers report that participants either used general purpose pre-trained word embeddings, or trained from Tweet 2016 dataset or “from some sort of dataset” Nakov et al. (2016). However, participants neither report the size of vocabulary used neither the possible effect it might have on the task specific results.
Recently, Rodrigues et al. Rodrigues et al. (2016) created and distributed the first general purpose embeddings for Portuguese. Word2vec gensim implementation was used and authors report results with different values for the parameters of the framework. Furthermore, authors used experts to translate well established word embeddings test sets for Portuguese language, which they also made publicly available and we use some of those in this work.
2.6 Predicting Collective Attention
Online Reputation Monitoring systems would be even more useful if they would be able to know in advance if social media users will talk a lot about the target entities or not. In recent years, a number of research works have studied the relationship and predictive behavior of user response to the publication of online media items, such as, commenting news articles, playing Youtube videos, sharing URLs or retweeting patterns (R. Bandari and Huberman, 2012; Yang and J.Leskovec, 2011; M. Tsagkias and Rijke, 2009; He et al., 2014). The first attempt to predict the volume of user comments for online news articles used both metadata from the news articles and linguistic features (M. Tsagkias and Rijke, 2009). The prediction was divided in two binary classification problems: if an article would get any comments and if it would be high or low number of comments. Similarly, other studies found that shallow linguistic features (e.g. TF-IDF or sentiment) and named entities have good predictive power (Gottipati and Jiang, 2012; Louis and Nenkova, 2013).
Research work more in line with ours, tries to predict the popularity of news articles shares (url sharing) on Twitter based on content features (R. Bandari and Huberman, 2012). The authors considered the news source, the article’s category, the article’s author, the subjectivity of the language in the article, and number of named entities in the article as features. Recently, there was a large study of the life cycle of news articles in terms of distribution of visits, tweets and shares over time across different sections of the publisher (Castillo et al., 2014). Their work was able to improve, for some content type, the prediction of web visits using data from social media after ten to twenty minutes of publication.
Other lines of work, focused on temporal patterns of user activities and have consistently identified broad classes of temporal patterns based on the presence of a clear peak of activity (Crane and Sornette, 2008; Lehmann et al., 2012; Romero et al., 2011; Yang and J.Leskovec, 2011). Classes differentiate by the specific amount and duration of activity before and after the peak. Crane and Sornette (Crane and Sornette, 2008) define endogenous or exogenous origin of events based on being triggered by internal aspects of the social network or external, respectively. They find that hashtag popularity is mostly influenced by exogenous factors instead of epidemic spreading. Other work (Lehmann et al., 2012) extend these classes by creating distinct clusters of activity based on the distributions in different periods (before, during and after the peak) that can be interpreted based on semantics of hashtags. Consequently, the authors applied text mining techniques to semantically describe hashtag classes. Yang and Leskovec (Yang and J.Leskovec, 2011) propose a new measure of time series similarity and clustering. The authors obtain six classes of temporal shapes of popularity of a given phrase (meme) associated with a recent event, as well as the ordering of media sources contribution to its popularity.
Recently, Tsytsarau et al. (Tsytsarau et al., 2014) studied the time series of news events and their relation to changes of sentiment time series expressed on related topics on social media. The authors proposed a novel framework using time series convolution between the importance of events and media response function, specific to media and event type. Their framework is able to predict time and duration of events as well as shape through time.
2.7 Political Data Science
Content analysis of mass media has an established tradition in the social sciences, particularly in the study of effects of media messages, encompassing topics as diverse as those addressed in seminal studies of newspaper editorials (Lasswell, 1952), media agenda-setting (McCombs and Shaw, 1972), or the uses of political rhetoric (Moen, 1990), among many others. By 1997, Riffe and Freitag (Riffe and Freitag, 1997), reported an increase in the use of content analysis in communication research and suggested that digital text and computerized means for its extraction and analysis would reinforce such a trend. Their expectation has been fulfilled: the use of automated content analysis has by now surpassed the use of hand coding (Neuendorf, 2002). The increase in the digital sources of text, on the one hand, and current advances in computation power and design, on the other, are making this development both necessary and possible, while also raising awareness about the inferential pitfalls involved (Hopkins and King, 2010; Grimmer and Stewart, 2013).
One avenue of research that has been explored in recent years concerns the use of social media to predict present and future political events, namely electoral results Bermingham and Smeaton (2011); Tumasjan et al. (2010a); Marchetti-Bowick and Chambers (2012a); Sobkowicz et al. (2012); Livne et al. (2011); Tumasjan et al. (2010b); Gayo-Avello (2012); O’Connor et al. (2010); Chung and Mustafaraj (2011). Although there is no consensus about methods and their consistency Metaxas et al. (2011); Gayo Avello et al. (2011). Gayo-Avello Gayo-Avello (2013) summarizes the differences between studies conducted so far by stating that they vary about period and method of data collection, data cleansing and pre-processing techniques, prediction approach and performance evaluation. One particular challenge when using sentiment is how to aggregate opinions in a timely fashion that can be fed to the prediction method. Two main strategies have been used to predict elections: buzz, i.e., number of tweets mentioning a given candidate or party and the use of sentiment polarity. Different computational approaches have been explored to process sentiment in text, namely machine learning and linguistic based methods Pang and Lee (2008); Kouloumpis et al. (2011b); Nakov et al. (2013). In practice, algorithms often combine both strategies.
Johnson et al. Johnson et al. (2012) concluded that more than predicting elections, social media can be used to gauge sentiment about specific events, such as political news or speeches. Defending the same idea, Diakopoulos el al. Diakopoulos and Shamma (2010) studied the global sentiment variation based on Twitter messages of an Obama vs McCain political TV debate while it was still happening. Tumasjan et al. Tumasjan et al. (2010b) used Twitter data to predict the 2009 Federal Election in Germany. They stated that “the mere number of party mentions accurately reflects the election result”. Bermingham et al. Bermingham and Smeaton (2011) correctly predicted the 2011 Irish General Elections also using Twitter data. Gayo-Avello et al. Gayo Avello et al. (2011) also tested the share of volume as predictor in the 2010 US Senate special election in Massachusetts.
On the other hand, several other studies use sentiment as a polls result indicator. Connor et al. O’Connor et al. (2010) used a sentiment aggregate function to study the relationship between the sentiment extracted from Twitter messages and polls results. They defined the sentiment aggregate function as the ratio between the positive and negative messages referring an specific political target. They used the sentiment aggregate function as predictive feature in the regression model, achieving a correlation of 0.80 between the results and the poll results, capturing the important large-scale trends. Bermingham et al. Bermingham and Smeaton (2011) also included in their regression model sentiment features. Bermingham et al. introduced two novel sentiment aggregate functions. For inter-party sentiment, they modified the share of volume function to represent the share of positive and negative volume. For intra-party sentiment , they used a log ratio between the number of positive and negative mentions of a given party. Moreover, they concluded that the inclusion of sentiment features augmented the effectiveness of their model. Gayo-Avello et al. Gayo Avello et al. (2011) introduced a different aggregate function. In a two-party race, all negative messages on party are interpreted as positive on party , and vice-versa.
In summary, suggestions for potentially independent or in other words predictive metrics appear in a wide variety of forms: the mention share that a party received within all party mentions during a given time-span (Bermingham and Smeaton, 2011; Sanders and Van Den Bosch, 2013; Skoric et al., 2012; Soler et al., 2012; Sang and Bos, 2012; Tumasjan et al., 2010b), the mention share of political candidates (Chen et al., 2012; DiGrazia et al., 2013; Fink et al., 2013; Gaurav et al., 2013; Skoric et al., 2012), the share of positive mentions a party received (Bermingham and Smeaton, 2011; Thapen and Ghanem, 2013), the positive mention share of candidates (O’Connor et al., 2010; Shi et al., 2012; Fink et al., 2013), the share of users commenting on a candidate or party (Sang and Bos, 2012), the share of mentions for a candidate followed by a word indicative of electoral success or failure (Jensen and Anstead, 2013), the relative increase of positive mentions of a candidate (Franch, 2013) or simply a collection of various potentially politically relevant words identified by their statistical relationship with polls or political actors in the past (Beauchamp, 2013; Contractor and Faruquie, 2013; Lampos et al., 2013; Marchetti-Bowick and Chambers, 2012b).
Suggestions for the dependent variable, metrics of political success, show a similar variety. They include the vote share that a party received on election day (Bermingham and Smeaton, 2011; Franch, 2013; Sanders and Van Den Bosch, 2013; Skoric et al., 2012; Soler et al., 2012), the vote share of a party adjusted to include votes only for parties included in the analysis (Tumasjan et al., 2010b), the vote share of candidates on election day (DiGrazia et al., 2013; Fink et al., 2013; Gaurav et al., 2013; Jensen and Anstead, 2013; Skoric et al., 2012), campaign tracking polls (Beauchamp, 2013; Contractor and Faruquie, 2013; Fink et al., 2013; Lampos et al., 2013; O’Connor et al., 2010; Shi et al., 2012; Thapen and Ghanem, 2013), politicians’ job approval ratings (Marchetti-Bowick and Chambers, 2012b; O’Connor et al., 2010), and the number of seats in parliament that a party received after the election (Sang and Bos, 2012).
3.1 Entity-Relationship Retrieval
E-R retrieval is a complex case of entity retrieval. E-R queries expect tuples of related entities as results instead of a single ranked list of entities as it happens with general entity queries. For instance, the E-R query “Ethnic groups by country" is expecting a ranked list of tuples ethnic group, country as results. The goal is to search for multiple unknown entities and relationships connecting them.
|E-R query (e.g. “congresswoman hits back at US president”).|
|Entity sub-query in (e.g. “congresswoman”).|
|Relationship sub-query in (e.g. “hits back at”).|
|Term-based representation of an entity (e.g. <Frederica Wilson> = representative, congresswoman). We use the terminology representation and document interchangeably.|
|Term-based representation of a relationship (e.g. <Frederica Wilson, Donald Trump> = hits,back). We use the terminology representation and document interchangeably.|
|The set of entity sub-queries in a E-R query (e.g. “congresswoman”,“US president” ).|
|The set of relationship sub-queries in a E-R query.|
|The set of entity documents to be retrieved by a E-R query.|
|The set of relationship documents to be retrieved by a E-R query.|
|E-R query length corresponding to the number of entity and relationship sub-queries.|
|The entity tuple to be retrieved (e.g. <Frederica Wilson, Donald Trump>).|
In this section, we present a definition of E-R queries and a probabilistic formulation of the E-R retrieval problem from an Information Retrieval perspective. Table 3.1 presents several definitions that will be used throughout this chapter.
3.1.1 E-R Queries
E-R queries aim to obtain a ordered list of entity tuples as a result. Contrary to entity search queries where the expected result is a ranked list of single entities, results of E-R queries should contain two or more entities. For instance, the complex information need “Silicon Valley companies founded by Harvard graduates” expects entity-pairs (2-tuples) company, founder as results. In turn, “European football clubs in which a Brazilian player won a trophy" expects triples (3-tuples) club, player, trophy as results.
Each pair of entities , in an entity tuple is connected with a relationship . A complex information need can be expressed in a relational format, which is decomposed into a set of sub-queries that specify types of entities and types of relationships between entities.
For each relationship sub-query there must be two sub-queries, one for each of the entities involved in the relationship. Thus a E-R query that expects 2-tuples, is mapped into a triple of sub-queries , , , where and are the entity attributes queried for and respectively, and is a relationship attribute describing .
If we consider a E-R query as a chain of entity and relationship sub-queries , , , …, ,, and we define the length of a E-R query as the number of sub-queries, then the number of entity sub-queries must be and the number of relationship sub-queries equal to . Consequently, the size of each entity tuple to be retrieved must be equal to the number of entity sub-queries. For instance, the E-R query “soccer players who dated a top model” with answers such as Cristiano Ronaldo, Irina Shayk) is represented as three sub-queries soccer players, dated, top model.
Automatic mapping of terms from a E-R query to sub-queries or is out of the scope of this work and can be seen as a problem of query understanding Yahya et al. (2012); Pound et al. (2012); Sawant and Chakrabarti (2013)
. We assume that the information needs are decomposed into constituent entity and relationship sub-queries using Natural Language Processing techniques or by user input through an interface that enforces the structure, , , …, ,, .
3.1.2 Modeling E-R Retrieval
Our approach to E-R retrieval assumes that we have a raw document collection (e.g. news articles) and each document is associated with one or more entities . In other words, documents contain mentions to one or more entities that can be related between them. Since our goal is to retrieve tuples of related entities given a E-R query that expresses entity attributes and relationship attributes, we need to create term-based representations for both entities and relationships. We denote a representation of an entity as .
In E-R retrieval we are interested in retrieving tuples of entities as a result. The number of entities in each tuple can be two, three or more depending on the structure of the particular E-R query. When a E-R query aims to get tuples of more than two entities, we assume it is possible to combine tuples of length two. For instance, we can associate two tuples of length two that share the same entity to retrieve a tuple of length three. Therefore we create representations of relationships as pairs of entities. We denote a representation of a relationship as .
Considering the example query “Which spiritual leader won the same award as a US vice president?” it can be formulated in the relational format as spiritual leader, won, award, won, US vice president. Associating the tuples of length two Dalai Lama, Nobel Peace Prize and Nobel Peace Prize, Al Gore would result in the expected 3-tuple Dalai Lama, Nobel Peace Prize, Al Gore.
For the sake of clarity we now consider an example E-R query with three sub-queries (). This query aims to retrieve a tuple of length two, i.e. a pair of entities connected by a relationship. Based on the definition of a E-R query, each entity in the resulting tuple must be relevant to the corresponding entity sub-queries . Moreover, the relationship between the two entities must also be relevant to the relationship sub-queries . Instead of calculating a simple posterior as with traditional information retrieval, in E-R retrieval the objective is to rank tuples based on a joint posterior of multiple entity and relationship representations given a E-R query, such as when .
E-R queries can be seen as chains of interleaved entity and relationship sub-queries. We take advantage of the chain rule to formulate the joint probabilityas a product of conditional probabilities. Formally, we want to rank entity and relationship candidates in descending order of the joint posterior as:
We consider conditional independence between entity representations within the joint posterior, i.e., the probability of a given entity representation being relevant given a E-R query is independent of knowing that entity is relevant as well. As an example, consider the query “action movies starring a British actor”. Retrieving entity representations for “action movies” is independent of knowing that <Tom Hardy> is relevant to the sub-query “British actor”. However, it is not independent of knowing the set of relevant relationships for sub-query “starring”. If a given action movie is not in the set of relevant entity-pairs for “starring” it does not make sense to consider it as relevant. Consequently, .
Since E-R queries can be decomposed in constituent entity and relationship sub-queries, ranking candidate tuples using the joint posterior is rank proportional to the product of conditional probabilities on the corresponding entity and relationship sub-queries , and .
We now consider a longer E-R query aiming to retrieve a triple of connected entities. This query has three entity sub-queries and two relationship sub-queries, thus . As we previously explained, when there are more than one relationship sub-queries we need to join entity-pairs relevant to each relationship sub-query that have one entity in common. From a probabilistic point of view this can be seen as conditional dependence from the entity-pairs retrieved from the previous relationship sub-query, i.e. . To rank entity and relationship candidates we need to calculate the following joint posterior:
When compared to the previous example, the joint posterior for shows that entity candidates for are conditional dependent of both and . In other words, entity candidates for must belong to entity-pairs candidates for both relationships representations that are connected with , i.e. and .
We are now able to make a generalization of E-R retrieval as a factorization of conditional probabilities of a joint probability of entity representations , relationship representations , entity sub-queries and relationship sub-queries . These set of random variables and their conditional dependencies can be easily represented in a probabilistic directed acyclic graph,i.e. a Bayesian network (Pearl, 1985).
In Bayesian networks, nodes represent random variables while edges represent conditional dependencies. Every other nodes that point to a given node are considered parents. Bayesian networks define the joint probability of a set of random variables as a factorization of the conditional probability of each random variable conditioned on its parents. Formally, , where represents all parent nodes of .
Figure 3.1 depicts the representation of E-R retrieval for different query lengths using Bayesian networks. We easily conclude that graphical representation contributes to establish a few guidelines for modeling E-R retrieval. First, each sub-query points to the respective document node. Second, relationship document nodes always point to the contiguous entity representations. Last, when there are more than one relationship sub-query, relationship documents also point to the subsequent relationship document.
Once we draw the graph structure for the number of sub-queries in we are able to compute a product of conditional probabilities of each node given its parents. Adapting the general joint probability formulation of Bayesian networks to E-R retrieval we come up with the following generalization:
We denote as the set of all candidate relationship documents in the graph and the set of all candidate entity documents in the graph. In Information Retrieval is often convenient to work in the log-space as it does not affect ranking and transforms the product of conditional probabilities in a summation, as follows:
We now present two design patterns to compute each conditional probability for every entity and relationship candidate documents.
3.2 Design Patterns for Entity-Relationship Retrieval
Traditional ad-hoc document retrieval approaches create direct term-based representations of raw documents. A retrieval model (e.g. Language Models) is then used to match the information need, expressed as a keyword query, against those representations. However, E-R retrieval requires collecting evidence for both entities and relationships that can be spread across multiple documents. It is not possible to create direct term-based representations. Raw documents serve as proxy to connect queries with entities and relationships.
Abstractly speaking, entity retrieval can be seen as a problem of object retrieval in which the search process is about fusing information about a given object, such as in the case of verticals (e.g. Google Finance). Recently, Zhang and Balog (2017) presented two design patterns for fusion-based object retrieval.
The first design pattern – Early Fusion – is an object-centric approach where a term-based representation of objects is created earlier in the retrieval process. First, it creates meta-documents by aggregating term counts across the documents associated with the objects. Later, it matches queries against these meta-documents using standard retrieval methods.
The second design pattern - Late Fusion - is a document-centric approach where relevant documents to the query are retrieved first and then later in the retrieval process, it ranks objects associated with top documents. These design patterns represent a generalization of Balog’s Model 1 and Model 2 for expertise retrieval (Balog et al., 2006).
In essence, E-R retrieval is an extension, or a more complex case, of object-retrieval where besides ranking objects we need to rank tuples of objects that satisfy the relationship expressed in the E-R query. This requires creating representations of both entities and relationships by fusing information spread across multiple raw documents. We propose novel fusion-based design patterns for E-R retrieval that are inspired from the design patterns presented by Zhang and Balog (2017) for single object-retrieval. We extend those design patterns to accommodate the specificities of E-R retrieval. We hypothesize that it should be possible to generalize the term dependence models to represent entity-relationships and achieve effective E-R retrieval without entity or relationship type restrictions (e.g. categories) as it happens with the Semantic Web based approaches.
3.2.1 Early Fusion
The Early Fusion strategy presented by Zhang and Balog (2017) consists in creating a term-based representation for each object under retrieval, i.e., a meta-document containing all terms in the proximity of every object mention across a document collection. As described in previous section, E-R queries can be formulated as a sequence of multiple entity queries and relationship queries . In a Early Fusion approach, each of these queries should match against a previously created term-based representation. Since there are two types of queries, we propose to create two types of term-based representations, one for entities and other for relationships.
Our Early Fusion design pattern is similar to Model 1 of Balog et al. (2006). It can be thought as creating two types of meta-documents and . A meta-document is created by aggregating the context terms of the occurrences of across the raw document collection. On the other hand, for each each pair of entities and that co-occur close together across the raw document collection we aggregate context terms that describe the relationship to create a meta-document
In our approach we focus on sentence level information about entities and relationships although the design pattern can be applied to more complex segmentations of text (e.g. dependency parsing). We rely on Entity Linking methods for disambiguating and assigning unique identifiers to entity mentions on raw documents . We collect entity contexts across the raw document collection and index them in the entity index. The same is done by collecting and indexing entity pair contexts in the relationship index.
We define the (pseudo) frequency of a term for an entity meta-document as follows:
where is the total number of raw documents in the collection, is the term frequency in the context of the entity in a raw document . is the entity-document association weight that corresponds to the weight of the document in the mentions of the entity across the raw document collection. Similarly, the term (pseudo) frequency of a term for a relationship meta-document is defined as follows:
where is the term frequency in the context of the pair of entity mentions corresponding to the relationship in a raw document and is the relationship-document association weight. In this work we use binary associations weights indicating the presence/absence of an entity mention in a raw document, as well as for a relationship. However, other weight methods can be used.
The relevance score for an entity tuple can then be calculated using the posterior defined in previous section (equation 3.6). We calculate the individual conditional probabilities as a product of a retrieval score with an association weight. Formally we consider:
where represents the retrieval score resulting of the match of the query terms of a relationship sub-query and a relationship meta-document . The same applies to the retrieval score which corresponds to the result of the match of an entity sub-query with a entity meta-document . For computing both and any retrieval model can be used. Different scoring functions will be introduced below.
We use a binary association weight for which represents the presence of a relevant entity to a sub-query in its contiguous relationships in the Bayesian network, i.e. and which must be relevant to the sub-queries and . This entity-relationship association weight is the building block that guarantees that two entities relevant to sub-queries that are also part of a relationship relevant to a sub-query will be ranked higher than tuples where just one or none of the entities are relevant to the entity sub-queries . On the other hand, the entity-relationship association weight guarantees that consecutive relationships share one entity between them in order to create triples or 4-tuples of entities for longer E-R queries ().
The relevance score of an entity tuple given a query is calculated by summing individual relationship and entity relevance scores for each and in . We define the score for a tuple given a query as follows:
Considering Dirichlet smoothing unigram Language Models (LM) the constituent retrieval scores can be computed as follows:
where is a term of a sub-query or , and are the (pseudo) frequencies defined in equations 3.8 and 3.9. The collection frequencies , represent the frequency of the term in either the entity index or in the relationship index . and represent the total number of terms in a meta-document while and represent the total number of terms in a collection of meta-documents. Finally, and are the Dirichlet prior for smoothing which generally corresponds to the average document length in a collection.
3.2.2 Association Weights
Both Early Fusion and Late Fusion share three components: , and . The first two represent document associations which determine the weight a given raw document contributes to the relevance score of a particular entity tuple . The last one is the entity-relationship association which indicates the strength of the connection of a given entity within a relationship .
In our work we only consider binary association weights but other methods could be used. According to the binary method we define the weights as follows: