Log In Sign Up

Open Domain Suggestion Mining: Problem Definition and Datasets

We propose a formal definition for the task of suggestion mining in the context of a wide range of open domain applications. Human perception of the term suggestion is subjective and this effects the preparation of hand labeled datasets for the task of suggestion mining. Existing work either lacks a formal problem definition and annotation procedure, or provides domain and application specific definitions. Moreover, many previously used manually labeled datasets remain proprietary. We first present an annotation study, and based on our observations propose a formal task definition and annotation procedure for creating benchmark datasets for suggestion mining. With this study, we also provide publicly available labeled datasets for suggestion mining in multiple domains.


page 1

page 2

page 3

page 4


Comparison between the two definitions of AI

Two different definitions of the Artificial Intelligence concept have be...

Towards Efficient Discriminative Pattern Mining in Hybrid Domains

Discriminative pattern mining is a data mining task in which we find pat...

COMPILING: A Benchmark Dataset for Chinese Complexity Controllable Definition Generation

The definition generation task aims to generate a word's definition with...

Open Images V5 Text Annotation and Yet Another Mask Text Spotter

A large scale human-labeled dataset plays an important role in creating ...

UniCausal: Unified Benchmark and Model for Causal Text Mining

Current causal text mining datasets vary in objectives, data coverage, a...

Towards a General Framework for Actual Causation Using CP-logic

Since Pearl's seminal work on providing a formal language for causality,...

Approximate Denial Constraints

The problem of mining integrity constraints from data has been extensive...

1 Introduction

Suggestion mining can be defined as the extraction of sentences that contain suggestions from unstructured text. Collecting suggestions is an integral step of any decision making process. A suggestion mining system could extract exact suggestion sentences from a retrieved document, which would enable the user to collect suggestions from a much larger number of pages than they could manually read over a short span of time.

Apart from suggestions that relate to general topics, industrial and other organizational decision makers seek suggestions to improve their brand or organization Jijkoun et al (2010). In this case, consumers or other stakeholders are explicitly asked to provide suggestions. Opinions towards persons, brands, social debates etc. are generally expressed through online reviews, blogs, discussion forums, or social media platforms, and tend to contain the expressions of advice, tips, warnings, recommendations etc. Amigó et al (2014). For example, online reviews may contain suggestions for improvements in the product or service (Table 1); and recommendation platforms often ask for specific tips from their users, which are then offered to other users; see Figure 1 for an example from travel site TripAdvisor.111

Source Sentence
Electronics reviews I would recommend doing the upgrade to be sure you have the best chance at trouble free operation.
Electronics reviews My one recommendation to creative is to get some marketing people to work on the names of these things
Hotel reviews Be sure to specify a room at the back of the hotel.
Twitter Dear Microsoft, release a new zune with your wp7 launch on the 11th. It would be smart
Travel discussion forum If you do book your own airfare, be sure you don’t have problems if Insight has to cancel the tour or reschedule it
Table 1: Examples of suggestions from different domains.

State-of-the-art opinion mining systems primarily summarize these opinions as a distribution of positive and negative sentiments by means of sentiment analysis methods 

Liu (2012)

, and therefore suggestion mining remains a relatively young area. So far, it has usually been defined as a problem of classifying sentences of a given text into

suggestion and non-suggestion classes. Suggestion mining faces similar challenges as other newly introduced sentence classification tasks. These include: (1) task formalization and data annotation, (2) understanding sentence level semantics, (3) figurative expressions, (4) long and complex sentences, (5) context dependency, and (6) highly imbalanced class distribution in some domains.

As we will see below, the domains covered previously include hotel reviews, electronics reviews, Twitter, and travel discussion forums, with the majority of studies having focused on collecting suggestions for product improvement using product reviews as a source text. Problem definition and methods remain tailored for their specific application. Mostly rule-based systems have so far been developed, and very few statistical classifiers have been proposed.

Figure 1: Manually provided room tips on TripAdvisor

In this study, we address the following research questions:


How do we define suggestions in the context of open domain suggestion mining?


How do we prepare benchmark datasets for suggestion mining?

The main contributions of this paper are as follows: (1) Study of layman perception of the term suggestion; (2) Proposal for an empirically driven definition of suggestions; (3) Study of the linguistic properties observed in sentences labeled as suggestion as per our definition; (4) Proposition of an annotation method that employs both layman and expert annotators, with the aim of minimizing annotation time and cost without compromising the quality of datasets; and (5) Development and distribution of benchmark datasets.

In Section 2 we discuss related work on problem definition and datasets for suggestion mining. Section 3 proposes a problem definition based on an annotation study performed with layman annotators. In Section 4 we propose a method to create benchmark datasets for suggestion mining, inspired by our observations and proposed problem definition. Section 5 concludes the paper.

2 Related Work

A common problem definition followed in different publications on suggestion mining is

Given a sentence s, predict a label l for s where l suggestion, non suggestion.

In this study, we aim to extend this problem definition by formally defining the scope of suggestion and non-suggestion classes in a manner that will suit both open domain and domain specific suggestion mining.

Previous work that attempted to define suggestions, did so in two ways. One was to provide a dictionary-like generic definition Wicaksono and Myaeng (2012); Viswanathan et al (2011) such as A sentence made by a person, usually as a suggestion or a guide to action and/or conduct relayed in a particular context. The other was to provide an application specific definition of suggestions Ramanand et al (2010); Brun and Hagege (2013); Moghaddam (2015) such as Sentences where the commenter wishes for a change in an existing product or service. Although the first category is generic and applies to all domains, the publications listed evaluated suggestion mining on single domains. In our annotation study on several domains, on which we elaborate in the next section, we observe that a generic definition of suggestion leads to higher disagreement among the annotators. On the other hand, when a domain and use case specific definition was provided in related work, the formal annotation guidelines were still missing. Importantly, such definitions cannot be used to define the scope of open domain suggestion mining.

Wicaksono and Myaeng (2012, 2013a, 2013b) performed the extraction of what they refer to as the advice revealing sentences

from travel related weblogs and online forums, using supervised learning methods. Their dataset is available to the research community.

222Available upon request from the authors However, no formal annotation guidelines were provided Wicaksono and Myaeng (2013b) during the annotation. The kappa statistics for inter-annotator agreement was 0.76, and only those sentences were included in the datasets on which the two annotators agreed. In order to prepare the dataset, they first filtered out a sample of blog entries by using clue words like suggest, advice, recommend, tip, etc. This approach of filtering the real data may bias the results towards the proposed features since the authors also used these clue words as features. Another publication that employed supervised learning was Dong et al (2013), who performed suggestion detection on tweets about the Microsoft Windows 7 phone. They did not define suggestions, but mentioned that the objective of collecting suggestions is to improve and enrich the quality and functionality of products, services, and organizations. The dataset prepared by them is also publicly available,333 however the formal annotation details are again not available. The authors did not provide inter-annotator agreement scores, and retained only those tweets in the dataset where all the annotators agreed on the label.

Datasets from other publications are not available for further investigation, mainly because of the proprietary nature of the data; see Table 2. Moghaddam (2015) performed suggestion mining at the review level, by identifying suggestion containing reviews among a large number of reviews about the eBay App. They report that 5 human annotators were employed, and a 96% agreement was observed in their annotations, where only agreed upon instances were retained in the dataset.

None of the studies that proposed and evaluated rule-based systems for this task Ramanand et al (2010); Viswanathan et al (2011); Brun and Hagege (2013), performed an annotation study and provided inter-annotator agreement scores.

Publication Source Dataset available
Ramanand et al (2010) Product reviews No
Viswanathan et al (2011) Product reviews No
Brun and Hagege (2013) Product reviews No
Moghaddam (2015) Hotel reviews No
Wicaksono and Myaeng (2012, 2013a, 2013b) Travel forum Yes
Dong et al (2013) Twitter Yes
Table 2: An overview of related work and available datasets.

The publications on suggestion mining listed above and summarized in Table 2 did not investigate the definition and scope of suggestions as a research question. They did remove the disagreed instances from the train and test datasets if manual labeling was performed using the broad definitions as guidelines. Although, this may be one way to compensate for a missing annotation study, models trained on such unambiguous datasets may result in a lowered performance on real life data, as compared to the unambiguous test datasets. In this work, we perform a study of different perceptions of the term suggestion, formalize the definition of suggestions, and propose a two phase annotation method in order to create benchmark datasets.

3 Problem Definition

Existing problem definitions of suggestion mining define the task as predicting suggestion and non-suggestion labels for sentences. While this definition applies to all domains and applications of suggestion mining, the scope of the suggestion and non-suggestion classes should also be defined.

Many linguistic studies have investigated suggestions from an open domain, syntactic, pragmatic, or philosophical perspective. For example, according to Searle (1976), suggestions belong to the group of directive speech acts, which are those in which the speaker’s purpose is to get the hearer to commit him/herself to some future course of action. These studies relate to the classical and standard form of language, and so are the definitions of suggestions. We observe that such definitions are less suitable for suggestion mining. Firstly, in the context of text mining we are dealing with informal text on the web, which often involves a non-standard use of language. Secondly, linguists or expert annotators would be required to annotate datasets as per the definitions in linguistic studies. Thirdly, suggestion mining systems are highly likely to be used by layman users and therefore the predicted suggestions should be aligned with a layman understanding.

Wicaksono and Myaeng (2013a) mentioned a broad and domain independent definition of suggestion as a ‘sentence that contains a suggestion for or guide to an action to be taken in a particular context.’ They simply asked annotators to label advice-revealing sentences in a given dataset. Our annotation study showed that human perception of the term suggestion is subjective, and this effects the preparation of hand labeled suggestion datasets. We first present a qualitative and quantitative analysis of our annotation study below, and based on our observations propose a formal definition of both suggestion and non-suggestion classes, as well as an annotation approach for the benchmark datasets. In previous work Negi and Buitelaar (2015), we proposed a formal task definition, but it was specific to the use case of mining customer to customer suggestions from online reviews.

3.1 Annotation Study

We study the perception of layman users towards the term suggestion by performing a trial annotation where no formal annotation guidelines were provided, and the annotators were simply asked to choose between the labels suggestion and non-suggestion for a given sentence. We collected a total of 80 sentences, 20 sentences each from 4 domains: hotel reviews, restaurant reviews, software developers’ suggestion forum, and tweets. We made sure that at least 5 potential suggestion sentences were present in each of these four sets. In the case of tweets, the entire tweet is considered as a single instance instead of a sentence. Each sentence was annotated by 50 layman annotators recruited through Crowdflower.444Known as Crowdflower ( during the time of study, the platform is re-branded as Figure8 now ( We also provided the context along with these sentences, where context refers to the source of the sentence, as well as the text of the entire document to which a sentence belongs. For example, context comprises of the entire review text in the case of reviews, and the entire post in the case of suggestion forum.

Sentence Domain Annotators Category
We did not have breakfast at the hotel but opted to grab breads/pastries/coffee from the many food-outlets in the main stations. Hotel review 80 Tell
I recommend going for a Trabi Safari Hotel review 100 Recommend
Definitely come with a group and order a few plates to share. Restaurant review 52 Suggest
Please provide consistency throughout the entire Microsoft development ecosystem!” Suggestion forum 70 Request
We need a supplementary fix to the issues faced by existing affected users without resorting them to contacting Microsoft customer support Suggestion forum 60 Need/ Necessity
RT larrycaring: ”every time someone send you hate send them this back” this is v good advice I’ll use it every time antis insult me Twitter 84 Suggest, Appreciation
There is a parking garage on the corner of Forbes so its pretty convenient. Restaurant review 62 Inform, Appreciation
Table 3: Examples of sentences accepted as suggestions by layman annotators. Annotators in each table refers to the percentage of total annotators who labeled a given sentence as a suggestion.

Tables 3 and 4 provide examples of the sentences that were perceived as suggestions and non-suggestions, respectively, by more than 50% of the annotators. Unlike previous studies, we also study the characteristics of sentences labeled as non-suggestions. In order to define the scope of sentences considered as suggestions and non-suggestions by these annotators, we match the labeled sentences with their potential opinion expression category following the categorization defined by Asher et al (2009). These categories are Inform, Assert, Tell, Remark, Think, Guess, Blame, Praise, Appreciation, Recommend, Suggest, Hope, Anger, Astonishment, Love, Hate, Fear, Offence, Sadness/Joy, Bore/Entertain. A sentence with multiple clauses may contain more than one expression. We observe that most of the suggestions fell under the category of Suggestion, Recommendation, Request, and Need/Necessity opinion expressions. Annotators tend to choose Suggestion if they inferred a suggestion from the sentence, even if the sentence did not contain the expressions typically used in a suggestion. For example, some sentences with ‘Tell’ expressions were labeled as suggestions (Table 3), which is a dominant category of expressions in non-suggestions (Table 4). Also, there was a high disagreement over many sentences (30%)

Sentence Domain Annotators Category
This is much safer and far more convenient than hailing taxis from the street. Hotel Review 70 Assert
There is a very annoying bug in the Windows 10 store that hides apps from listing. Suggestion Forum 64 Tell
Just returned from a 3 night stay. Hotel Review 94 Tell
The internet was free but I do think people should only be allowed say 30 mins and its written in a book, thats a fair way I think. Hotel Review 72 Tell, Think, Suggest
I had so much fun filming all kinds of tips and advice (and some yoga!) yesterday Twitter 84 Inform, Tell, Entertain
Table 4: Examples of sentences accepted as non-suggestions by layman annotators.

In order to define the scope of suggestions and formulate annotation guidelines for benchmark datasets, one approach is to identify which of the 20 expressions types by by Asher et al (2009) map to the suggestion and non-suggestion classes. However, identifying suggestions on the basis of the type of opinion expression in a sentence may require a high level of linguistic understanding by the annotators. Also, certain type of expressions are present in both suggestions and non-suggestions.

We then provided some guidelines in a second round of annotation study. The guidelines asked the annotators to only label sentences where suggestions were explicitly expressed and not inferred. Examples of explicit and implicit expression of sentences were also provided. No improvement in disagreement was observed when detailed guidelines were provided. We observed that workers on crowdsourcing platforms tend to ignore or not comprehend the detailed guidelines. On the one hand, crowdsourcing with layman annotators may not be the best suited method for creating suggestion mining benchmark datasets, while on the other hand, trained annotators are either slower (volunteers) or expensive.

Apart from the detailed guidelines, we observed that context plays some role in the judgment of the annotators, since many annotators decided the labels based on the source text and not solely on the given sentence. Therefore, we also performed a round of annotations without providing the context. The following trends were observed:

  • There was a tendency to label sentiment sentences such as They serve really nice breakfast as suggestions when the context was not provided.

  • When the context was provided, and a large number of sentences in the source text were expressing sentiments, annotators tend to choose more explicitly expressed sentences as suggestions.

  • In cases where the sentence did not contain any information about the reviewed entity, implicit suggestions were marked as non-suggestions. For example, for There is a parking garage on the corner of Forbes so its pretty convenient the agreed upon label changed to a non-suggestion when the restaurant review context was not provided.

Based on these observations, we propose an extended problem definition that defines the scope of suggestion and non-suggestion sentences.

3.2 Task Definition Revisited

As we saw in our layman annotation study, context may affect an annotator’s judgment. In the absence of context, different annotators associate different contexts to a candidate sentence. We observe that the following concepts form an integral part of defining a suggestion for suggestion mining.

Surface structure.

Different surface structures Chomsky (1957); Crystal (2011) can be used to express the underlying intention of giving the same suggestion. For example, The nearby food outlets serve fresh local breakfast and are also cheaper and You can also have breakfast at the nearby food outlets which are cheaper and equally good. Linguistic studies in the past studied the way suggestions are expressed (see Table 5) and provided the examples of lexical elements used in the surface structures for suggestions.


When dealing with specific use cases, context plays an important role in distinguishing a suggestion from a non-suggestion. Context may be present within a given sentence. It can be a set of values corresponding to different variables that are provided explicitly and in addition to a given sentence. One or more of the following variables can constitute the context:


We refer to the source of a text as domain. In the process of dataset annotation, we closely studied some of the domains; Table 8 shows how the distribution of suggestions varies with the domains.

Source text.

The text in the entire source document to which a sentence belongs may also serve as a context, giving an insight into the discourse where the suggestion appeared.

Application or use case.

Suggestions may sometimes be sought only around a specific topic, for example, mining room tips from hotel reviews. Suggestions can also be selectively mined for a certain class of users, for example, mining suggestions for future customers. All non-relevant suggestions in the data may be regarded as non-suggestions in this case. Previous studies on suggestion mine from online reviews operated on this kind of context. For example, only suggestions for improvement were identified from the product reviews, whereas suggestions meant for the fellow customers Negi and Buitelaar (2015) were considered as non-suggestions.

We now propose an empirically driven and context-based definition of suggestions. Given that

  • denotes the surface structure of a sentence,

  • denotes additional context provided with , where the context can be a set of values corresponding to certain variables, and

  • denotes the annotation agreement for the sentence, and denotes a threshold value for the annotation agreement,

we write to denote the suggestion function, which is defined as


Depending on the choice of , and, hence, on the value of , we identify four categories of sentences that a suggestion mining system is likely to encounter.

Explicit suggestions.

Explicit suggestions are sentences for which always outputs Suggestion, whether is the empty set or not. The suggest, recommend, request, need/necessity category of opinion expressions in Table 3 are mostly found in the explicit expressions of suggestions. They are like the direct and conventionalised forms of suggestions as defined by Martínez Flor (2005) (Table 5). It may also be the case that such sentences have a strong presence of context within their surface form, as in illustrated by If you do end up here, be sure to specify a room at the back of the hotel.

Explicit non-suggestions.

These are the sentences for which always outputs Non-suggestion, whether is the empty set or not. For example, Just returned from a 3 night stay.

Implicit suggestions.

These are sentences for which outputs Non-suggestion only when is the empty set. Typically, implicit suggestions do not posses the surface form of suggestions but the additional context helps the readers to identify them as suggestions. For example, There is a parking garage on the corner of Forbes, so its pretty convenient is labeled as a suggestion by the annotators when the context is revealed as that of a restaurant review. A sentence such as Malahide is a pleasant village-turned-dormitory-town near the airport can be considered as a suggestion given that it is obtained from a travel discussion thread for Dublin. These kind of sentences are observed to have a lower inter annotator agreement than the above two categories.

Implicit non-suggestions.

These are sentences for which outputs Suggestion only when is an empty set. Typically, an implicit non-suggestion posses the surface form of suggestions but the context leads readers to identify them as non-suggestions. Such sentences may contain sarcasm. Examples include Do not advertise if you don’t know how to cook appearing in a restaurant review and The iPod is a very easy to use MP3 player, and if you can’t figure this out, you shouldn’t even own one appearing in a MP3 player review.

Based on the above four categories, we can define the scope of the suggestion and non-suggestion classes for open domain suggestion mining. For open domain suggestion mining, we propose to limit the scope of suggestions to the explicit suggestions. Therefore, we set the task definition for open domain suggestion mining to be:

  • Let be a sentence. If is an explicit suggestion, assign the label Suggestion. Otherwise, assign the label Non-suggestion.

The proposed categories provide the flexibility to change the scope of classes in a well defined manner, as well as to define context as per the application scenario.

3.3 Linguistic Observations

Explicit suggestions are often expressed by means of certain lexical cues and grammatical moods. They tend to contain certain keywords and phrases, like suggest, suggestion, recommendation, advice, I suggest, I recommend, etc. Most of the previous work created a hand crafted list of such words and phrases to use them as features with the classifiers. However, not all the suggestions contain these keywords and phrases. Table 6 lists the top 10 unigrams and bigrams in the sentences tagged as advice and suggestion in the dataset used by Wicaksono and Myaeng (2013a) and Dong et al (2013), and some examples which exclude such keyphrases. The ranking is based on the frequency count and unigrams exclude the stopwords.

Type Example
Direct I suggest that you …
I recommend that you …
I advice you to …
My suggestion would be …
Try using …
Don’t try to …
Conventionalised forms Why don’t you …?
How about …?
Have you thought about …?
You can …
You could …
If I were you, I would …
Indirect One thing (that you can do) would be …
Here’s one possibility …
There are a number of options that you …
It would be helpful if you…
It might be better to …
It would be nice if …
Table 5: Three types of expressions of suggestions as defined by Martinez (2005) Martínez Flor (2005) with the examples of prevalently used surface structures.
Travel Microsoft Tweets
Unigrams Bigrams Unigrams Bigrams
You credit card Microsoft Windows Phone
tour you will Windows Dear Microsoft
if if want WP7 Microsoft needs
just http www phone Come Microsoft
travel make sure need Microsoft make
time Europe board nokia Microsoft WP7
like tour director make If Microsoft
hotel post Europe needs Microsoft really
did travel tips apps really needs
need United States want hope Microsoft
Table 6: Top 10 unigrams and bigrams in the sentences labeled as advice and suggestion in the dataset provided by Wicaksono and Myaeng (2013a) and Dong et al (2013), respectively.


Suggestion expressions often contain what may be referred to as subjunctive and imperative moods (Morante and Sporleder, 2012; Negi and Buitelaar, 2015). Subjunctive mood is a commonly occurring language phenomenon in Indo-European languages, which is typically used in subordinate clauses to express an action that has not yet occurred, in the form of a wish, possibility, necessity etc. Guan (2012). Typical examples include If the Duke were here he would be furious Dudman (1988) and It is my suggestion that the students be sent to Tibet (Guan, 2012). The imperative mood is used to express the requirement that someone perform or refrain from an action. A typical example would be Take an umbrella, Pray everyday Portner (2009). Wicaksono and Myaeng (2012, 2013a, 2013b) and Dong et al (2013) used the imperative mood as a feature with their classifiers. However, subjunctive mood has not been associated with suggestions in previous suggestion mining studies.


In the context of opinion mining, suggestion mining and sentiment analysis can be applied to very similar domains or data sources, for example, reviews, blogs, discussion forums, twitter etc. Sentiment expressions and suggestion expressions may also appear together in the same sentence, especially in the case of recommendation type of suggestions. In sentiment analysis, text is generally categorized into three or more classes, while in suggestion mining a text is either a suggestion, or a non-suggestion. Suggestion expressing texts are not limited to a particular class of sentiment. While sentiment sentences are always defined as subjective, suggestions can be found in both objective and subjective type of text. Therefore, a suggestion bearing sentence may be associated to multiple sentiments.

In the case of reviews, some sentiment analysis benchmark datasets like SemEval datasets Pontiki et al (2016) exclude text that is not about the opinion target, even though the text is found within the same review. The guidelines from the SemEval 2015 Sentiment Analysis task Pontiki et al (2016) read: Quite often reviews contain opinions towards entities that are not directly related to the entity being reviewed, for example, restaurants/hotels that the reviewer has visited in the past, other laptops or products (and their components) of the same or a competitive brand. Such entities as well as comparative opinions are considered to be out of the scope of SE-ABSA 15. In these cases, no opinion annotations were provided. Some of the non-relevant sentences in review datasets are potential suggestions on closely related points of interest. Suggestions in hotel reviews may contain tips and advice about the reviewed entity, suggestions and recommendations about the neighborhood, transportation, and things to do. Similarly, in product reviews, suggestions can be about how to make a better use of the product, accessories which go with them, or availability of better deals, etc. (Table 7). Figure 2 shows the distribution of suggestions and sentiments (expressed towards the hotel) in a hotel review dataset annotated with sentiments by Wachsmuth et al (2014).

Sentence Domain Sentiment Relevant
One more thing- if you want to visit the Bundestag it is a good idea to book a tour (in English) in advance Hotel Neutral No
Be sure to pick up an umbrella (for free) at the concierge if you anticipate rain while sightseeing. Hotel Positive Yes
You are going to need to buy new headphones, the stock ones suck MP3 player Negative Yes
If you strictly use the lcd and not the view finder , i highly recommend the camera . Camera Positive Yes
For those of you who already bought this camera , I suggest you buy a hi-ti dye-sub photo printer Camera Neutral No
If only it played stand alone avi files DVD player Neutral Yes
Would be really good if they have given an option to stop this auto-focusing Camera Neutral Yes
Table 7: Examples of suggestions from review datasets for sentiment analysis. Sentiment refers to the sentiment towards the reviewed entity, while Relevant indicates if the suggestion is relevant for calculating the sentiment towards the reviewed entity.
Figure 2: Distribution of sentiment polarities towards the hotel in suggestion and non-suggestion sentences for a hotel review dataset (Wachsmuth et al, 2014).

In this section we answered RQ1, i.e., How do we define suggestions in the context of open domain suggestion mining? We provided a formal definition for suggestions by identifying four categories of sentences, i.e., explicit suggestions, implicit suggestions, explicit non-suggestions and implicit non-suggestions, where the explicit suggestions are defined as suggestions with regards to open domain suggestion mining.

4 Creating Benchmark Datasets for Suggestion Mining

Based on our preliminary annotation study, problem definition and scope of suggestions, we propose a two phase method for manually annotating sentences with the class labels:

Phase 1

This phase is performed using paid crowdsourcing, where each sentence is annotated by multiple layman annotators.

Phase 2

This phase is performed by an expert annotator.

Below, we describe both phases as well as the subsequent creation of multiple datasets for suggestion mining.

4.1 Phase 1: Crowdsourced Annotations

We used Crowdflower555The company was re-branded as Figure Eight post this study. to collect layman annotations.

Job design.

The annotation job was offered to annotators in batches of rows which are referred to as ‘Pages’ (see Figure 3). Annotators were not paid per individual row but per page. In our case, each page had 8 sentences.

Figure 3: A screenshot of how each page of the annotation job appears to the annotators on the crowdflower platform.

For quality control, before being allowed to perform a job, the annotators were presented with a set of test sentences which are similar to the actual questions except that their answers have already been provided by us to the system. We also submitted the explanation behind the correct answer. This way the test questions serve two purposes: test the annotators competency and understanding of the job, and train the annotator for the job.

We submitted 30 test questions for each dataset. Each starting annotator was presented with 10 test questions, and only the annotators who score 70% or more were allowed to proceed with the job. If an annotator passed the test and started the job, the remaining unseen test questions were presented to them in between the regular sentences, and without being notified. One question out of 8 on a page was a hidden test question for the annotator. The accuracy score of a contributor on test questions is referred to as Trust score in a job. If an annotator’s trust score dropped below a certain threshold during the course of the annotation, the system did not allow them to proceed further with the job. This threshold score in our case was set to 70%.

In addition to the hidden test questions, a minimum time for each annotator to stay on one page of the job was set. We set this time to 40 seconds (5 seconds on average for each sentence) for our jobs. If annotators appeared to be faster than that, they were automatically removed from the job.

We restricted access to annotators from countries where English is a popular language and that are also likely to have a large crowdsourcing workforce. Most of the annotators came from Australia, Canada, Germany, India, Ireland, the United Kingdom, and the USA.

Crowdflower assigns workers with experience levels based on the number of times they successfully passed test questions and performed a job. We only allowed workers with experience level 2 or more for our annotations. Therefore, the annotation reward was required to be competent with other jobs. We paid 10 cents for each page, i.e., 8 sentences, which means a maximum of 10 cents for 40 seconds of work, which amounts to $9 per hour, which respected the wage regulations of the country in which the first author resided at the time of writing.

Annotation agreement.

We used Crowdflower to collect multiple judgments per sentence. Using Crowdflower’s confidence score, which describes the level of agreement between multiple contributors and the confidence in the validity of the result at the same time, we used a threshold confidence score of 0.6. However, it can be the case that a sentence is very ambiguous and cannot achieve the confidence score even after a large number of workers answered it. A maximum limit to the number of annotators is set in such case, and no further judgements are collected even if the threshold confidence is not reached. We set this limit to 5 annotators. Sentences that do not pass the confidence threshold of 0.6 are counted as non-suggestions in the final dataset.

4.2 Phase 2: Expert Annotations

As mentioned previously, we follow a two phase annotation strategy, where one phase (Phase 1) includes the context and the other (Phase 2) excludes the context of a sentence. Our scope of suggestions is limited to sentences that are labeled as suggestion in both the phases, i.e., explicit suggestions.

In the crowdsourced annotations (Phase 1), the context was provided to the annotators in the form of domain information and source text. Phase 2 of the annotation is only applied to sentences that were labeled as suggestions in Phase 1. This drastically reduces the number of annotations to be performed in Phase 2. The inter-annotator agreement for Phase 2 was calculated by having two annotators label a subset of sentences for each domain (50 sentences). Cohen’s kappa coefficient was used to measure the inter-annotator agreement. The remainder of the data instances were annotated by only one annotator.

The following guidelines were provided to the annotators in Phase 2 :

  • The intent of giving a suggestion and the suggested action or recommended entity should be explicitly stated in the sentence. Try the cup cakes at the bakery next door is a positive example. Other explicit forms of this suggestion could be: I recommend the cup cakes at the bakery next door or You should definitely taste the cup cakes from the bakery next door. An implicit way of expressing the suggestion could be The cup cakes from the bakery next door were delicious.

  • The suggestion should have the intent of benefiting a stakeholder and should not be mere sarcasm or a joke. For example, If the player doesn’t work now, you can run it over with your car would not pass this test.

Following are some of the scenarios of conflicting judgments observed in this phase of annotation:

  • In the case of suggestion forums for specific domains, like a software developer forum, domain knowledge is required to distinguish an implicit non-suggestion from an explicit suggestion. Consider, for example, the two sentences, It needs to be an integrated part of the phones functionality, that is why I put it in Framework and Secondly, you need to limit the number of apps that a publisher can submit with a particular key word. The first sentence is a description of already existing functionality and is a context sentence in the original post, while the second is suggestion for a new feature.

  • No concrete mention of what is being advised such as in It’d be great if you would work on a solution to improve the situation.

  • A sentence such as I would go in fall was annotated as a suggestion in Phase 1, as a part of travel discussion forum, with a likely interpretation as If I were you, I would go in the fall. However, when viewed without context, it can be perceived as reporting one’s travel plans.

  • At times, there was a confusion between information (fact) or suggestion (opinion). For example, You can get a ticket that covers 6 of the National Gallery sites for only about US$10.

In the final versions of the datasets prepared by us (see below), the sentences that are labeled as suggestions in Phase 2 of the annotation process are labeled as suggestions, while all other sentences are labeled as non-suggestions.

4.3 New Datasets for Suggestion Mining

This section lists the manually labeled datasets that were created following Phase 1 and Phase 2 described above.

We apply our two-stage annotation process to create four new datasets for suggestion mining. On top of that we re-use existing datasets, viewing the labels originally provided as Phase 1 annotations of our two-stage annotation process.

Hotel reviews.

Wachsmuth et al (2014) provide a large dataset of hotel reviews from the TripAdvisor666 website. They segmented the reviews into statements so that each statement has only one sentiment label and have manually labeled the sentiments. Statements are equivalent to sentences, and comprise of one or more clauses. These statements have been manually tagged with positive, negative, conflict, and neutral sentiments. We take a smaller subset of these reviews, where each statement is considered as a sentence in our dataset.

Electronics reviews.

Hu and Liu (2004) provide a dataset comprising of reviews of different kinds of electronic products obtained from the website of Amazon.777 The Amazon website collects and displays online reviews of listed products. Hu and Liu split the reviews into sentences; sentiment for each sentence has been manually tagged.

Travel forum.

The data is obtained from a previous travel forum dataset by Wicaksono and Myaeng (2013a, b). This domain exhibits a wide variety of expressions employed in suggestions, with relatively lower grammatical and spell errors. However, this domain also shows a relatively lower inter-annotator agreement.

Software suggestion forum.

The sentences for this dataset were scraped from the Uservoice888 platform. Uservoice provides customer engagement tools to brands, and therefore hosts dedicated suggestion forums for certain products. The Feedly mobile application forum and the Windows developer forum are openly accessible. A sample of posts were scraped and split into sentences using the Stanford CoreNLP toolkit Klein and Manning (2003). The class ratio in the dataset obtained from suggestion forums is more balanced than the other domains. Many suggestions are in the form of requests, which is less frequent in other domains. The text contains highly technical vocabulary related to the software which is being discussed, which may effect the classifier performance when this dataset is used for training or evaluation in the cross domain train-test settings, specially when bag of word features are employed.

Following are the reasons for choosing these four domains for datasets. Online reviews is a popular target domain for opinion mining and a number of sentiment tagged review datasets are available from previous studies, therefore we also prepared a suggestion mining datasets for electronics (product) and hotel (service) reviews. This also allows us to study the relationship between suggestions and sentence level sentiments in online reviews. Also, product reviews were popularly used in the related work for suggestion mining. The choice of travel forum dataset was also inspired from the related work on advice mining Wicaksono and Myaeng (2012). Software suggestion forum was chosen because review datasets were highly imbalanced for explicit suggestions, while suggestion forums had a better presence of explicitly expressed suggestions. Also, suggestion forum datasets supported a different use case of suggestion mining than the review datasets, i.e. summarisation of suggestion posts.

In addition to these four new datasets, we re-labeled two existing datasets for suggestion mining. We consider the labels provided from in previous work as Phase 1 annotations, and performed Phase 2 annotations on these datasets, i.e., re-labeling of instances that were previously labeled as suggestions.

Travel forum dataset.

Wicaksono and Myaeng (2013a, b) crawled several web forum threads from two well-known travel forums InsightVacations999 and Fodors.101010 Originally, an inter-annotator agreement of 0.76 (Cohen’s kappa) was reported for this dataset. The provided dataset only comprises of the instances where both the annotators agreed.

Microsoft tweets dataset.

The dataset was initially released by Dong et al (2013). While no inter-annotator agreement was reported by the authors, only those tweets were retained in the dataset where the annotators mutually agreed upon the label. In this case, the unit for suggestions is a tweet instead of a sentence. All of the tweets previously labeled as suggestions in the Microsoft tweet dataset were accepted as suggestions in Phase 2 annotations. A 100% inter-annotator agreement was observed between the two annotators. This could be due to the fact that full tweet is available to the annotators rather than a single sentence, while hashtags also help in reducing ambiguities.

Dataset identifier Source Suggestion : Non-suggestion Inter-annotator agreement (Phase 2)
Existing datasets
Microsoft tweets (original, re-tagged) Twitter 238/2762 (0.08) 1.0
Travel forum (original) InsightVacations, Fodors 2192/3007 (0.72) 0.76 Wicaksono and Myaeng (2013a)
Travel train (re-tagged) InsightVacations, Fodors 1314/3869 (0.34) 0.72
New datasets
Hotel train Tripadvisor 448/7086 (0.06) 0.86
Hotel test Tripadvisor 404/3000 (0.13) 0.86
Electronics train Amazon 324/3458 (0.09) 0.83
Electronics test Amazon 101/1070 (0.09) 0.83
Travel test Fodors 229/871 (0.26) 0.72
Software train Uservoice suggestion forum 1428/4296 (0.33) 0.81
Software test Uservoice suggestion forum 296/742 (0.39) 0.81
Table 8: Manually annotated suggestion mining datasets created in this paper.

Table 8 provides the details of the suggestion mining datasets created as part of this work. The datasets111111 are freely made available for non-commercial purposes.121212

In this section we answered RQ 2, i.e., How do we prepare benchmark datasets for suggestion mining. We studied the annotation challenges associated with suggestion mining, and proposed a two phase annotation method for preparing datasets for open domain suggestion mining. Phase 1 provides context with the sentences to be annotated and is performed by layman annotators using a crowdsourcing platform, while Phase 2 does not reveal the context to the annotators, and is performed by expert annotators. The sentences that were labeled as suggestions in both phases are retained as suggestions while the rest of the sentences are labeled as non-suggestions.

5 Conclusion

In this paper we have focused on answering two main research questions, viz. how to define suggestions in the context of open domain suggestion mining, and how to create benchmark datasets for suggestion mining.

To inform our annotation effort, we report on a study of the perception of the term suggestion among layman annotators. We map the sentences labeled as suggestions and non-suggestions by layman annotators to some predefined categories of expressions which tend to appear in opinion discourse. These categories were thoroughly studied and defined in existing work Asher et al (2009). Some of the categories of expressions were present in both suggestion and non-suggestion sentences, which forms the basis of ambiguities associated with the preparation of benchmark datasets.

Based on the observations, we propose a context dependent method to define four categories of sentences encountered in a source text, explicit suggestions, implicit suggestions, explicit non-suggestions, an implicit non-suggestions. We also study the possible contexts which can play a significant role in determining whether a sentence should be regarded as a suggestion or not. Based on this theory, we develop benchmark datasets for suggestion mining using a two-phase annotation method. A detailed account is shared on the methodology we adapted to prepare benchmark datasets for suggestion mining using a combination of crowdsourced and expert annotators.

Finally, we release new benchmark datasets for suggestion mining relating to five domains, that are publicly available for research purposes.


  • Amigó et al (2014) Amigó E, Carrillo-de Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, de Rijke M, Spina D (2014) Overview of RepLab 2014: Author profiling and reputation dimensions for online reputation management. In: CLEF 2014, Springer, no. 8685 in LNCS, pp 307–322
  • Asher et al (2009) Asher N, Benamara F, Mathieu Y (2009) Appraisal of opinion expressions in discourse. Lingvisticæ Investigationes 31(2):279–292
  • Brun and Hagege (2013) Brun C, Hagege C (2013) Suggestion mining: Detecting suggestions for improvement in users’ comments. Research in Computing Science
  • Chomsky (1957) Chomsky N (1957) Syntactic Structures. Mouton and Co., The Hague
  • Crystal (2011) Crystal D (2011) A Dictionary of Linguistics and Phonetics. The Language Library, Wiley
  • Dong et al (2013) Dong L, Wei F, Duan Y, Liu X, Zhou M, Xu K (2013) The automated acquisition of suggestions from tweets. In: AAAI, AAAI Press
  • Dudman (1988) Dudman VH (1988) Indicative and subjunctive. Analysis 48(3):113–122
  • Guan (2012) Guan X (2012) A study on the formalization of english subjunctive mood. Theory and Practice in Language Studies 2(1):170–173
  • Hu and Liu (2004) Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’04, pp 168–177
  • Jijkoun et al (2010) Jijkoun V, de Rijke M, Weerkamp W, Ackermans P, Geleijnse G (2010) Mining user experiences from online forums: An exploration. In: NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media: #SocialMedia, ACL
  • Klein and Manning (2003) Klein D, Manning CD (2003) Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’03, pp 423–430
  • Liu (2012) Liu B (2012) Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies 5(1):1–167
  • Martínez Flor (2005) Martínez Flor A (2005) A theoretical review of the speech act of suggesting: Towards a taxonomy for its use in FLT. Revista Alicantina de Estudios Ingleses 18:167–187
  • Moghaddam (2015) Moghaddam S (2015) Beyond sentiment analysis: mining defects and improvements from customer feedback. In: European Conference on Information Retrieval, Springer, pp 400–410
  • Morante and Sporleder (2012) Morante R, Sporleder C (2012) Modality and negation: An introduction to the special issue. Computational Linguistics 38(2):223–260
  • Negi and Buitelaar (2015)

    Negi S, Buitelaar P (2015) Towards the extraction of customer-to-customer suggestions from reviews. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, pp 2159–2167

  • Pontiki et al (2016) Pontiki M, Galanis D, Papageorgiou H, Androutsopoulos I, Manandhar S, AL-Smadi M, Al-Ayyoub M, Zhao Y, Qin B, De Clercq O, Hoste V, Apidianaki M, Tannier X, Loukachevitch N, Kotelnikov E, Bel N, Jiménez-Zafra SM, Eryiğit G (2016) SemEval-2016 task 5 : aspect based sentiment analysis. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Association for Computational Linguistics, pp 19–30
  • Portner (2009) Portner P (2009) Modality. Oxford University Press
  • Ramanand et al (2010) Ramanand J, Bhavsar K, Pedanekar N (2010) Wishful thinking - finding suggestions and ’buy’ wishes from product reviews. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, Association for Computational Linguistics, Los Angeles, CA, pp 54–61
  • Searle (1976) Searle JR (1976) A classification of illocutionary acts. Language in Society 5(1):1–23
  • Viswanathan et al (2011) Viswanathan A, Venkatesh P, Vasudevan BG, Balakrishnan R, Shastri L (2011) Suggestion mining from customer reviews. In: AMCIS 2011 Proceedings - All Submissions, p 155
  • Wachsmuth et al (2014) Wachsmuth H, Trenkmann M, Stein B, Engels G, Palakarska T (2014) A review corpus for argumentation analysis. In: Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing, Springer, Kathmandu, Nepal, LNCS, vol 8404, pp 115–127
  • Wicaksono and Myaeng (2012) Wicaksono AF, Myaeng SH (2012) Mining advices from weblogs. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, ACM, New York, NY, USA, CIKM ’12, pp 2347–2350
  • Wicaksono and Myaeng (2013a) Wicaksono AF, Myaeng SH (2013a) Automatic extraction of advice-revealing sentences for advice mining from online forums. In: Proceedings of the Seventh International Conference on Knowledge Capture, ACM, K-CAP ’13, pp 97–104
  • Wicaksono and Myaeng (2013b) Wicaksono AF, Myaeng SH (2013b) Toward advice mining: Conditional random fields for extracting advice-revealing text units. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, ACM, New York, NY, USA, CIKM ’13, pp 2039–2048