With the advent of social media platforms, increasing user base address their grievances over these platforms, in the form of complaints. According to , complaint is considered to be a basic speech act used to express negative mismatch between the expectation and reality. Transportation and its related logistics industries are the backbones of every economy222https://www.entrepreneur.com/article/326552. Many transport organizations rely on complaints gathered via these platforms to improve their services, hence understanding these are important for: (1) linguists to identify human expressions of criticism and (2) organizations to improve their query response time and address concerns effectively.
Presence of inevitable noise, sparse content along with rephrased and structurally morphed instances of posts, make the task at hand difficult . Previous works  in the domain of complaint extraction have focused on static datasets only. These are not robust to changes in the trends reflected, information flow and linguistic variations. We propose an iterative, semi-supervised approach for identification of complaint based tweets, having the ability to be replicated for stream of information flow. The preference of a semi-supervised approach over supervised ones is due to the stated reasons: (a) the task of isolating the training set, make supervised tasks less attractive and impractical and (b) imbalance between the subjective and objective classes lead to poor performance.
We aimed to mimic the presence of sparse/noisy content distribution, mandating the need to curate a novel dataset via specific lexicons. We scrapedrandom posts from recognized transport forum333https://www.theverge.com/forums/transportation. A pool of uni/bi-grams was created based on tf-idf representations, extracted from the posts, which was further pruned by annotators. Querying posts on Twitter with extracted lexicons led to a collection of tweets. In order to have lexical diversity, we added randomly sampled tweets to our dataset. In spite of the sparse nature of these posts, the lexical characteristics act as information cues.
Figure 1 pictorially represents our methodology. Our approach required an initial set of informative tweets for which we employed two human annotators annotating a random sub-sample of the original dataset. From the samples, were marked as informative and as non informative (), discriminated on this criteria: Is the tweet addressing any complaint or raising grievances about modes of transport or services/ events associated with transportation such as traffic; public or private transport?. An example tweet marked as informative: No, metro fares will be reduced ???, but proper fare structure needs to presented right, it’s bad !!!.
We utilized tf-idf for the identification of initial seed phrases from the curated set of informative tweets. terms having the highest tf-idf scores were passed through the complete dataset and based on sub-string match, the transport relevant tweets
were identified. The redundant tweets were filtered based on the cosine similarity score.Implicit information indicators were identified based on domain relevance score
, a metric used to gauge the coverage of n-gram (,,) when evaluated against a randomly created pool of posts.
We collected a pool of randomly sampled tweets different from the data collection period. The rationale behind having such a metric was to discard commonly occurring n-grams normalized by random noise and include ones which are of lexical importance. We used terms associated with high domain relevance score (threshold determined experimentally) as seed phrases for the next set of iterations. The growing dictionary augments the collection process. The process ran for iterations providing us transport relevant tweets as no new lexicons were identified. In order to identify linguistic signals associated with the complaint posts, we randomly sampled a set of tweets which was used as training set, manually annotated into distinct labels: complaint relevant () and complaint non-relevant () (). We employed these features on our dataset.
Linguistic markers. To capture linguistic aspects of complaints, we utilized Bag of Words, count of POS tags and Word2vec clusters.
Sentiment markers. We used quantified score based on the ratio of tokens mentioned in the following lexicons: MPQA, NRC, VADER and Stanford.
Information specific markers. These account for a set of handcrafted features associated with complaint, we used the stated markers (a) Text-Meta Data, this includes the count of URL’s, hashtags, user mentions, special symbols and user mentions, used to enhance retweet impact; (b) Request Identification, we employed the model presented in  to identify if a specific tweet assertion is a request; (c) Intensifiers, we make use of feature set derived from the number of words starting with capital letters and the repetition of special symbols (exclamation, questions marks) within the same post; (d) Politeness Markers, we utilize the politeness score of the tweet extracted from the model presented in ; (e) Pronoun Variation, these have the ability to reveal the personal involvement or intensify involvement. We utilize the frequency of pronoun types } using pre-defined dictionaries.
From the pool of transport relevant tweets, we sampled tweets which were used as the testing set. The results are reported in Table1 with fold cross-validation. With increasing the number of iterations, the pool of seed phrases gets refined and augments the selection of transport relevant tweets. The proposed pipeline is tailored to identify complaint relevant tweets in a noisy scenario.
Table 1 reflects that the BOW model provided the best results, both in terms of accuracy and F1-score. The best result achieved by a sentiment model was the Stanford Sentiment ( F1-score), with others within the same range and linguistic-based features collectively giving the best performance.
|Information Specific Markers|
Conclusion and Future Work
In this paper, we presented a novel semi-supervised pipeline along with a novel dataset for identification of complaint based posts in the transport domain. The proposed methodology can be expanded for other fields by altering the lexicons used for the creation of information cues
. There are limitations to this analysis; we do not use neural networks which mandate a large volume of data. In the future, we aim to identify demographic features for identification of complaint based posts on social media platforms.
-  (2013) No country for old members: user lifecycle and linguistic change in online communities. In Proceedings of the 22nd international conference on World Wide Web, pp. 307–318. Cited by: Proposed Methodology.
-  (2013) Electronic complaints: an empirical study on british english and german complaints on ebay. Vol. 18, Frank & Timme GmbH. Cited by: Introduction.
-  (1985) Complaints-a study of speech act behavior among native and nonnative speakers of hebrew. Tel Aviv University. Cited by: Introduction.
-  (2017) Multimodal analysis of user-generated multimedia content. Springer. Cited by: Introduction.