Text processing is present in our everyday life and empowers several important utilities, such as, machine translation, web search, personal assistants, and user recommendations. Today, social media is one of the largest sources of text, and while social media fosters the development of a new type of text processing applications, it also brings with it its own set of challenges due to the informal language.
Text in social media is unstructured and has a more informal and conversational tone than text from conventional media outlets [hownoisy]. For instance, text in social media is rich of abbreviations, hashtags, emojis, and misspellings.
Traditional Natural Language Processing (NLP)-tools are designed for formal text and are less effective when applied on informal text from social media[ner_twitter18]. This is why recent research efforts have tried to adapt NLP tools to the social media domain [twitter_pos]
. Moreover, methods within the intersection of NLP and machine learning applied to social media have been successful in information extraction[twitter_event], classification [twitter_class], and conversation modeling [unsup_twitter].
Results of the previous work are not enough for our purposes due to the following reasons: (1) many results rely on access to massive quantities of annotated data, something that is not available in our domain; (2) most of the work is focused on Twitter, with little attention to image sharing platforms like Instagram111Instagram.com; and (3) to the best of our knowledge, no prior assessment of complex, multi-label, hierarchical extraction and classification in social media has been made.
Acquisition of annotated data that is accurate and can be used for training text mining models is expensive. Especially in a shifting data domain like social media. In this research, we explore the boundaries of text mining methods that can be effective without this type of strong supervision.
Even if we assume that the main research results from Twitter will be useful in our research on Instagram, we still should take into account several important differences between the two domains. The most prevalent discrepancies are that Instagram is an image-sharing medium while Twitter is a micro-blogging medium, and that Twitter has a character-limit per tweet.
In this paper, we focus on the task of extracting fashion attributes from Instagram posts, and classifying Instagram posts into clothing categories based on the associated text. The work presented in this paper is part of a larger research project. The project aspires to improve the state-of-the-art in fashion recommendation by employing activities in social media and using data crossing multiple domains in the recommendations [shatha_intro]
. The text processing methods presented in this paper are meant to be integrated with computer vision models in the project.
Just as other consumption-driven industries, the fashion industry has been influenced by the emergence of social media. Social media is progressively getting more attention by fashion brands and retailers as a source for detecting trends, adapting user recommendations, and for marketing purposes [fashion_sm2]. To give an example, the image-sharing platform Instagram has become a popular medium for fashion branding and community engagement [fashion_article]. This is why extraction and classification of fashion attributes on Instagram is an important task for several modern applications working with user recommendation and detection of fashion trends.
In addition to hosting images, Instagram contains large volumes of user generated text. Specifically, an Instagram post can be associated with an image caption written by the author of the post, by comments written by other users, and by “tags” in the image that refer to other users. Despite being a platform rich of text, little prior work has paid attention to the promising applications of text mining on Instagram. From our case study on Instagram posts in the fashion community, it was revealed that the text oftenindicates the clothing on the associated image, an example of this is given in Fig. 1. We believe that there is a value in the text on Instagram that currently is unutilized. For example, the text on Instagram can be mined and used for predictive modeling and analytics.
Our contribution in this paper includes:
An empirical study of Instagram text.
A system for unsupervised extraction of fashion attributes from text on Instagram.
A novel pipeline for multi-label clothing classification of the text associated with Instagram posts using weak supervision and the data programming paradigm [data_prog].
Our empirical study provides one of the few available studies on Instagram text and shows that the text is noisy, that the text distribution exhibits the long-tail phenomenon, and that comment sections on Instagram often are multi-lingual. Moreover, experimental results demonstrate that the use of word embeddings adds semantic word knowledge that is helpful for information extraction and improves the accuracy compared with a baseline that uses Levenshtein distance. Finally, we train a deep text classifier using weak supervision and data programming. The classifier achieves an score of on the task of clothing prediction of Instagram posts based on the text. The accuracy of the classifier is on level with human performance on the task and beats a baseline that uses majority voting.
The rest of this paper is structured as follows. In Section II we describe related work, and in Section III we present our approach to the problem. In Section LABEL:sec:setup we summarize the experimental setup and Section LABEL:sec:results contains the results from our evaluations and our interpretation of the results. Lastly, Section LABEL:sec:conc_fw includes our conclusions and suggestions for future research directions.
Ii Related Work
Ii-a Unsupervised Information Extraction
In [twitter_event], the authors propose an approach to event extraction and categorization that uses a supervised tagger to identify events in tweets. Next, the extracted events are categorized using latent variable models, that can make use of unlabeled data. Results demonstrate an improved accuracy compared with a supervised baseline. Their work resembles ours in that they attempt to classify and extract information from noisy text, and try to make use of unlabeled data. However, it has some important differences compared to our setting. In event categorization, the categories are unclear a priori, which fits well into the latent variable model approach. In contrast, our extraction problem has a pre-defined set of classes. Moreover, in their proposed solution, they assume access to an annotated dataset for training a tagger to recognize events in tweets, a corresponding dataset is not available in our domain.
Numerous research efforts have been made on the line of coarse-grained classification in social media using latent variable models [unsup_twitter, ner_twitter18]. These studies differ from our work in two ways. First, most of the work is focused on Twitter. Second, in our research, the goal is a complex multi-label extraction, while the aforementioned work target more general and high-level extraction tasks.
Word embeddings have shown to be a great asset for information extraction. In [ir_we1] the authors evaluate how useful word embeddings are for clinical concept extraction and in [ir_we2]
the utility of word embeddings for named entity recognition on Twitter is evaluated. Both results demonstrate improvements when using word embeddings compared to baseline methods.
Ii-B Text Classification with Weak Supervision
For the task of classifying Instagram text, our research builds primarily on results from supervised machine learning. The success of this paradigm of machine learning has traditionally been coupled to annotated datasets. Notable results in supervised text classification are [kim_cnn] and [verydeep_cnn], both of which differ from our research in that they assume access to a large annotated text corpora for training the classifier.
More recently, weakly supervised approaches have been used for text classification and information extraction. Specifically, the data programming paradigm presented in [data_prog], has achieved promising results. Data programming has been applied to binary and multinomial text extraction and classification tasks [data_prog, snorkel]. To the best of our knowledge, it has neither been applied to multi-label classification tasks, nor to social media text.
In text mining, there is a balance between models that rely on domain knowledge and models that rely on annotated training data. In our research, we have experimented with both approaches. In Section III-A we outline how our analysis of the Instagram corpora was performed. Section III-B describes our second contribution, which is a method for information extraction using an ontology with domain knowledge and word embeddings. Finally, Section LABEL:sec:weak_sup presents the pipeline we used to train a deep text classifier using weak supervision. The code for the implementations is publicly available222https://github.com/shatha2014/FashionRec.
Iii-a Empirical Study of Instagram Text
Of special interest in our study was to elucidate how the Instagram text differs from newswire text, as it affects the choice of processing methods. We analyzed a corpora of Instagram posts by measuring the fraction of online-specific tokens, the number of Out-Of-Vocabulary (OOV) words, the number of languages in the corpora, and the text distribution.