Even though usage and popularity of Twitter have stopped rapidly growing and even dropped in recent years111https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users, it still has a considerable amount of loyal users who keep on sharing everything from worldwide events to random personal details with their followers. We decided to focus on one of the random personal details that people share, specifically - anything to do with food consumption and related topics.
Several corpora of Latvian tweets exist in prior work, but none of them are domain-specific and have been collected over an extensive period of time. Milajevs [milajevs2018language] collected and analysed 1.4 million tweets geo-located in Riga, Latvia from April 2017 to July 2018 and 60 thousand tweets [milajevs-2017-toward] from November 2016 to March 2017. Pinnis [pinnis2018latvian] collected and analysed 3.8 million tweets of Latvian politicians, companies, media, and users who interacted with these entities from August 2016 to July 2018 There are also several data sets of general sentiment-annotated tweets [peisenieksuses, viksna2018sentiment, pinnis2018latvian]222https://github.com/nicemanis/LV-twitter-sentiment-corpus amounting to 14,781 tweets in total.
In this paper, we describe the Twitter eater corpus (TEC) and analyse its contents. We also provide two sub-corpora - one consisting of question and answer tweets and one with sentiment-annotated tweets. More details can be found in Section 2. In Sections 3.1 and 3.2
we describe question answering and sentiment analysis experiments using our corpus. Finally, we conclude the paper in Section4.
2 The Twitter Eater Corpus
The corpus consists of tweets that have been collected from October 2011 [rikters2012universalas] until April 2020. They are tracked using 363 keywords, which are various inflections of Latvian words associated with eating, tasting, breakfast, lunch, dinner, etc. The main keywords are shown in Table 1 - the words in bold are mostly verbs that describe eating - these were inflected to all usable forms and included in the full keyword list. The rest of the keywords are a set of the top 60 food-related words that were most popular in the first month of collecting the tweets.
Figure 1 illustrates the contents of a single tweet from the TEC in JSON notation. Each tweet consists of primary fields - "tweet_id", "tweet_text", "tweet_author" and "created_at", which will always be present, and optional fields, which depend on the tweet text and metadata. We separate three groups of optional fields: 1) "media_url" and "expanded_url", which contain information about media files from the tweet; 2) "location_name", "location_lng", "location_lat" and "location_country", which specify where the tweet was created; and 3) "food_surface_form", "food_nominative_form", "food_group" and "food_english_translation", which contain semicolon-separated lists of foods or drinks that appear in the tweet.
At the beginning of the project approximately 15,000 food and drink words from collected tweets were manually annotated with their respective nominative forms, English translations and food groups according to the food guide pyramid [duston_1992]. The food groups are: bread, cereal, rice, pasta (6); vegetables (5); fruit, berries (4); milk products (3); meat, eggs, fish (2); fats, oils, sweets (1). There are two additional groups for drinks - alcoholic drinks (7) and non-alcoholic drinks (8).
The corpus is available on Github333https://github.com/Usprogis/Latvian-Twitter-Eater-Corpus in accordance with the content redistribution section of the Twitter Developer Agreement and Policy444https://developer.twitter.com/en/developer-terms/agreement-and-policy. The public release includes tweet IDs along with data fields created within the scope of this project (starting with "location_lng" in Figure 1). The complete version is available upon individual request for research purposes. The repository also includes data processing scripts and details on how to reproduce our experiments.
2.1 Content Overview
The corpus contains 2,275,787 tweets, of which 155,057 contain media information, 165,335 contain location information and 1,297,159 tweets mention foods or drinks. Table 2 shows the 10 most popular foods and drinks from the TEC. Looking from a Latvian consumer perspective555https://enciklopedija.lv/skirklis/4980-nacion%C4%81l%C4%81-virtuve-Latvij%C4%81 it is very typical that Latvians mostly drink water, tea, juice, beer and eat meat, vegetables and fruits. Interesting, however, is the high popularity of sweets such as chocolate, cakes, ice cream and Coca-Cola.
Figure 2 shows the yearly count of collected tweets along with the potential trend (since for years 2011 and 2020 only a part has been collected) and the general popularity of Twitter and Instagram (a competing social network) for Latvia from Google Trends 666https://trends.google.com/trends/explore?hl=en-US&tz=-540&date=2011-10-06+2020-03-14&geo=LV&q=%2Fm%2F0fjd36,%2Fm%2F0289n8t,%2Fm%2F02y1vz,%2Fm%2F0glpjll&sni=3. There was a stable income of food tweets up until 2015, but after that, it seems that the decrease correlates with the overall drop in popularity of Twitter in Latvia, which seems to be directly opposite to the popularity of Instagram in Latvia according to Google Trends.
In Figure 3 we have visualised four of the largest tweet trends over the past years from the Latvian speaking twitter users. The most recent one just a month ago - panic buying of buckwheat due to the CoViD19 pandemic of 2020, followed by the doubling of butter prices in 2017, Latvian sprat import ban to Russia in 2015, and finally the horsemeat scandal in 2013. If we look closer at the 2823 tweets about meat in week 9 of 2013, we can see multiple inflexions of the word "horse" along with words like "scandal" and "investigation" among the most common words.
Figure 4 shows a selection of seasonal trends averaged from data between 2012 and 2019. Most trends have one peak zone indicating parts of the year when they are more popular. Examples of this are gingerbread and tangerines in December, and strawberries and ice cream in the summer. We were expecting to see chocolate peak high on Valentine’s day, but while it does peak, the difference is not as high.
2.2 Question - Answer Sub-corpus
We noticed that there are plenty of tweets in our corpus that express questions. To highlight one of the uses of the corpus, we selected a subset of tweets which include at least one of typical Latvian question words777http://valoda.ailab.lv/latval/vidusskolai/SINTAKSE/sint3jaut.htm or phrases along with a question mark. This resulted in 215,233 question tweets. To gather answers for them, we scraped Twitter’s web version888https://github.com/luodaoyi/TwEater, which resulted in 19,871 tweets with at least one reply. Since there were many tweets with multiple answers, we eventually wound up with 42,744 question-answer pairs. We randomly selected subsets of 1000 and 500 question-answer pairs to use as the development set and evaluation set respectively.
2.3 Sentiment Annotated Sub-corpus
We manually annotated 5420 tweets. marking them as positive, neutral or negative. This gave us 1631 positive, 2507 neutral and 1282 negative tweets. We further split these into a test set of 250 tweets from each class and a training set
3.1 Question Answering
Typical question answering systems are trained using paragraphs of text, questions about the paragraphs and answers to those questions [rajpurkar-etal-2016-squad]. Since we only had question-answer, we chose to train an encoder-decoder model similar to machine translation using questions and answer as source and target languages respectively. We used Sockeye [Sockeye:17] to train transformer architecture models with the base parameters until they reached convergence on development data.
Our initial experiments using only TEC data showed rather poorly generated answers due to lack of general-domain training data. To mitigate this, we used the same approach to select question-answer tweets from the Latvian Tweet Corpus [pinnis2018latvian]. This gave us 546,982 additional question-answer pairs to add to our training data.
3.1.1 Data Pre-processing
We performed tokenisation or truecasing using scripts from the Moses Toolkit [Koehn2007Moses:Translation]. We used Sentencepiece [kudo2018sentencepiece] to create a shared subword vocabulary of 8000 tokens. We replaced all Twitter-specific @user mentions with @USR and URLs with @URL, as these usually don’t contain relevant linguistic data for the model to learn. We also replaced multiple consecutive @USR or @URL tags with a single one and removed them completely if they were either at the start of the tweet or in the end.
Figure 5 shows several examples of generated answers to the evaluation data questions. There were many hypothesis answers that were not even close to the reference ones but still made some sense in relation to the questions, such as the first two. There were also just as many or even more answers that made no sense at all like the last one.
We performed a small-scale human evaluation on the results by asking 5 annotators to evaluate a random 10% of the evaluation set by marking generated answers as either OK or not good (NG). The evaluators marked 46.40% of answers as OK. The evaluators had an overall agreement of 66.27% (Free-marginal kappa [randolph2005free] - 0.33), which indicates moderate agreement.
|Question:||Bet es jau tevis dēļ visu daru. Cikos jāiet ēst torti?|
|Hypothesis:||Es jau paēdu.: D|
|Reference:||uzzvanīšu kad varēs nākten ciemen.|
|Question:||Man šķūnī nesen piedzima ežuļi. Tāpēc SOS jautājums - ko ēd mazi eži?|
|Hypothesis:||Jā, tas ir ļoti labs jautājums.:)|
|Reference:||tikai nebaro ar krūti, viņi nepanes pienu!|
|Question:||kāpostu tīteņi vai cepelīni?|
3.2 Sentiment Analysis
For the sentiment analysis experiment, we used 3116 of our sentiment-annotated tweets. 250 random tweets from each sentiment class were separated as an evaluation corpus and the remaining 2368 were used for training.
For sentiment analysis, we performed similar data pre-processing steps as for question answering, except for splitting words in sentencepieces. We also experimented with stemming999https://github.com/rihardsk/LatvianStemmer and lemmatizing [Paikens:2007:BalticHLT] words.
shows results of our sentiment analysis experiments. We compared a Python implementation of the Naive Bayes classifier from NLTK[bird2009natural] against Pinnis [pinnis2018latvian]
implementation of the Perceptron classifier. We also experimented with several combinations of training data sets - TE (our Twitter Eater dataset), MP[pinnis2018latvian], RV [viksna2018sentiment], PE [peisenieksuses], NI101010https://github.com/nicemanis/LV-twitter-sentiment-corpus. We found that the highest classification accuracy - 61.23% - is achieved by using all but NI data sets for training and only stemming all words.
In this paper, we described the creation of a fairly large narrow-domain corpus of Twitter posts related to the topic of eating. We gave some insights in overall observations gained from the corpus contents and various trends that we noticed from the data. We believe that the data would be useful in many linguistic, sociological, behavioural and other research areas.
We experimented with creating a food-related question answering system using one subset of our data and a sentiment analysis system using another subset to highlight potential use-cases of our corpus. While the results did not break new ground, we hope that they inspire related future research.
We would like to thank Mārcis Pinnis for sharing his collected tweet dataset with us as well as running experiments with his model using our data.