The official implementation of the NAACL-HLT 2019 paper "Microblog Hashtag Generation via Encoding Conversation Contexts"
Automatic hashtag annotation plays an important role in content understanding for microblog posts. To date, progress made in this field has been restricted to phrase selection from limited candidates, or word-level hashtag discovery using topic models. Different from previous work considering hashtags to be inseparable, our work is the first effort to annotate hashtags with a novel sequence generation framework via viewing the hashtag as a short sequence of words. Moreover, to address the data sparsity issue in processing short microblog posts, we propose to jointly model the target posts and the conversation contexts initiated by them with bidirectional attention. Extensive experimental results on two large-scale datasets, newly collected from English Twitter and Chinese Weibo, show that our model significantly outperforms state-of-the-art models based on classification. Further studies demonstrate our ability to effectively generate rare and even unseen hashtags, which is however not possible for most existing methods.READ FULL TEXT VIEW PDF
A huge volume of user-generated content is daily produced on social medi...
Automatic microblog hashtag generation can help us better and faster
Hashtag annotation for microblog posts has been recently formulated as a...
Rumors are rampant in the era of social media. Conversation structures
Quotations are crucial for successful explanations and persuasions in
Conventional topic models are ineffective for topic extraction from micr...
The field of open-domain conversation generation using deep neural netwo...
The official implementation of the NAACL-HLT 2019 paper "Microblog Hashtag Generation via Encoding Conversation Contexts"
Microblogs have become an essential outlet for individuals to voice opinions and exchange information. Millions of user-generated messages are produced every day, far outpacing the human being’s reading and understanding capacity. As a result, the current decade has witnessed the increasing demand for effectively discovering gist information from large microblog texts. To identify the key content of a microblog post, hashtags, user-generated labels prefixed with a “#” (such as “#NAACL” and “#DeepLearning”), have been widely used to reflect keyphrases Zhang et al. (2016, 2018) or topics Yan et al. (2013); Hong et al. (2012); Li et al. (2016). Hashtags can further benefit downstream applications, such as microblog search Efron (2010); Bansal et al. (2015), summarization Zhang et al. (2013); Chang et al. (2013)2010); Wang et al. (2011), and so forth. Despite the widespread use of hashtags, there are a large number of microblog messages without any user-provided hashtags. For example, less than % tweets contain at least one hashtag Wang et al. (2011); Khabiri et al. (2012). Consequently, for a multitude of posts without human-annotated hashtags, there exists a pressing need for automating the hashtag annotation process for them.
|Target post for hashtag generation|
|This Azarenka woman needs a talking to from the umpire her weird noises are totes inappropes professionally. #AusOpen|
|Replying messages forming a conversation|
|[T1] How annoying is she. I just worked out what she sounds like one of those turbo charged cars when they change gear or speed.|
|[T2] On the topic of noises, I was at the NadalTomic game last night and I loved how quiet Tomic was compared to Nadal.|
|[T3] He seems to have a shitload of talent and the postmatch press conf. He showed a lot of maturity and he seems nice.|
|[T4] Tomic has a fantastic tennis brain…|
Most previous work in this field focuses on extracting phrases from target posts Zhang et al. (2016, 2018) or selecting candidates from a pre-defined list Gong and Zhang (2016); Huang et al. (2016); Zhang et al. (2017). However, hashtags usually appear in neither the target posts nor the given candidate list. The reasons are two folds. For one thing, microblogs allow large freedom for users to write whatever hashtags they like. For another, due to the wide range and rapid change of social media topics, a vast variety of hashtags can be daily created, making it impossible to be covered by a fixed candidate list. Prior research from another line employs topic models to generate topic words as hashtags Gong et al. (2015); Zhang et al. (2016). These methods, ascribed to the limitation of most topic models, are nevertheless incapable of producing phrase-level hashtags.
In this paper, we approach hashtag annotation from a novel sequence generation framework. In doing so, we enable phrase-level hashtags beyond the target posts or the given candidates to be created. Here, hashtags are first considered as a sequence of tokens (e.g., “#DeepLearning” as “deep learning”). Then, built upon the success of sequence to sequence (seq2seq) model on language generation Sutskever et al. (2014), we present a neural seq2seq model to generate hashtags in a word-by-word manner. To the best of our knowledge, we are the first to deal with hashtag annotation in sequence generation architecture.
In processing microblog posts, one major challenge we might face is the limited features to be encoded. It is mostly caused by the data sparsity exhibited in short and informal microblog posts.222For instance, the eligible length of a post on Twitter or Weibo is up to characters. To illustrate such challenge, Table 1 displays a sample Twitter post tagged with “#AusOpen”, referring to Australian Open tennis tournament. Only given the short post, it is difficult to understand why it is tagged with “#AusOpen”, not to mention that neither “aus” nor “open” appear in the target post. In such a situation, how shall we generate hashtags for a post with limited words?
To address the data sparsity challenge, we exploit conversations initiated by the target posts to enrich their contexts. Our approach is benefited from the nature that most messages in a conversation tend to focus on relevant topics. Content in conversations might hence provide contexts facilitating the understanding of the original post Chang et al. (2013); Li et al. (2015). The effects of conversation contexts, useful on topic modeling Li et al. (2016, 2018) and keyphrase extraction Zhang et al. (2018), have never been explored on microblog hashtag generation. To show why conversation contexts are useful, we display in Table 1 a conversation snippet formed by some replies of the sample target post. As can be seen, key content words in the conversation (e.g., “Nadal”, “Tomic”, and “tennis”) are useful to reflect the relevance of the target post to the hashtag “#AusOpen”, because Nadal and Tomic are both professional tennis players. Concretely, our model employs a dual encoder (i.e., two encoders), one for the target post and the other for the conversation context, to capture the representations from the two sources. Furthermore, to capture their joint effects, we employ the bidirectional attention (bi-attention) Seo et al. (2016) to explore the interactions between two encoders’ outputs. Afterward, an attentive decoder is applied to generate the word sequence of the hashtag.
In experiments, we construct two large-scale datasets, one from English platform Twitter and the other from Chinese Weibo. Experimental results based on both information retrieval and text summarization metrics show that our model generates hashtags closer to human-annotated ones than all the comparison models. For example, our model achieves% ROUGE-1 F1 on Weibo, compared to % given by the state-of-the-art classification-based method. Further comparisons with classification-based models show that our model, in a sequence generation framework, can better produce rare and even new hashtags.
To summarize, our contributions are three-fold:
We are the first to approach microblog hashtag annotation with sequence generation architecture.
To alleviate data sparsity, we enrich context for short target posts with their conversations and employ a bi-attention mechanism for capturing their interactions.
Our proposed model outperforms state-of-the-art models by large margins on two large-scale datasets, constructed as part of this work.
In this section, we describe our framework shown in Figure 1. There are two major modules: a dual encoder to encode both target posts and their conversations with a bi-attention to explore their interactions, and a decoder to generate hashtags.
Formally, given a target post formulated as word sequence and its conversation context formulated as word sequence , where and denote the number of words in the input target post and its conversation, respectively, our goal is to output a hashtag represented by a word sequence . For training instances tagged with multiple gold-standard hashtags, we copy the instances multiple times, each with one gold-standard hashtag following Meng et al. (2017). All the input target posts, their conversations, and the hashtags share the same vocabulary .
To capture representations from both target posts and conversation contexts, we design a dual encoder, composed of a post encoder and a conversation encoder, each taking the and as input, respectively.
For the post encoder, we use a bidirectional gated recurrent unit (Bi-GRU)Cho et al. (2014) to encode the target post , where its embeddings are mapped into hidden states . Specifically, is the concatenation of forward hidden state and backward hidden state for the -th token:
Likewise, the conversation encoder converts conversations into hidden states via another Bi-GRU. The dimensions of both and are .
To further distill useful representations from our two encoders, we employ the bi-attention to explore the interactions between the target posts and their conversations. The adoption of bi-attention is inspired by Seo et al. (2016), where the bi-attention was applied to extract query-aware contexts for machine comprehension. Our intuition is that the content concerning the key points in target posts might have their relevant words frequently appearing in their conversation contexts, and vice versa. In general, such content can reflect what the target posts focus on and hence effectively indicate what hashtags should be generated. For instance, in Table 1, names of tennis players (e.g., “Azarenka”, “Nadal”, and “Tomic”) are mentioned many times in both target posts and their conversations, which reveals why the hashtag is “#AusOpen”.
To this end, we first put a post-aware attention on the conversation encoder with coefficients:
where the alignment score function captures the similarity of the -th word in the target post and the -th word in its conversation. Here
is a weight matrix to be learned. Then, we compute a context vectorconveying post-aware conversation representations, where the -th value is defined as:
Analogously, a conversation-aware attention on post encoder is used to capture the conversation-aware post representations as .
Next, to further fuse representations distilled by the bi-attention on each encoder, we design a merge
layer, a multilayer perceptron (MLP) activated by hyperbolic function:
where and are trainable parameters.
Note that either or conveys the information from both posts and conversations, but with a different emphasis. Specifically, mainly retains the contexts of posts with the auxiliary information from conversations, while does the opposite. Finally, vectors and are concatenated and fed into the decoder for hashtag generation.
Given the representations produced by our dual encoder with bi-attention, we apply an attention-based GRU decoder to generate a word sequence
as the hashtag. The probability to generate the hashtag conditioned on a target post and its conversation is defined as:
where refers to .
Concretely, when generating the -th word in hashtag, the decoder emits a hidden state vector and puts a global attention over . The attention aims to exploit indicative representations from the encoder outputs and summarizes them into a context vector defined as:
where is another alignment function () to measure the similarity between and .
Finally, we map the current hidden state of the decoder together with the context vector to a word distribution over the vocabulary via:
which reflects how likely a word to be the -th word in the generated hashtag sequence. Here and are trainable weights.
Here is the number of training instances and denotes the set of all the learnable parameters.
In hashtag inference, based on the produced word distribution at each time step, word selection is conducted using beam search. In doing so, we generate a ranking list of output hashtags, where the top hashtags serve as our final output.
|Datasets||# of||Avg len||Avg len||Avg len||# of tags|
|posts||of posts||of convs||of tags||per post|
Here we describe how we set up our experiments.
Two large-scale experiment datasets are newly collected from popular microblog platforms: an English Twitter dataset and a Chinese Weibo dataset. The Twitter dataset was built based on the TREC 2011 microblog track.333https://trec.nist.gov/data/tweets/ To recover the conversations, we used Tweet Search API to fetch “in-reply-to” relations in a recursive way. The Weibo dataset was collected from January to August 2014 using Weibo Search API via searching messages with the trending queries444 http://open.weibo.com/wiki/Trends/ as keywords. For gold-standard hashtags, we take the user-annotated hashtags, appearing before or after a post, as the reference.555Hashtags in the middle of a post are not considered here as they generally act as semantic elements Zhang et al. (2016, 2018). The statistics of our datasets are shown in Table 2. We randomly split both datasets into three subsets, where %, %, and % of the data corresponds to training, development, and test sets, respectively.
To further investigate how challenging our problem is, we show some statistics of the hashtags in Table 3 and the distributions of hashtag frequency in Figure 2. In Table 3, we observe the large size of hashtags in both datasets. Moreover, Figure 2 indicates that most hashtags only appear a few times. Given such a large and imbalanced hashtag space, hashtag selection from a candidate list, as many existing methods do, might not perform well. Table 3 also shows that only a small proportion of hashtags appearing in their posts, conversations, and either of them, making it inappropriate to directly extract words from the two sources to form hashtags.
For tokenization and word segmentation, we employed the tweet preprocessing toolkit released by Baziotis et al. Baziotis et al. (2017) for Twitter, and the Jieba toolkit666https://pypi.python.org/pypi/jieba/ for Weibo. Then, for both Twitter and Weibo, we further take the following preprocessing steps: First, single-character hashtags were filtered out for not being meaningful. Second, generic tags, i.e., links, mentions (@username), and numbers, were replaced with “URL” “MENTION”, and “DIGIT”, respectively. Third, inappropriate replies (e.g., retweet-only messages) were removed, and the remainder were chronologically ordered to form a sequence as conversation contexts. Last, a vocabulary was maintained with the and most frequent words, for Twitter and Weibo, respectively.
State of the arts
Classifier (post only)
, paired t-test). Higher values indicate better performance.
For experiment comparisons, we first consider a weak baseline Random that randomly ranks hashtags seen from training data. Two unsupervised baselines are also considered, where words are ranked by latent topics induced with the latent Dirichlet allocation topic model (henceforth LDA), and by their TF-IDF scores (henceforth Tf-Idf). Here for TF-IDF scores, we consider the -gram Tf-Idf (). Besides, we compare with supervised models below:
Classifier: We compare with the state-of-the-art model based on classification Gong and Zhang (2016), where hashtags are selected from candidates seen in training data. Here two versions of their classifier are considered, one only taking a target post as input (henceforth Classifier (post only)) and the other taking the concatenation of a target post and its conversation as input (henceforth Classifier (post+conv)).
Generator: A seq2seq generator (henceforth Seq2Seq) Sutskever et al. (2014) is applied to generate hashtags given a target post. We also consider its variant augmented with copy mechanism Gu et al. (2016) (henceforth Seq2Seq-copy), which has proven effective in keyphrase generation Meng et al. (2017) and also takes the post as input. The proposed seq2seq with the bi-attention to encode both the post and its conversation is denoted as Our model for simplicity.
We conduct model tunings on the development set based on grid search, where the hyper-parameters that give the lowest objective loss are selected. For the sequence generation models, the implementations are based on the OpenNMT framework Klein et al. (2017). The word embeddings, with dimension set to , are randomly initialized. For encoders, we employ two layers of Bi-GRU cells, and for decoders, one layer of GRU cell is used. The hidden size of all GRUs is set to . In learning, we use the Adam optimizer Kingma and Ba (2014) with the learning rate initialized to . We adopt the early-stop strategy: the learning rate decreases by a decay rate of till either it is below or the validation loss stops decreasing. The norm of gradients is rescaled to if the -norm is observed. The dropout rate is and the batch size is . In inference, we set the beam-size to and the maximum sequence length of a hashtag to .
Popular information retrivalevaluation metrics F1 scores at K (F1@K) and mean average precision (MAP) scores Manning et al. (2008) are reported. Here, different values are tested on F1@K and result in a similar trend, so only F1@1 and F1@5 are reported. MAP scores are also computed given the top outputs. Besides, as we consider a hashtag as a sequence of words, ROUGE metrics for summarization evaluation Lin (2004) are also adopted. Here, we use ROUGE F1 for the top-ranked hashtag prediction computed by an open source toolkit pythonrouge,999https://github.com/tagucci/pythonrouge with Porter stemmer used for English tweets. For Weibo posts, scores calculated at the Chinese character level following Li et al. (2018). We report the average scores for multiple gold-standard hashtags on ROUGE evaluation.
In this section, we first report the main comparison results in Section 4.1, followed by an in-depth comparative study between classification and sequence generation models in Section 4.2. Further discussions are then presented to analyze our superiority and errors in Section 4.3.
Table 4 reports the main comparison results. For Classifier
, their outputs are ranked according to the logits after alayer. For Extractor, it is unable to produce ranked hashtags and thus no results are reported for F1@5 and MAP. For LDA, as it cannot generate bigram hashtags, no results are presented for ROUGE-SU4. In general, we have the following observations:
Hashtag annotation is more challenging for Twitter than Weibo. Generally, all models perform worse on Twitter measured by different metrics. The intrinsic reason is the essential language difference between English and Chinese microblogs. English allows higher freedom in writing, resulting in more variety in Twitter hashtags (e.g., abbreviations are prominent like “aus” in “#AusOpen”). For statistical reasons, Twitter hashtags are more likely to be absent in either posts or conversations (Table 3), and have a more severe imbalanced distribution (Figure 2).
Topic models and extractive models are ineffective for hashtag annotation. The poor performance of all baseline models indicates that hashtag annotation is a challenging problem. LDA sometimes performs even worse than Random due to its inability to produce phrase-level hashtags. For extractive models, both Tf-Idf and Extractor fail to achieve good results. It is because most hashtags are absent in target posts, as we see in Table 3 that only % hashtags on Twitter and % on Weibo appear in target posts. This confirms that extractive models, relying on word selection from target posts, cannot well fit the hashtag annotation scenario. For the same reason, copy mechanism fails to bring noticeable improvements for the seq2seq generator on both datasets.
Sequence generation models outperform other counterparts. When comparing Generators with other models, we find the former uniformly achieve better results, showing the superiority to produce hashtags with sequence generation framework. Classification models, though as the state of the art, expose their inferiority as they select labels from the large and imbalanced hashtag space (reflected in Table 3 and Figure 2).
Conversations are useful for hashtag generation. Among the sequence generation models, Our model achieves the best performance across all the metrics. The observation indicates the usefulness of bi-attention in exploiting the joint effects of target posts and their conversations, which further helps in identifying indicative features from both sources for hashtag generation. However, interestingly, incorporating conversations fails to boost the classification performance. The reason why Our model better exploits conversations than Classifier (post+conv) might be that we can attend the indicative features when decoding each word in the hashtag, which is however not possible for classification models (considering hashtags to be inseparable).
From Table 4, we observe that the classifiers outperform topic models and extractive models by a large margin but exhibit generally worse results than sequence generation models. Here, we present a thorough study to compare hashtag classification and generation. Four models are selected for comparison: two classifiers, Classifier (post only) and Classifier (post+conv), and two sequence generation models, Seq2Seq and Our model. Below, we explore how they perform to predict rare and new hashtags.
According to the hashtag distributions in Figure 2, we can see a large proportion of hashtags appearing only a few times in the data. To study how models perform to predict such hashtags, in Figure 3, we display their F1@1 scores in inferring hashtags with varying frequency. The lower F1 score on less frequent hashtags indicates the difficulty to yield rare hashtags. The reason probably comes from the overfitting issue caused by limited data to learn from.
We also observe that sequence generation models achieve consistently better F1@1 scores on hashtags with varying sparsity degree, while classification models suffer from the label sparsity issue and obtain worse results. The better performance of the former might result from the word-by-word generation manner in hashtag generation, which enables the internal structure of hashtags (how words form a hashtag) to be exploited.
To further explore the extreme situation where hashtags are absent in the training set, we experiment to see how models perform in handling new hashtags. To this end, we additionally collect instances tagged with hashtags absent in training data and construct an external test set, with the same size as our original test set. Considering that classifiers will never predict unseen labels, to ensure comparable performance, we only adopt summarization metrics here for evaluation and report ROUGE-1 F1 scores in Table 5.
As can be seen, creating unseen hashtags is a challenging task, where unsurprisingly, all models perform poorly on this task. Nevertheless, sequence generation models perform much better on both datasets, e.g., at least 6.5x improvements over classification models observed on Weibo dataset. For Twitter dataset, the improvements are not that large, which confirms again that hashtag annotation on Twitter is more difficult due to the noisier data characteristics. In particular, compared to seq2seq, our model achieves an additional performance gain in producing new hashtags by leveraging conversations with the bi-attention.
Classifier (post only)
To further analyze our model, we conduct a quantitative ablation study, a qualitative case study, and an error analysis. We then discuss them in turn.
We report the ablation study results in Table 6 to examine the relative contributions of the target posts and the conversation contexts. To this end, our model is compared with its five variants below: Seq2Seq (post only), Seq2Seq (conv only), and Seq2Seq (post+conv), using standard seq2seq to generate hashtags from their target posts, conversation contexts, and their concatenation, respectively; Our model (post-att only) and Our model (conv-att only), whose decoder only takes and defined in Eq. (5) and Eq. (6), respectively. The results show that solely encoding target posts is more effective than modeling the conversations alone, but exploring their joint effects can further boost the performance, especially combined with a bi-attention mechanism over them.
Seq2seq (post only)
Seq2seq (conv only)
Seq2seq (post + conv)
Our model (post-att only)
Our model (conv-att only)
Our model (full)
We further present a case study on the target post shown in Table 1, where the top five outputs of some comparison models are displayed in Table 7. As can be seen, only our model successfully generates “aus open”, the gold standard. Particularly, it not only ranks the correct answer as the top prediction, but also outputs other semantically similar hashtags, e.g., sport-related terms like “bbc football”, “arsenal”, and “murray”. On the contrary, Classifier and Seq2Seq tend to yield frequent hashtags, such as “just saying” and “jan 25”. Baseline models also perform poorly: LDA produces some common single word, and TF-IDF extracts phrases in the target post, where the gold-standard hashtag is however absent.
|Model||Top five outputs|
|LDA||found; stated; excited; card; apparently|
|TF-IDF||inappropes; umpire; woman need; azarenka woman; the umpire|
|Classifier||fail; facebook; just saying; quote; pro choice|
|Seq2seq||fail; jan 25; yr; eastenders; facebook|
|Our model||aus open ; bbc football ; bbc aus ; arsenal ; murray|
To analyze why our model obtains superior results in this case, we display the heatmap in Figure 4 to visualize our bi-attention weight matrix . As we can see, bi-attention can identify the indicative word “Azarenka” in the target post, via highlighting its other pertinent words in conversations, e.g., “Nadal” and “tennis”. In doing so, salient words in both the post and its conversations can be unveiled, facilitating the correct hashtag “aus open” to be generated.
Taking a closer look at our outputs, we find that one type of major errors comes from the unmatched outputs with gold standards, even as a close guess. For example, our model predicts “super bowl” for a post tagged with “#steelers”, a team in super bowl. In future work, the semantic similarity should be considered in hashtag evaluation. Another primary type of error is caused by the non-topic hashtags, such as “#fb” (indicating the messages forwarded from Facebook). Such non-topic hashtags cannot reflect any content information from target posts and should be distinguished from topic hashtags in the future.
Our work mainly builds on two streams of previous work — microblog hashtag annotation and neural language generation.
We are in the line of microblog hashtag annotation. Some prior work extracts phrases from target posts with sequence tagging models Zhang et al. (2016, 2018). Another popular approach is to apply classifiers and select hashtags from a candidate list Heymann et al. (2008); Weston et al. (2014); Sedhai and Sun (2014); Gong and Zhang (2016); Huang et al. (2016); Zhang et al. (2017). Unlike them, we generate hashtags with a language generation framework, where hashtags in neither the target posts nor the pre-defined candidate list can be created. Topic models are also widely applied to induce topic words as hashtags Krestel et al. (2009); Ding et al. (2012); Godin et al. (2013); Gong et al. (2015); Zhang et al. (2016). However, these models are usually unable to produce phrase-level hashtags, which can be achieved by ours via generating hashtag word sequences with a decoder.
Our work is also closely related to neural language generation, where the encoder-decoder framework Sutskever et al. (2014) acts as a springboard for many sequence generation models. In particular, we are inspired by the keyphrase generation studies for scientific articles Meng et al. (2017); Ye and Wang (2018); Chen et al. (2018, 2019), incorporating word extraction and generation using a seq2seq model with copy mechanism. However, our hashtag generation task is inherently different from theirs. As we can see from Table 4, it is suboptimal to directly apply keyphrase generation models on our data. The reason mostly lies in the informal language style of microblog users in writing both target posts and their hashtags. To adapt our model on microblog data, we explore the effects of conversation contexts on hashtag generation, which has never been studied in any prior work before.
We have presented a novel framework of hashtag generation via jointly modeling of target posts and conversation contexts. To this end, we have proposed a neural seq2seq model with bi-attention over a dual encoder for capturing indicative representations from the two sources. Experimental results on two newly collected datasets have demonstrated that our proposed model significantly outperforms existing state-of-the-art models. Further studies have shown that our model can effectively generate rare and even unseen hashtags.
This work is supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK 14208815 and No. CUHK 14210717 of the General Research Fund). We thank NAACL reviewers for their insightful suggestions on various aspects of this work.
Empirical Methods in Natural Language Processing.
The Thirty-Fourth AAAI Conference on Artificial Intelligence.
Hashtag recommendation using attention-based convolutional neural network.In International Joint Conference on Artificial Intelligence.
Opennmt: Open-source toolkit for neural machine translation.In Association for Computational Linguistics.
Keyphrase extraction using deep recurrent neural networks on twitter.In Empirical Methods in Natural Language Processing.