China is known for its rich internal Internet ecosystem where Chinese alternatives to most foreign Internet services flourish. This is due not only to cultural differences that prevent foreign websites from gaining a large market share, but also due to stringent government controls that sometimes prevent foreign Internet companies from selling their services or that outright block access to them.
Sina Weibo, as China’s most popular microblogging platform, is perhaps the most visible face of China’s own internal version of the Internet. It is currently used by over users and, similarly to its foreign counterpart Twitter  that is widely considered to be a proxy for its users social life and interests [2, 3, 4, 5], it has recently started to draw the attention of researchers everywhere [6, 7, 8, 9].
Sina Weibos origins date back to but it wasn’t until that it rose to prominence. Since July 2009, Twitter has been blocked in China , leaving national alternatives such as Sina Weibo as the only alternative. In March 2012, Weibo started requiring its users to associate their profile with their true identity  while still giving users the option to display whichever screenname they wished.
The previous works on topic detection on microblogs are usually designed for pre-selected specific topics [10, 11] or only for short-messages with #tags . However, the majority of Tweets are not # tagged , and there is few work focusing on automatic topic detection for microblogs. We first propose a simple ad-hoc algorithm to identify topics on microblogs without pre-selection, and cluster those microblogs without #tag into the detected topics. More importantly, without the assumption that a #tag represents a unique topic, our algorithm merges microblogs of the same contents but with different #tags.
Past studies on Chinese microblogging platforms [7, 9] mainly focused on censorship and analyzed deleting practices on microblogs containing censored key words. Others compared the user behaviors, texture features of posts and temporal dynamics of re-posting  and an artificially selected categorical events  on Sina Weibo and Twitter. There is little research on comparing the collective attention of Chinese microbloggers in a large scale. Here, we take a first step in this direction by proposing an algorithm to model and compare Sina Weibo and Twitter. By contrasting the discussions occurring on these two platforms, we can observe two different versions of chinese culture: people inside China ( of geotagged Weibos are within China) and those outside (
of geotagged Tweets are located outside China). Despite China’s growing global relevance, and due to the complexity of its language, the number of people outside China learning Chinese as a second language is still very small. A recent study estimates that, in, just over students in American universities were taking Chinese language classes, compared with over studying Spanish , indicating that people who Tweet in Chinese outside of China are likely either Chinese expats or from Chinese heritage. While some analyses have been performed on geographically distributed populations speaking the same language , this combination of technically equivalent services serving populations with a similar cultural background that are isolated from each other is unique and provides us with the perfect opportunity to study the cultural differences in the virtual world between Chinese speakers inside and outside China.
In summary, we perform a topical comparison of both Twitter and Sina Weibo. Our results reveal significant differences in social attention distribution across both platforms, with the most popular topics on Sina Weibo relating to entertainment while the most topics in Twitter corresponded to cultural or political contents.
We use the dataset of Sina Weibo from Open Weiboscope Data Access [7, 8]. The dataset contains million Weibo posts (Weibo for short) collected over the full course of . The Twitter dataset used in this study was extracted from the raw Gardenhose feed , an unbiased sample of of the entire Twitter dataset that provides a statistically significant real time view of all Twitter account activity . To identify Tweets and Weibo in Chinese language, we perform language detection using the “Chromium Compact Language Detector” . See  for further details. This way, we collected Million Tweets and Million Weibo in both simplified and traditional Chinese language covering the entire year of . The Sina Weibo dataset also include microblogs which are not accessible to the public, either censored or self deleted. Following , we consider weibos deleted by the censorship (with message “permission denied” from API). In total, we considered deleted weibos for our study.
Clustering microblogs into topics
weibos with #tags. We build a vocabulary vector space on each microblogging platform with words of high-frequency (high TF-IDF score), and cluster similar #tags into a specific topic. For instance, for the Top#tags on Sina Weibo and Twitter, we merge them into and topics respectively. For the rest of microblogs without #tags, we assign them to topics that are closest to them in the vocabulary vector space. To reduce statistical fluctuations we restrict our study to the Top topics in each platform. In total,
weibos are classified into popular topics on Weibo andof tweets discuss popular topics on Twitter. In the remaining part of this section, we briefly describe our algorithm of clustering microblogs into topics.
Preprocessing. We first filter microblogs by removing the words representing short URLs and mentioning other users (“@username”). Filtered microblogs in traditional Chinese are then converted to simplified Chinese with the python-jianfan library . Chinese word segmentation is performed using Jieba  and part-of-speech tagging (POS) is performed following . This way each microblog is represented as a set of words tagged as noun, name, location, organization, time, place word, position word or verb.
Vector representation. We merge all microblogs with the same #tag as a document , and calculate its TF-IDF (term frequency–inverse document frequency). For each , we exclude the words with length less than 2 since a single character word in Chinese can be noisy and under-representative, and choose the first words with highest frequency and their TF-IDF weights , and its vocabulary vector can be written as , and where is the number of #tags we select.
Since several similar #tags likely refer to the same topic, we further cluster #tags into topics using hierarchical clustering. In Figure1, we show the dendrogram for the Top #tags on Sina Weibo platform, calculated using cosine distances in the embedding vector space . Interestingly, most clades are simplicifolious, indicating that distribution of words for each #tag is substantially different from the distribution in others. We observe similar dendrogram for Top #tags on Twitter (figure not shown). Thus, we apply a modified divisive clustering method (Algorithm 1), where we iteratively divide the largest cluster into a small cluster and a large one, until the size of the small cluster is .
After merging # tags into topics, each topic in vocabulary vector space now is defined as , and . is now the first words with highest frequency in a topic and their TF-IDF weights . The centroids of the final clusters are taken to represent topics in the vector space of each platform. To classify the remaining microblogs on one platform, we measure the cosine distance between the centroid of a topic , and each microblog , . If is smaller than a threshold , we consider the microblog is discussing the topic , shown in Algorithm 2.
To determine the threshold , we measure the distribution of distance between a centroid of a topic and microblogs inside that topic. About about of tweets and of weibos have distances less than to their topical centroid. Meanwhile, if we measure distances from a microblog to centroids of other topics, on average only about of microblogs outside a topical centroid have distances less than . Therefore, we use as our threshold.
Results and Discussion
Our analysis aims to compare topical spaces in Chinese language on different microblogging systems. With identified centroids in the vocabulary vector space defined in the last section, we first calculate the distance between the centroids of the Top topics on the two platforms. We define as the cosine distance in the vocabulary vector space between the centroid of topic on Twitter and topic on Sina Weibo. Figure 2
-A shows the cumulative distribution function of distancefor pairs of topical centroids. Surprisingly, only pairs of topics have distance less than . In Figure 2-B, we show the distance between Top topics on Weibo and Top topics on Twitter. Surprisingly, the distance between of pairs of Top topics on the two platforms is larger than , indicating that microbloggers in each platform have significantly different conversation topics and interests.
In Table 1, we provide the Top topics in Chinese language on Sina Weibo and Twitter to illustrate the differences. On Sina Weibo, of entire datasets can be classified into top 10 topics ( for the Top ); while on Twitter over all tweets are categorized into top 10 topics ( for Top ). The microblogs on Sina Weibo focus on entertainment (singers, actors and games) and advertising. In contrast, on Twitter, there is no commercial advertisement appearing, and the last two topics are about games. The others are all corresponding to political contents.
|Rank||Sina Weibo||in English||%||in English||%|
|1||三国来了||an online game||0.51||陈光诚||
|3||晚安/早安||good morning/night||0.38||Freetibet||Free Tibet||1.62|
|4||微博客户端||Sina Weibo app||0.36||李旺阳||
|0.04||抗暴||Tibetan Uprising Day||0.88|
|7||有奖转发||re-posting to win a prize||0.04||达赖喇嘛||Dalai Lama||0.68|
|10||新版微博||new version of Sina Weibo||0.02||武士朝代||an Andorid game||0.48|
In the previous section, we have classified weibos and tweets into the topical space in their own vocabulary vector space. For an unclassified weibo or tweet, we calculate its distance to centroids on both platforms, and assign it to the closest topic. Interestingly, we find there are only of tweets correspond to the Top topics on Sina Weibo platform, and only of weibos were discussing the most popular topics on Twitter. Chinese microbloggers speaking the same languages on two platforms share a few social attentions.
We further investigate deleted weibos that were likely censored  by checking if they belong to topics which appear on Twitter. In total, deleted weibos can be classified into the Top topics on Twitter. We re-rank the topics in accordance with the frequency of deleted weibos. The Kendall rank correlation coefficient between the top topics for all tweets and for deleted weibos is , with -value . In Table 2, we list Top topics for deleted weibos on Twitter’s vector space. Compared with popular topics on Sina Weibo, the deleted weibos are significanlty more likely to discuss political issues.
|rank||topic on Twitter||in English|
|7||HK71||Hong Kong 1 July march|
The social attention of online users from the same cultural backgrounds but living in different countries might be different due to the changes of social environments. In this study, we take the first steps toward understanding such differences.
Sina Weibo is used almost exclusively within China while most Chinese language use of Twitter occurs almost exclusively outside Chinese borders. By comparing the most popular topics in these two platforms we can, for the first time, observe how the interests of two populations, with similar cultural backgrounds, differ. Surprisingly, we find that there is very little overlap between the two attention profiles. Weibo users speak mostly about popular culture and games while Twitter users focus mostly on political issues.
The reasons behind this divergence are hard to discern but can likely be attributed to one of two factors: lack of interest for political topics within China or a high degree of self-censorship that prevents Chinese from discussing politics in public. A small indication towards this second hypothesis is the list of topics seen in deleted Weibos (see Table 2) that have higher political content. It is worth to remark that our algorithm of detecting topics still depends on # tags, and some of such # tags may not necessarily be a social topic but likely represent some commercial web/mobile applications. Manual annotations could be included in the future work to improve the topic detection results. Another key datapoint we are missing to fully clarify this question is the number of people who use foreign VPN services as a way of being able to reach Twitter where the discussion is more politically centered. An analysis of this interesting factor will be the subject of future study. The proposed methodology in this paper can be easily applied to any other languages across different online conversation platforms if data are available.
Another possibility worth considering when comparing user behavior across two different platforms are the technical differences between the platforms are not to be excluded. However, the similarity between the two platforms likely minimizes this effect. Indeed, it would be difficult to argue that Twitter is, on technical grounds, any more or less suitable to discussion of the topics listed on the right side of Table than Sina Weibo or vice-versa. A final possibility is simply that different "cultural norms" have emerged in the two platforms  with the Sina Weibo community naturally becoming much more focused on pop-culture and entertainment and Twitter becoming more political.
-  Q. Gao, F. Abel, G. Houben, and Y. Yu. A comparative study of users’ microblogging behavior on Sina Weibo and Twitter. In User modeling, adaptation, and personalization, pages 88–101. Springer, 2012.
-  D. Bamman, J. Eisenstein, and T. Schnoebelen. Gender identity and lexical variation in social media. J. of Sociolinguistics, 18:135, 2014.
-  O. Phelan, K. McCarthy, and B. Smyth. Using Twitter to recommend real-time topical news. In RecSys’09, page 385, 2009.
-  F. Ciulla, D. Mocanu, A. Baronchelli, B. Gonçalves, N. Perra, and A. Vespignani. Beating the news using social media: the case study of american idol. EPJ Datascience, 1:8, 2012.
-  P.T. Metaxas, E. Mustafaraj, and D. Gayo-Avello. How (not) to predict elections. In IEEE Third Inernational Conference on Social Computing (SocialCom), page 165, 2011.
-  A. Rauchfleisch and Mike S. Schäfer. Multiple public spheres of Weibo: a typology of forms and potentials of online public spheres in China. Information, Communication & Society, 18:139, 2014.
-  K. Fu, C. Chan, and M. Chau. Assessing censorship on microblogs in China: discriminatory keyword analysis and the real-name registration policy. Internet Computing, IEEE, 17(3):42–50, 2013.
-  K. Fu and M. Chau. Reality check for the Chinese microblog space: a random sampling approach. PloS one, 8(3):e58356, 2013.
-  D. Bamman, B. O’Connor, and N. Smith. Censorship and deletion practices in Chinese social media. First Monday, 17(3), 2012.
-  Swit Phuvipadawat and Tsuyoshi Murata. Breaking news detection and tracking in twitter. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, volume 3, pages 120–123. IEEE, 2010.
-  G. Li, K. Meng, and J. Xie. An improved topic detection method for chinese microblog based on incremental clustering. Journal of Software, 8(9):2313–2320, 2013.
-  O. Tsur, A. Littman, and A. Rappoport. Efficient clustering of short messages into general domains. In ICWSM, 2013.
-  Xin Shuai, Xiaozhong Liu, Tian Xia, Yuqing Wu, and Chun Guo. Comparing the pulses of categorical hot events in twitter and weibo. In Proceedings of the 25th ACM conference on Hypertext and social media, pages 126–135. ACM, 2014.
-  N. Furman, D. Goldberg, and N. Lusin. Enrollments in Languages Other Than Engish in United States Institutions of Higher Education, Fall 2009. Technical report, Modern Language Association, 2010.
-  B. Gonçalves and D. Sánchez. Crowdsourcing dialect characteriation through twitter. PLoS One, 9:E112074, 2014.
-  J. Ratkiewicz, M. Conover, M. Meiss, B. Gonçalves, S. Patil, A. Flammini, and F. Menczer. Truthy: mapping the spread of astroturf in microblog streams. In WWW, pages 249–252. ACM, 2011.
-  Guide to the Twitter API Part 3 of 3: An Overview of Twitters Streaming API. http://blog.gnip.com/ tag/gardenhose/, 2014.
-  M.M. Candless. Chromium Compact Language Detector. http://code.google.com/p/chromium-compact-language-detector/, 2012.
-  D. Mocanu, A. Baronchelli, N. Perra, B. Gonçalves, Q. Zhang, and A. Vespignani. The Twitter of Babel: Mapping world languages through microblogging platforms. PloS one, 8(4):e61981, 2013.
-  Jianfan. https://code.google.com/p/python-jianfan/, 2013.
-  Jieba: Chinese word segmentation module. https://github.com/fxsjy/jieba, 2014.
-  Tao Zhou. Understanding online community user participation: a social influence perspective. Internet Research, 21(1):67–81, 2011.