Social media and news play an important role in driving the fluctuation of economic indicators and financial markets , , ,  in a nontrivial fashion. Recently, novel financial markets have emerged, that are exchanging from fiat money (USD, EUR, CNY) to cryptocurrencies and vice versa [5, 6]. As of December 2018, cryptocurrencies have a total market capitalization of $120 billion, with more than 250000 transactions per day. In 2017 Bitcoin was ranked second on the Google Trends list of popular topics in global news. Despite a decrease of interest towards cryptocurrencies in 2018 according to Google Trends, the number of daily articles related to cryptocurrencies is still notably high. Therefore, it is not surprising that the rapid development of cryptocurrency has attracted increasing attention from news and social media.
A large volume of news articles about cryptocurrencies, published daily can make it hard for individuals to filter out relevant information and make informed decisions in this domain. Fortunately, people share and discuss news every day in large quantities on social media platforms, e.g. on Twitter, which is the focus of this paper. Therefore, social media can be a good proxy to monitor and track ”important” news about cryptocurrencies. Our work is motivated by the hypothesis that high engagement with a news article on Twitter is related to the ”importance” of an article.
In this paper, we introduce an online data mining system which connects news and tweets discussing it. We also perform preliminary data exploratory and predictive analytic using machine learning and deep learning. Overall, the contribution of this paper is as follows: (i) We build an online data mining pipeline to extract news articles from a discussion on Twitter and collect tweets associated with the articles. This paired news and tweet data is continuously updated in a cloud database. This data is a rich source for studying public interest and attention on cryptocurrency and the potential effect of social media on the market. (ii) Based on the news and associated tweets collected by the pipeline, we perform exploratory data analysis to characterize news discussion on Twitter. (iii) We apply machine learning and deep learning models to predict the popularity of news articles on Twitter. We aim to predict the number of tweets mentioning an articles related to cryptocurrencies, which we consider as a measure of its ”importance”.
2 Related work
Many studies have focused on the relationship between social media, news, and other information from the www onto financial markets [7, 8, 3]. However, the main focus of our work is modeling and prediction of news popularity via social media. In , the authors link a given news article to social media utterances that implicitly reference it through a dedicated query model. Tracking and automatically connecting news articles to Twitter conversations by Twitter hashtags was studied in . In , the authors constructed a multi-dimensional feature space derived from an article and use a conventional SVM to predict its popularity. The authors in  show how the class of temporal point processes (Hawkes) can be used for predicting Retweet dynamics. The authors in  propose how to leverage knowledge base information for improving popularity prediction. Starting from the idea that only a small amount of news articles become popular,  focused on the subset of the most popular news to rank articles. In  it formulates article importance prediction as a classification task using SVM.
In this paper, we exploit ensemble machine learning and sequence to sequence (seq2seq) deep learning to study the predictability of crypto news popularity on Twitter in real-time mode. In contrast to others, our analysis is focused on the intraday importance prediction.
3 Data pipeline
The data pipeline consists of a real-time online system, with the following components: Twitter collection, article collection, and tweet-article matching.
The Twitter data collection was implemented by using the publicly accessible Twitter streaming API with real-time filtering by a list of cryptocurrency related keywords. The Twitter API does only provide a random sample of all tweets.
The article data colletion is obtained by scraping news from the dynamic set of gazetteer source URLs. The set of gazetteer source URLs is automatically updated by extracting the URLs from the content of downloaded tweets.
The tweet-article matching data is the document-oriented database online instance, that contains matchings between news articles and tweets. The matching exists if the tweet explicitly contains the URL of an article.
Before extracting features from the data, we first process the data for further usage. In a first step, we merge some of the matchings together. This is done for two reasons. First, the online database only contains incremental matchings which have to be merged together in order to provide a complete matching of the article to tweets. Secondly, the raw URLs of the articles can contain query strings, which often contain information not related to which article the URL refers to. Hence, by changing those parameters one can obtain arbitrarily many different URLs linking to the same article. Because of this, there are often multiple different article entries in the database or data file for the same article. On the other hand, there are also some websites that use the query string to distinguish between articles. Therefore, we merge the matchings of two articles if they fulfill the following 3 conditions:
The URLs of both articles share the same host as well as the same path.
Both articles have the same title.
Both articles were published at the same time.
These conditions allow for merging of articles of which the URLs have different query strings while the last two conditions prevent the system from merging articles which are distinguished by the query string in the URL. While merging the articles we also remove duplicate entries for the same tweet which are sometimes present in the database.
According to the publication time in the data, some of the articles were published 2000 years ago or even in the future. These publication dates are clearly wrong.
We, therefore, remove all articles that were published outside of an acceptable time-interval.
Let us describe the data, that we are gathering and then go on to describe how we process this data. We work with three entities in our dataset: news articles, tweets, and matchings, i.e. the relations between articles and tweets. While these concepts are easy to understand intuitively, we specify their attributes here to avoid confusion.
|URL||The url of the article.|
|title||The headline of the article.|
|publication time||The publication timestamp.|
|text||The text body of the article.|
|user||The Twitter username.|
|text||The content of the tweet.|
|publication time||The timestamp of the tweet.|
|links||The urls in the text.|
|article||An article as described in Tab.1.|
|tweets||A list of matched tweets.|
In Fig. 5 we observe that the number of mentions on Twitter saturates generally within the first 24 hours after publication. We, therefore, define our prediction target as the number of mentions during 24 hours after publication. We also found that there is a strong seasonality effect at weekends.
4 Predictive analytics
We aim to explore the different machine learning models on the task of predicting the total number of article mentions within a defined time horizon after the article’s publication time. Let be two timestamps with , where we set to 24 hours in this paper. The task of our machine learning models is as follows: Given an article published at time and all tweets published between and that mention the article, predict the cumulative number of tweets mentioning the article between time and . Here is the prediction starting time, which represents how much historical data can be used to predict.
4.1 Feature Extraction
For our predictive models we use three sets of features, which we name time series, content and context features.
Time series features
Let be the publication time of an article and the current time. Let a timestep size (set to 1 hour in our experiments). The time series feature is then given by the number of mentions of the article between and where . As an example suppose that is 3 hours after and the article is mentioned twice, once and three times in hours 1, 2 and 3 since publication respectively. Then there are 3 time series features: . Note that the number of these features is not constant, but depends on as defined in 4.
We extract a vector of content features from each article, by using a keyword list to allow the models to learn individual dynamics for articles related to different cryptocurrencies. Each cryptocurrency related concept is represented by a binary feature, that is set to 1 if one of the keywords related to the concept is present in the title of the article.
The amount of Twitter mentions might further be related to the publisher of an article. There seem to be very few publishers with significantly more mentions on Twitter than the other publishers. We count the total number of mentions of each publisher in our training set. We then extract the 10 publishers with the highest numbers of mentions. For each of these publishers, we introduce a binary feature set to 1 if the article was published by the respective publisher.
4.2 Predictive Models
4.2.1 Baseline model
As a baseline model, we use a linear extrapolation of the last time series features by fitting a linear function of the time step to the dataset given by . Here is the number of time series features available for the article. To predict the total number of mentions at the target time, we evaluate the model for the time step , such that . The model ignores content and context features. In our experiments, we will choose .
4.2.2 Autoregressive models
A common type of time series model is an autoregressive model . An autoregressive model of order predicts the value at the next timestep based on the values observed at the previous timesteps . In our experiments, we provide as an additional input to the model. The idea is, that the dynamics can be very different a few hours after the publication and shortly before the end of the prediction window. In some experiments, we will further provide content and/or context features to the model. In our case, we have to predict multiple steps in the future. This is achieved by first predicting a single timestep. We then assume that the predicted value is the correct value and use it as an input for the prediction of the next timestep. By recursively applying this strategy, we can predict an arbitrary amount of timesteps ahead.
For autoregressive models of order , we generate multiple training samples from each time series . The first sample uses to predict . The second sample predicts from and so on.
We use two different kinds of autoregressive models. The first one uses a linear model to predict the next timestep and the second one uses the random forest regressor.
4.2.3 Random forest regressor
A random forest is an ensemble of decision trees. The total response of the random forest model is the average prediction of all decision trees. In order to increase the variety of the individual decision trees, each tree is trained on a bootstrapped sample from the original dataset and uses only random subsets of the features for each decision. For more details about random forests see or the documentation of the scikit-learn implementation that we use for our experiments.222https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor
4.2.4 Sequence-to-sequence model
As described in 
, sequence-to-sequence model consists of two recurrent neural networks (RNN). The first RNN is called the encoder. This recurrent network receives as inputs all available time series features. The outputs of this model are discarded. The second RNN is called the decoder. The initial hidden state is given by the final hidden state of the encoder. The first input to the decoder is the last time series feature. In our architecture, the output of the decoder at each timestep serves as input for a fully connected network with one hidden layer that outputs the predicted value for the next timestep. If context or content features are used, those features serve as additional inputs to the fully connected network. The predicted value is then used as the input at the next timestep. As a loss function, we use the sum of squared errors between the predicted values and the real values. Bethe input time series and the real continuation. Be the predicted values. The loss for this time series is then given by
Gated recurrent units
Our sequence-to-sequence model implementation is based on the gated recurrent unit (GRU), a variant of recurrent neural networks (RNN). The following definition of a GRU is taken from. Let
denote the sigmoid function,the hyperbolic tangent and the element-wise matrix multiplication operator. Be a sequence of inputs (e.g. a time series) and the so-called initial hidden state usually set to 0. The RNN using GRUs then outputs a sequence defined by
are model parameters learned during training. We use the TensorFlow GRU implementation.333https://www.tensorflow.org/api_docs/python/tf/nn/rnn_cell/GRUCell
4.3 Evaluation set-up
In this paper, we have done the evaluation of our real-time model on 23535 articles published between 2018-12-02 00:00:00 and 2018-12-09 00:00:00 and all tweets mentioning those articles. The corresponding training set contains all articles published before 2018-12-02 00:00:00 with the related tweets. In total, the training set consists of 125248 articles from roughly 35 days.
Because popular articles are rare, we balance the training set. Let where is the total number of articles in the training set. We first sort the training set in descending order by the number of mentions and then keep the top articles with most mentions. We also draw random samples from each of the sets of articles lying between the 75th and 95th percentile and below the 75th percentile. We hence obtain a training set with an equal number of high importance (above 95th percentile), medium importance (between 75th and 95th) and low importance (below 75th) articles.
For the baseline model, we choose a linear interpolation of the most recent 3 time series features. The linear autoregressive model is evaluated for orders 1, 3 and 5. The random forest autoregressive models are all of order 3 with 50 and 500 estimators. The sequence-to-sequence model has a hidden state size of 200 for the encoder and the decoder. The hidden dense layer consists of 200 units. During training, we drop out 30 percent of the inputs to the dense layer as well as 10 percent of the hidden state passed to the next step of the RNN. The network is then trained in an end-to-end fashion, using back-propagation with training batches of size 64. The model is trained for 30 epochs with a learning rate .001, then for another 30 epochs with learning rate .0001 and finally for yet another 30 epochs using learning rate .000001 using Adam optimizer.
The baseline model is only provided the time series features. The autoregressive models and the sequence-to-sequence model are trained using the time series, content and context features.
4.3.3 Evaluation metrics
The goal of our prediction is to extract the most relevant cryptocurrency related articles. Most articles accumulate very few mentions on Twitter. Those articles would heavily influence the performance scores because they represent the vast majority of articles in the validation set. However, the performance of those articles does not represent the performance with respect to our goal. For this reason, our evaluation focuses on the top k articles with most mentions on Twitter.
There are two different properties of the predictions that can be used to assess performance: the accuracy of the predicted number of mentions, and the quality of the induced ranking of articles. We use mean absolute percentage error to measure the former property and normalized discounted cumulative gain to measure the latter.
Mean absolute percentage error (MAPE)
The MAPE is used to evaluate the quality of the predicted value. It computes by how many percents the predicted value deviates from the actual value on average. The reason for using percentage errors instead of absolute errors lies in the great difference between numbers of mentions of articles. A metric based on absolute errors would most likely be dominated by the very few articles with significantly more mentions. The MAPE is defined as follows
where denotes 24 hour window from the publication date.
Normalized discounted cumulative gain (NDCG)
The second interesting property of the predictions is the ordering of the articles induced by the predicted values. We could obtain the most important articles from a model that achieves poor performance with respect to the MAPE but yields a good approximation of the ordering of the news. The discounted cumulative gain (DCG) is high if the top k predicted articles achieve a high number of mentions. To compute the DCG, the articles are first ordered by their predicted importance such that . Be the observed importance values of the first k articles from this ordered set. is then defined as
is now defined as the maximal achievable which is computed as the based on an ordering according to the observed instead of the predicted values. The can be computed as
Hence the maximal achievable is 1.
We will discuss the overall results of our models and compare their performance in the prediction of the number of mentions and the order of the articles.
We vary the prediction start time to be 5, 10, 15 or 20 hours after publication time while keeping the target prediction time fixed at hours after the publication. For instance, for a start time of 5 hours, the model gets five data points as input, describing the mentions in the first 5 hours. It then predicts the number of mentions after another 19 hours. Similarly, for a start-time of 15 hours, the model gets 15 points as input and has to predict 9 hours into the future.
4.4.1 Overall performance
In Fig. 6 we show the experimental results of the baseline model, autoregressive model, random forest autoregressive model and sequence-to-sequence model on the test dataset.
As expected, we see that all models make better predictions, the closer the prediction start time is to the target time. After 15 and 20 hours from the publication time, the baseline model already achieves very good performance with MAPE of and less. As we have seen, the number of mentions in most news articles starts to saturate after hours. Because of this, the linear extrapolation that is performed by the baseline model can be quite accurate at later prediction points. At prediction start times of 15 and 20 hours, the random forest (RF) and sequence-to-sequence (S2S) model achieve a slightly lower MAPE than the baseline.
However, we are more interested in the early prediction start points. Ideally, we want to make an accurate prediction about the popularity of an article as soon as possible after its publication. At prediction start time 5 hours after publication advanced models achieve a significantly lower MAPE than the baseline. RF and S2S model achieve a MAPE around , while the linear model is at about and the baseline at . For predictions starting 10 hours after publication, the baseline and the linear model improve significantly over their performance 5 hours after publication. However, RF and S2S model still achieve a significantly lower MAPE.
Overall, we can say that RF and S2S model is able to achieve significantly lower error rates close to the publication time of an article. All models are comparable about 20 hours after publication. The RF model achieves the lowest MAPE overall.
It is instructive to look at the NDCG as well. Here we do not see any model being significantly better than the baseline model, which is mainly due to the fact that the baseline model is already very good at predicting the final order of the news articles shortly after publication. It achieves an NDCG of around 0.9 only 5 hours after publication. The S2S model seems to perform worse than the other two models in predicting the order of the articles. This is likely due to the fact, that it has a large degree of freedom in the model parameter space and probably the rank regularization could help. However, additional tuning of the objective function of the S2S model was left for future work.
Predicting a rough order of popular news articles seems to be possible by just linearly extrapolating the number of mentions in the first few hours. Improving over that is hard due to the high uncertainty of the predictions of the number of mentions after 24 hours. Looking at the performance of the linear model and the RF, it seems to be feasible to improve over the baseline, but the uncertainty of the predictions remains high.
4.4.2 Effect of order
For all autoregressive models, we have to choose an order. A higher order increases computational cost but potentially also the prediction quality. To measure the effect on the prediction quality, we evaluate the linear autoregressive model on different orders 1, 3 and 5.
The results are shown in Fig. 7. We find that for each prediction start time the MAPE of the order = model is significantly lower than the MAPE of the order = model. However, increasing the order of the model to does not seem to give a significant error reduction. The NDCG is not significantly different between the different choices of the order.
In view of these results, we choose an order of for all autoregressive models from here on. While we might be able to gain slightly better predictions by choosing an order of more than , choosing seems to be a good compromise between model performance and computational cost.
Random forest models provide a number of hyper-parameters such as the number of estimators, the depth of the tree or the size of the leaf nodes. To this end, we compare a random forest autoregressor with 50 to one with 500 estimators. The results, depicted in Fig. 8
, show no significant different in MAPE or NDCG for both models. Choosing 500 instead of 50 estimators somewhat decreases the variance of the NDCG. For the random forest models in our other experiments, we will, therefore, use 500 estimators, which are still manageable in computation. Experimenting with other hyperparameters was out of the scope of this work and is left for future work.
4.4.3 Uncertainty prediction
In addition to achieving the best model performance in our experiments, the random forest model also gives us a natural way to quantify the prediction uncertainty. Instead of just calculating the mean of the ensemble predictions, we can calculate percentiles of the predictions to get prediction intervals with coverage. This is shown in two example time series in Fig. 9.
4.5 Model Deployment
We use the trained autoregressive model to do online predictions on real-time data. The data extraction server444deployed on the Google Cloud (GC) constantly retrieves new tweets and articles and finds the tweet-article matchings, which are saved as a new batch into an online database555MongoDB deployed on Amazon Web Services (AWS). We generate the dataset for prediction by querying the database for articles published in the last 24 hours. The queried data is then pre-processed in order to extract time series, content, and context features, as described earlier.
Then, we predict the importance values of new articles at 24 hours after the article’s publication time, using the previously trained model. The prediction is performed every 10 minutes.
The current predicted importance values are visualized in an interactive webpage666Link to webpage: http://cryptodatathon.com/ranknews. The web application is developed using Python-based Flask web development framework777http://flask.pocoo.org/
In this paper, we introduce an online data mining system relating cryptocurrency news to the tweets discussing them. This data pipeline paves the way for monitoring cryptocurrency news of public’s interest, identifying and predicting poplar news, and tracking public opinions towards cryptocurrencies.
Data exploration on the collected paired news articles and tweets characterized top publishers, top cryptocurrencies discussed on Twitter as well as the lifespan of these news discussions. We also perform preliminary predictive analytics using machine learning and deep learning models. This work is a first step towards providing a prediction system, that detects articles that are going to become popular shortly after they are published.
Our current system still needs to observe a few hours of data before making a prediction. For future work, the goal would be to make more accurate predictions within the first hour after publication. This is possible, by exploring different representations of the article content by more advanced NLP models.
N.A.-F. and T.G. are grateful for financial support from the EU Horizon 2020 projects: SoBigData under grant agreement No. 654024. J.B., R.H. and D.L are grateful for the support of professor A. Krause on the Data Science Lab 2018 course at ETH Zurich.
-  S. Chakraborty, A. Venkataraman, S. Jagabathula, and L. Subramanian, “Predicting socio-economic indicators using news events,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1455–1464, ACM, 2016.
-  Z. Hu, W. Liu, J. Bian, X. Liu, and T.-Y. Liu, “Listening to chaotic whispers: A deep learning framework for news-oriented stock trend prediction,” in Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 261–269, ACM, 2018.
-  M. Piškorec, N. Antulov-Fantulin, P. K. Novak, I. Mozetič, M. Grčar, I. Vodenska, and T. Šmuc, “Cohesiveness in financial news and its relation to market volatility,” Scientific Reports, vol. 4, no. 1, 2014.
-  W. Zhang and S. Skiena, “Trading strategies to exploit blog and news sentiment,” in In Fourth Int. Conf. on Weblogs and Social Media (ICWSM), 2010.
-  T. Guo, A. Bifet, and N. Antulov-Fantulin, “Bitcoin volatility forecasting with a glimpse into buy and sell orders,” in 2018 IEEE International Conference on Data Mining (ICDM), IEEE, 2018.
-  N. Antulov-Fantulin, D. Tolic, M. Piskorec, Z. Ce, and I. Vodenska, “Inferring short-term volatility indicators from the bitcoin blockchain,” in Complex Networks and Their Applications VII (L. M. Aiello, C. Cherifi, H. Cherifi, R. Lambiotte, P. Lió, and L. M. Rocha, eds.), (Cham), pp. 508–520, Springer International Publishing, 2019.
-  H. Chen, P. De, Y. J. Hu, and B.-H. Hwang, “Customers as advisors: The role of social media in financial markets,” SSRN Electronic Journal, 2012.
-  T. G. Andersen, T. Bollerslev, F. X. Diebold, and C. Vega, “Real-time price discovery in global stock, bond and foreign exchange markets,” Journal of International Economics, vol. 73, no. 2, pp. 251–277, 2007.
-  M. Tsagkias, M. De Rijke, and W. Weerkamp, “Linking online news and social media,” in Proceedings of the fourth ACM international conference on Web search and data mining, pp. 565–574, ACM, 2011.
-  B. Shi, G. Ifrim, and N. Hurley, “Insight4news: Connecting news to relevant social conversations,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 473–476, Springer, 2014.
-  R. Bandari, S. Asur, and B. A. Huberman, “The pulse of news in social media: Forecasting popularity.,” ICWSM, vol. 12, pp. 26–33, 2012.
-  R. Kobayashi and R. Lambiotte, “Tideh: Time-dependent hawkes process for predicting retweet dynamics,” in Tenth International AAAI Conference on Web and Social Media, 2016.
-  H. Dou, W. X. Zhao, Y. Zhao, D. Dong, J.-R. Wen, and E. Y. Chang, “Predicting the popularity of online content with knowledge-enhanced neural networks,” in ACM KDD, 2018.
-  N. Moniz, L. Torgo, and F. Rodrigues, “Resampling approaches to improve news importance prediction,” in International Symposium on Intelligent Data Analysis, 2014.
-  I. Arapakis, B. B. Cambazoglu, and M. Lalmas, “On the feasibility of predicting news popularity at cold start,” in International Conference on Social Informatics, 2014.
-  R. H. Shumway and D. S. Stoffer, Time Series Analysis and Its Applications. Springer International Publishing, 2017.
-  L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv e-prints, vol. abs/1412.3555, 2014.