In the era of information explosion, it is particularly urgent to find valuable information from the increasingly abundant content and immediately available data. Attention has become a major limiting factor in the consumption of information. Attention economy relies on a competing process through which a few items become popular while most are forgotten over time. Popularity prediction has important applications in a wide range of domains, such as decision making concerning with recruitment and funding in the scientific community, public opinion monitoring in the online social networks, and so on. However, it is arguably very difficult to predict the dynamical popularity of individual items within a complex evolving system. One reason behind this is that the dynamical processes governing individual items appear too noisy to be amenable to quantification [Wang and Barab si2013].
The early researches focus on reproducing certain statistical quantities over an aggregation of items [Crane and Sornette2008, Ratkiewicz et al.2010]. These models have been successful in understanding the underlying mechanisms of popularity dynamics. Yet, as they do not provide a way to extract item-specific parameters, these models lack of predictive power for the popularity dynamics of individual item. In the past several years, researchers began to analyze and model the popularity dynamics of individual items [Matsubara et al.2012, Wang and Barab si2013]. The existing models fall into two main paradigms. One uses networks to model the popularity dynamics and utilizes graph mining techniques to solve the prediction problem [Mcgovern et al.2003, Yu et al.2012, Pobiedina and Ichise2016]. The other prevalent line of research formulates the popularity dynamics over time as time series, making predictions by either exploiting temporal correlations [Szabo and Huberman2010], regression [Yan et al.2011], or fitting these time series with certain classes of process [Matsubara et al.2012, Bao et al.2013], including counting process [Vu et al.2011], point process [Xiao et al.2016] or specific Poisson process [Shen et al.2014] and so on.
The second kind of time series model has gained widely concern in the academic and research community. Reinforced Poisson Process (RPP) [Shen et al.2014] is used with a probabilistic framework to model the stochastic popularity dynamics. RPP with self-excited Hawkes Process [Bao et al.2015, Xiao et al.2016] considers the aging effect and triggering role of recent citations in the prediction of individual paper citation count over time. Furthermore, the influence-based self-excited Hawkes process [Bao2016] takes into account the user-specific triggering effect of each forwarding based on the endogenous social influence in the microblogging network. However, one major limitation of the parametric forms of these processes is due to their specialized and restricted expression capability for arbitrary distributed data which trends to be oversimplified or even infeasible for capturing the problem complexity in real applications [Xiao et al.2017].
Recently, Deep Neural Network (DNN) based models, such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), have received great attention from both the academia and the industry. RNN has been proven to perform extremely well on temporal data series[Sutskever, Vinyals, and Le2014]
. The networks with loops in RNN allow information to persist. Due to the vanishing gradient problem, RNN fails to handle the temporal contingencies present in the input/output sequences span long intervals[Bengio, Simard, and Frasconi1994]
. Long short-term memory (LSTM) is proven to be capable of learning long-term dependencies. So, RNN with LSTM units is suitable for handling long-term temporal data series.
The recent studies in the area of popularity dynamics combine the deep learning with the point process or Hawkes process. DeepHawkes [Cao et al.2017] leverages end-to-end deep learning to make an analogy to interpretable the three key factors captured in Hawkes process. RNNs with LSTM units are used to model the intensity function of point process without specific parametric form [Xiao et al.2017]. It models various characters of real event data for its application utility under the assumption of point process. Without assuming any specific type of the generative process, adversarial learning of neural network process is used in the latest research for popularity prediction [Xiao et al.2018].
In the existing studies, all items are treated equivalently in popularity dynamics prediction. However, the popularity difference between items is very large. A few items become popular while most are forgotten over time. Fig. 1 illustrates the citation distribution (the number of papers vs. citation counts) of about two million papers in AMiner [Tang et al.2008]. It is natural to find that not all publications attract equal attention in the academia. Nature reports that a few research papers accumulate the vast majority of citations, and most of the other papers attract only a few citations [Barab si, Song, and Wang2012]. Obviously, the papers with large citation counts are needed to be given special emphasis in popularity dynamics prediction.
In this paper, we propose a deep learning attention mechanism to model and predict individual-level popularity dynamics. Due to the effectiveness of RNN with LSTM units, it is used in the proposed model to quantify the hidden mechanisms of the given time series and capture the long-term mechanism of popularity dynamics. It is worth noting that we analyze the interpretability of the model with the four key phenomena confirmed independently in the previous studies of long-term popularity dynamics quantification, including () the intrinsic quality, characterizing the inherent competitiveness of an item against others; () the aging effect, capturing the fact that each item’s novelty fades eventually over time; () the recency effect, corresponding to the phenomenon that novel items tend to attract more attentions; () the Matthew effect, documenting the well-known “rich-get-richer” phenomenon. To give emphasis on the history data of highly cited papers, we design a deep learning attention mechanism based on the RNN.
Taking citation system as an exemplary case, we demonstrate the effectiveness of the proposed prediction model using a dataset peculiar in its longitudinality, spanning over years (from to ). Experimental results show that the proposed deep learning attention model consistently outperforms the existing models. The main contributions of this paper are two-fold: () we design the deep learning attention model to give emphasis on items with high popularity; () we analyze the the interpretability of the proposed model with the four key phenomena of long-term popularity dynamics.
The popularity dynamics of an individual item during time period is characterized by a time-stamped sequence , where represents the attention received by item at time . In the context of given the historical citation, the goal is to model the popularity dynamics and predict it at any given time.
Popularity. The popularity of an item at time is defined as the number of attentions received by the item at time . is an integer greater than or equal to zero.
The underlying assumption of the popularity here is that we concern on the accumulated attentions. Although the aging effect exists in the long-term popularity dynamics evaluation, the accumulated attentions make it possible to quantify popularity for different items at different times. Without loss of generality, we have .
Popularity dynamics. The popularity dynamics of individual item can be formalized as the following time series .
The popularity dynamics prediction problem can be formalized as follows.
Input: For each item , the input is , where , and is expressed as a
-dimensional feature vector, anddenotes the popularity of the item at time .
Learning: The goal of popularity dynamics prediction is to learn a predictive function () to predict the popularity of an item after a given time period . Formally, we have
where is the predicted popularity and is the actual one.
A commonly used prediction function is linear [Yuan et al.2017], that is, , where
are parameters to be estimated from the training data andis a bias term, which can be further absorbed by adding one dimension to (as ) and one dimension (of value ) to . Thus we have a simple form . Extensions to nonlinear functions can be done, for example, by using kernel tricks. We will show the performance comparison between the linear function and the nonlinear function in the long-term popularity dynamics prediction. Due to the restricted expression capability of the linear function, it is oversimplified for capturing the embedded rules in the long-term popularity dynamics. The popularity dynamics prediction model we used belongs to the second case.
Prediction: Based on the learned prediction function, we can predict the popular level of a given item in the future time, for example, the popularity of item at time is given by .
Popularity Dynamics Prediction
We begin by considering the RNN-LSTM solution for popularity dynamics prediction, and then give emphasis on highly cited papers in the proposed deep learning attention mechanism. We further perform detailed analysis to understand the interpretability of each component in the model, and bridge the gap between prediction and understanding of the four key phenomena confirmed in the previous studies of long-term popularity dynamics quantification.
Deep Learning Attention Mechanism
We embed the underlying mechanisms of the long-term popularity dynamics in RNN to produce result. Fig. 2 illustrates the proposed long-term popularity dynamics prediction model. Fig. 2(a) gives an overview of the architecture.
Given a time-stamped sequence , a -dimensional feature vector needs to be designed as input. The input space of every item with popularity records reflects the intrinsic quality of the item. There are two key components in the architecture: the RNN with LSTM units and the attention model. The LSTM units are arranged in the form of RNN with layers. The parameter is set as conventional number in deep neural network according to the input scale.
The LSTM unit is used for its popularity and well-know capability for efficient long-range dependency learning [Xiao et al.2017]. We use the LSTM units to capture the aging effect and the Matthew effect in long-term popularity dynamics quantification. Specifically, for enhancing the recency effect through the short-term working memory , we design an attention model in the framework.
The RNN with LSTM Units. The LSTM units are arrange in the form of RNN as illstrated in Fig. 2(b). There are four major components in a common LSTM unit, including a memory cell, a forget gate , an input gate , and an output gate . The gates are responsible for information processing and storage over arbitrary time intervals. Usually, the outputs of these gates are between and . A new study gives suggestions to push the output values of the gates towards or . By doing so, the gates are mostly open or closed, instead of in a middle state [Li et al.2018]. Although the LSTM units are arranged in the form of RNN, it avoids the vanishing gradient problem by the introduction of the memory cell. Thus, information can be stored for either short or long time periods in the LSTM unit.
Intuitively, the input gate controls the extent to which a new value flows into the memory cell. A function of the inputs passes through the input gate and is added to the cell state to update it. The following formula for the input gate is used:
where matrix collects the weights of the input and recurrent connections. The symbol
represents the Sigmoid function. The values of the vectorare between and . If one of the values of is (or close to ), it means that this input gate is closed and no new information is allowed into the memory cell at time . If one of the values is , the input gate is open for new coming value at time . Otherwise, the gate is in the state of half-open half-clearance.
The forget gate controls the extent to which a value remains in the memory cell. It provides a way to get rid of the previously stored memory value. Here is the formulate of the forget gate:
where are weights that govern the behavior of the forget gate. Similar to , is also a vector of values between and . If one of the values of is (or close to ), it means that the memory cell should remove that piece of information in the corresponding component in the cell. If one of the values is , the corresponding information will be kept.
Remembering information for long periods of time is practically the default behavior of LSTM. The long-term accumulative influence is formulated as follows:
where denotes the Hadamard product (the element-wise multiplication of matrices), is calculated as follows:
That is, the information in memory cell consists of two parts: the retained old information (controlled by the forget gate), and the new coming information (controlled by the input gate).
The output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. The following output function is used:
The weight matrices and bias vector parameters are needed to be learned during training. The current working state is updated as the following formula:
The Attention Model. The artificial attention mechanism, inspired by the attention behavior in neuroscience [Itti, Koch, and Niebur1998], has been applied in deep learning for speech recognition, translation, and visual identification of object [Vaswani et al.2017, Choi et al.2016]. Broadly, attention mechanisms are components of prediction systems that allow the system to sequentially focus on different subsets of the input [Cho, Courville, and Bengio2015]. More specifically, the attention distribution is generated with content-based attention. Only part of a subset of the input information is focused. The attention function needs to be differentiable, so that we focus everywhere of the input, just to different extents.
The deep learning attention mechanism designed in this paper works as follows: given an input , the aforementioned LSTM units generate to represent the hidden patterns of the input. The output is the summary of the focusing on information linked to the input. In this formulation, attention can be seen as producing a fixed-length embedding of the input sequence by computing an adaptive weighted average of the state sequence [Raffel and Ellis2016].
The graphical representation of the attention model is shown in Fig. 2(c). The input and the hidden layer of LSTM network (a RNN composed of LSTM units) are the input of the attention model. Then, it computes the following formula:
where is the weight matrix. An important remark here is that each is computed independently without looking at the other for [Raffel and Ellis2016]. Then, each
is linked to a Softmax layer, which function is given by:
where , the is the softmax of the projected on a learned direction. The output is a weighted arithmetic mean of the input, and the weights reflects the relevance of and the input. It is calculated as the following formula:
Finally, the popularity of item at time is given by the prediction .
Key Phenomena in Popularity Dynamics
There are four key phenomena confirmed independently in previous studies of long-term popularity dynamics quantification: the intrinsic quality, the aging effect, the recency effect and the Matthew effect. We analyze the interpretability of the deep learning attention model with these four major phenomena.
We firstly show the details of them. The intrinsic quality reflects the perceived novelty and importance of an item [Wang and Barab si2013]. It captures the inherent differences between items. Actually, the items with low quality are more likely to be unpopular, and vice versa. The aging effect represents the accumulation process of popularity recession. In the attention economy, aging accounts for the fact that each item’s novelty fades eventually over time. The recency effect indicates that novel items tend to attract more attentions. In the time series modeling of popularity dynamics, the recent items shored in the short-term working state have an advantage over those stored in the long-term memory. Therefore, more emphasis needs to be given on the new popular items, which are stored in the short-term working memory. The Matthew effect of accumulated advantage is summarized by the “rich-get-richer” phenomenon, i.e., previous accumulated attention triggers more subsequent attentions [Crane and Sornette2008]. It is in fact that the highly popular items are more visible and more likely to be viewed than others.
Then, we present how the proposed deep learning attention mechanism captures the four key phenomena of long-term popularity dynamics. We are the first to analyze the interpretability in the deep learning based model for popularity dynamics prediction. The detailed specifications of these four phenomena are formulated as follows:
Intrinsic quality. The intrinsic quality captures the inherent differences between items, accounting for the perceived novelty and importance of an item. In our proposed prediction model, the input space of every item with popularity records reflects its intrinsic quality.
Aging effect. The aging effect, which captures the fact that each item’s novelty fades eventually over time, can be modeled by the forget gate in the LSTM unit. It is formulized as Eq. (3).
Recency effect. We need to give emphasis on the current popularity state of items, which is given by Eq. (7). In the proposed popularity dynamics prediction model, the recent items shored in the current working state have an advantage in reading over those stored in the long-term memory. Thus, it is possible to capture the Recency effect.
Matthew effect. The memory cell in LSTM unit takes the long-term dependencies into consideration. As shown in Eq. (4), previous accumulated attention stored in the long-term memory triggers more subsequent attentions. What’s more, the attention model, which focuses on the most popular part of the time series as Eq. (10) does, also capture the Matthew effect.
In this section, we demonstrate the effectiveness of the proposed popularity dynamics prediction model via deep learning attention mechanism.
Experiments are conducted on a real-world dataset222https://www.aminer.cn/data, which is extracted from the academic search and mining platform – AMiner. We select publications in Computer Science for more than years (from to ), which consists of papers authored by researchers. The full graph of citation network contained in this dataset has vertices (literature papers) and edges (citations).
Baseline Models and Evaluation Metrics
To compare the predictive performance of the proposed deep learning attention mechanism DLAM against other models, we introduce several published models that have been used to model and predict the popularity dynamics. Specifically, the comparison methods in our experiments are listed as follows. RPP [Wang and Barab si2013, Shen et al.2014] incorporates three key ingredients: the intrinsic attractiveness, the aging effect, and the reinforcement mechanism using a reinforced Poisson process. CART and SVR
perform better in citation count prediction compared to LR and KNN in the reference[Yan et al.2011].
Two metrics used for evaluating popularity dynamics in [Shen et al.2014, Xiao et al.2016] are also used: Mean Absolute Percentage Error (MAPE) and Accuracy (ACC). Let be the observed citations of paper up to time , and be the predicted one. The MAPE measures the average deviation between the predicted and observed citations over all papers. For a dataset of papers, the MAPE is given by:
ACC measures the fraction of papers correctly predicted under a given error tolerance . Specifically, the accuracy of citation prediction over papers is defined as:
where is an indicator function which return if the statement is true, otherwise return . We find that our method always outperforms regardless the value of . In this paper, we set as [Xiao et al.2016].
We found in the experiments that the longer the duration of the training set, the better the long-term prediction performance. In this paper, we set the training period as years and then predict the citation counts for each paper from the to after the training period. For example, means that the first observation year after the training period. We found that the features with positive contributions are the citation history, the h-index of the paper author and the level of the publication journal. For the convenience of performance comparison, the input feature used here is the citation history for every sub-window of length years. In the experiment, we set
. The loss function utilized here is MAPE. We use Adadelta[Zeiler2012] as the gradient descent optimization algorithm. The attention layer is fully connected and it uses tanh activation.
Results and Discussion
Results. As shown in Table. 1, the proposed DLAM model exhibits the best performance in terms of ACC in all the situation of , , , and . It means that the DLAM consistently achieves the higher accuracy than other models across different observation time. What’s more, the DLAM model also exhibits the best performance in terms of MAPE in all the aforementioned situations. That is, the DLAM model achieves the higher accuracy and lower error simultaneously. As shown in Fig. 3(b), the superiority of the DLAM model, compared to the other methods in terms of ACC, increases with the number of years after the training period. When , the proposed DLAM model achieves a significant performance improvement in terms of ACC, about about compared to RPP. As illustrated in Fig. 3(a), the models used for comparison all achieve acceptable low error rate, except RPP. This problem can be avoid by RPP with prior [Shen et al.2014]
, which incorporates conjugate prior for the fitness parameter. But the RPP with prior doesn’t improve the ACC performance. That is to say, our proposed DLAM model also outperforms than RPP with prior in terms of ACC.
Effectiveness of the attention model. We remove the attention model of the proposed model to verify the effectiveness of the attention model. The remainder is RNN with LSTM units (labeled as LT-CCP), which is proven to be effectiveness in long-term popularity dynamics prediction. As shown in Fig. 3, the DLAM consistently outperforms (improving the ACC and decreasing the MAPE) than the LT-CCP. That is, introducing the attention model can improve the prediction ability in popularity dynamics.
Analysis of the citation distribution. We illustrate the actual and the predicted citations distribution when in Fig. 4. At first glance, it seems that the LT-CCP (shown in Fig. 4(a)) has better fitting effect than the DLAM (shown in Fig. 4(b)). In fact, the LT-CCP only has better fitting effect on the papers with little citation counts. On the contrary, the DLAM has better fitting effect on the highly cited papers. The DLAM achieves better overall performance. It is more accordant with practical prediction requirements that a few papers occupy vast number of citations. It further proves the effectiveness of the attention model.
In the long-term popularity dynamics analysis, it is a fact that a few items become popular while most are forgotten over time. In this paper, we present a deep learning attention mechanism to model and predict long-term popularity dynamics. We analyze the interpretability of the model. It incorporates four key ingredients of long-term popularity dynamics, including the intrinsic quality of publications, the aging effect, the recency effect and the Matthew effect. More importantly, we verify the effectiveness of introducing the attention model in long-term popularity dynamics prediction. Experiments on a real-large citation dataset demonstrate that our proposed model consistently outperforms the existing prediction models. The results show that the proposed model has better fitting effect on the highly cited papers, and achieves the best overall performance.
- [Bao et al.2013] Bao, P.; Shen, H. W.; Huang, J.; and Cheng, X. Q. 2013. Popularity prediction in microblogging network: a case study on sina weibo. In International Conference on World Wide Web, 177–178.
- [Bao et al.2015] Bao, P.; Shen, H. W.; Jin, X.; and Cheng, X. Q. 2015. Modeling and predicting popularity dynamics of microblogs using self-excited hawkes processes. In International Conference on World Wide Web, 9–10.
- [Bao2016] Bao, P. 2016. Modeling and predicting popularity dynamics via an influence-based self-excited hawkes process. In International on Conference on Information and Knowledge Management, 1897–1900.
- [Barab si, Song, and Wang2012] Barab si, A.; Song, C.; and Wang, D. 2012. Handful of papers dominates citation. Nature 491(7422):40.
- [Bengio, Simard, and Frasconi1994] Bengio, Y.; Simard, P.; and Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5(2):157–166.
- [Cao et al.2017] Cao, Q.; Shen, H.; Cen, K.; Ouyang, W.; and Cheng, X. 2017. Deephawkes: Bridging the gap between prediction and understanding of information cascades. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1149–1158. ACM.
- [Cho, Courville, and Bengio2015] Cho, K.; Courville, A.; and Bengio, Y. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia 17(11):1875–1886.
- [Choi et al.2016] Choi, E.; Bahadori, M. T.; Sun, J.; Kulas, J.; Schuetz, A.; and Stewart, W. 2016. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, 3504–3512.
- [Crane and Sornette2008] Crane, R., and Sornette, D. 2008. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences of the United States of America 105(41):15649–53.
- [Itti, Koch, and Niebur1998] Itti, L.; Koch, C.; and Niebur, E. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20(11):1254–1259.
- [Li et al.2018] Li, Z.; He, D.; Tian, F.; Chen, W.; Qin, T.; Wang, L.; and Liu, T.-Y. 2018. Towards binary-valued gates for robust lstm training. arXiv preprint arXiv:1806.02988.
- [Matsubara et al.2012] Matsubara, Y.; Sakurai, Y.; Prakash, B. A.; Li, L.; and Faloutsos, C. 2012. Rise and fall patterns of information diffusion: model and implications. In International Conference on Knowledge Discovery and Data Mining, 6–14.
- [Mcgovern et al.2003] Mcgovern, A.; Friedland, L.; Hay, M.; Gallagher, B.; Fast, A.; Neville, J.; and Jensen, D. 2003. Exploiting relational structure to understand publication patterns in high-energy physics. ACM SIGKDD Explorations Newsletter 5(2):165–172.
- [Pobiedina and Ichise2016] Pobiedina, N., and Ichise, R. 2016. Citation count prediction as a link prediction problem. Applied Intelligence 44(2):252–268.
- [Raffel and Ellis2016] Raffel, C., and Ellis, D. P. 2016. Feed-forward networks with attention can solve some long-term memory problems. arXiv preprint arXiv:1512.08756.
- [Ratkiewicz et al.2010] Ratkiewicz, J.; Fortunato, S.; Flammini, A.; Menczer, F.; and Vespignani, A. 2010. Characterizing and modeling the dynamics of online popularity. Physical Review Letters 105(15):158701.
[Shen et al.2014]
Shen, H. W.; Wang, D.; Barab si, A.; and Song, C.
Modeling and predicting popularity dynamics via reinforced poisson
Twenty-Eighth AAAI Conference on Artificial Intelligence, 291–297.
- [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 3104–3112.
- [Szabo and Huberman2010] Szabo, G., and Huberman, B. A. 2010. Predicting the popularity of online content. Communications of the ACM 53:80–88.
- [Tang et al.2008] Tang, J.; Zhang, J.; Yao, L.; Li, J.; Zhang, L.; and Su, Z. 2008. Arnetminer:extraction and mining of academic social networks. In International Conference on Knowledge Discovery and Data Mining, 990–998.
- [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.
[Vu et al.2011]
Vu, D. Q.; Asuncion, A. U.; Hunter, D. R.; and Smyth, P.
Dynamic egocentric models for citation networks.
International Conference on Machine Learning, 857–864.
- [Wang and Barab si2013] Wang, D., and Barab si, A. L. 2013. Quantifying long-term scientific impact. Science 342(6154):127–32.
- [Xiao et al.2016] Xiao, S.; Yan, J.; Li, C.; and Jin, B. 2016. On modeling and predicting individual paper citation count over time. In International Joint Conference on Artificial Intelligence, 2676–2682.
- [Xiao et al.2017] Xiao, S.; Yan, J.; Yang, X.; Zha, H.; and Chu, S. M. 2017. Modeling the intensity function of point process via recurrent neural networks. In Thirty-First AAAI Conference on Artificial Intelligence, volume 17, 1597–1603.
- [Xiao et al.2018] Xiao, S.; Yan, J.; Yang, X.; and Zha, H. 2018. Publication popularity modeling via adversarial learning of profile-specific dynamic process. IEEE Access 6:19984–19992.
- [Yan et al.2011] Yan, R.; Tang, J.; Liu, X.; Shan, D.; and Li, X. 2011. Citation count prediction: learning to estimate future citations for literature. In International Conference on Information and Knowledge Management, 1247–1252.
- [Yu et al.2012] Yu, X.; Gu, Q.; Zhou, M.; and Han, J. 2012. Citation prediction in heterogeneous bibliographic networks. In International Conference on Data Mining, 1119–1130.
- [Yuan et al.2017] Yuan, S.; Tao, Z.; Zhu, T.; and Bai, S. 2017. Realtime online hot topics prediction in sina weibo for news earlier report. In International Conference on Advanced Information Networking and Applications, 599–605. IEEE.
- [Zeiler2012] Zeiler, M. D. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.