Log In Sign Up

Bi-Directional Recurrent Neural Ordinary Differential Equations for Social Media Text Classification

Classification of posts in social media such as Twitter is difficult due to the noisy and short nature of texts. Sequence classification models based on recurrent neural networks (RNN) are popular for classifying posts that are sequential in nature. RNNs assume the hidden representation dynamics to evolve in a discrete manner and do not consider the exact time of the posting. In this work, we propose to use recurrent neural ordinary differential equations (RNODE) for social media post classification which consider the time of posting and allow the computation of hidden representation to evolve in a time-sensitive continuous manner. In addition, we propose a novel model, Bi-directional RNODE (Bi-RNODE), which can consider the information flow in both the forward and backward directions of posting times to predict the post label. Our experiments demonstrate that RNODE and Bi-RNODE are effective for the problem of stance classification of rumours in social media.


page 1

page 2

page 3

page 4


Sub-event detection from Twitter streams as a sequence labeling problem

This paper introduces improved methods for sub-event detection in social...

Call Attention to Rumors: Deep Attention Based Recurrent Neural Networks for Early Rumor Detection

The proliferation of social media in communication and information disse...

Rumor Detection on Social Media with Bi-Directional Graph Convolutional Networks

Social media has been developing rapidly in public due to its nature of ...

Optical Neural Ordinary Differential Equations

Increasing the layer number of on-chip photonic neural networks (PNNs) i...

Simple Attention-Based Representation Learning for Ranking Short Social Media Posts

This paper explores the problem of ranking short social media posts with...

1. Introduction

Information disseminated in social media such as Twitter can be useful for addressing several real-world problems like rumour detection, disaster management, and opinion mining. Most of these problems involve classifying social media posts into different categories based on their textual content. For example, classifying the veracity of tweets as False, True, or unverified allows one to debunk the rumours evolving in social media (Zubiaga et al., 2018a). However, social media text is extremely noisy with informal grammar, typographical errors, and irregular vocabulary. In addition, the character limit (240 characters) imposed by social media such as Twitter make it even harder to perform text classification.

Social media text classification, such as rumour stance classification111Rumour stance classification helps to identify the veracity of a rumour post by classifying the reply tweets into different stance classes such as Support, Deny, Question, Comment (Qazvinian et al., 2011; Zubiaga et al., 2016; Lukasik et al., 2019)

can be addressed effectively using sequence labelling models such as long short term memory (LSTM) networks 

(Zubiaga et al., 2016; Augenstein et al., 2016; Kochkina et al., 2017; Zubiaga et al., 2018b, a; Dey et al., 2018; Liu et al., 2019; Tian et al., 2020). Though they consider the sequential nature of tweets, they ignore the temporal aspects associated with the tweets. The time gap between tweets varies a lot and LSTMs ignore this irregularity in tweet occurrences. They are discrete state space models where hidden representation changes from one tweet to another without considering the time difference between the tweets. Considering the exact times at which tweets occur can play an important role in determining the label. If the time gap between tweets is large, then the corresponding labels may not influence each other but can have a very high influence if they are closer.

We propose to use recurrent neural ordinary differential equations (RNODE) (Rubanova et al., 2019) and developed a novel approach bi-directional RNODE (Bi-RNODE), which can naturally consider the temporal information to perform time sensitive classification of social media posts. Neural ordinary differential equation (NODE) (Chen et al., 2018)

is a continuous depth deep learning model that performs transformation of feature vectors in a continuous manner using ordinary differential equation solvers. NODEs bring parameter efficiency and address model selection in deep learning to a great extent. Recurrent NODE 

(Rubanova et al., 2019) extends NODE to time-series where hidden states associated with the elements in the sequence are assumed to evolve continuously over time. They generalize RNNs to consider the temporal information present in the sequence data and allow the hidden representation to change according to this temporal information.

We propose RNODE to perform sequence labeling of posts occurring continuously over time in social media. It can consider the varying inter-arrival times in the posts and update the hidden representation according to it for classifying the posts. In addition, we propose a novel model, bi-directional RNODE (Bi-RNODE), which considers not only information from the past but also from the future in predicting the label of the post. Here, continuously evolving hidden representations in the forward and backward directions in time are combined and used to predict the post label. We show the effectiveness of the proposed models on the rumour stance classification problem in Twitter using the RumourEval-2019 (Derczynski et al., 2019)

dataset. We found RNODE and Bi-RNODE can improve the social media text classification by effectively making use of the temporal information and is better than LSTMs and gated recurrent units (GRU) with temporal features.

2. Background

2.1. Problem Definition

We consider the problem of classifying social media posts into different classes. Let us consider our data set to be a collection of posts, . Each post is assumed to be a tuple containing details about the post such the textual content (one can consider other features as well such as number of re-posts and reactions), time of the post and the label associated with the post , thus . Our aim is to develop a sequence classification model which consider the temporal information along with for classifying a social media post. In particular, we consider the rumour stance classification problem in Twitter where one classify tweets into different classes such as Support, Query, Deny, and Comment, thus .

2.2. Neural Ordinary Differential Equations

Neural ordinary differential equations (NODE)(Chen et al., 2018) were introduced as a continuous depth alternative to Residual Networks (ResNets)(He et al., 2016)

. ResNets uses skip connections to avoid vanishing gradient problems when networks grow deeper. Residual block output is computed as

, where is a neural network parameterized by

involving stacked layers with non-linear activation functions and

representing the hidden representation at depth . This update is similar to a step in the Euler numerical technique used for solving ordinary differential equations (ODE) of the following form.


Sequence of residual block operations in ResNets can be seen as a solution to the ODE with representing the hidden representation at any time and the ODE trajectories defined through the neural network . Consequently, NODEs can be interpreted as a continuous equivalent of ResNets modelling the evolution if hidden representations over time.

For solving ODE, one can use fixed step-size numerical techniques such as Euler, Runge-Kutta or adaptive step-size methods like Dopri5(Dormand and Prince, 1980). Solving an ODE requires one to specify an initial value () and can compute the value at using an ODE solver . We can consider initial value as input or a transformation of using a downsampling block. The ODE (1) is solved until some end-time to obtain the final hidden representation . A fully connected neural network (FCNN) transforms the final representation to the output . For classification problems cross-entropy loss is used to update the weights of NODE using back-propagation. For NODE models, efficient back-propagation and gradient computations were proposed using adjoint sensitivity method (Zhuang et al., 2020; Chen et al., 2018).

(a) RNODE architecture
(b) Bi-RNODE architecture
Figure 1. Architecture details of RNODE and Bi-RNODE

3. Bi-Directional Recurrent NODE

The popular techniques for sequence classification such as LSTMs consider the sequential nature of the data but ignores the temporal features associated with the data in its standard setting. The posts occur at irregular intervals of time, with more posts occurring at certain period. The influence of consecutive posts might depend on this time gap with the influence typically decreasing over time. Instead of an LSTM model which perform single step transformation it will be beneficial to use a model where the number of transformations depend on the time gap.

We propose to use recurrent neural ordinary differential equations (RNODE) (Rubanova et al., 2019) to address the drawbacks of RNN based models in classifying irregularly occurring posts in social media. RNODE is developed for time-series data and can naturally consider the time associated with the posts make perform the transformations of the hidden representation to reflect the same.In RNODE, the transformation of a hidden representation at time to at time is governed by an ODE similar to (1), with being a neural network (NN) transformation. Unlike standard LSTMs where is obtained from as a single NN transformation, RNODE first obtains a hidden representation as a solution to (1) at time with initial value .

As this integral is intractable, RNODE uses a numerical technique (e.g., Euler method) to obtain the transformation. The number of update steps in the numerical technique is determined by the time gap between the consecutive posts.

The hidden representation and input post at time are passed through neural network transformation (RNNCEll()) to obtain final hidden representation , i.e., = RNNCell(). The process is repeated for every element in the sequence. The hidden representations associated with the elements in the sequence are then passed to a neural network (NN()) to obtain the sequence of outputs corresponding to the post labels. Figure 1(a) provides the detailed architecture of the RNODE model.

Bi-directional RNNs (Schuster and Paliwal, 1997) such as Bi-LSTMS (Graves et al., 2013)

were proven to be successful in many sequence labeling tasks in natural language processing such as POS tagging 

(Huang et al., 2015). They use the information from the past and future to predict the label while standard LSTMs consider only from the past. We propose Bi-directional RNODE (Bi-RNODE), which uses the sequence of input observations from past and from the future to predict the post label at any time . It assumes the hidden representation dynamics are influenced not only by the past posts but also by the futures posts. Unlike Bi-LSTMs, Bi-RNODE consider the exact time of the posts and their inter-arrival times in determining the transformations in the hidden representations. Bi-RNODE consists of two RNODE blocks, one performing transformations in the forward direction (in the order of posting times) and the other in the backward direction (in the reverse order of posting times). The hidden representations and computed by forward and backward RNODE respectively are aggregated either by concatenation or averaging to obtain a final hidden representation and is passed through a NN to obtain the post labels. Bi-RNODE is useful when a sequence of posts needs to be classified together, and can be restrictive for an online classification of individual posts. Algorithm 1 and Figure 1(b) provides an overview of Bi-RNODE for post classification. For Bi-RNODE, an extra neural network is required to compute hidden representations

in the backward direction. Training in Bi-RNODE is done in a similar manner to RNODE, with cross-entropy loss and back-propagation to estimate parameters.

Initialize: , ,
if bidirectional:
    Set to contain in reverse order.
    Set to contain in reverse order, where
    , ,
for  to  do
       = ODESolverCompute(, , , )
       = RNNCell( , )
       if bidirectional:
           = ODESolverCompute(), ,, )
           = RNNCell( , )
end for
if bidirectional:
    = aggregate(,) // concatenate or average
return NN // return predicted post labels
Algorithm 1 Pseudo code for RNODE and Bi-RNODE approach to predict class labels. The input data points where are sorted in increasing order of their timestamps.

4. Experiments

To demonstrate the effectiveness of the proposed approaches, we consider the stance classification problem in Twitter and RumourEval-2019 (Derczynski et al., 2019) data set. This Twitter data set consists of rumours associated with eight events. Each event has collection of tweets labelled with one of the four labels - Support, Query, Deny and Comment. We picked four events Charliehebdo, Ferguson, Ottawashooting and Sydneysiege to conduct experiments.

Features : For dataset preparation, each data point associated with a Tweet includes text embedding, retweet count, favourites count, punctuation features, sentiment polarity, negative and positive word count, presence of hashtags, user mentions, URLs, and entities etc. from the tweet information. Using pre-trained word2vec vectors 222Pre-trained vectors on Google News dataset:

, each word is represented as an embedding of size 15. The text embedding of the tweet is obtained by concatenating the word embeddings. Each event data is split into train, validation, and test datasets with the ratio 60:20:20 in the order of time at which tweet occurred. Each tweet timestamp is converted to epoch time and Min-Max normalization is applied over the time stamps associated with each event to keep the duration of the event in the interval


(b) LSTM
(c) GRU
(d) Bi-RNODE
(e) Bi-LSTM
(f) Bi-GRU
Figure 2. ROC curves of the models (a) RNODE (b) LSTM (c) GRU (d) BiRNODE (e) Bi-LSTM (f) Bi-GRU trained on sydneysiege event for seen event experiment

4.1. Experimental setup

In real time, new rumours arise and propagate at different time periods. Our experiments are conducted to predict stance of social media posts propagating in seen events as well as unseen events. Here are two experimental setups we conducted on the dataset.

  • Seen Event Here we train, validate and test on tweets of same event. Each event data is split 60:20:20 ratio in sequence of time. This setup helps in predicting stance of unseen tweets of the same event.

  • Unseen Event: This setup helps in evaluating performance on an unseen event and in training on a larger dataset. Here we consider training and validation on 3 events and testing on event. Last 20% data of each of the training event is set aside for validation. During training, mini-batches are formed only from the posts in each event and are fed to the model in the order they appear in the event.

Baselines: We compared results of our proposed RNODE and Bi-RNODE models with RNN based baselines such LSTM (Kochkina et al., 2017), Bi-LSTM (Augenstein et al., 2016), GRU (Cho et al., 2014), Bi-GRU, and Majority (labelling most frequent class) baseline models. We also use a variant of LSTM baseline considering temporal information (Zubiaga et al., 2018b), LSTM-timeGap where the timegap of consecutive data points is included as part of the input data.

Evaluation Metrics

: We consider the standard evaluation metrics such as precision, recall, F1 and in addition the AUC score to account for the data imbalance. We consider a weighted average of the evaluation metrics to compare the performance of models.

Model Charliehebdo Ferguson Ottawashooting Sydneysiege
AUC F1 Recall Preci- AUC F1 Recall Preci- AUC F1 Recall Preci- AUC F1 Recall Preci-
sion sion sion sion
RNODE 0.665 0.653 0.674 0.658 0.600 0.591 0.659 0.598 0.638 0.654 0.692 0.670 0.699 0.722 0.730 0.724
0.638 0.672 0.700 0.721 0.618 0.632 0.677 0.640 0.659 0.651 0.703 0.642 0.661 0.662 0.704 0.638
Bi-RNODE 0.696 0.659 0.693 0.629 0.595 0.599 0.673 0.641 0.669 0.667 0.692 0.658 0.739 0.769 0.784 0.763
0.651 0.697 0.737 0.690 0.615 0.643 0.695 0.635 0.652 0.624 0.662 0.618 0.650 0.650 0.669 0.653
Bi-LSTM 0.628 0.625 0.679 0.609 0.563 0.599 0.650 0.614 0.622 0.627 0.654 0.622 0.648 0.701 0.716 0.721
0.662 0.690 0.717 0.671 0.603 0.623 0.667 0.600 0.650 0.637 0.686 0.622 0.652 0.655 0.680 0.652
Bi-GRU 0.654 0.643 0.660 0.641 0.588 0.571 0.631 0.625 0.640 0.651 0.686 0.644 0.701 0.739 0.757 0.748
0.656 0.690 0.724 0.682 0.613 0.634 0.678 0.611 0.648 0.636 0.683 0.610 0.653 0.655 0.690 0.680
LSTM 0.625 0.600 0.637 0.637 0.567 0.602 0.650 0.611 0.605 0.609 0.635 0.603 0.634 0.689 0.703 0.695
0.645 0.690 0.728 0.686 0.602 0.611 0.631 0.603 0.630 0.626 0.680 0.627 0.643 0.644 0.669 0.641
GRU 0.616 0.610 0.647 0.623 0.578 0.588 0.664 0.631 0.591 0.539 0.513 0.574 0.623 0.688 0.725 0.689
0.682 0.695 0.713 0.686 0.614 0.640 0.687 0.623 0.638 0.632 0.683 0.618 0.654 0.665 0.711 0.659
LSTM- 0.638 0.631 0.679 0.605 0.565 0.581 0.627 0.590 0.625 0.640 0.679 0.650 0.656 0.667 0.667 0.671
timeGap 0.652 0.695 0.732 0.696 0.604 0.625 0.673 0.633 0.638 0.638 0.683 0.651 0.632 0.649 0.698 0.655
Majority 0.500 0.456 0.605 0.366 0.500 0.518 0.654 0.428 0.500 0.485 0.628 0.395 0.500 0.485 0.628 0.395
0.500 0.542 0.673 0.453 0.500 0.528 0.662 0.439 0.500 0.467 0.614 0.377 0.500 0.490 0.632 0.400
Table 1. Performance of all the models on RumourEval-2019 (Derczynski et al., 2019) dataset. First and second rows of each model represents seen event and unseen event experiment results respectively.


: All the models are trained for 50 epochs with 0.01 learning rate, Adam optimizer, dropout(0.2) regularizer, batchsize of 50 and cross entropy loss function. Different hyperparameters like neural network layers (1, 2), hidden representation sizes (64,128), numerical methods (Euler, RK4, Dopri5 for RNODE and Bi-RNODE) and aggregation strategy (concatenation or averaging for Bi-LSTM and Bi-RNODE) are used for all the models and the best configuration is selected from the validation data for different experimental setups and train/test data splits.

(b) Bi-RNODE
Figure 3. t-SNE plot of (a) RNODE and (b) Bi-RNODE latent representations for the Sydneysiege event

4.2. Results and Analysis

The results of seen event and unseen event experiment setup can be found in Table 1, where the first and second rows for each model provides results on seen event and unseen event respectively. We can observe from Table 1 that for both seen event and unseen event experiment setup, our proposed RNODE and Bi-RNODE models outperformed baseline models for all the four events. For the seen event setup, Bi-RNODE gives the best result out-performing other models for most of the data sets and measures. While for unseen event setup, RNODE and Bi-RNODE models gave better results when compared to baseline models except for Charliehebdo event. Bi-RNODE results are better than RNODE for Charliehebdo and Ferguson, while it is close to RNODE for Ottawashooting and Sydneysiege. Under seen event experiment on syndneysiege event, we plot the ROC curve for all the models in Figure 2. We can observe that AUC for Figures 2(a) and 2(d) corresponding to RNODE and Bi-RNODE respectively are higher than LSTM, GRU, Bi-LSTM , and Bi-GRU.

The proposed models are computationally and parametrically efficient where RNODE (M,in Millions) and Bi-RNODE (M) models required less parameters when compared to LSTM (M) and Bi-LSTMS (M) models. Visualization of latent hidden state representations of the proposed models using t-SNE plot (Figure 3(a) and 3(b)) shows that they are capable of separating data points from different classes into different groups. The proposed models are learning hidden representations well with Bi-RNODE learning a better representation than RNODE.

5. Conclusion and Future Work

We proposed RNODE and Bi-RNODE models for sequence classification of social media posts which naturally consider the temporal information and use it to model the dynamics of hidden representations using an ODE. This makes them more effective than LSTMs for social media where posts occur at irregular time intervals. The experimental results on the rumour stance classification problem in Twitter supports the superior capability of the RNODE and Bi-RNODE in performing tweet classification. As a future work, we would like to further improve the sequence modelling capability of the proposed models by combining them with conditional random fields.


  • I. Augenstein, T. Rocktäschel, A. Vlachos, and K. Bontcheva (2016) Stance detection with bidirectional conditional encoding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 876–885. Cited by: §1, §4.1.
  • R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018) Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583. Cited by: §1, §2.2, §2.2.
  • K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder–decoder approaches

    In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Cited by: §4.1.
  • L. Derczynski, G. Gorrell, A. Zubiaga, A. Aker, K. Bontcheva, M. Liakata, and E. Kochkina (2019) RumourEval 2019 data. figshare. External Links: Document, Link Cited by: §1, Table 1, §4.
  • K. Dey, R. Shrivastava, and S. Kaushik (2018) Topical stance detection for twitter: a two-phase lstm model using attention. In Advances in Information Retrieval, pp. 529–536. Cited by: §1.
  • J. R. Dormand and P. J. Prince (1980) A family of embedded runge-kutta formulae. Journal of computational and applied mathematics 6 (1), pp. 19–26. Cited by: §2.2.
  • A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. Cited by: §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §2.2.
  • Z. Huang, W. Xu, and K. Yu (2015) Bidirectional LSTM-CRF models for sequence tagging. ArXiv abs/1508.01991. Cited by: §3.
  • E. Kochkina, M. Liakata, and I. Augenstein (2017) Turing at SemEval-2017 task 8: sequential approach to rumour stance classification with branch-LSTM. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 475–480. Cited by: §1, §4.1.
  • Y. Liu, X. Jin, and H. Shen (2019) Towards early identification of online rumors based on long short-term memory networks. Inf. Process. Manag. 56, pp. 1457–1467. Cited by: §1.
  • M. Lukasik, K. Bontcheva, T. Cohn, A. Zubiaga, M. Liakata, and R. Procter (2019) Gaussian processes for rumour stance classification in social media. ACM Trans. Inf. Syst. 37 (2). Cited by: §1.
  • V. Qazvinian, E. Rosengren, D. Radev, and Q. Mei (2011) Rumor has it: identifying misinformation in microblogs. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1589–1599. Cited by: §1.
  • Y. Rubanova, R. T. Chen, and D. Duvenaud (2019) Latent odes for irregularly-sampled time series. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 5320–5330. Cited by: §1, §3.
  • M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §3.
  • L. Tian, X. Zhang, Y. Wang, and H. Liu (2020)

    Early detection of rumours on twitter via stance transfer learning

    In European Conference on Information Retrieval, pp. 575–588. Cited by: §1.
  • J. Zhuang, N. Dvornek, X. Li, S. Tatikonda, X. Papademetris, and J. Duncan (2020) Adaptive checkpoint adjoint method for gradient estimation in neural ode. In

    International Conference on Machine Learning

    pp. 11639–11649. Cited by: §2.2.
  • A. Zubiaga, A. Aker, K. Bontcheva, M. Liakata, and R. Procter (2018a) Detection and resolution of rumours in social media: a survey. ACM Computing Surveys (CSUR) 51 (2), pp. 1–36. Cited by: §1, §1.
  • A. Zubiaga, E. Kochkina, M. Liakata, R. Procter, M. Lukasik, K. Bontcheva, T. Cohn, and I. Augenstein (2018b) Discourse-aware rumour stance classification in social media using sequential classifiers. Information Processing & Management 54 (2), pp. 273–290. Cited by: §1, §4.1.
  • A. Zubiaga, E. Kochkina, M. Liakata, R. Procter, and M. Lukasik (2016) Stance classification in rumours as a sequential task exploiting the tree structure of social media conversations. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2438–2448. Cited by: §1.