CDIL-CNN
None
view repo
Classification of long sequential data is an important Machine Learning task and appears in many application scenarios. Recurrent Neural Networks, Transformers, and Convolutional Neural Networks are three major techniques for learning from sequential data. Among these methods, Temporal Convolutional Networks (TCNs) which are scalable to very long sequences have achieved remarkable progress in time series regression. However, the performance of TCNs for sequence classification is not satisfactory because they use a skewed connection protocol and output classes at the last position. Such asymmetry restricts their performance for classification which depends on the whole sequence. In this work, we propose a symmetric multi-scale architecture called Circular Dilated Convolutional Neural Network (CDIL-CNN), where every position has an equal chance to receive information from other positions at the previous layers. Our model gives classification logits in all positions, and we can apply a simple ensemble learning to achieve a better decision. We have tested CDIL-CNN on various long sequential datasets. The experimental results show that our method has superior performance over many state-of-the-art approaches.
READ FULL TEXT VIEW PDFNone
Sequence classification is the task of predicting class labels for sequences. It is of central importance in many applications, such as document classification, genomic analysis, and health informatics. For example, classifying documents into different topic categories is a challenge for library science, especially for modern digital libraries
[1]. Genomic classification help researchers to further understand some diseases [2]. Classifying ECG time series tells if someone is a healthy person or a patient with heart disease [3].Machine Learning, especially Deep Learning, becomes widely used in end-to-end sequence classification, where a single model learns all steps between the initial inputs and the final outputs. Recurrent Neural Networks (RNNs), Transformers, and Convolutional Neural Networks (CNNs) are three primary techniques for analyzing sequential data.
RNNs use their internal states to process the sequence step by step. Despite success for short sequences, traditional RNNs cannot scale to very long sequences [4]
. One reason is that they are challenging to train due to exploding or vanishing gradient problems
[5]. In addition, the prediction of each timestep must wait for all its predecessors to complete, which makes RNNs difficult to parallelize. Transformers are a family of models relying on self-attention mechanism [6, 7]. They have quadratic time and memory complexities to the input sequence length because they compute pairwise dot-products. Comprehensive approximations are required to reduce the cost [8].In contrast, CNNs are able to handle very long sequences. A convolutional layer uses sparse connections and no recurrent nodes. Therefore, CNNs are easier to train and parallelize. In addition, dilated convolutions can exponentially enlarge the receptive fields, allowing CNNs to use fewer layers to capture long-term dependencies. For example, Temporal Convolutional Networks (TCNs) recently provide remarkable performance on sequence regression tasks [9]. However, the performance of TCNs for classification tasks is not satisfactory. TCNs use causal convolutions which implement a skewed connection protocol. The asymmetric design causes a tendency to focus on the latter part of a sequence.
In this paper, we propose a novel convolutional architecture named Circular Dilated Convolutional Neural Network (CDIL-CNN), which can scale to very long sequences and have superior performance on various classification tasks. Unlike TCNs, we use symmetric convolutions to mix information, and thus every position can receive both earlier and later information from previous layers in a circular manner. Unlike conventional pyramid-like CNN architecture, every position of the last convolutional layer in our design has an equal chance to receive all information from the whole sequence and gives its classification logits. Then a simple average ensemble learning helps our model achieve better accuracy.
We have tested our model on extensive sequence classification tasks, including synthetic data, images, texts, and audio series. Experimental results show that CDIL-CNN outperforms several state-of-the-art models. Our method can accurately and robustly classify across tasks with both short-term and long-term dependencies for very long sequences.
The remaining of the paper is organized as follows. We review some popular models for sequential data and their limitations in Section 2. In Section 3, we present our model, CDIL-CNN, including its connection protocol and network architecture. Experimental tasks and results are provided in Section 4, and we discover that the simple convolutional network has superior performance over other models in various scenarios. Finally, we conclude the paper in Section 5.
A sequence of length is a list of elements , where is the -dimensional element at the -th position. Given a training set with sequences and their class labels, sequence classification uses the training set to fit a model , where is the space of class labels. The fitted model can then be used to classify newly coming sequences.
Many deep neural networks have been proposed for various sequence classification tasks. RNNs, Transformers, and CNNs are three significant branches for learning from sequential data.
RNNs read and process inputs sequentially. At each timestep, an RNN takes the current sequence element and the hidden state as the input and outputs the next hidden state. The hidden state at a timestep is expected to act as the representation of all its earlier inputs. Because the prediction of each timestep must wait for all its predecessors to complete, the sequential process is difficult to parallelize, which makes RNNs hard to handle very long sequences. Moreover, basic RNNs suffer from vanishing and exploding gradient problems, making model training very difficult for long sequences
[5]. Gated RNNs, such as Long Short-Term Memory (LSTM)
[10]and Gated Recurrent Unit (GRU)
[11], have been proposed to relieve the gradient problems. They have many additional gates to regulate the flow of information. The gated RNNs are used in many sequence classification tasks, such as ECG arrhythmia [12] and text [13, 14]. However, they can process only short sequences (about 500-1000 timesteps) [4].Transformers, a family of models based on attention mechanism, quantify the interdependence within the sequence elements (self-attention). Originally, attention was used in conjunction with recurrent networks and convolutional networks [15, 16]. Later, Transformer, an architecture based solely on attention mechanism, was proposed. The vanilla Transformer computes pairwise dot-products between all sequence elements, which leads to a quadratic complexity w.r.t. the sequence length and makes it infeasible to process very long sequences. Approximated attention methods have been proposed to tackle this problem. Sparse Transformer [17], LogSparse Transformer [18], Longformer [19], and Big Bird [20] use sparse attention mechanism. Linformer [21] and Synthesizer [22] apply low-rank projection attention. Performer [23]
[24], and Random Feature Attention [25] rely on kernel approximation. Reformer [26], Routing Transformer [27], and Sinkhorn Transformer [28] follow the paradigm of re-arranging sequences. However, their approximation quality is questionable. Later in Section 4, we will show that their performance is inferior for long sequence classification.CNNs are good at processing data that has a grid-like topology. Two-dimensional CNNs achieve great success in computer vision
[29, 30, 31, 32], while one-dimensional CNNs are commonly used for sequential data [33, 34, 35]. Among these models, TCNs which use causal convolutions with skewed connections attempt to capture the temporal interactions and have been applied to various regression tasks, such as action segmentation and detection [36, 37], lip-reading [38, 39], and ENSO prediction [40]. The comparison of the convolutional and recurrent architectures shows that a simple TCN outperforms canonical RNNs across a wide range of sequence modeling tasks [9].Although TCN is suitable for long sequence regression, their performance for classification is not satisfactory. In this paper, we propose a new convolutional model, named CDIL-CNN, to overcome the TCN drawbacks in long sequence classification. More details are described as follows.
Our model uses symmetric convolutions that can receive both earlier and later information from previous layers. Because no information is allowed to be leaked from future to past in regression tasks, TCN uses causal convolutions that implement a skewed connection protocol, meaning that the output at timestep can only receive information of and earlier from previous layers. However, classification tasks do not have the restriction because the classification result depends on the whole sequence. Therefore, symmetric convolutions help our model better capture interactions.
Our model also uses increasing dilation sizes with the depth of the network. Dilated convolutions (or atrous convolutions) were originally introduced for dense image prediction, where they helped the model to capture multi-scale information [41, 42, 43, 44, 45]. For 1D CNNs, dilated convolutions are generally used to enlarge the receptive fields [33, 34, 37, 9]. Following these works, we increase the dilation sizes exponentially, i.e., where is the dilation size at the -th convolutional layer. The combination of deep networks and exponentially dilated convolutions enables the receptive fields to expand quickly, which makes our model scalable to very long sequences. For sequence length of , our model needs only links (convolution coefficients) to cover the whole sequence.
To avoid notational clutter, we start from the case. Let denotes an 1-dimensional input sequence of the -th convolutional layer. The convolutional output at the -th () position is computed by , where the kernel size
is usually an odd number
^{1}^{1}1We used in all our experiments. and are the convolution coefficients of the -th layer. See Figure 1 for an illustration of a 3-layer symmetric dilated convolutions with . It is straightforward to extend the convolution with the bias term and for the cases.In traditional CNNs, zero-padding is often used for the boundary positions where the subscripts of their convolved input positions
are smaller than 1 or larger than . However, this can cause boundary effects because signals near the boundaries have to be mixed up with zeros and thus have less chance to be forwarded., | , |
We propose a circular protocol to remove the boundary effects. In our model, a signal on one end is no longer convolved with zeros but with signals from the other end. The circular dilated convolutions are shown in Figure 2. The convolutional output becomes
(1) |
Using circular dilated convolutions, our model can connect boundary positions and learn long-term dependencies even in the first layer, unlike lower layers of traditional CNNs which only focus on local information. In our design, every position of the last convolutional layer has an equal chance to receive all information of the whole input sequence. Therefore, our model can apply a simple average ensemble learning as below.
We use a simple average ensemble learning to achieve better performance. RNNs and TCN assume that the last position contains all information of the whole sequence and the class decision depends only on the last position. In our model, every position of the last convolutional layer can receive all information of the whole sequence. A linear module , where is the number of convolution channels, is applied on each convolutional output position, and each position gives its preliminary class logits. Then a simple average pooling as ensemble learning aggregates the individual logits. In the implementation, we can perform the average first to speed up the network because the linear module and the average pooling are exchangeable.
Our model also uses residual connections to facilitate the training and to improve the accuracy
[46, 47]. A residual block contains a skip connection where the inputs are added before the block outputs. A schematic view of our model is depicted in Figure 3.We have compared our model with many popular models (including RNNs, Transformers, and CNNs) on various long sequential datasets in three groups of experiments. First, we used a synthetic dataset with increasing sequence lengths to show the scalability of our model. Then, we tested our model on the Long Range Arena benchmark suite which contains different long-term dependencies. Finally, we tried three time series classification datasets that contain important local information and much noise. All experiments were run on a Linux server with one NVIDIA-Tesla V100 GPU with 32 GB of memory. More details are given in the supplemental document.
The adding problem has often been used as a task for sequence models [9, 10, 48, 49]. Each position of an input sequence is a pair of numbers. The first number, called value, is randomly chosen from the interval . The second number is used as a marker. Most markers are 0 except for two randomly selected positions where the markers are 1. Let and denote the two values at the 1-marked positions. We set the learning target to be . Figure 4 shows an example of the adding problem. A prediction is considered correct if . We have used different sequence lengths with , where . For each length, we generated all training, validation, and testing datasets with 2000 instances. We compare our model with several other popular models, and the results are shown in Figure 5.
You can see that our model works very well with almost no error. Even when the sequences are very long (), the error rate of our model is still less than 0.1%. RNNs and Transformers achieve low error rates for short sequences, all less than 5% for and less than 10% for . However, they turn worse quickly as the sequences become longer. Linformer has an error rate of 86.5% for , which is nearly as bad as random guessing. LSTM starts to get wrong for with the error rate of 83.5%. The error rates of Transformer and GRU become 82.55% and 84.9%, respectively, for . Performer runs out of memory at length . TCN performs poorly even when with 80% error rate. These results demonstrate that our model is more scalable than other RNNs, Transformers, and TCN.
Image, ship | Image, airplane | Pathfinder, negative | Pathfinder, positive |
Long Range Arena (LRA) is a public benchmark suite for evaluating model quality in long-context scenarios [50]. The suite consists of different data types, such as images and texts. Many Transformers have been evaluated on the suite [25, 50, 51, 52]. We compared our CDIL-CNN with other models on the following datasets:
Image
. This is a 10-class image classification task. The images come from the gray-scale version of CIFAR-10
[53], where pixel intensities (0-255) are treated as categorical values. Two example images and their labels are shown in Figure 6. Every image is flattened to a sequence of length . The task requires the model to learn the 2D spatial relations while using the 1D sequences.Pathfinder. This is a synthetic image task motivated by cognitive psychology [54, 55]. The task requires the model to make a binary decision whether two highlighted points are connected by a dashed path. Two example images and their labels are shown in Figure 6. Similar to the Image task, every pathfinder image is flattened to a sequence of length with an alphabet size of 256.
Text
. This is a binary sentiment classification task of predicting whether an IMDb movie review is positive or negative
[56]. The task considers the character-level sequences which generate longer inputs and make the task more challenging. We use a fixed length for every sequence, which is truncated or padded when necessary.Retrieval. This is a character-level task with the ACL Anthology Network dataset [57]. The task requires the model to process a pair of documents and determine whether they have a common citation. Like the Text task, every document is truncated or padded to the sequence length of , making the total length for the pair.
Model | Image | Pathfinder | Text | Retrieval |
---|---|---|---|---|
=1024 | =1024 | |||
Transformer [50] | 42.44 | 71.40 | 64.27 | 57.46 |
Transformer [52] | 38.20 | 74.16 | 65.02 | 79.35 |
Transformer [51] | - | - | 65.35 | 82.30 |
Local Attention [50] | 41.46 | 66.63 | 52.98 | 53.39 |
Sparse Transformer [50] | 44.24 | 71.71 | 63.58 | 59.59 |
Longformer [50] | 42.22 | 69.71 | 62.85 | 56.89 |
Linformer [50] | 38.56 | 76.34 | 53.94 | 52.27 |
Linformer [52] | 37.84 | 67.60 | 55.91 | 79.37 |
Linformer [51] | - | - | 56.12 | 79.37 |
Reformer [50] | 38.07 | 68.50 | 56.10 | 53.40 |
Reformer [52] | 43.29 | 69.36 | 64.88 | 78.64 |
Reformer [51] | - | - | 64.88 | 78.64 |
Sinkhorn Transformer [50] | 41.23 | 67.45 | 61.20 | 53.83 |
Synthesizer [50] | 41.61 | 69.45 | 61.68 | 54.67 |
BigBird [50] | 40.83 | 74.87 | 64.02 | 59.29 |
Linear Transformer [50] | 42.34 | 75.30 | 65.90 | 53.09 |
Performer [50] | 42.77 | 77.05 | 65.40 | 53.82 |
Performer [52] | 37.07 | 69.87 | 63.81 | 78.62 |
Performer [51] | - | - | 65.21 | 81.70 |
Nyströmformer [52] | 41.58 | 70.94 | 65.52 | 79.56 |
Nyströmformer [51] | - | - | 65.75 | 81.29 |
RFA-Gaussian [25] | - | - | 66.0 | 56.1 |
Transformer-LS [51] | - | - | 68.40 | 81.95 |
LSTM | 38.28 | 69.62 | 86.02 | 77.44 |
GRU | 46.63 | 85.28 | 86.47 | 76.92 |
TCN | 44.22 | 86.42 | 60.80 | 77.32 |
CDIL-CNN | 66.91 | 91.70 | 86.78 | 85.36 |
For a fair comparison, we followed the same data preprocessing and training/validation/testing splitting in [50]. We have compared our CDIL-CNN with many other models on LRA tasks. We quoted Transformers’ results from reference papers and ran LSTM, GRU, and TCN for completeness. The results are reported in Table 1.
Our model is significantly better than all other models. For the two image tasks, our model shows the superior ability to learn spatial relations, even though all images are flattened to 1D sequences. Especially for the Image task, our model (66.91%) outperforms other models with +22.67%, +20.28%, +28.63%, respectively, in accuracy over the best Transformer variant (44.24%), the best RNN (46.63%), and TCN (38.28%). For the two character-level text datasets, our model also outperforms all other models: 86.78% (ours) vs. 86.47% (the runner-up) and 85.36% (ours) vs. 82.30% (the runner-up), respectively for the Text and the Retrieval.
The UEA & UCR Repository^{2}^{2}2http://www.timeseriesclassification.com/ consists of various time series classification datasets [58]. Many time series classification problems can be solved by detecting local patterns [59, 60, 61]. These tasks require the model to pick out important local information from long sequences which contain much noise. We compared our CDIL-CNN with other popular models on three audio datasets:
FruitFlies. The dataset comes from the same optical sensor which recorded the change in amplitude of an infra-red light as it was occluded by the wings of fruit flies during flight. The dataset contains 17259 training and 17259 testing sequences of length . The task requires the model to classify a sequence as one of three species of the fruit fly.
RightWhaleCalls. Right whale calls are difficult to hear due to some low-frequency anthropogenic sounds. Up-calls are the most commonly documented right whale vocalization. The task requires the model to decide whether a sequence contains a set of right whale up-calls or not. The training and testing sizes of this dataset are 10934 and 1962, respectively. All sequences have a fixed length .
MosquitoSound. The dataset represents the wing beat of the flying mosquito. Both training and testing sets have 139883 instances with sequence length . The task requires the model to classify each sequence into one of six species.
Model | FruitFlies | RightWhaleCalls | MosquitoSound |
---|---|---|---|
Transformer | 55.55 | 71.72 | 30.60 |
Linformer | 82.20 | 70.42 | 62.52 |
Performer | 86.57 | 73.23 | 69.21 |
LSTM | 60.36 | 61.20 | 30.69 |
GRU | 58.19 | 58.54 | 35.39 |
TCN | 90.29 | 87.14 | 85.76 |
CNN | 95.18 | 81.25 | 89.49 |
CDIL-CNN | 97.00 | 92.14 | 91.74 |
We split every original training set into training (70%) and validation (30%) parts, and used the original testing set for testing. We have compared our model with some popular Transformers, RNNs, and CNNs. In particular, considering that these tasks require the model to discover local patterns rather than long-term dependencies, we also ran the traditional CNN which used the dilation size of 1 and zero-padding. The classification results are shown in Table 2.
You can see that our model is the winner on all three tasks with high accuracies of 97.00%, 92.14%, and 91.74%, respectively. Furthermore, we also found that convolutional networks are much better than RNNs and Transformers. For example, all CNNs can achieve more than 85% accuracy on the MosquitoSound task, while the best Transformer variant is just 69.21% and the best RNN is just 35.39%. This phenomenon could result from the inductive bias of CNNs, namely spatial similarity.
We have proposed a novel convolutional model named Circular Dilated Convolutional Neural Network (CDIL-CNN) for sequence classification. Based on the characteristic of very long sequential data, we have used a design that consists of multiple symmetric and circular convolutions with exponential dilation sizes. Therefore, our model can remove boundary effects and enlarge the receptive fields quickly. In this way, every position of the last convolutional layer has an equal chance to receive all information of the whole input sequence. Finally, a simple average ensemble learning is applied to improve the accuracy. Experimental results show that our model has superior performance over all other models on various long sequential datasets.
In the future, we could add other popular modules to our model, such as absolute positional encoding [7], relative positional encoding [62], and conditional positional encoding [63], which could further improve the performance. We could also pre-train our model for few-shot or zero-shot learning, where only a few supervised labels are required in training.
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 5457–5466, 2018.Learning word vectors for sentiment analysis.
In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011.In this group of experiments, we used Mean Squared Error (MSE) as the loss function and the Adam optimizer
[64]with the learning rate of 0.001. We trained every model for 200 epochs using the batch size of 40. For RNNs, namely LSTM and GRU, we used 1 layer with a hidden size of 128. For Transformer, Linformer, and Performer, we used 32 dimensions, 4 layers, and 4 heads. In TCN and CDIL-CNN, we used the kernel size of 3 and 32 channels for every convolutional layer. We adopted the varying depth of CNNs so that the last position of TCN and each position of CDIL-CNN can cover almost the whole sequence, i.e., the depth
for the sequence length . Table 1 gives the model sizes.Model | ||||||||
---|---|---|---|---|---|---|---|---|
Transformer | 25.99 | 25.99 | 25.99 | 25.99 | 25.99 | 25.99 | 25.99 | 25.99 |
Linformer | 59.78 | 76.16 | 108.93 | 174.47 | 305.54 | 567.68 | 1091.97 | 2140.55 |
Performer | 101.25 | 101.25 | 101.25 | 101.25 | 101.25 | 101.25 | 101.25 | 101.25 |
LSTM | 67.71 | 67.71 | 67.71 | 67.71 | 67.71 | 67.71 | 67.71 | 67.71 |
GRU | 50.82 | 50.82 | 50.82 | 50.82 | 50.82 | 50.82 | 50.82 | 50.82 |
TCN | 16.07 | 19.20 | 22.34 | 25.47 | 28.61 | 31.75 | 34.88 | 38.02 |
CDIL-CNN | 16.07 | 19.20 | 22.34 | 25.47 | 28.61 | 31.75 | 34.88 | 38.02 |
For this group of experiments, we quoted Transformers’ results from reference papers [25, 50, 51, 52] and ran LSTM, GRU, TCN, and CDIL-CNN for comparison. During training, we used a categorical cross-entropy loss function and an Adam optimizer with the learning rate of 0.001. For LSTM and GRU, we used 1 layer with a hidden size of 128. For TCN and CDIL-CNN, we used the kernel size of 3 and 50 channels for every convolutional layer. The depth was decided by the sequence length. All tasks had a vocabulary size of 256 and an embedding dimension of 64. Every model was trained for 50 epochs. More details of RNNs and CNNs are given in Table 2.
Task | ||||||||
---|---|---|---|---|---|---|---|---|
Image | 1024 | 10 | 32 | 117.00 | 92.17 | 9 | 90.64 | 90.64 |
Pathfinder | 1024 | 2 | 256 | 115.97 | 91.14 | 9 | 90.24 | 90.24 |
Text | 4000 | 2 | 32 | 115.97 | 91.14 | 11 | 105.44 | 105.44 |
Retrieval | 8000 | 2 | 256 | 116.74 | 91.91 | 11 | 105.74 | 105.74 |
In this group of experiments, we used the categorical cross-entropy loss function and the Adam optimizer with the learning rate of 0.001. We trained every model for 100 epochs using the batch size of 64. For LSTM and GRU, we used 1 layer with a hidden size of 128. For Transformer, Linformer, and Performer, we used 32 dimensions, 4 layers, and 4 heads. For TCN and CDIL-CNN, we used kernel size of 3 and 32 channels for every convolutional layer. The depth was decided by the sequence length, i.e., . For traditional CNN, we used the same parameter setting as TCN and CDIL-CNN. Therefore, all three CNNs had the same model size for a fair comparison. More details are given in Table 3.
Task | |||||||||
---|---|---|---|---|---|---|---|---|---|
FruitFlies | 5000 | 3 | 26.02 | 683.43 | 101.28 | 67.46 | 50.69 | 12 | 34.82 |
RightWhaleCalls | 4000 | 2 | 25.99 | 555.39 | 101.25 | 67.33 | 50.56 | 11 | 31.65 |
MosquitoSound | 3750 | 6 | 26.12 | 523.53 | 101.38 | 67.85 | 51.08 | 11 | 31.78 |