I Introduction
Electrocardiogram (ECG), which records the electrical depolarizationrepolarization patterns of the heart’s electrical activity in the cardiac cycle, is widelyused for monitoring or diagnosing patients’ cardiac conditions [1, 2, 3, 4, 5, 6] as well as identification [7]. The diagnosis is usually made by welltrained and experienced cardiologists, which is expensive and sometimes inconvenient, for instance, patients should go to hospitals. Therefore, automatic monitoring and diagnosing systems are in great demand in clinics, community medical centers and home health care programs. Though great advances have been made in ECG filtering, detection and classification in the past decades [8, 9, 10, 11, 4, 12], it is still a challenging problem for efficient and accurate ECG classification due to the disturbing noise, various types of symptoms and differences between patients.
Before classification, a preprocessing filtering step is usually needed to remove a variety of noises from the ECG signal, including the powerline interference, baseline wander, muscle contraction noise, etc. Traditional approaches like lowpass filters and filter banks can reduce noise but may also lead some artifacts [13]. Combining signal modeling and filtering together may alleviate this problem, but it limits to a single type noise [14, 15]. Recently, different noise removal methods based on wavelet transform has been proposed by leveraging its superiority in multiresolution signal analysis [16, 17, 18, 19]. For instance, S. Poungponsri and X.H. Yu proposed a novel adaptive filtering approach based on wavelet transform and artificial neural networks which can efficiently removal different types of noises [18].
For ECG classification, classical methods usually consists of two sequential modules: feature extraction and classifier training. Handcrafted features are extracted in time domain or frequency domain, including amplitudes, intervals and higherorder statistics, etc. Various methods have been proposed such as filter banks
[20][21], Principal Component Analysis (PCA)
[22, 23, 24, 25, 26], frequency analysis, wavelet transform (WT) [27, 28, 29, 8, 8, 30, 31]and statistical methods. Classifier models including Hidden Markov Models (HMM)
[32, 33], Support Vector Machines (SVM)
[34], Artificial Neural Networks (ANN) [8, 9, 27, 35, 11, 36], and mixtureofexperts method [37] have also been studied. Among them, large amount of methods are based on artificial neural network (ANN) due to its better modeling capacity. For example, L.Y. Shyu et al. proposed a novel method for detecting Ventricular Premature Contraction (VPC) using wavelet transform and fuzzy neural network [8]. By using same wavelet for QRS detection and VPC classification, their method has less computational complexity. I. Guler and E.D. Ubeyli propose to use combined neural network model for ECG beats classification [9]. Statistical features based on discrete wavelet transform are extracted and used as the input of first level networks. Then, sequential networks were trained using the outputs of the previous level networks as input. Unlike previous methods, T. Ince proposed a new method which uses a robust and generic ANN architecture and trains a patientspecific model with morphological wavelet transform features and temporal features for each patient [27]. Besides, some approaches have been proposed by combining several handcrafted features to provided enhanced performance [38, 39].Though the above methods have achieved good performance, they exhibit some common drawbacks: 1) the handcrafted features rely on domain knowledge of experts and should be designed and tested carefully. And the classifier should have appropriate modeling capacity of such features. 2) The types of ECG signals are usually limited or coarsegrained, e.g., 25 types. On the one hand, for a new type of ECG pattern, the discrimination of existing features should be examined first and maybe new features should be designed elaborately again. On the other hand, their performance for finegrained classification is still unclear since it requires features with better discrimination and classifiers with better modeling capacities.
In the past few years, deep learning based methods including deep belief networks, deep convolutional neural networks and recurrent neural networks have been widely used in many research fields and achieve remarkable performance, such as speech recognition, image classification and object detection. S. Kiranyaz et al. proposed a 1D convolutional neural networks for patientspecific ECG classification
[4]. They design a simple but effective network architecture and utilize 1D convolutions to processing the ECG wave signal directly. B. Pourbabaee et al. utilize deep convolutional neural networks to learn ECG features for screening paroxysmal atrial fibrillation (PAF) patients. They experimental results convince the representation capability of deep CNN. Recently, G. Clifford et al. organized the PhysioNet/Computing in Cardiology Challenge 2017 for AF rhythm classification from a short single lead ECG recording. A large number of realworld ECG samples from patients are collected and labelled. It facilitates the research on the challenging AF classification problem. Both handcrafted feature based method and deep learning based method have been proposed and reached the top entries [40, 41, 42]. For example, S. Hong et al. propose a ensemble classifiers based method by combining expert features and deep features together
[40]. T. Teijeiro et al. propose a combined two classifiers based method, i.e., the first classifier evaluates the record globally using aggregated values for a set of highlevel and clinically meaningful features, and the second classifier utilizes a Recurrent Neural Network fed with the individual features for each detected heartbeat [42]. M. Zabihi proposed a handcrafted feature extraction and selection method based on a random forest classifier
[41].In this paper, we proposed a new deep CNN based method for ECG classification in this paper. Different from previous methods, 1). We first transform the original ECG signal into timefrequency domain by ShortTime Fourier Transform (STFT). 2). Then, the timefrequency characteristics of each pattern are learned by a CNN of 2D convolutions. 3). Finally, we propose an online decision fusion method to fuse past and current decisions from different models into a more accurate one. 4). We examine the proposed method for finegrained ECG classification on a synthetic ECG dataset which consists of 20 types of ECG signals. Moreover, we also evaluate its performance on a realworld ECG dataset and compare it with stateoftheart methods. The experimental results convince the effectiveness and efficiency of the proposed method.
The rest of the paper is organized as follows. In Sect. II, we briefly formulate the ECG classification problem. Then, we present the proposed method in Sect. III including shorttime Fourier transform, architecture of the proposed CNN and the online decision fusion method. In Sect. IV, we present the experimental results on both a synthetic ECG dataset and a realworld ECG dataset to verify the effectiveness of the proposed method and compare its performance with stateoftheart methods. Finally, we conclude the paper in Sect. V and point out some potential directions of the future work.
Ii Problem formulation
Given a set of ECG signals and their corresponding labels, the target of a classification method is to predict their labels correctly. As depicted in Sect. I, it usually consists of two sequential modules: feature extraction and classifier training. Once the classifier is obtained, it can be used for unseen samples prediction, i.e., testing phase. Mathematically, we denote the set of ECG wave signals as:
(1) 
where is the sample: , is the sample length. is the category of , and is the number of total categories. is the index set of all samples. The feature extraction can be described as follows:
(2) 
where is the corresponding feature representation of signal . Usually, the feature vector is more compact than the original signal , i.e., . is a mapping function from the original signal space to the feature space, and is the parameters associated with the mapping . It is usually determined according to domain knowledge of experts or crossvalidation. Given the feature representation, a classifier predicts its category as follows:
(3) 
where is the parameters associated with the classifier . is the prediction. The frequentlyused classifiers include SVM [34], ANN [35, 11, 36], Random Forest [43], HMM [32, 33], DCNN [4], etc. Given the training samples, the training of a classifier can be formulated as an optimization problem of its parameter as follows:
(4) 
where is the index set of training samples.
is a loss function which depicts the loss of assigning a prediction category
for a sample with label , e.g., margin loss in SVM model and crossentropy loss in models of ANN or Random Forest.For deep neural networks models, feature extraction (learning) and classifier training are integrated together in the neural network architecture as an endtoend model. The parameters are optimized for training samples by using the error back propagation algorithm. Mathematically, it can be formulated as:
(5) 
where is the deep neural networks model with parameters . For a modern deep neural networks architecture, e.g., DCNN, it usually consists of many sequential layers like convolutional layers, pooling layer, nonlinear activation layer, and fully connected layer, etc. Therefore, is a nonlinear mapping function with powerful modeling capacity which can map the original highdimension input data to a lowdimension feature space, where they are more discriminative, representative and compact.
Iii The proposed approach
In this paper, we proposed a new deep CNN method for finegrained ECG classification. First, the ECG wave signal is transformed to timefrequency domain by using ShortTime Fourier Transform. Next, specific DCNN models are trained on training samples of specific length. Finally, an online decision fusion method is proposed to fuse past and current decisions from different models into a more accurate one. Figure 1 shows pipeline of the proposed method. We present its details in the following parts.
Iiia ShortTime Fourier Transform
Though wave signals in the original time domain can be used as input of DNN to learn features, a timefrequency representation may be a better choice. Inspired by the work in speech recognition areas [44, 45], where they show spectrogram features of speech are superior to MFCC with DNN, we first transform the original ECG wave signal into the timefrequency domain by using ShortTime Fourier Transform to obtain the ECG spectrogram representation. Mathematically, it can be described as follows:
(6) 
where is the window function, e.g., Hamming window. is the spectrogram of , which has a twodimension structure. Figure 2 shows the spectrogram examples.
IiiB Architecture of the proposed CNN
Network  Type  Input Size  Number  Filter  Pad  stride 

Proposed Network ()  Conv1  1x32x4  32  3x3  1  1 
Pool1  32x32x4    4x1  0  (4,1)  
Conv2  32x8x4  32  3x3  1  1  
Pool2  32x8x4    4x2  0  (4,2)  
Fc3  32x2x2  64        
Fc4  64x1x1  20        
Params  18,976  
Complexity^{1}^{1}1Evaluated with FLOPs, i.e. the number of floatingpoint multiplicationadds.  3.5x10  
Network in [4]  Conv1  1x1x512  32  1x15  (0,7)  (1,6) 
Conv2  32x1x86  16  1x15  (0,7)  (1,6)  
Conv3  16x1x15  16  1x15  (0,7)  (1,6)  
Pool3  16x1x3    1x3  0  1  
Fc4  16x1x1  10        
Fc5  10x1x1  20        
Params  12,360  
Complexity  1.7x10 
Since we use the twodimension spectrogram as input, we design a Deep CNN architecture which involves 2D convolutions. Specifically, the proposed architecture consists of 3 convolutional layer and 2 fullyconnected layer. There is a max pooling layer and a ReLU layer after the first two convolutional layer and a max pooling layer after the last convolutional layer. Details are shown in Table
I. As can be seen, it is rather lightweight with 18,976 parameters and 3.5x10 flops computation. We also present the network architecture in [4] as a comparison. Feature channels are kept the same. Filter sizes and strides are adapted to the data length used in this paper. It can be seen that the proposed network has a comparable amount of parameters and computational cost with the one in [4]. As we will show in Sect. IV, the proposed method is rather computational efficient and can achieve a realtime performance even in a embedded devices.With the spectrogram
as input, the DCNN model predicts a probability vector for a classification problem, i.e.,
subjected to and . Then the model parameter can be learned by minimizing the crossentropy loss as follows:(7) 
where is the onehot vector representation of label , i.e., .
It should be noted that the width of spectrogram is related to the length of ECG wave signal given the window function. Given the sampling rate, signals of longer length contains more beats. Usually, single beat is detected and classified [9]. Since more beats contains more information, it will lead a more accurate result. In this paper, the length of each sample in the synthetic ECG dataset is 16384 at a sampling rate of 512Hz, which lasts 32s. We split each sample into subsamples which have a same length of 512 points. Therefore, each sample last 1s and contains beats. It is noteworthy that we do not explicitly extract the beat from the raw signal but use it as the input directly after the ShortTime Fourier Transform. Then, we train our CNN model on this dataset. Besides, to compare the performance of models for longer samples, we also split each sample into subsamples of different lengths, e.g., 2s, 4s, 8s, 16s. We use them to train our CNN models correspondingly and denote all these lengthspecific models as , respectively. Though training samples of different length corresponds to different spectrograms of different widths, we use the same architecture for all the above models and only change the pooling strides along columns correspondingly while keeping the fullyconnected layers fixed.
IiiC Online decision fusion method
For online testing, as the length of signal is growing, we can test it at different time by using above models in a sequential manner. As illustrated in Fig. 1, lower level models make decisions based on local signals of shorter length, higher level models make decisions based on global signals of longer length. These models can be seen as different experts which focus on different volume of information. Their decisions may be complementary, and probably can be fused to a more accurate one. To this end, we proposed an online decision fusion method. Mathematically, it can be described as follows:
(8) 
where is the fusion result, is the maximum level for a signal of specific length . is the segment of for level model , and is the number of segments at level. For example, when the total length of
in the present moment is 2048,
will be 3, and will be 4, 2, 1, respectively. is the fusion weight for model subjected to . It can be seen from Eq.(8), decision at each segment at the same level is treated equally and being averaged. It is reasonable since there is no prior knowledge for each segment and each decision is made by using the same model. For decisions from models at different scale, we assign a weight for each of them. In the experiment part, we compare its influence on the final fusion result.Iv Experiments
In this section, we present the experimental results on both synthetic and realworld datasets and compare the proposed method with previous ones. First, we conducted extensive experiments on a synthetic dataset constructed by using a ECG simulator to inspect the proposed method for finegrained ECG classification. Then, we evaluate the performance of the proposed method on a realworld dataset for classifying Atrial Fibrillation signals from other three types. For clarity, we will present them in the following two subsections, respectively.
First, we present the definition of the measures used in this paper. The notation of each element in the confusion matrix is defined in Table
II. Then, the Accuracy, Sensitivity, Specificity and F1 score can be calculated as follows.(9) 
Here, is number of samples which belong to the category and are predicted to the category. is the number of categories.
(10) 
where class represents the symptomatic classes, e.g., RAF, FAF, etc.
(11) 
where class represents the normal class.
(12) 
(13) 
Predicted Labels  

Class  Class  Class  
Ground Truth Labels  Class  
Class  
Class 
Sensitivity  

Methods  RAF  FAF  AF  SA  AT  ST  PAC  VB  VTr  PVCCI 
SVM+FFT  0.950.02  0.730.01  0.770.08  0.770.05  0.850.02  0.780.10  0.760.08  0.810.04  0.710.10  0.820.03 
1D CNN [4]  0.960.01  0.920.04  0.840.06  0.390.03  0.900.01  0.960.02  0.200.09  0.920.03  0.190.11  0.870.02 
Proposed  0.970.01  0.990.01  0.940.01  0.710.02  0.940.01  0.990.01  0.590.20  0.930.01  0.180.19  0.890.03 
SVM+CNN Feature  0.980.01  0.990.01  0.970.01  0.750.01  0.970.01  0.990.01  0.820.04  0.960.01  0.340.12  0.950.01 
Sensitivity  Specificity  
Methods  VTa  RVF  FVF  AVBI  AVBII  AVBIII  RBBB  LBBB  PVC  N 
SVM+FFT  0.840.02  0.860.03  0.780.06  0.820.02  0.840.05  0.760.07  0.720.09  0.810.05  0.760.10  0.860.02 
1D CNN [4]  0.970.01  0.970.02  0.960.02  0.700.17  0.940.03  0.680.05  0.910.04  0.980.01  0.680.05  0.950.01 
Proposed  0.990.01  0.990.01  0.980.01  0.770.15  0.920.03  0.860.04  0.950.01  0.980.01  0.720.05  0.960.01 
SVM+CNN Feature  1.000.00  1.000.00  0.990.01  0.950.01  0.980.01  0.930.03  0.970.01  0.980.01  0.750.04  0.980.01 
Sensitivity, Specificity of different methods on the training set. Mean scores and the Standard deviations (mean
std) are reported.Sensitivity  

Methods  RAF  FAF  AF  SA  AT  ST  PAC  VB  VTr  PVCCI 
SVM+FFT  0.880.11  0.440.14  0.500.21  0.690.13  0.530.27  0.560.21  0.570.13  0.420.29  0.380.14  0.540.24 
1D CNN [4]  0.960.01  0.930.04  0.830.07  0.380.04  0.900.02  0.950.03  0.200.05  0.930.02  0.150.15  0.850.03 
Proposed  0.980.01  0.990.01  0.950.03  0.720.02  0.940.02  0.960.03  0.620.22  0.940.01  0.320.23  0.910.02 
SVM+CNN Feature  0.970.01  0.980.01  0.960.01  0.720.02  0.950.01  0.960.04  0.850.03  0.960.01  0.020.02  0.950.03 
Proposed(Fusion)  0.990.01  1.000.00  0.990.01  1.000.00  1.000.00  1.000.00  1.000.00  1.000.00  0.990.01  0.990.01 
Sensitivity  Specificity  
Methods  VTa  RVF  FVF  AVBI  AVBII  AVBIII  RBBB  LBBB  PVC  N 
SVM+FFT  0.730.09  0.600.18  0.450.32  0.500.30  0.660.14  0.560.21  0.460.07  0.480.33  0.400.24  0.600.18 
1D CNN [4]  0.960.01  0.960.02  0.940.04  0.500.35  0.930.08  0.640.07  0.890.10  0.980.01  0.690.05  0.950.02 
Proposed  0.990.01  0.980.02  0.990.01  0.560.30  0.930.06  0.870.05  0.950.02  0.980.01  0.570.25  0.960.02 
SVM+CNN Feature  0.980.01  0.970.02  0.980.02  0.480.15  0.950.05  0.890.05  0.950.02  0.980.01  0.570.22  0.970.02 
Proposed(Fusion)  1.000.00  1.000.00  0.990.01  0.860.20  1.000.00  0.950.05  0.980.02  1.000.00  0.990.01  0.990.01 
Iva Experiments on a synthetic ECG dataset
IvA1 Dataset and parameter settings
To verify the effectiveness of the proposed method, we construct a synthetic dataset by using a ECG simulator. The simulator can generate different types of ECG signals with different parameter settings. In this paper, we choose 20 categories of ECG signals including: Normal(N), Rough Atrial Fibrillation(RAF), Fine Atrial Fibrillation(FAF), Atrial Flutter(AF), Sinus Arrhythmia(SA), Atrial Tachycardia(AT), Supraventricular Tachycardia(ST), Premature Atrial Contraction(PAC), Ventricular Bigeminy(VB), Ventricular Trigeminy(VTr), Premature Ventricular Contraction Coupling Interval(PVCCI), Ventricular Tachycardia(VTa), Rough Ventricular Fibrillation(RVF), Fine Ventricular Fibrillation(FVF), AtrioVentricular Block I(AVBI), AtrioVentricular Block(AVBII), AtrioVentricular Block(AVBIII), Right Bundle Branch Block(RBBB), Left Bundle Branch Block(LBBB), and Premature Ventricular Contractions(PVC). There are total 2426 samples, about 120 samples per category. Each sample has a maximum length of 16384 points at a sampling frequency 512Hz. In the following experiments, we use 3fold crossvalidation to evaluate the proposed method.
Parameters are set as the following. We use the Hamming window of length 256 in ShortTime Fourier Transform and the overlap size is 128. The CNN model is trained in a total of 20,000 iterations with a batchsize of 128. The learning rate decreases by half from 0.01 to 6.25x, every 5,000 iterations. The momentum and the decay parameter are set to 0.9 and 5x
, respectively. We implemented the proposed method in CAFFE
[46]. All the experiments are conducted on a workstation with Nvidia GTX Titan X GPUs if not specified.IvA2 Comparisons with previous methods
We compare the performance of the proposed method with previous methods including SVM based on Fourier transform, the pilot Deep CNN method in [4] which uses 1D convolutions, and SVM based on the learned features of the proposed method. We report the sensitivity and specificity scores of different methods on the training set and test set for all categories. We also report the average classification accuracies. The standard deviations of each index on the 3fold crossvalidation are also reported. Results are summarized in Table III, Table IV and Table V.
Methods  Training Set  Test Set 

SVM+FFT  0.81(0.04)  0.56(0.12) 
1D CNN [4]  0.83(0.01)  0.81(0.04) 
Proposed  0.88(0.03)  0.87(0.03) 
SVM+CNN Feature  0.93(0.01)  0.87(0.02) 
Proposed(Fusion)    0.99(0.01) 
It can be seen that method in [4] outperforms traditional methods, e.g., features based on FFT coefficients and SVM classifier. However, it is inferior to the proposed one which using 2dimensional spectrogram as input. By learning features of timefrequency characteristics from spectrogram, the proposed method achieves better classification accuracy. In addition, we use the learned features of the proposed method to train a SVM classifier. The corresponding results are denoted as “SVM+CNN Feature”. Compared with the SVM with FFT features, performance of this classifier is significantly boosted. It convinces that the proposed method learns a more discriminative feature representation of ECG signal. Interestingly, it is marginal better than the proposed CNN model which indeed uses a linear classifier in the last layer of the proposed network architecture. It is reasonable since a more sophisticated nonlinear radial basis kernel is used in the SVM classifier. But it also shows a tendency toward overfitting since a larger gain is achieved on the training set.
Moreover, from Table III and Table IV, we can find that categories of SA, PAC, VTr, AVBI and PVC are hard to be distinguished. We’ll shed light on the reason by analyzing the learned features through the visualization technique [47, 48, 49] as well as by analyzing the confusion matrix between categories.
IvA3 Analysis on learned features and confusion matrix between categories
To show the effectiveness of the proposed method on learning feature representation, we plot the learned features to visually inspect them. For example, we obtain the learned features of all training data by calculating the responses of the penultimate layer. Then, we employ the tDistributed Stochastic Neighbor Embedding (tSNE) method proposed in [47, 48, 49] to visually inspect them. The visualization results of features for all the categories are shown in the upleft subfigure in Fig. 3. As can be seen, some categories such as Normal(N), RVF, FVF, ST, RBBB, LBBB, PVCCI, VTa and RAF, are clustered and separated with other categories. However, some categories such as SA, PAC, PVC, VTr, AVBI and AVBIII, are mixed with other categories as indicated by the red circles. To show it more clearly, we selected the concerned categories and only plot their feature visualization results. They are shown in the subsequent subfigures in Fig. 3. For example, SA tends be mixed with AT and PVC, and PVC tends to be mixed with PAC and VB. Nevertheless, they are well separated from the category of Normal, which implies that the proposed method can predict the category of Normal correctly. It is consistent with the high specificity scores in Table III and Table IV.
In addition, we also calculate the confusion matrix between categories. The results are shown in Fig. 4(a). Each row represents the numbers corresponding to assigning samples in a given category to other categories. It is more clear that which category tends to be mixed with others, e.g., SA, VTr, AVBI and PVC. The results are consistent with the visual inspection results in Fig. 3. The above visual inspection method and the confusion matrix are convenient to understand the confusion between some finegrained categories, which is very helpful for real applications.
IvA4 Online decision fusion performance
We test the fusion method in Sect. IIIC at different levels : 2, 3, 4, 5, and 6. Two kinds of fusion weights are used: the uniform one and the one preferring high level models. The fusion weight in the latter case is calculated according to the following equation:
(14) 
Fig.5 shows the means and standard deviations of the classification accuracies of the proposed fusion methods and the proposed single scale method, respectively. First, it can be seen that model achieves the best performance among all the single models at different levels. The reason may be that it makes a tradeoff between data length and number of model decisions. Compared with model , the input data length is 16 times larger. Compared with model , which only make a single decision on the whole sequence, model can make 4 decisions on four different subsequence, these decisions can be fused into a more accurate one.
Then, it can be seen that the fusion results are consistently better than the results of single model. And the performance is improved consistently with the growth of data length. It convinces the idea that fusing decisions from different model may lead a more robust and accurate decision since these models are trained on samples covering different scopes of original data. Besides, it can be seen that using nonuniform weights does not provide any advantage over the uniform one. We hypothesize that the nonuniform weight strategy prefers the model with the largest amount of data and neglects the decisions from the models with smaller amount of data. Though it is better than the single model, the gains are indeed very marginal. Especially at higher levels, the performance is largely dominated by the model at hight level.
As can be concluded, the proposed online decision fusion method with uniform weights at level 6 achieves the best result, i.e., highest average classification accuracy and lowest standard deviation. For example, the accuracy is boosted from 87% (single model at level 1) to 99%, and the standard deviation is reduced from 0.03 (single model at level 1) to 0.011. The sensitivity and specificity indexes of it are shown in Table IV, which shows significant boost than other methods. The confusion matrix shown in Fig. 4(b) shows the similar results. These experimental results clearly convince the effectiveness of the proposed online decision fusion method.
Level  Titan X  TX2  TX2 (BatchSize: x10)  

Mode  GPU  CPU  GPU  CPU  GPU  CPU 
1  0.01  0.17  0.17  0.46  0.12  0.44 
2  0.03  0.21  0.27  0.59  0.13  0.55 
3  0.05  0.26  0.37  0.80  0.14  0.72 
4  0.10  0.42  1.22  1.57  0.18  1.39 
5  0.21  0.82  2.01  2.91  0.27  2.79 
6  0.33  1.33  2.73  5.38  0.56  5.66 
IvA5 Computational complexity and running time analysis
As depicted in Table I, the total computational cost of the proposed architecture is only 3.5x10 flops. We records the running times of the proposed method at GPU and CPU modes, respectively. Results are shown in Table VI. It can be seen, the running time is only 0.33ms even if it is tested for the whole sequence (level 6). To further examine the computational efficiency of the proposed method, we test it on the NVIDIA Jetson TX2 embedded board both at GPU and CPU modes. Again, the proposed method can achieve a realtime speed. Interestingly, the running times at GPU mode and CPU mode are comparable. We hypothesize that enlarging the batch size may make full advantage of the GPU acceleration. As can be seen, after enlarging the batch size 10 times, the superiority of GPU mode is significant. Generally, these results imply that the proposed method is very efficient and promising to be integrated in a portable ECG monitor instrument with limited computational resources.
IvB Experiments on a realworld ECG dataset
In addition, we also conducted extensive experiments on a realworld ECG dataset which is used as the benchmark dataset for the PhysioNet/Computing in Cardiology Challenge 2017 [50]. The dataset consists of three parts: training set, validation set and the test set. The training set contains 8,528 single lead ECG recordings lasting from 9 s to just over 60 s. The validation set and test set contain 300 and 3,658 ECG recordings of similar lengths, respectively. The ECG recordings were sampled as 300 Hz. Each sample is labelled into four categories: Normal rhythm, AF rhythm, Other rhythm and Noisy recordings. Some examples of the ECG waveforms in PhysioNet dataset are shown in Fig. 6. Only labels of training set and validation set are publicly available. Labels of test set are kept private and the corresponding results should be submitted to the test server during the challenge. More details can be found in [50, 51].
We train our model on the training set. Scores both on training set and validation set are reported and compared with the top entries in the challenge. It is noteworthy that we add two more convolutional layers after each of the first and second convolutional layers in the network depicted in Table I. It leads to a network architecture with a stronger modelling capacity, which can handle the realworld ECG signals better. The corresponding number of convolutional filters and kernel sizes are the same with their preceding counterparts. The first fullyconnected layer is kept the same. The output number of the last fullyconnected layer is modified to be four to keep consistent with the number of categories. Each sample in the dataset is cropped or duplicated to have a length of 16,384. All other hyperparameters are kept the same with the experiments in the above subsections if not specified. We train the model at each level three times with random seeds, and report the average scores and the standard deviations.
IvB1 Performance of the proposed online fusion model and comparisons with the top entries in the challenge
We report the experimental results on the aforementioned dataset in terms of mean accuracies and F1 scores. To keep consistent with the evaluation protocol in [50, 51], we report the average F1 scores for the first three categories. In addition, we also include the average F1 scores for all categories. The results are plotted in Fig. 7 and Fig. 8. It can be seen that best results are achieved at level 4 () and level 5 () by the proposed online fusion method, i.e., classification accuracy and F1 score. Meanwhile, the results of single models are also competetive. Best results are achieved at level 2 () with a accuracy and a F1 score. It is consistent with the experimental result in Sect. IVA4 that neither model nor model achieve the best results. The model make a tradeoff between data length and number of model decisions and achieves the best results. The comparison results between the proposed method and the top entries in the challenge are shown in Table VII. It can be seen that the proposed methods can achieve comparable or better results than the top entries both on training set and validation set.
Rank  Entry  accuracy  F1 score  F1 score (all categories)  

Validation  Train  Validation  Train  Validation  Train  
1  Teijeiro et al. [42]      0.912  0.893     
1  Datta et al.      0.990  0.970     
1  Zabihi et al. [41]      0.968  0.951     
1  Hong et al. [40]      0.990  0.970     
5  Baydoun et al.      0.859  0.965     
5  Bin et al.      0.870  0.875     
5  Zihlmann et al.      0.913  0.889     
5  Xiong et al.      0.905  0.877     
  Proposed (level 4)  0.9920.002  0.9980.001  0.9890.002  0.9960.002  0.9920.002  0.9950.003 
  Proposed (fusion, level 4)  1.00.0  0.9990.001  1.00.0  0.9940.006  1.00.0  0.9910.009 
IvB2 Analysis on learned features
Like Sect. IVA3, we also plot the learned features by the model . We obtain the learned features of all samples in the validation set by calculating the responses of the penultimate layer. Then, we employ the tDistributed Stochastic Neighbor Embedding (tSNE) method proposed in [47, 48, 49] to visually inspect them. The visualization results of features for all the categories are shown in Fig. 9. As can be seen, samples in each categories are almost clustered together and separated with other clusters. For several samples in category of ”Other rhythm”, they are near the clusters of ”Normal” and ”Noisy”. It implies those samples are either mislabelled which should be carefully checked or easy to be confused with other categories which should be carefully handled.
In addition, we also plot the spectrograms and their corresponding response maps of , , and layers for each category in Fig. 10. As can be seen, the first convolutional layer acts like a basic feature extractor which strengthens the useful parts in the spectrograms. Then, features corresponding to low and medium frequencies are pooled and contribute more to the final classification. From the response maps of , we can see that the proposed network can generate strong responses in specific frequency zones and accumulate useful features along the temporal axis. By doing so and together with the online fusion process, the proposed network can learn effective and discriminative features and make accurate classification.
V Conclusion and future work
In this paper, we introduce a novel deep CNN based method for finegrained ECG signal classification. It can learn discriminative feature representation from the timefrequency domain by calculating the ShortTime Fourier Transform of the original wave signal. In addition, the proposed online decision fusion method fuses past and current decisions from different models and generates a more accurate one. Experimental results on a synthetic 20category ECG dataset and a realworld AF classification dataset convince the effectiveness of the proposed method. Moreover, the proposed method is computational efficient and promising to be integrated in a portable ECG monitor instrument with limited computational resources.
Future research may include the following two directions: 1) Constructing more compact and efficient network architectures to handle complex realworld ECG data. 2) Improving the online decision method in a recursive manner which not only uses the past decisions but also the past learned features.
Vi Acknowledgement
This work is supported by the Natural Science Foundation of China (61751304), NSFCZhejiang Joint Fund for the Integration of Industrialization and Informatization (U1509203, U1709215) and Zhejiang Natural Science Foundation of China (LY17F010020).
References
 [1] I. D. Castro, C. Varon, T. Torfs, S. Van Huffel, R. Puers, and C. Van Hoof, “Evaluation of a multichannel noncontact ecg system and signal quality algorithms for sleep apnea detection and monitoring,” Sensors, vol. 18, no. 2, p. 577, 2018.
 [2] M. Nappi, V. Piuri, T. Tan, and D. Zhang, “Introduction to the special section on biometric systems and applications,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 44, no. 11, pp. 1457–1460, 2014.
 [3] K. A. Sidek, I. Khalil, and H. F. Jelinek, “Ecg biometric with abnormal cardiac conditions in remote monitoring system,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 44, no. 11, pp. 1498–1509, 2014.
 [4] S. Kiranyaz, T. Ince, and M. Gabbouj, “Realtime patientspecific ecg classification by 1d convolutional neural networks,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 3, pp. 664–675, 2016.
 [5] B. Pourbabaee, M. J. Roshtkhari, and K. Khorasani, “Deep convolution neural networks and learning ecg features for screening paroxysmal atrial fibrillatio patients,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2017.
 [6] A. Szczepański and K. Saeed, “A mobile device system for early warning of ecg anomalies,” Sensors, vol. 14, no. 6, pp. 11 031–11 044, 2014.
 [7] Z. Zhao, L. Yang, D. Chen, and Y. Luo, “A human ecg identification system based on ensemble empirical mode decomposition,” Sensors, vol. 13, no. 5, pp. 6832–6864, 2013.
 [8] L.Y. Shyu, Y.H. Wu, and W. Hu, “Using wavelet transform and fuzzy neural network for vpc detection from the holter ecg,” IEEE Transactions on Biomedical Engineering, vol. 51, no. 7, pp. 1269–1273, 2004.
 [9] I. Guler and E. D. Ubeyli, “Ecg beat classifier designed by combined neural network model,” Pattern recognition, vol. 38, no. 2, pp. 199–208, 2005.
 [10] S. Mitra, M. Mitra, and B. B. Chaudhuri, “A roughsetbased inference engine for ecg classification,” IEEE Transactions on instrumentation and measurement, vol. 55, no. 6, pp. 2198–2206, 2006.

[11]
T. Mar, S. Zaunseder, J. P. Martínez, M. Llamedo, and R. Poll, “Optimization of ecg classification by means of feature selection,”
IEEE transactions on Biomedical Engineering, vol. 58, no. 8, pp. 2168–2177, 2011.  [12] W. Li, J. Li, and Q. Qin, “Setbased discriminative measure for electrocardiogram beat classification,” Sensors, vol. 17, no. 2, p. 234, 2017.
 [13] Y. Wu, R. M. Rangayyan, Y. Zhou, and S.C. Ng, “Filtering electrocardiographic signals using an unbiased and normalized adaptive noise reduction system,” Medical Engineering & Physics, vol. 31, no. 1, pp. 17–26, 2009.
 [14] J. Yan, Y. Lu, J. Liu, X. Wu, and Y. Xu, “Selfadaptive modelbased ecg denoising using features extracted by mean shift algorithm,” Biomedical Signal Processing and Control, vol. 5, no. 2, pp. 103–113, 2010.
 [15] M. BlancoVelasco, B. Weng, and K. E. Barner, “Ecg signal denoising and baseline wander correction based on the empirical mode decomposition,” Computers in biology and medicine, vol. 38, no. 1, pp. 1–13, 2008.
 [16] V. Bhateja, S. Urooj, R. Mehrotra, R. Verma, A. LayEkuakille, and V. D. Verma, “A composite wavelets and morphology approach for ecg noise filtering,” in International Conference on Pattern Recognition and Machine Intelligence. Springer, 2013, pp. 361–366.
 [17] W. Jenkal, R. Latif, A. Toumanari, A. Dliou, O. El Bcharri, and F. M. Maoulainine, “An efficient algorithm of ecg signal denoising using the adaptive dual threshold filter and the discrete wavelet transform,” Biocybernetics and Biomedical Engineering, vol. 36, no. 3, pp. 499–508, 2016.
 [18] S. Poungponsri and X.H. Yu, “An adaptive filtering approach for electrocardiogram (ecg) signal noise reduction using neural networks,” Neurocomputing, vol. 117, pp. 206–213, 2013.
 [19] Y. Xu, M. Luo, T. Li, and G. Song, “Ecg signal denoising and baseline wander correction based on ceemdan and wavelet threshold,” Sensors, vol. 17, no. 12, p. 2754, 2017.
 [20] V. X. Afonso, W. J. Tompkins, T. Q. Nguyen, and S. Luo, “Ecg beat detection using filter banks,” IEEE transactions on biomedical engineering, vol. 46, no. 2, pp. 192–202, 1999.
 [21] N. Zeng, Z. Wang, and H. Zhang, “Inferring nonlinear lateral flow immunoassay statespace models via an unscented kalman filter,” Science China Information Sciences, vol. 59, no. 11, p. 112204, 2016.
 [22] F. Castells, P. Laguna, L. Sörnmo, A. Bollmann, and J. M. Roig, “Principal component analysis in ecg signal processing,” EURASIP Journal on Applied Signal Processing, vol. 2007, no. 1, pp. 98–98, 2007.
 [23] V. Monasterio, P. Laguna, and J. P. Martinez, “Multilead analysis of twave alternans in the ecg using principal component analysis,” IEEE Transactions on Biomedical Engineering, vol. 56, no. 7, pp. 1880–1890, 2009.
 [24] R. J. Martis, U. R. Acharya, K. Mandana, A. K. Ray, and C. Chakraborty, “Application of principal component analysis to ecg signals for automated diagnosis of cardiac health,” Expert Systems with Applications, vol. 39, no. 14, pp. 11 792–11 800, 2012.
 [25] Y. Ozbay, R. Ceylan, and B. Karlik, “A fuzzy clustering neural network architecture for classification of ecg arrhythmias,” Computers in Biology and Medicine, vol. 36, no. 4, pp. 376–388, 2006.
 [26] M. Kallas, C. Francis, L. Kanaan, D. Merheb, P. Honeine, and H. Amoud, “Multiclass svm classification combined with kernel pca feature extraction of ecg signals,” in Telecommunications (ICT), 2012 19th International Conference on. IEEE, 2012, pp. 1–5.
 [27] T. Ince, S. Kiranyaz, and M. Gabbouj, “A generic and robust system for automated patientspecific classification of ecg signals,” IEEE Transactions on Biomedical Engineering, vol. 56, no. 5, pp. 1415–1426, 2009.
 [28] E. Jayachandran et al., “Analysis of myocardial infarction using discrete wavelet transform,” Journal of medical systems, vol. 34, no. 6, pp. 985–992, 2010.
 [29] A. Daamouche, L. Hamami, N. Alajlan, and F. Melgani, “A wavelet optimization approach for ecg signal classification,” Biomedical Signal Processing and Control, vol. 7, no. 4, pp. 342–349, 2012.
 [30] J. Ródenas, M. García, R. Alcaraz, and J. J. Rieta, “Wavelet entropy automatically detects episodes of atrial fibrillation from singlelead electrocardiograms,” Entropy, vol. 17, no. 9, pp. 6179–6199, 2015.
 [31] M. García, J. Ródenas, R. Alcaraz, and J. J. Rieta, “Application of the relative wavelet energy to heart rate independent detection of atrial fibrillation,” Computer methods and programs in biomedicine, vol. 131, pp. 157–168, 2016.
 [32] M. Javadi, R. Ebrahimpour, A. Sajedin, S. Faridi, and S. Zakernejad, “Improving ecg classification accuracy using an ensemble of neural network modules,” PLoS one, vol. 6, no. 10, p. e24386, 2011.
 [33] W. Liang, Y. Zhang, J. Tan, and Y. Li, “A novel approach to ecg classification based upon twolayered hmms in body sensor networks,” Sensors, vol. 14, no. 4, pp. 5994–6011, 2014.
 [34] S. Osowski, L. T. Hoai, and T. Markiewicz, “Support vector machinebased expert system for reliable heartbeat recognition,” IEEE transactions on biomedical engineering, vol. 51, no. 4, pp. 582–589, 2004.
 [35] M. Barni, P. Failla, R. Lazzeretti, A.R. Sadeghi, and T. Schneider, “Privacypreserving ecg classification with branching programs and neural networks,” IEEE Transactions on Information Forensics and Security, vol. 6, no. 2, pp. 452–468, 2011.

[36]
J.S. Wang, W.C. Chiang, Y.L. Hsu, and Y.T. C. Yang, “Ecg arrhythmia classification using a probabilistic neural network with a feature reduction method,”
Neurocomputing, vol. 116, pp. 38–45, 2013.  [37] A. R. Hassan and M. A. Haque, “An expert system for automated identification of obstructive sleep apnea from singlelead ecg using random under sampling boosting,” Neurocomputing, vol. 235, pp. 122–130, 2017.
 [38] J. Oster and G. D. Clifford, “Impact of the presence of noise on rr intervalbased atrial fibrillation detection,” Journal of electrocardiology, vol. 48, no. 6, pp. 947–951, 2015.
 [39] Q. Li, C. Liu, J. Oster, and G. D. Clifford, “Signal processing and feature selection preprocessing for classification in noisy healthcare data,” Machine Learning for Healthcare Technologies, vol. 2, p. 33, 2016.
 [40] S. Hong, M. Wu, Y. Zhou, Q. Wang, J. Shang, H. Li, and J. Xie, “Encase: An ensemble classifier for ecg classification using expert features and deep neural networks,” in Computing in Cardiology (CinC), 2017. IEEE, 2017, pp. 1–4.
 [41] M. Zabihi, A. B. Rad, A. K. Katsaggelos, S. Kiranyaz, S. Narkilahti, and M. Gabbouj, “Detection of atrial fibrillation in ecg handheld devices using a random forest classifier,” 2017.
 [42] T. Teijeiro, C. A. García, D. Castro, and P. Félix, “Arrhythmia classification from the abductive interpretation of short singlelead ecg records,” arXiv preprint arXiv:1711.03892, 2017.
 [43] N. Emanet, “Ecg beat classification by using discrete wavelet transform and random forest algorithm,” in Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control, 2009. ICSCCW 2009. Fifth International Conference on. IEEE, 2009, pp. 1–4.
 [44] L. Deng, M. L. Seltzer, D. Yu, A. Acero, A.r. Mohamed, and G. Hinton, “Binary coding of speech spectrograms using a deep autoencoder,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
 [45] L. Deng, J. Li, J.T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams et al., “Recent advances in deep learning for speech research at microsoft,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8604–8608.
 [46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678.
 [47] L. v. d. Maaten and G. Hinton, “Visualizing data using tsne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
 [48] L. Van Der Maaten, “Accelerating tsne using treebased algorithms.” Journal of machine learning research, vol. 15, no. 1, pp. 3221–3245, 2014.
 [49] L. Van der Maaten and G. Hinton, “Visualizing nonmetric similarities in multiple maps,” Machine learning, vol. 87, no. 1, pp. 33–55, 2012.
 [50] “The physionet/computing in cardiology challenge 2017,” https://physionet.org/challenge/2017/.
 [51] G. Clifford, C. Liu, B. Moody, L. Lehman, I. Silva, Q. Li, A. Johnson, and R. Mark, “Af classification from a short single lead ecg recording: The physionet computing in cardiology challenge 2017,” Computing in Cardiology, vol. 44, 2017.