Modelling Segmented Cardiotocography Time-Series Signals Using One-Dimensional Convolutional Neural Networks for the Early Detection of Abnormal Birth Outcomes

08/06/2019 ∙ by Paul Fergus, et al. ∙ Liverpool John Moores University 7

Gynaecologists and obstetricians visually interpret cardiotocography (CTG) traces using the International Federation of Gynaecology and Obstetrics (FIGO) guidelines to assess the wellbeing of the foetus during antenatal care. This approach has raised concerns among professionals concerning inter- and intra-variability where clinical diagnosis only has a 30 value when classifying pathological outcomes. Machine learning models, trained with FIGO and other user derived features extracted from CTG traces, have been shown to increase positive predictive capacity and minimise variability. This is only possible however when class distributions are equal which is rarely the case in clinical trials where case-control observations are heavily skewed. Classes can be balanced using either synthetic data derived from resampled case training data or by decreasing the number of control instances. However, this introduces bias and removes valuable information. Concerns have also been raised regarding machine learning studies and their reliance on manually handcrafted features. While this has led to some interesting results, deriving an optimal set of features is considered to be an art as well as a science and is often an empirical and time consuming process. In this paper, we address both of these issues and propose a novel CTG analysis methodology that a) splits CTG time series signals into n-size windows with equal class distributions, and b) automatically extracts features from time-series windows using a one dimensional convolutional neural network (1DCNN) and multilayer perceptron (MLP) ensemble. Our proposed method achieved good results using a window size of 200 with (Sens=0.7981, Spec=0.7881, F1=0.7830, Kappa=0.5849, AUC=0.8599, and Logloss=0.4791).



There are no comments yet.


page 1

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The United Nations Children’s Fund (UNICEF) estimates that 130 million babies are born each year; one million of these will be intrapartum stillbirths and more than three and a half million will die as a result of perinatal complications

[1]. According to a National Health Service (NHS) Resolution report published in 2017, the number of reported live birth deliveries in England in 2015 was 664,777 of which 1137 resulted in death [2]. The report also states that in the same year there were 2,612 stillbirths. In 2016/2017, maternity errors linked to adverse outcomes cost the NHS £1.7bn with the most expensive claims being for avoidable cerebral palsy [2].

According to MBRRACE-UK there has been a steady fall in the rate of stillbirths, however, neonatal deaths have remained largely static [3]. Cardiotocography (CTG) traces are used by obstetricians and midwives to visually inspect and monitor the foetus during the early stages of delivery [4]. CTG devices fitted to the abdomen record the foetal heart rate, which describes the modulation influence provided by the central nervous system - when the foetus is deprived of oxygen, the cardiac function is impaired. Detecting its occurrence can be confirmed by cord blood (umbilical artery) metabolic acidosis with a base deficit of more than 15 mmol/L [5].

Clinicians use features defined by the International Federation of Gynaecology and Obstetrics (FIGO) to interpret CTG traces. The FIGO features include the real FHR baseline (RBL), Accelerations, and Decelerations. Short- and Long-Term Variability (STV, LTV) are also used alongside FIGO features. The RBL is calculated as the mean of the signal [6] with the peaks and troughs removed (signals that reside outside the baseline min and max thresholds) as shown in Fig.1. Peaks and troughs are removed using a virtual baseline (VBL - the mean of the signal with peaks and troughs) where the signals that are beats per minute (bpm) from the VBL are removed.

Fig. 1: Using the FHR signal (Beats per Minute) to calculate the Real Baseline

Accelerations and Decelerations are obtained by counting the number of transient increases and decreases from the RBL, that are bpm and last for 10s or more [7]. Accelerations indicate adequate blood delivery and are reassuring for medical practitioners. Decelerations result from physiological provocation (compromised oxygenation resulting from uterine contractions). If Decelerations do not recover (the absence of Accelerations), this can indicate the presence of umbilical cord compression, foetal hypoxia or metabolic acidosis [8].

STV is calculated as the average of 2.5 second blocks in the signal averaged over the duration of the signal. While, LTV describes the difference between the minimum and maximum value in a 60 second block averaged over the duration of the signal. The presence of both STV and LTV describe normal cardiac function [9]. If STV or LTV decreases or is absent, this could indicate the onset of an adverse pathological outcome [10].

FIGO features represent the morphological structure of the FHR signal and are the visual cues used by clinicians to monitor the foetus. Non-linear features (not easily identifiable through human visual inspection) obtained from the FHR signal have also been considered. These include Root Mean Squares (RMS) and Sample Entropy (SampEn) and have been widely adopted in antepartum and intrapartum studies to describe the non-visual patterns contained in the FHR [11], [12]. RMS measures the magnitude of the varying quantity while Sample entropy represents the non-linear dynamics and loss of complexity in the FHR, which is an indicator for foetal hypoxia and metabolic-acidosis detection [13].

Frequency representations, via Fast Fourier Transform (FFT)

[14] and Power Spectral Density (PSD) have also been widely reported in feature engineering to minimise signal quality variations [15]. In FHR analysis, frequency features have been reported in [16]. While in other studies Peak Frequency (FPeak) derived from the PSD has been used in antepartum and intrapartum analysis to measure variability and normal sympathetic and parasympathetic function [17]. Poincaré plots, have also proven to be useful in heart rate variability studies [18]. While the box-counting dimension has allowed the dynamics of the FHR to be estimated [19] and to measure the morphological properties of a signal. Self-affinity measures derived from the FHR signals in classification tasks have also been useful in detecting pathological outcomes [20]. While, Detrend Fluctuation Analysis (DFA) is said to be a good indicator for the presence or absence of self-similarity [21].

One of the fundamental problems with human CTG analysis is poor interpretation and high inter-intra-observer variability. In many cases, it is not easy to interpret CTG time-series traces and requires expert knowledge in signal processing. This has therefore made the prediction of neonatal outcomes challenging among healthcare professionals [22]. Computer scientists have investigated this problem using algorithms to automatically process CTG traces and train machine learning models. For example, Warrick et al. [23] developed a system to model FHR and Uterine Contraction (UC) signal pairs to estimate their dynamic relation [50]. The study showed that it was possible to detect approximately half of all pathological cases 1 hour and 40 min before delivery with a 7.5% false positive rate. While, Kessler et al. [24], using ST waveform analysis, were able to demonstrate timely intervention for caesarean section and vaginal deliveries. In [25]

Blix et al. trained and compared several machine learning models; a Decision Tree (DT), an Artificial Neural Network (ANN), and a Discriminant Analysis (DA) model. The ANN classifier produced the best results with 97.78% for accuracy. The Sensitivity and Specificity values were not provided making accuracy on its own an insufficient performance measure. While, Ocak et al.


evaluated an SVM and Genetic Algorithm (GA) classifier and reported 99.3% and 100% accuracies for normal and pathological delivery outcomes. Similar results were reported in

[27] and [26]. Again, Sensitivity and Specificity values were not provided in these studies. Meanwhile Menai et al. [28]

classified foetal state using a Naive Bayes (NB) classifier with four different feature selection (FS) techniques: Mutual Information, Correlation-based, ReliefF, and Information Gain. The NB classifier in conjunction with ReliefF features produced 93.97%, 91.58%, and 95.79% for Accuracy, Sensitivity and Specificity, respectively. While, Karabulut et al.

[29] utilised an adaptive boosting (AdaBoost) classifier that produced an accuracy of 95.01% - again no Sensitivity or Specificity values were provided. Spilka et al. [30], used a RF classifier and latent class analysis (LCA) [31] producing Sensitivity and Specificity values of 72% and 78% respectively [32]. Generating slightly better results in [18], Spilka et al. attempted to detect perinatal outcomes using a C4.5 decision tree, Naive Bayes, and SVM. The SVM produced the best results using a 10-fold cross validation method, which achieved 73.4% for Sensitivity and 76.3% for Specificity.

The main problem with the aforementioned approaches is twofold. First, machine learning algorithms are sensitive to skewed class distributions which is often the case with data derived from clinical trials where observations are typically normal outcomes [33]. For example, the dataset used in this study, contains 552 singleton pregnancy CTG recordings of which 46 are cases (abnormal birth deliveries) and 506 are controls (normal deliveries). The Synthetic Minority Oversampling Technique (SMOTE) is commonly used to solve this problem [34]

, where the minority class (case) is oversampled using each minority class record, to generate new synthetic records along line segments joining the k minority class nearest neighbours. This forces the decision region of the minority class to become more general and ensures that the classifier creates larger and less specific decision regions, rather than smaller specific ones. However, this approach introduce bias since the new data points are generated from the old ones which does not introduce much variance to the dataset. In most cases the synthetically generated observations are only marginally different than the original observations. This can be avoided by alternatively removing samples from the majority class (also known as undersampling) such that class distributions are equal. If datasets are small however, it may not be appropriate to remove valuable information.

Second, expert knowledge is required to extract features from CTG traces and these are application specific. This means handcrafted features are expensive to produce because manually intensive efforts are required to tune machine learning models for automated CTG analysis. The rapid progression of signal processing technologies means there is a need for a general signal analysis framework that can quickly be deployed to accommodate new application requirements.

In this paper, we solve both of these issues using CTG trace segmentation (windowing) to balance class distributions and a one-dimensional convolutional neural network (1DCNN) to automatically learn features from the segmented CTG traces [35]. All windows derived from cases are retained while windows are randomly sampled in controls such that both class distributions are equal. Features are then automatically learned from all case-control window segments. The learnt feature space in the 1DCNN (based on random uniform kernel initialisation) provides the foundation for pre-weight initialisation for several fully connected MLPs that are subsequently trained for classification tasks. The trained 1DCNN classifiers are evaluated through several experiments and the results are compared with those obtained from an MLP trained with random weight initialisation, a Support Vector Machine, a Random Forest, and a Fishers Linear Discriminant Analysis classifier.

The main contributions of this paper are therefore twofold: First, the morphological and nonlinear patterns in CTG traces are modelled using a 1DCNN. The benefits provided by this approach are: 1) it does not require manually labelled data (typically a laborious, subjective, and error prone process), 2) it offers a hierarchical learning paradigm to learn low and high-level features and interactions that are more flexible than those features crafted manually, and 3) since all existing state-of-the-art computerised CTG systems use manual feature selection methods, they generally do not scale well with new data. Therefore, the proposed CTG framework can be quickly deployed to perform deformable CTG modelling on new CTG modalities or new CTG applications with little to no human intervention. Second, skewed datasets are balanced using a windowing strategy. The benefits of windowing are: 1) synthetic data to balance classes is not required (algorithms are modelled using real data only), and 2) datasets are not biased due to the addition of data points that are similar to those used by resampling techniques. The performance of the proposed approach is assessed with 552 singleton pregnancy CTG recordings to demonstrate the proposed framework achieves better performance than existing state-of-the-art methods that use handcrafted features and oversampling techniques.

The remainder of this paper is organised as follows. Section 2 describes the Materials and Methods used in the study. Results are presented in Section 3 and discussed in Section 4 before the paper is concluded and future work presented in Section 5.

2 Materials and Methods

This section describes the dataset adopted in this study and the steps taken to a) pre-process the data and balance class distributions and b) automatically learn features from n-sized windows with a 1DCNN. The section is concluded with a discussion on the performance metrics adopted to evaluate the machine learning models presented in the results section.

2.1 Data Collection and Preprocessing

Cudacek et al. conducted a comprehensive study between April 2010 and August 2012 to captured intrapartum CTG Traces. The recordings were collected from the University Hospital in Brno (UHB) in the Czech Republic by obstetricians with the support of the Czech Technical University (CTU) in Prague. The CTU-UHB database contains 552 CTG recordings for singleton pregnancies with a gestational age less than 36 weeks that were selected from 9164 recordings. The STAN S21/31 and Avalon FM 40/50 foetal monitors were used to acquire the CTG records. The dataset contains no prior known development factors (i.e. they are ordinary clean obstetrics cases); the duration of stage two labour is less than or equal to 30 minutes; foetal heart rate signal quality is greater than 50% in each 30 minute window; and the pH umbilical arterial blood sample is available. The dataset contains 46 caesarean section deliveries - the remaining 506 are ordinary clean vaginal deliveries. Note that in this study 46 cases are classified as caesarean delivery due to pH 7.20 - acidosis, , pH 7.20 and pH 7.25 - foetal deterioration, ; or clinical decision without evidence of pathological outcome measures, . Note that the dataset curators do not give a reason why caesarean deliveries were necessary for the 24 subjects were no pathological outcome measures were recorded. Therefore in this study an assumption is made that the decision to deliver by caesarean was supported by underlying pathological concerns (however, there is no way to validate this). The CTU-UHB database is publicly available from Physionet.

Each recording begins no more than 90 min before delivery. CTG records contain the FHR time-series (measured in beats per minute) and uterine contraction (UC) signal - each sampled at

. The FHR was obtained from an ultrasound transducer attached to the abdominal wall. In this study only the FHR signal is considered since it provides direct information about the foetal state. For each of the recordings, noise and missing values are removed using cubic Hermite spline interpolation. The 552 signals are split using several windowing strategies with n-size data point coefficients equal to

, , , and respectively. All windowed observations are retained from all case observations with an equal number of randomly selected segments from all control windows. This class balancing strategy allows the number of case observations to be increased using real data only. Note that most studies reported in this paper address the class skew problem using either over or under sampling [36].

2.2 Feature Learning with One Dimensional Convolutional Neural Network

In contrast to manually extracted features based on input from domain knowledge experts, features in this study are automatically learnt from the data using a one dimensional convolutional neural network (1DCNN) [37]. Windowed CTG traces are input directly to a convolutional layer in the 1DCNN. The convolutional layer detects local features along the time-series signal and maps them to feature maps using learnable kernel filters. Local connectivity and weight sharing is adopted to minimise network parameters and avoid overfitting [38]. Pooling layers are implemented to reduce computational complexity and enable hierarchical data representations [38]. A single convolutional and pooling layer pair along with a fully connected MLP comprising two dense layers and softmax classification output is used to complete the 1DCNN network. The proposed 1DCNN architecture uses one dimensional vectors for kernel filters and feature maps as illustrated in Figure 2.

Fig. 2: One Dimensional Convolutional Neural Network Architecture

The network model is trained by minimizing the cost function using feedforward and backpropagation passes. The feedforward pass determines the predicted output from input data. In other words, it constructs a feature map from the previous layer to the next through the current layer until an output is obtained. The input and kernel filters of the previous layer are computed as follows:


where and are the input and output of the convolutional layer, respectively, and the weight kernel filter from the neuron in layer to the neuron in layer , represents the convolutional operation, and describes the bias of the neuron in layer . defines the number of kernel filters in layer

. A ReLU activation function is utilised in this network (empirically this activation function produced the best results) and is defined as:


where is the intermediate output at current layer before down sampling occurs. The output from current layer is defined as:



represents a max pooling function that reduces the number of parameters, and

is the output from layer , as well as the input to the next layer . The output from the last pooling layer is flattened and used as the input to a fully connected MLP.

After acquiring the predicted output , the predicted error is calculated according to the cost function using cross entropy with a softmax function, defined as:


where and are the target labels and the predicted outputs, respectively, and the number of classes in the training set. The role of learning is to optimize the network free parameters and minimise . The derivatives of the free parameters are obtained and the weights and biases are updated using learning rate . To prompt rapid convergence, we utilise Adam as an optimisation algorithm and apply weight initialisation. The learning rate is set to for all experiments. The weights and bias in the convolutional layer and fully connected MLP layers are updated using:


Small learning rates increase the training time, however, it reduces the number of oscillations and generates a lower error value which are key performance metrics in this study. Rate annealing is applied to guard against the learning rate freezing into a local minima. While rate decay is applied to control learning rate change across layers.

Momentum start is set to 0.5 and momentum ramp and momentum stable to and respectively. This controls the amount of momentum at the beginning of training and the amount of learning for which momentum increases. Momentum stable controls the final momentum value reached after momentum ramp training examples. While complexity is controlled with a weight decay parameter that is optimised through cross-validation. This ensures that a local optimum is found using small-magnitude parameters to avoid overfitting.

Several experiments were performed to determine the optimal number of neurons and hidden layers required to minimise . Empirically, adding additional convolutional layers did not improve the results obtained. Using 10 input neurons in two hidden layers, and 1 final output node for softmax classification produced the best results. It was found that increasing the number of hidden layers or the number of neurons did not improve performance (this was empirically performed using different activation functions and optimisers).

The 1DCNN was trained over 500 epochs (the results show that this number was sufficient for network convergence). The network free parameters where obtained using the training and validation sets and evaluated with a separate test set comprising unseen data.

2.3 Performance Measures

Sensitivity (or Recall) and Specificity provide the performance metrics for correctly classified case and control instances. Sensitivity describes the true positive rate (Controls - term deliveries) and Specificity the true negative rate (Cases - pathological deliveries).

The area under the curve (AUC) is utilised to determine the probability that, for each pair of examples, one for each class, the example from the positive class will be ranked highest. This is measured by ranking the estimates of posterior class membership

in increasing order. If is the sum of the ranks of values of inferences for test data in class , and similarly for class the AUC is given by:


where and are the sample sizes in each class.

The F1 score is a measure of a test’s accuracy. It considers both the precision p and recall r of the test to compute the score:


where precision (or Positive Predicted Value) describes the number of correct predictions among retrieved instances. An F1 = 1 is best and a score of 0 is worse.

Kappa on the other hand is a useful metric for comparing observed accuracy with an expected accuracy:


where observed accuracy is the relative observed agreement (identical to accuracy) and expected accuracy is the hypothetical probability of chance agreement. Kappa = 1 is again the best outcome with values = 0 indicating the worse possible outcome.

Log Loss provides a measure of accuracy for a classifier whereby penalties are imposed on classifications that are false. Minimising the Logloss is correlated with accuracy (as one increases the other decreases). Log loss is calculated by assigning a probability to each class rather than stating what the most likely class would be. Logloss is defined as:


where is the number of samples, is a binary indicator for the outcome of instance . For models that classify all instances correctly the Log Loss value will be zero. For misclassifications, the value will be progressively larger.

The Mean Squared Error (MSE) metric provides a measure correlating to the average sum of the square differences between actual values and predicted values for all data points. A MSE value of 0 indicates that the model correctly classifies all class instances. Again, for misclassifications, the MSE will be progressively larger.

3 Experiments

Here we evaluate CTG trace classification performance of the proposed CTG classification framework that uses deep learning for feature selection and windowing for dataset balancing. The experiments are divided into three parts. First, a trained multi-layer feedforward neural network classification model using raw CTG traces and several windowing strategies is demonstrated. Second, a trained 1DCNN is compared with the trained MLP approach under the same experimental conditions (raw windowed signals). Third, our proposed 1DCNN model is compared with a support vector machine (SVM), random forest (RF) and a fisher linear discriminant analysis (FLDA) classifier - again using the same experimental conditions. The performance of each model is measured using Sensitivity, Specificity F1, Kappa, AUC, Logloss and MSE metrics. The data set is split randomly into training (80%), validation (10%) and testing (10%). Our method was implemented in Python with Tensorflow GPU 1.13


and Keras 2.2.4

[40]. All experiments were conducted on a computer with an NVidia GTX1060 GPU, a Xeon Processor, and 16GB of RAM.

3.1 Multi-Layer Feedforward Neural Network

3.1.1 Classifier Performance

In this experiment a single MLP is evaluated using five dense layers (hidden layers) with 10 nodes in each and a softmax layer to classify normal and abnormal birth outcomes. The Relu activation function is implemented in all layers with dropout equal to 0.5. Adam optimisation is applied with the initial learning rate equal to 0.001. The batch size coefficient is set to 32 and training occurs over 500 epochs. Table 1 provides the performance metrics for the training and validation sets. Metric values for window sizes

, , , , and were obtained and averaged over 500 epochs, respectively.

Training Validation
Window AUC Logloss AUC Logloss
W=100 0.6395 0.6427 0.6578 0.6464
W=200 0.6556 0.6349 0.7138 0.6298
W=300 0.5815 0.6676 0.5638 0.6810
W=400 0.6546 0.6296 0.6781 0.6202
W=500 0.6412 0.6389 0.4578 0.7052
TABLE I: Baseline MLP Training and Validation Results

The performance of the best trained model using the validation set was achieved with W=200. Figure 3 and 4 provide the logloss and accuracy plots and show that overfitting is appropriately managed. An optimised loss function is adopted to train the models. The AUC plots provide information about early divergence between the training and validation curves. As evidenced in Fig. 3 and 4 learning tends to plateau around 400 epochs.

Fig. 3: Baseline MLP Training and Validation Logloss plot for Window size 200.
Fig. 4: Baseline MLP Training and Validation Accuracy plot for Window size 200.

Table 2 provides the performance metrics obtained using the test set data. Metric values for window size , , , , and were again obtained and averaged over 500 epochs, respectively. The results using the test set are better than those achieved by the validation set, however there is significant imbalance between Sensitivity and Specificity values across all window size configurations.

Window Sens Spec F1 Kappa AUC Logloss
W=100 0.3104 0.9194 0.4466 0.2291 0.6913 0.6542
W=200 0.8919 0.5135 0.7500 0.4054 0.7411 0.5831
W=300 0.9359 0.4394 0.7766 0.3898 0.7071 0.6297
W=400 0.4186 0.7901 0.5180 0.2063 0.6208 0.6805
W=500 0.4286 0.8358 0.5347 0.2676 0.6889 0.6272
TABLE II: Baseline MLP Test Set Results

3.1.2 Model Selection

The ROC curve shown in Fig. 5 shows the cut-off values for the false and true positive rates using the test set. In this experiment an MLP model with W=200 produced the best results with an AUC=74%.

Fig. 5: Baseline MLP Test ROC Curves for all Window sizes.

3.2 One-Dimensional Convolutional Neural Network

In the second experiment, the same raw CTG signals are used to model a 1DCNN based on the network configuration described in Fig. 2 and associated 1DCNN network parameter coefficient settings previously discussed.

3.2.1 Classifier Performance

This time several 1DCNN models were trained using all window size configurations. A single convolutional layer with 20 filters and a kernel size half that of the windowing strategy, i.e. 150 for 300 point points, was empirically found to provide the best results. A ReLU activation function is implemented in the convolution layer, which is followed by a single max pooling layer and two fully connected dense layers (the first layer contains 10 nodes and the second a single node to classify case and control instances). The nodes in both layers implement the sigmoid activation function.

All models are compiled with a binary cross entropy loss function and Adam optimizer with the learning rate set to 0.0001, to 0.9,

to 0.999, epsilon to zero, decay to 0.0, and amsgrad to false. Accuracy and Logloss are used as the evaluation metrics with a batch size of 32 and a training strategy that utilises 500 epochs. Ten percent of the training data is retained for model validation.

As shown in Table 3 the performance metrics obtained for training and validation data sets are provided. Again, metric values for the different window size configurations are used and averaged over 500 epochs. The results show that the validation set produced the best results with W=200 based on the highest AUC and lowest Logloss values.

Training Validation
Window AUC Logloss AUC Logloss
W=100 0.6848 0.5812 0.6584 0.6126
W=200 0.7279 0.5345 0.7284 0.5019
W=300 0.7737 0.4600 0.7185 0.5762
W=400 0.7735 0.4750 0.6730 0.6683
W=500 0.7982 0.4628 0.7333 0.6157
TABLE III: Conv1D Training and Validation Results

It can be seen in Fig. 6 that the Logloss value converges around after 500 epochs with no significant evidence of overfitting. Fig. 7 supports this and shows that both the training and validation plots are closely aligned after 500 epochs.

Fig. 6: 1D CNN Training and Validation Logloss plot for Window Size 200.

Fig. 7: 1D CNN Training and Validation Accuracy plot for Window size 200.

Table 4 this time shows that the best performance metrics obtained for the test data were produced with W=200. The metric values are higher than those obtained by the validation set and significantly higher than those produced by the MLP models. The Sensitivity and Specificity values are balanced indicating that the model can distinguish reasonably well between case and control observations with equal accuracy.

Windows Sens Sec F1 Kappa AUC Logloss
W=100 0.6676 0.6833 0.6741 0.3509 0.7581 0.5817
W=200 0.7981 0.7881 0.7830 0.5849 0.8599 0.4791
W=300 0.7692 0.7273 0.7692 0.4965 0.8199 0.5322
W=400 0.7011 0.6250 0.6854 0.3268 0.7181 0.6767
W=500 0.6774 0.7647 0.7000 0.4434 0.7448 0.6716
TABLE IV: Conv1D Test Set Results

3.2.2 Model Selection

This time Fig 8 shows that models trained on W=200 and W=300 performed much better than all other window size configurations.

Fig. 8: Baseline CNN Test ROC Curves for Window sizes.

The likely improvement is due to the fact that 1DCNNs are able to extract complex non-linear features (particularly data points with strong relationships) in a way not possible using an MLP alone. This is also likely due to the fact that pre-initialising weights rather than random weight assignment (typically performed in simple MLP networks) is known to improve network performance.

3.3 Comparison with SVM, RF and FLDA

In the final experiment the 1DCNN results are compared with SVM, RF and FLDA models. The same window configurations are used to model normal and pathological birth CTG traces

3.3.1 SVM Classifier Performance

In the first comparison, the raw CTG traces are used to train several SVM models. SVMs are trained by fitting a logistic distribution using maximum likelihood to the decision values of several binary classifier models. The same data split strategy used in previous experiments is adopted. A radial kernel function is implemented with tuned gamma and cost parameters set to 0.3333 and 1 respectively (determined empirically).

Table 5 provides the performance metrics for the test set using the same window size configurations discussed previously. The results obtained shown that all SVM models perform poorly with prediction in all models no better than chance.

Window Sens Spec F1 Kappa AUC MSE
W=100 0.5212 0.5202 0.5146 0.0421 0.5207 0.4792
W=200 0.4070 0.5321 0.4380 -0.0606 0.4695 0.5315
W=300 0.5060 0.5901 0.5599 0.0929 0.5480 0.4583
W=400 0.4673 0.5466 0.4763 0.0026 0.5070 0.4977
W=500 0.6779 0.5633 0.6153 0.2372 0.6206 0.3846
TABLE V: SVM Test Set Results

3.3.2 Model Selection

The ROC curve in Fig. 9 show that all plots are located close to the dashed line (random guessing). In this instance all SVM models do not perform better than the proposed 1DCNN model.

Fig. 9: SVM Test ROC Curves for all Window sizes.

3.3.3 Random Forest Classifier Performance

A Random Forest (RF) model is trained and evaluated in this second comparison. Breiman’s RF ensemble learning classifier is adopted which decorrelates 500 grown trees generated using bootstrapped training samples. RF models are trained using the same window configurations previously discussed. The best model using W=200 achieved balanced Sensitivity and Specificity values between 0.65 and 0.68 respectively. Whilst performing better than the SVM models, all models failed to improve on the results obtained by the 1DCNN.

Window Sens Sec F1 Kappa AUC MSE
W=100 0.5600 0.7433 0.6105 0.3049 0.6516 0.3453
W=200 0.6504 0.6890 0.6473 0.3393 0.6697 0.3288
W=300 0.5441 0.7500 0.5967 0.2968 0.6470 0.3472
W=400 0.5862 0.7750 0.6538 0.3579 0.6806 0.3233
W=500 0.5454 0.7968 0.6260 0.3409 0.6711 0.3307
TABLE VI: Random Forest Test Set Results

3.3.4 Model Selection

Figure 10 shows the ROC curves obtained for all trained RF models. Interestingly, all models across the five windowing strategies produced similar results showing that window size had little or no effect on performance.

Fig. 10: RF Test ROC Curves for all Window sizes.

3.3.5 FLDA Classifier Performance

Finally, the FLDA classifier is modelled which finds a linear combination of features (time-series signal points) to determine the direction along which the two classes are best separated. The criterion is the ratio of between-class to within-class variances. The data is projected onto a line, and classification is performed in this one-dimensional space where the projection maximizes the distance between the means of the two classes while minimizing the variance within each class.

Table 7 provides the performance metrics for the test set. The best performing model was trained using W=300 with Sensitivity and Specificity values 0.71 and 0.76 respectively. The best performing model performs well given the FLDA is one of the most simplest and less computationally expensive machine learning models to implement. However, despite these results the FLDA model does not outperform those produced by the 1DCNN.

Window Sens Sec F1 Kappa AUC MSE
W=100 0.6334 0.6572 0.6464 0.2903 0.6453 0.3549
W=200 0.6990 0.6554 0.6666 0.3521 0.6772 0.3243
W=300 0.7142 0.7666 0.7594 0.4697 0.7404 0.2638
W=400 0.6333 0.7012 0.6705 0.3315 0.6673 0.3353
W=500 0.6176 0.6129 0.6268 0.2302 0.6152 0.3846
TABLE VII: FLDA Test Set Results

3.3.6 Model Selection

Figure 11 shows the ROC curves for all trained FLDA models. Again, like the RF models, all windowing strategies produced similar results.

Fig. 11: FLDA Test ROC Curves for all Window sizes.

4 Discussion

Gynaecologists and obstetricians visually interpret CTG traces using FIGO guidelines to assess the wellbeing of the foetus during antenatal care. This approach has raised concerns among healthcare professionals with regards to inter-intra variability were clinicians only positively predict pathological outcomes 30% of the time. Machine learning models trained with features extracted from CTG traces have shown to improve predictive capacity and minimise variability. However, this is only possible when datasets are balanced which is rarely the case in data collected from clinical trials.

Concerns have also been raised on the efficacy of FIGO and handcrafted features and their ability to sufficiently describe normal and pathological CTG traces. Feature engineering requires expert knowledge to extract features and these are often directly related to modality and application. This means that handcrafted features are expensive to produce because manually intensive efforts are required to tune machine learning models for automated CTG analysis.

Both these issues were addressed in this paper by a) splitting CTG time-series signals into n-size windows with equal class distributions using real data only, and b) automatically extracting features from time-series windows using a 1DCNN. The former minimises the amount of bias introduced into the analysis phase and the later automatically extracts features thus removing the need for manual feature engineering. Collectively, we argue this simplifies the data analysis pipeline and provides a robust, rigorous and scalable platform for automated CTG trace modelling and classification.

The findings presented in this paper support the claims made in the study. Splitting CTG traces into n-size windows is a very simple way to balance case-control datasets using real data only. While deep learning derives latent representations from data that capture important information about normal and abnormal CTG traces. Feature extraction is performed as a single unified process using a relatively simple 1DCNN to capture subtle nonlinear dependencies in CTG traces that may not easily be detected using human visual inspection alone. Features obtained in CNN structures do not differ when presented with small input changes. Consequently, this has the effect of eliminating noise and increasing robustness within the feature extraction process.

Three experiments were presented in this study to evaluate and justify the methodological decisions made. In the first experiment, an MLP, using random weight initialisation, and several window size strategies were evaluated to provide baseline results. Using a trained MLP model with a window size=200 it was possible to obtain results (Sens=0.8919, Spec=0.5135, F1=0.7500, Kappa=0.4054, AUC=0.7411, and Logloss=0.5831) using the test set. When either decreasing or increasing the window size, results dropped with the lowest obtained with window=400 (Sens=0.4186, Spec=0.7901, F1=0.5180, Kappa=0.2063, AUC=0.6208, and Logloss=0.6805). Therefore, changing the window size in this study using the CTG-UHB dataset had no positive impact on overall performance. More importantly, the MLP configuration was unable to equally model and predict between case and control instances as indicated by the Sensitivity and Specificity values.

The second experiment introduced the results for the proposed 1DCNN which automatically extracts features from several CTG window size configurations which are used to pre-initialised network weights. The results using the test set showed significant improvements in classification accuracies. The best result was achieved using W=200 (Sens=0.7981, Spec=0.7881, F1=0.7830, Kappa=0.5849, AUC=0.8599, and Logloss=0.4791). The worst result was obtained using W=400 (Sens=0.7011, Spec=0.6250, F1=0.6854, Kappa=0.3268, AUC=0.7181, and Logloss=0.6767). The results were much better than those produced using the MLP. The Sensitivity value was lower, however, Specificity, F1, Kappa and AUC all increased, with Logloss decreasing by  10 percent.

The final experiment modelled an SVM, RF and FLDA classifier, to determine whether these less complex and computationally expensive models could outperform the proposed 1DCNN approach. Under the same evaluation criteria, raw CTG traces where used to train the models with window sizes 100, 200, 300, 400, and 500. The results obtained showed that the best performing classifier was the FLDA using W=300 with (Sens=0.7142, Spec=0.7666, F1=0.7594, Kappa=0.4697, AUC=0.7404, and MSE=0.2638). The SVM classifier produced the worse results with the best model using W=500 obtaining (Sens=0.6779, Spec=0.5633, F1=0.6153, Kappa=0.2372, AUC=0.6206, and MSE=0.3846). This was followed by the RF classifier with the best model using W=200 with (Sens=0.6504, Spec=0.6890, F1=0.6473, Kappa=0.3393, AUC=0.6697, and MSE=0.3288). All of the traditional classifiers performed worse than the 1DCNN, however, the FLDA results relatively close to those produced by the 1DCNN with less for Sens and less for Spec. This result is particularly interesting given that the FLDA is a much simpler model to train compared with CNNs in terms of compute requirements.

The results presented in this study are encouraging. While many other studies based on handcrafted features have reported better results, in many cases it is not clear how such results were obtained, i.e. particularly in cases where only accuracy metrics are shown without reference to Sensitivity and Specificity values. In other cases, the good results are likely due to the training and test set minority data points being oversampled rather than the training data points only. Where this is the cases it introduces bias and the trained models are unlikely to generalise well on unseen data. In this sense, we regard the work performed by Spilka et al., who use the same dataset, a more realistic fit for evaluation purposes and on these grounds our proposed approach outperforms the approaches Spilka et al. propose [30], [18].

5 Conclusion

A novel framework to deal with imbalanced clinical datasets, using real data only, using a windowing strategy is proposed in this paper including an automated feature engineering process using a 1DCNN. A dataset containing 552 CTG trace observations (506 controls and 46 cases) was used to train a 1DCNN with a W=200 windowing strategy that allowed us to obtain (Sens=0.7981, Spec=0.7881, F1=0.7830, Kappa=0.5849, AUC=0.8599, and Logloss=0.4791). Figures 6 and 7 show that there is no significant evidence of overfitting and Figure 8 shows that our trained models have good predictive capacity.

The gold standard for gynaecology and obstetrics is to screen patients with an increased risk of having an abnormal birth outcome. Consequently, the research is timely as there is a fundamental need to identify patients at high risk of abnormal birth outcomes and stratify those fetuses who would benefit from immediate delivery via operative vaginal delivery or caesarean delivery.These medications/procedures would be done before labor and are not impacted by the fetal heart rate. So far, the limited evidence for fetal heart predicting adverse outcomes is only in labour. Having an automated CTG trace system would help reduce morbidity and mortality outcomes associated with abnormal deliveries leading to significant impact within treatment and care.

Nonetheless, there is a great deal of work needed. The results presented in this study are interesting, however the CTG traces used to train the machine learning models did not contain annotations. This means that clinically relevant data and noise are combined in the feature extraction and modelling processes. This likely means irrelevant data is being modelled alongside key data points representative of abnormal and normal CTG information. Performing signal processing alongside clinicians to only retain parts of the CTG trace directly representative of normal and pathological signals will likely improve the overall predictive capacity of our 1DCNN network.

In future work it may also be interesting to model CTG traces from mothers who have normal vaginal deliveries and implement anomaly detection using stacked autoencoders to identify and triage pregnant mothers with suspected abnormal signs during pregnancy and compare this with the 1DCNN approach. This would help provide clear groupings and act as a basis for more in-depth analysis of possible abnormal birth outcomes. Making this accessible using web technologies would also be useful to the research community. Therefore, future work will convert Keras models to protocol buffers format for web hosting using Flask and online inferencing tasks.

Overall, the results highlight the benefits of using CTG trace windowing to balance class distributions using real data only and 1DCNNs to automatically extract features from raw CTG traces. This contributes to the instrumentation, measurement and biomedical fields and provides new insights into the use of deep learning algorithms when analysing CTG traces. While work exists in CTG trace analysis, including many of our own studies reported in the literature, to the best of our knowledge the study presented in this paper is the first comprehensive study of its kind that windows CTG traces and uses 1DCNNs to automatically extract features for modelling and classification tasks.


The dataset(s) used for the analyses described in this manuscript were obtained from Physionet found at through. Samples for the CTU-UHB Intrapartum Cardiotocography Database were provided by Václav Chudáček, Jiří Spilka, Miroslav Burša, Petr Janků, Lukáš Hruban, Michal Huptych, Lenka Lhotská. The authors would like to express there appreciation for making the data accessible to the research community, which has allowed the study in the paper to be undertaken.


  • [1] worldometer, “Current World Population.”, 2018. [Online; accessed 28-November-2018].
  • [2] N. Resolution”, “Five Years of Cerebral Palsy Claims: A Thematic Review of NHS Resolution Data.”, 2017. [Online; accessed 28-November-2018].
  • [3] ”Sands”, “MBRRACE-UK: Mothers and Babies: Reducing Risk through Audits and Confidential Enquiries across the UK.”, 2017. [Online; accessed 28-November-2018].
  • [4] P. Olofsson, H. Norén, and A. Carlsson, “New figo and swedish intrapartum cardiotocography classification systems incorporated in the fetal ecg st analysis (stan) interpretation algorithm: agreements and discrepancies in cardiotocography classification and evaluation of significant st events,” Acta Obstetricia et Gynecologica Scandinavica, vol. 97, no. 2, pp. 219–228, 2018.
  • [5] C. East, L. Leader, P. Sheehan, N. Henshall, P. Colditz, and R. Lau, “Intrapartum fetal scalp lactate sampling for fetal assessment in the presence of a nonreassuring fetal heart rate trace,” Cochrane Database of Systematic Reviews, no. 5, 2015.
  • [6] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000.
  • [7] R. Mantel, H. Van Geijn, F. Caron, J. Swartjes, E. Van Woerden, and H. Jongswa, “Computer analysis of antepartum fetal heart rate: 2. detection of accelerations and decelerations,” International Journal of Bio-Medical Computing, vol. 25, no. 4, pp. 273–286, 1990.
  • [8] H. Murray, “Antenatal foetal heart monitoring,” Best Practice & Research Clinical Obstetrics & Gynaecology, vol. 38, pp. 2–11, 2017.
  • [9] A. J. Camm, M. Malik, J. T. Bigger, G. Breithardt, S. Cerutti, R. J. Cohen, P. Coumel, E. L. Fallen, H. L. Kennedy, R. Kleiger, et al., “Heart rate variability: standards of measurement, physiological interpretation and clinical use. task force of the european society of cardiology and the north american society of pacing and electrophysiology,” 1996.
  • [10] L. Stroux, C. W. Redman, A. Georgieva, S. J. Payne, and G. D. Clifford, “Doppler-based fetal heart rate analysis markers for the detection of early intrauterine growth restriction,” Acta obstetricia et gynecologica Scandinavica, vol. 96, no. 11, pp. 1322–1329, 2017.
  • [11] C. Buhimschi, M. B. Boyle, G. R. Saade, and R. E. Garfield, “Uterine activity during pregnancy and labor assessed by simultaneous recordings from the myometrium and abdominal surface in the rat,” American journal of obstetrics and gynecology, vol. 178, no. 4, pp. 811–822, 1998.
  • [12] J. S. Richman and J. R. Moorman, “Physiological time-series analysis using approximate entropy and sample entropy,” American Journal of Physiology-Heart and Circulatory Physiology, vol. 278, no. 6, pp. H2039–H2049, 2000.
  • [13] M. G. Signorini, A. Fanelli, and G. Magenes, “Monitoring fetal heart rate during pregnancy: contributions from advanced signal processing and wearable technology,” Computational and mathematical methods in medicine, vol. 2014, 2014.
  • [14] M. Romano, M. Cesarelli, P. Bifulco, M. Ruffo, A. Fratini, and G. Pasquariello, “Time-frequency analysis of ctg signals,” Current Development in Theory and Applications of Wavelets, vol. 3, no. 2, pp. 169–192, 2009.
  • [15] M. J. Rooijakkers, S. Song, C. Rabotti, S. G. Oei, J. W. Bergmans, E. Cantatore, and M. Mischi, “Influence of electrode placement on signal quality for ambulatory pregnancy monitoring,” Computational and mathematical methods in medicine, vol. 2014, 2014.
  • [16] J. Van Laar, M. Porath, C. Peters, and S. Oei, “Spectral analysis of fetal heart rate variability for fetal surveillance: review of the literature,” Acta obstetricia et gynecologica Scandinavica, vol. 87, no. 3, pp. 300–306, 2008.
  • [17] P. Melillo, R. Izzo, A. Orrico, P. Scala, M. Attanasio, M. Mirra, N. De Luca, and L. Pecchia, “Automatic prediction of cardiovascular and cerebrovascular events using heart rate variability analysis,” PloS one, vol. 10, no. 3, p. e0118504, 2015.
  • [18] J. Spilka, V. Chudavcek, M. Koucky, L. Lhotska, M. Huptych, P. Janku, G. Georgoulas, and C. Stylios, “Using nonlinear features for fetal heart rate classification,” Biomedical Signal Processing and Control, vol. 7, no. 4, pp. 350–357, 2012.
  • [19] J. Spilka, V. Chudacek, M. Koucky, and L. Lhotska, “Assessment of non-linear features for intrapartal fetal heart rate classification,” in 2009 9th International Conference on Information Technology and Applications in Biomedicine, pp. 1–4, IEEE, 2009.
  • [20] P. Abry, S. G. Roux, V. Chudáček, P. Borgnat, P. Gonçalves, and M. Doret, “Hurst exponent and intrapartum fetal heart rate: Impact of decelerations,” in Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems, pp. 131–136, IEEE, 2013.
  • [21] M. Haritopoulos, A. Illanes, and A. K. Nandi, “Survey on cardiotocography feature extraction algorithms for foetal welfare assessment,” in XIV Mediterranean Conference on Medical and Biological Engineering and Computing 2016, pp. 1193–1198, Springer, 2016.
  • [22] S. Rhöse, A. M. Heinis, F. Vandenbussche, J. van Drongelen, and J. van Dillen, “Inter- and intra-observer agreement of non-reassuring cardiotocography analysis and subsequent clinical management,” Acta Obstetricia et Gynecologica Scandinavica, vol. 93, no. 6, pp. 596–602, 2014.
  • [23] P. A. Warrick, E. F. Hamilton, D. Precup, and R. E. Kearney, “Classification of normal and hypoxic fetuses from systems modeling of intrapartum cardiotocography,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 4, pp. 771–779, 2010.
  • [24] J. Kessler, D. Moster, and S. Albrechtsen, “Delay in intervention increases neonatal morbidity in births monitored with cardiotocography and st-waveform analysis,” Acta obstetricia et gynecologica Scandinavica, vol. 93, no. 2, pp. 175–181, 2014.
  • [25] E. Blix, K. G. Brurberg, E. Reierth, L. M. Reinar, and P. Øian, “St waveform analysis versus cardiotocography alone for intrapartum fetal monitoring: a systematic review and meta-analysis of randomized trials,” Acta obstetricia et gynecologica Scandinavica, vol. 95, no. 1, pp. 16–27, 2016.
  • [26] H. Ocak, “A medical decision support system based on support vector machines and the genetic algorithm for the evaluation of fetal well-being,” Journal of medical systems, vol. 37, no. 2, p. 9913, 2013.
  • [27]

    E. Yılmaz and Ç. Kılıkçıer, “Determination of fetal state from cardiotocogram using ls-svm with particle swarm optimization and binary decision tree,”

    Computational and mathematical methods in medicine, vol. 2013, 2013.
  • [28] M. E. B. Menai, F. J. Mohder, and F. Al-mutairi, “Influence of feature selection on naïve bayes classifier for recognizing patterns in cardiotocograms,” Journal of Medical and Bioengineering, vol. 2, no. 1, 2013.
  • [29] E. M. Karabulut and T. Ibrikci, “Analysis of cardiotocogram data for fetal distress determination by decision tree based adaptive boosting approach,” Journal of Computer and Communications, vol. 2, no. 09, p. 32, 2014.
  • [30] J. Spilka, G. Georgoulas, P. Karvelis, V. Chudacek, C. D. Stylios, and L. Lhotska, “Discriminating normal from “abnormal” pregnancy cases using an automated fhr evaluation method,” in

    Hellenic Conference on Artificial Intelligence

    , pp. 521–531, 2014.
  • [31] D. Rindskopf and W. Rindskopf, “The value of latent class analysis in medical diagnosis,” Statistics in medicine, vol. 5, no. 1, pp. 21–27, 1986.
  • [32] V. Chudacek, J. Spilka, M. Bura, P. Janku, L. Hruban, M. Huptych, and L. Lhotska, “Open access intrapartum ctg database: Stepping stone towards generalization of technical findings on ctg signals,” PLoS ONE (manuscript submitted for publication, 2013), 2013.
  • [33] J. Mathew, C. K. Pang, M. Luo, and W. H. Leong, “Classification of imbalanced data by oversampling in kernel space of support vector machines,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 9, pp. 4065–4076, 2018.
  • [34] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
  • [35] J. Kang, Y.-J. Park, J. Lee, S.-H. Wang, and D.-S. Eom, “Novel leakage detection by ensemble cnn-svm and graph-based localization in water distribution systems,” IEEE Transactions on Industrial Electronics, vol. 65, no. 5, pp. 4279–4289, 2017.
  • [36] P. Fergus, A. Hussain, D. Al-Jumeily, D.-S. Huang, and N. Bouguila, “Classification of caesarean section and normal vaginal deliveries using foetal heart rate signals and advanced machine learning algorithms,” BioMedical Engineering OnLine, vol. 16, no. 1, p. 89, 2017.
  • [37] J. Oh, J. Wang, and J. Wiens, “Learning to exploit invariances in clinical time-series data using sequence transformer networks,” CoRR, vol. abs/1808.06725, 2018.
  • [38] T. Brosch, Y. Yoo, L. Tang, and R. Tam, “Chapter 3 - deep learning of brain images and its application to multiple sclerosis,” in Machine Learning and Medical Imaging (G. Wu, D. Shen, and M. R. Sabuncu, eds.), pp. 69 – 96, Academic Press, 2016.
  • [39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, 2016.
  • [40] F. Chollet et al., “Keras: The python deep learning library,” Astrophysics Source Code Library, 2018.