Teaching a Machine to Diagnose a Heart Disease; Beginning from digitizing scanned ECGs to detecting the Brugada Syndrome (BrS)

08/27/2020 ∙ by Simon Jaxy, et al. ∙ Universität Osnabrück 0

Medical diagnoses can shape and change the life of a person drastically. Therefore, it is always best advised to collect as much evidence as possible to be certain about the diagnosis. Unfortunately, in the case of the Brugada Syndrome (BrS), a rare and inherited heart disease, only one diagnostic criterion exists, namely, a typical pattern in the Electrocardiogram (ECG). In the following treatise, we question whether the investigation of ECG strips by the means of machine learning methods improves the detection of BrS positive cases and hence, the diagnostic process. We propose a pipeline that reads in scanned images of ECGs, and transforms the encaptured signals to digital time-voltage data after several processing steps. Then, we present a long short-term memory (LSTM) classifier that is built based on the previously extracted data and that makes the diagnosis. The proposed pipeline distinguishes between three major types of ECG images and recreates each recorded lead signal. Features and quality are retained during the digitization of the data, albeit some encountered issues are not fully removed (Part I). Nevertheless, the results of the aforesaid program are suitable for further investigation of the ECG by a computational method such as the proposed classifier which proves the concept and could be the architectural basis for future research (Part II). This thesis is divided into two parts as they are part of the same process but conceptually different. It is hoped that this work builds a new foundation for computational investigations in the case of the BrS and its diagnosis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

2.1 Introduction

In this part of the thesis, we propose an automated pipeline that faces the challenge of digitizing ECG signals. Hereby, the workflow of the pipeline is presented in detail as well as a schematic description. During the process, it is most effective to separate different types of ECG images into different streams. This distinction is thematized later in the discussion part.

First, we state the problem and our objective. Related work is presented subsequently. Then, we propose the architecture of the pipeline in chapter three and evaluate it in chapter four. Finally, as already hinted, a discussion part is added.

2.1.1 Problem Statement

The ECG is a physiological method to inspect the heart. It does so by converting the electrical activity of the human heart into a graphical representation, usually printed onto a reference grid. Often, it is one of the very first methods applied to investigate the cardiovascular system as different impairments, such as structural heart diseases, ischemic heart diseases or other causes of symptoms that lie outside the heart, can be detected by the ECG [Woudstra2015].

During the ECG recording, the electrical activity is measured against time [Wasilewski2012]. Electrodes are placed onto the chest and limb which record the change of the electrical potential throughout depolarization and repolarization of the heart. ”The sources of the electrical potentials are contractile cardiac muscle cells [Wasilewski2012, p.1]”, hence, the heartbeat causes the electrical activity. In this non-invasive technique, the order and placement of the electrodes are of utmost importance for the accuracy of the recorded signals [Woudstra2015].

For many hospitals, digitized ECG signals are not the standard as they rely on the printed version, due to the high costs of modernizing the equipment. Large storage spaces are needed to store patient data, especially in very high populated regions, e.g., India or China [5478930]. Fast access to the medical records of patients cannot be guaranteed and hence, a call for fast accessible data is evolving [5478930],[Garg2012]. Moreover, the ECG is printed onto a specific thermal paper, which is not storable without harming the quality of the signal [Kleaf2015].

The need for the digitization of the ECG lies at hand. For one, the longtime storage of the recorded ECGs would not result in losing information. Plus, a fast transfer of the data would be possible when passing information from one medical institute to another. Furthermore, modern computational methods can be applied to investigate the recorded signals. Digitizing the ECG will also lead to a vast increase in the quantity of data that is available for research purposes as old ECGs can be recovered and newer ones additionally recorded. This will become especially relevant when dealing with rare heart conditions for which not many known cases exist but the records could be gathered over many years in the past.

2.1.2 Objective

Digitizing the ECG is a many step process that places a few challenges for the human engineer. Also, the quality of the scan plays an important role because the scanned image might be tilted, etc. Therefore, it is mandatory to design a pipeline that takes these challenges into account.

This pipeline aims to convert the pixel information obtained from the scanned images into a one-dimensional vector that expresses the ECG signal such that it can be easily reconstructed from the vector. Moreover, the goal is to reconstruct every lead that is visible on the sheet. Not taken into account is personal information from the patient, which could also be accessed from the paper. The reconstruction is implemented via Python

[Python3] 3.7.3 and with the help of the libraries OpenCV [opencv_library], Numpy [Numpy], Scipy [2020SciPy-NMeth], Pandas [Pandas] and Matplotlib [matplotlib].

2.2 Related Work

In the following, we present different solutions to the digitization problem as they are presented in the literature as well as their validation methods.

2.2.1 Pipelines

A variety of different solutions were introduced over the last decades, proposing different ways of digitizing ECG signals (the entire stream of the process is called pipeline). Most of them focus on a human-engineered solution, e.g., see [5478930],[Garg2012],[Kleaf2015]. Waits and Soliman [Waits2017] present an overview of different human-engineered techniques111The technique is human engineered in a sense that on every step and every parameter is chosen by a human as opposed to, e.g., a deep learning method.. The aim is always to design a program that is as automated as possible by reducing the amount of manually interfering in the process.
Generally, the workflow goes as described by Waits and Soliman [Waits2017]

. Firstly, the recorded ECG is scanned and possible rotations and skewing methods are applied. Then, the background grid is removed. Afterward, different kinds of processing techniques are applied to refine the signal. The final step that follows is the extraction of the signal. From time to time an optical character recognition is additionally performed to obtain valuable patient information.


Recently, Fontanarava [Fontanarava2019]

introduced a deep learning method for fully automated ECG feature extraction. It proposes the usage of three different Convolutional Neural Networks, responsible for layout detection, column-wise signal segmentation, and signal retrieval. This solution is more liberal than compared to a human-engineered one as it allows for quicker adaption between different kinds of ECG images and once the architecture is fully built, the amount of human interference is reduced to a minimum.


For our purposes, we construct a human-engineered solution that orients itself on the works of Waits and Soliman [Waits2017].

2.2.2 Validation

As described by Waits and Soliman [Waits2017] and further summarized by Fontanarava [Fontanarava2019], no standard technique or metric has been established yet to validate the quality of the recovered ECG signal. There are almost as many comparison methods introduced as there are different types of humanly engineered solutions to the task. A fit between the original and recovered signal combined with a correlation between the signals was suggested by Ravichandran et al. [Ravichandran2013], whereas Badilini et al. preferred to use a least-square fit analysis [Baldini2005]. While others, e.g., Wand and Mital [Wang1996], rely on visual comparison only.

At the time of writing this thesis, we did not have any already digitalized ECGs from the Brugada Syndrome, the study case at hand. Therefore, we could not make use of metrics to verify the quality of our extracted signal. Out of necessity, we decided to rely on graphical comparisons as our validation method, which is described in the fourth chapter.

2.3 Pipeline

The available records of ECGs for detecting BrS are limited and diverse in age, source and quality of the image. A uniform process for different kinds of ECGs manifested itself to be not efficient as the obtained images differed too much to undergo the same workflow. Therefore, a distinction between three types of images is made. The first type (Type 1) is composed of newer ECGs of higher quality with a colored background grid. The second type (Type 2) consists of older ECG images that were written in a manner such that the grid and the ink of the signal have the same color and intensity leading to an almost binary image. Lastly, the third type (Type 3) are images which are also binary but without a background grid. Distinguishing between the types results in a separated flow that is intersected at several procedural steps where only parameters deviate.

The general procedure goes as follows: First, the images are pre-processed to rotate (if necessary) and downsize the image. Then, an intermediate step of obtaining the inner frame image from Type 2 and Type 3 ECGs follows. Next, thresholding is executed to binarize the image and a grid removal is performed. Images are further cleansed from noise and other perturbations. Finally, the signal is extracted, mapped into time-voltage coordinates and effectively upsampled.

(a)
(b)
(c)
Figure 2.1: The three different types of ECG sheets (a) Type 1, (b) Type 2, (c) Type 3.
Figure 2.2: Digitization Pipeline

2.3.1 Pre-processing

Before the processing starts, the image has to be converted into a grayscale mode. Previously colored pixels are hence mapped to intensity pixels representing the amount of light of each color [Grayscale2019].
The first step of the pipeline is to rotate the images in case they have not been placed correctly into the scanner. This is important since the retrieved signal should not be tilted but rather represent the originally recorded signal. The idea of the process is that detected grid-edges in the image can be combined to lines. By calculating the slopes of the found lines, the necessary rotation angle can be computed. Automatic [Rosebrock2015] Canny edge detection [Canny:1983:FEL:889502] algorithm provides a way to detect the edges in the image. First, a noise reduction is performed to obtain a cleaner image by using a low-pass filter (usually a Gaussian Kernel) [Shrivakshan2012]. Then, the edge gradients are calculated by computing the first derivative in the horizontal and vertical direction since edges occur at locations of steepest slopes [Szeliski2010, p.239].

(2.1)
(2.2)

The smoothing operation and subsequent gradient calculation [Szeliski2010, p.240] are depicted in 2.1 and 2.2, where is the gradient of the smoothed image, the Gaussian kernel function and the original image. is the input , denotes the horizontal coordinate, stands for the vertical coordinate and is the width of the Gaussian.
From the equation follows that the input image is convolved with the horizontal and vertical derivatives of the Gaussian kernel function [Szeliski2010, p.240] to smooth the image and find its edge gradients.

Next, every pixel in the image is tested, whether or not it is a local maximum in its neighborhood by locating the zero-crossings in the second derivative in the gradient direction [Canny:1983:FEL:889502, p.50]. Lastly, a double thresholding operation (including a low threshold and a high threshold) decides which pixels are edges and which are not. A pixel above the high threshold is taken as immediate output, as is the entire connected segment of the contour which contains the point and which lies above a low threshold [Canny:1983:FEL:889502, p.60], while the pixels with an intensity gradient below the low threshold are discarded as none-edges. For our purposes, the zero parameter implementation that was established by Rosebrock [Rosebrock2015] shows itself to be reasonable because it facilitated the search for a suited parameter.

Using the detected edges, the probabilistic Hough transform (fully described in the OpenCV documentation [cvHough]), is applied in order to find connected lines within the images by performing a search for the describing parameters. First each point of the edge that is described by (and was previously detected by the edge detection algorithm) can be expressed in polar coordinates , where , computed via the equation , is the distance from the origin and the angle of the line and the horizontal axis. A line is found in coordinate-space if the line of a coordinate-pair crosses another line of the coordinate-pair in the parameter space . An array of (rows) and (columns) (aka accumulator) is initiated. Each is computed by plugging in the points into the line equation of the parameter space together with a -value that is iterated over . For each -pair, a one is added to the value in the corresponding array cell. This process is repeated for every point on the line of the coordinate-space. Now, the -cell with the maximum value gives indication where the line lies with parameters .

The summary of the pre-processing steps:

ImageList
for  do
     
     
     
     for  do
         180
               
     
     
     
Algorithm 1 Pre-process flow

The probabilistic version differs in the sense that it only considers a random subset of points that are enough for performing the line detection and thus yielding a more computationally-efficient algorithm [cvHough].
By performing the algorithm, starting and endpoints of a line are returned as the output. An angle for rotation can now be determined by calculating the angle of the slope of the line (the angle is thus transformed from radiance into degrees). Depending on whether a selected line is horizontal or vertical a shift is added. All angles are saved and then the median angle is chosen.
The entire scheme was tested using 851 randomly rotated images and inspecting the rotation outcome with the eye. Of the 851 images, 800 were rotated correctly, yielding an accuracy of 94%.

Lastly, the images needs to be downsized to reduce the computational effort. Type 1 and Type 2 ECGs are shrunk to a quarter of their original size and x, whereas Type 3 ECGs are reduced to an eighth of their previous size222To illustrate, three exemplary ECG images from Type 1 to Type 3 are downsized from x, x, x to x, x and x respectively.. During this process, pixels are resampled by taking the pixel to area relationship [GEOCV] into account such that a weighted average for the neighborhood in the defined area is calculated and taken as the new output pixel for this area [MCHUGH20].

2.3.2 Inner Frame

The signals of Type 2 and Type 3 ECGs are located inside a black frame, see Figure 2.1. Cropping the inner picture provides a good means to retain the quality of the signal as the black frame is interfering with some of the signals during the processing step. Therefore, the first step is to yield the inner image. Type 1 ECGs do not need to be processed this way since their frame is in a different color which makes it possible to erase it simply by a thresholding operation. Thresholding means that a constant (threshold) is determined that binarizes the image such that all pixels whose value lies under the threshold will be set to be black and every pixel above the threshold will be set to be white.

The entire process of extracting the inner frame:

ImageList
for  do
     
     
     
     
     
     
     
     
     
     
Algorithm 2 Extracting inner frame. Note: ImageList only contains images from Type 2 and Type 3 ECGs. denotes the convolution operation.

A copy of the original image is made to secure that the image stays unharmed during this process. All of the following image manipulation techniques are applied to the copy to ensure that the signals are kept as they are.
Finding the frame algorithmically is achieved by finding the contours inside the image. Contours can be seen as curves that unite connected points having the same color or intensity [Contours2013]. In pursuance of the contour seeking, some pre-processing steps have to be applied to the copy. First of all, the image is blurred to reduce high-frequency noise. This is achieved by using a median-filter, which selects the median value from each pixel’s neighborhood [Szeliski2010, p.124], in this case, a 5x5 neighborhood. Afterward, the image is sharpened again to strengthen the inked pixels by convolving the blurred image with a sharpening kernel333the kernel is of the form: which was found empirically after being inspired by [ImageKernel].. Then, the copy is binarized using a threshold operation. A morphological closing operation ensures the continuity of connected pixels444Morphological operations are nonlinear operations that are performed on binary images. To execute the operation, we first convolve the binary image with a binary structuring kernel and then select a binary output value depending on the threshold result of the convolution [Szeliski2010, p.126]. The convolution is described by , where represents the integer-valued count of number 1s (foreground-pixels) inside the binary kernel as it is convolved with the image, stands for a binary image and resembles the kernel. Morphological close is then defined as , with , , where is the size of the structuring element and , where stands for the threshold-value, as it is further described in [Szeliski2010, p.128f].. Finally, the contours can be found. The algorithm that we use had been first presented by Suzuki and Abe [SUZUKI198532]. It finds borders within the binarized image and follows them along to determine the structure of connected pixels. The connectivity of pixels are judged by either taking into account the 4- or 8-neighborhood of the pixel. A pixel at point is said to be a boarder pixel if there exists a 0-valued (background) pixel in its respective neighborhood. The image is scanned and at each border pixel, the point is labeled. Its border is followed until the entire border is found. If two labels meet, one of them consumes the other together with its entire border. After following an entire border, the image scan is continued where it had been interrupted. A hierarchical structure is thus built by labeling found boarders from outer to inner borders that are separated by holes (areas in which no labeled pixel exists). Using the contour-information, the rectangle with the biggest pixel area is found and its respective coordinates. These coordinates are then used to crop the original (not processed copy-image) such that the result is the inner-frame-image. Albeit, sometimes, the found rectangular is not precise.
In pursuance of a generalized pipeline that works for many images, a coordinate shift has to be applied to ensure that no grid line was visible anymore. Unfortunately, for some ECG this leads to a loss of information as some of the signals is cut out. Additionally, Type 2 and Type 3 images have different shifts as they diverged too strongly in their frame placement. For two cases of the Type 3 images, the shifting-coordinates have to be found manually after applying the procedure described above since the algorithm did not find the framing borders.

2.3.3 Thresholding

Following these processing steps, a thresholding operation is performed on all types of ECG images to detach signals from the rest of the image. The thresholding operation , where is a pixel-intensity-value and the threshold-value, results in binarized images. differs depending on which type of ECG image was operated on. Binarizing the images ensures that the signals are set as foreground and the rest, e.g. grid lines, is set as background and thus erased.

2.3.4 Contour Filtering

However, the thresholding operation does not result in isolated signals in the case of Type 1 and Type 2 ECGs. In the case of the Type 1 images, characters, describing which channel is being depicted, remain that need to be removed while Type 2 images are contaminated by leftovers of the background grid that could not be fully erased in the thresholding and convolution procedure.
To localize and remove unwanted groups of pixels we apply the aforementioned finding contour procedure. This time, an integer value is determined empirically which acts as a threshold representing the minimum amount of pixels that need to be enveloped by the detected contour. Contours of characters or leftover grids are filtered using this process albeit the Type 2 ECGs are harmed during this process because the grid residuals have not been entirely separable from the signal. A trade-off appears where the decision has to be made whether to lose parts of the signal or falsify the quality of the signal by including parts of the grid. For the sake of having the best quality while still preserving most signals, a compromise is achieved in which minor parts of the signals are lost while only minimally tampering the signal.

2.3.5 Signal Extraction

As the goal is to recreate each signal, a way has to be found to establish a full separation of the signals such that the information about each channel can be stored and accessed independently. One of the major problems during this procedure is to isolate overlapping signals which occurred quite frequently due to high peaks within the ECG signal. It is mandatory to reduce the loss in quality by the separation process. Therefore, a height-coordinate has to be discovered that leads to the smallest disruption of the signal.

Figure 2.3: For every row the amount of pixels in it is plotted together with peaks indicated where the maxima lie. (Example of a Type 3 ECG).
Figure 2.4: An inter-peak interval where for every included row the corresponding amount of pixels is plotted. The median of all minima is marked (cutoff-point).

To do so, the image is mapped onto a one-dimensional vector of the length of all rows, each entry representing the number of pixels of that row (Figure 2.4). Performing this transformation yields a pixel distribution over the rows of the image. Peaks, or high amounts of pixel values, indicate the location of a signal whereas no pixels give insight that a gap had been found between signals. In order to find the perfect location for separation, each peak in the distribution has to be found with a restriction on a minimum distance to the next pixel peak. Without this constraint, too many pixel peaks are found. As our objective is to find the signals, adding the constraint results in more control.
Next, each interval between the found peaks is investigated. The minimal values in these inter-peak intervals serve as good candidates. The median-coordinate of all unearthed inter-peak minima is taken (Figure 2.4)555The idea of this process was inspired by [Fontanarava2019, p.39f] and [7868769] although a few adjustments have been made..

The described process is iterated until every cutoff-coordinate is found. With the intention of further ensuring all signals are captured by this process, the interval from the last peak to the end of the picture is examined. If the encapsulated area has enough pixels inside (more than a certain threshold) another signal is found that would have been left out otherwise.
Type 1 ECGs show a different structure to the Type 2 and Type 3 ECGs. In the upper part of the images, one row consists of four recordings each from different channels. These have to be further separated to completely isolate every signal. The most satisfying results yields the cropping of each signal by finding the coordinates that are fully enclosing each signal. Luckily, for every Type 1 ECG image, these coordinates are the same such that a look-up table is established to provide a good cropping procedure. Subsequently, the following step is to clean images of Type 2 and Type 3 from leftover noise, i.e., grid residuals. Again, the aforementioned finding-contour-with-threshold-procedure does justice to this task, assuring a signal that is as clean as possible at this stage.

Finally, each cropped signal is investigated by itself. A column-wise scan is executed where the extracted signal-coordinates are averaged over each column to obtain the inner signal line666As it is performed similarly by Ravichandran et al. [Ravichandran2013].. However, these coordinates give no inside into the real time-voltage values. So, the coordinates have to be converted.

2.3.6 Coordinate Conversion

The goal of this step is to translate the pixel coordinate into time-voltage information. The previously removed background grid play an important part during this agenda. It gives inside into at what speed (25mm/s) and with how much voltage (10mV/s) the ECG was recorded. Using the information of how many pixels make up a big square of the background grid provides a good estimate for the pixel-coordinate to time-voltage-coordinate mapping. Of course, another distinction between the different types of ECG images has to be considered as the different types possess different kinds of grid resolutions.

2.3.7 Upsampling

To increase the number of sampling points and hence the amount of data obtained from the digitizing process, an upsampling of the extracted signal is conducted.

A sample is nothing more than a number specifying the position of the signal777Analogous to sampling digital sound signals [MDFTWEB07] https://ccrma.stanford.edu/~jos/mdft/Introduction_Sampling.html[date of last access: 19.12.2019]. and hence, the value the signal has at the time of measurement (sample-point). Upsampling has the aim of increasing the number of sampling-points and thus adding quantity to the data. This is achieved by a so-called zero-padding

of the signal in its frequency domain. Zero-padding can be understand as extending a signal [..] with zeros

[MDFTWEB07]888https://ccrma.stanford.edu/~jos/mdft/Zero_Padding.html[date of last access: 19.12.2019]..

First, the signal is transformed into the frequency domain as it lies naturally in the time domain. After all, it is a voltage signal recorded over time. The Discrete Fourier Transform (DFT) does justice to this operation since the Fourier transform converts a signal that depends on time into a representation that depends on frequency

[Mueller2015, p.39]999The entire DFT Theorem and its derivation will be too much for the scope of this thesis. For the sake of completeness, it will be referred to [Mueller2015] and [MDFTWEB07]..
Following the DFT, the signal is zero-padded. Zero-padding is defined by Smith [MDFTWEB07]101010https://ccrma.stanford.edu/~jos/mdft/Zero_Padding.html[date of last access: 19.12.2019].:

(2.3)

where is the old length of the signal, is the new length of the signal and with for odd, and for even.

The size of the zero-padding is determined by a factor (which was chosen to be ) that also determines the upsampling factor in the time domain. Now, zeros are inserted at exactly half of the frequency length in the frequency domain corresponding to the folding frequency (aka. half the sampling rate [MDFTWEB07]111111https://ccrma.stanford.edu/~jos/mdft/footnode.html##foot20087[date of last access: 06.01.2020] (listed as footnote 7.22).).
As a final step, an inverse Fourier Transform [MDFTWEB07]121212https://ccrma.stanford.edu/~jos/mdft/Inverse_DFT.html[date of last access: 06.01.2020]. (IDFT) is used to map the sequence back to the time domain.

The answer of how a zero-padding in the frequency domain is responsible for upsamling in the time domain is given by Smith [MDFTWEB07]131313https://ccrma.stanford.edu/~jos/mdft/Periodic_Interpolation_Spectral_Zero.html[date of last access: 19.12.2019]., who states that zero-padding in the frequency domain corresponds to

periodic interpolation in the time domain

and further postulates the following theorem:

(2.4)

According to the theorem, a periodic interpolation is equivalent to a zero-padding of the signal in its frequency domain and subsequently transforming to the time domain.

As the outcome of the upsampling procedure, a signal is yielded with an increased sampling rate and thus more data to investigate in subsequent steps. The reason for choosing this method for resampling becomes clear when the result is compared with other interpolation techniques.

Figure 2.5: Missing pixels
Figure 2.6: Upsampling vs. Cubic Spline Interpolation

Figure 2.6 shows an example of an interpolation process that led to an overshoot of the reconstructing sequence. A Cubic Spline interpolation is tested. Cubic spline interpolation is the smooth approximation of the underlying function by a piecewise polynomial141414The entire theory of cubic spline interpolation is outside of the scope of this thesis. Yet, for the sake of completeness it is referred to [Burger2015],[Parker1983]..
The overshoot occurred from time to time, whenever there had been missing pixel values due to a too exhaustive cleansing procedure, see Figure 2.5. Thus, it is preferred to take the noisy upsampled signal over the smoother interpolated signal151515

It is important to take in mind that the goal will be to feed the extracted data into an LSTM neural network which showed to be effective with noisy data, see

[Bizopoulos2019]..

2.3.8 Outcome

The outcome of the upsampling procedure, and also previous processing steps, are several time-series of voltage data each representing one lead. One vector will be created for each channel and it will be subsequently stored in a data frame.

Figure 2.7: Type 2 ECG digitization process.

2.4 Results

The aforementioned pipeline takes in scans of ECGs and digitizes them in a manner that each lead will be reconstructed via an approximation of the signal. Hereby, the integrity of each lead is preserved in most cases. However, the isolation of a signal is sometimes faulty due to strongly overlapping signals, that could not be fully separated, characters that were not fully removable since they are overlapping with the signal or, as in the case of Type 2 images, lost signal parts.

(a)
(b)
(c)
Figure 2.8: The successfully reconstructed signal of (a) lead 1 of a Type 1 ECG, (b) lead aVR of a Type 2 ECG and (c) lead 1 of a Type 3 ECG. Note: the line of (a) appears to be fading. This is due to the fact that the two images have a different resolution and (a) is slightly distorted.

Figure 2.8 depicts an example of a successful recovery of a recorded ECG lead for each different type. On the left side, the original scanned ECG signals depicted and on the right side, the recovered signals are presented. All three image types are correctly captured by the pipeline as their features are preserved and the waveform is identical.

(a)
(b)
(c)
Figure 2.9: Retrieved signals that contain faults from (a) a Type 1 V4 lead, (b) a Type 2 AVL lead and (c) a Type 3 V4 lead.

Figure 2.9 illustrates the example of a faulty recovery procedure. In the case of (a) and (c), the peaks of the signal are deformed by another signals’ peaks that could not be separated from the crop. Hence, the peak has a sudden drop where the two signals are overlapping in the column. Plus, images of Type 1, sometimes were additionally affected by leftover characters that could not be fully removed in the processing steps and thus also affecting the accuracy of the recovered signal (see (a)).
As for Type 2 images, due to the lack of quality, it was often not possible for the pipeline to distinguish background and foreground such that parts of the signal were erased or falsified by residuals that were too strongly intertwined with the signal to have them removed in one of the filtering procedures. Hence, the restoration does not capture the original signal.

2.5 Discussion

The proposed pipeline provides a framework for establishing a fully automated digitization process. A distinction between different types of ECG, which the pipeline had to face during the agenda, resulting in a workflow that united several steps but with diverging elements depending on what type is currently processed.
The quality of the recovered signals is repeatedly captured in a remarkable quality when the signals are isolated and not too strongly enveloped by the background grid. Here, the quality does not differ from the results presented in the literature. Likewise, the quality of the result drops as soon as the signals are strongly overlapping with each other and a clear separation, by taking only the height-coordinates into account, is not possible anymore, or if the background grid strongly interferes with the signal. The success depends further on the quality of the scanner used and on the tidiness of the action itself.
Of course, the presented pipeline appears to be tailored specifically for the three distinct types of ECGs faced. An introduction of a new form of paper ECG would require more inspection and an additional parameter search. However, this procedure is one of the first to investigate distinct types (Fontanarava [Fontanarava2019] uses two distinct types) and attempting to unify them within the same procedure.
Future work lies in the task of implementing an improved signal separation. One way could be to use CNNs as presented by Fontanavara [Fontanarava2019] that captures the main components of the signal such that it can be further extracted. For this thesis, it was decided to persist with a human-engineered pipeline. Additionally, an improvement could be made by consulting already digitized ECG signals and their scanned paper versions in order to compare the results of the aforementioned pipeline to create an optimization process by minimizing a metric between the original signal and its reconstruction.

Overall, the quality of the recovered data is sufficient to be used for the second part of this thesis, in which we will use the data to construct a classifier that is able to distinguish BrS positive and BrS negative patients based on there ECG.

2.6 Conclusion

In this part of treatise, a method was proposed to digitize and recreate ECG signals previously only accessible in paper form. This method encompasses three different types of ECGs such that all signals can be retrieved from each type. Processing steps differed depending on the type and consisted sometimes of an extra step and sometimes only of a difference in the parameter settings. The results of the process have been highlighted, strengths and weaknesses were disclosed. Furthermore, an outlook in future work has been introduced which could lead to improved workflow.
The aim of digitization has been reached and the extracted data promotes a basis for further investigation.

3.1 Introduction

In the second part of the thesis, we propose the classifier that is built based on the data gathered during Part I and an additional database.

We start by stating the problem and the objective of this part. Thereafter, we acquaint ourselves with various applications of machine learning in cardiology. Then, we describe the fundamentals that underlie the LSTM network. In the subsequent chapter, the data is presented together with necessary preprocessing steps. Following the data chapter, we propose the model together with all the important settings that have to be made. Afterward, we investigate the performance of the model in an evaluation procedure and discuss the results.

3.1.1 Problem Statement

During a lifetime, the heart is neither supposed to rest nor stop, otherwise the effects might be fatal. However, different kinds of impairments and diseases can impose serious problems on the living organism. Cardiovascular diseases (CVDs) are the number one cause of death in a global spectrum. In 2016, the World Health Organization (WHO) reported that 17.9 million people died from CVDs which resulted in 31.9% of all global deaths, infarcts and strokes were representing the main causes of death161616https://www.who.int/en/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)[date of last access: 21.01.2020]..

Sudden cardiac arrests (SCAs) are a grave case of CVDs that lead to a swift loss of the cardiac functions and subsequently pose an imminent threat to human life [Albert2018]. If the outcome is fatal, then the event is referred to as sudden cardiac death (SCD) [Albert2018].
The Brugada Syndrome (BrS) is a special type of SCA that often results in an SCD. BrS is diagnosed by analyzing the ECG since it is characterized by abnormalities in the signal [BRUGADA2006S115]. Ever since its first description in 1992 by Pedro and Josep Brugada [Brugada1992], the syndrome is intensely studied. BrS is associated with a high rate of sudden cardiac death (4% of the total count) in patients with structurally normal hearts (up to 20%) and appears to result mostly in an occasion of sudden death for men around the age of 41 years [Charles2005].

(a)
(b)
Figure 3.1: The typical BrS-pattern in a V1 lead. (a) taken from the own data base and (b) taken from https://en.ecgpedia.org/images/f/f1/Brugada_ecg_characteristics.svg [date of last access 06.02.2020].

The only diagnostic case is characterized by an accentuation of the J wave found in the right precordial leads (V1, V2), which results in an ST-segment elevation that is often followed by a negative T-wave [Antzelevitch2012],[Brugada1992] (see Figure 3.1). This behavior can be either spontaneous or drug-induced [Wilde2002]. There exist also other criteria that hint towards BrS (see [Wilde2002] or [Waks2017]). However, these criteria are based on currently available data and that it is a work-in-progress that is awaiting confirmatory […] clinical data [Wilde2002].

In order to make precise decisions about the diagnostic procedure in BrS, new criteria have to be found to distinguish the cases. The call for a classifier - able to distinguish between BrS positive and negative subjects - cannot be ignored. Thus, the challenge will be faced and its course is presented to the reader throughout this second part of the thesis.

3.1.2 Objective

During this part, the construction process of a binary classifier that can distinguish and correctly classify BrS positive from BrS negative patients solely based on their presented ECG data is displayed. Moreover, the classifier is built on the extracted time-series data from Part I. It is hoped that this classifier will be able to give an additional opinion when it comes to the diagnosis of the rare heart disease.
Everything described here is implemented using Python [Python3] 3.7.3 and the libraries used are Pandas [Pandas], Numpy [Numpy], Scikit-Learn [scikit-learn]

, PyTorch

[PYTORCH], Matplotlib [matplotlib] and Seaborn [Seaborn].

3.2 Related Work

Every day, a vast amount of data is generated in the medical field. It stems from vital signs of patients, genetics or diseases and it is collected and processed to evaluate and improve methods or for the creation of new methods.
Machine Learning provides the right tools to analyze the tremendous amount of information. The generation of statistical models, that can classify substructures within the data or make predictions about future states, and the subsequent evaluation of their performances leads to more insight into the field. Sophisticated methods thus have the rightful potential to become an accomplice to the human nurses and doctors, improving the quality of medical aid.
Since, for this thesis, we target a specific heart disease, it is important to inspect the role of machine learning in cardiology focusing on methods that rely on ECG data. In the following, an overview of this field is presented and then a special kind of method, the LSTM, is introduced to lay a foundation of understanding the subsequent chapters.

3.2.1 Machine Learning in Cardiology

The medical field provides the perfect playground for testing various machine learning techniques as there is a constant generation of data. However, capturing the data does not always occur cleanly but rather it can be faulty due to measuring errors. Therefore, working with medical data provides a good opportunity to test techniques on real-life practical data and not only on artificially constructed data sets. ECG signals constitute an example of the data generated in the medical field and more precisely in cardiology. The periodic signal serves as an indicator of the vital status of the human heart and its features give account over different conditions.
Deep Learning techniques are well-fitting candidates when it comes to the analysis and feature detection within ECG signals [Bizopoulos2019]

. Especially, the Recurrent Neural Network (RNN) is a suited device to find structure in temporal data

[Bizopoulos2019], [Aggarwal2018, p.273] and provided good results in arrhythmia detection [Lipton2015], [OH2018278] (the presented cases are based on the LSTM, an improved successor of the vanilla RNN model).
The LSTM showed prosper results in related fields that use time-series data [Liu2014],[Siami-Namini2018]. Recent propositions to apply the LSTM networks onto medical data resulted in promising outcomes [Lipton2015],[Han-Gyu2017], [OH2018278]. Lipton et al. [Lipton2015] were able to construct a multi-label LSTM classifier that is fed with raw medical data and not only predicts missing data but also gives a diagnosis via the classification label in the end. While Oh et al. [OH2018278] used a model that combined a CNN with an LSTM to diagnose arrhythmias in ECGs.
Before we come to the presentation of the data and subsequently to the LSTM-architecture that is used to classify ECG signals in the case of the BrS, we first have to take a look at the fundamentals to fully understand the given structure.

3.2.2 LSTM Fundamentals

By using an Artificial Neural Network (ANN), the goal is always to construct a function approximator that, depending on the problem, can receive data as input and return, i.e., an assigned label or a replicated sequence.
The network consists of an input layer with one unit per input-feature, several optional hidden layers with arbitrary size and an output layer, where the amount of units depends on the specific problem171717To illustrate, one output unit is sufficient for a binary classification problem in which the positive classification is labeled as and the negative one as

. The output of the ANN is then usually a probability between

.. Each layer is connected to its previous with weights (the weight vector is denoted as

). The entire output of the previous layer arrives at a unit of the current layer as a weighted sum, which in turn is used to feed an activation function

, that is often nonlinear such that the output is again only a single scalar [Johnson2019, p.9]. An arbitrary layer, , can be seen as a function of the layer that proceeded it [Goodfellow-et-al-2016, p.193f]. Therefore, our function approximator can be seen as a chain of functions [Goodfellow-et-al-2016, p.164]. In the end, a mapping is learned with parameters that result in the best function approximation [Goodfellow-et-al-2016, p.164].
In some cases, the stream can have also backward directions and thus loops. Those networks are then called recurrent neural networks [nielsenneural].

(3.1)

During the learning phase, the output, , of the function approximator is compared to a target value181818

Given a supervised learning task.

,

, with whom the loss (or cost) function is calculated. To become a good function approximator, the network has to minimize the loss function (the calculated difference between

and ). It does so, by updating its weights , that have been used to compute the final output, using gradient descent optimization, which means that the network is making small updates every iteration in direction of the steepest descent of the loss function [BackProp95],[Goodfellow-et-al-2016, p.80ff]

. Throughout this procedure, the error is backpropagated, meaning that the error of the output is used to calculate inner-network-errors to affect all weights of the network

191919The general backpropagation algorithm is described in [BackProp95], [Goodfellow-et-al-2016]..

The LSTM202020The architecture presented here is the one presented by Gers et al. [Gers] with the additional Forget Gate., developed by Hochreiter et al. [LSTM97], is a more sophisticated neural network that can pass information over time, due to its recurrent nature, and take sequences as an input. Its most remarkable feature is a memory cell that can maintain its state over time [Greff15, p.1], which is passed from time step to time step.

A forward pass of the LSTM is described as follows [Gers],[Olah15],[LSTMWIKI]:
Input , that is fed into the network at time step , will be first passed through several gates that regulate the information flow into and out of the cell [Greff15, p.1]. The first one is the forget gate, , that decides which information will be kept inside the memory block and which information will be discarded.
Next, the relevant data is determined that will be used to update the memory cell. This process is split into two parts. During the first part, it is resolved which values of (the cell state at time step ) will be updated bypassing the input through an input gate, . Then, new candidate values are computed by guiding the input through a candidate gate, .
Subsequently, the new cell state, is determined by taking into account which values are to be forgotten determined by and which candidate values will replace which old values determined by .
Lastly, the output, , is determined by passing the cell state through a -function and taking into account the relevant output elements that were determined by an output gate, .

(3.2)
(3.3)
(3.4)
(3.5)
(3.6)
(3.7)
Figure 3.2: Forward pass of an LSTM network for time step .
Let denote the input at time step , the input weights, the weights of the hidden unit and the respective biases. Further, denotes the element-wise product, the logistic sigmoid and the hyperbolic tangent.

Once the input is fed through the network at the respective time step , it is time to update the network’s parameters
.

To do so, the error between the target output at time

, , and the network output, , is calculated with the aim to minimize it during the learning procedure by using a loss function, .

(3.8)

After defining the error term, we can compute the error of an arbitrary gating unit at time step

by taking the partial derivative of the error w.r.t. the specific gate and considering the chain rule

212121In Leibniz notation: , whenever a variable is dependent on a variable , which itself is dependent on a variable [ChainRuleWiki] and we are eager to calculate the derivative of variable w.r.t. ..

(3.9)

Now, our goal will be to update the weights based on their impact on the calculated error. Therefore, we will search for weights that minimize the error, and thus minimizing the loss function, by calculating .

(3.10)
(3.11)
(3.12)
(3.13)
(3.14)
(3.15)
(3.16)
(3.17)
Figure 3.3: Backwards error flow

Figure 3.3 describes such a backward pass222222The notation is a mix out of the named resources and own notation to stay consistent with the previously used notation. [LSTM97],[Greff15],[chen2016gentle].
First, the error of the output is calculated by using , the error of time step t, and (recursively defined in Equation 3.3) the output difference [LSTMBackpropMedium] computed at the subsequent time step . Next, the error is computed as it has been shown by Hochreiter et al. [LSTM97, p.27] and Chen [chen2016gentle, p.8].

(3.18)

This behavior is because the error is not only backpropagated by the error but also via the memory cell . The equation given above can be now rewritten as

(3.19)

considering the equations given in the forward pass (Figure 3.2).

The gate updates can be computed in a similar fashion232323A full derivation of the equations is given in [LSTM97] and [chen2016gentle].. Note that , the error for the inputs, is only required when there exists an additional layer below the LSTM (i.e., a second LSTM layer) that requires training [Greff15].

Finally, the parameters are updated

(3.20)
(3.21)
(3.22)

where

denotes the outer product of two vectors and considering a stochastic gradient descent algorithm, with a learning rate of

, the final parameter update becomes

(3.23)

The entire process is repeated for every time step until the sequence was fed in completely.

3.3 Data

In the following, the positive and negative data is displayed as well as the necessary preparation steps. However, the data is not searched for features or normalized. The preparation of the raw data only covers for necessary steps to run the algorithm. We aim for a proof of concept rather than a state-of-the-art solution, since many factors, e.g., the available data, were not as progressed by the time of writing this thesis.

3.3.1 Positive Training Data

We label positive examples as positive because they are composed of ECG signals that were taken from patients with a diagnosed BrS. Each example consists of a recorded signal derived from its respective lead. Since digital signals were not available in the case of BrS positive patients, the learner is fully dependent on the ECG signals that were recovered during Part I of this thesis, namely the digitization process. Hereby, the layout of the three different types of images led to signals with varying length depending on whether they stem from Type 1, Type 2 or Type 3 ECGs. Furthermore, the quality also varies since some of the ECGs were not entirely recoverable as explained during Part I.
By the time of conducting the experiment, 30 positive training instances each consisting of 3000 data points are available.

3.3.2 Negative Training Data

Negative training data is negative in the sense that the recorded signals stem from healthy control patients. They were gathered from the PTB Database of Physionet [bousseljot_kreiseler_schnabel_2009], [Physionet]. The database contains 549 records from 290 subjects and a mean age of 57,2. A detailed clinical summary of the patients is attached, however, this information is only used to locate the healthy control patients and the rest of the information is discarded. ECGs from this database have been recorded in 16 (I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6, VX, VY, VZ) different leads and digitized at 1000 samples per second. In total, the digital ECGs of 52 healthy control patients can be extracted and their data is used to build the model.
In total, 80 negative training examples are used with a respective length of 3000 per instance.

3.3.3 Data Preparation

For the experiment, we select to discard signals from leads other than V1 or V2, as they are the primary indicators for the BrS (see 3.1.1). Furthermore, three separate models, one for V1, one for V2 and one for V1 & V2 (also referred to as ”both” in the figures), are created to test them against each other. The data is randomly shuffled and then split into training (), validation () and test data (

) responsible for training the model, testing hyperparameters and evaluating the final performance respectively. In the same moment, labels are assigned to each instance where

1 indicates BrS positive and 0 denotes BrS negative.
Before entering the LSTM-system, the data sequence is further split into non-overlapping windows that consist of 500 data points each. This is done to further increase the amount of data available. A final consideration has to be made as the two data sets (positive and negative instances) are unbalanced in their quantity. Thus, we decide to take action against this, as it would lead to difficulties when evaluating the performance of the classifier (this is also thematized in 3.5.1), by assigning weights during training to the training instances. Positive instances are given a weight of , according to the ratio in the training data set, as it is suggested in the Pytorch documentation242424https://pytorch.org/docs/stable/nn.html##bcewithlogitsloss [date of last access: 05.02.2020]., and thus the loss computation is adjusted as if the dataset is equally balanced.

3.4 Method

The ANN that is used throughout the experiments is depicted in Figure 3.4. Input to this network is the raw ECG signal, either extracted via the pipeline described in Part I or taken from the database. The ECG signals constitute time-series data, where each data point represents the input of at time . Inputs at time are forwarded into the LSTM-layer where they are subject to the forward pass computations that were described in 3.2.2.

We introduce a dropout layer (see 3.4.1), that randomly (chance of 25%) omits neurons and is placed after the LSTM-layer to avoid overfitting to the small data set. Additionally, a linear layer that constitutes as a bottleneck and squishes the dimensionality down to one, the output dimension, is added after the dropout layer.


In the final step, the output of the network is passed through a sigmoid function

such that the final output constitutes for the probability of the ECG belonging to class (BrS negative) or class (BrS positive). Subsequently, the loss is calculated by using a loss function which is specified in its subsection. Then, the network’s parameters are updated in accordance with the loss function operating the backpropagation described in the second half of 3.2.2.

In this chapter, the model settings such as the chosen loss function, optimizer and dropout are highlighted. However, parameter settings, such as the number of epochs or the chosen learning rate, are part of the results chapter since they are found after a parameter search according to the criteria of measurement, which are to be presented in the following chapter.

Figure 3.4: Architecture of the Neural Network used during the experiments

3.4.1 Dropout

Neural networks tend to have many different parameters that allow for a variety of different parameter settings. A lot of which can lead to a good result in the training procedure. However, most of these different configurations will be worse performance-wise when being confronted with the test data as compared to being confronted with the training data [hinton2012improving]252525A phenomenon called overfitting [hastie_09_elements-of.statistical-learning, p.364].. This is because some neurons within the network depend too much on other neurons of the network [hinton2012improving]. Dropout is used to regularize a neural network during the learning procedure and thus reduce its chance of overfitting to the data [Aggarwal2018, p.189f]. Dropout works by randomly omitting a subsample of all of the network’s neurons by a predefined chance during the presentation of each training example [hinton2012improving]. The idea is that neurons will stop being too reliable on other neurons and thus, stop learning co-dependencies [hinton2012improving].

The here presented network possesses a dropout-layer that follows right after the output of the LSTM-layer. Dropout occurs with a chance of . Additionally, a dropout within the LSTM-layer is implemented with a chance of , if the number of stacked LSTM-layers exceeds .

3.4.2 Cost Function

In a supervised learning task, the learner (in our case the LSTM model), will generate outputs based on the learned function . Given the inputs , outputs are produced in the form . During the training process, the learner is able to modify its input/output relationship in response to differences, , between the original and generated outputs [hastie_09_elements-of.statistical-learning, p.48]. By an iterative adjusting procedure, it is hoped that becomes close to , the real underlying function.
Imagine that the target outputs,

, stem from a probability distribution,

, such that , where describes the probability distribution of the training instances and denotes a small noise term262626The usage of is justified as for most systems the input-output pairs will not have a deterministic relationship [hastie_09_elements-of.statistical-learning, p.47].. Now, together with our own approximator, , and the training instances we can construct the probability distribution .
The only thing that is left for us is to find a way of how we can compare the two distributions such that they will be as close as possible by the end of the training.

(3.24)
(3.25)

The equation given above is the cross-entropy. It is composed of the entropy of the variable together with the relative entropy (or Kullback Leibler distance). The Entropy of a variable gives account over the uncertainty of its outcome, whereas the relative entropy is a measure of the distance between two distributions [InformationTheory, p.18]. The latter is used when two probability distributions, and , exist over the same variable [Goodfellow-et-al-2016, p.72] and their similarity are measured.
Since the goal for our generated distribution, , is to become as close as possible to the real target distribution , we strive to minimize the Kullback Leibler distance and thus equivalently minimize the cross-entropy [Goodfellow-et-al-2016, p.130] (as both terms, entropy and relative entropy, are defined positively272727 and defined in [InformationTheory].). Therefore, the information-theoretical definition gives a nice intuition of what is happening when the cross-entropy is minimized, namely that the dissimilarity between the two distributions and is minimized.

Remarkably, while minimizing the cross-entropy and thus finding the optimal , we also maximize the maximum likelihood for our parameters282828The principle of maximum likelihood assumes that the most reasonable values for are those for which the probability of the observed sample is largest [hastie_09_elements-of.statistical-learning, p.50]. [hastie_09_elements-of.statistical-learning, p.31f], [Goodfellow-et-al-2016, p.129f].
Furthermore, due to our choice of an additional sigmoid layer, the output of the network is already a probability distribution over the two possible values . Thus, it is only a natural decision to consider the cross-entropy function as our cost function.

3.4.3 Optimizer

An optimizer, as the name already suggests, tries to optimize the gradient descent approach by regulating the respective step sizes and computing them individually for the parameters. One of these stochastic gradient descent optimizers is ADAM (derived from Adaptive Momentum Estimator) and was proposed by Kingma and Ba. [kingma2014adam]. ADAM computes the adaptive learning rate for different parameters individually by using estimates of the first and second moments of the gradients292929Moments are quantitative measures of a function that describe the appearance of a distribution [Blitzstein2014, p.267]. One of the attractive properties of ADAM is that each time step, the step size is approximately bounded [kingma2014adam, p.1f] by the initial step size parameter such that a so-called trust region [kingma2014adam, p.3] is formed around the current parameter values allowing for an automatic annealing process [kingma2014adam].
The necessary parameters for the algorithm, , , , are initialized with , , and as suggested by [kingma2014adam].

3.5 Results

Before we come to the results of the experiment, we first have to clarify how results are measured. Especially in medical data, which is often imbalanced, special precaution has to be taken while analyzing and evaluating because slightly different outcomes can have entirely different consequences. Therefore, in the following, it is stated, why the classic paradigm of accuracy as a performance measure fails in our context and which alternative metric is taken. Then, the results of the experiment are presented, based on the introduced metrics.

3.5.1 Evaluation Process

As the goal of this thesis is to ultimately construct a model that can differentiate BrS positive from BrS negative patients, it is of utmost importance that the classifier can perform well on medical evaluation methods.
Whenever the classifier303030In the following, the situation for binary classifiers are described. It is possible to extend the information given to a multi-classifier-case. However, this will not be part of this thesis. decides whether the current instance belongs to a class or not, we can evaluate its choice by comparing the output with the target output .
For our purposes, the accuracy313131Accuracy , where TP stands for the true positives, TN for true negatives, P for the total amounts of positives and N for the total amount of negatives. gives a first impression of the performance of our model as the ability of the test to distinguish between the relevant states or conditions of the subject (i.e., diagnostic accuracy) is the most basic property of the test as an aid in decision making [Zweig1993ReceiveroperatingC]. For the tests presented in the following, the classification threshold323232Please note that this classification threshold is picked rather arbitrary. Depending on the classification task it is very reasonable to consider an expert’s opinion. E.g., a physician ought to be careful when it comes to making a diagnosis and a value of , which is not sufficient for a positive classification in our case, might give a good reason to conduct another screening. This is very well described in https://www.fharrell.com/post/classification/ [date of last access: 05.02.2020]. is set as for measuring the accuracy value:

(3.26)

where denotes the final output of the network (after the last sigmoid layer). Nevertheless, accuracy alone does not give a sufficient account of our performance. Since the data sets are highly unbalanced, good accuracy can be achieved by labeling every instance as negative which would mean that our model did not learn anything about the positive examples. This would be a worst-case scenario as our goal is to build a classifier that can detect BrS positive patients.
Therefore, receiver operating characteristic (ROC) [FAWCETT2006861], [Zweig1993ReceiveroperatingC], [Powers2008] is very important. The ROC is a long-established method for visualizing, organizing and selecting classifiers based on their performance [FAWCETT2006861, p.1] in the medical field, e.g Zweig et al. [Zweig1993ReceiveroperatingC] or Zou [Zou] (an online library about ROC literature research in medicine) and lately, it is gaining more attention in the domain of machine learning, to illustrate [eFerri2002], [Provost2001], [Bradley97]. For the ROC, the true positive rate (sensitivity) is plotted against the false positive rate (1-specificity) and thus gives an account on the performance of our classifier given the ratio of correctly and incorrectly classified positive instances. Using the ROC, different decision thresholds are calculated and used to test the performance of the classifier [FAWCETT2006861]. Each point on the ROC-space stands for a different selected decision threshold and connecting these yields a continuous curve. Depending on where the curve (or the points) is (or are) situated, we can evaluate whether a good classification occurred (upper left area) or a bad classification occurred. E.g., close to the diagonal from bottom left to the upper right, in which case the classifier appears to be making random decisions.
By making use of the true positive rate and false positive rate, and thus ratios, as compared to absolute values, the evaluation of the model is insensitive to an imbalance in the class distribution [FAWCETT2006861]. Hence, the ROC helps us to see whether the accuracy of the model is based on the imbalanced data distributions or its performance.
Calculating the area under the ROC-curve yields the AUC (area under the curve), another metric with whom we are able to compare the performances of different classifiers by comparing their respective AUCs (as they are scalar values) [Powers2008].

3.5.2 Parameter Search

Most of the hyperparameters, such as the number of units in the hidden layer or the learning rate, have to be found by a search through the parameter space. To find the best settings, one has to conduct many experiments with different parameter tunings and compare their results. To find suiting parameters, the different models are first trained with the training set and then their performance is tested by using the validation data set.
As aforesaid, during the process of writing this thesis, time and resources are very limited. Thus, only a small search is performed and some parameter settings are adjusted based on a search through the literature. In total, experiments are conducted, searching among a possible number of epochs (), the amount of hidden units (), the number of stacked LSTM-layers () and different learning rates () for each of the different tested features (V1, V2, V1 & V2). Greff et al. [Greff15] gave the learning rate a high priority together with a suggestion to start high and then divide the learning rate each time by a factor of . Therefore, it is tested with four different settings, each time, dividing by a factor of ten. Next, the amount hidden units follows a suggestion by [LSTMHyperparameters] - after performing a hyperparameter search for different data sets - that gave the number of hidden units a low priority and suggested units per LSTM-layer333333In their experiment, they used a bi-directional LSTM-model which is not subject to this thesis. However, their suggested amount of units appears to be reasonable and serves as a guideline for the presented experiments.. According to Reimers and Gurevych [LSTMHyperparameters], the number of LSTM-layers stacked on top of each other has a medium priority with a suggested amount of . The possible numbers of epochs stem firstly from the restricted amount of computational power and secondly, due to the limited amount of data available for the trials, to avoid overfitting to the training data.

(a)
(b)
Figure 3.5:

The parameter search for the model based on the V1 lead, where one value is fixed and then the (a) loss (sorted after blue for lowest and red for highest total loss) and (b) AUC (with confidence intervals) are averaged over all remaining model combinations.

The search is conducted as depicted in Figure 3.5. After training the models, the average values for the tested metrics (accuracy, AUC, loss) are calculated for every possible parameter value by holding the parameter value fixed and then averaging over all other possible combinations. To illustrate, when investigating the performance for parameter , one value is chosen, i.e., , and then for every other possible combination - , and - models are constructed and their results are averaged. The reasoning behind this procedure is firstly to speed up the evaluation process and secondly to find the parameter value that seems to have the biggest impact on the performance of the model. Hereby, the AUC of the validation set and the validation loss are the major criteria for choosing a parameter. Only then, accuracy is taken into consideration.

Figure 3.6: Validation accuracies for the last parameter setting after fixing the prior three.
Figure 3.7: Validation AUCs for the last parameter setting after fixing the prior three.

As it can be inferred from Figure 3.7 and Figure 3.7, this procedure is repeated after deciding for the first parameter value, and then the second, etc. until the last parameter value for the last parameter is found.
The outcome is three different models, one based on V1-data, one model based on V2-data and the third one based on V1- and V2-data.

  • V1: , , ,

  • V2: , , ,

  • both: , , ,

3.5.3 Test Results

After the decision was made for the fitting parameter values, the chosen models were tested against each other.

Figure 3.8: The ROC-curves of the chosen models plotted against each other.

Figure 3.8 shows the ROC curves of the models after they encountered and classified the data out of the test set. V1 yielded the highest AUC, closely followed by the combination of V1 & V2 and lastly, V2.

Figure 3.9: Confusion matrix of the model that was solely trained on V1 data.
Figure 3.10: Normalized confusion matrix of the model that was solely trained on V1 data.
Figure 3.11: Confusion matrix of the model that was solely trained on V2 data.
Figure 3.12: Normalized confusion matrix of the model that was solely trained on V2 data.
Figure 3.13: Confusion matrix of the model that was trained on V1 & V2 data.
Figure 3.14: Normalized confusion matrix of the model that was trained on V1 & V2 data.

The respective confusion matrices343434Bear in mind that the confusion matrices are calculated using the classification threshold that was chosen to be and rather arbitrary. Therefore, they do not give as much as information as the ROC plot does. and their normalized versions are depicted from Figure 3.10-3.14. True negatives of the testing procedure are placed in the upper left square, false positives in the row below the true positives. The second column consists of the false negatives on top, true positives in the middle and total of classified positive instances in the bottom. True positives and false negatives give account over the number of positive instances encountered (second row, rightmost square) while false positives and true negatives (first row, rightmost square) yield the number of negative examples.

Metrics V1 V2 V1 & V2
Val. Total Loss
Val. AUC
Val. ACC
Test AUC
Test ACC
Table 3.1: Results of the selected Models

Table 3.1 presents the validation and test metrics that were tested on the selected models. Note that the decision threshold was set at and therefore, the accuracy value is not fully reliable. Furthermore, the total validation loss of the V1 & V2-model is, of course, higher, because it saw twice the amount of ECGs during the validation step.

3.6 Discussion

Throughout the preceding chapters, instructions and their justifications were given of how one can construct a classifier based on the data that was previously gathered via the digitization process of Part I. This classifier was tested and the results were presented in the previous chapter. We will now interpret these results and then come to a final conclusion after proposing future research.

By inspecting Figure 3.8, we can tell that the classifiers seem to make more liberal choices, meaning that they risk a higher false positive rate to achieve a higher true positive rate at the same time. In our context of the heart disease, this means that they are quicker to classify BrS-negative ECGs as positive in order to find more BrS-positive ECGs. By analyzing the plot, it appears that they need less evidence in order to make a positive prediction. Figure 3.10 and Figure 3.10 illustrate nicely, that a high amount of false positives (almost 50%) were generated, while yielding no false negatives. Interestingly, the model that is based on the combined data of lead V1 & lead V2 did make predictions that led to false negatives. Thus, it has misclassified some true BrS-positive ECGs. Even to such an extent that the proportion of true positives captured by the model is lower than the proportion of false negatives. In our medical context, it would mean that the classifier would not be a good advisor for the physician, as it would miss more BrS-positive patients that it could find them.
Table 3.1 summarizes the metrics that were tested on all models. It is interesting to see that for V1 the testing accuracy is exactly the chance level, however, the AUC is not. Therefore, the chosen decision threshold of does not work for this model. The other two models are performing slightly better than the model based on V1 in this sense, however, their accuracy’s as well as their AUC’s are declining when comparing validation and test results. Thus it seems likely that the models did overfit to the training data. In comparison, the V1 model has a relatively marginal decline from validation to test AUC.
If someone had to make a decision about which model to choose, one would be best advised to take the model based on V1-data because it will be better to have some false positives and run more tests with them, than having false negatives when facing such a severe disease.

The results of the experiment give the impression that the presented classifiers are not fully developed yet. They will not be a flawless advisor for physicists when they are about to take their final decision upon a diagnosis.
However, the results are fruitful enough to show that it is possible to prove the concept and that we successfully came closer to our final goal, namely, to construct a classifier that is able to distinguish between BrS positive and BrS negative patients such that it will become an aid to the physicians. It will be now important to consult an expert’s opinion to conduct proper validation. Then, future work will hopefully lead to further improvement.

In future approaches, the goal will be to achieve a ROC plot, where the curve sticks to the upper left corner right from the start and stays high throughout the threshold testing. Realizing this will result in a classifier that is able to be a valid consultant when making a diagnosis of the BrS. Improvements could be made by considering only data, that was sampled either by the means of Part I, or that was directly sampled from digital ECGs. Standardizing the data gathering process will lead to data that is more closely related such that the model has to make finer distinctions while classifying and hopefully, will learn this during the training process.
A prolonged and broader parameter search will results in an abundance of different models not only giving account over the impact of different parameters but also their respective optimal settings. Once the computational infrastructure is given, this will help tremendously in finding an improved classifier.
Furthermore, the ECG signals were fed into the classifier as raw time-series data. One could think of various different techniques taken out of the domain of signal processing that exists to shape the data priorly. Future experiments will show whether these techniques - i.e., wavelet transforms that decompose a signal into subparts of itself by passing it again and again through low and high pass filters each time dividing the signal further, giving account over which frequency is present at which time [PolikarWaveLet] - are useful data preprocessing tools to achieve better results.
Moreover, after constructing a good model, it will be important to find out what the model actually has learned. This will give further insight into the decision making and may spark up the investigation for new methods of diagnostics. Remarkably, van der Westhuizen, Jos & Lasenby [van2017visualizing] have recently shown that the hidden neurons in LSTM networks seem to extract representations from the frequency domain similar to wavelet transformations [van2017visualizing, p.8]. Thus, preprocessing the data might not lead to an improvement after all but this will be another topic of research.
Another promising future research will be to investigate the strategies of Oh et al. [OH2018278] who first made use of a convolutional neural network to extract features from the ECG signal and then subsequently an LSTM network for classification. Again, it will be very interesting to study the decision making of the combined network to extract further information for physicists.

Overall, the presented experiments were successful by providing a blueprint for the construction of a classifier regarding the BrS. A proof of concept has been stated and enough fuel for future investigations was provided.

3.7 Conclusion

This second part of the thesis dealt with the possibility of constructing a classifier that was based on the data gathered during Part I of the dissertation. The aforementioned classifier is built by using the LSTM architecture and its parameters were found by a search through the space of possible parameter values. The final results were presented and interpreted, giving an extensive account of the performance of the presented models. Finally, future work was proposed that will give insight into how the presented model could be further augmented.
Our goal to state a proof of concept has been successfully reached.

References