## References

## 1 Introduction

EEGs are used in many health care institutions to monitor and record electrical activity in the brain using electrodes placed on the scalp. Analysis of EEG recordings is a crucial first step to making a clinical diagnosis of epilepsy, severity of anesthesia, coma, encephalopathy, and brain death [yamada_practical_2017]. Traditionally, clinicians analyze EEG signals by visual inspection, which is usually time-consuming and can be affected easily by clinicians’ subjectivity [main_compare]

. Thus automatic detection systems based on machine learning algorithms have been developed to evaluate and classify EEG signals

[RAMGOPAL_seizure_2014].Due to the specific properties of EEG signals, such a classification task can be more broadly described as non-stationary signal classification. Traditionally, in non-stationary signal classification, the first step is to extract the useful features and then use classifiers like Support Vector Machine(SVM), K-Nearest Neighbors(KNN), Regression, and so on to differentiate classes

[non-stationary1998, ns_svm_2002, tf_svm_2014, GLCM_2016]. Many techniques have been proposed for EEG signal classification. Hassanpour et al. [hassanpour_timefrequency_2004]use modified B-distribution to characterize low-frequency EEG patterns and apply singular value decomposition (SVD) on the time-frequency distribution matrix to detect seizures in newborns and obtain encouraging results. In

[tfa_dnn], Tzallas et al. extract features using Cohen’s class Time-Frequency Representation (TFR) and power spectrum density, then use Artificial Neural Networks as the classifier to identify epileptic seizures in three benchmark EEG datasets. Boashas et al. [QTFD_2015] use quadratic time-frequency distributions (TFDs) with extended features and matrix decomposition with SVM as the binary classifier to detect newborn EEG abnormalities. In [motion_eeg], the authors use wavelet energy and wavelet entropy as features and KNN as the classifier. In [o.k._time-domain_2019], the authors propose exponential energy as a new feature, and combine it with other commonly used entropy and energy features, then use SVM to classify epileptic EEG signals.In the recent decade, it has been shown that methods based on deep learning can yield better performance than conventional methods

[main_compare, tf_spatial_conv, Gao_Gao_Chen_Liu_Zhang_2020, Gao_Dang_Wang_Hong_Hou_Ma_Perc_2021]. Most of these methods utilize real-valued signals in the algorithms. In [main_compare], the authors use the single-sided amplitude spectrum as input to a CNN. However, in frequency domain, the signals are represented as complex values. Such a complex representation contains the power of a specific wave relative to others (amplitude) and the alignment of different frequency components in time (phase). Using only the amplitude information will lose the phase information, which might be critical for EEG signals. Figure

1 illustrate the core idea of our argument. In Figure 0(a), we can see that these two convolution operations use the same convolution kernel, and all the complex numbers have the same modulus, which is ; however, the convolution results are different. If we only use the modulus in this convolution operation (See Figure 0(b)), we can only get one result, which is . Suppose the phase is the only difference between two signals; using amplitude only as the feature can not differentiate them properly.We, therefore, propose an algorithm that can utilize the features hidden in the phase information. To achieve this goal, we may train a real-valued convolutional neural network whose inputs are the amplitude and the phase as two channels or the real part and the imaginary part of the complex number as two channels. However, the traditional convolution operation contains a linear combination of the channels. It is unclear if such a linear combination is meaningful[complex_patrick_2019].

With these issues in mind, we believe that using the original complex values of DFT as input is a better alternative. In [Hirose_generalization], the authors also mentioned that complex-valued neural networks might be suitable for extracting phase information. Therefore, we exploit a complex-valued neural network for our case, more specifically complex-valued convolutional neural network. In [Guberman_2016_on_complex], a complex-valued convolution neural network was introduced and compared with a real-valued convolutional neural network on the classification tasks. Another notable paper is [crelu]

, where the authors applied deep complex networks to audio-related tasks and achieved promising results; this work also presented batch-normalization and complex weight initialization strategies for complex-valued neural networks. In our experiments, we observe that the performance is not improved if all the layers in the network are complex-valued. The first possible reason is the complex-valued non-linear activation function is task-specific; an inadequate selection of activation function may lead to poor transmission of the information between layers

[complex_patrick_2019, Scardapane_complex_2018, bassey_2021_survey]. The second possible reason is that in the loss function of a complex-valued neural network, one usually needs to calculate the ”distance” between complex numbers and real numbers, which is not well-defined in mathematics.

Therefore, we develop an algorithm that integrates the real-valued and complex-valued neural networks to overcome the difficulties mentioned above. Our approach builds on the network structure of San-Segundo et al. [main_compare]

, which contains two convolutional layers for feature extraction and three fully connected layers for classification. We improve their neural network by adding a crucial complex-valued convolutional layer at the beginning, and immediately taking the modulus as non-linear activation (See Section

2.1). There are four main advantages of our network architecture: 1) The extra complex-valued convolutional layer captures the features in the phase information of complex-valued input; 2) we only need one universal complex-valued activation function, which is taking the modulus; 3) The difficulty in distance calculation between complex numbers and real numbers in the loss function is avoided in our framework; 4) Our network uses about 50% fewer parameters compared with the structure in [main_compare]. Moreover, our framework can achieve higher classification accuracies than the conventional feature selection method and the CNN in

[main_compare] in both the experiments using a benchmark dataset and the simulations.We apply the proposed method to the EEG signal classification problem with discrete Fourier transform (DFT). We present qualitative results on several classification tasks, including binary and multi-class classification on two simulated datasets and one real-world dataset. The simulated datasets consist of synthetic signals of our own design and signals generated based on well-known theories. The real-world dataset is the Ralph-Andrzejak EEG dataset [RalphEEG], which contains 500 EEG recordings from 500 patients.

The rest of this paper is structured as follows: Section 2 contains the methodology in this study, including a description of the discrete Fourier transform (DFT), the framework of our neural network, the backpropagation algorithm, and some training details. Section 3 describes the experiments and the results obtained in simulation and a real-world dataset. Section 4 summaries this paper and discusses the limitations of our method and our future works

The symbols and notations used in this paper are summarized in Table 1

Sets of real, complex numbers | |
---|---|

Imaginary unit | |

Transpose, conjugate, conjugate transpose of | |

Real part of | |

Absolute value, modulus of |

## 2 Methodology

In this section, we present an overview of our methodology, including the framework of our network, a description of discrete Fourier transform, the backpropagation algorithm, and some training details.

### 2.1 Framework

Figure 2 shows our framework. The first layer we use is a complex-valued convolutional layer. Immediately after this layer, we take the modulus (see the part in the dotted line box in Figure 2

), making all the outputs real-valued. After this, we add two real-valued convolutional layers and three fully connected layers with ReLu and max-pooling. Finally, we use softmax as the last activation function and cross-entropy as the loss function.

### 2.2 Discrete Fourier Transform and Its Inverse Transform

We apply discrete Fourier transform (DFT) on the original EEG signals to obtain their representations on the frequency domain and apply inverse discrete Fourier transform (IDFT) to achieve the inverse transform. The formulas of DFT and IDFT are shown below:

(1) | ||||

(2) |

where is the original signal of length , is the Fourier transform of of length , and . This paper implements the DFT and IDFT using MATLAB command and . Because our original signals are all real-valued, their DFTs are conjugate symmetric. We only keep the first half part of the DFTs as the input to our neural network.

### 2.3 Backpropagation

As we can see from Figure 2, our framework contains a complex-valued convolutional layer taking modulus as activation, and after this, all the layers are real-valued. We use the Adam algorithm [adam] with default settings to optimize the kernel and bias in real-valued layers. To optimize the parameters in the complex-valued convolutional layer, we also need the backpropagation algorithm in the complex value. The regular complex derivative only applies to the analytic functions [Scardapane_complex_2018]; however, to obtain the modulus need to evaluate the following the function:

(3) |

which is not analytic. So in backpropagation, the derivative of with respect to can not be calculated with a regular complex derivative. In this case, we need to adopt the Wirtinger derivative.

#### 2.3.1 Wirtinger derivative

The following defines the Wirtinger derivatives:

###### Definition 2.1.

Consider the complex plane . The two operators and are defined by:

(4) |

are referred to as the Wirtinger derivatives [wirtinger_zur_1927].

Wirtinger derivative holds standard rules for differentiation known from real-valued analysis concerning the sum, product, and composition of two functions. Then from equation (4), we can derive another essential property of Wirtinger derivative:

which means we can treat and as independent variables. Then based on the first-order Taylor expansion for multivariable functions, we have:

(5) |

To further derive the backpropagation algorithm based on the Wirtinger derivative, we need the following Corollary.

###### Corollary 2.1.

Derivatives of the conjugate function satisfy the following relationships:

(6) |

In our case, , which means , equation (6) can be further simplified to the following equations:

(7) |

From equations (5) and (7), by omitting the lower order term, we then have:

(8) |

Based on the principles of the gradient descent method, we need to find that maximize . From Cauchy–Schwarz inequality, we know that should have the same direction of . So the direction of the steepest ascent is the direction of (if , from Equation 7, we know that ).

We can give the general form of the backpropagation algorithm for CVNN:

(9) |

where is the set of all the parameters that need to be learned in the complex-valued layers at ’th iteration.

is the learning rate. We adopt automatic differentiation in Tensorflow

[tensorflow2015-whitepaper] to calculate the gradients for complex-valued layers based on the Wirtinger derivative [tensorflow2015-whitepaper]. We use complex Adam [complex_sarroff_2018] as the specific backpropagation algorithm to optimize the parameters.#### 2.3.2 An example of one convolution kernel

Suppose we have an input sequence data and a complex convolution kernel (See Figure 3). is an arbitrary part of , and is the convolution result. Note that here

is not the Hermitian inner product of two complex-valued vectors. After adding the bias term

, can be defined as:(10) |

From the backpropagation for the real-valued CNN, we can get . Then based on the Wirtinger derivative, we need to find and

. Based on the chain rule, we have:

(11) |

Similarly,

(12) |

### 2.4 Other training details

For the real-valued weights and bias, we use Xaiver initialization [xavier]. For the complex-valued weights and bias, we use the Rayleigh distribution to generate the modulus of the complex number (

) and the Uniform distribution

to generate the angle (). Then we can get the initialization for complex-valued parameters by using the formula .## 3 Experiments

In this section, we compare our method against two existing frameworks on the classification task with two simulated datasets. We then apply our approach to a real-world dataset and compare it with several previous works.

### 3.1 Simulation Study

In this study, we simulate EEG signals with two different methods to compare the classification performance between our method and other methods. In the first simulation, we adopt the first-order autoregression (AR(1)) model to generate the amplitude and phase separately. We then use the inverse discrete Fourier transform (IDFT) to obtain the signals on the time domain. The main difference among signals in different classes in this simulation is the phase. We want to use this simulation to prove that our algorithm can efficiently utilize the features in phase. In the second simulation, we adopt classical theory and phase resetting theory to generate event-related potential (ERP) with noise [yeung_detection_2004, yeung_theta_2007], and then we design four classification task (See Table 3). There are two reasons we perform the second simulation: 1) the signals in the second simulation are closer to the real EEG signals, 2) ERP-signal classification is crucial in analyzing human EEG activities and can be a promising tool to study error detection and observational-learning mechanisms in performance monitoring [vasios_classification_2009].

We mainly compare our method with real-valued CNN and the conventional feature selection method in these two classification tasks. For real-valued CNN, the network structure we choose to compare with is the structure used in [main_compare], which contains two convolutional layers and three fully connected layers. To compare with the feature selection method, we choose seven features, which are Shannon entropy [ibrahim_electroencephalography_2018], Renyi entropy [das_discrimination_2016], log-energy entropy [das_discrimination_2016], approximate entropy [approximate_sohn_2006], sample entropy [richman_physiological_2000, li_assessing_2015], fuzzy entropy [simons_fuzzy_2018, chen_measuring_2009, shi_entropy_2017], and exponential energy [o.k._time-domain_2019](See Table 2 for detailed parameter settings.). In both simulations, we applied 6th order Butterworth low pass filter to remove the frequencies over 60 Hz before extracting the features. We try all 127 combinations of these seven features. For each combination of the features, we extract them from the pre-processed signal, the first and second order difference of the pre-processed signal. So the number of features we select is always a multiple of 3. The classifier we choose is the support vector machine(SVM) for the binary classification task and the error-correcting output codes (ECOC) model using SVM binary learners[escalera_decoding_2010, escalera_separability_2009] for the multi-class classification task.

Features | Parameters | Ref. |
---|---|---|

Renyi Entropy | – | |

Approximate Entropy | [richman_physiological_2000] | |

Sample Entropy | [richman_physiological_2000] | |

Fuzzy Entropy | [li_detection_2018] |

: the standard deviation of the input time-series data.

In this simulation study, we use 5-fold cross-validation to obtain accuracy. To alleviate the accuracy variation, we repeat the 5-fold cross-validation ten times for each classification task to get the average and standard deviation of the accuracies. The accuracies presented in this simulation study are based on the results with the highest mean accuracy.

#### 3.1.1 EEG signals generated with AR(1) model

In this section, we simulate the signals using AR(1) model because we assume that neighboring amplitude and phase are not independent. The AR(1) model assumes that the current value depends linearly on its immediately prior value and a stochastic term, which complies with our assumptions. The formula for AR(1) model is shown in Equation (13):

(13) |

where is the present value, is the immediately prior value, is a constant, is the model parameter, and

is the white noise with zero mean and constant variance.

We first simulate the phase information. We know that the phase , so we modify the Equation (13) to ensure the range of is limited. The modified formula is shown in Equation (14):

(14) |

where is a function used to obtain the remainder of divided by . Here, achieves an overall phase shift, and its effect can be understood as a rotation of a signal under the Hilbert transform (See Figure 4). Suppose we have the Hilbert transform of a real-valued signal. In that case, we can plot the analytic form of the signal in three-dimensional Cartesian coordinates with the time axis, the real part axis, and the imaginary part axis. Then we can rotate this analytic form of the signal about the time axis(dashed line in Figure 4 (b), (c)) to achieve the effect of .

From equation (14), we can see that, to simulate the phase, we need to determine three parameters —, and the variance of . Since determines the correlation between the and and the variance of determines the intensity of the noise, we use different as the baseline to generate signals for different classes (See the middle column in Figure 6). Because we want to compare our method with other methods in multi-class classification, we generate signals for five classes by setting as (corresponds to the different angle of rotation in Figure 4). To show the effect of and the variance of for different classes, we let and be ranged from 0.1 to 0.9 with an interval of 0.1. The accuracy table in Figure 7 contains nine by nine values, and each value corresponds to a specific pair of and .

After we obtain the simulated phase, we use unmodified AR(1) model (13) with

to randomly generate the amplitude for different groups. We then multiplied the single-sided amplitude spectrum by the Chi-square distribution to make our simulated signals have the main bandwidth appears in the range of 0 to 70 Hz (see Figure

5), which is close to the bandwidth used by clinical analysis of EEG [beniczky_electroencephalography_2020]. Then we adopt IDFT to obtain the simulated signals on the time domain using the simulated phase and amplitude (see Figure 6). Finally, the simulated signals last for 1.5 seconds with a 200 Hz sampling frequency.Figure 7 shows the classification results using this simulated dataset. As shown in Figure 6(a), it is not surprising that applying CNN on phase-only data can achieve the highest accuracy no matter the value of and . Because based on our simulation, all the differences among different classes are reflected in the phase information, and the random variation in amplitude can be viewed as noise. We also apply CNN and the feature selection method to the simulated signals in the time domain for comparison (see Figure 6(b) and 6(c)). Figure 6(d) shows the result obtained by using our algorithm. Figure 8 shows the accuracy differences between our method and the two methods for comparison. We can see our method outperforms other methods in most cases. We also show the confusion matrices in the lower part of Figure 8. These four confusion matrices are selected when the accuracy difference achieves the largest( and ).

#### 3.1.2 EEG signals generated according to classical theory and phase resetting theory

In this simulation, we generate EEG signals for the classification task according to two main theories: classical theory and phase-resetting theory [yeung_detection_2004, yeung_theta_2007]. The former theory assumes that the ERP signal is buried in the ongoing EEG noise, while the latter theory believes that the events can reset the phase of ongoing oscillations. With each theory, we design two binary classification tasks, referred to as the ”Fixed location” task and the ”Random location” task (see Table 3), to compare our method and other methods.

classical theory | phase resetting theory | |
---|---|---|

Fixed location | Fixed peak location + noise noise only | Fixed phase resetting location + noise no phase resetting + noise |

Random location | Random peak location + noise noise only | random phase resetting location + noise no phase resetting + noise |

In the simulation based on the classical theory, we first randomly generated 10,000 pieces of noise signals with the same power spectrum of human EEG signals using the MATLAB codes downloaded from [EEG_sim_website]. Each piece of noise signal lasts for 2 seconds with a sampling rate of 150 Hz. Then 5000 pieces of these noise signals were randomly selected to add a peak signal. In the experiment of the ”Fixed location,” we added 5 Hz peak signal centered at the middle of each piece of noise (see Figure 9), and in the experiment of the ”Random location,” we change the location of the center of the peak to be uniformly random distribute on the interval of 0 to 2 seconds.

In the simulation guided by phase resetting theory, we randomly generate 5000 signals with phase resetting and 5000 signals without phase resetting. For the phase resetting group, similar to Makinen et al. [makinen_auditory_2005], we generate simulated data by summing four sinusoids with frequencies chosen randomly from the range 4 to 16 Hz. In the experiment of ”Fixed position,” each of these four sinusoids contains phase resetting at the center (see Figure 10). In the experiment of ”Random position,” each of these four sinusoids contains phase resetting at the same random position (uniformly distributed on the interval of 0 to 2 seconds). Then we add randomly generated human EEG background noise to all the signals.

We mainly compare our method with the approach that applies CNN on phase information only, amplitude information only, and the original signal. We also separated the real and imaginary parts of the FFT and applied the exact structure of CNN to each of them. Furthermore, we also apply CNN with the input of real and imaginary parts as two channels. The result shows in Table 4. We can see that our method can outperform others by comparing the accuracy.

Method | classical theory | Phase resetting(with noise) | ||

Fixed | Traditional | Features + classifier | 0.64640.0010 | 0.64030.0012 |

Phase only | 0.97580.0018 | 0.94100.0042 | ||

Amplitude only | 0.94220.0023 | 0.86400.0113 | ||

Original signal | 0.92220.0022 | 0.90250.0021 | ||

CNN | Real part only | 0.98010.0013 | 0.91130.0028 | |

Imaginary part only | 0.63490.0096 | 0.85530.0033 | ||

Real+Imaginary(two channels) | 0.95930.0049 | 0.93050.0017 | ||

Our method | 0.99530.0009 | 0.96790.0103 | ||

Random | Traditional | Features + classifier | 0.65040.0010 | 0.68070.0010 |

Phase only | 0.92730.0029 | 0.90430.0022 | ||

Amplitude only | 0.89230.0056 | 0.76290.0086 | ||

Original signal | 0.87270.0071 | 0.53710.0049 | ||

CNN | Real part only | 0.94880.0020 | 0.88400.0070 | |

Imaginary part only | 0.94310.0027 | 0.86900.0067 | ||

Real+Imaginary(two channels) | 0.91620.0113 | 0.79260.0067 | ||

Our method | 0.99300.0012 | 0.93690.0144 |

### 3.2 Real-World Data

#### 3.2.1 Data Acquisition, Description and Pre-processing

The real-world dataset analyzed here is the Ralph-Andrzejak EEG dataset [RalphEEG], which contains five categories of signals named Z, O, F, N and S, respectively (see Table 5). Each category contains 100 EEG records of about 23.6 seconds with a sampling rate of 173.61 Hz (4097 values per record). Although there are 5 classes in this dataset, the four most common classification tasks for this dataset are Z vs S [o.k._time-domain_2019], S vs NS (NS=Z+O+F+N) [wang_automatic_2017, kumar_classification_2015], Z vs N vs S [Abualsaud_ensemble_2015] and Z vs F vs S [Sadati_2006].

Category | Description |
---|---|

Z | Healthy group, recorded with eye open |

O | Healthy group, recorded with eye closed |

F | Interictal activity, recorded in the epileptogenic zone |

N | Interictal activity, recorded at hippocampal location |

S | Seizure activity |

To obtain a comparable result, we use the pre-processed dataset in [UCI_data_EEG, UCI_data]. In this dataset, the original signals of 23.6 seconds were divided into 23 non-overlapping segments. Each segment contains 178 values(about 1 second). Totally, there are pieces of segmental signals. Figure 11 shows five example signals from each category and their corresponding single-sided amplitude spectrum.

To obtain the input to our neural network, we apply DFT on the original signal and only keep the first half. The right row in Figure 11 shows the corresponding single-sided amplitude spectrum. We also follow the training steps of the 5-fold cross-validation mentioned in [main_compare]. We randomly shuffle the data and divide them into five groups equally. We use one out of five as the test set, and in the other four groups, we choose one as the validation set and the other three as the training set(See Figure 12

). We then train our neural network with these three training sets and use the validation set to choose the number of epochs yielding the highest accuracy. After that, we retrain the neural network on the training sets and the validation set together and use the number of epochs chosen in the previous step. We then apply this final model to the test set to get the classification results. For one round of the above steps, we can get five accuracies and we keep the average of them. Then we repeat this process ten times to get the mean and standard deviation of the accuracies for each classification task. The last row in table

6 shows the results obtained from our method.#### 3.2.2 Result and comparison

We can see from Table 6 that our method can reach the same accuracy (99.8%0.13%) on Z vs S and higher accuracy on Z vs N vs S (97.050.37%) and Z vs F vs S (96.360.63%) compare with the previous results. Another main concern about our method is the number of parameters and memory use. We summarize our neural network framework’s parameters and memory use and the one used in [main_compare] in Table 7. Our model uses 52% fewer parameters and about 15% more memory than the model used in [main_compare]. The additional memory cost happens in the layer of taking the modulus since this layer needs to store the modulus for the use of the next layer.

Methods | Z vs S | S vs NS | Z vs N vs S | Z vs F vs S |
---|---|---|---|---|

Neural Fuzzy Network[Sadati_2006] | – | – | – | 86.00.82% |

Ensemble Classifier[Abualsaud_ensemble_2015] | – | – | 90.00.71% | – |

Local Binary Pattern[kumar_classification_2015] | – | 98.30.24% | – | – |

Multi-Domain Feature Extraction[wang_automatic_2017] | – | 99.00.15% | – | – |

Exponential energy + SVM[o.k._time-domain_2019] | 99.50.20% | – | 91.70.65% | 89.00.74% |

Raw data + DNN [main_compare] | 99.00.29% | 98.80.20% | 89.20.73% | 89.40.73% |

only FFT + DNN [main_compare] | 99.80.13% | 99.40.14% | 96.30.45% | 95.60.48% |

All transform + DNN[main_compare] | 99.80.13% | 99.50.13% | 96.50.44% | 95.70.48% |

Our method | 99.80.13% | 98.80.23% | 97.050.37% | 96.360.63% |

Model in [main_compare] | # Para | Memory | Our method | # Para | Memory |
---|---|---|---|---|---|

Input | 1128=0.128k | Input | 1282=0.256k | ||

Conv | 48 | 12828=2.048k | |||

ABS | 1288=1.024k | ||||

Conv | 192 | 12832 = 4.096k | Conv | 656 | 12816=2.048k |

Pool | 4232 = 1.344k | Pool | 4216=0.672k | ||

Conv | 5152 | 4232 = 1.344k | Conv | 2592 | 4232=1.344k |

Fc1 | 241792 | 128=0.128k | Fc1 | 98688 | 256 = 0.256k |

Fc2 | 4128 | 32=0.032k | Fc2 | 16640 | 32 = 0.032k |

Fc3 | 128 | 3=0.003k | Fc3 | 256 | 3 = 0.003k |

# Total | 251392 | 7.075k | 118880 | 8.163k |

## 4 Discussion

In this paper, we proposed a novel neural network architecture that can capture the phase information in signals by using a complex-valued convolutional layer at the very beginning. In simulations, our framework significantly improves the classification performance compared with other methods; furthermore, our method can reduce the number of parameters and improve the accuracy simultaneously in the experiments for the real-world dataset. Besides, our framework can also be applied to find proper complex-valued filters on the frequency domain without prior knowledge, which is usually a tricky task. Currently, all the input signals to our neural network are relatively short and only contain one channel. In the future, we plan to improve our method such that it can be applied to classify long-term and multi-channel EEG signals.

## Availability of data and materials

Our codes for building neural networks are based on this work [cnn-from-scratch] and are available at: https://github.com/David-Hang-Du/HCVNN-EEG. The codes for simulations can also be found in our GitHub repository.