Visual classification is a computer vision task that inputs an image and outputs a prediction of the category of the object image. It has become one of the core research directions of computer vision and been widely used in many applications such as localisation, detection, and segmentation of objects[karpathy2016cs231n, He_2019_CVPR]
. In the last decades, with the discovery of Convolutional Neural Networks (CNN) that is loosely inspired by the human visual architecture, researchers have made great efforts on performance improvement in object recognition[Girshick_2014_CVPR, 10.1007/978-3-319-10578-9_23, He_2017_ICCV]. However, some researchers are cognizant that there are still significant differences in the way that human and current CNN process visual object information [NIPS2018_7982]. Particularly, the performance of which evaluated CNNs on negative images  and compared generalisation towards previously unseen distortions in human and CNNs [NIPS2018_7982] have further shown the robustness of CNNs on object recognition are not at the human level.
For human beings, the advantage of visual exteroceptive sense is distinct. For example, someone usually directly looks at the objects they want to recognise to make full use of the foveal vision. It has always been a challenging issue in cognitive neuroscience to figure out how humans accomplish the object recognition effortlessly in everyday life and how to model the mechanisms or the discriminative feature spaces employed for object categorisation [Grill1988, Beeck10111, Peelen2007The]
. Researchers have investigated that the brain exhibits functions of feature extraction, shape description, and memory matching, when the human brain is involving visual cognitive processes[gazzaniga2006cognitive]. Subsequent studies [Simanova_2010, NORMAN2006424, HANSON2004156] have further revealed that analysing brain activity recordings, linkage with the operating human visual system, is possible to help us understand the presentational patterns of visual objects in the cortex of the brain. For instance, from the hemodynamic responses, information of some particular categories of images (e.g., human faces and objects) were found to be presented primarily in several specialised regions in brain cortex [Haynes2005, Ishai9379, COX2003261]. Inspired from the above visual neuroscience investigations, there is some recent work considered to process visual classification problems by analysing neurophysiology and neuroimaging signals recorded from human visual cognitive processes [Haynes2006Decoding, Kaneshiro2015, spampinato2017, renli2018, renli2020], which have demonstrated the feasibility to identify the feature space that employed by humans for object image categorisations. However, they are still limited to analyse the brain visual activities by using the raw physiological signals without extracting a more representative input during the signal preprocessing stage.
In addition, many existing visual classification studies have been focusing on electroencephalography (EEG)-based visual object discriminations as we explored above. EEG signals, featuring by a high temporal resolution in comparison with other neuroimaging, are generally recorded by electrodes on the surface of the scalp, which has been applied in developing several areas of brain-computer interface (BCI) classification systems, such as pictures, music, and speech recognitions [bashivan2015learning, stober2015deep, Carlson_2011]. Interestingly, when human evoked by the different visual or auditory stimulus, the EEG signals, could collect diverse responses of evoked potentials [gazzaniga2006cognitive]. However, the raw waveforms of EEG signals are the spontaneous potential of the human brain in a natural state, which is difficult to distinguish the hidden event-related information during the visual cognitive process [gazzaniga2006cognitive, DEPASCALIS2004295]. Thus, the event-related potential (ERP) was proposed to identify the real-time evoked response waveforms caused by stimuli events (e.g., specific vision and motion activities), which usually performed lower values than the spontaneous EEG amplitude [gazzaniga2006cognitive] and extracted from the EEG fragments with averaged superposition in multiple visual trials.
2 Related Work
Decoding image object-related EEG signals for visual classification has been a long-sought objective. For example, the early-stage studies in [philiastides2006neural, Philiastidesbhi130]
attempted to classify single-trial EEG responses to photographs of faces and cars. An image classification task[Kaneshiro2015] in 2015 considered a comprehensive linear classifier to tackle EEG brain signals evoked by 6 different object categories, and achieved the classification accuracy around 40%. Then, investigating the intersection between deep learning and decoding human visual cognitive feature spaces has increased significantly. For example, Bashivan’s work [bashivan2015learning] in 2015 transformed EEG into topographical maps for each image and trained a deep recurrent-convolutional neural network to learn robust representations and classification.
Afterwards, Sampinato et al. [spampinato2017]
proposed an automated visual classification framework in 2017 to compute EEG features with Recurrent Neural Networks (RNN) and trained a CNN-based regressor to project images onto the learned EEG features. However, the recent two studies in 2018 and 2020[renli2018, renli2020] brought force questions to Spampinato’s block design [spampinato2017] employed in the EEG data acquisition, where all stimulus of a specific class are presented together without randomly intermixed. In particular, the latest study in 2020 [renli2020] replicated the Spampinato’s experiment [spampinato2017] with a rapid-event design and analysed the classification performance on the randomised EEG trials. In addition, we noted that a special structure recurrent neural network, Long Short-Term Memory (LSTM) network, is commonly used in these studies to learn the representations of brain signals, which have shown the feasibility to decode human visual activities and deep learning for visual classification.
However, most of current machine learning approaches for visual classification ignored to explore the EEG evoked potentials of spontaneous generation. Even now deep learning is still difficult to recognise distinctive patterns of evoked potentials from the raw waveforms of EEG signals with a visual stimulus, so we assume that excluding visual related evoked potentials could be a fundamental cause that leads to an uncertain feature representation space for visual classification and place a restriction on the improvement of classification accuracy. Only two early-stage studies[Wang_2012, qin2016classifying] were preliminarily explored evoked-guided models by extracting ERPs from EEG signals for visual classification and used 4 components of ERP (P1, N1, P2a, and P2b) as the inputs of a linear classifier, but these preliminary studies required multiple spatial or temporal components during the preprocessing stage and did not apply deep learning for evoke-guided feature representations.
Thus, in this study, our work was inspired from two assumptions: (1) the feature representations employed by human brains for visual classification will be more pronounced learned from the purer ERP which conveys image stimuli-related information; (2) the multi-dimensional ERPs can be decoded to obtain a one-dimensional representation using RNN and do not require pre-selection of spatial or temporal components. One special type of RNNs, the LSTM, presents the strong capability in recognising long-term and short-term feature representations from time-series EEG signals.
With the above two assumptions, in this study, we proposed the first visual evoked potential-guided deep learning framework, called ERP-LSTM framework, to learn the discriminative representations for visual classification. The ERP-LSTM framework is constituted by two stages: (1) acquiring the ERP waveforms from multiple EEG trials with averaged superposition; (2) a parallel LSTM network mapping the extracted ERPs into feature representation vectors and involving an activation layer that classifies the derived vectors into different classes.
3 Our Proposed Framework
The overview of our proposed ERP-LSTM framework is shown in Fig. 1, which is separated into two stages for visual classification. In Stage 1, we employed raw EEG signals recorded from the visual experiment and then extracted ERPs from the raw EEG data to secure the visual stimuli-related signals. In Stage 2, we trained an LSTM network to learn the representation space of the ERP sequences and followed a Softmax classification trained to discriminate the different classes of the images.
3.1 Stage 1: ERPs Extractions from EEG
The representative features of EEG signals play an essential role in classifying image object categories. The first stage of our proposed framework aims to extract representative visual-related features of ERPs by increasing the signal-noise ratio (SNR) of the raw EEG signals with smooth-averaging measurement. A number of EEG segments with the same trials are averaged out to a fused waveform. In specific, during the averaging process, the consistent features of the segments (the ERPs) are retained, while features that vary across segments are attenuated (refer to the upper left corner of Fig. 1).
More formally, let , , is the subset of the multi-channel temporal EEG signals, when one subject is viewing the exemplar image. N is the number of EEG trials to be averaged, which contains of EEG trials, where trial (c is the number of channels).
The averaging process is described by the following fomula:
where is the ERP sequence averaged from .
Let be the sum of extracted multi-channel ERPs, , which will be the inputs of the LSTM encoder module we addressed in the next subsection to learn discriminative feature representations for visual classification.
3.2 Stage 2: Feature Representations and Classification
To further utilise the spatial and temporal information from extracted ERPs, we applied an LSTM encoder module shown in the lower part of Fig. 1, which refers to Spampinato’s “common LSTM + output layer” architecture [spampinato2017]. The inputs of the encoder are the multi-channel temporal signals - ERPs, which are preprocessed in the previous subsection.
At each time step t, the first layer takes the input (the vector of all channel values at time
), namely that all ERPs from multiple channels are initially fed into the same LSTM layer. After a stack of LSTM layers, a ReLU layer is added to make the encoded representations easy to map the feature space. The whole LSTM encoder outputs a one-dimensional representation feature of each ERP. After the representation vectors are obtained, a Softmax activation layer is finally connected to classify the LSTM representative features to different visual categories.
The LSTM encoder module is evaluated by the cross-entropy loss, which measures the differences between the classes predicted from the network and the ground-truth class labels. The cross-entropy loss is defined as follows:
is the probability distribution vector of the ground-truth class label, andis the output probability distribution of the Softmax classification layer. The total loss is propagated back into the neural network to update the whole model’s parameters through gradient desent optimisation.
In the proposed ERP-LSTM framework, the LSTM encoder module is used for generating feature representations from ERP sequences, followed by a Softmax classification layer to predict the visual classes.
4 The Experiment
4.1 The Dataset
In this study, we used an image dataset, along with EEG signals that were collected from 10 subjects viewed different categories of images [Kaneshiro2015]. The image dataset has a total of 72 photographs containing 6 categories: Human Body (HB), Human Face (HF), Animal Body (AB), Animal Face (AF), Fruit Vegetable (FV), and Inanimate Object (IO), As shown in Fig. 2, we demonstrated some sample of photographs from six categories (HB, HF, AB, AF, FV, and IO) that were used for visual classification.
During the experiment, each subject completed two experimental sessions, and each of the sessions contained three blocks. In each block, every image from all 72 images was displayed 12 times in random order. Each experiment trial consists of a single image, displayed 500 ms on the screen, followed by 750 ms a blank grey screen as an interval. In total, each participant completed 72 trials of each of the 72 images and conducted a total of 5,184 trials per subject. Thus, 51,840 trials in total were collected from 10 subjects. In this study, each of the trials was labelled to map the description of the visual evoked-related events, namely the corresponding image category (which of the 6 categories is related to) or image exemplar (which of the 72 images is related to).
In terms of EEG recordings, it was collected using unshielded 128-channel EGI HCGSN 110 nets, and preprocessed using a butter-worth filter with cut-off frequencies of 1 Hz and 25 Hz before temporally downsampled to a final sampling rate of 62.5 Hz (496 ms). The 4 channels at the forefront of the frontal cortex were deleted for artifact removal purpose so that we remained 124 channels electrodes for ERP analysis.
In this study, we randomly segmented the 72 EEG trials into 6 sets, and each set contains 12 EEG trials. The trials in each set are averaged to extract an ERP sequence with the same image and category label. Then, we obtained 6 ERP sequences of each image and also achieved , the ERP space of the overall extracted 124-channel ERP sequences. Of note, the ERP space is split into the training set and the testing set with a proportion of 5:1, indicating that 80% ERP sequences for each image keep in the training set and the remaining 20% sequences are on the testing set. To further evaluate the performance of the classification framework, we performed two types of data classification: cross-subject and within-subject basis.
5.1 Performance of Six-category Visual Classification
As shown in Table 1, we presented the classification performance of the basic LSTM using raw EEG (EEG-LSTM) [spampinato2017] and our proposed ERP-LSTM frameworks. It also illustrated the two types (cross-subject and within-subject) of classification performance. Our findings showed that our proposed ERP-LSTM framework could reach about 66.81% accuracy for cross-subject type of visual classification and achieve the highest classification accuracy of 89.06% for a single subject (subject 1). Both outcomes were outperformed that of EEG-LSTM framework, where the classification accuracy improved 30.09% across 10 subjects, 53.99% for subject 1, and 23.46% for averaged within-subject from 1 to 10.
Our findings suggested that the representation feature space encoded from the extracted ERPs is more discriminative to classify image objects compared to that of the raw EEG. Also, we suppose that the critical information for object cognition of the brain signals did not miss during the averaging process. On the contrary, the extracted ERPs have retained the spatial and temporal feature that is related to the visual evoked potentials.
5.2 Performance of Exemplar-Level Visual Classification
Here, we further analysed the existing frameworks and our proposed ERP-LSTM framework at the exemplar image level. It removed the categories as the classification labels, and instead, it aims to identify a specific image as an exemplar. As shown in Table LABEL:tabel_2, we presented the existing two frameworks, Kaneshiro [Kaneshiro2015] and EEG-LSTM [spampinato2017], to identify the exemplars with 72 classes across all 10 subjects. The findings showed that our proposed ERP-LSTM framework still could achieve the classification accuracy of 27.08% at the exemplar level, which outperformed 14.46% for Kaneshiro and 7.97% for EEG-LSTM. We also attached the results of six-category level classification to get insights into the difference between easy (category) and hard (exemplar) modes.
|Categories (6 classes)||40.68%||36.72%||66.81%|
|Exemplars (72 classes)||14.46%||7.97%||27.08%|
Thus, relative to the existing model, our work denoted that the representation feature decoded from the extracted ERPs is less confusion than raw EEG signals, which benefits to learn a more discriminative feature space for visual classification. Furthermore, our ERP-LSTM framework also achieved better performance than a recent work in 2020 [renli2020] (in which the reported classification accuracy on 6 categories is 17.1%), even if we used the different data source. This suggested that the LSTM network is capable to encode the ERPs to obtain a representative feature space, as the advantages of LSTM network on tackling temporal dynamics of time-series EEG signals.
In this paper, we proposed an evoked potential-guided deep learning framework, called ERP-LSTM framework, for visual classification, which is separated into two stages: (1) extracting ERP sequences from multi-trial EEG segments; (2) a parallel LSTM network to encode a representation feature space for object categorisation as well as to classify EEG signal representations. Our proposed ERP-LSTM framework achieved better performance compared to existing frameworks both on the classification of 6 categories and 72 exemplar images. We believe our findings are presenting the feasibility to learn representational patterns of visual objects based on the recording of brain cortex activities, and an ERP-LSTM framework could learn characteristic features for visual classification.