## I Introduction

Recent advances in neural engineering have shown that local field potential (LFP) signals collected from the prefrontal cortex of a subject trained to perform specific set of tasks are reliable alternative to spike count recordings for designing robust brain-computer interfaces (BCIs) [12, 8, 11, 1]. The application of the theory of non-parametric regression has proven to be crucial for the success of LFP-based neural decoders. As documented in [2, 1]

, this has led to the development of complex spectrum-based feature extraction technique based on the famous Pinsker’s theorem. In contrast with conventional Power Spectrum Density (PSD)-based decoders

[9], which incorporate only the LFP signal amplitude information, Pinsker’s feature extraction methodology has enabled dramatic performance improvements in decoding motor actions in single-subject studies [1].One of the most important objectives of neural engineering is to design robust and reliable *cross-subject* BCIs that generalize well over large population of subjects [3]. Cross-subject BCIs should be trained over limited number of representative subjects performing same or similar tasks; however, they are being expected to perform reliably over unseen representatives from the same population. There are several reasons that motivate the development of cross-subject BCIs which can be summarized as follows. First, the cross-subject BCIs can be used for quantifying similarities between parts of the cortex that determine decision-making across different subjects - an important task in developing cross-subject BCIs that can be used across new subjects from the population without extensive training. Second, in lack of sufficient training data to train a reliable decoder for any given representative of a population, the cross-subject BCIs should enable and facilitate decoding of intended actions even in absence of local training. This can be useful in subjects with impaired or lost motor functions where training data collected from healthy subjects can be used to alleviate the chronic impairment [5, 10].

At the heart of such BCIs is the cross-subject decoding problem in which training data collected from one representative subject (henceforth referred to as subject X or source) is used to train a reliable and well-performing decoder for another subject (i.e., subject Y or destination) over specific set of actions. While the single-subject decoding problem has been extensively studied and is well covered in existing literature, the cross-subject problem has been seldom addressed. Even in its most basic variant where the decoder is defined over the same set of actions, the cross-subject decoding turns out to be a challenging neurological problem. The main reason for the its general difficulty is the fact that the neural signals are highly variable and their structural and statistical properties can vary dramatically even under slight changes of the recording conditions. In fact, in our studies we have observed that a decoder trained over data set collected from given subject in general performs close to random choice decoder when applied directly to test data set from different subject, even in the case of equal actions spaces. In other words, the feature spaces which serve to represent the space of motor functions and where the decoding of actions is performed varies across the population.

The last observation implies that the differences between the feature space representations of the actions spaces in different subjects should be taken into account when tackling the cross-subject decoding problem. This motivates us to develop novel pre-processing procedure, specifically tailored to address the aforementioned feature space variability. The technique, referred to as *data centering*

, is inspired from the emerging field of transfer learning and is the main focus of this paper. The main underlying idea of data centering is to adapt the feature space of the source subject where the training data is collected to the feature space of the destination subject where the decoding of intended motor intentions takes place. The adaptation procedure relies on the notion of

*transfer functions*- a concept that formally captures the functional transformation between feature spaces. We propose a simple, data-driven solution for estimating

*linear transfer functions*. We then apply this solution on the experiment and data introduced in [9] where two adult macaque monkeys perform memory-driven visual saccades to one of eight peripheral target lights. The results show that when data centering with linear transfer functions is applied over Pinsker’s feature space, the cross-subject decoder exhibits substantial performance improvement over random choice decoder, making data centering a highly promising solution for cross-subject BCIs.

## Ii Methods

### Ii-a Description of the Experiment

We begin with a brief overview of the experimental setup [9, 1]. All experimental procedures were approved by the NYU University Animal Welfare Committee (UAWC).

Two adult male macaque monkeys (*M. mulatta*), referred to herein as Monkey A and Monkey S, were trained to perform a memory-guided saccade task, see Fig. 1. The setup consists of an LCD monitor on which targets are presented to the animal. The monkey initiates a trial by fixating a central target. Once the monkey maintains fixation for a baseline period, one of 8 possible peripheral targets (drawn from the corners and edge midpoints of a square centered on the fixation target) is flashed on the screen for 300 ms. For a random-duration memory period ( -

s), the monkey must maintain fixation on the central target until it is turned off. This event cues the monkey to make a saccade to the remembered location of the peripheral target. If the monkey holds his gaze within a window of the true target location, the trial is completed successfully. In our analyses, only LFP segments during the memory periods of a successful trials were used in the decoding experiments. This epoch of the trial is especially interesting since it reflects information storage and generation of the resulting motor response.

An electrode array consisting of individually movable electrodes (20 mm travel, Gray Matter Research) was implanted into prefrontal cortex and used to collect the LFP while the monkeys performed the above task. Signals were recorded at kHz and downsampled to kHz.
The initial position of each electrode at the beginning of the experiment was receded millimeter within the drive; after penetrating the dura and the pia, action potentials were first recorded at a mean depth (from the initial position) of and millimetres for Monkey A and Monkey S, respectively.
As the experiment progressed, the positions of individual electrodes (also referred to as channels) were gradually advanced deeper; the mean depth increase step was and microns for Monkey A and Monkey S, respectively.
For a fixed electrode depth configuration, multiple trials were performed, which is detailed in Section II-D.
Henceforth, a fixed configuration of electrode positions over which trials are collected is referred to as *electrode depth configuration (EDC)*.
Each EDC is uniquely described by a

-dimensional real vector; each entry in the vector contains the depth position of an individual electrode (in millimeters) with respect to its initial position.

### Ii-B Feature Extraction via Pinsker’s Theorem

We give a brief, informal digest of the feature extraction technique based on the famous Pinsker’s theorem; for more rigorous theoretical treatment of the theorem and related concepts such as Gaussian sequence models and minimax optimality, we advise the interested reader to refer to [7] as well as [2, 1].

#### Ii-B1 Nonparametric Regression for LFP data

Let denote the discrete-time LFP signal from an arbitrary channel (i.e., electrode) sampled with frequency during the memory period of the visual saccades experiment, see Section II-A. We use arguments from the theory of nonparameteric regression to construct a technique for extracting the relevant features from the LFP data.

We assume that each LFP signal consists of two components: (1) useful, information-carrying and unknown signal waveform, represented with the function , and (2) random noise component which is modelled as i.i.d. Gaussian noise. These two components add up in the following model:

(1) |

where and are the corresponding discrete versions of and respectively. Although the actual noise process is neither Gaussian nor i.i.d. it has been shown in previous studies that the assumption provides good approximation [1]. The unknown function stores the relevant information from which the intended motor action of the subject can be decoded. As such, each different action, i.e., in our case, each different eye movement direction will yield different representation in the function space. Moreover, as noted in [2], the signal will be also different in repeated tasks due to variety of practical and neurological reasons. Hence, it is more accurate to say that each specific motor action when performed repeatedly forms a class of functions in the function space; in such case the neural decoding problem reduces to conventional, multiple-class composite hypothesis testing where the goal is to design a decoder from LFP data that reliably identifies the function class of incoming LFP signal.

Next, the decoder, being a function of random and finite LFP sequences, should be consistent. One way to ensure consistency, is to take the worst case probability of error to zero as

. This motivates the use of minimax-optimal estimators for from ; it can be shown [2, 1] that using minimax-optimal function estimators leads to consistent decoder as long as the function classes in the function space are well separated.#### Ii-B2 Gaussian Sequences and Pinsker’s Theorem

Function estimators are objects of infinite dimension and therefore hard to handle and use in practice for training. The Gaussian sequence modeling framework provides an alternative approach for obtaining finite-dimensional, asymptotically minimax-optimal estimators of arbitrary functions as long as these functions satisfy some smoothness criteria [7]. The key idea is to represent the time-domain LFP data (1

) as a sequence of independent Gaussian random variables with different means; this is achieved by transforming the original LFP data using common orthogonal transformation such as the Fourier basis. As a result, we obtain the equivalent, sequence representation of (

1):(2) |

Here, , and are projections of the vectors , and onto the -th Fourier basis function [7]. Now, the problem of finding an estimator for the function , in the equivalent sequence space reduces to finding an estimate for the Fourier coefficients .

Pinsker’s theorem gives an asymptotically minimax-optimal estimator for the Gaussian sequence model provided that the Fourier coefficients satisfy some predefined criteria [7, 13]. Specifically, if the Fourier coefficients live in (or on) an ellipsoid, i.e., if the sequence satisfies with , (or, equivalently, the functions live in a Sobolev space of order ), then the asymptotically minimax-optimal estimator of the coefficients is given by the simple linear estimator

(3) |

The parameters and become design parameters and their values need to be carefully chosen such that the performance of the decoder is optimized.

Inspecting the sequence , we see that the estimator (3) shrinks the observations by an amount if ; otherwise it sets the observations to . Furthermore, it has been shown in that the simple truncation estimator of the form

(4) |

is also asymptotically minimax-optimal and can be viewed as a special case of Pinsker’s estimator [7]. Here, is a vector where the first entries are and the remaining ; in other words, our finite dimensional representation of the estimate of is obtained by simply low-pass filtering the original time-domain sequence to obtain the dominant components of its spectrum. The truncation estimator is also simpler to implement than the Pinsker’s estimator since it introduces only a single design parameter, namely the number of dominant frequency components .

One final comment is in order. Pinsker’s theorem has proven to be very useful in limited data scenarios, especially when the number of trials is not more than an order of magnitude larger than the respective dimension of the problem; such data sets frequently arise in neuroscientific experiments where the cost of running experiments and collecting data is high. However, in situations where the amount of available data is relatively large, alternative feature extracting methodologies can be also considered, including non-linear ones such as deep neural networks and autoencoders

[4].### Ii-C Data Centering for Cross-subject Decoding

The feature spaces across subjects are in general different. Specifically, the target-conditional distributions, representing the same actions are in general different in different subjects. As a result, a well-performing and reliable decoder trained on one subject will perform poorly when tested on another subject without adequate pre-processing. Our preliminary investigations have shown that a well-performing decoder, trained using a data set collected from one subject performs no better than a random choice decoder when applied to a test data collected from different subject.

Our remedy to this problem, which is also our main contribution in this paper, is the data centering procedure which relies on the core assumption that the target-conditional distributions are functionally related via deterministic component. These functional relations are captured via transfer functions; adequate modelling and estimating of the transfer functions is critical for the success of data centering and it is in the focus of the following subsections.

#### Ii-C1 Transfer Functions

To illustrate the underlying principles of data centering, we fix the following terminology. Let X denote the subject from which the training data is collected (i.e., the source), and let denote the subject where the test data is generated (i.e., the destination). Our goal is to train a decoder, using subject X training data, that can reliably decode the subject Y test data. For simplicity we assume that the set of actions is identical for both subjects (as in the memory-guided visual saccade experiment, described in Section II-A) although this might not be the case in general. Correspondingly, let and denote the random vectors representing the feature spaces of subject X and Y, respectively; both and live in where is the dimension of the feature space. It should be noted that and are drawn from a specific class-conditional (we also target-conditional interchangeably) distribution corresponding to a specific motor action; however, we omit to denote this dependence explicitly to avoid clutter. Data centering borrows from transfer learning by postulating the following assumption, see Fig. 3: the feature space of subject Y can be viewed as a functional transformation of the feature space of subject X. In other words, we can establish the general relationship:

(5) |

Here, the invertible and deterministic mapping relates target-conditional distributions between the two subjects and is termed *transfer function*. In addition to the deterministic component, the model (5) includes a stochastic component, i.e., local noise, represented via the random vector which captures that local brain dynamics that further perturbs the feature space of subject Y independently of .

Now, according to (5), if the transfer functions for each specific action in the action set are known and if the local noise is small, the subject X data can be transferred into the feature space of subject Y, as if it was generated there in the first place. This is the core idea of the data centering technique: provided that is known, the training data set from subject X can be transferred to subject Y by simply applying to it; the decoder trained using the transformed data set can be subsequently used to decode intended actions by subject Y.

#### Ii-C2 Learning Linear Transfer Functions

As seen from (5), knowledge of the transfer function

is crucial for the success of data centering. Obviously, identifying an adequate model from first principles using neurological reasoning and modelling is a challenging problem as it is unclear a priori how the action spaces across different subjects in a population can be mapped to each other. An alternative way to obtain knowledge of the transfer function is through data-driven modelling and estimation. In other words, one can resort to data collected from both subjects X and Y and estimate the transfer function relying on either parametric or non-parametric model. In this work, we restrict our attention to linear transfer functions:

(6) |

where

is an invertible square matrix. This might be adequate in situations where the feature vectors follow normal distributions and where the assumption that the statistical nature of the feature space distributions for specific set of actions should be preserved across subjects is reasonably justified. The advantage of the data-driven estimation approach is its versatility; as long as there is enough data, one can fit a transfer function for any pair of subjects. In addition, provided that there are enough representatives of the population, the estimated transfer functions form a manifold which can be further used to analyze the general mapping principles for the decision-making part of the cortex across an ensemble of subjects performing the same or similar sets of tasks.

Estimating models of the form (6) is a well understood problem statistical signal processing and dynamical systems. However, in our case there one critical twist that makes the problem significantly more involved: since we are mapping feature spaces of two different subjects, no causal input-output relationship can be established between different realizations of and in the corresponding training data sets. In other words, we are unable to tell which specific realization of corresponds to a given realization of . As an alternative, we develop an estimation approach that relies on the first and second moments, i.e., mean vectors and covariance matrices of the distributions of the class-conditional feature vectors; this leads to a particularly simple estimation approach of the linear model (6). Specifically, we assume that the first and second moments of exist and we denote them as , , and . Furthermore, we assume that the local noise in subject is zero mean with unknown covariance matrix ; hence, in the model (6) we have two unknown components: the transformation matrix and the local noise covariance matrix . In addition, for reasons that will become apparent shortly, we define the matrix such that ; an example would be the standard zero-phase component analysis (ZCA) whitening matrix . Using this notation, it is not difficult to show that the following holds:

(7) | ||||

(8) |

The maximum number of linearly independent equations in the above system is whereas the total number of unknown parameters is with equality only for the case .

In order to solve the above system for the transfer function in presence of unknown local noise covariance , we make several simplifying but reasonable assumptions. First, we restricts

to be a diagonalizible matrix. Second, we assume that

which implies that the matrixis diagonally dominated by the identity matrix

; in fact, we will assume that the local noise process in comparison with which allows us to use the first order approximation of the Neumann expansion(9) |

where the series converges for any satisfying whereas the approximation holds when . Third, to obtain well-posed estimation problem for in presence of unknown , we need to impose certain structure on by incorporating some reasonable intuitions; this is achieved via the parametrization where is unknown parameter vector with . Finally we will allow

to have all degrees of freedom but we will restrict it to be diagonal, i.e.,

; with this final restriction, we assume that the local brain dynamics in subject Y acts on each feature independently.Taking into account all assumptions stated above it is easy to obtain an estimate of the transfer function . Namely, from (8) and using the assumption that is diagonalizible, we obtain

(10) |

Applying the assumption that the local perturbation process in subject is small and making use of (9), we obtain

(11) |

Finally, plugging (11) in (7) and assuming , we obtain an estimate of as

(12) |

which is then replaced in (11) to obtain an estimate for the transfer function . The reader should note that the above estimation approach is applied per target-conditional distribution; hence, if the number of targets, i.e., classes is as in the eye movement decoding problem, we need to estimate

corresponding linear transformation matrices

, one for each target-conditional distribution. In light of this, from (11) and (12) we observe that the computation of both and, subsequently, involves inverse of the target-conditional covariance matrices. In a limited data scenario such as the one working with, see Section II-D, we need to make sure that the covariance matrices are well-conditioned such that they can be adequately inverted. This might require reducing the dimension during feature extraction phase by using lower cut-off frequency or increasing the number of trials per data set and thus per target using the data clustering mechanism from [1]. Alternatively, in cases when neither of these two options is applicable, we can resort to the shared covariance matrix instead, which is computed as a linear combination of the target-conditional covariance matrices with weights given by the empirical target priors [4]; this approach will be useful later on Section III-B when we investigate the performance of data centering on imbalanced data sets where one or more targets distributions are poorly represented, i.e., the number of trial is significantly less than the dimension and where further reduction of the feature space dimension is not possible.The transfer function learning approach described above has its simplicity as its main advantage; using the linear transfer function model (6) and relying on the simplifying assumptions regarding the structure of the transfer function and the local perturbation process , we were able to derive an approximate, closed form solution for using only the mean and covariance matrices of the feature vectors and which can be easily estimated using training data. Furthermore, as we will see later on in the results, this simple, linear approach is capable to achieve fairly reliable cross-subject decoding performance. It should be noted however that the transfer model might not be linear in general. In fact, it is more reasonable to assume to assume more general, non-linear mapping between the feature spaces and use more sophisticated methodology for estimating the non-linear transfer functions using e.g. deep neural networks; such modelling and estimation approaches to the cross-subject problem are part of our ongoing research.

### Ii-D Data Preparation and Processing

#### Ii-D1 Available Data Description

The data analyzed in the rest of the paper has been first reported in [9] and the reader interested in more details regarding the specific procedure including a description of the applied equipment is advised to refer to [9]. Multiple trials have been performed and collected across several days and recording sessions. As also described in Section II-A, over the course of the experiment, the positions of the electrodes were gradually advanced into the pre-frontal cortex towards white matter; each unique configuration of electrode depths, described by -dimensional vector is referred as EDC, see Section II-A. The experiment was performed over a total of and EDCs for Monkey A and Monkey S, respectively, during which multiple trials were collected. The indexing of the EDCs used here reflects the temporal progression of the experiment; namely, the trials for EDC- were collected earlier in time than the trials for EDC- if . The total number of trials, collected for each individual EDC is shown in Fig. 4 where the horizontal axis gives the mean depth of each array configuration computed as a simple mean of the corresponding EDC vector with respect to the initial positions of the electrodes, see Section II-A. From these bar plots it can be noted that that for Monkey A the average number of trials per EDC is around , with the exception of EDC- under which a total of trials were collected across recording sessions. Similarly, the average number of trials per EDC for Monkey S amounts to around . Given the dimension of the feature space which easily surpasses , we conclude that the number of trials for each individual EDC is insufficient to train a reliable decoder.

#### Ii-D2 Data Clustering

To obtain data sets of sufficient size, we apply trial clustering algorithm [1] based on the Euclidean proximity of the EDC vectors using the following reasoning: similar EDCs, i.e., EDCs whose depth vectors are close in Euclidean sense, generate similar feature spaces with similar target-conditional distributions.
The algorithm operates as follows. First, we define *clustering window* as the minimum number of trials per EDC data set, i.e. data cluster. Second, we fix the *concurrent EDC* at which we want to create data set of sufficient size. Next, we begin appending trials for the concurrent EDC. We first append the trial from the concurrent EDC; if the number of of trials is less than than the clustering window, we start appending trials from the EDCs which are closest to the concurrent EDC in Euclidean sense. The algorithm ends when the total number of trials fills up the clustering window.

#### Ii-D3 Feature vector formation

Once the data sets of sufficient size across all EDCs have been obtained, they are processed using the feature extraction procedure described in Section II-B. Feature extraction is performed on a per channel basis, see Fig. 5; namely, for each trial, the LFP signals from each channel are transformed into frequency domain and they are low-pass filtered up to some cut-off frequency as in Fig. 2. The total number of retained (complex) Fourier coefficients as features per channel is (corresponding to the dc component and lowest frequencies). We use rectangular coordinates representation to store the complex DFT coefficients via their sine and cosine components; hence, the dimension of the feature space per channel is (the dc component is real, so only one coefficient is stored). These spectral representations are then concatenated across the channels, i.e., electrodes to form one large feature vector of dimension . Note that the dimension can be further reduced by using PCA as in [1] and the performance of the decoder can be additionally optimized over the number of principal modes retained after PCA.

#### Ii-D4 Data-driven Data Centering

For cross-subject studies, the next step is to define the data sets for transfer function estimation, cross-subject training and testing; the procedures are illustrated in see Fig. 6. For subject Y, i.e., the destination (either Monkey A or Monkey S) this is simple to do: namely, we randomly pull a holdout test set from each EDC data set. The remaining trials in the EDC data sets are used for estimating the transfer function. For instance, if we use leave-one-our cross-validation, as we do in this work, we randomly pull out single trial from the subject Y data set leaving the remaining trials for transfer function estimation. For subject X, i.e., the source (either Monkey S or Monkey A), on the other hand, we form two data subsets using bootstrapping: namely, we define a real positive number which represents the proportion of trials which are randomly pulled from each EDC data set. Note that for subject X, the data set is bootstrapped independently to form the subset used for transfer function estimation and the subset which is used to training the cross-subject decoder after data centering. This might lead partial overlap between these two subsets as some trials will likely be pulled out for both transfer function estimation and cross-subject decoder training; this will certainly occur when . An alternative to this approach is to split the subject X data set into two disjoint subsets following predefined ratio (either via bootstrapping or some other technique), one for transfer function estimation (e.g. of the trials) and the other one (e.g. the remaining of the trials) for cross-subject decoder training. We have tested both approaches, with the first one providing superior performance over the second for limited data sets; hence, here we only report the cross-subject decoding results using the first approach where subject X data set is bootstrapped independently for transfer learning and cross-decoder training.

#### Ii-D5 Decoding

The final block in the chain is the actual decoder. Due to the limited amount of training data, prior studies have shown that linear discriminant analysis (LDA) is an adequate decoder. In fact, the single-subject studies and recent evaluations, partially reported in [1]

, have confirmed that LDA outperforms other standard classifier, such as quadratic discriminant analysis (QDA) decoder, logistic regression and support vector machines (SVM). The limited amount of data has so far been the main obstacle for applying deep neural network for classification. Due to its proven robustness for the problem at hand, we also use LDA decoder for the forthcoming analyses.

## Iii Results

In Section III-A, we first look at the benchmark performance when the decoder for each subject is trained using its own data. Section III-B presents the results obtained from the cross-subject studies drawing preliminary conclusions regarding the achievable cross-subject decoding performance. Last but not least, Section III-C demonstrates an obvious benefit from data centering in the case of imbalanced training data sets over standard techniques.

### Iii-a Single-Subject Analysis: Benchmark Decoding Performance

Fig. 7 shows the performance of the LDA decoder applied after Pinsker’s feature extraction. The performance has been evaluated as a statistical average using leave-one-out cross-validation during which we also optimized the value of , i.e., the number of time-domain LFP samples acquired from the memory period.^{1}^{1}1As previously reported in [1], the first half of the memory period, immediately after the target light is switched off, is the most informative; hence, the optimal value of is between and LFP samples per channel.
We evaluated the performance of the decoder for several values of , i.e., number of retained frequency components after Pinsker’s feature extraction; namely, we varied from (when only the dc component is kept) to . Interestingly, the decoder achieves fairly good decoding accuracy compared to random choice decoder even when . The decoder achieves its best performance for between and ; however, for larger , i.e., larger feature spaces the performance of the decoder deteriorates as can be seen from the curves corresponding to .
The results suggest that virtually all information relevant for the decoding is stored in the dc and the first frequency component, which for sampling frequency of kHz corresponds to cut-off frequency of Hz; the next few frequencies, within the first Hz provide additional degrees of freedom and can further enhance the decoding accuracy, albeit marginally as evident in Fig. 7. We conclude that the most relevant information that determines the dynamics of the decision making process for motor actions is stored in the and bands of the LFP signals [1].
We also note that the above curves can be even further optimized using PCA which we haven’t done in these studies. Specifically, [1] shows that after fine-tuning the number of principal components, the performance of the decoder peaks at and for Monkey A and Monkey S, respectively when .

### Iii-B Cross-Subject Analysis

As outlined in the above discussion, the first two DFT coefficients carry virtually all information pertinent to the decoding intended motor actions from LFP data. Therefore, for our cross-subject investigations, we fix .

We begin our analysis by first applying data centering between the subjects across all recording EDCs and compare the resulting performance with the benchmark, single-subject performance reported in the previous section. The results are shown in Figs. 9 and 8.

Specifically, the two-dimensional color plots in Fig. 8 depict the cross-subject decoding performance for all available EDC data sets; namely, each available EDC data set for subject X (also termed source subject in Fig. 8) is applied to all available data sets for subject Y (i.e., destination subject). The purpose of this analysis is to investigate the performance of data centering for cross-subject decoding across different cortical depths. Fig. 9 compares the cross-subject decoding performance of a single, arbitrarily chosen source EDC from Fig. 8 with the benchmark single-subject decoding performance for the destination. We draw several important conclusions from these results. First, from both Fig. 9 and 8 we see that the cross-subject decoding is consistently dominated by the locally achievable single-subject decoding performance when the available data is used for local training instead of transfer learning. Nevertheless, the actual cross-subject decoding performance is also consistently higher and often substantially higher than a random choice decoder, peaking at around in Monkey A. To the best of the authors’ knowledge, this work is the first to report such reliable cross-subject decoding accuracy; we attribute the performance to the data centering technique. Next, we observe that several Monkey S EDC data clusters achieve better performance when trained in Monkey A feature space after centering, than when trained on Monkey S itself. This phenomenon is expected since the target-conditional mean vectors in Monkey A feature space are better separated especially at superficial depths near the surface of the prefrontal cortex. Hence, we are linearly mapping poorly separated Monkey S data onto better separated Monkey A feature space. Conversely, Monkey A data transferred to Monkey S suffers from performance degradation since the class-conditional mean vectors in Monkey S data are more poorly separated as compared to Monkey A feature space.

### Iii-C Data Centering for Learning from Imbalanced Data Sets

One of the most important objectives of cross-subject BCIs is restoring of lost motor functions in subjects with chronic disabilities; this neurological problem can be partially studied within the framework of learning from imbalanced data sets [6]. As an illustration, we consider the following hypothetical scenario. Specific motor functions and actions, such as eye movement directions, in the feature space of a healthy subject (i.e., subject X) will be equally represented; the subject can perform all of the functions into consideration and therefore, a BCI algorithm can easily learn their corresponding feature space representations and decode reliably intended actions by the subject which is illustrated in two dimensions in Fig. (a)a. On the other hand, a subject that has partially or completely lost the ability to perform one or more of the considered functions (say subject Y) will produce imbalanced feature space where the lost functions are very poorly represented. Hence, a BCI trained over the imbalanced data set will result in very poor detection of intended actions from the poorly represented functions, as illustrated in Fig. (b)b. A straightforward alternative would be to use a standard sampling technique for dealing with imbalance between classes such as random oversampling or random under-sampling via bootstrapping [6].

Our main question here is whether data centering can leverage the bio-physiological relations and similarities of the LFP data between the two subjects and help to solve the imbalanced data problem, beyond what’s achievable with standard techniques. Specifically, we investigate whether we can exploit training data from the evenly represented feature space of subject X for the under-represented targets in the feature space of subject Y, as depicted in Fig. 11. For illustration purposes, we consider simplified binary hypothesis testing problem in which we pick only two targets, i.e., eye movement directions; then, one of the targets is randomly under-sampled to create artificial imbalance between the targets. We only consider Monkey S on Monkey A cross-subject decoding with EDC- data set as destination; the same conclusions apply in the opposite direction.

To begin with, we consider the eye movement directions ; in this case, the decoding performance when the data is balanced is close to perfect, i.e., close to prediction accuracy. The results are shown in Figs. 12 and 13; we note that the optimal source EDC for the unbalanced class, i.e., the one that maximizes the decoding performance was found via exhaustive search, and it is given in the caption of the plots. The histograms in Fig. 12 show the distribution of prediction accuracy for random subsets for the under-represented target class and between class imbalance of 100; considering that the average number of trials per target for data clustering window of is just above , this imbalance ratio amount to only few, i.e., or trials for the under-represented class. We see that if we train an imbalanced decoder, the poorly represented target will be rarely decoded correctly. In other words, the decoder will most of the time assign the test data points to the well-represented class. We then observe dramatic shift of the accuracy distribution for the under-represented class, well above random choice decoder ( prediction accuracy), after data centering. Interestingly, we also observe improvement in the accuracy for the well represented target. We attribute this to the neurological component of the data and the existence of inherent correlations and similarities between the feature spaces of the subjects.

Fig. 13, depicting only the average performance across targets, verifies the above conclusions for range of imbalance ratios; in addition, data centering also outperforms standard random sampling methods for solving the imbalanced data problem such as random undersampling of the well-represented class, as shown in Fig. 13 as well as random oversampling via bootstrap or synthetic sampling with data generation [6]. We also see that as the between-class imbalance is decreased by improving the representation of the under-represented class, the gain from data centering eventually vanishes; in fact, for small between-class imbalances, local decoder training outperforms training with cross-subjected centered data for the under-represented class; this is in line with the conclusions from Fig. 8.

In the setup investigated in Fig. 13, both movement directions belong to different hemifields. On the other hand, [1, 9] have reported significant discrepancy in per-target decoding accuracy, depending on whether the eye movement direction was in the ipsilateral or contralateral hemifield. To asses the impact of the hemifield location of the corresponding eye movements on data centering, we consider several combinations of targets as detailed in Fig. 14. By comparing the result in Figs. (a)a and (b)b with the results in Fig. (c)c and (d)d, we see that aforementioned discrepancy between contralateral and ipsilateral directions remains consistent for data centering.

## Iv Discussion

We proposed novel pre-processing technique, referred to as data centering for adapting feature spaces across different subject that perform the same tasks. The key ingredient of data centering are the transfer functions which model the deterministic component of the functional transformations between target-conditional distributions. We have also developed a simple, yet effective estimation technique for linear transfer functions based on the first and second moments of the target-conditional distributions.

We tested the techniques on data collected from two macaque monkeys performing memory-guided visual saccades to one of eight target locations. We observed a peak cross-subject decoding performance of at superficial cortical depths; this marks the substantial improvement over random choice decoder, proving that data-centering is essential for the success of cross-subject decoders. Nevertheless, the cross-subject decoding performance remains consistently below the single-subject decoding performance. Hence, we can conclude that for balanced data sets, the cross-subject decoding does not outperform the single-subject decoding performance.

We tested the feasibility of data centering decoding for imbalanced data sets, i.e., when one or more of the class-conditional distributions are under-represented. In particular, we investigated whether data centering can be used to correct the between class imbalance by bringing training data from another subject in which the specific classes are well-represented. The results have the exciting potential of data centering in such scenarios and we have observed dramatic performance improvements over standard methods for dealing with imbalanced data sets.

## Acknowledgment

This work was supported by the Army Research Office MURI Contract Number W911NF-16-1-0368.

## References

- [1] (2019-05) Minimax-optimal decoding of movement goals from local field potentials using complex spectral features. Journal of Neural Engineering 16 (4). Cited by: §I, §II-A, §II-B1, §II-B1, §II-B, §II-C2, §II-D2, §II-D3, §II-D5, §III-A, §III-C, footnote 1.
- [2] (2018-06) Classification of local field potentials using gaussian sequence model. In 2018 IEEE Statistical Signal Processing Workshop (SSP), Vol. , pp. 683–687. External Links: Document, ISSN Cited by: §I, §II-B1, §II-B1, §II-B.
- [3] (2014-05) Restoring sensorimotor function through intracortical interfaces: progress and looming challenges.. Nature Neuroscience 15 (8), pp. 313–325. Cited by: §I.
- [4] (2006) Pattern recognition and machine learning (information science and statistics). Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0387310738 Cited by: §II-B2, §II-C2.
- [5] (2012) A high-performance neural prosthesis enabled by control algorithm design. Nature. Cited by: §I.
- [6] (2009-Sep.) Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21 (9), pp. 1263–1284. External Links: Document, ISSN 1041-4347 Cited by: §III-C, §III-C.
- [7] (2017) Gaussian estimation : sequence and wavelet models. Cited by: §II-B2, §II-B2, §II-B2, §II-B.
- [8] (2011-12) Modeling the spatial reach of the LFP.. Neuron 72 (5), pp. 859–872. Cited by: §I.
- [9] (2011) Optimizing the decoding of movement goals from local field potentials in macaque cortex. Journal of Neuroscience 31 (50), pp. 18412–18422. External Links: Document Cited by: §I, §I, §II-A, §II-D1, §III-C.
- [10] (2014-06) Closed-loop decoder adaptation shapes neural plasticity for skillful neuroprosthetic control.. Neuron 82 (6), pp. 1380–1393. Cited by: §I.
- [11] (2018-07) Investigating large-scale brain dynamics using field potential recording: analysis and interpretation.. Nature Neuroscience 21 (8), pp. 903–919. Cited by: §I.
- [12] (2015-06) A high performing brain–machine interface driven by low-frequency local field potentials alone and together with spikes. Journal of Neural Engineering 12 (3), pp. 036009. Cited by: §I.
- [13] (2008) Introduction to nonparametric estimation. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 0387790519, 9780387790510 Cited by: §II-B2.

Comments

There are no comments yet.