Complex-valued Neural Networks with Non-parametric Activation Functions

02/22/2018 ∙ by Simone Scardapane, et al. ∙ University of Stirling 0

Complex-valued neural networks (CVNNs) are a powerful modeling tool for domains where data can be naturally interpreted in terms of complex numbers. However, several analytical properties of the complex domain (e.g., holomorphicity) make the design of CVNNs a more challenging task than their real counterpart. In this paper, we consider the problem of flexible activation functions (AFs) in the complex domain, i.e., AFs endowed with sufficient degrees of freedom to adapt their shape given the training data. While this problem has received considerable attention in the real case, a very limited literature exists for CVNNs, where most activation functions are generally developed in a split fashion (i.e., by considering the real and imaginary parts of the activation separately) or with simple phase-amplitude techniques. Leveraging over the recently proposed kernel activation functions (KAFs), and related advances in the design of complex-valued kernels, we propose the first fully complex, non-parametric activation function for CVNNs, which is based on a kernel expansion with a fixed dictionary that can be implemented efficiently on vectorized hardware. Several experiments on common use cases, including prediction and channel equalization, validate our proposal when compared to real-valued neural networks and CVNNs with fixed activation functions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the last years, machine learning techniques have obtained impressive results in a wide range of fields, especially when dealing with supervised problems

[lecun2015deep, 8264962, zheng2017video]. The majority of these applications has focused on the case of real-valued

data: as an example, most of the deep learning frameworks currently used today can only work with floating point (or integer) numbers. Several applicative domains of interest, however, exhibit data that can be more naturally modeled using

complex-valued algebra, from image processing to time-series prediction, bioinformatics, and robotics’ control (see [hirose2003complex, schreier2010statistical] for a variety of examples). While complex data can immediately be transformed to a real domain by considering the real and imaginary components separately, the resulting loss of phase information gives rise to algorithms that are generally less efficient (or expressive) than alternative methods able to work directly in the complex domain, as evidenced by a large body of literature [mandic2007complex]. Due to this, many learning algorithms have been extended to deal with complex data, including linear adaptive filters [fisher1983complex, schreier2010statistical], kernel methods [bouboulis2011extension, tobar2012novel, boloix2017widely], component analysis [scarpiniti2008generalized], and neural networks (NNs) [georgiou1992complex, kim2003approximation, arjovsky2016unitary, danihelka2016associative, guberman2016complex, trabelsi2017deep, 8109745]. We consider this last class of algorithms in this paper.

Despite the apparent similarity between the real and complex domains, working directly in the latter is challenging because of several non-intuitive analytical properties of the complex algebra. Most notably, almost all cost functions involved in the training of complex models require non-analytic (also known as non-holomorphic [bouboulis2011extension]) functions, so that standard complex derivatives cannot be used in the definition of the optimization algorithms. This is why several algorithms defined before the last decade considered optimizing the real and imaginary components separately, resulting in a more cumbersome notation which somehow hindered their development [leung1991complex]. More recently, this problem has been solved by the adoption of the so-called CR-calculus (or Wirtinger’s calculus), allowing to define proper complex derivatives even when dealing with non-analytic functions [brandwood1983complex, kreutz2009complex], by considering explicitly their dependence on both their arguments and their complex conjugates. We describe CR-calculus more in depth in Section II.

When dealing with neural networks, another challenging task concerns the design of a proper activation function in the complex domain. In the real-valued case, the use of the rectified linear unit (ReLU) has been instrumental in the development of truly deep networks

[glorot2011deep, maas2013rectifier], and has spun a wave of further research in the topic, e.g., see [klambauer2017self, ramachandran2017swish] for very recent examples. In the complex case, Liouville’s theorem asserts that the only complex function which is analytic and bounded at the same time is a constant one. Due to the preference for bounded activation functions before the introduction of the ReLU, many authors in the past preferred bounded functions to analytic ones, most notably in a split organization, wherein the real and independent parts of the activations are processed separately [nitta1997extension], or in a phase-amplitude configuration, in which the nonlinearity is applied only to the magnitude component, while the phase component is preserved [georgiou1992complex]. Even extending the ReLU function to the complex domain has been shown to be non-trivial, and several authors have proposed different variations [guberman2016complex, arjovsky2016unitary].

In this paper, we consider the problem of adapting

activation functions in the complex domain. For real-valued NNs, there is a large body of literature pointing to the fact that endowing activation functions with several degrees of freedom can improve the accuracy of the trained networks, ease the flow of the back-propagated gradient, or vastly simplify the design of the network. In the simplest case, we can consider parametric functions having only a few (generally less than three) parameters per neuron, such as the parametric ReLU

[he2015delving], the S-shaped ReLU [jin2016deep], or the self-normalizing exponential linear unit (SELU) [klambauer2017self]. More in general, we can think of non-parametric activation functions, that can adapt to potentially any shape in a purely data-driven fashion, with a flexibility that can be controlled by the user, and to which standard regularization techniques can be applied. In the real-valued case, a lot of research has been devoted to the topic, including the design of Maxout networks [goodfellow2013maxout], adaptive piecewise linear (APL) units [agostinelli2014learning], spline functions [scardapane2016learning], and the recently proposed kernel activation functions (KAFs) [scardapane2017kafnets]. When dealing with complex-valued NNs (CVNNs), however, only a handful of works have considered adapting the activation functions [scarpiniti2008generalized, trabelsi2017deep], and only in the simplified parametric case, or when working in a split configuration. In this sense, how to design activation functions that can adapt to the training data while remaining simple to implement remains an open question.

Contributions of the paper

In this paper, we significantly extend KAFs [scardapane2017kafnets] in order to design non-parametric activation functions for CVNNs. The basic idea of KAFs is to exploit a kernel expansion at every neuron, in which the elements of the kernel dictionary are fixed beforehand, while the mixing coefficients are adapted through standard optimization techniques. As described in [scardapane2017kafnets], this results in functions that are universal approximators, smooth over their entire domain, and whose implementation can leverage highly vectorized CPU/GPU libraries for matrix multiplication.

Here, we propose two different techniques to apply the general idea of KAFs in the context of CVNNs. In the first case, we use a split combination where the real and the imaginary components are processed by two independent KAFs sharing the same dictionary. In the second case, we leverage recent works on complex-valued reproducing kernel Hilbert spaces [bouboulis2011extension] to redefine the KAF directly in the complex domain, by describing several choices for the kernel function. We show via multiple experimental comparisons that CVNNs endowed with complex-valued KAFs can outperform both real-valued NNs and CVNNs having only fixed or parametric activation functions.

Organization of the paper

In Section II we introduce the basic theoretical elements underpinning optimization in a complex domain and CVNNs. Then, in Section III we summarize research on designing activation functions for CVNNs. The two proposed complex KAFs are given in Section IV (split KAF) and Section V (fully complex KAF). We provide an experimental evaluation in Section VI before concluding in Section VII.

Notation

We denote vectors using boldface lowercase letters, e.g., ; matrices are denoted by boldface uppercase letters, e.g., . All vectors are assumed to be column vectors. A complex number is represented as , where and are, respectively, the real part and the imaginary part of the number, and . Sometimes, we also use and to denote the real and imaginary parts of for simplicity. Magnitude and phase of a complex number are given by and respectively. denotes the complex conjugate of . Other notation is introduced in the text when appropriate.

Ii Preliminaries

Ii-a Complex algebra and CR-calculus

We start by introducing the basic theoretical concepts required to define a complex-valued function and to optimize it. We consider scalar functions first, and discuss the multivariate extension later on. Any complex-valued function can be written as:

(1)

where and are real-valued functions in two arguments. The function is said to be real-differentiable if the partial derivatives of and with respect to and are defined. Additionally, the function is called analytic (or holomorphic) if it satisfies the Cauchy-Riemann conditions:

(2)

Only analytic functions admit a complex derivative in the standard sense, but most functions used in practice for CVNNs do not satisfy (2) (e.g., functions with real-valued outputs for which everywhere). In this case, CR-calculus [kreutz2009complex] provides a theoretical framework to handle non-analytic functions directly in the complex domain without the need to switch back and forth between definitions in the complex domain and gradients’ computations in the real one.

The main idea is to consider explicitly as a function of both and its complex conjugate , which we denote as . If is real-differentiable, then it is also analytic with respect to when keeping constant and vice versa. Thus, we can define a pair of (complex) derivatives as follows [brandwood1983complex, kreutz2009complex]:

(3)
(4)

Everything extends to multivariate functions of a complex vector by defining the cogradient and conjugate cogradient operators:

(5)
(6)

Then, a necessary and sufficient condition for to be a minimum of is either or [brandwood1983complex]

. CR-calculus inherits most of the standard properties of the real derivatives, including the chain rule and the differential rule, e.g., see

[kreutz2009complex]

. For the important case where the output of the function is real-valued (as is the case for the loss function when optimizing CVNNs) we have the additional property:

(7)

Combined with the Taylor’s expansion of the function, an immediate corollary of this property is that the direction of steepest ascent of in the point is given by the conjugate cogradient operator evaluated in that point [kreutz2009complex]. Up to a multiplicative constant term, this result coincides with taking the steepest descent direction with respect to the real derivatives, allowing for a straightforward implementation in most optimization libraries.

Ii-B Complex-valued neural networks

We now turn our attention to the approximation of multivariate complex-valued functions. A generic CVNN is composed by stacking layers via the alternation of linear and nonlinear operations. In particular, the -th layer is described by the following equation:

(8)

where is the -dimensional input to the layer, and are adaptable weight matrices, and is a (complex-valued) activation function applied element-wise, which will be discussed more in depth later on. By definition, denotes the input to the network, while denotes the final output, which we assume one-dimensional for simplicity. Some results on the approximation properties of this model are given in [kim2003approximation], while [trabelsi2017deep] describes some techniques to initialize the adaptable linear weights in the complex domain.

Given input/output pairs , we train the CVNN by minimizing a cost function given by:

(9)

where collects all the adaptable weights of the network and is a loss function, e.g., the squared loss:

(10)

Following the results described in the previous section, a basic steepest descent approach to optimize (9) is given by the following update equation at the -th iteration:

(11)

where is the learning rate. More in general, we can use noisy versions of the gradient given by sampling a mini-batch of elements, or accelerate the optimization process by adapting most of the state-of-the-art techniques used for real-valued neural networks [bottou2016optimization]. We can also apply some techniques that are specific to the complex domain. For example [xu2015convergence], inspired by the theory of widely linear adaptive filters, augments the input to the CVNN with its complex conjugate . Additional improvements can be obtained by replacing the real-valued with a complex-valued learning rate [zhang2016complex], which can speed up convergence in some scenarios.

Iii Complex-valued activation functions

As we stated in the introduction, choosing a proper activation function in (8) is more challenging than in the real case because of Liouville’s theorem, stating that the only complex-valued functions that are bounded and analytic everywhere are constants. So in practice, one need to choose between boundedness and analyticity. Before the introduction of the ReLU activation [glorot2011deep], almost all activation functions in the real case where bounded. Consequently, initial approaches to design CVNNs always preferred non-analytic functions in order to preserve boundedness, most commonly by applying real-valued activation functions separately to the real and imaginary parts [nitta1997extension]:

(12)

where is a generic input to the activation function in (8), and is some real-valued activation function, e.g., sigmoid. This is called a split activation function. As a representative example, the magnitude and phase of the split- when varying the activation are given in Fig. 1. Early proponents of this approach can be found in [benvenuto1992complex] and [leung1991complex].

(a) Absolute value
(b) Phase
Fig. 1: Example of split activation function having in (12) processing both the real and the imaginary parts of the input. (a) Magnitude of the output. (b) Phase of the output.

Another common class of non-analytic activation functions are the phase-amplitude (PA) functions popularized by [georgiou1992complex, hirose1992continuous]:

(13)
(14)

where is the phase of , while , and are positive constant which in most cases are set equal to . PA functions can be seen as the natural generalization of real-valued squashing functions such as the sigmoid, because the output has bounded magnitude but preserves the phase of .

A third alternative is to use fully-complex activation functions that are analytic and bounded almost everywhere, at the cost of introducing a set of singular points. Among all possible transcendental functions, it is common to consider the complex-valued extension of the hyperbolic tangent, defined as [kim2003approximation]:

(15)

possessing periodic singular points at the imaginary points , with . However, careful scaling of the inputs and of the initial weights allows to avoid these singularities during training.

Finally, several authors have proposed extensions of the real-valued ReLU function , motivated by the fact that its success in the deep learning literature does not immediately translate to the complex-valued case, where using it in a split function as in (12) results in poor performance [trabelsi2017deep]. [guberman2016complex] propose a complex-valued ReLU as:

(16)

Alternatively, inspired by the PA functions to maintain the phase of the activation value, [arjovsky2016unitary] propose the following modReLU function:

(17)

where is an adaptable parameter defining a radius along which the output of the function is . Another extension, the complex cardioid, is advanced in [virtue2017better]:

(18)

maintaining phase information while attenuating the magnitude based on the phase itself. For real-valued inputs, (18) reduces to the ReLU.

Note that in all cases these proposed activation functions are fixed or endowed with a very small degree of flexibility (as in (17)). In the following sections we describe a principled technique to design non-parametric activation functions for use in CVNNs.

Iv Split kernel activation functions

Our first proposal is a split function as in (12), where we use non-parametric (real-valued) functions for in place of fixed ones. Specifically, we consider the kernel activation function (KAF) proposed in [scardapane2017kafnets], which will also serve as a base for the fully complex-valued proposal of the following section. Here, we introduce the basic elements of the KAF, and we refer to the original paper [scardapane2017kafnets] for a fuller exposition.

The basic idea of a KAF is to model each activation function as a one-dimensional kernel model, where the kernel elements are chosen in a proper way to obtain an efficient backpropagation step. Consider the generic activation function

, where denotes either the real or the imaginary part of as in (12). To obtain a flexible shape, we can model a linear predictor on a high-dimensional feature space of the activation. However, this process becomes infeasible for a large number of feature transformations, and cannot handle infinite-dimensional feature spaces. However, for feature maps associated to a reproducing kernel Hilbert space with kernel ,111Remember that a function is a valid kernel function if it respects the positive semi-definiteness property, i.e., for any possible choice of and in (20) we have that:
(19)
we can write an equivalent linear model by exploiting the representer’s theorem as:

(20)

where are the mixing coefficients and make up the so-called dictionary of the kernel expansion [hofmann2008kernel, liu2011kernel]. In the context of a neural network, the dictionary elements cannot be selected a priori because they would change at every step of the optimization algorithm depending on the distribution of the activation values. Instead, we exploit the fact that we are working with one-dimensional kernels to fix the elements beforehand, and we only adapt the mixing coefficients in the optimization step. In particular, we select the elements by sampling values over the -axis, uniformly around zero. In this way, the value becomes a hyper-parameter controlling the flexibility of the approach: for larger we obtain a more flexible method at the cost of a larger number of adaptable parameters. In general, since the function is only a small component of a much larger neural network, we have found values in to be sufficient for most applications. As the number of parameters per neuron can potentially grow without bound depending on the choice of , we refer to such activation functions as non-parametric.

We use the same dictionary across the entire neural network, but two different sets of mixing coefficients for the real and imaginary parts of each neuron. Due to this, an efficient implementation of the proposed split-KAF is straightforward. In particular, consider the vector containing the (complex) activations of a layer following the linear operations in (8). We build the matrix by computing all the kernel values between the real part of the activations and the elements of the dictionary (and similarly for using the imaginary parts), and we compute the final output of the layer as:

(21)

where represents element-wise product (Hadamard product), are matrices collecting row-wise all the mixing coefficients for the real and imaginary components of the layer, and is a vector of ones. If we need to handle batches of elements (or convolutive layers), we only need to slightly modify (21) by adding additional trailing dimensions.

For all our experiments, we consider the 1D Gaussian kernel defined as:

(22)

where is called the kernel bandwidth. In the proposed KAF scheme, the values of the dictionary are chosen according to a grid, and as such the optimal bandwidth parameter depends uniquely on the grid resolution. In particular, we use the following rule-of-thumb proposed in [scardapane2017kafnets]:

(23)

where is the distance between the grid points. In order to provide an additional degree of freedom to our method, we also optimize a single per layer via back-propagation after initializing it following (23).

V Fully-complex kernel activation functions

While most of the literature on kernel methods in machine learning has focused on the real-valued case, it is well-known that the original mathematical treatment originated in the complex-valued domain [aronszajn1950theory]. In the context of the kernel filtering literature, techniques to build complex-valued algorithms by separating the real and the imaginary components (as in the previous section) are called complexification methods [bouboulis2011extension]. However, recently several authors have advocated for the direct use of (pure) complex-valued kernels leveraging the complex-valued treatment of RKHSs for a variety of fields, as surveyed in the introduction.

From a theoretical standpoint, defining complex RKHSs and kernels is relatively straightforward. As an example, a one-dimensional complex-function is positive semi-definite if and only if:

(24)

where all values are now defined in the complex-domain. Any PSD function is then a valid kernel function. Based on this, in this paper we also propose a fully-complex, non-parametric KAF by defining (20) directly in the complex domain, without the need for split functions:

(25)

where the mixing coefficients are now defined as complex numbers. Note that, in order for the dictionary to provide a dense sampling of the space of complex numbers, we now consider fixed elements arranged over a regular grid, an example of which is depicted in Fig. 2. Due to this, we now have adaptable mixing coefficients per neuron, as opposed to in the split case. We counter-balance this by selecting a drastically smaller (see the experimental section).

[mode=image, scale=1.2]TikZ/complex_dictionary_sampling

Fig. 2: A visual example of sampling the dictionary for the complex-valued KAF, in the complex plane, for in the range .

An immediate complex-valued extension of the Gaussian kernel in (22) is given by:

(26)

where in our experiments the bandwidth hyper-parameter is selected using the same rule-of-thumb as before and then adapted layer-wise. A complete analysis of the feature space associated to (26) is given in [steinwart2006explicit]. In order to gain some informal understanding, we can write the kernel explicitly in terms of the real and imaginary components of its arguments:

(27)

By analyzing the previous expression, we see that the complex-valued Gaussian kernel has several properties which are counter-intuitive if one is used to work with its real-valued restriction. First of all, (26) cannot be interpreted as a standard similarity measure, because it depends on its arguments only via and . For the same reasons, the kernel is not stationary, and it has an additional oscillatory behavior. We refer to Fig. 3 (or to [boloix2017widely, Section IV-A]) for an illustration of the kernel when fixing the second argument.

(a) Absolute value
(b) Phase
Fig. 3: Example of Gaussian complex kernel in (26) with and . Notice the scale of the axes (more details are provided in the text). (a) Real part of the output. (b) Imaginary part of the output.

For these reasons, another extension of the Gaussian kernel to the complex domain is given in [bouboulis2011extension], where the authors propose to build a whole family of complex-valued kernels starting from any real-valued one as follows:

(28)

The new complex-valued kernel is called an independent kernel. By plugging the real-valued Gaussian kernel (22) in the previous expression, we obtain a complex-valued expression that can still be interpreted as a similarity measure between the two points.

Note that several alternative kernels are also possible, many of which are specific to the complex-valued case, a prominent example being the Szego kernel [bouboulis2011extension]:

(29)

Vi Experimental evaluation

In this section, we experimentally evaluate the proposed activation functions on several benchmark problems, including channel identification in Section VI-A, wind prediction in Section VI-B, and multi-class classification in the complex domain in Section VI-C. In all cases, we linearly preprocess the real and the imaginary components of the input features to lie in the range. We regularize all parameters with respect to their squared absolute value (which is equivalent to standard regularization applied on the real and imaginary components separately), but we exclude the bias terms and the window parameter in (17). We select the strength of the regularization term and the size of the networks based on previous literature or on a cross-validation procedure, as described below. For optimization, we use a simple complex-valued extension of the Adagrad algorithm, which computes a per-parameter learning rate weighted by the squared magnitude of the gradients themselves. For each iteration, we construct a mini-batch by randomly sampling elements from the entire training dataset. All algorithms have been implemented in Python using the Autograd library [maclaurin2015autograd].

Vi-a Experiment 1 - Channel Identification

(a) Circular signal ()
(b) Non-circular signal ()
Fig. 4: Results for the first experiment, expressed in terms of MSE (dB). (a) Circular input signal. (b) Non-circular input signal. With a dashed line we divide the results of the proposed models.

Our first experiment is a standard benchmark in the complex-valued literature, i.e. a channel identification task [bouboulis2015complex]. The input to the channel is generated as:

(30)

where and

are Gaussian random variables, and the parameter

determines the circularity222A random variable is circular if and

have the same probability distribution for any angle

. Roughly speaking, non-circular signals are harder to predict, requiring the use of widely linear techniques when using standard linear filters [bouboulis2011extension].
of the signal. For the input is circular, while for approaching or the signal is highly non-circular. The output of the channel is computed by first applying a linear filtering operation:

(31)

where:

(32)

for . Then, the output of the linear filter goes through a memoryless nonlinearity:

(33)

and finally it is corrupted by adding white Gaussian noise in order to get the final signal

. The variance of the noise is selected to obtain a signal-to-noise ratio (SNR) of about

dB. The input to the neural network is an embedding of channel inputs:

(34)

with , and the network is trained to output . We generate samples of the channel, and we randomly keep for testing, averaging over different generations of the dataset. We compare the following algorithms:

  • LIN: a standard linear filter [schreier2010statistical] with complex-valued coefficients.

  • 2R-NN: a real-valued neural network taking as input the real and imaginary parts separately. For the activation functions in the hidden layers, we consider either a standard or ReLUs.

  • C-NN: complex-valued neural networks with fixed activation functions, including a split-, a split-ReLU, the AMP function in (13), or the complex ReLU in (16).

  • ModReLU-NN: CVNN with adaptable activation functions with ModReLU neurons as in (17). In this case, the coefficients of the neurons are all initialized at and later adapted.

  • Proposed KAF-NN: CVNN with the split-KAF proposed in Section IV. We empirically select elements in the dictionary sampled uniformly in .

  • Proposed C-KAF-NN: CVNN with the fully complex KAF proposed in Section V. In this case, we test either the complex Gaussian kernel (26), or the independent kernel with the real Gaussian kernel as base. We empirically select .

All algorithms are trained by minimizing the mean-squared error in (10) on random mini-batches of elements. Following [xu2015convergence], in this scenario we consider one hidden layer with neurons (as more layers are not found to provide significant improvements in performance). The size of the regularization factor is empirically selected as . Results in terms of mean squared error (MSE) expressed in dBs are given in Table 4, by considering either (circular input signal) or the more challenging scenario (non-circular signal).

As expected, results are generally lower for the non-circular case, proportionally so for techniques that are not able to exploit the geometry of non-circular complex signals, such as non-widely linear models and real-valued neural networks. However, the proposed KAF-NN and C-KAF-NN are able to consistently out-perform all other methods in both scenarios in a stable fashion. Note that this difference in performance cannot be overcome by increasing the size of the other networks, thus pointing to the importance of adapting the activation functions also in the complex case. Interestingly, the complex Gaussian kernel in (26) results in a poor performance, which is solved by using the independent one.

Vi-B Experiment 2 - Wind prediction

(a) Absolute value
(b) Phase
Fig. 5: A plot of the complex-valued wind profile for the initial samples of the wind time-series. (a) Absolute value of the signal. (b) Phase of the signal.

For the second experiment, we consider a real-world dataset for a task of wind prediction [goh2006complex]. The dataset consists of hourly samples of wind intensity collected along two different axes (north axis and east axis). The dataset is provided in three settings of wind regime, namely ‘low’, ‘medium’, and ‘high’, from which we select the highest, being the most challenging one. In order to construct a complex-valued signal, the two samples for each hour are considered as the real and the imaginary components of a single complex number (for more motivation on the use of complex-valued information when dealing with wind forecasting, see [goh2004complex, goh2005nonlinear, 8109745, goh2006complex, kuh2009applications]. A snapshot of the absolute value and phase of the resulting signal is shown in Fig. 5 for the initial samples. We consider the task of predicting both components of the wind for an -hour-ahead horizon, starting from an embedding of the last hours of measurements. We select neural networks with hidden layers (as more hidden layers are not found to provide gain in performance), and we optimize both the number of neurons and the regularization factor on a held-out validation set. We test the datasets on the last components of the time-series, in terms of the coefficient of determination:

(35)

where is the true value, is the predicted value, and is the mean of the true values computed from the test set. Positive values of denotes a prediction which is better than chance, with values approaching for an almost-perfect prediction.

Model
Linear Linear
Real-valued NNs 2R-NN (tanh)
2R-NN (ReLU)
CVNN C-NN (split-tanh)
C-NN (split-ReLU)
C-NN (AMP)
C-NN (CReLU)
ModReLU-NN
Proposed CVNN KAF-NN
C-KAF-NN
C-KAF-NN (Ind.)
TABLE I:

Results (mean and standard deviation for the coefficient of determination

) in the wind prediction task. Best result is highlighted in bold, second-best result in underlined.

Results for the experiment are reported in Table I. We can see that, also in this scenario, the two best results are obtained by the proposed split-KAF and complex KAF neurons, significantly outperforming the other models.

Vi-C Experiment 3: complex-valued multi-class classification

We conclude our experimental evaluation by testing the proposed algorithms on a multi-class classification problem expressed in the complex domain. Following [bouboulis2015complex]

, we build the task by applying a two-dimensional fast Fourier transform (FFT) to the images in the well-known MNIST dataset,

333http://yann.lecun.com/exdb/mnist/ comprising black-and-white images of handwritten digits split into ten classes. We then rank the coefficients of the FFT in terms of significance (by considering their mean absolute value), and keep only the most significant coefficients as input to the models. We compare a real-valued NN taking the real and the imaginary components of the coefficients as separate inputs, a CVNN with modReLU activation functions, and a CVNN employing the proposed split-KAF. All networks have a softmax activation function in their output layer. For the CVNNs, we use the following variation to handle the complex valued activations :

(36)

where , and for our problem. All networks are then trained by minimizing the classical regularized cross-entropy formulation with the same optimizer as the last sections. We consider networks with three hidden layers having neurons each, whose regularization term is optimized via cross-validation separately. Results on the MNIST test set are provided in Table II.

We see that working in the complex domain results in significantly better performance when compared to working in the real domain. In addition, the proposed split-KAF can consistently obtain a better accuracy in this task than the ModReLU version. We show a representative evolution of the loss function in Fig. 6, where we highlight the first 10000 iterations for readability.

Vii Conclusive remarks

In this paper, we considered the problem of adapting the activation functions in a complex-valued neural network (CVNN). To this end, we proposed two different non-parametric models that extend the recently introduced kernel activation function (KAF) to the complex-valued case. The first model is a split configuration, where the real and the imaginary components of the activation are processed independently by two separate KAFs. In the second model, we directly redefine the KAF in the complex domain with the use of fully-complex kernels. We showed that CVNNs with adaptable functions can outperform neural networks with fixed functions in different benchmark problems including channel identification, wind prediction, and multi-class classification. For the fully-complex KAF, the independent kernel generally outperforms a naive complex Gaussian kernel without introducing significantly more complexity.

Model Test accuracy [%]
Real-valued NN
CVNN (ModReLU)
CVNN (Proposed split-KAF)
TABLE II: Results (mean and standard deviation for the test accuracy) in the complex-valued MNIST task. Best result is highlighted in bold.

Several improvements over this framework are possible, most notably by leveraging over recent advances in the field of real-valued kernels (e.g. [mansouri2017multiscale]) and complex-valued kernel regression and classification. One example is the use of pseudo-kernels [boloix2017widely] to handle more efficiently the non-circularity in the signals propagated through the network. More in general, it would be interesting to extend other classes of non-parametric, real-valued activation functions (such as Maxout networks [goodfellow2013maxout] or adaptive piecewise linear units [agostinelli2014learning]) to the complex domain, or adapt the proposed complex KAFs to other types of NNs, such as convolutive architectures [lecun2015deep, ren2017clustering].

Acknowledgments

The work of Simone Scardapane was supported in part by Italian MIUR, “Progetti di Ricerca di Rilevante Interesse Nazionale”, GAUChO project, under Grant 2015YPXH4W_004. The work of Steven Van Vaerenbergh was supported by the Ministerio de Economía, Industria y Competitividad (MINECO) of Spain under grant TEC2014-57402-JIN (PRISMA). Amir Hussain was supported by the UK Engineering and Physical Science Research Council (EPSRC) grant no. EP/M026981/1.

Fig. 6: Loss function evolution for the three algorithms on the complex-valued MNIST task (detail of the first 10000 iterations).

References