I Introduction
Over the last years, machine learning techniques have obtained impressive results in a wide range of fields, especially when dealing with supervised problems
[lecun2015deep, 8264962, zheng2017video]. The majority of these applications has focused on the case of realvalueddata: as an example, most of the deep learning frameworks currently used today can only work with floating point (or integer) numbers. Several applicative domains of interest, however, exhibit data that can be more naturally modeled using
complexvalued algebra, from image processing to timeseries prediction, bioinformatics, and robotics’ control (see [hirose2003complex, schreier2010statistical] for a variety of examples). While complex data can immediately be transformed to a real domain by considering the real and imaginary components separately, the resulting loss of phase information gives rise to algorithms that are generally less efficient (or expressive) than alternative methods able to work directly in the complex domain, as evidenced by a large body of literature [mandic2007complex]. Due to this, many learning algorithms have been extended to deal with complex data, including linear adaptive filters [fisher1983complex, schreier2010statistical], kernel methods [bouboulis2011extension, tobar2012novel, boloix2017widely], component analysis [scarpiniti2008generalized], and neural networks (NNs) [georgiou1992complex, kim2003approximation, arjovsky2016unitary, danihelka2016associative, guberman2016complex, trabelsi2017deep, 8109745]. We consider this last class of algorithms in this paper.Despite the apparent similarity between the real and complex domains, working directly in the latter is challenging because of several nonintuitive analytical properties of the complex algebra. Most notably, almost all cost functions involved in the training of complex models require nonanalytic (also known as nonholomorphic [bouboulis2011extension]) functions, so that standard complex derivatives cannot be used in the definition of the optimization algorithms. This is why several algorithms defined before the last decade considered optimizing the real and imaginary components separately, resulting in a more cumbersome notation which somehow hindered their development [leung1991complex]. More recently, this problem has been solved by the adoption of the socalled CRcalculus (or Wirtinger’s calculus), allowing to define proper complex derivatives even when dealing with nonanalytic functions [brandwood1983complex, kreutz2009complex], by considering explicitly their dependence on both their arguments and their complex conjugates. We describe CRcalculus more in depth in Section II.
When dealing with neural networks, another challenging task concerns the design of a proper activation function in the complex domain. In the realvalued case, the use of the rectified linear unit (ReLU) has been instrumental in the development of truly deep networks
[glorot2011deep, maas2013rectifier], and has spun a wave of further research in the topic, e.g., see [klambauer2017self, ramachandran2017swish] for very recent examples. In the complex case, Liouville’s theorem asserts that the only complex function which is analytic and bounded at the same time is a constant one. Due to the preference for bounded activation functions before the introduction of the ReLU, many authors in the past preferred bounded functions to analytic ones, most notably in a split organization, wherein the real and independent parts of the activations are processed separately [nitta1997extension], or in a phaseamplitude configuration, in which the nonlinearity is applied only to the magnitude component, while the phase component is preserved [georgiou1992complex]. Even extending the ReLU function to the complex domain has been shown to be nontrivial, and several authors have proposed different variations [guberman2016complex, arjovsky2016unitary].In this paper, we consider the problem of adapting
activation functions in the complex domain. For realvalued NNs, there is a large body of literature pointing to the fact that endowing activation functions with several degrees of freedom can improve the accuracy of the trained networks, ease the flow of the backpropagated gradient, or vastly simplify the design of the network. In the simplest case, we can consider parametric functions having only a few (generally less than three) parameters per neuron, such as the parametric ReLU
[he2015delving], the Sshaped ReLU [jin2016deep], or the selfnormalizing exponential linear unit (SELU) [klambauer2017self]. More in general, we can think of nonparametric activation functions, that can adapt to potentially any shape in a purely datadriven fashion, with a flexibility that can be controlled by the user, and to which standard regularization techniques can be applied. In the realvalued case, a lot of research has been devoted to the topic, including the design of Maxout networks [goodfellow2013maxout], adaptive piecewise linear (APL) units [agostinelli2014learning], spline functions [scardapane2016learning], and the recently proposed kernel activation functions (KAFs) [scardapane2017kafnets]. When dealing with complexvalued NNs (CVNNs), however, only a handful of works have considered adapting the activation functions [scarpiniti2008generalized, trabelsi2017deep], and only in the simplified parametric case, or when working in a split configuration. In this sense, how to design activation functions that can adapt to the training data while remaining simple to implement remains an open question.Contributions of the paper
In this paper, we significantly extend KAFs [scardapane2017kafnets] in order to design nonparametric activation functions for CVNNs. The basic idea of KAFs is to exploit a kernel expansion at every neuron, in which the elements of the kernel dictionary are fixed beforehand, while the mixing coefficients are adapted through standard optimization techniques. As described in [scardapane2017kafnets], this results in functions that are universal approximators, smooth over their entire domain, and whose implementation can leverage highly vectorized CPU/GPU libraries for matrix multiplication.
Here, we propose two different techniques to apply the general idea of KAFs in the context of CVNNs. In the first case, we use a split combination where the real and the imaginary components are processed by two independent KAFs sharing the same dictionary. In the second case, we leverage recent works on complexvalued reproducing kernel Hilbert spaces [bouboulis2011extension] to redefine the KAF directly in the complex domain, by describing several choices for the kernel function. We show via multiple experimental comparisons that CVNNs endowed with complexvalued KAFs can outperform both realvalued NNs and CVNNs having only fixed or parametric activation functions.
Organization of the paper
In Section II we introduce the basic theoretical elements underpinning optimization in a complex domain and CVNNs. Then, in Section III we summarize research on designing activation functions for CVNNs. The two proposed complex KAFs are given in Section IV (split KAF) and Section V (fully complex KAF). We provide an experimental evaluation in Section VI before concluding in Section VII.
Notation
We denote vectors using boldface lowercase letters, e.g., ; matrices are denoted by boldface uppercase letters, e.g., . All vectors are assumed to be column vectors. A complex number is represented as , where and are, respectively, the real part and the imaginary part of the number, and . Sometimes, we also use and to denote the real and imaginary parts of for simplicity. Magnitude and phase of a complex number are given by and respectively. denotes the complex conjugate of . Other notation is introduced in the text when appropriate.
Ii Preliminaries
Iia Complex algebra and CRcalculus
We start by introducing the basic theoretical concepts required to define a complexvalued function and to optimize it. We consider scalar functions first, and discuss the multivariate extension later on. Any complexvalued function can be written as:
(1) 
where and are realvalued functions in two arguments. The function is said to be realdifferentiable if the partial derivatives of and with respect to and are defined. Additionally, the function is called analytic (or holomorphic) if it satisfies the CauchyRiemann conditions:
(2) 
Only analytic functions admit a complex derivative in the standard sense, but most functions used in practice for CVNNs do not satisfy (2) (e.g., functions with realvalued outputs for which everywhere). In this case, CRcalculus [kreutz2009complex] provides a theoretical framework to handle nonanalytic functions directly in the complex domain without the need to switch back and forth between definitions in the complex domain and gradients’ computations in the real one.
The main idea is to consider explicitly as a function of both and its complex conjugate , which we denote as . If is realdifferentiable, then it is also analytic with respect to when keeping constant and vice versa. Thus, we can define a pair of (complex) derivatives as follows [brandwood1983complex, kreutz2009complex]:
(3)  
(4) 
Everything extends to multivariate functions of a complex vector by defining the cogradient and conjugate cogradient operators:
(5)  
(6) 
Then, a necessary and sufficient condition for to be a minimum of is either or [brandwood1983complex]
. CRcalculus inherits most of the standard properties of the real derivatives, including the chain rule and the differential rule, e.g., see
[kreutz2009complex]. For the important case where the output of the function is realvalued (as is the case for the loss function when optimizing CVNNs) we have the additional property:
(7) 
Combined with the Taylor’s expansion of the function, an immediate corollary of this property is that the direction of steepest ascent of in the point is given by the conjugate cogradient operator evaluated in that point [kreutz2009complex]. Up to a multiplicative constant term, this result coincides with taking the steepest descent direction with respect to the real derivatives, allowing for a straightforward implementation in most optimization libraries.
IiB Complexvalued neural networks
We now turn our attention to the approximation of multivariate complexvalued functions. A generic CVNN is composed by stacking layers via the alternation of linear and nonlinear operations. In particular, the th layer is described by the following equation:
(8) 
where is the dimensional input to the layer, and are adaptable weight matrices, and is a (complexvalued) activation function applied elementwise, which will be discussed more in depth later on. By definition, denotes the input to the network, while denotes the final output, which we assume onedimensional for simplicity. Some results on the approximation properties of this model are given in [kim2003approximation], while [trabelsi2017deep] describes some techniques to initialize the adaptable linear weights in the complex domain.
Given input/output pairs , we train the CVNN by minimizing a cost function given by:
(9) 
where collects all the adaptable weights of the network and is a loss function, e.g., the squared loss:
(10) 
Following the results described in the previous section, a basic steepest descent approach to optimize (9) is given by the following update equation at the th iteration:
(11) 
where is the learning rate. More in general, we can use noisy versions of the gradient given by sampling a minibatch of elements, or accelerate the optimization process by adapting most of the stateoftheart techniques used for realvalued neural networks [bottou2016optimization]. We can also apply some techniques that are specific to the complex domain. For example [xu2015convergence], inspired by the theory of widely linear adaptive filters, augments the input to the CVNN with its complex conjugate . Additional improvements can be obtained by replacing the realvalued with a complexvalued learning rate [zhang2016complex], which can speed up convergence in some scenarios.
Iii Complexvalued activation functions
As we stated in the introduction, choosing a proper activation function in (8) is more challenging than in the real case because of Liouville’s theorem, stating that the only complexvalued functions that are bounded and analytic everywhere are constants. So in practice, one need to choose between boundedness and analyticity. Before the introduction of the ReLU activation [glorot2011deep], almost all activation functions in the real case where bounded. Consequently, initial approaches to design CVNNs always preferred nonanalytic functions in order to preserve boundedness, most commonly by applying realvalued activation functions separately to the real and imaginary parts [nitta1997extension]:
(12) 
where is a generic input to the activation function in (8), and is some realvalued activation function, e.g., sigmoid. This is called a split activation function. As a representative example, the magnitude and phase of the split when varying the activation are given in Fig. 1. Early proponents of this approach can be found in [benvenuto1992complex] and [leung1991complex].
Another common class of nonanalytic activation functions are the phaseamplitude (PA) functions popularized by [georgiou1992complex, hirose1992continuous]:
(13)  
(14) 
where is the phase of , while , and are positive constant which in most cases are set equal to . PA functions can be seen as the natural generalization of realvalued squashing functions such as the sigmoid, because the output has bounded magnitude but preserves the phase of .
A third alternative is to use fullycomplex activation functions that are analytic and bounded almost everywhere, at the cost of introducing a set of singular points. Among all possible transcendental functions, it is common to consider the complexvalued extension of the hyperbolic tangent, defined as [kim2003approximation]:
(15) 
possessing periodic singular points at the imaginary points , with . However, careful scaling of the inputs and of the initial weights allows to avoid these singularities during training.
Finally, several authors have proposed extensions of the realvalued ReLU function , motivated by the fact that its success in the deep learning literature does not immediately translate to the complexvalued case, where using it in a split function as in (12) results in poor performance [trabelsi2017deep]. [guberman2016complex] propose a complexvalued ReLU as:
(16) 
Alternatively, inspired by the PA functions to maintain the phase of the activation value, [arjovsky2016unitary] propose the following modReLU function:
(17) 
where is an adaptable parameter defining a radius along which the output of the function is . Another extension, the complex cardioid, is advanced in [virtue2017better]:
(18) 
maintaining phase information while attenuating the magnitude based on the phase itself. For realvalued inputs, (18) reduces to the ReLU.
Note that in all cases these proposed activation functions are fixed or endowed with a very small degree of flexibility (as in (17)). In the following sections we describe a principled technique to design nonparametric activation functions for use in CVNNs.
Iv Split kernel activation functions
Our first proposal is a split function as in (12), where we use nonparametric (realvalued) functions for in place of fixed ones. Specifically, we consider the kernel activation function (KAF) proposed in [scardapane2017kafnets], which will also serve as a base for the fully complexvalued proposal of the following section. Here, we introduce the basic elements of the KAF, and we refer to the original paper [scardapane2017kafnets] for a fuller exposition.
The basic idea of a KAF is to model each activation function as a onedimensional kernel model, where the kernel elements are chosen in a proper way to obtain an efficient backpropagation step. Consider the generic activation function
, where denotes either the real or the imaginary part of as in (12). To obtain a flexible shape, we can model a linear predictor on a highdimensional feature space of the activation. However, this process becomes infeasible for a large number of feature transformations, and cannot handle infinitedimensional feature spaces. However, for feature maps associated to a reproducing kernel Hilbert space with kernel ,^{1}^{1}1Remember that a function is a valid kernel function if it respects the positive semidefiniteness property, i.e., for any possible choice of and in (20) we have that:(20) 
where are the mixing coefficients and make up the socalled dictionary of the kernel expansion [hofmann2008kernel, liu2011kernel]. In the context of a neural network, the dictionary elements cannot be selected a priori because they would change at every step of the optimization algorithm depending on the distribution of the activation values. Instead, we exploit the fact that we are working with onedimensional kernels to fix the elements beforehand, and we only adapt the mixing coefficients in the optimization step. In particular, we select the elements by sampling values over the axis, uniformly around zero. In this way, the value becomes a hyperparameter controlling the flexibility of the approach: for larger we obtain a more flexible method at the cost of a larger number of adaptable parameters. In general, since the function is only a small component of a much larger neural network, we have found values in to be sufficient for most applications. As the number of parameters per neuron can potentially grow without bound depending on the choice of , we refer to such activation functions as nonparametric.
We use the same dictionary across the entire neural network, but two different sets of mixing coefficients for the real and imaginary parts of each neuron. Due to this, an efficient implementation of the proposed splitKAF is straightforward. In particular, consider the vector containing the (complex) activations of a layer following the linear operations in (8). We build the matrix by computing all the kernel values between the real part of the activations and the elements of the dictionary (and similarly for using the imaginary parts), and we compute the final output of the layer as:
(21) 
where represents elementwise product (Hadamard product), are matrices collecting rowwise all the mixing coefficients for the real and imaginary components of the layer, and is a vector of ones. If we need to handle batches of elements (or convolutive layers), we only need to slightly modify (21) by adding additional trailing dimensions.
For all our experiments, we consider the 1D Gaussian kernel defined as:
(22) 
where is called the kernel bandwidth. In the proposed KAF scheme, the values of the dictionary are chosen according to a grid, and as such the optimal bandwidth parameter depends uniquely on the grid resolution. In particular, we use the following ruleofthumb proposed in [scardapane2017kafnets]:
(23) 
where is the distance between the grid points. In order to provide an additional degree of freedom to our method, we also optimize a single per layer via backpropagation after initializing it following (23).
V Fullycomplex kernel activation functions
While most of the literature on kernel methods in machine learning has focused on the realvalued case, it is wellknown that the original mathematical treatment originated in the complexvalued domain [aronszajn1950theory]. In the context of the kernel filtering literature, techniques to build complexvalued algorithms by separating the real and the imaginary components (as in the previous section) are called complexification methods [bouboulis2011extension]. However, recently several authors have advocated for the direct use of (pure) complexvalued kernels leveraging the complexvalued treatment of RKHSs for a variety of fields, as surveyed in the introduction.
From a theoretical standpoint, defining complex RKHSs and kernels is relatively straightforward. As an example, a onedimensional complexfunction is positive semidefinite if and only if:
(24) 
where all values are now defined in the complexdomain. Any PSD function is then a valid kernel function. Based on this, in this paper we also propose a fullycomplex, nonparametric KAF by defining (20) directly in the complex domain, without the need for split functions:
(25) 
where the mixing coefficients are now defined as complex numbers. Note that, in order for the dictionary to provide a dense sampling of the space of complex numbers, we now consider fixed elements arranged over a regular grid, an example of which is depicted in Fig. 2. Due to this, we now have adaptable mixing coefficients per neuron, as opposed to in the split case. We counterbalance this by selecting a drastically smaller (see the experimental section).
An immediate complexvalued extension of the Gaussian kernel in (22) is given by:
(26) 
where in our experiments the bandwidth hyperparameter is selected using the same ruleofthumb as before and then adapted layerwise. A complete analysis of the feature space associated to (26) is given in [steinwart2006explicit]. In order to gain some informal understanding, we can write the kernel explicitly in terms of the real and imaginary components of its arguments:
(27) 
By analyzing the previous expression, we see that the complexvalued Gaussian kernel has several properties which are counterintuitive if one is used to work with its realvalued restriction. First of all, (26) cannot be interpreted as a standard similarity measure, because it depends on its arguments only via and . For the same reasons, the kernel is not stationary, and it has an additional oscillatory behavior. We refer to Fig. 3 (or to [boloix2017widely, Section IVA]) for an illustration of the kernel when fixing the second argument.
For these reasons, another extension of the Gaussian kernel to the complex domain is given in [bouboulis2011extension], where the authors propose to build a whole family of complexvalued kernels starting from any realvalued one as follows:
(28) 
The new complexvalued kernel is called an independent kernel. By plugging the realvalued Gaussian kernel (22) in the previous expression, we obtain a complexvalued expression that can still be interpreted as a similarity measure between the two points.
Note that several alternative kernels are also possible, many of which are specific to the complexvalued case, a prominent example being the Szego kernel [bouboulis2011extension]:
(29) 
Vi Experimental evaluation
In this section, we experimentally evaluate the proposed activation functions on several benchmark problems, including channel identification in Section VIA, wind prediction in Section VIB, and multiclass classification in the complex domain in Section VIC. In all cases, we linearly preprocess the real and the imaginary components of the input features to lie in the range. We regularize all parameters with respect to their squared absolute value (which is equivalent to standard regularization applied on the real and imaginary components separately), but we exclude the bias terms and the window parameter in (17). We select the strength of the regularization term and the size of the networks based on previous literature or on a crossvalidation procedure, as described below. For optimization, we use a simple complexvalued extension of the Adagrad algorithm, which computes a perparameter learning rate weighted by the squared magnitude of the gradients themselves. For each iteration, we construct a minibatch by randomly sampling elements from the entire training dataset. All algorithms have been implemented in Python using the Autograd library [maclaurin2015autograd].
Via Experiment 1  Channel Identification
Our first experiment is a standard benchmark in the complexvalued literature, i.e. a channel identification task [bouboulis2015complex]. The input to the channel is generated as:
(30) 
where and
are Gaussian random variables, and the parameter
determines the circularity^{2}^{2}2A random variable is circular if andhave the same probability distribution for any angle
. Roughly speaking, noncircular signals are harder to predict, requiring the use of widely linear techniques when using standard linear filters [bouboulis2011extension]. of the signal. For the input is circular, while for approaching or the signal is highly noncircular. The output of the channel is computed by first applying a linear filtering operation:(31) 
where:
(32) 
for . Then, the output of the linear filter goes through a memoryless nonlinearity:
(33) 
and finally it is corrupted by adding white Gaussian noise in order to get the final signal
. The variance of the noise is selected to obtain a signaltonoise ratio (SNR) of about
dB. The input to the neural network is an embedding of channel inputs:(34) 
with , and the network is trained to output . We generate samples of the channel, and we randomly keep for testing, averaging over different generations of the dataset. We compare the following algorithms:

LIN: a standard linear filter [schreier2010statistical] with complexvalued coefficients.

2RNN: a realvalued neural network taking as input the real and imaginary parts separately. For the activation functions in the hidden layers, we consider either a standard or ReLUs.

ModReLUNN: CVNN with adaptable activation functions with ModReLU neurons as in (17). In this case, the coefficients of the neurons are all initialized at and later adapted.

Proposed KAFNN: CVNN with the splitKAF proposed in Section IV. We empirically select elements in the dictionary sampled uniformly in .
All algorithms are trained by minimizing the meansquared error in (10) on random minibatches of elements. Following [xu2015convergence], in this scenario we consider one hidden layer with neurons (as more layers are not found to provide significant improvements in performance). The size of the regularization factor is empirically selected as . Results in terms of mean squared error (MSE) expressed in dBs are given in Table 4, by considering either (circular input signal) or the more challenging scenario (noncircular signal).
As expected, results are generally lower for the noncircular case, proportionally so for techniques that are not able to exploit the geometry of noncircular complex signals, such as nonwidely linear models and realvalued neural networks. However, the proposed KAFNN and CKAFNN are able to consistently outperform all other methods in both scenarios in a stable fashion. Note that this difference in performance cannot be overcome by increasing the size of the other networks, thus pointing to the importance of adapting the activation functions also in the complex case. Interestingly, the complex Gaussian kernel in (26) results in a poor performance, which is solved by using the independent one.
ViB Experiment 2  Wind prediction
For the second experiment, we consider a realworld dataset for a task of wind prediction [goh2006complex]. The dataset consists of hourly samples of wind intensity collected along two different axes (north axis and east axis). The dataset is provided in three settings of wind regime, namely ‘low’, ‘medium’, and ‘high’, from which we select the highest, being the most challenging one. In order to construct a complexvalued signal, the two samples for each hour are considered as the real and the imaginary components of a single complex number (for more motivation on the use of complexvalued information when dealing with wind forecasting, see [goh2004complex, goh2005nonlinear, 8109745, goh2006complex, kuh2009applications]. A snapshot of the absolute value and phase of the resulting signal is shown in Fig. 5 for the initial samples. We consider the task of predicting both components of the wind for an hourahead horizon, starting from an embedding of the last hours of measurements. We select neural networks with hidden layers (as more hidden layers are not found to provide gain in performance), and we optimize both the number of neurons and the regularization factor on a heldout validation set. We test the datasets on the last components of the timeseries, in terms of the coefficient of determination:
(35) 
where is the true value, is the predicted value, and is the mean of the true values computed from the test set. Positive values of denotes a prediction which is better than chance, with values approaching for an almostperfect prediction.
Model  
Linear  Linear  
Realvalued NNs  2RNN (tanh)  
2RNN (ReLU)  
CVNN  CNN (splittanh)  
CNN (splitReLU)  
CNN (AMP)  
CNN (CReLU)  
ModReLUNN  
Proposed CVNN  KAFNN  
CKAFNN  
CKAFNN (Ind.) 
Results (mean and standard deviation for the coefficient of determination
) in the wind prediction task. Best result is highlighted in bold, secondbest result in underlined.Results for the experiment are reported in Table I. We can see that, also in this scenario, the two best results are obtained by the proposed splitKAF and complex KAF neurons, significantly outperforming the other models.
ViC Experiment 3: complexvalued multiclass classification
We conclude our experimental evaluation by testing the proposed algorithms on a multiclass classification problem expressed in the complex domain. Following [bouboulis2015complex]
, we build the task by applying a twodimensional fast Fourier transform (FFT) to the images in the wellknown MNIST dataset,
^{3}^{3}3http://yann.lecun.com/exdb/mnist/ comprising blackandwhite images of handwritten digits split into ten classes. We then rank the coefficients of the FFT in terms of significance (by considering their mean absolute value), and keep only the most significant coefficients as input to the models. We compare a realvalued NN taking the real and the imaginary components of the coefficients as separate inputs, a CVNN with modReLU activation functions, and a CVNN employing the proposed splitKAF. All networks have a softmax activation function in their output layer. For the CVNNs, we use the following variation to handle the complex valued activations :(36) 
where , and for our problem. All networks are then trained by minimizing the classical regularized crossentropy formulation with the same optimizer as the last sections. We consider networks with three hidden layers having neurons each, whose regularization term is optimized via crossvalidation separately. Results on the MNIST test set are provided in Table II.
We see that working in the complex domain results in significantly better performance when compared to working in the real domain. In addition, the proposed splitKAF can consistently obtain a better accuracy in this task than the ModReLU version. We show a representative evolution of the loss function in Fig. 6, where we highlight the first 10000 iterations for readability.
Vii Conclusive remarks
In this paper, we considered the problem of adapting the activation functions in a complexvalued neural network (CVNN). To this end, we proposed two different nonparametric models that extend the recently introduced kernel activation function (KAF) to the complexvalued case. The first model is a split configuration, where the real and the imaginary components of the activation are processed independently by two separate KAFs. In the second model, we directly redefine the KAF in the complex domain with the use of fullycomplex kernels. We showed that CVNNs with adaptable functions can outperform neural networks with fixed functions in different benchmark problems including channel identification, wind prediction, and multiclass classification. For the fullycomplex KAF, the independent kernel generally outperforms a naive complex Gaussian kernel without introducing significantly more complexity.
Model  Test accuracy [%] 

Realvalued NN  
CVNN (ModReLU)  
CVNN (Proposed splitKAF) 
Several improvements over this framework are possible, most notably by leveraging over recent advances in the field of realvalued kernels (e.g. [mansouri2017multiscale]) and complexvalued kernel regression and classification. One example is the use of pseudokernels [boloix2017widely] to handle more efficiently the noncircularity in the signals propagated through the network. More in general, it would be interesting to extend other classes of nonparametric, realvalued activation functions (such as Maxout networks [goodfellow2013maxout] or adaptive piecewise linear units [agostinelli2014learning]) to the complex domain, or adapt the proposed complex KAFs to other types of NNs, such as convolutive architectures [lecun2015deep, ren2017clustering].
Acknowledgments
The work of Simone Scardapane was supported in part by Italian MIUR, “Progetti di Ricerca di Rilevante Interesse Nazionale”, GAUChO project, under Grant 2015YPXH4W_004. The work of Steven Van Vaerenbergh was supported by the Ministerio de Economía, Industria y Competitividad (MINECO) of Spain under grant TEC201457402JIN (PRISMA). Amir Hussain was supported by the UK Engineering and Physical Science Research Council (EPSRC) grant no. EP/M026981/1.