Signal Transformer: Complex-valued Attention and Meta-Learning for Signal Recognition

06/05/2021 ∙ by Yihong Dong, et al. ∙ ibm Carnegie Mellon University 0

Deep neural networks have been shown as a class of useful tools for addressing signal recognition issues in recent years, especially for identifying the nonlinear feature structures of signals. However, this power of most deep learning techniques heavily relies on an abundant amount of training data, so the performance of classic neural nets decreases sharply when the number of training data samples is small or unseen data are presented in the testing phase. This calls for an advanced strategy, i.e., model-agnostic meta-learning (MAML), which is able to capture the invariant representation of the data samples or signals. In this paper, inspired by the special structure of the signal, i.e., real and imaginary parts consisted in practical time-series signals, we propose a Complex-valued Attentional MEta Learner (CAMEL) for the problem of few-shot signal recognition by leveraging attention and meta-learning in the complex domain. To the best of our knowledge, this is also the first complex-valued MAML that can find the first-order stationary points of general nonconvex problems with theoretical convergence guarantees. Extensive experiments results showcase the superiority of the proposed CAMEL compared with the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the recent explosion of deep learning, signal recognition has made some remarkable advances O’Shea et al. (2016); Jas et al. (2017); Song et al. (2020); Dong et al. (2021a)

. To achieve these, a large volume of data is required to obtain satisfactory performance. However, the deep learning models trained with traditional supervised learning methods often perform poorly or even fail when only a small amount of data is available or when they need to adapt to unseen tasks or time-varying ones. In practical signal recognition tasks, the collection and annotation of abundant data are notoriously expensive, especially for some rare but important signals. Another critical challenge is the presence of noise, because the signal data varies for different signal-to-noise ratios (SNRs), and in real-world scenarios, the deep neural networks (DNNs) have to adapt to real-time variations in SNRs.

Meta-learning technique Finn et al. (2017, 2018); Yoon et al. (2018); Zhang et al. (2018); Balaji et al. (2018) seeks to resolve above challenges by learning how to learn like humans do. We know that humans can effectively utilize prior knowledge and experience to learn new skills rapidly with very few examples. Similarly, the meta-learner is trained on the distribution of homogeneous tasks, with the goal of learning internal features that are broadly applicable to all tasks, rather than a single individual task. Equipped with these sensitive internal features, the meta-learner is able to produce significant improvements of adaptation ability via fine-tuning. Recently, meta-learning has demonstrated promising performance in many fields Liu et al. (2019); Khodadadeh et al. (2019); Xie et al. (2019); Alet et al. (2019); Jerfel et al. (2019); Khodak et al. (2019); Zhou et al. (2019); Rajeswaran et al. (2019); Zhuang et al. (2020); Kong et al. (2020); Denevi et al. (2020); Chen et al. (2020); Yao et al. (2020); Baik et al. (2020); Sitzmann et al. (2020); Harrison et al. (2020); Ji et al. (2020); Boutilier et al. (2020); Confavreux et al. (2020); Goldblum et al. (2020). Please see the supplementary material for detailed related work in Section G. However, for some particular fields, especially signal recognition, existing meta-learning methods generally neglect the prior knowledge of the signals, i.e., temporal information and complex domain information. For models with insufficient training data, it is crucial to incorporate this prior knowledge.

As such, we take into account the attention mechanism Vaswani et al. (2017) and the complex-valued neural network Trabelsi et al. (2018a); Hirose (2012); Tu et al. (2020)

for signal recognition, respectively. Attention mechanisms have been widely adopted in many time series learning tasks, such as natural language processing. It became an integral component of Recurrent neural networks (RNNs), long short-term memory

Hochreiter and Schmidhuber (1997) and gated recurrent Chung et al. (2014) neural networks, until Transformer Vaswani et al. (2017) was proposed. Since then, self-attention is able to replace RNN with better performance and parallel computation. Therefore, we adapt the attention mechanism to the signal recognition task. Since the signals contain both magnitude and phase, complex numbers are used for the representation of signals. Consequently, complex arithmetic operations are the essential part of signal processing. Intuitively, complex-valued neural networks should be built to address the signal recognition problem. However, to the best of our knowledge, the meta-learning method equipped with attention mechanisms in the complex-valued neural networks has not been investigated.

In this paper, we propose a Complex-valued Attentional MEta Learner (CAMEL), for few-shot signal recognition, which generalizes meta-learning and attention to the complex domain. With the help of these novel designs, CAMEL has succeeded in capturing more information from the signal data. The prior knowledge assists CAMEL in preventing overfitting and improving its performance. For better understanding the proposed architecture, the overview of CAMEL is illustrated in Figure 1. Notice that CAMEL can be applied to any kind of complex-valued data. By leveraging existing meta-learning and few learning methods in extensive experiments, the proposed method shows consistently better performance compared with the state-of-the-art methods. The effectiveness of each novel component in CAMEL is verified via ablation studies. From the convergence analysis of complex-valued MAML, it is shown that CAMEL is able to find an first-order stationary point for any positive after at most iterations with second-order information.

Figure 1: The overview of CAMEL.

The code of this paper will be released upon acceptance. Please see the supplementary material for notations, detailed derivation of Lemma, and more experiment results.

2 Motivation

Meta-learning is one of the most suited techniques to solve signal recognition problems, because, in the real world, signal annotation is expensive and models need to adapt to changing SNRs, whereas meta-learning has an explicit goal of fast adaptation. To further improve the effectiveness of meta-learning in applications to signal processing, we consider incorporating prior knowledge of signal data to the model by CAMEL that can generalize meta-learning with attention to the complex domain so that we are able to extract complex domain and temporal information from signal data.

However, lots of so called complex-valued neural networks treat a complex number as two real numbers, i.e., real and imaginary parts of the complex number, and design special network structures to recover complex operations using these real numbers. We refer to these special complex-valued neural networks as in-phase/quadrature complex-valued neural networks (IQCVNNs). Although IQCVNNs can deal with complex-valued problems, essentially the neural nets are still working with real-valued ones, since IQCVNNs work without defining complex derivatives and the complex chain rules in back-propagation. We refer to the complex-valued neural networks that define complex derivatives and the complex chain rules as complex derivatives complex-valued neural networks (CDCVNNs). It turns out that compared with IQCVNNs, CDCVNNs can perform complex operations with fewer parameters. To be more specific, we give the following lemmas to show the significance of CDVNNs compared with IQCVNNs with respect to time complexity.

Lemma 1

If a function is complex analytic, the time complexity of the derivative of in IQCVNNs are twice that of the complex derivative of in CDCVNNs.

Lemma 2

The complex-valued convolutional layer and complex-valued fully connected layer is complex analytic.

As we know, the convolutional and fully connected layers are the most computationally intensive parts of a neural network. Therefore, although it has a similar effect to the complex-valued neural network, IQCVNNs far exceed the CDCVNNs in terms of the time complexity of back-propagation. Especially, meta-learning requires second-order information of the objective function to guarantee convergence Fallah et al. (2020), which forces us to implement CDCVNNs. The complex chain rule is a key to implementing CDCVNNs. According to the complex chain rule, we are able to derive the outer-loop update process of CAMEL, which is different from that of MAML.

Complex-valued attention is also necessary for CAMEL to obtain the temporal information from signal data. However, in complex-valued attention, it is required to compute the derivative of the mapping from complex to real domain since calculating the similarity coefficient between two pairs leads to the real numbers in the activation function of the complex-valued neural nets. Given the following lemma, we know that the derivative of the function will be non-analytic, since constant function is useless in identifying the features of data.

Lemma 3

, is analytic if and only if is a constant function.

To the best of our knowledge, attention in the complex domain has rarely been studied. 111A closely related work is Yang et al. (2020), which proposed a complex transformer and developed attention and encoder-decoder network operating for complex input. However, they utilized eight attentions to represent complex-valued attention without considering the nonlinear components of attention such as softmax and activation functions, etc.. Therefore, we here study complex-valued attention and propose CAMEL as presented in the next section.

3 Camel

Please see the supplementary material for the definitions of complex derivative, analytic function, and the Cauchy-Riemann equations in Section D.

3.1 Algorithm Design

CAMEL utilizes complex-valued neural networks and attention to provide prior knowledge, i.e., complex domain and temporal information, to prevent overfitting during training. It resembles its namesake animal, camel, which stores water and nutrients with its hump to ensure its survival in extreme conditions.

CAMEL updates parameters through back-propagation by the chain rule. However, traditional chain rule does not work, because CAMEL is non-analytic.

The chain rule for complex variables  The chain rule is different when the function is non-analytic. For a non-analytic composite function , where , we can apply the following chain rule:

(1)

where is a continuous function and

denotes the conjugate vector of

. Note that if the function is analytic, the second term equals zero and (1) turns into the normal chain rule. In the case of matrix derivatives, the chain rule can be written as:

(2)

where is non-analytic, and are two complex matrices, and denotes the transpose of a matrix.

Under (1) and (2), CAMEL is able to update the parameters as expected. Formally, we define the base model of CAMEL to be a complex-valued attentional neural network with meta-parameters . The goal is to learn a sensitive initial , for which the network performs well on the th query set after few gradient update steps on the th support set to obtain . Here,

is a task randomly sampled from the task probability distribution

. The update steps above are termed as the inner-loop update process, which can be represented as:

(3)

where is a learning rate and denotes the gradient on the support set of task . The meta-parameters are trained by optimizing the performance of . Consequently, the meta-objective is defined as follows:

(4)

where denotes the loss on the query set of task after the inner-loop update process. As the underlying is unknown, evaluation of the expectation in the right hand side of (4) is often computationally prohibitive. Therefore, we can minimize the function with a batch of tasks that are independently drawn from , which can be expressed as:

(5)

The optimization of the meta-objective is referred to as the outer-loop update process, which can be expressed as:

(6)

where denotes the meta learning rate. Define

(7)
Lemma 4

In response to complex meta-parameters , we have

(8)

Then, according to (8), the outer-loop update process for complex meta-parameters is

(9)

The complete algorithm description of is outlined in Algorithm 1.

0:  The distribution over tasks .
0:  The learning rates .
0:  The meta-parameters of CAMEL.
  Randomly initialize the meta-parameters of CAMEL.
  repeat
     Sample batch of tasks
     for each  do
        Evaluate via the complex chain rule (1) and (2).
        Update .
        Set and
        Evaluate and via the complex chain rule (1) and (2).
     end for
     Update
  until convergence
Algorithm 1 Pseudocode for CAMEL Update

3.2 Complex-valued Attention

The attention mechanisms are widely used in various areas of deep learning, but attention for the complex domain have rarely been addressed. A significant reason is that the attention has to utilize the softmax function to calculate the similarity coefficient, which must be real numbers rather than complex numbers. According to Lemma 3, it is a constant function or a non-analytic function. However, the constant functions are useless and discardable in neural networks, while non-analytic functions cannot be derived at arbitrary points in complex domain. As a result, we had to utilize the complex gradient vector.

Complex gradient vector  If is the real function of a complex vector , then the complex gradient vector is given by Hjørungnes (2011):

(10)

Complex-valued softmax function Under (10), we are able to define the generalized complex-valued softmax function as:

(11)

where denotes the softmax function in real case and denotes any function that maps complex numbers to real numbers, such as (i.e., the magnitude of the complex numbers), , and , etc.

Given a complex matrix , we can compute the complex matrix , and

using linear transformations, which are similar to complex-valued fully connected layers. Then the complex-valued attention can be written as:

(12)

where acts on each row of the matrix and denotes the row dimension of i.e. scaling factor.

Complex-valued multi-head attention Complex-valued multi-headed attention allows models to jointly focus on information from different representations.

(13)

where , and are the projection matrices and denotes the concatenation of inputs matrices.

Complex-valued normalization

Normalization, such as batch normalization

Ioffe and Szegedy (2015) and layer normalization Ba et al. (2016), is an important component of neural networks. Especially, the batch normalization is commonly employed. However, for a complex vector

, its variance, which has to be computed in normalization, is real. According to Lemma

3, the variance is non-analytic. Therefore, in the back-propagation of complex-valued normalization, we have to utilize the complex gradient vector (10). Define as the complex scaling parameters and as the complex shift parameters, the complex-valued normalization can be expressed as:

(14)

where and denote the expectation and variance, respectively, and denotes the conjugate transpose of .

Figure 2: The architecture of CAMEL. The embedding block contains complex-valued convolutional layers. The complex-valued convolutional block contains

complex-valued convolutional layer, complex-valued batch normalization, and complex-valued ReLU. The complex-valued attention block contains complex-valued attentions. The complex-valued fully connected block contains complex-valued fully connected layers, complex-valued batch normalization, and complex-valued ReLU.

Complex-valued activation function  The activation function is nonlinear, so that it is scarcely to be analytic. Most of the well-known activation functions are not analytic in the complex domain, such as Sigmoid, Tanh, and ReLU Goodfellow et al. (2016), etc. Especially, the complex Sigmoid and Tanh is not bounded while in complex ReLU the complex numbers cannot be compared with zero. To this end, the complex-valued activation function can be defined as:

(15)

where denotes the activation function in real case. In this way, the and are bounded because the real and imaginary parts of them are bounded. Meanwhile, the complex can be compared with zero because the real and imaginary parts of inputs can be compared with zero. However, since the complex-valued activation functions defined above are non-analytic in most cases, the complex chain rule is required for derivatives.

Please see the supplementary material for detailed complex-valued convolutional layer and complex-valued fully connected layer in Section F.

4 Convergence of CAMEL

In this section, we will show the convergence behavior of complex-valued MAML by following the previous work Fallah et al. (2020) in proving the convergence MAML in the real domain. To prove the complex-valued MAML, we need to utilize twice continuously differentiable, -smooth, -Lipschitz continuous, and Hessian, etc. in complex domain. Please see the supplementary material for detailed Assumptions, Lemma, and proof of Theorem 1 in Section H.

Theorem 1

Suppose that Assumptions 1-5 hold and . Consider running complex-valued MAML with batch sizes and . Following the definition in Lemma 5, let . Then for any , complex-valued MAML finds a solution that

(16)

after running for

(17)

iterations, where is defined in Assumption 1 and and denotes the size of the support set and query set, respectively.

The result in Theorem 1 demonstrates that after running CAMEL for iterations, we are able to find a point at which the expected gradient norm satisfies (16).

5 Experiments

We train the model on 3 datasets: RadioML 2016.10A O’Shea et al. (2016), a dataset with 220,000 total samples, 20,000 samples for each class and 11,000 samples for each SNR, consists of dimension input X in 11 classes. The 11 classes correspond to 11 modulation types: 8PSK, AM-DSB, AM-SSB, BPSK, CPFSK, GFSK, PAM4, QAM16, QAM64, QPSK, WBFM. And RadioML 2016.04C O’Shea et al. (2016)

, a synthetic dataset, is generated with GNU Radio, consisting of about 110 thousand signals. These samples are uniformly distributed in SNR from -20dB to +20dB and tagged so that we can evaluate performance on specific subsets. Actually 2016.10A represents a cleaner and more normalized version of the 2016.04C dataset. The third one is SIGNAL2020.02

Dong et al. (2021b), whose data is modulated at a rate of 8 samples per symbol, while 128 samples per frame, with 20 different SNRs, even values between [2dB, 40dB].

5.1 Experimental setup

The CAMEL is implemented in Pytorch

Paszke et al. (2019) with python on a RTX3090 Graphics Processing Units, and trained using the Adam optimizer Kingma and Ba (2014)

. In the classification experiments of three datasets, RadioML 2016.04C, RadioML 2016.10A and SIGNAL 2020.02, the default hyper-parameters are as follows: the training epochs are 400,000; the meta batch size is 2; the meta-level outer learning rate is 0.001 and the task-level inner update learning rate is 0.1; the task-level inner update step is 5 and the update step for fine-tuning is 10. All of our experiments use the same hyper-parameter as the default setting. We change the support set shot number in 1 and 5 to have different results of 5-way 1-shot case and 5-way 5-shot case.

5.2 Our Model

First, we study the influence of adding a multi-head self attention mechanism in this network, which can focus attention on important information. We perform a multi-head attention with 8 heads. Instead of performing a single attention function with input -dimensional keys, values and queries, it is found beneficial to linearly project the queries, keys and values times with different, learned linear projections to , , dimensions, respectively. Then perform the attention function in parallel, concatenate the outputs and do the projection again to get the final result Dauphin and Schoenholz (2019). In our experiments, as illustrated in Table 1, the performance is much better with the addition of the multi-head attention mechanism. As the batch size increases, the performance improves while increasing computation and time-consuming. To make a trade-off, we set the batch size to be 64 when using multi-head attention. We observe that the model with attention mechanism demonstrates a greater ability to increase the accuracy owing to various improvements.

RADIOML 2016.10A SIGNAL2020.02
Method 1-shot 5-shot 1-shot 5-shot
MAML Finn et al. (2017) 86.57% 94.50% 43.26% 67.77%
MAML+attention 95.80% 97.70% 54.44% 63.33%
MAML+complex 91.40% 96.38% 59.50% 64.00%
SNAIL Mishra et al. (2018) 71.18% 78.48% 35.01% 36.34%
Reptilec Nichol et al. (2018) 69.16% 92.32% 55.01% 69.39%
MAML+complex+CT Yang et al. (2020) 96.40% 97.50% 58.40% 69.80%
CAMEL (ours) 97.23%0.13% 98.22%0.08% 64.80%0.10% 74.27%0.15%
Table 1: CAMEL: compare with other meta-learning models on datasets RADIOML 2016.10A and SIGNAL 2020.02. The method ‘MAML+attention’ indicates adding multi-head attention mechanism on the origin MAML model, ‘MAML+complex’ means constructing complex-valued neural network in the MAML model and ‘CT Yang et al. (2020)’ represents the Complex Transformer model using 8 attention functions to represent the complex-valued attention. The

shows 95% confidence intervals over tasks.

RADIOML 2016.04C
Method 1-shot 5-shot
MAML Finn et al. (2017) 88.93%0.13% 93.59%0.62%
MAML+attention 92.12%0.22% 95.51%0.05%
MAML+complex 91.65%0.35% 96.28%0.53%
SNAIL Mishra et al. (2018) 89.21%0.75% 96.90%0.19%
Reptile Nichol et al. (2018) 87.08%2.88% 92.07%5.65%
MAML+complex+CT Yang et al. (2020) 93.58%1.15% 96.52%0.08%
CAMEL (ours) 96.30%0.22% 97.51%0.15%
Table 2: CAMEL: compare with other meta-learning models in detail on the dataset RADIOML 2016.04C. The shows 95% confidence intervals over tasks. CAMEL outperforms all other meta-learning models listed.

Further study concerns the influence of adding a complex-valued neural network, because we notice that complex numbers could have a richer representational capacity. For these signals inputs, using complex number can probably obtain more useful details than real numbers and could also facilitate noise-robust memory retrieval mechanisms

Trabelsi et al. (2018b). We need to deal with the complex building blocks to construct a complex number neural network: representing of complex numbers, Complex gradient vectors, complex weight initialization, complex convolutions, complex-valued activation, complex-valued normalization and complex-valued multi-head attention mechanism. These blocks are determined by their own algorithm and the algorithm of complex numbers. We figure out from the results in Table 1 and Table 2 that this complex features improve the classification accuracy in both 5-way 1-shot and 5-way 5-shot cases with different datasets.

In the training process, we adjust the number of convolution kernels to 128. For the multi-head attention part, we set the source sequence length and output sequence length to 64, number of heads to 8. We observe that such complex-valued models are more competitive than their real valued counterparts. These build our final model: CAMEL, Model-Agnostic Meta-Learning with features of multi-head attention and complex-valued neural network. Compared with the other meta-learning models, CAMEL achieves the best classification accuracy.

The Complex Transformer Yang et al. (2020) implements complex attention in another way: It rewrites all complex functions into two separate real functions and computes the multiplication of queries, keys and values to get the complex attention with 8 attention functions having different inputs. We also conduct SNAIL Mishra et al. (2018), which combines a casual attention operation over the context produced by temporal convolutions, and Reptile Nichol et al. (2018), which uses only first-order derivatives for meta-learning updates. To have a comparison, Table 1 and Table 2 list the accuracies of several models based on MAML applied on different datasets. Results in thses two tables demonstrate that our model CAMEL have the state-of-the-art performance among all. In particular, some models are not well performed on the task in the dataset SIGNAL2020.02, but our model CAMEL still has a stable and great performance on this challenging task. Figure 3 indicates that CAMEL could get the highest accuracy at a relatively fast convergence speed. The results also show that, on these challenging signal classification tasks, the CAMEL model apparently outperforms other meta-learning models in accuracy and stability, which could be figured out from the smooth accuracy curves and narrow confidence intervals for CAMEL model in both 1 shot and 5 shot cases.

Figure 3: CAMEL: compare the convergence curves of accuracy with other meta-learning models for classification tasks on the dataset RADIOML 2016.04C. This pair of images shows the accuracy curves at 95% confidence interval over the same classification task. Left: 5-way 1-shot learning. Right: 5-way 5-shot learning. The results indicate that our model CAMEL reaches the highest accuracy at a relatively fast convergence rate.

5.3 Ablation study

In this section, we have conducted the ablation studies on CAMEL in three scenarios, as shown in Table 3. The first scenario uses samples whose SNR 0, of which 75% is selected as the training set and 25% is selected as the test set. For the second scenario, showed in the column "SNR = 0" in Table 3, we pick samples with SNR=0 and randomly select 75% of them to form the training set and 25% of them as the test set. The third scenario forms the (Prediction-Other) P-O set as follow: pick 5 classes of signal samples (SNR 0) as set P and the rest 5 classes of samples (SNR 0) form set O. Pick all samples in set O and 5% of samples in set P as the training set. The remaining 95% of samples in set P constitute the test set.

On 3 training and testing sets mentioned above, we construct the MAML model first, and then add some features on it step by step. We add attention components and complex numbers separately and together. From the results we observe that in CAMEL, all the features added on the original MAML model help improve the classification accuracy.

Accurancy SNR 0 SNR = 0 P-O set
MAML 87.20% 81.64% 89.06%
MAML+attention 93.00% 87.26% 91.90%
MAML+complex 91.10% 91.75% 91.30%
CAMEL (ours) 93.70% 92.10% 96.30%
Table 3: Ablation study on CAMEL in three scenarios. For the MAML, add multi-head attention mechanism and use complex numbers step by step. Our model CAMEL combines MAML, complex-valued neural network and complex-valued multi-head attention component. The total experiment consists of 4 models on 3 training and test sets.

6 Conclusion

In this paper, we have proposed a complex domain attentional meta-learning framework for signal recognition named CAMEL. CAMEL utilizes complex-valued neural networks and attention to provide prior knowledge, i.e., complex domain and temporal information, which helps CAMEL improve performance and prevent overfitting. As two byproducts of CAMEL, we have designed the complex-valued meta-learning and complex-valued attention, which can be of independent interest. With second-order information, CAMEL is able to find first-order stationary points of general nonconvex problems. Furthermore, CAMEL has achieved the state-of-the-art results on extensive datasets. Finally, the ablation studies in three scenarios have demonstrated the effectiveness of the components of CAMEL.

References

  • F. Alet, E. Weng, T. Lozano-Pérez, and L. P. Kaelbling (2019) Neural relational inference with fast modular meta-learning. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: §1.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.
  • S. Baik, M. Choi, J. Choi, H. Kim, and K. M. Lee (2020)

    Meta-learning with adaptive hyperparameters

    .
    In Advances in Neural Information Processing Systems, Vol. 33, pp. 20755–20765. External Links: Link Cited by: Appendix G, §1.
  • Y. Balaji, S. Sankaranarayanan, and R. Chellappa (2018) Metareg: towards domain generalization using meta-regularization. Advances in Neural Information Processing Systems 31, pp. 998–1008. Cited by: §1.
  • C. Boutilier, C. Hsu, B. Kveton, M. Mladenov, C. Szepesvari, and M. Zaheer (2020) Differentiable meta-learning of bandit policies. In Advances in Neural Information Processing Systems, Vol. 33, pp. 2122–2134. External Links: Link Cited by: §1.
  • Y. Chen, A. L. Friesen, F. Behbahani, A. Doucet, D. Budden, M. Hoffman, and N. de Freitas (2020) Modular meta-learning with shrinkage. In Advances in Neural Information Processing Systems, Vol. 33, pp. 2858–2869. External Links: Link Cited by: Appendix G, §1.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §1.
  • B. Confavreux, F. Zenke, E. Agnes, T. Lillicrap, and T. Vogels (2020) A meta-learning approach to (re)discover plasticity rules that carve a desired function into a neural network. In Advances in Neural Information Processing Systems, Vol. 33, pp. 16398–16408. External Links: Link Cited by: §1.
  • Y. N. Dauphin and S. Schoenholz (2019) MetaInit: initializing learning by learning to initialize. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §5.2.
  • G. Denevi, M. Pontil, and C. Ciliberto (2020) The advantage of conditional meta-learning for biased regularization and fine tuning. In Advances in Neural Information Processing Systems, Vol. 33, pp. 964–974. External Links: Link Cited by: §1.
  • Y. Dong, X. Jiang, L. Cheng, and Q. Shi (2021a)

    SSRCNN: a semi-supervised learning framework for signal recognition

    .
    IEEE Transactions on Cognitive Communications and Networking (), pp. 1–1. External Links: Document Cited by: §1.
  • Y. Dong, X. Jiang, H. Zhou, Y. Lin, and Q. Shi (2021b) SR2CNN: zero-shot learning for signal recognition. IEEE Transactions on Signal Processing 69 (), pp. 2316–2329. External Links: Document Cited by: §5.
  • A. Fallah, A. Mokhtari, and A. Ozdaglar (2020) On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 1082–1092. Cited by: §2, §4, Lemma 5, Proof 5.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    International Conference on Machine Learning

    ,
    pp. 1126–1135. Cited by: §1, Table 1, Table 2.
  • C. Finn, K. Xu, and S. Levine (2018) Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, Vol. 31. External Links: Link Cited by: §1.
  • M. Goldblum, L. Fowl, and T. Goldstein (2020) Adversarially robust few-shot learning: a meta-learning approach. In Advances in Neural Information Processing Systems, Vol. 33, pp. 17886–17895. External Links: Link Cited by: Appendix G, §1.
  • I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §3.2.
  • J. Harrison, A. Sharma, C. Finn, and M. Pavone (2020) Continuous meta-learning without tasks. In Advances in Neural Information Processing Systems, Vol. 33, pp. 17571–17581. External Links: Link Cited by: Appendix G, §1.
  • A. Hirose (2012) Complex-valued neural networks. Vol. 400, Springer Science & Business Media. Cited by: §1, Assumption 2.
  • A. Hjørungnes (2011) Complex-valued matrix derivatives: with applications in signal processing and communications. Cambridge University Press. Cited by: §3.2, Assumption 2, Proof 5.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §3.2.
  • M. Jas, T. Dupré la Tour, U. Simsekli, and A. Gramfort (2017) Learning the morphology of brain signals using alpha-stable convolutional sparse coding. In Advances in Neural Information Processing Systems, Vol. 30. External Links: Link Cited by: §1.
  • G. Jerfel, E. Grant, T. Griffiths, and K. A. Heller (2019) Reconciling meta-learning and continual learning with online mixtures of tasks. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: Appendix G, §1.
  • K. Ji, J. D. Lee, Y. Liang, and H. V. Poor (2020) Convergence of meta-learning with task-specific adaptation over partial parameters. In Advances in Neural Information Processing Systems, Vol. 33, pp. 11490–11500. External Links: Link Cited by: §1.
  • S. Khodadadeh, L. Boloni, and M. Shah (2019) Unsupervised meta-learning for few-shot image classification. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: Appendix G, §1.
  • M. Khodak, M. F. Balcan, and A. S. Talwalkar (2019) Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: Appendix G, §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • W. Kong, R. Somani, S. Kakade, and S. Oh (2020)

    Robust meta-learning for mixed linear regression with small batches

    .
    In Advances in Neural Information Processing Systems, Vol. 33, pp. 4683–4696. External Links: Link Cited by: §1.
  • S. Liu, A. Davison, and E. Johns (2019) Self-supervised generalisation with meta auxiliary learning. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: Appendix G, §1.
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2018) A simple neural attentive meta-learner. In International Conference on Learning Representations, External Links: Link Cited by: §5.2, Table 1, Table 2.
  • A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §5.2, Table 1, Table 2.
  • T. J. O’Shea, J. Corgan, and T. C. Clancy (2016) Convolutional radio modulation recognition networks. In International Conference on Engineering Applications of Neural Networks, pp. 213–226. Cited by: §1, §5.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703. Cited by: §5.1.
  • A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine (2019) Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: Appendix G, §1.
  • V. Sitzmann, E. Chan, R. Tucker, N. Snavely, and G. Wetzstein (2020) MetaSDF: meta-learning signed distance functions. In Advances in Neural Information Processing Systems, Vol. 33, pp. 10136–10147. External Links: Link Cited by: §1.
  • X. Song, L. Chai, and J. Zhang (2020) Graph signal processing approach to qsar/qspr model learning of compounds. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document Cited by: §1.
  • C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal (2018a) Deep complex networks. In 6th International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1.
  • C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal (2018b) Deep complex networks. In International Conference on Learning Representations, External Links: Link Cited by: §5.2.
  • Y. Tu, Y. Lin, C. Hou, and S. Mao (2020) Complex-valued networks for automatic modulation classification. IEEE Transactions on Vehicular Technology 69 (9), pp. 10085–10089. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, pp. . External Links: Link Cited by: §1.
  • Y. Xie, H. Jiang, F. Liu, T. Zhao, and H. Zha (2019) Meta learning with relational information for short sequences. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: Appendix G, §1.
  • M. Yang, M. Q. Ma, D. Li, Y. H. Tsai, and R. Salakhutdinov (2020) Complex transformer: a framework for modeling complex-valued sequence. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4232–4236. Cited by: §5.2, Table 1, Table 2, footnote 1.
  • H. Yao, Y. Zhou, M. Mahdavi, Z. (. Li, R. Socher, and C. Xiong (2020) Online structured meta-learning. In Advances in Neural Information Processing Systems, Vol. 33, pp. 6779–6790. External Links: Link Cited by: Appendix G, §1.
  • J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn (2018) Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 7343–7353. Cited by: §1.
  • H. Zhang, X. Liu, D. Xu, and Y. Zhang (2014)

    Convergence analysis of fully complex backpropagation algorithm based on wirtinger calculus

    .
    Cognitive Neurodynamics 8 (3), pp. 261–266. Cited by: Assumption 3.
  • R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, and Y. Song (2018) MetaGAN: an adversarial approach to few-shot learning.. Advances in Neural Information Processing Systems 2, pp. 8. Cited by: §1.
  • S. Zhang, Y. Xia, and J. Wang (2015) A complex-valued projection neural network for constrained optimization of real functions in complex variables. IEEE Transactions on Neural Networks and Learning Systems 26 (12), pp. 3227–3238. Cited by: Assumption 2, Assumption 3.
  • P. Zhou, X. Yuan, H. Xu, S. Yan, and J. Feng (2019) Efficient meta learning via minibatch proximal update. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: Appendix G, §1.
  • Z. Zhuang, Y. Wang, K. Yu, and S. Lu (2020) No-regret non-convex online meta-learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3942–3946. Cited by: §1.

Appendix A Proof of Lemma 1

Lemma 1

If a function is complex analytic, the time complexity of the derivative of in IQCVNNs are twice that of the complex derivative of in CDCVNNs.

Proof 1

We consider two scenarios, the derivative of simple analytic function and composite analytic function.
1. For a simple analytic function , in CDCVNNs, the complex derivative of with respect to a complex vector is equal to . While IQCVNNs considers and , therefore

(18)

Thus, in this scenario, the time complexity of the derivative of in IQCVNNs are twice that of the complex derivative of in CDCVNNs.
2. For a composite analytic function where is also complex analytic, in CDCVNNs, the complex derivative of with respect to a complex vector can be computed according to the complex chain rule.

(19)

Owing to the fact that is complex analytic, is equal to zero. So, (19) can be simplified to

(20)

where and . Hence, the time complexity of the complex derivative of in CDCVNNs is . However, in IQCVNNs, we have

(21)

where the size of each above tensor in (

21) is . Hence, the time complexity of the derivative of in IQCVNNs is . Note that the composite function of layers, which can be seen as composite functions of two layers calculated serially. As a result, in CDCVNNs, the time complexity of the complex derivative of is , while in IQCVNNs, the time complexity of the derivative of is . Hence, the Lemma holds in the scenario of the derivative of composite analytic function.
To sum up, the Lemma 1 is established in both two scenarios. This completes the proof.

Appendix B Proof of Lemma 2

Lemma 2

The complex-valued convolutional layer and complex-valued fully connected layer is complex analytic.

Proof 2

It is obviously that the complex-valued convolution layer and complex-valued fully connected layer are linear and continuous. Assume that a linear function is continuous with respect to a complex vector , then we can obtain

In a similar way,

Therefore,

According to the function is continuous and satisfies the Cauchy-Riemann equations, the linear function is complex analytic. Hence, the complex-valued convolution layer and complex-valued fully connected layer are complex analytic. This completes the proof.

Appendix C Proof of Lemma 3

Lemma 3

, is analytic if and only if is a constant function.

Proof 3

Assume a function is analytic. Then has to satisfy the Cauchy-Riemann equations:

where is the complex input vector. Since the partial derivatives of are all equal to 0, is a constant function. This completes the proof.

Appendix D Definition Recall

In this section, we recall the definitions of complex derivative, analytic function, and the Cauchy-Riemann equations.

Complex derivative  Let , where . If is continuous at a point , we can define its complex derivative as:

(22)

This is similar to the definition of the derivative for the function of a real variable. In the real case, the existence of the derivative implies that the limits of both exist and are equal when the point converges to from both the left and right directions. However, in the complex case, it means that the limits of exist and are equal when the point converges to from any directions in the complex plane. If a function satisfies this property at a point , we say that the function is complex-differentiable at .

Analytic function  If a function is complex-differentiable for all points in some domain , then is said to be analytic, i.e., is a complex analytic function also known as holomorphic function, in .

The Cauchy-Riemann equations

 The Cauchy-Riemann equations are a pair of real partial differential equations, and their complex analytic function needs to satisfy:

(23)

where and denote the real and imaginary parts of the complex number, respectively. The necessary and sufficient condition for to be complex analytic function in is that the function is continuous and satisfies the Cauchy-Riemann equations in .

Appendix E Proof of Lemma 4

Lemma 4

In response to complex meta-parameters , we have

(11)
Proof 4

According to (5), it is obviously that

(24)

Note that, since , following the definition of complex gradient vector (10), we have

(25)

where the second equality is because the output of is real, the third equality follows the complex chain rule, and the last equality is given by the definition of complex gradient vector. Next, according to inner-loop update process (3), we have

(26)

Similarly,

(27)

Now, using (7), we can write (25) as

(28)

Plugging (28) in (24) yields

(29)

This completes the proof.

Appendix F Complex-valued Neural Networks

Neural networks require back-propagation to update their parameters via first-order derivatives, as do complex-valued neural networks. We would prefer the functions in complex-valued neural networks to be analytic. Define and as the complex input vector and complex bias vector for each function, respectively.

Complex-valued convolutional layer  The complex-valued convolutional layer implements the convolution operation on complex input signals. Define as the complex convolution kernel. Given , , and , since the complex-valued convolutional layer is linear, we are able to compute the real and imaginary parts of its outputs separately as the following

(30a)
(30b)

Then, according to the (30a) and (30b), the complex-valued convolutional layer can be represented as follows:

where denotes the convolution operation in real case.

Complex-valued fully connected layer  The complex-valued fully connected layer achieves the linear transformation of complex inputs. Define as the complex weight matrix. Given , , and , the real and imaginary parts of the outputs of complex-valued fully connected layer can be computed as:

(31a)
(31b)

Similarly, the complex-valued fully connected layer can be expressed as:

(32)

Appendix G Related work

Recently, meta-learning has demonstrated promising performance in many fields. Khodadadeh et al. Khodadadeh et al. (2019) proposed an unsupervised algorithm for model-independent meta-learning for classification tasks. The work Liu et al. (2019) proposed a new method that automatically learns appropriate labels for auxiliary tasks. The work Xie et al. (2019) proposed a new meta-learning method to learn heterogeneous point process models from short event sequence data and relational networks. In addition, the work Jerfel et al. (2019)

proposed a Dirichlet process mixture for hierarchical Bayesian models with the parameters of arbitrary parametric models. Khodak et al.

Khodak et al. (2019) built a theoretical framework for the design and understanding of practical meta-learning methods. The authors in Zhou et al. (2019) proposed a meta-learning method based on minibatch proximal update for learning effective hypothesis transfer.

Moreover, the work Rajeswaran et al. (2019) proposed an implicit MAML algorithm which relies only on the solution to the inner level optimization. The work Chen et al. (2020) a meta-learning approach that avoids the need for this often sub-optimal hand-selection. The work Yao et al. (2020) proposed an online structured meta-learning framework. Additionally, the authors in Baik et al. (2020) proposed a new weight update rule that greatly enhances the fast adaptation process. The work Harrison et al. (2020) proposed a meta-learning approach via online changepoint analysis to augment with a differentiable Bayesian changepoint detection scheme. The work Goldblum et al. (2020) proposed an adversarial querying algorithm for generating adversarially robust meta-learners and thoroughly investigated the causes for adversarial vulnerability.

Appendix H Convergence Analysis

For ease of writing and derivation, in our notation, represents the loss function on the task , represents the loss on the task after the inner-loop update process, and represents the meta-objective. By drawing task from task probability distribution , our optimization problem can be rewritten as

(33)
Definition 1

A random vector is called an -approximate first order stationary point for problem 33 if it satisfies .

Then, we formally state our assumptions as below.

Assumption 1

is bounded below, and