As of 2019, an estimated three billion mobile phones are connected to the internet, collectively amassing a staggering amount of information(Lim et al., 2019)
, and it is highly desirable to make this data available for ever more data-demanding neural networks. Traditional approaches for training neural networks require that all data is collected in one place for training. However, as the data is often sensitive, and contains a wealth of private information about the user, centralized data collection is not always realizable or desirable.
Federated learning (FL) provides an approach to learn a centralized model from decentralized data in a privacy-aware manner (McMahan et al., 2016). Local copies of a global model are trained at participating decentralized nodes (or clients) with local data, and the model is then consolidated episodically. While there are privacy concerns about potential data leakage through model updates, federated learning is often preferable to training a model in a centralized manner (Wei et al., 2020). Learning in this decentralized manner does, however, pose new problems for training neural networks, especially when the data available at individual clients is highly heterogeneous as well as sensitive from a privacy perspective. An example of such highly sensitive data is audio collected by personal mobile devices.
A federated learning approach where the data does not leave the person’s phone is particularly well-suited for this scenario. Yet, such data from smartphones is challenging in multiple points. Not only might the data be class imbalanced, but even conditioned on the class the feature distributions between clients can diverge, as people are in different environments with different soundscapes and use different phones to record the data. Li et al. (2019b) noted that while the original federated learning algorithm FederatedAveraging algorithm (FedAvg) will converge under certain strong assumptions, it lacks this guarantee under this more realistic assumption that data distribution between clients will be diverse.
However, certain features of the input, while variant across clients, are likely to be consistent for a single client, as people tend to e.g. use the same phone for a long amount of time and do not change their voice or speaking patterns substantially. Motivated by this, we propose to learn a local embedding for each client along with the global model as shown in Fig. 1. We evaluate this approach, which we coined the conditional gated activation unit (CGAU), on two classification tasks, one from the audio domain and one from the image domain, covering both balanced and imbalanced data. CGAU outperforms a baseline in both scenarios, showing the usefulness of learning localized features explicitly as opposed to encoding them in the global model.
The current paper presents two contributions. Firstly, we propose the conditional gated activation unit (CGAU), an enhancement for current neural network architectures suited for federated learning that captures features in client-dependent non-IID data. CGAU can be utilized in conjunction with currently used federated learning algorithms such as FedAvg. Secondly, to evaluate CGAU, we present a principled approach to simulate clients with non-IID data for evaluating federated learning. The approach utilizes embeddings from pre-trained networks to simulate clients by finding clusters in the embedding space.
In 2016 McMahan et al. (2017)
coined the term federated learning to refer to the training of a model with data that is only available at many distributed devices. This setting is characterized by a few properties that challenge traditional machine learning approaches. For one, the data is typically distributed on a large number of clients. Additionally, since there is a large variance in the behaviour and environment of the typical phone user, it is non-IID. Finally, the data is likely to be unbalanced, i.e. there is a large variance in the number of samples and class distributions at each client.
To combat those issues they proposed the FedAvg
, a variant on traditional stochastic gradient descent (SGD).FedAvg consolidates training updates from a large number of different sources with potentially unbalanced training data. Each client has a local copy of the model to be trained. In each round, a fraction of the total clients available computes the weight update of the model on their locally available data. The number of weight updates before global consolidation
is a hyperparameter. Afterweight updates on the local model, the client sends the local model to the server. The server averages the model weights from all clients into the global model. By only synchronizing the local models every steps, the communication cost is greatly reduced, making FedAvg feasible for training on distributed clients.
Since McMahan et al. (2017) a number of papers have followed up on this approach, aiming to e.g. reduce the necessary communication between clients, examine the vulnerability of the model towards adversarial attacks or deal with non-IID data (Konečnỳ et al., 2016; Sattler et al., 2019; Li et al., 2019b; Bagdasaryan et al., 2018; Nguyen et al., 2020)
. We focus specifically on approaches dealing with non-IID data, and we will in the following assume a traditional supervised learning task, learning a function that maps an input to an output class. This covers the most common tasks where federated learning is used.
Zhao et al. (2018) showed that federated learning is vulnerable to data being non-IID in the distribution of classes with the extreme case being that each client only sees one class. They show that the difficulty arises because the model weights from different classes diverge too much before global synchronization. They also show how weight divergence in between clients is bounded by the earth mover’s distance in between the distributions of classes from the clients and the global distribution. The model performance can be partly recovered by sharing a small proportion of the data globally to all clients. In contrast to our work, they only consider differences in distribution over the classes, not in characteristics of the input data (non-IID feature distributions as opposed to class distributions).
Sattler et al. (2019) proposed a new compression scheme, sparse ternary compression (STC) to reduce the communication necessary in a non-IID setting regarding the class distributions from the individual clients.
Recently Li et al. (2019b) further examined the influence of non-IID data on model performance. They show empirically and theoretically that heterogeneous data will slow down convergence of the model to the minimum and established that with non-IID data, the learning rate must be decayed over time for the model to converge to the optimal state.
Peng et al. (2019)
aim to mitigate the effect of domain shift heterogeneity by learning invariant features with adversarial reconstruction. They split up the embedding of the input into a domain-specific and domain-invariant part by minimizing mutual information between the two components. Then only the domain-invariant component is used for classification of the original task while an additional loss function is placed on a complete reconstruction of the input embedding from both. In this way, the network learns to disentangle the domain-invariant features and becomes more robust towards domain shift.
Finally, Ghosh et al. (2019) proposes a solution for training with heterogeneous data distributions on the clients. They propose first finding independent locally optimal solutions for each client. The clients then send their model to the server. The server clusters the clients based on the locally optimal solutions. Consequently, a traditional FL optimization is run for each client cluster. Like in our work, they explicitly consider diversity in the data distributions. However, in contrast to our work Ghosh et al. (2019) do not work with neural networks as it assumes a (relatively) low-dimensional representation of the learnt algorithm parameters.
3.1 Federated learning with pre-trained networks
Deep learning requires, in general, a large amount of data and computational power for training, and the resulting models are often of considerable size in terms of memory. In settings where resources like computational power, data, and data transfer are limited, e.g., for federated learning on mobile devices, the resource demands of training deep neural networks from scratch can be prohibitively large. Pre-trained neural networks, fine-tuned on the task at hand via transfer-learning present a viable solution for these issues.
We investigate the use of pre-trained networks for classifications tasks in a federated learning scheme, where a pre-trained network is used as a “frozen” feature extractor (that is, the pre-trained network is not further trained using federated learning). By using a pre-trained network, we off-load the needed computational power and data required from the federated learning process as the feature extractor network can be trained centrally using any available but task-relevant data, or taken from already trained networks from a relevant domain. Additionally, the communication costs are reduced, as the feature extractor does not need to be sent back and forth between rounds of federated learning. The federated learning then only has to learn to solve the problem of interest by learning a (much smaller) classifier on top of the embedding that the pre-trained network produces.
3.2 Client adaptation through conditional gated activation units
Heterogeneous clients with features that are non-IID are an inevitable part of federated learning in realistic settings. Given a set of completely homogeneous clients, anything a model learns based on a particular client would generalize to other clients. With heterogeneous clients, however, some class characteristics for a particular client might, or might not, generalize to other clients. Similarly, some shared class characteristics might be expressed differently at each client. Learning in such heterogeneous settings is more difficult, in part, as the information that can be shared between clients is reduced (e.g. by patterns distinct to a subset of clients, which we will call client-specific expression) and the information that can be shared is obfuscated (the same underlying pattern looks somewhat different at different clients, which we will call client-specific modulation).
We propose the use of a simple architectural component for enabling a (federated learning) model to identify whether global features are expressed at a client and how each client modulates global patterns. Specifically, for the classifiers on top of the pre-trained networks, we use a feed-forward neural network with gated activation units and enable the model to condition the units based on the client. We will refer to units as a conditional gated activation units, CGAU, seeFig. 1. A federated learning algorithm, such as FedAvg, can then be used to optimize the classifier.
The CGAU consists of two parts. In a gated part (orange), an input is processed by filter- and gate-weights (learnable , respectively, where is the number of units and
is the dimensionality of the input) followed by a hyperbolic tangent or a sigmoidal activation function. The conditioning part (green) shifts the responses of the filter () and gate () before applying the activation functions. A simplifying view of the process, to provide some intuition, is that the conditioning of the filtering responses modulates the global features (“what the feature is”), and the conditioning of the gating responses controls the client expression of features (“whether this feature is active”), each through corresponding learnable weights , where is the number of clients.
In total, the output of the CGAU, , is:
where are and are the input to and output of a layer in the classifier (biases omitted, is the Hadamard product, and and
are the hyperbolic tangent and sigmoidal activation functions). In the simulated setup, the conditioning is a one-hot encoding of clients IDs, captured as. Importantly, in a real federated learning setting, we do not need to share the learnt client conditioning; for a particular client, the one-hot encoding, in essence, selects the dimension of the matrices and that pertains to the client. This conditioning can simply be maintained locally and can be thought of as an additional local, learnable translation of the filtering and gating responses.
Gated activation units with conditioning in this format have been used both in the WaveNet and PixelCNN architecture (Oord et al., 2016; Van den Oord et al., 2016) in a convolutional form, and are based on previous work on gated activation units in e.g. highway networks (Srivastava et al., 2015)
and long short-term memory cells(Hochreiter and Schmidhuber, 1997).
3.3 Simulating clients with non-IID features
While non-IID clients, both in terms of their label distribution and feature distributions, are inevitable in realistic federated learning, most data sets that are commonly investigated in deep learning do not have any inherent client identification for the samples of the data set. One approach to simulate a more realistic scenario in lieu of actual clients is to assign only a subset of all labels to any given client in a set of simulated clients. This approach mimics a scenario in which the label distributions are non-IID across clients (as in e.g. (Zhao et al., 2018)).
However, in realistic settings, we might encounter that clients express a particular class differently, which can also be thought of as having features that are (even class conditionally) non-IID. In order to simulate such non-IID client feature distributions, we investigate simulated clients where the samples of a particular class are clustered based on the feature embedding from a pre-trained network.
We simulate clients by splitting the data, , in nodes, . Each node has a disjoint data set available for federated training, such that . For a given data set with classes, we partition the data in training and test data, and embed the available data using a pre-trained network (cf. Section 3.1
). We then learn a dimensionality reduction on the training partition of the embedded data through principal component analysis. For simplicity, we use the first two principal components of the training data and project all the available data onto these.
For each class, we then cluster the training data using a -means clustering, thereby obtaining a set of times cluster centroids. Each of the clients are then assigned a set of centroids such that samples that are the closest to one of a client’s centroids belongs to that client. In this manner, we obtain data distributions in the embedding space that are locally clustered and distinct in each client.
3.4 Quantifying heterogeneity of client feature distributions
We are interested in understanding how non-IID data distributions (beyond class distributions) affects federated learning. We consider data from clients, where the ’th client-specific data consists of pairs of input/label-pairs . We can say that the class distribution follows a categorical distribution over, such that . Previous studies have considered how differences in for different clients affect learning. In contrast, we are interested in how differences of the input data, , affect learning. We consider derived features, or embeddings, as learnt in a pre-trained network as the basis for investigation—partly inspired by the use of the Fréchet Inception Distance (Heusel et al., 2017) for evaluating the performance of generative adversarial networks.
For the purposes of this paper, we define the overall client data heterogeneity as the average distances from any given client to the rest of the clients in their distributions of embeddings, . Under a strong assumption of normality, the level of heterogeneity is quantifiable as a distance between multivariate Gaussians, and we consider the Fréchet distance between two distributions and , (among other names also know as the 2nd Wasserstein distance).
The Fréchet distance between two multivariate Gaussians, and , can be determined as:
For a given set of clients , we are interested in measuring the overall level of heterogeneity, and so we quantify this heterogeneity of clients by determining the distance from any given client to all other clients in their embeddings. We determine sample mean, , and covariance, , of the embeddings for both the client-specific data (thus assuming ), and a pooling of the data from all other clients () We determine the distance using Eq. 2, and then use the average distance as our measure of heterogeneity, :
We use two data sets to evaluate our proposed scheme, one based on audio data and one on image data. We chose the two data sets to cover imbalanced and balanced data as well as binary and multi-class tasks. In both settings we use a pre-trained network that was not trained on the data set at hand. The pre-trained networks are used as feature extractors to provide the features for a classifier. Experiments were carried out with Tensorflow Federated(Abadi et al., 2015).
We investigate the performance of a classifier that utilizes client adaption through conditional gated activation units, as described in Section 3.2
. We contrast the effect of client adaption with a standard feed-forward neural network with rectified linear units without conditioning.
We simulated clients with non-IID features by clustering the embedding features in the manner described in Section 3.3. We can control the level of client heterogeneity by shuffling a certain percentage of the sample client assignments. For a shuffling proportion of 0.0, no samples are randomly re-assigned to any of the clients. In this case, we consider the simulated clients to be maximally non-IID (all samples that are the closest to a particular centroid are collected in one client). On the other hand, a shuffling of 1.0 corresponds to completely random assignment of samples to clients.
In Experiment I, we investigate data from the Freesound Database (FSD) Kaggle 2018 data set (Fonseca et al., 2018). FSD contains approximately 11k audio clips from 41 different classes (such as laughter, keys jangling, writing, trumpet, and coughing). In Experiment II, we investigate CIFAR-10 (Krizhevsky et al., 2009), a popular image dataset featuring 32x32 resolution natural images from ten different classes. Additionally, we illustrate how the conditioning through CGAU changing the neural network solution on a XOR-problem in Appendix A.
4.1 Experiment I: audio cough detection
In our investigation of audio data, we construct a binary problem from the FSD by subdividing the labels into a positive class of audio labelled as cough, and a much larger negative class of any other label in FSD. FSD as a whole has a total of 273 examples of the cough label.
We use a pre-trained MobileNet (Howard et al., 2017) called YamNet111Maintained by M. Plakal and D. Ellis and available in the Tensorflow AudioSet research repository. trained on the AudioSet data (Gemmeke et al., 2017)
to obtain embeddings. Audio inputs are resampled to 16 kHz mono signals, and converted to stabilized log-mel spectrograms. YamNet outputs a 1024-dimensional vector for each patch in the log-mel spectrogram, where a patch corresponds to 960 ms of waveform input (only full 960 ms patches are considered and any remainder of the signal is dropped). We obtain an embedding robust to varying time-length of the FSD audio samples by max-pooling the patch-features across the patches (the temporal dimension).
We train a 2-layer classifier with 64 units with 50 % dropout between layers. The classifier is trained using FedAvg (McMahan et al., 2017) with 10 simulated clients, and at each round of federated learning all 10 clients were included. Each client completed 10 steps of gradient descent with batch sizes of 32 samples, or until all client data had been seen once—each client had a variable size of data set, seeing as samples are assigned based on proximity to the cluster centroids. The clients utilized a stochastic gradient descent optimizer222A momentum and a decay was specified, but at a later stage it was discovered that the states were erroneously resat at each round (i.e. at each 10 steps), effectively thus having no decay, and momenta building over only 10 steps.. We retain the original FSD training and test partitions. While training, we monitor the loss of a held-out sub-partition of 5 % the training set (a validation partition) by centrally collecting the outcomes on the validation data points at each client. The best model (model weights) with the lowest cross-entropy loss on the validation partition across clients after a total of a 1000 rounds of federated learning is then used in the final evaluation on the test set.
We determine the average Fréchet distances from any given client’s features to all others (as described in Section 3.4). The results averaged across replicates of the shuffling proportion are shown in Table 1.
We evaluate the performance for shuffling proportions ranging from 0.0 to 1.0 in steps of 0.2 with 12 repetitions of the experiment at each level for both with and without client adaption. Since the data is imbalanced, we measure the model performance in terms of the area under the receiver operating characteristic (AUC). The results are shown in Fig. 1(a). We see that a model without client adaption (in blue) performs consistently at about 0.994 across the range of shuffling. The model with client adaption (orange) outperforms this baseline model for the most non-IID clients (lowest levels of shuffling), and has an AUC of about 0.998, thus improving performance for the most heterogeneous simulated clients. For more homogeneous clients, we see no discernible difference between the performance of the two models.
|Proportion of shuffled labels||0.0||0.2||0.4||0.6||0.8||1.0|
|Average client Fréchet distance||1784.4||980.0||550.5||271.0||91.8||19.9|
4.2 Experiment II: image label classification
To show that our approach also works on a balanced and more challenging data set, we show results on CIFAR-10 (Krizhevsky et al., 2009).
We use an Inception architecture pre-trained333Model from github.com/pytorch/vision/tree/master/torchvision
on ImageNet(Szegedy et al., 2016; Deng et al., 2009)
. The embeddings were extracted from the next to last layer. To align the CIFAR-10 images with the resolution of ImageNet, we upsampled the images to 299x299 pixels. We randomly partitioned the training data set into 80 % training data and 20 % validation data. The classifier network mirrors the one used in the audio-experiment, except the dimensions were increased to 128 hidden units each in the two layers. The federated learning scheme is similar to the audio experiment, but utilizes an increased learning rate of 0.01, and all models were trained for 500 epochs. All configurations were run five times with different seeds. Training with CGAU took roughly 85% longer per epoch in the simulated setting.
We show results for this task in Fig. 1(b). Since the class sizes of CIFAR-10 are balanced, we show model performance in terms of the accuracy. The model with client adaption outperforms the baseline (without client adaption) at all levels of label shuffling. However, it works particularly well when there is a large difference in the training distribution between clients.
In Fig. 1(a) and Fig. 1(b) we see that the models trained with the proposed conditional gated activation unit outperforms or perform equal to the baseline model at all levels of diversity between clients. In additional experiments on a toy dataset we show that CGAU does enable the network to learn client specific features and feature expressions (results in Appendix A).
Notably, CGAU is particularly useful when there is a high diversity between clients (no or little label shuffling). This may be more pronounced due to the effects of transfer learning, i.e. fine-tuning an already pre-trained model on the available task-specific data. As a result, the embeddings from the lower layers may be worse at “discounting”’ client-variant features, making it particularly important to learn client-specific embeddings along with global embeddings.
In preliminary experiments on CIFAR-10, we found that the gain in using CGAUs were not as pronounced when the feature embeddings are extracted from a network pre-trained with CIFAR-10—and not with ImageNet as shown on Fig. 1(b). We theorize that CGAUs are particularly useful when the feature extractor was not trained on the same data distribution, as the feature extractor will not be as adept at extracting invariant features while discarding spurious features in the embedding.
To ensure that low shuffling does correspond to higher diversity in between client data distributions we provide the average Fréchet distance from each individual client to the remaining data, , in Table 1. We see that label shuffling rates correlate well with , implying that clustering based on the embedding may be a good alternative, or addition, to class clustering for artificially creating non-IID data sets. In particular it allows us to measure the impact of data that is non-IID even when conditioned on the classes.
A surprising effect is that applying CGAU results in predictive accuracy increasing for diverse data distributions (low proportion of shuffling). We would expect the accuracy to be decreased for the base model for diverse data distributions as the models diverge between synchronizations and to stay relatively equal for the model with CGAU. Instead the base model performs relatively equal across all levels of label shuffling whereas the CGAU model performs better for a low proportion of label shuffling. We hypothesize that this is due to the model learning the specific data distribution from each client.
While the cough detection problem is highly imbalanced, the classification task is a less challenging problem when using a well-suited pre-trained network. A considerable portion of the audio samples are, e.g., musical instruments with tonal qualities, that are quite straightforward to distinguish from coughs. This is also evident from the experimental results, where even the baseline effectively solves the task with an AUC of about 0.994.
A marked difference in problem complexity between Experiment I and II is evident in the performance difference at homogeneous clients (shuffling of 1.0). The model capacity (if naïvely measured as parameter count) is doubled by the filtering and gating alone in using the CGAU compared to the baseline. This increase in capacity is not beneficial for the homogeneous clients in the audio task, but does increase the test accuracy on the CIFAR-10 problem from about 0.780 to 0.785. This also increases the computational power needed for each round of learning, potentially exacerbating problems with e.g. stragglers in real federated learning systems; whether the performance gains of CGAU is worth such trade-offs remains to be investigated.
In our experiments, we worked with a relatively low number of clients (ten clients for both experiments). This was done to ensure that the clustering as described in Section 3.4 resulted in semantically meaningful clusters of data samples. Examining the effect with a larger number of clients is needed, seeing as FL usually works with a (much) larger number of clients. A potential alternative solution may be to use a clustering scheme as suggested in Ghosh et al. (2019) to find clients with shared attributes. However, sharing characteristics specific to a client in a privacy-compliant manner may be a challenge.
Incorporating client-information directly in federated learning models, even if only locally, is a potential opening for attacks on privacy. The client-specific conditioning does not need to be shared, yet any use of conditioning in the manner of the investigated CGAU would need to be evaluated for robustness to attacks.
Extant previous work has shown that models trained in a federated learning manner converge much slower if the used data sets are non-IID between clients (Li et al., 2019a). Since this is an extremely common characteristic when learning from sensitive user data, it presents a serious hindrance to the utilization of federated learning.
We present a simple approach to reduce the impact of local features by learning patterns specific to each client along with the global model. In experiments we show that our approach outperforms the baseline for scenarios with heterogeneous clients. We find evidence that our approach may be particularly beneficial when using a transfer learning approach to extract embeddings.
- TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §4.
- How to backdoor federated learning. arXiv preprint arXiv:1807.00459. Cited by: §2.
- Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §4.2.
- General-purpose tagging of freesound audio with audioset labels: task description, dataset, and baseline. arXiv preprint arXiv:1807.09902. Cited by: §4.
- Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. Cited by: §4.1.
- Robust Federated Learning in a Heterogeneous Environment. External Links: Cited by: §2, §5.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §3.4.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §4.1.
- Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §2.
- Learning multiple layers of features from tiny images. Cited by: §4.2, §4.
- On the Convergence of FedAvg on Non-IID Data. External Links: Cited by: §6.
- On the Convergence of FedAvg on Non-IID Data. External Links: Cited by: §1, §2, §2.
- Federated Learning in Mobile Edge Networks: A Comprehensive Survey. IEEE Communications Surveys and Tutorials. External Links: Cited by: §1.
- Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §2, §2, §4.1.
- Communication-Efficient Learning of Deep Networks from Decentralized Data. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017. External Links: Cited by: §1.
- Poisoning attacks on federated learning-based iot intrusion detection system. Cited by: §2.
- Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §3.2.
- FEDERATED ADVERSARIAL DOMAIN ADAPTATION. Technical report Cited by: §2.
- Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14. External Links: Cited by: §2, §2.
- Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §3.2.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §4.2.
- Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §3.2.
- Federated learning with differential privacy: algorithms and performance analysis. IEEE Transactions on Information Forensics and Security (), pp. 1–1. Cited by: §1.
- Federated Learning with Non-IID Data. External Links: Cited by: §2, §3.3.
Appendix A CGAU on XOR dataset
We can illustrate the workings of the CGAU by investigating an augmented version of the classic XOR-problem. We consider a two-class problem with two features, and . The negative class (red) consists of two clusters for which the features have the same sign, and the positive class (blue) is characterized by having the features of opposite sign. The augmentation to the XOR-problem is that we consider the clusters to be from two different clients: client one (“up client”) has only positive , and client two has only negative
(“down client”). Ignoring the clients and solving the problem using a multi-layer perceptron (MLP, with two hidden units) results in a solution of the form shown onFig. 2(a), whereas a CGAU with one gated unit () results in a solution of the form shown on Fig. 2(b).
The un-conditioned filter and gate outputs are shown on Fig. 2(c) and Fig. 2(d). We see how the filter has learnt a split on the sign of , and the gate is shutting off any information for negative . Looking at Fig. 2(e) and Fig. 2(f), we see how the CGAU has learnt to modulate the up clients filter feature by moving the shift in activation towards more positive , and similarly, we see that the it has learnt to shift the gating for the down client towards more negative .
In a sense, this enables the CGAU to solve the problem for the down client using “client-specific expression”, and enables the CGAU to solve the problem for the up client by using “client-specific modulation”; this assertion can be confirmed by turning off the two types of conditioning in making a plot like Fig. 2(b), which is shown on Fig. 2(h) and Fig. 2(g).