Distributed Learning for Time-varying Networks: A Scalable Design

07/31/2021 ∙ by Jian Wang, et al. ∙ HUAWEI Technologies Co., Ltd. 11

The wireless network is undergoing a trend from "onnection of things" to "connection of intelligence". With data spread over the communication networks and computing capability enhanced on the devices, distributed learning becomes a hot topic in both industrial and academic communities. Many frameworks, such as federated learning and federated distillation, have been proposed. However, few of them takes good care of obstacles such as the time-varying topology resulted by the characteristics of wireless networks. In this paper, we propose a distributed learning framework based on a scalable deep neural network (DNN) design. By exploiting the permutation equivalence and invariance properties of the learning tasks, the DNNs with different scales for different clients can be built up based on two basic parameter sub-matrices. Further, model aggregation can also be conducted based on these two sub-matrices to improve the learning convergence and performance. Finally, simulation results verify the benefits of the proposed framework by compared with some baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Future wireless systems are envisaged to be deeply integrated with artificial intelligence (AI) technologies to provide both communication and AI services

[7]. In the past decade, the great progress in three key driving forces, i.e., diverse data sets, advanced algorithms and powerful computing capability, has played an important role in the success of AI applications [5]. Hence, the deployment of AI in a particular scenario such as wireless systems always begins with the considerations with three factors, i.e., how to collect data, how to process data, and by which device to process. For example, in scenarios of Internet of Things (IoT) and vehicular communication systems, a large volume of data will be collected and analyzed by devices to provide classification, regression, prediction, and decision making functions to handle different tasks. In these cases, distributed learning is a proper enabler for AI solutions [1].

A lot of distributed learning frameworks have been recently proposed, within which the most popular one is federated learning (FL) [4]. In FL, each client trains a local learning model using its local collected data, and then uploads the local model to a central server. The central server aggregates multiple local models to a global one and broadcasts it to the clients for next round training. FL exploits the computing capability of distributed clients and keeps users’ data local at clients without direct exchanging. A commonly used assumption for FL is that the basic structure of the learning models (both local ones and the global one) are identical. Although model compression, sparsification and pruning may alleviate the computation and communication overhead, they must be based on a same structure.

On the other hand, distributed clients may need to cooperate with each other to finish a task together. The learning made by each client is not only depended on its local data, but also related to the data and actions of its neighbors. Informations are allowed to exchanged in this case, and the input of the learning model on each client may include both local and neighboring data. The numbers of neighbors for different clients may be different and change over time, which implies that the input of the learning model for each client may be different and time-varying. Retraining a model repetitively is impractical due to the large overhead during the training phase. Pre-defining a maximum possible number of neighbors and training a corresponding large model may be a solution. However, the assumption of knowing the maximum possible number of neighbors is not always feasible in practice, and this model can not handle the situation when the number of neighbors exceeds the pre-defined one.

In this paper, we propose a novel distributed learning framework which can be adopted in the wireless networks with time-varying topology. First of all, to cope with different input dimensions of the learning models at clients, a scalable deep neural network (DNN) structure is designed. Here, similar to previous works [2, 6], the permutation equivalence (PE) and permutation invariance (PI) properties, which can be found in many applications [9], are considered. Based on this PE/PI prior knowledge of applications, the DNN structure can be simplified and viewed as a combination of two basic blocks. Then, the DNNs with different scales can be easily built up through using different numbers of basic blocks at clients to handle both local and neighboring data. Moreover, the model aggregation at the central server can be taken out also based on the two basic blocks, which makes the model aggregation on DNNs with different scales possible. We evaluate the proposed framework through simulations and show that with less model parameters, the proposed framework outperforms the design with a fixed size.

The paper is organized as follows. The application scenario considered in this paper are introduced in Section II. Then, in Section III, the proposed distributed learning framework is elaborated. Simulation results and discussions are provided in Section IV. Finally, we summarize the work in Section V.

Ii System Model

Fig. 1: System model.

As shown in Fig. 1, we consider a wireless network with one central unit such as a base station (BS) and multiple distributed clients such as mobile user equipments (UEs). A distributed learning task is assigned to this wireless network, where each of the distributed clients need to perform the learning based on both its local data and the data from its neighbors. A typical application scenario of this system model is object recognition where different cameras and sensors capture different features from the same object, which can be regarded as a multi-view learning task [8]. This system model can also find its application in IoT systems and wireless sensor networks (WSNs).

We characterize the system by a graph, where the vertexes are the distributed clients and the edges show the connectivity between clients. Different clients may have different numbers of neighbors, e.g., Client E has only one neighbor of Client C, while Client D has two neighbors of Client B and D. Moreover, as a wireless network, the topology of the graph is of high dynamics. Due to the quality of wireless link and on/off activities of clients, the graph structure may be time-varying. For example, considering the graph in Fig. 1, client C has four neighbors of Client A, B, D and E in first time slot. If, in the second time slot, Client D departs from the system, the neighbors of Client C will change to Client A, B and E. The DNNs deployed on all the clients need to handle both local and neighboring data, so at least the input dimensions of these DNNs are different. As the graph changing over time, this input dimension of the DNN on the same client may also be time varying.

A server is deployed on the central unit of the system. It can help with speeding up the training progress through aggregating local models from the clients. However, the model aggregation must take into consideration the situation when the scales of the DNNs are different.

Iii Distributed Learning with a Scalable DNN Design

Iii-a PE/PI-based Scalable DNN Design

PE and PI are well-known and widely-used properties which can be elaborated as follows. Given an input vector

, if for an arbitrary permutation on , i.e., , the output of function follows the same permutation, i.e., , we say that the function is PE for the input . And if the output of function does not change even with the permutation of the input , we say that it has the PI property. Many distributed learning tasks have PE and PI properties. For example, in the UE scheduling problem in wireless networks, the scheduling priority of the target UE depends on both its own state such as channel condition and the states of other UEs. However, permuting the states of UEs before inputting them into the scheduler, the output scheduling priorities of the UEs permute in the same way. The PE and PI properties can also be found in multi-view learning task, where the permutation in the order of input neighboring data may not affect the output of the learning model.

Fig. 2: Structure of DNNs with PE/PI property.

To realize the aforementioned function through a DNN with layers, we can rely on the parameter sharing structure as shown in Fig. 2. The input includes states from clients, i.e., one target client and its neighbors. is the output vector on the -th hidden layer. If is considered as the output, the DNN is of PE property. The permutation on the input results in the same permutation on the output . If only is considered, the DNN is of PI property because the permutation on the input will not change the output. The processing of the -th hidden layer can be expressed as

(1)

where , is the weight matrix and

is the bias vector of the

-th hidden layer, and

is the activation function. With

and , where is a vector with all-one elements and length same to , Eq. 1 can be rewritten as

(2)

To guarantee the PE/PI property of the DNN, should have the following structure

(3)

where consists of two sub-matrices, i.e., and , which is also shown in Fig. 2. Hence, DNNs with different scales can be built up based on these two sub-matrices to handle different input dimensions.

Iii-B The Proposed Distributed Learning Framework

Fig. 3: Proposed distributed learning framework.

Relying on the PE/PI-based scalable DNN design, we propose a distributed learning framework as shown in Fig. 3. Each distributed client uses a DNN with the aforementioned structure, whose scale can be adjusted according to the number of neighbors. For example, Client B has three neighbors, hence the parameter matrix of its DNN can be expressed as

(4)

However, Client E has only one neighbor and its parameter matrix is

(5)

When the topology of the wireless network changes, the scale of the DNN used at each client can also change accordingly. For example, if Client B departs from the network at the -th time slot, the parameter matrix of the DNN used by Client A will change from to , where

(6)

and

(7)

After training the local DNN based on the local and neighboring data, each client can upload the local training results and to the central server, where global sub-matrices, i.e., and , are obtained through model aggregation. All commonly used aggregation algorithms can be adopted here. Although the scales of the DNNs used by all the clients may be different from each other, through the same basic building blocks, i.e., the two sub-matrices, model aggregation can be done. Similar as in FL, the model aggreation can make used of data from multiple clients and speed up the whole learning procedure.

and are of the same size, which is supposed to be , then the total number of parameters of the DNN is , where is the number of neighbors plus one. Thanks to the parameter sharing, the number of trainable parameters is only , which is also the communication payload size for local model uploading and global model broadcasting. Compared to directly using a large DNN with parameters, the DNN used in the proposed distributed learning framework can save up to times computation and communication overhead.

Iv Simulations

To evaluate the performance of the proposed distributed learning framework, we design a test case and show the simulation results in this section.

We consider a classification task on the MNIST

[3] in a network with 5 distributed clients. Both the training set with 60000 samples and the test set with 10000 samples are divided into five non-overlapping subsets, each of which is focused by one of the five clients. To turn the task into a cooperative one, we add noise to the samples, and different noisy versions of the same picture are allocated to the target client and its neighbors. The target client tries to identify the class of the item in the picture by input its local noisy version and neighboring noisy versions of the picture. The original and noisy version of some sample pictures are shown in Fig. 4.

Fig. 4: Sample pictures in MNIST (left:original, right: -10dB noisy).
Fig. 5: General structure of the used DNN.

For comparison, we consider three baselines. The first one is a one-client scenario with no neighbors and no model aggregation. For the second baseline, cooperation among clients is neglected and each client tries to do the classification based only on its local data, i.e., one noisy figure. For the third baseline, a large DNN is used, where five figures can be processed simultaneously. Obviously, this large DNN is very suitable for Client C who has four neighbors. However, for other clients with fewer neighbors, zeros will be padded to make the input have the same as five pictures.

The proposed framework and the baselines use a similar DNN structure, which includes two convolution layers with 2-by-2 pooling and Relu activation, and three fully connected (FC) layers with Relu activation for the first two FC layers and Softmax activation for the last FC layer. The general structure of the DNN is shown in Fig.

5, where is a scaling factor which has different values for different method. For the first and second baselines without considering data from neighbors, we set . For the third baseline with a large DNN, is set to be 5. Because the DNN used by the proposed method is built up based on two sub-matrices, i.e., and , both of which have the same dimension as the one used in the first baseline, is equal to the number of neighbors of Client plus one. It is worth noting that thanks to the parameter sharing design, the number of trainable parameters of the DNN used by the proposed method is only twice the number for the first baseline. Other simulation parameters are summarized in Table I.

Parameters Values
Number of clients 5
Data set CIFAR10
Number of training samples 12000 per client
Number of test samples 2000 per client
Learning rate 0.05
Decaying rate of learning rate 0.99
Training batch size 500
Test batch size 128

Number of training epoches

20
Testing frequency every 2 epoches
TABLE I: Simulation Settings
Fig. 6: The simulated time-varying scenario.

We consider the scenario with a time-varying topology as shown in Fig. 6. In the first time slot, the connectivity among clients is the same as Fig. 1. Then, in the second time slot, Client D is off, hence, the numbers of neighbors of Client B and C are decreased by one. The connectivities can be represented as

(8)

and

(9)

for the first and seconde time slots, respectively.

Fig. 7: Accuracy of different methods in fixed topology scenario.

The performance comparison among aforementioned methods is firstly done in the fixed topology for the first time slot. The accuracy performances during the training phases of different methods are shown in Fig. 7. The first baseline with only one client considered performs the worst, while with the help of model aggregation, the second baseline performs better than the first one although no data from neighbors is used. With consideration of the neighboring data, both the third baseline and the proposed method show superior performance. Although the proposed method has a little performance degradation compared to the third baseline, the third baseline uses a much larger DNN. In the simulated case, the number of parameters of DNNs used in the third baseline is 1348630, while the number of parameters of DNNs used in the proposed method is only 109628. This means only one tenth the communication and computation overhead of the third baseline is needed for the proposed method. Moreover, the maximum possible number of participants of the learning may be hard to pre-define in practice, which also hindering the third baseline from being deployed in real systems.

Method Baseline 3 (large DNN) Proposed method
Number of Parameters 1348630 109628
Client A 0.9589 0.9487
Client B 0.9668 0.9593
Time Slot 1 Client C 0.9709 0.9656
Client D 0.9592 0.9499
Client E 0.9366 0.9208
Average 0.9585 0.9489
Client A 0.9596 0.9490
Client B 0.9577 0.9493
Time Slot 2 Client C 0.9677 0.9601
Client E 0.9357 0.9206
Average 0.9551 0.9447
TABLE II: Inference Performance for Time-varying Scenario

We now evaluate the performance of the time-varying scenario. Because the first and second baselines do not consider neighboring data, the time-varying topology will not affect their performance. Hence, we only compare the performance between the proposed framework and the third baseline. The DNNs are trained for 20 epoches, and are fixed for inference for 10 iterations. The averaging results over the 10 iterations are shown in Table II. It is shown that both methods can work in the time-varying networks. The departure of Client D results in reductions in numbers of neighbors for Client B and C, so the performance of Client B and C degrade. Again, we emphasize that the third baseline is used here for comparison, it is hard to be deployed in practice due to the assumption of knowing the maximum possible participants of the learning.

V Conclusions

In this paper, we proposed a distributed learning framework based on a scalable DNN design to conduct cooperative tasks in wireless networks with time-varying topology. Relying on the PE/PI property, parameter matrices of DNNs with different scales are built up based on the same two basic blocks. Hence, distributed clients can form its own DNN based on these two basic sub-matrices according to the number of its neighbors. The changing on the number of neighbors is handled by rebuilding the DNN accordingly. Aggregation of local models with different scales can be done at the central server also based on these two basic sub-matrices, which improves the convergence and performance of the global model. Simulations show that, the proposed method can effectively handle the scenario with time-varying topology.

References

  • [1] M. Chen, D. Gündüz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. V. Poor (2021) Distributed learning in wireless networks: recent progress and future challenges. arXiv preprint arXiv:2104.02151. Cited by: §I.
  • [2] J. Guo and C. Yang (2020) Structure of deep neural networks with a priori information in wireless tasks. In ICC 2020-2020 IEEE International Conference on Communications (ICC), pp. 1–6. Cited by: §I.
  • [3] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. External Links: Link Cited by: §IV.
  • [4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §I.
  • [5] S. J. Russell and P. Norvig (2016) Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,. Cited by: §I.
  • [6] J. Wang, C. Xu, R. Li, Y. Ge, and J. Wang (2021)

    Smart scheduling based on deep reinforcement learning for cellular networks

    .
    In IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Workshop on Native-AI Empowered Wireless Networks, Cited by: §I.
  • [7] J. Wang, R. Li, J. Wang, Y. Ge, Q. Zhang, and W. Shi (2020) Artificial intelligence and wireless communications. Frontiers of Information Technology & Electronic Engineering 21 (10), pp. 1413–1425. Cited by: §I.
  • [8] C. Xu, D. Tao, and C. Xu (2013) A survey on multi-view learning. arXiv preprint arXiv:1304.5634. Cited by: §II.
  • [9] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola (2017) Deep sets. arXiv preprint arXiv:1703.06114. Cited by: §I.