I Introduction
Future wireless systems are envisaged to be deeply integrated with artificial intelligence (AI) technologies to provide both communication and AI services
[7]. In the past decade, the great progress in three key driving forces, i.e., diverse data sets, advanced algorithms and powerful computing capability, has played an important role in the success of AI applications [5]. Hence, the deployment of AI in a particular scenario such as wireless systems always begins with the considerations with three factors, i.e., how to collect data, how to process data, and by which device to process. For example, in scenarios of Internet of Things (IoT) and vehicular communication systems, a large volume of data will be collected and analyzed by devices to provide classification, regression, prediction, and decision making functions to handle different tasks. In these cases, distributed learning is a proper enabler for AI solutions [1].A lot of distributed learning frameworks have been recently proposed, within which the most popular one is federated learning (FL) [4]. In FL, each client trains a local learning model using its local collected data, and then uploads the local model to a central server. The central server aggregates multiple local models to a global one and broadcasts it to the clients for next round training. FL exploits the computing capability of distributed clients and keeps users’ data local at clients without direct exchanging. A commonly used assumption for FL is that the basic structure of the learning models (both local ones and the global one) are identical. Although model compression, sparsification and pruning may alleviate the computation and communication overhead, they must be based on a same structure.
On the other hand, distributed clients may need to cooperate with each other to finish a task together. The learning made by each client is not only depended on its local data, but also related to the data and actions of its neighbors. Informations are allowed to exchanged in this case, and the input of the learning model on each client may include both local and neighboring data. The numbers of neighbors for different clients may be different and change over time, which implies that the input of the learning model for each client may be different and timevarying. Retraining a model repetitively is impractical due to the large overhead during the training phase. Predefining a maximum possible number of neighbors and training a corresponding large model may be a solution. However, the assumption of knowing the maximum possible number of neighbors is not always feasible in practice, and this model can not handle the situation when the number of neighbors exceeds the predefined one.
In this paper, we propose a novel distributed learning framework which can be adopted in the wireless networks with timevarying topology. First of all, to cope with different input dimensions of the learning models at clients, a scalable deep neural network (DNN) structure is designed. Here, similar to previous works [2, 6], the permutation equivalence (PE) and permutation invariance (PI) properties, which can be found in many applications [9], are considered. Based on this PE/PI prior knowledge of applications, the DNN structure can be simplified and viewed as a combination of two basic blocks. Then, the DNNs with different scales can be easily built up through using different numbers of basic blocks at clients to handle both local and neighboring data. Moreover, the model aggregation at the central server can be taken out also based on the two basic blocks, which makes the model aggregation on DNNs with different scales possible. We evaluate the proposed framework through simulations and show that with less model parameters, the proposed framework outperforms the design with a fixed size.
The paper is organized as follows. The application scenario considered in this paper are introduced in Section II. Then, in Section III, the proposed distributed learning framework is elaborated. Simulation results and discussions are provided in Section IV. Finally, we summarize the work in Section V.
Ii System Model
As shown in Fig. 1, we consider a wireless network with one central unit such as a base station (BS) and multiple distributed clients such as mobile user equipments (UEs). A distributed learning task is assigned to this wireless network, where each of the distributed clients need to perform the learning based on both its local data and the data from its neighbors. A typical application scenario of this system model is object recognition where different cameras and sensors capture different features from the same object, which can be regarded as a multiview learning task [8]. This system model can also find its application in IoT systems and wireless sensor networks (WSNs).
We characterize the system by a graph, where the vertexes are the distributed clients and the edges show the connectivity between clients. Different clients may have different numbers of neighbors, e.g., Client E has only one neighbor of Client C, while Client D has two neighbors of Client B and D. Moreover, as a wireless network, the topology of the graph is of high dynamics. Due to the quality of wireless link and on/off activities of clients, the graph structure may be timevarying. For example, considering the graph in Fig. 1, client C has four neighbors of Client A, B, D and E in first time slot. If, in the second time slot, Client D departs from the system, the neighbors of Client C will change to Client A, B and E. The DNNs deployed on all the clients need to handle both local and neighboring data, so at least the input dimensions of these DNNs are different. As the graph changing over time, this input dimension of the DNN on the same client may also be time varying.
A server is deployed on the central unit of the system. It can help with speeding up the training progress through aggregating local models from the clients. However, the model aggregation must take into consideration the situation when the scales of the DNNs are different.
Iii Distributed Learning with a Scalable DNN Design
Iiia PE/PIbased Scalable DNN Design
PE and PI are wellknown and widelyused properties which can be elaborated as follows. Given an input vector
, if for an arbitrary permutation on , i.e., , the output of function follows the same permutation, i.e., , we say that the function is PE for the input . And if the output of function does not change even with the permutation of the input , we say that it has the PI property. Many distributed learning tasks have PE and PI properties. For example, in the UE scheduling problem in wireless networks, the scheduling priority of the target UE depends on both its own state such as channel condition and the states of other UEs. However, permuting the states of UEs before inputting them into the scheduler, the output scheduling priorities of the UEs permute in the same way. The PE and PI properties can also be found in multiview learning task, where the permutation in the order of input neighboring data may not affect the output of the learning model.To realize the aforementioned function through a DNN with layers, we can rely on the parameter sharing structure as shown in Fig. 2. The input includes states from clients, i.e., one target client and its neighbors. is the output vector on the th hidden layer. If is considered as the output, the DNN is of PE property. The permutation on the input results in the same permutation on the output . If only is considered, the DNN is of PI property because the permutation on the input will not change the output. The processing of the th hidden layer can be expressed as
(1) 
where , is the weight matrix and
is the bias vector of the
th hidden layer, andis the activation function. With
and , where is a vector with allone elements and length same to , Eq. 1 can be rewritten as(2) 
To guarantee the PE/PI property of the DNN, should have the following structure
(3) 
where consists of two submatrices, i.e., and , which is also shown in Fig. 2. Hence, DNNs with different scales can be built up based on these two submatrices to handle different input dimensions.
IiiB The Proposed Distributed Learning Framework
Relying on the PE/PIbased scalable DNN design, we propose a distributed learning framework as shown in Fig. 3. Each distributed client uses a DNN with the aforementioned structure, whose scale can be adjusted according to the number of neighbors. For example, Client B has three neighbors, hence the parameter matrix of its DNN can be expressed as
(4) 
However, Client E has only one neighbor and its parameter matrix is
(5) 
When the topology of the wireless network changes, the scale of the DNN used at each client can also change accordingly. For example, if Client B departs from the network at the th time slot, the parameter matrix of the DNN used by Client A will change from to , where
(6) 
and
(7) 
After training the local DNN based on the local and neighboring data, each client can upload the local training results and to the central server, where global submatrices, i.e., and , are obtained through model aggregation. All commonly used aggregation algorithms can be adopted here. Although the scales of the DNNs used by all the clients may be different from each other, through the same basic building blocks, i.e., the two submatrices, model aggregation can be done. Similar as in FL, the model aggreation can make used of data from multiple clients and speed up the whole learning procedure.
and are of the same size, which is supposed to be , then the total number of parameters of the DNN is , where is the number of neighbors plus one. Thanks to the parameter sharing, the number of trainable parameters is only , which is also the communication payload size for local model uploading and global model broadcasting. Compared to directly using a large DNN with parameters, the DNN used in the proposed distributed learning framework can save up to times computation and communication overhead.
Iv Simulations
To evaluate the performance of the proposed distributed learning framework, we design a test case and show the simulation results in this section.
We consider a classification task on the MNIST
[3] in a network with 5 distributed clients. Both the training set with 60000 samples and the test set with 10000 samples are divided into five nonoverlapping subsets, each of which is focused by one of the five clients. To turn the task into a cooperative one, we add noise to the samples, and different noisy versions of the same picture are allocated to the target client and its neighbors. The target client tries to identify the class of the item in the picture by input its local noisy version and neighboring noisy versions of the picture. The original and noisy version of some sample pictures are shown in Fig. 4.For comparison, we consider three baselines. The first one is a oneclient scenario with no neighbors and no model aggregation. For the second baseline, cooperation among clients is neglected and each client tries to do the classification based only on its local data, i.e., one noisy figure. For the third baseline, a large DNN is used, where five figures can be processed simultaneously. Obviously, this large DNN is very suitable for Client C who has four neighbors. However, for other clients with fewer neighbors, zeros will be padded to make the input have the same as five pictures.
The proposed framework and the baselines use a similar DNN structure, which includes two convolution layers with 2by2 pooling and Relu activation, and three fully connected (FC) layers with Relu activation for the first two FC layers and Softmax activation for the last FC layer. The general structure of the DNN is shown in Fig.
5, where is a scaling factor which has different values for different method. For the first and second baselines without considering data from neighbors, we set . For the third baseline with a large DNN, is set to be 5. Because the DNN used by the proposed method is built up based on two submatrices, i.e., and , both of which have the same dimension as the one used in the first baseline, is equal to the number of neighbors of Client plus one. It is worth noting that thanks to the parameter sharing design, the number of trainable parameters of the DNN used by the proposed method is only twice the number for the first baseline. Other simulation parameters are summarized in Table I.Parameters  Values 

Number of clients  5 
Data set  CIFAR10 
Number of training samples  12000 per client 
Number of test samples  2000 per client 
Learning rate  0.05 
Decaying rate of learning rate  0.99 
Training batch size  500 
Test batch size  128 
Number of training epoches 
20 
Testing frequency  every 2 epoches 
We consider the scenario with a timevarying topology as shown in Fig. 6. In the first time slot, the connectivity among clients is the same as Fig. 1. Then, in the second time slot, Client D is off, hence, the numbers of neighbors of Client B and C are decreased by one. The connectivities can be represented as
(8) 
and
(9) 
for the first and seconde time slots, respectively.
The performance comparison among aforementioned methods is firstly done in the fixed topology for the first time slot. The accuracy performances during the training phases of different methods are shown in Fig. 7. The first baseline with only one client considered performs the worst, while with the help of model aggregation, the second baseline performs better than the first one although no data from neighbors is used. With consideration of the neighboring data, both the third baseline and the proposed method show superior performance. Although the proposed method has a little performance degradation compared to the third baseline, the third baseline uses a much larger DNN. In the simulated case, the number of parameters of DNNs used in the third baseline is 1348630, while the number of parameters of DNNs used in the proposed method is only 109628. This means only one tenth the communication and computation overhead of the third baseline is needed for the proposed method. Moreover, the maximum possible number of participants of the learning may be hard to predefine in practice, which also hindering the third baseline from being deployed in real systems.
Method  Baseline 3 (large DNN)  Proposed method  

Number of Parameters  1348630  109628  
Client A  0.9589  0.9487  
Client B  0.9668  0.9593  
Time Slot 1  Client C  0.9709  0.9656 
Client D  0.9592  0.9499  
Client E  0.9366  0.9208  
Average  0.9585  0.9489  
Client A  0.9596  0.9490  
Client B  0.9577  0.9493  
Time Slot 2  Client C  0.9677  0.9601 
Client E  0.9357  0.9206  
Average  0.9551  0.9447 
We now evaluate the performance of the timevarying scenario. Because the first and second baselines do not consider neighboring data, the timevarying topology will not affect their performance. Hence, we only compare the performance between the proposed framework and the third baseline. The DNNs are trained for 20 epoches, and are fixed for inference for 10 iterations. The averaging results over the 10 iterations are shown in Table II. It is shown that both methods can work in the timevarying networks. The departure of Client D results in reductions in numbers of neighbors for Client B and C, so the performance of Client B and C degrade. Again, we emphasize that the third baseline is used here for comparison, it is hard to be deployed in practice due to the assumption of knowing the maximum possible participants of the learning.
V Conclusions
In this paper, we proposed a distributed learning framework based on a scalable DNN design to conduct cooperative tasks in wireless networks with timevarying topology. Relying on the PE/PI property, parameter matrices of DNNs with different scales are built up based on the same two basic blocks. Hence, distributed clients can form its own DNN based on these two basic submatrices according to the number of its neighbors. The changing on the number of neighbors is handled by rebuilding the DNN accordingly. Aggregation of local models with different scales can be done at the central server also based on these two basic submatrices, which improves the convergence and performance of the global model. Simulations show that, the proposed method can effectively handle the scenario with timevarying topology.
References
 [1] (2021) Distributed learning in wireless networks: recent progress and future challenges. arXiv preprint arXiv:2104.02151. Cited by: §I.
 [2] (2020) Structure of deep neural networks with a priori information in wireless tasks. In ICC 20202020 IEEE International Conference on Communications (ICC), pp. 1–6. Cited by: §I.
 [3] (2010) MNIST handwritten digit database. External Links: Link Cited by: §IV.
 [4] (2017) Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §I.
 [5] (2016) Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,. Cited by: §I.

[6]
(2021)
Smart scheduling based on deep reinforcement learning for cellular networks
. In IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Workshop on NativeAI Empowered Wireless Networks, Cited by: §I.  [7] (2020) Artificial intelligence and wireless communications. Frontiers of Information Technology & Electronic Engineering 21 (10), pp. 1413–1425. Cited by: §I.
 [8] (2013) A survey on multiview learning. arXiv preprint arXiv:1304.5634. Cited by: §II.
 [9] (2017) Deep sets. arXiv preprint arXiv:1703.06114. Cited by: §I.