Edge-Cloud Polarization and Collaboration: A Comprehensive Survey

by   Jiangchao Yao, et al.

Influenced by the great success of deep learning via cloud computing and the rapid development of edge chips, research in artificial intelligence (AI) has shifted to both of the computing paradigms, i.e., cloud computing and edge computing. In recent years, we have witnessed significant progress in developing more advanced AI models on cloud servers that surpass traditional deep learning models owing to model innovations (e.g., Transformers, Pretrained families), explosion of training data and soaring computing capabilities. However, edge computing, especially edge and cloud collaborative computing, are still in its infancy to announce their success due to the resource-constrained IoT scenarios with very limited algorithms deployed. In this survey, we conduct a systematic review for both cloud and edge AI. Specifically, we are the first to set up the collaborative learning mechanism for cloud and edge modeling with a thorough review of the architectures that enable such mechanism. We also discuss potentials and practical experiences of some on-going advanced edge AI topics including pretraining models, graph neural networks and reinforcement learning. Finally, we discuss the promising directions and challenges in this field.



There are no comments yet.


page 1

page 2

page 3

page 4


Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing

With the breakthroughs in deep learning, the recent years have witnessed...

KubeEdge.AI: AI Platform for Edge Devices

The demand for smartness in embedded systems has been mounting up drasti...

Autonomy and Intelligence in the Computing Continuum: Challenges, Enablers, and Future Directions for Orchestration

Future AI applications require performance, reliability and privacy that...

Bringing AI To Edge: From Deep Learning's Perspective

Edge computing and artificial intelligence (AI), especially deep learnin...

OpenEI: An Open Framework for Edge Intelligence

In the last five years, edge computing has attracted tremendous attentio...

Enabling Design Methodologies and Future Trends for Edge AI: Specialization and Co-design

Artificial intelligence (AI) technologies have dramatically advanced in ...

Edge Computing: A Systematic Mapping Study

Edge computing is a novel computing paradigm which extends cloud computi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cloud computing concerns the provisioning of resources for computation and memory to construct a cost-efficient computing paradigm for numerous applications [duc2019machine]. It has flourished for a long period in the past decades and achieved a great success in the market, e.g., Amazon EC2, Google Cloud and Microsoft Azure. According to the recent analysis [size2020share], the global cloud computing market size was valued at USD 274.79 billion in 2020 and is expected to grow at a compound annual growth rate of 19.1% from 2021 to 2028. Concomitantly, Artificial Intelligence (AI), especially the compute-intensive Deep Learning [lecun2015deep], has enjoyed the tremendous development with the explosion of cloud computing. Nevertheless, the rapid increase of Internet of Things (IoT) [ashton2009internet] raises an inevitable issue of the data transfer from the edges to the data centers in an unprecedented volume. Specifically, about 850 ZB data is generated by IoT at the network edge by 2021, but the traffic from worldwide data centers only reaches 20.6 ZB [Cisco:2021]. This drives the emergence of a new decentralized computing paradigm, edge computing, which turns out to be an efficient and well-recognised solution to reduce the computational cost and the transmission delay. Similarly in algorithmic application layer of edge computing, there is an urgent need to push the AI frontiers to the edges so as to fully unleash the potential of the modeling benefits [zhou2019edge].

The existing two computing paradigms, cloud computing and edge computing, polarize the algorithms of AI into different directions to fit their physical characteristics. For the former, the corresponding algorithms mainly focus on the model performance in generalization [bousquet2003introduction], robustness [hansen2011robustness], fairness [barocas2017fairness] and generation [gregor2015draw, fedus2018maskgan] etc.

, spanning from computer vision (CV), natural language process (NLP) to other industrial applications. To achieve better performances, a large amount of research from the perspectives of the data, the model, the loss and the optimizer is devoted to exploring the limit under the assumption of sufficient computing power and storage. For example, the impressive Generative Pre-trained Transformer 3 (GPT-3

[radford2018improving] that has 175 billion parameters and is trained on the hundred-billion scale of data, can produce human-like texts. The AlphaFold [jumper2021highly] with the elaborate network design for the amino acid sequence is trained with a hundred of GPUs and makes the astounding breakthrough in highly accurate protein structure prediction. Nowadays, cloud computing is continuously advancing AI widespread to various scientific disciplines and impact our daily lives.

Fig. 1: Layout of the Survey.

However, in terms of edge computing, it is still in its infancy to announce the success due to the resource-constrained IoT scenarios with very limited algorithms deployed. There are several critical constraints to design AI algorithms that run on IoT devices while maintaining the model accuracies. The most critical factor is the processing speed for the applicability of any edge application [Nielsen:1993]. We usually use throughput and latency for the measurement, which respectively count the rate at which the input data is processed and characterize the time interval between a single input and its response. Besides, for matrix intensive computations such as those in deep learning algorithms, FLOPS and multiplier-accumulate (MAC) are frequently used as the measure. Second, memory

like RAM and cache is the critical resource for building AI edge applications. Machine learning algorithms often take a significant portion of memory during model building with the storage of model parameters and other auxiliary variables. Except the storage of models, querying the model parameters for processing and inference is both time-consuming and energy-intensive: a single MAC operation needs three memory reads and one memory write  

[sze2017efficient]. Thus, power consumption and the corresponding thermal performance are other crucial factors for edge learning and energy efficient solution is welcoming to prolong the battery lifetime and cut maintenance costs.

Given the above constraints, edge computing polarizes AI to a new research era, namely edge intelligence or edge AI111Edge computing originally refers to the computing in edge infrastructures like rounters. We here extend it to a broader range including smart phones etc. Edge AI therein we talk also includes on-device AI. [Wang:2018, Li:2018, jeronimo2017mobile]. Contrary to entirely relying on the cloud, edge AI makes the most of the edge resources to gain further AI insights. Much efforts from both academia and industry have been made for the deployment of mainstream deep learning frameworks. One direction is to make the model lightweight. For example, a new convolution [howard2017mobilenets] is explored by constructing the depthwise separable convolutions, which reduces the number of parameters from 29.3 million to 4.2 million and the number of computations by a factor of 8 while only losing 1% accuracy. Furthermore, EfficientNet [Tan:2019] is proposed to systematically scale up CNN models in terms of network depth, width and input image resolution. It achieved state-of-the-art (SOTA) performance via 8 times lesser memory but 6 times faster inference. FastRNN and its gated version FastGRNN [Kusupati:2018] are also proposed for stable training and good accuracy while keeping model size small on resource-constrained devices. As summarized, FastGRNN has 2-4 times fewer parameters and achieves 18–42 times faster prediction than other leading gated RNN models such as LSTM [Hochreiter:1997] or GRU [Kyunghyun:2014] with the same performances. Major enterprises, such as Google, Microsoft, Intel, IBM , Alibaba and Huawei, have put forth pilot projects to demonstrate the effectiveness of edge AI. These applications cover a wide spectrum of AI tasks, e.g., live video analytics [Ananthanarayanan:2017], cognitive assistance for agriculture [Ha:2014], smart home [Jie:2017] and industrial IoTs [Li:2018].

Despite distinct characteristics of cloud computing and edge computing, a complete real-world system usually involves their collaboration from the physical aspects to the algorithmic aspects. We term this cooperation as Edge-Cloud collaboration, and actually some explorations have been made in both academia and industry, e.g., the privacy-primary collaboration: federated Learning (FL) [mcmahan2017communication]. Specifically, the federated averaging algorithm [mcmahan2017communication] is the first model aggregation method based on model parameter sharing between the cloud and the edges. Except privacy, there are also some works focusing on the efficiency. For example, a successful industrial practice is the Taobao EdgeRec System [edgerec]. In EdgeRec, the memory-consuming embedding matrices encoding the attributes are deployed on the cloud and the lightweight component executes the real-time inference on the edge. Similarly in [ding2020cloud], a CloudCNN provides the soft supervision to each local EdgeCNN for the edge training and simultaneously, the EdgeCNN performs the real-time inference interacting with the vision input. Another exploration to pursue the extreme personalization is [dccl], which leverages the patches to avoid the burden of transmission via edge learning and calibrates the model on the cloud side.

However, huge diversity in algorithms of different areas prevents a systematic review from the complete edge-cloud scope. For example, the transformer [devlin2019bert, khan2021transformers] conquers various benchmark completions in CV and NLP, but we did not see much progress on its application on edge. Edge computing [zhou2019edge] has the quite promising IoT applications in the future intelligent life, but without the auxiliary from cloud computing, the powerful large models will be extravagant to most of us. Although FL [kairouz2019advances] is elegant to protect the user privacy, given the heterogeneity and hardware constraints, we still have to explore some new solutions to remedy the performance degradation. All these concerns might be trapped in the local area and some works in the cross-field might enlighten us to find a better way. Motivated by this intuition, we conduct a comprehensive and concrete survey about the edge-cloud polarization and collaboration. Specifically, we are the first to set up the collaborative learning (CL) mechanism for cloud and edge modeling. To summarize, we organize the survey of this paper as follows and in Fig. 1:

  • [leftmargin=*]

  • Section 2 gives an overview of cloud AI;

  • Section 3 discusses edge AI;

  • Section 4 reviews the architectures that enable the collaboration between cloud and edge AI;

  • Section 5 discusses some on-going advanced topics, including pretraining models (PTM), graph neural networks (GNN) and reinforcement learning (RL) models that are deployed on edge;

  • Section 6 reviews the hardware development;

  • We conclude the paper in Section 7 with some possible future directions and challenges.

2 Cloud AI

AI has achieved the tremendous development in the recent years. Specifically, a plethora of Deep Neural Networks (DNNs) approach to, even outperform the human performance in a range of open-source competition benchmarks 

[silver2017mastering, grace2018will]. One main reason of the success of these models is that they benefit from recent advances in the cloud computing, i.e., the large-scale distributed clusters, which greatly accelerate the training of DNNs. We term them as cloud AI. Considering the most real-world AI applications depend on the information carriers such as image, text and internet, we limit the review of the cloud AI models to the areas of CV, NLP and web services as the exemplars.

2.1 Computer Vision

Among various research areas of AI, CV is a longstanding and fundamental field, which allows computers to derive meaningful information from digital images, videos, and other visual inputs. As a representative method in CV, CNN models have achieved the new SOTA performance on a wide range of tasks over the last few decades, e.g., Image Recognition [3lecun1989backpropagation], Object Detection [6guo2021distilling, 7qin2019thundernet, 8kehe2020improvement], Image Segmentation [10hu2021towards, 9cho2021picie, 11qiu2021semantic], and Image Processing [12song2021addersr, 13zamir2021multi, 14fu2021auto, 15li2021image].

Image recognition involves analyzing images and identifying objects, actions, and other elements in order to draw conclusions. In the 1880s, the “neocognitron” i.e., the predecessor of CNN, was proposed. Subsequently, Lecun et al., [3lecun1989backpropagation] proposed the CNN operator based on gradient learning and successfully applied it to handwritten digital character recognition, with an error rate of less than 1%. In 2012, AlexNet [krizhevsky2017imagenet]

won 2012 ImageNet competition


and incorporated many enticing techniques such as ReLU nonlinearity, dropout, and data augmentation. Following AlexNet, deeper structures like VGG-16 

[17simonyan2014very] and GoogleNet [18szegedy2015going] were explored. However, gradient vanishing and exploding problems still limited model depth. Towards this end, ResNet [he2016deep]

proposed a skip-connection scheme to form a residual learning framework. As such, ResNet could have convolutional layers up to 152 and became the winner of multiple vision tasks in ILSVRC 2015 and MS COCO 2015.

Object detection is one of the most challenging computer vision tasks, which identifies and locates objects in an image/scene. In recent years, one-stage and two-stage object detectors [20law2018cornernet, 21lin2017focal, 22redmon2016you, 23cai2018cascade, 24ren2015faster] have achieved noticeable improvements. However, these methods rely on deep convolution operations to learn intensive features, resulting in a sharp increase in the cost of computing resources and an apparent decrease in detection speed [6guo2021distilling]. Therefore, how to address these problems and enable real-time detection becomes an important line of research in object detection. Pruning techniques [han2015learning, 41guo2016dynamic, 27alvarez2016learning] compress a large PTM by removing unnecessary connections. Knowledge Distillation (KD) is another effective model compression tool that teaches a compact object detector by mimicking teacher networks [28chen2017learning, 29wang2019distilling].

Image segmentation [9cho2021picie] is another fascinating CV technique that simplifies the representation of an image by separating it into several recognizable segments. Apparently, image segmentation requires tremendous human effort in labeling pixels. Recently, there have been a number of attempts on semantic segmentation without labels. IIC [34ji2019invariant]

uses the mutual information-based clustering to output a semantic segmentation probability map in image pixels. AC identifies probabilities of pixels over categories by an autoregressive model and maximizes mutual information across two different “orderings” 

[35ouali2020autoregressive, 36oord2016conditional]. In addition to generic domains, many experts focus on investigating domain-specific image segmentation techniques such as medical image segmentation [37jha2020doubleu, 38ronneberger2015u, 39zhou2019unet++].

Great success has been achieved in many other CV tasks [Zhang_Tan_Zhao_Yu_Kuang_Jiang_Zhou_Yang_Wu_2020]

, such as super-resolution 

[12song2021addersr], image restoration [45pan2021exploiting, 47zhang2020residual], and image generation [ramesh2021zero, zhu2017unpaired, goodfellow2014generative]. For example, DALL-E [ramesh2021zero] trains a 12-billion parameter autoregressive transformer for zero-shot text-to-image generation, and observe improved generalization on out-of-domain datasets.

2.2 Natural Language Processing

As a longstanding area, NLP has been developed into a broad sub-areas including tagging, named entity recognition, question answering and machine translations

etc [manning1999foundations].

The POS tagging [pos_tagging] is a traditional task in NLP, which analyzes the textual data in grammar and anchors the words of interest by category. It has been widely studied based on the rules [pos_tagging], the conditional random fields [wallach2004conditional] and DNNs [gui2017part]. With the help of the large-scale dataset, the DNNs like CharWNN [dos2015boosting] has been developed to achieve a range of SOTA performances. Recently, LSTM and Transformer architecture [akbik2018contextual] are applied to capture the dynamics among sequences, which re-breaks the records.

Text or document categorization [joulin2017bag] is a common task for the news and forum websites to recognize the textual content for information retrieval. The advance mainly leverages the convolution architecture and attention to capture the pre-defined class evidence [zhang2017sensitivity, adhikari2019docbert]

. A similar task is sentiment analysis that is explored from different granularities 

[dos2014deep]. Another widely-used technique to extract the key points from the long textual data is information extraction [cowie1996information]

. Its successful application, knowledge graph, provides us an elegant view on the relations of objects 


Machine translation (MT) [hutchins1986machine] has achieved great success in movie caption translation, sport games and international political conference, which saves the human labor and domain knowledge in translation. Past works can be roughly divided into rule-based [simard2007rule], statistical-based [koehn2009statistical]

and neural machine translations 

[bahdanau2015neural]. The early attempt in neural MT is recurrent continuous translation model [kalchbrenner2013recurrent], which leverages an autoregressive mechanism to automatically capture word ordering, syntax, and meaning of the source sentence explicitly. Sutskever et al., [sutskever2014sequence] improves it via an encoding-decoding approach of the bidirectional LSTM to learn the long-term dependencies. The following variants mainly target to design more efficient attention modules. More recently, the BERT-based neural architecture achieves the new SOTA on a range of MT benchmarks by enlarging the capacity with the large-scale corpora [zhu2019incorporating].

Question Answering (QA) [hirschman2001natural] as a way of information retrieval is commonly depolyed in some well-known engines like Microsoft Windows and Apple Siri. The industrial QA usually consists of multiple stages e.g., query recognition and expansion, answer selection and fine-grained ranking spanning from text, image and videos [toxtli2018understanding]. The mainstream research falls into modeling the QA pairs and the negative sampling, exploring the implicit matching between multiple objects and the answer [antol2015vqa]. Dialogue system can be considered as a more complex QA, which generates the answer in the multiple rounds [merdivan2019dialogue]. Compared to the single retrieval, the interactive sentence generation is critical.

Text generation is one exciting technique under the development of the large-scale PTMs such as GPT series in the recent few years [radford2019language, floridi2020gpt]. Several generation tasks like poetry generation and story generation have shown impressive performances, which even approach human levels. In order to make the generation more human-like, the GAN-style [lin2017adversarial] and VAEs-style [serban2017hierarchical] have been respectively explored in PTMs to leverage their potential merits. Except totally structure-free generation, some knowledge-enhanced text generation that considers the external text hint, constraint-aware and graph-based knowledge have been investigated [yu2020survey, Zhang_Tan_Yu_Zhao_Kuang_Liu_Zhou_Yang_Wu_2020]. Besides, some cross-domain text generation like Visual QA and reading comprehension are explored to understand the multi-modalities [floridi2020gpt, m6].

2.3 Web Services

Recommendation, Search and E-advertisement has been the successful business paradigms in web services [kosala2000web]. The corresponding Cloud AI models as the core components of these paradigms are widely explored in many enterprises including Google, Amazon and Microsoft [sadiku2014cloud].

2.3.1 Search

Web search [broder2002taxonomy] is an important technique to retrieve the object of interest from the huge number of candidates based on the human query. According to the query type, e.g., tag, image and video, different area of works with similar ideas are explored. Specially, the keyword search has been investigated almost along with emergence of the world wide web [liu2006effective]. More challenges lie in the image or the video as the query, which require the cloud-based models to consider the efficiency and the communication cost in the interplay between the user and the cloud services [datta2008image]. In the image search, the model should extract any possible instances in the image to match the candidates [gordo2016deep]

and many research works thus focus on how to enhance the deep feature extraction ability. Considering the domain bias and noise, some approaches proposed to pretrain the model on the in-domain clean subset and then finetune in the open-domain dataset 

[liu2021image]. Due to the latency constraint for multimedia data, the hash techniques are explored to accelerate the retrieval [liu2016deep].

2.3.2 Recommendation

Recommender systems [resnick1997recommender] have been widely studied in the last decade and become an indispensable infrastructure of web services. It actively selects the potential contents for the users based on its preferences captured from their historical clues on the websites. The related recommendation methods are progressively improved with the development of collaborative filtering, deep learning and sequential modeling. The early stage mainly focused on the user-based collaborative filtering [zhao2010user], the item-based collaborative filtering [sarwar2001item] and matrix factorization [koren2009matrix, rao2016preference]. As deep learning achieved a great success in CV and NLP, several variants of collaborative filtering combined with DNNs were proposed [cheng2016wide, he2017neural, guo2017deepfm, cui2018variational, yao2017discovering, chen2020towards]

. They leveraged the non-linear transformation of DNNs to activate high-level semantics for more accurate recommendation. Sequential modeling as another perspective to model the user interests has been successfully applied to recommender systems 

[shani2005mdp]. With the architecture evolving, several methods based on GRU [jannach2017recurrent], Attention [kang2018self, zhou2018deep, tan2021sparse, zhang2021cause, Lu_Huang_Zhang_Han_Chen_Zhao_Wu_2021, pan2021click] have achieved remarkable performances in recommender systems.

2.3.3 Advertisement

Advertisement we review here refers to the computational advertising developed for web services [huh2020advancing]. It is an intersection of multiple disciplines like Advertising, Marketing and Computer Science, making money by means of the online information propagation. Computational Advertising builds on the deep understanding of the web data and user preference, and at the same time taking the cost and the revenue into account. A range of works study the recommendation system for advertising [schafer2007collaborative], Guaranteed target display advertising [turner2012planning] in a dynamic way or real-time bidding [schwartz2017customer] by analyzing the contextual data. Except the cost-effective impression, most algorithms target to find the best strategic planing among the complex markets with the uncertain revenue, or explore the optimal advertising path by inferring the causal relations with the incentive bonus [chu2020inductive]. With the development of AI, data- and algorithmic- driven models combined with advertising have drawn more attention. It constructs an automatic optimization for the complex marketing environment while still maintaining the advertisement efficiency compared to the traditional advertisement [yun2020challenges].

3 Edge AI

On the mobile platform, the breakthrough of AI has spawn a wealth of intelligent applications, such as virtual assistants[hoy2018alexa] , personalized recommendation[pimenidis2019mobile] etc. The traditional cloud-based paradigm involves the data uploading that may violate the user privacy and it depends heavily on the network conditions that may cause the high transmission delays or unavailability. The way to alleviate the above problems is that we can place models partially or fully on mobile devices and make predictions locally, i.e., edge inference. However, edge inference is highly nontrivial as the computing, storage and energy resources of mobile devices are limited, and many research efforts have been devoted to meeting these constraints. In this section, we review the corresponding techniques to ease the models on the edges, i.e., edge AI. Specifically, the core is to make the model lightweight in the aspects of both volume and speed, so that the edge could afford the model inference.

3.1 Efficient Network Architecture

3.1.1 Manual Design

There are many works exploring the lightweight network architectures. One representative is SqueezeNet [hu2018squeeze], which leverages the iteration of a squeeze layer and a expansion layer for parameter compression. MobileNet [howard2017mobilenets] decomposes the conventional convolution into the composition of a depth-wise convolution and a point-wise convolution. The intuition behind MobileNet is that the low rank merit of the convolution kernel makes the decomposition approximately equivalent, and thus computation can be accelerated by a two-stage convolution. Similarly, MobileNet-v2 [sandler2018mobilenetv2] shows that the high dimensional features can actually be expressed through compacting low-dimensional features, and then proposes a new layer unit, inverted residual with linear bottleneck to reduce the parameters. ShuffleNet [zhang2018shufflenet] shows the point-wise convolution in MobileNet is expensive when the input dimension is high. It leverages the group convolution together with the channel shuffle to reduce the computational cost and the parameter space.

3.1.2 Neural Architecture Search

Compared to the manual design, neural architecture search (NAS) enables us to automatically explore the efficient network architectures. The classical NAS [zoph2016neural] utilizes RNN as the controller to generate a sub-network, and then perform the training and evaluation, and finally updates the parameters of the controller. There are two major challenges in NAS [brock2017smash, zoph2016neural, klein2017fast]. The first comes from the non-differential objective [elsken2018efficient, cai2018efficient, liu2018progressive]. Specifically, the performance of the sub-network is non-differential, which makes it infeasible to optimize the controller directly. Fortunately, the strategic gradient methods in RL can be a surrogate to update the controller parameters. Another challenge relates to the computationally expensive pipeline where we have to train each sub-network updated by the controller from scratch [zoph2018learning, real2019aging, yu2019evaluating]. Towards this end, efficient neural architecture search (ENAS) proposes the weight sharing technique, and greatly reduces the search time [pham2018efficient]. Recently, there is a rapidly growing trend to follow the ENAS. This line of research can be divided in the perspectives of search space [elsken2018efficient, zoph2018learning, real2019aging, cai2018path], search policy sampling network [cai2018efficient, stanley2019designing, liu2018darts] and the performance-aware selection [liu2018progressive, bender2018understanding, yu2019evaluating].

3.2 Compression

3.2.1 Knowledge Distillation

KD [hinton2015distilling] is widely used to transfer the knowledge learned from complex models or multiple model ensembles to another lightweight models. According to the types of knowledge to be learned, KD can be divided into three categories, i.e. response-based KD [hinton2015distilling, zhang2018deep, cho2019efficacy, yang2019training], feature-based KD [romero2014fitnets, zagoruyko2016paying] and relation-based KD [yim2017gift, tung2019similarity].

  • [leftmargin=*]

  • Response-Based KD. It takes the network output as the soft target to teach the student model. It is simple but effective for model compression, and has been widely used in different tasks and applications. The distillation loss based on response knowledge can be expressed as , where

    is the Kullback-Leibler Divergence loss.

  • Feature-Based KD. DNNs are good at learning multiple levels of feature representation by abstraction. Therefore, the output of the intermediate layer, i.e.,, the feature map, can be used as the knowledge to supervise the training of the student model. FitNet [romero2014fitnets] improved the training of the student model by matching the feature map between teachers and students directly. Subsequently, a range of other methods have been proposed to follow this paradigm [zagoruyko2016paying, kim2018paraphrasing, heo2019knowledge, passban2020alp, chen2021cross, wang2020exclusivity, 48gan2020bert]. For example, Zagoruyko et al., [zagoruyko2016paying] derived an attention map from the original feature maps to express knowledge.

  • Relation-Based KD. This lines of methods explore the relationship between different layers or data samples for distillation. Specifically, to explore the relationships between different feature maps, a flow of solution process (FSP) is proposed, which is defined by the gram matrix between two layers [yim2017gift]

    . The FSP matrix summarizes the relations between pairs of feature maps. It is calculated using the inner products between features from two layers. To use the knowledge from multiple teachers, two graphs are formed by respectively using the logits and features of each teacher model as the nodes 

    [zhang2018better, lee2019graph].

3.2.2 Parameter Quantization

Quantization has shown a great success in both training [banner2018scalable] and inference [han2016deep] of DNN models. Specifically, the breakthrough of the half-precision and mixed-precision training [micikevicius2017mixed, gupta2015deep, ginsburg2017tensor, courbariaux2014training, banner2018scalable, chmiel2020neural, faghri2020adaptive, li2019additive] has enabled an order of magnitude higher throughput in AI accelerators. However, it has proven very difficult to go below half-precision without significant tuning, and thus most of the recent quantization research has focused on inference via Parameter Quantization (PQ) [han2016deep]. Currently, the PQ methods can be roughly divided into Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ).

  • [leftmargin=*]

  • Quantization-Aware Training (QAT): QAT is method in which the usual forward and backward pass are performed on the quantized model in floating point, and the model parameters are quantized after each gradient update (similar to projected gradient descent). Performing the backward pass with floating point is important, since accumulating the gradients in quantized precision can result in zerogradient or gradients that have high error, especially in low-precision [courbariaux2015binaryconnect, gysel2016hardware, gysel2018ristretto, huang2018data, lin2015neural, rastegari2016xnor]

    . A popular approach to address this is to approximate the gradient of this operator by the so-called Straight Through Estimator (STE) 


    . However, the computational cost of QAT is very high, since the retraining of the model may take hundreds of epochs, especially for low bit quantization.

  • Post-Training Quantization (PTQ): An alternative is Post-Training Quantization (PTQ), which performs the quantization and the adjustments of the weights without any fine-tuning [banner2018post, cai2020zeroq, choukroun2019low, fang2020post, garg2021confounding, he2018learning, hubara2020improving, lee2018quantization, zhao2019improving]. The overhead of PTQ is very low and often negligible and it can be applied in situations where data is limited or unlabeled. However, this often comes at the cost of lower accuracy as compared to QAT, especially for low-precision quantization. In PTQ, all the weights and activations quantization parameters are determined without any re-training of the DNN model. Therefore, PTQ is very fast to quantize DNN models.

(a) Privacy-Primary Collaboration
(b) Efficiency-Primary Collaboration (I)
(c) Efficiency-Primary Collaboration (II)
Fig. 2: Prototypes of Edge-Cloud Collaborative AI. In privacy-primary collaboration (a), the cloud typically cannot access edge-specific private data and employs simple strategy for edge model aggregation. In efficiency-primary collaboration (b-c), the cloud enjoys a large computation capacity and almost all edge data. As such, large-scale latency-insensitive training and inference is conducted on the cloud with or without the help of personalized edge models. The edge can accordingly achieve both high-efficiency and high-effectiveness by transferring, compressing, and personalizing the cloud model on edge.

3.2.3 Pruning

Pruning is another way to reduce the parameter space by removing some computation paths in the model. Previous methods in this direction can be categorized into two main categories, one-time pruning and runtime pruning [vysogorets2021connectivity]. Specifically, for the former, there are three lines. One line of methods focus on pruning after model training [lecun1990optimal, hassibi1993optimal, han2015learning, dong2017learning]. It aims to design the certain criteria like value magnitudes [han2015learning] and second-derivative information [lecun1990optimal] to remove the least salient parameters. Another line of methods focus on jointly learning the sparse structures with model parameters to alleviate the performance drop after pruning [narang2017exploring, zhu2017prune, mocanu2018scalable, dettmers2019sparse, evci2020rigging]. These methods explore more efficient strategies to avoid the expensive prune-retrain cycles. The third line aims to prune network at initialization, which saves resources at training as well [lee2018snip, wang2019picking, tanaka2020pruning]. The Lottery Ticket Hypothesis [frankle2018lottery] demonstrates that the dense network contain sub-networks, which could be trained to reach the test accuracy comparable with the full network.

Runtime pruning focus on the dynamic path selection to enable the real-time inference with the limited computation budgets. For example, Huang et al., [huang2018multi]

implements the early-exit mechanism by injecting a cascade of intermediate classifiers throughout the deep CNNs. The layers are evaluated one-by-one with the forward process and the early-stop will execute once the CPU time is depleted, namely the latter layers are pruned. Yu

et al., [yu2018slimmable] introduces a slimmable neural network, which adjusts its width on the fly based on the resource constraints. In contrast to the depth pruning, reducing width helps reduce the memory footprint in inference. Besides, some others re-consider the input-dependent assumption, since treating all inputs equally may cause unnecessary resource consumption [teerapittayanon2016branchynet, huang2018multi, bolukbasi2017adaptive, wang2018skipnet, wu2018blockdrop, lin2017runtime]. For inputs that are easy to distinguish, a simpler model might be sufficient. The pioneering works use handcrafted control decisions [teerapittayanon2016branchynet, huang2018multi]. For instance, Huang et al., [huang2018multi] terminates the inference once the intermediate classifiers output the confidence score exceeding a pre-determined threshold. The subsequent improvements propose to learn a network selection system, which adaptively prunes the full network for each input example. For example, Bolukbasi et al., [bolukbasi2017adaptive] propose an adaptive early-exit strategy by introducing extra classifiers to determine whether the current example should proceed to the next layer. Some others [wang2018skipnet, wu2018blockdrop, lin2017runtime] utilize RL to learn the dynamic pruning decisions, which has a higher structure variability compared to the early-exit mechanism.

4 Edge-Cloud Collaborative AI

Edge-cloud collaborative modeling has received significant interest recently across separate research communities [zhou2019edge]. Most of the current edge-cloud collaborative modeling can be categorized into two sets: privacy-primary collaboration and efficiency-primary collaboration (cf. Figure 2).

Collaboration Manner Collaboration Concerns
co-training co-inference privacy storage efficiency communication efficiency personalization
Semantic QA cache [yoon2016device]
EdgeRec System [edgerec]
Auto-Split [banitalebi2021auto]
CoEdge [hu2020coedge]
Colla [lu2019collaborative]
DCCL [dccl]
MC-SF [chen2021mc]
FedAvg [mcmahan2017communication]
FML [shen2020federated]
Personalized FedAvg [jiang2019improving]
HyperCluster [mansour2020three]
Federated Evaluation [wang2019federated]
TABLE I: Device-Cloud Collaborative Modeling

4.1 Privacy-Primary Collaboration: Federated Learning

FL is a machine learning paradigm where multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider [kairouz2019advances]. The collaboration is privacy-primary in the sense that raw data of each client is stored locally and will not be exchanged or transferred. Typically, FL can be roughly divided into two lines of works, i.e., cross-device and cross-silo, respectively. In cross-device FL, there are substantial devices like phones, laptops, or IoT. In cross-silo FL, differently, there are organizations where data silos naturally exist. Another categorization given by [yang2019federated]

divides FL into horizontal FL, vertical FL and federated transfer learning, according to how data of different participants overlaps. In modern FL frameworks, there are several challenges that hinder effective edge-cloud collaboration. In this paper, we focus on two major challenges,

i.e., data heterogeneity of different edge devices, and attacks on FL.

4.1.1 Statistical Heterogeneity

In real-world applications, data samples are not independent and identically distributed (Non-IID) over numerous devices [li2020federated], resulting in heterogeneous edge models and gradients. Typically, there are five sources of data heterogeneity: 1)

Feature skew

: The marginal distribution of features may differ between edges. 2) Label skew: The marginal distribution of labels may differ between edges. 3) Same label, different features: The conditional distribution of features given labels may differ between edges. 4) Same features, different labels: The conditional distribution of labels given features may differ between edges. For example, the symbol represents correct in many countries and incorrect in some others (e.g., Japan); and 5) Quantity skew: Clients can store drastically varying volumes of data.

Unfortunately, real-world FL datasets present a mix of these phenomenons, and it is an urgent need to drive measures to alleviate statistical heterogeneity for edge-cloud collaboration. In this nascent research area, FL personalization addresses this issue by introducing a two-stage collaboration framework. The first stage is to collaboratively build a global model, followed by a customization stage for each customer with the client’s private data [sim2019investigation]. Recently, several methods have been proposed to achieve personalization, enhanced by transfer learning, meta training, and KD techniques. A learning-theoretic framework with generalization guarantees is presented by [mansour2020three]. Wang et al. [wang2019federated]

construct different personalization strategies, including the model graph, and the training hyperparameters for different edges. Jiang et al.

[jiang2019improving] build connections between FL and meta-learning, and interpret FedAvg as a popular algorithm, Reptile [nichol2018first]. Shen et al. [shen2020federated] presents an FL method based on KD and transfer learning that allows clients to train their own model independently with the local private data. Instead of training one global model, Masour et al. [mansour2020three] consider device clustering and learn one model per cluster.

4.1.2 Attacks on FL

Current FL protocol designs are vulnerable to attackers inside and outside of the system, putting the data privacy and model robustness at risk. There are two serious threats to FL privacy and robustness [lyu2020privacy]: 1) poisoning attacks against robustness; and 2) inference attacks against privacy.

The impact of poisoning attacks on the FL model is determined by the extent to which the backdoor players participate in the assaults, as well as the amount of training data poisoned. Model poisoning attacks seek to prevent global model learning [lamport2019byzantine] or hide a backdoor trigger into the global model [wang2020attack]. These attacks contaminate the changes of local models before these changes are uploaded to the server. The Byzantine attack [lamport2019byzantine] is a form of untargeted model poisoning attack that uploads arbitrary and harmful model changes to the server that fools the global model.

Though FL protects data privacy by shielding off local data from being directly accessed, FL still suffers from privacy risks due to inference attacks. For example, Deep Leakage from Gradient (DLG) [zhu2020deep] presents an optimization approach that can retrieve the raw images and texts critical for model improvement. Existing studies in privacy-preserving FL are often built on traditional privacy-preserving approaches, including: (1) homomorphic encryption[paillier1999public]; (2) Secure Multiparty Computation (SMC), [demmler2015aby]; and (3) differential privacy [dwork2014algorithmic]. To protect against an honest-but-curious opponent, Hardy et al. [hardy2017private] used FL on partitioned data encrypted using the homomorphic technique. Secure Multiparty Calculation (SMC) [yao1982protocols] allows different participants with private inputs to do a collaborative computation on their inputs without disclosing them to one another. Given the resource limits of mobile devices, it is expected that privacy-protection solutions must be computationally inexpensive, communication-efficient, and resistant to device failure. Truex et al. [truex2020ldp] present a novel FL system under the protection of the formal local differential privacy (LDP) framework, according to the local privacy budget while minimizing the overwhelming impact of noise.

4.2 Efficiency-Primary Collaboration

Although FL is a popular framework that considers the data privacy and governance issue, some real-world applications may not be sensitive to the privacy but care more about the factors like communication budget or personalization. In the recent years, there are some emerging edge-cloud collaborations in this direction and we term this paradigm as the efficiency-primary collaboration (cf. Figure 1(b)-1(c)).

4.2.1 Split-Deployment

The straightforward collaboration is the split-deployment that separates one complete model into two parts, one part placed on the cloud side and the other part placed on the edge side. One exemplar is the Semantic QA cache [yoon2016device], in which the inference on the edge side is responsible for the feature encoding and the inference on the cloud side outputs the answer to the query. Another successful practice is from the Taobao EdgeRec System [edgerec], where the memory-consuming embedding matrices encoding the attributes are deployed on the cloud side and the lightweight component executes the remaining inference on the edge side. Amin [banitalebi2021auto] introduced an Auto-Split solution to automatically split DNNs models into two parts respectively for the edge and for the cloud. Similarly, Hu [hu2020coedge] casts the split as a latency-minimum allocation problem and introduced an CoEdge solution. For the general models, the automatic model partition is explored. For example, [kang2017neurosurgeon] develops models to estimate the latency and energy consumption of each DNN layer and identify the ideal split point for latency or energy optimization. To further optimize the latency, [li2018jalad][ko2018edge] leverage lossy data compression techniques to reduce the transmission data size. To be simple, the intuition behind these works is to sufficiently leverage the computing power of the edges to reduce the burden of the cloud as well as the communication latency, and how to optimally split the model to guarantee the efficiency.

4.2.2 Edge-Centralized Personalization

Another type of edge-cloud collaboration is to leverage the decentralized advantage of the edge side in personalization by setting up an auxiliary cloud model. For example, Lu et.al., [lu2019collaborative] proposes a collaborative learning method, COLLA, for the user location prediction, which builds a personalized model for each device, and allows the cloud and edges learned collectively. In COLLA, the cloud model acts as a global aggregator distilling knowledge from multiple edge models. Ding et.al., [ding2020cloud] introduces a collaborative framework, where a CloudCNN provides the soft supervision to each local EdgeCNN for the edge training and simultaneously, the EdgeCNN performs the real-time inference interacting with the vision input. Yao et.al., [dccl] proposes a edge-cloud collaboration framework for recommendation via the backbone-patch decomposition. It greatly reduces the computational burden of edges and guarantees the personalization by introducing a MetaPatch mechanism and re-calibrates the backbone to avoid the local optimum via MoModistill. Extensive experiments demonstrate its superiority on the recommendation benchmark datasets.

4.2.3 Bidirectional Collaboration

One more intensive type of edge-cloud collaboration could be bidirectional, where we consider the independent modeling on each side and they maintain the interactive feedback to each other during both the training and serving. One recent exploration is a Slow-Fast Learning mechanism for Edge-Cloud Collaborative recommendation developed on Alibaba Gemini Platform [chen2021mc]. In MC-SF, the slow component (the cloud model) helps the fast component (the edge model) make predictions by delivering the auxiliary latent representations; and conversely, the fast component transfers the feedbacks from the real-time exposed items to the slow component, which helps better capture the user interests. The intuition behind MC-SF resembles the role of System I and System II in the human recognition [kahneman2011thinking], where System II makes the slow changes but conducts the comprehensive reasoning along with the circumstances, and System I perceives fast to make the accurate recognition [madan2021fast]. The interaction between System I and System II allows the prior/privileged information exchanged in time to collaboratively meet the requirements of the environment. With the hardware advancement of mobile phones, IoT devices and edge servers, it will be meaningful to pay attention to such a bidirectional collaborative mechanism in different levels [zhou2019edge].

4.3 Rethinking Collaboration in Classical Paradigms

Edge-cloud collaborative learning can be formulated as a two-stage optimization problem, where we train a model on one side (cloud or edge) and then further optimize it on the other side (edge or cloud). Considering this, we rethink this problem from the perspectives of transfer learning, meta-learning, and causal inference.

4.3.1 Transfer Learning.

In edge-cloud collaborative learning, the data distribution naturally differs from edge to edge and from edge to cloud. Towards this end, heterogeneous transfer learning [Zhang_Qi_Yang_Prisacariu_Wah_Torr_2020, Yeh_Huang_Wang_2014, Wang_Wu_Jia_2017, Ren_Feng_Dai_Yan_2021, Wu_Zhu_Yan_Wu_Zhang_Ng_2021, Tsai_Yeh_Wang_2016, Samat_Persello_Gamba_Liu_Abuduwaili_Li_2017a, Li_Wang_Zhang_Li_Keutzer_Darrell_Zhao_2021] would greatly improve the bidirectional model adaptation between the edges and the cloud. There are roughly two lines of heterogeneous transfer learning works addressing the difference in feature space, i.e., symmetric transformation, and asymmetric transformation. Symmetric transformation [Samat_Persello_Gamba_Liu_Abuduwaili_Li_2017a, Wang_Ma_Cheng_Zou_Rodrigues_2018, Yeh_Huang_Wang_2014] aims to learn domain-invariant representations across different domains. [Liu_Zhang_Lu_Lu_2017] addressed unsupervised transfer learning, which indicates a mostly labeled source domain with no target domain labels. [Tsai_Yeh_Wang_2016] proposed Cross-Domain Landmark Selection (CDLS) as a semi-supervised HDA solution. As a counterpart, by asymmetric transformation mapping [Zhou_Tsang_Pan_Tan_2014, Kulis_Saenko_Darrell_2011, Feuz_Cook_2015, Xiao_Guo_2015a], the feature space of the source is aligned with that of the target. A semi-supervised method of adapting heterogeneous domains was proposed in [Xiao_Guo_2015a], called Semi-Supervised Kernel Matching Domain Adaptation. [Wu_Wu_Ng_2022] learns an enhanced feature space by jointly minimizing information loss and maximizing domain distribution alignment.

4.3.2 Meta-learning.

Meta-learning is another successful knowledge transfer framework. Different from transfer learning where models learn from solving the tasks in the source domain, meta-learning expects that the model learns how to quickly solve new tasks. Based on meta-learning, on-cloud training (meta-training) could yield models that can learn quickly in heterogeneous edge environments (meta-learning). Recently, [Rosenfeld_Rajendran_Simeone_2021] proposed to employ spiking neural networks and meta-learning with streaming data, permitting fast edge adaptation. Based on MAML [finn2017model], MELO [Huang_Zhang_Yang_Qian_Wu_2021] is another work that learns to quickly adapt to new mobile edge computing (MEC) tasks. Another major challenge in edge-cloud collaborative learning is the limited computation capacity of edges. By consolidating both meta-learning and model compression, existing researches learn light-weight models that can quickly adapt to edge environments [Zhou_Xu_McAuley_2021, Ye_Zhang_Wang_2021, Pan_Wang_Qiu_Zhang_Li_Huang_2021, Zhang_Wang_Gai_2020]. [Ye_Zhang_Wang_2021] proposes an end-to-end framework to seek layer-wise compression with meta-learning. [Pan_Wang_Qiu_Zhang_Li_Huang_2021] learns a meta-teacher that is generalizable across domains and guides the student model to solve domain-specific tasks.

4.3.3 Causal Inference.

There is a substantial and rapidly-growing research literature studying causality [Pearl_2009, kuang2020causal] for bias reduction and achieving fairness. Causal theory is essential to edge-cloud collaborative learning for the following two reasons: 1) on-cloud training is supposed to yield a generalizable model, which is free from confounding effects and model bias, w.r.t the heterogeneous edge data distributions; 2) edge training should avoid over-fitting plagued by spurious correlations between the input and the outcome. [Rotman_Feder_Reichart_2021] builds connections among model compression, causal inference, and out-of-distribution generalization. With causal effects as the basis, they propose to make decisions on model components pruning for better generalization. More works dive into the intersection of causality and out-of-domain generalization [Teshima_Sato_Sugiyama_2020, Yang_Shen_Chen_Li_2020, Chen_Bhlmann_2020, Yue_Sun_Hua_Zhang_2021, kuang2018stable, yuan2021learning, Yang_Yu_Cao_Liu_Wang_Li_2020, zhang2020devlbert]. [Yue_Sun_Hua_Zhang_2021] proposes to reserve semantics that is discriminative in the target domain by embracing disentangled causal mechanisms and deconfounding. [Yang_Yu_Cao_Liu_Wang_Li_2020] assumes the relationship between causal features and the class is robust across domains, and adopts the Markov Blanket [Yu_Guo_Liu_Li_Wang_Ling_Wu_2020]

for causal feature selection.

[kuang2018stable] proposes a causal regularizer to recover the causation between predictors and outcome variables for stale prediction across unknown distributions. [yuan2021learning] designs an instrumental variable (IV) based methods for achieving invariant relationship between predictors and outcome variable for domain generalization.

5 Advanced Topics

5.1 Edge Pretraining Models

PTM, also known as the foundation model[FoundationModel], such as BERT [devlin2019bert] and GPT-3 [brown2020language], have become an indispensable component of modern AI, due to their versatility and transferability especially in few-shot or even zero-shot settings. The scale of the foundation models have been growing tremendously recently [megatronlm, gshard, switchtrans, m6, m6-t], as researchers observe that larger models trained on larger corpora generally leads to superior performance on downstream tasks [devlin2019bert, brown2020language]. The large scale, which can range from millions of to trillions of parameters, however, raises a critical question as to whether edge intelligence can enjoy the merits of the increasingly powerful foundation models. On the other hand, the unique challenges faced by edge agents, such as the heterogeneity of the deployment environments and the need of small sample adaptation, constitute an ideal testbed for demonstrating the versatility and transferability of the foundation models. We herein discuss three crucial directions that require exploration before we can unleash the power of the foundation models on edges.

5.1.1 Model Compression

Compressing a large foundation model into a small one is necessary due to the storage and/or network bandwidth constraint of many edge mobiles such as modern mobile phones. General techniques discussed in previous sections, such as model quantization and weight pruning, can be readily adopted, while there are also techniques specifically tailored for the Transformer architecture commonly adopted by the foundation models. For instance, parameter sharing across layers has been proven effective by ALBERT [Lan2020ALBERT:]. KD [hinton2015distilling] remains the most popular solutions. DistilBERT [DistilBERT] uses a subset of the layers of the teacher model to form the student for distillation. TinyBERT [jiao-etal-2020-tinybert] improves upon the vanilla logit-based distillation method [hinton2015distilling] by adding extra losses that align the immediate states between the teacher and the student. MobileBERT [sun-etal-2020-mobilebert] introduces bottleneck layers to reduce the dense layers’ parameters while keeping the projected hidden states’ dimension unchanged, for convenient feature map transfer and attention transfer. MiniLM [MiniLM] demonstrates that it is effective to distill the dot products between the attention queries, keys, and values, which do not impose constraints on the hidden size or the number of layers of the student. So far the distilled tiny models usually comes with around 10M parameters, and it is unclear whether this size can be further reduced without much performance loss.

5.1.2 Inference Acceleration

Some techniques for reducing the model size, for example reducing the hidden sizes or the number of layers, can also bring the faster inference. Yet, as a foundation model typically has tens of layers, an interesting question arises: whether it is necessary to use all the layers during inference, since there may be simple samples or easy downstream tasks that do not necessitate using the full model. PABEE [zhou2020bert] demonstrates that early exiting, i.e. dynamically stopping inference once the intermediate predictions of the internal classifier layers remain unchanged for a number of steps, is indeed effective with BERT. However, it remains unclear whether early exiting is equally applicable to generative models such as GPT-3 and text generation tasks.

5.1.3 Few-sample and Few-parameter Adaptation

The traditional method is to fine-tune a PTM on each target downstream task’s samples. However, fine-tuning typically involves updating almost all parameters of a foundation model and is not effective enough in terms of sample efficiency [wei2021finetuned]. It can lead waste of storage and network bandwidth since each fine-tuned model requires independent resources. Moreover, an edge is typically a few-sample learning environment, for example, only consisting of data associated with one single mobile phone. We thus need to seek a more efficient approach in place of fine-tuning for few-sample and few-parameter adaptation. The recent emergence of prompt-based learning [gao2021making] and instruction tuning [wei2021finetuned] represent promising directions. These recent methods mine natural language templates for the downstream tasks and use the templates to form input sentences. With a powerful generative language model, such templates can guide the model to output the correct predictions for the downstream tasks, where the model prediction is in the form of natural language as well. These recent paradigms require no parameter updating and can even achieve zero-shot learning in some cases [wei2021finetuned]. However, so far prompt-based learning and instruction tuning focus mostly on large-scale models, not the type of tiny models for edges.

5.2 Edge Graph Neural Networks

In recent years, Graph Neural Networks (GNN) have achieved SOTA performance on graph-structured data in various industrial domains [wu2020comprehensive], including CV [yang2018graph, landrieu2018large], NLP [marcheggiani2018exploiting, beck2018graph], traffic [yao2018deep, li2017diffusion], recommender systems [wu2019session] and chemistry [duvenaud2015convolutional]. It can learn high-level embedding from node features and adjacent relationship, thus effectively deal with graph-based tasks. The rapid growth of node feature and their adjacent information drive the success of GNN, but also pose challenges in integrating GNN into the edge-cloud collaboration framework, like data isolation, memory consuming, limited samples and generalizationetc. Recently, some efforts have emerged to address the above problems from aspects of FL, quantization and meta learning. In this part, we briefly review these works.

5.2.1 Federated GNN

To collaborate the graph data distributed on different edges to train a high-quality graph model, recent researchers have made some progress in FL on GNN [jiang2020federated, zhou2020privacy, zheng2021asfgnn, wang2020graphfl, wu2021fedgnn, he2021fedgraphnn, wang2021fl, caldarola2021cluster, chen2021fedgl, he2021spreadgnn, meng2021cross, ni2021vertical, xie2021federated, zhang2021subgraph]. The key idea of FL is to leave the data on the edges and train a shared global model by uploading and aggregating the local updates, e.g., gradients or model parameters, to a central server. Feddy [jiang2020federated] proposes a distributed and secure framework to learn the object representations from multi-device graph sequences in surveillance systems. ASFGNN [zheng2021asfgnn] further proposes a separated-federated GNN model, which decouples the training of GNN into two parts: the message passing part that is done by clients separately, and the loss computing part that is learnt by clients federally. FedGNN [wu2021fedgnn] applies the federated GNN to the task of privacy-preserving recommendation. It can collectively train GNN models from decentralized user data and meanwhile exploit high-order user-item interaction information with privacy well protected. FedSage [zhang2021subgraph] studies a more challenging yet realistic case where cross-subgraph edges are totally missing, and designs a missing neighbor generator with the corresponding local and federated training processes. FL-AGCNs [wang2021fl] considers the NAS techniques of GNN for the FL scenarios with distributed and private datasets. To alleviate the heterogeneity in graph data, some works, e.g., FedCG [caldarola2021cluster] and GCFL [xie2021federated], leverage clustering to reduce statistical heterogeneity by identifying homogeneous, while FedGL [chen2021fedgl] exploits the global self-supervision information. SpreadGNN [he2021spreadgnn] extends federated multi-task learning to realistic serverless settings for GNNs, and utilizes a novel optimization algorithm with a convergence guarantee to solve decentralized multi-task learning problems. In addition to data, graphs can also exist as relationships among clients. For example, CNFGNN [meng2021cross] leverages the underlying graph structure of decentralized data by proposing a cross-node federated GNN. It bridges the gap between modeling complex spatio-temporal data and decentralized data processing by enabling the use of GNN in the FL setting.

5.2.2 Quantized GNN

To systematically reduce the GNN memory consumption, GNN-tailored quantization that converts a full-precision GNN to a quantized or binarized GNN can emerge as a solution for resource-constrained edge devices 

[feng2020sgquant, tailor2020degree, wang2021binarized, bahri2021binary, wang2021bi]. SGQuant [feng2020sgquant] proposes a multi-granularity quantization featured with component-wise, topology-aware, and layer-wise quantization to intelligently compress the GNN features while minimizing the accuracy drop. Degree-Quant [tailor2020degree] performs quantization-aware training on graphs, which results in INT8 models often performing as well as their FP32 counterparts. BGN [wang2021binarized] learns binarized parameters and enables GNNs to learn discrete embedding. Bi-GCN [wang2021bi] binarizes both the network parameters and the node attributes and can significantly reduce the memory consumptions by 30x for both the network parameters and node attributes, and accelerate the inference by about 47x.

5.2.3 GNN with Meta Learning

Recently, several meta learning methods to train GNNs have been proposed to solve the limited samples problem [mandal2021meta]. Most of the existing works [huang2020graph, wang2020graph, wen2021meta, ma2020adaptive, jiang2021structure, buffelli2020meta] adopt the Model-Agnostic Meta-Learning (MAML) algorithm [finn2017model]. The outer loop of MAML updates the shared parameter, whereas the inner loop updates the task-specific parameter for the current task. G-META [huang2020graph] uses local subgraphs to transfer subgraph-specific information and learn transferable knowledge faster via meta gradients with only a handful of nodes or edges in the new task. AMM-GNN [wang2020graph] proposes a graph meta-learning framework for attributed graphs, which leverages an attribute-level attention mechanism to capture the distinct information of each task and thus learns more effective transferable knowledge for meta-learning. MI-GNN [wen2021meta] studies the problem of inductive node classification across graphs and proposes a meta-inductive framework to customize the inductive model to each graph under a meta-learning paradigm.

5.3 Edge-Cloud Reinforcement Learning

Reinforcement Learning (RL) is a special machine learning manner by learning a policy through the interaction between agent and environment, which is close to the human learning way. RL could provide special functionalities such as train-and-error and long-term optimization which is often neglected by traditional unsupervised and supervised learning methods

[li2018overviewDRL]. Although it seems natural that the cloud-edge architecture fit with the classical RL paradigm including an agent and an environment, an RL system trained and acted on the cloud-edge architecture has only been discussed in a limited scope. For example, [kapten:2020] discusses the possible implementation of model update and following model deploy on edges using WebAssembly. Nevertheless, there are some thorough investigations for some special domains, which are summarized as follows.

5.3.1 Federated Reinforcement Learning

There are some works which study FL with an RL system [Zhuo2019FederatedRL, wang2020noniid, yu2021iUDEC]. Similar with FL [yang2019fml], the idea of FRL could also be divided into two main categories: horizontal FRL (HFRL) and vertical FRL (VFRL) [qi2021federated]. In HFRL, the mobile devices distribute geographically but face similar tasks. The well-studied topic distributed RL is close to HFRL, and HFRL can be viewed as a security-enhanced distributed RL. For example, [wang2020noniid] studies the non-i.i.d. issue of data by deliberately choosing reasonable participating edges during model update. [yu2021iUDEC] proposes a two-timescale (fast and slow) DQN framework and FL is adopted in the fast timescale level and is trained in a distributed manner. In VFRL, edge devices belong to the same group but their feature spaces are different. VFRL is relatively less studied than HFRL by so far; FedRL [Zhuo2019FederatedRL]

is an example which builds a shared value network and uses the Gaussian differential on information shared by edges. Besides that, Multi-agent RL (MARL) shares a lot similarities with VFRL; yet VFRL requires modeling on the partially observable Markov decision process (MDP) while MARL usually assume full observability of system.

There are also other ideas of FRL which do not belong to either HFRL or VFRL. For example, [hu2021frs] proposes Federated Reward Shaping (FRS) in which reward shaping is employed to share federated information of agents. The Multi-task FRL (MT-FedRL) [anwar2021mtFRL] achieves federation of policy gradient RL between agents by smoothing average weight on the agents’ parameters. Both works here are built on the server-client architecture.

5.3.2 RL-Assisted Optimization

There are also substantial studies which use RL as an aside system in cooperation with mobile-cloud system, to deal with optimization issues including online resource allocation [wu2021disa], task scheduling [sheng2021reinforceschedu], workload scheduling [zheng2021rlws], computation offloading [zhao2020cap, chen2018decentralized, dai2019icct5g, qu2020dmroa, hao2020murl, zhan2020vecppo, wang2018carltm], and service mitigation [park2020rlservmig]. Among these works, applications are implemented on the Internet of Thing (IOT) [zhao2020cap, wu2021disa, qu2020dmroa, sheng2021reinforceschedu], 5G network [dai2019icct5g], telemonitoring [wang2018carltm], or vehicular terminals [zhan2020vecppo]. The basic motivation is that those optimization problems are generally NP-hard and easier to solve by DRL built on MDP [hao2020murl, qu2020dmroa]. The detailed RL methods applied in these attempts include Q-Learning [wang2018carltm, zheng2021rlws], DQN [zhao2020cap, wu2021disa], REINFORCE [sheng2021reinforceschedu], DDPG [chen2018decentralized], PPO [zhan2020vecppo], and meta-RL [qu2020dmroa]. Some key elements, such as storage space [hao2020murl] or context impact [wang2018carltm], might also be took into account and is used to determined the computation stage of service .

Fig. 3: Key milestones in the development of cloud computing and edge computing hardwares.

6 Hardware

The major difference between edge computing and cloud computing is that we have to consider the hardware limitations of edges [murshed2019machine]. Specially, constraints of limited computing power [tambe2021edgebert], memory [sze2017efficient], energy consumption [chen2016eyeriss] and bandwidth [wang2018bandwidth] determine the design of algorithms and systems. Fig. 3 enumerates some milestone products in the development of cloud computing and edge computing. In this section, we provide a brief landscape of the variety of edge/cloud hardware in several domains.

6.1 AI Hardware in Cloud Computing

Graphics Processing Unit (GPU) was originally designed to create images for computer graphics and video game consoles. However, in early 2010, the researchers found that GPUs could also be used to accelerate calculations involving large amounts of data together with Central Processing Unit (CPU), especially for DNNs [krizhevsky2017imagenet]. The main difference between the CPU and GPU architectures is that the CPU is designed to quickly process a wide range of tasks (measured by the CPU clock speed), but is limited by the concurrency of the tasks that can be run. In the following, we analyze the typical CPU and GPU in brief.

6.1.1 Cpu

The Core i9-7920X as one representative CPU of the Core X-series processors222https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_processors is the fastest processor for general computation from Intel. It runs at 3.0 GHz and can be turbo up to 4.4Ghz frequency. As the competitor, Advanced Micro Devices (AMD) also released the Ryzen-series of products like Ryzen 5 2600333https://www.amd.com/en/products/cpu/amd-ryzen-5-2600 and The AMD Ryzen 9 3900X444https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x, both for desktops and IoTs based on Zen micro-architecture. The recent AMD Ryzen Threadripper 3990X555https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-3990x that has the best performance is the flagship of Ryzen series especially designed for the edge machine learning.

6.1.2 Gpu

Perhaps the most famous GPU-series applied in deep learning is from Nvidia. The well-known Alexnet [krizhevsky2017imagenet]

, the winner of the first ImageNet challenge, used the Nvidia GeForce GTX 580. It is a high-end graphics card launched by NVIDIA in 2010, which uses 40 nm process and the chip area of 520 square millimeters with 3 billion transistors. Its GPU runs at 772 MHz and the memory runs at 1002 MHz (4 Gbps effective). Subsequently, NVIDIA released the Tesla K80 and Tesla P100 that further improved the computation performance. Recently, Tesla V100 as well as Tesla A100 again catches the eyes due to the impressive performance of PTM 

[floridi2020gpt, m6, m6-t] in language understanding and image generation tasks. However, the cost also correspondingly becomes expensive, which may limit the usage of the researchers without the adequate grants in academia.

6.1.3 Tpu

Tensor Processing Unit (TPU) is a dedicated integrated circuit developed by Google for processing neural networks. TPUv1 is identified by its high-bandwidth loop, i.e., the core data and computation loop that processes neural network layers fast [Norrie_Patil_Yoon_Kurian_Li_Laudon_Young_Jouppi_Patterson_2021]. Two years later, the next generation TPUv2 provided fast and cost-effective training for Google services. Each TPUv2 can reach 180 TFLOPS on a single board. With TPUv2, we can achieve mixed precision training with float16 for computation and float32 for accumulation [Wang_Wei_Brooks_2019]. TPUv3 provides significant performance benefits over TPUv2 and fit better for larger-scale network architecture such as deeper ResNets666https://cloud.google.com/tpu/docs/system-architecture-tpu-vm. In Google I/O 2021, Google launches the latest TPUv4 with nearly two times performance over TPUv3.

6.2 AI Hardware in Edge Computing

The demand for edge AI grows rapidly due to the issues of bandwidth, privacy or compute-transmission balance. Many representative hardwares have emerged to meet the requirements of numerous edge AI applications. In summary, there are three critical types of hardwares, i.e., VPU, GPU and TPU, which we reviews in the following.

6.2.1 Vpu

A vision processing unit (VPU) is an emerging class of microprocessor used in edge AI. It allows the efficient execution of demanding computer vision and edge computing AI workloads, and achieves a balance between power supply efficiency and computing performance. One of the most popular examples is the Intel Neural Compute Stick777https://www.intel.com/content/www/us/en/developer/tools/neural-compute-stick/overview.html, which is based on the Intel Movidius Myriad X VPU. This plug-in and play device can be easily attached to edges running Linux, Windows, Raspbian ,including Raspberry Pi and Intel NUC etc

. In terms of machine learning framework it supports TensorFlow, Caffe, Apache MXNet, Open Neural Network Exchange, PyTorch


6.2.2 Edge TPU

An application-specific integrated circuit (ASIC) is an integrated circuit chip customized for a particular use, rather than intended for general-purpose use. Based on this techniques, Google has built the Edge Tensor Processing Unit (Edge TPU)888https://cloud.google.com/edge-tpu to accelerate the ML inference on edges. It is capable of running the classical compressed CNNs such as MobileNets V1/V2, MobileNets SSD V1/V2, and Inception V1-4 as well as TensorFlow Lite models, and has been applied into the real-world detection and segmentation.

6.2.3 Mobile GPU

Except VPU and Edge TPU, some enterprises like Nvidia and Qualcomm also explore to integrate the GPU to accelerate the computation on edges. The representative is NVIDIA Jetson Nano999https://developer.nvidia.com/embedded/jetson-nano, which includes an integrated 128-core Maxwell GPU, quad-core ARM A57 64-bit CPU, 4GB LPDDR4 memory, along with support for MIPI CSI-2 and PCIe Gen2 high-speed I/O. Besides, Apple also developed the processor for iPhone and iPad, A12 Bionic101010https://en.wikipedia.org/wiki/Apple_A12, which is a 64-bit ARM-based system on a chip designed by Apple Inc. It includes a dedicated neural network hardware, which Apple calls the “next generation neural engine”. The A12 is manufactured using a 7-nanometer FinFET process, contains 6.9 billion transistors, and 4 GiB LPDDR4X memory. Similarly, HiSilicon released Kirin 990 5G111111https://www.hisilicon.com/en/products/Kirin/Kirin-flagship-chips/Kirin-990-5G, a 64-bit high-performance mobile ARM 5G SoC. More recently, Qualcomm announced the Snapdragon 888121212https://www.qualcomm.com/products/snapdragon-888-5g-mobile-platform on 2020. Its integrated sixth-generation AI engine combines the digital computing capabilities of the new Hexagon 780 and GPU, providing the impressive 26 TOPS computing power.

6.2.4 Npu

The motivation of designing Neural Processing Unit (NPU) is to find an efficient configuration to control a large number of resources [Park_Lee_Lee_Moon_Kwon_Ha_Kim_Park_Bang_Lim_2021]. In other words, NPU architectures are dedicated to energy-efficient DNN acceleration [Lee_2021]. DNNs typically require a large amount of data for training and ask for large memory. To resolve the need for off-/on-chip memory bandwidth, NPUs usually rely on data reuse and unnecessary computations skipping. Many companies have devised their own NPUs, including (but not limited to) DianNao [Chen_Du_Sun_Wang_Wu_Chen_Temam_2014] from Cambricon, Samsung NPU [Park_Lee_Lee_Moon_Kwon_Ha_Kim_Park_Bang_Lim_2021], and Truenorth from IBM. For example, ARM Ethos N-77131313https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-n77 delivers up to 4 TOPS of performance (2048 8-Bit MACs), scaling to 100s of TOPS in multicore deployments. It supports machine learning frameworks like TensorFlow, TensorFlow Lite, Caffe2, PyTorch, MXNet, and ONNX. Tensorflow and Pytorch also have generic support for deploying models on Android NPU via Google NNAPI141414https://www.tensorflow.org/lite/performance/nnapi.

7 Future Directions and Conclusion

7.1 Challenges

Although Edge-Cloud collaborative learning is a promising paradigm to a broad real-world applications, there are several challenges unsolved hindering its development. We summarize the main concerns in the following.

7.1.1 Data

To our best knowledge, the edge-cloud collaboration open-source datasets are very scarce, which limits the explore in the academic area. The only dataset151515https://tianchi.aliyun.com/dataset/dataDetail?dataId=109858 is released for the Mobile Edge Intelligence in recommendation. The reasons for the real-world open-source datasets are two-fold. On one hand, the fine-grained features on the edge side like the user real-time in-app scrolling, are usually not transmitted to the cloud due to the communication cost and the instant serving bottleneck. That is to say, we cannot completely understand the characteristics of data on the edge side by only training the model on the current cloud-based datasets. On the other hand, for the interactive scenarios like Gemini [chen2021mc], it requires the specific data collection to guarantee the distribution consistency between the training and the test. Simply simulation on the cloud-based dataset by decomposition cannot recover the real-world scenarios. Therefore, more open-source benchmark datasets contributed to this area will promote the research and industrial development.

7.1.2 Platform

The software platform is critical to explore the edge-cloud collaboration, since it is expensive to construct a collaboration environment and simulate the heterogeneous edges and the communication noise. However, it is still in its vacancy to build the well-established platforms that are friendly to a range of algorithmic study. For example, in FL, we have to handle the uncontrollable number of local models uploaded from the edge devices in the real-world scenarios, which may affect the convergence of the training [kairouz2019advances]. Besides, a systematic analysis on the model training, deployment and evaluation are still lack, which is critical to measure the methods. In the future, it will be quite useful to establish the fully functional platform both in academia and industry.

7.2 Applications

7.2.1 Recommendation

Recommendation systems enable users to find and explore information easily and become increasingly important in a wide range of online applications such as e-commerce, micro-video portals, and social media sites. Despite the huge success, modern recommender systems still suffer from the user-oriented bias/fairness issue [Li_Chen_Fu_Ge_Zhang_2021, Chen_Dong_Wang_Feng_Wang_He_2020], privacy leakage [Shin_Kim_Shin_Xiao_2018, muhammad2020fedfast] and high-latency response [Freno_Saveski_Jenatton_Archambeau_2015]. For example, centralized on-cloud training will be inevitably biased towards some privileged users, such as active users, resulting in an enlarged performance gap or unfairness among users. Embracing edge-cloud collaborative learning [dccl] opens up possibilities to address these issues. Expressly, edge training (cloud edge) permits personalization that is free from biased collaborative filtering. Edge models can fully leverage edge features and provide low-latency services, such as dynamic interest modeling, and re-rank. As a counterpart, upon edge models, on-cloud training (edge cloud) benefits from privileged distillation without access to sensitive user data, achieving both privacy protection and full personalization.

7.2.2 Auto-driving

Essentially, a vehicle entirely controlled by machines without any human input will possess the acclaimed banner of being autonomous. Nowadays, an important aspect of self-driving vehicles integrating with the cloud is the capability of using OTA (Over-The-Air) electronic communications [Khatun_Gla_Jung_2021]. As well as various sensors for detecting the outside world, the vehicle will be equipped with on-board computer processors and electronic-oriented memory technology, and can communicate with the cloud via a communication device [Vaidya_Kaur_Mouftah_2021]. Besides data communication, edge-cloud collaborative learning techniques will further enhance autonomous driving on safety, functionality, and privacy [Deng_Zhang_Lou_Zheng_Jin_Han_2021]. Initially trained on clouds, edge machine learning models (cloud edge) can interpret real-time raw data, make decisions based on the derived insights, and learn from the feedback from real-time road conditions. Real-time model adaptation is essential for safety and efficiency improvement, and accidents and traffic congestion reduction. Raw data like driving records might contain sensitive contents. Edge models can hereby enhance the centralized training with privacy.

7.2.3 Games

Video games have soared as one of the most popular ways to spend time. To ensure a seamless gaming experience, existing games struggle to figure out the work division of cloud computing and edge computing [Nguyen_Tran_Thang_2020, Basiri_Rasoolzadegan_2018, Kassir_Veciana_Wang_Wang_Palacharla_2021]. With edge-cloud collaborative learning, we can leverage the advantages of both sides. On the edge side, we can do more personalization and responsive actions that are sensitive to latency. For example, in sandbox games where the gameplay element is to give players a great degree of creativity and freedom on task completion and NPC interaction, personalized AI for NPC dialogue generation and action-taking is an enticing element. Edge training permits such personalization by taking players’ behaviors, decisions, and preferences as input. Meanwhile, we can maintain the interactive feedback of edges and the cloud during both training and inference by transferring latent representations or leaving highly intensive but latency-insensitive training tasks for the cloud. Such a bidirectional collaboration brings more creation and freedom to users, ensures a seamless user experience, and opens up new possibilities for advanced gaming systems, such as the Metaverse [Dionisio_Burns_Gilbert_2013, Lee_Braud_Zhou_Wang_Xu_Lin_Kumar_Bermejo_Hui_2021].

7.2.4 IoT Security

Internet of Things (IoT) security refers to the protection of edges and networks connected to the IoT from malicious attacks [Waheed_He_Ikram_Usman_Hashmi_Usman_2021]. Technology, processes, and regulations necessary to secure IoT devices and networks constitute the IoT security landscape. Most existing IoT systems are vulnerable to attacks [Shahid_2021], where threats in IoT include (but are not limited to) lack of proper data encryption and malicious software. Fortunately, the schema of edge-cloud collaborative learning can help to address some security concerns. On the one hand, collaborative learning transfers model parameters, latent representations, and back-propagated gradients between edge-edge and edge-cloud, avoiding the communication attack on transferred sensitive data. On the other hand, in bidirectional collaboration, a trustable cloud model can identify malicious edges and mitigate their negative effects, which can be computationally intensive and typically not affordable on edges.

7.3 Conclusion

The rapid development of computing power drives AI to flourish, and evolve into three paradigms: cloud AI, edge AI and edge-cloud collaborative AI. To comprehensively understand the underlying polarization and collaboration of various paradigms, we systematically review the advancement of each direction and build a complete scope. Specially, our survey covers a broad areas including CV, NLP and web services powered by cloud computing, and simultaneously discusses the architecture design and compression techniques that are critical to edge AI. More importantly, we point out the potential collaboration types ranging from privacy-primary collaboration such as Federated Learning to efficient-primary collaboration for personalization. We rethink some classical paradigms in the perspective of collaboration that might be extended into edge-cloud collaboration. Some ongoing advanced topics for edge-cloud collaboration are also covered. Finally, we summarize the milestone products of cloud computing and edge computing in the recent years, and present future challenges and applications.