MC^2-SF: Slow-Fast Learning for Mobile-Cloud Collaborative Recommendation

09/25/2021 ∙ by Zeyuan Chen, et al. ∙ Microsoft Hong Kong Baptist University 6

With the hardware development of mobile devices, it is possible to build the recommendation models on the mobile side to utilize the fine-grained features and the real-time feedbacks. Compared to the straightforward mobile-based modeling appended to the cloud-based modeling, we propose a Slow-Fast learning mechanism to make the Mobile-Cloud Collaborative recommendation (MC^2-SF) mutual benefit. Specially, in our MC^2-SF, the cloud-based model and the mobile-based model are respectively treated as the slow component and the fast component, according to their interaction frequency in real-world scenarios. During training and serving, they will communicate the prior/privileged knowledge to each other to help better capture the user interests about the candidates, resembling the role of System I and System II in the human cognition. We conduct the extensive experiments on three benchmark datasets and demonstrate the proposed MC^2-SF outperforms several state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The information explosion on the websites greatly drives the development of recommender systems, which automatically search the content e.g., movies, songs and news, for the users based on their interests. In the past years, recommender systems are usually deployed in the cloud server, owing to the large amounts of the user behavior data and the high demand of the computing power. Recently, the rapid development of mobile devices reshapes the architecture of industrial recommender systems, and building a recommendation phase on the mobile devices to utilize the fine-grained features and the real-time feedbacks is becoming a trend (Gong et al., 2020; Sun et al., 2020; Yao et al., 2021).

Previous recommendation tends to build the model on the cloud or device side independently as illustrated in Figure 1

(a) and (b). For the slow component deployed in the cloud, it enjoys the large computing power and the rich but delayed user behaviors, which drive the development of a range of deep-learning-based models like SASRec 

(Kang and McAuley, 2018), DIN (Zhou et al., 2018b) and other complex models (Sun et al., 2019; Li et al., 2020). Regarding the fast component on the mobile side, models are efficient and lightweight to meet the hardware constraint. They usually benefit from the real-time feedbacks, fine-grained features or frequent responses (Gong et al., 2020; Sun et al., 2020) compared to the cloud-based models. However, the models of two sides have no collaboration in the training.

Figure 1. The recommendation prototypes. Subfigure (a) and subfigure (b) represent the independent modeling without relying on other sides, subfigure (c) and subfigure (d) respectively represent the slow-centralized modeling and the fast-centralized modeling and subfigure (e) is our framework.

Recent advances in recommendation start to consider the advantages of the counterpart side. For example, one representative methodology is Federated recommender systems (Yang et al., 2020), which leverage the local devices to compute the gradients of the model and simultaneously keep the data privacy. They actually utilize the computing power of the fast component to serve the slow component, but have not considered the modeling of the fast component. We use Figure 1 (c) to indicate this type of biased collaboration and term it as the slow-centralized modeling. Note that, the dashed arrow means the pseudo collaboration for the other side. Another representative methodology is Model-Personalized recommender systems (Yao et al., 2021), which leverage the cloud server and data to re-calibrate the backbone model. We can appropriately consider it as the reverse counterpart of Federated recommender systems and illustrate it in Figure 1 (d). In comparison, we term it as the fast-centralized modeling.

Different from the above works, we focus on the bidirectional collaboration to benefit both the slow component and the fast component as shown in Figure 1 (e). Specifically, we propose MC-SF, a Slow-Fast Learning mechanism for Mobile-Cloud Collaborative recommendation. In MC-SF, the slow component helps the fast component make predictions by delivering the auxiliary latent representations; and conversely, the fast component transfers the feedbacks from the real-time exposed items to the slow component, which helps better capture the user interests. The intuition behind MC-SF resembles the role of System I and System II in the human recognition (Kahneman, 2011), where System II makes the slow changes but conducts the comprehensive reasoning along with the circumstances, and System I perceives fast to make the accurate recognition (Madan et al., 2021). The interaction between System I and System II allows the prior/privileged information exchanged in time to collaboratively meet the requirements of the environment.

We summarize the contributions of this paper as follows:

  • [leftmargin=*]

  • To our best knowledge, we are the first to study the bidirectional collaboration between the cloud-based model and the mobile-based model with the hardware advances.

  • We introduce a slow-fast learning mechanism, MC-SF, which considers the cloud-based model and the mobile-based model respectively as a slow component and a fast component, between which the prior and the privileged knowledge are interacted during the training and serving.

  • Extensive experiments on three benchmark datasets have demonstrated that the proposed method significantly outperforms the state-of-the-art recommendation baselines and shown the promise of the bidirectional collaboration.

2. Related Work

2.1. Independent Modeling in Recommendation

In this section, we review the independent modeling on the cloud side and on the mobile side respectively. Regarding the cloud-based recommendation models, the early exploration falls in the collaborative filtering (Wang et al., 2006; Su and Khoshgoftaar, 2009) or Latent Factor Model (LFM) (Zhang et al., 2013)

. With the development of deep learning, deep neural networks are involved into the recommendation to acquire the high-level semantics 

(Elkahky et al., 2015; Van Den Oord et al., 2013; Wang et al., 2015). For example, DeepFM (Guo et al., 2017)

combines the power of the factorization machines for recommendation and deep learning for feature learning, yielding a promising improvement in performance. To model the user dynamics in the sequences, the recurrent neural networks (RNNs) are applied to obtain the representations of whole user behavior sessions for sequential recommendation 

(Hidasi et al., 2015). The subsequent work (Quadrana et al., 2017) further investigates the hierarchical version of recurrent neural networks. Moreover, Caser (Tang and Wang, 2018) considers to capture the local features in sequential behaviors by means of the convolution filters. Some recent studies explore to utilize the attention mechanisms to learn behavior sequence representations (Vaswani et al., 2017; Kang and McAuley, 2018; Zhou et al., 2018b), or apply sequential recommendation to the specific scenarios (Lian et al., 2020; Chen et al., 2021; Zhang et al., 2021).

With the hardware development of mobile devices, the mobile-based modeling has drawn much more attention in the industrial scenarios (Satyanarayanan, 2017). For example, knowledge distillation has been widely used to distill a smaller but efficient student network from the teacher network, which adapts the on-device inference (Kim and Rush, 2016; Zhou et al., 2018a). The compression technique that jointly leverages weight quantization and distillation for efficiently executing deep models in resource-constrained environments like mobile or embedded devices, is also proposed. CpRec (Sun et al., 2020) employs a generic model shrinking technique to reduce responding time and memory footprint. Other work (Gong et al., 2020) implements a novel recommender system on the device side and addresses the serving concern based on a split deployment strategy.

2.2. Biased Collaboration in Recommendation

This line of works could be split into two parts, the slow-centralized modeling and the fast-centralized modeling. One exemplar of the slow-centralized modeling is Federated recommender systems  (Zhou et al., 2012), which trains the cloud-based model with the aid of distributed local devices. The gradients of the local copy from the centralized deep model are first executed on plenty of devices, and then are collected to the server to update the model parameters by federated averaging (FedAvg) or its variants  (Anelli et al., 2020; Muhammad et al., 2020). For example, Qi et al., (Qi et al., 2020) leveraged the local information of massive users to train an accurate news recommendation model and meanwhile keep the privacy of the sensitive data. To reduce the communication costs of federated learning, the structured update and the sketched update to compress networks on the device side are introduced (Konečnỳ et al., 2016). One representative work of the fast-centralized modeling is DCCL (Yao et al., 2021), which could be treated as the reverse counterpart of Federated Learning. DCCL leverages the cloud server and data to re-calibrate the backbone model, but actually for the on-device personalization. The slow-centralized modeling and the fast-centralized modeling do not actually achieve the bidirectional mobile-cloud collaboration, which is meaningful under the real-world pipeline of the industrial recommender systems.

3. Preliminary

In this section, we will formulate the problem of Mobile-Cloud collaborative recommendation. Let denote the user set and denote the item set, where and are the user number and the item number respectively. For each user , we define the interactive item sequence . In the perspective of collaborative recommendation, on the cloud side is instantiated as and on the mobile side is instantiated as respectively. The reason that the sequences on two sides are different, is that the feature on the mobile side are more real-time and fine-grained, and we might not leverage too long sequences i.e., , given the limited computational resource of mobile devices. We will refer more details about them in the subsequent sections. Without loss of generality, we define the problem of mobile-cloud collaborative recommendation as follows.

Problem 0 (Mobile-Cloud Collaborative Recommendation).

As aforementioned before, we term the cloud-based model as the slow component and the mobile-based model as the fast component. Given a target user and a candidate item , the goal of both slow and fast components is to learn a function that accurately predicts the user interaction, respectively defining as and . and denotes the trainable parameters of the slow component and the fast component, and and mean the interactive features between them. Compared to the traditional recommendation without mobile-cloud collaboration or only with the biased collaboration, they will not have and . The Figure 1 summarizes the difference from the previous works.

Figure 2. Architecture of the model MC-SF. Input denotes the sequence of item embedding. The interacted items generated in fast component would be updated instantly (fast update) and uploaded to slow component at regular intervals (slow update) in industrial practice, whose input update is slower than fast component. Owing to large computing power, slow component has more parameters and computational complexity compared to fast component. The collaboration of both sides is achieved by prior/privileged knowledge delivery.

4. Methodology

Following the convention of recent deep learning-based recommendation methods, we map each item ID to a dense vector and tune it in the training stage. Taking item

as an example, we define the following embedding-lookup function:

where is a trainable item embedding matrix and is the embedding dimension. Noting that, is saved in the cloud and only part of it is distributed to the mobile when requested. is an one-hot vector for the -th item.

4.1. Independent Slow Component

For the cloud-based modeling, i.e., the slow component, we use to represent the representation of the click sequence . Deep Interest Network (Zhou et al., 2018b), abbreviated as , is used to model the user representation from and the candidate item as follows,


where denotes the embedding of the candidate item. We then combine with and feed into to compute the following click-through rate prediction

Finally, the cross-entropy loss is applied for the training of the parameters in the slow component as follows,


By far, we present a typical cloud-based model without any real-time knowledge intervention from the mobile side.

4.2. Independent Fast Component

For the mobile-based model, i.e., the fast component, we use to denote the representations of either clicked or real-time exposed items . Note that, the reason that we take the exposed items into account, is that they are also informative as pointed in (Gong et al., 2020; Xie et al., 2020). In the fast component, we use two GRUs to separately encode the clicked parts and the exposed parts of , termed as and respectively, and then fuse them by a target-aware attention mechanism. Formally, the procedure is formulated by the following equations,


where can be seen as the user real-time interests, and can be considered as the negative impression. Like the slow component, we combine and to compute the prediction

Similarly, the cross-entropy loss is applied to train the parameters of the fast component as follows,


Now, we have an independent recommendation model on the mobile side, which exploits more real-time clicked items and feedbacks from the exposed items to capture user interests.

4.3. Interactive Slow Component

As proved by (Gu et al., 2021), the real-time exposures reflecting the user negative impression would bring the gains to the model. Based on Eq. (3) in the basic fast component, we can obtain the negative memory which memorizes the previous exposed items, and send it to the slow component. With this, the slow component can continue to feed into to generate the presumed response, and re-weight it to form the final feature about the exposure feedback,


Note that, during the training phase, the offline optimization of the slow component could affect the update operation of . After Eq. (5), we can concatenate , in Eq. (1) and to compute the exposure-aware prediction


The optimization objective is the same as Eq. (2).

4.4. Interactive Fast Component

The fast component is usually a smaller model than the slow component, since the cloud side has the sufficient computing and storage resource to enlarge the model. This yields that the prior knowledge from the slow component is a useful hint to the fast component. Considering this, we send and from the cloud to the mobile to help the prediction of the fast component. Specially, we transform them into the space of the hidden state in GRU, and use the transformed representations to initialize the state of and in Eq. (3). The procedure is formulated as follows,


Following Eq. (3), we could transform and to the final representation . To enhance the effect of the prior knowledge from the slow component, we also combine with and to compute the prediction score like Eq. (6) as follows


Similarly, the optimization objective Eq. (4) is applied.

4.5. Slow-Fast Learning

The complete procedure of MC-SF is summarized in Algorithm 1. Specifically, the slow component firstly receives negative memory and proceeds model optimization based on and relevant input. With the completion of the training, the slow could generate representation and so as to send them to the fast component to help prediction, which corresponds to “Interest Embedding” transfer operation as shown in Figure 2. Based on and , the fast component continues to optimize. As the model is deployed in the mobile, it could update corresponding real-time exposed items to generate new negative memory . The upload of “Negative Interest” in Figure 2 represents new negative memory is uploaded to improve the slow component. As such, the collaboration framework could be iterate continuously.

  while lifecycle do
     Slow:Training:  1.Receive negative memory   2.Optimize the slow component Inference:  1.Generate representations and based on .  2.Send representations to the fast component .Fast:Training:  1.Receive representations and   2.Optimize the fast component Inference:  1.Accumulate exposed items and update negative memory   2.If time threshold: upload   3.Else: return step 1
  end while
Algorithm 1 Slow-Fast Learning

5. Experiments

This section first clarifies the experimental setups and then provides comprehensive experimental results, striving to answer the pivotal questions below:

  • [leftmargin=1.8em]

  • What is the performance of MC-SF compared with the state-of-the-art recommendation models?

  • How does the bidirectional collaboration used in MC-SF contribute to the overall performance?

  • How do the hyper-parameters of MC-SF make the effect on the final performance?

5.1. Experimental Setups

This section explains the used datasets, the adopted evaluation protocols, the baselines to be compared, and the model implementations.

Dataset ML-1M Alipay Steam
# Users 6,040 1,031,268 39,795
# Items 3,706 13,932 14,411
# Interactions 1,000,209 228,412,772 17,978,753
Table 1. Statistics of the datasets.

5.1.1. Datasets

To evaluate the model performance, we choose two datasets, ML-1M111 and Steam222 that are publicly available and one industrial dataset Alipay. We preprocess the datasets to guarantee that users and items have been interacted at least 20 times in above datasets. The basic statistics of the three datasets are summarized in Table 1. Generally, each dataset is divided into three disjoint parts according to the log timestamps, including slow training phase, fast training phase, and testing phase. Specifically, for a sequence containing items, testing phase contains the last 5 items, fast training phase uses the interaction history from the th to the th item and slow training phase contains the rest of items. Each component is trained in the corresponding phase with the aid of the other side.

5.1.2. Evaluation Protocols

Three widely used metrics is adopted: (1) HR@k (Hit Ratio@k) is the proportion of recommendation lists that have at least one positive item within top-k positions. (2) NDCG@k (Normalized Discounted Cumulative Gain@k) is a position-aware ranking metric that assigns larger weights to the top positions. As the positive items rank higher, the metric value becomes larger. (3) MRR (Mean Reciprocal Rank) measures the relative position of the top-ranked positive item and takes value 1 if the positive item is ranked at the first position. HR@k and MRR mainly focus on the first positive item, while NDCG@k considers a wider range of positive items. They are mathematically defined as follows.


where is the ranking position of interaction between user and item , and is the indicator function.

In the experiments, we list the results w.r.t. HR@1, HR@5, HR@10, NDCG@5, NDCG@10 and MRR.

Regarding negative sampling, we follow the way which is commonly observed in recommendation studies considering implicit feedback (He et al., 2017; Kang and McAuley, 2018). It is notable that there is a difference between our settings and traditional recommendation when testing. For each sequence, traditional recommendation uses the most recent interaction of each user for testing. However, in our settings, we have different strategies to evaluate the slow component and fast component. The slow component cannot immediately use the sequence on the test set so that it predicts every item in testing phase based on the sequence in the slow training phase. But for the fast component deployed on the mobile side, it could instantaneously access the sequence of testing phase. Therefore, the fast component leverages all the items before the target item to perform the prediction.

Dataset Metric Fast Slow Vanilla Slow+Fast MC-SF Improv.
ML-1M HR@1 0.5308 0.5345 0.5459 0.5505 0.5508 0.5441 0.5671 0.6105 7.65%
HR@5 0.6619 0.6622 0.6794 0.6868 0.6782 0.6709 0.6880 0.7166 4.16%
HR@10 0.7511 0.7538 0.7615 0.7616 0.7567 0.7618 0.7643 0.7768 1.64%
NDCG@5 0.6150 0.6230 0.6202 0.6207 0.6271 0.6265 0.6285 0.6643 5.70%
NDCG@10 0.6395 0.6471 0.6445 0.6449 0.6508 0.6509 0.6538 0.6837 4.57%
MRR 0.6108 0.6162 0.6177 0.6197 0.6169 0.6216 0.6304 0.6636 5.27%
Alipay HR@1 0.3462 0.2974 0.5756 0.6086 0.6043 0.5395 0.6224 0.6286 1.00%
HR@5 0.5669 0.5130 0.7073 0.7222 0.7221 0.6763 0.7301 0.7456 2.12%
HR@10 0.7407 0.6943 0.8346 0.8474 0.8473 0.8107 0.8545 0.8708 1.91%
NDCG@5 0.4586 0.4050 0.6411 0.6590 0.6631 0.6063 0.6744 0.6852 1.60%
NDCG@10 0.5144 0.4633 0.6820 0.6993 0.7032 0.6496 0.7124 0.7256 1.85%
MRR 0.4623 0.4127 0.6470 0.6640 0.6668 0.6116 0.6804 0.6906 1.50%
Steam HR@1 0.4204 0.3310 0.4492 0.4512 0.4699 0.4150 0.4796 0.4926 2.71%
HR@5 0.5393 0.5145 0.5398 0.5425 0.5679 0.5281 0.5726 0.5853 2.22%
HR@10 0.6054 0.5729 0.6111 0.6098 0.6256 0.5989 0.6419 0.6538 1.85%
NDCG@5 0.4821 0.4372 0.4950 0.4976 0.5202 0.4666 0.5265 0.5389 2.30%
NDCG@10 0.5034 0.4600 0.5178 0.5192 0.5388 0.4854 0.5487 0.5617 2.37%
MRR 0.4858 0.4300 0.5034 0.5055 0.5231 0.4721 0.5355 0.5464 2.04%
Table 2. The results w.r.t. HR@1, HR@5, HR@10, NDCG@5, NDCG@10, and MRR for recommendation on three datasets. The best results in each measure are highlighted in “bold”. Improv. denotes the relative improvement over the second-best results.

5.1.3. Baselines

Some representative sequential recommendation models are considered as the slow component in the experiments:

  • [leftmargin=*]

  • Caser (Tang and Wang, 2018). Caser is a method that combines CNNs and a latent factor model to learn users’ sequential and general representations.

  • SASRec (Kang and McAuley, 2018). SASRec is a well-performed model that heavily relies on self-attention mechanisms to identify important items from a user’s behavior history. These important items affect user representations and finally determine the next-item prediction.

  • DIN (Zhou et al., 2018b)

    . DIN is a popular attention model that captures relative interests to target item and obtain adaptive interest representations.

We also take the following three well-known lightweight recommendation models as the fast component:

  • [leftmargin=*]

  • FM (Rendle, 2010). This is a benchmark factorization model considering the second-order feature interactions between inputs. Here we treat the IDs of a user and an item as input features.

  • NeuMF (He et al., 2017). This is a pioneering model that combines deep learning with collaborative filtering for general recommendation.

  • GRU4Rec (Hidasi et al., 2015). This is a pioneering model that successfully applies recurrent neural networks to model user sequence for recommendation.

To validate the effectiveness of the slow component as the prior of the fast component, we distribute ranking list of candidate items from the slow component to the fast component. As such, the fast component produces more accurate recommendation results based on ranking features from the slow component. Owing to the best performance validated in Table 2 and 3, we choose DIN as the slow component. Thus, the ad-hoc combination with the fast component can be termed as DIN+FM, DIN+NeuMF and DIN+GRU4Rec.

Method ML-1M Alipay Steam
MC-SF 0.7166 0.6642 0.6636 0.7456 0.6852 0.6906 0.5853 0.5389 0.5464
SASRec 0.6855 0.6188 0.6166 0.7065 0.6501 0.6601 0.5419 0.4961 0.5028
Caser 0.6837 0.6181 0.6170 0.7024 0.6500 0.6605 0.5393 0.4920 0.4992
Table 3. The performance of different models for the slow component.

5.1.4. Model Implementations

We implement our model by Tensorflow and deploy it on a Linux server with GPUs of Nvidia Tesla V100 (16G memory). The model is learned in a mini-batch fashion with a size of 256. Without specification, all methods optimized by Adam keep the default configuration with


. For the adopted Adam optimizer, we set the learning rate to 5e-4 and keep the other hyper-parameters by default. We add L2 regularization to the loss function by setting the regularization weight to 1e-4. The embedding size of all the relevant models is fixed to 32 for ensuring fairness. The number of layers used in MLP is set to 3. For reducing the impact of noise, all results in our experiments are averaged over 3 runs.

5.2. Experimental Results

This section elaborates on the comprehensive experimental results to answer the aforementioned three research questions.

Figure 3. The case study to compare the independent or interactive slow and fast components for one user. The left side presents five items that user clicked recently and the right side gives the top-5 recommendation from four methods. (“Fast” and “Slow” denote independent components, “S2F” and “F2S” denote interactive components.)

5.2.1. Comparison (Q1)

Table 2 presents the overall performance of our model and all the adopted baselines on the mobile side, from which we have the following observations:

  • [leftmargin=*]

  • FM and NeuMF are comparable on each dataset. These two models both focus on feature interaction and user-item collaborative information. Such distinction might be attributed to the characteristics of corresponding datasets.

  • Compared with other baseline models of the fast component, GRU4Rec achieves the best results on the three datasets. This conforms to the expectation since only using the representations from the user and item is insufficient, which ignores sequential temporal patterns. Standard RNNs are good at modeling sequential dependencies, and thus user preference information hidden in behavior sequences could be effectively captured.

  • Compared to basic fast components that do not consider the guidance of the slow components, all fast components with the privileged information from DIN achieve better performance than the independent counterparts.

Table 2 and 3 presents the overall performance of our model and all the adopetd baselines on the cloud side.

  • [leftmargin=*]

  • From the table, Caser achieves poor results on the three datasets. The reason might be that the convolution filter is not good at capturing sequential patterns.

  • SASRec is a transformer-like recommendation model. Due to the strong power of attention computation used in transformer, they perform significantly better than the aforementioned models. However, DIN that models the interaction between the target item and the behavior sequence achieves the best results.

As shown in Table 2 and 3, MC-SF achieves consistently better performance than all the baselines. In particular, MC-SF improves the second-best performed models w.r.t. NDCG@5 by 5.70%, 1.60%, and 2.30% on ML-1M, Alipay, and Steam, respectively. This is because: (1) MC-SF can distill collaborative knowledge from the slow component to the fast component through generalized interest representations. (2) The introduction of the negative memory can significantly improve the collaborative recommendation task, which is validated in Figure 4.

Figure 4. The performance of MC-SF compared to DIN.

5.2.2. Ablation Study (Q2)

We further conduct the ablation study to validate the contributions of the key components in MC-SF. Specifically, (1) “DIN” represents using DIN to produce prediction results from perspectives of the slow component. (2) “w/ F2S” means uploading negative memory from the fast component to the slow component, so that the slow component could combine these features to refine the predictions. (3) “w/ S2F” denotes that the slow component distributes privileged features to the fast component on the basis of “w/ F2S” and proceeding the collaborative recommendation.

Throughout the result analysis of the ablation study shown in Figure 4, we observe that:

  • [leftmargin=*]

  • “w/ F2S” achieves better performance improvement. It validates the crucial role of negative memory of the fast component. This is because learning from negative features could lead to more informative user representations which contain positive and negative information.

  • “w/ S2F” also shows the significant improvements, which reveals that distributing privileged features and proceeding collaborative recommendation is indispensable. The reason might be that the privileged features could introduce the long-term user interests that the fast component does not have. Meanwhile, collaborative learning for both components is easy to reach equilibrium of the two sides.

Figure 5. The performance of MC-SF on ML-1M dataset w.r.t different hyper-parameters.

5.2.3. Parameter Sensitivity (Q3)

We here study the performance variation for MC-SF w.r.t. hyper-parameters, and comparative results are shown in Figure 5.

Impact of Embedding Size. We set the embedding size in . According to the results, the small embedding size might limit the capacity of the model, and too large embedding size is also hard to learn. Only the proper dimension achieves the best performance.

Impact of Regularization. We vary the regularization coefficient among , and the best performance is achieved when it is set to . The regularization coefficient is critical to avoid over-fitting. Too small cannot constrain the model effectively and too large may lead to the under-fitting situation.

5.2.4. Case Study

To visualize how the collaborative recommendation works, we present the recommendation results of four methods, and illustrate in Figure 3. According to Figure 3, (1) “Fast” prefers to the prediction relevant to recent click sequence. “Slow” recommends based on the sequence in the cloud server, thus the category of movies is somewhat different compared to the five items that user clicked recently. (2) “S2F” mixes the results from “Fast” and “Slow” to some extent, which considers both the long-term interests and the short-term interests. Besides, the category of children’s does not exist in the recommendation results of “F2S”, which is more relevant to the sequence in the cloud server.

6. Conclusion

This paper studies mutual benefits of the slow component and the fast component by a Slow-Fast Collaborative Learning framework. The proposed MC-SF explores to transfer the prior/privileged knowledge from one side to the other side, and introduces a bidirectional collaborative learning framework to benefit each other. Especially, this work firstly introduces slow-fast ideology to recommendation, resembling the role of System I and System II in the human cognition. The comprehensive experiments conducted on two public datasets and one industrial datasets show the promise of MC-SF and the effectiveness of its main components.


  • V. W. Anelli, Y. Deldjoo, T. Di Noia, A. Ferrara, and F. Narducci (2020) FedeRank: user controlled feedback with federated recommender systems. arXiv preprint arXiv:2012.11328. Cited by: §2.2.
  • Z. Chen, W. Zhang, J. Yan, G. Wang, and J. Wang (2021) Learning dual dynamic representations in time-slice user-item interaction graphs for sequential recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Cited by: §2.1.
  • A. M. Elkahky, Y. Song, and X. He (2015) A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th international conference on world wide web, pp. 278–288. Cited by: §2.1.
  • Y. Gong, Z. Jiang, Y. Feng, B. Hu, K. Zhao, Q. Liu, and W. Ou (2020) EdgeRec: recommender system on edge in mobile taobao. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2477–2484. Cited by: §1, §1, §2.1, §4.2.
  • S. Gu, X. Sheng, Y. Fan, G. Zhou, and X. Zhu (2021) Real negatives matter: continuous training with real negatives for delayed feedback modeling. arXiv preprint arXiv:2104.14121. Cited by: §4.3.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §2.1.
  • X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In WWW, pp. 173–182. Cited by: item o, §5.1.2.
  • B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2015) Session-based recommendations with recurrent neural networks. ICLR. Cited by: §2.1, item o.
  • D. Kahneman (2011) Thinking, fast and slow. Macmillan. Cited by: §1.
  • W. Kang and J. J. McAuley (2018) Self-attentive sequential recommendation. In ICDM, pp. 197–206. Cited by: §1, §2.1, item o, §5.1.2.
  • Y. Kim and A. M. Rush (2016) Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947. Cited by: §2.1.
  • J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §2.2.
  • J. Li, Y. Wang, and J. McAuley (2020) Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th international conference on web search and data mining, pp. 322–330. Cited by: §1.
  • D. Lian, Y. Wu, Y. Ge, X. Xie, and E. Chen (2020) Geography-aware sequential location recommendation. In SIGKDD, pp. 2009–2019. Cited by: §2.1.
  • K. Madan, R. N. Ke, A. Goyal, B. B. Schölkopf, and Y. Bengio (2021) Fast and slow learning of recurrent independent mechanisms. arXiv preprint arXiv:2105.08710. Cited by: §1.
  • K. Muhammad, Q. Wang, D. O’Reilly-Morgan, E. Tragos, B. Smyth, N. Hurley, J. Geraci, and A. Lawlor (2020) Fedfast: going beyond average for faster training of federated recommender systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1234–1242. Cited by: §2.2.
  • T. Qi, F. Wu, C. Wu, Y. Huang, and X. Xie (2020) Privacy-preserving news recommendation model learning. arXiv preprint arXiv:2003.09592. Cited by: §2.2.
  • M. Quadrana, A. Karatzoglou, B. Hidasi, and P. Cremonesi (2017) Personalizing session-based recommendations with hierarchical recurrent neural networks. In RecSys, pp. 130–137. Cited by: §2.1.
  • S. Rendle (2010) Factorization machines. In 2010 IEEE International conference on data mining, pp. 995–1000. Cited by: item o.
  • M. Satyanarayanan (2017) The emergence of edge computing. Computer 50 (1), pp. 30–39. Cited by: §2.1.
  • X. Su and T. M. Khoshgoftaar (2009) A survey of collaborative filtering techniques.

    Advances in artificial intelligence

    Cited by: §2.1.
  • F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management, pp. 1441–1450. Cited by: §1.
  • Y. Sun, F. Yuan, M. Yang, G. Wei, Z. Zhao, and D. Liu (2020) A generic network compression framework for sequential recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1299–1308. Cited by: §1, §1, §2.1.
  • J. Tang and K. Wang (2018) Personalized top-n sequential recommendation via convolutional sequence embedding. In WSDM, Y. Chang, C. Zhai, Y. Liu, and Y. Maarek (Eds.), pp. 565–573. Cited by: §2.1, item o.
  • A. Van Den Oord, S. Dieleman, and B. Schrauwen (2013) Deep content-based music recommendation. In Neural Information Processing Systems Conference (NIPS 2013), Vol. 26. Cited by: §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §2.1.
  • H. Wang, N. Wang, and D. Yeung (2015) Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1235–1244. Cited by: §2.1.
  • J. Wang, A. P. De Vries, and M. J. Reinders (2006) Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 501–508. Cited by: §2.1.
  • R. Xie, C. Ling, Y. Wang, R. Wang, F. Xia, and L. Lin (2020) Deep feedback network for recommendation.. In IJCAI, pp. 2519–2525. Cited by: §4.2.
  • L. Yang, B. Tan, V. W. Zheng, K. Chen, and Q. Yang (2020) Federated recommendation systems. In Federated Learning, pp. 225–239. Cited by: §1.
  • J. Yao, F. Wang, K. Jia, B. Han, J. Zhou, and H. Yang (2021) Device-cloud collaborative learning for recommendation. In SIGKDD, pp. 3865–3874. Cited by: §1, §1, §2.2.
  • S. Zhang, D. Yao, Z. Zhao, T. Chua, and F. Wu (2021) CauseRec: counterfactual user sequence synthesis for sequential recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Cited by: §2.1.
  • W. Zhang, J. Wang, and W. Feng (2013) Combining latent factor model with location features for event-based group recommendation. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 910–918. Cited by: §2.1.
  • G. Zhou, Y. Fan, R. Cui, W. Bian, X. Zhu, and K. Gai (2018a) Rocket launching: a universal and efficient framework for training well-performing light net. In Thirty-second AAAI conference on artificial intelligence, Cited by: §2.1.
  • G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai (2018b) Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. Cited by: §1, §2.1, §4.1, item o.
  • L. Zhou, S. El Helou, L. Moccozet, L. Opprecht, O. Benkacem, C. Salzmann, and D. Gillet (2012) A federated recommender system for online learning environments. In International Conference on Web-Based Learning, pp. 89–98. Cited by: §2.2.