Secure Your Ride: Real-time Matching Success Rate Prediction for Passenger-Driver Pairs

In recent years, online ride-hailing platforms have become an indispensable part of urban transportation. After a passenger is matched up with a driver by the platform, both the passenger and the driver have the freedom to simply accept or cancel a ride with one click. Hence, accurately predicting whether a passenger-driver pair is a good match turns out to be crucial for ride-hailing platforms to devise instant order assignments. However, since the users of ride-hailing platforms consist of two parties, decision-making needs to simultaneously account for the dynamics from both the driver and the passenger sides. This makes it more challenging than traditional online advertising tasks. Moreover, the amount of available data is severely imbalanced across different cities, creating difficulties for training an accurate model for smaller cities with scarce data. Though a sophisticated neural network architecture can help improve the prediction accuracy under data scarcity, the overly complex design will impede the model's capacity of delivering timely predictions in a production environment. In the paper, to accurately predict the MSR of passenger-driver, we propose the Multi-View model (MV) which comprehensively learns the interactions among the dynamic features of the passenger, driver, trip order, as well as context. Regarding the data imbalance problem, we further design the Knowledge Distillation framework (KD) to supplement the model's predictive power for smaller cities using the knowledge from cities with denser data and also generate a simple model to support efficient deployment. Finally, we conduct extensive experiments on real-world datasets from several different cities, which demonstrates the superiority of our solution.



page 3

page 4

page 5

page 7

page 9

page 10

page 12

page 13


Driver Action Prediction Using Deep (Bidirectional) Recurrent Neural Network

Advanced driver assistance systems (ADAS) can be significantly improved ...

A Deep Value-network Based Approach for Multi-Driver Order Dispatching

Recent works on ride-sharing order dispatching have highlighted the impo...

Driver Side and Traffic Based Evaluation Model for On-Street Parking Solutions

Parking has been a painful problem for urban drivers. The parking pain e...

Value Function is All You Need: A Unified Learning Framework for Ride Hailing Platforms

Large ride-hailing platforms, such as DiDi, Uber and Lyft, connect tens ...

A Deep-Learning Framework to Predict the Dynamics of a Human-Driven Vehicle Based on the Road Geometry

Many trajectory forecasting methods, implementing deterministic and stoc...

A ridesharing simulation platform that considers dynamic supply-demand interactions

This paper presents a new ridesharing simulation platform that accounts ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the prominent development of technologies like GPS and mobile Internet, various ride-hailing applications have been prosperous in providing drivers and passengers with more convenience, such as Didi, Lyft and Uber. In every ride-hailing platform, when travel orders emerge, services are offered by matching passengers with drivers in a timely manner. Hence, for those platforms, predicting the matching success rate of a candidate passenger-driver pair plays an important role when making efficient order assignments.

Figure 1: The operation interfaces for the passenger (a) and the driver (b) after the order was responded by the platform.

Specifically, once a passenger makes a request for a ride and an order is generated. Then she/he will be matched to a driver in the surrounding areas according to her/his origin and departure time. Next, as Fig. 1 shows that both the passenger and driver have the option of cancellation in their user interfaces, and either of them can choose to cancel the matching result by clicking the cancel button. This usually happens for various reasons, e.g., the passenger is concerned about a driver low rating score, or the driver encounters a no-show passenger. It is a common practice for ride-hailing platforms to infer the success rate of a passenger-driver matching pair first, and based on it, an order to the most promising driver can be generated to ensure a smooth service. An accurate prediction can not only improve the real trading volume, but also provide both the passengers and drivers with better user experience.

In this paper, we term the described problem as Matching Success Rate (MSR) prediction. Generally, MSR prediction is more challenging compared with traditional advertising problems (e.g., Click-Through Rate (CTR) prediction [qu2016product, huang2013learning]) because it is a collective process which involves two kinds of platform users, i.e., passengers and drivers whose decisions are also highly dependent on the real-time environment. MSR represents a bilateral consensus between passengers and drivers and their contexts jointly determine the final results. Hence, MSR prediction is subsumed under a more complicated scenario where we need to fully understand both the passengers’ and drivers’ intentions as the rapidly changing contexts around them, such as the instant supply-demand ratio and the traffic condition. In other words, it is challenging to capture all the interactions among these complicated factors. Firstly, people’s personalities can influence their decisions, e.g., a driver would not show much patience if the passenger is currently late for more than 5 minutes. Secondly, MSR is affected by many external factors. For example, a driver would prefer a passenger who has a longer (i.e., more profitable) trip. A passenger tends to cancel orders matched with drivers who are more than 3 km away, but she/he may not do that if the passenger is in a rural area. Therefore, the sophisticated combinatorial effect among users’ preference and the dynamic contexts makes it difficult to learn a precise mapping from these variables to the eventual MSR.

Furthermore, unlike the online advertising problems that are commonly not limited by the spatial locations, MSR prediction suffers from the severe data imbalance among different cities. Some international metropolises generate millions of orders one day, leading to thousands of orders in one minute, while there are only a few orders during several continuous hours in some small cities. On one hand, models trained for those small cities are subject to inferior performance. On the other hand, compared with fully-developed metropolitan markets, small cities have lots of potential customers that many ride-hailing platforms are contending for. As a result, it is especially crucial yet challenging to provide accurate MSR prediction for small cities whose data are highly scarce.

As MSR prediction is a new-born problem accompanying the recent prosperity of ride-hailing platforms, potential solutions from existing studies are hardly satisfactory. Some studies [liu2017functional, zhang2017deep, wang2019unified]

on traffic prediction propose deep learning-based models where data imbalance still remains unresolved. In this regards, Yao et al. 

[yao2019learning] and Wang et al. [wang2018crowd] propose meta learning frameworks to transfer knowledge from source cities to target cities for spatiotemporal prediction, but the combinatorial effect among different real-time features are largely overlooked. Notably, some conceptual similarity can be also found from research on user response prediction [luo2020dynamic, guo2017deepfm, juan2016field], where a representative task is to predict the possibility that a user will click through a given link/advertisement. Learning feature interactions lies in the core of user response prediction, and methods like Deep Structured Semantic Model (DSSM) [huang2013learning, shen2014latent, elkahky2015a] and Factorization Machine (FM) based models [guo2017deepfm, he2017neural, lian2018xdeepfm]

are capable models for this purpose. Recently, to account for the temporal dynamics of features, there are variants adopting Recurrent Neural Network (RNN)

[palangi2014semantic, song2016multi] and self-attention networks [chen2020sequence] to fully investigate the sequential dependencies while modelling feature interactions. Unfortunately, user response prediction only considers the decision dynamics from the customer side, while the other side is a display item and considered static. This assumption is obviously ill-posed for MSR prediction which involves a bidirectional, collective decision process. As we have discussed, these methods are not designed to handle the geographically unbalanced data. Moreover, MSR is used to support instant operations like order assignment in a high throughput environment. Considering that the aforementioned prediction methods are usually complex (e.g., the deep residual network in [zhang2017deep] and the convolutional network in [lian2018xdeepfm]), a lightweight model for MSR prediction is necessary to ensure efficiency for online deployment.

In this paper, to thoroughly capture feature interactions from both passenger and driver sides, we propose the Multi-View model (MV) for MSR prediction. In particular, to fully capture the dynamics of the constantly changing context, we build our MV model upon the Differentiable Neural Computer (DNC) [graves2016hybrid], a variant of neural memory networks [graves2014neural] while modelling feature interactions. In addition, with the external memory matrix coupled with DNC, our MV model can effectively accumulate useful knowledge from cities to improve its prediction accuracy. As MSR has to be inferred in real time, we aim to keep the prediction model simple yet powerful for minimizing the latency during online inference. Hence, we further propose a learning scheme based on Knowledge Distillation (KD) [ba2013deep, hinton2015distilling, mishra2017apprentice] to support learning a lightweight model used for real-world deployment. Based on MV, we design a teacher model that utilizes the learned knowledge about data-intensive cities to complement the prediction for the target city. By encouraging the simpler student model to mimic the behavior of the teacher model, the resulted (student) model is fully capable of delivering quality MSR predictions under high data scarcity.

To the best of our knowledge, this is the first work to identify a novel problem that comes along with the prosperity of ride-hailing platforms, namely passenger-driver matching success rate (MSR) prediction. MSR prediction advances existing user response prediction with the notion of bilateral decision-making, and it can generalize to and benefit a wider range of applications like friend-making websites and online marketplaces. Apart from that, the main technical contributions of this work are summarized as follows:

  • We propose the Multi-View model (MV) to solve the MSR prediction problem, which provides a comprehensive interaction scheme for features from different perspectives in dynamic contexts, and can retain knowledge about a city for future predictions.

  • Coupled with MV, we design a Knowledge Distillation framework (KD) to transfer knowledge from other cities to the lightweight model designed for the target city. This not only mitigates the data imbalance among cities but also results in a compact prediction model for efficient deployment.

  • We conduct extensive experiments on real industry-level datasets, where the results demonstrate the superior performance of our approach in both accuracy and scalability.

The rest of the paper is organized as follows. We first introduce the preliminaries in Section 2. After that, we present the multi-view model in Section 3 and knowledge distillation framework in Section 4. We evaluate the proposed solutions in Section 5. Then we review related work in Section 6. Finally, we conclude this study in Section 7.

2 Preliminaries

In this section, we define some key concepts and then formulate our research problem. We have also provided a list of notations used in this paper in Table I.

2.1 Definitions

Definition 2.1.

Passenger Request. Passengers relying on a ride-hailing application have a different way to acquire a ride compared with the traditional roadside hailing. When these passengers need a ride to somewhere, they will make a request denoted as which includes the passenger ID, timestamp, current location, origin and destination locations, status (i.e., responded or not responded, etc). A passenger request is also called an order on the ride-hailing platform.

Definition 2.2.

Driver Status. Every driver has a tuple to describe his/her states , , which consists of driver ID, timestamp, location, and state (i.e., empty, occupied, on the way to a pick-up, etc.).

Definition 2.3.

Passenger-Driver Pair. Once a passenger request is generated, the ride-hailing platform will match a driver for it as soon as possible according to the information provided and the surrounding drivers’ status. The passenger-driver pair will be constructed and denoted as which means the passenger and driver are paired by the platform at timestamp . The origin and destination locations of the trip are and , respectively. Then the passenger and driver will respond to this matching result via acceptance or cancellation. We use to label this passenger-driver pair. If either of these two participants chooses to cancel this pair then , otherwise we have which means they both accept it.

2.2 MSR Prediction

Problem 2.1.

Matching Success Rate (MSR) prediction. For an arbitrary city, we are given a sequence of passenger-driver pairs and a corresponding label sequence . denotes the total number of matching pairs. MSR

prediction is a classification task to predict the match success probability

of any given passenger-driver pair . In MSR prediction, we need to make MSR prediction for a set of source cities with sufficient data, as well as a set of target cities with sparse data (i.e., for an arbitrary source city and target city, their total numbers of passenger-driver pairs within a same time period satisfy ). It is worth noting that, when predicting the MSR on target cities, we have an additional goal of improving our model’s performance on target cities by borrowing usable knowledge learned from source cities.

Notation Description
, passenger request and driver status
, , , user ID, timestamp, location, status
, a sequence of passenger-driver pairs and labels
arbitrary source and target city
sets of source and target cities
, , features of the order, passenger and driver for the -th passenger-driver pair
context feature sequence
the context feature at time slot
feature embedding or feature interaction representation
memory matrix at time slot
memory matrices of arbitrary source city and target city
set of source cities’ memory matrices

read, erase and write vectors at time slot

read weight of the -th read operation and write weight at time slot
, E, hidden state of GRU, unit matrix, and interface vector
final representation of the passenger-driver pair
predicted MSR
Table I: Key notations used throughout the paper.

3 The Multi-View Model (MV)

The collective nature and constantly changing context of MSR require our model to account for the complex interactions of various factors in the real-time. However, unlike CTR prediction models that mainly focus on designing complicated feature interaction paradigms, MSR prediction is also facing severe data imbalance geographically. In light of this, we design a Multi-View model (MV for short), which not only learns an accurate mapping from the interactions among multiple factors to the MSR in the real time, but also maintains useful knowledge from source cities via memory matrices. Its structure is shown in Fig. 2

. The model consists of four main steps: feature extraction, embedding, interaction and attentive combination. Specifically, we utilize a sequence-based neural memory model for learning the dynamic contexts, which also facilitates the knowledge transfer described in Section


Figure 2: The architecture of the multi-view model. The MV model includes three main steps from the bottom up: feature embedding, multi-view feature interaction modelling, and attentive combination.

3.1 Feature Extraction and Embedding

In this section, we category the features into 4 kinds so that we can have a full and interpretable fusion of these features from different perspectives in the following parts.

Given a passenger-driver pair , we extract features from different perspectives: passenger, driver, order and context. The passenger’s feature vector includes the passenger recent behaviors and order history, e.g., the number of his/her canceled orders in the last five minutes, last day and so on. Similarly, the driver’s feature vector consists of the driver’s recent behaviors, service types and so on. The order information such as origin, destination and time is important to the MSR of a passenger-driver pair. For instance, a passenger would be more impatient when he/she is in a hurry to work during the morning rush hour. Conversely, passengers would like to wait for a longer time when hanging out on weekends. Hence, the order information is represented as an order feature vector . Last but not least, a context sequence is abstracted for each passenger-driver pair , which aims to capture the dynamics of the context during the past few time slots. We will provide more details in Section 5.

Then, as Fig. 2 shows, we feed , , into an embedding layer having two steps. Firstly, we project categorical features into embedding vectors and concatenate them with the other continuous and numerical features to reform a complete vector for each perspective. Secondly, we feed each of the three vectors into a dense layer separately and obtain the final embedding vectors (i.e., , , and ) with a unified embedding size for subsequent calculations.

3.2 Context Embedding and Knowledge Building

For each city, we can extract a set of contextual features that carry temporal dynamics (e.g., traffic condition) and are of great importance to the MSR. In MV, the transferring knowledge for target cities are built upon the context embeddings of source cities. Recall that we have extracted a sequence for each passenger-driver pair, which carries the crucial real-time contextual information that heavily impacts the matching results of a passenger-driver pair. In this part, our goal is to learn an effective representation for the dynamic context of the corresponding passenger-driver pair. Firstly, the context representation should encode the influence it would have on the passenger and driver’s decisions. Secondly, we aim to obtain a memory of all kinds of context situations as a reference knowledge to help improve future predictions, especially for cities with sparse data. Above all, we design the context embedding component based on DNC [graves2016hybrid]

which is extension to Neural Turing Machine (NTM)

[graves2014neural]. DNC contains two basic components: a neural network controller and an external memory matrix. The controller interacts with the other model components with its input and output and meanwhile it also reads from and write to the external memory matrix. A sequential controller in DNC could help capture the temporal dynamics in context features and the memory matrix it maintains can be utilized for transferring knowledge from source cities to the target one. Besides, DNC’s dynamic memory allocation scheme solves the memory overlapping and non-recoverable memory writing problems of NTM, making it a better fit than NTM for our task. We introduce the memory operation and controller network below.

3.2.1 External Memory.

The external memory is a fix-sized matrix . DNC reads from and write to the memory matrix with the help of read/write weights and an interface vector derived from the controller network. DNC selects locations for reading and writing depends via vectors of non-negative numbers whose elements sum to at most 1. The complete set of allowed weights over locations is the non-negative orthant of with the unit simplex as a boundary (known as the “corner of the cube”):


The read operation emits a set of read vectors {, , …, , …, }at each time step . Each read vector is a weighted sum of ’s rows:


with read weights …, …, .

The write operation is mediated by a single write weight , which is used in conjunction with an erase vector and a write vector (both are emitted by the controller) to modify the memory as follows:


where denotes element-wise multiplication and E is an matrix consisting of s.

3.2.2 Controller Network.

The controller network can be a recurrent or feedforward network but a recurrent controller has its own internal memory that can complement the external memory. Hence, while DNC and NTM use Long and Short Term Memory network (LSTM) [hochreiter1997long], we adopt GRU [cho2014learning] which performs similar to LSTM but is computationally cheaper [wang2017gated]. As the controller network, GRU would receive the set of read vectors {, …, } from the memory matrix . Then the read vectors are concatenated with the current input feature vector as the controller’s input ; …; . For convenience, we formulate the GRU in a simple version:


where , are respectively the input and output vector. The outputs of controller network are a function of the hidden states, which can be denoted by the following:


where is the output vector at time slot ; is called the interface vector which is subdivided to parameterize the memory interactions; are the hidden states generated by all GRU layers and denotes the trainable weight.

Then with final output vector of the controller network and the current time step’s read vectors, the final output of the DNC is defined as:


where is a weight matrix that transforms the concatenated read vectors to the same space as .

3.3 Multi-View Feature Interaction and Attentive Combination

One of the challenges in predicting MSR is that, the ride-hailing service is a collective process that involves two kinds of platform users i.e., passengers and drivers whose decisions are both highly dependent on the real-time environment. Therefore, we propose a two-step mechanism to thoroughly model and selectively fuse the feature interactions from different views.

The MSR of a passenger-driver pair is influenced by both the passenger and driver’s decisions. Meanwhile, the corresponding context and order attributes would affect their final decisions. The same passenger and driver would make different decisions according to different contexts, while under the same context, different passengers and drivers would make different decisions due to different behavioral preferences. As per our discussion, features in different perspectives have interactions with each other, and those features as well as their interactions have different importance to orders under different situations. To facilitate the modeling of such information, we propose our own interaction mechanism which includes two steps: (1) the interactions between the embeddings from different perspectives, (2) the attentive combination of the feature representations and their interaction results.

In the interaction part, we utilize three types of formulation to capture feature interactions, i.e., inner product, element-wise product, and outer product. Then the three types interactions are combined into the final interaction representation. For the features from any two perspectives, their interaction is formulated as follows:


where , represent inner product and outer product, and specially, includes a summation operation along the second dimension to flatten the resulted matrix from outer product into a vector. After this step, we can receive , , which carry the interaction contexts by crossing different views.

Feature representations of different perspectives and their interactions have different importances to orders under different situations. Hence, we calculate attentive weights between each one and the order representation and their weighted results can be represented as:


where represents the element-wise function, , , are query, key and value weight matrices dedicated to each representation , is the dimension of the representation vector, and is the scaling factor to smooth the row-wise output and avoid extremely large values of the inner product, especially when the dimension is very high. Then, the concatenation of all weighted results is taken as the final representation of the features in all perspectives:


Finally, we employ a dense layer to predict the final result of the label which is formulated as:


where w and are trainable parameters in the prediction layer.

3.4 Optimizing MV

With the final predicted results and the label

, we formulate the loss function as:



represents the binary cross entropy. All parameters of our model are optimized with the Stochastic Gradient Descent (SGD) method. Specifically, we utilize Adam 

[kingma2014adam] which is a widely-used variant of SGD.

4 Enhancing MV with Knowledge Distillation (KD)

As MSR prediction aims to support online order assignments, it requires real-time efficiency of the predictive model. However, the MV model by itself is computationally heavy for such high throughput applications because of the constant writing and reading operations on the memory matrices. In order to facilitate quicker online inference while ensuring strong performance on cities with high data sparsity, we devise a Knowledge Distillation framework (KD for short) which has two main components: a complex teacher model and a lightweight student model. The idea is to let the teacher model who can access the information stored in the memory guide the student model, such that the student model being used for online inference can mimic the behaviors of the teacher model and provides high prediction accuracy. As Fig. 3 shows, it is similar to a Siamese structure where two models are independent except two interactions where

means to calculate the cosine similarity between the two vectors and

indicates parameter sharing in the dense layer. The framework enables the teacher model to borrow knowledge from source cities with dense data, so that the prediction accuracy can be guaranteed when coping with target cities having fewer records. In addition, by encouraging the student to emulate the prediction behaviors of the teacher with shared parameters, we can obtain a student model that is both lightweight and accurate.

Figure 3: The overview of the knowledge distillation framework. Based on MV, we design a teacher model that utilizes the knowledge learned form data-intensive cities, i.e., to complement the prediction for the target city , and we also design a simple student model with less parameterization to mimic the behavior of the teacher.

4.1 Memory Preparation

As the data distributions are diverse among cities, we train a specialized MV model for each city separately. The source cities own sufficient data to facilitate training a quality prediction model, while the data scarcity of target cities tend to lead to poor prediction performance. One of the aims in this section is to solve the problem via knowledge transfer. Firstly, we train the MV models for source cities in to obtain the their memory matrices which are denoted as a set …, …, . Intuitively, not all source cities’ knowledge is of equivalent importance to a target city in , hence we build the teacher model by attentively select the memorized information from each source city. For a more effective knowledge transfer, we conduct pretraining for the target city as an initialization for its memory matrix so that the attentive weights between and each source matrix in can be calculated. In this way, a reasonable and effective combination of the memory matrices can be achieved.

4.2 Setting up Teacher and Student Models

Fig. 3 shows the main structures of the teacher and student models, as well as the interactions between them. Note that the teacher and student share the last dense layer, and there is a similarity constraint between their final feature representations, enforcing the student to behave like the powerful teacher model. The teacher model aims to learn as much information as possible so that it can lead to a more accurate result. During the training process of a target city, besides the normal interaction between GRU and the memory matrix, there is a combination procedure. Specifically, the teacher model advances the default MV models by attentively integrating the memory matrices associated with each city and we denote it as for convenience. Assuming that the target city’s memory matrix is w.r.t. sample , then is formulated as:


where , , are query, key and value weight matrices dedicated to each target city, denotes the row-wise function and is the column dimension of the memory matrix.

To this end, it is obvious that the teacher model is heavier than MV in parameters (i.e., the teacher has the additional query, key and value weight matrices compared with MV) which further hinders its practicality for online deployment. Therefore, we devise the student model by stripping away the heavy memory interaction part as Fig. 2 shows with the rectangle in orange and round dot line (i.e., the student omits the Eq.(6) and Eq.(7) compared with MV and their corresponding weights are and ) and we denote it as for convenience. Consequently, the space complexity of student is reduced by the memory interaction weights , and the memory aggregation weights , , compared with the teacher, which provides minimal parameterization and efficient inference in return. It is worth noting that, our model design also facilitates multi-thread processing in the production environment, where each city is deployed at a single computation node and the implementation can easily scale up to a large number of cities.

4.3 Optimizing KD

As we can see from Fig. 3, the final outputs of teacher and student models are and , respectively. Meanwhile, both of them generate a representation vector of all features for each passenger-driver pair , i.e., and . The rationale of the knowledge distillation is to make the output of teacher and student model as similar as possible so we formulate the loss function of the teacher-student framework as following:


where denotes the cosine similarity between two vectors; , , and refer to weighting factors to prioritize the output of a certain loss function over the other, which are learnable in our model. Similar to MV, all parameters of the knowledge distillation framework are optimized with the variant of SGD, i.e., Adam [kingma2014adam].

5 Evaluation

In this section, we conduct experiments on industry-level datasets to showcase the effectiveness of our model on MSR prediction. In particular, we aim to answer the following research questions via experiments:

  • Can MV outperform strong baselines on MSR prediction?

  • How does MV benefit from each component of the proposed model structure?

  • How does each key hyperparameter affect the performance of MV?

  • Is the lightweight model learned via KD effective in predicting MSR under high data sparsity?

  • How does each key hyperparameter affect the performance of KD?

  • Is our solution to MSR prediction scalable?

City #Instance #N Size () Population
BJ 31,062,865 11.45% 16,411 21,536,000
SH 21,132,759 14.41% 6,340.5 24,281,400
SZ 12,792,586 12.16% 1,997.47 13,438,800
CD 12,432,933 11.27% 14,335 16,581,000
DZ 477,156 11.04% 10,356 5,748,500
ZZ 858,138 7.93% 12,600 5,160,000
Table II: A summary of datasets in use. #N denotes the percentage of negative samples among instances.
Figure 4: Statistics of the order rejection rate and the distribution of passengers with different orders. Note that due to security regulations from Didi Chuxing, we are only permitted to showcase the trends within different datasets instead of explicit numbers.
(a) Beijing
(b) Shanghai
(c) Dezhou
(d) Zhangzhou
Figure 5: Heat maps of order rejection rates in different areas of different cities, where a darker color represents a higher rejection rate. Exact numbers are also omitted due to security regulations.
Feature Description
, the number of different historical events during last day/week/month
the number of finished trips during last day/week/month
start POI type
end POI type
product type
estimated pick-up distance
estimated pick-up time
whether it is a peak period
whether it is a holiday
whether it is a hot spot
real-time supply and demand ratio
the cancellation rate during the last ten minutes at the starting point
Table III: Features extracted from our datasets.

5.1 Datasets

We conduct experiments on real-world datasets generated by Didi Chuxing111, the largest Chinese ride-hailing platform. We use data collected from 6 cities, namely, Beijing (BJ), Shenzhen (SZ), Shanghai (SH), Chengdu (CD), Dezhou (DZ), and Zhangzhou (ZZ). Table II summarizes the statistics of these datasets, which are randomly drawn from the platform’s daily operations in December 2020. Note that all samples in our datasets are matched orders (i.e., passenger-driver pairs in Definition 2.3), where we label the rejected ones as negative and fulfilled ones as positive, respectively.

To further showcase the characteristics of our experimental datasets, we provide some statistical visualizations in Fig. 4 and Fig. 5. From Fig. 4 we can see three obvious peaks at about 9:00, 18:00 and 24:00 in five cities. During those three major commuting time periods, people tends to have less patience waiting for a delayed ride, so the rejection rates are high. The data of Zhangzhou presents a different curve, whose peaks appear much earlier morning, which may reflect a specific regular routine of people in this city. As Fig. 4 depicts, it is within our expectation that the order rejection rate rises as the pick-up distance increases. However, when the pick-up distance is increased to a larger value, the rejection rates in three cities significantly decreases. This is reasonable because people are more willing to wait when available nearby vehicles are in short. Fig. 5 shows heat maps of order rejection rates in different areas of different cities. The closer the distance between an area to the downtown, the higher order rejection rates appear. As implied by the visualizations, there is almost no consistent patterns across all cities while various factors have complex combinatorial impact on MSR, which greatly motivates our model design.

Based on the scale of available instances, we use BJ, SZ, SH and CD as the source cities and DZ, ZZ as the target cities to facilitate knowledge distillation for MSR prediction under data scarcity. Referring to the data sparsity, we conduct a further analysis buy using Fig.4 (c) to illustrate the distribution of passengers w.r.t. the amount of orders placed. It can be seen that target cities have substantially more passengers with fewer orders, and vice versa. Combining the numbers of instances in different cities (Table 2) with the curve in Fig.4 (c), it implies severe data sparsity in target cities caused by inactive passengers, which is also tackled by our model design. Table III lists the features we have extracted from four different views (i.e., passenger, driver, order and context) that are used as the input of our model.

5.2 Baselines

To evaluate the performance of our model and framework, we make comparison two types of baselines: feature interaction-based methods and knowledge transfer-based methods. First, we compare MV with feature interaction-based methods that are designed for user response prediction in the following:

  • DeepFM: DeepFM [guo2017deepfm] is an end-to-end deep model that emphasizes both low- and high-order feature interactions, which combines the power of factorization machines and neural networks for interaction modelling and representation learning.

  • PNN: The Product-based Neural Network (PNN)  [qu2016product]

    utilizes an embedding layer to learn a distributed representation of the categorical data, a product layer to capture inter-field patterns, and fully connected layers to explore high-order feature interactions.

  • DIN: The Deep Interest Network (DIN) [zhou2018deep] is developed and deployed in the display advertising system in Alibaba. DIN represents users’ diverse interests with an interest distribution and designs an attention network structure to selectively activate the related user interests.

  • DCN: The Deep and Cross Network (DCN) [wang2017deep] keeps the benefits of a DNN model, and beyond that, it introduces a novel cross network that is more efficient in learning certain bounded-degree feature interactions. In particular, DCN explicitly applies feature crossing at each layer, requires no manual feature engineering, and adds negligible extra complexity to the DNN model.

  • DSSM: The Deep Structured Semantic Model (DSSM) [huang2013learning] learns representations for both queries and documents in a common low-dimensional space, and then computes their similarity to infer the probability of a matching pair.

  • Wide&Deep: The Wide&Deep [cheng2016wide] model jointly trains wide linear models and deep neural networks to combine the benefits of memorization and generalization for user response prediction.

  • DeepCrossing: DeepCrossing [shan2016deep] stacks multiple residual units upon the concatenation layer of feature embeddings in order to learn deeper cross interactions of features.

We further compare our proposed KD method in their capabilities of utilizing knowledge transfer to enhance prediction performance on target cities. As for knowledge transfer-based methods, [wang2018crowd, yao2019learning] can transfer knowledge between cities but their prediction targets focus on spatiotemporal time series data, which is unsuitable for our MSR prediction problem. Hence, following [yao2019learning], we conduct experiments by designing two fine-tuning strategies for MV as our knowledge transfer-based peer methods:

  • Single-FT: We train the MV model on every source city and fine-tune the model for all target cities, respectively.

  • Multi-FT: We train the MV model on randomly chosen samples from all source cities and then fine-tune it for all target cities.

5.3 Experimental Settings

As our problem is a classification task, we evaluate the prediction accuracy with the widely-applied metric, Area Under roc Curve (AUC) and Root Mean Squared Error (RMSE) [chen2020sequence]

. In experiments, the ratio between the test set and training set is 1:9. In the training set, we further hold out its last 10% of the data for validation. We implement our solutions with TensorFlow 1.15 and Python 3.6. The default sizes of embeddings and the memory matrix are set as

and , respectively. The batch size is set to .

DeepFM 0.7089 0.3122 0.7116 0.3326 0.7379 0.3385 0.7245 0.2761 0.6834 0.3013 0.6925 0.2658
PNN 0.7283 0.3074 0.7343 0.3295 0.7785 0.3285 0.7519 0.2726 0.6886 0.2984 0.7036 0.2657
DIN 0.7291 0.3048 0.7253 0.3348 0.7617 0.3344 0.7462 0.2742 0.6981 0.2988 0.7072 0.2638
DCN 0.7345 0.3079 0.7327 0.3275 0.7654 0.3307 0.7457 0.2721 0.6922 0.3001 0.7036 0.2629
DSSM 0.7376 0.3048 0.7347 0.3305 0.7743 0.3284 0.7464 0.2731 0.6967 0.3003 0.713 0.2649
Wide&Deep 0.7423 0.3056 0.736 0.3277 0.7791 0.3280 0.7469 0.2718 0.7009 0.2975 0.7121 0.2621
DeepCrossing 0.7487 0.3043 0.7409 0.3285 0.7848 0.3226 0.7536 0.2731 0.6919 0.2992 0.7063 0.2614
MV 0.7562 0.3012 0.7538 0.3228 0.7998 0.3201 0.7650 0.2665 0.7045 0.2961 0.7212 0.2590
MV-S1 0.7454 0.3079 0.7458 0.3269 0.7945 0.3223 0.7583 0.2683 0.6899 0.2994 0.7098 0.2614
MV-S2 0.7458 0.3075 0.7466 0.3274 0.7933 0.3237 0.7574 0.2705 0.6886 0.3001 0.7079 0.2600
MV-S3 0.7449 0.3073 0.7326 0.3307 0.7939 0.3221 0.7601 0.2691 0.6859 0.2998 0.7104 0.2592
MV-S4 0.7524 0.3026 0.7520 0.3239 0.794 0.3219 0.7639 0.2695 0.6877 0.2994 0.7063 0.2622
Table IV: Results of different methods on both source and target cities compared with MV.
Figure 6: Results of MV with different hyperparameter values.

5.4 Effectiveness of MV (RQ1)

We first verify the performance of our MV model. Though we lay more emphasis on the accurate prediction for less popular cities (i.e., target cities) with sparser data, we test perform performance evaluation on all cities in this test. The upper part of Table IV

shows the results of effectiveness comparison between MV and all feature interaction-based methods. Note that we have fully tuned all hyperparameters of all baselines with the Grid Search method. Hence, the results are recorded from their best performance. In this regard, we can ensure that the performance gain is from the model design itself. Also, our MSR prediction task shares a similar goal with user response prediction, where an improvement of our evaluation metrics at the 0.001-level is regarded significant 

[han2019all, song2019autoint] owing to the amount of additional revenue it can bring.

Based on the results, we draw the following observations from the experimental results:

  • It is obvious that MV outperforms all baselines on both AUC and RMSE significantly and consistently. Among the baselines, PNN has better performance than DeepFM which trivially fuse the output from the FM and DNN components. Other feature interaction models with more delicate structures (e.g., DIN which employs GRU to capture features’ dynamics, DSSM which learns feature representations from different views, etc.) perform better than DeepFM and PNN. In addition, while Wide&Deep and DeepCrossing have relative simple structures, they both show promising performance at the same time, which implies their generalizability.

  • Generally, all methods show higher prediction accuracy. Moreover, the performance of DeepCrossing is poor on both target cities than other method while it outperforms all other baselines on source cities, which may indicate this method is not suitable for sparse datasets. Meanwhile, Wide&Deep shows impressive prediction accuracy on target cities.

5.5 Ablation Study on MV (RQ2)

In order to testify the importance of the key components of MV, we design several variants of it and conduct ablation study. The lower part of Table IV shows the results of the following 4 MV variants: (1) MV-S1 that removes the interactions between features from different views in Section 3.2; (2) MV-S2 that replaces the attentive aggregation in Section 3.3 by a simple concatenation operation; (3) MV-S3 that replaces the DNC by a simple GRU network without the external memory; and (4) MV-S4 that uses LSTM to replace the GRU in the DNC component. With the results obtained in Table IV, it can be observed that:

  • MV-S3 shows the worst performance overall, and it is also very unstable. The general performance of MV-S2 is the second worst, while MV-S1 and MV-S4 perform slightly better than MV-S2. Correspondingly, the DNC component plays the most important role to improve our model’s performance. Meanwhile, the attentive aggregation and the multi-view feature interaction both provide positive contributions to the prediction accuracy of the full MV model.

  • As can be expected, MV-S4 yields similar performance with MV on all source cities, as it shares the same components with MV except that the GRU in the DNC component is swapped with LSTM. It worth mentioning that MV-S4 performs not as well on the target cities as MV. One potential reason may be that the memory units within LSTM has more parameters, and the sparse data of target cities is insufficient to support its optimization.

In summary, all the key components are beneficial for our model’s performance, while the DNC makes the most contribution owing to its external memory.

5.6 Hyperparameter Sensitivity of MV (RQ3)

As Fig. 6 shows, we examine the impact of two key hyperparameters of MV, namely the embedding dimension and the size of the memory matrix. In general, our model is insensitive to the variations of both hyperparameters, especially on larger datasets (i.e., source cities). We analyze the effect of each hyperparameter below:

  • Embedding Size: As Fig. 6 and Fig. 6 show, when the embedding size of MV is increasing, its performance keeps improving slightly on source cities, while a downward trend is observed on target cities. The possible reason might be that, source cities has richer data samples, hence a larger memory capacity is required to store all the information. Conversely, if the embedding size is set too large on target cities that have scarce information, it can lead to overfitting which would disturb the accuracy of the model.

  • Memory Size: From Fig. 6 and Fig. 6, we can see that the performance of MV on source cities also keeps increasing. It demonstrates the similar assumption in our experiments w.r.t. the embedding size on source cities. Compared with the embedding size, it seems that the memory size has little influence on the final prediction results. As suggested by the results, a relatively small memory matrix is sufficient for storing all the knowledge acquired from the dataset, which can also help avoid an excessively large model. With the increase of the memory size, the final results are relatively stable.

Method DZ ZZ
Single-FT (BJ) 0.7131 0.2945 0.7292 0.2568
Single-FT (SZ) 0.7107 0.2948 0.729 0.2571
Single-FT (SH) 0.7107 0.2946 0.7281 0.2569
Single-FT (CD) 0.7088 0.2944 0.7282 0.2569
Muti-FT 0.7122 0.294 0.7303 0.2566
MV 0.7131 0.2947 0.7402 0.2532
MV 0.7136 0.2935 0.7425 0.2526
Table V: Results of different methods on target cities compared with KD.

5.7 Effectiveness of KD (RQ4)

The KD model is designed to cope with the significantly scarcer data in target cities using the knowledge from source cities. Table V compares the performance of all knowledge transfer-based baseline methods and KD. In Table V, we have listed the performance of both the complex teacher model (i.e., MV) and the simplified student model (i.e., MV) trained via KD. We conclude our findings below:

  • Combing the results from Table V with Table IV, we can tell that transferring knowledge from source cities to target cities can make an obvious difference in performance compared with merely modelling feature interactions. The superiority of MV verifies the effectiveness of the KD framework that we have proposed, where it even slightly outperforms the sophisticated teacher model on both target cities.

  • As TableV shows, Single-FT based on BJ dataset is better than the other three Single-FT baselines. One possible reason is that BJ owns the most sufficient passenger-driver matching records, which will intuitively offer richer knowledge to the target cites during inference. At the same time, Multi-FT performs better than Single-FT in general, indicating the knowledge from a single source city is hardly sufficient for predicting MSR on target cities.

Figure 7: Results of KD with different hyperparameter values.

5.8 Hyperparameter Sensitivity of KD (RQ5)

As KD takes the learned memory matrix of each city’s MV model as its input, the memory size of KD is already determined. Hence, we focus on how different the embedding sizes in KD influence the final prediction performance. Fig. 7 depicts the prediction results of KD with different values of embedding sizes, and we perform experiments for both the teacher and student component. Compared with the MV in Section 5.6, KD is more sensitive to the embedding dimension. We hereby provide detailed analysis on the hyperparameter sensitivity of KD:

  • Embedding Size of Teacher Model: As Fig. 7 and Fig. 7 show, when the embedding size grows, the performance of KD rises at first and then gradually declines. KD performs the best when the teacher embedding size is set to . One possible reason might be that, the teacher model is fully supported by the comprehensive knowledge from all source cities, so a moderate value of is sufficient for the embedding size. Another potential reason is that, similar to the MV in Section 5.6, the scarce data of target cities means that the model does not need a very large embedding size to learn all the information .

  • Embedding Size of Student Model: Fig. 7 and Fig. 7 show that the KD’s performance is constantly improving with the increase of student embedding size. The rationale of why it shows a different trend compared with the teacher embedding size is that, the student model is considerably simpler and has much less parameters. Therefore, it benefits from a larger embedding size to facilitate expressiveness so that the student can fully replicate the behavior of the accurate teacher model.

Figure 8: Scalability analysis results.

5.9 Scalability Analysis of MV and KD (RQ6)

We test the scalability of MV and KD via three groups of experiments and the results is shown in Fig. 8. In what follows, we present more details on the experimental settings and provide further analysis on our model’s scalability.

  • Firstly, we test the training efficiency of MV by varying the data size used for training. Specifically, we train MV with different amounts of instances, and report the time it takes for model convergence. We can see from Fig. 8 that the training time for MV oscillates around the linear one (blue line) very closely. This indicates that MV is able to scale to large-scale datasets.

  • Secondly, recall that the other important goal of designing KD is to obtain an efficient and simple model to support the online inference while preserving high accuracy. Hence, we compare the time cost of teacher and student models in KD during inference. As the curves in Fig. 8 show, the inference efficiency of student is constantly better than the teacher, and both teacher and student models’ inference time costs grow linearly.

  • Thirdly, as the MSR prediction model is expected to handle very large datasets during production, we deploy our model on a spark cluster with parallelism for training. The experiments of this group are conducted under the same data size of instances. The speed-up ratio of MV under different degrees of parallelism is depicted in Fig. 8. It is obvious that the efficiency of parallelism is highly competitive. Moreover, we can see that the speed-up ratio begins to drop after the degree of parallelism increases to , which is a strong efficiency indicator for industry-level deployment.

6 Related Work

6.1 Traffic Prediction

We first introduce relevant research work on traffic prediction problems [wei2016zest, wang2016traffic, tong2017simpler, liu2017functional, zhang2017deep, feng2018deepmove, yao2018deep, yao2019learning, wang2019unified, wang2019origin]. With the fast development of Internet and data warehousing techniques, all kinds of data is becoming increasingly available for such prediction tasks in the last few years. Some studies [wei2016zest, wang2019unified, tong2017simpler, yao2018deep] propose to use multi-source data to solve the passenger demand prediction. Wang and Wei et al. [wei2016zest, wang2019unified] present a combinatorial model to predict the passenger demand in a region at the future time slot, and the model captures the temporal trends and fuses the spatial and other related features to facilitate prediction. In particular, Tong et al. [tong2017simpler]

employ multi-source data to synthesize more than 200 million features via feature engineering and feed them into a unified linear regression model to obtain prediction results. Zhang et al. 

[zhang2017deep] design a spatiotemporal model based on CNN and residual network to extract information from both spatial and temporal perspectives and predict both the in-flow and out-flow within each urban area. Wang et al. [wang2016traffic]

propose a deep learning method with an error-feedback recurrent convolutional neural network (eRCNN) for continuous traffic speed prediction. Liu et al. 

[liu2017functional] develop a hierarchical station bike demand predictor which analyzes bike demands from functional zone level to station level. However, these studies ignore the data imbalance problem among different cities. Yao et al. [yao2019learning] propose a meta-learning framework to transfer knowledge from source cities to target cities, but their model does not account for intricate feature interactions and the efficiency of the prediction model. Moreover, Tong et al.’s work provide us with much inspiration, especially studies on bipartite graph matching [DBLP:conf/icde/WangTLXXL19] and task assignment problem [DBLP:conf/kdd/ShiTZSLY21, tong2019two, tong2016mobile] and online minimum matching problem in real-time spatial data [tong2016online]. Though these are also two-sided problems, such methods lack the capability of addressing issues of data imbalance and efficient online inference in the MSR prediction context.

6.2 Deep Feature Interaction Models

As the MSR problem involves the joint effect of multiple dynamic factors, modeling feature interactions is necessary. Recently popular methods on feature interaction modeling [shan2016deep, qu2018product, zhou2018deep, huang2013learning, shen2014latent, palangi2014semantic, elkahky2015a, Rendle2010factorization, juan2016field, guo2017deepfm, he2017neural, xiao2017attentional, lian2018xdeepfm, chen2020sequence] have gained immense popularity in a wide range of prediction tasks, especially user response prediction in online advertising. In order to project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them, Huang et al. [huang2013learning] propose a DSSM which learns representations for both objects and computes their relevance. Similarly structured methods [shen2014latent, palangi2014semantic, elkahky2015a] based on CNN, LSTM are proposed aiming at capturing more interaction contexts. Some studies [juan2016field, guo2017deepfm, he2017neural, xiao2017attentional, lian2018xdeepfm, chen2020sequence] focus on improving the interaction modeling schemes based on the original factorization machine (FM) [Rendle2010factorization]. For example, Guo et al. [guo2017deepfm] combines FM and the traditional DNN, which enhances the expressiveness of the model. Xiao et al. [xiao2017attentional] propose a novel attentional factorization machine, which learns the importance of each feature interaction via a neural attention network. Qu et al. [qu2016product, qu2018product] come up with a product-based neural network (PNN), which performs either inner product or outer product operation on feature embeddings. Zhou et al. [zhou2018deep] calculate the attention on different features and result in a weighted sum in different feature groups, which enables the model to adjust the weights of input features for accurate prediction. Chen et al. [chen2020sequence] point out that existing FM-based models assume no temporal orders in the data, and are unable to capture the sequential dependencies or patterns within the dynamic features, so they propose a novel sequence-aware factorization machine for temporal predictive analytics, which models feature interactions by fully investigating the effect of sequential dependencies. Unfortunately, all aforementioned methods do not take the data imbalance problem into consideration, which greatly hinders a model’s performance under data scarcity.

6.3 Memory Networks and Knowledge Distillation

In order to solve the data imbalance problem, we gain lots of inspiration from memory networks and knowledge distillation methods [graves2014neural, graves2016hybrid, cheng2016wide, ba2013deep, hinton2015distilling, mishra2017apprentice, tang2018ranking, wang2020next]. One representative technique of is memory networks whose theory is learning from previous experience. Graves et al. [graves2014neural, graves2016hybrid] propose NTM and DNC which maintain an external memory matrix to store the dynamics of temporal features over time. Then, in order to balance the generalization and memorization of the model, Cheng et al. [cheng2016wide]

design the Wide&Deep model, whose wide part is in charge of the memorization ability via a linear model and the deep part is responsible for the generalization of the model. As a popular choice for transfer learning, knowledge distillation has shown promising effectiveness 

[ba2013deep, hinton2015distilling, mishra2017apprentice, tang2018ranking, wang2020next] in various tasks. Ba et al. [ba2013deep] is the first to put forward the emulation learning scheme where a complicated teacher model is designed to guide the training of a simple student model without labels, which has been attracting extensive research attention until now. Hinton et al. [hinton2015distilling] propose to further incorporate the labels of the training data into the loss function to enhance the performance of the learned student model. Knowledge distillation has been proven effective in recent preference mining tasks like recommendation [tang2018ranking, wang2020next], showcasing its potential in MSR prediction.

7 Conclusion

In this work, we define a novel research problem, i.e., Matching Success Rate prediction for passenger-driver pairs, which originates from the real demand of ride-hailing platforms. MSR prediction face three main challenges. Firstly, as MSR involves a bidirectional decision process of two end-users in a dynamic environment, learning a comprehensive representation for each passenger-driver pair is non-trivial. Secondly, the data imbalance problem is common yet harmful for the prediction accuracy on small cities. Thirdly, as MSR prediction is utilized to support real-time strategic operations, a lightweight yet accurate model is essential. In order to solve MSR prediction, we propose the Multi-View model (MV) which learns the combinatorial effect of features in different views. Then, to tackle data imbalance and guarantee online efficiency, we design the Knowledge Distillation framework (KD) which can not only supplement knowledge for the cities with scarce data, but also generate a simple model to support online applications. Through extensive experiments, we have demonstrated the strength of our solution in both accuracy and scalability.


Thanks to my dear colleagues in Didi Chuxing, Kecheng Xu and Haoyu Wang who help us conduct data statistics during the first-round revision. This work is supported by the National Key Research & Development Program of China (Grant No. 2016YFB1000103) and Australian Research Council (Grant No. DP190101985, FT210100624).