SANST: A Self-Attentive Network for Next Point-of-Interest Recommendation

by   Qianyu Guo, et al.
The University of Melbourne

Next point-of-interest (POI) recommendation aims to offer suggestions on which POI to visit next, given a user's POI visit history. This problem has a wide application in the tourism industry, and it is gaining an increasing interest as more POI check-in data become available. The problem is often modeled as a sequential recommendation problem to take advantage of the sequential patterns of user check-ins, e.g., people tend to visit Central Park after The Metropolitan Museum of Art in New York City. Recently, self-attentive networks have been shown to be both effective and efficient in general sequential recommendation problems, e.g., to recommend products, video games, or movies. Directly adopting self-attentive networks for next POI recommendation, however, may produce sub-optimal recommendations. This is because vanilla self-attentive networks do not consider the spatial and temporal patterns of user check-ins, which are two critical features in next POI recommendation. To address this limitation, in this paper, we propose a model named SANST that incorporates spatio-temporal patterns of user check-ins into self-attentive networks. To incorporate the spatial patterns, we encode the relative positions of POIs into their embeddings before feeding the embeddings into the self-attentive network. To incorporate the temporal patterns, we discretize the time of POI check-ins and model the temporal relationship between POI check-ins by a relation-aware self-attention module. We evaluate the performance of our SANST model with three real-world datasets. The results show that SANST consistently outperforms the state-of-theart models, and the advantage in nDCG@10 is up to 13.65



There are no comments yet.


page 1

page 2

page 3

page 4


DAN-SNR: A Deep Attentive Network for Social-Aware Next Point-of-Interest Recommendation

Next (or successive) point-of-interest (POI) recommendation has attracte...

Where to Go Next: A Spatio-temporal LSTM model for Next POI Recommendation

Next Point-of-Interest (POI) recommendation is of great value for both l...

NAIRS: A Neural Attentive Interpretable Recommendation System

In this paper, we develop a neural attentive interpretable recommendatio...

MEANTIME: Mixture of Attention Mechanisms with Multi-temporal Embeddings for Sequential Recommendation

Recently, self-attention based models have achieved state-of-the-art per...

Extracting Attentive Social Temporal Excitation for Sequential Recommendation

In collaborative filtering, it is an important way to make full use of s...

STAN: Spatio-Temporal Attention Network for Next Location Recommendation

The next location recommendation is at the core of various location-base...

Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence

The rapid growth of Location-based Social Networks (LBSNs) provides a gr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Travel and tourism is a trillion dollar industry worldwide [20]. To improve travel and tourism experiences, many location-based services are built. Next POI recommendation is one of such services that is gaining an increasing interest as more POI check-in data become available. Next POI recommendation aims to suggest a POI to visit next, given a user’s POI visit history. Such a service is beneficial to both the users and the tourism industry, since it alleviates the burden of travel planning for the users while also boosts the visibility of POIs for the tourism industry.

Figure 1: A POI check-in sequence in New York City (dashed line indicates movements across days)

Next POI recommendation is often modeled as a sequential recommendation problem [2, 6, 5], to take advantage of the sequential patterns in POI visits. For example, tourists at New York City often visit Central Park right after The Metropolitan Museum of Art (“the Met”, cf. Fig. 1). If a user has just checked-in at the Met, Central Park is the next POI to recommend.

Recently, self-attentive networks (SAN) [22], a highly effective and efficient sequence-to-sequence learning model, have been introduced to general sequential recommendation problems. The resultant model named SASRec [9] yields state-of-the-art performance in recommending next product or video game to purchase, next movie to watch, etc.

Applying SASRec to make next POI recommendations directly, however, may produce sub-optimal results. This is because SASRec is designed for general recommendation scenarios and focuses only on the sequential patterns in the input sequence. It does not consider any spatial or temporal patterns, which are inherent in POI visit sequences and are critical for POI recommendations.

In terms of spatial patterns, as illustrates in Fig. 1, POI visits demonstrate a clustering effect [2]

. POIs nearby have a much larger probability to be visited consecutively than those far away. This offers an important opportunity to alleviate the data sparsity problem resulted from relying only on historical check-in sequences. In an extreme case, we may recommend Central Park to a user at the Met, even if the two POIs had not appeared in the historical check-in sequences, since both POIs are just next to each other and may be visited together.

In terms of temporal patterns, historical check-ins made at different times ago shall have different impact on the next POI visit. For example, in Fig. 1, the solid arrows represent transitions made within the same day, while the dashed arrow represents a transition made across days. Check-ins at Central Park and the Met may have a strong impact on the check-in at Museum of the New York City, as they together form a day trip. In contrast, the check-in at Museum of the New York City may have little impact on the check-in at Times Square, as they are at different days and may well be different trips. SASRec ignores such time differences and just focuses on the transition probabilities between actions (i.e., check-ins). In SASRec, the impact of the check-in at the Met on the check-in at Central Park may be the same as that of the check-in at Museum of the New York City on the check-in at Times Square. Such an impact modeling strategy may be inaccurate for next POI recommendations.

To address these limitations, in this paper, we introduce self-attentive networks to next POI recommendation and integrate spatial and temporal pattern learning into the model. We name our resultant model the SANST model.

To integrate spatial pattern learning, we learn a spatial embedding for each POI and use it as the input of the self-attentive networks. Our embedding preserves the spatial proximity between the POIs, such that the self-attentive networks can learn not only the sequential patterns between the check-ins but also the spatial patterns between the check-ins. To learn our spatial embedding, we first hash the POIs into a grid where nearby cells are encoded by strings with common prefixes (e.g., following a space-filling curve such as the Z-curve). We then learn character embeddings from the hashed strings corresponding to the POIs using Bi-LSTM [8]. The Bi-LSTM output is then used as the embedding of the POI. Since POIs at nearby cells are encoded by similar strings, they are expected to obtain similar embeddings in our model. Thus, we preserve the spatial proximity patterns in POI check-ins.

To integrate temporal pattern learning, we follow [19] to adapt the attention mechanism. We add a parameter to represent the relative position between two input elements and (i.e., check-ins) in the input sequence. We define the relative position based on the time difference between and instead of the number of other check-ins between and (which was done by [19]). This way, our SANST model represents the temporal patterns in check-ins explicitly and can better learn their impact.

To summarize, we make the following contributions:

  1. We propose a self-attentive network named SANST that incorporates spatial and temporal POI check-in patterns for next POI recommendation.

  2. To incorporate the spatial patterns of POI check-ins, we propose a novel POI embedding technique that preserves the spatial proximity of the POIs. Our technique hashes POIs into a grid where nearby cells are encoded by strings with common prefixes. The POI embeddings are learned from the hashed strings via character embeddings, such that POIs at nearby cells yield similar embeddings.

  3. To incorporate the temporal patterns of POI check-ins, we extend the self-attentive network by adding a parameter to represent the relative time between the check-ins. This enables the network to learn the temporal patterns of the check-ins explicitly.

  4. We study the empirical performance of our SANST model on three real-world datasets. The results show that SANST outperforms state-of-the-art sequential next POI recommendation models and adapted models that combine self-attentive networks with spatial feature learning directly by up to 13.65% in terms of nDCG@10.

2 Related Work

Like many other recommendation problems, POI recommendation has attracted extensive research interests. Earlier studies on this topic use collaborative filtering (CF) techniques, including both user-based CF (UCF) [24] and item-based CF (ICF) [11]. These techniques make recommendations based on either user or item similarity. Factorization models [7, 13, 14] are also widely used, where the user-POI matrix is factorized to learn users’ latent interests towards the POIs. Another stream of studies use probabilistic models [1, 10], which aim to model the mutual impact between spatial features and user interests (e.g., via Gaussian or topic models). Details of these studies can be found in a survey [25] and an experimental paper [16].

Next POI recommendation. In this paper, we are interested in a variant of the POI recommendation problem, i.e., next POI recommendation

(a.k.a. successive POI recommendation). This variant aims to recommend the very next POI for a user to visit, given the user’s past POI check-in history as a sequence. Users’ sequential check-in patterns play a significant role in this problem. For example, a tensor-based model named

FPMC-LR [2] recommends the next POI by considering the successive POI check-in relationships. It extends the

factorized personalized Markov chain

(FPMC) model [18] by factorizing the transition probability with users’ movement constraints. Another model named PRME [6] takes a metric embedding based approach to learn sequential patterns and individual preferences. PRME-G [6] further incorporates spatial influence using a weight function based on spatial distance. POI2Vec [5] also makes recommendations based on user and POI embeddings. To learn the embeddings, it adopts the word2vec model [17]

originated from the natural language processing (NLP) community; users’ past POI check-in sequences form the “word contexts” for training the word2vec model. The POI check-in time is also an important factor that is considered in next POI recommendation models. For example,

STELLAR [30] uses ranking-based pairwise tensor factorization to model the interactions among users, POIs, and time. ST-RNN [15] extends recurrent neural networks (RNN) to incorporate both spatial and temporal features by adding distance-specific and time-specific transition matrices into the model state computation. MTCA [12] and STGN [29] adopt LSTM based models to capture the spatio-temporal information. [26] split a day into time slots (e.g., by hour) to learn the periodic temporal patterns of POI check-ins. LSTPM [21] map a week into time slots. They propose a context-aware long and short-term preference modeling framework to model users’ preferences and a geo-dilated RNN to model the non-consecutive geographical relation between POIs.

Attention networks for recommender systems. Due to its high effectiveness and efficiency, the self-attention mechanism [22] has been applied to various tasks such as machine translation. The task of recommendation makes no exception. SASRec [9] is a sequential recommendation model based on self-attention. It extracts the historical item sequences of each user, and then maps the recommendation problem to a sequence-to-sequence learning problem. AttRec [28] uses self-attention to learn from user-item interaction records about their recent interests, which are combined with users’ long term interests learned by a metric learning component to make recommendations. These models have shown promising results in general sequential recommendation problems, e.g., to recommend products, video games, or movies. However, they are not designed for POI recommendations and have not considered the spatio-temporal patterns in POI recommendations. In this paper, we build a self-attentive network based on the SASRec model to incorporate spatio-temporal patterns of user check-ins. We will compare with SASRec in our empirical study. We omit AttRec since SASRec has shown a better result.

Symbol Description
A set of users
A set of POIs
A set of check-in time
The historical check-in sequence of a user
A historical check-in of a user
The maximum check-in sequence length
The input embedding matrix
A relevant score computed by SASRec
Table 1: Frequently Used Symbols

(a) Transformer layer (Trm)                        (b) SASRec model structure                                     (c) Grid cell ID string embedding

Figure 2: The architecture of our SANST model

3 Preliminaries

We start with basic concepts and a problem definition in this section. We then present the SASRec model [9], based on which our proposed model is built. This model will also be used in our experiments as a baseline. We summarize the frequently used symbols in Table 1.

3.1 Problem Definition

We consider a set of users and a set of POIs . Each user comes with a historical POI check-in sequence = , which is sorted in ascending order of time. Here, denotes the size of the set . Every check-in is a tuple , where is the check-in POI and is the check-in time, respectively.

Given a user and a time , our aim is to predict , i.e., the next POI that will visit at . In our model, we consider check-in times in the day granularity, to alleviate the data sparsity issue.

3.2 SASRec

SASRec is a two-layer transformer network [22]. It models sequential recommendation as a sequence-to-sequence learning problem that translates an input sequence to an output sequence . Here,

is a hyperparameter controlling the input length. The last element in the output,

, is the recommendation output. SASRec can be adopted for next POI recommendation directly, by treating a user’s check-in sequence as the model input. If , only the latest check-ins are kept; if

, the input sequence is padded by 0’s at front.

Fig. 2b illustrates the SASRec model structure. The input sequence of SASRec first goes through an embedding layer to convert every input element (e.g., a POI id ) into a

-dimensional vector


The embeddings are then fed into two stacking transformer layers. In the first transformer layer, as shown in Fig. 2a, the transformer input first goes through a self-attention layer to capture the pairwise dependency between the POI check-ins. Specifically, the embeddings of the input elements are concatenated rowwise to form a matrix . Then, three linear projections on are done using three projection matrices , respectively. These matrices will be learned by the model. The linear projections yield three matrices , , and , respectively. The self-attention for , denoted by , is then computed as follows, where , , and are let be the same:


To endow the model with non-linearity, a point-wise feed-forward network is applied on . Let be the -th output of of self-attention module. Then, the feed-forward network on is computed as:


Here, and are learnable parameters. Layer normalization and dropout are adopted in between these layers to avoid overfitting.

Two transformer layers are stacked to form a deeper network, where the point-wise feed-forward network of the first layer is used as the input as the second layer.

Finally, the prediction output at position is produced by computing the relevance score between a POI and the position- output of the point-wise feed-forward network of the second transformer layer.


Here, denotes the embedding of fetched from the embedding layer. The POIs with the highest scores is returned.

To train the model, the binary cross-entropy loss is used as the objective function:


4 Proposed Model

As discussed earlier, while SASRec can be adapted to make next POI recommendations, a direct adaptation may produce sub-optimal results. This is because SASRec does not consider any spatial or temporal patterns, which are inherent in POI visit sequences and are critical for POI recommendations. In this section, we present our SANST model to address this limitation via incorporating spatial and temporal pattern learning into self-attentive networks. As illustrated in Fig. 2, our SANST model shares the overall structure with SASRec. We detail below how to incorporate spatial and temporal pattern learning into this structure.

4.1 Spatial Pattern Learning

To enable SANST to learn the spatial patterns in POI check-in transitions, we update the embedding of the -th check-in of an input sequence to incorporate the location of the checked-in POI. This way, our SANST model can learn not only the transitions between the POIs (i.e., their IDs) but also the transitions between their locations.

A straightforward approach to incorporate the POI locations is to concatenate the geo-coordinates of the POIs with the SASRec embedding . This approach, however, suffers from the data sparsity problem – geo-coordinates of POIs are in an infinite and continuous space, while the number of POIs is usually limited (e.g., thousands).

To overcome the data sparsity, we discritize the data space with a grid such that POI locations are represented by the grid cell IDs. We learn embeddings for the grid cell IDs (detailed next). Then, given a check-in , we locate the grid cell in which the check-in POI lies. We use the grid cell ID embedding as the spatial embedding of , denoted by . We concatenate (denoted by ) with the SASRec embedding to form a new POI embedding in our SANST model, denoted by :


4.1.1 Grid Partitioning and Grid Cell Encoding

We use a space-filling curve to partition the data space and to number the grid cells. We encode the grid cell numbers into strings and use the encoded strings as the grid cell IDs. The purpose is to generate ID strings such that nearby grid cells have similar ID strings. This way, the cell IDs can preserve the spatial proximity of the corresponding cells. We adapt the GeoHash technique to generate the strings as detailed below.222

Fig. 3 illustrates the steps of the grid partitioning and grid cell encoding scheme. A Z-curve is used in this figure, although other space-filling curves such as Hilbert-curves may be used as well. Suppose that an order- curve is used to partition the grid. Then, there are grid cells, and each curve value is represented by an -bit integer. We break a curve value into segments of bits from the left to the right consecutively (a total of segments). Each segment is then mapped to a character in a size- alphabet. In the figure, we use and an alphabet of characters (i.e., ‘a’, ‘b’, ‘c’, and ‘d’). As the figure shows, using this encoding, nearby grid cells obtain similar ID strings – they share common prefixes by the definition of the curve values. The longer the common prefix is, the nearer the two cells are. In our experiments, we use , and .

Figure 3: Example of grid cell encoding

4.1.2 Grid Cell ID Embedding Learning

Geohash encodes coordinates in a hierarchical way. It can create adjacent cells without a common prefix, e.g., cells ‘cdb’ and ‘dca’ in Fig. 3. We address this problem by a natural language processing approach – we learn an ID string embedding via learning character embeddings using a Bi-LSTM network. As Fig. 2c shows, each character (a randomly initialized -dimensional vector to be learned) of an ID string (e.g., “baca”) is fed into Bi-LSTM. The final output of Bi-LSTM in both directions are caught (i.e., and ), which are used as the ID string embedding (i.e., spatial embedding ) to form our POI check-in embedding in SANST. Since the character embeddings are trained jointly with the entire model, the characters which are adjacent to each other in the grid cells will have similar weights in there embedding vectors. Such as ‘a’ and ‘b’, ‘a’ and ‘c’. Therefore, for the cells labeled with ‘cbd’ and ‘dca’, even though they do not share a common prefix, the adjacency relation between the characters in each layer (e.g., ‘c’ and ‘d’ in the top layer of the hierarchical hash codes) can still be captured. We envision that this spatial embedding method can not only be used in our problem but also other tasks that incorporate spatial information.

4.2 Temporal Pattern Learning

To learn the temporal patterns in POI check-in transitions, we follow [19] to adapt the attention mechanism. We add a parameter to represent the relative position between two check-ins and in the input sequence. We define the relative position based on the time difference between and instead of the number of other check-ins between and (which was done by [19]). This way, we model the temporal patterns explicitly and can better learn their impact. We detail our adaptation next.

In self-attention, each output element

is computed as a weighted sum of linearly transformed input elements, i.e.,

, where denotes the position- input element (e.g., a POI check-in). [19] add an edge between two input elements and to model their relative position in the input sequence. The impact of this edge is learned from two vectors and , and the self-attention equation is updated to:


Here, is computed as:


We adapt and as follows to learn the temporal pattern between and (i.e., the relative position in time):


Here, and represent the temporal label on input sequence at position and , respectively. We compute the temporal label for the -th input element of user as . We then learn relative temporal representations and , where is a hyperparameter that represents the size of the time context window that we examine. We illustrate these concepts in Fig. 4, where each input element is a user check-in .

Figure 4: Temporal pattern learning

4.3 Model Training

We use the same loss function as that of SASRec (i.e., Equation 


) to train SANST. We randomly generate one negative POI for each time step in each sequence in each epoch. The model is optimized by the Adam optimizer.

5 Experiments

We perform an empirical study on the proposed model SANST and compare it with state-of-the-art sequential recommendation and next POI recommendation models.

Dataset users POIs check-ins Time range
Gowalla 10,162 24,250 456,988 02/2009-10/2010
Los Angeles 3,202 4,368 101,327 03/2009-10/2010
Singapore 2,321 5,596 194,108 08/2010-07/2011
Table 2: Dataset Statistics
Method Gowalla Los Angeles Singapore
hit@10 nDCG@10 hit@10 nDCG@10 hit@10 nDCG@10
Baseline FPMC-LR 0.1197 0.0741 0.2347 0.1587 0.1784 0.1017
POI2vec 0.0939 0.0606 0.2370 0.1700 0.2063 0.1425
PRME-G 0.1852 0.1083 0.2367 0.1592 0.1601 0.1049
SASRec 0.2023 0.1209 0.3648 0.2337 0.2245 0.1429
SASRec+WF 0.2096 0.1184 0.3304 0.2006 0.2137 0.1251
SASRec+2KDE 0.1842 0.1113 0.3057 0.1912 0.1900 0.1154
LSTPM 0.1361 0.0847 0.2366 0.1580 0.1777 0.1105
Variants SANS 0.2248 0.1372 0.3891 0.2519 0.2417 0.1491
SANT 0.2028 0.1245 0.3635 0.2264 0.2296 0.1441
Proposed SANST 0.2273 0.1374 0.3941 0.2558 0.2417 0.1531
(+8.44%) (+13.65%) (+8.03%) (+9.46%) (+7.66%) (+7.14%)
Table 3: Summary of Results

5.1 Settings

We first describe our experimental setup, including datasets, baseline models, and model implementation details.

Datasets. We evaluate the models on three real-world datasets: the Gowalla dataset [26], the Los Angeles dataset [3], and the Singapore dataset [26]. Table 2 summarizes the dataset statistics. Following previous studies [6, 4], for each user’s check-in sequence, we take the last POI for testing and the rest for training, and we omit users with fewer than five check-ins.

Baseline models. We compare with six baseline models:

  • FPMC-LR [2]: This is a matrix factorization model that extends the factorized personalized Markov chain (FPMC) model [18] by factorizing the POI transition probability with users’ movement constraints.

  • PRME-G [6]: This is a metric embedding based approach to learn sequential patterns and individual preferences, and it incorporates spatial influence using a weight function based on spatial distance.

  • POI2Vec [5]: This is an embedding based model. It adopts the word2vec model to compute POI and user embeddings. Recommendations are made based on the embedding similarity.

  • LSTPM [21]: This is an LSTM based model. It uses two LSTMs to capture users’ long-term and short-term preferences and a geo-dilated RNN to model the non-consecutive geographical relation between POIs.

  • SASRec [9]: This is the state-of-the-art sequential recommendation model as described in the Preliminaries section.

  • SASRec+WF: We combine SASRec with spatial pattern learning by adopting a weight function [6] to weight the relevance score by the spatial distance to the last POI check-in.

  • SASRec+2KDE: We combine SASRec with spatial pattern learning by adopting the

    two-dimensional kernel density estimation

    (2KDE) model [27]. The 2KDE model learns a POI check-in distribution for a user based on the POI geo-coordinations. We weight the relevance score for a POI in SASRec by the probability of the POI learned by 2KDE.

Model variants. To study the contribution of the spatial and time pattern learning to the overall performance of SANST, we further compare with two model variants: SANS and SANT, which are SANST without time pattern learning and SANST without spatial pattern learning, respectively.

Implementation details. For FPMC-LR, PRME-G, and POI2Vec, we use the code provided by [4] (we could not compare with the model proposed by [4] because they only provided implementations of their baseline models but not their proposed model). For LSTPM, we use the code provided by [21]. For SASRec, we use the code provided by [9], based on which our SANST model and its variants are implemented. The learning rate, regularization, POI embedding dimensionality, and batch size are set to 0.005, 0.001, 50, and 128 for all models, respectively. Other parameters of the baseline models are set to their default values that come with the original paper. For our SANST model and its variants, we use 2 transformer layers, a dropout rate of 0.3, a character embedding size of 20, and the Adam optimizer. We set the maximum POI check-in sequence length to 100, and the time context window size to be 3, 5 and 1 for Gowalla, Los Angeles and Singapore, respectively. We train the models with an Intel(R) Xeon(R) CPU @ 2.20GHz, a Tesla K80 GPU, and 12GB memory.

We report two metrics: hit@10 and nDCG@10. They measure how likely the ground truth POI is in the top-10 POIs recommended and the rank of the ground truth POI in the top-10 POIs recommended, respectively.

5.2 Results

We first report comparison results with baselines and then report the impact of model components and parameters.

Overall performance. Table 3 summarizes the model performance. We see that our model SANST outperforms all baseline models over all three datasets consistently, in terms of both hit@10 and nDCG@10. The numbers in parentheses show the improvements gained by SANST comparing with the best baselines. We see that SANST achieves up to 8.44% and 13.65% improvements in hit@10 and nDCG@10 (on Gowalla dataset), respectively. These improvements are significant as confirmed by -test with . Among the baselines, SASRec outperforms FPMC-LR, POI2Vec PRME-G and LSTPM, which validates the effectiveness of self-attentive networks in making next item recommendations. However, adding spatial features to self-attentive networks with existing methods do not necessarily yield a better model. For example, both SASRec+WF and SASRec+2KDE produce worse results than SASRec for most cases tested. They learn spatial patterns and sequential patterns separately which may not model the correlations between the two factors accurately. Our spatial pattern learning technique differs from the existing ones in its ability to learn representations that preserve the spatial information and can be integrated with sequential pattern learning to form an integrated model. Thus, combining self-attentive networks with our spatial pattern learning technique (i.e., SANS) outperforms SASRec, while adding temporal pattern learning (i.e., SANST) further boosts our model performance.

Ablation study. By examining the results of the two model variants SANS and SANT in Table 3, we find that, while both spatial and temporal patterns may help next POI recommendation, spatial patterns appear to contribute more. We conjecture that this is due to the use of the same discretization (i.e., by day) across the check-in time of all users in SANT. Such a global discretization method may not reflect the impact of time for each individual user accurately. It impinges the performance gain achieved from the time patterns. Another observation is that, when both types of patterns are used (i.e., SANST), the model achieves a higher accuracy comparing with using either pattern. This indicates that the two types of patterns complement each other well.

Model Gowalla Los Angeles Singapore
20 0.1131 0.1177 0.2240 0.2369 0.1289 0.1299
50 0.1265 0.1301 0.2326 0.2409 0.1379 0.1431
100 0.1209 0.1374 0.2337 0.2558 0.1429 0.1531
200 0.1251 0.1391 0.2195 0.2355 0.1446 0.1506
Table 4: Impact of (NDCG@10)

Next, we study the impact of hyperparameters and model structure on the performance of SANST. Due to space limit, we only report results on nDCG@10 in these experiments. Results on hit@10 show similar patterns and are omitted.

Impact of input sequence length . We start with the impact of the input sequence length . As shown in Table 4, SANST outperforms the best baseline SASRec across all values tested. Focusing on our model SANST, as grows from 20 to 100, its nDCG@10 increases. This can be explained by that more information is available for SANST to learn the check-in patterns. When grows further, e.g., to 200, the model performance drops. This is because, on average, the user check-in sequences are much shorter than 200. Forcing a large input length requires padding 0’s at front which do not contribute extra information. Meanwhile, looking too far back may introduce noisy check-in patterns. Thus, a longer input sequence may not benefit the model.

Impact of character embedding dimensionality . We vary the character embedding dimensionality for our spatial pattern learning module from 10 to 30 and report the performance of SANST in Fig. 5a. As the figure shows, our model is robust against the character embedding dimensionality (note the small value range in the -axis. When , the model has the best overall performance across the datasets. This relates to the fact that the size of the character vocabulary used to generate the location hashing strings is 32. If is much smaller than 32, the character embedding may not capture sufficient information from the hashing strings. Meanwhile, if is too large, the information captured by each dimension may become too weak due to data sparsity. Another observation is that the model performs much better on Los Angeles. This is because Los Angeles has the smallest number of POIs and average check-in sequence length per user. Its data space to be learned is smaller than those of the other two datasets.

(a) Character embedding size  (b) Time window size

Figure 5: Impact of character embedding size and time context window size

Impact of time context window size . We vary the time context window size for our temporal pattern learning module from 1 to 5 and report the results in Fig. 5b. We see that SANST is also robust against the time context window size. We find the best time context window size to be strongly correlated to the time span and the average length of the check-in sequences in a dataset. In particular, Los Angeles covers a long time span (20 months) while its average check-in sequence length is the shortest. This means that its check-ins are less dense in the time dimension. Thus, it needs a larger time context window to achieve better performance, which explains for its increasing trend in nDCG@10 as increases towards 5.

Impact of model structure.

The transformer network can be stacked with multiple layers, while the attention network can have multi-heads. We show the impact of the number of transformer network layers

and the number of heads in Table 5. We see better results on Los Angeles with , i.e., two transformer layers, which demonstrates that a deeper self-attention network may be helpful for learning the data patterns. However, deeper networks also have the tendency to overfit. The model performance drops when on the other two datasets. When , the model is worse on all three datasets. We keep and further add more heads to the attention network. As the table shows, adding more heads does not help the model performance. This is opposed to observations in natural language processing tasks where different attention heads are more effective to capture various type of relations (e.g., positional and syntactic dependency relations) [23]. We conjecture that the relations between POIs are simpler (and without ambiguity) than those between words in natural language. Thus, a single-head attention is sufficient for our task.

Structure Gowalla Los Angeles Singapore
=1, =1 0.1426 0.2462 0.1539
=2, =1 (default) 0.1374 0.2558 0.1531
=3, =1 0.1278 0.2416 0.1464
=2, =2 0.1358 0.2516 0.1464
Table 5: Impact of Model Structure (NDCG@10)

6 Conclusions

We studied the next POI recommendation problem and proposed a self-attentive network named SANST for the problem. Our SANST model takes advantage of self-attentive networks for their capability in making sequential recommendations. Our SANST model also incorporates the spatial and temporal patterns of user check-ins. As a result, SANST retains the high efficiency of self-attentive networks while enhancing their power in making recommendations for next POI visits. Experimental results on real-world datasets confirm the superiority of SANST – it outperforms state-of-the-art sequential recommendation models and adapted baseline models that combine self-attentive networks with spatial features directly by up to 13.65% in nDCG@10.


  • [1] C. Cheng, H. Yang, I. King, and M. R. Lyu (2012) Fused matrix factorization with geographical and social influence in location-based social networks. In

    AAAI Conference on Artificial Intelligence (AAAI)

    pp. 17–23. Cited by: §2.
  • [2] C. Cheng, H. Yang, M. R. Lyu, and I. King (2013) Where you like to go next: successive point-of-interest recommendation. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 2605–2611. Cited by: §1, §1, §2, 1st item.
  • [3] E. Cho, S. A. Myers, and J. Leskovec (2011) Friendship and mobility: user movement in location-based social networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1082–1090. Cited by: §5.1.
  • [4] Q. Cui, Y. Tang, S. Wu, and L. Wang (2019) Distance2Pre: personalized spatial preference for next point-of-interest prediction. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 289–301. Cited by: §5.1, §5.1.
  • [5] S. Feng, G. Cong, B. An, and Y. M. Chee (2017) Poi2vec: geographical latent representation for predicting future visitors. In AAAI Conference on Artificial Intelligence (AAAI), pp. 102–108. Cited by: §1, §2, 3rd item.
  • [6] S. Feng, X. Li, Y. Zeng, G. Cong, Y. M. Chee, and Q. Yuan (2015) Personalized ranking metric embedding for next new poi recommendation. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 2069–2075. Cited by: §1, §2, 2nd item, 6th item, §5.1.
  • [7] H. Gao, J. Tang, X. Hu, and H. Liu (2013) Exploring temporal effects for location recommendation on location-based social networks. In ACM Conference on Recommender systems (RecSys), pp. 93–100. Cited by: §2.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1.
  • [9] W. C. Kang and J. Mcauley (2018) Self-attentive sequential recommendation. In IEEE International Conference on Data Mining (ICDM), pp. 197–206. Cited by: §1, §2, §3, 5th item, §5.1.
  • [10] T. Kurashima, T. Iwata, T. Hoshide, N. Takaya, and K. Fujimura (2013) Geo topic model: joint modeling of user’s activity area and interests for location recommendation. In ACM International Conference on Web Search and Data Mining (WSDM), pp. 375–384. Cited by: §2.
  • [11] J. J. Levandoski, M. Sarwat, A. Eldawy, and M. F. Mokbel (2012) Lars: a location-aware recommender system. In IEEE International Conference on Data Engineering (ICDE), pp. 450–461. Cited by: §2.
  • [12] R. Li, Y. Shen, and Y. Zhu (2018) Next point-of-interest recommendation with temporal and multi-level context attention. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 1110–1115. Cited by: §2.
  • [13] X. Li, G. Cong, X. Li, T. N. Pham, and S. Krishnaswamy (2015) Rank-geofm: a ranking based geographical factorization method for point of interest recommendation. In ACM SIGIR Conference on Information Retrieval (SIGIR), pp. 433–442. Cited by: §2.
  • [14] D. Lian, C. Zhao, X. Xie, G. Sun, E. Chen, and Y. Rui (2014) GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 831–840. Cited by: §2.
  • [15] Q. Liu, S. Wu, L. Wang, and T. Tan (2016) Predicting the next location: a recurrent model with spatial and temporal contexts. In AAAI Conference on Artificial Intelligence (AAAI), pp. 194–200. Cited by: §2.
  • [16] Y. Liu, T. N. Pham, G. Cong, and Q. Yuan (2017) An experimental evaluation of point-of-interest recommendation in location-based social networks. Proceedings of the VLDB Endowment 10 (10), pp. 1010–1021. Cited by: §2.
  • [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. In International Conference on Learning Representations Workshops, Cited by: §2.
  • [18] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme (2010) Factorizing personalized markov chains for next-basket recommendation. In International Conference on World Wide Web (WWW), pp. 811–820. Cited by: §2, 1st item.
  • [19] P. Shaw, J. Uszkoreit, and A. Vaswani (2018) Self-attention with relative position representations. arXiv preprint arXiv:1803.02155. Cited by: §1, §4.2, §4.2.
  • [20] Statista (2018) Global travel and tourism industry - Statistics & Facts. Note: Cited by: §1.
  • [21] K. Sun, T. Qian, T. Chen, Y. Liang, Q. V. H. Nguyen, and H. Yin (2020) Where to Go Next: Modeling Long-and Short-Term User Preferences for Point-of-Interest Recommendation.. In AAAI Conference on Artificial Intelligence (AAAI), pp. To appear. Cited by: §2, 4th item, §5.1.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008. Cited by: §1, §2, §3.2.
  • [23] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov (2019) Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808. Cited by: §5.2.
  • [24] M. Ye, P. Yin, W. Lee, and D. Lee (2011) Exploiting geographical influence for collaborative point-of-interest recommendation. In ACM SIGIR Conference on Information Retrieval (SIGIR), pp. 325–334. Cited by: §2.
  • [25] Y. Yu and X. Chen (2015) A survey of point-of-interest recommendation in location-based social networks. In Workshops at the AAAI Conference on Artificial Intelligence, pp. 53–60. Cited by: §2.
  • [26] Q. Yuan, G. Cong, Z. Ma, A. Sun, and N. M. Thalmann (2013) Time-aware point-of-interest recommendation. In ACM SIGIR Conference on Information Retrieval (SIGIR), pp. 363–372. Cited by: §2, §5.1.
  • [27] J. Zhang, C. Chow, and Y. Li (2014) Lore: exploiting sequential influence for location recommendations. In ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 103–112. Cited by: 7th item.
  • [28] S. Zhang, Y. Tay, L. Yao, and A. Sun (2018) Next item recommendation with self-attention. arXiv preprint arXiv:1808.06414. Cited by: §2.
  • [29] P. Zhao, H. Zhu, Y. Liu, J. Xu, Z. Li, F. Zhuang, V. S. Sheng, and X. Zhou (2019) Where to go next: a spatio-temporal gated network for next poi recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5877–5884. Cited by: §2.
  • [30] S. Zhao, T. Zhao, H. Yang, M. R. Lyu, and I. King (2016) STELLAR: spatial-temporal latent ranking for successive point-of-interest recommendation. In AAAI conference on artificial intelligence (AAAI), pp. 315–321. Cited by: §2.