1 Introduction and Motivation
With advancements in GIS systems and subsequent availability of high resolution geo-spatial data, location analysis is increasingly becoming common in planning location-aware services, resources and infrastructure [17, 36]. An important problem in this area, termed as the facility location problem, deals with identifying the best locations to set up new services [15, 20]. This has been also studied as optimal location problems [16, 41, 18, 38, 13]. Majority of the existing works, however, assume that the users of the service are static [34, 35, 28]. However, services such as fuel stations, automobile service stations, ATMs, food joints, convenience stores, etc., are widely accessed by mobile users, besides the static users. For example, it is common for many users to make their daily purchases while returning from their offices. Therefore, the placement of such services should also take into account the mobility patterns or the trajectories of its users, rather than simply their static locations. Such services are referred to as trajectory-aware services . A user trajectory is a sequence of spatial points that lie on the path of a user while travelling. Note that trajectories strictly generalize the static users’ scenario because static users can always be modeled as trajectories with a single user location. Thus, trajectories capture user location patterns more effectively and realistically.
Since the budget for setting up a facility or a service location for a trajectory-aware service is typically restricted, the facilities must be placed in a manner that minimizes the inconvenience of its users. The inconvenience of a user on a trajectory (or from a single static location) is defined as the extra distance traveled by her with respect to her regular path (or location) to avail the service.
Next, we show that the inconvenience faced by static users is much higher than that of mobile users. Fig. 1 shows 2 users, and whose homes and offices are located at sites and , respectively. While mobile, they take trajectories and , respectively, as shown in the figure. The road segments are shown in black with the corresponding distances (assumed to be the same on both ways). Suppose a company wants to install a fuel station at a location that minimize the maximum inconvenience faced by any user. Consider the case when we simply factor in the static user locations, , ignoring the trajectories. In this scenario, the inconvenience of a user is the round-trip distance between its static locaton and the fuel station. For instance, if the fuel station is located at , the inconvenience of user is units. Similarly, if the facility is located at either or , then the inconvenience of one of the users is units. Hence the optimal choice to install the facility is or with maximum inconvenience of units.
Next, consider the case, when we factor in the user-trajectories, rather than their static locations. In this case, the candidate facility locations are . If the facility is located at , then the inconvenience of user is units, which is much less than units, as in the case of static locations. More importantly, we observe that if the facility is placed at , then no user needs to travel any extra distance. Hence the optimal facility location is with maximum inconvenience being . Hence, it is much more useful to consider user-trajectories than simply their static locations for placing facilities that minimize user-inconvenience. The existing studies on facility location problems did not address this issue.
For deeper understanding of the problem, we illustrate the next example, shown in Fig. 2, that will be used as the running example in the rest of the paper. There are 5 user trajectories and 4 candidate sites to host a trajectory-aware service. Based on the earlier discussion, for a user on trajectory , if the service is set up at location , her inconvenience is units. If another service location comes up at , her inconvenience reduces to units. Note that the inconvenience is defined using the nearest service location from any point on a user trajectory (or the user location).
Assume that the service-provider wants to set up facilities with the objective to minimize the maximum inconvenience faced by any user. If the facilities are hosted at , the inconvenience of trajectories are units each. For trajectory , the nearest facility is and, therefore, its inconvenience is units. Similarly, for trajectory , the nearest facility is and, thus, its inconvenience is units. Therefore, the maximum inconvenience among all the trajectories due to the selection is units. The maximum inconvenience for all such selections of sites is listed in Fig. 2 under the column . (We will shortly explain the meaning of .) The selection offers the optimal choice with a maximum inconvenience of units. Note that although the most number of trajectories pass through , it is not part of the optimal solution.
Next, suppose the objective is to minimize the average (or equivalently, total) inconvenience over all the user trajectories. The last column in the table lists the total inconvenience for all possible selections. The selection (or ) offers the optimal total inconvenience of units. While the optimal average inconvenience is units, the optimal maximum inconvenience is units.
In many situations, there may be an outlier trajectory due to which minimizing the maximum inconvenience across all the trajectories becomes harder. Thus, the service provider may choose a fraction of the total number of trajectories, over which the maximum inconvenience will be minimized. We call this theuser-fraction, .
Referring to Fig. 2, when , the goal is to minimize the maximum inconvenience over at least trajectories. The values of the maximum inconvenience for all the selections for are listed in the table. Note that the optimal selection for is (or ) which is different from the optimal selection for . Note also that the optimal inconvenience falls to units as compared to units for .
Based on the above discussion on minimizing the user inconvenience, in this paper, we introduce two optimal location problems over user trajectories, namely, MAX-TIPS and AVG-TIPS. Formally, given a set of trajectories , a set of candidate sites , an integer , and user-fraction , the MAX-TIPS problem seeks to report a set of locations that minimizes the maximum inconvenience over any fraction of the trajectories ; the AVG-TIPS problem aims to identify the locations that minimize the average inconvenience faced by any user.
In location analysis literature, there are several works that investigate the facility location problems with the goal of minimizing the maximum and average distance of any user to its nearest facility in the context of static users. These problems are referred to as the -center problem  and the -medoids problem  respectively. We study these problems in the context of user trajectories.
The major challenges in solving these problems are as follows:
Quality: Since both the problems are NP-hard (proved later), optimal algorithms are impractical. Thus, efficient heuristics that offer high quality solutions are of basic necessity.
Scalability: Any basic approach to solve the above problems would typically need to compute and store pairwise distances between the sets of candidate sites and trajectories. However, for any city-scale datasets, this time and storage requirement is prohibitively large. For example, the Beijing dataset  (used in this work) with more than 120,000 trajectories and 250,000 sites would require close to 250 GB of storage. This is unlikely to fit in the main memory and, therefore, multiple expensive random disk seeks are required at run-time. Thus, it is necessary to design solutions that are practical and scalable.
In this paper, we propose efficient heuristics for both MAX-TIPS and AVG-TIPS that overcome the above challenges.
To summarize, our major contributions are as follows:
We introduce two optimal location problems over user trajectories, namely, MAX-TIPS and AVG-TIPS (Sec. 3).
We explain how the proposed solutions can be adapted to situations when there are already some existing services (Sec. 6).
Empirical evaluation on urban scale datasets show that our heuristics are quite effective in terms of quality and efficient in terms of space and running time (Sec. 7).
2 Related Work
The related work is divided into the following main classes.
Clustering Problems: The following clustering problems are related to the current work.
-center Problem: Given a set of points, determine a set of size , referred to as centers, such that the maximum distance of any point in to its nearest center is minimized. This problem was introduced and proved to be NP-hard in . They also proposed a greedy heuristic that offers a factor of approximation. Our proposed MAX-TIPS problem is a generalization of the -center problem as trajectories generalize static users. We have extended the greedy algorithm in  to design a heuristic (with bounded quality guarantees) to solve MAX-TIPS (Sec. 4).
-medoids Problem: Given a set of points, determine a set of size , referred to as medoids, such that the sum of distances of each point in to its nearest medoid is minimized . This problem is also referred to as the -median problem. A polynomial time approximation scheme was proposed in  for two-dimensional Euclidean spaces. The first constant factor algorithm for the -median problem in general metric space, with an approximation ratio of was proposed in . Later,  improved this factor to . Subsequently,  designed a -approximation algorithm that runs in time. Korupolu et al.  proposed a local search based approximation scheme by allowing a constant factor blow-up in . Kolliopoulos et al.  proposed a nearly-linear time algorithm for Euclidean spaces. Arya et al.  improved the approximation bound of local search based algorithm for the metric -medians problem to where is the number of medians swapped simultaneously. Three popular local search based techniques for the -medoids problem are PAM , CLARA  and CLARANS . These schemes are detailed in Sec. 5.2. Most of these works are, however, theoretical and, therefore, fail to scale on large datasets. Moreover, all these algorithms require the distance matrix between each pair of points. This quadratic space and time requirement renders them infeasible for real datasets. Our proposed AVG-TIPS problem generalizes the -medoids problem as explained in Sec. 3. The HCC heuristic to solve AVG-TIPS (Sec. 5.2) builds on the ideas of CLARA and CLARANS. More importantly, to address the challenge of high space overhead, we design sampling techniques to reduce the sizes of the sets and .
Optimal Location Queries: Optimal Location (OL) queries have attracted considerable attention [16, 18, 38, 13, 25, 27]. An OL query has three inputs: a set of existing facilities, a set of users, and a set of candidate sites. The objective is to determine a candidate site to set up a new facility that optimizes an objective function based on the distances between the facilities and the users. Comparing OL query with TIPS, we note that: (a) while fixed user locations are considered for OL queries, TIPS uses trajectories of users, (b) OL queries report only a single optimal location while TIPS reports locations, and (c) unlike OL queries that are solvable in polynomial time, TIPS is NP-hard (it is polynomially solvable only for ). Majority of the earlier works in this area, assume that the underlying space is Euclidean, and only recently has the problem been studied for road networks [38, 13]. Recently, Li et al.  studied OL queries on trajectories over a road network. Two algorithms were proposed to compute the optimal road segment to host a new service. Their work has quite a few limitations and differences when compared with ours: (a) Since a single optimal road segment is reported, their problem is polynomially solvable. (b) Their work identifies the optimal road segment rather than the optimal site. (c) There is no analysis on the quality of the reported optimal road segment, either theoretically or empirically. (d) It is not shown how the reported road segment performs for other established metrics, such as number of new users covered, distance traveled by the users to avail the service, etc. Mustafizur et al.  introduced optimal accessible location query which returns a single optimal location that is best accessible from a given set of trajectories. A tree-based index structure is proposed to answer this query. This problem is similar to TIPS; the key distinguishing feature being that it is polynomially solvable, while TIPS is NP-hard.
Facility Location Queries: Majority of the early works in the area of facility location queries assume that the users are static [15, 20]. Later works did consider human mobility, though [21, 8, 6, 3, 7, 5, 4]. These works, however, assume a flow model to characterize mobility instead of using real trajectories. A fairly comprehensive literature survey is available in . The proposed models are mostly theoretical and are not scalable for real city-scale road networks. In particular, all these approaches require extensive distance computations which leads to large memory overhead and are, hence, infeasible . Hodgson  posed the first facility location problem that minimizes the average user inconvenience and showed that it is a generalization of -medians problem. In , it was shown that the problem does not admit constant factor approximation unless . However, no approximation algorithm was proposed. In contrast, we design and evaluate two heuristics for this problem. Very recently, [30, 26] revisited these problems, and presented a few efficient scalable techniques to solve them. However, the goal therein is to maximize user-coverage, and not minimize user-inconvenience. In fact,  assumed that users of a service would never deviate from their regular paths, which is too restrictive and unrealistic.
3 The TIPS Problem
Consider a road network over a geographical area where denotes the set of road intersections, and denotes the road segments between two adjacent road intersections. To model the direction of the underlying traffic that passes over a road segment, we assume that the edges are directed. Assume a set of candidate sites where a certain service or facility can be set up. The set can be in addition to existing service locations. Without loss of generality, we can augment the vertices to include all the sites. Thus, .
The set of trajectories over is denoted by where each trajectory , is a sequence of locations that the user passes through. Usually any service or facility is used by a mix of static users and mobile users. As the above definition of trajectory allows both single and multiple locations of a single user, it captures both static and mobile users, simultaneously. The trajectories are usually recorded as GPS traces and may contain arbitrary spatial points on the road network. For our purpose, each trajectory is map-matched  to form a sequence of road intersections through which it passes.
The modeling of user trajectories is a generalization over the set of static users. A static user can be considered as a trajectory with only a single node in it, which is the static location of the user.
We also assume that each trajectory belongs to a separate user. However, the framework can be generalized to multiple trajectories belonging to a single user. The union of road intersections that the user passes through in any of her trajectories will be treated as the nodes in her trajectory. In effect, this will minimize her inconvenience from any of her trajectories.
Suppose denotes the shortest road network distance along a directed path from node to , and denotes the shortest distance of a round-trip starting at node , visiting , and returning to , i.e., . In general, , but .
The extra distance travelled by any user on trajectory to avail a service at node , denoted by , is defined as follows: , i.e., it deviates from its trajectory at , reach and then return to another point such that the deviation is minimum.
The round-trip distance between two trajectories and is defined as the minimum pairwise distance between its sites: . Henceforth, the term distance implies round-trip distance unless otherwise mentioned.
It is inconvenient for a user to avail a service if the nearest service location is far off from her trajectory. We define this inconvenience as follows. Given a set of service locations , the inconvenience of a user on trajectory , denoted by , is the extra distance travelled to avail a service at the nearest service location in . Formally, .
Using the above setting, we introduce two novel problems, namely MAX-TIPS and AVG-TIPS, described as follows.
The MAX-TIPS problem aims to report a set of service locations that minimizes the maximum inconvenience over a given user-fraction of the set of trajectories. It is formally stated next.
Problem 1 (Max-Tips)
Given a set of trajectories , a set of candidate sites that can host the services, a positive integer , and a user-fraction (), the MAX-TIPS problem seeks to report a set , that minimizes the maximum inconvenience over any set such that , i.e., it minimizes , where .
Intuitively, as the user-fraction increases, the optimal value of increases because of the need to serve more number of users with the same number of facilities. When , the goal is to serve all the trajectories such that the maximum inconvenience faced by any trajectory is minimized.
The AVG-TIPS problem seeks to identify service locations that minimizes the expected or average inconvenience across all the trajectories. Since the number of trajectories is fixed, minimizing the average inconvenience is equivalent to minimizing the total inconvenience over all the trajectories. Formally, it is stated as follows.
Problem 2 (Avg-Tips)
Given a set of trajectories , a set of candidate sites that can host the services, and a positive integer , the AVG-TIPS problem seeks to report a set , that minimizes the total inconvenience over , i.e., it minimizes , where .
We also consider both the problems, MAX-TIPS and AVG-TIPS, under the additional constraint of existing service facilities in Sec. 6.
We next show that both the TIPS problems are NP-hard.
Theorem 1 (NP-hardness of TIPS)
MAX-TIPS and AVG-TIPS are NP-hard problems.
Since the -center problem is NP-hard  and it reduces to the MAX-TIPS problem with each trajectory being a single user location and , MAX-TIPS is NP-hard as well.
Since the -medoids problem is NP-hard  and it reduces to the AVG-TIPS problem with each trajectory being a single user-location, AVG-TIPS is also NP-hard.
4 Algorithms for MAX-TIPS
Here we present an optimal algorithm and two heuristics to solve the MAX-TIPS problem.
4.1 Optimal Algorithm
We present an optimal solution to the MAX-TIPS problem in the form of an
integer linear program
integer linear program(ILP):
The boolean variable if and only if the site is selected. The boolean variable if and only if the site is selected, and trajectory is served by . The variable captures the maximum inconvenience offered to any trajectory served by the site (Ineq. (3)). The constraint in Ineq (2), along with the objective function in Eq. (1) ensure that the maximum inconvenience is minimized over the set of selected sites. The constraint in Ineq. (4) ensures that at most sites are selected in the answer set. Since monotonically decreases with , therefore, will attain its optimal value only when . The constraint in Ineq. (5) guarantees that each trajectory is served by at most one service location. The constraint in Ineq. (6) ensures that at least trajectories are served. The constraint in Ineq. (7) guarantees that if , then , i.e., no trajectory is served by the site as it is not selected.
The optimal algorithm is impractical except for extremely small datasets. Therefore, we next present a couple of polynomial-time heuristics to solve MAX-TIPS.
4.2 MIF Algorithm
MAX-TIPS problem being a generalization of the -center problem (Sec. 2), the most natural approach to solve MAX-TIPS is to extend the greedy heuristic for the -center problem . We refer to this adaptation as MIF (Most-Inconvenient-First). It iteratively selects a site and the corresponding most inconvenient trajectory in each of the iterations. The details are as follows.
Initially, we set . Let be a set of representative trajectories, that is initialized with some randomly chosen trajectory . Next, the algorithm runs in iterations. Suppose was the last representative trajectory added to the set . Then we choose a candidate site that is nearest to , and add it to the set . If there are multiple such candidate sites, then the tie is broken arbitrarily. The trajectories in are kept sorted based on their distance to the nearest facility in . We choose a trajectory whose rank is the smallest but greater than or equal to in this ordering, and add it to . The above process is repeated until .
Let us evaluate this algorithm on the example in Fig. 2 with and . Suppose the MIF algorithm selects as the first representative trajectory. Since is the nearest site to , is added to . On sorting the trajectories in w.r.t. their distances from , we get the following ordering: . Since is the farthest trajectory from with , it is chosen as the next representative trajectory. Subsequently, the site is added to . The maximum inconvenience offered by this solution is which is also the optimal solution.
Now, consider the same example with and . Assuming to be the first representative trajectory, in this case, the solution is with a sub-optimal maximum inconvenience of units.
4.2.1 Analysis of MIF:
For the purpose of analysis of MIF, we assume that there exists no site in through which no trajectory passes. This assumption is reasonable because for any site, if no user passes through it, it is likely that the site is inaccessible, and hence it should not be a candidate site for facility location.
The analysis of MIF algorithm involves the following trajectory clustering problem, MAX-TRAJCLUS:
Given a set of trajectories , and an integer , find a set of trajectories, such that the distance between any trajectory and its nearest trajectory in is minimized, i.e.,
Let denote a representative trajectory. Let denote the set of trajectories for which is the nearest representative trajectory in . The largest distance between trajectory and any trajectory in is referred to as the radius of cluster and denoted by . The goal of the MAX-TRAJCLUS problem is to minimize the largest radius over the trajectory clusters.
Being a generalization of the -center problem, this problem is NP-hard. The greedy heuristic by  guarantees a factor of approximation for the -center problem on metric spaces. In order to apply this greedy heuristic to the MAX-TRAJCLUS problem, we next verify that the trajectory distances follow metric properties.
The distance between two trajectories and is the minimum pairwise distance between the constituent nodes. Thus, and . Therefore, the set forms a metric space.
The extended algorithm for MAX-TRAJCLUS is described next.
Initially, . We begin by choosing a random trajectory and adding it to the set . In each of the successive iterations, we select a trajectory that is farthest to any of the trajectories in , and add it to the set . At the end of iterations, we get the desired solution .
The next result states its approximation guarantee.
The Farthest-First algorithm offers a factor of 2 approximation for the MAX-TRAJCLUS problem.
Since the set forms a metric space and the Farthest-First algorithm extends the heuristic in  that offers an approximation bound of , the proof follows.
The next result relates MAX-TRAJCLUS and MAX-TIPS.
Let be the optimal maximum inconvenience for the MAX-TIPS problem, be the optimal radius for the MAX-TRAJCLUS problem, and be the largest radius of any cluster returned by the Farthest-First algorithm. Then the following inequality holds: .
The inequality follows directly from Lemma 1.
We prove the left hand side inequality as follows. Let be an optimal solution to the MAX-TIPS problem with maximum inconvenience value . Let us assume that . Let denote the set of trajectories for which is the nearest facility. Thus, for any , and for any trajectory , . Following the assumption that there exists no site in through which no trajectory passes, there must exist a trajectory that passes through . Let us define a representative set of trajectories where . In case there are duplicate trajectories in , i.e., , for some , we swap with any and set . We note that is a feasible solution to the MAX-TRAJCLUS problem. Since there exists a site , such that , we get, . However, since , is a solution to the MAX-TRAJCLUS problem with maximum radius, lower than , which contradicts the optimality of . Thus, .
We next analyze the quality and complexity bounds of MIF.
We observe that for any set of sites , and two sets of trajectories, such that , the maximum inconvenience faced by any trajectory in due to the set is at most the maximum inconvenience faced by any trajectory in due to the set . Therefore, any approximation bound that holds for the MAX-TIPS problem with user-fraction will also hold for . Hence, we next discuss the approximation results only for .
The MIF algorithm for the MAX-TIPS problem offers an approximation bound of for , and for , where is the distance of the longest trajectory and is the optimal cluster radius defined in the MAX-TRAJCLUS problem.
Let be the representative trajectories chosen by the MIF algorithm. Let the maximum radius, . Let be the reported answer set where is the nearest candidate site to the trajectory for .
Since , it follows that each node in is a candidate site. Since is the nearest candidate site to , it must be that .
Now, let us consider the case . Let be the maximum inconvenience due to the selection of the site . Now suppose is the site reported by an optimal algorithm with optimal maximum inconvenience value . Hence, . Therefore, there exists a site , such that for any trajectory , because there exists a path from via to trajectory of distance at most . Thus, the maximum inconvenience due to the site is at most . Hence the approximation bound is .
Now, consider the case . Let be the set of sites returned by the MIF algorithm, with the maximum inconvenience being . Consider the first representative trajectories, , chosen by the MIF algorithm. Since , we get . Now consider an optimal solution with maximum inconvenience . Let denote the set of trajectories served by . Using the pigeon-hole principle, we conclude that at least two trajectories must belong to one of the clusters, say . Since the maximum inconvenience of the solution is , therefore, . Moreover, since , we get . From this, we infer that there must be a site such that and, therefore, . Since the distance of any trajectory is at most , . From this, we conclude that . Using the result of Lemma 2, the approximation bound is .
The time and space complexity of MIF algorithm is and respectively, where is the total number of nodes in the road network, is the maximum number of nodes in any trajectory, and is the total number of trajectories in .
We assume that the number of road segments (edges) is as the road networks are roughly planar. Thus, from any node , the distances to all other nodes in the network can be computed in time, using the Dijkstra’s shortest path algorithm .
We analyze the computation cost of any iteration as follows. Suppose is the last trajectory added to the set in any iteration. Then, for each node (that passes through), the distances are computed to all other nodes in . If the maximum number of nodes in any trajectory is , i.e., , then the above distance computation step takes time. Thus, identifying the nearest candidate site to , say , requires time.
Following this, is added to . Then the distances are computed between and all other trajectories in . To do this, we first compute distances between and all nodes in , which requires time. Assume these distances are indexed on site-ids. Referring to the the definition of distance between a trajectory and site , , in Sec. 3, the distance between and any trajectory can be computed in time, as there at most distance look-ups . Therefore, the distance between and all trajectories in can becomputed in time. If the distance of any trajectory to its nearest facility in is more than that with , then this value is updated. This updation step takes time over all the trajectories. Sorting these updated distances (of each trajectory to its nearest facility in ) takes time. Choosing the next representative trajectory takes time. Summing up all these costs, over each of the total iterations, the total time complexity is .
Next, we analyze the space complexity. The road network and candidate sites can be stored in space. Storing the trajectories in require space. During each iteration of the algorithm, the node-to-node distances are computed, for each node of the previously chosen representative trajectory. Storing these distance values require at most space. As these distance values are no longer used in subsequent iterations, they are discarded at the end of each iteration. Thus, the maintenance overhead of the node-to-node distances over the iterations is . Storing the distances of each trajectory to its nearest facility in require additional space. Storing the sets and require space. Hence, the total space complexity is .
Although MIF offers bounded quality guarantees, it is quite slow. This is because it does not use any pre-computed distances. The next scheme, however, leverages on pre-computed distances, and offers significantly faster response times.
4.3 Algorithm using NetClus
We observe that MIF takes significant computation time for calculating node-to-node distances, and node-to-trajectory distances. This is because we cannot afford to pre-compute and store all pairs node-to-trajectory distances, which is overwhelmingly large. However, an indexing scheme can be used that pre-computes and stores only a small set of node-to trajectory distances.
In this section, we propose a heuristic that uses the NetClus indexing framework
 that was originally designed to solve the following
TOPS problem :
Given a set of trajectories , a set of candidate sites that can host the services, the TOPS problem with query parameters seeks to report the best sites, , that cover maximum number of trajectories. It is assumed that a site covers a trajectory , if and only if , where is referred to as the coverage threshold.
NetClus performs multi-resolution clustering of the nodes in the road network, . NetClus maintains instances of index structures of varying cluster radii. A particular index instance is useful for a particular range of query coverage thresholds. From one instance to the next, the radius is increased by a factor of for some . Assume that the normal range of query coverage threshold is , . Then the total number of index instances is . For each index instance, NetClus maps the trajectories to the sequence of clusters that they pass through.
Intuitively, as the coverage threshold increases, the number of trajectories covered by any set of candidate sites , also increases. This was validated empirically in .
Exploiting the general monotonic behavior of the trajectory coverage with respect to the coverage threshold , we propose the following heuristic to answer the MAX-TIPS problem. Our goal is to identify the smallest value of such that there exists a set of size that covers at least number of trajectories in . To guess this desired value of , we perform a binary search over the range of , i.e., .
The algorithm proceeds in iterations. In each iteration, it computes the value of the coverage threshold, where and denote the current ranges. Initially, and . Next, we compute the TOPS query with the parameters . If the trajectory coverage value (i.e., the number of trajectories covered by the solution set ) is lower than , then the value of is set to , else the value of is set to . Consequently, in the next iteration, we compute the TOPS query with the revised value of . Since this process can continue forever, we stop it when the difference between and falls below the desired precision. Suppose the final iteration executed the TOPS query with parameters and returned the set . Then the answer to the MAX-TIPS problem is also with maximum inconvenience as . We call this algorithm simply NetClus.
Since the monotonicity of the trajectory coverage with respect to the coverage threshold is not theoretically guaranteed, the quality of this heuristic may not be bounded. Empirically, however, it performs the best in terms of both running time and quality (Sec. 7).
Assuming the time required to answer a TOPS query by NetClus to be , since TOPS queries are executed, the total time is .
While the NetClus approach is index-based, MIF is non-index based. In case of NetClus, the distance of the trajectories and sites to their respective cluster centres is pre-computed, thereby making it efficient. For MIF, all the necessary distance computations are performed online and, hence, it is slower.
5 Algorithms for AVG-TIPS
This section presents an optimal algorithm and two heuristics for the AVG-TIPS problem.
5.1 Optimal Algorithm
The following integer linear program solves AVG-TIPS:
The boolean variable if and only if the site is selected. The boolean variable if and only if the site is selected, and it is the nearest facility for trajectory . The objective function in Eq. (10) ensures that the total (equivalently, mean) inconvenience is minimized over the set of selected sites. The constraint in Ineq. (11) ensures that at most sites are selected in the answer set. The constraint in Ineq. (12) guarantees that each trajectory is served by exactly one service location. The constraint in Ineq. (13) guarantees that if , then , i.e., no trajectory is served by the site as it is not selected.
This optimal algorithm is, however, impractical except for extremely small datasets. Therefore, we next present the following heuristics for the AVG-TIPS problem.
5.2 HCC Algorithm
Recall that the AVG-TIPS problem generalizes the -medoids problem (Sec. 2). Our first heuristic, HCC, therefore, builds on the three popular approaches for the -medoids problem, PAM , CLARA , and CLARANS . We first describe these approaches, and then discuss the proposed HCC algorithm.
PAM: The basic idea of PAM is as follows. Given a set of objects, it starts by choosing random objects, called as medoids. Each non-medoid object is assigned to its nearest medoid. The cost of a particular clustering is the sum of the distances of each non-medoid object to its nearest medoid. The PAM algorithm proceeds in iterations. In each iteration, it swaps one of the existing medoids with a non-medoid object such that the cost of the resulting clustering decreases. To realize the swap with minimal cost, it computes the cost of each possible swap which are as many as . This step is computationally very expensive. PAM terminates when it reaches a local minima, i.e., there is no possible swap resulting in a solution with a lower cost.
CLARA: For large datasets, the PAM algorithm is infeasible owing to its high running time. To address this limitation, the CLARA algorithm  was proposed that relies on sampling. The idea is to create random samples of size much smaller than , and execute the PAM algorithm on each of these samples, and return the clustering that offers the minimal cost over all the samples.
CLARANS: To improve the quality of clustering of CLARA, another algorithm called CLARANS was proposed in . The basic idea of CLARANS is to avoid scanning all possible swaps in each iteration of PAM. Essentially, a small fraction of the total possible swaps are scanned and the swap that offers the minimal clustering cost is executed. To increase the robustness, this scheme is repeated a few number of times, and finally the clustering with the minimal cost is reported.
Inspired by the above approaches, we propose a new algorithm, HCC (Hybrid-CLARA-CLARANS), for the AVG-TIPS problem that combines the ideas of sampling (from CLARA) and scanning a small number of swaps (from CLARANS). The basic idea is to consider a few samples of sufficiently small size, and for each sample, to scan a sufficiently small number of swaps. The details are described next.
First, we describe the basic algorithm, and later, discuss how to make it scalable. Given a set of trajectories, and a set of candidate sites, initially, we compute the distances between each pair of trajectory and site. Then we execute the following steps.
Choose a set of random sites in , referred to as medoids. For each trajectory , we maintain a map that stores the nearest facility for in , along with its distance. The total inconvenience of the set is , where . The value can be, thus, computed using the maps.
Subsequently, the algorithm proceeds in iterations. In each iteration, it scans a fraction of the total possible swaps between a medoid and a non-medoid site in , as discussed above, and executes the swap that results in the lowest total inconvenience . Out of the total number of possible swaps, , HCC scans only a fraction, referred to as the swap-fraction, and denoted by . This process terminates when there is no possible swap that leads to a solution with a lower total inconvenience. The algorithm makes trials of the above steps to account for the randomly chosen initial set of medoids. Finally, it returns the set with the lowest total inconvenience achieved over the trials.
Let us see the working of this algorithm for the AVG-TIPS problem on the example shown in Fig. 2 with . We will consider a single trial with swap-fraction . Let us start with the following set of initial medoids . We note that . Since , we examine all possible swaps resulting in the following sets: and . Since is minimal, it executes this swap. Since there is no further possible swap that results in lower total inconvenience, the algorithm terminates, reporting the set as the answer. Incidentally, this happens to be the optimal solution.
This heuristic requires the distance values between each pair of trajectory and site . Computing and storing these distances for large datasets may be infeasible when or is large. Thus, under such circumstances, we propose to sample the set of candidate sites and trajectories, using the following schemes.
5.2.2 Site Sampling:
The sampling technique is based on a simple clustering idea that clusters the set of nodes in the road network and samples at most a single candidate site from each cluster. The details are as follows. A random node (that is not yet clustered) is chosen as the cluster-center of a cluster that consists of all nodes that are not yet clustered and are within some distance threshold from . This process is repeated until every node in is clustered. Finally, a sample is created by selecting at most a single site from each cluster that is closest to the cluster-center, which we refer to as its cluster-representative. If a cluster has no candidate site, then it has no cluster-representative either. As the chosen sites belong to different clusters, they are not expected to be close to each other. Thus, the sampled sites are typically nicely distributed over the road network.
5.2.3 Trajectory Sampling:
Consider any trajectory . For each node , let be the cluster-center of the cluster that contains . Each trajectory is mapped to a set of cluster-centers . This transforms the trajectories into a coarser representation as nodes that are close to each other in the trajectory are likely to fall in the same cluster. This transformed set of trajectories is denoted by .
Following this, the set of trajectories is clustered based on their Jaccard similarity measure, as proposed in . The high level overview of this method is as follows. Suppose, we are required to create a trajectory sample of size . Initially, each trajectory is a cluster by itself with being the cluster-representative. For any two trajectories, and , their Jaccard similarity is given by:
The clustering follows an iterative algorithm where in each iteration it fuses a pair of clusters with cluster-representatives and that have the maximum Jaccard similarity. After fusing, either of the two trajectories, or is deemed as the cluster-representative of the new cluster. The algorithm continues fusing a pair of clusters in each iteration in this manner, until there are exactly clusters. The cluster-representatives of these clusters are mapped back to their original representation as sequence of nodes, and reported as the desired trajectory sample.
Unfortunately, HCC does not have any approximation bound, since the original algorithms, PAM, CLARA and CLARANS do not have any quality guarantees either.
5.3 GREAT Algorithm
We next propose a greedy heuristic called GREAT (GREedy Avg-Tips). Unlike HCC, this algorithm offers bounded quality guarantees.
It is an iterative algorithm that works on the principle of maximizing the marginal gain in each successive iteration. It starts with an empty set . In each iteration , it selects a site such that the total inconvenience of the resulting set, , is minimized. The site is added to the set . The algorithm terminates after iterations.
Similar to HCC, the GREAT algorithm also assumes that the distances between each pair of trajectory and site are pre-computed and available. Also, for each trajectory , we maintain a map that stores the nearest facility for in , along with its distance. Whenever a new site is chosen to be added into , we check whether it would be the nearest facility for , and update it accordingly.
Let us see the working of this algorithm for the AVG-TIPS problem on the example shown in Fig. 2 with . In the first iteration, is selected as it offers the least total inconvenience of units. In the next iteration, (or ) is selected, resulting in optimal total inconvenience of units.
5.3.1 Analysis of GREAT
We, next, state an important property of AVG-TIPS problem, which in turn, helps us to bound the quality of GREAT.
A function defined on any subset of a set is sub-modular if for any pair of subsets , . A function is super-modular if its negative is sub-modular. The following result shows that the function is super-modular.
For any set of candidate sites , the total inconvenience is a non-increasing super-modular function.
Consider any pair of sets such that .
First, we show that is a non-increasing function. Since is a minimum function over the set , it follows that . Thus, . Hence, is a non-increasing function.
To show that is super-modular, following , it is sufficient to show that for any site , the following holds:
Since , it is, therefore, enough to prove that for any trajectory ,
Suppose the site is the nearest facility to
trajectory in the set of sites . There can be
(a) : . Further, since (using the non-increasing property), Ineq. (18) follows.
(b) : Here, . From the non-increasing property of (using the non-increasing property), . Thus, Ineq. (18) follows.
The next result bounds the quality of GREAT.
Let denotes an optimal solution to the AVG-TIPS problem. Let be the solution reported by the GREAT algorithm. Then,
where refers to the initial total inconvenience, i.e., the total inconvenience faced by the trajectories in due to the existing facilities. We assume to be a non-negative constant.
It is easy to see that GREAT returns the optimal solution for as it performs an exhaustive search over all the candidate sites in .
Now, suppose . For any given set of candidate sites , consider a function . Since is a non-increasing super-modular function ( Th. 4), it follows that is a non-decreasing sub-modular function. Further, . It is known that the greedy heuristic offers an approximation bound of for any non-decreasing sub-modular function with . Since is a set that minimizes , from the definition of , it follows that must maximize as is a constant. Let be the set of sites reported by GREAT. Since GREAT essentially mimics the greedy heuristic for non-decreasing sub-modular functions , the same approximation bound is applicable for . Therefore, . From the definition of , we can write .
The next result bounds the complexity of GREAT.
The space and time complexity of the GREAT algorithm are and respectively, where and .
Storing the distance values for each pair of site and trajectory requires space. Further, storing the maps require space. Since , the overall space complexity is .
We next analyze the time complexity. In each iteration of the GREAT algorithm, we compute the total inconvenience for each site . For each site , this computation requires time over all the trajectories. Once a site is added to , the maps are updated in time. Thus, each iteration requires time. The overall time complexity of GREAT, running over iterations is, therefore, .
6 Handling Existing Services
Optimal location queries usually factor in existing service locations while planning for new ones. We next show how the heuristics presented in Sec. 4 and Sec. 5 can be easily adapted to handle an existing set of services, denoted by . Consider MAX-TIPS problem with input parameters .
MIF: The algorithm begins with the following sets initialized as follows:, and . The maps are updated to reflect the nearest facility in