1. Introduction
Today most people spend a significant portion of their time daily in indoor spaces such as subway systems, office buildings, shopping malls, convention centers, and many other structures. In addition, indoor spaces are becoming increasingly large and complex. For instance, the New York City Subway has 472 stations and contains 245 miles (394 km) of routes (NYCS). In 2017, the subway system delivered over 1.73 billion rides, averaging approximately 5.6 million rides on weekdays (MTA). Therefore, users will have more and more demand for launching spatial queries for finding friends or Points Of Interest (POI) (wang2017efficient; wang2019scalable) in indoor places. Moreover, users are usually moving around when issuing queries. Thus we need to properly support indoor spatial queries continuously, e.g., reporting nearby friends in a mall when a user is shopping. However, existing spatial query evaluation techniques for outdoor environments (either based on Euclidean distance or network distance) (conf/sigmod/RoussopoulosKV95; journals/tods/HjaltasonS99; conf/vldb/PapadiasZMT03; conf/sigmod/SametSA08; journals/tkde/LeeLZT12) cannot be applied in indoor spaces because these techniques assume that user locations can be acquired from GPS signals or cellular positioning, but the assumption does not hold in covered indoor spaces. Furthermore, indoor spaces are usually modelled differently from outdoor spaces. In indoor environments, user movements are enabled or constrained by topologies such as doors, walls, and hallways.
Radio Frequency Identification (RFID) technologies have become increasingly popular over the last decade with applications in areas such as supply chain management (journals/cacm/SantosS08), health care (amendola2014rfid), and transportation (jedermann2009spatial). In indoor environments, RFID is mainly employed to support track and trace applications. Generally, RFID readers are deployed in critical locations while objects carry RFID tags. When a tag passes the detection range of a reader, the reader recognizes the presence of the tag and generates a record in the back end database. However, the raw data collected by RFID readers is inherently unreliable (Sullivan0504; conf/vldb/JefferyGF06), with false negatives as a result of RF interference, limited detection range, tag orientation, and other environmental phenomena (conf/mobisys/WelbourneKSBB09). In addition, readers cannot cover all areas of interest because of their high cost or privacy concerns (journals/internet/WelbourneBCGRRBB09). Therefore, we cannot directly utilize RFID raw data to evaluate commonly used spatial query types (e.g., range and NN) for achieving high accuracy results in indoor environments. In addition, several other types of wireless communication technologies such as WiFi and Bluetooth have been employed for indoor positioning (conf/icdcsw/AnastasiBCDGM03; conf/gis/BellJK10). However, each aforementioned technology has considerable positioning uncertainty. Furthermore, WiFi and Bluetooth are mainly utilized for positioning individual users rather than supporting a centralized indoor location tracking system. It is too expensive to attach WiFi or Bluetooth transmitters ($5 per device) to monitored objects. Therefore, we focus on RFID in this research.
In this paper, we consider the setting of an indoor environment where a number of RFID readers are deployed in hallways. Each user is attached with an RFID tag, which can be identified by a reader when the user is within the detection range of the reader. Given the history of RFID raw readings from all the readers, we are in a position to design a system that can efficiently answer indoor spatial queries. We mainly focus on four types of spatial queries, range query, NN query, continuous range query, and continuous NN query.
Bayesian filtering techniques (Arulampalama:tutorial; Maybeck79) can be employed to estimate the state of a system that changes over time using a sequence of noisy measurements made on the system. In this paper we propose the Bayesian filteringbased location inference methods, the indoor walking graph model, and the anchor point indexing model for inferring object locations from noisy RFID raw data. On top of the location inference, indoor spatial queries can be evaluated efficiently by our algorithms with high accuracy. The contributions of this study are as follows:

We design the Bayesian filteringbased location inference methods as the basis for evaluating indoor spatial queries.

We propose two novel models, the indoor walking graph model and the anchor point indexing model, and an RFIDbased system for tracking object locations in indoor environments.

Indoor spatial query evaluation algorithms for range, NN, continuous range, and continuous NN queries are developed based on the proposed system.

We demonstrate the efficiency and effectiveness of our approach by comparing the performance of our system with the symbolic modelbased solution (Yang:indoorknn) through experiments based on realworld data and synthetic data.
The rest of this paper is organized as follows. In Section 2
, we survey previous works for indoor object monitoring and spatial queries. Background knowledge of particle filters and the Kalman filter is provided in Section
3. In Section 4, we introduce our Bayesian filteringbased indoor spatial query evaluation system. The experimental validation of our design is presented in Section 5. Section 6 concludes this paper with a discussion of future work.2. Related Work
In this section, we review previous work related to indoor spatial queries and RFID data cleansing.
2.1. Indoor Spatial Queries
Outdoor spatial queries, e.g., range and NN queries, have been extensively studied both for Euclidean space (conf/sigmod/RoussopoulosKV95; journals/tods/HjaltasonS99) and road networks (conf/vldb/PapadiasZMT03; conf/sigmod/SametSA08; journals/tkde/LeeLZT12). However, due to the inherent differences in spatial characteristics, indoor spatial queries need different models and cannot directly apply mature techniques from their outdoor counterparts. Therefore, indoor spatial queries are drawing more and more research attention from industry and academia. For answering continuous range queries in indoor environments, Jensen et al. (Jensen:2009:GMB:1590953.1591000) proposed using the positioning device deployment graph to represent the connectivity of rooms and hallways from the perspective of positioning devices. Basically, entities that can be accessed without having to be detected by any positioning device are represented by one cell in the graph, and edges connecting two cells in the graph represent the positioning device(s) which separate them. Based on the graph, initial query results can be easily processed with the help of an indexing scheme also proposed by the authors (Yang:indoorrange). Query results are returned in two forms: certain results and uncertain results. To reduce the workload of maintaining and updating the query results, Yang et al. further proposed the concept of critical devices
. Only from the ENTER and LEAVE observations of its critical devices can a query’s results be affected. However, the probability model utilized in Yang’s work is very simple: a moving object is uniformly distributed over all the reachable locations constrained by its maximum speed in a given indoor space. This simple probability model is incapable of taking advantage of the moving object’s previous moving patterns, such as direction and speed, which would make the location prediction more reasonable and precise. In addition, Yang
et al. (Yang:indoorknn) also addressed the problem of NN queries over moving objects in indoor spaces. Unlike another previous work (DBLP:dblp_conf/mdm/LiL08) which defines nearest neighbors by the minimal number of doors to go through, they proposed a novel distance metric, minimum indoor walking distance, as the underlying metric for indoor NN queries. Moreover, Yang et al. provided the formal definition for Indoor Probabilistic Threshold NN Query (PTNN) as finding a result set with objects which have a higher probability than the threshold probability . Indoor distancebased pruning and probability thresholdbased pruning are proposed in Yang’s work to speed up PTNN query processing. Similarly, the paper employs the same simple probabilistic model as in (Yang:indoorrange), and, therefore, has the same deficiencies in probability evaluation. An adaptive cleansing (AC) probabilistic model (zhao2012model) is proposed to achieve object tracking in open spaces. An RFID data cleaning method that optimizes the overall accuracy and cost is proposed in (gonzalez2007cost). However, (zhao2012model) and (gonzalez2007cost) are different from our eventdriven setting, because indoor topology is not applied. A spatial cleansing model (baba2013spatiotemporal) that utilizes a distanceaware graph to reduce spatial ambiguity in indoor spaces is proposed for RFID data cleansing. Their method is more focused on predicting the actual location among its alternative possibilities rather than solving spatial queries. Offline cleaning with subsequence data (fazzinga2014offline) is also taken into consideration. Their method is applicable only when data are stabilized and used for analysis tasks. The main contribution of (fazzinga2014cleaning) is a framework which cleans RFID data by utilizing reachability and travel time limits. (fazzinga2014offline) and (fazzinga2014cleaning) suffer from certain constraints and can not be applied to online spatial queries. To employ different methods in different user scenarios, (msnindoor)uses a pretrained Neural Network model to classify users into different categories.
2.2. RFIDBased Track and Trace
RFID is a very popular electronic tagging technology that allows objects to be automatically identified at a distance using an electromagnetic challengeandresponse exchange of data (journals/queue/Want04). An RFIDbased system consists of a large number of lowcost tags that are attached to objects, and readers which can identify tags without a direct lineofsight through RF communications. RFID technologies enable exceptional visibility to support numerous track and trace applications in different fields (conf/percom/YangCZT12), including indoor navigation (wang1; wang2) and indoor trajectory mining (mine1; mine2). However, the raw data collected by RFID readers is inherently noisy and inconsistent (Sullivan0504; conf/vldb/JefferyGF06). Therefore, middleware systems are required to correct readings and provide cleansed data (journals/vldb/JefferyFG08). In addition to the unreliable nature of RFID data streams, another limitation is that due to the high cost of RFID readers, RFID readers are mostly deployed such that they have disjointed activation ranges in the settings of indoor tracking.
To overcome the above limitations, RFID data cleansing is a necessary step to produce consistent data to be utilized by highlevel applications. Baba et al. (6916912) proposed a probabilistic distanceaware graph model to handle false negative in RFID readings. The main limitation is that their generative model relies on a long tracking history to detect and possibly correct RFID readings. Tran et al. (DBLP:yanlei) used a samplingbased method called particle filtering to infer clean and precise event streams from noisy raw data produced by mobile RFID readers. Three enhancements are proposed in their work to make traditional particle filter techniques scalable. However, their work is mainly designed for warehouse settings where objects remain static on shelves, which is quite different from our setting where objects move around in a building. Therefore, Tran’s approach of adapting and applying particle filters cannot be directly applied to our settings. Another limitation of (DBLP:yanlei) is that they did not explore further utilization of the output event streams for highlevel applications. Chen et al. (haiquan; journals/tkde/Ku12)
employed a different sampling method called Markov Chain Monte Carlo (MCMC) to infer objects’ locations on shelves in warehouses. Their method takes advantage of the spatial and temporal redundancy of raw RFID readings, and also considers environmental constraints such as the capacity of shelves, to make the sampling process more precise. Their work also focuses on warehouse settings; thus it is not suitable for our problem of general indoor settings. The works in
(conf/sigmod/ReLBS08; conf/mobisys/WelbourneKLLBBS08; conf/icde/LetchnerRBP09) target settings such as office buildings, which are similar to our problem. They use particle filters in their preprocessing module to generate probabilistic streams, on which complex event queries such as “Is Joe meeting with Mary in Room 203?” can be processed. However, their goal is to answer event queries instead of spatial queries, which is different from the goal of this research. Geng et al. (6655909) also proposed using particle filters for indoor tracing with RFID; however, they assumed a grid layout of RFID readers instead of only along the hallways. Thus their algorithms cannot be applied to our problem.3. Preliminary
In this section, we briefly introduce the mathematical background of Bayesian filters, including the Kalman filter and particle filters, and location inference based on the two filters. Notations used in this paper are summarized in Table I.
Symbol  Meaning 

An indoor query point  
The object with ID  
A set of candidate objects  
A set of sensing devices  
The indoor walking graph  
The edge set of  
The node (i.e., intersection) set of  
A probability distribution function for in terms of all possible locations 

An anchor point with ID  
The total number of particles for an object  
The maximum walking speed of a person  
The maximum walking distance of a person during a certain period of time  
The uncertain region of object  
The minimum shortest network distance  
The maximum shortest network distance  
The size of a given region  
The th RFID reader  
The probability that object exists at the searched location at time .  
The total probability of all objects in the result set with query at time 
3.1. The Kalman Filter
The Kalman filter is an optimal recursive data processing algorithm, which combines a system’s dynamics model, known control inputs, and observed measurements to form an optimal estimate of system states. Note here the control inputs and observed measurements are not deterministic, but rather with a degree of uncertainty. The Kalman filter works by making a prediction of the future system state, obtaining measurements for that future state, and adjusting its estimate by moderating the difference between the two. The result of the Kalman filter is a new probability distribution of system state which has reduced its uncertainty to be less than either the original predicted values or measurements alone.
To help readers better understand how the Kalman filter works for location estimation, we use a simple example of one dimensional movement and location estimation. Suppose an object is moving along a horizontal line, and we are interested in estimating the object’s location with the Kalman filter. We assume the object’s speed can be expressed by , where is a constant and
is a Gaussian variable with a mean of zero and variance of
. We also assume the object’s initial location atis also a Gaussian distribution with mean
and variance . At a later time , just before an observation is made, we get a prediction of the object’s location to be a Gaussian distribution with mean and variance:(1) 
(2) 
As indicated by Equation 2, the uncertainty in the predicted location increases with the time span , since no measurements are made during the time span and the uncertainty in speed accumulates with time.
After the observation at is made, suppose its value turns out to be with variance . The Kalman filter combines the predicted value with the measured value to yield an optimal estimation with mean and variance:
(3) 
(4) 
where . The details of deriving Equations 3 and 4 are omitted here, and we refer readers to (Maybeck79) for further details.
As we can see from Equation 3, the optimal estimate is the optimal predicted value before the measurement plus a correction term. The variance is smaller than either or . The optimal gain gives more weight to the better value (with lower variance), so that if the prediction is more accurate than the measurement, then is weighed more; otherwise is weighed more.
3.2. The Particle Filter
A particle filter is a method that can be applied to nonlinear recursive Bayesian filtering problems (Arulampalama:tutorial)
. The system under investigation is often modeled as a state vector
, which contains all relevant information about the system at time . The observation at time is nonlinear to the true system state ; also the system evolves from to nonlinearly.The objective of the particle filter method is to construct a discrete approximation to the probability density function (pdf)
by a set of random samples with associated weights. We denote the weight of the particle at time by , and the particle at time by . According to the mathematical equations of particle filters (Arulampalama:tutorial), the new weight is proportional to the old weight augmented by the observation likelihood . Thus, particles which are more likely to cause an observation consistent with the true observation result will gain higher weight than others.The posterior filtered density can be approximated as:
(5) 
(6) 
(7) 
Resampling is a method to solve the degeneration problem in particle filters. Degeneration means that with more iterations only a few particles would have dominant weights while the majority of others would have nearzero weights. The basic idea of resampling is to eliminate low weight particles, replicate high weight particles, and generate a new set of particles with equal weights. Our work adopts sampling importance resampling filters, which perform the resampling step at every time index.
In our application, particles update their locations according to the object motion model employed in our work. Briefly, the object motion model assumes objects move forward with constant speeds, and can either enter rooms or continue to move along hallways. Weights of particles are updated according to the device sensing model (haiquan) used in this research. An example of applying particle filters to the problem of RFIDbased indoor location inferences can be found in (conf/edbt/YuKSL13).
3.3. Query Definitions
Here we define the probabilistic NN query following the idea of (Yang:indoorknn). In this paper, we use NN in indoor environment to imply probabilistic NN.
Definition 3.1 ().
(Probabilistic k Nearest Neighbor Queries) Given a set of indoor moving objects =, a NN query issued at time with query location returns a result set R = . We denote the probability that object exists at the searched location at time by (while the searching depends on the relative distance to ), and the total probability of all objects in the result set by .
Definition 3.2 ().
(Range Queries) Given a set of indoor moving objects = and a range , a range query issued at time with query location returns a result set R = , and the respective probabilities of each .
4. System Design
In this section, we will introduce the design of an RFIDbased indoor range and NN query evaluation system, which incorporates four modules: eventdriven raw data collector, query aware optimization module, Bayesian filteringbased preprocessing module, and query evaluation module. In addition, we introduce the underlying framework of two models: indoor walking graph model and anchor point indexing model. We will elaborate on the function of each module and model in the following subsections.
Figure 1 shows the overall structure of our system design. Raw readings are first fed into and processed by the eventdriven raw data collector module, which then provides aggregated readings for each object at every second to the Bayesian filteringbased preprocessing module. Before running the preprocessing module, the reading data may be optionally sent to the query aware optimization module which filters out noncandidate objects according to registered queries and objects’ most recent readings, and outputs a candidate set to the Bayesian filteringbased preprocessing module. The preprocessing module cleanses noisy raw data for each object in , stores the resulting probabilistic data in a hash table, and passes the hash table to the query evaluation module. At last, the query evaluation module answers registered queries based on the hash table that contains filtered data.
4.1. EventDriven Raw Data Collector
In this subsection, we describe the eventdriven raw data collector which is the front end of the entire system. The data collector module is responsible for storing RFID raw readings in an efficient way for the following query processing tasks. Considering the characteristics of Bayesian filtering, readings of one detecting device alone cannot effectively infer an object’s moving direction and speed, while readings of two or more detecting devices can. We define events in this context as the object either entering (ENTER event) or leaving (LEAVE event) the reading range of an RFID reader. To minimize the storage space for every object, the data collector module only stores readings during the most recent ENTER, LEAVE, ENTER events, and removes earlier readings. In other words, our system only stores readings of up to the two most recent consecutive detecting devices for every object. For example, if an object is previously identified by and (readers), readings from and are stored in the data collector. When the object is entering the detection range of a new device , the data collector will record readings from while removing older readings from . The previous readings have negligible effects on the current prediction.
The data collector module is also responsible for aggregating the raw readings to more concise entries with a time unit of one second. RFID readers usually have a high reading rate of tens of samples per second. However, Bayesian filtering does not need such a high observation frequency. An update frequency of once per second would provide a good enough resolution. Therefore, aggregation of the raw readings can further save storage without compromising accuracy.
4.2. Indoor Walking Graph Model and Anchor Point Indexing Model
This subsection introduces the underlying assumptions and backbone models of our system, which form the basis for understanding subsequent sections. We propose two novel models in our system, indoor walking graph model and anchor point indexing model, for tracking object locations in indoor environments.
4.2.1. Indoor Walking Graph Model
We assume our system setting is a typical office building where the width of hallways can be fully covered by the detection range of sensing devices (which is usually true since the detection range of RFID readers can be as long as 3 meters), and RFID readers are deployed only along the hallways. In this case the hallways can simply be modeled as lines, since from RFID reading results alone, the locations along the width of hallways cannot be inferred. Furthermore, since no RFID readers are deployed inside rooms, the resolution of location inferences cannot be higher than a single room.
Based on the above assumptions, we propose an indoor walking graph model. The indoor walking graph is abstracted from the regular walking patterns of people in an indoor environment, and can represent any accessible path in the environment. The graph comprises a set of nodes (i.e., intersections) together with a set of edges, which present possible routes (i.e., hallways). By restricting object movements to be only on the edges of , we can greatly simplify the object movement model while at the same time still preserving the inference accuracy of Bayesian filtering. Also, the distance metric used in this paper, e.g., in NN query evaluations, can simply be the shortest spatial network distance on , which can then be calculated by many wellknown spatial network shortest path algorithms (conf/vldb/PapadiasZMT03; conf/sigmod/SametSA08) as shown in Figure 2.
4.2.2. Anchor Point Indexing Model
The indoor walking graph edges are by nature continuous. To simplify the representation of an object’s location distribution on , we propose an effective spatial indexing method: anchor pointbased indexing. We define anchor points as a set of predefined points on with a uniform distance (such as 1 meter) to each other. Those anchor points are discrete location points. For most applications, this generalization will avoid a heavy load of unnecessary computation. An example of anchor points is shown in Figure 2. A triangle represents an anchor point. In Figure 3, the striped circle represents the Uncertain Region. In essence, the model of anchor points is a scheme of trying to discretize objects’ locations. After Bayesian filtering is finished for an object , its location probability distribution is aggregated to discrete anchor points. Specifically, for the Kalman filter, an integration of an object’s bellshaped location distribution between two adjacent anchor points is calculated. For particle filters, suppose is an anchor point with a nonzero number of particles, , where is the probability distribution function that is at and is the total number of particles for .
A hash table APtoObjHT is maintained in our system. Given the coordinates of an anchor point , the table will return the list of each object and its probability at the anchor point: (). For instance, an entry of APtoObjHT would look like: , which means that at the anchor point with coordinate (8.5, 6.2), there are three possible objects (, , and ), with probabilities of 0.14, 0.03, and 0.37, respectively. With the help of the above anchor point indexing model, the query evaluation module can simply refer to the hash table APtoObjHT to determine objects’ location distributions.
4.3. Query Aware Optimization Module
To answer every range query or NN query, a naive approach is to calculate the probability distribution of every object’s location currently in the indoor setting. However, if query ranges cover only a small fraction of the whole area, then there will be a considerable percentage of objects who are guaranteed not to be in the result set of any query. We call those objects that have no chance to be in any result set “noncandidate objects”. The computational cost of running Bayesian filters for noncandidate objects should be saved. In this subsection we present two efficient methods to filter out noncandidate objects for range query and NN query, respectively.
Range Query: to decrease the computational cost, we employ a simple approach based on the Euclidean distance instead of the minimum indoor walking distance (Yang:indoorknn) to filter out noncandidate objects. An example of the optimization process is shown in Figure 3. For every object , its most recent detecting device and last reading time stamp are first retrieved from the data collector module. We assume the maximum walking speed of people to be . Within the time period from to the present time , the maximum walking distance of a person is . We define ’s uncertain region to be a circle centered at with radius . The red circle in Figure 3 represents the reading range of a reader. If does not overlap with any query range then is not a candidate and should be filtered out. On the contrary, if overlaps with one or more query ranges then we add to the result candidate set . In Figure 3, the only object in the figure should be filtered out since its uncertain region does not intersect with any range query currently evaluated in the system.
NN Query: by employing the idea of distancebased pruning in (Yang:indoorknn), we perform a similar distance pruning for NN queries to identify candidate objects. We use to denote the minimum (maximum) shortest network distance (with respect to the indoor walking graph) from a given query point to the uncertain region of :
(8) 
Let be the th minimum of all objects’ values. If of object is greater than , object can be safely pruned since there exists at least objects whose entire uncertain regions are definitely closer to than ’s shortest possible distance to . Figure 2 is an example pruning process for a 2NN query: There are 3 objects in total in the system. We can see and consequently in this case; is greater than , so has no chance to be in the result set of the 2NN query. We run the distance pruning for every NN query and add possible candidate objects to .
Finally, a candidate set is produced by this module, containing objects that might be in the result set of one or more range queries or NN queries. is then fed into the Bayesian filteringbased preprocessing module which will be explained in the next subsection.
4.4. Bayesian Filteringbased Preprocessing Module
The preprocessing module estimates an object’s location distribution according to its two most recent readings, calculates the discrete probability on anchor points, and stores the results to the hash table APtoObjHT. We introduce two preprocessing approaches based on two famous algorithms in the Bayesian Filtering family: the Kalman filter and the Particle filter.
4.4.1. Kalman FilterBased Preprocessing Module
In this section, we extend the basic 1D example of the Kalman filter in Section 3.1 to be suitable for more complex 2D indoor settings. Due to the irregularity of indoor layout, the main challenge here is that an object’s moving path may diverge to multiple paths. For example, in Figure 4, assume an object was detected first by reader at then by reader at , it could have entered or before proceeding to . When we conduct a prediction with the Kalman filter, we need to consider all these possible paths, each of which will give a separate prediction. Algorithm 1 formulates our approach of applying the Kalman filter to estimate objects’ locations, which is elucidated in the rest of this subsection with the example in Figure 4.
The Kalman filter algorithm starts by first retrieving the most recent readings for each candidate from the data collector module. Line 5 of Algorithm 1 restricts the Kalman filter from running more than 60 seconds beyond the last active reading, since otherwise its location estimation will become dispersed over a large area and the filtering result will become unusable.
We assume objects’ speed is a Gaussian variable with m/s and m/s, and the time of an object staying inside a room also follows Gaussian distribution. We assume that objects rarely enter the same room more than once. There could be several shortest paths from reader to . For a specific shortest path, if object can walk into 0 rooms, 1 room, 2 rooms, 3 rooms… m rooms during to , there are different predictions . We calculate the possibilities respectively on these cases from line 6 to line 16. Note that we simplify by replacing with its mean value . For example, in Figure 4, the object could enter 0 rooms, 1 room, 2 rooms while moving before entering ’s range, therefore, there are 3 distributions (0 rooms, 1 room, 2 rooms). The 3 curves in Figure 4 indicate 3 distributions.
When the observation at is made, we combine the observation with only reasonable predictions to get a final estimation. By “reasonable”, we mean predictions with a good portion of pdf overlapping with ’s reading range. For example, in Figure 4, if the threshold about the probability of the object being in ’s range is 0.05 and the probability that the object moving into and before being in ’s range is less than 0.05, this path will be eliminated. It means two predictions for the two paths entering and respectively are hardly overlapping with ’s reading range, so we can safely prune them and only consider the rightmost prediction. After pruning, the average of remaining predictions is used to calculate the object’s location estimation at according to Equations 3 and 4. For example, if the distance from to is 10, the observed mean will be 10 and the variance is 2 (the radius of the reader’s detection range). Suppose that the predicted mean is 14 and variance is 3. By employing Equation , will be 0.6. According to Equations 3 and 4, the filtered mean is 11.6 and the new variance is 1.2.
From the latest detected time to current, the object can take every possible path from going forward. Line 19 uses recursion to enumerate all the possibilities and line 20 calculates the probability distribution of . Suppose that is 20 and is 22.5; will be 2.5. In line 21, we could arrive at the new variance 1.45. From line 22 to line 26, we calculate possible objects and the integration of their probabilities. Suppose that we are going to calculate object ’s probability for anchor point to , where is the adjacent point in the moving direction on a specific path. If the distance from to reader is 12, and the distance from to reader is 16. The integration of probability is , where
is the cumulative distribution function for the distribution calculated from lines 18 to 21. In line 25, we update the hash table APtoObjHT for each anchor. For example, there are 3 possible objects for anchor point
: with probability , with probability , and with probability . We will update the hash table with item . With the aforementioned approach, we could determine possible objects and their probabilities for each anchor.4.4.2. Particle FilterBased Preprocessing Module
The particle filter method consists of 3 steps: initialization, particle updating, and particle resampling. In the first step, a set of particles is generated and uniformly distributed on the graph edges within the detection range of , and each particle picks its own moving direction and speed as in line 5. In our system, particles’ speeds are drawn from a Gaussian distribution with m/s and m/s . In the location updating step in line 9, particles move along graph edges according to their speed and direction, and will pick a random direction at intersections; if particles are inside rooms, they continue to stay inside with probability 0.9 and move out with probability 0.1. After location updating, in line 16, weights of particles are updated according to their consistency with reading results. In other words, particles within the detecting device’s range are assigned a high weight, while others are assigned a low weight. In the resampling step, particles’ weights are first normalized as in line 18. We then employ the Resampling Algorithm (conf/edbt/YuKSL13) to replicate highly weighted particles and remove lowly weighted particles as in line 19. Lines 23 to 26 discretize the filtered probabilistic data and build the hash table APtoObjHT as described in Section 4.2.
4.5. Query Evaluation
In this subsection we are going to discuss how to evaluate range and NN queries efficiently with the filtered probabilistic data in the hash table APtoObjHT. For NN queries, without loss of generality, the query point is approximated to the nearest edge of the indoor walking graph for simplicity.
4.5.1. Indoor Range Query
To evaluate indoor range queries, the first thought would be to determine the anchor points within the range, then answer the query by returning objects and their associated probabilities indexed by those anchor points. However, with further consideration, we can see that since anchor points are restricted to be only on graph edges, they are actually the 1D projection of 2D spaces; the loss of one dimension should be compensated in the query evaluation process. Figure 5 shows an example of how the compensation is done with respect to two different types of indoor entities: hallways and rooms.
In Figure 5, query is a rectangle which intersects with both the hallway and room , but does not directly contain any anchor point. We denote the left part of which overlaps with the hallway as , and the right part which overlaps with as . We first look at how to evaluate the hallway part of . The anchor points which fall within ’s vertical range are marked red in Figure 5, and should be considered for answering . Since we assume there is no differentiation along the width of hallways, objects in hallways can be anywhere along the width of hallways with equal probability. With this assumption, the ratio of (the width of ) and (the width of the hallway) will indicate the probability of objects in hallways within the vertical range of being in . For example, if an object is in the hallway and in the vertical range of with probability , which can be calculated by summing up the probabilities indexed by the red anchor points, then the probability of this object being in is .
Then we look at the room part of . The anchor points within room should represent the whole 2D area of , and again we assume objects inside rooms are uniformly distributed. Similar to the hallway situation, the ratio of ’s area to ’s area is the probability of an object in happening to be in . For example, if ’s probability of being in is , then its probability of being in is , where can be calculated by summing up the indexed probabilities of on all the anchor points inside , and stands for the size of a given region .
Algorithm 3 summarizes the above procedures. In line 15, we define the multiplication operation for to adjust the probabilities for all objects in it by the multiplying constant. In line 16, we define the addition operation for : if an object probability pair is to be added, we check whether already exists in . If so, we just add to the probability of in ; otherwise, we insert to . For instance, suppose originally contains , and result stores . After the addition in line 16, is updated to be .
4.5.2. Indoor NN Query
For indoor NN queries, we present an efficient evaluation method with statistical accuracy. Unlike previous work (Yang:indoorknn; Cheng:2009:EPT:1516360.1516438), which involves heavy computation and returns multiple result sets for users to choose, our method is user friendly and returns a relatively small number of candidate objects. Our method works as follows: starting from the query point , anchor points are searched in ascending order of their distance to ; the search expands from one anchor point forward per iteration, until the sum of the probability of all objects indexed by the searched anchor points is no less than . The result set has the form of where . The number of returned objects will be at least . From the sense of statistics, the probability associated with object in the result set is the probability of being in the NN result set of . The algorithm of the indoor NN query evaluation method in our work is shown in Algorithm 4.
In Algorithm 4, lines 1 and 2 are initial setups. Line 3 adds two entries to a vector , whose elements store the edge segments expanding out from query point . In the following for loop, line 5 finds the next unvisited anchor point further away from . If all anchor points are already searched on an edge segment , lines 6 to 12 remove and add all adjacent unvisited edges of .node to . Line 13 updates the result set by adding object ID, probability pairs indexed by the current anchor point to it. In lines 14 to 17, the total probability of all objects in the result set is checked, and if it equals or exceeds , the algorithm ends and returns the result set. Note that the stopping criteria of our NN algorithm do not require emptying the frontier edges in .
An example NN query is shown in Figure 6, which is a snapshot of the running status of Algorithm 4. In Figure 6, red arrows indicate the searching directions expanding from , and red anchor points indicate the points that have already been searched. Note that the edge segment from to is already removed from and new edges , are currently in as well as . The search process is to be continued until the total probability of the result set is no less than .
4.5.3. Continuous Indoor Range Query
In this subsection, we aim to solve the problem of continuous indoor range query on filtered probabilistic data. To efficiently monitor the result set, we use a similar concept “critical device” as in (Yang:indoorrange), which can save considerable computations rather than constantly repeating the snapshot algorithm. We define critical devices for a query to be only the set of devices whose readings will affect the query results. Our continuous monitoring algorithm is distinct from Yang’s work (Yang:indoorrange) in two aspects: First, we leverage the Indoor Walking Graph to simplify the identification process of critical devices. Second, the probability updating process is Bayesian filterbased, which is more accurate and very different in nature from Yang’s approach.
To identify critical devices for a range query, we propose an approach consisting of two steps, mapping and searching. For the mapping step, we categorize two different cases:

Case 1: when the whole query range is contained within one room or adjacent rooms, then we project from the doors of end rooms to along hallways. For example, in Figure 7 is fully contained in room , so it is projected to a point (the red point) on through the door of .

Case 2: when the query range overlaps with both rooms and hallways, then the endpoints of mapped edge segment(s) should take whichever makes the covered segment longer among projected points of query range ends and end rooms’ doors. In Figure 7, is an example of this case. It is mapped to an edge segment, , along the hallway as marked in red. Point , room door’s projected point, is chosen instead of , the query range end projected point. Similarly, point is chosen instead of .
For the searching step, an expansion starting from the mapped endpoint(s) is performed along until the activation range of an RFID reader or a dead end is reached.
For the initial evaluation of a query, we change the optimization algorithm in Section 4.3 of the snapshot query to fully take advantage of critical devices. For an object to be in the query range, it must be most recently detected by a critical device or any device that is bounded by the critical devices. Other than the difference in identifying the candidate object set, other parts of the initial evaluation algorithm are the same as its snapshot counterpart. After initial evaluation, we continuously monitor the candidate set by performing Bayesian filters for them at every time step.
During the lifetime of a query, the candidate set may change due to candidates moving out or noncandidates moving into the critical device bounded region. If a candidate object is detected by a critical device, or the object’s probability of still residing in the bounded region falls to 0, then we assume that it is moving out and should be removed from the candidate set. On the other hand, if a noncandidate object enters the detection range of a critical device, we assume it is moving into the bounded region and should be added to the candidate set.
The proposed continuous indoor range query is formalized in Algorithm 5. Lines 1 to 6 initialize the critical devices and candidate set for query . In line 4 we use a new hash table , which maps a device to objects whose most recent readings are from this device. Lines 9 to 20 update the candidate set according to the readings of critical devices, and also objects’ probabilities of presence within the bounded region. Line 21 executes Algorithms 1 or 2 to update candidate objects’ location distribution probabilities. Line 22 calculates the result set using Algorithm 3. Note that for Algorithm 3 there is no need to recompute anchor point set since it remains unchanged until the query is unregistered from the system.
4.5.4. Continuous Indoor NN Query
Similar to continuous indoor range query, a method for updating the candidate set of continuous indoor NN query is crucial. To reduce the overhead of computing the candidate set at every time step, we buffer a certain number of extra candidates, and only recompute the candidate set according to the optimization approach in Section 4.3 when the total number of candidates is less than .
Recall from Section 4.3, by examining the minimum ()/maximum () shortest network distance from the query point to an object’s uncertain region, the snapshot optimization approach excludes objects with . Note that the candidate set identified by this method contains at least objects (usually more than ). Based on this snapshot optimization approach, we extend it to include at least candidates where is a user configurable parameter. Obviously, represents a tradeoff between the size of candidate set and the recomputing frequency. We accomplish this by calculating the th minimum among all objects, and use this value as a threshold to cut off noncandidate objects.
During continuous monitoring, we need to make sure that the candidate set gets updated accordingly when objects move away or towards . We still use critical devices to monitor candidates, but now the critical devices may change each time the candidate set is recomputed. The identification process of critical devices goes like the following: after calculating the candidate set, a search is performed from along to cover all the uncertain regions of candidate objects, until reaching readers (critical devices) or a dead end. As we can see, critical devices form a bounded region where at least candidate objects are surely inside it.
The proposed continuous indoor NN query is formalized in Algorithm 6. Note that in lines 13 to 16, when the total number of candidates falls below , we need to recompute a new candidate set of at least objects, and identify new critical devices accordingly.
5. Experimental Validation
In this section, we evaluate the performance of the proposed Bayesian filteringbased indoor spatial query evaluation system using both synthetic and realworld data sets, and compare the results with the symbolic modelbased solution (Yang:indoorknn). The proposed algorithms are implemented in C++. All the experiments were conducted on an Ubuntu Linux server equipped with an Intel Xeon 2.4GHz processor and 16GB memory. In our experiments, the floor plan, which is an office setting on the second floor of the Haley Center on Auburn University campus, includes 30 rooms and 4 hallways on a single floor, in which all rooms are connected to one or more hallways by doors^{1}^{1}1Our code, data, and the floor plan are publicly available at https://github.com/DataScienceLab18/IndoorToolKit.. A total of 19 RFID readers are deployed on hallways with uniform distance to each other. Objects are moving continuously without stopping, waiting, or making detours.
5.1. Evaluation Metrics

For range queries, we proposed cover divergence to measure the accuracy of query results from the two modules based on their similarity with the true result. Cover divergence is used to evaluate the difference between two probability distributions. The discrete form of cover divergence of from given in Equation 9 measures the information loss when is used to approximate . As a result, in the following experiments, smaller cover divergence indicates better accuracy of the results with regard to the ground truth. For instance, there are 3 objects at time in the query window: , and the predicted result is . .

For NN queries, cover divergence is no longer a suitable metric since the result sets returned from the symbolic model module do not contain objectspecific probability information. Instead, we count the hit rates of the results returned by the two modules over the ground truth result set. We only consider the maximum probability result set generated by the symbolic model module when calculating the hit rate. Given a query point , there will be a ground truth set which contains nearest objects around at time . The query model will also return a predicted set = , = . The query model sums up the probabilities of the nearest neighbor in decreasing order of distance from until . Hit rate is formally defined in Equation 10. For example, if , the ground truth set is , and the predicted result is = , . . The hit rate is 0.667.
(9) 
(10) 
In all the following experimental result figures, we use PF, KF, and SM to represent particle filterbased method, Kalman filterbased method, and symbolic modelbased method, respectively.
5.2. Synthetic Data Set
The whole simulator consists of six components, including true trace generator, raw reading generator, Bayesian filter module, symbolic model module, ground truth query evaluation, and performance evaluation module. Figure 8 shows the relationship of different components in the simulation system. The true trace generator module is responsible for generating the ground truth traces of moving objects and recording the true location of each object every second. Each object randomly selects its destination, and walks along the shortest path on the indoor walking graph from its current location to the destination node. We simulate the objects’ speeds using a Gaussian distribution with m/s and m/s. The raw reading generator module checks whether each object is detected by a reader according to the deployment of readers and the current location of the object with a certain probability. Whenever a reading occurs, the raw reading generator will feed the reading, including detection time, tag ID, and reader ID, to the query evaluation modules (Bayesian filter module and symbolic model module). The ground truth query evaluation module forms a basis to evaluate the accuracy of the results returned by the two aforementioned query evaluation modules. The default parameters of all the experiments are listed in Table 2.
Parameters  Default Values 

Number of particles  64 
Query window size  2% 
Number of moving objects  200 
3  
Activation range  2 meters 
5.2.1. Effects of Query Window Size
We first evaluate the effects of query window size on the accuracy of range queries. The window size is measured by percentage with respect to the total area of the simulation space. At each time stamp, 100 query windows are randomly generated as rectangles, and the results are averaged over 100 different time stamps. As shown in Figure 10, their accuracy is not significantly affected by the query window size. However, the cover divergence of the particle filterbased method is lower than both the Kalman filterbased and symbolic modelbased methods.
5.2.2. Effects of k
In this experiment we evaluate the accuracy of NN query results with respect to the value of . We choose 100 random indoor locations as NN query points and issue queries on these query points at 100 different time stamps. As goes from 2 to 9, we can see in Figure 10 that the average hit rates of Kalman filterbased and symbolic modelbased methods grow slowly. As increases, the number of objects returned by the method increase as well, resulting in a higher chance of hits. On the contrary, the average hit rate of the particle filterbased method is relatively stable with respect to the value of , and the particle filterbased method always outperforms the other two methods in terms of the average hit rate.
5.2.3. Effects of Number of Particles
From the mathematical analysis of particle filters in Section 3.2, we know that if the number of particles is too small, the accuracy of particle filters will degenerate due to insufficient samples. On the other hand, keeping a large number of particles is not a good choice either since the computation cost may become overwhelming, as the accuracy improvement is no longer obvious when the number of particles is beyond a certain threshold. In this subsection, we conduct extensive experiments to explore the effects of the number of particles on query result accuracy in order to determine an appropriate size of the particle set for the application of indoor spatial queries.
As shown in Figure 11, we can see that when the number of particles is very small, the particle filterbased method has a smaller average hit rate for NN queries than the other two methods. As the number of particles grows beyond 16, the performance of the particle filterbased method exceeds the other two. For range queries, the particle filterbased method has a lower cover divergence than the other two methods when the number of particles grows beyond 16. However, the performance gain with more than 64 particles slows down as we already have around accuracy. Figure 12 shows the relationship between runtime and the number of particles. As the number of particles increases, the runtime increases. Therefore, we conclude that in our application, the appropriate size of the particle set is around 60, which guarantees good accuracy while not costing too much in computation.
5.2.4. Effects of Speed of Moving Objects
To justify the assumption about velocity made in this paper, we generate the trajectories of objects with different velocities. In the experiment, we vary the constant moving speed (Yang:indoorrange) of the objects from 0.9 m/s to 1.4 m/s to get the ground truth. Figure 13 shows the performance of the three models. The PF model outperforms the other two models at all moving speed of objects. And the KF model exceeds SM. We get the same comparison result as that of the default experimental setting (a Gaussian distribution with m/s and m/s).
5.2.5. Effects of Number of Moving Objects
In this subsection, we evaluate the scalability of our proposed algorithms by varying the number of moving objects from 200 to 1000. All the result data are collected by averaging an extensive number of queries over different query locations and time stamps. Figure 14 shows that the cover divergence of the three methods is relatively stable, while the average hit rate of NN queries decreases for all the methods. The decrease of NN hit rate is caused by increasing density of objects. A finer resolution algorithm is required to accurately answer NN queries. In all, our solution demonstrates good scalability in terms of accuracy when the number of objects increases.
5.2.6. Effects of Activation Range
In this subsection, we evaluate the effects of the reader’s activation range by varying the range from 50 cm to 250 cm. The results are reported in Figure 15. As the activation range increases, the performance of all the three methods improves because uncertain regions not covered by any reader essentially get reduced. In addition, even when the activation range is small (e.g., 100 cm), the particle filterbased method is still able to achieve relatively high accuracy. Therefore, the particle filterbased method is more suitable than the other two methods when the physical constraints limit readers’ activation ranges.
5.2.7. Continuous Query Performance Evaluation
The previous subsections show the performance of snapshot queries, i.e., queries at a specific time stamp. This subsection demonstrates our algorithms’ performance across a duration of time. The application scenarios are described as follows:

For continuous range queries, a user registers a query window at time , and unregisters at . During the time interval (between and ), we keep updating the user of the objects in the query window whenever a change is detected.

For continuous NN queries, a user registers a query point on the walking graph (a query point which is not on the walking graph can be projected to its closest edge of the graph) at , and unregisters at . During the time interval, every time there is a change in the nearest neighbor query result set, we will update the user with the new query result.
We develop two criteria to measure the performance in the above scenarios:
Change Volume: Change volume is defined as the number of changes of objects in the query range between two consecutive time stamps, including departing and arriving objects. Suppose at , the objects in the query range are ; at , the result set changes to , then the number of changes equals to 2, because one of the objects, , is departing and another object, , just arrived. The rationale behind this is that higher change volume could potentially impair query result accuracy.
Query Duration: Query duration is the interval between and , where denotes the time a user registers a continuous query, and denotes the time a user unregisters the query. The rationale for this criteria is that the proposed algorithms can be evaluated as stable and reliable if they can maintain a satisfactory accuracy for a long duration. Figure 16 shows the performance of our proposed algorithms with different number of changes. It is clear from the figure that our algorithms’ accuracy is not heavily influenced by the change volume, although there are some fluctuations. Updating the user of the objects in the query window once a change is detected contributes to the stability of performance.
Furthermore, Figure 17 shows the accuracy of our algorithms against the query duration. Once the system is stable, the accuracy of our algorithms is not affected by the duration of query time.
5.3. Real Data Set
In the experiments utilizing real data, 40 objects were randomly moving on the second floor of the Haley Center on Auburn University campus; the trajectories were recorded by a camera. The experiments assumed that the RF readers were located at the designated positions. Once the object on the trajectory enters into the detection range of readers, it will be recorded with a specific probability and the hash table AptoObjHT will be updated. We evaluate all three models (PF, KF, and SM) with the collected data.
Figure 19 shows the effects of the query window size. The result is not significantly influenced by the query window size when the window size is greater than 0.01. When the query window size is 0.01, the query window cannot cover the whole room or the width of the hallway. At the same time, the number of moving objects is small. As a result, the cover divergence is relatively small. As shown in Figure 19, the hit rate of PF outperforms SM and KF for different values. As goes from 2 to 9, the average hit rates of KF and SM grow slowly. The hit rate of PF is stable relatively concerning the value of . Figure 20 shows the effects of varying the number of particles on the query result. As the number of particles grows beyond 16, the performance of PF exceeds the other two. The reason is that as the number of particles increases, more possible anchors could be the position of the specific object. As a result, the algorithm will return more objects. Since there is no particle in KF and SM, the result of KF and SM will not be influenced by the number of particles. Overall, the comparison result on the real data set is the same as that on the synthetic data set.
6. Conclusion
In this paper, we introduced an RFID and Bayesian filteringbased indoor spatial query evaluation system. In order to evaluate indoor spatial queries with unreliable data collected by RFID readers, we proposed the Bayesian filteringbased location inference method, the indoor walking graph model, and the anchor point indexing model for cleansing noisy RFID raw data. After the data cleansing process, indoor range and NN queries can be evaluated efficiently and effectively by our algorithms. We conduct comprehensive experiments using both synthetic and realworld data. The results demonstrate that our solution outperforms the symbolic modelbased method significantly in query result accuracy with the assumption that objects move at a constant rate of 1 m/s, without stopping, waiting, or making detours.
For future work, we plan to conduct further analyses of our system with more performance evaluation metrics and object moving trajectory patterns (e.g., people may stop for a while at a certain location as in a shopping mall setting). In addition, we intend to extend our framework to support more spatial query types such as spatial skyline, spatial joins and closestpairs.
7. Acknowledgement
This research has been funded in part by the U.S. National Science Foundation grants IIS1618669 (III) and ACI1642133 (CICI).