LOCATER: Cleaning WiFi Connectivity Datasets for Semantic Localization

04/20/2020 ∙ by Yiming Lin, et al. ∙ University of California, Irvine 0

This paper explores the data cleaning challenges that arise in using WiFi connectivity data to locate users to semantic indoor locations such as buildings, regions, rooms. WiFi connectivity data consists of sporadic connections between devices and nearby WiFi access points, each of which may cover a relatively large area within a building. Our system, entitled semantic LOCATion cleanER (LOCATER), postulates semantic localization as a series of data cleaning challenges - first, it treats the problem of determining the AP to which a device is connected between any two of its connection events as a missing value detection and repair problem. It then associates the device with the semantic subregion (e.g., a conference room in the region) by postulating it as a location disambiguation problem. We propose a bootstrapping semi-supervised learning method for the coarse localization and probabilistic method to achieve finer localization. We show that LOCATER can achieve significantly high accuracy at both the coarse and fine levels. LOCATER offers several benefits over traditional indoor localization approaches since it neither requires active cooperation from users (e.g., downloading code on a mobile device and communicating with the server) nor installing external hardware (e.g., network monitors).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper studies the challenge of cleaning connectivity data collected by WiFi infrastructure to support semantic localization [jia2015soundloc] inside buildings. Semantic localization refers to locating people in a semantic indoor location such as within a building, a floor, a region, and/or a room. WiFi connectivity data refers to association events between a device and the WiFi infrastructure (e.g., access points in buildings). Such association events are readily available in large WiFi infrastructures that may consist of hundreds to thousands of Access Points (APs) (e.g., in university campuses, airports, shopping malls, etc.)111 When a device associates with an AP to connect to the network, the associated AP sends an association event to a wireless controller for tasks such as roaming management, load balancing, traffic fairness among clients, and wireless channel interference mitigation. Controllers can collect WiFi association data from APs in real-time using SNMP (Simple Network Management Protocol) which is the most widely used protocol, or using NETCONF [enns2006netconf], a more recent network management protocol. Some infrastructures may alternatively uses Syslog [gerhards2009syslog] to gather the operation log, including association events, from APs. and consist of observations in the form of mac address, time stamp, wap which correspond to the MAC of the WiFi-enabled connected device, the timestamp when the connection occurred and the WiFi AP (wap) to which the device is connected. Connectivity events of mobile devices (e.g., handhelds, wearables) can be exploited to locate individuals carrying such devices to a region covered by the access points. For example, in Fig. 1(b) an event can lead to the observation that the owner of the the device with mac address 7bfh… was located in the region covered by wap3 (which includes rooms 2059, 2061, 2065, 2066, 2068, 2069, 2072, 2074, 2076, and 2099) at 2019-08-22 13:04:35.

WiFi connectivity can be a powerful technology for indoor localization especially for applications where localizing individuals to a semantically meaningful location (such as inside/outside buildings, on a given floor, in a given region, or in a given room) suffices. Such applications include maintaining accurate assessment of occupancy of different parts of the building for HVAC (Heating, Ventilation and Air Conditioning) control [afram2014theory], constructing accurate models of building usage for space planning, or other customized services, and even tracking individual inside large buildings [jensen2009graph, musa2012tracking]. In the context of recent COVID-19 epidemic, such data can be used to determine regions/areas of high traffic in buildings, monitoring adherence to social distancing, and building systems to alert people who might have been exposed based on possible contact with someone who has been infected.

While several indoor localization technologies exist based on a variety of sensors (e.g., bluetooth beacons, video-based tracing, fingerprint analysis of WiFi signal strength, ultra-wide band signals, inertial sensors) or their fusion to accurately locate people inside buildings, using WiFi connectivity data offers several unique benefits. First, since WiFi infrastructure is ubiquitous in modern buildings, using the infrastructure for semantic localization does not incur any additional hardware costs either to users or to the built infrastructure owner. Such would be the case if we were to retrofit buildings with technologies such as RFID, ultra wideband (UWB), bluetooth, camera, etc. [liu2007survey]. Besides (almost) zero cost, another artifact of ubiquity of WiFi networks is that such a solution has wide applicability to all types of buildings - airports, residences, office spaces, university campuses, government buildings, etc. Another key advantage is that localization using WiFi connectivity can be performed passively without requiring users to either install new applications on their smartphones, or to actively participate in the localization process. In contrast, active approaches [deak2012survey, priyantha2000cricket] require individuals to download special software/apps and send information to a localization system [deak2012survey] that significantly limits technology adoption. Non-participation by a non-negligible population renders applications that perform aggregate level analysis (e.g., analysis of space utilization and crowd flow patterns) very difficult to implement. While such a limitation of active localization has sparked significant interest in passive localization mechanisms, e.g., [luo2016pallas, youssef2007challenges, xu2013scpl, seifeldin2012nuzzer, musa2012tracking, want1992active, li2015passive], such prior approaches have required external hardware capable of monitoring and capturing (or extracting) signals from user devices to be deployed that can detect environmental changes [luo2016pallas, musa2012tracking] to help locate individuals. Such hardware makes the solution prohibitively costly at a large scale, and furthermore, may not be possible due to physical restrictions of space. In contrast, WiFi infrastructure is readily available in most commercial buildings and offers a relatively untapped potential for exploiting as a localization framework for applications where semantic localization suffices.

Comparing with state-of-the-art WiFi-based localization systems. Several WiFi-based localization systems are passive and do not require external hardware [kotaru2015spotfi, vasisht2016decimeter]. They rely on CSI (Channel State Information) or RSSI (Received Signal Strength Indicator) to localize devices to achieve a median accuracy of decimeter-level, e.g., 40cm and 65cm for SpotFi [kotaru2015spotfi] and Chronos [vasisht2016decimeter]. The key limitation of such techniques is that they require the WiFi AP to work in monitor mode where the AP serves as dedicated WiFi channel sensor that collects WiFi data packets on the channel. Most APs in monitor mode have restrictions for packet transmission [linux-monitor, conrad2012cissp], which means that those APs cannot handle data traffic between client devices and the infrastructure. Even for some limited types of AP whose monitor mode can support normal WiFi association, localization process will affect the network traffic [vasisht2016decimeter]. In practice, WiFi networks are normally deployed for communication (i.e., to maximize data throughput and network coverage) rather than localization purposes, and our work tries to provide localization on top of it without interfering with users’ network connection.

Challenges in exploiting WiFi connectivity data. Using WiFi data, besides offering new opportunities, raises several significant data cleaning challenges.

  • Missing value detecting and repairing. Devices might get disconnected from the network even when the user carrying them are still in the space. Depending on the specific device, connectivity events might occur only sporadically and at different periodicity, making prediction more complex. These lead to missing values challenge. As an instance, in Fig. 1(c) we have raw connectivity data for device 7fbh at time 13:04:35 and 13:18:11. Location between these two consecutive time stamps is missing.

  • Location disambiguation. APs cover large regions within a building that might involve multiple rooms and hence simply knowing which AP a device is connected to may not offer room-level localization. For example, in Fig. 1, the device 3ndb connects to wap2, which covers rooms: 2004, 2057, 2059,…, 2068. These values are dirty for room-level localization. Such a challenge can be viewed as a location disambiguation challenge.

  • Scalability. The volume of WiFi data is huge - for instance, in our campus, with over 200 buildings and 2,000 access points, we generate several million WiFi connectivity tuples in one day on average. Thus, data cleaning technique needs to be able to scale to large data sets.

To address the above challenges, this paper proposes an online location cleaning system, entitled LOCATER, to efficiently clean WiFi connectivity data for room-level localization. LOCATER mainly consists of two parts, a cleaning engine and a caching engine. Cleaning engine takes input of WiFi connectivity dataset, metadata (such as the type of rooms, as will be explained later), and a query that requests the location of a device at given time, and outputs the location of the device at the room-level. Caching engine is used to cache cleaning results of past queries to speed up the system. Specifically, we make the following contributions.

  • We propose a novel approach to semantic indoor localization by formalizing the challenge as a combination of missing value cleaning and disambiguation problems.

  • We propose a semi-supervised method to resolve missing value problem and a novel probability-based approach to disambiguate room locations without using label data.

  • We design an efficient caching technique to enable LOCATER to answer the query in near real-time.

  • We validate our approach in a real world testbed and deployment. Experimental results show that LOCATER achieves high accuracy and great scalability on both real and simulated data sets.

The rest of the paper is structured as follows. Section 2 formalizes the data cleaning challenges. Section 3 and 4 describe the coarse and fine localization algorithms. Section 5 describes LOCATER’s prototype and caching technique. Section 6 reports experimental results. Section 7 discusses related works and Section 8 concludes the paper.

2 Formalizing Data Cleaning Challenges

Variable(s) Definition/Description
; ; buildings; regions; rooms
; set of regions inside the building; set of rooms in region
; WiFi APs; devices
WiFi connectivity events;
WiFi connectivity events of device
WiFi connectivity events during time period
; time interval validity of gaps associated to
Table 1: Model variables and shorthand notation.

In this section, we formalize the two data cleaning challenges in LOCATER: 1) Missing value detection and repairing and 2) Location disambiguation. First, we develop the notation used in the rest of the paper (see in Table 1).

Space Model. LOCATER aims to semantically localize users within a specific building at a given time. While LOCATER’s space model can be easily generalized to other spatial models, in this paper we partition it in three granularity-levels, from coarse to fine, building, region, room, as follows:

Building: The coarsest level of localization relates to whether the user is in the building or not. At building granularity , we consider two localization options representing whether the person is inside of the building or outside, denoted by / , respectively. We call a device that is inside of the building as online device; otherwise offline device.

Region: A building contains a set of regions where each region represents a portion of the building. At region granularity, we consider as localization options all the regions within the building, denoted as . We consider a region to be the area covered by the network connectivity of a specific WiFi AP [tervonen2016applying] (represented with dotted lines in Fig. 1(a)). Let be the set of APs within the building. Hence, and each is related to one and only one . In Fig. 1(a), there exist four APs and thus there exist four regions such that . As shown in the figure, regions can/often do overlap.

Room: The finest level of localization relates to the specific room a user is located in. A building contains a set of rooms where represents the ID of a room within the building – e.g., . Furthermore, a region contains a subset of . Hence, at room-level granularity and given a specific region , the localization options are . Since regions can overlap, a specific room can be part of different regions if its extent intersects with multiple regions. For instance, in Fig. 1-(a) room 2059 belongs to both regions and .

We consider that rooms in a building have metadata associated that can be leveraged for the localization problem. In particular, we classify rooms into two types: (i) 

public: shared facilities such as meeting rooms, lounges, kitchens, food courts, etc., that are accessible to multiple users (denoted by ); and (ii) private: rooms typically restricted to or owned by certain users such as a person’s office (denoted by ).

WiFi Connectivity Data Model. is the set of WiFi APs in a building. Let be the set of devices. Each device has associated a MAC-address, denoted by , that uniquely identifies it. Let be the WiFi connectivity events table with schema eid, mac address, timestamp, wap.(As shown in Fig. 1(b)) Each tuple is defined as = where is the id of the specific event logged by at time that the device with MAC address was connected. We refer to an attribute value (e.g., ) of a tuple using 222We use instead of for notation abbreviation..

We further divide the connectivity events based on the devices. We denote the connectivity events in in which a specific device participated by . We also denote the connectivity events that occur in a time period by where extends from to .

Connectivity events occur stochastically even when devices are stationary and/or the signal strength is stable. Events are typically generated when (i) a device connects to a WiFi AP for the first time, (ii) the OS of the device decides to probe available WiFi APs around, or (iii) when the device changes its status, etc. Hence, connectivity logs do not contain an event for every instant of time a device is connected to the WiFi AP or located in a space. Because of the sporadic nature of connectivity events, we associate to each event a validity period denoted by . The value of depends on the actual device

(in Appendix we show how to estimate

) and is denoted by (see Fig. 2 for some sample connectivity events of device ). We consider an event at time to be valid in the interval if its validity interval does not overlap with subsequent events of the same device (e.g., event in Fig. 2); otherwise, the validity interval of is updated to the timestamp of its closest event (e.g., is valid in in Fig. 2). While we assume that an event is valid for period, there can be portions of time in which no connectivity event is valid in the log for a specific device. We refer to such time periods as gaps. Each gap is associated with and that correspond to the begin and end times of the gap. Let the two consecutive connectivity events of device corresponding to be and with associated timestamps and respectively (see Fig 2). Thus, the start time of the gap , and furthermore, . We will alternatively denote gaps based on the timestamp values of the connectivity events they occur: . We further define the set of all the gaps in the connectivity log of a device by .

Figure 2: Connectivity events of device and their validity.

We now define the data cleaning challenges that arise in semantic localization using WiFi connectivity data.

Coarse-grained Localization. Let be the query requesting the location of device at time and be the WiFi connectivity events table. The goal of the coarse level localization is to identify the region which is located in at time .

If is within the validity interval of some connectivity event in , i.e., an event of in table where , then is assumed to be in the region covered by . Otherwise, if falls in any gap , the coarse level localization approach needs to determine whether is inside/outside the building. Then, if is inside the building, the algorithm needs to determine which region is located in at time . The coarse-grained localization problem, thus, consists of first detecting missing values (identifying gaps for WiFi connectivity data of ) and then, repairing missing values of location (i.e., identifying the that is located in) at time , if is within a gap.

Fine-grained Localization. Let be the query requesting location of device at time , given be the output of the coarse-level localization as 333Note that if is outside the building the problem of fine-grained localization does not arise.. The goal of the fine-grained localization is to determine the room , such that the device is located in .

Fine localization can be viewed as a location disambiguation problem wherein the goal is to choose (one of the) possible rooms in the region where the device is located (based on the output of the coarse level localization).

3 Coarse-Grained Localization

In this section, we discuss how LOCATER addresses the missing value detection and repair associated with the coarse-level localization discussed above. We will assume that the query falls in a gap, else, the coarse level location of would be covered by the corresponding AP, as discussed above. Determining ’s location is not a trivial task. LOCATER estimates this by classifying the corresponding gap as an interval where the device is outside of the building or inside in a specific region.

Classification of gaps. To classify gaps, we utilize a semi-supervised learning algorithm combined with bootstrapping techniques. The algorithm takes as input a set of historical connectivity events of a particular device for time period consisting of past days, where is a parameter set experimentally (discussed more in Section 6). Let be the set of all gaps in . For each , let () and () refer to the time and date corresponding to the timestamps. Likewise, let (and ) refer to the day of the week444We assume that gaps do not span multiple days.. We extract the following features for each gap

and represent them as a vector:

  • , : corresponding to the begin and end time of .

  • duration : representing the duration of the gap (i.e., ).

  • (): representing the day of the week in which occurred (ended).

  • , : corresponding to the region associated when the event occurred at and ended at .

  • connection density : representing the average number of logged connectivity events for the device during the same time period of a for each day in .

LOCATER uses a semi-supervised learning method, combined with bootstrapping techniques, to train two logistic regression classifiers based on such vectors to label gaps: 1) As inside/outside and 2) Within a specific region, if inside.

Bootstrapping.

The bootstrapping process labels a gap as inside or outside the building by using heuristics that take into consideration the duration of the gap (short gaps inside and long gaps outside). In particular, we set two thresholds,

and , such that a gap is labeled as if and as if (different values of and are tested in Section 6). If the duration of a gap is between and , then we cannot label it as either inside/outside using the above heuristic, and such gaps are marked as unlabeled. We, thus, partition the set of gaps of device , i.e., into two subsets – , . For gaps in that are classified as inside of the building, to further label them with a region at which the device is located, the heuristic takes into account the start and end region of the gap. This is done as follows:

  • If , then the assigned labeled is (in other words, if the regions at the start and end of the gap are the same, the device is considered to be in the same region (i.e., ) for the entire duration of the gap).

  • Otherwise, we assign as label a region which corresponds to the most visited region of in connectivity events that overlap with the gap (i.e., whose connection time is between and ).

Semi-supervised learning. We use semi-supervised learning to label the remaining (unlabeled) gaps  
, as described in Algorithm 1. In particular, for each device , we learn logistic regression classifiers on (function TrainClassifier in Algorithm 1), which are then used to classify the unlabeled gaps associated with the device555We assume that connectivity events exist for the device in the historical data considered, as is the case with our data set. If data for the device does not exist, e.g., if a person enters the building for the first time, then, we can label such devices based on aggregated location, e.g., most common label for other devices..

Algorithm 1 is firstly executed at building level to learn a model to classify if an unlabeled gap is inside/outside the building. To this end, let be the set of possible training labels - i.e., inside/outside the building. The method Predict, returns an array of numbers from 0 to 1, where each number represents the probability of the gap being assigned to a label in (all numbers in the array sum up to 1), and the label with highest probability in the array. In the array returned by Predict

, a larger variance means that the probability of assigning a certain label to this gap is higher than other gaps. Thus, we use the variance of the array as the confidence value of each prediction. In each outer iteration of the loop (lines 1-11), as a first step, a logistic regression classifier is trained on

. Then, it is applied to all gaps in . For each iteration, the gap with the highest prediction confidence is removed from and added to along with its predicted label. This algorithm terminates when is empty and the classifier trained in the last round will be returned. The same process is then followed to learn a model at the region level for gaps labeled as inside the building. In this case, when executing the algorithm contains the set regions in the building (i.e., ). The output is a classifier that can label a gap with the region where the device might be located.

Input:
1 while  is not empty do
2          TrainClassifier();
3          ;
4          for  do
5                   ;
6                   ;
7                   if  then
8                            ;
9                            ;
10                           
11                  
12          ;
13          ;
14         
return ;
Algorithm 1 Semi-supervised learning algorithm.

Given the two trained classifiers, for each query in which the associated device is in a gap, we first use the inside/outside classifier to classify the gap as inside or outside of the building. If the gap is classified as outside, then the query can be answered as the location of the device will be outside. Otherwise, we further classify the gap using the region classifier to obtain its associated region. Then, the device will be located in such region and LOCATER will perform the room-level fine-grained localization as we will explain in the following section.

4 Fine-Grained Localization

Given a query where is localized in region at time (e.g., as predicted by the coarse-level localization algorithm), this step determines the specific room where is located at time .

As shown in Fig. 1(b), events , are logged for two devices and with MAC addresses 7fbh and 3ndb, respectively. Assume that we aim to identify the room in which device was located at 2019-08-22 13:04. Given that was connected to wap3 at that time, the device should have been located in one of the rooms in that region – i.e., . These are called candidate rooms of (we omit the remaining candidate rooms: 2066, 2068, 2072, 2074, and 2099 for simplicity). The main goal of the fine-grained location approach, is to identify in which candidate room was located.

Affinity. LOCATER bases its fine-grained location prediction on the concept of affinity which models relationships between devices and rooms.

  • Room affinity: denotes the affinity between a device and a room (i.e., the chance of being located in at time ), given the region in which is located at time .

    Figure 3: Graph view in fine-grained location cleaning.
  • Group affinity: represents the affinity of a set of devices to be in a room at time (i.e., the chance of all devices in being located in at ), given that device is located in region at time .

Note that the concept of group affinity generalizes that of room affinity. While room affinity is a device’s conditional probability of being in a specific room, given the region it is located in, group affinity of a set of devices represents the probability of the the set of devices being co-located in a specific room at . We differentiate between these since methods we use to learn these affinities are different as will be discussed in the following section. We first illustrate how affinities affect localization prediction using the example in Fig. 3, which shows a hypergraph representing room and group affinities at time . For instance, an edge between and the room in the figure shows the affinity . Likewise the hyperedge with the label 0.12 represents the group affinity, represented as 666Affinity computation is discussed in Section 4.1.. If at time device is not online (i.e., there are no events associated with at in that region), we can predict that is in room 2061 since ’s affinity to 2061 is the highest. On the other hand, if is online at , the chance that is in room 2065 increases due to the group affinity . The location prediction for a device , thus, must account for both room affinity between the device and rooms, and also group affinity between groups of devices and rooms.

Room Probability. Let be the probability that a device is in room at time . Given a query and , the goal of the fine-grained location prediction algorithm is to find the room of at time , such that has the maximum . We develop such an algorithm based on estimating based on both room and group affinities in Section 9.2. However, before we discuss the algorithm, we first describe how affinity values are estimated in Section 4.1.

4.1 Affinity Learning

Learning Room Affinity. One of the challenges in estimating room affinity is the potential lack of historical room-level location data for devices - collecting such a data would be prohibitively expensive, specially when we consider large spaces with tens of thousands of people/devices as in our setting. Our approach, thus, does not assume availability of room-level localization data which could have been used to train specific models777Extending our approach when such data is obtainable, at least some a subset of devices, through techniques such as crowd-sourcing, is interesting and part of our future work.. Instead, we compute it based on the available background knowledge and space metadata.

To compute , we associate for each device a set of preferred rooms – e.g., the personal room of ’s owner (space metadata), or the most frequent rooms ’s owner enters (background knowledge). is an empty set if ’s owner does not have any preferred rooms. If is one the preferred rooms of (), we assign to the highest weight denoted by . Similarly, if is a public room (), we assign to the second highest weight denoted by . Finally, if is a private room (), we assign to the lowest weight denoted by . In general, these weights are assigned only if the following conditions are satisfied: (1)  and (2) . The influence of different combinations of is evaluated in Section 6.

We illustrate the assignment of these weights by using the graph of our running example. As already pointed out, connects to wap3 of region , where . In addition, ’s office room 2061 is the only preferred room () and 2065 is a public room (meeting room). Therefore, the remaining rooms in are other personal offices associated with other devices. Based on Fig. 3, a possible assignment of to the corresponding rooms is as follows: , , and any room in ) – i.e., shares the same room affinity, which is . Note that since room affinity is not data dependent, we can pre-compute and store it to speed up computation. Furthermore, preferred rooms could be time dependent (e.g., user is expected to be in the break room during lunch, while being in office during other times). Such a time dependent model would potentially result in more accurate room level localization if such metadata was available.

Learning Group Affinity. Before describing how we compute group affinity, we first define the concept of device affinity, denoted by which intuitively captures the probability of devices/users to be part of a group and be co-located (which serves as a basis to compute group affinity). Consider all the connectivity events . Let be the set of events corresponding to device , and be the connectivity events of devices in . Consider the set of connectivity events such that for each event , belonging to that set, and for every other device , there exists a connectivity event where is within the validity interval of and both devices are connected to the same AP, i.e., and . Intuitively, such an event set represents the times when all the devices in are in the same area (since they are connected to the same WiFi AP). We compute device affinity as a fraction of such intersecting events among all events in .

Given device affinity , we can now compute the group affinity among devices in room at time ,i.e., . Let be the set of intersecting rooms of connected regions for each device in at time 888Note that, devices in can be connected to different APs at time but still all be located in the same space as regions covered by APs might be overlapping as explained in Section 2).: . If is not one of the intersecting rooms, , then . Otherwise, to compute , we first determine conditional probability of a device to be in given that at time .

Let represent the fact that device is in room at time , and likewise represent the fact that is in one of the rooms in at .
, where
. We now compute , where as follows:

(1)

Intuitively, group affinity captures the probability of the set of devices to be in a given room (based on the room level affinity of individual devices) given that the (individual’s carrying the) devices are collocated, which is captured using the device affinity.

We explain the notation using the the example in Fig. 3(b). Let us assume that the device affinity between and (not shown in the figure) is , i.e., . The set . We compute as follows. . Similarly, . Finally, .

4.2 Localization Algorithm

Given a query and candidate rooms , we compute the room probability for each and select the room with highest probability as an answer to . Based on the room and group affinities, we first define the concept of the set of neighbor devices of , denoted by . A device is a neighbor of if: (i)  is online at time (inside the building); (ii)  for each ; and (iii) , where ) is the region that is located in. In Fig 3(b), is a neighbor of . Essentially, neighbors of a device could influence the location prediction of (since they will contribute a non-zero group affinity for )

Since we use the concept of neighbor always in the context of a a device , we will simplify the notation and refer to as . Since processing every device in can be computationally expensive, the localization algorithm considers the neighbors iteratively until there is enough confidence that the unprocessed devices will not change the current answer. Let be the set of devices that the algorithm has processed. We denote the probability of being the answer of given the devices and their locations in 999Note that we could express the above using the notation discussed in Section 4.1 as . We use the simplified notation for brevity of formulas later. being the answer of query means is in at time , and we write here for simplicity. that have been processed by the algorithm so far. Using Bayes’s rule:

(2)

where we estimate using the room affinity .

We first compute , under the simplifying assumption that probability of to be in room given any two neighbors in is conditionally independent. We then consider the case when multiple neighbor devices may together influence the probability of to be in a specific room .

%neighbor devices we consider two cases as below.

Figure 4: Graph view in fine-grained location cleaning.

Independence Assumption: Since we have assumed conditional independence: where represents that is located in at time . By definition, . The numerator represents the group affinity, i.e., . Similarly, , and thus:

(3)

To guarantee that our algorithm determines the answer of by processing the minimum possible devices in , we compute the minimum, maximum and expected probability of being the answer based on neighbor devices in . To compute these probabilities, not only we consider the processed devices , but also unprocessed devices . Thus, we consider all the possible room locations (given by coarse-location) for unprocessed devices. We denote the set or all possibilities for locations of these devices (i.e., the set of possible worlds [possible-world]) by . For each possible world , let be the probability of the world and be the probability of being the answer of given the observations of processed devices and the possible world . We now formally define the expected/max/min probability of given all the possible worlds.

Definition 1

Given a query , a region , a set of neighbor devices , a set of processed devices and the candidate room of , the expected probability of being the answer of , denoted by , is defined as follows:

(4)

In addition, the maximum probability of , denoted by , is defined as:

(5)

similarly, the minimum probability can be defined.

The algorithm terminates the iterations only if there exists a room , for any other room , such that . However, it is often difficult to satisfy such strict condition in practice. Thus, we relax this condition using the following two conditions:

  1. (or )

  2. (or )

In Section 6 we show that these loosen conditions enable the algorithm to terminate efficiently without sacrificing the quality of the results.

Next, the key question that arises is, how do we compute these probabilities efficiently? To compute the maximum probability of being in , we can assume that all unprocessed devices are in room as described in the theorem below. (All the proofs of theorems are shown in Appendix.)

Theorem 1

Given a set of already processed devices , a candidate room of and the possible world where all devices are in room , then, .

Likewise, to compute the minimum probability, we can simply assume that none of the unprocessed devices are in room . The following theorem states that we can compute the minimum by placing all the unprocessed devices in the room (other than ) in which has the highest chance of being at time (that is, the highest probability of being the answer, other than ).

Theorem 2

Given a set of already processed devices , a candidate room , and a possible world where all devices in are in room , then, .

For the expected probability of being the answer of , we prove that it equals to .

Theorem 3

Given a set of independent devices , the set of already processed devices and the candidate room , then, .

Relaxing the Independence Assumption: We next consider relaxing the conditional independence assumption we have made so far. In this case, we cannot treat each neighbor device independently. Instead, we divide into several clusters where every neighbor device in a cluster have non-zero group affinity with the rest of the devices. Let be a cluster where . In addition, group affinity of devices of any pair of devices in different clusters equals zero, i.e., , where , . In Fig 4(b), and . Naturally, we have . In this case, we assume that each cluster affects the location prediction of independently.

Thus, the probability . For each cluster, we compute its conditional probability , where . The reason is that is the probability that all devices in and are in room , which equals by definition. Thus,