Efficient Join Processing Over Incomplete Data Streams (Technical Report)

08/23/2019
by   Weilong Ren, et al.
SUNY Canton
0

For decades, the join operator over fast data streams has always drawn much attention from the database community, due to its wide spectrum of real-world applications, such as online clustering, intrusion detection, sensor data monitoring, and so on. Existing works usually assume that the underlying streams to be joined are complete (without any missing values). However, this assumption may not always hold, since objects from streams may contain some missing attributes, due to various reasons such as packet losses, network congestion/failure, and so on. In this paper, we formalize an important problem, namely join over incomplete data streams (Join-iDS), which retrieves joining object pairs from incomplete data streams with high confidences. We tackle the Join-iDS problem in the style of "data imputation and query processing at the same time". To enable this style, we design an effective and efficient cost-model-based imputation method via deferential dependency (DD), devise effective pruning strategies to reduce the Join-iDS search space, and propose efficient algorithms via our proposed cost-model-based data synopsis/indexes. Extensive experiments have been conducted to verify the efficiency and effectiveness of our proposed Join-iDS approach on both real and synthetic data sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/24/2019

Skyline Queries Over Incomplete Data Streams (Technical Report)

Nowadays, efficient and effective processing over massive stream data ha...
03/15/2021

Online Topic-Aware Entity Resolution Over Incomplete Data Streams (Technical Report)

In many real applications such as the data integration, social network a...
05/10/2021

Probabilistic Top-k Dominating Queries in Distributed Uncertain Databases (Technical Report)

In many real-world applications such as business planning and sensor dat...
08/28/2018

Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis

Incentivized by the enormous economic profits, the data marketplace plat...
10/14/2018

DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs

Subgraph listing is a fundamental problem in graph theory and has wide a...
01/16/2019

SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning

SkinnerDB is designed from the ground up for reliable join ordering. It ...
12/05/2021

Design Trade-offs for a Robust Dynamic Hybrid Hash Join (Extended Version)

The Join operator, as one of the most expensive and commonly used operat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Stream data processing has received much attention from the database community, due to its wide spectrum of real-world applications such as online clustering (Hyde et al., 2017), intrusion detection (Dhanabal and Shantharajah, 2015), sensor data monitoring (Abadi et al., 2004), object identification (Hong et al., 2016), location-based services (Hong et al., 2012), IP network traffic analysis (Fusco et al., 2010), Web log mining (Carbone et al., 2015), moving object search (Yu et al., 2013), event matching (Song et al., 2017a), and many others. In these applications, data objects from streams (e.g., sensory data samples) may sometimes contain missing attributes, for various reasons like packet losses, transmission delays/failures, and so on. It is therefore rather challenging to manage and process such streams with incomplete data effectively and efficiently.

In this paper, we will study the join operator between incomplete data streams (i.e., streaming objects with missing attributes), which has real applications such as network intrusion detection, online clustering, sensor networks, and data integration.

We have the following motivation example for the join over incomplete data streams in the application of network intrusion detection.

Figure 1. The join operator over incomplete data streams for monitoring network intrusion.
router object ID from [] No. of con- [] connection [] transferred
ID stream or nections ( duration (min) data size (GB)
0.1 0.1 0.1
0.2 0.1 0.2
0.4 0.3
0.2 0.1 0.1
0.2 0.2 0.2
0.3 0.3 0.2
Table 1. Incomplete data streams, and , in Figure 1.
Example 1.1 ().

(Monitoring Network Intrusion) Figure 1 illustrates two critical routers, and , in an IP network, from which we collect statistical (log) attributes in a streaming manner, for example, No. of connections, the connection duration, and the transferred data size. In practice, due to packet losses, network congestion/delays, or hardware failure, we may not always obtain all attributes from each router. As an example in Table 1, the transferred data size of router is missing (denoted as “-”) at timestamp . As a result, stream data collected from each router may sometimes contain incomplete attributes.

One critical, yet challenging, problem in the network is to monitor network traffic, and detect potential network intrusion. If one router (e.g., ) is under the attack of network intrusion, we should quickly identify potential attacks in other routers, like , at close timestamps, to which we may take actions for protecting the network security. In this case, it is very important to conduct the join over (incomplete) router data streams, and monitor similar patterns/behaviors from these two routers (e.g., and ). The resulting joining pairs can be used to effectively detect network intrusion events in routers.   

In Example 1.1, the join on incomplete router data streams monitors pairs of (potentially incomplete) objects from streams whose Euclidean distances are within some user-specified threshold. Due to the incompleteness of objects, it is rather challenging to accurately infer missing attributes, and effectively calculate the distance between 2 incomplete objects with missing attributes. For example, as depicted in Table 1, it is not trivial how to compute the distance between object (with missing attribute , transferred data size) from router and any object (for ) from router .

Inspired by the example above, in this paper, we formally define the join over incomplete data streams (Join-iDS), which continuously monitors pairs of similar (incomplete) objects from two incomplete data streams with high confidences. In addition to the application of network intrusion detection (as shown in Example 1.1), the Join-iDS problem is also useful for many other real applications, such as sensor data monitoring and data integration.

One straightforward method to solve the Join-iDS problem is to conduct the imputation over data streams, followed by join processing over two imputed streams. However, this method is not that efficient, due to high imputation and joining costs, which may not suit for the requirements of stream processing (e.g., small response time).

To tackle the Join-iDS problem efficiently and effectively, in this paper, we will propose an effective and adaptive imputation approach to turn incomplete data objects into complete ones, devise cost-model-based imputation indexes and a synopsis for data streams, and an efficient algorithm to simultaneously handle data imputation and Join-iDS processing.

Differences from Prior Works. While many prior works studied the join operator over complete data streams (Das et al., 2003; Lin et al., 2015) or uncertain data streams (Lian and Chen, 2010, 2009), they all assume that data streams are complete, and streaming objects do not have any missing attributes. To the best of our knowledge, no previous works considered the join operator over incomplete data streams (i.e., Join-iDS). To turn incomplete data records into complete ones, one straightforward way is to set the missing attribute values to 0, that is, ignoring the missing attribute values. However, this method may overestimate (underestimate) the distance between objects from data streams and cause wrong join results. Instead, in this paper, we will adopt differential dependency (DD) rules (Song and Chen, 2011) to impute the possible values of missing attributes of data objects from incomplete data streams.

Most importantly, in this paper, we will propose efficient Join-iDS processing algorithms to enable the data imputation and join processing at the same time, by designing cost-model-based and space-efficient index structures and efficient pruning strategies.

In this paper, we make the following major contributions:

  1. We formalize a novel and important problem, join over incomplete data streams (Join-iDS), in Section 2.

  2. We propose effective and efficient cost-model-based data imputation techniques via DD rules in Section 3.

  3. We devise effective pruning strategies to reduce the Join-iDS search space in Section 4.

  4. We design an efficient Join-iDS processing algorithm via data synopsis/indexes in Section 5.

  5. We evaluate through extensive experiments the performance of our Join-iDS approach on real/synthetic data in Section 6.

In addition, Section 7 reviews related works on the stream processing, differential dependency, join operator, and incomplete data management. Section 8 concludes this paper.

2. Problem Definition

In this section, we formally define the problem of the join over incomplete data streams (Join-iDS), which takes into account the missing attributes in the process of the stream join.

2.1. Incomplete data stream

We first define two terms, incomplete data stream and sliding window, below.

Definition 2.1 ().

(Incomplete Data Stream) An incomplete data stream, , contains an ordered sequence of objects, . Each object arrives at timestamp , and has attributes (for ), some of which are missing, denoted as ”.

In Definition 2.1, at each timestamp , an object from incomplete data stream will arrive. Each object may be an incomplete object, containing some missing attributes .

Following the literature of data streams, in this paper, we consider the sliding window model (Ananthakrishna et al., 2003) over incomplete data stream .

Definition 2.2 ().

(Sliding Window, ) Given an incomplete data stream , an integer , and the current timestamp , a sliding window, , contains an ordered set of the most recent objects from , that is, .

In Definition 2.2, the sliding window contains all objects from arriving within the time interval . To incrementally maintain the sliding window, at a new timestamp , a new sliding window can be obtained by adding the newly arriving object to and removing the old (expired) object from .

Note that, the sliding window we adopt in this paper is the count-based one (Ananthakrishna et al., 2003). For other data models such as the time-based sliding window (Tao and Papadias, 2006) (allowing more than one object arriving at each timestamp), we can easily extend our problem by replacing each object with a set of objects arriving simultaneously at timestamp , which we would like to leave as our future work.

2.2. Imputation Over

In this paper, we adopt differential dependency (DD) rules (Song and Chen, 2011) as our imputation approach for inferring the missing attributes of incomplete data objects from . By using DD rules, incomplete data streams can be turned into imputed data streams. We would like to leave the topics of considering other imputation methods (e.g., multiple imputation (Royston, 2004), editing rule (Fan et al., 2010), relational dependency network (Mayfield et al., 2010), etc.) as our future work.

Differential Dependency (DD). The differential dependency (DD) (Song and Chen, 2011) reveals correlation rules among attributes in data sets, which can be used for imputing the missing attributes in incomplete objects. As an example, given a table with 3 attributes , , and , a DD rule can be in the form of , where and are 2 distance constraints on attributes and , respectively. Assuming , this DD rule implies that if any two objects and have their attribute within -distance away from each other (i.e., ), then their values of attribute must also be within -distance away from each other (i.e., holds).

Formally, we give the definition of the DD rule as follows.

Definition 2.3 ().

(Differential Dependency, ) A differential dependency (DD) rule is in the form of , where is a set of determinant attributes, is a dependent attribute (), and is a differential function to specify the distance constraints, , on attributes in , where .

In Definition 2.3, given a DD rule , if two data objects satisfy the differential function on determinant attributes , then they will have similar values on dependent attributes .

Missing Values Imputation via DD and Data Repository . DD rules can achieve good imputation performance even in sparse data sets, since they tolerate differential differences between attribute values (Song and Chen, 2011).

In this paper, we assume that a static data repository (containing complete objects without missing attributes) is available for imputing missing attributes from data streams. Given a DD and an incomplete object , we can obtain possible values of missing attribute , by leveraging all complete objects satisfying the differential function w.r.t. attributes in .

Specifically, any imputed value

is associated with an existence probability

, defined as the fraction of complete objects with attribute among all complete objects in satisfying the distance constraint with on attribute(s) .

Imputed Data Stream. With DDs and data repository , we can turn incomplete data streams into imputed data streams, which is defined as follows.

Definition 2.4 ().

(Imputed Data Stream, ) Given an incomplete data stream , DD rules, and a static data repository , the imputed (uncertain) data stream, , is composed of imputed (probabilistic) objects, , by imputing missing attribute values of incomplete objects via DDs and .

Each imputed object contains a number of probabilistic instances, , with existence confidences , where instances are mutually exclusive and meet .

In Definition 2.4, each object in is complete, containing a set of probabilistic instances . In this paper, for each instance , we calculate its existence probability as the product of confidences of attribute values of .

Possible Worlds Over . We consider possible worlds (Dalvi and Suciu, 2007), , over the sliding window, , of imputed data stream , which are materialized instances of the sliding window that may appear in reality.

Definition 2.5 ().

(Possible Worlds of the Imputed Data Stream, ) Given a sliding window of an imputed data stream , a possible world, , is composed of some object instances , where these instances covers all imputed objects and each instance comes from different imputed objects .

The appearance probability, , of each possible world can be calculated by:

(1)

In Definition 2.5, each imputed object contributes to one potential instance to , making each possible world a combination of instances from imputed objects in sliding window .

2.3. Join Over Incomplete Data Streams

The Join-iDS Problem. Now, we are ready to formally define the join over incomplete data streams (Join-iDS).

Definition 2.6 ().

(Join Over Incomplete Data Streams, Join-iDS) Given two incomplete data streams, and , a distance threshold , a current timestamp , and a probabilistic threshold , the join over incomplete data streams (Join-iDS) continuously monitors pairs of incomplete objects and within sliding windows and , respectively, such that they are similar with probabilities, , greater than threshold , that is,

where and are instances of the imputed objects and , respectively, is a Euclidean distance function, and function returns 1, if (or 0, otherwise).

In Definition 2.6, at timestamp , Join-iDS will retrieve all pairs of incomplete objects, , such that their distance is within threshold with Join-iDS probabilities, (as given by Eq. (2.6)), greater than or equal to , where and . In particular, the Join-iDS probability, , in Eq. (2.6) is given by summing up probabilities that object instances and are within -distance in possible worlds, and .

Challenges. There are three major challenges to tackle the Join-iDS problem. First, existing works often assume that objects from data streams are either complete (Das et al., 2003; Lin et al., 2015) or uncertain (Lian and Chen, 2010, 2009), and this assumption may not always hold in practice, due to reasons such as transmission delay or packet losses. Moreover, it is also non-trivial to obtain possible values of missing attributes. To our best knowledge, no prior work has studied the join operator over incomplete data streams. Thus, we should specifically design effective and efficient imputation strategies to infer incomplete objects from and .

Second, it is very challenging to efficiently solve the Join-iDS problem under possible worlds (Dalvi and Suciu, 2007) semantics. The direct computation of Eq. (2.6) (i.e., materializing all possible worlds of two incomplete data streams) has an exponential time complexity, which is inefficient, or even infeasible. Thus, we need to devise efficient approaches to reduce the search space of our Join-iDS problem.

Third, it is not trivial how to efficiently and effectively process the join operator over data streams with incomplete objects, which includes data imputation and join processing over imputed data streams. To efficiently handle the Join-iDS problem, in this paper, we perform data imputation and join processing at the same time. Therefore, we need to propose efficient Join-iDS processing algorithms, supported by effective pruning strategies and indexing mechanism.

Symbol         Description
( or ) an incomplete data stream
an imputed (probabilistic) data stream
(or ) the most recent objects from stream (or ) at timestamp
the size of the sliding window
a possible world of imputed (probabilistic) objects in sliding window
( or ) an (incomplete) object from stream ( or )
an imputed probabilistic object of in the imputed stream
a static (complete) data repository
an imputation lattice for DDs with dependent attribute
an index built over for imputing attribute
-grid a data synopsis containing objects and from streams
a join set containing object pairs
Table 2. Symbols and descriptions.

2.4. Join-iDS Processing Framework

Algorithm 1 illustrates a framework for Join-iDS processing, which consists of three phases. In the first pre-computation phase, we offline establish imputation lattices (for ), and build imputation indexes over a historical repository for imputing attribute (lines 1-2). Then, in the imputation and Join-iDS pruning phase, we online maintain a data synopsis, called -grid, over objects () from sliding window () of each incomplete data stream. In particular, for each expired object (), we remove it from sliding window (), update the -grid, and update the join set, , w.r.t. object () (lines 3-6); for each newly arriving object (), we will impute it by traversing indexes over with the help of DD rules (selected by the imputation lattice ), prune false join objects () via the -grid, and insert the imputed object () into the -grid (lines 7-9). Finally, in the Join-iDS refinement phase, we will calculate the join probabilities between () and each non-pruned object -grid (-grid), and return the join results for all objects () (lines 10-11).

Table 2 depicts the commonly-used symbols and their descriptions in this paper.

Input: two incomplete data streams and , a static (complete) data repository , current timestamp , an timestamp interval , a distance threshold , and a probabilistic threshold
Output: a join result set, , over and
// Pre-computation Phase
offline establish imputation lattice, , based on detected DDs from offline construct imputation indexes, , over data repository // Imputation and Join-iDS Pruning Phase
1 for each expired object () at timestamp  do
2          evict () from () update -grid over () update join set
3for each new object () arriving at () do
4          traverse index, , over and -grid, over () at the same time to simultaneously enable DD attribute imputation and join set preselection. insert the data information of () into -grid
// Join-iDS Refinement Phase
calculate the join probabilities between () with each candidate -grid (-grid), and add the join pairs into return the join sets, , for all objects () as join results
Algorithm 1 Join-iDS Processing Framework

3. Imputation of incomplete objects via DDs

Data Imputation via DDs. In Section 2.2, we discussed how to impute the missing attribute (for ) of an incomplete object by a single DD: . In practice, we may encounter multiple DDs with the same dependent attribute , , , …, and . In this case, one straightforward way is to combine all these DDs, that is, , to impute the missing attribute . By doing this, we may obtain a more selective query range, , which may lead to not only more precise imputation results, but also the reduced imputation cost (i.e., with a smaller query range). However, to enable the imputation, such a combination has two requirements: (1) there should be at least one sample in data repository that satisfies the distance constraints w.r.t. , and (2) incomplete object must have complete values on all attributes . Both requirements may not always hold, thus, alternatively we need to select a “good” subset of attributes to impute .

Imputation Lattice (). We propose a imputation lattice, (for ), which stores the combined DDs with all possible subsets of attributes , and can be used for selecting a “good” combined DD rule. In particular, each lattice has levels. Level 1 contains the original DD rules, with determinant attributes , and ; Level 2 has (i.e., ) combined DDs, with determinant attributes such as , and ; and so on. Finally, on Level . there is only one combined DD rule, i.e., .

DD Selection Strategy. Given an imputation lattice , we select a good DD rule from based on two principles. First, DDs on higher levels of (e.g., Level ) will have stronger imputation power than those on lower levels (e.g., Level ), since DDs on higher levels of tend to have more accurate imputation results and lower imputation cost. Second, for those DDs, , on the same level in

, we will offline estimate the expected numbers,

, of objects that can be used for imputation via . We designed a cost model (via fractal dimension (Belussi and Faloutsos, 1998)) for estimating in Appendix B.1. Since smaller indicates lower imputation cost and we need at least one sample for imputation, we rank DDs on the same level, first in increasing order for , and then in decreasing order for .

Given an incomplete object with missing attribute , we traverse the lattice from Level to Level 1. On each level, we will access DDs in the offline pre-computed order as mentioned above. For each DD we encounter, we will online estimate the number of samples for imputing attribute w.r.t. incomplete object (as given by Appendix B.1). If the expected number of objects for imputation is greater than or equal to 1, we will stop the lattice traversal, and use the corresponding DD for the imputation.

Our proposed data imputation approaches via DDs are verified to be effective and efficient, whose empirical evaluation will be later illustrated in in Sections 6.3 and 6.4, respectively.

4. Pruning Strategies

4.1. Problem Reduction

As given in Eq. (2.6) of Section 2.3, it is inefficient, or even infeasible, to compute join probabilities between two (incomplete) objects, and , by enumerating an exponential number of possible worlds. In this subsection, we reduce the problem of calculating the join probability, , between and from possible-world level to that on object level, and rewrite Eq. (2.6) as:

(3)

where and are the existence confidences of instances and , respectively, and function is given in Definition 2.6.

In Eq. (3), we consider all pairs, , of instances and , which is much more efficient than materializing all possible worlds, but still incurs cost, where is the number of instances in the imputed object . Thus, in the sequel, we will design effective pruning rules to accelerate Join-iDS processing.

(a) object-level pruning
(b) sample-level pruning
Figure 2. Illustration of pruning strategies.

4.2. Pruning Rules

Below, we propose two pruning strategies, object-level and sample-level pruning, to reduce the Join-iDS search space. The latter one will be used, if an object pair cannot be pruned by the former one.

Object-Level Pruning. Given two incomplete objects and , our first pruning rule, namely object-level pruning, is to utilize the boundaries of the imputed objects and , and filter out the object pair if their minimum possible distance is greater than the distance threshold . Here, the boundary of an imputed object (or called minimum bounding rectangle (MBR)) encloses all instances of and has an imputed interval, , for any missing attribute (for ).

Lemma 4.1 ().

(Object-Level Pruning) Given two incomplete objects, and , from sliding windows and , respectively, if holds, then object pair can be safely pruned.

Proof.

Please refer to Appendix A.1. ∎

Figure 2(a) illustrates an example of Lemma 4.1. Intuitively, if holds, then two imputed objects (MBRs), and , are far away from each other, and any instance pair from them cannot be joined (i.e., ). Thus, object pair can be safely pruned.

Sample-Level Pruning. The object-level pruning rule cannot filter out object pairs with non-zero Join-iDS probabilities (). Thus, we present a sample-level pruning method, which aims to rule out those false alarms with low Join-iDS probabilities, by considering instances of imputed objects and .

Lemma 4.2 ().

(Sample-Level Pruning) Given two incomplete objects and , and two sub-MBRs, and , the object pair, , can be safely pruned, if and hold, where is the summed probability that instances fall into sub-MBR (the same for w.r.t. ).

Proof.

Please refer to Appendix A.2. ∎

Figure 2(b) shows an example the sample-level pruning in Lemma 4.2, which considers instances of imputed objects and , and uses their sub-MBRs, and , to enable the pruning, where (or ) is a sub-MBR such that object (or ) falls into (or ) with probability (or ). Intuitively, if and hold, then we can prove that the object pair has low join probability (i.e., ), and can be safely pruned.

5. Join over incomplete data streams

In this section, we first design a data synopsis for incomplete data streams and imputation indexes over data repository , and then propose an efficient Join-iDS processing algorithm to retrieve the join results via synopsis/indexes.

Figure 3. Illustration of a 2D -grid over incomplete data streams.

5.1. Grid Synopsis and Imputation Indexes

-Grid Over Imputed Data Streams. We will incrementally maintain a data synopsis, namely -grid, over (imputed) objects and from sliding windows and , respectively. Specifically, to construct the -grid, we divide the data space into equal grid cells with side length along each dimension (attribute ). Each cell, , is associated with two queues, and , which sequentially store imputed objects and , respectively, that intersect with this cell . Each imputed object (or ) contains information as follows:

  1. a set of currently accessed MBR nodes, , in the R-tree over data repository for imputation (as will be discussed later in this subsection), or;

  2. a set of instances, (or ), in (or ).

Figure 3 illustrates an example of the -grid with two attributes and . The -grid divides the 2D data space into 16 () cells, each with side length . If imputed object (or ) intersects with a cell , then this object will be stored in a queue (or ) pointed by cell .

Figure 4. Imputation Index over repository given a DD ().

Imputation Indexes Over Data Repository . To enable fast imputation, we devise indexes, (for ), each of which will have the best imputation power for a possibly missing attribute . Specifically, assume that the combined DD rule on Level of the imputation lattice (see Section 3) is . Then, we let , and construct a variant of R-tree (Beckmann et al., 1990) over attributes of data repository .

We divide complete objects in data repository into clusters, , and insert them into the R-tree, where each cluster has the size within . We design a specific cost model to select a good cluster set. Please refer to Appendix B.2 for details.

Moreover, each node in R-tree stores a histogram, , over dependent attribute , which stores a summary of complete objects in , where is divided into buckets, (), with consecutive bucket intervals (i.e., ), and each bucket contains all () objects with attribute values within the interval .

Figure 4 gives an example of a table with 3 attributes and , and two DD rules, and , with dependent attribute . We construct an index for imputing attribute , where and . In this example, we first put complete objects into some clusters (e.g., ), and then insert these clusters into an R-tree as leaf nodes. As shown in Figure 4, each node is divided into buckets, , based on the distribution of dependent attribute , where each bucket () contains the count of objects and the interval of values on attribute of in the bucket .

5.2. Join-iDS Processing via -Grid

Input: a join set , a -grid synopsis, imputation indexes over , and new objects and from and
Output: an dynamically updated and -grid
1 remove from -grid those expired objects from streams and remove from object pairs containing the expired objects obtain initial via R*-tree nodes in index and DD rules returned by if there exists some grid cell -grid with nonempty queues , such that (via Lemma 4.1) then
2          obtain instances of imputed object by accessing objects via indexes update
3for each cell -grid with non-empty queue that cannot be pruned via Lemma 4.1 do
4          if  (via Lemma 4.2) then
5                   for each unchecked object in queue satisfying  do
6                            if  is not completely imputed via indexes  then
7                                     impute to instance level via indexes , and update update those cells -grid intersecting with
8                           if  (via Lemma 4.2) then
9                                     if  via Eq. (3) then
10                                              add to
11                                    
12                           
13                  
14         
15for each cell intersecting with the MBRs of  do
16          add to queue
execute lines 3-17 symmetrically for new object
Algorithm 2 Join-iDS via -grid

Join-iDS via -Grid. Denote as a join set that records all join results, , between two incomplete data streams. Algorithm 2 performs the object imputation and join at the same time, and dynamically maintain the join set (and -grid as well).

Deletion of the expired objects. At a new timestamp , Algorithm 2 will remove the expired objects from -grid and those object pairs containing the expired objects from (lines 1-2).

Object imputation and object-level pruning. Given a newly arriving incomplete object , Algorithm 2 will retrieve a query range via a DD rule returned by the imputation lattice (Section 3), and obtain an initial MBR , by accessing R-tree nodes that intersect with the query range via imputation index (line 3). Then, we will check whether there are some cells, , in the -grid, that may match with (via Lemma 4.1). In particular, if holds, we will further obtain instances of imputed object , and update (shrink) the MBR of (lines 4-6).

Object imputation and sample-level pruning. Next, for cells that cannot be pruned by Lemma 4.1, Algorithm 2 will further check the minimum distance, , between sub-MBRs and cell via the sample-level pruning (Lemma 4.2; lines 7-15). If and queues are non-empty, then we will check the minimum distance, , between imputed objects and each unchecked object in the queues of cell (lines 9-15). Note that, each object may have two possible imputation states: (1) object is represented by MBRs , or (2) object is represented by some samples (the missing attributes are imputed from ). We call the first state “not completely imputed”, while the second one “completely imputed”. If is not completely imputed, we will impute completely via indexes , and update the cells in -grid intersecting with (lines 10-12). Given both completely imputed objects and , if , we will use the sample-level pruning to prune the object pair (please refer to Appendix C for the selection of sub-MBRs and ; line 13). If the object pair still cannot be pruned, then we will check the join probabilities, , via Eq. (3), and add actual join pairs to (lines 14-15).

Update of -grid with new object . Algorithm 2 then inserts new object into all queues, , in cells (intersecting with the MBR of ) of the -grid (lines 16-17).

Finally, similar to object , we execute lines 3-17 for a newly arriving object from sliding window (line 18).

Complexity Analysis. The Join-iDS algorithm in Algorithm 2 requires time complexity, where and are the average numbers of instances in imputed objects and , respectively, is the number of cells intersecting with the MBR of , and is the average number of objects within queues of cells .

Discussions on the Extension of Join-iDS to () Incomplete Data Streams. We can extend our Join-iDS problem over 2 incomplete data streams to multiple (e.g., ) incomplete data streams . We only need to update the -grid, that is, increase the number of queues in each cell of -grid from 2 to . Within a cell in -grid, each queue, (), stores objects from its corresponding incomplete data stream . With the modified -grid, at timestamp , when a new object arrives, the imputed object will push its join pairs into , by accessing those objects from queues in each cell.

6. Experimental Evaluation

        Parameters Values
probabilistic threshold 0.1, 0.2, 0.5, 0.8, 0.9
dimensionality 2, 3, 4, 5, 6, 10
distance threshold 0.1, 0.2, 0.3, 0.4, 0.5
the number, , of valid objects in 500, 1K, 2K, 4K, 5K, 10K
the size, , of data repository 10K, 20K, 30K, 40K, 50K
the number, , of missing attributes 1, 2, 3
Table 3. The parameter settings.

6.1. Experimental Settings

Real/Synthetic Data Sets. We evaluate the performance of our Join-iDS approach on 4 real and 3 synthetic data sets.

Real data sets. We use Intel lab data111http://db.csail.mit.edu/labdata/labdata.html, UCI gas sensor data for home activity monitoring222http://archive.ics.uci.edu/ml/datasets/gas+sensors+for+home+activity+monitoring, US & Canadian city weather data333https://www.kaggle.com/selfishgene/historical-hourly-weather-data, and S&P 500 stock data444https://www.kaggle.com/camnugent/sandp500, denoted as , , and , respectively. contains 2.3 million data, collected from 54 sensors deployed in Intel Berkeley Research lab on Feb. 28-Apr. 5, 2014; includes 919,438 samples from 8 MOX gas sensors, and humidity and temperature sensors; contains historical weather (temperature) data for 30 US and Canadian Cities during 2012-2017; has historical stock data for all companies found on the S&P 500 index till Feb 2018. We extract 4 attributes from each of these 4 real data sets: temperature, humidity, light, and voltage from ; temperature, humidity, and resistance of sensors 7 and 8 from ; Vancouver, Portland, San Francisco, and Seattle from ; and open, high, low, close from . We normalize the intervals of 4 attributes of each data sets into . Then, as depicted in Table 4, we detected DD rules for each data set, by considering all combinations of determinant/dependent attributes over samples in data repository (Song and Chen, 2011).

Data Sets          DD Rules
,
{}
,
,
,
,
,
- ,