1. Introduction
Stream data processing has received much attention from the database community, due to its wide spectrum of realworld applications such as online clustering (Hyde et al., 2017), intrusion detection (Dhanabal and Shantharajah, 2015), sensor data monitoring (Abadi et al., 2004), object identification (Hong et al., 2016), locationbased services (Hong et al., 2012), IP network traffic analysis (Fusco et al., 2010), Web log mining (Carbone et al., 2015), moving object search (Yu et al., 2013), event matching (Song et al., 2017a), and many others. In these applications, data objects from streams (e.g., sensory data samples) may sometimes contain missing attributes, for various reasons like packet losses, transmission delays/failures, and so on. It is therefore rather challenging to manage and process such streams with incomplete data effectively and efficiently.
In this paper, we will study the join operator between incomplete data streams (i.e., streaming objects with missing attributes), which has real applications such as network intrusion detection, online clustering, sensor networks, and data integration.
We have the following motivation example for the join over incomplete data streams in the application of network intrusion detection.
router  object ID from  [] No. of con  [] connection  [] transferred 

ID  stream or  nections (  duration (min)  data size (GB) 
0.1  0.1  0.1  
…  …  …  
0.2  0.1  0.2  
0.4  0.3  
0.2  0.1  0.1  
…  …  …  
0.2  0.2  0.2  
0.3  0.3  0.2 
Example 1.1 ().
(Monitoring Network Intrusion) Figure 1 illustrates two critical routers, and , in an IP network, from which we collect statistical (log) attributes in a streaming manner, for example, No. of connections, the connection duration, and the transferred data size. In practice, due to packet losses, network congestion/delays, or hardware failure, we may not always obtain all attributes from each router. As an example in Table 1, the transferred data size of router is missing (denoted as “”) at timestamp . As a result, stream data collected from each router may sometimes contain incomplete attributes.
One critical, yet challenging, problem in the network is to monitor network traffic, and detect potential network intrusion. If one router (e.g., ) is under the attack of network intrusion, we should quickly identify potential attacks in other routers, like , at close timestamps, to which we may take actions for protecting the network security. In this case, it is very important to conduct the join over (incomplete) router data streams, and monitor similar patterns/behaviors from these two routers (e.g., and ). The resulting joining pairs can be used to effectively detect network intrusion events in routers.
In Example 1.1, the join on incomplete router data streams monitors pairs of (potentially incomplete) objects from streams whose Euclidean distances are within some userspecified threshold. Due to the incompleteness of objects, it is rather challenging to accurately infer missing attributes, and effectively calculate the distance between 2 incomplete objects with missing attributes. For example, as depicted in Table 1, it is not trivial how to compute the distance between object (with missing attribute , transferred data size) from router and any object (for ) from router .
Inspired by the example above, in this paper, we formally define the join over incomplete data streams (JoiniDS), which continuously monitors pairs of similar (incomplete) objects from two incomplete data streams with high confidences. In addition to the application of network intrusion detection (as shown in Example 1.1), the JoiniDS problem is also useful for many other real applications, such as sensor data monitoring and data integration.
One straightforward method to solve the JoiniDS problem is to conduct the imputation over data streams, followed by join processing over two imputed streams. However, this method is not that efficient, due to high imputation and joining costs, which may not suit for the requirements of stream processing (e.g., small response time).
To tackle the JoiniDS problem efficiently and effectively, in this paper, we will propose an effective and adaptive imputation approach to turn incomplete data objects into complete ones, devise costmodelbased imputation indexes and a synopsis for data streams, and an efficient algorithm to simultaneously handle data imputation and JoiniDS processing.
Differences from Prior Works. While many prior works studied the join operator over complete data streams (Das et al., 2003; Lin et al., 2015) or uncertain data streams (Lian and Chen, 2010, 2009), they all assume that data streams are complete, and streaming objects do not have any missing attributes. To the best of our knowledge, no previous works considered the join operator over incomplete data streams (i.e., JoiniDS). To turn incomplete data records into complete ones, one straightforward way is to set the missing attribute values to 0, that is, ignoring the missing attribute values. However, this method may overestimate (underestimate) the distance between objects from data streams and cause wrong join results. Instead, in this paper, we will adopt differential dependency (DD) rules (Song and Chen, 2011) to impute the possible values of missing attributes of data objects from incomplete data streams.
Most importantly, in this paper, we will propose efficient JoiniDS processing algorithms to enable the data imputation and join processing at the same time, by designing costmodelbased and spaceefficient index structures and efficient pruning strategies.
In this paper, we make the following major contributions:

We formalize a novel and important problem, join over incomplete data streams (JoiniDS), in Section 2.

We propose effective and efficient costmodelbased data imputation techniques via DD rules in Section 3.

We devise effective pruning strategies to reduce the JoiniDS search space in Section 4.

We design an efficient JoiniDS processing algorithm via data synopsis/indexes in Section 5.

We evaluate through extensive experiments the performance of our JoiniDS approach on real/synthetic data in Section 6.
2. Problem Definition
In this section, we formally define the problem of the join over incomplete data streams (JoiniDS), which takes into account the missing attributes in the process of the stream join.
2.1. Incomplete data stream
We first define two terms, incomplete data stream and sliding window, below.
Definition 2.1 ().
(Incomplete Data Stream) An incomplete data stream, , contains an ordered sequence of objects, . Each object arrives at timestamp , and has attributes (for ), some of which are missing, denoted as “”.
In Definition 2.1, at each timestamp , an object from incomplete data stream will arrive. Each object may be an incomplete object, containing some missing attributes .
Following the literature of data streams, in this paper, we consider the sliding window model (Ananthakrishna et al., 2003) over incomplete data stream .
Definition 2.2 ().
(Sliding Window, ) Given an incomplete data stream , an integer , and the current timestamp , a sliding window, , contains an ordered set of the most recent objects from , that is, .
In Definition 2.2, the sliding window contains all objects from arriving within the time interval . To incrementally maintain the sliding window, at a new timestamp , a new sliding window can be obtained by adding the newly arriving object to and removing the old (expired) object from .
Note that, the sliding window we adopt in this paper is the countbased one (Ananthakrishna et al., 2003). For other data models such as the timebased sliding window (Tao and Papadias, 2006) (allowing more than one object arriving at each timestamp), we can easily extend our problem by replacing each object with a set of objects arriving simultaneously at timestamp , which we would like to leave as our future work.
2.2. Imputation Over
In this paper, we adopt differential dependency (DD) rules (Song and Chen, 2011) as our imputation approach for inferring the missing attributes of incomplete data objects from . By using DD rules, incomplete data streams can be turned into imputed data streams. We would like to leave the topics of considering other imputation methods (e.g., multiple imputation (Royston, 2004), editing rule (Fan et al., 2010), relational dependency network (Mayfield et al., 2010), etc.) as our future work.
Differential Dependency (DD). The differential dependency (DD) (Song and Chen, 2011) reveals correlation rules among attributes in data sets, which can be used for imputing the missing attributes in incomplete objects. As an example, given a table with 3 attributes , , and , a DD rule can be in the form of , where and are 2 distance constraints on attributes and , respectively. Assuming , this DD rule implies that if any two objects and have their attribute within distance away from each other (i.e., ), then their values of attribute must also be within distance away from each other (i.e., holds).
Formally, we give the definition of the DD rule as follows.
Definition 2.3 ().
(Differential Dependency, ) A differential dependency (DD) rule is in the form of , where is a set of determinant attributes, is a dependent attribute (), and is a differential function to specify the distance constraints, , on attributes in , where .
In Definition 2.3, given a DD rule , if two data objects satisfy the differential function on determinant attributes , then they will have similar values on dependent attributes .
Missing Values Imputation via DD and Data Repository . DD rules can achieve good imputation performance even in sparse data sets, since they tolerate differential differences between attribute values (Song and Chen, 2011).
In this paper, we assume that a static data repository (containing complete objects without missing attributes) is available for imputing missing attributes from data streams. Given a DD and an incomplete object , we can obtain possible values of missing attribute , by leveraging all complete objects satisfying the differential function w.r.t. attributes in .
Specifically, any imputed value
is associated with an existence probability
, defined as the fraction of complete objects with attribute among all complete objects in satisfying the distance constraint with on attribute(s) .Imputed Data Stream. With DDs and data repository , we can turn incomplete data streams into imputed data streams, which is defined as follows.
Definition 2.4 ().
(Imputed Data Stream, ) Given an incomplete data stream , DD rules, and a static data repository , the imputed (uncertain) data stream, , is composed of imputed (probabilistic) objects, , by imputing missing attribute values of incomplete objects via DDs and .
Each imputed object contains a number of probabilistic instances, , with existence confidences , where instances are mutually exclusive and meet .
In Definition 2.4, each object in is complete, containing a set of probabilistic instances . In this paper, for each instance , we calculate its existence probability as the product of confidences of attribute values of .
Possible Worlds Over . We consider possible worlds (Dalvi and Suciu, 2007), , over the sliding window, , of imputed data stream , which are materialized instances of the sliding window that may appear in reality.
Definition 2.5 ().
(Possible Worlds of the Imputed Data Stream, ) Given a sliding window of an imputed data stream , a possible world, , is composed of some object instances , where these instances covers all imputed objects and each instance comes from different imputed objects .
The appearance probability, , of each possible world can be calculated by:
(1) 
In Definition 2.5, each imputed object contributes to one potential instance to , making each possible world a combination of instances from imputed objects in sliding window .
2.3. Join Over Incomplete Data Streams
The JoiniDS Problem. Now, we are ready to formally define the join over incomplete data streams (JoiniDS).
Definition 2.6 ().
(Join Over Incomplete Data Streams, JoiniDS) Given two incomplete data streams, and , a distance threshold , a current timestamp , and a probabilistic threshold , the join over incomplete data streams (JoiniDS) continuously monitors pairs of incomplete objects and within sliding windows and , respectively, such that they are similar with probabilities, , greater than threshold , that is,
where and are instances of the imputed objects and , respectively, is a Euclidean distance function, and function returns 1, if (or 0, otherwise).
In Definition 2.6, at timestamp , JoiniDS will retrieve all pairs of incomplete objects, , such that their distance is within threshold with JoiniDS probabilities, (as given by Eq. (2.6)), greater than or equal to , where and . In particular, the JoiniDS probability, , in Eq. (2.6) is given by summing up probabilities that object instances and are within distance in possible worlds, and .
Challenges. There are three major challenges to tackle the JoiniDS problem. First, existing works often assume that objects from data streams are either complete (Das et al., 2003; Lin et al., 2015) or uncertain (Lian and Chen, 2010, 2009), and this assumption may not always hold in practice, due to reasons such as transmission delay or packet losses. Moreover, it is also nontrivial to obtain possible values of missing attributes. To our best knowledge, no prior work has studied the join operator over incomplete data streams. Thus, we should specifically design effective and efficient imputation strategies to infer incomplete objects from and .
Second, it is very challenging to efficiently solve the JoiniDS problem under possible worlds (Dalvi and Suciu, 2007) semantics. The direct computation of Eq. (2.6) (i.e., materializing all possible worlds of two incomplete data streams) has an exponential time complexity, which is inefficient, or even infeasible. Thus, we need to devise efficient approaches to reduce the search space of our JoiniDS problem.
Third, it is not trivial how to efficiently and effectively process the join operator over data streams with incomplete objects, which includes data imputation and join processing over imputed data streams. To efficiently handle the JoiniDS problem, in this paper, we perform data imputation and join processing at the same time. Therefore, we need to propose efficient JoiniDS processing algorithms, supported by effective pruning strategies and indexing mechanism.
Symbol  Description 

( or )  an incomplete data stream 
an imputed (probabilistic) data stream  
(or )  the most recent objects from stream (or ) at timestamp 
the size of the sliding window  
a possible world of imputed (probabilistic) objects in sliding window  
( or )  an (incomplete) object from stream ( or ) 
an imputed probabilistic object of in the imputed stream  
a static (complete) data repository  
an imputation lattice for DDs with dependent attribute  
an index built over for imputing attribute  
grid  a data synopsis containing objects and from streams 
a join set containing object pairs 
2.4. JoiniDS Processing Framework
Algorithm 1 illustrates a framework for JoiniDS processing, which consists of three phases. In the first precomputation phase, we offline establish imputation lattices (for ), and build imputation indexes over a historical repository for imputing attribute (lines 12). Then, in the imputation and JoiniDS pruning phase, we online maintain a data synopsis, called grid, over objects () from sliding window () of each incomplete data stream. In particular, for each expired object (), we remove it from sliding window (), update the grid, and update the join set, , w.r.t. object () (lines 36); for each newly arriving object (), we will impute it by traversing indexes over with the help of DD rules (selected by the imputation lattice ), prune false join objects () via the grid, and insert the imputed object () into the grid (lines 79). Finally, in the JoiniDS refinement phase, we will calculate the join probabilities between () and each nonpruned object grid (grid), and return the join results for all objects () (lines 1011).
Table 2 depicts the commonlyused symbols and their descriptions in this paper.
3. Imputation of incomplete objects via DDs
Data Imputation via DDs. In Section 2.2, we discussed how to impute the missing attribute (for ) of an incomplete object by a single DD: . In practice, we may encounter multiple DDs with the same dependent attribute , , , …, and . In this case, one straightforward way is to combine all these DDs, that is, , to impute the missing attribute . By doing this, we may obtain a more selective query range, , which may lead to not only more precise imputation results, but also the reduced imputation cost (i.e., with a smaller query range). However, to enable the imputation, such a combination has two requirements: (1) there should be at least one sample in data repository that satisfies the distance constraints w.r.t. , and (2) incomplete object must have complete values on all attributes . Both requirements may not always hold, thus, alternatively we need to select a “good” subset of attributes to impute .
Imputation Lattice (). We propose a imputation lattice, (for ), which stores the combined DDs with all possible subsets of attributes , and can be used for selecting a “good” combined DD rule. In particular, each lattice has levels. Level 1 contains the original DD rules, with determinant attributes , and ; Level 2 has (i.e., ) combined DDs, with determinant attributes such as , and ; and so on. Finally, on Level . there is only one combined DD rule, i.e., .
DD Selection Strategy. Given an imputation lattice , we select a good DD rule from based on two principles. First, DDs on higher levels of (e.g., Level ) will have stronger imputation power than those on lower levels (e.g., Level ), since DDs on higher levels of tend to have more accurate imputation results and lower imputation cost. Second, for those DDs, , on the same level in
, we will offline estimate the expected numbers,
, of objects that can be used for imputation via . We designed a cost model (via fractal dimension (Belussi and Faloutsos, 1998)) for estimating in Appendix B.1. Since smaller indicates lower imputation cost and we need at least one sample for imputation, we rank DDs on the same level, first in increasing order for , and then in decreasing order for .Given an incomplete object with missing attribute , we traverse the lattice from Level to Level 1. On each level, we will access DDs in the offline precomputed order as mentioned above. For each DD we encounter, we will online estimate the number of samples for imputing attribute w.r.t. incomplete object (as given by Appendix B.1). If the expected number of objects for imputation is greater than or equal to 1, we will stop the lattice traversal, and use the corresponding DD for the imputation.
4. Pruning Strategies
4.1. Problem Reduction
As given in Eq. (2.6) of Section 2.3, it is inefficient, or even infeasible, to compute join probabilities between two (incomplete) objects, and , by enumerating an exponential number of possible worlds. In this subsection, we reduce the problem of calculating the join probability, , between and from possibleworld level to that on object level, and rewrite Eq. (2.6) as:
(3)  
where and are the existence confidences of instances and , respectively, and function is given in Definition 2.6.
In Eq. (3), we consider all pairs, , of instances and , which is much more efficient than materializing all possible worlds, but still incurs cost, where is the number of instances in the imputed object . Thus, in the sequel, we will design effective pruning rules to accelerate JoiniDS processing.
4.2. Pruning Rules
Below, we propose two pruning strategies, objectlevel and samplelevel pruning, to reduce the JoiniDS search space. The latter one will be used, if an object pair cannot be pruned by the former one.
ObjectLevel Pruning. Given two incomplete objects and , our first pruning rule, namely objectlevel pruning, is to utilize the boundaries of the imputed objects and , and filter out the object pair if their minimum possible distance is greater than the distance threshold . Here, the boundary of an imputed object (or called minimum bounding rectangle (MBR)) encloses all instances of and has an imputed interval, , for any missing attribute (for ).
Lemma 4.1 ().
(ObjectLevel Pruning) Given two incomplete objects, and , from sliding windows and , respectively, if holds, then object pair can be safely pruned.
Proof.
Please refer to Appendix A.1. ∎
Figure 2(a) illustrates an example of Lemma 4.1. Intuitively, if holds, then two imputed objects (MBRs), and , are far away from each other, and any instance pair from them cannot be joined (i.e., ). Thus, object pair can be safely pruned.
SampleLevel Pruning. The objectlevel pruning rule cannot filter out object pairs with nonzero JoiniDS probabilities (). Thus, we present a samplelevel pruning method, which aims to rule out those false alarms with low JoiniDS probabilities, by considering instances of imputed objects and .
Lemma 4.2 ().
(SampleLevel Pruning) Given two incomplete objects and , and two subMBRs, and , the object pair, , can be safely pruned, if and hold, where is the summed probability that instances fall into subMBR (the same for w.r.t. ).
Proof.
Please refer to Appendix A.2. ∎
Figure 2(b) shows an example the samplelevel pruning in Lemma 4.2, which considers instances of imputed objects and , and uses their subMBRs, and , to enable the pruning, where (or ) is a subMBR such that object (or ) falls into (or ) with probability (or ). Intuitively, if and hold, then we can prove that the object pair has low join probability (i.e., ), and can be safely pruned.
5. Join over incomplete data streams
In this section, we first design a data synopsis for incomplete data streams and imputation indexes over data repository , and then propose an efficient JoiniDS processing algorithm to retrieve the join results via synopsis/indexes.
5.1. Grid Synopsis and Imputation Indexes
Grid Over Imputed Data Streams. We will incrementally maintain a data synopsis, namely grid, over (imputed) objects and from sliding windows and , respectively. Specifically, to construct the grid, we divide the data space into equal grid cells with side length along each dimension (attribute ). Each cell, , is associated with two queues, and , which sequentially store imputed objects and , respectively, that intersect with this cell . Each imputed object (or ) contains information as follows:

a set of currently accessed MBR nodes, , in the Rtree over data repository for imputation (as will be discussed later in this subsection), or;

a set of instances, (or ), in (or ).
Figure 3 illustrates an example of the grid with two attributes and . The grid divides the 2D data space into 16 () cells, each with side length . If imputed object (or ) intersects with a cell , then this object will be stored in a queue (or ) pointed by cell .
Imputation Indexes Over Data Repository . To enable fast imputation, we devise indexes, (for ), each of which will have the best imputation power for a possibly missing attribute . Specifically, assume that the combined DD rule on Level of the imputation lattice (see Section 3) is . Then, we let , and construct a variant of Rtree (Beckmann et al., 1990) over attributes of data repository .
We divide complete objects in data repository into clusters, , and insert them into the Rtree, where each cluster has the size within . We design a specific cost model to select a good cluster set. Please refer to Appendix B.2 for details.
Moreover, each node in Rtree stores a histogram, , over dependent attribute , which stores a summary of complete objects in , where is divided into buckets, (), with consecutive bucket intervals (i.e., ), and each bucket contains all () objects with attribute values within the interval .
Figure 4 gives an example of a table with 3 attributes and , and two DD rules, and , with dependent attribute . We construct an index for imputing attribute , where and . In this example, we first put complete objects into some clusters (e.g., ), and then insert these clusters into an Rtree as leaf nodes. As shown in Figure 4, each node is divided into buckets, , based on the distribution of dependent attribute , where each bucket () contains the count of objects and the interval of values on attribute of in the bucket .
5.2. JoiniDS Processing via Grid
JoiniDS via Grid. Denote as a join set that records all join results, , between two incomplete data streams. Algorithm 2 performs the object imputation and join at the same time, and dynamically maintain the join set (and grid as well).
Deletion of the expired objects. At a new timestamp , Algorithm 2 will remove the expired objects from grid and those object pairs containing the expired objects from (lines 12).
Object imputation and objectlevel pruning. Given a newly arriving incomplete object , Algorithm 2 will retrieve a query range via a DD rule returned by the imputation lattice (Section 3), and obtain an initial MBR , by accessing Rtree nodes that intersect with the query range via imputation index (line 3). Then, we will check whether there are some cells, , in the grid, that may match with (via Lemma 4.1). In particular, if holds, we will further obtain instances of imputed object , and update (shrink) the MBR of (lines 46).
Object imputation and samplelevel pruning. Next, for cells that cannot be pruned by Lemma 4.1, Algorithm 2 will further check the minimum distance, , between subMBRs and cell via the samplelevel pruning (Lemma 4.2; lines 715). If and queues are nonempty, then we will check the minimum distance, , between imputed objects and each unchecked object in the queues of cell (lines 915). Note that, each object may have two possible imputation states: (1) object is represented by MBRs , or (2) object is represented by some samples (the missing attributes are imputed from ). We call the first state “not completely imputed”, while the second one “completely imputed”. If is not completely imputed, we will impute completely via indexes , and update the cells in grid intersecting with (lines 1012). Given both completely imputed objects and , if , we will use the samplelevel pruning to prune the object pair (please refer to Appendix C for the selection of subMBRs and ; line 13). If the object pair still cannot be pruned, then we will check the join probabilities, , via Eq. (3), and add actual join pairs to (lines 1415).
Update of grid with new object . Algorithm 2 then inserts new object into all queues, , in cells (intersecting with the MBR of ) of the grid (lines 1617).
Finally, similar to object , we execute lines 317 for a newly arriving object from sliding window (line 18).
Complexity Analysis. The JoiniDS algorithm in Algorithm 2 requires time complexity, where and are the average numbers of instances in imputed objects and , respectively, is the number of cells intersecting with the MBR of , and is the average number of objects within queues of cells .
Discussions on the Extension of JoiniDS to () Incomplete Data Streams. We can extend our JoiniDS problem over 2 incomplete data streams to multiple (e.g., ) incomplete data streams . We only need to update the grid, that is, increase the number of queues in each cell of grid from 2 to . Within a cell in grid, each queue, (), stores objects from its corresponding incomplete data stream . With the modified grid, at timestamp , when a new object arrives, the imputed object will push its join pairs into , by accessing those objects from queues in each cell.
6. Experimental Evaluation
Parameters  Values 

probabilistic threshold  0.1, 0.2, 0.5, 0.8, 0.9 
dimensionality  2, 3, 4, 5, 6, 10 
distance threshold  0.1, 0.2, 0.3, 0.4, 0.5 
the number, , of valid objects in  500, 1K, 2K, 4K, 5K, 10K 
the size, , of data repository  10K, 20K, 30K, 40K, 50K 
the number, , of missing attributes  1, 2, 3 
6.1. Experimental Settings
Real/Synthetic Data Sets. We evaluate the performance of our JoiniDS approach on 4 real and 3 synthetic data sets.
Real data sets. We use Intel lab data^{1}^{1}1http://db.csail.mit.edu/labdata/labdata.html, UCI gas sensor data for home activity monitoring^{2}^{2}2http://archive.ics.uci.edu/ml/datasets/gas+sensors+for+home+activity+monitoring, US & Canadian city weather data^{3}^{3}3https://www.kaggle.com/selfishgene/historicalhourlyweatherdata, and S&P 500 stock data^{4}^{4}4https://www.kaggle.com/camnugent/sandp500, denoted as , , and , respectively. contains 2.3 million data, collected from 54 sensors deployed in Intel Berkeley Research lab on Feb. 28Apr. 5, 2014; includes 919,438 samples from 8 MOX gas sensors, and humidity and temperature sensors; contains historical weather (temperature) data for 30 US and Canadian Cities during 20122017; has historical stock data for all companies found on the S&P 500 index till Feb 2018. We extract 4 attributes from each of these 4 real data sets: temperature, humidity, light, and voltage from ; temperature, humidity, and resistance of sensors 7 and 8 from ; Vancouver, Portland, San Francisco, and Seattle from ; and open, high, low, close from . We normalize the intervals of 4 attributes of each data sets into . Then, as depicted in Table 4, we detected DD rules for each data set, by considering all combinations of determinant/dependent attributes over samples in data repository (Song and Chen, 2011).
Data Sets  DD Rules 

,  
{}  
,  
,  
,  
,  
,  
  , 
Comments
There are no comments yet.