Semantrix: A Compressed Semantic Matrix

02/27/2020 ∙ by Nieves R. Brisaboa, et al. ∙ Universidad de Chile Universidade da Coruña 0

We present a compact data structure to represent both the duration and length of homogeneous segments of trajectories from moving objects in a way that, as a data warehouse, it allows us to efficiently answer cumulative queries. The division of trajectories into relevant segments has been studied in the literature under the topic of Trajectory Segmentation. In this paper, we design a data structure to compactly represent them and the algorithms to answer the more relevant queries. We experimentally evaluate our proposal in the real context of an enterprise with mobile workers (truck drivers) where we aim at analyzing the time they spend in different activities. To test our proposal under higher stress conditions we generated a huge amount of synthetic realistic trajectories and evaluated our system with those data to have a good idea about its space needs and its efficiency when answering different types of queries.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent works, the need for analyzing trajectories in a higher abstraction level than the one offered by a sequence of GPS points, had led to the definition of the concept of semantic trajectories [20, 1, 17]. Basically, the idea is to split the trajectory of a mobile object into segments (segmentation) that are relevant according to some parameter (place, speed, activity, etc). After that, each segment is labeled with a semantically rich tag (“driving in slow traffic”, “shopping at the mall”, “working in customer facilities”, “working in the office”, “refueling”, etc) [19, 6, 3, 7]. Semantic trajectories should theoretically allow to analyze trajectories in a more relevant abstraction level. However, on the one hand, there are no standard ways to represent them, and on the other hand, the most relevant queries would need to accumulate the duration/length (size) of the semantically homogeneous segments. Therefore, there is an actual need for representing them in such a way that, as a data warehouse, it enables us to efficiently support queries.

The use of semantic trajectories has many applications. For example, it allows us to label the work/activity done by a mobile worker (e.g. a worker who moves/drives to visit customers) in each moment of the working day; or to know the state in which all the cars from a taxi company are (e.g. “traffic jam”, “normal traffic”, “stopped at semaphore”, etc); or to classify a storm in different moments of its evolution. In the context of our work, we deal with a set of trucks that collect organic waste from farms within a large area of Spain.

Semantic trajectories are complex objects. They include spatio-temporal data representing the sequence of GPS points (polyline of the trajectory segment) and the actual time instants in which the mobile object went through the points of the trajectory. They also include a textual tag that identifies the place, activity speed, or whatever interesting aspect was used to separate a segment of the trajectory from the next segment.

Nowadays, there is no standard (or at least an usual) way to represent those multidimensional complex data. Of course, we could use GIS222Geographic Information Systems technology to define a trajectory segments table by providing: a semantic trajectory ID, its geometry, the initial and final timestamps, the trajectory or object ID the segment belongs to, and the tag of the semantic trajectory. However, that GIS-based solution has two main problems: it uses too much space (as a consequence, the table would not fit into memory) and exploiting those data would become rather inefficient (not only they are in disk but also cumulative queries must be performed at query time, hence leading to a time-consuming solution). Note that the historical set of semantic trajectories from all the workers during each day could become a big data problem. Therefore, defining an effective solution (in terms of space consumption and query time) to store and analyze those data is a relevant problem.

In this work, we tackle a real problem of a truck fleet from a transport company where there is interest in monitoring which activities are being done by each truck driver during a given time period, but also gathering activity patterns is of interest. We present semantrix, a compact representation for the sequence of tags associated to semantic trajectories (that in our experimental data would represent activities) in such a way that we could efficiently answer different relevant types of queries oriented to analyze the data, particularly focusing on aggregated queries. Our proposal uses compact data structures based on bitmaps and sums-matrices to store the data in a pre-computed way (as it is usual in data warehouses).

Note that we do not tackle the problem of labeling the semantic trajectories. In different contexts, such labeling process can be done either manually by the worker himself or automatically using detection strategies or machine learning. In our case, we followed the automatic approach in

[3]. In any case, we assume that trajectories have been previously split into homogeneous pieces and each segment has been correctly labeled. In the same way, we do not deal with the spatial representation of trajectories (or segments of trajectories) as we assume that both the representation of the geometries and (in general) the representation of the cartography is done with a GIS. Therefore, our problem focuses in how to deal with the labels of the segments (the semantic information of the trajectories) and how long this segment lasts (its size/duration). Note that by knowing that a semantic trajectory from a mobile object lasts from an initial to an ending timestamp we can easily map that label over the corresponding geographic representation of the segment. Recall each segment in the spatial database has its initial and ending timestamp.

In consequence, the target of our representation is to enable the efficient exploitation of the semantic labels when dealing with queries such as: “how many hours did my workers spend at refueling during the last month?”, “how many miles in average did my workers drive to meet customers?”, or “how many of my workers had lunch between 14 and 15pm?”. In addition, we can also solve queries about the sequence/patterns of activities performed by a truck-driver: e.g. “How many times (or who/when did) the activity driving out of the planned route was performed just after the activity driving in slow traffic on the planned route?”.

2 Basic concepts

In this section, we briefly describe some well-known data structures that make up the basic components of our proposal.

  • Bitvectors. Bitvectors are the basic components of many Compact Data Structures. A bitvector is a sequence of zeroes and ones of lenght . The following operations are expected to be supported:

    • returns the number of set bits in . Alternatively, and also .

    • returns the position in where the th 1 occurs. Therefore, .

    These operations can be supported in constant time by using extra bits [13, 15]. There also exist compressed bitvector representations of [18, 16, 12] that still support those operations and also permit to solve , which returns the original value .

  • FM-index. Given a text built on an alphabet , the FM-index [9] provides a self-indexed representation of based on the BWT [4] of and the use backward searching for identifying pattern occurrences. It requires bits of space and permits to search for the occurrences of a pattern in time ( being the number of occurrences of within ). Several variants of this scheme exist [10, 11, 8, 14] which induce different time/space tradeoffs for the counting, locating, and extracting operations.

  • Summed Area Tables. The Summed Area Tables were first introduced in computer graphics [5] to speed up mipmapping. Given a matrix , for which we want to solve the operation , the key idea of this approach is to create a new matrix where all the cells in both row and column are set to zero, and any other cell stores the total sum of all the previous cells within (to the left and up); i.e. . An example showing matrices and is depicted in Figures 1(a) and  1(b). Using allows us to solve operation in time as . Basically, from a geometric point of view, Figure 1(c) shows that to compute we subtract from (sum of all the values in ) both the values in the area depicted with horizontal bars ( sum of values in ) and those values in the area depicted with vertical bars ( sum of values in ). Since we are subtracting the sum of values in the area depicted with both vertical and horizontal lines twice ( sum of values in ) we still have to add that value () once. Consequently, we obtain .

    Figure 1: Summed Area Tables example.

3 Our proposal: Semantrix

In this work, we aim at creating a compressed representation of a set of semantic trajectories/activities in such a way that we could still answer different relevant queries efficiently. Particularly, we are targeting at aggregated queries. Note that our set of semantic trajectories can be gathered from the movements of several objects/vehicles along time. A rather straightforward (naive) approach would be a solution based on a classic matrix where columns represent a discretization of the time in such a way that each column corresponds to a time interval related to the actual continuous time between two discretized time instants (e.g. 13:00 - 13:10). The rows represent each of the moving objects of study. Thus, a cell within this matrix contains the identifier of the (most-representative) activity performed by a given mobile object at a particular time interval. For example, in the matrix in Figure 3, the car was performing the activity with id 4 from 13:20h to 13:30h.

Figure 2: Naive matrix.
Figure 3: Semantrix structure.

Semantix structure:

With the aim of improving the previous solution, we have created a new structure named semantrix that represents all the information included in the previous naive

original matrix, and considerably reduces pattern matching and aggregated queries times. This new structure encompasses three vectors: a bitvector

, an integer vector , and a vector of matrices . The former two structures permit us to compactly represent the original sequence of activities within the naive matrix. The later vector keeps one activity matrix for each possible activity so that, for each activity, it handles aggregated information for each vehicle and time-interval. Those structures, that are discussed below, are depicted in Figure 3.

  • Representing the naive matrix: Bitvector and vector . Recall the information that regards the activity performed by each mobile object during each discretized time interval was stored in the naive matrix previously. In addition, a given row () keeps particularly the activities for the -th mobile object during each of the time intervals. Those rows can be concatenated to make up a unique sequence of rows () as depicted in the top of Figure 3. Note that since all those rows have the same length (number of time intervals) we retain the same direct-access capabilities as in the naive matrix. Yet, we also have the same space needs. To compactly represent we use:

    i) A bitvector aligned with where we set a each time an activity switch occurs in ; i.e. we set and then, we set if ; we set otherwise. Finally, we also set a at positions to mark a row/mobile-object switch.

    ii) An integer vector , such that is aligned with the ones in , and stores the of the activities from associated to those ones in . Therefore, we set . Note that contains, for each mobile object, a sequence with the identifiers of the activities it performed.

  • Storing aggregated information related to each activity: Vector of Activity matrices . We have included a vector of matrices (one per activity) that operates as a kind of a classic data warehouse. The goal is to have cumulative information pre-computed to efficiently solve aggregated queries. Thereby, this vector contains one matrix () for each possible activity in the system having the data in each matrix pre-computed as in a Summed Area Table (see section 2). In Figure 3, it is shown how the cumulative activity matrices , , and for the activities , , and in our working example would look like (note that we are not showing the content of the other matrices). By using the operation we will be able to gather, for example, the number of times an activity was performed during a given time window.

4 Supporting activities-related queries in semantrix

In our scope, we can distinguish among three main types of queries. We found individual queries that aim at gathering the content of one particular cell from the original naive matrix (e.g. Which was the activity performed by a given mobile object at a given time instant ?, or Which is the list of activities performed by a given mobile object during a given time interval ?). There are also queries focusing on detecting if a given pattern of activities occurred (e.g. How many times the activity was followed by the activity ?). Finally, we also have to deal with aggregated queries aimed to unravel the total values hidden within the matrix (e.g. How much time was actually spent by all the mobile objects while performing the activity during a given time window ). To support this types of operations we used the different structures within semantrix.

  • Individual queries: These kind of queries are easily solved just using the bitvector and vector . First, with a operation over the bitmap we obtain the position(s) of interest; and then this position is used to access to retrieve the activity/ies within the particular time window.

  • Pattern queries: For these queries, a FM-index built on top of vector is used. Therefore, we use its self-indexing capabilities to efficiently locate patterns of activities. Particularly, to solve query “How many times was activity followed by ?” we simply rely on over the FM-index of .

  • Aggregated queries: With the help of the activity matrices () and the operation, most aggregated queries can be solved in constant time.

5 Experimental evaluation

It is worth recalling that the seminal idea for this work arose as a recent project shared with a local company devoted to the transportation of organic waste. Accordingly, the actual experimental evaluation is now taking place on a real environment. Our system is being used on a daily basis to manage the activities of the trucks from the enterprise. The relevant activities for this company we have to deal with are:

  • Being at headquarters

  • Working at a customer place

  • Normal transit on planned route

  • Slow transit on planned route

  • Normal transit out of planned route

  • Slow transit out of planned route

  • Taking a break

  • Undefined/unknown activity

  • Inactive

We present experiments comparing our proposal semantrix with other representations and show both the space needs and their performance at query time. Below, we discuss the baseline representations used, we present our test dataset and finally we show the corresponding experimental results.

Representations compared with semantrix: naive, baseline+, and Diff

We have included in our experiments the naive original matrix discussed in Section 3.

Additionally, we have implemented a more elaborated baseline named baseline+ (see Figure 4). It is based on the sequence of all the activities performed ordered both chronologically, and by moving object. It consists basically on the vector (i.e. sequence composed of the rows from the original matrix). Yet, we have also included a set of aggregated sequences to boost solving aggregated operations. There is one sequence per activity that gathers all the cumulative data in chronological order. Thereby, individual and pattern queries are solved with the activity sequence, while the cumulative sequences deal with the data warehouse-like queries.

Figure 4: Baseline+ example

Finally, we also implemented a variant of semantrix following the ideas presented in [2] named (Diff) where the activity matrices are represented in a slightly more compact way. The idea is to sample some rows and to represent the non-sampled rows as differences with respect to the closest sample and the actual value. This implies a space/time trade-off.

Datasets

Since our system has not been used over a relevant amount of time (6 months or more) yet, there are not enough real data to test our proposal in a real environment. Nonetheless, we have generated a synthetic dataset according to the actual constraints and the current existing statistics, where we have recreated realistic information about daily truck activities in the company. We have discretized the time using 5-min intervals, which is a sensible time lapse considering the speed of the trucks. We assume a small company that has trucks that work hours every day of the week. Assuming those preconditions and the nine activities discussed above, three datasets with different temporal sizes were created: one month ( time instants), six months ( instants) and one year ( instants).

Experiments: space and query time comparison

We have compared the space requirements of the tested techniques. As shown in Figure 5, the original matrix (naive) needs, by far, less space as it only stores the activity values within the original matrix.

The others use roughly the same space. Yet, it is worth noting that Diff (sampling every 4 rows) requires around % less space than semantrix. Baseline+ uses around % less space than semantrix.

Figure 5: Space measurements

To test query performance, we have chosen one query of each type (we have skipped the results from single-query type due to space constraints. Yet, the results showed only negligible differences among all the techniques). For pattern-queries we used the query “How many times was activity followed by activity ?”, and for aggregated-queries, we used the query “how many times were trucks 1,2, and 3 performing activity from 11am to 12pm ( time Intervals)?”. We have measured average execution times from randomly generated queries on an Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz machine running Debian GNU/Linux 9.9. Our implementations use components from the SDSL Library333https://github.com/simongog/sdsl-lite. The compiler used was g++ 6.3.0.

We can see in Figure 6 that the techniques recreating a data warehouse (semantrix, Diff, and baseline+) are much faster solving both pattern and aggregation queries than naive. Actually, naive approach, which must traverse the original matrix, becomes several orders of magnitude slower than both semantrix, and baseline+ when solving pattern queries (Figure 6.(left)). However, these latter structures obtain similar results as they both rely on an FM-index to solve pattern queries (with the difference that semantrix needs an additional access to ).

Figure 6: Times for pattern queries (left), and times for aggregation queries (right)

For aggregated queries, Figure 6.(right) shows, as expected, that semantrix is clearly the fastest technique. As in the previous experiment, naive needs to explore the whole queried submatrix, whereas semantrix counterparts and baseline+ benefit from their aggregated data. In this case, Diff is around % slower than semantrix.

6 Conclusions and future work

We have analyzed the problem of representing and managing semantic trajectories in a compact and efficient way. We present a data structure named semantrix to handle a semantic data warehouse for the trajectories from mobile objects, and we show how it supports different types of queries. The proposal works on top of the compressed activity sequence (ordered chronologically and by mobile-object identifier) which leans on a bitmap for individual and pattern-matching queries. To improve the resolution of aggregated queries a cumulative matrix for each activity was appended to our structure enabling it to solve most accumulated queries in constant time. We have experimentally evaluated the proposed solution using realistic synthetic data that represent the truck movements of a real company. As a quality proof, it is worth recalling that our system is being used as part of a real company project, solving a real life problem.


Regarding future work, the first step will be to increase the scope of this work in order to represent in a compact way also the geometry of each semantically tagged segment or semantic trajectory. This idea opens a wide new field of possibilities to perform queries combining spatial, temporal, and semantic constraints.

References

References

  • [1] L. O. Alvares, V. Bogorny, B. Kuijpers, J. A. F. de Macedo, B. Moelans, and A. Vaisman (2007) A model for enriching trajectories with semantic geographical information. In Proc. 15th Annual ACM Int. Symposium on Advances in Geographic Information Systems (GIS), GIS ’07, pp. 22:1–22:8. External Links: ISBN 978-1-59593-914-2 Cited by: §1.
  • [2] N. R. Brisaboa, A. Fariña, D. Galaktionov, T. V. Rodeiro, and M. A. Rodríguez (2018) New structures to solve aggregated queries for trips over public transportation networks. In Proc. 25th String Processing and Information Retrieval (SPIRE), pp. 88–101. Cited by: §5.
  • [3] N. R. Brisaboa, M. R. Luaces, C. Martínez Pérez, and Á. S. Places (2017) Semantic trajectories in mobile workforce management applications. In Proc. Web and Wireless Geographical Information Systems (W2GIS), pp. 100–115. Cited by: §1, §1.
  • [4] M. Burrows and D. J. Wheeler (1994) A block-sorting lossless data compression algorithm. Technical report Digital Equipment Corporation. Cited by: 2nd item.
  • [5] F. C. Crow (1984) Summed-area tables for texture mapping. ACM SIGGRAPH computer graphics 18 (3), pp. 207–212. Cited by: 3rd item.
  • [6] S. Dodge, P. Laube, and R. Weibel (2012) Movement similarity assessment using symbolic representation of trajectories. Int. Journal of Geographical Inf. Sci. 26 (9), pp. 1563–1588. Cited by: §1.
  • [7] R. dos Santos Mello, V. Bogorny, L. O. Alvares, L. H. Z. Santana, C. A. Ferrero, A. A. Frozza, G. A. Schreiner, and C. Renso (2019) MASTER: a multiple aspect view on trajectories. Trans. GIS 23, pp. 805–822. Cited by: §1.
  • [8] P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro (2007) Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3 (2), pp. 20. Cited by: 2nd item.
  • [9] P. Ferragina and G. Manzini (2000) Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS), pp. 390–398. Cited by: 2nd item.
  • [10] P. Ferragina and G. Manzini (2001) An experimental study of a compressed index. Inf. Sci. 135 (1-2), pp. 13–28. Cited by: 2nd item.
  • [11] P. Ferragina and G. Manzini (2005) Indexing compressed text. J. ACM 52 (4), pp. 552–581. Cited by: 2nd item.
  • [12] A. Golynski, R. Grossi, A. Gupta, R. Raman, and S. S. Rao (2007) On the size of succinct indices. In Proc. 15th Annual European Symposium on Algorithms (ESA), LNCS 4698, pp. 371–382. External Links: ISBN 978-3-540-75520-3 Cited by: 1st item.
  • [13] G. Jacobson (1989) Space-efficient static trees and graphs. In Proc. 30th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 549–554. Cited by: 1st item.
  • [14] V. Mäkinen and G. Navarro (2005) Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12 (1), pp. 40–66. Cited by: 2nd item.
  • [15] I. Munro (1996) Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), LNCS 1180, pp. 37–42. Cited by: 1st item.
  • [16] D. Okanohara and K. Sadakane (2007) Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 60–70. Cited by: 1st item.
  • [17] C. Parent, S. Spaccapietra, C. Renso, G. L. Andrienko, N. V. Andrienko, V. Bogorny, M. L. Damiani, A. Gkoulalas-Divanis, J. A. F. de Macêdo, N. Pelekis, Y. Theodoridis, and Z. Yan (2013) Semantic trajectories modeling and analysis. ACM Comput. Surv. 45 (4), pp. 42:1–42:32. Cited by: §1.
  • [18] R. Raman, V. Raman, and S. S. Rao (2002) Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 233–242. Cited by: 1st item.
  • [19] Z. Yan, D. Chakraborty, C. Parent, S. Spaccapietra, and K. Aberer (2013) Semantic trajectories: mobility data computation and annotation. ACM TIST 4 (3), pp. 49:1–49:38. Cited by: §1.
  • [20] Z. Yan, C. Parent, S. Spaccapietra, and D. Chakraborty (2010) A hybrid model and computing platform for spatio-semantic trajectories. In Proc. 7th Extended Semantic Web Conference (ESWC), pp. 60–75. Cited by: §1.