Ephemeral Data Handling in Microservices - Technical Report

04/25/2019 ∙ by Saverio Giallorenzo, et al. ∙ 0

In modern application areas for software systems --- like eHealth, the Internet-of-Things, and Edge Computing --- data is encoded in heterogeneous, tree-shaped data-formats, it must be processed in real-time, and it must be ephemeral, i.e., not persist in the system. While it is preferable to use a query language to express complex data-handling logic, their typical execution engine, a database external from the main application, is unfit in scenarios of ephemeral data-handling. A better option is represented by integrated query frameworks, which benefit from existing development support tools (e.g., syntax and type checkers) and execute within the application memory. In this paper, we propose one such framework that, for the first time, targets tree-shaped, document-oriented queries. We formalise an instantiation of MQuery, a sound variant of the widely-used MongoDB query language, which we implemented in the Jolie language. Jolie programs are microservices, the building blocks of modern software systems. Moreover, since Jolie supports native tree data-structures and automatic management of heterogeneous data-encodings, we can provide a uniform way to use MQuery on any data-format supported by the language. We present a non-trivial use case from eHealth, use it to concretely evaluate our model, and to illustrate our formalism.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern application areas for software systems—like eHealth [orszag2008evidence], the Internet of Things [Baker2017], and Edge Computing [Shi2016]—need to address two requirements: velocity and variety [mehta2004handbook]. Velocity concerns managing high throughput and real-time processing of data. Variety means that data might be represented in heterogeneous formats, complicating their aggregation, query, and storage. Recently, in addition to velocity and variety, it has become increasingly important to consider ephemeral data handling [tene2012big, shein2013ephemeral], where data must be processed in real-time but not persist — ephemeral data handling can be seen as the opposite of dark data [darkdata], which is data stored but not used. The rise of ephemeral data is due to scenarios with heavy resource constraints (e.g., storage, battery) — as in the Internet of Things and Edge Computing — or new regulations that may limit what data can be persisted, like the GDPR [Mostert2015] — as in eHealth. Programming data handling correctly can be time consuming and error-prone with a general-purpose language. Thus, often developers use a query language, paired with an engine to execute them [cheney2013practical]. When choosing the query execution engine, developers can either A) use a database management system (DBMS) executed outside of the application, or B) include a library that executes queries using the application memory. Approach A) is the most common. Since the early days of the Web, programmers integrated application languages with relational (SQL-based) DBMSs for data persistence and manipulation [welling2003php]. This pattern continues nowadays, where relational databases share the scene with new NoSQL [mehta2004handbook] DBMSs, like MongoDB [mongodb] and Apache CouchDB [couchdb], which are document-oriented. Document-oriented databases natively support tree-like nested data structures (typically in the JSON format). Since data in modern applications is typically structured as trees (e.g., JSON, XML), this removes the need for error-prone encoding/decoding procedures with table-based structures, as in relational databases. However, when considering ephemeral data handling, the issues of approach A) overcome its benefits even if we consider NoSQL DBMSs:

  1. [label=0⃝]

  2. Drivers and Maintenance. An external DBMS is an additional standalone component that needs to be installed, deployed, and maintained. To interact with the DBMS, the developer needs to import in the application specific drivers (libraries, RESTful outlets). As with any software dependency, this exposes the applications to issues of version incompatibility [dependency_hell].

  3. Security Issues. The companion DBMS is subject to weak security configurations [mongodb_security] and query injections, increasing the attack surface of the application.

  4. Lack of Tool Support. Queries to the external DBMS are typically black-box entities (e.g., encoded as plain strings), making them opaque to analysis tools available for the application language (e.g., type checkers) [cheney2013practical].

  5. Decreased Velocity and Unnecessary Persistence. Integration bottlenecks and overheads degrade the velocity of the system. Bottlenecks derive from resource constraints and slow application-DB interactions; e.g., typical database connection pools [visveswaran2000dive] represent a potential bottleneck in the context of high data-throughput. Also, data must be inserted in the database and eventually deleted to ensure ephemeral data handling. Overheads also come in the form of data format conversions (see item 5).

  6. Burden of Variety. The DBMS typically requires a specific data format for communication, forcing the programmer to develop ad-hoc data transformations to encode/decode data in transit (to insert incoming data and returning/forwarding the result of queries). Implementing these procedures is cumbersome and error-prone.

On the other side, approach B) (query engines running within the application) is less well explored, mainly because of the historical bond between query languages and persistent data storage. However, it holds potential for ephemeral data handling. Approach B) avoids issues 1 and 2 by design. Issue 3 is sensibly reduced, since both queries and data can be made part of the application language. Issue 4 is also tackled by design. There are less resource-dependent bottlenecks and no overhead due to data insertions (there is no DB to populate) or deletions (the data disappears from the system when the process handling it terminates). Data transformation between different formats (item 5) is still an issue here since, due to variety, the developer must convert incoming/outgoing data into/from the data format supported by the query engine. Examples of implementations of approach B) are LINQ [meijer2006linq, cheney2013practical] and CQEngine [cqengine]. While LINQ and CQEngine grant good performance (velocity), variety is still an issue. Those proposals either assume an SQL-like query language or rely on a table-like format, which entail continuous, error-prone conversions between their underlying data model and the heterogeneous formats of the incoming/outgoing data.

Contribution.

Inspired by approach B), we implemented a framework for ephemeral data handling in microservices; the building blocks of software for our application areas of interest. Our framework includes a query language and an execution engine, to integrate document-oriented queries into the Jolie [MGZ14, jolieweb] programming language. The language and our implemented framework are open-source projects111 https://github.com/jolie/tquery. Our choice on Jolie comes from the fact that Jolie programs are natively microservices [DGLMMMS17]. Moreover, Jolie has been successfully used to build Internet-of-Things [GGLZ18] and eHealth [wareflo] architectures, as well as Process-Aware Information Systems [montesi2016process], which makes our work directly applicable to our areas of interest. Finally, Jolie comes with a runtime environment that automatically translates incoming/outgoing data (XML, JSON, etc.) into the native, tree-shaped data values of the language — Jolie values for variables are always trees. By using Jolie, developers do not need to handle data conversion themselves, since it is efficiently managed by the language runtime. Essentially, by being integrated in Jolie, our framework addresses issue 5 by supporting variety by construction. As main contribution of this paper, in creftypecap 3, we present the formal model, called TQuery, that we developed to guide the implementation of our Jolie framework. TQuery is inspired by MQuery [botoeva18], a sound variant of the MongoDB Aggregation Framework [agg_framework]; the most popular query language for NoSQL data handling. The reason behind our formal model is twofold. On the one hand, we abstract away implementation details and reason on the overall semantics of our model — we favoured this top-down approach (from theory to practice) to avoid inconsistent/counter-intuitive query behaviours, which are instead present in the MongoDB Aggregation Framework (see [botoeva18] for details). On the other hand, the formalisation is a general reference for implementors; this motivated the balance we kept in TQuery between formal minimality and technical implementation details — e.g., while MQuery adopts a set semantics, we use a tree semantics. As a second contribution, in creftypecap 2 we present a non-trivial eHealth use case to overview the TQuery operators, by means of their Jolie programming interfaces. The use case is also the first concrete evaluation of MQuery and, in creftypecap 3, we adopt the use case as our running example to illustrate the semantics of TQuery.

2 A Use Case from eHealth

In this section, we illustrate our proposal with an eHealth use case taken from [Vigevano2018], where the authors delineate a diagnostic algorithm to detect cases of encephalopathy. The handling follows the principle of “data never leave the hospital” in compliance with the GDPR  [ROSE20141212]. In the remainder of the paper, we use the use case to illustrate the formal semantics of TQuery. Hence, we do not show here the output of TQuery operators, which are reported in their relative subsections in creftype 3. While the algorithm described in [Vigevano2018] considers a plethora of clinical tests to signal the presence of the neurological condition, we focus on two early markers for encephalopathy: fever in the last 72 hours and lethargy in the last 48 hours. That data is collectible by commercially-available smart-watches and smart-phones [bunn2018current]: body temperature and sleep quality. We report in creftypecap 1, in a JSON-like format, code snippets exemplifying the two kinds of data structures. At lines 1–2, we have a snippet of the biometric data collected from the smart-watch of the patient. At lines 4–6 we show a snippet of the sleep logs [Thurman2018]. Both structures are arrays, marked [ ], containing tree-like elements, marked { }. At lines 1–2, for each date we have an array of detected temperatures (t) and heart-rates (hr). At lines 4–6, to each year (y) corresponds an array of monthly (M) measures, to a month (m), an array of daily (D) logs, and to a day (d), an array of logs (L), each representing a sleep session with its start (s), end (e) and quality (q).

1[{date:20181129,t:[37,...],hr:[64,...]},
2 {date:20181130,t:[36,...],hr:[66,...]},...]
3
4[{y:2018,M:[...,{m:11,D:[{d:29,L:[{s:"21:01",e:"22:12",q:"good"},
5{s:"22:36",e:"22:58",q:"good"},...]},{d:30,L:[
6{s:"20:33",e:"22:12",q:"poor"},...]},...]},...]},...]
Listing 1: Snippets of biometric (line 1) and sleep logs (lines 3–5) data.

On the data structures above, we define a Jolie microservice, reported in creftypecap 2, which describes the handling of the data and the workflow of the diagnostic algorithm, using our implementation of TQuery. The example is detailed enough to let us illustrate all the operators in TQuery: match, unwind, project, group, and lookup. Note that, while in creftypecap 2 we hard-code some data (e.g., integers representing dates like 20181128) for presentation purposes, we would normally use parametrised variables. In creftypecap 2, line 1 defines a request to an external service, provided by the HospitalIT infrastructure. The service offers functionality getPatientPseudoID which, given some identifying patientData (acquired earlier), provides a pseudo-anonymised identifier — needed to treat sensitive health data — saved in variable pseudoID. At lines 2–6 (and later at lines 9–17) we use the chaining operator |> to define a sequence of calls, either to external services, marked by the @ operator, or to the internal TQuery library. The |> operator takes the result of the execution of the expression at its left and passes it as the input of the expression on the right. At lines 2–6 we use TQuery operators match and project to extract the recorded temperatures of the patient in the last 3 days/72 hours. At line 2 we evaluate the content of variable credentials, which holds the certificates to let the Hospital IT services access the physiological sensors of a given patient. In the program, credentials is passed by the chaining operator at line 3 as the input of the external call to functionality getMotionAndTemperature. That service call returns the biometric data (creftypecap 1, lines 1–2) from the SmartWatch of the patient. While the default syntax of service call in Jolie is the one with the double pair of parenthesis (e.g., at line 1 creftypecap 2), thanks to the chaining operator |> we can omit to specify the input of getMotionAndTemperature (passed by the |> at line 3) and its output (the biometric data exemplified at creftypecap 1) passed to the |> at line 4. At line 4 we use the TQuery operator match to filter all the entries of the biometric data, keeping only those collected in the last 72 hours/3 days (i.e., since 2018113). The result of the match is then passed to the project operator at line 5, which removes all nodes but the temperatures, found under t and renamed in temperatures (this is required by the interface of functionality detectFever, explained below). The projection also includes in its result the pseudoID of the patient, in node patient_id. We finally store (line 6) the prepared data in variable temps (since it will be used both at line 7 and 16). At line 7, we call the external functionality detectFever to analyse the temperatures and check if the patient manifested any fever, storing the result in variable hasFever.

1getPatientPseudoID@$\textbf{HospitalIT}$( patientData )( pseudoID );
2credentials
3$\chainop$ getMotionAndTemperature@$\textbf{SmartWatch}$
4$\chainop$ match { date == 20181128 || date == 20181129 || date == 20181130 }
5$\chainop$ project { t in temperatures, pseudoID in patient_id }
6$\chainop$ temps;
7detectFever@$\textbf{HospitalIT}$( temps )( hasFever );
8if( hasFever ){
9  credentials
10  $\chainop$ getSleepPatterns@$\textbf{SmartPhone}$
11  $\chainop$ unwind  { M.D.L }
12  $\chainop$ project{y in year,M.m in month,M.D.d in day,M.D.L.q in quality}
13  $\chainop$ match { year == 2018 && month == 11 && ( day == 29 || day == 30 ) }
14  $\chainop$ group   { quality by day, month, year }
15  $\chainop$ project { quality, pseudoID in patient_id }
16  $\chainop$ lookup  { patient_id == temps.patient_id in temps }
17  $\chainop$ detectEncephalopathy@$\textbf{HospitalIT}$   }
Listing 2: Encephalopathy Diagnostic Algorithm.

After the analysis on the temperatures, if the patient hasFever (line 8), we continue testing for lethargy. To do that, at lines 9–10, we follow the same strategy described for lines 2–3 to pass the credentials to functionality getSleepPatterns, used to collect the sleep logs of the patient from her SmartPhone. Since the sleep logs are nested under years, months, and days, to filter the logs relative to the last 48 hours/2 days, we first flatten the structure through the unwind operator applied on nodes M.D.L (line 11). For each nested node, separated by the dot (.), the unwind generates a new data structure for each element in the array reached by that node. Concretely, the array returned by the unwind operator at line 11 contains all the sleep logs in the shape: [  \{year:2018, M:[\{m:11, D:[\{d:29,L:[\{s:"21:01",e:"22:12",q:"good"\}]\}]\}]\},      \{year:2018, M:[\{m:11, D:[\{d:29,L:[\{s:"22:36",e:"22:58",q:"good"\}]\}]\}]\}] where there are as many elements as there are sleep logs and the arrays under M, D, and L contain only one sleep log. Once flattened, at line 12 we modify the data-structure with the project operator to simplify the subsequent chained commands: we rename the node y in year, we move and rename the node M.m in month (bringing it at the same nesting level of year); similarly, we move M.D.d, renaming it day, and we move M.D.L.q (the log the quality of the sleep), renaming it qualityM.D.L.s and M.D.L.e, not included in the project, are discarded. On the obtained structure, we filter the sleep logs relative to the last 48 hours with the match operator at line 13. At line 14 we use the group operator to aggregate the quality of the sleep sessions recorded in the same day (i.e., grouping them by day, month, and year). Finally, at line 15 we select, through a projection, only the aggregated values of quality (getting rid of day, month, and year) and we include under node patient_id the pseudoID of the patient. That value is used at line 16 to join, with the lookup operator, the obtained sleep logs with the previous values of temperatures (temps). The resulting, merged data-structure is finally passed to the HospitalIT services by calling the functionality detectEncephalopathy.

3 TQuery Framework

In this section, we define the formal syntax and semantics of the operators of TQuery. We begin by defining data trees:

Above, each tree has two elements. First, a root value , , where and

is the null value. Second, a set of one-dimensional vectors, or arrays, containing sub-trees. Each array is identified by a label

. We write arrays using the standard notation . We write to indicate the extraction of the array pointed by label in : if is present in we return the related array, otherwise we return the null array , formally

We assume the range of arrays to run from the minimum index to the maximum , which we also use to represent the size of the array. We use the standard index notation to indicate the extraction of the tree at index in array . If contains an element at index we return it, otherwise we return the null tree .

Example 1 (Data Structures)

To exemplify our notion of trees, we model the data structures in creftypecap 1.

1[ $\emptyval$ {date:[20181129 {}],t:[37{},...],hr:[64{},...]},
2  $\emptyval$ {date:[20181130 {}],t:[36{},...],hr:[66{},...]},...]
3
4[$\emptyval${y:[2018{}],M:[$\emptyval${m:[11{}],D:[
5  $\emptyval${d:[29{}],L:[$\emptyval${s:["21:01"{}],e:["22:12"{}],q:["good"{}]},...]},
6  $\emptyval${d:[30{}],L:[$\emptyval${s:["20:33"{}],e:["22:12"{}],q:["poor"{}]},...]},
7...]},...]},...]

Note that tree roots hold the values in the data structure (e.g., the integer representation of the date 20181128). When root values are absent, we use the null value .

We define paths to express tree traversal: . Paths are concatenations of expressions , each assumed to evaluate to a tree-label, and the sequence termination (often omitted in examples). The application of a path to a tree , written returns an array that contains the sub-trees reached traversing following . This is aligned with the behaviour of path application in MQuery which return a set of trees. In the reminder of the paper, we write to indicate that the evaluation of expression in a path results into the label . Also, both here and in MQuery paths neglect array indexes: for a given path , such that , we apply the subpath to all trees pointed by in . We use the standard array concatenation operator where . We can finally define , which either returns an array of trees or the null array in case the path is not applicable.

In the reminder, we also assume the following structural equivalences

Example 2

Let us see some examples of path-tree application where we assume a tree $\emptyval$ { x: [ $\emptyval$ { z: [ 1 {}, 2 {} ] , y: [3 {}] } ] }

1$\apt{t}{x.\emptyseq}\ \Rightarrow$ [$\emptyval${z:[1{},2{}],y:[3{}]}]
2$\apt{t}{x.z.\emptyseq}\ \Rightarrow$ [1{},2{}]

We first present the syntax of TQuery and then dedicate a subsection to the semantics of each operator and to the running examples that illustrate its behaviour. As in MQuery, a TQuery query is a sequence of stages applied on an array : . The staging operator in TQuery is similar to the Jolie chaining operator |>: they evaluate the expression on their left, passing its result as input to the expression at their right. We report in creftypecap 1 the syntax of TQuery, which counts five stages. The match operator selects trees according to the criterion . Such criterion is either the boolean truth , a condition expressing the equality of the application of path and the array , a condition expressing the equality of the application of path and the application of a second path , the existence of a path , and the standard logic connectives negation , conjunction , and disjunction . The unwind operator flattens an array reached through a path and outputs a tree for each element of the array. The project operator modifies trees by projecting away paths, renaming paths, or introducing new paths, as described in the sequence of elements in , which are either a path or a value definition inserted into a path . Value definitions are either: a boolean value ( or ), the application of a path , an array of value definitions, a criterion or the ternary expression, which, depending the satisfiability of criterion selects either value definition or . The group operator groups trees according to a grouping condition and aggregates values of interest according to . Both and are sequences of elements of the form where is a path in the input trees, and a path in the output trees. The lookup operator joins input trees with trees in an external array . The trees to be joined are found by matching those input trees whose array found applying path equals the ones found applying path to the trees of the external array . The matching trees from are stored in the matching input trees under path .

Figure 1: Syntax of the TQuery

3.1 Match

When applied to an array , match returns those elements in that satisfy . If there is no element in that satisfies , returns an array with no elements (different from ). Below, we mark the satisfiability of criterion by a tree .

Above, criterion is satisfied both when the application of the two paths to the input tree return the same array as well as when both paths do not exist in , i.e., their application coincide on .

Example 3

We report below the execution of the match operator at line 4 of creftypecap 2. In the example, array corresponds to the data structure defined at lines 1–2 in creftypecap 1. First we formalise in TQuery the match operator at line 4: where

The match evaluates all trees inside , below we just show that evaluation for

1a[1] = $\emptyval\ ${date:[20181129{}],t:[37{},...],hr:[ 64 {},...]}

we verify if one of the sub-conditions , , or hold. Each condition is evaluated by applying path date on and by verifying if the equality with the considered array, e.g., [20181128{}], holds. As a result, we obtain the input array filtered from the trees that do not correspond to the dates in the criterion.

3.2 Unwind

To define the semantics of the unwind operator , we introduce the unwind expansion operator (read “unwind on under ”). Informally returns an array of trees with cardinality where each element has the shape of except that label is associated index-wise with the corresponding element in . Formally, given a tree , an array , and a key : Then, the formal definition of is

We define the unwind operator inductively over both and . The induction over results in the application of the unwind expansion operator over all elements of . The induction over splits in the current key and the continuation . Key is used to retrieve the array in the current element of , i.e., , on which we apply to continue the unwind application until we reach the termination with .

Example 4

We report the execution of the unwind operator at line 11 of creftypecap 2. The unwind operator unfolds the given input array wrt a given path in two directions. The first is breadth, where we apply the unwind expansion operator , over all input trees and wrt the first node in the path . The second direction is depth, and defines the content of array in , which is found by recursively applying the unwind operator wrt to the remaining path nodes in ( excluded) over the arrays pointed by node in each . Let be the sleep-logs data-structure at lines 4–6 of creftypecap 1, such that where e.g., is that in such that . The concatenation below is the first level of depth unfolding, i.e., for node M of unwind . To conclude this example, we show the execution of the unwind expansion operator of the terminal node L in path , relative to the sleep logs recorded within a day, represented by tree , i.e., where . Above, for each element of the array pointed by L, e.g., {s:["21:01"{}],e:["22:12"{}],q:["good"{}]} we create a new structure where we replace the original array associated with the key L with a new array containing only that element. The final result of the unwind operator has the shape:

1[$\emptyval$ {y:[2018{}],M:[$\emptyval${m:[11{}],D:[$\emptyval$ {d:[30{}],
2      L:[$\emptyval${s:"21:01",e:"22:12",q:"good"}]}]}]},
3 $\emptyval$ {y:[2018{}],M:[$\emptyval${m:[11{}],D:[$\emptyval$ {d:[30{}],
4      L:[$\emptyval${s:"22:36",e:"22:58",q:"good"}]}]}]},... ]

3.3 Project

We start by defining some auxiliary operators used in the definition of the project. Auxiliary operators and formalise the application of a branch-selection over a path . Then, the auxiliary operator returns the array resulting from the evaluation of a definition over a tree . Finally, we define the projection of a value (definition) into a path over a tree i.e., . The projection for a path over an array results in an array where we project over all the elements (trees) of .

The projection for a path over a tree implements the actual semantics of branch-selection, where, given a path , , we remove all the branches in , keeping only (if ) and continue to apply the projection for the continuation over the (array of) sub-trees under in (i.e., ).

The operator evaluates the value definition over the tree and returns an array containing the result of the evaluation.

Then, the application of the projection of a value definition on a path , i.e., returns a tree where under path is inserted the evaluation of over .

Before formalising the projection, we define the auxiliary tree-merge operator , used to merge the result of a sequence of projections .

To conclude, first we define the application of the projection to a tree , i.e., , which merges () into a single tree the result of the applications of projections over

and finally, we define the application of the projection to an array , i.e., , which corresponds to the application of the projection to all the elements of .

Example 5

We report the execution of the project at line 5 of creftypecap 2. Let be the array at the end of creftypecap 3, and let , , be the trees in such that is the first tree in relative to date 20181128, the second, and the third

1[$t_{\text{28}}$,$t_{\text{29}}$,$t_{\text{29}}$] $
2\app{\ \project_{\mathtt{t}\rangle\mathtt{temperatures},\
3\mathtt{pseudoID}\rangle
4\mathtt{patient\_id}}} \Rightarrow$
5[$\scalemath{.84}{\project_{\mathtt{t}\rangle\mathtt{temperatures},\
6\mathtt{pseudoID}\rangle
7\mathtt{patient\_id}}(t_{\text{28}})}$,$\scalemath{.84}{\project_{\mathtt{t}\rangle\mathtt{temperatures},\ \mathtt{pseudoID}\rangle
8\mathtt{patient\_id}}(t_{\text{29}})}$,$\scalemath{.84}{\project_{\mathtt{t}\rangle\mathtt{temperatures},\ \mathtt{pseudoID}\rangle
9\mathtt{patient\_id}}(t_{\text{30}})}$]

We continue showing the projection of the first element in , (the projection on the other elements follows the same structure)

1$\project_{\mathtt{t}\rangle\mathtt{temperatures},\ \mathtt{pseudoID}\rangle
2\mathtt{patient\_id}}(t_{\text{28}}) \Rightarrow
3\project_{\mathtt{t}\rangle\mathtt{temperatures}}(t_{\text{28}}) \merge
4\project_{\mathtt{pseudoID}\rangle
5\mathtt{patient\_id}}(t_{\text{28}})\Rightarrow$
6$\emptyval$ { temperatures : $\project_{\texttt{t}\rangle\emptyseq}(t_{
7\text{28}})$ } $\merge$ $\emptyval$ { patient_id : $\project_{
8\texttt{["xxx"\{\}]}\rangle\emptyseq}(t_{\text{28}})$ }
9$ =\emptyval$ { temperatures : $\evalDef(\texttt{t},t_{\text{28}})$ }$\merge$$
10\emptyval$ { patient_id : $\evalDef($["xxx"{}]$,t_{\text{28}})$ }
11$ =\emptyval$ { temperatures : $\apt{t_{\text{28}}}{\texttt{t}}$ }$\merge
12${patient_id: ["xxx"{}]}
13$= \emptyval$ { temperatures : [36{},...], patient_id: ["xxx"{}] }

The result of the projection has the shape

1[$\emptyval$ { temperatures:[36{},...], patient_id:["xxx"{}] },
2 $\emptyval$ { temperatures:[37{},...], patient_id:["xxx"{}] },
3 $\emptyval$ { temperatures:[36{},...], patient_id:["xxx"{}] ]
Example 6

We report the execution of the project at line 12 of creftypecap 2. Let be the array at the end of creftypecap 4, and let , be the trees in such that is the first tree in relative to year 2018, the second, and so on.

1[$t_{\text{2018}}^{\text{1}}$,$t_{\text{2018}}^{\text{2}}$,...] $
2\app{\ \project_{\mathtt{y}\rangle\mathtt{year},\ \mathtt{M.m}\rangle
3\mathtt{month},\ \mathtt{M.D.d}\rangle\mathtt{day},\ \mathtt{M.D.L.q}\rangle
4\mathtt{quality}}} \Rightarrow$
5[$\scalemath{.90}{\project_{\mathtt{y}\rangle\mathtt{year},\
6\mathtt{M.m}\rangle
7\mathtt{month},\ \mathtt{M.D.d}\rangle\mathtt{day},\ \mathtt{M.D.L.q}\rangle
8\mathtt{quality}}(t_{\text{2018}}^{\text{1}})}$,$\scalemath{.90}{\project_{
9\mathtt{y}\rangle \mathtt{year},\ \mathtt{M.m}\rangle \mathtt{month},\
10\mathtt{M.D.d}\rangle\mathtt{day},\ \mathtt{M.D.L.q}\rangle
11\mathtt{quality}}(t_{\text{2018}}^{\text{2}})}$,...]

We continue showing the projection of the first element in , (the projection on the other elements follows the same structure)

1$\project_{\mathtt{y}\rangle\mathtt{year},\
2\mathtt{M.m}\rangle
3\mathtt{month},\ \mathtt{M.D.d}\rangle\mathtt{day},\ \mathtt{M.D.L.q}\rangle
4\mathtt{quality}}(t_{\text{2018}}^{\text{1}}) \Rightarrow$
5$\project_{\mathtt{y}\rangle\mathtt{year}}(t_{\text{2018}}^{\text{1}}) \merge
6\project_{\mathtt{M.m}\rangle\mathtt{month}}(t_{\text{2018}}^{\text{1}}) \merge
7\project_{\mathtt{M.D.d}\rangle\mathtt{day}}(t_{\text{2018}}^{\text{1}}) \merge
8\project_{\mathtt{M.D.L.q}\rangle\mathtt{quality}}(t_{\text{2018}}^{\text{1}})$

Finally, we show the unfolding of the first two projections from the left, above, i.e., those for and for , and their merge (the remaining ones unfold similarly).

1$\project_{\mathtt{y}\rangle\mathtt{year}}(t_{\text{2018}}^{\text{1}}) \merge
2\project_{\mathtt{M.m}\rangle\mathtt{month}}(t_{\text{2018}}^{\text{1}}) \Rightarrow$
3$\emptyval$ { year : $\project_{\texttt{y}\rangle\emptyseq}(t_{\text{2018}}^{
4\text{1}})$ } $\ \merge$ $\emptyval$ { month : $\ \project_{
5\texttt{M.m}\rangle\emptyseq}(t_{\text{2018}}^{\text{1}})$ }
6= $\emptyval$ { year : $\evalDef(\texttt{y},t_{\text{2018}}^{\text{1}})$ } $\
7\merge$ $
8\emptyval$ { month : $\ \evalDef(\texttt{M.m},t_{\text{2018}}^{\text{1}})$ }
9= $\emptyval$ { year : $\apt{t_{\text{2018}}^{\text{1}}}{
10\texttt{y}}$ }$\merge$$
11\ \emptyval$ { month : $\apt{t_{\text{2018}}^{\text{1}}}{\texttt{M.m}}$ }
12= $\emptyval$ { year : [2018{}] } $\merge$ $\emptyval$ { month : [11{}] }
13= $\emptyval$ { year : [2018{}], month : [11{}] }

The result of the projection has the shape

1[$\emptyval${year:[2018{}],month:[11{}],day:[30{}],quality:["good"{}]},
2 $\emptyval${year:[2018{}],month:[11{}],day:[30{}],quality:["good"{}]},
3 $\emptyval${year:[2018{}],month:[11{}],day:[30{}],quality:["poor"{}]},...]

3.4 Group

The group operator takes as parameters two sequences of paths, separated by a semicolon, i.e., . The first sequence of paths, ranged , is called aggregation set, while the second sequence, ranged , is called grouping set. Intuitively, the group operator first groups together the trees in which have the maximal number of paths in the grouping set whose values coincide. The values in are projected in the corresponding paths . Once the trees are grouped, the operator aggregates all the different values, without duplicates, found in paths from the aggregation set, projecting them into the corresponding paths . We start the definition of the grouping operator by expanding its application to an array . In the expansion below, on the right, we use the series-concatenation operator and the set , element of the power set , to range over all possible combinations of paths in the grouping set. Namely, the expansion corresponds to the concatenation of all the arrays resulting from the application of the group operator on a subset (including the empty and the whole set) of paths in the grouping set.

In the definition of the expansion, we mark the casting of an array to a set (i.e., we keep only unique elements in and lose their relative order). Each returns an array that contains those trees in that correspond to the grouping illustrated above. Formally:

When applied over a set , , considers all combinations of values identified by paths in the trees in . In the formula above, we use the array to refer to those combinations of values. In the definition, we impose that, for each element in in a position , there must be at least one tree in that has a non-null () array under path . Hence, for each combination of values in , builds a tree that i) contains under paths the value (as encoded in the projection query and from the definition of the operator , defined below) and ii) contains under paths , , the array containing all the values found under the correspondent path in all trees in that match the same combination element-path in (as encoded in ). The grouping is valid (as encoded in ) only if we can find (i.e., match ) trees in where i) we have a non-empty value for , ii) there are no paths that are excluded in , and iii) for all paths considered in , the value found under path corresponds to the value in the considered combination . If the previous conditions are not met, returns an empty array . We conclude defining the operator , used above to unfold the set of aggregation paths and the related values contained in , e.g., let then . Its meaning is that, for each path , we project in it the value correspondent to . Formally

Note that for case (i.e., for ), returns the empty path , which has no effect (i.e., it projects the input tree) in the projection in the definition of . Hence, the resulting tree from grouping over will just include (and project over ) those trees in that do not include any value reachable by paths (as indicated by expression in ). Like in MQuery and MongoDB, we allow the omission of paths and in . However, we interpret this omission differently wrt MQuery. There, the values obtained from s with missing s (resp., with missing ) are stored within a default path _id. Here, we intend the omission as an indication of the fact that the user wants to preserve the structure of (resp., ) captured by the structural equivalence below.

Example 7

We report the execution of the group operator at line 14 of creftypecap 2. Let be the result of the projection creftypecap 6, with the exception that has been filtered by the match at line 13 in creftypecap 2 and contains only the sleep logs for days 29 and 3 of month 11 and year 2018.

3.5 Lookup

Informally, the lookup operator joins two arrays, a source and an adjunct , wrt a destination path and two source paths and . Result of the lookup is a new array that has the shape of the source array but where each of its elements has under path those elements in the adjunct array whose values under path equal the values found in under path . Formally

Above, the lookup operator takes as parameters three paths , , and and an array of trees . When applied to an array of trees , it returns (i.e., all of its elements, as retuned by the projection under the first parameter ) where each of its elements has under path an array of trees obtained from applying the match () in expression , i.e., following the definition of , the projection under is merged with the result of the projection under . For each element (), matches those trees in for which either i) there is a path and the array reached under equals the array found under or ii) there exist no path (i.e., its application returns the null array ) and also does not exist in (i.e., ).

Example 8

We report the execution of the lookup at line 16 of creftypecap 2

where corresponds to the resulting array from the application of the project operator at line 15 of creftypecap 2, which has the shape

1$a$ = [  $\emptyval$ { quality:["good"{},"good"{},...], patient_id:["xxx"{}] },
2      $\emptyval$ { quality:["poor"{},"good"{},...], patient_id:["xxx"{}] } ]

and where corresponds to the array of temperatures that results from the application of the project at line 5 of creftypecap 2, as shown at the bottom of creftypecap 5. Then, unfolding the execution of the lookup, we obtain the concatenation of the results of two projections, on the only two elements in . The first corresponds to the projection on while the second corresponds to the projection on where

Below, sub-node temps contains the whole array , since all its elements match patient_id.

1[$\project_{\emptyseq,\beta_1}(a[1])$::$\project_{\emptyseq,\beta_2}(a[2])$]
2= [$\emptyval$ { quality:["good"{},"good"{},...], patient_id:["xxx"{}],
3    temps: [
4       $\emptyval$ { temperatures:[36{},...], patient_id:["xxx"{}] },
5       $\emptyval$ { temperatures:[37{},...], patient_id:["xxx"{}] },
6       $\emptyval$ { temperatures:[36{},...], patient_id:["xxx"{}] } ] },
7 $\emptyval$ { quality:["poor"{},"good"{},...], patient_id:["xxx"{}],
8    temps: [
9       $\emptyval$ { temperatures:[36{},...], patient_id:["xxx"{}] },
10       $\emptyval$ { temperatures:[37{},...], patient_id:["xxx"{}] },
11       $\emptyval$ { temperatures:[36{},...], patient_id:["xxx"{}] } ] } ]

4 Related Work and Conclusion

In this paper, we focus on ephemeral data handling and contrast DBMS-based solutions wrt to integrated query engines within a given application memory. We indicate issues that make unfit DBMS-based solutions in ephemeral data-handling scenarios and propose a formal model, called TQuery, to express document-based queries over common (JSON, XML, …), tree-shaped data structures. TQuery instantiates MQuery [botoeva18], a sound variant of the Aggregation Framework [agg_framework] used in MongoDB, one of the main NoSQL DBMSes for document-oriented queries. We implemented TQuery in Jolie, a language to program native microservices, the building blocks of modern systems where ephemeral data handling scenarios are becoming more and more common, like in Internet-of-Things, eHealth, and Edge Computing architectures. Jolie offers variety-by-construction, i.e., the language runtime automatically and efficiently handles data conversion, and all Jolie variables are trees. These factors allowed us to separate input/output data-formats from the data-handling logic, hence providing programmers with a single, consistent interface to use TQuery on any data-format supported by Jolie. In our treatment, we presented a non-trivial use case from eHealth, which provide a concrete evaluation of both TQuery and MQuery, while also serving as a running example to illustrate the behaviour of the TQuery operators. Regarding related work, we focus on NoSQL systems, which either target documents, key/value, and graphs. The NoSQL systems closest to ours are the MongoDB [mongodbwebsite] Aggregation Framework, and the CouchDB [couchdbwebsite] query language which handle JSON-like documents using the JavaScript language and REST APIs. ArangoDB  [arangodbwebsite] is a native multi-model engine for nested structures that come with its own query language, namely ArangoDB Query Language. Redis [redis] is an in-memory multi-data-structure store system, that supports string, hashes, lists, and sets, however it lacks support for tree-shaped data. We conclude the list of external DB solutions with Google Big Table [chang2008bigtable] and Apache HBase [george2011hbase] that are NoSQL DB engines used in big data scenarios, addressing scalability issues, and thus specifically tailored for distributed computing. As argued in the introduction, all these systems are application-external query execution engine and therefore unfit for ephemeral data-handling scenarios. There are solutions that integrate linguistic abstractions to query data within the memory of an application. One category is represented by Object-relation Mapping (ORM) frameworks [fussel1997foundations]. However, ORMs rely on some DBMS, as they map objects used in the application to entities in the DBMS for persistence. Similarly, Opaleye [ellis2014opaleye] is a Haskell library providing a DSL generating PostgreSQL. Thus, while being integrated within the application programming tools and executing in-memory, in ephemeral data-handling scenarios, ORMs are affected by the same issues of DBMS systems. Another solution is LevelDB [leveldbwebsite], which provides both a on-disk and in-memory storage library for C++, Python, and Javascript, inspired by Big Table and developed by Google, however it is limited to key-value data structures and does not support natively tree-shaped data. As cited in the introduction, a solution close to ours is LINQ [meijer2006linq], which provides query operators targeting both SQL tables and XML nested structures with .NET query operators. Similarly, CQEngine [cqenginewebsite] provides a library for querying Java collections with SQL-like operators. Both solutions do not provide automatic data-format conversion, as our implementation of TQuery in Jolie. We are currently empirically evaluating the performance of our implementation of TQuery in application scenarios with ephemeral data handling (Internet-of-Things, eHealth, Edge Computing). The next step would be to use those scenarios to conduct a study comparing our solution wrt other proposals among both DBMS and in-memory engines, evaluating their impact on performance and the development process. Finally, on the one hand, we can support new data formats in Jolie, which makes them automatically available to our TQuery implementation. On the other hand, expanding the set of available operators in TQuery would allow programmers to express more complex queries over any data format supported by Jolie.

References