Exploiting Information Centric Networking to federate NoSQL Spatial Databases

03/14/2019 ∙ by Andrea Detti, et al. ∙ 0

This paper explores methodologies, challenges and expected advantages related to the use of the Information Centric Network (ICN) technology for federating NoSQL databases. ICN services allow simplifying the design of federation procedures, improving their performance, and providing so-called data-centric security. In this work present an architecture able to federate NoSQL spatial databases and evaluate its performance, by using a real data set within a heterogeneous federation formed by MongoDB and CouchBase database systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays NoSQL (Not Only SQL) database systems play a central role in information management. They can manage large volumes of read-write operations and can be easily distributed on different servers.

Usually, NoSQL databases store generic objects, such as JSON ones, but they can also implement specific functionality aimed at improving the management of specific objects, such as spatial ones SDBintroduction . A spatial object is characterized by a georeferenced geometry (point, multi-point, polygon, line, etc.) and by a set of properties. For instance, a point of interest (POI) can be represented in a database as a spatial object with a point geometry, whose properties are the information about the class of the POI (historical, shop, etc.), its description and so on. Spatial databases are used for many applications, including Geographic Information System (GIS), navigation software, journey planners, IoT, etc. In addition to the features of a traditional database, a spatial database offers the capability to efficiently index and query spatial objects. A typical query search could be the one for objects intersecting an area, or close to a given point, or contained in a given area.

A database system is usually owned by a single entity that assures the reliability of the offered services and of the stored information. Instead, a federated database system is a collection of cooperating but autonomous component database systems sheth1990federated , transparently integrated through a common access interface, as shown in fig. 1. A database federation can foster the development of services needing to access data stored in separated DBSs, and which can not be merged in a single DB. For instance, organizations managing different information sources are often autonomous and willing to share their data only if they retain the control of such data.

A real example of federation is the one recently proposed by the European Commission for the provision of EU-wide multimodal travel information services, which requires member states to deploy National Access Points exposing national transport information (traffic status, train schedules, etc.). The federation of these databases, and their capability to appear as a unique DB, would simplify the development of cross-border journey planners, which concurrently need to access data offered by several National Access Points.

Another example concerns smart-city applications for which the smart behavior of an application often arises out from the integration of cross-domain (health, transport, security, etc.) information stored in different databases. Accordingly, the ETSI ISG CIM group is now discussing how to integrate cross-domain context/IoT information; the federation of different DBSs is a possible solution to such issue.

The federation of DBSs poses several challenges including sheth1990federated :

  • Data heterogeneity - the fact that information can be stored with different structures, e.g. different tags to indicate the same concept, different formats, or different semantics.

  • System heterogeneity - the possibility that federated databases have different capabilities, query languages, etc.

  • Efficient and secure communication - the need to design network services so as to have fast and secure create, read, update, delete (CRUD) operations.

In this paper we focus on the third challenge described above. We propose a federated database system based on an Information Centric Network (ICN) jacobson2009networking , which interconnects the different database sites. To some extent, ICN services resemble those of a Content Delivery Network, but with a finer, packet-level, granularity (see sec. 2.1 below for a brief description of ICN).

In this work, we propose to exploit typical ICN services such as routing-by-name, in-network caching and multicast, to efficiently solve geographical queries and to support a global indexing scheme that shortens query latency and reduce DBS load. We also leverage ICN’s data-centric security for assuring provenance and validity of data and signaling.

Besides, we implemented our proposed ICN solution and tested it by setting up a federation whose sites use heterogeneous NoSQL databases with spatial features, namely MongoDB and CoutchBase. We focused on spatial databases, but the proposed approach is general enough to be applicable also to other kinds of objects. To the best of our knowledge, this is the first work that proposes to use ICN services for the federation of NoSQL databases. Its main contributions are:

  • a methodology for federating NoSQL databases using the services of an Information Centric Network;

  • a global indexing strategy based on a greedy adaptive tessellation algorithm that enables query routing;

  • a real implementation based on databases currently used in production systems;

  • a comprehensive performance evaluation considering heterogeneous databases.

Figure 1: Federated Database System (FDBS)

2 Related concepts and works

Figure 2: ICN forwarding engine model and packets

2.1 Information Centric Networks

An ICN is a communication architecture providing users with data items rather than end-to-end communication pipes. The network addresses are hierarchical names (e.g. dbs#1/poi/1502) that do not identify end-hosts but data items.

A data item and its unique name form the so-called named object. A named object is actually a small data unit (e.g., 4kB long) and may contain an entire content (e.g., a document, a video, etc.), or a chunk of it. The names used for addressing the chunks of the same content have a common prefix (e.g., dbs#1/poi/1502) followed by a sequence number identifier (e.g. s1, s2, s3, etc.).

An ICN is formed by nodes logically classified as consumers, producers, and routers. Consumers pull named objects provided by producers, possibly going through intermediate routers. The consumer-to-producer path is labeled as upstream; the reverse path as downstream.

Any node uses the forwarding engine shown in figure 2 and is connected to other nodes through channels, called faces, which can be based on different transport technologies such as Ethernet, TCP/IP sockets, etc.

The data units exchanged in ICN are Interest packets and Data packets. To download a named object, a consumer issues an Interest packet, which contains the object name and is forwarded towards the producer. The forwarding process is referred to as routing-by-name, since the upstream face is selected through a name-based prefix matching based on a Forwarding Information Base (FIB) containing name prefixes, e.g., dbs#1 in fig. 2. The FIB is usually configured by routing protocols, which announce name prefixes rather than IP subnetworks hoque2013nisr . During the Interest forwarding process, the node temporarily keeps track of the forwarded Interest packets in a Pending Information Table (PIT), which stores the name of the requested object and the identifier of the face from which the Interest came from (downstream face).

When an Interest reaches a node (producer or an intermediate router) having the requested named object, the node sends back the object within a Data packet, whose header includes the object name. The Data packet is forwarded downstream to the consumer by consuming (i.e., deleting) the information previously left in the PITs, like bread crumbs.

Each forwarding engine can cache the forwarded Data packet to serve subsequent requests of the same object (in-network caching). Usually, the data freshness is loosely controlled by an expiry approach. Any Data packet includes a metainfo field reporting the freshness period specified by the producer, which indicates how long the Data can be stored in the network cache.

The forwarding engine also supports multicast distribution both for Interest and Data packets. Interest multicasting takes place when there are more upstream faces for a given prefix (e.g., index/notify in fig. 2) and the incoming Interest is forwarded to all of them. Data multicasting is implemented as follows: when a node receives multiple Interests for the same object, the engine forwards only the first packet and discards the subsequent ones, appending the identifier of the arrival downstream faces in the PIT; then, when the requested Data packet arrives, the node forwards a copy of it towards each downstream face contained in the PIT.

ICN security is built on the notion of data-centric security: the content itself is made secure, rather than the connections over which it travels. The ICN security framework provides each entity with a private key and an ICN digital certificate, signed by a trust anchor, and uniquely identified by a name called key-locator ndntrust . Each Data packet is digitally signed by the content owner and includes the key-locator of the digital certificate to be used for signature verification. For access control purposes, Interest packets can be signed too.

An ICN uses receiver-driven flow/congestion control. To download a content formed by many chunks, the consumer sends a flow of Interest packets, one per chunk, and receives the related flow of Data packets. Flow and/or congestion control are implemented on the receiver side by limiting the number of in-flight Interest packets to a given amount (aka pipeline-size), which can be a constant value or a variable one, e.g., controlled by an AIMD scheme salsano2012transport .

2.2 NoSQL spatial databases

Database systems (DBMS) may be based either on a relational model or on a non-relational model also called NoSQL. Relational databases, such as MySQL, allow complex SQL queries but are usually hard to distribute on different servers and performance issues show up when the database size increases.

For large information sets, NoSQL databases are more and more replacing relational ones since they can be easily distributed over different servers. This feature, known as horizontal scalability, fits well with cloud environments, where databases are usually deployed. MongoDB, CouchBase, Cassandra, DocumentDB are examples of popular NoSQL databases, with the first two offering also spatial features.

Most of the NoSQL databases are based on the document-oriented data model, actually a subclass of a key-value store. A document has a unique object identifier, is formed of a structure of key/value couples, and is usually represented as a JSON object. Whereas in a relational database the information is commonly spread over different tables to facilitate their reuse in different views (this feature being knows as normalization), a NoSQL document is built as a self-consistent unit for database operations (this feature being knows as de-normalization), thus simplifying data distribution and horizontal scaling.

The storage space of a NoSQL database is logically organized in data-sets, i.e., groups of related objects, such as a MongoDB/DocumentDB collection, a Cassandra columns-family, or a CouchBase bucket. Furthermore, the storage space can be physically partitioned over different servers (sharding).

An inalienable characteristic of a modern database is the data indexing. An index is an internal data structure where the reference of stored data is sorted according to a given field (column) of the data, named index key. On the one hand, an index consumes storage spaces and computational resources to be sorted; on the other hand, it drastically accelerates queries related to the indexed data fields, especially when the database size grows. For this reason, the indexes should be created only for those fields that are expected to be frequently used in queries.

Spatial databases have specialized index structures that improve the speed of spatial operations. In general, a spacial indexing method partitions the geographical space in regions that can be further decomposed in sub-regions. The resulting region hierarchy forms a tree data structure. Most popular indexing methods are Grid, R-Tree, and their variants MicrosoftSQL guttman1984r .

The R-tree method is more efficient in terms of storage consumed by the index structure, but the index may change as a function of the inserted items. Conversely, a grid-based index has the advantage that the structure of the index can be created first, and data added on an ongoing basis without requiring any change to the index structure; however, some nodes of the index can be unused.

2.3 Federated database system

Even though the problem of database federation is known from the early ’80s, there is not much literature on the topic, probably because of the absence of a valuable number of practical use-cases, until now. As discussed in the introduction, we believe that the demand for the integration of different heterogeneous bases is increasing and so the federation of NoSQL databases may have a real application in a near future.

The paper mcleod1980federated is one of the first work focusing on the database federation concept. The authors discuss different design alternatives and a generic sequence of operations that should be performed by a ”Federal controller” to execute a transaction (e.g., a query) through the federation. The controller identifies which are the target DBSs that should satisfy the transaction, translates the request in their format and then collects the results. We propose a more distributed design and the use of a neutral transaction format between the controller and the remote DBSs. In so doing the introduction of new technology does not require changing existing sites. Moreover, we also provide an implementation of the proposed concepts, exploiting ICN, and also considering the issue of global indexing.

In sheth1990federated the Authors present a very comprehensive taxonomy of the typologies of multi-database systems, mainly focused on the issue of schema integration in the identified cases. In their taxonomy, our system is a tightly coupled one, i.e., the federation is managed by administrators rather than by end users. However, the focus of our work is rather different, because we assume as solved the schema integration problem, and we mainly propose solutions for networking and global indexing issues.

In laurini1998spatial the Author proposes solutions for problems arising from the integration of heterogeneous ”relational” databases with different schemata and spatial object representations (e.g., different projection schemes, geometric discrepancies for boundary objects, etc). Moreover, the paper explores two possible solutions for global indexing based on Peano keys and R-Trees. Our work follows the position of laurini1998spatial regarding the need for a global index but we use an adaptive grid structure, since we believe it is more stable for database operations. We do not consider the schemata integration issues because we are focusing on NoSQL databases that are schema-less. We do not consider the problem of heterogeneity of the spatial representations because we assume GeoJSON as the common format; otherwise, the solutions proposed in laurini1998spatial can be used also in our framework. Finally, our work is much more focused on networking aspects and also includes a performance evaluation.

In dharmasiri2013federated the Authors presented an Architecture for NoSQL databases, successfully tested with CouchDB, MongoDB, and Cassandra. Differently from us, their system uses classic TCP/IP and query-flooding (no global index).

Finally, we point out that in ogb we dealt with ICN and NoSQL databases; however, in that paper we explored how ICN can be used to implement distributed databases, rather than federated ones. The main difference between the two is that in a distributed deployment the data is spread over the available databases of the cluster according to a sharding logic, thus the query routing process does not need any indexing. In a federation, the data distribution cannot be controlled, the users decide where to insert data, rather than a sharding logic, and this requires to design a different approach to solve queries. Moreover, also security requirements are different. Besides, we note that the ICN distributed database that we proposed in ogb can be considered as one of the possible NoSQL DBMS that can join our proposed federated architecture, like MongoDB, CouchBase, etc.

Figure 3: NoSQL/ICN federated database architecture

3 ICN/NoSQL federated databases

We consider a federation formed by autonomous DBSs, each one with its administrator and users. Every user is registered to a home DBS and can execute Create, Read, Update and Delete (CRUD) operations only for objects stored in its home DBS and through the local DBMS API (fig. 3). The user can access the federation functionality to extend the scope of Read (queries) operations to the overall federation; a federated query searches for matching data through the whole set of federated DBSs as if they were a single one.

Differently from SQL ones, NoSQL databases do not (still) have a standard language for querying the DB and this may potentially hamper their complete federation, because only a subset of functionality can be available in all federation sites. In dharmasiri2013federated the Authors observe that traditional query operations are supported by most NoSQL databases and it is possible to translate these operations from one language to another. Concerning spatial databases, we verified that most NoSQL ones (MongoDB, CouchBase, CouchDB, etc.) support spatial range queries, which are searches of objects intersecting or contained in a given geographical area and having specific properties. For instance, a spatial range query may be: search all objects included in the polygon P1 and whose property ”POI type” is equal to ”hotel”. In this paper we assume that the types of queries that a user can submit to the federation are spatial range queries only, due to their wide availability.

The storage space of the federation is organized in data-sets, each of which is identified by unique data-set identifiers (did). A federated data-set is actually the union of the homologous data-sets available in the federated DBSs. For instance, assuming that the federation is formed by two DBSs, both having the data-set ”POI”, the union of these local data-sets forms the federated data-set with did=POI. A data-set contains spatial objects structured as generic GeoJSON objects, with a mandatory property called object name (oName). The object name uniquely identifies a version of an object in the federation, thus if the object is updated its object name must be changed.

3.1 Functional Architecture

Figure 3 shows the functional architecture of the federated database. The large gray box contains the federation functions that we are going to discuss.

Federated queries are received by a Federation Front End function (FED-FE) that controls the access rights of the user, executes the requested queries interacting with the query processor and sends back the answer to the user. The query processor carries out the query by interacting with local and remote DBSs and collecting their answers. Preliminarily, the query processor uses a global index function to single out the subset of DBSs that might have matching objects, so reducing the query distribution scope. A DBMS adapter is used to ”translate” the queries generated by the federation functions (query processor, global index, etc.) in the final language used by the local DBMS.

The communications among federated DBSs are supported by an Information Centric Network, which is connected to the local DBS through an ICN forwarder. The ICN could be public or private, and it is offered by an ICN service provider. The FIB of the ICN forwarder is automatically configured by an ICN routing agent (e.g., NLSR hoque2013nisr ) that has a peering relationship with the ICN access node (AN) of the provider 111In absence of an ICN service provider, a P2P deployment is possible by directly peering the ICN forwarders to each other. Alternatively, the federation administrators can deploy a shared ICN node which forwards control and data traffic, which is expected to be not so high, for DB applications.

Abbreviation Description
DBS Database System, refers to a site of the federation
DBMS Database Management System, refers to the specific database technology used in a site, e.g. MongoDB, CouchBase, etc.
oName Unique identifier of an object in the federation, it changes when the object changes
oInterest Interest packet used to fetch an object through an oName
oData Data packet used to satisfy an oInterest and containing the requested object
qName One-time ICN name used to encode a query statement for a remote DBS
qInterest Interest packet used to send a query statement to a remote DBS by using a qName
qData Data packet used to satisfy a qInterest and containing the oNames of objects matching the query statement
vInterest Interest packet used by a DBS to advertise a new version of the global index information; it is distributed over the multicast prefix index/notify
gInterest Interest packet used to fetch global index information from a remote DBS
gData Data packet used to satisfy a gInterest and containing global index information of a DBS
dbsid unique identifier of a DBS of the federation
did data-set identifier
Table 1: Abbreviations

3.2 Joining procedure

To join a new DBS to the federation, the DBS administrator obtains from the ICN service provider a unique database identifier (dbsid) and a valid ICN certificate signed by the provider. The (unsigned) certificate of the provider is the security trust anchor of the federation.

The ICN routing agent (fig. 3) establishes a peering relationship with the ICN access node (AN) and announces the prefix dbsid. The routing agent concurrently receives the identifiers of other DBSs of the federation. To support global indexing functionality (discussed later on), the routing agent announces also the prefix {dbsid}/index/data and the multicast prefix index/notify. After the propagation of these routing announcements, any Interest whose name contains one of these prefixes will reach the joined DBS.

To avoid the joining of unauthorized DBSs or the tampering of ICN routing plane information, any routing announcement is signed by the DBS and the signature is verified at the receiving side, i.e., by any other routing agents both in the ICN network and in the remote DBSs.

3.3 Query resolution

Figure 4: Query procedure executed by the query processor

A spatial range query is solved by a query processor as shown in fig. 4. In the following explanation we denote as local query processor, the query processor receiving the query from the user, and remote query processors the query processors of remote DBSs participating to the query resolution.

Initially, the local query processor selects the DBSs of the federation deemed worth to send the query. A simple choice, named query flooding, would be to send the query to all the DBSs. Query flooding has the advantage of not requiring any knowledge about what is stored in the databases but has the cons of possibly overloading them with queries for which they haven’t matched data. When the federation grows, it is necessary to be more efficient by adopting query routing strategies that reduce the query scope by identifying the subset of databases that might have matching objects. To this aim, the local query processor inquiries the global index function by passing to it the spatial area queried by the user; in turn, the global index function replies with the identifiers of the DBSs having objects in that area (e.g., local dbs#1, dbs#2 and dbs#3 in fig. 4). The local query processor subsequently relays the query to the selected DBSs and eventually collects the query results by using the following ICN procedure.

For relaying a query to a remote DBS, a naming function computes an ICN name, called qName, composed of the identifier of the remote DBS (dbsid), a q marker identifying that this is a query name, the data-set identifier (did), the query statement and a random nonce; e.g., dbs#2/q/POI/{query}/1234 is a possible qName. For each remote DBS a qName is computed and an Interest packet (qInterest) is sent out. ICN forwarders route-by-name a qInterest to the remote DBSs exploiting the database identifier (dbsid) contained in the qName and their FIB entries.

The qInterest is handled by the remote query processor, which extracts the query statement inside the qName and relays the query to the local DBMS thereafter. This query is made in such a way to retrieve only the names (oNames) of the objects matching the query conditions, rather than the whole objects’ information. The resulting list of oNames is packaged in a Data packet, called qData, which is sent back to satisfy the qInterest. An oName is formed by the identifier of the database system (dbsid), the o marker identifying that this is an object name, the data-set identifier (did), and a unique string that identifies a specific version of the object; e.g., dbs#2/o/POI/{_id}-v1 is a possible oName, where {_id} is a unique identifier of the object in the specific DBS, possibly equal to the id natively provided by the DBMS, and v1 is a version number.

When all qData packets coming from remote DBSs are received by the local query processor, the latter has the whole list of oNames of federated objects matching the query condition. These objects are then pulled through parallel oInterest-oData packet exchanges.

In so doing, we are solving a query in two phases: a fetch-names phase and a fetch-objects phase (fig. 4). This may sound as a temporal inefficiency, but we have chosen this approach to exploit the ICN in-network caching as hereafter discussed.

3.4 Exploitation of in-network caching

Even though caching can dramatically accelerate query resolution, its usage should be carefully designed in database applications, when it is not acceptable to send back stale data. For this reason, we used a two-phase query resolution strategy and two different ICN caching approaches for qData and oData.

Before devising the two-phase strategy, we had considered a simpler one-phase strategy, made of a qInterest-qData exchange only, for which the returning qData packet merely contained the whole set of matching objects. However, we observed that the result of a query statement may change over time due to object insertions or removals. Consequently, in-network caching could not be used for qData to avoid the risk of sending back stale information. In addition, even assuming to agree to receive stale information, cache hits would happen only for those limited sets of queries which are exactly equal each other, e.g., two users searching objects on the same area and with the same property constraint. All this means that the one-phase query resolution strategy can not be used when the reception stale data is unacceptable and that caching might not be effective, due to the natural heterogeneity of query statements.

For this reason, we decided to move forward to the two-phase strategy, which implies that qData contains only the names of matching objects. These names can change over time and thus, again, qData packet cannot be cached. However, ICN caches can be fruitfully and safely used for next oData packets. Fruitfully, because an oData packet contains a single spatial object, whose cached version can be reused also by heterogeneous queries, e.g., having a partial overlapping of the queried areas. Safely, because there is no risk to send back stale data neither after a spatial object update nor after a removal. In fact, when an object changes, the object name (oNames) is changed as well, and such a new name is sent back in the list inside the qData, making it possible to fetch the updated version of the object in the second phase. Moreover, if the object is deleted, its oName will be no more included in the qData, thus the deleted object will not be fetched in the second phase.

Finally, to avoid cache poisoning problems bianchi2013check , Interest and Data packets are signed by the sender and verified by the receiver, to avoid the access to the federated data from not authorized DBSs and to avoid information manipulation.

3.4.1 Global Indexing

The federation uses a global indexing function based on a grid strategy and on a network synchronization procedure for which every DBS eventually obtains the same version of the index. We choose a grid methodology because it is more stable and reduces the synchronization effort.

The grid regions are squared geographical tiles, aligned with world parallels and meridians (fig. 6). Tiles have possible resolution levels, from level to level . A tile of level contains tiles of level . For instance, in case of , level tiles have a lon/lat size of 1 degree, level tiles have a size of 0.1 degrees, level tiles have a size of 0.01 degrees, and so forth.

The index processor (fig. 3) of the -th DBS tessellates the area covered by the stored spatial objects with a set of non-overlapping tiles, denoted as active tiles. Thus, a tile is active if it intersects at least a stored object. For instance, in fig.6 we have a DBS storing three POIs. The tessellation has a size and it is formed by a level-2 and a level-1 active tile. Due to insertion, removal, and deletion of spatial objects, the set of active tiles forming the tessellation is periodically updated by the index processor.

Figure 5: Active tiles of a DBS
Figure 6: Global index synchronization

When the tessellation is updated, the index processor distributes it to other remote DBSs. In parallel, it receives the tessellations of remote DBSs. The global index is built by each DBS by merging the local and remote tessellations. Such a synchronization procedure uses the ICN packet exchanges shown in fig. 6, for which DBS #1 has produced an updated tessellation . The procedure exploits ICN multicast and security capabilities as follows. We remind that the routing engine of any DBS announces the multicast prefix index/notify and the unicast prefix {dbsid}/index/data. Periodically (e.g., every 1s), the index processor of DBS #1 sends an Interest packet (vInterest) with index/notify as name prefix, followed by /dbs#1/version=, where is an increasing version number of the local tessellation . The vInterest reaches the ICN network that, in turn, carries out a multicast distribution towards any other node that announces index/notify, i.e., towards any other DBS of the federation. If the receiving DBS #n has an older version of , it fetches the new set by sending an Interest (gInterest) for dbs#1/index/data/version=. The same gInterest can be sent by other DBS nearly at the same time, thus triggering ICN multicast distribution for the returning Data message (gData) too, which contains the updated version of . To avoid the tampering of the global index, or acquisition of the index from unauthorized entities, any related Interest and Data packet are signed by the producing entity and verified by the receiving one.

The elements of the global index are stored in a local spatial database, which could be either the same one used for storing spatial objects of the customers or an additional faster in-memory database. Each active tile is stored as a spatial object with a squared shape and with the database identifier (dbsid) as a property. Let us assume that a user submits to the system a range query whose requested area is . To single out which DBSs is to be involved in the query resolution, the query processor submits to the global index function a range query requesting the same area , thereby receiving from the underlying database the set of intersecting active tiles and, in turn, the set of database identifiers of the DBSs to be contacted for solving the query.

Adaptive tessellation.

Due to the tile heterogeneity, many possible tessellations may exist and an optimization problem turns out as follows. When the query processor inquiries the global index, it obtains a list of candidates DBSs; however, some of them could be false-positives, i.e., not actually storing any spatial objects intersecting the queried area. This is because an active tile may be larger than the enclosed spatial objects. For instance, in fig. 6 the DBS stores three POIs and advertises two active tiles whose covered area is greater than the POIs’ one. Consequently, it may happen that a query intersects an active tile but not the contained POI, thus generating a false-positive. The consequence of a false-positive is the useless sending of queries to DBSs, wasting network bandwidth and, more importantly, processing capacity. The volume of false-positives is a measurement of the accuracy of the global index: the better the accuracy, the less the number of false-positives.

A tessellation made with the smallest possible tiles, e.g., level-2 tiles in fig. 6, has the pro of providing the highest possible accuracy. However, it has the con of generating the highest number of active tiles to synchronize. To give an idea of the involved numbers, if we exclusively use level-2 for a DBS containing 1/3 of OpenStreetMap European POIs the resulting number of active tiles is in the order of . Such a high number of active tiles has many drawbacks, including a great number of bytes to be transferred during the synchronization process, and a higher probability that insertion or removal operations can change the set , therefore increasing synchronization frequency, etc.

Summarizing: smaller tiles provide better accuracy but need a higher synchronization effort. Accordingly, a trade-off shows up and we can model it with the following optimization problem: finding the best tessellation with the minimum cost , with the constraint that the number of active tiles is not greater than a given value , i.e.,

(1)
subject to

We define the cost of a tessellation as the difference between the area covered by its tiles and the area covered by the tiles of the tessellation obtained by using the smallest possible tiles. The lower is the cost, the higher is the expected accuracy. Indeed, if we have no constraint on , the best choice (zero cost) in terms of accuracy is to select the tessellation . We also define the cost of a tile as the difference between the area covered by the tile and the area covered by its children tiles that belong to the minimum tessellation . It follows that the cost of a tessellation is the sum of the costs of its tiles. In golab2015size authors demonstrate that such a size-constrained weighted set cover problem is NP-hard. Thus, we propose to use the following greedy algorithm 1.

Initially, the algorithm uses the tessellation to builds a -levels tree , whose nodes are the tiles of plus all their parent tiles up to level-0. This tree may have many disjointed roots and thus we add a common root node (see fig. 16(a)). Starting from the tree , the algorithm follows a top-down reduction approach and at the end of the iteration, the active tiles of the tessellation are the leaves of the reduced tree.

For each step, the algorithm computes which is the highest resolution level of the next tile that has to be added to the final tessellation to respect the constraint. Colloquially, the question posed by the algorithm is: ”is it necessary to add a tile of level-0?”; if not, ”is it necessary to add a tile of level-1?”; and so forth. In general, it is necessary to add a new level- tile if we are not able to respect the constraint by finishing the tessellation with all the remaining (smaller) level- tiles, but we are able to respect the constraint by finishing the tessellation with all the remaining level- tiles.

= max number of active tiles
= number of resolution levels
= hierarchical tree of tiles
= leaves of T
next highest level to be added
Build from
# constraint violation exception
if num. level-0 tiles of  then return = set of level-0 tiles of
end if
# tesselling iteration
while number of  do
     while  do
         if (a new level- active tile is necessary) then
              Find the new level- tile with min cost
              Prune all its children from
              Break
         end if
     end while
end while
return
Algorithm 1 Constrained Tessellation

When the level of the next tile has been identified, among the tiles of this level not yet included in the tessellation, the algorithm selects the one with the minimum tile cost , and the related sub-tree is removed from . Then, the iteration restarts and continues until the number of leaves respects the constraint . There are some exceptional cases in which respecting the constraint is impossible. This happens when even the smallest possible tessellation entirely formed by level-0 tiles has a size greater than . In these cases, such a minimum size tessellation is returned by the algorithm.

We also tried a simpler bottom-up reduction algorithm. However, we found that the proposed top-down algorithm is more efficient, as discussed in Appendix I.

4 Performance analysis

We implemented the architecture in fig. 3 by using NDN (a specific implementation of ICN) ndn , Java STS Spring, MongoDB and CouchBase NoSQL DBSs. The neutral format used to express a federated database operation is JSON, used by MongoDB, properly translated at the receiving side when the DBS is different.

We set up a database federation formed of three DBSs, connected to each other by a single ICN node, emulating the network of an ICN provider. We considered two configurations of the federation: in the first configuration, named ”1C+2M”, a site uses CouchBase and two sites use MongoDB. In the second configuration, named ”2C+1M”, we have two CouchBase sites and one MongoDB site. The DBSs and the ICN network node run on different servers connected by an Ethernet switch.

The dataset stored in the federated DBSs is made up of 3 millions European POIs, gathered from OpenStreetMap. Regarding the distribution of the data items, we considered two types of locality: random and country-based. In the random case, every POIs is stored in a randomly chosen DBS; in case of country-based locality, each DBS stores the POIs of a specific set of countries. The countries have been grouped so as to have almost the same number of POIs per DBS.

In addition to the permanent storage space provided by the databases, i.e., MongoDB or CouchBase, data items can be opportunistically stored in the cache of ICN forwarders too, whose capacity is set to 256000 Data packets, roughly 256000 POIs. For the global index, we used the 3-level hierarchical grid shown in fig. 6 and each DBS has a default limit on the advertised active tiles equal to 20000.

For the workload we used trials

, formed by a sequence of 5000 spatial queries with squared area, evenly submitted to the DBSs of the federation. The area of each query is randomly centered within European borders. The query inter-time follows a Poisson distribution. Each result is obtained by averaging 10 trials and all 95% confidence intervals are smaller than 10%.

Figure 7: Stable and unstable system behavior
Figure 8: Maximum query rate vs. query area, random storage

We defined as ”maximum query rate” the highest rate for which the time needed to solve a query has a stable behavior, as illustrated in fig. 8. Fig. 8 shows the maximum range query rate versus the range query area, for different configurations of the federation and for a random locality. The rate decreases at the increase of the query area because the system takes more time to solve larger queries. Indeed, the resolution of larger queries involves a greater number of DBSs, thus increasing their load, and provides query responses with more POIs, thus requiring longer transmission times.

The two federation configurations, 2C+1M and 1C+2M, result in very similar performance. This is because MongoDB and CouchBase sites provide similar performance, singularly, thus a different mix of them do not change the outcome. In fact, the figure shows that a system composed by only one MongoDB DBS (1M), and a system composed by only a CouchBase DBS (1C) provide similar results. We point out that the maximum query rate increases by having more DBSs in the federation (e.g., from 1M to 1C+2M), thereby showing the horizontal scalability of the federated system: the more are the available resources, the higher is the sustainable load.

Figure 9: Maximum query rate vs. query area, 1C+2M
Figure 10: Average query delay vs. query rate, 1C+2M, random storage

Fig. 10 shows the impact of different locality configurations: random and country-based. We observe that locality has a significant impact for small query areas, e.g., up to 100 km. In these cases, better performance is achieved for a higher data locality (country-based), because in this case the query routing mechanism reduces the number of involved DBSs to solve a query, consequently decreasing DBSs load and response time. This difference fades out as the query area inceases, because larger queries involve, anyway, more DBSs: for very large areas all DBSs are involved in both cases, thus resulting in the same performance.

Fig. 10 shows the average query delay versus the query rate in case of query areas of 100 and 1000 km. The query time increases at the increase of the query area and of the query rate. The figure also points out that the system provides a stable performance when loaded with a query rate lower than the maximum one, i.e., 125 for 100 km and 45 for 1000 km. When the query rate gets close to the maximum one, the response times quickly grows and the system becomes unstable.

Figure 11: Maximum query rate vs. query area, with and without ICN caching, 1C+2M, random storage
Figure 12: Maximum query rate vs. query area, with and without (flooding) global indexing, random storage, 1C+2M

Fig. 12 shows the maximum query rate versus the query area in presence and absence of ICN caching. ICN caches reduces the load of the DBSs and the time needed to fetch matching objects. Consequently, the ICN caching functionality improves the sustainable query rate; this benefit is higher for large queries, which implies the exchange of many objects per query.

We now discuss benefits and trade-offs related to the global indexing strategy discussed in section 3.4.1. First of all, we would like to highlight the importance of having a global index, thus justifying the design effort made in this paper. To this end, fig. 12 shows the maximum query rate with and without global indexing. In most cases, the resulting rate is dramatically higher with global indexing. Thus, a well-designed global indexing, enabling an effective query routing, can really make the difference in performance for federated database applications. It is also worth observing that, when the query area becomes very large, the performance with indexing gets close to the flooding one, since all DBSs are likely involved, and as a consequence query routing tends to be useless.

Figure 13: Maximum query rate vs. query area for different values of the number of active tiles and different tessellation, random storage, 1C+2M
Figure 14: Average size of global index advertisement for different values of the number of active tiles, random storage, 1C+2M

Fig. 14 shows the maximum query rate by varying the constraint of the adaptive tessellation, from 1500 tiles up to 50000 tiles. The figure includes also the results obtained by using a simple unconstrained tessellation having all tiles of the same size, for three different configurations of the lat/lon tile size: 1, 0.1 and 0.01 degrees, respectively. We observe that a global index with higher accuracy, i.e., a greater number of tiles, provides considerable improvements only for small queries (100 km). This is because there are more false-positive events for smaller areas and thus a higher accuracy can avoid a significant number of them, thus allowing sustaining a higher query rate. For larger queries, false-positive events become rare, and therefore the impact of higher accuracy is lower.

By using the unconstrained tessellation we end up with a global index having about 1500, 60000 and 400000 active tiles per DBSs. These numbers cannot be controlled since they depend on the stored spatial objects. Conversely, the adaptive tessellation allows such control, by configuring the parameter . By comparing adaptive and unconstrained tessellation, we see that the former one allows reaching the best performance with 10000 tiles. Conversely, in the case of unconstrained tessellation we need about 60000 tiles to achieve the same result, thus increasing signaling and processing effort required to maintain the global index.

A measure of this effort is the average size of the announcements (gData packets) made by DBSs to share their index information; this metrics is shown in Fig. 14. As expected, the higher the number of tiles, the greater the announcement size. By comparing the cases of adaptive tessellation with 10000 tiles and the case of unconstrained tessellation with 60000 tiles we observe that the former solution reduces the signaling overhead of roughly 80%, while providing the same performance (fig. 14).

Figure 15: ratio between the total query submitted to the federation and the query received by a single DBS out of three, random storage, 1C+2M

Finally, fig. 15 shows the ratio between the total number of queries submitted to the federation and the number of queries subsequently received by a single DBS. If we used query flooding, any query would be sent to any DBSs, thus making such ratio equals to one. Therefore, the reported ratio is a measure of the load reduction achieved thanks to the global indexing and query routing mechanisms. In the worst case, the use of the global index reduces of about 50% and even less, when we increase the accuracy of the index, i.e. . The reduction is higher for smaller query areas where the index accuracy causes a greater impact. By comparing adaptive and unconstrained tessellations we see that the adaptive tessellation provides a lower DBS load for a similar amount of tiles.

We conclude this section by pointing out that the obtained numerical results clearly depend on the hardware and on the software code that we have used, being the evaluation based on a real implementation. Consequently, all the above results are of interest more for the insights that they provide than for their absolute numerical values 222After extensive searches in the literature and in the Web, we did not find software solutions for federating heterogeneous NoSQL databases and for this reason we have been unable to carry out a comparative analysis..

5 Conclusions

In this paper, we exploited the ICN concept to federate NoSQL databases. Multicast, caching and data-centric security offered by ICN are indeed useful instruments to cope with communication issues arising from the federation of databases. The results obtained from a practical implementation and with real datasets have shown the ability of ICN to effectively integrate heterogeneous NoSQL databases, while providing efficient query resolution through global indexing and in-network caching.

Even though we focused our work on spatial databases, we argue that the federated architecture that we are proposing and its ICN-based procedures can be used also for databases storing plain objects, even if the global indexing strategy must be changed in such case. To this end, a first possible option is to avoid using a global indexing and solving the query with a simple query flooding, accepting the consequent performance drawbacks, shown in this paper. Another approach is to generalize the index tessellation process as follows. Let us assume that we want to index a specific key of stored objects, e.g., the surname of a person, for an address book application. We can map the index key (surname) to a 1D hash space and then tessellate it with the same proposed adaptive algorithm but using 1D segments rather than 2D tiles. In so doing we will obtain again the benefit of query routing shown in the spatial database case.

Acknowledgements

This work is supported in part by the European Commission in the context of the EU-JP ICN2020 and Fed4IoT projects.

References

  • (1) R. H. Güting, An introduction to spatial database systems, The VLDB Journal—The International Journal on Very Large Data Bases 3 (4) (1994) 357–399.
  • (2) A. P. Sheth, J. A. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases, ACM Computing Surveys (CSUR) 22 (3) (1990) 183–236.
  • (3) V. Jacobson, D. K. Smetters, J. D. Thornton, M. F. Plass, N. H. Briggs, R. L. Braynard, Networking named content, in: Proceedings of the 5th international conference on Emerging networking experiments and technologies, ACM, 2009.
  • (4) A. Hoque, S. O. Amin, A. Alyyan, B. Zhang, L. Zhang, L. Wang, Nlsr: named-data link state routing protocol, in: Proceedings of the 3rd ACM SIGCOMM workshop on Information-centric networking, ACM, 2013.
  • (5) Y. Yu, A. Afanasyev, D. Clark, V. Jacobson, L. Zhang, et al., Schematizing trust in named data networking, in: Proceedings of the 2nd International Conference on Information-Centric Networking, ACM, 2015, pp. 177–186.
  • (6) S. Salsano, A. Detti, M. Cancellieri, M. Pomposini, N. Blefari-Melazzi, Transport-layer issues in information centric networks, in: Proceedings of the second edition of the ICN workshop on Information-centric networking, ACM, 2012, pp. 19–24.
  • (7) Y. Fang, M. Friedman, G. Nair, M. Rys, A.-E. Schmid, Spatial indexing in microsoft sql server 2008, in: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, ACM, New York, NY, USA, 2008, pp. 1207–1216. doi:10.1145/1376616.1376737.
    URL http://doi.acm.org/10.1145/1376616.1376737
  • (8) A. Guttman, R-trees: a dynamic index structure for spatial searching, Vol. 14, ACM, 1984.
  • (9) D. McLeod, D. Heimbigner, A federated architecture for database systems, in: Proceedings of the May 19-22, 1980, national computer conference, ACM, 1980, pp. 283–289.
  • (10) R. Laurini, Spatial multi-database topological continuity and indexing: a step towards seamless gis data interoperability, International Journal of Geographical Information Science 12 (4) (1998) 373–402.
  • (11) H. Dharmasiri, M. Goonetillake, A federated approach on heterogeneous nosql data stores, in: Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on, IEEE, 2013, pp. 234–239.
  • (12) A. Detti, N. B. Melazzi, M. Orru, R. Paolillo, G. Rossi, Opengeobase: Information centric networking meets spatial database applications, in: 2016 IEEE Globecom Workshops (GC Wkshps), 2016, pp. 1–7. doi:10.1109/GLOCOMW.2016.7848988.
  • (13) G. Bianchi, A. Detti, A. Caponi, N. Blefari Melazzi, Check before storing: What is the performance price of content integrity verification in lru caching?, ACM SIGCOMM Computer Communication Review 43 (3) (2013) 59–67.
  • (14) L. Golab, F. Korn, F. Li, B. Saha, D. Srivastava, Size-constrained weighted set cover, in: Data Engineering (ICDE), 2015 IEEE 31st International Conference on, IEEE, 2015, pp. 879–890.
  • (15) Ndn project, http://named-data.net/.

Appendix I

(a) step 0 - n. ,
(b) step 1 - n. ,
(c) step 2 - n. ,
Figure 16: Constrained tessellation (top-down) with and

In this appendix, we report an example of the adaptive tessellation algorithm described in Sec. 3.4.1. In fig. 16, we consider a case where , and a tile of level contains 4 tiles of level . We start from the initial tree , whose leaves are the set of level-2 tiles covering the database objects. The tile cost is reported inside the circles.

In step 0, we form the tree T with all level-2 leaf. The resulting number of leaves is 20 and cost . In step 1, we observe that to respect the constraint , we surely need to have in the final tessellation a tile of level-0. Indeed, the best reduction that we can obtain by using all level 1 tiles, would result in a tree with 8 leaves. Consequently, in step 1 we select the level-0 tile with cost 5 as part of the final set and prune its sub-tree. The resulting number of leaves is 10 and the cost is 5. In the final step 3, we observe that a level-0 tile is no more necessary because if we used all the remaining level-1 tiles we would obtain a tree with 5 leaves thus respecting the constraint. A level-1 tile is, however, necessary because we would not respect the constraint by using all the remaining level-2 tiles. Consequently, we select the level-1 tile with cost 0 and remove the related sub-tree. The resulting and final tree has 7 leaves and a cost .

(a) step 8 - n. ,
(b) step 9 - n. ,
Figure 17: Constrained tessellation (bottom-up) with and

In fig. 17 we report the same tessellation case but using a more intuitive bottom-up algorithm, which reduces the tree by iteratively selecting the minimum cost parent node and pruning the related sub-tree. Starting from the same configuration of fig. 16(a), in step 1 the bottom-up algorithm selects the level-1 tile with cost 0. In step 2, the one with cost 1, and so forth. As shown in fig 17(a), at the end of step 8 all level-2 tiles have been pruned, the resulting tessellation has a cost and 8 leaves nodes. Thus, another reduction step is necessary because must be lower than 7. In the final step 9 the algorithm selects the level-0 tile with a cost equal to 5; the resulting tessellation has a cost and 5 leaves nodes.

We observe that the top-down algorithm provides a lower cost with respect to the bottom-up. This is due to the fact that the simpler bottom-up greedy strategy selects nodes of higher levels only after that it has made many reductions at lower levels, and these lower level reductions may result not necessary (and costly) when a higher level reduction is made thereafter. For instance, in fig 17 the bottom-up algorithm recognizes the need for a level-0 tile only when it has reduced all level-2 tiles. But when the level-0 tile of cost 5 is inserted then many of the reductions previously made on the left part of the tree are no more necessary. This can be noted by comparing fig. 16(c) and 17(b).