The Impact of Distance on Performance and Scalability of Distributed Database Systems in Hybrid Clouds

07/31/2020
by   Yaser Mansouri, et al.
0

The increasing need for managing big data has led the emergence of advanced database management systems. There has been increased efforts aimed at evaluating the performance and scalability of NoSQL and Relational databases hosted by either private or public cloud datacenters. However, there has been little work on evaluating the performance and scalability of these databases in hybrid clouds, where the distance between private and public cloud datacenters can be one of the key factors that can affect their performance. Hence, in this paper, we present a detailed evaluation of throughput, scalability, and VMs size vs. VMs number for six modern databases in a hybrid cloud, consisting of a private cloud in Adelaide and Azure based datacenter in Sydney, Mumbai, and Virginia regions. Based on results, as the distance between private and public clouds increases, the throughput performance of most databases reduces. Second, MongoDB obtains the best throughput performance, followed by MySQL C luster, whilst Cassandra exposes the most fluctuation in through performance. Third, vertical scalability improves the throughput of databases more than the horizontal scalability. Forth, exploiting bigger VMs rather than more VMs with less cores can increase throughput performance for Cassandra, Riak, and Redis.

READ FULL TEXT VIEW PDF

Authors

page 11

page 13

page 15

page 19

06/04/2020

An Automated Implementation of Hybrid Cloud for Performance Evaluation of Distributed Databases

A Hybrid cloud is an integration of resources between private and public...
01/06/2022

Evaluation of Distributed Data Processing Frameworks in Hybrid Clouds

Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) ...
09/15/2021

Evaluation of Distributed Databases in Hybrid Clouds and Edge Computing: Energy, Bandwidth, and Storage Consumption

A benchmark study of modern distributed databases is an important source...
10/11/2019

Persistence and Big Data Analytics Architectures for Smart Connected Vehicles

Up until recently, relational databases were considered as the de-facto ...
10/24/2017

DDoS Attacks: Tools, Mitigation Approaches, and Probable Impact on Private Cloud Environment

The future of the Internet is predicted to be on the cloud, resulting in...
05/14/2018

Fork and Join Queueing Networks with Heavy Tails: Scaling Dimension and Throughput Limit

Parallel and distributed computing systems are foundational to the succe...
08/13/2019

uPredict: A User-Level Profiler-Based Predictive Framework for Single VM Applications in Multi-Tenant Clouds

Most existing studies on performance prediction for virtual machines (VM...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For about half century, relational databases have been dominated solutions for storing, retrieving, and managing data orend2010Chang2008Leavitt2010. However, due to essential requirements for high performance111In this paper, performance and throughput are used exchangeable; otherwise mentioned. and scalability, NoSQL databases have emerged Schram2012Kuznetsov2014. The explosive growth in the usage of NoSQL databases has led to several efforts aimed at evaluating their performance and scalability. The performance of these databases depends on their features (data model, data replication strategy, consistency mechanism) as well as the hardware underpinning the computing and storage infrastructures utilized Floratou2012Parker2013Cooper2010. Most of the industrial organizations deploy and operate ”big data storage solutions” using cloud computing infrastructure since the emergence of the cloud computing paradigm.

Cloud computing comes traditionally in three models Bokhari2018Patidar2012: public, private and hybrid. A public cloud provides computing, storage and networking resources to the general public over Internet, while a private cloud provisions computing and storage resources from an organization”s own cloud infrastructure. A hybrid cloud is a seamless integration of public and private clouds to benefit from the best of both worlds Rimal2009. It enables cloud bursting in which applications initially leverage private cloud and then can burst into a public cloud when private cloud’s resources are not enough under spikes or increased workloads. A hybrid cloud offers its owner business opportunities in terms of security in compliance with the location of sensitive data, availability, reliability and monetary cost.

The performance of NoSQL and Relational databases deployed on public or private cloud datacenters varies as the hardware infrastructure resources and database configuration parameters may change. As indicated in the literature Klein2015Kuhlenkamp2014Rabl2012, the more powerful resources, the higher throughput and less read and write latency are offered by using these databases. The value of these performance parameters is directly proportional to the power of hardware resources. There is a significant amount literature on the evaluation of NoSQL databases deployed on either public or private cloud datacenters. More specifically, Rabl et al. Rabl2012 evaluated the performance and sclability of four NoSQL databases (Voldermort222Voldermort: https://www.project-voldemort.com/voldemort/, HBase 333HBase:https://hbase.apache.org/, Cassandra444Cassandra: http://cassandra.apache.org/, and Redis555Redis: https://redis.io/), two relational databases (MySQL666MySQL: https://www.mysql.com/ and VoltDB777VoltDB: https://www.voltdb.com/). In this regard, Kuhlenkamp et al. Kuhlenkamp2014 conducted experiments on two NoSQL databases (i.e., Cassandra and HBase) from scalability and elasticity perspectives. The former work conducted on a private cloud, and the latter on a public cloud. Li et al. Li2018 investigated the performance of six distributed databases in terms of read, write, delete and instance initiate over a private cloud. Table 1 summarizes the studies on evaluation of NoSQL and relational databases deployed on cloud computing infrastructure.

All the above-mentioned studies discuss throughput/performance, horizontal and vertical scalability of different NoSQL and relational databases in either private or public clouds, where the distance factor did not come into play. In contrast, distance between private and public cloud datacenters in a hybrid cloud can be an important factor that should be considered on the evaluation of distributed databases in terms of performance and scalability. Recently, we evaluated the throughput performance of distributed databases running in hybrid clouds without considering the distance factor Mansouri2020.

Thus, the main research question arises here is to what extent distance between private and public cloud datacenters can impact on the performance and scalability of distributed database systems?. An obvious answer to this question is the smaller distance between public and private cloud datacenters, the smaller is the impact on the databases performance. This solution might not be desirable for all business companies due to differential in their needs, for example, privacy and monetary cost. As an instance, a business company might select a cheaper cloud datacenter in East USA region rather than the one in Sydney region to deploy its database. This makes a compromise between performance and hardware infrastructure cost. Furthermore, a wider selection of cloud datacenters across the world (e.g., 58 Azure regions worldwide888Azure regions: https://azure.microsoft.com/en-us/global-infrastructure/regions/) provides more opportunities for business companies in public cloud datacenter selection in terms of distance to pair with their private cloud datacenter to build a hybrid cloud. Therefore, distance factor comes into play for building a hybrid cloud as an important research question stated above.

To answer the above question, we have exploited an automated hybrid cloud that enabled us to pair a public datacenter in different regions with a private one in a flexible way Mansouri2020. This implementation of hybrid cloud also allowed to select VM size, VMs number, desired database installation, database cluster configuration and so on. The selection of a public datacenter in a specific region depends on how much distance is desirable between private and public cloud datacenters. Since we intended to evaluate the effects of distance on performance of distributed databases, we chose public cloud datacenters in US East and Sydney regions as the closest and the farthest to a private cloud datacenter located in our Lab (Adelaide Uni), respectively. Moreover, we selected another public datacenter between the closest and the farthest datacanter from distance perspective, which led to the choice of a datacenter in West India (Mumbai) region.

In addition to the above research question, we were also interested in investigating the impact of VM packing on the throughput of distributed databases running in a hybrid cloud. VM packing means that we deploy fewer VMs with more cores instead of more VMs with less core so that the total cores in both deployment is an equal number (e.g., 2 VMs with 4 cores each instead of 4 VMs with 2 cores each). This investigation helped us to demonstrate how much latency between VMs in a cloud datacenter can impact the throughput of distributed databases. In other word, the latency between VMs in a single cloud datacenter reduces to the latency between cores in a single VM, which might be effective means of a hybrid cloud deployment for distributed databases.

There are several choices of distributed databases that can be evaluated for the impact of distance on their performance as deployed on a hybrid cloud Wu2013. By thriving NoSQL in 2011 Lourenco2015, currently these are more than 225 NoSQL databases999NoSQL databases: http://nosql-database.org/; Among these databases, some are supported in a pre-installed and configured infrastructure component (e.g., MongoDB101010How to install and configure MongoDB on a Linux VM: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/install-mongodb and Cassandra111111Run Apache Cassandra on Azure VMs: https://docs.microsoft.com/en-us/azure/architecture/best-practices/cassandra databases by the well-known cloud providers). Based on the widespread usage and popularity, we selected six databases to benchmark the impact of distance on their performance, scalability, and VM packing: MongoDB121212MongoDB: www.mongodb.com, Cassandra131313Cassandra: http://cassandra.apache.org/, Riak141414Riak: https://riak.com/, CouchDB151515CouchDB: https://couchdb.apache.org/, Redis161616Redis:https://redis.io/, and MySQL171717MySQL:https://www.mysql.com/. We have investigated the impact of distance between cloud datacenters making a hybrid cloud on the performance and scalability of these databases. In this investigation, we answer the following Research Questions (RQs):

  • RQ1: What is the impact of distance on the performance of widely used NoSQL and SQL databases in hybrid clouds?

  • RQ2: To what extent, NoSQL and SQL databases are scalable in hybrid clouds?

  • RQ3: What is the impact of VM packing on the performance of distributed databases in hybrid clouds?

The reminder of this paper is structured as follows. Section 2 reviews the literature with respect to the performance evaluation of NoSQL databases and data-intensive applications on different models of cloud computing. Section 3 discusses the background for the model of hybrid cloud implemented and deployed, as well as distributed databases under test. Section 4 presents the experimental setup, and performance evaluation results, and then discusses our findings. Finally, Section 5 draws some conclusions and identifies the future work.

Paper DBs name Cloud model Distance impact Cloud bursting Evaluation metrics
Rabl et al.Rabl2012 Cassandra, Hbase, Redis, Voldermort, MySQL, VoltDB Private No No Throughput, horizontal and vertical scalability evaluation
Kuhlenkamp et al.Kuhlenkamp2014 Cassandra, Habse Public No No Scalability and elasticity evaluation
Li et al. Li2013 MongoDB, RavenDB, CouchDB, Cassandra, Hypertable, Couchbase, MySQL Private No No Throughput, read, write, delete latency
Klein et al.Klein2015 MongoDB, Cassandra,Riak Public No No Throughput evaluation for different consistency setting
Abramova and BernardinoAbramova2013 MongoDB, Cassandra Private No No A comparison between Cassandra and MongoDB in performance
Cooper et al. Cooper2008 Cassandra, Hbase, MySQL, PNUTS Private No No Throughput, scalability, read and write latency
Veen et al. vanderVeen2012 MongoDB, Cassandra, PostgreSQL Private No No Throughput evaluation
Bastiao et al. Bastiao2014 MongoDB, CouchDB, Lucene Private No No Retrieve and insert latency
Lourenco et al. Lourenco2015 Cassandra, CouchDB, MongoDB, MS SQL Private No No Throughput
Mansouri et al. Mansouri2020 MongoDB, Cassandra, Riak, Redis, Couchdb, MySQL Hybrid No Yes Throughput, read and write latency
This work MongoDB, Cassandra, Riak, Redis, Couchdb, MySQL Hybrid Yes Yes Throughput, vertical and horizontal scalability, VM packing
Table 1: Comparison of the relevant studies with our work

2 Related Work

We compare our work in this paper with the state-of-the-art studies in two categories: Performance evaluation of distributed databases deployed on cloud computing, and the impact of distance on data-intensive applications performance.

Performance evaluation of distributed databases on cloud computing: By flourishing NoSQL databases in 2011, several of them including MongoDB, Cassandra, Riak, CouchDB, Riak, Redis, and Hbase are the center of studies Tudorica2011JingHan2011Davoudian2018. As stated in the literature, these databases outperform the relational databases in performance and scalability, making them more popular for use within private or public cloud datacenters.

The performance evaluation of NoSQL databases is supported through using the Yahoo Cloud Serving Benchmark (YCSB) Cooper2010 that allows to measure throughput and latency of read, write, insert, update, delete, and scan operations. Initially, Cooper et al. Cooper2010 used this benchmark to measure the performance of Cassandra, Hbase, MySQL, and PNUTS Cooper2008. Later, since 2012, researchers leveraged this benchmark to make comparison between NoSQL and relational databases deployed on cloud computing from performance and scalability perspectives Hecht2011.

Abramova et al. Abramova2013 compared MongoDB and Cassandra in terms of their features and capabilities using YCSB. MongoDB is affected by high workloads, whereas Cassandra seemed to experience performance boosts with the increasing amounts of data. Also, Cassandra outperforms MongoDB in update operations. Veen et al. vanderVeen2012 made a comparison between MongoDB and Cassandra and concluded also that MongoDB provides high throughput as it is deployed in a single server. Klein et al. Klein2015 evaluated MongoDB and Cassandra with non-default setting parameters to measure throughput, latency for read and write operations. Rable et al. Rabl2012 conducted extensive experiments to evaluate the performance and scalability of Cassandra, Hbase, Redis, Voldemort, VoltDB, and MySQL. Kuhlenkamp et al. Kuhlenkamp2014 carried out a large experiment to measure scalability and elasticity of Cassandra and HBase deployed within a public cloud datacenter. As summarized in Table 1, different from our work, all these studies (except in our work Mansouri2020) evaluate the performance of distributed databases on either a private or a public cloud datacenter. Recently, we have evaluated throughput performance of distributed databases on a hybrid cloud without considering distance between private and public clouds involved in a hybrid cloud agreement model Mansouri2020.

Several studies have used NoSQL databases to evaluate their applicability to different IT domains. Bastiao et al. Bastiao2014 leveraged MongoDB and CouchDB in the healthcare domain. They observed that there is no difference between these two databases in performance, and concluded NoSQL databases still should be improved. Lourenco et al. Lourenco2015a evaluated Cassandra, CouchDB and MongoDB for a write-intensive application. The results revealed that Cassandra is better than the other NoSQL databases for a four-node setup, while a MS SQL Server running on a single node outperformed all NoSQL contenders for these specific settings. Rith et al. Rith2014 implemented a layer to translate SQL queries to NoSQL ones for MongoDB and Cassandra, moving from relational to non-relational databases. None of these studies has investigated the impact of distance between cloud datacenters making up a hybrid cloud for the performance, scalability and VM packing (i.e., VMs number vs. VMs flavour) of distributed databases.

Some researchers have evaluated their proposed algorithms, policies and methods through simulation and implementation for a hybrid cloud. Toosi et al. Toosi]2018 has recently configured a hybrid cloud including Microsoft Azure and two PC workers to analyze the proposed resource provisioning algorithms. Using the same configuration for the hybrid cloud with different VM size though, Tuli et al. Tuli2020 evaluated several resource provisioning and task scheduling algorithms. Calheiros et al. Calheiros2012 and Vecchiola et al. Vecchiola2012 provided almost similar setup to evaluate an algorithm leveraging dynamic resources to meet the deadline constraint for Bag-of-Tasks (BoTs). Li et al. Li2018 designed a cost-aware job scheduling approach based on the queuing theory in hybrid clouds. Differently, Loreti et al. Loreti2015 implemented a software layer on top of a hybrid cloud infrastructure to dynamically deploy and scale virtual clusters. Zhou et al. zhou2019 proposed an approach to optimize the monetary cost of workflow scheduling with constrained time. They extended this approach to minimize the execution time of tasks within constrained time and budget. Our work is different with these studies as they have implemented scheduling algorithms to complete tasks within a constrained time and budget for BoT applications in a simulated or implemented hybrid cloud environment.

Impact of distance on data-intensive applications performance: User-perceived latency for database operations (e.g., read, write, delete, update or create, etc.) is a vital criterion at the database level. Latency for these operations can be impacted from network congestion, overloaded computing infrastructure, and the distance between the location of requests issued from, and the location of data stored. How to reduce network latency was well-studied in literature, and there are several simple solutions. They are varying from data replication Vulimiri2013Nishtala2013, requests iteration Wu2015Dean2013 to more powerful hardware resources. Obviously, these solutions are more effective if a public cloud datacenter with a proper distance is selected to be paired with the private one as experimentally observed in a study Wu2013. All studies stated above did not investigate the impact of distance between cloud datacenters involving in the hybrid cloud on the performance and horizontal/vertical scalability of distributed databases. This investigation is our main contribution in this study.

3 Distributed database systems evaluated

In this section, we briefly provide an overview of the distributed databases under evaluation in this work, and then discuss how a hybrid cloud has been implemented to evaluate the performance and scalability of these databases.

3.1 Distributed databases under evaluation

In this section, we discuss five NoSQL and one relational databases. The criterion for the selection of these database for our study is their widespread adoption in the industry. In the following, we give more details for these databases to better understand the experimental results.

MongoDB

is an open-source document-based database and supports horizontal scalability and automatic sharding

Haughian2016. It also provides full replication and the asynchronous master-slave model for consistency. This implies that writes are only made by the master node and reads can be conducted from both a master node and from one of the slave nodes. Writes are propagated to the slave nodes by reading from the master’s operation log Haughian2016. Mongo DB offers different types of consistency models for clients by specifying whether reads are made from secondary nodes and how many nodes must confirm a read operation.

Cassandra is an open-source database based on the ideas behind Google BigTable Chang2006 and Amazon Dynamo Sivasubramanian2012. It uses column-based data model in which each column consists of the name, value and timestamp; All of which are provided by a client. Consistency is highly tunable according to the requirements of the applications, making a trade-off between latency and consistency. Cassandra operates in a master-master mode Gudivada2014, which makes easy for horizontal scalability to support Haughian2016. This mode of operation implies no node is different from another leading to high write throughput operations Gudivada2014Haughian2016 with the help of combining disk-persistence with in-memory caching of data. It also supports several partitioning and replication techniques.

Riak is open-source and document-based NoSQL database. It supports master-less replication architecture without a single point of failure. Riak allows applications to define how many nodes are required to confirm read and write operations. This feature makes a trade-off between availability and consistency. It supports the options to choose between eventual (default option) and strong consistency for each data bucket.

CouchDB is an open-source database offering a document-oriented approach Kuznetsov2014 in JSON format. It provides ACID properties on the document level and no lock for read operations through Multi-Version Concurrency Control (MVCC). CouchDB supports both master-master and master-slave replication between different CouchDB instances or on a single instance. It does not support sharding, while provides scaling by asynchronous data replication Kuznetsov2014. CouchDB offers eventual consistency and performs conflict resolution through the most updated data. CouchDB works well if it can store the whole dataset in a RAM of cluster since it is essentially a RAM-based database.

Reids is an open source and in-memory data structure including strings, hashes, lists, sets, sorted sets. It also uses master-slave asynchronous replication, where data can be replicated to multiple replica servers. This improves read performance (as requests can be split among the servers) and faster recovery when the primary server goes down. Redis offers a highly available in-memory cache to decrease data access latency, increase throughput and ease the load off NoSQL and relational databases.

MySQL Cluster provides shared-nothing clustering and auto-sharding for the MySQL database management system. It internally deploys synchronous replication through a two-phase commit mechanism in order to ensure that data is written to multiple nodes upon committing the data. MySQL Cluster automatically creates data node groups from the number of replicas and data nodes specified by a user. Writes are synchronously replicated between nodes in a group to guarantee durability. However, it replicates data asynchronously between clusters to reduce the effects of network latency by locating data physically closer to a set of users.

3.2 Hybrid Cloud implementation

We implemented an automated hybrid cloud across OpenStack181818OpenStack: https://www.openstack.org/ and Microsoft Azure 191919Microsoft Azure https://azure.microsoft.com/en-au/. In this implementation, we leveraged on-demand usage model (also called cloud bursting), where a data-intensive application running on a private cloud datacenter borrows resources (e.g., a VM instance) from a public cloud datacenter. One of the main aspects of this implementation is a secure, robust, cost-free connection between private and public cloud datacenters. For this purpose, we used WireGuard202020WireGuard: https://www.wireguard.com/ as a Linux kernel-based VPN tool recently released Version 1.0 as part of kernel 5.6 212121WireGUard Version 1.0:
https://www.archyde.com/wireguard-vpn-1-0-0-appears-in-linux-5-6-kernel-computer-news/
. Using Wireguard rather than a public cloud VPN brings advantage in terms of security, reliability in connection, throughput, monetary cost and inter-portability Mansouri2020. A schematic view of the hybrid cloud using WireGuard is illustrated in Fig. 1.

As can be seen in Fig. 1, we initially implemented a client/consumer broker in OpenStack and exploited the on-demand model Mansouri2020 in which it might be required to expand the workload on Azure VM instances in the case of spiking workloads. Thus, we have implemented a server/donor broker in Microsoft Azure. Based on the environment specification, we need to create shared networks/sub-networks that can be connected to and disconnected from broker networks. Similarly, we might need shared networks in different regions in Microsoft Azure side. These shared networks should be able to connect/disconnect to/from the broker network in Microsoft Azure side. We deployed our cluster nodes hosting distributed databases in the shared subnetworks.

We used Terraform 222222Terraform: https://www.terraform.io/ as an open-source automation tool for provisioning and managing cloud infrastructure in an automated manner. This tool enabled us to define and execute the required infrastructure resources across OpenStack and Azure cloud datacenters in terms of quantity (e.g., VMs number) and specifications (e.g., VMs size). By using this tool, we installed six distributed databases and made cluster configuration between nodes hosting databases. Such automation of hybrid cloud implementation allows us to consistently re-produce experimental setup to evaluate distributed databases with minimal human interference.

Figure 1: Hybrid cloud architecture spans over on-premises infrastructure resources and the public cloud datacenter in East US (Virginia) region by using WireGuard Mansouri2020

4 Evaluation

We evaluated performance, horizontal scalability, vertical scalability and VM packing for six distributed databases deployed on a hybrid cloud that spans over a private cloud datacenter and public cloud datacenters in Sydney, Mumbai, and Virginia regions232323In this study, we reported only throughput for evaluation of performance, horizontal scalability, vertical scalability, and VM packing. We eliminated the latency results because the distance between private cloud and public cloud in three regions reflects latency. That is, the longer the distance between two clouds, the higher the latency for operations performed at databases.. The aim of this evaluation is to investigate the impact of distance between cloud datacenters involved in a hybrid cloud on the performance of the most used and modern databases. In the following, we discuss specification and location of infrastructure resources, workload setup, experimental scenarios and results.

4.1 Experiment Setup

Infrastructure resources specification: We leveraged two clusters for our experiments as depicted in Fig. 1. The OpenStack cluster consists of Linux VMs, each equipped with 1 core CPU, 2 GB of RAM, and 10 GB disk. We also set up a cluster on Microsoft Azure consisting of Linux Standard_B1ms with 1 vCPU, 2GB of RAM, and 4 GB SSD storage. The number of instances in both clusters is 8, where n1 nodes/VMs in the private cloud/OpenStack and 8-n nodes in Azure cloud. Note that we consider at least one node running on OpenStack in all experiments to enforce compliance with the definition of hybrid clouds. A summary of deployed infrastructure is listed in Table 2.

Infrastructure resources location: The location of our private cloud infrastructure hosting OpenStack was CREST Lab in Adelaide University, and the locations of the used public cloud datacenters are Australia East (Sydney), India West (Mumbai) and US East (Virginia) regions. The reason behind the selection of these regions is to evaluate the impact of distance on the performance, scalability and VM packing of distributed databases. Thus, we selected Sydney and Virginia as the closest and the furthest locations to our private infrastructure at a distance of 1374 KMs and 16671 Kms, respectively. Moreover, we selected one point in the middle of the closest and the furthest regions (i.e., (1374+16671)/2 =9022 KMs), which led to the selection of the cloud datacenter in West India (Mumbai) region.

Benchmarking and system under test: We used Yahoo Cloud Serving Benchmark (YCSB) that includes client component running a set of queries as core workload Cooper2010. YCSB runs six different workloads as described in Table 3; All workloads have a uniform requests distribution. For each experiment, we built cleaned VM instances and ran a YCSB load phase that inserts 10 K records in each cluster node. Each record consists of 10 fields with a length of 8 bytes. Thus, a record in the workload is 80 bytes. In all experiments, we used at least one instance in the shared network on OpenStack with the default number of threads (10 threads).

The system under test consists of 6 distributed databases installed and configured on both private and public nodes as a single cluster based on the default settings Mansouri2020. Cluster configuration for MySQL server is different to NoSQL databases since it requires three types of node. We ran manager and mysql server nodes on the same VM instance in OpenStack, and data nodes across the hybrid cloud. While NoSQL databases are configured in a master-slave model as needed; otherwise, all nodes were identical. As previously mentioned, we used Terraform to automate the deployment, destruction, installation and configuration of database cluster nodes with the least interference of human. This implementation allowed us to consistently and repeatedly evaluate distributed databases based on the desired configuration setting parameters in terms of hardware specification, cluster configuration and so on Mansouri2020.

Experimental Scenarios: We defined four different scenarios associated with RQs in the introduction section to evaluate the performance of 6 databases with different workloads as summarized in Table 4. RQ 1: To evaluate the performance in terms of the number of operations per time unit (i.e., throughput), we considered all permutation of nodes that can be burst into the public cloud datacenter. Thus, we used the hybrid cluster configurations of (8_0), (7_1), …, (2_6), and (1_7), where the first and second element of each pair represents the number of nodes in the private and public cloud datacenter respectively. RQ 2.1: To assess the horizontal scalability of these databases across the hybrid cloud, we fixed one node in the private cloud and vary the number of nodes from 2 to 8 with a step of two in a public cloud242424It is worth mentioning that this setup can be opposite, namely fixing one node in a public cloud and varying the number of nodes in a private cloud. We did not investigate the effect of the opposite configuration setup on the performance of databases since we intended to explore the impact of distance between two cloud datacenters, which were the same from a private cloud to a public cloud, and vice versa. . RQ 2.2: To evaluate the vertical scalability of databases across the hybrid cloud, we considered two nodes in the private cloud and one node in the public cloud with a range of 2, 4, and 8 cores252525Note that Azure cloud datacenters provide Bs-series VMs with 1, 2, 4, 8, 12,16, and 20 cores. We selected VMs with 2, 4, 8 cores to evaluated databases with respect to vertical scalability. https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/#a-series.. Note that we considered 3 nodes in this scenario because MySQL and Redis require at least three nodes for cluster configuration. RQ 3: In this scenario, we considered one node in the private cloud, and three sets of nodes with different cores number in the public cloud: 4 nodes with 2 cores (4X2), 2 nodes with 4 cores (2X4), and 1 node with 8 cores (1X8). The aim of this scenario is to evaluate the performance of databases under two cases: fewer VMs with more cores for each and more VMs with less cores for each so that the overall cores in each case is the same (here the overall cores is 8).

System Setup Private Infrastructure (OpenStack) Public Infrastructure (Azure)
Instance type m1.small Standard_1Bms
CPU 1 Core 1 Core
RAM 2 GB 2 GB
Disk 10 GB 4GB SSD
Location Adelaide Sydney, Mumbai, Virginia
Table 2: A summary of infrastructure setup
Workload type Operations Label
Workload A 50% Read + 50% Update Read-intensive
Workload B 95% Read + 5% Update Write-intensive
Workload C 100% Read Read-only
Workload D 95% Read + 5% Insert Read-latest
Workload E 95% Scan + 5% Insert Scan
Workload F 50% Read + 50% update Read-Modify-Write(RMW)
Table 3: Core workload in YCSB
Research Question Description Hardware setting
RQ1 Throughput evaluation Up to 8 VMs with one core across the hybrid cloud
RQ2.1 Horizontal scalability one VM in OpenStack and up to 8 VMs with one core in Azure
RQ2.2 Vertical scalability 2 VMs in OpenStack and one node with 2, 4, and 8 cores in Azure
RQ4 VMs number vs. VMs size one VM in OpenStack, and 4X2, 2X4, and 1X8 VMs in Azure
Table 4: A summary of experimental scenarios associated to RQs

4.2 Experiment Results

In this section, we report results for research questions indicated in Introduction Section.

4.2.1 Performance Evaluation

(a) Read-intensive
(b) Write-intensive
(c) Read-only
(d) Read-latest
(e) Scan
(f) Read-Modify-Write(RMW)
Figure 8: Throughput for MongDB in Sydney, Mumbai and Virginia regions. Value in axis X represents that the hybrid cloud consists of nodes in the private cloud and nodes in the public cloud.
(a) Read-intensive
(b) Write-intensive
(c) Read-only
(d) Read-latest
(e) Scan
(f) Read-Modify-Write(RMW)
Figure 15: Throughput for Cassandra in Sydney, Mumbai and Virginia regions. Value in axis X represents that the hybrid cloud consists of nodes in the private cloud and nodes in the public cloud.

In this section, we intend to answer RQ1: What is the impact of distance between private and public cloud datacenters on the performance and scalability of distributed databases running on a hybrid cloud? In order to reflect distance impact, as we discussed before, we consider three regions hosting public cloud datacenter: Sydney, Virginia and Mumbai. Figs. 8-42 illustrate the throughput of the six distributed databases against hybrid cluster configuration labelled with pairs of , where and respectively are the number of VM instances exploited in private and public cloud datacenters. For each database and cluster configuration, we used a freshly installed and established database cluster and loaded data. We refer to cluster configuration with pairs and as non-bursting and full-bursting respectively. All pairs, except (8_0), are referred to as hybrid cluster configurations. It should be noted that in the full-bursting setting, we still exploit one VM instance in a private cloud datacenter due to keeping the definition of hybrid cloud.

Fig. 8 shows the throughput for MongoDB in three regions. For all six workloads, the distance262626In this context, “the impact of distance” implies the impact of distance between a private cloud datacenter and a public one in all regions (Sydney, Mumbai, Virginia) on the performance. has slight impact on the throughput when at least half of the VMs hosted in a private cloud (i.e., cluster configuration of , , , , and ). For these hybrid cluster configurations, MongoDB obtained the best throughput for read-only and read-latest workloads (about 600 ops/sec), followed by read- and write-intensive workloads (between 500-600 ops/sec). In contrast, as the number of nodes bursting into a public cloud increases (i.e., cluster configurations of , , ), the distance affects the throughput of MongoDB for almost all workloads in Virginia and Mumbai regions. For read-intensive, write-intensive, and read-only workloads with these cluster configurations, the throughput decreases by 30%-40% and 5%-20% in Virginia and Mumbai respectively. The throughput for the scan workload reduces even more (around 45%) especially in Virginia. For two other workloads (Read-latest and RMW), distance has less impact on the throughput reduction specifically in Mumbai region.

In Fig. 15, the throughput for Cassandra in three regions is summarized. For the non-bursting hybrid cluster configuration (i.e., ), Cassandra exhibits different values for throughput in three regions although all nodes are hosted in a private cloud. This implies that a fluctuation in the latency between the broker and shared sub-networks and the latency between shared VMs hosting Cassandra database nodes. When hybrid cluster configuration changes from non-bursting (i.e., ) to bursting (, …,), the distance has substantial effect on throughput especially for read-intensive workload (Fig. (a)a). As the distance between private and public cloud datacenters increases272727Increment in the distance means the deployment of hybrid cloud in Mumbai rather than Sydney, or the deployment of the hybrid cloud in Virginia instead of Mumbai/Sydney., the throughput significantly decreases. For other workloads, although cloud bursting reduces throughput for all hybrid cluster configurations, the distance has less impact on the performance of Cassandra. It is worth mentioning that Cassandra exposes better performance with hybrid cluster configuration of () compared to other hybrid cluster configurations (i..e, ,…,) especially for Read-only and Read-latest workloads in Sydney. The reason might be the VMs and consequently the data are placed almost equally on two cloud datacenters, which, in turn, the throughput for read operations based on the default setting (three replicas and quorum-based consistency) increases.

(a) Read-intensive
(b) Write-intensive
(c) Read-only
(d) Read-latest
(e) Read-Modify-Write(RMW)
Figure 21: Throughput for Riak in Sydney, Mumbai and Virginia regions. Value in axis X represents that the hybrid cloud consists of nodes in a private cloud and nodes in a public cloud.

The throughput for Riak 282828Riak does not support workload E. is captured in Fig. 21. In comparison to MongoDB and Cassandra, Riak exposes more stable in performance trend. As the distance increases and the number of nodes bursting into a public cloud datacenter rises, the throughput of Riak decreases. For the read-intensive workload, the throughput reduces by half in Sydney region when the cluster configuration changes from non-bursting () to full-bursting (); Whilst for two others regions, the throughput decreases by a factor of about 5. For the remaining workloads (Figs. (a)a-(e)e), the throughput closely follows the one for read-intensive workload in the reduction trend. In summary, as long as the distance between private and public cloud datacenters is effectively close (less than 1370 KM– distance between Sydney and Adelaide), the performance of Riak is effective if more than half of Riak database nodes are hosted by a private cloud.

(a) Read-intensive
(b) Write-intensive
(c) Read-only
(d) Read-latest
(e) Scan
(f) Read-Modify-Write(RMW)
Figure 28: Throughput for Couchdb in Sydney, Mumbai and Virginia regions. Value in axis X represents that the hybrid cloud consists of nodes in a private cloud and nodes in a public cloud.

In Fig. 28, the throughput results for Couchdb can be seen. Like Riak, Couchdb demonstrates stability in throughput for all workloads, though this performance metric decreases by 25% for the RMW workload - 80% for the read-intensive workload as the hybrid cluster configuration changes from () to () in Sydney region. In this region, Couchdb’s throughput also exposes a decrement of 23% for the read-only workload - 50% for the Read-latest and RMW workloads as the cluster configurations vary from non-bursting to full-bursting. In Mumbai region, upon cluster configuration changes from non-bursting to bursting (), Couchdb obtains the least reduction in throughput for the read-intensive workload by a factor of about 2.2; Likewise, the most reduction for the read-only workload by a factor of 4. For both Sydney and Mumbai regions, the throughput of Couchdb initially reduces as half of VMs burst into the public cloud datacenter, and then it gradually increases or stays at a constant level when more than half of VMs are exploited in the public cloud.

(a) Read-intensive
(b) Write-intensive
(c) Read-only
(d) Read-latest
(e) Scan
(f) Read-Modify-Write(RMW)
Figure 35: Throughput for Redis in Sydney, Mumbai and Virginia regions. Value in axis X represent nodes that the hybrid cloud consists of nodes in a private cloud and nodes in a public cloud.

Fig. 35 plots the throughput for Redis. Compared to Riak and Couchdb, it has the same stability in performance as distance between private and public cloud datacenters increases. However, for non-bursting cluster configuration in Sydney region, Redis outperforms both Riak and Couchdb in throughput for all workloads (apart from the read-intensive and scan workloads). As hybrid cluster configuration changes from non-bursting to bursting (), Redis’s throughput in Sydney region is 4 times compared to the one in Mumbai region for all workloads except for the scan workload. This increment for the hybrid cloud configuration in Sydney raises even by a factor of 5 compared to this configuration in Virginia region. This increment trend for throughput in Sydney region remains fairly stable in comparison to the one that a hybrid cloud obtains in Mumbai and Virginia regions.

(a) Read-intensive
(b) Write-intensive
(c) Read-only
(d) Read-latest
(e) Scan
(f) Read-Modify-Write(RMW)
Figure 42: Throughput for MySQL Cluster in Sydney, Mumbai and Virginia regions. Value in axis X represents that the hybrid cloud consists of nodes in a private cloud and nodes in a public cloud.

Fig. 42 illustrates throughput performance for MySQL Cluster with default setting, where MySQL provides strong consistency among replicas in each node group cluster. The size of data node group equals to the number of replicas, which is two as a default in our experiment. With this default setting Mansouri2020 and the amount of data uploaded to the data node groups, MySQL achieves high throughput (more than 300 ops/sec) for all workloads except the scan workload in all regions. Thus, MySQL holds the second rank among the six investigated databases (after MongoDB) in term of throughput. In respect to MySQL default setting and performance, it is worth mentioning that the following remarks. (i) As we observed, MySQL smartly clusters data nodes in each data node group based on the distance between nodes, and (ii) for non-bursting cluster configuration, MySQL’s throughput is lower than the throughput of Riak and Redis (See Figs. 21, 35, and 42).

4.2.2 Horizontal Scalability Evaluation

(a) Cassandra
(b) MongoDB
(c) Riak
(d) Couchdb
(e) Redis
(f) MySQL Cluster
Figure 49: Horizontal scalability of MongoDB, Cassandra, Riak, Couchdb, Redis, and MySQL Cluster in Sydney, Mumbai, and Virginia regions. Axis X represents the number of VM deployed in a public cloud datacenter.

This section presents the results for RQ2.1, which is related to horizontal scalability of six distributed databases running on a hybrid cloud. In this set of experiments, we investigated the effects of horizontal scalability on the throughput of distributed databases292929Note that in this work, we intended to evaluate horizontal scalability as cloud bursting happened. Thus, we increased the exploitation of more VMs in the public cloud datacenter, not in the private one.. Horizontal scalability means that adding more computing nodes to the resource pools. Thus, in this experiment, we varied the number of VMs from 2 to 8 with the size of Standard_B1m (1 vCPU, 2 GB RAM, and 4 GB SSD) in a public cloud datacenter. In compliance with the definition of a hybrid cloud, we also deployed a small VM instance (1 vCPU, 2GB RAM, 10GB HDD) in private cloud datacenters.

As shown in Fig. (a)a, the throughput of Cassandra reduces to half or even more for most of workloads as the number of VMs increased from 2 to 8. This reduction was less for Mumbai and Virginia regions. Thus, bursting more nodes to a public cloud datacenter did not necessarily improve the performance of database because the more VMs there are, the more communication they need with each other to conduct read and write operations in Cassandra (see Section 4.2.4). The throughput of Riak and Redis remained fairly constant as more VMs exploited into the public datacenter in Mumbai and Virginia regions. In Sydney region, Riak slightly incremented in throughput for the write intensive workload, while for Redis’s throughput decreased by a factor of (Figs. (c)c and (e)e). In the same region, Couchdb’s throughput initially reduced with the increment of 2 VMs to 4 VMs, and then its throughput stayed at a constant level (see Fig. (d)d).

In contrast to the four discussed databases above, we observe that an increment in throughput of MongoDB and MySQL in some cases as the number of VMs increased in all regions (Figs. (b)b and (f)f). However, this performance trend did not increase consistently for both databases since read and write operations transferring over Wide Area Network (WAN) incurred high latency.

Summary: This set of experiments shows that adding more nodes in a public cloud datacenter to conduct read and write operation over WAN cannot guarantee better throughput especially for RAM-based databases (e.g., Redis) and for quorum-based databases (e.g., Cassandra and Couchdb).

(a) Cassandra
(b) MongoDB
(c) Riak
(d) Couchdb
(e) Redis
(f) MySQL Cluster
Figure 56: Vertical scalability of MongDB, Cassandra, Riak, Couchdb, Redis, and MySQL Cluster in Sydney, Mumbai, and Virginia regions. Axis X represents the number of cores for a VM deployed in a public cloud datacenter.

4.2.3 Vertical Scalability Evaluation

This set of experiments answers RQ2.2, where the vertical scalability of six distributed databases has been evaluated by varying the number of cores for the VM deployed in the public cloud datacenter in Sydney, Mumbai, and Virginia regions. In these experiments, we exploited two small VMs in private infrastructure resources and one VM in the public cloud datacenter with 2, 4, and 8 cores. Note that we ran three VMs because MySQL and Redis required at least three VMs for cluster configuration.

Fig. (a)a shows that the throughput of Cassandra increased 2-4 times when the number of cores varied from 2 to 4 for the VM deployed in Sydney region. Likewise, we observe that the same increment of 1.5-2 times for the VM running in Mumbai and Virginia regions. As depicted in Fig. (c)c, selection of a larger VM in Sydney region raises the throughput of Riak by 30% for the read-only and read-latest workloads. The throughput of Redis and Couchdb remained constant or slightly decreased as a larger VM in Mumbai and Virginia regions was exploited (Figs. (e)e and (d)d). In contrast, a larger VM selection in Sydney region raised the throughput of Couchdb by 10% -30% for the read-only and read-latest workloads.

Figs. (b)b and (f)f illustrate the effect of a larger VM exploitation on throughput of MongoDB and MySQL. Apart from scan, both databases obtained increment in their throughput in the range of 5%-20% in Sydney region, though this incremental, the trend was not linear. This is due to these two databases were unstable in performance over WAN. In contrast to MongoDB, MySQL demonstrated a slight decrement of  5% as the number of cores changes from 4 to 8 in Virginia region.

Summary: The results from this set of experiments demonstrate that deploying a larger VM in a public cloud datacenter can improve the throughput of all databases except Cassandra for most workloads in Sydney region. In this region, Cassndra’s throughput significantly increased as the number of cores raised from 2 to 4, and then reduced when the number of cores went up from 4 to 8.

4.2.4 VM Packing Evaluation

(a) Cassandra
(b) MongoDB
(c) Riak
(d) Couchdb
(e) Redis
(f) MySQL Cluster
Figure 63: The effects of VMs number vs. VM size on throughput for MongDB, Cassandra, Riak, Couchdb, Redis, and MySQL Cluster in Sydney, Mumbai, and Virginia regions. Value in axis X represents VMs with size of cores in a public cloud datacenters.

This section answers RQ3 related to the performance evaluation of distributed databases as the number and flavor of VM change. In this set of experiments, we fixed one small VM instance in OpenStack (see Table 2) and varied the number of VM instances (1,2,4) and their flavor (Standard_B2ms (2 cores), Standard_B4ms (4cores), and Standard_B8ms(8cores)) in the public cloud datacenter so that for each setting VMs number VMs flavor is a constant value (i.e., 8 cores). This set of experiments helped us to determine whether to select a larger VM in the number of cores or more VM instances with less cores number. Thus, as shown in Fig. 63, we have 3 configuration settings of , where and respectively represents the number and flavor of VMs303030Note that in this set of experiments, we have not considered configuration setup of 1X8 for Redis and MySQL because these databases require at least 3 VMs/nodes..

Fig. (a)a shows that as the number of cores for a VM increased from 2 to 8, the throughput of Cassandra significantly boosted. In Sydney region, Cassandra’s throughput increased by a factor of 8 for read-modify-write, followed by read-intensive (7.2 times), and read-only (7 times). Similarly, for two other regions, we observe an incremental trend for throughput as larger VMs are deployed for all workloads (apart from scan) by a factor of at most 4. Like Cassandra, Riak’s throughput increases as the number of cores in a VM increased especially for the read-only and read-latest workload in Sydney region. As shown in Fig. (c)c, these two workloads obtained a 30% improvement of throughput with the increasing number of cores. However, for two other regions, the throughput of Riak remained fairly constant with the increased number of cores. This might be the latency over WAN dominates the latency between VMs in the same cluster. The reason behind such behaviour arising from Cassandra and Riak can be that both databases leveraged a kind of quorum-based mechanism to provide data consistency, which necessitated more communication between VMs. As can be seen in Fig. 63, as the configuration setting varies from 4X2 to 2X4, Redis’s throughput improved at most 10% for all workload (except scan). Thus, Redis has better performance when it exploited 2 VMs with 4 cores each rather than 4 VMs with 2 cores each. This is due to larger VMs offer bigger RAM, which suits for memory-based databases like Redis.

In contrast to the discussed databases above, Figs. (b)b. (d)d, and (f)f show that the throughput of MongoDB, Couchdb, and MySQL with the default setting Mansouri2020 did not vary as the configuration setting changed from 4X2 to 1X8. This implies that these databases do not require heavy communication between VMs to conduct read and write operations. No reason to have such communication since MongoDB provides full replication with eventual consistency, Couchdb offers a local quorum-based mechanism for consistency, and MySQL supports strong consistency between VMs in the same data node group. It is worth mentioning that the throughput of MongoDB, Couchdb, and MySQL was not affected as configuration setting (4X2 - 1X8) changed for Mumbai and Virginia regions.

Summary: The result demonstrate that quorum-based (i.e. Cassandra), RAM-based (i.e., Redis), and Riak databases can effectively leverage larger VMs in terms of cores and RAM size rather than more VMs with less Cores and RAM size to improve their performance. While the remaining databases cannot exploit larger VMs to improve performance in Mumbai and Virginia regions.

4.2.5 Lessons Learnt

From the deployment perspective, construction and deconstruction of a cluster nodes in a private cloud datacenter take less time in comparison to the one in a public cloud datacenter irrespective of its region. The reason might be a public cloud consists of thousands of racks including hundreds of servers, while in our private clouds consists of one rack including two servers. Moreover, as ref elected in the results, we observed that MongoDB and MySQL require less time to run YCSB workload with much less deployment failures.

In terms of throughput performance under default setting, MongoDB is a clear winner throughout our databases under test in this work. Long distance between the private and public cloud datacenters affects its throughput if more than half of VMs in the cluster configuration are exploited in a public cloud datacenter. Precisely, for the longest distance (i.e., Virginia region), the throughput of MongoDB reduces to at most half with full bursting configuration () in comparison to the throughput for non bursting . We believe that the reason behind such performance of this database is full replication and eventual consistency, which makes close data to data requester (here the broker VM in OpenStack).

MySQL stands at the second rank in this criterion since it guarantees strong consistency between nodes in the same data nodes group rather than guaranteeing Geo-graphical strong consistency required transfer data over WAN. Since the size of data nodes group is equals to two (default value for replicas number), data is close to data requester in all cluster configuration except (). Surprisingly, even for the cluster configuration (), MySQL gains almost the same throughput. This might be that the operations are conducted on the local VM (hosting MySQL) in OpenStack, where data requester and data host are in the same cloud. It might be difficult to achieve such throughput for MySQL if clients issuing data request are geographically located, the number of replicas increases, and YCSB+T Dey2014 is leveraged. YCSB+T provides dependency between data which requires strong consistency in case of MySQL. The effect of such aspects should be investigated to better understand the functionality of MySQL at long distance (i.e., Virginia and Mumbai).

Cassandra deployment demonstrates the most fluctuation in throughput as it is significantly impacted by distance. This evaluation also confirmed sha2014 for three web-based workloads. However, this database provides the best performance for the Read-intensive workload in Sydney region and exposes high throughput for Read-only and Read-latest workloads with hybrid cloud configuration of () in Sydney region. Based on the obtained results, the default setting for data replication and consistency mechanism should be adapted to achieve better throughput as the distance between data requester node/VM and data host node increases.

The throughput of Riak, Redis and Couchdb follows semi-parabolic trend line, in which the throughput of these databases reduces as hybrid configuration setup changes from () to (), and then their throughput gradually increases when more than half of the nodes are exploited in the public cloud in Sydney region. This evaluation demonstrates that the performance of these databases depends on the density of nodes/VMs located in one cloud when the private and public clouds are at close distance. The comparison between three databases shows that Redis outperforms Riak, which, in turn, exposes better throughput in comparison to Couchdb for almost all workload (apart from the scan workload) in Sydney region. In contrast, in Mumbabi and Virginia regions, the throughput of these databases (especially Redis) is significantly impacted as cloud bursting happens.

With respect to horizontal scalability, the results show that MongoDB and MySQL can improve their performance as a greater number of VMs exploited in the Sydney region though this incremental trend is not linear. By contrast, in the Sydney region, the throughput of Cassandra significantly cuts by a factor of as the number of VMs in the public cloud changes from 2 to 8. Likewise, for Riak by a factor of . Thus, the results demonstrate that the throughput of Riak, Cassandra and in some cases for MySQL and MongoDB cannot improve as the number of bursting VMs increases. This might be a case because distance does not allow to saturate the link between the private and public clouds while more data are located far away from the node/VM issued operations.

With respect to vertical scalanility, MongoDB, MySQL, Riak and Couchdb demonstrate increment in throughput but not significant. For Cassnadra, initially the throughput goes up as the number of cores changes from 2 to 4, and then reduces when the cores number increases from 4 to 8. What we realized from this set of experiments, long distance dominates the performance of these databases with respect to vertical scalability. It seems valuable to investigate the scalability of these databases on shorter distance, namely at distance of several KMs rather than hundreds of KMs.

In the last set of experiments, we evaluated the impact of VMs number vs. VMs cores (called VM packing) on the performance of modern distributed. This evaluation exposes latency between VMs in the same cloud, and replacing this latency with one between cores in the same VM. The results demonstrate that less VMs with more cores has significant impact on the performance of Cassandra and then Riak in the Sydney region. This might be that these databases require more communication between VMs. Also, VM packing improves the performance of RAM-based databases such as Redis. In this work, we considered VM packing on the public cloud side since we intended to investigate the impact of distance on it. Otherwise, it is useful to explore the impact of VM packing locally (i.e., on the side of private cloud) on the performance of distributed databases.

5 Conclusions and Future Work

In this paper, we have conducted an extensive evaluation of performance, scalability and VMs size vs. VMs number for six modern and widely used databases (i.e., MongoDB, Cassandra, Riak, Couchdb, Redis, and MySQL Cluster). Unlike the previous studies, we have evaluated these databases on a hybrid cloud spanning on-premises infrastructure resources and public cloud datacenters in the Sydney, Mumbai, and Virginia regions. The selection of these regions reflects the effect of the distance between a pair of private and public datacenters on performance and scalability of these databases. We have observed that MongoDB obtains the first rank among these databases in throughput since it leverages full replication with the eventual mechanism for consistency. MySQL Cluster comes after MongoDB from performance perspective since it uses strong consistency between data nodes in the same data nodes group, which avoids guaranteeing geographically strong consistency. In contrast to these two databases, the long distance (i.e., Virginia and Mumbai regions) degrades the performance of other databases (i.e., Cassandra, Riak, Couchdb, and Redis). At close distance (i.e., Sydney region), Riak, Redis and Couchdb exposes at most 50% reduction in throughput as half of the VMs (i.e., ) are burst into a public cloud, and then their performance gradually increases as more than half of the VMs are deployed in the public cloud.

In our experiments, we have observed adding more VMs in a public cloud datacenter can help MongoDB and MySQL to improve their throughput especially in the Sydney region. In contrast, exploiting the larger VM instances in a public cloud datacenter increases the throughput of all the databases (except Cassandra) for the most workloads in the Sydney region. For Cassnadra, in all regions, its throughput increases as the number of cores increases from 2 to 4, and then reduces when the number of cores rises from 4 to 8. Thus, distance also affects on vertical and horizontal scalability of these databases. Furthermore, in our evaluation, we realized that those databases requiring more communication among VMs to conduct operations (e.g., Cassandra and Riak) as well as the Ram-based database (e.g., Redis) can benefit more from larger VMs compared to more smaller VMs.

For the future work, this research can be extended in several directions. As all these experiments have been conducted under default settings, it is worth determining the effect of the key setting parameters such as replica number, sharding number (if applicable) and different consistency mechanisms on the performance of NoSQL databases in a hybrid cloud. Since a hybrid cloud spanning two cloud datacenters with one source of issuing read and write operations, it is valuable to extend this architecture to be fully distributed in hardware resources hosting data and issuing operations. Last but not the least, rather than either bursting more or larger VMs in a public datacenter, it might be effective to leverage more larger VMs to improve throughput of all the distributed databases.

References