In recent years, with the development of the big data technology, big data applications are becoming more and more popular. For example, in the meteorological applications, a meteorological station will collect some metrics such as the temperature, humidity and wind speed at different dimensions such as latitude, longitude, altitude and timestamp for weather forecast . These dimensions are usually sortable and used to perform query filtering.
When using Cassandra to handle the above data in an application, developers have to collect the potential query patterns on the data and then design the database schema according to query patterns. For example, in Cassandra, the partition keys  in a column family are used to partition data for load balance and can only be equality queried on. The clustering keys  support range filters or equality filters. Choosing different columns as partition and clustering keys will have different query performance. Therefore, users have to know all the potential query patterns before they design the database schema of Cassandra.
In Cassandra, the order of columns of clustering keys impacts query performance heavily. For example, if the columns (altitude, time) are clustering keys, then queries such as “Find the humidity at some locations where altitude=100Pa and between time [2018.05.01, 2018.05.02]" have low latencies, because data is sorted by the time column on disk. If the columns (time, altitude) are clustering keys, then the latencies of the former queries increase but queries such as “Find the humidity at some locations where altitude in [100Pa, 1000Pa] and time= 2018.05.01 will have low latencies. This shows that organizing the column family structure efficiently accommodate one query, but may negatively impact the performance of many other queries . Therefore, how to design the database schema according to the query patterns for optimizing the data serialization on disk is challenging.
In this paper, we rethink the ability of current replica mechanism in NoSQL systems  and propose a new replica mechanism, called heterogeneous replica, to improve the query performance of the system. Currently, replica is used for data recovery and load balance of query. Traditional replica mechanism requires different replicas have the same serialization bytes on disk. Our heterogeneous replica only requires different replicas to have the same dataset while the serialization bytes on disk can be different. Different replica accelerates different query patterns. Meanwhile, the ability of data recovery and load balance are reserved because they have the same dataset.
In Cassandra, the serialization of data is determined by the clustering keys in column family. By modeling the query cost on different schema of column families, we propose heterogeneous replica construction algorithm(HRCA) to construct the optimal heterogeneous replicas to minimize the average query latency for known query patterns.
In this work, we adopt a general approach that separates architectural concerns of writing process and availability from heterogeneous replica. A shim layer heterogeneous replica (HR) engine was implemented on top of Cassandra  to verify the efficiency of our method. This enables the HR engine to control the exact heterogeneous replica structure on disk regardless of the write process of the underlying store.
The contributions of this paper are as follows:
We propose a new replica mechanism called heterogeneous replica. The new mechanism gives replica the ability to accelerate queries while providing the original data recovery ability.
We propose a cost model for query on SSTable and formalize the heterogeneous replica construction problem to achieve the best query performance of given workload.
We propose HRCA algorithm to efficiently construct the optimal disk structures of heterogeneous replicas.
We implement a HR engine on top of Cassandra to verify the effect of our methods. Experiments show that we can achieve great performance improvement: the average query latency can be reduced by orders of magnitude.
The following of this paper is organized as follows. We first introduce the intuition in Section 2. The problem definition and the solution are introduced in Section 3. In Section 4 we show the architecture of our HR engine on Cassandra. The experimental evaluation is reported in Section 5. Finally, we discuss the related studies in Section 6 and conclude the paper in Section 7.
2 Rethink REPLICA
Modern distributed storage systems usually use replica mechanism for data security, such as GFS [10, 4], Cassandra , Hbase , MongoDB  and Dynamo . Replica mechanism is to copy data several times and store replicas on different nodes in distributed storage systems. When the data in one node is lost because of the disk crash, we can still recover the data through the copy of data on other nodes.
The side benefit of replica mechanism is that to support query parallelism and load balance. Queries on the same data can be evenly routed to different nodes which store a copy of the data.
Using traditional replica mechanism, all the replicas of a dataset have the same serialization bytes on disk, which have the same query performance for a query. The left part of Figure 1 shows this case. Both replica and serialize the dataset on disk in alphabet order. Given a query and either or serves for it, the query latency is the same without considering load balance.
The drawback is, a kind of serialization of replica on disk is only friendly for some queries. For example, Given two queries in Figure 1, select data that is less than “" and selects the “blue" data. the latency of is less than no matter which replica serves for the two queries. That is, using traditional replica mechanism, no matter how we optimize the serialization strategy of the dataset on disk, one data serialization that accommodates some queries (e.g., ) may negatively impact other queries (e.g., ). This “one size fits all" pattern does not take full advantage of replica.
Our heterogeneous replica mechanism only requires different replicas to have the same dataset while the serialization bytes on disk can be different. By this means, different replicas can handle different queries so that we can make full use of replica. The right part of Figure 1 shows this case. serializes data in alphabet order and by color. Then both and have the minimal latency. At the same time, the data recovery ability is reserved because different replicas have the same dataset.
3 Heterogeneous Replica
3.1 Modeling In Cassandra
Heterogeneous replica mechanism cares for the data serialization on disk. In this paper we focus on Cassandra, which uses sorted sequence table (SSTable [11, 6, 26]) to manage data on disk. SSTable stores key-value data in sorted order by the key. In Cassandra, given a partition key, the keys which are sorted in SSTable are the combination of clustering keys. That is, we can control the data serialization in a SSTable by adjusting the clustering keys of the corresponding column family.
Then our problem is transferred to how to organize the order of clustering keys in each replica. We define that the order of clustering keys of a column family in a replica as the structure of the replica on disk.
We use to represent the columns that are chosen to be the clustering keys in a column family. is the number of columns in . Given a dataset , we use to represent the value of record at the column. For each column , the distribution function of is
and the probability density function is. is a permutation of clustering keys, i.e. . In practice, is the structure of a replica.
We use to represent the known query workload. Each query pattern is composed of some range filters ( ) and equality filters () on clustering keys . For example, . Besides, we assign the clustering keys that do not has any filter with a global range filter to ensure that every clustering key has a filter. To support these queries, we use ALLOW FILTERING  in Cassandra.
Given a query, the time cost mainly depends on the size of data in SSTables to be loaded from disk in Cassandra. For example, Figure 2 shows the data needs to be loaded of a SSTable when executing , in which is the combination of clustering keys. Cassandra needs to traverse from the lower bound (4-5-3) and terminate when meet the first key (4-8-6) that exceeds the end boundary.
When estimating the size of these rows, we record the index of the first clustering key that has an range filter as, ensuring all keys before the i-th key has an equality filter. Given a replica with the clustering key permutation , the number of rows to be loaded from disk can be estimated as:
This estimation has a little larger compared to the real size of rows shown in Figure 2.
In the query process of Cassandra, the real time cost is dominated by and the relation between and depends on the actual environment of the system. We use to represent:
Given a specific structure of replicas on disk and a query , The minimal time cost of a query is as following
Then, the average time cost of is:
Finally, the HRC problem is defined as follows:
(Heterogeneous replica construction problem): Given a query workload , find optimal structure of heterogeneous replicas such that the average latency of is minimized
3.2 Replica Construction
In this section, we first analyze the hardness of the HRC problem, which motivates us to devise an HRCA algorithm based on simulated annealing.
The simple way to find is to enum all possible structures of replicas in Equation (5). Given the replication factor and the number of clustering keys , kinds of replica layouts could be considered as possible results. It can be very large when or are large.
In HRCA, a specific structure of all heterogeneous replicas on disk corresponds to a state. Users need to give a query workload and arbitrary state as the initial state . The main loop (line 2-8) of Algorithm 1 is the searching process in simulated annealing. A ‘good’ state will always be accepted, while a ‘bad’ state will be accepted probabilistically, which can avoid finding a locally optimal solution. The new state generation function is generated by swapping two clustering keys of a replica in .
The algorithm is only be called once so the speed is not important compared to the effect. Besides, the algorithm generally converges in ten seconds in our experiments.
4.1 Separation of Concerns
We implemented the heterogeneous replication(HR) engine on top of Cassandra. This architecture decouples the replica mechanism from the data management on disk.
This enable a clean separation of concerns. The underlying Cassandra handles most aspects of data management, including the management of MemTable and SSTable process of LSM-Tree[21, 9]. Hence we do not worry about the sorting and data serialization on disk as well as the compaction strategy of SSTables. The above HR-engine can concentrate on how to construct an optimal structures of heterogeneous replicas.
The underlying Cassandra handles most aspects of data management, including the management of MemTable and SSTable process of LSM-Tree. Hence we do not worry about the sorting and data serialization on each node as well as the compaction strategy of SSTables.
4.2 HR Engine
The architecture is shown in Figure 3. HR engine mainly has five models: request agency, cost evaluator, replica generator, request scheduler and recovery. The engine accepts the requests of clients and connect to the underlying database.
All requests from clients are routed by the Request Agency. Clients send requests to the Request Agency layer, which handles all communications with underlying data store and other clients. Clients are agnostic to the Cassandra.
For CREATE COLUMN FAMILY request, the Replica Generator module automatically generates the optimal replica structures and allocates to different nodes by our defined hash function that accept the replica id and partition key.
When receiving a read request, the Request Scheduler will route the request to the replica with lowest latency. The query cost in a replica will be calculated by the Cost Evaluator. The write request will be resolved by the Write Scheduler and be sent to all replicas to maintain the data consistency. The sort process will be handled by the LSM-Tree writing process of each node.
Recovery is responsible for data recovery when a node down. As the structures of replicas are different, the original recovery strategy does not apply to the heterogeneous replica. Therefore, we leverage the LSM-Tree write process to recover replica.
In this section, we first model the cost function in Equation (2). Then we compare the following replica mechanism under different data size, replication factor, and the number of clustering keys on TPC-H and simulation dataset:
TR The traditional replica mechanism with approximate optimal structure that an expert can give.
HR The heterogeneous replicas that HRCA generates.
We ran experiments on a Cassandra (version 3.11.0) cluster with 6 nodes. Each node has 2 Intel Xeon E5-2697 CPUs which have 36 cores in total, 256GB memory and 7200 rpm HDD.
There are two kinds of datasets we used:
TPC-H Dataset Considering the column data type and supported queries, we use the table orders in TPC-H, which has 9 columns as the experiment target. The scale factor (of TPC-H dataset) we used is which results in different data size ranging from 1.5 million to 7.5 million. In this column family, the clustering key we defined are custkey, orderdate and clerk.
Since the 22 queries in TPC-H are mainly for join operations which are not the optimization target of our query model, we give two examples based on the business scene:
Q1: find the total price of the all customers that a specific clerk served at one day. SQL Example : select totalprice from orders where orderdate = ? and clerk = ? and custkey 0;
Q2: find the total price that a customer consumed by a specific clerk’s merchandising in some days. The time range of orderdate are randomly generated. SQL Example : select sum(totalprice) from orders where custkey = ? and clerk = ? and orderdate ? and orderdata < ?;
We generated 500 query instances by replacing the “?" in the SQL templates.
Simulation dataset: We generated simulation datasets, whose data size and clustering key size satisfy: (1) The value scope in each clustering key is ; and (2) the data type of each clustering key is integer and is distributed randomly in the whole data space. The queries we used is randomly generated.
5.1 Cost Modeling
In Equation (2), the function depends on the hardware and the configurations of the system. In this experiment, we collect the query time costs under different in different system environments using the simulation dataset with queries. The result is shown in Figure 4.
We first evaluate whether the size of data item impacts Row(). In the experiment, we change the size of the data item from 50B to 200B by increasing the size of the metric column value. As shown in Figure 4(a), each line represents the cost with a specific size of a data item. The lines show the roughly linear relationships between the size of candidate result set and the cost. Besides, the cost does not change significantly while the size of a data item increases three times(i.e., from 50 byte to 200 byte). Therefore, we do not need to model the cost function if only the size of the metric column changes.
We then study how the number of clustering keys impacts the cost function. The number of clustering keys is also related to the size of data items, so we change the number to make the average item size be 50 bytes 200 bytes as the above experiment. As can be seen in Figure 4(b), the cost function is a linear function under different number of clustering keys. The slope of the cost function increases when the number of clustering keys is increasing. Therefore, it needs to be re-modeled.
5.2 Query Latency and Improvement
We evaluate the average query latency of two replica mechanisms under different size of the TPC-H datasets. The result is shown in Figure 5(a) and the relative performance improvement of HR over TR is in Figure 5(d) using the equation . The query cost of the TR mechanism increase gradually as the data size grows while the HR remains almost unchanged. The improvement from TR to HR shows the good performance of heterogeneous replicas. According to the experiment, HR can achieve two orders of magnitude performance improvement compared to the TR under 5 scale factor.
We then evaluate how replication factor(number of replicas) impacts the query latency on the simulation dataset. In this experiment, there are 10 million data items in the dataset. The query latency of two mechanisms with replication factor from are shown in Figure 5(b). The relative performance improvement of HR over TR is shown in Figure 5(e). The average query latency stays constant in TR but decreases as the replication factor grows in HR. The latency is the same when the replication factor is 1 and the time cost of HR dramatically drops when the replication factor is greater than 1. Two replicas can greatly speed up the query.
We evaluate the impact of clustering key number on the simulation dataset with uniform queries by changing the number of clustering keys from . The dataset has 10 million items in total and the replication factor is 3. The time costs of different replica mechanisms are shown in Figure 5(c) and the relative performance improvement of HR over TR is shown in Figure 5(f). As can be seen, the improvement increases along with clustering key increasing except for five clustering keys. If there are only 2 or 3 clustering keys, the effect of 3 replicas is not fully utilized. For more clustering keys, the effect of HR is better.
5.3 Write Throughput
We measured the write throughput of TR and HR with 3 replicas on TPC-H dataset. We load 40, 80 and 120 million rows separately. Table 1 shows that heterogeneous replica maintain the same write speed as traditional replicas. Because we write data asynchronously into different replicas and the writing process of different replica use the traditional LSM-Tree write strategy. Therefore, heterogeneous replica does not have a negative impact on write speed.
|Number of Rows(million)||Time cost with TR(s)||Time cost with HR(s)|
5.4 Data Recovery
We measure the speed of data recovery when a node falls down. We remove the data on the node and call nodetool repair to launch the origin data recovery in Cassandra. We import 18 million rows of TPC-H data set, traditional data recovery takes approximately 4 minutes to recover data, our HR-engine takes 6 minutes to recover. Considering that the node failure occurs infrequently, compared to the tremendous reduction of query latency, take a little longer to recover data is acceptable.
6 Related Work
There are many works [14, 15] to optimize the data structure on disk. For example, in columnar store , a column ordering strategy is proposed in . The read speed is improved by adjusting the disk order of columns. They model the disk seek cost on column store but we model the data needs to be loaded into memory on SSTable. Besides, they do not consider replica.
Data duplication and replica mechanism [25, 19, 24, 27] have been widely used in data management systems. In which full replication [16, 22] is a popular approach. For example, primary-backup . The main goal of full replication is supporting the data availability. In , data replication and allocation strategy are proposed. This work chooses optimal replication factors for each partition. Another work  increases replica according to the frequency of the query access. In another work , a duplication strategy is proposed for columnar store systems. The above works increase either the number of replicas or the amount of data in each replica. Some even assume that the storage is unlimited. We make adjustments based on the existing replica mechanism, and improves the query performance without introducing additional disk costs, which is different from existing works.
Data partitioning methods place different part of data on different nodes to achieve load balancing and thereby accelerate the query. We focus on the heterogeneous replica structure inside each partition. Therefore, data partitioning strategy are orthogonal to our work and can work with our replica mechanism together.
In this paper, we propose a new replica mechanism called heterogeneous replica. The new mechanism gives replicas the ability to significantly reduce the average latency of queries while keeping the properties of data recovery. The existing approaches on accelerating queries either optimize limited kinds of queries by adjusting the structure of data or duplicate some frequently accessed data. In contrast with them, we do not introduce additional disk cost by optimizing the existing replica serialization on disk.
To find an approximate optimal structures of the heterogeneous replicas, we propose (1) a cost model for SSTable on Cassandra, (2) the formalized heterogeneous replica construction problem, (3) a solution to find the optimal structures of replicas, (4) the implementation of HR engine. We believe that our replica mechanism can be also applied to other databases not only Cassandra that have replica.
-  https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html.
-  https://www.datastax.com/dev/blog/allow-filtering-explained-2.
-  H. Bian and Yan. Wide table layout optimization based on column ordering and duplication. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 299–314. ACM, 2017.
-  D. Borthakur et al. Hdfs architecture guide. Hadoop Apache Project, 53, 2008.
-  N. Budhiraja, K. Marzullo, F. B. Schneider, and S. Toueg. The primary-backup approach. Distributed systems, 2:199–216, 1993.
-  F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008.
-  K. Chodorow. MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. " O’Reilly Media, Inc.", 2013.
-  G. DeCandia and Hastorun. Dynamo: amazon’s highly available key-value store. In ACM SIGOPS operating systems review, volume 41, pages 205–220. ACM, 2007.
-  C. M. Dong S. Optimizing space amplification in rocksdb. InCIDR, 3(3), 2017.
-  S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system, volume 37. ACM, 2003.
-  I. Grigorik. Sstable and log structured storage: Leveldb, 2014.
-  Q. Gui-bin, Z. Er-bin, and J. Yang. Using samba service to realize information sharing. Heilongjiang Meteorology, 28(4):40–41, 2012.
-  J. Han and J. Du. Survey on nosql database. In Pervasive computing and applications (ICPCA), 2011 6th international conference on, pages 363–366. IEEE, 2011.
-  Y. Huai and S. Ma. Understanding insights into the basic structure and essential issues of table placement methods in clusters. Proceedings of the VLDB Endowment, 6(14):1750–1761, 2013.
-  A. Jindal and J. Dittrich. A comparison of knives for bread slicing. Proceedings of the VLDB Endowment, 6(6):361–372, 2013.
-  B. Kemme and G. Alonso. Don’t be lazy, be consistent: Postgres-r, a new way to implement database replication. In VLDB, pages 134–143, 2000.
-  S. Kirkpatrick, D. Gelatt, and M. P. Vecchi. Optimization by simmulated annealing. Science, 220:671–680, 1983.
-  A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010.
-  M. Lei and X. Hong. An on-line replication strategy to increase availability in data grids. Future Generation Computer Systems, 24:85–98, 2008.
-  S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330–339, 2010.
-  P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The log-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351–385, 1996.
-  C. Plattner, G. Alonso, and M. T. Özsu. Extending dbmss with satellite databases. The VLDB Journal, 17(4):657–682, 2008.
-  T. Rabl and H.-A. Jacobsen. Query centric partitioning and allocation for partially replicated database systems. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 315–330. ACM, 2017.
-  Y. Saito and M. Shapiro. Optimistic replication. ACM Computing Surveys (CSUR), 37(1):42–81, 2005.
-  G. L. Sanders and S. Shin. Denormalization effects on performance of rdbms. In System Sciences, 2001. Proceedings of the 34th Annual Hawaii International Conference on, pages 9–pp. IEEE, 2001.
-  P. Shetty and E. Zadok. Building workload-independent storage with vt-trees. In FAST, pages 17–30, 2013.
-  R. Van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In OSDI, volume 4, 2004.
-  H. Wiki. Hbase: bigtable-like structured storage for hadoop hdfs, 2012.
-  H. Xiang-dong, W. Jian-min, G. Si-han, et al. A storage model for large scale multi-dimension data files. Proc of NDBC, 1, 2014.
-  H. Zhong, Z. Zhang, and X. Zhang. A dynamic replica management strategy based on data grid. In GCC, 2010 9th, pages 18–23. IEEE, 2010.