To Ship or Not to (Function) Ship (Extended version)

07/30/2018 ∙ by Feilong Liu, et al. ∙ The Ohio State University 0

Sampling is often used to reduce query latency for interactive big data analytics. The established parallel data processing paradigm relies on function shipping, where a coordinator dispatches queries to worker nodes and then collects the results. The commoditization of high-performance networking makes data shipping possible, where the coordinator directly reads data in the workers' memory using RDMA while workers process other queries. In this work, we explore when to use function shipping or data shipping for interactive query processing with sampling. Whether function shipping or data shipping should be preferred depends on the amount of data transferred, the current CPU utilization, the sampling method and the number of queries executed over the data set. The results show that data shipping is up to 6.5x faster when performing clustered sampling with heavily-utilized workers.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In big data analysis, sampling is often used to reduce query latency for interactive query execution [25]. Current database systems use function shipping in query execution, where the coordinator distributes query plans to the workers for execution then collect results from the workers. The cost of function shipping includes the computation cost of executing queries in workers and the communication cost of transferring results from workers to the coordinator. In function shipping, sampling methods do not affect the communication cost, but affect the computation cost. For example, random sampling accesses the whole data set while cluster sampling only accesses part of the data set during query execution.

Commodity clusters are now commonly equipped with fast networks with Remote Direct Memory Access (RDMA) support [3]. RDMA enables user applications to directly access memory in remote machines without involving the operating kernel and offers higher throughput than TCP/IP sockets [8]. Data shipping is possible with one-sided memory access provided by RDMA. In data shipping, the coordinator uses RDMA Read to read data from workers and executes query locally while the workers remain passive. The cost of data shipping includes the computation cost of executing queries in the coordinator and the communication cost of transferring data from workers to the coordinator. In data shipping, sampling methods not only affect the computation cost, but also affect the communication cost. In cluster sampling only the sample of the data set is transferred to the coordinator, however, the whole data set is transferred to the coordinator in random sampling.

In this work, we add the optimization of choosing between function shipping and data shipping in our RDMA-aware system [20]. We discuss the trade-offs between function shipping and data shipping that are afforded by the advent of RDMA and look at how sampling influences this decision. Whether function shipping or data shipping should be preferred depends on the amount of data transferred, the current CPU utilization, the sampling method and the number of queries executed on the data set. The result shows that data shipping has better performance when the computing resources are limited in workers for both sampling methods and data shipping improves performance by up to 6.5.

Fig. 1: As distinct cardinality increases, function shipping becomes expensive due to result set size increase.
Fig. 2: As the computation resources available at worker increases, function shipping gets cheaper.
Fig. 3: As the result set size increases due to more queries being executed, data shipping gets cheaper.

Ii System Design

We use the traditional single coordinator, multiple workers design. User queries are sent to the coordinator. The coordinator directs the query to all the workers. Each worker can then choose to respond through either data shipping or function shipping. In data shipping, the worker returns the raw data to the coordinator and the coordinator executes the query on the received data. In function shipping, the worker executes the query on its data and returns the result back to the coordinator. A final aggregation to combine the results from all workers is then performed at the coordinator.

Ii-a Function Shipping vs Data Shipping

In function shipping, the worker executes the query and performs RDMA Write to send the result to the coordinator. In data shipping, the coordinator uses RDMA Read to read the data and executes the query on received data. The worker is passive in data shipping. The costs of data shipping and function shipping are as follows:


where COST is the cost of data shipping, is the cost of reading data from workers, is the cost of sampling, is the cost of executing queries at the coordinator, COST is the cost of function shipping, is the cost of executing queries at the worker, is the cost of writing the result to the coordinator, and is the cost of aggregating results from the workers. When there are multiple workers, the workload is split and distributed across all workers. Increasing the number of workers reduces the data to be processed in each worker, which decreases the sampling cost and the execution cost at the workers in Equation 2, and hence favors function shipping.

While function shipping is the norm, data shipping is preferred when the cost of function shipping COST is higher than the cost of data shipping COST. According to Equation 1 and Equation 2, is exclusive to function shipping and depends on the size of the result. Larger result size means higher cost for function shipping, which makes data shipping preferred. Data shipping is also preferred when there is heavy load in workers and less computing resources available for query execution. The execution cost in Equation 2 increases when there are less computing resources in workers, which leads to the increase of the function shipping cost COST. Hence data shipping is preferred in the following cases:

  1. [label=0)]

  2. The size of the result is large.

  3. Computation load on the worker is high.

Ii-B Sampling

Our system uses online sampling to meet interactive latency requirements for large datasets. We support two sampling modes, simple random sampling and cluster sampling [21]

. In simple random sampling, every tuple has an equal probability of being included in the sample. In the absence of indexes, this involves accessing every tuple of the dataset. We use Bernoulli sampling semantics for simple random sampling. In cluster sampling, different clusters are chosen randomly and all tuples within a cluster are included in the sample. This avoids accessing every tuple in the dataset. The pros and cons of both sampling strategies are as follows.

Ii-B1 Execution Speed

In function shipping, performing simple random sampling involves adding Bernoulli sampling-based scan operator and accessing the whole data set, while in cluster sampling, only a fraction of tuples are accessed. In data shipping, to perform simple random sampling, the entire dataset needs to be transferred to the coordinator as the worker lacks computing resources required to perform sampling. The coordinator then samples the received data. For cluster sampling, the coordinator only accesses a sample of the data, resulting in less network traffic. Thus, cluster sampling is cheaper than simple random sampling for both shipping modes.

Ii-B2 Result Quality

Simple random sampling usually results in better sample quality than cluster sampling if the tuples are stored in non-random order. A clustered index stores the data in a sorted order. If the GROUP BY or WHERE clause contains any of the clustered index columns in order, cluster sampling can result in tuples and groups being respectively missed, causing sampling error to be large.

Iii Experiments

We extended our RDMA-aware query execution engine Pythia, a prototype open-source in-memory query engine [20], with sampling support. We currently employ a single coordinator and a single worker setup. They each have 512 GB of memory across two NUMA nodes, with each NUMA node having one Intel Xeon E5-2680v4 14-core processor. They are connected by an EDR (100 Gb/s) InfiniBand network.

Our dataset has one table R with billion tuples, with each tuple having two long integers R.a and R.b as attributes. R.a is the primary key, which thereby ranges from 1 to the cardinality of the table, and the distinct cardinality of R.b is varied. We evaluate our system using the SQL query SELECT R.b, COUNT(*) FROM R GROUP BY R.b. In the execution of the SQL query, records with the same value in R.b are aggregated to a single record. Hence the number of records in the result is the same as the number of distinct values of R.b in the data.

Iii-a Changing Cardinality of Results

In exploring the trade-offs between function and data shipping modes, a natural question to ask is, how does the size of the result affect their response times (Section II-A)?

As the result size is non-deterministic with sampling, we turn off sampling in this experiment. We vary the distinct cardinality of R.b from 1 thousand to 1 billion. At the coordinator, we use all 28 cores for query execution, while the worker only uses 14 cores to simulate the additional workload in the worker node. Figure 3 shows that when the result size is less than or equal to 4 MB, function shipping has lower response time than data shipping. This is because the size of the result which is transferred in function shipping is not large. When the result size is equal to or larger than 8 MB, the saving in network traffic decreases and function shipping has higher response time than data shipping. Hence, data shipping is preferred when the result size is large.

Iii-B Changing Load on the Workers

Another question is, how does the load on the worker and the choice of sampling method affect the choice between function shipping and data shipping?

We simulate different loads on the worker by varying the number of available cores from 1 to all 28 cores, and keeping the number of cores at the coordinator fixed at 28. We set the distinct cardinality of R.b to be 2, and the query timeouts at 60 seconds. The sampling rate is and we compare both cluster sampling and random sampling. The result is shown in Figure 3. The number of available worker cores has no impact on data shipping due to our use of RDMA. The response time for function shipping decreases when the number of worker cores increases. When the number of cores in the worker is 8 and 9, data shipping has higher performance for random sampling but has lower performance for cluster sampling. This is because random sampling is more computation intensive and favors data shipping when the worker has limited computing resources. For the same sampling method, we can see that data shipping has lower response time when the number of cores is small and is up to 6.5 faster than function shipping, as the saving in network traffic is offset by the slow workers in query execution.

How does the performance change if we increase the size of the aggregation result? As discussed in Section II-A, the cost of function shipping increases when the result size increases. Data shipping will be preferred when the result size increases and the cross point between function shipping and data shipping will move to the right of the horizontal axis in Figure 3.

Iii-C Executing Multiple Queries

Here, we look at the case where multiple queries are executed at the same time. How does executing multiple queries affect the decision to choose between function and data shipping in presence of sampling?

We use all 28 cores for both the worker and coordinator nodes. The distinct cardinality of R.b is set to be 512 million and set the sampling rate to . We run multiple times, ranging from 1 to 5. Figure 3 shows that to achieve the identical sampling rate (), random sampling is more expensive than cluster sampling. Within the same sampling method, when running a single query, function shipping is better than data shipping. This is because the size of the result transferred in function shipping is less than the size of the data transferred in data shipping. However, data shipping becomes preferable over function shipping when the number of queries increases () in our setup. This is due to result size increasing with the number of queries increasing, while the data transferred in data shipping stays the same.

Iv Related Work

RDMA has been studied in multiple database operations. RDMA has been used to accelerate join execution. Frey et al. [9] build a new join algorithm, cyclo-join, which transfers data using RDMA. Tinnefeld et al. [28] compare different join algorithms over RAMCloud, which is connected with RDMA-enabled network. Barthels et al. [3] study the radix join algorithm using RDMA to transfer data. Rödiger et al. [26] have designed flow-join

, which uses RDMA to deal with skew in join execution. RDMA has also been used to accelerate data shuffling in parallel database systems. Rödiger et al. 

[27] design a multiplexer which uses RDMA for data transfer. As RDMA provides direct memory access to remote memory, Mühleisen et al. [24] study the performance of accessing remote memory in database systems; Li et al. [18] use RDMA to directly access buffer pool in remote nodes. RDMA is also studied in data processing. Lu et al. [22] accelerate Hadoop using RDMA. Dragojević et al. [7] build a distributed computing platform, FaRM, with RDMA. Wu et al. [30] design a graph processing engine over FaRM. Chen et al. [5] and Wei et al. [29] build distributed transaction processing systems using RDMA. Kalia et al. [12] implement RPC with RDMA and use RDMA-enabled RPC for distributed transaction processing. RDMA is also used in key-value stores [11, 23].

Sampling has been introduced in the context of databases by Olken et al [25]. The AQUA system incorporated sampling into their real-world production environment, including supports for joins. Different database systems such as SQL Server, DB2, AQUA [1], Turbo-DBO [6], BlinkDB [2], Quickr [17] have varying degrees of support for sampling. Others have incorporated online sampling in the context of a session [13, 14]. Sampling over joins has also received in-depth attention as sampling in many-to-many joins has theoretical and practical constraints [4, 15]. Online aggregation [10] introduced the notion of decreasing error during partial query execution. Researchers have also looked at using data cubes and binning to provide scalable interactive visualizations [19, 16]. In contrast, our system is the first to consider the implications of using sampling in the context of RDMA.

V Conclusions and future work

In this paper, we compare how RDMA and fast networks affect query execution strategies for interactive queries with sampling. While function shipping was the norm, interactive big data analytics should take the amount of data transferred, the CPU utilization, the sampling methods and the number of queries executed on the data set into account when choosing query execution strategies. Looking ahead, one possible direction is to build a cost model which takes these factors into account to predict the cost of different execution strategies and to pick the optimal execution strategy.


This material is based upon work supported by the National Science Foundation under grants IIS-1422977, IIS-1527779, CAREER IIS-1453582, CCF-1816577 and CNS-1513120. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


  • [1] S. Acharya et al. The Aqua Approximate Query Answering System. Sigmod Record, 1999.
  • [2] S. Agarwal, B. Mozafari, A. Panda, H. Milner, et al. BlinkDB: Queries with Bounded Errors and Bounded Response Times on very Large Data. Eurosys, 2013.
  • [3] C. Barthels, S. Loesing, et al. Rack-Scale In-Memory Join Processing Using RDMA. SIGMOD, 2015.
  • [4] S. Chaudhuri et al. On Random Sampling over Joins. SIGMOD, 1999.
  • [5] Y. Chen et al. Fast and General Distributed Transactions Using RDMA and HTM. EuroSys, 2016.
  • [6] A. Dobra, C. Jermaine, et al.

    Turbo-Charging Estimate Convergence in DBO.

    VLDB, 2009.
  • [7] A. Dragojević, D. Narayanan, et al. FaRM: Fast Remote Memory. NSDI, 2014.
  • [8] P. W. Frey and G. Alonso. Minimizing the Hidden Cost of RDMA. ICDCS, 2009.
  • [9] P. W. Frey, R. Goncalves, et al. A Spinning Join That Does Not Get Dizzy. ICDCS, 2010.
  • [10] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online Aggregation. SIGMOD, 1997.
  • [11] A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA Efficiently for Key-value Services. SIGCOMM, 2014.
  • [12] A. Kalia, M. Kaminsky, et al. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. OSDI, 2016.
  • [13] N. Kamat, P. Jayachandran, et al. Distributed and Interactive Cube Exploration. ICDE, 2014.
  • [14] N. Kamat and A. Nandi. A Session-Based Approach to Fast-But-Approximate Interactive Data Cube Exploration. TKDD, 2017.
  • [15] N. Kamat and A. Nandi. A Unified Correlation-based Approach to Sampling Over Joins. SSDBM, 2017.
  • [16] N. Kamat and A. Nandi. InfiniViz: Interactive Visual Exploration using Progressive Bin Refinement. arXiv preprint arXiv:1710.01854, 2017.
  • [17] S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, et al. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. SIGMOD, 2016.
  • [18] F. Li, S. Das, M. Syamala, et al. Accelerating Relational Databases by Leveraging Remote Memory and RDMA. SIGMOD, 2016.
  • [19] L. Lins, J. T. Klosowski, and C. Scheidegger. Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. TVCG, 2013.
  • [20] F. Liu, L. Yin, and S. Blanas. Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems. Eurosys, 2017.
  • [21] S. Lohr. Sampling: Design and Analysis. 2009.
  • [22] X. Lu, N. S. Islam, M. Wasi-Ur-Rahman, et al. High-Performance Design of Hadoop RPC with RDMA over InfiniBand. ICPP, 2013.
  • [23] C. Mitchell, Y. Geng, and J. Li. Using One-sided RDMA Reads to Build a Fast, CPU-efficient Key-value Store. USENIX, 2013.
  • [24] H. Mühleisen et al. Peak Performance: Remote Memory Revisited. DaMoN, 2013.
  • [25] F. Olken. Random Sampling from Databases. 1993.
  • [26] W. Rödiger, S. Idicula, A. Kemper, et al. Flow-Join: Adaptive Skew Handling for Distributed Joins over High-Speed Networks. ICDE, 2016.
  • [27] W. Rödiger, T. Mühlbauer, et al. High-speed Query Processing over High-speed Networks. PVLDB, 2015.
  • [28] C. Tinnefeld, D. Kossmann, et al. Parallel Join Executions in RAMCloud. ICDE, 2014.
  • [29] X. Wei, J. Shi, et al. Fast In-memory Transaction Processing Using RDMA and HTM. SOSP, 2015.
  • [30] M. Wu et al. GraM: Scaling Graph Computation to the Trillions. SoCC, 2015.