Parallel Statistical Computing with R: An Illustration on Two Architectures

by   George Ostrouchov, et al.

To harness the full benefit of new computing platforms, it is necessary to develop software with parallel computing capabilities. This is no less true for statisticians than for astrophysicists. The R programming language, which is perhaps the most popular software environment for statisticians today, has many packages available for parallel computing. Their diversity in approach can be difficult to navigate. Some have attempted to alleviate this problem by designing common interfaces. However, these approaches offer limited flexibility to the user; additionally, they often serve as poor abstractions to the reality of modern hardware, leading to poor performance. We give a short introduction to two basic parallel computing approaches that closely align with hardware reality, allow the user to understand its performance, and provide sufficient capability to fully utilize multicore and multinode environments. We illustrate both approaches by working through a simple example fitting a random forest model. Beginning with a serial algorithm, we derive two parallel versions. Our objective is to illustrate the use of multiple cores on a single processor and the use of multiple processors in a cluster computer. We discuss the differences between the two versions and how the underlying hardware is used in each case.



There are no comments yet.


page 1

page 2

page 3

page 4


Quest-V: A Virtualized Multikernel for Safety-Critical Real-Time Systems

Modern processors are increasingly featuring multiple cores, as well as ...

On the Performance and Energy Efficiency of the PGAS Programming Model on Multicore Architectures

Using large-scale multicore systems to get the maximum performance and e...

Using Meta-heuristics and Machine Learning for Software Optimization of Parallel Computing Systems: A Systematic Literature Review

While the modern parallel computing systems offer high performance, util...

Proceedings Programming Language Approaches to Concurrency- and Communication-cEntric Software

Modern hardware platforms, from the very small to the very large, increa...

High Performance Reconfigurable Computing Systems

The rapid progress and advancement in electronic chips technology provid...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


  • Breiman [2001] L. Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statist. Sci., 16(3):199–231, 08 2001. doi: 10.1214/ss/1009213726.
  • Chen et al. [2012] W.-C. Chen, G. Ostrouchov, D. Schmidt, P. Patel, and H. Yu. pbdMPI: Programming with big data – interface to MPI, 2012. R Package, URL
  • Eddy and Schervish [1987] W. F. Eddy and M. J. Schervish. Parallel processing on a network of vaxes with applications. In Proceedings of the Statistical Computing Section, pages 41–47. American Statistical Association, 1987.
  • Frey and Slate [1991] P. Frey and D. Slate. Letter recognition using holland-style adaptive classifiers. Machine Learning, 6(2):161–182, 1991. ISSN 0885-6125. doi: 10.1007/BF00114162. URL
  • Heath [1987] M. Heath, editor. Hypercube Multiprocessors 1987: Proceedings of the Second Conference on Hypercube Multiprocessors. SIAM, 1987. URL
  • Leisch and Dimitriadou [2010] F. Leisch and E. Dimitriadou. mlbench: Machine Learning Benchmark Problems, 2010. R package v 2.1-1.
  • Liaw and Wiener [2002] A. Liaw and M. Wiener. Classification and regression by randomForest. R News, 2(3):18–22, 2002. URL
  • Matloff [2015] N. Matloff.

    Parallel Computing for Data Science: With Examples in R, C++ and CUDA

    Chapman & Hall/CRC The R Series. CRC Press, 2015. ISBN 9781466587038.
  • Message Passing Interface Forum [2015] Message Passing Interface Forum. MPI: A message-passing interface standard version 3.1. 2015. URL
  • Ortega et al. [1989] J. Ortega, G. Voight, and C. Romine. Bibliography on parallel and vector numerical algorithms, 1989. URL
  • Ostrouchov [1987] G. Ostrouchov. Parallel computing on a hypercube: An overview of the architecture and some applications. In M. Heiberger, editor, Proc. 19th Symp. on the Interface of Computer Science and Statistics, pages 27–32, Washington, D.C., 1987. American Statistical Association.
  • Ostrouchov et al. [2012] G. Ostrouchov, W.-C. Chen, D. Schmidt, and P. Patel. Programming with big data in R, 2012. URL
  • Schervish [1988] M. J. Schervish. Applications of parallel computation to statistical inference. Journal of the American Statistical Association, pages 976–983, 1988.
  • Schmidberger et al. [2009] M. Schmidberger, M. Morgan, D. Eddelbuettel, H. Yu, L. Tierney, and U. Mansmann. State of the art in parallel computing with r. Journal of Statistical Software, 31(1):1–27, 2009.
  • Schmidt et al. [2017] D. Schmidt, W.-C. Chen, M. A. Matheson, and G. Ostrouchov. Programming with BIG data in R: Scaling analytics from one to thousands of nodes. Big Data Research, 8:1–11, 2017. ISSN 2214-5796. doi:
  • Wang et al. [2015] C. Wang, M.-H. Chen, E. Schifano, J. Wu, and J. Yan. Statistical methods and computing for big data., 2015. URL
  • Xenopoulos et al. [2016] P. Xenopoulos, J. Daniel, M. Matheson, and S. Sukumar. Big data analytics on hpc architectures: Performance and cost. In 2016 IEEE International Conference on Big Data, pages 2286–2295, Dec 2016.