fplyr: the split-apply-combine strategy for big data in R

05/10/2020 ∙ by Federico Marotta, et al. ∙ Università di Torino 0

We present fplyr, a new package for the R language to deal with big files. It allows users to easily implement the split-apply-combine strategy for files that are too big to fit into the available memory, without relying on data bases nor introducing non-native R classes. A custom function can be applied independently to each group of observations, and the results may be either returned or directly printed to one or more output files.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

0.1 Introduction

Many fields of science and industry are witnessing an expansion in the volume of data collected by their statistical experiments, and the size of the files to be analysed is growing accordingly. On the other hand, the hardware resources needed to scale up the analyses are not available to every practitioner, and sometimes are even beyond the current technological reach. In particular, it can happen that the size of a file exceeds the total RAM on the machine where the analysis is to be performed. In such cases the R (Rlang)

programmer has several tools at his or her disposal, each with its own advantages and limitations. The existing tools can be broadly classified in three families: database-backed, file-backed, and MapReduce. In the next few paragraphs, these three approaches will be briefly reviewed; the purpose, however, is not to compare them, as each is suited to different situations.

To the first family, database-backed, belong packages such as sqldf (grothendieck2012sqldf) and Rdbi, which set up a database behind the scenes and perform SQL queries on it without loading the whole file into memory. While this approach can leverage the speed and efficiency of database routines, it is sometimes necessary to perform on the data particular operations that are not entailed by the database software. For example, while it is possible to use the groupby SQL keyword to aggregate the results of a query according to the value of one field, only a small number of simple ‘aggregating operations’ are supported; the exact number and type of operations can differ according to the database backend, but in general they are not much more sophisticated than summary statistics like AVG, MAX, or STDEV.

A filebacking strategy is employed in the bigmemory (kane2013bigmemory) and ff (adler2014ff) packages, which rely on low-level operating system functions to create a virtual map of the file, once again without loading it into memory. These packages are highly optimised, but their efficiency comes with tradeoffs; for instance, bigmemory is restricted to numerical matrices, and with ff only a (wide but) limited set of preimplemented operations are available. The ffbase (de2014ffbase) considerably enlarges the scope of action of ff. One potential disadvantage that remains, however, is that ff and related packages implement their own set of classes, rather than using the native ones offered by R.

The iotools package (arnold2015iotools)

offers both an efficient way to read a file from disk and a set of functions to parse the file chunk by chunk. These latter functions, such as chunk.apply() and chunk.map(), are reminescent of the MapReduce paradigm, which gives the name to the last family of our classification. Here, the large file is read piece by piece in such a way that, at any given moment, only a limited number of rows of the file are present in the RAM; at the same time, an arbitrary function can be applied to each ‘chunk’. The results of the processing are then combined and returned.

In this paper we propose a new approach, implemented in the fplyr pacakge, to deal with big files; it aims to be a simple and user-friendly solution that integrates well with the existing R functions. Nonetheless, as all of the approaches described above, it also has a restricted applicability, as it deals only with a particular class of big files: those that are amenable to the split-apply-combine approach (wickham2011split). For example, the famous iris data set (anderson1936species; fisher1936use) would be a good candidate because the ‘Species’ field defines a partition of the data into independent groups of observations, so that a particular function can be applied separately to each species. Another example is that of gene expression files, which are often organised as matrices where each row refers to a gene, and each column reports the expression of that gene in a different individual. In this case, one may apply a function separately to each gene. More specifically, the files on which fplyr operates should be formatted in such a way that consecutive rows contain all the measurements relative to the same subject, and the first field contains the subject IDs. We refer to each group of consecutive rows pertaining to the same subject as a block.

Much as apply() applies a function to each row or column of a matrix, the functions in fplyr apply custom functions to each block of data, independently of all the other blocks. Thus, at its core, this package enables one to mimick the behaviour of the by() function, without requiring that the whole file be loaded into memory. Indeed, the file being read block by block, fplyr’s functions run with an space complexity.

As an illustration of the possible usage of the package, suppose that the path to a big file to be processed is stored as a character string in the variable f. Then, the following code computes and returns the summary() of each block:

flply(f, summary)

0.2 Comparison with existing packages

The approach we presented may appear similar to iotools’ chunk-wise processing, but it differs from it in several respects. Most importantly, one of iotools’ chunks can contain measurements about many subjects, while an fplyr’s block contains each and every measurement about one subject only. In other words, one of iotools’ chunks may contain many of fplyr’s blocks. Moreover, in fplyr, each block is treated as independent of all the others; the aim is to obtain a list of values, one for each block, whereas with iotools the typical aim is to obtain a single value for the whole file.

Similar differences distinguish fplyr by ffbase’s way of performing operations by chunks, a function called ffdfdply(). As with iotools, here a chunk can contain measurements from many subjects, and the task of further separating them is left to the user. Moreover, ffdfdply() can only return an "ffdf", the equivalent of a "data.frame" for ff, whereas with fplyr it is possible to return any R object, or even to directly write something to an output file block by block, during the processing.

As previously stated, aggregate operations with database-backed packages are limited to simple operations. If the operation to be applied is more complex, it is still possible to use sqldf, but the analysis must be performed in at least two steps: first the relevant group of observations must be selected, and only then can the custom function be applied. In order to replicate the behaviour of the groupby keyword with arbitrary aggregating functions, a loop should be manually set up where at each iteration a different group of observations is retrieved and analysed; furthermore, all the possible values of the ‘groupby’ field must be known in advance. fplyr offers an effortless way of doing the same thing.

0.3 Implementation

fplyr is mainly built upon two other R packages: iotools and data.table, the former being used to read the file chunk by chunk, the latter providing some efficient ways to split the chunk into its constituent blocks, applying the user-specified function to each, combining the results, and (when applicable) writing back the result to the disk (dowle2019). From this point onwards, for the sake of distinguishing the user-specified function to be applied to each block from other functions, we shall refer to it as FUN. fplyr’s algorithm is, in essence, as follows: one chunk is read with the help of the iotools package, then it is split into its constituent blocks using the by() function, and FUN is applied to each. Once the whole file has been processed, it is returned a list where each element corresponds to a block. This algorithm, with additional technicalities, is implemented in the flply() function. However, if the output of FUN is a "data.frame", the data.table package allows us to replace the base R by() function with a faster alternative; this second algorithm is implemented in the ftply() function (see 0.5 The names of the functions for an interpretation of the names of the functions in this package). Since the user knows a priori the type of output that FUN returns, he or she can choose which function to use accordingly. Note, however, that the second approach is implemented only as a shortcut, because the same result could be achieved using by() followed by rbind(), albeit it would be achieved much more slowly. In the following paragraphs we shall discuss several aspects of the implementation.

One of the reasons why iotools is faster than base R functions in reading a file from disk is that it reads the file in bulk as a "raw" vector and only afterwards does it convert the data into a suitable structure, like a "data.frame" or a "matrix". In fact, for instance, the mstrsplit() function takes a "raw" or "character" vector and arranges it into a matrix, and dstrsplit() returns a "data.frame" instead. Functions like mstrsplit() and dstrsplit() are called formatters in iotools jargon

(arnold2015iotools). In fplyr, we defined a new formatter which takes a "raw" vector and returns a "data.table". In particular, it first converts the "raw" vector to a "character" one, and then feeds the "character" vector to data.table’s fread(). (Indeed, not only can fread() take the path to a file as input, but it can also handle character strings directly and cast them into "data.table"s.) By relying on it for the low-level reading of the files, fplyr inherits some of iotools’ features, such as the ability to read compressed files. Incidentally, benchmarks show that our data.table-based formatter, dtstrsplit(), is even faster than iotools’ builtin formatter (Figure 1).

Figure 1: Benchmark of the two formatters: fplyr’s, called dtstrsplit, and iotools’, called dstrsplit. The two were used to read the same file (253316 lines, 9.2 MB) for 200 times without performing any other operation. As a reference, the time taken by data.table’s fread() is also reported. The benchmark was performed on an Acer Aspire 5750G, 2.4GHz Intel Core i5 with 4 cores, 4GB DDR3 RAM. The code to reproduce the benchmarks is available on the GitHub repository of the package at https://github.com/fmarotta/fplyr.

Once a chunk has been read and formatted to a "data.table", it must be split into its blocks and FUN must be applied to each of them. According to the type of FUN’s output, as we mentioned, two different strategies are employed. If FUN returns a "data.table", then we use the following construct:

output <- chunk[, FUN(.SD, .BY, …), by = blocks]

In this case, FUN must take at least two arguments: at evaluation time, when FUN is called, a "data.table" containing one block (except the first field) is passed to FUN as the first argument, and a "character" vector with the name of the block is passed as the second argument. Additional arguments can be specified by the user, much as in the apply() family. It is up to the user to write FUN satisfying these specifications.

On the other hand, if FUN returns something different than a "data.table", then by() is used and the result is returned as a list where each element corresponds to a block. In this case, FUN must take at least one argument: at evaluation time, the whole block, including the first field, is passed to FUN. Once again, it is up to the user to write or find a function that acts on each block and satisfies this specification, and additional arguments can be passed by the user like in the previous case.

Due to the mutual independence of the blocks, the processing of a file can be easily parallelised111The parallelisation is only possible on *nix operating systems.; indeed, all the functions in fplyr support the parallel argument to specify the number of workers to be initialised. In particular, we adopted iotools’ pipeline parallelism (arnold2015iotools), where the master process reads from and writes to the disk sequentially, but each chunk is then relegated to the workers, which process it in parallel as described previously; while the workers are busy, the next chunk is pre-allocated by the master. The sequential reading and writing avoids conflicts—for instance if multiple processes try to write to the same file at the same time—but can result in bottlenecks if the processing of each chunk takes an amount of time that differs too much from the time needed for the reading and writing. Furthermore, some workers can become idle if the amount of time required to process different chunks is different.

For convenience, we also implemented two additional functions which may be useful when even the result of the processing of the file would not fit into the RAM. In these cases, it is not possible to simply return everything at the end. If the output of FUN is a "data.table", one solution could be to print the resulting data for each block to an output file as soon as it is ready, proceeding to append the results of all the blocks to the same file. This strategy is implemented in the ffply() function. The last function, fmply(), combines the behaviours of ffply() and flply(): for each block, it is possible to write to one or more files and, at the same time, to return an arbitrary R object.

0.4 Examples

To illustrate the usefulness of the package, we shall now discuss some examples. Throughout this section, we assume that the path to the file of interest is stored in a variable called fin, while the path to an output file is stored inside fout.

Tabular data with multiple measurements can be represented in two ways: the long and the wide format (wickham2007reshaping). These two representations are equivalent in that data can always be converted from one to the other, but when the data are in long format, they are ‘tidy’ and easier to analyse (wickham2014tidy), and some R functions require their input to be in this format. However, if some big file is only available in wide format, reshaping it could become a problem due to the scarcity of RAM. Indeed, even if the wide-formatted file fits into the memory, the long one may not. With fplyr the reshaping can be performed block by block:

ffply(fin, fout, function(d, by) melt(d, measure.vars = names(d))

Here the by argument, which contains the subject ID, was ignored, but in principle it can be used inside an if condition to select only some of the blocks. The d argument contains one whole block without the first column, and it is reshaped using data.table’s melt() function. The first column, containing the IDs, will be automatically added after FUN has been applied.

Sometimes a group of observations must be analysed in many ways, producing different outputs. For instance, when performing a linear regression by blocks, it may be convenient to print the coefficients to an output file, while at the same time returning the full "lm" object for future in-depth inspection. The fmply() function takes as arguments the path to the input file, a vector of paths to the (possibly many) output files, and a function that returns a list of objects. If there are, say, d output files, the first d elements of the list must be "data.table"s (or "data.frame"s) and are printed to the corresponding output files; optionally, FUN can return d+1 elements, in which case the last one is returned by fmply(). In this example, the coefficients are written to fout, and the "lm" objects themselves are returned at the end. In particular, l will be a list where each element is an "lm" object corresponding to one block.

l <- fmply(fin, fout, function(d) lm.fit <- lm(Y   ., data = d[, -1]) # Add the name of the block as the first field lm.coef <- as.data.table(cbind(d[1, 1], t(coef(lm.fit)))) # The coefficients will be printed, the fitted object will be returned return(list(lm.coef, lm.fit)) )

Additional examples can be found in the package vignette.

0.5 The names of the functions

Despite the names of this package is reminescent of other packages belonging to the tidyverse (wickham2019welcome), fplyr bears no relation with them. We did, however, follow the same conventions of one of Hadley Wickham’s packages for naming the functions (wickham2011split). All the names consist of two letters followed by ‘ply’: the first letter represent the type of input, whereas the second letter characterises the type of output, and the final ‘ply’ clinches the relation with the existing ‘apply’ family of functions. The first letter is usually ‘f’, because the input is the path to a file. The second letter is ‘l’ if the output is a list, as in flply(), it is ‘t’ if the output is a "data.table", ‘f’ if it is another file, and ‘m’ if it can be multiple things.

0.6 Discussion

We believe that fplyr fills a gap in the landscape of the existing tools to process big files in R. It adresses a problem that in principle could be solved by other packages as well, but only with workarounds. Furthermore, our implementation combines the strengths of two other packages, iotools and data.table, and is therefore reasonably efficient. A variety of features, such as the transparent parallelisation, the ability to read compressed files, and the possibility to specify the maximum number of blocks, make it also user-friendly.

There are, however, also some limitations. First and foremost, the file on which fplyr operates must contain observations which can be assigned to several independent ‘subjects’, and the subject IDs can only be in the first field, the reason being that this is also how iotools works. Although it is possible to pre-process the file with *nix command-line tools such as awk and sort to ensure that the IDs are in the first column and that the rows referring to the same subject are adjacent, possible future work could extend the package in order to support a custom field for the subject IDs. A related extension could allow the blocks to be defined by the combined values of two or more columns.

Another possible weakness is the pipeline parallelism algorithm, which, besides not being available for Windows-based operating systems, can cause bottlenecks if the times required for the operation on the block on the one hand, and for reading the block on the other, are on very different scales. Nevertheless, this algorithm has been proven to be more efficient in a variety of common situations (arnold2015iotools).

In summary, the main strength of the package is that it simplifies the task of splitting, applying and combining for files too big to fit in the available memory.

0.7 URLs

0.8 Aknowledgements

The author would like to thank Prof. Paolo Provero (Università degli Studi di Torino) for his insightful comments and suggestions.

Bibliography