In the last decades, digital data transfer became available everywhere and to everyone. This rise of digital data urges the need for data compression techniques or improvements on existing ones. Run Length Encoding  (abbreviated as RLE) is a simple coding scheme that performs lossless data compression. It identifies each maximal sequence of consecutive identical symbols of a string by a run, usually denoted by , where is an alphabet symbol and is its number of repetitions. To give an example, the string aaaabbaaabbbb consists of the four runs . In the standard RLE compression scheme the number of bits reserved to encode the length of a run is fixed. Each run is encoded by bits storing the binary representation of the length of the run, followed by the binary encoding of the letter of the run (which usually also has some fixed length ). Some strings like aaaabbbb achieve a very good compression rate because the string contains only two different characters which repeat more than twice. Hence, for and , its RLE-representation can be stored in 4 bytes, instead of 8 bytes needed for the original string in ASCII or UTF-8. On the other hand, if the input consists of highly mixed characters with few or no repetitions at all like abababab, the RLE-representation of the string is which needs 16 bytes for . Thanks to its simplicity RLE is still being used in several areas like fax transmission, where RLE compression is combined with other techniques into Modified Huffman Coding  and applied on binary images. As most fax documents are typically simple texts on a white background , RLE compression is particularly suitable for fax and often achieves good compression ratios.
But RLE also has a major downside, which is obviously the possible explosion in size, due to missing repetitions in the input string. Expanding the string to twice the original size is a rather undesirable worst case behavior for a compression algorithm, so one has to make sure the input data is fitted for RLE as compression scheme. In this work, we present a combination of preprocessing techniques that increases the average compression ratio of the RLE compression scheme on arbitrary input data. The main idea is to consider a bit-wise representation of the data and to read all bits in a row, which have the same position in a byte. We combine this approach with dynamic byte remapping and a Burrows-Wheeler-Scott transform (BWST for short) to increase the average run length on a bit level. We show experimentally that with the help of such preprocessing the originally proposed RLE can compress arbitrary files of different corpora. Our proposed algorithm is even comparable to the popular compression scheme ZIP. Files suited for regular RLE are compressed even more than with the original method. To unify the measurements, the relative file size after compression is calculated by encoding all files listed in the Canterbury and Silesia Corpus individually. Since most improvements like permutations on the input, for example, a reversible BWST to increase the number of consecutive symbols or a different way of reading the byte stream, take quite some time, encoding and decoding speed will decrease with increasing preprocessing effort compared to regular RLE.
This work is structured as follows. In the next section, we discuss the literature on RLE after giving some preliminaries. Then, we discuss our proposed technique in more detail and evaluate it in comparison with the standard RLE compression scheme and ZIP v3.0 afterwards.
Throughout this work, we assume to be a finite alphabet. A string is a sequence of letters , . The set of all such sequences is denoted by which is the free monoid over , with concatenation as operation and with the empty word as neutral element. In standard text representation, the string is coded as an array of blocks of bit-strings, each of size , that can be read and written at arbitrary positions, and where the -th block of contains the binary representation of the -th character of . In the following, our algorithm is working on a byte alphabet, i.e., 8 bits are assumed to encode one input symbol. For the examples discussed later this byte alphabet is realized as an UTF-8 encoding. The vertical interpretation, also called Bit-Layers text representation in , codes the array as an ordered collection of binary strings of length , , where the -th binary string is the sequence of bits at position of the blocks in encoding characters in , in the order in which they appear in , where refers to the least significant bit. Let define a compression scheme. For a string let be the number of bytes in the UTF-8 encoding of . We define the number of bits per symbol (bps) of under as .
3 Combination with other compression methods
Examples of combining different techniques to achieve a better compression rate has already been discussed in other papers and achieved good compression ratios, not much worse than the theoretical limit of around 1.5 bps , for example, Burrows and Wheeler used their transform, in combination with a Move-to-Front Coder and a Huffman Coder . Also standard compression algorithms, such as bzip2  use a combinations of transforms, i.e., by default bzip2 applies a RLE, a Burrows-Wheeler Transform followed by a Huffman encoding. Via parameters it is also possible to enable a second run length encoding on the character level between the latter two phases. In contrast to our approach, both RLEs are on a sequential horizontal byte level and not on a vertical binary level.
Generally, a combined approach would no longer be considered preprocessing but it clearly has some benefits over the encoding of regular RLE runs with a fixed size. The fax transmission implementation also uses RLE and Huffman coding together . While the idea of encoding the RLE runs with Huffman codes is already known and analyzed , it is mostly in a static sense and optimized for special purpose applications such as fax transmission and DNA sequences [1, 19]. However, the vertical byte reading enables new approaches, even more in combination with the idea of byte remapping and becomes applicable to more than just binary fax or DNA sequences, with longer runs of any kind in average. As our evaluation shows, our technique makes nearly every type of input data suitable to RLE.
4 Proposed technique
The binary representation of an arbitrary string does not contain long runs of repeating bits, but, first, reading all most significant bits of all bytes, then all second most significant bits and so on, results in much longer average runs of the same bit value. This is partially explained by ASCII artifacts but also by the small Hamming distance of the binary string representations of most letters, as they all have a value between 65 and 122 in the UTF-8 encoding. This improvement in average run length can even be enhanced by mapping the byte values of the input to lower values in relation to their occurrence probability. To further improve the algorithm we use a variable length code for encoding the RLE runs instead of a fixed size. This way, the proposed algorithm can compress arbitrary files with a reasonable compression ratio and even improve against regular RLE on files highly suited for the original algorithm.
The proposed technique is depicted in Figure 1. In the first step, the uncompressed byte array is analyzed and for each byte its number of occurrences is counted. In parallel, a bijective Burrows-Wheeler-Scott Transform  is applied to the input byte array, which produces a reversible permutation of the input byte array with long repetitions of similar symbols. Afterwards, each byte is remapped, where the most frequent byte values are mapped to the lowest binary values. The resulting byte array is then interpreted in a specific way, at first all most significant bits of all bytes are read, then all second most significant bits and so on, resulting in long average runs of identical bits. On this representation, a run length encoding is applied and the runs are counted to generate a Huffman tree. Using this, the runs are output with a variable length code, together with the relevant mapping needed to decompress the file. Next, we discuss each step of our proposed technique in detail. We will sequentially apply each step to the example input string . The binary UTF-8 interpretation of the example string contains 3 runs of length 3 and 4, 9 runs of length 2 as well as 8 runs of length 1 in total.
4.1 Burrows-Wheeler-Scott Transform
Working with arbitrary data implies starting with an uncompressed byte array, which is analyzed by the static analysis component. All occurrences of each byte value are counted and later on used in the byte mapping process. In the mean time, a Burrows-Wheeler-Scott transform  (BWST for short) is performed on the same uncompressed byte array, using the C library LibDivSufSort . The BWST is a strong enhancement of the classical Burrows Wheeler Transformation (BWT)(introduced in  and analyzed in ) which is used in a variety of compression algorithms. In short, the BWT creates all cyclic permutations of the input strings and sorts them lexicographically. As the last symbol of a cyclic permutation is the predecessor of the first symbol, in the last column of all permutation, identical symbols are clustered together if the input string contains repetitions, e.g., repeating natural words. Then, the last column of all permutations in this sorting is output. So, in general, the BWT increases the number of subsequent identical symbols.
Here, we use the enhanced BWST transform, which in contrast to the original BWT does not require additional information, nor start and stop symbols or the index of the original permutation in the sorting. Briefly, it does not construct a matrix of all cyclic rotations like the originally proposed BWT, instead it is computed with a suffix array sorted with DivSufSort, closer described in  and , which is the fastest currently known method of constructing the transform , working in linear time and space. Since we do not alter the BWST algorithm and only use an existing library as a building block in our preprocessing pipeline, we refer for more algorithmic details on BWST to . Applying BWST on the input string results in the string with the binary representation
4.2 Dynamic Byte Remapping
Next, we apply a dynamic byte remapping of the input data, where the most frequently used bytes are mapped to the lowest value. This way the values are not alternating in the whole range of 0 to 255 and between 65 and 122 for regular text, but rather in a smaller subset and the most frequent ones will be the smallest values. Hence, considering only the most significant bits of each byte, the number of consecutive zeros should increase, yielding longer average runs of RLE on a vertical byte reading. Let be the set of all bytes appearing in the input data. Then, let be the function applying the dynamic byte remapping. Considering our example string , the most frequent letter is , followed by an which appear once each. By fixing an order on we get the byte remapping function , , , and . Applying on yields
For huge input files, splitting the input and creating a single map for each block of data should result in lower average values used but also creates some kind of overhead because the mapping has to be stored in the encoded file as well. Applying a single mapping to lower values for the whole file still results in increased runs in the vertically interpreted bytes and is used in our approach.
4.3 Vertical Byte Reading
Reading all most significant bits of all bytes, then the second most significant bits of all bytes and so on greatly improves the average run length on a bit level for most types of files as shown in the example below.
Recall the binary UTF-8 interpretation of the example string as with 3 runs of length 3 and 4, 9 runs of length 2 as well as 8 runs of length 1 in total. The vertical byte reading codes the string as an ordered collection of 8 binary strings of length , where the ’th binary string is the sequence of bits at position of the bytes in , in the order in which they appear in , where
refers to the least significant bit. We refer to the concatenated bit vectorsinduced by such representation as the vertical representation of the encoding. Formally, letting be the binary encoding, for and assume therefore that , , and . Hence, the vertical representation of the string is:
Performing RLE on the consecutive bits of to results in 5 runs of length 6, 2 runs of length 3, 3 runs of length 2 and just 6 runs of length 1 as opposed to the many short runs of the simple interpretation. This is because the binary similarity between the used characters, as the character for and only differ in one bit. It is clear that simply a different way of reading the input does not compress the actual data, instead it enables a better application of existing compression methods. This approach can also be generalized to arbitrary sized alphabets. By shrinking the alphabet to actually used code words, the numbers of bit vectors can be reduced which is discussed in .
Now, let us continue with our toy example and apply the vertical byte reading to the string from the last step. This gives us the vertical representation which highlights in contrast with the above vertical representation of the initial string the impact of the dynamic byte remapping step.
4.4 Run Length Encoding
Continuing with the example, and performing RLE on the consecutive bits of results in 1 run of length 36, 1 of length 5, 1 of length 2, and 5 runs of length 1. In general the binary RLE simply counts alternating runs of ones and zeros and encodes the length of the run into a fixed length code with bits . Assuming a run always starts with a zero and the maximum run length determined by the length of the code, we add an artificial run of length 0 to flag a run exceeding the maximum run length or a run starting with 1. This way we can encode any binary string. Some experiments with different default maximum run lengths showed improvement in performance but also revealed some shortcomings. Refining the algorithm to use different maximum run lengths for the different bit vectors ,, , did improve but not solve the issue of being a very static solution. It is possible to choose maximum run lengths to work more efficient for a specific file or to be adequate for a range of files but it is always a trade off. Eventually, a variable length code for encoding the runs was needed, so the algorithm is combined with another compression method, namely Huffman Encoding. The maximum run length is limited to 255, in order to limit the size of the Huffman tree and therefore the average prefix length. This gives us the RLE representation with and .
4.5 Huffman Encoding of RLE runs
While the RLE is performed with a fixed maximum run length set to 255 to limit the amount of Huffman codes to be generated, the occurrence of each run is counted. After the RLE step is finished, a Huffman tree for the runs is generated  and each run is encoded with the according prefix free code of variable length. This further reduces the required space to encode the file but also a representation of the Huffman tree needs to be persisted to reverse the variable length coding. For ease of decoding, a map of run length to the pair of length of prefix, and prefix is generated. Finally, the size of the map, followed by the map is written to the stream. The Huffman tree for the runs of generates the following prefixes , which encodes to the final encoded output with 13 bits:
The decoding happens in three phases. First, the size of the byte map is parsed to know how many pairs of bytes are expected. In the second phase, the map of Huffman prefixes is parsed and the number of expected pairs is determined. Since each pair in the Huffman map consists of the byte which is mapped, the length of the prefix and the prefix itself, it is easy to decode each mapping from the stream. After both required maps are parsed, the compressed content follows. The following stream is read bit-wise to be able to match any bit sequence of variable length to the related Huffman code and decode it into a number of runs. Reversing RLE from the decoded runs recreates the bit vectors which are written to the output file. Finally, the byte mapping parsed in phase 1 is applied to the file and the bijective BWST is inverted, restoring the original input data.
|size [kB]||s. [kB]||r.s. [%]||[bps]||s. [kB]||r.s. [%]||[bps]||[%]|
|values per file||-||337.06||26,95||-||42.34||3.40||83.18|
To evaluate the effectiveness of the proposed technique, a collection of files from the Canterbury Corpus , and the Silesia Corpus (containing medical data)  were compressed. All file sizes are given in kB (kilo byte). The relative file sizes after compression are listed in Tables 1 and 2. To have another unit of measure, the bps (bits per symbol) is also shown in the table. Plain RLE on a bit level with a maximum run length of 255, encoded in 8 bits, showed good results on the file ptt5, a ITU-T standard fax showing a black and white picture. This fits our expectations since RLE was designed for those types of files. On this file, simple RLE achieved a relative file size of 26% compared to the original size which relates to bits per symbol. In contrast, on all files contained in the Canterbury corpora combined, the plain bit level RLE increases the files by a factor of on average.
|size [kB]||size [kB]||r.s. [%]||[bps]||size [kB]||r.s. [%]||[bps]||[%]|
|values per file||-||37.48||3.00||-||39.47||3.16||-7.31|
In contrast, our presented technique, consisting of a combination of preprocessing steps and a Huffman encoding of the RLE runs, achieved, with a relative file size of 40.8% on average, comparable results to the state of the art for both corpora. Already suited files, like the file ptt5 from the Canterbury Corpus, were compressed even further than with plain bit level RLE.
For comparison, ZIP v3.0 using a combination of the dictionary technique LZ77 and Huffman codes, is listed. All zip compressions were executed with zip -evr $file. For instance, ZIP achieves an average relative file size of 37.5% on the single files in the Silesia Corpus, where our algorithm achieves 39.5%.
|file type||ZIP||proposed algorithm|
|rel. size [%]||[bps]||rel. size [%]||[bps]||improvement [%]|
In a second evaluation, a randomly chosen collection (listed in detail under ) of raw image files and 3D-object files (taken from ) were compressed with the proposed algorithm and with ZIP in version 3.0. The average relative file sizes are listed in Table 3, all files were compressed individually. Regarding large raw picture files like .PPM and .PMG from the Rawzor corpus  as well as a random collection of .DNG files from raw.pixel.us , a higher compression ratio than obtained by ZIP could be achieved. 3D-object files in the encoding format .obj .sty and .ply are also compressed by our algorithm to a size comparable but inferior to the output produced by ZIP. This shows that with our approach run length encoding can become a suitable compression algorithm for more than just pellet based images like fax transmissions.
The implementation is hosted on Bitbucket and released under the MIT license. The source code and the test data can be found here . All source code is written in Kotlin and runs on any Java virtual machine, but performs best executed on the GraalVM .
All benchmark tests were performed on a system running Linux Pop OS with a 5.6.0 kernel with an AMD Ryzen 5 2600X six core processor (12 threads) with a 3.6 GHz base clock and a 4.2 GHz boost clock speed. For memory, 16GB 3200MHz ram and a Samsung evo ssd was used for persistent storage.
Encoding is reasonably fast with measured 7.1 seconds but the decoding is rather slow with 16.7 seconds for the whole Canterbury Corpus. Avoiding internal operations and large or complex data structures to hold all the input data or even collecting the values of same significance in memory into byte arrays greatly improved time performance of the algorithm described. It has to be mentioned that there is still some potential in performance optimization and parallelization. In theory, all 8 runs could be created at the same time by reading the input as a byte stream which would vastly improve overall encoding speed instead of the currently used library to handle the binary stream . Also extracting bit values only by bit shifting operations instead of relying on an external library for handling the binary stream might improve reading speed. Another potential improvement in decoding speed could be achieved by reconstructing in memory and just write the finished file to disk. The main reason for the margin between encoding and decoding speed is most likely the multiple writing to the output file, since each bit position has to be decoded separately resulting in up to 8 write accesses to a single byte. This could easily be resolved by first reconstructing in memory and only writing the file to disk once.
8 Conclusions and future work
In conclusion, we demonstrated that with the help of different preprocessing steps and a different encoding technique, RLE can achieve compression results comparable to modern methods. Not only is there a reasonable compression for every file in the different corpora containing a huge variety of data-types, files highly suited for the original proposed RLE were compressed even better. The relative file size after compression of our RLE based technique is with 42.34% on average on files in the Canterbury Corpus only a few percent points behind daily used algorithms, e.g. gzip with 31.8% or ZIP with 32.67% and even slightly better than compress with 43.21%. On raw image files like .PGM, .PPM, or .DNG, where a potential compression is desired to be lossless, our algorithm even achieves significantly better compression ratios than ZIP. Despite the discussed potential for improvement, our implementations demonstrates the improvement of applicability of RLE to arbitrary input data by our discussed preprocessing steps.
One interesting approach not performed in this scope is the encoding of Huffman codes after a byte-wise RLE instead of a vertical RLE. It was assumed to perform worse than the vertical encoding because there has to be one code for every combination of runs and values, thus very long average Huffman codes are expected. Another idea is the substitution of Huffman encoding by another, more sophisticated method like Asymmetric Numeral Systems . This would most likely further improve compression results at the expense of slower computation.
Acknowledgment: The second author is supported by Deutsche Forschungsgemeinschaft project FE 560/9-1.
-  (2017) Toward a Better Compression for DNA Sequences Using Huffman Encoding. Journal of Computational Biology 24 (4), pp. 280–288. Cited by: §3.
-  (1997) A Corpus for the Evaluation of Lossless Compression Algorithms. See Proceedings of the 7th data compression conference, Storer and Cohn, pp. 201–210. Cited by: §6.
-  (1994) A Block-Sorting Lossless Data Compression Algorithm. Cited by: §3, §4.1.
-  (2020) Bit-Layers Text Encoding for Efficient Text Processing. In SOFSEM (Doctoral Student Research Forum), CEUR Workshop Proceedings, Vol. 2568, pp. 13–24. Cited by: §2, §4.3.
-  Collection of Various RAW Files. Note: Accessed: 2020-10-14https://raw.pixls.us/ Cited by: §6.
-  (1992) An introduction to computer images. DEscription Language for TAxonomy (DELTA) Newsletter 7 (10), pp. 1–3. Cited by: §1.
-  Silesia Compression Corpus. Silesia University. Note: http://sun.aei.polsl.plPublished by Silesia University; Accessed: 2020-10-15 Cited by: §6.
-  (2013) Asymmetric Numeral Systems as Close to Capacity Low State Entropy Coders. CoRR abs/1311.2540. External Links: Cited by: §8.
-  RLE Preprocessing. Note: Released: 2019; Last Update: 2020-10-19; Version: v1.1https://bitbucket.org/fierg1/dcc-algorithm/src Cited by: §7.
-  (2017) Dismantling DivSufSort. See Proceedings of the prague stringology conference, Holub and Zdárek, pp. 62–76. Cited by: §4.1.
-  (2012) A Bijective String Sorting Transform. CoRR abs/1201.3077. External Links: Cited by: §4.1, §4.1, §4.
-  J. Holub and J. Zdárek (Eds.) (2017) Proceedings of the prague stringology conference. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague. Cited by: 10.
-  (1952) A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE 40 (9), pp. 1098–1101. Cited by: §4.5.
-  (1980) International Digital Facsimile Coding Standards. Proceedings of the IEEE 68 (7), pp. 854–867. Cited by: §1, §3.
-  J. Karlgren, J. Tarhio, and H. Hyyrö (Eds.) (2009) String processing and information retrieval, 16th international symposium, SPIRE. Lecture Notes in Computer Science, Vol. 5721, Springer. Cited by: 24.
-  (1963) On Tables of Random Numbers. Sankhyā: The Indian Journal of Statistics, Series A, pp. 369–376. Cited by: §3.
-  M. Koricic, Z. Butkovic, K. Skala, Z. Car, M. Cicin-Sain, S. Babic, V. Sruk, D. Skvorc, S. Ribaric, S. Gros, B. Vrdoljak, M. Mauher, E. Tijan, P. Pale, D. Huljenic, T. G. Grbac, and M. Janjic (Eds.) (2019) 42nd international convention on information and communication technology, electronics and microelectronics, MIPRO. IEEE. Cited by: 29.
-  (2019) IOStreams for Kotlin, library. Bitbucket. Note: Version: 0.33https://bitbucket.org/akornilov/binary-streams/downloads/binary-streams.pdf Cited by: §7.
-  (2019) GenPress: A Novel Dictionary Based Method to Compress DNA Data of Various Species. See Intelligent information and database systems - 11th asian conference, ACIIDS, Nguyen et al., pp. 385–394. Cited by: §3.
-  (2001) An Analysis of the Burrows—Wheeler Transform. Journal of the ACM (JACM) 48 (3), pp. 407–430. Cited by: §4.1.
-  (2015) LibDivSufSort, Suffix Sorting Algorithm in C. Note: Version: 2.0.1https://github.com/y-256/libdivsufsort Cited by: §4.1.
-  (2007) Compressed Full-Text Indexes. ACM Computing Surveys 39 (1), pp. 2–61. Cited by: §4.1.
-  N. T. Nguyen, F. L. Gaol, T. Hong, and B. Trawinski (Eds.) (2019) Intelligent information and database systems - 11th asian conference, ACIIDS. Lecture Notes in Computer Science, Vol. 11432, Springer. Cited by: 19.
-  (2009) A Linear-Time Burrows-Wheeler Transform Using Induced Sorting. See String processing and information retrieval, 16th international symposium, SPIRE, Karlgren et al., pp. 90–101. Cited by: §4.1.
-  Rawzor Image Compression Benchmark Test Images. Note: Accessed: 2020-10-14http://www.imagecompression.info/test_images/ Cited by: §6.
-  (1967) Results of a Prototype Television Bandwidth Compression Scheme. Proceedings of the IEEE 55 (3), pp. 356–364. Cited by: §1, §3, §4.4.
-  bzip2 and libbzip2, version 1.0.8A Program and Library for Data Compression. . Note: https://www.sourceware.org/bzip2/manual/manual.pdfAccessed: 2021-01-13 Cited by: §3.
-  Suggestive Contour Gallery. Note: https://gfx.cs.princeton.edu/proj/sugcon/models/Accessed: 2020-10-14 Cited by: §6.
-  (2019) Exploring Aspects of Polyglot High-Performance Virtual Machine GraalVM. See 42nd international convention on information and communication technology, electronics and microelectronics, MIPRO, Koricic et al., pp. 1671–1676. Cited by: §7.
-  J. A. Storer and M. Cohn (Eds.) (1997) Proceedings of the 7th data compression conference. IEEE Computer Society. Cited by: 2.