An Analysis into the Performance and Memory Usage of MATLAB Strings

09/26/2021
by   Travis Near, et al.
0

MATLAB is a mathematical computing environment used by many engineers, mathematicians, and students to process and understand their data. Important to all data science is the managing of textual data. MATLAB supports two textual data containers: (1) cell arrays of characters and (2) string arrays. This research showcases the strengths of string arrays over cell arrays by quantifying their performance, memory contiguity, syntax readability, interface fluidity, and autocomplete capabilities. These results demonstrate that string arrays often run 2x to 40x faster than cell arrays for common string benchmarks, are optimized for data locality by reducing metadata overhead, and offer a more expressive syntax due to their automatic data type conversions and vectorized methods.

READ FULL TEXT VIEW PDF

Authors

page 9

page 10

02/17/2020

Computing Covers under Substring Consistent Equivalence Relations

Covers are a kind of quasiperiodicity in strings. A string C is a cover ...
07/06/2021

On Arithmetically Progressed Suffix Arrays and related Burrows-Wheeler Transforms

We characterize those strings whose suffix arrays are based on arithmeti...
06/19/2022

Quantum implementation of circulant matrices and its use in quantum string processing

Strings problems in general can be solved faster by using special data s...
06/03/2019

Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets

We present the first worst-case linear time algorithm that directly comp...
09/02/2020

An Algorithm for Automatically Updating a Forsyth-Edwards Notation String Without an Array Board Representation

We present an algorithm that correctly updates the Forsyth-Edwards Notat...
07/06/2019

Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M

The Dynamic Distributed Dimensional Data Model (D4M) library implements ...
10/02/2020

BOSS: Bayesian Optimization over String Spaces

This article develops a Bayesian optimization (BO) method which acts dir...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

MATLAB has supported text since its inception as a simple array of numeric ASCII values [Moler]. Textual data in programming languages is a simple concept but expands into many implementation challenges regarding its performance and memory layout. This analysis begins by comparing common usages of two different containers of text in MATLAB: cell array and string array. Next, it measures these containers’ performance by running integral string benchmarks: string building, formatting, data type conversions, and concatenation. Beyond performance, this paper also quantifies cell array and string array ease of use and interface conciseness. Then it details the difficult-to-measure metadata allocation of text to shed light on these containers’ internal memory structure. Lastly, this research closes with a glimpse into string compilation and how to beat MATLAB string speed for duration-critical operations.

2 Different text types in MATLAB

2.1 Char vector

Text is composed of characters and each character in MATLAB is two bytes in size. A char vector is a simple horizontal sequence of characters. 'hello', for example, is a char vector. MATLAB denotes char with single quotes.

2.2 Char matrix

A char matrix is an MxN

matrix of characters. Unlike with double-precision matrices, sets of characters are rarely perfectly rectangular: almost all collections of text are of differing lengths. The example below demonstrates how to construct a char matrix. Note how the shorter string must be padded with spaces to create a rectangular matrix:

    >> text = [’hello   ’; ’everyone’]
text =

  28 char array

    ’hello   ’
    ’everyone’

In addition to requiring padding, char matrix is unnatural due to MATLAB’s column-major layout [ColumnMajor]. The 'hello everyone' text above is stored in columns in memory as 'heevlelroy o n e' (indexes 1, 3, 5, 7 correspond to 'hello' and indexes 2, 4, 6, …, and 16 correspond to 'everyone'). Even though strings in English read left-to-right, data in MATLAB is always stored in vertical columns. Iterating by rows in MATLAB is difficult and yields poor performance due to jumping forward and back across the contiguous data buffer. Given these limitations of char matrix, developers typically use a higher-level container to hold their textual data such as a cell array of character vectors.

2.3 Cell array of character vectors (cellstr)

The classic container to house collections of text in MATLAB is cell array. Although textual data is uniform in type, it is frequently stored in a non-uniform cell array (see §3.5 for data uniformity). A cell array of character vectors is also known as a cellstr (short for cell string). One cell array advantage over char matrix is that it can hold text of differing number of characters. cellstr is created with braces ({}) and single quotes ('). Here is the same 'hello everyone' text stored in a cell array without needing additional padding:

>> cell_array = {’hello’; ’everyone’}
cell_array =

  21 cell array

    {’hello’   }
    {’everyone’}

2.4 String

In MATLAB, string is a much newer data type compared to char and cell. MATLAB uses double quotes to denote strings (e.g., "world"). In terms of MATLAB syntax, char and string are distinguished only by their quote character: single quote for char and double quote for string. Internally, however, string has a distinct data structure which is conducive to creating powerful string arrays.

2.5 String array

MATLAB introduced string array in R2016b to help supersede cellstr [Cellstr]. Like cell array, string arrays allow holding strings of differing lengths. String arrays are created using brackets ([]) and double quotes ("):

    >> string_array = ["hello"; "everyone"]
string_array =

  21 string array

    "hello"
    "everyone"

Cosmetically, string array and cellstr are similar, but analytically the differences are huge. Section 3 analyzes their differences in execution speed, memory usage, data uniformity, and clarity.

3 Cell array of character vectors vs. string array

The difference between cellstr and string array is best introduced through an example. Consider this exercise which will first be solved using cellstr then later using string array:

Using MATLAB, generate a sequence of the strings "TestResult1", "TestResult2", …, "TestResult1000".

3.1 Cell array solution

Using a cell array of character vectors, here is a MATLAB one-liner to generate the "TestResult#" strings:

    >> c = arrayfun(@(idx) [’TestResult’, num2str(idx)], 1:1000, ’UniformOutput’, false);

This complex line has much to dissect:

  • arrayfun to expand to 1000 elements

  • num2str to convert each numeric index to its ASCII char counterpart

  • Brackets ([]) to concatenate 'TestResult' with the converted index

  • UniformOutput set to false to create a cell array (char vectors of differing number of elements are not uniform, contrasted with a char matrix which is uniform, see §3.5)

Much of this syntax is unique to MATLAB. Experienced MATLAB users understand these details, but new users who have a Python or non-development background might find them intimidating: What is arrayfun? Why does num2str return char instead of string? What is UniformOutput? …and why don’t I want it?

3.2 String array solution

The MATLAB expression below also creates the "TestResult#" 1000-element sequence in one line of M-code, but this time using double quoted string arrays:

    >> s = "TestResult" + (1:1000);

Unlike with cellstr, this line reads like a natural sentence: no function calls, no concerns over data uniformity, and a familiar string concatenation ‘+’ operator. Aesthetics aside, these two lines can be compared quantitatively as shown in Table 1.

Metric c = arrayfun(@(idx) s = "TestResult" + String advantage
Characters of M-code 78 chars 24 chars 3.25x shorter
Duration (sec) 0.01640 0.0003634 45x faster
Bytes 129,786 70,096 1.85x smaller
Table 1: Performance comparison of string building for cell and string array

This table begins to unlock some of the key differences between cell array and string array:

Character count

78 characters of MATLAB code for cellstr compared to 24 characters for string. Removing the boilerplate code needed for cellstr distills the string expression into its constituent text and numeric elements.

Performance

0.01640 seconds for cellstr vs. 0.0003634 seconds for string, giving string a 45x speedup. Function calls and loops (arrayfun is a loop masquerading as a function call) have significant cost compared to built-in vectorization.

Memory usage

129,786 bytes vs. 70,096 bytes. Understanding why string array requires less memory requires additional explanation:

3.3 Memory layout comparison of cell array and string array

The cellstr in Table 1 is 129,786 bytes while the string array is only 70,096 bytes even though their textual content is identical: these data types differ in how they require metadata. Knowing the byte size of the raw text allows calculating the metadata size that each data type needs (characters in MATLAB are 2 bytes each):

    Vector to analyze: ["TestResult1", "TestResult2", ..., "TestResult1000"].

    ’TestResult’ text = 10 chars * 2 bytes/char * 1000 copies = 20,000 bytes.

Then account for the numeric suffixes of each "TestResult#":

    1-9     = 1 digit  * 2 bytes/char *   9 elements =   18 bytes
    10-99   = 2 digits * 2 bytes/char *  90 elements =  360 bytes
    100-999 = 3 digits * 2 bytes/char * 900 elements = 5400 bytes
    1000    = 4 digits * 2 bytes/char *   1 element  =    8 bytes

Summing it together, the total bytes of raw data is bytes.

MATLAB can also derive the number of data bytes (25,786) by joining all the discrete strings into one long character vector:

    >> s = "TestResult" + (1:1000);
    >> raw = s.join(’’).char;
    >> whos raw

      Name      Size       Bytes
      raw       1x12893    25786

Knowing the total bytes and raw data bytes yields the amount of metadata which cellstr and string array need as summarized in Table 2.

Metric Total bytes Raw data bytes Metadata (total - raw)
cellstr 129786 25786 104000 (2.34x more bloated)
string array 70096 25786 44310
Table 2: Comparison of data and metadata sizes of cell array and string array

Based on Table 2, cellstr requires 2.34x () as much metadata to store the same "TestResult#" array. This metadata difference is primarily due to cellstr needing 1000 mxArrays and string array needing only 1 mxArray.

3.4 mxArray

mxArray is the fundamental C++ data type for all MATLAB variables. mxArray is implemented as a tagged union [MATLABData]. Tagged unions strike a balance: they minimize byte allocation while maximizing support for distinct data types. This allows the mxArray to be flexible: it supports numeric, text, sparse, struct, cell, etc. The trade-off is that the mxArray is comparatively large due to MATLAB’s language complexity: 104 bytes for each mxArray. This number of bytes (104) can be derived by creating a cell array husk:

    >> c = {[]};
    >> whos c
      Name      Size      Bytes
      c         1x1         104

Cell arrays by definition are non-uniform and thus need one mxArray per element [CellArray]. Therefore, a 1000-length cell array needs 1000 mxArrays, totaling 1000 * 104 bytes = 104,000 bytes of metadata. String arrays, by contrast, are uniform and thus only require one mxArray regardless of the string array size. Its smaller overhead (44,310 bytes, from Table 2) stems from its internal small vector container in C++ which annotates the size of the textual content.

3.5 Data uniformity

Data in MATLAB is considered uniform when a function called on each element returns a scalar which can be concatenated into an array [Uniform]. For example, , , and are all scalars which join into the array . By contrast, the data , ["3"] are not uniform due to differing dimensions (1x2 vs. 1x1) and data types (double vs. string).

cellstr is a non-uniform data container which allows it to hold textual data of differing sizes. Each element is stored in an independent cell. These mxArrays and their data are chained together in memory, like a linked list or double-ended queue. Although the mxArrays themselves are contiguous in memory, the data buffer that each point to is not, as demonstrated in Figure 1.

Figure 1: Cell array memory layout in MATLAB

Each cell array data pointer points to a disparate memory address. Iterating across a cell array leads to poor data locality and frequent cache-misses, thus degrading performance. Cell arrays are inherently inefficient.

Recommendation: Use string array over cellstr to benefit from a simpler syntax, faster execution, and a more compact memory layout.

4 String concatenation

String concatenation is the operation of fusing smaller strings into a longer string. Most programming languages support this operation. MATLAB has three distinct ways to concatenate text:

  1. The printf family of functions which use format specifiers (ex: sprintf, fprintf, sscanf, …)

  2. horzcat which uses brackets, ex: ['hello, ', name]

  3. String concatenation which uses the ‘+’ operator

String concatenating and formatting are fundamental to managing textual data. This foundational requirement is compared analytically below.

4.1 printf/sprintf

printf uses format specifiers which act as placeholders in the unformatted text. The corresponding variable substitutions are input arguments to the sprintf function. For example,

    >> sprintf(’hello, %s!’, name);

The textual variable name is substituted for the character format specifier %s to complete the string. Specifying only one input argument to sprintf is simple, but how does substitution scale for multiple inputs? Figure 2 demonstrates a trace of one’s eyeballs while reading a string which makes three variable substitutions.

Figure 2: The flow of a sprintf formatted string.

To follow this string from left to right, one’s eyes repeatedly need to dart to the end of the expression to locate the variable, make a mental map of its format specifier and variable type, then return to the string. Navigating formatted text requires eyeball gymnastics for all but the simplest text, but fortunately there exists a simpler way: string concatenation.

4.2 String concatenation

Figure 3 is a visual trace of the same string above but this time using the ‘+’ (concatenate) operator.

Figure 3: The arrow of a concatenated string.

This flows from left to right. In addition, the format specifiers %s and %d are not needed: MATLAB already knows variable data types via its tagged union mxArray3.4) which makes format specifiers superfluous. This allows MATLAB’s string library to convert efficiently non-string values (such as numeric) into string automatically (an implicit conversion). These advantages make string concatenation easier to read and write as compared to sprintf, but how do their execution times compare?

4.3 Concatenation performance comparison

Here are three MATLAB code samples, each concatenating identical text in a different way:

    >> sprintf("%d %s", 1, a)
    >> [num2str(1) ’ ’ a]
    >> 1 + " " + a

All three create the same text content (assume that the variable equals the character ‘a’). These examples test the double and char data types given that they are among MATLAB’s most widely used [MATLABData]. Also, the text is deliberately kept short to focus more heavily on comparing their string concatenating properties. Table 3 summarizes the performance comparison for each of the three ways to concatenate text.

Metric sprintf [num2str(1) ‘ ’ a] 1 + " " + a Advantage over sprintf
Chars of M-code 22 chars 18 chars 11 chars 2x as compact
Duration (sec) 0.00001375 0.00001227 0.000001693 8.1x faster
Table 3: Comparison of MATLAB text concatenation.

Table 3 clearly indicates that string concatenation has strong advantages over printf for the concatenated text above:

  • Zero function calls (ex: no sprintf) for string concatenation

  • Faster execution: string operators JIT-compile (see §4.4) better than function calls [Performance]

  • Fewer characters of MATLAB code, which often increases code clarity

  • Implicit numeric-to-string conversion (no num2str calls needed)

  • No format specifiers (%d and %s are common enough to memorize but rare specifiers such as %x, %g, and %E likely require documentation [sprintf])

  • No surprise output if a user: (1) accidentally chooses the wrong format specifier (e.g., floating-point instead of integer), or (2) changes the variable data type but forgets to update its specifier

  • Flow left to right as demonstrated by the arrows in Figures 2 and 3

4.4 MATLAB’s Just-In-Time (JIT) compilation

MATLAB code runs through a JIT compiler [ExecutionEngine]. JIT compilation is a hybrid between compilation ahead-of-time (AOT) and interpretation of a language. In terms of speed, generally AOT > JIT > Interpreted. JIT compilation often falls in the performance spectrum between AOT and Interpreted: efficient JIT compilation nears AOT performance, while inefficient JIT is no better than Interpreted.

One of the principles of JIT is that time-to-compile is generally faster than time-to-execute. This is especially important for tight loops and vectorized instructions. By compiling once then reusing that optimized bytecode for subsequent executions, performance greatly benefits [Loren]. MATLAB launched its new execution engine in R2015a which continues to speed up every release: MATLAB benchmarks in R2020a are on average 2.18x faster than pre-R2015a [MATLABPerformance]. To circle back to string vs. cell, MATLAB strings have a greater data to metadata ratio and are inherently vectorized. This allows JIT compilation for string more opportunities to optimize.

Recommendation: With improved clarity, conciseness, data type conversions, and performance, prefer string concatenation to printf.

5 Autocomplete

Many MATLAB functions offer autocomplete to improve documentation lookup, reduce syntactic errors, and aid in development time. This ease of use is especially of importance to new users beginning to learn the language. MATLAB is frequently used by engineers and researchers who have strong problem-solving backgrounds but are relatively new to software development.

Autocomplete is yet another aspect where string shines over cellstr. To illuminate the difference in autocomplete for these two types, here is another coding exercise:

Retrieve the image number from a list of images names where the number token is separated by an underscore, ex: "1001_img.jpg" should return "1001".

Starting with string array:

    >> s = ["10_img.jpg", "11_img.jpg"];

To see the autocomplete, type "s." then <TAB>. Figure 4 shows the hinted string functions.

Figure 4: String method autocomplete

After a quick scan, a user finds the function that he or she needs, extractBefore:

    >> s = s.extractBefore("_")

s =
  12 string array

    "10"    "11"

How does autocomplete for cell arrays compare? Here are the same file names, but this time as a cell array of character vectors:

    >> c = {’10_img.jpg’, ’11_img.jpg’};

As before, type the variable name then <TAB>. Figure 5 shows the lack of helpful results.

Figure 5: Cell array’s empty autocomplete

Why are there no suggestions? This is because MATLAB’s autocomplete for variables works with methods rather than functions. Cell arrays are a container, not a true class, and therefore have no methods. This makes finding the correct function a larger challenge.

5.1 Can MATLAB’s Live Editor help?

The Live Editor is MATLAB’s new and revamped text editor, containing additional features such as block selection, enhanced autocomplete, and formatted text with LaTeX. With MATLAB’s Live Editor, autocomplete appears as soon as a user begins typing. Let’s say that an engineer is looking for string methods beginning with "ex":

Figure 6: Live Editor’s autocomplete

The Live Editor’s popup menu lists four results, all of which are relevant to text processing. New developers who are unfamiliar with MATLAB’s first-party functions immediately find what they are looking for.

For cell array, however, users must begin by typing "ex" rather than scoping into a variable. The Live Editor does not have an opportunity to filter and therefore returns all 200+ candidates as shown in Figure 7.

Figure 7: Live Editor’s unfiltered autocomplete

extractBefore is buried in there somewhere but scanning through hundreds of records is not helpful for anyone, regardless of experience level.

Recommendation: prefer string arrays over cellstr to let autocomplete improve your development efficiency.

6 What if MATLAB string arrays are too slow?

Although MATLAB string arrays are vastly superior to cellstr in performance, they still incur the cost of living in a higher-level language. When additional speedup is needed, consider developing in a compiled ahead-of-time language such as C or C++. Figure 8 illustrates the same two MATLAB performance benchmarks compared with C++: (1) "TestResult#" string array generation and (2) concatenation. Note: cellstr and sprintf in MATLAB are so slow that they cause the C++ bar to disappear and are therefore omitted.

Figure 8: C++ vs. MATLAB duration comparison (lower is better)

6.1 Google Benchmark

Google Benchmark is a C++ library for benchmarking. It is particularly suited for micro-benchmarking a critical function which is then pounded by the framework with many thousands of iterations. The library executes a block of C++ code repeatedly until the duration across iterations becomes statistically stable [GoogleBenchmark].

6.2 TestResult# vector generation using Google Benchmark

Strings are aggressively optimized during compilation and extremely efficient in C++. This makes Google Benchmark the perfect tool for measuring C++ string performance. The C++ TestResultArray() function below creates an array of the same ["TestResult1", "TestResult2", …, "TestResult1000"] strings as benchmarked earlier. It begins by preallocating a vector<string>. Then the function converts each numeric index to string followed by concatenating it to the current element in the string vector.

    static void TestResultArray(benchmark::State &state)
    {
        for (auto _ : state) // the ’for’ loop defines the scope to measure
        {
            constexpr int numel = 1000;
            vector<string> v(numel, "TestResult");
            for (size_t x = 0; x < v.size(); ++x)
            {
                v[x] += std::to_string(x + 1);
            }
        }
    }

The above C++ code sample uses a loop, but the algorithmic approach shown below is an interesting coding exercise: it uses std::iota to create an incrementing sequence then std::transform with a lambda expression to convert each integer index to string:

    constexpr int numel = 1000;
    vector<string> v(numel, "TestResult");
    array<int, numel> num;
    iota(num.begin(), num.end(), 1);
    transform(num.begin(), num.end(), v.begin(), v.begin(),
        [] (int n, string &s) { return s += std::to_string(n); });

The algorithm is about 50% slower than the loop due to visiting each element multiple times. The loop completes in 20,647 nanoseconds on average while the algorithm takes 31,495 nanoseconds as shown in Figure 9.

Figure 9: Google Benchmark output for the TestResult benchmarks

6.3 String concatenate using Google Benchmark

The C++ StringConcatenate() function concatenates the same string studied in Section 4.3 and is incredibly efficient, taking only 25.1 nanoseconds on average after 24.8 million iterations. It converts the numeric index to string then uses the concatenate operator to build the final string. Additionally, it leverages DoNotOptimize to prevent the compiler from optimizing the local string variable out of existence. This ensures that the concatenation action is accurately being measured [DoNotOptimize].

    static void StringConcatenate(benchmark::State &state)
    {
        for (auto _ : state)
        {
            char a = ’a’;
            std::string s = std::to_string(1) + " " + a;
            benchmark::DoNotOptimize(s);
        }
    }
Figure 10: Google Benchmark output for the string concatenate benchmark

7 Hardware and Software

All results were obtained using MATLAB R2020a on Windows 10, Intel® Xeon® W-2133 CPU @ 3.60GHz, 64 GB RAM.

8 Conclusion

Strings are pervasive in data science. Processing volumes of text efficiently is a critical aspect of data analysis. MATLAB provides two textual containers to aid researchers: string array and cell array. Comparing them objectively repeatedly demonstrates the strengths of string over cell. In terms of performance, string arrays are faster than cellstr for almost all string processing benchmarks such as string-building, formatting, concatenation, and implicit data type conversions. By memory footprint, string utilizes approximately half as much metadata to hold identical arrays of text content as compared to cellstr. Additionally, string arrays offer a cleaner, more natural syntax which improves developer usability by reducing syntax errors and code churn. And lastly, autocomplete powerfully supports string arrays which reduces development time. Given this wealth of advantages, MATLAB users should prefer string arrays over cell array of character vectors and proactively replace existing cellstr usages.

References