Scalene: Scripting-Language Aware Profiling for Python

Existing profilers for scripting languages (a.k.a. "glue" languages) like Python suffer from numerous problems that drastically limit their usefulness. They impose order-of-magnitude overheads, report information at too coarse a granularity, or fail in the face of threads. Worse, past profilers—essentially variants of their counterparts for C—are oblivious to the fact that optimizing code in scripting languages requires information about code spanning the divide between the scripting language and libraries written in compiled languages. This paper introduces scripting-language aware profiling, and presents Scalene, an implementation of scripting-language aware profiling for Python. Scalene employs a combination of sampling, inference, and disassembly of byte-codes to efficiently and precisely attribute execution time and memory usage to either Python, which developers can optimize, or library code, which they cannot. It includes a novel sampling memory allocator that reports line-level memory consumption and trends with low overhead, helping developers reduce footprints and identify leaks. Finally, it introduces a new metric, copy volume, to help developers root out insidious copying costs across the Python/library boundary, which can drastically degrade performance. Scalene works for single or multi-threaded Python code, is precise, reporting detailed information at the line granularity, while imposing modest overheads (26

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/12/2020

Towards Memory Safe Python Enclave for Security Sensitive Computation

Intel SGX Guard eXtensions (SGX), a hardware-supported trusted execution...
03/30/2022

pycefr: Python Competency Level through Code Analysis

Python is known to be a versatile language, well suited both for beginne...
06/11/2021

Toward Efficient Interactions between Python and Native Libraries

Python has become a popular programming language because of its excellen...
10/26/2017

Fast Linear Transformations in Python

This paper introduces a new free library for the Python programming lang...
06/28/2022

Gradual Soundness: Lessons from Static Python

Context: Gradually-typed languages allow typed and untyped code to inter...
03/10/2021

Blindspots in Python and Java APIs Result in Vulnerable Code

Blindspots in APIs can cause software engineers to introduce vulnerabili...
10/01/2020

Scipp: Scientific data handling with labeled multi-dimensional arrays for C++ and Python

Scipp is heavily inspired by the Python library xarray. It enriches raw ...

1 Introduction

General-purpose programming languages can be thought of as spanning a spectrum from systems languages to scripting languages [ousterhout1998scripting]. Systems languages are typically statically-typed and compiled, while scripting languages are dynamically-typed and interpreted. As Table 1 shows, scripting languages share many implementation characteristics, such as unoptimized bytecode interpreters, relatively inefficient garbage collectors, and limited support for threads and signals 111We deliberately exclude JavaScript, which was initially a scripting language; its implementation has evolved to the point where it no longer has much in common with those of other scripting languages, beyond its lack of support for threads..

Scripting Interpreter GC algorithm Threads Signal limitations
Language Bytecode AST Ref-counting Mark-sweep pthreads Serialized Main Only Delayed
Perl (1987) N/A
Tcl/Tk (1988) N/A N/A
Python (1990)
Lua (1993)
PHP (1994) * N/A
R (1995) N/A N/A
Ruby (1995)
Table 1: Major scripting language implementations share common implementation characteristics. Next to each language is its first release date. All are dynamically typed; their standard implementations are interpreted and garbage-collected, most with reference counting. All lack threads or serialize them with a global interpreter lock (“GIL”), and all place severe limits on signal delivery, such as delivering only to the main thread and delaying delivery until the interpreter regains control (e.g., after executing a bytecode). : Python has an optional “stop-the-world” generational mark-sweep garbage collector. : Lua garbage collector is an incremental mark-sweep collector. : PHP has a backup cycle collector [DBLP:conf/ecoop/BaconR01]. : Ruby’s garbage collector is an incremental, generational mark-sweep collector. *: PHP support for threads is disabled by default, but can be configured at build time. ()

This combination of overheads can lead applications in scripting languages to run orders of magnitude slower than code written in systems languages. They also can consume much more space: for example, because of object metadata, an integer consumes 24–28 bytes in most scripting languages. The widespread use of incomplete memory management algorithms like reference counting, which cannot reclaim cycles, only exacerbates the situation. These performance properties combine to make developing efficient code in scripting languages a challenge, but existing profilers for these languages are essentially ports of profilers for systems languages like gprof [DBLP:conf/sigplan/GrahamKM82] or perf, which greatly limits their usefulness.

This paper introduces scripting-language aware profiling, which directly addresses the key challenges of optimizing code written in scripting languages. Because scripting languages are so inefficient, optimizing applications in these languages generally involves moving code into native libraries. Developers thus need to know if bottlenecks reside in the scripting language, which they can optimize, or in native libraries, which they cannot. Because of the significant space overheads that scripting languages impose, developers need to both limit unnecessary memory consumption by avoiding accidental instantiation of lazily generated objects, moving memory intensive code into libraries, as well as identify leaks. Finally, they need to identify and eliminate implicit copying across the scripting language/compiled language boundary, which can drastically degrade performance.

We have developed a scripting-language aware profiler for Python called Scalene. We target Python because it is one of the most popular scripting languages according to a variety of rankings [ieeeplrank2019, redmonk-rankings, tiobe-index, stack-overflow-2019-survey]. Large-scale industrial users of Python include Dropbox [python-at-dropbox], Facebook [python-at-facebook], Instagram [python-at-instagram], Netflix [python-at-netflix], Spotify [python-at-spotify], and YouTube [python-at-youtube].

In addition to subsuming the functionality of previous profilers with higher performance, Scalene implements the following novel scripting-aware profiling features:

Profiler Time Efficiency Mem Unmodified Threads Scripting-Lang Aware
Cons. Code Python/C Mem Trend Copy Vol.
function-granularity
cProfile [cprofile] real
Profile [profile] CPU
pyinstrument [pyinstrument] real
yappi [yappi] CPU
yappi [yappi] real
line-granularity
line_profiler [line_profiler] real
pprofile [pprofile] real
pprofile [pprofile] real
py-spy [py_spy] both
memory_profiler [memory_profiler] N/A
Scalene both
Table 2: Existing Python profilers vs. Scalene. Time indicates real (wall-clock) time, CPU time, or both. Darker circles shown in Efficiency indicate higher efficiency (lower overheads), ranging from less than to over (Figure 4 provides detailed performance breakdowns, and Section 5.2 provides other details.) Mem Cons. indicates whether it profiles memory consumption. Unmodified Code means that use of the profiler does not require source code modifications. Threads indicates whether it correctly attributes execution time or memory consumption for multithreaded Python code. Only Scalene reports scripting-language aware statistics: Python/C = separate attribution of execution time () and memory () to Python code or C, Mem Trend = timeline of memory consumption (), and Copy Vol. = copy volume in MB/s ().
  • Separate Python/C accounting of time and space. Scalene separately attributes both execution time () and memory consumption () based on whether it stems from Python or native code. Most Python programmers are not able to optimize the performance or memory consumption of native code (which is usually either in the Python implementation or external libraries), so this helps developers focus their optimization efforts on the code they can improve.

  • Fine-grained tracking of memory use over time. Scalene uses a novel sampling memory allocator () to not only enable separate accounting of memory consumption to Python vs. native code, but also to efficiently profile memory usage at the line granularity. It produces per-line memory profiles in the form of sparklines (see Figure 1): these are in-line graphs that indicate trends of memory consumption over time, making it easier to track down leaks ().

  • Copy volume. Finally, Scalene reports copy volume in megabytes per second, for each line of code (). This novel metric makes it straightforward to spot inadvertent copying, including silent coercion or crossing the Python/library boundary (e.g., accidentally converting numpy arrays into Python arrays or vice versa).

Scalene overcomes a number of technical challenges inherent to the implementation of scripting languages to collect this information with relatively low performance overhead. Scalene outperforms other profilers by in some cases orders of magnitude, while delivering far more detailed information. Scalene is precise. Unlike many existing Python profilers, Scalene performs both memory and CPU profiling at the line granularity. This level of detail can be much more useful than the function-granularity profiles returned by many profilers: unlike in systems languages, where individual lines are often compiled to a few cycles, lines of code in scripting languages are often orders of magnitude more expensive. Our prototype achieves this precision with low overhead. For full memory and copy profiling, it imposes between 26%–53% overhead; for CPU profiling only (separating Python and C execution), it imposes no observable performance penalty (Section 4).

While this paper primarily focuses on Scalene and Python, we believe the techniques it describes depend primarily on implementation details common to almost all scripting languages, and thus should be broadly applicable.

2 Overview of Scalene

(a) Profiling with line_profiler. Traditional CPU profilers often yield little actionable insight.
(b) Profiling with Scalene: before optimization. Line 4’s sawtooth allocation and high copy volume indicate copying due to np.array.
(c) Profiling with Scalene: after optimization. Removing the call to np.array cuts execution time and total memory footprint in half.
Figure 1: Scalene’s profiler can effectively guide optimization efforts. Unlike past profilers, Scalene splits time spent and memory consumed in the Python interpreter vs. native libraries, includes average net memory consumption as well as memory usage over time, and reports copy volume. The sawtooth pattern and high copy volume on line 4 in Figure 0(b) indicate unnecessary allocation and copying due to a redundant np.array call. Removing it stabilizes allocation and eliminates copying overhead, leading to a 50% performance improvement and footprint reduction.

This section provides an overview of Scalene’s operation in collecting profile information.

Profiling a Python program with Scalene is a straightforward matter of replacing the call to Python (e.g., python3 app.py becomes scalene app.py). By default, Scalene generates a profile when the program terminates. To support long-running Python applications, Scalene also can be directed via command-line parameters to periodically write profiles to a file.

In addition to providing line-granularity CPU profiles, Scalene breaks out CPU usage by whether it is attributable to interpreted or native code (). Its sampling memory allocator—which replaces the default allocator through library interposition—lets it report line-granularity net memory consumption, separately attribute memory consumption to Python or native code (), and display trends, in the form of “sparklines” [tufte2006beautiful], which capture memory usage over time (). This information makes it easy for developers to identify leaks or unnecessary allocation and freeing. It also reports copy volume in megabytes per second, which can identify unnecessary copying, whether in Python, in native libraries, or across the boundary.

Figure 1 demonstrates how Scalene’s guidance can help developers find inefficiencies and optimize their code. Figure 0(a) shows a profile from a standard Python profiler, line_profiler. The generic nature of past profilers (just tracking CPU time) often fails to yield meaningful insights. Here, it indicates that the line of code is responsible for 100% of program execution, but this fact does not suggest optimization opportunities.

By contrast, Figure 0(b) shows the output of Scalene for the same program. The profile reveals that the line of code in question is unusual: its memory consumption (exclusively in native code) exhibits a distinctive sawtooth pattern. In addition, the line is responsible for a considerable amount of copy volume (almost 600 MB/s). Together, this information tells a familiar tale: copying to a temporary, which is allocated and then promptly discarded. Inspection of this line of code reveals an unnecessary call to np.array (the result of the expression is already a numpy array). Removing that call, as Figure 0(c) shows, reduces both overall memory consumption (shown in the top line of the profile) and total execution time by 50%.

In addition to revealing optimization opportunities that other profilers cannot, Scalene is also fast, imposing just 10% overhead for this benchmark. The next section details how Scalene’s implementation simultaneously delivers high precision and generally low overhead (at most 53%).

3 Implementation

Our Scalene prototype runs on Linux (including Windows Subsystem for Linux, version 2) and Mac OS X, for Python versions 3.5 and higher. It is implemented as a combination of a pure Python module and a specialized runtime library written in C++ that replaces key calls by library interposition (that is, LD_PRELOAD on Linux and DYLD_INSERT_LIBRARIES on Mac OS X). Figure 2 presents a diagrammatic overview.

Crucially, Scalene does not depend on any modifications to the underlying CPython interpreter. This approach means that Scalene works unchanged with other implementations of Python like PyPy [DBLP:conf/oopsla/RigoP06]. It also provides evidence that the techniques we develop for Scalene should be portable to other scripting languages without significant changes. Table 3 presents an overview of scripting languages and the features that Scalene relies on.

Exposing scripting-language aware features—without modifying the underlying language—required overcoming a number of technical challenges. This section first explains how Scalene turns the severe limitations on signal delivery (typical of scripting languages) to good effect. It then presents Scalene’s runtime library, which cooperates with the Python-based component to track memory usage, trends over time, and copy volume, all at a line granularity and with low overhead. In the remainder of this section, we focus our discussion specifically on Python, noting where characteristics of Python differ from other scripting languages.

Figure 2: Scalene Overview. Scalene consists of two main components, a Python module and a C++-based runtime system, both depicted in gray. The runtime system is loaded via library interposition. The white components (the code being profiled and the Python interpreter itself) require no modifications.

3.1 Python/C Attribution of CPU Time

Traditional sampling profilers work by periodically interrupting program execution and examining the current program counter. Given a sufficiently large number of samples, the number of samples each program counter receives is proportional to the amount of time that the program was executing. Sampling can be triggered by the passage of real (wall-clock) time, which accounts for CPU time as well as time spent waiting for I/O or other events, or virtual time (the time the application was actually scheduled for execution), which only accounts for CPU time.

While both timer approaches are available in Python (on Linux and Mac OS X systems), directly using sampling is ineffective for Python. As noted previously, nearly all scripting languages impose severe limitations on signal delivery. Typically, as in Python, these signals are delayed until the virtual machine (i.e., the interpreter loop) regains control, often after each opcode. These signals are also only delivered to the main thread.

The result is that no signals are delivered—and thus, no samples accrue—during the entire time that Python spends executing external library calls. It also means that lines of code executing in threads (besides the main thread) are never executed. In the worst case, sampling can utterly fail. Consider a main thread that spawns child threads and then blocks waiting for them to finish. Because no signals are delivered to the main thread while it is blocking, and because the threads themselves also never receive signals, a naïve sampling profiler could report that no time elapsed. (Note that because of serialization due to the GIL, Python threads are not particularly well suited for parallel code, but they are widely used in servers to manage connections.)

Inferring Time Spent in C Code

Recall that one of the goals of Scalene is to attribute execution time separately, so that developers can identify which code they can optimize (Python code) and which code they generally cannot (C or other native code). An apparently promising approach would be handle signals, walk the stack, and distinguish whether the code was invoked by the interpreter as an external function, or whether it was within the interpreter itself. However, as we note above, no signals are delivered during native code execution, making such an approach impossible.

Instead, we turn this ostensible limitation to our advantage. We leverage the following insight: any delay in signal delivery corresponds to time spent executing outside the interpreter. That is, if Scalene’s signal handler received the signal immediately (that is, in the requested timing interval), then all of that time must have been spent in the interpreter. If it was delayed, it must be due to running code outside the interpreter, which is the only cause of delays (at least, in virtual time).

To track this time, Scalene uses a clock (either time.process_time() or time.perf_counter()) to record the last time it received a CPU timer interrupt. When it receives the next interrupt, it computes , the elapsed time and compares it to the timing interval (for quantum).

Scalene uses the following algorithm to assign time to Python or C: Whenever Scalene receives a signal, Scalene walks the Python stack until it reaches code being profiled (that is, outside of libraries or the Python interpreter itself), and attributes time to the resulting line of code. Scalene maintains two counters for every line of code being profiled: one for Python, and one for C (native) code. Each time a line is interrupted by a signal, Scalene increments the Python counter by , the timing interval, and it increments the C counter by .

It might seem counterintuitive to update both counters, but as we show below, this approach yields an unbiased estimator. That is, in expectation, the estimates are equivalent to the actual execution times. We first justify this intuitively, and then formally prove it is unbiased.

First, consider a line of code that spends 100% of its time in the Python interpreter. Whenever a signal occurs during execution of that line, it will be almost immediately delivered, meaning that . Thus, all of its samples () will accumulate for the Python counter, and 0% () for the C counter, yielding an accurate estimate.

Now consider a line that spends 100% of its time executing C code. During that time, no signals are delivered. The longer the time elapsed, the more accurate this estimate becomes. The ratio of time attributed to C code over (C plus Python) is , which simplifies to . As approaches infinity, this expression approaches 1 (that is, ), making it an accurate estimate.

While equality holds in the limit, the resulting approximation is accurate even for relatively low elapsed times, as long as they are larger relative to the sampling interval. Scalene’s current sampling interval is 0.01 seconds, so a line that takes one second executing C code would receive or 99% of its samples as native code, which is only off by 1%.

Finally, consider the general case when the ratio of time spent in C code to Python code is some fraction

. In expectation, the signal will be delayed with probability

, meaning that the attribution to C code will be . As approaches infinity, this expression approaches .

To prove that this approach yields an unbiased estimator, we need to show that, in expectation, the estimates equal the actual values. We denote the execution time of the program’s Python and C components as and , respectively. We subscript these with an index (e.g., ) to denote individual lines of code; . We use hats to denote estimates, as in . Rephrasing formally, to show that these estimates are unbiased, we need to show that and .

We first observe that, in expectation, is the proportional fraction of execution time taken by line of the total (and similarly for ). By linearity of expectation, it is sufficient to consider the total execution times and show that and .

Call the total number of samples received by the program—by definition, only when it is executing Python code. This means that : the expected running time of Python code is the number of samples times the length of each quantum. Scalene adds every time to its estimate of Python execution time whenever it receives a signal: the total is , so . For C code, Scalene adds the time elapsed waiting for a signal. The total time elapsed when waiting for a signal is the total elapsed time minus the time accounted for by signals: . We have already shown that , so we have .

Simulation study.

To quantify how quickly these formulas converge converges depending on the ratio of and , we perform a simulation study. The simulator mimics the effect of executing a Python program line by line, spending a random amount of time executing Python code, and then a random amount of time running C code. The simulator draws the execution times of the Python and C components of each line of code from a Pareto distribution such that 20% of the code accounts for 80% of the total execution time (). It then simulates execution of 100 lines of code for a range of execution times, where the simulated quantum is set at 0.01 seconds (as in Scalene), and attributes time as described either to Python or C code. At the end of execution, the simulator reports the estimated total time spent in Python code or C code, along with the simulated “actual” time.

Figure 3: Simulated execution of Python/C code. This graph validates Scalene’s inference approach to distinguishing Python and C execution times, showing that as execution time increases, the estimated shares of execution time become increasingly accurate.

Figure 3 presents the results of running this simulation 10 times; the x-axis is execution time, and the y-axis is the average ratio of estimated time to simulated time. As predicted, the accuracy of both estimators increases as execution time increases. The simulation shows that the amount of error in both estimates is under 10% after one minute of execution time. Empirically, we find that actual code converges more quickly; we attribute this to the fact that actual Python code does not consist of serialized phases of Python and then C code, but rather that the phases are effectively randomly mixed.

We also evaluate the correlation of all estimated and simulated times using Spearman’s , which measures whether there is a linear relationship between the two, a value of denoting a monotonic linear relationship between the values. For 64 seconds of execution, the correlation coefficient for the Python estimates and the C estimates to their simulated execution time is , indicating that the estimates are directly correlated with the simulated times.

Attributing Time Spent in Threads

The approach described above accurately attributes execution time for Python vs. C code in the main thread, but it does not attribute execution time at all for threads, which themselves never receive signals. To handle this, Scalene relies on the following Python features, which are available in other scripting languages: monkey patching, thread enumeration, stack inspection, and bytecode disassembly.

Monkey patching.

Monkey patching refers to the redefinition of functions at runtime, a feature of most scripting languages. Scalene uses monkey patching to ensure that signals are always received by the main thread, even when that thread is blocking. Essentially, it replaces blocking functions like threading.join with ones that always use timeouts. The timeout interval is currently set to Python’s thread quantum (obtained via sys.getswitchinterval()). By replacing these calls, Scalene ensures that the main thread yields frequently, allowing signals to be delivered regularly.

In addition, to attribute execution times correctly, Scalene maintains a status flag for every thread, all initially executing. In each of the calls it intercepts, before Scalene actually issues the blocking call, it sets the calling thread’s status as sleeping. Once that thread returns (either after successfully acquiring the desired resource or after a timeout), Scalene resets the status of the calling thread to executing. Scalene only attributes time to currently executing threads.

Thread enumeration.

When the main thread receives a signal, Scalene introspects on all running threads, invoking threading.enumerate() to collect a list of all running threads; similar logic exists in other scripting languages (see Table 3).

Stack inspection.

Scalene next obtains the Python stack frame from each thread using Python’s sys._current_frames() method. Note that the preceding underscore is just Python convention for a “protected” class method or variable. As above, Scalene walks the stack to find the appropriate line of code for which it will attribute execution time.

Bytecode disassembly.

Finally, Scalene uses bytecode disassembly (via the dis module) to distinguish between time spent in Python vs. C code. Whenever Python invokes an external function, it does so via a bytecode whose textual representation starts with CALL_ (this approach is common to other languages; for example, Lua uses OP_CALL, while Ruby’s is opt_call_c_function). Scalene builds a map of all such bytecodes at startup.

For each running thread, Scalene checks the stack and its associated map to determine if the currently executing bytecode is a call instruction. Because this method lets Scalene know with certainty whether the thread is currently executing Python or C code, there is no need for the inference algorithm described above. If the bytecode is a call, Scalene assigns time to the C counter; otherwise, it assigns it to the Python counter.

3.2 Memory Usage

Traditional profilers either report CPU time or memory consumption; Scalene reports both, at a line granularity. It is vital that Scalene track memory both inside Python and out; external libraries are often responsible for a considerable fraction of memory consumption.

To do this, Scalene intercepts all memory allocation related calls (malloc, free, etc.) via its own replacement memory allocator, which is injected before execution begins.

By default, Python relies on its own internal memory allocator for objects 512 bytes or smaller, maintaining a freelist of objects for every multiple of 8 bytes in size. However, if the environment variable PYTHONMALLOC is set to malloc, Python will instead use malloc to satisfy all object requests. Scalene sets this variable accordingly before beginning profiling. Note that some other languages may not make it so straightforward to replace all allocations; for example, while Ruby uses the system malloc to satisfy large object requests, there is no facility for replacing small object allocations. However, most other scripting languages make it simple to redirect all of their allocations (see Table 3).

An Efficient Replacement Allocator

Because Python applications can be extremely allocation-intensive, using a standard system allocator for all objects can impose considerable overhead. In our experiments, replacing the allocator by the default on Mac OS X can slow down execution by 80%. We viewed this as an unacceptably large amount of overhead, and ended up building a new allocator in C++, with some components drawn from the Heap Layers infrastructure [DBLP:conf/pldi/BergerZM01].

This might at first glance seem unnecessary, since in theory, one could extract the allocator from the Python source code and convert it into a general-purpose allocator. Unfortunately, the existing Python allocator is also not suitable for use as a general malloc replacement. First, the built-in Python allocator is implemented on top of malloc; in effect, making it a general-purpose allocator still would require building an implementation of malloc.

However, the most important consideration, which necessitates a redesign of the algorithm, is that a usable general-purpose allocator replacement needs to be robust to invocations of free on foreign objects. That is, it must reject attempts to free objects which were not obtained via calls to its malloc. This case is not a theoretical concern, but is in fact a near certitude. It can arise not just due to programmer error (e.g., freeing an unaligned object, a stack-allocated object, or an object obtained from an internal allocator), but also because of timing: library interposition does not necessarily intercept all object allocations. In fact, Python invokes free on ten foreign objects, which are allocated before Scalene’s interposition completes. Because re-using foreign objects to satisfy object requests could lead to havoc, a general-purpose allocator needs a fast way to identify foreign objects and discard them (a small leak being preferable to a crash).

We therefore built a general-purpose memory allocator for Scalene whose performance characteristics nearly match those of the Python allocator. At initialization, the Scalene allocator reserves a contiguous range of virtual memory to satisfy small object requests. It also allocates memory for large objects to be aligned to 4K boundaries, and places a magic number (0xDEADBEEF) in each header as a validity check. If objects are outside the contiguous range, not properly aligned, or fail their validity check, Scalene treats them as foreign. We have found this approach to be sufficiently robust to enable it to work on every Python program we have tested.

Otherwise, the internals of the Scalene allocator are similar in spirit to those of the Python allocator; it maintains lists for every size class of a multiple of 16 bytes up to 512 bytes. These point to 4K slabs of memory, with a highly optimized allocation fast path. Large objects are allocated separately, either from a store of 4K chunks, or directly via mmap. In our tests, this allocator significantly closes the performance gap between the system allocator and Python’s internal allocator, reducing overhead from 80% to around 20%. We expect to be able to optimize performance further, especially by avoiding repeated calls to mmap for large object allocation.

Sampling

With this efficient allocator in hand intercepting all allocation requests, we are now in a position to add the key component: sampling.

Allocation-Triggered Sampling:

The Scalene sampling allocator maintains a count of all memory allocations and frees, in bytes. Once either of these crosses a threshold, it sends a signal to the Python process. To allow Scalene to work on Mac OS X, which does not implement POSIX real-time signals, we re-purpose two rarely used signals: SIGXCPU for malloc signals, and SIGXFSZ for free signals. Scalene triggers these signals roughly after a fixed amount allocation or freeing. This interval is currently set as a prime number above

, intended to reduce the risk of stride behavior interfering with sampling.

Call Stack Sampling:

To track the provenance of allocated objects (that is, whether they were allocated by Python or native code), Scalene triggers call stack sampling. The sampling rate is set as a multiple of the frequency of allocation samples (currently ). Whenever the threshold number of allocations is crossed (that is, after allocations), Scalene climbs the stack to determine whether the sampled allocation came from Python or native code.

To distinguish between these two, Scalene relies on the following domain-specific knowledge of Python internals. Python has a wide range of functions that create new Python references, all of which begin with either Py_ or _Py. If Scalene encounters one of these functions as it climbs the stack, the object was by definition allocated by Python, so it increments a count of Python allocations by the requested size.222A few special cases: _PyCFunction allocates memory but on behalf of a C call, and PyArray, a non-Python call that numpy uses for allocating its own (native) arrays; Scalene treats both of these correctly as C allocations.

After walking a maximum number of frames (currently 4), if Scalene has not encountered one of these functions, it concludes that the allocation was due to native code and increments the C allocation counter. When the Scalene allocator eventually sends allocation information to the Python module (described below), it includes the ratio of Python bytes over total allocated bytes. It then resets both allocation counters.

Because resolving function names via dladdr is relatively costly, especially on Mac OS X, Scalene maintains an open-addressed hash table that maps call stack addresses to function names. This hash table is a key optimization: using it reduces Scalene’s overhead by 16% in one of our benchmarks.

Managing Signals:

Because Python does not queue signals, signals can be lost. We thus need a separate channel to communicate with the main process; to do this, we allocate a temporary file with the process-id as a suffix. Scalene appends information about allocations or frees to this file, as well as the fraction of Python allocations.

When Scalene’s signal handler is triggered (in the Python module), it reads the temporary file and attributes allocations or frees to the currently executing line of code in every frame. As with sampling CPU execution, lines of code that frequently allocate or free memory will get more samples. Scalene also tracks the current memory footprint, which it uses both to report maximum memory consumption and to generate sparklines for memory allocation trends (Section 3.3).

One fly in the ointment is that the Python signal handler itself allocates memory. Unlike in C, this allocation is impossible to avoid because the interpreter itself is constantly allocating and freeing memory. However, Scalene again leverages one of Python’s limitations to its advantage: Python’s global interpreter lock ensures that there is no true concurrency inside the interpreter. Therefore, Scalene straightforwardly prevents re-entrant calls by checking a boolean flag to see if it is already in the signal handler; if not, it sets the flag.

3.3 Memory Trends

Scalene not only reports net memory consumption per line, but also reports memory usage over time in the form of sparklines, both for the program as a whole and for each individual line. It adds the current footprint (updated on every allocation and free event, comprising at least of allocation) to an ordered array of samples for each line of code. The sampling array is chosen to be a multiple of 3, currently 27. When the array fills, the contents are reduced by a factor of 3, replacing each entry by its median value; after this reduction, footprint samples are again added to the end of the array. The effect of this approach is to smooth older footprint trends (on the left side of the sparkline) while maintaining higher fidelity for more recent footprints.

3.4 Copy Volume

Finally, Scalene reports copy volume by line. It also accomplishes this by sampling. The Scalene runtime system interposes on memcpy, which is invoked both for general copying and copying across the Python/C boundary. As with memory allocations, Scalene triggers a signal (this time, SIGPROF) after a threshold number of bytes has been copied. It also uses the same temporary file approach to avoid the problem of lost signals. The current memcpy sampling rate is set at a multiple of the allocation sampling rate (currently ). The ratio of the of copy sampling and the allocation sampling rates typically has a proportional impact on the number of interrupts. Since copying is almost always immediately preceded by an allocation of the same size, and followed by a deallocation, the current setting maintains copy samples at roughly the same rate as allocation samples.

Scripting malloc Monkey Thread Stack Opcode
Language interposition patching enum. inspection disassembly
Perl ✓(1) threads->list() Devel::StackTrace B::Concise
Tcl/Tk ✓(2) not needed not needed not needed
Python ✓(3) threading.enumerate() sys._current_frames() dis
Lua ✓(4) not needed not needed not needed
PHP ✓(5) not needed not needed not needed
R not needed sys.call disassemble
Ruby ✓(6) Thread.list caller RubyVM::InstructionSequence
Table 3: Feature support needed for scripting-language aware profiling, with corresponding functions/modules, if needed. While Scalene is a Python profiler, it relies on widely available characteristics of scripting language implementations. (1): Perl’s default configuration disables its internal allocator (-Dusemymalloc=n). (2): Tcl/Tk’s default configuration also disables its internal allocator (-DUSE_TCLALLOC=0). (3): Python’s allocator can be redirected by setting the environment variable PYTHONMALLOC=malloc. (4): Lua’s allocator can be changed via the function lua_setallocf(). (5): PHP’s allocator can be redirected by setting the environment variable USE_ZEND_ALLOC=0. (6): Ruby invokes malloc for objects larger than 512 bytes, but does not provide a facility for interposing on smaller object allocations.

4 Evaluation

We conduct our evaluation on a MacBook Pro (2016), with a 3.3 GHz dual-core Intel Core i7, and equipped with 16GB of 2133 MHz DDR3 RAM. The Powerbook running MacOS Catalina (version 10.15.4). All C and C++ code is compiled with clang version 11.0, and we use version 3.6.8 of the Python interpreter.

4.1 CPU Profiling Overhead

This section compares the profiling overhead of Scalene to the suite of existing profilers listed in Table 2. To tease apart the impact of the Scalene runtime library, we include the results of Scalene without the library, which we refer to as “Scalene (CPU)” (as it performs CPU profiling only, although still separating Python and C execution time), from “Scalene (full)”, which includes both memory and copy volume tracking. We conservatively choose CPU-intensive applications to perform these experiments, as these represent the worst-case for profiling overheads; overheads are likely to be substantially lower in applications that spend more time in I/O operations.

Benchmarks.

While there is a standard benchmark suite for Python known as pyperformance, most of the included benchmarks are microbenchmarks, running in many cases for less than a second. As these are too short lived for our purposes, we conduct our evaluation on one of the longest running benchmarks, bm_mdp, which simulates battles in a game and whose core involves topological sorting. This benchmark takes roughly five seconds. We also use as a benchmark program an example used as a basis for profiling in a book on high-performance Python, which we refer to as julia [gorelick2020high, Chapter 2]; this benchmark computes the Julia set (a fractal) and runs for seven seconds. We modify the benchmarks slightly by adding @profile decorators, as these are required by some profilers; we also add code to ignore the decorators when they are not used. In addition, we had to add a call to system.exit(-1) to force py-spy to generate output. We report the average of three consecutive runs.

Figure 4 provides these results. In general, Scalene (CPU only) imposes virtually no overhead, while the full Scalene imposes between 26% and 53% overhead.

(a) Overhead running the Julia benchmark.
(b) Overhead running the mdp benchmark.
Figure 4: Profiling overheads. Despite collecting far more detailed information, Scalene is competitive with the best-of-breed CPU profilers, imposing no perceivable overhead in its CPU-only version, and between 26%–53% for its full version.

4.2 Memory Profiling Overhead

The profilers we examine include just one memory profiler (memory_profiler). That profiler’s focus is exclusively on memory profiling; that is, it does not track CPU time at all. Like Scalene, memory_profiler works at a line granularity, reporting only average net memory consumption.

We sought to perform an empirical comparison of memory_profiler’s performance against Scalene. Unfortunately, memory_profiler is far too slow to be usable. While it runs for simple examples, we forcibly aborted it after it had run for at least longer than the baseline; for the Julia benchmark, we allowed it to run for over 2 hours, but it never completed. In other words, its slowdown is at least . By contrast, Scalene delivers fine-grained memory usage information with vastly lower overhead.

4.3 Case Study

(a) Profiling with line_profiler. Line 15 is the clear culprit, but the reason is unclear.
(b) Profiling with Scalene (before optimization) Scalene reveals that line 15 is allocating and freeing memory at a high rate.
Figure 5: Case Study: This small case study illustrates how Scalene can reveal optimization opportunities: in this case, changing a few lines improves performance by over ().

In this section, we report how Scalene can reveal previously-unknown optimization opportunities in actual Python code. This case study is primarily meant to illustrate Scalene’s role in the optimization process, and how it improves on past work. We note that we do not expect most users of Scalene to identify such enormous optimization opportunities.

We examine code presented in the Python documentation for the Decimal arbitary-precision library to compute exp ([exp-recipe]. Running this code on Decimal(3000) takes 12 seconds. A standard line-level profiler (line_profiler) reports that line 15 is the bottleneck: computing the ratio num / fact (Figure 4(a)). However, line_profiler does not provide much insight into why this is the case.

When we run Scalene on this code, we see an entirely different story (Figure 4(b)). Scalene reveals that line 15 is mostly executing in Python, but most importantly, it shows that it is, somewhat surprisingly, allocating and freeing objects at a rapid rate. In fact, this single line accounts for 81% of the object allocation activity in the program, all in Python. This fact warranted investigation of the num and fact variables. Inspecting the values of num and fact made it clear that both are growing large fast: they are repeatedly allocating and freeing space for digits.

To address this—that is, to keep the size of these numbers small—we introduce a variable nf that maintains the ratio num / fact. This change required the addition of a new variable, adding one line of code, and deleting two. The result was a drop in execution time from 12 seconds to 0.01 seconds: an improvement of over .

5 Related Work

5.1 Other Scripting Languages

Table 1 provides a breakdown of previous scripting languages by the characteristics of their standard implementations. All are dynamically-typed languages, and their standard implementations are interpreters. This section describes key features of these scripting languages.

Perl [DBLP:books/lib/WallS92] was designed by Larry Wall and first released in 1987. Unusually, Perl does not use bytecodes, instead using an abstract-syntax tree-based interpreter. It exclusively relies on a reference-counting garbage collector. Since Perl 5.8, released in 2002, Perl has provided interpreter threads, which comprise separate interpreters per thread; unlike traditional threads, all variables and references are thread-local unless they are explicitly shared [perlthreads]. Signal delivery is delayed until interpreter enters a safe state (between opcodes), also since Perl 5.8; previously, it had been signal-unsafe.

Tcl/Tk [DBLP:conf/usenix/Ousterhout90, wiki:tcl] was designed by John Ousterhout; its first release was in 1988. It has used a stack-based bytecode interpreter since version 8.0, released in 1997 [DBLP:conf/tcltk/Lewis96, tcl8.0], replacing its original string-based interpreter. It relies exclusively on a reference-counting garbage collector. Like Perl, Tcl implements a variant of interpreter threads (as an extension) [tcl-threads], with explicit sharing of variables possible via special operators,[welch2003practical, Chapter 21]. Unlike other scripting languages discussed here, core Tcl has no built-in support for signals since version 8.6, though it is available in extensions [tclsignals].

Python [DBLP:conf/tools/Rossum97] was designed by Guido van Rossum and initially released in 1990. It is a stack-based bytecode interpreter. It has a reference-counting garbage collector, but there is also an optional gc module that performs mark-sweep garbage collection. Only one thread at a time can execute in the interpreter, which is protected by a global lock (the global interpreter lock, a.k.a., “the GIL”). Signals are delivered only to the main thread, and delayed until VM regains control.

Lua [DBLP:journals/spe/IerusalimschyFF96, DBLP:conf/hopl/IerusalimschyFF07] was designed by Roberto Ierusalimschy; its first release was in 1993. Lua’s interpreter is register-based, rather than bytecode-based. Lua has never had reference-counting, relying on stop-the-world mark-sweep garbage collection until incremental GC was added in version 5.1, released in 2006. Lua had no threads of any kind until version 4.1; it now has cooperative (non-preemptive) threads. Signals are delayed until the VM regains control.

PHP [wiki:php] was designed by Rasmus Lerdorf and first released in 1994. Its interpreter is similar to a register-based bytecode (three-address based). It uses a reference-counting garbage collection, but added a backup cycle collector based on Bacon and Rajan’s synchronous cycle collection algorithm [DBLP:conf/ecoop/BaconR01] in PHP 5.3, released in 2009. PHP’s default configuration is NTS (Not Thread Safe); threading can be enabled at build time by turning on ZTS (Zend Thread Safety). Since PHP 7.0, signal delivery has been delayed until the interpreter regains control; unlike other scripting languages, PHP delays delivering signals not just after executing one opcode but only once the VM reaches a jump or calls instruction.

[doi:10.1080/10618600.1996.10474713] was designed by Ross Ihaka and Robert Gentleman; its first release was in 1995. R is a reimplementation of the S programming language, developed in 1976 by John Chambers [the-new-s-language], with the addition of lexical scoping. R has both an AST-based interpreter and a bytecode interpreter (since version 2.13, released in 2011) [bytecode-compiler-for-r]. Since its creation, R has employed a mark-sweep-compact garbage collector. R is single-threaded and has no support for signal handling.

Finally, Ruby [DBLP:books/daglib/0015648] was designed by Yukihiro Matsumoto (a.k.a., “Matz”) and first released in 1995. Originally an abstract-syntax-tree based interpreter, it switched to using a stack-based bytecode interpreter (“YARV” [10.1145/1094855.1094912]) with Ruby 1.9 [wiki:rubymri], released in 2007. Initially, like Lua, it employed a stop-the-world, mark-sweep garbage collector; generational collection was introduced in version 2.0, and incremental garbage collection as of version 2.1. Like Python, Ruby has multiple threads but these are serialized behind a global-interpreter lock. Signals are only delivered to the main thread, and they are queued until the interpreter regains control.

5.2 Existing Python Profilers

Table 2 provides a high-level overview of the features of all of the major Python profilers of which we are aware. All but one are CPU profilers. These profilers fall into two categories: function-granularity and line-granularity. Most are less efficient than Scalene (particularly in its CPU-only mode), notably those that rely on Python’s built-in support for profiling (the setprofile and setttrace calls from the sys and threading modules). Some fail to record information accurately for multi-threaded applications. None perform scripting-aware profiling.

Two of the profilers operate in different modes. Like Scalene, yappi can perform either CPU-time or wall-clock profiling. However, yappi’s CPU-time profiling mode does not use sampling, making it inefficient, degrading performance by . The wall-clock version is considerably more efficient, though it still imposes performance penalties ranging from . Like yappi, pprofile has two different versions: one is deterministic, relying on instrumentation, while the other uses sampling. The sampling version imposes low overhead, but the deterministic version imposes the highest performance penalties of any CPU profiler we study: from .

5.3 Profilers for Other Scripting Languages

Like previous Python profilers, profilers for other scripting languages are essentially variants of traditional profilers for systems languages; none are scripting-language aware.

Next to Python, Ruby is the scripting language with the most profilers in wide use. Rbspy is an efficient sampling-based profiler that inspired the development of Py-Spy [rbspy]. Another profiler for Ruby, stackprof, optionally performs object sampling after every so many allocations [stackprof]. Unlike Scalene, this sampling does not integrate with CPU sampling, nor does it perform any scripting-language aware profiling such as separate CPU/memory attribution, tracking memory usage over time or reporting copy volume. Finally, Ruby also has a MemoryProfiler that precisely tracks object allocations at the line granularity, imposing considerable overheads (up to [MemoryProfiler]. Like stackprof, MemoryProfiler cannot simultaneously perform CPU profiling and memory allocation tracking.

R’s standard profiler is Rprof, a line-granularity sampling-based profiler for CPU and memory consumption; it does not measure CPU time or memory consumed by native code in libraries. Andersen et al. describe feature-specific profiling [andersen2018feature]

, a profiling approach that focuses on attributing costs to specific language features, such as pattern matching or dynamic dispatch. They present an implementation of this profiler for R that uses Rprof’s sampler. Most feature-specific profiling they describe is orthogonal and complementary to scripting-language aware profiling. One use case they describe—identifying when R’s copy-on-write policy fails, resulting in deep copies—would be subsumed by

Scalene

’s copy volume profiling. A previous R profiler, lineprof, also reports the number of vector duplications.

Profilers for other scripting languages are conventional. Profilers for Lua include LuaProfiler [luaprofiler], LuaTrace [luatrace], and Pro-Fi [profi], all function granularity CPU profilers. Similarly, the standard Tcl profiler is also a function-level profiler. Perl has a variety of profilers, include Devel::DProf (also function granularity), Devel::SmallProf (line granularity), Devel::FastProf (a faster variant of Devel::SmallProf written in C); the most sophisticated profiler for Perl is Devel::NYTProf, which performs profiling at the file, function, “block”, and line granularity [nytprof].

6 Conclusion

This paper introduces scripting-aware profiling, and presents a prototype scripting-aware profiler for Python called Scalene. Scalene both sidesteps and exploits characteristics of Python—and typical of most scripting languages—to enable it to deliver actionable information to Python developers. Its pervasive use of sampling coupled with its runtime system allow it to capture detailed information with relatively modest overhead. Scalene

has been released as open source at

https://github.com/emeryberger/scalene.

References