Producing software with precise, repeatable timing is a challenging task. First, the software application itself may have data-dependent processing complexity, such as with data-dependent loops. Second, the execution time of the application on the processor may be affected by the memory hierarchy and the run-time state of the processor. Third, the timing of the execution may be affected by resource contention when several parallel threads share the same processor resource. Among these three problems, the second and the third are most difficult because they are outside of the control of the programmer. In cryptographic applications, data-dependent timing variations may be exploited as timing side-channel leakage, either directly as an effect of data-dependent control flow, or indirectly as an effect of contention on shared processor resources. To avoid timing side-channels, we need data-independent timing.
In this contribution, we propose a programming model that yields these timing characteristics. We contrast our proposal with earlier work towards precise software timing for embedded applications, PRET [DBLP:conf/rtss/LeeRZ17, DBLP:conf/rtas/ZimmerBSL14]. A fundamental idea of PRET is to use instruction scheduling to avoid resource contention in the processor in the pipeline. By spacing the instructions of timing-critical threads several cycles apart, stall-free execution is achieved in the pipeline. As a consequence, the timing of individual threads is repeatable regardless of the processor state. To ensure overall processor utilization, PRET combines multiple timing-critical threads with time-interleaving and a customized instruction scheduling technique [DBLP:conf/rtas/ZimmerBSL14].
Our insight in this paper is that such time-sensitive threads can also be combined spatially within a processor word, instead of temporally
using interleaved instruction streams. The advantage of spatially combining the threads (instead of using time-based interleaving) is that we don’t need to adapt the processor for interleaved instruction execution. To implement a spatial arrangement of threads, we organize each thread as a single-bit program, and execute the overall application as a vectorized version of the single-bit program. We emphasize that the proposed model goes beyond software bit-slicing[biham1997fast], which is strictly functional and ignores the control flow and the state within each slice.
To simplify the development of single-bit programs, we adopt a synchronous execution model. A single-bit program is captured as a synchronous Finite State Machine with Datapath (FSMD), and the execution of this program follows a sequential schedule of the bit-operations that define the FSMD. The vectorized form of the single-bit program is then achieved with bitwise instructions over the processor word. The vectorized form is a parallel synchronous program. Since each thread has its own state, each thread executes as an independent FSMD. However, the instruction count for one iteration of the overall program is constant and repeatable, and therefore the execution time of these FSMD threads becomes repeatable too. A prototype implementation of a synthesis tool starts from a Verilog input specification and generates C code with inline assembly optimized for an embedded target. We demonstrate several useful examples of parallel synchronous programming (PSP).
We develop a software execution model that leads to repeatable, data-independent timing. We first define what is meant by repeatable and data-independent timing in software. We then describe software bitslicing, which can offer such timing characteristics for functions (i.e. straight-line stateless programs). Next, we explain how to extend the semantics of software bitslicing from straight-line programs to synchronous FSMD. The result is a Parallel Synchronous Program (PSP).
Ii-a Desired Timing Properties
Programs written as PSP aim for repeatable timing as well as data-independent timing. The former is useful in real-time embedded software design, while the latter is useful for secure systems design. We motivate and differentiate each property.
Edwards et al. make a distinction between repeatable timing and predictable timing [DBLP:conf/iccd/EdwardsKLLPS09]. Repeatable timing means that every correct execution of a program uses the same timing. Repeatable timing is desired as a property of the program, not of the program running on a specific processor. Repeatable timing is needed in the context of real-time applications when timing jitter is a concern. For example, when a physical sensor must be read from software at a specific sample rate, then the software needs to have repeatable timing. Jitter is typically caused by resource contention and interrupts.
A second relevant domain for PSP is that of secure software. In recent years, a rich collection of attacks have been found to exploit the implementation characteristics of secure software rather than the program logic itself. The best known of these are side-channel attacks and micro-architectural attacks, which rely on precise execution time measurement [DBLP:journals/jce/GeYCH18]. To thwart these attacks, software with (secret-)data-independent timing is needed. This is hard because modern micro-architectures are rife with architectural contention and context-dependent timing. Even if there are no obvious dependencies in the program logic, there may still be hidden dependencies in the micro-architecture. The cryptographic community is well aware of the risk of timing-based side-channel leakage, leading to the design of so-called constant-time software that avoids data-dependencies in the program execution time [DBLP:conf/uss/AlmeidaBBDE16]. The resulting programs are not literally constant-time, but rather they adopt data-independent control flow and memory access patterns.
We argue that software written as a Parallel Synchronous Program (PSP) provides repeatable timing as well as data-independent timing. PSP achieves these properties by combining two concepts: software bitslicing and synchronous FSMD. The following subsections introduce both.
Software bitslicing was originally proposed for high throughput software implementations [biham1997fast]. In this model of programming, a program is expanded into 1-bit (Boolean) operations as follows. A bit variable with bits is distributed over registers , such that register holds bit . An bit processor operates as an way SIMD processor, processing instances of the bit variable in parallel, and storing these instances in registers. Bitsliced programs are Boolean programs written with bit-wise logic operations. The rationale of bitslicing is that it guarantees full utilization of the processor word-length. The absence of control flow ensures that each iteration through a bitslice function uses the same amount of instructions. In addition, the absence of state (memory) in a bitslice function eliminates cache timing effects. For this reason, bitslicing is often applied in the context of developing programs that are constant-time (in the cryptographic sense). However, software bitslicing is insufficient as a general-purpose methodology for software. Because bitslice functions do not have control flow, control operations are typically emulated using non-bitsliced logic surrounding bitsliced expressions. This prevents individual slices from operating as independent threads of control. Bitslice programming essentially applies only to functions. The management of the program state resides outside of the bitslice logic.
Ii-C Synchronous FSMD
We next describe how to introduce control flow and state into software bitslicing. A Boolean (1-bit) program does not offer the concept of an address space or control flow instructions. Therefore, we propose introducing the control flow through the intermediary of a synchronous FSMD. FSMDs are common in digital hardware design, and they are routinely applied in register-transfer-level designs. An FSMD is a synchronous model of computation combining a datapath and a finite-state controller. Computations are done on the datapath under control of the FSM [schaumont2012practical]. Each synchronous clock cycle, the FSM computes a single state transition and selects one or more operations in the datapath. The execution of datapath operations depends on the current state of the FSM, and the state transition conditions in the FSM depend on the current state of the datapath. Conditional control flow is expressed using dataflow-like semantics: the datapath will compute both the true and false case of the control condition, and the correct result will be selected using multiplexing.
To map a synchronous FSMD into software, we adopt a synchronous execution model as shown in Figure 1. Every loop in this program corresponds to a single clock cycle of the synchronous FSMD model. The software awaits the occurrence of a clock tick to read all inputs and evaluate all outputs concurrently [caspi2007synchronous]. The eval() function in Figure 1 computes the FSM next-state as well as the datapath next-state. The update() function adjusts the current state of the FSM and the datapath to the next. State update is handled synchronously: each state variable in the synchronous FSMD is split into two copies, the current_state and the next_state. This avoids race conditions, and ensures that the program will always compute the same result regardless of the scheduling of eval() and update(). The PSP is implemented as N parallel copies of the program of Figure 1, where every thread is tied to the same global clock tick, and each thread is a 1-bit program expressed as a synchronous FSMD.
Iii Synthesis of Parallel Synchronous Software
We next demonstrate how to create PSP. We first describe the example PSP design of a parallel greatest common divisor program, and next discuss a design flow that synthesizes PSP software from a synchronous FSMD description.
Figure 2a shows the outline of a 4-bit GCD module. We express the functionality of the GCD algorithm as an FSMD model. After a start control pulse, the module reads two 4-bit inputs a and b, and repeatedly subtracts the smaller value from the larger value until they are equal. A done pulse is generated to indicate completion of the algorithm. A two-state control FSM drives the loading of two 4-bit registers a and b and their iterative computation.
A PSP version of the GCD algorithm for a 32-bit processor executes 32 parallel copies of the GCD. We create this software by converting the FSMD to a gate-level netlist using logic synthesis. We target a generic technology with a logically complete set of primitive functions (such as AND, OR and NOT) as well as a storage element such as a flip-flop (Figure 2b). The outcome of the logic synthesis is a netlist in terms of logic elements. We then rewrite the netlist as a sequential function by leveling the netlist according to data dependencies from input to output. The logic cells are replaced by bit-wise operations, and the flip-flops are replaced by static (or global) variables. The resulting function declaration is as follows.
gcd_PSP(int a, int b, // data input int q, // data output int start, // control in int* done); // status out
Each invocation of this function corresponds to a single synchronous iteration (one clock cycle of the synchronous FSMD). An important difference between the circuit of Figure 2a and the PSP function in Figure 2b is the degree of parallelism; The circuit in Figure 2 computes a single GCD whereas the gcd_PSP function is a software design that computes 32 concurrent GCD algorithms independently, each with their own start and done bits. The inputs and outputs of gdc_PSP are in bitsliced form. For example, a contains the second bit of 32 different inputs. Hence, a call to gcd_PSP needs to transpose the input and output arguments.
An Automated Flow
We implemented a software synthesis flow for PSP that starts from an FSMD description in a Verilog program. An open-source Verilog synthesis tool[yosys] converts the FSMD into a netlist in terms of generic target technology for Boolean logic and a state element. The target library for logic synthesis is adjusted in function of the targeted processor. Table I demonstrates a sample mapping for several embedded processors. The state elements (flip-flop) are mapped to static variables.
The netlist is then converted to software as follows. The netlist is topologically sorted, from the primary inputs and flip-flop outputs to the primary outputs and the flip-flop inputs. Next, each primitive gate is converted to a bitwise operation which is either emulated in C or else added through inline assembly. We rely on the C compiler to create a sequential schedule for the gate netlist that will minimize the register pressure on the processor. The following section applies the automated flow on several examples.
|processor||suitable instructions for PSP|
|ARM Cortex-M4||AND, BIC, EOR, MOV, MVN, ORN, ORR|
|RISC-V||AND, OR, XOR|
|MSP430||AND, BIC, BIS, XOR|
|AVR||AND, COM, EOR, OR|
Iv Experimental Results
|cipher||block size||key size||rounds||type||PSP||normal||speedup|
|performance and cost||instructions breakdown|
We analyze our flow and the resulting performance using several examples. We target the MHz ARM Cortex-M4F processor, which comes with the Texas Instruments MSP432P401R Launchpad and implements the ARMv7E-M architecture. Table III summarizes our results. The numbers reported on this table are compiled with size optimization (-Os).
The first two examples, GCD and PWM, illustrate the general-purpose nature of PSP as well as its real-time characteristics. For these examples, Table III lists the number of processor clock cycles per synchronous cycle. Computing 32 parallel GCD’s thus takes 382 clock cycles per synchronous cycle, i.e., per iteration of the GCD while-loop.
The Pulse Width Modulator (PWM) generates pulses with a fixed period while having different duty cycles. The PSP version of this function in a 32-bit architecture can generate 32 pulses with varying cycles of duty at the same time. Our implementation demonstrates a PWM with 8-bit resolution. The synchronous cycle of our PWM uses 239 ARM cycles, which provides a minimum pulse width of and a period of or .
The second group of examples are taken from cryptography [beaulieu2015simon, presentcipher, guo2011led, banik2015midori]. Their characteristics are summarized in Table II. SIMON 128/128 is a block cipher with the Feistel structure and consists of calls to the same round encryption routine. We used two different realizations of SIMON, the first one with a bit-parallel data-path and the second one with a bit-serial data-path [aysu2014simon]. In traditional hardware design, bit-serial methodologies are used to minimize area footprint at the expense of throughput. In the PSP execution model of software, we expect the lower gate-count of a bit-serial input specification to translate to fewer bit-wise operations in the program, and hence to a smaller code footprint. Further, we expect the bit-serial PSP design to have a lower throughput due to the lower computational effort done per synchronous clock cycle.
The first part of Table III shows that the models are small enough to fit on a simple embedded architecture. Furthermore, we observe, similar to their hardware designs, the bit-serial implementation of SIMON is smaller than its bit-parallel counterpart in code size, whereas the bit-parallel version is faster and has a higher throughput than the bit-serial version. The second part of Table III shows the overhead of data movements. The overhead values reported are calculated as the number of move instructions (MOV, STR, LDR) divided by the total number of instructions. Moving the data takes about 45-60% of the entire instructions, which is expected for a straight-line program. For comparison, the data-moving overhead for a regular (non-bitsliced) implementation of SIMON on NEON in the SUPERCOP benchmark [supercopSimon] is 34%.
We compare our PSP designs of cryptographic ciphers with their available normal implementations in Table II. In the CRYPTREC lightweight project [CRYPTREC2017lightweight], SIMON-128/128 and Midori-64 ciphers are implemented in software for the RL78 16-bit microcontroller. The throughputs of the PSP implementation of these ciphers in this work are respectively almost and higher. PRESENT-80 is evaluated in the FELICS [dinu2019triathlon] project on ARM Cortex-M3. Even though the implementation of PRESENT-80 in FELICS uses pre-computed keys, still the runtime of our PSP implementation of this cipher plus its key generation is approximately smaller. Furthermore, to show the repeatable-timing property of PSP, we compare the runtime of the PSP and non-PSP implementations of GCD calculator for 1000 random inputs. As shown in Figure 3
, the PSP implementation has a quantized runtime (with steps of length the runtime of one PSP function) whereas the runtime of the normal GCD function varies with an average of 580.475 and a standard deviation of 1969.29 clock cycles.
We presented parallel synchronous programming as a high-throughput, fixed-time model of programming, which is beneficial in safety-critical applications. We introduced an automated method for PSP code generation that can be implemented without any dependency on commercial tools. The PSP generation can be customized for the target processor to have a better performance by defining custom libraries. Finally, through examples and discussions, we demonstrated the potential of parallel synchronous software.