Log In Sign Up

Investigating Black-Box Function Recognition Using Hardware Performance Counters

by   Carlton Shepherd, et al.
Royal Holloway, University of London

This paper presents new methods and results for learning information about black-box program functions using hardware performance counters (HPC), where an investigator can only invoke and measure function calls. Important use cases include analysing compiled libraries, e.g. static and dynamic link libraries, and trusted execution environment (TEE) applications. We develop a generic machine learning-based approach to classify a comprehensive set of hardware events, e.g. branch mis-predictions and instruction retirements, to recognise standard benchmarking and cryptographic library functions. This includes various signing, verification and hash functions, and ciphers in numerous modes of operation. Three major architectures are evaluated using off-the-shelf Intel/X86-64, ARM, and RISC-V CPUs. Following this, we develop and evaluate two use cases. Firstly, we show that several known CVE-numbered OpenSSL vulnerabilities can be detected using HPC differences between patched and unpatched library versions. Secondly, we demonstrate that standardised cryptographic functions executing in ARM TrustZone TEE applications can be recognised using non-secure world HPC measurements. High accuracy was achieved in all cases (86.22-99.83 compilation assumptions. Lastly, we discuss mitigations, outstanding challenges, and directions for future research.


page 5

page 7

page 8

page 15

page 16


On The Performance of ARM TrustZone

The TrustZone technology, available in the vast majority of recent ARM p...

Splitability Annotations: Optimizing Black-Box Function Composition in Existing Libraries

Data movement is a major bottleneck in parallel data-intensive applicati...

On the (Im)plausibility of Public-Key Quantum Money from Collision-Resistant Hash Functions

Public-key quantum money is a cryptographic proposal for using highly en...

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library

In this paper, we present an early version of a SYCL-based FFT library, ...

High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitiv...

Two-Chains: High Performance Framework for Function Injection and Execution

Some important problems, such as semantic graph analysis, require large-...

nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems

We present nanoBench, a tool for evaluating small microbenchmarks using ...

1 Introduction

Modern central processing units (CPUs) support a range of hardware performance counters (HPCs) for monitoring run-time memory accesses, pipeline events (e.g. instruction retired), cache hits, clock cycles, and more. Today’s processors may support very few HPCs—under 10 on constrained microcontroller units—to over 100 events on Intel and AMD server chips [19, 3]. Originally intended for optimisation and debugging purposes, HPC events have found a myriad of security applications. Examples include accessing accurate timing sources for cache attacks, e.g. for cryptographic key recovery; intrusion detection [27, 39]; malware detection [34, 1, 14]; maintaining control-flow integrity [43]; and reverse engineering proprietary CPU features [25, 18]. While the security implications of high-resolution HPCs have been acknowledged [29, 4, 5], they remain widely available on commercial platforms.

In this paper, we explore a novel use of HPCs that uses measurements collected and analysed en masse from multiple counters for identifying program function behaviour. A generic machine learning-driven approach is developed where target functions are classified using hardware performance events collected prior and following their invocation. Calling and measuring exposed functions in this way can avoid time-consuming binary reverse engineering and patching—for example, to instrument precise code triggers—which is a major challenge in related research [32]. .

To this end, we present the results of a three-part study. §3 presents a preliminary analysis that evaluates the efficacy of identifying functions from a standard benchmarking suite (MiBench [17]) and four widely used cryptography libraries (WolfSSL, Intel’s Tinycrypt, Monocypher, and LibTomCrypt). We show how functions can be identified with 48.29%–83.81% accuracy (unprivileged execution) and 86.22%–97.33% (privileged), depending on the target architecture (X86-64/Intel, ARM Cortex-A, and RISC-V). Further, we examine correlations of HPC pairs to identify potentially redundant features, and use model inspection techniques to explain the decisions of HPC measurement classification models. This is used to identify the most effective HPCs for facilitating our approach’s generalisation to platforms that cannot measure an extensive range of hardware events.

Next, §4 explores an offensive use case for detecting the presence of security patches for known vulnerabilities, with applications as a reconnaissance method during security evaluations. We show how several CVE-numbered micro-architectural vulnerabilities can be recognised from OpenSSL (libcrypto) compiled libraries with 0.889–0.998 -score (89.58%–99.83% accuracy). In §5

, we investigate recognising particular cryptographic algorithms within ARM TrustZone trusted execution environment (TEE) trusted applications (TAs). Using OP-TEE—a GlobalPlatform-compliant, open-source TEE platform—and a comprehensive range of functions from the GlobalPlatform TEE Client API 

[15], we show how a Spy process can classify functions within a victim TA with 95.50% accuracy. Finally, §6 provides a security analysis and discusses potential software- and system-level mitigations.

1.1 Threat Model

We consider scenarios where an attacker, , aims to identify particular algorithms under execution given only high-level function calls. is assumed to have limited knowledge of the target function’s actual implementation. This is often the case when analysing software with no debug symbols, function/variable names, and optimisation and code obfuscation techniques that inhibit readability. Example applications are the analysis of compiled shared libraries, e.g. Microsoft Windows dynamic link libraries (.dll) and Linux shared objects (.so). A similar approach is commonly used by TEE applications, e.g. TrustZone TAs, where only high-level functions are exposed to untrusted, non-secure world software [15]. is assumed to possess the ability to measure the side-effects of function invocation and execution on CPU HPCs. They may be a privileged attacker with the ability introduce new kernel modules, or to otherwise to execute ring 0/kernel-mode code, for configuring and accessing HPCs. The attacker is also assumed to possess oracle access for collecting HPC measurements from idempotent functions, without restrictions on the number of permitted invocations.

1.2 Contributions

This paper presents the following contributions:

  • The design and evaluation of a HPC data-driven, machine learning-based approach for function recognition. We evaluate this using a test-bed of standard cryptographic and non-cryptographic algorithms taken from widely used cryptographic libraries and MiBench [17]. Results from three major architectures are provided from commercially available RISC-V, ARM and X86-64 (Intel) CPUs. We examine the effect of different privilege levels and compilation optimisation assumptions on overall effectiveness. Moreover, a feature importance analysis is conducted for determining a strong minimal set of HPCs for facilitating generalisation.

  • Methods and results of two use cases for: 1 detecting unpatched/patched versions of OpenSSL for several CVE-numbered vulnerabilities; and 2 recognising GlobalPlatform cryptographic functions executing within OP-TEE, a GlobalPlatform-compliant ARM TrustZone TEE implementation. This is a passivevector that is difficult to mitigate in software against privileged adversaries. In both cases, experimental results indicate that HPCs can effectively classify and recognise target program functions.

2 Background

This section discusses related literature and presents background information on CPU performance events.

2.1 Related Work

HPCs have found utility in a range of offensive and defensive security applications, particularly as precise measurement sources for mounting and defending against micro-architectural attacks, e.g. from cache timings and speculative execution [39, 2, 20, 27]. A significant body of work has studied their use for malware detection in static and online environments, including rootkit, cryptocurrency miner, and ransomware detection [11, 45, 37, 34, 14, 1]. A general approach involves instrumenting target binaries with procedures to acquire counter measurements from the host CPU’s performance monitoring unit (PMU). Measurements are used to construct bespoke statistical or machine learning models using known malware and safeware samples using supervised and unsupervised methods.

For intrusion detection, Yuan et al. [44] developed Eunomia, which traps and analyses sensitive syscalls during suspicious program execution. A trusted monitor analyses preceding HPC measurements and permits/denies the call to prevent code injection, return-to-libc, and return-oriented programming (ROP) attacks. Xia et al. [43] tackle the problem of control-flow integrity using Intel’s Branch Trace Store (BTS), a buffer within Intel PMUs for storing control transfer events (e.g. jumps, calls, returns, and exceptions). The proposal builds legal sets of target addresses offline with which branch traces of suspect applications are compared at run-time. Payer [27] proposed HexPADS, which examines performance events of target processes for detecting Rowhammer, cache-based side-channels, and cross-VM address space layout randomisation (ASLR) breakages.

In other work, Copos and Murthy [9] proposed InputFinder—a fuzzer that uses HPCs to build valid inputs for closed binaries. Binaries are initially instrumented for measuring the instruction retirement HPC, before executing the program under various inputs. Counter measurements are used to determine control flow changes upon providing valid inputs using differences in the number of executed instructions. Spisak [35] developed a kernel mode HPC-based rootkit family using PMU interrupts to trap system events, e.g. system calls. It demonstrates that even events from ARM TrustZone TAs can be analysed on some consumer devices using PMU perturbations between secure and non-secure world execution. Malone et al. [24]

explored the feasibility of performance counters for static and dynamic software integrity verification. Measurements from six HPCs are presented using four test programs, but effectiveness results are not given using standard evaluation metrics. Later, Wang et al. 

[41] used similar methods for embedded systems firmware verification. The approaches compare the similarity of measurements taken from different program inputs at installation-time with those collected at run-time.

For reverse engineering, Maurice et al. [25] use per-slice PMU access counter measurements to determine the cache slice assigned to a last-level cache (LLC) complex address for enabling cross-core LLC cache attacks. Helm et al. [18] use HPCs for understanding Intel’s proprietary DRAM mapping mechanism for translating physical addresses to physical memory channels, banks, and ranks. Similarly, the work uses differences in per-channel PMU transfer counters while accessing different known physical addresses.

Contrastingly, we explore a novel application of HPCs to function recognition and vulnerability detection using only measurements collected around invocations. We collect and analyse HPC measurements in bulk using large numbers of available events, which is evaluated using X86-64, ARM and RISC-V platforms; OpenSSL; and a TrustZone TEE.

2.2 Performance Monitoring Units (PMUs)

Performance counters are available on all major CPU architectures within hardware PMUs, enabling the collection of detailed events with negligible overhead. HPCs are configured and accessed through special-purpose registers that update during execution. While different CPU architectures may count the same types of events, their availability and accessibility can differ significantly. We briefly describe the mechanics of PMUs on X86-64, ARM, and RISC-V.

2.2.1 X86

The Intel PMU, introduced on Intel Pentium CPUs, provides non-configurable registers for tracking fixed events and several programmable registers per logical core [19]. Intel Core CPUs support four general-purpose programmable registers and three fixed-function registers tracking elapsed core cycles, reference cycles, and retired instructions. Programmable registers can be configured to simultaneously monitor one of 100 performance events on the Intel Xeon and Core architectures, such as branch mis-predictions and hits at various cache hierarchy levels [19]. PMUs implement configuration and counter registers as model-specific registers (MSRs), which are accessed using the RDMSR and WRMSR instructions in ring 0 (kernel mode). Alternatively, the RDPMC instruction can be used for accessing PMU counters at lower privilege levels if the CR4.PCE control register bit is set. By default, ring 3 (user mode) access is granted to monotonic time-stamp counters (TSC) in a 64-bit register using the RDTSC instruction. In addition to precise event counting, event-based sampling is supported for triggering a performance monitoring interrupt after exceeding threshold value (e.g. after events). AMD CPUs feature minimal functional differences to Intel implementations for counting specific events, but do contain more programmable events (six vs. four) [27, 3, 19].

2.2.2 Arm

The ARM PMU is a ubiquitous extension for ARM Cortex-A, -M, and -R processors. It supports per-core monitoring of similar high-level events to X86 CPUs, but, generally speaking, fewer events are available to accommodate simpler mobile and embedded systems designs. The ARM Cortex-A53, for example, contains a smaller cache hierarchy with a shared L2 cache at the highest level, precluding the ability to report L3 instruction or data cache measurements [4]. Like X86, fixed-function cycle counters are commonplace, and 2–8 general-purpose counters can be programmed to monitor the events using the MRS and MSR instructions (AArch64). PMU registers can be configured to be accessed at any privilege mode (exception level) using the performance monitors control register (PMCR). Typically only kernel mode (EL3) processes may access PMU registers by default, however. ARM PMUs may also assert nPMUIRQ interrupt signals, e.g. upon counter overflows, which can be routed to an interrupt controller for prioritisation and masking.

2.2.3 Risc-V

The RISC-V Privileged [30] and Unprivileged [29] ISA specifications define separate instructions for accessing performance counters in different privilege modes. The Unpriviliged ISA specifies 32 64-bit instructions for accessing per-core counters in user- and supervisor-modes (U-mode and S-mode), comprising three fixed-function counters for cycle count (RDCYCLE), real-time clock (RDTIME), and instruction retirements (RDINSTRET). Control and status register (CSR) space is reserved for 29 vendor-specific 64-bit programmable HPC registers (HPCCOUNTER3HPCCOUNTER31). The Privileged ISA specifies analogous CSR registers (MCYCLE, MTIME, MINSTRET) that can be accessed only in machine-mode (M-mode). Likewise, 32 64-bit registers (MHPMCOUNTER3MHPMCOUNTER31) are available for generic, vendor-specific programmable HPCs. Note that RISC-V embedded systems are expected to possess only M- or M- and U-modes [30, 31], while workstations and servers are expected to support supervisor (S)-mode and the forthcoming hypervisor (H)-mode extensions [29].

3 Function Recognition: A Preliminary Study on X86, ARM, and RISC-V

This section presents a foundational study on recognising a large range of non-cryptographic and cryptographic functions using HPC measurements taken before and after invoking target functions. We discuss the methodological approach and the tested functions, before proceeding to implementation challenges and results.

3.1 Overview

We assume two processes shown in Fig. 1: a Spy controlled by the adversary, and a Victim that exposes high-level functions. In practical cases, the Victim will assume the form of a compiled static or shared library with which the Spy is statically or dynamically linked. Our approach then follows two steps: 1 HPC measurements corresponding to each function are assigned labels according to its identifier, which are as feature vectors for training a classifier. Next, 2 uses newly measured values and the model hypothesis from 1 to infer the executed function. We evaluate this using three devices with various architectures and capabilities:

Fig. 1: High-level approach.
  • X86-64. Dell Latitude 7410 with 8GB RAM and an Intel i5-10310U: 1.70GHz 64-bit quad-core, eight hyperthreads, and 256kB L1, 1MB L2, and 6MB L3 caches. Ubuntu 20.04 LTS was used with Linux kernel v5.12.

  • ARM. Raspberry Pi 3B+ with 1GB SDRAM and a Broadcom BCM2837 system-on-chip (SoC): 1.4GHz 64-bit quad-core ARM Cortex-A53 CPU, with 32kB L1, 32kB L2, and no L3 cache. Raspbian OS was used, based on Debian 11/Bullseye, with Linux kernel v5.15.

  • RISC-V. SiFive HiFive Rev. B with a FE310-G002 SoC: 320MHz 32-bit single-core CPU with RV32IMAC ISA support, 6kB L1 instruction cache, and 16kB SRAM. Supports privileged (M-) and unprivileged execution (U-mode). SiFive’s Freedom E SDK was used as a hardware abstraction layer for application development.

We used PAPI [38] on our X86-64 and ARM devices, which provides portable HPC measurement acquisition using the Linux perf subsystem. CPUs often expose extremely precise access to specific micro-architectural events, e.g. instruction pipeline, DRAM controller, and proprietary feature events (e.g. Intel TSX), thus preventing cross-platform compatibility. To overcome this, PAPI implements micro-architectural dependent code and abstracts a set of commonly found high-level events, which we used for portability (shown in Table I). PAPI was compiled and linked PAPI with our benchmarking application that implemented the functions under test (§3.2) on X86-64 and ARM. For RISC-V, the aforementioned assembly instructions were used for accessing HPC registers (§2.2.3).

Method Description X86-64 ARM RISC-V
RDTSCP Cycle count since a reset.
L1_DCM L1 data cache misses.
L1_ICM L1 instruction cache misses.
L1_TCM L1 total cache misses.
L1_LDM L1 load misses.
L1_DCA L1 data cache accesses.
L1_STM L1 store misses.
L2_DCM L2 data cache misses.
L2_ICM L2 instruction cache misses.
L2_TCM L2 total cache misses.
L2_LDM L2 load misses.
L2_DCR L2 data cache reads.
L2_STM L2 store misses.
L2_DCA L2 data cache accesses.
L2_ICR L2 instruction cache reads.
L2_ICH L2 instruction cache hits.
L2_ICA L2 instruction cache accesses.
L2_TCA L2 total cache accesses.
L2_TCR L2 total cache reads.
L2_TCW L2 total cache writes.
L3_TCM L3 total cache misses.
L3_LDM L3 load misses.
L3_DCA L3 data cache accesses.
L3_DCR L3 data cache reads.
L3_DCW L3 data cache writes.
L3_ICA L3 instruction cache accesses.
L3_ICR L3 instruction cache reads.
L3_TCA L3 total cache accesses.
L3_TCR L3 total cache reads.
L3_TCW L3 total cache writes.
CA_SNP Requests for a cache snoop.
CA_SHR Requests for exclusive access to a shared cache line.
CA_CLN Requests for exclusive access to a clean cache line.
CA_ITV Requests for cache line intervention.
TLB_DM Data TLB misses.
TLB_IM Instruction TLB misses.
PRF_DM Data pre-fetch cache misses.
MEM_WCY Cycles stalled for memory writes.
STL_ICY Cycles with no instruction issue.
FUL_ICY Cycles with maximum instruction issue.
BR_UCN Unconditional branch instructions.
BR_CN Conditional branch instructions.
BR_TKN Conditional branches taken.
BR_NTK Conditional branch instructions not taken.
BR_MSP Conditional branch mispredictions.
BR_PRC Conditional branches correctly predicted.
TOT_INS Instructions completed.
LD_INS Load instructions.
SR_INS Store instructions.
BR_INS Branch instructions.
RES_STL Cycles stalled on any resource.
TOT_CYC Total cycles executed.
LST_INS Load and store instructions executed
SP_OPS Single-precision floating point operations.
DP_OPS Double-precision floating point operations.
VEC_SP Single precision SIMD instructions.
VEC_DP Double precision SIMD instructions.
RDCYCLE U-mode reference cycle count.
RDINSTRET U-mode instructions retired.
RDTIME U-mode real-time counter
MCYCLE M-mode cycle counter
MINSTRET M-mode instructions retired.
MTIME M-mode real-time counter.
  • X86-64: Intel i5-10310U; ARM: ARM Cortex-A53; RISC-V: SiFive FE310.

  • Blue cells indicate user-mode counters; yellow denote privileged counters.

TABLE I: HPC events used on a per-device basis.

3.2 Target Procedures

For data acquisition, we developed a test-bed comprising 64 functions listed in Table II, incorporating the MiBench benchmarking suite with common embedded systems procedures [17], alongside modern cryptographic algorithms.111Note that standard cryptographic libraries tend not to expose separate functions for generating and verifying HMACs and GMACs, which was also reflected in this work.

For the latter, we used an extensive range of common functions available through WolfSSL in addition to reference implementations of GOST, Speck, and PRINCE. To account for measurement differences from parameter selection, our benchmarking tool used random input buffers in the range [8B, 32B, 256B, 512b, 1024kB, 2048kB, 4096kB] for encryption, signing, MAC, and hashing algorithms. Random keys were generated for all symmetric algorithms and keyed MACs (e.g. AES in all modes, ChaCha20, Prince, DES and HMAC). Similarly, random public-private key pairs and secret and public values were generated for asymmetric and key exchange algorithms respectively (e.g. ECDSA, X25519, ECDH, and RSA). HPC measurements were not inclusive of this process. Fresh initialisation vectors (IVs), counter values and nonces were also generated where applicable (e.g. ChaCha20 and AES-CTR). RSA operations were split approximately equally using a random padding scheme and key length. RSAES-OAEP or RSAES-PKCS#1v1.5 were used for encryption/decryption, and RSASSA-PSS or RSASSA-PKCS#1v1.5 for signing/verification under 1024-, 2048-, 3072-, and 4096-bit key lengths. Similarly, ECDSA and ECDH used a random curve from P-256, P-384, and P-521, while AES used 128-, 196, and 256-bit key lengths. For ease of implementation, corresponding inverse operations (e.g. verification for signing) were called with the same parameters immediately following the original operation.

Non-cryptographic Functions
1. Solve Cubics 8. Bsearch 15. Euler
2. Integer Sqrt 9. CRC32 16. Simpson
3. Angle Convert 10. BWT 17. Root Finding
4. Qsort 11. Matrix Mul. 18. XOR-Shift
5. Dijkstra 12. Bit Ops. 19. Base64 Encode
6. PBM Search 13. Gaussian Elim. 20. Base64 Decode
7. FFT 14. Fibonacci 21. Entropy
Cryptographic Functions
22-23. AES-ECB (E+D) 40-41. 3DES (E+D) 56. GMAC
24-25. AES-CBC (E+D) 42-43. GOST (E+D) 57. Poly1305
26-27. AES-CTR (E+D) 44-45. Ed25519 (S+V) 58. MD2
28-29. AES-GCM (E+D) 46-47. ECDSA (S+V) 59. MD4
30-31. ChaCha20 (E+D) 48. X25519 60. MD5
32-33. ChaCha20+Poly1305 (E+D) 49. ECDH 61. SHA-1
34-35. Speck (E+D) 50-53. RSA (E+D, S+V) 62. SHA-256
36-37. PRINCE (E+D) 54. DH 63. SHA-3
38-39. DES (E+D) 55. HMAC 64. BLAKE2
  • E+D: Encrypt and decrypt. S+V: Sign and verify.

TABLE II: Test-bed target functions.

3.3 Methodology

HPC events from 10,000 instances were collected using the 64 procedures in our test-bed. Each instance comprised executions that measured the event until exhaustion from those in Table I (, X86-64; , ARM; , RISC-V). Each instance thus resulted in an -dimensional feature vector corresponding to the class label of the procedure ID (), requiring 31.3M executions (X86-64), 8.3M (ARM), and 4.4M (RISC-V). From this, we formed two data sets: A and B, representing events available in unprivileged and privileged mode respectively.

Next, each data set was split into training and test sets using a 80:20 ratio. Z-score normalisation was applied to produce features with zero mean and unit variance, and used for training nine supervised classifiers: naïve Bayes (NB), logistic regression (LR), linear discriminant analysis (LDA), decision tree (DT), gradient boosting machines (GBM), random forest (RF), support vector machine (SVM), and a multi-layer perceptron (MLP). Hyper-parameter optimisation was conducted using an exhaustive grid search and

-fold () cross-validation (CV), with the best performing CV classifier used to score the test set. Given the relatively large and (balanced) many-class classification problem, conventional classification accuracy was used as the evaluation metric. The process was orchestrated by a Python script using Scikit-learn on a workstation with an Intel i7-6700k CPU (quad-core at 4.0GHz) and 16GB RAM. The data acquisition and training processes were repeated for each device under different GCC optimisation parameters (discussed further in §3.4.5). In total, the data acquisition and training procedures required approximately 14 hours and 30 hours respectively on our hardware.

3.4 Implementation Challenges

Using HPCs requires careful thought to avoid measurement bias and related pitfalls. Some phenomena have been investigated in related literature [10, 42], while others have attracted little attention to date, e.g. cache warming and optimising compilers. We discuss these challenges forthwith.

3.4.1 HPC Accuracy

Run-to-run variations in HPC measurements is a challenge for program analysis on commercial CPUs, which was analysed by Weaver et al. [42], and later by Das et al. [10] in security contexts. Counter measurements are not perfectly replicated between sequential program executions, with 0.5%–2% error observed on standard benchmarks [42]. These differences are due principally to:

  • Event non-determinism. External events, particularly hardware interrupts and page faults [42, 10], can cause small deviations in vulnerable HPC events, e.g. load and store counters, which are unpredictable and difficult to reproduce. Another source of variation is the pipeline effects arising, for instance, from out-of-order execution. This can also perturb absolute counter values, although the effects are negligible if the monitoring of very short instruction sequences is avoided [4].

  • Overcount. CPU implementation differences and errata can overcount events. For example, instructions retired events can be overcounted by exceptions and pseudo-instructions where micro-coded instructions differ to the instructions that are actually executed. Specifically, X87/SSE exceptions, OS lazy floating point handling, and fldcw, fldenv, frstor, maskmovq, emms, cvtpd2pi, cvttpd2pi, sfence, mfence instructions are known overcount sources [42].

The effects of non-determinism cannot be entirely eliminated. Rather, we model interactions from many individual measurements (millions of features) as a mitigation against single, run-to-run variations. This assumes the ability to measure large numbers of target functions idempotently, i.e. their behaviour changes insignificantly between invocations, and without restriction regarding the number of permitted calls (see §1.1). Importantly, we do not rely upon exact measurements, but only that the deviations between different function executions confer enough discriminitive power for training a classifier. Moreover, we use measurements from multiple counters to maximise the discriminative benefits of disparate event types. This also minimises potential issues of focussing on any single counter (e.g. overcounting biases). We also note that X86 CPUs can measure the number of hardware interrupts—a known source of non-determinism—which was used as a feature (Table I).

3.4.2 Function Implementation Variations

It is important to discern between multiple implementations of the same algorithm. Statistical models may fail to generalise when presented with HPC events of alternative libraries that, while similar algorithmically, contain implementation deviations that can affect HPC measurements (e.g. API calling conventions). Considering this, we ported additional implementations from Monocypher (for ChaCha20, Poly1305, Ed25519, X25519, BLAKE2), Intel’s Tinycrypt (AES, ECDSA, ECDH, HMAC, SHA-256), and LibTomCrypt (AES, MD2, MD4, MD5, SHA-1, SHA-256, 3/DES, RSA, DH, GMAC, HMAC) on our target platforms.222The choice of these libraries was principally pragmatic to construct a common test-bed; very few cryptographic libraries (e.g. OpenSSL and GnuTLS) currently offer 32-bit RISC-V MCU support. For these ‘conflicted’ functions, HPC measurements were taken by cycling through each implementation for each invocation, and mapping the measurement vector to the same label. For instance, HPC measurements from LibTomCrypt and WolfSSL implementations were assigned the same label for MD5 function executions.

3.4.3 Cache Warming

Preliminary experiments showed that cache-based events, e.g. L1 and L2 misses, and their correlated values—discussed further in §3.6.1—were markedly higher during initial function executions. This subsided after several executions per function in order to converge to stability (3% requiring an average of 9.7 executions, X86-64; 7.0, ARM; 7.3, RISC-V). We hypothesise that cache warming was partially responsible, which has not been identified as a challenge with using HPCs for security applications in related literature. Within the context of this work, the reliance on large numbers of measurements renders this effect negligible (0.5%). However, the reader is warned of potential bias in ‘one-shot’ and related environments where only a limited number of measurements can be acquired.

3.4.4 Simultaneous HPC Measurements

Today’s PMUs support only a small number of programmable counter registers with which to monitor events, typically 4–8 depending on the architecture [19, 3]. Consequently, acquiring measurements from several counters necessitates: 1 Using a single counter register per execution; 2 Batching the collection of HPC measurements, with the batch size equalling the number of supported counter registers; or 3 Using software multiplexing provided by some tools, e.g. PAPI, which subdivides counter usage via time-sharing over a large number of performance events. We used 1 as a conservative method at the cost of execution time. Using 2, execution time could be reduced by a constant factor (of the max. supported counters), while 3 minimises execution time at the cost of accuracy [38].

3.4.5 Optimising Compilers

Modern compilers improve performance and code size at the expense of compilation time and debuggability. Example optimisations include loop unrolling, if-conversions, tail-call optimisation, inline expansion, and register allocation. This an important considering the difficulty of ascertaining a priori the exact parameters under which target software, e.g. a dynamic library, was compiled. Optimisation parameters can have a material effect on HPC measurements for the same program. Choi et al. [8], for instance, showed that if-conversions can reduce branch mis-predictions by 29% on Intel CPUs, which would reflect in branch-related HPCs.

To address this, we compiled the test-bed and collected measurements from test devices under different GCC optimisation options. Flags were used for disabling GCC optimisation (O0, the default setting); code size optimisation (Os) used often memory-constrained targets, e.g. embedded systems; and maximum optimisation (O3) for sacrificing compile time and memory usage for performance. The reader is referred to the GCC documentation for a comprehensive breakdown of the optimisations utilised for each flag [13]. To emulate environments where compiler optimisations are unknown, we created a mixed data set by concatenating and randomly shuffling measurement feature vectors collected under each optimisation parameter.

3.5 Results

(a) Privileged HPCs.
(b) Unprivileged HPCs.
Fig. 2: Normalised confusion matrices for X86-64 (X-axis: Predicted value, Y-axis: True value).
O0 49.30 50.24 60.24 59.20 66.31 65.79 61.56 54.44 51.83
A Os 49.22 50.63 58.30 64.46 66.07 66.56 61.87 62.43 54.62
O3 48.48 49.30 59.08 60.27 59.81 49.71 35.57 45.46 47.28
Mixed 37.21 32.80 48.29 48.03 47.67 40.52 26.30 45.39 37.02
O0 98.95 96.47 89.25 99.53 99.70 99.02 94.31 99.04 86.55
B Os 93.03 89.27 87.71 95.89 95.99 97.49 88.29 92.35 90.06
O3 87.55 86.08 82.75 92.01 98.51 83.78 74.44 80.53 84.60
Mixed 84.78 73.95 85.14 95.40 97.33 91.01 80.86 83.78 85.95
  • A: User-mode performance counters only, B: User- and machine-mode performance counters.

TABLE III: Classification accuracy (Intel i5-10310U; in %, best scores in bold).
O0 85.65 87.27 82.34 96.87 96.74 96.81 80.42 83.65 84.01
Os 77.91 77.45 82.70 90.31 88.59 91.32 86.28 84.13 84.30
O3 79.21 82.99 82.85 88.90 91.40 89.43 74.37 87.62 74.82
Mixed 73.97 70.08 80.35 88.96 88.99 90.68 74.00 80.37 69.09
  • A: User-mode performance counters only, B: User- and machine-mode performance counters.

TABLE IV: Classification accuracy (ARM Cortex-A53).
O0 76.09 74.59 78.30 85.55 89.04 89.00 87.10 81.67 80.28
A Os 72.84 73.02 79.61 86.47 83.40 86.91 80.35 77.53 77.54
O3 72.88 74.94 83.99 85.10 87.42 86.80 76.75 74.32 75.02
Mixed 66.87 66.23 83.60 82.16 83.81 75.68 70.07 68.15 67.22
O0 82.83 83.46 82.02 90.87 93.34 91.99 86.57 81.05 83.30
B Os 80.19 78.67 79.44 87.23 90.04 90.03 75.11 72.84 75.56
O3 81.01 76.48 88.29 86.32 86.37 87.41 79.85 74.32 80.20
Mixed 76.74 74.02 69.81 84.59 86.22 86.21 70.03 71.40 70.90
  • A: User-mode performance counters only, B: User- and machine-mode performance counters.

TABLE V: Classification accuracy (SiFive E31 SoC).

Classification results for each architecture and GCC compilation setting are presented in Tables III (X86-64), IV (ARM), and V (RISC-V). In general, test-bed functions could be classified from HPC values with considerable effectiveness, albeit with notable differences with respect to architecture and privilege mode. Worst-case performances occurred where only unprivileged counters were used with the mixed GCC data sets, i.e. 48.29% (X86-64; one HPC) and 83.81% (RISC-V; three HPCs). Mixed compilation parameters generally correlated with a significant accuracy degradation of 2–16% depending on the architecture and privilege mode. In contrast, the best cases corresponded to scenarios where all HPCs were used for specific compilation settings: 97.33%–99.70% (X86-64), 90.68%–96.81% (ARM), and 86.22–93.34% (RISC-V). We can tentatively conclude that using more HPCs confers greater discriminitive power during classification, the availability of which are maximised during privileged execution. Further, tree-based models and ensembles tended to perform best out of all evaluated classifiers (18/20) with RF classifiers the best-performing (11/20).

Confusion matrices were also produced to examine points of weakness across particular classes using the mixed GCC data sets and the best-performing classifiers, shown in Fig. 2 for X86-64 (see Appendix A for ARM and RISC-V). We observe confusion occurring in functions with structural similarities when fewer counters are used (Fig. 1(b)). For instance, the encryption and decryption functions for ChaCha20-Poly1305, DES, 3DES, and AES in certain modes of operation (e.g. CTR and CBC), and RSA encryption, decryption, signature and verification. These errors resolved when more HPCs were available, where greatly reduced classification accuracy was observed.

3.6 Feature Importance

An important consideration is the contribution of each HPC feature to the classification process. If only a small set of HPCs is needed to train high accuracy models, then certain implementation difficulties can be avoided. Minimising the number of hardware counters is desirable for two reasons:

  1. Reproducibility: Using all of the counters on a given architecture risks depending on redundant HPCs that are unavailable on other architectures, particularly older and more constrained platforms. ARM and RISC-V micro-controller and IoT SoCs, for instance, contain fewer measurable events relative to workstation- and server-grade X86-64 CPUs [30, 19, 4].

  2. Performance

    : Removing redundant features, or those with low predictive power, can offer training and classification performance benefits due to the curse of dimensionality. (Dimensionality reduction methods have been applied in HPC malware classification literature, e.g. principal component analysis 

    [34, 45], but this does not directly reduce the number of HPCs used at source).

We also note that HPC-based research ubiquitously relies on complex, non-linear models, e.g. random forests and gradient boosting machines, for modelling high-dimensional data 

[45]. This was also observed in the classification results from §3.5. Unfortunately, these have decision processes that are inherently difficult to interpret, prompting the development of model explanation methods [23, 33].

3.6.1 HPC Correlation Analysis

We investigated correlation relationships between individual events as a first step towards understanding the relative importance of HPCs. It is intuitive that certain events are linearly dependent; for example, the total number of load/store instructions (LST_INS) is directly related to the number of load instructions (LD_INS). Likewise, processes that frequently access temporally and spatially dislocated data will cause misses and, thus, more data writes to CPU caches. To study this, we computed correlation matrices for each test-bed device HPC pair. Figure 4 presents pairwise correlations with each matrix element representing the Pearson correlation coefficient of an HPC tuple.

The results show strong correlations between many HPC pairs. On X86-64, the cycle and time-stamp counters (TOT_CYC and RDTSCP) are 0.95 correlated with the total instruction (TOT_, LD_, SR_, BR_ and LST_INS) and branch counters (BR_TKN, _NTK, and _PRC). A similar pattern was found for ARM. This is intuitively unsurprising considering that larger, longer-running functions will likely contain more run-time branches and memory accesses. Strong correlations are also seen between cache misses in lower cache hierarchy elements and accesses in higher ones; for example, L2 data accesses and L1 data misses (L2_DCA and L1_DCM, 0.96; X86-64). On ARM, L1 instruction cache misses are strongly correlated (0.98) with L2 data accesses (L1_ICM vs. L2_DCA). In the absence of a dedicated L2 instruction cache [4], this indicates that the L2 data cache is used as a de facto instruction cache, echoing existing work on using HPCs to uncover latent SoC properties [25, 18].

It is also seen that cache misses in last-level caches (LLC)—L3 for X86-64 and L2 for ARM (L3_TCM and L2_DCM respectively)—have no correlations with other HPCs. PAPI monitors CPU-level events; the LLC represents the final unit before external memories are accessed. Similar patterns exist for TLB data misses (TLB_DM) on ARM and instruction misses (TLB_IM) on X86-64 and ARM; and the use of vector and floating point operations on X86-64 (SP_OPS, DP_OPS, VEC_SP, VEC_DP), which only have correlations with each other. On RISC-V, strong correlations (0.97) were observed between machine-mode counters of their user-mode counterparts, e.g. RDINSTRET vs. MINSTRET. This is interesting from a security perspective: unprivileged processes can use HPCs with potentially the same power as privileged processes. It also provides insight into why classification results for privileged HPCs were only marginally different (4%) from unprivileged HPCs in §3.5.

(a) X86-64 HPCs.
(b) ARM HPCs.
(c) RISC-V HPCs.
Fig. 3: Correlation matrices for each test-bed device.

3.6.2 Shapley Additive Explanations

While useful for understanding linear dependencies, correlations do not show the extent to which particular HPCs contribute to classification decisions. This requires the use of model inspection techniques to understand their relative importance. To this end, we employed SHAP (SHapley Additive exPlanations), a unified framework for interpreting complex model predictions, which has been applied to Android malware classification [12] and intrusion detection [40]. SHAP uses a co-operative game-theoretic approach for assigning feature importances of a machine learning model prediction function, , using Shapley values [23]. SHAP explains ’s decisions as the sum of results, , of feature subsets being included in a conditional expectation, ( being a subset of model features and the feature vector of the instance to be explained). SHAP scores combine conditional expectations with the Shapley value of a feature value, —corresponding to its contribution to the payout (prediction)—which are calculated using Eq. 1.


Where is the number of features and is the set of all input features. The average absolute Shapley values per feature are computed across the entire data set and ordered to find the global feature importances. SHAP values have been shown to be more consistent with human intuition than alternative methods, e.g. LIME [28] and DeepLIFT [33], is model-agnostic, and does not suffer from some of the nuances associated with HPC measurements, albeit at greater computational cost. (Other techniques, e.g. mean decrease in Gini impurity for tree models, can be misleading with high cardinality or continuous features [36]—an inherent property of HPCs). We computed the SHAP values for each HPC on each platform using the best performing classifiers from §3.5 under the mixed GCC data sets.

The ranked SHAP values are shown in Fig. 4. Some common HPCs confer greater impact on classification decisions across our platforms. Specifically, branch-related HPCs comprised the top two ARM HPCs (BR_MSP, BR_INS) and six of the top 10 X86-64 counters (BR_PRC, BR_CN, BR_NTK, BR_TKN, BR_INS, BR_UCN). The RISC-V platform could not measure branch events. Instruction-counting HPCs too ranked highly, representing the top two RISC-V HPCs (RDINSTRET, MINSTRET), three of the top five ARM HPCs (BR_INS, SR_INS, TOT_INS), and three of the top 10 X86-64 HPCs (TOT_INS, BR_INS, SR_INS). Notably, we observe that cache events ranked relatively poorly, particularly TLB data and instruction misses (X86-64 and ARM), and shared LLC events (ARM L2 and X86-64 L3 accesses/misses).

(a) X86-64 HPCs.
(b) ARM HPCs.
(c) RISC-V HPCs.
Fig. 4: Ranked mean absolute SHAP values.

3.6.3 Feature Elimination

SHAP values gauge the relative contribution of features, but do not directly determine the number of required features for high accuracy. §3.6.1 suggested that fewer HPCs may be required to achieve similar accuracy due to linear dependencies between HPCs. In light of this, we investigated how model accuracy fluctuated by systematically including particular hardware counters. The best performing models from §3.5 were retrained using the top

features from the SHAP analysis under the same classifier hyperparameters. Values of

were evaluated in the range 1–10 and compared with using all available HPCs.

Fig. 5: Accuracy using the top SHAP-valued HPCs.

The results are shown in Fig. 5 for all devices under unprivileged/privileged HPCs where applicable. Evidently, using all HPCs bestows some accuracy benefits, but only a subset were necessary for achieving close results. Using all 49 X86-64 counters, for example, delivered a 5% improvement over using the top 10 from Fig. 3(a). The top three HPCs achieved 9% of the final accuracy on X86-64 (BR_PRC, BR_CN, TOT_INS) while only two ARM counters were needed to achieve this (BR_MSP, BR_INS). On RISC-V, the accuracy benefits decreased to 5% using three counters (RDINSTRET, MISNTRET, RDCYCLE) versus all seven.

In general, our results indicate that a large range of black-box functions can be classified effectively based on their HPC values—up to 97.33%—across three major CPU platforms. This varies significantly according to privilege mode and counter availability on our chosen platforms. In the coming sections, we explore applications of this generic approach in the domains of vulnerability detection within OpenSSL and function recognition within TEE applications.

4 Identifying Known Security Vulnerabilities in OpenSSL

Cryptographic libraries are often deployed as static or shared libraries in end-user applications. OpenSSL, for instance, is deployed as libssl and libcrypto, with the latter implementing dependent cryptographic algorithms which may be used in isolation. By default, both libraries are compiled and named using major and minor version numbers, e.g. 1.0, 1.1, and 3.0. Incremental sub-versions are also released after remedying known security vulnerabilities, known as lettered releases.

In this use case, we examine the extent to which vulnerable libcrypto lettered versions can be detected using HPC measurements from invoking affected cryptographic functions. Specifically, we examine the extent to which OpenSSL vulnerabilities can be identified using HPC differences in calls to patched and unpatched libcrypto functions. This can be used as an exploratory technique for isolating useful attack vectors in the dependencies of application binaries. Broadly, the attacker is assumed to possess the ability to: 1 instrument the target binary to measure function calls to the dependent library; or 2 compile and link his/her own measurement harness to invoke functions in the dependency.

4.1 Methodology

Our developed approach follows three phases:

  1. Preliminary vulnerability identification. We comprehensively examined the OpenSSL vulnerability disclosure announcements333 to identify vulnerabilities that cause internal micro-architectural state changes, but whose effects are not immediately observable (e.g. does not induce a crash, infinite loop, or returns certain function values). Vulnerable/patched functions were precisely identified using NIST’s National Vulnerability Database (NVD)444, which aggregates technical write-ups, third-party advisories, and, importantly, commit-level patch details for CVEs. In total, we successfully identified six vulnerabilities, which may be used for DSA, RSA, and ECDSA private key recovery. The descriptions, CVE numbers, and commit IDs are given in Appendix B.

  2. Collecting labelled data samples. For each vulnerability, we collected HPC measurements of 10,000 executions using all available counters. We measured offending libcrypto functions of the same major and minor version, but from different lettered versions before and after the patch was implemented. For example, for a function patched in v1.1.0f, measurements were taken from the preceding v1.1.0a-e (unpatched) and successive versions v1.1.0f-n (patched). This required compiling and linking our PAPI measurement harness against multiple individual libcrypto lettered versions. In contrast to §3, which considered functon recognition as a multi-class problem, vulnerability detection is treat as a binary problem, where the measurement harness assigned feature labels in the range (0 = patched, 1 = unpatched).

  3. Model selection and evaluation. Following the same method as §3.3, several models were trained using an 80:20 training-test set ratio, 10-fold cross-validation, and exhaustive grid search for hyper-parameter optimisation. In addition to classification accuracy, precision, recall, and F1-score metrics were employed for evaluating binary classification performance. This is important when considering the unbalanced nature of vulnerability detection in this context. Measurement data sets of vulnerabilities remedied in early OpenSSL lettered versions, e.g. 1.1.0b, will contain far fewer ‘unpatched’ labels than those in later versions, thus necessitating evaluation metrics that consider relevance.

4.2 Results and Analysis

We applied the methodology to known vulnerabilities within libcrypto using the same target devices from §3.555OpenSSL does not support RISC-V at the time of publication.. The classification results are presented in Tables VI and VII for X86-64 and ARM respectively for each vulnerability. Our approach identified OpenSSL vulnerabilities with high accuracy: 89.5% in all cases and 93%+ accuracy in all but one case across both architectures. Likewise, precision (0.892–0.997, X86-64; 0.882–0.978, ARM), recall (0.900–0.998, X86-64; 0.873–0.995, ARM) and F1 scores (0.896–0.998, X86-64; 0.889–0.985, ARM) were consistently high, demonstrating effectiveness when identifying feature vectors corresponding to unpatched instances.

It is noteworthy that some differences exist between each identified CVE, particularly CVE-2018-0737, which has significantly lower accuracy (4–5%) than the next worst performing CVE. On closer inspection, we noticed a broad correlation between classification performance and relative patch complexity. This is not surprising: security-related patches with significant code additions and/or deletions will cause deterministic effects in HPC measurements. As an elementary example, additional conditional statements will correlate with branch-related, cycle, and instruction counting HPCs (e.g. total instructions and CPU cycles). The CVE-2018-0737 patch—the worst performing vulnerability—changed only two lines of code (LOC) between lettered OpenSSL versions for setting constant-time operation flags for RSA primes (commit 6939eab03a6e23d2bd2c3f5e34fe1d48e542e787666 is an illustrative proxy of program complexity; a single line may induce complex control flows, with significant effects on HPCs. Compare this to CVE-2018-0734—the vulnerability exhibiting the greatest detection performance—which made 19 LOC changes, including the declaration of new variables, conditional statements, internal function calls, and more (commit 8abfe72e8c1de1b95f50aa0d9134803b4d00070f888

This raises a research question about the granularity with which counters can reliably detect practically arbitrary code changes. The known effects of non-determinism—see §3.4 and Das et al. [10]—and pipeline execution on measurement noise [4] provide an undetermined lower bound. Yet, this has not been answered in related literature, which we pose as an obvious gap for future research. Notwithstanding, our approach shows that HPC measurements of function calls can detect known vulnerabilities using off-line analysis.

CVE ID Operation Pr. Re. F1 Acc.
CVE-2018-5407 ECC scalar mul. 0.915 0.977 0.945 94.67
CVE-2018-0734 DSA sign 0.997 0.998 0.998 99.83
CVE-2018-0735 ECDSA sign 0.974 0.950 0.962 97.50
CVE-2018-0737 RSA key gen. 0.892 0.900 0.896 90.25
CVE-2016-2178 DSA sign 0.985 0.978 0.981 98.75
CVE-2016-0702 RSA decryption 0.940 0.938 0.939 95.92
  • Pr.: Precision, Re.: Recall, F1: F1-score, Acc.: Accuracy (in %).

TABLE VI: OpenSSL CVE identification results (X86-64).
CVE ID Operation Pr. Re. F1 Acc.
CVE-2018-5407 ECC scalar mul. 0.891 0.975 0.931 93.25
CVE-2018-0734 DSA sign 0.976 0.995 0.985 99.00
CVE-2018-0735 ECDSA sign 0.978 0.910 0.943 96.33
CVE-2018-0737 RSA key gen. 0.882 0.896 0.889 89.58
CVE-2016-2178 DSA sign 0.954 0.938 0.946 96.42
CVE-2016-0702 RSA decryption 0.939 0.873 0.904 93.83
TABLE VII: OpenSSL CVE identification results (ARM).

5 Recognising Cryptographic Algorithms in a Trusted Execution Environment (TEE)

The last section showed how some security vulnerabilities can be detected within compiled OpenSSL (libcrypto) libraries. The general approach is extendable to recognising functions executing within trusted execution environment (TEE) applications. TEEs are designed to protect TEE applications typically expose high-level APIs for enabling untrusted applications to interact with secure world services [5, 4, 15]. This section examines the application of our approach for cryptographic algorithm recognition within an ARM TrustZone TEE.

5.1 ARM TrustZone and the ARM PMU

ARM TrustZone partitions platform execution into ‘secure’ (SW) and ‘non-secure’ (NS) worlds, with the aim of protecting SW service services from kernel-level, non-secure world software attacks. SW execution is isolated by setting the NS-bit by the secure monitor at the highest ARM exception (privilege) level (EL3), which is added to cache tags and propagated through system-on-chip bus transactions, e.g. for accessing sensitive peripheral controllers. Interactions between the normal and secure worlds are conducting using ARM secure monitor calls (SMC) for entering secure monitor mode. NS world applications invoke TA functions using a pre-defined interfaces specified by the TA developer, as standardised by the GlobalPlatform Client API [15]. Notably, TEE TAs are usually provisioned in encrypted form, which are subsequently loaded from flash memory during the device’s secure boot sequence using a firmware-bound key. This occurs before loading any untrusted world binaries, rendering direct inspection of TA binaries tremendously difficult, even from privileged NS world execution [32, 5].

Recall from §2.2.2 that the ARM PMU manages performance events on ARM Cortex-A platforms. Ideally, PMU interrupt events should be suppressed during secure world execution to prevent sensitive micro-architectural state changes being measurable from malicious non-secure world processes. However, enabling SW PMU events is used in pre-release testing environments for TEE debugging and optimisation (categorised as a non-invasive technique under the ARM debugging architecture [26]). Whether or not SW PMU events are enabled can be determined by querying the non-invasive and secure non-invasive flags (NIDEN and SPNIDEN) of the ARM DBGAUTHSTATUS debug register. If NIDEN or SPNIDEN are set, then PMU events are counted in the non-secure and secure worlds respectively.

Importantly, Spisak [35] showed that SPNIDEN was still enabled in some consumer devices following release, including the Amazon Fire HD 7” tablet and Huawei Ascend P7. This was followed by Ning and Zhang [26] who examined 11 mobile, IoT and ARM server platforms, showing that only three devices had correctly unset SPNIDEN prior to consumer release. Vulnerable devices included the Huawei Mate 7, Raspberry Pi 3B+, Xiaomi Redmi 6, and the Scaleway ARM C1 Server. The use of insecurely configured ARM PMU events during secure world execution has been exploited on consumer devices for building rootkits using PMU interrupts [35], and cross-world covert channels on an undisclosed Samsung Tizen TV (ARM Cortex-A17, ARMv7) and HiKey board (ARM Cortex-A53, ARMv8) [7].

5.2 Methodology

Using the knowledge that consumer devices may fail to suppress SW PMU interrupts, we developed a test-bed for investigating the extent to which cryptographic algorithms within secure world TAs can be identified from non-secure world processes. OP-TEE999 was leveraged to this end—an open-source, GlobalPlatform-compliant TEE reference implementation—on our Raspberry Pi 3B+. Two applications were developed, which are illustrated in Fig. 6:

Fig. 6: TEE side-channel set-up using a shared PMU.
  • Spy (non-secure world). A privileged application that invokes TEE functions in the Victim using the GlobalPlatform Client API. The same PAPI test harness used in §3 and §4 measured HPC values immediately before and after TA command invokation, i.e. TEEC_InvokeCommand() in GlobalPlatform Client API nomenclature. The Spy may be unprivileged if the PMU is configured to enable measurements from EL0.

  • Victim (secure world TA). An emulation of a TEE-based key management application that exposes four high-level functions for 1 signing (TA_SIGN), 2 verifying (TA_VERIFY), 3 encrypting (TA_ENCRYPT), and 4 decrypting (TA_DECRYPT) inputs provided by the Spy on demand. These utilised fresh TEE-bound keys which were generated at random on a per-session basis.101010Under the GlobalPlatform TEE architecture, a non-secure client application invokes one or more commands, e.g. signing or decryption, of a target TA within a single session.

The aim, similar to §3, is to identify the precise algorithm used by the Victim from PMU measurements before and after its invocation by the Spy. The Spy is given only the aforementioned high-level functions for signing, verification, encryption, and decryption. Secure world cryptographic functions are implemented in OP-TEE using LibTomCrypt whose functions are wrapped by the GlobalPlatform Internal Core API [16]. The Victim was developed to call an extensive range of GlobalPlatform Internal Core API functions implemented by the OP-TEE OS core. For calculating performance results, the Victim TA also accepted a given Internal Core API algorithm ID, which was used for labelling HPC vectors. The full list of analysed cryptographic algorithms is given in Table VIII, covering all available modes of operation and padding schemes where applicable.

Using our previously developed PAPI test harness, HPC measurements of 1,000 invocations of each algorithm were collected from the non-secure world. This was repeated for 100 sessions, with the data collected afterwards for off-line analysis (100,000 vectors per algorithm; 3.4M in total). The measurement vectors were labelled with respect to each GlobalPlatform Internal Core algorithm identifier, i.e. with the label . Like §3, key sizes were not fixed; a random key size was set during the algorithm’s run-time allocation prior to its execution. It is important to note that OP-TEE TAs are cross-compiled using a GCC-based toolchain, rendering them vulnerable to the compilation biases (§3.4.5). To address this, we evaluated the GCC optimisation flags from §3 in order to assess the effects of different optimisation levels on classification performance.

Method GlobalPlatform Internal Core API ID Key Sizes
DSA TEE_ALG_DSA_SHA1 512, 1024
TEE_ALG_DSA_SHA256 2048, 3072
TEE_ALG_RSASSA_PKCS1_V1_5_SHA512 1024, 2048
TEE_ALG_AES_CTR 128, 196
TABLE VIII: Victim TA algorithms and key sizes (bits).

5.3 Results

After retrieving the data files from the test platform, the same procedure was followed as in the previous sections. The data file was divided into training and test sets using an 80:20 ratio before applying exhaustive grid search with 10-fold cross-validation to select the best performing classifier. The final accuracy was calculated using the performance of the best performing cross-validation classifier on the aforementioned test set. Results of this analysis are given in Table IX. The results generally reflect those in §3 and §4: algorithm recognition can be achieved with high accuracy using PMU values perturbed by a secure world TA that is measured by a non-secure world application. In the best case, 95.50% classification accuracy (DT) was achieved for the mixed GCC data-set, increasing slightly to best cases of 96.13% (RF), 96.02% (GBM), and 97.45% (GBM) for the O0, Os, and O3 data sets respectively.

O0 83.99 90.12 88.07 94.48 96.13 90.76 92.02 88.54 87.90
Os 84.61 87.92 84.68 90.72 94.07 96.02 95.51 85.81 86.75
O3 83.18 92.49 93.19 89.58 93.33 97.45 79.66 94.35 87.71
Mixed 82.89 85.27 92.83 95.50 94.17 86.93 82.55 86.99 90.82
TABLE IX: Classification accuracy (%) of Spy-measured (non-secure world) HPC feature vectors of Victim-executed algorithms (secure world).

6 Evaluation

This section analyses mitigations, limitations, and challenges of the work presented in this paper.

6.1 Analysis and Mitigations

Exploiting PMUs as a side-channel medium has been acknowledged by CPU architecture designers and specification bodies. ARM concede that counters are a potential side channel for leaking confidential information, and issue secure development guidance for preventing TEEs from perturbing HPC measurements during TA execution [5, 4]. Similarly, the RISC-V Unprivileged specification states that “Some execution environments might prohibit access to counters to impede timing side-channel attacks” [29] (Chapter 10, p. 59). Despite this, PMUs are still widely accessible with privileged access and, as shown by [35] and [26], extends to perturbations from TEE TAs on some consumer devices.

If attackers are assumed to possess only user-mode permissions, then certain system-level countermeasures can be deployed. On X86-64, the CR4.TSD and CR4.PCE control registers can be set and unset to prevent the reading of time-stamp (RDTSC) and programmable PMU counter values (RDPMC) respectively. On Linux devices, the kernel.perf_event_paranoid flag can be set to a non-zero value to prevent user-space processes from accessing PMU values through the perf subsystem. Alternatively, perf can be disabled at installation time for providing access to high-resolution CPU events. It is worth stating, however, that performance counters have many legitimate uses, including application benchmarking and debugging.

Side-channel resistance has been studied extensively in the context of avoiding time- and data-dependent code. Cache-based timing attacks, in particular, have prompted the development of resistant cryptographic algorithms (e.g. preferring bit-sliced AES implementations over those using lookup T-tables [6]). In recent years, attacks that leverage side-effects of transient execution have compounded these issues [20, 2]. From the results in this work, we emphasise that focussing on time-, cache-based, or branch prediction-based side-channel mitigations are insufficient for protecting against the presented methods. Ideally, library and TEE TA developers must consider a larger range of measurable performance events, such as instruction retirements, TLB hits/misses, load/store counts, pre-fetch misses, and memory writes. While certain individual events have been exploited and mitigated, multiple events can be leveraged to bypass countermeasures against any particular micro-architectural side-channel approach.

For software-based mitigations, inspiration can be taken from Li and Gaudiot [21] who posed the challenge of HPC-based Spectre attack detection in the face of ‘evasive’ adversaries. State-of-the-art accuracy of Spectre detection models declined by 30%–40% after introducing instructions that mimicked benign programs. Thus, one countermeasure is to introduce micro-architectural obfuscation in applications by randomly and significantly perturbing the PMU during the execution of sensitive functions. Noise injection was also suggested by Liu et al. [22] for countering ARM cache-based side-channels; however, such methods must be generalised to all events, not only cache-based counters.

We also draw attention to countermeasures when deploying Intel Software Guard eXtensions (SGX) enclaves. SGX applications can be built using the ‘anti side-channel interference’ (ASCI) feature, which suppresses interrupts to Intel PMUs upon entry to production enclaves [19], thus preventing their use as a side-channel for inspecting enclave contents. Unless developers explicitly and negligently opt-out of ASCI, then production enclaves are strengthened significantly to the attacks described in this paper. Likewise, we reiterate best-practice guidelines to device manufacturers to prevent PMU events being raised during secure world execution by securely configuring the PMU control and debug registers upon TEE entry.

6.2 Challenges, Limitations, and Practicability

Using HPCs to classify program functions faces some interesting challenges that were considered outside the scope of this work. Firstly, we investigated programs with few levels of intermediate abstraction: using C programs on a bare-metal microcontroller (RISC-V) or with a single host OS (Debian-based Linux; ARM and X86-64). Instrumenting and measuring programs using HPCs faces difficulties in virtualised environments sharing a single set of CPU counters; for instance, within virtual machines (VMs) and containers with OS-level virtualisation (e.g. Docker). Programs written in interpreted languages and those with just-in-time (JIT) compilation, e.g. Java, also pose known challenges for side-channel analysis [32]. Further research is required to correctly account for the additional noise from multiple users, processors, garbage collectors, etc. on a single PMU.

Secondly, a significant number of possible implementations variations may be encountered in reality on an arbitrary platform. The variations were extensive but not exhaustive; for example, the use of hardware implementations; different RNG sources; expanded cryptographic libraries, e.g. Crypto++ and Bouncy Castle; and alternative TEE implementations were not evaluated. The scope of this work was not to provide an comprehensive analysis of these possibilities. Rather, we aimed to provide the first investigation into using HPCs for function recognition, while providing a detailed understanding of HPC correlations, relative classification contributions, and the applications to vulnerability detection and TA analysis.

Thirdly, in a practical scenario, the generic approaches in §3 and §5 would require classifiers to be trained and transferred to a device under test. TrustZone software binaries, for instance, are typically encrypted and authenticated during secure boot sequences on consumer devices. Moreover, TrustZone OSs are notoriously closed-source and closed-access, thus preventing ordinary developers from provisioning their own TAs after deployment for acquiring ground-truth labels. A future research direction is to explore the efficacy of transfer learning by training a model initially in a white-box environment where the environment is accessible and where samples can be reliably labelled, before transposing it to a black-box environment. Although GlobalPlatform TEE APIs specify a standard set of algorithms, one challenge with this approach is that their implementations may differ during the transfer between white-box and black-box, OEM implementations. While we evidenced that different library implementations may be classified under a single label in §3

, this remains untested in a transfer learning environment, which we defer to future research.

7 Conclusion

This paper developed, implemented and evaluated a generic approach to black-box program function recognition. Extending work from related HPC research, we explored a novel side-channel approach in which micro-architectural events are analysed in bulk using a machine learning-driven approach. We examined this in the three parts: 1 A preliminary study that examined classifying functions from widely used benchmarking suites and cryptographic libraries; 2 detecting several known, CVE-numbered vulnerabilities within OpenSSL; and 3 cryptographic function recognition within ARM TrustZone. The approaches achieved 86.22%–99.83% accuracy depending on the target architecture and application. We showed how functions are recognisable in a relatively large, multi-class problem space. Furthermore, we demonstrated how OpenSSL lettered versions containing security vulnerabilities can be identified with high accuracy (0.889–0.998 -score; 89.58%–99.83% accuracy). We then presented results from a further use case for recognising functions in a reference open-source ARM TrustZone TEE implementation with high accuracy (95.50%–97.45%) using a comprehensive range of GlobalPlatform TEE API functions.

We posit that focussing on well-known side-channel vectors—cache accesses, timing differences, and branch predictions—-are insufficient for engineering implementations which are resistant to the methods presented in this paper. Focus must directed to a wider range of micro-architectural events simultaneously—TLB misses, instruction retirements, clock cycles, pre-fetch events, and others—rather than prescribed events popularised in related work. We also re-emphasise best-practice guidelines for configuring PMUs to avoid exposing TEE side-channels to untrusted applications. Further work is needed to leverage more sophisticated attack scenarios, e.g. interpreted languages and transferring models between consumer devices, but we present evidence that HPC-based function recognition is effective on today’s major CPU architectures.


This work received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 883156 (EXFILES).


  • [1] M. Alam, S. Bhattacharya, S. Dutta, S. Sinha, D. Mukhopadhyay, and A. Chattopadhyay (2019)

    RATAFIA: ransomware analysis using time and frequency informed autoencoders

    In IEEE Int’l Symposium on Hardware Oriented Security and Trust, Cited by: §1, §2.1.
  • [2] M. Alam, S. Bhattacharya, D. Mukhopadhyay, and S. Bhattacharya (2017) Performance counters to rescue: a machine learning based safeguard against micro-architectural side-channel-attacks. IACR Cryptol. ePrint Arch. 2017, pp. 564. Cited by: §2.1, §6.1.
  • [3] AMD, Inc. (2021) AMD64 architecture programmer’s manual volume 2: system programming. Cited by: §1, §2.2.1, §3.4.4.
  • [4] ARM (2014) ARM Cortex-A53 MPCore processor: technical reference manual. Cited by: §1, §2.2.2, 1st item, item 1, §3.6.1, §4.2, §5, §6.1.
  • [5] ARM (2021) Secure development guidelines. Note: Cited by: §1, §5.1, §5, §6.1.
  • [6] D. J. Bernstein (2005) Cache-timing attacks on AES. Note: Available online: Cited by: §6.1.
  • [7] H. Cho, P. Zhang, D. Kim, J. Park, C. Lee, Z. Zhao, A. Doupé, and G. Ahn (2018) Prime+Count: novel cross-world covert channels on ARM TrustZone. In 34th Annual Computer Security Applications Conference, pp. 441–452. Cited by: §5.1.
  • [8] Y. Choi, A. Knies, L. Gerke, and T. Ngai (2001) The impact of if-conversion and branch prediction on program execution on the Intel Itanium processor. In 34th ACM/IEEE Int’l Symposium on Microarchitecture, Cited by: §3.4.5.
  • [9] B. Copos and P. Murthy (2015) Inputfinder: reverse engineering closed binaries using hardware performance counters. In 5th Program Protection and Reverse Engineering Workshop, Cited by: §2.1.
  • [10] S. Das, J. Werner, M. Antonakakis, M. Polychronakis, and F. Monrose (2019) SoK: the challenges, pitfalls, and perils of using hardware performance counters for security. In IEEE Symposium on Security and Privacy, pp. 20–38. Cited by: 1st item, §3.4.1, §3.4, §4.2.
  • [11] J. Demme, M. Maycock, J. Schmitz, A. Tang, A. Waksman, S. Sethumadhavan, and S. Stolfo (2013) On the feasibility of online malware detection with performance counters. ACM SIGARCH Computer Architecture News 41 (3), pp. 559–570. Cited by: §2.1.
  • [12] M. Fan, W. Wei, X. Xie, Y. Liu, X. Guan, and T. Liu (2020) Can we trust your explanations? sanity checks for interpreters in android malware analysis. IEEE Transactions on Information Forensics and Security 16, pp. 838–853. Cited by: §3.6.2.
  • [13] Free Software Foundation, Inc. (2022) Optimize options (using the GNU Compiler Collection). Note: Cited by: §3.4.5.
  • [14] A. Gangwal, S. G. Piazzetta, G. Lain, and M. Conti (2020) Detecting covert cryptomining using HPC. In International Conference on Cryptology and Network Security, pp. 344–364. Cited by: §1, §2.1.
  • [15] GlobalPlatform (2010) TEE Client API (v1.0). Cited by: §1.1, §1, §5.1, §5.
  • [16] GlobalPlatform (2021) TEE Internal Core API (v1.3.1). Cited by: §5.2.
  • [17] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown (2001) MiBench: a free, commercially representative embedded benchmark suite. In 4th Annual IEEE International Workshop on Workload Characterization, pp. 3–14. Cited by: 1st item, §1, §3.2.
  • [18] C. Helm, S. Akiyama, and K. Taura (2020) Reliable reverse engineering of Intel DRAM addressing using performance counters. In 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Cited by: §1, §2.1, §3.6.1.
  • [19] Intel, Inc. (2022) Intel 64 and IA-32 architectures software developer’s manual combined volumes 3A, 3B, 3C, and 3D: system programming guide. Cited by: §1, §2.2.1, item 1, §3.4.4, §6.1.
  • [20] C. Li and J. Gaudiot (2018) Online detection of spectre attacks using microarchitectural traces from performance counters. In 30th Int’l Symposium on Computer Architecture and High Performance Computing, Cited by: §2.1, §6.1.
  • [21] C. Li and J. Gaudiot (2020) Challenges in detecting an ‘evasive spectre’. IEEE Computer Architecture Letters 19 (1), pp. 18–21. Cited by: §6.1.
  • [22] N. Liu, W. Zang, S. Chen, M. Yu, and R. Sandhu (2019) Adaptive noise injection against side-channel attacks on ARM platform. EAI Endorsed Transactions on Security and Safety 6 (19). Cited by: §6.1.
  • [23] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, NeurIPS, pp. 4765–4774. Cited by: §3.6.2, §3.6.
  • [24] C. Malone, M. Zahran, and R. Karri (2011) Are hardware performance counters a cost effective way for integrity checking of programs. In 6th ACM Workshop on Scalable Trusted Computing, pp. 71–76. Cited by: §2.1.
  • [25] C. Maurice, N. Le Scouarnec, C. Neumann, O. Heen, and A. Francillon (2015) Reverse engineering Intel last-level cache complex addressing using performance counters. In Int’l Symposium on Recent Advances in Intrusion Detection, Cited by: §1, §2.1, §3.6.1.
  • [26] Z. Ning and F. Zhang (2019) Understanding the security of ARM debugging features. In IEEE Symposium on Security and Privacy, pp. 602–619. Cited by: §5.1, §5.1, §6.1.
  • [27] M. Payer (2016) HexPADS: a platform to detect ‘stealth’ attacks. In International Symposium on Engineering Secure Software and Systems, pp. 138–154. Cited by: §1, §2.1, §2.1, §2.2.1.
  • [28] M. T. Ribeiro, S. Singh, and C. Guestrin Why should I trust you?: explaining the predictions of any classifier. In 22nd ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, Cited by: §3.6.2.
  • [29] RISC-V Foundation (2019) The RISC-V instruction set manual volume I: unprivileged ISA. Cited by: §1, §2.2.3, §6.1.
  • [30] RISC-V Foundation (2021) The RISC-V instruction set manual volume II: privileged ISA. Cited by: §2.2.3, item 1.
  • [31] C. Shepherd, K. Markantonakis, and G. Jaloyan (2021) LIRA-V: lightweight remote attestation for constrained RISC-V devices. In IEEE Security and Privacy Workshops (SPW), Cited by: §2.2.3.
  • [32] C. Shepherd, K. Markantonakis, N. van Heijningen, D. Aboulkassimi, C. Gaine, T. Heckmann, and D. Naccache (2021-12) Physical fault injection and side-channel attacks on mobile devices: a comprehensive analysis. Computers & Security 111, pp. 102471. External Links: ISSN 0167-4048 Cited by: §1, §5.1, §6.2.
  • [33] A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In International conference on machine learning, pp. 3145–3153. Cited by: §3.6.2, §3.6.
  • [34] B. Singh, D. Evtyushkin, J. Elwell, R. Riley, and I. Cervesato (2017) On the detection of kernel-level rootkits using hardware performance counters. In ACM Asia Conference on Computer and Communications Security, pp. 483–493. Cited by: §1, §2.1, item 2.
  • [35] M. Spisak (2016) Hardware-assisted rootkits: abusing performance counters on the ARM and x86 architectures. In 10th USENIX Workshop on Offensive Technologies, Cited by: §2.1, §5.1, §6.1.
  • [36] C. Strobl, A. Boulesteix, A. Zeileis, and T. Hothorn (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC bioinformatics 8 (1), pp. 1–21. Cited by: §3.6.2.
  • [37] A. Tang, S. Sethumadhavan, and S. J. Stolfo (2014) Unsupervised anomaly-based malware detection using hardware features. In Int’l Workshop on Recent Advances in Intrusion Detection, Cited by: §2.1.
  • [38] D. Terpstra, H. Jagode, H. You, and J. Dongarra (2010) Collecting performance data with PAPI-C. In Tools for High Performance Computing, Cited by: §3.1, §3.4.4.
  • [39] L. Uhsadel, A. Georges, and I. Verbauwhede (2008) Exploiting hardware performance counters. In 5th Workshop on Fault Diagnosis and Tolerance in Cryptography, pp. 59–67. Cited by: §1, §2.1.
  • [40] M. Wang, K. Zheng, Y. Yang, and X. Wang (2020) An explainable machine learning framework for intrusion detection systems. IEEE Access 8, pp. 73127–73141. Cited by: §3.6.2.
  • [41] X. Wang, C. Konstantinou, M. Maniatakos, and R. Karri (2015) Confirm: detecting firmware modifications in embedded systems using hardware performance counters. In IEEE/ACM Int’l Conference on Computer-Aided Design, Cited by: §2.1.
  • [42] V. M. Weaver, D. Terpstra, and S. Moore (2013) Non-determinism and overcount on modern hardware performance counter implementations. In IEEE Int’l Symposium on Performance Analysis of Systems and Software, Cited by: 1st item, 2nd item, §3.4.1, §3.4.
  • [43] Y. Xia, Y. Liu, H. Chen, and B. Zang (2012) CFIMon: detecting violation of control flow integrity using performance counters. In IEEE Int’l Conference on Dependable Systems and Networks, Cited by: §1, §2.1.
  • [44] L. Yuan, W. Xing, H. Chen, and B. Zang (2011) Security breaches as PMU deviation: detecting and identifying security attacks using performance counters. In 2nd Asia-Pacific Workshop on Systems, pp. 1–5. Cited by: §2.1.
  • [45] B. Zhou, A. Gupta, R. Jahanshahi, M. Egele, and A. Joshi (2018) Hardware performance counters can detect malware: myth or fact?. In ACM Asia Conference on Computer and Communications Security, pp. 457–468. Cited by: §2.1, item 2, §3.6.

Appendix A ARM and RISC-V Confusion Matrices

Additional confusion matrices relevant to §3.5 for ARM and RISC-V function recognition are given in Fig. 7 and Fig. 8 respectively.

Fig. 7:

Normalised ARM confusion matrix.

(a) RISC-V (Privileged).
(b) RISC-V (Unprivileged).
Fig. 8: Normalised RISC-V confusion matrices.

Appendix B OpenSSL CVE Descriptions

  • CVE-2018-5407: “ECC scalar multiplication, used in e.g. ECDSA and ECDH, has been shown to be vulnerable to a microarchitecture timing side channel attack. An attacker with sufficient access to mount local timing attacks during ECDSA signature generation could recover the private key.” Fixed in OpenSSL v1.1.0i.

  • CVE-2018-0734: “DSA signature algorithm has been shown to be vulnerable to a timing side channel attack. An attacker could use variations in the signing algorithm to recover the private key.”. Fixed in v1.1.1a.

  • CVE-2018-0735: “The OpenSSL ECDSA signature algorithm has been shown to be vulnerable to a timing side channel attack. An attacker could use variations in the signing algorithm to recover the private key.” Fixed in v1.1.0j.

  • CVE-2018-0737: “RSA key generation algorithm has been shown to be vulnerable to a cache timing side channel attack. An attacker with sufficient access to mount cache timing attacks during the RSA key generation process could recover the private key.” Fixed in v1.1.0i.

  • CVE-2016-2178: “Operations in the DSA signing algorithm should run in constant time in order to avoid side channel attacks. A flaw in the OpenSSL DSA implementation means that a non-constant time codepath is followed for certain operations. This has been demonstrated through a cache-timing attack to be sufficient for an attacker to recover the private DSA key.” Fixed in v1.0.2i.

  • CVE-2016-0702: “A side-channel attack was found which makes use of cache-bank conflicts on the Intel Sandy-Bridge microarchitecture which could lead to the recovery of RSA keys. The ability to exploit this issue is limited as it relies on an attacker who has control of code in a thread running on the same hyper-threaded core as the victim thread which is performing decryptions.” Fixed in v1.0.2g.