DeepAI
Log In Sign Up

A Comprehensive Benchmark Suite for Intel SGX

Trusted execution environments (TEEs) such as facilitate the secure execution of an application on untrusted machines. Sadly, such environments suffer from serious limitations and performance overheads in terms of writing back data to the main memory, their interaction with the OS, and the ability to issue I/O instructions. There is thus a plethora of work that focuses on improving the performance of such environments – this necessitates the need for a standard, widely accepted benchmark suite (something similar to SPEC and PARSEC). To the best of our knowledge, such a suite does not exist. Our suite, SGXGauge, contains a diverse set of workloads such as blockchain codes, secure machine learning algorithms, lightweight web servers, secure key-value stores, etc. We thoroughly characterizes the behavior of the benchmark suite on a native platform and on a platform that uses a library OS-based shimming layer (GrapheneSGX). We observe that the most important metrics of interest are performance counters related to paging, memory, and TLB accesses. There is an abrupt change in performance when the memory footprint starts to exceed the size of the EPC size in Intel SGX, and the library OS does not add a significant overhead (  +- 10

READ FULL TEXT VIEW PDF
10/25/2020

Performance Analysis of Scientific Computing Workloads on Trusted Execution Environments

Scientific computing sometimes involves computation on sensitive data. D...
04/20/2020

BAHULAM: Distributed Data Analytics on Secure Enclaves

This is a survey of some of the currently available frameworks (opensour...
09/30/2019

Memory Centric Characterization and Analysis of SPEC CPU2017 Suite

In this paper we provide a comprehensive, memory-centric characterizatio...
10/16/2020

Elasticlave: An Efficient Memory Model for Enclaves

Trusted-execution environments (TEE), like Intel SGX, isolate user-space...
07/20/2020

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a...
12/14/2021

Speeding up enclave transitions for IO-intensive applications

Process-based confidential computing enclaves such as Intel SGX can be u...
10/10/2017

An Introduction to Rocker: Docker Containers for R

We describe the Rocker project, which provides a widely-used suite of Do...

1. Introduction

Intel Secure Guard eXtension or Intel SGX (intelresearch; intelsgxexplained) has gained popularity in recent years as a way to securely execute an application on a remote, untrusted machine. The security of the application and data within SGX, i.e., confidentiality, integrity, and freshness are guaranteed by the hardware. The code and data within Intel SGX is even out of the reach of privileged system software such as the operating system and hypervisor. Recently, Microsoft Azure adopted Intel SGX to provide secure computing in their data centers (azure_sgx; sgx_datacenter).

However, this protection comes at a cost. Intel SGX, to ensure security guarantees, puts certain restrictions on the applications running within it, such as no system calls, as the operating system is not a part of the trusted framework of SGX (intelsgxexplained). Therefore, few additional and expensive steps are required to enable system call support, which incurs additional performance overheads (portorshim; hotcalls). Furthermore, Intel SGX reserves a portion of the main memory for its operations, which is managed by the hardware. However, this reserved memory is limited in size, and any application allocating more than the reserved memory, incurs a significant amount of performance overhead (intelsgxexplained; regaining_lost_seconds).

Researchers have focused on alleviating this problem by proposing different mechanisms and workarounds to reduce the overheads (vault; scone; switchless_calls; eleos; hotcalls; regaining_lost_seconds). To show the benefits of their methods, researchers have resorted to manual porting of applications to Intel SGX (portorshim; sgxlapd_sgxnbench). However, porting an application requires significant expertise and development effort (portorshim). Also, the decision of which application to port is generally motivated by the ease of porting, and not necessarily by the gains accrued by doing so. Hence, there is no accepted, standard method for benchmarking SGX-based systems primarily due to the ad hoc nature of workload creation.

The big picture is as follows. The workloads used to evaluate the efficiency of different methods for improving Intel SGX vary across different proposals. Hence, it is not possible to compare the performance gains in one work with those of another in any meaningful manner. Therefore, there is a need for a standard benchmark suite for Intel SGX, much like traditional benchmarks suites such as SPEC (spec2017) and PARSEC (parsec).

A benchmark suite needs to thoroughly evaluate all the critical components of Intel SGX, and enable performance comparison by setting a common denominator across different works. Primarily, there are three sources of performance overheads in Intel SGX: encryption/decryption of the data in the reserved secure memory, the cost for accessing operating system services, and the additional time for swapping in data when an application has allocated more memory than the reserved memory (hotcalls; regaining_lost_seconds). Prior works (sgxlapd_sgxnbench; portorshim; sgxometer) in this field propose different benchmark suites (nbench_orig; lmbench) for evaluating Intel SGX. However, they only focus on the first two costs, ignoring the last one, which accounts for the maximum performance overhead (regaining_lost_seconds).

We present SGXGauge – a comprehensive benchmark suite for Intel SGX. SGXGauge contains 10 real-world and synthetic benchmarks from different domains that thoroughly evaluate all the critical components of Intel SGX. We use SGXGauge to evaluate Intel SGX in two different modes: ❶ native mode where we port the benchmarks to Intel SGX, and ❷ shim mode where we execute benchmarks in an environment where a thin system software layer intercepts the system calls and intercedes with the OS on behalf of the application (graphenesgx; scone; panoply). Such shim layers are also known as library operating systems; they are gaining popularity because they significantly reduce the development time required to run an application on SGX as compared to porting the same application to SGX (portorshim). Our precise list of contributions are as follows.

  1. We present SGXGauge, a benchmark suite for Intel SGX that thoroughly evaluates all of its components.

  2. We stress test the impact of EPC on the performance of applications — a crucial component that is missing from prior work.

  3. We thoroughly evaluate the performance overhead incurred while executing an application with a library operating system.

The rest of the paper is organized as follows: we discuss the required background for the paper in Section 2. In Section 3 we discuss related work and the motivation for the paper. This is followed by a detailed overview of our benchmark suite, SGXGauge, in Section 4. We discuss the evaluation of the benchmarks in Section 5. We finally conclude in Section 6.

2. Background

In this section, we discuss the necessary background for the paper.

2.1. Intel SGX

Intel Secure Guard eXtension or SGX ensures the secure execution of an application either on a local or remote machine. It guarantees confidentiality, integrity, and freshness of the code and data running within it. Even privileged system software such as the operating system and hypervisor cannot affect its execution.

SGX reserves a part of the system memory for its use at boot time. This reserved memory is known as the Processor Reserved Memory or PRM (intelsgxexplained). Our system supports 128 MB of PRM, and the rest of the discussion in the paper is based on this setting. The PRM is split into two regions that are used to store ❶ SGX metadata and ❷ date/code of user applications, respectively. The latter is called the Enclave Page Cache or EPC. The size of the EPC is 92 MB, although SGX supports applications that require more memory (details in Section  2.2). For every process, SGX creates a trusted execution environment an enclave (intelsgxexplained).

The operating system cannot access the data within an enclave. However, an enclave still requires the operating system’s support for setting it up, scheduling, context switching, page management, and cleanup. To enable this, the memory management of the enclave is done by the operating system. Just before launching an enclave, the hardware checks the loaded binary for tampering by securely calculating its signature (hash) and matching it with the signature provided by the enclave’s author.

2.2. Enclave Page Cache

The EPC is used to allocate memory for all the applications executing within SGX. The data in the EPC is always in an encrypted form to prevent any snooping from privileged system software such as the operating system. The data is decrypted when brought in to the LLC (last level cache) upon a CPU request. Intel SGX uses a dedicated hardware called the Memory Management Engine or MEE for encrypting and decrypting the data.

Sadly, the size of the EPC is one of the major limitations of SGX (regaining_lost_seconds). A typical modern application generally has a working set that is more than 92 MB (working_set; working_set2). In such cases, SGX transparently evicts pages from the EPC to the untrusted memory, albeit in an encrypted form, to make space for the new data. When an application tries to access an evicted page, an EPC fault is raised, and SGX brings the page back to the EPC (vault).

An EPC fault is an expensive operation. SGX encrypts and calculates the MAC (encrypted hash) of a page prior to an eviction. When the page is brought back, it needs to be decrypted and integrity checked before its use. Our experiment found that evicting a page from the EPC takes on an average of 12,000 cycles.

2.3. Enclave Transitions: Ecalls and Ocalls

The security provided by SGX comes at a cost. For security reasons, SGX puts several restrictions on an executing enclave – notably, it cannot make any system calls (intelsgxexplained).

In the Intel SGX framework, the operating system is an untrusted entity, and hence, systems calls are restricted. To make a system call, an enclave first exits the secure region by calling an OCALL (outside call) function. After this, it makes the system call, collects its results, and returns to the secure region. Similarly, an application from an unsecure region can call a function within an enclave by calling an ECALL (enclave call) function.

During a transition from the secure region to the unsecure region, the TLB entries of the enclave are flushed due to security concerns (intelsgxexplained). When the enclave returns, the TLB entries have to be populated again. While adding a TLB entry to the TLB, if the entry points to an EPC page, it is first verified. For this purpose, SGX maintains a table called the Enclave Page Cache Map or EPCM (intelsgxexplained) in the secure region. The EPCM contains one entry for every page in the EPC. For each EPC page, the EPCM tracks its owner and the corresponding virtual address for which this page was allocated. These values are checked when a TLB entry for the corresponding page is being added to the TLB (see Figure  1).

Frequent enclave transitions affect the performance of an application due to context switches, TLB misses, and cache pollution. hotcalls (hotcalls) show that the cost of calling an enclave function typically requires 17,000 cycles.

Figure 1. Figure showing the relation between the address space, the EPC, and the EPCM.

2.4. Library Operating Systems

Intel provides a software development kit (Intel SGX SDK (intelsgxsdk)) to ease application development for SGX. However, porting or writing an application for SGX still requires an in-depth knowledge of the workings of SGX and significant development effort (portorshim).

Due to this, many researchers have opted for executing an application using a shim layer built on top of Intel SGX– also known as a library operating system or LibOS. A library operating system can execute an unmodified binary on Intel SGX; thus, saving on the high cost and effort of porting the application. Scone (scone), GrapheneSGX (graphenesgx), and Panoply (panoply) are some examples of such systems. Researchers have reported that it can take up to 3-4 months for porting applications (portorshim); this is also in line with our observations. Moreover, the task of verifying correct execution for all corner cases will take even more time. This is precisely why such shim layers need to be inevitably used and are fast becoming an inseparable part of the Intel SGX stack. Even though they have their share of performance overheads, the sheer reduction in the development and verification effort makes them a necessary part of many deployments. No benchmark suite for Intel SGX can be oblivious of them.

3. Related Work and Motivation

In this section, we discuss the related work in this area and the motivation for SGXGauge.

3.1. Related Work

Limited work has been done in this area, mainly due to the limitations of the Intel SGX framework and the engineering effort required to port an application to SGX.

3.1.1. LMbench-SGX

portorshim (portorshim) in their work Port-or-Shim, ported a part of LMbench (lmbench) for Intel SGX and compared its performance against a shimmed version running within a library operating system. They also used GrapheneSGX for their evaluations. They point out that porting LMbench to SGX took months – and that too after removing certain features from it (portorshim). Whereas running a shimmed version of LMbench on GrapheneSGX (graphenesgx) took a week of effort. They specifically focus on the cost of the encryption/decryption and enclave transitions. They intentionally avoided EPC faults by ensuring that the amount of memory allocated to the benchmarks is less than the size of the EPC (92 MB). They report that the performance of LMbench-SGX (ported version of LMbench) and the version that runs within the library OS GrapheneSGX is the same – this raises a question about whether porting was worth the effort.

3.1.2. Nbench-SGX

Apart from this,  sgxlapd_sgxnbench (sgxlapd_sgxnbench) proposed a method to prevent side-channel attacks in Intel SGX. They ported Nbench (nbench_orig) to SGX to evaluate the effectiveness of their solution. However, the working set of the benchmarks was small and limited analyses were performed.


Both LMbench-SGX and Nbench-SGX (ported versions) are single-threaded benchmark suites (portorshim; sgxlapd_sgxnbench). LMbench-SGX mainly focuses on the memory bandwidth and the system call latencies. Nbench-SGX mostly contains CPU-intensive workloads and is designed to check the efficiency of integer and floating point operations of a CPU. Our suite is far more comprehensive in terms of its coverage (evaluated in Section 5).

sgxometer (sgxometer) also point out the issues with using Nbench-SGX for Intel SGX evaluation. They propose a CMake based framework to develop SGX and non-SGX applications from the same source. Apart from this, there are other proposals by  sgxperf (sgxperf),  teemon (teemon),  teeperf (teeperf), and  sgxtop (sgxtop) that propose methods to collect statistics about an executing secure application. This information can help a developer debug a secure application or improve its performance. However, these are not benchmark suites; rather, they focus on the efficient profiling aspect for an executing enclave.

3.2. Motivation

Here, we discuss the motivation for SGXGauge. The system setup for experiments used here is listed in Table 3.

3.2.1. Experiment: Stressing the EPC

Figure 2. Allocating beyond the EPC size increases the overhead. The baseline is a Vanilla (non-SGX) setting with the same input size. For EPC evictions the baseline is the Low setting.

The limited amount of EPC memory is one of the biggest challenges in SGX (regaining_lost_seconds; intelsgxexplained). Due to the small size, EPC faults are a common event. Multiple instances of an enclave with a small memory footprint may also cause a number of EPC faults. This is because an enclave prior to its execution is loaded completely in the EPC to verify its content (everything_sgx_virtual; intelsgxexplained).

As can be seen in Figure 2, on crossing the EPC boundary the number of dTLB misses increases by 91, page walk cycles by more than 124, and EPC evictions by 100 as compared to when the amount of memory is less than the EPC size. Hence, analyzing the impact of the EPC size on the performance is crucial – a fact completely ignored by LMbench-SGX (portorshim) and Nbench-SGX (sgxlapd_sgxnbench).

3.2.2. Experiment: Execution of multi-threaded benchmarks

Figure 3. The latency of the Lighttpd server increases with the number of concurrent accesses by up to 7 while running in SGX and compared to a Vanilla (non-SGX) execution.

An application leverages the multiple cores on a modern system by using threads to speed up its operation. Intel SGX does not support thread creation inside the secure region; however, numerous threads can do an ECALL and execute the same function using the same global enclave ID (sgx_threads). The overhead due to Intel SGX can change drastically based on the number of threads making the ECALL. As shown in Figure 3, the latency of Lighttpd increases with the number of threads (by 7). Hence, it is crucial to capture the executions’ characteristics in this setting also. Nbench-SGX and LMbench-SGX  do no contain any multi-threaded benchmarks.

3.2.3. Experiment: Library Operating System

As noted in prior work (portorshim; sgxlapd_sgxnbench) and also by us, shimming an application is much easier than porting an application for Intel SGX– in terms of development and verification effort. We believe that in the future a library operating system will be the primary way to execute an application on Intel SGX. Hence, it is essential to understand the behavioral changes between ported and shimmed applications. Port-or-Shim (portorshim) also focuses on this problem, but with benchmarks that have a small working set (70ṀB). Our observations are more comprehensive and also differ. As shown in Figure 4, the impact of a library operating system depends on the characteristics of the application and thus needs to be rigorously studied.

Figure 4. A library operating system may affect the performance of an application in a positive or negative manner, depending on the characteristics of the application.

3.2.4. Experiment: Real-world benchmarks

Real world applications exhibit different phases during their execution. A typical pattern is that an application will read some data from the file system, process it, and then store the results. Micro-benchmarks such as Nbench (nbench_orig) lack this phase change behavior and thus do not represent a real-world scenario (details in Section 5).
Let us summarize.

[top=4pt,left=0pt,right=4pt,bottom=0pt]

  1. Existing benchmark suites for Intel SGX (portorshim; sgxlapd_sgxnbench) are ported version of decade-old benchmarks that were designed to evaluate the CPU performance; they are not well suited to SGX.

  2. Using multiple threads changes the overheads of Intel SGX, hence it is necessary to also include multi-threaded benchmarks.

  3. The performance impact of a library operating system depends on the characteristics of the application executing within it. Thus, there is a need for further study.

  4. Modern applications show different phases during their execution. Therefore, it is necessary to assess Intel SGX using real-world benchmarks.

4. SGXGauge Benchmark Suite

The most important challenge in front of us was to find an appropriate set of workloads that need to be executed in a secure environment. This problem has a degree of subjectivity. Nonetheless, we followed standard practice and restricted ourselves to workloads that have been used by highly cited works on SGX in the recent past. We found the following workloads: blockchain related (blockchain_sgx; blockchain_sgx2; blockchain_sgx3; blockchain_sgx4; blockchain_sgx5), protecting key-value pairs (keyvalue_sgx; keyvalue_sgx2; keyvalue_sgx3; keyvalue_sgx4), securing databases (database_sgx; database_sgx2; database_sgx3; database_btree_sgx4), protecting keys (password_sgx; password_sgx2), securing a machine learning models (ml_sgx; ml_sgx2; ml_sgx3), protecting network routing tables (hashtable_sgx), securing communication (hash_sgx_signal), graph traversals (graphtraversal_sgx; graphtraversal_sgx_2), protecting web-servers (hotcalls; scone), and HPC workloads (sgxl_gups_xsbench).

The next task was to refine the set of workloads and choose an appropriate set. There are three main sources of overheads in Intel SGX: encryption/decryption, enclave transitions, and EPC faults (see Section 2). While selecting workloads for SGXGauge, our primary aim was to ensure complete coverage of all the Intel SGX components. First, we selected some of the most commonly used workloads such as OpenSSL (openssl; sgx_ssl) and Lighttpd (graphenesgx; hotcalls; occlum). We then analyzed them and identified where they lack in terms of stressing the SGX components. For e.g. both OpenSSL and Lighttpd do not stress the CPU much. Hence, to fulfil this criterion, we selected Blockchain workload which is a CPU-intensive and multi-threaded workload. However, while it stresses the CPU, it does not use a lot of memory. To ensure both the components are stressed, we opted for an HPC workload XSBench, that was used by a prior work (sgxl_gups_xsbench) for similar purposes.

For selecting workloads that exclusively stress the EPC, we selected the following from prior work: B-Tree, BFS, HashJoin, and PageRank. Each of them has different data access patterns. B-Tree is used commonly in databases for efficient lookups and has been used in Intel SGX setting also (database_btree_sgx4). BFS is used in protecting the control flow graph of an application (graphtraversal_sgx_2; flaas). While B-Tree aims to minimize the number of nodes it visits, BFS aims to visit all nodes efficiently. HashJoin performs a number of hash table probing operations which is at the core of many systems (hashtable_sgx; hash_sgx_signal). PageRank (pagerank_wiki) is a widely used workload for link analysis. SVM is a machine learning (ML) workload that is CPU and memory-intensive. It runs multiple iterations over the same input data, a typical pattern of ML workloads. We also discarded some workloads such as Redis (redis)

, Fourier transform 

(rodinia), License Managers (license3j), GUPS (gups), Nginx (nginx_paper), etc. because they were similar to other workloads that were already chosen.

4.1. Evaluation Modes and Input Settings

We execute the workloads in SGXGauge in different modes and different input settings to gain a better understanding of Intel SGX workings. Table 1 list the different execution modes and input setting used in the paper.

Execution Modes
Vanilla An application executing without Intel SGX support.
Native An application executing within Intel SGX after it is ported to the SGX framework.
LibOS An application executing with Intel SGX in shimmed mode – with the support of a LibOS (GrapheneSGX).
Input Settings
Low , wherever applicable
Medium , wherever applicable
High , wherever applicable
Table 1. Conventions used in the paper for discussion
S.No. Workload Vanilla mode Native mode LibOS mode Property Low (  EPC) Medium ( EPC) High (  EPC)
1. Blockchain (libcatena) CPU/ECALL-intensive Blocks 3 Blocks 5 Blocks 8
2. OpenSSL (openssl) Data-intensive File Size 76 MB File Size 88 MB File Size 151 MB
3. BTree (btree) Data/CPU-intensive Elements 1 M Elements 1.5 M Elements 2 M
4. HashJoin (hashjoin) Data/CPU-intensive Data Table Size 61 MB Data Table Size 91 MB Data Table Size 122 MB
 5. BFS (ligra) Data-intensive Nodes 70 K
Edges 909 K
Nodes 100 K
Edges 1.3 M
Nodes 150 K
Edges 1.9 M
6. Pagerank (ligra) Data-intensive Nodes 4500
Edges 10.1 M
Nodes 4750
Edges 11.2 M
Nodes 5000
Edges 12.5 M
7. Memcached (memcached) Data/ECALL-intensive Records: 50K Operations:800K Records: 100K Operations:800K Records: 200K Operations:800K
8. XSBench (xsbench) CPU-intensive Points: 53K Lookups: 100 Points: 88K Lookups: 100 Points: 768K Lookups: 100
9. Lighttpd (lighttpd) ECALL-intensive Requests: 50K Threads: 16 Requests: 60K Threads: 16 Requests: 70K Threads: 16
10. SVM (svm) Data/CPU-intensive Rows 4000
Features 128
Rows 6000
Features 128
Rows 10000
Features 128
Table 2. Description of the workloads in SGXGauge along with the specific settings used in the paper.

4.2. Workloads’ Description

4.2.1. Blockchain (libcatena):

A blockchain is a distributed ledger that does not require any central authority for its management. A blockchain is essentially a linked list of blocks. A block has a payload portion (contents of the block) and the hash of contents of the previous block on the chain. Given that the chain is immutable, these hashes have a degree of finality and permanence. We vary the number of blocks in the chain to create different inputs for the workload (see Table  2). The hash computation is the sensitive operation; hence, this operation is offloaded to Intel SGX. This function is called by many threads from the unsecure region resulting in many  ECALLs.

4.2.2. OpenSSL (sgx_ssl; sgxometer):

OpenSSL is a library that provides access to cryptographic primitives to developers. Our workload using Intel SGX-SSL (sgx_ssl) reads encrypted data from an input file and decrypts it within SGX. Then, it performs a small compute-intensive task based on the content of the decrypted file. Finally, it encrypts the generated output and saves it in the untrusted filesystem. This workload stresses the mechanisms that copy data from the unsecure memory region to the EPC and the EPC if the input file size is more than the EPC size.

4.2.3. B-Tree (database_btree_sgx4)

The B-Tree data structure is used for an efficient organization of data, specifically in database management systems. This enables efficient lookup in large databases and applications of a similar nature – a crucial feature in today’s “big-data” world. This workload creates a B-Tree consisting of a certain number of elements and performs multiple find operations on a randomly generated set of keys. This workload is also designed to stress the EPC and the paging system.

4.2.4. HashJoin (hash_sgx_signal; hashtable_sgx):

The hash-join algorithm is used in modern databases to implement “equi-join” (equijoin_condition). It has two phases: build and probe. Given two data tables, it first builds a hash table from the rows in the first table, and then probe it using the rows in the second table. We vary the size of the first table and, in effect, vary the memory and compute-intensive nature of the workload.

4.2.5. Breadth-First Search (BFS) (graphtraversal_sgx; sabel_bfs):

The workload is a port from the implementation of the well-known breadth-first search algorithm used in the Rodinia benchmark suite (rodinia). The input to the workload is an undirected graph. It first reads the input graph to the EPC and then traverses all the connected components in the graph. This is primarily memory and compute-intensive workload. The computation overhead is a function of the number of nodes in the graph; the degree is at least 3.

4.2.6. Page Rank (sgxl_gups_xsbench):

PageRank is used to rank web pages based on the popularity of pages that point to it. The input to the workload is a connected directed graph represented in the adjacency list format with an out-degree of at least 1. The workload loads the graph into the EPC and builds an adjacency matrix of pages with a default initial rank for all. The workload then uses the number of out links of the page, previous rank, and the weight of the out neighbor pages to assign a new rank. This is repeated a fixed number of times.

4.2.7. Memcached(scone; memcached; hotcalls)

Memcached is an in-memory key-value store. It is used in production servers to cache hot data in memory. We use the popular YCSB (ycsb) workload to evaluate the performance of Memcached. YCSB first populates Memcached with a specified amount of data and then performs a specified set of (read or write) operations on those key-value pairs.

4.2.8. XSBench(xsbench; sgxl_gups_xsbench)

XSBench is a key computational kernel of the Monte Carlo neutron transport algorithm over a set of “nuclides” and “grid-points” (xsbench). We vary the number of grid points to generate different input sizes for the workload (see Table  2).

4.2.9. Lighttpd (lighttpd; hotcalls; graphenesgx):

Lighttpd is a light-weight web server that is optimized for concurrent accesses. The server however runs on a single thread. Our workload hosts a web-page of size 20 KB (similar to (hotcalls)). We use the ab tool, which is a part of the Apache suite (ab_apache), to make a certain number of requests to the Lighttpd server using concurrent threads (see Table 2).

4.2.10. Support Vector Machine (SVM) (libsvm)

SVM is a popular machine learning technique to classify the input data by projecting it into a higher dimensional space, and then using a linear combination of separating functions. We implemented SVM using

libSVM (libsvm), a library to implement SVM in C/C++ code.

4.3. Porting to Intel SGX

SGXGauge contains 10 benchmarks. We have ported 6 of these to execute natively on Intel SGX (native). The other 4 are real-world benchmarks, which are evaluated in LibOS mode using GrapheneSGX (details in Section 5). For these benchmarks, the engineering and verification effort in creating a native SGX port was prohibitive, and the benefits were not clear. Table 1 summarizes this information and also talks about the three execution settings (with different memory footprints): Low, Medium, and High.

While porting an application to Intel SGX, the ideal case is to run the entire application within an enclave. However, this is not always possible due to the restrictions imposed by Intel SGX. In this case, a crucial function is typically moved to the enclave and is accessed via an ECALL. This is the standard practice (glamdring; enclavedom).

We follow this approach while porting the applications. We completely ported OpenSSL, BFS, PageRank, B-Tree, and HashJoin to Intel SGX. However, Blockchain uses multiple threads to speed up the hash finding process. Intel SGX does not support the creation of threads within an enclave (sgx_threads). Nevertheless, multiple threads from the untrusted region can call the same ECALL function. Hence, for Blockchain, we moved the hash function inside Intel SGX; it is called from different threads from the main application which runs in the untrusted region.

4.4. Running on GrapheneSGX

To execute a binary on GrapheneSGX, we first need to define a “manifest” file. The manifest file contains the binary’s location, list of libraries required, and the required input files. The parameters such as the enclave size and the threads to be used are also listed here. GrapheneSGX then processes this file and calculates the hash of all the required input files, which are then verified at the time of the execution.

5. Evaluation

Here, we discuss the performance of workloads in SGXGauge under different execution modes and with different input settings (see Table 1). We focus on the insights that are not listed in prior work (sgxometer; everything_sgx_virtual; portorshim; sgx_performance), or where our observations differ from theirs.

Hardware Settings
Xeon E-2186G CPU, 3.80 GHz Disk: 1 TB (HDD)
CPUs: 1 Socket, 6 Cores, 2 HT
DRAM: 32 GB L1: 384 KB, L2: 1536 KB, L3: 12 MB
System Settings
Linux kernel: 5.9 ASLR: Off GCC: 9.3.0
DVFS: fixed frequency (performance) Transparent Huge Pages: never
SGX Settings
PRM: 128 MB Driver: 2.11 SDK version: 2.13
GrapheneSGX Settings
Enclave Size: 4 GB Threads: 16 Internal Memory: 64 MB
Table 3. System configuration

5.1. Experimental Setup

The details of our evaluated system can be seen in Table 3. We use the GrapheneSGX library operating system (graphenesgx) for our experiments in LibOS mode. Since GrapheneSGX is under constant development, we found that the performance of the code in the master branch is significantly better than its official release (v1.1). Hence, we used the code from the master branch of their GitHub repository111Commit ID: adf6269218dfa80aed276d57121a98e7b13b0f4e.

5.1.1. Instrumenting Intel SGX

In order to instrument SGX-related events, we added instrumentation code directly to the Intel SGX driver code. This approach has also been used in prior work (sgxtop; teemon). We identified crucial functions within the driver code that are called during different SGX events, such as sgx_do_fault() (page fault handling function). Note that these functions do not execute within the secure world and thus can be easily instrumented. We report the latencies of SGX’s crucial functions in Appendix  A.

Native Mode w.r.t Vanilla (6 workloads)
Over- head dTLB misses Walk Cycles Stall cycles LLC misses EPC Evictions
Low 2.0 8.38 29.7 2.5 1.8 21.5 K
Medium 3.0 14.6 57.0 5.3 2.0 49.6 K
High 3.4 17.48 59.1 6.4 3.0 79.6 K
LibOS Mode w.r.t Vanilla (10 workloads)
Low 2.03 40.6 517 114 24 796 K
Medium 3.13 59.7 724 146 18.5 1,792 K
High 3.7 44.0 113 12.7 15.5 2,255 K
LibOS Mode w.r.t Native (6 workloads)
Low 1.03 3.3 5.1 8.3 9.3 75
Medium 1.03 2.7 4.0 7.9 9.2 68
High 0.9 2.0 3.0 5.9 7.2 45
Table 4. Overhead in system-related events. Avg. value of EPC evictions is reported when compared with the Vanilla mode. The overhead refers to the performance overhead (run time).

5.2. Evaluation Plan

We take the following approach for evaluation.

  • Native mode performance: We analyze the impact of Intel SGX on the applications executing natively on it for different input sizes.

  • LibOS mode performance: We study the overheads that are introduced while executing on Intel SGX using a library operating system.

  • Native mode vs. LibOS mode: We compare the performance of the Native and LibOS execution modes.

Table  2

shows an overview of the evaluation results. The geometric mean value is computed across at least 10 executions (seen to be enough).

(a) Runtime overhead in Native mode.
(b) EPC evictions in Native mode.
Figure 5. Performance impact of SGX on applications in Native mode for different input sizes.
(a) Statistics for GrapheneSGX for an “empty” workload.
(b) Figure showing the overhead in the execution for the LibOS setting.
(c) Figure showing total EPC page reloads for the LibOS setting.
(d) The latency of Lighttpd improves when using the switchless mode
Figure 6. Performance impact of GrapheneSGX on workloads in SGXGauge.

5.3. Native mode Performance

Here, we evaluate the performance overhead of running an application in the Native mode as compared to the Vanilla mode with different input sizes. As shown in Figure  4(a), the performance overhead increases by up to as we go from the Low to the Medium setting, and by up to from the Medium to the High setting.

As shown in Figure  4(b), the total number of EPC evictions increase by up to when the input size is increased from Low to Medium. On further increasing the input from Medium to High, the total number of EPC evictions increases by up to . As already explained in Section  2, the TLB entries of an enclave are flushed before a transition to the unsecure region due to security reasons. Hence, as we increase the size from Low to Medium, dTLB misses increase by up to , and by up to as we go from Medium to High. Due to this, the total walk cycles increase by up to (Low to Medium), and by up to while going from Medium to High. Consequently, the total stall cycles increase by up to (Low to Medium), and by up to while going from Medium to High.

Summary: As we approach the EPC size (Low to Medium), there is a sudden rise in all the paging and TLB-related performance counters. However, going beyond the EPC size (Medium to High) does not affect the performance to that extent. We present a detailed discussion for all the workloads in Appendix B and impact of the counters on the workloads’ performance in Appendix C.

5.4. LibOS Mode Performance

Here, we evaluate the performance impact of GrapheneSGX.

5.4.1. GrapheneSGX Overhead

We first characterize the overhead of just GrapheneSGX using an “empty” (return 0;) workload. As shown in Figure  5(a), in this setup, GrapheneSGX performs 300 ECALLs, 1000 OCALLs, and 1000 AEX exits. During this time, total EPC evictions are 1 M. However, out of these 1 M evicted EPC pages, only 700 pages (2 MB) are loaded back.

The reason for the unusually high number of EPC evictions is the enclave size property, which is set to 4 GB. As prior to executing an enclave, SGX completely loads it in the EPC to calculate its signature (everything_sgx_virtual). Doing so for a 4 GB enclave will cause 1 M EPC faults (1 M*4 KB=4 GB). Lowering the value of the property “enclave-size” reduces the EPC evictions but worsens the performance by up to 4, even for the workloads with a small memory footprint such as Blockchain. All these EPC evictions are done at the beginning of the execution, i.e., while initializing GrapheneSGX. We do not count this time in the execution time of a workload running on it (see Appendix  D).

Performance As shown in Figure 5(b), the performance overhead increases by up to while going from Low to Medium, and by up to while going from Medium to High. As shown in Figure 5(c), the total number of EPC load-backs (page brought back to the EPC from the untrusted memory) increases by up to when the input size increases from Low to Medium, and by up to on further increasing the input size from Medium to High. The total number of dTLB misses increases by up to as we go from Low to Medium, and by up to as we go from Medium to High. Due to this, the total number of walk cycles increases by up to as we go from Low to Medium, and by up to when we go from Medium to High. Consequently, the total number of stall cycles increases by up to and , respectively.

Summary: Similar to Native mode, as we approach the EPC size (Low to Medium), there is a sudden increase in all the TLB and paging-related performance counter values. However, going beyond the EPC size (Medium to High) does not affect the performance as much. We discuss the I/O related overheads with GrapheneSGX in Appendix E.

5.5. Native Mode vs LibOS Mode

Here, we compare the performance of the Native and LibOS modes (see Table  4). We observe that as we increase the input size, the performance overhead of LibOS as compared to the Native mode starts decreasing. The overall overhead reduces by when we increase the input setting from Low to High or Medium to High. The total number of dTLB misses comes down by and , when the input is increased from Low to Medium and Medium to High, respectively. In the same setting, the total walk cycles comes down by 21% and 25%, stall cycles by 4% and 25%, LLC misses by 1% and 21%, and EPC evictions by 9% and 33% when we increase the input size from Low to Medium and Medium to High, respectively.

Summary: On increasing the workload size, the overhead of GrapheneSGX starts decreasing, and eventually, approaches that of the Native mode.

5.6. Switchless Mode

To reduce the cost of an OCALL, Intel SGX supports a switchless mode of operation where it leverages multiple cores of a modern system to make an OCALL without exiting an enclave – thus preventing a TLB flush. In this case, a set of threads (proxy threads) on dedicated cores are used to handle the OCALLs. Here, the parameters of an OCALL and other relevant data are sent to a proxy thread running on another core using an unsecure shared memory channel. The proxy thread reads the request and performs the operation. Once the operation is finished, the results are written to the shared memory region; these results are subsequently read by the enclave that issued the request. This is a standard pattern and is used to to hide the overheads of system calls as well in regular operating systems.

We configured GrapheneSGX to use 8 cores for handling OCALL requests from enclaves. In Lighttpd, this reduces the total number of dTLB misses by 60% thus improving the latency by 30%, as compared to the default implementation of OCALL (see Figure 5(d)).

6. Conclusion

We introduced SGXGauge, a benchmark suite for Intel SGX that captures a holistic view of the performance of applications running in such TEEs – this includes the impact of the EPC memory. SGXGauge contains diverse benchmarks that affect different components of SGX. We also performed an evaluation of the performance of SGX in LibOS mode and showed that there is a marked difference in behavior as the memory footprint crosses the EPC size limit.

References

Appendix A Intel SGX Latencies

Instrumenting an application running in the secure mode within SGX is a non-trivial task due to the constraints imposed by Intel SGX. RDTSC instructions (rdtsc; rdtsc_micro), which are generally used to measure the cycles taken by an operation, are not allowed in SGX. Fortunately, the Intel SGX driver code does not execute in Intel SGX and thus can be easily instrumented.

We measured the latencies of the core Intel SGX operations: allocating a page (sgx_alloc_page()), evicting a page (sgx_ewb()), loading back a page (sgx_eldu()), and handling a page fault in SGX (sgx_do_fault()). We use the ftrace tool (ftrace) for this purpose. SGX uses the EWB instruction to evict a page from the EPC and the ELDU instruction to load it back. While evicting an EPC page, first its MAC is calculated and then the page is encrypted. While loading back, the page is decrypted and its integrity is checked using the MAC (vault).

These functions are highly optimized with latencies in the range of a few micro-seconds. Figure 7 reports the mean of 40K+ samples. The latency of evicting an EPC page is more than loading back an EPC page. SGX evicts pages in a batch that is typically 16 pages. However, during a fault, a single page is loaded back.

Figure 7. Latency of core operations in Intel SGX.

Appendix B Native Mode Performance

Here, we discuss the overheads in the 6 of workloads from SGXGauge while executing in the Native mode. We use the change in the most relevant hardware performance counters for this discussion. A heat-map of these counters is shown in Figure 8.

(a) Blockchain
(b) OpenSSL
(c) B-Tree
(d) HashJoin
(e) BFS
(f) PageRank
Figure 8. Overheads for the workloads when executing in the Native mode w.r.t the Vanilla mode.

b.1. Blockchain

As seen in Figure 7(a), while executing in the Native mode, the total number of dTLB misses is   2000 more than the Vanilla mode. This is because in our implementation of Blockchain the hash function is protected inside an enclave. This is called via ECALLs from many threads that are executing in the unsecure region. This results in many enclave transitions and thus many TLB flushes. These TLB entries are to be populated for every ECALL by a page table walk. Hence, we see a similar increase in the number of walk cycles.

With 16 threads in the Low setting there are 3,133 K ECALLs, for Medium 4,831 K ECALLs, and for the High setting there are 8,944 K ECALLs.

b.2. OpenSSL

In OpenSSL, the number of EPC evictions increases from 389 K to 433 K to 721 K, as we increase the input size from Low to Medium and High, respectively. Due to this, in the High setting, the total number of enclave exits increases, thus increasing the total number of dTLB misses by 131 and walk cycles by 196 w.r.t. the Vanilla mode.

b.3. B-Tree

In B-Tree, the total number of EPC evictions increases from 79 K to 116 K to 305 K, when we move from the Low to Medium and then to the High input setting, respectively. However, here the total number of dTLB misses increase by only 2.2 only in the High setting. This is because the total number of dTLB misses is dominated by the number of page faults caused by the workload. To serve a page fault, an enclave performs an asynchronous exit (AEX), which also causes a TLB flush. The total number of page faults increases from 3 in the Low setting to 7.5 in the High setting, and the total LLC misses increase from 1 in the Low setting to 6.4 in the High setting.

b.4. HashJoin

In HashJoin, on increasing the input size we observe an increase in almost all of the performance counters. Most notably, the total number of page faults and dTLB misses increase by 246 and 140 over Vanilla mode in the High input setting, respectively. This is due to the characteristics of the workload. A typical hashjoin operation incurs many cache misses and stall cycles (hashjoin_perf).

b.5. Bfs

In BFS, the total number of page faults increase by 3 as compared to the Vanilla mode. However, we do not observe a large impact with the increase in the input size. This is because of the inherent locality in the workload.

b.6. PageRank

In PageRank, we observe a decrease in the total number of walk cycles on increasing the input size. This is because dTLB misses also go down with an increasing input size. The main reason for this is the nature of the workload. In the Vanilla mode (not shown in the Figure), the number of dTLB misses increases by 3.6 when we increase the input size from the Low to the High setting. Hence, the nature of the workload dominates the total number of misses, hiding the extra misses caused due to SGX.

Appendix C Counter Impact on Performance

Intel SGX provides a way to execute an application securely on a remote machine, although with some limitations. Researchers are working on developing methods to circumvent these limitations. Different solutions might affect the components of Intel SGX differently. Here we provide a generic approach for the developer to select correct benchmarks from SGXGauge as per the requirement.

As pointed out in prior work and observed by us in our experiments, when a benchmark reserves more memory than the EPC size it suffers a slowdown. Usage of more memory than the EPC may impact the total number of dTLB misses, LLC misses, walk cycles, stall cycles, and EPC-Evictions. We rank the metrics in the order of importance for each of the workloads in SGXGauge. We use linear regression 

(multiple_linear_regression) for this purpose. Linear regression predicts the execution time given these metrics as input. While doing so, it assigns coefficients to these metrics. The magnitude of these coefficients is correlated with the importance of that metric in determining the execution time (see Table 5).

Workloads Walk cycles Stall cycles Page faults dTLB misses LLC misses EPC evictions
Native mode
Blockchain 0.33 0.32 0.01 0 0.32 0
OpenSSL 0.08 0.12 0.17 0 0.21 0.14
BTree 0.27 0.11 0.22 0.05 0.22 0.11
HashJoin 0.21 0.16 0.21 0.03 0.21 0.21
BFS 0.10 0.17 0.18 0.21 0.09 0.22
PageRank 0.44 0.54 0.05 0.33 0.65 0.04
LibOS mode
Memcached 0.03 0.04 0.09 0.15 0.13 0.09
XSBench 0.16 0.17 0.17 0.18 0.16 0.17
Lighttpd 0.18 0.19 0 0.09 0.26 0
SVM 0.09 0.60 0.27 0.09 0.31 0.03
Table 5. Table showing the most important hardware performance counter that determines the performance of each workload (shown in bold). LLC refers to the last level cache.

We can conclude that most of the time paging and TLB-related counters are the most correlated with the performance. LLC misses are mostly an important factor in OpenSSL.

Appendix D GrapheneSGX start-up overhead

Here, we discuss the overhead in initializing GrapheneSGX.

Figure 9. Figure showing the performance counter values for EPC page allocation, eviction, and load-back during the execution phase of B-Tree in the Native mode (N-) and LibOS mode (G-).

Intel SGX verifies the signature of an enclave prior to its execution. To do so, it loads the entire enclave into the EPC. In SGX v1, a heap size greater than the EPC size was not allowed, as that will not allow SGX to load the complete enclave in the EPC. Since SGX v2, a heap size greater than the EPC is allowed. SGX transparently evicts and loads pages as per its requirement.

Figure 9 shows the allocation, eviction, and load back of EPC pages in Native and LibOS modes for a representative workload, B-Tree. This pattern remains the same across other workloads also. SGX first calculates the signature of the enclave that causes the initial EPC evictions. Note that EPC pages are allocated after the verification is done. After that, the EPC pattern of GrapheneSGX is the same as that of the Native mode.

Intel SGX recommends setting the enclave size as per the maximum requirement of the application. However, in our experiments we observed additional overheads on setting a lower value of the enclave size in LibOS mode. This is related to how GrapheneSGX initializes the enclave. We thus used an enclave size of 4 GB for all our experiments. We do not count the GrapheneSGX start up time in the workload executing time while calculating the overheads in Section 5 mainly because this is an one-time activity and a workload can run for a very long time after its enclave is initialized. Also note in Figure 9 that after the initialization phase the gray (GrapheneSGX) and black (Native) lines converge (same behavior).

Appendix E What About I/O?

As mentioned before, SGX does not support system calls, notably file system calls. An enclave needs to rely on an OCALL to read or write a file to the file system. In this case, by default, the data is transferred in plaintext and it is the responsibility of the developer to protect it via encryption. SGX has a sealing feature, where the data can be encrypted using the sealing enclave (intelsgxexplained). The sealing enclave is an Intel-authored enclave that is part of the Intel SDK. It can “seal” or encrypt data using a platform dependent hardware key. The sealed data can only be “unsealed” or decrypted on the same platform, and optionally, it can be configured to be decrypted only by the same enclave that encrypted it.

Library operating systems support file system operations by transparently capturing these calls and handling them either via an OCALL or via a parallel, proxy thread executing on a different core. However, as this is not covered by Intel SGX constructs, a naive implementation will still write the data in plain text to the file system, essentially leaking data. GrapheneSGX supports transparently encrypting files before they are written to the file system. This feature is known as the protected file system or PF (graphenepf) mode. However, as shown in Figure 10 this feature is not optimized, and the performance of an I/O intensive application can suffer by up to when PF is used.

(a) Read performance
(b) Write performance
(c) ECALL overhead
(d) OCALL overhead
Figure 10. The I/O overhead with GrapheneSGX (S-G) and GrapheneSGX with protected files (S-P). Iozone: reading and writing 1 GB of data with 4 M blocks.

We use the popular file system benchmark Iozone (iozone) to evaluate the performance of the GrapheneSGX PF system. We compare this against the Vanilla mode, and LibOS mode without using the protected file setting. LibOS incurs an overhead of 33% and 36% compared to the Vanilla mode for read and write operations, respectively. The overhead increases to and for read and write operations, respectively, when the protected files mode is enabled. The main reason for this is the increase in the number of ECALLs (see Figure 9(c)) and OCALLs (see Figure 9(d)).

The PF mode needs to be optimized to make it practical for production-quality systems.