SESAME: Software defined Enclaves to Secure Inference Accelerators with Multi-tenant Execution

07/14/2020 ∙ by Sarbartha Banerjee, et al. ∙ 0

Hardware-enclaves that target complex CPU designs compromise both security and performance. Programs have little control over micro-architecture, which leads to side-channel leaks, and then have to be transformed to have worst-case control- and data-flow behaviors and thus incur considerable slowdown. We propose to address these security and performance problems by bringing enclaves into the realm of accelerator-rich architectures. The key idea is to construct software-defined enclaves (SDEs) where the protections and slowdown are tied to an application-defined threat model and tuned by a compiler for the accelerator's specific domain. This vertically integrated approach requires new hardware data-structures to partition, clear, and shape the utilization of hardware resources; and a compiler that instantiates and schedules these data-structures to create multi-tenant enclaves on accelerators. We demonstrate our ideas with a comprehensive prototype – Sesame – that includes modifications to compiler, ISA, and microarchitecture to a decoupled access execute (DAE) accelerator framework for deep learning models. Our security evaluation shows that classifiers that could distinguish different layers in VGG, ResNet, and AlexNet, fail to do so when run using Sesame. Our synthesizable hardware prototype (on a Xilinx Pynq board) demonstrates how the compiler and micro-architecture enables threat-model-specific trade-offs in code size increase ranging from 3-7 % and run-time performance overhead for specific defenses ranging from 3.96% to 34.87% (across confidential inputs and models and single vs. multi-tenant systems).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hardware-enclaves are key to confidential computing [confidential-computing-consortium] – where users can push their private data into a box that is invisible to privileged software, co-resident processes, and even attackers with physical access to the pins of the chip [aegis, xom, bonsai-merkle-tree, ascend, Phantom, SGX, sanctuary, Sanctum, keystone]. Confidential computing can provide a trustworthy foundation where users can safely work with healthcare, financial, or business data while organizations can offload compliance enforcement [gdpr, ccpa] to services that provide hardware root of trust.

Enclaves today however offer a Hobson’s choice. A user can pick a design like Intel SGX that hardwires a very specific threat model for general-purpose CPUs – and pay with performance-overheads [sgx-is-slow, Vault] and the risk of side-channel breaches [Foreshadow, RIDL, Fallout, attackSurvey] – or be left with the default unprotected execution inside virtual machines. Crucially, confidential computing is limited to general-purpose cores while accelerator-rich architectures [TPU, TVM] worsen the gap between non-secure and enclaved execution for important classes of programs like deep learning and graph computing[Graphicianado, gui2019survey].

In this paper, we propose to bring confidential computing to domain-specific secure accelerators, retaining most of the performance benefits of using an accelerator (compared to CPUs) and also reducing the attack surface by closing side-channels. The key insight is to enable software-defined enclaves – where software can construct enclaves that are customized for a specific threat-model and program domain – in order to optimize performance without leaking secrets. Threat models are only known at deployment-time, and hard-wiring one into hardware means that every confidential program pays the price of security against threats they may not care about – for example, the cost of integrity checks or oblivious main memory accesses in a secure data-center facility, or the price of obfuscating code when it might be public. Similarly, general purpose program execution have to be obfuscated assuming worst-case control- and data-flows and uncontrolled micro-architectural side-effects [raccoon, escort] which incurs significant slowdowns. Software-defined enclaves can tune the slowdowns from security to scale gracefully with threat models while driving security and performance optimizations down from algorithms to bits and gates.

Specifically, we introduce Sesame

, a software-defined enclave framework for multi-tenant machine learning inference accelerators that are tightly coupled to a CPU (e.g., Arm Ethos-N NPUs

[ArmEthos-N, Arm-ML]). We focus on decoupled access-execute (DAE) architectures as an example accelerator that is popular across deep learning, graph, and other data-driven domains across cloud and edge devices. Sesame takes deep learning models such as VGG, ResNet, and AlexNet as input and produces as output an auto-tuned program and a multi-tenant hardware design that enforces computational non-interference between security domains. Crucially, a user can express their threat model in terms of (a) the program model and/or the user inputs being secret, and (b) whether the attacker’s visibility includes on-chip or off-chip signals. Sesame translates all concurrent users’ threat models into a lattice of security labels and ensures computational non-interference among all labels – (informally) ensuring that secret inputs/model information from one user does not leak via on-chip and/or off-chip signals to other users (including the cloud provider or hardware-owner).

Sesame includes defense mechanisms across the layers of the accelerator stack. At the synthesizable hardware (RTL) level, we introduce two new data-structures that the Sesame compiler can instantiate and tune for the current threat model and program: (1) private queue that software can partition based on a schedule – these queues prevent contention-channel leaks and replace queues that are ubiquitous in hardware designs between pipelined stages and arbiters of shared hardware; (2) traffic shaper that decouples observation channels – where attackers infer secrets from signals coming out of a software-defined enclave – from secret variables. While partitioning and shaping are generic strategies that have been applied from networking [vuvuzela] to hardware [camouflage], Sesame’s contribution is to enable compiler/synthesis tools to overlay non-secure RTL designs with private queues and shapers and re-use the data movement logic to create flexible enclaves. Further, for traffic shapers that can be software-configured, their RTL implementation can be far simpler than (e.g.) hardware-only shapers designed specifically to obfuscate main memory traffic [camouflage].

At the compiler level, Sesame’s auto-tuning phase generates a tiling schedule that maximizes performance while ensuring that the computation and memory access schedule has no secret information. The code-generator then annotates instructions appropriately to obfuscate observation channels like execution variability and based on the threat model the driver turns on defenses like private queues and traffic shaping to quash contention and external observation channels.

Enabling a constant-time mode for execution units (like GEMM and ALU) is the last piece required to construct multi-tenant enclaves – our prototype system’s baseline GEMM and ALU units happen to be constant-time already and hence we do not use this feature.

To summarize, we make the following contributions:

  • [leftmargin=*]

  • Sesame introduces software-defined enclaves for domain specific accelerators, enabling enclave software to be confidential against co-tenants and infrastructure providers.

  • A detailed vulnerability analysis of on-chip accelerators, highlighting gaps that need defenses in order to construct a range of accelerator enclaves.

  • A cross-layer design including new hardware data-structures to shape traffic and to share queues (in addition to using standard base-bound techniques to partition on-chip state); exposed via ISA extensions to software; that a compiler uses to generate code for a user-specified threat model.

  • An end-to-end implementation of a deep-learning accelerator (using a baseline VTA [VTA] system) where Sesame compiles and synthesizes six workloads – including ResNet, VGG, and AlexNet models – to a Xilinx FPGA.

  • A security evaluation that shows a classifier that could distinguish different layers in each model (the first step towards reversing models and weights) is foiled by Sesame; and performance evaluation in terms of code-size, slowdown, and area cost of Sesame across a range of system configurations and threat models. For example, slowdown varies from to under increasingly confidential and contention-heavy settings.

We will open source our hardware and compiler contributions to spur further research into software-defined enclaves for DAE-applications beyond deep learning.  


2 Background

2.1 Baseline Secure Platform

Our baseline consists of a user on a client system who engages one or more services deployed by an infrastructure provider on a cloud system. The infrastructure provider can expose accelerator resources in standard units (e.g., small/medium/large instances on Amazon’s F1 cloud) that the user’s Sesame compiler can generate code for. Alternatively, a high level model may be communicated securely to the cloud which is then compiled in a sandbox to ensure no information is leaked. The binary generated is securely transferred to the provisioning service which copies it into accelerator memory or uses it as a bitstream to configure an FPGA.

To program an accelerator, the provisioning service has to create a trusted channel to establish a root of trust on the accelerator. This entails the following assumptions about such a platform:

  • [leftmargin=*]

  • The platform provides an attestation service to assure the client that

    • All hardware on the platform is provided by a trusted manufacturer. This ensures the platform is free from any hardware trojans with accelerator bitstream hash checked during secure boot.

    • The platform is capable of deploying a trusted execution environment (TEE). This includes the ability to isolate an application from the OS, hypervisor and other privileged/non-privileged software on both the CPU and main memory. Further, this TEE can be extended to provide similar isolation guarantees for the system memory used by the accelerator.

    • The ML compiler framework and driver are sourced from a trusted software provider and can be deployed to run in a TEE.

    Attestation protocols remain the same whether a ‘secure processor’ implements a RISC CPU (like Aegis  [aegis] or Sanctum [Sanctum]) or an application-specific accelerator logic. Attestation enables platform authentication (whose identity is vouched for by a public-key certificate authority) on which the user can execute a workload

  • The host platform and the client have the ability to set up a secure channel to communicate sensitive information

  • Confidential data stored in main memory can be protected using encryption [AMD-SME, AWS-Graviton2, Intel-TME].

  • Any key management systems that handle cryptographic keys to either set up communication channels or provision memory encryption for the accelerator are in the trusted code base.

2.2 Baseline Accelerator Architecture

VTA[VTA] is a DAE[DAE] machine built on the TVM[TVM]

software stack to provide a domain-specific end-to-end solution to neural network applications. It is built on a Xilinx Zynq-7000 series FPGA. Users write their model and explicitly schedule the computation into a high-level CISC representation. This is then converted into low-level accelerator instructions by a code-generator running in an FPGA co-processor. The accelerator then uses these instructions and data to perform inference in the accelerator programmed in the FPGA.

3 Vulnerability Analysis

Threat Model
No. Vulnerability Exploit PM PI Defences
1 Adversary has access to communication channel between edge device and cloud Tampering of model/input in flight when being communicated to the accelerator Remote attestation of cloud system and establishing a secure channel using TLS
2 Unauthorized access by privileged process on host CPU Tampering of input/model in memory Application isolation using a TEE framework like Aegis, Sanctum etc.
3 Shared access to system memory carve-out by all tenants on accelerator and their host processes on CPU Adversary deploys an accelerator task to access victim’s model parameters / input TEE provides memory isolation between different tenants.
4 Compiler runs on the host system and has access to the model Model privacy compromised Compiler is attested, made part of the trusted compute base and runs in a TEE.
5 Model execution termination channel observation Model parameters may leaked due to runtime variability Scheduler allocates time slice increments at a granularity identified by model provider.
6 Adversary has access to reading system memory Secret data confidentiality violated Data encrypted by AES128 or QARMA128
7 Adversary has ability to observe memory bandwidth characteristics Model topology and layer size leak Data independent distribution of memory traffic shape
8 Shared access to dependency queues in accelerator Loading corrupt data & performing computation before data ready Partitioned Queues
9 Shared access to instruction queues in accelerator Instruction queue occupation serves as a covert channel Partitioned Queues with base bound check
10 Shared access to scratchpad in accelerator Reading raw secret data on-chip by distrustful tenants Base bound check for each process and ZEROIZE after secret data becomes dead
11 Shared access to execution units in accelerator Sniffing execution output from the execution unit pipeline Spatial partitioning of GEMM units with private operand buffers
12 Variable time GEMM unit execution Model parameters / input data values leak Disabling data driven optimizations through GEMM.C instruction
13 Adversary has ability to modify system memory Secret model weights filter maps can be tampered Data MAC authentication[MorphCount, mee, bonsai-merkle-tree]
14 Adversary has ability to observe addresses access patterns on memory bus Memory access pattern attacks[ReverseCNN] Invisimem[Invismem], PathORAM[PathORM]
Table 1: Vulnerabilities, exploits and defence mechanisms. Vulnerability 1-5 are addressed in the baseline platform, 6-11 are Sesame contribution and 12-14 can be composed with Sesame. PM = Private Model, PI = Private Input.

We systematize the vulnerabilities that a machine-learning-as-a-service(MLaaS) service may be subject to in Table 1. Vulnerabilities (1-5) are addressed in our baseline secure platform model. Vulnerabilities (6-11) correspond to the digital side channels arising within the accelerator that Sesame addresses. In this section we take a closer look at these vulnerabilities and their applicability under different threat model variants relevant to Sesame.

Memory bandwidth snooping: Modern chips commonly include high precision memory bandwidth performance counters for performance debugging purposes. These may also be used by a cloud provider for ensuring quality of service(QoS) across multiple tenants using the same platform. However these performance counters can end up leaking side channel information. Figure 1 shows the memory traffic of all the convolution layers of VGG16. Each of the layers utilizes different amount of memory bandwidth based on tile size and number of tiles. Figure 1

zooms in on layer 5 which consists of 32 tiles. The bandwidth trace leaks the number of tiles and the kernel size of the weight tensor. In Section 

8.1, we demonstrate that a classifier can detect all the boundaries based on change in traffic shape and bandwidth. Upon successful leakage of the structure of the model, the attacker can craft custom inputs and devise an attack to determine the weights[scnn, ReverseCNN]. Interestingly, this attack can be effected solely by observing bandwidth variations using performance counters even when the data and address buses are protected. To the best of our knowledge, our work is the first to make this observation and design Sesame to defend against such observation channel attacks (Section 7.1).

Shared dependency queues: The load, compute, and store units of the DAE accelerator are controlled through dependency queues that are shared among units. An attacker can corrupt control dependencies of the victim program – e.g., by triggering execution units before data loading – and violate read-after-write (RAW) dependencies. Sesame addresses this through dependency queue partitioning (Section 7.2.1).

Shared access to scratchpads: Shared access to the scratchpad can help an attacker read out stale secrets belonging to the victim by computing on a scratchpad region without loading any data after a victim’s execution. Sesame uses partitioned scratchpads, private data zeroization logic, and instruction base/bound check to address this threat (Section 7.2.2).

Shared instruction queues: Shared instruction queues help the attacker perform cyclic attacks like prime-and-probe by inserting instructions between victim execution and creating a covert channel to leak data. Sesame design defends this threat with private instruction queues (Section 7.3).

Shared access to execution units: Shared instruction units can lead to cross-tenant contention attacks in a multi-tenant accelerator. Sesame schedules execution units among different tenants based on their program phases.

Variable time GEMM/ALU unit: Since ML weights are typically sparse, data driven optimizations leading to execution time variability leaking sensitive model/input data during spatial sharing. Sesame generates constant time execution instructions on private data operations (Section 7.4).

Memory bandwidth of layers in vgg16.

Read bandwidth of the 5th layer in vgg16.
Figure 1: Different layers in vgg16 network utilizes different memory bandwidth, while bursts within each layer show number of tiles.

Vulnerabilities that require integrity protection (13) and those that employ the address bus side channel (14) assume the attacker to have physical access to the system. The Sesame prototype excludes these vulnerabilities since these are composable with our case-studies on confidentiality vs. on-chip attacks (by co-tenants and privileged software) and are not required in physically secured data-centers. Recent commercial products include encrypted DRAM without integrity protection [AMD-SME, AWS-Graviton2, Intel-TME] for such use-cases. Nevertheless, solutions for integrity or address-buses [MorphCount, Vault, Phantom, PathORM] can be composed with the Sesame prototype by extending the memory controller while our defenses address vulnerabilities that arise within the accelerator.

3.1 Sesame threat model variants

Our threat model is defined based on three parameters. Sesame defenses are configured based on settings of these three parameters, which are discussed below:

  • [leftmargin=*]

  • Multi-Tenant execution modes: Cloud services deploy large accelerators capable of hosting multiple ML models simultaneously. We observe that there can be two modes of sharing accelerator resources across multiple tenants.

    • Temporal Sharing: This corresponds to a scenario where a single tenant rents entire accelerator for a given duration. In this mode, the attack surface is restricted to observation/contention channels outside of the accelerator during job execution. After the execution is done attacker can read out secrets by extracting stale scratchpad data with use-after-free attacks.

    • Spatial Sharing: In this mode multiple mutually distrusting model inference jobs are run concurrently in a single accelerator. The attacker can observe/contend for the memory bandwidth channel, shared scratchpad, dependency queues, shared buffer and execution units.

  • Private/secret model: This parameter specifies whether model confidentiality is required and addresses scenarios where the ML service provider deploys his service on a third party cloud provider for economies of scale.

  • Private/secret user input: This parameter specifies whether the of user input is confidential.

The threat model column in Table 1 identifies which vulnerabilities are relevant with respect to the later two of the three threat model parameters described above. It may be noted that vulnerability 8-12 apply to only spatial sharing mode while all others are applicable to both spatial and temporal sharing modes. Further in the threat model where both model and input are private the vulnerability list would be union of the individual lists.

Analog side channels like electromagnetic radiations and information leakage caused by accelerator power and temperature variations are orthogonal to our system. Denial-of-Service (DoS) attacks by an attacker which either compromises the accelerator management software or tampers with the network connected between the CPU and the client or the accelerator and the CPU is also outside of our threat model. We also do not protect against DoS attacks of a malicious kernel executed in the accelerator which leads to contention in shared resources. The FPGA bitstream hash is checked during co-processor secure boot and is assumed to be free of hardware trojans. And runtime reconfiguration is disabled.

4 SESAME Overview

This section describes the overall architecture of Sesame to address new vulnerabilities (shaded blue in Table 1) that arise when an accelerator is shared by mutually distrusting tenants co-executing on untrusted infrastructure – i.e., to build a multi-tenant accelerator enclave. These vulnerabilities can be categorized under information leaks through observation-only channels (such as memory traffic or scratchpad access control) and contention-driven (e.g., on-chip queues, scratchpad and execution units) channels – closing these with minimal slowdown requires both compiler and hardware support. In this section, we look at the overall secure accelerator architecture and highlight hardware structure that need shaping and partitioning.

Figure 2: Sesame design: a baseline DAE architecture extended with private queue partitioning, shaping, and access control blocks.

To bootstrap Sesame, a user authenticates the accelerator hardware and firmware using remote attestation protocols and a public-key certificate authority [Sanctum]. Users can compile their models on Sesame using host-side enclaves or remotely in a machine not exposed to attackers – in both cases, a user transfers a Sesame binary to the accelerator and triggers execution until completion or for a fixed duration (if end-to-end timing channel is within the threat model). The binary along with the input data is loaded by the secure platform in DRAM and configuration registers specific to threat model requirements is programmed in accelerator registers.
Sesame hardware comprises of three main primitives. The first component – private queues – secures shared queues that are used pervasively in hardware micro-architecture to decouple or pipeline function units. Private queues enable the accelerator software to partition shared queues across security domains and prevent information leaks through queue contention. The second component – a traffic shaper – closes leaks through all signals that come out an enclave (i.e., observation channels) by dynamically shaping the attacker-visible trace to look like a secret-independent distribution. The third component – secure compilation passes – takes as input programmer annotations to mark parts of the code and data as secret and generates accelerator instructions that minimize overheads from obfuscation-related data movement and cleanup (e.g., zeroing out scratchpads). In addition, Sesame includes additional logic to partition on-chip scratchpad memory, implement bound-checks for scratchpad accesses, and hardware to clear out scratchpad space with dead private data. The key design principle is that private queues, traffic shapers, partitioned on-chip memory, and zeroization hardware can be composed arbitrarily. E.g., for small enclaves, shaping egress signals obviate the need to partition downstream queues, while large enclaves may be constructed by partitioning everywhere on-chip and using shapers exclusively for off-chip traffic.

Figure 2 shows the components in Sesame hardware. The baseline DAE architecture includes a load unit to load inputs and weights from memory into the scratchpad – Sesame extends this with support to zero out memory and place bounds checks on scratchpad accesses to prevent buffer overflow. The store unit similarly writes back outputs and intermediate values into memory and includes a scratchpad. The compute unit performs the matrix arithmetic and GEMM computations that form the bulk of deep learning models – Sesame requires both ALUs and GEMM units to support a constant-time mode where the execution time is independent of data value. The compute unit is connected to both load and store units via dependency queues – Sesame modifies these queues to be configurable as private queues. All traffic to memory is via a Request unit – Sesame adds logic to shape the memory trace to this unit. The request unit also has load, instruction, and store queues that a Sesame user can configure as private queues in spatial sharing mode.

5 Programming Model:

Figure 3: Code generation example for various user-specified threat model invoking only the required secure hardware widgets

5.1 Instruction-set extensions:

We introduce the following instructions to the baseline architecture:

  • [leftmargin=*]

  • LOAD_E <spad_range>,<dram_range>: Load secret data from DRAM that doesn’t need traffic shaping but still needs to be decrypted.

  • LOAD_S(E) <spad_range>,<dram_range>: Load data from DRAM with the traffic shaped read channel. Data decryption needed for the E variant.

  • GEMM_C out,<inp1>,<inp2>: This instruction disables all data-driven optimization of the gemm unit.

  • ALU_C <out>,<inp>

    : This instruction performs constant time ALU instructions. It prevents leaks through relu and clipping units. Input can be an immediate value as well.

  • ZEROIZE <spad_range>: This instruction is used to zeroize a portion of scratchpad address range. This instruction adds dependency to other instructions that uses conflicting regions of scratchpad.

  • STORE_E <dram_range>,<spad_range>: Encrypt and store output from to DRAM.

  • STORE_S(E) <dram_range>,<spad_range>: Store data through a shaped memory write channel. Encryption needed by the E variant.

5.2 High-level code Pragmas:

Sesame supports application level annotation to identify secret data structures and specify execution mode. The software defined threat model as mentioned in section 3.1 is propagated down to generate instructions and accelerator configurations through pragmas in high- level code. We describe below the various pragmas that are used to specify this information.

  • [leftmargin=*]

  • allocateS: This pragma, as illustrated in Figure 3 is an enhanced version of the pragma used in the baseline which allocates scratchpad enables DMA transfer from DRAM. The annotation identifies secret data structures for the following purposes:

    • [leftmargin=*]

    • Data structures thus identified are stored in encrypted format in the memory and code generation unit appropriately annotates load/store instructions.

    • It directs the code generation unit to generate zeroize instructions at scheduling boundaries to the scratchpad locations that hold these data structures and computation results generated from them.

    • Computational instructions operating on such structures are annotated to operate in constant time mode.

    • Lastly, for threat scenarios where the traffic shaper has been enabled, this pragma helps annotate the load/store instructions that need the bandwidth defenses of the traffic shaper.

  • execmode: It informs the driver if the application is sharing the accelerator with other tenants and need in-accelerator resource partitioning.

  • queuedepth: The driver requests the accelerator scheduler to reserve private queues for both instruction and dependency between the DAE components. This prevents contention attacks by an untrusted tenant co-executing in the accelerator in spatial sharing mode.

  • spadsize: The driver programs the value specified by this pragma into tenant-private configuration space. The scheduler reserves regions of input, weight, accumulator and output scratchpad based on the size specified by this value.

  • bandwidth: Memory-mapped registers corresponding to the traffic shaper are programmed with the constant bandwidth specified by this pragma for both read and write channels. Traffic generated by LOAD_S or LOAD_SE instructions is shaped to this bandwidth specification.

5.3 Code transformations

We envision the cloud provider providing entire accelerator(temporal sharing) or part of accelerator(spatial sharing) to each tenant. Sesame enables users to create enclaves that are tailored to their threat model. Figure 3 shows how a user specifies (1) the execution environment to be temporally shared (single tenant at a time) or spatially shared (by multiple tenants simultaneously), and (2) whether their model and/or inputs are private(secret) or public. Sesame then handles (3) accelerator compute execution and (4) tenant teardown, clearing out secrets after execution completes. Each of these stages starts with (a) high level user program description, (b) instruction or configuration metadata generation and (c) hardware widgets (e.g., queues and shapers) to isolate computation. Phase 1 creates the execution environment by enabling private queues and resource requirement. Phase 2 enables user to define private/public variables to determine the layer threat model, using which Sesame

enables the hardware privacy widgets. Phase 3 performs the constant-time computation on partitioned execution unit for private variables and regular execution for public operands. A tiled loop nest iterates over height,width and channel tiles. Convolution operation uses the GEMM units while the activation, pooling,batch normalization uses the ALU units. Finally, the result is stored back in phase 4 and the scratchpad state is zeroized to relieve on-chip resources of private data.

6 Compiler Passes

Sesame design encompasses enhancing the entire hardware/software stack for domain specific accelerators to support the different threat models described in Section 3.1. This enables a user to get the best-possible performance within the threat model isolation realm. In this section we discuss how user specified security requirements filter through the various compiler passes and translate to setting up the required security constructs on the accelerator either through metadata for engaging hardware widgets or through instruction generation. Figure 4 illustrates the various phases involved in this process which includes: (1) Perform graph transformation with TVM compiler on ML front and parse security pragmas and tag private variables information flow tracking for threat model specific transformations. (2) Auto-tuning phase perform tile-size exploration. (3) Traffic shaping and zeroize optimization on optimized tiling schedule. (4) Finally, the metadata after config validation and code generation phase ships the binary,data and config to the driver for deployment.

Figure 4: Sesame compiler passes play a key role in state-space exploration shaping and zeroization. Security passes are tightly coupled with user level threat model definitions

We describe how specific compiler phases are adapted to the work on additional security information specified above:

  • [leftmargin=*]

  • Autotuning tiles: We use AutoTVM[AutoTVM] framework to perform optimum tile search using threat model and resource constraints before accelerator deployment. Tile optimization chooses the best configuration for each layer for the entire model and ensures runtime tenant isolation.

  • Resource bill of materials (BOM): Application resource requirement including memory bandwidth and execution mode is partially extracted from high-level code. The scratchpad, private queue and execution tile is extracted from the auto-tuning phase. This compiler pass eradicates illegal resource allocations limiting attackers to create resource contention and runtime errors.

  • Information flow tracking: Any intermediate results generated from the sensitive data structures( variables) are marked as private. Scratchpad allocations of such variables are zeroized before de-allocation. This pass also flags warning to the user if a particular secret data is mistakenly marked as public at layer boundaries. Explicit dataflow tracking helps identify kernel schedules and variable liveness durations for precise allocation/zeroization.

  • Zeroize optimization: Sesame compiler takes advantage of explicit kernel scheduling to reuse private scratchpad regions across different tiles without zeroizing it once at the layer boundary. This greatly reduces the binary size and scratchpad zero writes as shown in Figure 8.

  • Memory traffic shaper optimizations: The memory traffic shaping is a software-hardware co-design. This pass is for protecting private ML models.
    Burst size equalization with padding:

    The input and weight tensors are split into multiple equal-sized bursts. The tensor edges are padded to make it a multiple of burst size. Data is laid out such that no burst crosses a DRAM row-buffer boundary. The padding parameters are embedded in

    Sesame load instructions similar to VTA. Each burst is converted to AXI INCR transaction by the DMA engine.
    Memory bank conflict prevention: Certain tiling configurations of input and weight tensors can cause bank conflicts which is reported in autotuning phase by the DMA engine. Since tiling on channel, height, and width axes is unique for each convolution layer, data layout transformation is done to ensure streaming data load on the inner loops and any bank conflict loads are rearranged to a different bank.

  • Code generation: Appropriate variants of LOAD/STORE instructions are generated for DMA transfer between memory and scratchpad. Compute instructions like GEMM/ALU instructions are annotated for constant operation to protect sensitive data structures. ZEROIZE instructions are generated to leave zero trace after private data execution.

  • Configuration generation: Execution configuration is generated and tenant-private memory mapped registers are written to invoke threat-model specific isolation hardware widgets.

7 Implementation

Now, we discuss how the proof-of-concept implementation of memory traffic shaper, the partitioning of various queues, and the hardware scheduler is implemented to support multi-tenant enclaves.

7.1 Memory Traffic Shaper

The memory traffic shaper ensures secret-independent data transfers from the DRAM to accelerator for a secret model application. It masks secret-dependent read/write bandwidth variations with a shaped trace with software programmed bandwidth during the lifetime of secret data computation and transfer.
Compiler assumptions: The load/store instructions are already a multiple of burst size and the data load by the driver ensures no bank conflict which are resolved in compiler as explained in section 6.
Traffic shaper micro-architecture: Figure 5 shows the hardware components of the traffic shaper. A real transaction queue is filled with incoming real requests for each tenants. In order to provide model-size-independent trace, the dma engine always produces fixed-sized transaction bursts. A single load instruction is split into multiple equal sized bursts by the DMA engine and each burst is sent on determining the overall bandwidth. Moreover, to hide the computation and data dependency at runtime, the traffic generator has a fake transaction generator to produce transactions to free memory banks. A real transaction fifo full signal prevents the load module from generating further memory requests. Same hardware logic is present for write channel.

The traffic shaper has tenant private configuration registers. Shaper_en of each tenant(identified by tenant_id) enables the fake transaction generator and limits the transaction generator to produce constant sized transactions. The bandwidth register programs the timer for each tenant to ensure bandwidth QoS and the addr_range enables the fake transaction generator to generate requests within the memory-mapped address range to prevent access-control violations.

Figure 5: Memory Traffic Shaper micro-architecture

7.2 Partitioning Resources

In-accelerator partitioning for the following resources is done for spatial multi-tenancy to deter on-chip attacks:

7.2.1 Dependency queues

The DAE dependency queues are partitioned four ways in our implementation to enable multiple tenants. This split prevents a tenant to corrupt/contend the control dependency with a co-executing tenant. Tenant ID is used to redirect queue read/writes to the tenant-private one. The queue depth comes from configuration register and is validated during resource BOM compiler phase.

7.2.2 Scratchpad

Sesame has four scratchpads – Weight, Accumulator, Input, and Output – with capacities listed in Table 2. Each scratchpad is split into regions whose tenant ownership is maintained in a scheduler scratchmap data structure. Scratchpad address base and bound logic checks tenant ownership from scratchmap before each access keeping tenants isolated. Zeroizer logic clears private data and tenant ownership at scratchpad region granularity during teardown.

7.2.3 Execution units

The GEMM and ALU execution units are spatially partitioned into four 8x8 units with private load/store buffers to eradicate execution unit contention in the multi-tenant scenario. This utility is useful for smaller kernels co-executing in large accelerators like TPU[TPU] and cloud accelerator deployment of Sesame.

7.3 Scheduler

Sesame has a hardware scheduler block consisting of the instruction queue and data structures keeping track of each tenant’s resource occupancy and guarantee runtime computational non-interference. It houses tenant maps for dependency queues, scratchpad, execution tile and instruction queue providing complete on-chip separation between multiple tenants. It also communicates with the driver to launch new tenant and sets configuration register notifying application teardown.

7.4 Constant time execution units

Sesame ISA supports constant-time execution instructions (e.g. gemm_c,alu_c) but our PoC implementation only includes 8-bit DSP implementation of execution units. The performance headroom for such optimizations is limited for quantized ML inference.

7.5 Private data encryption

To estimate the performance overhead of encryption,

Sesame PoC implementation emulates QARMA128 with a delay and AES128[AES] with a delay for every 128-bit private data access. These delay numbers are taken from prior work[RWC_Avanzi].

8 Evaluation

Sesame proof-of-concept(PoC) implementation is built by enhancing the VTA[VTA] hardware baseline with a constant traffic shaper to mask the memory bus traffic, partitioned on-chip accelerator resources like dependency queues and scratchpads for tenant isolation, a zeroization module to clear private data, an address base/bound logic for access checks and a hardware scheduler for runtime tenant isolation management. The prototype runs on a Xilinx Pynq-Z1 board with a dual-core arm CortexA9 co-processor and an accelerator prototype in FPGA fabric with specifications listed in table 2.

Component Specification
Processor Dual arm CortexA9 @667MHz
DRAM 512MB DDR3 @ 533MHz
FPGA frequency Zynq 7020 @ 100MHz
Accelerator temporal spatial
Weight Scratchpad 2MB 512KB
Input Scratchpad 256KB 64KB
Output Scratchpad 256KB 64KB
Acc Scratchpad 512KB 128KB
GEMM units 256 64
Memory bandwidth 400MB/s 100MB/s
Table 2: System Specifications with per-tenant temporal and spatial sharing resource allocation

To simplify the implementation, the resource BOM which includes traffic shaper bandwidth, datasize of each layer etc. is extracted manually after compiler autotuning phase and fed to the runtime driver as command line parameters. The driver bootstraps the accelerator by creating a config,code and data memory-mapped address space as shown in Figure 2. The config address space houses live resource availability status and tenant-private metadata regions for configuration loading. The proof-of-concept implementation supports upto four tenants. The accelerator driver begins by querying resource availability registers and schedules a new tenant if adequate resources are available to prevent over-subscription and creates a tenant ID. The driver then loads instructions into the tenant-private instruction region and input/model values into the data region. After that the driver waits for the results by polling a configuration register.

In this section, we evaluate the hardware prototype by running imagenet inference of six trained deep learning networks taken from MXNET modelzoo[MxNet]. The models are quantized to support 8-bit operation. The image classification models chosen are VGG11, VGG16, AlexNet, ResNet18, ResNet34, and ResNet50. We first assess the security provided by traffic shaper against memory bandwidth side channel attacks, followed by instruction binary size and finally a performance comparison for all threat models for spatial and temporal execution modes. The four bars of each plot is for (1) Public Input Public Model; (2) Private Input Public Model;(3) Public Input Private Model; (4) Private Input Private Model.

8.1 Security Evaluation

8.1.1 Memory traffic shaper

In this section we analyze the effectiveness of Sesame with the traffic shaping primitive discussed in Section 7.1 to protect against a memory bandwidth side channel attack from section 1. To show that bandwidth variations is a problem and validate the traffic shaper, a bandwidth measurement widget is synthesized along with the FPGA bitstream. The bandwidth measurement widget counts the AXI read and write channel data bytes transferred to report memory bandwidth. Figure 6 shows the memory traffic pattern before and after shaping collected for the six workloads. Both read/write bandwidth in unshaped trace is leaky and provides kernel size and layer boundary information. The figure shows a single run but each benchmark is run fifty times and the median is constant for every sample. Nevertheless, a two stage classifier is designed to attack both the unshaped and the shaped trace as follows:

Figure 6: Comparison between real and shaped memory traffic. Top figure shows the real read/write traffic of each network while the bottom one plots the shaped traffic.
AlexNet VGG11 VGG16 ResNet18 ResNet34 ResNet50
easy
3
all
4
easy
5
all
6
easy
8
all
11
easy
22
all
23
easy
24
all
36
easy
50
all
52
precision
1 1 1 1 1 1 0.69 0.64 0.66 0.72 0.33 0.33

Unshaped

recall
0.75 1 0.83 1 0.73 1 0.96 1 0.67 1 0.96 1
precision
0.03 0.03 0.01 0.01 0.0027 0.00011 NA NA NA NA NA NA

Shaped

recall
0.75 1 0.83 1 0.73 1 NA NA NA NA NA NA
Table 3: Precision and recall when identifying layer boundaries for each network. The second row in this table lists the number of the boundaries between two consecutive layers that are of different types (easy) and the total number of layer boundaries (all). The precision and recall are calculated when the attacker detects all boundaries while introducing false positives (if any). NA indicates that the attacker fails to identify the boundaries while introducing over false positives.
AlexNet VGG11 VGG16 ResNet18 ResNet34 ResNet50 Overall
Execution
time only
1 1 0.958 0.896 0.851 0.824 0.826
SVM w/
(w/o) DWT
1 1 1 1 1
0.811
0.927
MLP w/
(w/o) DWT
1 1 1 1
1
(0.986)
0.868
(0.849)
0.949
(0.934)
CNN w/
(w/o) DWT
1 1 1 1 1
0.877
(0.830)
0.953
(0.934)
Table 4: Layer type classification accuracy using unshaped traffic assuming perfect layer boundary detection. The first row shows the accuracy only using the termination timing of the layers for classification. It serves as a baseline accuracy as long as attacker can identify the layer boundaries.

2-4th rows show accuracy of 3 classifiers using bandwidth trace with and without frequency domain signal computed from discrete wavelet transform (DWT).

Layer Boundary Detector: Prior arts [ReverseCNN, DeepSniffer]

demonstrated that the read-after-write (RAW) pattern on the address trace reveals the layer boundaries accurately but bandwidth variation on read/write channel is used to detect layer boundaries. We first use the RAW dependency pattern to identify potential boundaries (the end of each write transaction) on unshaped trace. We then compare the memory bandwidth within a fixed time window before and after the potential boundary. Statistics like total read/write data, median and peak bandwidth, standard deviation as well as frequency domain signals computed using discrete wavelet transform (DWT) are used. For shaped trace, since write bus is constantly exercised, the RAW activity is invisible to the attacker. Instead, we model an attacker who offline profiles termination timing of all possible layer configurations with termination timing protection turned-off. At run-time, the attacker use a combination of termination timings to enumerate potential layer boundary candidates. This helps constrain the maximum false positives. Similarly, at a boundary candidate, attacker compares the memory bandwidth channel before and after the candidate using the same statistics. The different thresholds for this determination allows the attacker to trade-off the aggressiveness in boundary identification vs. triggering false positives.

Table 3 shows the precision and recall numbers for both unshaped and shaped memory traffic on our evaluation model suite. The second row in the table shows the total number of layer boundaries as well as the number of easily identifiable boundaries (adjacent layers utilize different memory bandwidth) from the unshaped traffic. Without the traffic shaper in place, the classifier can detect easy boundaries with 100% precision for AlexNet, VGGs and ResNet18. Note that for ResNet models, the residual layers are very short, which makes the boundaries in between them hard to detect even without traffic shaper. With the traffic shaper in place, the precision drops down to as low as 0.01% for VGG16. NA in the table indicates that our modeled attacker fails to identify every layer boundary in three ResNet models, while triggering over false positives (precision ).

Layer Type Classification.

Compared to one feature vector per kernel in Deepsniffer 

[DeepSniffer]

, fine-grained observations enable attackers to model each layer as a time-series of bandwidth information. In addition to memory bandwidth related features used in DeepSniffer, we include frequency domain signals(DWT) to capture IFM tile load/store memory bandwidth signatures. Our training data constitutes of bandwidth traces from different potential layer configurations. we test victim memory traffic time-series using three classifiers: Support Vector Machine (SVM), Multilayer Perceptron (MLP), and Convolutional Neural Network (CNN), with and without frequency domain signals. Similar to 

[DeepSniffer], SVM and MLP use one feature vector per layer, while CNN uses a sequence of feature vectors for classification, enabled by fine-grained observation. Table 4 shows the layer-type classification accuracy with unshaped traffic data after the layer boundaries are identified. The first row shows classification accuracy merely using the execution length of each layer. This is a baseline accuracy for any attacker with the knowledge of the layer boundaries (execution timing of layers). The 2-4th rows show accuracy of the three evaluated layer-type classifiers. From execution-time based classifier to bandwidth-based classifiers, the accuracy jumps from 84% to 93% on average. From SVM to CNN, accuracy increases with increasing classifier complexity. In addition, including frequency domain signals improves the accuracy as different tile size configurations result in different compute/IO ratio.

However, after applying traffic shaping, the classifiers are not able to classify features among different layer types. The resulting accuracyis similar to the baseline attacker knowing only the layer termination times. But as the layer boundaries for the shaped trace is undetectable (Table 3), we conclude that Sesame seals the memory read/write bandwidth channel.

8.1.2 Partitioned logic

The partitioning and access control hardware is validated first at RTL level by system verilog tests. Runtime validation is done by changing a tenant binary in after code generation to access an unauthorized scratchpad location. The access was blocked on both read/write channel in BRAM.Moreover, scratchpad read after tenant teardown returned zero value.

8.2 Compiler Evaluation

8.2.1 Tile optimization performance

Figure 7 illustrates the variability of performance overhead for the threat model where both input and model are private. All of the accelerator resource is used for temporal sharing and 4 tenants equally share accelerator resource in spatial sharing. Each benchmark is run with 800 different tile combinations across layers with overhead comparison with baseline VTA. Tile optimization helps maximum available resource utilization and the best tile-configurations change with resource bom. It results in better utilization of the traffic shaper bandwidth for private model.

Figure 7: Performance overhead variation for design space exploration of tile size configuration

8.2.2 Zeroize Optimization

Figure 8 shows the reduction in the size of scratchpad regions that need to be zeroized over a naive case of zeroizing all private data held by scratchpad. Kernel scheduling helps re-utilize private scratchpad regions across multiple tiles that share the same security level. Only zeroizing regions before loading public data reduces dynamic zeroize instruction count by 14% to 18% based on the threat model. Since each instruction clears variable sized BRAM regions, the right plot shows the number of BRAM bytes zeroized. There is a reduction between 8.5% and 16.7% across different threat models. There is higher reduction in weight secret threat model for vgg11,vgg16 and alexnet due to larger kernel sizes as compared to the resnets. The private model(bar 2) shows higher reduction than private input(bar 1) and the private model and secret threat model(bar 3) plot is closer to the weight reduction percentage due to higher number of channels in model weight than the IFM/OFM.

Figure 8: Dynamic instruction count and number of zeroized BRAM bytes normalized wrt zeroizing every private data for different threat models

8.2.3 Code Generation Instruction Mix

Figure 9 shows the normalized instruction mix for six benchmarks for different threat models. Increase in binary size is solely due to addition of zeroize instructions. Load/store instruction variants change with threat model. Load_e instructions are generated for loading private input when model is public. while a private model needs a read/write traffic shaper and uses load_se/store_se instructions. When the model is private, public input may be stored unencrypted in the DRAM and uses load_s as its traffic needs to be shaped to hide access patterns. When both model and input are private all read accesses are performed using the load_se instructions. Internal scratchpad locations holding secret data are cleared with zeroize instructions which varies based on the threat model. Each zeroize instruction clears different amount of scratchpad regions. Even though the number of zeroize instruction is higher in private input, the latency is higher in private model as zeroize clears a larger model scratchpad region.

Figure 9: Instruction mix for different threat models

8.3 Performance/FPGA Utilization Results

8.3.1 FPGA Utilization Overhead

Table 5 lists the percentage of FPGA resource used for each design component. The scheduler uses distributed RAMs to store tenant ownership of scratchpad partitions and private queues. The additional configuration registers lead to increase in register and LUT logic. Memory traffic shaper uses BRAM for tenant-private load/store queues. Distributed ram is used in DMA engine to partition loads and storing pending transaction bank information. Glue logic and timer accounts for the increase in register/LUT. The same queue size in baseline is split into multiple sections and access control glue logic accounts for area increase in partitioned resources.

Component Baseline Scheduler
Memory
Traffic
Shaper
Partitioned
Resources
SESAME
Prototype
Logic
LUT
46.97%
9.57%
(1.2x)
2.56%
(1.05x)
2.4%
(1.05x)
61.5%
(1.3x)
Register 19.5%
11.26%
(1.6x)
6.24%
(1.3x)
3.6%
(1.2x)
40.76%
(2.1x)
BRAM 92.14%
1.6%
(1.02x)
3.22%
(1.04x)
0%
(0)
96.96%
(1.05x)
Distributed
RAM
11.51%
10.27%
(1.9x)
3.3%
(1.29x)
0%
(0)
23.08%
(2.178x)
DSP 100%
0%
(1x)
0%
(1x)
0%
(1x)
100%
(1x)
Table 5: FPGA Component Utilization

8.3.2 Performance Overhead

In this section we study the performance overheads of providing security against the various threat models in both temporal and spatial sharing modes v/s a non-secure baseline for which both model and input are set to be public. Each of the other three threat models has two bars: one for memory encryption defense with QARMA-128 block cipher and the other with AES-128. Private input setting adds zeroize instruction overhead on the input, accumulator and output scratchpad with input padding. Private model further adds overhead due to memory traffic shaper with bandwidth of 400 MB/s, zeroization of weight scratchpad. Figure 9(a) illustrates that for temporal sharing mode performance overheads range from to with QARMA128 and to with AES128 encryption.

(a) Temporal Sharing
(b) Spatial Sharing
Figure 10: Performance Overhead of Two Multi-Tenant Scenarios. Temporal sharing where multiple tenants share the accelerator in a time-multiplexed manner. Spatial sharing where resources are evenly shared among four tenants. Overheads are compared to a non-secure public input public model scenario. Encryption overhead from either QARMA-128 or AES-128 is added on top of each threat model.

For spatial sharing (Figure 9(b)), four tenants equally share all accelerator resources including partitioned scratchpad, private queues, execution units and memory traffic bandwidth. In other words, each inference task is given one fourth the resources allocated in temporal sharing mode as shown in Table  2. Thus, with spatial sharing, because the non secure baseline is already resource constrained, the relative overheads for providing security are slightly lower. The overhead ranges from to with QARMA128 and to with AES128 encryption.The spatial sharing security overhead is less for models with larger layers like alexnet and vgg because limited memory bandwidth in the baseline is primary bottleneck.

9 Related Work

ML inference accelerators [eie, fpga_CNN, acc_survey, VTA, dnnweaver, brainwave] extract performance using specialized hardware like DAE[DAE] and systolic array designs[systolic] with domain-specific optimizations[deepcompression]. On the other hand, secure TEEs focus on general-purpose cores [SGX, bastion_sgx, Sanctum, TZ, keystone] and recently GPUs [Graviton, telekine] and FPGAs [hetee] (specifically, on securing the PCIe interface between CPU and GPUs/FPGAs). Sesame is complementary and proposes software-defined enclaves to construct accelerator-TEEs that are tightly coupled on chip with general-purpose cores.

While partitioning[Sanctum, cachePartition, CNNpartition] and shaping[camouflage, mitts] are generic strategies that have been extensively studied from networking to CPU designs, Sesame turns them into hardware/RTL modules – private queues and shapers – to be synthesized based on the threat model and tuned based on the program phase information. As a result, Sesame’s shaper unit is simpler than memory-controller shapers [camouflage] because it relies on software to learn traffic distribution and configure it. Sesame is also more secure than Camouflage [camouflage] by not assuming that an attacker is limited to observing only coarse-grained signals – an on-chip co-tenant or privileged software can observe Sesame enclave outputs at arbitrary granularity.

A software configurable Sesame framework is composable with other physical protection units necessary for TEEs like encryption/integrity blocks[mee] and memory access pattern confidentiality using address encryption [invisimem, obfusmem] or ORAM[PathORM, Phantom]. These memory protection schemes can be exposed to software by extending the Sesame ISA [mto]. The compiler’s auto-tuning framework can be similarly extended to extract performance by performing state-space exploration to (e.g.,) take advantage of streaming patterns for integrity checks.Sesame enables prior work that uses CPU-based TEEs confidential-ML [Slalom, chiron, Myelin] to use accelerator-TEEs instead.

Beyond hardware-based TEEs, cryptographic approaches towards privacy in machine learning have also been widely studied. Homomorphic encryption(HE) and garbled circuits (GC) based research such as Cryptonets[Cryptonets], Securenets[Securenets], SecureML[Secureml], and xonn[Xonn_Raizi] provide confidentiality guarantees for user data without trusting server-side hardware, but are orders of magnitude slower than non-secure execution and do not hide the model. Finally, we observe that protecting against adversarial input attacks (DNNGuard[dnnguard]) is orthogonal to our design and threat model.

10 Conclusion

Sesame brings confidential computing to accelerators and introduces software-defined enclaves – where the slowdown scales gracefully with the threat model and program phase regularity. The key innovation is to define new hardware modules that replace ubiquitous queues with private queues and adds traffic-shapers at enclave egress – together, these modules can form a secure data-plane for accelerators beyond DAE. While extending Sesame workloads to secure graph programs are near term tasks, bringing these ideas back into general-purpose CPUs and incorporating these modules into hardware verification tools would be the longer term wins. We hope to spur further research into secure and performant enclaves by sharing our code and benchmark suite with the research community.

References