Integrated circuits (ICs) are indispensable components for a diverse set of real-world applications including healthcare systems, smart home devices, industrial equipment, and machine learning accelerators[chen2016eyeriss, chen2009wireless]. The vulnerability of digital circuits may result in severe outcomes due to their deployment in security-critical tasks. The design and manufacturing process of contemporary ICs are typically outsourced to (untrusted) third parties. Such a supply chain structure results in hardware security concerns, such as sensitive information leakage, performance degradation, and copyright infringement [tehranipoor2011introduction, colombier2014survey]. Malicious hardware modifications, a.k.a., Hardware Trojan (HT) attack [tehranipoor2010survey, bhunia2014hardware] may occur at each stage of the IC supply chain.
There are two main components in a HT attack: Trojan trigger and payload. The HT trigger is a control signal that determines when the malicious activity of the HT shall be activated. The Trojan payload is the actual effect of circuit malfunctioning which depends on the purpose of the adversary, e.g., stealing private information or producing incorrect outputs [tehranipoor2010survey]. The attacker intends to design a stealthy HT that remains dormant during functional testing and evades possible detection techniques. As such, the HT trigger is typically derived from the rather rare activation conditions that are easier to hide for the intruder.
To alleviate the concerns about malicious hardware modifications, a line of research has focused on developing effective HT detection methods. Existing HT detection techniques can be categorized into two classes based on the underlying mechanisms: (i) Side-Channel Analysis (SCA), and, (ii) Logic Testing. SCA-based HT detection explores the fact that the presence of the HT on the victim circuit will change its physical parameters (e.g., time, power, and electromagnetic radiation), thus can be revealed by side-channel information [liu2014hardware, lin2009trojan]. Such a mechanism determines that SCA-based approaches can detect non-functional HTs, while they may have high false alarm rates when detecting small HTs due to the operational and physical silicon variation, as well as measurement noise. Logic testing-based techniques intend to activate the stealthy Trojan trigger by generating diverse test patterns [chakraborty2009mero, nourian2018hardware, saha2015improved]. The main challenge of logic testing-based HT detection is to increase the trigger coverage with a small number of test patterns.
In this paper, we aim to simultaneously address three challenges of logic testing-based HT detection: effectiveness, efficiency, and scalability. To this end, we propose AdaTest, the first automated adaptive, reinforcement learning-based test pattern generation (TPG) framework for HT detection with hardware accelerator design. Figure 1 demonstrates the high-level usage of AdaTest to inspect if any hardware Trojans are inserted in the CUT. AdaTest takes the netlist of the circuit under test (CUT) and user-defined parameters as its inputs. A set of test vectors with high reward values are returned as the output of AdaTest.
AdaTest framework consists of two main phases: (i) Circuit profiling. Given the circuit netlist, we first characterize each node in the CUT from two perspectives: the transition probability
transition probability, and the SCOAP testability measures. These two properties are used to identify rare nodes and quantify the fitness of each node, respectively. (ii) Adaptive test pattern generation. AdaTest proposes an innovative reward function for test vectors using the following information: the number of times that each rare node is triggered, the SCOAP testability measure of the rare nodes, and the graph-level distance of the circuit (represented as directed acyclic graph) when applying this test input and the historical ones. In each iteration, AdaTest gradually expands the test set by generating candidate test inputs and selecting the ones that have high reward values. AdaTest provisions a flexible trade-off between trigger coverage and test generation time. To enable hardware-assisted solution, we further design an optimized architecture for AdaTest’s implementation to reduce the hardware overhead. More specifically, AdaTest architecture pipelines the computation in online TPG and deploy circuit emulation to accelerate reward evaluation.
AdaTest opens a new axis for the growing research in hardware security by exploring the idea of reinforcement learning (RL) and adaptive test pattern generation. The adaptive nature of AdaTest ensures that the quality (measured by our reward function) of our dynamic test set always improves over iterations as new test inputs are added to the test set. Furthermore, AdaTest is generic and can be easily extended for other hardware security problems, such as logic verification, efficient ATPG, functional testing, and built-in self-test. For example, the concept of RL and adaptive test pattern generation presented in AdaTest can be used in an efficient ATPG application where the RL reward function is designed to reflect the goal of the ATPG (such as fault coverage of considered fault models).
Organization. Section 2 introduces preliminary knowledge and related works on Hardware Trojan and its detection, as well as reinforcement learning. Section 3 discusses the challenges of HT detection and the overall workflow of AdaTest framework. Section 4 presents our test pattern generation algorithm that combines RL and adaptive sampling for fast exploitation. Section 5 demonstrates our domain-specific architecture design of AdaTest. Section 6 provides a comprehensive performance evaluation of AdaTest on various circuit benchmarks and comparison with prior works on logic testing-based HT detection. Section 7 concludes the paper.
2 Preliminaries and Backgrounds
2.1 Hardware Trojan Attacks
Security of third-party SoCs has raised an increasing amount of concerns due to the contemporary outsourcing-based supply chain. Hardware Trojans are malicious circuit modifications inserted in the circuit to perform the pre-defined adversarial task (‘payload’) e.g., circuit malfunction or private information leakage, when its control signal (‘trigger’) is activated. Figure 2 shows an example HT design where a logic-AND gate and an XOR-gate are used as the trigger and payload, respectively. The payload flips the output signal when the trigger is activated, thus disturbing the desired behavior of the original circuit.
The collaborative nature of the supply chain also determines that HTs may be inserted by different parties at different stages of the IC lifecycle. For instance, the untrusted IP provider, the circuit designer, or the manufacturing party might insert HTs in the circuit. Hardware Trojans shall remain dormant in most cases to evade functional testing and HT detection, while it should be activated by the trigger to execute the attack. For this purpose, stealthy HTs are designed with two main considerations: (i) Rare conditions are used to construct the trigger signal; (ii) The HT is placed in a non-critical path to minimize its impact on side channels (delay, power, electromagnetic emission, etc.)
2.2 Hardware Trojan Detection
Previous HT detection techniques can be categorized into two broad types: destructive and non-destructive methods. Destructive detection schemes perform de-packaging and de-layering on the manufactured IC to reverse engineer its design layout, thus is prohibitively expensive [el2015integrated]. Non-destructive HT detection includes two types: run-time monitoring and test-time detection. Run-time approaches monitor the IC throughout its entire operational lifecycle with the goal of detecting Trojans that pass other detection methods, providing the ’last-line of defense’. There are two classes of test-time HT detection techniques. We detail each type as follows:
(i) Side-channel Analysis. SCA-based Trojan detection methods explore the influence of the inserted HT on a particular measurable physical property, such as the supply current, power consumption, or path delay. These physical traces can be considered as the ‘fingerprint’ of the circuit and allow the defender to detect both parametric and functional Trojans [liu2015concurrent, liu2014hardware]. Parametric Trojans modify the wires and/or logic in the original circuit while functional Trojans add/delete transistors or gates in the original chip [wang2008hardware, karri2012trojan, moein2015attribute]. However, SCA-based HT detection has two limitations: (i) It cannot detect a small HT that causes a negligible impact on the physical side-channel; (ii) The extracted circuit fingerprint is susceptible to manufacturing variation and measurement noise, thus it might incur high false alarm rates.
(ii) Logic Testing. Compared to the side-channel based approaches, logic testing methods can only detect functional
Trojans. However, they yield reliable results under process variation and measurement noise. The main challenge of developing a practical and effective logic testing technique for HT detection is the inordinately large space of possible Trojan designs that the adversary can explore. Since the HT trigger is derived from a very rare condition that is unknown to the defender, attempting to stimulate the stealthy Trojan with a limited number of test inputs is difficult. Existing logic testing methods generate test patterns using simple heuristics, thus cannot ensure high trigger coverage on complex circuits. Also, such heuristic-driven test generation approaches are inefficient (long test generation time) and unscalable to large benchmarks[chakraborty2009mero, bhunia2014hardware, tehranipoor2010survey].
Besides SCA and logic testing, other HT detection techniques have also been explored. For instance, FANCI [waksman2013fanci] presents a Boolean functional analysis method to identify suspicious wires that are nearly unused in the circuit. For this purpose, FANCI introduces a concept called ‘control value’ to characterize the influence of a specific wire on other wires. The wires with small control values are flagged as suspicious. However, the wire-wise control value computation in FANCI is unscalable on large circuits. VeriTrust [zhang2015veritrust] suggests a verification method to detect HT trigger inputs by examining the verification corners. Therefore, VeriTrust is agnostic to the HT implementation styles.
Prior works on logic testing have explored various heuristics to improve trigger coverage while reducing the test generation time. Conceptually similar to the ‘N-detection test’ in stuck-at automatic test pattern generation (ATPG), MERO [chakraborty2009mero] leverages random test vectors and mutates them until each rare node in the circuit is individually triggered at least times. Such a simple detection heuristic results in an unsatisfying trigger coverage, particularly Trojans that are hard-to-activate. To overcome the limitation of MERO, [saha2015improved]
proposes to use genetic algorithms (GA) and Boolean Satisfiability (SAT) to produce test inputs that excite regular rare nodes and internalhard-to-trigger nodes, respectively. As the end result, [saha2015improved] achieves a higher trigger coverage compared to MERO, while it is inefficient due to the long test generation time. TRIAGE [nourian2018hardware] further improves GA-based test generation by devising a more appropriate ‘fitness’ function that incorporates the controllability and observability factors of rare nodes. However, the GA nature of TRIAGE limits its efficiency for test input space exploration and the resulting test set might be unnecessarily large. TGRL [pan2021automated] suggests to train a machine learning model for test patterns generation that combines rare signal stimulation as well as controllability/observability analysis. Although TGRL claims to explore reinforcement learning, its test pattern generation pipeline (Alg.3 in [pan2021automated]
) does not involve sequential decision making in standard RL techniques. Instead, TGRL learns a ML model via stochastic gradient descent for TPG.
2.3 Reinforcement Learning
Reinforcement learning [kaelbling1996reinforcement, wiering2012reinforcement, sutton2018reinforcement] is a machine learning technique that is capable of solving complex problems in various domains. RL works sequentially in an environment by taking an action, evaluating its reward, and adjusting the following actions accordingly. In particular, an RL paradigm involves an agent that observes the environment and takes actions to maximize the reward determined by the problem of concern [sutton2018reinforcement, mnih2013playing]. Figure 3 shows the interaction between the agent and the environment in the RL paradigm.
We introduce the key concepts in an RL system below:
Action Space. The action space is a set of possible moves that the agent can take to change to a new state. For example, in a video game, an action can be running left/right, or jumping high/low.
Environment. The environment takes the agent’s current state and action as input, and returns the reward and the next state as the output. Depending on the problem domain, the environment might be a set of physical laws or chemical reaction rules that processes the actions and establish the corresponding outcomes.
A state is a concrete and instantaneous situation in which the agent finds itself. This can be an instant configuration, a particular place and a moment that puts the agent in connection with other influential objects in the environment, such as opponents or awards. It is noteworthy that a state needs to contain all information to ensure the system satisfies theMarkov property [pardo2018time].
Observations. The agent can obtain observations (emission of states) from the environment. In particular, the observation is a (stochastic) function of the state.
Reward. The reward is a numerical value that evaluates the fitness (success or failure) of an agent’s actions in a given state. From a given state, an agent takes actions in the environment and acquires the new state as well as the reward from the environment. A cumulative reward is defined as the summation of discounted rewards: . The discount factor () tunes the importance of future rewards for the current state. The key idea of RL is to find a series of actions that maximize the expected cumulative reward.
The policy of a RL algorithm is typically defined within the context of Markov decision process[sutton2018reinforcement]. Given the state information, policy is the suggested action that the agent shall take in order to obtain a high reward.
Our objective is to develop an adaptive test pattern generation framework for logic testing with high Trojan coverage and small test set size. Therefore, AdaTest belongs to the test-time detection category introduced in Section 2.2
. We choose RL over other machine learning techniques (e.g. neural networks) since the reward-oriented and progressive nature of RL makes it appealing for our goal. Furthermore, to reduce the complexity of RL, AdaTest integrates adaptive sampling to prioritize test patterns that provide more useful information for HT detection.
3 AdaTest Overview
In this section, we first discuss the limitations of prior works on Hardware Trojan detection and our motivation (Section 3.1), then introduce our assumptions and threat model for AdaTest framework (Section 3.2). We demonstrate the overall workflow of AdaTest test pattern generation technique in Section 3.3. AdaTest is a hardware-friendly framework and we present our architecture design in Section 5.
3.1 Motivation and Challenges
Prior works have advanced logic testing-based Trojan detection using various techniques [chakraborty2009mero, saha2015improved, nourian2018hardware]. We discuss the limitations of these detection schemes below.
MERO. Inspired by the traditional ‘N-detect’ test used in stuck-at ATPG, MERO [chakraborty2009mero] generates random test vectors to activate each rare node (identified as nodes with transition probability smaller than the threshold ) to the corresponding rare value at least times. MERO has three main disadvantages: (i) Triggering all rare nodes for times might be very time-consuming or even impractical; (ii) It yields low trigger coverage for hard-to-trigger Trojans; (iii) It only explores a small number of test vectors in the entire possible space due to its bit mutation and test vector selection policy.
ATPG based on GA+SAT. The paper [saha2015improved] combines genetic algorithms and SAT in test pattern generation for HT detection. While it improves the trigger coverage compared to MERO, [saha2015improved] has two constraints: slow test set generation and large memory footprint.
TRIAGE. The paper [nourian2018hardware] proposes TRIAGE that integrates the benefits of MERO and [saha2015improved]. TRIAGE leverages the SCOAP testability parameters and advises the fitness function of GA for HT detection. However, the evolutionary nature of GA determines that TRIAGE might be ‘trapped’ in the vicinity of a local optimum, thus exploring only a small portion of the full test input space.
We present AdaTest as a holistic solution to address the limitations of the previous works. To this end, we identify three main challenges of developing an efficient and effective logic testing-based HT detection technique as follows:
(C1) High trigger coverage. The test vector set shall yield a high trigger coverage rate to ensure that the probability of activating the stealthy Trojan is large. This property is critical for the effectiveness criterion of HT detection.
(C2) Efficient test generation. The runtime overhead of test pattern generation shall be reasonable while attaining a high trigger coverage. For hardware-assisted security, this implies that a test set with smaller size is preferred. This requirement assures the efficiency and practicality of the HT detection method, particularly on large circuits.
(C3) Scalable to large benchmarks. The runtime consumed by the test pattern generation technique shall not scale exponentially with the size of the examined circuit.
AdaTest tackles the above challenges using an adaptive, RL-based input space exploration approach. Furthermore, we provide architecture design for AdaTest-based TPG in Section 5 to enable hardware-assisted security. We empirically corroborate the superior performance of AdaTest compared to the above counterparts in Section 6.
3.2 Threat Model
As shown in Figure 2, HTs consist of two parts: trigger and payload. Figure 2 shows an example of HT design. AdaTest is applicable to both combinational and sequential circuits. One can unroll sequential circuits into combinational ones and apply AdaTest for test pattern generation. Without the loss of generality, we assume that the adversary uses a logic-AND gate as the Trojan trigger that takes a subset of rare nodes as its inputs. An XOR gate is used to flip the value of the payload node when the trigger is activated (i.e., each of the trigger nodes has a logical value ‘1’).
We make the following assumptions about AdaTest framework:
(i) The defender knows the netlist of the circuit under test. We assume the party that executes logic testing has the netlist description of the circuit to be examined. This netlist can be obtained by performing de-packaging, de-layering, and imaging [torrance2009state, meade2016netlist, li2012reverse, fyrbiak2017hardware] on the physical circuit. While hardware obfuscation techniques such as camouflaging [li2017provably, yasin2016camoperturb, shakya2019covert, shamsi2019impossibility] and logic encryption [yasin2015improving, yasin2017evolution, xie2018anti, tan2020benchmarking] could make the trigger design of the Trojan harder to identify, we consider the scenario where the circuit under test is not encrypted in our threat model since this setting is also used in previous Trojan detection papers [chakraborty2009mero, pan2021automated, shakya2017benchmarking, yang2020survey].
(ii) The defender can observe the ‘indication signal’ when the Trojan is activated. We assume the defender can observe certain manifestations of the hidden Trojan when it is activated. In particular, we assume the defender knows the correct response of the CUT to a given test input and observes the primary outputs of the CUT for comparison. Note that AdaTest is compatible with techniques that increase manifestation signals (e.g., test point insertion).
3.3 Global Flow
Figure 4 illustrates the global flow of AdaTest. We discuss the threat model in Section 3.2. AdaTest framework consists of two stages: (i) Circuit profiling phase (offline) that computes the transition probabilities and SCOAP testability parameters of the netlist; (ii) Adaptive RL-based test set generation phase (online) that progressively identifies test vectors with high reward values.
Phase I: Circuit Profiling. This stage includes the following:
(1) Compute Transition Probabilities. Given the netlist of the circuit under test, AdaTest first computes the transition probability of each internal node in the netlist. In particular, we use the method in [salmani2011novel] and assume that each primary input has an equal probability of taking a logical value of 0 and 1. We make this assumption about the primary input values since previous Trojan detection papers [salmani2011novel, xiao2016hardware, bhunia2014hardware, li2016survey] use the same assumption when computing the transition probability. Mathematically, the transition probability of a node is computed as where . of each node is then compared with a pre-defined threshold to identify the rare nodes. Identifying rare nodes is important for HT detection since the defender does not know the exact set of trigger nodes used by the attacker. As such, the activation status of rare nodes provides guidance to generate test inputs that are likely to trigger the stealthy Trojan.
(2) Compute SCOAP Testability Parameters. Controllability and observability are important testability characteristics of a digital circuit. More specifically, ‘controllability’ describes the ability to establish a specific node to 0 or 1 by setting the primary inputs. ‘Observability’ defines the capability of determining the value of a node by controlling the circuit’s inputs and observing the outputs. The testability parameters are useful for Trojan detection since they allow AdaTest to distinguish the quality of different rare nodes.
Phase II: Adaptive RL-based test pattern generation. After the CUT is profiled offline in Phase 1, AdaTest performs adaptive test input generation as shown in the bottom of Figure 4. We outline each step as follows:
(1) Initialize Test Set. AdaTest first generates an initial test vector set that is used in the later steps. A naive way to do so is random initialization, which may not be optimal for HT detection. To improve the trigger coverage in the later runs, AdaTest employs SAT to find a number of test inputs that activate a subset of rare nodes. We call this method ‘smart initialization’ and empirically corroborate its effectiveness in Section 6.1.
(2) Generate Candidate Test Inputs. In each iteration of AdaTest’s adaptive test vector generation, we first produce a sufficient number of candidate test input patterns that might improve the detection performance when added to the current test set. AdaTest deploys random test generation for this purpose.
(3) Evaluate Reward Function. AdaTest applies the candidate test inputs on the examined circuit and collects the observations, i.e., the netlist status represented as a directed acyclic graph (DAG). We incorporate the transition probabilities and the SCOAP testability parameters from Phase 1 as well as a novel DAG-level diversity measure to define our reward function.
(4) Adaptive Sampling to Update Test Set. Inspired by the selection step in genetic algorithms, we design an adaptive sampling module that picks ‘high-quality’ test patterns for fast and efficient input space exploration. In particular, after computing the reward value of each test input in the candidate test vectors, AdaTest selects the ones with the highest scores and append them to the current test set.
At the end of each iteration, AdaTest checks the termination condition and decides whether or not the progressive test generation process shall continue.
Performance Metrics. We use effectiveness and efficiency as two main metrics to assess the performance of a Trojan detection scheme. In particular, we measure the effectiveness from two aspects: trigger coverage and Trojan coverage (i.e. detection rate). The efficiency property is measured by the test set generation time and test set size. AdaTest, for the first time, provides the trade-off between effectiveness and efficiency by adaptively generating a set of test patterns with evolving quality over time. The quantitative analysis of the above metrics is demonstrated in Section 6.
4 AdaTest Algorithm Design
The key to ensuring a high probability of Trojan detection using logic testing is to generate a test set that can trigger the circuit to diverse states, in particular, the rare nodes in the circuit. To this end, AdaTest leverages three important characteristics of the circuit: the transition probabilities, the SCOAP testability measures, and the DAG-level diversity. In particular, AdaTest employs an RL-driven test pattern generation approach that uses the above three properties to progressively generate test inputs. Inspired by the selection stage in genetic algorithms, we integrate an adaptive sampling module that progressively expands the current test set (used as historical information) with high-quality test patterns. This response-adaptive design is beneficial for statistical search of the HT trigger in the circuit input space, thus improves the efficiency of AdaTest’s RL-based pipeline. We detail the two main phases of AdaTest shown in Figure 4 in the following of this section.
4.1 Circuit Profiling
Alg. 1 outlines the steps of the circuit profiling phase in AdaTest. This stage obtains two informative properties of the circuit: the transition probabilities and testability measures. In particular, we use random testing and logic simulation
to estimate the transition probabilityof each node in the netlist . To further investigate the rewards of different rare nodes, AdaTest also computes the SCOAP parameters of the nodes using the technique in [goldstein1980scoap].
AdaTest’s circuit profiling stage characterizes the static reward properties of the circuit in terms of the transition probabilities of rare nodes and testability measures. We call these two properties ‘static’ since they are independent of the circuit input for a given circuit netlist. As such, our profiling phase can be performed offline. The above two properties are indispensable for the reward computation step in Phase 2 of AdaTest since: (i) Transition probabilities and rare nodes shed light on the potential trigger nodes exploited by the malicious adversary. The defender knows that a subset of rare nodes are used to design the stealthy Trojan while he has no knowledge about the exact trigger set. As such, rewarding the activation of rare nodes encourages the test vectors to stimulate the possible HT. (Note that the Trojan activation condition is equivalent to knowledge of the exact trigger set and both are assumed to be unknown to the defender.) (ii) Testability parameters provide more fine-grained information about the quality of individual rare nodes in the context of HT detection. One can compare the fitness of two test inputs by counting and comparing the number of activated rare nodes correspond to each test vector. However, such a naive counting mechanism neglects the intrinsic difference between the quality of individual rare nodes. In principle, a rare node with higher controllability and observability shall be assigned with higher reward values. As such, AdaTest integrates the SCOAP testability measures to quantify the reward of each activated rare node.
4.2 Adaptive RL-based Test Pattern Generation
AdaTest deploys a progressive, reinforcement learning-driven algorithm for efficient and effective test input space exploration with the goal of HT detection. Section 2.3 introduces the basic concepts of RL. We discuss how we map the Trojan detection problem to the RL paradigm as follows.
AdaTest’s RL Formulation of Trojan Detection:
State. The objective of AdaTest is to adaptively generate test patterns with high effectiveness for Trojan detection in an iterative manner. As such, AdaTest defines a state as the current test set in the present iteration.
Action Space. Recall that an action transforms the agent into a new state, which is the new test set according to our definition of the state above. Therefore, a feasible action for AdaTest is to identify a set of new test input vectors in each iteration that improves the quality of HT detection when added to the current test set.
Environment. For HT detection, the netlist of the circuit () can be considered as the environment that converts the current state and the action, and returns the reward value.
Observations. The agent makes the observation of the environment before reward computation. For Trojan detection problems, we model the DAG formed by the values of all nodes in the netlist given a specific input vector as an observation of the circuit state.
Reward. The definition of the reward function directly reflects the objective of the problem that one aims to solve. As such, for the task of logic testing-based HT detection, AdaTest designs a composite reward function to encourage the generation/exploration of test inputs that facilitate the excitation of the potential HT.
The mathematical definition of AdaTest’s dynamic reward function is given in the equation below:
Here, and are the current test set (i.e., the state) and the newly generated test inputs in iteration, respectively. and are the set of rare nodes and the SCOAP testability parameters identified in Phase 1 (static attributes). The hyper-parameters , , determine the relative weighting of the three reward terms. The reward function characterizes the fitness of the specific test inputs while considering the current test set . Evaluating the reward value of in the context of the historical test patterns () makes AdaTest’s RL framework adaptive and intelligent.
We detail how each term in AdaTest’s reward function is designed below. Inspired by the ‘N-detect’ test, the first reward term in Equation (4.2) aims to activate each rare node in the circuit for at least times. To this end, we define the rare node reward as follows:
where is the number of times that the rare node is activated to its rare value up to the iteration.
The second reward term in Equation (4.2) leverages the SCOAP parameter computed in Phase 1 to encourage the stimulation of rare nodes with high controllability and observability. Given the current test set , we can obtain the set of activated rare nodes (which is a subset of ). The SCOAP testability reward is then computed as follows:
Here, denotes the controllability of setting the rare node to its corresponding rare value. More specifically, shall be converted to or depending on the rare value of the node . denotes the observability of the rare node .
Besides leveraging the static attributes identified in Phase 1 to define the rare node reward and the SCOAP testability reward , AdaTest further explores the graph-level diversity extracted from the circuit netlist. In particular, AdaTest identifies the dynamic fitness property, i.e., the DAG-level diversity that is jointly determined by the circuit netlist and the test vector set. Such a DAG-level distance serves as a dynamic fitness measure since it is input-aware. Recall that AdaTest leverages an RL paradigm and considers the value assignments of all nodes when given the netlist and a specific test input as the observation. We use the graph representation of the circuit to abstract the observed netlist status. To facilitate the computation, AdaTest flattens the DAG to an ordered sequence based on the circuit level information. The distance between the two transformed DAG sequences is used as the DAG-level diversity measure. To summarize, we define the DAG diversity reward as follows:
Here, denotes the flattened ordered sequence of the DAG obtained when applying the test inputs to the circuit . The diversity measurement function computes the normalized pairwise distance of the flattened DAGs using the Hamming distance metric. Since the DAG sequence of the circuit is binary-valued (0 or 1), AdaTest employs function as an efficient implementation of the function. It’s worth noting that this graph reward is aware of historical test inputs (), thus provides guidance to select new inputs that stimulate different internal nodes structure in the context of current test inputs .
Policy. The policy component of a RL algorithm suggests actions to achieve a high reward given the current state. Recall that AdaTest defines the state and the action space as the current set of test vectors and the expansion with the new test patterns, respectively. Therefore, the policy module of AdaTest selects the most suitable test pattern candidates and add them to the result test set (line 5&6 in Alg. 2).
Algorithm 2 outlines the procedure of our adaptive test set generation framework. We emphasize that AdaTest does not require explicit training on the training set, which is typically required by machine learning model (e.g., gradient descent-based training). The RL nature enables AdaTest to search for distinguishing test inputs with the guidance of the composite reward. This makes our detection method fundamentally different from TGRL [pan2021automated] that still trains a ML model for test pattern generation. We discuss how AdaTest leverages the RL paradigm formulated above to achieve logic testing-based HT detection in the following of this section.
Smart Initialization. Recall that the intuition of logic testing-based Trojan detection is to encourage the generation of test inputs that activate diverse combinations of rare nodes to their corresponding rare values. Random test vectors might be unlikely to yield a high trigger coverage, especially on large circuits. To explore the above intuition, AdaTest leverages SAT to generate the initial test set (line 1 in Algorithm 2) such that it is able to activate diverse rare nodes specified by the defender. We empirically validate the advantage of our smart initialization as opposed to the random variant in Section 6.1. It is worth noticing that while the defender can identify rare nodes in the circuit by thresholding the transition probabilities, it might be unfeasible to find an input that stimulates all rare nodes to their rare values. Therefore, AdaTest tries to generate test patterns that stimulate different combinations of rare nodes for Trojan detection.
Generate Candidate Test Patterns. AdaTest progressively identifies test inputs that are suitable for HT detection using an iterative approach. To this end, AdaTest first generates a sufficient number of candidate test vectors at the beginning of each iteration (line 3 in Alg. 2). These candidates are responsible for exploring the test input space and aim to find solutions with high rewards. In our experiments, we adopt an adaptive sampling method to generate candidate test patterns at each iteration. In particular, the sampling weights for the test vectors in the initial set are uniformly assigned at iteration 0. In other words, at iteration 0, we perform a uniform sampling to generate candidate test patterns. Then the sampling weights of test vectors at iteration will be updated based on the normalized reward values evaluated at iteration . Test vectors with higher reward values will result in higher sampling weights, which in turn increases the probability of the test vectors to be included in the generated set . The adaptive sampling method allows us to optimize test pattern generation by favoring test patterns with higher reward values thus enhance convergence in our test pattern generation.
Evaluate Reward Function. The definition of reward is task-specific. Since our objective is to generate test patterns that stimulate the circuit (particularly the rare nodes) to different states for Trojan detection, AdaTest designs an innovative composite reward function as shown in Equation (4.2). In each iteration, the reward values of the candidate test inputs are evaluated (line 4 of Alg. 2). Our compound reward function captures informative features that are beneficial for HT detection from three aspects: the number of times that each rare node is activated (), the SCOAP testability measures that quantify the fitness of different rare nodes (), and the graph-level diversity between the current test inputs and historical ones ().
Adaptive Sampling to Update Test Set. Recall that in AdaTest’s RL paradigm, the current test set represents the ‘state’ variable. After obtaining the reward values of individual candidate test input in from Step 3, AdaTest updates the state by selecting a subset of that has the highest reward values and adding them to the current test set . This step is conceptually similar to the selection stage in genetic algorithms. With the domain-specific definition of reward, AdaTest adaptively samples high-quality test patterns from the randomly generated candidate test inputs, therefore facilitates fast exploration of the circuit input space for HT detection.
Check Termination Condition. AdaTest’s adaptive test set generation terminates if any of the following three conditions is satisfied: (i) of all rare nodes are activated for at least times and all rare nodes are activated at lease once (line 8 in Alg. 2); (ii) The maximal number of iteration is reached (line 2 in Alg. 2); (iii) The current test set activates the hidden Trojan, i.e., all involved trigger nodes are activated to their corresponding rare values by (line 2 in Alg. 2). Note that we include termination condition (iii) since our threat model assumes that the defender can observe the manifestation of an activated Trojan.
Discussion. As summarized in Alg. 2, our reinforcement learning approach does not require model training. Instead, we progressively generate the set of test vectors using adaptive sampling given the particular circuit with the goal of maximizing the RL rewards for Trojan detection. From this perspective, our RL-based detection tool generates a specific test set for the circuit under test. However, AdaTest is generic in the sense that it is agnostic to the circuit structure and can be applied to other different circuits (i.e., re-applying AdaTest to other circuits does not require any model training since we do not incorporate neural networks in our RL detection pipeline shown in Alg. 2).
5 AdaTest Architecture Design
Beyond the novel test generation algorithm discussed in Section 4, we design a Domain-specific systems-on-chip (DSSoC) architecture of AdaTest for its practical deployment. The bottleneck of AdaTest implementation is the computation of the test input’s reward according to Equation (4.2). Given the rare node set and SCOAP testability measures of the circuit from offline circuit profiling (Algorithm 1), the online reward evaluation of a new test input involves three terms as shown in Equation (4.2): identifying the rare nodes stimulated by (for ), obtaining the SCOAP values corresponding to each active rare node (for ), and computing the DAG-level graph distance (for ). Note that the third component require us to obtain the DAG with nodes value assignment when applying the test input on the circuit . This information is also sufficient to compute the first two reward terms. Therefore, the main task for AdaTest’s on-chip implementation is to obtain the value-assigned DAG for a new test input on the circuit ().
To accelerate circuit evaluation, AdaTest deploys circuit emulation on the programmable hardware to obtain the response . Furthermore, AdaTest constructs the customized auxiliary circuitry automatically to pipeline each computation stage and reduce the runtime overhead. We design an optimized DSSoC architecture of AdaTest for efficient implementation of our adaptive TPG method outlined in Algorithm 2.
5.1 Architecture Overview
The overall hardware architecture of AdaTest’s online test patterns generation is shown in Figure 5 (a). AdaTest leverages Algorithm/Software/Hardware co-design approach to accelerate the test inputs searching process shown in Figure 4 (phase2). More specifically, AdaTest maps the netlist of the circuit under test () with the auxiliary part to the FPGA and performs circuit evaluation to obtain the circuit’s response () to the test input . We make this design decision to develop the hardware accelerator for AdaTest since acquiring the circuit’s response from a configured FPGA (circuit emulation) is significantly faster than the same process running on a host CPU (software simulation). In addition, AdaTest parallelizes the computation of circuit emulation and pipelines each step of RL process. AdaTest performs reward computation of the candidate test inputs and adaptive sampling in an online fashion to minimize data communication between the off-chip memory and the FPGA.
Note that we do not include a random number generator (RNG) in our architecture design. Instead, AdaTest stores a set of random numbers pre-computed on CPU using the inherent variation of the operating system. This design choice has two benefits: (i) The hardware overhead of a True RNG is non-trivial and not desired; (ii) Random numbers generated from the CPU typically features stronger randomness compared to the one generated on FPGA. The results of circuit emulation are used for computing the reward values of test inputs using Eqn. (4.2) during reward evaluation. Rare node evaluation and DAG distance computation in reward evaluation are parallelized by accommodating multiple Computing Engine (CE) in AdaTest’s design. We also evenly partition the workload of each CE evenly offline.
After accumulating the reward for each candidate test input, our adaptive sampling selects the ones with the highest rewards. This selection process is equivalent to sorting
. Therefore, AdaTest includes a sorting engine that permutes the key index based on their corresponding rewards. We implement a lightweight sorting engine based on the ‘even-odd sort’ algorithm[chen1978simplified] for adaptive sampling, incurring a linear runtime overhead with the candidate test set size .
Its is worth noticing that AdaTest does not deploy a central control unit to coordinate the computation flow. Instead, each design component in Figure 5 (a) follows a trigger-based control mechanism [parashar2013triggered]. Particularly, each module is controlled by the status flag from its previous computation stage. For example, the adaptive sampling module (i.e., the sorting engine) in AdaTest begins to operate when the accumulation of the reward value is detected as completed. Our trigger-based control flow simplifies the control logic while satisfying the data dependency between different components in Figure 4. We detail the design of AdaTest’s circuit emulation and auxiliary circuitry as follows.
5.2 AdaTest Circuit Emulation
We empirically observe from AdaTest’s software implementation that circuit evaluation (i.e., obtaining ) dominates the execution time. Motivated to address the high latency issue of evaluating a circuit netlist on CPU, we propose to use circuit emulation to improve AdaTest’s efficiency. The first step of circuit emulation is to rewrite the netlist of the circuit under test () such that the values of internal nodes can be recorded by registers. The rewritten circuit is then connected with the auxiliary circuitry and mapped onto FPGA. In this way, we can emulate the response of the target circuit for any test input by directly applying it on the circuit and collecting the corresponding values in the registers. The collected signal values are used to compute the three reward terms in Equation (4.2).
Furthermore, AdaTest optimizes the latency of hardware evaluation by storing the emulation results in a ping-pong buffer (i.e., consisting of two buffers denoted with and ) and decoupling it from other hardware components as shown in Figure 5 (a). More specifically, the reward computing engine (CE) calculates the reward of the candidate test input using the data from buffer A. In the meantime, the emulator acquires the states of given the next input and stores the results into buffer .
5.3 AdaTest Reward Computing Engine
Pipeline with Early Starting. Our architecture design aims to maximize the overlapping time between each execution stage of AdaTest to increase the throughput of TPG. As shown in Figure 6, the ping-pong buffer enables pipelined execution of hardware emulation and reward evaluation. Furthermore, reward evaluation and adaptive sampling can be pipelined across different iterations. We can see from Figure 6
that epochcan start circuit emulation and reward evaluation when the previous epoch begins to generate new test inputs for the next epoch. As such, the latency of candidate test inputs generation can be hidden by circuit emulation and reward evaluation.
Scalable Reward Computing Engine. Once circuit emulation finishes for the current input , AdaTest begins to calculate the reward of this test input using Equation (4.2). From the hardware perspective, the reward term and is computed by accumulating the number of activated rare nodes and the corresponding SCOAP values from the circuit , and the reward is computed by accumulating the Hamming Distance (i.e., XOR) between the values in the current DAG () and the historical ones (). Independence between different groups of wire signals typically exists in circuits. AdaTest leverages this property by distributing the computation involving independent groups of nodes to different reward computing engines as shown in Figure 5 (b). As such, each CE stores a subset of DAG nodes’ values in the associated DAG buffer. The accumulation of the ultimate reward score completes when the last CE finishes reward computing.
We investigate AdaTest’s performance for Hardware Trojan detection on various benchmarks, including ISCAS’85 [hansen1999unveiling], MCNC [mcnc], and ISCAS’89 [iscas89]. The statistics of the evaluated benchmarks are summarized in Table 1. To apply AdaTest on sequential circuits in the ISCAS’89 benchmark, we unroll the circuit for two time frames and convert it to a combinational one [arora2004enhancing, yuan2015sequential]. Note that the unrolling process duplicates the combinational logic blocks, thus increasing the effective circuit size for Trojan detection. The transition probability () threshold for rare nodes is set to for ISCAS’85 and MCNC benchmarks. As for two ISCAS’89 circuits, we use such that the number of rare nodes is at the same level as the previous two benchmarks. The identification results are shown in the last column of Table 1. To compare the performance of AdaTest and other logic testing-based Trojan detection methods, we use trigger coverage and Trojan coverage as the metrics to quantify detection effectiveness. To characterize detection efficiency, we use the number of test vectors and the detection runtime as the metrics. We empirically show that AdaTest achieves a higher Trojan detection rate with shorter runtime overhead compared to the counterparts in the rest of this section.
Experimental Setup. Adhering to our threat model defined in Section 3.2, we first design the HT and insert it to each benchmark listed in Table 1. We use a logic-AND gate as the Trojan trigger and select three rare nodes with rare value 1 as the inputs. To fully characterize the performance of AdaTest, we devise various HTs for each circuit (i.e., using different combinations of rare nodes as the trigger) and repeat the insertion for times. Our Trojaned benchmarks include ‘hard-to-trigger’ HTs with activation probabilities around (e.g., ). To compare the performance of AdaTest with prior works, we re-implement MERO [chakraborty2009mero] and TRIAGE [nourian2018hardware] based on the methodology described in the paper using Python. Our experiments are performed on an Intel Xeon E5-2650 v4 processor with 14.5 GiB of RAM.
MERO Configuration. We use the parameter selection strategy suggested in MERO [chakraborty2009mero] for re-implementation. Particularly, we set the size of random patterns to 2,500. The hyper-parameter of MERO is (desired number of times that each rare node shall be activated). A large value of achieves a higher detection rate while resulting in a larger test set [chakraborty2009mero]. We use in the experiments since this is the value suggested by MERO [chakraborty2009mero].
TRIAGE Configuration. We use a population size of 100 and select 20 test inputs with the highest fitness score in each generation. The probability of crossover and mutation is set to 0.9 and 0.05, respectively. The termination condition in TRIAGE [nourian2018hardware] is used to evolve the test patterns.
AdaTest Configuration. In AdaTest’s circuit profiling stage, we use the Testability Measurement Tool [scoaptool] to compute the SCOAP parameters. The SAT-based smart initialization step of AdaTest’s Phase 2 is performed using the pycosat library [pycosat]. Our framework is developed in Python language and does not require extensive hyper-parameter tuning. To ensure the three reward terms in Equation (4.2) have comparable values within the range of , we set the hyper-parameters to , , . The candidate test size and the step size in Algorithm 2 are set to and for all benchmarks, respectively. We use the percentage threshold to identify rare nodes and set the target activation times to . The maximal iteration time is set to .
According to the performance metrics in Section 3.3, we use the trigger coverage (percentage of trigger nodes identified by the test set) and the Trojan coverage (i.e., detection rate) to quantify the effectiveness of HT detection. Meanwhile, we measure the test set generation time and test set size of each technique for efficiency comparison. To obtain an accurate and comprehensive performance measurement, we design different HTs for each benchmark in Table 1 while fixing the number of trigger nodes to . Each set of devised HTs is inserted into the circuit independently. We run AdaTest detection on each Trojaned circuit for times. The trigger and Trojan coverage for each benchmark are computed as the average value over runs.
|circuit||Method||# test vectors||Runtime (s)||Trigger coverage||Trojan coverage|
6.1 Detection Effectiveness
We assess the detection performance of AdaTest, MERO, and TRIAGE using the aforementioned experimental setup. Figure 7 compares the Trojan coverage of the three HT detection techniques on different benchmarks. One can see that our framework achieves uniformly higher detection rates across various circuits. The superior HT detection performance of AdaTest is derived from our definition of adaptive, context-aware reward functions in Equation (4.2).
We use two metrics to quantitatively compare the effectiveness of different HT detection techniques: trigger coverage rate and Trojan detection rate. Note that AdaTest determine a Hardware Trojan is present in the circuit if the set of test patterns generated using Alg. 2 result in Trojan activation when the test inputs are applied on the circuit. Therefore, our detection method does not have any false positives and we focus on evaluating the detection rates (which corresponds to the false negative rate). Table 2 summarizes the HT detection results of three different methods on the benchmarks in Table 1. The trigger coverage and Trojan coverage results are shown in the last two columns of Table 2. It can be seen that AdaTest achieves the highest Trojan coverage while requiring the shortest test generation time across most of the benchmarks. More specifically, AdaTest achieves an average of and Trojan coverage improvement over MERO [chakraborty2009mero] and TRIAGE [nourian2018hardware], respectively. The superior HT detection performance of our logic testing-based approach is derived from the diverse test patterns found by AdaTest adaptive RL-driven input space exploration technique (see Section 4.2). We not only encourage the activation of rare nodes and differentiate their qualities using SCOAP testability parameters, but also explicitly characterize the graph-level distance of the CUT status under different test stimuli.
We measure the dynamic rare node coverage versus the number of executed iterations to validate the time-evolving property of AdaTest framework. Figure 8 shows the coverage results of AdaTest with random initialization and SAT-based smart initialization on the benchmark. We can make two observations from Figure 8: (i) AdaTest consistently improves the rare node coverage over time (with either initialization method); (ii) SAT-based smart initialization improves the convergence speed of AdaTest, thus reducing our test set generation time. The first observation corroborates the efficacy of our RL-based progressive test pattern generation method. The second observation reveals the importance of proper initialization for fast convergence of RL exploration. Note that a shorter convergence time (i.e., a smaller number of iterations in Algorithm 2) indicates s smaller test set returned by AdaTest, which is beneficial to reduce the test generation time for higher detection efficiency.
6.2 Detection Efficiency
We characterize the efficiency of AdaTest for logic testing based-HT detection using two metrics: the test set size (space efficiency), and the test set generation time (runtime efficiency). The quantitative efficiency measurements of three HT detection methods are shown in the third and fourth columns of Table 2. It can be computed that AdaTest engenders an average of 2.04 and 155.04 reduction of test set size compared to MERO and TRIAGE across all benchmarks, respectively. The reduction of test set size has two benefits: (i) A smaller test set features a lower memory footprint; (ii) For on-chip test pattern generation, a smaller test set suggests a shorter test generation time.
Figure 9 compares the required test generation time of AdaTest, MERO, and TRIAGE to achieve the coverage results on various benchmarks in Table 2. Note that we use log-scale for the vertical axis since the range of runtime is diverse across different circuits. We can observe that AdaTest is the most efficient HT detection method among the three and it also achieves high Trojan coverage (last column of Table 2). More specifically, AdaTest engenders an average of 366.26 and 0.63 test generation speedup compared to MERO [chakraborty2009mero] and TRIAGE [nourian2018hardware], respectively. Note that although the runtime of TRIAGE is smaller, its Trojan detection rate is lower than AdaTest.
6.3 AdaTest Architecture Evaluation
The resource utilization of AdaTest depends on the input length and the circuit size. We report the resource utilization results of the evaluated benchmarks in Table 3. Figure 10 shows that AdaTest architecture achieves approximately linear speedup w.r.t. to the number of CEs. Our hardware design can be scaled up by adding more reward computing engines to parallel the circuit emulation process as AdaTest’s computation bottleneck is reward evaluation of the test patterns. Nevertheless, the speedup saturates when is sufficiently high. AdaTest broadcasts the wire values of the circuit response (given a test input) to all CEs via a shared data bus. Each CE scans the DAG buffer and obtains the broadcast wire values to compute the corresponding reward. Therefore, increasing the number of CEs does not lead to extra wire delay. However, more CEs suggests a higher overhead during reward accumulation.
|14.9 (0.5)||25.5 (0.6)||61.1 (3.5)||267.9 (26.1)|
|4,440 (80)||5,743 (160)||6,717 (317)||12,943 (1190)|
In this paper, we present a holistic solution to Hardware Trojan detection using adaptive, reinforcement learning-based test pattern generation. To formulate logic testing-based HT detection as an RL problem, we design an innovative reward function to characterize the quality of a test pattern from both static and dynamic aspects. AdaTest progressively expands the test set by identifying test input vectors with high reward values in an iterative approach. AdaTest integrates adaptive sampling to identify and encourage high-reward test patterns, thus accelerating our RL-based input space exploration. We devise AdaTest using a Software/Hardware co-design approach. Particularly, we develop a domain-specific systems-on-chip architecture for efficient hardware implementation of AdaTest. Our architecture optimizes reward evaluation via circuit emulation and pipelines the computation of AdaTest. We perform extensive evaluations of AdaTest on various benchmarks and compare its performance with two counterparts, MERO and TRIAGE. Empirical results corroborate that AdaTest achieves superior effectiveness, efficiency, and scalability for HT detection compared to prior works. AdaTest is a generic test pattern generation framework, we plan to investigate its performance on other hardware security problems such as logic verification and built-in self-test in our future work.