The ISA specifies semantic and syntactic characteristics of a practically functionally complete set of machine instructions. The ISA specifies semantic and syntactic characteristics of a practically functionally complete set of machine instructions. Modern ISAs are not necessarily mathematically functionally complete, but provide sufficient expressiveness for practical algorithms. For software layers, the ISA defines the underlying machine – as capable as the variety of algorithmic tasks the composition of its building blocks, instructions, can express. For hardware layers, the ISA rather acts as a behavioral design specification for the machine organization. Accordingly, the ISA governs both the functional completeness and complexity of a machine design.
This paper makes the case for an alternative, single-ISA heterogeneous computing platform, AISC, which can reduce the ISA complexity, and thereby improve energy efficiency, on a per compute engine (be it a core or an accelerator) basis, without compromising the functional completeness of the overall platform. The distintinctive feature of AISC is that each compute engine supports a different subset of the very same instruction set. Such per compute engine ISA subsets may be disjoint or overlapping. An ISA subset may not be functionally complete, but the union of the (per compute engine) subsets renders platform-wide a functionally complete single ISA. Therefore, software layers perceive AISC as a single-ISA machine. Tailoring the microarchitecture of each compute engine to the subset of the ISA that it supports results in less complex, more energy efficient compute engines, without compromising the overall functional completeness of the machine.
When it comes to the design of a feasible AISC platform, many questions arise, the most critical being:
Which subset of the ISA should each compute engine support, by construction?
How to guarantee that each sequence of instructions scheduled to execute on a given compute engine only spans the respective subset of the ISA (with potential accuracy loss)? More specifically, how to map instruction sequences to the compute engines?
How to orchestrate migration of code sequences from one compute engine to another within the course of computation, as different application phases may exhibit different degrees of tolerance to noise?
How to keep the potential accuracy loss bounded?
Starting with the first and most basic question, approximation along two dimensions can set the ISA subset per compute engine:
Breadth approximation simplifies instructions by reducing complexity (e.g., precision) on a per instruction basis. To be more specific, the subset of the ISA a compute engine implements in this case would selectively contain lower complexity (e.g., lower precision) instructions, by construction. Well-studied precision reduction approaches [21, 2, 15, 19, 14, 12, 8, 17] are directly applicable in this context. Reducing the operand width often enables simplification in the corresponding arithmetic operation, in addition to a more efficient utilization of the available communication bandwidth for data (i.e., operand) transfer.
Depth approximation excludes complex and less frequently used instructions from the subset.
The combination of the two dimensions, Breadth + Depth , is also possible: In this case, the compute engine concerned would be able to approximately emulate complex and less frequently used instructions (that its ISA subset does not contain) by a sequence of simpler instructions.
AISC trades computation accuracy for the complexity (and thereby, energy efficiency) on a per compute engine basis. At the same time, as the entire platform still supports the full-fledged ISA, instruction sequences not prone to approximation can still run at full accuracy. AISC can also be regarded as an aggressive variant of architectural core salvaging  or ultra-reduced instruction set coprocessors , where actual hardware faults impair a compute engine’s capability to implement a subset of its ISA (and all compute engines support the same ISA by design). These lines of studies detail how to achieve full-fledged functional completeness under the hardware-fault-induced loss of support for a subset of instructions. AISC, on the other hand, features compute engines with approximate, i.e., incomplete or restricted, ISAs by construction. Without loss of generality, such incomplete or restricted ISAs within an AISC platform may be due to errors or simply enforced by design. The latter applies for the following discussion.
In the following, Section 2 details a proof-of-concept AISC implementation and practical limitations; Sections 3 and 4 quantify the complexity vs. energy efficiency trade-off as induced by AISC; Section 5 compares and contrasts AISC with related work; and Section 6 concludes the paper by summarizing our findings.
2 Proof-of-Concept Implementation
Let us start with a motivating example. Fig. 1 shows how the (graphic) output of a typical noise-tolerant application, SRR (see Table LABEL:tab:apps), changes for representative Depth, Breadth, and Breadth + Depth techniques. Section 3 provides experimental details. Fig. (a)a captures the output for the baseline for comparison, Native execution, which excludes any approximation. The accuracy loss remains barely visible but still varies across different techniques. Let us next take a closer look at the sources of this diversity.
2.1 Depth Techniques
Depth encompasses all techniques that orchestrate dropping of complex and less frequently used instructions. The key question is how to pick the instructions to drop. A more general version of this question, which instructions to approximate under AISC, already applies to AISC techniques across all dimensions, but the question becomes more critical for Depth. As the most aggressive in our bag of tricks, Depth can incur significant loss in accuracy. The targeted recognition-mining-synthesis (RMS) applications can tolerate errors in data-centric phases as opposed to control . Therefore, confining instruction dropping to data-flow can help limit the incurred accuracy loss.
2.2 Breadth Techniques
Without loss of generality, we experimented with three approximations to reduce operand widths: DPtoSP, DP(SP)toHP, and DP(SP)toINT.
Under the IEEE 754 standard, 32 (64) bits express a single (double) precision floating point number: one bit specifies the sign; 8 (11) bits, the exponent; and 23 (52) bits the mantissa, i.e., the fraction. For example, represents a single-precision floating number. DPtoSP is a bit discarding variant, which omits 32 least-significant bits of the mantissa of each double-precision operand of an instruction, and keeps the exponent intact. DP(SP)toHP comes in two flavors. DPtoHP omits 48 least-significant bits of the mantissa of each double-precision operand of an instruction, and keeps the exponent intact; SPtoHP, 16 least-significant bits of the mantissa of each single-precision operand of an instruction. Fig. (c)c captures an example execution outcome under DPtoHP. DP(SP)toINT also comes in two flavors. DPtoINT (SPtoINT) replaces double (single) precision instructions with their integer counterparts, by rounding the floating point operand values to the closest integer.
2.3 Breadth + Depth Techniques
The proof-of-concept AISC design features two representative Breadth + Depth techniques: MULtoADD and DIVtoMUL. MULtoADD converts multiplication instructions to a sequence of additions. MULtoADD picks the smaller of the factors as the multiplier, which determines the number of additions. MULtoADD rounds floating point multipliers to the closest integer number. DIVtoMUL converts division instructions to multiplications. DIVtoMUL first calculates the reciprocal of the divisor, which next gets multiplied by the dividend to render the end result. In our proof-of-concept implementation based on the x86 ISA, the reciprocal instruction has 12-bit precision. DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on the other hand, relies on one iteration of the Newton-Raphson method 
to increase the precision of the reciprocal to 23 bits. DIVtoMUL12 can be regarded as an approximate version of DIVtoMUL.NR, eliminating the Newton-Raphson iteration, and hence enforcing a less accurate estimate of the reciprocal (of only 12 bit precision). Fig.(d)d captures an example execution outcome under DIVtoMUL.NR. Algorithm 1 provides an example sequence of instructions to emulate division according to DIVtoMUL.NR, where one reciprocal (RCPSS), 3 multiplication (MULSS), one addition (ADDSS), and one subtraction (SUBSS) instruction substitute a division. DIVtoMUL12 omits the iteration of the Newton-Raphson method (lines 3-6) from Algorithm 1. As opposed to DIVtoMUL.NR, DIVtoMUL12 keeps the 12-bit accuracy of the RCPSS instruction. Hence, it becomes mathematically equivalent to omitting 11 bits of the mantissa.
3 Evaluation Setup
caption = Benchmarks deployed.,
mincapwidth = label = tab:apps,
maxwidth = doinside = ,
pos = !htbp,
footerwidth = .95star
Application & Domain & Input & Output
K-means (KM) & Clustering & Edge features Corel Corporation DB (100 entries) & Cluster assignments
Latent Dirichlet Allocation (LDA) & Topic Modeling & 500 documents, 6097 terms; variat. inference:
20 itr./10-6 error; variat. EM: 100 itr./10-5 error & Estimation Model
Principal Component Analysis (PCA) & Feature Selection & Handwritten digit (1593 instances, 256 ATR) & Column- & Row-Projections
Restricted Boltzmann Machine (RBM) & Deep Learning & Netflix DB (100 train users x 100 movies
x 20 loops; 100 test users x 100 movies) & Suggestions for
users/movies (100 x 100 matrix)
Super-Resolution Reconstruction (SRR) & Computer Vision & “EIA" Dataset (16 frames, 64 x 64) & Reconstructed Image (256 x 256)
Single Value Decomposition (SVD3) & Feature Selection & KOS Press (500x500 matrix) & Decomposition matrices
We analyze a representative mix of RMS benchmarks from Cortex Suite , as tabulated in Table LABEL:tab:apps, compiled with GCC 4.8.4 with -O1 on an Intel® Core™ i5 3210M machine. As we perform manual transformations on the code, high optimization levels hinder the task; we resort to -O1 for our proof-of-concept and leave for future work more exploration on compiler optimizations. For each application, we focus on the main kernels (i.e., region of interest) where the actual computation takes place. Throughout the evaluation, we map the region of interest of an application in its entirety to an incomplete-ISA (AISC) compute engine, irrespective of potential changes in noise tolerance within the course of its execution. We use ACCURAX metrics  to quantify the accuracy-loss. We prototype AISC techniques from Section 2 on Pin 2.14 , which is based on the x86 ISA.
Our energy model follows , which is based on bare-metal measurements for x86 processors.
The model distinguishes between fixed and variable components of energy consumption.
The variable component captures changes in activity.
Beyond activity, the ratio of fixed to variable components depends on technology and microarchitecture.
For example, on Intel Penryl, 56% of the energy consumption is due to the fixed component, while on Haswell this percentage increases to 75% .
AISC techniques can affect both, the fixed and the variable component.
Savings in the variable component come from the operand width reduction under Breadth, and from the instruction dropping under Depth.
The fixed component can also decrease if the microarchitecture exploits AISC for a less complex pipeline front-end (fetch + decode) design.
In the evaluation, we will only report anticipated energy savings in the variable component.
The overall impact of the variable component on energy savings depends on the ratio of the fixed and variable components.
We represent energy in terms of number of instructions EPI (energy per instruction).
We conservatively assume that the EPI of an integer instruction equals the EPI of a 32-bit floating point arithmetic instruction, and scale the EPI values for 64-bit and 16-bit operations according to
EPI (energy per instruction). We conservatively assume that the EPI of an integer instruction equals the EPI of a 32-bit floating point arithmetic instruction, and scale the EPI values for 64-bit and 16-bit operations according to.
Fig. 2 shows the impact on instruction mix and count, as characterized by a group of bars for each benchmark. The first bar corresponds to the Native execution outcome, excluding any AISC technique. The rest of the bars capture execution under Breadth, Breadth + Depth, and Depth. The height of each bar captures the relative change in the instruction count with respect to the native outcome. The stacks in each bar depict the instruction mix, considering three categories: Critical indicates control-flow instructions such as accesses to the stack; Integer, integer data transfer and arithmetic; FP (Floating Point), floating point data transfer and arithmetic. AISC excludes Critical instructions, as approximating or removing them might alter the control flow and prevent successful program termination. The proof-of-concept AISC implementation does not approximate Integer instructions to avoid corrupting pointer arithmetic, which may in turn prevent successful termination. We further categorize FP instructions by the size of the operands: 64b(it), 32b, and 16b.
To understand the corresponding implications for accuracy, Fig. 3 shows, on a per benchmark basis, the trade-off space of energy versus accuracy loss. Each point corresponds to the outcome under one AISC technique. We report both energy and accuracy loss relative to the native execution outcome, which excludes AISC techniques. Trade-off spaces do not include non-terminating executions and executions that render more than 40% increase in energy consumption. On the normalized energy axis, any point above 1 is unfeasible. Fig. (h)h shows the output randomization results for each benchmark , which serve as a proxy for (close to) worst case accuracy loss.
Next, we examine the proof-of-concept AISC techniques from Section 2 in more detail.
4.1 Breadth Techniques
We observe that Breadth in terms of DPtoSP, DP(SP)toHP, and DP(SP)toINT, can reduce the instruction count significantly for benchmarks PCA and SVD3 (Fig. 2). These benchmarks adapt iterative refinement; under bit discarding due to Breadth, they reach the convergence criteria earlier. In this case, only two experiments fail to terminate, marked by a cross in Fig. 2: DP(SP)toINT in KM and DP(SP)toHP in LDA. Significant changes in instruction count mainly stem from early or late convergence.
Breadth, in terms of DPtoSP and DP(SP)toHP, provides large energy reductions (20% and 37% on average) accompanied by a modest accuracy loss (less than 10%) for most of the applications – except PCA and SRR, where accuracy loss reaches 78.1% and 34.6%, respectively; and LDA where the experiments failed to terminate due to bit discarding. Even the very aggressive DP(SP)toINT works for LDA, ME, and SVD3, where the accuracy loss becomes 18.9%, 13.0%, and 4.6%, respectively. The energy reduction for LDA (98%) and SVD3 (81%) is significant, as the number of iterations to convergence gets drastically reduced: LDA reduces the number of iterations by 97.3%; SVD3, by 97.9%. KM is the only application that does not survive DP(SP)toINT. For the rest of the benchmarks, the accuracy loss becomes comparable to the accuracy of a randomly generated output (which captures close-to-worst-case accuracy loss ), therefore, likely not acceptable (Fig. (h)h).
4.2 Depth Techniques
Depth comes in two flavors. First, we selectively remove all the floating point division instructions of the binary (DropDIV). This would be equivalent to removing the floating point division instruction from the ISA, without providing any substitute (as opposed to Breadth + Depth techniques). Dropping division instructions does not affect the termination of the experiments, although PCA and SVD3 do not reach convergence. As shown in Fig. 2, KM and LDA show a significant reduction in the executed instructions under DropDIV, which translates into an energy reduction of 15% and 90%, respectively, with an accuracy loss of 28% and 13.62%. For ME, RBM, and SRR, the instruction count remains practically the same, while the accuracy loss of the outputs reaches the loss of completely random outputs (as indicated Fig. (h)h).
As a more aggressive Depth approach, we randomly delete static (arithmetic) FP instructions. For each static instruction, we base the dropping decision on a pre-defined threshold t. We generate a random number r in the range , and drop the static instruction if r remains below t. We experiment with threshold values between 1% and 10%. We observe three distinctive behavior:
SVD3 and PCA do not tolerate Depth; experiments either fail to terminate, or render an invalid output/excessive accuracy loss.
ME, RBM, and SRR can tolerate dropping, but the outcome highly depends on the instructions dropped. Fig. 4 illustrates three different SRR outcomes for =3%: dropping 1 static instruction (Fig. (b)b); 3 static instructions (Fig. (c)c); 5 static instructions (Fig. (d)d); the native output is also shown for comparison (Fig. (a)a). SRR has 477 static instructions, out of which around 80 are FP. The dropped static instructions translate into dropping 16 million dynamic instructions in Fig. (b)b; 245 million in Fig. (c)c; and 255 million in Fig. (d)d, respectively. The numeric accuracy metric, SSIM  becomes 17.1% (Fig. (b)b), 30.2% (Fig. (c)c), and 48.2% (Fig. (d)d). To compensate for the missing instructions, SRR executes additional iterations, increasing the dynamic instruction count.
For KM and LDA some experiments fail and some survive with varying accuracy loss.
4.3 Breadth + Depth Techniques
Under Breadth + Depth, we observe that MULtoADD (Section 2) significantly increases the instruction count and/or prevents successful termination (LDA and SVD3) – except for RBM, where most of the multiplications involve very small operands between 0 and 1. In any case, we did not observe any improvement on the energy vs. accuracy trade-off space, and, therefore, we excluded MULtoADD from Figs. 2 and 3.
DIVtoMUL variants also increase instruction count, although this increase is only significant for LDA, PCA, and SVD3, where more iterations are run to compensate for the precision reduction, as we did not alter the convergence criteria. Breadth + Depth – in terms of DIVtoMUL.NR and DIVtoMUL12 – does not show an energy advantage, mostly due to our conservative energy modeling (we assume that all FP instructions of a given precision have the same EPI). KM, ME, and SRR have a very small percentage of division operations, accordingly, DIVtoMUL variants have minimal impact. Eliminating the Newton-Raphson iteration under DIVtoMUL.NR only increases the accuracy loss in RBM. PCA and SVD3 experience a significant energy increase under DIVtoMUL12 (when compared to DIVtoMUL.NR) due to the increasing number of iterations until convergence to meet the convergence criteria.
5 Related Work
Lopes et al. perform a chronological analysis of several x86 applications and operating systems and show that 30% of the instructions were rarely used or become unused over time . ISA extensions have been proposed in the context of approximation [7, 9]. In , the authors describe an ISA extension to provide approximate operations and storage. Kamal et al. extend the ISA with custom instructions for embedded devices, and allow approximation on those custom instructions by not meeting timing requirements, as the results might be still good enough . Instead of extending the ISA, we explore the possibilities of reducing the ISA complexity, by reducing the set of instructions and/or the size of the operands.
Finally, it is worth referring to the work from Schkufza et al. . Their objective is the automatic generation of different floating point kernels, applying different optimization in an stochastic way. In detail, for each computation kernel they apply a series of transformations that include changes in the opcodes, the operands, swapping instructions, and even dropping computations. Contrary to AISC, however, their goal is to obtain aggressive optimization of high performance applications running on conventional hardware.
6 Conclusion & Discussion
The Instruction Set Architecture (ISA) bridges the gap between software and hardware layers of the system stack. ISA grows with the addition of new extensions in Depth (addition of new instructions) and Breadth (wider instructions, in the case of CISC machines) directions. This growth increases the complexity of the fetch and decode hardware (front-end), an already power hungry part of the pipeline, and critical for performance, as it feeds the back-end with instructions. Besides, execution units (back-end) and their control policies also become more complex.
Conventional ISAs are unaware of the intrinsic error tolerance of many emerging algorithms. AISC turns around this assumption to reduce hardware complexity, and, thereby, energy consumption. Our proof-of-concept analysis revealed that, in its restricted form – where the region of interest of an application is mapped in its entirety to an incomplete-ISA compute engine, irrespective of potential changes in noise tolerance within the course of its execution – AISC can cut energy by up to 37% at around 10% accuracy loss.
We find that the energy vs. accuracy trade-off is very sensitive to application convergence characteristics. Therefore, an efficient AISC implementation should carefully examine convergence criteria. For example, under Depth, applications that tolerate instruction dropping either compensate for the missing instructions by executing more to meet the convergence criteria, or exhibit a very large accuracy loss.
The presented proof-of-concept implementation does not track data dependencies beyond instruction boundaries. Operand width reduction under Breadth or Breadth + Depth is confined to the input and output operands of the respective instructions. Transitively adjusting the precision of the variables which depend on or determine such input and output operands may substantially increase energy savings.
The most critical design aspect is how instruction sequences should be mapped to restricted-ISA compute engines, and how such sequences should be migrated from one engine to another within the course of execution, to track potential temporal changes in algorithmic noise tolerance. While fast code migration is not impossible, if not orchestrated carefully, the energy overhead of fine-grain migration can easily become prohibitive. Therefore, a break-even point in terms of migration frequency and granularity exists, past which AISC may degrade energy efficiency.
-  I. Akturk, et al. On quantification of accuracy loss in approximate computing. In 12th Annual Workshop on Duplicating, Deconstructing and Debunking (WDDD), 2015.
-  V. K. Chippa, et al. Scalable effort hardware design. IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 22(9):2004–2016, Sept. 2014.
-  H. Cho, et al. ERSA: Error resilient system architecture for probabilistic applications. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 31(4):546–558, April 2012.
-  J. W. Choi, et al. A roofline model of energy. In IEEE 27th Int. Symp. on Parallel Distributed Processing, pages 661–672, 2013.
-  K. Czechowski, et al. Improving the energy efficiency of big cores. In 41st Annual Int. Symp. on Computer Architecture, pages 493–504, 2014.
-  M. Ercegovac et al. Digital Arithmetic. Morgan Kaufmann Series in Computer Architecture and Design. Morgan Kaufmann, 2004.
-  H. Esmaeilzadeh, et al. Architecture support for disciplined approximate programming. In 17th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 301–312, 2012.
-  A. Jain, et al. CPSA: Compute precisely store approximately. In Workshop on Approximate Computing Across the Stack, 2016.
-  M. Kamal, et al. Improving efficiency of extensible processors by using approximate custom instructions. In Design, Automation & Test in Europe Conf. Exhibition, pages 1–4, 2014.
-  B. C. Lopes, et al. SHRINK: Reducing the isa complexity via instruction recycling. In 42nd Annual Int. Symp. on Computer Architecture, pages 311–322, 2015.
-  C.-K. Luk, et al. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 190–200, 2005.
-  T. Moreau, et al. Approximating to the last bit. In Workshop on Approximate Computing Across the Stack, 2016.
-  M. D. Powell, et al. Architectural core salvaging in a multi-core processor for hard-error tolerance. In ISCA, 2009.
-  C. Rubio-González, et al. Precimonious: Tuning assistant for floating-point precision. In Int. Conf. on High Performance Computing, Networking, Storage and Analysis, pages 27:1–27:12, 2013.
-  A. Sampson, et al. EnerJ: Approximate data types for safe and general low-power computation. In 32nd ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 164–174, 2011.
-  E. Schkufza, et al. Stochastic optimization of floating-point programs with tunable precision. In 35th ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 53–64, 2014.
-  M. Stephenson, et al. Bidwidth analysis with application to silicon compilation. In ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 108–120, 2000.
-  S. Thomas, et al. CortexSuite: A synthetic brain benchmark suite. In IEEE Int. Symp. on Workload Characterization, pages 76–79, 2014.
-  Y. F. Tong, et al. Minimizing floating-point power dissipation via bit-width reduction. In Power-Driven Microarchitecture Workshop, 1998.
-  D. Wang, et al. Reliable computing with ultra-reduced instruction set coprocessors. IEEE Micro, 34(6), 2014.
-  T. Y. Yeh, et al. Fool me twice: Exploring and exploiting error tolerance in physics-based animation. ACM Trans. on Graphics, 29(1):5:1–5:11, Dec. 2009.