I Introduction and Motivation
The ever-increasing computational requirements of high performance computing (HPC) has leveraged the scaling of contemporary technologies for decades, now reaching the atomic level. However, the power density of silicon nano-electronics limits their applicability to future exascale computing [5, 2], motivating the research for alternate technologies. Evolved from rapid SFQ (RSFQ)  technology, superconductive circuits that promise ultra-low switching energy of J  and clock frequencies exceeding 25GHz  have become a promising beyond-CMOS technology.
Various 8-bit SFQ microprocessors have been developed in the last two decades, including a bit-serial microprocessor with eight 1-bit serial ALU blocks (FLUX-1) , a bit-serial CORE1 processor , and a bit-serial SCRAM2 asynchronous microprocessor . More specifically, the arithmetic logic unit (ALU), a critical part of a microprocessor, has gained significant research importance in RSFQ , , , . Recently Tang et. al. have proposed a 16-bit bit-sliced ALU  because earlier proposed serial  and 2-/4-/8-bit bit-sliced  ALUs compute at a slower rate for 32-/64-bit processors. As we increase the ALU bit-width, its gate-level pipelined nature, forces an increase in latency and efficiently utilizing this deep pipelined architecture becomes more difficult.
To improve pipeline utilization we propose a block-skewed ALU architecture, called qBSA, inspired by the use of skewed datapaths in asynchronous CMOS design . Our proposed architecture uses eight 4-bit ALU blocks skewed in time, reduces the delay of the data feedback loop, and enables individual blocks to start computing a dependent operation as soon as its own output is ready. The choice of 4-bit blocks enables a balance between keeping the latency of the 32-bit adder relatively low while requiring fewer Josephson junctions (JJ) than needed for higher bit-width blocks. We have simulated our results using the MIT LL 100A/m
SFQ5ee RSFQ cell library to demonstrate its functional correctness. We have also estimated its impact on the instructions per cycle (IPC) of a RISC-V processor.
Ii Proposed 32-bit Block-skewed architecture
In this section we describe the logic design of our 32-bit qBSA. We divided the design into eight 4-bit blocks as shown in Fig. 1(a). Due to its low latency, and simple carry look ahead circuit with only one feed forward signal () we adopted the Sklansky prefix-tree adder  to design each 4-bit block, as illustrated in Fig. 1(b). Notice that the carry () is needed to compute the carry out (, ) and sum () only after five pipeline stages. We leverage this fact and start computing the sum and carry of more significant blocks before the of the less significant blocks are evaluated. It is to be noted that we use to quickly feed the input carry of the next 4-bit ALU block and delay it by one stage to provide the final . The feedback path from the output of each block back to its input (through a multiplexer) enables less significant blocks to start accepting and computing their next data-dependent inputs as soon as the previous corresponding output is ready, thereby avoids waiting for the entire 32-bit result. This staggers the computation start time for different blocks making the datapath skewed and better utilizes the gate-level pipelining nature of SFQ. In particular, this reduces the initiation interval (II) for back-to-back data-dependent operations, defined as the number of clock-cycle separation between the start of two consecutive data-dependent operations.
|Parameter||Data Dependency||Pipeline stages|
We used Verilog models of a 100A/m MIT LL SFQ5ee cell library to design and simulate qBSA in the Xilinx Vivado 2017.4 tool. Note that in our simulated waveforms a signal transition (high to low or vice versa) and no transition represent presence and absence of SFQ pulse, respectively.
Iii-a Gate-level Simulation
Fig. 2 shows a typical waveform generated through gate-level simulation of the proposed 32-bit ALU. Notice that after the first output is available, the skewed datapath of the qBSA enables back-to-back data-dependent outputs available after the pipeline depth of a 4-bit ALU block (8-clock stages) instead of the pipeline depth delay of the entire 32-bit ALU (15-clock stages). Thus the initiation interval of our proposed qBSA is 1.5x and 2x faster compared to recently proposed 32-bit Ladner Fischer ALU (32LFA)  and 4-bit bit sliced ALU (4BSA) , respectively.111For both the 32LFA and 4BSA ALUs we have added a 1-clock delay for the MUX-stage to their actual stage delays to perform 32-bit data-dependent operations, obtaining IIs of 12 and 16, respectively.
Iii-B Performance Evaluation: Instruction Per Cycle
To quantify the benefit of our proposed design we estimated the impact on IPC for a set of benchmarks on a generic qBSA-based RISC-V processor with in order commitment (qBSP). We compared the obtained IPC to that of a 32LFA (32LFP) and 4BSA (4BSP) based processors. In particular, the IPC of a benchmark with total number of instructions and total NOPs needed to resolve dependencies is as follows:
We estimate the IPC using a script that reads benchmark files generated through Spike, a RISC-V sodor core instruction set architecture (ISA) simulator, analyzes the dependencies, and estimates the number of NOPs required . We assume all processor components are block-skewed and consume and generate inputs and outputs in block-skewed fashion. In particular, Equations 2 and 3 recursively defines the number of NOPs required before each instruction and its final position considering the added NOPs.
Here, functions S(i,m) and I(i,m) provide the instruction type and original index of the instruction that creates the source operand of the instruction. is the latency of the instruction which creates the source register of instruction .
Our experiments explore a range of non-ALU data-dependent operation latencies but in each individual experiment, for simplicity, we assume that all non-ALU operations have the same integral latency. As two examples, Fig. 3 shows the IPC improvement of qBSP over 32LFP and 4BSP with non-ALU latency assumptions 1 and 10.
The gate-level pipelined nature of RSFQ makes keeping the pipelines full a difficult micro-architectural challenge, especially in the presence of data-dependent operations. This paper proposes a block-skewed ALU to reduce the average pipeline initiation interval and estimates its impact on an ideal RSFQ processor. Averaging across multiple benchmarks with a simple dependency model, block-skewing improves IPC between 1.2x and 1.37x compared to a 32-bit Ladner Fischer ALU based processor and between 2.93x and 4x compared to a 4-bit bit-sliced ALU based processor. Our future work includes evaluating the benefits of block skewing on other processor components, the impact of different block sizes, and refinements of our model of instruction dependencies.
-  (2015) 80-ghz operation of an 8-bit RSFQ arithmetic logic unit. In 2015 15th International Superconductive Electronics Conference (ISEC), pp. 1–3. Cited by: §I.
-  (2011) The future of microprocessors. Communications of the ACM 54 (5), pp. 67–77. Cited by: §I.
-  (2013) 8-bit asynchronous sparse-tree superconductor RSFQ arithmetic-logic unit with a rich set of operations. IEEE Transactions on Applied Superconductivity 23 (3), pp. 1700104–1700104. Cited by: §I.
-  (2001) FLUX chip: design of a 20-ghz 16-bit ultrapipelined RSFQ processor prototype based on 1.75-/spl mu/m lts technology. IEEE transactions on applied superconductivity 11 (1), pp. 326–332. Cited by: §I.
-  (2011) Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pp. 365–376. Cited by: §I.
-  (2011) 8-bit asynchronous wave-pipelined RSFQ arithmetic-logic unit. IEEE Transactions on Applied Superconductivity 21 (3), pp. 847–851. Cited by: §I.
-  (2012) 20 ghz operation of an asynchronous wave-pipelined RSFQ arithmetic-logic unit. Physics Procedia 36, pp. 59–65. Cited by: §I.
-  (2008) Bit-serial single flux quantum microprocessor core. IEICE transactions on electronics 91 (3), pp. 342–349. Cited by: §I.
-  (1991) RSFQ logic/memory family: a new josephson-junction technology for sub-terahertz-clock-frequency digital systems. IEEE Transactions on Applied Superconductivity 1 (1), pp. 3–28. Cited by: §I.
-  (2001) Width-adaptive data word architectures. In Proceedings 2001 Conference on Advanced Research in VLSI. ARVLSI 2001, pp. 112–129. Cited by: §I.
-  (2007) Design and implementation of a fully asynchronous SFQ microprocessor: scram2. IEEE transactions on applied superconductivity 17 (2), pp. 478–481. Cited by: §I.
-  Windows Phone Central (Ed.)(Website) External Links: Cited by: §III-B.
-  (1960) Conditional-sum addition logic. IRE Transactions on Electronic computers (2), pp. 226–231. Cited by: §II.
-  (2018) Logic design of a 16-bit bit-slice arithmetic logic unit for 32-/64-bit RSFQ microprocessors. IEEE Transactions on Applied Superconductivity 28 (4), pp. 1–5. Cited by: §I, §III-A.
-  (2015) 4-bit bit-slice arithmetic logic unit for 32-bit RSFQ microprocessors. IEEE Transactions on Applied Superconductivity 26 (1), pp. 1–6. Cited by: §I, §III-A.
-  (2016) 4-bit bit-slice arithmetic logic unit for 32-bit RSFQ microprocessors. IEEE Transactions on Applied Superconductivity 26 (1), pp. 1–6. Cited by: §I.
-  (2013) Experimental investigation of energy-efficient digital circuits based on eSFQ logic. IEEE Trans. Appl. Supercond 23 (3), pp. 1301505. Cited by: §I.