The semiconductor industry has achieved unprecedented growth in the last six decades owing to its incessant drive to fulfill the prophecy of Moore’s Law scaling . Moore’s law continued to provide value to the semiconductor industry as cost per transistor reduced with shrinking feature sizes. However, as we hit physical limits of transistor scaling and increasing cost of lithography and patterning, the industry is transitioning to design-technology and system-technology co-optimization (STCO) paradigms where added value is achieved through heterogenous integration of different technologies targeted towards specific end-applications . 2.5D and 3D stacking techniques are key enablers of this new paradigm.
3D integration is a wide term encompassing technologies that enable vertical integration of more than one layer of active transistors and interconnects with the goal of increasing compute density. Integrated circuit (IC) designs with natural redundancy and regularity in 2D can be extended or stacked in the 3rd dimension with relative ease. CMOS image sensors , DRAM memories , and NAND Flash memories , are all examples of this type of IC, and these products have already adopted 3D integration and achieved success in high-volume market adoption.
However, adoption of 3D stacking for logic applications has been limited to advanced packaging techniques. Here functionally complete chips, commonly referred to as chip-lets, are stacked using package bumping technologies. The stacking configuration could be 2.5D, wherein, chip-lets are assembled in 2D but interconnected through an underlying substrate (e.g., Silicon interposer) or redistribution layer (RDL), e.g., fan-out RDL. Alternatively, the stacking configuration could be 3D, e.g., package-on-package (PoP) wherein DRAM packaged dies are stacked on ASIC die  or two or more compute dies stacked using through-silicon-via (TSV) and micro-bump technology . A discussion of advanced 3D packaging using bumping technologies is out of scope of this paper.
The trajectory of current adoption of 3D stacking technologies points towards finer-pitch 3D connectivity in the form of die-stacking or sequential 3D integration, which, we refer to as high-density 3D integration. High-density 3D integration techniques open the possibility of designing systems where functional units are partitioned and co-designed across separate 3D stacked tiers. The advantages of such 3D integration is multi-fold:
Systems can utilize the 3rd dimension to bring functional blocks closer, reducing interconnect delay and power.
Large die SoCs can be partitioned into smaller dies, improving yield and hence reducing cost.
Dies from different process nodes or technologies (e.g., non-volatile memory) can be integrated together enabling heterogenous integration and enables more flexible product migration to advanced nodes further reducing cost.
However one of the primary challenges/opportunities that will be required to fully access the above advantages will be the re-design of design architecture to take advantage of computing and memory density that is different than what we have come to know in decades of 2D integration. The rest of the paper provides an overview of high-density 3D stacking technologies, state-of-the-art physical design studies and associated challenges. The paper concludes with a motivation for 3D-aware architecture exploration that breaks the traditional silos of the semiconductor design ecosystem.
Ii High-density 3D technologies
Current adoption of 3D stacking is mainly in the packaging domain and 3D connection density is limited by bump pitches at approximately 40 . However, wafer-level and die-level stacking technologies such as hybrid-bonding allow precision alignment of wafers resulting in 3D connection pitches of 10 or less . At these 3D connection pitches, SoC functional unit partitioning becomes feasible. The 3D integration roadmap is shown in Fig.1 as a plot of connection pitch versus connection density which highlights the orders of magnitude higher 3D connections that are feasible as we transition from package level bumping technologies to hybrid wafer bonding techniques.
Another flavor of high-density 3D is monolithic or sequential 3D integration where two or more active device layers and interconnects are sequentially processed using standard lithography tools. The 3D connection pitch is limited by the alignment of lithography stepper tools, enabling pitches down sub 100nm pitch, i.e., metal via pitches at advanced process nodes. However, this technology faces challenges with incompatibility of BEOL and FEOL processing temperatures for silicon based transistors . Alternative approaches of using materials and devices that do not require high temperature processing such as carbon nanotube field effect transistors (CNFET) and resistive non-volatile memories (RRAM) have been proposed for monolithic 3D integration . These technologies have seen slow but steady progress in experimental demonstrations .
Iii 3D Cost-saving
Any semiconductor technology promising to augment Moore’s Law scaling will be required to pass the litmus test of cost scaling. 3D die-stacking technologies achieve this in a manner similar to 2.5D chiplet approach, i.e., implementing the functionality of a large monolithic die in smaller dies interconnected in 3D. Compared to 2.5D chiplet approach, 3D die-stacking can achieve significantly higher connectivity and lower latencies, hence, improving performance and power translating to added value.
The value of 3D versus 2D is dependent on die size and die size is also dependent on technology node. The trade-off here is the time to market and risk-reduction of getting to market with early N-node 3D solution vs. an N+1 node 2D solution. Fig. 2 models a scenario where total die-area vs. die-cost is plotted for an early process ramp of representative 5nm technology and compared to a relatively mature 7nm process technology. Due to higher costs and worse defect densities in early ramp, 5nm die-cost for the same area is higher compared to 7nm. The different arrows show cost trade-offs of scaling to a 5nm process versus implementing a 3D solution in 7nm or implementing a heterogenous 3D system comprising of a mix of 5nm and 7nm dies, targeting an example area of 500 , representative of multi-core high performance system. A conventional technology shrink gives 13% cost reduction while a heteregenous 3D solution of a 5nm and 7nm stacked die doubles the cost benefit to 26% lower die-cost and a 3D solution at 7nm gives a 32% lower cost. Breaking an SoC into logic and memory layers where memory layers are repairable is another of the many possible embodiments of 3D that could have varying degrees of benefit per product.
For 3D cost savings to be realized as modeled, test for known-good-die (KGD) is a requirement prior to 3D assembly, hence is only applicable to die-to-wafer stacking scenarios. Since high-density 3D can have connections at sub-10 pitches, direct probing of every 3D connection priori to assembly is non-trivial and can be expensive. Novel Design-for-test (DFT) techniques for 3D stacking need to be developed that allow testing each die for ’goodness’ prior to assembly. There are active efforts in standardizing DFT methodologies for 3D in the form of IEEE P1838 standard .
Assuming design and test challenges are addressed, additional cost savings could be achieved through 3D STCO instead of trying to reduce cost per transistor through large die splitting. As an example, a 3D optimized N-core system could potentially perform equally to a 2D M-core system (where NM) due to improved bandwidth and connectivity.
Detailed cost modeling of monolithic 3D designs has been presented in  and . The primary yield improvement in monolithic 3D comes from the fact that the critical area for defect densities can be reduced by approximately 2X in monolithic 3D wafer processing. Considering different scenarios, these works have found that monolithic 3D can enable cost savings compared to 2D designs, especially for large die areas.
Iv 3D physical design
3D design has been explored extensively in the past few decades based on through-silicon-via (TSV) technology assumptions. High-density 3D design explorations that enable design partitioning at a block or gate level have been challenging, mainly because of the lack of EDA tools to implement such designs. Fig. 3 (a) shows current state-of-the-art 3D physical design flow supported by EDA tools today. Today’s 3D-IC designs are predicated on the assumption that functionally complete systems would be stacked in 3D. Hence, current tools do not support any automated 3D partitioning or cross-tier placement, timing or routing optimizations. Each 3D tier is separately designed and optimized and cross-tier connections are only analyzed and verified during the final ’sign-off’ stage.
A 3D-aware EDA tool, especially supporting high-density 3D technologies, would enable partitioning and optimized synthesis, floorplanning, placement, clock tree synthesis and routing of 3D tiers inherently in the design flow. Since these capabilities do not exist today, significant research efforts have been made to enable 3D physical implementation using 2D EDA tools , . In a recent paper , we presented our efforts on co-optimization of gate placement across 3D tiers using commercial EDA tools, for a 3D-partitioned Arm Cortex-A microprocessor. The important steps of the flow and how it differs from conventional 2D design methodology is described in Fig. 3 (b). Multi-tier co-placement utilizes commercial EDA placement optimization engine mimicking the behavior of a 3D-aware placement engine.
Results shown in Fig. 4 highlight the efficacy of the multi-tier co-optimization flow. In this plot, path-length versus number of cells in a path are plotted color-coded by timing slack. Path-length refers to the summation of pin-to-pin half-parameter wirelength among all net connections in the a design timing path. Number of cells denote the enumeration of two-pin equivalent gates in the timing path. For the 3D design, multi-tier co-placement efficiently places logic blocks in close proximity to each other in the 3rd dimension and is able to significantly reduce the number of long failing timing critical path. Additionally the overall path-length distribution is tighter compared to the 2D case as well. This structural improvement in the design directly translates to performance improvement of up to 12% or power reduction of up to 40% , approaching that of a modern Moore’s law process node highlighting the importance of 3D-aware tool flows.
A key challenge in realizing 3D design implementations is designing a robust power delivery network and managing thermal dissipation. For the same design implemented in 2D versus 2-stack 3D, the 3D design occupies a smaller 2D footprint, potentially 50% of the original design. However, the 3D design requires similar power drawn through a smaller number of package bumps in the reduced footprint increasing the current drawn per bump, as described in Fig. 5. This directly translates to higher power density as well. These constraints require careful floor-planning of the 3D tiers to avoid power and thermal hotspots and a robust power delivery network design. It is possible that 3D systems may need more expensive packaging and cooling solutions to offset the power density increase, partially offsetting the underlying advantages of 3D stacking. Numerous solutions have been proposed to mitigate power delivery and thermal challenges in 3D designs ,  and this is an area of active research.
V 3D Architecture
High density 3D poses the question whether we can fundamentally re-think design micro-architecture to take advantage of 3D stacked tiers. An area of focus for 3D micro-architecture research has been the goal of breaking the von-Neumann bottleneck, i.e., bringing larger capacity high-bandwidth memory closer to compute. Fig. 6 shows a plot of mean IPC (instructions per clock) improvement on running the SPEC benchmark suite on an Arm big and LITTLE CPU design in gem5  with larger capacity and lower latency L1 and L2 caches. Significant IPC benefits can be seen by having larger capacity low latency memory access for general purpose CPUs. This concept has been extended to large-scale systems, where solutions of stacking DRAM dies over CMOS logic compute chips  and 3D network-on-chip (NoC) systems  have been proposed. A recent work proposes a monolithic 3D solution integrating compute, caches and random-access memories which does not require any off-chip memory access, essentially enabling orders of magnitude higher energy efficiency . Besides addressing the logic-memory bottleneck,  presented a 3D micro-architecture study of designing vertical processors using monolithic 3D technology, wherein all critical stages of a superscalar out-of-order CPU are partitioned in 3D tiers achieved significant performance improvement at lower energy dissipation.
These works point to significant gains possible with high-density 3D stacking technologies. The magnitude of gains are work-load dependent (compute-bound versus memory-bound) and whether 3D integration effectively relieves existing 2D bottlenecks. Conventionally system architecture and micro-architectural explorations abstract out physical details in favor for cycle-accurate design behavior. This abstraction has worked well in the era of traditional Moore’s Law scaling. However, this approach makes it challenging to assess realizable gains for new 3D stacked architectures since the underlying 3D technology and physical design constraints have a significant impact on achievable gains. Designing next generation high-performance general-purpose computing systems in 3D requires extensive effort and co-optimization of system and CPU architectural exploration in the context of physical design.
This paper presents an overview of high-density 3D integration technologies and its potential to improve performance, power, and cost, essentially augmenting Moore’s Law scaling. 3D-optimized architectures could potentially enable higher gains. There is industry-wide effort to address 3D manufacturing and design challenges in the form of 3D-aware EDA tools, robust power delivery, thermals, and test of known-good-die. As high-density 3D manufacturing technologies and physical design methodologies mature, it is time to revisit 3D-optimized architecture research with strong cross-abstraction collaboration between technologists, circuit designers and computer architects.
-  G. Moore. Cramming more components onto integrated circuits. Electronics, April 1965.
-  G. Yeric. Moore’s law at 50: Are we planning for retirement? In 2015 IEEE International Electron Devices Meeting (IEDM), pp. 1.1.1–1.1.8, Dec 2015.
-  R. Fontaine. The State-of-the-Art of Smartphone Imagers. In 2019 International Image Sensor Workshop (IISW), 2019.
-  H. Jun et al. HBM (High Bandwidth Memory) DRAM Technology and Architecture. In 2017 IEEE International Memory Workshop (IMW), pp. 1–4, May 2017.
-  S. Venkatesan and M. Aoulaiche. Overview of 3D NAND Technologies and Outlook Invited Paper. In 2018 Non-Volatile Memory Technology Symposium (NVMTS), pp. 1–5, Oct 2018.
-  C. Tseng et al. InFO (Wafer Level Integrated Fan-Out) Technology. In 2016 IEEE 66th Electronic Components and Technology Conference (ECTC), pp. 1–6, May 2016.
-  Foveros - Intel - Wikichip. URL: https://en.wikichip.org/wiki/intel/foveros.
-  M. Chen et al. System on Integrated Chips (SoIC(TM) for 3D Heterogeneous Integration. In 2019 IEEE 69th Electronic Components and Technology Conference (ECTC), pp. 594–599, May 2019.
-  P. Batude et al. 3DVLSI with CoolCube process: An alternative path to scaling. In 2015 Symposium on VLSI Technology (VLSI Technology), pp. T48–T49, June 2015.
-  M. M. Sabry Aly et al. The N3XT Approach to Energy-Efficient Abundant-Data Computing. Proceedings of the IEEE, 107(1):19–48, Jan 2019.
-  G. Hills et al. Modern microprocessor built from complementary carbon nanotube transistors. Nature, 572(7771):595–602, 2019.
-  E. J. Marinissen et al. IEEE Std P1838: DfT standard-under-development for 2.5D-, 3D-, and 5.5D-SICs. In 2016 21th IEEE European Test Symposium (ETS), pp. 1–10, May 2016.
-  D. Gitlin et al. Cost model for monolithic 3D integrated circuits. In 2016 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), pp. 1–2, Oct 2016.
-  Bon Woong Ku et al. How much cost reduction justifies the adoption of monolithic 3D ICs at 7nm node? In 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–7, Nov 2016.
-  S. Panth et al. Shrunk-2-D: A Physical Design Methodology to Build Commercial-Quality Monolithic 3-D ICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36(10):1716–1724, Oct 2017.
-  K. Chang et al. Cascade2D: A design-aware partitioning approach to monolithic 3D IC with 2D commercial tools. In 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8, Nov 2016.
-  X. Xu et al. Enhanced 3D Implementation of an Arm® Cortex®-A Microprocessor. In 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6, July 2019.
-  M. Scheuermann et al. Thermal analysis of multi-layer functional 3D logic stacks. In 2016 IEEE International 3D Systems Integration Conference (3DIC), pp. 1–4, Nov 2016.
-  K. Chang et al. System-Level Power Delivery Network Analysis and Optimization for Monolithic 3-D ICs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(4):888–898, April 2019.
-  N. Binkert et al. The Gem5 Simulator. SIGARCH Comput. Archit. News, 39(2):1–7, August 2011.
-  T. Carlson and M. Facchini. 3D Stacking of DRAM on Logic, pp. 187–210. Springer US, Boston, MA, 2011.
-  I. Akgun et al. Network-on-Chip Design Guidelines for Monolithic 3D Integration. IEEE Micro, pp. 1–1, 2019.
-  B. Gopireddy and J. Torrellas. Designing Vertical Processors in Monolithic 3D. In Proceedings of the 46th International Symposium on Computer Architecture, ISCA ’19, pp. 643–656, New York, NY, USA, 2019. ACM.