Bit matrix compression is a key operation in efficient computer arithmetic. It was already the general concept behind the fast multiplier suggested by Wallace  and the multiplier schemes discussed by Dadda  as well as the carry shower circuits that were proposed by Foster and Stockton  and further discussed by Swartzlander . Besides multiplication and population counting, matrix compression can naturally subsume fused operations such as multiply-accumulate  or the computation of complete dot products. The dot product by itself is a key operation in applications such as digital filters 7]
. The latter work specifically advocates binarized neural networks for their efficient implementation on FPGAs. In this approach, the dot product summation degenerates into a high-fanin population count, which was the motivation to evaluate this work in the context of matrices with extreme aspect ratios.
Already the early, well established use cases exemplify the ubiquity of bit matrix reduction and the diversity of shapes of the input matrices. On the one hand, multiplication processes a matrix of partial bit products, which assumes a skewed shape due to the increasing numerical weight of the multiplier bits. The population count operation behind the carry shower circuits, on the other hand, has to process a single column of equally weighted bits. Both of these use cases are illustrated in Fig.1.
Ultimately, both of these use cases desire to compute the arithmetic sum of all properly weighted input bits as a binary number. This goal is achieved through two distinguishable steps: (1) the matrix compression down to a height of two rows by making a massively parallel use of elementary bit counters, and (2) the carry-propagate addition to obtain the conventional binary result. At the interface between these steps, the result is available in what is called a carry-save representation, i.e., its total is distributed over two addends. Higher-level operations fusing multiple compression steps would typically try to copy this representation directly to the input matrix of the next compression so as to avoid the carry-propagate addition for all but the conclusive computation step. This approach warrants a significant speedup as the logarithmic latencies of efficient implementations of both the matrix compression and the final carry-propagate addition are on the same order of magnitude.
The matrix compression is implemented by reduction elements typically referred to as parallel counters. The most basic of these counters is the full adder reducing three inputs of the same weight to a two-digit binary number:
where corresponds to the carry and to the sum output. Applying full adders in parallel to three matrix rows as shown in Fig. 2 reduces them to two rows with the same additive value. This 3-to-2 compression, aka. carry-save addition, is performed within a single full-adder delay independent of the width of the rows. More sophisticated counters can give rise to higher compression ratios such as the 4-to-2-adders thoroughly reviewed by Kornerup  or the 5-to-2 compressor described by Kwon et al. . Besides full adders, they rely on instances of what is called a generalized parallel counter whose input bits may already have different numerical weights. In particular, their compressor also uses elements where the notation of defines the right-aligned numbers of input and output bits, which the counter processes while maintaining this invariant over the total sum:
The rather stiff notion of reducing the number of complete rows is less suitable for irregular matrix shapes. In fact, it was already broken up by Dadda  who applied full and half adders for the reduction of a multiplication matrix exactly to the bit columns where this was needed to reach the targeted row count of the current reduction step. This flexible, goal-oriented placement of counters can still be considered the state of the art. Heuristics and ILP solvers have been used to optimize such compression solutions for various input matrix shapes [9, 10].
In the remainder of this paper, we first give an overview of the related work before establishing criteria for the counter evaluation and defining a suitable and systematically completed set of parallel bit counters for our FPGA implementation. We then discuss the conclusive carry-propagate addition and its integration with the preceding matrix compression to define an efficient greedy construction of a matrix summation implementation. Finally, we evaluate the generic synthesizable VHDL implementation of our approach targeting a concrete Xilinx Zynq device using Vivado.
Ii Related Work
The scheduling of counters to build a compressor depends naturally on the selection of available modules. It is the backing technology that defines which counters can be implemented most efficiently. A discussion of the choices for ASICs was composed by Verma and Ienne . FPGA-targeted counters have been most prominently proposed by Parandeh-Afshar et al. [12, 13, 9] as well as Kumm and Zipf [14, 10]. As this paper focuses on the construction of compressors within a modern Xilinx FPGA fabric, it will heavily build on the work of these latter two groups.
A heuristic for constructing compressors for Altera devices was proposed by Parandeh-Afshar et al. in 2008 . They used a single-pass heuristic selecting the most efficient from a selection of parallel counter that would fit into the work still to do by the compression step starting from the least-significant and proceeding to the most-significant bit position. The compression goal was a matrix of, at most, three rows. This relaxed goal definition exploits the fact that ternary adders map well onto modern FPGA architectures. It also has the tremendous benefit that half adders can be avoided altogether. Half-adders only have a reshaping function and do not reduce the number of bits in the matrix. As shown in Fig. 3, they must be used to reshape an almost done two-row matrix in parallel so that it can accommodate just one more carry efficiently. This pressure disappears with a goal of three rows.
In their follow-up work , Parandeh-Afshar et al. start considering mapping counters to the broader structural context of an Altera Adaptive Logic Module (ALM) rather than assuming an indifferent pool of lookup tables (LUTs). This enables them to exploit the carry-chain links between adjacent LUT stages for fast and yet more capable counters. Finally , they tie individual counters together by merging the carry output of one module with the carry input of another into one LUT stage. While this, indeed, reduces LUT usage, it also creates unwieldy structures that severely limit the mobility of individual counters during the logic placement, which complicates the routing optimization to be performed by the tools. Last but not least, this work also looks into a generalization for Xilinx architectures.
Kumm and Zipf 
, on the other hand, proposed counter designs that are a natural fit for the slices of four LUTs that are found in Xilinx architectures. Like Parandeh-Afshar et al., they aim at a three-row compression goal. While they provide convincing delay estimates for using their proposed counters, they do not use it for a sharp cut in the selection of implementation modules. Besides an extended counter selection, they also propose to consider 4:2 adders in the construction of a compressor. In a subsequent work
, they substitute the heuristic compressor construction by an integer linear programming (ILP) optimization. While they can demonstrate a consistent reduction of the LUT usage and the number of compression stages, the ILP running time remains prohibitive for all but desperate workflows.
Targeting Xilinx devices, the work by Kumm and Zipf is the natural foundation we build on. We adopt their useful counters but also decompose, classify and generalize them to construct a systematic selection of modules, from which the implementation can pick the most suitable and capable instances. While we also use theirefficiency metric, we complement it with the additional metrics of strength and slack so as to enable a directed selection of the most beneficial counters in the compressor construction. Performing an in-workflow construction at the time of the RTL synthesis, we rely on a heuristic that is closely related to the one by Parandeh-Afshar et al.
Iii Counter Evaluation
For the evaluation of the building blocks of a compressor, the generalized parallel counters, we derive several performance metrics based on their physical properties. The goal of this evaluation is to firstly define the selection of counters that is to be used in the compressor construction and secondly to prioritize among them as long as multiple choices are technically feasible.
As a first criterion, we will use the estimated counter delay for a hard exclusion of candidates. So as to ensure roughly balanced bit delays after each compression step, counters are not allowed to add any extra signal paths over the general-purpose routing network beyond what is needed to feed its inputs and to forward its outputs. Counters are, however, allowed to grow beyond a single LUT by using slice internal signal paths, in particular, the carry chain. The delay on the carry chain links is negligible in comparison to general-purpose routing paths. We only ensure that a counter is constrained to a slice, which corresponds to a maximum of four LUTs. While all the GPCs collected by Kumm and Zipf also fit into the bounds of a slice , quite a few of them feature secondary carry signals over the general-purpose routing in parallel to the carry-chain link. We exclude these counters explicitly.
In terms of physical dimension, the total number of counter inputs, outputs and the occupied area in terms of LUTs are of interest. Given the GPC , we use:
Performance metrics are derived from these physical characteristics.
The efficiency of the generalized parallel counter is the quotient of its achieved reduction of the number of bit signals and the number of LUTs it occupies:
This notion of efficiency was previously used by Kumm and Zipf . It reflects signal reduction achieved in relation to the hardware investment. Giving preference to more efficient counters will optimize the constructed result with respect to silicon area.
The strength of the generalized parallel counter is the ratio of its input bit count vs. its output bit count:
The strength metric captures the asymptotic height reduction of a large bit matrix when exclusively using a specific counter in a single compression step. Giving preference to stronger counters emphasizes a small number of compression steps as a construction goal.
The (arithmetic) slack of the generalized parallel counter captures the share of the numeric range representable by the output bits that cannot be used:
The slack captures the coding inefficiency of a counter’s output. It is counterproductive for reducing the number of bits in the matrix. Accumulating slack within a compression network may even result in phantom carries. Refer to the compression depicted by Fig. 4. While the two-row result looks as if it might produce a carry into the position, this is, in fact, not possible because the maximum numerical value of the original input is only . This misconception suggested by the dot diagram is created by the half adder (a -counter), which cannot produce a result value of three and hence leaves one out of four possible outputs unused. Note that the slack of a functionally correct counter is never negative as this would imply that there are large input totals that cannot be recoded into the available output bits111Counters with a negative slack may render useful when (positive) slack has accumulated in the computation. For instance, a half adder that has no arithmetic chance to produce a carry degenerates into a plain XOR gate for the sum output. This gate can be viewed as a simple -counter with an obvious negative slack. This pathway is not investigated any closer in this paper. . Giving preference to counters with no or, at least, smaller slack minimizes the chance to introduce phantom signals to the constructed compressor.
Kumm and Zipf have proposed several counters that map perfectly into the slice structure found in modern Xilinx devices since generation 5. While not identified as such, most of them are actually instances of a more general concept that composes those 4-column counters from the 2-column atoms shown from Fig. 5 through Fig. 7. Any two of these atoms can be combined arbitrarily into a slice to form nine different counters. Both constituting atoms are exclusively connected through the carry chain. The initial carry chain input at the lower significant atom can be used to input an additional bit of weight one. Note that this is physically not possible for atom since the LUT bypass used by conflicts with driving an external carry input. All of the counters constructed this way produce a five-bit binary number as a single-row result.
The performance metrics of these composable whole-slice counters are summarized by Tab. I. On the left top, the combination of two -atoms with an additional carry input essentially yields a four-bit ripple-carry adder (RCA). It reduces the number of active bits by one for each invested LUT, hence its efficiency of . While an arbitrarily wide RCA would have a strength of , the limitation of the accepted carry paths to a slice limits it to . Both efficiency and strength grow systematically as one of the weight-2 inputs is replaced by two weight-1 inputs to yield the - and the -atoms. They complete the set of reasonable 2-column atoms, which are allowed to contribute a maximum numeric value of 6 in addition to the input on the carry chain. The remaining alternative of a -atom is reasonably subsumed by a single-LUT implementation of a full adder.
Note that the advantage of the -atom is impacted in the low-significant position where a structural resource hazard within the slice prevents the functional utilization of an additional carry input. Only being able to feed a constant of zero, there are no improvements in terms of efficiency and strength over the -atom in this position. Rather arithmetic slack is introduced as the output code space cannot be utilized completely. For the illustration of the strength metric, observe that the exclusive use of - and -counters on a large bit matrix will deflate twelve input rows to five output rows, hence .
There is one more whole-slice counter that is adopted from Kumm and Zipf  which cannot be decomposed into the atoms extracted above: the -counter. We discard some other of their proposals:
all counters such as the -counter that impose an additional routing delay by driving carry-like signals over the general-purpose routing rather than the carry chain, and
the - and -counters that they have mapped to the carry chain tying them to the lower significant half of the slice without being able to utilize the higher significant part.
While complex, slice-based counters promise high values in strength and efficiency, they are typically too bulky for achieving a covering of the bit matrix that is as exhaustive as possible. For this purpose, more flexible, smaller LUT-based counters are needed. The most elementary among these is the full adder, i.e. a -counter. Using both of its outputs, this can be implemented within a single LUT. The -counter is adopted from Brunie et al. . It occupies three LUTs and is very effective for reducing the height of a singular peek column. Finally, we propose the novel -counter depicted in Fig. 8. It utilizes only two LUTs to do the work of three full adders within a single logic level. As shown, the sum and the carry computations of one of these full adders are distributed among the two LUTs and merged with the exclusively local full adders. Implementing two 5-input functions within every involved LUT utilizes them fully. The performance figures of all of these floating counters whose LUTs can be placed freely are listed in Tab. II. Note that the -counter competes very favorably in this group.
Iv Final Carry-Propagate Addition
The ultimate result of the parallel compression by counters must finally undergo a carry-propagate addition to obtain the total as a conventional binary number. The traditional two-row compression goal implied by Wallace  and Dadda  can be relaxed to three rows if ternary adders are well supported on the targeted hardware. This was exploited by Parandeh-Afshar et al.  as well as Kumm and Zipf . We will go one step further and define an even more flexible compression goal.
The ternary adder implementation on a Xilinx carry chain requires a secondary carry that cannot utilize a direct fast carry-chain link. As shown in Fig. 9, the computation of this secondary carry at a bit position does not depend on the incoming secondary carry . Thus, no lengthy and slow combinational path is created but rather only one extra intermediate routing delay is implied in total. This additional delay is roughly equivalent to introducing a separate compression stage by parallel full adders that would achieve a reduction from three down to two rows as well. However, the ternary adder offers an increased functional density.
Instead of targeting a fixed compression goal as the input to the carry-propagation stage, we allow a flexible goal that is inferred in the process of the right-to-left greedy heuristic for counter placement. This process will not further consider the placement of counters in a column if its effective height, i.e. its remaining input bits plus the outputs from counters previously placed in this compression step, can be processed by the carry-propagate stage. The acceptable bit height depends on the previous history of column heights. In the very first, least-significant column, a height of four bits is acceptable as the input can be re-purposed as a fourth input. Unfortunately, and compete for the same link to the general-purpose routing network so that a fifth input fails on a structural hazard.
|CP –||Bit Copy|
|FA –||Full Adder|
|TE –||Ternary Element|
Behind a ternary adder element, the next column will be considered done if it is no higher than three bits. Counters are placed aggressively if the height target has not yet been reached. This way, the resulting column height may drop below the acceptable height. This allows benefits to be passed on to the next column. For instance, assume the rightmost column to have a height of 5, and a counter is scheduled for its compression. In the next compression stage, no more counter will be placed into this column. Due to the fact that it does not even leave a carry for the final addition, the weight-2 column now only needs to be compressed to a height of 4. Similarly, full adders that do not produce a secondary carry will be used if the number of remaining bits permits. See Fig. 10 for an example of a ragged carry-propagate input that is acceptable by the proposed flexible approach.
Tab. III and Alg. 1 illustrate the interaction between the counter scheduling and the flexible carry-propagate stage. Carry-propagate modules are selected on the basis of the effective column height and the carries received through the carry propagation itself as shown by Tab. III. If the effective column height does not fit any of the available elements, the parallel compression by counters has to continue and counters are to be placed starting at the determined anchor position. Pipeline registers may be scheduled to buffer the active bit signals after any compression stage. The corresponding request must currently be posed by the designer who specifies the stage count, which is expected to meet timing within the targeted constraints.
V Experimental Evaluation
The described algorithm was implemented directly in synthesizable VHDL code. The available counters are characterized by signature records stored in an array that is sorted with respect to the preferred performance metric for the purpose of counter scheduling. The concrete schedule comprising the counter placements and the composition of the conclusive carry-propagate addition is computed by a designated function, which takes an array of the initial column heights as its input. Its output is the representation of the schedule in a flat integer_vector, which directs the actual module instantiations within a for-generate loop.
We have used the described algorithm together with the proposed selection of counters and the suggested final adder construction to implement several matrix summations. The analyzed use cases range from high, single-column population counts over high, dual-column inputs to a -bit multiplication matrix for reference. A multiplier would typically not be implemented within the general LUT-based FPGA fabric but would rather utilize the hardwired -bit multipliers of the DSP48E blocks.
All measurements were obtained using Vivado 2016.4 targeting an XC7Z045-FFG900-2 device. This device is found on the ZC706 evaluation board, which was also recently used by Umuroglu et al. . Delay measurements are taken from a pure combinational implementation without added pipeline stages in a register sandwich that determines the timing constraint. Timing targets are defined in a process of nesting intervals until a failed and an accomplished timing goal are reached that are no more than 0.1 ns apart. The last met timing is reported. Area results for the summation are extracted from the hierarchical utilization report. For each use case, three schemes of the counter selection for the construction of the compression stage have been evaluated: precedence with respect to (a) efficiency, (b) strength, and (c) their product. The arithmetic slack was used as last decision criterion only.
Fig. 11 depicts the determined combinational delays of the different matrix summations. The general tendency that higher or wider input matrices demand more time for computation is obvious. However, note that the strength-driven construction produces an unexpectedly slow solution for the input matrix of two 128-bit columns. Its schedule, indeed, differs significantly as it is dominated by - rather than the -counter more frequently employed by the other approaches. Nonetheless, in the overall comparison the strength metric tends to produce the fastest solutions.
|Matrix||Efficiency / Product||Strength|
The soft -multiplier matrix appears to produce an extraordinarily long delay with all solution approaches, especially when realizing that it also only has a total of 256 input bits. It suffers from its wide result, which is computed within a final 32-bit carry-propagate adder. As can be seen in Tab. IV, all computed solutions are also heavily dominated by whole-slice counters with only a few interspersed floating ones. This suggests that the delay implications even of short monolithic carry chains should be investigated more thoroughly in the future. Note that the schedules optimized for strength and for the efficiency-strength product were, in fact, identical in all the presented use cases.
The area consumption of the summation solutions is shown in Fig. 12. It clearly shows the tradeoffs imposed by the counter selection on different use cases in comparison to the delay figures. While the summation of the multiplication matrix is the slowest, it also has the most compact solution among the cases with 256 input bits. Here, the greater efficiency of the whole-slice counters takes effect. Also note that the strength-driven selection has to pay an area premium for achieving a certain speed gain over the other approaches.
Using the combined area-delay product shown in Fig. 13 as the quality metric, the differences between the selection approaches diminish further. There is no clear winner, and a preference towards area efficiency or speed should be selected explicitly so that the most appropriate solution can be constructed.
For a concrete practical reference, recollect the population count synthesized by Umuroglu et al. for FINN . The population count is one of the key operations in their binarized neural network application. Their HLS implementation of a 128-bit instance, however, is pipelined so as to meet the 200 MHz clock target and occupies a total of 376 LUTs. All compressors generated by our very feasible and efficient greedy approach achieve this goal on the same device in a single cycle of a 200 MHz clock with only slightly more than a quarter of the resources. This demonstrates that providing such a critical operation possibly as a builtin function of the HLS compiler is worth more than a consideration.
This paper has described a VHDL-implemented generic matrix summation module that can be used universally for the implementation of operations as different as population counting, dot product computation or integer multiplication. The implementation is backed by a set of parallel counters that has been derived and extended from previous works by Parandeh-Afshar et al. and Kumm et al. The proposed approach further features a novel flexible interface between the parallel matrix compression and the conclusive carry-propagate addition. The complete approach has been implemented specifically targeting modern Xilinx devices. It has been shown that the underlying runtime-efficient greedy construction of the matrix summation is a valuable opportunity for an operation to be provided as a builtin function of the high-level synthesis.
-  C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Transactions on Electronic Computers, vol. EC-13, no. 1, pp. 14–17, Feb 1964.
-  L. Dadda, “Some schemes for parallel multipliers,” Alta Frequenza, vol. 34, pp. 349–356, 1965.
-  C. C. Foster and F. D. Stockton, “Counting responders in an associative memory,” IEEE Transactions on Computers, vol. C-20, no. 12, pp. 1580–1583, Dec 1971.
-  E. E. Swartzlander, “Parallel counters,” IEEE Transactions on Computers, vol. C-22, no. 11, pp. 1021–1024, Nov 1973.
-  O. Kwon, K. Nowka, and E. E. Swartzlander, “A 16-bit by 16-bit MAC design using fast 5:3 compressor cells,” Journal of VLSI signal processing systems for signal, image and video technology, vol. 31, no. 2, pp. 77–89, 2002.
-  S. Mirzaei, A. Hosangadi, and R. Kastner, “FPGA implementation of high speed FIR filters using add and shift method,” in International Conference on Computer Design, Oct 2006, pp. 308–313.
-  Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “FINN: A framework for fast, scalable binarized neural network inference,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2017), ser. FPGA. New York, NY, USA: ACM, Feb 2017, pp. 65–74.
-  P. Kornerup, “Reviewing 4-to-2 adders for multi-operand addition,” in International IEEE Conference on Application- Specific Systems, Architectures, and Processors, 2002, pp. 218–229.
-  H. Parandeh-Afshar, A. Neogy, P. Brisk, and P. Ienne, “Compressor tree synthesis on commercial high-performance FPGAs,” ACM Trans. Reconfigurable Technol. Syst., vol. 4, no. 4, pp. 39:1–39:19, Dec 2011.
-  M. Kumm and P. Zipf, “Pipelined compressor tree optimization using integer linear programming,” in 24th International Conference on Field Programmable Logic and Applications (FPL 2014). IEEE, Sep 2014, pp. 1–8.
-  A. K. Verma and P. Ienne, “Automatic synthesis of compressor trees: Reevaluating large counters,” in Design, Automation Test in Europe Conference Exhibition (DATE 2007), Apr 2007, pp. 1–6.
-  H. Parandeh-Afshar, P. Brisk, and P. Ienne, “Efficient synthesis of compressor trees on FPGAs,” in Asia and South Pacific Design Automation Conference, Mar 2008, pp. 138–143.
-  ——, “Exploiting fast carry-chains of FPGAs for designing compressor trees,” in International Conference on Field Programmable Logic and Applications, 2009. FPL 2009., Aug 2009, pp. 242 –249.
-  M. Kumm and P. Zipf, “Efficient high speed compression trees on Xilinx FPGAs,” in MBMV 2014 - Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen, M. M. Jürgen Ruf, Dirk Allmendinger, Ed. Cuvillier Verlag, Feb 2014.
-  N. Brunie, F. de Dinechin, M. Istoan, G. Sergent, K. Illyes, and B. Popa, “Arithmetic core generation using bit heaps,” in International Conference on Field programmable Logic and Applications (FPL 2013), Sep 2013, pp. 1–8.