Neural networks form the backbone of current technologies such as face recognitionYang et al. (2015), text comprehension Kadlec et al. (2016) and speech emulation Pascual et al. (2017). The recent successes of neural networks can be attributed to the algorithmic advances made possible in part due to the availability of powerful computational devices and of large datasets to train the neural networks. GPUs are the mainstream devices to train them. However, most embedded applications do not require powerful GPUs, given that only the inference task needs to be implemented with pre-trained networks. Therefore, hardware accelerators built on low power devices such as FPGAs, embedded GPUs and embedded processors are preferred to efficiently run these neural networks in real-time. Lacey et al. (2016)
Implementing neural networks on embedded devices poses a set of challenges such as power consumption, latency and adaptability. A vanilla neural network can be designed to achieve the best test set accuracy without much considerations for power consumption and memory requirement. These networks use 32-bit floating point representations that require expensive MAC operations and memory read operations. Methods such as quantization of weights and activations address these challenges by reducing the data precision down to 4 bits Zhuang et al. (2018)
for the Imagenet dataset and sometimes to the extreme extent of single bit binary values for smaller datasets such as CIFAR-10 and SVHNCourbariaux et al. (2016). Surprisingly, quantization may not be detrimental to the the accuracy of the network as it can provide a form of regularization similar to dropout Dahl et al. (2013), which helps to better generalize on the testset. Similarly, pruning weights reduces memory reads and MAC operations while maintaining the accuracy. Such methods can actually lower the computation effort but they are still founded on the simulation of a neural network with a target architecture that is not specifically adapted to the target algorithms.
Field Programmable Gate Arrays (FPGA) are intrinsically closer to the architecture of a neural network because they embed millions of small computing elements, such as Look-Up Tables (LUT), that can be interconnected to form a large network of computing elements. Nevertheless, their inputs are binary signals while neural networks usually handle higher precision data. Binary Neural Networks Courbariaux et al. (2016)
open a way to very efficient FPGA implementations. In such networks, binary multiplications are implemented with simple XNOR gates while the overall sum (accumulation) is computed using a popcountSun,Chenfan and del (2016) and a final comparison of this MAC operation produces the binary output activation. However, each neuron may have a large number of inputs (1000 in the case of ResNet implementations He et al. (2016)) while FPGA computing resources can handle just a few inputs per LUT. Modern FPGAs have 4-6 input bits and 1-2 output bits per LUT. This mismatch leads to inefficient architecture spanning multiple LUTs for each operation when implemented on a FPGA.
In this work, we build networks of tiny binary neurons and map them to FPGA LUTs, thereby providing an optimized way to implement equivalent MAC operations while achieving near state of the art accuracy. We name this architecture and its associated building algorithm PoET-BiN. It is very power efficient since most of the processing is done as Look-Up operations in the binary domain and does not require floating point multipliers, adders or external memory accesses. Our LUT-based architecture combines Decision Trees (DTs) and weighted sums of binary classifiers. A major motivation to use LUT-based DTs is because it eliminates memory reads and can be implemented using simple logic gates, thus considerably saving power. On the other hand, DTs alone are weak classifiers. Hence, we modify the inherently weak classifiers to solve complex non-linear classification tasks consuming a fraction of the power compared to other implementations. The following four contributions are proposed to combine DTs and weighted sums of binary signals to best exploit embedded hardware resources in the context of neural network implementations :
A modified DT training algorithm to better handle a fixed number of inputs LUTs.
The Reduced Input Neural Circuit (RINC) : A LUT-based architecture founded on modified DTs and the hierarchical version of the well known Adaboost algorithm to efficiently implement a network of binary neurons.
A sparsely connected output layer for multiclass classification.
The PoET-BiN architecture consisting of multiple RINC modules and a sparsely connected output layer.
Automatic VHDL code generation of the PoET-BiN architecture for FPGA implementation.
Section 2 details the proposed PoET-BiN architecture, followed by the experimental setup in Section 3. The results and comparison with state-of-the-art methods are detailed in section 4. Finally, the related works and conclusions are presented in Section 5 and 6 respectively.
2 The PoET-BiN Architecture
In this section we detail the principal contribution, multi-level RINC architecture for binary feature representation (§2.1) and also the sparsely connected output layer for multiclass classification (§2.2), who together constitute PoET-BiN.
2.1 Multi-level RINC Architecture
The RINC is a network of tiny binary neurons with limited inputs. In a traditional neural network, each neuron can have up to 4096 neurons as seen in VGG Networks Simonyan and Zisserman (2014). This impedes efficient hardware implementation as it leads to numerous and interdependent logical circuits that adversely affect the power consumption, speed and area of the architecture. In our architecture, the number of single-bit inputs to each neuron is limited, usually less than 8. This poses a major challenge to choose the best inputs among the ones available and classifying the data based on only these selected inputs.
We follow an approach inspired by DTs to implement a binary neuron ( §2.1.1). The size of the DTs is limited by the number of LUT inputs. DTs are inherently weak classifiers and boosting techniques are used to group weak classifiers thus forming stronger classifiers. The Adaboost algorithm is one of the most widely used boosting algorithm. In section 2.1.2, we detail our LUT-based implementation of the Adaboost algorithm. Still, the LUT-based algorithm is limited by the number of weak classifiers that can be grouped together. To further enhance our classifiers, we introduce a hierarchical Adaboost algorithm, where the number of DTs increases exponentially with every level. At each level, all the operations are designed to exactly fit in a single LUT, thereby optimizing power and area efficiency. The hierarchical algorithm is detailed in section 2.1.3.
2.1.1 RINC-0 : Modified Decision Tree Algorithm
A binary neuron only has two possible outputs, so all possible input combinations can be classified into two groups. Hence, each binary neuron is a binary classifier that can be implemented as an Input vs Output table for all the possible input combinations in LUTs (Fig.1). Hardcoding such tables in a LUT or a memory is feasible when the number of inputs is limited. With each added input the complexity increases by a factor of
. In most cases, we need to model a binary neuron with more inputs than can be accommodated in a single LUT. Therefore, hardcoding the input-output relation is not a viable option. This demands an algorithm to choose the best inputs from the input set so as to fit the implementation in a single LUT. We use a greedy approach inspired by DTs to choose the best inputs from the available set of inputs. There are many other traditional machine learning classifier such as Support Vector MachinesBurges (1998) et al. (1992). However, DTs provide a distinctive edge as they can be easily and efficiently implemented in LUTs Abdelsalam et al. (2018).
The original DT algorithm Quinlan (1986) is limited either by the depth or by the number of nodes, which often leads to under utilization of LUTs, because the LUTs used to implement the DTs are neither constrained by the depth or nodes, rather by the number of distinct inputs. Hence, we propose a modified DT algorithm that attempts to optimize DTs for a given number of inputs. Off-the-shelf DT classifiers are built one node at a time. Each node is associated with a feature that divides the input feature space and minimizes the entropy of the DT. Contrarily, we train DTs layer-wise. Hence, all nodes in the same level of the DT have the same features. This divides the input feature space into (where is the number of inputs to a LUT) sub-spaces and increases the capacity of the DT for a given number of input features. Moreover, the leaf nodes, which contain the possible output (0 or 1) for each combination, can be selected in time. The resulting LUT-based DT is named as Reduced Input Neural Circuit (RINC-0), where-”0” signifies the level, which is further explained in the following sections. With this modified DT approach we have an increased capacity RINC-0 architecture that can fit exactly into one LUT and is limited by the number of input features.
Algorithm 1 details the training algorithm for the RINC-0 module. The training dataset contains examples of dimensions each. Our goal is to choose the best features from the available features that best classify the data, or, in other words, reduce the entropy. We train a level-based DT approach where we choose the best feature () from all features that have not been used before, that reduces the entropy of the entire level to the largest extent. The best feature is appended to the Used feature array(). The label array () contains the class label for each leaf node. A leaf node is assigned a class that has the highest number of training examples that end up in that leaf node. These class labels form the output column of the LUT and the best features form the input indices to the LUT.
Fig. 1 illustrates a DT and its equivalent LUT. The red and green arrows at each node represent the path taken when the feature at the node is 0 or 1, respectively. This implementation of a RINC-0 module is versatile and not limited to LUTs alone. The approach can also be implemented in memory blocks as well, as we only have to store a table with precomputed output values for each combination of input values. Since the memory size is computed as the base-2 exponential of the number of inputs, a 30-input LUT already requires one gigabit of data.
Technically, implementing a N-input LUT is efficient only for small values of N (typically under 12 inside an FPGA). In any case, it is completely unrealistic to implement a LUT for a binary circuit that has more than 40 inputs, which is still far less than the number of inputs in a typical neuron. In order to increase the number of inputs taken into account, we can build several RINC-0 DTs and combine them at higher level by applying a boosting algorithm as described in the next sub-section.
2.1.2 RINC-1: Boosting the MAT units
Even with the modified DT training approach, the RINC-0 modules have low capacity and cannot predict the output of the large binary neuron with sufficient accuracy. Hence, we increase the capacity of the weak DTs by grouping them together using a boosting algorithm. One of the most common boosting algorithms is Adaboost Rätsch et al. (2001), where each weak classifier (a LUT in this case) is trained sequentially and focuses on the misclassified examples of the previous classifier. Each classifier is assigned a weight (), where x indicates the corresponding RINC-0 module and ranges from 0 to . The output of each classifier is multiplied with its respective weight and added. Finally, this weighted sum is thresholded and the binary output is obtained. The architecture is detailed in the MAT unit shown in Fig. 2. Each MAT unit theoretically requires multiplications and additions. Nevertheless, since each MAT module consists of input bits and one output bit, it can also be implemented in a LUT where we pre-compute the 1-bit output for all possible inputs combinations. This MAT operation can now be performed as a single look-up operation. Thanks to the addition of the Adaboost layer, the number of inputs to the overall architecture has increased from to . We denote this module as RINC-1, where-”1” is the number of Adaboost levels.
However, the number of inputs to the LUT-based implementation of the MAT module is still limited. Hence, it is not possible to group more than weak classifiers. To overcome this issue, we propose a hierarchical Adaboost algorithm, which is detailed in the following sub-section.
2.1.3 RINC-L: Hierarchical Adaboost Algorithm
In FPGAs with 6-input LUTs, even with inputs, the RINC-1 module can accommodate only 36 inputs which is highly insufficient. Hence, we propose a hierarchical Adaboost algorithm to increase the capacity. Firstly, we build a RINC-1 module with DTs. We also call it a subgroup in this architecture. In each subgroup, the weights are represented as where y indicates the index of a RINC-0 module and x indicates the sub-group index, as shown in Fig. 3. This subgroup is considered a weak classifier. Using the Adaboost algorithm, we construct up to subgroups. Each subgroup is assigned a weight (), where x indicates the sub-group number. Hence, this creates two levels of Adaboost, one within the subgroup and second across subgroups. The binary outputs of the subgroups are multiplied, added and thresholded in a MAT module. Again this module can be implemented as a LUT. We can observe from Fig. 3 that adding another level of Adaboost increases the number of RINC-0 modules exponentially, thus accommodating inputs. In a hierarchical Adaboost algorithm with levels and inputs per LUT, there are RINC-0 modules and Look-Up based MAT modules. Thus,
Algorithm 2 details the hierarchical Adaboost algorithm. We create groups of DTs together. Each DT is associated with a weight that is multiplied with the corresponding output of the DT and thresholded. Now we consider these groups of DTs as a weak classifier and assign a new weight to each group. We further build such groups of DTs and assign a weight to each of them according to the Adaboost algorithm. Again, a MAT module is required to group these sub-groups and it is implemented as a LUT. This can be viewed as 2 levels of Adaboost. Considering this RINC-2 as weak classifier, we can build sub-groups of RINC-2 classifier and group them together using a MAT module to build a RINC-3 architecture. Similarly, further levels can be added to build a RINC-L architecture.
2.2 Binary to Multiclass Classification
The RINC-L modules can be used to implement any size of a binary neuron in the network. However, RINC-L being a binary classifier, it cannot be used directly for multiclass classification. Traditionally, multiclass classification using DTs has been solved using two approaches, Multiclass DTs Holmes et al. (2002) and One-vs-all classifications Rifkin and Klautau (2004). Modifying the RINC architecture as Multiclass DT makes it expensive to implement in hardware. In Multiclass DTs, the leaf nodes refer to either one of the classes, where is the number of classes. This requires output of each DT to be represented over bits. Therefore, the RINC-0 and MAT modules cannot be confined to a single LUT. This effect would cascade over the entire architecture and would make it less efficient. Also, we do not consider one-vs-all classifications using Binary DTs as there is a large drop in accuracy between each individual binary classifier and the overall multi-class classifier designed by comparing the confidences Abdelsalam et al. (2018). Moreover, the one-vs-all classification requires a confidence comparison circuit that consumes more resources.
On the other hand, fully connected layers have been successful in multiclass classification compared to DTs. Hence, we formulate a combined approach with our RINC architecture and a fully connected layer to overcome the multiclass challenge. We preserve the output fully connected layer, while replacing the hidden layers in the classifier with our RINC-L architecture. We then adapt the output fully connected layer to work in tandem with the RINC-L architecture for multiclass classification. In a way we break the task of the classifier into two parts, first the binary classification of the hidden layer representations (§2.2.1) and then the multiclass classification of the output layer (§2.2.2).
2.2.1 Replacing the hidden layers
We use a back to front approach where we start replacing the binary neurons in the network with our RINC-L architecture from the last layers and progressively move towards the initial layers. Given sufficient capacity of the RINC-L modules, it is possible to replace multiple layers of the neural network with a single RINC-L module. In order to minimize the number of RINC-L modules required, we add a fully connected layer with binary sigmoid activation Kwan (1992) after the last hidden layer. We call this fully connected layer, intermediate layer. It consists of neurons, where is the number of classes in the multiclass classification. Typically, the value of ranges from 6 to 8 and the value of is 10 for the MNIST, CIFAR-10 and SVHN datasets. Hence, the intermediate layer has 60-80 neurons for these datasets. Thus, the intermediate layer with binary sigmoid activation can be viewed as a set of binary neurons. This enables us to train a RINC-L module to emulate a binary neuron representation in the intermediate layer. Similarly, binary sigmoid activation can be introduced in the earlier layers and can be replaced with RINC-L modules. Since there are fewer neurons in the intermediate layer than in the hidden layers, it takes fewer resources to train a RINC-L classifier for each of the neuron in the intermediate layer. On the other hand, this restricts the representation space. Hence, the hyper parameter must be chosen carefully to balance the trade-off between accuracy and resources.
2.2.2 Sparsely connected output layer
The outputs of the RINC-L modules (emulating the intermediate layer representation) are connected to the output layer. However, this output layer needs to be optimized for LUT based implementation. Firstly, we modify the output layer to be sparsely connected to the intermediate layer. Each neuron in the output layer is connected to only neurons of the intermediate layer as shown in Fig. 4. Hence, each neuron depends on inputs and therefore can be implemented as a single Look-Up operation. Also, the output layer can be seen as a small set of input fully connected layers stacked in parallel. Since the output layer is trained as a fully connected network, it inherits all the properties of neural networks to classify multiclass data effectively. The output layer is separately retrained with RINC-L outputs to adapt the weights of the output layer. The retraining of the sparsely connected output layer for multiclass classification adapts the weights to the RINC-L binary hidden layer representations.
The output layer activation is not binary. Nevertheless, since it is sparsely connected to the previous layer, it can still be efficiently implemented with LUTs. Each output neuron is quantized to q-bits precision, thus requiring LUTs per output neuron. Therefore, the output layer can be implemented using LUTs. This is negligible as compared to the resources required by the RINC-L modules. Thus, with the help of few more LUTs we achieve higher accuracy on multiclass classification than using costly multiclass DTs or One-Vs-All classifiers. This final architecture with multiple RINC-L modules and q-bit quantized output layer implemented as LUTs make the announced PoET-BiN.
3 Experimental Setup
In this section we detail how we train the proposed architecture and test its performance with various datasets.
We developed the workflow shown in Fig. 5 to train the RINC modules starting from a vanilla CNN network. Firstly, we use a pretrained full precision CNN as our base architecture (Vanilla network). The base architecture for each dataset is mentioned in Table 1
. FE refers to the feature extractor which consists of convolutional, maxpooling and activation layers. In the vanilla network, the features are represented with full precision. However, RINC modules can only be trained on binary features. Hence, we replace the ReLU with binary sigmoid activation after the last convolutional layer to obtain the binary features. This is represented by the Bin act. module in the Binary Feature Representation Network in Fig.5. Further, an intermediate layer and a binary sigmoid activation are added after the last hidden layer. This forms the teacher network. Then, we replace all the hidden layers and the intermediate layer in the classifier using our RINC architecture which is the student architecture in our work. Finally, the output layer is retrained with the RINC outputs. The output layer activations are quantized to bits for efficient hardware implementation. In our test it was observed that when , the loss in accuracy was quite significant as compared to the original floating point implementation. On the other hand, with the loss in accuracy was minimal. In the case when the accuracies were similar to that of 8-bit quantization but it requires twice the amount of LUTs as compared to 8-bit quantization. Hence we use 8-bit quantized output layer.
The architecture hyperparameters are listed in Table1
and explained in detail in the following sub-sections for each dataset. We use similar architecture to other recent stat-of-the-art implementations. We use techniques such as batch normalizationIoffe and Szegedy (2015), exponentially decreasing learning rate, squared hinge loss Rosasco et al. (2004) and ADAM optimizer Kingma and Ba (2015)
in all the vanilla networks. Also, we do not retrain with the validation set. We do not use any image augmentation techniques except for padding in CIFAR-10.
As seen in Table 1, we use the LeNet architecture for the MNIST dataset. Using two convolutional layers of convolutions and two pooling layers of size , we transform the feature space to 512 binary features. The classifier portion consists of only one hidden fully connected layer of 512 neurons with ReLU activation and an output layer of 10 neurons. We use 8-input LUTs . Hence, the intermediate layer contains neurons. We train a 2-level RINC (RINC-2) module with 32 DTs for each neuron in the intermediate layer. Therefore each RINC-2 module selects a maximum of 256 ( = ) features from the available 512 binary features. These predicted outputs of the RINC-2 modules are used to retrain the final bit quantized output layer.
The CIFAR-10 training procedure is similar to the one used for MNIST, but for a bigger network, i.e. a VGG-11 architecture with 8 convolution layers and 3 fully connected layers. The full precision implementation proposed by Kuang Kuang (2018) is used as reference. The convolutional layers transform the input to a binary feature space of 512 features. There are 2 hidden fully connected layers with 4096 neurons each. To augment the capacity of RINC modules, we use input LUTs () and 40 DTs for each of the neuron in the intermediate layer. The intermediate layer consists of binary neurons. The predicted outputs of the RINC-2 modules are used to retrain the output layer which is quantized to 8 bits.
SVHN is implemented with an architecture similar to the one used for CIFAR-10, except that LUTs have 6 inputs (), leading to 36 DTs per neuron with 2 hierarchical levels (RINC-2) of Adaboost. We also use the extra dataset from the SVHN dataset for training.
4 Results and Discussions
We analyse the accuracy, power consumption and latency of PoET-BiN for the MNIST, CIFAR-10 and SVHN datasets.
4.1 Classification Accuracy
We report four sets of accuracies for each dataset in Table 2. Firstly, we report the accuracy of the vanilla network (), followed by the accuracy with binary sigmoid activation after the last convolutional layer to obtain the binary features (
). Then we report the accuracy after binarizing the intermediate layer (). This forms the teacher network. Finally, we replace the classifier portion of the teacher network with the RINC classifiers and quantize final layer whose accuracy is reported as (). This helps isolate and study the effect of each modification. We report the best accuracy achieved over different sets of hyper-parameters such as number of DTs and LUT size for RINC modules. Fig. 5 illustrates the progressive modifications with relevant accuracies.
In Table 2 , we observe a drop in accuracy of approximately 0.3% for MNIST, 1.9% for the CIFAR-10 dataset and 1.1% for the SVHN datasets between the vanilla and teacher network (). This is expected as we restrict the feature space by using binary representations. This teacher network is used to train our RINC classifiers and quantized sparsely connected output layer. This results in a further dip in accuracy of for MNIST and for SVHN. An interesting observation in the case of CIFAR-10 is that by replacing the fully connected layers with PoET-BiN, the accuracy improves by 1.5% for CIFAR-10. This anomaly could be due to better generalization as a result of the noise injected into the system due to the inaccuracies in intermediate layer prediction by the RINC modules. Similar observations were seen in Dropconnect Wan et al. (2013).
Now that we have obtained the final accuracy of the PoET-BiN architecture, it is necessary to have a fair comparison with other architectures present in the literature. We choose three starkly different architectures, namely BinaryNet Courbariaux et al. (2016), POLYBiNN Abdelsalam et al. (2018) and Neural Decision Forest (NDF) Kontschieder et al. (2015)
. BinaryNet is a quantized CNN approach, while POLYBiNN is a complete Decision Tree approach and NDF is a hybrid mixture of both with differentiable DTs. To ensure fairness we use the same feature extractor across all architectures, and change the classifier portion of the architecture. We used our Python implementation for BinaryNet and POLYBiNN. While, Jing’s Pytorch implementation of NDFJing (2018) was adapted for the comparative analysis. From Table 2, we can see that our architecture performs the best in the case of CIFAR-10 and second best in case of SVHN. Though the NDF architecture performs better than PoET-BiN on MNIST and SVHN, it is not optimized for hardware implementations.
For MNIST, the significant reduction in accuracy can be overcome by increasing the number of RINC classifiers. In the MNIST architecture, rather than training the RINC classifiers to predict the intermediate layer outputs, we can train a RINC classifier for each of the neuron in the only hidden layer in M1 architecture. This results in 512 RINC-2 modules. Retraining the fully connected output layer with the 512 RINC classifier outputs results in an accuracy of 98.62% that is more closer to that of NDF. However, this implementation consumes more resources. Therefore, we do not consider this accuracy. However, it proves the versatility of the RINC architecture in implementing binary neurons. We do not implement similar architectures for SVHN and CIFAR-10, as they have significantly more neurons in the last hidden layer (2048 for SVHN and 4096 for CIFAR-10), requiring long training times.
Another important observation to be noted in Table 2 is that the PoET-BiN architecture performs better than off-the-shelf DTs used in POLYBiNN across all datasets, in spite of them having significantly more nodes in each DT. This can be attributed to our hierarchical training algorithm and unique binary to multiclass classification technique.
One of the most important metrics apart from accuracy for hardware implementations of neural networks is power. We compare the power consumed by the PoET-BiN architecture to the classifier portion of the vanilla neural network and quantized neural networks. We implement our PoET-BiN architecture based classifier on Spartan-6 45-nm FPGA from Xilinx to calculate the power consumption. Typically, the PoET-BiN architecture used for the three datasets consists of thousands of LUTs. It is quite cumbersome to write a HDL script for such a large implementation. Hence, we developed a Python script to generate the VHDL code automatically from the trained LUTs. The retrained final layer is also implemented on the FPGA automatically by our script. To report the power of this architecture, it has to fit in the FPGA. The Spartan-6 FPGA has 276 input/output ports but our architecture has 512 features. Hence the inputs are provided through a shift register with a single input. This enables us to fit the architecture into the target FPGA. However, this also adds some logic and signal power which is 4 mW in the case of MNIST and CIFAR-10 and 6 mW in the case of SVHN. The power consumed by the PoET-BiN architecture (subtracting the power consumed in the shift registers) are reported in Table 3. The outputs generated by the FPGA and those generated by PyTorch are verified in the testbench, that is automatically produced by another Python script.
|(at 62.5 Mhz)||clock||logic||signal||IO|
|Multiplication (16 bits)||0.001||0.001||0.000||0.020||0.036||0.058|
|Addition (16 bits)||0.001||0.000||0.001||0.024||0.036||0.062|
|Multiplication (32 bits)||0.002||0.001||0.001||0.035||0.037||0.076|
|Addition (32 bits)||0.001||0.000||0.002||0.048||0.037||0.088|
We compare the power consumed to that of the vanilla neural network and quantized neural networks. Most of the hardware implementations of neural network in the literature provide the power consumed for the entire network including the convolutional layers. It is difficult to accurately estimate the power consumed in the fully connected layers from this data. Hence we use a bottom-up approach to estimate the power of the classifier portion of these networks. Mathematical operations (multiplication and addition) and memory fetching operations consume most of the power in the fully connected layers. Power consumption of memory fetching operations depends on the evaluation platform, memory type and other factors. Hence, it is quite difficult to estimate the power required for memory fetching operations accurately without an actual implementation. On the other hand, we can estimate the power required for the mathematical operations. First, we implement a single multiplication and an addition on the same FPGA. Table4 provides the power consumption for a single multiplication and an addition operation. The multiplication is implemented with a Digital Signal Processor (DSP) block in the FPGA, which consumes less power compared to a LUT-based implementation. The addition operation is implemented with LUTs and dedicated carry chains. We use IP cores provided by Xilinx to implement the multiplication and addition. Now, that we know the power consumed by each operation, we calculate the total number of multiplications and additions in each of the fully connected layers of the vanilla architecture Table 5. From these data we estimate the power consumed in the classifier portion of the vanilla and quantized neural networks except binary quantized networks.
In the case of binary quantized neural network, each multiplication or addition operation consumes an insignificant amount of power. Hence, we estimate the power consumption by implementing a binary neuron. Each binary neuron consists of multiple binary multiplications (XNOR operation) followed by a tree structured adder Hoe et al. (2011) and a comparator. We then multiply this value with the number of neurons in the classifier portion of the respective network for each dataset to estimate the total power consumption of classifier portion of each network. In the case of MNIST, each binary neuron consumes of logic + signal power. However, this includes the power consumed by two shift registers that are used to feed in the values for the inputs and weights. Each shift register consumes of power that needs to be subtracted from the total power. Hence, the power consumed by the binary neuron portion alone is . There are 522 binary neurons in the classifier portion of the M1 architecture. Therefore the total dynamic power consumed by all the binary neurons is . This value is multiplied by the time period of the clock () to obtain the energy shown in Table 6. Similarly, we estimate the energy in the classifier portion of binary quantized networks for the other datasets.
Such method to estimate the power has the advantage of considering the same target device for all the estimations as illustrated in Table 4. Other works use the metrics proposed by Horowitz in Horowitz (2014) but we have not been able to find a fair estimation of a small LUT. Moreover, the power analyzer gives a detailed report on the power distribution in the FPGA. The total power can be coarsely divided into static and dynamic power. The static power, as the name suggests, is constant for a given FPGA device. The dynamic power can be further sub-divided into clock, logic, signal and IO power. The clock and IO power are also constant for a given FPGA device at a given frequency of operation. Hence, the actual energy involved in the computation of a combinational function is only concerned by the logic and signal columns of Table 4. Therefore, we only use these values to estimate the power of a given architecture.
Along with power, energy is also an important metric to be taken into consideration. The energy is calculated in Table 6 from the power values mentioned in Table 3 and Table 4. To calculate the energy value we use the time period of the clock. Our PoET-BiN classifier requires single cycle to implement the inference. For the SVHN dataset, we use a RINC-2 classifier with that is easily implementable on Xilinx LUTs as they support 6-input LUTs. Hence, we use a 100 MHz clock for the RINC-2 classifier implementation for SVHN. On the other hand, MNIST and CIFAR-10 require RINC-2 classifiers with . As each 8-input LUT requires four input LUTs, the critical path increases. Therefore, we use a slower 62.5 MHz clock. We can increase the frequency of the implementation by pipelining the architecture, but this will lead to more power consumption due to the extra registers. Hence, we stick to single cycle implementations. Using these information, the energy results are calculated and detailed in Table 6.
We observe that the PoET-BiN architecture as compared to a full precision vanilla network consumes times less energy in the case of MNIST, almost for CIFAR-10 and in the case of SVHN. Even in the case of 16-bits quantized network, the PoET-BiN architecture consumes almost less energy in MNIST, in CIFAR-10 and in the case of SVHN. Comparing the PoET-BiN to 1-bit quantization (binary), we observe our architecture consumes less energy in the case of the MNIST dataset, less energy for the CIFAR-10 dataset and less in the case the SVHN dataset.
Actually, these values are the worst case scenario of power reduction since we do not consider the power required for memory fetching operations in vanilla and quantized neural networks. These operations are more power intensive than multiplication operations Horowitz (2014). On the other hand, the PoET-BiN architecture does not need any memory access operations. Hence, in reality our architecture is even more power efficient than what we report here.
4.3 Latency and Area
Apart from accuracy and power, embedded application are often time critical and demand low latency. Even with a single cycle implementation, the PoET-BiN architecture has a low latency. From Table 7, we can observe that the latency is 5.85 ns in the case of SVHN, while for CIFAR-10 and MNIST it is 9.48 ns and 9.11 ns respectively. This translates to a throughput of up to images per second in the case of SVHN and images per second in the case of MNIST and CIFAR-10.
The PoET-BiN architecture stands out from other inference techniques such as quantization or pruning in this regard as our implementation is focused on making each sub operation fit a single LUT. This yields a highly optimized architecture in terms of area as seen in Table 7. Especially in the case of SVHN, the PoET-BiN architecture with and 2 levels (RINC-2) requires 2660 LUTs. This can be verified with manual calculations as follows. Firstly, each RINC-0 module requires a single LUT. Then, a RINC-1 module with , consists of 6 RINC-0 modules and a MAT module, hence requiring LUTs. A RINC-2 module consists of 6 RINC-1 module and a MAT module, thus requiring LUTs. Sixty such RINC-2 modules are required to emulate the intermediate layer, therefore consuming LUTs. The final output layer consists of 10 neurons whose values are quantized to 8 bits. Hence each neuron in the output layer requires 8 LUTs. Therefore, LUTs are required to implement the classifier architecture for SVHN. This is the exact count given by the Xilinx synthesizer as well. As there are no overlaps between inputs in each DT, the Xilinx synthesizer cannot further simplify the design. This supports the idea that our training algorithm produces a highly efficient implementation.
In the case of MNIST and CIFAR-10 we use 32 and 40 DTs per RINC-2 module, respectively, with . Since Xilinx LUTs have a maximum of 6 inputs, each input LUT requires four input Xilinx LUTs. Sometimes, this conversion results in redundancy and the synthesizer removes a few LUTs that do not affect the result. Further analysis reveals that most of the LUTs removed by the synthesizer are MAT modules. This is because some DTs have a very low weight assigned by the Adaboost algorithm, which finally does not affect the result of the MAT operation. This suggests that it is possible to further improve our training algorithm. Such opportunity is predominately visible in the case of CIFAR-10 where approximately of the LUTs are removed by the synthesizer producing a smaller architecture in spite of having more DTs per RINC-2 module as compared to the PoET-BiN implementation of MNIST.
5 Related Work
In the literature, there have been various implementations of DTs optimized for embedded hardware Abdelsalam et al. (2018); Narayanan et al. (2007). In these implementations, tackling complex classification tasks inevitably leads to deeper and more DTs. This necessitates the use of external memory to store the intermediate results. More recently, Abdelsalam et al. Abdelsalam et al. (2018) implemented Binary DTs using simple AND-OR gates achieving 97 % accuracy on the MNIST dataset. However the DTs were implemented using off-the-shelf DT libraries, leading to generic trees that were not completely optimized for the underlying LUT-based hardware. Zhou & Feng Zhou and Feng (2017) proposed a multi-level DT architecture where the output of the DTs of the preceding layer are used as inputs to the next layer. They achieved promising results on the MNIST dataset but could only achieve 68% on CIFAR-10. Also, there have been few attempts to combine neural networks and DTs. Kontschieder et al. first coined the term neural decision forests Kontschieder et al. (2015). They proposed differentiable DTs which could be trained along with convolutional layers to achieve 93.6% top-5 accuracy on Imagenet. However, they used stochastic rounding and computations at each of the decision nodes making it unfavourable for hardware implementation.
This work introduced PoET-BiN, an architecture and its associated building algorithm, that is optimized to fit the LUTs or memory blocks in embedded systems such as FPGAs. We replaced the classifier portion of various networks with our architecture to achieve accuracies similar to the ones obtained with full precision implementations. This is because, around half of the network weights are present in the FC layers for the network models considered. We need to reduce both the computations and memory instructions to reduce the overall power of the network. Moreover, a memory fetch instruction consumes more power than a MAC operation. Hence, reducing the power in the FC layers is all the more necessary we reduce the energy consumption for the classifier portion up to six orders of magnitude compared to a floating point implementations and up to three orders of magnitude when compared to recent binary quantized neural networks. This is due to the fact that all the arithmetic operations are replaced by small LUTs on binary signals. Also, our work is quite versatile. CNNs, RNNs, LSTMs and almost every NN uses MAC operations. By binarizing certain activations, we can directly implement our RINC modules in these NNs.
In future work, we will implement the convolutional layers with RINC modules and extend them to bigger datasets such as Imagenet. Another avenue of research would be to use differentiable DTs, thus training the DTs and convolutional layers together to obtain better feature representation. This may result is higher accuracies as the entire network is trained together instead of training layerwise.
- POLYBiNN: a scalable and efficient combinatorial inference engine for neural networks on FPGA. In Design and Architectures for Signal and Image Processing, pp. 19–24. Cited by: §2.1.1, §2.2, §4.1, §5.
A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 (2), pp. 121–167. External Links: Cited by: §2.1.1.
- Binarized neural networks: training deep neural networks with weights and activations constrained to + 1 or -1. arXiv preprint: 1602.02830. Cited by: §1, §1, §4.1.
Improving deep neural networks for LVCSR using rectified linear units and dropout. In Acoustics, Speech and Signal Processing, Vol. , pp. 8609–8613. External Links: Cited by: §1.
- Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1.
- Design and characterization of parallel prefix adders using fpgas. In 2011 IEEE 43rd Southeastern Symposium on System Theory, Vol. , pp. 168–172. External Links: Cited by: §4.2.
- Multiclass alternating decision trees. In European Conference on Machine Learning, pp. 161–172. Cited by: §2.2.
- Computing’s energy problem (and what we can do about it). In International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Vol. , pp. 10–14. External Links: Cited by: §4.2, §4.2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. External Links: Cited by: §3.
- An implementation of the deep neural decision forests in pytorch. https://github.com/jingxil/Neural-Decision-Forests. Cited by: §4.1.
- Text understanding with the attention sum reader network. In Computational Linguistics (Volume 1: Long Papers), pp. 908–918. External Links: Cited by: §1.
- ADAM: A method for stochastic optimization. In International Conference on Learning Representations, External Links: Cited by: §3.
- Deep neural decision forests. In International Conference on Computer Vision, pp. 1467–1475. Cited by: §4.1, §5.
- Train CIFAR10 with PyTorch. https://github.com/kuangliu/pytorch-cifar. Cited by: §3.2.
Simple sigmoid-like activation function suitable for digital hardware implementation. Electronic Letters 28 (15), pp. 1379–1380. External Links: Cited by: §2.2.1.
- Deep learning on fpgas: past, present, and future. arXiv preprint arXiv:1602.04283. Cited by: §1.
An analysis of bayesian classifiers.
Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’92, pp. 223–228. External Links: Cited by: §2.1.1.
- Interactive presentation: an FPGA implementation of decision tree classification. In Design, Automation and Test in Europe, pp. 189–194. Cited by: §5.
- SEGAN: speech enhancement generative adversarial network. arXiv preprint :1703.09452. Cited by: §1.
- Induction of decision trees. Machine Learning 1 (1), pp. 81–106. External Links: Cited by: §2.1.1.
- Soft margins for Adaboost. Machine learning 42 (3), pp. 287–320. Cited by: §2.1.2.
- In defense of one-vs-all classification. Journal of machine learning research 5 (Jan), pp. 101–141. Cited by: §2.2.
Are loss functions all the same?. Neural Computation 16 (5), pp. 1063–1076. Cited by: §3.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.
- Revisiting popcount operations in cpus/gpus. In The International Conference for High Performance Computing, Networking, Storage and Analysis poster. Cited by: §1.
- Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pp. 1058–1066. Cited by: §4.1.
- From facial parts responses to face detection: a deep learning approach. In International Conference on Computer Vision, pp. 3676–3684. Cited by: §1.
- Deep forest: towards an alternative to deep neural networks. arXiv preprint:1702.08835. Cited by: §5.
Towards effective low-bitwidth convolutional neural networks. In Computer Vision and Pattern Recognition, pp. 7920–7928. Cited by: §1.