Neural networks are critical drivers of new technologies such as computer vision, speech recognition, and autonomous systems. As more data have become available, the size and complexity of NNs has risen sharply with modern NNs containing millions or even billions of trainable parameters[1, 2]. These massive NNs come with the cost of large computational and storage demands. The current state of the art is to train large NNs on GPUs in the cloud – a process that can take days to weeks even on powerful GPUs [1, 2, 3] or similar programmable processors with multiply-accumulate accelerators . Once trained, the model can be used for inference which is less computationally intensive and is typically performed on more general purpose processors (i.e., CPUs). It is increasingly desirable to run inference, and even some re-training, on embedded processors which have limited resources for computation and storage. In this regard, model reduction has been identified as a key to NN acceleration by several prominent researchers . This is generally performed post-training to reduce the memory requirements to store the model for inference – e.g., methods for quantization, compression, and grouping parameters [6, 7, 8, 9].
Decreasing the time, computation, storage, and energy costs for training and inference is therefore a highly relevant goal. In this paper we present two compatible methods towards this end goal: (i) a method for introducing sparsity in the connection patterns of NNs, and (ii) a flexible hardware architecture that is compatible with training and inference-only operation and supports the proposed sparse NNs. Our approach to sparsifying a NN is extremely simple and results in a large reduction in storage and computational complexity both in training and inference modes. Moreover, this method is not tied to the hardware acceleration and provides the same benefits for training and inference in software under the current paradigm. The hardware architecture is massively parallel, but not tightly coupled to a specific NN architecture (i.e., not tied to the number of nodes in a layer). Instead, the architecture allows for maximum throughput for a given amount of circuit resources.
Our approach to making a NN sparse is to specify a sparse set of neuron connections prior to training and to hold this pattern fixed throughout training and inference. We refer to this method of simply excluding some fixed set of connections in the NN as pre-defined sparsity. There are several methods in the literature related to sparse NNs, but most do not reduce the computation and storage complexity associated with training, which is a primary goal of this work. One related concept is drop-out  where selected edges in the NN are not processed during some steps of the training process, but the final result is a FC NN for inference. Another set of approaches target producing a sparse NN for inference, but use FC NNs during training. Among these are pruning and trimming methods that post-process the trained NN to produce a sparse NN for inference mode [11, 12, 13]. As mentioned above, other methods have been proposed for reducing the complexity of performing inference on a trained FC NN such as quantization, compression, and grouping parameters [6, 7, 8, 9]. Other research has suggested a method of learning sparsity during training that begins training a fully-connected NN and uses a cost regularizer that promotes sparsity in the trained model . Note that all of these methods do not substantially reduce the complexity of training and instead target inference models that have lower complexity. One method aimed at reducing both training and inference complexity is using NNs with structured, but not sparse, weight matrices [15, 16]. Finally, we note that several authors have very recently proposed pre-defined sparse NNs [17, 18, 19] independently of our published work [20, 21, 22].
Motivated by the fact that specialized hardware is typically faster and more energy efficient than GPUs and CPUs, there exists a large body of literature in NN hardware acceleration. The vast majority of this addresses only inference given a trained model [23, 9, 24, 25, 26], with few addressing hardware accelerated training . The work of , for example, targets a specific size NN – i.e., the logic and memory architecture is tied to the number of neurons in a layer.
We propose an architecture that supports training, but can be simplified for inference-only mode, and is flexible to the NN size. This is particularly attractive for FPGA implementations. Specifically, the proposed architecture produces the maximum throughput on a given FPGA for a given NN and can therefore support various sized NNs on various sized FPGAs. This is accomplished by an edge-based processing architecture that can process edges in a given layer in parallel (i.e., we refer to as the degree of parallelism). A given FPGA can support some largest value of , and NNs with more edges will simply take more clock cycles to process.111We use the terms the terms ‘connection’ and ‘edge’ interchangeably, as we do with ‘node’ and ‘neuron’. Also, the term ‘cycle’ will mean ‘clock cycle’, unless otherwise stated.
Our edge-based architecture is inspired by architectures proposed for iterative decoding of modern sparse-graph-based error correction codes (i.e., Turbo and LDPC codes) (cf., [28, 29]). In particular, for a given processing task, there are logic units to perform the task and memories to store the quantities associated with the task. A challenge with this architecture, shared between the decoding and NN applications, is that, in order to achieve high-throughput without memory duplication, the parallel memories must be accessed in two manners: natural order and interleaved order. In natural order, each computation unit is associated with one memory and accesses the elements of that memory sequentially. For interleaved order access, the computational units must access the memories such that no memory is accessed more than once in a cycle. Such an addressing pattern is called clash-free, and this property ensures that no memory contention occurs so that no stalls or wait states are required. For modern codes, the clash-free property of the memories is ensured by defining clash-free interleavers (i.e., permutations) , or clash-free parity check matrices . In the context of NNs, this clash-free property is tied to the connection patterns between layers of neurons.
In addition to degrees of parallelism in edge processing in a given layer, our architecture is pipelined across layers. Thus, there is a degree of parallelism associated with each layer (i.e., for layer ) selected to set the number of cycles required to process a layer to a constant – i.e., larger layers have larger so that the computation time of all layers is the same. For an -layer NN there are pipeline stages so that a given NN input is processed in the time it takes to complete the processing of the edges in a single layer. Furthermore, the three operations associated with training – FF, BP, and UP – are performed in parallel. The architecture may be simplified to perform only inference by eliminating the logic and memory associated with BP and UP. Furthermore, while the architecture supports the reduced sparse complexity NNs, it is also compatible with traditional FC networks. Interestingly, very recent work proposed pipelining across layers for an inference-only accelerator , as well as a scalable edge-based architecture for training  independently of our published work [20, 21]. Neither of these other recent works, however, takes advantage of pre-defined sparsity in the network.
In Section II we provide motivation for and simple examples of the effectiveness of pre-defined sparsity. In Section III the hardware architecture is described in detail, including defining a class of simple clash-free connection patterns with low address generation complexity. Section IV contains a detailed simulation study of pre-defined sparsity in NNs based on four different classification datasets – MNIST handwritten digits , Reuters news articles , TIMIT speech corpus , and CIFAR-100 images . We identify a set of trends or design guidelines in this section as well. This section also demonstrates that the simple, hardware-compatible clash-free connection patterns provide performance on-par or better than that of randomly connected sparse patterns. Finally, in Section V we consider the issue of whether pre-defining the structured sparse patterns causes a significant performance loss relative to other sparse methods having similar amount of parameters. We find that there is no significant performance degradation and therefore our hardware architecture can provide training and inference performance commensurate with state-of-the art sparsity methods.
Ii Structured Pre-Defined Sparsity
Ii-a Definitions, Notation, and Background
An -layer MLP has nodes in the layer, described collectively by the neuronal configuration , where layer 0 is the input layer. We use the convention that layer is to the ‘right’ of layer . There are junctions between layers, with junction connecting the nodes of its left layer with the nodes of its right layer .
We define pre-defined sparsity as simply not having all edges present in junction . Furthermore, we define structured pre-defined sparsity so that for a given junction , each node in its left layer has fixed out-degree – i.e., connections to its right layer, and each node in its right layer has fixed in-degree – i.e., connections from its left layer. FC NNs have and with edges in the junction, while a sparse NN has at least one junction with less than this number of edges. The number of edges (or weights) in junction is given by . The density of junction is measured relative to FC and denoted as . The structured constraint implies that the number of possible values is equal to the GCD of and , as shown in Appendix A. The overall density is
Thus, specifying and the out-degree configuration determines the density of each junction and the overall density.
We will also consider random pre-defined sparsity, where connections are distributed randomly given preset values without constraints on in- and out-degrees. In Sec. IV-B we show that random pre-defined sparsity is undesirable at low densities because it may result in unconnected neurons.
The standard equations for FC NNs are well-known . For a NN using structured pre-defined sparsity, only the weights corresponding to connected edges are stored in memory and used in computation. This leads to the modified equations (2a)–(7a), where subscripts denote layer/junction numbers, single superscripts denote neurons in a layer, and double superscripts denote (right neuron, left neuron) in a junction. The FF processing proceeds left-to-right and computes the activations and associated derivatives
for each layer by applying an activation functionto a linear combination of biases , junction weights and preceding layer activations
Note that (4a) is used in training, but is not required in inference mode. The BP computation is done only in training and computes a sequence of error values from right-to-left
where is the
component of the loss function. Finally, stochastic gradient UP is given by
Ii-B Motivation and Preliminary Examples
Pre-defined sparsity can be motivated by inspecting the histogram for trained weights in a FC NN. There have been previous efforts to study such statistics [3, 38], however, not for individual junctions. Fig. 1 shows weight histograms for each junction in both a 2-junction and 4-junction FC NN trained on the MNIST dataset. Note that many of the weights are zero or near-zero after training, especially in the earlier junctions. This motivates the idea that some weights in these layers could be set to zero (i.e., the edges excluded). Even with this intuition, it is unclear that one can pre-define a set of weights to be zero and let the NN learn around this pre-defined sparsity constraint. Fig. 1(c) and (h) show that, in fact, this is the case – i.e., this shows classification accuracy as a function of the overall density for structured pre-defined sparsity. Since the computational and storage complexity is directly proportional to the number of edges in the NN, operating at an overall density of, for example, 50% results in a 2X reduction in complexity both during training and inference. Detailed numerical experiments in Section IV build on these simple examples. However, before we proceed to those results, it is important to consider a hardware architecture that can support structured pre-defined sparsity and consider the additional clash-free constraints placed on the connection patterns so that these can be considered in the studies in Section IV.
Iii Hardware Architecture
In this section we describe the proposed flexible hardware architecture outlined in the Introduction. The overall architectural view is captured by Fig. 2: sub-figure (a) shows parallel edge processing within a junction with degree of parallelism 3, (b) shows clash-free memory access, and (c) junction pipelining and parallel processing of the three operations – FF, BP, UP. The toy example in Fig. 2(a)-(b) is for , , , and . Fig. 2(a) shows that the blue edges are processed in parallel in one cycle, while the pink edges are processed in parallel during the next cycle. Fig. 2(b) shows how the FF processing logic units access the memories in natural and interleaved order. As described in detail in Sec. III-B, the interleaved order access may represent reading of the activations for and the natural order access may correspond to writing the computed activations for . On the next cycle, the remaining memory locations (i.e., the white cells) will be accessed. Note that this illustrates a clash-free connection pattern since each of the memories is accessed no more than once in each cycle – i.e., one hit per column on each access.
The junction-based operation in Fig. 2(b) is repeated for each junction in a pipeline. In particular, there are
pipeline stages. For example, for the FF pipeline, while the first stage is processing input vectoron junction 1, the second stage is processing input vector on junction 2. The degree of parallelism for each junction is selected so that the processing time for any operation (FF/BP/UP) is the same for each junction. Thus the throughput, i.e., the frequency of processing input samples, is determined by the time taken to perform a single operation in a single junction.
In summary, the architecture is (i) edge-based and not tied to a specific number of nodes in a layer, (ii) flexible in that the amount of logic is determined by the degree of parallelism which trades size for speed, and (iii) fully pipelined for the parallel operations associated with NN training. Also note that the architecture can be specialized to perform only inference by removing the logic and memory associated with the BP and UP operations, and the computation in (4a).
A key concern when implementing NNs on hardware is the large amount of storage required. Several characteristics regarding memory requirements guided us in developing the proposed architecture. Firstly, since weight memories are the largest, their number should be minimized. Secondly, having a few deep memories is more efficient in terms of power and area than having many shallow memories . Thirdly, throughput should be maximized without duplicating memories, hence the need for clash-free connection patterns.
In Sec. III-A, we describe junction pipelining design which attempts to minimize weight storage resources. The memory organization within a junction is described in Sec. III-B, and is designed to minimize the number of memories for a given degree of parallelism. Finally, clash-free access conditions are developed in Sec. III-B and III-C, and a simple method for implementing such patterns given in Sec. III-C.
Iii-a Junction pipelining and Operational parallelism
Our edge-based architecture is motivated by the fact that all three operations – FF, BP, UP – use the same weight values for computation. Since edges are processed in parallel in a single cycle, the time taken to complete an operation in junction is cycles. The degree of parallelism configuration is chosen to achieve . This allows efficient junction pipelining since each operation takes exactly cycles to be completed for each input in each junction, which we refer to as a junction cycle.222During hardware implementation, a few extra cycles may be needed to flush the pipeline so that . These are also balanced, i.e., , to achieve efficient pipelining. In our initial implementation , for example, and the junction cycle is . This determines throughput.
The following is an analysis of Fig. 2(c) in more detail for an example NN with . While a new training input numbered is getting loaded as , junction 1 is processing the FF stage for the previous input and computing . Simultaneously, junction 2 is processing FF and computing cost via cost derivatives for input . It is also doing BP on input to compute , as well as updating (UP) its parameters from the finished computation of input . Simultaneously, junction 1 is performing UP using from the finished BP results of input .333Note that BP does not occur in the first junction because there are no values to be computed This results in operational parallelism in each junction, as shown in Fig. 3. The combined speedup is approximately a factor of as compared to doing one operation at a time for a single input.
Notice from Fig. 3 that there is only one weight memory bank which is accessed for all three operations. However, UP in junction needs access to for input , as per the weight update equation (8a). This means that there need to be left activation memory banks for storing for inputs to , i.e., a queue-like structure. Similarly, UP in junction 2 will need queued banks for each of its left activation and its derivative memories – for inputs from (for which values will be read) to (for which values are being computed and written). There also need to be 2 banks for all memories – 1 for reading and the other for writing. Thus junction pipelining requires multiple memory banks, but only for layer parameters , and , not for weights.444This is achieved by making the weight memory dual-port, while and are single-ported memories. The memories are also dual-ported due to the exact manner in which we implemented this architecture on FPGA, refer to  for full details. The number of layer parameters is insignificant compared to the number of weights for practical networks. This is why pre-defined sparsity leads to significant storage savings, as quantified in Table I for the circled FC point vs the point from Fig. 1(c). Specifically, memory requirements are reduced by 3.9X in this case. Furthermore, the computational complexity, which is proportional to the number of weights for a MLP, is reduced by 4.8X. For this example, these complexity reductions come at a cost of degrading the classification accuracy from to .
|Parameter||Expression||Count (FC)||Count (sparse)|
Iii-B Memory organization
For the purposes of memory organization, edges are numbered sequentially from top to bottom on the right side of the junction. Other network parameters such as , and are numbered according to the neuron numbers in their respective layer. Consider Fig. 4 as an example, where junction is flanked by left neurons with and right neurons, leading to and . The three weights connecting to right neuron 0 are numbered 0, 1, 2; the next three connecting to right neuron 1 are numbered 3, 4, 5, and so on. A particular right neuron connects to some subset of left neurons of cardinality .
Each type of network parameter is stored in a bank of memories. The example in Fig. 4 uses , i.e., 4 weights are accessed per cycle. We designed the weight memory bank to have the minimum number of memories to prevent clashes, i.e., , and their depth equals . Weight memories are read in natural order – 1 row per cycle (shown in same color).
Right neurons are processed sequentially due to the weight numbering. The number of right neuron parameters of a particular type needing to be accessed in a cycle is upper bounded by , which leads to in order to prevent clashes in the right memory bank.555This does not limit most practical designs (see Appendix B). For FF in Fig. 4 for example, cycles 0 and 1 finish computation of and respectively, while cycle 2 finishes computing both and . For BP or UP, everything remains same except for the right memory accesses. Now and are used in cycle 0, and in cycle 1, and and in cycle 2. Thus the maximum number of right neuron parameters ever accessed in a cycle is .
Since edges are interleaved on the left, in general, the edge processing logic units will need access to parameters of a particular type from layer . So all the left memory banks have memories, each of depth , which are accessed in interleaved order. For example, after cycles, edges have been processed – i.e., . We require that each of these edges be connected to a different left neuron to eliminates the possibility of duplicate edges. This completes a sweep, i.e., one complete access of the left memory bank. Since each left neuron connects to edges, sweeps are required to process all the edges, i.e., each left activation is read times in the whole junction cycle. The reader can verify that cycles multiplied by sweeps results in total cycles, i.e., one junction cycle.
Iii-C Clash-free connection patterns
We define a clash as attempting to perform a particular operation more than once on the same memory at the same time, which would stall processing.666For single-ported memories, attempting two reads or two writes or a read and a write in the same cycle is a clash. For simple dual-ported memories with one port exclusively for reading and the other exclusively for writing, a read and a write can be performed in the same cycle. Attempting to perform two reads or two writes in the same cycle is a clash. The idea of clash-freedom is to pre-define a pattern of connections and values such that no operation in any junction of the NN results in a clash. Sec. III-B described how values should be designed to prevent clashes in the weight and right memory banks.
This subsection analyzes the left memory banks, which are accessed in interleaved order. Their memory access pattern should be designed so as to prevent clashes. Additionally, the following properties are desired for practical clash-free patterns. Firstly, it should be easy to find a pattern that gives good performance. Secondly, the logic and storage required to generate the left memory addresses should be low complexity.
We generate clash-free patterns by initially specifying the left memory addresses to be accessed in cycle 0 using a seed vector . Subsequent addresses are cyclically generated. Considering Fig. 4 as an example, . Thus in cycle 0, we access addresses from memories , i.e., left neurons . In cycle 1, the accessed addresses are , and so on. Since , cycles 3–5 access the same left neurons as cycles 0–2.
We found that this technique results in a large number of possible connection patterns, as discussed in Appendix C. Randomly sampling from this set results in performance comparable with non-clash-free NNs, as shown in Sec. IV-B. Finally, our approach only requires storing and using incrementers to generate subsequent addresses. This approach is similar to methods used in modern coding to allow parallel processing and memory accesses, c.f. [28, 30, 29]. Other techniques to generate clash-free connection patterns are discussed in Appendix C.
Iii-D Batch size
It is common in training of NNs to use minibatches. For a batch size of , the UP operation in (7a) is performed only once for inputs by using the average over the gradients. Our architecture performs an UP for every input and therefore may be viewed as having batch size one. However, the processing in our architecture differs from a typical software implementation with due to the pipelined and parallel operations. Specifically, in our architecture, FF and BP for the same input use different weights, as implied by Fig. 2(c). In results not presented here, we found no performance degradation due to this variation from the standard backpropagation algorithm. There is considerable ambiguity in the literature regarding ideal batch sizes [41, 42], and we found that our current network architecture performed well in our initial hardware implementation . However, if a more conventional batch size is desired, the UP logic can be removed from the junction pipeline and the UP operation performed once every inputs. This would eliminate some arithmetic units at the cost of increased storage for accumulating intermediate values from (7a).
Iii-E Special Case: Processing a FC junction
Fig. 5 shows the FC version of the junction from Fig. 4, which has 96 edges to be accessed and operated on. This can be done keeping the same junction cycle by increasing to 16, i.e., using more hardware. On the other hand, if hardware resources are limited, one can use the same and pay the price of a longer junction cycle , as shown in Fig. 5. This demonstrates the flexibility of our architecture.
Note that FC junctions are clash-free in all practical cases due to the following reasons. Firstly, the left memory accesses are in natural order just like the weights, which ensures that no more than one element is accessed from each memory per cycle. Secondly, for all practical cases since , as discussed in Appendix B, and for FC junctions. This means that at most one right neuron is processed in a cycle777In Fig. 5 for example, one right neuron finishes processing every cycle, so clashes will never occur when accessing the right memory bank.
Note that compared to Fig. 4, the weight memories in Fig. 5 are deeper since has increased from 6 to 24. However, the left layer memories remain the same size since and are unchanged, but the left memory bank is accessed more times since the number of sweeps has increased from 2 to 8. Also note that even if cycle 0 (blue) accesses some other clash-free subset of left neurons, such as instead of , the connection pattern would remain unchanged. This implies that different memory access patterns do not necessarily lead to different connection patterns; as discussed further in Appendix C.
Iv Observed Trends of Pre-Defined Sparsity
This section analyzes trends observed when experimenting with several different datasets via software simulations. We intend the following four trends to provide guidelines on designing pre-defined sparse NNs.
Hardware-compatible, clash-free, pre-defined sparse patterns perform at least as well as other pre-defined sparse patterns (i.e., random and structured) (Sec. IV-B).
The performance of pre-defined sparsity is better on datasets that have more inherent redundancy (Sec. IV-C).
Junction density should increase to the right: junctions closer to the output should generally have more connections than junctions closer to the input (Sec. IV-D).
Larger and more sparse NNs are better than smaller and denser NNs, given the same number of layers and trainable parameters. Specifically, ‘larger’ refers to more hidden neurons (Sec. IV-E).
The remainder of this section first describes the datasets we experimented on, and then examines these trends in detail.
Iv-a Datasets and Experimental Configuration
Unless otherwise noted, the following parameters and configurations listed below were used for all presented results.
MNIST handwritten digits
We rasterized each input image into a single layer of 784 features888On certain occasions we added 16 input features which are always trivially 0 so as to get 800 features for each input. This leads to easier selection of different sparse network configurations., i.e., the permutation-invariant format. No data augmentation was applied.
Reuters RCV1 corpus of newswire articles
The classification categories are grouped in a tree structure. We used preprocessing techniques similar to  to isolate articles which fell under a single category at the second level of the tree. We finally obtained 328,669 articles in 50 categories, split into for validation, for test, and the remaining for training. The original data has a list of token strings for each story, for example, a story on finance would frequently contain the token ‘financ’. We chose the most common 2000 tokens and computed counts for each of these in each article. Each count was transformed into to form the final 2000-dimensional feature vector for each input.
TIMIT speech corpus
TIMIT is a speech dataset comprising approximately
hours of 16 kHz audio commonly used in ASR. A modern ASR system has three major components: (i) preprocessing and feature extraction, (ii) acoustic model, and (iii) dictionary and language model. A complete study of an ASR system is beyond the scope of this work. Instead we focus on the acoustic model which is typically implemented using a NN. The input to the acoustic model is feature vectors and the output is a probability distribution on phonemes (i.e., speech sounds). For our experiments, we used 25ms speech frames with 10ms shift, as in , and computed a feature vector of 39 MFCCs for each frame. We used the complete training set of training samples (462 speakers), validation samples (50 speakers), and test samples (118 speakers). We used a phoneme set of size 39 as defined in .
Our setup for CIFAR-100 consists of a CNN followed by a MLP. The CNN has 3 blocks and each block has 2 convolutional layers with window size 3x3 followed by a max pooling layer of pool size 2x2. The number of filters for the six convolutional layers is (60,60, 125,125, 250,250). This results in a total of approximately one million trainable parameters in the convolutional portion of the network. Batch normalization is applied before activations. The output from the 3rd block, after flattening into a vector, has 4000 features. Typically dropout is applied in the MLP portion, however we omitted it there since pre-defined sparsity is an alternate form of parameter reduction. Instead we found that a dropout probability of half applied to the convolutional blocks improved performance. No data augmentation was applied.
For each dataset, we performed classification using one-hot labels and measured accuracy on the test set as a performance metric.999 The NN in a complete ASR system would be a ‘soft’ classifier and feed the phoneme distribution outputs to a decoder to perform ‘hard’ final classification decisions. Therefore for TIMIT, we computed another performance metric called TPC, measured as KL divergence between predicted test output probability distributions of sparse vs the respective FC case. Performance results obtained using TPC were qualitatively very similar to test accuracy and not shown here.
The NN in a complete ASR system would be a ‘soft’ classifier and feed the phoneme distribution outputs to a decoder to perform ‘hard’ final classification decisions. Therefore for TIMIT, we computed another performance metric called TPC, measured as KL divergence between predicted test output probability distributions of sparse vs the respective FC case. Performance results obtained using TPC were qualitatively very similar to test accuracy and not shown here.We also calculated the top-5 test set classification accuracy for CIFAR-100.
We found the optimal training configuration for each FC setup by doing a grid search using validation performance as a metric. This resulted in choosing ReLU activations for all layers except for the final softmax layer. The initialization proposed by Heet al.  worked best for the weights; while for biases, we found that an initial value of worked best in all cases except for Reuters, for which zeroes worked better. The Adam optimizer  was used with all parameters set to default, except that we set the decay parameter to for best results. We used a batch size of 1024 for TIMIT and Reuters since the number of training samples is large, and 256 for MNIST and CIFAR.
All experiments were run for 50 epochs of training and regularization was applied as an L2 penalty to the weights. To maintain consistency, we kept most hyperparameters the same when sparsifying the network, but reduced the L2 penalty coefficient with increasing sparsity. This was done because sparse NNs have fewer trainable parameters and are less prone to overfitting. We ran each experiment at least five times to average out randomness and we show the 90% CIs for each metric as shaded regions (this also holds for the results in Fig.1(c,h)). In addition to the results shown, we developed a data set of Morse code symbol sequences and investigated pre-defined sparse NNs. While these results are excluded for brevity, they are consistent with the trends described in this Section, and can be found in .
Iv-B Comparison of Pre-Defined Sparse Methods
Table II shows performance on different datasets for three methods of pre-defined sparsity: a) the most restrictive and hardware-friendly clash-freedom, b) structured, and c) random. For the clash-free case, we experimented with different settings to simulate different hardware environments:
Reuters: One junction cycle is 50 cycles for all the different densities. This is because we scale accordingly, i.e., a more powerful hardware device is used for each NN as increases.
CIFAR-100 and MNIST: These simulate cases where hardware choice is limited, such as a high-end, a mid-range and a low-end device being available. Thus three different values are used for CIFAR-100 depending on .
TIMIT: We keep constant for different densities. Junction cycle length varies from 90 cycles for to 810 for . This shows that when limited to a single low-end hardware device, denser NNs can be processed in longer time by simply changing .
Table II confirms that hardware-friendly clash-free pre-defined sparse architectures do not lead to any statistically significant performance degradation. We also observed that random pre-defined sparsity performs poorly for very low density networks, as shown by the blue values. This is possibly because there is non-negligible probability of neurons getting completely disconnected, leading to irrecoverable loss of information.
Iv-C Dataset Redundancy
Many machine learning datasets have considerable redundancy in their input features. For example, one may not need information from the 800 input features of MNIST to infer the correct image class. We hypothesize that pre-defined sparsity takes advantage of this redundancy, and will be less effective when the redundancy is reduced. To test this, we changed the feature vector for each dataset as follows. For MNIST, PCA was used to reduce the feature count to the least redundant 200. For Reuters, the number of most frequent tokens considered as features was reduced from 2000 to 400. For TIMIT, we both reduced and increased the number of MFCCs by 3X to 13 and 117, respectively. Note that the latter increases redundancy. For CIFAR-100, a source of redundancy is the depth of the CNN, which extracts features and discriminates between classes before the MLP performs final classification. In other words, the CNN eases the burden of the MLP. So a way to reduce redundancy and increase the classification burden of the MLP is to lessen the effectiveness of the CNN by reducing its depth. Accordingly, we used a single convolutional layer with 250 filters of window size followed by a max pooling layer. This results in the same number of features, 4000, at the input of the MLP as the original network, but has reduced redundancy for the MLP.
Classification performance results are shown in Fig. 6 as a function of . For MNIST and CIFAR-100, the performance degrades more sharply with reducing for the nets using the reduced redundancy datasets. To explore this further, we recreated the histograms from Fig. 1 for the reduced redundancy datasets, i.e., a FC NN with training on MNIST after PCA. We observed a wider spread of weight values, implying less opportunity for sparsification (i.e., fewer weights were close to zero). Similar trends are less discernible for Reuters and TIMIT, however, reducing redundancy led to worse performance overall.
The results in Fig. 6 further demonstrate the effectiveness of pre-defined sparsity in greatly reducing network complexity with negligible performance degradation. For example, even the reduced redundancy problems perform well when operating with half the number of connections. For CIFAR in particular, FC performs worse than an overall MLP density of around 20%. Thus, in addition to reducing complexity, structured pre-defined sparsity may be viewed as an alternative to dropout in the MLP for the purpose of improving classification performance.
Iv-D Individual junction densities
The weight histograms in Fig. 1 indicate that latter junctions, particularly junction closest to the output, have a wide spread of weight values. This suggests that a good strategy for reducing would be to use lower densities in earlier junctions – i.e., . This is demonstrated in Fig. 7 for the cases of MNIST, CIFAR-100 and Reuters, each with junctions in their MLPs. Each curve in each subfigure is for a fixed , i.e., reducing across a curve is done solely by reducing . For a fixed , the performance improves as increases. For example, the circled points in Reuters both have , but the starred point with has approximately better test accuracy than the pentagonal point with . The trend clearly holds for MNIST and is also discernible for CIFAR-100.
We further observed that this trend (i.e., should hold) is related to the redundancy inherent in the dataset and may not hold for datasets with very low levels of redundancy. To explore this, results analogous to those in Fig. 7 are presented in Fig. 8 for TIMIT, but with varying sized MFCC feature vectors – i.e., datasets corresponding to larger feature vectors will contain more redundancy. The results in Fig. 8(c) are for 117 dimensional MFCCs and are consistent with the trend in Fig. 7. However, for a MFCC dimension of 13, this trend actually reverses – i.e., the junction 1 should have higher density. This is shown in Fig. 8(b), where each curve is for a fixed . This reversed trend is also observed for the case of 39 dimensional feature vectors, considered in Fig. 8(a), where . Due to this symmetric neuronal configuration, for each value of on the x-axis in Fig. 8(a), the two curves have complementary values of and () – e.g., the two curves at have pairs of and . We observe that the curve for is generally worse than the curve for , which indicates that junction 1 should have higher density in this case.
Fig. 8(d) depicts the results for Reuters with the feature vector size reduced to 400 tokens. While junction 2 is still more important (as in Fig. 7(c) for the original Reuters dataset), notice the circled star-point at the very left of the curve. This point has very low . Unlike Fig. 7(c), it crosses below the other curves, indicating that it is more important to have higher density in the first junction with this less redundant set of features. We observed a similar, but less prominent, trend in MNIST PCA when the feature dimension was reduced to 200.
In summary, if an individual junction density falls below a certain value, referred to as the critical junction density, it will adversely affect performance regardless of the density of other junctions. This explains why some of the curves cross in Fig. 8. The critical junction density is much smaller for earlier junctions than for later junctions in most datasets with sufficient redundancy. However, the critical density for earlier junctions increases for datasets with low redundancy.
Iv-E ‘Large and sparse’ vs ‘small and dense’ networks
We observed that when keeping the total number of trainable parameters the same, sparser NNs with larger hidden layers (i.e., more neurons) generally performed better than denser networks with smaller hidden layers. This is true as long as the larger NN is not so sparse that individual junction densities fall below the critical density, as explained in Sec. IV-D. While the critical density is problem-dependent, it is usually low enough to obtain significant complexity savings above it. Thus, ‘large and sparse’ is better than ‘small and dense’ for many practical cases, including NNs with more than one hidden layer (i.e., ).
Fig. 9 shows this for networks having one and three hidden layers trained on MNIST. For the three layer network, all hidden layers have the same number of neurons. Each solid curve shows classification performance vs for a particular , while the black dashed curves with identical markers are configurations that have approximately the same number of trainable parameters. As an example, the points with circular markers (with a big blue ellipse around them) in Fig. 9(b) all have the same number of trainable parameters and indicate that the larger, more sparse NNs perform better. Specifically, the network with and corresponding to performs significantly better than the FC network with , and other smaller and denser networks, despite each having trainable parameters. Increasing the network size further to , and reducing to to fix the number of trainable parameters at , leads to performance degradation. This is because this was achieved by setting , which appears to be below the critical density.
Fig. 10 summarizes the analogous experiment on Reuters with similar conclusions. Both subfigures are for the same results with the x-axis split into higher and lower density range (on log scale), to show more detail. Observe that the trend of ‘large and sparse’ being better than ‘small and dense’ holds for subfigure (a), but reverses for (b) since densities are very low (the black dashed curves have positive slope instead of negative). This is due to the critical density effect.
Fig. 11(a) shows the result for the same experiment on TIMIT with four hidden layers111111We also performed experiments on TIMIT with one hidden layer () and Reuters with 2 hidden layers (). Results were similar to those shown, so are not shown for brevity’s sake.. The trend is less clearly discernible, but it exists. Notice how the black dashed curves have negative slopes at appreciable levels of , indicating ‘large and sparse’ being better than ‘small and dense’, but high positive slopes at low , indicating the rapid degradation in performance as density is reduced beyond the critical density. This is exacerbated by the fact that TIMIT with 39 MFCCs is a dataset with low redundancy, so the effects of very low are better observed.
Fig. 11(b) for the MLP portion of CIFAR-100 shows similar results as TIMIT, but on a log x-scale for more clarity. As noted in Sec. IV-C, the best performance for a given occurs at an overall density less than . It appears that for any for CIFAR-100, peak performance occurs at around – overall MLP density. In experiments not shown here, we obtained similar results for the reduced redundancy net with a single convolutional layer.
V Comparison to Other Sparse NN Methods
Numerical results in Sec. IV showed that hardware-compatible clash-free connection patterns performed as well as structured and random pre-defined sparse connections. In this section, we compare clash-free patterns against two sparsity approaches that are less constrained than the structured pre-defined sparsity considered in Sec. IV. In particular, both approaches remove the constraint of regular degree – i.e., these approaches yield sparse NNs that have varying and selected to optimize classification performance.
V-a Attention-based Preprocessed Sparsity
on object recognition and image captioning to achieve better performance with fewer parameters and less computation. We simplify this idea by computing the variance of input features as attention and setting the out-degree of the neurons of the input layer based on this value, Specifically, the feature variances are quantized into three levels, and input neurons with higher attention are assigned more connections than those with lower attention. For the neurons in latter layers, we use uniform out-degree and in-degree.
V-B Learning Structured Sparsity during Training
While the method in Sec. V-A obtains a non-uniform neuron out-degree for the first layer, it only considers the properties of the dataset and not the learning process. We also compared against the method of LSS which learns a good sparse connection pattern during training. This method was proposed in  and prunes the connections during training by using a sparse-promoting penalty function as part of the objective function. Example penalty functions include L1 and L1/L2 used in Lasso and group-Lasso, respectively. During training, the optimizer minimizes a balancing objective comprising the loss function 121212Here we emphasize that the loss function depends on all of the trainable parameters in the network, as opposed to the output layer activations and ground truth labels as done in Sec. II-A. This is to emphasize that loss is a function of all of the trainable parameters and therefore the loss function can promote sparsity by driving some edge weights to zero., the regularizer , and a sparse-promoting penalty function ,
where the penalty coefficients control the density of each junction. Increasing decreases , however, obtaining a specific value of requires experimental tuning of . In the results presented in this section, we used L1 as the element-wise sparse-promoting penalty function and L2 as the regularizer. Note that, in contrast to the attention-based method and the structured pre-defined sparsity approach, LSS is not a pre-defined sparsity method. Instead training in LSS begins with a FC network, which means that training complexity is similar to that of a FC NN. At the end of the LSS training process, weights with absolute value below a threshold are set as zero to achieve the target density.
V-C Performance comparison
Fig. 12 compares performance versus of different sparse NNs on MNIST, Reuters, and TIMIT. The individual density of each junction with the attention-based preprocessed sparse method is set to be identical to the density of each junction using clash-free pre-defined sparse method. However, the density of the nets using the LSS method can be tuned only with the penalty coefficients. We tuned these to approximate match the density of the other methods.131313This is why values of the green curves do not perfectly align with the pre-defined sparsity curves.
The LSS method performs best among all sparse methods, which is to be expected as it is the least constrained and also discovers a good sparse connection pattern during training. However, the performance with clash-free pre-defined sparsity is near that of the attention-based and LSS methods – i.e., within in terms of test accuracy at . We conclude that even though the clash-free patterns are highly structured and pre-defined, there is no significant performance degradation when compared to advanced methods for producing sparse models by exploiting specific properties of the dataset or learning sparse patterns during training.
Vi Conclusions and Future Work
In this work we proposed a new technique for complexity reduction of neural networks – pre-defined sparsity – in which a fixed sparse connection pattern is enforced prior to training and held fixed during both training and inference. We presented a hardware architecture suited to leverage the benefits of structured pre-defined sparsity, capable of parallel and pipelined processing. The architecture can be used for both training and inference modes, and supports networks of arbitrary density, including conventional fully-connected ones. Flexibility is afforded by the degree of parallelism , which trades hardware complexity for speed. Simple methods for clash-free memory access are presented and these methods are shown to achieve performance on par with the best known methods for obtaining sparse MLPs.
Using extensive numerical experiments, we identified trends which help in designing pre-defined sparse networks. Firstly, it is better to allocate connections in a structured manner rather than randomly. Secondly, for most datasets with high redundancy, earlier junctions can be made more sparse. Thirdly, it is better to have more neurons in the hidden layers, and then sparsify aggressively to keep the number of edges low and reduce complexity.
As motivated in the Introduction, the rapidly growing complexity associated with modern NNs is a major challenge. Pre-defined sparsity is a simple method to help address this challenge, as is acceleration with custom hardware. Interesting areas for future research include analytical approaches to justify the trends observed in this work and improving our initial hardware implementation in . It is also interesting to consider extending the methods introduced herein to convolutional layers and recurrent architectures. Finally, truly speeding the training process by orders of magnitude would allow more extensive search over NN architectures and therefore a better understanding of the largely empirical process of NN design.
Appendix A Structured Pre-Defined Sparsity Constraints
In our structured pre-defined sparse network, , the density of junction , cannot be arbitrary, since , where and are natural numbers satisfying the equation . Therefore, the number of possible values is the same as the number of values satisfying the structured pre-defined sparsity constraints:
where denotes the set of natural numbers.
The smallest value of which satisfies is , and other values are its integer multiples. Since is upper bounded by , the total number of possible is . Thus, the set of possible is
As a concrete example, consider a NN with . The number of possible densities of the junctions are determined by and . Therefore, the sets of junction densities are
Appendix B Hardware Architecture Constraints
The depth of left memories in our hardware architecture is . Thus should preferably be an integral multiple of . This is not a burdening constraint since the choice of is independent of network parameters and depends on the capacity of the device. In the unusual case that this constraint cannot be met, the extra cells in memories can be filled with dummy values such as 0.
There are also 2 conditions placed on the values to eliminate stalls in processing: for all layers , (i) , and (ii) . Using the definitions from Sec. II-A, (i) is equivalent to . Then, (ii) can be equivalently written as
which needs to be satisfied . In practice, it is desirable to design to be an integer so that an integral number of right neurons finish processing every cycle. This simplifies hardware implementation by eliminating the need for additional storage, for example, of the intermediate activation values during FF. In this case, (13) reduces to , which is always true.
For non-integral , there are two cases. If , (13) reduces to . On the other hand, if , there is no bound on the right hand side of (13). In general, note that (13) becomes a burdening constraint only if is large, and and are both desired to be small. This corresponds to earlier junctions being denser than later, which is typically not desirable according to the observations in Sec. IV-D, or to very limited hardware resources. We thus conclude that (13) is not a limiting constraint in most practical cases.
Appendix C Clash-Free Patterns
Specifying , , and for junction in a clash-free structured pre-defined sparse NN does not uniquely define a connection pattern (unless it is FC). This section discusses the number of possible left memory access patterns for such a junction . Note that the total number of possible memory access patterns for the complete NN is .
When , which is expected to be true for practical cases of implementing sparse NNs on powerful hardware devices, is also equal to the number of possible connection patterns , which is the key quantity of interest. This is because if , at least one right neuron is completely processed in some cycle. Thus, changing the left memory access pattern will change the left neurons to which that right neuron connects, thereby changing the connection pattern. This one-to-one correspondence results in .
For the case of , a FC junction provides an example where . Specifically, in this case as there is only one way to fully connect all neurons, but there are many clash-free memory access patterns, as shown in the following equations (14)-(16).
We now discuss various types of clash-freedom, and arising from each:
Type 3: In this technique, the constraint of cyclically accessing the left memories is also eliminated. Instead, any cycle can access any cell from each of the memories. This means that storing is not enough, the entire sequence of memory accesses needs to be stored as a matrix . In Fig. 13(c) for example, for sweep 0. Every sweep would also have a different , resulting in:
A technique that can be applied to all the types of clash-freedom is memory dithering, which is a permutation of the memories (i.e., the columns) in a bank. This permutation can change every sweep, as shown in Fig. 13(d). Memory dithering incurs an additional address computation storage cost because of the permutation, but increases by a factor . If is an integer, an integral number of cycles are required to process each right neuron. Since a cycle accesses all memories, dithering has no effect and . On the other hand, if is an integer greater than 1, the effects of dithering on connectivity patterns are only observed when switching from one right neuron to the next within a cycle. This results in
for types 2 and 3, and the exponent is omitted for type 1 since the access pattern does not change across sweeps.
When either of or does not perfectly divide the other, an exact value of is hard to arrive at since some proper or improper fraction of right neurons are processed every cycle. In such cases, is upper-bounded by .
Table III compares the count of possible left memory access patterns and associated storage cost for computing memory addresses for types 1–3, with and without memory dither. The junction used is the same as in Fig. 4, except is raised to 12 such that becomes 2 and allows us to better show the effects of memory dithering.
|Type||Memory||Storage Cost to Compute|
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” inProc. Advances in Neural Information Processing Systems 25 (NIPS), 2012, pp. 1097–1105.
A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro, “Deep learning with COTS HPC systems,” inProc. 30th Int. Conf. Machine Learning (ICML), vol. 28, 2013, pp. III–1337–III–1345.
-  S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural networks,” in Proc. Advances in Neural Information Processing Systems 28 (NIPS), 2015, pp. 1135–1143.
N. P. Jouppi, C. Young, N. Patil et al.
, “In-datacenter performance analysis of a tensor processing unit,” in2017 ACM/IEEE 44th Annu. Int. Symp. Computer Architecture (ISCA), June 2017.
C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,”
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
-  Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deep convolutional networks using vector quantization,” in arXiv preprint arXiv:1412.6115, 2014.
-  W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” in Proc. 32nd Int. Conf. Machine Learning (ICML), 2015.
-  S. Han, H. Mao, and W. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in Proc. Int. Conf. Learning Representations (ICLR), 2016.
-  S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient inference engine on compressed deep neural network,” in Proc. 43rd Int. Symp. Computer Architecture (ISCA), 2016.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
-  J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network computing,” in Proc. 2016 ACM/IEEE 43rd Annu. Int. Symp. Computer Architecture (ISCA), 2016, pp. 1–13.
-  B. Reagen, P. Whatmough, R. Adolf et al., “Minerva: Enabling low-power, highly-accurate deep neural network accelerators,” in Proc. 2016 ACM/IEEE 43rd Annu. Int. Symp. Computer Architecture (ISCA), 2016, pp. 267–278.
-  A. Aghasi, A. Abdi, N. Nguyen, and J. Romberg, “Net-trim: Convex pruning of deep neural networks with performance guarantee,” in Proc. Advances in Neural Information Processing Systems 30 (NIPS), 2017.
-  W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Proc. Advances in Neural Information Processing Systems 29 (NIPS), 2016, pp. 2074–2082.
-  V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for small-footprint deep learning,” in Proc. Advances in Neural Information Processing Systems 28 (NIPS), 2015, pp. 3088–3096.
-  S. Wang, Z. Li, C. Ding, B. Yuan, Y. Wang, Q. Qiu, and Y. Liang, “C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2018.
-  A. Bourely, J. P. Boueri, and K. Choromonski, “Sparse neural network topologies,” arXiv preprint arXiv:1706.05683, 2017.
-  A. Prabhu, G. Varma, and A. M. Namboodiri, “Deep expander networks: Efficient deep networks from graph theory,” arXiv preprint arXiv:1711.08757, 2017.
-  D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and A. Liotta, “Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science,” Nature Communications, vol. 9, 2018.
-  S. Dey, Y. Shao, K. M. Chugg, and P. A. Beerel, “Accelerating training of deep neural networks via sparse edge processing,” in Proc. 26th Int. Conf. Artificial Neural Networks (ICANN). Springer, Sep 2017, pp. 273–280.
-  S. Dey, P. A. Beerel, and K. M. Chugg, “Interleaver design for deep neural networks,” in Proc. 51st Asilomar Conf. Signals, Systems, and Computers, Oct 2017, pp. 1979–1983.
-  S. Dey, K. Huang, P. A. Beerel, and K. M. Chugg, “Characterizing sparse connectivity patterns in neural networks,” in Proc. 2018 Information Theory and Applications Workshop (ITA), Feb 2018, pp. 1–9.
-  Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
-  Y. Ma, N. Suda, Y. Cao, S. Vrudhula, and J. Seo, “ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler,” Integration, the VLSI Journal, 2018.
-  S. Zhang, Z. Du, L. Zhang et al., “Cambricon-X: An accelerator for sparse neural networks,” in Proc. 2016 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2016, pp. 1–12.
-  N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao, “Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks,” in Proc. 2016 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. ACM, 2016, pp. 16–25.
-  T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proc. 19th Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2014, pp. 269–284.
-  G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, “VLSI architectures for turbo codes,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 7, no. 3, pp. 369–379, 1999.
-  T. Brack, M. Alles, T. Lehnigk-Emden et al., “Low complexity LDPC code decoders for next generation standards,” in Design, Automation & Test in Europe Conf. & Exhibition (DATE). IEEE, 2007, pp. 1–6.
-  S. Crozier and P. Guinand, “High-performance low-memory interleaver banks for turbo-codes,” in Vehicular Technology Conf., 2001. VTC 2001 Fall. IEEE VTS 54th, vol. 4. IEEE, 2001, pp. 2394–2398.
-  F. Sun, C. Wang, L. Gong, C. Xu, Y. Zhang, Y. Lu, X. Li, and X. Zhou, “A high-performance accelerator for large-scale convolutional neural networks,” in 2017 IEEE Int. Symp. Parallel and Distributed Processing with Applications and 2017 IEEE Int. Conf. Ubiquitous Computing and Communications (ISPA/IUCC), Dec 2017, pp. 622–629.
-  C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “DLAU: A scalable deep learning accelerator unit on FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 3, pp. 513–517, March 2017.
Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST database of handwritten digits,”http://yann.lecun.com/exdb/mnist/.
-  D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “RCV1: A new benchmark collection for text categorization research,” Journal of machine learning research, vol. 5, pp. 361–397, Apr 2004.
-  J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speech corpus,” https://catalog.ldc.upenn.edu/LDC93S1.
-  A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Toronto, 2009.
-  I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
J. Yosinski and H. Lipson, “Visually debugging restricted boltzmann machine training with a 3D example,” inProc. 29th Int. Conf. Machine Learning (ICML), 2012.
-  N. H. E. Weste and D. M. Harris, CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed. Pearson, 2010.
-  S. Dey, D. Chen, Z. Li, S. Kundu, K. Huang, K. M. Chugg, and P. A. Beerel, “A highly parallel FPGA implementation of sparse neural network training,” in Proc. 2018 Int. Conf. Reconfigurable Computing and FPGAs (ReConFig), Dec 2018, expanded pre-print version available at https://arxiv.org/abs/1806.01087.
-  P. Goyal, P. Dollár, R. B. Girshick et al., “Accurate, large minibatch SGD: training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
-  D. Masters and C. Luschi, “Revisiting Small Batch Training for Deep Neural Networks,” arXiv preprint arXiv:1804.07612, 2018.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
K.-F. Lee and H.-W. Hon, “Speaker-independent phone recognition using hidden markov models,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp. 1641–1648, Nov 1989.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. IEEE Int. Conf. Computer Vision (ICCV), 2015, pp. 1026–1034.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations (ICLR), 2014.
-  S. Dey, K. M. Chugg, and P. A. Beerel, “Morse code datasets for machine learning,” in Proc. 9th Int. Conf. Computing, Communication and Networking Technologies (ICCCNT), Jul 2018, pp. 1–7.
-  J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv:1412.7755, 2014.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 32nd Int. Conf. Machine Learning (ICML), 2015, pp. 2048–2057.
-  M. R. Osborne, B. Presnell, and B. Turlach, “A new approach to variable selection in least squares problems,” in IMA Journal of Numerical Analysis, 2000.
-  R. Jenatton, J. Audibert, and F. R. Bach, “Structured variable selection with sparsity-inducing norms,” Journal of Machine Learning Research, vol. 12, pp. 2777–2824, 2011.