Unifying and Merging Well-trained Deep Neural Networks for Inference Stage

05/14/2018 ∙ by Yi-Min Chou, et al. ∙ Academia Sinica 0

We propose a novel method to merge convolutional neural-nets for the inference stage. Given two well-trained networks that may have different architectures that handle different tasks, our method aligns the layers of the original networks and merges them into a unified model by sharing the representative codes of weights. The shared weights are further re-trained to fine-tune the performance of the merged model. The proposed method effectively produces a compact model that may run original tasks simultaneously on resource-limited devices. As it preserves the general architectures and leverages the co-used weights of well-trained networks, a substantial training overhead can be reduced to shorten the system development time. Experimental results demonstrate a satisfactory performance and validate the effectiveness of the method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The research on deep neural networks has gotten a rapid progress and achievement recently. It is successfully applied in a wide range of artificial intelligence (AI) applications, including computer vision, speech processing, natural language processing, bioinformatics, etc. To handle various tasks, we usually design different network models and train them with particular datasets separately, so they can behave well for specific purposes. However, in practical AI applications, it is common to handle multiple tasks simultaneously, leading to a high demand for the computation resource in both training and inference stages. Therefore, how to effectively integrate multiple network models in a system is a crucial problem towards successful AI applications.

This paper tackles the problems of merging multiple well-trained (known-weights) feed-forward networks and unifying them into a single but compact one. The original networks, whose architectures may not be identical, can be either single or multiple source input. After unification, the merged network should be capable of handling the original tasks but is more condensed than the whole original models. Our approach (NeuralMerger) contains two phases:

Alignment and encoding phase: First, we align the architectures of neural network models and encode the weights such that they are shared among the networks. The purpose is to unify the weights so that the filters and weights of different neural networks can be co-used.

Fine-tuning phase: Second, we fine-tune the merged model with partial or all training data (calibration data). A method following the concept of distilling dark knowledge of neural networks in [HintonDistilling14] is employed in this phase.

Figure 1: Two tasks accomplished by feed-forward networks, where model A (or B) consists of (or ) convolution and (or ) fully-connected layers, respectively. Our NeuralMerger unifies the two models into a single one consisting of convolution and fully-connected layers for the model inference; E-Conv and E-FC are referred to as the Jointly-Encoded convolution and fully-connected layers, respectively.

Neural network models may have very different topologies. Currently, this study focuses on merging feed-forward networks, while merging networks with loops remains a future work. A modern feed-forward network consists of several kinds of layers, including convolution, pooling, and full-connection, which is generally referred to as a convolutional network (CNN). When merging two CNNs, our approach aligns the same-type layers (convolution; full-connection) into pairs. The layers in a pair are merged into a single layer that shares a common weight codebook through the proposed encoding scheme. The codebooks in the merged single model can be further trained via back-propagation algorithm; it thus can be fine-tuned to seek for performance improvement.

Figure 2: Example of merging three models, ZF, VGG-avg, and LeNet into a single one for the inference stage via our NeuralMerger.

Motivation of Our Study: Merging existing neural networks has a great potential for real-world applications. To tackle multiple recognition tasks in a single system based on either unique or various signal sources, a typical approach is to design a new model and train the model on the union datasets of these tasks, eg., [DBLP:journals/corr/KaiserGSVPJU17, DBLP:journals/corr/AytarVT17]. Such “learn-them-all” approaches train a single complex model to handle multiple tasks simultaneously. However, two issues may arise. First, it is hard to choose a suitable neural-net architecture for learning all the tasks well in advance; hence, a trial-and-error process is required to conduct suitable architectures. Second, learning from a random initial with large training data of different types could be demanding. To tackle these issues, the networks with bridging layers among the original models are conducted as a joint model for multi-task learning in [RN274_multitask]. However, increased complexity and size of the joint model hinder their availability on resource-limited or edge devices in the inference stage.

As many models trained for various tasks are available now, a practical way to integrate different functionalities in a system would be leveraging on these individual-task models. In this paper, we introduce an approach that merges different neural networks by removing the co-redundancy among their filters or weights. The proposed NeuralMerger can take advantage of existing well-trained models. Our approach merges them via finding and sharing the representative codes of weights; the shared codes can still be refined by learning. To our knowledge, this is the first study on merging known-weights neural-nets into a more compact model. Because our approach compresses the networks for weight sharing and redundancy removal, it is useful for the deep-learning embedded system or edge computing in the inference stage.

Overview of Our Approach: When merging two different CNN models and , the output is a CNN model consisting of jointly encoded convolution (E-Conv) and fully-connected (E-FC) layers. An overview of our approach is illustrated in Fig. 1 and an example of merging three models via our approach is given in Fig. 2.

Contributions of this paper are summarized as follows:

(1) Given well-trained CNN models, the introduced NeuralMerger can merge them for multi-tasks even the models are different. The merging process preserves the general architectures of the well-trained networks and removes their redundancy. It avoids the cumbersome-design and trial-and-error process raised by the learn-them-all approaches.

(2) The proposed method produces a more compact model to handle the original tasks simultaneously. The compact model consumes less computational time and storage than the compound model of the original networks. It has a great potential to be fitted in low-end systems.

2 Related Work

To simultaneously achieve various tasks via a single neural-net model, a typical way is to increase the output nodes (for multi-tasks) of a pre-chosen neural-net structure and train it from an initialization. In  [DBLP:journals/corr/KaiserGSVPJU17], MultiModel architecture is introduced to allow input data to be images, sound waves, and text of different dimensions, and then converts them into a unified representation. In [DBLP:journals/corr/AytarVT17], a deep CNN leverages massive synchronized data (sound and sentences paired with images) to learn an aligned representation. Nevertheless, as mentioned earlier, applying the learn-them-all approaches has to pay cumbersome training effort and intensive inference computation.

Compressing a neural-net is an active direction to deploy the compact model on resource-limited embedded systems. To reduce the representation, binary weights and bit-wise operations are used in [hubara2016binarized] and [rastegari2016xnor]. Han et al. [Han16] introduce a three-stage pipeline: pruning redundant network connections, quantizing weights with a codebook, and Huffman encoding weights and index, to reduce the storage required by CNN. Quantized CNN (Q-CNN) [Wu16]

is proposed to address both the speed and compression issues, which splits the input layer space and applies vector quantization to each subspace. Researchers also try to prune filters and feature maps to directly reduce the computational cost 

[molchanov2016pruning][li2016pruning][he2017channel]. Transferring the learned knowledge from a large network to a small network is another important direction towards effective network compression. In [HintonDistilling14], distilling the knowledge of an ensemble model into a smaller model is introduced.

Instead of compressing a single network, the goal of this study is to merge multiple networks simultaneously. Besides, our method can restore the performance of the jointly compressed models by fine-tuning it with the training samples.

3 Deep Model Integration

Assume that model A (or B) consists of (or ) convolution (Conv) followed by (or ) fully connected (FC) layers. Let . In our approach, a correspondence is established between the Conv layers for the alignment of the two models, ; is a strictly increasing mapping from to , and is a strictly increasing mappings from to . Likewise, a correspondence is also established between the FC layers for .

In our method, the merged layers have to be of the same type (Conv or FC). Given two layers, one in model A and the other in model B, the principle to merge them is finding a set of (fewer) exemplar codewords that represent the weights of the layers with small quantization errors. The layers are thus jointly compressed for redundancy removal. Below, we first consider unifying the Conv layers, and then the FC layers.

3.1 Merging Convolution Layers

Assume that some Conv layer in model A and some other in model B are to be merged. The layer of model A has the input volume size , where is the spatial size and is the depth (number of channels). The input volume is convolved with convolution kernels, where the size of each kernel is . The output of the Conv layer in model A is thus a volume of

(without loss of generality, assume that padded convolutions are used.)

Likewise, similar notations apply to the respective layer in model B. An input volume of the size are convolved by convolution kernels of . The output volume of that layer is of the size .

We aim to jointly encode the convolution coefficients. As there are (or ) convolution kernels in the layers of A (or B), we hope to find a new set of fewer (than

) exemplars to express the original ones so that the models are fused and the redundancy between them is removed. To this end, a viable way is to perform vector quantization (such as k-means clustering) on the convolution kernels and find a smaller number of codewords (

) to jointly represent the kernels compactly. However, it is demanding to make this method practicable because the kernel dimensions could be inconsistent (i.e., or ).

To address this issue, we unify the different convolution kernels by using spatially convolutions, so that merging CNNs with convolution kernels of different sizes is attainable. In the following, we review the operations in a convolution layer at first and then show how to separate the dimensions so that different layers are unified and jointly encoded.

3.1.1 Operations in Convolution Layer

The operations in a Conv layer of CNNs are reviewed as follows. Suppose

is the input volume (a.k.a. 3D tensor) to a Conv layer and

is the output volume. Assume that convolution kernels of size are applied to the layer, denoted as

(1)

Then, the -th channel output is obtained as , the volume convolution of and , and the output is the concatenation of ,

(2)

Let and respectively be the -th channel of and . The volume convolution is formed by summing the 2D-convolution results of the channels:

(3)

where denotes the 2D convolution operator.

In CNNs, various and (eg., ) are used in existing networks. Particularly, when , the volume convolution of size is often referred to as a convolution in CNNs for all .

3.1.2 Kernel Decomposition in Spatial Directions

In the above, the volume convolution is computed as a spatially sliding operation (2D convolution) followed by a channel-wise summation along the depth direction. In this section, we show that, no matter what and are, it can be equivalently represented by convolutions via decomposing the kernel along with the spatial directions as follows.

Given the kernel , let specify its entry at the spatial location () of the -th channel. In particular, let stand for the volume convolution at ; e.g., a kernel of spatial size consists of 25 kernels of spatial size , , .

Following the notation, we decompose an volume convolution into multiple convolutions and combine them with shift operators: Without loss of generality, we assume that the spatial sizes

of the kernel are odd numbers and replace them with

and . The -th channel output can be equivalently represented as

(4)

where () are the convolutions depicted above, and is the shift operator,

(5)

with the spatial location and the channel index. Hence, for all , the -th channel output of the volume convolution can be decomposed as the shifted sum of convolutions via Eq. 4. Then, the output is obtained via the concatenation in Eq. 2.

To address the issue caused by dimension mismatch in merging two convolution layers, we then propose to take the representation of convolutions for both layers. Hence, a kernel in model A is decomposed into convolutions of size and that in model B is decomposed into convolutions of size . The kernel is then unified into in the spatial domain no matter whether (or ) equals to (or ).

3.1.3 Kernel Separation along Depth Direction

Then, we seek to jointly express the convolutions by a compact representation so that the two layers are co-compressed, where and are the numbers of kernels of the layers in A and B, respectively. Though the subspace dimensions in the spatial domain are consistent now, they are still inconsistent in the depth direction ( vs. ) and thus crucial to be jointly clustered. To address this problem, we simply separate the kernel into non-overlapping kernels along the depth direction (). As the convolution in CNNs are summation-based in the depth direction, we divide the kernel into vectors of dimension ,

(6)

where (of size ) is the -th segment of the original kernel. The output in Eq. 4 then becomes

(7)

where is the -th sub-volume of the input for or . Specifically, a spatially kernel is respectively segmented into (or ) kernels of dimension in model A (or B), where the last segment is padded with zero if necessary.

Let . There are then kernels of size for the segment . To jointly represent the kernels of both layers, we use codewords ( in the dim- space to encode the convolution coefficients compactly. We run the k-means algorithm with various initials for the vectors and then select the results yielding the least representation error to produce the codewords (i.e., cluster centers of k-means), for . 111For those remaining segments, , we also use codewords to encode the (or ) dim- vectors in the respective subspaces if (or ).

Figure 3: Illustration of merging two models’ Conv layers having the kernels of spatial size and , respectively; each layer is divided into 2 segments. They are decomposed into spatially kernels and the kernels in every segment is clustered via k-means clustering to build a coodbook. The convolutions are pre-computed on the codebook, and a lookup table is built to index the results.

3.1.4 E-Conv Layer and Weights Co-use

The merged convolution layer (called the E-Conv layer), is a newly-formed layer where the weights are co-used among the convolution kernels: Denote the codewords in the -th subspace to be . We then replace each dim- kernel at the spatial site in the subspace (namely, ) with , the closest codeword in the dim- space, where is the code-assignment mapping. Eq. 7 is then simplified as

(8)

Because the number of codewords is fewer than that of the total kernel vectors , Eq. 8 can be executed more efficiently via computing the convolutions of the codewords at first:

(9)

and then storing the results in a lookup table at run-time. The run-time operation of convolution is thus replaced by table indexing. Hence, the convolution kernels of the two models A and B are representationally shared in a compact codebook, , and the computation time is saved. An illustration of the E-Conv layer is given in Fig. 3.

When choosing the codewords fewer, the amount of convolution coefficients is reduced to . The merged model is thus co-compressed as it consumes less storage than the total required for the two convolution layers. As for the computational speed, each is replaced with an index and there are entries for indexing in , where are the spatial location. Let and be the time unit for a table-indexing operation and be the time unit for a convolution. The speedup ratio is then ) in terms of the complexity. Hence, when the codewords is fewer or the subspace dimension is larger, the speedup is getting higher.

3.1.5 Derivatives of the Merged Layer

Besides condensing and unifying the convolution operations, the E-Conv layer is also differentiable and so end-to-end back-propagation learning still remains realizable. However, evaluating the derivatives would be hard based on the table-lookup structure as the indices are not continuous. Hence, the table is used for the inference stage in our approach. While for learning, we slightly change the form of Eq. 8 to conduct the derivatives of (the output at the spatial location of channel ) to (the codewords). From Eq. 8,

(10)

where is the inner product. Let be the matrix whose columns are the dim- codewords of the -th segment. Let be the one-hot vector where the -th entry in is if ; otherwise the entry is . Then, the codeword mapping in Eq. 8 can be replaced by . Hence, the derivative of the E-Conv layer is conducted as

(11)

As is the matrix consisting of the codewords , they can then be fine-tuned via the gradients for learning.

3.2 Merging Fully-connected Layers

The volume input to an FC layer is re-shaped to a vector in general. Let be the input vector of an FC layer and be its output. Then , where are the weights of the FC layer.

Unlike Conv layers that have sliding operations, all the operations in FC are summation-based. Thus, given the two weight matrices of models A and B, namely, and , we simply divide them into length- segments along the row direction. The dim- weight vectors in the same segment are then clustered via k-means algorithm and codewords are found. In this way an E-FC layer is built as well, and It is easy to show that the E-FC layer is also differentiable.

3.3 End-to-end Fine Tuning

As both E-Conv and E-FC layers are differentiable, once their codebooks are constructed, we can then fine-tune the entire model from partial or all training data through end-to-end back-propagation learning. The training data are referred to as calibration data and the fine-tuning process is called the calibration training in our work. Codebooks are also used for the weights quantization in the single-model compression approach [Wu16]. However, unlike our work that the codebooks in all layers are tunable in an end-to-end manner, only the codebook in a single layer can be tuned per time (with the other layers fixed) in [Wu16], making its learning process inefficient and demanding to seek better solutions. Besides, our approach merges and jointly compresses multiple models, instead of only a single model.

Two error terms are combined for the minimization in our calibration training. One is the classification (or regression) loss utilized in the original models A and B. The other is the layer-wise output mismatch error; when applying the input to the model A (or B), the output of every layer in the merged model should be close to the output of the associated layer in A (or B), and

norm is used to measure this error. We use a framework (TensorFlow 

[abadi2016tensorflow]) to implement the calibration training.

In the inference stage, to make the approach generalizable to edge devices that may not contain GPUs, we use CPU and OpenBLAS library [xianyi2012model, wang2013augem] to implement the merged model with the associated codebooks. To make fair comparisons, the Conv layers of the individual models compared with ours are realized via the unrolling convolution [chellapilla2006high, anwar2017structured] that converts the volume convolution into a single matrix product, which is commonly used as an efficient implementation for the Conv layer. Our code will be publicly available at GitHub222https://github.com/ivclab/NeuralMerger.

4 Experiments

In this section, we present the results of several experiments conducted to verify the effectiveness of our approach.

Merging Sound and Image Recognition CNNs

The first experiment is to merge two CNNs of heterogeneous signal sources: image and sound. Although the sources are different, the same network (LeNet [LeCun98]) is used. It is applied to the datasets of two tasks:

Sound20333https://github.com/ivclab/Sound20: This dataset contains 20 classes of sounds recorded from animal and instrument with 16636 training samples and 3727 testing samples. It is constructed using Animal Sound Data [keoghmonitoring] and Instrument Data [Juliani16]. The raw signals are converted to 2D spectrograms and then resized to images for classification.

Fashion-MNIST dataset [xiao2017/online]: This dataset contains a collection of 60,000 training and 10,000 testing images of size in 10 classes of dressing styles.

Para. Conv1 Conv2 Fc1
ACCU 1/64 8/128 8/128
LIGHT 1/64 32/128 8/64
Table 1: The parameters of for ’ACCU’ and ’LIGHT’ settings in merging two LeNet models (while the classification layers are not co-compressed).
Para. Compr. Speedup Sound Image
ACCU 10.4 1.3 -0.06% 0.68%
LIGHT 15.3 1.8 0.76% 1.40%
Table 2: Compression ratio, Speedup, and Accuracy drop of sound and image merged model. The speedup ratio is tested on the CPU of Intel(R) Xeon(R) CPU E5-2640 v4, in the single-thread model.