1 Introduction
Light convolutional deep neural networks (LCNN) are attractive in resource limited conditions for delivering high performance at low costs. Some of the stateoftheart LCNNs in computer vision are ShuffleNet (
[20], [13]), IGC (Interleaved Group Convolutions, [19]) and IGCV3 (Interleaved LowRank Group Convolutions,[15]). A noticeable feature in the design is the presence of permutations for channel shuffling in between separable convolutions. The permutations are handcrafted by designers outside of network training however. A natural question is whether the permutations can be learned like network weights so that they are optimized based on training data. An immediate difficulty is that unlike weights, permutations are highly discrete and incompatible with the stochastic gradient descent (SGD) methodology that is continuous in nature. To overcome this challenge, we introduce an exact Lipschitz continuous nonconvex penalty so that it can be incorporated in SGD to approximate permutation at high precision. Consequently, exact permutations are obtained by simple rounding at the end of network training with negligible drop of classification accuracy. To be specific, we shall work with ShuffleNet architecture ([20], [13]). Our approach extends readily to other LCNNs.Related Work. Permutation optimization is a long standing problem arising in operations research, graph matching among other applications [7, 3]. Wellknown examples are linear and quadratic assignment problems [16]. Graph matching is a special case of quadratic assignment problem, and can be formulated over permutation matrices as: , where and are the adjacency matrices of graphs with vertices, and is the Frobenius norm. A popular way to handle is to relax it to the Birkhoff polytope , the convex hull of , leading to a convex relaxation. The problem may still be nonconvex due to the objective function. The explicit realization of is the set of doubly stochastic matrices , where . An approximate yet simpler way to treat is through the classical first order matrix scaling algorithm e.g. the Sinkhorn method [14]. Though in principle such algorithm converges, the cost can be quite high when iterating many times, which causes a bottleneck effect [11]. A nonconvex and more compact relaxation of is by a sorting network [11] which maps the box into a manifold that sits inside and contains . Yet another method is path following algorithm [18] which seeks solutions under concave relaxations of
by solving a linear interpolation of convexconcave problems (starting from the convex relaxation). None of the existing relaxations are exact.
Contribution. Our nonconvex relaxation is a combination of matrix penalty function and . The (the difference of and
norms) has been proposed and found effective in selecting sparse vectors under nearly degenerate linear constraints
[5, 17]. The matrix version is simply a sum of over all row and column vectors. We prove that the penalty is zero when applied to a matrix in if and only if the matrix is a permutation matrix. Thanks to the constraint, the penalty function is Lipschitz continuous (almost everywhere differentiable). This allows the penalty to be integrated directly into SGD for learning permutation in LCNNs. As shown in our experiments on CIFAR10 and Imagenet data sets, the closeness to turns out to be remarkably small at the end of network training so that a simple rounding has negligible effect on the validation accuracy. We also found that convex relaxation by fails to capture good permutations for LCNNs. To our best knowledge, this is the first time permutations have been successfully learned on deep CNNs to improve handcrafted permutations.Outline. In section 2, we introduce our exact permutation penalty, and prove the closeness to permutation matrices when the penalty values are small as observed in the experiments. We also present the training algorithm combining thresholding and matrix scaling to approximate projection onto for SGD. In section 3, we analyze three permutation optimization problems to show the necessity of our penalty. In a 2layer neural network regression model with short cut (identify map), convex relaxation does not give the optimal permutation even with additional rounding while our penalty can. In section 4, we show experimental results on consistent improvement of autoshuffle over handcrafted shuffle on both CIFAR10 and Imagenet data sets. Conclusion is in section 5.
2 Permutation, Matrix Penalty and Exact Relaxation
The channel shuffle operation in ShuffleNet [20, 13] can be represented as multiplying the feature map in the channel dimension by a permutation matrix . The permutation matrix is a square binary matrix with exactly one entry of one in each row and each column and zeros elsewhere. In the ShuffleNet architecture [20, 13], is preset by the designers and will be called “manual”. In this work, we propose to learn an automated permutation matrix through network training, hence removing the human factor in its selection towards a more optimized shuffle. Since permutation is discrete in nature and too costly to enumerate, we propose to approach it by adding a matrix generalization of the penalty [5, 17] to the network loss function in the stochastic gradient descent (SGD) based training.
Specifically for , the proposed continuous matrix penalty function is
(1) 
in conjunction with the doubly stochastic constraint:
(2) 
Remark 1.
Remark 2.
Thanks to (2), we see that the penalty function is actually Lipschitz continuous in as , , and , .
Theorem 1.
A square matrix is a permutation matrix if and only if , and the doubly stochastic constraint (2) holds.
Proof.
() Since a permutation matrix consists of columns (rows) with exactly one entry of 1 and the rest being zeros, each term inside the outer sum of equals zero, and clearly (2) holds.
() By the CauchySchwarz inequality,
with equality if and only if the rowwise cardinalty is 1:
(3) 
This is because the mixed product terms like () in must all be zero to match . This only happens when equation (3) is true. Likewise,
with equality if and only if . In view of (2), is a permutation matrix. ∎
The nonnegative constraint in (2) is maintained throughout SGD by thresholding . The normalization conditions in (2) are implemented sequentially once in SGD iteration. Hence they are not strictly enforced. In theory, if the column normalization (divide each column by its sum) and row normalization (divide each row by its sum) are iterated sufficiently many times, the resulting matrices converge to (2). This is known as the Sinkhorn process or RAS method [14], which is a first order method to approximately solve the so called matrix scaling problem (MSP). Simply state, the MSP for a given nonnegative real matrix is to scale its rows and columns (i.e. multiply each by a nonnegative constant) to realize the prescribed row sums and column sums. The approximate MSP is: given tolerance , find positive diagonal matrices and such that For a historical account of MSP and a summary of various algorithms to date, see [2] and Table 1 therein. The RAS method is an alternate minimization procedure with convergence guarantees. Each iteration of the RAS method costs complexity , being the number of nonzero entries in . If the entries of are polynomially bounded (which is the case during network training due to the continuous nature of SGD), the RAS method converges in iterations [6], giving total complexity , where tilde hides logarithmic factors in and . Improvements of complexity bounds via minimizing the log of capacity and higher order methods can be found in [2]. However for our study, the first order method [14] suffices for two reasons. One is that it is computationally low cost, the other is that the error in the matrix scaling step can be compensated in network weight adjustment during SGD. In fact, we did not find much benefit to iterate RAS method more than once in terms of enhancing validation accuracy. This is quite different from solving MAP as a stand alone task.
The multiplication by can be embedded in the network as a convolution layer with initialized as absolute value of a random Gaussian matrix. After each weight update, we threshold the weights to , normalize rows to unit lengths, then repeat on columns. Let be the network loss function. The training minimizes the objective function:
(4) 
where is the total number of “channel shuffles” ’s abbreviated as , is the network weight, a positive parameter. The training algorithm is summarized in Algorithm 1. The term in the penalty function has standard subgradient, and the term is differentiable away from zero, which is guaranteed in the algorithm 1 due to continuity of SGD and the normalization in columns and rows. is chosen to be or so as to balance the contributions of the two terms in (4) and drive close to .
We shall see that the penalty indeed gets smaller and smaller during training. Here we show a theoretical bound on the distance to when is small and (2) holds approximately.
Theorem 2.
Let the dimension of a nonnegative square matrix be fixed. If , , and the doubly stochastic constraints are satisfied to , then there exists a permutation matrix such that .
Proof.
It follows from that
implying that:
(5) 
On the other hand for :
(6) 
Let , at any . It follows from (6) that
and from (5) that
Hence each row of is close to a unit coordinate vector, with one entry near 1 and the rest near 0. Similarly from , and , we deduce that each column of is close to a unit coordinate vector, with one entry near 1 and the rest near 0. Combining the two pieces of information above, we conclude that is close to a permutation matrix. ∎
The learned nonnegative matrix will be called a relaxed shuffle and will be rounded to the nearest permutation matrix to produce a final auto shuffle. Strictly speaking, this “rounding” involves finding the orthogonal projection to the set of permutation matrices, a problem called the linear assignment problem (LAP), see [1]
and references therein. The LAP can be formulated as a linear program over the doubly stochastic matrices or constraints (
2), and is solvable in polynomial time [1]. As we shall see later in Table 3, the relaxed shuffle comes amazingly close to an exact permutation in network learning. Hence, it turns out unnecessary to solve LAP exactly, indeed a simple rounding will do. AutoShuffleNet units adapted from ShuffleNet v1 [20] and ShuffleNet v2 [13] are illustrated in Figs. 22.3 Permutation Problems Unsolvable by Convex Relaxation
The doubly stochastic matrix condition (
2) is a popular convex relaxation of permutation. However, it is not powerful enough to enable autoshuffle learning as we shall see later. In this section, we present examples from permutation optimization to show the limitation of convex relaxation (2), and how our proposed penalty (1) can strength (2) to retrieve permutation matrices.Let us recall the graph matching (GM) problem, see [16, 11, 1, 12, 18] and references therein. The goal is to align the vertices of two graphs to minimizes the number of edge disagreements. Given a pair of vertex graphs and , with respective adjacency matrices and , the GM problem is to find a permutation matrix to minimize . Let be the set of all permutation matrices, solve
(7) 
By algebraic identity:
the GM problem (7) is same as:
(8) 
a quadratic assignment problem (QAP). The general QAP for two real square matrices and is [16, 11]:
(9) 
The convex relaxed GM is:
(10) 
As an instance of general QAP, let us consider problem (7) in case for two real matrices:
If , then:
and equals ( ):
Example 1: We show that by selecting:
which is convex on and has minimum at . The convex relaxed matrix solution is:
however, the permutation matrix solution to GM problem (7) is the identity matrix at .
In the spirit of objective function (4), let us minimize:
or equivalently minimize (after skipping some additive constants in ):
An illustration of is shown in Fig. 3. The minimal point moves from the interior of the interval when (dashed line, top curve) to the end point 1 as increases to 1 (linestar, middle curve) and remains there as further increases to 2 (linecircle, bottom curve). So for , is recovered with our proposed penalty. ∎
Example 2: Consider the adjacent matrix () of an undirected graph of 2 nodes and 1 edge (with a loop at node 1). An edge adds 1 and a loop adds 2 to an adjacent matrix.
Then:
So . The regularized objective function (modulo additive constants) is:
with . In view of:
two possible interior critical points are:
(11) 
Since , if , the second equality in (11) is ruled out. Comparing with , we see that the global minimal point does not occur at if
Hence if , minimizing recovers . ∎
In Fig. 4, we show that two minimal points of occur in the interior of when , and transition to , at . When , or , the second equality in (11) cannot hold, becomes convex with a unique minimal point at .
Remark 3.
We refer to [12] on certain correlated random Bernoulli graphs where . On the other hand, there is a class of friendly graphs [1] where . Existing techniques to improve convex relaxation on GM and QAP include approximate quadratic programming, sorting networks and path following based homotopy methods [16, 11, 18]. Our proposed penalty (1)(2) appears more direct and generic. A detailed comparison will be left for a future study.
Remark 4.
In example 1 above, if the convex relaxed is rounded up to 1, then . In example 2 (Fig. 4), the two interior minimal points at , after rounding down (up), become zero or one. So convex relaxation with the help of rounding still recovers the exact permutation. We show in example 3 below that convex relaxation still fails after rounding (to 1 if the number is above 1/2, to 0 if the number is below 1/2).
Example 3: We consider the twolayer neural network model with one hidden layer [10]. Given , the forward model is the following function:
where
is the ReLU activation function,
is the input random vector drawn from a probability distribution,
is the weight matrix, is the identity matrix. Consider a twolayer teacher network with weight matrixWe train the student network with doubly stochastic constraint on using the loss:
Let ,
We write the loss function as
(12) 
where for , is defined as
Define then and
For simplicity, let
obey uniform distribution on
. For , , , equals(13) 
We have
(14) 
(15) 
where . The last term in (12) is a constant:
(16) 
Consider a special case when , , and . By (14)(16), the loss function equals
Let When , , Fig. 5 (top left) shows has minimal points in the interior of . A permutation matrix that minimizes can be achieved by rounding the minimal points. However, when , , (Fig. 5, top right), rounding the interior minimal point of gives the wrong permutation matrix at . At , the regularization selects the correct permutation matrix.
Remark 5.
4 Experiments
We relax the shuffle units in ShuffleNet v1 [20] and ShuffleNet v2 [13] and perform experiments on CIFAR10 [8] and ImageNet [4, 9] classification datasets. The accuracy results of auto shuffles are evaluated after the relaxed shuffles are rounded.
On CIFAR10 dataset, we set the penalty parameter . All experiments are randomly initialized with learning rate linearly decaying from . We train each network for epochs. We set weightdecay , momentum and batch size . The experiments are carried out on a machine with single Nvidia GeForce GTX 1080 Ti GPU. In Table 1, we see that auto shuffle consistently improves by as much as 1% on manual shuffle in both v1 and v2 models of ShuffleNet.
Networks  Manual  Auto 

ShuffleNet v1 (g=3)  90.55  91.76 
ShuffleNet v1 (g=8)  90.06  91.26 
ShuffleNet v2 (1)  91.90  92.81 
ShuffleNet v2 (1.5)  92.56  93.22 
On ImageNet dataset, we set the penalty parameter . For each experiment, the training process includes two training cycles: the first cycle is randomly initialized with learning rate starting at and the second one is resumed from the first one with learning rate starting at . Each cycle consists of epochs and the learning rate decays linearly. We set weightdecay , momentum and batch size . The experiments are carried out on a machine with 4 Nvidia GeForce GTX 1080 Ti GPUs. In Table 2, we see that auto shuffle again consistently improves on manual shuffle for both v1 and v2 models.
Networks  Manual  Auto 

ShuffleNet v1 (g=3)  65.50  65.62 
ShuffleNet v1 (g=8)  65.18  65.76 
ShuffleNet v2 (1)  67.46  67.69 
ShuffleNet v2 (1.5)  70.58  70.60 
The permutation matrix of the first shuffle unit in ShuffleNet v1 (g=3) is a matrix of size , which can be visualized in Fig. 7 (manual, left) along with an auto shuffle (right). The dots (blanks) denote locations of 1’s (0’s). The auto shuffle looks disordered while the manual shuffle is ordered. However, the inference cost of auto shuffle is comparable to manual shuffle since the shuffle is fixed and stored after training.
The accuracy drop due to rounding to produce auto shuffle from relaxed shuffle is indicated by relative error in Table 3. On CIFAR10 dataset, negligible drop is observed for ShuffleNet v1. Interestingly, rounding even gained accuracy for ShuffleNet v1 on ImageNet dataset.
Dataset  Networks  Relative Error 

CIFAR10  ShuffleNet v1 (g=3)  0 
ShuffleNet v1 (g=8)  0  
ShuffleNet v2 (1)  2.15E4  
ShuffleNet v2 (1.5)  1.07E3  
ImageNet  ShuffleNet v1 (g=3)  +6.10E5 
ShuffleNet v1 (g=8)  +3.04E5  
ShuffleNet v2 (1)  0  
ShuffleNet v2 (1.5)  2.83E5 
The penalty of ShuffleNet v1 (g=3) is plotted in Fig. 8. As the penalty decays, the validation accuracy of auto shuffle (after rounding) becomes closer to relaxed shuffle (before rounding), see Fig. 9.
To demonstrate the significance of the regularization, we also tested auto shuffle with various on ShuffleNet v1 (g=3). Table 4 shows that the accuracy drops much after the relaxed shuffle is rounded. We plot the stochastic matrix of the first shuffle unit of the network at and respectively in Fig. 10. The penalty is large when is relatively small, indicating that the stochastic matrices learned are not close to optimal permutation matrices.
0  1E5  1E4  5E4  1E3  

relaxed  90.00  90.18  90.48  91.45  91.76 
auto  10.00  38.18  11.37  71.50  91.76 
penalty  3.37E3  1.59E3  4.95E2  3.13E1  5.07E2 
5 Conclusion
We introduced a novel, exact and Lipschitz continuous relaxation for permutation and learning channel shuffling in ShuffleNet. We showed through a regression problem of a 2layer neural network with short cut that convex relaxation fails even with additional rounding while our relaxation is precise. We plan to extend our work to autoshuffling of other LCNNs and hard permutation problems in the future.
6 Acknowledgement
The work was partially supported by NSF grant IIS1632935.
References
 [1] Y. Aalo, A. Bronstein, and R. Kimmel. On convex relaxation of graph isomorphism. Proc. National Academy Sci, 112(10):2942–2947, 2015.
 [2] Zeyuan AllenZhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. 58th Annual IEEE Symposium on Foundations of Computer Science, pages 890–901, 2017.

[3]
R. Burkard.
The quadratic assignment problem.
in: Handbook of Combinatorial Optimization
, pages 2741–2814, 2013. 
[4]
Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and FeiFei Li.
Imagenet: A largescale hierarchical image database.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 248–255, 2009.  [5] Ernie Esser, Yifei Lou, and Jack Xin. A method for finding structured sparse solutions to nonnegative least squares problems with applications. SIAM J. Imaging Sciences, 6:2010–2046, 2013.
 [6] B. Kalantari, I. Lari, F. Ricca, and B. Simeone. On the complexity of general matrix scaling and entropy minimization via the ras algorithm. Mathematical Programming, 112(2):371–401, 2008.
 [7] T. Koopmans and M. Beckman. Assignment problems and the location of economic activities. The Econometric Society, 25:53–76, 1957.
 [8] Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
 [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Conference on Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
 [10] Y. Li and Y. Yuan. Convergence analysis of twolayer neural networks with relu activation. arXiv preprint 1705.09886v2, 2017.

[11]
C. Lim and S. Wright.
A boxconstrained approach for hard permutation problems.
Proceedings of the 33rd International Conference on Machine Learning
, page 10 pages, 2016.  [12] V. Lyzinski, D. Fishkind, M. Fiori, J. Vogelstein, C. Priebe, and G. Sapiro. Graph matching: Relax at your own risk. arXiv preprint 1405.3133, 2014.
 [13] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 2018.
 [14] Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876–879, 1964.
 [15] K. Sun, M. Li, D. Liu, and J. Wang. Igcv3: Interleaved lowrank group convolutions for efficient deep neural networks. arXiv preprint 1806.00178v1, 2018.
 [16] J. Vogelstein, J. Conroy, V. Lyzinski, L. Podrazik, S. Kratzer, E. Harley, D. Fishkind, R. Vogelstein, and C. Priebe. Fast approximate quadratic programming for graph matching. PloS one, 10(4), 2015.
 [17] Penghang Yin, Yifei Lou, Qi He, and Jack Xin. Minimization of for compressed sensing. SIAM J. Sci. Computing, 37(1):A536–A563, 2015.
 [18] M. Zaslavskiy, F. Bach, and J. Vert. A path following algorithm for the graph matching problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:2227–2242, 2009.
 [19] T. Zhang, GJ Qi, B. Xiao, and J. Wang. Interleaved group convolutions. CVPR, pages 4373–4382, 2017.
 [20] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.