Fast, Accurate and Lightweight Super-Resolution with Neural Architecture Search

01/22/2019 ∙ by Xiangxiang Chu, et al. ∙ Xiaomi 22

Deep convolution neural networks demonstrate impressive results in super-resolution domain. An ocean of researches concentrate on improving peak signal noise ratio (PSNR) by using deeper and deeper layers, which is not friendly to constrained resources. Pursuing a trade-off between restoration capacity and simplicity of a model is still non-trivial by now. Recently, more contributions are devoted to this balance and our work is focusing on improving it further with automatic neural architecture search. In this paper, we handle super-resolution using multi-objective approach and propose an elastic search method involving both macro and micro aspects based on a hybrid controller of evolutionary algorithm and reinforcement learning. Quantitative experiments can help to draw a conclusion that the models generated by our methods are very competitive than and even dominate most of state-of-the-art super-resolution methods with different levels of FLOPS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Works

As a classical task in computer vision, single image super-resolution (SISR) is aimed to restore a high-resolution image from a degraded low-resolution one, which is known as an ill-posed inverse procedure. Most of the recent works on SISR have shifted their approaches to deep learning, and they have surpassed other SISR algorithms with big margins

[Dong et al.2014, Kim et al.2016a, He et al.2016, Ahn et al.2018].

Nonetheless, these human-designed models are tenuous to fine-tune or to compress. Meantime, neural architecture search has produced dominating models in classification tasks [Zoph and Le2016, Zoph et al.2017]. Following this trend, a novel work by [Chu et al.2019] has shed light on the SISR task with a reinforced evolutionary search method, which has achieved results outperforming some notable networks including VDSR [Kim et al.2016a].

In this paper, we dive deeper into the SISR task with elastic neural architecture search, hitting a record comparable to CARN and CARN-M [Ahn et al.2018] 111Our models are released at https://github.com/falsr/FALSR.. Our main contributions can be summarized in the following four aspects,

  • releasing several fast, accurate and lightweight super-resolution architectures and models, which are highly competitive with recent state-of-the-art methods,

  • performing elastic search by combining micro and macro space on the cell-level to boost capacity,

  • building super-resolution as a constrained multi-objective optimization problem and applying a hybrid model generation method to balance exploration and exploitation,

  • producing high-quality models that can meet various requirements under given constraints within a single run.

Feature Extractor

cell

cell

cell

Subpixel Upsampling

Figure 1: Neural Architecture of Super-Resolution (the arrows denote skip connections).

2 Pipeline Architecture

Like most of NAS approaches, our pipeline contains three principle ingredients: an elastic search space, a hybrid model generator and a model evaluator based on incomplete training. It is explained in detail in the following sections.

Similar to [Lu et al.2018, Chu et al.2019], we also apply NSGA-II [Deb et al.2002] to solve the multi-objective problem. Our work differs from them by using a hybrid controller and a cell-based elastic search space that enables both macro and micro search.

We take three objectives into account for the super-resolution task,

  • quantitative metric to reflect the performance of models (PSNR),

  • quantitative metric to evaluate the computational cost of each model (mult-adds),

  • number of parameters.

In addition, we consider the following constraints,

  • minimal PSNR for practical visual perception,

  • maximal mult-adds regarding resource limits.

3 Elastic Search Space

Our search space is designed to perform both micro and macro search. The former is used to choose promising cells within each cell block, which can be viewed as a feature extraction selector. In contrast, the latter is aimed to search backbone connections for different cell blocks, which plays a role of combining features at selected levels. In addition, we use one cell block as our minimum search element for two reasons: design flexibility, and broad representational capacity.

Typically, the super-resolution task can be divided into three sub-procedures: feature extraction, nonlinear mapping, and restoration. Since most of the deep learning approaches concentrate on the second part, we design our search space to describe the mapping while fixing others. Figure 1 depicts our main flow for super-resolution. Thus, a complete model contains a predefined feature extractor (a 2D convolution with 32 3 3 filters), cell blocks drawn from the micro search space which are joined by the connections from macro search space, and subpixel-based upsampling and restoration222Our upsampling contains a 2D convolution with 32 33 filters, followed by a 3

3 convolution with one filter of unit stride.

.

3.1 Cell-Level Micro Search Space

For simplicity, all cell blocks share the same cell search space . In specific, the micro search space comprises the following elements:

  • convolutions: 2D convolution, grouped convolution with groups in , inverted bottleneck block with an expansion rate of ,

  • channels: ,

  • kernels: {1, 3},

  • in-cell residual connections:

    ,

  • repeated blocks:.

Therefore, the size of micro space for cell blocks is .

3.2 Intercell Macro Search Space

The macro search space defines the connections among different cell blocks. Specifically, for the -th cell block , there are choices of connections to build the information flow from the input of to its following cell blocks333Here, starts with 1.. Furthermore, we use to represent the path from input of to . We set if there is a connection path between them, otherwise 0. Therefore, the size of macro space for cell blocks is . In summary, the size of the total space is .

4 Model Generator

Our model generator is a hybrid controller involving both reinforcement learning (RL) and an evolutionary algorithm (EA). The EA part handles the iteration process and RL is used to bring exploitation. To be specific, the iteration is controlled by NSGA-II [Deb et al.2002], which contains four sub-procedures: population initialization, selection, crossover, and mutation. To avoid verbosity, we only cover our variations to NSGA-II.

4.1 Model Meta Encoding

One model is denoted by two parts: forward-connected cells and their information connections. We use the indices of operators from the operator set to encode the cells, and a nested list to depict the connections. Namely, given a model with cells, its corresponding chromosome can be depicted by (), where and are defined as follows,

(1)
(2)

4.2 Initialization

We begin with populations and we emphasize the diversities of cells. In effect, to generate a model, we randomly sample a cell from and repeat it for times. In case is larger than the size of , models are arbitrarily sampled without repeating cells.

As for connections, we sample from a categorical distribution. While in each category, we pick uniformly, i.e. . To formalize, the connections are built based on the following rules,

(3)

4.3 Tournament Selection

We calculate the crowding distance as noted in [Chu and Yu2018]

to render a uniform distribution of our models, and we apply tournament selection (

) to control the evolution pressure.

4.4 Crossover

To encourage exploration, single-point crossovers are performed simultaneously in both micro and macro space. Given two models () and (), a new chromosome can be generated as,

(4)

where and are chosen positions respectively for micro and macro genes. Informally, the crossover procedure contributes more to exploitation than to exploration.

4.5 Mutation

We again apply a categorical distribution to balance exploration and exploitation.

4.5.1 Exploration

To encourage exploration, we combine random mutation with roulette wheel selection (RWS). Since we treat super-resolution as a multi-objective problem, FLOPS and the number of parameters are two objectives that can be evaluated soon after meta encodings are available. In particular, we also sample from a categorical distribution to determine mutation strategies, i.e. random mutation or mutated by roulette wheel selection to handle FLOPS or parameters. Formally,

(5)

Whenever we need to mutate a model by RWS, we keep unchanged. Since each cell shares the same operator set , we perform RWS on for times to generate . Strictly speaking, given , it’s intractable to execute a complete RWS (involving models). Instead, it can be approximated based on (involving basic operators). Besides, we scale FLOPS and the number of parameters logarithmically before RWS.

4.5.2 Exploitation

To enhance exploitation, we apply a reinforcement driven mutation.

We use a neural controller to mutate, which is shown in Figure 2. Specifically, the embedding features for are concatenated, and then are injected into 3 fully-connected layers to generate . The last layer has neutrons to represent connections, with its output denoted as .

Cell 1

sample

Softmax

LSTM

embedding

zero cell

zero state

Cell 2

sample

Softmax

LSTM

embedding

Cell

sample

Softmax

LSTM

embedding

embedding

concat

FC 2

FC

connection

Figure 2: Controller network.

The network parameters can be partitioned into two groups, and

. The probability of selecting

for cell is and for the connection , we have . Thus, the gradient can be calculated as follows:

(6)

In Equation 6, and are the discounted accumulated rewards. Here, we set the discount parameter .

5 Evaluator

The evaluator calculates the scores of the models generated by the controller. In the beginning, we attempted to train an RNN regressor to predict the performances of models, with data collected in previous pipeline execution. However, its validation error is too high to continue. Instead, each model is trained for a relatively short time to roughly differentiate various models. At the end of the incomplete training, we evaluate mean square errors on test datasets.

6 Experiments

6.1 Setup

In our experiment, about 10k models are generated in total, where the population for each iteration is 64. It takes less than 3 days on a Tesla-V100 with 8 GPUs to execute the pipeline once. We use DIV2K as our training set.

During an incomplete training, each model is trained with a batch size of 16 for 200 epochs. In addition, we apply Adam optimizer (

, ) to minimize the loss between the generated high-resolution images and its ground truth. The learning rate is initialized as and kept unchanged at this stage.

As for the full train, we choose 4 models with a large crowding distance in the Pareto front between mean squared error and mult-adds, which was generated at the incomplete training stage. These models are trained based on DIV2K dataset for 24000 epochs with a batch-size of 16 and it takes less than 1.5 days. Moreover, the standard deviation of weights

is initialized as 0.02 and the bias 0.


Model Mult-Adds Params SET5 SET14 B100 Urban100
PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM
SRCNN [Dong et al.2014] 52.7G 57K 36.66/0.9542 32.42/0.9063 31.36/0.8879 29.50/0.8946
FSRCNN [Dong et al.2016] 6.0G 12K 37.00/0.9558 32.63/0.9088 31.53/0.8920 29.88/0.9020
VDSR [Kim et al.2016a] 612.6G 665K 37.53/0.9587 33.03/0.9124 31.90/0.8960 30.76/0.9140
DRCN [Kim et al.2016b] 17,974.3G 1,774K 37.63/0.9588 33.04/0.9118 31.85/0.8942 30.75/0.9133
LapSRN [Lai et al.2017] 29.9G 813K 37.52/0.9590 33.08/0.9130 31.80/0.8950 30.41/0.9100
DRRN [Tai et al.2017a] 6,796.9G 297K 37.74/0.9591 33.23/0.9136 32.05/0.8973 31.23/0.9188
SelNet [Choi and Kim2017] 225.7G 974K 37.89/0.9598 33.61/0.9160 32.08/0.8984 -
CARN [Ahn et al.2018] 222.8G 1,592K 37.76/0.9590 33.52/0.9166 32.09/0.8978 31.92/0.9256
CARN-M [Ahn et al.2018] 91.2G 412K 37.53/0.9583 33.26/0.9141 31.92/0.8960 31.23/0.9194
MoreMNAS-A [Chu et al.2019] 238.6G 1,039K 37.63/0.9584 33.23/0.9138 31.95/0.8961 31.24/0.9187
FALSR-A (ours) 234.7G 1,021K 37.82/0.9595 33.55/0.9168 32.12/0.8987 31.93/0.9256
FALSR-B (ours) 74.7G 326k 37.61/0.9585 33.29/0.9143 31.97/0.8967 31.28/0.9191
FALSR-C (ours) 93.7G 408k 37.66/0.9586 33.26/0.9140 31.96/0.8965 31.24/0.9187
Table 1: Comparisons with the state-of-the-art methods based on 2 super-resolution task.

feature extraction

conv_f64_k3_b4_isskip

conv_f48_k1_b1_isskip

conv_f64_k3_b4_isskip

conv_f64_k3_b4_isskip

conv_f64_k3_b4_isskip

conv_f64_k1_b4_noskip

conv_f64_k3_b4_isskip

sub-pixel upsampling

Figure 3: The model FALSR-A comparable to CARN.

feature extraction

invertBotConE2_f16_k3_b1_isskip

invertBotConE2_f48_k1_b2_isskip

conv_f16_k1_b2_isskip

invertBotConE2_f32_k3_b4_noskip

conv_f64_k3_b2_noskip

groupConG4_f16_k3_b4_noskip

conv_f16_k3_b1_isskip

sub-pixel upsampling

Figure 4: The model FALSR-B comparable to CARN-M.

6.2 Comparisons with State-of-the-Art Super-Resolution Methods



Figure 5: FALSR A, B, C (shown in salmon) vs. others (light blue).

After being fully trained, our model are compared with the state-of-the-art methods on the commonly used test dataset for super-resolution (See Table 1 and Figure 5). To be fair, we only consider the models with comparable FLOPS. Therefore, too deep and large models such as RDN [Zhang et al.2018b], RCAN [Zhang et al.2018a] are excluded here. We choose PSNR and SSIM as metrics by convention [Hore and Ziou2010]. The comparisons are made on the task. Note that all mult-adds are measured based on a input.

At a comparable level of FLOPS, our model called FALSR-A (Figure 3) outperforms CARN [Ahn et al.2018] with higher scores. In addition, it dominates DRCN [Kim et al.2016b] and MoreMNAS-A [Chu et al.2019] over three objectives on four datasets. Moreover, it achieves higher PSNR and SSIM with fewer FLOPS than VDSR [Kim et al.2016a], DRRN [Tai et al.2017a] and many others.

For a more lightweight version, one model called FALSR-B (Figure 4) dominates CARN-M, which means with fewer FLOPS and a smaller number of parameters it scores equally to or higher than CARN-M. Besides, its architecture is attractive and the complexity of connections lies in between residual and dense connections. This means a dense connection is not always the optimal way to transmit information. Useless features from lower layers could make trouble for high layers to restore super-resolution results.

Another lightweight model called FALSR-C (not drawn because of space) also outperforms CARN-M. This model uses relatively sparse connections (8 in total). We conclude that this sparse flow works well with the selected cells.

(1)
(2)
Figure 6: Multiple objectives of models during evolution.

Figure 7 shows the qualitative results against other methods.

6.3 Discussions

6.3.1 Cell Diversity

Our experiments show that a good cell diversity also helps to achieve better results for super-resolution, same for classification tasks [Hsu et al.2018]. In fact, we have trained several models with repeated blocks, however, they underperform the models with diverse cells. We speculate that different types of cells can handle input features more effectively than monotonous ones.

6.3.2 Optimal Information Flow

Perhaps under given current technologies, dense connections are not optimal in most cases. In principle, a dense connection has the capacity to cover other non-dense configurations, however, it’s usually difficult to train a model to ignore useless information.

6.3.3 Good Assumption?

Super-resolution is different from feature extraction domains such as classification, where more details need to be restored at pixel level. Therefore, it rarely applies downsampling operations to reduce the feature dimensions and it is more time-consuming than classification tasks like on CIFAR-10.

Regarding the time, we use incomplete training to differentiate various models. This strategy works well under an implicit assumption: models that perform better when fully trained also behave well with a large probability under an incomplete training. Luckily, most of deep learning tasks share this good feature. For the rest, we must train models as fully as possible.

(1) Ground Truth
(2) CARN
(3) FALSR-A
(4) CARN-M
(5) FALSR-B
(6) FALSR-C
(7) Ground Truth
(8) CARN
(9) FALSR-A
(10) CARN-M
(11) FALSR-B
(12) FALSR-C
(13) Ground Truth
(14) CARN
(15) FALSR-A
(16) CARN-M
(17) FALSR-B
(18) FALSR-C
(19) Ground Truth
(20) CARN
(21) FALSR-A
(22) CARN-M
(23) FALSR-B
(24) FALSR-C
Figure 7: Qualitative results in comparison with others. The number of images in Urban100 (rows from top to bottom): 11, 59, 59, 66.

7 Conclusions

To sum up, we presented a novel elastic method for NAS that incorporates both micro and macro search, dealing with neural architectures in multi-granularity. The result is exciting as our generated models dominate the newest state-of-the-art SR methods. Different from human-designed and single-objective NAS models, our methods can generate different tastes of models by one run, ranging from fast and lightweight to relatively large and more accurate. Therefore, it offers a feasible way for engineers to compress existing popular human-designed models or to design various levels of architectures accordingly for constrained devices.

Our future work will focus on training a model regressor, which estimates the performance of models, to speed up the pipeline.

References