1 Introduction
Neural networks (NNs) have become a default solution for many problems because of their high performance. However, wider adoption of NNs requires not only high accuracy but also high computational efficiency. Researchers either compress, search or jointly search and compress architectures aiming for more computationally effective solutions [26]. This problem is as well actual for SR architectures often utilized on mobile devices.
In this paper, we choose to combine techniques of NAS and quantization to search for efficient quantizationfriendly models for SR. The NAS problem is hard because we should either define a differentiable NAS procedure or use discrete optimization in a highdimensional space of architectures. The problem is even more challenging for quantizationaware NAS because quantization is an indifferentiable operation. Therefore, optimization of quantized models is more difficult than full precision models. An additional technical challenge arises for SR, as Batch Norm (BN) in SR models damages final performance [18] but training models without BN is much slower. Finally, we should define an appropriate search space given numerous recent advances in quantized SR architectures and take into account that the size of the discrete search space grows exponentially with the introduction of new parts of the architecture, making the optimization problem harder.
Our contributions:

We propose the first end2end approach to NAS for mixed precision quantization of SR architectures.

To search for robust, quantizationfriendly architectures we approximate model degradation caused by quantization with Quantization Noise (QN) instead of directly quantizing model weights during the search phase. For sampling QN we follow procedure proposed in [7]. Such reparametrization allows differentiability crucial for a differentiable NAS and is up to 30% faster than quantizing weights directly.

We design a specific search space for SR models. The proposed search space is simpler than the current SOTA  TrilevelNAS [24] and leads to about 5 times faster search procedure. We show that the search space design is equally important as search methods and argue that the community should pay more attention to the search space design.

Quantizationaware NAS with Search Against Noise (SAN) yields results with a better tradeoff between quality measured in PSNR and efficiency measured in BitOps compared to uniform and mixed precision quantization of fixed architectures. Thus, the joint quantizationaware NAS is a better choice than separate quantization and NAS.
2 Related works
Differentiable NAS (DNAS)
[23, 17, 11, 24] is a differentiable method of selecting a directed acyclic subgraph (DAG) from an overparameterized supernet. An example of such selection is on Figure 1. Each node represents a feature map of intermediate layers of inputs and outputs. Edges are operations between those nodes. During the search procedure, we aim to assign importance weights for each edge and consequently select a subgraph using edges with the highest importance weights.
The weights assignment can be done in several ways. The main idea of DNAS is to update importance weights
with respect to a loss function parameterized on supernet weights
. Consequently, hardware constraints are easy to introduce as an extension of an initial loss function.DNAS has been proven to be efficient to search for computationallyoptimized models. FBnet [23] focuses on optimizing FLOPs and latency. Authors mainly focus on classification problems. AGD [11] and TrilevelNAS[24] further extend resource constrained NAS for super resolution problem (SR) for full precision models.
Quantization aware DNAS.
DNAS can be employed to search for architectures with desired properties. In OQAT [19], authors perform quantizationaware NAS with uniform quantization. They show that architectures found with a quantizationaware search perform better when quantized compared to architectures found without accounting for quantization. However, uniform quantization is less flexible and leads to suboptimal solutions compared to mixedprecision quantization (MPQ)  where each operation and activation has its own quantization bits. This idea was explored in EdMIPS [5]. MPQ used in EdMIPS is a NAS procedure where all operations are fixed, and we search for different quantization levels. One on hand MPQNAS is a natural extension of EdMIPS or OQAT but on another hand joint optimization of quantization bits and operations has a high computational costs. Additionally, instability of optimization with quantized weights was highlighted in DiffQ [7], authors considers QN to perform MPQ on existing architectures.
Two problems above make joint optimization of an overparameterized supernet with mixedprecision bits a challenging task. We propose a procedure that is simultaneously effective and relatively robust. Due to the usage of supernet, we turn our problem into a continuous one, and by the use of quantization noise, we can make the solution of this problem fast and stable.
Search space design
is crucial for achieving good results. It should be both flexible and contain known bestperforming solutions. Even a random search can be a reasonable method with a good search space design. AGD [11]
applies NAS to SR problem. Authors search for (1) a cell  a block which is repeated several times, and (2) kernel size along with other hyperparameters like the number of input and output channels. TrilevelNAS
[24] extends the previous work by adding (3) network level that optimizes the position of the network upsampling layer. Both articles expand the search space, making it more flexible while more tricky to search in, possibly leading to local and suboptimal solutions.In our work, we choose to focus on a simpler search space consisting of: (1) a cell block  search is performed for operations within the block and (2) quantization levels different for each operation (for quantizationaware search). We show that this design leads to architectures with similar performance as TrilevelNAS[24] but search time is much faster.
Sparsification for differentiable architecture search
was discussed in several works [3, 21, 6, 25, 24, 27]. The problem arises because operations in a supernet coadapt. So, a subgraph selected from the supernet depends on all the left in the supernet operations. We can make the problem better suitable for NAS if we enforce the sparsification of a graph with most of the importances for a particular connection between nodes being close to zero.
The sparsification strategy depends on the graph structure of a final model. In our work, we use the SinglePath strategy  one possible edge between two nodes, more in appendix C. For the SinglePath strategy, the sum of node outputs is a weighted sum of features, it can be seen in Figure 1. The coadaptation problem becomes obvious. Second layer convolutions are trained on a weighted sum of features, but after discretization (selecting a subgraph), only one source of features remains. Therefore, sparsification for the vector of is necessary. In BATS [3], sparsification is achieved via scheduled temperature for softmax. Entropy regularization proposed in DiscretizationAware search [21]. In [6], authors proposed an ensemble of Gumbels to sample sparse architectures for the MixedPath strategy and in [25], Sparse Group Lasso (SGL) regularization is used. In ISTANAS [27], authors tackle sparsification as a sparse coding problem. Trilvel NAS [24] proposed sorted Sparsestmax. In our work, we used entropy regularization [21] to disentangle the final model from the supernet.
3 Methodology
We follow the basic definitions provided in the previous section with the description of our approach. The description has three parts. We start with (i) subsection 3.1 that describes the space of considered architectures. The procedure for searching of an architecture in this space is in (ii) subsection 3.2. It includes the description of the used loss function. Finally, in (iii) subsection 3.3 we provide details on reparametrization with quantization noise.
First we quantize
3.1 Search space design
We design our search space taking into account recent results in this area and, in particular, the SR quantization challenge [14]. The challenge was in manually designing of quantizationfriendly architectures.
We combine most of these ideas in the search design depicted in Figure 3. The deterministic part of our search space includes the upsampling layer in tail block of the architecture, the number of channels in convolutions and specific for SR AdaDM [18] block. The AdaDM block is used only in quantizationaware search. The variable part is quantization bit values and operations within head, body, upsample, and tail blocks. We perform all experiments with 3 body blocks, unless specified otherwise. Additional, parallel convolutional layer of the body block is used to increase representation power of quantized activations.
Batch Norm for SR and modification of AdaDM.
Variation in a signal is crucial for identifying small details. In particular, the residual feature’s standard deviation shrinks after the layers’ normalisation, and SR models with BN perform worse
[18]. On another hand training overparameterized supernet without BN can be challenging. The authors of [18] proposed to rescale the signal after BN based on its variation before BN layers. Empirically we found that AdaDM with removed second BN improves overall performance of quantized models. Original AdaDM block and our modification are depicted in Figure 4. All the residual blocks in our search design have the modified AdaDM part: the body block, all the repeated layers within the body block and the tail block.3.2 The search procedure
We consider the selection of blocks and quantization bits during NAS. We assign separate values to optimize during NAS for a Cartesian product of possible operations and the number of quantization bits. Search and training procedures are performed as two independent steps.
For search, we alternately update supernet’s weights and edges importances . Two different subsets of training data are used to calculate the loss function and derivatives for updating and similar to [11]. Hardware constraints and entropy regularisation are applied as additional terms in the loss function for updating . To calculate the output of th layer we weight the output of separate edges according to importance values of each of operations and bits:
(1) 
where and all .
Then the final architecture is derived by choosing a single operator with the maximal among the ones for this layer. Finally, we train the obtained architecture from scratch.
To optimize we compute the following loss:
where and are regularization constants. depends on the iteration . is distance between high resolution and restored images, and is the hardware constraint and is the entropy loss that enforces sparsity of the vector . The last two losses are defined in two subsections below.
3.2.1 Hardware constraint regularization
The hardware constraint is proportional to the number of floating point operations FLOPs for full precision models and the number of quantized operations BitOps for mixedlow precision models. is the function computing FLOPs value based on the input image size and the properties of a convolutional layer
: kernel size, number of channels, stride, and the number of groups. We use the same number of bits for weights and activations in our setup. Therefore, BitOps can be computed as
where is the number of bits. Then, the corresponding hardware part of the loss is:(2) 
where is a supernet’s block or layer consisting of several operations, the layerwise structure is presented in Figure 1. We normalize value by the value of this loss at initialization with the uniform assignment of , as the scale of the unnormalized hardware constraint reaches .
3.2.2 Sparsity regularization
After the architecture search, the model keeps only one edge between two nodes. Let us denote as all alphas that correspond to edges that connect a particular pair of nodes. They include different operations and different bits. At the end of the search, we want to be a vector close to the vector with one value close to and all remaining components to be . We found that the entropy loss works the best with our settings.
3.3 Quantization
Our aim is to find quantizationfriendly architectures that perform well after quantization. (i) To obtain a trained and quantized model, we perform Quantization Aware Training QAT [15]. During training, (a) we approximate quantization with QN; (b) compute gradients for quantized weights; (c) update full precision weights. (ii) Then, a model with found architecture is trained from scratch. Below we provide details for these steps.
3.3.1 Quantization Aware Training
Let’s consider the following onelayer neural network (NN),
(4) 
where
is a non linear activation function and
is a function parametrized by a tensor
. In (4) is a linear function, but it also can be a convolutional operation. To decrease the computational complexity of the network, we replace expensive floatpoint operations with lowbit width ones. Quantization occurs for both weights and activation .3.3.2 Quantization Aware Search with Shared Weights (SW)
To account for further quantization during the search phase, we perform Quantization Aware Search similar to QAT [15]. One way to do so is to quantize model weights and activations during the search phase the same way as during training.
To improve computational efficiency we can quantize weights of identical operations with different quantization bits instead of using different weights for each quantization bit, this idea was studied in [5]. Here, is the output of th layer with input and parameters .
(6) 
The effectiveness of SW can be seen from (6): it requires fewer convolutional operations and less memory to store the weights.
3.3.3 Quantization Aware Search Against Noise
To further improve computational efficiency and performance of search phase we introduce SAN. Model degradation caused by weights quantization is equivalent to adding the quantization noise . Then, quantized weights is and (6) is:
(7) 
This procedure (i) does not require weights quantization and (ii) differentiable unlike quantization. is a function of because it depends on its shape and magnitude of values. Given the quantization noise, we can more efficiently run forward and backward passes for our network, similar to the reparametrization trick.
Adding quantization noise is similar to adding independent uniform variables from with [22]. However, for the noise sampling, we use the following procedure [7]:
(8) 
as it performs slightly better than the uniform distribution
[7].4 Results
We provide the code for our experiments here.
4.1 Evaluation protocol
For all experiments, we consider the following setup if not stated otherwise. A number of body blocks is set to 3. As the training dataset, we use DIV2K dataset [1]. As the test datasets we use Set14 [28]
, Set5
[2], Urban1100 [13], Manga109 [12], with scale 4. In the main body of the paper, we present results on Set14. The results for other datasets are presented in Appendix.For training, we use RGB images. For PSNR score calculation, we use only the Y channel, similarly to [11, 24]. Evaluation of FLOPs and BitOPs is done for fixed image sizes and , respectively.
Search space.
For full precision search, we use ten different possible operations as candidates for a connection between two nodes. For quantizationaware search, we limit the number of operations to to obtain a search space of a reasonable size. For quantization, we consider two options as possible quantization bits: 4 or 8 bits for activations and weights.
A search space analysis and more technical details are provided in Appendix D.
4.2 Different number of body blocks
A straightforward way to improve model performance is to increase the number of layers. We study how our method scales by performing search with a different number of body blocks: 1, 3, and 6. Three constellations are presented in Figure 5. We observe that increasing the number of blocks improves final performance and increases the number of BitOPs for architectures found with the highest hardware regularization  each constellation is slightly shifted to the right.
4.3 QuantNAS and quantization of fixed architectures
We compare QuantNAS with (1) uniform quantization and (2) mixed precision quantization for two existing architectures, ESPCN [20] and SRResNet [16]. For uniform quantization, we use LSQ [10] and HWGQ [4]. For mixed precision quantization, we use EdMIPS [5]. Our setup for EdMIPS is matching the original setup and search is performed for different quantization bits for weights and activations. Unlike in QuantNAS, quantization bits for activations and weights are the same.
In Figure 5, we compare our procedure with two architectures quantized by EdMIPS. We can see that QuantNAS finds architectures with better PSNR/BitOps tradeoffs within a range where BitOps values overlap. Performance gain is especially notable between quantized ESPCN and our approach.
We note that due to computational limits, our search is bounded in terms of the number of layers. Therefore, we can’t extend our results beyond SRResNet in terms of BitOps to provide a more detailed comparison.
From Table 1, we can see that QuantNAS finds architectures with a better PSNR/BitOps tradeoff than uniform quantization techniques. We compare within the same range of BitOPs values, 8 bits for ESPCN and 4 bits for SRResNet.
Model  GBitOPs  PSNR  Method 

SRResNet  23.3  27.88  LSQ(4U) 
SRResNet  23.3  27.42  HWGQ(4U) 
ESPCN  2.3  27.26  LSQ(8U) 
ESPCN  2.3  27.00  HWGQ(8U) 
SRResNet  19.4  27.919  EdMIPS(4,8) 
4.6  27.814  QuantNAS(4,8)  
9.3  27.988  QuantNAS(4,8) 
4.4 Time efficiency of QuantNAS
We measure the average training time for three considered quantisation approaches: without sharing weights, with sharing (SW), and with quantization noise (ours QuantNAS or briefly SAN).
We run the same experiment for different amounts of searched quantization bits. Figure 7 shows the advantage of our approach in training time. As the number of searched bits grows, so does the advantage. On average, we get up to speedup.
4.5 Ablation studies
4.5.1 Adaptive Deviation Modulation
We start with comparing the effect of AdaDM [18]
and Batch Normalization on two architectures randomly sampled from our search space. In Table
2, we can see that both original AdaDM and Batch Normalization hurt the final performance, while AdaDM with our modification improves PSNR scores. In Figure 6, we observe that architectures found with AdaDM are better in terms of both PSNR and BitOPs. Interestingly, we did not notice any improvement with AdaDM for fullprecision models. Our best full precision model in Table 2 was obtained without AdaDM.Model  Model M1  Model M2 

Without Batch Norm  27.55  28 
With Batch Norm  27  27.16 
Original AdaDM  27.33  27.84 
Our AdaDM 
4.5.2 Entropy regularization
We consider three settings to compare QuantNas with and without Entropy regularization: (A) reduced search space, SGD optimizer; (B) full search space, Adam [8] optimizer; (C) reduced search space, Adam [8] optimizer. All the experiments were performed for full precision search. For full and reduced search spaces, we refer to Appendix D.1. We perform the search without hardware penalty to analyze the effect of the entropy penalty.
Quantitative results for Entropy regularisation are in Table 3. Entropy regularisation improves performance in terms of PSNR for all the experiments.
Figure 8 demonstrates dynamics of operations importance for joint NAS with quantization for 4 and 8 bits. 4 bits edges are depicted in dashed lines. Only two layers are depicted: the first layer for the head (HEAD) block and the skip (SKIP) layer for the body block. With entropy regularization, the most important block is evident from its value. Without entropy regularization, we have no clear most important block. So, our search procedure has two properties: (a) the input to the following layer is mostly produced as the output of a single operation from the previous layer; (b) an architecture at final search epochs is very close to the architecture obtained after selecting only one operation per layer with the highest importance value.
Training settings  w/o Entropy  w Entropy 

A  27.99 / 111  / 206 
B  28.00 / 30  / 19 
C  27.92 / 61  / 321 
Method  GFLOPs  PSNR  Search cost 

SRResNet  166.0  28.49  Manual 
AGD  140.0  28.40  1.8 GPU days 
Trilevel NAS  33.3  28.26  6 GPU days 
Our FP best  28.22  1.2 GPU days 
4.5.3 Comparison with existing full precision NAS for SR approaches
Here we examine the quality of our procedure for full precision NAS. We did not use the AdaDM block for the full precision search.
The results are in Table 4. Our search procedure achieves comparable results with TrilevelNAS[24] with a relatively simpler search design and about 5 times faster training. The best performing full precision architecture was found with a hardware penalty of value . This architecture is depicted in Appendix Figure 15.
5 Limitations
For our NAS procedure, the overall NAS limitation applies: the computational demand for joint optimization of many architectures is high. The search procedure takes about hours to finish for a single GPU TITAN RTX. Moreover, obtaining the full Pareto frontier requires running the same experiment multiple times.
On Figure 6 all most right points (within one experiment/color) have hardware penalty. It clearly shows that limited search space creates an upper bound for the top model performance. Therefore, our results do not fall within the same BitOps range as SRResNet.
We found that our search design is sensitive to hyperparameters. In particular, optimal coefficients for hardware penalty and entropy regularization can vary across different search settings. Moreover, we expect that there is a connection between optimal coefficients for the hardware penalty, entropy regularization and search space size. Different strategies or search settings require different values of hardware penalties. Applying the same set of values for different settings might not be the best option, but it is not straightforward how to determine them beforehand.
6 Conclusion
To the best of our knowledge, we are the first to deeply explore NAS with mixedprecision search for SuperResolution (SR) tasks.
We proposed the method QuantNAS for obtaining computationally efficient and accurate architectures for SR using jointly NAS and mixedprecision quantization.
Our method is better than others due to (1) specifically tailored search space design; (2) differentiable SAN procedure; (3) adaptation of AdaDM; (4) the entropy regularization to avoid coadaptation in supernets during differentiable search.
Experiments on standard SR tasks demonstrate the high quality of our search. Our method leads to better solutions compared to mixedprecision quantization of popular SR architectures with [5]. Moreover, our search is up to faster than a shareweights approach.
References

[1]
(2017)
Ntire 2017 challenge on single image superresolution: dataset and study.
In
Proceedings of the IEEE conference on computer vision and pattern recognition workshops
, pp. 126–135. Cited by: §4.1.  [2] (2012) Lowcomplexity singleimage superresolution based on nonnegative neighbor embedding. Cited by: Appendix F, §4.1, §4.5.3.
 [3] (2020) BATS: binary architecture search. ECCV2020. Cited by: §2, §2.
 [4] (2017) Deep learning with low precision by halfwave gaussian quantization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5918–5926. Cited by: §3.3.1, §4.3.
 [5] (2020) Rethinking differentiable search for mixedprecision neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2349–2358. Cited by: §2, Figure 5, §3.3.2, §4.3, §6.
 [6] (2019) DATA: differentiable architecture approximation. Conference on Neural Information Processing Systems. Cited by: §2, §2.
 [7] (2021) Differentiable model compression via pseudo quantization noise.. arXiv:2104.09987v2. Cited by: 2nd item, §2, §3.3.3.
 [8] (2015) Adam: a method for stochastic optimization. In In Proceedings of the International Conference on Learning Representations, pp. 126–135. Cited by: §4.5.2.
 [9] (2014) Learning a deep convolutional network for image superresolution.. ECCV. Cited by: Figure 10, Appendix B.
 [10] (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. Cited by: §3.3.1, §4.3.

[11]
(2020)
AutoGANdistiller: searching to compress generative adversarial networks
. ICML. Cited by: §2, §2, §2, §3.2, §4.1.  [12] (2016) Manga109 dataset and creation of metadata. In Proceedings of the 1st international workshop on comics analysis, processing and understanding, pp. 1–5. Cited by: Appendix F, §4.1, §4.5.3.
 [13] (2015) Single image superresolution from transformed selfexemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5197–5206. Cited by: Appendix F, §4.1, §4.5.3.
 [14] (2021) Realtime quantized image superresolution on mobile npus, mobile ai 2021 challenge: report. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2525–2534. Cited by: §3.1.
 [15] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §3.3.2, §3.3.
 [16] (2017) Photorealistic single image superresolution using a generative adversarial network.. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690. Cited by: Figure 10, Appendix B, Figure 5, §4.3, §4.5.3.
 [17] (2019) DARTS: differentiable architecture search. ICLR. Cited by: Appendix C, §2.
 [18] (2021) AdaDM: enabling normalization for image superresolution. arXiv preprint arXiv:2111.13905. Cited by: §1, Figure 4, §3.1, §3.1, §4.5.1.
 [19] (2020) Once quantized for all: progressively searching for quantized efficient models. arXiv preprint arXiv:2010.04354. Cited by: §2.

[20]
(2016)
Realtime single image and video superresolution using an efficient subpixel convolutional neural network.
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883. Cited by: Appendix B, Figure 5, §4.3.  [21] (2021) Discretizationaware architecture search. Pattern Recognition. Cited by: §2, §2.
 [22] (1996) Statistical theory of quantization.. IEEE Transactions on instrumentation and measurement, 45(2): 353–361.. Cited by: §3.3.3.
 [23] (2019) FBNet: hardwareaware efficient convnet design via differentiable neural architecture search. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Cited by: Appendix C, §2, §2.
 [24] (2021) Trilevel neural architecture search for efficient single image superresolution. Computer Vision and Pattern Recognition. Cited by: 3rd item, §2, §2, §2, §2, §2, §2, §4.1, §4.5.3, Table 4.
 [25] (2020) Neural architecture search as sparse supernet. arXiv:2007.16112. Cited by: §2, §2.
 [26] (2019) Deep learning for single image superresolution: a brief review. IEEE Transactions on Multimedia 21 (12), pp. 3106–3121. Cited by: §1.
 [27] (2020) ISTAnas: efficient and consistent neural architecture search by sparse coding. Conference on Neural Information Processing Systems. Cited by: §2, §2.
 [28] (2010) On single image scaleup using sparserepresentations. In International conference on curves and surfaces, pp. 711–730. Cited by: Appendix F, §4.1, §4.5.3.
Appendix
Appendix A Technical details
During the search phase, we consider architectures with a fixed number of 36 channels for each operation unless channel size is changed due to operations properties. The search is performed for 20 epochs. For updating supernet weights we use the following hyperparameters: batch size 16, initial learning rate (lr) 1e3, cosine learning rate scheduler, SGD with momentum 0.9, weight decay 3e7. For updating alphas we use fixed lr 3e4 and 0 weight decay.
During the training phase, an obtained architecture is optimized for 30 epochs with the following setting: batch size 16, initial lr 1e3, lr scheduler with the weight decay 3e7.
a.1 Entropy schedule
For entropy regularization, we gradually increase the regularization value according to Figure 9, for the first two epochs regularization is zero. Entropy regularization is multiplied by an initial coefficient and coefficient factor. Initial coefficients are 1e3 and 1e4 for experiments with full precision and the quantizationaware search.
Appendix B Scaling SR models with initial upsampling
To maintain good computational efficiency, it is common for recent SR models to operate on downsampled images and then upsample them with some upsampling layers. This idea was introduced first in ESPCN [20]. Since then, there were not many works in the literature exploring SR models on initially upscaled images.
Therefore, we were interested in how this approach scales in terms of quality and computational efficiency given arbitrary many layers. Results are presented in Figure 10. We start with one fixed block, similar to our body block in Figure 3 and then increase it by one each time. We compare our results with SRResNet [16] and SRCNN [9]. As we can see SRResNet [16] operates on downscaled images and yields better results given the same computational complexity.
Appendix C Singlepath search space
There are several ways to select directed acyclic subgraph from a supernet. DARTS [17] uses MultiPath strategy  one node can have several input edges. Such a strategy makes a search space significantly larger. SinglePath strategy  each searchable layer in the network can choose only one operation from the layerwise search space Figure 1. It has been shown in FBNet [23] that simpler SinglePath approach yields comparable with MultiPath approach results for classification problems. Also, since it aligns more with SR search design in our work, we use SinglePath approach.
Appendix D Search space analysis
First,
we compare two search spaces for quantization by randomly sampling and then training architectures from them, results are presented in Figure 11. We observe that one search space performs slightly better than another, search space in green color is described in section E.2. Additionally, we observe that our neural architecture search method significantly outperforms random sampling.
Second,
we observe that the first layer is the most sensitive to quantization levels. It can be seen in Figure 12. With low bit quantized weights at the first layer, we lose a significant portion of information from an initial signal. However, once a number of channels is increased, low level quantized weights can be applied in the following layers with less model degradation. Interestingly, our approach almost always tries to allocate higher bit values to the first and last layers ex. in Figure 16 and in Figure 11 in green.
Third,
we perform a regression analysis where architectures are features and final scores are target variables. Onehot encoding was used to transform architectures into features. Two different search spaces were used with about 20 trained models each: (1) architectures were sampled randomly and had a higher variance of a target variable; (2) architectures were found with our final procedure with a smaller variance of the target variable. We note, that with only 20 observations we can not derive reliable insights but we still can observe reasonable trends. Results are presented in Figure
13.d.1 Different search spcases
For detailed operations description we refer to our code.
Appendix E Search space
We have a fixed number of channels equal to 36 for all the layers unless specified.

Head : two layers, first layer computes images with three channels;

Body: a repeatable cell with a skip connection and three layers;

Skip: a single convolutional layer within body cell connecting input and output (do not confuse with conventional skip operation);

Upsample: one layer before pixel shuffle, this layer increases a number of channels before pixel shuffle and pixel shuffle operation outputs 3 channels image;

Tail : two layers, final layer outputs images with 3 channels.
e.1 Search space for full precision experiments
Possible operations blockwise:

Head 8 operations: simple 3x3, simple 5x5, growth2 5x5, growth2 3x3, simple 3x3 grouped 3, simple 5x5 grouped 3, simple 1x1 grouped 3, simple 1x1;

Body 7 operations: simple 3x3, simple 5x5, simple 3x3 grouped 3, simple 5x5 grouped 3, decenc 3x3 2, decenc 5x5 2, simple 1x1 grouped 3;

Skip 4 operations: decenc 3x3 2, decenc 5x5 2, simple 3x3, simple 5x5;

Upsample 12 operations: conv 5x1 1x5, conv 3x1 1x3, simple 3x3, simple 5x5, growth2 5x5, growth2 3x3, decenc 3x3 2, decenc 5x5 2, simple 3x3 grouped 3, simple 5x5 grouped 3, simple 1x1 grouped 3, simple 1x1;

Tail 8 operations: simple 3x3, simple 5x5, growth2 5x5, growth2 3x3, simple 3x3 grouped 3, simple 5x5 grouped 3, simple 1x1 grouped 3, simple 1x1;
e.2 Reduced search space for Quantization experiments
Possible operations blockwise:

Head 5 operations: simple 3x3, simple 5x5, simple 3x3 grouped 3, simple 5x5 grouped 3;

Body 4 operations: conv 5x1 1x5, conv 3x1 1x3, simple 3x3, simple 5x5;

Skip 3 operations: simple 1x1, simple 3x3, simple 5x5;

Upsample 4 operations: conv 5x1 1x5, conv 3x1 1x3, simple 3x3, simple 5x5;

Tail 3 operations: simple 1x1, simple 3x3, simple 5x5;
Conv 5x1 1x5 and conv 3x1 1x3 are depthwise separable convolution convolutions. For operations description we refer to our code.
Appendix F Results on other datasets
In Figure 14 we provide quantative results obtained on differetn test datasets: Set14 [28], Set5 [2], Urban1100 [13], Manga109 [12] with scale 4.
In Figure 17 we provide with visual results for quantized and full precision models.