Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer

We propose a semi-supervised network for wide-angle portraits correction. Wide-angle images often suffer from skew and distortion affected by perspective distortion, especially noticeable at the face regions. Previous deep learning based approaches require the ground-truth correction flow maps for the training guidance. However, such labels are expensive, which can only be obtained manually. In this work, we propose a semi-supervised scheme, which can consume unlabeled data in addition to the labeled data for improvements. Specifically, our semi-supervised scheme takes the advantages of the consistency mechanism, with several novel components such as direction and range consistency (DRC) and regression consistency (RC). Furthermore, our network, named as Multi-Scale Swin-Unet (MS-Unet), is built upon the multi-scale swin transformer block (MSTB), which can learn both local-scale and long-range semantic information effectively. In addition, we introduce a high-quality unlabeled dataset with rich scenarios for the training. Extensive experiments demonstrate that the proposed method is superior over the state-of-the-art methods and other representative baselines.

READ FULL TEXT VIEW PDF

page 1

page 3

page 6

page 7

page 10

page 11

page 12

page 13

05/15/2020

Semi-supervised Medical Image Classification with Relation-driven Self-ensembling Model

Training deep neural networks usually requires a large amount of labeled...
04/26/2021

Practical Wide-Angle Portraits Correction with Deep Structured Models

Wide-angle portraits often enjoy expanded views. However, they contain p...
07/24/2022

Semi-supervised Deep Multi-view Stereo

Significant progress has been witnessed in learning-based Multi-view Ste...
12/04/2018

A Deep Learning Framework for Semi-Supervised Cross-Modal Retrieval with Label Prediction

Due to abundance of data from multiple modalities, cross-modal retrieval...
04/07/2021

Deep Semi-supervised Metric Learning with Dual Alignment for Cervical Cancer Cell Detection

With availability of huge amounts of labeled data, deep learning has ach...
11/04/2021

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Much of the existing linguistic data in many languages of the world is l...

Abstract

We propose a semi-supervised network for wide-angle portraits correction. Wide-angle images often suffer from skew and distortion affected by perspective distortion, especially noticeable at the face regions. Previous deep learning based approaches require the ground-truth correction flow maps for the training guidance. However, such labels are expensive, which can only be obtained manually. In this work, we propose a semi-supervised scheme, which can consume unlabeled data in addition to the labeled data for improvements. Specifically, our semi-supervised scheme takes the advantages of the consistency mechanism, with several novel components such as direction and range consistency (DRC) and regression consistency (RC). Furthermore, our network, named as Multi-Scale Swin-Unet (MS-Unet), is built upon the multi-scale swin transformer block (MSTB), which can learn both local-scale and long-range semantic information effectively. In addition, we introduce a high-quality unlabeled dataset with rich scenarios for the training. Extensive experiments demonstrate that the proposed method is superior over the state-of-the-art methods and other representative baselines.

1 Introduction

The wide-angle camera, aiming to capture memorial moments filled with more people and landscapes, has become an ongoing trend in smartphones. However, all wide-angle images are suffering from perspective distortion, the straight lines appearing at the edges of the image will be curved, and the human faces will be unnatural, as shown in Fig. 

1 (a).

To solve these problems, some classic calibration-based methods Pavić, Schönefeld, and Kobbelt (2006); Carroll, Agarwala, and Agrawala (2010); Du, Hu, and Martin (2013); Tehrani, Majumder, and Gopi (2016) were proposed and the curved straight lines at the background can be corrected. But these conventional methods fail to correct the distorted faces, and even have a negative impact. Recent approaches Shih et al. (2019); Tan et al. (2021) have considered to correct both distorted portraits and lines simultaneously. Especially, the first supervised CNN-based method was proposed by Tan et al. Tan et al. (2021) and obtained satisfactory result. In Tan’s method, two networks were designed to remove the distortion on background and portraits respectively, achieving smooth transitions between perspective-rectified background and stereographic-rectified faces. Since there is no suitable dataset for training, they built a high-quality dataset over labeled training samples. Nevertheless, there are still some drawbacks that restrict the continued ascension in their work. First, a well-generalized network needs a large amount of training data with rich types. The high cost makes it unrealistic to enlarge the labeled dataset. Second, in their two-stage network, the lines at the background are required to be corrected before processing the faces, causing the existence of redundancy. Third, although CNNs have achieved excellent performance, they always rely on the receptive fields and can’t learn long-range semantic information interaction well due to the locality of convolution operation.

Figure 1: An example of our method. (a) the original wide-angle image with curved lines and distorted faces. (b) result by the proposed semi-supervised method, both lines and faces are corrected.

Motivated by the above drawbacks, we attempt to leverage the unlabeled images to train a transformer-based network for wide-angle portraits correction. To be specific, we develop a novel network, dubbed as Multi-Scale Swin-Unet (MS-Unet), which is established based on the multi-scale swin transformer block (MSTB) to learn both local-scale and long-range information. Instead of two-stage learning scheme Tan et al. (2021), the MS-Unet can learn the correction flow maps from distortion image to normal image directly. Furthermore, we build a semi-supervised strategy, containing DRC and RC, to make full use of both labeled and unlabeled data by introducing a surrogate task (segmentation) in MS-Unet. In order to verify the effectiveness of the semi-supervised scheme, we collect more than unlabeled distortion images from different phones and shooting scenes. By conducting sufficient experiments, we demonstrate the superior performance of our proposed method. Fig. 1 (b) shows our visual result.

In summary, our main contributions are:

  • To the best of our knowledge, we propose the first semi-supervised learning strategy for wide-angle portraits correction, which dramatically reduces the requirement of labeled training data.

  • We develop a novel transformer-based network called MS-Unet, based on MSTB, to fully utilize both local-scale and long-range semantic information interaction.

  • We provide a high-quality unlabeled dataset that can be used for semi-supervised wide-angle portraits correction.

2 Related Works

2.1 Wide-Angle Portraits Correction

Early wide-angle portraits correction methods always relied on traditional algorithms. For example, stereographic projection Svardal, Olsen, and Andersen (2003) can get very natural portraits, but the lines at the background are curved. Shih et al.  Shih et al. (2019) proposed a mesh-based algorithm that can strike a balance between straight lines and portraits correction effects. Nevertheless, it requires the camera parameters and portraits segmentation as input, leading complicated procedures. Recently, Tan et al. Tan et al. (2021)

proposed a two-stage deep neural network to complete wide-angle portraits correction, which solved the problem without camera parameters and portraits segmentation. Unfortunately, this fully supervised method is limited to the number of labeled data that requires high-cost manual screening and processing. Considering above reasons, we propose a semi-supervised transformer method, which can expand a large amount of training data at low cost, and effectively improve the effect of portraits correction.

2.2 Visual Transformer

The proposal of transformer Vaswani et al. (2017)

has been widespread used in natural language processing (NLP). Inspired by their outstanding achievements, researchers have gradually applied transformers to the computer vision field recently 

Han et al. (2020); Khan et al. (2021). More impressively, Liu et al. Liu et al. (2021) proposed an excellent hierarchical transformer structure called Swin Transformer, which is established upon shifted window partitioning mechanism. It has gained advanced performance on various vision tasks including image classification, object detection, and semantic segmentation. Hu et al. Cao et al. (2021)

also devised a U-shaped transformer block called Swin-Unet, which focused on the medical image segmentation and achieved great results. Based on these works, we specially propose a new transformer network that is suitable for wide-angle portraits correction.

2.3 Deep Semi-Supervised Learning

Deep semi-supervised learning provides a practical and effective approach to fully utilize the mixture dataset containing labeled and unlabeled images. It has been widely used in image classification 

Xu et al. (2017); Xie et al. (2020), semantic segmentation Xiao et al. (2018); Babakhin, Sanakoyeu, and Kitamura (2019), machine translation He et al. (2016); Cheng (2019), crowd counting Liu et al. (2020); Meng et al. (2021), text classification Karamanolakis, Hsu, and Gravano (2019); Li et al. (2019); Lee, Ko, and Han (2021) and so on. These works have proved that the semi-supervised learning can really promote the accuracy of network. Therefore, we also propose a semi-supervised learning method to break the constraint of the amount of labeled data in wide-angle portraits correction.

Figure 2: (a) The overview of our proposed Multi-Scale Swin-Unet (MS-Unet), which is built upon the multi-scale swin transformer block (MSTB). The network mainly consists of encoder, decoder, bottleneck and skip fusion blocks (SFB). (b) The architecture of two successive MSTBs. The primary difference between them is the windowing configurations (window partition and shifted window partition). (c) The detailed architecture of SFB.
Figure 3: The pipeline of semi-supervised wide-angle portraits correction framework, which consists of direction and range consistency (DRC), regression consistency (RC). For an unlabeled image

, when it is sent to the siamese network, the estimated segmentation mask and correction mask are utilized to compute the

and using unsupervised loss. The unlabeled data flow is marked with black. In addition, the labeled data flow marked with red is also illustrated.

3 Method

In this section, we introduce the semi-supervised transformer for wide-angle portraits correction. Built upon Swin-Unet Cao et al. (2021), we construct a network called Multi-Scale Swin-Unet (MS-Unet), which contains four major parts: encoder, bottleneck, decoder and skip fusion blocks. As shown in Fig. 2 (a) , the network takes a single distortion image as input, and produces horizontal and vertical correction flow maps as intermediate outputs. Then the distortion image is projected into a normal image by the maps. Furthermore, we train our proposed MS-Unet through a semi-supervised scheme, which is achieved by direction and range consistency (DRC), and regression consistency (RC). The diagram of the semi-supervised is illustrated in Fig. 3.

3.1 Multi-Scale Swin-Unet (MS-Unet)

Architecture Overview.

Our fundamental idea is to introduce the local-scale information into transformers, so that the features with local-scale and long-range information can be produced for accurate wide-angle portraits correction. Hence, we develop the MS-Unet, which is derived from Swin-Unet Cao et al. (2021), to solve the above problems. As shown in Fig. 2 (a), the MS-Unet is divided into four major parts: encoder, decoder, bottleneck and skip fusion blocks.

For the encoder, the input with the size of is split into non-overlapping patches with size of . Then a linear embedding block is adopted on these patches to form the features with dimension . In addition, there are four hierarchical stages in encoder and bottleneck, and each of them mainly contains two successive MSTBs and a patch merging block. Specially, the patch merging block is responsible for increasing dimension and down-sampling, while the MSTB is designed for local-scale and long-range information extraction. Inspired by U-Net Ronneberger, Fischer, and Brox (2015), a symmetric architecture is deployed as the decoder. Contrary to the encoder, the decoder adopts patch expanding block for up-sampling and only a MSTB in each stage. Note that the outputs of each stage are fused with shallow features from encoder via the SFB before they are fed into next stage. Thus the missing spatial information caused by down-sampling will be complemented. Eventually, the final output features from the decoder share the same resolution with the input, and a linear projection layer is employed to produce the horizontal and vertical correction flow maps.

Overall, there are two primary differences between MS-Unet and Swin-Unet. First, as the core unit of Swin-Unet, swin transformer block ignores the importance of local-scale information, which leads that some objects (e.g., faces with different size) are still distorted after correction. Second, directly employing the skip connection may be not the optimal scheme for hierarchical features fusion owing to their difference. To alleviate these issues, we leverage the MSTB as the basic unit of our MS-Unet to integrate local-scale and long-range information. Furthermore, the simple yet efficient SFB is designed to replace the skip connection.

Multi-Scale Swin-Transformer Block (MSTB).

We develop the dense connection module (DCM) into the MSTB for local multi-scale information extraction. In Fig.2

(b), two successive MSTBs are presented. Each MSTB contains DCM, layernorm (LN), multi-head self-attention (MSA), skip connection, and multi-layer perceptron (MLP). The window partitioning (WP) and shifted window partitioning (SWP) are used in two successive MSTBs, respectively.

When the features with a height of and a width of are fed into the MSTB, they will pass through two parallel branches for computing the query , key , and value as the input of MSA. In the left branch, are split into non-overlapping windows with size of by (S)WP. The features are flattened and reshaped as , where . Then a full connection layer is applied to obtain query , where and is the head number. In the right branch, the features are first utilized to extract local-scale information by DCM. Inspired by Huang et al. (2017), the DCM consists of two layers, and three depthwise separable convolution layers with different dilation rates . To be specific, the convolution layers are employed to change the feature dimension. Each depthwise separable convolution layer will receive the features from all preceding layers () as input:

(1)

where denotes the concatenation operation. Then we apply the same operations like the left branch on these features from the DCM to generate . The key , and value are obtained through . Afterwards, the MSA can be calculated as follows:

(2)

where refers to learnable relative position bias.

Skip Fusion Block (SFB).

As mentioned above, the crucial difference among features from different hierarchical stages will be ignored when directly adopting skip connection. Hence, we specifically design the simple yet efficient skip fusion block (SFB) to replace the skip connection. As shown in Fig. 2 (c), before the features , (from the stage of encoder, and the stage of decoder) are sent to the next stage of decoder, they pass through the SFB to form new features with dimension . The whole calculation process is defined as follows:

(3)

where is dimension permuting, refers to the concatenation, and is 1D convolution layer.

3.2 Semi-supervised Learning Algorithm

Although the MS-Unet can boost the performance significantly, training a superior portraits correction network is really dependent on the high-quality labeled data. However, obtaining the labeled data is not easy owning to the extremely high cost. To reduce the cost of labeling, we train our proposed MS-Unet through a semi-supervised scheme, which allows us to use fewer labeled images to obtain better effect. In our problem settings, we have a set of unlabeled images noted as and a set of labeled images , where represents the labels. We mix these images and adopt them to train our network through the semi-supervised method composed of DRC and RC.

Direction and Range Consistency (DRC).

Many existing methods have proved that a network performance can be further improved by introducing approximate surrogate tasks Gao, Wang, and Li (2019); Meng et al. (2021). Thus we construct a surrogate task (segmentation) in the MS-Unet to predict whether the flow map value, , meets the given direction and range. To be specific, the prediction target of the segmentation task is defined as follows:

(4)

where denotes the segmentation mask, is the pixel position of mask or flow map, and is the predefined threshold, which is set to in our experiments.

This design is mainly motivated by four aspects: 1) Each pixel on the correction flow map is directional. The network can learn the direction variation of each pixel through segmentation mask, which is beneficial to get a better understanding of portraits correction. 2) By expressing a variation range with the same threshold, the segmentation mask can make the network learn regional consistency, so that the estimated correction flow maps will become smoother. 3) The ground truth segmentation mask can be generated directly by using the existing correction labels without additional cost. 4) The loss function can be constructed between portraits correction and segmentation, which makes it feasible to introduce unlabeled data by our semi-supervised scheme.

The learning strategy of DRC is shown in Fig. 3. For labeled data, the segmentation task is supervised through the label transformed from ground truth flow map. For the unlabeled images, they are utilized by the network to learn direction and range consistency. We will give the details of semi-supervised loss in section 3.3.

Regression Consistency(RC).

Besides the DRC, we also introduce the regression consistency (RC) to improve the network robustness. Fig. 3 illustrates the details of RC. Specifically, we can obtain two different images and with various augmentation methods (e.g., noise, smoothing, and sharpening), from an unlabeled image . Many previous works have stated that when an image with different perturbations, the same predictions can be obtained through the same network, which means that an excellent network is robust. Therefore, we expand the MS-Unet into a shared-weight siamese structure. The unlabeled images and are respectively fed into the two networks and a consistent loss is established between their outputs. The detailed loss implementation of RC will be given in section 3.3.

3.3 Loss Function

In practice, the MS-Unet is optimized by adopting the supervised losses (flow map regression, and segmentation task) on the labeled data , and the semi-supervised losses on the unlabeled data .

Supervised Loss.

Our constructed supervised loss is composed of three parts, including mask-based loss , mask-based sobel loss , and the cross-entropy loss . The detailed definitions are described as follows:

: In our method, we introduce the weighted mask, where the weight value in portrait region is always larger than background. Thus the network will pay more attention to distortion portraits. Eq. 5 gives the definition of this loss.

(5)

where and represent the ground truth and estimated flow maps, respectively, denotes the weighted mask.

: In portraits correction, the object edges directly affect the overall visual effects of a correction image. Therefore, we introduce the sobel loss which can be expressed as follows:

(6)

where the and mean the horizontal and vertical sobel kernel, respectively.

: To supervise the mask generated from the segmentation task, we convert the ground truth flow map into mask label and deploy the cross-entropy loss. The loss function is defined as follows:

(7)

where the is the ground truth mask converted from the flow map, and the refer to the estimated mask. To sum up, the training loss for a labeled image is:

(8)

where and are hyper-parameters, both being set to 10 in our experiments.

Semi-Supervised Loss.

For an unlabeled image, we construct the unsupervised loss based on segmentation task and flow map task to guide the prediction consistency of the network. The unsupervised loss contains two parts: the loss of DRC and RC .

(9)

where the and are the estimated flow maps from both the branches of the siamese network, and refer to the segmentation mask converted from and , while the and indicate the output of the siamese network.

Figure 4: Qualitative results of different wide-angle portraits correction methods.

4 Experiments

4.1 Implementation Details

Datasets.

Following the existing method Tan et al. (2021), we conduct extensive experiments on the wide-angle dataset Tan et al. (2021), captured from different smartphones. The dataset consists of over images in the training set, and images in the testing set. For each image in the dataset, many kinds of labels are provided, containing the face mask, correction flow maps, and landmarks. In addition, we collect another more than images through different smartphones (including Samsung Note 10, Xiaomi 11, vivo X23 and vivo iQOO) as the unlabeled set.

Training Details.

In the training stage, we train the MS-Unet via a two-step scheduling scheme. In the first stage, we only train correction flow map predictor until epochs. Then, we introduce the surrogate task and the semi-supervised method to further improve the performance of the network. In this stage, the total training epoch is set to in our experiments. For both steps, we use Adam to optimize our model with a initial learning rate of , and the weight decay of . In addition, we adopt some data enhancement methods (e.g., zooming, sharpening, smoothing) to enrich the diversity of training samples. All the experiments are performed with Geforce RTX 2080Ti.

Evaluation Metrics.

We use the same evaluation metrics (LineAcc and ShapeAcc) as

Tan et al. (2021) to evaluate the performance of our proposed method. More specifically, LineAcc is used to evaluate the curvature variation of the marked lines and defined as follows:

(10)

where denotes the similarity between slope of these two lines, is the number of uniformly sampled points in each line. and indicate the coordinate of the corresponding point in the reference and distortion image.

ShapeAcc aims to evaluate the face similarity between the correction image and the reference image. Based on face landmarks, the ShapeAcc is described as follows:

(11)

where is the similarity between the corrected and target face, is the number of fixed sampled points in each face. and are the corresponding face landmarks in the correction image and the reference image.

4.2 Ablation Study

In order to verify the influence of different factors on our proposed method, we conduct some ablation experiments on Tan’s dataset Tan et al. (2021) and our unlabeled dataset. In particular, the structure of correction network, the semi-supervised strategy, and the number of unlabeled samples are all considered below.

Effect of the Correction Network.

We first explore that how the proposed modules (i.e., MSTB, SFB) affect the network performance. To be specific, we utilize the Swin-Unet as our baseline, and the performance of three different networks are evaluated. 1) Baseline: directly employ the Swin-UNet to predict correction flow maps; 2) Baseline+MSTB: based on 1), the MSTB is considered to replace the swin transformer block for integrating local-scale information; 3) Baseline+MSTB+SFB (MS-Unet): the SFB is added to effectively fuse the hierachical features. Table 1 presents the results of the experiments. From it we can observe that the performance boosts significantly with the addition of each module. Especially, when both MSTB and SFB are added to the network, the full MS-Unet can achieve the best LineAcc () and ShapeAcc (). These experiments demonstrate that MSTB indeed promotes the network to extract more complementary information, which boosts the correction ability dramatically. Meanwhile, SFB provides a better feature fusion strategy than skip connections.

Index Baseline MSTB SFB LineAcc ShapeAcc
1) - - 66.514 97.460
2) - 66.763 97.487
3) 66.825 97.491
Table 1: Ablations on the structure of proposed MS-Unet.

Effect of the Semi-Supervised Strategy.

Several experiments are conducted to evaluate the impact of our proposed semi-supervised scheme. Both labeled and unlabeled data are deployed to accomplish our experiments. In each experiment, the network is trained using supervised scheme for 200 epochs before the semi-strategy is deployed, and then the network will remain to be trained for epochs.

We first simply add the surrogate task (segmentation) to the network and continue to train the two-task MS-Unet without any unlabeled images. The second row in Table 2 reports the results of the two-task MS-Unet. Compared with the MS-Unet with only correction task, it shows a slight improvement after adding the surrogate task (LineAcc: , ShapeAcc: ). This indicates that the learning ability of MS-Unet can be further improved via segmentation task. Then the DRC is conducted based on the segmentation task, and the third row in Table 2 lists the comparison result. Compared with the two-task MS-Unet, adding DRC can further improve the estimation accuracy of the correction flow maps, especially the LineAcc (from 66.871 to 67.154). Besides, the effect of RC is also evaluated and the result is presented in the fourth row of Table 2. The result also outperforms the single-task MS-UNet, which is only trained by supervised scheme. From all the list, the MS-Unet attains the best result (LineAcc: 67.209, ShapeAcc: 97.500) when both DRC and RC are employed during the semi-supervised training.

Index Seg DRC RC LineAcc ShapAcc
1) - - - 66.825 97.491
2) - - 66.871 97.493
3) - 67.154 97.494
4) - - 66.848 97.497
5) 67.209 97.500
Table 2: Performance comparison of different semi-supervised strategies. ’Seg’ indicates a segmentation task without direction and range consistency.

Effect of the Number of Unlabeled Samples.

We examine the influence of the number of unlabeled images on network performance through Tan’s dataset Tan et al. (2021) and our unlabeled data. We change the amount of unlabeled images from 0 to while the number of labeled images are fixed. The results are listed in Table 3, where it shows our MS-Unet trained with semi-supervised strategy obtains consistent superior performance compared with that using the fully-supervised scheme. Meanwhile, we can draw that in a certain range, the performance of the MS-Unet will improve as the amount of unlbeled images increases.

Index numbers LineAcc ShapeAcc
1) 0 66.871 97.493
2) 1000 66.929 97.493
3) 2000 66.999 97.494
4) 3000 67.105 97.497
5) 4000 67.155 97.496
6) 5000 67.209 97.500
Table 3: The impact of the number of unlabeled images.

4.3 Comparison with Other Methods

Comparison with Other Wide-Angle Portraits Correction Methods.

We also compare our proposed method with previous state-of-the-art methods on Tan’s and Google’s test sets. Table 4 illustrates that our method obtains the highest metric results in both two test sets. The visual comparisons in Fig. 8 also confirm the metric results. Note that the projection image can correct lines but make faces distorted seriously. Shih Shih et al. (2019) and Tan Tan et al. (2021) try to seek the optimal trade-off between the faces and the background. Unfortunately, there still exists several bended structures at the background, and a few faces are still distorted. From our results, we can draw that the faces are more natural, and the corrected structures in the background are satisfactory. Both quantitative and qualitative results verify the superior performance of our method.

Method Tan’s test set Google’s test set
LineAcc ShapeAcc LineAcc ShapeAcc
ShihShih et al. (2019) 66.143 97.253 61.551 97.464
TanTan et al. (2021) 66.784 97.490 64.650 97.499
Ours 67.209 97.500 66.098 97.512
Table 4: Quantitative results on two wide-angle portraits correction test sets.

Fig. 5 depicts the results of our method and other popular portraits correction algorithms from smartphones (i.e., Xiaomi 11 ultra, and iPhone 12). We can observe that serious stretching of portraits appear in iPhone 12. Although the result of Xiaomi 11 ultra improves than the distortion image, there are still slight deformation on the face and curved lines at the background. Our method shows better results, as the face is natural while correcting the lines at the background.

Figure 5: Visual comparison between our method and some wide-angle portraits correction methods from smartphones.

Comparison with Other Computer Vision Methods.

At present, only few works employ the deep learning methods to implement wide-angle portraits correction due to its challenge. Based on the correction flow maps, the correction task is transformed into a pixel-level regression problem, which is closely related to some computer vision tasks, such as crowd counting Gao, Wang, and Li (2019); Liu et al. (2020) and semantic segmentationChen et al. (2018); Cao et al. (2021). Hence, we introduce some efficient networks from these fields to predict the correction flow maps. All the networks are trained by full supervised scheme, and Table 5 shows the results. Notably, our proposed MS-Unet surpasses all the other methods. The primary reason is that the CNN-based networks focus on learning local-scale information while the transformer-based networks concentrate more on building long-range information. Combining both of their advantages, the MS-Unet will capture multi-scale information for more accurate estimation of correction flow maps.

Method LineAcc ShapeAcc
RefineNetLin et al. (2017) 66.348 97.449
CSRNetLi and Zhang (2018) 65.967 97.469
Deeplab v3+Chen et al. (2018) 66.200 97.482
Swin-UnetCao et al. (2021) 66.514 97.460
HRNetSun et al. (2019) 66.748 97.477
Ours 66.825 97.491
Table 5: The results of different networks about wide-angle portraits correction.

Besides, our semi-supervised method is employed to train these networks. The results in Table 6 demonstrate the generalization ability of our semi-supervised method.

Method Fully-Supervised Semi-Supervised
LineAcc ShapeAcc LineAcc ShapeAcc
RefineNetLin et al. (2017) 66.348 97.449 66.569 97.455
CSRNet Li and Zhang (2018) 65.967 97.469 66.236 97.471
Deeplab v3+ Chen et al. (2018) 66.200 97.482 66.565 97.487
Swin-UnetCao et al. (2021) 66.514 97.460 66.859 97.469
HRNetSun et al. (2019) 66.748 97.477 66.805 97.491
Ours 66.825 97.491 67.209 97.500
Table 6: The effectiveness evaluation of the proposed semi-supervised scheme on different networks.

5 Conclusion

In this paper, we develop a novel semi-supervised wide-angle portraits correction method using multi-scale transformer. In order to capture both local-scale and long-range information, we design the MS-Unet, which takes the MSTB as the core unit. We also design the SFB and add it to our network for integrating hierachical features effectively. Furthermore, we develop a semi-supervised manner to train our network. By combining DRC and RC, we can solve the limitations of labeled data and make full use of unlabeled data. In addition, four kinds of smartphones are adopted to collect available unlabeled data. Extensive experimental results show that our proposed method is much better than the existing advanced methods, and has the ability to be popularized in the application of wide-angle portraits correction.

References

A Additional Implementation Details

a.1 Image Scaling

In our experiments, the original distortion images from smartphones are not straight-forward for training directly due to the large size. Therefore, we scale all the images to a uniform size with , and send the resized images to our MS-Unet for training. Consequently, the the output correction flow maps share the same size (i.e., ) with the resized images. Afterwards, the correction flow maps are resized as the same size as the original images, and then they are utilized to correct the original distortion images into normal images. This scaling policy enables our MS-Unet to be executed under a lower complexity, which makes it feasible to apply into smartphones.

a.2 Correction Flow Map Segmentation Mask

In Tan’s method Tan et al. (2021), two sub-networks were designed to implement the wide-angle portraits correction. In their method, the LineNet produces the perspective projection flow maps to project the distortion image as flattened, while the ShapeNet predicts the face correction flow maps to correct the flattened images into normal images. To reduce the structural redundancy, we design the one-stage network called MS-Unet to generate the correction flow maps, which can project the distortion image into normal image directly.

Fig. 6 shows some corrected images by our MS-Unet, as well as the corresponding correction flow maps (including horizontal and vertical correction flow maps). Usually the darker the region in the flow map, the stronger correction the region requires. Consequently, we can observe that the flow maps pay more attention to the distortion faces in addition to the corners of the distortion image. Meanwhile, Fig. 6 also illustrates the corresponding segmentation mask of each flow map, which is developed to assist our semi-supervised scheme.

B Introduction about the Unlabeled Data

We construct the unlabeled dataset containing distortion images with various scenes and smartphones. This dataset make it possible to train the MS-Unet with our proposed semi-supervised scheme.

Table 7 illustrates the specific distribution of our unlabeled data divided by the number of people and the orientation. The samples in the unlabeled dataset are captured with 4 types of smartphones (including Samsung Note 10, Xiaomi 11, vivo X23 and vivo iQOO) with wide-angle lens of different distortion modules. Each smartphone contains various scenes with both horizontal and vertical orientation, and the number of people in each image is also range from 1 to 3, corresponding to(Scene 1-H to Scene 3-V in Table 7). Meanwhile, we present the specific sample number of each scene, as well as its percentage over all dataset. Notably, the number of people is evenly distributed on both smartphones and scenes.

Besides, our unlabeled dataset contains a variety of complex scenes, and some samples are show in Fig. 7. To be specific, these images are captured with different shooting ways. In particular, the shooting ways include different number of people, shooting angles and shooting distances, as shown in Fig. 7. Different shooting ways are combined with different orientation, covering various types of distortions.

C Comparison Results

c.1 Comparison with Other Methods

We show more visual comparisons with perspective undistortion, Shih’ results Shih et al. (2019), and Tan’s results Tan et al. (2021). We mark the obvious differences in the corrected images with red boxes. In general, our proposed method strikes a better balance between correcting the straight lines and distortion faces. Moreover, it keeps a more natural transition between the faces and the background.

c.2 Comparison with Other Phones

In addition, Fig. 9 shows more visual results compared with the corrected wide-angle portraits images by iPhone 12 and Xiaomi 11. We can clearly observe that our proposed method is superior to the two commercial solutions. Especially the region marked with red box, the corrected faces is more natural than others.

c.3 Our Future Research Direction

Although our proposed method can correct distortion images well in many scenes, the performance needs to be further improved in some scenes. As shown in Fig. 10, we further evaluate the effectiveness of our proposed method, and explore the space to further enhance. Fig. 10 (a) shows two unsatisfactory examples where the feet are close to the corners of the images. It is inevitable to force the feet to overstretch when correcting the corners. Similarly, the body in Fig. 10 (b) is also overstretch due to the strong correction of corners. The main reason is that the current ground truth flow maps only focus on the correction of distorted faces but ignore the other part of the body, such as feet and arms. In the future, we will continue to expand our work to correct the distortion of the whole body, which can solve the problem mentioned in Fig. 10. Also, it will make the corrected images more natural.

Scene 1-H Scene 1-V Scene 2-H Scene 2-V Scene 3-H Scene 3-V Total
Xiaomi 11 226 194 229 224 198 189 1,260 (25.20%)
Samsung Note10 224 219 214 209 199 189 1,254 (25.08%)
vivo X23 231 195 198 238 211 196 1,269 (25.38%)
vivo iQOO 198 218 213 196 191 201 1,217 (24.34%)
Total 879(17.58%) 826(16.52%) 854(17.08%) 867(17.34%) 799(15.98%) 775(15.50%) 5,000
Table 7: Unlabeled dataset distribution on different camera modules with different number of people and orientations.’H’ indicates the image in horizontal orientation, and ’V’ refers to the image in vertical orientation.
Figure 6: Visualization results of our proposed methods. From left to right: (a) the correction images, (b) the horizontal correction flow map, (c) the vertical correction flow map, (d) the horizontal segmentation mask, (e) the vertical segmentation mask.
Figure 7: Some samples of different scenes in our unlabeled dataset. 5 different shooting scenes with various people and backgrounds are shown in this figure.
Figure 8: Qualitative results of different wide-angle portraits correction methods. Note that the obvious differences in the corrected images are marked with red boxes.
Figure 9: Visual comparison between our method and two other smartphones with wide-angle portraits correction methods. We mark the obvious differences in the correction images with red boxes.
Figure 10: Some corrected images that require to be further improved.