Deep Joint Encryption and Source-Channel Coding: An Image Privacy Protection Approach

Joint source and channel coding (JSCC) has achieved great success due to the introduction of deep learning. Compared with traditional separate source channel coding (SSCC) schemes, the advantages of DL based JSCC (DJSCC) include high spectrum efficiency, high reconstruction quality, and the relief of "cliff effect". However, it is difficult to couple encryption-decryption mechanisms with DJSCC in contrast with traditional SSCC schemes, which hinders the practical usage of the emerging technology. To this end, our paper proposes a novel method called DL based joint encryption and source-channel coding (DJESCC) for images that can successfully protect the visual information of the plain image without significantly sacrificing image reconstruction performance. The idea of the design is using a neural network to conduct image encryption, which converts the plain image to a visually protected one with the consideration of its interaction with DJSCC. During the training stage, the proposed DJESCC method learns: 1) deep neural networks for image encryption and image decryption, and 2) an effective DJSCC network for image transmission in encrypted domain. Compared with the perceptual image encryption methods with DJSCC transmission, the DJESCC method achieves much better reconstruction performance and is more robust to ciphertext-only attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 7

page 8

page 10

page 11

11/25/2019

DeepJSCC-f: Deep Joint-Source Channel Coding of Images with Feedback

We consider wireless transmission of images in the presence of channel o...
11/30/2020

Wireless Image Transmission Using Deep Source Channel Coding With Attention Modules

Recent research on joint source channel coding (JSCC) for wireless commu...
12/21/2021

Nonlinear Transform Source-Channel Coding for Semantic Communications

In this paper, we propose a new class of high-efficient deep joint sourc...
01/05/2021

Deep Joint Source Channel Coding for WirelessImage Transmission with OFDM

We present a deep learning based joint source channel coding (JSCC) sche...
08/07/2020

Image Transformation Network for Privacy-Preserving Deep Neural Networks and Its Security Evaluation

We propose a transformation network for generating visually-protected im...
10/26/2016

Volumetric Light-field Encryption at the Microscopic Scale

We report a light-field based method that allows the optical encryption ...
10/09/2021

Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control

We present a novel adaptive deep joint source-channel coding (JSCC) sche...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The modular design principle based on Shannon’s separation theorem [cover1999elements] is the cornerstone of modern communications and has enjoyed great success in the development of wireless communications. However, the assumptions of unlimited codeword length, delay and complexity in the separation theorem are not possible in real wireless environments, leading to sub-optimal separate source channel coding (SSCC). Moreover, for time varying channels, when the channel quality is lower than the target channel quality, the SSCC cannot decode any information due to the collapse of channel coding; When the channel quality is higher than the target quality, the separation coding cannot further improve the reconstruction quality. This is the famous “cliff effect” [skoglund2006hybrid], which increases the cost of SSCC during the wireless transmission. In the past years, joint source-channel coding (JSCC) has been demonstrated theoretically to have better error exponent than SSCC in discrete memoryless source channels [gallager1968information, csiszar1980joint, zhong2007joint, zhong2007error] motivating the development of various JSCC designs over the years [belzer1995joint, heinen2005transactions, cai2000robust, guionnet2004joint] are proposed.

More recently, deep learning (DL) based approaches have been proposed for source coding[toderici2015variable, balle2016end, minnen2018joint, minnen2020channel], channel coding [o2017introduction, xu2019performance, jiang2020learn, jiang2019turbo], and JSCC [bourtsoulatze2019deep, kurka2021bandwidth, kurka2020deepjscc, xu2021wireless]. Compared with the SSCC scheme (e.g., JPEG/JPEG2000 for image coding and LDPC for channel coding), the DL based JSCC (DJSCC) scheme designed in [bourtsoulatze2019deep]

has better image restoration quality especially in the low signal-to-noise ratio (SNR) regime. To well adapt variable bandwidth and exploit the utility of channel output feedback in real wireless environments, the schemes of DJSCC with adaptive-bandwidth image transmission and image transmission with channel output feedback are proposed by

[kurka2021bandwidth] and [kurka2020deepjscc], respectively. However, all the aforementioned schemes are trained and inferred at the same channel conditions (the single SNR) to ensure optimality, demanding the use of multiple trained networks to suit a range of SNR that leads to considerable storage requirements in transceivers. To overcome this challenging problem in DJSCC, [xu2021wireless] proposed a single network for DJSCC which can adapt to a wide range of SNR conditions to meet the memory limit of the device in real wireless scenarios. So far, by using the data-driven approach, DJSCC successfully reduces the difficulty of coding design in traditional JSCC, and balances the performance and the storage requirement. These benefits brought by DL make DJSCC methods easier to be employed and deployed than traditional JSCC schemes.

Fig. 1: The DJSCC based wireless communication system.

Yet, to protect information privacy and confidentiality, one must also couple encryption and decryption mechanisms with DJSCC based wireless communication systems as illustrated in Fig. 1. The image owner intends to transmit a plain image (an unencrypted image) to the image recipient through the network containing a untrusted wired network and a wireless transmission service. To protect visual information of the plain image, the image owner encrypts the plain image into an encrypted image (the visually protected image) before providing it to the wireless service provider. Then the encrypted image is transmitted by the wireless service provider through the DJSCC transmission. After the DJSCC wireless transmission, the corrupted encrypted image (the decoded image by DJSCC decoder) is transmitted to the image recipient through the untrusted network. The image recipient decrypts the corrupted encrypted image to the plain image. Even if the encrypted image or the decoded image is leaked or stolen during the wired network transmission process, visual information of the plain image can not be acquired directly.

Fig. 2: Different Strategies for protecting image privacy in SSCC and DJSCC. (a) Encrypt the data encoded by source encoder before the channel encoder in SSCC;(b) Encrypt the image source before the source encoder in SSCC; (c) Encrypt the image source before the source encoder in DJSCC.

In SSCC, there are two strategies to safeguard image privacy: 1) encrypting the data encoded by source encoder before the channel encoder as shown in Fig. 2, and 2) encrypting the image source before the source encoder as shown in Fig. 2 [chuman2019encryption]. The first strategy regarded as compression-then-encryption (CtE) applies the source encoder to compress the image source to the binary data, and then uses some encryption method (e.g., data encryption standard [diffie1977special], advanced encryption standard [daemen2001reijndael], and Rivest–Shamir–Adleman [rivest1978method]) to encrypt the binary data generating the ciphertext. The second strategy called encryption-then-compression (EtC) is fit for the typical scenario, where the image provider only takes care of protecting the image privacy and the telecommunications provider has an overriding interest on improving spectrum efficiency. Following the EtC strategy, various perceptual image encryption methods have been developed via transforming an original image to a visually protected image [hui2002arnold, yano2002image, zou2004new, zhong2016image, mathews2011image, prasad2011chaos, maniccam2004image, el2011good, acharya2009image, pareek2006image, jun2009image, guan2005chaos, tong2008image, kanso2012novel, ferreira2015privacy, johnson2004compressing, liu2010efficient, kang2013performing, zhou2014designing, zhang2011lossy, chuman2019encryption, maekawa2018privacy, tanaka2018learnable].

To protect image privacy for DJSCC transmission, the encryption module should be placed in front of deep joint source-channel encoder as shown in Fig. 2, which is akin to the position of encryption module in Fig.  2. However, a major issue is that EtC change the visual structure of the plain image, causing DJSCC transmission degradation. To the best of our knowledge, there are no works addressing the encryption problem in the DJSCC framework. In [tanaka2018learnable, sirichotedumrong2019pixel, sirichotedumrong2019privacy, sirichotedumrong2021gan, ito2021image], different encryption methods are developed for DNN-based classification tasks. In addition, Homomorphic Encryption (HE) [acar2018survey] is a promising privacy preserving computation method. Some of the HE-based methods have been applied in DL domain [aono2018privacy, gilad2016cryptonets, hesamifard2017cryptodl, xu2019cryptonn]

. However, the high computational complexity, the loss of calculation accuracy due to the use of polynomial instead of nonlinear activation function, and the large ciphertext expansion when combining HE with DNNs is an obstacle to the use of HE in DJSCC.

In this paper, we design a DL based joint encryption and source-channel coding method that can generate a visually protected image suitable for DJSCC transmission with high reconstruction performance. The inspiration of our proposed method originates from image transformation illustrated in Fig. 3. Image transformation has been successfully applied in image compression (e.g., JPEG [wallace1992jpeg], JPEG2000 [rabbani2002jpeg2000]) and image processing (e.g., image classification, image semantic segmentation [xu2020learning]) for facilitating the subsequent operations. Although the transformed image contains the information of the plain image, little visual information can be perceived due to the transformation. Compared with the existing perceptual image encryption methods, the proposed method not only protects the image privacy, but also have a better reconstruction performance. Another advantage of our proposed method is that the proposed method is more robust against the ciphertext-only attacks than the existing perceptual image encryption methods. Moreover, the full convolutional architecture of the proposed method makes it more flexible to deal with images of different sizes without loss of the visual protection performance.

Fig. 3: Illustration of image transformation. (a) The plain image, (b) The transformed image using discrete cosine transform (DCT).

The rest of this paper is organized as follows. Section II presents related work on deep joint source-channel coding, perceptual image encryption, and image transformation. Then, the proposed method is presented in Section III. In Section IV, the proposed method is evaluated on datasets with low resolution and high resolution, respectively. Visual security of the proposed is evaluated by ciphertext only attacks in Section V. Finally, Section VI concludes this paper.

Ii Related work

Ii-a Deep Joint Source Channel Coding

As shown in the lower part of Fig. 1

, DJSCC follows the end-to-end autoencoder architecture

[goodfellow2016deep] which replaces the source encoder and the channel encoder (with/without the modulation) with a deep joint source-channel encoder (DJSCE) in the transmitter, and similarly, replaces the corresponding source decoder, the channel decoder (with/without demodulation) by a deep joint source-channel decoder (DJSCD).

The initial DJSCC work proposed a recurrent neural network (RNN) based DJSCE and DJSCD for text transmission over binary erasure channels

[farsad2018deep]. From then on, DJSCC attracted increasing interests, especially for image compression and transmission. In [bourtsoulatze2019deep]

, fully convolutional neural networks (FCNNs) are used for DJSCE and DJSCD, and are shown to outperform SSCC method especially in the low SNR regime. Furthermore, the DJSCC method in

[bourtsoulatze2019deep]

provides a graceful performance degradation in communications scenarios suffering from large channel estimation errors, associated with the well-known “cliff effect” in SSCC.

In the wireless communication systems, the transmission bandwidth is always dynamically allocated according to user requirements and system overload. DJSCC-l proposed by [kurka2021bandwidth] can transmit progressively in layers; When the available bandwidth is limited, codewords of the first few layers are transmitted to the receiver to reconstruct the transmitted image with lower quality; When the available bandwidth is increased, codewords of the residual layers are transmitted to the receiver and are combined with the first codewords to reconstruct the transmitted image with higher quality.

Once channel output feedback is available, DJSCC-f proposed by [kurka2020deepjscc] can further improve the reconstruction quality by exploiting the channel output feed back. The difference between DJSCC-l and DJSCC-f is that DJSCC-f not only aggregates the layered transmission information at the receiver, but also jointly processes the transmitted image and the processed channel output feedback at the transmitter. Compared with DJSCC without channel output feedback, DJSCC with both noisy and noiseless channel output feedback can achieve considerable performance gains. However, [bourtsoulatze2019deep, kurka2020deepjscc, kurka2021bandwidth] must train multiple networks in a range of SNR and select one of them depending on the real SNR condition to keep their optimality, which would lead to heavy burden on the storage overhead of devices.

Inspired by the resource assignment strategy in traditional JSCC, [xu2021wireless] introduces the SNR feedback to DJSCC and proposes a general DJSCC method named Attention DL based JSCC (ADJSCC). ADJSCC can dramatically reduce the storage overhead while maintaining the similar performance by using attention mechanisms. In addition, [choi2019neural] proposed a DJSCC based on maxmizing the mutual information between the source and noisy received codeword for the binary erasure channel and the binary symmetric channel. [saidutta2019joint] and [saidutta2019joint2] model their DJSCC systems via variational autoencoder and manifold variational autoencoders for a Gaussian source, respectively. However, all of the aforementioned work are not designed to guarantee information privacy protection.

Ii-B Perceptual Image Encryption

Perceptual image encryption aims to transform a plain image into an encrypted image, which downgrades visual quality to protect the original perceptual information.

One can change image visual information along two dimensions: the space dimension and the pixel dimension. The space based encryption involves re-arranging (scrambling) the different pixels within the image. There are numerous image scrambling algorithms, e.g., Arnold [hui2002arnold], Baker Transformation [yano2002image], Fibonacci Transformation [zou2004new], Magic Square [zhong2016image], RGB Scramble [mathews2011image], Chaos Scramble [prasad2011chaos], and SCAN Pattern [maniccam2004image]. The pixel based encryption only changes pixel values to protect visual information. Popular methods include DES[el2011good], Hill algorithm [acharya2009image], chaotic logistic map[pareek2006image], and cellular automata [jun2009image] based methods. Combining space based encryption and pixel based encryption can make encryption more robust. Well-known techniques include Arnold and Chen’s chaotic system based method [guan2005chaos], compound chaotic sequence based method [tong2008image], and 3D chaotic map based method [kanso2012novel]. However, the aforementioned encryption methods do not take into account the effect of subsequent operations such as source or channel coding.

Considering the compression task in EtC system, stream cipher methods using a pseudo-random key generator followed by Slepian-Wolf coding and resolution progressive compression have been used for lossless compression in [johnson2004compressing, liu2010efficient]. To further improve the compression ratio, pixel based image encryption followed by a lossy scalable compression technique [kang2013performing], image encryption via prediction error clustering and random permutation followed by a context-adaptive arithmetic coding [zhou2014designing], pseudorandom permutation based image encryption followed by orthogonal transformation based lossy compression [zhang2011lossy], and block scrambling based image encryption followed by JPEG lossy compression [chuman2019encryption] are proposed for lossy compression. However, when the final signal processing task is changed from image reconstruction to e.g. image classification, the methods designed for EtC systems stop working.

DL has led to state-of-the-art performance in various image processing tasks, motivating more recently the application of deep learning techniques to encrypted images. The methods proposed by [tanaka2018learnable, sirichotedumrong2019pixel, sirichotedumrong2019privacy, sirichotedumrong2021gan, ito2021image] are designed for DL based classification. Tanaka’s method [tanaka2018learnable] is a hybrid encryption method, adding an adaptation network prior to DNNs. The pixel based image encryption methods with/without key management [sirichotedumrong2019pixel, sirichotedumrong2019privacy]

can directly be fed into DNNs for classification. DL based encryption methods, e.g., generative adversarial networks (GAN) and DNNs, are proposed for image encryption in

[sirichotedumrong2021gan] and [ito2021image], respectively. Both of [sirichotedumrong2021gan] and [ito2021image] employ image transformation—transforms the image from one domain to the other domain—to protect visual information. However, The image encryption methods designed for DL based classification are still not fit for DJSCC. The encryption images for classification only reserve some specific semantic information relevant to image class while the pixel based information of the image is discarded, which cause the performance degradation for the image reconstruction task, i.e., DJSCC.

Iii Deep Joint Encryption and Source-Channel Coding

Our method motivation is to successfully protect the visual information of the plain image without significantly sacrificing image reconstruction performance in DJSCC. Conventional image transformations, e.g., discrete cosine transform (DCT) and discrete wavelet transform (DWT), can provide the convenience for subsequent operations and simultaneously generate a visually protected image in the transformation. However, [jiang2018end, mishra2021wavelet, xu2020learning] demonstrate that conventional image transformations are hardly matched with the DNNs if the structures of DNNs are not modified. In this Section, a DL based joint encryption and source-channel coding (DJESCC) method is proposed for protecting visual information of plain images in DJSCC transmission.

Iii-a System Model

Fig. 4: The system model of the proposed DL based joint encryption and source-channel coding method.

Considering a visually protected DJSCC transmission system as shown in the lower part of Fig. 4, a plain image is represented by , where denotes the set of real numbers and . denote the width, height, and the number of channels of an image, respectively. The encryption network transforms the plain image into a visually protected image. This encryption process can be expressed as:

(1)

where represents an encryption deep neural network based parameterized by the set of parameters . Note that the plain image and the visually protected image have the same size.

Then the visually protected image is encoded by the joint source-channel encoder as:

(2)

where denotes the set of complex numbers, represents the size of channel input symbols, and represents an joint source-channel encoder parameterized by the set of parameters . The encoded complex-valued represents the transmitted signals at the transmitter. The real parts and the imaginary parts of are considered as in-phase components I and quadrature components Q of the transmitted signals, respectively. Due to the average power constraint at the transmitter, must be satisfied, where is the complex conjugate transpose of .

The transmitted signals are corrupted by the wireless channel. We adopt a well known AWGN model111By applying equalization at the receiver, the flat fading channel model can be represented as AWGN model, while the noise has a different distribution. given by:

(3)

where is the channel output and denotes the additive noise modeled by where represents the average noise power and

denotes a circularly symmetric complex Gaussian distribution.

In turn, the channel output symbols is decoded by the joint source-channel decoder as:

(4)

where represents an joint source-channel decoder parameterized by the set of parameters . The decoded image is with the same size of the visually protected image .

Similarly to the encryption step, the decryption network is employed to convert the decoded image to the decrypted image, as follows:

(5)

where represents a decryption deep neural network parameterized by the set of parameters and the decrypted image is an estimation of the plain image. The bandwidth ratio is defined as , where is the source size (i.e., image size) and is the channel bandwidth (i.e., channel input size).

Iii-B The Proposed Method

In sharp contrast with DJSCC methods [bourtsoulatze2019deep, kurka2020deepjscc, kurka2021bandwidth, xu2021wireless], we require our DJESCC method to address two issues:

1) To hide visual information of a plain image.

2) To extract effective features from the visually protected image for subsequent DJSCC transmission.

The classical full-reference metric of image similarity is peak signal-to-noise ratio (PSNR) between the original image and the restored image, which is defined as:

(6)

where MAX is the maximum possible value of the image pixels and MSE is the abbreviation of mean square error between the original image and the restored image. Although the prediction of PSNR performance is not always consistent with quality perception by the human visual system, the simplicity and the inexpensive computational complexity make it widely used in the field of image processing [mohammadi2014subjective].

However, as is illustrated in Fig. 5, PSNR is not a good metric to assess the effect of hiding visual information. Fig. 5(b) has a lower PSNR than Fig. 5(c), while the visual information (e.g., the birds and the leaf) are more easily identified in Fig. 5(b) than in Fig. 5

(c). To solve this problem, a feature extraction network is employed to measure the effect of hiding visual information. The feature extraction method was initially used to measure the similarity between two images

[johnson2016perceptual] and then successfully used to measure the difference between two images [ito2021image].

Fig. 5: PSNR comparison. (a) plain image, (b) invert image (the intensity values of the plain image are substracted by 255), (c) shuffle image (the intensity values of the plain image are randomly shuffled in space and channel dimension).

In the training stage, features of the plain image , the encrypted image , and the decoded image are extracted by the feature extraction network in Fig. 4. The feature loss between the plain image and the encrypted image is expressed as:

(7)

and the feature loss between the plain image and the encrypted image is expressed as:

(8)

where , , and are the features of the plain image , the encrypted image , and the decoded image extracted by the feature extraction network and . , , and denote the width, height, and the number of channels of an extracted feature, respectively. The reconstruction loss between the plain image and the decrypted image is expressed as:

(9)

where and represents the intensity values of the plain image and the decrypted image , respectively.

Different from the image-to-image translation tasks

[sirichotedumrong2021gan, johnson2016perceptual] which minimizes the feature loss in the training stage, the proposed method maximizes and to hide visual information for the encrypted image and the decoded image . The total loss used to train the proposed method:

(10)

where and are the weights of and , respectively. Under a certain bandwidth ratio , DJESCC learns the parameters of the encryption network , the deep joint source-channel encoder , the joint source-channel decoder , and the decryption network by minimizing the end-to-end distortion as follows:

(11)

where are the optimal parameters,

represents the joint probability distribution of the plain image

and the decrypted image , and

represents the probability distribution of the average noise power

. Note that the probability distribution of the average noise power instead the fixed average noise power is adopted due to the considerations of the storage overhead and the difficulty to acquire the average noise power in the image owner/recipient. In addition, empirical average instead of statistic average is adopted in the training stage. During the training stage, the proposed DJESCC method learns: 1) an effective method to hide visual information of the plain image, 2) an easy to be extracted image domain for the subsequent DJSCC transmission, 3) an effective DJSCC transmission method, and 4) an effective method to reconstruct the plain image.

After the DJESCC network has been trained, the encryption network and the decryption network are securely distributed to the image owner and the image recipient by some security protocol, e.g., Secure Sockets Layer (SSH) protocol, respectively. The deep joint source-channel encoder and the deep source-channel decoder are distributed to the DJSCC transmission service provider.

In the test stage, the plain image is first converted to the encrypted image by the image owner using Eq. (1). Then the encrypted image is sent to the DJSCC transmission service provider. DJSCC transmission is executed using Eq. (2) and Eq. (4) and the decoded image is obtained. The DJSCC transmission service provider sends the decoded image to the image recipient, which uses Eq. (5) to decrypt the decoded image . The process of test stage is illustrated in the lower part of Fig. 4.

Iv Experimental Results

Fig. 6: The architecture of the DJSCC network[bourtsoulatze2019deep] adopted in this paper.
Fig. 7: The architecture of the encryption and decryption networks adopted in this paper.

Considering the generality of the DJESCC method, there are multiple architecture choices for the encryption network, the DJSCC encoder, the DJSCC decoder, and the decryption network. To prove the potential of our proposed method, the original DJSCC network architecture in [bourtsoulatze2019deep] is chosen in the subsequent experiments. It is worthy noting that our proposed method can be applied to other extensions of the original DJSCC network architecture.

The DJSCC architecture proposed by [bourtsoulatze2019deep] is shown in Fig. 6, where the encoder and decoder networks are adopted in our DJESCC method. The DJSCC encoder consists of the normalization layer, five alternant convolutional layers and PReLU layers, the reshape layer, and the power normalization layer. The DJSCC decoder consists of the reshape layer, five alternanting transposed convolutional layers and activation layers (i.e., four PReLU layers and one sigmoid layer), and the denormalization layer. The normalization layer converts the input image with the pixel value range [0, 255] to the image with pixel value range [0, 1], and the denormalization layer performs the opposite operation. The notation in a convolution/transpose convolution layer denotes that it has filters with size

and stride down/up

. The power normalization layer is used to satisfy the average power constraint at the transmitter. The channel number of the last convolutional layer in the DJSCC encoder is . The bandwidth ratio of the proposed method is . The architecture shown in Fig. 7 is employed as the architectures of the encryption network and the decryption network, which is a shallow version of U-Net [ronneberger2015u]. In the training stage, the parameters of the feature extraction network is fixed.

Tensorflow [abadi2016tensorflow]

and its high-level API Keras is used to implement the proposed DJESCC method

222Source codes for constructing the proposed method are available at: https://github.com/alexxu1988/DJESCC.

. The proposed method is trained under a uniform distribution within the SNR range [0, 20] dB. The following experiments run on a Linux server with twelve octa-core Intel(R) Xeon(R) Silver 4110 CPUs and sixteen GTX 1080Ti GPU. Each experiment was assigned six CPU cores and a GPU.

We first consider the performance of the proposed method on CIFAR-10 dataset, which consists of 60000

color images associated with 10 classes where each class has 6000 images. Note the goal of our proposed method is to generate visually protected images for the untrusted transmission channels and reconstruct the plain image at the receiver, so the class label of each image is not used in the following experiments. Training dataset and test dataset contain 50000 images and 10000 images, respectively. We use a part of the VGG16 [simonyan2014very]

with batch normalization, i.e., a classical network for classification tasks, as the feature extraction network, which is pretrained on the CIFAR-10. All of the networks were trained for 500 epochs by using Adam optimizer with an initial learning rate of

. Once learning stagnated for 10 epochs, the learning rate was reduced by a factor of 10. The performance of the DJESCC networks were evaluated at specific [0,20] dB on CIFAR-10 test dataset. To alleviate the effect of the randomness caused by the wireless channel, each image in CIFAR-10 test dataset is transmitted 10 times. PSNR is used in the evaluation of the reconstruction performance between the plain image and the decrypted image.

Fig. 8: Performance of DJESCC and DJSCC trained on CIFAR-10 training dataset and evaluated on CIFAR-10 test dataset with R=1/6.
Fig. 9: Visually protected images generated by the DJESCC method. The image in the first column is the plain image. The images in the second column are the encrypted images transformed by the image owner. The images in the third column are the decoded images decoded by the DJSCC decoder for dB. The images in the last column are the decrypted images transformed by the image recipient for dB. PSNR and structural similarity index (SSIM) are given under the images. (a) , (b) , (c) , (d) .

Fig. 8 compares the reconstruction performance of the proposed method with different loss weights (e.g., ) at bandwidth ratio . The reconstruction performance of the DJSCC without visual protection, i.e., is also plotted. The reconstruction performance is gradually decreased with the increase of and , as the training promotes protection of visual information instead of reconstruction of the images. Fig. 9 shows the visualization of the plain image, the encrypted images transformed by the image owner, the decoded images decoded by the DJSCC transmission service provider, and the decrypted images transformed by the image recipient for dB with different loss weights and . The plain image comes from CIFAR-10 test dataset. Fig. 9(a) shows that even though the feature losses between the plain image and the encrypted/decoded image are not used, the visual information of the plain image is partially protected in the encrypted image and the decoded image. The outline of the dog can be vaguely identified in the encrypted image in Fig. 9(a), and less visual information is captured in the decoded images than in the encrypted image. With the increase in loss weights and , the visual protection in the encrypted image is enhanced, and the corresponding visual pattern in the decoded image is also changed. In Fig. 9(b) with the loss weights , the outline of the dog can be vaguely identified in the encrypted image. The encrypted image decays to regular black and white lattice with no visual information of the plain image in Fig. 9(c) with the loss weights . As the loss weights increase to , the left part of the encrypted image contains alternated red and blue stripes and the right part of it contains twisted black and white lattice, which is more irregular than the encrypted image in Fig. 9(d). In Fig. 9(b)(c)(d), all of the decoded images with different SNRs can protect visual information of the plain image. The proposed method with shows a better trade-off between the reconstruction performance and the visual protection performance than others.

Since our proposed method is the first work that constructs visually protected image for DJSCC transmission, we compare the proposed method with two perceptual image encryption methods, i.e. the learnable image encryption (LE) method [tanaka2018learnable] and the pixel-based image encryption (PE) method [sirichotedumrong2019privacy], which are designed for image classification task. The classification accuracy of ResNet-20 [he2016deep] based on the LE method and the PE method are 87.02% and 86.99% on CIFAR-10 test dataset, respectively [ito2021image].

Fig. 10: Performance of DJESCC, DJSCC_LE, and DJSCC_PE evaluated on CIFAR-10 test dataset. DJESCC is trained on CIFAR-10 training dataset.
Fig. 11: Visually protected images comparison generated by the DJESCC method, the DJSCC_LE method, and the DJSCC_PE method with R=1/6. The image in the first column is the plain image. The images in the second column are the encrypted images transformed by the image owner. The images in the third column are the decoded images decoded by the DJSCC decoder for dB with different methods. The images in the last column are the decrypted images transformed by the image recipient for dB with different methods. PSNR and SSIM are given under images. (a) DJESCC with , (b) DJSCC_LE, (c) DJSCC_PE.

Fig. 10 compares the DEJSCC method with the LE based DJSCC (DJSCC_LE) method and the PE based DJSCC (DJSCC_PE) method for bandwidth ratios R= 1/6 and R=1/12. The performance of the DJSCC_PE with R=1/6 is slightly increased around 11.5 dB as the SNR increases from 0dB to 20dB, which is much lower than that of the DJESCC with and R=1/6. Although the performance of the DJSCC_PE with R=1/6 is better than that of the DJSCC_LE with R=1/6, the performance of the DJSCC_PE with R=1/6 is still 5 dB lower than that of the DJESCC at dB and the performance gap between the DJSCC_PE and the DJESCC with is further widened with the increase of the SNR. Comparing the methods with R=1/12 shows the similar results. With the increase of R from 1/12 to 1/6, DJESCC achieves more gains than DJSCC_PE and DJSCC_LE. Fig. 11 shows the corresponding visual performance for DJESCC, DJSCC_LE, DJSCC_PE with dB and R=1/6. Different from the black and white lattice characteristic of the encrypted image shown in the DJESCC method, the encrypted images of the DJSCC_LE method and the DJSCC_PE method are shown as noisy images. However, the pixels in the part of the encrypted images corresponding to that in the dog part of the plain image show less randomness than the pixels in other parts. The decoded images of the DJSCC_LE method and the DJSCC_PE method are similar to the corresponding encrypted images of the DJSCC_LE method and the DJSCC_PE method, respectively. The decrypted images of the DJSCC_LE method at dB are disturbed by some noisy pixels, while the decrypted images of the DJSCC_PE method at look blurred.

V Visual Security of The Proposed Method

As described in Section I, it is unsafe during the network transmission. The visually protected images—the encrypted image generated by the image owner then sent to the DJSCC provider through the Internet and the decoded image processed by the DJSCC provider then sent to the image recipient—may be leaked or stolen in the wired network. Hence the cipher-text-only attack (CoA), where the attacker has access to decoded image and wishes to recover plain image, must be considered to evaluate the visual security of the proposed method. In this paper, we focus on two CoAs: feature reconstruction (FR) attack and generative adversarial network based (GAN-based) attack.

V-a FR Attack

Based on the fact that many visually protected methods deal with the plain image in patch (e.g., the LE method operates the pixels in a 44 block) or in pixel (e.g., the PE method only deal with the pixels without change their positions), FR attack proposed by [chang2020attacks] is employed to reconstruct the edge information of plain images from the encrypted images. The algorithm of FR attack is described in Algorithm 1, where denotes the floor function.

0:  the 8-bit visually protected image , leading bit
0:  the 8-bit restored image
1:  the 8-bit RGB image is divided into pixels, represents the -th pixel
2:  for i = 0 : 1 :  do
3:     for j = 0 : 1 : 2 do
4:        if  then
5:           
6:        end if
7:     end for
8:  end for
9:  assemble to form
10:  return
Algorithm 1 The FR-attack method [chang2020attacks]

V-B GAN-based Attack

GAN-based attack adopted in [sirichotedumrong2020visual] has shown state-of-the-art attack performance for CoA. As illustrated in Fig. 12, the Generator intends to generate images from encrypted images to cheat the discriminator while the purpose of the discriminator is to well distinguish whether the images fed into it originates from the plain image dataset or from the decrypted images. In the training stage, the training dataset containing plain images is divided into and with the same size. is encrypted by some encryption method. Both and are employed for alternatively training the generator and the discriminator. Note that paired images are not required in the training stage. With the progress of training, the generator can gradually generate the images more and more like plain images from the encrypted images. In other words, the generator decrypts the encrypted images and gets the visual information.

Fig. 12: GAN-based attack.

The generators proposed in [sirichotedumrong2020visual] are used as generators of GAN-based attack for the DJSCC_LE method and the DJSCC_PE method except using the sigmoid activation instead of tanh activation in the last layer. The transformation architecture shown in Fig. 7

is employed as the generator for the DJESCC method. The discriminator contains three convolutional layers with kernel size 2, stride 2, the number of channels 16, 16, 32 and leaky ReLU activation, and a dense layer with sigmoid activation. The models of GAN-based attack networks were trained for 600 epochs by using the Adam optimizer with the learning rate 0.0001 and batch size with 64.

V-C Visual Performance of CoAs

We employ the STL-10 dataset in the expriments for a better visual comparison. Full convolutional architecture based DJSCC can deal with tasks of different image sizes in the test phase

[bourtsoulatze2019deep] and [xu2021wireless]. The DJESCC model is trained on CIFAR-10, which is unknown to the attacker. The LE method and the PE method are size sensitive. Hence the DJSCC_LE method and the DJSCC_PE method were trained on the STL-10 dataset under a uniform distribution within the SNR range [0, 20] dB. FR attack can be directly employed on the cipher images (the encrypted image and the decoded image generated by DJESCC, DJSCC_LE, or DJSCC_PE), while the GAN-based attack should first be trained and then be employed on the cipher images. The GAN-based attack was trained and tested on STL10 dataset for DJESCC, DJSCC_LE and DJSCC_PE, respectively.

Visual information reconstructed by FR attack and GAN-based attack with =10 dB are shown in Fig. 13. For attacking encrypted images by FR attack, edge information against the PE method can be successfully reconstructed and edge information against the LE method can be vaguely reconstructed. However, the black image generated by FR attack against the DJESCC method shows that FR attack can not reconstruct any edge information from the encrypted image generated by the DJESCC method. Similar results can be found for attacking decrypted images by FR attack, except that the edge information can not be reconstructed by FR attack attack against the PE method because of the low DJSCC transmission quality in the DJSCC_PE method. GAN-based attack can successfully recover visual information from the encrypted image protected by the DJSCC_LE method, the decrypted image protected by the DJSCC_LE method, and the encrypted image protected by the DJSCC_PE method, while little visual information is recovered from the encrypted image and the decrypted image protected by the DJESCC method. Compared with the DJSCC_LE method and the DJSCC_PE method, the visual protected images (i.e.,the encrypted image and the decoded image) generated by the DJESCC method are more robust against FR attack and GAN-based attack.

Fig. 13: Reconstructed visual information from the encrypted image and the decoded image by FR attack and GAN-based attack at =10 dB. The image in the first column is the plain image. The images in the second column are the encrypted image, FR attack for the encrypted image and GAN-based attack for the encrypted image. The images in the third column are the decoded image, the FR-attack for the decode image, and the GAN-attack for the decoded image. decoded by the DJSCC decoder for SNR= 0, 10, 20 dB. PSNR and structural similarity index (SSIM) are given under the images. (a) DJESCC with , (b) DJSCC_LE, (c) DJSCC_PE.

Vi Conclusion

Inspired by image transformation, we have proposed a novel DJESCC method. By applying end-to-end training, the proposed DJESCC method learned two DNNs to transform the images, i.e., one transformed from the plain image domain to the encrypted image domain for encryption, and the other one transformed from the DJSCC decoded image domain for decryption. Besides that, the proposed DJESCC method simultaneously learned an effective DJSCC transmission method in the encryption domain during the training stage.

Compared with perceptual image encryption methods (e.g., the LE method and the PE method) for DJSCC, the proposed DJESCC method has shown a much better reconstruction performance. In addition, the experimental results demonstrated that the proposed DEJSCC method were robust to the two types of CoA, i.e., the FR attack and the GAN-based attack. It is worth noting that the proposed DJESCC method is a general method for protecting visual information of the plain image transmitted via DJSCC. With some appropriate modifications of the DJESCC, the proposed mechanism can be applied to various different DJSCC architectures, e.g., the DJSCC-l [kurka2021bandwidth], the DJSCC-f [kurka2020deepjscc], and the ADJSCC [xu2021wireless].

References