I Introduction
CORONAVIRUS disease 2019 (COVID19) causes an ongoing pandemic that significantly impacts everyone’s life since it was first reported, with hundreds of thousands of deaths and millions of infections emerging in over 200 countries [1, 2]. As indicated by the World Health Organization (WHO), due to its highly contagious nature and lack of corresponding vaccines, the most effective method to control the spread of COVID19 infection is to keep social distance and contact tracing. Hence, early and fast diagnosis of COVID19 has become significantly essential to control further spreading, and such that the patients could be hospitalized and receive proper treatment in time.
Since the emerge of COVID19, reverse transcription polymerase chain reaction (RTPCR), as a viral nucleic acid detection method by gene sequencing, is the accepted standard for COVID19 detection [3]. However, because of the low accuracy of RTPCR and limited medical test kits in many hyperendemic regions or countries, it is challenging to detect every individual affected by COVID19 rapidly [4, 5]. Therefore, alternative testing methods, which are faster and more reliable than RTPCR, are urgently needed to combat the disease.
Since most COVID19 positive patients were diagnosed with pneumonia, radiological examinations could help detect and assess the disease. Recently, chest computed tomography
(CT) has been shown to be efficient and reliable to achieve a realtime clinical diagnosis of COVID19, outperforming over RTPCR in terms of accuracy. Moreover, some deep learning based methods have been proposed for COVID19 detection using chest CT images
[6, 7, 8, 9]. For example, an adaptive feature selection approach was proposed in
[10] for COVID19 detection based on a trained deep forest model. In [11], an uncertainty vertexweighted hypergraph learning method was designed to identify COVID19 from community acquired pneumonia (CAP) using CT images. However, the routine use of CT, which is conducted via expensive equipments, takes considerably more time than Xray imaging and brings a massive burden on radiology departments. Compared to CT, Xrays could significantly speed up disease screening, and hence become a preferred method for disease diagnosis.Accordingly, deep learning based methods for detecting COVID19 with chest Xray (CXR) have been developed and shown to be able to achieve accurate and speedy detection [12, 13]
. For instance, a tailored convolution neural network platform trained on open source dataset called COVIDNet in
[14] was proposed for the detection of COVID19 cases from CXR. Oh et al. [15] proposed a novel probabilistic gradientweighted class activation map to enable infection segmentation and detection of COVID19 on CXR images. Fig. 1 shows three samples from the COVIDx dataset [14]which contains three different classes: normal, pneumonia and COVID19. However, due to the similar pathological information between pneumonia and COVID19 in the early stage, the CXR samples may have latent features distributed near the category boundaries, which can be easily misclassified by the hyperplane learned from the limited training data. Moreover, to the best of our knowledge, most of the existing methods for COVID19 detection are designed to extract the lowerdimension latent representations which may not be able to fully capture statistical characteristic of complex distributions (i.e., nonGaussian distribution). Furthermore, quantifying uncertainty in COVID19 detection is still a major yet challenging task for doctors, especially with the presence of noise in the training samples (i.e.,
label noise and image noise).To address the above problems, we propose a novel deep network architecture, referred to as RCoNet, for robust COVID19 detection which, in particular, contains the following three modules, i.e., Deformable mutual Information Maximization (DeIM), Mixed Highorder Moment Feature (MHMF) and Multiexpert Uncertaintyaware Learning (MUL):

The Deformable mutual Information Maximization (DeIM) module estimates and maximizes the mutual information (MI) between input data and learned highlevel representations, which pushes the model to learn the discriminative and compact features. We employ deformable convolution layers in this module which are able to explore disentangled spatial features and mitigate the negative effect of similar samples across different categories.

The Mixed Highorder Moment Feature (MHMF) module, inspired by [16], fully explores the benefits of using a mix of highorder moment statistics to better characterize the feature distributions in medical imaging.

The Multiexpert Uncertaintyaware Learning (MUL) creates multiple parallel dropout networks, each can be treated as an expert
, to derive multiple experts based diagnosis similar to clinical practices, which improves the prediction accuracy. MUL also quantifies the prediction accuracy by obtaining the variance in prediction across different experts.

The experimental results show that our proposal achieves the stateoftheart performance in terms of most metrics both on open source COVIDx dataset of 15134 original CXR images and that of noisy setting.
The remaining of this paper is organized as follows: In Section II, we review related works on mutual information estimation and uncertainty learning as well. In Section III, after an overview of our proposed approach, we discuss the main components of RCoNet. In Section IV, we compare our proposed architecture with the existing deep learning based methods evaluated on a public available dataset of CXR images and also the same dataset but under noisy conditions. And we also conduct extensive experiments to demonstrate the benefits of DeIM, MHMF and MUL on the performance of the system. Finally, we conclude this paper in Section V.
Ii Background and Related Works
In this section, we introduce related works on mutual information estimation and uncertainty learning that lay the foundation of this paper.
Iia Mutual Information Estimation
Mutual information (MI), as a fundamental concept in information theory, is widely applied to unsupervised feature learning for quantifying the correlation between random variables. MI has been exploited in a wide range of domains and tasks, including biomedical sciences
[17], blind source separation (BSS, e.g., independent component analysis
[18]), feature selection [19, 20] and causal inference [21]. For example, the object tracking task considered in [22]was treated as a problem of optimizing the mutual information between features extracted from a video with most color information removed and those from the original fullcolor video. Closely related work presented in
[23] considered learning representations to predict crossmodal correspondence by maximizing MI between features from the multiview encoders and the content of the heldout view. Moreover, Mutual Information Neural Estimation (MINE) proposed by [24] was designed to learn a generalpurpose estimator of the MI between continuous variables based on dual representations of the KLdivergence, which are scalable, flexible and, most crucially, trainable via backpropagation. Based on MINE, our proposal estimates and maximizes the CXR image inputs and the corresponding latent representations to improve diagnosis performance.IiB Uncertainty in Deep Learning
Aiming at combating the significant negative effects of uncertainty in deep neural networks, uncertainty learning has been getting lots of research attention, which facilitates the reliability assessment and solves riskbased decisionmaking problems
[25, 26, 27]. In recent years, various frameworks have been proposed to characterize the uncertainty in the model parameters of deep neural networks, referred to as model uncertainty, due to the limited size of training data [28, 29], which can be reduced by collecting more training data [26, 30, 31]. Meanwhile, another kind of uncertainty in deep learning, referred to as data uncertainty, measures the noise inherent in given training data, and hence cannot be eliminated by having more training data [32]. To combat these two kinds of uncertainty, lots of works on various computer vision tasks, i.e., face recognition
[25], semantic segmentation [33], object detection [34], person reidentification [35], etc., have introduced deep uncertainty learning to improve the robustness of deep learning model and interpretability of discriminant. For face recognition task in [26], an uncertaintyaware probabilistic face embedding (PFE) was proposed to represent face images as distributions by utilizing data uncertainty. Exploiting the advantage of Bayesian deep neural networks, one recent study [36] leveraged the model uncertainty for analysis and learning of face representations. To our knowledge, our proposal is the first work that utilizes the highorder moment statistics and multiple expert networks to estimate uncertainty for COVID19 detection using CXR images.Iii Method
In this section, we introduce the novel RCoNet for robust COVID19 detection, which incorporates Deformable mutual Information Maximization (DeIM), Mixed Highorder Moment Feature (MHMF) and Multiexpert Uncertaintyaware Learning (MUL), as illustrated in Fig. 2. is the number of levels of moment features that are combined in MHMF, and is the number of the expert network in MUL, which will be further clarified in the sequel. The CXR images are first processed by DeIM which consists of a stack of deformable convolution layers, extracting discriminative features. The compact features are then fed into MHMF module to generate highorder moment latent features, reducing negative effects caused by similar images. The proposed MUL utilizes the learned highorder features to generate final diagnoses.
Iiia Deformable Mutual Information Estimation and Maximization
Due to the similarity between COVID19 and pneumonia in the latent space, we propose Deformable mutual Information Maximization (DeIM) to extract discriminative and informative features, reducing the negative influence caused by the lack of distinctiveness in the deep features. In particular, we train the model by maximizing the mutual information between the input and corresponding latent representation.
We use a stack of five convolutional stages, as shown in Fig. 2, to encode inputs into latent representations, which is denoted by a differentiable parametric function :
(1) 
where denotes the set of all the trainable parameters in these layers, and and denote the input and output spaces, respectively.
The detailed architecture of each convolutional stage is presented in Fig. 2
, which consists of several convolutional layers each followed by a batch normalization layer. Note that we employ deformable convolutional layers which can better extract spatial information of the irregular infected area compared to conventional convolutional layers. More specifically, regular convolution operates on predefined rectangular grid from an input image or a set of input feature maps, while the deformable convolution operates on deformable grids that each grid point is moved by a learnable offset. For example, the receptive grid
of a regular convolution with kernel size is fixed and can be given by:(2) 
while, for deformable convolution, the receptive grid is moved by the learned offsets and the output is given as follows:
(3) 
where denotes the value at location on the output feature map , enumerates the locations in , represents the weight at location of the kernel, and is value at given location on the input feature map. We can see that with the introduction of offsets , the receptive grid is no longer fixed to be a rectangle, and instead is deformable.
We optimize by maximizing the mutual information between the input and the output, i.e., , where
. The precise mutual information requires knowledge probability density functions (PDFs) of
and , which is intractable to obtain in practice. To overcome this issue, Mutual Information Neural Estimation (MINE) proposed in [24] estimates mutual information by using a lowerbound on the DonskerVaradhan representation [37] of the KLdivergence:(4)  
where
represents the joint probability of
and , i.e., , and denotes the product of marginal probabilities of and , . denotes a global discriminator modeled by a neural network with parameters , which is trained to maximize to approximate the actual mutual information. Hence, we can simultaneously estimate and maximize by maximizing :(5) 
Since the encoder and the mutual information estimator are optimized simultaneously with the same objective function, we can share some layers between them, and replace the with to account for this fact.
Since we are primarily interested in maximizing the mutual information rather than estimating the precise value, we can alternatively use a JensenShannon MI estimator (JSD) [38], which offers more interpretable tradeoff:
(6)  
where
is an input sample of an empirical probability distribution
, denotes a fake sample from distribution , where . This estimator is illustrated by th DeIM block shown in Fig. 2, which has the latent representation , the input sample and the fake sample as input, and the difference between the outputs of the two softplus operations as the estimation of MI.Another alternative MI estimator is called NoiseContrastive Estimator (NCE)
[39], which is defined as:(7)  
The experiments have found that using the NCE estimator outperforms the JSD estimator in some cases, but appears to be quite similar most of the time.
The existing works [40] that implement these estimators use some latent representation of , which is then merged with some randomly generated features to obtain “fake” samples that satisfy . In contrast, we use the samples from other categories as the “fake” samples, i.e., , instead. For example, if the input is a pneumonia sample, then the fake sample is either a normal or COVID sample. We note that this can push the learned encoder to derive more distinguishable features for samples from different categories.
IiiB Mixed Highorder Moment Feature
The presence of the image noise and label noise in CXR datasets may cause image latent representations generated by deep neural networks to be scattered in the entire feature space. To deal with this issue, [25, 26, 35]
represent each image as a Gaussian distribution, that is defined by a mean (a standard feature vector) and a variance. However, the deep features of CXR samples we considered in this paper typically follow a complex, nonGaussian distribution
[41, 42], which cannot be fully captured by its firstorder (mean) or secondorder statistics (variance).We seek a better combination of different orders of statistics to more precisely characterize the latent representation of the CXR images. We illustrate the moment features of different orders [16] in Fig. 3, where we plot 350 data points in sampled from a distribution that combines three different Gaussian distributions. We can observe that the highorder moment features are more expressive of statistical characteristic compared to loworder one. More specifically, it captures the shape of the cloud of samples more accurately. Therefore, we include the Mixed Highorder Moment Feature (MHMF) module in the proposed model, as shown in Fig. 2, which outputs a combination of highorder moment features with the latent representation as input. This will potentially solve the scattering problem, and, more importantly, capture the subtle differences between CXR images of similar categories, i.e., pneumonia and COVID19 in our case.
We show how to obtain the complicated highorder moment feature in the following. Define th order moment feature as , where denotes a latent feature map of dimension . Lots of recent works adopt the Kronecker product to compute highorder moment feature [42]. However, calculating Kronecker product of high dimensional feature maps is significantly computational intensive, and hence infeasible for realworld applications. Inspired by [43, 44, 45], we approximate by exploiting random projectors which relies on certain factorization schemes, such as Random Maclaurin [46]. We use convolution kernels as the random projectors to estimate the expectations of highorder moment features. That is,
(8) 
where represents the Hadamard (elementwise) product, and are convolution kernels with random weights.
Note that Random Maclaurin produces a estimator that is independent of the input distribution, which causes the estimated highorder moments to contain noninformative highorder moment components. We eliminate these components by learning the weights of the projectors, i.e., the convolution kernels, from the data. Also note that the Hadamard product of a number of random projectors may end up with the estimated highorder moment features to be similar to loworder ones. To solve this problem, we use a recursive way to estimate the highorder moments instead,
(9) 
Since different order moments capture different informative statistics, we design the MHMF module to keep the estimated moments of different levels of order, as shown in Fig. 2, the output of which is given as:
(10) 
Hence, is rich enough to capture the complicated statistics, and produce discriminative features for the input of different categories.
IiiC Multiexpert Uncertaintyaware Learning
The MHMF module, as described in section IIIB, generates mixed highorder moment features of each sample in the latent space, which we aim to further exploit to derive compact and disentangled information for COVID19 detection. Meanwhile, quantifying uncertainty in disease detection is undoubtedly significant to understand the confidence level of computerbased diagnoses. Motivated by the clinical practices, we present a novel neural network in this section, referred to as Multiexpert Uncertaintyaware Learning (MUL), which takes in the mixed highorder moment features and outputs the prediction and the quantification of the diagnostic uncertainty caused by the noise in the data.
The structure of Multiexpert Uncertaintyaware Learning module is shown in Fig. 2, which consists of multiple dropout layers that process the output from MHMF in parallel, each of which together with the following several fully connected layers can be regarded as an expert for COVID19 detection. We note that each dropout layer uses different masks which results in different subsets of latent information to be kept, while the following fully connected layers share the same weights across different experts. The masks for the dropout layers are generated randomly at each iteration during training, but fixed during the inference time. We denote the inputoutput function of each expert by , , where is the total number of experts. Hence, we have the classification loss of th expert given as follows:
(11) 
where represents the total number of labeled CXR samples, and denotes the onehot representation of the class label, , and we recall that denotes the MHMF operation given in Eq. (10) and is the preprocessing step on the CXR samples. Note that, the total number of COVID19 cases is much smaller than nonCOVID cases, i.e., normal and pneumonia cases. This imbalance in the dataset leads to a high ratio of falsenegative classification. To mitigate this negative effect, we employ a weighted crossentropy given as follows:
(12) 
where is the total number of classes, is the th element of , and denotes the corresponding prediction. represents the weight that controls how much the error on class contributes to the loss, . Finally, the loss of the whole MUL module is derived by averaging the loss values of all the experts:
(13) 
We use the variance of classification loss with regards to the average loss to quantify the uncertainty, denoted by , which is given as:
(14) 
The proposed MUL module improves the diagnostic accuracy as the final prediction combines the results from multiple experts, and also mitigates the negative effects caused by the noise in the data by introducing the dropout layers. Moreover, the experiments have revealed that the more experts in MUL module the faster the system converges during training.
IiiD Training
The whole architecture of RCoNet is presented in Fig. 2, where the CXR images are first processed by a stack of deformable convolution layers, then transformed to highorder moment latent features by the MHMF module, which are then fed to the MUL module to generate final diagnoses. The loss used to optimize RCoNet is given as follows
(15) 
where is the prediction loss given by Eq. (13) , and denotes the mutual information between the input and the latent representation estimated by either Eq. (6) or Eq. (7). is a positive hyperparameter that governs how much and contribute to the total loss. During training, the trainable parameters of the whole systems are updated iteratively to minimize , which is to jointly minimize the prediction loss thus to improve the accuracy, and maximize the mutual information .
Iv Experiments and Results
Iva Dataset
We use a public chest Xray dataset, referred to as COVIDx, to evaluate the proposed model, which is published by the authors of COVIDNet [14]. This dataset contains a total of 13975 CXR images from 13870 patients of 3 classes: (a) normal (no infections); (b) pneumonia (nonCOVID19 pneumonia); (c) COVID19. It contains samples from five open source available data repositories https://github.com/lindawangg/COVIDNet/blob/master/docs/COVIDx.md. Three random CXR samples of these three classes are shown in Fig. 1. To reduce the negative effect caused by extremely unbalanced training samples, i.e., very limited number of COVID19 positive cases compared to the other two categories, we further include other opensource CXR datasets from https://www.kaggle.com/c/rsnapneumoniadetectionchallenge/data. Following [14, 47], the dataset is finally divided into 13624 training and 1510 test samples. The numbers of samples from different categories used for training and testing are summarized in Table I. Moreover, we also adopted various data augmentation techniques to generate more COVID19 training samples, such as flipping, translation, rotation using random five different angles, to tackle the data imbalance issue such that the proposed model can learn an effective mechanism of detecting COVID19.
Data  Number of Patients Per Class  Total Patients  
Normal  Pneumonia  COVID19  
Train  7966  5451  207  13624 
Test  885  594  31  1510 
IvB Evaluation Metrics
In our experiments, we use the following six metrics to evaluate the COVID19 detection performance of different approaches:

Accuracy (): calculates the proportion of images that are correctly identified. .

Sensitivity (): is the ratio of the positive cases that have been correctly detected to all the positive cases. .

Specificity ():
is the ratio of the negative cases that have been correctly classified to all the negative cases.
. 
Balance (): is the mean value of and . .

Positive Predictive Value (): is the ratio of correctly detected positive cases to all cases that are detected to be positive. .

F1score (): uses a combination of accuracy and sensitivity to calculate a balanced average result. .
, , and represent the total number of true negatives, true positives, false negatives, and false positives, respectively.
IvC Compared Methods
We compare the proposed RCoNet with the following five existing deep learning methods for COVID19 detection:

PbCNN [15]: A patchbased convolutional neural network with a relatively small number of trainable parameters.

COVIDNet [14]: A tailored deep convolutional neural network that uses a projectionexpansionprojection design pattern.

DenseNet121 [48]: A densely connected convolutional network that connects each layer to every other layer in a feedforward fashion.

ReCoNet [47]: A residual imagebased COVID19 detection network that exploits a CNNbased multilevel preprocessing filter block and a multitask learning loss.
Training Date  Clean  Noise  Total 

Normal  7170  796 (Peumonia+COVID19)  7966 
Pneumonia  4906  545 (COVID19+Normal)  5451 
COVID19  187  20 (Peumonia+Normal)  207 
Method  ACC ()  SEN ()  SPE ()  BAC ()  PPV ()  F1 ()  Param (M)  FLOPs (G) 

PbCNN [15]  88.901.63  85.901.69  96.402.10  91.151.31  88.651.52  87.372.14  11.60   
COVIDNet [14]  95.101.34  91.371.37  95.762.04  93.570.89  94.730.97  93.200.85  117.4  15.10 
DenseNet121 [48]  97.401.67  96.080.88  97.231.01  96.661.21  96.051.00  96.741.04  7.61  bf5.59 
CoroNet [49]  95.001.58  96.901.57  97.501.93  97.201.07  95.001.03  95.600.95  33.00   
ReCoNet [47]  97.481.05  97.391.67  97.531.28  97.460.87  97.170.76  97.430.59  2.52  7.68 
RCoNet  96.120.33  95.710.41  96.380.29  96.050.20  95.860.62  95.910.56  .73  7.61 
RCoNet  96.780.57  96.480.69  96.910.74  96.700.34  96.940.53  96.630.58  6.74  7.70 
RCoNet  97.460.43  97.250.79  97.620.40  97.440.82  97.590.91  97.350.38  6.75  7.79 
RCoNet  97.890.53  97.330.45  98.240.39  97.790.62  97.930.74  97.610.48  6.77  7.91 
RCoNet  97.500.62  97.760.87  97.180.63  97.470.73  97.100.91  97.630.71  6.77  8.00 
IvD Implementation
We implement our RCoNet
using the PyTorch library and apply ResNeXt
[50] as the backbone network. We train the model with the Adam optimizer with an initial learning rate of and a weight decay factor of . All the experiments are run on an NVIDIA GeForce GTX 1080Ti GPU. We set the batch size to be 8, and resize all images topixels. The hyperparameter
in the loss function given in Eq. (
15) is set to be within the range of . The drops rate of each dropout layer in the MUL module is randomly chosen from . The loss weight for each category, which is used to calculate the weighted sum of the loss as given in Eq. (12), is set to be , , and for the normal, pneumonia, COVID19 samples, respectively, corresponding to the number of training samples in each. We adopt 5fold crossvalidation training that we randomly divide the training sets into five equalsize subsets and train the model five times that using different four subsets for training, and the remaining one for validation each time. We also evaluate our proposed model with different number of order moments for the MHMF module , and different number of experts .To evaluate the performance of the proposed model with the presence of label noise, we derive a noisy dataset from the given dataset in the following way: we randomly select a given percentage of training samples in each category, and assign wrong labels to these sample. In particular, to ensure that the fake COVID19 samples are less than the real ones, we assign the COVID19 labels to selected normal and pneumonia samples in a way the the number of normal and pneumonia samples assigned with COVID19 label equals to the number of COVID19 samples assigned with either normal and pneumonia label. We show a realization of the derived noisy dataset when the percentage of fake samples is set to be 10 in Table II.
Noise  Method  ACC ()  SEN ()  SPE () 

10  PbCNN [15]  83.22  81.98  89.01 
COVIDNet [14]  91.03  87.94  90.62  
DenseNet121 [48]  91.97  87.94  92.17  
CoroNet [49]  89.45  88.74  90.06  
ReCoNet [47]  91.63  90.82  91.16  
RCoNet  92.78  92.21  93.51  
RCoNet  92.98  93.39  93.12  
RCoNet  92.01  91.41  92.76  
20  PbCNN [15]  78.42  75.90  80.29 
COVIDNet [14]  82.51  82.77  81.95  
DenseNet121 [48]  82.16  81.01  82.21  
CoroNet [49]  82.33  81.10  81.89  
ReCoNet [47]  83.26  82.72  83.17  
RCoNet  84.18  84.56  85.79  
RCoNet  84.30  84.01  85.99  
RCoNet  84.34  83.96  85.21  
30  PbCNN [15]  67.76  66.47  70.61 
COVIDNet [14]  71.98  70.13  71.55  
DenseNet121 [48]  72.74  72.36  72.96  
CoroNet [49]  71.87  72.02  71.54  
ReCoNet [47]  73.26  72.53  73.11  
RCoNet  74.56  74.20  75.54  
RCoNet  74.69  74.51  76.94  
RCoNet  74.88  74.37  75.21 
IvE Results and Discussions
Performance on Clean Data: The numerical results on the clean dataset without any artificial noise added are shown in Table III. The results are presented in the form of , where and denote the average and variance values of each metric on five independent experiments, respectively. We can see that RCoNet, i.e., the proposed model with levels of mixed moment features and experts, achieves notable performance improvement over the comparison methods in terms of most metrics considered, including ACC, SPE, BAC, PPV and F1 score. We note the performance of RCoNet can be further improved with a different set of and . For instance, RCoNet achieves better SEN and F1 score than RCoNet. The higher ACC and F1 score validate that RCoNet is able to obtain latent features, i.e., the mixed moment features of different levels of order, that maintains interclass separability and intraclass compactness better than other models. Note that RCoNet leads to a higher SEN than all other methods, which is particularly important to COVID19 detection, since successfully detecting COVID19 positive cases is the key to control the spread of this super contagious disease. Moreover, it can be observed that RCoNet has smaller variance compared to the others, which demonstrates the robustness and stability of our model.
We also evaluate the complexity of the proposed model in terms of numbers of parameters and computational cost, i.e., Floatpoint operations (FLOPs), which is presented in Table III. It can be observed that the proposed model has much fewer parameters than several existing methods, except ReCoNet. However, we note that the FLOPs of RCoNet is quite close to that of ReCoNet, which means it takes a similar amount of time to diagnose COVID19 from CXR images by these two model. We can also observe that the increase of and , i.e., the number of mixed moment features and the number of experts in MUL, only causes a small, or even neglectable, amount of increase in the number of parameters and FLOPs as well, which suggests that we can improve the performance of the proposed model by optimizing and , without the concern on the significant increase of the complexity.
Performance on Noisy Data: We further compare the proposed model to the existing ones when there is noise present in the training dataset. We generate three noisy training datasets in the aforementioned way from the clean dataset with , and samples with wrong labels, respectively. The results, which we take the averages from five independent experiments, are presented in Table IV. It can be easily seen that the more fake samples we add the more it degrades the performance of all the methods. Note that the proposed RCoNet still gets the stateoftheart results in all considered cases with different percentages of noisy samples in the training dataset. Moreover, the performance gain over the existing methods slightly increases with the ratio of noisy samples, verifying that our model is more robust to the noise. Note that the extreme case of noisy samples leads to great performance degradation of all the models. In practice, the percentage of label noise is usually around to . We present the confusion matrices in Fig. 4 to summarize the prediction accuracy of different categories. We can observe that, although with very limited number of COVID19, our model still maintains high accuracy of detecting COVID19 cases, even with the presence of noisy samples.
Uncertainty Estimation: One remarkable advantage of our model is the ability to quantify the uncertainty in the final prediction, which is significantly crucial for COVID19 detection. This is done by obtaining the variance in the output of different experts in MUL as described in Section IIIC. The larger the variance is, the more different experts disagree with each other, and, hence, the more uncertain the model is about the final prediction. We present two CXR samples in Fig. 6, including the predictions and the corresponding uncertainty level by RCoNet. We can see that the correctly classified CXR image has a low uncertainty level about its prediction, i.e., 0.0094, and the misclassified CXR sample with a high uncertainty level, i.e., 0.4792, suggests that an alternative way of diagnosis should be sought to correct this prediction. This greatly improves the reliability of the prediction by RCoNet, and reduces the chance of misdiagnosis. We also show in Fig. 7 the average uncertainty levels of RCoNet trained on clean and noisy datasets with different ratios of noisy samples. It can be observed that the uncertainty level increases almost linearly with the percentage of noisy samples in the dataset, which highlights the negative impact of noise on model training.
IvF Analysis
We further numerically analyse the benefits of the three key modules of RCoNet, i.e., the DeIM, MHMF and MUL modules in this section.
RCoNet  s=1  s=2  s=3  s=4  s=5  s=6  s=7 
k=1  95.4  95.7  95.9  96.1  96.1  96.0  95.8 
k=2  96.3  96.4  96.6  96.8  96.8  96.7  96.4 
k=3  97.2  97.2  97.3  97.5  97.4  97.3  97.3 
k=4  97.4  97.6  97.8  97.9  97.9  97.7  97.5 
k=5  97.2  97.3  97.3  97.4  97.5  97.5  97.3 
k=6  96.8  97.0  97.0  97.1  97.0  96.9  96.9 
Effectiveness of DeIM: We utilize tSNE method [51] to visualize the latent features, presented in Fig. 5, which are generated by the bottleneck layers of the baseline model, i.e., ResNeXt, RCoNet and three variants of RCoNet: (a) RCoNetD: a model contains only DeIM; (b) RCoNetM: a model contains only MUL; (c) RCoNetDM: a model contains DeIM and MUL but not MHMF. Comparing the latent feature distribution by the baseline model shown in Fig. 5(a), and that by RCoNetD presented in Fig. 5(b), we can tell that the introduction of DeIM leads to better class separation in the latent space.
Effectiveness of MHMF: We can observe in Fig. 5(a)  Fig. 5(d) that the latent features of the COVID19 samples, generated by the models without MHMF, always distribute around the category boundary, and are not quite separable from those of some pneumonia samples. Meanwhile, the latent feature distributions presented in Fig. 5(e)  Fig. 5(h) derived by the models with MHMF show significant separability between different categories, which implies that MHMF can extract discriminative features. We also include numerical results of RCoNet, trained and tested on COVIDx dataset, with regards to different values of , i.e., the number of levels of the moment features to be mixed, and , i.e., the number of experts, in Table V in terms of accuracy. We can observe that, for a given value of , the accuracy increases first with the value of but decreases after is larger than . It demonstrates that including more levels of moment feature could improve the model performance. However, the overly highorder moments may lead to performance degradation, which may be because these features are not useful for COVID detection.
Effectiveness of MUL: From Table V, we observe that, for a given value of , accuracy increases first with the value of but saturates around . This implies that having more experts in MUL can increase the prediction accuracy but it is not necessary to have too many.
Parameter Sensitivity and Convergence: We evaluate how sensitive the model performance in terms of accuracy to the value of . We show the average accuracy of five independent experiments by RCoNet trained on the dataset with different ratios of noisy samples in Fig. 8. As we can see, the larger , which means the prediction loss, i.e., , contributes less to the total loss, not necessarily leads to degradation in the accuracy. This means maximizing the mutual information between the input and the latent features could keep useful information within the latent features, thus improving the prediction accuracy. We have also shown the learning curves of different models in Fig. 9, which shows that RCoNet converges slightly faster than the others, including COVIDNet, ReCoNet and CoroNet.
V Conclusions
In this paper, we proposed a novel deep network model, named RCoNet, for robust COVID19 detection, which contains three key components, i.e., Deformable mutual Information Maximization (DeIM), Mixed Highorder Moment Feature (MHMF) and Multiexpert Uncertaintyaware Learning (MUL). DeIM estimates and maximizes the mutual information between input data and the latent representations simultaneously to obtain the category separability in the latent space. We proposed MHMF to overcome the limited expressive capability of loworder statistics, and instead use a combination of both low and high order moment features to extract more informative and discriminative features. MUL generates the final diagnosis and the uncertainty estimation, by combining the output of multiple parallel dropout networks, each as an expert. We numerically validated that the proposed RCoNet trained on either the public COVIDx dataset or the noisy version of it, outperforms the existing methods in terms of all the metrics considered. We note that these three modules can be easily implemented into other frameworks for different tasks.
References
 [1] K. Zhang, X. Liu, J. Shen, Z. Li, Y. Sang, X. Wu, Y. Zha, W. Liang, C. Wang, K. Wang et al., “Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid19 pneumonia using computed tomography,” Cell, 2020.
 [2] Z. Han, B. Wei, Y. Hong, T. Li, J. Cong, X. Zhu, H. Wei, and W. Zhang, “Accurate screening of covid19 using attention based deep 3d multiple instance learning,” IEEE Transactions on Medical Imaging, 2020.

[3]
X. Mei, H.C. Lee, K.y. Diao, M. Huang, B. Lin, C. Liu, Z. Xie, Y. Ma, P. M.
Robson, M. Chung et al.
, “Artificial intelligence–enabled rapid diagnosis of patients with covid19,”
Nature Medicine, pp. 1–5, 2020.  [4] W. Xie, C. Jacobs, J.P. Charbonnier, and B. van Ginneken, “Relational modeling for robust and efficient pulmonary lobe segmentation in ct scans,” IEEE Transactions on Medical Imaging, 2020.
 [5] X. Ouyang, J. Huo, L. Xia, F. Shan, J. Liu, Z. Mo, F. Yan, Z. Ding, Q. Yang, B. Song et al., “Dualsampling attention network for diagnosis of covid19 from community acquired pneumonia,” IEEE Transactions on Medical Imaging, 2020.
 [6] H. X. Bai, R. Wang, Z. Xiong, B. Hsieh, K. Chang, K. Halsey, T. M. L. Tran, J. W. Choi, D.C. Wang, L.B. Shi et al., “Ai augmentation of radiologist performance in distinguishing covid19 from pneumonia of other etiology on chest ct,” Radiology, p. 201491, 2020.
 [7] A. A. Ardakani, A. R. Kanafi, U. R. Acharya, N. Khadem, and A. Mohammadi, “Application of deep learning technique to manage covid19 in routine clinical practice using ct images: Results of 10 convolutional neural networks,” Computers in Biology and Medicine, p. 103795, 2020.
 [8] H. Kang, L. Xia, F. Yan, Z. Wan, F. Shi, H. Yuan, H. Jiang, D. Wu, H. Sui, C. Zhang et al., “Diagnosis of coronavirus disease 2019 (covid19) with structured latent multiview representation learning,” IEEE transactions on medical imaging, 2020.
 [9] D.P. Fan, T. Zhou, G.P. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Infnet: Automatic covid19 lung infection segmentation from ct images,” IEEE Transactions on Medical Imaging, 2020.
 [10] L. Sun, Z. Mo, F. Yan, L. Xia, F. Shan, Z. Ding, W. Shao, F. Shi, H. Yuan, H. Jiang et al., “Adaptive feature selection guided deep forest for covid19 classification with chest ct,” arXiv preprint arXiv:2005.03264, 2020.
 [11] D. Donglin, S. Feng, Y. Fuhua, X. Liming, M. Zhanhao, D. Zhongxiang, S. Fei, L. Shengrui, W. Ying, S. Ying, H. Miaofei, G. Yaozong, S. He, G. Yue, and S. Dinggang, “Hypergraph learning for identification of covid19 with ct imaging,” 2020.
 [12] Z. Y. Zu, M. D. Jiang, P. P. Xu, W. Chen, Q. Q. Ni, G. M. Lu, and L. J. Zhang, “Coronavirus disease 2019 (covid19): a perspective from china,” Radiology, p. 200490, 2020.
 [13] M. Siddhartha and A. Santra, “Covidlite: A depthwise separable deep neural network with white balance and clahe for detection of covid19,” arXiv preprint arXiv:2006.13873, 2020.
 [14] L. Wang and A. Wong, “Covidnet: A tailored deep convolutional neural network design for detection of covid19 cases from chest xray images,” arXiv preprint arXiv:2003.09871, 2020.
 [15] Y. Oh, S. Park, and J. C. Ye, “Deep learning covid19 features on cxr using limited training data sets,” IEEE Transactions on Medical Imaging, 2020.
 [16] E. Pauwels and J. B. Lasserre, “Sorting out typicality with the inverse moment matrix sos polynomial,” in Advances in Neural Information Processing Systems, 2016, pp. 190–198.
 [17] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens, “Multimodality image registration by maximization of mutual information,” IEEE transactions on Medical Imaging, vol. 16, no. 2, pp. 187–198, 1997.
 [18] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 45, pp. 411–430, 2000.
 [19] N. Kwak and C.H. Choi, “Input feature selection by mutual information based on parzen window,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 12, pp. 1667–1671, 2002.
 [20] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy,” IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
 [21] A. J. Butte and I. S. Kohane, “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements,” in Biocomputing 2000. World Scientific, 1999, pp. 418–429.
 [22] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy, “Tracking emerges by colorizing videos,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 391–408.
 [23] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 609–617.

[24]
M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville,
and D. Hjelm, “Mutual information neural estimation,” in
International Conference on Machine Learning
, 2018, pp. 531–540.  [25] J. Chang, Z. Lan, C. Cheng, and Y. Wei, “Data uncertainty learning in face recognition,” arXiv preprint arXiv:2003.11339, 2020.
 [26] Y. Shi and A. K. Jain, “Probabilistic face embeddings,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6902–6911.
 [27] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding,” arXiv preprint arXiv:1511.02680, 2015.
 [28] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” arXiv preprint arXiv:1505.05424, 2015.
 [29] Y. Gal, “Uncertainty in deep learning,” University of Cambridge, vol. 1, p. 3, 2016.

[30]
D. J. MacKay, “A practical bayesian framework for backpropagation networks,”
Neural computation, vol. 4, no. 3, pp. 448–472, 1992.  [31] R. M. Neal, Bayesian learning for neural networks. Springer Science & Business Media, 2012, vol. 118.
 [32] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in Advances in neural information processing systems, 2017, pp. 5574–5584.
 [33] S. Isobe and S. Arai, “Deep convolutional encoderdecoder network with model uncertainty for semantic segmentation,” in 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA). IEEE, 2017, pp. 365–370.
 [34] J. Choi, D. Chun, H. Kim, and H.J. Lee, “Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 502–511.
 [35] T. Yu, D. Li, Y. Yang, T. M. Hospedales, and T. Xiang, “Robust person reidentification by modelling feature uncertainty,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 552–561.
 [36] U. Zafar, M. Ghafoor, T. Zia, G. Ahmed, A. Latif, K. R. Malik, and A. M. Sharif, “Face recognition with bayesian convolutional networks for robust surveillance systems,” EURASIP Journal on Image and Video Processing, vol. 2019, no. 1, p. 10, 2019.
 [37] M. D. Donsker and S. S. Varadhan, “Asymptotic evaluation of certain markov process expectations for large time, i,” Communications on Pure and Applied Mathematics, vol. 28, no. 1, pp. 1–47, 1975.
 [38] S. Nowozin, B. Cseke, and R. Tomioka, “fgan: Training generative neural samplers using variational divergence minimization,” in Advances in neural information processing systems, 2016, pp. 271–279.
 [39] M. U. Gutmann and A. Hyvärinen, “Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 307–361, 2012.
 [40] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in Advances in Neural Information Processing Systems, 2019, pp. 15 509–15 519.
 [41] J. Xu, P. Ye, Q. Li, H. Du, Y. Liu, and D. Doermann, “Blind image quality assessment based on high order statistics aggregation,” IEEE Transactions on Image Processing, vol. 25, no. 9, pp. 4444–4457, 2016.
 [42] C. Chen, Z. Fu, Z. Chen, S. Jin, Z. Cheng, X. Jin, and X.S. Hua, “Homm: Higherorder moment matching for unsupervised domain adaptation,” arXiv preprint arXiv:1912.11976, 2019.
 [43] P. Jacob, D. Picard, A. Histace, and E. Klein, “Metric learning with horde: Highorder regularizer for deep embeddings,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6539–6548.

[44]
H. Jégou and O. Chum, “Negative evidences and cooccurences in image retrieval: The benefit of pca and whitening,” in
European conference on computer vision. Springer, 2012, pp. 774–787.  [45] M. Opitz, G. Waltner, H. Possegger, and H. Bischof, “Bierboosting independent embeddings robustly,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5189–5198.
 [46] P. Kar and H. Karnick, “Random feature maps for dot product kernels,” pp. 583–591, 2012.
 [47] S. Ahmed, M. H. Yap, M. Tan, and M. K. Hasan, “Reconet: Multilevel preprocessing of chest xrays for covid19 detection using convolutional neural networks,” medRxiv, 2020.

[48]
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected
convolutional networks,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2017, pp. 4700–4708.  [49] A. I. Khan, J. L. Shah, and M. M. Bhat, “Coronet: A deep neural network for detection and diagnosis of covid19 from chest xray images,” Computer Methods and Programs in Biomedicine, p. 105581, 2020.
 [50] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” pp. 5987–5995, 2017.
 [51] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in International conference on machine learning, 2014, pp. 647–655.
Comments
There are no comments yet.