1 Introduction
Accurate classification of early colon lesions plays an important role in diagnosis and treatment [25]. Colorectal cancer is usually diagnosed at an advanced stage due to insignificant early clinical symptoms, resulting in a high mortality rate [10, 5]. In clinical practice, colonoscopy is the most commonly used method to diagnose colorectal lesions [1, 17]. However, manual lesion classification is generally timeconsuming and potentially subjective. Therefore, automated classification of colorectal lesions from colonoscopy images is critical in clinical analysis because it: 1) helps physicians determine the type of colonic disease; 2) formulates the most appropriate treatment options; 3) compresses the duration of colonoscopy [12].
Existing research has achieved progress in classifying colon diseases
[26, 2], but brightness imbalance and location variability remain intractable challenges (Fig.1 a)). The unbalanced illumination from the endoscope probe induces apparent color and brightness differences even in normal images, degrading the classification performance and increasing the difficulty in the model generalization. Some studies rely on data augmentation to resolve brightness imbalances [15]. Wei et al. designed the color exchange operation to force the model to focus more on the target shape and structure by generating images of various colors [23]. However, the significant differences in brightness and color of endoscopic images prevent data augmentation from encompassing all distributions, resulting in poor performance in some distinct regions. On the other hand, the locational variability of intestinal lesions is reflected in their appearance in various regions of the lumen wall. As a result, it is difficult for the network to learn all positional changes on small datasets and is prone to overfitting [11]. Several works applied transfer learning to alleviate the location variability issues by acquiring features from large datasets
[13, 22]. Nonetheless, different feature distributions between source and target domain in transfer learning can lead to insufficient adaptability in network training.Frequency learning has great potential for its brightness insensitivity and displacement invariance, which is recognized to improve the ability to discriminate the colon diseases (Fig.1 a)) [18, 24]. As we know, the DC component (only one value) in the spectrum represents the average brightness of the image [3]. This breaks the correlation between image content and brightness, prompting the model to focus more on the target shape and structure. The magnitude map has displacement invariance and remains unchanged with the spatial movement of the timedomain image, which avoids the influence of changes in the lesion location. Moreover, the phase map provides contour and structural information of the image. However, the spectrum obtained by the direct Fourier transform of the image only contains global information, which limits the model to learn the local information of the lesion. Besides, the lack of phase information is caused by the fact that the network only learns from the magnitude spectrum due to the real number operation (Fig.1 b)).
In this study, we propose a novel frequency learning framework (FFCNet) for colon disease classification. Our work has the following contributions: 1) Our method is one of the first to study the automatic classification of colon diseases (normal, polyps, adenomas, cancers) in the whole process. This fourlevel classification helps colonoscopists to determine the type of lesions accurately and advance the clinical diagnosis of early colorectal cancer. 2) For the first time, we propose a framework that can be trained directly in the frequency domain by combining complex convolutional networks and frequency learning. The convolution kernels, blocks, and architectures in our complex network are modified into complex operations to enable direct learning of the spectrum with complex numbers, thus avoiding the loss of phase features caused by real network operations. Moreover, the spectral brightness insensitivity and displacement invariance have the ability to resolve the uneven brightness and positional variability of timedomain images. 3) We innovatively present an image patch scrambling module embedded in FFCNet to generate local spectrograms. Spectral blocks provide local features of lesions so that the model has stronger discriminative ability of interclass similarity. Also, through random shuffling operations, spectral patches that appear at different locations in the image will reveal longrange information to the model. 4) The proposed method is competitive against wellknown CNN architectures in experiments. This work has also sparked discussions on how classical CNN architectures can exploit spatial and frequency features in solving realworld problems to improve performance.
2 Methodology
FFCNet (Fig.2) is composed of a patch scrambling module and a frequencydomain complex network. The patch scrambling module (Sect.2.1) obtains a complex spectrogram by slicing the timedomain image after the Discrete Fourier Transform (DFT) and then scrambling, which effectively aggregates local information and improves the learning ability of nonlocal information. The frequencydomain complex network (Sect.2.2
) is capable of handling complex numbers based on the original network architecture. Specifically, replacing convolution, ReLU, batch normalization (BN) with complex convolution, complex ReLU, and complex BN enables the network to calculate complex spectrum to extract richer feature information.
2.1 Patch Shuffling Module (PSM)
The proposed PSM transforming timedomain images into frequencydomain representations consists of image dicing, DFT, and random shuffling. Compared with the image spectrum without dicing, the network will only learn global information due to the spectral characteristics that a point in the frequency domain can affect the entire image, which neglects the necessary local features for colon classification. Hence, as shown in Fig.2 b), we perform DFT after slicing images to guide the network to focus on recognizable local features. On the other hand, the scrambled spectrum block further improves the longdistance feature learning of the model.
Given an input image I, we first uniformly partition the image into patches denoted by matrix R. denotes an image patch where i and j are the horizontal and vertical indices, respectively (, ). After the dicing, each block will be transformed to the frequency domain. Then, for each image block m, denoted as , with size , the DFT is computed according to the following expression:
(1) 
Finally, the spectral patch will be randomly shuffled with probability
. Since the neatly arranged spectrum has been corrupted, in order to identify these randomly arranged spectral blocks, the classification network has to find discriminative regions and identify small differences between classes.Summarized advantages: Our PSM compensates for both local and longrange features of the image while preserving the advantages of the frequency domain. In addition, the local spectrogram has a smaller numerical distribution range than the original spectrogram, which improves the convergence degree of the gradient and speeds up the training of the model.
2.2 Frequencydomain complex network
The proposed frequency learning network can directly learn complexvalued spectrograms through complex operations during training. The backbone architecture of the complex network adopts ResNet [7], and the internal operations are replaced by complex subcomponents (complex convolution, ReLU, batch normalization). Each residual block consists of two 3*3 convolutional layers and one connection path. Thus, the proposed network takes into account the advantages of frequency information features and the rich expressive power of complex operations.
Complex convolution. In order to perform an operation equivalent to the traditional realvalued 2D convolution in the complex domain, the real part and the imaginary part of the complex matrix in the spectrogram are respectively input into the network. Meanwhile, two sets of convolution kernels and are inserted to simulate the real and imaginary parts of the complex convolution kernel . Complex convolution can be expressed as:
(2) 
Where a,b,c,d are all real numbers.
Complex ReLU.
The neural network relies on the ReLU function to introduce nonlinearity to promote the sparsity of the network. The ReLU function sets all negative values in the matrix to zero and does not change the remaining values. The complex ReLU is the addition of the real and imaginary parts after applying ReLU respectively. Complex ReLU satisfies the CauchyRiemann equation when both the real and imaginary parts are strictly positive or strictly negative
[21]. The specific formula is as follows:(3) 
Complex BN.
BN is often employed to accelerate learning in neural networks. BN forcibly pulls the distribution of the input values of each layer of neural network back to a standard normal distribution with a mean of 0 and a variance of 1. For BN of complex numbers, it is unreasonable to translate and scale it so that it has a mean of 0 and a variance of 1. This normalization does not ensure that the variances of the real and imaginary parts are equal. It will be oval, possibly with high eccentricity. Hence, we treat it as a twodimensional vector to change the data distribution.
Given a batch input to compute the mean and variance, the normalized is expressed as:
(4) 
Where is the covariance matrix and is the mean of the data. is a matrix represented as: where and represent the real and imaginary parts of , respectively. Similar to real number normalization, learnable reconstruction parameters and are introduced to restore the feature distribution to be learned by the network. The difference is that the shift parameter is a complex parameter with two learnable components (real and imaginary). The scaling parameter is a
positive semidefinite matrix with only three degrees of freedom. There are three learnable components. The complex BN is defined as:
(5) 
Summarized advantages: Our elaborate complex CNN implements a full range of frequencydomain analysis, learning both magnitude and phase information in a unified architecture. Not only that, complex convolution, ReLU, and BN maintain the strength of easier optimization of complex numbers, and further improve the expressive ability of frequency features.
3 Experiments and Results
Experiment protocol. 1)Datasets. The study included 3568 standard whitelight endoscopic images including 865 normal, 843 polyps, 896 adenomas, and 964 cancers. We randomly split the dataset into training, validation and testing in a 6:2:2 ratio. 2)Settings.
We use version 1.1 of PyTorch
[14]to perform all experiments on NVIDIA TITAN X (PASCAL) GPU machines and evaluate our proposed method on the widely used classification backbone network ResNet18. Input images are resized to a fixed size of 400×400. Random horizontal and vertical flips were applied for data augmentation. During training, we trained all network by SGD optimizers with a learning rate of 0.1 and a minibatch size of 32 for 600 epochs. The probability p of patch shuffling was set to 0.3. During testing, data augmentation and patch scrambling algorithms were disabled. The diced spectrum of the original image was fed into the complex classification network for final prediction.
We evaluate the classification performance using four metrics: Accuracy, Precision, Recall and F1score. More details are in our Supplementary Material.Accuracy  Precision  Recall  F1score  
ResNet [7]  81.89  81.96  81.89  81.91 
MobileNet [8]  81.11  81.14  81.11  81.07 
EfficientNet [20]  84.44  84.76  84.44  84.55 
DenseNet [9]  84.86  84.86  84.86  84.80 
GoogLeNet [19]  85.42  85.72  85.42  85.52 
CoAtNet [4]  85.93  86.08  85.93  85.94 
Fast [4]  81.57  81.86  81.57  81.60 
GFNet [16]  84.81  84.92  84.81  84.86 
KSpace [6]  83.28  83.42  83.28  83.24 
FFCNet  86.35  86.61  86.35  86.44 
Comparative experiments show the superiority of our FFCNet: The comparison of our network with several classical classification methods shows great potential for application in colonoscopy classification scenarios. Compared with other methods (Table 1), FFCNet has the highest accuracy (), precision (), recall () and F1score (). In particular, the accuracy of FFCNet is higher than that of the backbone network ResNet, indicating that the frequency features provide a significant improvement to the architecture.
Comparisons with frequency, complex and transform networks also indicate the superiority of our FFCNet: a joint CNN and transformer network (CoAtNet,
), a CNN network with added frequency modules (Fast, ), a transformer network with added frequency modules (GFNet, ), and other complex networks (KSpace, ). FFCNet outperforms them without requiring pretraining, indicating that the designed model is more efficient.The results of the confusion matrix demonstrate the superiority of FFCNet in discriminating similarity between classes. Our network obtains excellent performance in all categories Fig.3. Not only that, the network achieves the highest accuracy in the most difficult to distinguish polyps () and adenomas (), respectively. This finding is attributed to our PSM guiding the network to acquire diseasespecific information by learning local and longrange features.
Ablation experiments demonstrate the contribution of the proposed module: Fig.4 a) shows the average accuracy of the ablation experiments quantitatively, demonstrating that each submodule contributes to the performance improvement. We first build a baseline model using magnitude maps with ResNet and gradually incorporate each of the submodules discussed in Section 3 into the baseline model. After adding PSM alone, local information is supplemented resulting in an accuracy improvement of and , compared to the baseline and baseline models with complex networks. Incorporating our proposed complex network to the baseline model improved by verifies the ability of complex networks to mine phase information. The last two columns validate the importance of random shuffling significantly improving classification accuracy by learning distance information.
Hyperparameter experiments analyze the superiority of the architecture: We evaluate the effect of patch number and random scramble probability on the network (Fig.4 b) c)). The slicing operation brings local information of the image, so the accuracy of the network gradually increases as the number of slices grows. However, if the image patch is too small, it will destroy the advantages of frequency domain features and reduce model performance. Likewise, random shuffling allows the network to learn from information over long distances, yet the difficulty of network learning also increases. Therefore, we shuffle the image to stabilize the network performance with a certain probability.
4 Conclusion
In this study, we propose a novel method to advance colon disease classification in colonoscopy images from a frequency domain perspective. The proposed framework, FFCNet, introduces complex convolution operations, enabling the network to directly operate on complex spectra to obtain rich texture features and eliminate the influence of brightness imbalance. Furthermore, the patch scrambling algorithm we developed preprocesses the spectrogram so that the network can learn both longrange and local information. Finally, we compare the performance of the framework with other methods. The results show that our frequencydomain complex number framework is competitive with timedomain models in diagnosing colon diseases.
4.0.1 Acknowledgements
This work was supported by the National Key R&D Program Project (2018YFA0704102).
References
 [1] BibbinsDomingo, K., Grossman, D.C., Curry, S.J., Davidson, K.W., Epling, J.W., García, F.A., Gillman, M.W., Harper, D.M., Kemper, A.R., Krist, A.H., et al.: Screening for colorectal cancer: Us preventive services task force recommendation statement. Jama 315(23), 2564–2575 (2016)

[2]
Carneiro, G., Pu, L.Z.C.T., Singh, R., Burt, A.: Deep learning uncertainty and con fidence calibration for the fiveclass polyp classification from colonoscopy. Medical image analysis
62, 101653 (2020)  [3] Chi, L., Jiang, B., Mu, Y.: Fast fourier convolution. Advances in Neural Information Processing Systems 33, 4479–4488 (2020)
 [4] Dai, Z., Liu, H., Le, Q.V., Tan, M.: Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems 34, 3965–3977 (2021)
 [5] Elbediwy, A., VincentMistiaen, Z.I., SpencerDene, B., Stone, R.K., Boeing, S., Wculek, S.K., Cordero, J., Tan, E.H., Ridgway, R., Brunton, V.G., et al.: Integrin signalling regulates yap and taz to control skin homeostasis. Development 143(10), 1674–1687 (2016)
 [6] Han, Y., Sunwoo, L., Ye, J.C.: kspace deep learning for accelerated mri. IEEE transactions on medical imaging 39(2), 377–386 (2019)

[7]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
 [8] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
 [9] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
 [10] Ladabaum, U., Dominitz, J.A., Kahi, C., Schoen, R.E.: Strategies for colorectal cancer screening. Gastroenterology 158(2), 418–432 (2020)
 [11] Liu, X., Guo, X., Liu, Y., Yuan, Y.: Consolidated domain adaptive detection and localization framework for crossdevice colonoscopic images. Medical image analy sis 71, 102052 (2021)
 [12] Mármol, I., SánchezdeDiego, C., Pradilla Dieste, A., Cerrada, E., Rodriguez Yoldi, M.J.: Colorectal carcinoma: a general overview and future perspectives in colorectal cancer. International journal of molecular sciences 18(1), 197 (2017)

[13]
Misawa, M., Kudo, S.e., Mori, Y., Cho, T., Kataoka, S., Yamauchi, A., Ogawa, Y., Maeda, Y., Takeda, K., Ichimasa, K., et al.: Artificial intelligenceassisted polyp detection for colonoscopy: initial experience. Gastroenterology
154(8), 2027–2029 (2018)  [14] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high performance deep learning library. Advances in neural information processing sys tems 32 (2019)
 [15] Qadir, H.A., Shin, Y., Solhusvik, J., Bergsland, J., Aabakken, L., Balasingham, I.: Toward realtime polyp detection using fully cnns for 2d gaussian shapes prediction. Medical Image Analysis 68, 101897 (2021)
 [16] Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J.: Global filter networks for image classification. Advances in Neural Information Processing Systems 34, 980–993 (2021)
 [17] Rex, D.K., Boland, C.R., Dominitz, J.A., Giardiello, F.M., Johnson, D.A., Kaltenbach, T., Levin, T.R., Lieberman, D., Robertson, D.J.: Colorectal cancer screening: recommendations for physicians and patients from the us multisociety task force on colorectal cancer. Gastroenterology 153(1), 307–323 (2017)
 [18] Stuchi, J.A., Boccato, L., Attux, R.: Frequency learning for image classification. CoRR abs/2006.15476 (2020)
 [19] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1–9 (2015)

[20]
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)
 [21] Trabelsi, C., Bilaniuk, O., Serdyuk, D., Subramanian, S., Santos, J.F., Mehri, S., Rostamzadeh, N., Bengio, Y., Pal, C.J.: Deep complex networks. CoRR abs/1705.09792 (2017)
 [22] Wang, Y., Feng, Z., Song, L., Liu, X., Liu, S.: Multiclassification of endoscopic colonoscopy images based on deep transfer learning. Computational and Mathe matical Methods in Medicine 2021 (2021)
 [23] Wei, J., Hu, Y., Zhang, R., Li, Z., Zhou, S.K., Cui, S.: Shallow attention network for polyp segmentation. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. pp. 699–708. Springer (2021)
 [24] Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y.K., Ren, F.: Learning in the frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1740–1749 (2020)
 [25] Zhang, R., Zheng, Y., Mak, T.W.C., Yu, R., Wong, S.H., Lau, J.Y., Poon, C.C.: Automatic detection and classification of colorectal polyps by transferring low level cnn features from nonmedical domain. IEEE journal of biomedical and health informatics 21(1), 41–47 (2016)
 [26] Zhang, R., Zheng, Y., Poon, C.C., Shen, D., Lau, J.Y.: Polyp detection during colonoscopy using a regressionbased convolutional neural network with a tracker. Pattern recognition 83, 209–219 (2018)