I Introduction
Computer vision follows a sequence of operations to identify the characteristics of an image to perceive and analyze. Computer vision can be extensively used to solve a wide variety of problems in the field of remote sensing, spectroscopy, medical imaging, image processing, meteorology, astronomy and microscopy. Image recognition is an exciting research fields in computer vision, for which researchers predict that the global market will reach billion by 2021 [ImageRec74:online].
The application of deep learning approaches for image enhancement, restoration and morphing
[he2016deep, krizhevsky2012imagenet]. has significantly improved the data (or features) extraction from images. Moreover, Convolutional Neural Networks (CNNs) have been extremely successful in image processing. However, CNNs are computationally expensive due to computation costs of convolutions. By transforming images and CNN kernels into frequency space using Fast Fourier Transformation (FFT), a convolution operator becomes a single elementwise multiplication. Hence, FFT is used tremendously for image convolution. But, there is an overhead cost involved in performing FFT. This type of transformation is efficient in reducing the convolution cost only when the convolution is large. Recently, convolutions of CNNs have been optimized using FFT
[vasilyev2015cnn] and being adapted in mobile and embedding systems. Here, we apply FFT to the input data only and demonstrate the performance improvement on detecting objects.Convolutional Networks [lecun1989backpropagation] have existed for some times but the required size of training sets and the network size restricted their wide adoptions. On the other hand, their deep variations have shown outstanding performance in visual object recognition tasks [krizhevsky2012imagenet], through supervised training.
The object identification algorithms typically use extracted features and learning algorithms to apprehend instances of an object category. Currently, most of the object detection models specify where object is located in image with a bounding box [redmon2017yolo9000, he2017mask].
This study attempts to identify object position using a FFTbased convolutional neural network to detect the position by image segmentation rather than a bounding box. One of the major issues with using CNN for image dataset is the training time. This study focuses on lesion detection in image data set using the UNet model and thus aims at improving the training time by implementing Fast Fourier Transformation layer to the UNet model. This paper proposes to accomplish this by using a Fast Fourier convolution with fully CNN. The proposed model detects objects from an image, utilizing the scaling properties of FFT that improves the convolution time. The key contributions of this paper are as follows:

Introducing Fast Fourier Transformationbased deep learning algorithm (i.e., FFTbased CNN) in the context of object recognition for optimization purposes,

Implementing FFT on an UNet architecture and capturing key features, and

Evaluating the proposed Fast Fourier Transformationbased CNN through a set of experiments using bioimage dataset in the domain of health analytic.
The rest of the paper is organized as follows: we review the existing work in object identification in Section II and then provide explanation of the ideas and theories used in our implementation in Section III. Section IV presents the methodology and generic algorithm for proposed approach. Section V gives an insight to the experiments and the results. We concluded the paper in Section VI.
Ii Related Work
To incorporate the contextual information for lesion recognition Gao et al. [gao2015automatic]
proposed a fusion of CNN (convolutional Neural Networks) and RNN (Recurrent Neural Networks) based approach. Shen et al.
[shen2015automatic] implemented three CNNs for addressing the problem of object recognition. While these approaches are targeted for object identification for 2D images, Setio et al. [setio2016pulmonary] proposed a multistream CNN for object classification in 3D images.For quantitative analysis of clinical characteristics, the bioimage needs segmentation to identify the pixels that represent the object of interest. Ronneberger et. al.[RFB15a] proposed UNet for bio medical image segmentation. It is a refined architecture developed from fullyconvolutional neural network which is used for fast and precise segmentation of images. It was named on the fact that its architecture is in the shape of the letter “U”. The UNet has been further extended [cciccek20163d] for 3D image segmentation. VNet [milletari2016v] is another variant of UNet for 3D images. RNNs are also used for image segmentations [xie2016spatial]. Stollenga et al. [stollenga2015parallel] implemented a 3D LSTMRNN with convolutional layer. Andermatt et al. [andermatt2016multi] proposed gated RNN for segmenting 3D bioimages. Chen et al. [chen2016combining] combined RNN with UNet like architecture for segmentation. Convolutional Neural Netowrks also have been used in many other application domains such as mapping genome data into images [tavakoli2020seq2image].
Iii Theoretical Background
This section discusses the two major techniques that we used for the convolution and image processing: (1) Fast Fourier Transformation and (2) UNet.
Iiia Fast Fourier Transformation
In image processing, FFT is applied on an image to decompose it into real and imaginary components, representing the image in a frequency domain
[goodmanIntroductionFourierOptics2017]. The FFT of a 2D image [muthyalampalli2009implementation] can be calculated using Equations (1) and (2):(1) 
(2) 
where represents the pixel at position , whereas is the function to represent the image in the frequency domain pertaining to the position and , represents dimension of the image, and is [goodmanIntroductionFourierOptics2017].
In our work, we used a tensorflowbased implementation of Fast Fourier Transform in order to produce a transformed feature map in Fourier domain, which is then provided to UNet for object segmentation.
IiiB UNet Architecture
Ronneberger et al. proposed the UNet architecture [RFB15a] consists of a contracting path to apprehend the context and a symmetric expanding path that enables precise localization. This network is trained endtoend using very few images and outperformed slidingwindow convolutional network like OverFeat [sermanet2013overfeat].
In Figure 1, the architecture of Unet is presented where each blue box is a multichannel feature map. The number of channels is indicated on the top of each box. White boxes are mapping of copied features. The arrows denote the various operations within the network.
Iv Methodology
This section presents the architecture of the proposed model along with the developed algorithm. This work implements the UNet [RFB15a] with Fast Fourier Transformbased NN. The neural network implements the Fast Fourier Transform for the convolution between image and the kernel (i.e., mask). We trained the NN with labeled dataset [ljosa2012annotated]
which consists of synthetic cell images and masks. Each pixel on the image is classified as either being part of a cell or not. The image is a matrix that has values ranging between
and(RGB). The mask (i.e., kernel) is a 2D matrix with white and black values. Each pixel in the predicted mask represents the probability that the pixel is part of a cell. The output of the prediction model is clustered to identify the object location in a 2D environment.
Iva Model Overview
The proposed architecture is built upon the UNet, which is a fully Convolutional Network. The UNet is modified to yield better segmentation in medical imaging. The proposed model consists of the contracting path and an expansion path. These two together form a Ushaped architecture. The contracting path comprises of a fully connected convolutional network and includes repeated application of convolutions, each of which is followed by an exponential Linear Unit (eLU) and maxpooling operation. The contraction path decreases the spatial information, while increasing the feature information. The expansive path combines the feature and spatial information through a sequence of upconvolutions and concatenations with highresolution features from the contracting path. The upconvolution in this experiment is deconvolution operation, which is used for upsampling. In this model, we incorporate dropout in every block to avoid overfitting. Figure
2 displays the architecture of the proposed model.It consists of:

Input Layer.
This layer takes the input image and performs Fast Fourier convolution by applying the Kerasbased FFT function
[chollet2015keras]. The input layer is composed of:
A lambda layer with Fast Fourier Transform

A 3x3 Convolution layer and activation function, and

A lambda layer with Inverse Fast Fourier Transform.


Contracting Path. The Contracting layer consists of four blocks where each block is a group of operations bound together logically. In this part of the architecture, each block entails:

Two Convolution layers with activation function, and

A Max pooling layer.
This path applies convolution repeatedly. Each convolution is followed by exponential Linear Unit (eLU) and a
max pooling operation with stride 2 for downsampling. Here, at every max pooling step the feature channels are doubled.


Bottleneck. The bottleneck is the connection between contracting and expanding paths. It consists of two convolutional layers with dropout:

Two Convolution layers with activation functions, and

A Convolution layer.


Expanding. Every step in the expansive path consists of an upsampling of the feature map, followed by a convolution to reduce the number of features. A concatenation is applied with the corresponding feature map from the contracting path. Furthermore, two convolutions with eLU are applied. It consists of 4 blocks where each comprises of:

A deconvolution layer with stride 2,

A concatenation with the corresponding cropped feature map from the contracting path,

Two 3x3 convolution layers with activation functions,

An output layer with convolutional layer, and

At the final layer, a
convolution layer is used to map each feature vector to the input class.

IvB Process Flow
The Figure 3 provides a process flow of the proposed approach. The proposed architecture is built upon the UNet, which is a fully Convolutional Network. The UNet is modified to yield better segmentation in medical imaging. After importing the training dataset with image and mask for training the neural network, we resized the dimensions of the input image to limit. Then a tensorflow spectral function is applied to the images to compute 2D Fast Fourier Transform and Inverse Fast Fourier Transform. This tensorflowbased FFT layer is added to the neural network. The convolution of input image and mask is then implemented using a 2D convolutional layer, which performs elementwise multiplications between the Fourier transform of the input and mask. Subsequently, an Inverse Fourier Transform layer is applied to retain the original dimensions.
The purpose of pooling in the contraction path is to reduce the size of the feature channels so that we have fewer parameters in the network. Therefore, the The NN retains the most relevant features (max valued pixels) from each region and discards the rest of pixels, and it is achieved through pooling in the contraction path. The output of this layer is provided to a flattening layer, which flattens all feature structure to create a single long feature vector to be used by the output layer for object identification.
In this model, we incorporate dropout in every block to avoid overfitting. Figure 2 displays the architecture of the proposed model.
IvC Drop Out
Dropout is the technique of ignoring randomly selected nodes during training to prevent overfitting in neural networks. They are “droppedout” randomly. For this experiment dropout is set on each block in the architecture. In Figure 3, the dropout rate is set as for the blocks c1, c2, c8 and c9 and for the blocks c3, c4, c5, c6 and c7.
IvD Training, Validation, and Testing
The model is trained on the labeled data set where each image has a mask. The mask helps to identify the position of the object of interest in the main image. Additionally, validation and model fitting set is used for unbiased evaluation of the model on the training dataset. The model is tested against validation test data so as to ensure to calculate the model metrics. It is used to avoid overfitting. In this experiment, the validation test data consists of image and mask where the mask is compared with prediction to generate intersection over union scale.
After training the model, a scaled test set is used for an unbiased performance evaluation. The proposed model predicts the test mask from the test image dataset, which consists of the position of the objects in the image. The test mask is resized and local maxima are calculated on the test mask to capture the high intensity points. Further, DBScan clustering algorithm is applied to aggregate the local maxima in order to capture the object location and the number of objects in the image [ester_densitybased_1996].
V Case Study
This section presents the results of a case study and implementation of the proposed FFTbased neural network algorithm. In this study, the accuracy of the result from the CNN based model is compared to the result of a nonCNN based system.
Va Experimental Procedure and Data Collection
Out of the total simulated cell microscopy images available in the BBBC005v1[ljosa2012annotated] dataset, the training set comprises of images and masks while test set comprises of images. The validation set comprises of images and masks from the training set to ensure the accuracy of the model by fitting better models. The model uses these processed dataset and predicts the mask, which is then further processed by resizing and calculating the local maxima. The processed image is clustered in order to identify the object position and number in the image.
VB Results
The experiments are performed in and epochs on FFTbased and nonFFT based algorithms. For the sake of space limit, we discuss the results for epochs only. The output obtained from the training is processed and clustered to identify the objects in the test image. For all the discussions on results, Figure 4 is used as one of the test images.
The images shown in Figures 4(a) and 4(b) are the output results when Figure 4 is used as the test image. The mask prediction obtained using FFTbased algorithm is presented in Figure 4(a) and Figure 4(b), where the Figure 4(a) is the predicted mask and Figure 5(a) is the unsampled prediction. The prediction obtained from model without FFT is shown in Figure 4(b) and unsampled prediction is Figure 5(b).
Table I reports the performance metrics obtained for model in
epochs with and without FFT, respectively. The evaluation metrics used in this experiment is Intersection Over Union (IoU) or the Jaccard index, a metric for measuring the accuracy of detecting objects. The average IoU obtained when FFTbased Neural Network is used is
; whereas, when FFT is not employed, the average IoU obtained is . As the results suggest, the accuracy of object detection is much higher when FFT is used.FFT  Avg  DICE  Mean  CV. Val  CV. Val 

Valid. IoU  Value  IoU  Loss  mean IoU  
With  0.83  0.8392  0.8747  0.8753  0.8748 
W/o  0.52  0.5631  0.6815  0.6025  0.6815 
The loss function used in this experiment is DICE value
[dice1945measures] to assess the segmentation. The DICE value obtained with FFT based Algorithm is and without FFT based algorithm is , showing a better similarity achieved by FFT. The Mean IoU for the final iteration obtained is for FFT based algorithm and for nonFFT based algorithm.For algorithm with FFT, crossvalidation loss obtained is and for nonFFT based algorithm it is . The crossvalidation mean IoU is and for FFT based algorithm and nonFFT based algorithm, respectively.
Figures 6(a) and 6(b) illustrate the plots for loss and mean IoU obtained on the training and validation phases with FFT and without FFT, respectively. As logloss plots illustrate the logloss values are approaching to and , respectively for with and without FFT, indicating the reduction in loss values. Moreover, it is also observable from the plots that the logloss values are very stable and consistent when FFTbased approach is implemented; whereas, the logloss values when FFT is not applied fluctuate between different epochs.
In the mean IoU metric plot, the Xaxis consists the number of epochs and Yaxis consists of the mean IoU value. The graph consists of plot for mean IoU value obtained for each epoch. Both FFTbased and nonFFTbased approaches show similar trends. However, as the scale of yindex indicates, the mean of IoU when FFT is utilized is much higher than compared to cases when FFT is not applied.
VC Generate Local Maxima and Clustering Using DBScan
The input image for this experiment is obtained after the prediction of the test image from the model (Figure 4(a)). The resultant image, Figure 4(a), is resized and used as the input. The local maxima are calculated based on the brightness of the pixel in the resized image, Figure 8, which will help in identifying peaks using the peak local maxima from scikitimage [scikitimage].
In this experiment, Figure 8 (left) is used as the input image to detect the local maxima by applying maximum filter to the original image where Figure 8 (middle) is the image after maximum filter is applied. This helps in accentuating the high intensity region of the image. The local maxima function is applied to Figure 8 (middle). The resultant image is Figure 8 (right) where the high intensity pixels are identified using local maxima.
The output image obtained with local maxima identified, 8 (right), are then clustered using DBScan to identify the number of objects by the number of clusters. Figure 9 shows the result of applying clustering using DBScan on Figure 8. The clustering algorithm has identified 20 clusters in the image.
Vi Conclusion and Future Work
In this work, an FFTbased convolutional neural network was successfully trained to identify the objects present in an image. The proposed FFT based convolutional neural network was able to improve the training time during convolution by reducing the number of steps needed to learn the feature and train the network. By comparing the Intersection Over Union (IoU) scores, we were able to determine the efficiency of the detection. The FFT was able to improve the training time during convolution from  ms/step to  ms/step. We compared FFT based model with nonFFT based model there and noticed an improvement in the Intersection Over Union (IoU) scores. In our work, for 100 epoch FFT based convolution IoU score was and for nonFFT based model the IoU score was . The comparison of IoU scores implies that FFT based CNN has a better accuracy when compared to nonFFT CNN.
In conclusion we can say FFTbased convolution implemented in a CNN architecture can improve the convolution time and accuracy. This work can be done by comparing the model using multiple metrics and comparing the current model with many different object detection models. It can also include comparing the effects of the model using multiple datasets with varied objects. This study can be further improved by finding the object position on a 3D environment to identify its position more precisely.
These results are promising and warrant further investigation. The DFFT can be implemented as a single layer NN [velik_discrete_2008], so it would seem that UNet would be capable of learning this operation if it were useful for segmentation. Why UNet is unable to transform or utilize the frequency domain of an image on its own is unclear but it is clear that performing the FFT at the onset is advantageous for the learning process.
Acknowledgment
This research work is supported by National Institute of Health (NIH) under Grants No: R15GM120669. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.