Fast Fourier Transformation for Optimizing Convolutional Neural Networks in Object Recognition

This paper proposes to use Fast Fourier Transformation-based U-Net (a refined fully convolutional networks) and perform image convolution in neural networks. Leveraging the Fast Fourier Transformation, it reduces the image convolution costs involved in the Convolutional Neural Networks (CNNs) and thus reduces the overall computational costs. The proposed model identifies the object information from the images. We apply the Fast Fourier transform algorithm on an image data set to obtain more accessible information about the image data, before segmenting them through the U-Net architecture. More specifically, we implement the FFT-based convolutional neural network to improve the training time of the network. The proposed approach was applied to publicly available Broad Bioimage Benchmark Collection (BBBC) dataset. Our model demonstrated improvement in training time during convolution from 600-700 ms/step to 400-500 ms/step. We evaluated the accuracy of our model using Intersection over Union (IoU) metric showing significant improvements.


page 7

page 8

page 9


Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add

Convolutional neural networks (CNNs) are currently state-of-the-art for ...

Computational optimization of convolutional neural networks using separated filters architecture

This paper considers a convolutional neural network transformation that ...

Deep Convolutional Neural Networks: A survey of the foundations, selected improvements, and some current applications

Within the world of machine learning there exists a wide range of differ...

Butterfly-Net: Optimal Function Representation Based on Convolutional Neural Networks

Deep networks, especially Convolutional Neural Networks (CNNs), have bee...

SelectionConv: Convolutional Neural Networks for Non-rectilinear Image Data

Convolutional Neural Networks have revolutionized vision applications. T...

Learning Convolutional Neural Networks in the Frequency Domain

Convolutional neural network (CNN) has achieved impressive success in co...

Automatic Recognition of Coal and Gangue based on Convolution Neural Network

We designed a gangue sorting system,and built a convolutional neural net...

I Introduction

Computer vision follows a sequence of operations to identify the characteristics of an image to perceive and analyze. Computer vision can be extensively used to solve a wide variety of problems in the field of remote sensing, spectroscopy, medical imaging, image processing, meteorology, astronomy and microscopy. Image recognition is an exciting research fields in computer vision, for which researchers predict that the global market will reach billion by 2021 [ImageRec74:online].

The application of deep learning approaches for image enhancement, restoration and morphing

[he2016deep, krizhevsky2012imagenet]

. has significantly improved the data (or features) extraction from images. Moreover, Convolutional Neural Networks (CNNs) have been extremely successful in image processing. However, CNNs are computationally expensive due to computation costs of convolutions. By transforming images and CNN kernels into frequency space using Fast Fourier Transformation (FFT), a convolution operator becomes a single element-wise multiplication. Hence, FFT is used tremendously for image convolution. But, there is an overhead cost involved in performing FFT. This type of transformation is efficient in reducing the convolution cost only when the convolution is large. Recently, convolutions of CNNs have been optimized using FFT 

[vasilyev2015cnn] and being adapted in mobile and embedding systems. Here, we apply FFT to the input data only and demonstrate the performance improvement on detecting objects.

Convolutional Networks [lecun1989backpropagation] have existed for some times but the required size of training sets and the network size restricted their wide adoptions. On the other hand, their deep variations have shown outstanding performance in visual object recognition tasks [krizhevsky2012imagenet], through supervised training.

The object identification algorithms typically use extracted features and learning algorithms to apprehend instances of an object category. Currently, most of the object detection models specify where object is located in image with a bounding box [redmon2017yolo9000, he2017mask].

This study attempts to identify object position using a FFT-based convolutional neural network to detect the position by image segmentation rather than a bounding box. One of the major issues with using CNN for image dataset is the training time. This study focuses on lesion detection in image data set using the U-Net model and thus aims at improving the training time by implementing Fast Fourier Transformation layer to the U-Net model. This paper proposes to accomplish this by using a Fast Fourier convolution with fully CNN. The proposed model detects objects from an image, utilizing the scaling properties of FFT that improves the convolution time. The key contributions of this paper are as follows:

  • Introducing Fast Fourier Transformation-based deep learning algorithm (i.e., FFT-based CNN) in the context of object recognition for optimization purposes,

  • Implementing FFT on an U-Net architecture and capturing key features, and

  • Evaluating the proposed Fast Fourier Transformation-based CNN through a set of experiments using bio-image dataset in the domain of health analytic.

The rest of the paper is organized as follows: we review the existing work in object identification in Section II and then provide explanation of the ideas and theories used in our implementation in Section III. Section IV presents the methodology and generic algorithm for proposed approach. Section V gives an insight to the experiments and the results. We concluded the paper in Section VI.

Ii Related Work

To incorporate the contextual information for lesion recognition Gao et al. [gao2015automatic]

proposed a fusion of CNN (convolutional Neural Networks) and RNN (Recurrent Neural Networks) based approach. Shen et al.

[shen2015automatic] implemented three CNNs for addressing the problem of object recognition. While these approaches are targeted for object identification for 2D images, Setio et al. [setio2016pulmonary] proposed a multi-stream CNN for object classification in 3D images.

For quantitative analysis of clinical characteristics, the bio-image needs segmentation to identify the pixels that represent the object of interest. Ronneberger et. al.[RFB15a] proposed U-Net for bio medical image segmentation. It is a refined architecture developed from fully-convolutional neural network which is used for fast and precise segmentation of images. It was named on the fact that its architecture is in the shape of the letter “U”. The U-Net has been further extended [cciccek20163d] for 3D image segmentation. V-Net [milletari2016v] is another variant of U-Net for 3D images. RNNs are also used for image segmentations [xie2016spatial]. Stollenga et al. [stollenga2015parallel] implemented a 3D LSTM-RNN with convolutional layer. Andermatt et al. [andermatt2016multi] proposed gated RNN for segmenting 3D bio-images. Chen et al. [chen2016combining] combined RNN with U-Net like architecture for segmentation. Convolutional Neural Netowrks also have been used in many other application domains such as mapping genome data into images [tavakoli2020seq2image].

Iii Theoretical Background

This section discusses the two major techniques that we used for the convolution and image processing: (1) Fast Fourier Transformation and (2) U-Net.

Iii-a Fast Fourier Transformation

In image processing, FFT is applied on an image to decompose it into real and imaginary components, representing the image in a frequency domain

[goodmanIntroductionFourierOptics2017]. The FFT of a 2D image [muthyalampalli2009implementation] can be calculated using Equations (1) and (2):


where represents the pixel at position , whereas is the function to represent the image in the frequency domain pertaining to the position and , represents dimension of the image, and is [goodmanIntroductionFourierOptics2017].

In our work, we used a tensorflow-based implementation of Fast Fourier Transform in order to produce a transformed feature map in Fourier domain, which is then provided to U-Net for object segmentation.

Iii-B U-Net Architecture

Ronneberger et al. proposed the U-Net architecture [RFB15a] consists of a contracting path to apprehend the context and a symmetric expanding path that enables precise localization. This network is trained end-to-end using very few images and outperformed sliding-window convolutional network like OverFeat [sermanet2013overfeat].

Figure 1: Architecture of U-Net [RFB15a] on lowest resolution of pixels.

In Figure 1, the architecture of U-net is presented where each blue box is a multi-channel feature map. The number of channels is indicated on the top of each box. White boxes are mapping of copied features. The arrows denote the various operations within the network.

Iv Methodology

This section presents the architecture of the proposed model along with the developed algorithm. This work implements the U-Net [RFB15a] with Fast Fourier Transform-based NN. The neural network implements the Fast Fourier Transform for the convolution between image and the kernel (i.e., mask). We trained the NN with labeled dataset [ljosa2012annotated]

which consists of synthetic cell images and masks. Each pixel on the image is classified as either being part of a cell or not. The image is a matrix that has values ranging between


(RGB). The mask (i.e., kernel) is a 2-D matrix with white and black values. Each pixel in the predicted mask represents the probability that the pixel is part of a cell. The output of the prediction model is clustered to identify the object location in a 2-D environment.

Iv-a Model Overview

The proposed architecture is built upon the U-Net, which is a fully Convolutional Network. The U-Net is modified to yield better segmentation in medical imaging. The proposed model consists of the contracting path and an expansion path. These two together form a U-shaped architecture. The contracting path comprises of a fully connected convolutional network and includes repeated application of convolutions, each of which is followed by an exponential Linear Unit (eLU) and max-pooling operation. The contraction path decreases the spatial information, while increasing the feature information. The expansive path combines the feature and spatial information through a sequence of up-convolutions and concatenations with high-resolution features from the contracting path. The up-convolution in this experiment is de-convolution operation, which is used for up-sampling. In this model, we incorporate dropout in every block to avoid overfitting. Figure

2 displays the architecture of the proposed model.
It consists of:

Figure 2: Architecture of the proposed model.
  1. Input Layer.

    This layer takes the input image and performs Fast Fourier convolution by applying the Keras-based FFT function

    [chollet2015keras]. The input layer is composed of:

    1. A lambda layer with Fast Fourier Transform

    2. A 3x3 Convolution layer and activation function, and

    3. A lambda layer with Inverse Fast Fourier Transform.

  2. Contracting Path. The Contracting layer consists of four blocks where each block is a group of operations bound together logically. In this part of the architecture, each block entails:

    1. Two Convolution layers with activation function, and

    2. A Max pooling layer.

    This path applies convolution repeatedly. Each convolution is followed by exponential Linear Unit (eLU) and a

    max pooling operation with stride 2 for down-sampling. Here, at every max pooling step the feature channels are doubled.

  3. Bottleneck. The bottleneck is the connection between contracting and expanding paths. It consists of two convolutional layers with dropout:

    1. Two Convolution layers with activation functions, and

    2. A Convolution layer.

  4. Expanding. Every step in the expansive path consists of an up-sampling of the feature map, followed by a convolution to reduce the number of features. A concatenation is applied with the corresponding feature map from the contracting path. Furthermore, two convolutions with eLU are applied. It consists of 4 blocks where each comprises of:

    1. A de-convolution layer with stride 2,

    2. A concatenation with the corresponding cropped feature map from the contracting path,

    3. Two 3x3 convolution layers with activation functions,

    4. An output layer with convolutional layer, and

    5. At the final layer, a

      convolution layer is used to map each feature vector to the input class.

Iv-B Process Flow

The Figure 3 provides a process flow of the proposed approach. The proposed architecture is built upon the U-Net, which is a fully Convolutional Network. The U-Net is modified to yield better segmentation in medical imaging. After importing the training dataset with image and mask for training the neural network, we resized the dimensions of the input image to limit. Then a tensorflow spectral function is applied to the images to compute 2-D Fast Fourier Transform and Inverse Fast Fourier Transform. This tensorflow-based FFT layer is added to the neural network. The convolution of input image and mask is then implemented using a 2-D convolutional layer, which performs element-wise multiplications between the Fourier transform of the input and mask. Subsequently, an Inverse Fourier Transform layer is applied to retain the original dimensions.

The purpose of pooling in the contraction path is to reduce the size of the feature channels so that we have fewer parameters in the network. Therefore, the The NN retains the most relevant features (max valued pixels) from each region and discards the rest of pixels, and it is achieved through pooling in the contraction path. The output of this layer is provided to a flattening layer, which flattens all feature structure to create a single long feature vector to be used by the output layer for object identification.

In this model, we incorporate dropout in every block to avoid overfitting. Figure 2 displays the architecture of the proposed model.

Iv-C Drop Out

Dropout is the technique of ignoring randomly selected nodes during training to prevent overfitting in neural networks. They are “dropped-out” randomly. For this experiment dropout is set on each block in the architecture. In Figure 3, the dropout rate is set as for the blocks c1, c2, c8 and c9 and for the blocks c3, c4, c5, c6 and c7.

Iv-D Training, Validation, and Testing

The model is trained on the labeled data set where each image has a mask. The mask helps to identify the position of the object of interest in the main image. Additionally, validation and model fitting set is used for unbiased evaluation of the model on the training dataset. The model is tested against validation test data so as to ensure to calculate the model metrics. It is used to avoid overfitting. In this experiment, the validation test data consists of image and mask where the mask is compared with prediction to generate intersection over union scale.

After training the model, a scaled test set is used for an unbiased performance evaluation. The proposed model predicts the test mask from the test image dataset, which consists of the position of the objects in the image. The test mask is resized and local maxima are calculated on the test mask to capture the high intensity points. Further, DBScan clustering algorithm is applied to aggregate the local maxima in order to capture the object location and the number of objects in the image [ester_density-based_1996].

Figure 3: Process flow of the proposed architecture.

V Case Study

This section presents the results of a case study and implementation of the proposed FFT-based neural network algorithm. In this study, the accuracy of the result from the CNN based model is compared to the result of a non-CNN based system.

V-a Experimental Procedure and Data Collection

Out of the total simulated cell microscopy images available in the BBBC005v1[ljosa2012annotated] dataset, the training set comprises of images and masks while test set comprises of images. The validation set comprises of images and masks from the training set to ensure the accuracy of the model by fitting better models. The model uses these processed dataset and predicts the mask, which is then further processed by resizing and calculating the local maxima. The processed image is clustered in order to identify the object position and number in the image.

V-B Results

The experiments are performed in and epochs on FFT-based and non-FFT based algorithms. For the sake of space limit, we discuss the results for epochs only. The output obtained from the training is processed and clustered to identify the objects in the test image. For all the discussions on results, Figure 4 is used as one of the test images.

Figure 4: Original image.

The images shown in Figures 4(a) and 4(b) are the output results when Figure 4 is used as the test image. The mask prediction obtained using FFT-based algorithm is presented in Figure 4(a) and Figure 4(b), where the Figure 4(a) is the predicted mask and Figure 5(a) is the unsampled prediction. The prediction obtained from model without FFT is shown in Figure 4(b) and unsampled prediction is Figure 5(b).

(a) With FFT.
(b) without FFT.
Figure 5: Mask predicted.
(a) with FFT.
(b) Without FFT.
Figure 6: Unsampled prediction.

Table I reports the performance metrics obtained for model in

epochs with and without FFT, respectively. The evaluation metrics used in this experiment is Intersection Over Union (IoU) or the Jaccard index, a metric for measuring the accuracy of detecting objects. The average IoU obtained when FFT-based Neural Network is used is

; whereas, when FFT is not employed, the average IoU obtained is . As the results suggest, the accuracy of object detection is much higher when FFT is used.

FFT Avg DICE Mean CV. Val CV. Val
Valid. IoU Value IoU Loss mean IoU
With 0.83 -0.8392 0.8747 -0.8753 0.8748
W/o 0.52 -0.5631 0.6815 -0.6025 0.6815
Table I: Metric and cost function value for the experiment.

The loss function used in this experiment is DICE value

[dice1945measures] to assess the segmentation. The DICE value obtained with FFT based Algorithm is and without FFT based algorithm is , showing a better similarity achieved by FFT. The Mean IoU for the final iteration obtained is for FFT based algorithm and for non-FFT based algorithm.

For algorithm with FFT, cross-validation loss obtained is and for non-FFT based algorithm it is . The cross-validation mean IoU is and for FFT based algorithm and non-FFT based algorithm, respectively.

Figures 6(a) and 6(b) illustrate the plots for loss and mean IoU obtained on the training and validation phases with FFT and without FFT, respectively. As log-loss plots illustrate the log-loss values are approaching to and , respectively for with and without FFT, indicating the reduction in loss values. Moreover, it is also observable from the plots that the log-loss values are very stable and consistent when FFT-based approach is implemented; whereas, the log-loss values when FFT is not applied fluctuate between different epochs.

In the mean IoU metric plot, the X-axis consists the number of epochs and Y-axis consists of the mean IoU value. The graph consists of plot for mean IoU value obtained for each epoch. Both FFT-based and non-FFT-based approaches show similar trends. However, as the scale of y-index indicates, the mean of IoU when FFT is utilized is much higher than compared to cases when FFT is not applied.

(a) With FFT.
(b) without FFT.
Figure 7: Log-Loss and mean IoU metric plots for 100 epochs.

V-C Generate Local Maxima and Clustering Using DBScan

The input image for this experiment is obtained after the prediction of the test image from the model (Figure 4(a)). The resultant image, Figure 4(a), is resized and used as the input. The local maxima are calculated based on the brightness of the pixel in the resized image, Figure 8, which will help in identifying peaks using the peak local maxima from scikit-image [scikit-image].

In this experiment, Figure 8 (left) is used as the input image to detect the local maxima by applying maximum filter to the original image where Figure 8 (middle) is the image after maximum filter is applied. This helps in accentuating the high intensity region of the image. The local maxima function is applied to Figure 8 (middle). The resultant image is Figure 8 (right) where the high intensity pixels are identified using local maxima.

Figure 8: Original image, Original image with maximum filter, Image with local peak maxima.

The output image obtained with local maxima identified, 8 (right), are then clustered using DBScan to identify the number of objects by the number of clusters. Figure 9 shows the result of applying clustering using DBScan on Figure 8. The clustering algorithm has identified 20 clusters in the image.

Figure 9: DBScan cluster.

Vi Conclusion and Future Work

In this work, an FFT-based convolutional neural network was successfully trained to identify the objects present in an image. The proposed FFT based convolutional neural network was able to improve the training time during convolution by reducing the number of steps needed to learn the feature and train the network. By comparing the Intersection Over Union (IoU) scores, we were able to determine the efficiency of the detection. The FFT was able to improve the training time during convolution from - ms/step to - ms/step. We compared FFT based model with non-FFT based model there and noticed an improvement in the Intersection Over Union (IoU) scores. In our work, for 100 epoch FFT based convolution IoU score was and for non-FFT based model the IoU score was . The comparison of IoU scores implies that FFT based CNN has a better accuracy when compared to non-FFT CNN.

In conclusion we can say FFT-based convolution implemented in a CNN architecture can improve the convolution time and accuracy. This work can be done by comparing the model using multiple metrics and comparing the current model with many different object detection models. It can also include comparing the effects of the model using multiple datasets with varied objects. This study can be further improved by finding the object position on a 3D environment to identify its position more precisely.

These results are promising and warrant further investigation. The DFFT can be implemented as a single layer NN [velik_discrete_2008], so it would seem that U-Net would be capable of learning this operation if it were useful for segmentation. Why U-Net is unable to transform or utilize the frequency domain of an image on its own is unclear but it is clear that performing the FFT at the onset is advantageous for the learning process.


This research work is supported by National Institute of Health (NIH) under Grants No: R15GM120669. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.