Food Image Recognition by Using Convolutional Neural Networks (CNNs)

12/03/2016 ∙ by Yuzhen Lu, et al. ∙ Michigan State University 0

Food image recognition is one of the promising applications of visual object recognition in computer vision. In this study, a small-scale dataset consisting of 5822 images of ten categories and a five-layer CNN was constructed to recognize these images. The bag-of-features (BoF) model coupled with support vector machine was first tested as comparison, resulting in an overall accuracy of 56 Data expansion techniques were applied to increase the size of training images, which achieved a significantly improved accuracy of more than 90 the overfitting issue that occurred to the CNN without using data expansion. Further improvement is within reach by collecting more images and optimizing the network architecture and relevant hyper-parameters.



There are no comments yet.


page 3

Code Repositories


ten-class food images and classification based on cnn in python

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Due to the widespread use of low-cost imaging devices like smart-phone cameras, more and more applications are being developed in computer vision communities to facilitate automatic object recognition, among which recognition of food images has recently gained much attention ref1 ; ref2 ; ref3 ; ref4

. Nowadays, people, especially diabetes patients, are increasingly cautious about their diet for improved health care. Food image recognition provides a simple means to estimate the dietary caloric intake and evaluate eating habits of people, by using cameras to keep track of their food consumption.

In recent years, Convolutional neural networks (CNN) have enjoyed great popularity as a means for image classification/categorization since Krizhevsky et al ref5

won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2012 competition. CNN, as a variant of the standard deep neural network (DNN), is characterized by a special network architecture consisting of alternating convolutional and pooling layers


, to extract and combine local features from a two-dimensional input. Compared to conventional hand-crafted feature extraction based approaches, CNN is advantageous since it is able to learn optimal features from images adaptively. In implementing CNN for image classification, researchers have to collect such a large-scale dataset as the ILSVERC that contains more than one million images

ref5 ; ref7 , for network training because of the need for learning a large number of parameters involved in the network, which, however, is not a trivial task. One simple way to deal with the situation is to apply the CNN model that has been pre-trained based on a large-scale image data ref1 ; ref8

, which is so-called transfer learning. Alternatively, one can algorithmically expand the existing training data, e.g., by performing affine transformations (e.g., translation, rotation, scaling, zooming, etc.) to the raw images

ref9 ; ref10 .

This study was aimed to apply CNNs aided with data expansion techniques to a ten-class small-scale food image data. As comparison, a conventional a bag-of-feature (BoF) model combined with linear support vector machine (SVM) based approach was also employed for food image recognition. Experimental results demonstrated the superior performance of CNN and the effectiveness of data expansion techniques in training small-scale data.

Ii Experimentation

ii.1 Database

A total of 5822 color images, representing ten-class food items, were collected from the ImageNet ( The number of images for each category is summarized in TABLE 1.

Category Apple Banana Broccoli Burger Egg Frenchfry Hotdog Pizza Rice Strawberry
Image number 1050 310 327 519 626 296 639 1248 352 455
Table 1: Image number for each category of food items

These images had a large variations in quality and size. Some of them have a neat, uniform background while some have cluttered background. Prior to image analysis, all the images were down-sampled to a fixed resolution of , and then randomly divided into two parts for training and test with a 4:1 ratio, resulting in 4654 training and 1168 test images.

ii.2 BoF

The BoF model has been extensively used for image classification. In this method, an image is treated as a collection of orderless descriptors extracted from local patches, which are quantized into discrete visual words and represented by a compact histogram ref11 . Scale invariant feature transform (SIFT) descriptors that are not affected by perspective, scale, occlusion or illumination ref12 , are one of the most robust and popular feature descriptors. In this study, the BoF model with SIFT descriptors was used to extract features that were then fed to linear SVM for image classification, and this approach was implemented by means of the VLFeat library ref13 .

ii.3 Cnn

The architecture of the CNN used in this study is schematically illustrated in FIG. 1. The network has four layers of hidden neurons (three convolutional-pooling and one fully-connected), apart from a final layer of output neurons (the input is not considered as a layer). The input contains

neurons, representing the RGB values for a image. The first convolutional-pooling layer uses a local receptive field (also known as convolutional kernel) of size

with a stride length of 1 pixel to extract 32 feature maps, followed by a max pooling operation conducted in a

region; the second and third convolutional-pooling layers use and

local receptive fields, resulting in 64 and 128 feature maps, respectively, and the other parameters remain unchanged. The fourth layer is a fully-connected layer with 128 rectified linear units (ReLU) neurons, and the output layer has 10 softmax neurons that correspond to the ten categories of food. The three convolutional-pooling layers also use ReLU activation functions.

Figure 1: Schematic of the architecture of a convolutional neural network model.

The network was trained with the stochastic gradient descent (SGD) algorithm with a cross-entropy cost function. The dropout that randomly eliminate a portion of neurons from the network was used to reduce possible overfitting. The dropout rates of 0.25 and 0.5 were set for the third convolutional-pooling layer and the fully-connected layer, respectively. Training a CNN model requires to select a set of hyper-parameters, among which the leaning rate

is the the most critical one affecting the training performance. A fixed learning rate over the entire training process seems suboptimal, since it takes account of the dynamical training behavior. Here, a dynamically updated learning rate was used, which was an exponential function of cost where is set to 0.00l through trials and errors and C is the training loss. Such learning rate updating schedule is directly related to the training performance. At the initial stage, the training loss was large, resulting in a large learning rate to speed up the training process; gradually, the learning rate decreased with the loss, which helped avoid overshooting the best result. Given the limited images available, affine transformations including rotation, translation and scaling were conducted to expand the training data. Upper bounds were set up in performing these transformations, within which each image was randomly subjected to those transformations, which substantially increased the training data size. FIG. 2 shows an example of a raw image and expanded images.

Figure 2: The raw fruit image (right) and expanded images (left).

The CNN model was trained by using deep learning packages of Keras (

) and Theano ( implementation codes are available at:

Iii Results and Discussion

The BoF combined with SVM resulted in an overall accuracies (the ratio of the number correctly recognized images to the number of total images) of 68% and 56% for training and test images, respectively. TABLE 2 presents the confusion matrix on the test images and recognition rates (i.e., the average of true positive and true negative rates) for all the categories.

Confusion matrix and recognition rates by BoF on the test images Category Apple Banana Broccoli Burger Egg Frenchfry Hotdog Pizza Rice Strawberry Apple 178 1 0 4 8 0 2 6 5 7 Banana 2 43 1 2 4 4 5 2 1 1 Broccoli 1 0 28 2 0 3 2 24 1 4 Burger 5 0 2 72 2 2 7 12 1 1 Egg 20 1 2 7 75 1 6 6 6 1 Frenchfry 1 4 3 4 1 21 6 16 1 0 Hotdog 5 8 4 10 4 9 76 10 2 1 Pizza 5 0 4 5 1 1 7 221 3 4 Rice 6 1 1 2 4 1 1 18 35 1 Strawberry 11 1 2 1 1 0 1 29 1 45 R.R. 0.89 0.82 0.71 0.83 0.79 0.67 0.78 0.87 0.74 0.73

  • R.R. denotes the recognition rate (i.e, the average of the true positive and true negative rates).

Apple and pizza gave the two highest recognition rates, which were mainly because the two categories had a large number of training images; while french-fry and broccoli was the two hardest categories, the majority of which were misclassified into another distinct class. The BoF based approach did not performed well as expected, which was possibly because that the extraction of SIFT descriptors was conducted only to grayscale images, and also the parameter optimization was not fully done in descriptor quantization. The CNN was first implemented without using data expansion techniques. FIG. 3 show the accuracy and loss curves during training with the maximum training epoch set to 100. A large gaps between training and test occurred after 10 epochs, indicating the presence of overfitting. The highest accuracy on the test images was found to be 74%, corresponding to a training accuracy 95%. These results were much better than those obtained by the BoF approach, which confirmed the superiority of CNN.

Figure 3: Accuracy (left) and loss (right) curves for convolutional neural network models without data augmentation. The training process was early stopped because of no improvement over a preset period of time (20 epchos in this study).

Then, the CNN was trained with the expanded image data. As illustrated in FIG. 4, two significant improvements were achieve through almost trivial data transformations. First, the test accuracy was greatly elevated to a level up to 87% within 100 epochs, and second, the overfitting issue observed above was completely eliminated. The data expansion substantially increases the effective size of the training data, thus helping improve the training performance and making the model generalize well.

Figure 4: Accuracy (left) and loss (right) curves for convolutional neural network models based on data augmentation.

According to the accuracy and loss curves, it seemed that the CNN model could be further improved by increasing the training epochs. Hence, another three training schemes with 200, 400 and 600 epochs respectively were implemented. FIG. 5 shows their accuracy and loss curves. The test accuracy did increase further, but not by a large margin. After 200 epochs the training accuracy started to exceed the test accuracy and the gaps between training and test tended to enlarge. Training the CNN with 400 epochs resulted in the highest test accuracy of more than 90%; while training 600 epochs only increased the training accuracy, but not the test accuracy which seemed to level off around 90%. This suggested that overfitting was possible to escalate, and one may not further increasing the test accuracy by simply increasing the training epochs. While data augmentation is an effective method for improving the CNN performance, it does not overshadow the importance of manually collecting more training images.

Figure 5: Accuracy (left) and loss (right) curves for convolutional neural network models with training epochs set to 200, 400 and 600 (from top to bottom), where early stopping could occur because of no accuracy improvement on test images.

TABLE 3 presents the confusion matrix by the best CNN model for the test images and the corresponding recognition rates for each food category. Compared with Table 2, the CNN model resulted in the overall accuracy improved from 56% to 90%, and the average recognition rate from 0.78% to 94%.

Confusion matrix and recognition rates by BoF on the test images Category Apple Banana Broccoli Burger Egg Frenchfry Hotdog Pizza Rice Strawberry Apple 193 6 1 0 1 0 2 1 0 6 Banana 4 49 0 0 4 2 3 0 0 0 Broccoli 0 0 64 0 0 0 1 1 0 0 Burger 1 0 0 87 0 1 9 6 0 0 Egg 3 1 0 2 110 2 5 1 2 0 Frenchfry 0 2 0 0 0 53 1 4 0 0 Hotdog 1 2 0 5 0 3 109 8 0 0 Pizza 0 0 0 3 0 1 6 239 0 1 Rice 0 0 0 1 0 0 1 5 64 0 Strawberry 3 0 0 0 0 0 0 0 0 88 R.R. 0.95 0.89 0.98 0.91 0.93 0.94 0.91 0.96 0.95 0.98

  • R.R. denotes the recognition rate (i.e., the average of the true positive and true negative rates).

Iv Conclusions

This study reported on the applications of CNNs to a ten-class small-scale food image data. A five-layer CNN model was constructed achieved the best test accuracy of 74%, which was better than the accuracy of 56% achieved by the BoF approach. The CNN model however suffered from noticeable overfitting due to limited training data. This issue was addressed by expanding the training data through various affine transformations, which also considerably increased the overall test accuracy to more than 90%. Training the CNNs with different epochs showed limited room for improving test accuracy. Further improvements can be anticipated by collecting more training data and or by optimizing the architecture and hyper-parameters of the network, rather than by based on increasing the training epochs in the current framework, which, otherwise, could be liable to overtraining.


  • (1) Yanai, K., Kawano, Y., 2015. Food image recognition using deep convolutional network with pre-training and fine-tuning, in Multimedia and Expo Workshops (ICMEW), 2015 IEEE International Conference on.
  • (2) Kagaya, H., Aizawa, K., Ogawa, M., 2014. Food detection and recognition using convolutional neural network, in Proc. of ACM International Conference Multimedia, pp. 1085-1088.
  • (3) Liu, C., Cao, Y., Luo, Y., Chen, G., Vokkarane, V., Ma, Y., 2016. DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment, in International Conference on Smart Homes and Health Telematics, pp. 37-48.
  • (4) Hou, L., Wu, Q., Sun, Q., Yang H., Li, P., 2016. Fruit Recognition Based On Convolution Neural Network, in 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pp. 18-22.
  • (5) Krizhevsky, A., Sutskever I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems.
  • (6) Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition, in Proceedings of the IEEE 86(11): 2278-2324.
  • (7) Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
  • (8)

    Oquab, M., Bottou, L., Laptev, I., Sivic, J., 2014. Learning and transferring mid-level image representations using convolutional neural networks, in Proc. of IEEE Computer Vision and Pattern Recognition.

  • (9) Simard, P.Y., Steinkraus, D., Platt, J.C., 2003. Best practices for convolutional neural networks applied to visual document analysis, in Proc. 7th Intl. Conf. Document Analysis and Recognition, pp. 958-962.
  • (10) Lu, Y., Lu, R., 2018. Detection of surface and subsurface defects of apples using structurd-illumination reflectance imaging with machiine learning algorithms. Transactions of the ASABE 61(6), 1831-1842.
  • (11) Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C., 2004. Visual categorization with bags of keywords, in Workshop on Statistical Learning in Computer Vision, ECCV.
  • (12) Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60(2): 91-110.
  • (13) Vedaldi, A., Fulkerson, B., 2008. VLFeat: An open and portable library of computer vision algorithms, in Proceedings of the 18th ACM international conference on Multimedia, pp.1469-1472.