Computer vision-based food calorie estimation: dataset, method, and experiment

05/22/2017 ∙ by Yanchao Liang, et al. ∙ East China Universtiy of Science and Technology 0

Computer vision has been introduced to estimate calories from food images. But current food image data sets don't contain volume and mass records of foods, which leads to an incomplete calorie estimation. In this paper, we present a novel food image data set with volume and mass records of foods, and a deep learning method for food detection, to make a complete calorie estimation. Our data set includes 2978 images, and every image contains corresponding each food's annotation, volume and mass records, as well as a certain calibration reference. To estimate calorie of food in the proposed data set, a deep learning method using Faster R-CNN first is put forward to detect the food. And the experiment results show our method is effective to estimate calories and our data set contains adequate information for calorie estimation. Our data set is the first released food image data set which can be used to evaluate computer vision-based calorie estimation methods.



There are no comments yet.


page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Obesity is a medical condition in which excess body fat has accumulated to the extent that it may have a negative effect on health. People are generally considered obese when their Body Mass Index(BMI) is over 30 . High BMI is associated with the increased risk of diseases, such as heart disease, type two diabetes, etc[1]. Unfortunately, more and more people will meet criteria for obesity[2]. The main cause of obesity is the imbalance between the amount of food intake and energy consumed by the individuals. Therefore, to lose weight in a healthy way, as well as to maintain a healthy weight for normal people, the daily food intake must be measured. However, current obesity treatment techniques require the patient to record all food intakes per day. In most of the cases, unfortunately patients have troubles in estimating the amount of food intake because of the self-denial of the problem, lack of nutritional information, the manual process of writing down this information (which is tiresome and can be forgotten), and other reasons. Obesity treatment requires the patients to eat healthy food and decrease the amount of daily calorie intake, which needs patients to calculate and record calories from foods every day. While computer vision-based measurement methods were introduced to estimate calories from images directly according to the calibration object and foods information, obese patients have benefited a lot from these methods.

In recent years, there are a lot of methods based on computer vision proposed to estimate calories[3, 4, 5, 6]

. Among these methods, the accuracy of estimation result is determined by two main factors: object detection algorithm and volume estimation method. In the aspect of object detection, classification algorithms like Support Vector Machine(SVM)

[7] are used to recognize food’s type in general conditions. In the aspect of volume estimation, the calibration of food and the volume calculation are two key issues. For example, when using a circle plate[3] as a calibration object, it is detected by ellipse detection; and the volume of food is estimated by applying corresponding shape model. Another example is using people’s thumb as the calibration object, the thumb is detected by color space conversion[8], and the volume is estimated by simply treating the food as a column. However, thumb’s skin is not stable and it is not guaranteed that each person’s thumb can be detected. The involvement of human’s assistance[4] can improve the accuracy of estimation but consumes more time, which is less useful for obesity treatment. After getting food’s volume, food’s calorie is calculated by searching its density in food density table[9] and energy in nutrition table111 Although these methods mentioned above have been used to estimate calories, the accuracy still need to be improved in the following two aspects: using widely-used calibration objects and more effective object detection algorithms.

Among the above measurement methods, a corresponding image dataset is in need, which is used to train and test the object detection algorithm. Several food image datasets[10, 11, 12] have been created so far. A food dataset called Food-101 is proposed, which contains a lot of fast food images[10]. As those fast food images in Food-10 do not include a calibration object as a reference, estimating calorie is impossible in this dataset. Pittsburgh Fast-food Image Dataset(PFID)[11] introduces a dataset including still images and videos of fast foods. The dataset named FOODD[12] comprises 3000 food images under different shooting conditions. Although those datasets can be used to train and test object detection algorithms, it is still hard to utilize them to estimate calories just because they have not prepared the volume and mass as a reference. There is still in need of Food image dataset for calorie estimation.

In this paper, a dataset named ECUST Food Dataset (ECUSTFD) and a novel calorie estimation method based on deep learning are proposed. In ECUSTFD, food’s volume and mass records are provided, as well as One RMB Yuan coin is used as the calibration object, which is widely used in daily life. For our calorie estimation method, it takes 2 images as its inputs: a top view and a side view of the food; each image includes a calibration object which is used to estimate image’s scale factor. Food(s) and calibration object are detected by object detection method called Faster R-CNN and each food’s counter is obtained by applying GrabCut algorithm. After that, we estimate each food’s volume and calorie.

The main contributions of this paper are listed as follows:

  1. Proposing the first released image dataset with volume and mass records for food calorie estimation.

  2. Proposing a complete and effective calorie estimation method.

2 Database Generation

2.1 Dataset Description

Nowadays, a corresponding food dataset is necessary for assessing those calorie estimation methods. That is why we create ECUSTFD.

As shown in Figure 1,ECUSTFD contains 19 kinds of food: apple, banana, bread, bun, doughnut, egg, fired dough nut, grape, lemon, litchi, mango, mooncake, orange, peach, pear, plum, qiwi, sachima, tomato. The total number of food images is 2978. The number of images and the number of objects for the same type are shown in Table 1. For a single food portion, we took several pairs of images by using smart phones; each group of images contains a top view and a side view of this food. For each image, there is only a One Yuan coin as calibration object and no more than two foods in it. If there are two food in the same image, the type of one food is different from another. We provide two datasets for researchers: one includes original images and another includes resized images. The size of each image in resized dataset is less than 10001000.

Figure 1: ECUSTFD Sample Images
Food Type The number The number Density Energy
of images of objects () ()
apple 296 19 0.78 00.52
banana 178 15 0.91 00.89
bread 066 07 0.18 03.15
bun 090 08 0.34 02.23
doughnut 210 09 0.31 04.34
egg 104 07 1.03 01.43
fired dough twist 124 07 0.58 24.16
grape 058 02 0.97 00.69
lemon 148 04 0.96 00.29
litchi 078 05 1.00 00.66
mango 220 10 1.07 00.60
mix 108 14 / /
mooncake 134 06 0.96 18.83
orange 254 15 0.90 00.63
peach 126 05 0.96 00.57
pear 166 06 1.02 00.39
plum 176 04 1.01 00.46
qiwi 120 08 0.97 00.61
sachima 150 05 0.22 21.45
tomato 172 04 0.98 00.27
Table 1: ECUSTFD

For each images, our dataset still provide other informations as follows:

  1. Annotation. Our dataset still provide bounding boxes (we only annotated the resized images because the original images are more than screen resolution and is hard for us to annotate) for each object in every images. For example, we offer two bounding boxes for a single apple image: one for the apple and another for the calibration object.

  2. Mass. We provide mass for each food. The mass is obtained with an electronic scale.

  3. Volume. Considering that we can only get volume from food images rather than mass, we choose to provide the volume information as a reference. The volume is measured with drainage method. Due to the limit of containing cup, the volumes we measured in ECUSTFD are not as reliable as the qualities we measured.

  4. Density and Energy. In order to estimate calorie, foods’ density and energy information should be provided. The density is calculated with the volume and mass information collected in ECUSTFD. For each kind of food, energy is obtained from nutrition table.

2.2 Shooting Conditions

We took into consideration important factors that affect the accuracy of estimation results: camera, lighting, shooting angle, displacement, calibration object, food type.

  1. Camera. We use iPhone 4s and iPhone 7 to take photos. For the same scene, the images taken from different cameras may be different from each other due to the performances of cameras and algorithms. For most of images in ECUSTFD, the size of image taken by iPhone 4s is 24483264 and the size of image taken by iPhone 7 is 30244032.

  2. Lighting. As people can eat food on the table anytime, the photos in our dataset are taken from different lighting conditions. Some photos are taken in dark environment with or without flash light.

  3. Shooting angles. When taking a top view, shooting angle is almost 0 degree from the table; and when taking a side view, shooting angle is almost 90 degree from the table.

  4. Displacement. For a food image in our dataset, the position of food is not fixed. It means that food can be placed in anywhere as long as this food can be captured completely by camera. So as the calibration object. In most cases, the food is put on a red or white plate; in other cases, food is on the dining table directly.

  5. Calibration object. We choose One Yuan coin as this dataset’s Calibration object, which is easy to get in our daily life. The diameter of One Yuan coin is 25.0 as shown in Figure 2. One Yuan coin can be detected by Hough Transform[13] or deep learning methods.

    Figure 2: Two Sides of One Yuan Coin
  6. Food type. For obtaining food’s volume and mass easily, we only choose those foods which are big enough, stable and less prone to deformation. If food with small volume, like peanut, is hard to get its volume and will cause great error when comparing with its real volume. Every food in our dataset is complete. We prefer to use a whole apple to take photos rather than sliced apple, which makes it easy to measure volume and weight. In reality, the calorie of an apple with skin is higher than the calorie of the same apple without skin.

2.3 Accessment

ECUSTFD is a free public food image dataset. The dataset with original images and no annotations is publicly available at this website222 The small image dataset including annotations, volume and mass information is available at this website333 or You will find instructions at that websites either.

3 Calorie Estimation Method

3.1 Calorie Estimation Method Based On Deep Learning

Before performing the experimental results, we briefly introduce our calorie estimation method.

Our goal is to help obese people to calculate the calories they get from the food. Figure 3 shows the flowchart of the proposed system. To estimate calories, it requires the user to take a top view and a side view of the food before eating with his/her smart phone. Each images used to estimate must include One Yuan coin. For the top view, we use the deep learning algorithms to recognize the types of food and apply image segmentation to identify the food’s contour in the photos. So as the side view. then, the volumes of each food is calculated based on the calibration objects in the images. In the end, the calorie of each food is obtained by searching density table and nutrition table.

Figure 3: Calorie Estimation Flowchart

In order to get better results, we choose to use Faster Region-based Convolutional Neural Networks (Faster R-CNN)

[14] to detect objects and GrabCut [15] as segmentation algorithms.

3.2 Objection detection With Deep Learning Methods

We do not use semantic segmentation method such as Fully Convolutional Networks (FCN)[16] but choose to use Faster R-CNN. Faster R-CNN is a framework based on deep convolutional networks. It includes a Region Proposal Network (RPN) and an Object Detection Network[14]. When we put an image with RGB channels as input, we will get a series of bounding boxes. For each bounding box created by Faster R-CNN, its class is judged.

3.3 Image Segmentation

Before estimating volume, we choose to segment each bounding box first. GrabCut is an image processing approach based on optimization by graph cuts[15]. Practicing GrabCut needs user to draw a bounding box around the object; and such boxes can be provided by Faster R-CNN. Although asking user to label the foreground/background color can get better result, we refuse it so that our system can finish calorie estimation without user’s assistance. For each bounding box, we get precious contour after applying GrabCut algorithm. Then we can estimate every food’s volume and calorie.

3.4 Volume Estimation And Calorie Calculation

According to the One Yuan coin detected in the top view, the true size of a pixel is known. Similarly, we know actual size of of a pixel in the side view. Then we use different formulas to estimate volume of each food. After getting volume, food’s calorie is obtained by searching related tables.

4 Experiment

In this section, we present the volume estimation results using the food images dataset. These food and fruit images are divided into train and test sets. In order to avoid using train images to estimate volumes, the images of two sets are not selected randomly but orderly.

Figure 4: Image Number in Experiment

The numbers of train and test images used for Faster R-CNN are listed in Figure 4. After Faster R-CNN is well trained, we use those pairs of test images which Faster R-CNN correctly recognizes to estimate volumes. In other words, those images Faster R-CNN cannot identity or misidentify in test sets will be discarded. The numbers of images in volume estimation experiments are shown in Figure 4 either. The code can be downloaded at this website444 We use mean error to evaluate volume estimation results. Mean error ME is defined as:


In Equation 1, for food type , is the number of images Faster R-CNN recognizes correctly. Since we use two images to calculate volume, so the number of estimation volumes for th type is . is the estimation volume for the th pair of images with the food type ; and is corresponding estimation volume for the same pair of images.

Volume estimation results are shown in Figure 5. For most types of food in our experiment, the estimation volume are closer to reference volume. The mean error between estimation volume and true volume does not exceed 20% except banana, grape, mooncake. For some food types such as orange, our estimation result is close enough to the true value. As for the bad estimation results of grape, the way we measure volumes of grapes should be to blame. When measuring grapes’ volumes, we have not used plastic wrap to cover grapes but put them into the water directly, so the volumes we estimated are far from the reference values because of the gaps. All in all, our estimation method is available.

Figure 5: Volume Estimation Results

5 Conclusion

In this paper, we provided a dataset called ECUSTFD. A main feature of ECUSTFD is that each food’s volume and mass records are provided. We provided experimental results with deep learning algorithms in ECUSTFD.


  • [1] Wei Zheng, Dale F. Mclerran, Betsy Rolland, Xianglan Zhang, Manami Inoue, Keitaro Matsuo, Jiang He, Prakash Chandra Gupta, Kunnambath Ramadas, and Shoichiro Tsugane, “Association between body-mass index and risk of death in more than 1 million asians,” New England Journal of Medicine, vol. 364, no. 8, pp. 719–29, 2011.
  • [2] Mariachiara Di Cesare, James Bentham, Gretchen A. Stevens, Bin Zhou, Goodarz Danaei, Yuan Lu, Honor Bixby, Melanie J. Cowan, Leanne M. Riley, and Kaveh Hajifathalian, “Trends in adult body-mass index in 200 countries from 1975 to 2014: a pooled analysis of 1698 population-based measurement studies with 19·2 million participants,” Lancet, vol. 387, no. 10026, pp. 1377–1396, 2016.
  • [3] W. Jia, H. C. Chen, Y. Yue, Z. Li, J Fernstrom, Y. Bai, C. Li, and M. Sun, “Accuracy of food portion size estimation from digital pictures acquired by a chest-worn camera.,” Public Health Nutrition, vol. 17, no. 8, pp. 1671–81, 2014.
  • [4] Zhou Guodong, Qian Longhua, and Zhu Qiaoming, “Determination of food portion size by image processing,” 2008, pp. 119–128.
  • [5] Y. Bai, C. Li, Y. Yue, W. Jia, J. Li, Z. H. Mao, and M. Sun, “Designing a wearable computer for lifestyle evaluation.,” in Bioengineering Conference, 2012, pp. 93–94.
  • [6] Parisa Pouladzadeh, Pallavi Kuhad, Sri Vijay Bharat Peddi, Abdulsalam Yassine, and Shervin Shirmohammadi, “Mobile cloud based food calorie measurement,” pp. 1–6, 2014.
  • [7] Johan AK Suykens and Joos Vandewalle,

    “Least squares support vector machine classifiers,”

    Neural processing letters, vol. 9, no. 3, pp. 293–300, 1999.
  • [8] Gregorio Villalobos, Rana Almaghrabi, Parisa Pouladzadeh, and Shervin Shirmohammadi, “An image procesing approach for calorie intake measurement,” in IEEE International Symposium on Medical Measurements and Applications Proceedings, 2012, pp. 1–5.
  • [9] FAO/INFOODS, U. Ruth Charrondière, David Haytowitz, and B. Stadlmayr, “Fao/infoods density database version 1.0,” 2012.
  • [10] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool,

    “Food-101–mining discriminative components with random forests,”

    in European Conference on Computer Vision. Springer, 2014, pp. 446–461.
  • [11] Mei Chen, Kapil Dhingra, Wen Wu, Lei Yang, Rahul Sukthankar, and Jie Yang, “Pfid: Pittsburgh fast-food image dataset,” in IEEE International Conference on Image Processing, 2009, pp. 289–292.
  • [12] Parisa Pouladzadeh, Abdulsalam Yassine, and Shervin Shirmohammadi, “Foodd: An image-based food detection dataset for calorie measurement,” in International Conference on Multimedia Assisted Dietary Management, 2015.
  • [13] Dimitrios Ioannou, Walter Huda, and Andrew F Laine, “Circle recognition through a 2d hough transform and radius histogramming,” Image and vision computing, vol. 17, no. 1, pp. 15–26, 1999.
  • [14] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake, “Grabcut: Interactive foreground extraction using iterated graph cuts,” in ACM transactions on graphics (TOG). ACM, 2004, vol. 23, pp. 309–314.
  • [16] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 3431–3440.