Deep Cooking: Predicting Relative Food Ingredient Amounts from Images

09/26/2019 ∙ by Jiatong Li, et al. ∙ Rutgers University SAMSUNG 34

In this paper, we study the novel problem of not only predicting ingredients from a food image, but also predicting the relative amounts of the detected ingredients. We propose two prediction-based models using deep learning that output sparse and dense predictions, coupled with important semi-automatic multi-database integrative data pre-processing, to solve the problem. Experiments on a dataset of recipes collected from the Internet show the models generate encouraging experimental results.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Increased awareness of the impact of food consumption on health and lifestyle today has given rise to novel data-driven food analysis systems, e.g., (Beijbom et al., 2015; Meyers et al., 2015), whose goals are to alleviate the challenge of tracking the daily food intake. Many of those systems use data modalities such as images to seamlessly extract information related to the food item that was consumed, often the identity of the meal or its ingredients, or even its caloric value. While these systems frequently claim to predict the energy intake, they base these predictions on standard energy tables of standardized ingredients (e.g., USDA111 The estimation of the food amount, a highly challenging and often ambiguous task, is delegated to the users themselves. Even the systems that aim to predict a fine-grained ingredient-based representation of the food item, e.g., (Salvador et al., 2017) and (Chen and Ngo, 2016), do not consider the problem of predicting the ingredients’ amounts or relative contributions of the food items in each dish. However, these amounts are paramount for estimating the correct energy value of the meal. A small amount of high-fat food might not be a major health risk factor, while an unhealthy ingredient with a dominant amount may lead to potential health problems.

In this paper, we study the novel problem of predicting the relative amount of each ingredient in a food item from images. To our knowledge, this is the first work goes in to the detail of analyzing the amounts of each ingredients. The problem is both interesting and challenging. It’s interesting because we can analyze the nutrients of a food in detail and how each ingredient in the food contributes to health. Some ingredients can be replaced with their low-calorie counterparts and some can be replaced with vegan substitutions according to users’ dietary choices. We can also modify the amounts to create healthier foods. It’s challenging because the shape and color of even the same ingredient can exhibit large visual differences due to diverse ways of cutting and cooking (Chen and Ngo, 2016). Analyzing the amounts of ingredients may also suffer from occlusion. We attack the problem by deep learning models which output the amount of each ingredient.

The contributions of this paper are:

  • We propose a novel and challenging problem: analyzing the relative amounts of each ingredient from a food image.

  • We propose prediction-based models using deep learning to solve the problem.

2. Related Work

As food plays an essential part in our life, there has been a lot of research focusing on food classification, cross-modal retrieval, ingredient analysis and volume estimation.

Food Classification.

For classification, initially hand-crafted features and traditional classifiers are used to classify food images.

(Bossard et al., 2014)

uses random forests and SVMs.

(Beijbom et al., 2015) uses bag of visual words approach, SVM as the classifier to classify restaurant-specific food and calculate the calories of a meal from a given restaurant. The traditional methods are outperformed by methods using deep learning features or directly using deep learning. (Ciocca et al., 2017) uses SVM classifier with CNN features. (Wang et al., 2015) also uses deep learning features for classification. (Singla et al., 2016), (Chen et al., 2017) and (Mezgec and Koroušić Seljak, 2017) train deep-learning models for food recognition.

Cross-Modal Retrieval is given an image, retrieving its recipe from a collection of test recipes or the other direction: retrieving the corresponding image given a recipe. (Salvador et al., 2017) finds a joint embedding of recipes and images for the image-recipe retrieval task. (Marin et al., 2019) expands the dataset Recipe1M(Salvador et al., 2017) from 800k images to 13M. However, they expand with images from web search using recipe titles as queries. It is likely that the web images no longer share the same ingredients or instructions as the original ones. They also annotate amount or nutrition information for 50k recipes, while we annotate 250k recipes with our method and 80k of them have images. (Chen et al., 2018) uses attention for finding the embedding for cross-modal retrieval. (Wang et al., 2019) proposes ACME which learns cross-modal embeddings by imposing modality alignment using an adversarial learning strategy and imposing cross-modal translation consistency.

Ingredient Analysis is predicting the existence of ingredients given a food image. (Chen and Ngo, 2016) uses multi-task deep learning and a graph modeling ingredient co-occurrences. (Salvador et al., 2019) predicts ingredients as sets and generates cooking instructions by attending to both image and its inferred ingredients simultaneously. Unfortunately, previous work discard quantity information, losing a lot of important information.

Volume Estimation. Estimating food amounts itself, especially estimating absolute amounts, is not a novel problem. Some methods focus on multi-view 3d reconstruction while others focus on single-view volume estimation. Multi-view reconstruction can date back to 1994 (Shashua and Navab, 1994). Even a more recent work (Dehais et al., 2017)

using two views uses only traditional computer vision techniques.

(Zheng et al., 2018) uses a simple method, mapping contour, but the performance might come from an easy dataset. Although (Liang and Li, 2017) uses deep learning in their work, it’s only for object detection and they use a simple method to calculate volume.

As for single view methods, (Meyers et al., 2015) tries to recognize the contents of a meal from a single image then predicts the calories for home-cooked foods. Their method is based on deep learning for classification and depth estimation to calculate volume and calories. (Fang et al., 2018) uses GANs but it requires densely annotated datasets. (Ege et al., 2019) reviews three of their existing systems and proposes two novel systems, either using size-known reference objects or foods, including rice grains, or using special mobile devices, like built-in inertial sensors and stereo cameras.

Estimating relative amounts, on the other hand, does not require reference objects or special devices. Our methods estimating relative ingredient amounts can be applied to food images on the internet, where camera information and size-known reference objects are not available.

Furthermore, previous work calculate the amounts of foods as a whole while in our work, we go into the detail and estimate the amount of each ingredient using deep learning models.

3. Methods

Let the amounts of ingredients of a recipe be where is the total number of ingredients and is the amount of the ingredient in grams, when the ingredient is not present in the recipe. Suppose is a constant and also represents the same recipe. Therefore we assume that is normalized such that where is a constant. can then be interpreted as the proportion of the ingredient, or there are grams of the ingredient every grams of total ingredients.

Let the amount prediction of an image corresponding to the recipe be and is normalized such that .

We propose two methods, one outputs dense amount predictions and users can threshold according to their applications; the other outputs sparse amount predictions, just like real recipes which usually use a few ingredients.

3.1. Dense Method

When , both

can be viewed as probability distributions. If the last layer of a neural network is activated by softmax, the prediction is naturally normalized to

. Therefore we use softmax to activate the last fully connected layer of a neural network which predicts given

. The loss function to minimize is the cross entropy of


3.2. Sparse Method

ReLU activation brings negative values to 0, which brings some sparsity in the output and L1 loss also enforces sparsity. Here, the last fully connected layer of the neural network is activated by ReLU and the loss function is the L1 distance between .

4. Experiments

4.1. Dataset

We use the data in Recipe1M (Salvador et al., 2017) dataset. As the fraction slash is not an ASCII character on recipe websites, it is missing in the dataset after preprocessing. We re-scrape recipe ingredients from websites that are nicely structured, with the quantity and ingredient names parsed.

Unit Extraction We define 3 types of units:

  • Basic units, including volume, weight and counting units (e.g. box). We define 52 of them and list them in Appendix A.

  • Size modified units. Usually a counting unit after large/ big/ medium/ small (e.g. small scoop of ice cream). We treat ”large/ big” as a multiplier 1.2, ”medium” as 1.0 and ”small” as 0.8.

  • Numbers and basic units combined in a parenthesis. (e.g. 1 (15 1/4 ounce) box of cake mix)

We only keep recipes that for all ingredients in the recipe, the words between quantity and ingredient name belong to the units we defined. If an ingredient in a recipe does not have quantity field (e.g. some salt), we ignore the ingredient in the recipe. This step only removes 6% of re-scraped recipes.

Canonical Ingredient Construction. Recipe1M contains about 16k unique ingredients and the top 4k ingredients account for an average coverage of 95%, which is calculated by , where is the number of recipes, is the number of ingredients in the recipe and is the number of ingredients both in the recipe and the top 4k ingredients. The 4k ingredients are further reduced to 1.4k by first merging the ingredients with the same name after a stemming operation and semi-automatically fusing other ingredients. Ingredients are fused if they:

  • Are close together in Word2vec (Mikolov et al., 2013) embedding space, which is trained on the titles, ingredients and instructions of Recipe1M.

  • Share two or three words.

  • Are mapped to the same item in Nutritionix222

A human annotator accepts the proposed merger. We only keep recipes with at least 80% coverage. After the two steps, our data contains 460k recipes in total.

Unit Conversion. The amount numbers are meaningless without converting them to the same unit. In the dataset, one ingredient can be represented by volume, weight and counting units. For example, 2 onions, 1 cup chopped onion, 1 lb onions. We convert the units to grams using Nutritionix and USDA Food Database. The details of mapping can be found in appendix B

. Some unit conversions like ”1 packet of sugar” are not defined by neither of websites and we only keep recipes where 80% of the ingredients and their corresponding units are converted. Next we vectorize each recipe into two vectors,

for the amount value, described in section 3 and for the amount range, if the amounts are not exactly given. corresponds to the amount range of the ingredient. For ”24-25 ounce cheese_ravioli”, the value is 24.5 ounce, or 686 grams, the range is 28 grams (1 ounce). Our data contains 250k recipes after unit conversion and 80k of them have images.

4.2. Compared Methods

Retrieval-Based Method: One method of predicting the ingredients given a food image is cross-modal recipe retrieval which outputs the ingredients and the corresponding amounts of the retrieved recipe. We use the model in (Chen et al., 2018). For a fair comparison, the model is trained only with ingredients and images. Titles and instructions are not included as the prediction-based models do not use titles or instructions during training. No amount information is used during training the retrieval model. Following (Salvador et al., 2017), the top 1 recipe among 1000 randomly selected recipes is retrieved. We try two settings, one includes the ground truth recipe in the 1000 recipes and the other without. The two settings only differ by 1 recipe.

Prediction-Based Methods: We use a Resnet50 (He et al., 2016) pre-trained on UPMC (Wang et al., 2015) and replace the last layer with ingredient amount prediction. Softmax: The dense method in Section 3. As recipe ingredients are sparse, the dense output is thresholded to top 10 predictions and renormalized. L1: The sparse method in Section 3. Both models are fine tuned with Adam optimizer with learning rate . The batch size is 64.

4.3. Evaluation and Results

We report 4 evaluation metrics, the first two evaluate ingredient detection and the last two evaluate amounts and calories. If the predicted amount is non-zero for an ingredient, the ingredient is viewed as detected. The ingredient is viewed as not exist if the predicted amount is 0.

  • Recall of ground truth ingredients # common ingredients between ground truth and predicted over # ground truth.

  • IOU # common over # the union of ground truth and predicted.

  • L1 Error First normalize both the ground truth range vector and the predicted vector to (every kilogram of total ingredients). The ground truth range vector is normalized according to the ground truth amount vector. If the predicted amount falls outside the range, the error in the dimension is the difference between the amounts. The error is zero otherwise. The norm of the error is reported.

  • Relative Calorie Error (RCE) First normalize both the ground truth vector and the predicted vector. Suppose the energy of the i-th ingredient is , the ground truth calorie can be estimated as and the predicted calorie is the error is reported.

The results are shown in Table 1

. The 80k recipes with images are randomly split into training, validation and test sets, with 48k, 16k, 16k each. The results are on test set. Numbers are ”mean (standard deviation over test set)”. The up arrows indicate the higher the better and the down arrows indicate the lower the better.

Recall IOU L1 RCE
Retrieval w/o gt 0.26 (0.23) 0.16 (0.16) 1613.65 (469.17) 1.00 (38.47)
Retrieval w/ gt 0.33 (0.31) 0.24 (0.29) 1468.48 (653.69) 0.94 (38.47)
L1 0.32 (0.19) 0.21 (0.15) 1498.05 (476.83) 0.87 (8.07)
Softmax 0.35 (0.31) 0.17 (0.11) 1433.61 (444.22) 0.74 (10.53)
Table 1. Results of the methods

The results show that there is no significant difference between the performance of prediction based methods and retrieval with the ground truth recipe included while they all outperform retrieval without ground truth in terms of recall and L1 loss. Compared with retrieval with ground truth, prediction-based methods have lower standard deviations in IOU, L1 and RCE, or more robust performances in general. This is reasonable as retrieval based methods can be affected by the collection of recipes available for the retrieval system.

Relative calorie error tend to have a larger standard deviation because low calorie recipes are sensitive to calorie differences. Figure 1

shows the histogram of relative calorie errors in log scale and the recipe with the largest relative calorie error 4725.32. Most recipes have small relative calorie errors while there are some outliers. The error of about 88% of the test recipes is less than 1. The calorie of the recipe with the largest RCE after normalizing the sum of ingredients is about

and the calorie of the retrieved recipe is about . The system retrieves the salad dressing recipe probably because both recipes are liquid.

(a) Coffee foot soak
(b) Almond Salad dressing
Figure 1. Analysis of relative calorie error. Left to right: histogram of relative calorie error, ground truth image of the recipe with the largest RCE, retrieved image, ground truth amounts, retrieved amounts. The amounts are normalized to .

When thresholded to top 10 predictions and renormalized, the Softmax method performs well in terms of amounts and calories. Furthermore, the threshold can be adjusted according to users’ needs, showing the method is able to produce encouraging results at the same time being flexible.

We also demonstrate some easy, average and difficult examples for the models in terms of evaluation metrics. We notice that even the methods perform poorly quantitatively in some difficult test cases, they still produce reasonable ingredient and amount combinations. The result are shown in Table 2 and Figure 2.

Example Method Recall IOU L1 RCE
Easy Retrieval 0.57 0.33 502.74 0.09
Softmax 0.42 0.21 837.13 0.25
L1 0.57 0.44 450.37 0.16
Average Retrieval 0.38 0.30 1569.38 0.06
Softmax 0.25 0.125 1537.12 0.45
L1 0.25 0.18 1588.95 0.12
Difficult Retrieval 0 0 2000 0.87
Softmax 0.14 0.06 1932.74 0.20
L1 0 0 2000 0.45
Table 2. Quantitative evaluation of easy, average and difficult examples.
(a) Original pound cake
(b) Super soft snickerdoodle cookies
(c) Buttermilk cornmeal waffles
(d) Tippaleivät-May Day fritters
(e) Fruity rice krispie treats/squares-kids no bake
(f) Jagacida(Jag)-beans and rice from Cape Verde
Figure 2. Easy, average and difficult examples. Top row: easy, middle row: average, bottom row: difficult. Columns from left to right: ground truth image, retrieved image, ground truth amounts, retrieved amounts, Softmax results, L1 results. The amounts are normalized to .

In the good and average example, all three methods produce reasonable results. The retrieval system gets foods from the same high-level category: baked goods and prediction-based models capture main ingredients like flour, sugar and butter well. The amount of salt is also accurately predicted. In the bad example, the ground truth uses puffed rice cereals and the similar ingredient rice is also successfully retrieved, although the taste is incorrect: the ground truth is sweet while the retrieved is savory. Softmax successfully predicts the existence of magarine and the overall recipe produced is reasonable as the ground truth image looks like baked goods. L1 successfully predicts a similar ingredient butter with nearly the same amount. Even when the quantitative evaluations are bad, qualitatively the models still give robust predictions.

5. Conclusion

We studied a novel problem: given a food image, predict the relative amount of each ingredient needed to prepare the observed food item. Our experiments show that deep model-based prediction methods produce reasonable ingredient and amount predictions; even in the presence of challenging test examples, the methods are still able to yield robust qualitative results.

The problem opens up interesting avenues for future work. First, prediction-based methods leverage amount information during training while the retrieval-based method does not. Nevertheless, there is no significant difference in performance of prediction-based methods and retrieval with ground truth. Further research is needed for improving the performance on this problem. Second, we can leverage amount information in retrieval based methods. Finally, we can adjust the ingredients and the corresponding amounts according to users’ dietary needs to generate novel food images.


  • (1)
  • Beijbom et al. (2015) Oscar Beijbom, Neel Joshi, Dan Morris, Scott Saponas, and Siddharth Khullar. 2015. Menu-match: Restaurant-specific food logging from images. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on. IEEE, 844–851.
  • Bossard et al. (2014) Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101–mining discriminative components with random forests. In European Conference on Computer Vision. Springer, 446–461.
  • Chen and Ngo (2016) Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 32–41.
  • Chen et al. (2018) Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, and Tat-Seng Chua. 2018. Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1020–1028.
  • Chen et al. (2017) Xin Chen, Yu Zhu, Hua Zhou, Liang Diao, and Dongyan Wang. 2017. ChineseFoodNet: A large-scale Image Dataset for Chinese Food Recognition. arXiv preprint arXiv:1705.02743 (2017).
  • Ciocca et al. (2017) Gianluigi Ciocca, Paolo Napoletano, and Raimondo Schettini. 2017. Food Recognition: A New Dataset, Experiments, and Results. IEEE J. Biomedical and Health Informatics 21, 3 (2017), 588–598.
  • Dehais et al. (2017) Joachim Dehais, Marios Anthimopoulos, Sergey Shevchik, and Stavroula Mougiakakou. 2017. Two-view 3d reconstruction for food volume estimation. IEEE transactions on multimedia 19, 5 (2017), 1090–1099.
  • Ege et al. (2019) Takumi Ege, Yoshikazu Ando, Ryosuke Tanno, Wataru Shimoda, and Keiji Yanai. 2019. Image-Based Estimation of Real Food Size for Accurate Food Calorie Estimation. In 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 274–279.
  • Fang et al. (2018) Shaobo Fang, Zeman Shao, Runyu Mao, Chichen Fu, Edward J Delp, Fengqing Zhu, Deborah A Kerr, and Carol J Boushey. 2018. Single-view food portion estimation: learning image-to-energy mappings using generative adversarial networks. In 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 251–255.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 770–778.
  • Liang and Li (2017) Yanchao Liang and Jianhua Li. 2017. Computer vision-based food calorie estimation: dataset, method, and experiment. arXiv preprint arXiv:1705.07632 (2017).
  • Marin et al. (2019) Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE transactions on pattern analysis and machine intelligence (2019).
  • Meyers et al. (2015) Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P Murphy. 2015. Im2Calories: towards an automated mobile vision food diary. In Proceedings of the IEEE International Conference on Computer Vision. 1233–1241.
  • Mezgec and Koroušić Seljak (2017) Simon Mezgec and Barbara Koroušić Seljak. 2017. NutriNet: a deep learning food and drink image recognition system for dietary assessment. Nutrients 9, 7 (2017), 657.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
  • Salvador et al. (2019) Amaia Salvador, Michal Drozdzal, Xavier Giro-i Nieto, and Adriana Romero. 2019. Inverse cooking: Recipe generation from food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10453–10462.
  • Salvador et al. (2017) Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3020–3028.
  • Shashua and Navab (1994) Amnon Shashua and Nassir Navab. 1994. Relative affine structure: Theory and application to 3D reconstruction from perspective views. In CVPR, Vol. 94. 483–489.
  • Singla et al. (2016) Ashutosh Singla, Lin Yuan, and Touradj Ebrahimi. 2016. Food/non-food image classification and food categorization using pre-trained googlenet model. In Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management. ACM, 3–11.
  • Wang et al. (2019) Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven CH Hoi. 2019. Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11572–11581.
  • Wang et al. (2015) Xin Wang, Devinder Kumar, Nicolas Thome, Matthieu Cord, and Frederic Precioso. 2015. Recipe recognition with large multimodal food dataset. In Multimedia & Expo Workshops (ICMEW), 2015 IEEE International Conference on. IEEE, 1–6.
  • Zheng et al. (2018) Xin Zheng, Yifei Gong, Qinyi Lei, Run Yao, and Qian Yin. 2018. Multi-view Model Contour Matching Based Food Volume Estimation. In International Conference on Applied Human Factors and Ergonomics. Springer, 85–93.

Appendix A List of Basic Units

  • Volume Units: pint, liter, gallon, teaspoon, tablespoon, cup, dash, quart, fluid_ounce, ml, pinch

  • Weight Units: kg, lb, g, ounce

  • Counting Units: packet, handful, links, sheet, big, package, bunch, clove, leaf, jar, medium, strip, envelope, stick, large, drop, piece, small, container, bottle, head, scoop, of, stalk, glass, sprig, bag, inch, loaf, can, cm, ears, no_unit, dozen, box, slice, squares

”no_unit” means the ingredient name directly follows the quantity, which is an expression for counting (e.g. 2 apples)

Appendix B Mapping ingredients to Nutritionix and USDA

We pass the list of our ingredients to Nutritionix natural language tagging and get json formatted responses. The mapping of the ingredient is the ”ITEM” field in the response and unit conversion is the ”ALT_MESAURES” field. We then filter out ingredients with no mappings in Nutritionix and ingredients with multiple mappings (e.g.”butter flavored shortening” is mapped to both ”butter” and ”shortening”). For the rest of the results, we first check if the mapping is a sigular or plural form of the ingredient and get a list of ingredients that are not mapped exactly to themselves. For these results, we refine them manually:

  • No result: First we try to query the ingredients in USDA Food database to get the corresponding NDB ID as mapping and the conversion tables. If there is no result in USDA, we look for a synonym for the ingredient and query with Nutritionix to get the mapping and conversion table.

  • Multiple mappings: First we try to select one of the mappings to match the ingredient. If none of the mappings match the ingredient, we query the ingredient in USDA.

  • Mismatch: First we check if the mapping is correct. If it is incorrect (e.g. pea shootspea), we query the ingredient in USDA.

We then pass the mappings to Nutritionix nutrients to get the calories per gram with the ”serving_weight_grams” field and ”nf_calories” field. If the mappings are from USDA, we query the USDA Food database to get calories per gram.