In the kitchen, we increasingly rely on instructions from cooking websites: recipes. A cook with a predilection for Asian cuisine may wish to prepare chicken curry, but may not know all necessary ingredients apart from a few basics. These users with limited knowledge cannot rely on existing recipe generation approaches that focus on creating coherent recipes given all ingredients and a recipe name Kiddon et al. (2016). Such models do not address issues of personal preference (e.g. culinary tastes, garnish choices) and incomplete recipe details. We propose to approach both problems via personalized generation of plausible, user-specific recipes using user preferences extracted from previously consumed recipes.
Our work combines two important tasks from natural language processing and recommender systems: data-to-text generationGatt and Krahmer (2018) and personalized recommendation Rashid et al. (2002). Our model takes as user input the name of a specific dish, a few key ingredients, and a calorie level. We pass these loose input specifications to an encoder-decoder framework and attend on user profiles—learned latent representations of recipes previously consumed by a user—to generate a recipe personalized to the user’s tastes. We fuse these ‘user-aware’ representations with decoder output in an attention fusion layer to jointly determine text generation. Quantitative (perplexity, user-ranking) and qualitative analysis on user-aware model outputs confirm that personalization indeed assists in generating plausible recipes from incomplete ingredients.
While personalized text generation has seen success in conveying user writing styles in the product review Ni et al. (2017); Ni and McAuley (2018) and dialogue Zhang et al. (2018) spaces, we are the first to consider it for the problem of recipe generation, where output quality is heavily dependent on the content of the instructions—such as ingredients and cooking techniques.
To summarize, our main contributions are as follows:
We explore a new task of generating plausible and personalized recipes from incomplete input specifications by leveraging historical user preferences;111Our source code and appendix are at https://github.com/majumderb/recipe-personalization
We release a new dataset of 180K+ recipes and 700K+ user reviews for this task;
We introduce new evaluation strategies for generation quality in instructional texts, centering on quantitative measures of coherence. We also show qualitatively and quantitatively that personalized models generate high-quality and specific recipes that align with historical user preferences.
2 Related Work
Large-scale transformer-based language models have shown surprising expressivity and fluency in creative and conditional long-text generation Vaswani et al. (2017); Radford et al. (2019). Recent works have proposed hierarchical methods that condition on narrative frameworks to generate internally consistent long texts Fan et al. (2018); Xu et al. (2018); Yao et al. (2018). Here, we generate procedurally structured recipes instead of free-form narratives.
Recipe generation belongs to the field of data-to-text natural language generation Gatt and Krahmer (2018), which sees other applications in automated journalism Leppänen et al. (2017), question-answering Agrawal et al. (2017), and abstractive summarization Paulus et al. (2018), among others. Kiddon et al. (2015); Bosselut et al. (2018b) model recipes as a structured collection of ingredient entities acted upon by cooking actions. Kiddon et al. (2016) imposes a ‘checklist’ attention constraint emphasizing hitherto unused ingredients during generation. Yang et al. (2017) attend over explicit ingredient references in the prior recipe step. Similar hierarchical approaches that infer a full ingredient list to constrain generation will not help personalize recipes, and would be infeasible in our setting due to the potentially unconstrained number of ingredients (from a space of 10K+) in a recipe. We instead learn historical preferences to guide full recipe generation.
A recent line of work has explored user- and item-dependent aspect-aware review generation Ni et al. (2017); Ni and McAuley (2018). This work is related to ours in that it combines contextual language generation with personalization. Here, we attend over historical user preferences from previously consumed recipes to generate recipe content, rather than writing styles.
Our model’s input specification consists of: the recipe name as a sequence of tokens, a partial list of ingredients, and a caloric level (high, medium, low). It outputs the recipe instructions as a token sequence: for a recipe of length . To personalize output, we use historical recipe interactions of a user .
Encoder: Our encoder has three embedding layers: vocabulary embedding , ingredient embedding , and caloric-level embedding . Each token in the (length ) recipe name is embedded via ; the embedded token sequence is passed to a two-layered bidirectional GRU (BiGRU) Cho et al. (2014), which outputs hidden states for names , with hidden size . Similarly each of the input ingredients is embedded via , and the embedded ingredient sequence is passed to another two-layered BiGRU to output ingredient hidden states as . The caloric level is embedded via and passed through a projection layer with weights
to generate calorie hidden representation.
Ingredient Attention: We apply attention Bahdanau et al. (2015) over the encoded ingredients to use encoder outputs at each decoding time step. We define an attention-score function with key and query :
with trainable weights , bias , and normalization term . At decoding time , we calculate the ingredient context as:
Decoder: The decoder is a two-layer GRU with hidden state conditioned on previous hidden state and input token from the original recipe text. We project the concatenated encoder outputs as the initial decoder hidden state:
To bias generation toward user preferences, we attend over a user’s previously reviewed recipes to jointly determine the final output token distribution. We consider two different schemes to model preferences from user histories: (1) recipe interactions, and (2) techniques seen therein (defined in Section 4). Rendle et al. (2009); Quadrana et al. (2018); Ueda et al. (2011) explore similar schemes for personalized recommendation.
Prior Recipe Attention: We obtain the set of prior recipes for a user : , where each recipe can be represented by an embedding from a recipe embedding layer or an average of the name tokens embedded by . We attend over the -most recent prior recipes, , to account for temporal drift of user preferences Moore et al. (2013). These embeddings are used in the ‘Prior Recipe’ and ‘Prior Name’ models, respectively.
Given a recipe representation (where is recipe- or vocabulary-embedding size depending on the recipe representation) the prior recipe attention context is calculated as
Prior Technique Attention: We calculate prior technique preference (used in the ‘Prior Tech‘ model) by normalizing co-occurrence between users and techniques seen in
, to obtain a preference vector. Each technique is embedded via a technique embedding layer to . Prior technique attention is calculated as
Attention Fusion Layer: We fuse all contexts calculated at time , concatenating them with decoder GRU output and previous token embedding:
We then calculate the token probability:
and maximize the log-likelihood of the generated sequence conditioned on input specifications and user preferences. Figure 1 shows a case where the Prior Name model attends strongly on previously consumed savory recipes to suggest the usage of an additional ingredient (‘cilantro’).
4 Recipe Dataset: Food.com
We collect a novel dataset of 230K+ recipe texts and 1M+ user interactions (reviews) over 18 years (2000-2018) from Food.com.222https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions Here, we restrict to recipes with at least 3 steps, and at least 4 and no more than 20 ingredients. We discard users with fewer than 4 reviews, giving 180K+ recipes and 700K+ reviews, with splits as in Table 1.
|Split||# Users||# Recipes||# Actions||Sparsity333Ratio of unobserved actions to all possible actions.|
|Model||BPE PPL||BLEU-1||BLEU-4||ROUGE-L||D-1 (%)||D-2 (%)||UMA||MRR||PP (%)|
|Input||Name: Pomberrytini; Ingredients: pomegranate-blueberry juice, cranberry juice, vodka; Calorie: Low|
|Gold||Place everything except the orange slices in a cocktail shaker. Shake until well mixed and well chilled. Pour into martini glasses and float an orange slice in each glass.|
|Enc-Dec||Combine all ingredients. Cover and refrigerate. Serve with whipped topping.|
|Prior Tech||Combine all ingredients. Store in refrigerator. Serve over ice. Enjoy!|
|Prior Recipe||Pour the ice into a cocktail shaker. Pour in the vodka and vodka. Add a little water and shake to mix. Pour into the glass and garnish with a slice of orange slices. Enjoy!|
|Prior Name||Combine all ingredients except for the ice in a blender or food processor. Process to make a smooth paste and then add the remaining vodka and blend until smooth. Pour into a chilled glass and garnish with a little lemon and fresh mint.|
Our model must learn to generate from a diverse recipe space: in our training data, the average recipe length is 117 tokens with a maximum of 256. There are 13K unique ingredients across all recipes. Rare words dominate the vocabulary: 95% of words appear 100 times, accounting for only 1.65% of all word usage. As such, we perform Byte-Pair Encoding (BPE) tokenization Sennrich et al. (2016); Radford et al. (2018), giving a training vocabulary of 15K tokens across 19M total mentions. User profiles are similarly diverse: 50% of users have consumed 6 recipes, while 10% of users have consumed 45 recipes.
We order reviews by timestamp, keeping the most recent review for each user as the test set, the second most recent for validation, and the remainder for training (sequential leave-one-out evaluation Kang and McAuley (2018)). We evaluate only on recipes not in the training set.
We manually construct a list of 58 cooking techniques from 384 cooking actions collected by Bosselut et al. (2018b); the most common techniques (bake, combine, pour, boil) account for 36.5% of technique mentions. We approximate technique adherence via string match between the recipe text and technique list.
5 Experiments and Results
For training and evaluation, we provide our model with the first 3-5 ingredients listed in each recipe. We decode recipe text via top- sampling Radford et al. (2019), finding to produce satisfactory results. We use a hidden size for both the encoder and decoder. Embedding dimensions for vocabulary, ingredient, recipe, techniques, and caloric level are 300, 10, 50, 50, and 5 (respectively). For prior recipe attention, we set , the 80th %-ile for the number of user interactions. We use the Adam optimizer Kingma and Ba (2015) with a learning rate of , annealed with a decay rate of 0.9 Howard and Ruder (2018). We also use teacher-forcing Williams and Zipser (1989)
in all training epochs.
In this work, we investigate how leveraging historical user preferences can improve generation quality over strong baselines in our setting. We compare our personalized models against two baselines. The first is a name-based Nearest-Neighbor model (NN). We initially adapted the Neural Checklist Model of Kiddon et al. (2016) as a baseline; however, we ultimately use a simple Encoder-Decoder baseline with ingredient attention (Enc-Dec), which provides comparable performance and lower complexity. All personalized models outperform baseline in BPE perplexity (Table 2
) with Prior Name performing the best. While our models exhibit comparable performance to baseline in BLEU-1/4 and ROUGE-L, we generate more diverse (Distinct-1/2: percentage of distinct unigrams and bigrams) and acceptable recipes. BLEU and ROUGE are not the most appropriate metrics for generation quality. A ‘correct’ recipe can be written in many ways with the same main entities (ingredients). As BLEU-1/4 capture structural information via n-gram matching, they are not correlated with subjective recipe quality. This mirrors observations fromBaheti et al. (2018); Fan et al. (2018).
We observe that personalized models make more diverse recipes than baseline. They thus perform better in BLEU-1 with more key entities (ingredient mentions) present, but worse in BLEU-4, as these recipes are written in a personalized way and deviate from gold on the phrasal level. Similarly, the ‘Prior Name’ model generates more unigram-diverse recipes than other personalized models and obtains a correspondingly lower BLEU-1 score.
Qualitative Analysis: We present sample outputs for a cocktail recipe in Table 3, and additional recipes in the appendix. Generation quality progressively improves from generic baseline output to a blended cocktail produced by our best performing model. Models attending over prior recipes explicitly reference ingredients. The Prior Name model further suggests the addition of lemon and mint, which are reasonably associated with previously consumed recipes like coconut mousse and pork skewers.
Personalization: To measure personalization, we evaluate how closely the generated text corresponds to a particular user profile. We compute the likelihood of generated recipes using identical input specifications but conditioned on ten different user profiles—one ‘gold’ user who consumed the original recipe, and nine randomly generated user profiles. Following Fan et al. (2018), we expect the highest likelihood for the recipe conditioned on the gold user. We measure user matching accuracy (UMA)—the proportion where the gold user is ranked highest—and Mean Reciprocal Rank (MRR) Radev et al. (2002) of the gold user. All personalized models beat baselines in both metrics, showing our models personalize generated recipes to the given user profiles. The Prior Name model achieves the best UMA and MRR by a large margin, revealing that prior recipe names are strong signals for personalization. Moreover, the addition of attention mechanisms to capture these signals improves language modeling performance over a strong non-personalized baseline.
Recipe Level Coherence: A plausible recipe should possess a coherent step order, and we evaluate this via a metric for recipe-level coherence. We use the neural scoring model from Bosselut et al. (2018a) to measure recipe-level coherence for each generated recipe. Each recipe step is encoded by BERT Devlin et al. (2019)
. Our scoring model is a GRU network that learns the overall recipe step ordering structure by minimizing the cosine similarity of recipe step hidden representations presented in the correct and reverse orders. Once pretrained, our scorer calculates the similarity of a generated recipe to the forward and backwards ordering of its corresponding gold label, giving a score equal to the difference between the former and latter. A higher score indicates better step ordering (with a maximum score of 2).Table 4 shows that our personalized models achieve average recipe-level coherence scores of 1.78-1.82, surpassing the baseline at 1.77.
Recipe Step Entailment: Local coherence is also crucial to a user following a recipe: it is crucial that subsequent steps are logically consistent with prior ones. We model local coherence as an entailment task: predicting the likelihood that a recipe step follows the preceding. We sample several consecutive (positive) and non-consecutive (negative) pairs of steps from each recipe. We train a BERT Devlin et al. (2019) model to predict the entailment score of a pair of steps separated by a [SEP] token, using the final representation of the [CLS] token. The step entailment score is computed as the average of scores for each set of consecutive steps in each recipe, averaged over every generated recipe for a model, as shown in Table 4.
Human Evaluation: We presented 310 pairs of recipes for pairwise comparison Fan et al. (2018) (details in appendix) between baseline and each personalized model, with results shown in Table 2. On average, human evaluators preferred personalized model outputs to baseline 63% of the time, confirming that personalized attention improves the semantic plausibility of generated recipes. We also performed a small-scale human coherence survey over 90 recipes, in which 60% of users found recipes generated by personalized models to be more coherent and preferable to those generated by baseline models.
|Model||Recipe Level Coherence||Recipe Step Entailment|
In this paper, we propose a novel task: to generate personalized recipes from incomplete input specifications and user histories. On a large novel dataset of 180K recipes and 700K reviews, we show that our personalized generative models can generate plausible, personalized, and coherent recipes preferred by human evaluators for consumption. We also introduce a set of automatic coherence measures for instructional texts as well as personalization metrics to support our claims. Our future work includes generating structured representations of recipes to handle ingredient properties, as well as accounting for references to collections of ingredients (e.g. “dry mix”).
Acknowledgements. This work is partly supported by NSF #1750063. We thank all reviewers for their constructive suggestions, as well as Rei M., Sujoy P., Alicia L., Eric H., Tim S., Kathy C., Allen C., and Micah I. for their feedback.
- VQA: visual question answering. IJCV 123 (1), pp. 4–31. External Links: Cited by: §2.
- Neural machine translation by jointly learning to align and translate. In ICLR, External Links: Cited by: §3.
- Generating more interesting responses in neural conversation models with distributional constraints. In EMNLP, pp. . External Links: Cited by: §5.
- Discourse-aware neural rewards for coherent text generation. In NAACL-HLT, pp. . External Links: Cited by: §5.
- Simulating action dynamics with neural process networks. In ICLR, External Links: Cited by: §2, §4.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, pp. . External Links: Cited by: §3.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT 2019, External Links: Cited by: §5, §5.
- Hierarchical neural story generation. In ACL, pp. . External Links: Cited by: §2, §5, §5, §5.
- Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J. Artif. Intell. Res. 61, pp. 65–170. External Links: Cited by: §1, §2.
- Incorporating copying mechanism in sequence-to-sequence learning. In ACL, External Links: Cited by: §3.
- Universal language model fine-tuning for text classification. In ACL, External Links: Cited by: §5.
- Self-attentive sequential recommendation. In ICDM, pp. . External Links: Cited by: §4.
- Mise en place: unsupervised interpretation of instructional recipes. In EMNLP, pp. . External Links: Cited by: §2.
- Globally coherent text generation with neural checklist models. In EMNLP, pp. . External Links: Cited by: §1, §2, §5.
- Adam: A method for stochastic optimization. In ICLR, External Links: Cited by: §5.
- Data-driven news generation for automated journalism. In INLG, pp. . External Links: Cited by: §2.
- Taste over time: the temporal dynamics of user preferences. In ISMIR, pp. . External Links: Cited by: §3.
- Estimating reactions and recommending products with generative models of reviews. In IJCNLP, pp. . External Links: Cited by: §1, §2.
- Personalized review generation by expanding phrases and attending on aspect-aware representations. In ACL, pp. . External Links: Cited by: §1, §2.
- A deep reinforced model for abstractive summarization. In ICLR, External Links: Cited by: §2.
- Sequence-aware recommender systems. In UMAP, pp. . External Links: Cited by: §3.
- Evaluating web-based question answering systems. In LREC, External Links: Cited by: §5.
- Improving language understanding by generative pre-training. External Links: Cited by: §4.
- Language models are unsupervised multitask learners. External Links: Cited by: §2, §5.
- Getting to know you: learning new user preferences in recommender systems. In IUI, pp. . External Links: Cited by: §1.
- BPR: bayesian personalized ranking from implicit feedback. In UAI, pp. . External Links: Cited by: §3.
- Get to the point: summarization with pointer-generator networks. In ACL, pp. . External Links: Cited by: §3.
- Neural machine translation of rare words with subword units. In ACL, External Links: Cited by: §4.
- User’s food preference extraction for personalized cooking recipe recommendation. In SPIM, , pp. . External Links: Cited by: §3.
- Attention is all you need. In NIPS, pp. . External Links: Cited by: §2.
A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1 (2), pp. 270–280. External Links: Cited by: §5.
- A skeleton-based model for promoting coherence among sentences in narrative story generation. In EMNLP, pp. . External Links: Cited by: §2.
- Reference-aware language models. In EMNLP, pp. . External Links: Cited by: §2.
- Plan-and-write: towards better automatic storytelling. CoRR abs/1811.05701. External Links: Cited by: §2.
- Personalizing dialogue agents: I have a dog, do you have pets too?. In ACL, pp. . External Links: Cited by: §1.
Food.com: Dataset Details
Our raw data consists of 270K recipes and 1.4M user-recipe interactions (reviews) scraped from Food.com, covering a period of 18 years (January 2000 to December 2018). See Table 5 for dataset summary statistics, and Table 6 for sample information about one user-recipe interaction and the recipe involved.
|# Recipes||# Users||# Reviews||Sparsity (%)|
|ingredients||[‘butterscotch chips’, ‘chinese noodles’, ‘salted peanuts’]|
|Input||Name: Chicken Bell Pepper Chili Weight Watchers Ingredients: olive oil, red bell pepper, green bell pepper, yellow bell pepper Calorie: Medium|
|Gold||In a large nonstick dutch oven or saucepan, heat the oil. Saute the bell peppers and onion until softened, about 5 minutes. Add the chicken. Cook, breaking apart with a wooden spoon, until no longer pink, 5 - 7 minutes. Sprinkle with the flour, chili powder, cumin, cocoa powder, and cayenne. Cook, stirring briskly and constantly, 1 minute. Stir in the tomatoes and vinegar. Bring to a boil. Reduce the heat and simmer, stirring frequently, until thick, 30 - 40 minutes. Stir in the chickpeas. Simmer until heated through, about 5 minutes.|
|Enc-Dec||Heat oil in a large, heavy pot over medium heat. Add chicken and cook until browned on both sides, about 10 minutes, or until browned; drain. Add onion and garlic and saute for 2 minutes. Add chicken broth, chicken broth, and corn starch. Bring to a slow boil, reduce heat, cover and simmer for 30 minutes. Remove from heat and add the tomatoes and corn. Simmer until chicken is cooked through, about 10 minutes. Serve over rice or noodles.|
|Prior Tech||Heat oil in large saucepan over medium - high heat. Add chicken and cook until browned. Add bell pepper and onions, cook 2 minutes, stirring frequently. Add chicken broth. Cover, simmer, stirring occasionally, for 10 minutes or until vegetables are tender, stirring occasionally. Add chicken, cover, and cook 10 more minutes or until chicken is cooked through. Remove from heat. Stir in remaining ingredients. Season with salt and pepper to taste. Serve over rice and top with additional shredded cheese and cilantro.|
|Prior Recipe||Heat oil in large dutch oven over medium - high. Add chicken, bell peppers, bell peppers and onion. Saute 2 - 3 minutes. Add broth and tomato paste; stir. Bring to a boil. Reduce heat and simmer, covered, for 10 minutes or until rice is tender, stirring often. Stir in chicken. Cover and simmer for 10 minutes. Stir in parsley. Serve over rice. Garnish with cilantro.|
|Prior Name||Heat the oil in a large skillet over medium - high heat. Add the peppers and onions. Cook, stirring, until the vegetables are soft and beginning to soften, about 5 minutes. Stir in the tomatoes, corn, corn, and corn. Bring to a boil. Reduce heat to low, cover, and simmer for 10 minutes. Add the chicken and cook for another 10 minutes or until the vegetables are tender. Stir in the cilantro and serve.|
|Input||Name: Cinna Nilla Waffles Ingredients: flour, egg, milk, vegetable oil, sugar Calorie: Medium|
|Gold||Heat waffle iron. Beat eggs in large bowl with hand beater until fluffy. Beat in remaining ingredients until smooth. Pour batter into waffle iron and cook for approximately 5 minutes. Serve immediately or cool to pack into freezer bags and freeze.|
|Enc-Dec||Mix the flour, baking soda, and salt in a bowl. In a separate bowl, whisk together the milk, oil, eggs and sugar. Stir the flour mixture into the egg mixture, and continue to mix well. Add the flour to the egg mixture. Mix well and pour into a greased waffle iron. Cook for 2 minutes, remove from heat and serve.|
|Prior Tech||In a medium bowl mix flour, eggs, and milk until combined. Add the dry ingredients and stir until just combined and do not mix. Heat griddle over medium heat, add the oil, oil, and cook the pancakes until golden brown and cooked through. Serve with a little milk or cream. Enjoy|
|Prior Recipe||In a mixing bowl, whisk together the eggs, milk, oil, sugar, vanilla, salt and vanilla. Cover and let the mixture stand in the fridge for about 1 hour. Spoon batter into waffle iron and close the grill.|
|Prior Name||Preheat waffle iron. Beat together the eggs, milk and oil until well blended, add the vanilla and mix well with a mixer. Fold in flour, baking powder, and cinnamon. Spread 1 / 2 the mixture in a greased waffle iron. Bake until golden brown, about 15 minutes per side. Sprinkle with powdered sugar and serve warm.|
We prepared a set of 15 pairwise comparisons per evaluation session, and collected 930 pairwise evaluations (310 per personalized model) over 62 sessions. For each pair, users were given a partial recipe specification (name and 3-5 key ingredients), as well as two generated recipes labeled ‘A’ and ‘B’. One recipe is generated from our baseline encoder-decoder model and one recipe is generated by one of our three personalized models (Prior Tech, Prior Name, Prior Recipe). The order of recipe presentation (A/B) is randomly selected for each question. A screenshot of the user evaluation interface is given in Figure 2. We ask the user to indicate which recipe they find more coherent, and which recipe best accomplishes the goal indicated by the recipe name. A screenshot of this survey interface is given in Figure 3.