1 Introduction
Many office workers spend most of their working days using prooriented software applications. These applications are often powerful, but complicated. This complexity may overwhelm and confuse novice users, and even expert users may find some tasks timeconsuming and repetitive. We want to use machine learning and statistical modeling to help users manage this complexity.
Fortunately, modern software applications collect large amounts of data from users with the aim of providing them with better guidance and more personalized experiences. A photoediting application, for example, could use data about how users edit images to learn what kinds of adjustments are appropriate for what images, and could learn to tailor its suggestions to the aesthetic preferences of individual users. Such suggestions can help both experts and novices: experts can use them as a starting point, speeding up tedious parts of the editing process, and novices can quickly get results they could not have otherwise achieved.
Several models have been proposed for predicting and personalizing user interaction in different software applications.
These existing models are limited in that they only propose a single prediction or are not readily personalized. Multimodal predictions^{1}^{1}1We mean “multimodal” in the statistical sense (i.e., coming from a distribution with multiple maxima), rather than in the humancomputerinteraction sense (i.e., having multiple modes of input or output). are important in cases where, given an input from the user, there could be multiple possible suggestions from the application. For instance, in photo editing/enhancement, a user might want to apply different kinds of edits to the same photo depending on the effect he or she wants to achieve. A model should therefore be able to recommend multiple enhancements that cover a diverse range of styles.
In this paper, we introduce a framework for multimodal prediction and personalization in software applications. We focus on photoenhancement applications, though our framework is also applicable to other domains where multimodal prediction and personalization is valuable. Figure 1 demonstrates our highlevel goals: we want to learn to propose diverse, highquality edits, and we want to be able to personalize those proposals based on users’ historical behavior.
Our modeling and inference approach is based on the variational autoencoder (VAE)
Kingma and Welling (2013) and a recent extension of it, the structured variational autoencoder (SVAE) Johnson et al. (2016). Along with our new models, we develop approximate inference architectures that are adapted to our model structures.We apply our framework to three different datasets (collected from novice, semiexpert, and expert users) of image features and user edits from a photoenhancement application and compare its performance qualitatively and quantitatively to various baselines. We demonstrate that our model outperforms other approaches.
2 Background and related work
In this section, we first briefly review the frameworks (VAEs and SVAEs) that our model is built upon; next, we provide an overview of the available models for predicting photo edits and summarize their pros and cons.
(a)  (b) 
2.1 Variational autoencoder (VAE)
The VAE, introduced in Kingma and Welling (2013), has been successfully applied to various models with continuous latent variables and a complicated likelihood function (e.g., a neural network with nonlinear hidden layers). In these settings, posterior inference is typically intractable, and even approximate inference may be prohibitively expensive to run in the inner loop of a learning algorithm. The VAE allows this difficult inference to be amortized over many learning updates, making each learning update cheap even with complex likelihood models.
As an instance of such models, consider modeling a set of i.i.d. observations with the following generative process: and , where is a latent variable generated from a prior (e.g., ) and the likelihood function is a simple distribution whose parameters can be a complicated function of . For example, might be where the mean and the covariance depend on
through a multilayer perceptron (MLP) richly parameterized by weights and biases
. See Figure 2(a) for the graphical model representation of this generative process.In the VAE framework, the posterior density is approximated by a recognition network , which can take the form of a flexible conditional density model such as an MLP parameterized by . To learn the parameters of the likelihood function and the recognition network , the following lower bound on the marginal likelihood is maximized with respect to and :
To compute a Monte Carlo estimate of the gradient of this objective with respect to
, Kingma and Welling (2013) propose a reparameterization trick for sampling from by first sampling from an auxiliary noise variable and then applying a differentiable map to the sampled noise. This yields a differentiable Monte Carlo estimate of the expectation with respect to . Given the gradients, the parameters are updated by stochastic gradient ascent.2.2 Structured variational autoencoder (SVAE)
Johnson et al. (2016)
extend the VAE inference scheme to latent graphical models with neural network observation distributions. This SVAE framework combines the interpretability of graphical models with the flexible representations found by deep learning. For example, consider a latent Gaussian mixture model (GMM) with nonlinear observations:
Note that the nonlinear observation model for each data point resembles that of the VAE, while the latent variable model for is a GMM (see Figure 2(b)). This latent GMM can represent explicit latent cluster assignments while also capturing complex nonGaussian cluster shapes in the observations.
To simplify the SVAE notation, we consider a general setting in which we denote the global parameters of a graphical model by and the local latent variables by . Furthermore, we assume that and are a conjugate pair of exponential family densities with sufficient statistics and . We continue to use to denote the parameters of a potentially complex, nonlinear observation likelihood. Using a mean field family distribution for approximating the posterior the variational lower bound (VLB) can be written as:
where and are the parameters of the variational distributions and respectively.
Due to the nonconjugate likelihood function , standard variational inference methods cannot be applied to the latent GMM. To solve this problem, the SVAE replaces the nonconjugate likelihood with a recognition model that generates conjugate evidence potentials. We can then define a surrogate objective with conjugacy structure:
where the potentials have a conjugate form to . By choosing to optimize this surrogate objective, writing , the SVAE objective is then which can be shown to lower bound the variational inference objective in eq. 2.2. As in the stochastic variational inference (SVI) algorithm Hoffman et al. (2013)
, there is a simple expression for the natural gradient of this objective with respect to the variational parameters with conjugate priors; the gradients w.r.t. other variational parameters, such as those parameterizing neural networks, can be computed using the reparameterization trick.
2.3 Related work on the prediction of photo edits
There are two main categories of models, parametric and nonparametric, that have been used for prediction of photo edits:
Parametric methods
These methods approximate a parametric function by minimizing a squared (or a similar) loss. The loss is typically squared distance in Lab color space, which more closely approximates human perception than RGB space (Sharma and Bala, 2002). This loss is reasonable if the goal is to learn from a set of consistent, relatively conservative edits. But when applied to a dataset of more diverse edits, a model that minimizes squared error will tend to predict the average edit. At best, this will lead to conservative predictions; in the worst case, the average of several good edits may produce a bad result.
Bychkovsky et al. (2011) collect a dataset of 5000 photos enhanced by 5 different experts; they identify a set of features and learn to predict the user adjustments after training on the collected dataset. They apply a number of regression techniques such as LASSO and Gaussianprocess regression and show their proposed adjustments can match the adjustments of one of the 5 experts. Their method only proposes a single adjustment and the personalization scheme that they suggest requires the user to edit a set of selected training photos.
Yan et al. (2016) use a deep neural network to learn a mapping from an input photo to an enhanced one following a particular style; their results show that the proposed model is able to capture the nonlinear and complex nature of this mapping. They also incorporate semantic awareness in their model, so their model can predict the adjustments based on the semantically meaningful objects (e.g., human, animal, sky, etc.) in the photo. This method also only proposes a single style of adjustment.
Jaroensri et al. (2015) propose a technique that can predict an acceptable range of adjustments for a given photo. The authors crowdsourced a dataset of photos with various brightness and contrast adjustments, and asked the participants to label each edited image as “acceptable” or “unacceptable”.
From this labeled dataset they learn a support vector machine classifier that can determine whether an adjustment is acceptable or not. They use this model to predict the acceptable range of edits by first sampling from the parameter space and then using their learned model to analyze each sample. Although their model is able to propose a range of edits to the user, it requires a balanced, humanlabeled training set of “acceptable” and “unacceptable” images. Since the number of bad edits may grow exponentially with the dimensionality of the adjustment space, they mostly limit their study to twodimensional brightness and contrast adjustments.
Nonparametric methods
These methods are typically able to propose multiple edits or some uncertainty over the range of adjustments.
Lee et al. (2015)
propose a method that can generate a diverse set of edits for an input photograph. The authors have a curated set of exemplar images in various styles. They use an examplebased styletransfer algorithm to transfer the style from an exemplar image to an input photograph. To choose the right exemplar image, they do a semantic similarity search using features that they have learned via a convolutional neural network (CNN). Although their approach can recommend multiple edits to a photo, their edits are destructive; that is, the user is not able to customize the model’s edits.
Koyama et al. (2016) introduce a model for personalizing photo edits only based on the history of edits by a single user. The authors use a selfreinforcement procedure in which after every edit by a user they 1) update the distance metric between the user’s past photos 2) update a feature vector representation of the user’s photos and 3) update an enhancement preference model based on the feature vectors and the user’s enhancement history. This model requires data collection from a single user and does not benefit from other users’ information.
2.4 Related multimodal prediction models
Traditional neural networks using mean squared error (MSE) loss cannot naturally handle multimodal prediction problems, since MSE is minimized by predicting the average response. Neal (1992) introduces stochastic latent variables to the network and proposes training Sigmoid Belief Networks (SBN) with only binary stochastic variables. However, this model is difficult to train, and it can only make piecewiseconstant predictions and is therefore not a natural fit to continuousresponse prediction problems.
Bishop (1994)
proposes mixture density networks (MDN), which are more suitable for continuous data. Instead of using stochastic units, the model directly outputs the parameters of a Gaussian mixture model. That is, a some of the network outputs are used as mixing weights and the rest provide the means and variances of the mixture components. The complexity of MDNs’ predictive distributions is limited by the number of mixture components if the optimal predictive distribution cannot be well approximated by a relatively small number of Gaussians, then an MDN may not be an ideal choice.
Tang and Salakhutdinov (2013) add deterministic hidden variables to SBNs in order to model continuous distributions. The authors showed improvements over the SBN; nevertheless, training the stochastic units remained a challenge due to the difficulty of doing approximate inference on a large number of discrete variables.
Dauphin and Grangier (2015) propose a new class of stochastic networks called linearizing belief networks (LBN). LBN combines deterministic units with stochastic binary units multiplicatively. The model uses deterministic linear units which act as multiplicative skip connections and allow the gradient to flow without diffusion. The empirical results show that this model can outperform standard SBNs.
3 Models
Given the limitations of the available methods for predicting photo edits (described in Section 2.3), our goal is to propose a framework in which we can: 1) recommend a set of diverse, parametric edits based on a labeled dataset of photos and their enhancements, 2) categorize the users based on their style and type of edits they apply, and finally 3) personalize the enhancements based on the user category. We focus on the photoediting application in this paper, but the proposed framework is applicable to other domains where users must make a selection from a large, richly parameterized design space where there is no single right answer (for example, many audio processing algorithms have large numbers of usertunable parameters).
Our framework is based on VAEs and their extension SVAEs, and follows a mixtureofexperts design (Murphy, 2012, Section 11.2.4). We first introduce a conditional VAE that can generate diverse set of enhancements to a given photo. Next, we extend the model to categorize the users based on their adjustment style. Our model can provide interpretable clusters of users with similar style. Furthermore, the model can provide personalized suggestions by first estimating a user’s category and then suggesting likely enhancements conditioned on that category.
(a)  (b) 
3.1 Multimodal prediction with conditional Gaussian mixture variational autoencoder (CGMVAE)
Given a photo, we are interested in predicting a set of edits. Each photo is represented by a feature vector and its corresponding edits are represented by a vector of slider values (e.g. contrast, exposure, saturation, etc.). We assume that there are clusters of possible edits for each image. To generate the sliders for a given image , we first sample a cluster assignment and a set of latent features from its corresponding mixture component . Next, conditioned on the image and , we sample the slider values. The overall generative process for the slider values conditioned on the input images is
where and are flexible parametric functions, such as MLPs, of the input image features concatenated with the latent features . Summing over all possible values for the latent variables and , the marginal likelihood yields complex, multimodal densities for the image edits .
The posterior is intractable. We approximate it with variational recognition models as
(1) 
Note that this variational distribution does not model and as independent. For
, we use an MLP with a final softmax layer, and for
, we use a Gaussian whose mean and covariance are the output of an MLP that takes , , and as input.Given this generative model and variational family, to perform inference we maximize a variational lower bound on , writing the objective as
By marginalizing over the latent cluster assignments , the CGMVAE objective can be optimized using stochastic gradient methods and the reparameterization trick as in Section 2.1. Marginalizing out the discrete latent variables is not computationally intensive since and are conditionally independent given , is cheap to compute relative to , and we use a relatively small number of clusters. However, with a very large discrete latent space, one could use alternate approaches such as the GumbelMax trick Maddison et al. (2016).
Figure 3 (parts a and b) outlines the graphical model structures of the CGMVAE and its variational distributions .
(a)  (b)  (c)  (d) 
3.2 Categorization and personalization
In order to categorize the users based on their adjustment style, we extend the basic CGMVAE model to a hierarchical model that clusters users based on the edits they make. While the model in the previous section considered each imageedit pair in isolation, we now organize the data according to distinct users, using to denote the th image of user and to denote the corresponding slider values. denotes the number of photos edited by user . As before, we assume a GMM with components and mixing weights to model the user categories.
For each user we sample a cluster index to indicate the user’s category, then for each photo we sample the latent attribute vector from the corresponding mixture component:
Finally, we use the latent features to generate the vector of suggested slider values
. As before, we use a multivariate normal distribution with mean and variance generated from an MLP parameterized by
:For inference in the CGMSVAE model, our goal is to maximize the following VLB:
To optimize this objective, we follow a similar approach to the SVAE inference framework described in Section 2.2. In the following we define the variational factors and the recognition networks that we use.
Variational factors
For the local variables and , we restrict to be normal with natural parameters and we have in the categorical form with natural parameter . As in the CGMVAE, we marginalize over cluster assignments at the user level.
For a dataset of users, the VLB factorizes as follows:
Figure 3 (parts c and d) outlines the graphical model structures of the CGMSVAE and its variational distributions .
To adapt the recognition network used in the local inference objective (eq. 2.2) to our model structure, we write
(2) 
where denotes the onehot vector encoding of the mixture component index . That is, for each user image and corresponding set of slider values , the recognition network produces a potential over the user’s latent mixture component . These imagebyimage guesses are then combined with each other and with the prior to produce the inferred variational factor on .
This recognition network architecture is both natural and convenient. It is natural because a powerful enough can set , in which case and there is no approximation error. It is convenient because it analyzes imageedit pair independently, and these evidence potentials are combined in a symmetric, exchangeable way that extends to any number of user images .
4 Experiments
We evaluate our models and several strong baselines on three datasets. We focus on the photo editing software Adobe Lightroom. The datasets that we use cover three different types of users that can be roughly described as 1) casual users who do not use the application regularly, 2) frequent users who have more familiarity with the application and use it more frequently 3) experts who have more experience in editing photos than the other two groups. We split all three datasets into 10% test, 10% validation, and 80% train set.
Datasets
The casual users dataset consists of 345000 images along with the slider values that a user has applied to the image in Lightroom. There are 3200 users in this dataset. Due to privacy concerns, we only have access to the extracted features from a convolutional neural network (CNN) applied to the images. Hence, each image in the dataset is represented by a 1024dimensional vector. For the possible edits to the image, we only focus on 11 basic sliders in Lightroom. Many common editing tasks boil down to adjusting these sliders. The 11 basic sliders have different ranges of values, so we standardize them to all have a range between and 1 when training the model.
The frequent users dataset contains 45000 images (in the form of CNN features) and their corresponding slider values. There are 230 users in this dataset. These users generally apply more changes to their photos compared to the users in the casual group.
Finally, the expert users dataset (AdobeMIT5k, collected by Bychkovsky et al. (2011)) contains 5000 images and edits applied to these images by 5 different experts, for a total of 25000 edits.
We augment this dataset by creating new images after applying random edits to the original images. To generate a random edit from a slider, we add uniform noise from a range of 10% of the total range of that slider. Given the augmented set of images, we extract the “FC7” features of a VGG16 Simonyan and Zisserman (2014) pretrained network and use the 4096dimensional feature vector as a representation of each image in the dataset. After augmenting the dataset, we have 15000 images and 75000 edits in total. Similar to other datasets, we only focus on the basic sliders in Adobe Lightroom.
(a)  (b) 
Baselines
We compare our model for multimodal prediction with several models: a multilayer perceptron (MLP), mixture density network (MDN), and linearizing belief network (LBN). The MLP is trained to predict the mean and variance of a multivariate Gaussian distribution; this model will demonstrate the limitations of even a strong model that makes unimodal predictions. The MDN and LBN, which are specifically designed for multimodal prediction, are other baselines for predicting multimodal densities. Table
1 summarizes our quantitative results.We use three different evaluation metrics to compare the models. The first metric is the predictive loglikelihood computed over a heldout test set of different datasets. Another metric is the JensenShannon divergence (JSD) between normalized histograms of marginal statistics of the true sliders and the predicted sliders. Figure
4 shows some histograms of these marginal statistics for the casual users.Finally, we use the mean squared error in the CIELAB color space between the expertretouched image and the modelproposed image. We use the CIELAB color space as it is more perpetually linear compared to RGB. We only calculate this error for the experts dataset (test set) since that is the only dataset with available retouched images. To compute this metric, we first apply the predicted sliders from the models to the original image and then convert the generated RGB image to a LAB image. For reference the difference between white and black in CIELab is 100 and photos with no adjustments result in an error of 10.2 . Table 1, shows that our model outperforms the baselines across all these metrics.
Hyperparameters
For the CGMVAE model, we choose the dimension of the latent variable from {2, 20} and the number of mixture components from the set {3, 5, 10}. For the remaining hyperparameters see the supplementary materials.
Tasks
In addition to computing the predictive loglikelihood and JSD over the heldout test sets for all three datasets, we consider the following two tasks:

Multimodal prediction: We predict different edits applied to the same image by the users in the experts dataset. Our goal is to show that CGMVAE is able to capture different styles from the experts.

Categorizing the users and adapting the predictions based on users’ categories: We show that the CGMSVAE model, by clustering the users, makes better predictions for each user. We also illustrate how inferred user clusters differ in terms of edits they apply to similar images.
4.1 Multimodal predictions
To show that the model is capable of multimodal predictions, we propose different edits for a given image in the test subset of the experts dataset. To generate these edits, we sample from different cluster components of our CGMVAE model trained on the experts dataset. For each image we generate 20 different samples and align these samples to the experts’ sliders. From the 5 experts in the dataset, 3 propose a more diverse set of edits compared to the others; hence, we only align our results to those three to show that the model can reasonably capture a diverse set of styles.
For each image in the test set, we compare the predictions of MLP, LBN, MDN and the CGMVAE with the edits from the 3 experts. In MLP (and also MDN), we draw 20 samples from the Gaussian (mixture) distribution with parameters generated from the MLP (MDN). For the LBN, since the network has stochastic units, we directly sample 20 times from the network. We align these samples to the experts’ edits and find the LAB error between the expertretouched image and the modelproposed image.
To report the results we average across the 3 experts and across all the test images. The LAB error in Table 1 indicates that CGMVAE model outperforms other baselines in terms of predicting expert edits. Some sample edit proposals and their corresponding LAB errors are provided in Figure 5. This figure shows that the CGMVAE model can propose a diverse set of edits that is reasonably close to those of experts. For further examples see the supplementary material.
(a)  (b) 
5mm
Dataset  Casual  Frequent  Expert  

Eval. Metric  LL  JSD  LL  JSD  LL  JSD  LAB 
MLP  
LBN  
MDN  
CGMVAE 
4.2 Categorization and personalization
Next, we demonstrate how the CGMSVAE model can leverage the knowledge from a user’s previous edits and propose better future edits. For the users in the test sets of all three datasets, we use between 0 and 30 imageslider pairs to estimate the posterior of each user’s cluster membership. We then evaluate the predictive loglikelihood for 20 other slider values conditioned on the images and the inferred cluster memberships.
Figure 6 depicts how adding more imageslider combinations can generally improve the predictive loglikelihood. The loglikelihood is normalized by subtracting off the predictive loglikelihood computed given zero images. The effect of adding more images is shown for 30 different sampled users; the overall average for the test dataset is also shown in the figure. To compare how various datasets benefit from this model, the average values from the 3 datasets are overlaid. According to this figure, the frequent users benefit more than the casual users and the expert users benefit the most. ^{2}^{2}2To apply the CGMSVAE model to the experts dataset, we split the imageslider combinations from each of the 5 experts into groups of 50 imagesliders and pretend that each group belongs to a different user. This way we can have more users to train the CGMSVAE model. However, this means the same expert may have some imagesliders in both train and test datasets. The significant advantage gained in the experts dataset might be due in part to this way of splitting the experts. Note that there are still no images shared across train and test sets.
To illustrate how the trained CGMSVAE model proposes edits for different user groups, we use a set of similar images in the experts dataset and show the predicted slider values for those images. Figure 7 shows how the inferred user groups edit two groups of similar images. This figure provides further evidence that the model is able to propose a diverse set of edits across different groups; moreover, it shows each user group may have a preference over which slider to use. For more examples see the supplementary material.
5 Conclusion
We proposed a framework for multimodal prediction of photo edits and extend the model to make personalized suggestions based on each user’s previous edits. Our framework outperforms several strong baselines and demonstrates the benefit of having interpretable latent structure in VAEs. Although we only applied our framework to the data from photo editing applications, it can be applied to other domains where multimodal prediction, categorization and personalization are essential. Our proposed models could be extended further by assuming more complicated graphical model structure such as admixture models instead of the Gaussian mixture model that we used. Furthermore, the categories learned by our model can be utilized to gain insights about the types of the users in the dataset.
References
 Bishop [1994] C. M. Bishop. Mixture density networks. 1994.
 Bychkovsky et al. [2011] V. Bychkovsky, S. Paris, E. Chan, and F. Durand. Learning photographic global tonal adjustment with a database of input/output image pairs. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 97–104. IEEE, 2011.
 Dauphin and Grangier [2015] Y. N. Dauphin and D. Grangier. Predicting distributions with linearizing belief networks. arXiv preprint arXiv:1511.05622, 2015.
 Hoffman et al. [2013] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Jaroensri et al. [2015] R. Jaroensri, S. Paris, A. Hertzmann, V. Bychkovsky, and F. Durand. Predicting range of acceptable photographic tonal adjustments. In Computational Photography (ICCP), 2015 IEEE International Conference on, pages 1–9. IEEE, 2015.
 Johnson et al. [2016] M. J. Johnson, D. Duvenaud, A. B. Wiltschko, S. R. Datta, and R. P. Adams. Composing graphical models with neural networks for structured representations and fast inference. In Neural Information Processing Systems, 2016.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Koyama et al. [2016] Y. Koyama, D. Sakamoto, and T. Igarashi. Selph: Progressive learning and support of manual photo color enhancement. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 2520–2532. ACM, 2016.
 Lee et al. [2015] J.Y. Lee, K. Sunkavalli, Z. Lin, X. Shen, and I. S. Kweon. Automatic contentaware color and tone stylization. arXiv preprint arXiv:1511.03748, 2015.
 Maddison et al. [2016] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Murphy [2012] K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
 Neal [1992] R. M. Neal. Connectionist learning of belief networks. Artificial intelligence, 56(1):71–113, 1992.
 Sharma and Bala [2002] G. Sharma and R. Bala. Digital color imaging handbook. CRC press, 2002.
 Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 Tang and Salakhutdinov [2013] Y. Tang and R. R. Salakhutdinov. Learning stochastic feedforward neural networks. In Advances in Neural Information Processing Systems, pages 530–538, 2013.
 Yan et al. [2016] Z. Yan, H. Zhang, B. Wang, S. Paris, and Y. Yu. Automatic photo adjustment using deep neural networks. ACM Transactions on Graphics (TOG), 35(2):11, 2016.