The need to edit photographs has existed for as long as there has been photography. Cameras inherently have limitations in capturing all real world lighting and colors, causing users to correct these limitations through later editing. Digital image processing programs, such as Photoshop, made the process of image editing more accessible. However, novice users often find that they need significant training in order to successfully carry out desired edits, illustrated on web sites such as Reddit’s Photoshop Request forum111https://www.reddit.com/r/PhotoshopRequest/ and Zhopped222http://zhopped.com/ where users submit image edit requests. They often communicate their editing needs using ordinary, non-technical language, such as:
There is a spot on my wedding dress. Can someone please remove it. Please!
He just passed away. He‘d want his obituary photo to look phenomenal, but I think the lighting on his face is bad. Can someone fix that for me please?
We aim to develop a software tool that will assist all users to achieve their image editing goals by interpreting and executing natural language image edit requests. This tool will allow users to independently manipulate images without the assistance of an expert user, and will not require learning technical vocabulary. Our work is a first step towards developing such a tool. We utilize our previous work, the Edit Me corpus [Manuvinakurike et al.2018]
, a data set of written edit requests to alter real world images and related framework. We contribute to the data set by resolving previous annotation discrepancies and completing the unlabeled annotations. All utterances were annotated using our framework developed in Edit Me, a mapping of requests to executable commands in image editing software. We then implemented a two-level system, in which the first level classifies actions in an utterance, and the second level identifies relevant properties related to the action.
2 Image and Edit Request Research
2.1 Previous Research
Research combining vision and language include systems in visual question answering [Antol et al.2015], visual storytelling [Huang et al.2016], generating questions about an image [Mostafazadeh et al.2016], and question-answer interactions grounded on information shown in an image [Mostafazadeh et al.2017]. Previous work to understand image descriptions [Kulkarni et al.2013] is essential to our work, as illustrated in the forum examples above (lighting on his face). Our work also draws on work to identify visual references (on my wedding dress) [Paetzel et al.2015, de Vries et al.2016]. In [Laput et al.2013]
, a mobile interface for users to edit images through spoken language was developed, and is the only known previous work on image editing. They employed a rule-based system; we expand on this knowledge by handling a larger variety and structure of natural language image editing utterances through our machine learning implementations.
We utilize our Edit Me corpus of 9101 text edit requests (44727 word tokens) that was created using Amazon Mechanical Turk333https://mturk.com crowd-workers (called turkers for the rest of this work) [Manuvinakurike et al.2018]. The elicited requests illustrated a wide and challenging variation in vocabulary, utterance structure, and domain knowledge to accomplish similar editing outcomes. For vocabulary, similar but distinct terms were used to execute similar actions, such as crop, cut out, and delete to alter the dimensions of an image. While the majority of requests began with an imperative verb, for example Crop the left side of the image, other prevalent utterances used modal verbs and formed requests as a comment, such as The image is blurry. A lack of domain knowledge led to some ambiguous cases, such as the use of “zoom in” indicating to bring a portion of the photo into closeup, but could also indicate a request to crop (ex. Zoom in on the zebra).
2.3 Framework for Image Edits
Our annotation framework serves as an intermediary language that interprets requests in terms of editing software functionality.
An Image Edit Request (IER) contain an action that could be completed by an image editing program. IERs are composed of at most one action, and zero or more related entities. The framework maps between an explicit or implicit word or a phrase to one of 18 actions: adjust, delete, crop, add, replace, apply, zoom, rotate, transform, move, clone, select, swap, undo, merge, redo, other, and scroll. While actions provide a first level of understanding of an IER, entities complete the interpretation of how an action should be applied. The framework supports five types of entities: attribute, modifier/value, object, region, and intention.
The framework‘s flexibility permits for multiple annotations of the same entity type in a single IER. It also supports an utterance having no entities, which occurred in 3% of the instances in the data set. A unique feature of the framework is that the same word in an utterance can have multiple labels or one can be a subset of another. In the example Increase the saturation, the word increase is annotated as an adjust action and as a modifier/action entity.
The annotation scheme was previously tested for inter-rater reliability on a sample of 600 utterances (results shown in Table 1) [Manuvinakurike et al.2018]. The highest agreement was attained for the action types; agreement on entities was lower but still well above chance. However, for some entities, namely modifier/value, agreement borders on chance level.
|IER vs. comment||0.28||0.53||0.35|
Due to the low levels of agreement, we determined that the annotations should be redone by an annotator with additional training and support before computational modeling could be attempted. Furthermore, not all of the corpus was annotated by highly trained annotators, rather by Turkers that showed even lower levels of inter-annotator agreement (see [Manuvinakurike et al.2018] for these scores). All utterances were thus annotated by our annotator who was trained by reviewing and discussing annotations and annotator disagreements in the previous labels.
The most common action represented in the data set was adjust (comprising 44% of actions in the corpus). The framework was designed for interactive dialogue, not only the single instance IERs elicited in the crowd-sourced corpus. As a result, actions like undo, redo, select, merge, and scroll are infrequent or do not occur in the corpus. For entities, attribute occurred in 56% of the annotated utterances; modifier/value was labeled in 32% of IERs; object was labeled in 30% of utterances; region was annotated in 60% of the IERs; and intention occurred in 29% of the relabeled data.
The corpus of annotated utterances was first filtered for executable actions. In this stage, we removed all utterances without an IER, such as in This image should have been taken with a Nikon. Utterances with an other action were also filtered out (0.01% of the corpus). IERs labeled other contained some level of ambiguity that made the requested edit impossible to execute, such as Clean up the pavement. It is unclear if the user would be satisfied by the pavement being edited to all be a uniform color, or perhaps by deleting foliage on the pavement. We leave the investigation of this particular action for future work.
To prepare the model input, Glove [Pennington et al.2014]
was selected to map the IERs to vectors. Annotated entity sequences were converted to BIO (beginning-in-out) sequences. For example, the utteranceCrop the image, annotated as [IER : [ACTION-CROP : crop ] [LOCATION : the image ] ], would become O, B-LOCATION, I-LOCATION. Nested entities, such as in Add a warmer hue, where “warmer” is labeled with both the attribute and value labels, presented important considerations for the BIO encoder since nesting with a high degree of depth is possible. Nested entities account for 4% of all the entities in the corpus, hence it was possible that a nesting depth that occurred in the testing data set did not occur in the training. Both models investigated in this work often fail when encountering a novel nested entity beyond the depth seen in training. However, using the innermost entity would allow the image editing system to still respond to a novel multi-level nested utterance, albeit with an incomplete outcome. As all labels carry the same amount of importance, and there was no annotation order rule for nested labels, we expected that performance would be similar for any depth of nesting ultimately used. For these reasons, we arbitrarily selected to use the innermost label.
Finally, fixed sets of utterances for training and testing were created by randomly selecting utterances from the corpus. For actions, the training set contained 4958 utterances (75% of the corpus) and 1584 utterances for testing. The entities data was split into training (80% of the corpus), validation (10%), and testing (10%).
4.2 Structure of the Model
Our predictive model is composed of two levels. The first level classifies only actions in an IER. The results of the classification process are then passed to a second level which detects sequences of only the entities. We propose that splitting the model will encourage filtering out IERs with ambiguous executable actions, thus preventing these utterances from being processed further.
as the backend against three baseline algorithms: Support Vector Machine (SVM), Logistic Regression, and Random Forest. All baseline machine learning models were implemented in Python using Scikit Learn555http://scikit-learn.org/stable/.
In the second level to detect entities in an IER, we compared Conditional Random Fields (CRF) with default parameters (namely, the L-BFGS training algorithm) in Scikit Learn as a baseline against a state-of-the-art model, BiLSTM-CRF [Lample et al.2016]. BiLSTM-CRF combines a bidirectional LSTM with a CRF model. Previous experiments indicated the ability of this model to improve upon the limitations of CRFs by constraining the independence of output labels via the LSTM component. We utilized the default parameters for BiLSTM-CRF.
The best F1 score reported for this task was given by the LSTM and SVM models (Table 2
). One concern was that the highly skewed data set would present problems, namely that all utterances would be classified as the majority class, theadjust
action. As the confusion matrix in Figure2 attests, however, the the LSTM model (shown on the top) correctly classified minority classes with high accuracy, for example, the rotate action was correctly classified most of the time despite its low frequency of occurrence in the data. The SVM algorithm (confusion matrix shown on the bottom of Figure 2) was not as robust at correctly classifying the majority label, but did perform better than LSTM at predicting the three next largest classes (add, crop, delete).
For entities, we experimented with producing only the innermost entity as well as nested entities. Table 3 gives the results for correctly translating an utterance into a sequence of executable entities.
|Only innermost entities||0.66||0.73|
The results indicate that the state-of-the-art algorithm, BiLSTM-CRF, performs substantially better than the baseline CRF model for innermost entities. This indicates that the constraints induced by the BiLSTM component of the BiLSTM-CRF have a meaningful effect on the sequencing capability of the overall model.
6 Conclusions and future work
This paper provided first steps towards automated image editing communicated through natural language. We contributed to the Edit Me corpus by annotating the remainder of the corpus and by re-annotating utterances with previous annotation disagreement. We also evaluated a two-level system to classify actions and sequence entities in an edit request. We determined that the SVM model performed as well as LSTM for classifying actions, and that BiLSTM-CRF performs better at sequencing of only the innermost label of nested entities than the baseline learning algorithm. In future work, we plan to investigate a joint model that can predict both actions and entities. In addition, the two-level action and entities model will be applied to image editing dialogues to explore transfer learning. Finally, in many cases, entities require further parsing before being fully executable. We leave it to future work to parse vague entities.
7 Bibliographical References
[Antol et al.2015]
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C.,
and Parikh, D.
VQA: Visual Question Answering.
Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, Santiago, Chile.
- [de Vries et al.2016] de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., and Courville, A. (2016). GuessWhat?! visual object discovery through multi-modal dialogue. arXiv preprint arXiv:1611.08481.
- [Huang et al.2016] Huang, T.-H. K., Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., Girshick, R., He, X., Kohli, P., Batra, D., Zitnick, C. L., Parikh, D., Vanderwende, L., Galley, M., and Mitchell, M. (2016). Visual storytelling. In Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT), pages 1233–1239, San Diego, California, USA.
- [Kulkarni et al.2013] Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903.
[Lample et al.2016]
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C.
Neural architectures for named entity recognition.In Proceedings of NAACL-HLT, pages 260–270.
- [Laput et al.2013] Laput, G. P., Dontcheva, M., Wilensky, G., Chang, W., Agarwala, A., Linder, J., and Adar, E. (2013). Pixeltone: a multimodal interface for image editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2185–2194. ACM.
- [Manuvinakurike et al.2018] Manuvinakurike, R., Brixey, J., Bui, T., Chang, W., Kim, D. S., Artstein, R., and Georgila, K. (2018). Edit me: A corpus and a framework for understanding natural language image editing. In Proceedings of LREC 2018, Miyazaki, Japan.
- [Mostafazadeh et al.2016] Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., and Vanderwende, L. (2016). Generating natural questions about an image. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 1802–1813, Berlin, Germany.
[Mostafazadeh et al.2017]
Mostafazadeh, N., Brockett, C., Dolan, B., Galley, M., Gao, J., Spithourakis,
G., and Vanderwende, L.
Image-grounded conversations: multimodal context for natural question
and response generation.
Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan.
- [Paetzel et al.2015] Paetzel, M., Manuvinakurike, R., and DeVault, D. (2015). “So, which one is it?” The effect of alternative incremental architectures in a high-performance game-playing agent. In Proceedings of the Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 77–86, Prague, Czech Republic.
- [Pennington et al.2014] Pennington, J., Socher, R., and Manning, C. (2014). GLoVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.