Sketching is an intuitive way to represent visual signals. With a few sparse strokes, humans could understand and envision a photo from a sketch. Additionally, unlike photos which are rich in color and texture, sketches are easily editable as strokes are easy to modify. We aim to synthesize photos that preserve the structure of scene sketches while delivering the low-level visual style of reference photos.
Unlike previous works [15, 24, 31] that synthesize photos from categorical object-level sketches, our goal in which scene-level sketches are used as input poses additional challenges due to 1) Lack of data.
There is no training data available for our task due to the complexity of scene sketches. Not only the insufficient amount of scene sketches, but the lack of paired scene sketch-image datasets make supervised learning from one modality to another intractable.2) Complexity of scene sketches. A scene sketch usually contains many objects of diverse semantic categories with complicated spatial organization and occlusions. Isolating objects, synthesizing object photos and combining them together  do not work well and are hard to generalize. For one, detecting objects from sketches is hard due to the sparse structure. For another, one may encounter objects that do not belong to seen categories, and the composition could also make the synthesized photo unrealistic.
We propose to alleviate these issues via 1) a standardization module, and 2)disentangled representation learning.
For the lack of data, we propose a standardization module, where input images are converted to a standardized domain, edge maps. Edge maps can be considered as synthetic sketches due to the high similarity to real sketches. With the standardization, readily-available large-scale photo datasets could be used for training by converting them to edge maps. Additionally, during inference, sketches of various individual styles are also standardized such that the gap between training and inference is narrowed.
For the complexity of scene sketches, we learn disentangled holistic content and low-level style representations from photos and sketches by encouraging only content representations of photo-sketch pairs to be similar. As a definition, content representations encode holistic semantic and geometric structures of a sketch or photo. Style representations encode the low-level visual information such as color and texture. A sketch could depict similar contents as a photo, but contain no color or texture information. By factorizing out colors and textures, the model could directly learn from large-scale photos for scene structures and transfer the knowledge to sketches. Additionally, combining the content representation of a sketch and a style representation of a reference photo could decode a realistic photo. The decoded photo should depict similar contents as the sketch and shares a similar style with the reference photo. This is the underlying mechanics of the proposed reference-guided scene sketch to photo synthesis approach. Note that the disentangled representations have been studied previously for photos [34, 27] and we extend the concept to sketches.
As exemplified in Fig.1, not only photo synthesis from scene sketch, our model can promote also controllable photo editing by allowing users to directly modify strokes of a corresponding sketch. The process is easy and fast as strokes are easy and flexible to modify, compared with photo editing from segmentation maps proposed by previous works [22, 15, 25, 27]. Specifically, the standardization module first converts a photo to a sketch. Users could modify strokes of the sketch and synthesize a newly edited photo with our model. Additionally, the style of the photo could also be modified with another reference photo as guidance.
We summarize our contribution as follows: 1) We propose an unsupervised scene sketch to photo synthesis framework. We introduce a standardization module that converts arbitrary photos to standardized edge maps, enabling a vast amount of real photos to be utilized during training. 2) Our framework facilitates controllable manipulation of photo synthesis through editing scene sketches with more plausibility and simplicity than previous approaches. 3) Technically, we propose novel designs for scene sketch to photo synthesis, including shared content representations to enable knowledge transfer from photos to sketches and model fine-tuning with sketch-reference-photo triplets for improved performance.
2 Related Work
Conditional Generative Models.
Previous approaches generated realistic images by conditioning generative adversarial networks on a given input from users. More recent methods extended it to multi-domain and multi-modal setting [23, 13, 4]
, facilitating numerous downstream applications including image inpainting[14, 28]
, photo colorization[41, 20], texture and geometry synthesis [42, 10]. However, naively adopting this framework to our problem is challenging due to the absence of paired data where sketches and photos aligned. We address this by projecting arbitrary sketches and photo into the intermediate representation and generating pseudo paired data to learn in an unsupervised setting.
prior to the surge of deep learning models, where they show low-level style like texture can be modeled as statistics of an image. Deep generative models[34, 16, 27, 21] also achieved success in photo style transfer by the disentanglement. We extend the disentanglement idea to sketches and show its application in photo synthesis.
Sketch to Photo Synthesis. Following a seminal work, SketchGAN , several efforts has been made on synthesizing photos[8, 24, 37] or reconstructing 3D shapes [35, 5, 36] from sketches. They however mainly focused on categorical single-object sketches without substantial background clutters, and thus have difficulties when encountered with complicated scene-level sketches.
Scene sketch to photo synthesis is limited by lack of the data. SketchyScene  is the only scene dataset with object segmentation and corresponding cartoon images. However, their sketch is manually composited from multiple object sketches with reference to a cartoon image. The composite sketch has a large domain gap to real scene sketches with reference to a real scene. Their composition idea greatly impacts how researchers solve the photo synthesis.  detect objects of composite sketches and generate individual photos as well as a background image and combine them together. Holistic scene structures are ignored and the photo composition leads to artifacts and unrealism. We learn holistic scene structures from massive photo datasets and transfer the knowledge to sketches.
Deep Image Editing. By the favor of powerful generative models 
, previous works edited photos by modifying the extracted latent vector. Typically they sampled the desired latent vector from a fixed distribution according to a user’s semantic control, or let a user spatially annotate the region-based semantic layout [27, 26]. DeepFaceDrawing  enables user to sketch progressive for face image synthesis. Our work differs in that we allow users to directly edit strokes of a complicated scene sketch, thus enabling much more fine-grained editing.
As illustrated in Fig.2, our framework mainly consists of two components: domain standardization and reference-based photo synthesis. For standardization (details in Section 3.1), input photos and sketches are converted to standardized edge maps, which bypass the lack of data issue. The second part is reference-guided photo synthesis (details in Section 3.2), where synthesized photos are generated based on input sketches and style reference photos.
3.1 Domain Standardization
Due to the lack of paired sketch-photo datasets, it is intractable for supervised models to synthesize photos from sketches. We adopt a similar idea as , where they converted inputs to a standardized domain, and showed learning from such domain has better performance compared to directly using unprocessed inputs.
As shown in Fig.2L, the standardization can be considered as data prepossessing and is different for training and inference. During training, we collect a large scale photo dataset of a specific category, e.g., indoor scenes. Each photo is converted to a standardized edge map for later use with an off-the-shelf deep-learning-based edge detector . During inference, unlike the training, the input is a sketch. We use the same edge detector to convert it to the edge map for later use. Fig.3 depicts examples of photo, sketches and their corresponding edges. The standardized edge maps have small domain discrepancies. In addition to narrowing the domain gap between the training and test data, the standardization module during inference could narrow the gap of individual sketching styles (e.g., stroke width), which was also similarly shown in . Given that edge maps serve as a proxy for real sketches, we slightly abuse the wording of synthetic sketches (or omitted as sketches) hereinafter as they may refer to standardized edge maps.
3.2 Reference-Guided Photo Synthesis
Previous works [27, 34] show that photos can be encoded to two disentangled representations: content and style representations. We extend the concept to sketches and show that they can be encoded to disentangled representations. Preserving content representation while replacing the sketch style with a real photo style representation could generate a realistic synthesized photo.
The module is trained in two stages. 1) Disentangled representation encoding stage learns content and style representations from images via auto-encoding. 2) We further fine-tune the model with sketch-reference-photo triplets, with regularization loss to guarantee the synthesizing quality. Our model is inspired by and based on previous arts on disentangled representation learning  and style transfer , with novel designs for the goal of scene sketch to photo synthesis.
Disentangled Representation Encoding. Fig.4 depicts the pipeline of the disentangled representation encoding stage. Denote a pair of input images and its corresponding edge as , the encoder as , decoder as , and discriminator as . The encoder encodes input pairs to two representation pairs, content and style , i.e., . From the encoded representations, the decoder reconstructs a photo and its edge . The auto-encoder ensures the reconstructed image pair is similar to the input image pair by the following reconstruction loss in -norm:
Since the photo and the edge depict the same content, we ask their content representations to be similar in -norm:
Further, the adversarial GAN loss  is required to train discriminator for realistic reconstructions:
The final loss is , where are both set to be 0.5.
Fine-Tuning with Sketch-Reference-Photo Triplets. Fig.5 depicts the pipeline of the fine-tuning stage. Denote the sketch, reference photo and output synthesized photo as , respectively. With the pre-trained model from the previous representation learning stage, the encoder is able to encode content and style representations of sketches and photos. The output image is generated by the decoder from the content representation of the sketch , and the style representation of the reference :
As the model has been pre-trained in the previous stage for encoding content and style representations, the model has a good starting point for synthesizing photos from sketches. To ensure the output image has similar content as the sketch and a similar style as the reference, however, we enforce the following regularization loss on content and style representations in -norm:
Additionally, the adversarial GAN loss is required:
The final loss is , where is set to be 0.5 in the work.
4 Experimental Results
4.1 Network Architectures and Training Details
Network Architectures. Images are fed to the encoder to obtain content and style representations. First, images go through 4 down-sampling residual blocks  to obtain an intermediate representation. The intermediate representation is fed to another convolution layer to obtain the content representation with a spatial size of . The intermediate representation is also fed to another two convolution layers to obtain a style representation/vector dimension of . The decoder consists of 4 up-sampling residual blocks. The style representation is injected to the decoder convolution layers with weight modulation techniques described in StyleGAN2 . The discriminator is the same as that of StyleGAN2.
Hyper-Parameters and Training Schedules. For representation encoding, the initial learning rate is 2e-3. We use Adam optimizer  with . For fine-tuning, we start from the previously pre-trained model. The training schedule stays the same with the initial learning rate being 4e-4. The entire training time for the 3D-front indoor scene dataset is 7 days on 4 V100 GPUs.
Baselines. We follow the released code and the same settings of all baseline methods and retrain on datasets used in the paper. Specifically, some baselines [27, 24, 32, 18] only work on photos, but not sketches. We use a gray-scale images as a proxy to ensure the photo synthesis quality. Specifically, we first train a sketch to gray-scale photo model using the same setting as step 1 of , where the input to the model is a standardized sketch. The generated gray-scale photo is then used to train a gray-scale to color photo model with the same setting of the baseline methods. SpliceViT  and DTP  are designed for test-time optimization and are not trained on the entire dataset. All other baseline methods are trained on the same dataset as the proposed method with a similar iteration.
We train on the following scene photo datasets: 1) 3D-Front Indoor Scene  consists of 14,761 training and 5,479 validation photos. They are rendered with Blender from synthetic indoor scenes including bedrooms and living rooms. Photos are resized to 286 and randomly cropped to 256 during training. 2) LSUN Church  consists of 126,227 photos of outdoor churches. We randomly sample 25,255 photos as the validation set. Photos are resized to 286 and randomly cropped to 256 during training. 3) GeoPose3K Mountain Landscape  has 3,114 mountain landscape photos. 623 photos are randomly sampled for validation. Training photos are resized to 572 and randomly cropped.
For evaluation, we collect a Scene Sketch Evaluation Set. For each category (indoor scenes, mountain and church), we collect 50 sketches from the Internet, respectively. The sketches are collected with an intention to cover various sketching styles, e.g. different levels of line width, geometric distortion, use of shading, etc.
4.3 Representation Encoding
With effective learned representation, the model could reconstruct photos or sketches with high quality. We evaluate reconstruction performance in LPIPS .
Table 1a reports the LPIPS distance of reconstructed and input photos and synthetic sketches of our stage 1 model and StyleGAN2 . Fig.6 depicts several examples of the input and reconstruction. Our representation encoding model has a slightly better reconstruction performance compared to StyleGAN2, indicating the learned content and style representations are adequate and ready for further fine-tuning with sketch-reference-photo pairs.
4.4 Photo Synthesis
We evaluate the photo synthesis performance of our method and baselines in terms of photo-realism. We calculate the Fréchet inception distance (FID)  between the synthesized photo set and the training photo set for each category (Table 1b). Our method outperforms other baselines under the FID metric. Fig.7 depicts synthesis results of our method and baselines. Note that SpliceViT  and DTP  designed for test-time optimization and was not trained on the full dataset, making it disadvantageous to other methods. Style2Paints is designed to synthesizing painting, not realistic photos. We however include it as it is one of the few works that study synthesizing from scene sketches. Our synthesis result outperforms all other methods, with SAE  being the second. As for if the content of the output photo matches with the input sketch or if the style matches with the reference photo, we provide human perceptual evaluation in Section 4.5.
We also provide more visualization of our synthesis results of indoor scenes, churches and mountains in Fig.8.
4.5 Human Perceptual Study
We conduct a human perceptual study to evaluate the realism of synthesized photos, and if synthesized photos match contents and styles as desired. We only evaluate our method and SAE , the second best-performing synthesis method, due to limited resources.
We create a survey consisting of three parts: photorealism, content matching with sketches and style matching with reference photos. As guidance to the participants, we state our research purpose at the beginning of the survey. For each part, a detailed description and an example question with answers and explanations are provided for the participant’s reference. The order of our results, baseline results, and real images are randomly shuffled in the survey to minimize the potential bias from the participant. Each part consists of 13 questions, with one question being a bait question with an obvious answer. The bait question is designed to check if the participant is paying attention and if the answers are reliable. There are in total 51 participants, with 1 being ruled out due to failing one of the bait questions. Thus we finally collect 1,950 valid human judgments.
To evaluate the photorealism, we randomly select synthesized photos of ours and SAE evenly from three categories. Both methods use the same input sketch and reference photo. For each synthesized photo, we use Google’s search by image feature to find the most similar real photo and ask participants which one they think looks more like a real photo. We then calculate the percentage of participants being fooled. Note that the fooling rate of random guessing is 50%. Table 2a reports the fooling rate of our method and SAE. Ours is 27% higher than SAE. Specifically, for churches and mountains, ours achieves a fooling rate over 44%: the generated photos are almost indistinguishable from real photos.
To evaluate if the synthesized photos match the content of the input sketch, we show participants an input sketch and two synthesized results from our method and SAE, and ask them to pick one that has the most similar content as the sketch. Table 2b reports the preference rate of ours over SAE. We achieve 82% on average preference rate, well outperforming the baseline.
To evaluate if the synthesized photos match the style of the reference photo, we show participants a reference photo and two synthesized results from our method and SAE, and ask them to pick one that has the most similar style to the sketch. Table 2c reports the preference rate of ours over SAE. We achieve a 75% average preference rate, well outperforming the baseline.
4.6 Photo Editing Through Sketch
As depicted in Fig.11, given an input photo, we convert it to a standardized edge map (where we refer as sketch for simplicity). Users could add and remove strokes to edit the photo. We also show the possibility of sequential editing in the figure. We evaluate the photo editing performance for the indoor scene validation dataset, and the FID  of edited images to the training set is 69.2. One limitation is that the content in the unmodified region of a given photo may not be well preserved as the edited photo is solely generated from the edge map.
4.7 Analysis and Ablation Studies
Analysis of Style Representations. We visualize the learned content and style representations of photos and sketches using T-SNE  in Fig.9: style representations of sketches and photos are well separated, while content representations of sketches and photos are not separable. This verifies the grounding of the method: the content representations of sketches and photos can be shared, while the style representations for the two are different. Thus, combining the content representation of a sketch and style representation of a photo could decode a realistic synthesized photo.
Style Interpolation.We study if the reference style can be a combination of style of two different reference images and . Suppose their style representations are and . The combined representation , where . By adjusting , we synthesize photos with a combined style from both reference images. Fig.10 depicts examples of mountain sketch to photo synthesis with combined styles from two different reference images. By adjusting
, the synthesized photos have a continuous interpolation from winter to summer, and afternoon to dusk.
|no fine-tune||fine-tune+style loss||fine-tune+content loss||fine-tune+all loss|
Fine-Tuning Model. One of the novelty is that we propose the fine-tuning with sketch-reference-photo triplets for the task. We evaluate if the fine-tuning is necessary by removing the fine-tuning stage. As reported in Table 3, removing the model fine-tuning leads to 2.4 worse results in the FID metric.
Content and Style Regularization Loss. We study if the regularization loss at the fine-tuning stage is effective. We study the function of the content loss () and style loss () respectively. As reported in Table 3, removing the content regularization loss leads to 1.5 worse results in FID metric, and removing the style loss leads to 0.6 worse results. This verifies the effectiveness of the proposed regularization loss.
We propose a reference-guided framework for photo synthesis from scene sketches. We first convert all input photos and sketches to standardized edge maps, allowing the model to learn in unsupervised setting without the need of real sketches or sketch-photo pairs. Sequentially, the standardized input and reference image are disentangled into content and style components to synthesize new hybrid image that preserves the content of standardized input while transferring the style of reference image. Extensive experiments demonstrate that our method can generate and edit a realistic photo from a user’s scene sketch with a reference photo as style guidance, surpassing the previous approaches on three benchmarks.
A major insight of this work is that, we learn to synthesize scene structures directly from the vast amount of readily-available photos, rather than synthesizing and combining individual objects. Rather than worrying about the acclimated errors from sketch-based object detection, photo synthesis and spatial combination for the final output, we treat the scene sketches as a whole and learn the holistic structures for photo synthesis.
One limitation is that the deep-learning based standardization step could eliminate strokes that reflect the details of the scene, or misinterpret the strokes as textures. Future work could study a sketch-to-edge standardization process that preserves higher fidelity of the sketch. Another limitation lies in the sketch-based photo editing - the unchanged regions of a given photo may not be well preserved. This is due to the model takes sketch as the only input. Future work could improve the performance by taking the original photo into consideration.
Acknowledgements: This research was supported, in part, by BAIR-Amazon Commons and AWS. We thank Yubei Chen for helpful discussions. We thank Tian Qin for providing some scene sketches used in the study. We thank Li Tang, Lu Yuan, Martin Zhai, Xingchen Liu, Karl Hillesland, Amin Kheradmand, Nasim Souly, Charlotte Wang, Valerie Moss and other anonymous participants in our human perceptual study.
GeoPose3K: mountain landscape dataset for camera pose estimation in outdoor environments. Image and Vision Computing 66, pp. 1–14. Cited by: §4.2.
-  (2020) DeepFaceDrawing: deep generation of face images from sketches. ACM Transactions on Graphics (TOG) 39 (4), pp. 72–1. Cited by: §2.
-  (2018) Sketchygan: towards diverse and realistic sketch to image synthesis. In , pp. 9416–9425. Cited by: §2.
-  (2020) Stargan v2: diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8188–8197. Cited by: §2.
-  (2018) 3d sketching using multi-view deep volumetric prediction. Proceedings of the ACM on Computer Graphics and Interactive Techniques 1 (1), pp. 1–22. Cited by: §2.
-  (2021) 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10933–10942. Cited by: §4.2.
-  (2020) Sketchycoco: image generation from freehand scene sketches. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5174–5183. Cited by: §1, §2.
-  (2019) Interactive sketch & fill: multiclass sketch-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1171–1180. Cited by: §2.
-  (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §2, §3.2.
-  (2017) Interactive example-based terrain authoring with conditional generative adversarial networks. Acm Transactions on Graphics (TOG) 36 (6), pp. 1–13. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: Table 1, §4.4, §4.6, Table 3.
Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172–189. Cited by: §2.
-  (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–14. Cited by: §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1, §1.
-  (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410. Cited by: §2.
-  (2020) Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8110–8119. Cited by: §2.
-  (2021) Deep translation prior: test-time training for photorealistic style transfer. arXiv preprint arXiv:2112.06150. Cited by: Figure 7, 1(b), §4.1, §4.4.
-  (2015) Adam: a method for stochastic optimization. In ICLR (Poster), Cited by: §4.1.
-  (2016) Learning representations for automatic colorization. In European conference on computer vision, pp. 577–593. Cited by: §2.
-  (2020) Drit++: diverse image-to-image translation via disentangled representations. International Journal of Computer Vision 128 (10), pp. 2402–2417. Cited by: §2.
-  (2021) EditGAN: high-precision semantic image editing. Advances in Neural Information Processing Systems 34. Cited by: §1.
-  (2019) Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10551–10560. Cited by: §2.
-  (2020) Unsupervised sketch to photo synthesis. In European Conference on Computer Vision, pp. 36–52. Cited by: §1, §2, 1(b), §4.1.
-  (2021) SDEdit: guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, Cited by: §1.
-  (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2337–2346. Cited by: §2.
Swapping autoencoder for deep image manipulation. Advances in Neural Information Processing Systems 33, pp. 7198–7211. Cited by: §1, §1, §2, §2, §3.2, §3.2, 1(b), §4.1, §4.4, §4.5, Table 2, 2(a), 2(b), 2(c).
-  (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §2.
-  (2020) Dense extreme inception network: towards a robust cnn model for edge detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 1923–1932. Cited by: §3.1.
-  (2000) A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision 40 (1), pp. 49–70. Cited by: §2.
-  (2021) Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2287–2296. Cited by: §1.
-  (2022) Splicing vit features for semantic appearance transfer. arXiv preprint arXiv:2201.00424. Cited by: Figure 7, 1(b), §4.1, §4.4.
Visualizing data using t-sne..
Journal of machine learning research9 (11). Cited by: Figure 9, §4.7.
-  (2020) Stylegan2 distillation for feed-forward image manipulation. In European Conference on Computer Vision, pp. 170–186. Cited by: §1, §2, Figure 6, §3.2, §3.2, Table 1, §4.1, §4.3.
-  (2020) 3d shape reconstruction from free-hand sketches. arXiv preprint arXiv:2006.09694. Cited by: §2, §3.1, §3.1.
-  (2018) Unsupervised learning of 3d model reconstruction from hand-drawn sketches. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1820–1828. Cited by: §2.
-  (2022) Adversarial open domain adaptation for sketch-to-photo synthesis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1434–1444. Cited by: §2.
-  (2015) Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §4.2.
-  (2021) User-guided line art flat filling with split filling mechanism. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 1(b).
The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595. Cited by: Table 1, §4.3.
-  (2016) Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §2.
-  (2018) Non-stationary texture synthesis by adversarial expansion. arXiv preprint arXiv:1805.04487. Cited by: §2.
-  (2016) Generative visual manipulation on the natural image manifold. In European conference on computer vision, pp. 597–613. Cited by: §2.
-  (1998) Filters, random fields and maximum entropy (frame): towards a unified theory for texture modeling. International Journal of Computer Vision 27 (2), pp. 107–126. Cited by: §2.
-  (2018) Sketchyscene: richly-annotated scene sketches. In Proceedings of the european conference on computer vision (ECCV), pp. 421–436. Cited by: §2.