Constraint-Based Visual Generation

07/16/2018 ∙ by Giuseppe Marra, et al. ∙ Università di Siena 0

In the last few years the systematic adoption of deep learning to visual generation has produced impressive results that, amongst others, definitely benefit from the massive exploration of convolutional architectures. In this paper, we propose a general approach to visual generation that combines learning capabilities with logic descriptions of the target to be generated. The process of generation is regarded as a constrained satisfaction problem, where the constraints describe a set of properties that characterize the target. Interestingly, the constraints can also involve logic variables, while all of them are converted into real-valued functions by means of the t-norm theory. We use deep architectures to model the involved variables, and propose a computational scheme where the learning process carries out a satisfaction of the constraints. We propose some examples in which the theory can naturally be used, including the modeling of GAN and auto-encoders, and report promising results in problems with the generation of handwritten characters and face transformations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) goodfellow2014generative have achieved impressive results in image generation. By taking inspiration from the Turing test, a generator function is asked to fool a discriminator function which, in turn, tries to distinguish real samples from generated ones. GANs are known to generate very realistic images when trained properly.

A special generation task is image-to-image translation, which learns to map each image for an input domain into an image in a (possibly different) output domain. In most real-world domains, there are no pairs of examples showing how to translate an image into a corresponding one in another domain, yielding the so called UNsupervised Image-to-image Translation (UNIT) problem. In an UNIT problem, two independent sets of images belonging to two different domains (e.g. cats-dogs, male-female, summer-winter, etc.) are given and the task is to translate an image from one domain into the corresponding image in the other domain, even though there exist no paired examples showing this mapping. Unfortunately, estimating a joint distribution of the images in the two domains from the distributions in the original single domains is known to have infinite possible solutions. Therefore, one possible strategy consists in mapping pairs of corresponding images to the same latent space using auto-encoders and then learning to reconstruct an image from its representation in latent space. Combining auto-encoders with GANs has been proposed in 

rosca2017variational ; li2017alice and outstanding results on image translation have been reported by zhu2017unpaired ; liu2016coupled ; liu2017unsupervised .

This paper proposes a general approach to visual generation and translation that combines learning capabilities with logic descriptions of the images that are generated. The generation problem is translated into a constrained satisfaction problem, where each constraint forces the generated image to have some predefined feature. A main advantage of this approach is to decouple the logic description level from the generative models. The logic layer is architecture agnostic, allowing to inject into the logic layers any generator model based on deep learning. In particular, expressing the task using logic knowledge allows to easily extend the involved classes to additional translation categories as well as yielding an easier to understand learning scheme. The translations are then interleaved and jointly learned using the constraints generated by the framework that allow to obtain truly realistic images on different translation types.

Integration of learning and logic reasoning has been studied in the past few years, but no framework emerged as generic interface layer. For example, Minervini et al. minervini2017adversarial corrects the inconsistencies of an adversarial learner but the employed methodology is limited in terms of scope and defined ad-hoc for the task. A fuzzy generalization of First Order Logic is used both by Hu et al. hu2016harnessing

and Logic Tensor Networks 

serafini2016learning to integrate logic and learning, but both approaches are limited to universally quantified FOL clauses with specific forms. Another line of research rocktaschel2015injecting ; demeester2016lifted attempts at using logical background knowledge to improve the embeddings for Relation Extraction. Also these works are based on ad-hoc solutions that lack a common declarative mechanism that can be easily reused. Markov Logic Networks (MLN) richardson2006markov and Probabilistic Soft Logic (PSL) kimmig2012short ; bach2015hinge are two probabilistic logics, whose parameters are trained to determine the strength of the available knowledge in a given universe. MLN and PSL with their corresponding implementations have received lots of attention but they provide a shallow integration with the underlying learning processes working on the low-level sensorial data. In MLN and PSL, a low-level learner is trained independently, then frozen and stacked with the AI layer providing a higher-level inference mechanism. The framework proposed in this paper instead allows to directly improve the underlying learner, while also providing the higher-level integration with logic. TensorLog cohen2016tensorlog

is a recent framework to reuse the deep-learning infrastructure of TensorFlow (TF) to perform probabilistic logical reasoning. However, TensorLog is limited to reasoning and does not allow to optimize the learners while performing inference.

This paper utilizes a novel framework, called CLARE (Constrained Learning and Reasoning Environment)111URL: hidden for blind review, which is a TensorFlow  abadi2016tensorflow

environment based on a declarative language for integrating prior knowledge into machine learning. The proposed language generalizes frameworks like Semantic Based Regularization 

diligenti2012bridging ; diligenti2015semantic to any learner trained using gradient descend. The presented declarative language provides a uniform platform to face both learning and inference tasks by requiring the satisfaction of a set of rules on the domain of discourse. The presented mechanism provides a tight integration of learning and logic as any computational graph can be bound to a FOL predicate. The experimental section shows how to formulate an image-to-image task using logic, including adversarial tasks with cycle consistency. The declarative approach allows to easily interleave and jointly learn an arbitrary number of translation tasks.

The paper is organized as follows. In Section 2 we introduce the framework, describe its declarative nature and delineate how first–order logic (FOL) formulas can be converted into a learning model. Section 3 shows how to formalize the image-to-image translation problem in the proposed framework. The application of the framework to a male-to-female image translation task is presented in Section 4. Finally, some conclusions are drawn in Section 5.

2 Constrained Learning and Reasoning

In this paper, we consider a unified framework where both learning and inference tasks can be seen as constraint satisfaction problems. In particular, the constraints are assumed to be expressed by First-Order Logic (FOL) formulas and implemented in CLARE (Constrained Learning And Reasoning Environment), a software we developed converting automatically FOL expressions into TensorFlow computational graphs.

Given a set of task functions to be learned, the logical formalism allows to express high-level statements among the outputs of such functions. For instance, given a certain dataset, if any pattern has to belong to either a class or , we may impose that has to hold true, where and

denote two classifiers. As shown in the following of this section, there are several ways to convert FOL into real-valued functions. Exploiting the fuzzy generalization of FOL originally proposed by Novak 

novak1987first , any FOL knowledge base is translated into a set of real-valued constraints by means of fuzzy logic operators. A t-norm fuzzy logic hajek1998 can be used to transform these statements into algebraic expressions, where a t-norm is a commutative, monotone, associative -valued operation that models the logical AND. Assuming to convert the logical negation by means of , the algebraic semantics of the other connectives is determined by the choice of a certain t-norm. Different t-norm fuzzy logics have been proposed in the literature and we report in Table 1 the algebraic operations222For the implication we only report the algebraic translation according to the case , otherwise it is equal to . corresponding to the three fundamental continuous t-norm fuzzy logics, Gdel, Łukasiewicz and Product logic. In the following, we will indicate by

the algebraic translation of a certain logical formula involving the task functions collected in a vector

and by the available training data.

Connective Gdel Łukasiewicz Product

Table 1: The fundamental continuous t-norm logics and their algebraic semantics.

The constraints are aggregated over a set of data by means of FOL quantifiers. In particular, the universal and existential quantifiers can be seen as a logic AND and OR applied over each grounding of the data, respectively. Therefore, different quantifiers can be obtained depending on the selection of the underlying t-norm. For example, for a given logic expression using the function outputs as atoms, the product t-norm defines:

(1)

where denotes the available sample for the -th task function .

In the same way, the expression of the existential quantifier when using the Gdel t-norm becomes the maximum of the expression over the domain of the quantified variable:

Once the translation of the quantifiers are defined, they can be arbitrarily nested and combined in more complicated expressions.

The conversion of formulas into real-valued constraints is carried out automatically in the framework we propose. Indeed, CLARE takes as input the expressions defined using a declarative language and builds the constraints once we decide the conversion functions to be exploited.

This framework is very general and it accommodates learning from examples as well as the integration with FOL knowledge. In general terms, the learning scheme we propose can be formulated as the minimization of the following cost function:

(2)

where denotes the weight for the -th logical constraint and the function represents any monotonically decreasing transformation of the constraints conveniently chosen according to the problem under investigation. In particular, in this paper we exploit the following mappings

(3)

When the mapping defined in Equation 3-(b) is applied to an universally quantified formula as in Equation 1, it yields the following constraint:

that corresponds to a generalization to generic fuzzy-logic expressions of the cross-entropy loss, which is commonly used to force the fitting of the supervised data for deep learners.

3 Generative Learning with Logic

This section shows how the discriminative and generative parts of an image-to-image translation system can be formulated by merging logic and learning, yielding a more understandable and easier to extend setup.

Let us assume to be given a set of images . There are two components of a translator framework. First, a set of generator functions , which take as input an image representation and generate a corresponding image in the same output domain, depending on the semantics given to the task. Second, a set of discriminator functions determining whether an input image belongs to class (i.e. stating if an image has got or not a given property). Interestingly, all learnable FOL functions can be interpreted as generator functions and all learnable FOL predicates can be interpreted as discriminator functions.

The discriminators can be trained by providing some examples in the original domains as:

where is a given function returning true if and only if an image is a positive example for the -th discriminator. These constraints allow to transfer the knowledge provided by the supervision (i.e. the ) inside the discriminators, which play a similar role. However, functions are differentiable and can be exploited to train the generators functions.

To this end, assuming that a given function has to generate a pattern with a certain property, we can force the corresponding discriminator function for such a property to positively classify . Therefore, assuming that the semantic of the -th generator is to generate images of class , this can be typically expressed by a rule taking the form:

By requiring that a given image should or should not be classified as realistic, the GAN framework implements a special case of these constraints, where the required property is the similarity with real images.

Cycle consistency zhu2017unpaired is also commonly employed to impose that by translating an image from a domain to another one and then translating it back to the first one, we should recover the input image. Cycle consistency allows to further restrict the number of possible translations. Still assuming that the semantic of the -th generator is to generate images of class , this can be naturally formulated using logic as:

Clearly, in complex problems, the chain of functions intervening in these constraints can be longer.

As a toy example, we show a task in which we are asked to learn two generative functions, and , which, given an image of a digit, will produce an image of the next and previous digit, respectively. In order to give each image a next and a previous digit in the chosen set, a circular mapping was used such that is the next digit of and is the previous digit of .

The functions and

are implemented by a convolutional neural networks. A

-hidden layer RBF with a -sized softmax output layer is used to implement the , and discriminators bound to the three outputs of the network, respectively. The RBF model, by constructing closed decision boundaries, allows the generated images to resemble the input ones. Finally, let , and be three given functions, defined on the input domain, returning only if an image is a , or , respectively. They play the role of the in the general description.

The idea behind this task is how to learn generative functions without giving any direct supervision to them, but simply requiring that the generation is consistent with the classification performed by some jointly learned classifiers.

The problem can be described by the following constraints to learn the discriminators

and by

to express that the generation functions are constrained to return images which are correctly recognized by the discriminators. Finally, the cycle consistency constraints for the digit generators can be expressed by:

We test this idea by taking a set of around images of handwritten characters, obtained extracting only the , and digits from the MNIST dataset. The above constraints have been expressed in CLARE and the model computational graphs have been bound to the predicates.

Figure 1: An example of the trained generative functions. The first column pictures represents the input images. The second and third column pictures show the outputs of the functions next and previous, respectively, computed on the input image.

Figure 1 shows an example of image translation using this schema, where the image on the left is an original MNIST image and the two right images are the output of the and generators.

In more complex examples, the images in different domains are typically required to share the same common latent space. Let us indicate an encoding function mapping the image into a latent space. This encoding function must be jointly learned during the learning phase. In this special case, the generators must be re-defined as decoder functions taking as input the latent representation of the images, namely: . The auto-encoding constraints can be expressed using FOL as follows:

In the following section, we show a real image-to-image translation task applying the general setup described in this section, including auto-encoders, GANs and cycle consistency. The declarative nature of the formulation makes very easy to add an arbitrary number of translation problems and it allows to easily learn them jointly.

4 Experiments on Image Translation

UNIT translation tasks assume that there are no pairs of examples showing how to translate an image into a corresponding one in another domain. Combining auto-encoders with GANs is the state-of-the-art solution for tackling UNIT generation problems zhu2017unpaired ; liu2016coupled ; liu2017unsupervised . In this section, we show how this state-of-the-art adversarial setting can be naturally described and extended by the proposed logical and learning framework. Furthermore, we show how the logical formulation allows a straightforward extension of this application to a greater number of domains.

The CelebFaces Attributes dataset liu2015faceattributes was used to evaluate the proposed approach, where celebrities face images are labeled with various attributes gender, hair color, smiling, eyeglasses, etc. Images are defined as 3D pixel tensors with values belonging to the interval. The first two dimensions represent width and height coordinates while the last dimension indexes among the RGB channels.

In particular, we used the Male attribute to divide the entire dataset into the two input categories, namely male and female images. In the following and (such that are two given predicates holding true if and only if an image is tagged with male and female tags, respectively. Let be an encoding function mapping images into the the latent domain . The encoders are implemented as multilayer convolutional neural networks with resblocks he2016deep

, leaky-ReLU activation functions and instance normalization at each layer (see

liu2017unsupervised for a detailed description of the architecture). The generative functions and map vectors of the domain into images. These functions are implemented as multilayer transposed convolutional neural networks (also called “deconvolutions”) with resblocks, leaky-ReLU activation functions and instance normalization at each layer. To implement the shared latent space assumption, and share the parameters of the first layer.

The functions and are trained to discriminate whether an image is real or it has been generated by the and generator functions. For example, if and are two images such that hold true, then should return while should return .

The architectures of the models implementing are replicated from some state-of-the-art models zhu2017unpaired ; liu2016coupled ; liu2017unsupervised . All these papers show that the use of convolutional models in conjunction with resblocks and instance normalization allows to obtain truly realistic and high definition images.

The problem can be described as follows. First, we look at the logical constraints the encoding and generation functions need to satisfy. We ask the encoder and generator of the same domain to be circular, that is to map the input into itself, as in the autoencoding scheme proposed by Liu et al. 

liu2017unsupervised :

(4)
(5)

where the equality operator comparing two images in equations 4 and 5 is bound to a continuous and differentiable function computing a pixel by pixel similarity between the images, defined as where and are the -th pixel of the and images and is the total number of pixels.

Cycle consistency is also imposed as described in the previous section as:

(6)
(7)

where the same equality operator is used to compare the images.

Finally, the generated images must fool the discriminators so that they will be detected as real ones as:

(8)
(9)

On the other hand, the discriminators must correctly discriminate real images from generated ones by the satisfaction of the following constraints:

(10)
(11)

Using logical constraints allows us to give a clean and easy formulation of the adversarial setting. Indeed, the adversarial constraints can be interpreted as all others constraints which exploit classification functions. These constraints force the generation function to generate samples that are categorized in the desired class by the discriminator. Moreover, the decoupling between the models, used to implement the functions and which can be inherited from the previous literature, and the description of the problem makes really straightforward to extend or transfer this setting.

We implemented this mixed logical and learning task using CLARE. The Product t-norm was selected to define the underlying fuzzy logic problem. This selection of the t-norm is particularly suited for this task because, as shown earlier, it defines a cross-entropy loss on the output of the discriminators, which is the loss that was used to train these models in their original setup. The , , functions are trained to the satisfaction of the constraints defined in Equations 9, 8, 7, 6, 5 and 4, while and are trained to satisfy Equations 11 and 10. Weight learning for the models was performed used the Adam optimizer with a fixed learning rate equal to .

Figure 2: Face Gender Translation: male to female. The top row shows input male images whereas the bottom row shows the correspondent generated female images.
Figure 3: Face Gender Translation: female to male. The top row shows input female images whereas the bottom row shows the correspondent generated male images.
Figure 4: Face Gender Translation: male/female to eyeglasses. The top row shows input male/female images whereas the bottom row shows the correspondent generated faces with eyeglasses.

The software and configuration files can be downloaded at333Hidden for the blind review. Figures 2 and 3 show some male-to-female and female-to-male translations, respectively.

Given this setting, the integration of a third domain in the overall problem becomes straightforward. Let a given predicate holding true if and only if an image is tagged with eyeglasses tag in the dataset. Let be the corresponding generator and the corresponding discriminator. The same network architectures of the previous description are employed to implement and . The addition of this third class simply requires to add the following constraints for the generators:

These constraints should be added to learn the discriminators for the new class:

Figure 4 shows some examples of the original face images, and the corresponding generated images of the faces with added eyeglasses.

5 Conclusions

This paper shows a new general approach to visual generation that combines logic descriptions of the target to be generated with deep neural networks. The proposed theory relies on the principle of discovering parsimonious solutions of visual constraint satisfaction problems. The most distinguishing property is the flexibility of describing new generation problems by simple logic descriptions, which leads to attack very different problems. Instead of looking for specific hand-crafted cost functions, the proposed approach offers a general scheme for their construction that arises from the t-norm theory. Moreover, the interleaving of different image translations tasks allows us to accumulate a knowledge base that can dramatically facilitate the construction of new translation tasks. The experimental results shows the flexibility of the proposed approach, which makes it possible to deal with realistic face translation tasks, as well as with truly generation tasks, like the one shown on MNIST.

References