Generative Adversarial Networks (GANs) goodfellow2014generative have achieved impressive results in image generation. By taking inspiration from the Turing test, a generator function is asked to fool a discriminator function which, in turn, tries to distinguish real samples from generated ones. GANs are known to generate very realistic images when trained properly.
A special generation task is image-to-image translation, which learns to map each image for an input domain into an image in a (possibly different) output domain. In most real-world domains, there are no pairs of examples showing how to translate an image into a corresponding one in another domain, yielding the so called UNsupervised Image-to-image Translation (UNIT) problem. In an UNIT problem, two independent sets of images belonging to two different domains (e.g. cats-dogs, male-female, summer-winter, etc.) are given and the task is to translate an image from one domain into the corresponding image in the other domain, even though there exist no paired examples showing this mapping. Unfortunately, estimating a joint distribution of the images in the two domains from the distributions in the original single domains is known to have infinite possible solutions. Therefore, one possible strategy consists in mapping pairs of corresponding images to the same latent space using auto-encoders and then learning to reconstruct an image from its representation in latent space. Combining auto-encoders with GANs has been proposed inrosca2017variational ; li2017alice and outstanding results on image translation have been reported by zhu2017unpaired ; liu2016coupled ; liu2017unsupervised .
This paper proposes a general approach to visual generation and translation that combines learning capabilities with logic descriptions of the images that are generated. The generation problem is translated into a constrained satisfaction problem, where each constraint forces the generated image to have some predefined feature. A main advantage of this approach is to decouple the logic description level from the generative models. The logic layer is architecture agnostic, allowing to inject into the logic layers any generator model based on deep learning. In particular, expressing the task using logic knowledge allows to easily extend the involved classes to additional translation categories as well as yielding an easier to understand learning scheme. The translations are then interleaved and jointly learned using the constraints generated by the framework that allow to obtain truly realistic images on different translation types.
Integration of learning and logic reasoning has been studied in the past few years, but no framework emerged as generic interface layer. For example, Minervini et al. minervini2017adversarial corrects the inconsistencies of an adversarial learner but the employed methodology is limited in terms of scope and defined ad-hoc for the task. A fuzzy generalization of First Order Logic is used both by Hu et al. hu2016harnessing
and Logic Tensor Networksserafini2016learning to integrate logic and learning, but both approaches are limited to universally quantified FOL clauses with specific forms. Another line of research rocktaschel2015injecting ; demeester2016lifted attempts at using logical background knowledge to improve the embeddings for Relation Extraction. Also these works are based on ad-hoc solutions that lack a common declarative mechanism that can be easily reused. Markov Logic Networks (MLN) richardson2006markov and Probabilistic Soft Logic (PSL) kimmig2012short ; bach2015hinge are two probabilistic logics, whose parameters are trained to determine the strength of the available knowledge in a given universe. MLN and PSL with their corresponding implementations have received lots of attention but they provide a shallow integration with the underlying learning processes working on the low-level sensorial data. In MLN and PSL, a low-level learner is trained independently, then frozen and stacked with the AI layer providing a higher-level inference mechanism. The framework proposed in this paper instead allows to directly improve the underlying learner, while also providing the higher-level integration with logic. TensorLog cohen2016tensorlog
is a recent framework to reuse the deep-learning infrastructure of TensorFlow (TF) to perform probabilistic logical reasoning. However, TensorLog is limited to reasoning and does not allow to optimize the learners while performing inference.
This paper utilizes a novel framework, called CLARE (Constrained Learning and Reasoning Environment)111URL: hidden for blind review, which is a TensorFlow abadi2016tensorflow
environment based on a declarative language for integrating prior knowledge into machine learning. The proposed language generalizes frameworks like Semantic Based Regularizationdiligenti2012bridging ; diligenti2015semantic to any learner trained using gradient descend. The presented declarative language provides a uniform platform to face both learning and inference tasks by requiring the satisfaction of a set of rules on the domain of discourse. The presented mechanism provides a tight integration of learning and logic as any computational graph can be bound to a FOL predicate. The experimental section shows how to formulate an image-to-image task using logic, including adversarial tasks with cycle consistency. The declarative approach allows to easily interleave and jointly learn an arbitrary number of translation tasks.
The paper is organized as follows. In Section 2 we introduce the framework, describe its declarative nature and delineate how first–order logic (FOL) formulas can be converted into a learning model. Section 3 shows how to formalize the image-to-image translation problem in the proposed framework. The application of the framework to a male-to-female image translation task is presented in Section 4. Finally, some conclusions are drawn in Section 5.
2 Constrained Learning and Reasoning
In this paper, we consider a unified framework where both learning and inference tasks can be seen as constraint satisfaction problems. In particular, the constraints are assumed to be expressed by First-Order Logic (FOL) formulas and implemented in CLARE (Constrained Learning And Reasoning Environment), a software we developed converting automatically FOL expressions into TensorFlow computational graphs.
Given a set of task functions to be learned, the logical formalism allows to express high-level statements among the outputs of such functions. For instance, given a certain dataset, if any pattern has to belong to either a class or , we may impose that has to hold true, where and
denote two classifiers. As shown in the following of this section, there are several ways to convert FOL into real-valued functions. Exploiting the fuzzy generalization of FOL originally proposed by Novaknovak1987first , any FOL knowledge base is translated into a set of real-valued constraints by means of fuzzy logic operators. A t-norm fuzzy logic hajek1998 can be used to transform these statements into algebraic expressions, where a t-norm is a commutative, monotone, associative -valued operation that models the logical AND. Assuming to convert the logical negation by means of , the algebraic semantics of the other connectives is determined by the choice of a certain t-norm. Different t-norm fuzzy logics have been proposed in the literature and we report in Table 1 the algebraic operations222For the implication we only report the algebraic translation according to the case , otherwise it is equal to . corresponding to the three fundamental continuous t-norm fuzzy logics, Gdel, Łukasiewicz and Product logic. In the following, we will indicate by
the algebraic translation of a certain logical formula involving the task functions collected in a vectorand by the available training data.
The constraints are aggregated over a set of data by means of FOL quantifiers. In particular, the universal and existential quantifiers can be seen as a logic AND and OR applied over each grounding of the data, respectively. Therefore, different quantifiers can be obtained depending on the selection of the underlying t-norm. For example, for a given logic expression using the function outputs as atoms, the product t-norm defines:
where denotes the available sample for the -th task function .
In the same way, the expression of the existential quantifier when using the Gdel t-norm becomes the maximum of the expression over the domain of the quantified variable:
Once the translation of the quantifiers are defined, they can be arbitrarily nested and combined in more complicated expressions.
The conversion of formulas into real-valued constraints is carried out automatically in the framework we propose. Indeed, CLARE takes as input the expressions defined using a declarative language and builds the constraints once we decide the conversion functions to be exploited.
This framework is very general and it accommodates learning from examples as well as the integration with FOL knowledge. In general terms, the learning scheme we propose can be formulated as the minimization of the following cost function:
where denotes the weight for the -th logical constraint and the function represents any monotonically decreasing transformation of the constraints conveniently chosen according to the problem under investigation. In particular, in this paper we exploit the following mappings
that corresponds to a generalization to generic fuzzy-logic expressions of the cross-entropy loss, which is commonly used to force the fitting of the supervised data for deep learners.
3 Generative Learning with Logic
This section shows how the discriminative and generative parts of an image-to-image translation system can be formulated by merging logic and learning, yielding a more understandable and easier to extend setup.
Let us assume to be given a set of images . There are two components of a translator framework. First, a set of generator functions , which take as input an image representation and generate a corresponding image in the same output domain, depending on the semantics given to the task. Second, a set of discriminator functions determining whether an input image belongs to class (i.e. stating if an image has got or not a given property). Interestingly, all learnable FOL functions can be interpreted as generator functions and all learnable FOL predicates can be interpreted as discriminator functions.
The discriminators can be trained by providing some examples in the original domains as:
where is a given function returning true if and only if an image is a positive example for the -th discriminator. These constraints allow to transfer the knowledge provided by the supervision (i.e. the ) inside the discriminators, which play a similar role. However, functions are differentiable and can be exploited to train the generators functions.
To this end, assuming that a given function has to generate a pattern with a certain property, we can force the corresponding discriminator function for such a property to positively classify . Therefore, assuming that the semantic of the -th generator is to generate images of class , this can be typically expressed by a rule taking the form:
By requiring that a given image should or should not be classified as realistic, the GAN framework implements a special case of these constraints, where the required property is the similarity with real images.
Cycle consistency zhu2017unpaired is also commonly employed to impose that by translating an image from a domain to another one and then translating it back to the first one, we should recover the input image. Cycle consistency allows to further restrict the number of possible translations. Still assuming that the semantic of the -th generator is to generate images of class , this can be naturally formulated using logic as:
Clearly, in complex problems, the chain of functions intervening in these constraints can be longer.
As a toy example, we show a task in which we are asked to learn two generative functions, and , which, given an image of a digit, will produce an image of the next and previous digit, respectively. In order to give each image a next and a previous digit in the chosen set, a circular mapping was used such that is the next digit of and is the previous digit of .
The functions and
are implemented by a convolutional neural networks. A-hidden layer RBF with a -sized softmax output layer is used to implement the , and discriminators bound to the three outputs of the network, respectively. The RBF model, by constructing closed decision boundaries, allows the generated images to resemble the input ones. Finally, let , and be three given functions, defined on the input domain, returning only if an image is a , or , respectively. They play the role of the in the general description.
The idea behind this task is how to learn generative functions without giving any direct supervision to them, but simply requiring that the generation is consistent with the classification performed by some jointly learned classifiers.
The problem can be described by the following constraints to learn the discriminators
to express that the generation functions are constrained to return images which are correctly recognized by the discriminators. Finally, the cycle consistency constraints for the digit generators can be expressed by:
We test this idea by taking a set of around images of handwritten characters, obtained extracting only the , and digits from the MNIST dataset. The above constraints have been expressed in CLARE and the model computational graphs have been bound to the predicates.
Figure 1 shows an example of image translation using this schema, where the image on the left is an original MNIST image and the two right images are the output of the and generators.
In more complex examples, the images in different domains are typically required to share the same common latent space. Let us indicate an encoding function mapping the image into a latent space. This encoding function must be jointly learned during the learning phase. In this special case, the generators must be re-defined as decoder functions taking as input the latent representation of the images, namely: . The auto-encoding constraints can be expressed using FOL as follows:
In the following section, we show a real image-to-image translation task applying the general setup described in this section, including auto-encoders, GANs and cycle consistency. The declarative nature of the formulation makes very easy to add an arbitrary number of translation problems and it allows to easily learn them jointly.
4 Experiments on Image Translation
UNIT translation tasks assume that there are no pairs of examples showing how to translate an image into a corresponding one in another domain. Combining auto-encoders with GANs is the state-of-the-art solution for tackling UNIT generation problems zhu2017unpaired ; liu2016coupled ; liu2017unsupervised . In this section, we show how this state-of-the-art adversarial setting can be naturally described and extended by the proposed logical and learning framework. Furthermore, we show how the logical formulation allows a straightforward extension of this application to a greater number of domains.
The CelebFaces Attributes dataset liu2015faceattributes was used to evaluate the proposed approach, where celebrities face images are labeled with various attributes gender, hair color, smiling, eyeglasses, etc. Images are defined as 3D pixel tensors with values belonging to the interval. The first two dimensions represent width and height coordinates while the last dimension indexes among the RGB channels.
In particular, we used the Male attribute to divide the entire dataset into the two input categories, namely male and female images. In the following and (such that are two given predicates holding true if and only if an image is tagged with male and female tags, respectively. Let be an encoding function mapping images into the the latent domain . The encoders are implemented as multilayer convolutional neural networks with resblocks he2016deepliu2017unsupervised for a detailed description of the architecture). The generative functions and map vectors of the domain into images. These functions are implemented as multilayer transposed convolutional neural networks (also called “deconvolutions”) with resblocks, leaky-ReLU activation functions and instance normalization at each layer. To implement the shared latent space assumption, and share the parameters of the first layer.
The functions and are trained to discriminate whether an image is real or it has been generated by the and generator functions. For example, if and are two images such that hold true, then should return while should return .
The architectures of the models implementing are replicated from some state-of-the-art models zhu2017unpaired ; liu2016coupled ; liu2017unsupervised . All these papers show that the use of convolutional models in conjunction with resblocks and instance normalization allows to obtain truly realistic and high definition images.
The problem can be described as follows. First, we look at the logical constraints the encoding and generation functions need to satisfy. We ask the encoder and generator of the same domain to be circular, that is to map the input into itself, as in the autoencoding scheme proposed by Liu et al.liu2017unsupervised :
where the equality operator comparing two images in equations 4 and 5 is bound to a continuous and differentiable function computing a pixel by pixel similarity between the images, defined as where and are the -th pixel of the and images and is the total number of pixels.
Cycle consistency is also imposed as described in the previous section as:
where the same equality operator is used to compare the images.
Finally, the generated images must fool the discriminators so that they will be detected as real ones as:
On the other hand, the discriminators must correctly discriminate real images from generated ones by the satisfaction of the following constraints:
Using logical constraints allows us to give a clean and easy formulation of the adversarial setting. Indeed, the adversarial constraints can be interpreted as all others constraints which exploit classification functions. These constraints force the generation function to generate samples that are categorized in the desired class by the discriminator. Moreover, the decoupling between the models, used to implement the functions and which can be inherited from the previous literature, and the description of the problem makes really straightforward to extend or transfer this setting.
We implemented this mixed logical and learning task using CLARE. The Product t-norm was selected to define the underlying fuzzy logic problem. This selection of the t-norm is particularly suited for this task because, as shown earlier, it defines a cross-entropy loss on the output of the discriminators, which is the loss that was used to train these models in their original setup. The , , functions are trained to the satisfaction of the constraints defined in Equations 9, 8, 7, 6, 5 and 4, while and are trained to satisfy Equations 11 and 10. Weight learning for the models was performed used the Adam optimizer with a fixed learning rate equal to .
Given this setting, the integration of a third domain in the overall problem becomes straightforward. Let a given predicate holding true if and only if an image is tagged with eyeglasses tag in the dataset. Let be the corresponding generator and the corresponding discriminator. The same network architectures of the previous description are employed to implement and . The addition of this third class simply requires to add the following constraints for the generators:
These constraints should be added to learn the discriminators for the new class:
Figure 4 shows some examples of the original face images, and the corresponding generated images of the faces with added eyeglasses.
This paper shows a new general approach to visual generation that combines logic descriptions of the target to be generated with deep neural networks. The proposed theory relies on the principle of discovering parsimonious solutions of visual constraint satisfaction problems. The most distinguishing property is the flexibility of describing new generation problems by simple logic descriptions, which leads to attack very different problems. Instead of looking for specific hand-crafted cost functions, the proposed approach offers a general scheme for their construction that arises from the t-norm theory. Moreover, the interleaving of different image translations tasks allows us to accumulate a knowledge base that can dramatically facilitate the construction of new translation tasks. The experimental results shows the flexibility of the proposed approach, which makes it possible to deal with realistic face translation tasks, as well as with truly generation tasks, like the one shown on MNIST.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
-  Chunyuan Li, Hao Liu, Changyou Chen, Yuchen Pu, Liqun Chen, Ricardo Henao, and Lawrence Carin. Alice: Towards understanding adversarial learning for joint distribution matching. In Advances in Neural Information Processing Systems, pages 5501–5509, 2017.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In , pages 2223–2232, 2017.
-  Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 469–477. Curran Associates Inc., 2016.
-  Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
-  Pasquale Minervini, Thomas Demeester, Tim Rocktäschel, and Sebastian Riedel. Adversarial sets for regularising neural link predictors. arXiv preprint arXiv:1707.07596, 2017.
-  Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318, 2016.
-  Luciano Serafini and Artur S d’Avila Garcez. Learning and reasoning with logic tensor networks. In AI* IA, pages 334–348, 2016.
-  Tim Rocktäschel, Sameer Singh, and Sebastian Riedel. Injecting logical background knowledge into embeddings for relation extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1119–1129, 2015.
-  Thomas Demeester, Tim Rocktäschel, and Sebastian Riedel. Lifted rule injection for relation embeddings. arXiv preprint arXiv:1606.08359, 2016.
-  Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1):107–136, 2006.
-  Angelika Kimmig, Stephen Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. A short introduction to probabilistic soft logic. In Proceedings of the NIPS Workshop on Probabilistic Programming: Foundations and Applications, pages 1–4, 2012.
-  Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. Hinge-loss markov random fields and probabilistic soft logic. arXiv preprint arXiv:1505.04406, 2015.
-  William W Cohen. Tensorlog: A differentiable deductive database. arXiv preprint arXiv:1605.06523, 2016.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
-  Michelangelo Diligenti, Marco Gori, Marco Maggini, and Leonardo Rigutini. Bridging logic and kernel machines. Machine learning, 86(1):57–88, 2012.
-  Michelangelo Diligenti, Marco Gori, and Claudio Saccà. Semantic-based regularization for learning and inference. Artificial Intelligence, 2015.
-  Vilém Novák. First-order fuzzy logic. Studia Logica, 46(1):87–109, 1987.
-  P. Hajek. The Metamathematics of Fuzzy Logic. Kluwer, 1998.
-  Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.