Synthesizing Visual Illusions Using Generative Adversarial Networks

11/21/2019 ∙ by Alexander Gomez Villa, et al. ∙ Universitat Pompeu Fabra 40

Visual illusions are a very useful tool for vision scientists, because they allow them to better probe the limits, thresholds and errors of the visual system. In this work we introduce the first ever framework to generate novel visual illusions with an artificial neural network (ANN). It takes the form of a generative adversarial network, with a generator of visual illusion candidates and two discriminator modules, one for the inducer background and another that decides whether or not the candidate is indeed an illusion. The generality of the model is exemplified by synthesizing illusions of different types, and validated with psychophysical experiments that corroborate that the outputs of our ANN are indeed visual illusions to human observers. Apart from synthesizing new visual illusions, which may help vision researchers, the proposed model has the potential to open new ways to study the similarities and differences between ANN and human visual perception.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A prevalent view in vision science, pioneered by the works of Attneave and Barlow in the mid-twentieth century [2, 5]

, is the one given by the efficient representation principle. Stating that the organization of the visual system in general and neural responses in particular are tailored to the statistics of the images that the individual typically encounters, this principle affirms that visual information can be encoded in the most efficient way, optimizing the limited biological resources. This is an ecological approach for vision science that has proven to be extremely successful, being the only framework able to predict the functional properties of neurons from a simple theory. There is an agreement in considering this codification strategy as a result of an evolutionary process, a view shared by more recent and very popular theories

[6, 16, 22].

Figure 1: A canonical visual illusion (brightness contrast [9]). The squares have the same gray value, but they are perceived as being different.
Figure 2: First row targets (squares, rings, and squares) in a neutral background. Second row, visual illusions generated by our framework. Note that the values of the targets are exactly the same in the top and the bottom rows.

The direct study of visual perception is an extremely challenging open problem, and for this reason most psychophysical research is performed on the study of perceptual limits, thresholds, and errors, in order to shed light on the behaviour of the visual system. A visual illusion (VI) is an image stimulus that induces a visual percept that is not consistent with the visual information that can be physically measured in the scene. An example VI can be seen in Fig. 1, that shows a canonical contrast illusion  [9]: the center squares have the exact same gray value, and therefore send the same light intensity to our eyes (as a measurement with a photometer could attest), but we perceive the gray square over the white background as being darker than the gray square over the black background. This VI, as all VIs, are images whose statistics do not correspond to the ones that are typically found in natural images, so our perception of them suffers from errors in the (otherwise optimal) codification strategy. In fact, many illusions have been explained as by-products of optimal information transmission or error minimization in statistically unusual scenarios [3, 24]. Thus, VIs allow vision scientists to devise and test new vision models in their search for a better understanding of the rules that govern visual perception.

Since 2018, a handful of works have observed that artificial neural networks (ANN) trained in natural images can also be “fooled” by VIs, in the sense that their response to an image input that is a VI for a human is (qualitatively) the same as that of humans, and therefore inconsistent with the actual physical values of the light stimulus. This has been shown for VIs of very different type: motion [32], brightness and color [17], and completion [21].

In this article we now move in the opposite direction, and propose a general framework that allows to generate novel visual illusions with an ANN, that is, the ANN will generate images that will now “fool” humans. In particular, we propose a framework that takes the form of a Generative Adversarial Network (GAN) [19]. The generality of the model is exemplified by synthesizing VIs of different types and with different configurations of the GAN, and validated with psychophysical experiments that corroborate that the outputs of our framework are indeed visual illusions to human observers. This is a completely novel idea, to the best of our knowledge. Some examples of our results are shown in Figure 2.

This paper puts the focus on presenting a framework for the generation of novel visual illusions, showing the capability of our model to help vision researchers. But the potential impact of our proposed approach goes well beyond that: it could allow to study how close particular ANNs or vision models are to modelling visual perception, as we will discuss. Our code and models will be publicly available soon.

2 Phantasmagoria: A framework to generate visual illusions

Figure 3: Phantasmagoria: A framework to generate visual illusions

Our goal in this work is to propose a framework that synthesize new images that produce a visual illusion to human observers. Using a generative adversarial network (GAN) approach, in principle we would just need two components: a generator of images that are candidate to produce a VI and a discriminator that ensures that it actually is a VI. The problem with this approach is that if there is a particular candidate that produces a visual illusion with a considerably stronger effect than the rest of candidates, then the generator will end up generating many replicas of the this VI. In other words, the approach will fall into a mode collapse, thus transforming the framework into a synthesizer of the same VI again and again. This is the reason why we need a third component that pushes the generator to synthesize a variety of candidates for VIs that also follow the properties of a particular image set.

Then the framework we propose (shown in Figure. 3) is composed by three components, namely a candidate generator (CG), a background discriminator (BD) and an illusion discriminator (ID). The generator synthesize a candidate background (inducer) that is fed to the background discriminator (blue branch in Figure. 3); at the same time the targets (i.e. the region in where the illusion should happen) are pasted over the generated candidate (green branch) before being fed to the illusion discriminator. The background discriminator judges if the candidate agrees to a certain image type we impose. In our case, this is equivalent to deciding whether the candidate is a real instance of the training dataset - composed by textures or natural images, for example-. In contrast, the illusion discriminator plays the role of a replicant of the human vision system, in the sense that it assesses whether there is an illusion, i.e. whether it “sees” both targets pasted in the candidate background in a different way. The cost function for the candidate generator weightily combines the outputs from both discriminators (BD and ID) and therefore allows the CG to learn how to generate images that jointly produce a visual illusion and belong to a certain class.

2.1 BD: Background discriminator

The role of the background discriminator is to force the candidate generator to synthesize inducers belonging to the type of the desired dataset. The absence of the BD module or its poor training would allow the CG to fall into collapse mode and to generate what can be considered as the “trivial solution” (this will be discussed in section 6

). During the BD supervised training both images that come from the real dataset and images generated from the CG are fed into the BD with their corresponding expected probabilities. In this training scenario the BD outputs a discrepancy value. This discrepancy is computed by measuring a distance between the output probability of the input images of being from the desired dataset and the known answer of whether or not they were images from that dataset.

2.2 ID: Illusion discriminator

The illusion discriminator consists of two components. The first one is a visual task solver (VTS) designed to perform a particular task that the human visual system (HVS) also achieves, and it therefore provides us with an emulation of the human visual response to the stimuli. The second component quantifies the degree of illusion present in the response given by the VTS, i.e. it acts as a perceptual quantifier (PQ). Thus, this PQ depends entirely on the type of VI to be generated and on the selected VTS for the first component. Examples of a VTS could be for instance a CNN trained to do denoising as seen in [17] or a vision model specifically designed to replicate the HVS such as the ODOG model [8]. Let us note that this ID module is the key part of our proposed framework. By selecting an ID with responses very close to those of the HVS, the framework will generate VIs that almost always fool humans. Finally, this module is the part of the framework that mostly needs to be adapted in order to generate different types of illusions.

2.3 CG: Candidate generator

The aim of the candidate generator is to sample images that fool both the BD and the ID. Its representation power and the sampled space are key to generate the illusion, since the inducers produced will be limited by the synthesis capability of the generator. In our proposed framework we sample the space with the noise vector

, producing different inducers (which means different versions of the desired illusion). In order to have a rich variety of inducers it is recommended to pre-train the CG and BD modules (inside the dash line in the Fig. 3) in a generative adversarial network fashion. When this pre-train is not performed, the generator risks to fall into mode collapse. A discussion on this is presented in section 6.

2.4 Framework loss function

As discussed above, a good candidate to be a synthesized VI should balance the formation of the VI effect in the targets with the richness of the inducer background. In order to do so, our loss function for the candidate generator weights the outputs from both the ID and the BD modules using two scalars (

and respectively) as seen in Figure 3

. This approach resembles what is found in many classical models broadly used in computer vision and image processing

[18, 15]. In these classical models the loss function commonly balances two terms: one that is directly linked with the task to be solved and another one ensuring that the solution fulfills some a priori constraints, that in our case correspond to the ID and the BD modules respectively.

3 Generators of Visual Illusions

In this section we detail several instances of the Phantasmagoria framework to generate visual illusions of specific types. In particular we focus on Lightness VIs (LVIs), Color VIs (CoVIs), and Contrast VIs (CrVIs). In each subsection we explain the choices for the CG, BD and ID modules required to generate each type of illusion.

In all the instances we first pre-train the CG and GD in a GAN fashion using either the DTD [12] or the CIFAR10 [23] datasets. Then, both the CG and BD nets are fine-tuned with the whole framework until no significance change is observed in the loss function for any of them. Note that the ID block is never trained in this work. The and weights are adjusted to produce high variety of illusions (as explained in section 6).

3.1 Phantasmagoria-LVI: Lightness VI Generator

Lightness visual illusions (LVI) are a broadly studied topic in the vision science community. Their importance is based on its simplicity -there is only the Lightness channel- and on the importance that this Lightness channel has in our perception [30]. In this work we focus on LVIs consisting on images that include two targets of the same shape and lightness intensity, but in which these two targets are perceived differently because of the inducer image background.

In this instance, the CG module is a convolutional neural network (CNN) that receives a batch (size

) of random noise vectors of size and generates a batch of images. The CNN is composed by two fully-connected layers followed by two convolutional layers. The fully-connected layers have and hidden nodes respectively. Both convolutional layers use filter size with and channels respectively. Before each convolutional layer the input is

upscaled. ReLu activation functions are used after each layer but the output one in which a sigmoid activation function is applied. The CG is pretrained in a GAN fashion together with just the BD module. The BD module is also CNN that receives a batch of images of size

and outputs their probability of belonging to the desired dataset. The network is composed by three convolutional layers and two fully-connected layers. The first convolutional layer uses filter size and the other two use filter size with , and channels respectively. The two fully-connected layers have and hidden nodes. After every layer the activation function is a Leaky ReLu (

) except for the output layer that uses a sigmoid function. A max pooling operation is applied after each convolutional layer. Several choices for the training dataset can be made, in sections

4.1 and 4.2 we show the results obtained when using a database of textures [12] or natural images like CIFAR-10 [23].

As explained in section 2, the ID module is crucial and should be directly related with the LVI that we want to generate. In this case, for the visual task solver role we train a small CNN to perform a restoration problem (joint deblurring and denoising) inspired in the recent work of Gomez-Villa et al. [17]. We call this network RestoreNet. We use the same architecture proposed there: input and output layers of size pixels, one hidden layer with a filter size and 8 channels, and sigmoid activation functions. The last convolutional layer works as an output layer and hence has 3 channels. We also test our framework with a different choice for the visual task solver. In particular, we consider the ODOG model [8], which is a very successful model specifically designed to replicate human vision. Since all these choices of VTS receive

input images the images generated by the CG are upscaled using nearest neighbor interpolation.

Regarding the perceptual quantifier part of the ID, it must measure whether there is an illusion and its strength. In this case, we opted to point-wise subtract the intensity of the central area of one target from the intensity for the same area of the other target. Please note that we can choose which target will be seen lighter by just selecting from which of the targets we subtract the other.

3.2 Phantasmagoria-CoVI: Color VI Generator

The color visual illusions (CoVI) that we synthesized are very related to the previously described LVI. They are images with two targets with the same shape and color that are perceived as having a different color because of the inducer image background.

The CG and BD modules are the color version of the CNNs described in the previous section, i.e. the only changes are that the output layer for the CG network has 3 layers and that the size of the inputs of the BD is of . For the visual task solver of the ID module we choose a recent denoising CNN proposed by Zhang et al. [33] that according to the work of [17] obtained a good replication for color illusions. The perceptual quantifier used here is a point and channel-wise subtraction of the image values of the central area of one target from those of the same area of the other target. In order to choose a desired perceived color each channel must be tailored manually. For instance, if the desired output is a reddish right target (with respect to the left target) a possible perceptual quantifier will be one that maximizes the difference between the targets in the red channels (right minus left) and leave the blue and red channels as a free choice for the model.

Note that color VIs are highly complex, and there is still ongoing discussions on how to categorize them [28]. Hence, if the target is not neutral gray the framework may suggest solutions that saturate the color of one target or that only change the luminance. To alleviate this, we ask the framework to synthesize a simple perception of color, that is, show one target of one color. Therefore, mixed classes (saturation contrast or color contrast) of color illusions are expected.

3.3 Phantasmagoria-CrVI: Contrast VI Generator

Contrast illusions are different in nature to lightness and color ones. They are given by the fact that in humans the presence of certain patterns in the scene (previous in time, or close in space) leads to changes in the response of texture sensors [29, 14] and motion sensors [26]. For instance, in the texture aftereffect [7, 4] parts of stimuli with physically stationary contrast seem to fade out after prolonged exposure to localized high contrast patterns of similar frequency and orientation. In the case of texture induction, patterns with similar frequencies and orientation strongly reduce the response of sensor tuned to similar patterns [10].

In this work we focus on the case of texture induction. In particular, we focus on CrVIs consisting on images that include two targets that present the same spatial frequency and orientation, but in which these two targets are perceived differently because of the inducer image background.

In this instance the CG and BD modules are the same CNNs described in the section 3.1. Regarding the ID, we use RestoreNet as VTS. For the perceptual quantifier, we measure the Michelson’s contrast [25] in the central area of both targets and maximize the difference between them. As a result we ask the framework to generate an inducer that remarks or makes more “visible” (i.e. with higher contrast) one of the patterns with respect to the other.

4 Qualitative Results

Luminance and Color VIs can be further subdivided into assimilation and contrast ones. In assimilation the target intensity gets attracted by the inducer (e.g. a gray target gets brighter if it is surrounded by a white inducer). Contrast is the opposite effect of assimilation, therefore in this case, the target gets expelled from the inducer (e.g. a gray target gets darker if it is surrounded by a white inducer). In order to include the most variate set of examples, we show how our framework performs both of these sub-types for the luminance case. In the case of color we focus on showing how our framework generate color contrast illusions independently from the starting color of the given targets. Finally, we also show how our framework generates contrast illusions -in particular texture induction- for targets of different orientations.

We recommend the reader to look at all the different visual illusions in isolation -by for example covering any other illusion located close to the one of interest-. Also, let us note that all the visual illusions are presented in isolation in the supplementary material, together with a larger set of results.

4.1 Lightness illusions

Our results were obtained by selecting RestoreNet as the visual task solver, and considering different training datasets for the background discriminator -DTD [12] and CIFAR10 [23]-.

The results for the lightness contrast illusions are shown in Figure 4. Three different shapes for the targets -squares, rings and bars- were selected. Also, for these images, we selected that the right target should appear brighter that the left one. We can see how our framework is able to generate contrast VIs under all these different setups. Regarding the shapes, our method presents a slightly worse performance for the case of the bars, due to their large height. In terms of the training dataset, we can see how this affects the background. In particular, we can see that a more structured pattern appears in the background image inducer when the BD is trained using CIFAR10. This is probably caused by the object categories presented in this dataset.

Figure 5 shows the result of our framework for a lightness assimilation illusion. Please note that assimilation VIs extremely depend on spatial frequency, and thus they depend on the distance at which they are viewed. They may be converted into contrast VIs if the spatial frequency -or equivalently the viewing distance- is diminished [20, 13]. We have done our best to accommodate the size of this figure to account for the best experience of the visual illusion. This said, in case the reader does not perceive the illusion -and as the perception of visual illusions varies between observers- we recommend him/her to zoom out the Figure to better accommodate the frequency of this illusion.

Finally, figure 6 shows the result of our framework for the two different VTS choices explained in section 3.1. In this case, the training dataset was DTD and the shape selected was square, therefore aiming for a lightness contrast illusion. Please note that in these images the square selected as the brightest was randomized in order to use these images in the psychophysical experiment explained in section 5. This said, we can clearly see that both of the VTS are able to obtain different VIs -although not all of them with the same strength-.

Figure 4: Results for the Lightness contrast illusions using different training datasets and shapes, and RestoreNet as the VTs. In all the cases the right shape was selected to be brighter than the right one.
Figure 5: Results for the Lightness assimilation illusion using different training datasets, and RestoreNet as the VTS.
Figure 6: Results for the Lightness inductions using the DTD dataset and squares for two different VIs. These images are also a subset of those used in the psychophysical experiment. For this reason, the square selected as the one being brightest was randomized in these images

4.2 Color illusions

As we explained in section 3.2, in this case we have selected the denoising CNN proposed in Zhang et al. [33] as the VTS. We have considered the DTD as the training dataset for the background discriminator. The shape target selected are squares. Results are shown in Figure 7. The three columns of this Figure show the color contrast illusion obtained by starting either in a blue, yellow or red target color, and thus this Figure shows that our framework is not tied to any particular region of the color space.

More in detail, the three cases perform as it was imposed in the loss function. In the first column the right target is perceived redder than the left one. Similarly, in the second column the right target is perceived yellower, and in the third column the right target is perceived bluer.

Figure 7: Results for the color illusions using DTD as the training dataset for the BG and different target colors. From left to right, our cost imposed the perception of the right target to be: redder, yellower, bluer.

4.3 Contrast illusions

For the contrast illusions we have selected the CIFAR10 dataset as the training one for the background discriminator. We have selected CIFAR10 because the DTD dataset does not present enough high frequency training data to synthetize these type of illusions. For the visual task solver we have selected RestoreNET. In terms of the targets, we have selected patterns of orientation , , and . For these images, it was selected that the right target should look higher contrasted than the left one. Results for these illusions are shown in Figure 8. We can see that for the and cases the visual illusions accomplish our condition (i.e. the right target is higher contrasted than the left one). However, in the case of even if the framework performs as expected (i.e. selecting a higher frequency for surrounding the pattern of the left and a lower frequency for surrounding the pattern on the right) the effect is not strong enough to provoke an illusion for a human observer.

Figure 8: Results for the contrast illusions using CIFAR10 as the training dataset for the BG and different orientation patterns.

5 Psychophysical tests

We have run a phychophysical test to evaluate the ability of our framework to fool human observers. Observers were shown the lightness visual illusions created by Phantasmagoria-LVI -using squares as targets-, and they were asked to select the lighter square, having three different options to choose from: left, right, or center (in case they were not able to perceive any difference). The experiment was performed on a calibrated AOC I2781FH LCD monitor. Observers were sit at 50 cm of the screen so as the target grey squares surrounded 1.5 degrees of visual angle. We have considered the Describable Textures Dataset (DTD) [12] for training the background discriminator module. For the visual task solver in the illusion discriminator module we have considered the two choices mentioned in section 3.1: RestoreNet [17], and the ODOG model [8]. For each VTS choice we have selected output images (totalling 100 images) by randomly selecting images that were considered to be an illusion by the perception quantifier module from batches of different iterations during the framework training, in order to ensure the diversity of the candidates generated. These images were randomized both in terms of the methods and in terms of the side in which the lighter square was expected to appear. Six example images (3 for each of the IDs tried: first row ODOG, second row RestoreNet) are shown in Figure 6. The full set of images used in this experiment are shown in the supplementary material.

Ten observers took part in the experiment. All of them presented normal or corrected to normal color vision. None of them was an author in this paper. A first interpretation of the results is presented in Table 1, where the average selection of each option for the full set, as well as for each ID, is shown. As we can see, in a large majority of the cases () the human observers perceive the illusion that was generated by the ID. To obtain a more statistically significant result, we have also recast the experiment in terms of the Thurstone Case V Law of Comparative Judgement [31]. To this end, we have divided the answers as having the generated illusion, or not having it. This second category considers both the cases where an observer has not seen any illusion and where an observer has selected the opposite direction for the illusion. Results for this analysis are shown in Figure 9. As we can see, the generated illusion is perceived with statistical significance in all the cases.

Opposite
illusion
No
illusion
Correct
illusion
All 0.0770 0.2280 0.6950
ODOG 0.0720 0.2160 0.7120
RestoreNet 0.0820 0.2400 0.6780
Table 1: Results of the psychophysical experiment as average of the selected square.
Figure 9: Thurstone Case V Results for our Experiment. We can see that the illusions are saw with statistical significance in all the cases.

6 Discussion and limitations

One of the key points of the proposed framework, as mentioned in Section 2, is the right balance between the two discriminator modules, the background discriminator (BD) and the illusion discriminator (ID). We have studied the behavior of the proposed framework for a specific instance of Phantasmagoria-LVI in which we have chosen RestoreNET as the visual task solver and the DTD dataset [12] for the BD. Finding a right balance between the weights of the ID () and the BD () modules produces a rich variety of candidates to be a VI (see the top group of images in Figure 10). There we can see some randomly selected images from the generated images after 500 and 800 iterations of the learning process. If the weight of the BD module is too low (as shown in the middle group of images in Figure 10), the generated candidates will be very similar between them since these solutions will be mainly governed by the ID module. These candidates already resemble strongly the previously referred as canonical solution for lightness VI (see Figure 1). This “convergence” to this canonical solution is even more clear if we observe the extreme case of our framework: when we ignore the BD module (i.e. we set its corresponding weight to be 0) and we use a generator that has not been pretrained to generate images of a certain type. In this extreme case, in a few iterations the network falls into a mode collapse generating all the images of every batch exactly equal (as shown in the bottom group of images in Figure 10). The images produced by the Phantasmagoria-LVI framework in mode collapse are nothing but a version of the canonical solution that produces a very strong illusion in human observers.

Figure 10: Parametric study of the Phantasmagoria-LVI framework using RestoreNET as Visual Task Solver.

The study above introduces one general limitation of the proposed framework: there is no general rule to choose what is the best balance between the two discriminator modules. This adds up to a shared problem of most of the GAN approaches that is that of the selection of parameters such as the learning rates of the generator and discriminator. In total, this implies that obtaining good solutions requires a certain degree of fine tuning of the model parameters.

The previous discussion has also implications in the way the data for the psychophysical experiment was generated. Since the aim of our work is to synthesize new VI and hence avoid falling in this mode collapse (that we know that will produce a VI to human observers), we have to tune the framework parameters to obtain a good trade-off between a sufficiently rich variety of candidates and the strength of the effect that they provoke. The images used for the experiment are a consequence of this compromise and were randomly picked between the generated candidates instead of being manually chosen to maximize the effect in the observers. The fact that still under these conditions the observers perceived the expected VI in roughly a 70% of the cases is a very remarkable finding. Another by-product of the need of manually tuning the parameters in order to find this balance is that we can not use the results from the psychophysical tests to directly compare the two instances of the VTS used (RestoreNET and ODOG) in terms of how close are they to be a good “replicant”. We have no way to ensure that the selected parameters and are optimal for each method in the sense that they are the ones that would provoke stronger VI in the human observers. Let us note though, that obtaining a rule to fix these parameters would open the door to cast our framework as a proxy to assess how close a given model is to modelling the human visual perception, therefore becoming a really useful tool for vision researchers.

Besides the above mentioned topics there are other important open problems that mainly concern the illusion discriminator (ID) and are intimately related with well-known open problems for the vision science community. The first one is finding a vision model that correctly replicates the human visual perception in most scenarios. This is directly related with the visual task solver (VTS) part of our ID module, which in the absence of this “perfect model” we substitute it by CNNs or one of the existing vision models. The second one is our need for quantitatively evaluate the strength of the illusion after they pass through the VTS. There is a well known lack of quantitative measurements in the study of VIs in the vision science literature, in where qualitative assessments are typically performed. This forces us to propose ways to measure these effects in this work and therefore implies the need for further study of the impact of the choices made in the generation of VIs.

Finally, let us remark that the possibility of using a CNN as the VTS part of the illusion discriminator opens a possibility still unexplored of allowing the VTS to be trained during the framework optimization process. This raises further questions regarding what could be the best training for improving the generation of visual illusions.

7 Conclusions

We have introduced Phantasmagoria, the first ever framework to generate novel visual illusions using artificial neural networks. Our framework follows a GAN structure, that in particular has a generator of inducer candidates to generate a visual illusion and two discriminator modules, one that ensures that the candidate belongs to a desired type of images and another one that decides whether or not the candidate provokes indeed an illusion. We have shown the generality of our framework by synthesizing illusions of different shapes and types, namely lightness, color and contrast visual illusions. Furthermore, we have corroborated the validity of our approach with psychophysical experiments that confirm that most of the visual illusions produced by our framework fool in the same way human observers.

Further work should start by a deeper study of the parameters governing the model that could open the door to exciting new ways of research connecting human visual perception and artificial neural networks. The extension of this work to other types of illusions should be considered. For example, we can consider the works of Watanabe et al. [32] and Kim et al. [21], where they found CNNs replicating motion or completion illusions respectively. Both of these CNNs seem to be good candidates for our visual task solver module.

Acknowledgements

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement number 761544 (project HDR4EU) and under grant agreement number 780470 (project SAUCE), and by the Spanish government and FEDER Fund, grant ref. PGC2018-099651-B-I00 (MCIU/AEI/FEDER, UE). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

References

  • [1] M. Abadi and et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: Appendix B.
  • [2] F. Attneave (1954) Some informational aspects of visual perception.. Psychological review 61 (3), pp. 183. Cited by: §1.
  • [3] H. Barlow (1990) Vision: coding and efficiency. C. Blakemore (Ed.), Cited by: §1.
  • [4] H. Barlow (1990) A theory about the functional role and synaptic mechanism of visual after-effects. Vision: Coding and efficiency 363375. Cited by: §3.3.
  • [5] H. B. Barlow et al. (1961) Possible principles underlying the transformation of sensory messages. Sensory communication 1, pp. 217–234. Cited by: §1.
  • [6] H. Barlow (2001) Redundancy reduction revisited. Network: computation in neural systems 12 (3), pp. 241–253. Cited by: §1.
  • [7] C. Blakemore and F. W. Campbell (1969) Adaptation to spatial stimuli.. The Journal of physiology 200 (1), pp. 11P–13P. Cited by: §3.3.
  • [8] B. Blakeslee and M. E. McCourt (1999) A multiscale spatial filtering account of the white effect, simultaneous brightness contrast and grating induction. Vision Research 39 (26), pp. 4361 – 4377. External Links: ISSN 0042-6989, Document, Link Cited by: §2.2, §3.1, §5.
  • [9] E. Brucke (1865) Uber erganzungs und contrasfarben. Wiener Sitzungsber 51. Cited by: Figure 1, §1.
  • [10] J. R. Cavanaugh, W. Bair, and J. A. Movshon (2002) Nature and interaction of signals from the receptive field center and surround in macaque v1 neurons. Journal of neurophysiology 88 (5), pp. 2530–2546. Cited by: §3.3.
  • [11] F. Chollet et al. (2015) Keras. Note: https://keras.io Cited by: Appendix B.
  • [12] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In

    Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §3.1, §3, §4.1, §5, §6.
  • [13] C. Fach and L. T. Sharpe (1986) Assimilative hue shifts in color gratings depend on bar width. Perception and Psychophysics 40 (6), pp. 412–418. Note: Cited by: §4.1.
  • [14] J. M. Foley and C. Chen (1997) Analysis of the effect of pattern adaptation on pattern pedestal effects: a two-process model. Vision research 37 (19), pp. 2779–2788. Cited by: §3.3.
  • [15] D. A. Forsyth and J. Ponce (2002) Computer vision: a modern approach. Prentice Hall Professional Technical Reference. Cited by: §2.4.
  • [16] K. Friston (2009) The free-energy principle: a rough guide to the brain?. Trends in cognitive sciences 13 (7), pp. 293–301. Cited by: §1.
  • [17] A. Gomez-Villa, A. Martin, J. Vazquez-Corral, and M. Bertalmio (2019) Convolutional neural networks can be deceived by visual illusions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12309–12317. Cited by: §1, §2.2, §3.1, §3.2, §5.
  • [18] R. C. Gonzalez, R. E. Woods, et al. (2002) Digital image processing. Prentice hall Upper Saddle River, NJ. Cited by: §2.4.
  • [19] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. External Links: Link Cited by: §1.
  • [20] H. Helson (1963) Studies of anomalous contrast and assimilation. Journal of the Optical Society of America 53 (1), pp. . Note: Cited by: §4.1.
  • [21] B. Kim, E. Reif, M. Wattenberg, and S. Bengio (2019) Do neural networks show gestalt phenomena? an exploration of the law of closure. arXiv preprint arXiv:1903.01069. Cited by: §1, §7.
  • [22] F. A. Kingdom (2011) Lightness, brightness and transparency: a quarter century of new ideas, captivating demonstrations and unrelenting controversy. Vision research 51 (7), pp. 652–673. Cited by: §1.
  • [23] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §3.1, §3, §4.1.
  • [24] V. Laparra and J. Malo (2015) Visual aftereffects and sensory nonlinearities from a single statistical framework. Front. Human Neurosci. 9, pp. 557. External Links: Link, Document Cited by: §1.
  • [25] A. A. Michelson (1995) Studies in optics. Courier Corporation. Cited by: §3.3.
  • [26] M. Morgan, C. Chubb, and J. Solomon (2006) Predicting the motion after-effect from sensitivity loss. Vision Research 46 (15), pp. 2412–2420. Cited by: §3.3.
  • [27] Oriented difference of gaussians implementation. Note: https://github.com/computational-psychology/lightness_modelsAccessed: 2019-11-21 Cited by: Appendix B.
  • [28] X. Otazu, C. A. Parraga, and M. Vanrell (2010) Toward a unified chromatic induction model. Journal of Vision 10 (12), pp. 5–5. Cited by: §3.2.
  • [29] J. Ross and H. D. Speed (1991) Contrast adaptation and contrast masking in human vision. Proceedings of the Royal Society of London. Series B: Biological Sciences 246 (1315), pp. 61–70. Cited by: §3.3.
  • [30] A. G. Shapiro and D. Todorovic (2016) The oxford compendium of visual illusions. Oxford University Press. Cited by: §3.1.
  • [31] L. L. Thurstone (1927) A law of comparative judgment.. Psychological review 34 (4), pp. 273. Cited by: §5.
  • [32] E. Watanabe, A. Kitaoka, K. Sakamoto, M. Yasugi, and K. Tanaka (2018) Illusory motion reproduced by deep neural networks trained for prediction. Frontiers in psychology 9, pp. 345. Cited by: §1, §7.
  • [33] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017-07) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. Trans. Img. Proc. 26 (7), pp. 3142–3155. External Links: ISSN 1057-7149, Link, Document Cited by: §3.2, §4.2.

Appendix A CNN architectures from Phantasmagoria-LVI/CoVI/CrVI

In this section we depict the CNN architectures considered in the paper for the specific instance of the Phantasmagoria framework to synthesize lightness, color and contrast visual illusions (Phantasmagoria-LVI/CoVI/CrVI respectively). Fig.  A1 shows the candidate generator network. Fig. A2 depicts background discriminator network. Finally, in Fig. A3 we detail the visual task solver (which is part of the illusion discriminator module) and that we denote as RestoreNET.

Figure A.1: Architecture of the background discriminator CNN
Figure A.2: Architecture of the candidate generator CNN
Figure A.3: Architecture of the visual task solver (RestoreNet) CNN

Appendix B Implementation details

All our CNN were trained in Keras [11] and Tensorflow [1]

frameworks. We used the following loss functions: binary cross-entropy for the background discriminator network and mean squared error for the candidate generation and visual task solver. The maximum number of epochs was set to 100, and we set a batch size of 32. The training stop criteria of the framework varies depending of the value of

and parameters (please go to section 6 of the paper for more details).

The oriented difference of Gaussians model (ODOG) used was a Tensorflow implementation of an public available [27] Python implementation

Our code and models will be publicly available soon.

Appendix C Psychophysical test images

Figures C1 and C2 show the full set of images ( per method) that were used in the psychophysical test (see section 5 of the paper). The square selected as the one being brightest was randomized in these images for experimental purposes.

Figure C.1: Visual illusions generated using RestoreNet as visual task solver. The square selected as the one being brightest are randomized in these images for experimental purposes.
Figure C.2: Visual illusions generated using ODOG as visual task solver. The square selected as the one being brightest are randomized in these images for experimental purposes.

Appendix D Visual illusions in isolation and additional results

Figures D1 to D14 show the different images presented in the paper in isolation. This allows the reader to better accomodate for the effects provoked by the visual illusions. Additionally, Fig.D15 to Fig.D23 present several new visual illusions (different from the main paper results). The expected visual illusion is explained at each figure caption.

Please remember that -as is always the case in visual illusions- both targets have exactly the same pixel values.

Figure D.2: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.1: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.2: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.3: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.1: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.5: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.4: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.5: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.6: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.4: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.8: Right target is asked to be yellower. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.7: Right target is asked to be reddish. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.8: Right target is asked to be yellower. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.9: Right target is asked to be bluer. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.7: Right target is asked to be reddish. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.11: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.10: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.11: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.12: Right target is asked to be highlighted (higher contrast). CrVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.10: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.13: Right target is asked to be highlighted (higher contrast). CrVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.14: Right target is asked to be highlighted (higher contrast). CrVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.13: Right target is asked to be highlighted (higher contrast). CrVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.16: Right target is asked to be reddish. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.15: Right target is asked to be reddish. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.16: Right target is asked to be reddish. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.17: Right target is asked to be bluer. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.15: Right target is asked to be reddish. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.19: Right target is asked to be yellower. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.18: Right target is asked to be bluer. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.19: Right target is asked to be yellower. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.20: Right target is asked to be yellower. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.18: Right target is asked to be bluer. CoVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.22: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.21: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.22: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.
Figure D.23: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and CIFAR10 as database for the BD.
Figure D.21: Right target is asked to be lighter. LVI generated using RestoreNet as VTS and the DTD as database for the BD.