An adult laughs 18 times a day  on average. A good sense of humor is related to communication competence [14, 15], helps raise an individual’s social status , popularity [19, 29], and helps attract compatible mates [8, 10, 39]. Humor in the workplace improves camaraderie and helps workers cope with daily stresses  and loneliness . fMRI  studies of the brain reveal that humor activates the components of the brain that are involved in reward processing 
. This probably explains why we actively seek to experience and create humor.
Despite the tremendous impact that humor has on our lives, the lack of a rigorous definition of humor has hindered humor-related research in the past [4, 50]. While verbal humor is better understood today [45, 48], visual humor remains unexplored. As vision and AI researchers we are interested in the following question – what content in an image causes it to be funny? Our work takes a step in the direction of building computational models for visual humor.
Computational visual humor is useful for a number of applications: to create better photo editing tools, smart cameras that pick the right moment to take a (funny) picture, recommendation tools that rate funny pictures higher (say, to post on social media), video summarization tools that summarize only the funny frames, automatically generating funny scenes for entertainment, identifying and catering to personalized humor,etc. As AI systems interact more with humans, it is vital that they understand subtleties of human emotions and expressions. In that sense, being able to identify humor can contribute to their common sense.
Understanding visual humor is fraught with challenges such as having to detect all objects in the scene, observing the interactions between objects, and understanding context, which are currently unsolved problems. In this work, we argue that, by using scenes made from clipart [1, 2, 17, 25, 26, 54, 61, 62], we can study visual humor without having to wait for these detailed recognition problems to be solved. Abstract scenes are inherently densely annotated (e.g. all objects and their locations are known), and so enable us to learn fine-grained semantics of a scene that causes it to be funny. In this paper, we collect two datasets of abstract scenes that facilitate the study of humor at both the scene-level (fig:raccoon_party, fig:dog_dinner) and the object-level (fig:rats_steal_before, fig:rats_steal_after). We propose a model that predicts how funny a scene is using semantic visual features of the scene such as occurrence of objects, and their relative locations. We also build computational models for a particular source of humor, i.e., humor due to the presence of objects in an unusual context. This source of humor is explained by the incongruity theory of humor which states that a playful violation of the subjective expectations of a perceiver causes humor . E.g., fig:dog_dinner is funny because our expectation is that people eat at tables and dogs sit in pet beds and this is violated when we see the roles of people and dogs swapped.
The scene-level Abstract Visual Humor (AVH) dataset contains funny scenes (fig:raccoon_party, fig:dog_dinner) and unfunny scenes with human ratings for funniness of each scene. Using the ground truth rating, we demonstrate that we can reliably predict a funniness score for a given scene. The object-level Funny Object Replaced (FOR) dataset contains scenes that are originally funny (fig:rats_steal_before) and their unfunny counterparts (fig:rats_steal_after). The unfunny counterparts are created by humans by replacing objects that contribute to humor such that the scene is not funny anymore. The ground truth of replaced objects is used to train models to alter the funniness of a scene – to make a funny scene unfunny and vice versa. Our models outperform natural baselines and ablated versions of our system in quantitative evaluation. They also demonstrate good qualitative performance via human studies.
Our main contributions are as follows:
We collect two abstract scene datasets consisting of scenes created by humans which are publicly available.
The scene-level Abstract Visual Humor (AVH) dataset consists of funny and unfunny abstract scenes (subsec:AVH). Each scene also contains a brief explanation of the humor in the scene.
The object-level Funny Object Replaced (FOR) dataset consists of funny scenes and their corresponding unfunny counterparts resulting from object replacement (subsec:FOR).
We analyze the different sources of humor techniques depicted in the AVH dataset via human studies (subsec:humor_techniques).
We learn distributed representations for each object category which encode the context in which an object naturally appears,i.e., in an unfunny setting. (subsec:features).
We model two tasks to demonstrate an understanding of visual humor:
Predicting how funny a given scene is (subsec:task1_results).
Automatically altering the funniness of a given scene (subsec:task2_funny_unfunny).
To the best of our knowledge, this is the first work that deals with understanding and building computational models for visual humor.
2 Related Work
Humor Theories. Humor has been a topic of study since the time of Plato , Aristotle  and Bharata . Over the years, philosophical studies and psychological research have sought to explain why we laugh. There are three theories of humor  that are popular in contemporary academic literature. According to the incongruity theory, a perceiver encounters an incongruity when expectations about the stimulus are violated . The two stage model of humor  further states that the process of discarding prior assumptions and reinterpreting the incongruity in a new context (resolution) is crucial to the comprehension of humor. Superiority theory suggests that the misfortunes of others which reflects our own superiority is a source of humor . According to the relief theory, humor is the release of pent-up tension or mental energy. Feelings of hostility, aggression, or sexuality that are expressed bypassing any societal norms are said to be enjoyed .
Previous attempts to characterize the stimuli that induce humor have mostly dealt with linguistic or verbal humor  e.g., script-based semantic theory of humor  and its revised version, the general theory of verbal humor .
Computational Models of Humor. A number of computational models are developed to recognize language-based humor e.g., one-liners , sarcasm  and knock-knock jokes . Other work in this area includes exploring features of humorous texts that help detection of humor , and identifying the set of words or phrases in a sentence that could contribute to humor .
Some computational humor models that generate verbal humor are JAPE  which is a pun-based riddle generating program, HAHAcronym  which is an automatic funny acronym generator, and an unsupervised model that produces I like my X like I like my Y, Z jokes . While the above works investigate detection and generation of verbal humor, in this work we deal purely with visual humor.
Recent works predict the best text to go along with a given (presumably funny) raw image such as a meme  or a cartoon . In addition, Radev et al.  develop unsupervised methods to rank funniness of captions for a cartoon. They also analyze the characteristics of the funniest captions. Unlike our work, these works do not predict whether a scene is funny or which components of the scene contribute to the humor.
Buijzen and Valkenburg  analyze humorous commercials to develop and investigate a typology of humor. Our contributions are different as we study the sources of humor in static images, as opposed to audiovisual media. To the best of our knowledge, ours is the first work to study visual humor in a computational framework.
Human Perception of Images. A number of works investigate the intrinsic characteristics of an image that influence human perception e.g., memorability , popularity , visual interestingness , and virality . In this work, we study what content in a scene causes people to perceive it as funny, and explore a method of altering the funniness of a scene.
Learning from Visual Abstraction.
Visual abstractions have been used to explore high-level semantic scene understanding tasks like identifying visual features that are semantically important[61, 63], learning mappings between visual features and text , learning visually grounded word embeddings , modeling fine-grained interactions between pairs of people , and learning (temporal and static) common sense [17, 26, 54]. In this work, we use abstract scenes to understand the semantics in a scene that cause humor, a problem that has not been studied before.
We introduce two new abstract scenes datasets – the Abstract Visual Humor (AVH) dataset (subsec:AVH) and the Funny Object Replaced (FOR) dataset (subsec:FOR) using the interfaces described in subsec:interface. The AVH dataset (subsec:AVH) consists of both funny and unfunny scenes along with funniness ratings. The FOR dataset (subsec:FOR) consists of funny scenes and their altered unfunny counterparts. Both the datasets are made publicly available on the project webpage.
3.1 Abstract Scenes Interface
Abstract scenes enable researchers to explore high-level semantics of a scene without waiting for low-level recognition tasks to be solved. We use the clipart interface111www.github.com/VT-vision-lab/abstract_scenes_v002 developed by Antol et al.  which allows for indoor and outdoor scenes to be created. The clipart vocabulary consists of 20 deformable human models, 31 animals in various poses, and around 100 objects that are found in indoor (e.g., chair, table, sofa, fireplace, notebook, painting) and outdoor (e.g., sun, cloud, tree, grill, campfire, slide) scenes. The human models span different genders, races, and ages with 8 different expressions. They have limbs that are adjustable to allow for continuous pose variations. This combined with the large vocabulary of objects result in diverse scenes with rich semantics. fig:teaser (Top Row) shows scenes that AMT workers created using this abstract scenes interface and vocabulary. Additional details, example scenes, and a sample of clipart objects are available on the project webpage.
3.2 Abstract Visual Humor (AVH) Dataset
This dataset consists of funny and unfunny scenes created by AMT workers, facilitating the study of visual humor at the scene level.
Collecting Funny Scenes. We collect 3.2K scenes via AMT by asking workers to create funny scenes that are meaningful, realistic, and that other people would also consider funny. This is to encourage workers to refrain from creating scenes with inside jokes or catering to a very personalized form of humor. A screenshot of the interface used to collect the data is available on the project webpage. We provide a random subset of the clipart vocabulary to each worker out of which at least 6 clipart objects are to be used to create a scene. In addition, we also ask the worker to give a brief description of why the scene is funny in a short phrase or sentence. We find that this encourages workers to be more thoughtful and detailed regarding the scene they create. Note that this is different from providing a caption to an image since this is a simple explanation of what the worker had in mind while creating the scene. Mining this data may be useful to better understand visual humor. However, in this work we focus on the harder task of understanding purely visual humor and do not use these explanations.
We also use an equal number (3.2K) of abstract scenes from  which are realistic, everyday scenes. We expect most of these scenes to be mundane (i.e., not funny).
Labeling Scene Funniness. Anyone who has tried to be funny knows that humor is a subjective notion. A well-intending worker may create a scene that other people do not find very funny. We obtain funniness ratings for each scene in the dataset from 10 different workers on AMT who do not see the creator’s explanation of funniness. The ratings are on a scale of 1 to 5, where 1 is not funny and 5 is extremely funny. We define the funniness score Fi of a scene i, as the average of the 10 ratings for the scene. We found 10 ratings to be sufficient for good inter-human agreement. Further analysis is provided on the project webpage.
By plotting a distribution of these scores, we determine the optimal threshold that best separates scenes that were intended to be funny (i.e., workers were specifically asked to create a funny scene) and other scenes (i.e., everyday scenes from , where workers were not asked to create funny scenes). We label all scenes that have a as funny and all scenes with a lower Fi as unfunny. This re-labeling results in 522 unintentionally funny scenes (i.e., scenes from , which were determined to be funny), and 682 unintentionally unfunny scenes (i.e., well-intentioned worker outputs which were deemed not funny by the crowd).
In total, this dataset contains 6,400 scenes (3,028 funny scenes and 3,372 unfunny scenes). We randomly split these scenes into train, val, and test sets having 60%, 20%, and 20% of the scenes, respectively. We refer to this dataset as the AVH dataset.
Humor Techniques. To better understand the different sources of humor in our dataset, we collect human annotations of the different techniques are used to depict humor in each scene. We create a list of humor techniques that are motivated by existing humor theories, based on patterns that we observe in funny scenes, and the audio-visual humor typology by Buijzen et al. : person doing something unusual, animal doing something unusual, clownish behavior (i.e., goofiness), too many objects, somebody getting hurt, somebody getting scared and somebody getting angry.
We choose a subset of 200 funny scenes from the AVH dataset. We show each of these scenes to 10 different AMT workers and ask them to choose all the humor techniques that are depicted. Our options also included none of the above reasons, which also prompted workers to briefly explain what other unlisted technique depicted in the scene made it funny. However, we observe that this option was rarely used by workers. This may indicate that most of our scenes can be explained well by one of the listed humor techniques. fig:humor_techniques shows the top voted images corresponding to the 4 most popular techniques of humor. We find that the techniques that involve animate objects – animal doing something unusual and person doing something unusual are voted higher than any other technique by a large margin. For 75% of the scenes, at least 3 out of 10 workers picked one of these two techniques. We observe that this unusualness or incongruity is generally caused by objects occurring in an unusual context in the scene.
Introducing or eliminating incongruities can alter the funniness of a scene. An elderly person kicking a football while simultaneously skateboarding (fig:before_after, bottom) is incongruous and hence considered funny. However, when the person is replaced by a young girl, this is is not incongruous and hence not funny. Such incongruities that can alter the funniness of a scene serves as our motivation to collect the Funny Object Replaced dataset which we describe next.
3.3 Funny Object Replaced (FOR) Dataset
Replacing objects in a scene is a technique to manipulate incongruities (and hence funniness) in a scene. For instance, we can change funny interactions (which are unexpected by our common sense) to interactions that are normal according to our mental model of the world. We use this technique to collect a dataset which consists of funny scenes and their altered unfunny counterparts. This enables the study of humor in a scene at the object-level.
We show funny scenes from the AVH dataset and ask AMT workers to make the least number of replacements in the scene to render the originally funny scene unfunny. The motivation behind this is to get a precise signal of which objects in the scene contribute to humor and what they can be replaced with to reduce/eliminate humor, while keeping the underlying structure of the scene the same. We ask workers to replace an object with another object that is as similar as possible to the first object and keep the scene realistic. This helps us understand fine-grained semantics that causes a specific object category to contribute to humor. There could be other ways to manipulate humor, e.g., by adding, removing, or moving objects in a scene, etc. but in our work we employ only the technique of replacing objects. We find that this technique is very effective in altering the funniness of a scene. Our interface did not allow people to add, remove, or move the objects in the scene. A screenshot of the interface used to collect this dataset is available on the project webpage.
For each of the 3,028 funny scenes in the AVH dataset, we collect object-replaced scenes from 5 different workers resulting in 15,140 unfunny counterpart scenes. As a sanity check, we collect funniness ratings (via AMT) for 750 unfunny counterpart scenes. We observe that they indeed have an average Fi of 1.10, which is smaller than that of their corresponding original funny scenes (whose average Fi is 2.66). fig:before_after shows two pairs of funny scenes and their object-replaced unfunny counterparts. We refer to this dataset as the FOR dataset.
Given the task posed to workers (altering a funny scene to make it unfunny), it is natural to use this dataset to train a model to reduce the humor in a scene. However, this dataset can also be used to train flipped models that can increase the humor in a scene as shown in subsubsec:task2_unfunny_funny.
We propose and model two tasks that we believe demonstrate an understanding of some aspects of visual humor:
Predicting how funny a given scene is.
Altering the funniness of a scene.
The models that perform the above tasks are described in subsec:task1 and subsec:task2, respectively. The features used in the models are described first (subsec:features).
Abstract scenes are trivially densely annotated which we use to compute rich semantic features. Recall that our interface allows two types of scenes (indoor and outdoor) and our vocabulary consists of 150 object categories. We compute both scene-level and instance-level features.
Object embedding (150-d) is a distributed representation that captures the context in which an object category usually occurs. We learn this representation using a word2vec-style continuous Bag-of-Words model 
. The model tries to predict the presence of an object category in the scene, given the context provided by other instances of objects in the scene. Specifically, in a scene, given 5 (randomly chosen) instances, the model tries to predict the object category of the 6th instance. We train the single-layer (150-d) neural network
with multiple 6-item subsets of instances from each scene. The network is trained using Stochastic Gradient Descent (SGD) with a momentum of 0.9. We use 11K scenes (that were not intended to be funny) from the dataset collected in to train the model. Thus, we learn representations of objects occurring in natural contexts which are not funny. A visualization of the object embeddings is available on the project webpage.
Local embedding (150-d) For each instantiation of an object in the scene, we compute a weighted sum of object embeddings of all the other instances in the scene. The weight of every other instance is its inverse square-root distance w.r.t. the instance under consideration.
Cardinality (150-d) is a Bag-of-Words representation that indicates the number of instances of each object category that are present in the scene.
is a vector of the horizontal and vertical coordinates of every object in the scene. When multiple instances of an object category are present, we consider location of the instance closest to the center of the scene.
Scene Embedding (150-d) is the sum of object embeddings of all objects present in the scene.
4.2 Predicting Funniness Score
We train a Support Vector Regressor (SVR) that predicts the funniness score, Fi for a given scene . The model regresses to the Fi computed from ratings given by AMT workers (described in subsubsec:funniness_score) on scenes from the AVH dataset (subsec:AVH). We train the SVR on the scene-level features (described in subsec:features) and perform an ablation study.
4.3 Altering Funniness of a Scene
We learn models to alter the funniness of a scene – from funny to unfunny and vice versa. Our two-stage pipeline involves:
Detecting objects that contribute to humor.
Identifying suitable replacement objects from 1. to make the scene unfunny (or funny), while keeping it realistic.
We train a multi-layer perceptron (MLP) on scenes from the FOR dataset to make a binary prediction on each object instance in the scene – whether it should be replaced to alter the funniness of a scene or not. The input is a 300-d vector formed by concatenating object embedding and local embedding features. The MLP has two hidden layers comprising of 300 and 100 units respectively, to which ReLU activation is applied. The final layer has 2 neurons and is used to perform binary classification (replace or not) using cross-entropy loss. We train the model using SGD with a base learning rate of 0.01 and momentum of 0.9. We also trained a model with skip-connections that considers the predictions made on other objects when making a prediction on a given object. However, this did not result in significant performance gains.
We train an MLP to perform a 150-way classification to predict potential replacer objects (from the clipart vocabulary), given an object predicted to be replaced in a scene. The model’s input is a 300-d vector formed by concatenating local embedding and object embedding features. The classifier has 3 hidden layers of 300 units each, with ReLU non-linearities. The output layer has 150 units over which we compute soft-max loss. We train the model using SGD with a base learning rate of 0.1, momentum of 0.9, and a dropout ratio of 0.5. The label for an instance is the index of the replacer object category used by the worker. Due to the large diversity of viable replacer objects that can alter humor in a scene, we also analyze the top-5 predictions of this model. We train two models – one on funny scenes, and another on their unfunny counterparts from the FOR dataset. Thus, we learn models to alter the funniness in a scene in one direction – funny to unfunny or vice versa. Although we could train the pipeline end-to-end, we train each stage separately so that we can evaluate them separately and isolate their errors (for better interpretability).
We discuss the performance of our models in the two visual humor tasks of:
Predicting how funny a given scene is (subsec:task1_results)
Altering funniness of a scene (subsec:task2_funny_unfunny).
We discuss the quantitative results of our model in altering an unfunny scene to make it funny in subsubsec:task2_funny_unfunny), and the vice versa in subsubsec:task2_unfunny_funny. In subsec:human_eval, we report qualitative results through human studies.
5.1 Predicting Funniness Score
This section presents performance of the SVR (subsec:task1) that predicts the funniness score Fi of a scene.
Metric. We use average relative error to quantify our model’s performance computed as follows:
where N is the number of test scenes and Fi is the funniness score for the test scene i.
Baseline: The baseline model always predicts the average funniness score of the training scenes.
Model. As shown in tab:reg, we observe that our model trained using combinations of different scene-level features (described in subsec:features) performs better than the baseline model. We see that Location features perform slightly better than Cardinality. This makes sense because Location features also have occurrence information. The Embedding does not have location information and hence does worse. Due to some redundancy (all features have occurrence information), combining them does not improve performance.
|Features||Avg. Rel. Err.|
|Avg. Prediction Baseline||0.3151|
|Embedding + Cardinality + Location||0.2400|
5.2 Altering Funniness of a Scene
We discuss the performance in the tasks of identifying objects in a scene that contribute to humor (subsec:task1) and replacing those objects with other objects to reduce (or increase) humor (subsec:task2).
5.2.1 Predicting Objects to be Replaced
We train this model to detect objects instances that are funny in the scene. It makes a binary prediction whether each instance should be replaced or not.
Metric. Along with accuracy (% of correct predictions, i.e., Acc.), we also report average class-wise accuracy (i.e
., Avg. Cl. Acc.) to determine the performance of our model for this task. As the data is skewed, with the majority class beingnot-replace, we require our model to perform well both class-wise and as a whole.
Priors. We always predict that an instance should not be replaced. We also compute a stronger baseline that replaces an object if it is replaced at least T% of the time in training data. T was set to 20 based on the validation set.
From the scene embedding, we subtract the object embedding of the object under consideration. We then compute the cosine similarity of the resultant scene embedding with the object embedding. Objects with the least similarity with the scene are the anomalous objects in the scene. This is similar to finding the odd-one-out given a group of words. Objects that have a cosine similarity less than a threshold T with the scene are predicted as anomalous objects and are replaced. A modification to this baseline is to replace K objects that are least similar to the scene. Based on performance on the validation set, T and K are determined to be 0.8 and 4, respectively.
Model. tab:model1 compares the performance of our model with the baselines described above. We observe that the baseline based on priors performs better than anomaly detection. This is perhaps not surprising because the prior-based baseline, while , is supervised in the sense that it relies on statistics from the training dataset of which objects tend to get replaced. On the other hand, anomaly detection is completely unsupervised since it only captures the context of objects in normal scenes. Our approach performs better than the baseline approaches in identifying objects that contribute to humor.
On average, we observe that our model replaces 3.67 objects for a given image as compared to an average of 2.54 objects replaced in the ground truth. This bias to replace more objects ensures that a given scene becomes significantly less funny than the original scene. We observe that the model learns that in general, animate objects like humans and animals are potentially stronger sources of humor compared to inanimate objects. It is interesting to note that the model also learns fine-grained detail, e.g., to replace older people playing outdoors (which may be considered funny) with younger people (fig:funny2unfunnyr, top row).
5.2.2 Making a Scene Unfunny
Given that an object is predicted to be replaced in the scene, the model has to also predict a suitable replacer object. In this section, we discuss the performance of the model in predicting these replacer objects. This model is trained and evaluated using ground truth annotations of objects that are replaced by humans in a scene. This helps us isolate performance between predicting which objects to replace and predicting suitable replacers .
In order to evaluate the performance of the model on the task of replacing funny objects in the scene to make it unfunny, we use the top-5 metric (similar to ImageNet), i.e., if any of our 5 most confident predictions match the ground truth, we consider that as a correct prediction.
Priors. Every object is replaced by one of its 5 most frequent replacers in the training set.
Anomaly Detection. We subtract the embedding of the object that is to be replaced from the scene embedding. The 5 objects from the clipart vocabulary that are most similar (in the embedding space) to this resultant scene embedding are the ones that contextually fit in.
|Method||Avg. Cl. Acc.||Acc.|
|Priors (do not replace)||50%%||79.86%|
|Priors (object’s tendency to be replaced)||73.13%||71.5%|
|Anomaly detection (threshold distance)||62.16%||58.30%|
|Anomaly detection (top-K objects)||63.01%||64.31%|
Model. We observe that the performance trend in tab:model2 is similar to that observed in the previous section (subsec:predicting_objs_replaced), i.e., our model performs better than priors, which performs better than anomaly detection. By qualitative inspection, we find that our top prediction is intelligent, but lazy. It eliminates humor in most scenes by choosing to replace objects contributing to humor with other objects that blend well into the background. By relegating an object to the background, it is rendered inactive and hence, cannot be contribute to humor in the scene. For e.g., the top prediction is frequently plant in indoor scenes and butterfly in outdoor scenes. The 2nd prediction is both intelligent and creative. It effectively reduces humor while also ensuring diversity of replacer objects. Subsequent predictions from the model tend to be less meaningful. Qualitatively, we find the 2nd most confident prediction to be the best compromise.
|Priors (top 5 GT replacers)||24.53%|
|Anomaly detection (object that fits into scene)||7.69%|
Full pipeline. fig:funny2unfunnyr shows qualitative results from our full pipeline (predicting objects to replace and predicting their replacers) using the 2nd predictions made by our model.
5.2.3 Making a Scene Funny
We train our full pipeline model used in subsubsec:task2_funny_unfunny on scenes from the FOR dataset to perform the task of altering an unfunny scene to make it funny. Some qualitative results are shown in fig:unfunny2funny.
5.3 Human Evaluation
We conducted two human studies to evaluate our full pipeline:
Absolute: We ask 10 workers to rate the funniness of the scene predicted by our model on a scale of 1-5. We then compare this with the Fi of the input funny scene.
Relative: We show 5 workers the input scene and the predicted scene (in random order) and ask them to indicate which scene is funnier.
Funny to unfunny.
As expected, the output scenes from our model are less funny than the input funny scenes on average. The average Fi of the input funny test scenes is 2.69. This is 1.05 points higher than the output unfunny scenes whose average Fi is 1.64. Unsurprisingly, in relative evaluation, workers find our output scenes to be less funny than the input funny scenes 95% of the time.
Unfunny to funny. During absolute evaluation, we find that the average Fi of scenes made funny by our model is 2.14. This is a relatively high score, considering that the average Fi score of the corresponding originally funny scenes that were created by workers is 2.69. Interestingly, the relative evaluation can be perceived as a Turing test of sorts, where we show workers the model’s output funny scene and the original funny scene created by workers. 28% of the time, workers picked the model’s scenes to be funnier.
Humor is a subtle and complex human behavior. It has many forms ranging from slapstick which has a simple physical nature, to satire which is nuanced and requires an understanding of social context . Understanding the entire spectrum of humor is a challenging task. It demands perception of fine-grained differences between seemingly similar scenarios. E.g., a teenager falling off his skateboard (such as in America’s Funniest Home Videos222www.afv.com) could be considered funny but an old person falling down the stairs is typically horrifying. Due to these challenges some people even consider computational humor to be an AI-complete problem [6, 22].
While understanding fine-grained semantics is important, it is interesting to note that there exists a qualitative difference in the way humor is perceived in abstract and real scenes. Since abstract scenes are not photorealistic, they afford us suspension of reality. Unlike real images, the content depicted in an abstract scene is benign. Thus, people are likely to find the depiction more funny . In our everyday lives, we come across a significant amount of humorous content in the form of comics and cartoons to which our computational models of humor are directly applicable. They can also be applied to learn semantics that can extend to photorealistic images as demonstrated by Antol et al. .
Recognizing funniness involves violation of our mental model of how the world ought to be . In verbal humor, the first few lines of the joke (set-up) build up the world model and the last line (punch line) goes against it. It is unclear what forms our mental model when we look at images. Is it our priors about the world around us formed from our past experiences? Is it because we attend to different regions of the image when we look at it and gradually build an expectation of what to see in the rest of the image? These are some interesting questions regarding visual humor that remain unanswered.
In this work, we take a step towards understanding and predicting visual humor. We collect two datasets of abstract scenes which enable the study of humor at different levels of granularity. We train a model to predict the funniness score of a given scene. We also explore the different sources of humor depicted in the funny scenes via human studies. We train models using incongruity-based humor to alter a scene’s funniness. The models learn that in general, animate objects like humans and animals contribute more to humor compared to inanimate objects. Our model outperforms a strong anomaly detection baseline, demonstrating that detecting humor involves something more than just anomaly detection. In human studies of the task of making an originally funny scene unfunny, humans find our model’s output to be less funny 95% of the time. In the task of making a normal scene funny, our evaluation can be interpreted as a Turing test of sorts. Scenes made funny by our model were found to be funnier 28% of the time when compared with the original funny scenes created by workers. Note that our model would match humans at 50%. We hope that addressing the problem of studying visual humor using abstract scenes and the two datasets that are made public would stimulate further research in this new direction.
Acknowledgements. We thank the anonymous reviewers for their valuable comments and suggestions. This work was supported in part by the Paul G. Allen Family Foundation via an award to D.P. DB was partially supported by the National Science Foundation CAREER award, the Army Research Office YIP award, and an Office of Naval Research grant N00014-14-1-0679. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government or any sponsor. We thank Xinlei Chen for his work on earlier versions of the clipart interface.
Overview Of Appendix
In the following appendix we provide:
Inter-human agreement on funniness ratings in the Abstract Visual Humor (AVH) dataset.
Details of the model architecture used to learn object embeddings and visualizations of its embeddings.
A sample of objects from the abstract scenes vocabulary.
Examples of scenes from our datasets.
Analysis of occurrences of different object types in scenes from our datasets.
The user interfaces used to collect scenes for the AVH and Funny Object Replaced (FOR) datasets.
Appendix I: Inter-human Agreement
In this section, we describe our experiment to determine inter-human agreement in funniness ratings of scenes. The Abstract Visual Humor (AVH) dataset contains 3,028 funny scenes and 3,372 unfunny scenes that were created by Amazon Mechanical Turk (AMT) workers. The funniness of each scene in the dataset is rated by 10 different workers on a scale of 1-5. We define the funniness score of a scene, as the average of all ratings for a scene. In this section, we investigate the extent to which people agree regarding the funniness of a scene.
Perception of an image differs from one person to another. Moran et al.  treat humor appreciation by people as a personality characteristic. We investigate to what extent people agree how funny each scene in our dataset is. We split the votes we received for each scene into two groups, keeping each individual worker’s ratings in the same group to the extent possible. We compute the funniness score of each scene across workers in each group. We compute Pearson’s correlation between the two groups. fig:interhumanAg shows a plot of Pearson’s correlation (y-axis) vs. the number of workers (x-axis). We can see that inter-human agreement increases as we increase the number of workers in a group and that the trend is gradually saturating. This indicates that ratings from 10 workers is sufficient to compute a reliable funniness score.
We observed that the standard deviation among ratings from 10 different workers for funny scenes is 1.09, and for unfunny scenes is 0.73.I.e., people agree more on scenes that are clearly not funny than on ones that are funny, matching our intuition that humor is subjective, while the lack thereof is not.
Appendix II: Object Embeddings
In this section, we describe our model that learns embeddings for clipart objects and present visualizations of these embeddings. We learn distributed representations for each object category in the abstract scenes vocabulary using a word2vec-style continuous Bag-of-Words model . During training, subsets of 6 objects are sampled from all of the objects present in a scene and the model tries to predict one of the objects, given the other 5. Each object is assigned a 150-d vector, which is randomly initialized. The vectors corresponding to the 5 context objects are projected to an embedding space via a single layer whose parameters are shared between the 5 objects. This (randomly initialized) layer consists of 150 hidden units without a non-linearity after it. The sum of these 5 object projections is used to compute a softmax over the 150 classes in the object vocabulary. Using the correct label (i.e
., the object category of the 6th object), the cross-entropy loss is computed and backpropagated to learn all network parameters. The model is trained using Stochastic Gradient Descent with a base learning rate of 0.0001 and a momentum update of 0.9. The learning rate was reduced by a factor of two after each epoch. A diagram of the model can be seen in fig:embed_model.
The context provided by the 5 objects ensures that the representations learnt reflect the relationships between objects. I.e., objects that are semantically related tend to have similar representations. We learn the normal embeddings (i.e., the object embedding instance-level features from the main paper) from 11K scenes collected by Antol et al. . As these scenes were not intended to be humorous, the relationships captured in the embeddings are the ones that occur naturally in the abstract scenes world.
fig:embeddings (left) is a t-SNE  visualization of the normal embeddings for the 75 most frequent objects in unfunny scenes. In fig:embeddings (right), we also visualize humor embeddings, which were not used as features but provide us with insights. These are learnt from the 3,028 funny scenes in the AVH dataset.
We observe that the normal embeddings encode a notion for which object categories occur in similar contexts. We also observe that closely placed objects in the normal embedding space have semantically similar meanings. For instance, humans are clustered together around coordinates (10, -7). Interestingly, dog and puppy (coordinates (10, -5)) are placed together and furniture like chair, bookshelf, armchair, etc. are placed together (coordinates (10, 5)). This follows from the distributional hypothesis, which states that words which occur in the similar contexts tend to have similar meanings [16, 21].
In contrast, in the humor embeddings, visualized in fig:embeddings (right), we see that objects that are close in the embedding space may be semantically very different. For instance, dog and wine glass are placed together at coordinates (0, 0). These are placed far apart (at opposite ends) in the normal embedding. However, in the humor embedding, these two categories are extremely close to each other; even closer than semantically similar categories like two breeds of dogs. We hypothesize that this because our dataset contains funny scenes consisting of dogs with wine glasses, e.g., fig:dogDinner. It is interesting to note that background objects that do not contribute to humor in a scene are also placed together. For example, chair, couch, and window are placed together in the humor embedding as well (coordinates (4, 5)).
The understanding of semantically similar object categories that can occur in a context, represented by the normal embeddings, can be interpreted as a person’s mental model of the world. The humor embeddings capture deviations or incongruities from this normal view that might cause humor.
Appendix III: Abstract Scenes Vocabulary
The abstract scenes interface developed by Antol et al.  consists of 20 deformable humans, 31 animals in different poses, and about 100 objects that can be found in indoor scenes (e.g., couch, picture, doll, door, window, plant, fireplace) or outdoor scenes (e.g., tree, pond, sun, clouds, bench, bike, campfire, grill, skateboard). In addition to the 8 different expressions available for humans, the ability to vary the pose of a human at a fine-grained level enables these abstract scenes to effectively capture the semantics of a scene. The large clipart vocabulary (of which only a fraction is shown to a worker during creation of a scene) ensures diversity in the scenes being depicted. A subset of objects from our Abstract Scenes vocabulary is shown in fig:abstractVocab.
Appendix IV: Example Scenes
In this section, we present examples of scenes that were created using the abstract scenes interface. fig:exampleScenesAVH, depicts a spectrum of scenes from the AVH dataset in ascending order of funniness score. These scenes were created by AMT workers using the interface presented in fig:createFunny.
fig:exampleScenesFOR shows originally funny scenes (left) and their unfunny counterparts (right) from the FOR dataset. AMT workers created the counterparts by replacing as few objects in the originally funny scene such that the resulting scene is not funny anymore. A screenshot of the interface that was used to create the unfunny counterparts is shown in fig:replaceObjects.
Appendix V: Object Type Occurrences
In this section, we first analyze the occurrence of each object type in funny and unfunny scenes. We then analyze the most commonly cooccurring object types in funny scenes as compared to unfunny scenes.
Distribution of Object Types. We analyze the distribution of object types in funny and unfunny scenes across all scenes in our dataset. We compute the frequency of appearance of each object type in funny and unfunny scenes. We use this to compute the probability of a scene being funny, given that an object is present in the scene, which is shown in blue in fig:probFunny. Since we have more unfunny scenes than funny scenes, we use normalized counts.
We observe that the humans that most appear in funny scenes are elderly people. This is probably because a number of scenes in our dataset depict old men behaving unexpectedly, e.g., dancing or playing in the park as shown in fig:oldPeopleFun, which is funny.
Interestingly, we also observe that in general, animals appear more frequently in funny scenes. Animals like mouse, rat, raccoon and bee appear in funny scenes significantly more than they do in unfunny scenes. Other objects having a strong bias towards appearing in funny scenes include wine bottle, pen, scissors, tape, game and beehive. Thus, we see that certain object types have a tendency to appear in funny scenes. A possible reason for this is that these objects are involved in funny interactions, or are intrinsically funny, and hence contribute to humor in these scenes.
Funny Cooccurrence Matrix. We populate two object cooccurrence matrices – F and U, corresponding to funny scenes and unfunny scenes, respectively. Each element in F and U corresponds to the count of the cooccurrence of a pair of objects across all funny and unfunny scenes, respectively. To enable the study of types of cooccurrences that contribute to humor, we compute the probability of a scene being funny, given that a pair of objects cooccur in the scene as , which is shown in fig:coOccur for the top 100 probable combinations that exist in a funny scene. Please note that repeated entries for an object type (e.g., dog), correspond to slightly different versions (e.g., breeds) of the same object type. An interesting set of object pairs that are present in funny scenes are rat appearing alongside kitten, cat, stool, and dog. Another interesting set of combinations is raccoon cooccurring with bee, hamburger, basket, and wine glass. We observe that this matrix captures interesting and unusual combinations of objects that appear together frequently in funny scenes.
Appendix VI: User Interfaces
In this section, we present the user interfaces that were used to collect data from AMT. fig:createFunny shows a screenshot of the user interface that we used to collect funny scenes. Objects in the clipart library (on the right in the screenshot) can be dragged on to any part of the empty canvas shown in the figure. The pose, flip (i.e., lateral orientation), and size of all objects can be changed once they are placed in the scene. In the case of humans, one of 8 expressions must be chosen (initially humans have blank faces) and fine-grained pose adjustments are required.
fig:replaceObjects shows the interface that we used to collect object-replaced scenes for our FOR dataset. We showed workers an originally funny scene and asked them to replace objects in that scene so that the scene is not funny anymore. On clicking an object in the original scene, the object gets highlighted in green. A replacer object can then be chosen from the clipart library (displayed on the right in the screenshot). Objects that are replaced in the original scene show up in the empty canvas below. At any point, to undo a replacement, a user can click on the object in the below canvas and the corresponding object will be placed at its original position in the scene. The interface does not allow for the movement or the removal of objects.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015.
S. Antol, C. L. Zitnick, and D. Parikh.
Zero-Shot Learning via Visual Abstraction.
European Conference on Computer Vision, ECCV, 2014.
-  Aristotle and R. McKeon. The Basic Works of Aristotle. Modern Library, 2001.
-  S. Attardo. Linguistic theories of humor. Walter de Gruyter, 1994.
-  Bharata-Muni and M. Ghosh. Natya shastra (with english translations). 1951.
-  K. Binsted, B. Bergen, D. O’Mara, S. Coulson, A. Nijholt, O. Stock, C. Strapparava, G. Ritchie, R. Manurung, and H. Pain. Computational humor. IEEE Intelligent Systems, 2006.
-  K. Binsted and G. Ritchie. Computational rules for generating punning riddles. Humor: International Journal of Humor Research, 1997.
-  E. R. Bressler, R. A. Martin, and S. Balshine. Production and appreciation of humor as sexually selected traits. Evolution and Human Behavior, 2006.
-  M. Buijzen and P. M. Valkenburg. Developing a typology of humor in audiovisual media. Media Psychology, 2004.
-  D. M. Buss. The evolution of human intrasexual competition: Tactics of mate attraction. Journal of Personality and Social Psychology, 1988.
-  D. Davidov, O. Tsur, and A. Rappoport. Semi-supervised recognition of sarcastic sentences in twitter and amazon. In Conference on Computational Natural Language Learning, 2010.
L. V. der Maaten and G. Hinton.
Visualizing data using t-SNE.
Journal of Machine Learning Research, 2008.
A. Deza and D. Parikh.
Understanding image virality.
IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015.
-  R. L. Duran. Communicative adaptability: A measure of social communicative competence. Communication Quarterly, 1983.
-  R. L. Duran. Communicative adaptability: A review of conceptualization and measurement. Communication Quarterly, 1992.
-  J. R. Firth. A synopsis of linguistic theory. Blackwell, 1957.
-  D. F. Fouhey and C. L. Zitnick. Predicting object dynamics in scenes. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2014.
-  S. Freud. The Joke and Its Relation to the Unconscious. Penguin, 2003.
-  J. D. Goodchilds, J. Goldstein, and P. McGhee. On being titty: Causes, correlates, and consequences. The Psychology of Humor: Theoretical Perspectives and Empirical Issues, 1972.
-  M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. Gool. The interestingness of images. In Proceedings of the IEEE International Conference on Computer Vision, 2013.
-  Z. S. Harris. Distributional structure. word, 10 (2-3): 146–162. reprinted in fodor, j. a and katz, jj (eds.), readings in the philosophy of language, 1954.
-  M. M. Hurley, D. C. Dennett, and R. B. Adams. Inside jokes: Using humor to reverse-engineer the mind. MIT Press, 2011.
-  P. Isola, D. Parikh, A. Torralba, and A. Oliva. Understanding the intrinsic memorability of images. In NIPS, 2011.
-  A. Khosla, A. Das Sarma, and R. Hamid. What makes an image popular? In International Conference on World Wide Web, 2014.
-  S. Kottur, R. Vedantam, J. M. Moura, and D. Parikh. Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes. 2015.
-  X. Lin and D. Parikh. Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015.
-  A. Mahapatra and J. Srivastava. Incongruity versus incongruity resolution. In Proceedings of the 2013 International Conference on Social Computing, 2013.
-  R. A. Martin and N. A. Kuiper. Daily occurrence of laughter: Relationships with age, gender, and type a personality. Humor, 1999.
-  P. E. McGhee. Chapter 5: The contribution of humor to children’s social development. Journal of Children in Contemporary Society, 1989.
-  A. P. McGraw and C. Warren. Benign violations making immoral behavior funny. Psychological Science, 2010.
-  R. Mihalcea. The multidisciplinary facets of research on humour. In International Workshop on Fuzzy Logic and Applications, 2007.
-  R. Mihalcea and S. Pulman. Characterizing humour: An exploration of features in humorous texts. Computational Linguistics and Intelligent Text Processing, 2007.
-  R. Mihalcea and C. Strapparava. Making computers laugh: Investigations in automatic humor recognition. In EMNLP, 2005.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 2013.
-  D. Mobbs, M. D. Greicius, E. Abdel-Azim, V. Menon, and A. L. Reiss. Humor modulates the mesolimbic reward centers. Neuron, 2003.
-  J. M. Moran, M. Rain, E. Page-Gould, and R. A. Mar. Do i amuse you? asymmetric predictors for humor appreciation and humor production. Journal of Research in Personality, 2014.
-  M. P. Mulder and A. Nijholt. Humor Research: State of the Art. University of Twente, Centre for Telematics and Information Technology, 2002.
-  B. I. Murstein and R. G. Brust. Humor and interpersonal attraction. Journal of Personality Assessment, 1985.
-  S. Petrovic and D. Matthews. Unsupervised joke generation from big data. In ACL, 2013.
-  Plato, E. Hamilton, and H. Cairns. The Collected Dialogues of Plato, Including the Letters. Pantheon Books, 1961.
-  B. Plester. Healthy humour: Using humour to cope at work. New Zealand Journal of Social Sciences Online, 2009.
-  D. Radev, A. Stent, J. Tetreault, A. Pappu, A. Iliakopoulou, A. Chanfreau, P. de Juan, J. Vallmitjana, A. Jaimes, and R. Jha. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest. arXiv preprint arXiv:1506.08126, 2015.
-  P. Rinck. Magnetic resonance in medicine. the basic textbook of the european magnetic resonance forum. 8th edition; 2014.
-  W. Ruch, S. Attardo, and V. Raskin. Toward an empirical verification of the general theory of verbal humor. Humor: International Journal of Humor Research, 1993.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015.
-  P. Salovey, A. J. Rothman, J. B. Detweiler, and W. T. Steward. Emotional states and physical health. American Psychologist, 2000.
-  A. Salvatore and V. Raskin. Script rheory revisited: Joke similarity and joke representation model. Humor-International Journal of Humor Research, 1991.
-  D. Shahaf, E. Horvitz, and R. Mankoff. Inside jokes: Identifying humorous cartoon captions. In SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.
-  G. Sinicropi. La struttura della parodia— avvero: Bradamante in arli. Strumenti Critici Torino, 1981.
-  O. Stock and C. Strapparava. HAHAcronym: A computational humor system. In ACL, 2005.
-  J. M. Suls. A two-stage model for the appreciation of jokes and cartoons: An information-processing analysis. The Psychology of Humor: Theoretical Perspectives and Empirical Issues, 1972.
-  J. Taylor and L. Mazlack. Computationally recognizing wordplay in jokes. Proceedings of CogSci, 2004.
-  R. Vedantam, X. Lin, T. Batra, C. L. Zitnick, and D. Parikh. Learning common sense through visual abstraction. In ICCV, 2015.
-  W. Y. Wang and M. Wen. I can has cheezburger? a nonparanormal approach to combining textual and visual information for predicting and generating popular meme descriptions. In NAACL, 2015.
-  M. B. Wanzer, M. Booth-Butterfield, and S. Booth-Butterfield. Are funny people popular? an examination of humor orientation, loneliness, and social attraction. Communication Quarterly, 1996.
-  K. K. Watson, B. J. Matthews, and J. M. Allman. Brain activation during sight gags and language-dependent humor. Cerebral Cortex, 2007.
-  Wikipedia. Humor, November 2015.
-  Wikipedia. Theories of humor, April 2016.
-  D. Yang, A. Lavie, C. Dyer, and E. Hovy. Humor recognition and humor anchor extraction. 2015.
-  C. L. Zitnick and D. Parikh. Bringing semantics into focus using visual abstraction. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2013.
-  C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning the visual interpretation of sentences. In ICCV, 2013.
-  C. L. Zitnick, R. Vedantam, and D. Parikh. Adopting abstract images for semantic scene understanding. PAMI, 2014.