Code for reproducing our analysis in the paper titled: Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency
Twitter uses machine learning to crop images, where crops are centered around the part predicted to be the most salient. In fall 2020, Twitter users raised concerns that the automated image cropping system on Twitter favored light-skinned over dark-skinned individuals, as well as concerns that the system favored cropping woman's bodies instead of their heads. In order to address these concerns, we conduct an extensive analysis using formalized group fairness metrics. We find systematic disparities in cropping and identify contributing factors, including the fact that the cropping based on the single most salient point can amplify the disparities. However, we demonstrate that formalized fairness metrics and quantitative analysis on their own are insufficient for capturing the risk of representational harm in automatic cropping. We suggest the removal of saliency-based cropping in favor of a solution that better preserves user agency. For developing a new solution that sufficiently address concerns related to representational harm, our critique motivates a combination of quantitative and qualitative methods that include human-centered design.READ FULL TEXT VIEW PDF
While the field of algorithmic fairness has brought forth many ways to
We have developed a system that automatically detects online jihadist ha...
As machine learning algorithms have been widely deployed across applicat...
Software systems are increasingly making decisions on behalf of humans,
This paper investigates the stability of Twitter counts of scientific
Shadow banning consists for an online social network in limiting the
Saliency methods – techniques to identify the importance of input featur...
Code for reproducing our analysis in the paper titled: Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency
Submission to Twitter's algorithmic bias bounty challenge
Entry for Twitter's Algorithmic Ethics Bug Bounty
A fork of twitter's repo https://github.com/twitter-research/image-crop-analysis.git
Automated image cropping (or smart image cropping) (adobesmartcrop) is a task to, given a viewport dimension or aspect ratio and an image, crop the image such that the image fits the viewport or aspect ratio (width/height) while ensuring that its most relevant or interesting parts are within the viewport. The idea of automated image cropping (chen2016automatic; chen2017quantitative) has been present in the industry since at least 1997 (bollman1999automatic).
Image cropping have had applications in several fields. It has a rich history in the field of cinematography and broadcasting, where film footages’s aspect ratio is changed to match the display aspect ratio. Cropping an image to show its most interesting part is useful and often done manually in artistic settings. In recent website design, image crop is used to develop approaches for responsive images which do not distort the aesthetics of the website or app (twittersaliencyblog). This is often used with responsive images where a pre-defined image on a website or app needs to be resized to fit devices with different dimensions (mozilla_image).
The modern automation of cropping process is especially useful for many reasons. First, automation significantly reduces human burden when the throughput of images to be processed is very high. For example, automated image cropping is used to show user submitted images, which are very large in size, by many platforms such as Facebook, Instagram, Twitter, and Adobe, in conjunction with various degrees of user controls (instagramcrop; adobesmartcrop; twittersaliencyblog; facebookvideocrop). Second, many modern platforms are operating on multiple types of devices requiring varying aspect ratios, increasing the number of crops needed even on a single image. Again, automation reduces human burden in the multiplicities of cropping required. In addition, automated image cropping can help individuals identify important content in the image and do it faster compared to no-crop (Bongwon2003). Automated image crops can also produce more desirable thumbnails to users compared to shrinking the whole image to fit the viewport (Xie2006BrowsingLP).
However, automating image cropping can also result in errors and other undesirable outcomes. Often, automation is implemented by machine learning (ML) systems which are designed to ensure that the average error is low, but do not account for the disparate impact of cases where the error is not uniformly distributed across demographic groups(buolamwini2018; objectrecbias). Additionally, there have been many calls for a more critical evaluation of the underlying values and assumptions embedded in machine learning and human computer interaction (HCI) systems (dotan2019value; bardzell; borning2012next; friedman2002value; friedman2008value; van2011feminist).
Our work focuses on one automated image cropping system, namely Twitter’s saliency-based image cropping model, which automatically crops images that users submit on Twitter to show image previews of different aspect ratios across multiple devices. The system employs a supervised machine learning model on existing saliency maps to predict saliency scores over any given image (as an input). Saliency scores are meant to capture the “importance” of each region of the image. After having the saliency scores, the model selects the crop by trying to center it around the most salient point, with some shifting to stay within the original image dimension as needed. See Figure 1 for an example. While Twitter platform enjoys several benefits of automation of image cropping as mentioned earlier, several concerns exist around its usage. We summarize the concerns as follows:
Unequal treatment on different demographics. In September 2020, a series of public Tweets claimed that Twitter image cropping crops to lighter-skinned individuals when lighter and darker-skinned individuals are presented in one image1307115534383710208, 1307427332597059584, 1307440596668182528 (see Figure 1 as an example of such image), spurring further online responses (twitterresponse) 1307777142034374657, news (guardiannews), and articles (davidarticle). However, as one user noted1307427207489310721, one limitation of people posting individual Tweets to test cropping behavior is that giving one specific example is not enough to conclude there is systematic disparate impact. In order to address this, a Twitter user performed an independent experiment (vinayarticle) of 92 trials comparing lighter and darker-skinned individuals using the Chicago Faces Dataset (chicagoface). Because of the small size of experiment, there can still be concerns today whether the Twitter model systematically favors lighter-skinned individuals over darker-skinned, or more generally over other demographics such as men over women.
Male gaze. Another concern arises when cropping emphasizes a woman’s body instead of the head1010288281769082881. Such mistakes can be interpreted as an algorithmic version of male gaze, a term used for the pervasive depiction of women as sexual objects for the pleasure of and from the perspective heterosexual men (mulvey1989visual; korsmeyer2004feminist).
Lack of user agency. The automation of cropping does not include user agency in dictating how images should be cropped. This has the potential to cause representational harm to the user or the person present in the photo, since the resulting crop may change the interpretation of a user’s post in a way that the user did not intend.
The questions we try to answer in this work are motivated from the above concerns, and can be summarized as follows:
To what extent, if any, does Twitter’s image cropping have disparate impact (i.e. systematically favor cropping) people on racial or gendered lines?
What are some of the factors that may cause systematic disparate impact of the Twitter image cropping model?
What are other considerations (besides systematic disparate impacts) in designing image cropping on platforms such as Twitter? What are the limitations of using quantitative analysis and formalized fairness metrics for surfacing potential harms against marginalized communities? Can they sufficiently address concerns of representational harm, such as male gaze?
What are some of the alternatives to saliency-based image cropping that provide good user experience and are socially responsible? How do we evaluate such alternatives in a way that minimizes potential harms?
Our contribution can be summarized as follows, following the research questions in their order:
We provide quantitative fairness analysis showing the level of disparate impact of Twitter model over racial and gender subgroups.222We note the limitations of using race and gender labels, including that those labels can be too limiting to how a person wants to be represented and do not capture the nuances of race and gender. We discuss the limitations more extensively in Section 6. We also perform additional variants of experiments and an experiment on images showing the full human body to gain insights into causes of disparate impact and male gaze.
Using the results of and observations from the fairness analysis, we give several potential contributing factors to disparate impact. Notably, selecting an output based on a single point with highest predicted scores (the argmax selection) can amplify disparate impact in predictions, not only in automated image cropping but also in machine learning in general.
We qualitatively evaluate automated image cropping, discussing concerns including user agency and representational harm. We argue that using formalized group fairness metrics is insufficient for surfacing concerns related to representational harm for automated image cropping.
We give alternative solutions, with a brief discussion of their pros and cons. We emphasize the importance of representation, design, user control, and a combination of quantitative and qualitative methods for assessing potential harms of automated image cropping and evaluating alternative solutions.
In Section 2, we give background and related work on the Twitter saliency-based image cropping and fairness and representational harm in machine learning. We give the four contributions in Section 3-5, in their respective order. We list limitations and future directions in Section 6 and conclude our work and contribution in Section 7.
Algorithmic decision making and machine learning have the potential to perpetuate and amplify racism, sexism, and societal inequality more broadly. Design decisions pose unique challenges in this regard since design ”forges both pathways and boundaries in its instrumental and cultural use” (noble). Prior frameworks integrating ethics and design includes value sensitive design, which advocates for a combination of conceptual, empirical, and technical analysis to ensure social and technical questions are properly integrated (borning2012next; friedman2002value; friedman2008value). Similarly, feminist HCI methods advocate for a simultaneous commitment to moral and scientific agendas, connection to critical feminist scholarship, and co-construction of research agendas and value-laden goals (bardzell). Feminist frameworks are also useful for exposing the normative assumptions surrounding user agency and the relationship between subjects and technology (van2011feminist). Another commonly used framework is human-centered design, which aims to center human dignity, rights, needs, and creativity (buchanan2001human; gasson2003human; gill1990summary). human-centered design arose out of a critique to user centered approaches, which gasson2003human argues fails to question the normative assumptions of technology because of a ”goal directed focus on the closure of predetermined, technical problems”. costanza2018design’s design justice builds on previous frameworks but emphasizes community led design and the importance of intersectionality and participatory methods for conceptualizing user needs across different marginalized identities.
This work is meant to contribute to a broader conversation about how to promote ethical human-computer interaction within the technology industry, which will require open communication between industry, academia, government, and the wider public to solve, as well as an acknowledgment of the responsibility of companies to be held accountable for the societal effects of their products. In order to promote transparency and accountability to users, we strive to create a partnership between social media platforms and users where users interface with social media while maintaining control of their content and identities on the platform. We also recognize it is critical to attend to the perspectives and experiences of marginalized communities not only because “it empowers a comparatively powerless population to participate in processes of social control, but it is also good science, because it introduces the potential for empirically derived insights harder to acquire by other means” (bardzell).
Designing ethical ML systems presents several unique challenges. Because ML systems can make seemingly nonsensical errors that lack common sense because they can pick up on spurious correlations, it may be hard for designers to bridge the ML and the human perspective, and to anticipate problems (dove2017ux). For image processing, one key challenge in validating and communicating such problems is the lack of high quality datasets for fairness analysis available (buolamwini2018) especially to industry practitioners (andrus2021we) . One key focus of ethical machine learning has been on developing formal notions of fairness and quantifying algorithmic bias. Another challenge is the lack of a universal formalized notion of fairness that can be easily applied to machine learning models; rather, different fairness metrics imply different normative values and have different appropriate use cases and limitations (barocas-hardt-narayanan; narayanan2018translation). 333This list of challenges is non-exhaustive. Notably, there is a broad body of work on challenges related to AI ethics in industry, including ethics-washing, regulation, corporate structure and incentives, and power not included in the discussion here.
benjamin introduces the notion of coded exposure, which describes how people of color suffer simultaneously from under and over exposure in technology. In her work, she demonstrates “some technologies fail to see blackness, while other technologies render Black people hyper visible” (benjamin). This “over representation or unequal visibility” is not just limited to images but also texts, e.g. Mishra2020 report that names from certain demographics are less likely to be identified as person names by major text processing systems. crawford2017trouble details different types of representational harm, including stereotyping, denigration, recognition, and under representation. Beginning with the development of color photography, Shirley cards used by Kodak to standardize film exposure methods were taken of white women, causing photos of darker-skinned people to be underexposed and not show up clearly in photos (benjamin). More recently, buolamwini2018
demonstrated that widely used commercial facial recognition technology is significantly less accurate for dark-skinned females than for light-skinned males. Simultaneously, in the United States, facial recognition has rendered communities of color hyper visible since it is routinely used to surveil communities of color(benjamin; devich2020defund). Similarly, facial recognition is being used in China to surveil Uighurs as part of a broader project of Uighur internment, cultural erasure, and genocide (uighurprofile), and facial recognition tools have been specifically developed to identify Uighurs (ethnicitydetection). objectrecbias
demonstrates that images from ImageNet, the dataset widely used across a range of image processing tasks, come from primarily Western and wealthy countries, which is a contributing factor for why commercial object recognition systems performs poorly for items from low income and non-Western households.
These examples illustrate how technologies that are presented as objective, scientific, and neutral actually encode and reproduce societal unequal treatment over different people groups, both in their development and deployment. Additionally, they underscore the highly contextual nature of representational harm (crawford2017trouble); not all exposure is positive (as in the case of surveillance technology or stereotyping, for example), and representational harm provides a unique challenge to marginalized communities who have faced repeated challenges to maintaining their privacy and in advocating for positive representations in the media. Although representational harm is difficult to formalize due to its cultural specificity, it is crucial to address since it is commonly the root of disparate impact in resource allocation (crawford2017trouble). For instance, ads on search results of names perceived as Black are more likely to yield results about arrest records, which can affect people’s ability to secure a job (crawford2017trouble; sweeney2013discrimination). Although allocative harm may be easier to quantify than representational harm, it is critical that the ML ethics community works to understand and address issues of representational harm as well (crawford2017trouble). This work views the problem of automatic image cropping through the lens of representational harm, with the goal of not reproducing or reinforcing systems of subordination for marginalized communities in automatic image cropping.
Twitter’s image cropping algorithm (twittersaliencyblog) relies on a machine learning model trained to predict saliency. Saliency is the extent that humans tend to gaze on a given area of the image, and it was proposed as a good proxy to identify the most important part of the image to center a crop around as importance is relative and difficult to quantify (saliencycrop). Locations with high predicted saliency (or salient regions) include humans, objects, texts, and high-contrast backgrounds (see examples in (huang2015salicon) and Figures 1,5,6
for Twitter models). The Twitter algorithm finds the most salient point, and then a set of heuristics is used to create a suitable center crop around that point for a given aspect ratio. We note that other automated cropping algorithms exist which utilize the full saliency map on an image to identify the best crop, such as by covering the maximum number of salient points or maximizing the sum of saliency scores(DBLP:journals/corr/ChenHCTCC17).
Saliency prediction can be implemented by deep learning, a recent technique that has significantly increased prediction performance, though there are still gaps in prediction compared to human and limitations such as robustness to noise(borji2018saliency). Twitter’s saliency model builds on the deep learning architecture DeepGaze II (deepgazeii), which leverages feature maps pretrained on object recognition. In order to reduce the computational cost to be able to use the model in production, Twitter leverages a combination of fisher pruning (fisherpruning) and knowledge distillation (hinton2015distilling). The model is trained on three publicly available external datasets: salicon; MIT1003; cat2000. See fisherpruning for more precise details on model architecture.
Cropping is conducted as follows:
For a given image, the image is discretized into a grid of points, and each grid point is associated with a saliency score predicted by the model.
The image along with the coordinates of the most salient point and a desired aspect ratio are passed as an input to a cropping algorithm. This is repeated for each aspect ratio to show the image on multiple devices.
If the saliency map is almost symmetric horizontally (decided using a threshold on the absolute difference in value across the middle vertical axis), then a center crop is performed irrespective of the aspect ratio.
Otherwise, the cropping algorithm tries to ensure that the most salient point, or the focal point, is within the crop with the desired aspect ratio. This is done by cropping only one dimension (either width or height) to achieve the desired aspect ratio.
In this section, we answer the first research question on whether Twitter’s image cropping has disparate impact on racial or gendered lines by performing an analysis measuring disparate impact based on the fairness notion of demographic parity. We first define demographic parity, and then describe the dataset, experiment methodology, and results.
Automatic image cropping was developed so that the most interesting and noticeable part of an image is displayed for users to see. However, in an image with multiple people, sometimes it is impossible to find a crop that will include all people that will fit the desired aspect ratio. This puts the model in a difficult position since there is no “ideal” solution. In handling these cases, what is a fair way to determine who should be cropped in and out? One central concern is that the model should not be systematically cropping out any demographic group such as dark-skinned individuals. Although many competing definitions of fairness have been proposed (barocas-hardt-narayanan; narayanan2018translation), one notion of fairness that may seem suitable for this context is demographic parity (also referred to as independence or statistical parity). The intuition behind using demographic parity is that the model should not be favoring representing one demographic group over another, so in cases where the model is forced to choose between two individuals, the rate at which they are cropped out should be roughly equal. For our purposes, given a set of images each depicting two people from group and group , the auto-cropping model has disparate impact (or a lack of demographic parity for group with respect to group ) if
where denotes that the most salient point (the focal point for the crop) is on the person, represents group status (feldman2015; barocas-hardt-narayanan), is a slack value, and
is the probability that when two persons, one fromand one from , are given, the auto-cropping model has the most salience on the person from . Note that we may swap and to ensure representation for the class as well.
We use Wikidata Query Service (wikiquery) to create our WikiCeleb dataset444 The WikiCeleb dataset we use in this paper was collected through this query in November 2020. consisting of images and labels of celebrities (individuals who have an ID in the Library of Congress names catalogue555The Library of Congress Name Authority File (NAF) file provides authoritative data for names of persons: https://id.loc.gov/authorities/names.html) in Wikidata. Given that ancestry and gender identity of celebrities (especially those identified by Library of Congress Name Authority File) is generally known, using these manually curated labels is less concerning compared to automatically generating sensitive labels by a model which may exacerbate disparity of errors in labels. If an individual has more than one image, gender, or ethnicity, we sample one uniformly at random. We filter out one offensive occupation666Those whose occupation is pornographic actor (Q488111; see https://www.wikidata.org/wiki/Q488111) are excluded to avoid sensitive images. and any images that are before the year 1950 to avoid grey-scale images. We note the limitations of using race and gender labels, including that those labels can be too limiting to how a person wants to be represented and do not capture the nuances of race and gender. We discuss the limitations more extensively in Section 6.
WikiCeleb consists of images of individuals who have Wikipedia pages, obtained through the Wikidata Query Service. It contains 4073 images of individuals labeled along with their gender777Property:P21. https://www.wikidata.org/wiki/Property:P21 identity and ethnic group888Property:P172. https://www.wikidata.org/wiki/Property:P172 as recorded on Wikidata. We restrict our query to only consider male999Q6581097. https://www.wikidata.org/wiki/Q6581097 and female101010Q6581072. https://www.wikidata.org/wiki/Q6581072 gender as the data size for other gender identities is very sparse for our query) Ethnic group annotations may be more fine grained than racial groupings, so we map from ancestry to more course racial groupings, as defined by the US census (censusrace) (e.g. Japanese Asian). We discard ethnicities with small (¡40) samples. For small subsets (¡5%) of individuals with multiple labeled ethnicities, we sample one ethnicity uniformly at random. Finally, we split the data into subgroups based on race and gender, and drop subgroups with less than 100 samples. This results in four subgroups: Black-Female, Black-Male, White-Female, and White-Male, of size 621, 1348, 213, and 606, respectively.
We perform analysis on all six pairs of subgroups across four subgroups (Black-Female, Black-Male, White-Female, and White-Male). On each pair, we sample one image independently and uniformly at random from each of the two groups and attach those two images horizontally (padding black background when images have different heights).111111We have also performed an experiment by attaching two images vertically and observed several of the saliency map predictions. We found the difference of saliency maps on individual images between horizontal and vertical attaching to be negligible. We run the saliency model on the attached image and determine what image the maximum saliency point lies on. The group from which the image has maximum saliency point is favored (by the saliency model compared to the other group) in this sampling pair. We repeat the process on this pair of groups 10000 times, and record the ratio of times (or probability) that the first group is favored.
For each of the six pairs of subgroups, we report the probability that the first group is favored. The results are in Figure 2, also represented differently to show race and gender disparate impact more clearly. We observe the model’s strong gender favor for female over male, and a smaller favor for white over Black. The model’s gender and race favors are stronger among favored intersecting subgroups (white or female) than the other (Black or male).
: The probabilities that the model favors the first subgroup compared to the second across six comparisons. The bars and red labels are (signed) distance from 0.5, which is demographic parity. Positive sign indicates the first group is favored, and negative indicates the second group is favored. The further from 0.5, the bigger the disparate impact. The error bars are the two-sided 95% confidence interval.
In addition, we perform variants of the experiments as follows.
Images in the dataset vary in sizes. In this experiment variant, we scale all images to a fixed height of 256 pixels while fixing their aspect ratios. The results are similar as shown in Figure 4, showing that scaling of images has no significant effect.
One question we may ask is whether the disparate impact in saliency prediction is already intrinsic to the images in itself, or is contributed by attaching them together. In this variant, we run the saliency model on individual images first, then sample one image from each subgroup in the pair, and select the image with higher maximum saliency score. The results are in Figure 4, showing that attaching images indeed changes disparate impact across race but not gender. In particular, the model’s original race favor for White when attaching images in the male subgroup is flipped to the favor for Black when images are not attached, and in Female group the favor for white is diminished.
In this section, we discuss the male gaze concern, i.e. the claim that saliency-based models are likely to pay attention to woman’s body or exposed skin. In order to study this assumption, we randomly selected 100 images per gender from the WikiCeleb dataset which 1) have height/weight ratio of at least 1.25 (this increases the likelihood of image containing full body), and 2) have more than one salient region (salient regions are distinctive regions of high saliency identified by segmenting the saliency map121212Salient regions were identified in the saliency map using the regionprops algorithm in the scikit-image library https://scikit-image.org/docs/dev/auto_examples/segmentation/plot_label.html. This increases the likelihood of the image having a salient point other than the head). We found that no more than 3 out of 100 images per gender have the crop not on the head. The crops not on heads were due to high predicted salient scores on parts of the image such as a number on the jersey of sports players or a badge. These patterns were consistent across genders. For example, see Figure 5 where the crop is positioned on the body of the individual, and the closeups of the salient regions confirm that the crop is focused on the jersey number while the head is still a significant salient region. Our analysis also replicates when we used smaller (10 for each gender) targeted samples depicting exposed skin (e.g. sleeveless tops in males and female; bare chest for males and low-cut tops for females). Our findings again reveal high saliency on head or on jersey of individuals (see bottom figures in Figure 5). Finally, we spot checked the same images via another publicly available model from the Gradio App (kroner2020contextual) (this model crops based on the largest salient region), here too the results replicate. A more rigorous large-scale analysis can be conducted to ascertain the effect size, but this preliminary analysis reveals that the models we studied may not explicitly encode male gaze as part of their saliency prediction and that the perception of male gaze might be because of the model picking on other salient parts of the image like jersey numbers (or other text) and badges which are likely to be present in the body area. Hence, the crops likely confound with the body-focused crop in the rare cases that it happens.
We consider the additional impact of ML models which use the argmax approach for inference when applied to social settings. Given that many machine learning models are probabilistic (i.e. they model for every possible output for a given input ), they should be utilized probabilistically as well, i.e. for inference one should sample from their output distribution as opposed to always using the most probable . Additionally, in general setting using the most probable suffices as the this gives the least error when is sampled from which is the training data distribution. However, a serious concern arises when have social impact, e.g. choosing one person over the other in prediction. Disparate impact is compounded when these outputs enter a social system where the outputs have an outreach which follows the power-law distribution, and as a consequence make the chosen appear as a ground truth for the given simply because of the deterministic inference system. The agents in the social system do not see all possible realizations of the outputs from and can assume false facts, or ”stereotypes” of . More research using user studies can be conducted to validate this theory. For popular images where one individual is highlighted more, we can expect to experience the rich get richer phenomenon where the audience of the platform will identify the selected individual with the image (and its associated event) more than other equally deserving individuals present in the image.
In our image cropping setting, this issue manifests whenever the image has multiple almost equally likely salient points (see Figure 6).
This in fact occurs in images with multiple individuals, as the saliency scores on different individuals usually differ by a slight value. The usage of argmax limits only a single point to be selected, giving the impression that all the other points are not salient. In the case of social media platform, if multiple users upload the same image, all of them will see the same crop corresponding to the most salient point. This effect is compounded in the presence of a power law distribution of the shares of images, where certain images, either controversial or from popular accounts, are re-shared across the platform in a disproportionate amount compared to other images, thereby amplifying this effect. As we can see in Figure 6, selecting the second-best salient point shifts the crop to the top part of the image from the bottom part, highlighting the argmax limitation in presence of competing scores. Also, each slice of the bottom plot in Figure 6 represents a row of saliency score in the images, and we observe almost similar scores for top and bottom halves of the image. Furthermore, the argmax limitation to express more than one salient region still applies even if the competing scores are exactly equal, as the model can only select one point, and as long as the tie-breaking algorithm is deterministic, the selected point will remain the same. This argmax limitation suggests a possible fix based on sampling from the model or allowing the user to select among possible salient regions the best ones. We further discuss this in Section 5.
One possible explanation of disparate impact is that the model favors high contrast, which is more strongly present in lighter skin on dark background or darker eyes on lighter skin, and in female heads which have higher image variability (russell2009sex) (see below). The following observations give some support for this explanation. We note that plausible explanations related to contrast or variability do not excuse disparate impact.
First, the image attaching affects the saliency prediction (Figure 4 vs Figure 2), implying that the model does look at the overall images before assigning saliency scores. We manually observe that majority of images in WikiCeleb have dark backgrounds, suggesting that the model’s favor for white people may be explained by lighter skin tone having higher contrast to overall images. 131313This potential explanation is also in line with an informal experiment (vinayarticle) using white and Black individuals on the plain white background. In this case, the favor is observed for Black over white, which is flipped as expected as the background is now white.
On several occasions the most salient point is at the person’s eye. We manually observe that eyes’ color are mostly on darker side, suggesting that the model’s favor for lighter skin tone is a result of stronger contrast between darker eyes with lighter skin surrounding it. 141414Another preliminary results that may support this explanation are the results when we compare whites and Blacks with Asians. Asians are more favored than Blacks and whites with probability and , respectively. The fact that Asians are more favored than whites may be due to higher contrast in Asian eyes: Asians and whites do not differ significantly on skin tone, but Asians have significantly darker eyes overall. However, we do not have a large number of samples of Asian images, and hence we omit this result from the main body of the paper.
Female heads have higher variability than male heads, which may be party attributed to make-up being more commonly used by females than males (russell2009sex). This observation is consistent with the explanation that higher contrast increases saliency scores and with the observed favor for female over male.
For each image on each subgroup, we run the saliency model and record its maximum and median saliency score. We then aggregate these statistics for each subgroup, and plot their histograms in Figure 7
. We found that disparate impact across gender corresponds to histograms of the maximum salient score on female skew to the right compared to male. However, the median histograms on both genders do not show the same skew. Thus, higher maximum saliency in subgroup comes from higher variability in saliency score, but not the overall saliency. Thus, higher variability in female images may lead to higher variability in saliency scores, contributing to the disparate impact in the results.
Empirical cumulative distribution functions (ECDFs) of maximum (left) and median (right) saliency scores across all images of Black (top) and White (bottom) individuals, separated by gender. Curves of ECDFs that are on the left are the distributions that on overall take smaller values than ones on the right. The wider the gap, the larger the differences of values.
In this section, we answer the third research question on other important aspects besides systematic disparate impact for consideration in designing image cropping. Our analysis thus far has revealed systematic disparities in the behavior of the image cropping model under the notion of demographic parity. However, even in the ideal world where the model exhibits zero problems with regards to demographic parity under this framework, there are deeper concerns. Although the analysis above was able to demonstrate some aspects of model behavior leading to disparate impact, including unforeseen consequences of using argmax and the potential effect of contrast for saliency-based cropping, there are inherent limitations of formalized metrics for fairness. They are unable to sufficiently elucidate all the relevant harms related to automatic image cropping.
The primary concern associated with automated cropping is representational harm, which is multi-faceted, culturally situated (crawford2017trouble; benjamin), and difficult to quantify (crawford2017trouble). The choice of who to crop is laden with highly contextual and nuanced social ramifications that the model cannot understand, and that the analysis using demographic parity cannot surface. For example, think of how the interpretation of cropping could differ between a photo of two people, one darker-skinned and one lighter-skinned, if the context is they are software engineers versus criminals. For marginalized groups, clumsy portrayals and missteps that lead to the removal of positive in-group representation are especially painful since there are frequently very few other positive representations to supplement, which unfortunately places disproportionate emphasis on the few surviving positive examples (disclosure). We recognize the inherent limitations of formalized metrics for fairness, which are unable to capture the historically specific and culturally contextual meaning inscribed unto images when they are posted on social media.
Similar limitations exist for the formal analysis attempting to address issues of male gaze. Although we find no evidence the saliency model explicitly encodes male gaze, in cases where the model crops out a woman’s head in favor of their body due to a jersey or lettering, the chosen crop still runs the risk of representational harm for women. In these instances, it is important to remember that users are unaware of the details of the saliency model and the cropping algorithm. Regardless of the underlying mechanism, when an image cropped to a woman’s body area is viewed on social media, due to the historical hyper-sexualization and objectification of women’s bodies, the image runs this risk of being instilled with the notion of male gaze. This risk is, of course, not universal and context dependent; however, it underscores the fact that the underlying problem with automatic cropping is that it cannot understand such context and it should not be dictating so strongly the presentation of women’s bodies on the platform. Framing concerns about automated image cropping purely in terms of demographic parity fails to question the normative assumption of saliency-based cropping: the notion that for any image, there is a ”best”–or at the very least ”acceptable”–crop that can be predicted based on human eye tracking data. Machine learning based cropping is fundamentally flawed because it removes user agency and restricts user’s expression of their own identity and values, instead imposing a normative gaze about which part of the image is considered the most interesting.
Twitter is an example of an environment where the risk of representational harm is high since Twitter is used to discuss social issues and sensitive subject matter. In addition, Tweets can be potentially viewed by millions of people, meaning mistakes can have far reaching impact. Given this context151515Automatic image cropping could be more appropriate in other more limited contexts that deal with less sensitive content, smaller audiences, less variability in the types of photos cropped, etc., we acknowledge that automatic cropping based on machine learning is unable to provide a satisfactory solution that can fully address concerns about representational harm and user agency since they place unnecessary limits on the identities and representations users choose for themselves when they post images. Simply aiming for demographic parity ignores the nuanced and complex history of representational harm and the importance of user agency.
|Argmax saliency-based cropping (Twitter model)||
|Sampling saliency-based cropping: sample the focal point with probability equal predicted saliency scores||
|Averaging saliency-based cropping: use the average weighted by predicted saliency scores or average over top salient points as the focal point||
|Providing most salient points and let the user pick one of those points to crop around||
|Focal point selection: let the user pick the focal point to crop around||
|No cropping and only add background padding to images to fit desired aspect ratios||
In this section, we answer the fourth research question on alternatives to Twitter saliency-based image cropping. In order to sufficiently mitigate the risk of representational harm on Twitter, ML-based image cropping may be replaced in favor of a solution that better preserves user agency. Removing cropping where possible and displaying the original image is ideal, although photos that aren’t a standard size which are very long or wide pose challenging edge cases. Solutions that preserve the creator’s intended focal point without requiring the user to tediously define a crop for each aspect ratio for each photo are desirable. In Table 1, we present a list of solutions.161616Twitter released one image cropping solution (https://twitter.com/dantley/status/1390040111228723200) to a small group of users. The solution is similar to the ”No cropping” alternative in Table 1. The list is non-exhaustive, and a solution combining multiple approaches is also possible. In evaluating tradeoffs between solutions, we observe the same tensions raised in early debates about weighing the value of user control against the value of saving user’s time and attention by automating decisions (shneiderman1997direct), although these two approaches are not necessarily mutually exclusive (dove2017ux). Solutions that amplify the productivity of users by decreasing the amount of attention needed to generate satisfactory crops as much as possible while maintaining a sense of user control are ideal (shneiderman1997direct). There can also be hybrid among these solutions. For example, we perform saliency-based cropping, but ask the users to confirm the crop if the top few saliency points spread in the picture, and allow users to specify the focal point if they wish. Each of these solutions may present their own trade-offs compared to the original fully-automated image cropping solution and their exact utility and trade-offs require further investigation. In order to properly assess the potential harms of technology and develop a novel solution, our analysis reaffirms the following design implications that have been previously discussed within the ML and HCI communities:
Co-construction of the core research activities and value-oriented goals (bardzell; friedman2002value; friedman2008value)
The high-level goal of this work is to understand the societal effects of automatic cropping and ensure that it does not reproduce the subordination of marginalized groups, which has informed our recommendation to remove saliency-based cropping and influenced our decision to leverage group fairness metrics in combination with a qualitative critique.
The utility of combining qualitative and quantitative methods (bardzell; friedman2002value; friedman2008value)
As argued in Section 4, although it is tempting to view concerns about image cropping purely in terms of group fairness metrics, such an analysis in isolation does not adequately address concerns of representational harm.
The importance of centering the experience of marginalized peoples (hanna2020towards; bardzell; costanza2018design)
Understanding the potential severity of representational harm requires an understanding of how representational harm has historically impacted marginalized communities and how it reinforces systems of subordination that lead to allocative harm (crawford2017trouble; sweeney2013discrimination; noble), as discussed in Section 2.2.
Integration of insights from critical race and feminist studies into analysis (hanna2020towards; bardzell)
In this work, we leverage the concept of male gaze to understand potential harms, a longstanding feminist concept that has been used to critique the representation of women in the arts, film, etc. (korsmeyer2004feminist; mulvey1989visual). In addition, in Section 6, we also draw strongly on previous work in critical race theory, feminist science, and sociology in discussing the positionality, limitations and risks of using standardized racial and gender taxonomies.
Increased collaboration between ML practitioners and designers in developing ethical technology (yang2017role)
Better integration of critical theory and design methodologies is critical to help expose the underlying normative values and assumptions of machine learning systems. Our critique motivates an emphasis on design and user control in assessing the potential harms of image cropping, which have been long standing topics of research within the HCI community (shneiderman1997direct). In order to properly evaluate the trade-offs of the various solutions presented in Table 1 and develop a viable alternative, we recommend additional user studies.
In developing ethical technologies, moving from a fairness/bias framing to a discussion of harms
As we have illustrated here with respect to automatic image cropping and as other have noted (barocas-hardt-narayanan; corbett2018measure), formalized notions of fairness often fail to surface all potential harms in the deployment of technology. The fairness framing is often too narrowly scoped as it only considers difference of treatment within the algorithm and fails to incorporate an holistic understanding of the socio-technical system in which the technology is deployed (selbst2019fairness; kasy2021fairness). By centering our analysis on a discussion of potential harms, we are better equipped to address issues of stereotyping and male gaze, both of which are highly dependent on historical context.
In this section, we discuss limitations of our work and potential future work to address some of those limitations.
Standardized racial and gender labels can be too limiting to how a person wants to be represented and do not capture the nuances of race and gender; for example, it is potentially problematic or even disrespectful for mixed-race or non-binary gender individuals. Additionally, the conceptualization of race as a fixed attribute poses challenges due to its inconsistent conceptualization across disciplines and its multidimensionality (hanna2020towards; scheuerman2020we); these problems are especially relevant given we are using labels from Wikidata (Section 3.1), which is curated by a diverse group of people. Given ethnic group labels from Wikidata, we refer to the US census race categories for how to standardize and simplify these categories into light- and dark-skinned in order to conduct our analysis.
Following the suggestions in (scheuerman2020we), we also wish to give a brief description of the sociohistorical context and positionality of our racial and gender annotations to better elucidate their limitations. Wikidata primarily features celebrities from the US and the Western world. Using Wikidata images and the US census as a reference presents a very US-centric conception of race. Many critical race scholars also define race in terms of marked or perceived difference, rather than traceable ancestry in itself (haslanger2000gender; hirschman2004origins; scheuerman2020we). Additionally, people of shared ancestry or the same racial identity can still look very different, making racial categories based on ancestry not always a suitable attribute to relate to images.
For future directions, one alternative to racial identity is skin tone, such as in buolamwini2018, which is more directly related to the color in the images.171717We are not yet able to use the GenderShade dataset (buolamwini2018) at the time of publication due to a licensing issue. Using a more fine grained racial and gender taxonomy may alleviate some concerns related to the black/white binary (perea1998black) and the gender binary, although one challenge related to using more fine grained labels is the need for sufficient sample sizes. In addition, a more fundamental critique of racial and gender taxonomies is that they pose a risk of reifying racial and gender categories as natural and inherent rather than socially constructed (hanna2020towards; fields2014racecraft; benthall2019racial). The use of standardized racial and gender taxonomy here is not meant to essentialize or naturalize the construction of race and gender, or to imply that race is solely dependent on ancestry, but rather to study the impact of representational harm on historically marginalized populations (noble; hanna2020towards).
The plausible explanations for disparate impact in Section 3.6 are only suggestive and not conclusive, and there are other possible underlying causes. An example is that the maximum saliency scores across subgroups (Figure 7), whose disparities result in systematic disparate impact, are a combination of facial and background regions of images. While maximum saliency points usually lie on heads, on occasions they are on graphics such as texts or company logos on backgrounds or clothes, such as when we discuss male gaze in Section 3.4. These small subsets of images may seem to be favored by the model due to its demographic, but in reality it’s to their backgrounds. These non-facial salient regions may also contribute to the difference in media saliency scores as shown in Figure 7. In addition, we did not study whether human gaze annotations in the training datasets themselves are a contributing factor in the model’s disparate impact.
More experiments and understanding in explainable ML are needed to evaluate relevant factors in the datasets that potentially contribute to model disparity, and to more clearly explain the observed systematic disparate impact. This is an important aspect as bias related work can often be better explained away by more systematic factors in presence of higher quality datasets and more robust analysis (see (Mishra2018SelfCite)).
The pros and cons in recommended solutions in Section 5 are provided as starting points to consider. More user and design studies are needed to identify the best cropping methods or some hybrid among those, and other details in the chosen method such as selecting the parameter if needed.
Twitter’s saliency-based image cropping algorithm automatically crops images to different aspects ratios by centering crops around the most salient area, the area predicted to hold human’s gaze. The use of this model poses concerns that Twitter’s cropping system favors cropping light-skinned over dark-skinned individuals and favors cropping woman’s bodies over their heads. At the first glance, it may seem that the risk of harm from automated image cropping can be articulated purely as a fairness concern that can be quantified using formalized fairness metrics. We perform a fairness analysis to evaluate demographic parity (or the lack thereof, i.e. disparate impact) of the model across race and gender. We observe disparate impact and outline possible contributing factors. Most notably, cropping based on the single most salient point can amplify small disparities across subgroups.
However, there are limitations in using formalized fairness metrics in assessing the harms of technologies. Regardless of the statistical results of fairness analysis, the model presents a risk of representational harm, where the users do not have the choice to represent themselves as intended. Because representational harm is historically specific and culturally contextual, formalized fairness metrics such as demographic parity are insufficient on their own in surfacing potential harms related to automatic image cropping. For example, even if Twitter’s cropping system is not systematically favoring cropping women’s bodies, concerns of male gaze persist due to the historical hyper-sexualization and objectification of women’s bodies, which influence the interpretation of images when posted on social media. We enumerate alternative approaches to saliency-based image cropping and discuss possible trade-offs. Our analysis motivates combination of quantitative and qualitative methods that include human-centered design and user experience research in evaluating alternative approaches.
We want to thank Luca Belli, Jose Caballero, Rumman Chowdhury, Neal Cohen, Moritz Hardt, Ferenc Huszar, Ariadna Font Llitjós, Nick Matheson, Umashanthi Pavalanathan, and Jutta Williams for reviewing the paper.