DeepAI
Log In Sign Up

What's in a Decade? Transforming Faces Through Time

10/13/2022
by   Eric Ming Chen, et al.
0

How can one visually characterize people in a decade? In this work, we assemble the Faces Through Time dataset, which contains over a thousand portrait images from each decade, spanning the 1880s to the present day. Using our new dataset, we present a framework for resynthesizing portrait images across time, imagining how a portrait taken during a particular decade might have looked like, had it been taken in other decades. Our framework optimizes a family of per-decade generators that reveal subtle changes that differentiate decade–such as different hairstyles or makeup–while maintaining the identity of the input portrait. Experiments show that our method is more effective in resynthesizing portraits across time compared to state-of-the-art image-to-image translation methods, as well as attribute-based and language-guided portrait editing models. Our code and data will be available at https://facesthroughtime.github.io

READ FULL TEXT VIEW PDF

page 1

page 3

page 7

page 15

page 16

page 17

page 18

page 19

11/27/2019

AttentionGAN: Unpaired Image-to-Image Translation using Attention-Guided Generative Adversarial Networks

State-of-the-art methods in the unpaired image-to-image translation are ...
12/04/2019

StarGAN v2: Diverse Image Synthesis for Multiple Domains

A good image-to-image translation model should learn a mapping between d...
09/26/2021

ISF-GAN: An Implicit Style Function for High-Resolution Image-to-Image Translation

Recently, there has been an increasing interest in image editing methods...
03/19/2020

High-Resolution Daytime Translation Without Domain Labels

Modeling daytime changes in high resolution photographs, e.g., re-render...
12/06/2019

Exploring Unlabeled Faces for Novel Attribute Discovery

Despite remarkable success in unpaired image-to-image translation, exist...

1 Introduction

What would photographs of ourselves look like if we were born fifty, sixty, or a hundred years ago? What would Charlie Chaplin look like if he were active in the 2020s instead of the 1920s? Such is the conceit of popular diversions like old-time photography, where we imagine ourselves as we might have looked in an anachronistic time period like the Roaring Twenties. However, while many methods for editing portraits have been devised recently on the basis of powerful generative techniques like StyleGAN [karras2020analyzing, Karras2020ada, shen2020interfacegan, 10.1145/3447648, Wu2021StyleSpaceAD, alaluf2021matter, or2020lifespan], little attention has been paid to the problem of automatically translating portrait imagery in time, while preserving other aspects of the person portrayed. This paper addresses this problem, producing results like those shown in [.

To simulate such a “time travel” effect, we must be able to model and apply the characteristic features of a certain era. Such features may include stylistic trends in clothing, hair, and makeup, as well as the imaging characteristics of the cameras and film of the day. In this way, translating imagery across time differs from standard portrait editing effects that typically manipulate well-defined semantic attributes (e.g., adding or removing a smile or modifying the subject’s age). Further, while large amounts of data capturing these attributes are readily available via datasets like Flickr-Faces-HQ (FFHQ) [karras2019style], diverse and high-quality imagery spanning the history of photography is comparatively scarce.

1990s 1970s 1940s 1910s 1880s
Figure 1: Random samples from five decades in the Faces Through Time Dataset.

In this work, we take a step towards transforming images of people across time, focusing on portraits. We introduce Faces Through Time (FTT), a dataset containing thousands of images spanning fourteen decades from the 1880s to the present day. Faces Through Time is derived from the massive public catalog of freely-licensed images and annotations available through the Wikimedia Commons project. The extensive biographic depth available on Wikimedia Commons, as well as its organization into time-based categories, enables associating images with accurate time labels. In comparison to previous time-stamped portrait datasets, FTT is sourced from a wider assortment of images capturing notable identities varying in age, nationality, pose, etc. In contrast, a well-known prior dataset called The Yearbook Dataset [ginosar2015century]

only contains US yearbook photos. In our work, we demonstrate FTT’s applicability for synthesis tasks. More broadly, the dataset can allow for exploration of a variety of analysis tasks, such as estimating time from images, understanding photographic styles across time, and discovering fashion trends (see Section

2). Figure 1 shows random samples from five different decades from FTT.

To transform portraits across time, we build on the success of Generative Adversarial Networks (GANs) for synthesizing high-quality facial images. In particular, we finetune the popular StyleGAN2 

[karras2020analyzing, Karras2020ada] generator network (trained on FFHQ) on our dataset. However, rather than modeling the entire image distribution of our dataset using a single StyleGAN2 model, we train a separate model for each decade. We introduce a method to align and map a person’s image across the latent generative spaces of the fourteen different decades (see Figure 3

). Furthermore, we discover a remarkable linearity in each model’s generator weights, allowing us to fine-tune images with vector arithmetic on the model weights. This sets our approach apart from the many prior works that search for editing directions within a single StyleGAN2 model,

[shen2020interfacegan, shen2021closed, Wu2021StyleSpaceAD, 10.1145/3447648, alaluf2021matter, Patashnik_2021_ICCV]. We find that by using multiple StyleGAN2 models in this way, our method is more expressive than these existing approaches. In addition, our classes are separated in a useful way for style transfer.

We demonstrate results for a variety of individuals from different backgrounds and captured during different decades. We show that prior methods struggle on our problem setting, even when trained directly on our dataset. We also perform a quantitative evaluation, demonstrating that our transformed images are of high quality and resemble the target decades, while preserving the input identity. Our approach enables an Analysis by Synthesis scheme that can reveal and visualize differences between portraits across time, enabling a study of fashion trends or other visual cultural elements (related to, for instance, a New York Times article that discusses how artists use digital tools to imagine what George Washington would have looked like had he lived today).

In summary, our key contributions are:

  • Faces Through Time, a large, diverse and high-quality dataset that can serve a variety of computer vision and graphics tasks,

  • a new task—transforming faces across time—and a method for performing this task that uses unique vectorial transformations to modify generator weights across different models,

  • and quantitative and qualitative results that demonstrate that our method can successfully transform faces across time.

2 Related Work

2.1 Image analysis across time

While common image datasets such as ImageNet

[imagenet]

and COCO

[coco] do not explicitly use time as an attribute, those that do show unique characteristics. Here we focus on image datasets that feature people.

Datasets. The Yearbook Dataset [ginosar2015century] is a collection of 37,921 front-facing portraits of American high school students from the 1900s to the 2010s. The authors design models to predict when a portrait is taken. They also analyze the prevalence of smiles, glasses, and hairstyles across different eras. The IMAGO Dataset [IMAGO] contains over 80,000 family photos from 1845 to 2009. Images are labeled in “socio-historical classes” such as free-time or fashion. Hsiao and Grauman [cultureclothing2021] collect news articles and vintage photos and build a dataset that feature fashion trends in the 20th century. Our dataset contains portraits from a much wider age range (the Yearbook Dataset focuses on high school students), from diverse geographic areas (the IMAGO Dataset focuses on Italian families), and exhibiting rich variation in occupation and styles (not just fashion images).

Analysis. A standard task applicable to images with temporal information is to predict when they were taken, i.e., the date estimation problem. Müller-Budack et al. [whenwastaken] train two GoogLeNet models, one for classification and one for regression to predict the date of a photo. Salem et al. [salem2016face2year] train different CNNs for date estimation using the face, torso, and patch of a portrait image. Other rich information can also be learned from temporal image collections. In StreetStyle and GeoStyle [matzen2017streetstyle, mall2019geostyle], a worldwide set of images taken between 2013-2016 were analyzed to discover spatio-temporal fashion trends and events. In [cultureclothing2021], topic models and clusterings are used to discover trends in news and vintage photos. Unlike prior works that focus on analyzing temporal characteristics in the data, we work on the more challenging task of modifying such characteristics.

2.2 Portrait editing

Editing of face attributes has been extensively studied since before the deep learning era. A classic approach is to fit a 3D morphable model

[blanz1999morphable, 3dmm-review] to a face image and edit attributes in the morphable model space. Other methods that draw on classic vision approaches includes Transfiguring Portraits [kemelmacher2016transfiguring], which can render portraits in different styles via image search, selection, alignment, and blending. Given the recent success of StyleGAN (v1 [karras2019style], v2 [karras2020analyzing], and v3 [StyleGANv3]) in high quality face synthesis and editing, many works focus on editing portrait images using pre-trained StyleGAN models. In these frameworks, a photo is mapped into a code in one of StyleGANs latent spaces [ganinversion]. Feeding the StyleGAN generator with a modified latent code yields a modified portrait [shen2020interfacegan, Wu2021StyleSpaceAD]. To find the latent code of an input image, one can directly optimize the code so that the StyleGAN can reconstruct the inputs [Abdal2019Image2StyleGANHT, wulff2020improving] or train a feed-forward network such as pSp [richardson2021encoding] and e4e [tov2021designing]that directly predicts the latent code. We adopt an optimization-based procedure to obtain better facial details.

Once a latent code of an image is obtained, portrait image editing can be done in the latent space with a pre-trained StyleGAN generator. Directions for change of viewpoint, aging, and lighting of faces can be found by PCA in the latent space [harkonen2020ganspace], or from facial attributes supervision [shen2020interpreting]. Shen and Zhou [shen2021closed] find that editing directions are encoded in the generators’ weights and can be obtained by eigen decomposition. Collins et al. [instyle] perform local editing of portraits by mixing layers from reference and target images. Alternatively, the StyleGAN generator can also be modified for portrait editing. StyleGAN-nada [gal2021stylegannada] and StyleCLIP [Patashnik_2021_ICCV] use CLIP [Radford2021LearningTV] to guide editing on images with target attributes. Toonify [Pinkney2020ResolutionDG] uses layer-swapping to obtain a new generator from models in different domains. Similar to StyleAlign [Wu2021StyleAlignAA], we obtain a family of generators by finetuning a common parent model on different decades. The style change of a face is achieved by obtaining the latent code of the input image and feeding it into a modified target StyleGAN generator using PTI [roich2021pivotal]. Our method is conceptually simple and doesn’t require exploring the latent space.

Age editing on portraits is related to our task since both involve modifying the temporal aspects of an image. In the work of Or-El, et al. [or2020lifespan], age is represented as a latent code and applied to the decoder network of the face generator. An identity loss is used to preserve identities across ages. Alaluf et al. [alaluf2021matter] design an age encoder that takes a portrait and a target age, produces a style code modification on pSp-coded styles and generates a new portrait with the target age. In contrast, our editing aims to change the decade a photo was taken without altering the subject’s age.

GANs can also be used to recolor historic photographs [zhang2017real, deoldify, Luo-Rephotography-2021]. In particular, Time-Travel Rephotography [Luo-Rephotography-2021] uses a StyleGAN2 model to project historic photos into a latent space of modern high-resolution color photos with modern imaging characteristics. Rather than focusing solely on low-level characteristics like color, our method alters a diverse collection of visual attributes, such as facial hair and make-up styles. Moreover, our method can transform images across a wide range of decades, instead of learning binary transformations between “old” and “modern” as in [Luo-Rephotography-2021].

3 The Faces Through Time Dataset

Our Faces Through Time (FTT) dataset features 26,247 images of notable people from the 19th to 21st centuries, with roughly 1,900 images per decade on average. It is sourced from Wikimedia Commons (WC), a crowdsourced and open-licensed collection of 50M images.

We automatically curate data from WC to construct FTT (Figure 1) as follows: (1) The “People by name” category on WC contains 407K distinct people identities. We query each identity’s hierarchy of people-centric subcategories (similar to [whoswaldo]) and organize retrieved images by identity. (2) We use a Faster R-CNN model [renNIPS15fasterrcnn, jiang2017face, 2017arXiv170804370R]

trained on the WIDER Face dataset

[yang2016wider] as a face detector. For each detected face, 68 facial landmarks are found using the Deep Alignment Network [kowalski2017deep], and alignments are applied as in the FFHQ dataset [karras2019style] given these landmarks. (3) We devise a clustering method based on clique-finding in face similarity graphs to group faces by identities (see appendix). This resolves ambiguities in photos that feature multiple people. (4) We gather additional samples without biographic information from the “19th Century Portrait Photographs of Women” and “20th Century Portrait Photographs of Women” categories. These make up about 15% of the dataset.

We leverage image metadata, identity labels, and biographic information available in WC to further assist in balancing and filtering our data. We discard any photos without time labels or taken before 1880, and sample a subset of 3,000 faces each for the 2000s and 2010s decades (which tend to feature many more images than other decades) to maintain a roughly balanced distribution of images across decades. For images where the identity is known, we further filter by only keeping images where the identity is between 18 and 80 years old (comparing the image timestamp and identity’s birth year). We also estimate face pose using Hopenet [Ruiz_2018_CVPR_Workshops] and remove images with yaw or pitch greater than degrees. After these automated collection and filtering steps, we manually inspected the entire dataset and removed images with clearly incorrect dates, images that were not cropped properly, images that were duplicates of other identities, and images featuring objectionable content. This resulted in a removal of of the assembled data.

The total number of samples from each decade in our curated dataset, demographic and other biographic distributions, and further implementation details can be found in our supplementary material. We create train and test splits by randomly selecting 100 images per decade as a test set, with the remaining images used for training. Samples from the dataset are shown in Figure 1.

4 Transforming Faces Across Time

Given a portrait image from a particular decade, our goal is to predict what the same person might look like across various decades ranging from 1880 to 2010. The key challenges are: (1) maintaining the identity of the person across time, while (2) ensuring the result fits the natural distribution of images of the target decade in terms of style and other characteristics. We present a novel two-stage approach that addresses these challenges. Figure 2 shows an overview of our approach.

First, rather than training a single generative model that covers all decades (e.g., [or2020lifespan]), we train a family of StyleGAN models, one for each decade (Section 4.1, Figure 3). These are obtained by fine-tuning the same parent model, resulting in a set of child models whose latent spaces are roughly aligned [Wu2021StyleAlignAA], as described in Section 4.1. The alignment ensures that providing the same latent code to different models results in portraits with similar high-level properties, such as pose. At the same time, the resulting images exhibit the unique characteristics of each decade. Given a real portrait from a particular decade, it is first inverted into the latent space of the corresponding model, and the resulting latent code can then be fed into the model of any desired target decade. Our approach makes it unnecessary to search for editing directions in the latent space (e.g., [shen2020interfacegan, harkonen2020ganspace]).

Next, to better fit the identity of the input individual, we apply single-image finetuning of the family of per-decade StyleGAN generators (Section Section 4.2). Specifically, we introduce Transferable Model Tuning (TMT), a modified PTI (Pivotal Tuning Inversion) [roich2021pivotal] procedure, to obtain an adjustment for the weights of the source decade generator, and apply the resulting adjustment to the target generator(s). This input-specific adjustment is done in the generator’s parameter space, enabling us to better preserve the input individual’s identity, while maintaining the style and characteristics of the target decade. We now describe these two stages in more detail.

Learning Decade Models

Single-image Refinement

Figure 2: Overview of our method. Left: We first train a family of StyleGAN models, one for each decade, using adversarial losses and an identity loss on a blended face, which resembles the parent model in its colors. Right: Afterwards, each real image is projected onto a vector on the decade manifold (1960 in the example above). We learn a refined generator and transfer the learned offset to all models (this process is visualized in Figure 4). To better encourage the refined model to preserve facial details, we mask the input image and apply all losses in a weighted manner (further described in the text).
Figure 3: We finetune a family of decade generators (child models) from an FFHQ-trained parent model. While each generator captures unique styles, the generated images from the same latent code are aligned in terms of high-level properties such as pose.

4.1 Learning coarsely-aligned decade models

We are interested in learning a family of StyleGAN2-ADA [Karras2020ada] generators each of which maps a latent vector to an RGB image. For each decade, we finetune a separate StyleGAN model with weights initialized from an FFHQ-pretrained model. We call the FFHQ-pretrained model the parent model , and the finetuned network for decade the child model . Consistent with the findings in prior work [Wu2021StyleAlignAA], we observe that the collection of generators exhibits semantic alignment of faces generated from the same latent code : they share similar face poses and shapes. However, various fine facial characteristics such as eyes and noses, which are important for recognizing a person, often drift from one another (as evident in Figures 3 and 7).

To better preserve identity across decades when finetuning each child model, to the standard StyleGAN objective

we add an identity loss. Specifically, we measure the cosine similarity between ArcFace

[deng2019arcface] embeddings of images generated by and by :

(1)

A similar loss is used in [richardson2021encoding]. However, since the ArcFace model was only trained on modern day images, we found that this raw identity loss performed poorly on historical images, due to the domain gap. To solve this issue, we use a blended version instead of the original . We create using layer swapping [Pinkney2020ResolutionDG] to mix and at different spatial resolutions: we combine the coarse layers of the child model with the fine layers of the parent model . By doing so, we “condition” our input image for the identity loss by making its colors more similar to the image generated by the parent, and thus more similar to the distribution of the images used to train the ArcFace model. Figure 2 (left) shows that the blended (middle) image retains the structure of the 1900s image, but its colors better resemble those of a modern day photo. In addition, this technique restricts the identity loss to focus on layers which generally control head shape, position, and identity. Note that this blended image is only used to compute the loss, and not in the transformed results.

2            1

Input
Figure 4: Visualization of TMT offsets. We obtain offset vectors (for the image in row 1) and (row 2) for the weights of the source decade generator and apply it to every target decade generator. On the left we use PCA to visualize the convolutional parameters for all target generators in 2D. Each dot represents the weights of a single generator, colored according to decade, and with edges connecting adjacent decades. We illustrate the offset vectors optimized for the two input images (colored in gray and red) and the corresponding transformed images for three different decades. For each decade we show images before (left) and after (right) applying TMT. Adding these offsets has the effect of improving identity preservation.

4.2 Single-image refinement and transferable model tuning

As previously described, we first train the family of aligned StyleGAN models with randomly sampled latent codes from and our collections of real per-decade images. Given these coarsely-aligned per-decade models, we are given a single real face image as input and aim to generate a set of faces across various decades. In order to better preserve the identity of the input image across these decades, we introduce Transferable Model Tuning (TMT). TMT is inspired by PTI [roich2021pivotal], which is a procedure for optimizing a model’s parameters to better fit to an input image after GAN inversion. TMT extends PTI from a single generator model to a family of models. Our TMT procedure produces a set of face images where the identity is preserved in the presence of changing style over time (Figure 4).

Specifically, given an input image from decade , we first obtain its latent code using the projection method from StyleGAN2: , where is the vector of all parameters in . We only work with child models in this stage. As the first step of TMT, we fix the obtained latent code and optimize over to obtain a new model with parameters (Figure 2, right).

(2)

Tuning in the parameter space of generators, instead of only working in the latent space as in previous work [Nitzan2020FaceID], allows us to better fit the facial details of the input individual, such as eyes and expression. Treating and as vectors in the parameter space , this tuning can be thought of as applying an offset to the original model.

We found that this offset is surprisingly transferable from one TMT-tuned generator to all other decade generators in the parameter space . Concretely, to obtain the style-transformed face of in any target decade , we simply apply:

(3)

where is the generator for decade and is the vector of all its parameters. We visualize the learned TMT offsets for two different input images in Figure 4. As illustrated in the figure, these offsets greatly improve the identity preservation of synthesized portraits.

Intuitively, “refines” the parameters of the generator family to the single input face image to reconstruct better facial details. We hypothesize that the found offsets mainly focus on improving identities. Since the coarsely-aligned family of generators share similar weights that are responsible for identity-related features, those offsets are easily transferable. While most prior works, e.g., [shen2020interfacegan, harkonen2020ganspace, Wu2021StyleSpaceAD] modify images using linear directions in various latent spaces of StyleGAN, we are the first to apply a linear offset in the generator parameter space, to a collection of generators. We hope our work will inspire future investigations on understanding the linear properties in GAN’s parameter spaces. An analysis of the effects of applying TMT to a family of models can be found in the supplementary material.

As demonstrated in Figure 2 (right), to focus the loss computation on facial details instead of hair and background, we apply masks to images before calculating the losses. We use a DeepLab segmentation network [deeplabv3plus2018] trained on CelebAMask-HQ photos [or2020lifespan, CelebAMask-HQ]. Empirically we determine it is best to apply a weight of 1.0 on the face, 0.1 on the hair and 0.0 elsewhere. We put a small weight on the hair to accurately reconstruct it, as it does contribute to the image’s stylization. However, we do not want to prioritize it over facial features. In addition, we find that it is best to keep StyleGAN’s ToRGB layers frozen. Otherwise, color artifacts are introduced into the generators. We follow the objectives introduced in [roich2021pivotal] and minimize a perceptual loss (), and a reconstruction loss (). In addition, we add another identity loss () to further enhance identity preservation for the generated images.

Using our two-stage approach, for each portrait, we obtain a set of faces that maintain the identity as well as demonstrate diverse styles across decades.

5 Results and Evaluation

We conduct extensive experiments on the Faces Through Time test set. We compare our approach to several state-of-the-art techniques across a variety of metrics that quantify how well transformed images depict their associated target decades and to what extent the identity is preserved. We also present an ablation study to examine the impact of the different components of our approach. Additional uncurated results and visualizations of the full test set (1,400 individuals) are in the supplementary material.

5.1 State-of-the-art Image Editing Alternatives

As no prior works directly address our task, we adapt commonly used image editing models to our setting and perform comprehensive comparisons.

Image-to-image translation. Unpaired image-to-image translation models learn a mapping between two or more domains. We train a StarGAN v2 [Choi2020StarGANVD] model on our dataset, where decades are domains. Results on another model, DRIT++ [Lee2020DRITDI], are in the supplementary material.

Attribute-based editing. We consider each decade as a facial attribute and compare against recent works performing attribute-based edits. While many attributes in prior work are binary [shen2020interfacegan, harkonen2020ganspace], our decade attribute has multiple classes. By comparison, age-based transformation is more similar to our problem setting, as age is often broken up into bins [or2020lifespan]. We compare against SAM [alaluf2021matter], a recent age-based transformation technique that also operates in the StyleGAN space.

Language-guided editing. Weakly supervised methods that leverage powerful image-text representations (e.g. CLIP [Radford2021LearningTV]) have demonstrated impressive performance in portraits editing. We compare against: (i) StyleCLIP [Patashnik_2021_ICCV], which learns latent directions in StyleGAN’s space for a given text prompt, and (ii) StyleGAN-nada (StyleNADA) [gal2021stylegannada], which modifies the weights of the StyleGAN generator based on textual inputs. For both models, we used the text prompt “A person from the [XYZ]0s”, where [XYZ]0 is one of FTT’s 14 decades (1880-2010). For StyleCLIP, we use models trained on both FFHQ and FTT. Because StyleGAN-nada is designed for out-of-domain changes, we experiment with how well it can modify the generator from the FFHQ space to various decades in our dataset. We use 100 FFHQ photos. We compare these two baselines to how well our model can transform FFHQ images to each of the 14 decades.

Method FID KMMD

FFHQ

StyleCLIP 254.39 1.87 0.08 0.18 0.36 0.99
StyleNADA 312.06 2.03 0.10 0.30 0.38 0.96
Ours 69.46 0.43 0.50 0.81 0.91 0.93

FTT

StarGAN v2 68.05 0.40 0.38 0.75 0.89 0.97
SAM 96.52 0.72 0.51* 0.85* 0.89* 1.00
StyleCLIP 108.25 0.85 0.07 0.21 0.36 1.00
Ours 66.98 0.40 0.47 0.78 0.90 0.99
Table 1: Quantitative Evaluation.

We compare performance against SOTA techniques on FFHQ (top three rows) and on our test set (bottom four rows). Our method outperforms others in terms of most metrics. *Note that SAM uses the decade classifier during training and therefore the DCA metric is skewed in this case, as we further detail in the text.

StarGAN SAM StyleCLIP Ours
StarGAN SAM StyleCLIP Ours
StarGAN SAM StyleCLIP Ours
StarGAN SAM StyleCLIP Ours
Input 1880s 1920s 1940s 1960s 1980s 2010s
Input 1910s 1930s 1950s 1970s 1990s 2010s
Figure 5: Qualitative Results. Above we compare results generated by baselines and our technique. The red box indicates the inversion of the original input. We observe that our approach allows for significant changes across time while best preserving the input identity. While SAM and StarGAN are able to stylize images, these changes are mostly limited to color. StyleCLIP struggles to generate meaningful changes. Please refer to the supplementary material for qualitative results on the full test set.

5.2 Metrics

Visual quality. We use the standard FID [Heusel2017GANsTB, Seitzer2020FID] metric as well as the Kernel Mean Maximum Discrepancy Distance (KMMD) [wang2020minegan] metric. As image quality varies across decades, we compute scores between real and edited portraits separately for each decade and then average over all decades. Because FID can be unstable on smaller datasets [chong2019effectively], similar to prior work [noguchi2019image, wang2020minegan], we measure KMMD  [wang2020minegan] on Inception [Szegedy2016RethinkingTI] features. Experimentally, we find that these two scores are highly sensitive to an image’s background. Therefore, we compute the scores on images of size cropped to 160160 pixels.

Decade style. We evaluate how well the generated samples capture the style of the target decade using a EfficientNetB0 classifier [Tan2019EfficientNetRM] that we trained separately. Using the classifier, we define the Decade Classification Accuracy (DCA). We follow prior works [gaudette2009evaluation] and report three metrics: , and , where measure the accuracy within a tolerance of decades.

Identity preservation. We use the Amazon Rekognition service to measure how well a person’s identity has been preserved in generated portraits. Their CompareFaces operation outputs a similarity score between two faces. As a metric, we report – the fraction of successful identity comparisons. We consider a comparison to be successful if its similarity score is above a certain threshold (set empirically to ).

Input 1900s 1920s 1940s 1960s 1980s 2010s
Figure 6: StyleGAN-nada Results. Although StyleGAN-nada [gal2021stylegannada] produces some style changes across decades, all images are transformed similarly. For example, all 1980s portraits adopt a frizzy hairstyle.

5.3 Results

We present test set performance for all methods in Table 1 and qualitative results in Figure 5. Results on the full test set (1400 samples) are provided using an interactive viewer as part of our supplementary material. We present additional results on StyleGAN-nada’s effect on FFHQ input images in Figure 6. We see that our method performs well numerically in terms of all metrics. Although StarGAN also performs well in FID and KMMD, the style changes by StarGAN are mostly about color; there are few changes to makeup, hair, and beards, whereas our model can perform such changes. Our method also has fewer artifacts. As illustrated in Figure 6, modifications from StyleGAN-nada are more caricature-like than realistic. As a result, StyleGAN-nada performs poorly with respect to FID, KMMD, and DCA metrics. For StyleCLIP and SAM, the identity preservation is near 1.0 because their changes are generally limited to color. Nonetheless, our model still successfully matches input and transformed images in most cases (for and

of samples, for FFHQ and FTT images, respectively), while generating significant changes in styles. While SAM performs the best with respect to DCA, we believe this is advantaged because SAM used the same classifier during training as a loss function. In fact, in Figure 

5, SAM demonstrates little change across decades. We suspect that the classifier is leading SAM to overfit to noise, instead of truly changing an image’s style.

From our results we can discern interesting details that provide insight into style trends across time. For example, as illustrated in he top left example in Figure 5, we see that the individual adopts a bob haircut, one popularized by Irene Castle, and strongly associated with the flapper culture of the Roaring Twenties. Later on, we notice more contemporary hair styles. Finally, in the 2010s, we see that women tend to adopt longer hairstyles. Despite these generalizations, the portraits remain well-conditioned on the input, reflecting an individual’s identity and aspects of their personal style. For instance, the individual on the left with the long mustache maintains facial hair across the decade transformations, although in very different styles. In the bottom left example, we observe glasses that change style over time. Not only does our model generate realistic transformations, but also captures the nostalgia of various time periods.

No No LS No TMT No No Mask Ours
Input 1890s 1910s 1930s 1950s 1970s
Figure 7: Ablations. We see significant improvement in terms of identity preservation after adding the identity loss and TMT during training. The masking procedure alleviates artifacts caused by other regions in the image (such as the hat in the example above) by focusing the model’s attention on facial details and allowing for larger modifications in other regions. See Table 2 for descriptions of the ablation labels.

5.4 Ablations

We present ablations on components in our approach in Table 2 and Figure 7. Specifically, we train five ablated models: (1) without the identity loss (for learning decade models), (2) without the blended image obtained using layer swapping, (3) without TMT, (4) without the identity loss (during TMT), and (5) without masking the images. As shown in the first row of the table, our baseline method already captures a decade’s style well. However, the images are not aligned with respect to an input’s identity, which is reflected in its low score. As a result, , , and TMT are necessary for identity preservation. In addition, we find that using masks during TMT reduces artifacts in generated images as it allows for an accurate inversion in the facial region and larger modifications in other regions in the image. Empirically, we notice that this reduces noise and improves decade classification. While FID, KMMD and DCA scores remain similar across the ablations, our full model shows strong improvement in , which is the main objective of and our proposed TMT stage. We also experimentally find that spaces [richardson2021encoding] are less well aligned than the space. More details are in the supplementary material.

LS TMT Mask FID KMMD
69.51 0.45 0.49 0.79 0.92 0.61
69.36 0.47 0.51 0.82 0.92 0.63
68.18 0.45 0.50 0.82 0.92 0.72
67.32 0.39 0.46 0.78 0.89 0.95
67.08 0.38 0.46 0.77 0.89 0.99
66.98 0.40 0.47 0.78 0.90 0.99
Table 2: Ablation study evaluating the effect of the identity loss while learning decade models () and during TMT (), using a blended image with layer swapping (LS), TMT, and masking the images during TMT (Mask).

6 Ethical Discussion

Face datasets—and the tasks that they enable, such as face recognition—have been subject to increasing scrutiny and recent work has shed light on the potential harms of such data and tasks 

[bias1, bias2]. With awareness of these issues, our dataset was constructed with attention to ethical questions. The images in our dataset are freely-licensed and provided through a public catalog. We will include the source and license for each image in our dataset. As part of our terms of use, we will only provide our dataset for academic use under a restrictive license. Furthermore, our dataset does not contain identity information (and only includes one face per identity), and therefore cannot readily be used for facial recognition. Nonetheless, our dataset does inherit biases that are present in Wikimedia Commons. For instance, the data is gender imbalanced, containing a ratio of roughly male to female samples (according to the binary gender labels available on Wikimedia Commons, which are annotated by Wikipedia contributors). While such biases can be mitigated by balancing the data for training and evaluation purposes, we plan to continue gathering more diverse data to address this underlying bias in the data. For additional details on various features of our dataset, please refer to the accompanying datasheet [gebru2021datasheets].

There are also ethical considerations relating to the risks of using portrait editing for misinformation. Our task is perhaps less sensitive in this regard, since our explicit goal is to create fanciful imagery that is clearly anachronistic. That said, any results from such technology should be clearly labeled as imagery that has been modified.

7 Conclusion

We present a dataset and method for transforming portrait images across time. Our Faces Through Time dataset spans diverse geographical areas, age groups, and styles, allowing one to capture the essence of each decade via generative models. By learning a family of generators and efficient tuning offsets, our two-stage approach allows for significant style changes in portraits, while still preserving the appearance of the input identity. Our evaluation shows that our approach outperforms state-of-the-art face editing methods. It also reveals interesting style trends existing in various decades. However, our method still has limitations. As with any data-driven technique, our results are affected by biases that exist in the data. For instance, females with short hair are less common at the beginning of the 20th century, which may yield gender inconsistencies when transforming a short-haired modern female face to these early decades, including unexpected changes in visual features often associated with gender. In the future, we plan to explore methods that can improve consistency, perhaps by devising a way to jointly optimize models for different decades that better enforces consistency among them. Finally, we envision that future uses of our data could go beyond the synthesis tasks we consider in our work, and explore the combination of both analysis and synthesis.

8 Acknowledgements

This work was supported in part by the National Science Foundation (IIS-2008313).

Appendix

In this document, we present implementation details and additional results that are supplement to the main paper. Section A reports details about our Faces Through Time dataset. Section B provides details about the adaptation of state-of-the-art image editing alternatives as well as additional results and analysis. Section C provides implementation details such as training procedures and hardware specifications. A datasheet for our dataset is provided separately. Note that we also provide an interactive viewer that demonstrates our results, as well as those obtained using alternative techniques, on the full test set (as well as a lighter viewer that only displays results for 100 random test samples).

Appendix A Faces Through Time Dataset Details

Faces Through Time contains in total 26,247 portraits over 12 decades from the 1880s to 2010s. In Table 3, we report the number of distinct identities per decade. Since our dataset contains only one image per identity, these numbers also correspond to the number of images per decade. We include a histogram over the image resolutions for all decades in Figure 8. From 1880–2000, our images have an average resolution of . From 2000–2020, they have an average resolution of .

By associating the identity from each portrait with the biographic information on Wikimedia Commons, we show demographic information of Faces Through Time dataset in a set of histograms. Figure 9 shows the 50 most common citizenships and occupations. For a systematic review on the characteristics of the Faces Through Time dataset, please refer to the datasheet.

1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Identities 525 842 2049 3052 2042 1710 1337 1638 2611 1916 1382 928
Table 3: Number of identities per decade in the Faces Through Time dataset. The symbol denotes that these sets were trimmed. Note that our dataset contains one image per identity, and therefore, these numbers also correspond to the number of images per decade.
Resolution 1880 – 2000 Resolution 2000 – 2020
Figure 8: Resolution of Aligned Facial Images . A histogram containing each image’s resolution (measured in pixels) is illustrated above. For images from 1900 – 2000, we only use images of resolution greater than or equal to . Due to the abundance of digital images after 2000, we only use images of resolution greater than or equal to in the 2000s and 2010s. Note that these resolutions correspond to the images after alignment and cropping is applied to extract normalized facial regions, and not to the original images found in Wikimedia Commons.
Occupations Citizenships
Figure 9: 50 Most Common Occupations and Citizenships. Identities may have more than one or no associated occupations and citizenships. There are 981 occupations not shown, each with 146 or fewer associated identities. Similarly, there are 209 citizenships not shown, each with 49 or fewer associated identities. Citizenship generally refer to historical nations. For example, individuals from what would be considered modern day China are categorized into the Qing dynasty, the Republic of China (1912-1949), the People’s Republic of China, etc.

a.1 Assembly: Face Clustering

We elaborate on our face clustering method (see Sec. 3 in the main paper) here. Although one could naively associate all faces in an identity’s photos to that identity, a given photo may have multiple faces or may not picture the target identity at all. We, therefore, approach face assignment as a clustering problem, where we separately group faces corresponding to each person . Let be the full set of faces extracted from images in ’s category. We wish to cluster into groups of people and identify the actual identity’s group. We propose a hybrid semi-supervised clustering method for this problem, based on the following observations:

  1. The number of people represented in a set is unconstrained, so it is difficult to partition sets with parametric techniques, such as k-means, which assume a known number of clusters.

  2. A cluster is most likely to represent a single person if all face pairs are similar.

  3. If multiple faces appear in the same image, it is very unlikely that these faces belong to the same identity. As a corollary, given two images, each face in one of the images should only belong to the same identity as at most one face in the other image.

Maximal Clique Clustering. We construct a graph and formulate this as a graph clustering problem. Each face in corresponds to a node in . For each face, we also compute its FaceNet embedding using OpenFace [amos2016openface]. By observation (2), our goal is then to find subsets of faces such that every pair is a positive verification pair ( for some threshold ). Hence, we add an edge to for such pairs.

However, we can further constrain this construction with observation (3). Let and be a pair of images, containing faces denoted by the sets and . We observe that the corresponding subgraph of induced by must be bipartite, and that at most one edge should have an endpoint in each node to represent correct identity relations. For each pair of images, we construct edges for each pair of nodes in increasing order by if and no edge adjacent to or already exists. Prior work shows that setting is effective in practice [amos2016openface], which we corroborated in our evaluation.

We would then like to search this graph for cliques, subgraphs with an edge between every pair of nodes (specifically, maximal cliques, which cannot be further extended). We apply the Bron-Kerbosch [bron1973algorithm] algorithm to (relatively) efficiently enumerate all maximal cliques (this is an NP-complete problem [karp1972reducibility]). We sorted the resulting set of cliques in decreasing order by size, successively selecting those without nodes in any previously selected clique, until none remained. The selected cliques are our final clusters of .

We further purified clusters by removing faces with an outlier threshold

. Similar to the approach in MF2 [nech2017level], we first create a vector whose elements are the mean pairwise distance for each face in the cluster. We compute the median absolute deviation . Each face is an outlier if its corresponding satisfies for .

Since nearly all identities are annotated with a number of ground-truth reference images (i.e., their prominent Wikipedia article image or images listed in WikiData), we can simply assign the cluster containing the largest number of these reference images to that identity. If no references are available, we instead assume the largest cluster of faces is the person in question. By visual inspection of 200 identities, we found over 98% accuracy between clusters and their automatically assigned identity labels. We primarily attribute this to having reference images for over 86% of identities.

Appendix B Method Details and Additional Results

In this section, we include details and results that are omitted in the main paper due to the page limit. Note that these are supplementary to the main results, not to be viewed as significant new results.

b.1 Detailed adaptation of state-of-the-art methods

StarGAN v2 [Choi2020StarGANVD]. StarGAN is well-suited for our decade translation task because it scales to multiple domains. We ran the training code found in their official code repository. We used the CelebA-HQ [CelebAMask-HQ] configuration.

DRIT++ [Lee2020DRITDI]. DRIT++ is a state-of-the-art image-to-image translation framework known for generating diverse representations. We train a model using the code in their official code repository. In practice, DRIT++ scales poorly to our problem because the model can only be trained on two domains at a time. 91 models are needed to create transformations across all 14 decades. To visualize results, we ran the model on several pairs of decades. Evaluation shows that the model suffers from poor image quality compared to other baselines, as illustrated in Figure 13.

SAM [alaluf2021matter]. SAM has shown impressive realism with regard to age transformation. We ran the training code found in their official code repository. We train SAM on top of a StyleGAN model trained on images from all decades in our dataset. During training, SAM searches for decade transformation directions within the model’s space. To modify SAM, we replace the age regression network with a decade classification network.

StyleCLIP [Patashnik_2021_ICCV]. Guided by CLIP [Radford2021LearningTV], StyleCLIP uses a text prompt to transform an image in StyleGAN’s space. Similarly to SAM, we run StyleCLIP on a StyleGAN model trained on all of our dataset’s images. Although StyleCLIP presents three different training schemes, we used the latent mapper approach since the authors claim that it is best suited for complex attributes. We ran the training code found in their official code repository. For evaluation, we set the target prompt as “A person from the [decade]s" where decade is an element of .

StyleGAN-nada [gal2021stylegannada]. Because StyleGAN-nada is designed for out of domain changes, we started with images from FFHQ and projected them to decades in the 20th century. We used the same text prompt that we used for StyleCLIP. We ran the training code found in their official code repository. We also experimented with using exemplar images, and found that the generated images suffer from a lack of style diversity and are entangled with the identity of the exemplars, meaning that synthesized images adopted facial characteristics of the exemplar images.

b.2 Decade-classification performance

In our evaluation, we use an EfficientNetB0 classification network trained on the Faces Through Time

dataset to calculate the DCA scores. For reference, a confusion matrix on the test set of this classifier is in Figure

10. The classifier has an average accuracy of over all decades. Furthermore, of the confusion is captured within a tolerance of decade.

Figure 10: Test classification accuracy on the Faces Through Time test set. Rows indicate the ground truth decade and columns indicate the predicted decade.

b.3 Additional comparisons

Comparisons between our model and state-of-the-art alternatives on the full test set of Faces Through Time are provided separately using an interactive viewer. Consistent with the results presented in the main paper, our method outperforms alternatives in terms of image quality and style changes, while preserving the identity of the input images.

We also present results on CelebAHQ [CelebAMask-HQ], a dataset of recognizable celebrities, in Figure 11.












Input 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s
Figure 11: Results on CelebAHQ. Above we show results on celebrities from CelebAHQ. We see that the celebrities remain recognizable throughout the transformations.

b.4 space vs. space

As mentioned in the main paper, our method uses a projection to invert an image into the latent space of a StyleGAN model. Figure 12 shows a comparison to the alternative space. We find that inverting images into the space creates more artifacts during training, which are often amplified after TMT.

Inversion

Before TMT After TMT

Inversion

Before TMT After TMT

Inversion

Before TMT After TMT

Inversion

Before TMT After TMT
Input 1910s 1920s 1940s 1960s 1970s 1980s 2010s
Figure 12: Face inversion and transformation results using the vs. space. Above we compare before TMT and after TMT results obtained using the space, which we adopt in our work, with results obtained using the space. As demonstrated above, results with the space yield various artifacts, which are often amplified after TMT.

b.5 Additional ablation results

Figure 14 shows additional examples of ablations on essential components in our approach. Consistent with the results in the main paper, our full model with all components enabled has the best performance compared to other variants.

b.6 Analysis of TMT offsets

Figure 15 shows the effects of applying the proposed TMT offsets to portraits. In general, we find that the TMT offsets are distinguishable from directions between arbitrary pairs of decade generators. This agrees with our intuition that offsets learned by fine-tuning a generator on an identity should be independent from the style changes between decades.

Input (1900s) 1900s 2010s DRIT++ Ours Input (1920s) 1920s 1960s Input (2010s) 1900s 2010s DRIT++ Ours Input (1900s) 1900s 1920s
Figure 13: DRIT++ Results. We show additional qualitative results obtained using DRIT++ and our method. Compared to our method, DRIT has trouble reconstructing high quality images. In addition, most of the changes from DRIT are limited to color.
No No LS No TMT No No Mask Ours
No No LS No TMT No No Mask Ours
Input 1880s 1920s 1940s 1960s 2000s
Input 1890s 1910s 1930s 1950s 1970s
Figure 14: Additional Ablation Results. These results further show the improvement obtained in terms of identity preservation after adding , layer swapping, and TMT during training; and the benefit of incorporating and the masking procedure.

Appendix C Training and implementation details

Learning Decade Models. All models were trained for 645k iterations on a single Nvidia RTX 3090. The codebase is derived from the official stylegan2-ada-pytorch repository. For training, we use the paper256

config. We use the PyTorch implementation of

InsightFace with a ResNet-100 backbone for the identity loss. The identity loss was added to the phase of StyleGAN training with a weight of . We used a regularization weight of , which we empirically found was best for the dataset.

Single-image Refinement. We modified the Pivotal Tuning Inversion (PTI) code base for our task. We added a DeepLab [deeplabv3plus2018] mask to the input images during the generator tuning phase of PTI. From experimentation, we found that the L2 loss had little effect on the quality of images. Most of the inversion tuning is guided by the LPIPS [zhang2018perceptual] loss. Because of this, we set a small LPIPS threshold of during training. Inference takes 2-3 minutes per image.

Figure 15: For the two examples shown, we compare the cosine similarity between the vectors learned by TMT (colored in orange above) and the vectors learned by decade transformations (colored in green above). We notice that the cosine similarity is centered around zero. This implies that in StyleGAN’s parameter space, the TMT offset has no correlation with the decade transformation direction. To contrast, when comparing two decade offsets and we see that the vectors have high similarity. This agrees with our qualitative results, where we observe that TMT offset improves the identity preservation in each decade independently, without sacrificing the style of the target decade.