Multiple-Identity Image Attacks Against Face-based Identity Verification

06/20/2019 ∙ by Jerone T. A. Andrews, et al. ∙ UCL 7

Facial verification systems are vulnerable to poisoning attacks that make use of multiple-identity images (MIIs)---face images stored in a database that resemble multiple persons, such that novel images of any of the constituent persons are verified as matching the identity of the MII. Research on this mode of attack has focused on defence by detection, with no explanation as to why the vulnerability exists. New quantitative results are presented that support an explanation in terms of the geometry of the representations spaces used by the verification systems. In the spherical geometry of those spaces, the angular distance distributions of matching and non-matching pairs of face representations are only modestly separated, approximately centred at 90 and 40-60 degrees, respectively. This is sufficient for open-set verification on normal data but provides an opportunity for MII attacks. Our analysis considers ideal MII algorithms, demonstrating that, if realisable, they would deliver faces roughly 45 degrees from their constituent faces, thus classed as matching them. We study the performance of three methods for MII generation---gallery search, image space morphing, and representation space inversion---and show that the latter two realise the ideal well enough to produce effective attacks, while the former could succeed but only with an implausibly large gallery to search. Gallery search and inversion MIIs depend on having access to a facial comparator, for optimisation, but our results show that these attacks can still be effective when attacking disparate comparators, thus securing a deployed comparator is an insufficient defence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Biometric identifiers, such as face images, are commonly employed to verify the identity of individuals—capitalising on the distinctiveness of the identifier, its stability of appearance, and its fixed linkage with the identity. Recent advances in deep learning have enabled the development of algorithms that compare live and stored reference face images to automate this verification [1, 2, 3, 4, 5]. Increasingly such algorithms are being used for access control (e.g. international border crossing, banking access, and facility entry) [6].

An adversary attempting to defeat a face-based identity verification system has three options [7]: (i) avoid the system; (ii) disguise the live face; or (iii) poison the stored face-identity pairs. This paper is concerned with poisoning attacks.

Consider a database of stored face images. The simplest form of poisoning attack is to contrive an incorrect pairing: say the face image of A paired with the identity of B. This allows individual A to pass as individual B; which can serve either the aim of pretending to be B, or avoiding notice as being A. Achieving this incorrect pairing should be difficult though, as generally A and B will look different. Furthermore, great care is typically taken when establishing the pairings in the database—for example, by requiring trusted individuals to attest that the stored image is a good likeness of the person with whom it is to be paired.

A subtle way to bypass the safeguards against poisoning makes use of multiple-identity images (MIIs), which are face images that resemble more than one person. Assuming, for now, that MIIs are possible we explain how they would be used. Suppose A wishes to access some facility without their identity being detected. A works with an accomplice B to prepare an MII resembling both A and B, and arranges for it to be stored in the database paired with the identity of B. This storage could be achieved by creating a new entry in the database or by updating an old one for B—in either case the safeguards against poisoning are circumvented since the MII resembles B. Having poisoned the database, A will now be able to access the facility undetected as their live face will match the stored MII and they will be verified as B.

Verification systems are used in a variety scenarios with differing restrictions on the capture of reference and live images. The most restricted type are exemplified by passport photographs which can be captured only under constrained conditions of pose, illumination and expression. At intermediate level are systems that use face photographs captured by a webcam or mobile phone. The least restricted type are walk past systems, collecting face images without subject participation. In this paper, we examine the issue of MIIs, where the verification system is trained for use in the unconstrained setting.

The aims of this paper are: (i) to explain why MIIs are possible in principle; (ii) to show that MIIs can be realised well enough to be used practically; and (iii) to show that securing a face comparator does not prevent these attacks.

Our analyses will make use of the representation spaces that are used by automated methods of face verification. For the first aim, we will investigate the distribution of faces within a representation space. For the second, we will present one method that constructs MIIs directly in image space, and a second method that constructs MIIs first in representation space and then in image space. For the third, we will show that the representation spaces of different comparators are sufficiently similar to permit the transfer of MIIs.

In Section II we give an overview of related work. In Section III we introduce the datasets and face comparators that we experiment with. In Section IV we offer an explanation as to why face comparators are theoretically vulnerable to MIIs; and consider the behaviour and performance of a hypothetical ideal method for MII generation. Subsequently, in Section V-B, we outline three methods for either finding, constructing, or synthesising MIIs—showing that real MIIs can be generated that are sufficiently close to ideal MIIs that they are effective attacks. In Section VI we evaluate the effectiveness of generated MIIs, as well as the transferability of MIIs to novel comparators. In Section VII we discuss the inherent vulnerability of face comparators to MIIs in light of our findings; and conclude in Section VIII.

Ii Background

In this section, we provide an overview of input attack generation and input attack detection.

Ii-a Input Attack Generation

Here we review two distinct paradigms for input attack generation: image space and representation space. Both were originally developed for innocuous tasks—in the creative industries, but have subsequently been re-purposed for MII generation.

Ii-A1 Image Space

Image morphing is the process of generating a sequence of photorealistic interpolations between two images 

[8]. Seminally, [9] demonstrated the vulnerability of two commercially available face verification systems to MII attacks generated using morphing. The morphs in that case were constructed by geometrically warping [8] and then colour interpolating two face images. Attacks of this type have been shown to be highly effective [10], even deceiving human experts [11].

The warping step is crucial, and reliant on precisely identified common landmarks. This approach to face morphing thus works best when both images are frontally aligned, but even then there are often cues to manipulation. Several works have attempted to improve the visual quality of facial morphs, in particular, by: manual replacement and movement of correspondence points [12]; manual retouching [9]; splice morphing [13]; the restrictive selection of similar input images [14]; and Poisson image editing [15]. Although successful in creating an image that matches two identities, at present image space morphing still tends to leave artefacts that cue the manipulation [16, 12].

Ii-A2 Representation Space

A very different approach to image generation utilises algorithms that can synthesise a realistic image from a representation space encoding. Significantly, [17, 18] found that when inverting deep representations, several layers preserve accurate image-specific information. For example, [19]

utilised a deconvolutional neural network, and adversarial training, to generate high-quality images given high-level representations. Similarly,  

[20] proposed to leverage approximate models of facial appearance for training a generative model under adversarial supervision.

With attribute-conditioned image modification in mind (e.g. changing the age or gender of a face), invertible conditional generative adversarial networks [21] have also been proposed. They task an encoder network with learning an inverse mapping from an image to both its latent and conditional attribute representation, thus permitting the modification of images by varying the conditional information. More recently, [22] proposed an encoder-decoder scheme that forces the latent representation to be invariant to the attributes. However, [23] argue that decoupling the attribute values and the salient image information is detrimental to attribute editing, since the attributes represent the traits of face images. Therefore, rather than enforcing invariance of the latent representation to the attributes, [23] proposed an attribute classification constraint on the generated images, such that editing is wholly localised to the attributes one wishes to alter. Nevertheless, a principal issue with face editing is the lack of permanence with respect to the underlying identity of the original face image. Consequently,  [24] recently proposed an identity-preserving conditional generative adversarial network for face ageing synthesis, which ensures that the high-level representation of a synthesised aged face remains close to the representation of the original face.

Profiting from these substantial advances, it was recently shown that MIIs could be synthesised [25] by following the approach of [19]. However, [26] conjectured that the attacks optimised in [25], for a specific face comparator, were unlikely to generalise to dissimilar face comparators.

Ii-B Input Attack Detection

Unsurprisingly, the vulnerability of face comparators, in particular to image space face morphing, has fostered new research into targeted defences [27, 28, 29, 16, 30], which can be considered a specific problem within the field of general image tampering detection [31]. Broadly speaking, the defences are either no-reference or differential [26]. No-reference methods process single images, e.g. an image submitted for enrolment to be stored; whereas, differential methods compare an image, captured by a trusted source, to its supposed corresponding stored database image.

Whilst we briefly review detection approaches below,111A comprehensive survey on face recognition systems under morphing attacks can be found in [26]. each proposed method suffers from the same underlying problem: they generalise poorly when the training and testing distributions differ [32, 33]. In other words, they attempt to detect a specific cue using a supervised training approach; and if an attack avoids presenting this cue, then it will go undetected, even if the MIIs present other clear cues to their nature.

Ii-B1 No-reference Detection

Texture descriptors, based on hand-crafted shallow representations222For example, Local Binary Patterns (LBP), Binarised Statistical Image Features (BSIF), Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Histogram of Oriented Gradients (HOG).

have been shown to be effective for detecting image tampering, particularly when combined with a supervised classifier 

[30, 34, 35, 28]. However, the descriptors used are typically of an extremely high dimensionality. Consequently, [27] proposed deep supervised neural networks trained to detect face morphs, which result in more manageable and expressive representations.

Amongst other approaches, [16] analysed changes in noise patterns present with an image, since atypical changes imply that an image was captured by multiple devices—i.e. an image is likely to be the composite of multiple face images. Others have proposed to detect JPEG double-image compression [13, 36], premised on compression being an indicator of manipulation. In [12], face morphing was found to affect the physical validity of skin illumination.

Ii-B2 Differential Detection

Far fewer works concentrate on differential detection. In [29], a de-morphing strategy was proposed, whereby a live image is subtracted from its stored version, attempting to invert the morphing process, such that if the stored image is an MII, then the subtraction process will reveal two face images of two distinct individuals. However, the method depends heavily on the conditions in which the live image was captured [26]. Differently, [37]

compare landmark positions from live captured face images to stored versions, measuring the angles between corresponding aligned landmark position vectors. Nevertheless, due to high intra-class variation, the method cannot reliably detect signs of manipulation.

Iii Preliminaries

To begin, we introduce the datasets and face comparators that we experiment with, and end by formally defining a measure of MII attack success.

Iii-a Datasets

We use three disjoint face image datasets, namely VGGFace2 (VGGF2) [4], Color FERET (C-FERET) [38], and Flickr-Faces-HQ (FFHQ) [39].

VGGF2 [4] is an unconstrained face image dataset, consisting of M loosely cropped JPG images of K persons, hence face images per person. The face images were downloaded from Google Image Search, and vary in pose, illumination, size and quality. The intra-class variation of these images is representative of the variability that will be encountered by a verification system deployed in an unconstrained environment.

C-FERET [38] is a constrained high-quality face image dataset gathered in a semi-controlled environment, consisting of K labelled colour and grey PPM images of persons ( images per person). The face images were collected over sessions between 1993 and 1996. The images vary in pose ( unique yaw angles), eyewear, facial hair, hairstyle, and ethnicity. The intra-class variation of these images is much less than VGGF2.

FFHQ [39] is an unconstrained high-quality face image dataset, consisting of K aligned and cropped unlabelled colour PNG images. The images were downloaded from Flickr, and vary with regard to apparent age, hairstyle, facial hair, ethnicity, eyewear, headwear, apparel, and image background.

Pre-processing

For all images, face alignment was performed using the dlib [40] HOG-based face detector and a facial landmark predictor model [41]. The process retains only the most confident detection per face image. If a face was not detected then the image was discarded (, and

for VGGF2, C-FERET and FFHQ, respectively). For a detected face, the landmark predictor utilised an ensemble of regression trees to estimate the position of

landmarks, which are then used to transform each face to a canonical alignment. Alignment ensures that: (i) the faces are centred; (ii) the eye centroids lie on a horizontal line; and (iii) the faces are similar in scale. Next, we resized and rescaled the images such that they have a resolution of with pixel values in the range . For C-FERET, we discarded all face images with a pose label corresponding to a nonzero yaw angle and any persons with fewer than two face images. For each person with more than two face images we only kept two.

In summary, we were left with: (i) M VGGF2 train set partition (VGGF2) images; (ii) K VGGF2 evaluation set partition (VGGF2) images; (iii) K FFHQ images; and (iv) K C-FERET images of persons. Examples of the pre-processed face images are shown in Figure 1; and Figure 2 compares the intra-class variability of VGGF2 and C-FERET.

Fig. 1: Examples of the pre-processed aligned face images: VGGFace2 (top row); Color FERET (middle row); and FFHQ (bottom row).
(a) VGGF2
(b) C-FERET
Fig. 2: Examples depicting the difference in intra-class variation between the VGGF2 and C-FERET datasets. Each column in Figures (b)b and (a)a shows two images (one considered a reference image and the other considered a live image) from a unique identity. It is clear that the two images, per identity, vary much more for VGGF2 than for C-FERET.

Dataset Usage

Note that all MIIs were constructed for face image pairs sampled from the C-FERET dataset. We used VGGF2 for face verification threshold determination, and VGGF2 for a gallery search attack (in Section V-A), which utilises a real image (sampled from VGGF2) as an MII image. FFHQ was solely used by the representation space MII attack method for learning how to generate face images given a representation space encoding.

Iii-B Face Comparators

In this work we employ three publicly available pre-trained softmax induced facial recognition neural networks: SENet [42], SENet [42], and LtNet [5].333SENet and SENet are available at https://github.com/ox-vgg/vgg_face2, and are listed under the names SE-ResNet-50-128D and SE-ResNet-50-256D, respectively. LtNet is available at https://github.com/AlfredXiangWu/LightCNN, and is listed under the name LightCNN-29v2. All three networks output representations that can be used for face comparisons.

SENet [42] integrates squeeze-and-excitation (SE) blocks into a standard 50 layer residual network (ResNet) [43] architecture. SE units dynamically recalibrate channel-wise feature responses by explicitly modelling the relationships between channels, resulting in greater overall representational power. Both SENet and SENet were first pre-trained on the MS-Celeb-1M [44] dataset, and then fine-tuned on the original VGGF2 [4] dataset of persons, using a softmax-based loss. At this point, SENet and SENet had the exact same weights.

A final stage of training was performed by projecting from the penultimate layer to either (SENet) or (SENet), and then fine-tuning the entire model by performing classification, i.e. or . Hence SENet and SENet have different weights at every layer.

LightCNN [5]

is a convolutional architecture consisting of 29 layers, utilising ResNet-inspired residual blocks, but using Max Feature Map (MFM) activation functions rather than ReLU 

[45] nonlinearities, implementing a variant of Maxout [46]. MFM enforces a sparse relationship between consecutive layers by combining two feature maps and outputting their element-wise maximum. Moreover, no batch normalisation (BN) is used within the residual blocks. The penultimate layer containing the face representation is fully-connected with an MFM activation, as opposed to a global average pooling layer, which results in representations. The network was trained on the CASIA-WebFace [47] and MS-Celeb-1M [44] datasets with a fully-connected classification layer over identities. We denote this model as LtNet.

Iii-B1 Face Representations

Consider a facial recognition neural network trained over a set of known identities. Ignoring the classification layer, one can take some intermediate layer as a representation of facial appearance. Following normal practice, we will use the layer immediately before the classification layer as giving an encoding of the input image into a representation space. Again following normal practice, we normalise the vector of outputs from this layer so that the network, parameterised by , performs a mapping ; where is the image space, is the number of units in the final layer, and is the unit -sphere. For compactness we denote the representation of an image as .

Iii-B2 Representation Distances

Given two face representations, and , we use angular distance to quantify their visual disparity, which for unit-length vectors is defined as:

(1)

Iii-B3 Performance Metrics

Let and denote the sets of face representation pairs of matching (same identity) and non-matching (different identities), respectively. The evaluation of a face comparator is typically performed by determining a distance threshold for the binary classification of samples drawn from and .

The true acceptance rate (TAR) for is defined as the fraction of that have a distance less than . Conversely, the false acceptance rate (FAR) is defined as the fraction of that have a distance less than . A two-dimensional receiver operating characteristic (ROC) curve depicts the relationship between the FAR and TAR, as

varies, with the area under the curve (AUROC) equal to the probability that a random

has a lower angular distance than a random . Furthermore, the strength of a face comparator is typically assessed by computing the TAR at some specific FAR.

Iii-B4 Comparator Performance

To assess the performance of SENet, SENet and LtNet, we utilise the VGGF2 dataset, which consists of faces of persons unused during training of any of the comparators. For threshold determination, we compute the TAR at FAR —using all possible image pair combinations—which results in the set . The probability density plots, in Figure 3, show the angular distance of pairs sampled from either or , with the thresholds of each comparator overlaid. According to Table I, the angular distances between pairs, in particular those sampled from , across the face comparators, exhibit a positive linear correlation. Thus, to a degree, each comparator is roughly performing in a similar manner; although, the arrangement of the face representations may differ from comparator to comparator.

SENet LtNet
SENet
SENet
  • Pearson correlation coefficients between angular distances according to the three face comparators.

TABLE I: Pearson correlation coefficients
(a) SENet
(b) SENet
(c) LtNet
Fig. 3: Angular distance probability density plots of face representation pairs in (pale blue) and (dark blue), where and are sampled from VGGF2. Overlaid are five different thresholds , (vertical lines), based on the TAR at FAR .

Table II reports the performance, showing that the three comparators perform at similar levels. Following FRONTEX444European Agency for the Management of Operation Cooperation at the External Borders of the Member States of the European Union. guidelines, it is recommended that deployed facial verification systems—in automated border control scenarios—should employ a threshold that gives a FAR of , i.e. . We will primarily focus on the threshold throughout the remainder of this work. At , SENet, SENet and LtNet have a TAR of , and , respectively, which implies that LtNet is the strongest comparator.

AUROC TAR at FAR
() () () () ()
SENet
SENet
LtNet
  • Performance (in %) of the three face comparators on the hold-out VGGF2 set. Reported is the AUROC and TAR at FAR . Italics indicate the best performance per column.

TABLE II: Performance summary of the face comparators

Iii-C Performance of a Multiple-Identity Image Attack

Consider an adversary () and accomplice (), and images of them and , where the reference images are used to generate the MII, and the live images are those captured at (for example) the entrance to a facility which will be compared to the MII. Let be a method to generate MIIs, so that is the attack image. The attack will be successful if and only if

(2)

and

(3)

One condition representing the MII being accepted as a good likeness suitable for database storage, and the other representing identity verification at the facility. Using an MII angular distance defined as

(4)

the success condition can be rewritten as:

(5)

We define the overall success rate of as the fraction of successful attacks for randomly chosen pairs of distinct individuals.

Iv Why Are Multiple-Identity Images Possible?

We now consider the behaviour and performance of a hypothetical ideal method for MII generation. In later sections we will examine how well existing methods of MII generation realise this ideal. Breaking the problem in two like this will help in understanding why it is that MII attacks are possible.

The ideal MII attack method needs to minimise the MII distance

(6)

to maximise its rate of success; and it must do this with reference only to and , without direct knowledge of and , which do not yet exist at the time the MII is generated. The best that can be done is to assume that and and thus aim to minimise the approximately equal MII distance

(7)

This proxy objective is easy to solve— must generate an image whose representation is the spherical midpoint of the representations of the reference images, i.e.

(8)

which gives us:

(9)

Observe that the MII distances in this equation can be computed without actually implementing the ideal MII generator .

We have computed the MII distances for the ideal generator . In steps: for each identity we select distinct images and ; we randomly pair individuals and to act as adversary and accomplice; for each pairing we compute , the representation of the ideal MII based on the reference images; using that we compute the ideal MII distance as .

The distribution of ideal MII distances is shown in Figure 4 for attacks against all three face comparators, and when the reference and live images are drawn either from VGGF2 or from C-FERET. Each plot also shows the distribution of angular distances between and pairs from the same dataset. Additionally the plots show the distribution of half distances, which from Section IV we expect to approximate the MII distances. Finally, the plots show the threshold that MII distances need to be below for an attack to succeed. There is much to see in Figure 4:

  • In all plots the distances (dashed) are fully below the threshold. Thus if an attack could achieve these distances it would always be successful.

  • The ideal MII distances (red) are always greater than distances, since the MIIs are at the midpoints of the reference images, but are compared to the live images.

  • For C-FERET images, for which intra-class variation is small, the MII distances are only slightly larger than the distances, so the great majority () of ideal attacks would succeed.

  • For VGGF2 images, for which intra-class variation is larger, the MII distances are substantially larger than distances, but still much less than distances; resulting in attack success rates of .

In summary: if an adversary can ensure that reference and live images are very similar, then MII attacks will very likely succeed if the generator used is near ideal. The attack exploits the fact that a comparator computes distances between face images that are a compound of real face differences and incidental differences due to ageing, hairstyle, etc. An adversary can take advantage of this by constructing an MII which minimises incidental differences from its reference images and whose real difference from its reference images is only half the normal distance for a pair of mismatched faces.

It remains to be explained why an MII that is half the normal mismatch distance from its reference images is so clearly below the threshold distance; equivalently, how come distances are mostly above the threshold and mostly below? The answer is that the distribution is tight, and the position of the threshold (determined by the desired FAR) is necessarily in its left tail. Change either fact and MIIs would not succeed as often.

(a) SENet; VGGF2
(b) SENet; VGGF2
(c) LtNet; VGGF2
(d) SENet; C-FERET
(e) SENet; C-FERET
(f) LtNet; C-FERET
Fig. 4: Angular distance probability density plots of MII distances for an ideal attack method, i.e. (red); the distributions of angular distances between (pale blue) and (dark blue) pairs; and the distribution of paired angular distances divided by two (corresponding to the case when and ), denoted as (dashed grey). Figures (c)c, (b)b and (a)a use the unconstrained VGGF2; and Figures (f)f, (e)e and (d)d use the constrained C-FERET dataset of frontally aligned images. Overlaid is the threshold (vertical line) based on the TAR at FAR computed using matching and non-matching pairs sampled from VGGF2, which is the threshold used for all plots.

On the narrowness of the distance distribution, observe that it is centred around , the angular distance for orthogonal points. This indicates that face representations are very widely distributed on the sphere

. Indeed, it can be shown that, for points uniformly distributed on

, the angular distances are normally distributed with mean

and standard deviation

. The standard deviations of the distances for the three comparators are slightly larger, i.e. , indicating that face representation are not fully uniformly distributed over , but are sufficiently uniform to give the tight distribution observed.

On the position of the threshold, observe that setting it lower would substantially impact the TAR rate, leading to unusable systems. This is the case because the comparators have achieved an acceptable FAR/TAR trade-off with distances that are only modestly below orthogonal, rather than near zero. In plain terms, the current generation of comparators work because the representations of different identity faces are always close to orthogonal, and the representations of same identity faces are only slightly closer than orthogonal. This leaves them vulnerable to attack as an MII image can be much closer to its constituent faces than orthogonal.

V Generating Multiple-Identity Images

Here we outline three methods for either finding, constructing, or synthesising MIIs. Our aim is to show that real MIIs can be generated that are sufficiently close to ideal MIIs that they are effective attacks.

V-a Finding Multiple-Identity Images by Gallery Search

The simplest form of attack uses a real face image as MII, chosen from a gallery to be as close as possible to both reference images.

Formally, let be the reference images of the adversary and accomplice, and let be a gallery of face images that they have access to. We define the gallery search MII (GS-MII) generator as picking the in such that is minimised, i.e. the gallery image whose representation is as close as possible to the reference representations. Note that this attack makes use of representation distances, so the adversary needs access to a comparator to compute them—and this comparator may be the same or different to the one attacked.

V-B Constructing Multiple-Identity Images in Image Space

Face morphing is the image space process of geometric warping and then colour interpolating two distinct face images into a single composite [9, 32].

We use the following, standard, process to construct image space MIIs (IS-MIIs). Given a pair of frontally aligned face images, say , an IS-MII is constructed as follows: (i) corresponding facial landmark position vectors are determined in and using the dlib landmark detector [41]555The detector results in facial landmarks; in addition, we define evenly spaced landmarks on the image boundary.; (ii) Delaunay triangulation [48] is performed on the average of the two sets of landmark position vectors; (iii) an affine transformation is applied to the image within each triangle to transform and to the averaged positions, which results in two warped images; and (iv) the two warped images are averaged to give an IS-MII . Note that this method for generating MIIs does not require use of a comparator.

V-C Synthesising Multiple-Identity Images via Representation Space

Assuming a face comparator has learnt disentangled high-level representations that encode the identity of a face image, then in principle we can learn an inverse mapping. That is, we can learn to generate face images from abstract face representations. Having learnt such a mapping, we can synthesise MIIs by first constructing the ideal midpoint of the reference image representations, and then synthesising a face image corresponding to that midpoint.

Concisely, let be a deconvolutional decoder, i.e. , parameterised by , and denote the spherical midpoint of and , where . We define a synthesised MII as and will refer to it as a representation space MII (RS-MII).

V-C1 Model

To this end, we learn , which is tasked with reproducing the face image given its representation , where is a face comparator with fixed weights. Inspired by [19], we learn

using a combination of loss functions:

(10)

where

are weights. For the autoencoding pixel-wise loss

, we utilise the mean absolute deviations:

(11)

Whilst the loss enforces low-frequency correctness, it generally fails to produce satisfactory results in terms of high-frequency features. Therefore, we also learn an image patch convolutional discriminator (PatchGAN) [49] , parameterised by , which attends to localised high-frequency details. The PatchGAN moves across an image, classifying patches as being either real or reconstructed. The discriminator and the decoder are trained simultaneously. This process forces the decoder to attempt to produce reconstructions indistinguishable from real data. Concretely, the discriminator parameters are learnt by minimising a least squares loss [50]:

(12)

with the decoder trained to minimise:

(13)

Lastly, the feature loss ensures that the representation of a decoded image matches the representation of the original image:

(14)

V-C2 Implementation

The encoder network is fixed and is defined as SENet.

Let denote a ConvTranspose-BN-ReLU layer with

filters, stride

and padding

followed by a Conv-BN-ReLU layer with filters, stride and padding . The architecture of the decoder is:

After the last layer, a Conv-Tanh layer is applied, producing a -channel output. The output of given a sized input is a image.

With respect to the discriminator , we utilise a PatchGAN. Let be a Conv-BN-LeakyReLU layer with filters, stride and padding . The slope of the LeakyReLU activations is equal to . The discriminator architecture is:

The discriminator accepts inputs of size .

V-C3 Model Training

The weights used in Equation 10 were empirically set as follows: , , and . Training was performed on the FFHQ dataset ( for training and for validation). The training set was augmented by on-the-fly horizontal mirroring. For pre-processing, with respect to the discriminator, the image pixels were all normalised to the range . The Adam solver was used with a fixed learning rate of , , and a batch size of . The model was trained for epochs ( iterations per epoch).

To stabilise training, the discriminator was updated using a history of

previously reconstructed images, as opposed to only the most recent. For inputs to the discriminator, we also used additive Gaussian white noise, with

and , which we linearly decayed to zero over the first epochs.

V-D Multiple-Identity Image Methods: A Comparison

There are certain caveats associated with each MII attack method.

  • GS-MIIs rely on access to a gallery and a face comparator , in order to find the best for a pair . For evaluation, we define as the entire VGGF2 set of M face images, and assume that an adversary has access to the SENet comparator.

  • IS-MIIs require frontally aligned face images, such that corresponding landmarks between and can be found.

  • RS-MIIs depend on having access to a face comparator , so as to train a decoder . In our evaluation we fix that the adversary has access to the SENet comparator for this purpose.

In Figure 5, we show examples of MIIs generated using the previously described methods, where and are sampled from the C-FERET dataset of frontally aligned face images. Qualitatively speaking, the GS-MIIs tend to resemble and at only a rudimentary level; and are sometimes far from satisfactory, in particular when and differ in ethnicity (e.g. column 2), gender (e.g. columns 4, 5 and 15) or both (e.g. columns 7, 11 and 12). The overarching benefit of GS-MIIs is that they are real images, therefore free from visual artefacts. In contrast, since IS-MIIs and RS-MIIs combine and when constructing

, the MIIs have much better visual similarity to both

and . However, IS-MIIs suffer from visual ghosting artefacts, and RS-MIIs lack detail at the transition between the face and background.

Fig. 5: Examples of random GS-MII (second row), IS-MII (third row) and RS-MII (fourth row) attacks given two non-matching images (top row) and (bottom row). Each column denotes a unique attack given and .

Vi Effectiveness of Multiple-Identity Images

Here we evaluate whether generated MIIs are close enough to ideal MIIs to make effective attacks. We also evaluate how much the effectiveness of generated MIIs depends on whether the comparator that is being attacked is the same as the one used to generate the MIIs.

Vi-a Experiment

We split the C-FERET dataset of frontally aligned images into two disjoint sets (reference and live); each consisting of 993 face images of 993 unique persons. Each method for creating MIIs is applied to the same K randomly sampled unique pairings from the reference set. We score the effectiveness of a generated MII by the MII distance, i.e.

(15)

The overall score for an MII generator is the fraction of these distances which are less than or equal to some threshold.

GS-MIIs and RS-MIIs require a comparator for their operation and we fix that to be SENet; IS-MIIs do not make use of a comparator. We evaluate the MIIs as attacks against SENet (matched comparator mode), SENet (similar comparator mode), and LtNet (mismatched comparator mode).

Vi-B Analysis and Results

In Figure 6, we compare the distributions of MII distances for MIIs generated using the different methods, as well as the effect on the distributions as we vary the comparator attacked. The figure also shows the distributions of MII distances for ideal MIIs (as discussed in Section IV). Clearly all forms of generated MII fail to realise the optimal MII attack, but the large amount of mass still below the threshold indicates that they are successful more often that not. Table III gives the exact MII success rates for a range of values, but we concentrate on corresponding to a FAR of 0.1% for normal data.

(a) SENet
(b) SENet
(c) LtNet
Fig. 6: Angular distance probability density plots of Equation 15 for GS-MII (pale blue), IS-MII (medium blue), RS-MII (dark blue), as well as the MIIs distances for an ideal attack method (red). Overlaid is the threshold (vertical line) based on the TAR at FAR .

The weak performance of GS-MIIs is evident and expected: in matched comparator mode only of MII attacks succeed, and in similar and mismatched mode this drops below 4%.

MII () () () () ()
GS
IS
RS
GS
IS
RS
LtNet GS
IS
RS
  • This table compares the performance of the three MII attacks: GS-MIIs, IS-MIIs, and RS-MIIs. Performance corresponds to the percentage of MIIs constructed that would be successfully verified with their corresponding pair, based on an threshold. Italics indicate the best performance per comparator.

TABLE III: Performance summary of the MII attacks

IS-MIIs perform much better than GS-MIIs. Against the SENets less than half of IS-MII attacks succeed, whilst against LtNet about two-thirds succeed. Note that since IS-MIIs do not make use of a comparator in their construction these differences in performance, depending on the comparator attacked, are not due to matched vs. similar vs. mismatched mode but reflect hidden differences in the vulnerability of the networks. It is noteworthy that these differences are not apparent from their performance on normal data which were similar (see Table II).

RS-MIIs also perform much better than GS-MIIs. There is no clear winner comparing RS- and IS-MIIs. For matched and similar mode RS performs distinctly better than IS (77% and 71% vs. 37% and 40%), but for mismatched mode slightly worse (54% vs. 66%).

Vii Discussion

We now discuss issues arising especially with regards to improvement; first from the adversary’s perspective, then from the defender’s.

Vii-a Improved Multiple-Identity Image Attacks

Vii-A1 Reducing Detectability

GS-MIIs utilise real images that undergo no form of manipulation, and are therefore only detectable based on an MII being insufficiently similar to the appearance of the adversary and accomplice.

Contrastingly, there are clear and obvious cues of manipulation for IS- and RS-MIIs. The former suffer from visual ghosting artefacts; and the latter lack detail at the transition between the foreground (face region) and background. Clearly, more visually faultless MIIs can be constructed in both cases. For example, rather than generating MIIs, where the generator utilises the entire image, which includes the presence of identity irrelevant non-facial features such as the background, MIIs can instead be generated by binarising the image into facial and non-facial regions, such that only the facial area is manipulated prior to recombination with the background of one of the constituent images.

With specific regard to IS-MIIs, which rely on the precise specification of common image features for warping, advancements in automated landmark localisation (e.g. utilising a multi-task cascaded deep convolutional neural network 

[51]) are likely to offer improvements over the shallow landmark predictor used in this work, thus reducing ghosting. Nonetheless, generating photorealistic morphed images totally free from visual artefacts is still a challenging task. Heterogeneous, and intertwined, factors of variation stemming from pose, skin colour, hair style, etc. impact the realism of images generated on the continuum. That is, slight errors in landmark positioning inevitably give rise to ghosting, and these errors are common when a (shallow or deep) landmark predictor is learnt on a demographically biased dataset [52]. Therefore, combining a deep learning based landmark predictor (trained on an unbiased dataset) with both splice morphing [13] (considering only the facial region) and Poisson image editing [15] (smooth transition of low frequency details) appears to be the way forward. Note that regardless of improvements, IS-MIIs assume that the structural relationship between face images holds, i.e. that each face image is frontal, which is restrictive. Consequently, to generate IS-MIIs where this condition does not hold, an additional module will be required—e.g. a deep generative model capable of generating a novel (frontal) view of the face [53]. If, however, a deep generative neural network is going to employed, then it brings into question the utility of the entire IS-MII procedure, since one could instead employ RS-MIIs, which are capable of generating MIIs regardless of the pose of the reference images (when trained on face images where pose is a factor of variation).

With specific regard to RS-MIIs, an image can be modelled as the composite of the foreground () and background () via a matting equation, i.e. , where is an occlusion matrix that determines the visibility of background pixels. For instance, [54, 20] utilise a layered conditional generative model that generates images based on disentangled representations , where and capture the foreground and background factors of variation. This was shown to give visually pleasing results in the tasks of attribute-conditioned image reconstruction and completion. One negative, however, is that the method necessitates that the foreground layer is observable during learning, i.e. the method in its current form does not work unsupervised. Alternatively, the foreground can be extracted utilising a copy-pasting generative adversarial network recently proposed by [55], where objects are segmented (wholly unsupervised) for a given input image. The generator learns to discover an object in an image by compositing it into a different image, with the aim of fooling the discriminator into classifying the resulting image as real.

Vii-A2 Reducing the Gap to Ideal MIIs

As it turned out, in Section V, the MII construction methods evaluated produced suboptimal MIIs. This suggests that the studied attacks can be made stronger, so as to fulfil their theoretical potential.

At present, the simplest solution for an adversary is to ensure that reference and live images are very similar, so as to increase the likelihood of an MII succeeding (assuming that the generator used is near ideal). However, this is clearly not always possible, especially as the time increases between when the reference images were captured and the live images.

With respect to GS-MIIs, although we do not show the dependency of their success rate on the gallery size—the larger the gallery is the better a match can be found. We have examined this and estimate that the gallery would need to be B faces in order to achieve 50% success. This renders GS-MIIs somewhat impractical, albeit far from impossible.

By considering the behaviour and performance of a hypothetical ideal method for MII generation, an adversary could examine how well their method of MII generation realises this ideal. In particular, this can be done for IS- and RS-MII methods. Firstly, IS-MIIs can be made more ideal by optimising methods for landmarking, warping and interpolation with the ideal MII distance distribution in mind. Secondly, RS-MIIs can be made more ideal by perhaps employing a lower level intermediate representation that retains more image content, e.g. a convolutional layer, since the learnt filters in earlier layers are likely to be more similar across comparators than later layers which specialise to their task and training data.

It should be noted that at present we do not currently know to what extent the failures of photorealism are stochastically moving IS- and RS-MIIs away from their ideals. Speculatively, this may have an effect, therefore improving MII photorealism would not only reduce their detectability, but could also assist in reducing the current gap between MIIs as they stand and their ideals.

Vii-A3 Choosing the Right Accomplice

The ideal scenario for an adversary-accomplice pairing (e.g. and ) does not necessitate that they are initially visually similar in any regard. Throughout this paper, we represented this scenario by generating MIIs based on randomly sampled image pairs . However, recall the scenario outlined in Section I, where individual wished to access some facility without their identity being detected. If works with a random collaborator to generate, for instance, an IS-MII, then they would have a chance of success attacking LtNet. If, however, seeks out a collaborator who, to some degree, shares a resemblance, then the chance of success can be improved. For example, constraining to be within the of the population who are most visually similar to (LHS of the median in Figure 7), then the success rate increases to (lower left quadrant in Figure 7 as a proportion of the LHS of the median line). Clearly, further constraints would see further improvements.

Fig. 7: Each point corresponds to a pair of identities and . The horizontal axis is the angular distance between their reference images according to the adversary’s comparator SENet. The vertical axis is the MII distance according to the comparator being attacked, which here is LtNet (mismatched with adversary). Points below the line are successful attacks against LtNet. Points to the left of the median line correspond to adversary-accomplice pairings that are fairly similar in appearance according to SENet.

Vii-B Improved Multiple-Identity Image Defences

Vii-B1 Detecting MIIs

Both our IS-MIIs and RS-MIIs have clear artefacts that a system could be trained to detect; but it is very doubtful whether such systems would be able to deal with improved versions of the MII generators. An arms race

is possible, but it is unlikely that regulated commercial systems would be able to keep up with the nimbleness of unregulated adversarial development. This is a consequence of the fact that most detection pipelines employ supervised machine learning algorithms, which have good generalisation ability when the training and testing datasets are sampled from the same distribution. Using deep learning methods, for example, super-human performance 

[56]

can be achieved at recognising objects, places and people in consumer photography. This is possible because the statistics of this data are stable over time. In security applications, such as identity verification, the statistics of the rare set (MIIs) cannot be assumed to be stable, as adversaries seek to innovate and improve their methods. Bluntly, with sufficient examples of MIIs, as they exist in 2019, a supervised learning system that detected MIIs until 2020 might be possible, but it would likely fail in 2021.

Anomaly (or novelty) detection systems, however, that utilise unsupervised learning, trained only on the normal dataset, would not have this inevitable obsolescence. Anomaly detection learns the range and dimensions of normal variation, so that suspicious deviations from this can be detected. For instance, a latent space autoregression framework 

[57] based on memory and surprisal is viable, since an anomalous event can be expressed in terms of one’s capability to remember it (e.g. sparse dictionary learning), as well as the degree to which the event is surprising (e.g. in terms of probability). We hypothesise that anomaly detection can be as effective at detecting known signs as a supervised approach, but will also be able to detect signs of improved MII attack methods.

Vii-B2 Securing the Comparator

Our results in Section VI significantly highlight the following: (i) poisoning attacks optimised for one comparator can be very effective when transferred to a disparate comparator that was initially trained in the exact same way (prior to fine-tuning); and (ii) transferring attacks to completely different comparators (feature domains) with dissimilar architectures and training data can be effective—contrary to the assertion made in [26]. Furthermore, considering the attacks were randomly generated, LtNet exhibits a substantial vulnerability to RS-MIIs optimised for SENet. Nevertheless, the results indicate that IS-MIIs are a more fruitful attack, at , when one does not have access to a proxy face comparator that is similar to the comparator that one wishes to attack—i.e. optimising RS-MIIs for SENet so as to attack SENet.

The transferability of IS-MIIs and RS-MIIs should not come as a surprise, since the angular distance distributions and , across the face comparators, exhibit a positive linear correlation (recall Table I). Thus, to a degree, each comparator is roughly performing in a similar manner—as one would expect of discriminatively induced networks. Although the arrangement of the face representations may differ from comparator to comparator, if an MII is sufficiently similar in one feature domain to and then it is likely to also be similar in another. Therefore, securing a comparator is an insufficient defence to poisoning attacks.

Vii-B3 Improved Comparators

Face comparators are known to be vulnerable to MII poisoning attacks [9], which is reinforced by our results. Research on this mode of attack has focused on defence [27, 28, 29, 16, 30], with little to no explanation given as to why face comparators are vulnerable. Two aspects of our analysis suggest directions for improvement.

First is the observation that some comparators are better than others against IS-MIIs (which do not make use of a comparator), even though they have similar performance on normal data. Understanding why may guide future development.

Second is what we observed with respect to the and distance distributions in Figure 3, i.e. the lack of clear separation between the two. Comparators are typically learnt by minimising a supervised softmax-based loss [4, 5], with abstract high-level intermediate representations extracted for the purposes of open-set verification. The discriminative ability of the representations is a by-product of the training process, and is not explicitly enforced, thus resulting in decision boundaries that do not directly maximise class separability beyond what is necessary for discrimination. However, the popularity of these approaches stems from the ease of their training, since, for example, more direct metric learning approaches [2, 1]

, which directly embed images into Euclidean space based on relative relationships amongst inputs (minimising intra-class variance and maximising inter-class variance) can be difficult to train: (i) constraints are typically exponentially large in number, but highly redundant, thus leading to slow convergence; and (ii) mining informative and non-redundant constraints to alleviate (i) is difficult and can lead to training instabilities.

To overcome some of the issues outlined above, improved softmax-based losses have been proposed. For example, those with the objective of: (i) penalising both the distance between a sample and its class centre, in representation space, as well as enforcing inter-class separability [58]; (ii) penalising the distance between the centroid of each class [59]; (iii) maximising the margin between the farthest intra-class sample and its nearest inter-class neighbour [60]; or (iv) employing angular-based margins [3, 61, 62], e.g. directly optimising the geodesic distance margin on the hypersphere, so as to enforce a larger margin between classes [63]. Nonetheless, our distributional hypothesis still stands for such learning paradigms. The primary difference being that direct methods may lead to lower verification thresholds, since the distributions of angular distances of pairs in and is expected to have greater separation, thus potentially leading to the rejection of a greater number of MIIs. However, since inter-class separability is explicitly enforced, it becomes increasingly likely that two random face images of different persons will be orthogonal, which by implication means their optimal MII is almost surely at an angular distance of . Therefore, as long as the threshold employed lies to the right of the distribution of angular distances attained using idealised MIIs (due to insufficiently reducing the intra-class variance) with mean , then a comparator remains theoretically vulnerable to randomly constructed poisoning attacks.

Patently, the key to mitigating against randomised poisoning attacks lies in reducing intra-class variance, especially in light of the fact that algorithms for constructing MII attacks will most likely continue to improve (reaching their theorised potential). One solution would be to indefinitely contract samples from the same class during comparator training [64, 65]—by way of contrastive pairwise metric learning. That is, matching pairs continue to contribute to the loss, in an attempt to drive their angular distances to zero—beyond the point needed to differentiate them from dissimilarly labelled samples. However, this method of learning is known to cause overfitting. Consequently, a balance must be struck between robustness to MIIs and generalisation ability for open-set verification. An adapted solution would be to ensure that the distribution of distances is well separated from the ideal MII distribution of distances, as opposed to the distribution of distances, employing for instance a histogram loss [66].

Viii Conclusion

Face comparators are known to be vulnerable to MII poisoning attacks, which is reinforced by our results. However, research on this mode of attack has focused on defence, with little to no explanation given as to why face comparators are vulnerable. In contrast, we provided an intuitive view on the role of the face representation spaces used for verification, arguing that the principal cause of the vulnerability is that representations of different identity faces are always close to orthogonal, and the representations of same identity faces are only modestly closer than orthogonal. This is sufficient for open-set verification on normal data but provides an opportunity for MII attacks. Importantly, by considering the behaviour and performance of a hypothetical ideal method for MII generation, we were able to examine how well existing MII generators realise theoretically optimal MIIs—permitting one to establish the vulnerability of a specific comparator to MIIs. In addition, we showed that transference of MIIs from one comparator to another is made possible due to the representation spaces of dissimilar comparators being sufficiently similar, as such securing a comparator is an insufficient defence.

It is unclear whether MIIs can be completely defended against—although it is unlikely that any of our generated MIIs (in Figure 5) would be mistaken for the constituent identities by a human expert, we are not confident that this would still be true for improved generation systems that get closer to the ideal.

References

  • [1] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2015, pp. 815–823.
  • [2] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.” in bmvc, vol. 1, no. 3, 2015, p. 6.
  • [3] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2017, p. 1.
  • [4] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).   IEEE, 2018, pp. 67–74.
  • [5] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep face representation with noisy labels,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018.
  • [6] A. K. Jain, K. Nandakumar, and A. Ross, “50 years of biometric research: Accomplishments, challenges, and opportunities,” Pattern Recognition Letters, vol. 79, pp. 80–105, 2016.
  • [7] N. K. Ratha, J. H. Connell, and R. M. Bolle, “Enhancing security and privacy in biometrics-based authentication systems,” IBM systems Journal, vol. 40, no. 3, pp. 614–634, 2001.
  • [8] G. Wolberg, “Image morphing: a survey,” The visual computer, vol. 14, no. 8, pp. 360–372, 1998.
  • [9] M. Ferrara, A. Franco, and D. Maltoni, “The magic passport,” in Biometrics (IJCB), 2014 IEEE International Joint Conference on.   IEEE, 2014, pp. 1–7.
  • [10] U. Scherhag, R. Raghavendra, K. B. Raja, M. Gomez-Barrero, C. Rathgeb, and C. Busch, “On the vulnerability of face recognition systems towards morphed face attacks,” in 2017 5th International Workshop on Biometrics and Forensics (IWBF).   IEEE, 2017, pp. 1–6.
  • [11] D. J. Robertson, A. Mungall, D. G. Watson, K. A. Wade, S. J. Nightingale, and S. Butler, “Detecting morphed passport photos: a training and individual differences approach,” Cognitive research: principles and implications, vol. 3, no. 1, p. 27, 2018.
  • [12] C. Seibold, A. Hilsmann, and P. Eisert, “Reflection analysis for face morphing attack detection,” arXiv preprint arXiv:1807.02030, 2018.
  • [13] A. Makrushin, T. Neubert, and J. Dittmann, “Automatic generation and detection of visually faultless facial morphs.” in VISIGRAPP (6: VISAPP), 2017, pp. 39–50.
  • [14] J. P. Vyas, M. V. Joshi, and M. S. Raval, “Automatic target image detection for morphing,” Journal of Visual Communication and Image Representation, vol. 27, pp. 28–43, 2015.
  • [15] P. Pérez, M. Gangnet, and A. Blake, “Poisson image editing,” ACM Transactions on graphics (TOG), vol. 22, no. 3, pp. 313–318, 2003.
  • [16] L. Debiasi, U. Scherhag, C. Rathgeb, A. Uhl, and C. Busch, “Prnu-based detection of morphed face images,” in 2018 International Workshop on Biometrics and Forensics (IWBF).   IEEE, 2018, pp. 1–7.
  • [17] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5188–5196.
  • [18] A. Dosovitskiy and T. Brox, “Inverting visual representations with convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4829–4837.
  • [19] ——, “Generating images with perceptual similarity metrics based on deep networks,” in Advances in neural information processing systems, 2016, pp. 658–666.
  • [20] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras, “Neural face editing with intrinsic image disentangling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5541–5550.
  • [21] G. Perarnau, J. Van De Weijer, B. Raducanu, and J. M. Álvarez, “Invertible conditional gans for image editing,” arXiv preprint arXiv:1611.06355, 2016.
  • [22] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer et al., “Fader networks: Manipulating images by sliding attributes,” in Advances in Neural Information Processing Systems, 2017, pp. 5967–5976.
  • [23] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “Arbitrary facial attribute editing: Only change what you want,” arXiv preprint arXiv:1711.10678, vol. 1, no. 3, 2017.
  • [24] Z. Wang, X. Tang, W. Luo, and S. Gao, “Face aging with identity-preserved conditional generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7939–7947.
  • [25] N. Damer, A. M. Saladié, A. Braun, and A. Kuijper, “Morgan: Recognition vulnerability and attack detectability of face morphing attacks created by generative adversarial network,” in 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS).   IEEE, 2019, pp. 1–10.
  • [26] U. Scherhag, C. Rathgeb, J. Merkle, R. Breithaupt, and C. Busch, “Face recognition systems under morphing attacks: A survey,” IEEE Access, vol. 7, pp. 23 012–23 026, 2019.
  • [27] C. Seibold, W. Samek, A. Hilsmann, and P. Eisert, “Detection of face morphing attacks by deep learning,” in International Workshop on Digital Watermarking.   Springer, 2017, pp. 107–120.
  • [28] A. Asaad and S. Jassim, “Topological data analysis for image tampering detection,” in International Workshop on Digital Watermarking.   Springer, 2017, pp. 136–146.
  • [29] M. Ferrara, A. Franco, and D. Maltoni, “Face demorphing,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 4, pp. 1008–1017, 2018.
  • [30] R. Raghavendra, K. B. Raja, and C. Busch, “Detecting morphed face images,” in Biometrics Theory, Applications and Systems (BTAS), 2016 IEEE 8th International Conference on.   IEEE, 2016, pp. 1–7.
  • [31] G. K. Birajdar and V. H. Mankar, “Digital image forgery detection using passive techniques: A survey,” Digital Investigation, vol. 10, no. 3, pp. 226–245, 2013.
  • [32] U. Scherhag, C. Rathgeb, and C. Busch, “Performance variation of morphed face image detection algorithms across different datasets,” in 2018 International Workshop on Biometrics and Forensics (IWBF).   IEEE, 2018, pp. 1–6.
  • [33] L. Spreeuwers, M. Schils, and R. Veldhuis, “Towards robust evaluation of face morphing detection,” in 2018 26th European Signal Processing Conference (EUSIPCO).   IEEE, 2018, pp. 1027–1031.
  • [34] U. Scherhag, C. Rathgeb, and C. Busch, “Towards detection of morphed face images in electronic travel documents,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).   IEEE, 2018, pp. 187–192.
  • [35] C. Kraetzer, A. Makrushin, T. Neubert, M. Hildebrandt, and J. Dittmann, “Modeling attacks on photo-id documents and applying media forensics for the detection of facial morphing,” in Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security.   ACM, 2017, pp. 21–32.
  • [36] T. Neubert, A. Makrushin, M. Hildebrandt, C. Kraetzer, and J. Dittmann, “Extended stirtrace benchmarking of biometric and forensic qualities of morphed face images,” IET Biometrics, vol. 7, no. 4, pp. 325–332, 2018.
  • [37] U. Scherhag, D. Budhrani, M. Gomez-Barrero, and C. Busch, “Detecting morphed face images using facial landmarks,” in International Conference on Image and Signal Processing.   Springer, 2018, pp. 444–452.
  • [38] P. J. Phillips, H. Moon, P. Rauss, and S. A. Rizvi, “The feret evaluation methodology for face-recognition algorithms,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.   IEEE, 1997, pp. 137–143.
  • [39] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” arXiv preprint arXiv:1812.04948, 2018.
  • [40] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1755–1758, 2009.
  • [41] V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1867–1874.
  • [42] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  • [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [44] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European Conference on Computer Vision.   Springer, 2016, pp. 87–102.
  • [45]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
  • [46] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
  • [47] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
  • [48] B. Delaunay, “Sur la sphère vide,” Bulletin of Academy of Sciences of the USSR, pp. 793–800, 1934.
  • [49]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
  • [50] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2794–2802.
  • [51] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
  • [52] D. McDuff, R. Cheng, and A. Kapoor, “Identifying bias in ai using simulation,” arXiv preprint arXiv:1810.00471, 2018.
  • [53] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2439–2448.
  • [54] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Conditional image generation from visual attributes,” in European Conference on Computer Vision.   Springer, 2016, pp. 776–791.
  • [55] R. Arandjelović and A. Zisserman, “Object discovery with a copy-pasting gan,” arXiv preprint arXiv:1905.11369, 2019.
  • [56]

    K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in

    Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [57] D. Abati, A. Porrello, S. Calderara, and R. Cucchiara, “Latent space autoregression for novelty detection,” in International Conference on Computer Vision and Pattern Recognition, 2019.
  • [58] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision.   Springer, 2016, pp. 499–515.
  • [59] Y. Wu, H. Liu, J. Li, and Y. Fu, “Deep face recognition with center invariant loss,” in Proceedings of the on Thematic Workshops of ACM Multimedia 2017.   ACM, 2017, pp. 408–414.
  • [60] J. Deng, Y. Zhou, and S. Zafeiriou, “Marginal loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 60–68.
  • [61] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
  • [62] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5265–5274.
  • [63] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” arXiv preprint arXiv:1801.07698, 2018.
  • [64] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer, “Discriminative learning of deep convolutional feature point descriptors,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 118–126.
  • [65] S. Chopra, R. Hadsell, Y. LeCun et al., “Learning a similarity metric discriminatively, with application to face verification,” in CVPR (1), 2005, pp. 539–546.
  • [66] E. Ustinova and V. Lempitsky, “Learning deep embeddings with histogram loss,” in Advances in Neural Information Processing Systems, 2016, pp. 4170–4178.