Contextual Face Recognition with a Nested-Hierarchical Nonparametric Identity Model

11/19/2018 ∙ by Daniel Coelho de Castro, et al. ∙ Microsoft Imperial College London 14

Current face recognition systems typically operate via classification into known identities obtained from supervised identity annotations. There are two problems with this paradigm: (1) current systems are unable to benefit from often abundant unlabelled data; and (2) they equate successful recognition with labelling a given input image. Humans, on the other hand, regularly perform identification of individuals completely unsupervised, recognising the identity of someone they have seen before even without being able to name that individual. How can we go beyond the current classification paradigm towards a more human understanding of identities? In previous work, we proposed an integrated Bayesian model that coherently reasons about the observed images, identities, partial knowledge about names, and the situational context of each observation. Here, we propose extensions of the contextual component of this model, enabling unsupervised discovery of an unbounded number of contexts for improved face recognition.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face identification can be decomposed into two sub-problems: recognition and tagging. Here we understand recognition as the unsupervised task of matching an observed face to a cluster of previously seen faces with similar appearance (disregarding variations in pose, illumination etc.), which we refer to as an identity. Humans routinely operate at this level of abstraction to recognise familiar faces: even when people’s names are not known, we can still tell them apart. Tagging, on the other hand, refers to putting names to faces, i.e. associating string literals to known identities.

An important aspect of social interactions is that, as an individual continues to observe faces every day, they encounter some people much more often than others, and the total number of distinct identities ever met tends to increase virtually without bounds. Additionally, we argue that human face recognition does not happen in an isolated environment, but situational contexts (e.g. ‘home’, ‘work’, ‘gym’) constitute strong cues for the groups of people a person expects to meet (Fig. 2).

With regards to tagging, in daily life we very rarely obtain named face observations: acquaintances normally introduce themselves only once, and not repeatedly whenever they are in our field of view. In other words, humans are naturally capable of semi-supervised learning, generalising sparse name annotations to all observations of the corresponding individuals, while additionally reconciling naming conflicts due to noise and uncertainty.

We recently introduced a unified Bayesian model which reflects all the above considerations on identity distributions, context-awareness and labelling (Fig. 2) (Castro and Nowozin, 2018). Our nonparametric identity model effectively represents an unbounded population of identities, while taking contextual co-occurrence relations and sparse noisy labels into account.

In this preliminary work, we extend the referred context model in two ways: we explore its limit with an unbounded number of contexts, uncovering a rich nonparametric structure, and we lay the foundations for incorporating environmental cues (such as timestamps and geographical locations of frames) in our model to improve unsupervised context discovery and prediction.

2 Background

We begin by reviewing the face identification framework presented in Castro and Nowozin (2018), consisting of four main components: a context model (which we extend in Section 3), an identity model, a face model, and a semi-supervised label model.

Figure 1: Context-aware model of identities (Castro and Nowozin, 2018)




Figure 2: Overview of the proposed probabilistic model

2.1 Context Model

Data is assumed to be collected in frames, i.e. photo album or video stills, which are run through some off-the-shelf face detector. This produces observations, grouped into the frames via an indicator for each observation in frame . Context is therefore naturally shared among all face detections in each frame. We model context as a discrete latent variable, representing categories of situations in which a subject may find herself: e.g. home, work, gym.

We assume the context indicators for each frame , where

is some fixed number of distinct contexts, are independently distributed according to probabilities

, which themselves follow a Dirichlet prior:


where is the total number of frames. In our simulation experiments in Castro and Nowozin (2018), we used a symmetric Dirichlet prior, setting .

2.2 Identity Model

In a daily-life scenario, an increasing number of unique identities will tend to appear as more faces are observed, i.e. we do not expect a user to run out of new people to meet. Moreover, some people are likely to be encountered much more often than others. Since a Dirichlet process (DP) (Ferguson, 1973) displays properties that mirror all of the above phenomena (Teh, 2010), it is a sound choice for modelling the distribution of identities.

Furthermore, the assumption that all people can potentially be encountered in any context, but with different probabilities, is perfectly captured by a hierarchical Dirichlet process (HDP) (Teh et al., 2006). Making use of the context model, we define one DP per context , each with concentration parameter and sharing the same global

DP as a base measure. This hierarchical construction thus produces context-specific distributions over a common set of identities. Such a nonparametric model is additionally well suited for an open-set identification task, as it can elegantly estimate the prior probability of encountering an unknown identity.

To each of the face detections is associated a latent identity indicator variable, . Letting denote the global identity distribution and the context-specific identity distributions, we can write the generative process as


where is the DP stick-breaking distribution, , with and (Sethuraman, 1994; Pitman, 2006). We additionally define a hierarchical GEM distribution, , such that , with (Teh et al., 2006, Eq. (21)).

2.3 Face Model

We assume that the observed features of the th face, , arise from a parametric family of distributions, . The parameters of this distribution, , drawn from a prior, , are unique for each identity and are shared across all face feature observations of the same person:


As a consequence, the marginal distribution of faces is given by an infinite mixture model (Antoniak, 1974): .

In face recognition applications, it is typically more convenient and meaningful to extract a compact representation of face features than to work directly in a high-dimensional pixel space. For the experiments reported in Castro and Nowozin (2018)

, we used embeddings produced by a pre-trained neural network

(Amos et al., 2016). We chose isotropic Gaussian mixture components for the face features (

), with an empirical Gaussian–inverse gamma prior for their means and variances (


2.4 Label Model

We expect to work with only a small number of user-labelled observations. Building on the cluster assumption for semi-supervised learning (Chapelle et al., 2006, Sec. 1.2.2), we attach a label variable (a name) to each cluster (identity), here denoted . Since the number of distinct labels will tend to increase without bounds as more data is observed, we adopt a further nonparametric prior on these identity-wide labels, ,111One could instead consider a Pitman–Yor process if power-law behaviour seems more appropriate than the DP’s exponential tails (Pitman and Yor, 1997).

using some base probability distribution

over the countable but unbounded label space (e.g. strings). In Castro and Nowozin (2018) we defined over a rudimentary language model. Lastly, the observed labels, , are assumed potentially corrupted through some noise process, . Let denote the set of indices of the labelled data. We then have


All concrete knowledge we have about the random label prior comes from the set of observed labels, . Crucially, we can easily marginalise out (Teh, 2010), obtaining a tractable predictive label distribution, .

According to the proposed noise model, an observed label, , agrees with its identity’s assigned label, , with a fixed probability. Otherwise, it is assumed to come from a modified label distribution, in which we delete from and renormalise it. Here we use in the error distribution instead of to reflect that a user is likely to mistake a person’s name for another known name, rather than for an arbitrary random string.

3 Extended Context Model

The context framework employed in Castro and Nowozin (2018) assumes a finite collection of pre-specified contexts and is fully supervised: an explicit context label is observed with each frame. This simplified scenario was adopted as a proof of concept, yet is admittedly unrealistic.

3.1 Unbounded Contexts

As reviewed in Section 2.1, the original context model had a finite prior. A natural extension of such model is to take its limit as

, while tying the values of all context-wise concentration hyperparameters (

), which results in a Dirichlet process (Neal, 2000). In particular, up to a reordering of the contexts, the prior on context proportions, , becomes .

This transformation has interesting theoretical and practical implications: the resulting structure is a nested-hierarchical Dirichlet process.222This is related to the dual-HDP described in Wang et al. (2009) and the single-entity model of Agrawal et al. (2013), for example, although these works tended to focus on textual topic modelling. As before, at the top level we have the global identity distribution, , over face parameters and labels, and the context-specific identity distributions, , follow a DP with as a base measure:


a prototypical example of a hierarchical DP (HDP) (Teh et al., 2006). If , we can write .

Now, the nonparametric distribution of contexts implies wrapping the bottom level of the HDP, Eq. 10, as base for another DP, to form a nested DP (Blei et al., 2010; Rodríguez et al., 2008):


This construction inherits desirable properties from both elements: the hierarchy ensures that all frame-wise identity distributions, , have the same support, and nesting produces clusters of frames with shared identity weights (i.e. contexts).

3.2 Environmental Cues

While a purely identity-driven unsupervised context model may be able to disentangle co-occurrence patterns given enough data, we believe that environmental cues—such as timestamp and GPS coordinates of an acquired frame, if available—could considerably facilitate context discovery and prediction, in turn improving inference about identities.

Let us define as the environmental measurements available for frame , a likelihood family parametrised by , and a prior distribution for such parameters. Plugging as base measure for the nested DP in Eq. 11, we can write


Some preliminary ideas for a spatial model include a ‘geodetic’ Fisher distribution or a tangential Gaussian (Straub et al., 2015), while a temporal model would have to accommodate recurring and occasional contexts, potentially adopting a Cox process formalism (Cox, 1955).

4 Conclusion

In this work, we reviewed the fully Bayesian treatment introduced in Castro and Nowozin (2018) of the face identification problem. Each component of our proposed approach was motivated from human intuition about face recognition and tagging in daily social interactions, such that our principled identity model can contemplate context-specific probabilities of meeting an unbounded population.

We further proposed a nonparametric extension of the context model enabling unbounded context discovery, and discussed some of its theoretical implications in terms of nested-hierarchical nonparametric structures. Finally, we briefly examined how available environmental cues could be integrated into the model to replace the simplified supervised setting.


This work was partly supported by CAPES, Brazil (BEX 1500/2015-05).


  • Agrawal et al. (2013) Agrawal, P., Tekumalla, L. S., and Bhattacharya, I. (2013). Nested hierarchical Dirichlet process for nonparametric entity-topic analysis. In Machine Learning and Knowledge Discovery in Databases – ECML PKDD 2013, volume 8189 of LNCS, pages 564–579. Springer, Berlin, Heidelberg.
  • Amos et al. (2016) Amos, B., Ludwiczuk, B., and Satyanarayanan, M. (2016). OpenFace: A general-purpose face recognition library with mobile applications. Technical Report CMU-CS-16-118, CMU School of Computer Science.
  • Antoniak (1974) Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174.
  • Blei et al. (2010) Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):7.
  • Castro and Nowozin (2018) Castro, D. C. and Nowozin, S. (2018). From face recognition to models of identity: A Bayesian approach to learning about unknown identities from unsupervised data. In Computer Vision – ECCV 2018, volume 11206 of LNCS, pages 745–761. Springer. Extended version with supplement: arXiv:1807.07872.
  • Chapelle et al. (2006) Chapelle, O., Schölkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT Press.
  • Cox (1955) Cox, D. R. (1955). Some statistical methods connected with series of events. Journal of the Royal Statistical Society: Series B (Methodological), 17(2):129–164.
  • Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230.
  • Neal (2000) Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265.
  • Pitman (2006) Pitman, J. (2006). Combinatorial Stochastic Processes, volume 1875 of Lecture Notes in Mathematics. Springer-Verlag, Berlin.

    Lectures from the 32nd Summer School on Probability Theory held in Saint-Flour, July 7–24, 2002, with a foreword by Jean Picard.

  • Pitman and Yor (1997) Pitman, J. and Yor, M. (1997). The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2):855–900.
  • Rodríguez et al. (2008) Rodríguez, A., Dunson, D. B., and Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American Statistical Association, 103(483):1131–1154.
  • Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4(2):639–650.
  • Straub et al. (2015) Straub, J., Chang, J., Freifeld, O., and Fisher III, J. W. (2015). A Dirichlet process mixture model for spherical data. In

    Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS 2015)

    , volume 38 of PMLR, pages 930–938. PMLR.
  • Teh (2010) Teh, Y. W. (2010). Dirichlet process. In Sammut, C. and Webb, G. I., editors, Encyclopedia of Machine Learning, pages 280–287. Springer US.
  • Teh et al. (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
  • Wang et al. (2009) Wang, X., Ma, X., and Grimson, W. E. L. (2009). Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(3):539–555.