1 Introduction
Face identification can be decomposed into two subproblems: recognition and tagging. Here we understand recognition as the unsupervised task of matching an observed face to a cluster of previously seen faces with similar appearance (disregarding variations in pose, illumination etc.), which we refer to as an identity. Humans routinely operate at this level of abstraction to recognise familiar faces: even when people’s names are not known, we can still tell them apart. Tagging, on the other hand, refers to putting names to faces, i.e. associating string literals to known identities.
An important aspect of social interactions is that, as an individual continues to observe faces every day, they encounter some people much more often than others, and the total number of distinct identities ever met tends to increase virtually without bounds. Additionally, we argue that human face recognition does not happen in an isolated environment, but situational contexts (e.g. ‘home’, ‘work’, ‘gym’) constitute strong cues for the groups of people a person expects to meet (Fig. 2).
With regards to tagging, in daily life we very rarely obtain named face observations: acquaintances normally introduce themselves only once, and not repeatedly whenever they are in our field of view. In other words, humans are naturally capable of semisupervised learning, generalising sparse name annotations to all observations of the corresponding individuals, while additionally reconciling naming conflicts due to noise and uncertainty.
We recently introduced a unified Bayesian model which reflects all the above considerations on identity distributions, contextawareness and labelling (Fig. 2) (Castro and Nowozin, 2018). Our nonparametric identity model effectively represents an unbounded population of identities, while taking contextual cooccurrence relations and sparse noisy labels into account.
In this preliminary work, we extend the referred context model in two ways: we explore its limit with an unbounded number of contexts, uncovering a rich nonparametric structure, and we lay the foundations for incorporating environmental cues (such as timestamps and geographical locations of frames) in our model to improve unsupervised context discovery and prediction.
2 Background
We begin by reviewing the face identification framework presented in Castro and Nowozin (2018), consisting of four main components: a context model (which we extend in Section 3), an identity model, a face model, and a semisupervised label model.
2.1 Context Model
Data is assumed to be collected in frames, i.e. photo album or video stills, which are run through some offtheshelf face detector. This produces observations, grouped into the frames via an indicator for each observation in frame . Context is therefore naturally shared among all face detections in each frame. We model context as a discrete latent variable, representing categories of situations in which a subject may find herself: e.g. home, work, gym.
We assume the context indicators for each frame , where
is some fixed number of distinct contexts, are independently distributed according to probabilities
, which themselves follow a Dirichlet prior:(1) 
where is the total number of frames. In our simulation experiments in Castro and Nowozin (2018), we used a symmetric Dirichlet prior, setting .
2.2 Identity Model
In a dailylife scenario, an increasing number of unique identities will tend to appear as more faces are observed, i.e. we do not expect a user to run out of new people to meet. Moreover, some people are likely to be encountered much more often than others. Since a Dirichlet process (DP) (Ferguson, 1973) displays properties that mirror all of the above phenomena (Teh, 2010), it is a sound choice for modelling the distribution of identities.
Furthermore, the assumption that all people can potentially be encountered in any context, but with different probabilities, is perfectly captured by a hierarchical Dirichlet process (HDP) (Teh et al., 2006). Making use of the context model, we define one DP per context , each with concentration parameter and sharing the same global
DP as a base measure. This hierarchical construction thus produces contextspecific distributions over a common set of identities. Such a nonparametric model is additionally well suited for an openset identification task, as it can elegantly estimate the prior probability of encountering an unknown identity.
To each of the face detections is associated a latent identity indicator variable, . Letting denote the global identity distribution and the contextspecific identity distributions, we can write the generative process as
(2)  
(3)  
(4) 
where is the DP stickbreaking distribution, , with and (Sethuraman, 1994; Pitman, 2006). We additionally define a hierarchical GEM distribution, , such that , with (Teh et al., 2006, Eq. (21)).
2.3 Face Model
We assume that the observed features of the ^{th} face, , arise from a parametric family of distributions, . The parameters of this distribution, , drawn from a prior, , are unique for each identity and are shared across all face feature observations of the same person:
(5) 
As a consequence, the marginal distribution of faces is given by an infinite mixture model (Antoniak, 1974): .
In face recognition applications, it is typically more convenient and meaningful to extract a compact representation of face features than to work directly in a highdimensional pixel space. For the experiments reported in Castro and Nowozin (2018)
, we used embeddings produced by a pretrained neural network
(Amos et al., 2016). We chose isotropic Gaussian mixture components for the face features (), with an empirical Gaussian–inverse gamma prior for their means and variances (
).2.4 Label Model
We expect to work with only a small number of userlabelled observations. Building on the cluster assumption for semisupervised learning (Chapelle et al., 2006, Sec. 1.2.2), we attach a label variable (a name) to each cluster (identity), here denoted . Since the number of distinct labels will tend to increase without bounds as more data is observed, we adopt a further nonparametric prior on these identitywide labels, ,^{1}^{1}1One could instead consider a Pitman–Yor process if powerlaw behaviour seems more appropriate than the DP’s exponential tails (Pitman and Yor, 1997).
using some base probability distribution
over the countable but unbounded label space (e.g. strings). In Castro and Nowozin (2018) we defined over a rudimentary language model. Lastly, the observed labels, , are assumed potentially corrupted through some noise process, . Let denote the set of indices of the labelled data. We then have(6)  
(7)  
(8) 
All concrete knowledge we have about the random label prior comes from the set of observed labels, . Crucially, we can easily marginalise out (Teh, 2010), obtaining a tractable predictive label distribution, .
According to the proposed noise model, an observed label, , agrees with its identity’s assigned label, , with a fixed probability. Otherwise, it is assumed to come from a modified label distribution, in which we delete from and renormalise it. Here we use in the error distribution instead of to reflect that a user is likely to mistake a person’s name for another known name, rather than for an arbitrary random string.
3 Extended Context Model
The context framework employed in Castro and Nowozin (2018) assumes a finite collection of prespecified contexts and is fully supervised: an explicit context label is observed with each frame. This simplified scenario was adopted as a proof of concept, yet is admittedly unrealistic.
3.1 Unbounded Contexts
As reviewed in Section 2.1, the original context model had a finite prior. A natural extension of such model is to take its limit as
, while tying the values of all contextwise concentration hyperparameters (
), which results in a Dirichlet process (Neal, 2000). In particular, up to a reordering of the contexts, the prior on context proportions, , becomes .This transformation has interesting theoretical and practical implications: the resulting structure is a nestedhierarchical Dirichlet process.^{2}^{2}2This is related to the dualHDP described in Wang et al. (2009) and the singleentity model of Agrawal et al. (2013), for example, although these works tended to focus on textual topic modelling. As before, at the top level we have the global identity distribution, , over face parameters and labels, and the contextspecific identity distributions, , follow a DP with as a base measure:
(9)  
(10) 
a prototypical example of a hierarchical DP (HDP) (Teh et al., 2006). If , we can write .
Now, the nonparametric distribution of contexts implies wrapping the bottom level of the HDP, Eq. 10, as base for another DP, to form a nested DP (Blei et al., 2010; Rodríguez et al., 2008):
(11)  
(12) 
This construction inherits desirable properties from both elements: the hierarchy ensures that all framewise identity distributions, , have the same support, and nesting produces clusters of frames with shared identity weights (i.e. contexts).
3.2 Environmental Cues
While a purely identitydriven unsupervised context model may be able to disentangle cooccurrence patterns given enough data, we believe that environmental cues—such as timestamp and GPS coordinates of an acquired frame, if available—could considerably facilitate context discovery and prediction, in turn improving inference about identities.
Let us define as the environmental measurements available for frame , a likelihood family parametrised by , and a prior distribution for such parameters. Plugging as base measure for the nested DP in Eq. 11, we can write
(13) 
Some preliminary ideas for a spatial model include a ‘geodetic’ Fisher distribution or a tangential Gaussian (Straub et al., 2015), while a temporal model would have to accommodate recurring and occasional contexts, potentially adopting a Cox process formalism (Cox, 1955).
4 Conclusion
In this work, we reviewed the fully Bayesian treatment introduced in Castro and Nowozin (2018) of the face identification problem. Each component of our proposed approach was motivated from human intuition about face recognition and tagging in daily social interactions, such that our principled identity model can contemplate contextspecific probabilities of meeting an unbounded population.
We further proposed a nonparametric extension of the context model enabling unbounded context discovery, and discussed some of its theoretical implications in terms of nestedhierarchical nonparametric structures. Finally, we briefly examined how available environmental cues could be integrated into the model to replace the simplified supervised setting.
Acknowledgments
This work was partly supported by CAPES, Brazil (BEX 1500/201505).
References
 Agrawal et al. (2013) Agrawal, P., Tekumalla, L. S., and Bhattacharya, I. (2013). Nested hierarchical Dirichlet process for nonparametric entitytopic analysis. In Machine Learning and Knowledge Discovery in Databases – ECML PKDD 2013, volume 8189 of LNCS, pages 564–579. Springer, Berlin, Heidelberg.
 Amos et al. (2016) Amos, B., Ludwiczuk, B., and Satyanarayanan, M. (2016). OpenFace: A generalpurpose face recognition library with mobile applications. Technical Report CMUCS16118, CMU School of Computer Science.
 Antoniak (1974) Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174.
 Blei et al. (2010) Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):7.
 Castro and Nowozin (2018) Castro, D. C. and Nowozin, S. (2018). From face recognition to models of identity: A Bayesian approach to learning about unknown identities from unsupervised data. In Computer Vision – ECCV 2018, volume 11206 of LNCS, pages 745–761. Springer. Extended version with supplement: arXiv:1807.07872.
 Chapelle et al. (2006) Chapelle, O., Schölkopf, B., and Zien, A., editors (2006). SemiSupervised Learning. MIT Press.
 Cox (1955) Cox, D. R. (1955). Some statistical methods connected with series of events. Journal of the Royal Statistical Society: Series B (Methodological), 17(2):129–164.
 Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230.
 Neal (2000) Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265.

Pitman (2006)
Pitman, J. (2006).
Combinatorial Stochastic Processes, volume 1875 of Lecture
Notes in Mathematics.
SpringerVerlag, Berlin.
Lectures from the 32nd Summer School on Probability Theory held in SaintFlour, July 7–24, 2002, with a foreword by Jean Picard.
 Pitman and Yor (1997) Pitman, J. and Yor, M. (1997). The twoparameter Poisson–Dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2):855–900.
 Rodríguez et al. (2008) Rodríguez, A., Dunson, D. B., and Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American Statistical Association, 103(483):1131–1154.
 Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4(2):639–650.

Straub et al. (2015)
Straub, J., Chang, J., Freifeld, O., and Fisher III, J. W. (2015).
A Dirichlet process mixture model for spherical data.
In
Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS 2015)
, volume 38 of PMLR, pages 930–938. PMLR.  Teh (2010) Teh, Y. W. (2010). Dirichlet process. In Sammut, C. and Webb, G. I., editors, Encyclopedia of Machine Learning, pages 280–287. Springer US.
 Teh et al. (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
 Wang et al. (2009) Wang, X., Ma, X., and Grimson, W. E. L. (2009). Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(3):539–555.
Comments
There are no comments yet.