Demographic characteristics of an individual such as age, gender, and marital status are used in many real-world applications. For example, when designing a marketing campaign one of the first decisions taken is the target population, this population is defined among other things by demographic characteristics (Jansen and Solomon, 2010)
. Content and user experience in many applications is modulated by user characteristics, aiming to provide the best possible version to each individual. Another family of applications for which demography is crucial is insurance; for instance, demographic characteristics are required for life expectancy estimation, and specific demographic variables such as gender and marital status are also relevant for car insurance, house insurance, and others.
However, these two examples are fundamentally different. When filling an insurance form one must provide all the required information, or else the application will be rejected. On the other hand, marketing and advertisement professionals often have partial demographic data at best. Although in many occasions people are asked to provide their personal information, the provider may have no effective way to enforce disclosure of full and accurate demographic details. Furthermore, many users are reluctant to give this data due to privacy concerns. Another complication with demographic data is how to keep it up to date. Many life events change demographic status, so even if when acquired the data was accurate, this may no longer be the case. Demanding that all users keep their information up to date is possible only for mandatory actions like filling a government form.
The above discussion suggests that inferring demographic from available data is of great value. Specifically, for financial applications with access to much of a user’s financial data, it is important to be able to model the user and infer demographic attributes based on it. Our first and foremost motivating applications is to personalize and deliver more relevant content in a better and user-suited experience. In addition, we discuss the application of the proposed method to detecting change in a user’s profile triggered by life-time events and the related task of stolen identity prevention.
In the current applications the data available in real time is bank transaction data. For some users we also have partial demographic variables available as well as other data. A minority of the users provide full demographic information. In this work we focus on predicting demographics based solely on the sequences of transactions, augmented by relational information. This data is guaranteed to exist for all active users. The main contribution of this paper is in presenting a novel method for fusing multiple sequences of categorical information, together with structured relational data, for an end-to-end multi-task prediction of demographic characteristics.
2. Related Work
User modeling for demographic prediction has long been a topic of interest both in academia and in industry. Methods have been proposed for inferring demographics based on browsing behavior and search queries (Hu et al., 2007; Bi et al., 2013; Culotta et al., 2015), social links (Zhuang et al., 2006; Dong et al., 2014), and mobile behavior (Ying et al., 2012; Malmi and Weber, 2016). These works, as well as other older works, apply hand crafted features for the predictive modeling.
Our method draws inspiration from a previous line of work (Wang et al., 2016), in which a method is proposed to learn a user representation for multi-target prediction of demographic attributes, based on a single sequence of super-market purchases. Our method can be seen as a generalization to multiple sequences, combined with auxiliary relational information and a deep representation component.
We formulate the demographics prediction problem in terms of multi-task learning based on heterogeneous inputs. Inputs take two shapes: sequences of categorical variables, and numerical information. Outputs are all assumed to be categorical. Formally, we are interested in learning functions of the form:
where are finite sets representing categorical information (such as the identity of the merchant a purchase was made from), and likewise finite sets are categorical demographic information to be predicted (such as education). The notation is used to denote a finite sequence of elements of set .
We adopt an embedding approach, where we represent each of the variable length sequences as a constant size vector. This is done using an aggregation function (such as averaging) over the embeddings vectors representing the individual elements (see the blue unit at the bottom left of Figure1). Much like aggregating word vectors in NLP tasks, this approach transforms the unstructured sequence problem into a constant size vector representation amenable to further processing.
For each of the input sequences , we define an element embedding function:
and the sequence of elements of is then represented as:
In practice, since we assume nothing about the embedding functions, and would rather rely on the data and classification task in order to learn a meaningful representation of the sequence elements, the functions are represented non-parametrically as matrices . The column of is thus the embedding of the element of the set .
Following the embedding of the sequences, the derived representations are concatenated together with the auxiliary numerical information to form the raw user representations. The final deep user representation is obtained by feeding the raw representation into a fully-connected neural network. Alldemographic fields are then inferred from this layer using softmax classification layers (See figure 1 for a description of the full end-to-end architecture).
The full model is then trained on the combined (categorical cross-entropy) losses related to each of the individual target. This combined end-to-end approach allows the learned user representation to combine and share information from all the prediction tasks, towards a unified view of the a user as a whole.
We further propose that fine-tuning the model with respect to each of the targets could benefit the individual target performance. In the fine-tuning experiments (section 4) the model is first trained in the multi-task setting described above. Then, all but a single softmax output are removed together with the associated weights, and the partial network is fine-tuned using the categorical cross-entropy loss of the remaining target.
|gender||marriage status||household adults||household children||education level||residential status|
In this section we present experiments comparing our method to standard baselines, and demonstrating the ability of our proposed method to model users for the purpose of demographics prediction. We further investigate the effect of model attributes such as embedding size and network depth, and test the idea of fine-tuning with respect to individual targets.
4.1. Experimental Setup
Experiments are conducted on real-world user data collected on a large scale by a US based financial management software product. A total of users were sampled based on recent activity, and availability of at least partial demographic information. Of these, the target demographics fields (gender, marital status, number of adults in the household, number of children in the household, education level, and residential status) were all available for users. These were selected as the sample for our experiments in the current study.
The features used for demographic modeling are of two types. The first type is sequence information related to bank transactions and includes the category of purchases, as well as the merchant names. The second type of information is structured data on the type and number of financial accounts held by the user. Moreover, for part of the users partial demographic information is available as well as other relevant structured data. However, for the purpose of this current work, to investigate the extent these data contain the demographic information, we ignored this partial data restricting ourselves to focus on the worst case scenario necessitating full demographic profile prediction.
The sample population is heavily male skewed withvs. female (Figure 3). Even more notable is the apparent lack of the divorced marital status (under
). These and other limitations are inherent to this sort of user data, and limit (at least to a certain extent) the ability to use these provided labels for supervised learning in a straight forward way.
Further complication often results from self-selection bias in the identity of users who provide demographic information. This is not so much of an issue for the current data, because the existence or absence of demographic records is related to version of the product used, rather than choice of the users themselves.
4.1.2. Methods and Baselines
We compare the proposed method to baselines:
arg-max: for each target, always predict the category with the highest proportion. This is a sanity-check baseline producing the accuracy available to a classifier with no feature information.
stacking: raw features are stacked horizontally and a logistic regression is applied. Sequences are represented as a distribution over the elements they are comprised of.
PCA per sequence type: for each sequence, PCA is applied to the user-category proportion matrix, and the first 50 components are kept. These are stacked horizontally together with the relational input data and a logistic regression is applied as before. This baseline represents a simple linear approach to sequence embedding.
Two variants of the proposed method are tested:
fine-tuning: the full model is trained as before. Then, for each output, the sub-network pertaining to that classification task is fine-tuned (i.e. all other outputs are discarded, producing effectively a model with a single output layer).
In both cases, all models are trained for 500 epochs, with an early stopping criterion. In practice, convergence is achieved after a few dozen epochs at most.
First, we compare the proposed method to the aforementioned baselines for demographic information prediction (results are summarized in Table 1). Overall, the proposed method outperforms the baselines in all demographic fields. The margin varies substantially, from several percent in gender and marital status prediction to under one percent for education level.
Somewhat surprisingly, model fine-tuning with respect to each of the individual targets does not improve upon the multi-target model results. Furthermore, for out of the targets, fine-tuning leads to slightly diminished results. We suggest that the multi-target model objective may provide some additional regularization, and hence reduces over-fitting, an effect which is then destroyed by the fine-tuning with a single target.
Next, we turn to the effect of some of the major design parameters of the proposed model. With respect to the embedding size of sequences elements (Figure 2), the major improvement in classification performance in all the demographics prediction tasks is obtained when increasing from to . Minor additional improvement is obtained with size embeddings. Result in Table 1 are given for a model with an embedding size of for each of the two sequences.
With respect to the depth of the fully connected sub-network moving from user representations to deep user representations (see Figure 1), only a minor improvement is achieved when increasing from to layers. This implies that the majority of the representational work is done by the sequence embedding mechanism, and further processing of this initial user representation is redundant to a large extent. We note that using a larger amount of structured information (red bar in Figure 1) could further benefit the deep user representation. Results in Table 1 are giving for a model with a depth of in the fully connected part.
In this paper we introduce an end-to-end embedding approach for user representation and demographic predictions based on multiple sequences describing attributes of purchases, and augmented by auxiliary relational data. Experiments on a large real-world dataset demonstrate the superiority of this method relative to standard baseline alternatives.
Robust user representation holds promise in two additional use-cases in addition to personalization, which both relate to change in user behavior. The first, life-event detection, is an important issue in many applications where user behavior changes or is expected to change following events such as marriage. The second, detection of compromised accounts is important to security and prevention of identity theft in online accounts. Future research will focus on these domain specific applications. An additional benefit of user representation as described here is as a privacy maintaining user model, which avoids the need to use highly sensitive data such as user transaction information for the aforementioned purposes.
- Bi et al. (2013) Bin Bi, Milad Shokouhi, Michal Kosinski, and Thore Graepel. 2013. Inferring the demographics of search users: Social data meets search queries. In Proceedings of the 22nd international conference on World Wide Web. ACM, 131–140.
- Culotta et al. (2015) Aron Culotta, Nirmal Ravi Kumar, and Jennifer Cutler. 2015. Predicting the Demographics of Twitter Users from Website Traffic Data.. In AAAI. 72–78.
- Dong et al. (2014) Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, and Nitesh V Chawla. 2014. Inferring user demographics and social strategies in mobile social networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 15–24.
- Hu et al. (2007) Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, and Zheng Chen. 2007. Demographic prediction based on user’s browsing behavior. In Proceedings of the 16th international conference on World Wide Web. ACM, 151–160.
- Jansen and Solomon (2010) Bernard J Jansen and Lauren Solomon. 2010. Gender demographic targeting in sponsored search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 831–840.
- Malmi and Weber (2016) Eric Malmi and Ingmar Weber. 2016. You Are What Apps You Use: Demographic Prediction Based on User’s Apps.. In ICWSM. 635–638.
- Wang et al. (2016) Pengfei Wang, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2016. Your cart tells you: Inferring demographic attributes from purchase data. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, 173–182.
- Ying et al. (2012) Josh Jia-Ching Ying, Yao-Jen Chang, Chi-Min Huang, and Vincent S Tseng. 2012. Demographic prediction based on users mobile behaviors. Mobile Data Challenge (2012).
- Zhuang et al. (2006) Dong Zhuang, Benyu Zhang, Heng Zhang, Jeremy Tantrum, Teresa B Mah, Hua-Jun Zeng, Zheng Chen, and Jian Wang. 2006. Demographic prediction using a social link network. (Sept. 26 2006). US Patent App. 11/535,160.