Customers’ demographic information is a valuable asset for many companies as it is useful for making strategic decisions. When developing their marketing strategy, most companies use demographic segmentation to decide which market to target . For example, Apple’s iPhone targets mainly a relatively young customers (age) with enough purchasing power (income). Nowadays, as the importance of successful recommender systems grows, numerous methods have used demographic information to improve the quality of the systems and solve the cold start problem [6, 18].
The methods mentioned above can be successfully used only when there is sufficient demographic information. However, it is difficult for most companies to collect demographic information of customers. Due to a series of data breaches, customers are reluctant to give their sensitive information to companies. As Wang et al.  pointed out, in a real world data, only partial demographic attributes are known for a great number of users and some users have no attributes at all.
Demographic information is used to determine purchase propensity in aforementioned methods. If demographic information contains purchase propensity data, purchasing history can be used to predict demographic information as well. For example, Wang et al.  focused on predicting the demographic attributes of customers using their purchasing histories. Resheff et al.  used a similar approach but employed a model with a deeper structure.
Although they raised interesting questions and made some meaningful achievements, their works have some limitations. First, although we consider demographic attributes prediction as a multi-task problem, the previous works focused on learning shared user representation from user transactions. By sharing representation, models could learn patterns and trends of users. For example, users who purchase products popular in the older age group are more likely to be married. However, in multi-task learning, it is also important that the model is designed to learn more task-specific features. No existing works have achieved a balance between learning general features and task-specific features. Second, existing works assume that all transactions are equally important for demographic prediction, which is not true in most settings. Intuitively, purchasing multiple cosmetic items can be a clue that the user is female. However, daily necessities do not provide much information for inferring a user’s gender. Some transactions contribute more to predict demographic attribute than others and the importance varies for each task. Additionally, although the existing models improved the prediction accuracy, none of them have adopted an interpretable model structure. In real world business, understanding how a model predicts answers provides us with useful insights on consumer behavior.
To address the limitations mentioned above, we propose Embedding Transformation Network with Attention (ETNA) to predict demographic attributes from transaction data. In ETNA, we share embeddings at the bottom of the model structure and convert them to task-specific embeddings using simple linear transformation. This transformation selectively captures appropriate features for each task from shared information. To focus more on informative transactions than irrelevant ones, we apply attention mechanism to each task. The experimental results demonstrate that our model is far more accurate than existing models. Furthermore, by analyzing and visualizing attention mechanism, we can identify which type of transaction contributed more to predict an attribute. Moreover, we release a new benchmark dataset for demographic prediction in retail business scenario.
To summarize, our work makes following contributions:
We propose Embedding Transformation Network with Attention (ETNA) that learns task-specific features from shared information and automatically discriminating transactions that are more informative for predicting an attribute.
The experimental results show that our model is far more accurate than existing models. Using our attention network’s scoring mechanism, we conducted qualitative analysis and obtained results that provide insight into consumer behavior.
We developed and release a new benchmark dataset for demographic prediction in retail business scenario which could be used for future research.
The remainder of the paper is organized as follows. After discussing related works in Section 2, we provide the task description in Section 3. In Section 4, we describe our proposed ETNA model in detail. The experiment results and qualitative analysis are presented in Section 5. Finally, we conclude our paper in Section 6.
2 Related Work
2.1 Demographic prediction
Numerous approaches for predicting the demographic attributes of a user from various types of data have been proposed. Hu et al.  have used the browsing history of users. They found that search behavior can vary depending on gender and age, and they achieved meaningful prediction accuracy. With the advent of the big data era, several works have used social network and mobile phone data in demographic prediction tasks [2, 4]. Social network data and search queries are used to infer demographic attributes. Also, location and mobile application usage data have also been used to predict demographic attributes [15, 22].
Some works have used purchasing history to predict demographic attributes. Wang et al.  used a large scale dataset from a Chinese retailer, which was originally designed for a recommendation task. The dataset is publicly available but has no metadata (e.g. item descriptions, categories) that is needed to interpret a model. Resheff et al.  also used the purchasing history of users, which is not available to the public. To the best of our knowledge, there is no publicly accessible dataset that contains both metadata and user demographic information. To understand how our model predicts demographic attributes for real world business, we used transaction data with its metadata. We make our transaction data freely available for public use and future research on demographic prediction. A detailed description of our dataset is provided in Section 5.
2.2 Multi-Task Learning
Multi-task learning has been widely used in various domains such as natural language processing, speech recognition 16], and so on.
Despite its success in various tasks, there remains some challenges in multi-task learning. In Multi-task learning, the underlying assumption is that there are features helpful for each task learned from similar tasks. We should decide the type and amount of information to be shared. There exists a trade-off between sharing general information and capturing task-specific features. When we focus too much on general information by sharing parameters too hard, we lose task-specific signals. Helpful signals from other related tasks may be missed if models are focused on only a single task. One promising approach to solve this problem is to assume that task parameters within a group lie in a low dimensional subspace. Based on this assumption, Kumar et al. 
argued that each task parameter vector can be obtained by a linear combination of a finite number of underlying basis latent vectors. Inspired by Kumar’s work, we use simple transformation to shared representation to obtain task-specific representation.
2.3 Attention Mechanism
Attention mechanism has been used and proven to be successful in various tasks. It was first proposed to capture the most relevant information at each step in a machine translation task . Since attention mechanism can selectively focus on more informative data, it is also used in other natural language processing tasks such as question answering  and text classification . Many other domains such as recommendation and computer vision also adopted attention mechanism [3, 12].
Researchers use attention mechanism not only to improve performance, but also to make more interpretable models. In some domains such as medical diagnosis, designing a model that can be interpreted by a human is as important as achieving high accuracy . Furthermore, by analyzing the decision making process of a model, we can obtain useful insights into the domains of interest .
3 Problem Formalization
In this section, we formalize a demographic prediction task in retail business scenario. Let = [(, ),…,(, )] be a list of all data samples, where N is the number of samples. As each sample corresponds to an individual user, and are the all historical transactions and the demographic attributes of the -th user, respectively. also can be viewed as a list of labels for each task , where is the number of attributes. The number of possible classes for the -th attribute is . The transaction history = [,…,] can be either an ordered and unordered list of transactions depending on data sets, where is the length of the -th user’s history.
In a real world scenario, we might have full or partial, even none information about demographic attributes of users. Our goal is to predict all the missing attributes in the dataset. We follow two types of problem used in .
Partial Label Prediction is for the situation that users gave some part of their demographic attributes (partially observed), so that a company wants to know the remain unknown attributes. Let = [,…,] be a set of users’ transaction histories and = [,…,] be the users’ demographic attributes which are partially observed. Given with , the objective is to learn a function to predict the unknown attributes = [,…,]. Note that = [,…,] = .
New User Prediction is to predict demographic attributes for new users. Given with partially/fully observed attributes the objective is to learn a function to predict demographic attributes for new users. New users’ transaction histories are and corresponding labels are . Note that unlike partial label prediction where is used as the input for both training and test sets, is splitted into for the training set and for the test set, which implies .
4 Our Approach
In this section, we explain our Embedding Transformation Network with Attention (ETNA) in detail. First, we learn embeddings of transactions which are shared across all tasks in a Shared Embedding Layer. The embeddings are learned for each item or company of a purchased item depending on the dataset. For our dataset, we learned embeddings for each company. To obtain task-specific representations, we use an Embedding Transformation Layer that converts a shared vector space into a task-specific vector space for each task using a simple linear transformation. To encode relationships between transactions and demographic attributes, we opt for an attention mechanism in Task-specific Attention Layer. Finally, we obtain prediction values for each class from Prediction Layer. The overall model architecture is depicted in Figure 1-(c).
Note that given the user’s transaction history = [, …,], the task is to predict all their unknown attributes =[, …,]. In partial label prediction problem, the number of attributes to be predicted is less than or equal to , but to simplify the description we fix the number of unknown attributes to . We also replace with and omit the subscript .
4.1 Shared Embedding Layer
Given a transaction history where each transaction is represented as an index of an item or company, a shared embedding layer maps all transaction to -dimensional vectors. is obtained by this operation. In multi-task learning, the underlying assumption is that there are features helpful for each task learned from similar tasks. Accordingly, in shared embedding layer, we learn embeddings that are generally informative and this embeddings are shared globally across all tasks.
As we can see in Figure 1 all our baseline models have embeddings that are shared across all tasks. With shared embeddings of transactions they obtain single user representation which is used to predict all attributes. However with this structures, models can’t capture task specific features. In next subsection we will introduce embedding transformation layer to solve this problem.
4.2 Embedding Transformation Layer
While the shared embedding layer captures features that are globally informative in all tasks, the embedding transformation layer is responsible for capturing task-specific features that are not shared across tasks. The embedding transformation layer changes the vector space of shared embeddings into a vector space of task-specific features. Embedding transformation is operated as follows:
where [,…,] is an index of a task. The matrix is composed of trainable parameters. -dimensional shared embeddings are converted to task-specific embeddings with size . As is optimized only for the -th task, each converted embedding would form a task-oriented vector space.
Linear transformation of this layer can also be viewed as a mapping function. Linear transformation maps globally shared embeddings to task-specific embeddings. By obtaining embeddings in a task-specific vector space based on shared features, our model can learn both general and more task-specific information.
4.3 Task-Specific Attention Layer
Note that some transactions are more strongly associated with certain demographic attributes than others. To model this, we adopted task-specific attention mechanism. The transformed embeddings [,…,] are fed into a task-specific attention layer. We obtain attention weights for each task by:
where is a non-linear activation. and are trainable parameters that are not shared across tasks. In Equation 2, the addition symbol denotes a broadcasting operation. The attention score is computed based on the similarity between the transformed embeddings and . Distribution of attention weights describes the importance that our model assigned to each transaction.
Equation (2), Equation (3) are typically used to calculate attention weights with dot product. Our dataset contains the same repeated transactions, so calculating the attention weight using Equation (2) and Equation (3) is not effective. When we divide each un-normalized value by the sum of all repeated transaction weight values, the distribution of attention weights would be flattened, reducing the variation in attention scores. Thus, to make attention mechanism more discriminating, we convert the list of transactions into an unique set.
Finally, we calculate the weighted sum over all transformed embeddings as follows:
where denotes the user representation for the -th task, which means that each user is represented differently for each task. Figure 2 illustrates how ETNA obtains task-specific user representations.
The task-specific attention layer not only improves performance by utilizing useful data signals, but also makes our model more interpretable. We can check the attention weight distribution to see the signals that the model has focused more on. Further analysis on attention mechanism is presented in Section 5.6.
4.4 Prediction Layer
Finally, we obtain the predicted probability for the-th demographic attribute of a given user by:
where is a trainable parameter. The parameter is responsible for converting the user representations for each task into predictions through linear transformation.
The goal of demographic prediction is to infer all demographic attributes of users from their transaction histories. Now, we recover the subscript . For N users, we minimize the sum of the negative log-likelihoods defined as:
where denotes all trainable parameters and
is the weight decay coefficient which is one of the hyperparameters. As mentioned in Section4.2 and Section 4.3, total loss in Equation 6 is calculated by summing the losses of each task; thus, the gradients only flow within the scope of corresponding tasks.
We use the transaction dataset collected by a Korean multi vendor loyalty program provider. When customers use services or purchase items at a company contracted with this provider, a certain portion of the price is saved as points in the database. Customers can use these points like cash at all the stores that are enrolled in the program. The dataset consists of purchasing histories of 56,028 users and contains the gender, age, and marital status (demographic attributes) of all the users. A total of 494 companies participate in the program. In our dataset, transactions are recorded as a [user ID, company ID, purchased amount] triplet. Although company names are hidden due to privacy issues, all industrial categories of companies are provided so that we can further analyze the behavior of our model. To the best of our knowledge, our dataset is the first public dataset containing both transaction meta-data and demographic information. We made the dataset publicly available111https://github.com/dmis-lab/demographic-prediction. The statistics of our dataset are summarized in Table 1.
As we described in Section 3, we conducted experiments in two different problem settings. In the partial label prediction problem, our goal is to predict the unknown attributes of users while our model is trained with the observed attributes. All the users in our dataset have all demographic attributes. For the partial label prediction problem setting, we randomly set certain attributes as observed () and used them in the training phase. We used an observed ratio of 10% to 90% with a step length of 10%. When the observed ratio is 50%, each attribute of a user has a 50% chance to be observed and used in training. The remaining unknown attributes are used in the evaluation. To minimize noise due to randomness, we create 10 different splits for each observed ratio. We averaged the results of 10 datasets and report them in this paper.
For experiments on new user prediction, we split our dataset into non overlapping sets. We choose 8:1:1 as training, validation and testing split ratio, which results 44,822 users for training and 5,603 users for validation and testing respectively.
5.2 Evaluation Metrics
We employ several metrics appropriate to evaluate our model. These metrics are widely used in demographic prediction tasks.
(HL) is a metric used to calculate the number of times predictions are incorrectly classified given a set of labels, which is defined as:
where is the symmetric difference and denotes the size of a set inside the operation. is an indicator function. is the number of attribute labels to be predicted and is always set to in new user prediction. However, changes in partial label prediction where labels are randomly observed. When all attributes of a user are observed, can equal to 0. Hence, we calculate the metric over only whose is not 0.
F1 Score or F-measure
is a widely used measure as a complement for accuracy. F1 score is usually calculated as the harmonic mean of precision (P) and recall (R):
The precision and recall are formulated differently depending on the following types of F1 score: micro, macro and weighted. In our experiment, we do not use micro F1 since the goal of demographic prediction is to predict all combinations of attributes, not individuals. The precision and the recall of macro F1 (macF1) and weighted F1 (wF1) are computed as follows:
where is a set of all label combinations to be predicted. When employing macro F1, ; otherwise, . The macro F1 is an appropriate measure when considering each class as important as others. One characteristic of the macro F1 is that it is highly influenced by the minor classes. On the other hand, the weighted F1 assigns a high weight to the large classes by multiplying the number of classes by the weight. We report both metrics to compare the models.
|Partial Label (50%)||New User|
5.3 Baseline Models
To verify the effectiveness of our work, we compare our models with four baseline models. The description of our models and the baseline models are listed below.
POP : POP is a simple model that always predicts given users’ attributes as the majority classes. POP is used in  as a baseline model that ignores users’ characteristics.
JNE (Joint Neural Embedding)  : In JNE each company has its own latent vector. JNE maps all transactions in users’ histories into latent vectors. Average pooling is conducted to these vectors and then fed into a linear prediction layer for each task. The total loss is the sum of each loss as in Equation 6.
SNE (Structured Neural Embedding) 
: SNE has very similar structures with JNE. The only difference between JNE and SNE is that the loss function of SNE is designed to be effective in multi-task learning problem. In SNE, the loss is computed via a log-bilinear model with structured predictions and labels (combination of attribute is considered). The model architectures of JNE and SNE are illustrated in Figure1-(a),(b).
ETN (Embedding Transformation Network) : ETN is our proposed model with only embedding transformation layer is included. The output of the the embedding transformation layer is directly fed into the prediction layer .
ETNA (Embedding Transformation Network with Attention): ETNA is our proposed model with all components included. Difference between ETN and ETNA is whether task-specific attention layer is included or not.
5.4 Experimental Settings
We implemented all the baseline models as described in . We have searched hyperparameters which performs best for our models and the baseline models respectively. We set the size of embedding to 100 and the weight decay coefficient to . We use the Adam optimizer , with a mini-batch size of 64 and a learning rate of
. We apply early stopping for every epoch based on the weighted F1. The training process of partial label prediction (50%) takes roughly 30 minutes on a single Titan X (Pascal) GPU and requires approximately 5GB of memory. The source code for the experiments is implemented using PyTorch222The source code is available at https://github.com/dmis-lab/demographic-prediction.
Table 2 shows the experimental results on partial label prediction problem whose observed ratio of each attribute is 50% and the new user prediction problem. We also included results on the partial label prediction problem for different observed ratios in Figure 3. Our baselines take different approaches to obtain user representation. Based on our experiment results, we have three findings about how different approaches behave.
(1) As JNE and SNE learn task oriented representations, both models perform better than SVD. However, SNE does not significantly outperform JNE compared to the results reported in . In our experiment on new user prediction, JNE even outperforms SNE. As we mentioned before, JNE and SNE are very similar especially in obtaining user representations and predicting attributes from obtained user representations. Changing the objective function had a minimal effect on our experiment.
(2) In all the experiments, our models outperform all the baselines. As we emphasized throughout this entire paper, the ability to learn task-specific features is also important in multi-task learning problems. Although both JNE and ETN predict each attributes using separate prediction parameters, JNE’s capability is limited because it uses single user representation. On the other hand, ETN utilizes task-specific user representations obtained from embeddings which are optimized only for individual tasks. Attention mechanism in our model improves the performance of our model even further by focusing more on important signals in a task specific manner. ETNA achieves the best accuracy in all our experimental settings.
(3) To demonstrate the impact of embedding transformation in the demographic prediction task, we included the experimental results of models with different sharing structures in Table 3. SEP is a model with the loss function defined as Equation 6 but with a separate embedding for each task, which indicates that SEP does not share any feature across tasks. As our experimental results demonstrate, JNE and SEP obtain very similar performance. Both models have their own strengths (shared information, capturing task specific features) but without the ability to balance these two properties they cannot achieve better accuracy. Our ETN model outperforms both models in all experiments. The results show that obtaining task-specific user representations with embedding transformation is simple but effective.
We also conducted experiment on Beiren dataset used in . However, we couldn’t reproduce the result reported in . We got slightly but not significantly higher score than JNE and SNE. Based on our results and analysis, item level transaction data in Beiren doesn’t have enough information to predict demographic attributes. Experiment results on Beiren dataset are provided in Appendix.
5.6 Visualization of Task-Specific Attention
To further analyze the impact of attention mechanism in our model, we provide visualization of weights calculated by attention mechanism. We obtained attention scores for each company. As attention mechanism is independent for each task, each company is assigned different weight for each task. We picked examples that provide insights for customer behaviors from 20 companies with highest attention scores in each task. Based on the attention weights from our model, we draw heatmap in figure 4 .
First thing to notice in this example is that duty-free brand obtained a highest attention in gender prediction task. As cosmetics and perfumes are most popular products in duty free stores, we might say females purchase ore actively as they are generally more interested in those items. The fact that same brand got relatively lower score in age prediction task also accords with our intuition.
In age prediction, ETNA gave high attention scores to insuarance and automobile related services. Intuitively, older people have more chance to have their own cars and also are more interested in protecting themselves from possible risks.
Interesting point here is that some brands obtained high attention scores in multiple tasks and some brands are focused only in specific task. For example, home electronic appliance company was assigned high scores in all tasks. However, attention scores of health food and contraceptive are very different for each task.
Some results which are not expected by us before the analysis. Although our common sense accords with high score for cosmetic brand in gender prediction task, we didn’t expect that the cosmetic brand would get a high score in marital status prediction task. By analyzing unexpected facts, we might get a deeper understanding about customer behaviors.
It is noteworthy that interpretations of why our model attended on some companies can vary. There is no special way to generate an unique explanation. Our goal is to provide business managers a tool to understand customer behaviors and establish strategy based on data-driven way.
In this paper, we studied the demographic prediction task in retail business scenario as a multi-task learning problem. We proposed an Embedding Transformation with Attention (ETNA) model which transforms shared embeddings into task-specific embedding and detects more important signals with attention mechanism. We demonstrated ETNA’s improved accuracy over existing state-of-the-art demographic prediction models. We analyzed attention weights to understand customer behaviors in a data-driven way. We also released our dataset collected by a multi-vendor loyalty service provider.
This work was supported by the National Research Foundation of Korea(NRF-2017R1A2A1A17069645, NRF-2017M3C4A7065887).
-  D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473, (2014).
-  B. Bi, M. Shokouhi, M. Kosinski, and T. Graepel, Inferring the demographics of search users: Social data meets search queries, in Proceedings of the 22nd international conference on World Wide Web, ACM, 2013, pp. 131–140.
-  J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T.-S. Chua, Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention, in Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM, 2017, pp. 335–344.
-  A. Culotta, N. R. Kumar, and J. Cutler, Predicting the demographics of twitter users from website traffic data., in AAAI, 2015, pp. 72–78.
L. Deng, G. Hinton, and B. Kingsbury,
New types of deep neural network learning for speech recognition and related applications: An overview, in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, IEEE, 2013, pp. 8599–8603.
-  J. Gupta and J. Gadge, Performance analysis of recommendation system based on collaborative filtering and demographics, in Communication, Information & Computing Technology (ICCICT), 2015 International Conference on, IEEE, 2015, pp. 1–6.
-  J. Hu, H.-J. Zeng, H. Li, C. Niu, and Z. Chen, Demographic prediction based on user’s browsing behavior, in Proceedings of the 16th international conference on World Wide Web, ACM, 2007, pp. 151–160.
-  D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
-  A. Kumar and H. Daume III, Learning task grouping and overlap in multi-task learning, arXiv preprint arXiv:1206.6417, (2012).
A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani,
V. Zhong, R. Paulus, and R. Socher, Ask me anything: Dynamic memory
networks for natural language processing
, in International Conference on Machine Learning, 2016, pp. 1378–1387.
-  C.-F. Lin, Segmenting customer brand preference: demographic or psychographic, Journal of Product & Brand Management, 11 (2002), pp. 249–268.
-  J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical question-image co-attention for visual question answering, in Advances In Neural Information Processing Systems, 2016, pp. 289–297.
L. Luo, X. Ao, F. Pan, J. Wang, T. Zhao, N. Yu, and Q. He,
Beyond polarity: Interpretable financial sentiment analysis with hierarchical query-driven attention., in IJCAI, 2018, pp. 4244–4250.
-  M.-T. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser, Multi-task sequence to sequence learning, arXiv preprint arXiv:1511.06114, (2015).
-  E. Malmi and I. Weber, You are what apps you use: Demographic prediction based on user’s apps., in ICWSM, 2016, pp. 635–638.
I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, Cross-stitch
networks for multi-task learning
, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3994–4003.
-  Y. S. Resheff and M. Shahar, Fusing multifaceted transaction data for user modeling and demographic prediction, arXiv preprint arXiv:1712.07230, (2017).
-  L. Safoury and A. Salah, Exploiting user demographic attributes for solving cold-start problem in recommender system, Lecture Notes on Software Engineering, 1 (2013), pp. 303–307.
-  P. Wang, J. Guo, Y. Lan, J. Xu, and X. Cheng, Your cart tells you: Inferring demographic attributes from purchase data, in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, ACM, 2016, pp. 173–182.
-  Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, Hierarchical attention networks for document classification, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1480–1489.
-  Z. Zhang, Y. Xie, F. Xing, M. McGough, and L. Yang, Mdnet: A semantically and visually interpretable medical image diagnosis network, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6428–6436.
-  Y. Zhong, N. J. Yuan, W. Zhong, F. Zhang, and X. Xie, You are where you go: Inferring demographic attributes from location check-ins, in Proceedings of the eighth ACM international conference on web search and data mining, ACM, 2015, pp. 295–304.