Introduction
Recent years have witnessed the rapid development of technologies related to enterprise management, which can help organizations to keep up with the continuously changing business world. Along this line, a crucial demand is to build effective strategies for company profiling, which is an analytical process that results in an indepth understanding of company’s fundamental characteristics, and can therefore serve as an effective way to gain vital information of the target company and acquire business intelligence. With the help of profiling, a wide range of applications could be enabled including organization risk management [Martin and Rice2007], enterprise integration [Hollocks et al.1997], and company benchmarking [Knuf2000, Alling2002, Seong Leem et al.2008, Kerschbaum2008, Zhu et al.2016].
In the past decades, traditional approaches for company profiling rely heavily on the availability of the rich finance information about the company, such as finance reports and SEC filings, which may not be readily available for many private companies. Recently, with the rapid prevalence of online employment services, such as Glassgdoor, Indeed, and Kanzhun, a new paradigm is enabled for obtaining the variety of company’s information from their (former) employees anonymously via the reviews, ratings and salaries of specific job positions. This, in turn, raises the question whether it is possible to develop company profiles from an employee’s perspective. For example, we can help companies to identify their advantages and disadvantages, and to predict the expected salaries of different job positions for rival companies.
However, the heterogeneous characteristic of this public information imposes significant challenges to discover typical patterns of companies during profiling. To this end, in this paper we propose a model named Company Profiling based Collaborative Topic Regression (CPCTR) to formulate a joint optimization framework for learning the latent patterns of companies, which can collaboratively model both the textual information (e.g., review) and numerical information (e.g., salary and rating). With the identified patterns, including the positive/negative opinions and the latent variable that influences salary, we can effectively carry out opinion analysis and salary predictions for different companies. Finally, we conduct extensive experiments on a realworld data set. The results show that our algorithm provides a comprehensive interpretation of company characteristics and a more effective salary prediction than other baselines. Particularly, by analyzing the results obtained by CPCTR, many meaningful patterns and interesting discoveries can be observed, such as welfare and technology are the typical pros of Baidu, while those of Tencent are training and learning.
Related Work
The related work of this paper can be grouped into two categories, namely topic modeling for opinion analysis and matrix factorization for prediction.
Probabilistic topic models are capable of grouping semantic coherent words into human interpretable topics. Archetypal topic models include probabilistic Latent Semantic Indexing (pLSI) [Hofmann1999] and Latent Dirichlet Allocation (LDA) [Blei, Ng, and Jordan2003]. A lot of extensions have been proposed based on above standard topic models, such as authortopic model [RosenZvi et al.2004], correlated topic model (CTM) [Blei and Lafferty2005], and dynamic topic model (DTM) [Blei and Lafferty2006], etc. Among them, numerous works focus on opinion analysis, especially for tackling the aspectbased opinion mining task [Vivekanandan and Aravindan2014, Zhu et al.2014]. Moreover, a few works have attempted to combine ratings and review texts when performing opinion analysis [Ganu, Elhadad, and Marian2009, Titov and McDonald2008, McAuley and Leskovec2013]. However, none of them considers the pros and cons texts during the opinion modeling process, which is one of our major concern under the company profiling task.
Matrix factorization is a family of methods which is widely used for prediction. The intuition behind it is to get better data representation by projecting them into a latent space. Singular Value Decomposition (SVD)
[Golub and Reinsch1970] is a classic matrix factorization method for rating prediction, which gives lowrank approximations based on minimizing the sumsquared distance. However, since realworld data sets are often sparse, SVD does not perform well in practice. To solve it, some probabilistic matrix factorization methods have been proposed [Marlin2003, Marlin and Zemel2004, Salakhutdinov and Mnih2007, Zeng et al.2015]. Probabilistic Matrix Factorization [Salakhutdinov and Mnih2007] (PMF) is a representative one and has been popular in industry. However, in our salary prediction scenario, we need to model rating matrix and review text information simultaneously which cannot be met by neither SVD or PMF. Therefore, we develop a joint optimization framework to integrate the textual information (e.g., review) and numerical information (e.g., salary and rating) by extending Collaborative Topic Regression (CTR) [Wang and Blei2011] for effective salary prediction.Preliminaries
In this section, we introduce some preliminaries used throughout this paper, including data description and problem definition.
Data Description
In this paper, we aim to leverage the data collected from online employment services for company profiling. To facilitate the understanding of our data, we show a page snapshot of Indeed^{1}^{1}1http://www.indeed.com/ in Figure 1. Specifically, each company has a number of reviews posted by its (former) employees, each of which contains the poster’s job position (e.g., software engineer), textual information about the advantages and disadvantages of the company, and a rating score ranging from 1 to 5 to indicate the preferences of employees towards this company. Moreover, the salary range of each job position is also included for each company.
Problem Statement
Suppose we have a set of companies and a set of job positions. For each company , there are many reviews referred to it. In each review, we have its reviewer’s role (i.e., the reviewer’s job position), rating, and two independent textual segments (positive opinion and negative opinion). Moreover, we have the average salary for each job position. For simplicity, we group reviews by their job positions and denote two words lists as and to represent positive opinion and negative opinion for a specific job position , respectively.
Our problem is how to discover the latent representative patterns of jobcompany pair. To be more specific, there are two major tasks in this work: 1) how to learn positive and negative opinion patterns, (, ), for each job postition; and 2) how to use the latent patterns to predict job salaries (), for each jobcompany pair.
Thus, we propose a model, CPCTR, for jointly modeling the numerical information (i.e., rating and salary) and review content information simultaneously. To be more specific, we use probabilistic topic model to mine review content information and use matrix factorization to handle numerical information. In terms of review content information, and are represented by sets of opinionrelated topics. Besides, each jobcompany pair (, ) has a topic pattern
indicating its probability over
and . In terms of numerical information, we use a lowdimensional representation derived from numerical information, such as salary and rating, to represent job position and combine it with to model the latent relationship among them.Obviously, our model is a combination of probabilistic topic modeling and matrix factorization, similar to CTR. However, unlike CTR that only learns a global topicword distribution and topic proportion for each item , our model can learn two kinds of job related topicword patterns, including a positive topicword distribution and a negative topicword distribution . Moreover, in contrast with CTR, which cannot incorporate both rating and salary information into one optimization model simultaneously, our method can model these two numerical values and utilize the learned opinion patterns for more precise salary prediction. Thus, our model leads to a more comprehensive interpretation of company profiling and provides a collaborative view from opinion modeling to salary prediction.
Technical Details
In this section, we formally introduce the technical details of our model CPCTR.
Model Formulation
As mentioned above, our model CPCTR is a Bayesian model which combines topic modeling with matrix factorization. The graphical representation of CPCTR is shown in Figure 2. To facilitate understanding, we look into the model in two sides.
On one side, we model the jobcompany pair with a latent topic vector
, where is the number of topics. In probabilistic topic modeling, job position can be represented by two latent matrices, i.e., the positive opinion topics and the negative opinion topics , where is the size of vocabulary. For the n word in a positive review of jobcompany pair, we assume there is a latent variable denoted as , indicating the word’s corresponding topic. To be more specific, given , follows a multinomial distribution parameterized by . Meanwhile, the positive latent pattern is considered to be drawn from the multinomial distribution . A similar process can be conducted for the negative review.On the other side, we conduct matrix factorization for salary prediction. In matrix factorization, we represent job position and jobcompany pair in a shared latent lowdimensional space of dimension , i.e., job position is represented by latent vectors and , which indicate the influences of job positions over salary and rating, respectively. Similarly, the jobcompany pair () is represented by a latent vector , which indicates the joint influences of jobcompany pair over numeric rating and salary values. Here, we assume the latent vector and
follow Gaussian distributions with parameters
and , respectively. And, the latent vector is derived from by adding an offset, . also follows a Gaussian distribution with parameters . Therefore, it is obvious that is the key point by which we jointly model both content and numerical information.We form the prediction of salary values of a specific jobcompany pair () through the inner product between their latent representations, i.e.,
(1) 
Note that in our model, we first group reviews, ratings, and salary information by the posters’ jobcompany pair. We then calculate the average ratings, average salaries and aggregate reviews as one single document for each jobcompany pair. The complete generative process of our model is demonstrated in Algorithm 1. In the following, we leverage the Bayesian approach for parameter learning.
Parameter Learning
In the above generative process, we denote mathmatical notations as follows. , , , , , , , . The joint likelihood of data, i.e., , , , , and the latent factors , , , , , under the full model is
(2) 
For learning the parameters, we develop an EMstyle algorithm to learn the maximum a posterior (MAP) estimation. Maximization of posterior is equivalent to maximizing the complete log likelihood of
, , , , , , and , given , and ,(3) 
Here, we employ coordinate ascent (CA) approach to alternatively optimize the latent factors {, , } and the simplex variables as topic proportion. For , and , we follow in a similar fashion as for basic matrix factorization [Hu, Koren, and Volinsky2008]. Given the current estimation of , taking the gradient of with respect to , , and setting it to zero leads to
(4)  
(5)  
(6) 
Given , and , we then apply a variational EM algorithm described in LDA [Blei, Ng, and Jordan2003] to learn the topic proportion . We first define and , and then we separate the items that contain and apply Jensen’s inequality,
(7) 
where . In the Estep, the optimal variational multinomial and satisfy
(8)  
(9) 
The gives a tight lower bound of . Similar to CTR [Wang and Blei2011], we use projection gradient [Bertsekas1999] to optimize . Coordinate ascent can be applied to optimize remaining parameters , , , and . Then following the same Mstep for topics in LDA [Blei, Ng, and Jordan2003], we optimize and as follows,
(10)  
(11) 
where we denote as an arbitrary term in the vocabulary set.
Discussion on Salary Prediction
After all the optimal parameters are learned, the CPCTR model can be used for salary prediction by Equation 1. In this task, rating values and review content of the predicted jobcompany pair (, ) are available, but no salary information of (, ) pair is available. To obtain the topic proportion for the predicted jobcompany pair (, ), we optimize Equation Parameter Learning.
In particular, we only focus on the task of salary prediction, although rating prediction can be conducted in a similar way. Since reviews are always accompanied by ratings, ratings should be regarded as part of opinion information. Therefore, in this work we treat the ratings as the complementary of reviews for opinion mining, and the side information for salary prediction.
Experimental Results
In this section, we first give a short parameter sensitivity discussion to show the robustness of our model and then evaluate the salary prediction performance of CPCTR based on a realworld data set with several stateoftheart baselines. Finally, we empirically study the pros and cons for each jobcompany pair learned from their employees’ review.
Experimental Setup
Data Sets.
Kanzhun^{2}^{2}2http://www.kanzhun.com/ is one the largest online employment website in China, where members can review companies and assign numeric ratings from 1 to 5, and post their own salary information. Thus, Kanzhun provides an ideal data source for experiments on company profiling and salary prediction. The data set used in our experiments consists of 934 unique companies which contains at least one of total 1,128 unique job positions, i.e., for a specific company, at least one job’s average salary and rating has been included. Moreover, the data set contains 4,682 average salaries for all jobcompany pair (the matrix has a sparsity of 99.6%). The average rating and average salary in our data set are 3.32 and 7,565.21, respectively. For each review, we extracted advantages and disadvantages, then grouped reviews by its job position and formed one document for each jobcompany pair. Particularly, we removed stop words and single words, filtered out words that appear in less than one document and more than 90% of documents and then choose only the first 10,000 most frequent words as the vocabulary, which yielded a corpus of 580K negative words and 652K positive words. Finally, we converted documents into the bagofwords format for model learning.
Baseline Methods.
To evaluate the performance of salary prediction for CPCTR, we chose three stateoftheart benchmark methods for comparisons, including PMF [Salakhutdinov and Mnih2007], Regularized Singular Value Decomposition of data with missing values RSVD^{3}^{3}3https://github.com/alabid/PySVD and Collaborative Topic Regression CTR [Wang and Blei2011].
Evaluation Metrics.
We used two widelyused metrics, i.e., root Mean Square Error (rMSE), Mean Absolute Error (MAE), for measuring the prediction performance of different models. Specifically, we have
(12)  
(13) 
where is the actual salary of th jobcompany pair, is its salary prediction and is the number of test instances.
Fold  Method  CPCTR  PMF  RSVD  CTR 

1  rMSE  0.0528  0.0561  0.0608  0.0670 
MAE  0.0347  0.0356  0.0419  0.0433  
2  rMSE  0.0530  0.0518  0.0592  0.0597 
MAE  0.0346  0.0332  0.0401  0.0394  
3  rMSE  0.0506  0.0499  0.0595  0.0621 
MAE  0.0328  0.0322  0.0413  0.0414  
4  rMSE  0.0680  0.0703  0.0743  0.0815 
MAE  0.0345  0.0365  0.0425  0.0472  
5  rMSE  0.0479  0.0514  0.0543  0.0609 
MAE  0.0332  0.0354  0.0407  0.0433  
Average rMSE  0.0545  0.0559  0.0616  0.0662  
Average MAE  0.0340  0.0346  0.0413  0.0429 
Experimental Settings.
In our experiments, we used 5fold crossvalidation. For every job position that was posted by at least 5 companies, we evenly split their jobcompany pairs (average rating/salary values) into 5 folds. We iteratively considered each fold to be a test set and the others to be the training set. For those job positions that were posted by fewer than 5 companies, we always put them into the training set. This leads to that all job positions in the test set must have appeared in the training set, thus it guarantees the inmatrix scenario for CTR model in prediction. For each fold, we fitted a model to the training set and test on the withinfold jobs for each company. Note that, each company has a different set of withinfold jobs. Finally, we obtained the predicted salaries and evaluated them on the test set.
The parameter settings of different methods are stated as follows. For all methods, we set the number of latent factor to and the maximum iterations for convergence as . For probabilistic topic modeling in CTR and CPCTR, we set the parameters . For CTR, we used fivefold cross validation to find that , , and provides the best performance. For our model CPCTR, we chose the parameters by using grid search on held out predictions. As a default setting for CPCTR, we set , , , , . More detailed discussions about parameter sensitivity of our model will be given in the following subsection. Additionally, for convenience of parameter choosing, we used minmax method to normalize all values of rating/salary into [0, 1] range.
Parameter Sensitivity
In our model, the content parameter controls the contribution of review content information to model performance and the rating parameter balances the contribution of rating information to model performance. In the left plot of Figure 3, we vary the content parameter and rating parameter from to to study the effect on the performance of salary prediction, and the average performance within fivefold cross validation is shown in this plot. First, we can see that CPCTR shows good prediction performance when and , and achieves the best prediction accuracy when and , which is the default setting for CPCTR. Next, for facilitating comparison, we shrink the range of and into [1, 1e+05] and show the right plot of Figure 3. From this plot, we can see that almost all cases of CPCTR outperform other stateoftheart baselines, except for . The results show small and negligible fluctuation with varied and , and CPCTR becomes insensitive to these two parameters.
Performance of Salary Prediction
We show the prediction performance of different methods in Table 1. Note that, the best results are highlighted in bold and the runnerup are denoted in italic. From the results, we could observe that CPCTR achieves the best average prediction performance in terms of 5fold crossvalidation, and outperforms other baselines in three folds. This is in great contrast to CTR, which shows poor prediction performance in all five folds. It is because that, although CTR can integrate textual information for salary prediction, it cannot utilize the rating information and does not explicitly model the positive/negtive topicword distribution. Among traditional collaborative filtering methods, PMF consistently outperforms RSVD in all five folds, which demonstrates the effectiveness of probabilistic methods on prediction tasks. Based on the above analysis, CPCTR can be regarded as a more comprehensive and effective framework for company profiling, that can integrate review opinions and ratings for salary prediction.
Empirical Study of Opinion Profiling
Here, we apply CPCTR to carry out opinion analysis for different companies based on employees’ reviews. The objective is to effectively reveal the pros and cons of companies, which indeed helps for competitor benchmarking.
To illustrate the effectiveness of learning jobposition level topicword distributions, we listed 3 positive topics and 3 negative topics of job position Software Engineer inferred from CPCTR, as shown in Figure 4. Each topic is represented by 5 most probable words for that topic. It can be seen that our method has an effective interpretation of latent job position pattern and these topics accurately capture the common semantics of the job position Software Engineer in the whole market. We can see some interesting postive/negative topic patterns. For positive topics, topic 0 is about job environment, topic 1 is about flexible work time, topic 2 is technology atmosphere. For negative topics, topic 0 is about overtime, topic 1 is about prospect and promotion, and topic 2 is about opportunity and welfare.
We also compared the pros and cons among BAT, which is the abbreviation of three largest and most representative Chinese Internet companies, i.e., Baidu, Alibaba and Tencent. Specifically, we presented the pros and cons with most probable words appearing in learned topics for each company, given the job position Software Engineer in Figure 5. As can be seen, topics for each jobcompany pair can effectively capture the specific characteristics of each company. For instance, the typical pros of Baidu are welfare and technology, while those of Tencent are training and learning and those of Alibaba are culture and atmosphere. Interestingly, employees of all these three companies chose overtime as their cons, and the management of Tencent seems to be a typical cons.
Concluding Remarks
In this paper, we proposed a model CPCTR for company profiling, which can collaboratively model the textual information and numerical information of companies. A unique perspective of CPCTR is that it formulates a joint optimization framework for learning the latent patterns of companies, including the positive/negative opinions of companies and the latent topic variable that influences salary from an employee’s perspective. With the identified patterns, both opinion analysis and salary prediction can be conducted effectively. Finally, we conducted extensive experiments on a realworld data set. The results showed that our model provides a comprehensive interpretation of company characteristics and a more effective salary prediction than baselines.
Acknowledgments
This work was partially supported by NSFC (71322104, 71531001, 71471009, 71490723), National High Technology Research and Development Program of China (SS2014AA012303), National Center for International Joint Research on EBusiness Information Processing (2013B01035), and Fundamental Research Funds for the Central Universities.
References
 [Alling2002] Alling, E. 2002. Method and system for facilitating multienterprise benchmarking activities and performance analysis. US Patent App. 10/137,218.
 [Bertsekas1999] Bertsekas, D. 1999. Nonlinear Programming. Athena Scientific.
 [Blei and Lafferty2005] Blei, D. M., and Lafferty, J. D. 2005. Correlated topic models. In Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPS’05, 147–154. Cambridge, MA, USA: MIT Press.

[Blei and Lafferty2006]
Blei, D. M., and Lafferty, J. D.
2006.
Dynamic topic models.
In
Proceedings of the 23rd International Conference on Machine Learning
, ICML ’06, 113–120. New York, NY, USA: ACM.  [Blei, Ng, and Jordan2003] Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3:993–1022.
 [Ganu, Elhadad, and Marian2009] Ganu, G.; Elhadad, N.; and Marian, A. 2009. Beyond the stars: Improving rating predictions using review text content. In 12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009.
 [Golub and Reinsch1970] Golub, G. H., and Reinsch, C. 1970. Singular value decomposition and least squares solutions. Numer. Math. 14(5):403–420.
 [Hofmann1999] Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, 50–57. New York, NY, USA: ACM.
 [Hollocks et al.1997] Hollocks, B. W.; Goranson, H. T.; Shorter, D. N.; and Vernadat, F. B. 1997. Assessing Enterprise Integration for Competitive Advantage—Workshop 2, Working Group 1. Berlin, Heidelberg: Springer Berlin Heidelberg. 96–107.
 [Hu, Koren, and Volinsky2008] Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE International Conference on Data Mining, 263–272.
 [Kerschbaum2008] Kerschbaum, F. 2008. Building a privacypreserving benchmarking enterprise system. Enterprise Information Systems 2(4):421–441.
 [Knuf2000] Knuf, J. 2000. Benchmarking the lean enterprise: Organizational learning at work. Journal of Management in Engineering 16(4):58–71.
 [Marlin and Zemel2004] Marlin, B., and Zemel, R. S. 2004. The multiple multiplicative factor model for collaborative filtering. In Proceedings of the Twentyfirst International Conference on Machine Learning, ICML ’04, 73–. New York, NY, USA: ACM.
 [Marlin2003] Marlin, B. 2003. Modeling user rating profiles for collaborative filtering. In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS’03, 627–634. Cambridge, MA, USA: MIT Press.
 [Martin and Rice2007] Martin, N. J., and Rice, J. L. 2007. Profiling enterprise risks in large computer companies using the leximancer software tool. Risk Management 9(3):188–206.
 [McAuley and Leskovec2013] McAuley, J., and Leskovec, J. 2013. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, 165–172. New York, NY, USA: ACM.

[RosenZvi et al.2004]
RosenZvi, M.; Griffiths, T.; Steyvers, M.; and Smyth, P.
2004.
The authortopic model for authors and documents.
In
Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence
, UAI ’04, 487–494. Arlington, Virginia, United States: AUAI Press.  [Salakhutdinov and Mnih2007] Salakhutdinov, R., and Mnih, A. 2007. Probabilistic matrix factorization. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, 1257–1264. USA: Curran Associates Inc.
 [Seong Leem et al.2008] Seong Leem, C.; Wan Kim, B.; Jung Yu, E.; and Ho Paek, M. 2008. Information technology maturity stages and enterprise benchmarking: an empirical study. Industrial Management & Data Systems 108(9):1200–1218.
 [Titov and McDonald2008] Titov, I., and McDonald, R. 2008. A joint model of text and aspect ratings for sentiment summarization. In Proceedings of ACL08: HLT, 308–316. Columbus, Ohio: Association for Computational Linguistics.
 [Vivekanandan and Aravindan2014] Vivekanandan, K., and Aravindan, J. S. 2014. Aspectbased opinion mining: A survey. International Journal of Computer Applications 106(3):21–26.
 [Wang and Blei2011] Wang, C., and Blei, D. M. 2011. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, 448–456. New York, NY, USA: ACM.
 [Zeng et al.2015] Zeng, G.; Zhu, H.; Liu, Q.; Luo, P.; Chen, E.; and Zhang, T. 2015. Matrix factorization with scaleinvariant parameters. In Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 2531, 2015, 4017–4024.
 [Zhu et al.2014] Zhu, C.; Zhu, H.; Ge, Y.; Chen, E.; and Liu, Q. 2014. Tracking the evolution of social emotions: A timeaware topic modeling perspective. In 2014 IEEE International Conference on Data Mining, ICDM 2014, Shenzhen, China, December 1417, 2014, 697–706.
 [Zhu et al.2016] Zhu, C.; Zhu, H.; Xiong, H.; Ding, P.; and Xie, F. 2016. Recruitment market trend analysis with sequential latent variable models. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, San Francisco, CA, USA, August 1317, 2016, 383–392.
Comments
There are no comments yet.