With the advent of the Internet and the growing amount of information available therein, people are increasingly resorting to finding information online. This in turn has resulted in several challenges, one of the main ones for users being finding exactly what they are looking for or for researchers to keep upto date on information of whose existence they may be unaware.
For many years, achievements and discoveries made by scientists are made aware through research papers published in appropriate journals or conferences. Often, established scientists and especially newbies are caught up in the dilemma of choosing an appropriate conference to get their work through. Every scientific conference and journal is inclined towards a particular field of research and there is a vast multitude of them for any particular field. Choosing an appropriate venue is vital as it helps in reaching out to the right audience and also to further one’s chance of getting their paper published.
In order to address this problem, we aim to build a recommender system that recommends the most appropriate publication venues for an author. This system is particularly useful to budding researchers who have very little knowledge about the research world and also to experienced researchers by saving a lot of their time and effort.
In this work, we aim to approach this problem in the settings of dimensionality reduction and topic modeling. We propose three different methods to recommend conferences for researchers to submit their paper based on the content of the paper and the social network of the authors: two of them involving content-analysis and the third one involving social network of the authors. Our approach is empirically evaluated using a dataset of recent ACM conference publications and compared with existing methods such as content-based filtering, collaborative filtering and hybrid filtering with promising results.
However, there are several challenges that need to be addressed. We list out the challenges along with the different claims from our work
Challenges: We face several challenges when working in this domain, as illustrated.
In recent times, using dimensionality reduction methods such as SVD, PCA are becoming widely popular in application to recommender systems. The use of Correspondence Analysis (CA) has not been explored as much in the literature. In some of the recent works, PCA, which can only be applied to continuous data, has been applied to tabulated discrete data. How do we remedy this defect?
In all the previous work done related to our problem, only a model using the social network of the authors has been employed. Content analysis of the papers in consideration, to the best of the authors’ knowledge, has never been attempted. Just using the network of authors, without even looking at the paper, is not sufficient to decide where the paper should go to. How do we incorporate content into our work?
Suggesting conferences to new authors is a very tricky business. If the author has not published any paper before, he does not have a social network. Hence, the current systems would yield a poor recommendation. Will considering content of the paper lead to better results?
Constructing matrices in higher dimensional spaces, as in our case, invites a large amount of redundancy and hence, the relationship between the two attributes in consideration is not obtained with clarity. How can this problem be tackled?
For the second method in our work, we construct a Paper Words matrix and a Words Conference matrix, where the entry of each of the matrices indicate the frequency of occurrence of in and in the papers published in respectively. For the process of recommendation, we compose the two matrices to obtain a Paper Conference on which we apply CA to proceed. But it is not guaranteed that the entries in the matrix obtained are the frequencies. How can we make sure this question does not arise?
Main Claims: We use the abstracts of the papers in consideration for content analysis. The challenges raised above are systematically addressed as follows:
In our work, we deal with tabulated discrete data. From all the data collected, we construct matrices such as Paper Words, Paper Conference, Words Conference. Each entry of these matrices represents the frequency in question and thus forms the basis of our methods in applying dimensionality reduction techniques. We remedy the continuous data conundrum with the use of Correspondence Analysis (CA). By reducing the matrices to lower dimensional subspaces using CA, we obtain the necessary relationship between the two entities with clarity, thus avoiding having to use PCA. This makes more the approach taken all the more meaningful.
As suggested, just relying on the network of the authors is not sufficient to obtain a good recommendation of a conference. We bring in the content of the paper into our work to build a better model, which to the best of the authors’ knowledge has not been explored before in the literature. Since the essence of the entire paper is contained within it’s abstract, we build the content matrices using just the abstracts of the various papers. We employ term frequency-inverse document frequency (tf-idf) to generate the matrices of important keywords from the abstracts. In two of the methods, we construct Paper Words and Words Conference matrices using the above mentioned technique.
It would be problematic as suggested to recommend conferences to authors with no prior social network. But this problem does not arise during content-analysis as we are not concerned with the author’s social network. Just relying on the content of the abstract, we recommend suitable conferences. In our experiments, to suggest conferences to new authors, we observed that this method far supersedes the one relying on only his/her social network.
Maximum essence of the relationship between the attributes in a table is obtained only in lower dimensional subspaces. Thus, when reducing the dimension of the matrices using CA, we essentially throw out the redundant information while maintaining the crucial and important part of them that are responsible for the relationships. As an added bonus, the reduced dimension increases the efficiency of the methods.
In order to avoid such a confusion, our third method does not compose the two matrices. Instead a linear transformation is defined between the two spaces after reduction of dimension. In essence, after constructing the Paper Words and Words Conference matrices, we apply CA to each of them to reduce their dimension and then define a linear transformation from one subspace to the other for the process of recommendation.
Key tasks of the methods: The key tasks of each of the method proposed are listed as follows:
Method 1: Involving the use of the social network of the authors.
We construct the Author Conference matrix, with each row consisting of entries for a particular author and the entry of the matrix representing the number of times author has published in conference .
We apply CA on this matrix to obtain principal column co-ordinates corresponding to the conferences. Using this, we obtain the principal row co-ordinates corresponding to the authors, whose paper needs a conference recommendation.
The conference nearest to the obtained author cluster in the bi-plot is recommended as the most suitable conference.
Method 2: Considering the content of the paper and composition of matrices.
We construct a Paper Words matrix and a Words Conference matrix, where the entry of each of the matrices indicate the frequency of occurrence of in and in the papers published in respectively.
Then, we compose these two matrices and apply CA to obtain the principal column co-ordinates corresponding to the conferences.
We obtain the principal row co-ordinates of the paper in need of a recommendation by computing it’s tf-idf vector, composing with the WordsConference training matrix and subsequent CA.
The conference nearest to the paper in the bi-plot is recommended as the most suitable one.
Method 3: Considering the content of the paper and a linear transformation.
We construct the Paper Words and Words Conference matrices as before, but instead of composing them, we reduce them to lower dimensional subspaces individually using CA.
Then, we define a linear transformation from the reduced paper space to the reduced conference space.
This linear transformation enables us to take a paper, in need of recommendation, to the space of conferences and suggest a conference closest to it.
1.1 Organization of the Paper
The paper is organized as follows. Section 2 gives an overview of the works related to the problem at hand. We formulate the problem and build on the main techniques used in the experiments in Sections 3 and 4 respectively. Section 5 details the datasets and tools used. The technical approaches used and the experimental results obtained with their implications are discussed in Sections 6 and 7 respectively. We finally present and draw conclusions with remarks, exploring possibilities and scopes of future work in Section 8.
2 Related Works
The field of recommender systems, being recent, has been a hot topic for researchers in the last few years. A lot of work has been done in exploring different algorithms and techniques to aid in building systems that can make intelligent suggestions to consumers. There has been work in content-based filtering, collaborative filtering, knowledge-based recommender systems and data mining areas such as classification, clustering, association rule mining and dimensionality reduction.
2.1 Collaborative Filtering
There have been many collaborative systems developed in the academia and the industry. Algorithms for collaborative recommendations can be grouped into two general classes: memory-based(or heuristic-based) and model-based. Memory-based algorithms essentially are heuristics that make rating predictions based on the entire collection of previously rated items by the users. That is, the value of the unknown ratingfor user and item is usually computed as an aggregate of the ratings of some other (usually, the most similar) users for the same item .
There have been several model-based collaborative recommendation approaches proposed in the literature. These include a collaborative filtering method in a machine learning framework, where various machine learning techniques (such as artificial neural networks) coupled with feature extraction techniques (such as singular value decomposition — an algebraic technique for reducing dimensionality of matrices) are used. There have also been statistical models like Bayesian model and several algorithms for estimating parameters like
-means clustering and Gibbs sampling. More recently, a significant amount of research has been done in trying to model the recommendation process using more complex probabilistic models. Some probabilistic modeling techniques for recommender systems include Markov decision processes, probabilistic latent semantic analysis and a combination of multinomial mixture and aspect models using generative semantics of Latent Dirichlet Allocation.
Among the latest developments, techniques have been proposed to combine model-based and memory-based approach using probabilistic approaches. For example, 1) using an active learning approach to learn the probabilistic model of each user’s preferences and 2) using the stored user profiles in a mixture model to calculate recommendations.
2.2 Content-based Filtering
Content-based systems are designed mostly to recommend text-based items and the content in these systems is usually described with keywords. For example, a content-based component of the Fab [balabanovic1997fab] system, which recommends Web pages to users, represents Web page content with the most important words. Similarly, the Syskill & Webert system [pazzani1997learning] represents documents with the most informative words. The “importance” (or “informativeness”) of word in document is determined with some weighting measure that can be defined in several different ways.
2.3 Dimensionality Reduction
It is very common to see recommender systems use data with many features i.e. a very high-dimensional space. Despite the provision for many features, frequently it is observed that sparsity of the feature vectors is a common problem. This has many implications when it comes to clustering and outlier detection, where the notions of density and distance between points become less meaningful. This is often called. Dimensionality reduction techniques, thus, play an important role in these cases by helping to transform the original vectors into those in lower dimensional subspaces.
Xun Zhou et. al [zhou2012personalized] propose a scalable algorithm for recommendations called Incremental ApproSVD, which combines the incremental SVD algorithm with the ApproSVD (Approximating the SVD) proposed in one of their earlier works where they use ApproSVD to generate personalized recommendations. This has been shown to outperform the traditional incremental SVD algorithm, when run on the MovieLens and Flixster dataset.
As the Netflix Prize competition demonstrated, matrix factorization models are superior to classic nearest-neighbour techniques for producing product recommendations. Yehuda Koren [bell2007bellkor]
proposed the BellKor Solution to the Netflix Grand Prize. The baseline predictors were improved, an extension of the neighbourhood model that addresses temporal dynamics was introduced, a new Restricted Boltzmann Machine (RBM) model was used with superior accuracy by conditioning the visible units and finally a new blending algorithm which is based on Gradient Boosted Decision Trees (GBDT) was introduced. Gabor Takacs and Istvan Pilaszy[takacs2008unified] propose a hybrid approach that combines an improved Matrix Factorization (MF) method with the NSVD1 approach, familiarized by Paterek [paterek2007improving], resulting in a very accurate factor model. Further, they propose a unification of the factor models and neighbourhood-based approaches, which improves the performance. Having run their method on the Netflix Prize dataset, they provide a very low RMSE, which outperforms all published single methods in the literature.
Osman and Ismail [osmanli2011using] used tag similarity techniques in SVD-based recommender systems. To improve the recommendation quality, content information of the items in the form of user given tags are used. To adopt tags to the normal SVD algorithm, they have reduced the three-dimensional matrix ¡user, item, tag¿ to three two-dimensional matrices: ¡user, item¿, ¡user, tag¿ and ¡item, tag¿. These matrices are used to perform the SVM recommendation. This has shown to increase the performance.
Goldberg et al. [goldberg2001eigentaste] proposed an approach to use PCA in the context of an online joke recommendation system. Their system, known as Eigentaste, starts from a standard matrix of user ratings to items. They then select their gauge set by choosing the subset of items for which all users had a rating. This new matrix is then used to compute the global correlation matrix where a standard 2-dimensional PCA is applied. Manolis and Konstantinos [vozalis2007using] [vozalis2007recommender] propose an algorithm called PCA-Demog, which applies PCA for Demographically enhanced prediction generation. The filtering algorithm proposed applies PCA on user ratings and demographic data. Along with the algorithm, possible ways of combining it with different sources of filtering data is also discussed. They also describe in one of their other works, application of SVD on Item-based Collaborative Filtering. They describe two algorithms: The first algorithm uses SVD in order to reduce the dimension of the active item’s neighbourhood. The second algorithm initially enhances Item -based Filtering with demographic information and then applies SVD at various points of the filtering procedure.
Sarwar et al. [sarwar2000application] describe two different ways to use SVD in the context of RS. First, SVD can be used to uncover latent relations between customers and products. In order to accomplish this goal, they first fill the zeros in the user-item matrix with the item average rating and then normalize by subtracting the user average. This matrix is then factored using SVD and the resulting decomposition can be used after some trivial operations directly to compute the predictions. The other approach is to use the low-dimensional space resulting from the SVD to improve neighborhood formation for later use in a NN approach. As described by Sarwar et al. [sarwar2002incremental] in one of their other works, one of the big advantages of SVD is that there are incremental algorithms to compute an approximated decomposition. This allows to accept new users or ratings without having to recompute the model that had been built from previously existing data. The same idea was later extended and formalized by Brand into an online SVD model, where he used these methods to model data streams describing tables of consumer/product ratings, where fragments of rows and columns arrive in random order and individual table entries are arbitrarily added, revised, or retracted at any time.
Several hybrid approaches of Collaborative Filtering (CF) and Content-based Filtering (CB) have been proposed to increase the accuracy of recommendations. Major drawback in these were that the two techniques were most often executed independently. Panagiotis Symeonidis [symeonidis2008content] proposed a Content-based Dimensionality Reduction method for recommendations, wherein a feature profile for a user is constructed based on both collaborative and content features. Latent Semantic Indexing (LSI) is then applied to reveal the dominant features of the user. Recommendations are then provided according to this dimensionally-reduced feature profile. This method has shown to outperform well-known CF, CB and hybrid approaches.
Zanker et al. [zanker2008evaluating] evaluate the use of different recommender systems for the purpose of tourism. They have considered different methods: Correspondence Analysis, Click-stream Sequence Analysis and Contingency Analysis.
There have been advances in text-based recommender systems too as described:
Scientists depend on literature search to find prior work that is relevant to their research ideas. In this context, Steven Bethard and Dan Jurafsky [bethard2010should]
introduce a retrieval model for literature search that incorporates a wide variety of factors important to researchers, and learns the weights of each of these factors by observing citation patterns. They introduce features like topical similarity and author behavioral patterns, and combine these with features from related work like citation count and recency of publication. They present an iterative process for learning weights for these features that alternates between retrieving articles with the current retrieval model, and updating model weights by training a supervised classifier on these articles. In a similar context, Lee Giles et. al[giles1998citeseer] built CiteSeer, an autonomous citation indexing system, which indexes academic literature in electronic format. Published research papers on the World Wide Web, increasing in quantity daily, are often poorly organized and often exist in non-text forms (eg. Postscript). Due to this, significant amounts of time and effort are commonly needed to find interesting and relevant publications on the Web. CiteSeer, being a Web based information agent, helps alleviate this problem by assisting the user in the process of performing a scientific literature search.
Concept-based document recommendation for CiteSeer authors is explored by Kannan Chandrasekaran et. al [chandrasekaran2008concept]. They present a novel way of representing the user profiles as trees of concepts and an algorithm for computing the similarity between the user profiles and document profiles using a tree-edit distance measure. This has shown to outperform a traditional vector-space model. Another way of recommending documents is using the implicit social network of researchers, as proposed by Cheng Chen et. al [chen2008implicit].
Bela Gipp and Jordan Beel [gipp2009citation], propose an approach for identifying similar documents that can be used to assist scientists in finding related work. The approach called Citation Proximity Analysis (CPA) is a further development of co-citation analysis, but in addition, considers the proximity of citations to each other within an article’s full-text. The underlying idea is that the closer citations are to each other, the more likely it is that they are related. The CPA based approach has been shown to have higher precision with possibility of identifying related sections within documents, compared to existing approaches like bibliographic coupling, co-citation analysis or keyword based approaches.
Qi He et. al [he2011citation] propose an approach for automatic recommendation of citations for a manuscript without author supervision. This reduces user burden, as the input to the system is just a query manuscript (without a bibliography), and the system automatically finds locations where citations are needed. They have shown the effectiveness of their approach with an extensive empirical evaluation using the CiteSeerX data set. They further propose in one of their other works, a context-aware citation recommender system [he2010context], which helps in citing good candidates at different local contexts in the paper.
Ming Zhang et. al [zhang2008paper] present a recommender for scientific literatures based on semantic concept similarity computed from the collaborative tags. User profiles and item profiles are presented by these semantic concepts, and neighbour users are selected using collaborative filtering. Then, content-based filtering approach is used to generate recommendation list from the papers these neighbour users tagged. Onur et. al [kuccuktuncc2012direction] also address a similar problem of recommending papers on academic networks, but here they use a direction-aware (in the sense that they can be tuned to find either recent or traditional papers) citation analysis. Cai-Nicolas Ziegler et. al [ziegler2005improving] propose a method to diversify personalized recommendation lists in order to reflect the user’s complete spectrum of interests. They achieve this by introducing an intra-list similarity metric to assess the topical diversity of recommendation lists and then, reduce the intra-list similarity thereby diversifying the topics. This has shown to improve user satisfaction.
Specific to our problem of recommending conferences to authors, not much has been done in the dimensionality reduction space. The work done by H. Luong et al. [luong2012publication] makes use of the social network of the authors. For every author of the paper in need of a conference recommendation, based on his/her social network, the various conferences are given weights. These are then combined and the conference with the highest weight is suggested for the paper. Zaihan Yang and Brian D. Davidson [yang2012venue] provide a collaborative filtering-based recommender system that can provide venue recommendations to researchers. Here, papers are represented by both content (using topics, requiring LDA) and stylometric features. Eric Medvet et. al [medvetpublication], in their work, address the same problem but make use of only the title of the paper and abstract. They propose different approaches where they match the topics of a scientific paper with those of the possible publication venues for that paper.
We make use of Correspondence Analysis in our work, building a model out of the content of the abstracts thereby leading to a more meaningful conference recommender system for papers. We employ dimensionality reduction techniques to further reduce the noise and result in better recommendations. In this way, our approaches are different from those that are previously attempted: dimensionality reduction techniques, to the best of our knowledge, have not been explored very extensively in the text-based recommendation space. Therein lies our novelty and contribution to the machine learning literature.
2.4 Data Mining Techniques
Decision trees may be used in a model-based approach for a RS. One possibility is to use content features to build a decision tree that models all the variables involved in the user preferences. Bouza et al. [bouza2008semtree] use this idea to construct a Decision Tree using semantic information available for the items. The tree is built after the user has rated only two items. The features for each of the items are used to build a model that explains the user ratings. They use the information gain of every feature as the splitting criteria. Another option to use Decision Trees in a RS is to use them as a tool for item ranking. The use of Decision Trees for ranking has been studied in several settings and their use in a RS for this purpose is fairly straightforward [cheng2009decision].
A rule-based system can be used to improve the performance of a RS by injecting partial domain knowledge or business rules. Gutta et al.[gutta2000tv] implemented a rule-based RS for TV content. In order to do, so they first derived a Decision Tree that is then decomposed into rules for classifying the programs.
Bayesian classifiers are particularly popular for model-based RS. They are often used to derive a model for content-based RS. However, they have also been used in a CF setting. Miyahara and Pazzani [miyahara2000collaborative]
implement a RS based on a Naive Bayes classifier. In order to do so, they define two classes: like and don’t like. Experiments show that this model performs better than a correlation-based CF.
Support Vector Machines have recently gained popularity for their performance and efficiency in many settings. SVM’s have also shown promising recent results in RS. Kang and Yoo [yoo2007svm], for instance, report on an experimental study that aims at selecting the best preprocessing technique for predicting missing values for an SVM-based RS.
Xue et al. [xue2005scalable] present a typical use of clustering in the context of a RS by employing the -means algorithm as a pre-processing step to help in neighborhood formation.
Cho et al. [cho2002personalized] combine Decision Trees and Association Rule Mining in a web shop RS. In their system, association rules are derived in order to link related items. The recommendation is then computed by intersecting association rules with user preferences. They look for association rules in different transaction sets such as purchases, basket placement, and click-through. They also use a heuristic for weighting rules coming from each of the transaction sets. Purchase association rules, for instance, are weighted higher than click-through association rules.
Recently, several approaches involving natural language processing[iyer2019event, iyer2019unsupervised, iyer2019heterogeneous, iyer2017detecting, iyer2019machine, iyer2017recomob, iyer2019simultaneous], machine learning [li2016joint, iyer2016content, honke2018photorealistic]iyer2018transparency, li2018object] and numerical optimizations [radhakrishnan2016multiple, iyer2012optimal, qian2014parallel, gupta2016analysis, radhakrishnan2018new] have also been used in the visual and language domains.
3 Problem Statement
Selection of publishing venue for a research work is an arduous task. With huge number of venues to choose from researchers may find it difficult to filter the appropriate conference for their paper. Hence, we try to automate the process of filtering and ordering the conferences. Let be the set of papers to be published, be the set of conferences, and be the utility function such that, where is a total ordered set. Then, we need to find the conference, that maximizes the utility, for paper .
4 Preliminary Concepts
4.1 Correspondence Analysis
This section describes Correspondence Analysis in detail. The following section describes the theoretical aspect and the section after that discusses the computational details.
Correspondence analysis (CA) is a multivariate statistical technique applied to categorical data usually in the form of a contingency table, rather than continuous data as in the case of PCA, and represents graphically the row and column categories thereby allowing for a comparison of theircorrespondences or associations at a category level. CA tries to identify components in the reduced dimension to maximize the relations among the variables while PCA tries to get components that maximize the variability.
A contingency table is a type of table in a matrix format that displays the frequency distribution of the variables. An example of one is shown in Table 1. A contingency table is usually associated with a grand total, the total number of entities represented in the table, and marginal totals, which are the row sums and column sums.
Correspondence analysis basically tries to find out any possible relation between the categorical variables. Contingency tables with more number of variables are possible, but they become difficult to visualize. So, analysis of contingency tables with only two variables are described here (the size of the grid can be anything).
The most basic concept in CA is that of a profile, which is a set of frequencies divided by their total. For a given contingency table, we can have row and column profiles. The objective of CA is to be able to visualize these profiles and the relationships among them (for example the relation between hair colour and eye colour in Table 1), by projecting them onto a subspace of low dimensionality which best fits the profiles and the loss of information is minimized. Since the objective of finding low-dimensional best-fitting subspaces coincides with the objective of finding low-rank matrix approximations by least-squares, the SVD forms the backbone of CA. On a side note, CA is symmetrical in nature i.e. column analysis and row analysis yield the same results.
It so happens that row profiles of dimensions (meaning that the contingency table in consideration has columns), on being plotted in dimensions, lie on a dimensional space. This means that for a table with columns, the rows (after being divided by the row total to get profiles) lie on a -D space, which is a plane. A similar situation arises with column profiles too: considering there are rows in the table, the column profiles lie in a dimensional subspace. Since the analysis is symmetrical, it can be observed that both the columns and rows have to lie in a min (, ) dimensional subspace.
For a contingency table, the column categories can be thought to be a “pure” row profile, i.e. it’s distribution in the other column categories is . So, the row profile vector of the second column category can be thought to be . These points will form the vertices of the min dimensional subspace that the row profiles lie in. Upon reducing the dimensions of both the row profiles and vertices using SVD, so that minimum information is lost by finding the principal axes, it can often be brought down to a 2-D plot where the relationships are easily visualized.
The coordinates of the row profiles in the reduced subspace are called row principal coordinates and the those of the vertices are called column standard coordinates. A similar explanation can be given for column principal coordinates and row standard coordinates. Algorithm to compute these are given in section 4.1.3.
Having computed the principal and standard coordinates, the original information can be obtained back with some loss. This is called the reconstitution formula:
are the relative proportions , being the grand total
and are the row and column masses respectively
is the - principal inertia
and are the row and column standard coordinates respectively
In the summation, there are as many terms as there are dimensions in the data matrix, which has been shown to be equal to one less than the number of rows or columns, whichever is smaller. Taking lesser dimensions than this will lead to some loss of information. We have to choose appropriately to minimize the loss.
In order to visualize the relationships between the categories, the coordinates of the rows and columns are plotted in a -D map called biplot. There are difference kinds of plots: asymmetrical where the principal coordinates of rows/columns and the standard coordinates of the other are plotted. This is the most common representation to measure distances between the points to measure relationships etc. Another kind of plot is called symmetrical, where both the rows and columns being depicted use the same coordinates: principal or standard.
4.1.2 Theoretical Development
220.127.116.11 Pearson’s Chi-square Test for Independence
The Chi-square test is intended to test how likely it is that an observed distribution is due to chance. It is also called a “goodness of fit” statistic, because it measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent. A Chi-square test is designed to analyze categorical data, i.e. the data has been counted and divided into categories, and will not work with parametric or continuous data.
The Chi-square test basically tests the null hypothesis that the variables are independent. The test compares the observed data to a model that distributes the data according to the expectation that the variables are independent. Wherever the observed data doesn’t fit the model, the likelihood that the variables are dependent becomes stronger, thus proving the null hypothesis incorrect. So, a Chi-square test would allow us to test how likely it is that the attendance state and outcome state are completely independent. The Chi-square test is only meant to test the probability of independence of a distribution of data. It does not give any details about the relationship between them. However, once the probability that the two variables are related is determined (using the Chi-square test), other methods can be used to explore their interaction in more detail.
To test the null hypothesis, we need to construct a model which estimates how the data should be distributed if our hypothesis of independence is correct. We build the required model by making use of the marginal and grand totals. The estimated value for each cell , is given by
This way, we get a table similar to the observed table, except that in this case, the variables are assumed to be independent. This table is used to test the null hypothesis by computing the Chi-square statistic, as follows:
where represents the number of rows in the table and represents the number of columns in the table. stands for the entry of the observed table and stands for the expected/estimated value of the entry of the model table assuming independence.
Now, having calculated the Chi-square statistic, the numbers don’t give much meaning unless we determine the p-value. The p-value
is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.
With the Chi-square value and the degrees of freedom
, the p-value can be calculated. The degrees of freedom gives us the number of entries in the grid that areactually independent. For a Chi-square grid, the degrees of freedom can be said to be the number of cells that need to be filled, given the totals in the margins, before the rest of the grid can be filled using a formula that depends on the marginal totals and the values in the cells filled earlier. Thus, for a Chi-square grid, the degrees of freedom are , where and represent the number of rows and columns in the table respectively.
Consider an two-way contingency table , where the cell entry is given by for and . Let the grand total of be and the correspondence matrix or matrix of relative frequencies be so that the cell entry is and . Define the row marginal proportion by and define the column marginal proportion by .
18.104.22.168 Pearson’s Ratio
The aim of correspondence analysis, like many multivariate data analytic techniques, is to determine scores which describe how similar or different responses to two or more variables are.
If we consider a model of complete independence between rows and columns of the table, then
But this complete independence is almost never satisfied. So, we introduce a constant such that the new relation becomes
As can be seen, if for and , then complete independence in the model is observed. Since, complete independence is seldom observed, the elements for which by calculating
Using the Pearson’s ratio, the Pearson Chi-square statistic can be written as
This has a chi-squared distribution withdegrees of freedom.
A property of the Pearson chi-squared statistic is that as increases, so too does the statistic. This can hinder tests of association in the contingency tables. To overcome this problem, simple correspondence analysis considers , which is referred to as the total inertia of the contingency table, to describe the level of association, or dependence, between two categorical variables. By decomposing the total inertia, important sources of information that help describe this association can be identified. Most commonly, SVD is used to decompose the Pearson’s ratio.
22.214.171.124 Using Singular Value Decomposition
Classically, simple correspondence analysis is conducted by performing a singular value decomposition (SVD) on the Pearson ratio. The method of SVD, also referred to as the “Eckart Young” decomposition, is the most common tool used to decompose the Pearson ratio. For the application of analysis of contingency tables, the Pearson ratio may be decomposed into components by
where is the maximum number of dimensions required to graphically depict the association between the row and column responses. For example, for Table 1 only dimensions are required to graphically depict all of the association between the hair and eye colour of the children classified in Caithness. However, for a simple interpretation of this association, generally only the first two dimensions are used to construct such a graphical summary.
The vector is the row singular vector and is associated with the row categories. Similarly, is the column singular vector and is associated with the column categories. The elements of the vector are real and positive and are the first singular values and are arranged in descending order so that
These singular values can be also be calculated by
while the singular vectors have the property
We use the fact that , and to rewrite equation (8) as
By using the orthogonality properties of and from equation (11), the total inertia can be written in terms of singular values such that
For Table 1, we obtain , , so that . So, the first axis explains of the total variation that exists in the table, while the second axis explains of this variation. Thus, considering just these two axes accounts for of the total variation in Table 1. So, we can safely ignore the component, i.e. corresponding to without much loss. This way, we are reducing the dimensions.
CA is based on fairly straightforward, classical results in matrix theory. The central result is the singular value decomposition (SVD), which is the basis of many multivariate methods such as principal component analysis, canonical correlation analysis, all forms of linear biplots, discriminant analysis and metric multidimensional scaling. Here, Matrix–vector notation is used because it is more compact[greenacre2007correspondence].
Let denote the data matrix with positive row and column sums. For notational simplicity, the matrix is first converted to the correspondence matrix by dividing by it’s grand total .
The following notation is used:
Row and Column masses:
Diagonal matrices of row and column masses:
All subsequent results are given in terms of these relative quantities , and , whose elements add up to in each case.
126.96.36.199 Basic Computational Algorithm
The computational algorithm to obtain coordinates of the row and column profiles with respect to principal axes, using the singular value decomposition (SVD), is as follows:
Step 1: Calculate the matrix S of standardized residuals
Step 2: Calculate the SVD of S
where is the diagonal matrix of (positive) singular values in descending order:
Step 3: Standard coordinates of rows
Step 4: Standard coordinates of columns
Step 5: Principal coordinates F of rows:
Step 6: Principal coordinates G of columns:
Step 7: Principal inertias :
188.8.131.52 Transition equations between rows and columns
The left and right singular vectors are related linearly, for example by multiplying the SVD on the right by . Expressing such relations in terms of the principal and standard coordinates gives the following variations of the same theme, called transition equations:
Principal as a function of standard (barycentric relationships)
Principal as a function of principal
184.108.40.206 Supplementary Points
Supplementary rows/columns are those entries that are added to the original table. In many cases. we may require their principal/standard coordinates. The transition equations can be used to situate the supplementary points on the map. This way, we can compute the coordinates for the supplementary points using the already computed coordinates for the original table.
For example, given a supplementary column point with values in , divide by its total to obtain the column profile , and then use the profile transposed as a row vector in the second equation of (25), for example, to calculate the coordinates g of the supplementary column
In the proposed methods, we required the principal coordinates of the supplementary rows (which are the rows of the test matrix). The steps taken to obtain those, from the trained model, are:
Step 1: is our test matrix
Step 2: We obtain the correspondence matrix of , by normalizing the entries of the matrix with its grand total
Step 3: We obtain the row masses for
Step 4: The diagonal matrix corresponding to , , is obtained
Step 5: From the trained model, we have the column standard coordinates for the training matrix . Let us call that column standard coordinate matrix as . Then using the transition equations given in equation (25), we obtain the principal coordinates for the test matrix (supplementary rows), , as
Using the procedures mentioned in this section, CA has been implemented and used for the experiments conducted. For more details about the procedures and implementations, one can refer to [greenacre2007correspondence].
4.2 Topic Modeling
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is about cats and about dogs, there would probably be about times more dog words than cat words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.
The model being looked at is the Latent Dirichlet Allocation (LDA), which is a bag-of-words model.
220.127.116.11 Bag-of-words Model
This model is a simplifying assumption used in natural language processing and information retrieval wherein, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order. From a data modeling point of view, the bag-of-words model can be represented by a co-occurrence matrix of documents and words as illustrated in Figure 1. Just as a text consists of words, a multimedia document can be thought to consist of sensory words, thus allowing them a bag-of-words representation too. This model is widely used in document classification and modeling. When a Naive Bayes classifier is applied to text, for example, the conditional independence assumption leads to the bag-of-words model. Other methods of document modeling that use this model include the Latent Dirichlet Allocation.
4.2.2 Latent Dirichlet Allocation (LDA)
LDA is a generative probabilistic model that generates a document using a mixture of topics [blei2003latent]
. It assumes a generative probabilistic model in which documents are represented as random mixtures over latent topics, where each topic is characterized by a probability distribution over words. An illustration of the assumption in LDA model is depicted in Figure2.
LDA[blei2003latent] assumes the following generative process for each document w in a corpus :
For each of the words :
Choose a topic Multinomial()
Choose a word from , a multinomial probability conditioned on the topic
A visualization of LDA is given in Figure 3.
18.104.22.168 Understanding LDA with an example
Suppose we have the following sentences:
I ate a banana and spinach smoothie for breakfast
I like to eat broccoli and bananas.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.
Latent Dirichlet allocation is a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like
Sentences 1 and 2: 100
Sentences 3 and 4: 100
Sentence 5: 60
Topic A: 30
Topic B: 20
The question, of course, is: how does LDA perform this discovery?
In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, we
Decide on the number of words N the document will have (say, according to a Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word in the document by:
First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
Then using the topic to generate the word itself (according to the topic’s multinomial distribution). For instance, the food topic might output the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.
Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.
As an example, according to the above process, when generating some particular document , we might
Decide that D will be 1/2 about food and 1/2 about cute animals.
Pick 5 to be the number of words in D.
Pick the first word to come from the food topic, which then gives you the word “broccoli”.
Pick the second word to come from the cute animals topic, which gives you “panda”.
Pick the third word to come from the cute animals topic, giving you “adorable”.
Pick the fourth word to come from the food topic, giving you “cherries”.
Pick the fifth word to come from the food topic, giving you “eating”.
So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (note that LDA is a bag-of-words model). Now, how to infer the parameters of the LDA model, given a set of documents, is described in the next section.
So now suppose we have a set of documents. We have chosen some fixed number of topics, , to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. In short, we want to perform an inference on this generative model. Several techniques like EM algorithm, Gibbs Sampling etc. can be used for this purpose. For our experiments, we have used collapsed Gibbs sampling:
Go through each document, and randomly assign each word in the document to one of the topics.
Notice that this random assignment already gives us both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
So to improve on them:
For each document
Go through each word in
And for each topic , compute two things: 1) = the proportion of words in document that are currently assigned to topic , and 2) = the proportion of assignments to topic over all documents that come from this word . Reassign a new topic, where we choose topic with probability (according to our generative model, this is essentially the probability that topic generated word , so it makes sense that we resample the current word’s topic with this probability).
In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
After repeating the previous step a large number of times, we will eventually reach a roughly steady state where our assignments are pretty good. So, we can use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).
4.3 Term Frequency-Inverse Document Frequency
One of the best-known measures for specifying keyword weights in Information Retrieval is the term frequency/inverse document frequency (TF-IDF) measure that is defined as follows: Assume that is the total number of documents that can be recommended to users and that keyword appears in of them. Moreover, assume that is the number of times keyword appears in document . Then, , the term frequency (or normalized frequency) of keyword in document , is defined as
where the maximum is computed over the frequencies of all keywords that appear in the document . However, keywords that appear in many documents are not useful in distinguishing between a relevant document and an irrelevant one. Therefore, the measure of inverse document frequency () is often used in combination with simple term frequency (). The inverse document frequency for keyword is usually defined as
Then, the TF-IDF weight for keyword in document is defined as
and the content of document is defined as
which is a vector of weights.
We use this, LDA and CA in our experiments: tf-idf, LDA to represent content and CA to reduce the dimensions.
5 Data Set and Tools Used
5.1 Data Used
Techniques based on the network analysis of authors and content analysis of the publications, have been explored for the purposes of recommendation. Each of the following subsections describe the data collected and techniques/tools applied on the data. For uniformity, we have used the publications in ACM conferences over the years 2008 to 2010. The selected conferences include
SIGBED - Special Interest Group on Embedded System
CASES - Compilers, Architecture, and Synthesis for Embedded Systems
CODES + ISSS - International Conference on Hardware/Software Codesign and Systems Synthesis
EMSOFT - International Conference on Embedded Software
SENSYS - Conference On Embedded Networked Sensor Systems
SIGDA - Special Interest Group on Design Automation
DAC - Design Automation Conference
DATE - Design, Automation, and Test in Europe
ICCAD - International Conference on Computer Aided Design
SBCCI - Annual Symposium On Integrated Circuits And System Design
SIGIR - Special Interest Group on Information Retrieval
CIKM - International Conference on Information and Knowledge Management
JCDL - ACM/IEEE Joint Conference on Digital Libraries
SIGIR - Research and Development in Information Retrieval
WWW - World Wide Web Conference Series
SIGPLAN - Special Interest Group on Programming Languages
GPCE - Generative Programming and Component Engineering
ICFP - International Conference on Functional Programming
OOPSLA - Conference on Object-Oriented Programming Systems, Languages, and Applications
PLDI - Programming Language Design and Implementation
All together there are 16 conferences, which are from the 4 special interest groups. SIGBED is special interest group on embedded systems and accepts contributions related to embedded computer systems including software and hardware. SIGDA is special interest group on design automation. It accepts contributions on design and automation of complex systems on chip. SIGIR accepts contributions related to any aspect of Information Retrieval (IR) theory and foundation, techniques and applications. SIGPLAN is special interest group on programming languages and accepts contributions related to design, implementation and principles of programming languages.
5.1.1 Co-Author Network
We have downloaded the DBLP database, which contains the conference proceedings. This database contains the XML records of all the publications. Each record contains its publication information such as: author names, publication venue, title, year, and the DOI (Digital Object Identifier) of the publication. We extracted these attributes and generated a co-author network. Each node in the co-author network represents an author and each edge represents the co-authorship between the author nodes.
The ACM site provides abstracts for all the publications on its website. In order to perform content analysis, we crawled the ACM site and extracted the abstracts over the years 2008 to 2010 from the above mentioned conferences. We extracted a total of about 5447 abstracts published in these conferences and used them for content-based recommendations.
5.2 Tools Used
5.2.1 Neo4j Graph Database
For constructing the co-author network, Neo4j graph database has been used. It is an open-source project for graph databases. The python bindings were used to interact with the database.
5.2.2 MySQL Database
We relied on MYSQL to store the information on publications like year, DOI and venue.
5.2.3 Programming Languages
Latent Dirichlet Allocation (LDA) was written in C++. All the other applied methods were written in Python and R.
6 Technical Approach
Different approaches can be taken to solve the considered problem of attempting to recommend conferences to authors. Outline of ideas are provided and their pitfalls, if any, are mentioned. This recommender system unlike most commercial ones like recommending books, movies etc. involves people in some sense. Thus, there is an emotional connection involved. What this means is, if a conference suggested by our system gets a paper rejected, it is highly unlikely that he will use this system again. This is not that case with books or movie recommenders. So, there is no room for errors and less accuracies.
Some previous work on this has been done by H. Luong et al. [luong2012publication] who have recommended conferences to authors using the social network i.e. the co-author network with the same dataset. Exploring the possibility of using CA has not been attempted before.
We have implemented a total of methods for this application and have done a comprehensive evaulation of the results. Three of the methods use Correspondence Analysis and three of them don’t. The first method uses the Author-conference relation without taking into account the content of the paper. The next two methods use the content along with an application of CA to arrive at the results. The abstracts of the paper are used for content-analysis. This makes sense because the essence of the entire paper is contained in the abstract.
The last three methods are respectively: Content-based filtering, Collaborative filtering and Hybrid filtering. Content-based filtering and Hybrid filtering use the content of the paper but none of these methods employ CA.
Content is obtained in two ways: term frequency-inverse document frequency (tf-idf) and topics. LDA has been used for the latter. For each content-method, number of topics used: , , , , and . However, only results for topics are displayed in the evaluation, due to there being a very vast multitude of results and it would be too cumbersome to list all of them. Number of words used in tf-idf:
. For computing the resultant conferences, three methods of similarity have been used: euclidean distance, cosine similarity and pearson correlation.
In all the methods, set of papers have been used for training and papers have been used for testing. There are a total of papers for the years , for and for . There are a total of conferences.
The various similarity metrics used in the experiments are given below:
where is the number of attributes and and are the attributes of the data points and , respectively.
In this similarity measure, items are considered as n-dimensional document vectors and their similarityis measured as the cosine of the angle that they form between them. Thus, if the cosine measure is close to , i.e. the angle between the two vectors is close to , the items are considered to be very similar.
where indicates vector dot product and is the norm of vector x. This similarity is also known as the Norm.
Correlation between items can also measure their similarity, linear relationship in this case. Although several correlation coefficients can be used, the most commonly used one is the Pearson Correlation. Given the covariance of data points and ,
, and their standard deviation, we compute the Pearson correlation using:
6.1 Using Authors-Conferences Matrix
6.1.1 Data Construction
From the data collected in the DBLP database, we construct the author-conference matrix as shown in Figure 4, where each row represents a single author. Here represents the number of times author has published in conference . We construct two such matrices: one training, say and the other a test matrix, say . The training matrix is constructed from papers (a total of ) and the test matrix is constructed from the papers (a total of ). There are a total of conferences.
6.1.2 Applied Method
The algorithm followed is given in the following steps:
We compute the standardized residual matrix from as mentioned in section 22.214.171.124.
We then obtain the coordinate matrices (both standard and principal for rows and columns), after decomposing using SVD.
Using the matrix as a supplementary row matrix, we compute its principal coordinates using the standard column coordinates of .
The rows of the supplementary test matrix represent individual authors. So, to recommend a conference to a paper, which may be written by multiple authors: we take all the authors of that particular paper and compute the similarity (euclidean distance/cosine/pearson) with each of the conferences. For this purpose, we use the principal coordinates of the authors and the principal coordinates of the conferences.
We sort the conferences, which maximize the sum of the similarity to all the authors of the paper in consideration, in decreasing order. Maximizing similarity means: minimizing euclidean distance/maximizing cosine similarity/maximizing pearson correlation.
We then get a ranked list of recommendations for each paper.
This method has several drawbacks. For one, all the new authors (new to these conferences) are all recommended the same conference. Thus, this approach fails if the author has no publication history. Also, this does not capture the essence of the problem because we are recommending without even looking at the content of the paper in question. Thus, we need to look at the content of the paper as well in order to make better and more appealing recommendations.
Here, we have considered each row to be a single author. It can also be changed to comprise of multiple authors i.e. who have co-authored a paper. In this case, there will be more number of entries in the matrix and also it will be more sparse. Even in this case, the same limitations as above apply and in addition, the sparsity, in some sense, also reduces the “meaningfulness” between the authors and conferences. Applying a dimensionality reduction technique like SVD or CA will bring it to a lower-dimensional subspace which will capture the essence of the relation better, rendering the matrix less sparse.
6.2 Composition of Papers-Words/Topics and Words/Topics-Conferences Matrices
6.2.1 Data Construction
A way to remedy the defect in the previous method is to look at the content of the papers, abstracts in particular as they capture the entire essence of the paper. From the data collected, we can construct an paper words/topics matrix and words/topics conferences matrix as shown in Figure 5. We construct three matrices in total: two for training, and one for testing. We construct two training matrices, textitpaper words/topics and words topics-conferences from the papers, say () and (). We also construct a test matrix (), paper words/topics, from the papers, which contains all the papers which needs recommendation. We write “word/topic” because the content is represented in both ways.
Here, is the number of times author has used the word in all of his considered publications. is the number of times word has been used in the conference in total, i.e. considering all the papers that have been accepted in conference , all of them combined use the word , number of times. We generate the conference matrix by computing the centroid from those entries of the paper matrix which corresponds to this particular conference.
6.2.2 Applied Method
The algorithm followed is given in the following steps:
We multiply the training matrices, and , to obtain . The result is a paper conference matrix.
We compute the standardized residual matrix from as mentioned in section 4.1.3.
We then obtain the coordinate matrices (both standard and principal for rows and columns), after decomposing using SVD.
After this, we multiply the test matrix with the training matrix to obtain .
Using the matrix as a supplementary row matrix, we compute its principal coordinates using the standard column coordinates of .
Then, for each paper in , we compute its similarity with each of the conferences and sort the result.
We then get a ranked list of recommendations for each paper
In this method, we multiply the author-words and words-conference matrices and apply CA after that, to recommend a conference to an author. But, this may not capture the relations between the authors and conferences well. An alternative would be to reduce the author-words matrix and the words-conference matrix individually first. Then, defining a transformation from the first subspace to the other might help capture the relations better, which is the next method.
Instead of words, a paper can also be represented in terms of topics. This is more meaningful because if a paper is about information retrieval but does not have much of the IR jargon, then the chances of recommending an IR conference for this paper is less. But, if we capture the topics, then this solves that problem.
6.3 Using Linear Transformation between the reduced-dimensional subspaces
6.3.1 Data Construction
The dataset is constructed in the same way as in the previous method.
The main difference between this method and the previous, however, is that since direct multiplication of the matrices may not capture the relations very well, we reduce each of the matrices author-words and words-conferences to a lower-dimensional subspace and then try to define a transformation from one to the other i.e. reduced author-words to the reduced words-conferences . This is illustrated in Figure 6. This, we feel might give a better view of the relations associated between authors and conferences and hence lead to a better recommender system.
6.3.2 Applied Method
The algorithm followed is given in the following steps:
We compute the principal coordinates for each of the training matrices and separately.
We then define a linear transformation from the reduced paper-space to the reduced conference-space as follows:
For each paper in , the transformation matrix should map it to the exact conference that it was published in.
So, we collect the principal coordinate vectors of all the training papers in and then we construct a matrix corresponding to the principal coordinates of the conferences as follows: the row of is the principal coordinate vector of that conference in which the paper in was published.
Finally the dimensions of is and the dimensions of is also , where and are the dimensions of the reduced subspaces, paper-space and conference-space, respectively.
The linear transformation matrix T is then calculated as follows:
where is the pseudo-inverse of . The dimensions of the transformation matrix is .
Now, using the matrix as a supplementary row matrix, we compute its principal coordinates using the standard column coordinates of . This matrix has dimensions
We now need to transform these set of coordinates to the conference space where we can find the similarity easily. This is achieved by multiplying the principal coordinate matrix of the supplementary rows (test papers) with the transformation matrix . The dimensions of the resultant matrix is .
Then, for each paper in the transformed space, we compute the similarity with each of the conferences and sort the result.
We then get a ranked list of recommendations for each paper.
This approach is interesting because, to the best of our knowledge, it has never been used in literature of recommender systems. The approach referred to is defining a transformation between the reduced subspaces. In this case, to recommend, we take a paper from its reduced space to the conference space through a linear transformation.
6.4 Content-based Filtering
6.4.1 Data Construction
The dataset is constructed in the same way as in the previous method. After construction, we require only two matrices, namely, (paper words test matrix) and (words conferences training matrix).
6.4.2 Applied Method
This is a very simple method and the algorithm followed is given in the following steps:
For each paper in the test matrix , we compute the similarity with each conference vector in the training matrix . This is done in three ways again as before: euclidean distance, cosine similarity and pearson correlation.
So, for each paper we get a list of similar conferences. We sort them by their similarities and return the result.
We then get a ranked list of recommendations for each paper.
This algorithm is a memory-based technique, in contrast to a model-based technique which involves fitting a statistical model over the data and inferring suitable parameters. Although the method is very simple, it gives a very high accuracy compared to the other methods.
6.5 Collaborative Filtering
6.5.1 Data Construction
For this method, we construct a paper conference matrix, similar to the one in the first method. The difference between this matrix and the one in the first method is that, each row of this matrix represents a paper (multiple authors). We construct two such matrices: one for training and one for testing. Let us call them and respectively. For this method, we consider the papers to be the users and the conferences to be the items. So, our objective is to recommend items to the users. The basic idea is to find users similar to the one for whom recommendation is required and then recommend based on what the similar users like.
6.5.2 Applied Method
The algorithm followed is given in the following steps [segaran2007programming]:
For each paper in , we compute the similar papers from the training matrix . Each paper is a dimensional vector (representing the frequencies of the conferences) and similarity is calculated using the previous metrics.
We now have to compute the scores for each of the conferences (in other words items). We do this as follows:
We consider each column of the training matrix . We can think of this representing the ratings given by different users to this particular item.
We then take the ranked list of similar user (papers) to the paper in consideration, and multiply the similarity score of the paper with the corresponding rating of the item.
In more technical terms, we compute the inner product of the similarity vector with the conference column vectors (each of the in turn).
For the conference, say, after computing the inner product, we take sum over all it’s entries and then normalize it with the total similarity score (which is the sum of the similarity vector).
The need to do this is because we ignore those users in the inner product who haven’t rated the item in question. If this happens, then there is a chance that the items rated by all the users have the maximum score. So, we normalize to make things uniform.
We now sort the normalized scores for each conference to obtain a ranked list of recommendations.
This algorithm is also memory-based technique, in contrast to a model-based technique. This method is commonly used in e-commerce websites for rating-based recommendations.
6.6 Hybrid Filtering
6.6.1 Data Construction
The data construction for this method is very similar to that in collaborative filtering, with a few additions. We construct the paper conference matrix, as before. In addition, we also construct a paper words training matrix, and paper words test matrix, . This method is very similar to that of collaborative filtering algorithm. The only difference is that this method brings in content-analysis too. This is achieved using the paper words matrices.
6.6.2 Applied Method
The algorithm followed is given in the following steps:
For each paper in , we compute the similar papers from the training matrix . Unlike last time, now each paper is a dimensional vector (in case of tf-idf) and dimensional vector (in case of topics). Thus, now, the content of the paper is being brought in as opposed to plain conference-frequency vector. Similarity of each test paper with each of the training papers is calculated using the previous metrics.
The rest of the recommendation process is identical to that in collaborative filtering. This way, we have combined content-based filtering and collaborative filtering.
This algorithm is also a memory-based technique, in contrast to a model-based technique.
7 Evaluation and Results
In this section, we detail the evaluation procedures and discuss the results obtained. We have used a total of metrics to evaluate the performance of the algorithms described above. They were applied on the ranked list of recommendations generated by the above methods:
Mean Precision at (MP): The mean Precision at for a set of queries is defined as the mean of the Precision at values for each of those queries. Precision at , , is defined as:
Mean Recall at (MR): The mean Recall at for a set of queries is defined as the mean of the Recall at values for each of those queries. Recall at , , is defined as:
Mean Average Precision at (MAP): Mean average precision at for a set of queries is the mean of the average precision at values for each of those queries.
where is the number of queries. Here is the average precision for the query. Average precision is defined as:
where is an indicator function equaling if the item at rank is a relevant document, zero otherwise. is the precision at .
Mean Normalized Discounted Cumulative Gain at (MNDCG): Discounted Cumulative Gain (DCG) at is defined as:
where is the relevance score of result . DCG uses a graded scale of relevance and this allows us to have preferences in the predicted results. Let us assume an ideal sequence of predicted results which would yield the maximum . We call this the ideal , denoted by . The normalized , , is the ratio of the obtained with that of the ideal . This would thus always yield a value between and . The mean normalized for a set of queries is then the mean of the values for each of those queries.
Mean Reciprocal Rank (MRR): The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer. The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries :
Mean F-Measure at (MF-M): The mean F-measure at for a set of queries is the mean of the F-measures at
This is the balanced F-score, where the weights of precision and recall in the harmonic mean are equal. We can also have cases of uneven weights.
Mean R-Precision (MR-P): The mean R-Precision for a set of queries is the mean of the R-Precision values for each of those queries. R-Precision is defined as the Precision at , where is the number of relevant documents. At this position, the precision and recall values become equal.
For the experiments, we have chosen the value of and to be . This means that the measures are evaluated (which are and ) considering only the top of the returned results. For the purpose of calculating the metrics, we have defined relevant conferences in two cases:
A predicted conference is relevant if it is same as the actual conference the paper was originally published in (we have that information from the data set). For computing DCG in this case, the relevant conference (which is the original conference) is given a score of and the rest are given scores .
A predicted conference is relevant if it belongs to the Special Interest Group (SIG) of the actual conference the paper was originally published in. For computing DCG in this scenario, the original conference is given a score of , the other conferences in the SIG are given a score of as they are considered to be partially relevant. The rest of the conferences get a score of .
For calculating similarity to determine the ranking of the retrieved results, we have used three different metrics as previously mentioned:
Earlier it was explained that the dimension of the lower-dimensional subspace for an matrix is min. Since, we have only conferences and more than papers, the minimum is always . Although the experiments were evaluated for more than one subspace, due to lack of space and vast multitude of results, we only show the results for a -dimensional subspace. We call this . In the case of third method (Linear Transformation), we reduce two matrices independently using CA and hence each can be reduced to a different dimensional subspace. So, for that method, we show the results for , , where is the dimension of the subspace that the Conference x Words/Topics is reduced to and is the dimension of the subspace that the Paper x Words/Topics is reduced to.
The experimental parameters used for LDA are given in Table 2. For tf-idf, words were used.
|Number of Iterations||1000|
|Number of Topics||400|
|Number of Training Papers||3572|
|Number of Test Papers||1875|
For displaying the results of the experiments, the following conventions are used:
MAP: Mean Average Precision at
MNDCG: Mean Normalized Discounted Cumulative Gain at
MRR: Mean Reciprocal Rank
MR-P: Mean R-Precision
MF-M: Mean F-Measure
MP: Mean Precision at
MR: Mean Recall at
Using the above conventions, the evaluations of the experiments are given below:
7.1 Method 1: Using Author-Conference Matrix
Here, we present the results for the first method. In this case, evaluation has been conducted with two matrices. The first matrix is the one constructed from the test dataset. The second matrix is a null matrix (all entries are 0). The second matrix is required for testing because many authors are common in the training and testing set and it is highly likely that an author, if published in a certain conference, would prefer to publish in it again. Hence, this gives very high accuracy. The only way to really put the method to the test it to consider a new paper, which has not been published in any of the conferences mentioned and then recommend. This is why we considered a null matrix. The results are given below:
Case 1: Using test matrix, . The results are displayed in Table 3.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.9483 0.6308 0.9483 0.6308 0.9483 0.6308 MNDCG 0.9613 0.8339 0.9613 0.8339 0.9613 0.8339 MRR 0.9484 0.9961 0.9484 0.9961 0.9484 0.9961 MR-P 0.9050 0.6517 0.9050 0.6517 0.9050 0.6517 MF-M at 0.3328 0.5805 0.3328 0.5805 0.3328 0.5805 MP 0.1997 0.5225 0.1997 0.5225 0.1997 0.5225 MR 0.9985 0.6531 0.9985 0.6531 0.9985 0.6531 Table 3: Results for Method 1: Considering the test matrix to be built from 2010 papers,
Case 2: Using null test matrix, . The results are displayed in Table 4.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.2072 0.2253 0.2072 0.2253 0.2072 0.2253 MNDCG 0.3042 0.3292 0.3042 0.3292 0.3042 0.3292 MRR 0.2548 0.4232 0.2548 0.4232 0.2548 0.4232 MR-P 0.0196 0.3311 0.0196 0.3311 0.0196 0.3311 MF-M 0.2013 0.3940 0.2013 0.3940 0.2013 0.3940 MP 0.1208 0.3546 0.1208 0.3546 0.1208 0.3546 MR 0.6040 0.4433 0.6040 0.4433 0.6040 0.4433 Table 4: Results for Method 1: Considering the test matrix to be a zero (null) matrix,
As can be seen from the above results, when the input is a null matrix, the method performs poorly.
7.2 Method 2: Composition of Paper-Words/Topics and Words/Topics-Conference Matrices
Here we present the results for the second method, which composes two matrices and reduces the dimension. We have two cases: one using tf-idf matrices and one using topic matrices. The results for both are given below:
Case 1: Using tf-idf representation ( words). . The results are displayed in Table 5.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.5800 0.7124 0.5937 0.7778 0.5820 0.7616 MNDCG 0.6573 0.7213 0.6755 0.7571 0.6648 0.7452 MRR 0.5943 0.8475 0.6041 0.8545 0.5933 0.8507 MR-P 0.3781 0.7205 0.3829 0.7888 0.3696 0.7686 MF-M 0.2956 0.6910 0.3061 0.7477 0.3036 0.7356 MP 0.1773 0.6219 0.1836 0.6729 0.1821 0.6620 MR 0.8869 0.7774 0.9184 0.8412 0.9109 0.8276 Table 5: Results for Method 2: Using tf-idf matrices,
Case 2: Using topic representation ( topics). . The results are displayed in Table 6.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.3433 0.4801 0.3880 0.5820 0.3818 0.5715 MNDCG 0.4112 0.4861 0.4600 0.5476 0.4531 0.5383 MRR 0.3818 0.6330 0.4191 0.6614 0.4136 0.6584 MR-P 0.1957 0.5068 0.2261 0.5898 0.2218 0.5824 MF-M 0.2058 0.5025 0.2259 0.5662 0.2229 0.5534 MP 0.1235 0.4522 0.1355 0.5096 0.1337 0.4981 MR 0.6176 0.5653 0.6778 0.6370 0.6688 0.6226 Table 6: Results for Method 2: Using topic matrices,
Here, it is observed that using tf-idf representation for content outperforms its topic counterpart.
7.3 Method 3: Using Linear Transformation
Here, we present the results for the third method, which employs a linear transformation between reduced subspaces. We have total of four cases: using tf-idf matrix, topic matrix and different values for and . The results for all the cases are given below:
Case 1: Using tf-idf representation ( words), , . The results are displayed in Table 7.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.3054 0.2566 0.4981 0.7506 0.4933 0.6998 MNDCG 0.3803 0.3920 0.5824 0.7003 0.5687 0.6748 MRR 0.3520 0.5595 0.5179 0.8318 0.5190 0.8339 MR-P 0.1882 0.3428 0.2986 0.7178 0.3034 0.7149 MF-M 0.2049 0.3984 0.2785 0.7259 0.2643 0.6744 MP 0.1229 0.3586 0.1671 0.6533 0.1586 0.6070 MR 0.6149 0.4482 0.8357 0.8166 0.7930 0.7588 Table 7: Results for Method 3: Using tf-idf matrices, ,
Case 2: Using tf-idf representation ( words), , . The results are displayed in Table 8.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.5002 0.5347 0.5598 0.7857 0.5520 0.7766 MNDCG 0.5714 0.6117 0.6465 0.7504 0.6419 0.7457 MRR 0.5225 0.8354 0.5715 0.8917 0.5627 0.8876 MR-P 0.3354 0.5834 0.3397 0.7669 0.5627 0.8876 MF-M 0.2618 0.5442 0.3013 0.7454 0.3032 0.7461 MP 0.1571 0.4898 0.1808 0.6709 0.1819 0.6715 MR 0.7856 0.6122 0.9040 0.8386 0.9098 0.8394 Table 8: Results for Method 3: Using tf-idf matrices, ,
Case 3: Using topic representation ( topics), , . The results are displayed in Table 9.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.4087 0.5203 0.4265 0.6133 0.4262 0.6063 MNDCG 0.4708 0.5347 0.4932 0.5779 0.4934 0.5771 MRR 0.4416 0.6827 0.4548 0.6914 0.4543 0.6945 MR-P 0.2522 0.5326 0.2565 0.6178 0.2581 0.6097 MF-M 0.2188 0.5245 0.2305 0.5870 0.2314 0.5870 MP 0.1313 0.4721 0.1383 0.5283 0.1388 0.5283 MR 0.6565 0.5901 0.6917 0.6604 0.6944 0.6604 Table 9: Results for Method 3: Using topic matrices, ,
Case 4: Using topic representation ( topics), , . The results are displayed in Table 10.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.4525 0.5531 0.4614 0.6325 0.4566 0.6219 MNDCG 0.5152 0.5724 0.5259 0.6038 0.5210 0.5988 MRR 0.4816 0.7179 0.4871 0.7156 0.4836 0.7192 MR-P 0.2965 0.5596 0.2970 0.6356 0.2938 0.6201 MF-M 0.2343 0.5511 0.2394 0.6021 0.2376 0.5985 MP 0.1405 0.4960 0.1436 0.5419 0.1426 0.5386 MR 0.7029 0.6200 0.7184 0.6774 0.7130 0.6733 Table 10: Results for Method 3: Using topic matrices, ,
As can be observed, just like the third method, the tf-idf representation overall outperforms the topic representation. In some cases, the topic representation outperforms its counterpart, for example when considering the euclidean distance metric, and , .
7.4 Method 4: Content-based Filtering
Here, we present the results for content-based filtering. This method does not use any dimensionality reduction techniques and is a memory-based (which is different from model-based methods where we try to fit a statistical model to the data and infer the parameters, which are then used to determine the results) method. There are two cases here too: using tf-idf and topic representations.
Case 1: Using tf-idf representation ( words). The results are displayed in Table 11.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.6477 0.7530 0.6637 0.7758 0.6622 0.7762 MNDCG 0.7200 0.7718 0.7380 0.7920 0.7367 0.7916 MRR 0.6557 0.9034 0.6695 0.9166 0.6683 0.9156 MR-P 0.4544 0.7516 0.4613 0.7790 0.4597 0.7797 MF-M 0.3112 0.7191 0.3191 0.7413 0.3187 0.7421 MP 0.1867 0.6472 0.1914 0.6672 0.1912 0.6679 MR 0.9338 0.8090 0.9573 0.8340 0.9562 0.8349 Table 11: Results for Method 4: Using tf-idf matrices
Case 2: Using topic representation ( topics). The results are displayed in Table 12.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.4371 0.5913 0.4385 0.5947 0.4215 0.5955 MNDCG 0.5138 0.5879 0.5157 0.5900 0.4975 0.5779 MRR 0.4650 0.7222 0.4657 0.7206 0.4494 0.6973 MR-P 0.2586 0.6042 0.2592 0.6073 0.2464 0.6077 MF-M 0.2481 0.5880 0.2494 0.5916 0.2421 0.5876 MP 0.1489 0.5292 0.1496 0.5324 0.1452 0.5288 MR 0.7445 0.6616 0.7482 0.6656 0.7264 0.6610 Table 12: Results for Method 4: Using topic matrices
As we can see from the above results, tf-idf representation again outperforms the topic representation.
7.5 Method 5: Collaborative Filtering
Here, we present the results for collaborative filtering. This method does not use any dimensionality reduction techniques and is memory-based, just like content-based filtering. This method does not use content, rather just uses similarity measures to determine the recommendations. The results are displayed in Table 13.
7.6 Method 6: Hybrid Filtering
Here, we present the results for hybrid filtering, which is a hybrid of content-based filtering and collaborative filtering. This method does not use any dimensionality reduction techniques and is also memory-based. This method combines the good qualities of both content-based filtering and collaborative filtering. The results are displayed in Table 14.
Case 1: Using tf-idf representation ( words). The results are displayed in Table 14.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.0566 0.1083 0.1036 0.1190 0.1037 0.1191 MNDCG 0.0734 0.1380 0.1497 0.1860 0.1499 0.1872 MRR 0.1449 0.3930 0.1867 0.3223 0.1865 0.3231 MR-P 0.0192 0.1773 0.0213 0.1817 0.0213 0.1813 MF-M 0.0414 0.1765 0.0972 0.2388 0.0974 0.2417 MP 0.0248 0.1589 0.0583 0.2149 0.0584 0.2176 MR 0.1242 0.1986 0.2917 0.2686 0.2922 0.2720 Table 14: Results for Method 6: Using tf-idf matrices
Case 2: Using topic representation ( topics). The results are displayed in Table 15.
Euclid Cosine Pearson Metrics Actual SIG Actual SIG Actual SIG MAP 0.0519 0.0984 0.1241 0.1577 0.1006 0.1486 MNDCG 0.0720 0.1270 0.1892 0.2388 0.1477 0.2032 MRR 0.1334 0.3109 0.1945 0.3237 0.1813 0.3097 MR-P 0.0192 0.1773 0.0218 0.2688 0.0245 0.2362 MF-M 0.0449 0.1765 0.1308 0.3229 0.0983 0.2798 MP 0.0269 0.1589 0.0785 0.2906 0.0589 0.2518 MR 0.1349 0.1986 0.3925 0.3633 0.2949 0.3148 Table 15: Results for Method 6: Using topic matrices
Here also, we observe that the tf-idf representation outperforms its topic counterpart.
8 Conclusions and Future Work
Although each of the above methods has its own merits, from the results obtained we observe the following:
The content-based methods proposed easily beat popular methods like collaborative filtering. This shows that for this system, considering content is vital. Computing similarities with content in hybrid filtering also did not prove to be very helpful, as the remainder of the procedure is identical to collaborative filtering.
The first method, involving just the conference-frequencies and not the content, is seen to perform poorly when it comes to recommending for new authors. The very high accuracy when considering the test matrix can be attributed to the fact that authors tend to publish in conferences where they have published before. Our proposed content-based methods involving CA work equally well with old/new authors because only the content of the paper is taken into consideration and not their prior publication counts in the conferences.
Content-based filtering is seen to outperform the CA-based methods. This may be attributed to the fact that there is a certain amount of information loss during the dimensionality reduction phase, while content-based filtering utilizes the “pure” raw content.
In the results obtained, using tf-idf for content proved to be better than using topics. This may be due to considering a much larger number of words in tf-idf representation () than it’s topic counterpart (). Also, the method of generating the topic matrices may have influenced the results.
Lastly, we observe that cosine similarity proves to be the best measure to calculate the similarities.
8.2 Future Work
We can improve accuracy of the content-analysis techniques by considering more attributes for the papers such as keywords that can very well help in the recommendations. Improvements might be seen, if we can incorporate the network information of the authors along with the content of the paper into the recommender system. The citation information of the papers can also serve as a good feature.