1. Introduction
A citation network refers to a (un)weighted, directed graph with the directed edges representing the citations, the edge weights (if exist) representing the number of citations between two nodes (e.g., papers, journals, authors). The impact made by the work done by an entity/node is measured using the impact factor. The impact factor of a journal for a particular year is measured by the total number of citations received by the articles published in the journal during the preceding two years, divided by the total number of articles published in that journal during the preceding two years (impactfactor). It is widely known and we shall also explain below that anomalous citations can purposefully manipulate impact factors. Since the Journal Impact Factor (JIF) is used to determine the importance of the contributions made by the journal in its own domain, it becomes essential to spot the anomalies in citation networks in order to understand if there is any misuse of citations to manipulate JIF and what are the reasons behind such misuse.
Anomalous citations can be divided into two major categories:

Selfcitations: They refer to the citations made by entities, i.e., journals or authors, to themselves, which can, in certain cases, lead to a disproportionate increase in JIF of the entity (chakraborty2018good).

Citation Stack: It refers to the citation made by an entity, i.e., a journal or an author, to another entity, which can, in certain cases, be anomalously high and deviate from the usual behaviour of the entity (chakraborty2018good).
fowler2007does showed that the more one cites oneself (selfcitation), the more others cite them. ioannidis2015generalizedclassified the phenomenon of selfcitation to different types – direct, coauthor, collaborative and coerciveinduced selfcitation. They further raised concerns over how inappropriate selfcitations can affect impact factor, and suggested the usage of citation indices which are more foolproof to collusion using inappropriate selfcitations. Bartneck2010DetectingHM discussed how hindex can be inflated by authors by manipulating selfcitation. SelfCitation, in their study on analyzing the effect of excessive selfcitation, argued for a new metric based on more transparency to help curb excessive selfcitations, which may have a negative effect on our ability to truly analyze the impact and contributions of a scientific research in its research domain. However, limited work has been done in the field of analyzing citation stack.
In 2011, Thomson Reuters (ThomsonReuter) (currently known as Clarivate Analytics) suspended 33 journals from its Journal Citation Report due to a very high rate of selfcitations, which contributed to as much as 90% of the JIF of those journals. Therefore, anomalous citation detection is an important task in bibliometrics.
Our Contributions: In this paper, we study a citation network at the journal level, define the notion of anomalies for scientific citation networks, and detect the anomalous journal pairs with focus on the anomalies from one entity to another (citation stack), along with a confidence score indicating the correctness of our prediction. We also categorize them into different types, and figure out possible reasons for such anomalies. The major contributions of this paper are mentioned below:

We present a novel approach to detect the anomalies in a journallevel scientific citation network, and compare the results with the existing graph anomaly detection algorithms.

Due to the lack of groundtruth for citation anomalies, we introduce JoCAD, a novel dataset which consists of synthetically injected citation anomalies (Section 3.2), and use it to evaluate our methodology and compare the performance of our method with the stateoftheart graph anomaly detection methods.

Our model is able to predict the anomalous citation pairs with 100% precision and 86% F1score (Section 5.2).

We further categorize the anomalies detected into various types and dig out possible reasons to develop a deeper understanding of the root causes.

We interpret our results using a case study, wherein our results resemble the citations and Scimago Journal & Country Rank (SJR) ratingchange charts (Section 5.5).

We further design an interactive web portal  ‘Journal Citations Analysis Tool’ which can process any citation network and show journallevel anomalous citations. It helps the users analyze the temporal citation pattern of a given journal (Section 6).
Reproducibility: The codes and the datasets are available at: http://bit.ly/anomalydata.
2. Related Work
The methodology used in this paper is closely related to the techniques used in anomaly detection in complex networks, and journallevel study of citation networks. Both these topics are wellstudied; however, to the best of our knowledge, anomaly detection in journallevel citation network has not been studied yet.
2.1. Anomaly Detection in Complex Networks
Anomaly detection in networks is a widely explored area. Noble proposed a subdue based method for anomaly detection in graphs. Hodge2004
wrote a survey of machine learning and statistical techniques for outlier detection, and categorized the outlier detection algorithms into three major types, namely, unsupervised clustering, supervised classification and semisupervised recognition. The anomalies themselves have been categorized into two broad types
(chen) – white crow anomalies, which can be easily identified as outliers, and indisguise anomalies, which try to hide themselves and are thus difficult to detect. moon presented OutRank, an algorithm to detect whitecrow anomalies by assigning a similarity score to objects. Eberle2007MiningFS detected indisguise anomalies, which they called “structural anomalies”. We present two methods that cover both kinds of anomalies mentioned above. The box plot method in Section 4.2 detects whitecrow anomalies as it identifies high number of citations between journals (which is an easily noticeable feature) after bucketing them according to their sizes. The timeseries method in Section 4.2 detects indisguise anomalies as it analyzes citations over the years and detects outliers from the usual behaviour (which can not be noticed from an overall study of the data).Anomalous groups in a citation network can be seen as communities with high citation activity with each other. Sun2005NeighborhoodFA presented algorithms for community detection in bipartite graphs. They introduced two kinds of functions for the purpose of community detection and outlier detection. Oddball algorithm was proposed for unsupervised anomaly detection in weighted graphs, wherein each node is treated as an ego, and the neighbourhood network around it is studied as the egonet (oddball). Noble detected anomalies in graphbased data by categorizing them in two ways  anomalous substructures and anomalous subgraphs. They claimed that these anomalous substructures and subgraphs are extremely rare as compared to “usual” substructures and subgraphs. Chandola:2009:ADS:1541880.1541882 presented a survey which provided classified anomalies into three types:

Point Anomalies: It refers to the individual data points which are anomalous with respect to the entire data. For example, consider the real life scenario of the price of a stationary item. The cost of the item suddenly rises and becomes twice the previous cost. The new cost is thus, an example of a point anomaly.

Contextual Anomalies: If a data instance is anomalous only in a particular context, then it is a contextual anomaly. It is also known as a conditional anomaly. Huber designed a method to process event streams from technical systems for timely detecting anomalies and thus helping prevent huge industrial losses.

Collective Anomalies: If the existence of certain data points individually is not anomalous but their collective existence collectively is, then they are called collective anomalies.
In our study, we focus on the collective anomalies.
The standard anomaly detection algorithms for graphs could not be used for detecting pairwise anomalies in a citation network because the subtle signal strengths of the anomalies at the journal level in the citation network could not be detected by the preexisting graph algorithms. An alternate way of modelling this problem can be using supervised machine learning techniques, which can be used to learn the suitable feature representation that can help in detecting anomalous behaviour among journals, authors, paper, etc. However, due to the lack of a groundtruth dataset of considerable size which has sufficient examples of anomalous behaviours of the above entities, machine learning algorithms could not be applied. We empirically show that existing graph anomaly detection methods could not work on the problem under consideration (Section 5.2). This motivated us to study detection of pairwise anomalies.
2.2. Journallevel Study of Citation Networks
Hummon1989ConnectivityIA modelled the connectivity in citation networks as an exhaustive search problem and applied a depthfirstsearch variant to quantify the similarity in the network. They focused on the similarity of edges. Zhang presented a clustering algorithm which calculated the similarity between adjacent nodes and formed clusters based on that score.
Study of citation network is a sufficiently explored topic. It dates back to 1965, when solla studied the citation network and made inferences about the nature of references of a scientific paper with respect to its field and length. Various facets of citation networks have been studied, most widely explored being the network of papers and its outgoing citations. Osareh studied citation network and its applications. Zhao:2008:CPC:1458082.1458125 studied the evolution of heterogeneous citation networks.
Researches have also focused on the citation interactions among authors using citation analysis. Prabha carried out a pilot study on the reasons behind the citations in scientific papers. He found that many of the citations made by papers are not essential for the results produced by it. Cronin2002 studied paper reference patterns with respect to the authors of the paper. They defined an author’s citation identity as the authors they cite in their work, and citation image as the authors who cite them. They argued that the reference styles of authors are a form of watermark for their work.
One important application of anomaly detection in citation network is to find the possibility of collusion among entities. cartel came up with an algorithm to detect citation cartels (journal). It was a community detection algorithm focusing on anomalous pair of citations. The authors also discussed the possibility of some anomalous pairs being cases of citation cartels. Claiming whether a citation is a result of possible collusion or not is a nontrivial task, and requires an indepth examination of every citation. In this paper, we focus only on the anomalies in the citation graph, and we do not try to claim any intentions behind those anomalous citations.
3. Datasets
For the purpose of this study, we consider two different datasets. One of the major challenges while devising the experiments was the lack of groundtruth for anomalous citations in the realworld dataset. We thus introduce a journal citation anomaly dataset (called JoCAD), a dataset which consists of synthetically injected anomalies (see Section 3.2). We use this for the purpose of testing our model and reporting the final accuracy in terms of the F1score (Section 5.2). We also use the Microsoft Academic Search Dataset, a realworld citation dataset, for carrying out a casestudy on the anomalous pair of journals detected by our model (Section 5.5).
3.1. Microsoft Academic Search Dataset
The Microsoft Academic Search (MAS) Dataset (chakraborty2018good; MAS) consists of 1.6 million publications and 1 million authors related to the Computer Science domain. It has various metadata information about a paper, including the year of publication, keywords of the paper, the references of the paper, the specific field (AI, algorithms, etc.) the paper is targeted towards, and the abstract of the paper. The dataset spans from 1970 to 2014. This dataset has been extensively used in the past for bibliographic analysis (chakraborty2015categorization; chakraborty2018universal).
We preprocessed the dataset according to the needs of our model. We removed the metadata which were not useful for the study. These included keywords, abstract, and the related fields of the paper. Table 1 shows statistics of the dataset.
Entity  Number 

No. of papers  1.6 million 
No. of citations  6 million 
No. of journals  1,800 
No. of journal pairs  8,500 
3.2. JoCAD  Our Synthetically Injected Anomaly Dataset
For anomaly detection in citation graphs, the groundtruth is not welldefined. Hence, we created a synthetic dataset, wherein we synthetically injected anomalies to serve as the ground truth in our further experiments. The synthetic data generation process is motivated by Hayat2017.
Our synthetic dataset contains 100 journals, each having a journal index between 0 and 99, and containing citation information for the journals ranging over 20 years (20002020).
We assign the number of papers in the journal as a random number in a predecided range. Similarly, we assign the citation count between each pair of journals as a random number in a predecided range which we consider as the normal citation behaviour of the journal pairs.
For injecting anomalies, we intuitively list properties of an anomalous pair of journals. The properties consist of a sudden spike in the number of citations from one journal to another or from both journals to each other in some random year; or a gradual increase from one journal to another or from both journals to each other, which result in an anomalously high number of citations over all the years. Using these methods with reasonable randomness, we create a total of 110 anomalies. The final dataset consists of 110 anomalous pairs out of total 10,000 pairs of journals.
The types of anomalies we inject into the dataset can be broadly classified into five types:

If the number of citations from journal to journal in a particular year is significantly more than the normal citation behaviour of the journal pairs, i.e., there is a sudden spike in the number of citations from to in that year, then the citation from to is an instance of Type 1 anomaly.

If the number of citations from to increases significantly over successive years, with the difference in the number of citations between any two years and being greater than the expected difference in the number of citations between any journal pair, then the citation from to is an instance of Type 2 anomaly.

If the number of citations from to in the year is significantly higher than the average number of citations between any journal pairs, and the number of citations from to in the year is significantly higher than the average number of citations between any journal pairs, then both the citations from to and viceversa are instances of Type 3 anomaly.

If the number of citations from to in the year is greater than or equal to double the number of citations received by from in the previous year , then the citation from to is an instance of Type 4 anomaly.

If the number of citations from to in the year is greater than or equal to double the number of citations from to in the previous year , then the citation from to is an instance of Type 5 anomaly.
4. Methodology
In this section, we first introduce the problem definition, followed by the proposed methodology.
4.1. Problem Definition
Given a set of papers , where each paper has 4 attributes – journal name, authors, citations, and the year of publication, as the input to the system, the task is to predict the journaltojournal anomalies with a confidence score of the prediction made, classify the anomaly into its type and suggest possible reasons for the existence of the same.
4.2. Proposed Methodology
We use a twofold approach (as shown in Figure 1) to classify the citation anomalies — box plot and timeseries analysis.
4.2.1. Box Plot Bucket Analysis:
We start by clustering the journals into buckets based on similar behaviour in terms of the total number of papers published in the journal. For each journal in our database, we define the usual behaviour of the given journal by analyzing the behaviour of all the journals lying in the same bucket as the given journal, i.e., all the journals having similar number of papers as that of the given journal, and then use the box plot method to predict the anomalies.
For each possible bucket, we pair each of the papers in the current bucket with each of the papers in the bucket of the given journal. We refer to the set of all these paperpaper pairs as a grid. We now detect the anomalies using box plot. To calculate the outlier behaviour, using a box plot, we calculate the first quartile and the third quartile
^{1}^{1}1 The first quartile is the 25 percentile of the data and the third quartile is the 75 percentile of the data. The second quartile refers to the median of the distribution. of the data points corresponding to the journaljournal citation counts in the selected grid.We define anomalous journaljournal pairs as the data points which lie outside the area: [(firstquartile  ()) , (thirdquartile + ())], where interquartile range = (thirdquartile  firstquartile).
The anomalies detected by the box plot method are static anomalies as these anomalies are based on the total number of citations between two journals over all the years.
4.2.2. Timeseries Analysis:
Once we have the pairs declared anomalous by the box plot, we analyze each of them with timeseries analysis to confirm if there exists some year during which there was a sudden change in their usual citation behaviour. The anomalies detected by the timeseries analysis method are dynamic anomalies as these anomalies are susceptible to change every year.
To define the “usual behaviour”, we use the empirical rule. The empirical rule, also called the “689599.7 rule” (pukelsheim1994three)
, is a statistical rule that states the percentage of values of a Gaussian distribution that
lieswithin 2, 4, and 6 standard deviations around the mean. According to the rule, 2 standard deviations around the mean cover 68.27% of the values, 4 standard deviations around the mean cover 95.45% of the values, and 6 standard deviations around the mean cover 99.73% of the values.
To check the usual behaviour for a pair of journals for a given year (say ), we list the number of citations from one journal to another till , and calculate mean and standard deviation from the distribution. Then, as stated by the empirical formula, 99.73% of the data is expected to lie within 6 standard deviations around the mean. Hence, if the number of citations from one journal to another exceeds this range, it is suggested to be anomalous by the timeseries analysis.
To understand the different types of unusual behaviour, we extend the existing notions of synchronous and dianchronous citations (macro) and use synchronous citations to refer to the citations from a journal to another journal, and dianchronous citations to refer to the citations received by a journal.
For a given pair of journals, we check four types of behaviours for each year:

One sided synchronous citations: This consists of detecting the unusual behaviour of outgoing citations from one journal to the other for each side, i.e., if the two journals are and , the above test is performed separately for the outgoing citations from to and for the outgoing citations from to , for each year as stated above.

One sided dianchronous citations: This consists of detecting the unusual behaviour of incoming citations from one journal to the other for each side i.e., if two journals are and , the above test is performed separately for the incoming citations from to and the incoming citations from to , for each year as stated above.

Double sided synchronous citations: This consists of detecting the unusual behaviour of outgoing citations from one journal to the other for both the journals simultaneously, and if both of the sides lie in the outlier range, this pair is said to have double sided synchronous behaviour.

Double sided dianchronous citations: This consists of detecting the unusual behaviour of incoming citations from one journal to the other for both the journals simultaneously, and if both of the sides lie in the outlier range, this pair is said to have double sided synchronous behaviour.
Checking both synchronous and dianchronous citation behaviours is essential because a particular number of citations from one journal to another can be in the normal range for the citing journal but not in the normal range for the cited journal, or viceversa. This kind of case can happen in many instances. For example, one unpopular journal suddenly increases citations to a very popular journal and to an unusually high level; or one very popular journal suddenly increases citations to an unpopular journal. In both cases, only one side of the single side anomalies will be evident from the distribution based check described above.
4.2.3. Reasons for Possible Anomalies:
We categorize the detected anomalies hinting at five possible reasons for their occurrence, namely, manymany anomaly, manyone anomaly, onemany anomaly, oneone anomaly, and previousauthor collaboration. For the first four, we supplement a metric that captures the essence of the citation graph being highly crowded at both the ends (journals and ) for manymany anomaly, and so on.
To do this, we define the “normal behaviour” of the citations given by a paper to a journal by first creating a Gaussian distribution with the citation count from a paper to a journal as the data points and then, using the empirical formula, we use to define the normal behaviour of nearly 70% of the data points. Now consider all the papers in the citing (sender) journal, i.e., the journal , and count the number of papers in the journal which contribute to (cited/receiver journal) that is at least the normal behaviour: . Dividing this by the total number of papers in published in , we get the percentage of papers (vertices in the citation network) in the ’s side which is crowded in terms of the citation count they produce for , denoted as ‘senderpercentage’.
Now, we define the “normal behaviour” of the citations received by a paper from a journal, by creating a Gaussian distribution with the citation count from a journal to a paper as the data points, and then, we use the empirical formula to define the normal behaviour of the data points. Similar to the previous case, based on the above definition of the normal behaviour, we now consider all the papers in the receiver journal , and count the number of papers in which are cited by (the sender journal) more than or equal to the normal behaviour: . Dividing this by the total number of papers in , we get the percentage of papers (vertices) in the receiver’s side which is crowded in terms of the citation count they receive from , denoted as ‘receiverpercentage’.
Now we are ready to categorize an anomaly as manymany, manyone, onemany, or oneone based on the following criteria:

Manymany Anomaly: If the senderpercentage is and the receiverpercentage is , then it is the case of a manymany anomaly since both the receiver’s and the sender’s sides are extremely crowded.

Manyone Anomaly: If the senderpercentage is and the receiverpercentage is , then it is the case of a manyone anomaly since the receiver’s side is not crowded while the sender’s side is extremely crowded.

Onemany Anomaly: If the senderpercentage is and the receiverpercentage is , then it is the case of a onemany anomaly since the sender’s side is not crowded while the receiver’s side is extremely crowded.

Oneone Anomaly: If the senderpercentage is and the receiverpercentage is , then it is the case of a oneone anomaly since both the sender’s and receiver’s side are not crowded.
Figure 2 diagrammatically shows the four reasons listed above.
To explain the reasons for the anomalies, we also look into the previous collaboration of the authors of a publication and count the number of such occurrences between the journaljournal pairs.
In particular, we detect if at the time of publishing a paper in year any of the authors of the paper has previously collaborated with any of the authors of the paper it is citing. We then increment a counter by . We do so for paperpairs consisting of every paper in the senderjournal and all the papers it cites in the receiverjournal , and thus calculate the total number of previousauthorcollaborations.
4.3. Confidence Score
Our model finally assigns a confidence score to every anomaly detected, which states how confident the model is on the pair of journals being anomalous in a year (ConfScore). A confidence score is first assigned by box plot method to each anomaly, then by the timeseries method, and at last both the scores are combined, giving the final confidence score. We use the function^{2}^{2}2 is a hyperbolic function which gives a value ranging from 1 to 1, and in case of only positive inputs, 0 to 1. for calculating the individual confidence scores from the two methods – box plot and timeseries.
One intuition is that if our method states a pair to be anomalous, we are at least sure of it being anomalous. Following this intuition, we scale the confidence scores given by individual methods to be from instead of . Another intuition is that in the timeseries method, bothsided anomaly is a more unusual event than a singlesided anomaly. This is because a bothsided anomaly in a particular year means both the journals went beyond the normal behaviour and cited the other. We state the confidence of a bothsided anomaly to be at least . Therefore, a bothsided anomaly is scaled between instead of .
For combining the scores, we choose to take the average of the two scores. This is done because both the methods have equal importance in the detection of the anomaly. Figure 3 shows the histogram of the confidence scores of anomalies detected from the MAS dataset. We analyze the most anomalous pair with confidence score 0.93 and find that it is logical for the pair to be anomalous (see case studies in Section 5.5).
5. Experiments
5.1. Comparison of Timeseries Analysis with SJR ratings
To evaluate the performance of the timeseries method for anomalous behaviour detection, we use the SJR ratings to check changes in ratings of journals stated anomalous in some particular years. Scimago Journal & Country Rank, better known as SJR, is a journal rating system (SJR). It uses data similar to what is used by Hindex, and gives yearwise ranks to journals. It is a wellrecognised rating system for journals and conferences.
We scraped the SJR ratings over the years for all the journals present in the MAS dataset from the portal made for SJR ratings. The comparison of sudden increase in citations to each other for pairs of journals mostly resonates with the respective SJR rating changes of the citationreceiving journal in the next two years. The casestudy of the most anomalous pair of journals also supports the same result (see Section 5.5).
5.2. Experimental Results
For anomaly detection in citation graphs, there is no welldefined groundtruth or a proper evaluation metric. Hence, for the purpose of evaluating the performance of our method, we created a synthetic dataset wherein we synthetically injected anomalies
3. They were treated as the groundtruth in our experiments, and we report our results based on them. We use precision, recall and F1 score for the final evaluation.Baseline Methods: Due to the lack of any existing baseline for the given task, we decided to use the existing graph anomaly detection algorithms and each of the methods used by us, namely, the box plot method and the timeseries method, to act as baselines for evaluating our proposed model.
We used Kmeans clustering, one of the popular graph anomaly detection algorithms, but it could only detect 38 anomalous pairs out of the total of 110. The detected pairs are the pairs with a large number of citations on both sides i.e., to and from one journal to another over the years. These are clearly visible by taking the summation of the citations on both sides, similar to the box plot method. However, Kmeans clustering algorithm is unable to detect other types of anomalous pairs, and thus has a lower recall and F1 score as compared to our method.
Method  Precision (%)  Recall (%)  F1 Score (%) 

Kmeans  100  34.5  67.2 
Box plot  59.19  93.63  72.53 
Timeseries  86.73  77.27  81.73 
Our model  100  75.45  86.01 
Table 2 shows the comparative results of the baselines and our model. The F1 score of our model comes out to be which is higher than that of the baselines. Also, the precision of our model comes out to be , which means that all the pairs stated anomalous with the agreement of both the methods are actually anomalous.
5.3. Results on the MAS Dataset
We ran the model on the realworld MAS dataset. The histogram of confidence scores of the anomalies found in the dataset is shown in Figure 3 (explained in Section 4.3). Out of total 8.5 thousand journalpairs, total of 328 anomalies are found. The number of unique pairs of anomalous journals (irrespective of the year of anomaly) is 230, and the total number of journals found in any anomaly is 103. We observe that the occurrence of anomalies is rare (only ), and the occurrence of bothsided anomalies is extremely rare (only cases out of thousand pairs), as shown in Table 3
Type  Number  Percentage of total pairs 

Single (One sided)  324  3.85% 
Double (Both sided)  4  0.04% 
Total  328  3.90% 
We plot the average number of publications per year that are anomalous, i.e., the ratio of the number of anomalous publications in a year to the total number of publications in that year. The temporal frequency mapping of anomalies is shown in Figure 4. We can infer that the frequency of the relative yearwise anomalous journals decreases over the years. This decrease might be due to a huge boost given to the academic research in the field of Computer Science, which led to an increase in the total number of journals. This might have decreased the ratio over time. Another reason could be more stringent rules and regulations imposed in the scientific community, which could have led to a decrease in anomalous activity amongst journals.
Next, we plot the trends with respect to the size of the anomalous journals, defined in terms of the number of papers published in it. Figure 5 shows the decline of anomalous activities as the size of a journal increases. This might be due to a bigger journal having a wider variety of papers, research domains, and authors. Another reason might be the journals which have gained recognition and prestige over time are likely to receive high number of novel and highquality papers which should be published. Thus, prestigious and renowned journals may have a bigger journal size, which are less likely to be anomalous.
5.4. Human Annotators
Two human annotators^{3}^{3}3The annotators were experts in data mining domain and their age ranges between 2535 years. were asked to independently label the anomalies detected by our method on the MAS dataset as anomalous or nonanomalous. The annotation was done on the basis of the number of citations between the journals over years, and relevant metrics such as the average number of citations over all the years between the journals and the popularity of both journals. They checked the popularity of journals based on factors like impact factor and cite score.
Each annotator annotated all 328 anomalous pairs. They placed 38 pairs and 25 pairs respectively in the nonanomalous category. The interannotator agreement was , based on Cohen’s kappa coefficient.
5.5. Case Study
We conduct a case study for the anomalous pair with the highest confidence in the MAS dataset. The pair is the journal IEEE Transactions on Information Theory and the conference IEEE International Symposium on Information Theory, and the confidence score of being an anomaly is . IEEE has a professional society, namely The Information Theory Society (itsoc), under whose umbrella lies its main journal  IEEE Transactions on Information Theory (IEEE TIT) and its flagship conference, IEEE International Symposium on Information Theory. The symposium is a conference for researchers to meet and discuss work done in the field of information theory. The work can be previously published or unpublished.
As the work can be previously published, and the conference is a part of the same society as the journal, many authors of the published papers in the journal could attend the conference to display their work. Hence, it is natural for the two to have numerous citations to each other over the years. Because of this, there can be two reasons for a citation:

A paper has already been published in the journal and the authors were invited to the conference to present and discuss their work.

A paper was not accepted by the journal and the authors presented their work at the conference. Later, some other researcher used their work and cited them. The paper then got published in the journal.
Some citations may not exist because of the journal and the conference being a part of the same society. There would be many instances where a paper not published in the journal but presented at the conference had cited a paper previously published in the journal.
Charts in Figure 6 show the number of citations from Symposium on Information Theory to IEEE TIT, the change in SJR rating of IEEE TIT over the years, the number of citations from IEEE TIT to Symposium on Information Theory, and the change in SJR ratings of Symposium on Information Theory over the years.
A clear resemblance in citations and SJR ratingchange charts (scimagojr.com) for both the journal and the conference shows that the findings of timeseries analyzes correspond to the changes in the SJR rating of IEEE TIT. As stated in the introduction, JIF depends upon the number of citations the journal had received in the two previous years. A sudden spike can be seen in the citations from the conference to the journal in the year 2003. At the same time, a spike can be seen in the SJR rating of the journal in the year 2004, which shows that the increased citation count has contributed towards an increase in the rating of the journal.
6. A Web Portal
We use the journal level citation network created using the MAS dataset to develop a portal which allows the user to find the anomalies corresponding to a journal for all the possible years. It provides an interactive graphical interface, which visually depicts the anomalies in a citation network. The user can enter the details of the journal, based on which the portal provide the user with a yearwise analysis of the anomalies. A screenshot of the portal is presented in Figure 7.
For all pairs of journals in our dataset which form an anomalous pair with the journal queried by the user, we display the yearwise anomalies in an anomaly graph. This graph depicts each of the journals as nodes, while the citations between them is represented by the edges. The size of a node is proportional to the number of papers in the journal, i.e., a larger size of the node indicates that the journal has published a higher number of papers. The size of an edge depicts the number of citations between the two journals, i.e., a thicker edge between two journals indicates a higher number of citations between them. The number on top of each node represents its journal index, which is mapped to the respective journal name in our dataset.
A beta version of the portal is live and can be accessed here: https://journalcitationsanalysistool.herokuapp.com/.
7. Conclusion
In this paper, we presented a novel model to detect the journallevel anomalous pairs. The model also gives the confidence score by analyzing both static and dynamic anomalies using box plot bucketing method and timeseries analysis. We also curated JoCAD, a novel dataset, which consists of synthetically injected citation anomalies. We further ran our model on two datasets, namely, JoCAD and the Microsoft Academic Dataset, and used it to evaluate our methodology. We achieved 86% F1 score on JoCAD in the comparison of our method with the standard graph anomaly detection methods.
We interpreted our results in a case study using the Microsoft Academic Dataset, wherein our results resemble the citations and Scimago Journal & Country Rank (SJR) ratingchange charts. We experimentally showed a high similarity between the timeseries trends predicted by our method and the timeseries trends as reflected by the SJR. We further designed an interactive web portal  ‘Journal Citations Analysis Tool‘ which given the citation network as an input, shows the journallevel anomalous citations, and helps users analyze the temporal citation pattern of a given journal.
Future work in this direction can include an indepth study about the reasons for the anomaly detected as well as extend the work in developing methods for the detection of citation cartels. We acknowledge that while curating the dataset, JoCAD, we could have possibly introduced a human bias in it, which can be reduced in the future. We would like to explore the anomalous citation behavior in other domains as well. Other relevant future work can take into account the textual context of a citation made, and calculate its similarity with the content of the paper being cited. This will help us understand the relevance of the citation made and thus help predict anomalous citations better.
Acknowldgement
T. Chakraborty would like to acknowledge the support of Ramanujan Fellowship, DST (ECR/2017/00l691), and the Infosys Centre of AI, IIITDelhi, India.
Comments
There are no comments yet.