Classification of abrupt changes along viewing profiles of scientific articles

by   Ana C. M. Brito, et al.

With the expansion of electronic publishing, a new dynamics of scientific articles dissemination was initiated. Nowadays, many works are widely disseminated even before publication, in the form of preprints. Another important new element concerns the views of published articles. Thanks to the availability of respective data by some journals, such as PLoS ONE, it became possible to develop investigations on how scientific works are viewed along time, often before the first citations appear. This provides the main theme of the present work. More specifically, our research was motivated by preliminary observations that the view profiles along time tend to present a piecewise linear nature. A methodology was then delineated in order to identify the main segments in the view profiles, which allowed several related measurements to be derived. In particular, we focused on the inclination and length of each subsequent segment. Basic statistics indicated that the inclination can vary substantially along subsequent segments, while the segment lengths resulted more stable. Complementary joint statistics analysis, considering pairwise correlations, provided further information about the properties of the views. In order to better understand the view profiles, we performed respective multivariate statistical analysis, including principal component analysis and hierarchical clustering. The results suggest that a portion of the polygonal views are organized into clusters or groups. These groups were characterized in terms of prototypes indicating the relative increase or decrease along subsequent segments. Four respective distinct models were then developed for representing the observed segments. It was found that models incorporating joint dependencies between the properties of the segments provided the most accurate results among the considered alternatives.



There are no comments yet.


page 12

page 13

page 15

page 18

page 28

page 29

page 30

page 32


Characterizing the highly cited articles: a large-scale bibliometric analysis of the top 1

We conducted a large-scale analysis of around 10,000 scientific articles...

What happens when a journal converts to Open Access? A bibliometric analysis

In recent years, increased stakeholder pressure to transition research t...

Predicting the future success of scientific publications through social network and semantic analysis

Citations acknowledge the impact a scientific publication has on subsequ...

A Unified Nanopublication Model for Effective and User-Friendly Access to the Elements of Scientific Publishing

Scientific publishing is the means by which we communicate and share sci...

Graph Based Analysis for Gene Segment Organization In a Scrambled Genome

DNA rearrangement processes recombine gene segments that are organized o...

Nature, Science, and PNAS – Disciplinary profiles and impact

Nature, Science, and PNAS are the three most prestigious general-science...

Hunting for supernovae articles in the universe of scientometrics

This short note records an unusual situation with some Google Scholar's ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Science can be understood as a social activity, conceived and applied by humans. As a consequence, communication plays a critical role in scientific development, allowing important results to be disseminated and used. Interchange is important not only between scientists working on related fields, but also between those deriving results and those applying these results. In the beginnings of science, communication proceeded mostly in terms of letters (see e.g. peat2002certainty ), which were exchanged between scientists in order to share their most recent results. Letters gave rise to proceedings, journals and, more recently, World Wide Web-based dissemination. The study of how scientific articles are read and cited is of great importance because such knowledge provides insights about the efficiency at which science is disseminated.

Many of the existing studies in scientometrics, the area aimed at studying how science unfolds, consider citations as the main indicator of usage and interest of scientific articles. Until recently, this was one of the few available objective measurements of scientific dissemination waltman2016review ; 2012three ; bollen2005toward . Yet, with the introduction of the Internet and the WWW, other statistics became available, such as the number of views, shares and downloads of articles published online. Indeed, before the Internet and the WWW, it was very difficult to count how many times a journal or article was taken from the shelves and read. The availability of these new indicators paved the way to many interesting investigations in scientometrics, motivating the new area of altmetrics (e.g. sud2014evaluating ).

Among the new scientific indicators, the number of visualizations has some particularly interesting features. First and foremost, it takes place at a relatively high speed, involving little delay: once published online, a work starts being viewed almost immediately. Contrariwise, the first citation of a work can take months or even years to take place. Given that visualizations tend to be faster, they can provide insights about current trends, allowing predictions to be made. Visualizations also tend to take place in larger numbers than citations, therefore providing a potentially more complete sample that can lead to more accurate statistical analysis. Studying visualizations is also intrinsically important as a means to better understand its relationship with citations. Indeed, views are particularly important because they can be considered as precursors to citations. Indeed, a much viewed article is a good candidate for being highly cited.

One of the few limitations intrinsic to visualizations is that they provide a somewhat weaker indication of the use of the knowledge in the visualized work. Indeed, some visualizations can be the consequence of actions of Web crawlers or surveys, without a direct implication that the reported knowledge has been somehow transferred or applied. All in all, views-based scientometrics has potential for contributing substantially to our knowledge about how scientific information is disseminated.

Being mostly based to online publication, and by providing several statistics, the PLoS ONE journal 111 represents a good resource for performing scientometric/altmetric studies focusing views as main indicators. In particular, the number of views of each article is provided along time in a month-by-month fashion. Figure 1 illustrates some visualization profiles for 6 randomly chosen articles.

Figure 1: Profiles of 6 randomly chosen PLoS ONE articles showing, in each case, the cumulative number of views along successive months. Interestingly, several of these profiles present approximate polygonal shape, with sharp turns followed by relatively straight segments.

Remarkably, a preliminary analysis of these curves indicate some surprising features. First, we have that the total number of views tends to increase with constant speed in most cases, giving rise to relatively straight segments. What could be the mechanisms behind such a remarkable trend? Second, sharp changes in the cumulative number of views, reflected in the abrupt change of the slope of the curves, are often observed at given times for some of the articles. As a consequence, some of the viewing profiles resemble polygonal, or linear piecewise curves. This is even more interesting than the former effect, because it shows that some sort of events occur that are capable of changing substantially the visibility of a given article along time. The investigation of these phenomena can contribute to understanding how a given work is viewed by the scientific community along time, allowing predictions of their possible citations.

A more systematic analysis of the above reported effects constitutes the main objective of the current work. More precisely, viewing profiles are obtained for almost every article published in PLoS ONE. These signatures are then analysed by using state of the art numerical methods, namely segmented regression (e.g. muggeo2003estimating ), capable of fitting contiguous straight line segments along each signature, to match its original profile. The slope and transition points of these segments, corresponding to abrupt slope changes, can then be identified and used for statistical analysis and derivation of statistical models. Indeed, the current approach focuses on the derivation of these models as the means for trying to understand the properties of the considered visualization profiles.

The proposed models are based on variables describing the inclination angle and the extension of each subsequent linear-piecewise portion. Conditional densities are estimated from the experimental data, allowing the generation of synthetic profiles. We consider independent densities as well as Markov-1 univariate and multivariate dependencies between variables. Then, by comparing the congruence between the original and synthetic profiles, we can estimate which of the considered models tend to be more accurate. The so-obtained model is then studied in order to identify and understand possible mechanisms producing the observed profiles.

Several interesting results are reported. First, the linear-piecewise approximation was found to be more congruent with the real-world visualization profiles than with synthetic profiles obtained from uniformly random events. This supports the hypothesis that the real-world profiles tend to present a polygonal structure, therefore being potentially well-represented and modeled by using this type of curve. The several parameters observed for each set of visualizations, considering the same number of segments, revealed two relatively well-defined clusters corresponding to two main types of polygonal profiles exhibiting sequences of varying slopes. For instance, in the case of profiles containing 3 segments, one of the detected clusters is characterized by a relatively low initial slope followed by a higher slope, which is then reduced in the final segment. The other type of cluster presented opposite structure. Among the several types of statistical models considered in this work, we found that the use of conditional densities defined on joint pairs of previous parameters, more specifically by predicting the slope and interval at each current time in terms of the the previous instances considered jointly, led to the most accurate results. This suggests that each segment along the polygonal structures can be, to a good extent, estimated by the previous segment, corresponding to a memory-1 Markov dynamics that may underlies respective real-world effects.

This article starts by presenting the adopted data as well as the concepts and methods used for respective analysis. Then, we present the results of the polygonal fitting, including a comparison with control synthetic profiles obtained from uniform statistical distribution. The analysis of the obtained features is presented next, including the identification of two main types of visualization profiles. Then, statistical models are developed and applied as a means to predicting and better understanding the considered profiles. The article concludes by identifying its main contributions as well as suggesting venues for future investigations.

Ii Related Work

Altmetrics indicators have been employed to quantify the quality of some aspects of science sud2014evaluating ; bornmann2014altmetrics ; galligan2013altmetrics . Some analyses account for the importance of articles huang2018altmetric and others for measuring characteristics of scientists lariviere2013bibliometrics ; ioannidis2014bibliometrics . Here, we study an altmetrics measurement aimed at counting how many times a paper was visualized.

The Altimetric Attention Score (AAS) has been used to measure the importance of papers huang2018altmetric . This index measures mentions of papers in social media (e.g., Twitter and Facebook). The authors of huang2018altmetric found correlations between AAS and the number of citations for some journals. More specifically, by considering papers obtained from PLoS journals, the authors measured the Spearman correlation between a normalized AAS and a normalized count of citations. Interestingly, in the case of Medicine articles, this correlation was not found. In another study that considered highly cited papers, correlations were found in the comparison between metrics obtained from social networks and the number of citations thelwall2013altmetrics .

Another important information that can be measured is the difference between the behavior of new and old articles. Due to the fast evolution of computer-related areas, computer science researchers give lots of importance to conference papers. By considering papers of journals and conferences, thelwall2019mendeley compared number of citations with Mendeley reads. They found that the number of Mendeley reads and the number of citations are correlated for both journals and conferences. However, in the case of old conference papers, a similar correlation was not found. An alternative way of fast transmitting scientific results is by publishing papers on arXiv, which is a preprint repository. In a study that took into account early citations shuai2012scientific , the authors compared arXiv downloads and Twitter mentions and found that there is a correlation between tweets and downloads. Furthermore, both are correlated with early citations.

One particularly important measurement is the number of views, for which some distinct aspects have been analyzed. For instance, the number of views can be linearly correlated with the age of a paper priem2012altmetrics , in which older articles tend to have more views. Furthermore, in de2015relationship , the authors investigated the relationship between the number of article’s views and the mentions of articles in Twitter. More specifically, their study suggests that views obtained from tweets are not related to citations. Other scholars also analyze the numbers of article views and downloads, and they found that the number of downloads is much more correlated with citations than views wang2014attention . In this paper, we also explore the number paper views, however, here we focus our analysis on the dynamics of paper views along time.

Iii Materials and Methods

iii.1 The Dataset

We extracted information from all the papers published in PLoS ONE before 2017. For that, we employed a semi-automatic extraction of paper metadata. The resulting dataset contains information about the number of views per month of each article published in the journal, as well as other social media features, including the number of tweets and shares. The latter properties still need some post-processing and further validation before use. For that reason, here we only focus the analyses on the number of views over time, which resulted in a total of views profiles.

iii.2 Curve characterization

Following muggeo2003estimating , a breakpoint is modeled as two straight lines joined at point , that is,


where , with if condition is true and otherwise. is the position of the breakpoint, is the slope of the line before the breakpoint and is the slope after the breakpoint. Therefore, is the difference in slopes between the two lines. This is illustrated in Figure 2.

Figure 2: Example of segmented line with one breakpoint.

The term can be rewritten as


which represents a first order Taylor expansion of at . By defining and , Equation 1 can be rewritten as


Thus, the piecewise linear curve is represented as a linear combination between variables , and . Representing as and the values measured for, respectively, the and variables, we want to find , and which minimizes the sum of squared residuals for the model in Equation 3. Muggeo muggeo2003estimating showed that this can be done according to the following procedure:

  1. Set an initial value for ,

  2. Calculate and for each data point ,

  3. Fit the model in Equation 3

    using linear regression,

  4. Improve the breakpoint estimation according to ,

  5. Repeat 1-4 until convergence.

In case of multiple breakpoints, the process is similar. The breakpoints are modeled as


where is the number of breakpoints. represents the difference in slopes between segments and . Equation 4 can be rewritten as


The parameters of Equation 5 are found by least squares regression, and the breakpoints estimates are updated as .

The segmented regression requires the number of breakpoints to be known a priori. Muggeo and Adelfio muggeo2010efficient defined a procedure for finding an appropriate number of breakpoints. First, the aforementioned segmented regression method is applied using a large number of candidate breakpoints. Then, during the optimization procedure a breakpoint is removed if , which indicates that the slopes of segments and are too similar, and if two breakpoints are too close to each other. Next, breakpoints that do not significantly contribute to the residual are removed using the least-angle regression algorithm efron2004least .

For our analysis, the segmented regression was applied using the segmented package in the R language. The first point of each viewing profile was removed since it represents the number of article views along the first calendar month when the article was published. Thus, if the article was published at the end of the month, it will have an unreasonably low number of views for that month.

iii.3 Statistical Model

To better understand the linear-piecewise nature observed in the studied view profiles, we propose a set of simple statistical models. Each of the models progressively incorporate more information about the data. In particular, it takes into account more relationships among parameters recovered from approximating the view profiles by using the segmented regression algorithm.

We start the analysis by applying the segmentation algorithm to all the curves, resulting in a set of piecewise linear curves. Each curve is given in terms of the breakpoints and slopes that estimates the original visualization profile. By considering only the curves with a fixed number of segments, we extract the piecewise parameters given by the algorithm: , representing the length of the segment between two breakpoints, and , the inclination of the segment.

One of our goals is to produce synthetic profiles according to statistical models providing support to understand the real-world profiles. Here, we propose four types of models: null model, independent distribution model, Markov-1 univariate, and multivariate models.

For the models, we consider that and

are random variables given by probabilities

, , , and and employ the conditional probability equation, as follows


In our simplest model, which we call null model, and

are generated following uniform distributions. Each

is drawn between and the in the range . The independent distribution model takes into account independent densities. For samples obtained for a variable

, the probability density function (PDF) of the variable is estimated using kernel density estimation with a Gaussian kernel. The PDFs of the

and variables were estimated from the articles data and used for generating synthetic values.

In the Markov-1 univariate model, given a sequential random variables , the values of the next state are drawn from the univariate conditional probabilities given by the current state . More specifically, starting from an initial independent distribution of , the first state is generated. The next state is drawn from . To estimate , both the and parameters obtained for the data were partitioned into bins.

Finally, the Markov-1 multivariate model uses the combined joint probabilities for and to calculate and

. More specifically, the initial independent joint distribution

is used to generate the first states ( and values), then the distributions and are employed to calculate the next corresponding states.

Iv Results and Discussion

In this Section, we discuss the adherence of the real data to segmented lines in Section IV.1. The basics statistics of the real data are discussed in Section IV.2. In Section IV.3, we analyze the correlation between the parameters of the segmented lines describing the evolution of paper visualizations. In Section IV.4, we present a discussion on the clustering behavior of the segmented view profiles. Finally, in Section IV.5, we provide an evaluation of the model aimed at reproducing the behavior of paper visualizations along time.

iv.1 Segmented regression adherence

The first step of the analysis is the application of the segmented regression algorithm to all the obtained view profiles. In order to do so, each view profile was normalized to the range in both the time and number of views axes. Therefore, the cumulative number of views of an article for the most recent month in the dataset (October of 2016) is always 1. A preliminary analysis by visual inspection suggested that most of the profiles contain or less linear segments. In order to check which curves best adhere to a structure of segmented lines, we computed the Root Mean Square Error (RMSE) hyndman2006another between the curves derived from the algorithm and the original data. Figure 3 shows the distribution of the RMSE for the considered view profiles.

Figure 3: RMSE distribution for the segmented regressions of the visualization profiles. The red line indicates the threshold RMSE value; only the curves on the left were selected to be used in the experiments. With the established threshold, roughly 80% of all visualization curves were analyzed.

To test if the curves adhere to the segmented regression model, we chose a conservative threshold based on the obtained RMSE values. More specifically, we selected only the curves for which the regression resulted in an RMSE value lower than (shown as a red line in Figure 3). Figure 4 shows two examples of views profiles (green dots) – characterized by a higher (a) and lower (b) adherence to the model – and their respective piecewise curves (red segments) together with the breakpoints (gray vertical lines). About of the curves have been found to pass the RMSE test and were selected for further analyses.

(a) Example of a valid profile for segmented regression (RMSE = 0.006).
(b) Example of a invalid profile for segmented regression (RMSE = 0.0142).
Figure 4: Examples of visualization profiles and corresponding linear-piecewise curves. The segmented regression result is the red lines, the original values are the green points, and the breakpoints occur where there are the gray vertical lines.

In order to validate the hypothesis that real-world profiles tend to present a polygonal structure, control synthetic profiles were generated. For each original profile, a synthetic one with the same number of points was created. Synthetic views were generated following a discrete uniform distribution for the number of views an article received in a month, where is the largest number of views among all articles in the real data. We found a similar adherence compared to the original profiles when the segmented method was applied with the same parameters in synthetic profiles: about 78% of the lines passed the RMSE test. Despite this, there is a significant difference between the angle distribution in original and synthetic piecewise curves (see Figure S1 of the Supplementary Information (SI)). The latter is centered on 45-degree values. The differences between the above mentioned synthetic profiles and the real curves mean that the observed view curves for PloS ONE articles cannot be explained by this simple stochastic process.

iv.2 Basic Statistics

We analyzed some basic statistics of the considered views curves, i.e., those that passed the segmented regression test described in Sections III.2 and IV.1. Figure 5 shows the lifetime and cumulative views distributions of the views curves. The lifetime is the number of years since a certain article was published (up to 2016), and ranges predominantly between and years according to Figure 5(a). Also, most papers have from to views, as shown in Figure 5(b). As expected, we also found a moderate correlation between lifetime and number of views (see Figure S5 of the SI).

(a) Lifetime distribution
(b) Views distribution
Figure 5:

Distributions of (a) lifetime (in years) and (b) number of views of the selected visualization profiles (after removing outliers). From a given profile, lifetime is the period since its publication to the moment data was collected, and views refer to the total paper views along its lifetime.

Most of the curves passing the RMSE test have been found to be modeled best by

linear segments. The proportion of curves found to be described by 2, 3, 4 and 5 intervals are: 0.1%, 4.5%, 26.4% and 69.0%, respectively. It is not a surprise since the segmentation algorithm chooses the best number of segments, and the tendency is to have more segments to better adhere to the regression and the original data. We also found that the lifetime average (and standard deviation) seems to increase with the number of segments of the curves, this is shown in Figure 

S2 of the SI. This effect indicates that, as could be expected, the intricacy of the views profiles tend to increase with time.

From the segmented patterns, we extracted the parameters describing the curves according to the proposed segmentation method, more specifically, the angles and segment lengths . The respective parameter distributions are shown in Figure S4 of the SI. We found that the angles tend to become smaller along the consecutive segments, possibly reflecting the loss of visibility with time. The lengths of segments distributions become narrower with time, meaning that the monthly views of articles tend to change more rapidly as the article gets older.

iv.3 Joint Distributions

An important step to understanding how subsequent segments are related is the analysis of the joint distributions among their parameters. In particular, we focus our analysis on the bivariate distributions of and for subsequent segments.

Figure 6 shows the joint density plots for and and their respective Pearson correlations, , which displayed moderate negative values. In general, the first segment is small, as illustrated in Figure 6(a), which can be an effect of the time of response, given by outside factors, such as dissemination on the web, conferences, advertisements on magazines, posts on social networks, media coverage, the popularity of the topic, and citations from other papers. Furthermore, similar outcomes were found for and (see Figure S3 of supplementary material). In this case, we observe correlations that are not particularly strong. Two groups seem to be present on all the plots. A more detailed analysis of the groups in the profiles is performed in Section IV.4.

(a) = -0.54
(b) = -0.47
(c) = -0.46
(d) = -0.44
Figure 6: Joint density plots and corresponding marginal probabilities of and for profiles with five segments. is the Pearson correlation.

We also considered the relationships between and at subsequent segments. The ellipses plots shown in Figure 7 represent the correlations between the different parameter combinations of subsequent segments. In more detail, the colors and inclination indicate the sign of the correlations, and negative correlations are plotted in blue while positive relations in red. The bigger the correlation magnitude, the stronger the color tone is, and more elongated are the ellipses. We found stronger correlations between and in the initial segments compared to those found for the subsequent segments. In general, given a linear segment , the correlations are negative: high inclination angles occur jointly with short segments, and lower inclination angles tend to occur for longer segments. Contrariwise, in consecutive segments, the correlation is positive. This result indicates that short segments tend to be followed by low inclination angles and long segments by high inclination angles. A possible explanation of this effect could be related to a sudden increase in visualizations following the dissemination of the paper. In general, this surge of interest is not sustained along time.

Figure 7: Correlations between the variables of the piecewise curves shown as ellipses. The Pearson correlation coefficients were calculated by considering only the curves with five segments.

iv.4 Clustering Analysis

Before proposing a stochastic model to reproduce the observed view profiles, it is interesting to check if we can find different patterns of view profiles that could be understood as clusters. This type of analysis can give us insights regarding the proposed model. More specifically, knowing about the existence of groups, it is possible to create separate models incorporating the singularities of the models. For that, a clustering analysis was performed in the measured segmented curves parameters. Groups were obtained by running a Hierarchical Clustering algorithm 

murtagh2014ward with the Ward linkage criteria ward1963hierarchical and considering Euclidean distances. We set the number of clusters to three since other values led to less defined groups. For that analysis, each curve is represented by their set of segmented parameters and .

The number of parameters defining the curves is at least four, corresponding to the simplest case in which only one breaking point exists. Thus, for visualization purposes, we employed a Principal Component Analysis (PCA) projection to reduce the dimensionality of the set of parameters 

gewers . The panels in Figure 8

show the clusters of curves for each of the adopted number of segments. In each of these scatterplots, each point represents a curve in the projected space. The marginal distributions help to distinguish between the overlapping groups. The blue and orange groups are more separated and well-defined, while the red group tends to be distributed between the other two groups. The first two Principal Components accounts for 97.6% of the total variance in Figure

8(a), 75.86% in Figure 8(b), 58.86% in Figure 8(d), 43.24% in Figure 8(f). In the case of 2 and 3 segment clusters, only two principal components are enough to explain the variance of the data. However, for 4 and 5 segments, the third principal component is needed. This component is also shown in Figure 8(a) for these cases.

(a) The 2-segment profiles clusters.
(b) The 3-segment profiles clusters.
(c) The 4-segment profiles clusters (PCA1 and PCA2).
(d) The 4-segment profiles clusters (PCA1 and PCA3).
(e) The 5-segment profiles clusters (PCA1 and PCA2).
(f) The 5-segment profiles clusters (PCA1 and PCA3).
Figure 8: Scatter plots showing the clusters of the visualization profiles and the respective marginal distributions. The points correspond to the PCA projection of the angle and length of the segments of the profiles.

In general, the obtained clusters tend to overlap more as the number of segments increases. The marginal distributions are more separated when the curves have two or three segments in the PCA, their peaks being considerably distinct (Figures 8(a) and 8(b)). When the curves have four or five segments, the distributions become flatter and indicate a larger overlap among the groups (Figure 8(d)). One possible explanation for the change in the clustering structure is the progressive increase of visualization profile types with the number of segments.

In order to better understand the identified groups, we generated averaged curves representing each of the clusters (shown in Figure 9). Given a cluster of visualization profiles, its corresponding average curve is generated by taking the average of all the curves belonging to that cluster. In that figure, the dashed and solid lines represent the average curves and the curves approximated by the algorithm, respectively. In the following, we identify the two main types of profiles. The first type begins with a relatively low slope, then it changes to a higher value, and decreases along its final portion (one or two times). In the second type, the slopes always decrease. Figures 9(b) and 9(c) present these two types of profiles for curves with three and four segments. Figure 9(a) also shows well-defined groups, but it is a simple case where the two-segment lengths can differentiate them.

(a) Average curves of the 2-segment profile clusters.
(b) Average curves of the 3-segment profile clusters.
(c) Average curves of the 4-segment profile clusters.
(d) Average curves of the 5-segment profile clusters.
Figure 9: Average curves obtained for the visualization profiles clusters. The x-axis represents time, and the y-axis represents the cumulative number of views along time. Dashed and solid lines respectively represent the average curves and the curves approximated by the algorithm. The groups were colored by following the same pattern employed in Figure 8. In order to avoid mixing curves with very different lifetimes, only those belonging to 5 up to 7 years are used to generate the averaged curves.

iv.5 Models Adherence

In this section, we check if the proposed Markov models can reproduce the joint distributions of the curves parameters (

and ) as expressed in terms of their principal component projections, shown in Figure 10. Observe that each line in this figure corresponds to one of the four considered models (see Section III.3). The marginal densities are also depicted along the respective axes. For each of the four types of models, the cases corresponding to each of the three identified clusters were adjusted separately, and then combined when obtaining the principal component projection.

Figure 10: Comparison of the original profiles and the synthetic profiles produced according to the considered models. The three columns correspond respectively to the original data distributions along the PCA axes, the synthetic distributions and the difference. Note that the differences in density at the center of plot (c) are too small compared to the differences caused by the peaks of the original profiles. Note that for each model, a new PCA projection is obtained since it incorporate both the real and synthetic data.

Then, the obtained density surfaces were compared by calculating absolute point-to-point differences () between the original and synthetic 2d histograms of the PCA data (third column in Figure 10), and then adding all these values into the single error parameter , computed as


where and are the surfaces corresponding to the original and synthetic data, respectively. Lower values of indicate more accurate models.

As expected, the Null model resulted in the worst approximation of the curves (as seen in Figure 10c), and the quality of the models increases as we incorporate additional statistical information. The independent distribution model, shown in Figure 10 (d-f), better approximates the original profiles without considering conditional probabilities (i.e. memory). However, this model is unable to capture the medium scale details in the original distribution, and also broke the cluster.

The models based on Markov-1 take into account not only the independent conditional distributions but also the parameters of consecutive segments. The Markov-1 univariate synthetic profiles resulted in better approximations of the real view profiles (Figure 10 g-i). Finally, the Markov-1 multivariate model produced synthetic profiles more similar to the original counterparts (Figure 10 j-l). The bivariate distribution of and provide the best approximation of the original data among the considered models, as it yields the smallest value. This suggests that not only the angles of previous segments but also their lengths are important subsidies in predicting the parameters of the next segments.

All in all, the obtained results confirm to a good extent the initial hypothesis that the view profiles are not trivial or random and have intrinsic structure that were progressively reflected by modeling approaches taking into account additional information. This implies that there are interesting real-world effects and mechanisms implementing the types of observed structure. In particular, we have memory effects and time dependencies, in the sense that the properties of one segment tend to correlated with subsequent segments. These effects could be hypothetically related to tendencies such as a brief surge of interest, e.g. caused by media dissemination, would be followed by a longer period of less intense inclination.

V Concluding Remarks

Science is inherently a collaborative endeavor. Therefore, the speed of dissemination of new ideas has a relevant impact in the development of novel theories and experiments. Traditionally, the impact of papers has been studied in terms of the number of citations, but since the World Wide Web became the main medium for publishing papers, the development of new data aggregation tools led to the definition of many alternative metrics sud2014evaluating . One of the simplest of such metrics is the number of page views. Measuring page views is relatively simple and can usually be done with arbitrary granularity (hourly, daily, monthly, etc). Compared to citations, the number of views also tends to display a much lower delay to important events such as publication and conference presentation.

Here we studied the monthly number of views for articles published in the PLoS ONE journal. A key observation regarding the number of views of the articles in the PLoS ONE dataset was used in the analysis: articles tend to display periods of relatively constant number of monthly views, with sharp changes in views between such periods. This hypothesis was investigated throughout the work by considering the cumulative number of article views. If the hypothesis is true, the cumulative number of views should be correctly represented by a piece-wise linear function.

A segmented least squares regression methodology muggeo2003estimating was applied to identify breakpoints between linear segments in cumulative article views, and the length and angle of each segment were measured and used as parameters of four models for generating synthetic article views profiles. The models took into account progressively more information about the profiles, so as to allow the identification of the most relevant properties.

Several interesting results were obtained. It was verified that the segmented regression led to a lower RMSE than in the case of synthetic profiles generated from a model which took into account randomly generated number of monthly views. The result indicates that representing the cumulative number of article views by a piece-wise linear function led to a relatively low regression error. Thus, the profiles can be modeled by linear segments. Another important result was the observation of two visualization profiles, one corresponding to a relatively low initial slope followed by a higher slope and another presenting only slopes that decrease with time. In order to better interpret these groups, additional metadata is necessary and is a future development of this work. Regarding the synthetic models, it was found that curves generated from and sampled independently with the same distribution as the real data led to profiles that approximated well the real profiles. Taking into account conditional probabilities between subsequent segments and between and led to improved models.

For future developments, additional metadata about the papers can be taken into account in the analysis. In particular, it would be interesting to investigate how the views dynamics changes according to the subject area and authors institution. It would also be interesting to associate social network data with the observed profiles. For instance, verify if the identified breakpoints correlate with messages published by power users eysenbach2011can in a social network about the article. It is also worth investigating how bibliometric networks can affect view patterns along time de2017knowledge ; da2006learning . Finally, we could also analyze if other factors such as authors and topics visibility can affect the patterns of view profiles in papers correa2017patterns ; lariviere2016contributorship ; amancio2015comparing ; lu2019analyzing .


This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001. C. H. Comin thanks FAPESP (grant no. 18/09125-4) for financial support. F. N. Silva acknowledges CAPES and FAPESP (grant no. 15/08003-4). H. F. de Arruda acknowledges FAPESP for sponsorship (grants 2018/10489-0 and 2019/16223-5). D. R. Amancio thanks FAPESP (grant no. 16/19069-9) and CNPq (grant no. 304026/2018-2). L. da F. Costa thanks CNPq (grant no. 307085/2018-0) and NAP-PRP-USP for support. This work has been supported also by the FAPESP grant 15/22308-2.


  • [1] D. R. Amancio. Comparing the topological properties of real and artificially generated scientific manuscripts. Scientometrics, 105(3):1763–1779, 2015.
  • [2] D. R. Amancio, O. N. Oliveira Jr, and L. F. Costa. Three-feature model to reproduce the topology of citation networks and the effects from authors’ visibility on their h-index. Journal of Informetrics, 6(3):427–434, 2012.
  • [3] J. Bollen, H. Van de Sompel, J. A. Smith, and R. Luce. Toward alternative metrics of journal impact: A comparison of download and citation data. Information processing & management, 41(6):1419–1440, 2005.
  • [4] L. Bornmann. Do altmetrics point to the broader impact of research? an overview of benefits and disadvantages of altmetrics. Journal of Informetrics, 8(4):895–903, 2014.
  • [5] E. A. Corrêa Jr, F. N. Silva, L. F. Costa, and D. R. Amancio. Patterns of authors contribution in scientific manuscripts. Journal of Informetrics, 11(2):498–510, 2017.
  • [6] L. F. Costa. Learning about knowledge: A complex network approach. Physical Review E, 74(2):026103, 2006.
  • [7] H. F. de Arruda, F. N. Silva, L. F. Costa, and D. R. Amancio. Knowledge acquisition: A complex networks approach. Information Sciences, 421:154–166, 2017.
  • [8] J. C. de Winter. The relationship between tweets, citations, and article views for plos one articles. Scientometrics, 102(2):1773–1779, 2015.
  • [9] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, et al. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004.
  • [10] G. Eysenbach. Can tweets predict citations? metrics of social impact based on twitter and correlation with traditional metrics of scientific impact. Journal of medical Internet research, 13(4):e123, 2011.
  • [11] F. Galligan and S. Dyas-Correia. Altmetrics: Rethinking the way we measure. Serials review, 39(1):56–61, 2013.
  • [12] F. L. Gewers, G. R. Ferreira, H. F. de Arruda, F. N. Silva, C. H. Comin, D. R. Amancio, and L. F. Costa. Principal component analysis: A natural approach to data exploration. arXiv, 1804.02502(1):1–33, 2018.
  • [13] W. Huang, P. Wang, and Q. Wu. A correlation comparison between altmetric attention scores and citations for six plos journals. PLOS ONE, 13(4):1–15, 04 2018.
  • [14] R. J. Hyndman and A. B. Koehler. Another look at measures of forecast accuracy. International journal of forecasting, 22(4):679–688, 2006.
  • [15] J. Ioannidis, K. W. Boyack, H. Small, A. A. Sorensen, and R. Klavans. Bibliometrics: Is your most cited work your best? Nature News, 514(7524):561, 2014.
  • [16] V. Larivière, N. Desrochers, B. Macaluso, P. Mongeon, A. Paul-Hus, and C. R. Sugimoto. Contributorship and division of labor in knowledge production. Social Studies of Science, 46(3):417–435, 2016.
  • [17] V. Larivière, C. Ni, Y. Gingras, B. Cronin, and C. R. Sugimoto. Bibliometrics: Global gender disparities in science. Nature News, 504(7479):211, 2013.
  • [18] C. Lu, Y. Bu, X. Dong, J. Wang, Y. Ding, V. Larivière, C. R. Sugimoto, L. Paul, and C. Zhang. Analyzing linguistic complexity and scientific impact. Journal of Informetrics, 13(3):817–829, 2019.
  • [19] V. M. Muggeo. Estimating regression models with unknown break-points. Statistics in medicine, 22(19):3055–3071, 2003.
  • [20] V. M. Muggeo and G. Adelfio. Efficient change point detection for genomic sequences of continuous measurements. Bioinformatics, 27(2):161–166, 2010.
  • [21] F. Murtagh and P. Legendre. Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? Journal of classification, 31(3):274–295, 2014.
  • [22]
  • [23] F. D. Peat. From certainty to uncertainty: The story of science and ideas in the twentieth century. Joseph Henry Press, 2002.
  • [24] J. Priem, H. A. Piwowar, and B. M. Hemminger. Altmetrics in the wild: Using social media to explore scholarly impact. arXiv preprint arXiv:1203.4745, 2012.
  • [25] X. Shuai, A. Pepe, and J. Bollen. How the scientific community reacts to newly submitted preprints: Article downloads, twitter mentions, and citations. PloS one, 7(11), 2012.
  • [26] P. Sud and M. Thelwall. Evaluating altmetrics. Scientometrics, 98(2):1131–1143, 2014.
  • [27] M. Thelwall. Mendeley reader counts for us computer science conference papers and journal articles. Quantitative Science Studies, pages 1–13, 2019.
  • [28] M. Thelwall, S. Haustein, V. Larivière, and C. R. Sugimoto. Do altmetrics work? twitter and ten other social web services. PloS one, 8(5), 2013.
  • [29] L. Waltman. A review of the literature on citation impact indicators. Journal of Informetrics, 10(2):365–391, 2016.
  • [30] X. Wang, C. Liu, Z. Fang, and W. Mao. From attention to citation, what and how does altmetrics work? arXiv preprint arXiv:1409.4269, 2014.
  • [31] J. H. Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236–244, 1963.

Supplementary Information

Figure S1: Angles distributions of real-world (green) and control (red) synthetic profiles. The distributions are clearly different, and a higher deviation in angles is observed for the real data.
(a) Lifetime distribution for 2-segment profiles.
(b) Lifetime distribution for 3-segment profiles.
(c) Lifetime distribution for 4-segment profiles.
(d) Lifetime distribution for 5-segment profiles.
Figure S2: Distributions of lifetime grouped by profiles with the same number of segments. Lifetime average and standard deviation are shown at the upper right corner.
(a) = -0.08
(b) = -0.36
(c) = -0.26
(d) = 0.09
Figure S3: Joint density plots and corresponding marginal probabilities of and for profiles with five segments. is the Pearson correlation.
Figure S4: Distribution of the segment parameters among all the obtained curves.
Figure S5: Bivariate distribution of lifetime and views. The Pearson correlation coefficient between these two variables is .