Quantification and Analysis of Scientific Language Variation Across Research Fields

by   Pei Zhou, et al.

Quantifying differences in terminologies from various academic domains has been a longstanding problem yet to be solved. We propose a computational approach for analyzing linguistic variation among scientific research fields by capturing the semantic change of terms based on a neural language model. The model is trained on a large collection of literature in five computer science research fields, for which we obtain field-specific vector representations for key terms, and global vector representations for other words. Several quantitative approaches are introduced to identify the terms whose semantics have drastically changed, or remain unchanged across different research fields. We also propose a metric to quantify the overall linguistic variation of research fields. After quantitative evaluation on human annotated data and qualitative comparison with other methods, we show that our model can improve cross-disciplinary data collaboration by identifying terms that potentially induce confusion during interdisciplinary studies.



page 1

page 4


Role of Interdisciplinarity in Computer Sciences: Quantification, Impact and Life Trajectory

The tremendous advances in computer science in the last few decades have...

Using text analysis to quantify the similarity and evolution of scientific disciplines

We use an information-theoretic measure of linguistic similarity to inve...

Analysing Lexical Semantic Change with Contextualised Word Representations

This paper presents the first unsupervised approach to lexical semantic ...

Automatically assembling a full census of an academic field

The composition of the scientific workforce shapes the direction of scie...

Combining dissimilarity measure for the study of evolution in scientific fields

The evolution of scientific fields has been attracting much attention in...

Metaphor Research in the 21st Century: A Bibliographic Analysis

Metaphor is widely used in human communication. The cohort of scholars s...

The data paper as a socio-linguistic epistemic object: A content analysis on the rhetorical moves used in data paper abstracts

The data paper is an emerging academic genre that focuses on the descrip...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The usage of language always varies among people with different backgrounds. When it comes to scientific literature, linguistic variation commonly exists across different scientific fields, among which we often see that scholars with varied backgrounds of knowledge use the same terms to express entirely different meanings. For this paper, we mainly consider different research fields within one subject (Computer Science) such as Natural Language Processing (NLP) and Computer Networks and Communication (Comm) instead of different subjects. This is because research fields within one subject usually have more shared terms whose semantic changes may lead to ambiguity [1]. Consider the term alignment, which often refers to the matching of signals in the field of computer communications, is however more typically related to translation of words or sentences in NLP research. Other representative examples include terms such as embedding, semantic, and grid, etc. Against this issue, quantified analysis of linguistic variation for scientific terms benefits with clearer understanding of concept expressions in different scientific fields, and reducing confusions for interdisciplinary communications. Moreover, such computational methods can also demonstrate the divergence of overall language trends among research fields.

Fig. 1: Visualization of scientific term embeddings using our model after projecting on two dimensional space.

While computational analysis methods of linguistic variation have attracted much attention recently, past work mainly focused on geographic, temporal, and social aspects of languages [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. Studies on these aspects mainly focus on differentiating word representations that change dramatically in the corresponding aspect for the entire vocabulary. However, existing work has not really tackled field variation of the language that modifies the meanings of scientific terms across these research fields, nor have they utilized these terms to reveal the general inter-field divergence of the language.

In this paper, we propose a computational method to analyze the linguistic variation in scientific research fields based on the quantification of semantic changes of crucial scientific terms. We extend the neural language model [12, 13] with a partially-localized mechanism to capture field-specific embeddings for frequent key terms, and preserve the global representations for other general words. Fig. 1 shows the the embedded word vectors for key terms in five domains after reducing the dimension of the embeddings, and we can see clear separation between domains. We also propose metrics to quantify the linguistic variation for scientific fields based on the semantic distances of key terms. The proposed approach hence helps keep track of the variation of term expressions, and provides evidences for speculations about the diversity of language usage across different research areas.

Ii Modeling

We begin with the formalization of the corpus , which is a sequence of words. We use to denote the set of fields. is partitioned into disjoint field-specific corpora, i.e., . We use to denote the vocabulary of words, for which is the global vocabulary, and , which is disjoint with , is the vocabulary of field-specific terms, where . In practice, can be predefined by selecting the most frequent terms from titles. Note that for every , and represent one term that occurs in two field-specific corpora, and thus have two different embeddings. We establish two disjoint sets of vocabularies in order to localize embeddings for key terms and keep generic embeddings for other sentence elements. For a word , is the context of , where is half of the context size, and each word in belongs to either or .

Ii-a Neural Language Model

We extend the CBOW model [12] to capture both global embeddings of words and field-specific embeddings of scientific terms. The training objective of the model is to learn word vector representations that maximize the log likelihood of the word and term given its context that is specified in different fields:

for which the conditional likelihood of word over a field-specific context is defined as:

where is the embedding of word , and context is represented as the mean of contained word embeddings, i.e., . Following [12], we adopt batched negative sampling to obtain a computationally efficient approximation of log likelihood.

Ii-B Quantification Approaches

In this section, we present several methods to quantify the linguistic variation for terms and research fields.

Ii-B1 Term Variation

We consider both Cosine Similarity:

and Jaccard Similarity for quantitative analysis. To get the Jaccard Similarity, we find the most similar words or terms using cosine similarity for a term in , based on which the variation between the term in two fields (i.e. ) is calculated with Jaccard Distance. This similarity metric aims to capture the second order similarity for different terms. We consider the set of most similar words in order to capture the latent semantic of the term:

thereof, indicates the set of most similar words considering its cosine similarities with , i.e., , for which denotes most similar words to , and

is the Jaccard Index to measure the similarity of two sets. Intersection thereof is calculated by aggregating the cosine similarities of the words that appear in

and , and union is calculated by summing up the cosine similarities thereof.

In practice, a higher should cause to more comprehensively measure the variation of a term by considering the semantic distances of more related words in each field, which however requires more computational cost.

Ii-B2 Field Variation

We quantify the overall linguistic variation of research fields based on the field-specific term vocabularies . For each term , we first find the set of most similar words . Then the subset of that contains terms of the second field is obtained as . Based on that, we aggregate a term-to-field semantic similarity measure , i.e.,

where is the frequency of word , i.e. number of occurrence of word in divided by the total number of words in .

Then the semantic distance between two fields is defined based on the aggregated term-to-field similarities for all words in , for which we apply normalized exponential scaling [14] to signify the differences of measures for different fields. thereof is a normalizing constant. Thus we have:

Iii Evaluation

In this section, we evaluate the proposed approach for analyzing term and field-level linguistic variation based on a large collection of scientific literature. We collect human annotations on term variation and compare the results with separately-trained CBOW, GEODIST [3], and our model.

Iii-a Dataset and Model Configuration

We train our model on research papers in five fields of computer science, including NLP, Computer Vision (CV), Databases and Data Mining (DBDM), Computer Architecture (Arch), and Computer Networks and Communication (Comm). We select around ten mainstream conferences and journals for each field based on the Google Scholar Publication Ranking List 111https://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng, and correspondingly collected the papers in these venues that are published in the past seven years. The corpora contain the plaintext contents of over 56K academic papers.

We populate using the article titles, where the stop words are removed and 200 most frequent terms are selected for each field. Since the title should reflect the main topic of each paper, this word collection naturally contains a set of very popular scientific terms. We set the dimensionality of embeddings to be 100, the size of the contexts to be 24, and the negative sample size to be 5.

Iii-B Variation of Terms

Term Avg. Highest and fields
alignment 0.9215 0.9634 (NLP-Comm)
relations 0.8923 0.9654 (NLP-Arch)
translation 0.8905 0.9773 (NLP-Arch)
mapping 0.8819 0.9817 (NLP-Arch)
embedding 0.8780 0.9533 (NLP-Comm)
embed 0.8675 0.9785 (NLP-Arch)
pattern 0.8670 0.9804 (CV-Arch)
feature 0.8621 0.9667 (NLP-Arch)
grid 0.8509 0.9445 (NLP-CV)
semantic 0.8396 0.9679 (NLP-Arch)
TABLE II: Terms with least semantic change.
Term Avg.
linux 0.4721
cryptography 0.5283
multicore 0.5309
imaging 0.5352
intel 0.5452
multiprocess 0.5512
wireless 0.5566
hardware 0.5618
telecom 0.5639
server 0.5664
TABLE I: Terms with most significant semantic change.
Term Most Similar Words Per Field
TABLE III: Most similar words of some terms in Table II and II.

To evaluate the semantic change of a term , we calculate the with the field-specific embeddings of this term, i.e. and . Specifically, we analyze the of one term for in total ten pairs of fields, and for which average of their distances to quantify the overall variation of the term. We set to be 10,000 during the evaluation, which seeks to aggregate semantics from a large neighborhood of each term. Results reported in Table II show the ten terms with the overall most significant semantic changes, and between which fields such changes happen the most, while those in Table II show the ten terms with the least semantic change. The model detects that the meanings of terms like alignment and embedding should vary a lot across different fields, whereas those of terms like hardware and server are expected to be consistent. Table III presents the most similar words of some terms from Table II and Table II in each of the five field. It is noteworthy that, such terms with a high indeed have very different meanings and usages across different fields.

Iii-C Comparison on Human Annotated Data

Methods nDCG at rank 30 nDCG at rank 50
Separate CBOW–JS 0.492 0.753 0.839
GEODIST–Cosine -0.488 0.115 0.406
Our Model–Cosine 0.777 0.815 0.858
Our Model–JS 0.779 0.817 0.858
TABLE IV: Comparison with Human Annotations.

To quantitatively evaluate the term variation found by our model, we randomly choose 50 terms and instruct a group of 30 Computer Science PhD students to annotate them as semantically varied (+1) or not (-1), across the five research fields. We make sure each single term is annotated by at least five PhD students, sum their annotations up, and divide the summed number by 5 to get the annotated term variation. We collect all annotation for 50 terms and compare with the corresponding variation returned from different methods as shown in Table IV. As a baseline, separately-trained CBOW model on each of the five domains is considered, and the variation is calculated using Jaccard Similarity mentioned before. We also ran the GEODIST model introduced by [3] on the corpus for comparison. Then we test the performances on our model using both Cosine Similarity and Jaccard Similarity as similarity measures. For metrics, we use Pearson Correlation () [15] and Normalized Discounted Cumulative Gain (nDCG) at rank 10 and 50 [16]. From the results, we can see that our model performs better than baselines for all three metrics. For Pearson Correlation (), the improvement is more significant and the calculated variation has a linear correlation with the ground truth annotation. GEODIST fails to differentiate the semantic variation of terms across different groups of literature with selected terms, thus obtains a negative correlation and a low nDCG score. We hypothesis that it is because for the research field corpus, the language style of individual paper in the same field varies greatly and it is a more noisy corpus than the Google Book Ngrams corpus they used. Our model works well with the noisy data and also outperforms the separately-traind CBOW baseline. We also observe that using Jaccard similarity improves the results, showing that second order similarity metrics are arguably better than first order metrics.

Iii-D Variation of Fields

Fig. 2: Heatmap of field distances.

We now analyze the overall linguistic variation of different research fields based on , for which we set as 10,000. The results are shown in the heatmap of Fig. 2, in which darker colors indicate larger overall semantic changes (i.e. higher ). The results indicate that the language usages between NLP and CV are considered as most similar. This is explainable because NLP and CV are both AI-related fields and share lots of techniques in research studies. For DBDM, corresponding linguistic variation from other fields is more significant. This can be explained by the fact that, on one hand, data mining tasks employ many shared statistical techniques in AI research as well, and a portion of modern database research in distributed scenarios also shares much common knowledge with computer and network systems. On the other hand, there is still much distinction from DBDM to the rest involved fields. Another observation is that the language usages are considered as most dissimilar from the fields of NLP and CV to the fields of Arch and Comm. This conclusion is congruent to the evaluation results for terms in Table II, where distances for these fields are the highest for terms with high semantic changes. This is because AI research topics are often considered distant from computer system-related research fields like Arch and Comm, which explains why the measures are significant from any of the former to that of the later. Besides, we discover that the semantic change between Arch and Comm is also relatively significant.

Iv Related Work

In this section, we discuss related work on computational methods for linguistic variation. Recent work has paid much attention to using neural language models for corresponding analysis, which are respectively based on two lines of word representation mechanisms.

Diffusion-based. The diffusion-based mechanism categorizes the corpora by different context scenarios, and obtains differentiated representations for words in each context scenario. Corresponding methods have been used to analyze the semantics of words that vary according to different time periods or social groups [5, 4, 10, 2]. Our work is close to these methods, but instead of differentiating all words, we employs a partially-localized mechanism by localizing embeddings for a portion of important terms, while keep the universal representations for other generic words. This is due to that our task naturally focuses on measuring the change of frequent scientific terms that contribute mostly to the linguistic variation among scientific research fields.


Other work adopts the bias-based representation mechanism, which uniformly represents all words, and overlays a scenario-specific bias vector to words that appear in the corresponding context scenario. Exemplarily,

[3] captures the biases between UK and US English based on corresponding literatures, and [6] induces such biases on social media corpora that are tagged with more fine-grained geo-locations. Similarly, a user-specific bias is utilized by [9]. While each context scenario often applies the same bias to its words, we prefer the diffusion-based representations due to that our task requires the semantic changes to be captured distinctively for the terms in the same field.

Besides, we also provide quantitaive evaluation for the task of analyzing scientific language variation for different domains. Previous work has mainly showed qualitative analysis on language variation geographically or temporally. We argue that by comparing with human annotations, our model can capture the semantic change for cross-domain data. Moreover, we leverage the word-level quantification to measure the overall semantic differences of research fields.

V Conclusion

In this paper we introduced a computational model for analyzing scientific linguistic variation by research fields. The model was trained on a large collection of literature from five computer science research fields to obtain partially-localized representations for terms. A series of metrics were provided to quantify the semantic change on both term and field level. We evaluate the term variation found by our model by comparing with human annotated data and sevearl baselines and show that our model captures the term variation most accurately. We believe that with automatically detected term variation, confusion during interdisciplinary communications is reduced, and the goal of better cross-domain data collaboration is achieved.


  • [1] R. Kittredge and J. Lehrberger, Sublanguage: Studies of language in restricted semantic domains.   Walter de Gruyter, 1982.
  • [2] T. Hu, R. Song, P. Ding et al., “A world of difference: Divergent word interpretations among people,” in ICWSM, 2017.
  • [3] V. Kulkarni, B. Perozzi, and S. Skiena, “Freshman or fresher? quantifying the geographic variation of language in online social media.” in ICWSM, 2016, pp. 615–618.
  • [4] V. Kulkarni, R. Al-Rfou, B. Perozzi, and S. Skiena, “Statistically significant detection of linguistic change,” in Proceedings of the 24th International Conference on World Wide Web.   International World Wide Web Conferences Steering Committee, 2015, pp. 625–635.
  • [5] Y. Kim, Y.-I. Chiu, K. Hanaki, D. Hegde, and S. Petrov, “Temporal analysis of language through neural language models,” ACL 2014, p. 61, 2014.
  • [6] D. Bamman, C. Dyer, N. A. Smith et al.

    , “Distributed representations of geographically situated language,” in

    Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014.
  • [7] D. Bamman, J. Eisenstein, and T. Schnoebelen, “Gender identity and lexical variation in social media,” Journal of Sociolinguistics, vol. 18, no. 2, pp. 135–160, 2014.
  • [8] B. O’Connor, J. Eisenstein, E. P. Xing et al., “Discovering demographic language variation,” in MLSC, 2010.
  • [9] Z. Zeng, Y. Yin, Y. Song, and M. Zhang, “Socialized word embeddings,” in

    Proceedings of the 26th International Joint Conference on Artificial Intelligence

    .   AAAI Press, 2017, pp. 3915–3921.
  • [10] J. Eisenstein, B. O’Connor, N. A. Smith, and E. P. Xing, “Diffusion of lexical change in social media,” PloS one, vol. 9, no. 11, p. e113114, 2014.
  • [11] ——, “A latent variable model for geographic lexical variation,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.   Association for Computational Linguistics, 2010, pp. 1277–1287.
  • [12] T. Mikolov, K. Chen, G. Corrado et al.

    , “Efficient estimation of word representations in vector space,”

    ICLR Workshops Track, 2013.
  • [13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
  • [14] O. Mondain-Monval, A. Espert, P. Omarjee, J. Bibette, F. Leal-Calderon, J. Philip, and J.-F. Joanny, “Polymer-induced repulsive forces: Exponential scaling,” Physical review letters, vol. 80, no. 8, p. 1778, 1998.
  • [15] F. Galton, “Regression towards mediocrity in hereditary stature.” The Journal of the Anthropological Institute of Great Britain and Ireland, vol. 15, pp. 246–263, 1886.
  • [16] K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of ir techniques,” ACM Transactions on Information Systems (TOIS), vol. 20, no. 4, pp. 422–446, 2002.
  • [17] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”

    Journal of machine Learning research

    , vol. 3, no. Jan, pp. 993–1022, 2003.