Log In Sign Up

Determining Standard Occupational Classification Codes from Job Descriptions in Immigration Petitions

by   Sourav Mukherjee, et al.

Accurate specification of standard occupational classification (SOC) code is critical to the success of many U.S. work visa applications. Determination of correct SOC code relies on careful study of job requirements and comparison to definitions given by the U.S. Bureau of Labor Statistics, which is often a tedious activity. In this paper, we apply methods from natural language processing (NLP) to computationally determine SOC code based on job description. We implement and empirically evaluate a broad variety of predictive models with respect to quality of prediction and training time, and identify models best suited for this task.


page 3

page 4


Predicting Job Titles from Job Descriptions with Multi-label Text Classification

Finding a suitable job and hunting for eligible candidates are important...

Competence-Level Prediction and Resume Job Description Matching Using Context-Aware Transformer Models

This paper presents a comprehensive study on resume classification to re...

Evaluate On-the-job Learning Dialogue Systems and a Case Study for Natural Language Understanding

On-the-job learning consists in continuously learning while being used i...

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Binary code analysis allows analyzing binary code without having access ...

Semantic Similarity Strategies for Job Title Classification

Automatic and accurate classification of items enables numerous downstre...

Studio e confronto delle strutture di Apache Spark

English. This document is designed to study the data structures that can...

I Introduction

The process of obtaining U.S. work visas has become increasingly difficult in recent years. Data from the U.S. Citizenship and Immigration Services (USCIS) [1] show that between Fiscal Year (FY) 2015 and FY 2019, the approval rate of H-1B visa petitions has declined substantially. In response to some (though not all) visa petitions, the USCIS requests further information by issuing a Request For Evidence (RFE) [2]. Per [1], the percentage of petitions that resulted in RFEs has increased dramatically between FY 2015 and FY 2019, reaching 40.2% in 2019. Further, the percentage of petitions with RFE that were approved has decreased significantly during this time frame, reaching a low of 62.4% in FY 2018. In FY 2020, approval rates showed some improvement but were still lower than those in FY 2015; similarly, RFE issuance rate somewhat decreased but was still higher than that in FY 2015 [3]. We note that issuance of RFE, petitioner’s response, and subsequent review of response by the USCIS, add significant delays even for petitions that are ultimately approved. Per the USCIS, the most common reason for RFE issuance is the petitioner (employer) not being able to establish that the position is a specialty occupation [2] (please refer to [4] and [5] for explanations of this term). Therefore, accurate characterization of job positions is critical to the successful and timely completion of the visa petitions.

The U.S. Bureau of Labor Statistics (BLS) has created the Standard Occupational Classification (SOC) [6] to categorize jobs into 867 occupational categories. Each category is denoted by an SOC code. The mapping of SOC codes to categories is given in [7]. To minimize the chances of a visa petition getting delayed (due to RFE) or denied, it is important that the petitioner specify the SOC code that best describes the duties associated with the position. Typically, this requires careful reading of the job description and SOC code definitions to find the best match. Although SOC codes are organized hierarchically to facilitate search, the process is tedious, and results in enormous repetitive workload for immigration law firms.

In this paper, we focus on the problem of algorithmically determining SOC codes based on job descriptions. Applying techniques from natural language processing (NLP), we build a variety of predictive models that accept free form textual descriptions of job duties as input, and yield SOC code as output. Using real world data, we empirically evaluate these models with respect to quality of prediction and training time, and identify models that are best suited for this task.

The rest of this paper is organized as follows. Section II reviews related work. In Section III, we describe our approach. Section IV presents our experimental evaluation which are interpreted in Section V. Finally, SectionVI summarizes the paper and discusses future directions.

Ii Related Work

Machine learning methods have been applied in the past to various problems in the legal domain [8, 9, 10], such as outcome forecasting [11, 12, 13, 14, 15], document discovery [16, 17], document categorization [18, 19, 20, 21, 22, 23, 24] and legal drafting [25, 26, 27].

Application of machine learning methods to immigration law is a much newer area of research. The problem of predicting outcomes of refugee claims has been considered in [28, 29]. In contrast, our paper focuses on work visa applications as opposed to refugee claims, and seeks to programmatically select SOC codes as opposed to predicting case outcomes.

In [30], two problems related to work visa applications are considered, namely, categorization of supporting documents of visa petitions, and drafting responses in reaction to Requests For Evidence (RFE). Our work is different from [30] in that we focus on identifying SOC codes programmatically in an effort to proactively reduce the chances of RFE issuance.

Interestingly, application of natural language processing to determine SOC code has been studied in the epidemiological context. Specifically, the SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiologic Research) model [31]

predicts SOC code based on industry, job title, and job tasks. Our work is different from SOCcer in the following ways. First, while SOCcer is trained and evaluated using health-related datasets, we focus on data related to work visa petitions. Second, while SOCcer uses an ensemble of classifiers, three of which are based on job title, one on industry, and one on task, we seek to predict SOC code using description (i.e., tasks and responsibilities) alone. This is due to our observation that in work visa related data, job titles do not map to SOC codes in a consistent way and that the number of distinct SOC codes associated with an industry such as the software industry is huge. Third, unlike SOCcer, we benchmark a broad variety of models and compare them in terms of accuracy and training time. Finally, our benchmarking includes two different text vectorization approaches, namely sparse vectorization (using TF-IDF

-grams [32]) and dense vectorization (using doc2vec [33]

, a neural network).

The next section describes our approach in detail.

Iii Methodology

We begin by formally defining our problem.

Iii-a Problem Statement


  1. : a finite alphabet. denotes the set of all non-empty strings over . In this paper, we focus on strings that are job descriptions expressed as free form text.

  2. : a finite set of labels. In this paper, SOC codes are treated as labels.

  3. : a labeled dataset of size , where is a job description, and is its corresponding SOC code.

Output: A function which maps a job description to an SOC code such that

minimizes the expected error with respect to some loss function.

From a pragmatic standpoint, we want such a function to be available as a web service (i.e., web API) which accepts a request containing description to produce a response containing the predicted SOC code .

Iii-B Approach

Our approach may be described as a sequence of steps as follows.

Iii-B1 Text Vectorization

Since a majority of machine learning algorithms assume inputs to be real valued vectors, predictive modeling based on text often requires vectorizing the text, i.e., computing real valued vector representation of text. We consider two different vectorization techniques, which are as follows.

Tf-Idf -grams

An -gram () is a sequence of tokens. Given (), a corpus of text in can be used to compute the vocabulary of all -grams where . Subsequently, any string may be represented as a vector of counts, i.e., term frequencies (TF) of -grams present in . Such a vector representation of a string is typically sparse, i.e., most of its components are zero, since most -grams in the vocabulary are typically absent in it. To offset the effect of highly frequent -grams with little semantic value, the vectors are weighted by inverse document frequencies (IDF), resulting in TF-IDF -gram representations. While TF-IDF representations have been found to achieve high accuracy in text categorization [32], the high dimensionality of the sparse vectors generally entails high computational costs for training predictive models.


An alternative approach that addresses the issue of dimensionality consists of using neural architectures for vectorizing words [34] and strings [33], using contextual similarity to predict semantic similarity. The resulting representations are known as word embeddings and document embeddings, respectively, and the above neural architectures are referred to as word2vec and doc2vec, respectively. Embeddings computed by word2vec and doc2vec are typically of lower dimensionality compared to TF-IDF -gram representations. Therefore, such embeddings are considered dense vector representations. Since job descriptions are strings of arbitrary length, we use doc2vec to compute dense vector representations of such descriptions.

Iii-B2 Predictive Modeling

For each type of vectorization, we train a set of standard classifiers for predicting SOC code, namely,

-nearest neighbors (KNN), Gaussian naïve Bayes (GNB), logistic regression (LR), linear support vector machine (LinearSVC), support vector machine with radial basis function (SVC-RBF), decision tree (DT), and random forest (RF).

Iii-B3 Evaluation and Model Selection

To evaluate the models, we use -fold cross validation. The dataset is first divided into slices (or folds) of (roughly) equal size. In each round of cross validation, a different slice is held out for testing while the remaining slices are used for training. Several metrics are recorded in each round. At the end of rounds of training and testing, these metrics are averaged and reported. These scores help identify models best suited to the problem.

Iii-B4 Deployment

Once a model has been selected, we deploy it as a web service which can accept a POST request whose body contains a job description in free form text and produce a response containing the predicted SOC code.

The next section presents our empirical evaluation.

Iv Evaluation

Iv-a Dataset

Our dataset consists of 46,999 labeled instances, where each instance corresponds to a visa petition. For every instance, the relevant attributes include job title, job description, company name, SOC code (normalized) (which we will refer to as simply SOC code), and SOC occupation which is a moniker of the SOC code. We exclude company name from the model since we have found it to be irrelevant to the predictive task; moreover, the predictive model should be able to generalize to all companies. We exclude job title from the model as well because we have found many instances of the same job title being associated with different SOC codes in this dataset, suggesting that job title does not consistently map to SOC code. Therefore, we use job description as the only input to our models. Since SOC occupation is simply a moniker of SOC code, we use SOC code as the only output of our models.

It is worth noting that the distribution of SOC codes in this dataset is uneven. While the dataset includes abundant examples of the most common categories, less frequent codes may not have sufficient instances. To build a predictive model that is accurate for a majority of use cases, we focus on the 5 most frequent codes, which results in a dataset with 32,262 instances.

Iv-B Experimental Setup

Our experiments are implemented using Python 3 as the programming language, in interactive notebooks hosted on the Databricks111 platform. Other standard libraries used include Scikit-learn [35] for sparse vector representations and training classifiers, Gensim for doc2vec [36], Numpy [37] for numerical computations, Pandas [38] for tabular data processing, and Matplotlib [39] for plotting. We use Managed MLflow222 for deployment.

We have implemented and benchmarked 14 classifiers, 7 of which are based on TF-IDF -gram representation, while the rest are based on doc2vec representation. These are compared in terms of training time, accuracy, precision, recall, and f1 score [40]. The values of these metrics are averaged over 10-fold cross validation and reported.

Iv-C Hyperparameters

All hyperparameters used in this evaluation are manually tuned. Automatic parameter tuning is outside the scope of this paper and left as future work.

Iv-C1 Vector Representation

For TF-IDF -gram representation, we use . However, -grams that occur in fewer than 10% of the instances or greater than 90% of the instances are ignored. The resulting sparse vectors have a dimensionality of 858. For doc2vec, we use a dimensionality of 100.

Iv-C2 Predictive Modeling

For -nearest neighbor classifiers, we use

. For random forest classifiers, we use an ensemble of 100 estimators.

Iv-D Experimental Results

Iv-D1 Accuracy

We measure accuracy as the fraction of predictions that are correct. Figure 1 compares the accuracies of the models being evaluated.

Fig. 1: Accuracy scores of SOC Code predictors.

Iv-D2 Precision

In a binary classification problem, precision is defined as the fraction of all positive predictions that are correct. Since our problem involves more than two classes, we report the macro average, i.e., the average of precision scores measured with respect to each SOC code in the dataset [41]. Figure 2 shows the macro average precision scores of the models.

Fig. 2: Precision (macro average) of SOC Code predictors.

Iv-D3 Recall

In a binary classification problem, recall is defined as the fraction of all positive instances that are correctly predicted as positive. Since our problem involves more than two classes, we report the macro average, i.e., the average of recall scores measured with respect to each SOC code in the dataset [42]. Figure 3 shows the macro average recall scores of the models.

Fig. 3: Recall (macro average) of SOC Code predictors.

Iv-E F1 Score

In a binary classification problem, f1 score is defined as the harmonic mean of precision and recall. Since our problem involves more than two classes, we report the macro average, i.e., the average of f1 scores measured with respect to each SOC code in the dataset

[43]. Figure 4 shows the macro average f1 scores of the models.

Fig. 4: F1 score (macro average) of SOC Code predictors.

Iv-F Training Time

Finally, the time taken (in seconds) to train each model (averaged over 10-fold cross validation as will all the other metrics) is shown in Figure 5.

Fig. 5: Training time of SOC Code predictors.

In the next section, we interpret these results.

V Discussion

Figure 1 shows that TF-IDF -gram based support vector classifier with radial basis function (SVC-RBF) achieves the highest classification accuracy of 0.813031. We also note that regardless of whether the text vectorization is sparse (TF-IDF -grams) or dense (doc2vec), SVC-RBF and random forest classifiers achieve high accuracy in the neighborhood of approximately 0.8.

Figures 1, 3, and 4 further indicate that TF-IDF -gram based SVC-RBF achieves the highest cross validation scores with respect to accuracy, recall, and f1 score. However, Figure 2 shows doc2vec based random forest achieves the highest precision score.

These results demonstrate that while support vector classifiers with radial basis functions and random forest classifiers are suitable models for SOC code prediction, the choice of representation (sparse vs. dense) may depend on the metric of highest importance.

Figure 5 shows that the high accuracy of TF-IDF -gram based SVC-RBF comes at the cost of high training time, which is significantly greater than all the other models considered in this study. On the other hand, doc2vec based SVC-RBF requires much lower training time and yet achieves comparable accuracy. We note that the dimensionality of the sparse vectors is 858 while that of the dense vectors is 100, which is likely a contributing factor to this disparity in training time. We further observe that random forest classifiers, whether based on sparse or dense vectors, can be trained even more quickly while still achieving comparable accuracy.

Therefore, in a real world deployment, the choice of model may be dependent on the trade-off between training time and accuracy. Let us consider a scenario where once an initial model has been deployed, more accurate models are trained in the background as more training data become available over time, allowing the web service to switch to such models when they are substantially more accurate. If there are time constraints associated with the initial deployment, random forest or doc2vec based SVC-RBF would provide a highly accurate model more quickly. Subsequently, if there are no time constraints on switching to newer models, then TF-IDF -gram based SVC-RBF may be preferable for later deployments. The next section concludes the paper.

Vi Conclusion

Accurate determination of Standard Occupational Classification (SOC) codes is critical to the success and timely completion of U.S. work visa applications. In this paper, we have applied machine to reduce the repetitive workload of SOC code selection. Using methods from natural language processing, we have trained a variety of predictive models for determining SOC code based on job description. Using real world data, we have benchmarked these models with respect to quality of prediction and training time. Our results indicate that our approach results in highly accurate models that may be trained and deployed within reasonable timelines.

Several useful extensions of this work are possible. For example, the functionality of the models may be expanded to return a list of suggested SOC codes ranked by some confidence metric. Another improvement would be to incorporate statistical significance tests (e.g., Student’s t-test) into the model comparison process.


  • [1] U.S. Citizenship and Immigration Services, “Ri-129 - petition for a nonimmigrant worker specialty occupations (h-1b) by fiscal year, month, and case status: October 1, 2014 - september 30, 2020,”, accessed: September 1, 2021.
  • [2] ——, “Understanding requests for evidence (rfes): A breakdown of why rfes were issued for h-1b petitions in fiscal year 2018,”, accessed: September 1, 2021.
  • [3] Berry Appleman & Leiden LLP, “H-1B approval rates ticked up in FY2020, but remained historically low,”, accessed: September 1, 2021.
  • [4] U.S. Citizenship and Immigration Services, “6.5 H-1B specialty occupations,”, accessed: September 1, 2021.
  • [5] U.S. Bureau of Labor Statistics, “Characteristics of H-1B Specialty Occupation Workers Fiscal Year 2014 Annual Report to Congress October 1, 2013 – September 30, 2014,”, accessed: September 1, 2021.
  • [6] U.S. Bureau of Labor Statistics, “Standard Occupational Classification,”, accessed: September 1, 2021.
  • [7] ——, “2010 SOC Definitions,”, accessed: September 2, 2021.
  • [8] H. Surden, “Machine learning and law,” 89 WASH. L. REV.87, 2014. [Online]. Available:
  • [9]

    N. Bansal, A. Sharma, and R. K. Singh, “A review on the application of deep learning in legal domain,” in

    Artificial Intelligence Applications and Innovations, J. MacIntyre, I. Maglogiannis, L. Iliadis, and E. Pimenidis, Eds.   Cham: Springer International Publishing, 2019, pp. 374–381.
  • [10] D. Faggella, “Ai in law and legal practice – a comprehensive view of 35 current applications,”, accessed: September 1, 2021.
  • [11] T. W. Ruger, P. T. Kim, A. D. Martin, and K. M. Quinn, “The supreme court forecasting project: Legal and political science approaches to predicting supreme court decisionmaking,” Columbia Law Review, vol. 104, no. 4, pp. 1150–1210, 2004. [Online]. Available:
  • [12] A. D. Martin, K. M. Quinn, T. W. Ruger, and P. T. Kim, “Competing approaches to predicting supreme court decision making,” Perspectives on Politics, vol. 2, no. 4, pp. 761–767, 2004. [Online]. Available:
  • [13] D. Katz, I. Bommarito, and J. Blackman, “A general approach for predicting the behavior of the supreme court of the united states,” PLOS ONE, vol. 12, 12 2016.
  • [14] N. Aletras, D. Tsarapatsanis, D. Preotiuc-Pietro, and V. Lampos, “Predicting judicial decisions of the european court of human rights: a natural language processing perspective,” PeerJ Comput. Sci., vol. 2, p. e93, 2016.
  • [15] M. Medvedeva, M. Vols, and M. Wieling, “Using machine learning to predict decisions of the european court of human rights,” Artificial Intelligence and Law, vol. 28, pp. 237–266, 2019.
  • [16] E. Yang, D. A. Grossman, O. Frieder, and R. Yurchak, “Effectiveness results for popular e-discovery algorithms,” in Proceedings of the 16th edition of the International Conference on Artificial Intelligence and Law, ICAIL 2017, London, United Kingdom, June 12-16, 2017, J. Keppens and G. Governatori, Eds.   ACM, 2017, pp. 261–264. [Online]. Available:
  • [17] G. V. Cormack and M. R. Grossman, “Evaluation of machine-learning protocols for technology-assisted review in electronic discovery,” Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014.
  • [18] M. A. Lemley and J. H. Walker, “Intellectual property litigation clearinghouse: Data overview,” Kauffman Symposium on Entrepreneurship and Innovation Data, 2007.
  • [19] F. Wei, H. Qin, S. Ye, and H. Zhao, “Empirical study of deep learning for text classification in legal document review,” in IEEE International Conference on Big Data, Big Data 2018, Seattle, WA, USA, December 10-13, 2018, N. Abe, H. Liu, C. Pu, X. Hu, N. K. Ahmed, M. Qiao, Y. Song, D. Kossmann, B. Liu, K. Lee, J. Tang, J. He, and J. S. Saltz, Eds.   IEEE, 2018, pp. 3317–3320. [Online]. Available:
  • [20]

    N. C. Silva, F. Braz, T. E. de Campos, A. B. S. Guedes, D. B. Mendes, D. A. Bezerra, D. B. Gusmao, F. B. S. Chaves, G. G. Ziegler, L. H. Horinouchi, M. U. Ferreira, P. H. Inazawa, V. H. D. Coelho, R. V. C. Fernandes, F. H. Peixoto, M. S. M. Filho, B. P. Sukiennik, L. Rosa, R. Silva, T. A. Junquilho, and G. Carvalho, “Document type classification for brazil’s supreme court using a convolutional neural network,” in

    ICoFCS-2018, 2018.
  • [21] S. Undavia, A. Meyers, and J. Ortega, “A comparative study of classifying legal documents with neural networks,” in Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, FedCSIS 2018, Poznań, Poland, September 9-12, 2018, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 15, 2018, pp. 515–522. [Online]. Available:
  • [22] Q. Lu, J. G. Conrad, K. Al-Kofahi, and W. Keenan, “Legal document clustering with built-in topic segmentation,” in Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011, C. Macdonald, I. Ounis, and I. Ruthven, Eds.   ACM, 2011, pp. 383–392. [Online]. Available:
  • [23] L. O. de Colla Furquim and V. L. S. de Lima, “Clustering and categorization of brazilian portuguese legal documents,” in Computational Processing of the Portuguese Language - 10th International Conference, PROPOR 2012, Coimbra, Portugal, April 17-20, 2012. Proceedings, ser. Lecture Notes in Computer Science, H. de Medeiros Caseli, A. Villavicencio, A. J. S. Teixeira, and F. Perdigão, Eds., vol. 7243.   Springer, 2012, pp. 272–283. [Online]. Available:
  • [24] R. K. V and K. Raghuveer, “Article: Legal documents clustering using latent dirichlet allocation,” International Journal of Applied Information Systems, vol. 2, no. 6, pp. 27–33, May 2012, published by Foundation of Computer Science, New York, USA.
  • [25] J. Sprowl, P. Balasubramanian, T. Chinwalla, M. W. Evens, and H. Klawans, “An expert system for drafting legal documents,” in American Federation of Information Processing Societies: 1984 National Computer Conference, 9-12 July 1984, Las Vegas, Nevada, USA, ser. AFIPS Conference Proceedings, vol. 53.   AFIPS Press, 1984, pp. 667–673. [Online]. Available:
  • [26] K. D. Betts and K. R. Jaep, “The dawn of fully automated contract drafting: Machine learning breathes new life into a decades-old promise,” 15 Duke Law & Technology Review, pp. 216–233, 2017.
  • [27] S. Miller, “Benefits of artificial intelligence: what have you done for me lately?”, accessed: September 1, 2021.
  • [28] M. Dunn, L. Sagun, H. Sirin, and D. Chen, “Early predictability of asylum court decisions,” in Proceedings of the 16th edition of the International Conference on Artificial Intelligence and Law, ICAIL 2017, London, United Kingdom, June 12-16, 2017, J. Keppens and G. Governatori, Eds.   ACM, 2017, pp. 233–236. [Online]. Available:
  • [29] D. L. Chen and J. Eagel, “Can machine learning help predict the outcome of asylum adjudications?” in Proceedings of the 16th edition of the International Conference on Artificial Intelligence and Law, ICAIL 2017, London, United Kingdom, June 12-16, 2017, J. Keppens and G. Governatori, Eds.   ACM, 2017, pp. 237–240. [Online]. Available:
  • [30] S. Mukherjee, T. Oates, V. DiMascio, H. Jean, R. Ares, D. Widmark, and J. Harder, “Immigration document classification and automated response generation,” in 20th International Conference on Data Mining Workshops, ICDM Workshops 2020, Sorrento, Italy, November 17-20, 2020, G. D. Fatta, V. S. Sheng, A. Cuzzocrea, C. Zaniolo, and X. Wu, Eds.   IEEE, 2020, pp. 782–789. [Online]. Available:
  • [31] D. E. Russ, K. Y. Ho, J. S. Colt, K. R. Armenti, D. Baris, W. H. Chow, F. Davis, A. Johnson, M. P. Purdue, M. R. Karagas, K. Schwartz, M. Schwenn, D. T. Silverman, C. A. Johnson, and M. C. Friesen, “Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies,” Occup Environ Med, vol. 73, no. 6, pp. 417–424, Jun 2016.
  • [32] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Machine Learning: ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, April 21-23, 1998, Proceedings, ser. Lecture Notes in Computer Science, C. Nedellec and C. Rouveirol, Eds., vol. 1398.   Springer, 1998, pp. 137–142. [Online]. Available:
  • [33]

    Q. V. Le and T. Mikolov, “Distributed representations of sentences and documents,” in

    Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, ser. JMLR Workshop and Conference Proceedings, vol. 32., 2014, pp. 1188–1196. [Online]. Available:
  • [34] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 2013.
  • [35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [36] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.   Valletta, Malta: ELRA, May 2010, pp. 45–50,
  • [37] S. van der Walt, S. C. Colbert, and G. Varoquaux, “The numpy array: A structure for efficient numerical computation,” Computing in Science Engineering, vol. 13, no. 2, pp. 22–30, March 2011.
  • [38] W. McKinney, “Data structures for statistical computing in python,” 9th Python in Science Conference, pp. 51 – 56, 2010.
  • [39] J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing in Science Engineering, vol. 9, no. 3, pp. 90–95, May 2007.
  • [40] scikit-learn, “3.3. Metrics and scoring: quantifying the quality of predictions,”, accessed: September 1, 2021.
  • [41] ——, “sklearn.metrics.precision_score,”, accessed: September 1, 2021.
  • [42] ——, “sklearn.metrics.recall_score,”, accessed: September 1, 2021.
  • [43] ——, “sklearn.metrics.f1_score,”, accessed: September 1, 2021.