:flashlight: Qualitative Research support tools in Python
Grounded theory (GT) is a qualitative research method for building theory grounded in data. GT uses textual and numeric data and follows various stages of coding or tagging data for sense-making, such as open coding and selective coding. Machine Learning (ML) techniques, including natural language processing (NLP), can assist the researchers in the coding process. Triangulation is the process of combining various types of data. ML can facilitate deriving insights from numerical data for corroborating findings from the textual interview transcripts. We present an open-source python package (QRMine) that encapsulates various ML and NLP libraries to support coding and triangulation in GT. QRMine enables researchers to use these methods on their data with minimal effort. Researchers can install QRMine from the python package index (PyPI) and can contribute to its development. We believe that the concept of computational triangulation will make GT relevant in the realm of big data.READ FULL TEXT VIEW PDF
As data are generated more and more from multiple disparate sources,
Popular approaches to natural language processing create word embeddings...
Accessibility is a major challenge of machine learning (ML). Typical ML
This paper describes the design and development of a preliminary qualita...
TextAttack is an open-source Python toolkit for adversarial attacks,
The CLEVR dataset has been used extensively in language grounded visual
Vector space embedding models like word2vec, GloVe, fastText, and ELMo a...
:flashlight: Qualitative Research support tools in Python
Qualitative Research (QR) is undergoing a paradigmatic transformation with the increasing popularity of big data, machine learning and artificial intelligence[wiedemann2013opening]. Qualitative research methods such as Grounded Theory (GT), though heavily reliant on data, have an predominantly interpretive and subjective world view [glaser1967discovery]. The subjectivity in QR is its strength, although researchers from the quantitative domain see it as a weakness. Qualitative researchers have traditionally eschewed the computational techniques and tools that are used for objective data analysis.
Natural language processing (NLP) and some of the numerical machine learning (ML) techniques can give insights on qualitative and quantitative data to researchers without being too intrusive into the philosophical assumptions of interpretivism. We call this process computational triangulation (CT). these techniques may be technically challenging for social science researchers to use without a background in computer programming. In this article, we introduce QRMine [eapenbr2019qrmine] (pronounced Karmine), a python package that helps to reduce this technical barrier. The theoretical basis of CT will be discussed elsewhere.
QRMine is an open-source python package that wraps some of the popular NLP and ML libraries into an easy to use command-line tool. QRMine aligns with the philosophical assumptions and the traditional stages of coding in GT. The numerical ML techniques help in the computational triangulation of numerical data to corroborate emergent insights from textual qualitative data.
GT is a qualitative research method with an emphasis on “generating theory grounded in data that has been systematically collected and analysed” [strauss1990basics]. Some of the characteristics of GT that differentiates it from other qualitative methods are constant comparison by simultaneous collection and analysis of data, theoretical sampling [glaser1970theoretical] based on the emergent theory, and its emphasis on the ‘theoretical sensitivity’ [glaser1978advances2] or the insight of the researcher. The aim of GT is to converge towards a theory that adequately explains the phenomenon under study.
GT has been gaining traction due to the recent emphasis on data. However, researchers from the quantitative domain may consider GT chaotic and regard the theories that emerge from such studies as having limited validity, reliability and credibility [wasserman2009problematics]. The interpretive and subjective nature of GT makes analysis non-reproducible which is considered by some researchers as its weakness.
Coding is the process of tagging various types of data, to define what each segment of the data is about, to help in sensemaking and to converge towards analytic interpretations. [charmaz2006constructing] Coding is challenging for new researchers as it requires a lot of skill and experience to identify core concepts in the data and to identify their relationships with each other.
Coding in GT comprises several stages that may vary according to the specific GT tradition followed by the researcher [walker2006grounded]. In the open coding stage, data, often in the form of textual interview transcripts, are analyzed line-by-line to identify commonly occurring concepts or categories. Researchers are expected to employ a ‘constant comparison’ process during this step. Axial coding is the process of identifying relationships among the open codes in the form of ‘properties’ and ‘dimensions’ [walker2006grounded]. Selective coding identifies core categories that can represent a group of codes. Natural language processing (NLP) can assist the researcher in the coding process, especially when using a large corpus of text [nelson2017computational].
NLP is a field of ML and Artificial Intelligence (AI) concerned with the process of analyzing and interpreting large amounts of natural language data. Moderm NLP relies on ML algorithms and probabilistic models. Some of the commonly used tasks in NLP can be directly applied to coding in GT. Lemmatization and named entity recognition (NER) can be used to identify and count commonly occuring concepts in a large corpus of text[nadeau2007survey]. This makes NER useful for identifying GT categories during the open coding stage.
Parts of speech tagging [voutilainen2003part]can be used to identify subject-verb-object triads, a useful technique for supporting axial coding. Topic modelling is useful during the stage of selective coding to identify representative concepts for a set of categories. A structured methodology for applying the various NLP methods for coding in grounded theory is called computational grounded theory (CGT) [nelson2017computational]. GT triangulates many other types of data in addition to text for theory building.
Triangulation refers to the use of multiple methods and data sources in qualitative research to develop a comprehensive understanding of phenomena [patton1990sampling]. Though triangulation can involve multiple investigators, methods, theories and sources, we emphasis the simultaneous triangulation of qualitative and quantitative data which is the most challenging. The aim of triangulation is to corroborate findings from one stream of data with others. However, there is no universally accepted methodology for combining data streams [jick1979mixing].
Though a mixed-method multi-paradigm view is needed for combining numerical data with qualitative data, we posit that machine learning methods can be instrumental in including numerical data, especially big data in qualitative enquiries without epistemological contradictions. QRMine applies NLP and ML on both textual and numerical data helping researchers to derive complementary insights from both.
Generally, the triangulation of insights from numerical data is based on inferential statistics. Inferential statistical methods have a positivist ontology that is difficult to reconcile with a qualitative study without making it a multi-paradigmatic mixed-method study. Machine learning techniques (although numerical at the core) can be non-deterministic with a subjective interpretive worldview that aligns with qualitative research.
Another challenge in triangulation is the volume of big data and the sparsity of multidimensional data that make inferences difficult to deduce. The curse of dimensionality[donoho2000high] makes conventional statistical tests irrelevant in such situations. ML techniques can be useful in these circumstances to derive insights.
Next, we describe the design and usage of QRMine.
QRMine is an open-source python package that provides an easy to use wrapper around NLP and ML packages. QRMine makes coding of textual data and deriving insights from numerical data less challenging for non-technical researchers.
The text analysis functions use the textacy [dewilde2017textacy] and Spacy [honnibal2015spacy] python packages for tokenizing and part-of-speech tagging. In GT, it is recommended to ’Code for action’ during the open coding phase [charmaz2006constructing]. Hence QRMine treats repeating verbs as the categories in open coding. The coding dictionary is created by identifying adjacent concepts (properties) and their adjectives and adverbs (dimensions).
Some of the other packages used by QRMine are as follows: VaderSentiment is used for the sentiment analysis from text[chauhan2018twitter]srinivasa2018natural]. ’Click’ is used to implement command-line inputs and for formatting outputs [click4qrmine]. Other packages used are imbalanced-learn for oversampling rare events, and mlxtend [raschka2018mlxtend].
Appropriate defaults are set for most parameters. Users can use command-line options to perform most of the analysis. The output is useful to derive insights from data. QRMine does not support the interactive coding of data. Many commercial software packages such as NVivo and Dedoose [lieber2013dedoose] are available if interactive coding is needed.
QRMine is not yet tested on real data. We hope that the research community will use it and report issues and help us collaboratively to develop this open-source tool.
QRMine is hosted at https://github.com/dermatologist/nlp-qrmine. The package is available from the python package index (PyPI.org) and can be installed using pip [see below].
QRMine depends on the spacy English language model that is not available on pypi. This can be installed after the previous step as shown below:
The various modules of QRMine package can be imported into any other python code or jupyter notebook as below.
The ReadData module imports functions for importing data, Qrmine implements NLP and MLQRMine implements the ML functions. The modules are described in the documentation [eapenbr2019qrmine].
The text input file is specified by the -i flag. Multiple input files can be supplied at the command-line with repeating -i flags. The results are directed to the stdout by default, but can be sent to a file using the -o flag.
QRMine can import individual documents or interview transcripts in a single text file with a ‘<break>TOPIC</break>’ tag separating topics or sections [see example below]. Multiple tags of the same type are supported. This is useful when the transcript includes a conversation with many participants as follows:
First interview with John. Any number of lines with the transcribed text <break>Interview_John</break> Second interview with Jane. More text. <break>Interview_Jane</break> Additional comments by John. Shows that the tag can be repeated. <break>Interview_John</break>
Numeric data is supplied as a single csv file with the identifier as the first column, followed by independent variable and the dependent variable as the last column. The identifier in the first column can be text and can be used to link to documents or transcripts while all other columns should be numeric [See example below].
index, obesity, bmi, exercise, fbs, has_diabetes 1, 0, 29, 1, 89, 1 2, 1, 32, 0, 92, 0 ......
The package can be installed from the python repository with pip install qrmine
. The word vectors for spacy have to be separately installed[spacyvector]. QRMine installs as a command line script and can be invoked directly. All command-line flags are documented [eapenbr2019qrmine].
For example, the following commands show the top 10 categories (for open coding) and generate the coding dictionary (for axial coding).
The following command lists the top three topics and assigns the documents to these topics.
The coding dictionary, topics and topic assignments can be created from the entire corpus using the respective command line options. Categories (concepts), summary and sentiment can be viewed for the entire corpus or particular topics specified using the –titles flag. Sentence level sentiment output is possible with the –sentence flag. Filtering documents based on sentiment, titles or categories is possible for further analysis, using –filters or -f option.
The script below demonstrates how to find the sentiment of two segments of the text with the titles P5 and P7.
This is how a coding dictionary can be generated from documents having a positive sentiment.
The numeric function supported are neural network (–nnet) that displays the accuracy of the model after a certain number of epochs; support vector machine (–svm) producing the confusion matrix as output; kmeans clustering (–kmeans) showing the cluster assignments; k-nearest neighbours (-knn) displaying k nearest rows to a specified row; and pca (–pca) showing the factors. Many of the ML functions such as neural network take a second argument (-n) as shown below.
The -n argument represents the number of epochs in nnet, the number of clusters in kmeans, the number of factors in pca, and the number of neighbours in KNN. KNN also takes the –rec or -r argument to specify the record [see below].
Variables from csv can be selected using –titles (defaults to all).
The first variable will be ignored (index) and the last should be the dependent variable (DV).
QRMine does not currently support graphical visualization of results, which may be informative in analyses such as KNN. The command-line interface with its various options may be daunting for new users. Converting transcript files to a text only format with topic tags may be time consuming. QRMine does not support data cleaning and may be sensitive to missing data. Methods for explicitly connecting qualitative codes to quantitative data points have not been implemented yet.
In the future, we expect to build visualization methods for displaying the relationship between concepts. A simple user interface may be useful. We plan to provide additional methods for qualitative to quantitative linking and for finding association rules.
Machine learning is traditionally seen as an extension of statistics with a quantitative mindset. ML is useful in inductive research to derive insights from textual and numerical data. It also provides tools for combining insights from multi-modal data which is important for theory building in areas such as social sciences.
Computational Grounded Theory (CGT) is a concept introduced to leverage NLP for coding in GT research [nelson2017computational]. QRMine extend CGT into computational triangulation (CT) utilizing NLP and ML techniques for linking textual and numeric data, following the traditional coding methods of GT.
The lack of tools to facilitate usage of NLP and ML for researchers is a challenge that impacts the wider acceptance of these techniques. We provide an open-source python package that can be used in the context of jupyter notebooks [kluyver2016jupyter] to analyze data without much coding. We hope that we can improve on the current version, by introducing more functions that augment researchers’ abilities to derive insights from data. We urge the open-source community to contribute to the code and to potential users to report issues on our repository [eapenbr2019qrmine] so that we can fix them.