One of the most difficult challenges faced by the science educators is preparing a curriculum that facilitates the learning process in a structured, planned, and productive manner. It is the goal of the scientific curriculum to educate students who can investigate, question, participate in collaborative projects, and effectively communicate. The expected improvements for the students are articulated in a curriculum as learning outcomes Zorluolu2019AnalyzeOT. Learning outcomes are used to track, measure, and evaluate the standards and quality of education received by the students at educational institutions Attia2021BloomsTA. In terms of these learning outcomes, we may also identify the level of any student. Various measurement and evaluation studies are thus incorporated to determine the level of individual learning outcomes.
Exam evaluation is critical for determining how well students understand the course material. Therefore, the objectivity and scientific relevance of the questions developed for exams must be questioned in order to guarantee that students’ learning outcomes are tracked and judged effectively. One of the relevant scientific techniques for analyzing this is the Bloom’s Taxonomy Anderson2000ATF, which is well-known among the educators around the world. The examinations should take account of the difficulty levels, which correspond to the basic objectives and course outcomes in conventional ways like the Bloom’s taxonomy.
Dr. Benjamin Bloom, an Educational Psychologist, developed the Bloom’s Taxonomy in 1965. Its goal was to encourage high-order thinking, such as analyzing and examining instead of rote memorization of information Adesoji2018BloomTO. The Bloom’s taxonomy is divided into three categories: cognitive (mental skills), affective (emotional areas or attitude), and psychomotor (physical skills). Our study focuses on the cognitive domain, which involves knowledge and intellectual skill development. Researchers have recently demonstrated a growing interest in automatic assessment based on cognitive domains in Bloom’s Taxonomy. Abduljabbar2015ExamQC; Mohammed2018QuestionCB; Yahya2019SwarmIA. The majority of previous research focused on question classification from a specific domain, while Bloom’s taxonomy across the multi-domain region is lacking ways for classifying questions Sangodiah2017TAXONOMYBF. This work therefore seeks to establish a question classification method based on the cognitive domain of Bloom’s taxonomy. The Hierarchical order of levels in cognitive domain is: Knowledge, Comprehension, Application, Analysis, Synthesis and Evaluation. The first three levels are categorised as lower level of thinking, whilst the latter three levels are considered as high level of thinking.
The primary aim of this study is to assess the utility and efficacy of Bloom’s Taxonomy as a framework for establishing course learning outcomes, optimizing curriculum, and evaluating various educational programs. In this paper, we propose BloomNet, a novel transformer-based model that incorporates both linguistic and semantic information for the classification of bloom’s course learning outcomes. We also examine the generalisation capability of BloomNet on new distributions because train and test distributions are usually not distributed identically. The evaluation datasets rarely represent the entire distribution and the test distribution often drifts over time QuioneroCandela2009DatasetSI, resulting in train-test discrepancies. Due to these discrepancies, models can face unexpected conditions at the test time. Therefore, models should be able to detect and generalise to out-of-distribution (OOD) examples.
In most NLP evaluations, the train and test samples are assumed to be independent and identically distributed (IID). Large pretrained transformer models can achieve high performance on a variety of tasks in the IID scenario Wang2018GLUEAM. However, high IID accuracy does not always imply OOD robustness. Furthermore, because pretrained Transformers rely largely on false cues and annotation artifacts Gururangan2018AnnotationAI; Cai2017PayAT that OOD instances are less likely to feature, their OOD robustness is unknown. Hence, we examine the robustness of BloomNet and other experimented models such as CNNs, LSTMs, pretrained transformers, and more.
The contributions of our research can be summarized as follows:
We propose a transformer-based model, BloomNet, that can distinguish between six different cognitive levels of Bloom’s taxonomy (Knowledge, Comprehension, Application, Analysis, Synthesis and Evaluation).
We implement, train and evaluate multiple models to perform comparative analysis.
We evaluate experimented models for OOD generalization and we observe that pretrained transformers (RoBERTa, DistilRoBERTa Liu2019RoBERTaAR; Sanh2019DistilBERTAD) along with proposed model have better generalization capability compared to other models.
We perform ablation study to asses the contribution of various components in proposed model.
The following is the final exhibition. Section 2 and Section 3 describes the previous work and methodology respectively. Section 4 delves into the experiments and results. The conclusion and possible future directions are discussed in Section 5.
We implement, train and evaluate all the considered models as well as BloomNet as described in 4.3. We intend to release all the code, trained weights, datasets, and configuration upon acceptance.
2 Related Work
Text classification is an important NLP research area with numerous applications. A number of scholars have concentrated on automatic text classification. In recent years, classification of exam questions for the cognitive domain of Bloom’s taxonomy has received a lot of attention. Previous works have used different features and methods for text classification. Some of these works are discussed in this section.
In Chang2009AutomaticAB, an online examination system is created that supports automatic Bloom’s taxonomy analysis for the test questions. The researchers introduce fourteen keywords for the analysis on questions. Each keyword is associated with a specific cognition level. The experiment is conducted on 288 test items and a 75% accuracy is achieved for the "Knowledge" cognition level.
A. Swart and M. Daneti Swart2019AnalyzingLO analyzed the learning outcomes for Electronic fundamental module (of two universities) using Bloom’s Taxonomy. To identify the proportion of each cognition level, the verbs of each learning outcome are connected to certain specific verbs in Bloom’s taxonomy. This reflected the balance between theory and practice for the cognitive development of electrical engineering students. The consistency of the findings of the two universities demonstrated that students could blend theory and practice because they had around 40 percent of higher level cognitive outcomes.
classified exam questions for the cognitive domain of Bloom’s Taxonomy using TFPOS-IDF and pre-trained word2vec. To classify the questions, the extracted features are fed to three distinct classifiers i.e. logistic regression, K-nearest neighbour, and Support Vector Machine. For the experiment, they employ two datasets, one with 141 questions and the other with 600 questions. The first dataset results in a weighted average of 71.1%, 82.3% and 83.7% while the second achieves a weighted average of 85.4%, 89.4% and 89.7%.
Adidah Lajis et al. proposed Lajis2018ProposedAF a framework for assessing students’ programming skills. Bloom’s taxonomy cognitive domain serves as the foundation for the framework. According to the findings, Bloom’s taxonomy could be used as a basis for grading students. It said that the students would be judged based on their ability using Bloom’s taxonomy. The authors also suggested that taxonomy be used as an evaluation framework rather than learning.
Based on their domain knowledge, teachers and accreditation organizations manually classify course learning outcomes (CLOs) and questions on distinct levels of cognitive domain. This is time-consuming and usually leads in errors due to human bias. As a result, this technique must be automated. Several scholars have sought to automate this process through the use of natural language processing and machine learning techniquesHaris2012ARA; Jayakodi2015AnAC; Osadi2017EnsembleCB; Kowsari2019TextCA
. Deep learning has recently exhibited impressive results when compared to traditional machine learning methods, particularly in the field of text classificationMinaee2020DeepLT.
For text classification tasks, several neural models that automatically represent text as embedding have been developed, such as CNNs, RNNs, graph neural networks, and a variety of attention models such as hierarchical attention networks, self-attention networks, and so on. The majority of previous efforts on Bloom’s taxonomy have either used traditional machine learning approaches or representative deep neural models such as RNNs, LSTMs, and so on. In this research, we propose a transformer-based approach for performing text classification as per cognitive domains. Transformers,Vaswani2017AttentionIA
provide significantly better parallelization than RNNs, allowing for efficient (pre-)training of very large language models and an enhanced performance rate.
In this section we discuss the methodology part of our research. Our model is inspired by gupta-etal-2021-lesa and yang-etal-2016-hierarchical. In BloomNet , we encapsulate the linguistic information along with generic input representation and we also explicitly model word level attention. In following sections we describe each component of our model (shown in Figure 1) in detail.
Notation: We denote current input as set of tokens where is the number of tokens in input. We define a model as where . We define our final classfieir as where is the softmax output and its size is equal to number of classes in our data.
3.1 Representation Model
Representation model or language encoder is main component of BloomNet which gives contextualized embeddings devlin-etal-2019-bert; Pennington2014GloVeGV for the text input. and for this we use pretrained RoBERTa Liu2019RoBERTaAR model from huggingface model hub repository wolf-etal-2020-transformers. The reason we use RoBERTa instead of its other widely used counterparts such as BERT devlin-etal-2019-bert, DistilBERT Sanh2019DistilBERTAD is that it has seen much more data during its pretraining compare to its predecessor which results in increased robustness for subpopulation as well distribution shift. We feed tokenized input to RoBERTa model and we use CLS token as input representation. It can be represented as :
3.2 Linguistic Encapsulation
Work by gupta-etal-2021-lesa shows that explicit encapsulation of linguistic information increases the performance of the model for claim detection task, inspired by the same we also explicitly encapsulate linguistic information in modelling of BloomNet. We use POS (Part-Of-Speech) and NER (Named Entity Recognition) information coming from a trained model for POS and NER tasks respectively. We freeze the POS and NER model during training so that it’s weights do not change and hence it carries linguistic information. We use CLS token representation of the model and we define it as as follows:
3.3 Hierarchical Word Attention and Classification
Inspired by yang-etal-2016-hierarchical we use word level attention to get the better dense representation of the input. For this we use GRU cho-etal-2014-learning
and apply word level attention on its output. As result we get a vector from this module as input representation and we use this along with other information for classification. We denote this as follows :
Finally, we get four different representation coming from different components and we fuse these information using concatenation and feed this to a linear classification model. We write the concatenation as:
Classification module can be represented as:
4 Experiments and Results
We use two open domain datasets to evaluate the proposed approach. First dataset was proposed in Yahya2012BloomsTC which comprises 600 open-ended questions. The second dataset was compiled from a variety of websites, publications, and previous research Haris2015BloomsTQ. It contains 141 open-ended questions. The datasets are annotated and classified into six categories (Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation). Table 1 illustrates the label distribution for both datasets. The questions in these two datasets come from a variety of fields of study, including chemical, literature, biological, artistic, and computer science, among others.
In this section, we describe the various baseline models that we used for comparative analysis. These models are arranged in the order of their performance.
Very Deep CNN (VDCNN) Schwenk2017VeryDC
learns a hierarchical representation of a sentence with the help of a deep stack of convolutions and max-pooling of size 3 and by operating at the character level representation of the text. VDCNNs are substantially deeper than convolutional neural networks published previously. This is the first CNN model to present the "advantage of depth" in the field of NLP.
Text is viewed as a sequence of words in RNN-based models, which are designed to capture word dependencies and text structures for text classification. RNNs 10.5555/553011
can memorise the local structure of a word sequence, but they struggle with long-range dependencies. Long-Short Term Memory (LSTM)Sari2020TextCU is the most popular variant of RNN, that is created to capture long term dependencies. Vanilla RNNs suffer from gradient vanishing problem and LSTMs resolve this issue by using a memory cell that remember values across arbitrary time periods.
Hierarchical Attention Networks (HAN) Yang2016HierarchicalAN collects relevant tokens from sentences and aggregate their representation with the help of an attention mechanism. The same approach is used to retrieve relevant sentence vectors that is used in the classification task.
RNNs are taught to detect patterns over time, while CNNs Kim2014ConvolutionalNN are taught to recognise patterns over space. RNNs work for NLP tasks like POS tagging or QA that need understanding of long-range semantics, but CNNs are good for recognising local and position-invariant patterns LeCun1998GradientbasedLA. These patterns could be key phrases expressing a specific emotion or a topic. As a result, CNNs have become one of the most common text classification model.
In contrast to CNN, Recurrent CNN Girshick2014RichFH comprises of bi-directional recurrent structure that captures greater contextual data from word representations. This is followed by a max pooling layer which is responsible for extracting key features for text classification.
Deep learning models known as sequence-to-sequence Bahdanau2015NeuralMT
models have been deployed in tasks such as machine translation, text summarization, and image captioning. Seq2Seq comprises of encoder, decoder and attention layer where encoder is responsible for compiling data in the form of vector. Further this context is parsed to the decoder that produces desired output sequence. The primary idea behind the attention mechanism is to avoid learning a single vector representation for each sentence and instead be attentive to specific input vectors based on the attention weights.
Self-attention is a type of attention that allows us to learn the relationship between words in a sentence. Various NLP tasks and Transformers Vaswani2017AttentionIA use self-attention. Despite the fact that CNNs are less sequential than RNNs, the computing cost of capturing relationships between words in a phrase increases with the length of the sentence, much like RNNs. Transformers get around this constraint by using self-attention to compute a "attention score" for each word in a sentence or document in parallel, modelling the influence each word has on the others.
4.2.8 TF-IDF Random Forest
Random Forest (RF) models Xue2015ResearchOT
are made up of a collection of decision trees that were trained on random feature subsets. This model’s predictions are obtained via a majority vote of all forest tree projections. In addition, RF classifiers are simple to apply to text classification of high-dimensional noisy data. Furthermore, TF-IDF (Term Frequency Inverse Document Frequency)ref1 is a commonly used approach for converting text to a number representation that may be employed by a machine algorithm. TfidfVectorizer weights word counts based on how frequently they appear in the sentence.
DistilRoBERTa has been distilled from RoBERTa-base model Liu2019RoBERTaAR; Sanh2019DistilBERTAD that contains around half number of parameters as BERT model. It is based on same training process as that of DistilBERT. Moreover, DistilRoBERTa maintains 95 percent of BERT’s performance on the GLUE language understanding benchmark Wang2018GLUEAM.
RoBERTa (Robustly Optimized BERT Pretraining Accuracy) model Liu2019RoBERTaAR is a more robust version of BERT that is trained with a lot more data. It is based on fine-tuning the hyper-parameters that has improved the results and performance of the model significantly. To boost performance of BERT, RoBERTa also modified its training procedure and architecture. These modifications include removing next sentence prediction and dynamically changing the masking pattern during pre-training.
4.3 Experimental Setup
We use HuggingFace Transformers wolf-etal-2020-transformers
, PyTorchNEURIPS2019_bdbca288, PyTorch-Lightning falcon2019pytorch and Scikit-learn JMLR:v12:pedregosa11a for model implementation, training and evaluation. We train all of the models with KFold (k=5) cross validation and report mean and std values across the folds. We observe that the text in the dataset is relatively short, thus we use maximum sequence lengths = 128. For LSTM-like models, we employ hidden size = 768, number of layers = 4, and dropout = 0.10 throughout the experiments. We use Adam Kingma2015AdamAM as the optimizer and cross-entropy as the objective function. We use different learning rates for different models depending on how they are initialized, and for BloomNet we use learning rate = 2e-5 and train all models for
50 epochswith batch size = 32 and early stopping to prevent overfitting. We do not change the shared hyper-parameters across the models so that the comparison is as fair as possible. We do not conduct comprehensive hyper-parameter searches due to computational constraints.
We evaluate following baselines to compare the performance of our proposed model, BloomNet: VDCNN Schwenk2017VeryDC, LSTM Sari2020TextCU, HANYang2016HierarchicalAN, CNNKim2014ConvolutionalNN, RCNN Girshick2014RichFH, Seq2Seq-Attention Bahdanau2015NeuralMT, Self-Attention Vaswani2017AttentionIA, Random Forest Xue2015ResearchOT, DistilRoBERTa Liu2019RoBERTaAR, and RoBERTa Liu2019RoBERTaAR. We find that BloomNet outperforms all the considered baselines on the two datasets, demonstrating its higher performance for text classification. Table 2 reports the performance of BloomNet and all the baselines.
The model is trained on Dataset1 and evaluated for both the datasets. Dataset1 is used for evaluating BloomNet’s IID performance while Dataset2 is used to test BloomNet’s generalisation capabilities (OOD performance) by assessing it on new distributions that our model does not encounter during the training process. We observe that in comparison to the baseline models, BloomNet is less vulnerable to distribution shift.
|VDCNN111Very Deep Convolutional Networks for Text Classification(VDCNN)|
|LSTM222Long Short-Term Memory (LSTM)|
|HAN333Hierarchical Attention Networks (HAN)|
|CNN444Convolutional Neural Network (CNN)|
|RCNN555Recurrent Convolutional Neural Network (RCNN)|
|Seq2Seq-Attention666Sequential to Sequential Model with Attention|
|Random Forest TF-IDF777Term Frequency - Inverse Document Frequency(TF-IDF)|
|DistilRoBERTa888Distilled from RoBERTa model|
|RoBERTa999Robustly Optimized BERT Pre-training Approach|
Mean and Standard deviation of the results obtained over 5 folds. BloomNet performs significantly better (p < 0.004) than the RoBERTa.Bold shows best performance. All models are trained and evaluation on Dataset1 hence IID, and OOD evaluation is performed on Dataset2.
4.4.1 Comparative Analysis
As seen in Table 2 BloomNet outperforms the baseline models and achieves and and Macro-F1 score and on Dataset1(IID) and Dataset2(OOD) respectively.
In addition, we also made some very in-depth observations while evaluating the baselines. Surprisingly, the TF-IDF ref1 encoded text with random forest performs better than several strong baselines like LSTM, HAN, CNN, and RCNN. It is the third best performing baseline model that achieves and accuracy and Macro-F1 and on Dataset1(IID) and Dataset2(OOD) respectively.
We also observe that Attention based models like Seq2Seq-Attention and Self-Attention show better classification performance than vanilla models (like VDCNN, CNN, LSTM, and RCNN). Further, we investigate BERT-based models DistilRoBERTa and RoBERTa (which are pretrained Transformers) that achieve superior performanc over all the other considered baselines. RoBERTa is the best performing model with accuracy of and and Macro-F1 score and on Dataset1(IID) and Dataset2(OOD) respectively.
4.4.2 Out-of-distribution Generalisation
We evaluate models on new data which is not seen during training to evaluate the OOD robustness. We observe that OOD and IID performance is linearly correlated. The models that do not perform well on IID data such as VDCNN, LSTM, etc also perform poor on OOD data. Pretrained transformers have been proven robust to distribution shift hendrycks-etal-2020-pretrained; ramesh-kashyap-etal-2021-analyzing but in our case we notice significant performance drop () between IID and OOD data across all the pretrained transformer based models in our experiment which is same for other models as well. We hypothesise that this might be caused by large discrepancy between IID and OOD data.
4.5 Ablation Study
|+ WA + POS-NER||87.50||87.23|
Our proposed model BloomNet has three main component: 1. Representation model or Language Encoder 2. Linguistic Encapsulation Module and 3. Hierarchical Word Attention Model. We conduct an ablation study to assess the contribution of different components in our model. First we remove the word attention module from BloomNet and train it like other models with same configuration. We observe the BloomNet without word attention yields and accuracy for IID and OOD data respectively. Then we remove the linguistic encapsulation block and train the model like previously. BloomNet without linguistic encapsulation yeilds similar IID performance ( accuracy) but performs better on OOD data. If we remove the both components word attention as well as linguistic encapsulation BloomNet is same as RoBERTa Liu2019RoBERTaAR baseline. The result of ablation is stated in the table 3.
Limitations: We propose a novel transformer based model named BloomNet which has three language encoder (we use RoBERTa), two for linguistic encoding named as POS Encoder and NER Encoder, and one generic encoder. Due to three large transformer based language encoder proposed model is compute and memory heavy hence it becomes very cumbersome to deploy it in production. To asses the generalization capability of models we evaluate them on a different distribution which they do not see during training. We do not quantify the shift between IID and OOD and we restrict ourselves to only evaluation as investing the cause of performance drop on OOD data is beyond scope of this study. The datasets used in our work is relatively small having 600 and 141 samples respectively in both Dataset1 and Dataset2. Although we do cross validation and report mean and standard-deviation but we expect change in performance on bigger dataset. For same reason we do not train models on Dataset2.
Ethical Considerations: We are well aware of the societal implication of deploying large language models it could have unintended bias against marginalized groups and model itself plays significant role in amplifying those biases. We do not see any immediate misuse of our work, but more research in this area could lead to the development of systems such as automated scoring, which can have a disproportionately detrimental impact on marginalized groups.
6 Conclusion and Future Work
We propose a novel transformer-based model, BloomNet, that captures the linguistic and semantic information to classify the course learning outcomes according to the different cognitive domains of Bloom’s Taxonomy. BloomNet outperforms the considered baseline models analyzed in this study in terms of performance and generalization capability. Interestingly, we observe that carefully processed text with TF-IDF encoding outperforms numerous strong baselines like CNN, RNN, and attention based models. We also observe that pretrained Transformers generalize to OOD examples surprisingly well. Overall, we use a state-of-the-art Natural Language Processing (NLP) model for a relatively new task, and we believe it opens up new research directions for NLP in the education domain. We believe that, similar to previous domain-oriented NLP studies, such as NLP4Health, NLP4Programming, LegalNLP, and so on, NLP4Education has the potential to improve existing systems for the mutual benefit of the community and society in general. This is a novel task employing the state-of-the-art Natural Language Processing(NLP) system into education which is relatively new and we believe that it will open a new direction for NLP research.