1.1 Problem Statement and Motivations
Students’ academic performance is not only important for themselves, but also important for higher education institutions (HEIs), who use it as a basis for measuring the success of their educational programs. A variety of metrics can be used to measure student performance. Dropout rates are notable among these metrics. In other words, the reduce of dropout rates could indicate the improvement of student’s academic performance. However, according to the published figures, each year, one-fifth of first-year undergraduates drop out their degrees across Australia [shipley2019]
. In Brazil, it is estimated that only 62.4% of university enrolments succeed in obtaining an undergraduate degree[Sales2016]. These are concerning statistics for a country’s development and can affect students’ lives financially, mentally, and professionally. As a consequence, HEIs and researchers have shown an increasing interest in predicting systems to identify students at risk of dropping out [Ruben2019]. For example, Ruben et al. created multiple feature sets from student data to predict whether a student would drop out by using machine learning algorithms [Ruben2019]; also, Mujica et al. pointed out that path analysis was useful for predicting student dropouts based on affective and cognitive variables [Mujica2019].
According to Vergel et al., the curriculum design may affect the students’ performance and retention. A careful design is required to "understand the role of curriculum design in dropout rates and unmask how the political dynamics embedded in the curriculum influence students’ retention" [Vergel2018]. Moreover, Drennan, Rohde, Johnson, and Kuennen claimed that students’ academic performance in a course is related to their performance in their prerequisite courses [Drennan2002], [Johnson2006]. It can be seen that curriculum plays a crucial role in student performance as well as decreasing dropout rates.
The main motivation for this research is to improve students’ academic performance based on the analysis of curriculum and student academic performance in different courses. Identifying prerequisites for a course is key to enhance student experience, improve curriculum design and increasing graduation rates.
The main objectives of this study is to apply Artificial Intelligence and Semantic technologies to analyse and allow changes that may improve student academic performance.
The goal of this research project is threefold:
Predict students’ performance in university courses using grades from previous semesters;
Model a course semantic representation and calculate the similarity among courses; and, finally,
Identify the sequence between two similar courses.
To accomplish these goals, a systematic review was carried out to identify what technologies being used in CS curriculum.
The main contributions of this thesis can be divided into:
An approach for student dropout prediction using Genetic Algorithm (GA) and Long Short-Term Memory (LSTM);
A comprehensive systematic review to understand how Semantic Web and Natural Language Processing technologies have been applied to curriculum analysis and design in Computer Science; and, finally,
An analysis of a Computer Science curriculum using the SemRefD distance proposed by Manrique et al. [Ruben2019] and BERT proposed by Devlin et al. [Jacob2018].
1.4 Report Outline
Section 2 presents the background information on dropout prediction and text analysis. Section 3 covers the methodology used in this work including a systematic review and general techniques. Section 4 presents the experiments, results and discussion and, finally, section 5 concludes this report with future work directions.
Artificial Intelligence refers to the development of machines or computers that simulate or emulate the functions of the human brain. The functions of a computer differ according to the area of study. Prolog, for example, is a programming language that aims to understand human logic. It also applies mathematics in order to create systems that can discern relevant conclusions from a set of statements. Intelligent Agents is another example. Unlike traditional agents, Intelligent Agents are designed to take actions that are optimized to achieve a specific goal. Based on their perception of the environment and internal rules, intelligent agents make decisions. The study explores four growing fields: Machine Learning, Deep Learning, Natural Language Processing, and Semantic Web.
Machine Learning is a field of study that uses algorithms to detect patterns in data and to predict or make useful decisions. Different machine learning algorithms can be implemented in an innumerable number of scenarios. Machine learning algorithms are optimized for specific data sets; There are a variety of algorithms used to build models to suit different use cases. Using machine learning algorithms, we predict students’ performance in university courses based on their performance in previous courses.
Deep Learning is a neural network with multiple layers of perceptrons. The neural networks try to imitate human brain activities, albeit far from matching its capabilities, which enables it to study from high volume of data. Additional hidden layers can help to refine and optimize a neural network for accuracy, even when it has a single layer. Recurrent Neural Networks (RNNs) are one special type of neural networks that combines the information from previous time steps to generate updated outputs.
The Natural Language Processing (NLP) process is a psychological procedure that consists of analyzing successful people’s strategies and applying them to reaching a personal goal. The process links cognitions, language, and patterns of action that are learned through experience to particular outcomes.
W3C (World Wide Web Consortium) standards enable the Semantic Web to be extended from the World Wide Web. W3C standardizes the process of making information on the World Wide Web machine-readable. In order to accomplish this goal, a series of communication standards were created that enable developers to describe concepts and entities, as well as their classification and relationships. The Resource Description Framework (RDF) and Web Ontology Language (OWL) have enabled developers to create systems that store and use complex knowledge data bases known as knowledge graphs.
2.2 Related Work
This section discusses tools, frameworks, datasets, semantic technologies, and approaches used for curriculum design in Computer Science. Note that we present the methodology used to carry out a systematic review, but for brevity and adequacy we only present the most relevant works and sections. The complete systematic review will be submitted to a conference in Computer Science in Education as an outcome of this thesis.
This systematic review was conducted following the methodology defined by Kitchenham and Charters [Kitchenham2007]. The method is composed of three main steps: planning, conducting and reporting.
The planning stage helps to identify existing research in the area of interest as well as build the research question. Our research question was created based on the PICOC method [Kitchenham2007] and used to identify keywords and corresponding synonyms to build the search string to find relevant related works. The resulting search string is given below.
("computer science" OR "computer engineering" OR "informatics") AND ("curriculum" OR "course description" OR "learning outcomes" OR "curricula" OR “learning objects”) AND ("semantic" OR "ontology" OR "linked data" OR "linked open data")
To include a paper in this systematic review, we defined the following inclusion criteria: (1) papers must have been written in English; (2) papers must be exclusively related to semantic technologies, computer science and curriculum; (3) papers must have 4 or more pages (i.e., full research papers); (4) papers must be accessible online; and, finally, (5) papers must be scientifically sound, present a clear methodology and conduct a proper evaluation for the proposed method, tool or model.
Figure 2.1 shows the paper selection process in detail. Initially, 4,510 papers were retrieved in total. We used the ACM Digital Library111https://dl.acm.org/, IEEE Xplore Digital Library222https://ieeexplore.ieee.org/, Springer333https://www.springer.com/, Scopus444https://www.scopus.com/, ScienceDirect555https://www.sciencedirect.com/ and Web of Science666https://www.webofscience.com/ as digital libraries. The Springer digital library returned a total of 3,549 studies. The large number of papers returned by the Springer search mechanism led us to develop a simple crawling tool to help us in the process of including/rejecting papers in our systematic review. All the information returned by Springer was collected and stored in a relational database. After that, we were able to correctly query the Springer database and select the relevant papers for this systematic review.
We also applied the forward and backward snowballing method to this systematic review to identify relevant papers that were not retrieved by our query string. The backward snowballing method was used to collect relevant papers from the references of the final list of papers in Phase 1, whereas the forward snowballing method was used to collect the papers that cited these papers [Wohlin2014]. Google Scholar777https://scholar.google.com/ was used in the forward snowballing method. In total, 37 studies were identified as relevant; most of the studies were published in the last few years, which shows the increasing relevancy of this topic.
In the following sections we present the most relevant works that inspired the proposed approaches reported in this thesis.
is a popular open-source editor and framework for ontology construction, which several researchers have adopted to design curricula in Computer Science[Tang2013], [Wang2019], [Nuntawong2017], [Karunananda2012], [Hedayati2016], [Saquicela2018SimilarityDA], [Maffei2016], and [Vaquero2009]. Despite its wide adoption, Protégé still presents limitations in terms of the manipulation of ontological knowledge [Tang2013]. As an attempt to overcome this shortcoming, Asoke et al. [Karunananda2012] developed a curriculum design plug-in called OntoCD. OntoCD allows curriculum designers to customise curricula by loading a skeleton curriculum and a benchmark domain ontology. The evaluation of OntoCD is performed by developing a Computer Science degree program using benchmark domain ontologies developed in accordance with the guidelines provided by IEEE and ACM. Moreover, Adelina and Jason [Tang2013] used Protégé to develop SUCO (Sunway University Computing Ontology), an ontology-specific Application Programming Interface (API) for curricula management system. They claim that, in response to the shortcoming with using the Protégé platform, SUCO shows a higher level of ability to manipulate and extract knowledge and will function effectively if the ontology is processed as an eXtensible Markup Language (XML) format document.
Other specific ontology-based tools in curriculum management have also been developed. CDIO [Liang2012] is an example of such tool. CDIO was created to automatically adapt a given curriculum according to teaching objectives and teaching content based on the designed constructed ontology. A similar approach was used by Maffei et al. /citepMaffei2016 to model the semantics behind functional alignments in order to design, synthesize, and evaluate functional alignment activities. In Mandić’s study, the author presented a software platform999http://www.pef.uns.ac.rs/InformaticsTeacherEducationCurriculum for comparing chosen Curriculum for information technology teachers [Mandic2018]. In Hedayati’s work, the authors used the curriculum Capability Maturity Model (CMM), which is a taxonomical model used for describing the organization’s level of capability in the domain of software engineering [Paulk1993], An ontology-driven model for analyzing the development process of the vocational ICT curriculum in the context of the culturally sensitive curriculum in Afghanistan is used as a reference model [Hedayati2016].
There are several datasets used in curriculum design studies in Computer Science. The open-access CS2013101010https://cs2013.org/ dataset is the result of the joint development of a computing curriculum sponsored by the ACM and IEEE Computer Society [Piedra2018]. The CS2013 dataset has been used in several studies [Piedra2018], [Aeiad2016], [Nuntawong2016], [Nuntawong2017], [Karunananda2012], [Hedayati2016], and [Fiallos2018] to develop ontologies or as a benchmark curriculum in similarity comparison between computer science curricula.
Similar to CS2013, The Thailand Qualification Framework for Higher Education (TQF: HEd) was developed by the Office of the Thailand Higher Education Commission to be used by all higher education institutions (HEIs) in Thailand as a framework to enhance the quality of course curricula, including the Computer Science curriculum. TQF: HEd was used for the guidelines in terms of ontology development in the following studies [Nuntawong2017], [Nuntawong2016], [Hao2008], and [Nuntawong2015].
Other studies use self-created datasets (e.g., [Wang2019], [Maffei2016], [Hedayati2016], and [Fiallos2018]). Specifically, in Wang’s work, the Ontology System for the Computer Course Architecture (OSCCA) was proposed based on a dataset created using course catalogs from top universities in China as well as network education websites [Wang2019]. In Maffei’s study, the authors experimented and evaluate the proposal based on the Engineering program at KTH Royal Institute of Technology in Stockholm, Sweden [Maffei2016]. In Gubervic et al.’s work, the dataset used in comparing courses comes from Faculty of Electrical Engineering and Computing compared to all universities from United States of America [Guberovic2018]. In Fiallos’s study, not only did the author adopt CS2013 for domain ontologies modeling, but also the core courses from Escuela Superior Politécnica del Litoral (ESPOL111111https://www.espol.edu.ec/) Computational Sciences were collected for semantic similarity comparison [Fiallos2018].
2.2.4 Languages, Classes and Vocabulary
RDF is used as the design standard for data interchange in the following studies [Piedra2018], [Nuntawong2017], and [Saquicela2018SimilarityDA]. In particular, Saquicela et al. [Saquicela2018SimilarityDA] generated curriculum data in the RDF format, creating and storing data in a repository when the ontological model has been defined and created.
OWL is an extension of RDF that adds additional vocabulary and semantics to the classic framework [McGuinness2004]. OWL is used in many studies [Piedra2018], [Adrian2020], [Mandic2018], [Wang2019], [Maffei2016], and [Vaquero2009] for representing and sharing knowledge on the Web121212https://www.w3.org/OWL/. Apart from OWL, only two studies used XML131313https://www.w3.org/standards/xml/ due to implementation requirements of research ([Tang2013] and [Hao2008]).
Body of Knowledge (BoK), a subclass of OWL, is a complete set of concepts, terms and activities, which can represent the accepted ontology for a professional domain [Piedra2018]. BoK has become a common development method in many studies [Piedra2018], [Nuntawong2017], [Nuntawong2016], [Hao2008], [Karunananda2012], [Tapia2018], [Chung2014], and [Nuntawong2015]. In Piedra’s study [Piedra2018], the BoK defined in CS2013 ontology was viewed as a to-be-covered description of the content and a curriculum to implement this information. Similarly, Numtawong et al.[Nuntawong2017] applied the BoK, which is based on the ontology of TQF: HEd, to conduct the ontology mapping.
The Library of Congress Subject Headings141414https://www.loc.gov/aba/cataloging/subject/ (LCSH) is a managed vocabulary Upheld by the Library of Congress. LCSH terminology is a BoK, which contains more than 240,000 topical subject headings. The equivalence, hierarchical, and associative types of relationships between headings can be offered. In Adrian’s study, the authors create an ontology based on LCSH and the Faceted Application of Subject Terminology151515https://www.oclc.org/en/fast.html (a completely enumerative faceted subject terminology schema originated from LCSH), to assess the consistency of an academic curriculum and apply it to an Information Science curriculum [Adrian2020].
Knowledge Area (KA), also a subclass of OWL, is an area of specialization such as Operating Systems and Algorithm. The relationship between BoK and KA is built in various ways in the studies. For example, in Piedra’s study [Piedra2018], each BoK class contains a set of KAs. In contrast, Numtawong et al. [Nuntawong2017] considered KA as the superclass of BoK. KA classification was proposed in Orellana et al.’s study [Orellana2018]
. In that paper, the curricula are classified in the KA defined by UNESCO161616https://whc.unesco.org/en/
which defines 9 main areas and 24 subareas of knowledge. To do this, they convert the curricula to the vector space and then process with using traditional supervised approaches such as the support vector machines and k-nearest neighbors[Orellana2018]. By classifying the KA, the similarity measurement can be applied more easily.
22.214.171.124 Curriculum Design Analysis and Approaches
One development approach found is the extraction and interrelationship analysis. NLTK171717https://www.nltk.org/ is deployed to segment raw data into terms, using various algorithms to extract and analyse the interrelationship of items, then construct an ontology by using a certain framework such as Protégé as in [Piedra2018], [Wang2019], and [Tapia2018].
Text Mining, also known as text analysis methods, have been used to find the interesting patterns in HEIs’ curriculum [Orellana2018]. With using Text Mining approaches, keywords can be extracted from both found documents and course materials for the further comparison and analysis [Kawintiranon2016]. In the next section, the application of Text Mining approaches in Computer Science curriculum will be elaborated.
In the context of curriculum similarity measurement, Gomaa and Fahmy [H.Gomaa2013] define string-based, corpus-based, knowledge-based and hybrid similarity measures. String-based measures the distance between string (words) that can be compared by characters or terms. Corpus-based approach measures the semantic meaning of terms and phrases which are provided in the corpus. Knowledge-based uses synsets-formed word networks such as WordNet to compare cognitive meaning between each other. String-based similarity was conducted by Corpus-based was done in [Orellana2018], [Pawar2018], [Aeiad2016] and [Fiallos2018]. Knowledge-based similarity approach is proposed in [Nuntawong2015].
String-based similarity between terms was measured in many studies [Orellana2018], [Pawar2018], [Adrian2020], [Seidel2020], [Wang2019], and [Saquicela2018SimilarityDA]. Orellana et al. [Orellana2018] used cosine similarity between terms to acquire the level of similarity between two course descriptions. Adrian and Wang used the same approach to measure the similarity [Adrian2020], [Wang2019].
Pawar and Mago utilized the Bloom’s taxonomy to measure the similarity of sentence pairs in the Learning outcomes [Pawar2018]
. In another paper, Saquicela et al. used the K-means, an unsupervised clustering algorithm to calculate the similarity among courses content[Saquicela2018SimilarityDA].
Corpus-Based similarity measurement was proposed in these studies [Orellana2018], [Pawar2018], [Adrian2020], and [Fiallos2018]. In Orellana et al.’s study, the topics are extracted and processed by LSA. Through the process, the terms and documents are located within their context to get the most relevant documents (Wikipedia articles and curricula) by adding the similarity threshold.
Knowledge-based (Semantic) similarity measurement was proposed in [Nuntawong2016] and [Pawar2018]. In Nuntawong’s paper [Nuntawong2017], the authors designed curriculum ontology and defined the ontology mapping rules by semantic relationships that can be established between curricula. After completing the steps above, an ontological system was built by converting input curriculum data to ontology. Retrieve BoK in KA from TQF: HEd and course descriptions which were compared to WordNet. In the end, calculate the semantic similarity values with using extended Wu & P’s algorithm [Wu1994]
. To calculate the semantic similarity between words, Pawar and Mago used Synsets from WordNet. The method simulates supervised learning through the use of corpora. Additionally, they used the NLTK-implemented max similarity algorithm to determine the sense of the words[Pawar2018].
One important component in curricula is the Learning Outcomes (LOs). It defines what students are expected to learn by taking the course. There are six layers in a hierarchical structure in Bloom’s Taxonomy (Remembering, Understanding, Applying, Analysing, Evaluating, and Creating) [Lasley2013]. Semantic technologies are used along with the Bloom Taxonomy to calculate the similarity of learning outcomes between courses. Pawar and Mago [Pawar2018] propose a semantic similarity metric using WordNet181818https://wordnet.princeton.edu/ to generate a score to compare LOs based on the Bloom taxonomy. Similarly, Mandić [Mandic2018] proposed the taxonomic structural similarity between curricula by applying the revised Bloom’s taxonomy which has the adjustment in cognitive levels of learning.
This section partially presented some of the works found in the literature that used semantic technologies to help university stakeholders design curricula for Computer Science. The following chapters present the approaches and analysis carried out using previous works as inspiration.
3.1 Dropout Prediction
This section presents an approach to predicting dropout. We illustrate the entire dropout prediction workflow followed by the description of the dataset used and the corresponding pre-processing measurements. We then present an SVM-based genetic algorithm (GA) for feature selection and our LSTM approach for dropout prediction.
3.1.1 Procedure Overview
As illustrated in Figure 3.1, the entire dropout prediction workflow is composed of four steps. Details of each step will be explained in the following sections.
Briefly, the first step of dropout prediction is responsible for data pre-prepocessing including obtaining datasets. This step uses data wrangling and machine learning (ML) techniques. The second step is responsible for the feature selection. It uses the pre-processed data outputted in the first step. Steps 3 and 4 are merged together and starts with training and testing the Long Short-term Memory (LSTM) and Fully Connected (FC) neural network.
3.1.2 Data Pre-processing
Before introducing the data pre-processing, we introduce the data set used for this experiment. Figure 3.2 presents a few instances of the dataset used.
The dataset used in this thesis has been used in previous research Manrique2019 and is provided by a Brazilian university. It contains 248,730 academic records of 5,582 students enrolled in six distinct degrees from 2001 to 2009. The dataset is in Portuguese and we will translate the main terms for better understanding.
The dataset contains 32 attributes in total: cod_curso (course code); nome_curso (course name); cod_hab (degree code); nome_hab (degree name); cod_enfase (emphasis code); nome_enfase (emphasis name); ano_curriculo (curriculum year); cod_curriculo (curriculum code); matricula (student identifier); mat_ano (student enrolment year); mat_sem ((semester of the enrolment); periodo (term); ano (year); semestre (semester); grupos (group); disciplina (discipline/course); semestre_recomendado (recommended semester); semestre_do_aluno (student semester); no_creditos (number of credits); turma (class); grau (grades); sit_final (final status (pass/fail)); sit_vinculo_atual (current status); nome_professor (professor name); cep (zip code); pontos_enem (national exam marks); diff (difference between semesters and students performance); tentativas (number of attempts in a course); cant (previous course); count (count); identificador (identifier); nome_disciplina (course name).
Most of the data is related to the status of a student in a give course and degree. The "semestre" attribute is related to the semester a student took a course and has been previously used to create a time series of students Ruben2019. The attribute "sit_vinculo_atual" indicates that there are 12 status of enrolment where three of them ("DESLIGADO","MATRICULA EM ABANDONO","JUBILADO") represent the dropout status. Grades is scaled from 0 to 10 inclusive where 10 is the maximum grade. The dataset is anonymous and the identifiers do not allow re-identification. The students’ ids are encoded with dummy alphanumeric code such as "aluno1010" ("student1010"). The dataset was profiled by year and degree and is presented in Figure 3.2(a) - 3.5.
|Data Cleaning||outliers/duplicates correction|
|Data Validation||inappropriate data identification|
|Data Enrichment||Data enhancement|
|missing values filling|
|Data Normalisation||data re-scale|
To pre-process the data, we first split and reorganised the data set. We split the dataset into three parts: CSI ("Information Systems"), ADM ("Management"), and ARQ ("Architecture") datasets. After that we removed duplicate data, resolved data inconsistencies and removed outliers. For example, in the attribute "grades", a few marks were greater than 10, an invalid instance given the university marking scheme, and therefore removed. Attributes were also removed such as "grupos" as it was irrelevant and basically a copy of another attribute ("disciplinas"). Table 3.1 presents all information of the data-preprocessing step.
Another technique used was data validation where inappropriate data is identified and removed. As part of the data pre-processing step, we also converted categorical to numerical data. For this step we used a tool named LabelEncoder which is a extensive library of Sklearn111https://scikit-learn.org/.
For the droupout status attribute ("sit_vinculo_atual"), the instances were replaced by 0 (dropout) and 1 (enrolled). With respect of data imputation, the Random Forest (RF), which is a popular Machine Learning algorithm with constructing multiple decision trees in training, was employed to implement. As for normality or non-normality, linearity or non-linearity type of data, RF has a better performance than other algorithms in terms of handling imputation[Pantanowitz2009], [Shah2014], and [Hong2020]. To be specific, the sequence of imputation is from the least column to the ones with more missing data, filling the missing values with 0 in other columns, and then to run the RF and perform iterations until each missing value will be handled. The procedure of imputation is shown in Algorithm 1. Next comes to group the data by "course" and "semester" attributes.
The final step normalizes all the input data except the "sit_vinculo_atual" and "semestre" with the z-score method as z-score can improve the model performance than other techniques [Imron2020], and [Chris2003]. The z-score method is computed based on a set of values , the mean of the set (
), and the standard deviation (). The formula is presented in what follows.
The last pre-processing step applied to the dataset it the data enrichment step. As Table 3.1 shows, the number of instances in the CSI dataset is very small and not sufficient for training/testing purposes. For this, we used the Synthetic Minority Oversampling Technique (SMOTE) to increase the number of instances of the CSI dataset. Briefly, SMOTE generates new instances based on real ones. SMOTE considers the minorities to generate and balance the dataset with new instances.
3.1.3 Feature Selection
After the data is pre-processed, to reduce the time consumption and increase the computational efficiency in training, it is necessary to apply feature selection to remove irrelevant features. Consequently, Support Vector Machine(SVM)-based Genetic Algorithm (GA) will be introduced in this section.
In GA, optimization strategies are implemented based on simulation of evolution by natural selection for a species. Table /reftab:ga lists popular GA operators for creating optimal solutions and solving search problems based on biologically inspired principles such as mutation, crossover, and selection. Moreover, as its nature mentioned previsouly, the application of GA for feature selection is popular and was proven to improve the model performance in [Babatunde2014], [Huang2007], and [Leardi2000]. The whole procedure is displayed in Figure 3.6. In this thesis, to perform a GA-based feature selection, a set of techniques was deployed. First, a population was randomly generated as the initial solutions whose size is 1,000, followed binary encoding which determines the chosen feature as 1, and 0 represents column will not be chosen. The next step is to calculate the fitness of each individual in the population by performing SVM in the three datasets (ADM, CSI, and ARQ). SVM is a supervised linear machine learning technique that is most commonly employed for classification purposes, and it has good performance in the fitness calculation in feature selection [TAO2019323]. It is defined as a linear classifier with maximum interval on the feature space (when a linear kernel is used), which is basically an interval maximization strategy that results in a convex quadratic programming problem. Given a training dataset, , where is the number of samples in the dataset, and is the number of features. In this experiment . The target is to get a decision boundary so as to separate the samples into different areas. The separator in two-dimensional space is a straight line and can be written as formula 3.2:
Once mapping the separator to the
-dimensional space, it will become the hyperplane separator. It can be written as formula 3.3:
where where w is the vector determining the hyperplane direction, The hyperplane is located at the origin after a displacement term is applied. By determining the vector and the bias , the division hyperplane is denoted as (, ). In the sample space, a given point vector is the distance from the hyperplane. 3.4:
where is the 2-norm which can be defined by as in formula 3.5:
Suppose hyperplane (, ) can classify the training dataset, so as to , the following corollary can be obtained as below:
The vectors close to the hyperplane make the formula 3.6 hold are named "support vectors". The sum of distance of 2 vectors which are from different areas to the hyperplane is called "margin", which is defined in the formula 3.7:
To obtain a partitioned hyperplane with maximum margin, that is, to find parameters and that comply with the constraint in equation 3.6 such that the optimal hyperplane classifies all the points in correctly, namely:
Followed by the fitness computation with SVM in 5-fold cross validation, which is a technique to prevent overfitting, proportional selection was conducted to get the individuals, the process is similar to the roulette wheel, that is, individuals with higher fitness ratings will have the greater chance to be selected, the formula of this process is shown in equation 3.9:
where is a population’s total number of chromosomes whose initial number is 1,000, is the possibility of being selected , and is the fitness of to yield float type of value greater than 0.
After completing selection, the uniform crossover was performed to get the offsprings between a pair of selected parents. Specifically, each gene is selected at random from one of the corresponding genes on every parent chromosome as shown in Figure 3.7. Note that this method will only yield one offspring and cross rate is 0.75. Reproduction is followed by mutation which certain gene(s) will be changed randomly by the setting rate, in this thesis, is 0.002. The final step is the terminal criterion, when the maximum generation reached, the population has evolved 100 generations, an optimal subset will be generated as the output. The entire workflow is illustrated in Figure 3.8.
|Initialization||initialization of population acquisition||to generate a set of solutions||Randomly generated initialization|
|Encoding||representation of the individuals||to convey the necessary information||Binary Encoding|
|Fitness||the degree of health of individuals||to evaluate the optimality of a solution||Support Vector Machine (SVM)|
|Crossover||parents are chosen for production||to determine which solutions are to be preserved||Uniform Crossover|
|Selection||solutions are selected for production||to select the individuals with better adaptability||Proportional Selection|
|Mutation||a gene is deliberately changed||to maintain diversity in the population set||mutation rate setting-up|
|Termination||a process stops the evolution||to terminate and output the optimal outcomes||Maximum generations|
3.1.4 Training and Test
After feature selection, 3 subsets corresponding to the 3 used datasets in this study have been generated with optimal features. In this step, it’s necessary to train the model with processed data so as to predict dropout.
To begin with, Long short-term memory (LSTM) is used for the training model in this study, LSTM is a special type of recurrent neural networks (RNNs). Unlike traditional feedforward neural networks and traditional RNNs, it has feedback connection and gate schema, which power LSTM to learn and memorize the information over time sequences. Likewise, human learning activities are matchable to LSTM based on gate schema. In this study, it also simulate the process of influence of past exam to the current performance. Furthermore, LSTM is capable to handle the vanishing gradient. There are multiple studies in the past focusing on Deep Knowledge Tracing to predict student performance in the quiz[Chris2015]
and simulation of Human Activity Recognition based on the combination of LSTM and convolutional neural network (CNN)[Xia2020].
As illustrated in Figure 3.9, LSTM consists of the following components which has been listed in Table 3.3. The inputs at each time sequence are comprised of 3 elements, that is, , , and . With regards to outputs, and are exported by the LSTM. Note that there are 8 weight W parameters in total, 4 associated with hidden state and others linked with the input. Moreover, 4 bias b will be used in one unit. All the W and b will be initialized randomly for the first place, and will be adjusted by back-propagation. To prevent gradients exploring when gradients reach the designed threshold, which is 1.01, clipping the gradients.
|Forget Gate||activation vector with sigmoid|
|Candidate layer||activation vector with Tanh|
|Input Gate||activation vector with sigmoid|
|Output Gate||activation vector with sigmoid|
|Hidden state||hidden state vector|
|Memory state||cell state vector|
Once execution in LSTM during one sequence with using the dataset from section 3.1.3. The first move is to determine which information will be forgotten as formula 3.10 illustrates. This step is executed by the forget gate, and pass the gate to get a output between 0 and 1.
is between 0 to 31 inclusive which indicates there are 32 time steps per student representing the information across 8 semester and 4 courses each semester. Regarding the students who dropped out in the midway or have not finished all the semesters, padding time series sequences to make sure the total time step is the same. Theis the input, whose size initial input () equals to the number of features in the input dataset. Furthermore, there are two hidden layers and the hidden size is 50 which determines size of the and .
The second step is to decide what new messages will be used for storing in the cell state. First, input gate will get the update values , followed by a layer creates a vector of new candidate values , these values will be combined and the state updated. The formula for this is shown in 3.11 and 3.12:
Next step to update the , and create a new cell state . According to the formula 3.13, combining the previous outputs , , and , by a series of arithmetic operations, to get the final cell state , which will be the input in the next step.
The final step is to output the hidden state. As shown in formula 3.14 and 3.15, similar to step 1 and 2, carrying and as inputs, to pass output gate with sigmoid which will make a decision regarding the parts of outputs. Then output is transited to a layer to acquire the hidden state , which will be used as the input in the time step or another neural network.
In this study, after a series of hyper-parameter tuning, the final output hidden state will be regarded as the input of a 3-layer fully connected (FC) neural network to predict the dropout. The hidden size in FC is 128, in the first layer, sigmoid is deployed. Likewise, ReLU is applied in the output layer. As for loss measurement, as illustrate in formula3.16, Mean Squared Error (MSE) is used to compute the loss:
where is the number of samples, is the observed value, and is the predicted value.
After computing the loss, the back-propagation will be executed to adjust the weights with using Adam as the optimizer. Furthermore, by introducing a dropout layer on outputs of each LSTM layer apart from the last layer, with dropout possibility equals to 0.7, to prevent overfitting. Figure 3.10 visualizes the process of once training.
3.2 Curriculum Semantic Analysis
Identifying dropout is an important step to understand the reason why students fail and what can be done to increase graduation rates. We argue that the sequence of courses taken by students may influence student attrition. This section presents an approach to create a sequence of courses based on its descriptions. The idea is that in the future be able to correlate dropout rates, the sequence of courses taken by students and the course content/description.
We divide this section into two parts. The first part introduces a methods to measure the similarity between course pairs using the Bidirectional Encoder Representations from Transformers (the so-called BERT language model). The second part is responsible for ordering course pairs by employing Semi-Reference Distance (SemRefD) [Manrique2019].
3.2.1 Procedure Overview
As Figure 3.11 and 7.2 show, the first move is to unify BERT to conduct similarity measurement. It involves three steps, that is, dataset acquirement, sentence embedding, and similarity measurement. Likewise, the second step uses the same dataset, the first process is to extract entities, followed by concept comparison and the sequence will be identified between courses in the end.
3.2.2 Similarity Measurement
To compare the courses description, it is necessary to encode the contextual data into vectors so as to be comparable in a semantic way. To achieve this, we make use of BERT for sentence embedding.
BERT was proposed by Google [Jacob2018], which is a revolutionary pre-training model that uses multi-head attention based transformer to learn contextual relations between words or sentences in the context. Transformers were also proposed by Google [Ashish2017]. It consists of an encoder, which is a bidirectional RNN, and a decoder. There is a scaled dot-product attention layer and a feedforward neural network layer in the encoder. With regards to self-attention, it unifies the matrix representation to calculate scoring and final output embedding in one step as formula 3.17 illustrates:
where and create a key-value pair of input of of dimension , stands for the query. The output is a weighted sum of values.
Compare with computing the attention one time, the multi-head mechanism which is used in BERT, goes through the scaled dot-product attention several times at same time and creates separate embedding matrices that are combined into one final output embedding as formula 3.18 shown:
where and , , , and are separate weight matrices.
BERT contains 12 stacked encoders in base version, 24 stacked encoders in larger version. Transformer encoders read each sequence of words at once instead of the sequential reading of text input as in directional models. With this characteristic, a model is able to determine a word’s context based on all of its surrounding information as shown in Figure 3.13.
To start with, the datasets this study uses ANU Course & Program Website222https://programsandcourses.anu.edu.au/. Furthermore, this study takes Computer Science courses into consideration, hence, course code beginning with "COMP" will be considered. After using a tokenizer to segment the full text into sentences and applying BERT on sentences as such for encoding, the vectors of sentences are constructed. Next the vectors will be computed by cosine similarity measurement which aims to get the distance, the opposite of similarity as equation 3.19 specifies:
where and are sentence vectors which were generated by BERT, the output ranges within [-1, 1] depicting the degree of contextual similarity.
To get the entire contextual similarity, the average similarity among sentences will be calculated. After the processes above, the course similarity is obtained.
3.2.3 Prerequisite Identification
Despite the similarity between courses has been computed, the sequence between high similar courses remain uncertain, hence, to identify the prerequisite dependency (PD) between two highly similar courses becomes necessary.
Regarding PD, it is a relation between two concepts where, in education, the prerequisite concept should be taught first. For instance, Binary Tree and Red-black Tree are two concepts belonging to data structure field in Computer Science, the latter should be introduced after the former. By measuring the prerequisite relationship between courses, the curriculum will be analyzed as a whole.
To begin with, similar to the similarity measurement, the same dataset and pre-processing techniques were used in this study. Subsequently, the entities behind the text will be extracted, the tool named TextRazor was employed to complete this task. TextRazor
is a NLP-based API developed to segment text and capture conceptual terms. Then a technique called Semi-Reference Distance (SemRefD) was conduct to measure semantic PD between the entities of two courses in DBpedia333https://www.dbpedia.org/ Knowledge Graph (KG).
With respect to KG, it is known as a semantic network, which stands for a real-world entities network, in this studies, they refer to concepts, and illustrates the relationship between them. DBpedia is one of main KGs on Semantic Web and provides a wide variety of topic which can be used to encompass courses from various fields of study. Moreover, it is also very tolerant and inclusive of many different semantic properties, which empower the liberal connection to multiple types of concepts. Using given concept as a query in DBpedia, there will be two lists storing the candidate concepts, that is, direct list and neighbor list. As for direct list, in which a list of concepts sharing a category with the given concept will be returned. For the latter list, in which the candidate list is expanded by adding concepts linked to the target through non-hierarchical pathways up to hops [Ruben2019]. The path length parameter decides the maximum length of the path between the target concept and the farthest candidate concept to consider, in this study, is 1. Soon after acquiring the candidate lists, the SemRefD will be performed to compute the degree of prerequisite in the next step.
SemRefD was presented by Manrique et al.[Ruben2019] based on Reference Distance (RefD) which was proposed by Chen et al. [Liang2018] as defined in formula 3.20 by inputting two concepts which are denoted and as below:
An indicator function indicates that there is a relationship between and , and a weighting function indicates whether there is a relationship. The values of range from -1 to 1. According to Figure 3.14, is more likely to be a prerequisite for the if it is closer to 1.
RefD does not take into account the semantic properties of DBpedia to determine whether two concepts have a prerequisite dependency. SemRefD does. In the weighting function , the common neighbors’ concepts in the KG KG hierarchy are considered, while in the indicator function , the property paths between concept target and related concepts are considered [Ruben2019].
As a result, all concepts from two courses will be compared and summed up to reveal the order of them, that is, whether A is prerequisite of B or B is prerequisite of A.
4.1 Experimental Environment
The experiments are conducted on the with following devices and corresponding hardware as shown in Table 4.1. Specifically, the paper retrieving process in systematic review is completed on the Apple Macbook Pro. The rest of experiments are conducted on the server.
|Model||CPU||GPU||Memory||Hard disk size||System|
|Server||Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz||Tesla V100-DGXS-32GB*4||251GB||10TB||Ubuntu 18.04.6|
|Apple Macbook Pro||M1 chip||8-core GPU||16GB||512GB||macOS Monterey 12.0.1|
4.2 Dropout Prediction
This section contains two subsections. The evaluation criteria will be presented firstly, followed by the results from feature selection and dropout prediction.
4.2.1 Evaluation Metrics
Dropout prediction is evaluated by accuracy, precision, recall, and F1 score based on the confusion matrix as shown in Figure4.1. Specifically, the , , , and are defined as below:
True Positive (): Models predicted positive outcomes and are accurate.
True Negative (): Models predicted negative outcomes and are accurate.
False Positive (): Models predicted positive outcomes and are inaccurate.
False Negative (): Models predicted negative outcomes and are inaccurate.
To be specific, the The accuracy refers to the proportion of correctly predicted observations in a whole samples. The precision measures the correctly predicted TP samples to the total positive ones. Recall depicts the sensitivity of model by measuring the correctly predicted positive observations to the all the observations. F1 score measure the overall performance of model The equations are as formula 4.1 - 4.4 shown below:
4.2.2 Experimental Results an Discussions
As mentioned in Section 3.1.2, ADM, ARQ, and CSI datasets will be used to evaluate the proposed models in terms of feature selection and dropout prediction. The first section is regarding feature selection, which was proposed in Section 3.1.3, followed by dropout prediction, as mentioned in section 3.1.4.
126.96.36.199 Feature Selection
The results of feature selection as listed in Table 4.2. After 100 generations evolution in a population comprised of 1000 individuals, as for the performance of final optimal individuals among these datasets, ARQ has the best performance in terms of accuracy (90.61), F1 score (95.03) with the least number of dropped features, indicating that this individual fits the environment most than the others. By comparison, CSI has the least performance in terms of measurement metrics. By observing the lower score in precision than recall across used dataset, indicating that the model has it’s bias in terms of the prediction preference in positive instead of negative, and emphasising the importance of the balance of datasets. Furthermore, by acquiring the ratio of negative in CSI (24%), ADM (13%), and ARQ (10%), also proving this inference. Thus, employing model on a balanced dataset may improve the its performance. As for ADM, it has the nearly same recall as ARQ’s and best recall score in this experiment. Overall, the results turn out this model align with this experiment.
|Dataset||Accuracy(%)||Precision(%)||Recall(%)||F1 score(%)||No. dropped features|
188.8.131.52 Dropout Prediction
Likewise, this experiment used same datasets on the dropout prediction. As dropout prediction is one of the main objectives of this project, the performance of proposed model is important to this project. As illustrated in 4.3 - 4.7, these outcomes on training reveal the capacity of the model, whose end of the abscissa is the . Furthermore, the performance of model on test further validate the suitability of the model on datasets as shown in Table 4.3.
According to the accuracy and loss during training, we can observe that the model always converges after 10 epochs. As for the accuracy in ADM and ARQ datasets, which starts from below 10% initially, then fall to the lower, and rises to the close to the top afterwards. For ARQ, the curve is similar to some extent. After investigation, multiple reasons are identified, which leads to the abnormal curve. For instance, dropout rate is high, which shows that model drops the well-functional neurons, or the batch size is large. In this study, the time step is fixed as mentioned in section3.1.4, which indicates that the batch size is uncontrollable. In addition, after hyper-parameter tuning, high dropout rate (0.7) enables the best results of model, which shows that selecting the current dropout rate is a trade-off decision. Thus, this study chose to keep a better performance with using current dropout rate.
In terms of loss, after reaching the convergence, it will repeat the loss curve along with the reset of the time step. Apart from repetition of loss, we also observed that the unstable curve during one epoch training, which is named multiple descent [Lin2020]. The reason behind the anomaly may vary. For example, we assumed there are minimas and , when the distance between two minima is very small, and the learning rate is not small enough, which leads to cross a local and arrives in eventually. Also, this phenomenon can be caused by datasets [Lin2020].
With respect to the performance of proposed model in test, the model performs well in ADM and ARQ, whose best accuracy reaches 92.83% and 97.65% respectively. Notably, the accuracy of ARQ improves the result in Manrique’s previous work (95.2%) by 2.45% [Ruben2019]. Finally, we captured a potential improvement for model by adopting dynamic time steps instead of fixed steps to make full use of dataset. Overall, the model proposed by this study is suitable for the current datasets.
|Dataset||Avg. acc (10 iter.)(%)||Avg. acc (100 iter.)(%)||Avg. acc (200 iter.)(%)||Top. acc(%)|
4.3 Curriculum Semantic Analysis
4.3.1 Similarity Measurement
After the encoding between sentence captured by BERT and the average similarity between courses computed hereby, the results have been acquired and visualized in heat map format from Figure 4.8 ranging from 0.8265 to 1.0, which lists the result of all the course comparison, to Figure 4.10 - Figure 4.12 as below.
In 1000-level courses comparison, we can see that COMP1110 (Structured Programming) has the closest average distance to the rest of courses. In contrast, the COMP1600 (Foundations of Computing) has the farthest to others. From courses content to interpret, similarity among COMP1100 (Programming as Problem Solving) to COMP1100 is one of the most fundamental programming course among 1000 level courses, and COMP 1600 focuses on the mathematical perspective.
With respect to 2000-level courses comparison, COMP2100 (Software Design Methodologies) has the closest average distance to the rest of courses. On the contrary, the COMP2560 (Studies in Advanced Computing R & D) has the farthest to others. To sum up, the overall similarity between 2000-level courses becomes lower than 1000-level’s, which indicates that the curriculum differentiation has appeared.
As regards 3000-level and 4000-level courses, it can be seen that this trend has become more obvious and grown, which also aligns with real situations, for example, students knowledge grows, the course content deepens and divided by specifications, such as Data Science, Machine Learning etc. In the meantime, students also have the interest to dive in, which aids students to make use of strengths in the region of interest and improve the academic performance in return. Finally, we identified an obstacle in automatic evaluation, which is a common limitation in studies in systematic review[Piedra2018], [Pata2013], [Yu2007], [Saquicela2018SimilarityDA] and [Tapia2018].
4.3.2 Prerequisite Identification
As Section 4.3.1 presented, the similarity course have been computed, thus it is necessary to identify the prerequisite between two similar courses.
For this study, we selected 3 courses, that is, COMP1100, COMP1110, and COMP2100 to conduct this experiment as they have high similarity score in the previous stage and they are all programming oriented courses. The result is shown in Figure 4.13.
Between COMP1100 and COMP2100, whose similarity is 0.9479, the prerequisite score is 13.17. Based on the rule of thumb in the experiment, the score is very high, indicating that there is a strong relation in prerequisite in these two courses.
Similarly, COMP1110 and COMP2100 still have a very high score, which is 12.06, indicating that COMP1110 is also one of the prerequisite of COMP2100.
Regarding COMP1100 and COMP1110, the similarity is 0.9543 and score is 4.4, which reveals that although there is a strong bond between these two courses, the prerequisite relationship is not as strong as the relationship with COMP2100 respectively.