I Introduction
The^{†}^{†}* The first two authors have contributed equally to this work. traditional teaching methods that are widely used at universities for science, technology, engineering, and mathematics (STEM) courses do not take different abilities of learners into account. Instead, they provide learners with a fixed set of textbooks and homework problems. This ignorance of learners’ prior background knowledge, pace of learning, various preferences, and learning goals in the current education system can cause tremendous pain and discouragement for those who do not keep pace with this inefficient system [chen2008intelligent, brusilovsky2003course, chen2005personalized, hubscher2000logically, weber1997user]. Hence, elearning methods are given considerable attention in an effort to personalize the learning process by providing learners with optimal and adaptive curriculum sequences. Over the years, many webbased tools have emerged to adaptively recommend problems to learners based on courseware difficulty. These tools tune the difficulty level of the recommended problems for learners and push them to learn by gradually increasing the difficulty level of recommended problems on a specific concept. The disadvantage of such methods is that they do not take the concept continuity and mixture of concepts into account, but focus on the difficulty level of single concepts. Note that a learner who knows every individual concept does not necessarily have the ability to bring all of them together for solving a realistic problem involving a mixture of concepts. As a result, the recommender system needs to know similarity/dissimilarity of problems with mixture of concepts to respond to learners’ performance more effectively as described in the next paragraph, which is something that is missing in the literature and needs more attention.
Since it is difficult for humans to determine a similarity score that is consistent across a large enough training set, it is not feasible to simply apply supervised methods to learn a similarity score for problems. In order to take difficulty, continuity, and mixture of concepts into account for similarity score used in a personalized problem recommender system in an adaptive practice, we propose to use a proper numerical representation of problems on mixture of concepts equipped with a similarity measure, which is the cosine similarity of the problem representations. By virtue of vector representations for a set of problems on both single and mixture of concepts (problem embedding) that capture similarity of problems, learners’ performance on a problem can be projected onto other problems. As we see in this paper, creating a proper problem representation that captures mathematical similarity of problems is a challenging task, in which baseline text representation methods and their refined versions fail to work. Although the stateoftheart methods for phrase/sentence/paragraph representation are doing a great job for general purposes, their shortcoming in our application is that they take lexical and semantic similarity of words into account, which is totally invalid when dealing with text related to math or any other special topic. The words or even subjectrelated keywords of problems are not completely informative and cannot contribute to embedding of math problems on their own; as a result, the similarity of two problems is not highly correlated with the wording of the problems. Hence, baseline methods perform poorly on the problem similarity detection test in problem recommender applications.
We find that instead of words or even subjectrelated keywords, conceptual ideas behind the problems determine their identity. The conceptual particles (concepts) of problems are mostly not directly mentioned in problem wording, but their footprints can be found in the problems. Since problem wording does not capture the similarity of problems, we propose an alternative hierarchical approach called Prob2Vec consisting of an abstraction and an embedding step. The abstraction step projects a problem to a set of concepts. The idea is that there exists a concept space with a reasonable dimension , with ranging from tens to a hundred, that can represent a much larger variety of problems of order . Each variety can be sparsely inhabited, with some concept combination having only one problem. This is because making problems is itself a creative process: The more innovative a problem, the less likely that it will have exactly the same concept combination as another problem. The explicit representation of problems using concepts also enables statedependent similarity computation, which we will explore in future work. The embedding step constructs a vector representation of the problems based on concept cooccurrence. Similar to sentence embedding, it captures not only the common concepts between problems, but also the continuity among concepts. The proposed Prob2Vec algorithm achieves 96.88% accuracy on a problem similarity test, where human experts are asked to label the relative similarity among each triplet of problems. In contrast, the best of the existing methods [arora2016simple]
, which directly applies sentence embedding, achieves 75% accuracy. It is interesting that Prob2Vec is able to distinguish very finegrained differences among problems, as the problems in some triplets are highly similar to each other, and only humans with extensive training in the subject are able to identify their relative order of similarity. The problem embedding obtained from Prob2Vec has been used in the recommender system of an elearning tool for an undergraduate probability course for four semesters with successful results for hundreds of students, especially benefiting minorities who tend to be more isolated in the current education system.
In addition, the subproblem of concept labeling in the abstraction step is interesting in its own right. It is a multilabel problem suffering from dimensionality explosion, as there can be as many as problem types. This results in two challenges: First, there are very few problems for some types, hence a direct classification on
classes suffers from a severe lack of data. Second, perconcept classification suffers from imbalance of training samples and needs a very small perconcept false positive in order to achieve a reasonable perproblem false positive. We propose pretraining the neural network with negative samples (negative pretraining) and find that our approach outperforms an approach similar to oneshot learning
[fei2006one], in which the neural network is pretrained on classification of other concepts to have a warm start on classification of the concept of interest (transfer learning).
Ia Related Work
Embedding applications: The success of simple and lowcost word embedding technique using the wellknown neural network (NN) based word embedding method, Word2Vec by [NIPS2013_5021, mikolov2013efficient]
, compared to expensive natural language processing methods, has motivated researchers to use embedding methods for many other areas. As examples, Doc2Vec
[lau2016empirical], Paper2Vec [tian2017paper2vec, ganguly2017paper2vec], Gene2Vec [Gene2Vec], Graph2Vec [narayanan2017graph2vec], Like2Vec, Follower2Vec and many more share the same techniques originally proposed for word embedding with modifications based on their domains, e.g. see [ganguly2017paper2vec, grover2016node2vec, barkan2016item2vec, narayanan2016subgraph2vec, dhingra2016tweet2vec, ma2016app2vec, ring2017ip2vec, dong2017metapath2vec, ribeiro2017struc2vec, kavousi2019estimating, chen2017doctag2vec, chung2016unsupervised, niu2015topic2vec, vasile2016meta, lin2017establishment, ristoski2016rdf2vec, melamud2016context2vec, shi2018mvn2vec, liu2018multi]. In this work, we propose a new application for problem embedding for personalized problem recommendation.Word/Phrase/Sentence/Paragraph embedding:
The prior works on word embedding include learning a distributed representation for words by
[bengio2003neural], multitask training of a convolutional neural network using weightsharing by
[collobert2008unified], the continuous Skipgram model by [NIPS2013_5021], and the low rank representation of wordword cooccurrence matrix by [deerwester1990indexing, pennington2014glove]. Previous works on phrase/sentence/paragraph embedding include vector composition that is operationalized in terms of additive and multiplicative functions by [mitchell2008vector, mitchell2010composition, blacoe2012comparison], uniform averaging for short phrases by [NIPS2013_5021], supervised and unsupervised recursive autoencoder defined on syntactic trees by
[socher2011dynamic, socher2014grounded], training an encoderdecoder model to reconstruct surrounding sentences of an encoded passage by [kiros2015skip], modeling sentences by long shortterm memory (LSTM) neural network by
[tai2015improved] and convolutional neural networks by [blunsom2014convolutional], and the weighted average of the words in addition to a modification using Principal Component Analysis (PCA)/Singular Value Decomposition (SVD) by
[arora2016simple]. The simpletoughtobeat method proposed by [arora2016simple] beats all the previous methods and is the baseline in text embedding.The rest of the paper is outlined as follows. The description of the data set for which we do problem embedding in addition to our proposed Prob2Vec method for problem embedding are presented in section II. Negative pretraining for NNbased concept extraction is proposed in section III. Section IV describes the setup for similarity detection test for problem embedding, evaluates the performance of our proposed Prob2Vec method versus baselines, and presents the results on the negative pretraining method. Section V concludes the paper with a discussion of opportunities for future work.
Ii Prob2Vec: ConceptBased Problem Embedding
Consider a set of problems for an undergraduate probability course, where each problem can be on a single or mixture of concepts among the set of all concepts . Note that these concepts are different from keywords of problem wordings and are not originally included in problems, but labeling problems with concepts is a contribution of this work that is proposed for achieving a proper problem representation. Instead, problems are made of words from the set ; i.e. for , where . In the following subsection, we propose the Prob2Vec method for problem embedding that uses an automated rulebased concept extractor, which relieves reliance on human labeling and annotation for problem concepts.
Iia Prob2Vec Problem Embedding
As shown in section IV, using the set of words
or even a subset of keywords to represent problems, text embedding baselines fail to achieve high accuracy in similarity detection task for triplets of problems. In the keywordbased approach, all redundant words of problems are ignored and the subjectrelated and informative words such as binomial, random variable, etc., are kept. However, since the concepts behind problems are not necessarily mapped into normal and mathematical words used in problems, even the keywordbased approach fails to work well in the similarity detection task that is explained in section
IV. As an alternative, we propose a hierarchical method consisting of abstraction and embedding steps that generates a precise embedding for problems that is completely automated. The block diagram of the proposed Prob2Vec method is in Figure 1.
[leftmargin=*]

Abstraction step: Similarity among mathematical problems is not captured by the wording of problems; instead, it is determined by the abstraction of the problems. Learners who have difficulty solving mathematical problems mostly lack the ability to do abstraction and relate problems with appropriate concepts. Instead, they try to remember procedurebased rules to fit problems in them and use their memory to solve them, which does not apply to solving hard problems on mixture of concepts. We observe the same pattern in problem representation; i.e. problem statements do not necessarily determine their identity, instead abstraction of problems by mapping them into representative concepts moves problem embedding from lexical similarity to conceptual similarity. The concepts of a problem are mostly not mentioned directly in its text, but there can be footprints of concepts in problems. A professor and two experienced teaching assistants are asked to formulate rulebased mappings from footmarks to concepts for automation of concept extraction. As an example, the rule for labeling a problem with concept “nchoosek” is \\\\choose\\\\binom\\\\frac\{\s*\w+\!\s*\}\{\s*\w+\!\s*\\\\times \s*\w+\!\s*\}. By applying the rulebased concept extractor to problems, we have a new representation for problems in concept space instead of word space; i.e. for , where .

Embedding step:
A method similar to Skipgram in Word2Vec is used for concept embedding. The highlevel insight of Skipgram is that a neural network with a single hidden layer is trained, where its output relates to how likely it is to have each concept cooccurring in a problem with the input concept. As an example, if concept “lawoftotalprobability” of a problem is input of the neural network, we expect the neural network will more likely identify the problem as having the concept “conditionalprobability” than unrelated concepts like “Poissonprocess”. However, the neural network is not used for this task; rather, the goal is to use weights of the hidden layer for embedding. Recall the set of all concepts as
, where a problem is typically labeled with a few of them. Consider onehot coding from concepts to realvalued vectors of size that are used for training of the neural network, where the element corresponding to a concept is one and all other elements are zero. We consider neurons in the hidden layer with no activation functions (so the embedded concept vectors have features) andneurons in the output that form a softmax regression classifier. In order to clarify the inputoutput pair of the neural network, assume a problem that has a set of concepts
. The neural network is trained on all pairs and , where the onehot code of the first element of a pair is the input and the onehot code of the second element of a pair is the output of the neural network in the training phase. Hence, the neural network is trained over number of training data. This way, the neural network learns the statistic from the number of times that a pair is fed into it; e.g., the neural network is probably fed with more training pairs of “lawoftotalprobability”, “conditionalprobability” than the pair “lawoftotalprobability”, “Poissonprocess”. Note that during training phase, input and output are onehot vectors representing the input and output concepts, but after training when using the neural network, given a onehot input vector, the output is a probability distribution on the set of all concepts. Finally, since input concepts are coded as onehot codes, rows of the hidden layer weight matrix, which is of size
by , are concept vectors (concept embedding) which we are really after. Denoting embedding of concept by , the problem embedding denoted by for problem is obtained as follows:(1) where is frequency of concept in our data set of
problems. Concept embedding is scaled by its corresponding frequency to penalize concepts with high frequency that are not as informative as low frequency ones. For example, concept “pmf” is less informative than concept “MLparameterestimation”. Given problem embedding in Equation (
1), similarity between two problems and is defined by cosine similarity denoted by , where is the twonorm.
Remark. We choose rulebased concept extractor for the abstraction step over any supervised/unsupervised classification methods for concept extraction because of two main reasons. First, there is a limited number of problems, as few as a couple, for most concepts, which makes any supervised classification method inapplicable due to lack of training data. Second, there are concepts, so potentially there can be categories of problems which makes classification challenging for any supervised or unsupervised methods. Consider the maximum number of concepts in a problem to be ; then there are on the order of categories of problems. Even if we consider possessing acceptable number of problems for each of the categories, the false positive rate needs to be on the order of so that the number of false positives for each category will be on the order of . Given that there are concepts, utilizing a supervised or unsupervised approach to achieve such a low false positive rate is not feasible. Exploiting the rulebased classifier for problem labeling, however, we achieve as low as average false positive rate and average false negative rate for all concepts, where problem concepts annotated by experts are considered to be ground truth. Although accuracy is observed in similarity detection test when utilizing problem concepts annotated by experts, those concepts are not necessarily the global optimum ground truth or the only one. Thinking of increasing accuracy in similarity detection task as an optimization problem, there is not necessarily a unique global optimum for problem concepts that leads to good performance. Hence, not having a very low false negative for rulebased does not necessarily mean that such labels are not close to a local/global optimum problem concepts. In fact, rulebased extracted concepts achieve a high accuracy on a similarity test as is mentioned in section IV.
Iii Negative PreTraining for Imbalanced Data Sets
For the purpose of problem embedding, Prob2Vec discussed in section IIA with a rulebased concept extractor has an acceptable performance. Here, a NNbased concept extractor is proposed that can be a complement to the rulebased version, but we mainly study it to propose our novel negative pretraining method for reducing false negative and positive ratios for concept extraction with an imbalanced training data set. Negative pretraining outperforms a method similar to oneshot learning (transfer learning) as a data level classification algorithm to tackle imbalanced training data sets.
The setup for concept extractor using neural networks without any snooping of human knowledge is presented first; then we propose some tricks for reducing false negative and positive ratios. The neural network used for concept extraction has an embedding layer with linear perceptrons followed by two layers of perceptrons with sigmoid activation function, and an output layer with a single perceptron with sigmoid classifier. For common words in Glove and our data set, embedding layer is initialized by Glove
[pennington2014glove], but for words in our data set that are not in Glove, the weights are initialized according to a uniform distribution over
. The embedding size is considered to be and each of the other two layers has perceptrons, followed by output which has a single perceptron, indicating whether a concept is in the input problem. Note that for each concept, a separate neural network is trained. The problem with training a single neural network for all concepts is the imbalanced number of positive and negative samples for each concept. A concept is only present in a few of the problems, so having a single neural network means that too many negative samples for each concept are fed into it, dramatically increasing false negatives.In the following, some techniques for training of the above naive NNbased concept extractor are presented that reduce FN and FP ratios by at least and up to compared to using downsampling, which is a standard approach for training on imbalanced training data sets. The main challenge in obtaining low FN and FP is that as few as problems are labeled by a concept, which poses a challenge to neural network training with positive and negative samples.

[label=(), leftmargin=*]

Negative pretraining: Few of the problems are labeled with a specific concept; e.g., problems have concept “hypothesisMAP”. Hence, few positive samples and many negative samples are provided in our data set for training of a NNbased concept extractor for a specific concept. A neural network obviously cannot be trained on an imbalanced set where all negative samples are mixed with few positive ones or FN increases dramatically. Instead, we propose two phases of training for concept . Consider and , where . Let , where . In the first phase, the neural network is pretrained on a pure set of negative samples, , where the trained neural network is used as a warm start for the second phase. In the second phase of training, the neural network is trained on a balanced mixture of positive and negative samples,
. Utilizing negative pretraining, we take advantage of negative samples in training, and not only does FN not increase, but we get an overall lower FN and FP compared to downsampling. Due to the curse of dimensionality, the neural network learns a good portion of the structure of negative samples in the first phase of negative pretraining that provides us with a warm start for the second phase.

Oneshot learning (transfer learning) [fei2006one]: In the first phase of training, the neural network is first trained on classification of bags of problems with equal number of negative and positive samples of concepts that are not of interest, . Then, the trained neural network is used as a warm start in the second training phase for classification of the concept of interest on a balanced set, .

Word selection: Due to limited number of positive training samples, a neural network cannot tune any number of parameters to find important features of problems. Moreover, as a rule of thumb, the fewer parameters the neural network has, the less it is vulnerable to overfitting and the faster it can converge to an acceptable classifier. To this end, an expert Teaching Assistant (TA) is asked to select informative words out of total words that are originally used for input of the neural network, a process that took less than an hour. The redundant words in problems are omitted and only those among selected words related to probability are kept in each problem, which reduces the size of the embedding matrix from to and inputs more informative features to the neural network. In section IV, it is shown that FP and FN ratios are reduced under this trick by at least and up to , which is an indication that selected words are more representative than the original ones. These selected words have been used in problem embedding for modified versions of baselines in section IV as evidence that even keywordbased versions of embedding baselines do not capture similarity of problems.
Iv Experiments and Results
For evaluation of different problem embedding methods, a ground truth on similarity of problems is needed. To this end, four TAs are asked to select random triplets of problems, say with , and order them so that problem is more similar to than ; i.e. , where is similarity between the two input problems. Finally, a head TA brings results into a consensus and chooses triplets of problems. Note that the set of all problems is divided into
modules, where each module is on a specific topic, e.g. hypothesis testing, central limit theorem, and so on. The three problems of a triplet are determined to be in the same module, so they are already on the same topic that makes the similarity detection task challenging. As evidence, the similarity gap histogram for the
triplets of problems, , according to expert annotation for problem concepts and Skipgrambased problem embedding, is shown in Figure 2. It should be noted that the mentioned problem embedding is empirically proven to have the highest accuracy of in our similarity detection test. The expert annotation for problem concepts is done by an experienced TA unaware of the problem embedding project, so no bias is introduced to the concept labeling process. The similarity gap histogram clearly depicts that similarity detection for the triplets of problems is challenging due to skewedness of triplets in first bins.Prob2Vec is compared with different baseline text embedding methods in terms of the accuracy in determining the more similar problem of the second and third to the first one of a triplet. The experimental results are reported in table I. The baselines are mainly categorized into three groups: (1) Glovebased problem embedding that is derived by taking the uniform (or weighted with word frequencies) average of Glove word embedding, where the average can be taken over all words of a problem or some representative words of the problem. (2) The authors of [arora2016simple] suggest removing the first singular vector from the Glovebased problem embedding, where that singular vector corresponds to syntactic information and common words. (3) SVDbased problem embedding that has the same hierarchical approach as Prob2Vec, but concept embedding in the second step is done based on SVD decomposition of the concept cooccurrence matrix [levy2015improving]. The details on baseline methods can be found in the Appendix. The numbers of errors that the method with the best performance in each of the above categories makes in different bins of the similarity gap are shown in Figure 2. For example, there are triplets with similarity gap in the range and the best Glovebased method makes six errors out of these triplets in the similarity detection test. According to table I, the best Glovebased method is taking uniform average of embedding of selected words.
methodconcepts  annotated concepts  rulebased concepts  

Prob2Vec  
SVD_sub 

SVD_shifted 

SVD_cds 

SVD_wandc 

SVD_eig 


all words  selected words  



uniform average  weighted average  
Glove (all words)  
Glove (selected words) 
Concept

1  2  3  4  

proberror  probmiss 0.9766  probfalse 0.9699  hypothesisMAP 0.9661  hypothesisML 0.9122  
mutuallyexclusive  partition 0.9078  event 0.9056  setcomplement 0.8610  setintersection 0.8270 
concepts  1  2  3  4  5  

downsampling  FN  
FP  

FN  
FP  

FN  
FP  

FN  
FP  

FN  
FP  

FN  
FP  

FN  
FP  

FN  
FP 
Interesting patterns on concept continuity and similarity are observed from Prob2Vec concept embedding where two of them are shown in table II. As other examples, it is observed that the concept most similar to functionRV is CDF, where functionRV refers to finding the distribution of a function of a random variable. Based on our observations during office hours for multiple years, most students have no clue where to start on problems for functionRV, and are told to start with finding CDF of the function of random variable. It is worth noting that NNbased concept embedding can capture such relation between concepts in seconds with a small number of training samples while a human that is trained over the whole semester at university is mostly clueless where to start. We further observe the MLparameterE concept to be most closely related to the concept differentiation, where MLparameterE refers to maximum likelihood (ML) parameter estimation. Again, students do not get this relation for a while and they need a lot of training to get the association of MLparameterE with differentiation of likelihood of observation to find ML parameter estimation. As another example, the Bayesformula is most similar to lawoftotalprobability, and the list goes on.
Comparison of negative pretraining, oneshot learning, word selection, downsampling, and the combination of these methods applied to training process of the NNbased concept extractor is presented in table III for five concepts. In order to find the empirical false negative and positive ratios for each combination of methods in table III, training and cross validation are done for rounds on different random training and test samples, and false negative and positive ratios are averaged in the rounds. As an example, the set and test set are randomly selected in each of the rounds for the negative pretraining method, then the false negative and positive ratios of the trained neural network on the instances are averaged. Employing the combination of word selection and negative pretraining reduces false negative and positive ratios by at least and up to compared to the naive downsampling method. For some concepts, the combination of word selection, oneshot learning, and negative pretraining results in slightly lower false negative and positive ratios than the combination of word selection and negative pretraining. However, investigating the whole table, one finds that word selection and negative pretraining are the causes of reduced false negative and positive ratios. It is of interest that the NNbased approach can reduce FN for the concept “event” to with FP of , where rulebased concept extraction has FN of with FP of .
V Conclusion and Future Work
A hierarchical embedding method called Prob2Vec for subjectspecific text is proposed in this paper. Prob2Vec is empirically proved to outperform baselines by more than in a properly validated similarity detection test on triplets of problems. The Prob2Vec embedding vectors for problems are being used in the recommender system of an elearning tool for an undergraduate probability course for four semesters. We also propose negative pretraining for training with imbalanced data sets to decrease false negatives and positives. As future work, we plan on using graphical models along with problem embedding vectors to more precisely evaluate the strengths and weaknesses of students on single and mixture of concepts, and thus to do problem recommendation more efficiently.
References
Appendix A Details of Experimental Setting
The proposed Prob2Vec method is compared with the following text embedding methods.

[leftmargin=*]

Glove: Referring to literature for text embedding, the following are probably the most primitive approaches for problem embedding denoted by for problem :
(2) (3) where and are Glove embedding and frequency of word in our data set of problems, respectively. We can either consider all words of a problem, , or use selected subjectrelated keywords discussed in section III in the above embedding formulas. The selected keywords are logical choices since we get lower false positives and negatives in concept extraction by using these keywords instead of all words as shown in section IV.

Arora et al. [arora2016simple]: As an improvement on the previous raw method, [arora2016simple] find that top singular vectors of text embedding seem to roughly correspond to syntactic information and common words. Hence, they propose to exclude the first singular vector from sentence (problem) embedding in the following way. Given word embedding computed using one of popular methods, , where we use Glove, problem embedding for that is denoted by is computed as follows:
where is the first principle component of and is a hyperparameter which is claimed to result in best performance when to . We tried different values for inside this interval and out of it and found and to work best for our data set when using all words and to best work for when using selected words.

SVD: Using the same hierarchical approach as Prob2Vec, concept embedding in the second step can be done with an SVDbased method instead of the Skipgram method as follows. Recall that the concept dictionary is denoted by , where each problem is labeled with a subset of these concepts. Let for denote number of cooccurrences of concepts and in problems of data set; i.e. there are number of problems that are labeled with both and . The cooccurrence matrix is formed as follows:
The cooccurrence matrix is obviously symmetric with diagonal elements being zero and . By defining and for , the matrix is constructed as follows:
The SVD decomposition of the matrix is as , where , and is a diagonal matrix. Denote embedding size of concepts by , and let be the first columns of matrix , be a diagonal matrix with the first diagonal elements of diagonal matrix , and be the first rows of matrix . The followings are different variants of SVDbased concept embedding [levy2015improving]:

eig: Embeddings of concepts are given by rows of matrix that are of embedding length .

wandc: Let and . Embeddings of concepts are given by rows of .

sub: rows of are embeddings of concepts.

shifted: The matrix is defined in a slightly different way in this variant as follows:
We choose in our evaluations and calculate and based on the above matrix as before. The rows of are embeddings of concepts.

cds: The matrix is defined in a slightly different way in this variant. Let and . The matrix is then defined as follows:
Note that the matrix is not necessarily symmetric in this case. By deriving and matrices as before, embeddings of concepts are given by rows of .

Comments
There are no comments yet.