1 Introduction
In this paper we introduce and work on the problem of predicting algorithms classes for programming word problems (PWPs). A PWP is a problem written in natural language which can be solved using a computer program. These problems generally map to one or more classes of algorithms, which are used to solve them. Binary search, disjointset union, and dynamic programming are some examples. In this paper, our aim is to automatically map programming word problems to the relevant classes of algorithms. We approach this problem by treating it as a classification task.
Programming word problems A programming word problem (PWP) requires the solver to design correct and efficient programs. The correctness and efficiency is checked by various testcases provided by the problem writer. A PWP usually consists of three parts – the problem statement, a welldefined input and output format, and time and memory constraints. An example PWP can be seen in Figure 1.
Solving PWPs is difficult for several reasons. One reason is, the problems are often embedded in a narrative, that is, they are described as quasi realworld situations in the form of short stories or riddles. The solver must first decode the intent of the problem, or understand what the problem is. Then the solver needs to apply their knowledge of algorithms to write a solution program. Another reason is that, the solution programs must be efficient with respect to the given time and memory constraints. An outgrowth of this is that, the algorithm required to solve a particular problem not only depends on the problem statement, but also the constraints. Consider that there may be two different algorithms which will generate the correct output, for example, linear search, and binary search, but only one of those will abide by the time and memory constraints. With the growing popularity of these problems, various competitions like ACMICPC, and Google CodeJam have emerged. Additionally, several companies including Google, Facebook, and Amazon evaluate problemsolving skills of candidates for softwarerelated jobs (McDowell, 2016) using PWPs. Consequently, as noted by Forišek (2010), programming problems have been becoming more difficult over time. To solve a PWP, humans get information from all its parts, not just the the problem statement. Thus, we predict algorithms from the entire text of a PWP. We also try to identify which parts of a PWP contribute the most towards predicting algorithms.
Significance of the Problem Many interesting realworld problems can be solved and optimised using standard algorithms. Time spent grocery shopping can be optimised by posing it as a graph traversal problem Gertin (2012). Arranging and retrieving items like mail, or books in a library can be done more efficiently using sorting and searching algorithms. Solving problems using algorithms can be scaled by using computers, transforming the algorithms into programs. A program is an algorithm that has been customised to solve a specific task under a specific set of circumstances using a specific language. Converting textual descriptions of such realworld problems into algorithms, and then into programs has largely been a human endeavour. An AI agent that could automatically generate programs from natural language problem descriptions could greatly increase the rate of technological advancement by quickly providing efficient solutions to the said realworld problems. A subsystem that could identify algorithm classes from natural language would significantly narrow down the search space of possible programs. Consequently, such a subsystem would be a useful feature for, or likely be even part of, such an agent. Therefore, building a system to predict algorithms from programming word problems is potentially an important first step toward an automatic program generating AI. More immediately, such a system could serve as an application to help people in improving their algorithmic problemsolving skills for software job interviews, competitive programming, and other uses.
As per our knowledge, this task has not been addressed in the literature before. Hence, there is no standard dataset available for this task. We generate and introduce new datasets by extracting problems from Codeforces^{1}^{1}1codeforces.com, a sport programming platform. We release the datasets and our experiment code at ^{2}^{2}2hidden for the the double blind review.
Contribution The major contributions of this paper are: Four datasets on programming word problems  two multiclass^{3}^{3}3each problem belongs to only one class datasets having 5 and 10 classes and two multilabel^{4}^{4}4each problem belongs to one or more classes datasets having 10 and 20 classes. Evaluation of Classifiers on various multiclass and multilabel classifiers that can predict classes for programming word problems on our datasets along with the human baseline. We define our problem more clearly in section 2. Then we explain our datasets – their generation and format along with human evaluation in section 3. We describe the models we use for multiclass and multilabel classification in section 4
. We delineate our experiments, models, and evaluation metrics in section
5. We report our classification results in section 6. We analyse some dataset nuances in section 7. Finally, we discuss related work and the conclusion in sections 8 and 9 respectively.Dataset  Size  Vocab  classes  Avg. words  Class percentage 

CFMC5  550  9326  5  504  greedy: 20%, implementation:20%, data structures: 20%, dp: 20%, math: 20% 
CFMC10  1159  14691  10  485  implementation: 34.94%, dp: 12.42%, math: 11.38%, greedy: 10.44%, data structures: 9.49%, brute force: 5.60%, geometry: 4.22%, constructive algorithms: 5.52%, dfs and similar: 3.10%, strings: 2.84% 
Dataset  Size  Vocab  N classes  Avg. len  Label card  Label den  Label subsets 

CFML10  3737  28178  10  494  1.69  0.169  231 
CFML20  3960  29433  20  495  2.1  0.105  808 
2 Problem Definition
The focus of this paper is the problem of mapping a PWP to one or more classes of algorithms. A class of algorithms is a set containing more specific algorithms. For example, breadthfirst search, and Dijkstra’s algorithm belong to the class of graph algorithms. A PWP can be solved using one of the algorithms in the class it is mapped to. Problems on the Codeforces platform have tags that correspond to the class of algorithms.
Thus, our aim is to find a tagging function, which maps a PWP string, , to a set of tags, . We also consider another variant of the problem. For the PWPs that only have one tag, we focus on finding a tagging function, , which maps a PWP string, , to a tag, . We approximate and by training models on data.
3 Dataset
3.1 Data Collection
We collected the data from a popular sport programming platform called Codeforces. Codeforces was founded in 2010, and now has over 43000 active registered participants^{5}^{5}5http://codeforces.com/ratings/page/219. We first collected a total of 4300 problems from this platform. Each problem has associated tags, with most of the problems having more than one tag. These tags correspond to the algorithm or class of algorithms that can be used to solve that particular problem. The tags for a problem are given by the problem writer and they can only be edited only by highrated (expert) contestants who have solved the problem. Next, we performed basic filtering on the data – removing the problems which had nonalgorithmic tags, problems with no tags assigned to them, and also the problems wherein the problem statement was not extracted completely. After this filtering, we got 4019 problems with 35 different tags. This forms the Codeforces dataset. The label (tag) cardinality (average number of labels/tags per problem) was 2.24. Since the Codeforces dataset is the first dataset generated for a new problem, we select different subsets of this dataset with differing properties. This is to check if classification models are robust to different variations of the problem.
3.2 Multilabel Datasets
We found that a large number of tags had a very low frequency. Hence, we removed those problems and tags from the Codeforces dataset as follows. First, we got the list of 20 most frequently occurring tags, ordered by decreasing frequency. We observed that the tag in this list had a frequency of 98, in other words, 98 problems had this tag. Next, for each problem, we removed the tags that are not in this list. After that, all problems that did not have any tags left were removed.
This led to the formation of the Codeforces Multilabel20 (CFML20) dataset, which has 20 tags. We used the same procedure for the 10 most frequently occurring tags to get the Codeforces Multilabel10 (CFML10) dataset. The CFML20 has 98.53 (3960 problems) percent of the problems of the original dataset and the label (tag) cardinality only reduces from 2.24 to 2.21. CFML10 on the other hand has 92.9 percent of the problems with label (tag) cardinality 1.69. Statistics about both these multilabel datasets are given in Table 2.
3.3 Multiclass Datasets
To generate the multiclass datasets, first, we extracted the problems from the CFML20 dataset that only had one tag. There were about 1300 such problems. From those, we selected the problems whose tags occur in the list of 10 most common tags. These problems formed the Codeforces Multiclass10 (CFMC10) dataset which contains 1159 examples. We found that the CFMC10 dataset has a class (tag) imbalance. We also make a balanced dataset, Codeforces Multiclass5 (CFMC5), in which the prior class (tag) distribution is uniform. The CFMC5 dataset has five tags, each having 110 problems. To make CFMC5, first we extracted the problems whose tags are among the five most common tags. The fifth most common tag occurs 110 times. We sampled 110 random problems corresponding to the other four tags to give a total of 550 problems. Statistics about both the multiclass datasets are given in Table 1.
3.4 Dataset Format
Each problem in the datasets follows the same format (refer to Figure 1 for an example problem). The header contains the problem title, and the time and memory constraints for a program running on the problem testcases. The problem statement is the natural language description of the problem framed as a real world scenario. The input and output format describe the input to, and the output from a valid solution program. It also contains constraints that will be put on the size of inputs (for example, max size of input array, max size of 2 input values). The tags associated with the problem are the algorithm classes that we are trying to predict using the above information.
3.5 Class Categories in the Dataset
The classes for PWPs can be divided into two categories: Problem category classes tell us what kind of broad class of problem the PWP belongs to. For instance, math, and string are two such classes. Solution category classes tell us what kind of algorithm can solve a particular PWP. For example, a PWP of class dp or binary search would need a dynamic programming or binary search based algorithm to solve it.
Problem category PWPs are easier to classify because, in some cases, simple keyword mapping may lead to the classification (an equation in the problem is a strong indicator that a problem is of math type). Whereas, for solution category PWPs, a deeper understanding of the problem is required.
The classes belong to problem and solution categories for CFML20 are mentioned in the supplementary material.
3.6 Human Evaluation
In this section, we evaluate and analyze the performance of an average competitor on the task of predicting an algorithm for a PWP. The tags for a given PWP are added by its problem setter or other highrated contestants who have solved it. Our test participants were recent computer science graduates with some experience in algorithms and competitive programming. We gave 5 participants the problem text along with all the constraints, and the input and output format. We also provided them with a list of all the tags and a few example problems for each tag. We randomly sample 120 problems from the CFML20 dataset and split them into two parts – containing 20 and 100 problems respectively. The 20 problems were given along with their tags to familiarize the participants with the task. For the remaining 100 problems, the participants were asked to predict the tags (one or more) for each problem. We chose to sample the problems from the CFML20 dataset as it is the closest to a realworld scenario of predicting algorithms for solving problems. We find that there is some variation in the accuracy reported by different humans with the highest F1 micro score being 11 percent greater than that of the the lowest. (see supplementary material for more details). The F1 micro score averaged over all 5 participants was 51.8 while the averaged F1 Macro was 42.7. The results are not surprising since this task is like any other problem solving task, and people based on their proficiency would get different results. This shows us that the problem is hard even for humans with a computer science education.
4 Classification Models
To test the compatibility of our problem with text classification paradigm, we apply to it some standard text classification models from recent literature.
4.1 Multiclass Classification
To approximate the optimal tagging function (see section 2) we use the following models.
Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM)
Wang and Manning (2012)proposed several simple and effective baselines for text classification. An MNB is a naive Bayes classifier for multinomial models. An SVM is a discriminative hyperplanebased classifier
Hearst et al. (1998). These baselines use unigrams and bigrams as features. We also try applying TFIDF to these features.Multilayer Perceptron (MLP)
An MLP is a class of artificial neural network that uses backpropagation for training in a supervised setting
Rumelhart et al. (1986). MLPbased models are standard for text classification baselines Glorot et al. (2011).Convolutional Neural Network (CNN)
We also train a Convolutional Neural Network (CNN) based model, similar to the one used by
Kim (2014) in their paper, to classify the problems. We use the model both with and without pretrained GloVe wordembeddings Pennington et al. (2014).CNN ensemble Hansen and Salamon (1990) introduce neural network ensemble learning, in which many neural networks are trained and their predictions combined. These neural network systems show greater generalization ability and predictive power. We train five CNN networks and combine their predictions using the majority voting system.
4.2 Multilabel Classifiers
To approximate, (see section 2), we apply the following augmentations to the models described above.
Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) For applying these models to the multilabel case, we use the onevsrest (or, onevsall) strategy. This strategy involves training a single classifier for each class, with the samples of that class as positive samples and all other samples as negatives Bishop (2006).
Multilayer Perceptron (MLP) Nam et al. (2014) use MLPbased models for multilabel text classification. We use similar models, but use the MSE loss instead of the crossentropy loss.
Convolutional Neural Network (CNN) For multilabel classification we use a CNN based feature extractor similar to the one used in Kim (2014)
. The output is passed through a sigmoid activation function,
. The labels which have a corresponding activation greater than 0.5 are considered Liu et al. (2017). Similar to the multiclass case, we train the model both with and without pretrained GloVe Pennington et al. (2014) wordembeddings.CNN ensemble
We train five CNNs and add their output linear activation values. We pass this sum through a sigmoid function and consider the labels (tags) with activation greater than 0.5.
5 Experiment setup
All hyperparameter tuning experiments were performed with 10fold cross validation. For the nonneural networkbased methods, we first vectorize each problem using a bagofwords vectorizer, scikitlearn’s
Pedregosa et al. (2011) CountVectorizer. We also experiment with TFIDF features for each problem. In the multiclass case, we use the LIBSVM chung Chang and Lin (2001) implementation of the SVM classifier and we grid search over different kernels. However, the LIBSVM implementation is not compatible with the onevsrest strategy (complexity where is the number of classes), but only the onevsone (complexity ). This becomes prohibitively slow and thus, we use the LIBLINEAR Fan et al. (2008) implementation for the multilabel case. For hyperparameter tuning, we applied a grid search over the parameters of the vectorizers, classifiers, and other components. The exact parameters tuned can be seen in our code repository. For the neural networkbased methods, we tokenize each problem using the spaCy tokenizer Honnibal and Montani (2017). We only use words appearing 2 or more times in building the vocabulary and replace the words that appear fewer times with a special UNK token. Our CNN network architecture is similar to that used by Kim (2014). The batch size used is 32. We apply 512 onedimensional convolution filters of size 3, 4, and 5 on each problem. The rectifier,, is used as the activation function. We concatenate these filters, apply a global maxpooling followed by a fullyconnected layer with output size equal to the number of classes. We use the PyTorch framework
Paszke et al. (2017) to build this model. For the word embedding we use two approaches  a vanilla PyTorch trainable embedding layer and a 300dimensional GloVe embedding Pennington et al. (2014). The networks were initialized using the Xavier method Glorot and Bengio (2010) at the beginning of each fold. We use the Adam optimization algorithm Kingma and Ba (2014)as we observe that it converges faster than vanilla stochastic gradient descent.
6 Results
Classifier  CFMC5  CFMC10  

Acc  F1 W  Acc  F1 W  
CNN Random  25.0  22.1  35.2  19.2 
MNB  47.6  47.5  43.9  37.4 
SVM BoW  49.3  49.1  47.9  43.2 
SVM TFIDF  47.8  47.6  45.7  41.2 
MLP  47.8  47.6  49.3  46.2 
CNN  61.7  61.3  54.7  51.3 
CNN Ensemble  62.7  62.2  53.5  50.5 
CNN GloVe  62.2  61.3  54.5  51.4 
Classifier  CFML10  CFML20  

hamming loss  F1 micro  F1 macro  hamming loss  F1 micro  F1 macro  
CNN Random TWE  0.2158  15.98  9.39  0.1207  12.07  4.02 
MNB BoW  0.1706  30.57  25.73  0.1067  29.67  23.41 
SVM BoW  0.1713  36.08  31.09  0.1056  34.93  30.70 
SVM BoW + TFIDF  0.1723  38.20  33.68  0.1059  38.55  34.70 
MLP BoW  0.1879  39.13  34.92  0.1167  38.12  31.37 
CNN TWE  0.1671  39.20  32.59  0.1023  38.44  30.38 
CNN Ensemble TWE  0.1703  45.32  38.93  0.1093  42.75  37.29 
CNN GloVe  0.1676  39.22  33.77  0.1052  37.56  30.29 
Human          51.8  42.7 
Classification Accuracy for multilabel classification. TWE stands for trainable word embeddings initialized with a normal distribution. Note that all results were obtained on 10fold cross validation. CNN Random refers to a CNN trained on a random labelling of the dataset.
6.1 Multiclass Results
We see that the classification accuracy of the best performing classifier, CNN ensemble, for the CFMC5 dataset is 62.7 %. The highest accuracy for the CFMC10 dataset was achieved by the CNN classifer which does not use any pretrained embeddings. For all the multiclass classification results refer to table 3. We observe that CNNbased classifiers perform better than other classifiers – MLP, MNB, and SVM for both CFMC5 and CFMC10 datasets. Since these are the first learning results on the task of algorithm prediction for PWPs, we train a CNN classifier on a random labelling of the dataset. The results are given in the row called CNN random. To obtain this random labelling we shuffle the current mapping from problem to tag randomly. This ensures that the class distribution of the datasets remain the same. We see that all the classifiers significantly outperform the performance on the random dataset. We also observe that the classification accuracy is not the same for every class. We get the highest accuracy (see Fig. 2) for the class, data structures, at 90%, while, the lowest accuracy is for the class, greedy, at 40%. These results are on the CFMC5 dataset.
6.2 Multilabel Results
We see that CNNbased classifiers give the best results for the CFML10 and CFML20 datasets. The best F1 micro and macro scores for the CFML10 dataset were 45.32, 38.9 respectively. These were obtained by the CNN Ensemble model. For complete results see table 4. The best performing model on the CFML20 dataset was also the CNN ensemble. As we did in the multiclass case, we train a CNN model on the randomly shuffled labelling for both CFML10, CFML20 datasets. We find that all the classifers significantly outperform the model trained on a shuffled labelling. The humanlevel F1 micro and macro scores on a subset of the CFML20 dataset were 51.2 and 40.5. In comparison, our best performing classifier on the CMFL20 dataset, CNN Ensemble, got F1 macro and micro scores of 42.75, 37.29 respectively. We see that the performance of our best classifiers trail average human performance by about 8.45% and 3.21% on F1 micro and F1 macro scores respectively.
7 Analysis
7.1 Experiments with various subsets of the problem
As described in section 1, a PWP consists of three components – the problem statement, input and output format, and time and memory constraints. We seek to answer the following questions. Does one component contribute to the accuracy more than any other? Does the contribution of different components vary over the problem class? We performed some experiments to address these questions. We split the problem into two parts – 1) the problem statement, and 2) the input and output format, and time and memory constraints. We train an SVM, and a CNN on these two components independently.
Multiclass PWP component analysis We find classifier accuracies on the CFMC5 dataset. We choose the CFMC5 dataset out of the two multiclass datasets because it has a balanced class distribution. We find that the classifiers perform quite well on only the input and output format, and time and memory constraints – the best classifier getting an accuracy of 56.4 percent (only 5.3 percent lower than the accuracy of CNN with the whole problem). Classification using only the problem statement gives worse results than using the format and constraints, with a classification accuracy of 45.2 percent for the best classifier CNN (16.5 percent lower than the accuracy of a CNN trained on the whole problem). Complete results are given in table 5. We also see that the performance across different classes varies when trained on different inputs. We find that the class dp performs better when trained on the problem statement, whereas the other classes perform much better on the format and constraints. For each class except greedy, we see an additive trend – the accuracy is improved by combining both these features. Refer to figure 2 for more details.
Dataset  Features  Classifier  Soln. category  Prob. category  all  

F1 Mi  F1 Ma  F1 Mi  F1 Ma  F1 Mi  F1 Ma  
CFMC5  only statement  cnn  42.73  46.14  51.32  64.35  46.13  45.20 
CFMC5  only i/o  cnn  44.24  51.73  74.73  81.31  56.42  55.41 
CFMC5  all prob  cnn  54.24  59.91  71.36  78.32  61.71  61.32 
CFML20  only statement  cnn  30.83  17.32  38.64  41.82  33.59  28.34 
CFML20  only i/o  cnn  34.63  19.59  44.49  44.34  38.44  30.38 
CFML20  all prob  cnn  34.39  19.23  45.36  44.02  39.20  32.59 
Multilabel partial problem results We also tabulate the classifier accuracies on the CFML20 dataset by training it only on the format and constraints, and the problem statement. Even here, we observe similar trends as the multiclass partial problem experiments. We find that classifiers are more accurate when trained only on the format and constraints than only on the problem statement. Again, the accuracy is improved by combining both these features. Refer to table 5 for more details.
7.2 Problem category and Solution category results
We find that correctly classifying PWPs of the solution category is harder than correctly classifying PWPs of the problem category (table 5). For instance, take a look at the row corresponding to CFMC5 dataset and ”all prob” feature. The accuracy for solution category is 54.24% as compared to 71.36% for the problem category. This trend is followed for both CFMC5 and CFML20 datasets and also when using different features of the PWPs. In spite of the difficulty, the classification scores for the solution category are significantly better than random.
8 Related Work
Our work is related to three major topics of research, math word problem solving, text document classification and program synthesis.
Math word problem solving In the recent years, many models have been built to solve different kinds of math word problems. Some models solve only arithmetic problems Hosseini et al. (2014), while others solve algebra word problems Kushman et al. (2014). There are some recent solvers which solve a wide range preuniversity level math word problems Matsuzaki et al. (2017), Hopkins et al. (2017). Wang et al. (2017), and Mehta et al. (2017) have built deep neural network based solvers for math word problems. Program synthesis Work related to the task of converting natural language description to code comes under the research areas of program synthesis and natural language understanding. This work is still in its nascent stage. Zhong et al. (2017) worked on generating SQL queries automatically from natural language descriptions. Lin et al. (2017) worked on automatically generating bash commands from natural language descriptions. Iyer et al. (2016) worked on summarizing source code. Sudha et al. (2017) use a CNN based model to classify the algorithm used in a programming problem using the C++ code. Our model tries to accomplish this task by using the natural language problem description. Gulwani et al. (2017) is a comprehensive treatise on program synthesis. Document classification The problem of classifying a programming word problem in natural language is similar to the task of document classification. The stateoftheart approach currently for single label classification is to use a hierarchical attention network based model (Yang et al., 2016)
. This model is improved by using transfer learning
Howard and Ruder (2018). Other approaches include a Recurrent Convolutional Neural Network based approach Lai et al. (2015) or the fasttext model Joulin et al. (2016) which uses bagofwords features and a hierarchical softmax. Nam et al. (2014)use a feedforward neural network with binary cross entropy per label to perform multilabel document classification.
Kurata et al. (2016) leverage label cooccurrence to improve multilabel classification. Liu et al. (2017) use a CNN based architecture to perform extreme multilabel classification.9 Conclusion
We introduced a new problem of predicting the algorithm classes for programming word problems. For this task we generated four datasets – two multiclass (CFMC5 and CFMC10), having five and 10 classes respectively, and two multilabel (CFML10 and CFML20), having 10 and 20 classes respectively. Our classifiers are falling short only by about 9 percent of the human score. We also did some experiments which show that increasing the size of the train dataset improves the accuracy (see supplementary material). These problems are much harder than high school math word problems as they require a good knowledge of various computer science algorithms and an ability to reduce a problem to these known algorithms. Even our human analysis shows that trained computer science graduates only get an F1 of 51.8. Based on these results, we see that algorithm class prediction is compatible with and can be solved using text classification.
References
 Bishop (2006) Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag, Berlin, Heidelberg.
 chung Chang and Lin (2001) Chih chung Chang and ChihJen Lin. 2001. Libsvm: a library for support vector machines.
 Fan et al. (2008) RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874.
 Forišek (2010) Michal Forišek. 2010. The difficulty of programming contests increases. In International Conference on Informatics in Secondary SchoolsEvolution and Perspectives, pages 72–85. Springer.
 Gertin (2012) Thomas Gertin. 2012. Maximizing the cost of shortest paths between facilities through optimal product category locations. Ph.D. thesis.

Glorot and Bengio (2010)
Xavier Glorot and Yoshua Bengio. 2010.
Understanding the difficulty of training deep feedforward neural
networks.
In
In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics
. 
Glorot et al. (2011)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011.
Domain adaptation for largescale sentiment classification: A deep learning approach.
In Proceedings of the 28th international conference on machine learning (ICML11), pages 513–520.  Gulwani et al. (2017) Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages, 4(12):1–119.
 Hansen and Salamon (1990) L. K. Hansen and P. Salamon. 1990. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001.
 Hearst et al. (1998) Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28.
 Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear.

Hopkins et al. (2017)
Mark Hopkins, Cristian PetrescuPrahova, Roie Levin, Ronan Le Bras, Alvaro
Herrasti, and Vidur Joshi. 2017.
Beyond sentential semantic parsing: Tackling the math sat with a
cascade of tree transducers.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages 795–804.  Hosseini et al. (2014) Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533.
 Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model finetuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 328–339.

Iyer et al. (2016)
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.
Summarizing source code using a neural attention model.
In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2073–2083.  Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
 Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kurata et al. (2016) Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. Improved neural networkbased multilabel classification with better initialization leveraging label cooccurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 521–526.
 Kushman et al. (2014) Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. 2014. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 271–281.
 Lai et al. (2015) Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI, volume 333, pages 2267–2273.

Lin et al. (2017)
Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, and
Michael D Ernst. 2017.
Program synthesis from natural language using recurrent neural networks.
University of Washington Department of Computer Science and Engineering, Seattle, WA, USA, Tech. Rep. UWCSE170301.  Liu et al. (2017) Jingzhou Liu, WeiCheng Chang, Yuexin Wu, and Yiming Yang. 2017. Deep learning for extreme multilabel text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 115–124. ACM.
 Matsuzaki et al. (2017) Takuya Matsuzaki, Takumi Ito, Hidenao Iwane, Hirokazu Anai, and Noriko H Arai. 2017. Semantic parsing of preuniversity math problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2131–2141.
 McDowell (2016) Gayle Laakmann McDowell. 2016. Cracking the Coding Interview: 189 Programming Questions and Solutions. CareerCup, LLC.
 Mehta et al. (2017) Purvanshi Mehta, Pruthwik Mishra, Vinayak Athavale, Manish Shrivastava, and Dipti Sharma. 2017. Deep neural network based system for solving arithmetic word problems. Proceedings of the IJCNLP 2017, System Demonstrations, pages 65–68.
 Nam et al. (2014) Jinseok Nam, Jungi Kim, Eneldo Loza Mencía, Iryna Gurevych, and Johannes Fürnkranz. 2014. Largescale multilabel text classification—revisiting neural networks. In Joint european conference on machine learning and knowledge discovery in databases, pages 437–452. Springer.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPSW.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In In EMNLP.
 Rumelhart et al. (1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning internal representations by error propagation. In David E. Rumelhart and James L. Mcclelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, pages 318–362. MIT Press, Cambridge, MA.
 Sudha et al. (2017) S Sudha, A Arun Kumar, M Muthu Nagappan, and R Suresh. 2017. Classification and recommendation of competitive programming problems using cnn. In International Conference on Intelligent Information Technologies, pages 262–272. Springer.
 Wang and Manning (2012) Sida Wang and Christopher D. Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers  Volume 2, ACL ’12, pages 90–94, Stroudsburg, PA, USA. Association for Computational Linguistics.
 Wang et al. (2017) Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854.
 Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489.
 Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
Appendix A Appendix
a.1 Experiments with limited training data
We wanted to see how the dataset size affects the performance of the classifier. So, we train a CNN classifier on 25, 50, 75, and 100 percent of the CFML20 dataset. As expected, we find that the performance of the classifier improves with increase in size of the training data. The F1 micro and macro scores increase, and the hamming loss decreases. For the F1 scores, higher is better, while for hamming loss lower is better. See figure 3.
a.2 Evaluation Metrics
a.3 Multiclass: Accuracy
Accuracy is the percentage of labels correctly predicted. Note that for multiclass classification the microaveraged F1 score is equal to the accuracy.
a.4 Multiclass: Macroaveraged F1 score
Macroaveraged F1 score is computed by first computing the F1 score for each class independently and then take an averaging all the F1 scores. This metric treats all the classes as equal, independent of their frequency in the test set.
a.5 Multiclass: Weighted macroaveraged F1 score
Weighted macroaveraged F1 score is computed by first computing the F1 score for each class independently and then take an averaging all the F1 scores, weighted by their support.
a.6 Multilabel: Hamming loss
Hamming loss is the proportion of misclassified examples in the dataset.
a.7 Multilabel: Microaveraged F1 score
It is the Fmeasure averaging on the prediction matrix. The individual true positives, false positives, and false negatives are summed up across labels/classes and then the Fmeasure is calculated.
a.8 Multilabel: Macroaveraged F1 score
Macroaveraged F1 score is calculated by computing the F1 score for each of the labels, then averaging the label wise F1 scores.
Appendix B Human accuracy
We did a human study with 5 participants on the CFML20 dataset 6. Each participant is a recent graduate in computer science and is a frequent competitive programmer. You can see the results in 6
Classifier  20multi subset  

F1 micro  F1 macro  
Human 1  56.3  42.3 
Human 2  46.1  38.7 
Human 3  51.1  40.6 
Human 4  48.4  42.8 
Human 5  57.3  49.1 
Human Average  51.8  42.7 
Appendix C Classes classification in CFML20
c.1 Problem category
Following classes belong to Problem category: probabilities, geometry, combinatorics, number theory, strings, trees, graphs, math, data structures
c.2 Solution Category
Following classes belong to Solution category: dsu, binary search, dfs and similar, constructive algorithms, brute force, greedy, dp, bitmask, two pointers, sortings, implementation
Comments
There are no comments yet.