Predicting Algorithm Classes for Programming Word Problems

by   Vinayak Athavale, et al.
IIIT Hyderabad

We introduce the task of algorithm class prediction for programming word problems. A programming word problem is a problem written in natural language, which can be solved using an algorithm or a program. We define classes of various programming word problems which correspond to the class of algorithms required to solve the problem. We present four new datasets for this task, two multiclass datasets with 550 and 1159 problems each and two multilabel datasets having 3737 and 3960 problems each. We pose the problem as a text classification problem and train neural network and non-neural network-based models on this task. Our best performing classifier gets an accuracy of 62.7 percent for the multiclass case on the five class classification dataset, Codeforces Multiclass-5 (CFMC5). We also do some human-level analysis and compare human performance with that of our text classification models. Our best classifier has an accuracy only 9 percent lower than that of a human on this task. To the best of our knowledge, these are the first reported results on such a task. We make our code and datasets publicly available.



There are no comments yet.


page 1

page 2

page 3

page 4


Word-Class Embeddings for Multiclass Text Classification

Pre-trained word embeddings encode general word semantics and lexical re...

Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

In this work, we intrinsically and extrinsically evaluate and compare ex...

Efficient strategies for hierarchical text classification: External knowledge and auxiliary tasks

In hierarchical text classification, we perform a sequence of inference ...

Computing Class Hierarchies from Classifiers

A class or taxonomic hierarchy is often manually constructed, and part o...

Solving Arithmetic Word Problems with Transformers and Preprocessing of Problem Text

This paper outlines the use of Transformer networks trained to translate...

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Classification tasks are usually analysed and improved through new model...

Evaluation and Improvement of Chatbot Text Classification Data Quality Using Plausible Negative Examples

We describe and validate a metric for estimating multi-class classifier ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we introduce and work on the problem of predicting algorithms classes for programming word problems (PWPs). A PWP is a problem written in natural language which can be solved using a computer program. These problems generally map to one or more classes of algorithms, which are used to solve them. Binary search, disjoint-set union, and dynamic programming are some examples. In this paper, our aim is to automatically map programming word problems to the relevant classes of algorithms. We approach this problem by treating it as a classification task.

Programming word problems A programming word problem (PWP) requires the solver to design correct and efficient programs. The correctness and efficiency is checked by various test-cases provided by the problem writer. A PWP usually consists of three parts – the problem statement, a well-defined input and output format, and time and memory constraints. An example PWP can be seen in Figure 1.

Solving PWPs is difficult for several reasons. One reason is, the problems are often embedded in a narrative, that is, they are described as quasi real-world situations in the form of short stories or riddles. The solver must first decode the intent of the problem, or understand what the problem is. Then the solver needs to apply their knowledge of algorithms to write a solution program. Another reason is that, the solution programs must be efficient with respect to the given time and memory constraints. An outgrowth of this is that, the algorithm required to solve a particular problem not only depends on the problem statement, but also the constraints. Consider that there may be two different algorithms which will generate the correct output, for example, linear search, and binary search, but only one of those will abide by the time and memory constraints. With the growing popularity of these problems, various competitions like ACM-ICPC, and Google CodeJam have emerged. Additionally, several companies including Google, Facebook, and Amazon evaluate problem-solving skills of candidates for software-related jobs (McDowell, 2016) using PWPs. Consequently, as noted by Forišek (2010), programming problems have been becoming more difficult over time. To solve a PWP, humans get information from all its parts, not just the the problem statement. Thus, we predict algorithms from the entire text of a PWP. We also try to identify which parts of a PWP contribute the most towards predicting algorithms.

Significance of the Problem Many interesting real-world problems can be solved and optimised using standard algorithms. Time spent grocery shopping can be optimised by posing it as a graph traversal problem Gertin (2012). Arranging and retrieving items like mail, or books in a library can be done more efficiently using sorting and searching algorithms. Solving problems using algorithms can be scaled by using computers, transforming the algorithms into programs. A program is an algorithm that has been customised to solve a specific task under a specific set of circumstances using a specific language. Converting textual descriptions of such real-world problems into algorithms, and then into programs has largely been a human endeavour. An AI agent that could automatically generate programs from natural language problem descriptions could greatly increase the rate of technological advancement by quickly providing efficient solutions to the said real-world problems. A subsystem that could identify algorithm classes from natural language would significantly narrow down the search space of possible programs. Consequently, such a subsystem would be a useful feature for, or likely be even part of, such an agent. Therefore, building a system to predict algorithms from programming word problems is potentially an important first step toward an automatic program generating AI. More immediately, such a system could serve as an application to help people in improving their algorithmic problem-solving skills for software job interviews, competitive programming, and other uses.

As per our knowledge, this task has not been addressed in the literature before. Hence, there is no standard dataset available for this task. We generate and introduce new datasets by extracting problems from, a sport programming platform. We release the datasets and our experiment code at 222hidden for the the double blind review.

Contribution The major contributions of this paper are: Four datasets on programming word problems - two multiclass333each problem belongs to only one class datasets having 5 and 10 classes and two multilabel444each problem belongs to one or more classes datasets having 10 and 20 classes. Evaluation of Classifiers on various multiclass and multilabel classifiers that can predict classes for programming word problems on our datasets along with the human baseline. We define our problem more clearly in section 2. Then we explain our datasets – their generation and format along with human evaluation in section 3. We describe the models we use for multiclass and multilabel classification in section 4

. We delineate our experiments, models, and evaluation metrics in section

5. We report our classification results in section 6. We analyse some dataset nuances in section 7. Finally, we discuss related work and the conclusion in sections 8 and 9 respectively.

Dataset Size Vocab classes Avg. words Class percentage
CFMC5 550 9326 5 504 greedy: 20%, implementation:20%, data structures: 20%, dp: 20%, math: 20%
CFMC10 1159 14691 10 485 implementation: 34.94%, dp: 12.42%, math: 11.38%, greedy: 10.44%, data structures: 9.49%, brute force: 5.60%, geometry: 4.22%, constructive algorithms: 5.52%, dfs and similar: 3.10%, strings: 2.84%
Table 1: Dataset statistics for multiclass datasets. CFMC5 has 550 problems with a balanced class distribution. CFMC10 has 1159 problems and has a class imbalance. CFMC5 is a subset of CFMC10. Red classes belong to the solution category; blue classes belong to the problem category.
Dataset Size Vocab N classes Avg. len Label card Label den Label subsets
CFML10 3737 28178 10 494 1.69 0.169 231
CFML20 3960 29433 20 495 2.1 0.105 808
Table 2: Dataset statistics for multilabel datasets. The problems of the CFML10 dataset are a subset of those in the CFML20 dataset.

2 Problem Definition

The focus of this paper is the problem of mapping a PWP to one or more classes of algorithms. A class of algorithms is a set containing more specific algorithms. For example, breadth-first search, and Dijkstra’s algorithm belong to the class of graph algorithms. A PWP can be solved using one of the algorithms in the class it is mapped to. Problems on the Codeforces platform have tags that correspond to the class of algorithms.

Thus, our aim is to find a tagging function, which maps a PWP string, , to a set of tags, . We also consider another variant of the problem. For the PWPs that only have one tag, we focus on finding a tagging function, , which maps a PWP string, , to a tag, . We approximate and by training models on data.

3 Dataset

3.1 Data Collection

We collected the data from a popular sport programming platform called Codeforces. Codeforces was founded in 2010, and now has over 43000 active registered participants555 We first collected a total of 4300 problems from this platform. Each problem has associated tags, with most of the problems having more than one tag. These tags correspond to the algorithm or class of algorithms that can be used to solve that particular problem. The tags for a problem are given by the problem writer and they can only be edited only by high-rated (expert) contestants who have solved the problem. Next, we performed basic filtering on the data – removing the problems which had non-algorithmic tags, problems with no tags assigned to them, and also the problems wherein the problem statement was not extracted completely. After this filtering, we got 4019 problems with 35 different tags. This forms the Codeforces dataset. The label (tag) cardinality (average number of labels/tags per problem) was 2.24. Since the Codeforces dataset is the first dataset generated for a new problem, we select different subsets of this dataset with differing properties. This is to check if classification models are robust to different variations of the problem.

3.2 Multilabel Datasets

We found that a large number of tags had a very low frequency. Hence, we removed those problems and tags from the Codeforces dataset as follows. First, we got the list of 20 most frequently occurring tags, ordered by decreasing frequency. We observed that the tag in this list had a frequency of 98, in other words, 98 problems had this tag. Next, for each problem, we removed the tags that are not in this list. After that, all problems that did not have any tags left were removed.

This led to the formation of the Codeforces Multilabel-20 (CFML20) dataset, which has 20 tags. We used the same procedure for the 10 most frequently occurring tags to get the Codeforces Multilabel-10 (CFML10) dataset. The CFML20 has 98.53 (3960 problems) percent of the problems of the original dataset and the label (tag) cardinality only reduces from 2.24 to 2.21. CFML10 on the other hand has 92.9 percent of the problems with label (tag) cardinality 1.69. Statistics about both these multilabel datasets are given in Table 2.

3.3 Multiclass Datasets

To generate the multiclass datasets, first, we extracted the problems from the CFML20 dataset that only had one tag. There were about 1300 such problems. From those, we selected the problems whose tags occur in the list of 10 most common tags. These problems formed the Codeforces Multiclass-10 (CFMC10) dataset which contains 1159 examples. We found that the CFMC10 dataset has a class (tag) imbalance. We also make a balanced dataset, Codeforces Multiclass-5 (CFMC5), in which the prior class (tag) distribution is uniform. The CFMC5 dataset has five tags, each having 110 problems. To make CFMC5, first we extracted the problems whose tags are among the five most common tags. The fifth most common tag occurs 110 times. We sampled 110 random problems corresponding to the other four tags to give a total of 550 problems. Statistics about both the multiclass datasets are given in Table 1.

3.4 Dataset Format

Each problem in the datasets follows the same format (refer to Figure 1 for an example problem). The header contains the problem title, and the time and memory constraints for a program running on the problem testcases. The problem statement is the natural language description of the problem framed as a real world scenario. The input and output format describe the input to, and the output from a valid solution program. It also contains constraints that will be put on the size of inputs (for example, max size of input array, max size of 2 input values). The tags associated with the problem are the algorithm classes that we are trying to predict using the above information.

3.5 Class Categories in the Dataset

The classes for PWPs can be divided into two categories: Problem category classes tell us what kind of broad class of problem the PWP belongs to. For instance, math, and string are two such classes. Solution category classes tell us what kind of algorithm can solve a particular PWP. For example, a PWP of class dp or binary search would need a dynamic programming or binary search based algorithm to solve it.

Problem category PWPs are easier to classify because, in some cases, simple keyword mapping may lead to the classification (an equation in the problem is a strong indicator that a problem is of math type). Whereas, for solution category PWPs, a deeper understanding of the problem is required.

The classes belong to problem and solution categories for CFML20 are mentioned in the supplementary material.

3.6 Human Evaluation

In this section, we evaluate and analyze the performance of an average competitor on the task of predicting an algorithm for a PWP. The tags for a given PWP are added by its problem setter or other high-rated contestants who have solved it. Our test participants were recent computer science graduates with some experience in algorithms and competitive programming. We gave 5 participants the problem text along with all the constraints, and the input and output format. We also provided them with a list of all the tags and a few example problems for each tag. We randomly sample 120 problems from the CFML20 dataset and split them into two parts – containing 20 and 100 problems respectively. The 20 problems were given along with their tags to familiarize the participants with the task. For the remaining 100 problems, the participants were asked to predict the tags (one or more) for each problem. We chose to sample the problems from the CFML20 dataset as it is the closest to a real-world scenario of predicting algorithms for solving problems. We find that there is some variation in the accuracy reported by different humans with the highest F1 micro score being 11 percent greater than that of the the lowest. (see supplementary material for more details). The F1 micro score averaged over all 5 participants was 51.8 while the averaged F1 Macro was 42.7. The results are not surprising since this task is like any other problem solving task, and people based on their proficiency would get different results. This shows us that the problem is hard even for humans with a computer science education.

4 Classification Models

To test the compatibility of our problem with text classification paradigm, we apply to it some standard text classification models from recent literature.

4.1 Multiclass Classification

To approximate the optimal tagging function (see section 2) we use the following models.

Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM)

Wang and Manning (2012)

proposed several simple and effective baselines for text classification. An MNB is a naive Bayes classifier for multinomial models. An SVM is a discriminative hyperplane-based classifier

Hearst et al. (1998). These baselines use unigrams and bigrams as features. We also try applying TF-IDF to these features.

Multi-layer Perceptron (MLP)

An MLP is a class of artificial neural network that uses backpropagation for training in a supervised setting

Rumelhart et al. (1986). MLP-based models are standard for text classification baselines Glorot et al. (2011).

Convolutional Neural Network (CNN)

We also train a Convolutional Neural Network (CNN) based model, similar to the one used by

Kim (2014) in their paper, to classify the problems. We use the model both with and without pre-trained GloVe word-embeddings Pennington et al. (2014).

CNN ensemble Hansen and Salamon (1990) introduce neural network ensemble learning, in which many neural networks are trained and their predictions combined. These neural network systems show greater generalization ability and predictive power. We train five CNN networks and combine their predictions using the majority voting system.

4.2 Multilabel Classifiers

To approximate, (see section 2), we apply the following augmentations to the models described above.

Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) For applying these models to the multilabel case, we use the one-vs-rest (or, one-vs-all) strategy. This strategy involves training a single classifier for each class, with the samples of that class as positive samples and all other samples as negatives Bishop (2006).

Multi-layer Perceptron (MLP) Nam et al. (2014) use MLP-based models for multilabel text classification. We use similar models, but use the MSE loss instead of the cross-entropy loss.

Convolutional Neural Network (CNN) For multilabel classification we use a CNN based feature extractor similar to the one used in Kim (2014)

. The output is passed through a sigmoid activation function,

. The labels which have a corresponding activation greater than 0.5 are considered Liu et al. (2017). Similar to the multiclass case, we train the model both with and without pre-trained GloVe Pennington et al. (2014) word-embeddings.

CNN ensemble

We train five CNNs and add their output linear activation values. We pass this sum through a sigmoid function and consider the labels (tags) with activation greater than 0.5.

5 Experiment setup

All hyperparameter tuning experiments were performed with 10-fold cross validation. For the non-neural network-based methods, we first vectorize each problem using a bag-of-words vectorizer, scikit-learn’s

Pedregosa et al. (2011) CountVectorizer. We also experiment with TF-IDF features for each problem. In the multiclass case, we use the LIBSVM chung Chang and Lin (2001) implementation of the SVM classifier and we grid search over different kernels. However, the LIBSVM implementation is not compatible with the one-vs-rest strategy (complexity where is the number of classes), but only the one-vs-one (complexity ). This becomes prohibitively slow and thus, we use the LIBLINEAR Fan et al. (2008) implementation for the multilabel case. For hyperparameter tuning, we applied a grid search over the parameters of the vectorizers, classifiers, and other components. The exact parameters tuned can be seen in our code repository. For the neural network-based methods, we tokenize each problem using the spaCy tokenizer Honnibal and Montani (2017). We only use words appearing 2 or more times in building the vocabulary and replace the words that appear fewer times with a special UNK token. Our CNN network architecture is similar to that used by Kim (2014). The batch size used is 32. We apply 512 one-dimensional convolution filters of size 3, 4, and 5 on each problem. The rectifier,

, is used as the activation function. We concatenate these filters, apply a global max-pooling followed by a fully-connected layer with output size equal to the number of classes. We use the PyTorch framework

Paszke et al. (2017) to build this model. For the word embedding we use two approaches - a vanilla PyTorch trainable embedding layer and a 300-dimensional GloVe embedding Pennington et al. (2014). The networks were initialized using the Xavier method Glorot and Bengio (2010) at the beginning of each fold. We use the Adam optimization algorithm Kingma and Ba (2014)

as we observe that it converges faster than vanilla stochastic gradient descent.

6 Results

Classifier CFMC5 CFMC10
Acc F1 W Acc F1 W
CNN Random 25.0 22.1 35.2 19.2
MNB 47.6 47.5 43.9 37.4
SVM BoW 49.3 49.1 47.9 43.2
SVM TFIDF 47.8 47.6 45.7 41.2
MLP 47.8 47.6 49.3 46.2
CNN 61.7 61.3 54.7 51.3
CNN Ensemble 62.7 62.2 53.5 50.5
CNN GloVe 62.2 61.3 54.5 51.4
Table 3: Classification Accuracy for single label classification. Note that all results were obtained on 10-fold cross validation. CNN Random refers to a CNN trained on a random labelling of the dataset. F1 W stands for weighted macro F1-score.
Classifier CFML10 CFML20
hamming loss F1 micro F1 macro hamming loss F1 micro F1 macro
CNN Random TWE 0.2158 15.98 9.39 0.1207 12.07 4.02
MNB BoW 0.1706 30.57 25.73 0.1067 29.67 23.41
SVM BoW 0.1713 36.08 31.09 0.1056 34.93 30.70
SVM BoW + TF-IDF 0.1723 38.20 33.68 0.1059 38.55 34.70
MLP BoW 0.1879 39.13 34.92 0.1167 38.12 31.37
CNN TWE 0.1671 39.20 32.59 0.1023 38.44 30.38
CNN Ensemble TWE 0.1703 45.32 38.93 0.1093 42.75 37.29
CNN GloVe 0.1676 39.22 33.77 0.1052 37.56 30.29
Human - - - - 51.8 42.7
Table 4:

Classification Accuracy for multi-label classification. TWE stands for trainable word embeddings initialized with a normal distribution. Note that all results were obtained on 10-fold cross validation. CNN Random refers to a CNN trained on a random labelling of the dataset.

6.1 Multiclass Results

We see that the classification accuracy of the best performing classifier, CNN ensemble, for the CFMC5 dataset is 62.7 %. The highest accuracy for the CFMC10 dataset was achieved by the CNN classifer which does not use any pretrained embeddings. For all the multiclass classification results refer to table 3. We observe that CNN-based classifiers perform better than other classifiers – MLP, MNB, and SVM for both CFMC5 and CFMC10 datasets. Since these are the first learning results on the task of algorithm prediction for PWPs, we train a CNN classifier on a random labelling of the dataset. The results are given in the row called CNN random. To obtain this random labelling we shuffle the current mapping from problem to tag randomly. This ensures that the class distribution of the datasets remain the same. We see that all the classifiers significantly outperform the performance on the random dataset. We also observe that the classification accuracy is not the same for every class. We get the highest accuracy (see Fig. 2) for the class, data structures, at 90%, while, the lowest accuracy is for the class, greedy, at 40%. These results are on the CFMC5 dataset.

6.2 Multilabel Results

We see that CNN-based classifiers give the best results for the CFML10 and CFML20 datasets. The best F1 micro and macro scores for the CFML10 dataset were 45.32, 38.9 respectively. These were obtained by the CNN Ensemble model. For complete results see table 4. The best performing model on the CFML20 dataset was also the CNN ensemble. As we did in the multiclass case, we train a CNN model on the randomly shuffled labelling for both CFML10, CFML20 datasets. We find that all the classifers significantly outperform the model trained on a shuffled labelling. The human-level F1 micro and macro scores on a subset of the CFML20 dataset were 51.2 and 40.5. In comparison, our best performing classifier on the CMFL20 dataset, CNN Ensemble, got F1 macro and micro scores of 42.75, 37.29 respectively. We see that the performance of our best classifiers trail average human performance by about 8.45% and 3.21% on F1 micro and F1 macro scores respectively.

7 Analysis

7.1 Experiments with various subsets of the problem

As described in section 1, a PWP consists of three components – the problem statement, input and output format, and time and memory constraints. We seek to answer the following questions. Does one component contribute to the accuracy more than any other? Does the contribution of different components vary over the problem class? We performed some experiments to address these questions. We split the problem into two parts – 1) the problem statement, and 2) the input and output format, and time and memory constraints. We train an SVM, and a CNN on these two components independently.

Figure 2: Confusion matrices for different parts of the problem on CFMC5. Whole problem text (left), only format and constraints information (center), and only problem statement (right). Perfomance on the whole problem is the highest, followed by only format and constraints information. Performance across different classes (except greedy

) is additive, which shows that features extracted from both the parts are of importance

Multiclass PWP component analysis We find classifier accuracies on the CFMC5 dataset. We choose the CFMC5 dataset out of the two multiclass datasets because it has a balanced class distribution. We find that the classifiers perform quite well on only the input and output format, and time and memory constraints – the best classifier getting an accuracy of 56.4 percent (only 5.3 percent lower than the accuracy of CNN with the whole problem). Classification using only the problem statement gives worse results than using the format and constraints, with a classification accuracy of 45.2 percent for the best classifier CNN (16.5 percent lower than the accuracy of a CNN trained on the whole problem). Complete results are given in table 5. We also see that the performance across different classes varies when trained on different inputs. We find that the class dp performs better when trained on the problem statement, whereas the other classes perform much better on the format and constraints. For each class except greedy, we see an additive trend – the accuracy is improved by combining both these features. Refer to figure 2 for more details.

Dataset Features Classifier Soln. category Prob. category all
F1 Mi F1 Ma F1 Mi F1 Ma F1 Mi F1 Ma
CFMC5 only statement cnn 42.73 46.14 51.32 64.35 46.13 45.20
CFMC5 only i/o cnn 44.24 51.73 74.73 81.31 56.42 55.41
CFMC5 all prob cnn 54.24 59.91 71.36 78.32 61.71 61.32
CFML20 only statement cnn 30.83 17.32 38.64 41.82 33.59 28.34
CFML20 only i/o cnn 34.63 19.59 44.49 44.34 38.44 30.38
CFML20 all prob cnn 34.39 19.23 45.36 44.02 39.20 32.59
Table 5: Performance on different categories of PWPs for different parts of the PWPs. The rows with ”only statement” features use only the problem description part of the PWP, the rows with ”only i/o” use only the I/O and constraint information, and ”all prob” use the entire PWP. The results under the ”Soln category” column are of those problems that belong to the solution category, those under ”Prob category” belong to the problem category, and those under ”all” are for all the PWPs. So, for example, the F1 Micro score using only I/O and constraint for solution category problems of CFML20 is 34.63. Note that for CFMC5, F1 Mi (F1 Micro) is the same as accuracy, and F1 Ma (F1 Macro) score is a weighted Macro F1-score.

Multilabel partial problem results We also tabulate the classifier accuracies on the CFML20 dataset by training it only on the format and constraints, and the problem statement. Even here, we observe similar trends as the multiclass partial problem experiments. We find that classifiers are more accurate when trained only on the format and constraints than only on the problem statement. Again, the accuracy is improved by combining both these features. Refer to table 5 for more details.

7.2 Problem category and Solution category results

We find that correctly classifying PWPs of the solution category is harder than correctly classifying PWPs of the problem category (table 5). For instance, take a look at the row corresponding to CFMC5 dataset and ”all prob” feature. The accuracy for solution category is 54.24% as compared to 71.36% for the problem category. This trend is followed for both CFMC5 and CFML20 datasets and also when using different features of the PWPs. In spite of the difficulty, the classification scores for the solution category are significantly better than random.

8 Related Work

Our work is related to three major topics of research, math word problem solving, text document classification and program synthesis.

Math word problem solving In the recent years, many models have been built to solve different kinds of math word problems. Some models solve only arithmetic problems Hosseini et al. (2014), while others solve algebra word problems Kushman et al. (2014). There are some recent solvers which solve a wide range pre-university level math word problems Matsuzaki et al. (2017), Hopkins et al. (2017). Wang et al. (2017), and Mehta et al. (2017) have built deep neural network based solvers for math word problems. Program synthesis Work related to the task of converting natural language description to code comes under the research areas of program synthesis and natural language understanding. This work is still in its nascent stage. Zhong et al. (2017) worked on generating SQL queries automatically from natural language descriptions. Lin et al. (2017) worked on automatically generating bash commands from natural language descriptions. Iyer et al. (2016) worked on summarizing source code. Sudha et al. (2017) use a CNN based model to classify the algorithm used in a programming problem using the C++ code. Our model tries to accomplish this task by using the natural language problem description. Gulwani et al. (2017) is a comprehensive treatise on program synthesis. Document classification The problem of classifying a programming word problem in natural language is similar to the task of document classification. The state-of-the-art approach currently for single label classification is to use a hierarchical attention network based model (Yang et al., 2016)

. This model is improved by using transfer learning

Howard and Ruder (2018). Other approaches include a Recurrent Convolutional Neural Network based approach Lai et al. (2015) or the fasttext model Joulin et al. (2016) which uses bag-of-words features and a hierarchical softmax. Nam et al. (2014)

use a feed-forward neural network with binary cross entropy per label to perform multilabel document classification.

Kurata et al. (2016) leverage label co-occurrence to improve multilabel classification. Liu et al. (2017) use a CNN based architecture to perform extreme multilabel classification.

9 Conclusion

We introduced a new problem of predicting the algorithm classes for programming word problems. For this task we generated four datasets – two multiclass (CFMC5 and CFMC10), having five and 10 classes respectively, and two multilabel (CFML10 and CFML20), having 10 and 20 classes respectively. Our classifiers are falling short only by about 9 percent of the human score. We also did some experiments which show that increasing the size of the train dataset improves the accuracy (see supplementary material). These problems are much harder than high school math word problems as they require a good knowledge of various computer science algorithms and an ability to reduce a problem to these known algorithms. Even our human analysis shows that trained computer science graduates only get an F1 of 51.8. Based on these results, we see that algorithm class prediction is compatible with and can be solved using text classification.


  • Bishop (2006) Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.
  • chung Chang and Lin (2001) Chih chung Chang and Chih-Jen Lin. 2001. Libsvm: a library for support vector machines.
  • Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874.
  • Forišek (2010) Michal Forišek. 2010. The difficulty of programming contests increases. In International Conference on Informatics in Secondary Schools-Evolution and Perspectives, pages 72–85. Springer.
  • Gertin (2012) Thomas Gertin. 2012. Maximizing the cost of shortest paths between facilities through optimal product category locations. Ph.D. thesis.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In

    In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics

  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011.

    Domain adaptation for large-scale sentiment classification: A deep learning approach.

    In Proceedings of the 28th international conference on machine learning (ICML-11), pages 513–520.
  • Gulwani et al. (2017) Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119.
  • Hansen and Salamon (1990) L. K. Hansen and P. Salamon. 1990. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001.
  • Hearst et al. (1998) Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28.
  • Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear.
  • Hopkins et al. (2017) Mark Hopkins, Cristian Petrescu-Prahova, Roie Levin, Ronan Le Bras, Alvaro Herrasti, and Vidur Joshi. 2017. Beyond sentential semantic parsing: Tackling the math sat with a cascade of tree transducers. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 795–804.
  • Hosseini et al. (2014) Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533.
  • Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 328–339.
  • Iyer et al. (2016) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.

    Summarizing source code using a neural attention model.

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2073–2083.
  • Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kurata et al. (2016) Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 521–526.
  • Kushman et al. (2014) Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. 2014. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 271–281.
  • Lai et al. (2015) Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI, volume 333, pages 2267–2273.
  • Lin et al. (2017) Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, and Michael D Ernst. 2017.

    Program synthesis from natural language using recurrent neural networks.

    University of Washington Department of Computer Science and Engineering, Seattle, WA, USA, Tech. Rep. UW-CSE-17-03-01.
  • Liu et al. (2017) Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 115–124. ACM.
  • Matsuzaki et al. (2017) Takuya Matsuzaki, Takumi Ito, Hidenao Iwane, Hirokazu Anai, and Noriko H Arai. 2017. Semantic parsing of pre-university math problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2131–2141.
  • McDowell (2016) Gayle Laakmann McDowell. 2016. Cracking the Coding Interview: 189 Programming Questions and Solutions. CareerCup, LLC.
  • Mehta et al. (2017) Purvanshi Mehta, Pruthwik Mishra, Vinayak Athavale, Manish Shrivastava, and Dipti Sharma. 2017. Deep neural network based system for solving arithmetic word problems. Proceedings of the IJCNLP 2017, System Demonstrations, pages 65–68.
  • Nam et al. (2014) Jinseok Nam, Jungi Kim, Eneldo Loza Mencía, Iryna Gurevych, and Johannes Fürnkranz. 2014. Large-scale multi-label text classification—revisiting neural networks. In Joint european conference on machine learning and knowledge discovery in databases, pages 437–452. Springer.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In In EMNLP.
  • Rumelhart et al. (1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning internal representations by error propagation. In David E. Rumelhart and James L. Mcclelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, pages 318–362. MIT Press, Cambridge, MA.
  • Sudha et al. (2017) S Sudha, A Arun Kumar, M Muthu Nagappan, and R Suresh. 2017. Classification and recommendation of competitive programming problems using cnn. In International Conference on Intelligent Information Technologies, pages 262–272. Springer.
  • Wang and Manning (2012) Sida Wang and Christopher D. Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, pages 90–94, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Wang et al. (2017) Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489.
  • Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.

Appendix A Appendix

a.1 Experiments with limited training data

We wanted to see how the dataset size affects the performance of the classifier. So, we train a CNN classifier on 25, 50, 75, and 100 percent of the CFML20 dataset. As expected, we find that the performance of the classifier improves with increase in size of the training data. The F1 micro and macro scores increase, and the hamming loss decreases. For the F1 scores, higher is better, while for hamming loss lower is better. See figure 3.

Figure 3: F1 micro, macro and hamming loss variation when models trained on percentage of the CFMC20 dataset. Note that scale for both the F1 scores is given on the left and the one for hamming loss is given on the right.

a.2 Evaluation Metrics

a.3 Multiclass: Accuracy

Accuracy is the percentage of labels correctly predicted. Note that for multiclass classification the micro-averaged F1 score is equal to the accuracy.

a.4 Multiclass: Macro-averaged F1 score

Macro-averaged F1 score is computed by first computing the F1 score for each class independently and then take an averaging all the F1 scores. This metric treats all the classes as equal, independent of their frequency in the test set.

a.5 Multiclass: Weighted macro-averaged F1 score

Weighted macro-averaged F1 score is computed by first computing the F1 score for each class independently and then take an averaging all the F1 scores, weighted by their support.

a.6 Multilabel: Hamming loss

Hamming loss is the proportion of mis-classified examples in the dataset.

a.7 Multilabel: Micro-averaged F1 score

It is the F-measure averaging on the prediction matrix. The individual true positives, false positives, and false negatives are summed up across labels/classes and then the F-measure is calculated.

a.8 Multilabel: Macro-averaged F1 score

Macro-averaged F1 score is calculated by computing the F1 score for each of the labels, then averaging the label wise F1 scores.

Appendix B Human accuracy

We did a human study with 5 participants on the CFML20 dataset 6. Each participant is a recent graduate in computer science and is a frequent competitive programmer. You can see the results in 6

Classifier 20multi subset
F1 micro F1 macro
Human 1 56.3 42.3
Human 2 46.1 38.7
Human 3 51.1 40.6
Human 4 48.4 42.8
Human 5 57.3 49.1
Human Average 51.8 42.7
Table 6: Human accuracy on a 100 sized subset of the CFML20 dataset. HL is the hamming loss.

Appendix C Classes classification in CFML20

c.1 Problem category

Following classes belong to Problem category: probabilities, geometry, combinatorics, number theory, strings, trees, graphs, math, data structures

c.2 Solution Category

Following classes belong to Solution category: dsu, binary search, dfs and similar, constructive algorithms, brute force, greedy, dp, bitmask, two pointers, sortings, implementation