1 Introduction
In recent years, with the combination of education and information technology, online education has shown a booming development trend, and the number of online practice problems has massively surged. How to efficiently organize and manage these test resources and effectively realize intelligent processes such as question recommendation, creating question papers quickly and adaptive testing is becoming more and more critical in this field. The automatic tagging of knowledge points of practice problems is the basis for managing question bases and improving the automation and intelligence of education. First, automatic tagging of knowledge points can assist or completely replace manual tagging, effectively reducing teachers’ workload and improving the efficiency of tagging. Second, automatic tagging can reduce individual bias caused by subjective factors and improve the accuracy of tagging. Therefore, it is of great practical importance to study the automatic tagging of knowledge points.
As one of the primary subjects in K12 education, mathematics deserves to be researched with emphasis. Math tests are used to test students’ mastery of knowledge, and a math problem usually has many different categories of knowledge points. As shown in Figure 1, these different knowledge points of the question contain a hierarchical relationship. In addition, math texts are usually concise and contain various mathematical symbols with implicit logic and correlations. Moreover, mathematical language is rigorous, and changing one word may describe the opposite result. Traditional text classification methods are difficult to apply directly to math texts to meet the accuracy requirement. Therefore, we need to solve the following issues:

Classification of texts containing mathematical formulas

Classification of multilabel texts whose labels have correlations
In order to do this, the LABS model based on labelsemantic attention and multilabel smoothing combining textual features is proposed to improve the automatic tagging of knowledge points for math problems. The main contributions of this paper are as follows:

The mathematical objects such as formulas are treated as a whole. The features of the mathematical objects are extracted using the neural network to enrich the semantic features of the text.

The novel LABS model is better at predicting the knowledge points of math problems compared with the traditional BiLSTM model.

A realworld open source dataset is established for the research. The dataset consists of high school math questions and corresponding knowledge points. The questions contain textual information and mathematical expressions.
2 Related Work
The automatic tagging of knowledge points for math problems is essentially the multilabel classification of math texts. The length of a math text is generally within 200 characters, so math texts are short compared to general texts. The key to short text classification is to solve the problem that traditional models cannot extract enough semantic features due to data sparsity. Some studies extend the original text with the help of external knowledge basesbanerjee2007clustering ; hu2009exploiting ; liu2010short ; chen2019deep
, while some directly use the original features of words, such as ngrams
zhang2015short and word embeddingMeng2017short ; Zhang2017research , to extend the short text. Recently, Li et al.li2021merging investigated the effect of fusing statistical information of text with semantic features on short text classification. In addition, Xiao et al.xiao2019labelused the relationship between labels and texts to build a labelspecific document representation, which improved the classification effect of the model. However, these methods are only for texts in general domains and cannot be directly applied to classifying math texts.
There are two main methods in multilabel classification, one is to convert multilabel classification tasks into binary classification tasks or multiclass classification tasks boutell2004learning ; read2011classifier ; furnkranz2008multilabel ; tsoumakas2007random , and the other is to deal with multilabel classification problems by extending specific classification algorithmszhang2007ml ; clare2001knowledge ; elisseeff2001kernel ; mccallum2004collective . However, there are correlations among labels, and the difficulty of multilabel classification is how to deal with such correlations. Some recent work yang2018sgm ; tsai2019order converts the multilabel classification into a sequence generation problem using the Seq2Seq model, which uses neural networks to learn the label sequences from text sequences and thus learn the correlation among labels. However, this method requires prior knowledge of the label ordering and has exposure bias, which is not conducive to practical applications. In addition, the label distribution itself implies the relationship among labels, and Guo et al.guo2020label let the model learn a simulated label distribution to replace the onehot representation of labels, which has achieved good effects on multiclass classification. However, there is no relevant study on multilabel text classification.
3 LABS Model
3.1 Problem
In this paper, the automatic tagging of knowledge points for math problems is summarized as a multilabel classification problem for text with mathematical formulas, which is defined explicitly as
Definition 1.
denotes the sequence of a math question, where is the length of sequence,, is the word, is the mathematical expression. For each input sequence, there is a corresponding knowledge point sequence as an output, where is the total number of knowledge points, . So the problem is described as , i.e., given the question and its corresponding knowledge points , the goal is to train a classifier that assigns the most relevant knowledge points for the upcoming new questions.
3.2 Solution
To solve the problem above, we propose a model named LABS (Label Attention  Basic  Label Smoothing) in this paper. The architecture of the model is shown in Figure 2.
The LABS model is designed to study the effects of labelsemantic attention and multilabel smoothing combining textual features on knowledge points tagging. It consists of three main components:
3.2.1 Text representation of math questions
The Basic module provides the vector representation for a math question. For example, the math question shown in the Figure
1 includes both general textual content and mathematical objects such as “”. If the latter is treated like the former, the mathematical objects will be split like texts, and we will get sequences of words such as “{”, “an”, “}”. Such a representation destroys the implicit logic and relationship in mathematical formulas, and no useful information can be extracted. If the information in the mathematical formulas cannot be fully utilized, it is more difficult to achieve effective tagging of the math problems that are brief and concise. For this reason, the LABS model treats mathematical objects such as formulas as a whole, parsing and embedding mathematical formulas based on the TangentCFT method 10.1145/3341981.3344235 and uses neural networks to extract features of mathematical objects for improving the effectiveness of the classifier.The model uses the BiLSTM network to encode the context of the text from both front and back directions, thus fusing the contextual semantics of both directions into the text representation. For the input sequence of the math question, its hidden state at time step is determined by both the input at this time and the hidden state at the previous time step, as shown in Equation (1):
(1) 
where is the word or formula vector of the input text at time step , are the forward and backward hidden vectors respectively ( is the dimension of the hidden layer).
The final text encoded using BiLSTM is shown in the Equation (2), ( is the length of the sequence.)
(2) 
3.2.2 Labelsemantic attention
The LA module is responsible for learning the importance weights of words and formulas in the math text using labelsemantic attention. The knowledge points of the problem have specific semantics and correspond to the math text, while conventional attention mechanisms rarely use label information to guide the classification. Therefore, labelsemantic attention is proposed, which uses the semantic information of the knowledge points to learn the importance weights of words and formulas in the math text and guide the neural network model to extract meaningful information from the math text, thus enhancing text representation.
The label matrix (
is the number of labels) is treated as a trainable matrix, and the weights are dynamically updated in the training stage. In this paper, the similarity between label and text is calculated by the dot product of vectors and then passed into the sigmoid function as shown in Equation (
3).(3) 
where are the similarity between forward text and labels, and backward text and labels, respectively.
Eventually, the text representation enhanced by labelsemantic attention as shown in Equation (4), .
(4) 
3.2.3 Multilabel smoothing combining textual features
The LS module is responsible for constructing soft labels and optimizing the classifier using the relationship between the math question and knowledge points. In this paper, the knowledge points of the question are regarded as a trainable matrix. Textual features are integrated into the label distribution using the relationship between math text and knowledge points. Then multilabel smoothing is applied to the classifier, and the loss function is modified to improve the classification effect of the model.
In this paper, the sigmoid activation function is used in the last layer to get the multilabel prediction of the input text. It works as shown in the Equation (
5):(5) 
where is text vector, and , is the predicted label distribution.
Usually, to avoid the overfitting and overconfident model caused by the multihot encoded label vector, label smoothing (LS) is used as a regularization technique. However, the traditional LS method only adds random noise in each dimension without considering the correlation between labels, so the improvement of the model by the LS method is limited. For this reason, in this paper, based on the LCM (Label Confusion Model) guo2020label proposed by Guo et al., we incorporate textual features into the label representation and learn a better label distribution than multihot in real time for multilabel smoothing (MLS) to further improve the classification of the model.
MLS treats the label matrix as a trainable matrix and calculates a confusion distribution reflecting the similarity between labels according to the Equation (6).
(6) 
This confusion distribution is then used to adjust the original multihot encoding representation, which works as shown in the Equation (7):
(7) 
where is the multihot encoding, which is combined and then normalized with
by the hyperparameter
to obtain the final simulated label distribution .The Kullback–Leibler divergence is then used to calculate the loss function as shown in Equation (
8):(8) 
3.2.4 Evaluation metrics
In this paper, Precision@k, Recall@k and F1@k are used to measure the correlation between the predicted value and actual value. Precision@k quantifies the correlation in the first k labels of the predicted results and takes the value of [0,1], and the larger the better.
(9) 
The Recall@k gives the proportion of the first k labels that are predicted correctly among the actual correct result and takes the value of [0,1], and the larger the better.
(10) 
F1@k is defined as the harmonic mean of Precision@k and Recall@k and takes the value of [0,1], the larger the better.
(11) 
4 Experimental Setting
4.1 Data
Since the problem solved by the LABS model is novel, no suitable public benchmark dataset is available currently. As such, we establish a realworld Chinese dataset for the research, which is currently open source^{1}^{1}1https://anonymous.4open.science/r/mathdataD26B. This dataset, named the DA20k, was collected from the online question bases^{2}^{2}2http://tiku.zujuan.com/. It consists of the content of the high school math practice problems including text information and mathematical expressions, and the corresponding knowledge points. The statistical details of DA20k are shown in Table 1.
Questions  Labels  Avg. Chars  Avg. Words  Avg. Formulas  Avg. Labels 

22498  427  68.32  47.19  6.50  1.89 
4.2 Models and parameters
Four models are tested above the dataset DA20k:

Basic, including only the Basic module, which is served as a blank control group.

LAB, including Basic module and LA module, which is used to study the role of labelsemantic attention on the math text classification.

LBS, including Basic module and LS module, which is used to study the impact of multilabel smoothing combining textual features.

LABS, including Basic module, LA module, and LS module, is used to study the comprehensive effect of the two factors.
The four models are all tested under the same conditions except for the different modules used, and the parameters used in the experiment are shown in Table 2.
Parameter  Value 

Number of tokens  72904 
Number of labels  427 
Max sequence length  120 
Vector dimension  300 
Hidden layer size  512 
Batch size  512 
Optimizer  Adamkingma2014adam 
Learning rate  0.001 
Hyperparameter()  4guo2020label 
5 Experimental Results
5.1 Analysis of overall experiments
The experimental results of the four models are shown in Table 3 and Figure 3, where the optimal values are rendered in bold.
Evaluation  Basic  LAB  LBS  LABS 

Precision@1  52.14%  54.34%  57.84%  61.61% 
Precision@2  40.82%  42.35%  45.14%  48.05% 
Precision@3  33.33%  34.21%  36.58%  38.53% 
Recall@1  32.18%  33.85%  36.36%  38.89% 
Recall@2  47.89%  49.86%  53.70%  57.10% 
Recall@3  57.19%  59.01%  63.26%  66.55% 
F1@1  39.65%  41.57%  44.51%  47.56% 
F1@2  43.95%  45.67%  48.92%  52.07% 
F1@3  42.01%  43.20%  46.26%  48.71% 
In terms of Precision@k, the accuracy sorting of the four models is LABS >LBS >LAB >Basic when the value stays the same. It is shown that the accuracy of the model prediction is improved after labelsemantic attention and multilabel smoothing are introduced. Taking Precision@1 as an example, LAB is improved by 4.22% over Basic, while LBS is improved by 10.93%, and the effect of multilabel smoothing is more significant. Furthermore, the LABS model displays the highest performance, improved by 18.16% over Basic, 13.38% over LAB, and 6.52% over LBS. As the
value increases, the accuracy of all four models decreases, indicating that the model has the greatest probability of correctly marking one knowledge point, which is unlikely to exceed the average number of knowledge points of the math practice questions, i.e., 1.89.
In terms of Recall@k, the recall sorting of the four models is LABS >LBS >LAB >Basic when the value stays the same. It is shown that the model that introduces labelsemantic attention or multilabel smoothing has a higher recall under the same case. Taking Recall@3 as an example, LAB is improved by 3.18% over Basic and LBS is improved by 10.61%, so the effect of multilabel smoothing is more significant. Furthermore, the LABS model using both has the highest performance, improved by 16.37% over Basic, 12.78% over LAB, and 5.20% over LBS. As the value of increases, the recall of all four models increases, which indicates that the more knowledge points the model predicts, the more hits the model makes.
In terms of F1@k, the F1score sorting of the four models is LABS >LBS >LAB >Basic when the
value stays the same. Taking F1@2 as an example, LAB is increased by 3.91% over Basic and LBS is increased by 11.31%, so multilabel smoothing was more significant. Furthermore, the LABS model using both has the highest performance, improved by 18.48% over Basic, 14.01% over LAB, and 6.44% over LBS. The F1score is the harmonic value of precision and recall. The higher this value is, the better the model performs. Thus, LABS is the best model, followed by LBS and LAB, and Basic is the worst. It can be seen that the comprehensive effect of labelsemantic attention and multilabel smoothing combining textual features has the greatest impact on the automatic tagging of knowledge points.
5.2 Distribution of labelsemantic attention
In order to better analyze the role of labelsemantic attention on the automatic tagging of math problems, we visualize the attention weights on the original math text using heatmaps. The experimental results are shown in Figure 4 where sample One comes from the dataset, and sample Two is test data shown in Figure 1.
As can be seen from the attention distribution, all the knowledge points predicted by the model have related words or formulas given a higher attention weight. For sample One, the LAB model focuses on the words and formulas such as “hyperbola” and “equation”, while the LABS model focuses on “hyperbola” and the equation of a hyperbola. For sample Two, the LAB model focuses on “geometric sequence” and “general term”, while the LABS model focuses on “sequence”. It can be seen that the attention mechanism plays a role. The network model can use the knowledge point semantic information to focus its attention on some part of the math text, which helps improve the model’s classification accuracy. Comparing LAB and LABS models, it can be found that the LAB model captures more keywords and the LABS model pays relatively less attention, which indicates that the attention weights are affected by the introduction of multilabel smoothing.
5.3 Analysis of multilabel smoothing combining textual features
In order to analyze the impact of multilabel smoothing on math text classification, we firstly use the weight of the label simulated layer to calculate the simulated label distribution and then compare it with the actual value and the predicted value. The comparisons are shown in Figure 5, the sample One and Two are respectively corresponding to samples One and Two in Figure 4.
As can be seen from Figure 5, compared with the actual label distribution, the value of simulated label distribution is no longer either 0 or 1. The labeltext confusion distribution is obtained by similarity measure between text and label, and then it is used to adjust the actual label distribution, resulting in the simulated distribution used for label smoothing. Moreover, Figure 5 also shows that the predicted label distributions of both LBS and LABS are similar to the actual ones, then reflect the strong predictive power of the models. According to the analysis, both LABS and LBS perform better than either LAB or Basic, which indicates that the role of multilabel smoothing is more significant.
5.4 Comparison of model convergence speed
In this paper, the learning curves of the four models are studied, as shown in Figure 6. The number of iterations before the early stopping of the four models is shown in Figure 7.
From the learning curve, we can see that the learning curve of the Basic model is gentle, with low learning efficiency and easy overfitting. However, the learning curve of the other three models is steep, reflecting high learning efficiency, which can effectively avoid overfitting and be easier to obtain models with strong generalization ability. According to the comparisons of iterations, the Basic model converges slowly, and the number of iterations required for training is much higher than that of the other three models. It can be seen that introducing labelsemantic attention or multilabel smoothing can accelerate model convergence, improve learning efficiency, and effectively prevent model overfitting.
5.5 Comparison of different preprocessing methods of mathematical formulas
In addition to formula parsing and embedding (FormulaE), we also consider the other two preprocessing methods: formulas treated as texts (FormulaT) and formula dropping (FormulaD). Then, the effects of these three different treatments were studied on four models. Finally, the model is evaluated on the F1@2 as shown in Figure 8.
According to the results, the F1score of Basic and LAB are all improved by FormulaT compared with FormulaE. Basic is increased by 2.87%, and LAB is increased by 4.36%. However, FormulaT has less influence on the LBS and LABS. While with FormulaD, the F1score decreases substantially in all four models, with 10.42% in Basic, 4.38% in LAB, 4.62% in LBS, and 6.55% in LABS.
The three preprocessing methods of formulas are intuitively manifested in the average length of the input sequence. As shown in Figure 9, the average length of the input sequence is bigger after FormulaT, while the length distribution after FormulaD and FormulaE are similar. Data detail of math problems is shown in Table 4. It can be seen that FormulaE leads to 6 tokens more than FormulaD, while FormulaT leads to 27 tokens more.
Type  FormulaE  FormulaT  FormulaD 

Avg. words  47.19  74.59  47.46 
Avg. formulas  6.50  0  0 
Avg. length  53.69  74.59  47.46 
Finally, it can be seen that if the formula is ignored and dropped, the classification performance of the model will be worse, so the preprocessing of special elements such as the formula is an indispensable part of the math text representation. In addition, although FormulaT increases the input sequence length and brings some performance increases to the Basic and LAB models, there is no relevant effect on the LABS model proposed in this paper.
6 Conclusions and Future Work
In this paper, we proposed the LABS model based on the BiLSTM model introducing labelsemantic attention and multilabel smoothing combining textual features. Experimental results show that the LABS model has optimal precision, recall, and F1score. However, we do not quantitatively measure the relationship between knowledge points and only conducted experiments on a unique dataset. Besides, the math text has few features and is easy to overfit. In the future, further research will be made on the hierarchical relationship of knowledge points, linking mathematical entities, and expanding other datasets.
References
 (1) Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. Clustering short texts using wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 787–788, 2007.
 (2) Xia Hu, Nan Sun, Chao Zhang, and TatSeng Chua. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management, pages 919–928, 2009.

(3)
Zitao Liu, Wenchao Yu, Wei Chen, Shuran Wang, and Fengyi Wu.
Short text feature selection for microblog mining.
In Proceedings of the 2010 International Conference on Computational Intelligence and Software Engineering, pages 1–4, 2010. 
(4)
Jindong Chen, Yizhou Hu, Jingping Liu, Yanghua Xiao, and Haiyun Jiang.
Deep short text classification with knowledge powered attention.
In
Proceedings of the 33rd AAAI Conference on Artificial Intelligence
, pages 6252–6259, 2019.  (5) Xinwei Zhang and Bin Wu. Short text classification based on feature extension using the ngram model. In Proceedings of the 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 710–716, 2015.
 (6) Xin Meng and Wanli Zuo. Short text expansion and classification based on word embedding. Journal of Chinese Computer Systems, 38(8):1712–1717, 2017.
 (7) Qian Zhang, Zhangmin Gao, and Jiayong Liu. Research of weibo short text classification based on word2vec. Netinfo Security, 17(1):57–62, 2017.
 (8) Xianming Li, Zongxi Li, Haoran Xie, and Qing Li. Merging statistical feature via adaptive gate for improved text classification. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pages 13288–13296, 2021.

(9)
Lin Xiao, Xin Huang, Boli Chen, and Liping Jing.
Labelspecific document representation for multilabel text
classification.
In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)
, pages 466–475, 2019. 
(10)
Matthew R Boutell, Jiebo Luo, Xipeng Shen, and Christopher M Brown.
Learning multilabel scene classification.
Pattern Recognition, 37(9):1757–1771, 2004.  (11) Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multilabel classification. Machine Learning, 85(3):333–359, 2011.
 (12) Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencía, and Klaus Brinker. Multilabel classification via calibrated label ranking. Machine Learning, 73(2):133–153, 2008.
 (13) Grigorios Tsoumakas and Ioannis Vlahavas. Random klabelsets: An ensemble method for multilabel classification. In Proceedings of the 18th European Conference on Machine Learning, pages 406–417, 2007.

(14)
MinLing Zhang and ZhiHua Zhou.
Mlknn: A lazy learning approach to multilabel learning.
Pattern Recognition, 40(7):2038–2048, 2007.  (15) Amanda Clare and Ross D King. Knowledge discovery in multilabel phenotype data. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, pages 42–53, 2001.
 (16) Andre Elisseeff and Jason Weston. A kernel method for multilabelled classification. In Proceedings of the 2001 Conference on Neural Information Processing Systems (NIPS), pages 681–687, 2001.
 (17) Andrew McCallum and Nadia Ghamrawi. Collective multilabel text classification. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, pages 195–200, 2004.
 (18) Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. Sgm: Sequence generation model for multilabel classification. arXiv preprint arXiv:1806.04822, 2018.
 (19) ChePing Tsai and HungYi Lee. Orderfree learning alleviating exposure bias in multilabel classification. arXiv preprint arXiv:1909.03434, 2019.
 (20) Biyang Guo, Songqiao Han, Xiao Han, Hailiang Huang, and Ting Lu. Label confusion learning to enhance text classification models. arXiv preprint arXiv:2012.04987, 2020.
 (21) Behrooz Mansouri, Shaurya Rohatgi, Douglas W. Oard, Jian Wu, C. Lee Giles, and Richard Zanibbi. Tangentcft: An embedding model for mathematical formulas. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pages 11–18, 2019.
 (22) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.