DeepAI
Log In Sign Up

Automatic tagging of knowledge points for K12 math problems

Automatic tagging of knowledge points for practice problems is the basis for managing question bases and improving the automation and intelligence of education. Therefore, it is of great practical significance to study the automatic tagging technology for practice problems. However, there are few studies on the automatic tagging of knowledge points for math problems. Math texts have more complex structures and semantics compared with general texts because they contain unique elements such as symbols and formulas. Therefore, it is difficult to meet the accuracy requirement of knowledge point prediction by directly applying the text classification techniques in general domains. In this paper, K12 math problems taken as the research object, the LABS model based on label-semantic attention and multi-label smoothing combining textual features is proposed to improve the automatic tagging of knowledge points for math problems. The model combines the text classification techniques in general domains and the unique features of math texts. The results show that the models using label-semantic attention or multi-label smoothing perform better on precision, recall, and F1-score metrics than the traditional BiLSTM model, while the LABS model using both performs best. It can be seen that label information can guide the neural networks to extract meaningful information from the problem text, which improves the text classification performance of the model. Moreover, multi-label smoothing combining textual features can fully explore the relationship between text and labels, improve the model's prediction ability for new data and improve the model's classification accuracy.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/13/2018

ML-Net: multi-label classification of biomedical texts with deep neural networks

Background: Multi-label text classification is one type of text classifi...
01/10/2022

GUDN A novel guide network for extreme multi-label text classification

The problem of extreme multi-label text classification (XMTC) is to reca...
11/27/2019

Multi-label Classification for Automatic Tag Prediction in the Context of Programming Challenges

One of the best ways for developers to test and improve their skills in ...
06/18/2021

Label Mask for Multi-Label Text Classification

One of the key problems in multi-label text classification is how to tak...
08/26/2020

Item Tagging for Information Retrieval: A Tripartite Graph Neural Network based Approach

Tagging has been recognized as a successful practice to boost relevance ...
10/06/2022

Detecting Narrative Elements in Informational Text

Automatic extraction of narrative elements from text, combining narrativ...
04/13/2022

An Ensemble Learning Based Approach to Multi-label Power Text Classification for Fault-type Recognition

With the rapid development of ICT Custom Services (ICT CS) in power indu...

1 Introduction

In recent years, with the combination of education and information technology, online education has shown a booming development trend, and the number of online practice problems has massively surged. How to efficiently organize and manage these test resources and effectively realize intelligent processes such as question recommendation, creating question papers quickly and adaptive testing is becoming more and more critical in this field. The automatic tagging of knowledge points of practice problems is the basis for managing question bases and improving the automation and intelligence of education. First, automatic tagging of knowledge points can assist or completely replace manual tagging, effectively reducing teachers’ workload and improving the efficiency of tagging. Second, automatic tagging can reduce individual bias caused by subjective factors and improve the accuracy of tagging. Therefore, it is of great practical importance to study the automatic tagging of knowledge points.

As one of the primary subjects in K12 education, mathematics deserves to be researched with emphasis. Math tests are used to test students’ mastery of knowledge, and a math problem usually has many different categories of knowledge points. As shown in Figure 1, these different knowledge points of the question contain a hierarchical relationship. In addition, math texts are usually concise and contain various mathematical symbols with implicit logic and correlations. Moreover, mathematical language is rigorous, and changing one word may describe the opposite result. Traditional text classification methods are difficult to apply directly to math texts to meet the accuracy requirement. Therefore, we need to solve the following issues:

  • Classification of texts containing mathematical formulas

  • Classification of multi-label texts whose labels have correlations

In order to do this, the LABS model based on label-semantic attention and multi-label smoothing combining textual features is proposed to improve the automatic tagging of knowledge points for math problems. The main contributions of this paper are as follows:

  • The mathematical objects such as formulas are treated as a whole. The features of the mathematical objects are extracted using the neural network to enrich the semantic features of the text.

  • The novel LABS model is better at predicting the knowledge points of math problems compared with the traditional BiLSTM model.

  • A real-world open source dataset is established for the research. The dataset consists of high school math questions and corresponding knowledge points. The questions contain textual information and mathematical expressions.

Figure 1: (a) A math question and it’s knowledge points (b) A hierarchical relationship of these knowledge points

2 Related Work

The automatic tagging of knowledge points for math problems is essentially the multi-label classification of math texts. The length of a math text is generally within 200 characters, so math texts are short compared to general texts. The key to short text classification is to solve the problem that traditional models cannot extract enough semantic features due to data sparsity. Some studies extend the original text with the help of external knowledge basesbanerjee2007clustering ; hu2009exploiting ; liu2010short ; chen2019deep

, while some directly use the original features of words, such as n-grams

zhang2015short and word embeddingMeng2017short ; Zhang2017research , to extend the short text. Recently, Li et al.li2021merging investigated the effect of fusing statistical information of text with semantic features on short text classification. In addition, Xiao et al.xiao2019label

used the relationship between labels and texts to build a label-specific document representation, which improved the classification effect of the model. However, these methods are only for texts in general domains and cannot be directly applied to classifying math texts.

There are two main methods in multi-label classification, one is to convert multi-label classification tasks into binary classification tasks or multi-class classification tasks boutell2004learning ; read2011classifier ; furnkranz2008multilabel ; tsoumakas2007random , and the other is to deal with multi-label classification problems by extending specific classification algorithmszhang2007ml ; clare2001knowledge ; elisseeff2001kernel ; mccallum2004collective . However, there are correlations among labels, and the difficulty of multi-label classification is how to deal with such correlations. Some recent work yang2018sgm ; tsai2019order converts the multi-label classification into a sequence generation problem using the Seq2Seq model, which uses neural networks to learn the label sequences from text sequences and thus learn the correlation among labels. However, this method requires prior knowledge of the label ordering and has exposure bias, which is not conducive to practical applications. In addition, the label distribution itself implies the relationship among labels, and Guo et al.guo2020label let the model learn a simulated label distribution to replace the one-hot representation of labels, which has achieved good effects on multi-class classification. However, there is no relevant study on multi-label text classification.

3 LABS Model

3.1 Problem

In this paper, the automatic tagging of knowledge points for math problems is summarized as a multi-label classification problem for text with mathematical formulas, which is defined explicitly as

Definition 1.

denotes the sequence of a math question, where is the length of sequence,, is the word, is the mathematical expression. For each input sequence, there is a corresponding knowledge point sequence as an output, where is the total number of knowledge points, . So the problem is described as , i.e., given the question and its corresponding knowledge points , the goal is to train a classifier that assigns the most relevant knowledge points for the upcoming new questions.

3.2 Solution

To solve the problem above, we propose a model named LABS (Label Attention - Basic - Label Smoothing) in this paper. The architecture of the model is shown in Figure 2.

Figure 2: The architecture of the LABS model

The LABS model is designed to study the effects of label-semantic attention and multi-label smoothing combining textual features on knowledge points tagging. It consists of three main components:

  • Text representation of math questions (Basic module), described in detail in subsection 3.2.1.

  • Label-semantic attention (LA module), described in detail in subsection 3.2.2.

  • Multi-label smoothing combining textual features (LS module), described in detail in subsection 3.2.3.

3.2.1 Text representation of math questions

The Basic module provides the vector representation for a math question. For example, the math question shown in the Figure

1 includes both general textual content and mathematical objects such as “”. If the latter is treated like the former, the mathematical objects will be split like texts, and we will get sequences of words such as “{”, “an”, “}”. Such a representation destroys the implicit logic and relationship in mathematical formulas, and no useful information can be extracted. If the information in the mathematical formulas cannot be fully utilized, it is more difficult to achieve effective tagging of the math problems that are brief and concise. For this reason, the LABS model treats mathematical objects such as formulas as a whole, parsing and embedding mathematical formulas based on the TangentCFT method 10.1145/3341981.3344235 and uses neural networks to extract features of mathematical objects for improving the effectiveness of the classifier.

The model uses the BiLSTM network to encode the context of the text from both front and back directions, thus fusing the contextual semantics of both directions into the text representation. For the input sequence of the math question, its hidden state at time step is determined by both the input at this time and the hidden state at the previous time step, as shown in Equation (1):

(1)

where is the word or formula vector of the input text at time step , are the forward and backward hidden vectors respectively ( is the dimension of the hidden layer).

The final text encoded using BiLSTM is shown in the Equation (2), ( is the length of the sequence.)

(2)

3.2.2 Label-semantic attention

The LA module is responsible for learning the importance weights of words and formulas in the math text using label-semantic attention. The knowledge points of the problem have specific semantics and correspond to the math text, while conventional attention mechanisms rarely use label information to guide the classification. Therefore, label-semantic attention is proposed, which uses the semantic information of the knowledge points to learn the importance weights of words and formulas in the math text and guide the neural network model to extract meaningful information from the math text, thus enhancing text representation.

The label matrix (

is the number of labels) is treated as a trainable matrix, and the weights are dynamically updated in the training stage. In this paper, the similarity between label and text is calculated by the dot product of vectors and then passed into the sigmoid function as shown in Equation (

3).

(3)

where are the similarity between forward text and labels, and backward text and labels, respectively.

Eventually, the text representation enhanced by label-semantic attention as shown in Equation (4), .

(4)

3.2.3 Multi-label smoothing combining textual features

The LS module is responsible for constructing soft labels and optimizing the classifier using the relationship between the math question and knowledge points. In this paper, the knowledge points of the question are regarded as a trainable matrix. Textual features are integrated into the label distribution using the relationship between math text and knowledge points. Then multi-label smoothing is applied to the classifier, and the loss function is modified to improve the classification effect of the model.

In this paper, the sigmoid activation function is used in the last layer to get the multi-label prediction of the input text. It works as shown in the Equation (

5):

(5)

where is text vector, and , is the predicted label distribution.

Usually, to avoid the overfitting and overconfident model caused by the multi-hot encoded label vector, label smoothing (LS) is used as a regularization technique. However, the traditional LS method only adds random noise in each dimension without considering the correlation between labels, so the improvement of the model by the LS method is limited. For this reason, in this paper, based on the LCM (Label Confusion Model) guo2020label proposed by Guo et al., we incorporate textual features into the label representation and learn a better label distribution than multi-hot in real time for multi-label smoothing (MLS) to further improve the classification of the model.

MLS treats the label matrix as a trainable matrix and calculates a confusion distribution reflecting the similarity between labels according to the Equation (6).

(6)

This confusion distribution is then used to adjust the original multi-hot encoding representation, which works as shown in the Equation (7):

(7)

where is the multi-hot encoding, which is combined and then normalized with

by the hyperparameter

to obtain the final simulated label distribution .

The Kullback–Leibler divergence is then used to calculate the loss function as shown in Equation (

8):

(8)

3.2.4 Evaluation metrics

In this paper, Precision@k, Recall@k and F1@k are used to measure the correlation between the predicted value and actual value. Precision@k quantifies the correlation in the first k labels of the predicted results and takes the value of [0,1], and the larger the better.

(9)

The Recall@k gives the proportion of the first k labels that are predicted correctly among the actual correct result and takes the value of [0,1], and the larger the better.

(10)

F1@k is defined as the harmonic mean of Precision@k and Recall@k and takes the value of [0,1], the larger the better.

(11)

4 Experimental Setting

4.1 Data

Since the problem solved by the LABS model is novel, no suitable public benchmark dataset is available currently. As such, we establish a real-world Chinese dataset for the research, which is currently open source111https://anonymous.4open.science/r/mathdata-D26B. This dataset, named the DA-20k, was collected from the online question bases222http://tiku.zujuan.com/. It consists of the content of the high school math practice problems including text information and mathematical expressions, and the corresponding knowledge points. The statistical details of DA-20k are shown in Table 1.

Questions Labels Avg. Chars Avg. Words Avg. Formulas Avg. Labels
22498 427 68.32 47.19 6.50 1.89
Table 1: Details of the experimental datasets

4.2 Models and parameters

Four models are tested above the dataset DA-20k:

  1. Basic, including only the Basic module, which is served as a blank control group.

  2. LAB, including Basic module and LA module, which is used to study the role of label-semantic attention on the math text classification.

  3. LBS, including Basic module and LS module, which is used to study the impact of multi-label smoothing combining textual features.

  4. LABS, including Basic module, LA module, and LS module, is used to study the comprehensive effect of the two factors.

The four models are all tested under the same conditions except for the different modules used, and the parameters used in the experiment are shown in Table 2.

Parameter Value
Number of tokens 72904
Number of labels 427
Max sequence length 120
Vector dimension 300
Hidden layer size 512
Batch size 512
Optimizer Adamkingma2014adam
Learning rate 0.001
Hyperparameter() 4guo2020label
Table 2: Experimental parameters

5 Experimental Results

5.1 Analysis of overall experiments

The experimental results of the four models are shown in Table 3 and Figure 3, where the optimal values are rendered in bold.

Evaluation Basic LAB LBS LABS
Precision@1 52.14% 54.34% 57.84% 61.61%
Precision@2 40.82% 42.35% 45.14% 48.05%
Precision@3 33.33% 34.21% 36.58% 38.53%
Recall@1 32.18% 33.85% 36.36% 38.89%
Recall@2 47.89% 49.86% 53.70% 57.10%
Recall@3 57.19% 59.01% 63.26% 66.55%
F1@1 39.65% 41.57% 44.51% 47.56%
F1@2 43.95% 45.67% 48.92% 52.07%
F1@3 42.01% 43.20% 46.26% 48.71%
Table 3: Comparison of the four models in terms of Precision@k, Recall@k and F1@k (k = 1,2,3).
Figure 3: Statistical charts of the four models in terms of Precision@k, Recall@k and F1@k (k = 1,2,3).

In terms of Precision@k, the accuracy sorting of the four models is LABS >LBS >LAB >Basic when the value stays the same. It is shown that the accuracy of the model prediction is improved after label-semantic attention and multi-label smoothing are introduced. Taking Precision@1 as an example, LAB is improved by 4.22% over Basic, while LBS is improved by 10.93%, and the effect of multi-label smoothing is more significant. Furthermore, the LABS model displays the highest performance, improved by 18.16% over Basic, 13.38% over LAB, and 6.52% over LBS. As the

value increases, the accuracy of all four models decreases, indicating that the model has the greatest probability of correctly marking one knowledge point, which is unlikely to exceed the average number of knowledge points of the math practice questions, i.e., 1.89.

In terms of Recall@k, the recall sorting of the four models is LABS >LBS >LAB >Basic when the value stays the same. It is shown that the model that introduces label-semantic attention or multi-label smoothing has a higher recall under the same case. Taking Recall@3 as an example, LAB is improved by 3.18% over Basic and LBS is improved by 10.61%, so the effect of multi-label smoothing is more significant. Furthermore, the LABS model using both has the highest performance, improved by 16.37% over Basic, 12.78% over LAB, and 5.20% over LBS. As the value of increases, the recall of all four models increases, which indicates that the more knowledge points the model predicts, the more hits the model makes.

In terms of F1@k, the F1-score sorting of the four models is LABS >LBS >LAB >Basic when the

value stays the same. Taking F1@2 as an example, LAB is increased by 3.91% over Basic and LBS is increased by 11.31%, so multi-label smoothing was more significant. Furthermore, the LABS model using both has the highest performance, improved by 18.48% over Basic, 14.01% over LAB, and 6.44% over LBS. The F1-score is the harmonic value of precision and recall. The higher this value is, the better the model performs. Thus, LABS is the best model, followed by LBS and LAB, and Basic is the worst. It can be seen that the comprehensive effect of label-semantic attention and multi-label smoothing combining textual features has the greatest impact on the automatic tagging of knowledge points.

5.2 Distribution of label-semantic attention

In order to better analyze the role of label-semantic attention on the automatic tagging of math problems, we visualize the attention weights on the original math text using heatmaps. The experimental results are shown in Figure 4 where sample One comes from the dataset, and sample Two is test data shown in Figure 1.

Figure 4: Label attention visualization of math practice problems. (a) Sample One has labels of Simple Properties of Hyperbolas and Standard Equation for Hyperbola. (b) Sample Two has labels of Properties of Arithmetic Sequence, Properties of Geometric Sequence, General Term of an Arithmetic Sequence and Sum of the First n Terms of an Arithmetic Sequence.

As can be seen from the attention distribution, all the knowledge points predicted by the model have related words or formulas given a higher attention weight. For sample One, the LAB model focuses on the words and formulas such as “hyperbola” and “equation”, while the LABS model focuses on “hyperbola” and the equation of a hyperbola. For sample Two, the LAB model focuses on “geometric sequence” and “general term”, while the LABS model focuses on “sequence”. It can be seen that the attention mechanism plays a role. The network model can use the knowledge point semantic information to focus its attention on some part of the math text, which helps improve the model’s classification accuracy. Comparing LAB and LABS models, it can be found that the LAB model captures more keywords and the LABS model pays relatively less attention, which indicates that the attention weights are affected by the introduction of multi-label smoothing.

5.3 Analysis of multi-label smoothing combining textual features

In order to analyze the impact of multi-label smoothing on math text classification, we firstly use the weight of the label simulated layer to calculate the simulated label distribution and then compare it with the actual value and the predicted value. The comparisons are shown in Figure 5, the sample One and Two are respectively corresponding to samples One and Two in Figure 4.

As can be seen from Figure 5, compared with the actual label distribution, the value of simulated label distribution is no longer either 0 or 1. The label-text confusion distribution is obtained by similarity measure between text and label, and then it is used to adjust the actual label distribution, resulting in the simulated distribution used for label smoothing. Moreover, Figure 5 also shows that the predicted label distributions of both LBS and LABS are similar to the actual ones, then reflect the strong predictive power of the models. According to the analysis, both LABS and LBS perform better than either LAB or Basic, which indicates that the role of multi-label smoothing is more significant.

Figure 5: The simulated label distribution(SLD) and predicted label distribution(PLD) of two samples on the LBS and LABS model, and the vertical dotted line stands for the actual label distribution.

5.4 Comparison of model convergence speed

In this paper, the learning curves of the four models are studied, as shown in Figure 6. The number of iterations before the early stopping of the four models is shown in Figure 7.

Figure 6: Learning curve of models
Figure 7: The result of iteration times of models

From the learning curve, we can see that the learning curve of the Basic model is gentle, with low learning efficiency and easy overfitting. However, the learning curve of the other three models is steep, reflecting high learning efficiency, which can effectively avoid overfitting and be easier to obtain models with strong generalization ability. According to the comparisons of iterations, the Basic model converges slowly, and the number of iterations required for training is much higher than that of the other three models. It can be seen that introducing label-semantic attention or multi-label smoothing can accelerate model convergence, improve learning efficiency, and effectively prevent model overfitting.

5.5 Comparison of different preprocessing methods of mathematical formulas

In addition to formula parsing and embedding (Formula-E), we also consider the other two preprocessing methods: formulas treated as texts (Formula-T) and formula dropping (Formula-D). Then, the effects of these three different treatments were studied on four models. Finally, the model is evaluated on the F1@2 as shown in Figure 8.

According to the results, the F1-score of Basic and LAB are all improved by Formula-T compared with Formula-E. Basic is increased by 2.87%, and LAB is increased by 4.36%. However, Formula-T has less influence on the LBS and LABS. While with Formula-D, the F1-score decreases substantially in all four models, with 10.42% in Basic, 4.38% in LAB, 4.62% in LBS, and 6.55% in LABS.

Figure 8: The results of three formula preprocessing methods on four models in terms of F1@2
Figure 9: The distribution of input sequence length with three formula preprocessing methods

The three preprocessing methods of formulas are intuitively manifested in the average length of the input sequence. As shown in Figure 9, the average length of the input sequence is bigger after Formula-T, while the length distribution after Formula-D and Formula-E are similar. Data detail of math problems is shown in Table 4. It can be seen that Formula-E leads to 6 tokens more than Formula-D, while Formula-T leads to 27 tokens more.

Type Formula-E Formula-T Formula-D
Avg. words 47.19 74.59 47.46
Avg. formulas 6.50 0 0
Avg. length 53.69 74.59 47.46
Table 4: The details of dataset with three formula preprocessing methods

Finally, it can be seen that if the formula is ignored and dropped, the classification performance of the model will be worse, so the preprocessing of special elements such as the formula is an indispensable part of the math text representation. In addition, although Formula-T increases the input sequence length and brings some performance increases to the Basic and LAB models, there is no relevant effect on the LABS model proposed in this paper.

6 Conclusions and Future Work

In this paper, we proposed the LABS model based on the BiLSTM model introducing label-semantic attention and multi-label smoothing combining textual features. Experimental results show that the LABS model has optimal precision, recall, and F1-score. However, we do not quantitatively measure the relationship between knowledge points and only conducted experiments on a unique dataset. Besides, the math text has few features and is easy to overfit. In the future, further research will be made on the hierarchical relationship of knowledge points, linking mathematical entities, and expanding other datasets.

References

  • (1) Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. Clustering short texts using wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 787–788, 2007.
  • (2) Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management, pages 919–928, 2009.
  • (3) Zitao Liu, Wenchao Yu, Wei Chen, Shuran Wang, and Fengyi Wu.

    Short text feature selection for micro-blog mining.

    In Proceedings of the 2010 International Conference on Computational Intelligence and Software Engineering, pages 1–4, 2010.
  • (4) Jindong Chen, Yizhou Hu, Jingping Liu, Yanghua Xiao, and Haiyun Jiang. Deep short text classification with knowledge powered attention. In

    Proceedings of the 33rd AAAI Conference on Artificial Intelligence

    , pages 6252–6259, 2019.
  • (5) Xinwei Zhang and Bin Wu. Short text classification based on feature extension using the n-gram model. In Proceedings of the 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 710–716, 2015.
  • (6) Xin Meng and Wanli Zuo. Short text expansion and classification based on word embedding. Journal of Chinese Computer Systems, 38(8):1712–1717, 2017.
  • (7) Qian Zhang, Zhangmin Gao, and Jiayong Liu. Research of weibo short text classification based on word2vec. Netinfo Security, 17(1):57–62, 2017.
  • (8) Xianming Li, Zongxi Li, Haoran Xie, and Qing Li. Merging statistical feature via adaptive gate for improved text classification. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pages 13288–13296, 2021.
  • (9) Lin Xiao, Xin Huang, Boli Chen, and Liping Jing. Label-specific document representation for multi-label text classification. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    , pages 466–475, 2019.
  • (10) Matthew R Boutell, Jiebo Luo, Xipeng Shen, and Christopher M Brown.

    Learning multi-label scene classification.

    Pattern Recognition, 37(9):1757–1771, 2004.
  • (11) Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine Learning, 85(3):333–359, 2011.
  • (12) Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencía, and Klaus Brinker. Multilabel classification via calibrated label ranking. Machine Learning, 73(2):133–153, 2008.
  • (13) Grigorios Tsoumakas and Ioannis Vlahavas. Random k-labelsets: An ensemble method for multilabel classification. In Proceedings of the 18th European Conference on Machine Learning, pages 406–417, 2007.
  • (14) Min-Ling Zhang and Zhi-Hua Zhou.

    Ml-knn: A lazy learning approach to multi-label learning.

    Pattern Recognition, 40(7):2038–2048, 2007.
  • (15) Amanda Clare and Ross D King. Knowledge discovery in multi-label phenotype data. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, pages 42–53, 2001.
  • (16) Andre Elisseeff and Jason Weston. A kernel method for multi-labelled classification. In Proceedings of the 2001 Conference on Neural Information Processing Systems (NIPS), pages 681–687, 2001.
  • (17) Andrew McCallum and Nadia Ghamrawi. Collective multi-label text classification. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, pages 195–200, 2004.
  • (18) Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. Sgm: Sequence generation model for multi-label classification. arXiv preprint arXiv:1806.04822, 2018.
  • (19) Che-Ping Tsai and Hung-Yi Lee. Order-free learning alleviating exposure bias in multi-label classification. arXiv preprint arXiv:1909.03434, 2019.
  • (20) Biyang Guo, Songqiao Han, Xiao Han, Hailiang Huang, and Ting Lu. Label confusion learning to enhance text classification models. arXiv preprint arXiv:2012.04987, 2020.
  • (21) Behrooz Mansouri, Shaurya Rohatgi, Douglas W. Oard, Jian Wu, C. Lee Giles, and Richard Zanibbi. Tangent-cft: An embedding model for mathematical formulas. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pages 11–18, 2019.
  • (22) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.