Natural language processing has gain continuously attention in recent years, not only for academe research purposes but also for a real-world use case in various industrial sectors. Advanced neural architectures achieve significant improvements on difficult language understanding problems, thus enable various applications such as named-entity recognition, semantic role labeling , sentiment analysis or opinion mining [17, 15], machine translation , etc.
Thanks to the recent advances in the language understanding fields with the help of deep learning, a large number of machine learning projects turn from academic research outcomes into industrial products. For instance, the neural machine translation system now delivers a very high-quality translation that is approaching human-level accuracy; well trained neural models have been proposed to business users as prediction services on the cloud to perform difficult tasks such as topic extraction, sentiment analysis. These advances in technologies have enabled the need for automatic language analysis in the marketing, financial institutes, and others.
However, despite typical applications that can share the trained model to perform universal tasks such as speech-to-text, translation, most of the applications often require to train a custom model with company-owned data and strongly rely on domain-specific knowledge. For example, in an assurance company, one might interest in investigating customer comments on the topic related to processing time of insurance claims, while for the e-commercial website, reviews about the quality of the goods are more interesting to dive deeper. It might be challenging to analyse these problems without importing the specificities of their data and train a custom model based on these data.
In addition, most of the advanced neural network architectures require a considerable amount of labeled data for training, and labeling is often tedious and time-consuming. The challenges of labeling data can significantly slow down the development of machine learning enabled projects for companies.
In this paper, we address the customer review understanding problems with deep learning framework, with a particular focus on two methods that can accelerate the training by adopting the pretrained model and active learning (in Sec. 2). The preliminary results (in Sec. 3) show that the iterative process of training robust neural network models can be significantly shorted, thus saving a large amount of cost. We also provide a conclusion and further directions in the end.
2 Model architecture
2.1 Recurrent neural network with pretrained embedding
Recurrent neural networks have been widely used for sequence formed data, as the capability of taking the input as various lengths and learning the dependency among elements at different positions. The applications of recurrent neural networks cover from time series analysis , speech recognition , to natural language processing , and continuous gain attention in different research domains.
In the field of natural language processing, one key component is embedding or so-called language representation. The objective is to pass the words into a fixed-length vector, which could be later processed and fed into the classifiers. A straightforward idea is one-hot encoding, which represents each word in a sparse dictionary with one indicating its position among all possible words. The distribution of all the words inside a sentence can be directly used to represent that sentence. Such an approach often refers as bag-of-words. Some improvements have been added to such architecture by training a matrix projection over the words . Recent successes are focused on embedding with pretrained language model , which allows a semantic representation of words or tokens in the sentences, and the aggregated of embedded vectors often leads to a better result compared to word distribution with one-hot encoding.
The pretrained models, such as BERT and ELMo , have shown very promising results on various tasks. With the access of unlimited texts written on the Internet, these models can capture the semantic meaning of the language without being dedicated to specific tasks. The results of the pretrained model can later be used as preliminary inputs for various tasks, including classification, named entity recognition, question answering, language common-sense inference.
as our pretrained embedding module. The outputs of BERT embedding on the tokens (words) are then fed into a recurrent neural network. Long short-term memory (LSTM)
, as we used here for taking into account of the long dependencies among words. The output of the last LSTM ceil is coupled with fully connected layers, where the final output as the sigmoid active function that allows multi-label outputs. Unlike softmax layer that normalizes the output into probability assigned to each class (commonly used for single label classification problem), the sigmoid activated layer output unnormalized probability offor each class, with all the output values close to
indicating a high probability of the underlying sentences belongs to these classes. The multi-label loss function of our recurrent network is computed as the sum of binary cross-entropy between the prediction and true label for each class:
2.2 Active learning strategy
In the supervised learning framework, a large amount of labeled data are often required to archive excellent performance, and it is especially true for training a neural network. However, labeling data can be time-consuming and increase the cost of machine learning projects.
In the conventional data collection process (as illustrated in Fig. 2), human labeling tasks are conducted in a random form. The experts use their domain knowledge to label the data in the database by randomly selecting the data samples, and provide the labeled data for training. To further improve the performance, it often requires a more considerable amount of labeled data, and it is often conducted by other batches of a random selection of the data to label. Such an iterative process is highly insufficient as the continuous learning process is cutting into two different groups of subtasks without communication between them.
In contrast to the conventional data selection method, active learning offers an alternative strategy for collecting the supervised training data, as illustrated in Fig.2. The active learning strategies choose the samples that need to be labeled, with the aim of maximizing the machine learning algorithm’s performance w.r.t each incremental labeled dataset. These strategies include: Least Confidence , Bayesian Active Learning by Disagreement , core-set selection , etc. where all proposed strategies can be defined within a common framework: train the model with existing labeled data, use the trained model to select (under proposed measurement) the candidate from a pool of unlabeled dataset, label the selected candidate data points, and train the new model with augmented training dataset, as illustrated in Fig. 2.
In this paper, in order to show the effectiveness of adopting the active learning framework, we use one straightforward uncertainty-based strategy in the multi-label classification cases. The uncertainty score is measure by:
where we choose the unlabeled data with the lowest predicted probabilities among all classes. Intuitively speaking, the model can improve itself by seeing more diverse samples that are not semantical similar to the training set or model cannot confidently predict its labels. In such a setting, the model selects the data instances that are difficult to be assigned to any classes (uncertainty) for interactive labeling with human experts.
3 Experiments and discussions
We evaluate the proposed method by using a real-world customer review dataset, with 6929 instances for training and 1456 instances for evaluation. The evaluations are conducted on two separate multi-label classification tasks: aspects categorisation (13 classes) and sentiment analysis (2 classes), as shown in Tab. 1.
All three settings are connected to a recurrent neural network (two layer of LSTM) and a fully connected layer with sigmoid active function for multi-label outputs, implemented with TensorFlow library in Python.
We report micro f1 scores under different training sizes in Fig. 3. The micro f1 scores is calculated in a multi-label situation using micro precision and micro recall summed over all classes, shown as follows:
Such choice of measures follows the standard in the literature for multi-label classification , especially in the case of highly unbalanced classes. We report the comparison in Fig. 3 using the average micro f1 score over three times of different independent experiments.
As we can see in Fig. 3, in case of sentiment classification, the network with CNN embedding yield the worst performance in all different training sample sizes, due to the lack of effectiveness of language representations by training the embedding with largely insufficient data.
Such a gap between the self-trained embedding and the pretrained embedding with BERT is even large when facing more challenging multi-label aspects categorisation tasks.
When comparing the data selection framework, we can see in both tasks that the active learning selection performs better. In other words, in order to achieve the same performance, an active learning framework needs much fewer data.
Please note that the reported f1 scores with active learning are based on one straightforward selection strategy as in Eq. (2). By analyzing the empirical results in the active learning literature , we firmly believe that the reported results can be further improved when using some of the sophisticated selection strategies.
In this paper, we introduce two strategies for boosting the performance in real-world applications for natural language processing: 1. pretrained language model that allows extracting essential features of texts without the extra effort of collecting a large amount of training data; 2. active learning strategy which can smartly select the samples that need to be labeled. By comparing the performance with basic recurrent neural networks and the ones combined with pretrained embedding model and active learning framework, we observe a significant improvement. Such a combined approach can achieve the same accuracy by using a significantly smaller amount of labeled data, thus provide cost-effective solutions for the company self-promoted natural language processing projects. In the future, we would like to investigate on more sophisticate active learning strategies, in order to further improve the results with a constant number of training size.
The authors would like to thank Isabelle DUPUIS, Nathalie CHANSON, Axelle LETERTRE, Isabelle ROMANO from “Voice of the Customer” at GMF ASSURANCES for their expertises in customer review analysis and providing the labeled data used in this paper.
-  (2016) TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. External Links: Cited by: §3.
-  (2018) Modelling customer online behaviours with neural networks: applications to conversion prediction and advertising retargeting. arXiv preprint arXiv:1804.07669. Cited by: §2.1, §3.
-  (2005) Reducing labeling effort for structured prediction tasks. In AAAI, Vol. 5, pp. 746–751. Cited by: §2.2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1, §2.1, §2.1.
-  (2016) A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pp. 1019–1027. Cited by: §2.2.
-  (2017) Deep semantic role labeling: what works and what’s next. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 473–483. Cited by: §1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
-  (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §2.1.
-  (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360. Cited by: §1.
LSTM based similarity measurement with spectral clustering for speaker diarization. arXiv preprint arXiv:1907.10393. Cited by: §2.1.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.1.
-  (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.1.
-  (2016) Semeval-2016 task 5: aspect based sentiment analysis. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), pp. 19–30. Cited by: §3.
Active learning for convolutional neural networks: a core-set approach. arXiv preprint arXiv:1708.00489. Cited by: §2.2, §3.
-  (2016) Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606–615. Cited by: §1, §2.1.
-  (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1, §1.
-  (2018) Deep learning for sentiment analysis: a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (4), pp. e1253. Cited by: §1.
-  (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §3.