An Efficient Model for Sentiment Analysis of Electronic Product Reviews in Vietnamese

10/29/2019 ∙ by Suong N. Hoang, et al. ∙ 0

In the past few years, the growth of e-commerce and digital marketing in Vietnam has generated a huge volume of opinionated data. Analyzing those data would provide enterprises with insight for better business decisions. In this work, as part of the Advosights project, we study sentiment analysis of product reviews in Vietnamese. The final solution is based on Self-attention neural networks, a flexible architecture for text classification task with about 90.16



There are no comments yet.


page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sentiment analysis aims to analyze human opinions, attitudes, and emotions. It has been applied in various fields of business. For instance, in our current project, Advosights, it is used to measure the impact of new products and ads campaigns through consumer’s responses.

In the past few years, together with the rapid growth of e-commerce and digital marketing in Vietnam, a huge volume of written opinionated data in digital form has been created. As the result, sentiment analysis plays a more critical role in social listening than ever before. So far, human effort is the most common solution for sentiment analysis problems. However, this approach generally does not result in the desired outcomes and speed. Human check and labeling are time consuming and error-prone. Therefore, developing a system that automatically classifies human sentiment is highly essential.

While we can easily find a lot of sentiment analysis researches for English, there are only a few works for Vietnamese. Vietnamese is a unique language and it differs from English in a number of ways. To apply the same techniques that work for English to Vietnamese would yield inaccurate results. This has motivated our systematic study in sentiment analysis for Vietnamese. Since our project, Advosights, initially served a well-known electronics brand, we decided to focus our study on electronic product reviews. Broader scopes will be studied in future works.

Our initial approach was to build a sentiment lexicon dictionary. Its first version was based on some statistical methods

[4, 5, 6]

to estimate the sentiment score for each word from a list, collected manually based on Vietnamese dictionaries. This approach did not work well because the dataset came from casual reviews, that were practically spoken language with a lot of slang words and acronyms. This fact made it almost impossible to build a dictionary that cover all of those words. We then tried to use a simple neural network to learn sentiment lexicons from corpus automatically

[12]. This also did not work well because some words in Vietnamese have same morphology, but they have different meanings in different contexts. For example, the words “đã” in two sentences “nhìn đã quá” and “đã quá cũ” have different meanings. But by using the dictionary, they have the same sentiment score.

Some machine learning-based approaches have been studied. For examples, CountVectorizer and Term Frequency–Inverse Document Frequency (Tf-idf) were used for word representations. Support Vector Machine (SVM) and Naive Bayes were used as classifiers. However, the results were not very encouraging.

We also investigated various types of recurrent neural networks (RNNs) such as long short-term memory(LSTM)

[1], Bi-Directional LSTM (biLSTM) [2]

or gated recurrent unit (GRU)

[9], etc. Although some of them achieved pretty good accuracy, the models were heavy and had very long inference time. Our final model is based on the Self-attention neural network architecture Transformer [16], a well known state of the art technique in machine translation. It provided top accuracy and has very fast inference time when running on real data.

The paper is organized as follows. In section 2, some description of self-attention is provided for motivation. In section 3, our architecture is presented. The experiments are described in section 4. Finally, conclusions and remarks are included in section 5.

2 Background

Inspired by human sight mechanism, Attention was used in the field of visual imaging about 20 years ago [3]. In 2014, a group from Google DeepMind applied Attentions to the RNN for image classification tasks [7]. After that, Bahdanau et al. [8]

applied this mechanism to encoder-decoder architectures in machine translation task. It became the first work to apply Attention mechanism to the field of Natural Language Processing (NLP). Since then, Attention became more and more common for the improvement in various NLP tasks based on neural networks such as RNN/CNN 

[10, 11, 14, 15, 19].

In 2017, Vaswani et al. first introduced Self-attention Neural Network [16]. The proposed architecture, Transformer, did not follow the well-known idea of recurrent network. This paper paved the way and Self-attention have become a hot topic in the field of NLP in the last few years. In this section, we describe their approach in detail.

2.1 Attention

The first description of Attention Mechanism in Machine Neural Translation [8] was well known as a process to compute weighted average context vectors for each state of the decoder by incorporating the relevant information from all of the encoder states with the previous decoder hidden state , which is determined by a alignment weights between each encoder state and previous hidden state of the decoder, to predict next state of the decoder. It can be summarized by the following equations:


, where is a function to compute the compatibility score between and .

2.1.1 Scaled Dot-Product Attention:

Let us consider as a query vector . And now duplicated, one is key vector and the other is value vector (in current NLP work, the key and value vector are frequently the same, there for can be considered as or ). The equations outlined above generally look like:


In [16] paper, Vaswani et al. using the scaled dot-product function for the compatibility score function


where is dimension of input vectors or vector (, , have the same dimension as input embedding vector).

2.1.2 Self-attention:

Self-attention is a mechanism to apply Scaled Dot-Product Attention to every token of the sentence for all others. It means for each token, this process will compute a context output that incorporates informations of itself and information about how it relates to others tokens in the sentence.

By using a linear feed-forward layer as a transformation to create three vectors (query, key, value) for every token in sentence, then apply the attention mechanism outlined above to get the context matrix. But it seems very slow and takes a bunch of time for whole process. So, instead of creating them individually, we consider is a matrix containing all the query vectors , contains all keys , and contains all values . As the result, this process can be done in parallel [16].


2.1.3 Multi-head Attention

Instead of performing Self-attention a single time with of dimensions . Multi-head Attention performs attention times with matrices of dimensions , each time for applying Attention, it is called a head. For each head, the (Q,K,V) matrices are uniquely projected with different dimensions , and (equal to ), then self-attention mechanism is performed to yield an output of the same dimension [16]. After all, outputs of heads are concatenated, and apply a linear projection layer once again. This process can be summarized by the following equations:



Where the projections are parameter matrices .

2.2 Positional Information Embedding Representation

Self-attention can provide context matrix containing information about how a token relates with the others. However, this attention mechanism still has limit, losing positional information problem. It does not care about the order of tokens. That means outputs of this process is invariant with the same set of tokens with order permutations. So, to make it work, neural networks need to incorporate positional information to the inputs. Sinusoidal Positional Encoding technique is commonly used to solve this problem.

2.2.1 Sinusoidal Position Encoding:

This technique was proposed by Vaswani et al. [16]. The main point of this technique is to create Position Encoding () using sinusoidal and cosinusoidal functions to encode the position. The function can be write by following equation:


where starts from and is dimension of dimensions. It means that for each dimension of the positional encoding corresponds to a different sinusoids.

The advantages of this technique is it can add positional information for sentences longer than those in training dataset.

3 Our Approach

3.1 Model architecture

We proposed a simple model using a single modified heads Self-attention block (See Fig 2), described below.

Original Sinusoidal Position Encoding [16] used “adding” operation to incorporate positional informations as a input. That means while performing Self-attention, representation informations(Word Embeddings) and positional informations(Positional Embeddings) have the same weights (these two information are equal).


In Vietnamese, we assumed that the positional information has more contributions to create contextual semantics than representation informations. Therefore, we used “concatenate” operation to incorporate positional informations. That made representation informations may have a different weights with positional informations during the transformation process.


We added a block inspired by paper “Squeeze-and-Excitation Networks”, Hu et al. [18] for the average attention mechanism and the gating mechanism by stacking a GobalAveragePooling1D layer then forming a bottleneck with two fully-connected layers (see Fig. 2). The first layer is dimensionality-reduction layer with reduction ratio (in our experiment default is ) with a non-linear activation and then the second layer is dimensionality-increasing layer to return the result to

dimension also with a sigmoid activation function, which scale the feature value into range

. It means this layer computes how much a feature incorporates information to contextual semantics. We call this technique Embedding Feature Attention.


Where is input of block. is output of block. is a non-linear activation function. is a non-linear activation function. , are trainable matrices.

Figure 1: Squeeze-Excitation architecure.
Figure 2: Self-attention Neural Networks architecture for sentiments classification task.

4 Experiments

We implemented from scratch some layers that are needed for this work, such as: Scaled-dot product Attention, Multihead Attention, Feed-Forward Network and re-trained word embeddings for Vietnamese spoken language.

All experiments were deployed on 26GB RAM, CPU Intel Xeon Processor E31220L v2, GPU Tesla K80 for 20 epochs, 64 of batch size for comparison and all neural network models used focal-crossentropy as the training loss.

4.1 Datasets

There is no public dataset for electronics product reviews in Vietnamese. We had to crawl user reviews from several e-commerce websites, such as Tiki, Lazada, shopee, Sendo, Adayroi, Dienmayxanh, Thegioididong, fptshop, vatgia. Based on our purposes, we chose some data fields to collect and store. Some data samples are presented in Tab. 1 below.

username product name category review rating
user 1 Samsung Galaxy A8+ điện thoại Ytt5ya 5t55 1/5
user 2 Philips E181 điện thoại đang chơi liên quân tự nhiên bị đơ đơ. rồi tự nhảy lung tung. Bị như vậy là do game hay do máy v mọi người. 1/5
user 3 Philips E181 điện thoại Đặt màu vàng đồng mà giao màu bạc 2/5
user 4 Oppo f7 điện thoại Oppo f7 đang có chương trình trả trước 0% và trả góp 0% đúng không ạ? 2/5
user 5 Philips E181 điện thoại Giá đó mà không có camera kép, Vivo V9 đẹp hơn. 2/5
user 6 Samsung Note 7 điện thoại Cho em hỏi máy m5c của em hay bị tắt nguồn là do sao ạ? 4/5
user 7 Nokia 230 Dual SIM điện thoại điện thoại vs Máy dùng tốt 4/5
user 8 Oppo f7 điện thoại cho em hỏi giá oppo F7 hiện tại bên mình là bao nhiêu ạ? 5/5
user 9 Samsung Note 7 điện thoại Có màu đen ko vậy? 5/5
Table 1: Examples for crawled data from e-commerce websites.

After analyzing and visualizing, we found that the dataset was very imbalanced (see the description below) and noisy. There were some meaningless reviews (user1 in Tab. 1). Some of them did not have sentiments (user4, user8 and user9 in Tab. 1). Sometimes, the ratings do not reflect the sentiment of reviews, (see user6 in Tab. 1). Therefore, a manual inspection step was applied to clean and label the data. We also built a tool for labeling process to made it smoothly and faster (see Fig. 3).

- Corpus have only 2 labels (positive and negative).

- Total 32,953 documents in labeled corpus:

Positives: 22,335 documents.

Negatives: 10,618 documents.

Figure 3: Sentiment checking tool interface.

Next, to make the dataset balanced, we duplicated some short negative documents and segmented the longer ones. In final result we have over documents in corpus with positives and negatives.

Using for training models, we splitted corpus into 3 sets as following: training set: , validation set: , test set: .

4.2 Preprocessing

For automatic preprocessing, we mainly used available researches. We applied a sentence tokenizer[21] for each documents. All links, phone numbers and email addresses were replaced by “urlObj”, “phonenumObj” and “mailObj”, respectively. Words tokenizer from Underthesea[20] for Vietnamese was also applied.

4.3 Embeddings

We used fastText[13] model for word embeddings. In many cases, users may type a wrong word accidentally or intentionally. fastText deals with this problem very well by encoding at the characters level. When users type wrong or very rare words or out-of-vocabulary words, fastText still can represent those words with an embedding vector that most similar to word met in trained sentences. This has made fastText become the best candidate to represent user inputs.

There had been no fastText pre-trained model for Vietnamese spoken language. Therefore, we trained fastText model for Vietnamese vocabulary as embedding pre-trained weights from a corpus over documents of multi-products reviews crawled from ecomerce sites mentioned in subsection 4.1 with no label. Rare words that occur less than times in the vocabulary were removed. Embedding size was . After training, we had vocabularies in total.

4.4 Evaluation results

We used the same word embeddings as mentioned above for all models and evaluated all models on test set which has documents. To demonstrate the significance of our model, we compare our model with base line RNNs models such as Long-Short Term Memory (LSTM), Gated Recurrent Units (GRU), bidirectional LSTM, bidirectional GRU, stacked bidirectional LSTM and stacked bidirectional GRU with the following configurations.

- Vanilla LSTM and GRU: 1 layer with 1,024 units.

- Bidirectional model of LSTM and GRU: 1 layer with 1,024 units in forward and 1,024 units in backward.

- Stacked bidirectional model of LSTM and GRU: 2 stacked layers with 1,024 units in forward and 1,024 units in backward for each layer.

Table 2 shows that our model gave the best inference time with top accuracy in test set. Also, in fact, this model ran in prodution have shown good prediction than the top of baseline models, stacked Bidirectional Long-short Term Memory, especially with complex sentences such as “giá cao như này thì t mua con ss gala S7 cho r”, “quảng cáo lm lố vl”, “với tôi thì trong tầm giá nv vẫn có thể chấp nhận đk” or “Nhưng vì đây là dòng điện thoại giá rẻ, nên cũng k thể kì vọng hơn đc.” (See Fig. 5, Fig. 5)

Methods Avg.inference time (s) Macro-f1 (%)
LSTM 0.4748 48.9(23)
bi-LSTM 0.9373 90.0(05)
stacked bi-LSTM 1.7967 90.1(32)
GRU 0.3738 48.9(23)
bi-GRU 0.5863 88.9(25)
stacked bi-GRU 1.4830 89.9(72)
Self-attention 0.0124 90.1(64)
Table 2: Inference times and macro-f1 scores
Figure 4: Stacked bidirectional Long-Short term memory for Sentiments Analysis in Vietnamese examples
Figure 5: Self-attention Neural Network for Sentiments Analysis in Vietnamese examples
Figure 4: Stacked bidirectional Long-Short term memory for Sentiments Analysis in Vietnamese examples

5 Conclusion

In this paper we demonstrated that using Self-attention Neural Network is faster than previous state of the art techniques with the best result in test set and achieved exceptionally good results when ran in prodution (Predictions make sense to human in unlabeled data with very fast inference time).

For future work, we plan to extend stacked multi-head self-attention architectures. We are also interested in seeing the behaviour of the models explored in this work on much larger datasets (beyond the electronics product reviews) and more classes.


We thank our teammates, Tran A. Sang, Cao T. Thanh, and Ha H. Huy for helpful discussions and supports.