ABSA is not a conventional sentiment analysis but a more difficult task, it is concerned with defining the aspect terms listed in a document, as well as the sentiments expressed against each aspect.
As demonstrated by (b0)
, sentiment analysis can be studied at three levels: the document level where the task is to identify sentiment polarities (positive, neutral, or negative) that is indicated throughout the entire document. The sentence level is concerned with classifying sentiments relevant to a single sentence. But the document contains many sentences and each sentence may contain multiple aspects with different sentiments, so the document and sentence level sentiment analysis may not be accurate and need another suitable type that makes this fine-grained analysis called ABSA.
ABSA was first launched on SemEval-2014 (b1), with the introduction of datasets containing annotated restaurant and laptop reviews. ABSA’s work was largely replicated at SemEval over the next two years (b2; b3)
as the task has extended into various domains, languages, and challenges. SemEval-2016 provided 39 datasets in 7 domains and 8 languages for the ABSA task, additionally, the datasets were provided with Support Vector Machine (SVM) as a baseline evaluation procedure.
There are three primary tasks common to ABSA, as mentioned in (b3); aspect category identification (Task 1), aspect opinion target extraction (Task 2), and aspect polarity detection (Task 3). In this paper, we concentrated only on Task 3.
Neural Network (NN) variations have been applied to many Arabic NLP applications such as Sentiment Analysis (b35), machine translation (b36)b37), and speech recognition (b38) and the highest results were obtained.
Using word embedding or distributed representations enhances neural network efficiency and improves the performance of DL models, therefore, it has been applied as a preliminary layer in various DL models.
Two types of word embeddings are available: contextualized and non contextualized word embeddings. Most of the research available on Arabic ABSA is based on non-contextualized word embedding such as (word2vec and fastText). The main drawback of non-contextualized word embeddings is the presentation of a set of static word embeddings that do not take into account the various contexts in which they may appear.
In contrast, pre-trained language models based on transformers such as BERT can provide dynamic embedding vectors which change by changing the context of words in the sentences, this made it more encouraging to use BERT in many tasks via fine-tuning on the downstream dataset related to the task.
Despite the fact that Arabic has a significant number of speakers (estimated to be around 422 million(b4)) and is a morphologically rich language, the number of studies in Arabic aspect-based sentiment analysis is still restricted.
The following are the paper’s key contributions:
This paper examines the modeling competence of contextual embedding from pre-trained language models such as BERT with sentence pair input on Arabic aspect sentiment classification task.
A simple BERT based model with a linear classification layer was proposed to solve aspect sentiment polarity classification task. Experiments conducted on the Arabic hotel reviews dataset demonstrated that, despite the simplicity of our model, it surpassed the state-of-the-art works.
The rest of the paper is organized as follows. Section 2 addresses the related work; Section 3 illustrates the proposed model; Section 4 explains the dataset and the obtained results; finally, section 5 concludes the paper.
2 related work
ABSA is an area of SA with research methodologies divided into two approaches: standard machine learning techniques and DL-based techniques.
ABSA’s earliest efforts depended mainly on machine learning methods that mainly focus on handcrafted features like lexicons to train sentiment classifiers(b12)(b13), while these methods are effective but rely heavily on the efficiency of handcrafted features. Subsequently, a set of methods based on neural networks consisting of a word embedding layer followed by a neural architecture were developed for the ABSA task and pretty results were achieved (b14; b15; b25).
Several attention-based models have been applied to the ABSA task for its ability to focus on important parts of the sentence related to aspects (b16) (b17), but a single attention network maybe not enough at capturing key dependency attributes between context and targets especially when the sentence is long, so multiple attention networks were proposed to solve this problem (b20; b21).
In order to resolve the question-answering problem, the authors of (b42) have developed the memory network concept (MemNN), which was later adopted in several NLP challenges including ABSA(b22; b23; b24).
Pre-trained language models have recently played a vital role in many NLP applications, as they can take advantage of the vast volume of unlabeled data to learn general language representations; Elmo (b27), GPT (b28), and BERT (b29) are among the most well-known examples. The authors of (b41) studied the use of BERT embeddings with several neural models such as a linear layer, GRU model, self-attention network, and conditional random field to deal with the ABSA task. In (b40) the authors have demonstrated that, treating ABSA as a sentence pair classification task by building auxiliary sentence as input to the BERT, significantly improved the results. The strength of BERT embeddings in ABSA was further investigated by (b30; b31; b32; b33).
In general, Arabic ABSA research development is slower than English. (b8) combined aspect embedding with each word embedding and sent the mixture to CNN model to address aspect polarity and category identification tasks. Subsequently, (b10) took advantage of modeling internal information related to the hierarchical review structure in solving the ABSA task.
Two supervised machine learning-based techniques, RNN and SVM supplemented with a set of handcrafted features, were suggested by (b34) and excellent results were achieved using SVM but RNN was faster in terms of training execution time. (b7) combined aspect embedding with each word embeddings to motivate learning the connections between context words and targets and send the mixture to LSTM, then applied the attention mechanism for focusing on context words related to specific aspects. (Abdelgwad u. a., 2021) proposed applying IAN network supported with Bi-GRU for extracting target and context representations in a better manner. (b9) proposed applying deep memory network based on a stack of IndyLSTM supplemented with recurrent attention network to solve aspect sentiment classification task.
In this paper, a simple BERT-based model with sentence pair input and a linear classification layer was proposed to solve the Arabic ABSA task and state-of-the-art results were achieved on the Arabic hotel reviews dataset.
BERT (Bidirectional Encoder Representations from Transformers) is a deep learning technique for natural language processing (NLP) in which deep neural networks use unsupervised language representation and bidirectional models built on Transformers (a deep learning algorithm in which each output element is linked to each input element and the weightings between them are dynamically determined based on their relation using the attention mechanism). BERT is pre-trained on two separate but related NLP tasks using the bidirectional capability: Masked Language Modeling and Next Sentence Prediction. BERT effectively handles ambiguity, which is the most difficult aspect of understanding natural language, and can reach high degree of accuracy in analyzing languages close to human beings. We first use the BERT portion with L transformer layers to measure the corresponding contextualized representations for the T-length input token sequence. These contextual representations are then fed to the task-specific layer to predict sentiment polarity labels. The overall architecture of the proposed model is depicted in Figure1.
3.1 Auxiliary Sentence
Since the BERT model accepts a single or pair of sentences as input, and due to the ability and effectiveness of the BERT model in dealing with sentence pair classification tasks resulted from both the unsupervised masked language model and next sentence prediction tasks, the ABSA task can be transformed into a sentence-pair classification task using the pre-trained BERT model, with the first sentence containing words that express sentiments within the sentence, and the second sentence containing information related to the aspect (the auxiliary sentence). In other words, the model receives the review sentence as the first sentence and the aspect terms as an auxiliary sentence and the task would be to determine the sentiments towards each aspect.
3.2 BERT as Embedding layer
In comparison to the conventional word2vec embedding layer that offers static context-independent word vectors, the BERT layers offer dynamic context-dependent word embeddings by taking the entire sentence as input then calculating the representation of each word by extracting information from the entire sentence.
Inputs are processed in a special way by BERT, where sentences are tokenized first, as usual in every model, additionally, extra tokens are inserted at the start [CLS] and end [SEP] of the tokenized sentence. Then due to the utilization of self attention mechanism that enables BERT models to process tokens in parallel, and to deal with the next sentence prediction task, some special embedding tokens must be added to include all necessary information.
The tokenized sentence with [CLS] and [SEP] tokens are first fed into the embedding layer, which results in tokens embeddings, those tokens embeddings don’t include position information which added by means of position embeddings. Finally, it must be determined whether each token is associated with sentence A or B, this is possible by creating a new fixed token known as a segment embedding.
Then combine the token embeddings, segment embeddings, and position embeddings for each token and feed the mixture into transformer layers to optimize token level feature. The output representation from the transformer layer is fed as input into the next layer where . The representation at the l-th layer is calculated as follows :
where refers to the number of input tokens. Outputs from the last transformer layer are considered as full contextual representations for the input tokens and are used as input to the task specific layer.
3.3 Design of Downstream Model
In order to identify sentiment polarities toward aspects (i.e task 3), word embeddings extracted from the BERT model are fed into a task-specific layer, a simple linear layer in our case, where both the input and the weight (the learnable parameters) matrices are multiplied with the addition of the bias term to transform their incoming features to output features in a linear manner. The softmax function is then used to determine the likelihood of each category .
4.1 Data and baseline research
We conducted our experiments on the Arabic Hotel Reviews Dataset, which was presented in SemEval-2016 in support of ABSA’s multilingual task involving work in 8 languages and 7 domains (b3). There are 19,226 training tuples and 4802 testing tuples in the dataset. The XML schema was used to annotate the dataset. The dataset consists of a set of reviews, each review contains a number of sentences, with each sentence having three tuples: aspect-category, OTE, and aspect polarity. Figure 2 depicts an XML snapshot that corresponds to an annotated review.
The dataset supports both text-level annotations (2291 reviews) and sentence-level annotations (6029 sentences). This study focused only on sentence-level tasks. The dataset’s scale and distribution are explained in Table 1
. Also, an SVM classifier supported with N-gram features was applied to the Arabic hotel review dataset for various ABSA tasks and was considered as baseline research to compare with.
|T1: Sentence-level ABSA||1839||4802||10.509||452||1227||2604|
|T2: Text-level ABSA||1839||4802||8757||452||1227||2158|
4.2 Evaluation method
In order to determine the effectiveness of the proposed model, the accuracy metric was adopted, which was defined as follows:
Accuracy measures the number of correct samples to all samples, higher accuracy indicates better performance.
4.3 Hyperparameters Setting
The pretrained ”Arabic BERT” (b39)
was used, which was previously trained on about 8.2 billion words of MSA and dialectical Arabic. The BERT-Base model consisting of 12 hidden layers, 12 attention heads, and hidden size of 768 has been particularly used. Adam optimizer was used to fine-tune the model on the downstream task with a learning rate of 1e-5, dropout rate of 0.1, batch size of 24, and number of epochs equal to 10.
4.4 comparison models
Baseline trained SVM classifier enhanced with N-grams features (b3).
INSIGHT-1 combined aspect embedding with each word embedding and fed the resulting mixture to CNN for Aspect sentiment analysis (b8).
HB-LSTM developed hierarchical bidirectional LSTM for ABSA, that can take advantage of hierarchical modeling information of the review in improving performance (b10).
AB-LSTM-PC combined aspect embedding with each word embeddings to motivate learning the connections between context words and targets, then applied the attention mechanism for focusing on context words related to specific aspects (b7).
Used Bi-GRU to extract hidden representations from targets and context, then applied two associated attention networks on those representations to model targets and their context in an interactive manner (Abdelgwad u. a., 2021).
MBRA made use of external memory network containing a stack of bidirectional lndy-lstms consisted of 3 layers, and a recurrent attention mechanism to deal with complex sentence structures (b9).
4.5 main results
Table 2 shows that simply adding a basic linear layer on top of BERT, outperformed the baseline and achieved better results than many previous Arabic DL models. This is evidence of the effectiveness of Bert’s contextual representations at encoding associations between aspect terms and context words. Moreover, the use of the auxiliary sentence further improved the results of the BERT model, which is apparent in the higher results of the BERT-pair model compared to the BERT-single, achieving state-of-the-art results.
4.6 Overfitting Issue
Despite the use of Bert-base model ”the smallest prerained version of Bert”, the number of parameters seemed to be large (110M) for this task, which made us wonder: Is our model overfitting the downstream datat ? so we trained the BERT-linear model on the Arabic hotel reviews dataset for 10 epochs and noticed the oscillating accuracy results on the development set after each epoch. As indicated in Figure 3, the accuracy results of the development set are relatively stable and do not decrease significantly as the training progresses, which reveals that the BERT model is extremely robust to overfitting.
4.7 finetuning or not
We investigated the effect of fine-tuning on the final results by keeping the parameters of the BERT component fixed during the training phase. Figure 4 shows a simple comparison between the performance of the model when fine-tuning and when setting the parameters fixed. The general purpose BERT representation is obviously far from acceptable for the downstream task, and task-specific fine-tunning is necessary to use BERT’s capabilities to increase performance.
In this paper, we explored the modeling capabilities of contextual embeddings from the pre-trained BERT model with the benefit of sentence pair input on the Arabic ABSA task. specifically, we examined the incorporation of the BERT embedding component with a simple linear classification layer and extensive experiments were performed on the Arabic hotel review dataset. The experimental results show that despite the simplicity of our model, it surpassed the state-of-the-art works, and is robust to overfitting.