Log In Sign Up

Natural Language Processing Advancements By Deep Learning: A Survey

by   Amirsina Torfi, et al.
University of Minnesota
Virginia Polytechnic Institute and State University

Natural Language Processing (NLP) helps empower intelligent machines by enhancing a better understanding of the human language for linguistic-based human-computer communication. Recent developments in computational power and the advent of large amounts of linguistic data have heightened the need and demand for automating semantic analysis using data-driven approaches. The utilization of data-driven strategies is pervasive now due to the significant improvements demonstrated through the usage of deep learning methods in areas such as Computer Vision, Automatic Speech Recognition, and in particular, NLP. This survey categorizes and addresses the different aspects and applications of NLP that have benefited from deep learning. It covers core NLP tasks and applications and describes how deep learning methods and models advance these areas. We further analyze and compare different approaches and state-of-the-art models.


page 3

page 14

page 16


Recent Trends in Deep Learning Based Natural Language Processing

Deep learning methods employ multiple processing layers to learn hierarc...

Towards Improved Model Design for Authorship Identification: A Survey on Writing Style Understanding

Authorship identification tasks, which rely heavily on linguistic styles...

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

Deep Learning has revolutionized the fields of computer vision, natural ...

Deep Lake: a Lakehouse for Deep Learning

Traditional data lakes provide critical data infrastructure for analytic...

Fine-tuning BERT-based models for Plant Health Bulletin Classification

In the era of digitization, different actors in agriculture produce nume...

Deep Learning for Political Science

Political science, and social science in general, have traditionally bee...

I Introduction

Natural Language Processing (NLP) is a sub-discipline of computer science providing a bridge between natural languages and computers. It helps empower machines to understand, process, and analyze human language [manning1999foundations]. NLP’s significance as a tool aiding comprehension of human-generated data is a logical consequence of the context-dependency of data. Data becomes more meaningful through a deeper understanding of its context, which in turn facilitates text analysis and mining. NLP enables this with the communication structures and patterns of humans.

Development of NLP methods is increasingly reliant on data-driven approaches which help with building more powerful and robust models [zhang2015character, cho2014learning]. Recent advances in computational power, as well as greater availability of big data, enable deep learning, one of the most appealing approaches in the NLP domain [collobert2008unified, zhang2015character, cho2014learning], especially given that deep learning has already demonstrated superior performance in adjoining fields like Computer Vision [karpathy2014large, oquab2014learning, shrivastava2017learning] and Speech Recognition [graves2014towards, amodei2016deep]. These developments led to a paradigm shift from traditional to novel data-driven approaches aimed at advancing NLP. The reason behind this shift was simple: new approaches are more promising regarding results, and are easier to engineer.

As a sequitur to remarkable progress achieved in adjacent disciplines utilizing deep learning methods, deep neural networks have been applied to various NLP tasks, including part-of-speech tagging 

[santos2014learning, plank2016multilingual, manning2011part]

named entity recognition 

[lample2016neural, chiu2015named, lample2016neural], and semantic role labeling [zhou2015end, marcheggiani2017simple, he2017deep]. Most of the research efforts in deep learning associated with NLP applications involve either supervised learning111Learning from training data to predict the type of new unseen test examples by mapping them to known pre-defined labels. or unsupervised learning222Making sense of data without sticking to specific tasks and supervisory signals..

This survey covers the emerging role of deep learning in the area of NLP, across a broad range of categories. The research presented in [young2018recent] is primarily focused on architectures, with little discussion of applications. On the other hand, this paper describes the challenges, opportunities, and evaluations of the impact of applying deep learning to NLP problems.

This survey has six sections, including this introduction. Section 2 lays out the theoretical dimensions of NLP and artificial intelligence, and looks at deep learning as an approach to solving real-world problems. It motivates this study by addressing the question: Why use deep learning in NLP? The third section

discusses fundamental concepts necessary to understand NLP, covering exemplary issues in representation, frameworks, and machine learning. The

fourth section summarizes benchmark datasets employed in the NLP domain. Section 5 focuses on some of the NLP applications where deep learning has demonstrated significant benefit. Finally, Section 6 provides a conclusion, also addressing some open problems and promising areas for improvement.

Ii Background

NLP has long been viewed as one aspect of artificial intelligence (AI), since understanding and generating natural language are high-level indications of intelligence. Deep learning is an effective AI tool, so we next situate deep learning in the AI world. After that we explain motivations for applying deep learning to NLP.

Ii-a Artificial Intelligence and Deep Learning

There have been “islands of success” where big data are processed via AI capabilities to produce information to achieve critical operational goals (e.g., fraud detection). Accordingly, scientists and consumers anticipate enhancement across a variety of applications. However, achieving this requires understanding of AI and its mechanisms and means (e.g., algorithms). Ted Greenwald, explaining AI to those who are not AI experts, comments: ”Generally AI is anything a computer can do that formerly was considered a job for a human” [breenwald2018wsj].

An AI goal is to extend the capabilities of information technology (IT) from those to (1) generate, communicate, and store data, to also (2) process data into the knowledge that decision makers and others need [sivarajah2017critical]. One reason is that the available data volume is increasing so rapidly that it is now impossible for people to process all available data. This leaves two choices: (1) much or even most existing data must be ignored or (2) AI must be developed to process the vast volumes of available data into the essential pieces of information that decision-makers and others can comprehend. Deep learning is a bridge between the massive amounts of data and AI.

Ii-A1 Definitions

Deep learning refers to applying deep neural networks to massive amounts of data to learn a procedure aimed at handling a task. The task can range from simple classification to complex reasoning. In other words, deep learning is a set of mechanisms ideally capable of deriving an optimum solution to any problem given a sufficiently extensive and relevant input dataset. Loosely speaking, deep learning is detecting and analyzing important structures/features in the data aimed at formulating a solution to a given problem. Here, AI and deep learning meet. One version of the goal or ambition behind AI is enabling a machine to outperform what the human brain does. Deep learning is a means to this end.

Ii-A2 Deep Learning Architectures

Numerous deep learning architectures have been developed in different research areas, e.g., in NLP applications employing recurrent neural networks (RNNs) 


, convolutional neural networks (CNNs) 

[kim2014convolutional], and more recently, recursive neural networks [socher2011parsing]. We focus our discussion on a review of the essential models, explained in relevant seminal publications.

Multi Layer Perceptron:

A multilayer perceptron

(MLP) has at least three layers (input, hidden, and output layers). A layer is simply a collection of neurons operating to transform information from the previous layer to the next layer. In the MLP architecture, the neurons in a layer do not communicate with each other. An MLP employs nonlinear activation functions. Every node in a layer connects to all nodes in the next layer, creating a fully connected network (Fig. 

1). MLPs are the simplest type of Feed-Forward Neural Networks (FNNs). FNNs represent a general category of neural networks in which the connections between the nodes do not create any cycle, i.e., in a FNN there is no cycle of information flow.

Fig. 1: The general architecture of a MLP.

Convolutional Neural Networks: Convolutional neural networks (CNNs), whose architecture is inspired by the human visual cortex, are a subclass of feed-forward neural networks. CNNs are named after the underlying mathematical operation, convolution, which yields a measure of the interoperability of its input functions. Convolutional neural networks are usually employed in situations where data is or needs to be represented with a 2D or 3D data map. In the data map representation, the proximity of data points usually corresponds to their information correlation.

In convolutional neural networks where the input is an image, the data map indicates that image pixels are highly correlated to their neighboring pixels. Consequently, the convolutional layers have 3 dimensions: width, height, and depth. That assumption possibly explains why the majority of research efforts dedicated to CNNs are conducted in the Computer Vision field [krizhevsky2012imagenet].

A CNN takes an image represented as an array of numeric values. After performing specific mathematical operations, it represents the image in a new output space. This operation is also called feature extraction, and helps to capture and represent key image content. The extracted features can be used for further analysis, for different tasks. One example is image classification, which aims to categorize images according to some predefined classes. Other examples include determining which objects are present in an image and where they are located. See Fig. 


In the case of utilizing CNNs for NLP, the inputs are sentences or documents represented as matrices. Each row of the matrix is associated with a language element such as a word or a character. The majority of CNN architectures learn word or sentence representations in their training phase. A variety of CNN architectures were used in various classification tasks such as Sentiment Analysis and Topic Categorization

[kim2014convolutional, dos2014deep, johnson2014effective, johnson2015semi]. CNNs were employed for Relation Extraction and Relation Classification as well [zeng2014relation, nguyen2015relation].

Fig. 2: A typical CNN architecture for object detection. The network provides a feature representation with attention to the specific region of an image (example shown on the left) that contains the object of interest. Out of the multiple regions represented (see an ordering of the image blocks, giving image pixel intensity, on the right) by the network, the one with the highest score will be selected as the main candidate.
Fig. 3: Recurrent Neural Network (RNN), summarized on the left, expanded on the right, for timesteps, with indicating input, hidden layer, and output

Recurrent Neural Network:

If we line up a sequence of FNNs and feed the output of each FNN as an input to the next one, a recurrent neural network (RNN) will be constructed. Like FNNs, layers in an RNN can be categorized into input, hidden, and output layers. In discrete time frames, sequences of input vectors are fed as the input, one vector at a time, e.g., after inputting each batch of vectors, conducting some operations and updating the network weights, the next input batch will be fed to the network. Thus, as shown in Fig. 

3, at each time step we make predictions and use parameters of the current hidden layer as input to the next time step.

Hidden layers in recurrent neural networks can carry information from the past, in other words, memory. This characteristic makes them specifically useful for applications that deal with a sequence of inputs such as language modeling [mikolov2010recurrent], i.e., representing language in a way that the machine understands. This concept will be described later in detail.

RNNs can carry rich information from the past. Consider the sentence: “Michael Jackson was a singer; some people consider him King of Pop.” It’s easy for a human to identify him as referring to Michael Jackson. The pronoun him happens seven words after Michael Jackson;

capturing this dependency is one of the benefits of RNNs, where the hidden layers in an RNN act as memory units. Long Short Term Memory Network (LSTM) 

[hochreiter1997long] is one of the most widely used classes of RNNs. LSTMs try to capture even long time dependencies between inputs from different time steps. Modern Machine Translation and Speech Recognition often rely on LSTMs.

Fig. 4:

Schematic of an Autoencoder

Autoencoders: Autoencoders implement unsupervised methods in deep learning. They are widely used in dimensionality reduction333Dimensionality reduction is an unsupervised learning approach which is the process of reducing the number of variables that were used to represent the data by identifying the most crucial information. or NLP applications which consist of sequence to sequence modeling (see Section III-B [mikolov2010recurrent]. Fig. 4 illustrates the schematic of an Autoencoder. Since autoencoders are unsupervised, there is no label corresponding to each input. They aim to learn a code representation for each input. The encoder is like a feed-forward neural network in which the input gets encoded into a vector (code). The decoder operates similarly to the encoder, but in reverse, i.e., constructing an output based on the encoded input. In data compression applications, we want the created output to be as close as possible to the original input. Autoencoders are lossy, meaning the output is an approximate reconstruction of the input.

Fig. 5: Generative Adversarial Networks

Generative Adversarial Networks: Goodfellow [goodfellow2014generative] introduced Generative Adversarial Networks (GANs). As shown in Fig. 5, a GAN is a combination of two neural networks, a discriminator and a generator. The whole network is trained in an iterative process. First, the generator network generates a fake sample. Then the discriminator network tries to determine whether this sample (ex.: an input image) is real or fake, i.e., whether it came from the real training data (data used for building the model) or not. The goal of the generator is to fool the discriminator in a way that the discriminator believes the artificial (i.e., generated) samples synthesized by the generator are real.

This iterative process continues until the generator produces samples that are indistinguishable by the discriminator. In other words, the probability of classifying a sample as fake or real becomes like flipping a fair coin for the discriminator. The goal of the generative model is to capture the distribution of real data while the discriminator tries to identify the fake data. One of the interesting features of GANs (regarding being generative) is: once the training phase is finished, there is no need for the discrimination network, so we solely can work with the generation network. In other words, having access to the trained generative model is sufficient.

Different forms of GANs has been introduced, e.g., Sim GAN [shrivastava2017learning], Wasserstein GAN [arjovsky2017wasserstein], info GAN [chen2016infogan], and DC GAN [radford2015unsupervised]. In one of the most elegant GAN implementations [karras2017progressive]

, entirely artificial, yet almost perfect, celebrity faces are generated; the pictures are not real, but fake photos produced by the network. In the NLP domain, GANs often are used for text generation 

[yu2017seqgan, li2017adversarial].

Ii-B Motivation for Deep Learning in NLP

Deep learning applications are predicated on the choices of (1) feature representation and (2) deep learning algorithm alongside architecture. These are associated with data representation and learning structure, respectively. For data representation, surprisingly, there usually is a disjunction between what information is thought to be important for the task at hand, versus what representation actually yields good results. For instance, in sentiment analysis, lexicon semantics, syntactic structure, and context are assumed by some linguists to be of primary significance. Nevertheless, previous studies based on the bag-of-words (BoW) model demonstrated acceptable performance 

[pang2002thumbs]. The bag-of-words model [harris1954distributional], often viewed as the vector space model, involves a representation which accounts only for the words and their frequency of occurrence. BoW ignores the order and interaction of words, and treats each word as a unique feature. BoW disregards syntactic structure, yet provides decent results for what some would consider syntax-dependent applications. This observation suggests that simple representations, when coupled with large amounts of data, may work as well or better than more complex representations. These findings corroborate the argument in favor of the importance of deep learning algorithms and architectures.

Often the progress of NLP is bound to effective language modeling. A goal of statistical language modeling is the probabilistic representation of word sequences in language, which is a complicated task due to the curse of dimensionality. The research presented in 

[bengio2003neural] was a breakthrough for language modeling with neural networks aimed at overcoming the curse of dimensionality by (1)

learning a distributed representation of words and

(2) providing a probability function for sequences.

A key challenge in NLP research, compared to other domains such as Computer Vision, seems to be the complexity of achieving an in-depth representation of language using statistical models. A primary task in NLP applications is to provide a representation of texts, such as documents. This involves feature learning, i.e., extracting meaningful information to enable further processing and analysis of the raw data.

Traditional methods begin with time-consuming hand-crafting of features, through careful human analysis of a specific application, and are followed by development of algorithms to extract and utilize instances of those features. On the other hand, deep supervised feature learning methods are highly data-driven and can be used in more general efforts aimed at providing a robust data representation.

Due to the vast amounts of unlabeled data, unsupervised feature learning is considered to be a crucial task in NLP. Unsupervised feature learning is, in essence, learning the features from unlabeled data to provide a low-dimensional representation of a high-dimensional data space. Several approaches such as K-means clustering and principal component analysis have been proposed and successfully implemented to this end. With the advent of deep learning and abundance of unlabeled data, unsupervised feature learning becomes a crucial task for representation learning, a precursor in NLP applications. Currently, most of the NLP tasks rely on annotated data, while a preponderance of unannotated data further motivates research in leveraging deep data-driven unsupervised methods.

Given the potential superiority of deep learning approaches in NLP applications, it seems crucial to perform a comprehensive analysis of various deep learning methods and architectures with particular attention to NLP applications.

Iii Core Concepts in NLP

Iii-a Feature Representation

Distributed representations are a series of compact, low dimensional representations of data, each representing some distinct informative property. For NLP systems, due to issues related to the atomic representation of the symbols, it is imperative to learn word representations.

At first, let’s concentrate on how the features are represented, and then we focus on different approaches for learning word representations. The encoded input features can be characters, words [socher2011parsing], sentences [le2014distributed], or other linguistic elements. Generally, it is more desirable to provide a compact representation of the words than a sparse one.

Fig. 6: Considering a given sequence, the skip-thought model generates the surrounding sequences using the trained encoder. The assumption is that the surrounding sentences are closely related, contextually.

How to select the structure and level of text representation used to be an unresolved question. After proposing the word2vec approach [mikolov2013distributed], subsequently, doc2vec was proposed in [le2014distributed] as an unsupervised algorithm and was called Paragraph Vector (PV). The goal behind PV is to learn fixed-length representations from variable-length text parts such as sentences and documents. One of the main objectives of doc2vec is to overcome the drawbacks of models such as BoW and to provide promising results for applications such as text classification and sentiment analysis. A more recent approach is the skip-thought model which applies word2vec at the sentence-level [kiros2015skip]. By utilizing an encoder-decoder architecture, this model generates the surrounding sentences using the given sentence (Fig. 6). Next, let’s investigate different kinds of feature representation.

Iii-A1 One-Hot Representation

In one-hot encoding, each unique element that needs to be represented has its dimension which results in a very high dimensional, very sparse representation. Assume the words are represented with the one-hot encoding method. Regarding representation structure, there is no meaningful connection between different words in the feature space. For example, highly correlated words such as ‘ocean’ and ‘water’ will not be closer to each other (in the representation space) compared to less correlated pairs such as ‘ocean’ and ‘fire.’ Nevertheless, some research efforts present promising results using one-hot encoding 


Iii-A2 Continuous Bag of Words

Continuous Bag-of-Words model (CBOW) has frequently been used in NLP applications. CBOW tries to predict a word given its surrounding context, which usually consists of a few nearby words [mikolov2013efficient]. CBOW is neither dependent on the sequential order of words nor necessarily on probabilistic characteristics. So it is not generally used for language modeling. This model is typically trained to be utilized as a pre-trained model for more sophisticated tasks. An alternative to CBOW is the weighted CBOW (WCBOW) [lebanon2015riemannian] in which different vectors get different weights reflective of relative importance in context. The simplest example can be document categorization where features are words and weights are TF-IDF scores [leskovec2014mining] of the associated words.

Iii-A3 Word-Level Embedding

Word embedding is a learned representation for context elements in which, ideally, words with related semantics become highly correlated in the representation space. One of the main incentives behind word embedding representations is the high generalization power as opposed to sparse, higher dimensional representations [goldberg2017neural]. Unlike the traditional bag-of-words model in which different words have entirely different representations regardless of their usage or collocations, learning a distributed representation takes advantage of word usage in context to provide similar representations for semantically correlated words. There are different approaches to create word embeddings. Several research efforts, including [mikolov2013efficient, mikolov2013distributed], used random initialization by uniformly sampling random numbers with the objective of training an efficient representation of the model on a large dataset. This setup is intuitively acceptable for initialization of the embedding for common features such as part-of-speech tags. However, this may not be the optimum method for representation of less frequent features such as individual words. For the latter, pre-trained models, trained in a supervised or unsupervised manner, are usually leveraged for increasing the performance.

Iii-A4 Character-Level Embedding

The methods mentioned earlier are mostly at higher levels of representation. Lower-level representations such as character-level representation require special attention as well, due to their simplicity of representation and the potential for correction of unusual character combinations such as misspellings [zhang2015character]. For generating character-level embeddings, CNNs have successfully been utilized [santos2014learning].

Character-level embeddings have been used in different NLP applications [wehrmann2017character]. One of the main advantages is the ability to use small model sizes and represent words with lower-level language elements [santos2014learning]. Here word embeddings are models utilizing CNNs over the characters. Another motivation for employing character-level embeddings is the out-of-vocabulary word (OOV) issue which is usually encountered when, for the given word, there is no equivalent vector in the word embedding. The character-level approach may significantly alleviate this problem. Nevertheless, this approach suffers from a weak correlation between characters and semantic and syntactic parts of the language. So, considering the aforementioned pros and cons of utilizing character-level embeddings, several research efforts tried to propose and implement higher-level approaches such as using sub-words [bojanowski2016enriching] to create word embeddings for OOV instances as well as creating a semantic bridge between the correlated words [botha2014compositional].

Iii-B Seq2Seq Framework

Most underlying frameworks in NLP applications rely on sequence-to-sequence (seq2seq) models in which not only the input but also the output is represented as a sequence. These models are common in various applications including machine translation444The input is a sequence of words from one language (e.g., English) and the output is the translation to another language (e.g., French).

, text summarization

555The input is a complete document (sequence of words) and the output is a summary of it (sequence of words)., speech-to-text, and text-to-speech applications666The input is an audio recording of a speech (sequence of audible elements) and the output is the speech text (sequence of words)..

The most common seq2seq framework is comprised of an encoder and a decoder. The encoder ingests the sequence of input data and generates a mid-level output which is subsequently consumed by the decoder to produce the series of final outputs. The encoder and decoder are usually implemented via a series of Recurrent Neural Networks or LSTM [hochreiter1997long] cells.

The encoder takes a sequence of length , , where is the representation of a single input coming from the vocabulary , and then generates the output state . Subsequently, the decoder takes the last state from the encoder, i.e., , and starts generating an output of size , , based on its current state, , and the ground-truth output . In different applications, the decoder could take advantage of more information such as a context vector [see2017get] or intra-attention vectors [paulus2017deep] to generate better outputs.

One of the most widely training approaches for seq2seq models is called Teacher Forcing [bengio2015scheduled]. Let us define as the ground-truth output sequence correspondent to a given input sequence . The model training based on the maximum-likelihood criterion employs the following cross-entropy (CE) loss minimization:


where is the parameters of the model optimized during the training.

Once the model is optimized using the cross-entropy loss, it can generate an entire sequence as follows. Let denote the output generated by the model at time . Then, the next output is generated by:


In NLP applications, one can improve the output by using beam search to find a reasonably good output sequence [cho2014learning]. During beam search, rather than using for selecting the best output, we choose the top outputs at each step, generate different paths for the output sequence, and finally choose the one that provides better performance as the final output. Although, there has been some recent studies [goyal2018continuous, kool2019stochastic] on improving the beam search by incorporating a similar mechanism during training of them model, studying this is outside the scope of this paper.

Given a series of the ground-truth output and the generated model output , the model performance is evaluated using a task-specific measures such as ROUGE [lin2004rouge], BLEU [papineni2002bleu], and METEOR [banerjee2005meteor]. As an example,

, which is an evaluation metric in NLP tasks, uses the largest common sub-string between ground-truth

and model output to evaluate the generated output.

Iii-C Reinforcement Learning in NLP

Although the seq2seq models explained in Section III-B achieve great successes w.r.t. traditional methods, there are some issues with how these models are trained. Generally speaking, seq2seq models like the ones used in NLP applications face two issues: (1) exposure bias and (2) inconsistency between training time and test time measurements [keneshloo2018deep].

Most of the popular seq2seq models are minimizing cross-entropy loss as their optimization objective via Teacher Forcing (Section III-B). In teacher forcing, during the training of the model, the decoder utilizes two inputs, the former decoder output state and the ground-truth input , to determine its current output state . Moreover, it employs them to create the next token, i.e., . However, at test time, the decoder fully relies on the previously created token from the model distribution. As the ground-truth data is not available, such a step is necessary to predict the next action. Henceforth, in training, the decoder input is coming from the ground truth, while, in the test phase, it relies on the previous prediction.

This exposure bias [ranzato2015sequence] induces error growth through output creation at the test phase. One approach to remedy this problem is to remove the ground-truth dependency in training by solely relying on model distribution to minimize the cross-entropy loss. Scheduled sampling [bengio2015scheduled] is one popular method to handle this setback. During scheduled sampling, we first pre-train the model using cross-entropy loss and then slowly replace the ground-truth with samples the model generates.

The second obstacle with seq2seq models is that, when training is finished using the cross-entropy loss, it is typically evaluated using non-differentiable measures such as ROUGE or METEOR

. This will form an inconsistency between the training objective and the test evaluation metric. Recently, it has been demonstrated that both of these problems can be tackled by utilizing techniques from reinforcement learning 


Among most of the well-known models in reinforcement learning, policy gradient techniques [zaremba2015reinforcement] such as the REINFORCE algorithm [williams1992simple] and actor-critic based models such as value-based iteration [sutton2018reinforcement], and Q-learning [watkins1992q], are among the most common techniques used in deep learning in NLP.

Using the model predictions (versus the ground-truth) for the sequence to sequence modeling and generation, at training time, was initially introduced by Daume et al. [daume2009search]. According to their approach, SEARN, the structured prediction can be characterized as one of the reinforcement learning cases as follows: The model employs its predictions to produce a sequence of actions (words sequences). Then, at each time step, a greedy search algorithm is employed to learn the optimal action, and the policy will be trained to predict that particular action.

Fig. 7: A simple Actor-Critic framework.

In Actor-Critic training, the actor is usually the same neural network used to generate the output, while the critic is a regression model that estimates how the actor performed on the input data. The actor later receives the feedback from the critic and improves its actions. Fig 

7 shows this framework. It is worth noting that action in most of the NLP-related applications is like selecting the next output token while the state is the decoder output state at each stage of decoding. These models have mostly been used for robotic [levine2016end] and Atari games [mnih2014recurrent] due to the small action space in these applications. However, when we use them in NLP applications, they face multiple challenges. The action space in most of the NLP applications could be defined as the number of tokens in the vocabulary (usually between 50K to 150K tokens). Comparing this to the action space in a simple Atari game, which on average has less than 20 actions [mnih2014recurrent], shows why these Actor-Critic models face difficulties when applied to NLP applications. A major challenge is the massive action space in NLP applications, which not only causes difficulty for the right action selection, but also will make the training process very slow. This makes the process of finding the best Actor-Critic model very complicated and model convergence usually requires a lot of tweaks to the models.

Iv Datasets

Many different researchers for different tasks use benchmark datasets, such as those discussed below. Benchmarking in machine learning refers to the assessment of methods and algorithms, comparing those regarding their capability to learn specific patterns. Benchmarking aids validation of a new approach or practice, relative to other existing methods.

Benchmark datasets typically take one of three forms.

  1. The first is real-world data, obtained from various real-world experiments.

  2. The second is synthetic data, artificially generated to mimic real-world patterns. Synthetic data is generated for use instead of real data. Such datasets are of special interest in applications where the amount of data required is much larger than that which is available, or where privacy considerations are crucial and strict, such as in the healthcare domain.

  3. The third type are toy datasets, used for demonstration and visualization purposes. Typically they are artificially generated; often there is no need to represent real-world data patterns.

The foundation of Deep Learning utilization is the availability of data to teach the system about pattern identification. The effectiveness of the model depends on the quality of the data. Despite the successful implementation of universal language modeling techniques such as BERT [sun2019utilizing], however, such models can be used solely for pre-training the models. Afterward, the model needs to be trained on the data associated with the desired task. Henceforth, based on the everyday demands in different machine domains such as NLP, creating new datasets is crucial.

On the other hand, creating new datasets is not usually an easy matter. Informally speaking, the newly created dataset should be: the right data to train on, sufficient for the evaluation, and accurate to work on. Answering the questions of “what is the meaning of right and accurate data” is highly application-based. Basically, the data should have sufficient information, which depends on the quality and quantity of the data.

To create a dataset, the first step is always asking “what are we trying to do and what problem do we need to solve?” and “what kind of data do we need and how much of it is required?” The next step is to create training and testing portions. The training data set is used to train a model to know how to find the connections between the inputs and the associated outputs. The test data set is used to assess the intelligence of the machine, i.e., how well the trained model can operate on the unseen test samples. Next, we must conduct data preparation to make sure the data and its format is simple and understandable for human experts. After that, the issue of data accessibility and ownership may arise. Distribution of data may need to have specific authorizations, especially if we are dealing with sensitive or private data.

Given the aforementioned roadmap, creating proper datasets is complicated and of great importance. That’s why few datasets are frequently chosen by the researchers and developers for benchmarking. A summary of widely used benchmark datasets is provided in Table I.

Task Dataset Link
Machine Translation
WMT 2014 EN-DE
WMT 2014 EN-FR
Text Summarization
Reading Comprehension
Question Answering
Question Generation
Story Cloze Test
Semantic Parsing
AMR parsing
ATIS (SQL Parsing)
WikiSQL (SQL Parsing)
Sentiment Analysis
IMDB Reviews
Yelp Reviews
Subjectivity Dataset
Text Classification
AG News
20 NewsGroup
Natural Language Inference
SNLI Corpus
Semantic Role Labeling
Proposition Bank
TABLE I: Benchmark datasets.

V Deep Learning for NLP Tasks

This section describes NLP applications using deep learning. Fig. 8 shows representative NLP tasks (and the categories they belong to). A fundamental question is: ”How can we evaluate an NLP algorithm, model, or system?” In [resnik2010evaluation], some of the most common evaluation metrics have been described. This reference explains the fundamental principles of evaluating NLP systems.

Fig. 8: NLP tasks investigated in this study.

V-a Basic Tasks

V-A1 Part-Of-Speech Tagging

Part-of-Speech tagging is one of the basic tasks in Natural Language Processing. It is the process of labeling words with their part of speech categories. Part of speech is leveraged for many crucial tasks such as named entity recognition. One commonly used dataset for Part-of-Speech tagging is the WSJ corpus777Penn Treebank Wall Street Journal (WSJ-PTB).. This dataset contains over a million tokens and has been utilized widely as a benchmark dataset for the performance assessment of POS tagging systems. Traditional methods are still performing very well for this task [manning2011part]. However, neural network based methods have been proposed for Part-of-Speech tagging [kumar2016ask].

For example, the deep neural network architecture named CharWNN has been developed to join word-level and character-level representations using convolutional neural networks for POS tagging [santos2014learning]. The emphasis in [santos2014learning] is the importance of character-level feature extraction as their experimental results show the necessity of employing hand-crafted features in the absence of character-level features for achieving the state-of-the-art. In [huang2015bidirectional], a wide variety of neural network based models have been proposed for sequence tagging tasks, e.g., LSTM networks, bidirectional LSTM networks, LSTM networks with a CRF888Conditional Random Field. layer, etc. Sequence tagging itself includes part of speech tagging, chunking, and named entity recognition. Likewise, a globally normalized transition-based neural network architecture has been proposed for POS-tagging [andor2016globally]. State-of-the-art results are summarized in Table  II.

Model Accuracy

Character-aware neural language models [liu2018empower]
Transfer Learning + GRU[yang2017transfer] 97.55
Bi-directional LSTM + CNNs + CRF[ma2016end] 97.55
Adversarial Training + Bi-LSTM [yasunaga2017robust] 97.59
Character Composition + Bi-LSTM[ling2015finding] 97.78
String Embedding + LSTM[akbik2018contextual] 97.85
Meta-BiLSTM [bohnet2018morphosyntactic] 97.96

TABLE II: POS tagging state-of-the-art models evaluated on the WSJ-PTB dataset.

V-A2 Parsing

Parsing is assigning a structure to a recognized string. There are different types of parsing. Constituency Parsing refers in particular to assigning a syntactic structure to a sentence. A greedy parser has been introduced in [legrand2014joint] which performs a syntactic and semantic summary of content using vector representations. To enhance the results achieved by [legrand2014joint], the approach proposed in [legrand2016deep] focuses on learning morphological embeddings. Recently, deep neural network models outperformed traditional algorithms. State-of-the-art results are summarized in Table III.

Model Accuracy

Recurrent neural network grammars (RNNG) [kuncoro2016recurrent]
In-order traversal over syntactic trees + LSTM [liu2017order] 94.2
Model Combination and Reranking [fried2017improving] 94.6
Self-Attentive Encoder [kitaev2018constituency] 95.1

TABLE III: Constituency parsing state-of-the-art models evaluated on the WSJ-PTB dataset.

Another type of parsing is called Dependency Parsing. Dependency structure shows the structural relationships between the words in a targeted sentence. In dependency parsing, phrasal elements and phrase-structure rules do not contribute to the process. Rather, the syntactic structure of the sentence is expressed only in terms of the words in the sentence and the associated relations between the words.

Neural networks have shown their superiority regarding generalizability and reducing the feature computation cost. In [chen2014fast], a novel neural network-based approach was proposed for a transition-based dependency parser. Neural network based models that operate on task-specific transition systems have also been utilized for dependency parsing [andor2016globally]. A regularized parser with bi-affine classifiers has been proposed for the prediction of arcs and labels [dozat2016deep]. Bidirectional-LSTMs have been used in dependency parsers for feature representation [kiperwasser2016simple]. A new control structure has been introduced for sequence-to-sequence neural networks based on the stack LSTM and has been used in transition-based parsing [dyer2015transition].

V-A3 Semantic Role Labeling

Semantic Role Labeling (SRL) is the process of identification and classification of text arguments. It is aimed at the characterization of elements to determine “who” did “what” to “whom” as well as “how,” “where,” and “when.” It identifies the predicate-argument structure of a sentence. The predicate, in essence, refers to “what,” while the arguments consist of the associated participants and properties in the text. The goal of SRL is to extract the semantic relations between the predicate and the related arguments.

Most of the previously-reported research efforts are based on explicit representations of semantic roles. Recently, deep learning approaches have achieved the SRL state-of-the-art without taking the explicit syntax representation into consideration [tan2017deep]. On the other hand, it is argued that the utilization of syntactic information can be leveraged to improve the performance of syntactic-agnostic999Note that being syntactic-agnostic does not imply discarding syntactic information. It means they are not explicitly employed. models [marcheggiani2017encoding]. A linguistically-informed self-attention (LISA) model has been proposed to leverage both multi-task learning and self-attention for effective utilization of the syntactic information for SRL [strubell2018linguistically]. Current state-of-the-art methods employ joint prediction of predicates and arguments [he2018jointly], novel word representation approaches [peters2018deep]

, and self-attention models 

[tan2018deep]; see Table IV.

Model Accuracy ()

Self-Attention + RNN [tan2018deep]
Contextualized Word Representations [peters2018deep] 84.6
Argumented Representations + BiLSTM [he2018jointly] 85.3

TABLE IV: Semantic Role Labeling current state-of-the-art models evaluated on the OntoNotes dataset [pradhan2013towards]. The accuracy metric is score.

V-B Text Classification

The primary objective of text classification is to assign predefined categories to text parts (which could be a word, sentence, or whole document) for preliminary classification purposes and further organization and analysis. A simple example is the categorization of given documents as to political or non-political news articles.

The use of CNNs for sentence classification, in which training the model on top of pretrained word-vectors through fine-tuning, has resulted in considerable improvements in learning task-specific vectors [kim2014convolutional]

. Later, a Dynamic Convolutional Neural Network (DCNN) architecture – essentially a CNN with a dynamic k-max pooling method – was applied to capture the semantic modeling of sentences 

[kalchbrenner2014convolutional]. In addition to CNNs, RNNs have been used for text classification. An LSTM-RNN architecture has been utilized in [palangi2016deep] for sentence embedding with particular superiority in a defined web search task. A Hierarchical Attention Network (HAN) has been utilized to capture the hierarchical structure of text, with a word-level and sentence-level attention mechanism [yang2016hierarchical].

Some models used the combination of both RNNs and CNNs for text classification such as [lai2015recurrent]. This is a recurrent architecture in addition to max-pooling with an effective word representation method, and demonstrates superiority compared to simple window-based neural network approaches. Another unified architecture is the C-LSTM proposed in [zhou2015c] for sentence and document modeling in classification. Current state-of-the-art methods are summarized in Table V.

Model Accuracy

CNN [conneau2017very]
Deep Pyramid CNN [johnson2017deep] 93.13
CNN [johnson2016supervised] 93.43
Universal Language Model Fine-tuning (ULMFiT) [howard2018universal] 94.99

TABLE V: The classification accuracy of state-of-the-art methods, evaluated on the AG News Corpus dataset [zhang2015character].

V-C Information Extraction

Information extraction identifies structured information from “unstructured” data such as social media posts and online news. Deep learning has been utilized for information extraction regarding subtasks such as Named Entity Recognition, Relation Extraction, Coreference Resolution, and Event Extraction.

V-C1 Named Entity Recognition

Named Entity Recognition (NER) aims to locate and categorize named entities in context into pre-defined categories such as the names of people and places. The application of deep neural networks in NER has been investigated by the employment of CNN [collobert2011natural] and RNN architectures [mesnil2013investigation], as well as hybrid bidirectional LSTM and CNN architectures [chiu2015named]. NeuroNER [2017neuroner], a named-entity recognition tool, operates based on artificial neural networks. State-of-the-art models are reported in Table VI.

Model Accuracy

Semi-supervised Sequence Modeling [clark2018semi]
Google BERT [devlin2018bert] 92.8
Contextual String Embeddings [akbik2018contextual] 93.09

TABLE VI: State of the art models regarding Name Entity Recognition. Evaluation is performed on the CoNLL-2003 Shared Task dataset [tjong2003introduction]. The evaluation metric is score.

V-C2 Relation Extraction

Relation Extraction aims to find the semantic relationships between entity pairs. The recursive neural network (RNN) model has been proposed for semantic relationship classification by learning compositional vector representations [socher2012semantic]. For relation classification, CNN architectures have been employed as well, by extracting lexical and sentence level features [zeng2014relation].

V-C3 Coreference Resolution

Coreference resolution includes identification of the mentions in a context that refer to the same entity. For instance, the mentions “car,” “Camry,” and “it” could all refer to the same entity. For the first time in [clark2016deep], Reinforcement Learning (RL) was applied to coreference resolution. Current state-of-the-art methods leverage an attention mechanism [lee2018higher].

V-C4 Event Extraction

A specific type of extracted information from text is an event. Such extraction may involve recognizing trigger words related to an event and assigning labels to entity mentions that represent event triggers. Convolutional neural networks have been utilized for event detection; they handle problems with feature-based approaches including exhaustive feature engineering and error propagation phenomena for feature generation [chen2015event]. In 2018, Nguyen and Grishman applied graph-CNN (GCCN) where the convolutional operations are applied to syntactically dependent words as well as consecutive words [nguyen2018graph]; their adding entity information reflected the state-of-the-art using CNN models.

V-D Sentiment analysis

The primary goal in sentiment analysis is the extraction of subjective information from text by contextual mining. Sentiment analysis is considered high-level reasoning based on source data. Sentiment analysis is sometimes called opinion mining, as its primary goal is to analyze human opinion, sentiments, and even emotions regarding products, problems, and varied subjects. Important works on sentiment analysis or opinion mining include [nasukawa2003sentiment, dave2003mining]. Since 2000, much attention has been given to sentiment analysis, due to its relation to a wide variety of applications, its associations with new research challenges, and the availability of abundant data.

A critical aspect of research in sentiment analysis is content granularity. Considering this criterion, sentiment analysis is generally divided into three categories/levels: document level, sentence level, and aspect level.

V-D1 Document-level Sentiment Analysis

At the document level, the task is to determine whether the whole document reflects a positive or negative sentiment about exactly one entity. This differs from opinion mining regarding multiple entries. The Gated Recurrent Neural Network architecture has been utilized successfully for effectively encoding the sentences’ relations in the semantic structure of the document [tang2015document]. Domain adaptation has been investigated as well, to deploy the trained model on unseen new sources [glorot2011domain].

V-D2 Sentence-level Sentiment Analysis

At the sentence-level, sentiment analysis determines the positivity, negativity, or neutrality regarding an opinion expressed in a sentence. One general assumption for sentence-level sentiment classification is the existence of only one opinion from a single opinion holder in an expressed sentence. Recursive autoencoders have been employed for sentence-level sentiment label prediction by learning the vector space representations for phrases [socher2011semi]. Long Short-Term Memory (LSTM) recurrent models have also been utilized for tweet sentiment prediction [wang2015predicting]

. The Sentiment Treebank and Recursive Neural Tensor Networks

[socher2013recursive] have shown promise for predicting fine-grained sentiment labels.

V-D3 Aspect-level Sentiment Analysis

Document-level and sentence-level sentiment analysis usually focus on the sentiment itself, not the target of the sentiment, e.g., a product. Aspect-level sentiment analysis directly targets an opinion, with the assumption of the existence of the sentiment and its target. A document or sentence may not have a generally positive or negative sentiment, but may have multiple subparts with different targets, each with a positive or negative sentiment. This can make aspect-level analysis even more challenging than other types of sentiment categorization.

Aspect-level sentiment analysis usually involves Aspect Sentiment Classification and Aspect Extraction. The former determines opinions on different aspects (positive, neutral, or negative) while the latter identifies the target aspect for evaluation in context. As an example consider the following sentence: “This car is old. It must be repaired and sold!”. “This car” is what is subject to evaluation and must be extracted first. Here, the opinion about this aspect is negative.

For aspect-level sentiment classification, attention-based LSTMs are proposed to connect the aspect and sentence content for sentiment classification [wang2016attention]. For aspect extraction, deep learning has successfully been proposed in opinion mining [poria2016aspect]. State-of-the-art methods rely on converting aspect-based sentiment analysis to sentence-pair classification tasks [sun2019utilizing], post-training approaches [xu2019bert] on the popular language model BERT [devlin2018bert], and employment of pre-trained embeddings [xu2018double].

V-E Machine Translation

Machine Translation (MT) is one of the areas of NLP that has been profoundly affected by the advances in deep learning. The first subsection below explains methods used in the pre-deep learning period, as explained in reference NLP textbooks such as “Speech and Language Processing” [jurafsky2008speech]. The remainder of this section is dedicated to delving into recent innovations in MT which are based on neural networks, started by [kalchbrenner2013recurrent].

V-E1 Traditional Machine Translation

One of the first demonstrations of machine translation happened in 1954 [dostert1955georgetown] in which the authors tried to translate from Russian to English. This translation system was based on six simple rules, but had a very limited vocabulary. It was not until the 1990s that successful statistical implementations of machine translation emerged as more bilingual corpora became available [jurafsky2008speech]. In [papineni2002bleu] the BLEU score was introduced as a new evaluation metric, allowing more rapid improvement than when the only approach involved using human labor for evaluation.

V-E2 Neural Machine Translation

It was after the success of the neural network in image classification tasks that researchers started to use neural networks in machine translation (NMT). Around 2013, research groups started to achieve breakthrough results in NMT. Unlike traditional statistical machine translation, NMT is based on an end-to-end neural network [bahdanau2014neural]. This implies that there is no need for extensive preprocessing and word alignments. Instead, the focus shifted toward network structure.

Fig. 11 shows an example of an end-to-end recurrent neural network for machine translation. A sequence of input tokens is fed into the network. Once it reaches an end-of-sentence (EOS) token, it starts generating the output sequence. The output sequence is generated in the same recurrent manner as the input sequence until it reaches an end-of-sentence token. One major advantage of this approach is that there is no need to specify the length of the sequence; the network takes it into account automatically. In other words, the end-of-sentence token determines the length of the sequence. Networks implicitly learn that longer input sentences usually lead to longer output sentences with varying length, and that ordering can change. For instance, the second example in Fig. 9 shows that adjectives generally come before nouns in English but after nouns in Spanish. There is no need to explicitly specify this since the network can capture such properties. Moreover, the amount of memory that is used by NMT is just a fraction of the memory that is used in traditional statistical machine translation [cho2014properties].

Fig. 9: Alignment in Machine Translation


was one of the early works that incorporated recurrent neural networks for machine translation. They were able to achieve a perplexity (a measure where lower values indicate better models) that was 43% less than the state-of-the-art alignment based translation models. Their recurrent continuous translation model (RCTM) is able to capture word ordering, syntax, and meaning of the source sentence explicitly. It maps a source sentence into a probability distribution over sentences in the target language. RCTM estimates the probability

of translating a sentence in the source language to target language sentence . RCTM estimates by considering source sentence as well as the preceding words in the target language :


The representation generated by RCTM acts on n-grams in the lower layers, and acts more on the whole sentence as one moves to the upper layers. This hierarchical representation is performed by applying different layers of convolution. First a continuous representation of each word is generated; i.e., if the sentence is

, the representation of the word will be . This will result in sentence matrix in which . This matrix representation of the sentence will be fed into a series of convolution layers in order to generate the final representation e for the recurrent neural network. The approach is illustrated in Fig. 10. Equations for the pipeline are as follows.


In order to take into account the sentence length, the authors introduced RCTM II which estimates the length of the target sentence. RCTM II was able to achieve better perplexity on WMT datasets (see top portion of Table I) than other existing machine translation systems.

Fig. 10: Recurrent Continuous Translation Models (RCTM) [kalchbrenner2013recurrent].

In another line of work,  [sutskever2014sequence] presented an end-to-end sequence learning approach without heavy assumptions on the structure of the sequence. Their approach consists of two LSTMs, one for mapping the input to a vector of fixed dimension and another LSTM for decoding the output sequence from the vector. Their model was able to handle long sentences as well as sentence representations that are sensitive to word order. As shown in Fig. 11, the model reads ”ABC” as an input sequence and produces ”WXYZ” as output sequence. The token indicates the end of prediction. The network was trained by maximizing the log probability of the translation () given the input sequence (). In other words, the objective function is:

Fig. 11: Sequence to sequence learning with LSTM.

is the training set and is its size. One of the novelties of their approach was reversing word order of the source sentence. This helps the LSTM to learn long term dependencies. Having a fixed-length vector in the decoder phase is one of the bottlenecks of the encoder-decoder approach.  [bahdanau2014neural] argues that a network will have a hard time compressing all the information from the input sentence into a fixed-size vector. They address this by allowing the network to search segments of the source sentence that are useful for predicting the translation. Instead of representing the input sentence as a fixed-size vector, in [bahdanau2014neural] the input sentence is encoded to a sequence of vectors and a subset of them is chosen by using a method called attention mechanism as shown in Fig. 12.

In their approach , in which . While previously was the same for all time steps, here takes a different value, , at each time step. This accounts for the attention mechasim (context vector) around that specific time step. is computed according to the following:


Here is the alignment model that is represented by a feed forward neural network. Also , which is a way to include information both about preceding and following words in . The model was able to outperform the simple encoder-decoder approach regardless of input sentence length.

Improved machine translation models continue to emerge, driven in part by the growth in people’s interest and need to understand other languages Most of them are variants of the end-to-end decoder-encoder approach. For example,  [wu2016google]

tries to deal with the problem of rare words. Their LSTM network consists of encoder and decoder layers using residual layers along with the attention mechanism. Their system was able to decrease training time, speed up inference, and handle translation of rare words. Comparisons between some of the state-of-the-art neural machine translation models are summarized in Table 


Fig. 12: Attention Mechasim for Neural Machine Translation [bahdanau2014neural].
Model Accuracy

Convolutional Seq-to-Seq  [gehring2017convolutional]
Attention Is All You Need [vaswani2017attention] 28.4
Weighted Transformer [ahmed2017weighted] 28.9
Self Attention [shaw2018self] 29.2
DeepL Translation Machine 101010 33.3
Back-translation [edunov2018understanding] 35.0

TABLE VII: The machine translation state-of-the-art models evaluated on the English-German dataset of ACL 2014 Ninth Workshop on Statistical Machine TRranslation. The evaluation metric is score.

V-F Question Answering

Question answering (QA) is a fine-grained version of Information Retrieval (IR). In IR a desired set of information has to be retrieved from a set of documents. The desired information could be a specific document, text, image, etc. On the other hand, in QA specific answers are sought, typically ones that can be inferred from available documents. Other areas of NLP such as reading comprehension and dialogue systems intersect with question answering.

Research in computerized question answering has proceeded since the 1960s. In this section, we present a general overview of question answering system history, and focus on the breakthroughs in the field. Like all other fields in NLP, question answering was also impacted by the advancement of deep learning  [bordes2014question], so we provide an overview of QA in deep learning contexts. We briefly visit visual question answering as well.

V-F1 Rule-based Question Answering

Baseball  [green1961baseball] is one of the early works (1961) on QA where an effort was made to answer questions related to baseball games by using a game database. The baseball system consists of (1) question read-in, (2) dictionary lookup for words in the question, (3) syntactic (POS) analysis of the words in question, (4) content analysis for extracting the input question, and (5) estimating relevance regarding answering the input question.

IBM’s  [ittycheriah2000ibm] statistical question answering system consisted of four major components:

  1. Question/Answer Type Classification

  2. Query Expansion/Information Retrieval

  3. Name Entity Making

  4. Answer Selection

Some QA systems fail when semantically equivalent relationships are phrased differently.  [cui2005question]

addressed this by proposing fuzzy relation matching based on mutual information and expectation maximization.

V-F2 Question answering in the era of deep learning

Smartphones (Siri, Ok Google, Alexa, etc.) and virtual personal assistants are common examples of QA systems with which many interact on a daily basis. While earlier such systems employed rule-based methods, today their core algorithm is based on deep learning. Table VIII presents some questions and answers provided by Siri on an iPhone.

Question Answer
Who invented polio vaccine? The answer I found is Jonas Salk
Who wrote Harry Potter? J.K.Rowling wrote Harry Potter in 1997
When was Einstein born? Albert Einstein was born March 14, 1879
TABLE VIII: Typical Question Answering performance based on deep learning.
Fig. 13: Fixed length vector sentence representation for input Questions and Answers [qiu2015convolutional].

[ng2000machine] was one of the first machine learning based papers that reported results on QA for a reading comprehension test. The system tries to pick a sentence in the database that has an answer to a question, and a feature vector represents each question-sentence pair. The main contribution of [ng2000machine] is proposing a feature vector representation framework which is aimed to provide information for learning the model. There are five classifiers (location, date, etc.), one for each type of question. They were able to achieve accuracy competitive with previous approaches.

As illustrated in Fig. 13, [qiu2015convolutional] uses convolutional neural networks in order to encode Question-Answer sentence pairs in the form of fixed length vectors regardless of the length of the input sentence. Instead of using distance measures like cosine correlation, they incorporate a non-linear tensor layer to match the relevance between question and answer. Equation 9 calculates the matching degree between question and its corresponding answer .


f is the standard element-wise non-linearity function, is a tensor, , , .

Fig. 14: Example of Dynamic Memory Network (DMN) input-question-answer triplet

The model tries to capture the interaction between question and answer. Inspired by findings in neuroscience, [kumar2016ask] incorporated episodic memory111111A kind of long-term memory that includes conscious recall of previous activities together with their meaning. in their Dynamic Memory Network (DMN). By processing input sequences and questions, DMN forms episodic memories to answer relevant questions. As illustrated in Fig. 14, their system is trained based on raw Input-Question-Answer triplets.

DMN consists of four modules that communicate with each other as shown in Fig. 15. The input module encodes raw input text into a distributed vector representation; likewise the question module encodes a question into its distributed vector representation. The episodic memory module uses the attention mechanism in order to focus on a specific part of the input module. Through an iterative process, this module produces a memory vector representation that considers the question as well as previous memory. The answer module uses the final memory vector to generate an answer. The model improved upon state-of-the-art results on tasks such as the ones shown in Fig. 14. DMN is one of the architectures that could potentially be used for a variety of NLP applications such as classification, question answering, and sequence modeling.

Fig. 15: Interaction between four modules of Dynamic Memory Network [mnih2014recurrent].

[xiong2016dynamic] introduced a Dynamic Coattention Network (DCN) in order to address local maxima corresponding to incorrect answers; it is considered to be one of the best approaches to question answering.

V-F3 Visual Question Answering

Given an input image, Visual Question Answering (VQA) tries to answer a natural language question about the image [antol2015vqa]. VQN addresses multiple problems such as object detection, image segmentation, sentiment analysis, etc. [antol2015vqa] introduced the task of VQA by providing a dataset containing over 250K images, 760K questions, and around 10M answers. [malinowski2015ask] proposed a neural-based approach to answer the questions regarding the input images. As illustrated in Fig. 16, Neural-Image-QA is a deep network consisting of CNN and LSTM. Since the questions can have multiple answers, the problem is decomposed into predicting a set of answer words from a finite vocabulary set where represents the count of answer words regarding a given question.

Fig. 16: Neural Image Question Answering [malinowski2015ask].
Fig. 17: Spatial Memory Network for VQA. Bright Areas are regions the model is attending [xu2016ask].

Do humans and computers look at the same regions to answer questions about an image? [das2017human] tries to answer this question by conducting large-scale studies on human attention in VQA. Their findings show that VQAs do not seem to be looking at the same regions as humans. Finally, [xu2016ask] incorporates a spatial memory network for VQA. Fig. 17 shows the inference process of their model. As illustrated in the figure, the specific attention mechanism in their system can highlight areas of interest in the input image.

V-G Document Summarization

Document summarization refers to a set of problems involving generation of summary sentences given one or multiple documents as input.

Generally, text summarization fits into two categories:

  1. Extractive Summarization, where the goal is to identify the most salient sentences in the document and return them as the summary.

  2. Abstractive Summarization, where the goal is to generate summary sentences from scratch; they may contain novel words that do not appear in the original document.

Each of these methods has its own advantages and disadvantages. Extractive summarization is prone to generate long and sometimes overlapping summary sentences; however, the result reflects the author’s mode of expression. Abstractive methods generate a shorter summary but they are hard to train.

There is a vast amount of research on the topic of text summarization using extractive and abstractive methods. As one of the earliest works on using neural networks for extractive summarization,  [nallapati2017summarunner] proposed a framework that used a ranking technique to extract the most salient sentences in the input. This model was improved by [narayan2018ranking] which used a document-level encoder to represent sentences, and a classifier to rank these sentences. On the other hand, in abstractive summarization, it was  [rush2015neural] which, for the first time, used attention over a sequence-to-sequence (seq2seq) model for the problem of headline generation. However, since simple attention models perform worse than extractive models, therefore more effective attention models such as graph-based attention [tan2017abstractive] and transformers [vaswani2017attention] have been proposed for this task. To further improve abstractive text summarization models,  [nallapati2016abstractive] proposed the first pointer-generator model and applied it to the DeepMind QA dataset [hermann2015teaching]. As a result of this work, the CNN/Daily Mail dataset emerged which is now one of the widely used datasets for the summarization task. A copy mechanism was also adopted by [gu2016incorporating] for similar tasks. But their analysis reveals a key problem with attention-based encoder-decoder models: they often generate unusual summaries consisting of repeated phrases. Recently,  [see2017get] reached state-of-the-art results on the abstractive text summarization using a similar framework. They alleviated the unnatural summaries by avoiding generating unknown tokens and replacing these words with tokens from the input article. Later, researchers moved their focus to methods that use sentence-embedding to first select the most salient sentence in the document and then change them to make them more abstractive [chen2018fast, zhou2018neural]. In these models, salient sentences are extracted first and then a paraphrasing model is used to make them abstractive. The extraction employs a sentence classifier or ranker while the abstractor tries to remove the extra information in a sentence and present it as a shorter summary. Fast-RL [chen2018fast] is the first framework in this family of works. In Fast-RL, the extractor is pre-trained to select salient sentences and the abstractor is pre-trained using a pointer-generator model to generate paraphrases. Finally, to merge these two non-differentiable components, they propose using Actor-Critic Q-learning methods in which the actor receives a single document and generates the output while the critic evaluates the output based on comparison with the ground-truth summary.

Though the standard way to evaluate the performance of summarization models is with ROUGE [lin2004rouge] and BLEU [papineni2002bleu], there are major problems with such measures. For instance, the ROUGE measure focuses on the number of shared n-grams between two sentences. Such a method incorrectly assigns a low score to an abstractive summary that uses different words yet provides an excellent paraphrase that humans would rate highly. Clearly, better automated evaluation methods are needed in such cases.

There are additional problems with current summarization models. Shi et al. [shi2018neural] provides a comprehensive survey on text summarization.

V-H Dialogue Systems

Dialogue Systems are quickly becoming a principal instrument in human-computer interaction, due in part to their promising potential and commercial value. One application is automated customer service, supporting both online and bricks-and-mortar businesses. Customers expect an ever-increasing level of speed, accuracy, and respect while dealing with companies and their services. Due to the high cost of knowledgeable human resources, companies frequently turn to intelligent conversational machines. Note that the phrases conversational machines and dialogue machines are often used interchangeably.

Dialogue systems are usually task-based or non-task-based (Fig. 18). Though there might be Automatic Speech Recognition (ASR) and Language-to-Speech (L2S) components in a dialogue system, the discussion of this section is solely about the linguistic components of dialogue systems; concepts associated with speech technology are ignored.

Fig. 18: The framework of a dialogue system. A dialogue system can be task oriented or used for natural language generation based on the user input which is also known as a chat bot.

Despite useful statistical models employed in the backend of dialogue systems (especially in language understanding modules), most deployed dialogue systems rely on expensive hand-crafted and manual features for operation. Furthermore, the generalizability of these manually engineered systems to other domains and functionalities is problematic. Hence, recent attention has focused on deep learning for the enhancement of performance, generalizability, and robustness. Deep learning facilitates the creation of end-to-end task-oriented dialogue systems, which enriches the framework to generalize conversations beyond annotated task-specific dialogue resources.

V-H1 Task-based Systems

The structure of a task-based dialogue system usually consists of the following elements:

  • Natural Language Understanding (NLU): This component deals with understanding and interpreting user’s spoken context by assigning a constituent structure to the spoken utterance (e.g., a sentence) and captures its syntactic representation and semantic interpretation, to allow the back-end operation/task. NLU is usually leveraged regardless of the dialogue context.

  • Dialogue Manager (DM): The generated representation by NLU would be handled by the dialogue manager, which investigates the context and returns a reasonable semantic-related response.

  • Natural Language Generation (NLG): The natural language generation (NLG) component produces an utterance based on the response provided by the DM component.

The general pipeline is as follows: NLU module (i.e., semantic decoder) transforms the output of the speech recognition module to some dialogue elements. Then the DM processes these dialogue elements and provides a suitable response which is fed to the NLG for response generation. The main pipeline in NLU is to classify the user query domain and user intent, and fill a set of slots to create a semantic frame. It is usually customary to perform the intent prediction and the slot filling simultaneously [hakkani2016multi]. Most of the task-oriented dialogue systems employ slot-filling approaches to classify user intent in the specific domain of the conversation. For this aim, having predefined tasks is required; this depends on manually crafted states with different associated slots. Henceforth, a designed dialogue system would be of limited or no use for other tasks.

Recent task-oriented dialogue systems have been designed based on deep reinforcement learning, which provided promising results regarding performance [toxtli2018understanding], domain adaptation [ilievski2018goal], and dialogue generation [li2016deep]. This was due to a shift towards end-to-end trainable frameworks to design and deploy task-oriented dialogue systems. Instead of the traditionally utilized pipeline, an end-to-end framework incorporates and uses a single module that deals with external databases. Despite the tractability of end-to-end dialogue systems (i.e., easy to train and simple to engineer), due to their need for interoperability with external databases via queries, they are not well-suited for task-oriented settings. Some approaches to this challenge include converting the user input into internal representations [wen2016network], combining supervised and reinforced learning [williams2016end], and extending the memory network approach [sukhbaatar2015end] for question-answering to a dialog system [bordes2016learning].

V-H2 Non-task-based Systems

As opposed to task-based dialogue systems, the goal behind designing and deploying non-task-based dialogue systems is to empower a machine with the ability to have a natural conversation with humans [ritter2011data]. Typically, chatbots are of one of the following types: retrieval-based methods and generative methods. Retrieval-based models have access to information resources and can provide more concise, fluent, and accurate responses. However, they are limited regarding the variety of responses they can provide due to their dependency on backend data resources. Generative models, on the other hand, have the advantage of being able to produce suitable responses when such responses are not in the corpus. However, as opposed to retrieval-based models, they are more prone to grammatical and conceptual mistakes arising from their generative models.

Retrieval-based methods select an appropriate response from the candidate responses. Therefore, the key element is the query-response operation. In general, this problem has been formulated as a search problem and uses IR techniques for task completion [ji2014information]. Retrieval-based methods usually employ either Single-turn Response Matching or Multi-turn Response Matching. In the first type, the current query (message) is solely used to select a suitable response [hu2014convolutional]. The latter type takes the current message and previous utterances as the system input and retrieves a response based on the instant and temporal information. The model tries to choose a response which considers the whole context to guarantee conversation consistency. An LSTM-based model has been proposed [lowe2015ubuntu] for context and response vectors creation. In [yan2016learning], various features and multiple data inputs have been incorporated to be ingested using a deep learning framework. Current base models regarding retrieval-based chatbots rely on multi-turn response selection augmented by an attention mechanism and sequence matching [zhou2018multi].

Generative models don’t assume the availability of pre-defined responses. New responses are produced from scratch and are based on the trained model. Generative models are typically based on sequence to sequence models and map an input query to a target element as the response. In general, designing and implementing a dialogue agent to be able to converse at the human level is very challenging. The typical approach usually consists of learning and imitating human conversation. For this goal, the machine is generally trained on large corpora of conversations. However, this does not directly remedy the issue of encountering out-of-corpus conversation. The question is: How can an agent be taught to generate proper responses to conversations that it never has seen? It must handle content that is not exactly available in the data corpus that the machine has been trained on, due to the lack of content matching between the query and the corresponding response, resulting from the wide range of plausible queries that humans can provide.

To tackle the aforementioned general problem, some fundamental questions must be answered: (1) What are the core characteristics of a natural conversation? (2) How can these characteristics be measured? (3) How can we incorporate this knowledge in a machine, i.e., the dialogue system? Effective integration of these three elements determines the intelligence of a machine. A qualitative criterion is to observe if the generated utterances can be distinguished from natural human dialogues. For quantitative evaluation, adversarial evaluation was initially used for quality assessment of sentence generation [bowman2015generating] and employed for quality evaluation of dialogue systems [kannan2017adversarial]. Recent advancements in sequence to sequence modeling encouraged many research efforts regarding natural language generation [vinyals2015neural]. Furthermore, deep reinforcement learning yields promising performance in natural language generation [li2016deep].

V-H3 Final note on dialogue systems

Despite remarkable advancements in AI and much attention dedicated to dialogue systems, in reality, successful commercial tools, such as Apple’s Siri and Amazon’s Alexa, still heavily rely on handcrafted features. It still is very challenging to design and train data-driven dialogue machines given the complexity of the natural language, the difficulties in framework design, and the complex nature of available data sources.

Vi Conclusion

In this article, we presented a comprehensive survey of the most distinguished works in Natural Language Processing using deep learning. We provided a categorized context for introducing different NLP core concepts, aspects, and applications, and emphasized the most significant conducted research efforts in each associated category. Deep learning and NLP are two of the most rapidly developing research topics nowadays. Due to this rapid progress, it is hoped that soon, new effective models will supersede the current state-of-the-art approaches. This may cause some of the references provided in the survey to become dated, but those are likely to be cited by new publications that describe improved methods

Neverthless, one of the essential characteristics of this survey is its educational aspect, which provides a precise understanding of the critical elements of this field and explains the most notable research works. Hopefully, this survey will guide students and researchers with essential resources, both to learn what is necessary to know, and to advance further the integration of NLP with deep learning.