Application of the Multi-label Residual Convolutional Neural Network text classifier using Content-Based Routing process

10/19/2021
by   Tounsi Achraf, et al.
0

In this article, we will present an NLP application in text classifying process using the content-based router. The ultimate goal throughout this article is to predict the event described by a legal ad from the plain text of the ad. This problem is purely a supervised problem that will involve the use of NLP techniques and conventional modeling methodologies through the use of the Multi-label Residual Convolutional Neural Network for text classification. We will explain the approach put in place to solve the problem of classified ads, the difficulties encountered and the experimental results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

01/15/2018

Multi-Label Learning from Medical Plain Text with Convolutional Residual Models

Predicting diagnoses from Electronic Health Records (EHRs) is an importa...
03/26/2021

Heterogeneous Graph Neural Networks for Multi-label Text Classification

Multi-label text classification (MLTC) is an attractive and challenging ...
05/26/2019

Extreme Multi-Label Legal Text Classification: A case study in EU Legislation

We consider the task of Extreme Multi-Label Text Classification (XMTC) i...
05/19/2020

Assertion Detection in Multi-Label Clinical Text using Scope Localization

Multi-label sentences (text) in the clinical domain result from the rich...
03/29/2020

Seeing The Whole Patient: Using Multi-Label Medical Text Classification Techniques to Enhance Predictions of Medical Codes

Machine learning-based multi-label medical text classifications can be u...
04/01/2020

Deep Learning Based Multi-Label Text Classification of UNGA Resolutions

The main goal of this research is to produce a useful software for Unite...
11/14/2020

Supervised Text Classification using Text Search

Supervised text classification is a classical and active area of ML rese...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 I-Introduction

T he use of neural networks in language processing applications is still an open problem in research, which needs to be addressed in more depth, especially on task-specific issues (which architecture correspondS to which tasks), more global optimizations and learning transfer techniqueS in order to be able to use networks already trained on new problems easily. One of these subjects is the application of neural language with routing processes. In Infolegale’s production processes, the use of data science methods to process the text of documents in natural language is a necessity. In fact, they automatically extract the exploitable data and therefore meet the needs of the COMPLIFY projects. The project is part of a global intensification of the fight against money laundering, corruption and fraud. The company wishes to accompany its clients in the mapping of their risks, including the knowledge of the client and the identification of the person(s) with the control of the company (beneficial owner). Scanning of legal announcements, prediction of events and extraction of named entities were put in place to answer the first part of the project. The goal is to be able to make the process of acquiring legal announcements faster and faster. The process is nowadays complicated and relies on 724 newspapers to be collected and about 1 200 000 advertisements per year to dematerialize, to codify, to inform in a database and to sirenize then.

2 II-Context of the project

The company in question is revolutionizing its acquisition process of daily articles and newspapers by applying state-of-the-art NLP models in order to automate the whole process. In order to achieve such goal, the mission was defined during my end of study project where I worked on setting up a set of models that will determine the events contained in the legal announcements according to the types of events standardized by the company using Neuro-linguistic programming techniques and optimize the process by applying a Content-Based approach in order to route the events into their corresponding sphere.

3 III-Methodology

3.1 1 Understanding the target output

Once the legal ads have been extracted from the logs and PDF articles, the second process begins: identifying the events described by the ad. The objective of this part is to set up a set of models that will determine the events contained in the legal announcements according to the types of events standardized by Infolegale. And to accomplsh this, we will use NLP models with Content-Based Routing process in order to classify each of the 204 event that could be found in an ad.

3.2 2) Convolutional neural network

Convolutional neural networks or CNNs[6] can be defined as a class of artificial networks of deep-feed artificial neurons. This is an algorithm with layers stacked so that input information at the beginning can be refined carefully in all layers in a unidirectional order until the desired result is achieved.[9]

CNN architectures uses the layers presented as follows[6]:

  • Convolutional layers; used to preserve the spatial orientation of entities in an image.

  • Grouping layers; used to downsample an image (reduce it).

  • Fully connected layers; to stack the convolution / convolution layer output on a single-column matrix and compress it continuously to obtain a smaller but reasonable output.

  • Soft-max output; to present the output generated by the fully connected layers.

In the next section, we define a convolutional neuron network subtype which is the residual convolutional neural network.

3.3 3) Residual Convolutional neural network

A residual convolutional neural network [1] is an artificial neural network (ANN) of a type that relies on known constructs from pyramidal cells of the cerebral cortex. To do this, residual neural networks use skipped connections or shortcuts to jump over certain layers. [3]

Figura 1: Canonical a residual neural network form

Figure 1 simplifies the step of passing an L-1 layer of L-2 activation. These typical models are implemented with double or triple settees containing non linearity and batch normalization between the two.

Figura 2: Structure of a residual neural network

Figure 2 describes the structure of a network of residual neurons. The main idea behind this network is the residual block.[1] The network allows the development of extremely deep neural networks, which can contain 100 or more layers.

The equations that describe this structure are:

Simple network:

(1)
(2)
(3)
(4)
(5)

Residual network :

(6)
(7)
(8)
(9)
(10)

Where is the activations or outputs of neurons in layer ,

the activation function for layer

and the weight matrix for neurons for the layer .

Absent an explicit matrix (aka ResNets), forward propagation through the activation function simplifies to the equation (10).

With this extra connection, gradients can travel more easily backwards. [3] It becomes a flexible block that can increase the capacity of the network or simply turn into an identity function that does not affect the formation.[3]

3.4 4) Content-Based Routing

The end of the task is to set up a system for order processing. We first confirm the order when an incoming order is issued and then check that the ordered item is available in the warehouse. The stock system performs this task.This processing stage series is a perfect candidate for the design of Pipes and Filters.[7] We create two filters, one for the validation phase and one for the inventory system, and route through both filters the incoming messages[7]. Nevertheless, there are more than one distribution system in many business integration scenarios where each system can manage only specific items[7].

Figura 3: Example of an inventory check and order routing process

The Content-Based Router[7] checks the content of the message and routes the message to another channel based on the message’s information. The routing can be based on a number of factors, such as field presence, specific field values, etc. Great care must be taken when installing a content-based router to make it easy to control the routing function as the router may become a point of regular maintenance.The Content-Based Router can take the form of a configurable rules engine that determines the destination channel based on a set of configurable rules in more complex integration scenarios[7].

3.5 5) Input Data Analysis

The data that will be used throughout our study will be a set of random legal ads taken from the local database of the company. The data is consisted of raw french text with infinity word possibilities and no clear pattern of string format. The ad contains one or more event type which are needed to be detected. At Infolegale, there are 161 types of events in 7 family types. The volume and variety of legal notice events has prompted us to put in place a decision tree through which an ad will have to go to determine the events in question. In order to implement the decision tree, we will be using a the content-based routing process as described in a previous paragraph.

Figura 4: Example of a legal ad

During this module, we faced many difficulties, namely:

  • Database problems: the existence of duplicates of ads with a different identifier, the pollution of results with input errors…

  • The analysis of false positives: this step asked us the expertise of the quality team.Not available, it was difficult to find a niche to perform the analysis.

  • Grouping events: this step had us took a long time to set up the decision tree. Indeed, grouping ads is not easy and requires a lot of business expertise and performance testing for each attempt.

3.6 6) Multi-label Residual Convolutional Neural Network text classifier using Content-Based Routing implementation process

3.6.1 Events grouping

In order to set up our decision tree, the different event types contained in the ads have been analyzed in order to regroup them by the context meaning. Even though the data presented a huge variety in the content, a primary decision tree was put in place. Each of the 7 event families regrouped one or more sub-levels of groups and sub-groups and that is to assure the best model classification process and to predict the corresponding type of event with the highest accuracy, precision and recall possible. The tree is consisted of 43 node with 5 levels of precision and 161 events grouped by their meaning and text content. [7] The tree was then implemented using nested dictionaries and sub-classes with Python. Each node is consisted of a group name, a cutoff, a group description and most importantly, the children. In a matter of fact, the group children will be responsible for redirecting the original ad to the corresponding event. Due to the critical aspect of this point of the project, a high cutoff value is selected to each event (more than or equal to 0.999).

3.6.2 Model Building

In order to optimize our Multi-label Residual Convolutional Neural Network, we will be using the Stochastic gradient descent (SGD) in ordrer to minimize the sum of squared residuals[9]. The algorithm is very useful in cases where the optimal points cannot be found by equating the slope of the function to 0.

One other point, the data set that we will use contains different sizes of events and subgroups, which means that our deep learning neural network model is more likely to quickly over-fit a training data set with few examples, which results in poor performance when the model is evaluated on new data.[2] To overcome this problem, the dropout has the effect of making the schooling system noisy, forcing nodes within a layer to probabilistically take on greater or less responsibility for the inputs. This conceptualization suggests that perhaps dropout breaks-up conditions where community layers co-adapt to correct mistakes from prior layers, in turn making the model extra robust. [2]

Dropout simulates a sparse activation from a given layer, which interestingly, in turn, encourages the community to absolutely research a sparse representation as a side-effect.[4] As such, it is able to be used as an opportunity to pastime regularization for encouraging sparse representations in auto encoder models.[1] The hyper-parameter is introduced which specifies the probability at which outputs of the layer are dropped out, or inversely, the probability at which outputs of the layer are retained. We will give the probability of 0.35 to our dropout value for retaining the output of each node in a hidden layer.

3.6.3 Model Training

A set of 2.3 Million line of ads has been udes to train the model, regrouping most of the ad types. the groups and subgroups varies in size, ranging from 30 to 400 000. The presented model is designed to be effective in this particular case. It will shuffle our data set, divide it on multiple pre-sized batches and iterates over the whole data sets. The size configuration of the batches are taken from 512 to 1024 using a 1.001 compounding factor. once the model is trained over all batches, it will retrain itself and will iterate the same process 100 times in order for the losses to converge to the minimum. The values are selected in a way to optimize the usage of our GPU machine with a maximum of efficiency[8].

3.7 7) Experimental Results and Performance Analysis

After the first model training and implementation, the results was quite impressive and high for the 3 metrics (accuracy, precision and recall). The first results per family are as it follows :

Cutoff 0.9 0.99 0.999
event_code precision recall precision recall precision recall Support
’1100’ 0.745 0.988 0.757 0.918 0.780 0.707 18611
’2xxx’ 0.556 0.988 0.561 0.963 0.569 0.88 19502
’3xxx’ 0.474 0.995 0.502 0.988 0.546 0.963 5852
’4xxx’ 0.304 0.992 0.314 0.976 0.324 0.913 1716
’5xxx’ 0.775 0.999 0.769 0.999 0.77 0.992 9663
’6xxx’ 0.624 0.996 0.614 0.992 0.635 0.981 1311
’7xxx’ 0.261 0.951 0.248 0.923 0.263 0.878 1844

As we can see, the results are quite promising and further implementations are made for the rest of the events. What follows the results for the subfamilies of the common procedures, which are the most sensitive legal ads to be decoded :

Cutoff 0.9 0.99 0.99
event_code Precision Recall Precision Recall Precision Recall Support
’51xx’ 1 0.979 1 0.912 1 0.67 50000
’52xx’ 1 0.994 1 0.972 1 0.774 50000
’53xx’ 1 0.991 1 0.97 1 0.913 50000
’54xx’ 1 0.99 1 0.937 1 0.841 50000
’55xx’ 1 0.982 1 0.939 1 0.825 50000
’59xx’ 1 0.997 1 0.992 1 0.979 50000

The results of the model are detailed in what follows. We have randomly selected 60 of the events in question due to data protection legislatives.

Cutoff 0.9 0.99 0.999
Event Acc Pre Rec Acc Pre Rec Acc Pre Rec Support
’2110’, 0.997 0.984 0.969 0.981 0.987 0.895 0.990 0.992 0.865 7091
’212x’, 0.996 0.999 0.994 0.982 0.997 0.940 0.985 1.000 0.971 52858
’2120’, 0.997 0.999 0.994 0.983 0.997 0.943 0.985 1.000 0.972 52846
’2130’, 0.999 0.994 0.978 0.992 0.998 0.937 0.995 0.997 0.893 4205
’214x’, 0.997 0.999 0.986 0.982 0.999 0.935 0.989 0.999 0.935 17131
’2140’, 0.989 0.968 0.943 0.967 0.962 0.834 0.972 0.978 0.799 12641
’2141’, 0.992 0.949 0.858 0.970 0.979 0.727 0.982 0.976 0.617 4453
’2142’, 0.999 0.818 0.529 0.994 0.979 0.214 0.999 0.917 0.122 90
’215x’, 0.991 0.995 0.974 0.961 0.997 0.908 0.965 0.999 0.884 29460
’2150’, 0.992 0.994 0.974 0.970 0.992 0.907 0.966 0.997 0.877 26796
’2151’, 1.000 0.857 0.825 0.998 0.994 0.761 0.999 0.842 0.352 91
’2152’, 0.998 0.976 0.943 0.986 0.980 0.866 0.994 0.987 0.795 2627
’3100’, 0.995 0.998 0.989 0.992 0.998 0.979 0.978 0.998 0.937 33711
’3101’, 0.995 1.000 0.992 0.989 1.000 0.983 0.973 1.000 0.958 63432
’3102’, 0.997 0.954 0.925 0.995 0.977 0.819 0.990 0.988 0.655 2713
’3210’, 0.999 1.000 0.419 0.999 1.000 0.130 0.999 1.000 0.036 84
’411x’, 0.996 0.994 0.979 0.991 0.995 0.946 0.979 0.996 0.872 6546
’4110’, 0.996 0.992 0.976 0.990 0.994 0.921 0.976 0.994 0.796 4748
’4115’, 0.995 0.950 0.933 0.992 0.963 0.845 0.986 0.969 0.700 1749
’4210’, 0.998 1.000 0.773 0.998 1.000 0.722 0.996 1.000 0.488 295
’43xx’, 0.998 0.999 0.997 0.995 0.999 0.991 0.975 0.999 0.953 21761
’4330’, 0.995 0.997 0.994 0.985 0.998 0.974 0.912 0.998 0.830 21399
’4334’, 0.997 0.963 0.669 0.994 0.938 0.239 0.993 0.955 0.067 314
’511[0,1]’, 0.996 0.999 0.994 0.966 1.000 0.936 0.777 1.000 0.587 1835
’5110’, 0.996 0.998 0.994 0.965 0.999 0.936 0.775 1.000 0.584 1833
’5125’, 0.986 0.990 0.962 0.964 0.993 0.887 0.905 0.994 0.693 1032
’5130’, 0.990 1.000 0.934 0.953 1.000 0.684 0.917 1.000 0.436 500
’5299’, 0.996 1.000 0.248 0.996 1.000 0.133 0.995 0.000 0.000 505
’521[0-2]’, 0.998 0.998 0.994 0.987 0.999 0.948 0.923 0.999 0.680 23961
’5210’, 0.998 0.998 0.991 0.977 0.999 0.904 0.889 0.999 0.533 23745
’5212’, 1.000 0.972 0.972 1.000 0.974 0.887 0.999 0.986 0.630 216
’5227’, 0.999 0.947 0.168 0.999 1.000 0.018 0.999 0.000 0.000 115
’5238’, 0.999 0.971 0.925 0.997 0.979 0.776 0.995 0.986 0.520 1112
’5213’, 0.999 0.983 0.750 0.998 0.945 0.223 0.998 0.941 0.066 242
’5330’, 0.999 0.998 1.000 0.998 0.998 0.999 0.996 0.998 0.994 12965
’5380’, 1.000 1.000 0.995 0.996 1.000 0.927 0.986 1.000 0.771 1652
’5381’, 0.999 0.989 0.993 0.999 0.991 0.980 0.994 0.994 0.880 1371
’5382’, 0.999 0.994 0.977 0.997 0.995 0.946 0.990 0.997 0.775 1207
’5350’, 0.999 0.985 0.957 0.998 0.988 0.903 0.994 0.993 0.650 466
’539[0,1]’, 0.998 1.000 0.993 0.986 1.000 0.952 0.949 1.000 0.829 8241
’5391’, 0.998 0.985 0.973 0.991 0.990 0.852 0.978 0.991 0.608 1529
’5390’, 0.996 0.999 0.986 0.981 0.999 0.920 0.930 0.999 0.708 6671
’5320’, 0.999 0.986 0.863 0.996 0.993 0.582 0.995 0.990 0.414 249
’5310’, 0.996 1.000 0.676 0.993 1.000 0.404 0.990 1.000 0.103 312
’5360’, 1.000 1.000 0.979 0.997 1.000 0.870 0.991 1.000 0.598 629
’5410’, 0.995 0.995 0.980 0.979 0.997 0.904 0.947 0.997 0.748 1930
’5420’, 0.992 1.000 0.290 0.989 1.000 0.047 0.988 0.000 0.000 107
’5520’, 0.993 0.997 0.986 0.981 0.998 0.954 0.940 0.999 0.850 3682
’6111’, 1.000 0.971 0.971 0.999 0.965 0.799 0.999 0.912 0.223 139
’6220’, 0.996 1.000 0.996 0.991 1.000 0.991 1.000 1.000 0.981 30880
’6240’, 1.000 1.000 0.981 0.999 1.000 0.853 0.999 1.000 0.724 312
’721[0,1]’, 0.990 0.998 0.975 0.986 0.999 0.963 0.981 0.999 0.950 16103
’7210’, 0.989 0.994 0.976 0.985 0.995 0.964 0.979 0.996 0.947 15898
’7999_2199’, 0.967 0.998 0.564 0.951 0.997 0.355 0.939 0.997 0.197 3254
’7999’, 0.989 0.983 0.113 0.989 0.971 0.065 0.988 1.000 0.011 522
’71xx’, 0.998 0.999 0.996 0.995 1.000 0.989 0.985 1.000 0.966 18265
’712[5-6]’, 0.991 0.977 0.949 0.980 0.991 0.845 0.941 0.995 0.519 5240
’7125’, 0.991 0.973 0.951 0.979 0.988 0.840 0.939 0.992 0.501 5203

3.8 8) Discussion

We propose a multi-label Residual Convolutional Neural Network text classifier using Content-Based Routing process which train the model using content based orienting method described above. The benefits of our model is mainly in the optimization of the number of models needed to be trained in a later moment. with this method, we have succeeded in developing a highly accurate multi-class classification model which will classify all of the 161 different events based on the text. One of the greatest difficulties is how to handle the variety and the veracity of the data which was available with different sizes. In order to overcome this problem, we have developed a bigger dataset containing as much types of events possible.

4 Conclusion

During this module, we developed a multi-label Residual Convolutional Neural Network text classifier using Content-Based Routing processs to predict the final event code for each ad. We developed a decision tree to predict the code with the least possible false positives and the most accurate result possible. The tree was implemented using a content-based routing process, which allowed us to validate the model.

5 References

[1] Bodén, Mikael.

A guide to recurrent neural networks and backpropagation

. School of Information Science, Computer and Electrical Engineering, Halmstad University. November 13th, 2001.

[2] Graves, Alex; Mohamed, Abdel-rahman, Hinton, Geoffrey. Speech Recognition with Deep Recurrent Neural Networks, 2013.

[3] He, Kaiming; Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian.Deep Residual Learning for Image Recognition, December 10th, 2015.

[4] Khan, Salman; Rahmani, Hossein, Ali Shah, Syed Afaq. Bennamoun, Mohammed.

A Guide to Convolutional Neural Networks for Computer Vision

. Morgan and Claypool publishers,2018.

[5] Lafferty, John; McCallum, Andrew, Pereira, Fernando. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In

InternationalConference on Machine Learning

, 2001.

[6] Nielsen, Michael, Using neural nets to recognize handwritten digits, Neural Networks and Deep Learning, June 2019.

[7] Hohpe, Gregor, Woolf, Bobby. Enterprise Integration Patterns , October 10th, 2003, Princeton University.

[8] Serra, Joan; Karatzoglou, Alexandros. Getting Deep Recommends Fit: Bloom Embeddings for Sparse Binary Input/Output Networks, June 13th, 2017.

[9] Sherstinsky, Alex. From RNN to Vanilla LSTM Network.

Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network

, chapter 5, page 17, Directly Software Incorporated, August 9th, 2018.