So What's the Plan? Mining Strategic Planning Documents

07/01/2020
by   Ekaterina Artemova, et al.
ABBYY
0

In this paper we present a corpus of Russian strategic planning documents, RuREBus. This project is grounded both from language technology and e-government perspectives. Not only new language sources and tools are being developed, but also their applications to e-goverment research. We demonstrate the pipeline for creating a text corpus from scratch. First, the annotation schema is designed. Next texts are marked up using human-in-the-loop strategy, so that preliminary annotations are derived from a machine learning model and are manually corrected. The amount of annotated texts is large enough to showcase what insights can be gained from RuREBus.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/01/2020

So What's the Plan? Mining Strategic Planning Document

In this paper we present a corpus of Russian strategic planning document...
04/12/2022

The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts

We present the Project Dialogism Novel Corpus, or PDNC, an annotated dat...
03/05/2019

Language and Dialect Identification of Cuneiform Texts

This article introduces a corpus of cuneiform texts from which the datas...
01/26/2021

A Digital Corpus of St. Lawrence Island Yupik

St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysyntheti...
04/02/2020

NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts

This paper introduces the first version of the NUBes corpus (Negation an...
06/05/2020

Prague Dependency Treebank – Consolidated 1.0

We present a richly annotated and genre-diversified language resource, t...
09/12/2021

Review of the reasons for using Information System Strategic Planning at Higher Educations in Indonesia

Research related to the reasons for using the Information System Strateg...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Each Russian federal and municipal subject publishes several strategic planning documents per year, related to various directions of region development such as medicine, education, ecology, etc. The tasks, presented in the strategic planning documents, should meet several criteria, such as to fit into the budget and be economically sensible, be aligned with the state and local policy and satisfy the population’s expectations. Modern language technology helps to assess strategic planning documents and to gain insights into the general directions of development.

In this project, we intend to demonstrate the potential and the value of text-driven analysis when applied to strategic planning. Our approach exploits a common NLP pipeline, which involves annotation of large amount texts and training machine learning models. The most important steps of the pipeline are shown in Figure 1. This pipeline is justified by recent success in other applications, such as processing of clinical stories [holderness2019proceedings] or statements of claim [dale2019law]. Not only we annotate a large amount of texts according to an in-house annotation scheme, but also we use a number of supervised techniques, which allow to extract high-quality information from the documents and convey a more specific and comprehensive picture of the strategic planning and its significance.

Figure 1: Common NLP pipeline.

The government documents have not been widely subjected to processing and analysis. This means that we face a need to develop the whole domain-specific pipeline of annotation, information extraction and pre-training of language models. We showcase language technology capabilities. We present an annotation schema to markup named entities and relations, exploit active learning to annotate hundreds of documents and use state of the art methods for named entity recognition and relation extraction

to facilitate manual annotation.

The contributions of the project are threefold. First, a new dataset is being developed, which can be used both by computer science and economics communities for further studies. Second, a number of tools for processing documents in Russian will be released. Third, the dataset will help to conduct detailed analysis of strategic documents, to compare federal subjects and administrative districts in terms of their goals and budget requirement.

2 Why we build the corpus

The annotation schema we have developed for this project provides a powerful tool for strategic document analysis. Indeed listing its possible applications for the domain is one of the main contributions of this paper (see Sections 5-6 for details). However, what is arguably even more important from pure natural language processing perspective is that our dataset can be used as a more fitting case study for structuring unstructured information than other existing datasets. This is a rather bold claim that we intend to argue for in this section.

Structuring unstructured information or to be more specific converting data from text form into database-friendly (i. e. table) form is one of the most popular NLP business applications. Standard techniques used in order to solve this task are named entity recognition(or NER) and relation extraction (or RE). Both NER and RE are well-studied and there exist popular academic benchmarks for both tasks (CoNLL-2003 [tjong-kim-sang-de-meulder-2003-introduction], ACE-2004 and ACE-2005 [doddington-etal-2004-automatic, ace04, ace05], SemEval-2010 Task 8 [hendrickx-etal-2010-semeval], FactRuEval-2016 [f4aa6d516fa74e5992076aa5d3b2a5f1]). There are, however, several important differences between any of aforementioned benchmarks and a typical business case dataset, the most important one being as follows.

Business case texts are usually domain-specific (e. g. legal) texts that can contain less than perfect language or other irregularities (ponderous sentences with complicated syntactic structure, slang etc.). Academic baselines, on the other hand, typically consist of well-written news or biography texts without any irregularities of this kind.

To sum up a popular perception that NER (and to lesser extent RE) is basically a solved task can potentially be to a large extent a product of existing academic benchmarks. While recent years have provided for several major breakthroughs in these tasks, results one can obtain on real-world client corpora are often much more modest than ones reported by scholars [lin-etal-2019-reliability].

Given these considerations we decided to create a corpus closer to industrial NER and RE implementation than existing academic ones. Our corpus consists of unadopted domain-specific texts with many irregularities that can be found in practical applications. Hopefully, it can be a better benchmark for checking the suitability of a particular NER and RE model to business scenarios.

While this particular use of our corpus is not in the main focus of this paper, it was difficult for us to not provide one of our key motivations to create it.

3 Related work

To the best of our knowledge, there are little NLP applications to the e-Goverment domain in general and strategic planning in particular. [alekseychuk2019processing] present the only project of the unsupervised analysis of strategic planning documents. Other e-Goverment applications include processing country statements, governmental web-sites, e-petitions and other social media sources.

3.1 Processing e-government documents

NLP methods allow to extract and structure information of governmental activity. Baturo and Dasandi [UN] used topic modeling to analyze the agenda-setting process of the United Nations based on the UN General Debate corpus [dataset] consisting of over 7300 country statements from 1970 to 2014. In [China] Shen et al. explored Web data and government websites in Beijing, Shanghai, Wuhan, Guangzhou and Chengdu to conduct comparative analysis on the development of the five metropolia e-governments. Albarghothi et al. [arabic] introduced an Automatic Extraction Dataset System (AEDS) tool that constructs an ontology-based Semantic Web from Arabic web pages related to Dubai’s e-government services. The system automatically extracts textual data from the website, detects keywords, and finally maps the page to ontology via Protégé tool.

3.2 Processing petitions

NLP methods are widely used to aggregate and summarize public opinion, expressed in the form of electronic petitions. The concept of e-democracy implicates open communication between government and citizens, which in most cases involves the processing of a large amount of unstructured textual information [India]. Rao and Dey describe the scheme of citizens’ and stakeholders’ participation in Indian e-governance which allows the government to collect feedback from citizens and correct policies and acts according to it.

Evangelopoulos and Visinescu [US_people] analyze appeals to the U.S. government, in particular, SMS messages from Africans, sent during Barack Obama’s visit to Ghana in July 2009 and data from SAVE Award - initiative, aiming to make the U.S. government more effective and efficient at spending taxpayers’ money. For each of the corpus, authors extracted key topics with Latent Semantic Analysis (LSA) to explore trends in public opinion.

Suh et al. [Korea]

applied keyword extraction algorithms based on

and

-means clustering to detect and track petitions groups on Korean e-People petition portal. To forecast the future petition trends, radial basis function neural networks were us.

3.3 Processing Russian documents

Metsker et al. [metsker2019natural]

process almost 30 million of Russian court decisions to estimate the effectiveness of legislative change and to identify regional features of law enforcement. Alekseychuk et al.

[alekseychuk2019processing] use such unsupervised techniques as topic modelling and word embeddings to induce a taxonomy of regional strategic goals, extracted from strategic documents. This study motivated us to deep into the corpus of strategic documents. Our work extends this project, as we annotate the corpus with a detailed relation scheme. This allows of more detailed analysis.

4 Corpus annotation pipeline

4.1 Annotation Guidelines

Figure 2: Annotation interface for assigning entities and relations.

We develop guidelines for entity and relation identification in order to maintain uniformity of annotation in our corpus [rurebus]. Figure 2 presents annotation interface for assigning entities and relations.

We define eight types of entities: 1. MET(metric) 2. ECO(economics) 3. BIN(binary) 4. CMP(compare) 5. QUA(qualitative) 6. ACT(activity) 7. INST(institutions) 8. SOC(social)

These entities associated with 11 semantic relations of 5 types: 1. Current situation: (a) Negative NNG (Now NeGative) (b) Neutral NNT (Now NeutTral) (c) Positive NPS (Now PoSitive) 2. Implemented changes/results (about the past): (a) Negative PNG (Past Negative) (b) Neutral PNT (Past NeutTral) (c) Positive PPS (Past PoSitive) 3. Forecasts: (a) Negative FNG (Future NeGative) (b) Neutral FNT (Future NeutTral) (c) Positive FPS (Future PoSitive) 4. GOL(abstract goals) 5. TSK(specific tasks)

All annotations were obtained using a Brat Rapid Annotation Tool (BRAT) [stenetorp2012brat]. Each document in the corpus was annotated by two annotators independently, while disagreements were resolved by a moderator. All annotation instructions are available at the GitHub repository111https://github.com/dialogue-evaluation/RuREBus. Below we present description of entity types.

4.2 Entity Descriptions

BIN is a one-time action or binary characteristic. These entities represents one-time events such as construction, development, stimulation, formation, implementation, acquisition, involvement, absence, diversification, modernization, etc.

MET

entity is a numerical indicator or object on which a comparison operation is defined. In particular, these entities often describe labor productivity, planned and actual values of indicators, seismicity of a territory, the probability of a violation, economic growth, the degree of deterioration of a building, etc.

QUA represents a quality characteristic. Annotators were asked to identify spans of texts such as high, ineffective, limited, big, weak, safe as QUA entities.

CMP represents a comparative characteristic. Annotators were asked to identify spans of texts as CMP entities, associated with increasing, saturation, increasing decrease, activation, exceeding indicators, positive dynamics, improvement, expansion.

SOC is an entity related to social rights or social amenities. SOC entities other describe country population, housing quality, social protection, leisure activities, historical heritage, folk art, terms related to social rights or social amenities, etc.

INST entities represent various institutions, structures and organizations. In particular, annotators were asked to mark cultural and leisure facilities, family and child support organizations, cultural center as INST.

ECO is defined as an economic entity or infrastructure object. Entities of this type are associated with biological resources, innovative potential, domestic market, regional economy, energy balance, budget financing, fishing fleet, roads, library and museum funds, etc.

ACT is an event or specific activity. These entities are often combined with BIN, e.g., launched an educational project, where launched is marked as BIN and an educational project as ACT. Entities of this type are associated with events like drug addiction prevention, orphan prevention, educational projects, psychological assistance.

4.3 Relations Descriptions

GOL represents aims and goals of program. It is used to describe changes and objective that are expected to be achieved as the results of actions, proposed by the program.

TSK denotes concrete actions planned by the program. Main difference between TSK and GOL is that the later one describe ‘‘what" the program aims to achieve and the first one state ‘‘how" it will be done.

The other nine relations are designed to describe perceptions of the present, past and future state of affairs. Past relations (PPS, PNG, PNT) describe the previous situation. Respectively present relations (NPS, NNG, NNT) present current situation. Last triplet (FPS, FNG, FNT) predicts trends, metrics or consequences of the program in a long-term perspective.

Table 1 presents examples of annotated relations.

Entity Example (ENG) Example (RU)
GOL CMP improving SOC public health CMP укрепление SOC здоровья населения
TSK BIN halting ECO drug trafficking BIN пресечение ECO нарко трафика
FPS CMP reduction of MET mortality rate CMP снижение MET уровня смертности

Table 1: Examples of annotated relations.

4.4 Active learning

We also employ active learning technique [DBLP:journals/corr/ShenYLKA17]. Previously we obtained a subset of our corpus marked with this set of named entities and relations. Then we trained NER model and use it to markup unlabeled documents. Than documents were edited by annotators and verified by moderators. After that obtaining new part of final corpus model were retrained with this part added to training set.

In this work we employ NER model, namely char-CNN-BiLSTM-CRF (proposed by Lample et al [lample-etal-2016] and further developed by Ma and Hovy [ma2016endtoend]). This architecture is widely used as a robust baseline in sequence tagging tasks. We use FastText [bojanowski2017enriching] embeddings trained by RusVectores [KutuzovKuzmenko2017]. For relation extraction we also employ morphological, syntactical and semantical features, obtained from Compreno [ZuyevK2013StatiSticalMT, anisimovich2012] and some hand-made features, such as capitalization templates and dependency tree distance between relation members.

5 How to utilize named entities

In this and the following sections we provide an in-depth analysis of the annotated corpus and showcase applications of textual analysis, based on the proposed annotation schema. We start with the description of annotated entities and provide insights into strategic planning based on entity-level analysis. Than we take the analysis to the next level and explore relations between entities and the way the relations help to structure information from strategic documents.

5.1 Basic statistics

Total Mean len (std)
BIN 14236 1.05 (0.28)
MET 6377 4.23 (3.50)
QUA 3611 1.14 (0.52)
CMP 4149 1.16 (0.78)
SOC 5037 2.77 (2.31)
INST 3756 3.69 (2.81)
ECO 11422 2.78 (2.19)
ACT 5800 4.74 (4.57)
Table 2: Statistics of annotated entities.

In this section we provide basic statistics based on annotated entities. There are 188 annotated documents in the training set, average number of named entities in document is 289, mean document length is 1787 tokens. All token-based statistics were obtained using razdel tokenizer.222https://github.com/natasha/razdel

Named entity types are highly imbalanced, which may lead to significant problems when training classifier. However, we believe that proposed corpus design represents real-life situation well. These difficulties should inspire researches to invent more sophisticated solutions, rather than prevent them from approaching the task.

5.2 Named entity clustering

The main part of RuREBus are annotations of named entities. The types of the entities (such as ‘activity’ or ‘institution’) are quite broad to perform strategic analysis and planning. Clustering of entities into fine-grained subsets of entities that represent some concept, could be more useful in specific practical applications. Therefore, in this section we show how to use a simple technique to investigate find semantically related subgroups of named entities. For instance, a cluster of entities may represent a specific social-oriented measures (such as ‘prevention of drug usage in youth’). We demonstrate how to use modern natural language processing methods to find semantic clusters of annotations.

The clustering procedure consists of the following steps. First, we preprocess texts: all textual representations of entities are lowercased and duplicates are removed. Then, we represent each entity with a vector, or embedding. Finally, we applied

-means algorithm to find clusters.

The steps above a applied to each entity type separately. In the second step we represent each named entity with a vector. The vector (embedding) for each named entity has 1024 numbers. The vector representations for named entities were calculated using a combination of the FastText model pre-trained on Russian Wikipedia, and a bidirectional LSTM model (both models are implemented in the Flair library [akbik2018coling]).

In the last step we use embeddings to find clusters. This step is performed by the -means clustering algorithm. Number of clusters is a very important parameter; it varies dramatically from one type of entity to another. For example, entities of the QUA and CMP types have fewer subgroups variants. Hence, a reasonable number of clusters () for these types is smaller () than for ECO or MET (). To select number of clusters we use classical silhouette analysis [rousseeuw1987silhouettes];

-means clustering was performed by open-source library (scikit-learn).

In the rest of the section we briefly discuss the results of clustering. To represent results, we make use of a visualisation of a small fraction of the clusters (only 6 clusters are show in Figure 3). One can see that entities have been clustered in different groups.

Figure 3: Projection of 6 clusters obtained for the SOC type (each color represents a cluster of named entities).

We analyzed in details clusters derived within the SOC type. These clusters have complex structures, e.g. they contain closely related entities (‘youth’, ‘health of youth’, ‘patriotic education of youth’), hierarchies of entities (‘family’, ‘youth family’, ‘single parent youth family’) as well as subgroups of opposite entities (‘unity of the Russian nation’ and ‘separatism’). This gives an interesting insight on how regions view the concept of youth and related entities in their strategic programs. Moreover, one can use the clusters to separate different subtypes of relations, e.g. goals (GOL) related to the ‘Family‘ cluster in a specific region of Russia. The following list represents an example of top entities in a cluster related to ‘Youth’:

  • ‘‘patriotic education of children and youth’’ (патриотического воспитания детей и молодежи);

  • ‘‘self-realization of children and youth’’ (самореализации детей и молодежи);

  • ‘‘education of the younger generation and youth’’ (воспитания подрастающего поколения и молодежи);

  • ‘‘self-realization of youth’’ (самореализации молодежи).

Similar structures can be found in other typical clusters that make the whole corpus a very interesting resource for social, economic and geographic studies.

Finally, we list types of named entities along with clusters, which were found. The clusters clearly correspond to the intended meaning of the entity types (Table 1) and enable a detailed analysis of the annotated documents.

Entity Cluster names
BIN Done, Impossible, Negotiation, Creation, Improvement, Change, Necessity, …
ACT Programs and Events, Management, Organization, Support, Repair, …
MET KPIs, Quantity, Effectiveness, Extent, Level, …
QUA Positive, Negative, Insufficient, Significant, Redundant, …
CMP Increasing, Decreasing, More than / Less then, Negative dynamics, …
SOC Demographic trends, Education, Science, Culture, Sport, Family, …
INST Enterprises, Departments, Regions, Executive authorities, …
ECO Industries, Innovations, Budgets, Taxes, Infrastructure, Energy, …
Table 3: Typical clusters for entity types.

5.3 What actions are being taken?

ACT is an entity that describes what actions should be taken in order to complete the tasks and to reach the goals. We can think of two scenarios for action-based analysis. First, we presume that all actions are planned more or less in the same fashion by different regions, as there are a lot of common goals. However, there might be some unique actions, such as ‘‘creation of Cossack youth centers’’ (создание казачьих молодежных центров), which either reveal some specific needs of the region or unreasonable expenses. Second, we can estimate the cost of actions, based on data of previous years. This will enable on the fly evaluation of the strategic program budget.

6 How to utilize relation

6.1 Basic statistics

In Table 4, one can observe a number of relation occurrences based on the training set. Average amount of relations in document is 67. We also calculated a mean number of tokens between named entities spans participated in relation and the result is 0 for almost all relation types.

RE Total RE Total RE Total
GOL 3563 TSK 4613
NPS 755 NNG 844 NNT 534
PPS 528 PNG 84 PNT 190
FPS 1167 FNG 229 FNT 141

Table 4: Relation statistics.

6.2 Is change always good?

Some types of relations allow us to evaluate ongoing changes. Positive assessments of changes are expressed by NPS and PPS relations. For example, NPS(CMP, MET): ‘‘decrease’’ (снижение) – ‘‘gas prices’’ (цен на бензин) or PPS(CMP, MET): ‘‘increased’’ (увеличивались) – ‘‘cash income of the population’’ (денежные доходы населения).

Negative assessments are expressed by NNG and PNG relations. For example, NNG(MET, QUA): ‘‘the housing cost’’ (стоимость жилья) – ‘‘high’’ (высокая) or PNG(MET, CMP): ‘‘the population size’’ (численность населения) – ‘‘decreased’’ (сократилась).

Neutral assessments of changes are expressed by NNT and PNT relations. For example, NNT(SOC, QUA): ‘‘quality of life’’ (качество жизни) – ‘‘fair’’ (удовле-творительное) or PNT(BIN, ECO): ‘‘the investment project’’ (инвестиционный проект) – ‘‘developed’’ (разработан).

The texts of the collection contain assessments of qualitative changes (i.e. a situation is compared between the past and the present) or assessments of the current state of affairs without comparison with the past. Therefore, entities involved in these relations are usually of CMP or QUA type.

An analysis of such relations could be useful for strategic planning of social and economic development of the country’s regions. Such relations make it possible to judge how the implemented changes actually affect the life of society. However, it should be noted that assessments are subjective, and do not always coincide with the conventional wisdom.

6.3 Do the tasks meet the goals?

Goals and tasks necessary to achieve the goals are expressed as relations between entities. We consider binary relations only, which allow to relate two entities. For example, a goal can be expressed as a relation between a CMP entity and a MET entity: ‘‘improvement’’ (повышение) – ‘‘accessibility of transport’’ (доступность транспорта). A task can be expressed as a relation between a BIN entity and an ECO entity: ‘‘commissioning’’ (ввод) – ‘‘new metro lines’’ (новые линии метро).

The presence of goals and tasks, expressed as fragments of text, allows us to measure the similarity between them. Different similarity types can be considered:

  1. co-occurrence frequency: if a goal and a task are frequently used in the same documents, there is a strong association between them.

  2. semantic similarity: if a goal and a task consists of words, that share similar meaning, such as ‘‘transport’’ (транспорт) and ‘‘metro lines’’ (линии метро), there is a semantic association between them.

  3. topic similarity: if a goal and a task belong to the same topic, such as the goal ‘‘road development’’ (развитие дорожной сети) and the task ‘‘reduction in the number of road accidents’’ (снижение числа дорожно-транспортных происшествий) belong to the same topic, related to ‘‘transport’’.

Measuring similarity helps to reveal whether the goal was split into tasks reasonably. If a there no tasks, similar to the stated goal, the achievement of this goal in practice becomes unlikely. The opposite might be the case, too: the absence of similar tasks reflects unrealistic goals.

At the same time, being able to extract all tasks, may help to group them according to similarity measures, to find similar or even overlapping tasks and than to order them according to their complexity or urgency.

We can employ the similarity measures mentioned above to align goals and tasks declared by federal and municipal subjects. The goal declared by a federal subject may be supported by smaller goal declared by its subdivisions, namely, municipal subjects. Although it is not necessary for all municipal subjects to share goals, the absence of common goals can reveal potential managerial and administrative weaknesses.

At the same time goals declared by municipal subjects should follow the main development direction. If the average similarity between the goals declared on different levels is low, it means that the region lacks coherent coordination of planning authorities.

To conclude with, the analysis of goal and task relations can be used in several ways. It can be applied both to a single strategic document and to multiple strategic documents, prepared in a region. In the first case, the relation analysis can help to structure goal setting along with goal decomposition and task prioritization. In the second case the relation analysis allows to discover coherence problems between different levels of subdivisions.

6.4 Temporal analysis of past and present relations

The time component of extracted relations (current state / implemented changes / forecasts) allows us to measure the proportion of the work done to the planned work, in other words, understand whether a document contains a report on the work done rather than a plan for future work. Such a simple metric can monitor whether there is any success in reaching goals within a region over years, or documents contain only plans.

Republic, Region Done work to plans ratio
Komi Republic, Pechora Municipal District 0.596
Ryazan Region, Mikhailovsky Municipal District 0.523
Voronezh Region, Khlebenskoe 0.06
Moscow Region, Dmitrovsky 0.02
Table 5: Analysis of done work to planned ratio analysis.

Furthermore, we can match forecasts from previous years’ plans with the descriptions of implemented changes of current year to check whether the plans were implemented or not.

Altogether, we can automatically calculate the percentage of goals stated in previous years and achieved this year. In addition, we could observe the tendencies in some key metrics over time and regions from current state relations, for example, how the youth crime level has changed in some region during the last 5 years.

7 Conclusion

The exponential growth in the volume of information overwhelms many domains of human activity, including state regulation and planning. As government documents exist in the form of written text, the role of language technology is increasingly important. In this paper we showcase two well-known tasks, named entity recognition (NER) and relation extraction (RE), formulated for the strategic planning domain.

In this on-going project we intend to carry out the full cycle of language technology development: we start from raw texts, elaborate an annotation schema, annotate hundreds of documents with the help of human-in-the-loop approach, train domain-specific models for the tasks under consideration. Finally, we design a few analytical applications, which demonstrate the relevance and validity of the designed language resource. Not only we managed to implement the whole NLP pipeline for a novel application, but also we have shown that governmental documents can be subjected to computational analysis. Future research of strategic planning and e-Goverment would be able to benefit from the developed methods and tools as we release all results and code in open access. These cab be used to extract knowledge and gain insights from strategic planning documents or can be applied to other domains.

The future work directions reflect the rapid development of language technology. A large language model may be trained on the strategic documents and enhance the quality of downstream tasks, NER and RE. The project may benefit from recent cross-lingual methods as this would allow to conduct comparison between strategic planning in different countries.

Acknowledgements Work on corpus annotation and manuscript was carried out by Ekaterina Artemova, Elena Tutubalina, and Veronika Sarkisyan and was funded by the framework of the HSE University Basic Research Program and Russian Academic Excellence Project ‘‘5-100’’. Work on annotation of the part of the corpus was carried out by Tatiana Batura and was funded by RFBR according to the research project N 19-07-01134.

References