Robotic Process Automation (RPA) aims to provide a supporting framework to automate the long-tail of processes which involve routine tasks, structured data and deterministic outcomes . RPA supports end-users in the automation of existing processes without the requirement of a programming language.
Even though there has been an increase in investments in the area of RPA, it still relies on the explicit encoding of rules and configurations for process generation, with limited use of AI methods. Agostinelli et al.  provides an analysis of several RPA tools, none of the studied tools has self-learning ability and, are not able to automatically understand which actions belong to which process (intra-routine learning) and which processes are available for automation (inter-routine learning).
The goal of Intelligent Process Automation (IPA) is to generalise RPA, proving the tools to create complex workflows, with minimal user interference . IEEE Standards Association defines Intelligent Process Automation as “a preconfigured software instance that combines business rules, experience based context determination logic, and decision criteria to initiate and execute multiple interrelated human and automated processes in a dynamic context.” .
Consider, for example, a user that works in the IT department of a company and is responsible for redirecting tickets from the request system to the correct department. RPA would allow the user to automate this process by manually generating a set of rules capable of redirecting each ticket to the appropriate department; however, if the process is complex, with several variables, the implementation might be costly and slow. IPA systems, on the other hand, would observe the user’s actions and detect the patterns between different requests and the redirected departments. With sufficient examples and the clarification of the user intents, it would automate the process, minimising human intervention.
So far, the field of IPA has been largely driven by systems and use cases, lacking a more formal definition of the task ans its systematic evaluation. This paper aims to address this gap by focusing on the following contributions:
Providing a formalisation of IPA.
Proposing specific metrics to support the empirical evaluation of IPA methods and systems.
Introducing a new benchmark for the evaluation of IPA.
Comparing and contrasting IPA against related tasks such as end-user programming and program synthesis.
This paper is organised as follows: Section 2 presents the formalisation of IPA and related concepts. Section 3 introduces the modalities and tasks that are part of IPA. Section 4 analyses research areas that are similar to IPA. Section 5 then uses the formalism defined previously to define metrics for the different IPA tasks. Section 6 presents the methodology used for constructing a benchmark for evaluating IPA tasks. Finally, Section 7 concludes this work.
At the center of IPA is the capture and formalisation of a workflow (the interaction between end-user actions, software and data artefacts within an end-user interface environment) in a formal language which can be used to re-enact the workflow.
IPA formal languages have particular properties which aims at eliciting the end-user actions embedded within a workflow. The formalisation of these properties is a pre-requisite for the definition of an evaluation methodology. It is through this language that we may understand and evaluate a given interactor, such as the Sikuli GUI automation tool111http://sikulix.com/. Operating directly with the interactor allows us to evaluate the interaction of the end processes with the same interfaces a human would encounter, rather than the extremely varied computations that underlie the systems with which it interacts.
To this end, we propose an abstract interaction language with which to describe the kind of interface-centred activity to be mimicked by IPA. We first define it purely syntactically, and then describe its intended instantiation in terms of GUIs and GUI interactions.
An interface is a finite set of symbols which will be referred to as interface elements.
An environment is defined by a -tuple where is a finite set of interfaces, is a countably infinite set of variables () or value symbols and is a set of finitary function symbols where each may be any interface element or a value . will be referred to as the state space.
The following example serves to introduce the intended concrete realisations of the described language, after which we turn to the definitions relevant to its interpretation.
An end-user may work on a single desktop configuration with multiple windows, each of which can be identified as a separate interface. We may assume the interface elements are represented by unique IDs.
In Figure 1, let , where these identify the form input box and the submit button respectively. The state symbol will represent the computational state under the hood, while the function set is determined by the valid actions relevant to the interfaces: here, for example, one can imagine these to include a click action and a text input action .
The set is that of all possible user inputs to the text box. To simplify, suppose that in this case it is (where is the Kleene star).
A model or instantiation of an environment is defined with respect to a working computer desktop with finitely many graphical user interfaces, an interactor (which may either be automatic, such as a Sikuli program instance, or a human end-user) which has its own language and an interpretation function which will be described comprehensively in the next definition.
Given an environment and a fixed interactor language, we define the following interpretation function:
is abstractly defined to be the computational state space of the machine on which we are hypothetically operating.
For each interface element symbol , maps to a unique meaningful segment of the interface such as a button, a list item, a form, a paragraph of text or a scrollbar.
is the set , so that refers to a complete interface or window.
is the set of all values which are possible interactor inputs: in the case of a computerized interactor, this would be a set of all possible data artifacts such as strings, booleans, integers or images that would be recognised by the interactor’s own program language as program arguments.
is the set of all valid action functions in the interactor language, so that each functional symbol is mapped to a valid action function of matching arity, with arguments interpreted pointwise according to the interpretation function.
For notational simplicity, we will henceforth conflate environmental elements with their interpretations (for example, for ) and suppress the state space argument to action functions.
An interface element is positionally realised if there exists a coordinate pair defining the bounding box associated with the instantiation of in the graphical user interface.
A soft typing system may be implemented if one introduces an environmental vocabulary, by which descriptive types may be assigned to both interface elements and input values.
An environment vocabulary is a set of descriptive strings strings. Define to be a descriptor function for interface elements and values which may serve as action function arguments.
In figure 1, one may choose to define an environment vocabulary such that ‘button’.
A descriptive vocabulary (or soft typing system) is not strictly necessary, but may serve for greater interpretability and potentially be used to introduce constraints on function arguments. If the system designer wishes, one may similarly introduce an action vocabulary to which action functions may be mapped.
The purpose of the formalization in this section has been to formally define a process
in the context of IPA. The following definitions complete this goal and allows us to define the core tasks of IPA and to suggest evaluation metrics.
A process (with respect to a interpreted environment ) is an ordered sequence of action functions
with valid action functions and arguments interpreted (and valid) in the specified environment.
In the case that an environment is instantiated (or realised) with respect to a computational interactor, its programming language is referred to as an IPA Realisation Language.
An IPA program is a process p in an environment e which is instantiated with respect to an IPA Realisation Language.
Although the definition of a process is purposefully environment agnostic, for the purpose of the tasks described in the next section we specifically assume process to mean a process which is an IPA program. In particular, the tasks named demo2process and text2process require the output to be an IPA process with respect to some IPA realisation language.
IPA Modalities & Tasks
Intelligent Process Automation can be defined as a combination of four different tasks: demo2text, demo2process, text2process and process2text, as shown in Figure 3. In this section we will describe each one of the tasks, using a motivational example presented in Figure 4.
With respect to our formalization, these tasks may be understood as the exercise of defining a mapping between the three different environment interpretations which refer to the same desktop configuration interpreting the interface set I but with action functions and arguments instantiated with respect to three different interactor “languages”: the recorded activity of a human (which constitutes a demo), any IPA Realisation Language (which defines the target process) and human natural language (instructional text).
At the center of IPA is the task of transforming a demonstration to a program that is capable of re-executing the same process demonstrated by the user. We assume that all tasks are result-oriented, i.e., the user wants to achieve a specific result by executing the task. The results can be specific objectives, such as creating documents with specific values, or a set of actions, such as sending an e-mail to a client inquiring about the payment of a service.
Returning to the example in Figure 4, the resulting process is implemented using the Sikuli GUI Automation Tool. Running the Sikuli program, we obtain the same final spreadsheet as in in the demonstration.
The demo2text task requires the conversion from a demonstration of a user executing an activity to a natural language description of the executed activity. In the example, the demonstration is a screen-recording of a user’s desktop, divided into different time segments, where each segment has a corresponding natural language description of the actions. The natural language description should be a set of imperative sentences, where each sentence describes a single action that should be taken in order to replicate the demonstration.
The generation of natural language descriptions allows users to understand better the process that is being demonstrated, especially for users that are not executing the action, allowing new users to quickly learn how to execute the process themselves without external interference.
In Figure 4, we present the conversion from a video demonstrating a user manipulating a spreadsheet with student’s grades to the corresponding natural language description of the task.
Demo2text is similar to the task of automated video captioning. However, video captioning is the task of describing general videos using natural language , while demo2text focuses on the description of Desktop actions (usually materialised in video form) with imperative sentences.
text2process (T2P) and process2text (P2T)
Instead of having the end-user instance workflow demonstration, it is possible to target a text description of the workflow. The text2process task is particularly useful when the user wants to generate the process from the description of the activity. The task is similar to end-user programming and semantic parsing, giving that users can convert from natural language to a program that implements a process  and Program Synthesis, since the final goal is generating a program. Given a set of imperative sentences, we want to generate a program that executes the described actions.
Similarly, the user might also want to generate the NL description of a process that is already implemented, especially in cases where the process becomes too complex to be analysed by humans.
Intelligent process automation aims to replicate and improve, iteratively, activities carried out by humans. Comparing the performance of different IPA techniques requires well-defined tasks and metrics that reflect the relevant aspects when automating a process. In this section, we present the benchmarks present in similar research areas.
Even though IPA applications, such as software intelligence and RPA, have particular research questions , there are only a few published datasets designed to help addressing this demand.
One conspicuously related dataset is the Mini World of Bits (MiniWoB) 
, a benchmark of 100 reinforcement learning environments containing many of the characteristics of live web tasks, created in a controlled context. Each MiniWoB environment is an HTML page with a resolution of 210x160 pixels, and a natural language task description, such as “Click on the ‘Next’ button.”. The environment provides a precise evaluation metric, rewarding simulated behaviour, with rewards ranging from -1.0 (failure) to 1.0 (success), according to the results of each action, i.e., if the action shifts the current state to a state closer to the environment goal or not.
The same work  also proposes FormWoB, which consists of four web tasks based on real flight booking websites, and QAWoB, which approaches web tasks as question answering, soliciting questions from crowd workers. Even though MiniWoB provides an essential baseline for IPA, it is still a synthetic dataset, not reflecting real user applications. It has a specific screen size, a limited number of applications and user actions, creating a controlled and closed environment.
An example of a real (non-synthetic) dataset for software intelligence is the PhotoShop Operation Video (PSOV) Dataset , containing videos and dense command annotations for Photoshop Software, with 74 hours of videos and 29,204 labelled commands. Despite PSOV presenting real-world use cases, it is still limited to one specific software, i.e., it is not generalisable to other applications.
Comparable benchmarks can be found in the research area of video captioning.  presents the MPII Movie Description dataset (MPII-MD) which contains transcribed and aligned audio descriptions and script data sentences of a set of 55 movies of diverse genres.  introduces a dataset including 84.6 hours of paired video/sentences from 92 DVDs, with high-quality natural language phrases describing the visual content in a given segment of time.
A process can be seen as a formal representation of a task; therefore, there is an evident alignment between IPA and semantic parsing (SP). Unlike IPA, several benchmarks are available for SP, from which we can obtain insights.
Most of these benchmarks focus on evaluating the conversion from a natural language specification to a program written in a specific programming language. WikiSQL  is a collection of questions, corresponding SQL queries, and SQL tables. NL2Bash  is a corpus with frequently used Bash commands with its respective natural language description. Other applications include converting from a visual object to code [10, 4] and from source code to Pseudo-code in different languages .
Process mining is also another closely related research area; it aims to discover, monitor and improve real processes, extracting knowledge from event logs . Unlike IPA, it does not have its main focus on automation, but on finding answers for domain-specific questions, such as analysing patient treatment procedures , and discovering the roles of the people involved in the various stages of a specific process .
D2p & T2p
This section concentrates on a weighted quantification of the correctness and completeness of the generated IPA program output. Correctness and completeness are defined against a gold reference IPA program which was produced by one or more human programmers.
Given a set of generated IPA programs where each and a corresponding gold standard , we define a set of metrics of varying granularity that can be seen as approximate measures of program correctness.
We start with the following error functions:
Predicate / Argument Sensitive Error:
For an image argument, the intersection over union (IoU) is used to define . Given the bounding box of the corresponding interface element () and the gold standard bounding box ():
where if and otherwise.
IoU assumes that Image can be registered within the gs reference Screenshot.
An alternative way to define is by directly comparing the two argument images using means squared error (MSE) or structural similarity index (SSIM) :
The previous measures do not take into account the order of the program statements. In order to capture the sequential nature of an IPA program we include a sequence-based metric which is based on the longest common subsequence (LCS) function. Given two sequences and , and given that the prefixes of are and the prefixes of are , LCS is defined as:
In order to compute the maximal program fragment generated we encode the programs and as a sequence of unique symbols from a hash table derived from the statements of . The maximum program overlap (MPO) is defined as:
P2t & D2t
Evaluating generated text is a requisite for areas related to Natural Language Generation, such as Machine Translation, Automatic Summarisation, Image/Video Captioning, and Document Similarity. BLEU has been widely applied and accepted as an evaluation metric for the tasks in the mentioned areas.
Similarly, we apply BLEU  to evaluate text generated from demonstrations and processes. Our goal is to automatically evaluate for a demonstration or a process how well a candidate generated sentence matches the set of of demonstration/process descriptions
. BLEU computes the n-gram overlap between the generated text description and the reference description. BLEU score depends on two different other factors: modified n-gram precision and brevity penalty.
Modified precision score computes the fraction of words matched between candidate descriptions and reference descriptions in the entire test corpus. Differently from precision, it clips the n-grams when it matches with the reference, avoiding high-precision results obtained with only repetitions of correct words. Modified n-gram precision is computed as follows:
The brevity penalty factor penalizes candidates descriptions, , shorter than the reference descriptions, , computed as follows:
Finally, the BLEU score is calculated as
where are positive weights summing to one and N is the size of the n-grams.
Real World of Bits Benchmark
To the best of our knowledge, there are no datasets available that can be used as a benchmark for all tasks defined in this work. Therefore, we create a benchmark for evaluation of IPA approaches. Inspired by the MiniWoB , we name our dataset Real World of Bits (RealWoB). In this section we describe how we generated our benchmark.
RealWoB contains 100 different entries, where each entry is related to one specific task (e.g., searching for a flight) and it contains:
A screen recording (video) of a user performing the task;
A natural language summarisation of the task;
A natural language step-by-step description of the task;
A program that is capable of re-executing the task, written in Sikuli GUI Automation Tool and TagUI222https://github.com/kelaberetiv/TagUI.
Each video has an average of 56 seconds and is divided into time intervals, where each interval has a corresponding natural language description.
There are two types of natural language description for the task: a more general idea of the task being performed, i.e., a summarisation, and a detailed step-by-step description. The summarised text is used as a guide for the annotator, in order to generate the video, description and program. The detailed description is written using imperative sentences, such as “Click on the button ‘Send”.
RealWoB also contains a program for every task. The programs were implemented based on the videos and the natural language step-by-step. Every sentence in natural language corresponds to one or more commands in the program/process. Figure 5 presents an example of pair of natural language step-by-step description process implementation in Sikuli.
The 100 tasks are divided into 10 categories:
Spreadsheet use: 10 tasks
Spreadsheet and browser use - simple: 10 tasks
Spreadsheet and browser use - elaborate: 10 tasks
Webmail suite use: 10 tasks
Spreadsheet and Webmail suite use: 10 tasks
Webmail suite and browser use: 10 tasks
Browser, spreadsheet and Webmail suite use: 10 tasks
Browser (social media) use: 10 tasks
Browser (social media) and spreadsheet use: 10 tasks
Random selection of previous tasks executed in a different operating system: 10 tasks
The selected group of tasks/computer applications are common for office workers, with a high number of potential end users. We identified the tasks as suitable for automation, where none or little human input is required.
The automation of tasks using webmail suites and web browsing is particularly hard due to the dynamic state of the web. For tasks involving receiving/sending e-mails, we consider that all e-mails follow a pre-defined template. For the ones that involve browsing the web, we cache a version of the used webpage and assume it as the current state of the page.
For illustration purposes, Table 1 presents one example of each task.
|Category||Task Example (Summary)|
|Spreadsheet + browser||
|Spreadsheet + browser||
|Spreadsheet + Webmail||
|Webmail + browser||
|Browser + spreadsheet + webmail||
|Browser (social media)||
|Browser (social media) + spreadsheet||
|Random task - Different OS||Any of the above tasks executed using the OS Ubuntu.|
RealWoB was annotated by four different annotators, A, B, C and D, where C is a specialist in IPA. It was constructed with the following steps:
We defined 90 different tasks based on investigating everyday office worker tasks.
Annotator A recorded 90 videos, running the tasks on Windows 10.
Annotator B wrote a summary of the task being performed, a step by step description (for every video segment) and a program capable of executing the task automatically.
Annotator C verified and corrected the videos and the annotations.
From the 90 different tasks, we chose ten random tasks.
Annotator B recorded the ten tasks using Ubuntu 16.04.
Annotator D wrote a summary of the task being performed, a step by step description (for every video segment) and a program capable of executing the task automatically.
Annotator C verified and corrected the videos and the annotations.
In this work, we present a formalisation for IPA using a bottom-up approach, defining several composing blocks of an IPA program. We also define the modalities and tasks that are part of IPA, demo2text, demo2process, text2process and process2text, and we present how these tasks relate to similar research areas.
With the formalisation, it was also possible to define appropriate metrics for each one of the tasks, evaluating the resulting process or text. We also describe how we built a dataset that can be used as a benchmark for the IPA tasks.
We envision that the research area in IPA will see significant progress in the next few years, following the increased use of RPA tools. The formalisation present in this work is a starting step towards building complex IPA systems.
While we have provided the formalisation, evaluation techniques and methodology for designing a benchmark, there is an important next step in this research: joining together these contributions and creating a baseline for each one of the IPA tasks.
The authors would like to thank Jacques Cali and the Blue Prism team for their support of this project. We also thank the anonymous reviewers for the valuable feedback.
-  (2018) Robotic process automation. Business & Information Systems Engineering: The International Journal of WIRTSCHAFTSINFORMATIK 60 (4), pp. 269–272. Cited by: Related Work.
Research challenges for intelligent robotic process automation.
Workshop on Artificial Intelligence for Business Process Management (AI4BPM’19)held in conjuction with the 17th Int. Conference on Business Process Management (BPM’19). Vienna, Austria 1-6 September, 2019. Cited by: Introduction.
-  (2017) Automation of a business process using robotic process automation (rpa): a case study. In Applied Computer Sciences in Engineering, J. C. Figueroa-García, E. R. López-Santana, J. L. Villa-Ramírez, and R. Ferro-Escobar (Eds.), Cham, pp. 65–71. External Links: Cited by: Introduction.
-  (2018) Pix2code: generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems, pp. 3. Cited by: Related Work.
-  (2011) Analysis of patient treatment procedures.. In Business Process Management Workshops (1), Vol. 99, pp. 165–166. Cited by: Related Work.
Learning from photoshop operation videos: the psov dataset.
Asian Conference on Computer Vision, pp. 223–239. Cited by: Related Work.
-  (2018) Survey of the state of the art in natural language generation: core tasks, applications and evaluation. Journal of Artificial Intelligence Research 61, pp. 65–170. Cited by: P2T & D2T.
-  (2017-Sep.) IEEE guide for terms and concepts in intelligent process automation. IEEE Std 2755-2017 (), pp. 1–16. External Links: Cited by: Introduction.
-  (2018) NL2Bash: a corpus and semantic parser for natural language interface to the linux operating system. arXiv preprint arXiv:1802.08979. Cited by: Related Work.
-  (2016) Latent predictor networks for code generation. arXiv preprint arXiv:1603.06744. Cited by: Related Work.
-  (2015) Learning to generate pseudo-code from source code using statistical machine translation (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 574–584. Cited by: Related Work.
Video captioning with transferred semantic attributes.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6504–6512. Cited by: demo2text (D2T).
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: P2T & D2T.
-  (2019) ; A study of robotic process automation among artificial intelligence; international journal of scientific and research publications (IJSRP) 9(2) (ISSN: 2250-3153). External Links: Cited by: Introduction.
-  (2015) A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3202–3212. Cited by: Related Work.
-  (2017) Semeval-2017 task 11: end-user development using natural language. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 556–564. Cited by: text2process (T2P) and process2text (P2T).
World of bits: an open-domain platform for web-based agents.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3135–3144. Cited by: Related Work, Related Work, Real World of Bits Benchmark.
-  (2015) Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070. Cited by: Related Work.
-  (2011) Process mining manifesto. In International Conference on Business Process Management, pp. 169–194. Cited by: Related Work.
-  (2015) Bpi challenge 2015. In 11th International Workshop on Business Process Intelligence (BPI 2015), Cited by: Related Work.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: D2P & T2P.
-  (2017) Seq2sql: generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. Cited by: Related Work.