Managing Machine Learning Workflow Components

by   Marcio Moreno, et al.

Machine Learning Workflows (MLWfs) have become essential and a disruptive approach in problem-solving over several industries. However, the development process of MLWfs may be complicated, hard to achieve, time-consuming, and error-prone. To handle this problem, in this paper, we introduce machine learning workflow management (MLWfM) as a technique to aid the development and reuse of MLWfs and their components through three aspects: representation, execution, and creation. More precisely, we discuss our approach to structure the MLWfs' components and their metadata to aid retrieval and reuse of components in new MLWfs. Also, we consider the execution of these components within a tool. The hybrid knowledge representation, called Hyperknowledge, frames our methodology, supporting the three MLWfM's aspects. To validate our approach, we show a practical use case in the Oil & Gas industry.



There are no comments yet.


page 4


Designing for Recommending Intermediate States in A Scientific Workflow Management System

To process a large amount of data sequentially and systematically, prope...

How Developers Iterate on Machine Learning Workflows -- A Survey of the Applied Machine Learning Literature

Machine learning workflow development is anecdotally regarded to be an i...

Helix: Holistic Optimization for Accelerating Iterative Machine Learning

Machine learning workflow development is a process of trial-and-error: d...

A Machine-Learning-Aided Visual Analysis Workflow for Investigating Air Pollution Data

Analyzing air pollution data is challenging as there are various analysi...

In Defense of the Paper

The machine learning publication process is broken, of that there can be...

Building a Reproducible Machine Learning Pipeline

Reproducibility of modeling is a problem that exists for any machine lea...

Indexing Execution Patterns in Workflow Provenance Graphs through Generalized Trie Structures

Over the last years, scientific workflows have become mature enough to b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The recent advances in machine learning (ML), especially in neural networks 

[lecun2015], leverage capabilities of problem-solving in a broader sense, i.e., being applied on cross-industries domain, varying for instance from Smart Cities and Public Security [lourenco2018, zortea2018] to Agriculture [nery2018facing, moreno2018smart] and Oil & Gas (O&G) [civitarese2019]. In general, to address tasks in these domains, it is necessary to create complex ML Workflows (MLWf). In this paper, we consider that sets of components compose a MLWf, i.e., the necessary steps on learning and inferring (e.g.

, data processing, feature extraction, model training, and validation) and their relationships. The development process of the MLWf usually produces a vast amount of components and different MLWf, which are commonly produced to be task-specific. For instance, detect objects on an image prediction, predict possible links within a graph, and others. This development process may be complicated, hard to achieve, time-consuming, and error-prone 

[morenopatent2019]. Furthermore, the unstructured growth of MLWf limits the reuse of components, since there is no well-defined common vocabulary to structure them. To overcome these issues a key aspect of MLWf development, yet commonly overlooked, is the management of the components and MLWf in what we call machine learning workflow management.

In this paper, we define the Machine Learning Workflow Management (MLWfM) as a technique that supports the symbolic representation of MLWf grounded over one or multiple ontologies, providing the means for executing existent MLWf and the automatic creation of new ones. The ontology-oriented structuring provides a common vocabulary. It allows the interoperability of the MLWf’s components as well as the search for components by their characteristics. MLWf execution refers to the process of running a component or a set of components of a MLWf, i.e., any granularity level of the MLWf. The creation of a new MLWf relies on the use of the ML concepts described in the ontology and the available characteristics to provide reuse of existing components to produce new MLWf in different contexts.

We argue that MLWfM can contribute to the development and reuse of MLWf through the aspects of structuring, execution, and creation. Despite the novelty of the term machine learning workflow management, some of the issues discussed in this paper are tackled by some works in literature. However, most of them focus on execution or limited creation of MLWf, a few of them propose a common vocabulary to structure the MLWf, and none of them provide the three aspects within what we consider to achieve MLWfM. Therefore, in this paper, we present the MLWfM as a technique that tackles the issues of MLWf development. We propose the use of hybrid knowledge representations to structure broad tasks by putting together the best set of elements (specific tasks) to automatize this process.

The main contributions of this paper include (i) a new knowledge-oriented representation for MLWf; and (ii) a framework to support the MLWfM and its execution and creation features. To illustrate our work, we explore an industrial use case on O&G exploration, since it relies on multiple workflows of machine learning and data processing. This use case presents the knowledge representation of a MLWf and how this representation and the components’ semantic metadata provide support to reuse. Also, we show the benefits of MLWfM in the development process of MLWf through the execution of existing components on the use case and the creation of new workflows by reusing existing components. Finally, we validate our approach showing scenarios in which we address the requirements by using the stated MLWfM.

The remainder of this paper is organized as follows: Section 2 presents the main related work; Sections 3 and 4 show the knowledge-oriented representation of a MLWf and the MLWfM, respectively; in Section 5, we provide the validation of our approach and further discussions. Finally, in Section 6, we present the final remarks and future directions.

Ii Related Work

We discuss the related work regarding a twofold aspect: (i) the representation of MLWf and (ii) the usage and reuse of MLWf’s components.

Addressing the gap between execution’s provenance of a machine learning workflow and its representation for the reproducibility of executions, Esteves   et al. [esteves2015] and Publio et al. [publio2018] propose the use of vocabulary and an ontology schema, respectively. The MEX vocabulary [esteves2015] provides a standard schema based on a machine-readable terminology that aids the reproducibility of execution in various frameworks and workflow systems. However, it lacks details of the machine learning process itself, focusing mainly on the general machine learning workflow. Overcoming this issue, the W3C ML-Schema [publio2018] extends the MEX vocabulary improving the representation of the machine learning process. Both approaches structure the workflows within a common vocabulary, which is a fundamental aspect on the MLWfM, but they were designed to provide reproducibility of executions, lacking, then, the adequate support for reuse of the MLWf’s components.

MLflow111 [zaharia2018] and IBM Watson Studio222 are commercial solutions for designing and deployment of machine learning workflows. The advantages of these systems rely on the support for the creation and modification of machine learning workflows in a stable environment. The main drawback of these approaches is the lack of a knowledge representation, which could take advantage of a common vocabulary as well as supporting MLWf components’ metadata structuring.

Jannach et al. [jannach2016] proposes a recommendation system plugin to RapidMiner333ttp://, which supports the development of machine learning workflows trough adaptive recommendations based on a predictive model trained over existing machine learning workflows. Wang et al. [wang2018] propose a unified system architecture called Rifiki, that allows users of machine learning models to train and to predict through built-in services without handling specifics of building models, tuning hyper-parameters, optimization, and others. These approaches provide frameworks to ease the development of machine learning models, but they do not present an adequate representation, limiting the reuse capabilities of components.

Carvalho et al. [carvalho2018] propose a semantic software catalog to aid scientists in managing their computational experiments workflow exploration and evolution. The so-called OntoSoft-VFF is built upon a novel ontology, which describes software functionalities and evolution through their semantic metadata, affording to query over the represented components and its metadata. They have also illustrated their method over a machine learning workflow. Compared with other aforementioned works, Carvalho et al. ’s approach is more aligned with the proposed MLWfM. Nonetheless, it is focused on the representation of general software, lacking details of the MLWf. Also, it does not support the creation and execution of the workflows.

Iii Representation of Machine Learning Workflows

In this section, we describe the knowledge model elements that support our MLWfM method. The core component is the ML Schema proposed by W3C as a core vocabulary for the machine learning domain.

Iii-a Hyperknowledge

Hyperknowledge is a knowledge representation model that supports the representation of high-level semantic concepts, multimodal information, unstructured data, and the relationships between them in a unified way [moreno2016ncm, moreno2017extending, moreno2019]. Its conceptual model has expressiveness to relate multimedia contents (e.g. image, audio, text, video), with abstract concepts (e.g.

 label, classes) within the same framework. Besides, formal descriptions present in ontologies, linked data, machine learning models, executable content, and source codes can also be represented in Hyperknowledge. By providing a flexible descriptive framework, Hyperknowledge helps to fill the semantic gap between hypermedia content and knowledge engineering content, allowing reasoning over cross-modal types of information.

The Hyperknowledge model is composed of nodes, links, contexts, and supporting constructs. Nodes represent resources and can be decorated with properties and anchors. Properties represent node characteristics with literal values. Anchors denote fragments of the resource denoted by the node. For example, an anchor on a node representing an image might denote a region on that image. All nodes have a lambda anchor denoting the whole resource. Links can associate two or more nodes. This feature differentiates it from traditional graph-based representations, allowing the representation of -ary relations without reification. Links connect to nodes exclusively through anchors. Links can also display properties like nodes.

Furthermore, Hyperknowledge graphs are organized into contexts. All nodes and links are in a context; if a context is not specified, then the default context is assumed. Contexts are composite nodes, so they can be linked themselves to other nodes and also have parent contexts, effectively allowing the representation of context hierarchies.

Iii-B ML Schema

Figure 1: Hyperknowledge model of ML Schema vocabulary [publio2018] (in gray) and some additional concepts. Concepts in yellow and green denote, respectively, specific ML concepts we specified to be used in the examples below; and domain concepts for use case in Section V-A.

ML Schema defines constructs that allow one to describe ML algorithms, tasks, implementations, and executions (Figure 1). It can be used as a basis for the specification of ontologies, databases, and APIs for machine learning. We translated the OWL444 implementation of ML Schema to Hyperknowledge to allow instantiation and query of ML models. Concepts and relations in the ML Schema OWL model were translated to Hyperknowledge nodes and links. We specified classes as nodes and object properties as binary links between concepts. Datatype properties were specified as properties on concept nodes. All nodes and links were added to a single context. We specified the ML Schema ontology in its context, while extensions and instantiations were added to separated contexts. This allows for a more organized knowledge model, with separation of concerns for each context.

Iv Machine Learning Workflow Management

This section discusses the concepts regarding machine learning workflow management and how Hyperknowledge supports these concepts, namely: (i) MLWfs’ components retrieval, (ii) MLWf creation, and (iii) MLWf execution.

One of the advantages of the MLWfM is the knowledge framing described in Section III-B. Hyperknowledge structures the MLWf within a knowledge base, which enables them to be searched and have their components retrieved. In such a way, one could search for components individually or entire workflows regarding their functionalities and metadata. Furthermore, since the MLWfM relies on Hyperknowledge representation, it is possible to develop queries using the Hyperknowledge Query Language (HyQL).

As an example, one possible query could be to search for all models developed to achieve classification. The structure of this query with HyQL would be:

      Run achieves Classification AND
      Run hasOutput Model

However, such a query could result in a broad set of models. As discussed in Section III-A, entities can have anchors that represent part of its content. For instance, in our example, the desired models could have a convolution layer. Besides, to enhance the search providing a filtered result, the models’ metadata could also be part of the query. Like so, we propose a new enhanced query:

      Run achieves Classification AND
      Run realizes Algorithm AND
      Algorithm#ConvolutionLayer AND
      Run hasOutput Model AND
      Model.accuracy > 0.9

In both cases, the retrieved entities are components from already represented workflows. This approach yields the possibility of creating new MLWf from the retrieval of existing components, another advantage of using MLWfM.

The Figure 2 depicts the KES (Knowledge Explorer System) [moreno2018, moreno2018a], the Hyperknowledge visualization tool. Through KES, the user can visualize the symbolic representation of a MLWf stored in a Hyperknowledge base as well as curate it by adding, updating, or deleting the MLWfs and their components. Also, the KES leverages the third advantage of the MLWfM technique: the execution of MLWfs’ components. Through KES, after selecting a component retrieved by a query (Figure 2.a), the user is able to execute the component, while the result of this execution is further stored and represented within the MLWf’s context (Figures 2.b and 2.c).

Figure 2: Example of execution of a machine learning workflow’s component represented in Hyperknowledge framework on KES tool.

V Use Case and Validation

V-a Use case

As mentioned before, the proposed Machine Learning Workflow Management gives users the ability to query, create, execute, and share standardized workflows in the problem level of abstraction. To illustrate these properties, we present a common task in O&G exploration workflow, which is horizon picking for seismic data. Horizon picking is one of the seismic interpretation activities, which plays an essential role in O&G exploration software tools [herron2011first]. Seismic data captured from a specific region localized in the subsurface (e.g., a basin, a block, or a field), provides a picture of the general organization of the surfaces delimiting underground strata. Horizon picking consists of segmenting one or more surfaces from a seismic data volume [o2004towards]. Combined with other techniques, such a process aims to identify geological layers and other structures that may lead to potential hydrocarbon deposits [146848].

We further extended the ML Schema model with a simple domain ontology to support our use case in horizon picking (see Figure 1 green concepts). The connection between both ontologies is given by characterizing the Seismic concept as a kind of Data and Horizon Picking concept as a kind of task. Finally, we defined more specific ML concepts that implement horizon picking in seismic data (see Figure 1 yellow concepts). Based on that, we can carry out MLWfM in the domain, as described in the following.

V-B Validation

To validate the process of MLWfM, we explore its abilities to answer a series of investigations over the original use case and further modifications. The investigations were developed regarding the use case and domain knowledge represented in Figure 1. Also, for the following examples, assume we have a database of seismic images coming from different regions, on which machine learning tasks have been already applied so that previous components can be reused in new workflows, which answers investigations. The investigations were modeled as queries in HyQL, in which we also present their description and the obtained results by evaluating them.

Investigation 1 - Which are the trained machine learning models able to perform the horizon picking task on seismic images similar to the SeismicA using a similarity factor of at least ?

This investigation intends to identify a machine learning model that can pick horizons from an unseen seismic image, i.e., new seismic image (called SeismicA in the query). In terms of MLWfM, we are interested in showing the identification and retrieval of the model component of a MLWf based on a specific task that uses a similar input to the queried seismic. The HyQL query used to answer this investigation can be formulated as follows:

LET x = {
           Run hasOutput Model AND
           Run achieves HorizonPicking AND
           Run hasInput Seismic AND
           similarSiesmic(SeismicA, Seismic) > 0.9
      Run hasOutput Model AND Run hasInput x

This query retrieves any available machine learning model from a MLWf developed to perform the Horizon Picking task and, also, that consumes seismic images similar to the unseen seismic. Here, the query uses a predefined similarity function that evaluates the similarity between two seismic images, filtering its results by considering a similarity factor greater than .

Investigation 2 - Which are the trained machine learning models able to perform the horizon picking task on seismic images from the same basin as the basin of the seismic image SeismicA?

This investigation is analogous to the first one, relying on the identification of a machine learning model able to perform the horizon picking from an unseen seismic image (again, called SeismicA in the query), relying on the assumption that models trained on seismic images from the same basin of SeismicA should be useful to process SeismicA. In terms of MLWfM, we show the use of domain knowledge to restrict the query’s output. The query uses the domain knowledge to specify the relationship between the queried seismic image (SeismicA) and other seismic images used in the context of the model’s MLWf. The HyQL query used to answer this investigation can be formulated as follows:

LET x = {
           Run hasOutput Model AND
           Run achieves HorizonPicking AND
           Run hasInput Seismic
      Run hasOutput Model AND
      Run hasInput x AND
      x hasBasin Basin AND
      SeismicA hasBasin Basin

This query retrieves any available machine learning model from a MLWf developed to perform the Horizon Picking task and, also, that consumes seismic images of the same basin as the unseen seismic. The presented query uses the domain knowledge to relate two seismic images.

Investigation 3 - Which are the trained machine learning models able to retrieve horizons with the accuracy of at least from seismic images of Santos Basin?

This investigation aims to retrieve a machine learning model that was trained to identify horizons on seismic images from Santos Basin. On the MLWfM aspect, the investigation uses metadata from components to achieve the investigation’s goal. Thus, the following query retrieves the set of machine learning models, which general accuracy is above , able to identify horizons from the seismic images of the Santos Basin. The HyQL query formulated to answer this investigation:

      Run hasInput Dataset AND
      Dataset hasBasin SantosBasin AND
      Run hasOutput Model AND
      Model hasQuality ModelCharacteristic AND
      ModelCharacteristic.output=Horizon AND
      Run hasOutput ModelEvaluation AND
      ModelEvaluation.accuracy > 0.85

This third query retrieves any available machine learning model from a MLWf developed to retrieve horizons from seismic images of the Santos Basin. Here, the query filters the retrieved results by considering models with accuracy greater than .

Vi Conclusion and Future Work

In this work, we define and address the machine learning workflow management as a technique for symbolic representation, execution, and creation of machine learning workflows. We introduce the symbolic representation in the context of Hyperknowledge framework. We propose the retrieval of MLWfs’ components based on the Hyperknowledge Query Language and, also, the composition of these retrieved components to create a new MLWf. We show KES as a tool to support the execution of components. Finally, we validate our approach demonstrating scenarios in the Oil & Gas industry, exemplifying the capabilities of our technique. Potential future works include the composition of new machine learning models using fragments from existing models [morenopatent2019].