Industrial Federated Learning – Requirements and System Design

05/14/2020 ∙ by Thomas Hiessl, et al. ∙ Siemens AG 17

Federated Learning (FL) is a very promising approach for improving decentralized Machine Learning (ML) models by exchanging knowledge between participating clients without revealing private data. Nevertheless, FL is still not tailored to the industrial context as strong data similarity is assumed for all FL tasks. This is rarely the case in industrial machine data with variations in machine type, operational- and environmental conditions. Therefore, we introduce an Industrial Federated Learning (IFL) system supporting knowledge exchange in continuously evaluated and updated FL cohorts of learning tasks with sufficient data similarity. This enables optimal collaboration of business partners in common ML problems, prevents negative knowledge transfer, and ensures resource optimization of involved edge devices.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Industrial manufacturing systems often consist of various operating machines and automation systems. High availability and fast reconfiguration of each operating machine is key to frictionless production resulting in competitive product pricing  [15]. To ensure high availability of each machine, often condition monitoring is realized based on ml models deployed to edge devices, e.g., indicating anomalies in production  [5]. The performance of these ml models clearly depends on available training data, which is often only available to a limited degree for individual machines. Increasing training data might be realized by sharing data within the company or with an external industry partner [3]. The latter approach is often critical as vulnerable business or privact information might be contained.

The recently emerged Federated Learning fl method enables to train a ml model on multiple local datasets contained in local edge devices without exchanging data samples [13]

. In this privacy-preserving approach, typically a server receives parameters (e.g., gradients or weights of neural networks) from local models trained on decentralized edge devices and averages these parameters to build a global model 

[11]. After that, the averaged global model parameters are forwarded to edge devices to update local models. This process is repeatedly executed until the global model converges or a defined break-up condition is met.

However, to solve the discussed challenges of successfully applying ml models in industrial domains, fl needs to be adapted. Therefore, the integration of operating machines and its digital representations named assets111 need to be considered as depicted in Figure 1.

Figure 1: fl with industrial assets; Assets generate data that are used in learning tasks for ml models executed on edge devices; Learning tasks for ml models based on the same asset type are part of a fl population; Learning tasks for ml models with similar data are part of a fl population subset named fl cohort; Knowledge transfer in continuously evaluated and updated fl cohorts ensures optimal collaboration with respect to model performance and business partner criteria

Assets generate data on the shop floor during operation. Edge devices record this data to enable training of ml models e.g., in the field of anomaly detection aiming to identify abnormal behavior of machines in production. To improve the model quality, fl is applied by aggregating model parameters centrally in a global model, e.g., in the cloud, and sending out updates to other edge devices. Typically, all models of local learning tasks corresponding to the same ml problem are updated. This set of tasks is called a fl population. In the depicted industry scenario, a fl population corresponds to all learning tasks for models trained on asset data with same data scheme, which is typically ensured if assets are of the same asset type, e.g., learning tasks of models

M2.1 (E2), M2.2 (E2), and M2.2 (E3) belong to FL population 2, since they are based on assets of Asset Type T2. In contrast, learning tasks of models M1 (E1) and M1 (E3) belong to FL population 1. However, assets even of same asset type could face heterogenous environmental and operation conditions which affect recorded data. Due to these potential dissimilarities in asset data, negative knowledge transfer can be caused by the model updates which decreases model performance [14]. For this, industrial fl systems need to consider fl cohorts as subsets of a fl population. This enables knowledge sharing only within e.g., FL cohort 2 including M2.2 models using similar asset data.

For this, we propose to establish fl system support for knowledge exchange in fl cohorts involving ml models based on asset data from industry. Furthermore, it needs support for continuous adaption of fl cohorts as ml models evolve over time. To additionally support efficient fl with high quality of asset data, we aim for resource optimization of involved edge devices and appropriate consideration of qoi metrics [8]. Hence, our contribution comprises requirements and a system design for ifl which we introduce in this paper. ifl aims to improve collaboration on training and evaluating ml models in industrial environments. For this, we consider current fl systems and approaches [1, 2, 10, 11, 13] and incorporate industry concepts as well as experience from industrial projects. The design of the ifl system is presented with respect to supported workflows, domain model, and architecture.

In Section 2 we refer to the basic notation of fl. We review related work in Section 3 and subsequently present requirements of ifl in Section 4. The design of the ifl system is presented in Section 5 with respect to supported workflows, domain models, and architectures. We conclude in Section 6 and provide an outline to future work

2 IFL Notation

To introduce the basic notation of an ifl systems, we extend the fl notation by Bonawitz et al. [1] that define device, FL server, FL task, FL population and FL plan. Devices are hardware platforms as e.g., industrial edge devices or mobile phones, running FL clients to execute the computation necessary for training and evaluating ml models. To use fl, a fl client communicates to the fl server to run fl tasks for a given fl population. The latter one is a globally unique name that identifies a learning problem which multiple fl tasks have in common. The fl server aggregates results (i.e., model updates), persists the global model, and provides it to fl clients of a given fl population. A fl plan corresponds to a FL task and represents its federated execution instructions for the fl server and involved fl clients. It consists of sequences of ml steps as e.g., data pre-processing, training, and evaluation to be executed by fl clients and instructions for aggregating ml models on the fl server. Furthermore, we define FL cohorts that group multiple fl tasks within the same fl population and with similarities in their underlying asset data.

3 Related Work

3.1 FL Systems

Most of the current fl studies focus on federated algorithm design and efficiency improvement [9]. Besides that, Bonawitz et al. [1] built a scalable production system for fl aiming to facilitate learning tasks on mobile devices using TensorFlow222 Furthermore, NVIDIA Clara333 provided an SDK to integrate custom ml models in a fl environment. This system has been evaluated with data from the medical domain, considering a scenario with decentralized image datasets located in hospitals. However, no aspects of dynamically changing data patterns in learning tasks of fl cohorts have been considered in literature so far.

3.2 Client Selection

Nishio et al. [12] optimize model training duration in fl by selecting only a subset of fl clients. Since they face heterogeneous conditions and are provisioned with diverse resource capabilities, not all fl clients will manage to deliver results in decent time. For this, only those who deliver before a deadline are selected in the current training round. To achieve the best accuracy for the global model, the fl server may select fl clients based on their model evaluation results on held out validation data [1]

. This allows to optimize the configuration of fl tasks such as centrally setting hyperparameters for model training or defining optimal number of involved fl clients. Although, in ifl these client selection approaches need to be considered, the ifl system further selects fl clients based on collaboration criteria with respect to potential fl business partners.

3.3 Continuous Federated Learning

Liu et al. [10] propose a cloud-based FL system for reinforcement tasks of robots navigating around obstacles. Since there exist robots that train much and therefore update ml models continuously, the authors identify the need for sharing these updates with other federated robots. These updates are asynchronously incorporated in the global model to eventually enhance navigation skills of all involved robots. Based on that, in ifl the continuous updates are used to re-evaluate data similarity that is needed to ensure high model quality within a fl cohort organization.

4 Requirements

In this section we now present requirements that should be coverd by an ifl system. Based on fl system features discussed in [9], we add requirements with respect to industrial data processing and continuous adaptation of the system.

4.1 Industrial Metadata Management

To support collaboration of fl clients, we identify the requirement of publishing metadata describing the organization and its devices. Based on this, fl clients can provide criteria for collaborating with other selected fl clients. Although actual raw data is not shared in fl, it enables to adhere to company policies for interacting with potential partners. Asset models as provided by Siemens MindSphere444 describes the data scheme for industrial iot data. Since industrial fl clients target to improve machine learning models using asset data, metadata describing the assets builds the basis for collaborating in suitable fl populations.

4.2 FL Cohorts

As discussed in Section 3.2, fl client selection plays a role in fl to reduce duration of e.g., training or evaluation [12]. Furthermore, client selection based on evaluation using held-out validation data, can improve accuracy of the global model [1]

. In our experience, these approaches do not sufficiently address data generated by industrial assets and processed by  fl clients. For this, our approach aims for considering asset data characteristics for achieving optimal accuracy and performance for all individual client models. To this end, we identify the requirement of evaluating models in regards to similarities of asset data influenced by operating and environmental conditions. This is the basis for building fl cohorts of fl tasks using asset data with similar characteristics. fl cohorts enable that fl clients only share updates within a subset of fl clients, whose submitted fl tasks belong to the same fl cohort. These updates probably improve their individual model accuracy better, as if updates would be shared between fl clients that face very heterogeneous data due to e.g., different environmental or operating conditions of involved assets. In manufacturing industries there are situations where assets are placed in sites with similar conditions, as, e.g., placing production machines into shop floors with similar temperature, noise and other features considered in the model prediction. In such cases, the ifl system needs to build fl cohorts.

4.3 Quality of Information

Since each fl client trains and evaluates on its local data set, aggregated global models result from data sets with diverse qoi. Furthermore, due to different agents operating in the industry as e.g., fully autonomous control systems as well as semi-autonomous ones with human interaction [6], different data recording approaches can influence qoi of asset data sets. Lee et al. [8] discuss different dimensions of qoi as e.g., free-of-error, relevancy, reputation, appropriate amount, believability, consistent representation and security. Based on that, we derive that there is the need to evaluate qoi on fl clients and use resulting metrics on the fl server to decide on the extent of contribution of an individual fl client in the parameter aggregation process. Storing qoi metrics next to existing industrial metadata of participating organizations further enhances building and updating suitable fl cohorts.

4.4 Continuous Learning

ai increasingly enables operation of industrial processes to realize flexibility, efficiency, and sustainability. For this, often domain experts have to repeatedly understand new data with respect to its physical behavior and the meaning of parameters of the underlying process [7]. Moreover, continuously involving domain experts and data scientists in updating ml models by e.g., providing labels to recently recorded time series data, is a resource-intensive process, that can be faciliated by continuously collaborating in fl. Based on that, we identify the need of supporting continuously re-starting FL learning processes and cohort reorganization over time to consider major changes in asset time series data.

4.5 Scheduling and Optimization

Executing fl plans can cause heavy loads on edge devices, as e.g., training of ml models on large data sets [1]. Bonawitz et al. [1] identified the need for device scheduling. This involves that, e.g., multiple fl plans are not executed in parallel on single devices with little capacities, or that repeated training on older data sets is avoided while training on fl clients with new data is promoted. For industry purposes, it further needs optimization of cohorts communication. This means, that fl tasks linked to a fl cohort, can be transferred to other cohorts if this improves communication between involved fl clients with respect to e.g., latency minimization [4]. We believe, this decreases model quality due to preferring communication metrics over model quality metrics. However, ifl systems need to consider this trade-off in an optimization problem and solve it to maximize overall utility. Furthermore, collaboration restrictions of fl clients needs to be considered in the optimization problem. This ensures that no organization joins fl cohorts with other organizations that they do not want to collaborate with.

5 System Design

Figure 2: Domain Model

5.1 Domain Model

To establish a domain model for ifl, we consider FL terminology [1] as well as concepts from industrial asset models as discussed in Section 4.1. For this, Figure 2 depicts FL Population, FL Server, FL Client, FL Task and FL Plan as discussed in Section 2. Herein, we consider to deploy and run the fl server either in the Cloud or on an Edge Device. The fl client is hosted on an industrial edge device, that is a hardware device on a given location. To support scheduling and optimization decisions of the fl server, the edge device contains resource usage metrics and hardware specifications (hwConfig). A fl task refers to a ML Model that needs to be trained with an algorithm on a given Dataset consisting of time series values. The scheme of the Dataset is defined by an Aspect Type, which contains a set of Variables. Each Variable has name, unit, dataType, defaultValue and length attributes to define the content of the corresponding time series values. The qualityCode indicates wheter a variable supports OPC Quality Codes555 This enables to record and evaluate qoi metrics on the fl client as discussed in Section 4.3. Since industrial ml tasks typically consider data from industrial assets, we define an Asset (e.g., a concrete engine) operating on a given location facing environmental conditions (envDescription). The asset is an instance of an Asset Type (e.g., an engine) that collects multiple aspects (e.g., surface vibrations) of corresponding aspect types (e.g., vibration) again collecting variables (e.g., vibrations in x,y,z dimensions). The asset is connected to an edge device which is recording data for it. To express the complexity of industrial organizations, hierarchical asset structures can be built as it is depicted with recursive associations of assets and related asset types, considering nesting of, e.g., overall shop floors, their assembly lines, involved machines and its parts. Finally, we introduce fl cohorts as groups of fl tasks. A fl cohort is built with respect to similarities of assets considered in the attached ml model. So, creating fl tasks intents to typically solve ML problems based on asset data, whereas the aspect type referred in the Dataset of the ML model are used in the linked asset.

5.2 Workflows

To regard the requirements of Section 4, we propose several workflows to be supported by the ifl system.

5.2.1 FL Client Registration

Assuming the fl server to be in place, the fl client starts participation in the ifl system by registering itself. For this, the fl client has to submit a request including organization and edge device information. Furthermore, aspect types are handed in, describing the data scheme based on which the organization is willing to collaborate in fl processes with other organizations. Additionally, the assets enabled for fl are posted to the fl server, to provide an overview to other organizations and to ensure that IFL can build fl cohorts based on respective environmental conditions.

5.2.2 Cohort Search Criteria Posting

After fl client registration, other fl clients can request a catalog of edge devices, organizations and connected assets. Based on this, cohort search criteria can be created potentially including organizations, industries, and asset types as well as aspect types. This enables to match submitted fl tasks to fl cohorts based on client restrictions for collaboration and their ml models.

5.2.3 Submit and Run FL Tasks

The fl client creates a fl task including references to the ml model without revealing the actual data set and submits it to the fl server. If fl tasks target the same problems, i.e., reference to the same aspect types and corresponding ml model, the provided fl task is attached to an existing fl population, otherwise a new fl population is created. ifl then builds fl cohorts of fl tasks based on metadata provided during registration and posted cohort search criteria. If no cohort search criteria is provided by the fl client, the submitted fl Tasks are initially considered in the default fl cohort of the given fl population. To actually start fl, a fl plan is created including server and client instructions to realize e.g., Federated Averaging [11] on the server and training of ml models on every involved fl client. The configuration of fl tasks allows for defining parameters for supported algorithms of ifl for, e.g., setting break-up conditions for fl or defining the number of repeated executions over time. Since fl tasks are either realized as training or evaluation plan, the exchanged data between fl client and fl Server are different. While training plans typically include the sharing of model parameters as, e.g., gradients or weights of neural networks, evaluation plan execution results in metrics that are stored by ifl to further enable fl cohort reconfiguration and optimization.

5.2.4 Update FL Cohorts

Collected metrics in the fl process enable to update fl cohorts with respect to splitting and merging fl cohorts. Furthermore, moving fl tasks between cohorts is considered in ifl. The respective metrics include information like the environmental changes of assets and model accuracy. Furthermore, similarity measures of ml models are computed based on possible server-provided data. If such evaluation data is present, a strategy for updating fl cohorts includes to put fl tasks in the same fl cohort, where its ml model predicts ideally the same output based on provided input samples.

5.2.5 Evaluate QoI

The qoi of raw data used by each fl client is computed on edge devices and mapped to OPC Quality Codes as defined in Section 5.1. Besides using submitted qoi for e.g., updating fl cohorts, ifl considers qoi in the contribution weights of fl clients when it comes to weighted averaging of model parameters as defined in [11].

5.2.6 Continuous Learning

After time series data is updated and if needed properly labelled, fl tasks are submitted. For this, either synchronous [11] or asynchronous [2] fl processes are triggered. In the asynchronous case, ifl determines the timing for notifying fl clients to update ml models according to recent improvements of one fl client.

5.2.7 Optimize Computation and Communication

First, the fl server loads resource usage from edge devices to determine the load caused by executed processes. Second, network statistics (e.g., latency) are identified as recorded for model update sharings between fl clients and the fl Server. Third, statistics of past fl plan executions, e.g., duration of processing is loaded to be incorported in an optimization model. Finally, this model optimizes future fl plan executions considering qos criteria [4] as processing cost, network latency, and cohort reconfiguration cost.

5.3 Architecture

To realize the workflows presented in the previous section, we propose the ifl architecture depicted in Figure 3.

Figure 3: fl Client and Server Architecture

Considering two types of parties involved in ifl, we present the FL Application and the FL Server, whereas the former is a container for a Client Application that is a domain-dependent consumer of ifl. Furthermore, the fl Application contains the FL Client that interacts with the fl server.

We now discuss the main components of the ifl system and its responsibilities. First, the fl client registration workflow involves the Device Manager of the fl client. It provides an API to the client application to register for fl. The client application provides a list of participating edge devices and general information of the organization. Forwarding this to the Client Registry allows persistence in the Device & Asset Metadata Catalog stored on the fl server. cohort search criteria posting is supported by device manager and client registry too, with additionally exposing an interface to the FL Cohort Manager to provide the device & asset metadata catalog and the fl cohort search criteria for creating fl cohorts.

Submitting new fl tasks is initiated by invoking the FL Task Manager which is in charge of enriching the information provided by the fl task with information of the associated ml model and targeted asset. After forwarding the fl task to the server-side FL Scheduler, it is mapped to the corresponding fl population and persisted. Furthermore, the FL Scheduler attaches scheduling information to timely trigger execution of all fl tasks of a fl population. To actually run a fl task, the fl Scheduler hands it over to the FL Plan Processor. It translates the fl task to a fl plan and corresponding instructions as defined in Federated Computation Specifications. Subsequently, it creates the corresponding global ML Model and starts the fl process for a given fl cohort by connecting to all fl clients that have fl tasks in the same fl cohort. This information is provided by the fl cohort manager. Analogously to fl plans, there exists a client counterpart of the fl plan processor too. It invokes the client instructions specified in the fl plan to, e.g., train or evaluate ml models on local edge devices. Metrics resulting from evaluation plans are provided by the fl plan processor to the fl cohort manager to update cohorts continuously. Further metrics from, e.g., continuous learning approaches or qoi evaluations are stored and used directly by the fl plan processor e.g., during model aggregation. The FL Resource Optimizer connects to the metrics storage to incorporate parameters in optimization models. After solving the optimization task, the solution is returned to the fl scheduler to trigger, e.g., cohort reorganization and to update the schedule of fl plan executions.

6 Conclusions and Future Work

In this work, we identified the need for ifl and provided a structured collection of requirements and workflows covered in an ifl architecture. Due to diverse conditions of assets operating in industry, fl clients are not advised to exchange ml model parameters with the global set of fl participants. For this, we concluded to consider fl tasks grouped in fl cohorts aiming to share knowledge resulting from similar environmental and operating conditions of involved assets. Furthermore, we highlighed that fl can decrease the amount of resource-intensive work of domain experts considering less continuous updates of datasets and labelling to be done. Additionally, making use of metrics resulting from qoi and ml model evaluations can be used for fl cohort reorganizations and weighting in the fl process.

As future work, we consider evaluation of a pilot implementation of the ifl system in industrial labs. Furthermore, the incorporation of fl open source frameworks as

PySyft666, TensorFlow Federated (TFF)777, and FATE888 needs to be evaluated with respect to production readiness and support for concurrent communication and computation needed for fl cohorts. Additionally, efficient asynchronous and decentralized fl for industrial edge devices without involving a server is an interesting future research direction. Finally, forecasting of potentially negative knowledge transfer that decreases model quality could complement the idea of dynamically reorganizing fl cohorts.