I Introduction
The performance of a machine learning algorithm is highly sensitive to the choice of its hyperparameters. Therefore, hyperparameter selection is a crucial task in the optimization of knowledgeaggregation algorithms. Federated Learning (FL) is a recent machine learning approach which aggregates machine learning model parameters between devices (henceforth clients) without sharing their data. The aggregation is coordinated by a server. Industrial Federated Learning (IFL) is a modified approach of FL in an industrial context [1]. In an FL setting, hyperparameter optimization poses new challenges and is a major open research area [2]. In this work, we investigate the impact of different hyperparameter optimization approaches in an IFL system. We believe that the data distribution influences the choice of the best hyperparameter configuration and suggest that the best hyperparameter configuration for a client might differ from another client based on individual data properties. Therefore, we investigate a local hyperparameter optimization approach that – in contrast to a global hyperparameter optimization approach – allows every client to have its own hyperparameter configuration. The local approach allows us to optimize hyperparameters prior to the federation process reducing communication costs.
Communication is considered a critical bottleneck in FL [3]. Clients are usually limited in terms of communication bandwidth enhancing the importance of reducing the number of communication rounds or using compressed communication schemes for the model updates to the central server [3]. Dai et al. [4] introduced Federated Bayesian Optimization (FBO) extending Bayesian optimization to the FL setting. However, until now, there is no research on the impact of global and local hyperparameter optimization in FL. Therefore, we compare a local hyperparameter optimization approach to a global hyperparameter optimization approach, optimizing hyperparameters in the federation process.
The aim of this work is to i) analyze challenges and formal requirements in FL, and in particular in IFL, ii) to evaluate the performance of an Internet of Things (IoT) sensor based classification task in an IFL system, iii) to investigate a communication efficient hyperparameter optimization approach, and iv) to compare different hyperparameter optimization algorithms. Therefore, we want to answer the following questions.

Does FL work for an IoT sensor based anomaly classification task on industrial assets with nonidentically distributed data in an IFL system with a cohort strategy?

Can we assume that the global and local hyperparameter optimization approach deliver the same hyperparameter configuration in an i.i.d. FL setting?

Can we reduce communication costs in the hyperparameter optimization of a noni.i.d. classification task in context of FL by optimizing a hyperparameter locally prior to the federation process?

Does Bayesian optimization outperform grid search, both in a global and local approach of a noni.i.d. IoT sensor based classification task?
Ii Algorithmic Challenges and Formal Requirements for industrial Assets
In FL, new algorithmic challenges arise that differentiate the corresponding optimization problem from a distributed optimization problem. In distributed learning settings, major assumptions regarding the training data are made which usually fail to hold in an FL setting [5]. Moreover, noni.i.d. data, limited communication, and limited and unreliable client availability pose further challenges for optimization problems in FL [2]. Kairouz et al. [2] considered the need for addressing these challenges as a major difference to distributed optimization problems. The optimization problem in FL is therefore referred to as federated optimization emphasizing the difference to distributed optimization [5]. In an IFL setting, additional challenges regarding industrial aspects arise [1]. In this section, we want to formulate the federated optimization problem and discuss the algorithmic challenges of FL in general, and in particular of IFL.
Iia Problem Formulation
We consider a supervised learning task with features
in a sample space and labels in a label space . We assume that we have available clients, , withdenoting the data set of client and denoting the cardinality of the client’s data set. Let denote the distribution over all clients, and let denote the data distribution of client . We can then access a specific data point by first sampling a client and then sampling a data point [2]. Then, the local objective function is
(1) 
where represents the parameters of the machine learning model and represents the loss of the prediction on sample for the given parameters . Typically, we wish to minimize
(2) 
IiB Federated Learning
One of the major challenges concerns data heterogeneity. In general, we cannot assume that the data is identically distributed over the clients, that is for all and . Therefore, might be an arbitrarily bad approximation of [5].
In the following, we want to analyze different nonidentically distributed settings as demonstrated by Hsieh et al. [6] assuming that we have an IoT sensor based anomaly classification task in an industrial context. Given the distribution , let
denote the bivariate probability function, let
and denote the marginal probability function respectively. Using the conditional probability function and , we can now rewrite the bivariate probability function as(3) 
for . This allows us to characterize different settings of nonidentically distributed data:
Feature distribution skew:
We assume that for all , , but for some , . Clients that have the same anomaly classes might still have differences in the measurements due to variations in sensor and machine type.Label distribution skew: We assume that for all , , but for some , . The distribution of labels might vary across clients as clients might experience different anomaly classes.
Same label, different features: We assume that for all , , but for some , . The same anomaly class can have significantly different features for different clients due to variations in machine type, operational and environmental conditions.
Same features, different label: We assume that for all and , but for some , . The same features can have different labels due to operational and environmental conditions, variation in manufacturing, maintenance et cetera.
Quantity skew: We cannot assume that different clients hold the same amount of data, that is for all , . Some clients will generate more data than others.
In realworld problems, we expect to find a mixture of these nonidentically distributed settings. In FL, heterogeneity does not exclusively refer to a nonidentical data distribution, but also addresses violations of independence assumptions on the distribution [2]. Due to limited, slow and unreliable communication on a client, the availability of a client is not guaranteed for all communication rounds. Communication is considered a critical bottleneck in FL [3]. In each communication round, the participating clients send a full model update back to the central server for aggregation. In a typical FL setting, however, the clients are usually limited in terms of communication bandwidth [3]. Consequently, it is crucial to minimize the communication costs.
IiC Industrial Federated Learning
In an industrial setting, FL experiences challenges that specifically occur in an industrial context. Industrial assets have access to a wealth of data suitable for machine learning models, however, the data on an individual asset is typically limited and private in nature. In addition to sharing the model within the company, it can also be shared with an external industry partner [1]. FL leaves possibly critical business information distributed on the individual client (or within the company). However, Zhao et al. [7] proved that heterogeneity, in particular, a highly skewed label distribution, significantly reduces the accuracy of the aggregated model in FL. In an industrial context, we expect to find heterogeneous clients due to varying environmental and operational conditions on different assets. Therefore, Hiessl et al. [1] introduced a modified approach of FL in an industrial context and termed it Industrial Federated Learning (IFL). IFL does not allow arbitrary knowledge exchange between clients. Instead, the knowledge exchange only takes place between clients that have sufficiently similar data. Hiessl et al. [1] refer to this set of clients as a cohort. We expect the federated learning approach in a cohort to approximate the corresponding central learning approach.
Iii Hyperparameter Optimization Approaches in an IFL System
In an FL setting, hyperparameter optimization poses new challenges and is a major open research area [2]. The performance of a machine learning model is linked to the amount of communication [8]. In an effort to reduce communication costs, a critical bottleneck in FL [3], we investigated a communication efficient hyperparameter optimization approach, a local hyperparameter optimization approach that allows us to optimize hyperparameters prior to the federation process. Kairouz et al. [2] introduced the idea of a separate optimization of hyperparameters and suggest a different hyperparameter choice for dealing with noni.i.d. data.
Dai et al. [4] investigated a communication efficient local hyperparameter optimization approach and introduced Federated Bayesian Optimization (FBO) extending Bayesian optimization to the FL setting. In FBO, every client locally uses Bayesian optimization to find the optimal hyperparameter configuration. Additionally, each client is allowed to request for information from other clients. Dai et al. [4] proved a convergence guarantee for this algorithm and its robustness against heterogeneity. However, until now, there is no research on the impact of global and local hyperparameter optimization.
In the LocalHPO algorithm 1, we perform local hyperparameter optimization. We optimize the hyperparameter configuration for each client . In the GlobalHPO algorithm 2, we perform global hyperparameter optimization. We optimize the hyperparameter configuration in the federation process. The LocalOptimization method in the LocalHPO algorithm 1 and the GlobalOptimization method in the GlobalHPO algorithm 2 can be based on any hyperparameter optimization algorithm.
We want to differentiate between a global hyperparameter whose value is constant for all clients and a local hyperparameter whose value depends on a client . Here, denotes the hyperparameter on client . We notice that this differentiation is only relevant for settings with noni.i.d. data. In an i.i.d. setting, we assume that a hyperparameter configuration that works for one client also works for another client. In our experiments, we verified this assumption for a proxy data set.
Iv Data, Algorithms and Experiments
In the next section, we want to make our benchmark design explicit and present our experimental setup. We will present the machine learning tasks including the data partition of the training data, the machine learning models, the optimization algorithms and our experiments. We considered an image classification task on a data set, the MNIST data set of handwritten digits, and an IoT sensor based anomaly classification task on industrial assets.
Iva Data
In order to test the IFL system on the MNIST data set, we still need to specify on how to distribute the data over artificially designed clients. To systematically evaluate the effectiveness of the IFL system, we simulated an i.i.d. data distribution. This refers to shuffling the data and partitioning the data into clients, each receiving 6 000 examples. Following the approach of McMahan et al. [5]
, we applied a convolutional neural network with the following settings:
convolutional layers with and filters of sizeand a ReLu activation function, each followed by a max pooling layer of size
, a dense layer with neurons and a ReLu activation function, a dense layer with neurons and a softmax activation function.The industrial task concerns IoT sensor based anomaly classification on industrial assets. The data was acquired with the SITRANS multi sensor, specifically developed for industrial applications and its requirements [9]. We considered multiple centrifugal pumps with sensors placed at different positions, in different directions to record three axis vibrational data in a frequency of Hz. Per minute, samples were collected. We operated the pumps under varying conditions, including healthy states and
anomalous states. A client is either assigned data of an asset in a measurement, or data of several assets in a measurement ensuring that each client sees all operating conditions. However, since in the process of measurement, the assets were completely dismantled and rebuilt, we consider the data to be noni.i.d. regarding its feature distribution. We applied an artificial neural network with the following settings: a dense layer with
neurons and a ReLu activation function, a dropout layer with a dropout rate of , a dense layer with neurons and a ReLu activation function, a dropout layer with a dropout rate of , and a softmax activation function. We remapped the features using the Kabsch algorithm [10], applied a sliding window, extracted the Melfrequency cepstral coefficients, applied the synthetic minority oversampling technique [10], and normalized the resulting features.IvB Algorithms
Our evaluations include the Federated Averaging (FedAvg) algorithm according to McMahan et al. [5], and the hyperparameter optimization approaches LocalHPO 1 and GlobalHPO 2. We implemented these approaches based on grid search and Bayesian optimization. In this section, we give their pseudocode. We searched for the learning rate with fixed fraction of participating clients , number of communication rounds
, number of local epochs
, and minibatch size .In algorithm 3, we give the pseudocode of the LocalOptimization method in LocalHPO 1 based on the grid search algorithm with a fixed grid . We iterate through the grid , train the model on the training data of client based on the ClientUpdate method used in the FedAvg algorithm [5] with the learning rate as an additional argument, and validate the performance of the model on the validation data of client . Finally, the learning rate that yields the highest accuracy on the validation data is selected. Here, denotes the resulting model trained on the training data with learning rate and denotes the accuracy of the model tested on the validation data of client .
In algorithm 4, we give the pseudocode of the GlobalOptimization method in GlobalHPO 2 based on the grid search algorithm with a fixed grid . We iterate through the grid, perform the FedAvg algorithm [5] with the learning rate as an additional argument, validate the performance of the model on the validation data for all clients and compute the average accuracy of all clients. Finally, the learning rate that yields the highest average accuracy is selected.
In algorithm 5, we give the pseudocode of the LocalOptimization method in LocalHPO 1 based on Bayesian optimization. The objective function takes the learning rate as an argument, trains the model on the training data of client based on the ClientUpdate method used in the FedAvg algorithm [5] with the learning rate as an additional argument, validates the performance of the model on the validation data of client , and returns the resulting accuracy. We initialize a gaussian process for the objective function with sample points. Then, we find the next sample point by maximizing the acquisition function, evaluate , and update the gaussian process . Finally, we select the learning rate that yields the highest accuracy. We repeat this for iterations.
In algorithm 6, we give the pseudocode of the GlobalOptimization method in GlobalHPO 2 based on Bayesian optimization. The objective function takes the learning rate as an argument, performs the FedAvg algorithm [5] with the learning rate as an additional argument, validates the performance of the model on the validation data for all clients , computes the average accuracy of all clients and returns the resulting accuracy. We initialize a gaussian process for the objective function with sample points. Then, we find the next sample point by maximizing the acquisition function, evaluate , and update the gaussian process . Finally, we select the learning rate that yields the highest average accuracy. We repeat this for iterations.
IvC Experiments
In order to systematically investigate the impact of global and local hyperparameter optimization, we compared the global and local hyperparameter optimization approach in an i.i.d. setting, the MNIST machine learning task, as well as in a noni.i.d. setting, the industrial task. Therefore, we implemented the global and local optimization approach based on grid search with a grid , and based on Bayesian optimization with the widely used squared exponential kernel and the upper confidence bound acquisition function. We searched for the learning rate with fixed , , and .
In order to evaluate the global and local optimization approaches in a direct comparison, we chose the number of epochs in the local optimization approach as , where is the number of epochs in the global optimization approach and is the number of communication rounds. In the global optimization task, we set , , and for the MNIST data, and , , and for the industrial data. In the local optimization task, we set and for the MNIST data, and and for the industrial data. For the evaluation of the global hyperparameter optimization approach, we optimized the learning rate using the global approach, trained the federated model with a global learning rate, and tested the resulting federated model on the cohort test data. Then, we optimized the learning rate using the local approach, trained the federated model with local individual learning rates for each client in the cohort, and tested the resulting federated model on the cohort test data.
V Experimental Results
Following the approach of Hiessl et al. [1], we demonstrated the effectiveness of the IFL System for the industrial task and showed that the IFL approach performs better than the individual learning approach and approximates the central learning approach. Fig. 1 shows the test accuracy on the central cohort test data for each client, for i) a model trained on the individual training data of the client, ii) a central model trained on the collected training data of all clients in the cohort, and iii) the federated model trained in the cohort.
Fig. 2 a) shows the results for the MNIST data. The optimization approaches are based on the grid search algorithm. For the training posterior to the optimization, we set , , , and in the IFL system. The color indicates the optimized learning rate on the corresponding client. Since the MNIST data is i.i.d., there is only one cohort and all clients have the same federated model and thus the same test accuracy. Our results show that the grid search algorithm selected in the local optimization of the learning rate on each client. According to our expectation, the global optimization approach yielded the same learning rate.
For the industrial task, we evaluated the global and local optimization approach based on grid search and Bayesian optimization. For the training posterior to the optimization, we set , , , and in the IFL system. Fig. 2 b) shows the results for the industrial data with the optimization approaches based on the grid search algorithm. The results show that, in all cohorts, the global approach yielded an equal or larger accuracy than the local approach.
Fig. 2 c) shows the results for the industrial data with the optimization approaches based on the Bayesian algorithm. Note that the search space of the learning rate was in the optimization while the scale in the plot starts from . The results show that the global approach yielded a larger accuracy than the local approach in cohort and cohort .
The local Bayesian approach yielded different learning rates, see Fig. 2 c), on clients with no difference in data, that is, the same number of samples, the same class distribution, and the same measurement protocol. However, the local grid search approach yielded the same learning rate as the global grid search approach, see Fig. 2 b). Therefore, we suggest that the reason lies in the implementation of the Bayesian optimization approach and a not sufficiently large number of iterations to guarantee convergence.
In order to compare the optimization approaches for the industrial task, we performed a paired ttest regarding the test accuracy to determine the statistical significance, see table
I. We observe that the global optimization approach is significantly better than the local approach, both for the grid search approach () and for the Bayesian approach (). Furthermore, the results show that the grid search approach is significantly better than the Bayesian approach, both for the global approach () and for the local approach (). Note that we considered cohortan outlier and excluded this cohort from our calculations. Cohort
only consists of client , a client whose data was not generated according to the standard measurement protocol. Without outlier removal, the global grid search approach is still significantly better than the local grid search approach (), and the local grid search approach is significantly better than the local Bayesian approach (). However, there is no significant difference in the global Bayesian approach vs. the local Bayesian approach () and in the global grid search approach vs. the global Bayesian approach ().Vi Conclusion and Future Work
The results show that the federated learning approach approximates the central learning approach, while outperforming individual learning of the clients. In this work, we investigated the impact of global and local optimization approaches in an IFL System based on a proxy data set and a realworld problem. In our experiments on the industrial data, local optimization yielded different learning rates on different clients in a cohort. However, the results show that a globally optimized learning rate, and thus, a global learning rate for all clients in a cohort improves the performance of the resulting federated model. Therefore, we conclude that the global optimization approach outperforms the local optimization approach resulting in a communicationperformance tradeoff in the hyperparameter optimization in FL. In our experiments on the proxy data set, however, the local approach achieved the same performance as the global approach.
client  global grid  local grid  global Bayesian  local Bayesian 

0.7756  0.7720  0.7659  0.6897  
0.7756  0.7720  0.7659  0.6897  
0.7756  0.7720  0.7659  0.6897  
0.7756  0.7720  0.7659  0.6897  
0.8230  0.7921  0.7882  0.7889  
0.8230  0.7921  0.7882  0.7889  
0.8230  0.7921  0.7882  0.7889  
0.9740  0.9749  0.3867  0.9736  
0.7756  0.7720  0.7659  0.6897 
A limitation of our study is that we only considered one hyperparameter in our optimization task. Hence it would be interesting to explore whether we can confirm these observations for a hyperparameter configuration of more hyperparameters. The results show that the grid search approaches outperform the Bayesian approaches, both globally and locally. However, we suggest a convergence analysis for the Bayesian approach.
References
 [1] T. Hiessl, S. Rezapour Lakani, J. Kemnitz, D. Schall, and S. Schulte, “Cohort – based federated learning services for industrial collaboration on the edge,” TechRxiv. Preprint. https://doi.org/10.36227/techrxiv.14852361.v1, 2021.
 [2] P. Kairouz, H. B. McMahan, and et al., “Advances and open problems in federated learning,” Foundations and Trends^{®} in Machine Learning, vol. 14, no. 1, 2021.
 [3] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” NIPS Workshop on Private MultiParty Machine Learning, 2016.

[4]
Z. Dai, B. K. H. Low, and P. Jaillet, “Federated Bayesian optimization via Thompson sampling,”
Advances in Neural Information Processing Systems 33, 2020. 
[5]
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
“Communicationefficient learning of deep networks from decentralized
data,”
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)
, vol. 54, pp. 1273–1282, 2017.  [6] K. Hsieh, A. Phanishayee, O. Mutlu, and P. B. Gibbons, “The noniid data quagmire of decentralized machine learning,” International Conference on Machine Learning (ICML), pp. 4387–4398, 2020.
 [7] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with noniid data,” arXiv: 1806.00582, 2018.

[8]
A. Nilsson, S. Smith, G. Ulm, E. Gustavsson, and M. Jirstrand, “A performance
evaluation of federated learning algorithms,”
Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning (DIDL)
, pp. 1–8, 2018.  [9] T. Bierweiler, H. Grieb, S. von Dosky, and M. Hartl, “Smart sensing environment – use cases and system for plant specific monitoring and optimization,” Automation 2019, pp. 155–158, 2019.
 [10] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority oversampling technique,” J. Artif. Intell. Res, vol. 16, pp. 321–357, 2002.

[11]
F. L. Markley, “Attitude determination using vector observation: A fast optimal matrix algorithm,”
J. Astronaut. Sci., vol. 41, no. 2, pp. 261–280, 1993.