Data Collaboration Analysis applied to Compound Datasets and the Introduction of Projection data to Non-IID settings

08/01/2023
by   Akihiro Mizoguchi, et al.
0

Given the time and expense associated with bringing a drug to market, numerous studies have been conducted to predict the properties of compounds based on their structure using machine learning. Federated learning has been applied to compound datasets to increase their prediction accuracy while safeguarding potentially proprietary information. However, federated learning is encumbered by low accuracy in not identically and independently distributed (non-IID) settings, i.e., data partitioning has a large label bias, and is considered unsuitable for compound datasets, which tend to have large label bias. To address this limitation, we utilized an alternative method of distributed machine learning to chemical compound data from open sources, called data collaboration analysis (DC). We also proposed data collaboration analysis using projection data (DCPd), which is an improved method that utilizes auxiliary PubChem data. This improves the quality of individual user-side data transformations for the projection data for the creation of intermediate representations. The classification accuracy, i.e., area under the curve in the receiver operating characteristic curve (ROC-AUC) and AUC in the precision-recall curve (PR-AUC), of federated averaging (FedAvg), DC, and DCPd was compared for five compound datasets. We determined that the machine learning performance for non-IID settings was in the order of DCPd, DC, and FedAvg, although they were almost the same in identically and independently distributed (IID) settings. Moreover, the results showed that compared to other methods, DCPd exhibited a negligible decline in classification accuracy in experiments with different degrees of label bias. Thus, DCPd can address the low performance in non-IID settings, which is one of the challenges of federated learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2019

Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

Federated Learning enables visual models to be trained in a privacy-pres...
research
11/13/2020

Federated Learning System without Model Sharing through Integration of Dimensional Reduced Data Representations

Dimensionality Reduction is a commonly used element in a machine learnin...
research
09/15/2021

Federated Learning of Molecular Properties in a Heterogeneous Setting

Chemistry research has both high material and computational costs to con...
research
06/16/2022

Using adversarial images to improve outcomes of federated learning for non-IID data

One of the important problems in federated learning is how to deal with ...
research
08/31/2022

Non-readily identifiable data collaboration analysis for multiple datasets including personal information

Multi-source data fusion, in which multiple data sources are jointly ana...
research
08/26/2022

Another Use of SMOTE for Interpretable Data Collaboration Analysis

Recently, data collaboration (DC) analysis has been developed for privac...

Please sign up or login with your details

Forgot password? Click here to reset