PHOTON – A Python API for Rapid Machine Learning Model Development

02/13/2020
by   Ramona Leenings, et al.
0

This article describes the implementation and use of PHOTON, a high-level Python API designed to simplify and accelerate the process of machine learning model development. It enables designing both basic and advanced machine learning pipeline architectures and automatizes the repetitive training, optimization and evaluation workflow. PHOTON offers easy access to established machine learning toolboxes as well as the possibility to integrate custom algorithms and solutions for any part of the model construction and evaluation process. By adding a layer of abstraction incorporating current best practices it offers an easy-to-use, flexible approach to implementing fast, reproducible, and unbiased machine learning solutions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/28/2017

A Practical Python API for Querying AFLOWLIB

Large databases such as aflowlib.org provide valuable data sources for d...
03/21/2017

The NLTK FrameNet API: Designing for Discoverability with a Rich Linguistic Resource

A new Python API, integrated within the NLTK suite, offers access to the...
06/17/2021

PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python

Machine learning is a general-purpose technology holding promises for ma...
02/13/2020

The PHOTON Wizard – Towards Educational Machine Learning Code Generators

Despite the tremendous efforts to democratize machine learning, especial...
08/11/2017

Augmentor: An Image Augmentation Library for Machine Learning

The generation of artificial data based on existing observations, known ...
12/07/2020

MFST: A Python OpenFST Wrapper With Support for Custom Semirings and Jupyter Notebooks

This paper introduces mFST, a new Python library for working with Finite...
04/11/2022

Machine Learning State-of-the-Art with Uncertainties

With the availability of data, hardware, software ecosystem and relevant...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The interest in machine learning (ML) has grown exponentially in recent years and a rapidly increasing number of applications in all areas of science and industry showcase its potential. While the field progresses with breathtaking speed, the basic workflow to construct, optimize and evaluate a machine learning model has remained virtually unchanged. Data science experts systematically search for the best combination of data processing steps, learning algorithms and their respective hyperparameters based on unbiased performance estimates.

Diverse software packages and toolboxes arose supporting nearly any aspect of the model construction and evaluation procedure, providing users with a wide array of data processing methods, learning algorithms, hyperparameter optimization strategies, cross-validation schemes, performance metrics and other helpful tools. While in general this diversity of high-quality implementations enables the construction of use-case-tailored, sophisticated and capable solutions, in practice, it also leads to users facing numerous, reoccurring obstacles. First, algorithms and tools required for a particular use case are regularly spread across different toolboxes, requiring users to manually integrate code, learn different toolbox-specific syntaxes and consecutively re-adjust the data to the respective data representation needed in a particular processing step. Second, the typical machine learning workflow including hyperparameter optimization and model performance evaluation - an essential aspect of ML model development - must be properly implemented. Especially when integrating code stemming from different coding paradigms and toolboxes, this is a time-consuming, error-prone task. Third, more advanced ML pipeline features such as handling multiple data-streams or addressing class imbalance are either not supported or cannot be used flexible across toolboxes. Finally, model sharing and thus external model validation are hampered by the lack of a standardized format for saving, loading and predicting from pre-trained and optimized pipeline architectures comprising elements from different toolboxes.

As a result, users still spent a significant amount of their time (re-)implementing boilerplate code which runs typical, reoccurring ML analysis workflows and integrates the necessary algorithms from the plethora of diverse toolboxes. This is especially cumbersome in contexts in which rapid design iterations and evaluation of novel analysis pipelines are the norm rather than the exception. Moreover, the problem is aggravated by the lack of coding experience and uncertainties regarding ML best-practices, as is often the case in applied data science contexts (e.g. in many the life sciences).

Against this backdrop, we introduce PHOTON, a high-level Python Application Programming Interface (API), which offers an easy-to-use and flexible approach to implementing fast, reproducible, and unbiased machine learning models in accordance with current best practices. PHOTON is based on three conceptual ideas:

Design.

We conceptualize the entire Machine Learning pipeline as a series of building blocks, i.e. processing steps or learning algorithms, which can be selected and combined from a variety of choices. This enables the user to focus on crafting use-case optimized processing and learning sequences. PHOTON builds on an object structure that is optimized for easy and fast declaration of processing steps and data streams. Additionally, PHOTON introduces flow control elements encapsulating one or more processing steps or learning algorithms in parallel data streams, joined by either an AND or an OR operation. Furthermore, PHOTON adds advanced pipeline functionality to existing toolboxes: For one, the PHOTON pipeline is capable of handling dynamic changes in the training set structure and quantity on both the feature matrix and the target vector as needed e. g. for the application of over- and undersampling or data augmentation techniques. In addition, the PHOTON pipeline is capable of streaming supplementary data such as covariates or additional group labels through the sequence of building blocks. This data, which is useful e.g. for sample stratification, clustering or confounder removal, is accessible at any point in the pipeline.

Automation.

PHOTON automates the repetitive workflow of supervised machine learning comprising model training and evaluation, hyperparameter optimization, and (nested cross-) validation. Cross-validation schemes, hyperparameter optimization strategies, and performance metrics can be selected from a range of pre-existing options or customly built. Importantly, PHOTON introduces a standardized format for saving, loading and distributing optimized and trained pipeline architectures. PHOTON thus supports convenient model sharing and enables external model validation even for non-expert users. Finally, results of training, optimization, and evaluation are accessible via an interactive, browser-based graphical interface called PHOTON Investigator.

Integration.

PHOTON is built modular, so that it is capable of integrating established solutions from existing ML toolboxes and easily extendable to meet user-specific requirements. It offers simple access to a variety of established machine learning toolboxes enabling the user to rapidly design, train, and evaluate pipelines based on a wide range of state-of-the-art algorithm implementations. These pre-registered data processing methods, learning algorithms, hyperparameter optimization strategies, performance metrics, etc. can be used without knowing the syntax of the underlying toolboxes. In addition, custom solutions can be integrated anywhere in the workflow, from model construction to model evaluation. PHOTON is capable to accommodate any custom tailored data processing or learning algorithm at any position within the PHOTON pipeline complying to the scikit-learn interface for data processing methods and learning algorithms [1]. In addition, interfaces for training, optimization and evaluation allow the user to integrate custom hyperparameter optimization strategies, performance metrics or cross-validation schemes.

In the following, we will consider PHOTON in the field of existing machine learning libraries, outline the structure of the framework and highlight current functionalities, provide an example-based introduction to PHOTON’s usage and capabilities, and finally discuss current challenges and future developments.

2 Comparison With Existing Software

With Python

as the programming language of choice in the domain of Machine Learning today, several open-source toolboxes exist which implement ML algorithms and utility functions. As one of the most prominent,

scikit-learn [2] covers a very broad range of algorithms from regression and classification to clustering and preprocessing methods as well as model selection helper functions. It has established the de-facto standard interface for data-processing and learning algorithms. In addition, it introduces the concept of building analyses using pipelines, which successively apply a list of processing methods (referred to as transformers) and a final learning algorithm (called estimator) to the data. Scikit-learn offers a nested cross-validated grid-search as well as random grid-search function for hyperparameter optimization and performance evaluation. With regard to hyperparameter optimization, scikit-optimize implements a Bayesian hyperparameter optimization strategy which can be combined with algorithms from scikit-learn. Other libraries such as imbalanced-learn [3] offer functionality to handle imbalanced datasets by providing numerous over- or undersampling methods.

In the domain of (Deep) Neural Networks, Google’s open-source library

Tensorflow [4] allows users to design and train neural networks from scratch. It offers implementations for a wide range of elements, including several node types, layers, optimizers, helper functions and pretrained network architectures. Keras [5]

, as the official high-level tensorflow API, offers an expressive syntax for building complex network architectures. Other popular libraries such as

Theano, PyTorch, and Caffe [6, 7, 8] are also implemented in Python or provde Python APIs.

In the field of auto-ml - which seeks to automatically find the best model architecture and hyperparameter settings for a given dataset - libraries such as auto-sklearn, TPOT, AutoWeka and others optimize a specific set of data-preprocessing methods and learning algorithms. While auto-sklearn employs a Bayesian approach called Sequential Model-Based Optimization for General Algorithm Configuration (SMAC) [9], auto-keras [10] and Google’s AutoML design neural network architectures and optimize hyperparameters for a given task using the Neural Architecture Search algorithm [11, 12]. While very intriguing, these libraries aim at full automation - neglecting the need for custom-tailored pipelines and automatization approaches, thereby foregoing the opportunity to incorporate both modality specific algorithms and high-level domain knowledge in the model architecture search.

In contrast, the aim of the PHOTON framework is to provide a high-level API which simplifies model design and automates model training, hyperparameter optimization, and evaluation while based on existing as well as custom solutions in each step of the workflow. To this end, PHOTON establishes both a unified workflow as well as transparent interfaces which allow it to integrate algorithms, hyperparameter optimization strategies, and other utility functions within a unified workflow. In addition, PHOTON provides custom-built high-level pipeline functionalities to manipulate the dataflow as well as a multitude of convenience functions. Table 1 lists PHOTON’s core features and their origin. In the following, we will describe each feature and its implementation in more detail.

Machine Learning Algorithms
Supervised Learning Algorithms scikit-learn [2]
Neural Networks tensorflow, keras [4, 5]
Custom Learning Algorithms theano, pytorch, caffe, cognitive toolkit, etc. [6, 7, 8, 13]
Hyperparameter Optimization Strategies
Grid Search PHOTON code
Random Grid Search PHOTON code
Bayesian Optimization scikit-optimize, SMAC [14, 9]
Custom strategies integration via adapted ask-tell interface
Pipeline Functionality
Dataflow in a sequence of algorithms inspired by scikit-learn [2]
Parallel data streams PHOTON code
And-Elements, Or-Elements and Subpipelines PHOTON code
Streaming of additional data PHOTON code
Target Vector Transformation PHOTON code
Callbacks PHOTON code
Pipeline Functionality
Parallelization dask [15]
Performance Constraints inspired by auto-sklearn [16]
Caching PHOTON code
Handling class imbalance imbalanced-learn [3]
Data Augmentation via sample pairing PHOTON code
Standardized Format for Model Distribution PHOTON code
Significance Testing PHOTON code
Extensive logging and result visualization PHOTON code


Table 1: Overview of PHOTON features and machine learning software packages included.

3 Methods

The PHOTON framework provides functionality to structure, simplify and automate the ML workflow. To this end, we implemented several classes for designing, training, optimizing, and evaluating ML analysis pipelines. There are two main components that allow the user to design a machine learning pipeline: The Hyperpipe class - as the administrative core unit, and the PipelineElement class - representing a particular algorithm embedded in the sequence of processing steps.

In addition, we added control elements that enable more sophisticated pipeline architectures such as parallel sub-streams (PHOTON Branches) which can be joined by either an AND- (PHOTON Stack) or an OR-operation (PHOTON Switch). Finally, the PHOTON pipeline handles the data stream in the sequence of processing steps, offers caching and is utilized by the Hyperpipe to train the pipeline and request predictions for performance evaluation. Next, we will describe each element in more detail before illustrating usage and syntax of each class in the Examples section below.

3.1 Basic Elements

The Hyperpipe.

The Hyperpipe class is the core of the ML workflow in PHOTON. The Hyperpipe - short for hyperparameter optimization pipeline - provides the scaffolding for adding and arranging the sequence of processing steps and learning algorithms, monitors and controls the training and test procedure, communicates with the hyperparameter optimization strategy, guides the hyperparameter optimization process, evaluates performance, and coordinates the logging of all results (see listing LABEL:code:workflow in the appendix). In addition, the Hyperpipe generates a final model, serialized in a standardized format, consisting of the custom pipeline trained with the best performing hyperparameter configuration.

The PipelineElement

. As constituents of the Hyperpipe, PipelineElements determine the specific algorithms applied to the input data. This can either be a data processing algorithm, in reference to the scikit-learn interface also called transformer, or a learning algorithm, also referred to as estimator. By selecting and arranging PipelineElements, the user designs the ML pipeline. To facilitate this process, the PipelineElement class implements several helpful features. First, it enables easy access to various pre-implemented methods and algorithms: With an internal registration system that instantiates class objects from a string literal, it avoids the need to manually import and access the different algorithms from the respective toolboxes. The user can access a specific transformer or estimator by a string-encoded key which is internally mapped to an algorithm imported from the respective toolbox (e.g. scikit-learn, Tensorflow/Keras etc.) or any custom element registered by the user. If, for example, the user would like to utilize scikit-learn

’s implementation of the support vector machine (SVC)

[17] for classification, he/she can access it by simply adding a PipelineElement with the name SVC to the Hyperpipe. Internally the object is automatically imported and instantiated. Secondly, the PipelineElement also provides an expressive syntax for the specification of the algorithms’ hyperparameters and their dimensions (see Hyperparameter Optimization below). Note that we use the term hyperparameters in PHOTON to denote parameters which control the behavior of any given pipeline element - not only the learning algorithm’s hyperparameters as is usually done. Representing pipeline elements together with all their (potential) hyperparameter settings greatly simplifies the automatization of the pipeline training and optimization process. Most importantly, it enables seamless switching between different hyperparameter optimization strategies, ranging from simple (random) grid search to more advanced approaches. Thirdly, each PipelineElement posseses an an on- and off-switch (a parameter called test_disabled) which allows a complete disabling of this items if this fosters good model performance.

3.2 Flow Control Elements

PHOTON implements classes dedicated to flow control, i.e. designed to handle the processing of parallel data streams and supporting more complex ML pipeline architectures. When creating more than one data stream, we need to specify how to direct and join the dataflow. Additionally, specific use-cases might require the application of more than one transformation in parallel as well as a sequence of several transformations in parallel sub-pipelines. PHOTON’s control elements are designed to conveniently manage these tasks.

PHOTON Switch

Building ML pipelines involves comparing different pipelines with each other. While in most state-of-the-art ML toolboxes the user has to define and benchmark each pipeline manually, in PHOTON it is possible to evaluate several possibilities at once. With data processing steps, learning algorithms and their hyperparameters intimately entangled, we should consider algorithm selection as part of the hyperparameter optimization process. The PHOTON Switch object is representing and interchanging several pipeline elements, comparing and testing them within the same hyperparameter optimization cycle, thus effectively implementing an OR element for the pipeline. It streams the data only to the currently active item (see figure 2) which is selected by the hyperparameter optimization strategy. Note that not only PipelineElements, but also Branches and Stacks (see below) can be added to a Switch. See listing 2 for an example.

PHOTON Stack

Complementing the Switch (OR-operation), an AND-operation is available via the Stack class. In a Stack, data is delivered to all elements within the Stack and the respective outputs are horizontally concatenated (see figure 3). Thus, a Stack allows users to create new features by processing the input to the Stack in different ways and concatenating the results. Likewise, it allows training several estimators (including their hyperparameter configurations) with the same data in an ensemble-like fashion by concatenating their predictions. Those can then be further processed by applying e.g. a voting strategy or training another (meta-)estimator. Note that not only PipelineElements, but also Branches and Switches (see below) can be added to a Stack. Listing 3 shows the usage of a Stack.

PHOTON Branch

A Branch constitutes a sub- pipeline containing a sequence of PipelineElements (see figure 4). It can be used in combination with the Switch and Stack elements enabling the creation of complex pipeline architectures integrating parallel sub-pipelines in the data flow with minimal syntax (see figure 4). Note that not only PipelineElements, but also Switches and Stacks can be added to a Branch. An example usage of PHOTON Branches is demonstrated in code listing 4.

3.3 Custom Pipeline Elements

PHOTON aims to provide a flexible and expressive interface to the ML model development workflow. Therefore, the ability to build and integrate both custom and third-party algorithms into the pipeline is crucial. Specifically, users can use any (third-party or custom) algorithm if it adheres to the fit-predict-transform interface introduced by scikit-learn.

The scikit-learn interface requests every object to implement a fit() method which trains or adjusts the implemented algorithm based on the data (usually the training set). In case of a transformer, a transform() function is expected which processes the data with the previously fitted algorithm and returns the output. In case of an estimator, a predict() method is expected which uses the fitted model to generate and return predictions for the (test) data. To integrate with PHOTON’s automated hyperparameter optimization, a set_params() method is required which can e.g. be inherited from the BaseEstimator and BaseTransformer meta classes in scikit-learn. This method takes a dictionary including (hyper-)parameter names mapped to specific (hyper-)parameter values and synchronizes the parameters of the given object with those values.

If a third-party algorithm does not adhere to the scikit-learn interface by default, a simple wrapper class can be implemented that calls the third-party algorithm according to the request interface. Registrations for both third-party and custom data processing or learning algorithms can be managed (i.e. added or deleted) via the PhotonRegistry class. Once registered, third-party elements are equivalent to all other PipelineElements and fully integrate into all PHOTON functionalities thus compatible with hyperparameter optimization, nested cross-validation and model persistence.

3.4 Pipeline

In order to accommodate advanced machine learning concepts and enable maximum flexibility, we implemented the PHOTON pipeline class. It is conceptually related to the scikit-learn Pipeline but extends it with regard to core features.

First, the pipeline implementation allows for a dynamic training sample transformation on both the feature matrix and the target vector. This requires an adaptation of the scikit-learn interface as well as an adjusted communication with the processing algorithm. PHOTON reacts to a flag set by the algorithm and adjusts its expectations regarding the interface implementation. It is capable to accept both a transformed feature matrix as well as a transformed target vector as return values when calling the transform method. Additionally, it requires the pipeline to stream the new target vector information to the subsequent PipelineElements. As it is crucial to only apply these transformations during training (but not when testing), PHOTON automatically skips all target-transformation steps when transforming or predicting new data (test samples). Common use-cases for this scenario include data augmentation approaches - in which the number of training samples is increased by applying transformations (e.g. rotations to an image) - or strategies for imbalanced dataset, in which the number of samples per class is equalized via e.g. under- or oversampling. Concretely, this addition to pipeline functionality enables the integration of imbalanced data strategies directly from the imbalanced-learn package [3] and data augmentation e.g. via sample-pairing.

Second, numerous use-cases rely on the availability of additional information (i.e. data not contained in the feature matrix) at runtime. This includes all cases in which further information is used to adjust the applied transformation (e.g. when aiming to control for the effect of covariates) or in case a different processing is applied to subgroups of the data (e.g. males and females). In the PHOTON pipeline implementation, the additional data streamed through the pipeline is accessible for all steps at any point in the pipeline. To deliver supplementary data, Python’s keyword arguments (kwargs) are utilized. The supplementary data may also dynamically change at runtime. Therefore, the pipeline accepts an updated kwargs dictionary as additional return value and assures that the information is delivered to all subsequent steps designed to accept them. Statically available additional data, that is, additional data available independent of any transformation output, may be introduced to the data stream by supplying it to the Hyperpipe’s fit() method. Conveniently, PHOTON automatically ensures proper handling of the data during cross-validation so that at any point in time, a particular element receives the supplementary data matched to the indices of the training or test data currently in use. This additional data stream bridges the gap between a) high standardization of inputs needed for automated training and testing and b) flexibility necessary to accommodate custom solutions. Thereby, developers can rely on the infrastructure while being free to adapt the system for complex algorithm interactions.

Moreover, we implemented so-called Callbacks whichallow users to access (and inspect) data flowing through the pipeline at runtime. Acting as a PipelineElement, Callbacks can be inserted at any point within the pipeline. They must define a function delegate which is called with the exact same data that the next pipeline step will receive. Thereby, a developer may inspect e.g. the shape and values after a sequence of transformations have been applied. Return values from the delegate functions are ignored, so that after returning from the delegate call, the original data is directly passed to the next processing step.

Finally, in order to enable flexible pipeline architectures, we allow the positioning of learning algorithms at an arbitrary position within the pipeline. In case PHOTON identifies a PipelineElement that a) provides no transform() method and b) yet is followed by one or more other PipelineElements, it automatically calls predict() and delivers the output to the subsequent pipeline elements. Thereby, learning algorithms can be joined to ensembles, used within subpipelines or be part of other custom pipeline architectures without interrupting the data stream.

3.5 Hyperparameter Optimization

Hyperparameters directly control the behavior of ML algorithms and may have substantial impact on model performance. Importantly, preprocessing steps and training of the learning algorithm are intimately entangled as any transformation during preprocessing might alter the data structure. Therefore, unlike classic hyperparameter optimization, PHOTON’s hyperparameter optimization encompasses not only the search for the estimator’s hyperparameters but optimizes both the choice of PipelineElements as well as hyperparameters of every PipelineElement (transformers and estimators). Searching a potentially vast hyperparameter space - which may grow rapidly even for simple analyses due to the combinatorial explosion when combining PipelineElements and their hyperparameters - quickly becomes infeasible using full grid search (i.e. evaluating all combinations of PipelineElements and hyperparameter combinations specified in the Hyperpipe). Thus, PHOTON offers a growing number of hyperparameter optimization strategies including Random Grid Search and Bayesian Hyperparameter Optimization using scikit-optimize [14].

Custom implementations of hyperparameter optimization strategies can be seamlessly integrated. Specifically, we use an extended ask-and-tell-interface structure consisting of three functions (prepare, ask and tell), which interacts with the respective hyperparameter optimization strategy. The hyperparameter space is initialized using the prepare() method. The Hyperpipe delivers the list of PipelineElements and their respective hyperparameters to the hyperparameter optimization strategy. A hyperparameter can be defined as a categorical finite list of options or as either a range of floating-point numbers or integers. Additionally, the user can incorporate prior expectations about the distributions of hyperparameter values. Within the training and test workflow, the hyperparameter optimization strategy is asked for a set of values for each hyperparameter of the pipeline, creating a new hyperparameter configuration that is then tested in all folds of the inner cross validation loop. The resulting performance is averaged across all inner folds and via the tell() function delivered back to the hyperparameter optimization strategy. Drawing on the performance of previous hyperparameter combinations, the hyperparameter optimization strategy then again provides a new set of hyperparameters to be tested via the ask() method. This cycle is repeated until the hyperparameter optimization strategy decides that the optimization process is finished. Using a user-defined performance metric that rates the best performance, PHOTON selects the best hyperparameter configuration in order to train a final model.

3.6 Model Distribution

After identifying the optimal hyperparameter configuration, the pipeline is trained with the best configuration on all available data. The resulting model including all transformers and estimators is persisted as a single file in a standardized format, suffixed with “.photon”. It can be reloaded to make prediction on new, unseen data. The .photon format enables the integration of algorithms across toolboxes and software packages as well as custom code in order to facilitate model distribution and external model validation. As the latter is crucial for external model validation and thus at the heart of ML best practice, we also created a dedicated online model repository to which users can upload their models to make them publicly available. If the model is in the .photon-format, others can download the file and make predictions without extensive system setups or the need to share data.

3.7 Performance Boosting

ML analyses in general and hyperparameter optimization in particular may require substantial computational resources. Therefore, PHOTON implements several features aiming to alleviate computational cost and accelerate analyses.

First, the Preprocessing class allows users to define a sequence of PipelineElements which are executed prior to hyperparameter optimization procedure and outside of the training and testing cycle. Importantly, users must ensure that no optimization occurs during preprocessing. As a general rule, all operations which could be performed on a single sample with the same result are legal while any operation drawing on samples which might later be assigned to different training or test sets will lead to overestimation. For example, label encoding may be used during preprocessing whereas applying a standard scaler as part of the preprocessing would violate best practice.

Second, PHOTON allows users to specify so-called PerformanceConstraints which define the minimum performance expectation a hyperparameter configuration has to achieve in order to be evaluated further. Inspired by auto-sklearn [16], a configuration is skipped in further inner-cross-validation folds if it performs worse than a user-defined static or dynamic threshold, thereby accelerating hyperparameter search.

In addition, PHOTON integrates parallelization that is built on top of the dask library REF, which is built to scale Python applications in data science. By using dask, computational resources across cluster nodes can easily be accessed to accelerate the ML model development process. Specifically, PHOTON supports parallel and distributed computation of the outer cross-validation folds, directing the fold-specific data accordingly and providing a separated instance of the hyperparameter optimization strategy to all parallel processes.

Finally, during hyperparameter search, a large number of pipelines containing similar sequences with at least partly the same hyperparameter configurations are computed. To avoid recomputing data already available from another hyperparameter configuration, PHOTON implements caching. The caching index is specifically adapted to handle the varying datasets evolving from the cross-validation data splits as well as the several hyperparameter configurations that may only partially overlap. Specifically, the PHOTON pipeline checks if for a particular pipeline element, for a given outer and inner cross-validation fold, the data has been processed in the exact same pipeline in the same hyperparameter configuration before. As changes in subsequent parts of the pipeline do not affect the transformations, only hyperparameter values of the preceding steps are considered. The PHOTON pipeline keeps requesting the caching index until the first element is found for which no cached data is available. Only then, the data is loaded and delivered to the PipelineElement to be processed. Furthermore, the PHOTON pipeline is able to cache data per single item, in order to enhance transformations that are applied item-wise in contrast to group-wise. For example, applying resource-intensive transformations to large dimensional images can be cached image-wise, as to not apply the same transformation to the same image twice. The caching functionality of the pipeline reacts to a specific flag set by an algorithm and adapts the way it stores the computed data. It switches to index existing transformations to an item-wise key and is capable of collecting, loading and saving transformations across overlapping subject groups. Thereby, a particular cost-intensive transformation must be computed only once.

3.8 Logging

PHOTON provides extensive result logging including both performances and metadata generated through the hyperparameter optimization process. Each hyperparameter configuration tested is archived including all performance metrics and complementary information such as computation time and the training, validation, and test indices. To further support the interpretation of the performance metrics, PHOTON automatically establishes a baseline performance for each analysis by performing a simple rule-based predicting strategy. Using the DummyEstimator class implemented in scikit-learn, either the mean value (in case of regression) or the most frequent class (classification) are predicted. In this way, the performance can be evaluated against best guessing. This is particularly useful in cases of imbalanced class distributions, for obtaining a comparative value in regression tasks and in general to ensure that the learning algorithm is not only learning a trivial rule but a meaningful mapping.

3.9 Interactive, Browser-based Results Visualization

Results of the hyperparameter optimization process are conveniently accessible via the PHOTON Investigator: an interactive, browser-based graphical interface (see Figure ). It provides a visualization of the pipeline architecture, analysis design and performance metrics. Confusion matrices (for classification problems) and scatter plots (for regression analyses) with interactive per-fold visualization of true and predicted values are shown. All evaluated hyperparameter configurations can be inspected for each outer fold respectively. In addition, performance curves are visualized as an indicator for the course of the hyperparameter optimization strategy.

3.10 Significance Testing

Within the scientific community, significance testing is an important aspect of model evaluation. In cases involving nested cross-validation and hyperparameter optimization, where the independence assumption of classical statistical inference is violated, permutation based methods of significance testing have been established. However, significance tests are often times not implemented in ML software tools as they are less frequently used in ML applications. In PHOTON, we have implemented a permutation-based significance test that repeats the entire modelling process with permuted sample labels. This way, the permutation test comprises hyperparameter optimization and cross-validation which is crucial for unbiased statistical inference.

3.11 Additional Information

Please note that we here describe PHOTON 1.0. For changes and updates, please refer to . An up-to-date documentation can be found at the website. The complete code is available on github under the GNU General Public License v3.0. Code quality is ensured by Unit Testing. Test coverage is  X. In addition, PHOTON is registered in the Python packaging system under the name photonai.

PHOTON is implemented in Python 3 and adheres to the PEP8 style guidelines [18]. It uses standard data science toolboxes, including numpy, pandas, and scikit-learn [19, 20, 2]. Parallelization in PHOTON is based on the dask library [15], specialized on scaling Python applications. For the serialization of hyperparameter optimization results, we support persistence into the document-based open-source database mongoDB [21] licensed under the Server Side Public License (SSPL) v1. For this task, we refer to the pymodm package for object-document-mapping. The full of list of requirements can be seen on Github.

4 Usage and Examples

In the following, some example use -cases are given showcasing the PHOTON syntax for designing, training, optimizing, and evaluating ML analyses.

4.1 Installation

PHOTON can be installed via the command line.

pip install photonai

4.2 Simple Pipeline

Listing 1 shows a simple analysis pipeline containing data normalization using the scikit-learn

Simple Imputer, Standard Scaler, a Principle Component Analysis (PCA), and a Support Vector Machine. The PHOTON syntax enables an efficient representation of all hyperparameter configurations. The code optimizes the number of principal components within a range of 10 and 50, the regularization parameter of the support vector machine (

C) with within a range of 1 to 6 and chooses between a linear or rbf kernel, respectively. By statically defining the value for gamma, the support vector machine’s gamma parameter will be set to scale across all hyperparameter configurations. Note that setting the test _disabled parameter of a PipelineElement will - in addition to all other configurations of this element - evaluate the pipeline’s performance when this element is ignored. A full list of pre-installed transformers and estimators available in PHOTON can be obtained via the PhotonRegister.list_available_elements() command in PHOTON.

Figure 1: A simple pipeline architecture that imputes missing data and normalizes the output before dimensionality reduction is applied and the data is finally processed by the learning algorithm. The code example is given in listing 1

In the Hyperpipe object, we define options for nested cross-validation. Here, we employ 5-fold inner cross-validation to identify the optimal hyperparameter configuration among all the ones listed above and 3-fold outer cross-validation to evaluate the optimal configurations, respectively. In order to select the hyperparameter optimization strategy we set the optimizer parameter to skopt, thereby choosing Bayesian Hyperparameter Optimization [14] and via the optimizer_params parameter set the number of configurations to evaluate to 25. Performance metrics can be chosen via name. In addition, we specify a performance metric according to which the best performing hyperparameter configuration is selected. Finally, to open the interactive, browser-based graphical interface, we call the PHOTON Investigator with the fitted Hyperpipe object to visually prepare the obained results.

1my_pipe = Hyperpipe(’basic_svm_pipe’,
2                    inner_cv=KFold(n_splits=5),
3                    outer_cv=KFold(n_splits=3),
4                    optimizer=’sk_opt’,
5                    optimizer_params={’n_configurations’: 25},
6                    metrics=[’accuracy’,’recall’,  ’balanced_accuracy’],
7                    best_config_metric=’accuracy’)
8
9my_pipe += PipelineElement(’SimpleImputer’)
10my_pipe += PipelineElement(’StandardScaler’)
11
12my_pipe += PipelineElement(’PCA’,
13                           hyperparameters={’n_components’: IntegerRange(10, 30)},
14                           test_disabled=True)
15
16my_pipe += PipelineElement(’SVC’,
17                           hyperparameters={’kernel’: Categorical([’rbf’, ’linear’]),
18                                            ’C’: FloatRange(1, 6)},
19                           gamma=’scale’)
20
21my_pipe.fit(X, y)
22
23Investigator.show(my_pipe)
Listing 1: PHOTON code to implement a simple pipeline as described in section 4.2 and depicted in figure 1

4.3 Using the PHOTON Switch

In order to decide between alternative transformers or estimators, an OR-element, a so-called Switch is required. We can add any number of PipelineElements with their hyperparameters to a Switch. The PipelineElements within the Switch are independently evaluated. Listing 2 shows a pipeline that firstly normalizes the data, secondly applies a strategy to balance class distribution in the trianing set and finally utilizes a S

witch to find a suitable learning algorithm, i.e. either a Random Forest (with three hyperparameter configurations) or a Support Vector Machine (with a linear or an rbf kernel).

Figure 2: A pipeline applying normalization, handling class imbalance and evaluating two learning algorithms using a Switch. See PHOTON code in listing 2.
1my_pipe = Hyperpipe(’basic_switch_pipe’,
2                    optimizer=’random_grid_search’,
3                    optimizer_params={’n_configurations’: 15},
4                    metrics=[’accuracy’, ’precision’, ’recall’],
5                    best_config_metric=’accuracy’,
6                    outer_cv=KFold(n_splits=3),
7                    inner_cv=KFold(n_splits=5))
8
9my_pipe += PipelineElement(’StandardScaler’)
10my_pipe += PipelineElement(’ImbalancedDataTransform’,
11                           hyperparameters={’method_name’: [’RandomUnderSampler’,
12                                                                        ’SMOTE’]})
13
14est_switch = Switch(’EstimatorSwitch’)
15est_switch += PipelineElement(’SVC’,
16                              hyperparameters={’kernel’: [’rbf’, ’linear’]})
17est_switch += PipelineElement(’DecisionTreeClassifier’,
18                              hyperparameters={’min_samples_split’: IntegerRange(2, 5),
19                                               ’min_samples_leaf’: IntegerRange(1, 5),
20                                               ’criterion’: [’gini’, ’entropy’]})
21
22my_pipe += est_switch
23
24my_pipe.fit(X, y)
Listing 2: PHOTON code that deploys a strategy to correct class imbalance in the training set and a PHOTON Switch in order to compare different learning algorithms (see section 3.2 and figure 2)
Figure 3: In this pipeline three learning algorithms are applied in parallel using a S

tack and the output is further processed using a Decision Tree. See PHOTON Code in listing

3

4.4 Using the PHOTON Stack

TheStack introduces an AND-operation in which data is first delivered to all PipelineElements contained in the Stack and then processed by the respective PipelineElements. In listing 3

, output from three different estimators (a Gaussian Process Classifier, an AdaBoost Classifier and a Random Forest) is horizontally concatenated (see Figure 1b) to obtain new features which can then be transformed further or (as in this example) be fed to a Decision Tree algorithm. In addition, a cross-validation strategy is applied that splits the data according to associated group labels.

1my_pipe = Hyperpipe(’group_analysis’,
2                    optimizer=’sk_opt’,
3                    metrics=[’accuracy’, ’precision’, ’recall’],
4                    best_config_metric=’accuracy’,
5                    outer_cv=GroupKFold(n_splits=4),
6                    inner_cv=GroupShuffleSplit(n_splits=10))
7
8my_pipe += PipelineElement(’SimpleImputer’)
9my_pipe += PipelineElement(’StandardScaler’)
10my_pipe += PipelineElement(’PCA’, hyperparameters={’n_components’: FloatRange(0.5, 0.8,
11                                                                                                                                                          step=0.1)},
12                           test_disabled=True)
13
14stack = Stack(’estimator_stack’)
15stack += PipelineElement(’GaussianProcessClassifier’)
16stack += PipelineElement(’AdaBoostClassifier’,
17                         hyperparameters={’n_estimators’: IntegerRange(50, 200),
18                                          ’learning_rate’: FloatRange(0.01, 2)})
19stack += PipelineElement(’RandomForestClassifier’,
20                         hyperparameters={’min_samples_split’: IntegerRange(2, 10)})
21
22my_pipe += stack
23
24my_pipe += PipelineElement("DecisionTreeClassifier")
25my_pipe.fit(X, y, groups=groups)
Listing 3: PHOTON code for simultaneously applying different learning algorithms (see section 3.2 and figure 3)

4.5 Using PHOTON Branches

In the same vein, we can combine Stacks and Branches to differentially process data streams and integrate them afterwards. In listing 4, we process different subsets of features of the popular Breast Cancer dataset [22] as provided by scikit-learn. We divide the features according to their praefix into three different Branches, concatenate the outputs and apply a fully connected neural net defined in Keras.

1X, y = load_breast_cancer(True)
2
3my_pipe = Hyperpipe(’data_integration’,
4                    optimizer=’random_grid_search’,
5                    optimizer_params={’n_configurations’: 2},
6                    metrics=[’accuracy’, ’precision’, ’recall’],
7                    best_config_metric=’f1_score’,
8                    outer_cv=KFold(n_splits=3),
9                    inner_cv=KFold(n_splits=3))
10
11my_pipe += PipelineElement(’StandardScaler’, {}, with_mean=True)
12
13# Use only ’mean’ features: [mean_radius, mean_texture, etc. ]
14mean_branch = Branch(’MeanFeature’)
15mean_branch += DataFilter(indices=np.arange(10))
16mean_branch += PipelineElement(’PCA’)
17
18# Use only ’error’ features
19error_branch = Branch(’ErrorFeature’)
20error_branch += DataFilter(indices=np.arange(10, 20))
21error_branch += PipelineElement(’PCA’)
22
23# use only ’worst’ features: [worst_radius, worst_texture, etc.]
24worst_branch = Branch(’WorstFeature’)
25worst_branch += DataFilter(indices=np.arange(20, 30))
26worst_branch += PipelineElement(’PCA’)
27
28my_pipe += Stack(’SourceSplit’, [mean_branch, error_branch, worst_branch])
29
30# Create Custom Neural Net Model
31my_pipe += PipelineElement(’KerasDnnClassifier’,
32                           hyperparameters={’hidden_layer_sizes’: Categorical([[8, 4, 2],
33                                                                               [3, 5]]),
34                                            ’dropout_rate’: FloatRange(0.2, 0.7)},
35                           activations=

relu

,
36                           batch_size=32,
37                           multi_class=False,
38                           verbosity=1)
39my_pipe.fit(X, y)
Listing 4: PHOTON code for creating three subpiplines and using a Keras Neural Net (see section 3.2 and depicted in figure 4)
Figure 4: Using a Stack we join three subpipelines represented as Branches. The feature matrix is divided and the respective feature subsets are processed by different subpipelines. The output is given to a fully connected net defined in Keras. See PHOTON Code in listing 3

5 Future Developments and conclusion

We make PHOTON available in the hope that it will simplify and accelerate the ML workflow. In the future, we intend to extend both functionality and usability. First, we will incorporate additional hyperparameter optimization strategies. While this area has seen tremendous progress in recent years, these algorithms are often not readily available to data scientists and studies systematically comparing them are extremely scarce. Second, we seek to develop our pipeline setup into a more comprehensive Automatic Machine Learning (AutoML) system which allows users to automate any or all parts of the entire pipeline from raw data processing to ML model deployment. Notably, in contrast to existing tools, the user keeps full control over the automatization process and is able to customize and adapt the process to specific user-requirements as is crucial for use in an applied science context.

In addition to these core functionalities, we aim to establish an ecosystem of add-on modules which simplify and accelerate ML analyses for different data types and modalities. For example, we will add PHOTON Neuro as a means to directly use multimodal Magnetic Resonance Imaging (MRI) data in ML analyses. In addition, PHOTON Graph will pool existing graph analysis functions and provide specialized ML approaches for graph data. Likewise, modules integrating additional data modalities such as omics data would be of great value. More generally, PHOTON would benefit from modules which make more advanced ensemble learning capabilities and novel approaches to model interpretation (i.e. Explainability) available.

In summary, PHOTON aims to simplify and accelerate the ML workflow enabling rapid, reproducible, and unbiased analyses. It is especially well-suited in contexts which require iterative evaluation of novel approaches such as applied ML research in medicine and the Life Sciences. In the future, we hope to attract more developers and users to establish a thriving, open-source community.

Acknowledgments

This work was supported by grants from the Interdisciplinary Center for Clinical Research (IZKF) of the medical faculty of Münster (grant MzH 3/020/20 to TH and grant Dan3/012/17 to UD) and the German Research Foundation (DFG grants HA7070/2-2, HA7070/3, HA7070/4 to TH).

References

  • [1] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake Vanderplas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. API Design For Machine Learning Software: Experiences From the scikit-learn Project. sep 2013.
  • [2] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and Others. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
  • [3] Guillaume Lemaitre, Fernando Nogueira, and Christos K Aridas. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18(17):1–5, 2017.
  • [4] Mart’in Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, GregS.Corrado, AndyDavis, JeffreyDean, MatthieuDevin, SanjayGhemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. {TensorFlow}: Large-Scale Machine Learning on Heterogeneous Systems, 2015.
  • [5] François Chollet and Others. Keras, 2015.
  • [6] The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, and Others. Theano: A Python Framework for Fast Computation of Mathematical Expressions. arXiv preprint arXiv:1605.02688, 2016.
  • [7] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and Others.

    PyTorch: An Imperative Style, High-Performance Deep Learning Library.

    In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.
  • [8] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture For Fast Feature Embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678, 2014.
  • [9] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential Model-Based Optimization for General Algorithm Configuration. In International Conference on Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
  • [10] Haifeng Jin, Qingquan Song, and Xia Hu. Auto-Keras: An Efficient Neural Architecture Search System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD ’19, pages 1946–1956, New York, New York, USA, 2019. ACM Press.
  • [11] Barret Zoph and Quoc V. Le.

    Neural Architecture Search with Reinforcement Learning.

    nov 2016.
  • [12] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient Neural Architecture Search via Parameter Sharing. feb 2018.
  • [13] Frank Seide. Keynote: The Computer Science Behind The Microsoft Cognitive Toolkit: an Open Source Large-Scale Deep Learning Toolkit for Windows and Linux. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages xi—-xi. IEEE, 2017.
  • [14] Tim Head, MechCoder, Gilles Louppe, Iaroslav Shcherbatyi, Fcharras, Zé Vinícius, Cmmalone, Christopher Schröder, Nel215, Nuno Campos, Todd Young, Stefano Cereda, Thomas Fan, Rene-rex, Kejia (KJ) Shi, Justus Schwabedal, Carlosdanielcsantos, Hvass-Labs, Mikhail Pak, SoManyUsernamesTaken, Fred Callaway, Loïc Estève, Lilian Besson, Mehdi Cherti, Karlson Pfannschmidt, Fabian Linzberger, Christophe Cauet, Anna Gut, Andreas Mueller, and Alexander Fabisch. Scikit-optimize, 2018.
  • [15] Matthew Rocklin. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th python in science conference, number 130-136. Citeseer, 2015.
  • [16] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. Auto-sklearn: Efficient and Robust Automated Machine Learning. pages 113–134. 2019.
  • [17] Vladimir Vapnik, Steven E Golowich, and Alex J Smola. Support Vector Method for Function Approximation, Regression Estimation and Signal Processing. In Advances in Neural Information Processing Systems, pages 281–287, 1997.
  • [18] Guido Van Rossum, Barry Warsaw, and Nick Coghlan. PEP 8: Style Guide for Python Code. Python. org, 1565, 2001.
  • [19] Travis Oliphant. Guide To NumPy. USA: Trelgol Publishing, 2006.
  • [20] Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51–56, 2010.
  • [21] Dwight Merriman, Eliot Horowitz, and Kevin Ryan. MongoDB. 2007.
  • [22] W Wolberg and O Mangasarian. Multisurface Method of Pattern Separation for Medical Diagnosis Applied to Breast Cytology,. In Proceedings of the National Academy of Sciences, pages 9193–9196, 1990.