Ludwig: a type-based declarative deep learning toolbox

09/17/2019 ∙ by Piero Molino, et al. ∙ Uber 16

In this work we present Ludwig, a flexible, extensible and easy to use toolbox which allows users to train deep learning models and use them for obtaining predictions without writing code. Ludwig implements a novel approach to deep learning model building based on two main abstractions: data types and declarative configuration files. The data type abstraction allows for easier code and sub-model reuse, and the standardized interfaces imposed by this abstraction allow for encapsulation and make the code easy to extend. Declarative model definition configuration files enable inexperienced users to obtain effective models and increase the productivity of expert users. Alongside these two innovations, Ludwig introduces a general modularized deep learning architecture called Encoder-Combiner-Decoder that can be instantiated to perform a vast amount of machine learning tasks. These innovations make it possible for engineers, scientists from other fields and, in general, a much broader audience to adopt deep learning models for their tasks, concretely helping in its democratization.



There are no comments yet.


page 4

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

Over the course of the last ten years, deep learning models have demonstrated to be highly effective in almost every machine learning task in different domains including (but not limited to) computer vision, natural language, speech, and recommendation. Their wide adoption in both research and industry have been greatly facilitated by increasingly sophisticated software libraries like Theano 

Theano Development Team (2016)

, TensorFLow 

Abadi et al. (2015)

, Keras 

Chollet and others (2015)

, PyTorch 

Paszke et al. (2017)

, Caffe 

Jia et al. (2014), Chainer Tokui et al. (2015), CNTK Seide and Agarwal (2016) and MXNet Chen et al. (2015)

. Their main value has been to provide tensor algebra primitives with efficient implementations which, together with the massively parallel computation available on GPUs, enabled researchers to scale training to bigger datasets. Those packages, moreover, provided standardized implementations of automatic differentiation, which greatly simplified model implementation. Researchers, without having to spend time re-implementing these basic building blocks from scratch and now having fast and reliable implementations of the same, were able to focus on models and architectures, which led to the explosion of new *Net model architectures of the last five years.

With artificial neural network architectures being applied to a wide variety of tasks, common practices regarding how to handle certain types of input information emerged. When faced with a computer vision problem, a practitioner pre-processes data using the same pipeline that resizes images, augments them with some transformation and maps them into 3D tensors. Something similar happens for text data, where text is tokenized either into a list of words or characters or word pieces, a vocabulary with associated numerical IDs is collected and sentences are transformed into vectors of integers. Specific architectures are adopted to encode different types of data into latent representations: convolutional neural networks are used to encode images and recurrent neural networks are adopted for sequential data and text (more recently self-attention architectures are replacing them). Most practitioners working on a multi-class classification task would project latent representations into vectors of the size of the number of classes to obtain logits and apply a


operation to obtain probabilities for each class, while for regression tasks, they would map latent representations into a single dimension by a linear layer, and the single score is the predicted value.

Figure 1: Examples of declarative model definitions. The first two show two models for image classification using two different encoders, while the third shows a multi-label text classification system. Note that a part from the name of input and output features, which are just identifiers, all that needs to be changed to encode with a different image encoder is jest the name of the encoder, while for changing tasks all that needs to be changed is the types of the inputs and outputs.

Observing these emerging patterns led us to define abstract functions that identify classes of equivalence of model architectures. For instance, most of the different architectures for encoding images can be seen as different implementations of the abstract encoding function where denotes a tensor with dimensions and is an encoding function parametrized by parameters that maps from tensor to tensor. In tasks like image classification, is pooled and flattened (i.e., a reduce function is applied spatially and the output tensor is reshaped as a vector) before being provided to, again, an abstract function that computes where is a one-dimensional tensor of hidden size , is a one-dimensional tensor of size equal to the number of classes, and is a decoding function parametrized by parameters

that maps a hidden representation into logits and is usually implemented as a stack of fully connected layers. Similar abstract encoding and decoding functions that generalize many different architectures can be defined for different types of input data and different types of expected output predictions (which in turn define different types of tasks).

We introduce Ludwig, a deep learning toolbox based on the above-mentioned level of abstraction, with the aim to encapsulate best practices and take advantage of inheritance and code modularity. Ludwig makes it much easier for practitioners to compose their deep learning models by just declaring their data and task and to make code reusable, extensible and favor best practices. These classes of equivalence are named after the data type of the inputs encoded by the encoding functions (image, text, series, category, etc.) and the data type of the outputs predicted by the decoding functions. This type-based abstraction allows for a higher level interface than what is available in current deep learning frameworks, which abstract at the level of single tensor operation or at the layer level. This is achieved by defining abstract interfaces for each data type, which allows for extensibility as any new implementation of the interface is a drop-in replacement for all the other implementations already available.

Concretely, this allows for defining, for instance, a model that includes an image encoder and a category decoder and being able to swap in and out VGG Simonyan and Zisserman (2015), ResNet He et al. (2016) or DenseNet Huang et al. (2017) as different interchangeable representations of an image encoder. The natural consequence of this level of abstraction is associating a name to each encoder for a specific type and enabling the user to declare what model to employ rather than requiring them to implement them imperatively, and at the same time, letting the user add new and custom encoders. The same also applies to data types other than images and decoders.

With such type-based interfaces in place and implementations of such interfaces readily available, it becomes possible to construct a deep learning model simply by specifying the type of the features in the data and selecting the implementation to employ for each data type involved. Consequently, Ludwig has been designed around the idea of a declarative specification of the model to allow a much wider audience (including people who do not code) to be able to adopt deep learning models, effectively democratizing them. Three such model definition are shown in Figure 1.

The main contribution of this work is that, thanks to this higher level of abstraction and its declarative nature, Ludwig allows for inexperienced users to easily build deep learning models , while allowing experts to decide the specific modules to employ with their hyper-parameters and to add additional custom modules. Ludwig’s other main contribution is the general modular architecture defined through the type-based abstraction that allows for code reuse, flexibility, and the performance of a wide array of machine learning tasks under a cohesive framework.

The remainder of this work describes Ludwig’s architecture in detail, explains its implementation, compares Ludwig with other deep learning frameworks and discusses its advantages and disadvantages.

2 Architecture

The notation used in this section is defined as follows. Let be a data point sampled from a dataset . Each data point is a tuple of typed values called features. They are divided in two sets: is the set of input features and id the set of output features. will refer to a specific input feature, while will refer to a specific output features. Model predictions given input features are denoted as , so that there will be a specific prediction for each output feature . The types of the features can be either atomic (scalar) types like binary, numerical or category, or complex ones like discrete sequences or sets. Each data type is associated with abstract function types, as is explained in the following section, to perform type-specific operations on features and tensors. Tensors are a generalization of scalars, vectors, and matrices with ranks of different dimensions. Tensors are referred to as where indicates the dimensions for each rank, like for instance for a rank 3 tensor of dimensions , and respectively for each rank.

2.1 Type-based Abstraction

Figure 2: Data type functions flow.

Type-based abstraction is one of the main concepts that define Ludwig’s architecture. Currently, Ludwig supports the following types: binary, numerical (floating point values), category (unique strings), set of categorical elements, bag of categorical elements, sequence of categorical elements, time series (sequence of numerical elements), text, image, audio (which doubles as speech when using different pre-processing parameters), date, H3 Brodsky and others (2018) (a geo-spatial indexing system), and vector (one dimensional tensor of numerical values). The type-based abstraction makes it easy to add more types.

The motivation behind this abstraction stems from the observation of recurring patterns in deep learning projects: pre-processing code is more or less the same given certain types of inputs and specific tasks, as is the code implementing models and training loops. Small differences make models hard to compare and their code difficult to reuse. By modularizing it on a data type base, our aim is to improve both code reuse, adoption of best practices and extensibility.

Figure 3: Encoder-Combiner-Decoder Architecture

Each data type has five abstract function types associated with it and there could be multiple implementations of each of them:

  • Pre-processor: a pre-processing function maps a raw data point input feature into a tensor with dimensions . Different data types may have different pre-processing functions and different dimensions of . A specific type may, moreover, have different implementations of . A concrete example is text: in this case is a string of text, there could be different tokenizers that implement by splitting on space or using byte-pair encoding and mapping tokens to integers, and is , the length of the sequence of tokens.

  • Encoder: an encoding function maps an input tensor into an output tensor using parameters . The dimensions and may be different from each other and depend on the specific data type. The input tensor is the output of a function. Concretely, encoding functions for text, for instance, take as input and produce where is an hidden dimension if the output is required to be pooled, or if the output is not pooled. Examples of possible implementations of are CNNs, bidirectional LSTMs or Transformers.

  • Decoder: a decoding function maps an input tensor into an output tensor using parameter . The dimensions and may be different from each other and depend on the specific data type. is the output of an encoding function or of a combiner (explained in the next section). Concretely, a decoder function for the category type would map input tensor into a tensor where is the number of classes.

  • Post-processor: a post-processing function maps a tensor with dimensions into a raw data point prediction . is the output of a decoding function. Different data types may have different post-processing functions and different dimensions of . A specific type may, moreover, have different implementations of . A concrete example is text: in this case is a string of text, and there could be different functions that implement by first mapping integer predictions into tokens and then concatenating on space or using byte-pair concatenation to obtain a single string of text.

  • Metrics: a metric function produces a score given a ground truth output feature and predicted output of the same dimension.

    is the output of a post-processing function. In this context, for simplicity, loss functions are considered to belong to the metric class of function. Many different metrics may be associated with the same data type. Concrete examples of metrics for the

    category data type can be accuracy, precision, recall, F1, and cross entropy loss, while for the numerical data type they could be mean squared error, mean absolute error, and R2.

A depiction of how the functions associated with a data type are connected to each other is provided in Figure 2.

2.2 Encoders-Combiner-Decoders

Figure 4: Different instantiations of the ECD architecture for different machine learning tasks

In Ludwig, every model is defined in terms of encoders that encode different features of an input data point, a combiner which combines information coming from the different encoders, and decoders that decode the information from the combiner into one or more output features. This generic architecture is referred to as Encoders-Combiner-Decoders (ECD). A depiction is provided in Figure 3.

This architecture is introduced because it maps naturally most of the architectures of deep learning models and allows for modular composition. This characteristic, enabled by the data type abstraction, allows for defining models by just declaring the data types of the input and output features involved in the task and assembling standard sub-modules accordingly rather than writing a full model from scratch.

A specific instantiation of an ECD architecture can have multiple input features of different or same type, and the same is true for output features. For each feature in the input part, pre-processing and encoding functions are computed depending on the type of the feature, while for each feature in the output part, decoding, metrics and post-processing functions are computed, again depending on the type of each output feature.

When multiple input features are provided a combiner function that maps a set of input tensors into a set of output tensors is computed. has an abstract interface and many different functions can implement it. One concrete example is what in Ludwig is called concat combiner: it flattens all the tensors in the input set, concatenates them and passes them to a stack of fully connected layers, the output of which is provided as output, a set of only one tensor. Note that a possible implementation of a combiner function can be the identity function.

This definition of a decoder function allows for implementations where subsets of inputs are provided to different sub-modules which return subsets of the output tensors, or even for a recursive definition where the combiner function is a ECD model itself, albeit without pre-processors and post-processors, since inputs and outputs are already tensors and do not need to be pre-processed and post-processed. Although the combiner definition in the ECD architecture is theoretically flexible, the current implementations of combiner functions in Ludwig are monolithic (without sub-modules), non-recursive, and return a single tensor as output instead of a set of tensors. However, more elaborate combiners can be added easily.

The ECD architecture allows for many instantiations by combining different input features of different data types with different output features of different data types, as depicted in Figure 4. An ECD

with an input text feature and an output categorical feature can be trained to perform text classification or sentiment analysis, and an


with an input image feature and a text output feature can be trained to perform image captioning, while an

ECD with categorical, binary and numerical input features and a numerical output feature can be trained to perform regression tasks like predicting house pricing, and an ECD with numerical binary and categorical input features and a binary output feature can be trained to perform tasks like fraud detection. It is evident how this architecture is really flexible and is limited only by the availability of data types and the implementations of their functions.

Figure 5: Different instantiations of the ECD architecture for different machine learning tasks

An additional advantage of this architecture is its ability to perform multi-task learning Caruana (1993). If more than one output feature is specified, an ECD architecture can be trained to minimize the weighted sum of the losses of each output feature in an end-to-end fashion. This approach has shown to be highly effective in both vision and natural language tasks, achieving state of the art performance Ratner et al. (2019b). Moreover, multiple outputs can be correlated or have logical or statistical dependency with each other. For example, if the task is to predict both parts of speech and named entity tags from a sentence, the named entity tagger will most likely achieve higher performance if it is provided with the predicted parts of speech (assuming the predictions are better than chance, and there is correlation between part of speech and named entity tag). In Ludwig, dependencies between outputs can be specified in the model definition, a directed acyclic graph among them is constructed at model building time, and either the last hidden representation or the predictions of the origin output feature are provided as inputs to the decoder of the destination output feature. This process is depicted in Figure 5. When non-differentiable operations are performed to obtain the predictions, for instance, like argmax in the case of category features performing multi-class classification, the logits or the probabilities are provided instead, keeping the multi-task training process end-to-end differentiable.

This generic formulation of multi-task learning as a directed acyclic graph of task dependencies is related to the hierarchical multi-task learning in Snorkel MeTaL proposed by Ratner et al. (2018) and its adoption for improving training from weak supervision by exploiting task agreements and disagreements of different labeling functions Ratner et al. (2019a). The main difference is that Ludwig can handle automatically heterogeneous tasks, i.e. tasks to predict different data types with support for different decoders, while in Snorkel MeTaL each task head is a linear layer. On the other hand Snorkel MeTaL’s focus on weak supervision is currently absent in Ludwig. An interesting avenue of further research to close the gap between the two approaches could be to infer dependencies and loss weights automatically given fully supervised multi-task data and combine weak supervision with heterogeneous tasks.

3 Implementation

3.1 Declarative Model Definition

Ludwig adopts a declarative model definition schema that allows users to define an instantiation of the ECD architecture to train on their data.

The higher level of abstraction provided by the type-based ECD architecture allows for a separation between what a model is expected to learn to do and how it actually does it. This convinced us to provide a declarative way of defining the models in Ludwig, as the amount of potential users who can define a model by declaring the inputs they are providing and the predictions they are expecting, without specifying the implementation of how the predictions are obtained, is substantially bigger than the amount of developers who can code a full deep learning model on their own. An additional motivation for the adoption of a declarative model definitions stems from the separation of interests between the authors of the implementations of the models and the final users, analogous to the separation of interests of the authors of query planning and indexing strategies of a database and those users who query the database, which allows the former to provide improved strategies without impacting the way the latter interacts with the system.

Figure 6: On the left side, a minimal model definition for text classification. On the right side, a more complex model definition including input and output features and more model and training hyper-parameters.

The model definition is divided in five sections:

  • Input Features: in this section of the model definition, a list of input features is specified. The minimum amount of information that needs to be provided for each feature is the name of the feature that corresponds to the name of a column in the tabular data provided by the user, and the type of such feature. Some features have multiple encoders, but if one is not specified, the default one is used. Each encoder can have its own hyper-parameters, and if they are not specified, the default hyper-parameters of the specified encoder are used.

  • Combiner: in this section of the model definition, the type of combiner can be specified, if none is specified, the default concat is used. Each combiner can have its own hyper-parameters, but if they are not specified, the default ones of the specified combiner are used.

  • Output Features: in this section of the model definition, a list of output features is specified. The minimum amount of information that needs to be provided for each feature is the name of the feature that corresponds to the name of a column in the tabular data provided by the user, and the type of such feature. The data in the column is the ground truth the model is trained to predict. Some features have multiple decoders that calculate the predictions, but if one is not specified, the default one is used. Each decoder can have its own hyper-parameters and if they are not specified, the default hyper-parameters of the specified encoder are used. Moreover, each decoder can have different losses with different parameters to compare the ground truth values and the values predicted by the decoder and, also in this case, if they are not specified, defaults are used.

  • Pre-processing

    : pre-processing and post-processing functions of each data type can have parameters that change their behavior. They can be specified in this section of the model definition and are applied to all input and output features of a specified type, and if they are not provided, defaults are used. Note that for some use cases it would be useful to have different processing parameters for different features of the same type. Consider a news classifier where the title and the body of a piece of news are provided as two input text features. In this case, the user may be inclined to set a smaller value for the maximum length of words and the maximum size of the vocabulary for the title input feature. Ludwig allows users to specify processing parameters on a per-feature basis by providing them inside each input and output feature definition. If both type-level parameters and single-feature-level parameters are provided, the single-feature-level ones override the type-level ones.

  • Training

    : the training process itself has parameters that can be changed, like the number of epochs, the batch size, the learning rate and its scheduling, and so on. Those parameters can be provided by the user, but if they are not provided, defaults are used.

The wide adoption of defaults allows for really concise model definitions, like the one shown on the left side of Figure 6, as well as a high degree of control on both the architecture of the model and training parameters, as shown on the right side of Figure 6.

Ludwig adopts the convention to adopt YAML to parse model definitions because of its human readability, but as long its nested structure is representable, other similar formats could be adopted.

For the ever-growing list of available encoders, combiners, and decoders, their hyper-parameters, the pre-processing and training parameter available, please consult Ludwig’s user guide111 For additional examples refer to the example222 section.

In order to allow for flexibility and ease of extendability, two well known design patters are adopted in Ludwig: the strategy pattern Gamma et al. (1994) and the registry pattern. The strategy pattern is adopted at different levels to allow different behaviors to be performed by different instantiations of the same abstract components. It is used both to make the different data types interchangeable from the point of view of model building, training, and inference, and to make different encoders and decoders for the same type interchangeable. The registry pattern, on the other hand, is implemented in Ludwig by assigning names to code constructs (either variables, function, objects, or modules) and storing them in a dictionary. They can be referenced by their name, allowing for straightforward extensibility; adding an additional behavior is as simple as adding a new entry in the registry.

In Ludwig, the combination of these two patterns allows users to add new behaviors by simply implementing the abstract function interface of the encoder of a specific type and adding that function implementation in the registry of implementations available. The same applies for adding new decoders, new combiners, and to add additional data types. The problem with this approach is that different implementations of the same abstract functions have to conform to the same interface, but in our case some parameters of the function may be different. As a concrete example, consider two text encoders: a recurrent neural network (RNN) and a convolutional neural network (CNN). Although they both conform to the same abstract encoding function in terms of the rank of the input and output tensors, their hyper-parameters are different, with the RNN requiring a boolean parameter indicating whether to apply bi-direction or not, and the CNN requiring the size of the filters. Ludwig solves this problem by exploiting **kwargs, a Python functionality that allows to pass additional parameters to functions by specifying their names and collecting them into a dictionary. This allows different functions implementing the same abstract interface to have the same signature and then retrieve the specific additional parameters from the dictionary using their names. This also greatly simplifies the implementation of default parameters, because if the dictionary does not contain the keyword of a required parameter, the default value for that parameters is used instead automatically.

3.2 Training Pipeline

Figure 7: A depiction of the training and prediction pipeline.

Given a model definition, Ludwig builds a training pipeline as shown in the top of Figure 7. The process is not particularly different from many other machine learning tools and consists in a metadata collection phase, a data pre-processing phase, and a model training phase. The metadata mappings in particular are needed in order to apply exactly the same pre-processing to input data at prediction time, while model weights and hyper-parameters are saved in order to load the same exact model obtained during training. The main notable innovation is the fact that every single component, from the pre-processing to the model, to the training loop is dynamically built depending on the declarative model definition.

One of the main use cases of Ludwig is the quick exploration of different model alternatives for the same data, so, after pre-processing, the pre-processed data is optionally cached into an HDF5 file. The next time the same data is accessed, the HDF5 file will be used instead, saving the time needed to pre-process it.

3.3 Prediction Pipeline

The prediction pipeline is depicted in the bottom of Figure 7. It uses the metadata obtained during the training phase to pre-process the new input data, loads the model reading its hyper-parameters and weights, and uses it to obtain predictions that are mapped back in data space by a post-processor that uses the same mappings obtained at training time.

4 Evaluation

One of the positive effects of the ECD architecture and its implementation in Ludwig is the ability to specify a potentially complex model architecture with a concise declarative model definition. To analyze how much of an impact this has on the amount of code needed to implement a model (including pre-processing, the model itself, the training loop, and the metrics calculation), the number of lines of code required to implement four reference architectures using different libraries is compared: WordCNN Kim (2014), Bi-LSTM Tai et al. (2015) - both models for text classification and sentiment analysis, Tagger Lample et al. (2016) - sequence tagging model with an RNN encoder and a per-token classification, ResNet He et al. (2016) - image classification model. Although this evaluation is imprecise in nature (the same model can be implemented in a more or less concise way and writing a parameter in a configuration file is substantially simpler than writing a line of code), it could provide intuition about the amount of effort needed to implement a model with different tools. To calculate the mean for different libraries, openly available implementations on GitHub are collected and the number of lines of code of each of them is collected (the list of repositories is available in the appendix). For Ludwig, the amount of lines in the configuration file needed to obtain the same models is reported both in the case where no hyper-parameter is spacified and in the case where all its hyper-parameters are specified.

The results in Table 1 show how even when specifying all its hyper-parameters, a Ludwig declarative model configuration is an order of magnitude smaller than even the most concise alternative. This supports the claim that Ludwig can be useful as a tool to reduce the effort needed for training and using deep learning models.

TensorFlow Keras PyTorch Ludwig
mean mean mean. w/o w
WordCNN 406.17 201.50 458.75 8 66
Bi-LSTM 416.75 439.75 323.40 10 68
Tagger 1067.00 1039.25 1968.00 10 68
ResNet 1252.75 779.60 479.43 9 61
Table 1: Number of lines of code for implementing different models. mean columns are the mean lines of code needed to write a program from scratch for the task. w and w/o in the Ludwig column refer to the number of lines for writing a model definition specifying every single model hyper-parameter and pre-processing parameter, and without specifying any hyper-parameter respectively.

5 Limitations and future work

Although Ludwig’s ECD architecture is particularly well-suited for supervised and self-supervised tasks, how suitable it is for other machine learning tasks is not immediately evident.

One notable example of such tasks are Generative Adversarial Networks (GANs) Goodfellow et al. (2014): their architecture contains two models that learn to generate data and discriminate synthetic from real data and are trained with inverted losses. In order to replicate a GAN within the boundaries of the ECD architecture, the inputs to both models would have to be defined at the encoder level, the discriminator output would have to be defined as a decoder, and the remaining parts of both models would have to be defined as one big combiner, which is inelegant; for instance, changing just the generator would result in an entirely new implementation. An elegant solution would allow for disentangling the two models and change them independently. The recursive graph extension of the combiner described in section 2.2 allows a more elegant solution by providing a mechanism for defining the generator and discriminator as two independent sub-graphs, improving modularity and extensibility. WAn extension of the toolbox in this direction is planned in the future.

Another example is reinforcement learning. Although

ECD can be used to build the vast majority of deep architectures currently adopted in reinforcement learning, some of the techniques they employ are relatively hard to represent, such as instance double inference with fixed weights in Deep Q-Networks Mnih et al. (2015), which can currently be implemented only with a really custom and inelegant combiner. Moreover, supporting the dynamic interaction with an environment for data collection and more clever ways to collect it like Go-Explore’s Ecoffet et al. (2019) archive or prioritized experience replay Schaul et al. (2016), is currently out of the scope of the toolbox: a user would have to build these capabilities on their own and call Ludwig functions only inside the inner loop of the interaction with the environment. Extending the toolbox to allow for easy adoption in reinforcement learning scenarios, for example by allowing training through policy gradient methods like REINFORCE Williams (1992) or off-policy methods, is a potential direction of improvement.

Although these two cases highlight current limitations of the Ludwig, it’s worth noting how most of the current industrial applications of machine learning are based on supervised learning, and that is where the proposed architecture fits the best and the toolbox provides most of its value.

Although the declarative nature of Ludwig’s model definition allows for easier model development, as the number of encoders and their hyper-parameters increase, the need for automatic hyper-parameter optimization arises. In Ludwig, however, different encoders and decoders, i.e., sub-modules of the whole architecture, are themselves hyper-parameters. For this reason, Ludwig is well-suited for performing both hyper-parameter search and architecture search, and blurs the line between the two.

A future addition to the model definition file will be an hyper-parameter search section that will allow users to define which strategy among those available to adopt to perform the optimization and, if the optimization process itself contains parameters, the user will be allowed to provide them in this section as well. Currently a Bayesian optimization over combinatorial structures Baptista and Poloczek (2018) approach is in development, but more can be added.

Finally, more feature types will be added in the future, in particular videos and graphs, together with a number of pre-trained encoders and decoders, which will allow training of a full model in few iterations.

6 Related Work

TensorFlow Abadi et al. (2015), Caffe Jia et al. (2014), Theano Theano Development Team (2016) and other similar libraries are tensor computation frameworks that allow for automatic differentiation and declarative model through the definition of a computation graph. They all provide similar underlying primitives and support computation graph optimizations that allow for training of large-scale deep neural networks. PyTorch Paszke et al. (2017), on the other hand, provides the same level of abstraction, but allows users to define models imperatively: this has the advantage to make a PyTorch program easier to debug and to inspect. By adding eager execution, TensorFlow 2.0 allows for both declarative and imperative programming styles. In contrast, Ludwig, which is built on top of TensorFlow, provides a higher level of abstraction for the user. Users can declare full model architectures rather than underlying tensor operations, which allows for more concise model definitions, while flexibility is ensured by allowing users to change each parameter of each component of the architecture if they wish to.

Sonnet Reynolds et al. (2017), Keras Chollet and others (2015), and AllenNLP Gardner et al. (2017) are similar to Ludwig in the sense that both libraries provide a higher level of abstraction over TensorFlow and PyTorch primitives respectively. However, while they provide modules which can be used to build a desired network architecture, what distinguishes Ludwig from them is its declarative nature and being built around data type abstraction. This allows for the flexible ECD

architecture that can cover many use cases beyond the natural language processing covered by AllenNLP, and also doesn’t require to write code for both model implementation and pre-processing like in Sonnet and Keras.

Scikit-learn Buitinck et al. (2013), Weka Hall et al. (2009), and MLLib Meng et al. (2016) are popular machine learning libraries among researchers and industry practitioners. They contain implementations of several different traditional machine learning algorithm and provide common interfaces for them to use, so that algorithms become in most cases interchangeable and users can easily compare them. Ludwig follows this API design philosophy in its programmatic interface, but focuses on deep learning models that are not available in those tools.

7 Conclusions

This work presented Ludwig, a deep learning toolbox built around type-based abstraction and a flexible ECD architecture that allows model definition through a declarative language.

The proposed tool has many advantages in terms of flexibility, extensibility, and ease of use, which allow both experts and novices to train deep learning models, employ them for obtaining predictions, and experiment with different architectures without the need to write code, but still allowing users to easily add custom sub-modules.

In conclusion, Ludwig’s general and flexible architecture and its ease of use make it a good option for democratizing deep learning by making it more accessible, streamlining and speeding up experimentation, and unlocking many new applications.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from External Links: Link Cited by: §1, §6.
  • R. Baptista and M. Poloczek (2018) Bayesian optimization of combinatorial structures. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 471–480. External Links: Link Cited by: §5.
  • I. Brodsky et al. (2018) H3. GitHub. Note: Cited by: §2.1.
  • L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux (2013) API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122. Cited by: §6.
  • R. Caruana (1993) Multitask learning: A knowledge-based source of inductive bias. In Machine Learning, Proceedings of the Tenth International Conference, University of Massachusetts, Amherst, MA, USA, June 27-29, 1993, P. E. Utgoff (Ed.), pp. 41–48. External Links: Link, Document Cited by: §2.2.
  • T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang (2015) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: §1.
  • F. Chollet et al. (2015) Keras. Note: Cited by: §1, §6.
  • A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune (2019) Go-explore: a new approach for hard-exploration problems. CoRR abs/1901.10995. External Links: Link, 1901.10995 Cited by: §5.
  • E. Gamma, R. Helm, R. Johnson, and J. Vlissides (1994) Design patterns: elements of reusable object-oriented software. Addison-Wesley Professional Computing Series, Pearson Education. External Links: ISBN 9780321700698, Link Cited by: §3.1.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer (2017) AllenNLP: a deep semantic natural language processing platform. External Links: arXiv:1803.07640 Cited by: §6.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §5.
  • M. A. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten (2009) The WEKA data mining software: an update. SIGKDD Explorations 11 (1), pp. 10–18. External Links: Link, Document Cited by: §6.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1, §4.
  • G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. Cited by: §1, §6.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang, and W. Daelemans (Eds.), pp. 1746–1751. External Links: Link Cited by: §4.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016)

    Neural architectures for named entity recognition

    In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, K. Knight, A. Nenkova, and O. Rambow (Eds.), pp. 260–270. External Links: Link Cited by: §4.
  • X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. (2016) Mllib: machine learning in apache spark. The Journal of Machine Learning Research 17 (1), pp. 1235–1241. Cited by: §6.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: Link, Document Cited by: §5.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §1, §6.
  • A. Ratner, B. Hancock, J. Dunnmon, R. E. Goldman, and C. Ré (2018) Snorkel metal: weak supervision for multi-task learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM@SIGMOD 2018, Houston, TX, USA, June 15, 2018, S. Schelter, S. Seufert, and A. Kumar (Eds.), pp. 3:1–3:4. External Links: Link, Document Cited by: §2.2.
  • A. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, and C. Ré (2019a) Training complex models with multi-task weak supervision. In

    The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019.

    pp. 4763–4771. External Links: Link Cited by: §2.2.
  • A. J. Ratner, B. Hancock, and C. Ré (2019b) The role of massively multi-task and weak supervision in software 2.0. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings, External Links: Link Cited by: §2.2.
  • M. Reynolds, G. Barth-Maron, F. Besse, D. de Las Casas, A. Fidjeland, T. Green, A. Puigdomènech, S. Racanière, J. Rae, and F. Viola (2017) Open sourcing Sonnet - a new library for constructing neural networks. Note: Cited by: §6.
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §5.
  • F. Seide and A. Agarwal (2016) CNTK: microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2135–2135. Cited by: §1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1.
  • K. S. Tai, R. Socher, and C. D. Manning (2015)

    Improved semantic representations from tree-structured long short-term memory networks

    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 1556–1566. External Links: Link Cited by: §4.
  • Theano Development Team (2016) Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688. External Links: Link Cited by: §1, §6.
  • S. Tokui, K. Oono, S. Hido, and J. Clayton (2015) Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), Vol. 5, pp. 1–6. Cited by: §1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, pp. 229–256. External Links: Link, Document Cited by: §5.

Appendix A Full list of GitHub repositories

repository loc model notes 308 WordCNN 621 WordCNN 284 WordCNN 335 WordCNN + files in utils directory 405 WordCNN + + 484 WordCNN all files minus the rnn related ones 305 Bi-LSTM 271 Bi-LSTM 397 Bi-LSTM + files utils directory 459 Bi-LSTM + + 506 Bi-LSTM all files minus the cnn related ones 2243 ResNet 635 ResNet 472 ResNet 1661 ResNet 959 Tagger 1877 Tagger 365 Tagger
Table 2: List of TensorFlow repositories used for the evaluation.
repository loc model notes 228 WordCNN 117 WordCNN 258 WordCNN 295 WordCNN 122 WordCNN 189 WordCNN 425 Bi-LSTM 678 Bi-LSTM 547 Bi-LSTM 109 Bi-LSTM 292 ResNet 297 ResNet Only model, no preprocessing 2285 ResNet 560 ResNet 464 ResNet Only model, no preprocessing 2057 Tagger 150 Tagger 501 Tagger 1449 Tagger
Table 3: List of Keras repositories used for the evaluation.
repository loc model notes 311 WordCNN 247 WordCNN 778 WordCNN ignored 499 WordCNN 414 Bi-LSTM 421 Bi-LSTM 324 Bi-LSTM 188 Bi-LSTM 270 Bi-LSTM 447 ResNet 286 ResNet 535 ResNet 1095 ResNet 199 ResNet 450 ResNet 344 ResNet 1184 Tagger 840 Tagger 3243 Tagger 2605 Tagger
Table 4: List of Pytorch repositories used for the evaluation.