ModelHub: Towards Unified Data and Lifecycle Management for Deep Learning

by   Hui Miao, et al.

Deep learning has improved state-of-the-art results in many important fields, and has been the subject of much research in recent years, leading to the development of several systems for facilitating deep learning. Current systems, however, mainly focus on model building and training phases, while the issues of data management, model sharing, and lifecycle management are largely ignored. Deep learning modeling lifecycle generates a rich set of data artifacts, such as learned parameters and training logs, and comprises of several frequently conducted tasks, e.g., to understand the model behaviors and to try out new models. Dealing with such artifacts and tasks is cumbersome and largely left to the users. This paper describes our vision and implementation of a data and lifecycle management system for deep learning. First, we generalize model exploration and model enumeration queries from commonly conducted tasks by deep learning modelers, and propose a high-level domain specific language (DSL), inspired by SQL, to raise the abstraction level and accelerate the modeling process. To manage the data artifacts, especially the large amount of checkpointed float parameters, we design a novel model versioning system (dlv), and a read-optimized parameter archival storage system (PAS) that minimizes storage footprint and accelerates query workloads without losing accuracy. PAS archives versioned models using deltas in a multi-resolution fashion by separately storing the less significant bits, and features a novel progressive query (inference) evaluation algorithm. Third, we show that archiving versioned models using deltas poses a new dataset versioning problem and we develop efficient algorithms for solving it. We conduct extensive experiments over several real datasets from computer vision domain to show the efficiency of the proposed techniques.



There are no comments yet.


page 1

page 2

page 3

page 4


Sampling for Deep Learning Model Diagnosis (Technical Report)

Deep learning (DL) models have achieved paradigm-changing performance in...

DeepLens: Towards a Visual Data Management System

Advances in deep learning have greatly widened the scope of automatic co...

My First Deep Learning System of 1991 + Deep Learning Timeline 1962-2013

Deep Learning has attracted significant attention in recent years. Here ...

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

Deep Learning has revolutionized the fields of computer vision, natural ...

TensorLayer: A Versatile Library for Efficient Deep Learning Development

Deep learning has enabled major advances in the fields of computer visio...

OT-driven Multi-Domain Unsupervised Ultrasound Image Artifact Removal using a Single CNN

Ultrasound imaging (US) often suffers from distinct image artifacts from...

Scalable Recollections for Continual Lifelong Learning

Given the recent success of Deep Learning applied to a variety of single...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

a new introduction 2 pages, including abstract

  • Deep learning is an important method in analytics.

  • Deep learning anatomy: a DAG structured mapping function. node: templated_neuron, edge: connection. a DNN program specifies the DAG, input data, and a set of hyper-parameters.

  • Deep learning is end-to-end learning, features are learned directly from data, leading to unique human-in-the-loop life-cycle.

  • Lifecycle properties:

    1. Analysis (describe + compare) the trained models.

    2. Heuristic-driven enumerations of the models.

    3. A rich set of artifacts with lineages.

  • Lifecycle disadvantages under current systems:

    1. repetition and time consuming of human effort of doing description and enumeration via imperative programs;

    2. heavy turnover and footprint: long training time, less usage frequency of older models; a trade-off the user needs to decide: save more models for reuse or delete + retrain.

    3. redundancy in large models; similar models exist; may not need all models with the same frequency and accuracy.

    4. complicated and incoherent setups: hard to share, reuse, reproduce others’ models.

  • So we propose  system to tackle the disadvantages:

    1. Increase abstraction level and lead to optimization:

      1. we propose a VCS based system. use friendly query templates to capture the process in modeling lifecycle.

      2. we propose a DQL language, to help the user enumerate models, and encode model selection criteria.

    2. Treat weight matrices as first class data type:

      1. we identify a matrix comparison operator: alignment based matrix reordering, which is useful for comparison queries and model storages.

      2. we explore the problem of storing many models which have both highly structured information and high entropy float matrices with low precision tolerance.

      3. we develop model query directly on compressed storages, with guarantee of no errors, and at the same time faster query processing. It eases the trade-off between model storage and usage. we can compress all models with less storages, and do pay-as-you-go style of uncompressing to ensure no errors.

    3. the technique proposed extends naturally to a collaborative format and environment which has rich lineages and enables sharing, reusing, reproducing models.

  • Contributions:

    1. we are the first addressing model lifecycle management issues in deep learning.

    2. we are the first proposing full-lifecycle declarative constructs for deep learning modeling and show their implementations.

    3. we treat deep learning weight matrix as first class data types, we formulate the storage problem of a large set of float parameter matrices with deep learning lifecycle specific query workloads, and explore the low precision tolerance and accuracy trade-off to save the data type.

    4. by using bytewise compression, we present a storage framework and model evaluation query acceleration technique with guarantees of no errors.

    5. on the weight matrix data type, we identify a new problem of alignment based matrix reordering problem, appeared in deep learning model comparison and storage. We shown the problem is NP-hard and propose efficient greedy algorithms.

  • Our results show the techniques we propose are useful in real life models, and performance well on synthetic models.

  • Outline: Sec2: Preliminary, Sec3: System Overview, Sec4: Optimizations for Weight Matrices, Sec. 4.1 Alignment Operator. Sec 4.2: Storage. Sec 5. Exp. Sec 6. Related work. Sec 7. Conclusion.

Deep learning models, also called deep neural networks

(DNN), have dramatically improved the state-of-the-art results for many important reasoning and learning tasks including speech recognition, object recognition, and natural language processing in recent years


. Learned using massive amounts of training data,  models have superior generalization capabilities, and the intermediate layers in many deep learning models have been proven useful in providing effective semantic features that can be used with other learning techniques and are applicable to other problems. However, there are many critical large-scale data management issues in learning, storing, sharing, and using deep learning models, which are largely ignored by researchers today, but are coming to the forefront with the increased use of deep learning in a variety of domains. In this paper, we discuss some of those challenges in the context of the modeling lifecycle, and propose a comprehensive system to address them. Given the large scale of data involved (both training data and the learned models themselves) and the increasing need for high-level declarative abstractions, we argue that database researchers should play a much larger role in this area. Although this paper primarily focuses on deep neural networks, similar data management challenges are seen in lifecycle management of others types of ML models like logistic regression, matrix factorization, etc.

DNN Modeling Lifecycle and Challenges: Compared with the traditional approach of feature engineering followed by model training [2], deep learning is an end-to-end learning approach, i.e., the features are not given by a human but are learned in an automatic manner from the input data. Moreover, the features are complex and have a hierarchy along with the network representation. This requires less domain expertise and experience from the modeler, but understanding and explaining the learned models is difficult; why even well-studied models work so well is still a mystery is unknown in theory and under active research. Thus, when developing new models, changing the learned model (especially its network structure and hyper-parameters) becomes an empirical search task.

Fig. 1: Deep Learning Modeling Lifecycle

In Fig. 1

, we show a typical deep learning modeling lifecycle (we present an overview of deep neural networks in the next section). Given a prediction task, a modeler often starts from well-known models that have been successful in similar task domains; she then specifies input training data and output loss functions, and repeatedly adjusts the  on operators and connections like Lego bricks, tunes model hyper-parameters, trains and evaluates the model, and repeats this loop until prediction accuracy does not improve. Due to a lack of understanding about why models work, the adjustments and tuning inside the loop are driven by heuristics, e.g., adjusting hyper-parameters that appear to have a significant impact on the learned weights, applying novel layers or tricks seen in recent empirical studies, and so on. Thus, many similar models are trained and compared, and a series of model variants needs to be explored and developed. Due to the expensive learning/training phase, each iteration of the modeling loop takes a long period of time and produces many (checkpointed) snapshots of the model. As we noted above, this is a common workflow across many other ML models as well.

Current systems (Caffe 


, Theano, Torch, TensorFlow 

[4], etc.) mainly focus on model building and training phases, while the issues of data management, model sharing, and lifecycle management are largely ignored. Modelers are required to write external imperative scripts, edit configurations by hand and manually maintain a manifest of model variations that have been tried out; not only are these tasks irrelevant to the modeling objective, but they are also challenging and nontrivial due to the complexity of the model as well as large footprints of the learned models. More specifically, the tasks and data artifacts in the modeling lifecycle expose several systems and data management challenges, which include:

  • Understanding & Comparing Models

    : It is difficult to keep track of the many models developed and/or understand the differences amongst them. Differences among both the metadata about the model (training sample, hyperparameters, network structure, etc.), as well as the actual learned parameters, are of interest. It is common to see a modeler write all model configurations in an experiment spreadsheet to keep track of temporary folders of input data, setup scripts, snapshots and logs, which is not only a cumbersome but also an error-prone process. Though it is fine to view the measurements, understanding the model difference is less-principled and requires a lot more work. repetition and time consuming sub-steps in model enumerations

  • Repetitive Adjusting of Models: The development lifecycle itself has time-consuming repetitive sub-steps, such as adding a layer at different places to adjust a model, searching through a set of hyper-parameters for the different variations, reusing learned weights to train models, etc., which currently have to be performed manually.

  • Model Versioning: Similar models are possibly trained and run multiple times, reusing others’ weights as initialization, either because of a changed input or discovery of an error. There is thus a need to keep track of multiple model versions and their relationships over time, although the utilities of different models are very different.

  • Parameter Archiving

    : The storage footprint of deep learning models tends to be very large. Recent top-ranked models in the ImageNet task have billions of floating-point parameters and require hundreds of MBs to store one snapshot during training. Due to resource constraints, the modeler has to limit the number of snapshots, even drop all snapshots of a model at the cost of retraining when needed.

  • Reasoning about Model Results: Another key data artifact that often needs to be reasoned about is the results of running a learned model on the training or testing dataset. By comparing the results across different models, a modeler can get insights into difficult training examples or understand correlations between specific adjustments and the performance.

In addition, although not a focus of this paper, sharing and reusing models is not easy, especially because of the large model sizes and specialized tools used for learning and modeler generated scripts in the lifecycle.

ModelHub: In this paper, we propose the  system to address these challenges. The  system is not meant to replace popular training-focused  systems, but rather designed to be used with them to accelerate modeling tasks and manage the rich set of lifecycle artifacts. It consists of three key components: (a) a model versioning system () to store, query and aid in understanding the models and their versions, (b) a model network adjustment and hyper-parameter tuning domain specific language () to serve as an abstraction layer to help modelers focus on the creation of the models instead of repetitive steps in the lifecycle, (c) a hosted deep learning model sharing system () to exchange  repositories and enable publishing, discovering and reusing models from others.

The key features and innovative design highlights of  are: (a) We use a git-like VCS interface as a familiar user interface to let the modeler manage and explore the created models in a repository, and an SQL-like model enumeration DSL to aid modelers in making and examining multiple model adjustments easily. (b) Because model comparison is less-principled today, we propose two new model understanding and comparison schemes. no comparison (b) Behind the declarative constructs,  manages different artifacts in a split back-end storage: structured data, such as network structure, training logs of a model, lineages of different model versions, output results, are stored in a relational database, while learned float-point parameters of a model are viewed as a set of float matrices and managed in a read-optimized archival storage (). (c) Parameters dominate the storage footprint and floats are well-known at being difficult to compress. We study  implementation thoroughly under the context of  query workload and advocate a segmented approach to store the learned parameters, where the low-order bytes are stored independently of the high-order bytes. We also develop novel model evaluation schemes to use high order bytes solely and progressively uncompress less-significant chunks if needed to ensure the correctness of an inference query. (d) Due to the different utility of developed models, archiving versioned models using parameter matrix deltas exhibits a new type of dataset versioning problem which not only optimizes between storage and access tradeoff but also has model-level constraints. (e) Finally, the VCS model repository design extends naturally to a collaborative format and online system which contain rich model lineages and enables sharing, reusing, reproducing  models which are compatible across training systems.

Contributions: Our key research contributions are:

  • We propose the first comprehensive  lifecycle management system, study its design requirements, and propose declarative constructs ( and ) to provide high-level abstractions.

  • We propose two new model understanding and comparison schemes, and study a new matrix reordering problem (for matrix alignment). We analyze its complexity and propose greedy algorithms for solving it. no comparison

  • We develop , a read-optimized archival storage system for dealing with a large collection of versioned float matrices.

  • We formulate a new dataset versioning problem with co-usage constraints, analyze its complexity, and design efficient algorithms for solving it.

  • We develop a progressive, approximate query evaluation scheme that avoids reading low-order bytes of the parameter matrices unless necessary.

  • We present a comprehensive evaluation of  that shows the proposed techniques are useful in real life models, and scale well on synthetic models.

Outline: In Section II, we provide background on related topics in  modeling lifecycle. In Section III, we present an overview of , and discuss the declarative interfaces. We describe the parameter archival store () in Section V, present an experimental evaluation in Section VI, and closely related work in Section VII.

(a) Anatomy of A  Model (LeNet)
(b) Relationships of Model Versions and Weight Snapshots
Fig. 2: Illustration of Weight Parameter Artifacts in a Modeling Process
Fig. 3: Anatomy of A  Model (LeNet)

Ii Background on Lifecycle

To support our design decisions, we overview the artifacts and common task practices in  modeling lifecycle. We also examine the dataset versioning problem from recent database community research and point out the inefficiencies for  lifecycle management.

Deep Neural Networks: We begin with a brief, simplified overview. A deep learning model is a deep neural network () consisting of many layers having nonlinear activation functions that are capable of representing complex transformations between input data and desired output. Let

denote a data domain and denote a prediction label domain (e.g., may be a set of images; may be the names of the set of objects we wish to recognize, i.e, labels). As with any prediction model, a DNN is essentially a mapping function that minimizes a certain loss function , and is of the following form:

Here denotes the layer number, are learnable weights and bias parameters in layer , and

is an activation function that non-linearly transforms the result of the previous layer (common activation functions include sigmoid, ReLU, etc.). Given a learned model and an input

, applying in order gives us the prediction label for that input data. In the training phase, the model parameters are learned by minimizing , typically done through iterative methods, such as stochastic gradient descent.

Fig. 7 shows a classic convolutional DNN, LeNet. LeNet is proposed to solve a prediction task from handwritten images to digit labels

. In the figure, a cube represents an intermediate tensor, while the dotted lines are unit transformations between tensors. More formally, a layer,

, is a function which defines data transformations from tensor to tensor . are the parameters which are learned from the data, and are the hyperparameters which are given beforehand. A layer is non-parametric if .

In the computer vision community, the layers defining transformations are considered building blocks of a  model, and referred to using a conventional name, such as full layer, convolution layer, pool layer, normalization layer, etc. The chain is often called the network architecture. The LeNet architecture has two convolution layers, each followed by a pool layer, and two full layers, shown with layer shapes and hyperparameters in Fig. 7. Moreover, winning models in recent ILSVRC (ImageNet Large Scale Vision Recognition Competitions) are shown in Table I, with their architectures described by a composition of common layers in regular expressions syntax for illustrating the similarities (Note the activation functions and detailed connections are omitted). As one can see, common layers are building blocks of  models.

models are learned from massive data based on some architecture, and modern successful computer vision DNN architectures consist of a large number of float weight parameters (flops) The number of float parameters (flops) are shown in Table I, resulting in large storage footprints (GBs) and long training times (often weeks). Furthermore, the training process is often checkpointed and variations of models need to be explored, leading to many model copies.

Modeling Data Artifacts: Unlike many other prediction methods,  modeling results in a very large number of weight parameters, a rich set of hyperparameters, and learning measurements, which are used in unique ways in practice, resulting in a mixture of structured data, files and binary floating number artifacts:

  • Non-convexity & Hyperparameters: A DNN model is typically non-convex, and is a local optimum of the underlying loss-minimization problem. Optimization procedure employs many tricks to reach a solution quickly [5]. The set of hyperparameters (e.g., learning rate, momentum) w.r.t. to the optimization algorithm need to be maintained.

  • Iterations & Measurements: Models are trained iteratively and checkpointed periodically due to the long running times. A set of learning measurements are collected in various logs, including objective loss values and accuracy scores.

  • Fine-tuning & Snapshots: Well-known models are often learned from massive real-world data (ImageNet), and require large amounts of resources to train; when prediction tasks do not vary much (e.g., animal recognition vs dog recognition), the model parameters are reused as initializations and adjusted using new data; this is often referred to as fine-tuning. On the other hand, not all snapshots can be simply deleted, as the convergence is not monotonic.

  • Provenance & Arbitrary Files: Alternate ways to construct architectures or to set hyperparameters lead to human-in-the-loop model adjustments. Initialization, preprocessing schemes, and hand-crafted scripts are crucial provenance information to explore models and reproduce results.

Network Architecture (in regular expression) (flops)
LeNet [6]
AlexNet [7]
VGG [8]
ResNet [9]
TABLE I: Popular CNN Models for Object Recognition

Model Adjustment: In a modeling lifecycle for a prediction task, the update-train-evaluate loop is repeated in daily work, and many model variations are adjusted and trained. In general, once data and loss are determined, model adjustment can be done in two orthogonal steps: a) network architecture adjustments where layers are dropped or added and layer function templates are varied, and b) hyperparameter selections, which affect the behavior of the optimization algorithms. There is much work on search strategies to enumerate and explore both.

Model Sharing: Due to the good generalizability, long training times, and verbose hyperparameters required for large  models, there is a need to share the trained models. Jia et al. [3] built an online venue (Caffe Model Zoo) to share models. Briefly, Model Zoo is part of a github repository111Caffe Model Zoo: with a markdown file edited collaboratively. To publish models, modelers add an entry with links to download trained parameters in caffe format. Apart from the caffe community, similar initiatives are in place for other training systems.

Iii ModelHub System Overview

We show the  architecture including the key components and their interactions in Fig. 4. At a high level, as a modeling lifecycle management tool, first it has to run on local machine and integrate with popular  systems, such as caffe, torch, tensorflow; second an online module is served as a cloud host to save and exchange model versions. At a high level, the  functionality is divided among a local component and a remote component. The local functionality includes the integration with popular DNN systems such as caffe, torch, tensorflow, etc., on a local machine or a cluster. The remote functionality includes sharing of models, and their versions, among different groups of users. We primarily focus on the local functionality in this paper.

On the local system side,  is a version control system (VCS) implemented as a command-line tool (), that serves as an interface to interact with the rest of the local and remote components. Use of a specialized VCS instead of a general-purpose VCS such as git or svn allows us to better portray and query the internal structure of the artifacts generated in a modeling lifecycle, such as network definitions, training logs, binary weights, and relationships between models. The key utilities of  are listed in Table II, grouped by their purpose; we explain these in further detail in Sec. III-B.  is a DSL we propose to assist modelers in deriving new models; the  query parser and optimizer components in the figure are used to support this language. The model learning module interacts with external deep learning tools that the modeler uses for training and testing. They are essentially wrappers on specific  systems that extract and reproduce modeling artifacts. Finally, the  service is a hosted toolkit to support publishing, discovering and reusing models, and serves similar role for  models as github for software development or DataHub

for data science 


Fig. 4:  System Architecture

Iii-a Data Model

works with two data models: a conceptual  model, and a data model for the versions in a  repository.

Model: A  model can be understood in different ways, as one can tell from the different model creation APIs in popular deep learning systems. In the formulation mentioned in Sec. I, if we view a function as a node and dependency relationship as an edge, it becomes a directed acyclic graph (DAG). Depending on the granularity of the function in the DAG, either at the tensor arithmetic operator level (add, multiply), or at a logical composition of those operators (convolution layer, full layer), it forms different types of DAGs. In , we consider a  model node as a composition of unit operators (layers), often adopted by computer vision models. The main reason for this decision is that we focus on productivity improvement in the lifecycle, rather than implementation efficiencies of training and testing.

Though a  model essentially is a parametric nested mapping function learned from massive examples. However, in practice, a  model is represented in different ways.

Popular training systems often use a graph construction API at different levels to present the nested function, where the graph node is referred as operator, layer, gate and etc. To illustrate the difference, if we view a function (a layer) as a node and dependency relationship as an edge, it becomes a directed acyclic graph (DAG) over layers, while if a graph node is defined on , , these basic tensor arithmetic operators, its a DAG over operators. Modelers think and communicate in a high level (e.g. convolution layer, full layer), while may program in an API dialect of low level arithmetic operators.

In , we consider a  model node as a composition of unit operators (layers)., often adopted by computer vision models. The main reason for the decision is that we focus on the productivity improvement in the lifecycle, rather than the implementation efficiencies for training and testing.

VCS Data Model: When managing  models in the VCS repository, a model version represents the contents in a single version. It consists of a network definition, a collection of weights (each of which is a value assignment for the weight parameters), a set of extracted metadata (such as hyper-parameter, accuracy and loss generated in the training phase), and a collection of files used together with the model instance (e.g., scripts, datasets). In addition, we enforce that a model version must be associated with a human readable name for better utility, which reflects the logical groups of a series of improvement efforts over a  model in practice.

In the implementation, model versions can be viewed as a relation model_versionname, id, N, W, M, F, where id is part of the primary key of model versions and is auto-generated to distinguish model versions with the same name. In brief, are the network definition, weight values, extracted metadata and associated files respectively. The DAG, N, is stored as two tables: Nodeid, node, A, where is a list of attributes such as layer name, and Edgefrom, to. is managed in our learned parameter storage (, Sec. V). , the metadata, captures the provenance information of training and testing a particular model; it is extracted from training logs by the wrapper module, and includes the hyperparameters when training a model, the loss and accuracy measures at some iterations, as well as dynamic parameters in the optimization process, such as learning rate at some iterations. Finally, is file list marked to be associated with a model version, including data files, scripts, initial configurations, and etc. Besides a set of model versions, the lineage of the model versions are captured using a separate parentbase, derived, commit relation. All of these relations are maintained/updated in a relational backend when the modeler runs the different  commands that update the repository.

Type Command Description
model version management init Initialize a  repository.
add Add model files to be committed.
commit Commit the added files.
copy Scaffold model from an old one.
archive Archive models in the repository.
model exploration list List models and related lineages.
desc Describe a particular model.
diff Compare multiple models.
eval Evaluate a model with given data.
model enumeration query Run  clause.
remote interaction publish Publish a model to ModelHub.
search Search models in ModelHub.
pull Download from ModelHub.
TABLE II: A list of key  utilities.

Iii-B Query Facilities

Once the  models and their relationships are managed in , the modeler can interact with them easily. The query facilities we provide can be categorized into two types: a) model exploration queries and b) model enumeration queries.

Iii-B1 Model Exploration Queries

Model exploration queries interact with the models in a repository, and are used to understand a particular model, to query lineages of the models, and to compare several models. For usability, we design it as query templates via  sub-command with options, similar to other VCS.

List Models & Related Lineages: By default, the query lists all versions of all models including their commit descriptions and parent versions; it also takes options, such as showing results for a particular model, or limiting the number of versions to be listed.
dlv list [--model_name] [--commit_msg] [--last]

Describe Model: desc shows the extracted metadata from a model version, such as the network definition, learnable parameters, execution footprint (memory and runtime), activations of convolutional DNNs, weight matrices, and evaluation results across iterations. Note the activation is the intermediate output of a  model in computer vision and often used as an important tool to understand the model. The current output formats are a result of discussions with computer vision modelers to deliver tools that fit their needs. In addition to printing to console, the query supports HTML output for displaying the images and visualizing the weight distribution.
dlv desc [--model_name --version] [--output]

Compare Models: diff takes a list of model names or version ids and allows the modeler to compare the  models. Most of desc components are aligned and returned in the query result side by side. We discuss it in Sec. IV.drop comparison
dlv diff [--model_names --versions] [--output]

Evaluate Model: eval runs test phase of the managed models with an optional config specifying different data or changes in the current hyper-parameters. The main usages of exploration query are two-fold: 1) for the users to get familiar with a new model, 2) for the user who wants to test known models on different data or settings. The query returns the accuracy and optionally the activations. It is worth pointing out that complex evaluations can be done via model enumeration queries in .
dlv eval [--model_name --versions] [--config]

Iii-B2 Model Enumeration Queries

Model enumeration queries are used to explore variations of currently available models in a repository by changing network structures or tuning hyper-parameters. There are several operations that need to be done in order to derive new models: 1) Select models from the repository to improve; 2) Slice particular models to get reusable components; 3) Construct new models by mutating the existing ones; 4) Try the new models on different hyper-parameters and pick good ones to save and work with. When enumerating models, we also want to stop exploration of bad models early.

To support this rich set of requirements, we propose the  domain specific language, that can be executed using “query”. Challenges of designing the language are: a) the data model is a mix of relational and the graph data models and b) the enumeration includes hyper-parameter tuning as well as network structure mutations, which are very different operations. We omit a thorough explanation of the language due to space constraints, and instead show the key operators and constructs of the language along with a set of examples (Query 14) to show how requirements are met.

select m1
where like alexnet_%” and
       m1.creation_time > ”2015-11-22” and
       m1[”conv[1,3,5]”].next has POOL(”MAX”)
Listing 1:  select query to pick the models.           
slice m2 from m1
where like alexnet-origin%”
mutate m2.input = m1[”conv1”] and
       m2.output = m1[”fc7”]
Listing 2:  slice query to get a sub-network.          
construct m2 from m1
where like alexnet-avgv1%” and
       m1[”conv*($1)”].next has POOL(”AVG”)
mutate m1[”conv*($1)”].insert = RELU(”relu$1”)
Listing 3:  construct query to derive more models on existing ones. 
evaluate m
from query3
with config = path to config
vary config.base_lr in [0.1, 0.01, 0.001] and[”conv*”].lr auto and
     config.input_data in [”path1”, path2”]
keep top(5, m[”loss”], 100)
Listing 4:  evaluate query to enumerate models with different network architectures, search hyper-parameters, and eliminate models.

Key Operators: We adopt the standard SQL syntax to interact with the repository.  views the repository as a single model version table. As mentioned in Sec III-A, aA model version instance is a DAG, which can be viewed as object types in modern SQL conventions. In , DAG level attributes can be referenced using attribute names (e.g., m1.creation_time, m2.input, m2.output), while navigating the internal structures of the DAG, i.e. the Node and Edge EDB, we provide a regexp style selector operator on a model version to access individual  nodes. For example,, e.g. m1["conv[1,3,5]"] in Query 1 filters the nodes in m1. Once the selector operator returns a set of nodes, prev and next attributes of the node allow 1-hop traversal in the DAG. Note that POOL("MAX") is one of the standard built-in node templates for condition clauses. Using SPJ operators with object type attribute access and the selector operator, we allow relational queries to be mixed with graph traversal conditions.

To retrieve reusable components in a DAG, and mutate it to get new models, we provide slice, construct and mutate operators. Slice originates in programming analysis research; given a start and an end node, it returns a subgraph including all paths from the start to the end and the connections which are needed to produce the output. Construct can be found in graph query languages such as SPARQL to create new graphs. In our context, the  DAG only has nodes with multiple attributes, which simplifies the language. We allow construct to derive new DAGs by using selected nodes to insert nodes by splitting an outgoing edge or to delete an outgoing edge connecting to another node. Mutate limits the places where insert and delete can occur. For example, Query 2 and 3 show queries which work on the DAG structure and generate reusable subgraphs and new graphs. Query 2 slices a sub-network from matching models between convolution layer ‘conv1’ and full layer ‘fc7’, while Query 3 derives new models by appending a ReLU layer after all convolution layers followed by an average pool. All queries can be nested. in the from clause.

Finally, evaluate can be used to try out new models, with potential for early out if expectations are not reached. We separate the network enumeration component from the hyper-parameter turning component; while network enumeration can be done via select or construct and nested in the from clause, we introduce a with operator to take an instance of a tuning config template, and a vary operator to express the combination of activated multi-dimensional hyper-parameters and search strategies. auto is keyword implemented using default search strategies (currently grid search). To stop early and let the user control the stopping logic, we introduce a keep operator to take a rule consisting of stopping condition templates, such as top-k of the evaluated models, or accuracy threshold. Query 4 evaluates the models constructed and tries combinations of at least three different hyper-parameters, and keeps the top 5 models w.r.t. the loss after 100 iterations.

Besides the query facilities we have described so far, there are a collection of features we are lack of space to describe in details, such as extraction and interactions with model metadata, weight management, and provenance queries of the models.

Iii-C Model Publishing & Sharing

As the model repository is standalone, we host the repositories as a whole in a  service. The modeler can use the publish to push the repository for archiving, collaborating or sharing, and use search and pull to discover and reuse remote models. We envision such a form of collaboration can facilitate a learning environment, as all versions in the lifecycle are accessible and understandable with ease.

Iii-D ModelHub Implementation

On the local side, the current implementation of  maintains the data model in multiple back-ends and utilizes git to manage the arbitrary file diffs. Various queries are decomposed and sent to different backends and chained accordingly. On the other hand, as the model repository is standalone, we host the repositories as a whole in a  service. The modeler can use the publish to push the repository for archiving, collaborating or sharing, and use search and pull to discover and reuse remote models. We envision such a form of collaboration can facilitate a learning environment, as all versions in the lifecycle are accessible and understandable with ease.

Iv Model Comparison

The subtle difference of model versions are hard to grasp, comparing models are time consuming and requires heavy scripting under current training systems. As accuracy is the goal of the whole process, often modelers judge models by simple performance measures (e.g. loss, accuracy). Shared scripts such as plotting architectures, and optimization training logs can be found in user community of specific training systems.

supports a set of comparison schemes for different artifacts shown in Table III. One can tell the different flavor of diff operations from mixture of data types. Besides common practice, new proposed schemes are highlighted in italic font.

Artifact Type Scheme
Network Architecture Graph Plot, Graph Edit Distance
Learned Parameters Tensor Plot, Matrix Reordering
Prediction Result Relational Set Diff
Hyperparameters Key-Value Set Diff
Optimization Routine Time Series Plot
TABLE III: diff  Comparison Schemes

Plotting is an important tool to understand a network architecture, while comparing the difference between two networks architectures or two tensors are not easy. By viewing a network architecture as a DAG in , the comparison can be done via a minimizing graph edit distance routine. For tensor comparison, we propose a new matrix reordering scheme.

Iv-a Align Network Architectures

Iv-B Align Learned Parameters

For ease of illustration, we focus on 2D matrices, as tensors in  can be lowered from high dimension to connections between input and output neurons. Informally, the basic comparison idea is given two matrices

and , we permute ’s rows and columns accordingly in order to find the most similar w.r.t. to a cost function, e.g. euclidean distance, best compression bits. As an example, the direct and best delta matrix is shown in Fig. 5. In Fig. 5(a), we show the weight matrices of LeNet conv1 layer trained with the same initialization but flipped images for the scenario of weight reusing. In Fig. 5(b), the LeNet is trained with random initialization with the same image orders. As we can see, alignment not only show the connections of two matrices, but also derive more zeros. As shown later in the evaluation, together with segmented float matrices, alignment operator contributes to the overall storage performance.

(a) Fixed Initialization
(b) Random Initialization
Fig. 5: Example of Direct And Best Aligned Delta. LeNet conv1: (LeNet A, B, Direct Delta A-B, Best Delta A-B’)todo

Next, we present the matrix alignment problem formally. As the matrices to be aligned not necessary having the same dimensions, we first define a permutation matrix with capability to adapt dimensions.

[Permutation Matrix] Let a permutation
, is all possible permutations. Given two positive integer , a permutation matrix of is a matrix , where

Using the permutation matrix, a permutation can be used in matrix multiplication to reorder matrix by row or column accordingly.

A row permutation of to :

A column permutation of to :

With the permutation matrix, given two matrices and a cost function, we formulate the matrix alignment problem as follows:

[Matrix Alignment] Given two real matrices, , , we want to find two permutations to reorder to , and , such that:

where is a cost function. We denote the best delta matrix as .

Let be norm, given two following matrices:


the permutations minimize the cost function: =

Complexity Analysis: To the best of our knowledge, the matrix alignment problem is not studied in the literature. We show its NP-completeness by using the graph edit distance problem ().

Matrix Alignment Problem is NP-Complete when is additive.

Fig. 6: Weight Artifacts in a DLV Repository
Fig. 7: One Snapshot Example (LeNet)

Iv-B1 Greedy Hill Climbing Algorithm

0:  Matrix , Random Initial Points , Max Iterations
0:  Local optimum () of the Matrix Alignment
1:  ; ;
2:  result = (); cost =
3:  while  do
4:     random permutation ;
5:     for ++ do
6:         Construct_RAB(A, B, )
7:         Maximum_Weighted_Matching()
8:         Derived_Row_Permutation()
9:         Construct_CAB(A, B, )
10:         Maximum_Weighted_Matching()
11:         Derived_Column_Permutation()
12:     end for
13:     if  cost then
14:        result = (); cost =
15:     end if
16:     ++
17:  end while
18:  return  result
Algorithm 1 Randomized Hill Climbing Algorithm

We propose a randomized hill climbing approach to address the matrix alignment problem. As an overview, by noticing that if we fix one permutation at a time, minimizing the cost by varying the other permutation is the same as a graph matching problem in a biclique, we can iteratively solve a series of maximum weighted bipartite matching problems and find a local optimum. With randomized initial permutation, we run the algorithm multiple times, and choose the best solution among the found local optima.

Now we illustrate the algorithm in detail. First, fixing one permutation at a time, e.g. the column permutation , the matrix alignment problem is finding the best row permutation such that:

Next, we define the row alignment biclique (RAB), and show the connections between its maximum weighted bipartite matching solution and the best row alignment in detail.

[Row Alignment Biclique] Given two real matrices, , , a row alignment biclique is a bipartite graph , where

is a set of vertices representing all row vectors of

, , and is the i-th row of ; while represents the row vectors of the matrix:

and is the j-th row of . Each connects a row pair between and , and is the weight, i.e. .

Fig. 8: Example of Row Alignment Biclique

Given and , a maximal bipartite matching of the RAB is a set of edges sharing no vertices and . Given an , we can get a derived row permutation as follows:

The maximum weighted bipartite matching is a maximal bipartite matching with the biggest among all maximal bipartite matchings .

In Fig. 8, we use the two matrices , from Exp. 5 to show its row alignment biclique. is the resized matrix of by applying two permutation matrices and . The weight of each edge is shown in the figure. For instance, .

The maximal matching with the sum of weight is the maximum weighted bipartite matching. The derived permutation is .

The derived row permutation of the maximum weighted bipartite matching is also the solution of the matrix alignment problem with fixed column permutation . As has the largest , the derived permutation also has the minimum .

Due to the symmetric definition of the matrix alignment problem, we can define column alignment biclique similarly, as well as its maximum weighted bipartite matching. The derived permutation is also the solution for the matrix alignment problem with fixed row permutation .

Iterative maximum weighted bipartite matching in Alg. 1 converges to a local optimum solution.

The full algorithm is described in Alg. 1, is the max iteration, is the random initial points. By applying the bipartite matching iteratively, i.e. fixing a row or column permutation with previous iteration result, the cost is monotonically decreasing and we can find a local optima of the matrix alignment problem. In other words, it is a hill climbing algorithm to find a local optimum. With random initial permutations, we can compute multiple local optima and choose the best one. no space for this

V Parameter archival storage ()

Fig. 9: Relationships of Model Versions and Snapshots

Modeling lifecycle for DNNs, and machine learning models in general, is centered around the learned parameters, whose storage footprint can be very large. The goal of  is to maintain a large number of learned models as compactly as possible, without compromising on the query performance. Before introducing our design, we first discuss the queries of interest, and some key properties of the model artifacts. We then describe different options to store a single

float matrix, and to construct deltas (differences) between two matrices. We then formulate the optimal version graph storage problem, discuss how it differs from the prior work, and present algorithms for solving it. Finally, we develop a novel approximate model evaluation technique, suitable for the segmented storage technique that  uses.

V-a Weight Parameters & Query Type of Interests

We illustrate the key weight parameter artifacts and the relationships among them in Fig. 9, and also explain some of the notations used in this section. At a high level, the predecessor-successor relationships between all the developed models is captured as a version graph. These relationships are user-specified and conceptual in nature, and the interpretation is left to the user (i.e., an edge indicates that was an updated version of the model that the user checked in after , but the nature of this update is irrelevant for storage purposes). A model version itself consists of a series of snapshots, , which represent checkpoints during the training process (most systems will take such snapshots due to the long running times of the iterations). We refer the last or the best checkpointed snapshot as the latest snapshot of , and denote it by .

One snapshot, in turn, consists of intermediate data and trained parameters (e.g., in Fig. 7, the model has parameters for , and dimensions for , where is the minibatch size). Since is useful only if training needs to be resumed, only is stored in . Outside of a few rare exceptions, can always be viewed as a collection of float matrices, , which encode the weights on the edges from outputs of the neurons in one layer to the inputs of the neurons in the next layer. Thus, we treat a float matrix as a first class data type in 222We do not make a distinction about the bias weight; the typical linear transformation is treated as ..

The retrieval queries of interest are dictated by the operations that are done on these stored models, which include: (a) testing a model, (b) reusing weights to fine-tune other models, (c) comparing parameters of different models, (d) comparing the results of different models on a dataset, and (e) model exploration queries (Sec. III-B). Most of these operations require execution of group retrieval queries, where all the weight matrices in a specific snapshot need to be retrieved. This is different from range queries seen in array databases (e.g., SciDB), and also have unique characteristics that influence the storage and retrieval algorithms.

  • Similarity among Fine-tuned Models: Although non-convexity of the training algorithm and differences in network architectures across models lead to non-correlated parameters, the widely-used fine-tuning practices (Sec. II) generate model versions with similar parameters, resulting in efficient delta encoding schemes.

  • Co-usage constraints: Prior work on versioning and retrieval [11] has focused on retrieving a single artifact stored in its entirety. However, we would like to store the different matrices in a snapshot independently of each other, but we must retrieve them together. These co-usage constraints make the prior algorithms inapplicable as we discuss later.

  • Low Precision Tolerance: DNNs are well-known for their tolerance to using low-precision floating point numbers (Sec. VII), both during training and evaluation. Further, many types of queries (e.g., visualization and comparisons) do not require retrieving the full-precision weights.

  • Unbalanced Access Frequencies: Not all snapshots are used frequently. The latest snapshots with the best testing accuracy are used in most of the cases. The checkpointed snapshots have limited usages, including debugging and comparisons.  provides a set of storage schemes to let the user trade-off between storage and lossyness.

V-B Parameters As Segmented Float Matrices

Float Data Type Schemes: Although binary (1/-1) or ternary (1/0/-1) matrices are sometimes used in DNNs, in general  handles real number weights. Due to different usages of snapshots,  offers a handful of float representations to let the user trade-off storage efficiency with lossyness using .

  • Float Point: DNNs are typically trained with single precision (32 bit)or less likely double precision (64 bit) floats. This scheme uses the standard IEEE 754 floating point encoding to store the weights with sign, exponent, and mantissa bits. IEEE half-precision proposal (16 bits) and tensorflow truncated 16bits [4] are supported as well and can be used if desired.

  • Fixed Point: Comparing with float point encoding where each float has exponent bits, fFixed point encoding has a global exponent per matrix, and each float number only has sign and mantissa using all bits. This scheme is a lossy scheme as tail positions are dropped, and a maximum of different values can be expressed. The entropy of the matrix also drops considerably, aiding in compression.

  • Quantization: Similarly,  supports quantization using bits, , where possible values are allowed. The quantization can be done in random manner or uniform manner by analyzing the distribution, and a coding table is used to maintain the quantization information (with only the integer codes stored in the matrices in ). This is most useful for snapshots whose weights are primarily used for fine-tuning or initialization.

The float point schemes present here are not new, and are used in DNN systems in practice [12, 13, 14]., focusing on their implications of training/testing phases. As a lifecycle management tool,  lets experienced users select schemes rather than deleting snapshots due to resource constraints. Our evaluation shows storage/accuracy tradeoffs of these schemes.

Scheme Param. Bits Compress Lossyness Usage
Float Point 64/32/16 Fair Lossless latest
Fixed Point 32/16/8 Good Good latest
Quantization 8/k Excellent Poor other
TABLE IV: Float Representation Scheme Trade-offs

Bytewise Segmentation for Float Matrices: One challenge for  is the high entropy of float numbers in the float arithmetic representations, which leads to them being very hard to compress. Compression ratio shown in related work for scientific float point datasets, e.g., simulations, is very low. The state of art compression schemes do not work well for  parameters either (Sec. VII).

Method LeNet AlexNet VGG16
fpzip to add
TABLE V: Compression Ratio for  Parameters

By exploiting  low-precision tolerance, we adopt bytewise decomposition from prior work [15, 16] and extend it to our context to store the float matrices. The basic idea is to separate the high-order and low-order mantissa bits, and so a float matrix is stored in multiple chunks; the first chunk consists of 8 high-order bits, and the rest are segmented one byte per chunk. One major advantage is the high-order bits have low entropy, and standard compression schemes (e.g., zlib) are effective for them.

Apart from the simplicity of the approach, the key benefits of segmented approach are two-fold: (a) it allows offloading low-order bytes to remote storage, (b)  queries can read high-order bytes only, in exchange for tolerating small errors. Comparison and exploration queries (desc, diff) can easily tolerate such errors and, as we show in this paper, eval queries can also be made tolerant to these errors.

Read (bits) Float (64) Float (32) Fix (k/n) Fix (32/n)
16 todo
TABLE VI: Maximum Absolute Error With High-Order Bytes

Delta Encoding Across Snapshots: We observed that, due to the non-convexity in training, even re-training the same model with slightly different initializations results in very different parameters. However, the parameters from checkpoint snapshots for the same or similar models tend to be close to each other. Furthermore, across model versions, fine-tuned models generated using fixed initializations from another model often have similar parameters. The observations naturally suggest use of delta encoding between checkpointed snapshots in one model version and latest snapshots across multiple model versions; i.e., instead of storing all matrices in entirety, we can store some in their entirety and others as differences from those. Two possible delta functions (denoted ) are arithmetic subtraction and bitwise XOR333 Delta functions for matrices with different dimensions are discussed in the long version of the paper; techniques in Sec V work with minor modification.. We find the compression footprints when applying the diff in different directions are similar. We study the delta operators on real models in Sec. VI.

(a) Matrix Storage Graph
(b) Optimal Plan
without Constraints
(c) Optimal Plan with
Fig. 10: Illustrating Matrix Storage Graph & Plan using a Toy Example

V-C Optimal Parameter Archival Storage

Given the above background, we next address the question of how to best store a collection of model versions, so that the total storage footprint occupied by the large segmented float matrices is minimized while the retrieval performance is not compromised. This recreation/storage tradeoff sits at the core of any version control system. In recent work [11], the authors study six variants of this problem, and show the NP-hardness of most of those variations. However, their techniques cannot be directly applied in , primarily because their approach is not able to handle the group retrieval (co-usage) constraints.

We first introduce the necessary notation, discuss the differences from prior work, and present the new techniques we developed for . In Fig. 9, a model version consists of time-ordered checkpointed snapshots, . Each snapshot, consists of a named list of float matrices representing the learned parameters. All matrices in a repository, , are the parameter artifacts to archive. Each matrix is either stored directly, or is recovered through another matrix via a delta operator , i.e. , where is the delta computed using one of the techniques discussed above. In the latter case, the matrix is stored instead of . To unify the two cases, we introduce a empty matrix , and define .

[Matrix Storage Graph] Given a repository of model versions , let be an empty matrix, and be the set of all parameter matrices. We denote by the available deltas between all pairs of matrices. Abusing notation somewhat, we also treat as the set of all edges in a graph where are the vertices. Finally, let denote the matrix storage graph of , where edge weights are storage cost and recreation cost of an edge respectively.

[Matrix Storage Plan] Any connected subgraph of is called a matrix storage plan for , and denoted by , where and .

In Fig. 10(a), we show a matrix storage graph for a repository with two snapshots, and . The weights associated with an edge reflect the cost of materializing the matrix and retrieving it directly. On the other hand, for an edge between two matrices, e.g., , the weights denote the storage cost of the corresponding delta and the recreation cost of applying that delta. In Fig. 10(b) and 10(c), two matrix storage plans are shown.

For a matrix storage plan ,  stores all its edges and is able to recreate any matrix following a path starting from . The total storage cost of , denoted as , is simply the sum of edge storage costs, i.e. . Computation of the average snapshot recreation cost is more involved and depends on the retreival scheme used:

  • Independent scheme recreates each matrix one by one by following the shortest path () to from . In that case, the recreation cost is simply computed by summing the recreation costs for all the edges along the shortest path.

  • Parallel scheme accesses all matrices of a snapshot in parallel (using multiple threads); the longest shortest path from defines the recreation cost for the snapshot.

  • Reusable scheme considers caching deltas on the way, i.e., if paths from to two different matrices overlap, then the shared computation is only done once. In that case, we need to construct the lowest-cost Steiner tree () involving and the matrices in the snapshot. However, because multiple large matrices need to be kept in memory simultaneously, the memory consumption of this scheme can be large.

Retrieval Scheme Recreation Solution of Prob.1
Independent () Spanning tree
Parallel () Spanning tree
Reusable () Subgraph
TABLE VII: Recreation Cost of a Snapshot in a plan

can be configured to use any of these options during the actual query execution. However, solving the storage optimization problem with Reusable scheme is nearly impossible; since the Steiner tree problem is NP-Hard, just computing the cost of a solution becomes intractable making it hard to even compare two different storage solutions. Hence, during the storage optimization process,  can only support Independent or Parallel schemes.

In the example above, the edges are shown as being undirected indicating that the deltas are symmetric. In general, we allow for directed deltas to handle asymmetric delta functions, and also for multiple directed edges between the same two matrices. The latter can be used to capture different options for storing the delta; e.g., we may have one edge corresponding to a remote storage option, where the storage cost is lower and the recreation cost is higher; whereas another edge (between the same two matrices) may correspond to a local SSD storage option, where the storage cost is the highest and the recreation cost is the lowest. Our algorithms can thus automatically choose the appropriate storage option for different deltas.

Similarly,  is able to make decisions at the level of byte segments of float matrices, by treating them as separate matrices that need to be retrieved together in some cases, and not in other cases. This, combined with the ability to incorporate different storage options, is a powerful generalization that allows  to make decisions at a very fine granularity.

Given this notation, we can now state the problem formally. Since there are multiple optimization metrics, we assume that constraints on the retrieval costs are provided and ask to minimize the storage.

[Optimal Parameter Archival Storage Problem] Given a matrix storage graph , let be the snapshot recreation cost budget for each . Under a retrieval scheme , find a matrix storage plan that minimizes the total storage cost, while satisfying recreation constraints, i.e.:

In Fig. 10(b), without any recreation constraints, we show the best storage plan, which is the minimum spanning tree based on of the matrix storage graph, . Under independent scheme , and . In Fig. 10(c), after adding two constraints and , we shows an optimal storage plan satisfying all constraints. The storage cost increases, , while and .

Although this problem variation might look similar to the ones considered in recent work [11], none of the variations studied there can handle the co-usage constraints (i.e., the constraints on simultaneously retrieving a group of versioned data artifacts). One way to enforce such constraints is to treat the entire snapshot as a single data artifact that is stored together; however, that may force us to use an overall suboptimal solution because we would not be able to choose the most appropriate delta at the level of individual matrices. Another option would be to sub-divide the retrieval budget for a snapshot into constraints on individual matrices in the snapshot. As our experiments show, that can lead to significantly higher storage utilization. Thus the formulation above is a strict generalization of the formulations considered in that prior work.

Optimal Parameter Archival Storage Problem is NP-hard for all retrieval schemes in Table VII. We reduce Prob.5 in [11] to the independent scheme , and Prob.6 in [11] to the parallel scheme , by mapping each datasets as vertices in storage graph, and introducing a snapshot holding all matrices with recreation bound . For reuse scheme , it is at least as hard as weighted set cover problem if reducing a set to an edge with storage cost as weight, an item to an vertex in , and set recreation budget .

The optimal solution for Problem VII is a spanning tree when retrieval scheme is independent or parallel. Suppose we have a non-tree solution satisfying the constraints, and also minimize the objective. Note that parallel and independent schemes are based on shortest path in from to each matrix , so the union of each shortest path forms a shortest path tree. If we remove edges which are not in the shortest path tree from the plan to , it results in a lower objective , but still satisfying all recreation constraints, which leads to a contradiction. Lemma V-C shows is a spanning tree and connects our problem to a class of constrained minimum spanning tree problems. The above lemma is not true for the reusable scheme (; snapshot Steiner trees satisfying different recreation constraints may share intermediate nodes resulting in a subgraph solution.

Constrained Spanning Tree Problem: In Problem VII, storage cost minimization while ignoring the recreation constraints leads to a minimum spanning tree (MST) of the storage matrix; whereas the snapshot recreation constraints are best satisfied by using a shortest path tree (SPT). These problems are often referred to as constrained spanning tree problems [17] or shallow-light tree constructions [18], which have been studied in areas other than dataset versioning, such as VLSI designs. Khuller et al. [19] propose an algorithm called LAST to construct such a “balanced” spanning tree in an undirected graph . LAST starts with a minimum spanning tree of the provided graph, traverses it in a DFS manner, and adjusts the tree by changing parents to ensure the path length in constructed solution is within (1+) times of shortest path in , i.e. , while total storage cost is within (1+) times of MST. In our problem, the co-usage constraints of matrices in each snapshot form hyperedges over the graph making the problem more difficult.

In the rest of the discussion, we adapt meta-heuristics for constrained MST problems to develop two algorithms: the first one (-MT) is based on an iterative refinement scheme, where we start from an MST and then adjust it to satisfy constraints, similar to the LMT algorithm proposed in early work [11]; the second one is a priority-based tree construction algorithm (-PT), which adds nodes one by one and encodes heuristic in the priority function. Both algorithms aim to solve the parallel and independent recreation schemes, and thus can also find feasible solution for reusable scheme. Due to large memory footprints of intermediate matrices, we leave improving reusable scheme solutions for future work.

0:  , snapshots , recreation cost .
0:  A spanning tree satisfying constraints
1:  let MST of ;
2:  while unsatisfied constraints  do
3:     for each edge  do
4:         calculate with Eq. 1 (Eq. 2 for scheme )
5:     end for
6:     find
7:     break if
8:     swap with :
9:  end while
10:  return   unless
Algorithm 2 -MT

-MT: The algorithm starts with as the MST of , and iteratively adjusts to satisfy the broken snapshot recreation constraints, , by swapping one edge at a time. We denote as the parent of , and , and successors of in as . A swap operation on to edge changes parent of to in . A swap operation on changes storage cost of by , and changes recreation costs of and its successors by: .

The proof can be derived from definition of and by inspection. When selecting edges in , we choose the one which has the largest marginal gain for unsatisfied constraints:


The actual formula used is somewhat more complex, and handles negative denominators. Eq. 1 sums the gain of recreation cost changes among all matrices in the same snapshot (for the independent scheme), while Eq. 2 uses the max change instead (for the parallel scheme).

The algorithm iteratively swaps edges and stops if all recreation constraints are satisfied or no edge returns a positive gain. A single step examines edges and unsatisfied constraints, and there are at most steps. Thus the complexity is bounded by .

0:  , snapshots , recreation cost .
0:  A spanning tree satisfying constraints
1:  let and be a priority queue of edges based on
2:  push in
3:  while  do
4:     pop from ; let
5:     let constraints satisfaction flag be
6:     for each snapshot constraint  do

         estimate recreation cost

8:          and break if
9:     end for
10:     if is , then and goto line 3
11:     pop inner edges of from
12:     push outer edges to
13:     for , change improves , and no worse  do
14:         swap with
15:     end for
16:  end while
17:  if  is not a spanning tree then
18:     for each , do
19:     adjust using -MT heuristic.
20:  end if
21:  return   if is a matrix storage plan
Algorithm 3 -PT

-PT: This algorithm constructs a solution by “growing” a tree starting with an empty tree. The algorithm examines the edges in in the increasing order by the storage cost ; a priority queue is used to maintain all the candidate edges and is populated with all the edges from in the beginning. At any point, the edges in are the ones that connect a vertex , to a vertex outside . Using an edge (s.t., ) popped from , the algorithm tries to add to with minimum storage increment . Before adding , it examines whether the constraints of affected groups (s.t., ) are satisfied using actual and estimated recreation costs for vertices in and respectively; if , actual recreation cost is used, otherwise the lower bound of it, i.e. is used as an estimation. We refer the estimation for as .

Once an edge is added to , the inner edges of newly added are dequeued from , while the outer edges are enqueued. If the storage cost of existing vertices in can be improved (i.e., ), and recreation cost is not more (i.e. ), then the parent of in T is replaced to via the swap operation, which obviously decreases the storage cost and affected group recreation cost.

The algorithm stops if is empty and is a spanning tree. In the case when is empty but , an adjustment operation on to increase storage cost and satisfy the group recreation constraints is performed. For each , we append it to , then in each unsatisfied group that belongs to, optimally, we want to choose a set of to change their parents in , such that the decrement of storage cost is minimized while recreation cost is satisfied. The optimal adjustment itself can be viewed as a knapsack problem with extra non-cyclic constraint of , which is NP-hard. Instead, we use the same heuristic in Eq. 1 to adjust one by one by replacing its parent to until the group constraint in is satisfied.

where is number of successors of in . In other words, we choose a neighbor to replace , having the maximum marginal gain of recreation in the unsatisfied groups w.r.t. the storage increment. As before, the parallel scheme differs from independent case in the adjustment operator using Eq. 2. The complexity of this algorithm is .

V-D Model Evaluation Scheme in

Model evaluation, i.e., applying a  forward on a data point to get the prediction result, is a common task to explore, debug and understand models. Given a  storage plan, an eval query requires uncompressing and applying deltas along the path to the model. We develop a novel model evaluation scheme utilizing the segmented design, that progressively accesses the low-order segments only when necessary, and guarantees no errors for arbitrary data points.

The basic intuition is that: when retrieving segmented parameters, we know the minimum and maximum values of the parameters (since higher order bytes are retrieved first). If the prediction result is the same for the entire range of those values, then we do not need to access the lower order bytes. However, considering the high dimensions of parameters, non-linearity of the  model, unknown full precision value when issuing the query, it is not clear if this is feasible.

We define the problem formally, and illustrate the determinism condition that we use to develop our algorithm. Our technique is inspired from theoretical stability analysis in numerical analysis. We make the formulation general to be applicable to other prediction functions. The basic assumption is that the prediction function returns a vector showing relative strengths of the classification labels, then the dimension index with the maximum value is used as the predicted label.

[Parameter Perturbation Error Determination] Given a prediction function , where is the data and are the learned weights, the prediction result is the dimension index with the highest value in the output . When value is uncertain, i.e., each in known to be in the range , determine whether can be ascertained without error.

When is uncertain, the output is uncertain as well. However, if we can bound the individual entries in , then the following condition is an applicable necessary condition for determining error:

Let vary in range . If such that , then prediction result is .

Next we illustrate a query procedure, that given data , evaluates a  with weight perturbations and determines the output perturbation on the fly. Recall that  is a nested function (Sec. II), where the input data is transformed layer by layer, and each layer is of the form:

we derive the output perturbations when evaluating a model while preserving perturbations step by step:

Next, activation function is applied. Most of the common activation functions are monotonic function , (e.g. sigmoid, ReLu), while pool layer functions are , , avg functions over several dimensions. It is easy to derive the perturbation of output of the activation function, . During the evaluation query, instead of 1-D actual output, we carry 2-D perturbations, as the actual parameter value is not available. Nonlinearity decreases or increases the perturbation range. Now the output perturbation at can be calculated similarly, except now both and are uncertain: