I Introduction
a new introduction 2 pages, including abstract

Deep learning is an important method in analytics.

Deep learning anatomy: a DAG structured mapping function. node: templated_neuron, edge: connection. a DNN program specifies the DAG, input data, and a set of hyperparameters.

Deep learning is endtoend learning, features are learned directly from data, leading to unique humanintheloop lifecycle.

Lifecycle properties:

Analysis (describe + compare) the trained models.

Heuristicdriven enumerations of the models.

A rich set of artifacts with lineages.


Lifecycle disadvantages under current systems:

repetition and time consuming of human effort of doing description and enumeration via imperative programs;

heavy turnover and footprint: long training time, less usage frequency of older models; a tradeoff the user needs to decide: save more models for reuse or delete + retrain.

redundancy in large models; similar models exist; may not need all models with the same frequency and accuracy.

complicated and incoherent setups: hard to share, reuse, reproduce others’ models.


So we propose system to tackle the disadvantages:

Increase abstraction level and lead to optimization:

we propose a VCS based system. use friendly query templates to capture the process in modeling lifecycle.

we propose a DQL language, to help the user enumerate models, and encode model selection criteria.


Treat weight matrices as first class data type:

we identify a matrix comparison operator: alignment based matrix reordering, which is useful for comparison queries and model storages.

we explore the problem of storing many models which have both highly structured information and high entropy float matrices with low precision tolerance.

we develop model query directly on compressed storages, with guarantee of no errors, and at the same time faster query processing. It eases the tradeoff between model storage and usage. we can compress all models with less storages, and do payasyougo style of uncompressing to ensure no errors.


the technique proposed extends naturally to a collaborative format and environment which has rich lineages and enables sharing, reusing, reproducing models.


Contributions:

we are the first addressing model lifecycle management issues in deep learning.

we are the first proposing fulllifecycle declarative constructs for deep learning modeling and show their implementations.

we treat deep learning weight matrix as first class data types, we formulate the storage problem of a large set of float parameter matrices with deep learning lifecycle specific query workloads, and explore the low precision tolerance and accuracy tradeoff to save the data type.

by using bytewise compression, we present a storage framework and model evaluation query acceleration technique with guarantees of no errors.

on the weight matrix data type, we identify a new problem of alignment based matrix reordering problem, appeared in deep learning model comparison and storage. We shown the problem is NPhard and propose efficient greedy algorithms.


Our results show the techniques we propose are useful in real life models, and performance well on synthetic models.

Outline: Sec2: Preliminary, Sec3: System Overview, Sec4: Optimizations for Weight Matrices, Sec. 4.1 Alignment Operator. Sec 4.2: Storage. Sec 5. Exp. Sec 6. Related work. Sec 7. Conclusion.
Deep learning models, also called deep neural networks
(DNN), have dramatically improved the stateoftheart results for many important reasoning and learning tasks including speech recognition, object recognition, and natural language processing in recent years
[1]. Learned using massive amounts of training data, models have superior generalization capabilities, and the intermediate layers in many deep learning models have been proven useful in providing effective semantic features that can be used with other learning techniques and are applicable to other problems. However, there are many critical largescale data management issues in learning, storing, sharing, and using deep learning models, which are largely ignored by researchers today, but are coming to the forefront with the increased use of deep learning in a variety of domains. In this paper, we discuss some of those challenges in the context of the modeling lifecycle, and propose a comprehensive system to address them. Given the large scale of data involved (both training data and the learned models themselves) and the increasing need for highlevel declarative abstractions, we argue that database researchers should play a much larger role in this area. Although this paper primarily focuses on deep neural networks, similar data management challenges are seen in lifecycle management of others types of ML models like logistic regression, matrix factorization, etc.
DNN Modeling Lifecycle and Challenges: Compared with the traditional approach of feature engineering followed by model training [2], deep learning is an endtoend learning approach, i.e., the features are not given by a human but are learned in an automatic manner from the input data. Moreover, the features are complex and have a hierarchy along with the network representation. This requires less domain expertise and experience from the modeler, but understanding and explaining the learned models is difficult; why even wellstudied models work so well is still a mystery is unknown in theory and under active research. Thus, when developing new models, changing the learned model (especially its network structure and hyperparameters) becomes an empirical search task.
In Fig. 1
, we show a typical deep learning modeling lifecycle (we present an overview of deep neural networks in the next section). Given a prediction task, a modeler often starts from wellknown models that have been successful in similar task domains; she then specifies input training data and output loss functions, and repeatedly adjusts the on operators and connections like Lego bricks, tunes model hyperparameters, trains and evaluates the model, and repeats this loop until prediction accuracy does not improve. Due to a lack of understanding about why models work, the adjustments and tuning inside the loop are driven by heuristics, e.g., adjusting hyperparameters that appear to have a significant impact on the learned weights, applying novel layers or tricks seen in recent empirical studies, and so on. Thus, many similar models are trained and compared, and a series of model variants needs to be explored and developed. Due to the expensive learning/training phase, each iteration of the modeling loop takes a long period of time and produces many (checkpointed) snapshots of the model. As we noted above, this is a common workflow across many other ML models as well.
Current systems (Caffe
[3], Theano, Torch, TensorFlow
[4], etc.) mainly focus on model building and training phases, while the issues of data management, model sharing, and lifecycle management are largely ignored. Modelers are required to write external imperative scripts, edit configurations by hand and manually maintain a manifest of model variations that have been tried out; not only are these tasks irrelevant to the modeling objective, but they are also challenging and nontrivial due to the complexity of the model as well as large footprints of the learned models. More specifically, the tasks and data artifacts in the modeling lifecycle expose several systems and data management challenges, which include:
Understanding & Comparing Models
: It is difficult to keep track of the many models developed and/or understand the differences amongst them. Differences among both the metadata about the model (training sample, hyperparameters, network structure, etc.), as well as the actual learned parameters, are of interest. It is common to see a modeler write all model configurations in an experiment spreadsheet to keep track of temporary folders of input data, setup scripts, snapshots and logs, which is not only a cumbersome but also an errorprone process. Though it is fine to view the measurements, understanding the model difference is lessprincipled and requires a lot more work. repetition and time consuming substeps in model enumerations

Repetitive Adjusting of Models: The development lifecycle itself has timeconsuming repetitive substeps, such as adding a layer at different places to adjust a model, searching through a set of hyperparameters for the different variations, reusing learned weights to train models, etc., which currently have to be performed manually.

Model Versioning: Similar models are possibly trained and run multiple times, reusing others’ weights as initialization, either because of a changed input or discovery of an error. There is thus a need to keep track of multiple model versions and their relationships over time, although the utilities of different models are very different.

Parameter Archiving
: The storage footprint of deep learning models tends to be very large. Recent topranked models in the ImageNet task have billions of floatingpoint parameters and require hundreds of MBs to store one snapshot during training. Due to resource constraints, the modeler has to limit the number of snapshots, even drop all snapshots of a model at the cost of retraining when needed.

Reasoning about Model Results: Another key data artifact that often needs to be reasoned about is the results of running a learned model on the training or testing dataset. By comparing the results across different models, a modeler can get insights into difficult training examples or understand correlations between specific adjustments and the performance.
In addition, although not a focus of this paper, sharing and reusing models is not easy, especially because of the large model sizes and specialized tools used for learning and modeler generated scripts in the lifecycle.
ModelHub: In this paper, we propose the system to address these challenges. The system is not meant to replace popular trainingfocused systems, but rather designed to be used with them to accelerate modeling tasks and manage the rich set of lifecycle artifacts. It consists of three key components: (a) a model versioning system () to store, query and aid in understanding the models and their versions, (b) a model network adjustment and hyperparameter tuning domain specific language () to serve as an abstraction layer to help modelers focus on the creation of the models instead of repetitive steps in the lifecycle, (c) a hosted deep learning model sharing system () to exchange repositories and enable publishing, discovering and reusing models from others.
The key features and innovative design highlights of are: (a) We use a gitlike VCS interface as a familiar user interface to let the modeler manage and explore the created models in a repository, and an SQLlike model enumeration DSL to aid modelers in making and examining multiple model adjustments easily. (b) Because model comparison is lessprincipled today, we propose two new model understanding and comparison schemes. no comparison (b) Behind the declarative constructs, manages different artifacts in a split backend storage: structured data, such as network structure, training logs of a model, lineages of different model versions, output results, are stored in a relational database, while learned floatpoint parameters of a model are viewed as a set of float matrices and managed in a readoptimized archival storage (). (c) Parameters dominate the storage footprint and floats are wellknown at being difficult to compress. We study implementation thoroughly under the context of query workload and advocate a segmented approach to store the learned parameters, where the loworder bytes are stored independently of the highorder bytes. We also develop novel model evaluation schemes to use high order bytes solely and progressively uncompress lesssignificant chunks if needed to ensure the correctness of an inference query. (d) Due to the different utility of developed models, archiving versioned models using parameter matrix deltas exhibits a new type of dataset versioning problem which not only optimizes between storage and access tradeoff but also has modellevel constraints. (e) Finally, the VCS model repository design extends naturally to a collaborative format and online system which contain rich model lineages and enables sharing, reusing, reproducing models which are compatible across training systems.
Contributions: Our key research contributions are:

We propose the first comprehensive lifecycle management system, study its design requirements, and propose declarative constructs ( and ) to provide highlevel abstractions.

We propose two new model understanding and comparison schemes, and study a new matrix reordering problem (for matrix alignment). We analyze its complexity and propose greedy algorithms for solving it. no comparison

We develop , a readoptimized archival storage system for dealing with a large collection of versioned float matrices.

We formulate a new dataset versioning problem with cousage constraints, analyze its complexity, and design efficient algorithms for solving it.

We develop a progressive, approximate query evaluation scheme that avoids reading loworder bytes of the parameter matrices unless necessary.

We present a comprehensive evaluation of that shows the proposed techniques are useful in real life models, and scale well on synthetic models.
Outline: In Section II, we provide background on related topics in modeling lifecycle. In Section III, we present an overview of , and discuss the declarative interfaces. We describe the parameter archival store () in Section V, present an experimental evaluation in Section VI, and closely related work in Section VII.
Ii Background on Lifecycle
To support our design decisions, we overview the artifacts and common task practices in modeling lifecycle. We also examine the dataset versioning problem from recent database community research and point out the inefficiencies for lifecycle management.
Deep Neural Networks: We begin with a brief, simplified overview. A deep learning model is a deep neural network () consisting of many layers having nonlinear activation functions that are capable of representing complex transformations between input data and desired output. Let
denote a data domain and denote a prediction label domain (e.g., may be a set of images; may be the names of the set of objects we wish to recognize, i.e, labels). As with any prediction model, a DNN is essentially a mapping function that minimizes a certain loss function , and is of the following form:Here denotes the layer number, are learnable weights and bias parameters in layer , and
is an activation function that nonlinearly transforms the result of the previous layer (common activation functions include sigmoid, ReLU, etc.). Given a learned model and an input
, applying in order gives us the prediction label for that input data. In the training phase, the model parameters are learned by minimizing , typically done through iterative methods, such as stochastic gradient descent.Fig. 7 shows a classic convolutional DNN, LeNet. LeNet is proposed to solve a prediction task from handwritten images to digit labels
. In the figure, a cube represents an intermediate tensor, while the dotted lines are unit transformations between tensors. More formally, a layer,
, is a function which defines data transformations from tensor to tensor . are the parameters which are learned from the data, and are the hyperparameters which are given beforehand. A layer is nonparametric if .In the computer vision community, the layers defining transformations are considered building blocks of a model, and referred to using a conventional name, such as full layer, convolution layer, pool layer, normalization layer, etc. The chain is often called the network architecture. The LeNet architecture has two convolution layers, each followed by a pool layer, and two full layers, shown with layer shapes and hyperparameters in Fig. 7. Moreover, winning models in recent ILSVRC (ImageNet Large Scale Vision Recognition Competitions) are shown in Table I, with their architectures described by a composition of common layers in regular expressions syntax for illustrating the similarities (Note the activation functions and detailed connections are omitted). As one can see, common layers are building blocks of models.
models are learned from massive data based on some architecture, and modern successful computer vision DNN architectures consist of a large number of float weight parameters (flops) The number of float parameters (flops) are shown in Table I, resulting in large storage footprints (GBs) and long training times (often weeks). Furthermore, the training process is often checkpointed and variations of models need to be explored, leading to many model copies.
Modeling Data Artifacts: Unlike many other prediction methods, modeling results in a very large number of weight parameters, a rich set of hyperparameters, and learning measurements, which are used in unique ways in practice, resulting in a mixture of structured data, files and binary floating number artifacts:

Nonconvexity & Hyperparameters: A DNN model is typically nonconvex, and is a local optimum of the underlying lossminimization problem. Optimization procedure employs many tricks to reach a solution quickly [5]. The set of hyperparameters (e.g., learning rate, momentum) w.r.t. to the optimization algorithm need to be maintained.

Iterations & Measurements: Models are trained iteratively and checkpointed periodically due to the long running times. A set of learning measurements are collected in various logs, including objective loss values and accuracy scores.

Finetuning & Snapshots: Wellknown models are often learned from massive realworld data (ImageNet), and require large amounts of resources to train; when prediction tasks do not vary much (e.g., animal recognition vs dog recognition), the model parameters are reused as initializations and adjusted using new data; this is often referred to as finetuning. On the other hand, not all snapshots can be simply deleted, as the convergence is not monotonic.

Provenance & Arbitrary Files: Alternate ways to construct architectures or to set hyperparameters lead to humanintheloop model adjustments. Initialization, preprocessing schemes, and handcrafted scripts are crucial provenance information to explore models and reproduce results.
Network  Architecture (in regular expression)  (flops) 

LeNet [6]  
AlexNet [7]  
VGG [8]  
ResNet [9] 
Model Adjustment: In a modeling lifecycle for a prediction task, the updatetrainevaluate loop is repeated in daily work, and many model variations are adjusted and trained. In general, once data and loss are determined, model adjustment can be done in two orthogonal steps: a) network architecture adjustments where layers are dropped or added and layer function templates are varied, and b) hyperparameter selections, which affect the behavior of the optimization algorithms. There is much work on search strategies to enumerate and explore both.
Model Sharing: Due to the good generalizability, long training times, and verbose hyperparameters required for large models, there is a need to share the trained models. Jia et al. [3] built an online venue (Caffe Model Zoo) to share models. Briefly, Model Zoo is part of a github repository^{1}^{1}1Caffe Model Zoo: https://github.com/BVLC/caffe/wiki/ModelZoo with a markdown file edited collaboratively. To publish models, modelers add an entry with links to download trained parameters in caffe format. Apart from the caffe community, similar initiatives are in place for other training systems.
Iii ModelHub System Overview
We show the architecture including the key components and their interactions in Fig. 4. At a high level, as a modeling lifecycle management tool, first it has to run on local machine and integrate with popular systems, such as caffe, torch, tensorflow; second an online module is served as a cloud host to save and exchange model versions. At a high level, the functionality is divided among a local component and a remote component. The local functionality includes the integration with popular DNN systems such as caffe, torch, tensorflow, etc., on a local machine or a cluster. The remote functionality includes sharing of models, and their versions, among different groups of users. We primarily focus on the local functionality in this paper.
On the local system side, is a version control system (VCS) implemented as a commandline tool (), that serves as an interface to interact with the rest of the local and remote components. Use of a specialized VCS instead of a generalpurpose VCS such as git or svn allows us to better portray and query the internal structure of the artifacts generated in a modeling lifecycle, such as network definitions, training logs, binary weights, and relationships between models. The key utilities of are listed in Table II, grouped by their purpose; we explain these in further detail in Sec. IIIB. is a DSL we propose to assist modelers in deriving new models; the query parser and optimizer components in the figure are used to support this language. The model learning module interacts with external deep learning tools that the modeler uses for training and testing. They are essentially wrappers on specific systems that extract and reproduce modeling artifacts. Finally, the service is a hosted toolkit to support publishing, discovering and reusing models, and serves similar role for models as github for software development or DataHub
for data science
[10].Iiia Data Model
works with two data models: a conceptual model, and a data model for the versions in a repository.
Model: A model can be understood in different ways, as one can tell from the different model creation APIs in popular deep learning systems. In the formulation mentioned in Sec. I, if we view a function as a node and dependency relationship as an edge, it becomes a directed acyclic graph (DAG). Depending on the granularity of the function in the DAG, either at the tensor arithmetic operator level (add, multiply), or at a logical composition of those operators (convolution layer, full layer), it forms different types of DAGs. In , we consider a model node as a composition of unit operators (layers), often adopted by computer vision models. The main reason for this decision is that we focus on productivity improvement in the lifecycle, rather than implementation efficiencies of training and testing.
Though a model essentially is a parametric nested mapping function learned from massive examples. However, in practice, a model is represented in different ways.
Popular training systems often use a graph construction API at different levels to present the nested function, where the graph node is referred as operator, layer, gate and etc. To illustrate the difference, if we view a function (a layer) as a node and dependency relationship as an edge, it becomes a directed acyclic graph (DAG) over layers, while if a graph node is defined on , , these basic tensor arithmetic operators, its a DAG over operators. Modelers think and communicate in a high level (e.g. convolution layer, full layer), while may program in an API dialect of low level arithmetic operators.
In , we consider a model node as a composition of unit operators (layers)., often adopted by computer vision models. The main reason for the decision is that we focus on the productivity improvement in the lifecycle, rather than the implementation efficiencies for training and testing.
VCS Data Model: When managing models in the VCS repository, a model version represents the contents in a single version. It consists of a network definition, a collection of weights (each of which is a value assignment for the weight parameters), a set of extracted metadata (such as hyperparameter, accuracy and loss generated in the training phase), and a collection of files used together with the model instance (e.g., scripts, datasets). In addition, we enforce that a model version must be associated with a human readable name for better utility, which reflects the logical groups of a series of improvement efforts over a model in practice.
In the implementation, model versions can be viewed as a relation model_versionname, id, N, W, M, F, where id is part of the primary key of model versions and is autogenerated to distinguish model versions with the same name. In brief, are the network definition, weight values, extracted metadata and associated files respectively. The DAG, N, is stored as two tables: Nodeid, node, A, where is a list of attributes such as layer name, and Edgefrom, to. is managed in our learned parameter storage (, Sec. V). , the metadata, captures the provenance information of training and testing a particular model; it is extracted from training logs by the wrapper module, and includes the hyperparameters when training a model, the loss and accuracy measures at some iterations, as well as dynamic parameters in the optimization process, such as learning rate at some iterations. Finally, is file list marked to be associated with a model version, including data files, scripts, initial configurations, and etc. Besides a set of model versions, the lineage of the model versions are captured using a separate parentbase, derived, commit relation. All of these relations are maintained/updated in a relational backend when the modeler runs the different commands that update the repository.
Type  Command  Description 
model version management  init  Initialize a repository. 
add  Add model files to be committed.  
commit  Commit the added files.  
copy  Scaffold model from an old one.  
archive  Archive models in the repository.  
model exploration  list  List models and related lineages. 
desc  Describe a particular model.  
diff  Compare multiple models.  
eval  Evaluate a model with given data.  
model enumeration  query  Run clause. 
remote interaction  publish  Publish a model to ModelHub. 
search  Search models in ModelHub.  
pull  Download from ModelHub. 
IiiB Query Facilities
Once the models and their relationships are managed in , the modeler can interact with them easily. The query facilities we provide can be categorized into two types: a) model exploration queries and b) model enumeration queries.
IiiB1 Model Exploration Queries
Model exploration queries interact with the models in a repository, and are used to understand a particular model, to query lineages of the models, and to compare several models. For usability, we design it as query templates via subcommand with options, similar to other VCS.
List Models & Related Lineages:
By default, the query lists all versions of all models including their commit descriptions and
parent versions; it also takes options, such as showing results for a particular model, or limiting
the number of versions to be listed.
dlv list [model_name] [commit_msg] [last]
Describe Model:
desc shows the extracted metadata from a model version,
such as the network definition, learnable parameters, execution footprint (memory and runtime),
activations of convolutional DNNs, weight matrices, and evaluation results across iterations.
Note the activation is the intermediate output of a model in computer vision and often
used as an important tool to understand the model. The current output formats are a result of
discussions with computer vision modelers to deliver tools that fit their needs. In
addition to printing to console, the query supports HTML output for displaying the images and visualizing the weight distribution.
dlv desc [model_name
version] [output]
Compare Models:
diff takes a list of model names or version ids and allows the
modeler to compare the models. Most of desc components are aligned and returned in the
query result side by side. We discuss it in Sec. IV.drop comparison
dlv diff [model_names
versions] [output]
Evaluate Model:
eval runs test phase of the managed models with an optional config
specifying different data or changes in the current hyperparameters. The main usages of exploration
query are twofold: 1) for the users to get familiar with a new model, 2) for the user who wants
to test known models on different data or settings. The query returns the accuracy and optionally the activations.
It is worth pointing out that complex evaluations can be done via model enumeration queries in .
dlv eval [model_name
versions] [config]
IiiB2 Model Enumeration Queries
Model enumeration queries are used to explore variations of currently available models in a repository by changing network structures or tuning hyperparameters. There are several operations that need to be done in order to derive new models: 1) Select models from the repository to improve; 2) Slice particular models to get reusable components; 3) Construct new models by mutating the existing ones; 4) Try the new models on different hyperparameters and pick good ones to save and work with. When enumerating models, we also want to stop exploration of bad models early.
To support this rich set of requirements, we propose the domain specific language, that can be executed using “query”. Challenges of designing the language are: a) the data model is a mix of relational and the graph data models and b) the enumeration includes hyperparameter tuning as well as network structure mutations, which are very different operations. We omit a thorough explanation of the language due to space constraints, and instead show the key operators and constructs of the language along with a set of examples (Query 14) to show how requirements are met.
Key Operators:
We adopt the standard SQL syntax to interact with the repository. views
the repository as a single model version table. As mentioned in Sec IIIA, aA model
version instance is a DAG, which can be viewed as object types in modern SQL conventions. In ,
DAG level attributes can be referenced using attribute names (e.g. m1.name
,
m1.creation_time
, m2.input
, m2.output
), while navigating the
internal structures of the DAG, i.e. the Node and Edge EDB, we provide a regexp style
selector operator on a model version to access individual nodes. For example,, e.g.
m1["conv[1,3,5]"]
in Query 1 filters the nodes in m1
. Once the selector operator returns a set of nodes, prev
and next
attributes of
the node allow 1hop traversal in the DAG. Note that POOL("MAX")
is one of the
standard builtin node templates for condition clauses. Using SPJ operators with
object type attribute access and the selector operator, we allow
relational queries to be mixed with graph traversal conditions.
To retrieve reusable components in a DAG, and mutate it to get new models, we provide slice, construct and mutate operators. Slice originates in programming analysis research; given a start and an end node, it returns a subgraph including all paths from the start to the end and the connections which are needed to produce the output. Construct can be found in graph query languages such as SPARQL to create new graphs. In our context, the DAG only has nodes with multiple attributes, which simplifies the language. We allow construct to derive new DAGs by using selected nodes to insert nodes by splitting an outgoing edge or to delete an outgoing edge connecting to another node. Mutate limits the places where insert and delete can occur. For example, Query 2 and 3 show queries which work on the DAG structure and generate reusable subgraphs and new graphs. Query 2 slices a subnetwork from matching models between convolution layer ‘conv1’ and full layer ‘fc7’, while Query 3 derives new models by appending a ReLU layer after all convolution layers followed by an average pool. All queries can be nested. in the from clause.
Finally, evaluate can be used to try out new models, with potential for early out if expectations are not reached. We separate the network enumeration component from the hyperparameter turning component; while network enumeration can be done via select or construct and nested in the from clause, we introduce a with operator to take an instance of a tuning config template, and a vary operator to express the combination of activated multidimensional hyperparameters and search strategies. auto is keyword implemented using default search strategies (currently grid search). To stop early and let the user control the stopping logic, we introduce a keep operator to take a rule consisting of stopping condition templates, such as topk of the evaluated models, or accuracy threshold. Query 4 evaluates the models constructed and tries combinations of at least three different hyperparameters, and keeps the top 5 models w.r.t. the loss after 100 iterations.
Besides the query facilities we have described so far, there are a collection of features we are lack of space to describe in details, such as extraction and interactions with model metadata, weight management, and provenance queries of the models.
IiiC Model Publishing & Sharing
As the model repository is standalone, we host the repositories as a whole in a service. The modeler can use the publish to push the repository for archiving, collaborating or sharing, and use search and pull to discover and reuse remote models. We envision such a form of collaboration can facilitate a learning environment, as all versions in the lifecycle are accessible and understandable with ease.
IiiD ModelHub Implementation
On the local side, the current implementation of maintains the data model in multiple backends and utilizes git to manage the arbitrary file diffs. Various queries are decomposed and sent to different backends and chained accordingly. On the other hand, as the model repository is standalone, we host the repositories as a whole in a service. The modeler can use the publish to push the repository for archiving, collaborating or sharing, and use search and pull to discover and reuse remote models. We envision such a form of collaboration can facilitate a learning environment, as all versions in the lifecycle are accessible and understandable with ease.
Iv Model Comparison
The subtle difference of model versions are hard to grasp, comparing models are time consuming and requires heavy scripting under current training systems. As accuracy is the goal of the whole process, often modelers judge models by simple performance measures (e.g. loss, accuracy). Shared scripts such as plotting architectures, and optimization training logs can be found in user community of specific training systems.
supports a set of comparison schemes for different artifacts shown in Table III. One can tell the different flavor of diff operations from mixture of data types. Besides common practice, new proposed schemes are highlighted in italic font.
Artifact  Type  Scheme 

Network Architecture  Graph  Plot, Graph Edit Distance 
Learned Parameters  Tensor  Plot, Matrix Reordering 
Prediction Result  Relational  Set Diff 
Hyperparameters  KeyValue  Set Diff 
Optimization Routine  Time Series  Plot 
Plotting is an important tool to understand a network architecture, while comparing the difference between two networks architectures or two tensors are not easy. By viewing a network architecture as a DAG in , the comparison can be done via a minimizing graph edit distance routine. For tensor comparison, we propose a new matrix reordering scheme.
Iva Align Network Architectures
IvB Align Learned Parameters
For ease of illustration, we focus on 2D matrices, as tensors in can be lowered from high dimension to connections between input and output neurons. Informally, the basic comparison idea is given two matrices
and , we permute ’s rows and columns accordingly in order to find the most similar w.r.t. to a cost function, e.g. euclidean distance, best compression bits. As an example, the direct and best delta matrix is shown in Fig. 5. In Fig. 5(a), we show the weight matrices of LeNet conv1 layer trained with the same initialization but flipped images for the scenario of weight reusing. In Fig. 5(b), the LeNet is trained with random initialization with the same image orders. As we can see, alignment not only show the connections of two matrices, but also derive more zeros. As shown later in the evaluation, together with segmented float matrices, alignment operator contributes to the overall storage performance.Next, we present the matrix alignment problem formally. As the matrices to be aligned not necessary having the same dimensions, we first define a permutation matrix with capability to adapt dimensions.
[Permutation Matrix]
Let a permutation
, is all possible permutations. Given two positive integer , a permutation matrix of is a matrix , where
Using the permutation matrix, a permutation can be used in matrix multiplication to reorder matrix by row or column accordingly.
A row permutation of to :
A column permutation of to :
With the permutation matrix, given two matrices and a cost function, we formulate the matrix alignment problem as follows:
[Matrix Alignment] Given two real matrices, , , we want to find two permutations to reorder to , and , such that:
where is a cost function. We denote the best delta matrix as .
Let be norm, given two following matrices:
,
the permutations
minimize the cost function:
=
Complexity Analysis: To the best of our knowledge, the matrix alignment problem is not studied in the literature. We show its NPcompleteness by using the graph edit distance problem ().
Matrix Alignment Problem is NPComplete when is additive.
IvB1 Greedy Hill Climbing Algorithm
We propose a randomized hill climbing approach to address the matrix alignment problem. As an overview, by noticing that if we fix one permutation at a time, minimizing the cost by varying the other permutation is the same as a graph matching problem in a biclique, we can iteratively solve a series of maximum weighted bipartite matching problems and find a local optimum. With randomized initial permutation, we run the algorithm multiple times, and choose the best solution among the found local optima.
Now we illustrate the algorithm in detail. First, fixing one permutation at a time, e.g. the column permutation , the matrix alignment problem is finding the best row permutation such that:
Next, we define the row alignment biclique (RAB), and show the connections between its maximum weighted bipartite matching solution and the best row alignment in detail.
[Row Alignment Biclique] Given two real matrices, , , a row alignment biclique is a bipartite graph , where
is a set of vertices representing all row vectors of
, , and is the ith row of ; while represents the row vectors of the matrix:and is the jth row of . Each connects a row pair between and , and is the weight, i.e. .
Given and , a maximal bipartite matching of the RAB is a set of edges sharing no vertices and . Given an , we can get a derived row permutation as follows:
The maximum weighted bipartite matching is a maximal bipartite matching with the biggest among all maximal bipartite matchings .
In Fig. 8, we use the two matrices , from Exp. 5 to show its row alignment biclique. is the resized matrix of by applying two permutation matrices and . The weight of each edge is shown in the figure. For instance, .
The maximal matching with the sum of weight is the maximum weighted bipartite matching. The derived permutation is .
The derived row permutation of the maximum weighted bipartite matching is also the solution of the matrix alignment problem with fixed column permutation . As has the largest , the derived permutation also has the minimum .
Due to the symmetric definition of the matrix alignment problem, we can define column alignment biclique similarly, as well as its maximum weighted bipartite matching. The derived permutation is also the solution for the matrix alignment problem with fixed row permutation .
Iterative maximum weighted bipartite matching in Alg. 1 converges to a local optimum solution.
The full algorithm is described in Alg. 1, is the max iteration, is the random initial points. By applying the bipartite matching iteratively, i.e. fixing a row or column permutation with previous iteration result, the cost is monotonically decreasing and we can find a local optima of the matrix alignment problem. In other words, it is a hill climbing algorithm to find a local optimum. With random initial permutations, we can compute multiple local optima and choose the best one. no space for this
V Parameter archival storage ()
Modeling lifecycle for DNNs, and machine learning models in general, is centered around the learned parameters, whose storage footprint can be very large. The goal of is to maintain a large number of learned models as compactly as possible, without compromising on the query performance. Before introducing our design, we first discuss the queries of interest, and some key properties of the model artifacts. We then describe different options to store a single
float matrix, and to construct deltas (differences) between two matrices. We then formulate the optimal version graph storage problem, discuss how it differs from the prior work, and present algorithms for solving it. Finally, we develop a novel approximate model evaluation technique, suitable for the segmented storage technique that uses.Va Weight Parameters & Query Type of Interests
We illustrate the key weight parameter artifacts and the relationships among them in Fig. 9, and also explain some of the notations used in this section. At a high level, the predecessorsuccessor relationships between all the developed models is captured as a version graph. These relationships are userspecified and conceptual in nature, and the interpretation is left to the user (i.e., an edge indicates that was an updated version of the model that the user checked in after , but the nature of this update is irrelevant for storage purposes). A model version itself consists of a series of snapshots, , which represent checkpoints during the training process (most systems will take such snapshots due to the long running times of the iterations). We refer the last or the best checkpointed snapshot as the latest snapshot of , and denote it by .
One snapshot, in turn, consists of intermediate data and trained parameters (e.g., in Fig. 7, the model has parameters for , and dimensions for , where is the minibatch size). Since is useful only if training needs to be resumed, only is stored in . Outside of a few rare exceptions, can always be viewed as a collection of float matrices, , which encode the weights on the edges from outputs of the neurons in one layer to the inputs of the neurons in the next layer. Thus, we treat a float matrix as a first class data type in ^{2}^{2}2We do not make a distinction about the bias weight; the typical linear transformation is treated as ..
The retrieval queries of interest are dictated by the operations that are done on these stored models, which include: (a) testing a model, (b) reusing weights to finetune other models, (c) comparing parameters of different models, (d) comparing the results of different models on a dataset, and (e) model exploration queries (Sec. IIIB). Most of these operations require execution of group retrieval queries, where all the weight matrices in a specific snapshot need to be retrieved. This is different from range queries seen in array databases (e.g., SciDB), and also have unique characteristics that influence the storage and retrieval algorithms.

Similarity among Finetuned Models: Although nonconvexity of the training algorithm and differences in network architectures across models lead to noncorrelated parameters, the widelyused finetuning practices (Sec. II) generate model versions with similar parameters, resulting in efficient delta encoding schemes.

Cousage constraints: Prior work on versioning and retrieval [11] has focused on retrieving a single artifact stored in its entirety. However, we would like to store the different matrices in a snapshot independently of each other, but we must retrieve them together. These cousage constraints make the prior algorithms inapplicable as we discuss later.

Low Precision Tolerance: DNNs are wellknown for their tolerance to using lowprecision floating point numbers (Sec. VII), both during training and evaluation. Further, many types of queries (e.g., visualization and comparisons) do not require retrieving the fullprecision weights.

Unbalanced Access Frequencies: Not all snapshots are used frequently. The latest snapshots with the best testing accuracy are used in most of the cases. The checkpointed snapshots have limited usages, including debugging and comparisons. provides a set of storage schemes to let the user tradeoff between storage and lossyness.
VB Parameters As Segmented Float Matrices
Float Data Type Schemes: Although binary (1/1) or ternary (1/0/1) matrices are sometimes used in DNNs, in general handles real number weights. Due to different usages of snapshots, offers a handful of float representations to let the user tradeoff storage efficiency with lossyness using .

Float Point: DNNs are typically trained with single precision (32 bit)or less likely double precision (64 bit) floats. This scheme uses the standard IEEE 754 floating point encoding to store the weights with sign, exponent, and mantissa bits. IEEE halfprecision proposal (16 bits) and tensorflow truncated 16bits [4] are supported as well and can be used if desired.

Fixed Point: Comparing with float point encoding where each float has exponent bits, fFixed point encoding has a global exponent per matrix, and each float number only has sign and mantissa using all bits. This scheme is a lossy scheme as tail positions are dropped, and a maximum of different values can be expressed. The entropy of the matrix also drops considerably, aiding in compression.

Quantization: Similarly, supports quantization using bits, , where possible values are allowed. The quantization can be done in random manner or uniform manner by analyzing the distribution, and a coding table is used to maintain the quantization information (with only the integer codes stored in the matrices in ). This is most useful for snapshots whose weights are primarily used for finetuning or initialization.
The float point schemes present here are not new, and are used in DNN systems in practice [12, 13, 14]., focusing on their implications of training/testing phases. As a lifecycle management tool, lets experienced users select schemes rather than deleting snapshots due to resource constraints. Our evaluation shows storage/accuracy tradeoffs of these schemes.
Scheme  Param. Bits  Compress  Lossyness  Usage 

Float Point  64/32/16  Fair  Lossless  latest 
Fixed Point  32/16/8  Good  Good  latest 
Quantization  8/k  Excellent  Poor  other 
Bytewise Segmentation for Float Matrices: One challenge for is the high entropy of float numbers in the float arithmetic representations, which leads to them being very hard to compress. Compression ratio shown in related work for scientific float point datasets, e.g., simulations, is very low. The state of art compression schemes do not work well for parameters either (Sec. VII).
Method  LeNet  AlexNet  VGG16 

fpzip  to add  
SWavlet  
ISOBAR 
By exploiting lowprecision tolerance, we adopt bytewise decomposition from prior work [15, 16] and extend it to our context to store the float matrices. The basic idea is to separate the highorder and loworder mantissa bits, and so a float matrix is stored in multiple chunks; the first chunk consists of 8 highorder bits, and the rest are segmented one byte per chunk. One major advantage is the highorder bits have low entropy, and standard compression schemes (e.g., zlib) are effective for them.
Apart from the simplicity of the approach, the key benefits of segmented approach are twofold: (a) it allows offloading loworder bytes to remote storage, (b) queries can read highorder bytes only, in exchange for tolerating small errors. Comparison and exploration queries (desc, diff) can easily tolerate such errors and, as we show in this paper, eval queries can also be made tolerant to these errors.
Read (bits)  Float (64)  Float (32)  Fix (k/n)  Fix (32/n) 

16  todo  
24  
32 
Delta Encoding Across Snapshots: We observed that, due to the nonconvexity in training, even retraining the same model with slightly different initializations results in very different parameters. However, the parameters from checkpoint snapshots for the same or similar models tend to be close to each other. Furthermore, across model versions, finetuned models generated using fixed initializations from another model often have similar parameters. The observations naturally suggest use of delta encoding between checkpointed snapshots in one model version and latest snapshots across multiple model versions; i.e., instead of storing all matrices in entirety, we can store some in their entirety and others as differences from those. Two possible delta functions (denoted ) are arithmetic subtraction and bitwise XOR^{3}^{3}3 Delta functions for matrices with different dimensions are discussed in the long version of the paper; techniques in Sec V work with minor modification.. We find the compression footprints when applying the diff in different directions are similar. We study the delta operators on real models in Sec. VI.
VC Optimal Parameter Archival Storage
Given the above background, we next address the question of how to best store a collection of model versions, so that the total storage footprint occupied by the large segmented float matrices is minimized while the retrieval performance is not compromised. This recreation/storage tradeoff sits at the core of any version control system. In recent work [11], the authors study six variants of this problem, and show the NPhardness of most of those variations. However, their techniques cannot be directly applied in , primarily because their approach is not able to handle the group retrieval (cousage) constraints.
We first introduce the necessary notation, discuss the differences from prior work, and present the new techniques we developed for . In Fig. 9, a model version consists of timeordered checkpointed snapshots, . Each snapshot, consists of a named list of float matrices representing the learned parameters. All matrices in a repository, , are the parameter artifacts to archive. Each matrix is either stored directly, or is recovered through another matrix via a delta operator , i.e. , where is the delta computed using one of the techniques discussed above. In the latter case, the matrix is stored instead of . To unify the two cases, we introduce a empty matrix , and define .
[Matrix Storage Graph] Given a repository of model versions , let be an empty matrix, and be the set of all parameter matrices. We denote by the available deltas between all pairs of matrices. Abusing notation somewhat, we also treat as the set of all edges in a graph where are the vertices. Finally, let denote the matrix storage graph of , where edge weights are storage cost and recreation cost of an edge respectively.
[Matrix Storage Plan] Any connected subgraph of is called a matrix storage plan for , and denoted by , where and .
In Fig. 10(a), we show a matrix storage graph for a repository with two snapshots, and . The weights associated with an edge reflect the cost of materializing the matrix and retrieving it directly. On the other hand, for an edge between two matrices, e.g., , the weights denote the storage cost of the corresponding delta and the recreation cost of applying that delta. In Fig. 10(b) and 10(c), two matrix storage plans are shown.
For a matrix storage plan , stores all its edges and is able to recreate any matrix following a path starting from . The total storage cost of , denoted as , is simply the sum of edge storage costs, i.e. . Computation of the average snapshot recreation cost is more involved and depends on the retreival scheme used:

Independent scheme recreates each matrix one by one by following the shortest path () to from . In that case, the recreation cost is simply computed by summing the recreation costs for all the edges along the shortest path.

Parallel scheme accesses all matrices of a snapshot in parallel (using multiple threads); the longest shortest path from defines the recreation cost for the snapshot.

Reusable scheme considers caching deltas on the way, i.e., if paths from to two different matrices overlap, then the shared computation is only done once. In that case, we need to construct the lowestcost Steiner tree () involving and the matrices in the snapshot. However, because multiple large matrices need to be kept in memory simultaneously, the memory consumption of this scheme can be large.
Retrieval Scheme  Recreation  Solution of Prob.1 

Independent ()  Spanning tree  
Parallel ()  Spanning tree  
Reusable ()  Subgraph 
can be configured to use any of these options during the actual query execution. However, solving the storage optimization problem with Reusable scheme is nearly impossible; since the Steiner tree problem is NPHard, just computing the cost of a solution becomes intractable making it hard to even compare two different storage solutions. Hence, during the storage optimization process, can only support Independent or Parallel schemes.
In the example above, the edges are shown as being undirected indicating that the deltas are symmetric. In general, we allow for directed deltas to handle asymmetric delta functions, and also for multiple directed edges between the same two matrices. The latter can be used to capture different options for storing the delta; e.g., we may have one edge corresponding to a remote storage option, where the storage cost is lower and the recreation cost is higher; whereas another edge (between the same two matrices) may correspond to a local SSD storage option, where the storage cost is the highest and the recreation cost is the lowest. Our algorithms can thus automatically choose the appropriate storage option for different deltas.
Similarly, is able to make decisions at the level of byte segments of float matrices, by treating them as separate matrices that need to be retrieved together in some cases, and not in other cases. This, combined with the ability to incorporate different storage options, is a powerful generalization that allows to make decisions at a very fine granularity.
Given this notation, we can now state the problem formally. Since there are multiple optimization metrics, we assume that constraints on the retrieval costs are provided and ask to minimize the storage.
[Optimal Parameter Archival Storage Problem]
Given a matrix storage graph , let be the snapshot recreation cost budget for each . Under a retrieval scheme , find a matrix storage plan that minimizes the total storage cost, while satisfying recreation constraints, i.e.:
In Fig. 10(b), without any recreation constraints, we show the best storage plan, which is the minimum spanning tree based on of the matrix storage graph, . Under independent scheme , and . In Fig. 10(c), after adding two constraints and , we shows an optimal storage plan satisfying all constraints. The storage cost increases, , while and .
Although this problem variation might look similar to the ones considered in recent work [11], none of the variations studied there can handle the cousage constraints (i.e., the constraints on simultaneously retrieving a group of versioned data artifacts). One way to enforce such constraints is to treat the entire snapshot as a single data artifact that is stored together; however, that may force us to use an overall suboptimal solution because we would not be able to choose the most appropriate delta at the level of individual matrices. Another option would be to subdivide the retrieval budget for a snapshot into constraints on individual matrices in the snapshot. As our experiments show, that can lead to significantly higher storage utilization. Thus the formulation above is a strict generalization of the formulations considered in that prior work.
Optimal Parameter Archival Storage Problem is NPhard for all retrieval schemes in Table VII. We reduce Prob.5 in [11] to the independent scheme , and Prob.6 in [11] to the parallel scheme , by mapping each datasets as vertices in storage graph, and introducing a snapshot holding all matrices with recreation bound . For reuse scheme , it is at least as hard as weighted set cover problem if reducing a set to an edge with storage cost as weight, an item to an vertex in , and set recreation budget .
The optimal solution for Problem VII is a spanning tree when retrieval scheme is independent or parallel. Suppose we have a nontree solution satisfying the constraints, and also minimize the objective. Note that parallel and independent schemes are based on shortest path in from to each matrix , so the union of each shortest path forms a shortest path tree. If we remove edges which are not in the shortest path tree from the plan to , it results in a lower objective , but still satisfying all recreation constraints, which leads to a contradiction. Lemma VC shows is a spanning tree and connects our problem to a class of constrained minimum spanning tree problems. The above lemma is not true for the reusable scheme (; snapshot Steiner trees satisfying different recreation constraints may share intermediate nodes resulting in a subgraph solution.
Constrained Spanning Tree Problem: In Problem VII, storage cost minimization while ignoring the recreation constraints leads to a minimum spanning tree (MST) of the storage matrix; whereas the snapshot recreation constraints are best satisfied by using a shortest path tree (SPT). These problems are often referred to as constrained spanning tree problems [17] or shallowlight tree constructions [18], which have been studied in areas other than dataset versioning, such as VLSI designs. Khuller et al. [19] propose an algorithm called LAST to construct such a “balanced” spanning tree in an undirected graph . LAST starts with a minimum spanning tree of the provided graph, traverses it in a DFS manner, and adjusts the tree by changing parents to ensure the path length in constructed solution is within (1+) times of shortest path in , i.e. , while total storage cost is within (1+) times of MST. In our problem, the cousage constraints of matrices in each snapshot form hyperedges over the graph making the problem more difficult.
In the rest of the discussion, we adapt metaheuristics for constrained MST problems to develop two algorithms: the first one (MT) is based on an iterative refinement scheme, where we start from an MST and then adjust it to satisfy constraints, similar to the LMT algorithm proposed in early work [11]; the second one is a prioritybased tree construction algorithm (PT), which adds nodes one by one and encodes heuristic in the priority function. Both algorithms aim to solve the parallel and independent recreation schemes, and thus can also find feasible solution for reusable scheme. Due to large memory footprints of intermediate matrices, we leave improving reusable scheme solutions for future work.
MT: The algorithm starts with as the MST of , and iteratively adjusts to satisfy the broken snapshot recreation constraints, , by swapping one edge at a time. We denote as the parent of , and , and successors of in as . A swap operation on to edge changes parent of to in . A swap operation on changes storage cost of by , and changes recreation costs of and its successors by: .
The proof can be derived from definition of and by inspection. When selecting edges in , we choose the one which has the largest marginal gain for unsatisfied constraints:
(1)  
(2) 
The actual formula used is somewhat more complex, and handles negative denominators. Eq. 1 sums the gain of recreation cost changes among all matrices in the same snapshot (for the independent scheme), while Eq. 2 uses the max change instead (for the parallel scheme).
The algorithm iteratively swaps edges and stops if all recreation constraints are satisfied or no edge returns a positive gain. A single step examines edges and unsatisfied constraints, and there are at most steps. Thus the complexity is bounded by .
PT: This algorithm constructs a solution by “growing” a tree starting with an empty tree. The algorithm examines the edges in in the increasing order by the storage cost ; a priority queue is used to maintain all the candidate edges and is populated with all the edges from in the beginning. At any point, the edges in are the ones that connect a vertex , to a vertex outside . Using an edge (s.t., ) popped from , the algorithm tries to add to with minimum storage increment . Before adding , it examines whether the constraints of affected groups (s.t., ) are satisfied using actual and estimated recreation costs for vertices in and respectively; if , actual recreation cost is used, otherwise the lower bound of it, i.e. is used as an estimation. We refer the estimation for as .
Once an edge is added to , the inner edges of newly added are dequeued from , while the outer edges are enqueued. If the storage cost of existing vertices in can be improved (i.e., ), and recreation cost is not more (i.e. ), then the parent of in T is replaced to via the swap operation, which obviously decreases the storage cost and affected group recreation cost.
The algorithm stops if is empty and is a spanning tree. In the case when is empty but , an adjustment operation on to increase storage cost and satisfy the group recreation constraints is performed. For each , we append it to , then in each unsatisfied group that belongs to, optimally, we want to choose a set of to change their parents in , such that the decrement of storage cost is minimized while recreation cost is satisfied. The optimal adjustment itself can be viewed as a knapsack problem with extra noncyclic constraint of , which is NPhard. Instead, we use the same heuristic in Eq. 1 to adjust one by one by replacing its parent to until the group constraint in is satisfied.
where is number of successors of in . In other words, we choose a neighbor to replace , having the maximum marginal gain of recreation in the unsatisfied groups w.r.t. the storage increment. As before, the parallel scheme differs from independent case in the adjustment operator using Eq. 2. The complexity of this algorithm is .
VD Model Evaluation Scheme in
Model evaluation, i.e., applying a forward on a data point to get the prediction result, is a common task to explore, debug and understand models. Given a storage plan, an eval query requires uncompressing and applying deltas along the path to the model. We develop a novel model evaluation scheme utilizing the segmented design, that progressively accesses the loworder segments only when necessary, and guarantees no errors for arbitrary data points.
The basic intuition is that: when retrieving segmented parameters, we know the minimum and maximum values of the parameters (since higher order bytes are retrieved first). If the prediction result is the same for the entire range of those values, then we do not need to access the lower order bytes. However, considering the high dimensions of parameters, nonlinearity of the model, unknown full precision value when issuing the query, it is not clear if this is feasible.
We define the problem formally, and illustrate the determinism condition that we use to develop our algorithm. Our technique is inspired from theoretical stability analysis in numerical analysis. We make the formulation general to be applicable to other prediction functions. The basic assumption is that the prediction function returns a vector showing relative strengths of the classification labels, then the dimension index with the maximum value is used as the predicted label.
[Parameter Perturbation Error Determination] Given a prediction function , where is the data and are the learned weights, the prediction result is the dimension index with the highest value in the output . When value is uncertain, i.e., each in known to be in the range , determine whether can be ascertained without error.
When is uncertain, the output is uncertain as well. However, if we can bound the individual entries in , then the following condition is an applicable necessary condition for determining error:
Let vary in range . If such that , then prediction result is .
Next we illustrate a query procedure, that given data , evaluates a with weight perturbations and determines the output perturbation on the fly. Recall that is a nested function (Sec. II), where the input data is transformed layer by layer, and each layer is of the form:
we derive the output perturbations when evaluating a model while preserving perturbations step by step:
Next, activation function is applied. Most of the common activation functions are monotonic function , (e.g. sigmoid, ReLu), while pool layer functions are , , avg functions over several dimensions. It is easy to derive the perturbation of output of the activation function, . During the evaluation query, instead of 1D actual output, we carry 2D perturbations, as the actual parameter value is not available. Nonlinearity decreases or increases the perturbation range. Now the output perturbation at can be calculated similarly, except now both and are uncertain:
Comments
There are no comments yet.