Building industrial machine learning pipelines is an iterative cycle of gathering and curating training data, training and deploying models, and monitoring the model once put into production. When errors or undesirable behavior are found, the cycle is repeated. Without tools to manage this process, production models become hard to maintain and difficult to reproduce. In 2017, a subset of the authors built the first industrial feature store (michaelangelo)—a system designed to standardize and manage model features and workflows. The feature store both reduced engineer effort and improved model quality. Feature stores, however, have yet to adapt to a growing trend in model development: incorporating pretrained embeddings. Pretrained embeddings are becoming standard inputs to modern machine learning pipelines. These embeddings are typically trained in a self-supervised fashion over massive data sets and encode knowledge about words, entities, graphs, and images. Once trained, they can provide lift in numerous downstream tasks like recommendation systems (naumov2019deep), information retrieval (khattab2020colbert), and data integration (mudgal2018deep). A subset of the authors saw first-hand from their experience building and deploying an industrial self-supervised entity disambiguation system that pretrained embeddings are shifting industrial pipelines towards “hands-free” models that require limited hand-curated data and model engineering (overton2020re; karpathy20; molino2019ludwig). There is therefore an increasing need for the next generation of feature store systems to help manage and monitor the embedding training data, pretrained embeddings, and downstream systems that consume the embeddings.
The goal of this tutorial is to expose the interplay between data management and modern, self-supervised embedding ecosystems. We will first introduce feature store systems and the challenges they address. We will then introduce self-supervised pretrained embeddings and explain the new challenges associated with managing these embedding pipelines. We then explore how data management can help build, monitor, and maintain these self-supervised embedding ecosystems. Lastly, we will discuss future data management challenges and research directions.
This tutorial focuses on modern feature stores and the challenges with supporting embeddings as first class citizens in feature stores. We highlight that managing these self-supervised embedding systems is
a fundamental data management problem. This tutorial is intended for researchers with some familiarity with deep learning and pretrained embeddings who are interested in the interaction of data management and deep learning pipelines.
This tutorial is split into three parts over 1.5 hours.111This tutorial has not been presented in any venues prior and is most similar to recent SIGMOD and VLDB tutorials in data integration and cleaning with machine learning.
Feature Stores. We will give an overview of the modern machine learning pipeline and feature store systems. We will describe the core challenges these systems solve and give an overview of the technical contributions.
Embedding Ecosystems. We will introduce pretrained embeddings and discuss the new challenges faced by feature stores in treating embeddings as first class citizens. We then discuss recent solutions to some of these challenges.
Future Directions and Challenges. We will conclude with a discussion of the future directions and challenges.
2. ML Feature Stores
In 2017, a subset of the authors built the first industrial feature store (michaelangelo). Using this first-hand experience and the lessons learned, we introduce the modern ML pipeline and describe the challenges faced in maintaining and deploying models. We then introduce feature store systems and the technical innovations that help solve the aforementioned challenges. We focus the first part of this tutorial on traditional tabular feature data (i.e., not embeddings).
2.1. Machine Learning Pipelines
In industrial machine learning (ML) pipelines, as shown in Figure 1, engineers need to quickly ingest training data for feature curation, train and deploy models, and monitor and maintain the model once deployed. We describe each step below.
Training Data. Data is scraped, mined, or retrieved from a variety of different sources. The data needs to be cleaned, checked, and featurized for downstream models. Challenges: Engineers author custom features that, if not shared and managed, can result in repeated work and lack of definitional consistency. Feature definitions can become stale if not kept up-to-date as data changes over time.
Model Training and Deployment. Using a set of features, engineers need to train and deploy models. Challenges: As data changes over time and updates occur at different intervals, models can become stale if not given the most up-to-date features. Further, model reproducibility becomes a challenge as engineers try to keep up with changing data and model parameters.
Model Maintenance and Monitoring. Once deployed, models need to be monitored and maintained. Challenges: Models can struggle in the face of distribution shift and out-of-domain inputs (schelter2018challenges). Further, once model errors are detected, engineers are often lacking guidance as to what features need to be corrected.
2.2. Feature Stores
Feature stores (FSs) arose to address these challenges by providing a centralized repository of reusable features across the ML pipeline and automating the management of this pipeline (featurestore; michaelangelo; hopsworks; feast). Below, we dive into how feature stores address the three challenges above.
2.2.1. Training Data
Structured data can be in the form of raw tables as well as streams that users access when curating features. To facilitate sharing of features across an organization and maintaining features if they get updated, feature stores allow for feature authoring and publishing (alkowaileet2018end). Users provide simple definitional metadata, e.g., the feature update cadence and a definition SQL query, and upload the definition to the FS. When the underlying data changes, the FS orchestrates the updates to the features based on the user-defined cadence. For streaming features, users provide aggregation functions that are applied on the raw streaming features. The aggregated features are persisted to the online store and logged to the offline store.
2.2.2. Model Training and Deployment
Once features are curated, users need the ability to construct feature sets on the most recent data to train and deploy models. FSs support this workflow by partitioning features on date and providing APIs to allow for time based joins. Further, FSs must support feature quality metrics to support the detection and mitigation of feature errors. For example, FSs measure feature freshness, null counts, and mutual information across features.
Once a model is trained, relevant parameters and artifacts need to be stored for provenance and reproducibility. Although model storage is not traditionally part of a FS, some FSs (michaelangelo; hopsworks) do support model management by integrating a separate model store (vartak2016modeldb; gharibi2019modelkb).
Online Feature Serving
Once a model is deployed, features need to be continuously provided to deployed models even as the feature data is updated over time. To provide low latency feature serving, FSs are typically a dual datastore: one for offline training (e.g., SQL warehouse) and for online serving (e.g., in-memory DBMS).
2.2.3. Model Monitoring and Maintenance
FSs must additionally support model quality metrics (as well as feature quality metrics described above) (vartak2016modeldb)
. For example, FSs support critical model metrics such as training-deployment data skew and near real-time outlier and input drift detection. These metrics allow users to be informed of potential ‘gremlins’ in the system. Once an error is discovered, engineers can use the FS metrics to detect the offending set of features and select a more optimal feature set for serving (or retraining).
3. Data Management and Embedding Ecosystems
Drawing on our first-hand experiences developing entity embeddings across numerous downstream products at a large technology company, we now introduce self-supervised pretrained embeddings and the ecosystems around them. We then discuss new challenges and solutions with managing these ecosystems. We believe treating embeddings as first class citizens is the next evolution of feature stores.
3.1. Embedding Ecosystem
We define the embedding ecosystem as the embedding training data, embeddings, and downstream systems that consume them. As shown in Figure 1, the embedding ecosystem pipeline is similar to that of the feature store. However, FSs are unable to support end-to-end embedding management. With embeddings, standard metrics and tools for managing tabular features are no longer adequate as embeddings are derived data. For example, embeddings are often compared by dot product similarity, and existing FS metrics such as null value count do not capture drifts or changes in embeddings with respect to this metric. We now highlight the additional challenges associated with each step of the ML pipeline when incorporating embeddings (the same challenges from FSs still apply to embedding ecosystems).
Training Data: In an embedding ecosystem, the raw training data is used to pre-train the embeddings. As this data is self-supervised, it will not be hand-labeled or curated. Challenges: Embeddings will encode any inherent biases that exist in the self-supervised training data, e.g., the embeddings do not well represent rare things (orr2021bootleg; schick2020rare).
Model Training and Deployment: Once embeddings are trained, they need to be stored and served to downstream systems that use embeddings for training and deployment. Challenges: As embeddings get retrained and updated, just like features, the downstream models can become stale and out-of-date. Users need to understand and monitor the embedding changes and search over possible embeddings and select the best ones for their task. Unlike features, standard tabular metrics are inadequate for embeddings.
Model Maintenance and Monitoring: Like with FSs, deployed models need to be monitored and maintained, especially with respect to the embedding inputs. Challenges: Any inherent embedding quality issue will impact all downstream models using those embeddings. Users need to be able to understand and isolate downstream quality issues in the underlying embeddings. Once found, users need methods for correcting errors in downstream products.
3.1.1. Self-Supervised Training Data
Unlike feature curation data, which is tabular and pre-labeled, self-supervised training data is often unstructured and is not hand-curated. This lack of manual curation results in data that may not accurately represent the data seen upon deployment (bernstein2012direct; googlequeries; koh2020wilds) and is often biased toward popular things (orr2021bootleg). Embeddings trained on this data can inherit these biases. To improve embedding quality of rare things through training data management, recent work from (orr2021bootleg)
explored incorporating structured data into entity embedding pretraining through named entity disambiguation, the task of mapping from strings to things in a knowledge base. They showed that by adding structured data of the type of an entity and its knowledge graph relations, they could boost performance over rare entities by 40 F1 points. We believe merging structured and unstructured data is a promising management technique for improving quality in training data.
3.1.2. Embedding Management
Both traditional features and pretrained embeddings are served to downstream models. The uniqueness of an embedding ecosystem is that users need to be able to understand the difference in embedding quality as embeddings are updated over time and need guidance over what embeddings to use for their task. To measure quality, wendlandt2018factors and hellrich2016bad discuss analyzing word embeddings with respect to an embedding’s nearest neighbors. The work of leszczynski2020understanding is uniquely looking at the quality of an embedding with respect to a downstream task. The authors define the metric of downstream instability, the number of predictions that change with different embeddings, to measure downstream embedding instability. There is little available work on finding the right embedding to use, especially given compute or memory constraints. The work of may2019downstream
takes a first step by a variant of the eigenspace overlap score as a way of predicting downstream performance. However, their work is focused on measuring the performance of non-contextualized word embeddings.
3.1.3. Fine-Grained Monitoring and Patching
Downstream models in traditional FSs and embedding ecosystems need to be monitored and maintained. In an embedding ecosystem, however, the challenge is in giving users the tools to find meaningful subpopulations of errors and connecting the downstream errors to embedding quality issues. These errors then need to be corrected through the underlying embedding. In terms of monitoring downstream models, recent works provide toolkits for measuring language model performance at a semantic, fine-grained level (goel2021robustness; ribeiro2020beyond). goel2021robustness in particular focuses on allowing users to define custom sub-population functions to explore performance across different models. Once an error is discovered, the challenge is in how to correct that error in the underlying embedding. By correcting the error in the embedding, all downstream systems using those embeddings will be patched, which maintains product consistency. The work in orr2021bootleg gives a proof-of-concept that using data management techniques such as augmentation (chepurko2020arda), weak supervision (ratner2017snorkel), and slice-based learning (chen2019slice) can correct underperforming sub-populations of data (orr2021bootleg).
4. Future Directions
We end with a discussion of future directions.
Embedding Enhanced Feature Stores
We believe the next evolution of a feature store is one with native support for embeddings. While we discussed some challenges and potential solutions, this is just the beginning. Users need tools for searching and querying these embeddings as well as support for versioning, provenance, and downstream quality metrics. For example, if an embedding gets updated but a model that uses it does not, the dot product of the embedding with model parameters can lose meaning which leads to incorrect model predictions. Further, performing these operations at industrial scale will be non-trivial as the size of embeddings and their associated models are continuing to increase.
End-to-End Model Patching Through Data
An open area of research is in automatically correcting the errors discovered in the downstream model error analysis through the underlying embedding. While prior work showed you can patch errors through methods like data augmentation and slice finding (orr2021bootleg; goel2020model), there are remaining challenges in how to automate and manage this process. How can you predict if an augmentation strategy will have the desired result? If an embedding gets patched, what is the optimal way to propagate that patch downstream?
5. Biographical Sketches
Laurel Orr is a PostDoc in the Computer Science Department at Stanford University advised by Christopher Ré. She graduated from the University of Washington in the Database group and is a lead on the Bootleg project, a self-supervised system for entity disambiguation. Bootleg is in production at Apple and used by academic research groups. She was awarded the NSF GRFP as a graduate student and is a current IC Postdoc Fellow. Atindriyo Sanyal
is a technical lead on the Michelangelo team at Uber AI. He leads various feature engineering efforts across Uber. Prior to that, he worked at LinkedIn and Apple where he was a Senior Software Engineer on Siri and was the founding engineer behind SiriKit (Siri API). He did his Masters at UCLA, where he worked at the Networks Research Lab building routing algorithms for pedestrians with skin conditions on open source navigation systems. He’s one of the winners of Microsoft’s Imagine Cup, won many hackathons at university, an IEEE presidential Award nominee, and a Math Olympiad winner.Xiao Ling is a Machine Learning engineer at Apple where his work spans from information extraction for knowledge base construction to open-domain question answering. He earned his PhD in Computer Science and Engineering from the University of Washington in 2015. He was an early engineer at Lattice Data Inc., which was acquired by Apple in 2017. Megan Leszczynski is a PhD student in the Computer Science Department at Stanford University advised by Christopher Ré. She is one of the original developers of Bootleg, a self-supervised system for named entity disambiguation, which has since been deployed in industry. She has also given an invited lecture to the Stanford CS224N (NLP with Deep Learning) course led by Christopher Manning. Her research has been recognized with a NSF GRFP. Karan Goel is a 3rd year CS PhD student at the Stanford AI Lab. He leads the Robustness Gym (RG) project, whose goal is to facilitate fine-grained evaluation and maintenance of ML models. RG is actively deployed at Salesforce with users in academia and industry. He wrote one of the first papers on ”model patching”, and his work has been recognized with a Siebel Scholarship (2018) and a Salesforce Research Grant (2020). Acknowledgements We acknowledge the support of the IC Postdoctoral Research Fellowship Program and NSF Graduate Research Fellowship under No. DGE-1656518.