With growing adoption of machine learning, personalization is proving essential to competitive user experience D’Arcy (2021). To support users with different interests and preferences, one needs good default tactics, user feedback, prioritizing delivered content and available actions Molino and Ré (2021). When managing limited resources, e.g., for video serving, similar logic applies to network bandwidth, response latency, and video quality Mao et al. (2020); Feng et al. (2020). This paper explores the use of ML for personalized decision-making in software products using what we call smart strategies. Building smart strategies and offering them to software engineers brings unique challenges Agarwal et al. (2016); Molino and Ré (2021). Below we outline these challenges and how they can be addressed.
Data-centric ML development is an increasingly popular concept of shifting the focus of ML development from models to data Miranda (2021). It is especially relevant to software personalization using off-the-shelf models, where collecting the right (tabular) data and selecting the appropriate class of models become primary differentiators Molino and Ré (2021). Aside from traditional data management concerns, ML systems for personalization struggle to handle the noise inherent in user feedback signals and product impact metrics Letham et al. (2019). It is challenging to select features relevant to the task from a sea of available features with different computational cost profiles. Compared to developing and training ML models, data adequacy is often overlooked Sambasivan et al. (2021), and product development platforms must diligently address these omissions, e.g., by automation. Per Andrew Ng, “everyone jokes that ML is 80% data preparation, but no one seems to care” Sagar (2021). Data and model quality aside, product decisions are driven by product goals, and impact on numerous users is measured via A/B tests Bakshy et al. (2014); Xu et al. (2015); Letham and Bakshy (2019). Scaling, productionizing and fully measuring the impact of smart strategies calls for software-centric ML integration with APIs for data collection and decision-making, rather than application code directly dealing with models and data sets.
Vertical ML platforms lower barriers to entry and support the entire lifecycle of ML models (Figure 1
), whereas horizontal ML platforms like TensorFlowAbadi et al. (2016)
and PyTorchLi et al. (2020) focus on modeling for generic ML tasks, support hardware accelerators, and act as toolboxes for application development Gauci et al. (2018); Molino et al. (2019). Vertical platforms foster the reuse of not only ML components, but also workflows. Specialized end-to-end vertical platforms drive flagship product functionalities, such as recommendations at large internet firms (Google, Facebook, LinkedIn, Netflix). They have also been applied to software development, code quality checks, and even to optimize algorithms such as sorting and searching Carbune et al. (2018). Supporting smart strategies requires general-purpose vertical platforms, which build on top of horizontal platforms to offer end-to-end ML lifecycle management. General-purpose vertical ML platforms can be internal to a company — Apple’s Overton Ré et al. (2019) and Uber’s Michelangelo Hermann and Del Balso (2017), — or broadly available to cloud customers — Google’s Vertex, Microsoft’s Azure Personalizer Agarwal et al. (2016)
and Amazon Personalize. A common theme is to help engineers “build and deploy deep-learning applications without writing code” via high-level, declarative abstractionsMolino and Ré (2021). Improving user experience and system performance with ML remains challenging Paleyes et al. (2020) as correlations in data found by ML models might not lead to causal improvements. Little is known about optimizing for product goals Molino and Ré (2021); Wu et al. (2021).
In this work, we develop support for data-driven real-time smart strategies via a general-purpose vertical end-to-end ML platform called Looper, an internal ML platform at Meta for rapid, low-effort deployment of models of moderate complexity. Looper is a declarative ML system Hermann and Del Balso (2017); Molino et al. (2019); Ré et al. (2019); Molino and Ré (2021) that supports coding-free management of the full lifecycle of smart strategies via a GUI.
Our technical contributions include a full-stack ML platform (Section 3) with causal product-impact evaluation/optimization and handling of heterogeneous treatment effects (Sections 3.4 and 4.2) via an experiment optimization system and meta-learners,
a generic framework for targeting long-term outcomes by parameterized policies using pliug-in supervised learning and Bayesian optimization (Sections3.3, 3.4, 4.2), the strategy blueprint abstraction (Section 3.3) to optimize not only models, but the entire ML stack (Figure 1), capturing decision inputs and observations online via the succinct Looper API for product code; not only predicting what’s logged by the API, but also optimizing black-box product objectives (Section 3.2). analysis of performance bottlenecks (Section 4.4), broad deployment and substantial impact on product metrics (Section 4), qualitative analysis of our platform via a survey of adopters (Section 5).
Specialized vertical ML platforms limit application diversity, whereas Looper hosts hundreds of production use cases thanks to its general-purpose architecture. Many vertical platforms Agarwal et al. (2016); Carbune et al. (2018); Ré et al. (2019); Molino et al. (2019)
, don’t solve as wide a selection of ML tasks as Looper does (classification, estimation, value and sequence prediction, ranking, planning) using supervised or reinforcement learning. Unlike platforms with asynchronous batch-mode inferenceHermann and Del Balso (2017); Gupta et al. (2020)
, Looper runs in real time. To balance model quality, size and inference time, Looper AutoML selects models and hyperparameters, and such tuning extends to vertical optimizations viablueprints.
In the remainder of the paper, Section 2 explores ML-driven smart strategies and relevant platform needs. Section 3 covers our philosophy for the Looper platform, introduces the architecture, the API, and specializations. Section 4 summarizes product impact at Meta, including comparisons to baselines and addoption statistics. Section 5 covers the adoption of smart strategies. Secton 6 reviews how Looper helps improve SW systems and products.
2 ML for smart strategies
In this paper, we target smart strategies at key decision points in existing software products, for example:
application settings and preferences: selecting between defaults and user-specified preferences
adaptive interfaces — certain options are shown only to users who are likely to pursue them
controlling the frequency of ads, user notifications, etc
prefetching or precomputation to reduce latency
content ranking and prioritizing available actions
Individual user preferences and contextual information complicate decision-making. Reducing the cognitive load of a UI menu can turn a product failure into success, but menu preferences vary among users. Prefetching content to a mobile device may improve user experience, but doing this well requires predicting the environment and user behavior.
While human-crafted heuristic strategies often suffice as an initial solution, ML-based smart strategies tend to outperform heuristics upon sufficient engineering investmentKraska et al. (2017); Carbune et al. (2018). The Looper platform aims to lower this crossover point to broaden the adoption of smart strategies and deliver product impact over diverse applications. In this section, we discuss some of the modeling approaches to enable smart strategies and cover the priorities in building such an effective platform.
2.1 Modeling approaches for smart strategies
Smart strategies may be implemented through various forms of supervised learning, contextual bandits, and reinforcement learning supported on our platform (Figure 2). Aside from the model choice, most product decision problems require (1) ML optimization objectives to approximate the product goal(s) and (2) a decision policy to convert objective predictions into a single decision.
Approximating product goals with predictable outcomes (alternatively referred to as proxy or surrogate objectives) is a major difference between industry practice and research driven by existing ML models with abstract optimization objectives Stein (2019). Good proxy objectives should be readily measurable and reasonably predictable. In recommendation systems, the “surrogate learning problem has an outsized importance on performance in A/B testing but is difficult to measure with offline experiments” Covington et al. (2016). A delicate tradeoff exists between objectives which are directly connected to the decision and easier to measure versus more complex objectives; a good example in advertising is modeling clicks vs. conversions. Furthermore, product goals may implicitly have different weighting functions than the ML objective.111For example, the feedback provided by most prolific product users does not always represent other users Beutel et al. (2017). Objectives can be modeled directly using supervised learning; alternatively, models used by contextual bandits
(CBs) enable modeling of uncertainty in predictions across one or more objectives, which may then be used for exploring the set of optimal actions, such as in Thompson samplingAgarwal et al. (2009); Li et al. (2010); Agarwal et al. (2016); Daulton et al. (2019). The use of reinforcement learning (RL) further enables the optimization of long-term, cumulative objectives, which benefits use cases with sequential dependencies Li et al. (2010); Gauci et al. (2018); Apostolopoulos et al. (2021). To evaluate any one of these types of models and decision rules, true effects of the ML-based smart strategies can be estimated via A/B tests.
postprocess the raw model outputs into a final product decision or action. For single-objective tasks in supervised learning this may be as simple as making a binary decision if the objective prediction exceeds a threshold, e.g. turning the probability of a click into a binary prefetch decision (Section4.1). For tasks with multiple objectives and more complex action spaces, the template for a decision policy is to assign a scalar value or score to all possible actions in the decision space, which can then be ranked through sorting. In recommendation systems, a standard approach is to use a combination function (usually a weighted product of objective predictions) to generate a score for each candidate Zhao et al. (2019). When using reinforcement learning, reward shaping Laud (2004) weighs task scores in the reward function to optimize for the true long-term objective. Optimizing this weighting for multi-objective tasks is explored in Section 3.3. More sophisticated policies also use randomization to explore the action space, e.g. Thompson sampling in contextual bandits Daulton et al. (2019), or epsilon-greedy approaches for exploration in ranking Agarwal et al. (2009).
Choosing appropriate ML models
requires trading off performance metrics with resource usage, latency and data properties. SVM packages lack sparse feature support and struggle with heterogenous data, esp. in large amounts. DNNs scale to 1B+ data rows but use more memory. Gradient-Boosted Decision Trees (GBDTs) are compact, handle heterogenous data and scale to 100M rows. They are oftengood enough and use moderate resources (Figure 2).
2.2 Extending end-to-end ML for smart strategies
Traditional end-to-end ML systems go as far as to cover model publishing and serving Hermann and Del Balso (2017); Molino et al. (2019); Ré et al. (2019); Molino and Ré (2021), but to our knowledge rarely track how the model is used in the software stack. Assessing and optimizing the impact of smart strategies, especially with respect to product goals, requires experimentation on all aspects of the modeling framework – all the way from metric selection to policy optimization. To streamline this experimentation, smart-strategies platforms must extend the common definition of end-to-end into the software layer.
Software-centric ML integration Agarwal et al. (2016); Carbune et al. (2018) – where data collection and decision-making are fully managed through platform APIs – enables both high-quality data collection and holistic experimentation. Notably, the platform can now keep track of all decision points and support A/B tests between different configurations. Well-defined APIs improve adoption among product engineers with limited ML background, and ML configuration can be abstracted via declarative programming or GUI without requiring coding Molino and Ré (2021).
End-to-end AutoML. It is common to use automation for hyperparameter tuning (AutoML), typically via black-box optimization Balandat et al. (2020)
. However, in our extended end-to-end regime, model architecture and feature selection parameters can be optimized in a multi-objective tradeoff between model quality and computational resourcesDaulton et al. (2021). Decision policy weights can be tuned for long-term product goals. AutoML for the entire pipeline becomes possible with declarative strategy blueprints and an adaptive experimentation framework aware of online product metrics Bakshy et al. (2018), as explored in Section 3.3.
2.3 Additional requirements for smart strategies
Metadata features product-specific products (e.g., account type, time spent online, interactions with other accounts) are a distinctive requirement for learning smart strategies in comparison to the traditional content features (images, text, video) commonly associated with ML platforms. Unlike image pixels, metadata features are diverse, require non-uniform preprocessing, and often need to be joined from different sources. Patterns in metadata change quickly, necessitating regular retraining of ML models on fresh data. Interactions between metadata features are often simpler than for image or text features, so dense numerical metadata can be handled by GBDTs or shallow neural nets. Sparse and categorical features need adequate representations Rodríguez et al. (2018)
and special provisions if used by neural network architecturesNaumov et al. (2019).
Non-stationary environment is typical for product deployments but not for research demonstrations and SOTA results.
Logging and performance monitoring are important capabilities for a production system. Dashboards monitor system health and help understand model performance in terms of statistics, distributions and trends of features and predictions, automatically triggering alerts for anomalies Amershi et al. (2019); Breck et al. (2017). Our platform integrates with Meta ’s online experimentation framework, and production models can be withdrawn quickly if needed.
Monitoring and optimizing resource usage flags inefficiences across training and inference. Our monitoring tools track resource usage to components of the training and inference pipeline (Section 3.2), and help trade ML performance for resources and latency. Less important features are found and reaped with engineers’ approval (Section 4.4).
3 The Looper Platform
Smart strategies are supported by vertical ML platforms (Figure 1) and need operational structure — established processes and protocols for model revision and deployment, initial evaluation and continual tracking of product impact, as well as overall maintenance. We now introduce design principles and an architecture for a vertical smart strategies platform that address the needs outlined in Section 2.
3.1 Platform philosophy
In contrast to heavy-weight ML models for vision, speech and NLP that favor offline inference (with batch processing) and motivate applications built around them, we address the demand for real-time smart strategies within software applications and products. These smart strategies operate on metadata — a mix of categorical, sparse, and dense features, often at different scales. Respective ML models are lightweight, they can be re-trained and deployed quickly on shared infrastructure in large numbers. Downside risks are reduced via () simpler data stewardship, () tracking product impact, () failsafe mechanisms to withdraw poorly performing models. Smart strategies have a good operational safety record and easily improve naive default behaviors.
The human labeling process common for CV and NLP fails for metadata because relevant decisions and predictions (a) only make sense in an application context, (b) in cases like data prefetch only make sense to engineers, (c) may change seasonally or even faster. Instead of human labeling, our platform interprets user-interaction and system-interaction metadata as either labels for supervised learning or rewards for reinforcement learning. To improve operatonal safety and training efficiency, we rely on batch-mode (offline) training, even for reinforcement learning.
Our platform philosophy pursues fast onboarding, robust deployment and low-effort maintenance of multiple smart strategies where positive impacts are measured and optimized directly in application terms (Section 5). To this end, we separate application code from platform code, and leverage existing horizontal ML platforms with interchangeable models for ML tasks (Figure 1). Intended for company engineers, our platform benefits from high-quality data and engineered features in the company-wide feature store Orr et al. (2021). To simplify onboarding for product teams and keep developers productive, we automate and support
Workflows avoided by engineers Sambasivan et al. (2021), e.g., feature selection and preprocessing, and tuning ML models for metadata.
Workflows that are difficult to reason about, e.g., tuning ML models to product metrics.
We first introduce several concepts for platform design.
The decision space captures the shape of decisions within an application which can be made by a smart strategy. With reinforcement learning, the decision space matches well with the concept of action space. More broadly, it can be as simple as a binary value to show a notification or not, or a continuous value for time-to-live (TTL) of a cache entry, or a data structure with configuration values for a SW system, such as a live-video stream encoder.
Application context captures necessary key information provided by a software system at inference time to make a choice in the decision space. The application context may be directly used as features or it may contain ID keys to extract the remaining features from the feature store (Section 3.3).
Product metrics evaluate the performance of an application and smart strategies. When specific decisions can be judged by product metrics, one can generate labels for supervised learning, unlike for metrics that track long-term objectives.
A proxy ML task casts product goals in mathematical terms to enable () reusable ML models that optimize formal objectives and () decision rules that map ML predictions into decisions (Section 2.1). Setting proxy tasks draws on domain expertise, but our platofrm simplifies this process.
Evaluation of effects on live data verifies that solving the proxy task indeed improves product metrics. Access to Meta’s monitoring infrastructure helps detect unforeseen side effects. As in medical trials, (1) we need evidence of a positive effect, (2) side-effects should be tolerable, and (3) we should not overlook evidence of side-effects. On our platform, product developers define the decision space, allowing the platform to automatically select model type and hyperparameter settings. The models are trained and evaluated on live data without user impact, and improved until they can be deployed. Newly trained models are canaried (deployed on shadow traffic) before product use – such models are evaluated on a sampled subset of logged features and observations, and offline quality metrics (e.g., MSE for regression tasks) are computed. This helps avoid degrading model quality when deploying newer models.
3.2 Platform architecture: the core
Traditional ML pipelines build training data offline, but our platform uses a live feature store and differs in two ways:
Software-centric vs. data-centric interfaces. Rather than passed via files or databases, training data are logged from product surfaces as Looper APIs intercept decision points in product software. Product engineers delegate concerns about the quality of training data (missing or delayed labels, etc) to the platform.
An online-first approach. Looper API logs live features and labels at the decision and feedback points, then joins and filters them via real-time stream processing. Data hygiene issues Agarwal et al. (2016) and storage overhead are avoided by immediate materialization which () keeps training and inference consistent, and () limits label leakage by separating features and labels in time. Looper’s complete chain of custody for data helps prevent engineering mistakes.
The Looper RPC API relies on two core methods:
I. returns a value from the decision space, e.g., for binary choices or a floating-point score for ranking. Unlike in the 3-call APIs in Agarwal et al. (2016); Carbune et al. (2018), is returned before a model is available. User-defined links individual decisions with observations logged later (II); it may be randomly generated for clients to propagate. is a dictionary representation of the application context (Section 3.1), e.g., with the user ID (used to retrieve additional user features), current date/time, etc.
II. logs labels for training proxy ML task(s), where must match a prior call. Observations capture users’ interactions, responses to a decision (e.g., clicks or navigation actions), or environmental factors such as compute costs.
Though deceptively simple in product code, this design fully supports the MLOps needs of the platform. We separately walk through the online (inference) and offline (training) steps of the pipeline in Figure 3. Product code initializes the Looper client API with one of the known strategies registered in the UI. is then called with the and . Looper client API retrieves a versioned configuration (the “strategy blueprint”, Section 3.3) for the strategy to determine the features, the model instance, etc. The exact version used may be controlled through an external experimentation system. The client API passes the application context to the Meta feature store (Section 3.3
), which returns a complete feature vector.The client API passes the feature vector and production model ID to a distributed model predictor system (cf. Soifer et al. (2019)), which returns proxy task predictions to the client. Then, the client API uses a decision policy (Section 2.1) to make the final decision based on the proxy predictions. Decision policies are configured in a domain-specific language (DSL) using logic and formulas. Asynchronously, the anonymized feature vector and predictions are logged to a distributed online joining system (c.f. Ananthanarayanan et al. (2013)), keyed by the decision ID and marked with a configurable and relatively short TTL (time-to-live). The API (from multiple request contexts) also sends logs to this system. Complete “rows” with matching features and observations are logged to a training table, with retention time set according to data retention policies. The remaining steps are performed offline and asynchronously.
Delayed and long-term observations are logged in a table and then joined offline via Extract, Transform, and Load (ETL) pipelines Anonymous (2021). These pipelines perform complex data operations such as creating MDP sequences for reinforcement learning. The logged features, predictions, and observations are sent for logging and real-time monitoring as per Section 2.2. An offline training system Dunn (2016) retrains new models nightly, addressing concerns from Section 3.1. Trained models are published to the distributed predictor for online inference. Models are then registered for canarying (Section 3.1). A canary model that outperforms the prior model is promoted to production and added to the loop configuration.
3.3 Platform architecture: The strategy blueprint
The end-to-end nature of the Looper platform brings its own set of challenges regarding data and configuration management in the system. Existing ML management solutions Vartak and Madden (2018) primarily focus on managing or versioning of data and models, which is insufficient in covering the full lifecycle of smart strategies. In this section we introduce the concept of a strategy blueprint, a version-controlled configuration that describes how to construct and evaluate a smart strategy. Blueprints are immutable, and modifications (typically through a GUI) create new versions that can be compared in production through an online experimentation platform, allowing for easy rollback if needed. The strategy blueprint (Figure 4) controls four aspects of the ML model lifecycle and captures their cross-product:
Feature configuration. Modern ML models can use thousands of features and computed variants, which motivates a unified repository, termed a feature store, usable across both model training and real-time inference Hazelwood et al. (2018); Orr et al. (2021). Feature stores typically support feature groups, which describe how to compute features associated with pieces of application context (e.g., a website page identifier). Feature variants can be produced by feature transforms, e.g. pre-trained or SIF Arora et al. (2017) text embeddings. The Looper blueprint leverages feature stores for feature management and contains () a computational graph describing the use of feature groups, as well as () downstream feature transforms.In practice, the most common blueprint modifications tend to involve experimentation with new features with the hope of improving model quality.
Label configuration controls how customers describe ML objectives (Section 2.1), or “labels”. Labels (clicks, ratings, etc) are often chosen as proxies of the true target product metric. The relation between product metrics and their proxies is often difficult to measure precisely Stein (2019), so product teams may experiment with different label sets.
Model configuration helps product teams explore model architecture tradeoffs (neural networks, GBDTs, reinforcement learning). The blueprint only specifies high-level architecture parameters, while lower-level hyperparameters (e.g., learning rates) are delegated to AutoML techniques invoked by the training system (Section 2.2).
Policy configuration. As described in Section 2.1, decision policies translate raw objective predictions into decisions. The policy configuration contains a lightweight domain specific language (DSL) to convert raw model outputs into a final decision; Figure 4 illustrates a ranking decision, where the click and rating objectives are weighted in a combination function to generate a single score per candidate. Optimizing the weights embedded in decision policies is a frequent requirement for smart strategies.
Blueprints help capture compatibility between versions, e.g., the training pipeline for version may use data from version if features and labels in are subsets of those in . Tagging each training row with the originating blueprint version enables data sharing between versions.
Figure 4 illustrates the lifecycle of a blueprint. From left to right: An experimentation system enables different blueprint versions to be served across the user population to facilitate A/B testing (optionally, in concert with a “blueprint optimizer”, described later below). The client API uses the blueprint feature configuration to obtain a complete feature vector from the feature store. Completed training examples are logged to training tables, tagged with the originating blueprint version. The training system filters data by compatible version and executes the pipeline according to the blueprint’s feature, label, and model configurations. The policy configuration may be needed as well for more sophisticated model types (reinforcement learning). Trained models are published in the blueprint version. For inference, the client API uses only models explicitly linked to its served blueprint version. To generate the final product-facing decision, the client also uses the policy configuration.
The blueprint abstraction enables holistic optimization by capturing dependencies between, e.g., feature configurations and model configurations. An adaptive experimentation platform Bakshy et al. (2018) can tune parameters in a blueprint to optimize product metrics. A common example is the tuning of weights in the blueprint’s “policy configuration” (i.e., for recommendation scores or reward shaping), where different weight configurations may significantly affect the final product outcomes.
3.4 Platform architecture: specializations
Looper for ranking. While the + API is general enough to implement simple recommendation systems, advanced systems need finer support. Higher-ranked items are more often chosen by users, and this positional bias can be handled (in the API) by including the displayed position as a special input during training Craswell et al. (2008). To derive a final priority score for each item, the multiple proxy task predictions are often combined through a weighted combination function Zhao et al. (2019). Recommender systems learn from user feedback, but such exploration requires including lesser-explored items among top results once in a while (the explore/exploit tradeoff Yankov et al. (2015)). A specialized Looper ranking system abstracts these considerations under a higher-level API () which allows the ordering of an entire list of application contexts, and also allows recording of display-time observations such as the relative screen position of each item.
Integrated experiment optimizations. Even when a product metric can be approximated well in an ML model, the correlations captured by the model might not lead to causal product improvements. Hence, A/B testing estimates the average treatment effect (ATE) of the change across product users. Shared repositories of product metrics and are common Kohavi et al. (2009); Bakshy et al. (2014); Xu et al. (2015), and product variants are systematically explored by running many concurrent experiments Bakshy et al. (2018). While dealing with non-stationary measurements, balancing competing objectives, and supporting the design of sequential experiments Bakshy et al. (2018), a common challenge with A/B tests is to find subpopulations where treatment effects differ from the global ATE – heterogeneous treatment effects (HTE). Common neglect for HTEs in A/B testing leaves room for improvement Bakshy et al. (2014); Beutel et al. (2017) Wager and Athey (2018), likely delivering suboptimal treatments. The Looper platform and its support for A/B testing dramatically simplify HTE modeling on the Meta online experimentation platform, and help deploying treatment assignments based on HTE estimates.
In an initial training phase, Looper’s API acts as a drop-in replacement for the standard A/B testing API, and falls through to a standard randomized assignment while still logging features for each experiment participant. Then, metrics from the standard A/B testing repertoire help derive the treatment outcome (observations) for each participant, and the Looper platform trains specialized HTE models (meta-learners such as T-, X-, and S- learners Künzel et al. (2019)). In a final step, the HTE model predictions can be used in a decision policy to help make intelligent treatment assignments and measurably improve outcomes compared to any individual treatment alone. In this scenario, the best HTE estimate for a given user selects the actual treatment group. Our integration links Looper to an established experiment optimization system Bakshy et al. (2018) and creates synergies discussed in Section 4.2. A further extension relaxes the standard A/B testing contract to support fully dynamic assignments and enables reinforcement learning Apostolopoulos et al. (2021).
4 Impact of smart strategies
Looper supports real-time inference with moderate-sized models to improve various aspects of software systems. These models are deployed quickly and maintained without model-specific infrastructure, whereas our two-call RPC API decouples platform code from application code. Looper impact takes the form of substantial monetary savings and increased user engagement. We illustrate the diversity of applications by two types – data prefetching and causal product-metric optimization. Then we summarize the overall product adoption and impact of our platform.
4.1 Application deep dive – prefetching
Optimized resource prefetching via user history modeling in online applications may help to decrease the latency of user interactions by proactively loading application data. Modern ML methods can accurately predict the likelihood of data usage, minimizing unused prefetches. Our Looper platform supports prefetching strategies for many systems within Meta, often deeply integrated into the product infrastructure stack. For example, Meta’s GraphQL Byron (2015) data fetching subsystem uses our platform to decide which prefetch requests it should service, saving both client bandwidth and server-side resources. This technique yields around 2% compute savings at peak server load. As another example, Meta’s application client for lower-end devices (which employs a “thin-client” server-side rendering architecture Roy (2016)) also uses our platform to predictively render entire application screens. Our automated end-to-end system helps deploying both models and threshold-based decision policies then tune them for individual GraphQL queries or application screens, with minimal engineering effort. Based on numerous deployed prefetch models, we have also developed large-scale modeling of prefetching. User-history models have already proven to be helpful for this task Wang et al. (2019); taking this idea one step further, we created application-independent vector embeddings based on users’ surface-level activity across all Meta surfaces. To accomplish this, we train a multi-task, auto-regressive neural network model to predict the length of time that a user will spend on a selection of the most frequently accessed application surfaces in the future (e.g., news feed, search, notifications), based on a sequence of (application surface, duration) events from the user’s historical activities. A common practice in CV and NLP, intermediate layer outputs of this DNN are effective predictors of prefetch accesses and make specialized features unnecessary. Optimized prefetching illustrates how secondary, domain-specific platforms are enabled by the core Looper platform; infrastructure teams only need to wire up the prediction and labeling integration points while Looper provides full ML support.
4.2 Application deep dive – personalized experiments
While focusing on our platform architecture, Section 3.4 briefly outlined the integration of an experiment optimization system based on causal reasoning and HTE as a platform specialization. In practice, this capability has an outsized impact on the adoption and utility of our platform, due to the accessibility of the experimentation APIs. Given that many companies developed A/B testing APIs Bakshy et al. (2014); Xu et al. (2015), it is beneficial to expose smart strategies through such APIs :
Simpler learning curve and client code via embedding the decision API in the standard A/B testing API.
Dataset preparation and modeling flow can be automated for the task of optimizing metric responses based on users’ exposed to each treatment. Metric responses can be automatically sourced from the experimentation measurement framework without manual labeling.
The impact of a smart strategy can be directly compared to baseline treatments by embedding the smart strategy in the experimentation framework.
Such experiment optimization previously needed dedicated engineering resources. The tight integration of the Looper platform with the experimentation framework now allows product engineers quickly evaluate a smart strategy and optimize its product impact in several weeks. With automatic MOO, engineers find tradeoffs appropriate to a given product context. For example, during a server capacity crunch, one team traded a slight deterioration in a product metric for a 50% resource savings. Per month, three to four adaptive product experiments launched via Looper use integrated experiment optimization for smart A/B testing and parameter optimization. Predicating product deployment on such experiments creates safeguards against ML models that generalize poorly to live data.
4.3 Adoption and impact
Several internal vertical platforms at Meta Hazelwood et al. (2018) compete for a rich and diverse set of applications. Product teams sometimes relocate their ML models to a platform with greater advantages, while a few high-value applications are run by dedicated infrastructure teams. Looper was chosen and is currently used by 90+ product teams at Meta. In toto, these teams deploy 700 models that make 6 million decisions per second.222Averaged over the course of a typical day in November 2021 Application use cases fall into five categories (Figure 5), in decreasing order of usage:
Personalized Experience is tailored based on the user’s engagement history. For example, we display a new feature prominently only to those likely to use it.
Ranking orders items to improve user utility, e.g., to personalize a feed of candidate items for the viewer.
Prefetching/precomputing data/resources based on predicted likelihood of usage (Section 4.1).
Notifications/prompts can be gated on a per-user basis, and sent only to users who find them helpful.
Value estimation predicts regression tasks, e.g., latency or memory usage of a data query.
The impact of ML performance on product metrics varies by application. For a binary classifier, increasing ROC AUC from 90% to 95% might not yield large product gains when such decisions contribute little product metrics, e.g., when bottlenecks lie elsewhere. On the other extreme, an ROC AUC change from 55% to 60% may be significant when each percent translates into tangible resource or monetary savings, as illustrated by online payment processing.
Looper use cases have made substantial contributions to top-line company reporting metrics. While many product teams at Facebook and Instagram adopted Looper without additional staffing, several of them report 20-40% of improvements to their product goal metrics (user engagement, server utilization, monetary cost savings, etc) due to Looper.
4.4 Impact on resource utilization
Smart strategies tend to provide significant benefits but may require serious computational resources,333See resource utilization of various ML models in https://openai.com/blog/ai-and-compute/ so good resource management can distinguish success from failure. Looper is deployed in numerous and diverse applications at Meta, some of which optimize performance of other systems and some enhance functionality. This makes it difficult to report overall trends for optimizing resource utilization, but enables economies-of-scale infrastructure reuse and load-balancing. Figure 5 shows that different use cases exhibit different model-lifecycle bottlenecks, with feature extraction drawing the largest share of resources. This trend for relatively lightweight models with diverse metadata may not hold for advanced deep learning models with homogeneous image pixels, word embeddings, etc. Compared to standalone models, our platform offers savings from shared engineering infrastructure and optimizations; for example, the “reaping” of unimportant features has been widely deployed with over 11% resource cost savings and no negative product impact. In some use cases, savings reach 30% without adverse impacts.
5 Helpers and barriers to adoption
The article “Why Machine Learning Strategies Fail” Dickson (2021) lists common barriers to entry: (a) lacking a business case, (b) lacking data, (c) lacking ML talent, (d) lacking sufficient in-house ML expertise for outsourcing, (e) failing to evaluate an ML strategy. No less important it is to know why ML strategies succeed. To clarify the adoption process of smart strategies, we interviewed several product teams at Meta that adopted our platform and saw product impacts. All the teams had tried heuristic approaches but with poor results, hence their focus on ML. Simple heuristics proved insufficient for user bases spanning multiple countries with distinct demographic and usage patterns. The following challenges were highlighted: manually optimizing parameters in large search spaces, figuring out the correct rules to make heuristics effective, trading off multiple objectives, updating heuristic logic quickly, especially in on-device code.
The spectrum of ML expertise varied across product teams from beginners to experienced ML engineers, and only 15% of teams using our platform include ML engineers. For teams without production ML experience, an easy-to-use ML platform is often the deciding factor for ML adoption, and ML investment continues upon evidence of utility. An engineer mentioned that a lower-level ML system had a confusing development flow and unwieldy debugging. They were also unable to set up recurring model training and publishing. Our platform hides concerns about SW upgrades, logging, monitoring, etc behind high-level services and unlocks hefty productivity savings.
For experienced ML engineers, a smart-strategies platform improves productivity by automating repetitive time-consuming work: writing database queries, implementing data pipelines, setting up monitoring and alerts. Compared to narrow-focus systems, it helps product developers launch more ML use cases. An engineer shared prior experience writing custom queries for features and labels, and manually setting up pipelines for recurring training and model publishing without an easy way to monitor model performance and issue emergency alerts. Some prospective clients who evaluated our platform chose other ML platforms within our company or stayed with their custom-designed infrastructure. They missed batched offline prediction with mega-sized data and needed exceptional performance possible only with custom ML models. These issues can be addressed with additional platform development efforts.
Successful platform adopters configured ML models in two days and started collecting training data. Training the model using product feedback and revising it over 1-2 weeks enabled online product experiments that take 2-4 weeks. Product launch can take 1-3 months after initial data collection. Among platform adopters, experienced engineers aware of ML-related technical debt and risks Sculley et al. (2015); Agarwal et al. (2016); Paleyes et al. (2020); Dickson (2021); Sambasivan et al. (2021)
appreciated the built-in support for recurring training, model publishing, data visualization, as well as monitoring label and feature distributions over time and alerting engineers to data drifts. Also noted was the canarying mechanism for new models (Section3.2).
We outline opportunities to embed self-optimizing smart strategies for product decisions into software systems, so as to enhance user experience, optimize resource utilization, and support new functionalities. Our paper describes the deployment of smart strategies through software-centric ML integration where decision points are intercepted and data is collected through APIs Agarwal et al. (2016). This process requires infrastructure and automation to reduce mistakes in routine operations and maintain ML development velocity.
Our ML platform Looper addresses the complexities of product-driven end-to-end ML systems and facilitates at-scale deployment of smart strategies. As an important simplification, inference input processing matches that for training. Looper offers immediate, tangible benefits in terms of data availability, easy configuration, judicious use of available resources, reduced engineering effort, and ensuring product impact. It makes smart strategies easily accessible to software engineers Agarwal et al. (2016); Carbune et al. (2018) and enables product teams to build, deploy and improve ML-driven capabilities in a self-serve fashion without ML expertise. To this end, we observed product developers launch smart strategies within their products in one month. The lower barriers to entry and faster deployment lead to more pervasive use of ML to optimize user experience, including retrofitting of systems not designed with ML in mind as well as new application domains. The overall product impact is substantial both in terms of product metrics and in monetary terms. Long-term benefits include effort and module reuse, consistent reporting, reliable maintenance, etc. Looper adopters with positive experience often launch new, more sophisticated smart strategies. This virtuous cycle leads to a “new normal,” where data-driven smart strategies are built into SW systems by design to enhance user experience and adaptation to the environment. The Looper platform treats end-to-end ML-driven development more broadly than prior work Molino and Ré (2021); Wu et al. (2021), providing extensive support for product impact evaluation via causal inference and measurements of resource overhead. Platform specializations — for ranking, prefetching and personalized A/B testing — have been in demand, whereas end-to-end management enables holistic resource accounting and optimization Wu et al. (2021).
- Abadi et al. (2016) Martín Abadi et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI. 265–283.
- Agarwal et al. (2016) Alekh Agarwal et al. 2016. Making contextual decisions with low technical debt. arXiv:1606.03966 (2016).
- Agarwal et al. (2009) Deepak Agarwal, Bee-Chung Chen, and Pradheep Elango. 2009. Explore/exploit schemes for web content optimization. In ICDM 2009. 1–10.
- Amershi et al. (2019) Saleema Amershi et al. 2019. Software engineering for machine learning: A case study. In ICSE-SEIP. 291–300.
- Ananthanarayanan et al. (2013) Rajagopal Ananthanarayanan et al. 2013. Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams. In ACM SIGMOD Int’l Conf. on Management of Data (SIGMOD ’13). ACM, 577–588. https://doi.org/10.1145/2463676.2465272
- Anonymous (2021) Anonymous. 2021. ETL vs ELT: Must Know Differences. https://www.guru99.com/etl-vs-elt.html
- Apostolopoulos et al. (2021) Pavlos Athanasios Apostolopoulos et al. 2021. Personalization for Web-based Services using Offline Reinforcement Learning. arXiv:2102.05612 (2021).
- Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. ICLR (2017).
- Bakshy et al. (2018) Eytan Bakshy et al. 2018. AE: A domain-agnostic platform for adaptive experimentation. NeurIPS 2018 Systems for ML Workshop.
- Bakshy et al. (2014) Eytan Bakshy, Dean Eckles, and Michael S Bernstein. 2014. Designing and deploying online field experiments. In WWW’ 14. 283–292.
- Balandat et al. (2020) Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. 2020. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In NeurIPS 33.
- Beutel et al. (2017) Alex Beutel, Ed H Chi, Zhiyuan Cheng, Hubert Pham, and John Anderson. 2017. Beyond globally optimal: Focused learning for improved recommendations. In WWW’ 17. 203–212.
- Breck et al. (2017) Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley. 2017. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. In IEEE Big Data.
- Byron (2015) Lee Byron. 2015. GraphQL: A data query language. https://engineering.fb.com/2015/09/14/core-data/graphql-a-data-query-language
- Carbune et al. (2018) Victor Carbune, Thierry Coppey, Alexander Daryin, Thomas Deselaers, Nikhil Sarda, and Jay Yagnik. 2018. SmartChoices: Hybridizing programming and machine learning. arXiv:1810.00619 (2018).
- Covington et al. (2016) P. Covington, J. Adams, and E. Sargin. 2016. Deep neural networks for Youtube recommendations. In RecSys.
- Craswell et al. (2008) Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In WSDM 2008. 87–94.
- D’Arcy (2021) Marc D’Arcy. 2021. Opinion: The 3 Post-COVID Trends Empowering People and Shaping the Future. https://adage.com/article/opinion/opinion-3-post-covid-trends-empowering-people-and-shaping-future/2342861
- Daulton et al. (2019) Samuel Daulton et al. 2019. Thompson sampling for contextual bandit problems with auxiliary safety constraints. arXiv:1911.00638 (2019).
- Daulton et al. (2021) Samuel Daulton, Maximilian Balandat, and Eytan Bakshy. 2021. Parallel Bayesian Optimization of Multiple Noisy Objectives with Expected Hypervolume Improvement, In NeurIPS 34. arXiv:2006.05078.
- Dickson (2021) Ben Dickson. 2021. Why machine learning strategies fail. https://venturebeat.com/2021/02/25/why-machine-learning-strategies-fail/
- Dunn (2016) Jeffrey Dunn. 2016. Introducing FBLearner Flow: Facebook’s AI backbone. https://engineering.fb.com/2016/05/09/core-data/introducing-fblearner-flow-facebook-s-ai-backbone
- Feng et al. (2020) Qing Feng, Benjamin Letham, Hongzi Mao, and Eytan Bakshy. 2020. High-dimensional contextual policy search with unknown context rewards using Bayesian optimization. NeurIPS 33 (2020).
- Gauci et al. (2018) Jason Gauci et al. 2018. Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform. arXiv:1811.00260 (2018).
- Gupta et al. (2020) Udit Gupta et al. 2020. The Architectural Implications of Facebook’s DNN-based Personalized Recommendation. HPCA (2020), 488–501. arXiv:1906.03109
- Hazelwood et al. (2018) Kim Hazelwood et al. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In HPCA 2018. IEEE, 620–629.
- Hermann and Del Balso (2017) Jeremy Hermann and Mike Del Balso. 2017. Meet Michelangelo: Uber’s Machine Learning Platform. https://eng.uber.com/michelangelo-machine-learning-platform/
- Kohavi et al. (2009) Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009. Controlled experiments on the Web: survey and practical guide. Data mining and knowledge discovery 18, 1 (2009), 140–181.
- Kraska et al. (2017) Tim Kraska et al. 2017. The Case for Learned Index Structures. CoRR (2017). arXiv:1712.01208
- Künzel et al. (2019) Sören R. Künzel et al. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning. PNAS 116, 10 (Feb 2019), 4156–4165.
- Laud (2004) Adam Daniel Laud. 2004. Theory and application of reward shaping in reinforcement learning. UIUC.
- Letham et al. (2019) Benjamin Letham et al. 2019. Constrained Bayesian optimization with noisy experiments. Bayesian Analysis 14, 2 (2019), 495–519.
- Letham and Bakshy (2019) Benjamin Letham and Eytan Bakshy. 2019. Bayesian Optimization for Policy Search via Online-Offline Experimentation. J. ML Research 20, 145 (2019), 1–30.
- Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In WWW. 661–670.
- Li et al. (2020) Shen Li et al. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. In VLDB, Vol. 13(12).
- Mao et al. (2020) Hongzi Mao et al. 2020. Real-world video adaptation with reinforcement learning. arXiv:2008.12858 (2020).
- Miranda (2021) Lester James Miranda. 2021. Towards data-centric machine learning: a short review. (2021). https://ljvmiranda921.github.io/notebook/2021/07/30/data-centric-ml/
- Molino et al. (2019) P. Molino, Y. Dudin, and S. S. Miryala. 2019. Ludwig: a type-based declarative deep learning toolbox. arxiv:1909.07930 (2019).
- Molino and Ré (2021) P. Molino and C. Ré. 2021. Declarative Machine Learning Systems. ACM Queue 19 (2021). Issue 3.
- Naumov et al. (2019) Maxim Naumov et al. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. CoRR abs/1906.00091 (2019).
- Orr et al. (2021) Laurel J. Orr et al. 2021. Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems. CoRR (2021). arXiv:2108.05053
- Paleyes et al. (2020) Andrei Paleyes, Raoul-Gabriel Urma, and Neil D Lawrence. 2020. Challenges in deploying machine learning: a survey of case studies. arXiv:2011.09926 (2020).
- Ré et al. (2019) Christopher Ré et al. 2019. Overton: A data system for monitoring and improving machine-learned products. arXiv:1909.05372 (2019).
- Rodríguez et al. (2018) Pau Rodríguez, Miguel A Bautista, Jordi Gonzàlez, and Sergio Escalera. 2018. Beyond One-hot Encoding: lower dimensional target embedding. arXiv:1806.10805 (2018).
- Roy (2016) Gautam Roy. 2016. How we built Facebook Lite for every Android phone and network. https://engineering.fb.com/2016/03/09/android/how-we-built-facebook-lite-for-every-android-phone-and-network
- Sagar (2021) Ram Sagar. 2021. Andrew Ng Urges ML Community To Be More Data-Centric. https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/
- Sambasivan et al. (2021) Nithya Sambasivan et al. 2021. ”Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. SIGCHI, ACM (2021).
- Sculley et al. (2015) David Sculley et al. 2015. Hidden technical debt in machine learning systems. NIPS 28 (2015), 2503–2511.
- Soifer et al. (2019) Jonathan Soifer et al. 2019. Deep Learning Inference Service at Microsoft. In USENIX Conf. Operational Machine Learning (OpML). USENIX, 15–17. https://www.usenix.org/conference/opml19/presentation/soifer
- Stein (2019) Gregory J. Stein. 2019. Proxy metrics are everywhere in machine learning. http://cachestocaches.com/2019/1/proxy-metrics-are-everywhere-machine-lea
- Vartak and Madden (2018) M. Vartak and S. Madden. 2018. MODELDB: Opportunities and Challenges in Managing Machine Learning Models. IEEE Data Eng. Bull. 41, 4 (2018), 16–25.
Wager and Athey (2018)
S. Wager and S. Athey.
Estimation and inference of heterogeneous treatment effects using random forests.J. Amer. Stat. Assoc. 113, 523 (2018), 1228–1242.
- Wang et al. (2019) Hanson Wang, Zehui Wang, and Yuanyuan Ma. 2019. Predictive Precompute with Recurrent Neural Networks. arXiv:1912.06779 (2019).
- Wu et al. (2021) Carole-Jean Wu et al. 2021. Sustainable AI: Environmental Implications, Challenges and Opportunities. arXiv:cs.LG/2111.00364
- Xu et al. (2015) Ya Xu et al. 2015. From infrastructure to culture: A/B testing challenges in large scale social networks. In KDD. 2227–2236.
- Yankov et al. (2015) Dragomir Yankov, Pavel Berkhin, and Lihong Li. 2015. Evaluation of explore-exploit policies in multi-result ranking systems. arXiv:1504.07662 (2015).
- Zhao et al. (2019) Zhe Zhao et al. 2019. Recommending what video to watch next: a multitask ranking system. In RecSys ‘19. 43–51.