Log In Sign Up

XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse

by   Hyoukjun Kwon, et al.

Real-time multi-model multi-task (MMMT) workloads, a new form of deep learning inference workloads, are emerging for applications areas like extended reality (XR) to support metaverse use cases. These workloads combine user interactivity with computationally complex machine learning (ML) activities. Compared to standard ML applications, these ML workloads present unique difficulties and constraints. Real-time MMMT workloads impose heterogeneity and concurrency requirements on future ML systems and devices, necessitating the development of new capabilities. This paper begins with a discussion of the various characteristics of these real-time MMMT ML workloads and presents an ontology for evaluating the performance of future ML hardware for XR systems. Next, we present XRBench, a collection of MMMT ML tasks, models, and usage scenarios that execute these models in three representative ways: cascaded, concurrent, and cascaded-concurrency for XR use cases. Finally, we emphasize the need for new metrics that capture the requirements properly. We hope that our work will stimulate research and lead to the development of a new generation of ML systems for XR use cases.


page 2

page 6

page 7

page 9


SDRM3: A Dynamic Scheduler for Dynamic Real-time Multi-model ML Workloads

Emerging real-time multi-model ML (RTMM) workloads such as AR/VR and dro...

Real-Time Scheduling of Machine Learning Operations on Heterogeneous Neuromorphic SoC

Neuromorphic Systems-on-Chip (NSoCs) are becoming heterogeneous by integ...

Evaluation Methodologies for Code Learning Tasks

There has been a growing interest in developing machine learning (ML) mo...

Neural Network Inference on Mobile SoCs

The ever-increasing demand from mobile Machine Learning (ML) application...

Special Session: Towards an Agile Design Methodology for Efficient, Reliable, and Secure ML Systems

The real-world use cases of Machine Learning (ML) have exploded over the...

Representation Learning for Resource Usage Prediction

Creating a model of a computer system that can be used for tasks such as...

Non-Functional Requirements for Machine Learning: An Exploration of System Scope and Interest

Systems that rely on Machine Learning (ML systems) have differing demand...

1 Introduction

Applications based on machine learning (ML) are becoming prevalent. The number of ML models that must be supported on the edge, mobile devices, and data centers is growing. The success of ML across tasks in vision and speech recognition is furthering the development of increasingly sophisticated use cases. For instance, the metaverse Meta (2022c) combines multiple unit use cases (e.g., image classification, segmentation, speech recognition, etc.) to create more sophisticated use cases (e.g., real-time interactivity via virtual reality). Such sophisticated use cases demand more functionality, for which application engineers are increasingly relying on composability; rather than developing different large models for different use cases, they are mixing multiple smaller, specialized ML models to compose task functionality Barham et al. (2022).

In this paper, we focus on this new class of ML workloads referred to as multi-model multi-task (MMMT) ML workloads, specifically in the context of extended reality (XR) for metaverse use cases. A real-time MMMT application for extended reality is illustrated by Figure 1. The figure depicts how several MMMT models can be cascaded and operated concurrently, sometimes dynamically subject to certain conditions, to provide complex application-level functionality. The center section of the figure demonstrates that processing throughput requirements can vary depending on the usage scenario. The right side of the figure shows how there can be a variety of interleaved execution patterns for each of the concurrent jobs. MMMT workloads exhibit model heterogeneity, expanded computation scheduling spaces Kwon et al. (2021), and usage-dependent real-time constraints, which makes them challenging to support compared to today’s single-model single-task (SMST) workloads.

Figure 1: An example real-time multi-model multi-task (MMMT) ML workload and an example execution timeline.

We identify three key issues that arise with MMMT workloads that present interesting system level design challenges. The first is scenario-driven behavior. All ML pipelines operate at a set frames per second (FPS) processing rate that is determined by a particular use case (e.g., virtual reality gaming, augmented reality social interaction, and outdoor activity recognition). A scenario may sometimes even demand zero FPS (i.e. deactivating a model) for models not required for the scenario. This fluctuating FPS is due to the context-based behavior that drives system resource utilization, which presents a hurdle when designing the underlying DNN accelerator—the heterogeneous workload makes it difficult to employ traditional DNN specialization.

Second, MMMT workloads exhibit complex dependencies. XR use cases display substantial data dependency (e.g., eye segmentation to tracking) and control dependency (e.g., hand detection to tracking) across models. These severe model-dependency limitations have ramifications for the underlying hardware and software scheduling space. In particular, the control flow dependencies make workload tasks dynamic, creating complexities for runtime scheduling.

Third, XR workloads have stringent user quality of experience (QoE) requirements. A key distinguishing factor of MMMT workloads from single-model single-task ML workloads is the importance of understanding how to quantify the aggregated QoE metric across all of the concurrent ML tasks at a system level. The resulting user quality of experience extends beyond the computational performance (latency or throughput) of a single model, which motivates the need for new metrics. Simple metrics like latency and/or FPS do not capture the complex interactions of all these models across diverse scenarios. For example, the latency of each inference run cannot be the absolute metric, since improving latency beyond the deadline set by the target processing rate may not improve the overall processing rate (e.g. the processing rate may be bound by the sensor input stream rather than inference time). Therefore, we need a new scoring metric that can capture the aggregate performance of the different MMMT workloads under different usage scenarios. The scoring metric must collectively consider all system aspects (model accuracy, achieved processing rate compared to the target processing rate, energy, etc.).

Collectively, not only do these three unique characteristics present system design challenges for XR, but they also make it challenging to benchmark and systematically characterize the performance of XR systems. Unfortunately, many of the characteristics and system-level concerns associated with MMMT workloads are not fully understood. This is largely due to the lack of public knowledge regarding the realistic characteristics of MMMT workloads, derived from industry use cases. Consequently, the ML system design area for these workloads has yet to be explored. Furthermore, there is no benchmark suite of MMMT workloads that reflects industrial use cases. Many industry and academic benchmark suites that exist today focus almost exclusively on SMST or MMMT without cascaded models Reddi et al. (2020).

To address these deficiencies, we develop XRBench, a real-time multi-model ML benchmark with new metrics tailored for real-time MMMT workloads such as from the metaverse. XRBench includes proxy workloads based on real-world industrial use cases taken from production scenarios. These proxy workloads encapsulate the end-to-end properties of MMMT workloads at both the ML kernel and system levels, enabling the study of a vast design space.

XRBench includes scenario-based FPS requirements for ML use cases, which reflect the complex dependencies found in applications driving system-design research in a large organization invested in augmented and virtual reality. It also presents representative QoE requirements for making informed system decisions. XRBench consists of many different usage scenarios of a metaverse end-user device that combine various unit-level ML models with different target processing rates to reflect the dynamicity and real-time features of MMMT workloads. Furthermore, to enable comprehensive evaluations of ML systems using XRBench, we also propose and evaluate new scoring metrics that encompass four distinct requirements for the QoE of real-time MMMT applications: (1) the degree of deadline violations, (2) frame drop rate, (3) system energy consumption, and (4) model performance (e.g., accuracy).

In summary, we make the the following contributions:

  • We provide a taxonomy of MMMT-based workloads to articulate the unique features and challenges of real-time workloads for metaverse use cases.

  • We present XRBench

    , an ML benchmark suite for real-time XR workloads. We provide open-source reference implementations for each of the models to enable widespread adoption and usage.

  • We establish new scoring metrics for XRBench that capture key requirements for real-time MMMT applications and conduct quantitative evaluations.

2 MMMT Workload Characteristics

To assist XR systems research on real-time MMMT workloads, we define a benchmark suite based on industrial metaverse MMMT use cases. Before discussing the benchmark suite in Section 3, we first define the MMMT classification and the characteristics of a realistic MMMT workload, cascaded and concurrent MMMT.

2.1 Multi-model Machine Learning Workloads

Unlike single-model single-task (SMST) workloads, MMMT workloads include many models that lead to multiple model organization choices for constructing a workload instance. Based on the styles, we define three major classes:

  • Cascaded MMMT (cas-MMMT): Run multiple models back-to-back to enable one complex functionality (e.g., audio pipeline in Figure 1).

  • Concurrent MMMT (con-MMMT): Run multiple models independently at the same time to enable multiple unit functionalities (e.g., run Mask R-CNN He et al. (2017) and PointNet Qi et al. (2018) to perform 2D and 3D object detection during mapping and localization).

  • Cascaded and concurrent MMMT (cascon-MMMT): Hybrid of cas- and con-MMMT; connect multiple models back-to-back (cas-MMMT style) to implement a complex ML pipeline and deploy multiple models (con-MMMT style) for the other functionalities. (e.g., the VR gaming usage scenario in Figure 1).

Static vs. Dynamic: In addition to the model organization style, the model execution graph can be static or dynamic depending on the unit pipelines defined for a workload. For example, as shown in Figure 1, hand tracking can be deactivated if the hand detection model detects no hand.

In recent applications that encompass extended reality, we can observe dynamic and real-time cascon-MMMT style workloads Kwon et al. (2021), which represent some of the most complicated ML inference workloads today. Although such dynamic and real-time cascon-MMMT style workloads are emerging, we lack a benchmark suite for dynamic cascon-MMMT workloads. Consequently, there is no deep understanding of the features and challenges from dynamic cascon-MMMT. Next, we focus on dynamic cascon-MMMT’s features and challenges.

2.2 Dynamic Cascon-MMMT Features and Challenges

Cascaded and concurrent MMMT workloads are an emerging class of ML inference tasks. They have unique features and issues that do not exist in conventional ML workloads. We outline such aspects and analyze the issues of cascon-MMMT workloads for metaverse (XR) applications.

2.2.1 Scenario-driven Workloads

Metaverse workloads come from various different usage scenarios. A usage scenario refers to specific user experiences while utilizing a device or service. Gaming (e.g., VR gaming) and social (e.g., AR messaging) are example usage situations. Usage scenarios can be generated by combining several unit tasks, such as hand tracking or keyword detection. So, metaverse workloads must take the usage scenario into account to determine which unit tasks should be included, which is one of their distinctive elements compared to workloads in benchmarks such as MLPerf Reddi et al. (2020) and ILLIXR Huzaifa et al. (2021).

2.2.2 Real-Time Requirements

Many existing ML-based applications often employ a single model inference to input (e.g., image or text). In contrast to these applications, metaverse devices are frequently required to continually execute inferences of a set of models in order to provide continuous user experiences (e.g., a user plays a VR game for 1 hour). As inference runs contribute to user experiences, it is only reasonable that a strong quality of user-driven experience (QoE) is required. In the context of multi-model inference, QoE can be represented by processing rate (i.e., inferences per second such as FPS for models with frame-based inputs) or processing deadlines, hence introducing real-time processing requirements. Consequently, just as ML benchmarks must satisfy a certain level of accuracy for the quality of results Reddi et al. (2020), XR benchmarks must also provide target processing rates.

2.2.3 Dynamic Cascading of Models

Metaverse applications commonly utilize numerous models. For example, hand-based interaction capabilities are enabled by hand detection and hand tracking models. Such models are often cascaded (i.e., run sequentially in a back-to-back manner), and such cascaded models are characterized as a pipeline of models (or, an ML pipeline).

Figure 1 presents three ML pipeline examples. Such pipelines need to be transformed into data dependencies across models, which need to be considered while scheduling computations  Kwon et al. (2021). MMMT ML pipelines may deactivate one or more downstream models based on the upstream model’s results. For instance, when no hand is detected, the hand tracking pipeline does not initiate the downstream hand tracking model. Such a dynamic aspect presents another problem for scheduling computation. In addition, it indicates that metaverse benchmarks must include different usage scenarios that reflect the dynamic nature of metaverse workloads.

Category Task Model Dataset Accuracy Requirement
 Interaction Hand Tracking (HT) Hand Shape/Pose Ge et al. (2019) Stereo Hand Pose Zhang et al. (2017) AUC PCK, GT 0.948
Eye Segmentation (ES) RITNet Chaudhary et al. (2019) OpenEDS 2019 Garbin et al. (2019) mIoU, GT 90.54

Gaze Estimation (GE)

Eyecod You et al. (2022) OpenEDS 2020 Palmero et al. (2021) Angular Error, LT 3.39
Keyword Detection (KD) Key-Res-15 Tang and Lin (2018) Google Speech Cmd Google (2017) Accuracy, GT 85.60
Speech Recognition (SR) Emformer  Shi et al. (2021) LibriSpeech Panayotov et al. (2015) WER (others), LT 8.79
 Context Understanding Semantic Segmentation (SS) HRViT Gu et al. (2022) Cityscape Cordts et al. (2016) mIoU, GT 77.54
Object Detection (OD) D2Go Meta (2022a) COCO Lin et al. (2014) boxAP, GT 21.84
Action Segmentation (AS) TCN  Lea et al. (2017) GTEA Fathi et al. (2011) Accuracy, GT 60.8
Keyword Detection (KD) Key-Res-15 Tang and Lin (2018) Google Speech Cmd Google (2017) Accuracy, GT 85.60
Speech Recognition (SR) Emformer  Shi et al. (2021) LibriSpeech Panayotov et al. (2015) WER (others), LT 8.79
 World Locking Depth Estimation (DE) MiDaS Ranftl et al. (2020) KITTI Geiger et al. (2012) ,LT 22.9
Depth Refinement (DR) Sparse-to-Dense Ma and Karaman (2018) KITTI Geiger et al. (2012) , GT 85.5(100 samples)
Plane Detection (PD) PlaneRCNN Liu et al. (2019) KITTI Geiger et al. (2012) , GT 0.37
Table 1: XRBench unit tasks and proxy unit models. The Model column lists the specific model and the source that introduced the model architectures. The Dataset column indicates the dataset to be used for each model. Note that some models (KD and SR) are used for multiple task categories. Accuracy requirements are 95% of reported accuracy (or, 105% of reported error) in original papers, which opens the benchmark to various optimization techniques (e.g. quantization, structured sparsity), while ensuring reasonable prediction correctness. LT and GT refer to less than and greater than. For some models, we down-scale dataset resolution to adjust to the context of wearable/mobile devices. Details on this and specific model instances for XRBench are in appendix A.

2.2.4 Battery Life and Device Form Factor

The wearable form factor of metaverse devices makes thermal tolerances and battery life first-order priorities for user experience. For example, if the heat dissipation is excessively high, it may lead to skin discomfort or burns. Long battery life is critical since wearable devices are intended to be used continuously throughout the day, but the form factor places further constraints on battery size even compared to other edge devices. For example, a recent metaverse device 18 has an 800 mAh battery, which is a fifth of the size of the battery in a modern mobile device (e.g., 4000 mAh in Samsung Galaxy S20 13). Energy consumption must be a primary optimization priority for metaverse end-user devices. All of the requirements (i.e., scenario-driven tasks, real-time requirements, dynamic cascading, battery life and form factor) translate into energy constraints. So the benchmark needs to contain energy goals to ensure a device provides good user experience.

3 XRBench

Real-time cascon-MMMT workloads for the metaverse are distinctive due to the discussed characteristics and obstacles. As a result, this domain necessitates a new method of defining benchmarks in comparison to traditional model-level benchmarks alone. In this section, we outline what we consider to be the most important characteristics of an MMMT benchmark. Then, we describe XRBench, the first benchmark of its kind for extended reality applications.

3.1 Benchmark Principles

To systematically guide the design of MMMT benchmarks, we focus on the key requirements for such a benchmark:

  • Usage Scenarios: A set of real-world usage situations based on production use cases and a list of models to be run for each usage scenario must be defined.

  • Model Dependency: As certain ML models are cascaded, model dependencies across the task must be specified to study resource allocation and scheduling effects.

  • Target Processing Rates: Provide meaningful and applicable real-time requirements and processing rates for each model in each usage scenario to establish application behavior and system performance expectations.

  • Variants of a Usage Scenarios: To reflect the dynamic nature of model execution and enable apples-to-apples comparisons, the benchmark must give numerous scenarios with distinct active time windows for each model.

Based on the requirements, we define XRBench. We first discuss unit models and usage scenarios used in XRBench, then describe its evaluation infrastructure and scoring techniques. Later in Section 4 we show why each of these principles are important by conducting architectural analysis using XRBench.

3.2 Unit-level ML Models

Based on our experience in the metaverse (XR) domain, we define the first dynamic cascon-MMMT benchmark that reflects metaverse use cases. There are three main task categories in XRBench, listed in Table 1: real-time user interaction, understanding user context, and world locking (AR object rendering on the scene). These categories are based on real-world industrial use cases for the metaverse. For each unit task, we choose a representative reference model from the public domain. When selecting models, we evaluate their efficacy and efficiency, which is comprised of (1) model performance, (2) the number of FLOPs, and (3) the number of parameters. Additionally, we list datasets and accuracy requirements for each unit task. More information for each unit task, including specific open-source model instances and dataset input scaling, can be found in appendix A.

 Usage Scenario Target Processing Rate (# inferences / second) and Depedency Example Usage Scenario Description
 Social Interaction A 30 60 60, ES(D) 30 AR messaging with AR object rendering
 Social Interaction B 60 60, ES(D) 30 In-person interaction with AR glasses
 Outdoor Activity A 3 3, KD(C) 10 30 Hiking with smart photo capture
 Outdoor Activity B 3 3, KD(C) 30 Rest during hike
 AR Assistant 3 3, KD(C) 10 10 30 30 Urban walk with informative AR objects
 AR Gaming 45 30 30 Gaming with AR object
 VR Gaming 45 60 60, ES(D) Highly-interactive Immersive VR gaming
Table 2: Target processing rates. Model name next to processing rate indicates dependency (C: control dependency, D: data dependency).

3.2.1 Interaction

Real-time user interaction tasks enable users to control metaverse devices using various input methods, including hand movements, eye gaze, and voice inputs. Therefore, we include corresponding ML model pipelines: hand interaction pipeline (end-to-end model that performs hand detection and tracking), eye-interaction pipeline (ES and GE), and voice interaction (KD and SR).

3.2.2 Context Understanding

Context understanding tasks use multi- (e.g., VIO) or single-modal (e.g., audio) inputs to detect the context surrounding users so that a metaverse device can provide the appropriate user services. When a metaverse device detects that a user has entered a hiking trail, for example, it can provide the user with meteorological information. Context understanding models include scene understanding (SS, OD, and AS) and audio context understanding (KD and SR).

3.2.3 World Locking

A metaverse device must comprehend distances to real-world surfaces and occlusions in order to depict an augmented reality (AR) object on the display. These tasks are handled by models in the world-locking category, which includes a depth estimate model, a depth refinement model, and a plane detection model. The depth model is used to calculate the correct size of augmented reality (AR) objects, while the plane detection model identifies real-world surfaces that can be used to depict metaverse objects.

3.3 Usage Scenarios and Target Processing Rates

The models in Table 1 are not always active, but are selectively active with varying target processing rates depending on specific usage scenarios, as explained in Subsection 2.2.1. For example, the user experience of an augmented reality (AR) game based on intensive hand interaction requires high hand-tracking processing speeds. The speech pipeline may be completely stopped if the game does not use speech input. During outdoor activities like hiking, however, an AR-enabled metaverse device may not require hand-tracking functionality but must be prepared for user speech input.

To reflect the different usage scenarios and target processing rate characteristics, we chose five realistic metaverse scenarios: (1) social interaction (AR messaging with AR object rendering), (2) outdoor activity (smart photo capture during hiking), (3) AR assistant (AR information display based on user contexts), (4) AR gaming, and (5) VR gaming.

Even within the same usage scenario, the active models can differ because of the dynamic nature of cascon-MMMT workloads (explained in Subsection 2.2.3). For example, in an outdoor activity (hiking) usage scenario, when a user takes a break and tries to utilize an AR metaverse device (e.g., navigation and photo capturing), the hand tracking model will be engaged, unlike the previous hiking example scenario. Considering such variability within usage scenarios, we suggest two versions (A and B) of social interaction and outdoor activity scenarios. Table 2 describes the usage scenarios variants. In addition, we specify a target processing rate for each model with three levels: High (60Hz or 45Hz), Medium (30Hz), and Low (10Hz). SR has the processing rates of 3Hz, which models the 320ms left context size utilized in the SR model’s original paper Shi et al. (2021). We assign target processing speeds depending on the usage scenarios based on practical metaverse use cases. We suppose that a metaverse device identifies the active models and their processing rates (i.e., usage scenario) based on the specific application that has been launched by a user.

3.4 Input Sources and Load Generation

Figure 2: An overview of the benchmark harness, XRBench.

Metaverse devices utilize multiple sensors with varying modalities. To model the sensors, we use the settings listed in Table 3 for the unit models in Table 1

. The camera is the input source of images used by computer vision models. The lidar sensor provides sparse depth map to the depth refinement model using RGBd data. The microphone receives audio inputs for speech models (KD and SR).

Input Source Input Type Streaming Rate Jitter
 Camera Images 60 FPS 0.05
 Lidar Sparse Depth Points 60 FPS 0.05
 Microphone Audio 3 FPS 0.1
Table 3: Three main input sources to a metaverse device. We align all the input streaming rates to be 60 FPS for multi-modal models (e.g., DR in Table 1). We also model jitters for each data frame.

The arrival time of the input data in an actual system can vary slightly from the projected time based on the streaming rate, depending on multiple circumstances (e.g., system bus congestion). In numerous research analyses, jitter is frequently disregarded. However, in genuine production usage circumstances, jitter might result in sporadic frame dropouts, which degrades QoE. To represent such aspects, we apply a jitter to each data frame as shown in Table 3 and we alter the injection time of inference requests accordingly.

3.5 Benchmark Harness

A harness orchestrates the execution of the models, respecting the dependencies. We illustrate the structure of the benchmark harness in  Figure 2. The harness takes workload and system information as input, and generates reports that contain not only the scores (overall score and its break-downs; to be discussed in Subsection 3.7) but also detailed performance statistics such as amount of delay over deadline, frame drop, execution timeline, and so on. We include this detailed information in the reports to help users use XRBench to guide their system designs.

The harness consists of a runtime, logger, and scoring module. The runtime contains a load generator which intermittently generates jittered inference requests. The inference dispatcher/scheduler is the core component of the runtime, which (1) selects the next inference requests to be dispatched when a hardware entity (e.g., accelerator) becomes available, (2) tracks the model and frame dependencies, and (3) dispatches inferences to the machine learning system to be evaluated (which may be a real physical system, analytical cost model, or a simulator). The runtime components include an event detector, score tracker, and various data structures (request queue, active inference table, dependency table, etc.) that assist the dispatcher and scheduler.

XRBench requires users to finish a number of inference runs equal to the target processing rate within a set duration (default: one second) to ensure the real-time requirement is satisfied. XRBench provides a simple latency-greedy (for cost model or simulator-based runs) or a round-robin style scheduler (for real systems) for models within each usage scenario. Users can replace the scheduler and other components highlighted in yellow boxes in Figure 2 to model their runtime or system software’s behavior. Much like in a traditional ML system or ML benchmark Reddi et al. (2020), optimizing the software stack is crucial to the hardware’s success, and XRBench encourages such optimizations.

3.6 Deep Dive Example

To clarify the roles of each piece in XRBench, Figure 3 provides an example execution timeline for the “Social Interaction A” usage scenario in Table 2. The execution graph on the left shows the active models, their processing rates, and the model dependencies. The right side corresponds to a sample execution timeline for the scenario. The top-most section represents the input streaming from relevant input sources listed in Table 3. Each input sources can have different initial delays and jitter.

A compute engine (such as an accelerator) can only execute model inferences if it has access to the input data. In this example, we model a simple scheduler assuming that inferences can only begin if the input data is ready. Consequently, Eye Segmentation (ES) and Gaze Estimation (GE) for frame 0 begin once the input data retrieval for image frame 0 concludes. Additionally, GE runs after ES to satisfy their dependency. The multi-modal Depth Refinement (DR) model executes after image and depth point inputs are received. As DR’s target processing rate is 30 FPS while depth point input streams at 60 FPS, only every other input is used. As Hand Tracking (HT) also operates at 30 FPS, it skips every other image frame. Certain output destinations need synchronization (e.g., VSYNC for display). The DR output is used for display-targeted AR object rendering. Therefore, DR results must be delivered by a certain time, which in this example is a 30 FPS deadline.

Figure 3: An example execution timeline based on the Social Interaction A usage scenario in Table 2.

In the execution timeline, the usage scenario is effectively supported and desired processing rates are attained. However, ET and HT results are delivered past their desired deadlines ( and seconds, respectively). This suggests that HT and ET latency must be reduced further in this example. Not always should latency itself be used as an optimization target, though, as latency reduction beyond deadlines may not improve the user experience. Even zero-latency inferences cannot increase the effective processing rate of a task beyond the input data streaming rates because future data cannot be processed without them. This raises the question: How should we quantify the performance of a real-time MMMT system by taking into account the actual quality of the results for users? In the next subsection, we discuss new metrics that encompass the requirements and characteristics of XR tasks.

3.7 Scoring Metric

In Subsection 3.6, we showed that evaluating a system for real-time MMMT workloads is not trivial using the example in Figure 3. That is, lower inference latency does not always improve user experiences if the processing rate of each task is bound by the input data streaming rate. We need to capture such aspects when we evaluate a system for real-time XR workloads. Therefore, we define a new scoring metric considering all the aspects we discussed and propose it to be used as the overall performance metric in XRBench.

Based on the unique features of XR workload features, we list the following score requirements for the benchmark:

  • [Real-time] The score should include a penalty if the latency exceeds the usage scenario’s required performance constraints (i.e., missed deadlines).

  • [Low-energy] The score should prioritize low power designs as metaverse devices are energy-constrained.

  • [Model quality] The score should capture the output quality delivered to a user from running all the models.

  • [QoE requirement] The score should include a penalty if the FPS drops below the target FPS to maintain QoE.

We define four unit score components: real-time (RT) score, energy score, accuracy score, and QoE score. Each score is constrained to be in the [0, 1] range for easy analysis of score break-downs. We multiply unit scores to reflect all of their aspects while keeping the results in the [0,1] range. We utilize averages to summarize scores for multiple inference runs (i.e., inferences on different frames for a model), multiple models within a usage scenario, etc. We provide detailed formal definitions for each of the scores in appendix B and here we focus on the rationale and intuition.

To model real-time requirements, we consider the following observations: (1) too much optimization on inference latency beyond the deadline does not lead to higher processing rates. (2) reduced latency can still be helpful for scheduling other models. (3) violated deadlines gradually disrupt the user experience (e.g., Achieving 59 FPS for an eye-tracking model targeting 60 FPS won’t significantly affect the user experience). Based on these observations, we search for a function that (1) gradually rewards/penalizes for reduced/increased latency near a deadline (e.g.,

0.5ms for a deadline of 10ms) and (2) outputs 0 and 1 if the latency is well-beyond or within the deadline, respectively. We find such a function by modifying the sigmoid function, which is widely used in ML models.

For energy, a lower-is-better metric, a naive way to compute energy score is computing the inverse of the energy consumption (example unit: 1/mJ). However, the range of the naive metric is unbounded, which makes it hard for component-wise analysis when it is combined with other scores bound in [0,1] ranges. Therefore, to bound the energy score within the [0,1] range as well, we utilize a large energy limit to define the top-end of the score.

For accuracy score, we quantify how much the output correctness differs from desired level using model-specific performance metrics (e.g., accuracy for classification, mIoU for segmentation, PCK AUC for hand tracking, etc.). Although there are many different metrics other than accuracy, we use the term, accuracy score, for simplicity.

Finally, we construct the overall benchmark score in a hierarchical manner. Figure 4 illustrates how we combine scores along stages (unit, per-inference, per-model, per-usage scenario) to finally generate the overall benchmark score. We first compute the per-inference score by multiplying real-time, energy, and accuracy scores. The QoE score is not applied here as the frame drop rate only can be defined at the usage scenario level, since the FPS requirements change depending on the usage scenario. Using the per-inference score, we construct the per-model score by computing the average across all processed frames. We do not include dropped frames since they will be considered in the QoE score. To compute the per-usage scenario score, we compute the average of the product of per-model score and QoE score across all the models within a usage scenario. For full details and formulations of each piece of the scoring framework, see appendix B.

Figure 4: A high-level overview of how we define benchmark scores at inference run, model, and usage scenario granularity using unit scores (real-time, energy, accuracy, and QoE scores).

4 Evaluation

In this section, we focus on three key questions to ascertain the value of XRBench: (1) why the comprehensive overall score is necessary for the proper evaluation of XR tasks, (2) why it is important to study the different usage scenarios that are included in XRBench and (3) what are the hardware implications of the MMMT characteristics found in XR.

4.1 Methodology

Metaverse applications run on wearable devices, and the compute requirement for the workloads is heavy (tens for FPS requirements for multiple models). Therefore, considering the capabilities of state-of-the-art mobile SoCs (e.g., 26 TOPS on Qualcomm Snap Dragon 888 Qualcomm (2022)), we model wearable devices with DNN inference accelerators that employ 4K and 8K PEs with 256 GB/s on-chip bandwidth and 8MiB of on-chip shared memory running at 1 GHz clock, similar to Herald Kwon et al. (2021).

Simulated hardware styles: Table 4 shows the accelerator styles: FDA (fixed-dataflow accelerator), Scaled-out multi-FDA (two accelerator instances with the same dataflow style motivated by  Baek et al. (2020)), and HDA (heterogeneous dataflow accelerator) Kwon et al. (2021). Depending on the style, we partition the 4K and 8K PEs into 2 or 4 accelerator instances. The WS (weight stationary) dataflow is inspired by NVDLA NVIDIA (2017) that parallelizes the output and input channels with input columns. OS (output-stationary) is a hand-optimized dataflow that parallelizes output rows and columns with a 16-way adder-tree reducing input channel-wise partial sums. The RS (row stationary) dataflow is inspired by Eyeriss Chen et al. (2016) that parallelizes output channels, output rows, and kernel rows.

Simulation methodology: We implement the framework illustrated in Figure 2 and plug in MAESTRO Kwon et al. (2019) as the analytical cost model to perform the different case studies. All the models are the same across hardware platforms (8bit-quantized without other optimizations) and satisfy the accuracy goals (i.e., accuracy score = 1).

Dynamic cascading modeling methodology:

To model the dynamic cascading between keyword detection and speech recognition, we apply pre-defined probabilities of user keyword utterances to corresponding usage scenarios (Outdoor A, Outdoor B, and AR Assistant). For outdoor activity scenarios, we apply 0.2 as the interaction is expected to be in a low frequency for the scenarios. For AR assistant, we apply 0.5 as the speech is the standard interaction method for the use case. For eye segmentation and gaze estimation pipeline, we first apply the probability of 1.0 to model pure data dependency and sweep the probability for a separate deep dive (

Figure 7).

Acc. ID Acc. Style Dataflow
D SFDA111 Baek et al. (2020) WS + WS (1:1 partitioning)
E OS + OS (1:1 partitioning)
F RS + RS (1:1 partitioning)
G WS + WS + WS + WS (1:1:1:1 partitioning)
H OS + OS + OS + OS (1:1:1:1 partitioning)
I RS + RS + RS + RS (1:1:1:1 partitioning)
J HDA WS + OS (1:1 partitioning)
K WS + OS (3:1 partitioning)
L WS + OS (1:3 partitioning)
M WS + OS + WS + OS (1:1:1:1 partitioning)
Table 4: Accelerator styles. Partitioning indicate the PEs to be deployed for each accelerator instance for SFDA and HDA.
Figure 5: The scores computed for each style of an accelerator system with 4K and 8K PEs. (a-g) the score break-downs for each usage scenario. (h) the average across scenarios.
Figure 6: Execution timeline of AR gaming scenario on 4k and 8k PE versions of WS and OS HDA accelerator (accelerator J).

4.2 Why the Overall Score is a Necessary Metric

The intent of this section is to show that the overall scoring metric we present (Section 3.7) is necessary for systematically evaluating XR systems. We present our evaluation results in Figure 5, which shows score break-downs for each accelerator style running each usage scenario.

4.2.1 Overall Score Enables Comprehensive Analysis

The real-time score quantifies the degree of deadline violation. Higher-is-better for the real-time score; however, a high real-time score itself does not guarantee ideal system performance. For example, accelerator A with 8K PEs running the Outdoor Activity B (Figure 5, (d)) has a real-time score of 1.0, which indicates that most of the deadlines are met within a small margin. However, accelerator A misses 10.0% of the frames (not shown) and has high energy consumption, 34.1% greater than the most energy efficient design (accelerator C). Our scoring metric incorporates all aspects, including QoE score for frame drops and energy score for energy consumption, and it reports an overall score of 0.49 that is 42.9% less than the best accelerator (I).

As another example, for the AR Gaming scenario (Figure 5, (g)) on a 4K accelerator system, accelerator G achieved the greatest QoE score of 0.91 and a strong energy score of 0.76. However its real-time score is zero due to heavily missed deadlines. That is, while the frame rate is overall close to the target as captured in the QoE score), a user will experience heavy output lag, which degrades the real-time experience. The realtime score captured this and led the overall score for this accelerator to be zero.

4.2.2 Hardware Utilization is the Wrong Metric

Hardware utilization is often used as a key metric for accelerator workloads, since it can be directly translated to accelerator performance by multiplying utilization by the peak performance of the accelerator. However, we do not consider hardware utilization to be the right metric for real-time MMMT applications, and as such we do not include it in the overall scoring metric (Section 3.7).

Utilization does not consider frame drops or periodic workload injection. For example, Figure 6 shows the execution timelines for the 4K and 8K PE versions of accelerator J. The 8K PE timeline (Figure 6, (b)) has more gaps than the 4K PE timeline, which means the overall accelerator utilization is lower than that of the 4K PE accelerator (Figure 6, (a)), making it seem as though the 4K PE accelerator is a better choice. However, the 4K-PE accelerator drops 47.1% of the frames and completely fails to run the PD model, whereas the 8K-PE accelerator drops only 2.3% of frames.

In contrast to the failure from utilization to reflect the overall performance, our score metric properly captures the real-time aspects. The frame drop rates of the 4K and 8K PE accelerators are translated into the QoE score, 0.53 and 0.97 respectively. In addition, the large number of deadline violation due to the 100% frame drop rate on PD model in the 4K PE accelerator is reflected to its real-time score being 0. The overall scores for each accelerator, which are reported as 0 and 0.51, finally gives better sense of their end-to-end performance, compared to what the utilization value implied.

Figure 7: Evaluated scores on accelerators B and J with 4K PEs running VR gaming scenario. We vary the probability of triggering GE after ES, modeling the dynamic cascading.


Table 5: List of existing benchmarks related to ML and XR workloads, with comparison of workload characteristics and score metrics. means the property is partially supported by the benchmark.
Benchmark Workload Characterisitcs Score Metrics
Cascon-MMMT Dynamic Workload Real-time Scenarios Focus on ML Device Scope Latency Energy Accuracy QoE
ML MLPerf Inferencea server
MLPerf Tinyb edge
MLPerf Mobilec mobile
DeepBenchd server/edge
AI Benchmarke mobile
EEMBC MLMarkf edge
AIBenchg server
AIoTBenchh mobile/edge
VRMarkj PC


ML + XR XRBench edge


  • References:

  • Reddi et al. (2020),

  • Banbury et al. (2021),

  • Reddi et al. (2022),

  • 8,

  • Ignatov et al. (2018),

  • 10,

  • Gao et al. (2019),

  • Luo et al. (2018),

  • Huzaifa et al. (2021),

  • 48

4.3 Why It is Important to Dive into Usage Scenarios

Even though all usage scenarios in XRBench reflect the metaverse domain, the individual workload characteristics are diverse and tend to vary during execution, resulting in different system performance. Each usage scenario prefers different accelerator types, as shown in Figure 5. For example, in the 4K PE config, the Social Interaction A scenario (Figure 5, (a)) prefers the FDA style accelerator with WS dataflow (accelerator A). However, the Outdoor Activity A (Figure 5, (c)) prefers the SFDA style with four sub-accelerators with the OS dataflow (accelerator H).

Moreover, dynamically cascaded models (Section 2.2) require a deep-dive into corresponding usage scenarios. To understand the impact of dynamically cascaded models, we vary the probability of triggering the GE model after ES (GE is triggered only if ES results have sufficiently large segmented eyes). We highlight low- and high-score cases (accelerators B and J) in Figure 7. Overall, a high-score design (accelerator J) maintains its high scores over the probability change. The low-score design (accelerator B) shows score degradation when the probability is high. For the low-score design, we observe small score fluctuation between 50% and 75% probabilities. The current greedy scheduler trades off deadline violations with frame drop rate, which results in a better overall score. Such a case motivates further studies in the system scheduler, which can be facilitated by the XRBench benchmark harness. Overall, the results suggest that usage scenarios should be carefully examined under a variety of dynamic instances to design more robust systems.

4.4 What the Hardware Implications of XRBench Are

Comparing accelerator styles in the 4K PE setting, we find the accelerator styles with the highest score are all different for each workload. For example, accelerator A (FDA style, WS dataflow, single-accelerator system) is the best style for the social interaction A scenario (Figure 5, (a)). However, accelerator F (SFDA style, OS dataflow, four-accelerator system) performed the best for the outdoor activity B scenario (Figure 5, (d)). We also observe that the best accelerator style changes depending on the accelerator size (i.e., number of PEs). For example, the style H (SFDA style, RS dataflow, four-accelerator system) performs the best for the AR assistant scenario (Figure 5, (e)) with 4K PEs. However, when the total number of PEs change to be 8K, the style M (HDA style, WS and OS dataflows, four-accelerator system) performs the best. Those results imply that the design space for XR applications is complex with distinctive features of real-time MMMT workloads, which motivates follow-up studies using XRBench.

We also find the preference of the number of models in MMMT models to the multi-accelerator system (e.g., SFDA and HDA). AR assistant (Figure 5, (e)) and VR gaming (Figure 5, (f)) scenarios include the most (6) and least (3) number of models, respectively. For AR assistant, we observe the multi-accelerator style (SFDA and HDA) outperforms the single accelerator style. For VR gaming scenario, in contrast, the FDA style (accelerator A) outperforms most of the other accelerators. In particular, when the sub-accelerator size is sufficiently large (8K PE), a quad-accelerator system (HDA accelerator M) performs the best on the many-model scenario (AR assisttant), but the same system underperforms on the fewer-model scenario (VR gaming). Such data show the efficacy of parallel model execution using sub-accelerators, which motivates to explore scale-out designs for many-model MMMT workloads like the AR assistant.

5 Related Work

Based on the characteristics we describe in Section 3, we present the limitations of existing ML and XR benchmarks in Table 5. XRBench is unique in that it is the only workload suite that captures complex workload dependencies, is ML-focused, presents several real-world usage scenarios that are distilled from industry practice and uniquely establishes a robust robust scoring metric (even compared to benchmarks like ILLIXR or VRMARK). Due to space limitations, we defer detailed discussions of the benchmarks to appendix C. In summary, XRBench is the first suite to include several ML workloads tailored for XR applications.

6 Conclusion

Metaverse use cases necessitate complex ML benchmark workloads that are essential for fair and useful analyses of existing and future system performance, but such workloads exceed the capabilities of existing benchmark suites. The XR benchmark we present, which is based on industry experience, captures the diverse and complex characteristics of these emerging ML-based MMMT workloads. We believe it will foster new ML systems research focused on XR.


  • [1] (2018) AIoTBench, benchcouncil. Note: Cited by: Appendix C.
  • E. Baek, D. Kwon, and J. Kim (2020)

    A multi-neural network acceleration architecture

    In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 940–953. Cited by: §4.1, footnote 1.
  • C. Banbury, V. J. Reddi, P. Torelli, J. Holleman, N. Jeffries, C. Kiraly, P. Montino, D. Kanter, S. Ahmed, D. Pau, et al. (2021) MLPerf tiny benchmark. arXiv preprint arXiv:2106.07597. Cited by: Appendix C, Appendix C, item b.
  • P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, B. Saeta, P. Schuh, R. Sepassi, E. L. Shafey, A. C. Thekkath, and Y. Wu (2022) Pathways: asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems 4, pp. 430–449. Cited by: §1.
  • A. K. Chaudhary, R. Kothari, M. Acharya, S. Dangi, N. Nair, R. Bailey, C. Kanan, G. Diaz, and J. B. Pelz (2019) Ritnet: real-time semantic segmentation of the eye for gaze tracking. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3698–3702. Cited by: Table 6, Table 1.
  • Y. Chen, J. Emer, and V. Sze (2016)

    Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks

    In International Symposium on Computer Architecture (ISCA), Cited by: §4.1.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3213–3223. Cited by: Table 1.
  • [8] (2016) Deepbench. Note: Cited by: Appendix C, Appendix C, item d.
  • [9] (2016) ED-tcn. Note: Cited by: Table 6.
  • [10] (2020) EEMBC mlmark v2.0. Note: Cited by: Appendix C, Appendix C, item f.
  • S. Farrell, M. Emani, J. Balma, L. Drescher, A. Drozd, A. Fink, G. Fox, D. Kanter, T. Kurth, P. Mattson, et al. (2021) MLPerf™ hpc: a holistic benchmark suite for scientific machine learning on hpc systems. In 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), pp. 33–45. Cited by: Appendix C, Appendix C.
  • A. Fathi, X. Ren, and J. M. Rehg (2011) Learning to recognize objects in egocentric activities. In CVPR 2011, pp. 3281–3288. Cited by: Table 1.
  • [13] (2019) Galaxy s20 specifications. Note: Cited by: §2.2.4.
  • W. Gao, F. Tang, L. Wang, J. Zhan, C. Lan, C. Luo, Y. Huang, C. Zheng, J. Dai, Z. Cao, D. Zheng, H. Tang, K. Zhan, B. Wang, D. Kong, T. Wu, M. Yu, C. Tan, H. Li, X. Tian, Y. Li, J. Shao, Z. Wang, X. Wang, and H. Ye (2019) AIBench: an industry standard internet service ai benchmark suite. ArXiv abs/1908.08998. Cited by: Appendix C, Appendix C, item g.
  • S. J. Garbin, Y. Shen, I. Schuetz, R. Cavin, G. Hughes, and S. S. Talathi (2019) Openeds: open eye dataset. arXiv preprint arXiv:1905.03702. Cited by: Appendix A, Table 1.
  • L. Ge, Z. Ren, Y. Li, Z. Xue, Y. Wang, J. Cai, and J. Yuan (2019)

    3d hand shape and pose estimation from a single rgb image

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, Table 1.
  • A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3354–3361. Cited by: Appendix A, Table 1.
  • [18] (2019) Glass enterprise edition 2 specifications. Note: Cited by: §2.2.4.
  • Google (2017) Google speech commands. Note: Cited by: Table 1.
  • J. Gu, H. Kwon, D. Wang, W. Ye, M. Li, Y. Chen, L. Lai, V. Chandra, and D. Z. Pan (2022) Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12094–12103. Cited by: Table 6, Table 1.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: 2nd item.
  • [22] (2022) HRViT-b1. Note: Cited by: Table 6.
  • M. Huzaifa, R. Desai, S. Grayson, X. Jiang, Y. Jing, J. Lee, F. Lu, Y. Pang, J. Ravichandran, F. Sinclair, B. Tian, H. Yuan, J. Zhang, and V. S. Adve (2021) ILLIXR: enabling end-to-end extended reality research. Cited by: Appendix C, Appendix C, §2.2.1, item i.
  • A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool (2018) Ai benchmark: running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: Appendix C, Appendix C, item e.
  • H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna (2019) Understanding reuse, performance, and hardware cost of dnn dataflow: a data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 754–768. Cited by: §4.1.
  • H. Kwon, L. Lai, M. Pellauer, T. Krishna, Y. Chen, and V. Chandra (2021) Heterogeneous dataflow accelerators for multi-dnn workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 71–83. Cited by: §1, §2.1, §2.2.3, §4.1, §4.1.
  • C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager (2017) Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165. Cited by: Table 6, Table 1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: Table 1.
  • C. Liu, K. Kim, J. Gu, Y. Furukawa, and J. Kautz (2019) Planercnn: 3d plane detection and reconstruction from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4450–4459. Cited by: Table 6, Table 1.
  • C. Luo, F. Zhang, C. Huang, X. Xiong, J. Chen, L. Wang, W. Gao, H. Ye, T. Wu, R. Zhou, and J. Zhan (2018) AIoT bench: towards comprehensive benchmarking mobile and embedded device intelligence. In Bench, Cited by: Appendix C, Appendix C, item h.
  • F. Ma and S. Karaman (2018) Sparse-to-dense: depth prediction from sparse depth samples and a single image. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 4796–4803. Cited by: Table 6, Table 1.
  • Meta (2019) FBNet-c. Note: Cited by: Table 6.
  • Meta (2022a) D2GO. Note: Cited by: Table 1.
  • Meta (2022b) Faster-rcnn-fbnetv3a. Note: Cited by: Table 6.
  • Meta (2022c) What is the metaverse?. Note: Cited by: §1.
  • [36] (2020) Midas_v21_small. Note: Cited by: Table 6.
  • NVIDIA (2017) NVDLA deep learning accelerator. Note: Cited by: §4.1.
  • C. Palmero, A. Sharma, K. Behrendt, K. Krishnakumar, O. V. Komogortsev, and S. S. Talathi (2021) OpenEDS2020 challenge on gaze tracking for vr: dataset and results. Sensors 21 (14), pp. 4769. Cited by: Appendix A, Table 1.
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. Cited by: Table 1.
  • C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 918–927. Cited by: 2nd item.
  • Qualcomm (2022) Snapdragon 888 5g mobile platform. Note: Cited by: §4.1.
  • R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun (2020) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence. Cited by: Table 6, Table 1.
  • V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, et al. (2020) Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 446–459. Cited by: Appendix C, Appendix C, §1, §2.2.1, §2.2.2, §3.5, item a.
  • V. Reddi, D. Kanter, P. Mattson, J. Duke, T. Nguyen, R. Chukka, K. Shiring, K. Tan, M. Charlebois, W. Chou, M. El-Khamy, J. Hong, T. St John, C. Trinh, M. Buch, M. Mazumder, R. Markovic, T. Atta, F. Cakir, M. Charkhabi, X. Chen, C. Chiang, D. Dexter, T. Heo, G. Schmuelling, M. Shabani, and D. Zika (2022) MLPerf mobile inference benchmark: an industry-standard open-source machine learning benchmark for on-device ai. In Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.), Vol. 4, pp. 352–369. External Links: Link Cited by: Appendix C, item c.
  • [45] (2019) RITNet. Note: Cited by: Table 6.
  • Y. Shi, Y. Wang, C. Wu, C. Yeh, J. Chan, F. Zhang, D. Le, and M. Seltzer (2021) Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6783–6787. Cited by: Table 6, Table 1, §3.3.
  • R. Tang and J. Lin (2018) Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Cited by: Table 6, Table 1.
  • [48] (2020) VRMark. Note: Cited by: Appendix C, Appendix C, item j.
  • H. You, C. Wan, Y. Zhao, Z. Yu, Y. Fu, J. Yuan, S. Wu, S. Zhang, Y. Zhang, C. Li, et al. (2022) EyeCoD: eye tracking system acceleration via flatcam-based algorithm & accelerator co-design. arXiv preprint arXiv:2206.00877. Cited by: Table 6, Table 1.
  • J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu, and Q. Yang (2017) A hand pose tracking benchmark from stereo matching. In 2017 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 982–986. External Links: Document Cited by: Appendix A, Table 1.

Appendix A Benchmark Model Instances

Task Model Reference Model Instance Model Type Major Operators
 HT Hand Graph-CNN Ge et al. (2019) Hand Shape/Pose Ge et al. (2019) CNN CONV2D, Maxpool, FC
 ES RITNet Chaudhary et al. (2019) RITNet 45 CNN CONV2D, AvgPool, Skip Connection
 GE Eyecod You et al. (2022) FBNet-C Meta (2019) CNN CONV2D, DWCONV, Skip Connection
 KD Key-Res-15 Tang and Lin (2018) res8-narrow Tang and Lin (2018) CNN CONV2D, Avgpool, Skip Connection
 SR Emformer  Shi et al. (2021) EM-24L Shi et al. (2021) Transformer Self-attention, Layernorm
 SS HRViT Gu et al. (2022) HRViT-b1 22 Transformer Self-attention, Layernorm, DWCONV
 OD D2go Meta (2022b) Faster-RCNN-FBNetV3A Meta (2022b) R-CNN CONV2D, DWCONV, Skip Connection
 AS TCN  Lea et al. (2017) ED-TCN 9 CNN CONV2D, Maxpool, Upsample
 DE MiDaS Ranftl et al. (2020) midas_v21_small 36 CNN CONV2D, DWCONV, Skip Connection
 DR Sparse-to-Dense Ma and Karaman (2018) RGBd-200 Ma and Karaman (2018) CNN CONV2D, DeCONV, DWCONV
 PD PlaneRCNN Liu et al. (2019) PlaneRCNN Liu et al. (2019) R-CNN CONV2D w/ FPN, RoIAlign
Table 6: Specific model instances for the XRBench unit models listed in Table 1

. Also classifies model type and major operators.

As an extension to Table 1, this section describes more details on the models included in XRBench. Table 1 specifies which model variation (Model Instance) is adopted from the representative model (Model Reference), along with baseline or backbone structure (Model Type), and types of major operators that compose the model (Major Operators). The model instances are chosen based on their size, considering the edge use case. In addition, we also down-scale the dataset resolution of certain tasks to adjust to the context of edge devices. Stereo Hand Pose Zhang et al. (2017) is scaled by for Hand Tracking (HT), OpenEDS 2019 Garbin et al. (2019) and OpenEDS 2020 Palmero et al. (2021) are both scaled by for Eye Segmentation (ES) and Gaze Estimation (GE), respectively, and KITTI Geiger et al. (2012) is scaled by for Plane Detection (PD).

As shown in the table, there is a variety of model types and operators included in the XRBench workloads, representative of the diverse computing requirements of an XR system. Such heterogeneity emphasizes the need for innovative solutions to realize XR device capabilities.

Appendix B Problem Formulation

Symbol Definition
  A unit model
  The input source for the model
  Streaming rate of an input src (FPS)
  Initialization latency (ms) of an input stream
  Max jitter in ms
  A dataset
  Model performance metric (e.g., accuracy) ID
  Target model performance metric
  Dependency of model
  The i-th frame for the input of model
  Inference request on
model and
  An inference request instance on
  A distribution to model jitters for
on frame i
  The deadline for an inference request r
  The deadline for an inference request r
  The time window between a request time
and deadline
  The latency for an inference request r on a HW
  The real-time score on a inference request r
  The target frame ID for benchmarking on
a model
Table 7: Symbols used in the formulation.
Definition 1.

Input Data Stream
The input data stream is defined as follows:

Definition 1 formulates the input stream description in Table 3. The refers to the input source identifier. refers to the streaming rate (FPS). refers to the initial latency of each input stream, and refers to the maximum jitter in milliseconds.

Definition 2.

Unit models
The set of unit models is defined as follows:

Definition 3 defines the set of unit models. refers to a unit model (i.e., an element of ), refers to the model ID, refers to the input source ID in Definition 1, refers to the data set, is a tuple of the target metric ID in string and the of the metric in a floating point number. Based on the definition of , we define the usage scenario as follows:

Definition 3.

Usage Scenario
Given a set of unit models to be included to usage scenarios, a usage scenario is defined as follows:

where is an integer.

In Definition 3, defines the model granularity dependency on , which lists up the set of models depends on. When we have multiple pipelines for the same model (e.g., two eye tracking model runs for each eye), we include multiple instances of the same model within , with different input sources if necessary. With Definition 3, we can define the benchmark suite using as follows.

Definition 4.

Benchmark Suite
Given a set of usage scenarios , a real-time MMMT benchmark suite is defined as follows:

The Definition 4 shows that a real-time MMMT benchmark is a collection of usage scenarios as described in Table 2. Now we formulated the benchmark suite. We define some additional concepts for defining the scoring metric.

Definition 5.

Inference Request
The inference request for a model on the i-th input frame for , , is defined as follows:

Using Definition 5, we define the inference request time and deadline as follows.

Definition 6.

Inference Request time
For an inference request , we define the inference request time over an input stream as follows:

where indicates the setup latency of the input stream from the input source , represents the time until reach the i-th frame under the streaming rate (FPS) of , and accounts for the impact of jitter on the arrival time modeled by a maximum jitter value , a distribution , and a deterministic random function . Note that we make the choice of flexible for various scenarios. By default,

is a Gaussian distribution over


Using a similar formulation, we define the inference deadline as follows.

Definition 7.

Inference Deadline
For an inference request and , we define the inference deadline over an input stream as follows:

using the same notation as Definition 6.

Note that the deadline does not include the jitter term unlike the request time, which indicates that the system expects a certain processing rate even if there are jitters for the input data arrival time. Using the concepts defined so far, we define the inference-wise benchmark score next.

b.1 Inference-wise Benchmark Score

Because of the real-time processing nature of the Metaverse workloads, it is challenging to compare Metaverse device systems running real-time MMMT using traditional metrics of using latency and energy. The latency measures the end-to-end execution time of each inference, which can be used to check if each model’s deadlines are satisfied. However, achieving less latency than the deadlines does not offer benefits, making latency not an absolute minimization goal, unlike other ML systems targeting non-real-time MMMT workloads. In addition, we can adjust energy to meet the deadlines or optimize using the slack to the deadline (e.g., DVFS), which also makes energy a knob, not an absolute minimization target.

Therefore, we explore a new metric for XRBench that considers all the unique aspects of real-time MMMT workloads we discussed: (1) the task-specific deadlines based on usage scenarios, (2) end-to-end latency (i.e., how much latency does an ML system needs beyond the deadline?), (3) overall energy consumption, and (4) the quality of experience delivered.

To facilitate an intuitive comparison of ML systems with many pillars, we propose a single metric score that captures all of the above. The single-score approach will also help motivate the industry to submit their evaluation results because the latency and energy can be confidential data that we cannot directly share with the public. Instead, the industry can offer the single score metric that captures overall performance on real-time and multi-model DL workloads to demonstrate the robust capabilities of their accelerator systems.

To construct such a metric, we consider each aspect of real-time MMMT workloads and model performance (e.g., accuracy, mIoU, and boxAP): real-time requirement, energy consumption, quality of experience, and model performance. We define a score for each pillar and combine them to define the single comprehensive metric. We define those score functions to be in [0, 1] range to facilitate component-wise analysis as well.

To model real-time requirements, we consider the following observations: (1) Too much optimization on latency beyond the deadline does not lead to higher processing rates; even if a system finished the latency in only one cycle, the system still needs to wait for the next input frame. (2) reduced latency can still be helpful for scheduling other models. (3) violated deadline gradually damages the user experience (e.g., Achieving 59Hz for an eye-tracking model targeting 60Hz wouldn’t significantly affect the user experience).

Based on those observations, we search for a function that (1) gradually rewards and penalizes the reduced and increased latency near a deadline (e.g., 0.5ms for a deadline of 10ms) and (2) outputs 0 and 1 if the latency is beyond or within the deadline. We discuss such a function in Definition 8.

Definition 8.

Real-time requirement score
For an inference request for the i-th input frame, , and the corresponding inference request time, and deadline, ,

where represents the latency for the inference request and .

We define the real-time requirement score in Definition 8, where indicates the latency of the corresponding inference after the request. The score function can also be viewed as a modified Sigmoid function with shifting and scaling. A benefit of using is that we can tune the constant depending on the deadline sensitivity.  Figure 8 shows the change of the based on the k value, and we can observe small values makes the function more sensitive around the deadline. We define k values based on the usage scenario and model in the benchmark suite.

Figure 8: An example real-time score function over different values of the parameter whose range is . We assume the time window between the inference request time and deadline to be 1 ms in this example for simplicity. If k is 0, the score is completely not relevant to the deadline (i.e., no sensitivity on deadline). If k is , the score function becomes a piece-wise function that flips the score from 1 to 0 at the deadline.

Using the real-time requirement score, we define overall score for an inference request as follows.

Definition 9.

Inference-wise score
For an inference request on the j-th model in a usage scenario on the input frame , , we define a score for an inference request as follows:

where refers to a large energy number an ML system shouldn’t reach, used as an upper-bound value for the energy consumption. refers to the energy consumption for on hardware (e.g., CPU, accelerator, and GPU). Note each of real-time, energy, and model performance scores is within the range of , and the product of them is also within the range of .

We extend the inference-wise score function to a benchmark score.

Definition 10.

Usage-scenario Score
Given time-out frames for (i.e., the number of frames to be processed for model j), we define the usage-scenario score as follows:

where and refer to the number of total inference requests and that of inference requests that satisfied deadlines.

Note that the latency information used in usage-scenario score reflect the delay due to previous inference requests. That is, if an inference request cannot launch until the next inference request on the same model due to other requests, the inference request should be dropped. To measure the impact of frame drop, the additional last term quantifies the portion of inferences that satisfied the deadlines. This is used as a measurement of the quality of user experience.

We divide the product of the sum of inference-wise score and the user experience score by the total number of frames that need to be processed () and the number of models within the usage scenario to keep the overall range of the score metric within [0,1]. Finally, we multiply 100 to convert the range score into [0,100].

Using usage-scenario-wise scores, we define the overall score of XRBench as the geometric-average of the scores for each usage scenario in XRBench.

b.2 Schedule

We do not propose a specific scheduler as it is a part of the ML system software to be evaluated. However we define valid schedules to satisfy the following conditions:


Equation 1 indicates that the dependency order must be maintained in a schedule.  Equation 2 illustrates that a hardware piece (e.g., a systolic array-based accelerator) cannot run two models simultaneously. That is, if a hardware module can run multiple models at the same time, the hardware module should be treated as multiple smaller hardware modules. This is to prevent potential confusion around the definition of a unit hardware module for running models.

Appendix C Detailed Related Work Comparison

In this section, we expand on Section 5 and Table 5 by providing detailed discussions on prior benchmarks.

General ML Workload Benchmarks. MLPerf Inference  Reddi et al. (2020) is a set of industry standard, single-kernel ML benchmarks that span the ML landscape, from high performance computers Farrell et al. (2021) to tiny embedded systems Banbury et al. (2021)

. It also provides a rich set of inference scenarios based on realistic use cases from industry: single-stream (single inference), multistream (repeated inference with a time interval), server (random inference request modeled via Poisson distribution), and offline (batch processing). Extensions to embedded systems (MLPerf Tiny  

Banbury et al. (2021)) and mobile devices like smartphones (MLPerf Mobile  Reddi et al. (2022)) have also been developed, drawing closer to the XR form factor. However, the MLPerf suite workloads do not deploy any models in a concurrent or cascaded manner and the scoring metrics lack QoE consideration, which are essential in XR workloads.

DeepBench 8 focuses on benchmarking kernel operations which underlie ML performance. Although such microbenchmarks provide insights to operator level optimizations, it cannot be used for understanding the end-to-end performance of a single model or for MMMT workloads. AI Benchmark Ignatov et al. (2018) targets the ML inference performance of smartphones with 14 different tasks and EEMBC MLMark 10 measures the performance of neural networks on embedded devices. Still, none of them cover MMMT performance nor consider real-time processing scenarios. Their scoring metrics are also not diverse enough to handle complex XR workloads.

AIBench Gao et al. (2019) from BenchCouncil is another industry standard AI benchmark for Internet services, which was one of the first to include application scenarios for end-to-end performance evaluation. These scenarios model MMMT workloads of E-commerce search intelligence use cases with heterogeneous latency of each model, provided with rich scoring metric components for evaluation. Although AIBench decently reflects the key components of real-time MMMT workloads, the benchmark is tailored to server-scale internet service and has little to do with edge applications. In addition, their static execution graphs make extensions to XR use cases difficult, which require dynamic execution of models based on their control dependencies.

AIoT C. Luo, F. Zhang, C. Huang, X. Xiong, J. Chen, L. Wang, W. Gao, H. Ye, T. Wu, R. Zhou, and J. Zhan (2018); 1 is an AIBench extension that focuses on mobile and embedded AI. Though these platforms come closer to the XR platform, the benchmark does not model real-time, MMMT-based scenarios and therefore falls short to serve as an XR benchmark.

XR Benchmarks. ILLIXR Huzaifa et al. (2021) is a benchmark suite tailored for XR systems. ILLIXR models concurrent and cascaded execution pipelines in XR use cases and considers the real-time requirements of XR devices. Although ILLIXR provides a solid benchmark in the XR domain, the focus of ILLIXR is mainly in non-ML-based pipelines, unlike the ML workload focus of XRBench

. ILLXR includes only one ML model (RITNet for eye tracking), and its other parts are based on traditional computer vision and audio algorithms (e.g., QR decomposition and Gauss-Newton refinement) and signal processing (e.g., FFT). It also includes insufficient accuracy measurement for audio pipeline, which is a critical requirement of XR workloads. Lastly, its considerations of the dynamic aspect of real-time MMMT workloads are limited: while models may have different latency based on input data, there is only one static execution graph modeled by the benchmark.

VRMark 48 is a benchmark that evaluates the performance of VR experiences on PCs. The benchmark also does not target ML performance assessments but rather focuses on rendering graphics. Moreover, it lacks usage scenarios that are reflective of real-world user characteristics and various score metrics for systematical analysis.

On the other hand, even though existing XR-related or scenario-based benchmarks support real-time MMMT scenario and QoE metrics, they still lack several components such as sufficient ML algorithm coverage, dynamic model execution graph, and focusing on edge devices. All of these characteristics are satisfied by XRBench, expecting significant contribution to XR research community and the industry.