Croesus: Multi-Stage Processing and Transactions for Video-Analytics in Edge-Cloud Systems

by   Samaa Gazzaz, et al.

Emerging edge applications require both a fast response latency and complex processing. This is infeasible without expensive hardware that can process complex operations – such as object detection – within a short time. Many approach this problem by addressing the complexity of the models – via model compression, pruning and quantization – or compressing the input. In this paper, we propose a different perspective when addressing the performance challenges. Croesus is a multi-stage approach to edge-cloud systems that provides the ability to find the balance between accuracy and performance. Croesus consists of two stages (that can be generalized to multiple stages): an initial and a final stage. The initial stage performs the computation in real-time using approximate/best-effort computation at the edge. The final stage performs the full computation at the cloud, and uses the results to correct any errors made at the initial stage. In this paper, we demonstrate the implications of such an approach on a video analytics use-case and show how multi-stage processing yields a better balance between accuracy and performance. Moreover, we study the safety of multi-stage transactions via two proposals: multi-stage serializability (MS-SR) and multi-stage invariant confluence with Apologies (MS-IA).


Parallel Detection for Efficient Video Analytics at the Edge

Deep Neural Network (DNN) trained object detectors are widely deployed i...

SurveilEdge: Real-time Video Query based on Collaborative Cloud-Edge Deep Learning

The real-time query of massive surveillance video data plays a fundament...

VID-WIN: Fast Video Event Matching with Query-Aware Windowing at the Edge for the Internet of Multimedia Things

Efficient video processing is a critical component in many IoMT applicat...

Parallel Computation of Alpha Complex for Biomolecules

Alpha complex, a subset of the Delaunay triangulation, has been extensiv...

An Edge-Cloud Integrated Framework for Flexible and Dynamic Stream Analytics

With the popularity of Internet of Things (IoT), edge computing and clou...

RINDNet: Edge Detection for Discontinuity in Reflectance, Illumination, Normal and Depth

As a fundamental building block in computer vision, edges can be categor...

Fast Subspace Identification Method Based on Containerised Cloud Workflow Processing System

Subspace identification (SID) has been widely used in system identificat...

1 Introduction

Modern object detection models are based on complex Convolutional Neural Networks (CNN) that require GPU clusters costing tens of thousands of dollars to perform object detection in real-time 

[noscope, wu2019fbnet, he2018amc, cai2019once]. This is infeasible for edge applications that require real-time processing but cannot afford to place expensive hardware at the edge. Furthermore, many of these applications require response in the scale of milliseconds (such as V/AR [lincoln2016motion] and smart city Vehicle-to-Everything [chen2017vehicle]). This prohibits the use of faraway cloud resources.

There is a large body of research in the machine learning community that aims at addressing the trade-off between accuracy and performance in deep learning (DL) models by utilizing compression, pruning and quantization techniques  

[wu2019fbnet, he2018amc, han2019deep, cai2019once, kim2019efficient, luo2017thinet, ullrich2017soft, chen2018constraint, xu2018deep, Choi2020universal, dubey2018coreset]. In these approaches, we notice a trade-off between accuracy and performance. The accuracy of a compressed model is typically lower compared to the full model while performance is improved dramatically. For example, in [wu2019fbnet], the compressed model improves latency from 23.1 ms to 2.9 ms, while lowering the accuracy from to . Other papers in the field of image compression aid in reducing the amount of time needed to process data [8305033, Liusanfran, 8476610]. other researchers opt to specializing DL models for certain use cases to improve performance [julian2019deep, 8354254, lawhern2018eegnet, guo2021compact].

An important aspect that is overlooked in many video analytics solutions is that they are not integrated with the system’s data processing and management. Video analytics generates insights from videos that would typically be used in a data management application. For example, detecting objects in V/AR might feed into a mobile game, immersive social network, or other application. We propose Croesus, a multi-stage edge-cloud video processing framework that aims to manage the performance-accuracy trade-off in DL models. The framework consists of an edge-cloud video analytics component and a transaction processing component. Each component may exist in isolation of the other and benefit other use cases, however, they are co-designed to achieve the goals of data management for video analytics applications. This proposal separates computation into two stages: an initial stage that depends on best-effort computations at the edge (using a fast but less accurate DL model), and a final stage at the cloud to correct any errors incurred in the initial stage (using the accurate but slower DL model.) For example, for object detection in applications such as V/AR, instead of depending solely on the full CNN model, a more compact model is used at the edge to respond immediately to users. If needed, some frames are sent to the full CNN model on the cloud to detect any errors on the immediate responses sent by the initial stage. If an error is detected, then a correction process is performed in the final stage. The mechanism to correct errors is an application-specific task and our method allows flexibility in how errors are corrected. The advantage of this model is that users have the illusion of both a fast and accurate object detection. The downside is the possibility of short-term errors. This pattern of the multi-stage model is useful for applications that require fast response but where the full model cannot be used within the desired time restrictions.

We formalize and analyze the transactions (a transaction is a group of database read/write operations that represents a task or a program) in Croesus using a formal multi-stage transaction model. Our model divides transactions into two sections: an initial and a final sections (we also show how this model can be extended to multiple sections). The initial section is responsible for updating the system using the results of the initial object detection stage, and the final transaction is responsible for finalizing/correcting state using the results of the final (object detection) stage. The multi-stage transaction model can be generalised to have more than two stages. However, our analysis with the general design turned out to add additional overhead without providing a significant benefit for edge-cloud video analytics. The reason is that the asymmetry in edge-cloud systems is two-fold: in the edge (low-capability, real-time requirement) and in the cloud (high-capability, less stringent latency requirement).

The multi-stage transaction model leads to challenges when reasoning about the correctness guarantees that should be provided to users. This is because the multi-stage transaction model breaks a fundamental assumption in existing transaction models, which is the assumption that a transaction is a single program or block of code. Therefore, there are challenges on coming up with an abstraction of initial and final sections and how they interact. Also, there is a need to specify what makes an execution of initial and final sections correct in the presence of concurrent transactions. We cannot reuse existing correctness criteria—such as serializability [bernstein1987concurrency]—as they would not apply to the multi-stage transaction model.

For those reasons, we propose a multi-stage transaction processing protocol and study the safety-performance trade-offs in multi-stage transactions. We investigate two safety guarantees: (1) Multi-stage Serializability (MS-SR), which mimics the safety principles of serializability [bernstein1987concurrency] by requiring that each transaction would be isolated from all other transactions. (2) Multi-stage Invariant Confluence with Apologies (MS-IA), which adapts invariant confluence [bailis2014coordination] and apologies [DBLP:conf/cidr/HellandC09] to the multi-stage transaction model and enjoys better performance characteristics and flexibility compared to MS-SR. The multi-stage transaction pattern of Croesus invites a natural method of adapting invariant confluence and apologies. In particular, the final section is—by design—intended to fix any errors caused by the initial stage. This can be viewed as the final stage “correcting any invariant violations” and issuing “apologies” for any erroneous work generated by the initial section.

In the rest of this paper, we present background in Section 2, followed by the design of Croesus (Section 3) and multi-stage transactions (Section 4). Experiments and related work are presented in Sections 5 and 6, respectively. The paper concludes in Section 7.

Figure 1: Croesus’ execution pattern

2 Background

In this section, we present background on the multi-stage system model and object detection.

2.1 System and Programming Model

Edge-Cloud Model. Our proposed system model consists of edge nodes and a cloud node (see Figure 1). Each edge node maintains the state of a partition (database transactions are performed on the partition copy at the edge.) For ease of exposition, we focus on a single edge node and partition in this paper. In edge applications, interactions between users tend to have spatial locality and are therefore typically homed in the same edge node and partition.

Application Model. The applications we consider are video-driven—meaning that the input and triggers to operations on data are done via a video interface. For example, a gesture or object detected on a V/AR headset triggers a database transaction. This translates to the following processing steps for each frame : (1) the frame is processed using the small model on the edge node, , to generate labels (labels are the detected objects and/or actions). We call these the edge labels and are denoted by a set . (2) the edge labels are used to trigger transactions that take the labels as input. These transactions are denoted by the set . The initial sections of each of these transactions in are processed to return an immediate response to users and potentially write to the database on the edge node. (3) concurrently, the frame is also processed in the original, more accurate object detection model on the cloud, denoted by . Once the cloud model generates the labels, denoted by , they are sent to the edge node. (4) when the labels from the cloud are received, they are used to trigger two types of events. The first is to trigger the final sections of the transactions that started for frame . The input to these sections is the correct label(s) of the object(s) that triggered the transaction. The second is to trigger new transactions that should have been triggered by the frame but their labels where missing in . We focus on the first pattern as the second pattern can be viewed as a subset of the first.

Example Application. Consider a smart campus Augmented Reality (AR) application with two basic functionalities: (1) Task 1: continuously, an object detection CNN model detects buildings in the campus. If a building is detected, the database is queried and information about the building—such as available study rooms—is augmented onto the headset view. (2) Task 2: if the user clicks on an auxiliary device, a study room is reserved in the currently detected building.

Execution Pattern. The execution pattern of this application is the following (shown in Figure 1): The headset captures images continuously and sends them to the nearby edge node. The edge node performs the initial stage of computation by running the captured frame, on the small (fast but inaccurate) DL model, (step 1). The labels extracted from the model, , are used to trigger the initial section of transaction (step 2). For example, if the engineering building is detected, then the transaction’s initial section reads information about the building. The outcome of this transaction is sent back to the headset to be rendered and augmented onto the display. During this time, the frame is forwarded to the cloud node which runs the full (slow but accurate) CNN model, (step 3). The labels, extracted from the model are sent back to the edge node. Once the edge node receives the correct labels, it performs the final stage of the transactions in (step 4). The final stage takes as input both the original detected labels in the initial stage as well as the new, correct, labels.

Programming Interface. The programming model exposes an interface to write both the initial and final sections of the transaction. In our application for example, there are two transactions, one for each task. For task 1 (display information about detected buildings), the initial section is triggered for each frame with a label in the class “building” and it takes as input the detected labels, . For each detected label, the initial section reads the information about that key from the database and returns it to the headset to be rendered. The final section is triggered after the correct labels, , are sent from the cloud node. It checks if the labels are the same; if they are, the transaction terminates (note that the decision to terminate is specific to this example transaction, but other application might use the final section to perform some final actions even if the labels were correctly detected in the initial stage.) If they are not, then the transaction reads the labels of the correct detected building and sends them to the headset to render the correct information and an apology. 111

In a real application, the corrected information would also influence the small model—via retraining and heuristics such as smoothing—so that the error would not be incurred in the following frames.


For task 2 (reserve a study room), the initial section is triggered when the auxiliary device is clicked by the user. The initial section takes as input the most recent detected labels and their coordinates. If there are more than one label, the initial section picks the label that is closest to the center of the frame. Then, the initial section reserves a study room if one exists. The final section—triggered after receiving the correct labels—checks if the center-most label matches the building where the study room was reserved. If so, the transaction terminates. Otherwise, the original reservation is removed from the database and—if available—a new reservation with the right building is made. The results are sent back to the AR headset to be rendered with an apology.

2.2 Accuracy-Performance Trade-off in Object Detection

Convolutional Neural Networks (CNNs).

A CNN is designed and trained to detect labels of objects in an input frame. Different CNN models have different structures and variations, and we refer the interested reader to these surveys [khan2020survey, sindagi2018survey]. Our work applies to a wide-range of CNN models as we use them as a black box.

Accuracy-Performance Trade-off.

The complex processing of CNNs result in higher inference time. It is estimated that running a state-of-the-art CNN model in real-time requires a cluster of GPUs that costs tens of thousands of dollars 

[noscope]. This means that running a CNN model on commodity hardware—such as what is used in edge devices—would lead to prohibitively high latency. This led to exploring the accuracy-performance trade-off in CNN models. Specifically, there has been efforts to produce smaller CNN models that would run faster on commodity hardware [yolo9000, lawhern2018eegnet, wen2019memristor, zhang2020compact, racki2018compact, xu2019accurate]. The downside of these solutions is that they are less accurate than full CNN models. In this work, we aim to utilize both small and full CNN models by using small models for fast inference and original models to correct any errors.

Derivative Models. The interest in the accuracy-performance trade-off in CNNs led to efforts that enable deriving smaller—faster—models using existing original CNN models. One approach is to use a smaller model that handles the same scope of labels of the original model but with less accuracy [yolo9000]. Another approach is to create smaller—specialized—models that narrow the scope of labels to enable faster inference while retaining accuracy for the select labels [noscope]. In our work, we consider both variations. For smaller, less accurate models, the Croesus pipeline helps correct errors due to inaccuracy and for specialized models, the Croesus pipeline helps correct errors due to the narrower scope of labels.

3 Croesus Design

In this section, we present the design of Croesus and an optimization that controls the accuracy-performance trade-off.

3.1 Overview

System Model. The system model of Croesus (Section 2) consists of an edge node and a cloud node. The edge node hosts a small CNN model denoted by that is used to perform initial processing. The edge node also hosts the main copy of it’s partition’s data. The edge node processes both the initial section and the final section. The initial section of a transaction is triggered by the labels of the model on the edge, , and the final section is triggered by the labels of the model on the cloud, . The execution pattern of requests is shown in Figure 1 and described in Section 2.1.

Workflow. The workflow of requests in Croesus is the following: a frame is sent from the client to the edge node. The edge node processes using the edge model, . The labels from , , are used to trigger corresponding transactions, (the programmer defines what transactions should be triggered for each class of labels.) The initial sections of transactions in are processed on the edge node. At this time, the response from the initial sections are sent to the client. This marks the initial commit stage. In the meantime, the frame is sent to the cloud node. Once the cloud node receives it, the cloud model, , is used to process . The corresponding labels, , are then sent to the edge node. When the edge node receives the cloud labels , the final sections of transactions in are triggered. The responses and apologies from these final sections are sent to the client. This marks the final commit stage.

Bandwidth Thresholding. The pattern of edge-cloud stages introduces a bandwidth overhead due to the need to send all frames from the edge to the cloud. This can be problematic due to the high overhead on the edge device and the monetary cost of communicating data to the cloud. (e.g., some public cloud providers charge a cost for communicated data between the data center and the Internet). To this end, we tackle the problem of limiting edge-to-cloud communication. We use the confidence of the labels that are generated by the edge model, , to decide whether we need to send the frame to the cloud or not. Specifically, if the edge model’s confidence is high enough, this is an indication that the detected labels are more reliable than other detections that have less corresponding confidence. Later in this section, we develop a bandwidth thresholding mechanism to investigate sending frames to the cloud selectively using the edge model’s confidence.

3.2 Initial-Final Section Interaction

A unique property of multi-stage processing is that there are two stages of processing where the first stage is fast and less accurate and the second is slow and accurate. This property leads to the need to understand how they interact and what guarantees should be associated with each stage. In the rest of this section, we provide such properties that are useful to programmers in the multi-stage model. In the initial stage, the initial section of a transaction, , uses the input from the edge model, , to generate a response to the user. This response represents an initial-stage commit. The initial-stage commit—when received by a client—represents the following: (1) the response is a preliminary and/or best-effort result of the transaction. (2) any errors in this initial processing will be corrected by the logic specified by the programmer in the corresponding final section. This second property is critical because it leads to having to enforce a guarantee that if the initial section of a transaction returns a response to the client (an initial-stage commit), then the underlying system must guarantee that the corresponding final section would commit as well. This is trivial for a transaction running by itself, however, when transactions are running concurrently, this leads to complications. (In Section 4, we present the concurrency control mechanisms for multi-stage transactions where we encounter these complications.)

When the final section of the transaction starts, it is anticipated for the final section to observe what the input labels were to the initial section—to know whether the input was erroneous—and what the initial section did—to know what to fix if an error was detected. To avoid adding complexity to the system model and description, we consider that these two tasks are performed by the programmer using database reads and writes. Specifically, the initial section communicates to the final section via writing its input and state to the database.

3.3 Algorithms

Now, we provide the detailed algorithms of Croesus. Parts of the algorithms use a concurrency control component that we present and design in Section 4. We will denote this concurrency control component as CC and a transaction block would either be CC.initial{ } for an initial section and{ } for a final section. Both transaction blocks get the detected labels as input, but we omit it for brevity.

3.3.1 Client Interface

The client captures frames, gets user input (from auxiliary devices), and displays responses. For example, in a V/AR application, the client captures a frame from the headset camera and sends it to the edge node. Likewise, if there are any associated auxiliary or wearable devices, the client sends the input/commands that correspond to these devices. This process of sending frames and input is continuous—there is no blocking to get the response from the edge node. When a response is received from the edge node, that response is rendered and augmented in the user’s view.

3.3.2 Edge Node Algorithms

The edge node is responsible for the initial stage of processing (using the small model ), transaction processing, and storage. There are two main components in the edge node: the input processing component and the transaction processing component. The following is a description of the main tasks that are handled by the edge node.

Initialization and Setup. Starting an edge node includes setting up a small model, , a data store , and a transactions bank. The small model is the one that will be used to process incoming frames. The transaction bank is a data structure that maintains the application transactions and what triggers each transaction. For example, an application may have a transaction that reads the information about a building that is detected in a frame. The transaction takes as input the label that is associated with a building. The transactions bank helps the edge node know which transactions should be triggered in response to a label. For example, if a label represents a label name “Engineering Building” and label represents a label name “University Shuttle 42”, the transaction should be triggered in response to but not .

The way the transactions bank helps in making this decision is that it maintains a table, where each row corresponds to a class of labels and the transactions that would be triggered from that class of labels. For example, a row in that table can have a class of labels called “Buildings” and it contains all the labels that would correspond to a building. That row would also have and any other transactions that should be triggered in response to the “Building” class. A row in the transactions bank may also have other associated triggers, For example, a transaction that is used to reserve a study room in a building would be triggered if both a building label is detected in the frame and the auxiliary device input is received.

Input and Initial Stage Processing. The initial stage processing represents the input processing using the small model, , in response to a received frame or user input. When a frame is received by the edge node, it is supplied to the small model . The model returns a set of labels . Each label, , consists of the the name of the label, , the confidence of the label, , and the coordinates of the label, . The input processing component removes any labels from the set that have low confidence (the threshold for a low confidence is a configuration parameter.) Finally, the input processing component gets the information of all the transactions that correspond to the detected labels, , by reading from the transactions bank. The set of triggered transaction, , is sent to the transaction processing component.

Similar to how frames trigger transaction, when a different input is received by the input processing component—such as a click on the auxiliary device—the input processing component generates the set of transactions that corresponds to the input. An auxiliary input might lead to an action that is independent from the captured frame. For example, a click on the menu button may display the menu and general user information. In this case, the entry in the transactions bank is only specified by the input type. Alternatively, the input might be coupled with a specific label class to trigger a transaction. For example, a click would display a captured building’s information using . In such a case, would only be triggered if both the click and a building label are detected. To facilitate such actions, the input processing component matches a received auxiliary input with the labels from the most recently detected labels.

After transactions, , are sent to the transaction processing component (TPC), the frame is sent to the cloud node to be processed using the cloud model, . This concludes the tasks performed for input processing.

Initial Transaction Section. When the input processing component generates the set for a frame , these transactions are sent to the TPC. The TPC then triggers the initial section of these transactions. The read and write operations to the database are managed by the concurrency control component by wrapping them in the CC.initial{ } block. (The implementation details of the concurrency control component are presented in Section 4). The initial section of a transaction would either commit or abort—based on the decision of the concurrency controller. If the initial section aborts, then the abort decision is sent to the client. Otherwise, the response from the initial section is sent to the client, which represents the initial commit point for . The TPC records the decision for the initial section with the labels, , and waits until the corresponding labels are received from the cloud model.

Final Transaction Section. After processing the initial section, the TPC waits for the correct labels, , from the cloud node. Once received, the following is performed for each label, in . The label is matched with a label in . The matching is performed by finding if the bounding box (represented by the x-y coordinates) of a label in overlaps with the bounding box of . The overlap does not need to be exact—if the label overlap in more than X%, where X is a configuration parameter, then the two labels are considered overlapping. If there are more than two candidates in that overlap with , then the one with the bigger overlap is chosen. There are the following cases of matching the label to a label in : (1) If an overlapping label cannot be found in , then the label is considered erroneous and the final section of the corresponding transaction is called with an empty label. (2) If there is a label in that overlaps with and the label name is the same. In that case, the label is considered correct and the final section of the corresponding transaction is called with the same label. (3) If there is a label in that overlaps with and the label names are different. In that case, the label is considered erroneous and the final section of the corresponding transaction is called with the overlapping label from .

Once this matching process is complete, then the TPC checks if there are any labels in that were not matched. For each one of these labels , the TPC triggers an initial section and final section with the label in .

3.3.3 Cloud Node Algorithms

The cloud node has a single task of processing frames using the cloud model, . When a frame is received from an edge node, the labels, , are derived using and then sent back to the edge node.

3.4 Bandwidth Thresholding

A major problem faced by video-analytics applications in the edge-cloud paradigm is the high edge-cloud bandwidth consumption due to the large size of videos. Sending all frames from the edge to the cloud poses a performance challenge due to the communication overhead as well as a monetary overhead due to the cost of transferring data between the edge and the cloud (most public cloud providers charge applications for data communication between the cloud and the Internet). We extend our solution to reduce the reliance on cloud nodes with the goal of overcoming the performance overhead and monetary costs of edge-cloud communication.

The observation we utilize to reduce edge-cloud communication is that we can use the confidence of edge computation to decide whether verifying with the cloud node is necessary. (Confidence here represents the statistical confidence generated by CNN models which is a typical feature of such models.) Specifically, if the confidence of the produced detections in the edge model, , is high, it is likely that the edge model produced correct labels. Therefore, it would not be necessary to send the frame to the cloud. Likewise, if the detections had extremely low confidence, then it is likely that these are erroneous, false detections, and thus sending the frame to the cloud node would be unnecessary as they can be discarded immediately. What is left are detections that have confidence values that are not too high and not too low. These detections are ones that likely indicate the presence of an object of interest, but its label might be incorrect.

More formally, we represent with and the lower and the upper confidence thresholds such that . Generally, an object with confidence lower than is discarded as being likely a false-positive (this is called the discard interval). An object with confidence higher than is assumed to be correct and is not sent to the cloud node (this is called the keep interval). Objects with a confidence between and are sent to the cloud for validation (this is called the validate interval). However, there is a challenge in adopting this model as it is not clear how to derive these confidence thresholds to preserve the integrity of the underlying models. Specifically, a performance-accuracy trade-off controls this decision. A large validate interval would lead to better accuracy, since more frames are sent to the cloud for validation and correction. Likewise, a small validate interval would lead to worse accuracy but better performance in terms of average latency and edge-cloud bandwidth utilization. This is complicated further because the size of the validate interval is not the only factor controlling this trade-off. The validate interval size may lead to different performance-accuracy trade-offs based on where it is located in the threshold space from 0–100%.

Optimization Formulation. The input to the optimization problem is a set of video frames , and an object query (e.g., bus), which needs to be detected in the frames. Let be the number of instances of object detected in frame (by the NN in the edge-node) with confidence where is the confidence corresponding to the instance of object , for We denote this as edge-confidence.

Let be the number of frames which were sent to the cloud. We define the ratio (where is the number of frames in

) and have the corresponding F-score

where is precision and is recall. We want to find such that is minimized and the corresponding Let . We have:


This formulation produces the thresholds given .

3.5 Generalizing Multi-Stage Processing

In this section, we have focused on models with two stages. This is because the application domain we consider has a two-tier symmetry that invites the use of two sections, one that represents the edge and another that represents the cloud. However, the multi-stage processing model can be utilized for other use cases where the asymmetry has more than two levels. Our designs and treatments can be extended to these cases as we describe in the rest of this section.

Model. In a general multi-stage model, there are stages, . The first stage, , represents the initial stage of processing and the last stage, , represents the final stage of processing. All other stages are intermediate stages. The data storage is maintained by the node handling stage . Each stage contains a video/image detection model—where typically the model at stage (denoted ) has better detection that model , where . A transaction consists of sections, each one () corresponding to a stage ().

Processing. When a frame is received, it is first sent to the initial stage, . The initial stage processes using and takes the outcome of the model to process the first section of the transaction . Then, the frame is processed at the next stage —using —and the outcome is used to trigger transaction . This continues until the final stage. If bandwidth thresholding is performed at any stage, then the sequence from initial to final stages might be broken. For example, if at stage , the bandwidth thresholding algorithm (as presented earlier in the section) decides that the frame does not need to be forwarded to the next stage, then the sequence stops and the remaining transaction sections are performed.

4 Multi-Stage Transactions

4.1 Multi-Stage Transaction Model

We consider a new multi-stage transaction model where every transaction comprises of two distinct sections: the initial section and the final section. Each section, —in a transaction —consists of read () and write () operations in addition to control operations to begin () and commit () each section. For example, consider a multi-stage transaction . The execution of the transaction would look like the following: where stands for the initial section and stands for the final section.

If the initial section of a transaction commits (called initial commit), then the final section must begin and commit (called final commit) as well. When we say that a transaction in our model has committed, we mean that both sections of have committed. Furthermore, the final section of a transaction cannot begin before the initial section. The case for conflicts of transactions also demands special consideration. In our model, we say two transactions to be conflicting if there is at least one conflicting operation in either of the sections. The seemingly simple abstraction of splitting every transaction into two sections complicates the basic notions of the general transaction model. In the following, we take a look at safety and describe two notions of consistency in our model.

4.2 Safety

In the absence of concurrent activity, safety is straight-forward; the initial section is followed by the final section and both are processed as the programmer expects. When concurrency—which is important for performance—is introduced, it challenges the programmer’s notion of the sequentiality of running transactions and multi-stage sections (other conflicting transactions may run within and between a transaction’s sections.)

For example, consider an application where there are two transactions, and , each of which increment the value of a data object by one. Suppose that, for each transaction, the initial stage consists of reading the value of ; the value is increased, and the new value is written in the final section. Therefore, if the two transactions executed concurrently and both and read the same value of , then the final value of would only increase by one. This is an anomaly because there were two transactions that incremented the value of and the value of should have increased by two.

safety is different because it is also actions between sections not only within a transaction. safety here is also different than typical concurrency – it is not about conflicting copies to be merged, it is about a wrong trigger or wrong input. Evidently, multi-stage consistency adds to the complexity involved in traditional consistency guarantees such as serializability in two ways: (1) multi-stage transactions consists of two separate stages. This means that in addition to the concern of concurrent transactions interleaving operations within each section, there is a need to consider whether sections of transactions running between the sections of other transactions should be permitted. (2) in multi-stage transactions, inconsistency is not only due to concurrent activity, but also due to erroneous transactions that have an incorrect trigger or input (e.g., an erroneously detected building in the edge stage of processing leads to triggering the wrong transaction and/or supplying it with the wrong input.)

Due to these differences, we revisit transactional consistency in light of multi-stage transactions. We present and discuss two variants of multi-stage transaction consistency. In both variants, we assume that traditional concurrency control mechanisms are used to ensure that each section is serializable relative to other transactions’ sections. (This means that each section is atomic and isolated from other sections and that there is a total order on sections.) This leaves the novel challenge to safety that is introduced in our work, which is how these sections can be reordered relative to each other.

4.3 Multi-Stage Serializability (MS-SR)

In MS-SR, we mimic the safety principles of serializability, which is—informally—a guarantee that all transactions execute with the illusion of some serial order of all transactions [bernstein1987concurrency]. When trying to project this to multi-stage transactions, this translates to the requirement that all transactions are processed serially, where the final section of a transaction appears immediately after the initial section. This guarantee can be reduced to serializability by considering that the initial and final sections are part of the same serializable transaction. The main difference is that when the initial section commits, it is a guarantee that the final section would eventually commit—it cannot abort due to unresolved conflicts. As we will see in the rest of this section, this requirements complicates the processing of the initial section.

In order to specify MS-SR formally, we introduce some notations and state our assumptions. We denote with , the ordering relation on execution history of transaction sections. This relation represents the ordering relative to the commitment rather than the beginning of the section. For example, denotes that the left-hand side is ordered before the right-hand side, i.e., section is ordered before section .

Consider two conflicting transactions and (i.e., they have at least one conflicting operation in either section), where have initially committed before initially committed. MS-SR guarantees the following: (1) the final section of the first transaction, , must commit after . This is the guarantee of multi-stage transactions to commit the initial section before the final section of the transaction. (2) must commit before . This is due to the MS-SR guarantee that the two sections of the transaction must be ordered next to each other relative to other conflicting transactions. (3) must be ordered before only if there is a conflict between and . This is also due to the need to serialize the sections of two conflicting transactions. The condition of the conflict between and is to capture that if the two sections do not conflict, then they can be reordered in the serializable history. These conditions are represented by the following formulation, where (a) captures both conditions (1) and (2), and (b) captures condition (3):

We elaborate on Example 4.2 to demonstrate the need for MS-SR(a) and MS-SR(b). As an example of MS-SR, consider the two transactions:

and .

Further assume that Condition MS-SR(a) above guarantees that is committed after and before , i.e., we have With MS-SR(a) alone, the following is permitted. However, because conflicts with , then the two sections must be ordered according to MS-SR(b) and the following ordering relations must be met: . This ordering avoids the anomaly of both transactions reading the same value of , but one overwriting the value written by the other.

Now, we present a protocol that guarantees MS-SR.

Two Stage 2PL (TSPL): The Two Stage 2PL is the two phase locking protocol [BernsteinHG87] modified for our multi-stage transactional model (See Algorithm 1.) Let be a multi-stage transaction comprising of and . First, the initial section starts executing, locking each accessed data item before reading or writing it. After the initial section finishes processing, the initial commitment cannot be performed immediately. This is because we need to guarantee that the final section can execute and commit as well, due to the requirement of multi-stage transactions. Therefore, the locks of all items that are accessed (or potentially accessed) by the final section must be acquired first. Then, the transaction enters the initial commit phase. Once all the needed input is available for the final section (e.g., the corrected labels from the cloud model), the final section executes, and the transaction enters the final commit phase. Finally, all the locks are released.

items get_rwsets()
if acquirelocks(items) then
        items get_rwsets()
        if acquirelocks(items) then
               Initial Commit
               Final Commit
        end if
end if
Algorithm 1 Two Stage 2PL
Theorem 1.

The TSPL protocol satisfies MS-SR.


Consider a pair of conflicting transactions and , where . Following Algorithm 1, each section is serialized relative to each other section because locks are held before execution. Now, we show that the three conditions of MS-SR of ordering sections relative to each other are met. The first guarantee is ordering the initial section before the final section. The algorithm executes the initial section before the final section which guarantees their ordering. The second guarantee that is ordered before . There is at least one data object that both and access. Because the final section is only executed after all locks are held for the transaction (including the lock for ), would be processed before . The third guarantee is that if conflicts with , then . Assume that the conflict is on data object . Assume to the contrary that . If that’s the case, this means that acquired the lock on before and before the point of initial commitment (because initial commitment only happens after acquiring all locks including the locks for the final section). Because the locks (including the one on ) are not released until finishes, this means that before the lock on is released, has initially committed. However, initially commits only after acquiring the lock on , which means that , which is a contradiction to our starting assumption that . ∎

Discussion. Although MS-SR is an easy-to-use consistency guarantee, it leads to complications and undesirable performance characteristics. The main complication is due to the need to guarantee that committing the initial section would lead to committing the final section. With the stringent requirement that the two sections are serialized so that they appear to be back-to-back in the serialization order, this leads to having to ensure that the locks for the final section can be acquired. The design consequence as we see in the TS-2PL algorithm is that the initial section cannot commit before acquiring the locks of the final section. This leads to one of two consequences: (1) the system can infer what data will be accessed (or potentially accessed) in the final section so that the locks can be acquired and the initial commit happens before having to wait for the cloud model to finish processing, or (2) the transaction would not be able to initially commit until the cloud model returns the correct labels so that it is known what data items are going to be accessed. The first option may require complex analysis or input from the programmer and the second option is prohibitive as it means that the initial section has to wait for a potentially long time, which invalidate the goals of multi-stage transactions. Another complication is that the locks for the initial section must be held until the final section finishes processing which would lead to higher contention.

4.4 Multi-Stage Invariant Confluence with Apologies (MS-IA)

Now, we propose a multi-stage safety criterion that is inspired from invariant confluence [bailis2014coordination] and apologies [DBLP:conf/cidr/HellandC09]. The initial-final pattern of multi-stage transactions invites the utilization of these concepts as we discuss next.

Guesses and Apologies. The concept of guesses and apologies [DBLP:conf/cidr/HellandC09] was introduced to describe a pattern of programming that balances local action versus global action (for example, a local action on a replica versus global action on the state of all replicas in the context of eventual consistency). In this pattern, a guess is performed with local information and, then, guesses are reconciled with the global state which would lead to detecting inconsistencies in the local guesses. Such errors lead to apologies via undoing actions, administrator intervention, and/or notifications to affected users.

This pattern of guesses and apologies fits our multi-stage edge-cloud transaction model. The initial section represents the guess and the final section represents the apology. To illustrate, consider an example of a multi-player AR game with three players: with 50 tokens, with 10 tokens, and and with no tokens. The application has a token transfer function transfer(from, to, amount). The initial section performs the transfer, and the final section reconciles any mistakes. Now, assume that the initial section of a transfer from to for 50 tokens took place. Then, the initial section of a transfer from to for 10 tokens took place followed by another transfer from to for 50 tokens. Due to concurrency, assume that the final section of both and were performed and that their trigger and inputs were correct. In this case, the final section terminates for both transactions. Then, the final section of starts. However, the correct input to turns out to be instead of (for example, because the edge CNN model detected player when it is actually player as detected by the cloud CNN model.) An apology procedure in the final section could retract the effects of and any other transactions that depended on it, which are and .

Using guesses and apologies allows us to process the initial sections of transactions fast while providing a mechanism to overcome the mistakes of the edge best-effort computation. However, it may lead to a cascade of retractions. To overcome this, we propose combining the concept of apologies with invariant confluence as we show next.

Invariant Confluence. In invariant confluence, preserving the application-level invariants is what constitutes a safe execution. In its original form, invariant confluence is intended to reason about transactions mutating the state of different copies of data [bailis2014coordination]. Our edge-cloud model is different, involving mutating the state of one (edge) copy. However, an inconsistency might be introduced by the initial section of a transaction with erroneous trigger/input. Our insight is that we can utilize the final (apology) section to act as the merge function that attempts to reconcile application-level invariants instead of all potential inconsistencies. In a way, we are flipping the model of invariant confluence systems from a pattern of check-then-apply (check if the operation can merge, and decide whether coordination is necessary before doing the operation), to a pattern of apply-then-check (do the operation then check whether you can merge, and if you cannot merge, then perform an apology procedure and retract the initial section’s effects.)

MS-IA programming pattern. This pattern, when combined with apologies, can lead to reducing the negative consequences of erroneous triggers/inputs. Consider the multi-player AR game application introduced above (when discussing apologies). Assume that the initial sections of , , and , were processed as well as the final sections of and . At this stage, , , and have no tokens and has 60 tokens. When the error is discovered, it triggers the final section of . A programmer, equipped with the notions of invariant confluence and apologies, writes the final section to attempt to perform two tasks: (1) retract the minimum amount of erroneous actions and their effects using apologies, and (2) retain as much state as possible using invariant-preserving merge functions. The specifics of this pattern depends on the application invariants. For example, the final section of the transfer tasks could have the invariant that no player should have less than 0 tokens. The final section of would retract the 50 tokens that were initially sent to and sends them to the rightful recipient, player . This means that could not have sent a combined 60 tokens to . The merge function can then decide to retain the 10 tokens sent from to , since they are not affected by the error. But, it retracts the 50 tokens. This retraction is accompanied by an apology that depends on the application (e.g., a message is sent to both and , with a free game item.)

In terms of the concurrency control guarantee that is needed for MS-IA, the initial section of a transaction must be ordered before its corresponding final section (in addition to our earlier assumption that each section is serialized relative to other transactions’ sections). Formally, for an initial section, , the following is true:


items get_rwsets();
if acquirelocks(items) then
end if
Initial Commit
items get_rwsets()
if acquirelocks(items) then
end if
Final Commit
Algorithm 2 MS-IA Algorithm

Concurrency control. The concurrency control algorithm starts by acquiring all the locks for the initial section, then processing the initial section. When the processing of the initial section is done, the locks are released. Then, when the final section is ready to start, the corresponding locks are acquired before processing the final section. Finally, the locks for the final section are released. Note here that unlike the algorithm for MS-SR, we did not hold the locks for the initial section until the end of the final section and we reach the point of initial-commit immediately after processing the initial section without having to wait to lock or coordinate the final section. The reason for this is that the logic for invariant checking and apologies is embedded in the final section and that we do not need to ensure that the initial and final sections of one transaction are serialized next to each other.

Discussion. To have better performance characteristics, MS-IA presents a more complex programming abstraction than MS-SR because it places the burden of coordination (invariance checking, reconciliation, and apologies) on the programmer. In MS-IA, transactions are written as guesses (in the initial section) and apologies (in the final section). Furthermore, apologies are merge functions that aim to reconcile the inconsistencies caused by incorrect triggers or inputs. Given our apply-then-check pattern, it is possible that some operations cannot be merged. In such cases the final section would undo the effects of the initial section—and any transactions dependent on it. We envision that this pattern of multi-stage guesses and apologies can incorporate advances in merge operators that would allow minimizing the need for undoing transactions. For example, programmers may use merge-able operations in the initial sections and delaying other operations to the final section. This can benefit from—and help empower—the literature of conflict-free and compositional data types. These can be adapted to the initial-final pattern by making merge-able parts in the initial section and enabling other types of operations in the final section.

In Validation-based (optimistic) protocols, which operate in the context of a single transaction, before validation, the outcome of the transaction is not returned to the client and is not exposed to other transactions. Applying validation-based protocol as they are in the edge-cloud setting would be prohibitive because it means that a transaction would not commit until the validation step - that would happen after cloud processing - is ready. The MS-IA pattern, on the other hand, divides the transaction logic to two sections each acting as an independent transaction, where the first one commits before the second section starts, which allows returning responses to clients and exposing the outcome to other transactions (even before the final section and without having to wait for the processing at the cloud).

4.5 Multi-Partition Operations

The transaction processing protocols presented in this section focus on transactions that are local to a partition. In the case of distributed transactions (spanning multiple partitions), the presented algorithms need to be extended. In particular, in the multi-partition case, the data objects that are accessed by a transaction (whether in the initial or final sections) can be in multiple partitions. Locking data objects in remote partitions will be performed by sending the lock requests to the remote edge node that is responsible for the partition. The second difference is that after the transaction finishes, the partitions engage in a two-phase commit protocol to ensure that the distributed commit is performed in an atomic way. This atomic commitment step is performed in the following cases: (1) for MS-SR, it is performed at the end of the final section, (2) for MS-IA, it is performed at the end of both the initial and final sections. The reason for not performing this step at the end of the initial section in MS-SR is that the locks are not released until the end of the corresponding final section.

Figure 2: Croesus vs. state of the art baselines: Latency and F-score of running Croesus over four videos. Some values are minute and are hard to show on the figure.

5 Evaluation

In this section, we show how Croesus manages the trade-off between performance and accuracy of two models with different characteristics: (1) YOLOv3 [yolo9000, yolov3] as the cloud model, which is reported to achieve 45 FPS on high-end hardware and achieves high accuracy. (2) Tiny YOLOv3 [yolov3, yolo9000]—which is a compact version of YOLOv3—for the edge model. Tiny YOLOv3 is faster but less accurate than YOLOv3 [yolov3].

We compare Croesus with two baselines: • State-of-the-art edge baseline: this baseline represents a performance-centric video analytics applications where a compact model (Tiny YOLOv3) is deployed on the edge machine for lower latency. • State-of-the-art cloud baseline: this baseline represents accuracy-centric video analytics applications where a computationally expensive model (YOLOv3) is deployed on a resourceful cloud machine for better accuracy.

Figure 3: Croesus latency vs. accuracy for different pairs of thresholds
Figure 4: Latency in different setups for the optimal case that was dynamically configured by Croesus.
Accuracy Latency (ms)
Croesus Edge Cloud Croesus Edge Cloud
v1 0.81x 0.5x 1
210.74 1452.5
v2 0.8x 0.45x 1
207.97 1427.69
v3 0.83x 0.86x 1
211.19 1455.66
v4 0.85x 0.41x 1
214.65 1638.89
Table 1: Comparison between state-of-the-art edge and cloud and optimal threshold Croesus

5.1 Experimental setup

Our evaluations are performed on Amazon’s AWS EC2 services. Edge machines are implemented on either t3a.xlarge instances (for the default setups) and t3a.small (for experiments with limited resources). t3a.small machines have 2 virtual CPUs and 2GiB of memory and t3a.xlarge machines have 4 virtual CPUs and 16GiB of memory. Machine locations are either in California or Virginia. The default setup is of an edge machine in California and a cloud machine in Virginia.We implement a prototype of Croesus in Python. In addition to model detection, the edge node maintains a data store and processes transactions according to the MS-IA algorithm. Transactions are constructed by randomly selecting keys to read or write to the database in response to detected labels.

We evaluate accuracy and performance as follows: Accuracy is measured as the F-score. Performance is measured in two ways: (1) Latency, which we define as the time required to commit transactions in the system. (2) Edge-Cloud Bandwidth Utilization (BU), which we define as the ratio of frames being sent to the cloud relative to all processed frames. This metric is proportional to the number of corrections that need to be made in the final transaction. We consider the YOLOv3 output to be the ground truth and we use it to compare Creosus’ results and calculate the F-score. When the overlap between the truth boundaries and the predicted boundaries is more than %10, we consider the prediction correct. The calculation of the F-Score does not depend on the percentage of frames that are sent or not sent to the cloud, but rather on the accuracy of the detection from the perspective of the client (i.e., the accuracy of the detection and apologies, if any.) There is, however, a correlation between sending more frames to the cloud as it means that more errors are corrected by the more accurate cloud model.

Experiments run on a subset of five types of videos: Street traffic (vehicles), street traffic (pedestrians), mall surveillance (all three querying for ’person’), airport runway querying for ’airplane’, and home video of pet in the park querying for ’dog’. Each detection acquired for each frame triggers a transaction that has 6 operations, half of these mutate the state of the database by inserting data items, and the other half read from previously added items. This mimics a write-heavy workload of YCSB (Workload A) [Cooper2010]. Unless we mention otherwise, we use MS-IA as the consistency guarantee.

5.2 Experimental results

5.2.1 Performance vs. accuracy trade-off

Figure 2 shows the trade-off between the latency and accuracy as BU varies on four videos: park video (v1), street traffic (v2), airport runway (v3) and mall surveillance (v4). For each video, we compare different BU configurations with the state-of-the-art edge and cloud solutions. In the figure, the stacked bars represent the latency breakdown for each experiment. Edge latency and cloud latency represent the average time needed to send a frame to the edge and to the cloud, respectively. The edge detection latency and cloud detection latency are defined as the average time it takes the tiny YOLOv3 and YOLOv3 models, respectively, to produce the detected objects list in a frame. The initial transaction and final transaction latency are very minute and hard to show in the figure, but they represent the time it takes to commit a transaction after detection is done. The F-score metric is shown as a marked line.

As shown in Figure 2, Croesus processes transaction updates in the initial phase (measured by edge latency and edge detection latency), up to faster than the case with full BU while maintaining high accuracy (F-score up to in the case of "airport runway") by utilizing the cloud corrections and final transaction. The client observes two latencies: the first is the real-time initial processing at the edge which corresponds to edge latency, edge detection latency, and initial transactions latency. The second is for the final processing after corrections, if any, from the cloud, which corresponds to all the latency types shown in the figure. As BU increases, the amount of frames sent to the cloud, and consequently the average cloud-related latencies, increases. When BU is 100%, the total cloud latency for Croesus becomes even higher than state-of-the-art cloud because it incurs all the overheads of the state-of-the-art cloud in addition to the overhead of Croesus methods.

The trend of increasing Croesus cloud latency as BU increases is observed in videos 1, 2, and 4. However, a unique trend appears for video 3 (querying for ‘airplane’ on the airport runway video). In this video, the state-of-the-art edge produce high accuracy due to the nature of the video (an object that is detected by the edge model with high confidence). This asserts the need for dynamic optimization over the detection thresholds for different applications in order to address workload differences. Croesus’ dynamic optimization ensures the best balance of the trade-off between accuracy and latency depending on the needs of each application.

Figure 3 demonstrates the effect of choosing different thresholds on the latency in Croesus. We demonstrate the results using the street traffic video querying for vehicles. It shows the total Croesus cloud latency and the BU percentage as the threshold pairs for detections are varied. For example, a threshold pair (0.5, 0.6) means that only detections with confidence values in the edge mode that are within these two values are sent to the cloud for verification. Detections with lower confidence values are discarded and ones with higher confidence values are assumed correct by the edge node and are not verified (however, erroneous detections are still accounted for in the F-score.)

When the thresholds are set to (0.5, 0.5) the resulting BU is since no frames will be sent to the cloud for validation. The resulting accuracy is comparable to the edge only baseline at . For a threshold pair of (0.5, 0.6), the latency increases due to more results being validated in the cloud. The resulting BU is while the F-score increases by . When the BU reaches , the accuracy reaches . For thresholds (0.6,0.7), the BU is only lower than the BU of the thresholds (0.5, 0.6). However, the F-score decreases by more than . This shows that although two pairs may have similar BU values, their corresponding F-score can be significantly different. It indicates the importance of dynamically optimizing for an optimal pair of thresholds that balance the trade-off between the latency and accuracy while prioritizing thresholds that yield higher accuracy.

cloud model
latency (sec)
YOLOv3-320 (0.2, 0.3) 0.84 0.61 0.70
YOLOv3-416 (0.4, 0.5) 0.86 0.44 1.12
YOLOv3-608 (0.4, 0.6) 0.83 0.58 2.34
Table 2: The effect of the cloud model size.

Another observation from Figure 3 is that the rate at which the bandwidth utilization increases is faster than the rate of F-score increase over different threshold pairs. This is an indicator that increasing dependence on the cloud does not necessarily improve accuracy dramatically.

Figure 5: Croesus bandwidth utilization vs. accuracy based on the threshold pair choice. a) traffic video querying "person" () and b) mall surveillance querying "person" (). For all pairs of lower threshold and upper threshold . Dynamically chosen pair: yellow star using brute force, red star using gradient step.

The effect of changing the cloud model size in Croesus is demonstrated in Table 2. In this experiment, we set and compare the performance of Croesus while using three different cloud model sizes: YOLOv3-320, YOLOv3-416, YOLOv3-608, where the number at the end of each model’s name represents the width and height used in the neural network model. Therefore, a larger number indicates a larger model. As the cloud model size gets larger, the detection latency gets larger as well. This is the main impact of utilizing different model sizes. The different models have different accuracy characteristics as well. However, using them in the Croesus framework does not demonstrate such differences in the resulting F-score and BU. This is because the optimal thresholds are set based on the used cloud model to achieve the desired minimum accuracy, .

5.2.2 Optimal threshold performance on different setups

Figure 4 shows the accuracy and performance results of Croesus for different videos when using the optimal threshold. These experiments run across four different setups: (a) Small edge, different locations: Edge machines are of type t3a.small while cloud machines are of type t3a.xlarge. Edge machine are located in California and cloud machines are in Virginia. (b) small edge, same location: Small edge, different locations: Edge machines are of type t3a.small while cloud machines are of type t3a.xlarge.Edge and cloud machines are physically located in the same location. (c) Regular edge, different location: Edge and cloud machines are both of type t3a.xlarge. Edge machine are located in California and cloud machines are in Virginia. (d) Regular edge, same location: Edge and cloud machines are physically located in the same location and are both of type t3a.xlarge.

This figure demonstrates the improvement in latency that the optimal thresholds provide compared with the performance shown in Figure 2 (For a clearer presentation, we show the comparison numbers in Table 1, where the number inside the parentheses in Croesus is the latency of the initial transaction.). Also, it shows the effect of resource allocation and geographical location on performance, and the importance of dynamic threshold optimization to address the differences in applications.

In the case of applying the optimal thresholds, we see improvement in the final latency over the state-of-the-art cloud implementation by up to (but as low as for the case of v4). In addition, committing the initial transaction is always comparable to the state-of-the-art edge solutions. Even though the final transaction in Croesus can take up to more than the edge only implementation, the accuracy improvements is significant and can justify the slight delay after the initial transaction.

In addition, the F-score of optimal Croesus is 2.1x higher than the F-score of edge-only in video v4. In the case of video v3, the accuracy is comparable to the state-of-the-art accuracy because the optimal thresholds represent a near bandwidth utilization. This is possible in application where objects are expected to be easier to detect in each frame. The figure also shows that as the geographical distance between the edge and the cloud decreases (when placed in the same location), Croesus performance improves. In addition, the performance improves when edge resources are maximized.

Figure 6: (a) Comparing lock contention of MS-SR and MS-IA measured as the average latency of holding locks. (b) Abort rate of MS-SR transactions. (c) Hybrid system techniques.

5.2.3 Dynamic preprocessing optimization

Figure 5 shows the bandwidth utilization and accuracy as we vary the optimization thresholds (the lower threshold and upper threshold ). The heatmaps illustrate the gradual shift in the balance between bandwidth utilization and accuracy.

bandwidth utilization/accuracy trade-off. Figure 5(a1) for BU and Figure 5(a2) for F1-Score show the trend where increasing the lower threshold and the gap between the two optimization thresholds results in a higher throughput. For example, when the optimization pair is (0.2, 0.4), the F-score is since this pair of thresholds result in a high BU at. However, when the optimization pair is (0.3, 0.4) the bandwidth utilization drops to while the F-score remains relatively high . We are able to conserve the edge-cloud communication by more than %35.9 while maintaining relatively high accuracy.

Figures 5(b1) for BU and 5(b2) for F1-Score show the same trends as the previous set of heatmaps. However, we notice a sudden jump in bandwidth utilization and F-score results. This is due to the quality of this second video where objects are smaller and not as clear as the first video. In this case, utilizing edge-cloud communication increases the quality of detections dramatically compared to edge detections. For example, for the optimization pair (0.4,0.5) %81 of frames are sent to the cloud and the F-score is %92. However, when the optimization pair is (0.4,0.4) no frames are sent to the cloud and the F-score decreases to %45.

Dynamically finding the optimal solution. We implemented two approaches to acquire the optimized pair of thresholds. The first is a brute force method that evaluates the whole space of threshold pairs. In it, we obtain the optimal pair for balancing the trade-off (shown as a yellow star). The second approach uses a gradient step with our optimization formulation. Using gradient step is 2.2x times faster (shown as a red star). In both cases, bandwidth utilization is , accuracy is at least higher than an edge model.

5.2.4 Comparing MS-SR and MS-IA

In the next set of experiments, we measure the performance differences between the two proposed consistency levels: MS-SR and MS-IA. (In this set of experiments we use video v4 with the query “person”.) The main difference between the two consistency levels is that the locks in the initial section of MS-SR are held until the end of the whole transaction, whereas in MS-IA, the locks are released after the initial section. This results in increasing the lock contention in MS-SR. Figure 6(a) shows the difference in contention by measuring the average time locks are held in MS-SR and MS-IA (denoted average latency in the figure.) While the average latency of MS-IA is in the order of milliseconds, the average latency of initial sections in MS-SR is in the order of hundreds of milliseconds. This is because the locks are not released until the final section is performed which means that the locks are held while the frame is being processed using the cloud model which takes a significant amount of time.

The contention difference leads to a high likelihood of aborts in MS-SR. Figure 6(b) shows the abort rate of transactions in MS-SR while emulating a high contention scenario of hot sports with different sizes. The x-axis (key range) is the key range of the hot spot that the transactions are trying to access. In this model, transactions are executed in batches of 50 transactions per batch where each transaction has 5 update operations. The figure shows that the abort rate can be significant when the hot spot has a size that is less than 10K keys. This demonstrates the benefit of using MS-IA to overcome the hot spot contention problems while using MS-SR. The figure does not show the abort rate of MS-IA transactions as the rate is 0% for all cases. This is because our implementation uses a single-threaded sequencer to order transactions in batches so that conflicting transactions do not overlap. This is possible as the transactions do not have to hold locks for prolonged durations.

5.2.5 Hybrid edge-cloud techniques

Hybrid edge-cloud techniques have been proposed to process object detection models [glimpse, collaborativeedge, noscope, neurosurgeon]. These techniques generally work by performing some pre-processing steps at the edge node before sending the frame to be detected at the cloud. We compare with two such techniques that were utilized in various forms in prior work [collaborativeedge, noscope]: (1) compression in which the frame is compressed before sending it to reduce the communication bandwidth and latency, and (2) difference communication in which only the difference between the current frame and a reference frame is sent to the cloud. These techniques, if implemented in isolation, would achieve a small improvement over the performance of the state-of-the-art cloud baseline that we compared with as they would still require sending all frames for detection in the cloud. We show this in the evaluations on the park video v1 with the larger cloud model (YOLOv3-608) in Figure 6(c) under cloud+compression and cloud+compression+difference. These evaluations apply the hybrid techniques which improves the latency as less data need to be sent. However, this is a small improvement because the latency is dominated by the detection latency at the cloud.

An alternative view of these techniques is as methods to augment with edge-cloud Croesus. Figure 6(c) also shows how augmenting compression can improve the final commit latency in Croesus (under Croesus+compression and Croesus+compression+difference). The improvement is small because the model detection latency in the cloud is the dominant latency (as we show in previous evaluations.)

6 Related Work

The requirement of real-time processing has been tackled by real-time Databases (RTDB) [son1996improving] that aim to process data in predictable short time. Our method differs by allowing to manage the trade-off of performance and accuracy and providing the illusion of both a fast and accurate processing. A hybrid edge-cloud model (and similar caching-based models) have recently been used [glimpse, collaborativeedge, noscope, neurosurgeon] to take advantage of cloud computing to process data on neural networks, as well as leveraging resources at the edge. Our work extends these efforts by providing a multi-stage transactional model that enables programmers to reason about this hybrid edge-cloud model. In particular, these hybrid edge-cloud models can be augmented with the edge-cloud model of Croesus to improve the edge-to-cloud latency. However, when hybrid edge-cloud models are used in isolation, they would incur the high costs of edge-to-cloud communication for all frames since they require performing the detection in the cloud.

The multi-stage transaction model differs from existing abstractions in that each transaction is split into two asymmetrical sections. This makes traditional consistency models [bernstein1987concurrency] unsuitable for multi-stage transactions. The pattern of initial-final sections resemble work on eventual consistency [bailis2013eventual] and Transaction Chains [zhang2013transaction] but differs in one main way: the inconsistencies in the multi-stage model are external to the database. They are caused by erroneous inputs or triggers. In eventual consistency and Transaction Chains, inconsistency is caused by concurrent operation across different copies. This leads to similarities and differences, which led us to adapt prior relevant literature. Multi-stage transactions resemble work on long-lived transactions (LLT) as well, such as Sagas [sagas]. Multi-stage transactions can be viewed as a special case of LLT’s—with a transaction and a follow-up correction/compensation transaction—which enables simpler and more efficient solutions.

We view Croesus as a data layer solution that builds on top of asymmetric environments which - like edge-cloud - may include the lambda architecture [7364082] with both batch processing (slower but more accurate) and speed/real-time processing (faster but less-accurate). The contributions of Croesus can be applied to the lambda environment [warren2015big] by using multi-stage transactions (where the initial section is processed after real-time processing and the final section is processed after batch processing), and thus provide Croesus benefits to lambda programmers.

7 Conclusion

We presented Croesus, a multi-stage processing system for video analytics and a multi-stage transaction model which optimizes the trade-off between performance and accuracy. We present two variants of transnational consistency for multi-stage transactions—multi-stage serializability and multi-stage invariant confluence with apologies. Our evaluation demonstrates that multi-stage processing is capable of managing the accuracy-performance trade-off and that this model provides both immediate real-time responses and high accuracy.

Although we have presented the concept of multi-stage processing and transactions in the context of edge-cloud video analytics and processing [nawab2021wedgechain, nawab2018dpaxos, nawab2018nomadic, gazzaz2019collaborative, mittal2021coolsm], these concepts are relevant to many problems that share the pattern of needing immediate response and complex processing. Our future work explores these applications. One area of future work is to apply this pattern of multi-stage processing to blockchain systems with off-chain components [abadi2020anylog, alaslani2019blockchain, nawab2019blockplane]. In such a case, the first stage is performed in the off-chain component while the final stage is performed after validation from the blockchain. Another area we plan to explore is to integrate the multi-stage processing structure with global-scale edge placement and reconfiguration [zakhary2018global, zakhary2016db]. This will allow utilizing multi-stage processing more efficiently by controlling where the stages are performed and what edge/cloud datacenters to utilize.

8 Acknowledgement

This research is supported in part by the NSF under grant CNS-1815212.