A Scalable Framework for Distributed Object Tracking across a Many-camera Network

02/14/2019 ∙ by Aakash Khochare, et al. ∙ indian institute of science 0

Advances in deep neural networks (DNN) and computer vision (CV) algorithms have made it feasible to extract meaningful insights from large-scale deployments of urban cameras. Tracking an object of interest across the camera network in near real-time is a canonical problem. However, current tracking frameworks have two key limitations: 1) They are monolithic, proprietary, and lack the ability to rapidly incorporate sophisticated tracking models; and 2) They are less responsive to dynamism across wide-area computing resources that include edge, fog and cloud abstractions. We address these gaps using Anveshak, a runtime platform for composing and coordinating distributed tracking applications. It provides a domain-specific dataflow programming model to intuitively compose a tracking application, supporting contemporary CV advances like query fusion and re-identification, and enabling dynamic scoping of the camera-network's search space to avoid wasted computation. We also offer tunable batching and data-dropping strategies for dataflow blocks deployed on distributed resources to respond to network and compute variability. These balance the tracking accuracy, its real-time performance and the active camera-set size. We illustrate the concise expressiveness of the programming model for 4 tracking applications. Our detailed experiments for a network of 1000 camera-feeds on modest resources exhibit the tunable scalability, performance and quality trade-offs enabled by our dynamic tracking, batching and dropping strategies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The push for smarter and safer cities has led to the proliferation of video cameras in public spaces. Regions like London, New York, Singapore and China [1, 2] have deployed camera-networks with 1000’s of feeds to help with urban safety, e.g., to detect abandoned objects [3], track miscreants [4], and for behavioral analysis [5]. They are also used for citizen services, e.g., to identify open parking spots or count the traffic flow [6]

. Such many-camera networks, when coupled with sophisticated Computer Vision (CV) algorithms and Deep Learning (DL) models 

[7], can also serve as meta-sensors to replace other physical sensors for IoT applications and to complement on-board cameras for self-driving cars [8]. One canonical application domain to consume such ubiquitous video feeds is called tracking, where suspicious activities in public spaces need to be detected and followed by law enforcement to ensure safety [9, 10, 11]. Here, the goal is to identify a target

(e.g., a vehicle or a person of interest), based on a given sample image or feature vector, in video streams arriving from cameras distributed across the city, and to

track that target’s movements across the many-camera network. This requires online video analytics across space and time, and commonly has three stages: object detection, object tracking, and re-identification [12]. The first filters out objects that do not belong to the same class as the target while the second follows the motion of objects in a single camera’s frame [13, 14]. Re-identification (or re-id) matches the objects in a camera with the given target object [15, 16]. Recently, a fourth stage called fusion enhances the original target query with features from the matched images that is then used for tracking, giving better accuracy [17]. Each of these individual problem are well-researched. But these stages have to be composed as part of an overall platform, and coupled with a distributed tracking logic that operates across the camera network, and over time.

Challenge 1.  Contemporary many-camera surveillance platforms are monolithic, proprietary and bespoke [18, 19, 20]. They offer limited composability and reusability of models, increasing the time and effort to incorporate the rapid advances in CV/DL. Also, as the number of cameras and applications that consume them grow, executing the entire CV pipeline over all the camera feeds is costly. E.g., just doing object detection on a 1000-camera network using a contemporary fast neural network requires 128 Titan XP GPUs, along with the bandwidth to move the streams to the compute [21]. Instead, these platforms need to incorporate smarter tracking strategies that limit video processing to an active set of cameras where a target is likely to be present [22, 23]. Fig. 1 illustrates a target person being tracked across a network of 5 video cameras, , on a road network using a smart spotlight algorithm. A blue circle indicates the Field of View (FOV) of each camera. The path taken by the target between time and is indicated by the blue dashed arrow. Cameras do not generate and process video feeds unless activated. Initially, at time , the target is within the FOV of , and only this camera is active. By time , it has moved out of the FOV of , and also of all other cameras. Now, we activate cameras that fall within a spotlight neighborhood of the last seen camera, as shown by the yellow circle , which has and . This spotlight grows in size to at time , and activates camera as well. Now, we see the target reappear at and the spotlight shrinks to just this single camera being active at time . The spotlight again grows to at time when the target is lost, and activates cameras and , besides . Using such a smart tracking logic to scope the region of interest can significantly reduce the number of active video streams we process. This helps conserve resources without losing the tracking accuracy. Existing platforms do not offer such configurable tracking strategies.

Figure 1: Spotlight strategy for camera activation while tracking.

Challenge 2.  Smart cities are seeing edge and fog computing resources being deployed on Metropolitan Area Networks (MAN) to complement cloud resources. This brings processing closer to the data source and conserves network resources [24, 25, 26]. This is important for video tracking, given its low latency, high bandwidth and high compute needs [27, 1]. Hence, tracking platforms must make effective use of such heterogeneous, wide-area compute resources. For scalability, the platform must be able to balance the latency for tracking against the throughput supported for the active camera feeds. Also, given the dynamism in the network behavior, compute performance and data rates, it must help trade-off the accuracy of tracking with the application performance, and weakly-scale with the number of cameras. Current platforms do not offer such tunable adaptivity and scaling [11, 28].

We make the following specific contributions in this paper:

  1. [itemindent=0cm,labelwidth=0.4cm,labelsep=0cm,align=left]

  2. We propose a novel domain-specific dataflow model with functional operators to plug-in different analytics, for current and emerging tracking applications. Uniquely, it has first-class support for distributed tracking strategies to dynamically decide the active cameras (§ 2).

  3. We design domain-sensitive heuristics for

    frame drops and batching, which enable users to tune accuracy, latency and scalability under dynamism (§ 3). We implement the dataflow model and heuristics in our Anveshak platform to execute across distributed resources (§ 4).

  4. We illustrate the flexibility of the dataflow model using tracking applications, and offer detailed empirical results to validate the scalability and tunability of our platform across accuracy, latency and camera-set size (§ 5).

We complement these with a review of related work in § 6 and offer our conclusions in § 7.

2 A Dataflow Model for Tracking

We first discuss the features of a generic many-camera infrastructure, and then propose a domain-specific dataflow programming model to compose tracking applications.

2.1 System Model

A many-camera infrastructure consists of a set of cameras that are statically placed at specific locations, and each can generate a stream of video observations within an FOV [11]. The cameras are connected to a metropolitan area network (MAN), directly or through an edge computing device [27]. (Accelerated) Fog devices may be co-located with the cameras or within a few network hops of them, while cloud resources are accessible at data centers over the wide area network (WAN). These resources have heterogeneous capacities. The MAN, WAN and resource performance can vary over time. While the edge and fog are typically captive city resources, cloud resources are available on-demand for a price.

Cameras allow access to their video data streams over the network and expose control endpoints to change parameters such as the frame rate, resolution or FOV [29]. Traditionally, one or more fog servers in the city’s control center would acquire the streams for visualization, real-time analytics and archival. However, moving such video data to the compute has high bandwidth and latency costs. Instead, we propose to move the compute logic to the data by using edge and fog devices close to the cameras, complemented by on-demand cloud resources. Hence, a tracking framework must operate on heterogeneous and dynamic compute and network resources.

2.2 Domain-specific Programming Model

We propose a domain-specific model for tracking applications as a streaming dataflow with pre-defined modules, as shown in Fig. 2. The user provides functional logic for these modules to compose an application. We specify input and output interfaces for each module that the user-logic must use to consume and produce streams of events (e.g., video frames, detections). Instances of a module can naturally execute different input events in a data-parallel manner.

The structure of the tracking dataflow is fixed. This is similar to abstractions like MapReduce or Pregel [30, 31] where the user specifies the Map, Reduce and Compute logic, but the dataflow and execution structure is fixed by the platform to support a particular application domain. We offer a domain-specific pattern for designing tracking applications, rather than a general-purpose dataflow [32], while providing the benefits of automatic parallelization and performance management by the platform. Users can also rapidly incorporate advances in DL/CV models into the functional logic for each module.

Next we describe the interfaces of these modules, the dataflow pattern, and the execution model. These combine contemporary and emerging advances in video analytics, and uniquely, allows users to coordinate the tracking evolution.

2.2.1 Filter Controls (FC)

This module is the entry point for video frames from a camera into the dataflow. It is usually co-located with the camera or on an edge device connected to it. Each camera has a single FC instance along with its local state. When a video frame enters the input stream of FC, the user-logic decides if the frame should be forwarded on its output stream to the Video Analytics (VA) module, or be ignored. FC uses its local state or even the frame content to decide this. If a frame is accepted, a key-value event is sent on the output stream, with the camera ID as the key and the frame content as the value.

Importantly, the FC state for a camera can be updated by control events from the Tracking Logic (TL), as described later. This allows tunable activation of video streams that will enter the dataflow, on a per-camera basis. E.g., TL can have FC deactivate a camera feed if the target will not be present in its FOV, or reduce/increase the video rate based on the target’s speed. This balances the dataflow’s performance and accuracy, as shown in the “tuning triangle” in Fig. 2 (inset). The FC logic should be simple enough to run on edge devices.

Figure 2: Domain-specific dataflow and modules for tracking. (Inset) Tunable performance and scalability choices.

2.2.2 Video Analytics (VA)

This module receives input event streams from one or more upstream FC modules, and performs video analytics on a single camera’s stream at a time. The user-logic for this module can be complex, and invoke external tools like TensorFlow, PyTorch or OpenCV. The input API for the logic is an iterator of events, grouped by the camera ID. This is similar to the Reduce function in MapReduce. Grouping by camera ID gives the logic access to a batch of frames from the same camera for temporal analytics. This also amortizes the invocation overheads to call the external models. We discuss specific batching strategies in § 3.3.

Exemplar VA logic include object detection and tracking using DL models like YOLOv2 [21] or classic CV techniques like Histogram of Gradients (HoG). VA can access the user’s target query and maintain local state across executions, to be passed to the external model. The output of the logic is an iterator of key-value pairs, which may be, e.g., bounding boxes for potential target objects in a frame, with confidence scores. Depending on the compute needs of the logic, this may run on an edge, fog or cloud resource. There can be a many-to-many relationship between the input and output events for this module. However, we allow users to link an output event with an input event to trace its latency, and this provenance enables drop strategies we propose in § 3.2 to meet QoS goals.

Like FC, the local state of this module can be updated by the Query Fusion (QF) task. This allows dynamic updates to the target query by the emerging class of fusion algorithms [17]. These enhance a query’s feature vector with information derived from ongoing detections of the target in the frames. The VA can also update its model based on such signals.

2.2.3 Contention Resolution (CR)

This module receives a stream of key-value events from one or more VA instances, grouped by key. The values typically contain detections or annotated frames. This logic is used to analyze results from multiple cameras, say, to resolve conflicting detections from different cameras. It can use more advanced re-id logic or deeper neural networks for a higher match confidence. Sometimes, CR is triggered only on a conflict or a low confidence detection by a VA, and hence executes less often than a VA, but requires more compute power. CR may even degenerate to a human-in-the-loop deciding on borderline detections. This makes it better suited for running on fog or cloud resources. Like VA, this module can receive updates from QF as well.

The output stream from executing CR primarily contains metadata – much smaller than the video input – and this is forked three ways, to TL, QF and UV modules.

2.2.4 Tracking Logic (TL)

This is a novel module that we propose to capture the core logic for distributed tracking across the camera-network [23, 33, 9]. This module receives a metadata stream of detections from the CR, which it can aggregate to take a global view of the entire camera-network. It can be hosted on cloud resources to allow sophisticated target tracking algorithms to be executed using prior domain knowledge. It can also devise strategies to (de)activate the cameras to optimize the quality and performance of tracking. E.g., it can use spatial information on the city road networks to dynamically decide the camera search space like a spotlight, change the FOV for cameras to focus on an approaching or receding target, or change the frame rate based on the target speed. This module separates the core video analytics logic from the task of interpreting their results for distributed target tracking and camera controls, across the camera network.

2.2.5 Query Fusion (QF)

This module uses information on the detections to enhance the target query’s features. High-confidence target detections in the input video can be fused with the existing target query to generate a new target query that offers better matches, or even use negative matches to enhance the query [9, 17]. The output of this module updates the VA and CR modules for their future input streams.

2.2.6 User Visualization (UV)

This is a user-facing module that can be used to display the current state of the tracking and detections. This can be to a central portal running on the cloud where authorized personnel can view the progress.

2.3 Sample Application

We illustrate an application designed using our domain-specific dataflow and modules to track a person of interest (POI) across a road network. It takes an image of the POI as input, and returns detections of the POI in the camera network to the UV module. The module pseudo-code are given below.

Initially, all FCs have their state set to enable their input streams to pass on. VA first uses an OpenCV HoG feature-based pedestrian detector [34] to put bounding boxes (bbs) around persons in an image. It then uses a Tensorflow DNN [35] to match the POI against each bounding box. If the match score is higher than a threshold, VA outputs the frame as a possible match. CR acts as a “strong teacher” and uses a high-quality TensorFlow DNN [36] to check if the POI is indeed present in the VA’s output frames.

 

1:procedure FC()  
2:       return
3:end procedure

 

1:procedure VA()
2:     
3:     for  in  do
4:            
5:          
6:          
7:          
8:          
9:     end for
10:end procedure

 

1:procedure CR()
2:     
3:     for  in  do
4:          for  in  do
5:               
6:          end for
7:          
8:     end for
9:end procedure

 

1:procedure TL( )
2:     
3:     if  then
4:          
5:          
6:          
7:     else
8:          
9:     end if
10:end procedure

 

1:procedure QF()
2:     
3:     for  in  do
4:          if  then
5:               
6:          end if
7:     end for
8:     
9:end procedure

 

TL has access to the road network. When the POI is not in any camera’s feed, it starts a Breadth First Search (BFS) on the road network from the last known position of the POI, and activates the FC of cameras in this spotlight. QF uses an RNN [17] to enhance the POI query using high-quality hits.

Table 1 illustrates several other tracking applications that can be composed. Notably, DNNs can be used for object detection in VA (App 2 and 3) and for person or car re-id in CR. The target may also be vehicles (App 3) based on image rather than license plate [37]. TL can also be more complex, with awareness of the road lengths for a weighted BFS (App 2) or even the target’s speed (App 3). We can also use a Naïve Bayes model to predict the likelihood of paths that will be taken by the target to decide the cameras to activate (App 4).

# FC VA CR TL QF
1 Active? HoG [34] Person Re-id [35] UW-BFS
2 Active? DNN [38] Person Re-id [39] W-BFS RNN[17]
3 Frame Rate YOLO for Cars [21] Car Re-id [37] W-BFS w/ speed
4 Active? Person Re-id (Small) [40] Person Re-id (Large) [13] Probabilistic
Table 1: Module mappings for illustrative tracking apps

3 Latency & Quality-aware Runtime Strategies

The tracking platform operates in a dynamic environment, and needs runtime tunability. Our novel design offers a Tuning Triangle, Fig. 2 (inset), where users can achieve two of these three properties: performance, accuracy and scalability for tracking, which are shown on the corners, while sacrificing the third by controlling the knob on the side opposite to that corner. We have already seen that TL can manage the active camera set’s size, and achieve low latency and high accuracy.

In this section, we propose two more knobs, data drops and batching, that can achieve the other two property-pairs. For timely processing of the video feeds, the latency between a frame generated at a camera and its processed response reaching the UV must fall within a maximum tolerable latency () given by the user. This can be exploited to enhance the processing throughput by batching events passed to the external DL/CV models to amortize their static invocation overheads, while ensuring that the processing latency per event stays within permissible limits. We propose an adaptive batching strategy that tunes the supported latency, while supporting a scalable active camera set and high accuracy.

We have a captive set of edge, fog and cloud resources with dynamism in the MAN/WAN latency and bandwidth, and variable compute load due to an evolving active set decided by TL. So it is possible that the transient application load exceeds the compute or network capacity. Instead, we can gracefully degrade by dropping events that cannot be processed within . While this helps meet the latency goals and a larger active-set size, this knob affects the accuracy of tracking when frames containing the target are dropped. This can delay locating the target and cause the active set to grow, exacerbating the problem. Also, it is preferable to drop potentially stale events early in the pipeline than later to avoid wasting resources. We propose smart dropping heuristics that tune the accuracy, while offering a low latency and larger active set of cameras.

3.1 Preliminaries

For modeling latency, we decompose the dataflow graph to a set of sequential task pipelines with : task selectivity – the ratio of input to output events. We propose strategies for an individual pipeline, which can then be generalized. When an event arrives at a task from upstream, it is placed in a FIFO queue. Events are taken from the front of the queue to form a batch, and once a batch is full, it is triggered for execution by the user-logic. The execution returns a corresponding batch of output events that is passed to a partitioner, which routes each event based its key to a downstream task instance. This execution model is shown in Fig. 3.

The ID of an input event at the source task is propagated on its causal downstream events. Let indicate the arrival time of an event at a task from its upstream task . Let the function give the execution time for a batch of events by the user-logic for . We assume the execution time monotonically increases with the batch size, i.e., . When , this becomes streaming execution with no batching delay. Let denote the processing time between an event arriving at ’s input and its resulting output event being placed on its output stream.

We define the observed upstream arrival time at task for event using the timestamps of events observed at these tasks, using their local device clocks and . We assume that the clocks for the resources hosting the source and sink tasks of the dataflow are time-synchronized using, say NTP [41].

But our model is otherwise resilient to clock-skews present in, say, the edge and fog resources on the WAN hosting the intermediate tasks 

[42].

3.2 Strategies to Drop Events

Figure 3: Processing events at a task, with batching and drops

The platform should drop any event that cannot reach the last task before as it is stale. So it is safe for a task to drop an event arriving at it if . While simple, it does not prevent resource wastage at tasks prior to the one where the event is dropped. E.g., if at tasks and , we have and , then every event will be processed through the first tasks and yet dropped at the task, assuming the compute () and network performance stay constant. Ideally, the first task should reject an incoming event if it will be rejected downstream.

We achieve this optimization by introducing a reject signal that flows in reverse, from a task that drops a stale event back to the source task , through intermediate tasks. The signal also carries the duration by which the event exceeded at . Each task maintains a recent list of events processed by it, and their queue waiting time and the size of the execution batch that processed this event. When receives a reject signal for from

, it estimates its contribution to the delay, given by

, i.e., the queuing time and the batching delay. It decrements the staleness duration by this value, and if the residual is positive, propagates the rejects signal upstream. This apportions the cause of the staleness to the closest upstream tasks from , and is an advanced form of back-propagation [43] that is sensitive to .

In situations with no resource variability, any task receiving the reject signal can never fully process an arriving event with an eventual end-to-end latency within , i.e., the mapping of tasks to resources is unstable. However, an initial stable deployment can become unstable due to application dynamism, e.g., when the active set grows and the increased input stream rate causes the processing time to increase. In such cases, it may be possible to recover from a reject signal if the processing time at upstream tasks improve. So, we reframe the behavior on the receipt of a reject signal for event at as: drop a future event if . This considers both the upstream time of an event that arrives at a task, and the processing time at that task, to decide on the drop. So, a future event may pass through if the processing time at this or at previous tasks improve. Hence, our drop strategy is intelligent enough to respond to both deteriorating and improving conditions, both upstream and downstream. Transient network performance variability is reflected in the upstream time. So both rejects and recovery are responsive to network dynamism as well.

We define an upstream and processing budget for a task receiving a reject signal for event , as , i.e., the time from the event creation to the event completing its processing in this task. We use this to decide on whether to drop an input event, at three different drop points within a task, as shown in Fig. 3.

Drop Point 1.  The first drop point is before an event is placed in the input queue at a task, and the function is passed the arrival time at the source task (present in the event header) and the arrival time at this task, for the event. We drop an event if its observed upstream time plus the expected time to process the event at this task is above the budget. Here, we do not know the future queuing delay and the batch size for this event when it is executed. So we are liberal and assume a 0 queuing time and that the event will stream through for execution (). So some events that pass this drop test may still fail the next one, based on how long they spent in the queue and what the actual execution time was.  

1:procedure DropBeforeQueuing()
2:     
3:     if  then  return
4:     else  return            Drop this event
5:     end if
6:end procedure

 

Drop Point 2.  The second drop point is right before executing a batch. At this time, , we know how long the event has spent in the queue and the size of the batch it is part of, say, , which gives us the expected execution time. The function is passed the entire batch and it returns an updated batch without events that should be dropped.  

1:procedure DropBeforeExec()
2:     for  in  do
3:          if  then
4:          end if
5:     end for
6:     return            Events that should be executed
7:end procedure

 

Drop Point 3.  It is possible that the actual execution time was longer than expected, and the function is not accurate. If so, the third drop point is at time before an output event is put on the output stream to the next task, where we check if the budget has not been exceed. However, this drop point is important if the dataflow has branches, as we discuss next.  

1:procedure DropBeforeTransmit()
2:     if  then  return
3:     else  return            Drop this event
4:     end if
5:end procedure

 

By providing these three light-weight drop points, we achieve fine-grained control in avoiding wasted network or compute resources, and yet perform event drops just-in-time when they are guaranteed to exceed the budget. This balances application accuracy and performance. As a further quality optimization, we allow the user-logic to flag an event as do not drop, e.g., if it has a positive match, and we avoid dropping such events. This improves the accuracy and also benefits performance by reducing the active set size, say, by a spotlight-like TL logic.

Updating the Budget.  When a task receives a reject signal from an upstream task for event , we update its local upstream and processing budget as , i.e., the smaller of the previous budget, or the arrival time of the rejected event at this task and the time that it spent processing it. This means that we maintain a list of events that this task processed in the past along with its arrival and processing times, for a certain timeout duration.

However, this is a monotonically decreasing function and drops can only keep getting more aggressive. We need to support cases where the performance of the network or processing time improves, and pass a similar control signal back. As a result, we let every event that would have been dropped at the source task to pass through downstream as a “tracer”, where is user defined. If this event gets accepted by downstream tasks, the last task sends an accept control signal back to the upstream tasks. On receipt of an accept signal for a tracer event , we update the budget for a task as .

Non-linear Pipelines.  While the drop logic has been defined for a simple linear pipeline, a module instance in our dataflow deployed on a specific device can send events to multiple downstream modules on different devices, based on the partitioning function. Also, because of the partitioning, it is possible to predict the exact downstream module that an event will go to, from to . However this information is available only after the partitioner is run on the event and thus is available only to drop 3. Also the entire downstream path is cannot be estimated based on a single module’s partitioner. The downstream paths may vary significantly in the time needed to reach the last task on the path. If there exists an oracle that provides the paths for each event, it will be possible to accurately drop or forward a message. However, since that is not assumed, we approximate this by maintaining one budget per downstream module.

3.3 Strategies to Batch Events for Execution

Some events may either arrive early at a task , or the application may have a relaxed . In such cases, we can batch events in the FIFO queue before they are executed by the user-logic. This improves the throughput of the external analytics [44]. We allow users to optionally set a fixed batch size for each module as a tunable knob. However, it helps to let the batch size for a task vary over time. We use two factors to determine the dynamic batch size: the arrival time of events in the queue, and the execution time for the logic.

An event arriving at a task has a time budget of to complete processing on it to avoid being dropped downstream. Using this, we define the absolute event deadline for as the timestamp on that task by which this event must complete processing, and is given by . Similarly, we define the absolute batch deadline for completing execution of a batch having events is the earliest deadline from the events in the batch, . The execution time for this batch is .

Given these, we can intuitively decide at time if an event in the queue should be added to an existing batch of size or not by checking if , i.e., will adding this event cause this batch’s execution time to be delayed beyond the new batch deadline. We test each event in the FIFO queue if they can successfully be added to the current batch. If so, we add it and update the batch deadline; and if not, we submit this batch for execution, and add the event to a new empty batch we create. The drop test at point 2 will be done on the batch submitted for execution. Even if events are not arriving in the queue, a batch is automatically submitted for execution when the current time reaches . As a result of this dynamic batching strategy, we can achieve a higher throughput performance while not exceeding the required latency and avoiding event drops.

3.4 Formal Bounds on Batch Size, Drop Rate and Latency

Basic Scenario.  While our batching is not based on a fixed batch size but rather adapts to the events that arrive, we can formally bound the batch size under certain assumptions. Later, we relax some of these assumptions.

Say the dataflow has a constant input rate , and the execution time and network time functions are static. Given the 1:1 selectivity assumption, we have that the input rate at every task will be , and . This also means that the event’s observed upstream time for a task is constant, i.e., .

Then, the batch size for a task is the largest natural number, such that:

In the first equation, we capture the intuition that the time to accumulate the batch must not exceed the time to process the batch within the deadline. The second equation ensures the stability of the system by ensuring that while the batch is being processed, a larger batch cannot be formed. Also under the stable conditions, we have , i.e., the batch deadline it bound by its first event.

Since and are unconstrained natural numbers, may not have a valid solution, which means that we cannot batch and process any events on this task, and the input rate is unsustainable. In such cases, events will be dropped.

We find the drop rate of events by finding the largest stable input rate we can support, , and the associated batch size . The optimization goal is to maximize and minimize such that,

and the drop rate is .

The goal is to find a smaller value of input rate that can be supported. Since batching adds latency to the overall event processing time while increasing the throughput, we quantify the increased average latency, which is given as

As a result, the throughput of the dataflow is .

Variable upstream time.  We extend this concept to when the upstream time for the pipeline is variable. The cause of variability in the upstream time can be due to network performance variability, increase in the data rate, or even compute degradation upstream. But the effect on the module will only be visible as a variability in the upstream time. For a given system, we assume that the observed upstream time varies and is given as . If we assume that the variability begins at event then,

The waiting time in the queue can be represented as the series,

where is the input rate at the start of the batch of size n. Hence, the batch size for the batch can be solved as,

(1)
(2)

The two equations still try to capture the same intuitions of ensuring that the batch can be formed and processed within the deadline and that while the old batch is being processed, a new batch cannot be formed.

Variable clock skew:For a given system, we assume that the one unit of time at varies as

If we assume that the variability begins at message then,

. The waiting time can be represented as the series,

where is the input rate at the start of the batch. Hence, the equations are modified to be,

(3)
(4)

since in this case even the RHS has to be scaled as per the clock skew.

4 Implementation

We implement Anveshak, a Python-based distributed runtime engine that allows users to easily define their tracking application using our proposed domain-specific dataflow model. Anveshak is much more light-weight than traditional Big Data platforms like Spark Streaming or Flink, and designed to operate on a WAN rather than a LAN. This allows it to be deployed on edge, fog or cloud resources.

Each distributed resource available for deploying the tracking dataflow runs a Worker process. The Worker is responsible for managing module instances on a resource, and transferring data between instances of the dataflow running on different devices using ZeroMQ [45]. A Master process runs in the cloud at a well-known endpoint and manages application deployment. Application developers can implement their custom logic for the Interfaces corresponding to each of the dataflow modules. E.g., they may implement VA or CR as a call to a TensorFlow model using a local gRPC service. When the application is ready to be deployed, the dataflow definition containing the module descriptions along with parameters for the modules is passed to the Master. E.g., the image of the target used by VA and CR, or the expected speed of the target used by TL can be passed. The number of instances of each module is specified as well. The Master calls a Scheduler logic that maps the module instances to the resources running the Workers, and initializes them with the user-specified input parameters. The scheduling logic is pluggable, and advanced scheduling of instances to resources is outside the scope of this paper.

A Worker process may have multiple modules deployed on it. Each module instance is run in a separate Executor process. We use a Router process to pass data between the Worker and the Executor to ensure that the appropriate context for execution of the module logic is set. These processes use Sys V for Inter-Process Communication (IPC) [46]. A Worker process can also fetch the source camera feeds from an external endpoint. We natively use Apache Kafka [47] to route the initial camera video streams to FC. We also offer basic distributed error and debug logging at runtime from the Workers to the Master using Rsyslogs. This logger is also available to the module developers, and can also be used for collecting performance metrics.

5 Experiments

We perform targeted and detailed experiments to evaluate the benefits of the domain-sensitive Tuning Triangle knobs that we offer, viz., (1) a smarter way to define the tracking logic, (2) multi-stage dropping strategies, and (3) our batching capability. We empirically demonstrate our proposition that these knobs help achieve two of the three qualitative metrics, viz., (1) low end-to-end latency within a defined threshold, (2) a large size of the active set of cameras, and (3) the accuracy of the tracking.

System Setup.  We have ten Azure D8v3 VMs acting as the compute nodes, each with an 8-core/2.40GHz Intel Xeon E5-2673 v3 CPU and 32GB RAM, and one Azure D16v3 VM serving as the head node, with a 16-core of the same CPU and 64GB RAM. The compute nodes run Anveshak’s Workers while the head node runs the Master, and a Kafka v2.11.0 pub-sub broker to route the camera feeds. The VMs run Centos v7.5, Python v3, Tensorflow v1.9 and Java v1.8.

Application.  We implement the tracking application described in the pseudo-code in Sec. 2.3 (except for the QF module), in our experiments. Further, we implement three variants of the TL for our evaluations. TL-Base is a naïve baseline, and keeps all the cameras in the network active all the time. TL-BFS has access to the underlying road network, but assumes a fixed road-length of when performing the spotlight BFS strategy. TL-WBFS is similar, but is aware of the exact lengths of each road segment. Both TL-BFS and TL-WBFS are given with an expected speed of walk of the tracked target.

Workload.  We simulate video feeds that mimic the movement of the target through a road network. The simulator takes as input the road network with the road lengths, the speed of walk of the target, its starting vertex in the network, and a base image dataset which has labeled true positive and negative images for the target. A given number of cameras are “placed” on vertices surrounding the starting vertex. We simulate the movement of the target from the source vertex as a random walk at the given speed. Each camera generates a timestamped feed of images at a fixed frame rate using the true negative images, but uses the true positive images for the time intervals when the target is within its FOV during the walk.

For the road network, we extracted a circular region of from Open Street Maps [48], centered at the Indian Institute of Science campus. This has vertices and edges, with an average road length of . We use the CUHK-03 image dataset [49] that is compatible with our application, with unique targets and images that form true positives or negatives. Each JPG image is in size with RGB colors, with a median size of . The target walks at and the cameras generate a feed.

5.1 Benefits of the Programming Model and TL Module

The flexibility and expressivity of our domain-specific programming model was illustrated through the exemplar applications in Table 1. Of these, the TL module is unique, and helps us tune the active set size while lowering the end-to-end latency and improving the accuracy.

5.1.1 Benefits of a Smart TL

Having a smart TL algorithm helps us manage the growth in active camera set size. This helps with scaling to a larger total set of cameras, while also providing lower latency and higher accuracy. We evaluate three TLs with different degrees of smartness: TL-Base, TL-BFS, and TL-WBFS. TL-BFS and TL-WBFS are provided with an accurate speed of walk of the target as , which matches the underlying simulated speed of walk. We run this for 40, 100, 400 and 1000 cameras. Data drops and dynamic batching are disabled. The batch size is fixed at 10 while the tolerable latency is . The number of FC instances equals the number of total cameras, and are equally distributed on all 10 VMs. We also have 10 VA, 20 CR, 1 TL and 1 VU instances placed across the VMs. TL-Base alone has just 10 CR instances to improve its network performance.

(a) Active Set Distribution
(b) Latency Distribution
Figure 4: Performance of different TLs as camera count grows

Fig. (a)a shows the impact of the TL logic on the distribution of active set sizes across time while Fig. (b)b shows their latencies, for the various camera counts. All the experiments reported are stable. TL-Base always has all cameras active and its active camera count is mark with a single triangle. It is stable for 40 (active) cameras, but is unstable for a larger number of cameras, and not shown. Its simple TL logic that keeps all cameras active quickly overloads the 10 VM resources. This is reflected in its latency as well, which is larger than the others. While its median latency falls below , some frames that overshoot it but is able to otherwise maintain stability for 40 cameras, but no more.

For TL-BFS, the median active set stays flat at cameras even as the total number of cameras increase up to 1000. However, its lack of use of the road distances causes it to occasionally expand the active set size to as large as 36, seen from the long tails. While this causes its latancies to occasionally jump up as well, in general its median latency is , well under .

TL-WBFS has the most sophisticated of the strategies we compare, and this is reflected in its much tighter and smaller active set size. This stays within 8 cameras for the entire experiment run. Its latency is well-bounded as well, without as many spikes as seen for the others. However, its median latency is marginally higher than TL-BFS at .

5.1.2 Mismatch between TL and Actual Walk

The more accurately a TL is able to predict the behavior of the target, the better it will be able to manage the active set size. In this experiment, we intentionally set an over-estimate for the expected speed of walk of the target with the TL-BFS and TL-WBFS algorithms, compared to the simulated walk in the input video streams. We run this on a 1000-camera setup, with over-estimates of () and (), relative to the actual speed of . We retain a fixed batch size of , and with data drops disabled.

(a) Active Set Distribution
(b) Latency Distribution
Figure 5: Perf. of TLs with variable expected speeds of walk

As before, we plot the active set size and latency distributions in Fig (a)a and (b)b for the different expected speeds of walks. When there is no over-estimate at expected speed, both TL-BFS and TL-WBFS are stable. Their median active set size is 5 cameras, and median latency at 6.4 and 8.5 secs respectively. This configuration is similar to the previous experiment with 1000 cameras, though for a different run, and hence the marginally different values.

When we over-estimate the speed, TL-BFS becomes unstable at and beyond due to its length-unaware nature. Its active set grows rapidly, causing more frames to be generated, which stresses the system. This causes latencies to grow exponentially, and future positive matches of the target, which could shrink the active camera set, never get to be processed. However, TL-WBFS is able to manage the active set size to stay within 40 cameras even with a 150% over-estimate in the target speed, with a median set size of cameras. None of the frames are dropped and the latency is also well-controlled, mostly staying within . Its only at a walk speed of ( over) that it becomes unstable (not shown). Hence, while it may seem attractive to over-estimate the speed of walk to avoid missing the POI, it can cause the system to get over-loaded and unstable.

5.2 Benefits of Data Drops

Data drops help sustain a larger active set size and lower latency, while sacrificing the accuracy of matches. Data drops pro-actively remove events that cannot meet the threshold latency to conserve the resource usage, thus supporting more active cameras to be present online while meeting the latency for events that reach the destination. In this experiment, we enable data drops for the 1000 camera setup with a over-estimated speed of walk using TL-WBFS, and compare it with the same over-estimated speed without drops, which us unstable. We also report the no-drop scenario without an speed over-estimate, i.e., at . The upper limit of the batch size is set at 10 events.

(a) Data Drops Enabled
(b) Effect of Batch Sizes
Figure 6: Benefits of Data Drops and Batching

Fig. (a)a shows the active set size distribution (right Y axis) and latency distribution (left Y axis), along with the percentage of events dropped as a bar, when enabled. We see that latency distribution for with drops is similar to without drops, indicating that we meet the threshold latency. We also see that with drops, we are able to support a larger active set size of close to 100 cameras while about 51% of frames are dropped. These drops translate to a accuracy of detection of frames that have the target in them. The no-drop scenario at that is shown is unstable. As we see, the long-tail latency show the system degenerating, and causing the active set size to grow even further, and so on.

5.3 Benefits of Batching

The third factor of batching allows us to trade-off latency, in return for higher accuracy and a larger active set size. Here, perform controlled experiments with fixed-batch sizes, and vary the batch sizes across runs, for a 1000 camera setup operating over video feeds with the target walking at . We then tune the estimated walk speed passed to TL-WBFS to increase the active set size, and examine the behavior of the active set size and latency distributions. Drops are disabled.

As expected, we see that increasing the batch size allows us to increase the supported active set size, which ranges from about 8 cameras for a batch size of 1 (i.e., streaming each event) at all the way to a set size of 20 cameras at a batch size of 10. At the same time, we also see the latency for processing the events shrink as the batch size reduces, dropping from a median of with a batch size of 10 to with a batch size of 1.

6 Related Work

Video Surveillance Systems.  Intelligent video surveillance systems allow automated analytics over camera feeds [50]. They enable wide-area surveillance from multiple static or PTZ cameras, with distributed or master-slave controls [11]. ADVISOR [19] supports tracking, crowd counting, and behavioral analysis over camera feeds from train stations to support human operators. However, these are pre-defined applications, run centrally on a private data center and process all camera feeds all the time. IBM’s Smart Surveillance System (S3) [18] is a proprietary platform for video data management, analysis and real-time alerts. While it offers limited composability using different modules, it too executes the applications centrally and does not consider performance optimizations. Early works examine edge computing for basic pre-processing [51]. But the edge logic is statically defined, with the rest of the analytics done on centrally and over dedicated networks.

The Ella middleware [52] supports the definition of distributed analytics on the edge and over a WAN. They use a publish-subscribe model with hierarchical brokers to route video and event streams between analytics deployed on edge devices. They illustrate a multi-camera single person tracking application, similar to us. However, their platform design resembles a general-purpose event-driven middleware, without any specific model support or runtime optimizations for video analytics. We offer programming support for tracking applications and batching/dropping strategies that are tunable to dynamism. Others exclusively focus on offline analysis over video feeds from a many-camera network along with other data sources for spatio-temporal association studies [53].

Vigil [54] is specifically for video surveillance using wireless networks, which have limited bandwidth. They assume an Edge computing node (ECN) is co-located with the cameras and is used to reduce redundant data from being sent to the Cloud. The authors assign a utility score to each frame to ascertain its importance, similar to our do not drop flag. Our model and platform offer more active controls over the logic running on the ECN, and the runtime tuning.

The EdgeEye framework [55] is designed to efficiently deploy DNN models on the edge. It provides a JavaScript API for users to specify the parameters of the DNNs to be used, with the actual DNN implementation abstracted from the end user. While it caters to a wider class of analytics applications, it lacks composability and domain-specific patterns for tracking applications. It offers performance optimizing for the DNN model, but does not consider distributed systems issues, such as batching, dropping and variability of network and compute that we emphasize. Also, not all tracking applications use DNNs, and classic CV algorithms are still relevant [56].

Video Storm [57] is a video analytics system designed with the goals of approximation and delay tolerance. It schedules a mix of video analytics query workloads on a cluster of machines, where each query has a deadline and priority. Video Storm is capable of tuning knobs in the query, such as the resolution or the framerate, in order to support fluctuating workloads, at the cost of quality. Video Edge [29]

extends this to support scheduling on a hierarchy of Edge, Fog and Cloud resources. Both these provide tuning knobs which at a high-level are similar to our Tuning Triangle. However, the key distinction is that they offer many degrees of freedom but also requires the specification of objective function to define the impact of the knobs on metrics. This makes it challenging to use out of the box if the interactions are not well-defined. We our domain-sensitive Tuning Triangle takes a more prescriptive approach. It intuitively captures the impact of the three well-defined knobs we offer on the three metrics that have the most impact on tracking applications.

In general, this reflects our design philosophy. Today’s video surveillance landscape offers either a high degree of flexibility [52, 55], which increases the effort required to build application, or are too restrictive, limiting their flexibility [18, 19]. Our domain-specific approach targets tracking applications, and offer developers an intuitive but customizable pattern for defining their tracking application, while our runtime is sensitive to the performance challenges of distributed computing that will affect this application class.

Big Data platforms and DSL.  Generic stream processing platforms like Apache Storm, Flink and Spark Streaming [32, 43, 58] offer flexible dataflow composition. But defining a dataflow pattern for tracking applications, like we do, offers users a frame of reference for designing distributed video analytics applications, with modular user-defined tasks.

Google’s TensorFlow [59] is a domain-specific programming model for defining DNNs and CV algorithms. It provides TensorFlow Serving

to deploy trained models for inference. However, TensorFlow is not meant for composing arbitrary modules together. The tasks take a Tensor as an input and give a Tensor as the output, and there are no native patterns such as Map and Reduce that big data frameworks like MapReduce and Spark offer. Such pre-defined APIs allow users to better reason about the operations being performed on the data, and map to well-defined implementations of these APIs that saves users effort. We take a similar effort for tracking analytics.

Yahoo’s TensorFlow on Spark [60] gives more flexibility over TensorFlow by allowing Spark’s Executors to feed RDD data into TensorFlow. Thus, users can couple Spark’s operations with TensorFlow’s neural networks. We are at a level of abstraction higher, allowing for rapid development of tracking applications in fewer lines of code. Also, Spark is not designed for to disdtribute compute on wide area networks and edge/fog devices, which we address in the Anveshak runtime. Streaming Performance Management.  There are several performance optimization approaches adopted by stream processing systems, which we extend. Apache Flink [58] and Storm [43] support back-pressure, where a slow task sends signals to its upstream task to reduce its input rate. This may eventually lead to data drops, but the data being dropped are the new ones generated upstream rather than the stale ones that are on downstream tasks, which sacrifices freshness in favor of fairness. Our drops prioritizes recent events over stale events.

Google’s Millwheel [61] uses the concept of low watermarks to determine the progress of the system, defined as the timestamp of the oldest unprocessed event in the system. It guarantees that no event older than the watermark may enter the system. Watermarks can thus be used to trigger computations such as window operations safely. While our batching and drop strategies are similar, watermarks cannot determine the time left for a message in the pipeline and has no notion of user-defined latency.

Aurora [62] introduced the concept of load shedding, which is conceptually the same as data drops. They define QoS as a multidimensional function, including attributes such as response time, similar to our maximum tolerable latency, and tuple drops. Given this function, the objective is to maximize the QoS. Borealis [63] extended this to a distributed setup. Anveshak uses multiple drop points even within a task, which offers it fine-grained control and robustness. Features like “do not drop” and resilience to clock skews found in WAN resources are other domain and system specific optimizations.

7 Conclusions

In this paper, we have proposed an intuitive domain-specific dataflow model for composing distributed object tracking applications over a many-camera network. Besides offering an expressive and concise pattern, we surface the Tracking Logic module as a powerful abstraction that can perform intelligent tracking and manage the active cameras. This enhances the scalability of the application and makes efficient use of resources. Further, we offer tunable runtime strategies for dropping and batching that help users easily balance between the goals of performance, accuracy and scalability. Our design is sensitive to the unique needs of a may-camera tracking domain and for distributed edge, fog and cloud resources on wide-area networks. Our experiments validate these for a real-tracking application on feeds from up to 1000 cameras.

As future work, we plan to explore intelligent scheduling of the module instances on edge, fog and cloud resources; allow modules to be dynamically replaced for better accuracy or performance; handle mobile camera platforms such as drones; and perform experiments on a wide-area resource deployment.

8 Acknowledgments

We thank Prof. A. Chakraborthy, Visual Computing Lab, IISc, for discussions on the tracking problem and video analytics modules. We also thank fellow students, Swapnil Gandhi and Anubhav Guleria, for their valuable insights. This work was supported by grants from Huawei as part of the Innovation Lab for Intelligent Data Analytics and Cloud, and resources provided by Microsoft and NVIDIA.

References

  • [1] G. Ananthanarayanan, P. Bahl, P. Bodík, K. Chintalapudi, M. Philipose, L. Ravindranath, and S. Sinha, “Real-time video analytics: The killer app for edge computing,” IEEE Computer, vol. 50, no. 10, pp. 58–67, 2017.
  • [2] “How many CCTV cameras are there in London?” https://www.cctv.co.uk/how-many-cctv-cameras-are-there-in-london/, accessed: 2018/07/28.
  • [3] F. Porikli, Y. Ivanov, and T. Haga, “Robust abandoned object detection using dual foregrounds,” EURASIP Journal on Advances in Signal Processing, vol. 2008, no. 1, p. 197875, Oct 2007. [Online]. Available: https://doi.org/10.1155/2008/197875
  • [4] R. Arroyo, J. J. Yebes, L. M. Bergasa, I. G. Daza, and J. Almazán, “Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls,” Expert systems with Applications, vol. 42, no. 21, pp. 7991–8005, 2015.
  • [5] T. Ko, “A survey on behavior analysis in video surveillance for homeland security applications,” in

    2008 37th IEEE Applied Imagery Pattern Recognition Workshop

    , Oct 2008, pp. 1–8.
  • [6] N. S. Nafi and J. Y. Khan, “A vanet based intelligent road traffic signalling system,” in Australasian Telecommunication Networks and Applications Conference (ATNAC) 2012, Nov 2012, pp. 1–6.
  • [7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, p. 436, May 2015. [Online]. Available: https://doi.org/10.1038/nature14539
  • [8] A. Khochare, P. Ravindra, S. P. Reddy, and Y. Simmhan, “Distributed video analytics across edge and cloud using echo,” in International Conference on Service-Oriented Computing (ICSOC) Demo, 2017.
  • [9] K. A. Shiva Kumar, K. R. Ramakrishnan, and G. N. Rathna, “Distributed person of interest tracking in camera networks,” in Proceedings of the 11th International Conference on Distributed Smart Cameras, ser. ICDSC 2017.   New York, NY, USA: ACM, 2017, pp. 131–137. [Online]. Available: http://doi.acm.org/10.1145/3131885.3131921
  • [10] Q. Cai and J. K. Aggarwal, “Tracking human motion in structured environments using a distributed-camera system,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 11, pp. 1241–1247, 1999.
  • [11] P. Natarajan, P. K. Atrey, and M. Kankanhalli, “Multi-camera coordination and control in surveillance systems: A survey,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 11, no. 4, Jun. 2015.
  • [12] A. Bedagkar-Gala and S. K. Shah, “A survey of approaches and trends in person re-identification,” Image and Vision Computing, vol. 32, no. 4, pp. 270–286, 2014.
  • [13]

    C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA.

    , 2017, pp. 4278–4284. [Online]. Available: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14806
  • [14] X. Xie, M. Jones, and G. Tam, “Recognition, tracking, and optimisation,” International Journal of Computer Vision, vol. 122, no. 3, pp. 409–410, 2017.
  • [15] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person re-identification,” in 2014 22nd International Conference on Pattern Recognition, Aug 2014, pp. 34–39.
  • [16] X. Liu, W. Liu, H. Ma, and H. Fu, “Large-scale vehicle re-identification in urban surveillance videos,” in Multimedia and Expo (ICME), 2016 IEEE International Conference on.   IEEE, 2016, pp. 1–6.
  • [17] N. Murthy, R. K. Sarvadevabhatla, R. V. Babu, and A. Chakraborty, “Deep sequential multi-camera feature fusion for person re-identification,” arXiv preprint arXiv:1807.07295, 2018.
  • [18] C.-F. Shu, A. Hampapur, M. Lu, L. Brown, J. Connell, A. Senior, and Y. Tian, “Ibm smart surveillance system (s3): a open and extensible framework for event based surveillance,” in IEEE Conference on Advanced Video and Signal Based Surveillance, 2005., Sep. 2005, pp. 318–323.
  • [19] N. T. Siebel and S. Maybank, “The advisor visual surveillance system,” in ECCV 2004 workshop applications of computer vision (ACV), vol. 1, 2004.
  • [20] M. K. Lim, S. Tang, and C. S. Chan, “isurveillance: Intelligent framework for multiple events detection in surveillance videos,” Expert Systems with Applications, vol. 41, no. 10, pp. 4704–4715, 2014.
  • [21] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, 2017.
  • [22] L. Esterle, P. R. Lewis, M. Bogdanski, B. Rinner, and X. Yao, “A socio-economic approach to online vision graph generation and handover in distributed smart camera networks,” in 2011 Fifth ACM/IEEE International Conference on Distributed Smart Cameras, Aug 2011, pp. 1–6.
  • [23] F. Z. Qureshi and D. Terzopoulos, “Planning ahead for ptz camera assignment and handoff,” in 2009 Third ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC), Aug 2009, pp. 1–8.
  • [24] Y. Simmhan, Big Data and Fog Computing.   Cham: Springer International Publishing, 2018, pp. 1–10. [Online]. Available: https://doi.org/10.1007/978-3-319-63962-8_41-1
  • [25] F. Bonomi, “Cloud and fog computing: trade-offs and applications,” in Intl. Symp. Comp. Architecture (ISCA), 2011.
  • [26] P. G. Lopez, A. Montresor, D. Epema, A. Datta, T. Higashino, A. Iamnitchi, M. Barcellos, P. Felber, and E. Riviere, “Edge-centric computing: Vision and challenges,” SIGCOMM Computer Communication Reviews, vol. 45, no. 5, October 2015. [Online]. Available: http://doi.org/10.1145/2831347.2831354
  • [27] M. Satyanarayanan, P. Simoens, Y. Xiao, P. Pillai, Z. Chen, K. Ha, W. Hu, and B. Amos, “Edge analytics in the internet of things,” IEEE Pervasive Computing, vol. 14, no. 2, pp. 24–31, 2015.
  • [28] L. Esterle, P. R. Lewis, R. McBride, and X. Yao, “The future of camera networks: Staying smart in a chaotic world,” in Proceedings of the 11th International Conference on Distributed Smart Cameras, ser. ICDSC 2017.   New York, NY, USA: ACM, 2017, pp. 163–168. [Online]. Available: http://doi.acm.org/10.1145/3131885.3131931
  • [29]

    C.-C. Hung, G. Ananthanarayanan, P. Bodik, L. Golubchik, M. Yu, P. Bahl, and M. Philipose, “Videoedge: Processing camera streams using hierarchical clusters,” in

    2018 IEEE/ACM Symposium on Edge Computing (SEC).   IEEE, 2018, pp. 115–131.
  • [30] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
  • [31] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • [32] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets.” HotCloud, vol. 10, no. 10-10, p. 95, 2010.
  • [33] C. Kyrkou, S. Timotheou, T. Theocharides, C. Panayiotou, and M. Polycarpou, “Optimizing multi-target detection in stochastic environments with active smart camera networks,” in Proceedings of the 11th International Conference on Distributed Smart Cameras, 2017.
  • [34] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, June 2005, pp. 886–893 vol. 1.
  • [35] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learning architecture for person re-identification,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3908–3916.
  • [36] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, “Joint detection and identification feature learning for person search,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2017, pp. 3376–3385.
  • [37] J. Sochor, J. Špaňhel, and A. Herout, “Boxcars: Improving fine-grained recognition of vehicles using 3-d bounding boxes in traffic surveillance,” IEEE Transactions on Intelligent Transportation Systems, 2018.
  • [38] R. Girshick, “Fast r-cnn,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [39] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [41] Y. Geng, S. Liu, Z. Yin, A. Naik, B. Prabhakar, M. Rosenblum, and A. Vahdat, “Exploiting a natural network effect for scalable, fine-grained clock synchronization,” in 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), 2018, pp. 81–94.
  • [42] F. Buchholz and B. Tjaden, “A brief study of time,” digital investigation, vol. 4, pp. 31–42, 2007.
  • [43] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham et al., “Storm@ twitter,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data.   ACM, 2014, pp. 147–156.
  • [44] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural network models for practical applications,” CoRR, vol. abs/1605.07678, 2016. [Online]. Available: http://arxiv.org/abs/1605.07678
  • [45] F. Akgul, ZeroMQ.   Packt Publishing, 2013.
  • [46] D. P. Bovet and M. Cesati, Understanding the Linux Kernel: from I/O ports to process management.   ” O’Reilly Media, Inc.”, 2005.
  • [47] J. Kreps, N. Narkhede, J. Rao et al., “Kafka: A distributed messaging system for log processing,” in Proceedings of the NetDB, 2011, pp. 1–7.
  • [48] OpenStreetMap contributors, “Planet dump retrieved from https://planet.osm.org ,” https://www.openstreetmap.org, 2017.
  • [49] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [50] M. Al Najjar, M. Ghantous, and M. Bayoumi, Video surveillance for sensor platforms: Algorithms and Architectures.   Springer, 2014.
  • [51] A. Kornecki, “Middleware for distributed video surveillance,” IEEE Distributed Systems Online, vol. 9, no. 2, 2008.
  • [52] B. Dieber, J. Simonjan, L. Esterle, B. Rinner, G. Nebehay, R. Pflugfelder, and G. J. Fernandez, “Ella: Middleware for multi-camera surveillance in heterogeneous visual sensor networks,” in Distributed Smart Cameras (ICDSC), 2013 Seventh International Conference on.   IEEE, 2013, pp. 1–6.
  • [53] Z. Shao, J. Cai, and Z. Wang, “Smart monitoring cameras driven intelligent processing to big surveillance video data,” IEEE Transactions on Big Data, vol. 4, no. 1, pp. 105–116, 2018.
  • [54] T. Zhang, A. Chowdhery, P. V. Bahl, K. Jamieson, and S. Banerjee, “The design and implementation of a wireless video surveillance system,” in Proceedings of the 21st Annual International Conference on Mobile Computing and Networking, ser. MobiCom ’15.   New York, NY, USA: ACM, 2015, pp. 426–438. [Online]. Available: http://doi.acm.org/10.1145/2789168.2790123
  • [55] P. Liu, B. Qi, and S. Banerjee, “Edgeeye: An edge service framework for real-time intelligent video analytics,” in Proceedings of the 1st International Workshop on Edge Systems, Analytics and Networking, ser. EdgeSys’18.   New York, NY, USA: ACM, 2018, pp. 1–6. [Online]. Available: http://doi.acm.org/10.1145/3213344.3213345
  • [56]

    A. Kumar, S. Goyal, and M. Varma, “Resource-efficient machine learning in 2 KB RAM for the internet of things,” in

    Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 1935–1944.
  • [57] H. Zhang, G. Ananthanarayanan, P. Bodik, M. Philipose, P. Bahl, and M. J. Freedman, “Live video analytics at scale with approximation and delay-tolerance,” in 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17).   Boston, MA: USENIX Association, 2017, pp. 377–392. [Online]. Available: https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/zhang
  • [58] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, “Apache flink: Stream and batch processing in a single engine,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 36, no. 4, 2015.
  • [59] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16).   Savannah, GA: USENIX Association, 2016, pp. 265–283. [Online]. Available: https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
  • [60] “Tensorflow on Spark,” https://github.com/yahoo/TensorFlowOnSpark/wiki, accessed: 2018/06/16.
  • [61] T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle, “Millwheel: Fault-tolerant stream processing at internet scale,” Proc. VLDB Endow., vol. 6, no. 11, pp. 1033–1044, Aug. 2013. [Online]. Available: http://dx.doi.org/10.14778/2536222.2536229
  • [62] D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, “Aurora: A new model and architecture for data stream management,” The VLDB Journal, vol. 12, no. 2, pp. 120–139, Aug. 2003. [Online]. Available: http://dx.doi.org/10.1007/s00778-003-0095-z
  • [63] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina et al., “The design of the borealis stream processing engine.” in Conference on Innovative Data Systems Research (CIDR), vol. 5, no. 2005, 2005, pp. 277–289.