As the “Internet of Smart Things” continues to have tremendous societal impacts, human-machine interfaces are also evolving in the modalities, accuracies and improved energy-efficiencies. Beyond the traditional keyboard and mice, such smart devices enable advanced user interfaces, like voice command and control, camera and GPS based sensors and interfaces, as well as touch screens and displays. However, these interfaces are mostly active, in the sense that they require significant power to receive user inputs and subsequently process them. Hence, in an “always-on” environment, where these user interfaces need to be perpetually “on”, the design of the sensor front-ends and their power management present significant challenges. The power cost of continuously capturing and analyzing videos is so high that most systems require physical input from the user before accepting commands. To address this issue, a “wake up” camera front-end allows a sensor node to continuously acquire videos and monitor for a trigger that will wake up the back-end when necessary, thus enabling exciting new usage models. A promising “wake up” modality in “always on” cameras is hand gestures, which is presented in this paper. Traditional gesture recognition systems are power inefficient and run on batteries or even AC power supplies [1, 2]. However, with rapid advances in energy harvesting, it is enticing to think about a camera front-end which is powered by photo-voltaic cells (PV), thus paving the way for light-powered, smart, “always on” cameras.
Figure 1 illustrates the landscape of self-powered sensor nodes and shows the power requirement of various electronic devices and the amount of power that can be harvested by various sources like solar energy, thermal, mechanical etc. In particular, for image/video processing and classifications we need high computational power. CPU, GPU and FPGA’s are typically used to perform gesture recognition and object classification on video data[1, 3]. However, for “always on” front-ends where the objective is trigger identification and not continuous gesture recognition, high performance (and hence high power) are not optimal. Instead vision specific MCUs and DSPs are more attractive for self-powered devices, since they exhibit: (1) power dissipation in the order of hundreds of mWs (a 10X reduction compared to CPUs), (2) compact size and low thermal requirements, (3) sufficient computational ability for “always on” applications, will be demonstrated here. Our system features an Analog Devices’ Black Fin processor.
To enable “always on” and self-powered operation, we take advantage of recent advances in compressed domain (CD) data processing which allows trigger detection with significantly lower power and computational requirements. This is in contrast with existing algorithms which work directly in the pixel domain. Given the objective of our camera front-end, the computation complexity can be largely reduced ( demonstrated here) from existing algorithms that are targeted for continuous gesture recognition [4, 5]. As a command to wake up the system, only a few gesture classes are needed. When the gesture is structured and contains significant motion (for example, writing a big “Z” in front of the camera), it can be readily captured by images with high compression ratios. Beyond using low-resolution images, we construct each measurement as a random linear combination of pixels in a manner compatible with compressed domain signal processing. Recent development in compressed sensing and target recognition in the compressed domain [6, 7]
further improve the accuracy and energy efficiency of the overall process of data acquisition, feature extraction and recognition. We demonstrate that we can take random linear combinations of the pixel vales, and characterize the gesture motion directly from a few compressed measurements. On the other hand, energy harvested from the environment has been used in sensor networks[8, 9] with loads that demand very low power. Here we demonstrate that an algorithm-hardware co-design enables smart camera-front ends with “always on” gesture detection.
The gesture motion is captured by a sequence of difference images between consecutive frames. Each difference image passes two layers of compression to reduce its resolution and to be transferred to the compressed domain. The parameters of the motion are directly extracted from the compressed domain. In section II we describe hardware system architecture & algorithm details in section III. Section IV & V presents hardware implementation & measurement results respectively. Conclusions are drawn in section VI.
2 Hardware System Architecture
Before describing the proposed algorithm, we brief the hardware system architecture. The proposed system consists of four main components: a PV cell array, a DC-DC converter with output voltage regulation, an MCU, and an image sensor. The block diagram of our system is shown in Fig. 2a. The PV cell converts solar energy to electrical energy. The Norton equivalent output current (Fig. 2b) of PV cell is given by:
where I and V are PV cell’s output current and voltage respectively; and are the series and shunt resistances; , , , are dark saturation current, thermal voltage, diode ideality factor, and number of cell connected in series respectively; is the generated current whose magnitude depends on irradiation and temperature.
As the MCU and image sensors both demand regulated voltage to operate, the DC voltage generated by PV cells must be regulated by a DC-DC converter. For the current design, we select TI’s BQ25570EVM, a two-stage DC-DC converter with Maximum Power Point Tracking (MPPT) for solar energy harvesting and for providing a regulated output supply. The block diagram of the energy harvesting system and gesture recognition flow is shown in Fig. 3.
The input image is captured by Omnivision’s OV7672 sensor with a native resolution of . We extract only the gray-scale component of the image, which reduces the computation power without any impact on performance. The output of the pixel array is passed on to an on-board ADSP BF707 MCU using I2C interface. Once the image is received by the BF707 processor, we perform the following operations: block averaging, frame difference, random linear measurements, motion centers extraction in compressed domain followed by gesture recognition. For block compression we extract every one out of 16 pixel values in each row and column. Therefore the block compression factor is 256 (16 for every row and column). The block average, frame difference and dynamic time warping related matrices are stored in L1 cache (requires less than 128KB). Motion center extraction co-efficients are stored in external SDRAM (requires more than 1.2MB). L1 access is performed using core clock at 500MHz and SDRAM access happens at system clock with 250MHz speed. The hardware is further optimized by (1) using short integer maths and (2) optimizing memory usage that reduces total power consumption without loss of performance.
3 Gesture recognition algorithm
Our real-time gesture recognition algorithm is based on motion parameters extracted directly from the compressed domain. The starting point of our algorithm is the difference image. When a user’s hand is the only significant moving object present in front of the camera, the hand region is well presented by the difference image which then passes through two layers of compression. In the first layer, the resolution is reduced by dividing the whole image into several blocks and taking the average of each block. In the second layer, we transfer this low resolution image to the compressed domain by taking random linear combinations of its pixels. We estimate the center of the motion directly in the compressed domain without recovery the image sequences. These motion centers are passed to a nearest neighbor (NN) classifier coupled with DTW distance measurement for gesture recognition. The block diagram of our system is shown in Fig.4.
3.1 Two layers of compression
In the first compression layer, the difference image is divided evenly into blocks of size . The average of the pixel values in each block is taken, resulting in a block compressed difference image of size
We vectorize this low-resolution difference image and denote it as
. In the second layer of compression, we construct a random matrixof size . Each entry of is uniformly chosen from . The projection of the vectorized low-resolution difference image in the compressed domain is calculated as:
Each entry in is a random linear combination of all the entries in . We can write both compression layers into one linear equation:
Where is the vectorized original difference image . is the block averaging matrix of size by . Its product with forms a structured random matrix .
3.2 Motion center extraction in the compressed domain
In the uncompressed low-resolution domain, the hand region in the difference image can be captured by a template shown in Figure 4(c). The template (of size by ) has uniform non-zero values within the small rectangular region and is zero elsewhere. To locate the hand region, we construct a set of vectorized templates , where represents the coordinates of the center of the small rectangle, and represents different rectangle sizes. The variation in sizes is to adapt to the change of the hand size seen by the camera when users are at different locations. The center of the hand motion is extracted by solving
with high probability for some. By choosing the norm as and normalizing all the templates to have the same energy, we can further write motion center estimation as solving:
The above process represents a smashed filter operation which is akin to matched filters but in the compressed domain. Since we are not interested in reconstructing the image from compressed measurements, a compressed domain smashed filter reduces the computation by ().
3.3 Gesture Recognition
The extracted motion centers are stored in a FIFO buffer with length . We then measure the distances between the latest data in the buffer and the training samples. As gestures can be performed with different speed, the sequences of the gestures’ motion centers are of different lengths. We implement DTW algorithm, which automatically extracts the best matching segments between two sequences, adjusting them to the same length, and finally calculating the distance. Once the smallest distance passes a threshold, we assign the gesture to the class that has its nearest sample. Algorithm performance on the proposed hardware will be described in Section V.
4 Experimental setup
4.1 Power Management Design
The overall platform is designed from COTS components and here we explain the optimal design choice. We chose omnivision OV7672 image sensor which has frame size of pixels. The image sensor is connected to an ADSP BF707 processor using I2C interface. Measurements reveal a maximum current consumption of at fixed power supply.
The solar cell (AM5907) produces an output voltage of 5V at the point of maximum power transfer. The I-V and P-V characteristics of each cell is shown in Fig. 5(a) and Fig. 5(b) vis-a-vis simulation results. We see a close match between experimental results and empirically fitted Eqn 1. We note that for an irradiance of , the maximum power . In the current setup, We use 6 PV cells in parallel to generate the required power that the load demands. Also, from Figure 5(b), we observe that operating voltage at maximum power point is approximately of the open circuit voltage (). Hence Maximum Power Point Tracking (MPPT) is achieved by regulating the output at of .
As shown in Fig. 3, the MPPT block samples open circuit voltage every 16 seconds with on and off. This sample voltage is sent to the boost controller to modulate the phase and frequency of the boost converter so that the PV cell operates at maximum power point, of . The sampling process is shown in oscilloscope captures in Fig. 6(a). It is observed that open circuit voltage is sampled and the PV cell’s operating voltage changes accordingly. The energy is stored in a super-capacitor between the two converter stages. Availability of super-capacitor benefits camera-based applications whose power requirement fluctuates significantly. The output voltage is sensed and sent back to buck controller to regulate the output voltage. The output voltage is hardware programmable through programmable external resistors on the board. Fig. 6(b) shows how varies with varying irradiance and load current conditions. Measured oscilloscope capture also reveals that is well regulated under such dynamic conditions. The complete experimental setup along with the PV cells and the MCU is shown in Fig. 8.
4.2 Mapping Proposed Gesture Recognition Algorithm on Low-Power MCU
The image sensor output at full-resolution is captured by the MCU. The MCU performs compression on each difference image. In the block compression layer, we choose blocks of size ; and hence the vectorized low-resolution difference image . The compression rate of this layer is thus . In the random projection layer, the number of compressed measurements is a design variable. To gain better insights on the choice of , we explore its relationship with the accuracy of motion center extraction. For a typical gesture "Z" the extraction algorithm is shown in Figure 9a. The motion centers are extracted from the block-averaged difference images by solving equation (5). As we can see in Figure 9b, the three segments of the gesture are clearly distinguished on the path of the motion centers. With , the motion centers are extracted in the compressed domain by solving equation (7), and are plotted in Figure 9c. The similarity between this plot and 9b demonstrates the effectiveness of the theory. For each value of we calculate the average motion center error per frame in the compressed domain. The “L” shape of the curve indicates that is the threshold for nearly error-free motion parameter estimation, granting us another factor of 5 compression rate. This “threshold” behavior is consistent with the classic results from compressed sensing presented in [10, 6, 7]. The accurate motion center extraction in the compressed domain provides the foundation of preserving high recognition accuracy.
To reduce memory usage and reduce power consumption, we fix the size of the smashed filter templates (Figure 4(c)) to . In other words, we construct with fixed to and being every possible location in the block-averaged difference image. Using the same , we transfer all the templates into the compressed domain by calculating .
As proof of concept, we tested the system with a variety of key gestures and in the rest of the paper, we will discuss an implementation that recognizes gesture classes: "X", "+", and "Z". For the usage model where the key gestures are used for “wake up”, a small number of gesture classes suffices. In each class, we provide training examples. In each training example, the gesture is performed at different locations with respect to the camera, and the motion centers were extracted from the uncompressed domain by solving equation (5). For low power operation and to enable a completely, self-powered system, the image sensor is operated at a maximum of frames/second and the buffer length is set to .
5 Measurement Results
5.1 M vs. Recognition Rate and Power Consumption
For different numbers of compressed measurements, we measure the energy consumption per frame and the recognition rate. We evaluate 20 gestures of each class, and the recognition rate is calculated from the total number of correctly recognized gestures. The total time per gesture is kept at 0.1secs. In a typical instance, the motion centers for a gesture “Z” as extracted from the hardware is shown in Figure 10. Comparison with Figure. 9 reveals a close match between simulation and measurement. Figure 10(a) shows the measured design space exploration. We measure the recognition accuracy as a function of which reveals an accuracy rate of for , which closely matches simulation results described in the previous section. Figure 10(b) shows dependence of the power consumed by the MCU and the corresponding recognition accuracy of the proposed system as a function of the frame rate. We note that a minimum frame rate of 5fps is required for maintaining a desired recognition accuracy of [84%]. As the frame rate increases, the corresponding power consumption also increases and shows a graceful trade-off between accuracy and power consumed. Figure 10(c) illustrates the efficiency of the power management system where the irradiance of the incident light is varied. The corresponding power consumed and the maximum frame rate that can be supported is also shown. It can be noted that for an irradiance of (typical for outdoor sensors) a frame rate of 10fps and recognition accuracy of is achieved.
As the environmental conditions and irradiance levels change, the proposed system can scale the frame/sec accordingly, which gracefully trades-off recognition accuracy. Figure 12 illustrates the tradeoff between recognition accuracy for different gestures as a function of Irradiance.
5.2 Multi-class Recognition Accuracy
At the recognition accuracies of 3 different gesture classes are shown in Table 1. It can be seen that the recognition accuracy depends on complexity of the gesture. For a simple gesture, e.g., “+”, a peak accuracy of in a fully solar energy harvested system is measured. A comparison of the proposed system with competing hardware [2, 3, 12, 13] based motion and gesture detection is shown in Table. II. The proposed system demonstrates more than improvement compared to reported works in energy/frame for detecting “wake up” gestures. This enables a fully self-powered “always on” camera front end.
This paper presents a solar powered, “always on”, gesture recognition system that provides a trigger for system “wake up”. The major savings of power in our system comes from the two layers of compression that reduce the resolution of the image sensor by a factor of more than [ by block averaging and by random compressive measurements]. The block compression layer preserves the geometric information of the gesture and the random projection layer preserves the motion parameters. These two preservations are the keys for maintaining a high recognition rate in the compressed domain. Further a hardware-algorithm co-design allows energy-efficient mapping of the recognition algorithm on a low power MCU and powered by a solar powered DC-DC converter and regulator with MPPT. The system demonstrates an average recognition accuracy of while consuming less than .
This work was funded in part by Intel Corp. and the NSF CRII Award 1464353.
-  W. Ran et al., , “ Real-time visual static hand gesture recognition system and its fpga based hardware implementation," ICSP, no. 1, 434-439, 2014.
-  Y. P. Daeho Lee, “ Vision-based remote control system by motion detection and open Finger counting," IEEE trans. on Consumer Electronics, no. 1, 2308-2313, 2009.
-  Chao-Tang Li, Wen-Hui Chen, “ A novel fpga-based hand gesture recognition system, no. 1, 221-229, 2012.
Rautaray et al., “ Vision based hand gesture recognition for human computer interaction: a survey," Artificial Intelligence Review, vol. 43, no. 1, 1-54, 2015.
-  Pavlovic et al., “ Visual interpretation of hand gestures for human-computer interaction: A review," Pattern Analysis and Machine Intelligence," Artificial Intelligence Review, IEEE Transactions on, vol. 19,no. 7, 677-695, 1997.
-  M. A. Davenport et al., “The smashed Filter for compressive classification and target recognition," in Electronic Imaging, 64980-64994, ISOP, 2007.
-  Mantzel et al., “Compressive matched-Field processing," The Journal of the Acoustical Society of America, vol. 132, no. 1, 90-102, 2012.
-  Y. Zhang et al., “A batteryless 19 w mics/ism-band energy harvesting body sensor node soc for exg applications," IEEE JSSC, vol. 48, no. 1, 199-213, 2013.
-  X. Liu et al., “A highly efficient ultralow photovoltaic power harvesting system with mppt for internet of things smart nodes," IEEE TVLSI, vol. 23, no. 12, 3065-3075, 2015.
-  R. G. Baraniuk et al., “Random projections of smooth manifolds," Foundations of comput. mathematics,," vol. 9, no. 1, 51-77, 2009.
-  M. Muller, “Dynamic time warping Information retrieval for music and motion," 69-84, 2007.
T. T. Yu Shi, “An fpga-based smart camera for gesture recognition in hci applications," Asian conference on Computer vision, no. 1, 718-727, 2007.
S. J. Desai et al., “An ultra-low power, An Always ON camera front-end for posture detection in body worn cameras using restricted boltzman machines," IEEE TMSCS, no. 4, 187-194, 2015.