In the past decades, ornithologists have been trying to figure out efficient ways of counting the number of birds on the open sky in a particular region. Obtaining accurate bird counts can help scientists to estimate the population of birds for conservation purposes and to determine the magnitude of seasonal migration. However, it can be a daunting task since the number of birds can be overwhelming in an image. In addition to that, public samples with accurate labels of the number of birds are scarce. To cope with the problem of scarce training data, in this work, we design an efficient method of training with synthetic datasets and use the trained model to perform bird counting under real settings. We also perform model compression techniques to make the trained model more space efficient and perform fast inference since the target computation device is on the mobile phones.
The system would be the most useful for ornithologists to count large bird flocks. The particular setup that we aim to tackle is that of open sky high resolution images of birds. The camera is setup in the ground to face of the open sky. This setting enables the camera to stay relatively far from the bird flocks. Thus, the number of birds can be of large magnitude(in order of thousands). The images taken by the camera are from a variety of weather conditions. Since bird counting is often done in the field, internet connection can be weak and thus computational resource is scarce. Therefore, efficient model compression is performed to optimize computational, power and storage costs.
This setup provides a reasonable standardization to create a bird counting system. In this paper, we describe the key pipelines to build an efficient bird counting system that can achieve real time performance with high accuracy. We perform various experiments to show the accuracy achieved by our model and various system metrics including storage cost and inference speed.
Our main contributions are 1)Designed a pipeline of training with synthetic dataset on the counting problem 2)Created several bird counting dataset and manually labeled the real world dataset 3)Applied Unet to perform density map estimation on synthetic dataset 4)applied model compression techniques to the counting network(unet architecture) to significantly reduce storage cost 5) Designed flexible unet parameters to adjust the model parameter size.
2 Synthetic Dataset Generation
For the open sky bird counting task, the number of birds can be overwhelming since thousands of birds might be present in a high resolution image. It is costly to manually label the images to gather sufficient training dataset and to achieve high labeling accuracy. Inspired by works to crowd counting networks in dataset created from GTA5 (Yuan, 2019). We design a system to generate bird flocks images from some selected canonical bird crops. The current bird crops are taken from various images from the website, including bird shapes from cartoons.
Random sample generation from canonical shapes. For the training dataset, we select 50 canonical bird shapes from variety sources, representing numerous shapes of birds. Then the crops are binarized with a threshold so that the background of the bird images can be eliminated. We set the alpha channel of the background to be 0 so that when the birds are positioned to the canvas, overlapping could happen smoothly. The purpose of this is to simulate what we would observe in the real world samples. The resolution of the training samples is set to be 224x224 and all canonical bird samples are resized to 25x25.
The generation of a single images depends on the following randomly sampled parameters 1) the range of possible number of birds: we select the range to be from 0 to 50 in the implementation 2) The scale range of each bird: we select it to be around 0.5 2 times the size of the canonical bird samples. 3)The rotation of each bird 4) the noise level of the image.
With this generation pipeline, the location of each bird is randomly sampled from the 224x224 background and the bird is pasted to the canvas. We generate 1600 samples for the training dataset and 400 for the validation dataset. Note that with the synthetic
Injection of noise to samples.
In real world settings, the weather condition can be unstable, introducing different levels of noises. When converting the high resolution image into binary images for prediction, a large amount of noise can be introduced. Also, the effect of dark corners from the camera lens can negatively affect the sample quality. To make the model more resilient against noise, we inject pepper and salt noises into the training samples.The Pepper and Salt noise is used because the effects look similar to the observations on real dataset. The noise is injected as for each pixel there is a probability p such that the pixel is set to the 255. Otherwise, no effect would be applied to the pixel.
Density map prediction. Many crowd counting network use density map as label to train the network since it contains more information than the count value. MCNN (Zhang, 2016) trained with density maps achieve state of the art performance on crowd counting compared with other architectures. In our system, we also use density map to train the model. With synthetic data generation, each bird position is known, it is straightforward to obtain an accurate density map. The density map is created such that if there is a bird at the pixel , use a function to represent it. So the density function with N birds can be represented by To convert the density map into a continuous density function, we convolve this function with a Gaussian Kernel similar to (Zhang, 2016). Then the continuous density map is . With this function applied to the image, we can just integrate over the density map to obtain the number of birds in the image. Another advantage of density map besides provide geographical information is that density map can efficiently solve the problem of birds positioned on the boundary since the density value is directly proportional to the bird part in the image.
In our implementation, we set a value 100 to the center of the bird location after scale and rotation. Then use a Gaussian kernel to convolve with the sparse matrix to generate a density map. The standard deviation used is 27 for each bird. Note that since we use a higher value to avoid numerical imprecision, we need to re-normalize the density map for inference.
3 Bird Counting System Pipeline
3.1 Image Preprocessing
For the input high resolution image (79525304 pixels), the pre-processing pipeline involves the following steps. 1) Use a predefined threshold to binarize the image. This is done in open cv with a pixel value 127 for the threshold; 2) Run noise reduction method based on connected components, this step is to remove arbitrary pixel level noises in the original image 3) Divide the original image to patches of size 224x224. The patches are created with a non-overlapping sliding window. For the given input resolution, 864 patches are created. Then the trained model performs counting on each of the patches and output the total count.
3.2 U-Net For Density Map Estimation
Model Selection. We have considered multiple network architectures to perform the task of bird counting. The main considerations beside validation accuracy are 1)efficient training process 2)the generalization ability of the model 3)the size of the trained model 4) the average inference time of the model. Therefore, we have experimented with a variety of network architectures including 1)Resnet18 2)DenseNet161 3)MobileNet 4)ShuffleNet 5)SqueezeNet 6)FCRN 7) U-Net. For 1 5 architectures, the problem is formulated as a regression model with respect to the number of counts. For 6) and 7), we use the density map to train and test. We found that U-Net can achieve the best accuracy in the current setup. Although efficient model architectures such as MobileNetV2 and ShuffleNet is fast in inference time and storage cost, these compact models do not output accurate predictions on the test set and fails to generalize to the real dataset.
U-Net The U-Net architecture is proposed in (O. Ronneberger, ). The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. The U-Net can be trained end-to-end from very few images and achieve competitive results in a variety of tasks. We adopt U-Net for the density map estimation due to the natural symmetry in the problem. We further optimize the architecture in the compression part.
Model Training. The bird counting problem is formulated as a regression task with respect to the density map. Given image samples and density map , we build a model
such that the loss functionis minimized. We use MSE loss function to estimate the deviation of the output from the ground truth label. MSE loss is defined as . The optimizer is Adam with =0.0001. We notice that small than usual learning rate is important in training the density map estimation problem. We hypothesize that since the density map contains 50400(224
224) entries and most entries are of tiny due to the convolution with the Gaussian Kernel, smaller learning rate is required for stable training and convergence. We also schedule the learning rate decay after certain number of epochs. We noticed that the decreasing learning rate can also aid the training process. We train the model on synthetic dataset with 2000 samples for 100 epochs. The best accuracy and train loss is summarized in the experiments section.
Negative Samples Training. In real world settings, the majority of space in a high resolution image is the sky. Therefore, many patches contain zero birds. We hope to model to be more resilient to noised. Therefore, We also additionally train the model with negative samples. We noticed that the accuracy of the model can be improved slightly with this optimization in the test dataset. However, on the real dataset, the increase in accuracy is minimal with this method.
4 Model Compression
The setup we consider is for the ornithologists to set the camera in the field and continuously make predictions of the bird count. Our system must be compatible with the lack of computational resources and memory space. The two main systems considerations are 1)model storage cost 2)inference speed. Because the input image is of high resolution and is divided to patches on the fly, our model has to achieve significant throughput by reducing the inference time required. We also need to consider the requirements of mobile devices in terms of storage and computational power. Therefore, model compression techniques are particularly useful. We adopt the framework from (S. Han, 2015) deep compression and performs weight pruning, Quantization, weight sharing and Hoffman encoding on the current system.
Besides model compression techniques, We have also experimented with efficient model architectures such as mobile net and shuffle net. Even if the two models can achieve fast training and inference, the training process fail to converge and achieve high accuracy. However, following the methodology to boost inference speed, we also experiment with modifying the Unet architecture to achieve significant speedup and reduction in storage cost.
4.1 Weight Pruning and Quantization
Neural networks is known to be overly parametric. Weight pruning is a technique to remove redundant entries by setting a threshold such that weights below the threshold is removed. This method can optimize storage cost and take advantage of sparse matrix structure. For the trained unet model, we apply pruning to remove neurons that are below a certain threshold. Note that the threshold that we use is applied to all layers. Other method suggests pruning the weight based on each module of the network and to prune out only a set percentage of the parameters. Pruning is implemented in pytorch with an addition of a mask to assess performance. We continue train our model after the model is pruned. The accuracy of the pruned model is recorded in the experiment section.
4.2 Weight Sharing
We use weight sharing to reduce the number of possible weights. This techniques is from (Dally, 2015)
. We use 5 bits for precision so that effectively, the number of different weights is 32. Each weight center can be stored by a 2 byte sized tensor(float16) and the underlying weight matrix is reduced to a constant of 64 bytes. In this setup, the index matrix can be stored with int8 data type. Then the original weight matrix size is reduced by approximately 4 times. We can further leverage sparse matrix methods such as csc and csr to reduce the storage cost. The selection of the center of weights is determined by K-means algorithm so that to minimize the mse distance between the original weight matrix and the weight matrix after weight sharing. We use the package for Kmeans clustering from Sklearn. After weight sharing, we continue train the model with RMSProp for only a few iterations to recover some lost accuracy.
4.3 Hoffman Encoding
Huffman coding is a lossless data compression algorithm. In this algorithm, a variable-length code is assigned to input different characters. The code length is related to how frequently characters are used. Most frequent characters have the smallest codes and longer codes for least frequent characters. We apply Hoffman encoding post weight pruning to reduce the storage cost for the model. However, with the current implementation, Hoffman encoding does not produce significant boost in storage cost since the sklearn package for creating sparse csc and csr matrix requires the weight shape to be 2 dimensional.
4.4 Unet Fast
Following the strategy from (Howard, 2017) that introduces hyper-parameters for the trade-off between latency and accuracy, we also seek to reduce computational and storage cost through modifying the underlying unet architecture. We experiment with adjusting the original unet architecture on this task to gain speedup. This method is easy to implement and to benchmark the actual speedup as it only involves adding two additional model shrinking parameters to the original unet architecture.
Training with Down-sampled images The bottleneck of the computation with our model lies on the series of 3x3 convolutions on the input layers. Therefore, reducing the input size could help with achieving faster inference. One optimization we can perform is to sub-sample the input size to the model. Instead of feeding the 224x224 input size to the double conv layers, We can sub-sample the original image first with a maxpooling layer with a factor
and compute with reduced sample size. In the output layer, we use upsample layer through bilinear interpolation with the same factorto get back the original input size. We experiment with different subsampling factors and the results are recorded in the experiment section. We notice significant speedup from this optimization.
Uniformly reducing Unet filter sizes The current implementation of unet has filter size from 64-128-256-512-1024 and then reverse the same path with upsampling and transposed convolutional layers. We use a shrink factor to uniformly reduce the filter size to 64*-128*-256*-512*-1024*. This optimization significantly reduce the parameter size and could also lead to 10x speedup. We experiment with adjusting the two factors and to document the speedup and storage saving based on the adjustment of the model parameter.
5 Related Works
The works that are directly linked to our bird counting system is crowd counting methods and model compression techniques. In terms of the system optimizations.
Crowd Counting Crowd counting refers to the task of estimating the number of picture in any image. The task can be challenging since crowd can be of various scales and people can overlap. There are several major approaches to solving this problem. Early methods (P. VIOLA and SNOW, ) uses detection based method to perform crowd counting. This strategy is limited in face of dense crowd and do not generalize well in the case where the input image has to be divided to patches. (Zhang, 2016) and (R. Bahmanyar, ) formulate the crowd counting problem as a density map estimation. (Zhang, 2016) is able to obtain state of the art results in several of the most widely used datasets. To cope with the lack of accurate labels, (Yuan, 2019) uses synthetic crowd dataset from GTA5 to build a model and then use some real dataset to fine tune the model to achieve impressive results.
Model Compression Model compression is the technique to compress the model size by pruning connectivity, use low precision, weight sharing, etc. (N and Gopalakrishnan, ) successfully proves that training models with 8bit precision without compromising accuracy is possible. The deep compression method (S. Han, 2015) that is composed of quantization, weight sharing and Hoffman Encoding achieves compression rate and speed up on cpu without losing accuracy. In another direction, efficient model structures are proposed such as MobileNets (Howard, 2017) and ShuffleNets(Zhang, ), aiming at reducing parameter sizes and achieve speedup in inference through cutting down computationally intensive operations such as the convolutional layer so that the trained model can be fit into any mobile devices.
Synthetic test dataset.The first dataset we consider is the generated test dataset. Since the bird is generated from the same distribution, we would expect high accuracy in terms of mse in this case.
Synthetic test set with a different generation samples. We also follow the same pipeline and create synthetic dataset based on different canonical bird images. The crops are taken from real bird images. So the shape of the birds are based on natural observations.
Patches from high resolution images. From some real world high resolution bird images, We take bird flock crops and preprocess these images to form the high resolution set. There are 864 images in this set and we manually labeled each patch with the number of birds. Note that all birds shown in the images are labelled. This might incur some problem for the density map estimation. The goal of this dataset is to evaluate the generalization ability of our bird counting system. The problem with high resolution image of bird flocks is the fact that birds can overlap and the noise introduced by different scenes. All the images are processed with the same manually selected threshold parameter.
All data pre-processing is implemented in Python with support from OpenCV. We implemented the model and training in Pytorch. The Unet architecture is implemented with the up-sampling and transposed convolution method. We perform experiments to benchmark the accuracy achieved on various datasets as well as model size and inference speed.
7.1 Accuracy achieved by a variety of methods
We evaluate the accuracy of different model architecture and different methods for counting. We experimented with DenseNet161, ResNet18, MobileNetV2, ShuffleNet and SqueezeNet, but we only show the results from ResNet18 and MobileNetV2 as representatives of the various networks. These models are trained with a direct regression setting. We use the models directly from the torchvisions package. The problem setup involves directly estimating the output count. All models are trained with Adam Optimizer with a small learning rate. We found that these method do not generalize well to the real world dataset. Training with direct count also cannot solve the problem of double counting at the boundary. Therefore, we use the density map estimation method trained with unet, which provides a much better generalization power. The mse achieved by the baseline unet model is 12.4 on the real world dataset.
7.2 Inference Time and Storage Cost
We measure the time it takes for the various network to finish a single inference task. The input is images of size 224x224. Since the input is batched we measure the inference time based on predict of a single batch. We measure the time it takes to predict a patch(batch size=1). The input patch has size 0.6MB.
Inference Speed Inference speed is important in achieving a real time bird counting system. We first provide an estimation of the system pipeline in our setup. The typical data transmission speed of USB-1 is and USB-2 is . Therefore, even with USB-1, an image can be received by the computational node in 4 seconds. Data pre-processing takes 5 seconds for an entire input high resolution image, which can be significantly optimized with parallelism and software optimization. Therefore, the bottleneck computational cost is the model prediction.
To closely simulate the performance on the mobile phone or edge devices, we conduct our experiments on the CPU instead of GPU to get an estimation of the time it will take. The time to predict is about 0.5 seconds for the Vanilla Unet Model on a batch size of 32 patches. It takes approximately 610 seconds to finish counting all 864 patches. Whereas in ShuffleNet, it takes 0.08 seconds to predict on one batch and 19 seconds to finish working on one whole high resolution image.
However, with the current implementation of deep compression in pytorch. It is hard to measure the actual speedup since 1)Pytorch does not have support for Quantization now. Halftensor Operations are not supported for operations such as Conv2d 2) Model pruning and weight sharing are implemneted with additional data mask and index map to be more compatible with the current workflow. However, even if accuracy can be measured, it is hard to observe the actual speedup. Based on the deep compression method, their experiments show that the speed up on cpu after pruning can be 2x-5x for various models. Thus, we believe similar improvement can be achieved in our setup. In Table 4, we provide benchmarked speedup from changing the unet architecture through the unet-fast architecture.
Pruning, Weight Sharing and Quantization. We apply pruning, weight sharing and quantization to the trained Unet model. The weight sharing is implemented as K-means given the current weight values. The maximum precision of the weight is set as 16 bit floating point precision. We estimate the model size after pruning and quantization. After applying the above steps. The model parameter size is reduced from 55MB to 13MB. We then apply Huffman encoding to further reduce the model size. With Huffman encoding, the model size is reduced to 11.7MB. The result is summarized in the table. We then performs several epochs of training with the updated model to recover some loss in accuracy.
8 Remarks and Future Directions
In the experiment section, we see that the unet-fast provides significant convenience in achieve inference speedup and storage saving. It is easy to implement and can be sufficiently tested. In comparison, even if model compression techniques can reduce storage cost significantly, it does not eliminate the FLOPs involved with the costly convolutional operations. Even in low precision setting, achieving efficient implementation is still a difficult task that involving changing the underlying model update based on the pytorch package. Thus, model compression techniques may not be the optimal solution to our bird counting system. From a practical standpoint, in our final bird counting system, we would adopt the strategy of using fine-tuned Unet-fast model to perform real time inferences.
In this paper, we propose to build a efficient bird counting system for ornithologists on mobile devices that leverages synthetic dataset training and model compression techniques to achieve fast inference and storage cost saving. We describes the steps of generating high quality synthetic dataset and use density map estimation to achieve a good generalization ability. We are able to reach a mse of 12.4 on real dataset and prove that training a bird counting system on synthetic dataset is an achievable task. We also manually label a real world bird dataset consisting of 864 pictures with 9000 birds in total to test our results. We performs model compression on the Unet model to reduce storage cost and increase inference speed. Lastly, we design hyperparameters to efficiently reduce the storage and computational cost of the original unet architecture to observe siginificant speedup and storage saving. This method can be further explored to build a real time bird counting system.
Compression of neural machine translation models via pruning. Computer Science Department, Stanford University, Stanford, CA 94305 (), pp. . Cited by: §9.
- Learning both weights and connections for efficient neural networks. NIPS 2015, (), pp. . Cited by: §4.2, §9.
- Deep learning on mobile devices – a review. SPIE Defense + Commercial Sensing, Invited Paper. April 2019, Baltimore, MD (), pp. . Cited by: §9.
MobileNets: efficient convolutional neural networks for mobile vision applications. Computer Vision and Pattern Recognition (), pp. . Cited by: §4.4, §5.
AI benchmark: running deep neural networks on android smartphones.
Shanghaitech University, Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)(), pp. . Cited by: §9.
- W-net: reinforced u-net for density map estimation. Computer Vision and Pattern Recognition (), pp. . Cited by: §9.
- PRUNING convolutional neural networks for resource efficient inference. ICLR 2017 (), pp. . Cited by: §9.
-  () Training deep neural networks with 8-bit floating point numbers. IBM T. J. Watson Research Center (), pp. . Cited by: §5.
-  () U-net: convolutional networks for biomedical image segmentation. Mitsubishi Electric Research Laboratories (), pp. . Cited by: §3.2, §9.
-  () Detecting pedestrians using patterns of motion and appearance. Computer Vision and Pattern Recognition (), pp. . Cited by: §5, §9.
-  () MRCNet: crowd counting and density map estimation in aerial and ground imagery. Computer Vision and Pattern Recognition (), pp. . Cited by: §5, §9.
- Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR 2016, (), pp. . Cited by: §4, §5, §9.
- Learning from synthetic data for crowd count ing in the wild. Computer Vision and Pattern Recognition, (), pp. . Cited by: §2, §5, §9.
-  () ShuffleNet: an extremely efficient convolutional neural network for mobile devices. Computer Vision and Pattern Recognition (), pp. . Cited by: §5, §9.
- Single-image crowd counting via multi-column convolutional neural network. CVPR, Shanghaitech University (), pp. . Cited by: §2, §5, §9.