Counting objects in digital images is a process that should be replaced by machines. This tedious task is time consuming and prone to errors due to fatigue of human annotators. The goal is to have a system that takes as input an image and returns a count of the objects inside and justification for the prediction in the form of object localization. We repose a problem, originally posed by Lempitsky and Zisserman, to instead predict a count map which contains redundant counts based on the receptive field of a smaller regression network. The regression network predicts a count of the objects that exist inside this frame. By processing the image in a fully convolutional way each pixel is going to be accounted for some number of times, the number of windows which include it, which is the size of each window, (i.e., 32x32 = 1024). To recover the true count we take the average over the redundant predictions. Our contribution is redundant counting instead of predicting a density map in order to average over errors. We also propose a novel deep neural network architecture adapted from the Inception family of networks called the Count-ception network. Together our approach results in a 20 of the art method by Xie, Noble, and Zisserman in 2016.READ FULL TEXT VIEW PDF
For crowded scenes, the accuracy of object-based computer vision methods...
Accurately counting cells in microscopic images is important for medical...
We present a novel fruit counting pipeline that combines deep segmentati...
Measuring and analyzing the flow of customers in retail stores is essent...
The big problem for neural network models which are trained to count
This paper proposes the problem of point-and-count as a test case to bre...
We have created a large diverse set of cars from overhead images, which ...
Repo contains the implementation of the count-ception CNN (Tensorflow + Keras)
Pytorch implementation of Count-ception and custom CNN counting models for Kaggle Sea Lion Count challenge
Counting objects in digital images is a process that is time consuming and prone to errors due to fatigue of human annotators. The goal of this research area is to have a system that takes as input an image and returns a count of the objects inside and justification for the prediction in the form of object localization.
The classical approach to counting involves fine-tuning edge detectors to segment objects from the background  and counting each one. A large challenge here is dealing with overlapping objects which require methods such as the watershed transformation 
. These approaches have many hyperparameters specifically for each task and are complicated to build.
The core of modern approaches was described by Lempitsky and Zisserman in 2010 . Given labels with point annotations of each object, they construct a density map of the image. Here, each object predicted takes up a density of 1, so a sum of the density map will reveal the total number of objects in the image. This method naturally accounts for overlapping objects We extend this idea and focus on two main areas:
We propose redundant counting instead of a density map approach in order to average over errors.
We propose a novel construction of networks and training that can apply to counting tasks with very complicated objects.
We repose the problem of predicting a density map to instead predict a count map which contains redundant counts based on the receptive field of a smaller regression network. The regression network predicts a count of the objects that exist inside this frame as shown in Figure 1. By processing the image in a fully convolutional way  each pixel is going to be accounted for some number of times, the number of windows which include it, which is the size of each window, (i.e., ). To recover the true count we can take the average of all these predictions. Figure 2 illustrates how this change in kernel makes more sense with respect to the receptive field of the network that must make predictions. Using the Gaussian density map forces the model to predict specific values based on how far the cell is from the center of the receptive field. This is a harder task than just predicting the existence of the cell in the receptive field. A comparison of these two types of count maps is shown in Figure 3.
To perform this prediction we focus on a method using deep learning13] like Xie  and Arteta  have. They utilized networks similar to FCN-8 
which form bottlenecks at the core of the network to capture complex relationships in different parts of the image. Instead, we pad the borders of the input image so that the receptive field of the regression network will redundantly count the correct number of times. This way we do not bottleneck the representation in any way.
The idea of counting with a density map began with Lempitsky and Zisserman in 2010 
where they used dense SIFT features from the image as input to a linear regression to predict a density map. We predict redundant counts instead of a density map. Although a summation over the output of the model is taken over both causes, our method is explicitly designed to tolerate the errors when predictions are made.
However, the density map of objects does count multiple times indirectly. It needs to properly predict a density map of objects which is generated from a small Gaussian with the mean at the point annotation. The values they need to predict vary as some are at the mean and some are not. It doesn’t take into account the receptive field so the objects may be in view and the network has to suppress its prediction.
Many approaches were introduced to predict a better density map. Fiaschi 2012  used a regression forest instead of a linear model to make the density prediction based on BoW-SIFT features. Arteta 2014  proposed an interactive counting algorithm which would extend this algorithm to more dynamically learn to count various concepts in the image. Xie 2016  introduced deep neural networks to this problem. Their method built a network which would convolve a region to a density map. Once this network was trained it can be run in a fully convolutional way similar to our method. However, these approaches focus on predicting a density map which differentiates them from our work.
Arteta 2016  discuss new approaches past the density model. Their focus is different than our work. They tackle the problem of incorporating multiple point annotations from noisy crowd sourced data. They also utilize segmentation of the background to filter our erroneous predictions that may happen there.
In Segui  their method takes the entire image as input and output a single count value using fully connected layers to break the spatial relationship. They discover that a network can learn to count and while doing this they learn features for identifying the objects such as MNIST digits. We use this idea in that the regression network is learning to count the frame. But we expect it to produce errors so we perform this task redundantly.
Xie in 2015  presented an interesting idea similar to the direction we are going in. Their goal is to predict a proximity map which consists of cone shaped distributions over each cell which smooths each cell prediction using surrounding detections. This cone extended only 5 pixels from the point annotation which was the average size of the cell. However, this approach is more in line with a density map than a count map.
|target image, constructed from|
|image of point notations|
|width / length of receptive field|
|receptive field associated with|
|map of predicted counts for|
|number of training / validation images|
We would like to obtain the count of objects in an input image being given only a few training examples with point annotations of each object. The objects to count are often very small, and the overall image very large. Because counting is labor-intensive, there are often few labeled images in practice.
Motivation: We want to merge the idea of networks that count everything in their receptive field by Segui  with the density map of objects by Lempitsky and Zisserman  using fully convolutional processing like Xie  and Arteta .
Technique: Instead of using a CNN that takes the entire image as input and produces a single prediction for the number of objects we use a smaller network that is run over the image to produce an intermediate count map. This smaller network is trained to count the number of objects in its receptive field. More formally; we process the image with this network in a fully convolutional way to produce a matrix that represents the counts of objects for a specific receptive field of a sub-network that performs the counting. A high-level overview:
Pre-process image by padding
Process image in a fully convolutional way
Combine all counts together into total count for image
The fully convolutional network processes an image by applying a network with a small receptive field on the entire image. This has two effects which reduce overfitting. First, by being small, the fully convolutional network has much fewer parameters than a network trained on the entire image. Second, by splitting up an image, the fully convolutional network has much more training data to fit parameters on.
The following discussions will consider a receptive field of for simplicity and in order to have concrete examples. This method can be used with any receptive field size. An overview of the process is shown in Figure 5.
We want to count target objects in an image . This image has multiple target objects that are labelled with single point labels .
Because the counting network only reduces the dimensions from the input must be padded in order to deal with objects that appear on the border. Objects on the border of the image will at most be in the receptive field of a network with only one column or row overlapping the input image. For a pixel in can only be 15 pixels from the border of .
is meant to align with the target . It is important that these be aligned such that the receptive field of the network aligns with the proper regression target.
The target image can be constructed from a point-annotated map , the same size as the input image , where each object is annotated by a single pixel. This is desirable because labeling with dots is much easier than drawing the boundaries for segmentation.
Let be the set of pixel locations in the receptive field corresponding to . Then we can construct the target image :
Here is the sum of cells contained in a region the size of the receptive field. This will become the regression target for the region of the image.
We use fully convolutional networks with a receptive field of . The output of the fully convolutional network on the entire image is pixels. This yields a fully convolutional network output image larger than the original input. Each pixel in the output will represent the count of targets in that receptive field.
. At the core of the model Inception units are used to perform 1x1 (pad 0) and 3x3 (pad 1) convolutions at multiple layers without reducing the size of the tensor. After every convolution a Leaky ReLU activation is applied. We notice an improvement of the regression predictions with the Leaky ReLU during training because the output can be pushed to zero and then recover to predict the correct count.
Our modifications are in the down sampling layers. We removed the max pooling and stride=2 convolutions. They are replaced by large convolutions. This makes it easier to calculate the receptive field of the network because strides add a modulus to the calculation of the count map size.
We perform this down sampling in two locations using large filters to greatly reduce the size of the tensor. A necessity in allowing the model to train is utilizing Batch Normalization layers  after every convolution.
We tried many combinations of loss functions and foundloss to perform the best.
Xie found that the penalty was too harsh to the network during training. We reached the same conclusion for our configuration and chose an loss instead. We also tried to combine this basic pixel-wise loss with a loss based on the overall prediction in the entire image. We found this caused over-fitting and provided no assistance in training. The network would simply learn artifacts in each image in order to correctly predict the overall counts.
The above loss is a surrogate objective to the real count that we want. We intentionally count each cell multiple times in order to average over possible errors. With a stride of 1, each target is counted once for each pixel in its receptive field. As the stride increases, the number of redundant counts decreases.
In order to recover the true count we divide the sum of all pixels by the number of redundant counts.
There are many benefits to using redundant counts. If the pixel label is not exactly at the center of the cell, or even outside the cell, the network can still learn because on average the cell will appear in the receptive field.
With this approach we sacrifice the ability to localize each cell exactly with coordinates. Viewing the predicted count map can localize where the detection came from (shown in Figure 7) but not to a specific coordinate. For many applications accurate counting is more important than exact localization. Another issue with this approach is that a correct overall count may not come from correctly identifying cells and could be the network adapting to the average prediction for each regression. One common example is if the training data contains many images without cells the network may predict 0 in order to minimize the loss. A solution similar to Curriculum Learning  is to first train on a more balanced set of examples and then take well performing networks and train them on more sparse datasets.
VGG Cells: To compare with the state of the art we first use the standard benchmark dataset which was introduced by Lempitsky and Zisserman in 2010 . There are 200 images with a 256x256 resolution that contain simulated bacterial cells from fluorescence-light microscopy created by . Each image contains 174 64 cells which overlap and are at various focal distances simulating real life imaging with a microscope.
MBM Cells: We also use a real dataset based on the BM dataset introduced by Kainz et al. in 2015  which consists of eleven resolution images of bone marrow from height healthy individuals. The standard staining procedure used depicts in blue the nuclei of the various cell types present whereas the other cell constituents appear in various shades of pink and red. We modified this dataset in two ways to create the MBM dataset (Modified BM). First the images were cropped to in order to process the images in memory on the GPU and also to smooth out evaluation errors during training for a better comparison. This yields a total of 44 images containing 126 33 cells (identified nuclei). In addition, the ground truth annotations were updated after visual inspection to capture a number of unlabeled nuclei with the help of domain experts.
Adipocyte Cells: Our final dataset is a human subcutaneous adipose tissue dataset obtained from the Genotype Tissue Expression Consortium (GTEx) . 200 Regions Of Interest (ROI) representing adipocyte cells were sampled from high resolution histology slides by using a sliding window of 1700 1700. Images were then down sampled to 150 150, representing a suitable scale in which cells could be counted using a 32 32 receptive field. The average cell count across all images is 16544.2. Adipocytes can vary in size dramatically (20-200)  and given they are densely packed adjoining cells with few gaps, they represent a difficult test-case for automated cell counting procedures.
VGG Cells (200 Images Total)
|Predict Average Count|
|Lempitsky and Zisserman (2010)|
|Fiaschi et al. (2012)|
|Arteta et al. (2014)|
|FCRN-A, Xie (2016)||*|
|*Reported in their work as .|
MBM Cells (44 Images Total)
|Predict Average Count|
|Cell Profiler -single|
|Cell Profiler -multiple**|
|FCRN-A, Xie (2016)|
|**Cell Profiler results were obtained using a single pipeline (single) and using three different pipelines (multiple) to account for color differences in two of the eleven images.|
Adipocyte Cells (200 Images Total)
|Predict Average Count|
for the validation set, and a fixed size is used for the testing set. At least 10 runs using different random splits and different network initializations are used to calculate the mean and standard deviation.
First, we compare the overall performance of our proposed model to existing approaches in Table 2 for each dataset. For each dataset we follow the evaluation protocol used by Lempitsky and Zisserman in 2010 that has been used by all future papers. In this evaluation protocol, training, validation, and testing subsets are used. The held-out testing set size is fixed for all experiments while training and validation sizes () are varied to simulate lower or higher numbers of labeled examples. The algorithm trains on the training set only while being able to early stop by evaluating its performance on the validation set. The size of the training and validation sets are varied together for simplicity.
The results of the algorithm using at least 10 random splits are computed and we present the mean and standard deviation. The testing set size remains constant in order to provide constant evaluation. If the testing set were chosen to be all remaining examples (Testing = Total) instead of a fixed size then smaller values would be less impacted by difficult examples in the test set because examples are not sampled with replacement.
As a practitioner baseline comparison we compare our results to Cell Profiler’s  which uses segmentation to perform object identification and counting. This is representative of how cells are typically counted in biology laboratories. To do so, we designed two main different pipelines and evaluated the error on 10 splits of 100 randomly chosen images for the synthetic dataset (VGG Cells) and on 10 splits of 10 images for the bone marrow dataset (MBM Cells) to mimic the experimental setup in place since Cell Profiler does not use a training set. For the MBM Cells, we report the performance using the same pipeline (single) for all images and using three slightly modified versions of the pipeline (multiple) where a parameter was adjusted to account for color differences seen in 8 of the 44 images.
Among other methods we compare with Xie’s FCRN-A network . Only Xie’s and our method (Count-ception) are neural network based approaches. Our network is sufficiently deeper than the Xie’s FCRN-A network and that representational power together with our redundant counting we are able to perform significantly better. We show in §5.2 that the performance of our model matches that of Xie’s when the redundant counting is disabled by changing the stride to eliminate redundant counting.
In order to train the network we used the Adam optimization technique 
with a learning rate of 0.005 and a batch size of 4 images. The training runs for 1000 epochs and the best model based on the validation set error is evaluated on the test set. The weights of the network were initialized using the Glorot initialization method adjusted for ReLU gain.
We claim redundant counting is significant to the success of the method. By increasing the stride we can reduce double counting until there is none. We present the reader Table 3 which indicates that a stride of 1, meaning the maximum amount of redundant counting patch_size, is the optimal choice. As we increase the stride to equal the patch size where no redundant counting is occurring the accuracy is reduced.
The power of this algorithm is in the redundant counting. However, increasing the redundant count is complicated. The receptive field could be increased but this will add more parameters which cause the network to overfit the training data. We explored a receptive field of 64x64 and found that it did not perform better. Another approach could be to use dilated convolutions  which would be equivalent to scaling up the input image resolution.
|Train & Test||2.40.4||3.50.1||4.00.2||5.20.4|
The run-time of this algorithm is not trivial. We explored models with less parameters and found they could not achieve the same performance. Shorter models (fewer layers) or narrower models (less filters per layer) tended to not have enough representational power to count correctly. Making the network wider would cause the model to overfit. The complexity of the Inception modules were significant to the performance of the model.
The network was implemented in lasagne (version 0.2.dev1) 
and Theano (version 0.9.0rc2.dev) using the libgpuarray backend. The source code and data will be made available online111https://github.com/ieee8023/countception.
In this work we rethink the density map method by Lempitsky and Zisserman  and instead predict counts in a redundant fashion in order to average over errors and reduce overfitting. This redundant counting approach merges ideas by Segui  of networks that count everything in their receptive field with ideas by Lempitsky and Zisserman of using the density map of objects together with ideas by Xie  and Arteta  of using fully convolutional processing.
We call our new approach Count-ception because our approach utilizes a counting network internally to perform the redundant counting. We demonstrate that this approach outperforms existing approaches and can also perform well with very complicated cell structure even where the cell walls adjoin other cells. This approach is promising for tasks with different sizes of objects which have complicated structure. However, the method has some limitations. Although the count map can be used for localization it cannot easily provide locations of objects.
This work is partially funded by a grant from the U.S. National Science Foundation Graduate Research Fellowship Program (grant number: DGE-1356104) and the Institut de valorisation des données (IVADO). This work utilized the supercomputing facilities managed by the Montreal Institute for Learning Algorithms, NSERC, Compute Canada, and Calcul Quebéc. We also thank NVIDIA for donating a DGX-1 computer used in this work.
European Conference on Computer Vision, 2014.
International Conference on Machine Learning, 2009.
International Conference on Pattern Recognition, 2012.
International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.