I Introduction
Assessment of passenger counts is of paramount importance for public transport agencies to plan, manage and evaluate their transit service. Over the past three decades, automatic passenger counting (APC) systems have played an increasingly important role in determining the number of passengers in public transport. They are used in the daily monitoring of operations, in longterm demand planning, as well as in revenue sharing within transport associations around the world, which is in the billions annually [vdv:PresseinformationBilanz2018, vdv:PresseinformationBilanz2019, vdv:PresseinformationBilanz2020, wiki:Verkehrsverbund_BerlinBrandenburg] by both ticket revenues, and public funds [wiki:Farebox_recovery_ratio] and directly affects services of general interest. Systematic over or undercounts, socalled biases can thus lead to large unfairness and should thus be as small as possible. The VDV 457 is the prevalent standard that defines how APC systems are officially validated [SiebertEllenberger2019, vdv457_v2_1].
Modern APC systems are based on depthsensing cameras mounted above doors in busses, trains, trams, and other vehicles of public transport (see Figure 1). The recorded 3D videos start with a door opening, passengers board and alight, sometimes pushing and carrying objects along like bicycles and baby buggies, and at the end of the video, the door closes again. This sequence of events is called a video of a door opening phase or in the following video for short.
9665722 introduce and evaluate the Neural Automated Passenger Counter (NAPC), a deep neural network based on a long shortterm memory (LSTM)
[hochreiter1997long] architecture [9665722]. It is capable of learning automatic passenger counting tabula rasa from videos and total manual counts of e.g. boarding and alighting passengers at the end of each video. To illustrate this, the information there have been 2 boarding and 3 alighting passengers already is a label to an entire video, timestamps are not required. This property lowers the effort in acquiring the labels drastically over approaches like object detection [Redmon_2016_CVPR, bochkovskiy2020yolov4] or even segmentation [long2015fully]. Furthermore, the computational effort of the NAPC is comparably low to the aforementioned approaches. Training can thus be executed on consumer GPUs and inference even on legacy hardware systems, see Section IIIE.We tried to improve the original NAPC architecture by changing popular hyperparameters like the learning rate, batch size, LSTM layer count and depth. What can be considered as a success is the introduction of
learning rate decay (LRD): before, we have used a selection setto determine a suitable epoch. With LRD, we can pick the last epoch, making the process easier to handle. However, besides LRD, our results stayed inconclusive for months, so there was no improvement. Therefore, we switched our exploration to more fundamental parameters like various random seeds and training data selection. It is known from other fields like operation research
[SIEBERT20132251] and recent findings for neural networks [Mehrer2020] as well that such seemingly minor choices can have even more impact than algorithms or other design decisions. Another challenge was the lack of a suitable metric, so over time, we created a new one with a universal approach: the test success chance. Finally, we did an evaluation on how stationary the NAPC is, determined a limit, and found yet another very universal approach to remove it.We organize the paper as follows: in Section II we introduce the methods we have used. These comprise efficient labeling, LRD, grid search, quantization, ensemble building, and simulative metrics. Then, we present challenges and our results in Section III. Finally draw a conclusion in Section IV.
Ia Previous Work
Over the recent years, together with our research partners, we have developed multiple passenger count algorithms, e.g. [mdeArticle]
, which qualitywise was comparable to nonneuronal methods of other manufacturers. 9665722 introduced the NAPC
[9665722] to predict the boarding and alighting passengers on the BerlinAPC [11303_13421]dataset. It consists of roughly 13.000 videos of door opening phases and their total boarding and alighting passenger counts at the video end. A sequencetosequence LSTM model predicts the current absolute count for every frame (ranging from 56 to 3275 frames per sequence at 10 frames per second), which means that the prediction on the last frame is trained to match the label and thus is solely used to calculate the accuracy metric. Since the labels are counts, they are all integral, and thus the NAPC predictions are rounded as they typically have decimals. A special bounding box loss function covers all intermediate predictions. The maximum amount of passengers in a
class (here: boarding or alighting adults) is 67. However, the dataset contains mostly labels with low counts, and videos with high counts are scarce (for details, see Table I within "Number of Events" in [9665722]).The NAPC architecture consists of an input and output fully connected (FC) layer and an LSTM core in between (for details, see Figure 2 in [9665722]
). Therefore, it does not explicitly use spatial information and interprets each video frame as a flattened 500dimensional vector. The output is a regression of the cumulative counted passengers until the corresponding frame. 9665722 report a maximum overall accuracy of ~97% dropping to 55.17% with their best model in
more difficult situations like "More than 20 boarding passengers" [9665722]. The global relative biasand an equivalence test for a 95% confidence interval are used as an additional metric, fitting the passenger counting task. This is the same methodology as in the recent VDV 457 v2.1
[vdv457_v2_1], which, however, requires 99% for high precision systems.IB Related Work
The NAPC as introduced in [9665722] can be considered as a stationary method operating on nonstationary data since its counts are bounded (see Section III) but the labels can basically be arbitrary large. For time series, methods to change nonstationary to stationary data, two popular methods can be applied: The forward difference operator, which is defined by suwan2018monotonicity as
and the backward difference operator as
analogously for the time scale . According to [suwan2018monotonicity] it was first proposed by [miller1989fractional] in 1989. The cumulative sum over a sequence, the approach we use, can be considered the inverse operation of the backward difference operator. The differentiation and integration approach is applied, e.g. to make stockmarket predictions [hillmer1982arima]. In economics, autoregressive integrated moving average (ARIMA) models are used to make timeseries forecast predictions on nonstationary data (for details, see Chapter 4 of [box2015time]). The integration step is a cumulative sum over all previous predictions (i.e., an integration over a stationary process). Autoregressive (AR) and moving average (MA) properties help the model to predict future timesteps based on previous predictions (including the errors of previous predictions). Models with those features but no integration are categorized as ARMA. The integration step removes the trend of the time series.
Another topic is the use of legacy computational platforms, which we use for inference: since some have limited floatingpoint capabilities, we have developed a universal posttraining Monte Carlo quantization method, which can, in particular, be applied to LSTMs. The term quantization refers to the process of reducing the resolution from floatingpoint to fixedpoint numbers while retaining an overall similarity of the values. dundar1995effects [dundar1995effects]
investigated the effects of multilayer neural network quantization and concluded that 10 bits are necessary for their experiments. However, they used only weight quantization on linearonly connections. More than two decades later alvarez2016efficient propose their schematics to quantize recurrent neural networks down to 8bit resolution
[alvarez2016efficient]. They leverage quantizationaware training procedures, allowing for reduced quantization errors posttraining. They successfully report the quality of their method with wider LSTM layers (compared to [9665722]). li2021quantization proposed a more recent solution to LSTM quantization with and without quantizationaware training in li2021quantization [li2021quantization]. The authors reduce the resolution for weights and activations to 8bit fixedpoint arithmetics and maintain a 16bit fixedpoint when necessary (e.g. cell states). Not only is quantization relevant to legacy platforms, but it can also be used to fully utilize specialized neural network ASICs (ApplicationSpecific Integrated Circuit) like the Google Edge TPU [gholami2021survey].Another aspect we cover is ensembles: To improve equivalence test success chances by reducing bias, we have combined multiple networks into NAPC ensembles. Ensemble methods are used to improve the quality of a prediction of a single expert model through the combination of multiple experts. Those members of an ensemble then vote, or their predictions are combined such that the resulting selected prediction is a collective solution. Multiple experiments [krizhevsky2012imagenet, qiu2014ensemble] show that ensemble methods can improve the predictions. Keeping the ensemble size small is another objective. Since each member processes the input, the computations grow with each additional member. zhou2002ensembling mentions that using all available instances (neural networks in our case) could also result in a worse ensemble performance than choosing a subset of those [zhou2002ensembling].
Multiple learners can be created by e.g. Bagging, Boosting, and Stacking, which have different properties and limitations. Bagging (bootstrap aggregating) averages through all predictions in regression, or uses plurality voting for classification tasks [breiman1996bagging]. All members are trained independently and can therefore be parallelized. On the other hand, Boosting optimizes recursively and is not parallel. The Adaptive Boosting (AdaBoost) algorithm emphasizes the wrongly predicted samples of one model to train another model with those. The collection of those repeatedly optimized models is weighted and then combined [freund1997decision]. wolpert1992stacked introduced stacking (or stacked generalization) in 1992 as a method to combine the output of one or more generalizers [wolpert1992stacked]: Level 0 algorithms are optimized on input data and level 1 algorithms combine level 0 algorithm outputs to create its own outputs. More levels can be defined analogously.
Another aspect of ensemble building is the pruning method to limit the number of learners. Those methods can be categorized into Ranking, Clustering, Optimization and others [tsoumakas2009ensemble]. Rankingbased methods use measurable quality criteria to rank all the available algorithms and take the best performing to build the ensemble (with size ). Clusteringbased methods do not need a labeled pruning set as the other methods do, as it clusters the models based on a distance measurement (could be done on artificial data) and prunes the clusters until members are left. Optimizationbased ensemble pruning methods like Centralized Objection Maximization for Ensemble Pruning (COMEP) by bian2019ensemble starts with a single ensemble member and optimizes an objective iteratively for each new member [bian2019ensemble]
. Other methods like genetic algorithms
[zhou2003selective] can be used as optimizationbased ensemble pruners [tsoumakas2009ensemble] as well.During our work, another issue emerged: Nondeterminism. To conduct research on and potentially exploit the impact of neural network initialization or the training order for the NAPC as in [SIEBERT20132251], determinism is required. This has been an issue with GPU accelerated training, which relies on modern research in the area. As stated by morin2020non even with fixed seeds, a robust reproducibility of experiments used to be not achievable [morin2020non]
. The nondeterminism they measure (variance after training with the same seeds) is only traceable to the GPU floatingpoint operations, as the initialization of the ResNet
[he2016deep]weights and the order of samples (minibatches) is fixed. After training 50 models with the same and different seeds, they measure the proportion of variance in the metrics introduced by the GPUs. For the standard deviation of the model’s accuracy, the GPU nondeterminism contributed around 74% of the overall variance on the test set, and for the loss, it is more than 87% of the standard deviation’s variance on the test set. However, we could utilize more recent developments that enable deterministic GPU accelerated training
[Riach2019].IC Contributions
This work extends the original NAPC with several engineering and research optimizations:

efficient labeling

posttraining quantization, including LSTMs

introducing LRD to the training

cumulative summation in training and inference

test success chance as metric obtained by simulation

a grid search to identify important parameters

increased accuracy and less bias by ensemble building
Ii Methods
We use the network architecture mentioned in 9665722 as baseline predictor with 5 LSTM layers with 50 LSTM units each [9665722]. The following methods presented in the subsections are partially combined during the experiments. We did not use the network optimizations from Section IIB within the grid search (Section IIC3
). We leverage mixedprecision throughout the training supplied by the Tensorflow framework
[micikevicius2017mixed, abadi2016tensorflow].Iia Labeling
To efficiently gather the large amounts of labels required to train NAPCs, we have developed VisualCount, a specialized tool that utilizes game controllers. The main advantage is the analog buttons that allow navigating through a video in both directions at the most appropriate speed: slow in crowds, fast in idle situations, and backward to quickly review complex scenes. It also allows to enable and disable a 3D height map overlay. This allows to distinguish between children and adults by using a certain height threshold (e.g. 1.20m), see Figure 2. At time of writing, around half a million manual video counts have been made with VisualCount.
IiB Network Optimizations
The NAPC performs weaker in more challenging scenarios (see Figure 6 in [9665722]
) and in particular for high boarding and alighting counts. Predicting nonstationary data (without trend estimation or elimination) is often avoided in regression (compare ARMA to ARIMA in Section
IB) by converting it to stationary data. However, this is not possible in the case of NAPC since labels always refer to a range of video images in which the event (e.g. boarding passenger) happens. Therefore, instead of modifying the data, we modify the neural network and add a cumulative sum at the end, the socalled cumsum layer: Before, for each video image, the LSTM needed to report the total count until that image. With the cumsum layer, since framewise outputs are summed up, the LSTM only needs to report a frametoframe difference. Therefore, the cumsum also preserves the temporal correlation of single events over multiple frames, and the neural part of the network does not need to perform and understand the concept of summation anymore. Instead of changing the data, we thus have improved the NAPC to predict on nonstationary data, compare Figure 3.IiC Training Optimizations
IiC1 Data Storage
We have stored all our video data in a raw Numpy [Harris2020] array with an additional file that contains the names and offsets of the individual videos and accessed it via a simple Python class. The advantage of this baremetal approach is that no notable data parsing delays occur once the Numpy file is created, which can be significant for other file formats. Also, we made use of memory mapping, which reduces the data load time basically to zero so code changes can be evaluated interactively. Since the Numpy files can become hundreds of gigabytes large, we created them by reading the video files individually and adding the pixel data by appending to the file with npyappendarray^{1}^{1}1https://github.com/xor2k/npyappendarray. Together with memory mapping, this technically very basic approach allows using data sources larger than the main memory of the training machine.
IiC2 Lrd
Due to the problem of diverging and collapsing networks during training, we introduce LRD. It decreases the gradient step size over time, stabilizes the learning procedure, and allows us to pick the last epoch of the training run. It eliminates the need to pick an epoch by using a selection set. Therefore, more videos are available to the training and test set as compared to the previous solution. It was already stated by [[]p.20]bengio2012practical that the learning rate is one of the essential hyperparameters.
IiC3 Grid Search
We wanted to know to what amount similar effects as in [SIEBERT20132251] can occur for deep learning neural networks. In [Mehrer2020], the authors focus on neural network initialization while we evaluate multiple randomness aspects at the same time and their impact on the quality of the NAPC. We split the data into mutually exclusive groups to measure the impact of the training data on the results. Our grid search covers the product of the following ranges:

group selection: 4 options

Uniformly shuffling 8000 videos

Dividing into 4 mutually exclusive groups (with index ) with 2000 videos each


data amount: 2 options

2000 training videos, selected by group index (smaller training set)

6000 training videos, everything but group index (larger training set)


weights random seed: 10 options

Affects only the initialization of the weights


training random seed: 4 options

Affects order of training videos in the minibatches (and the concatenation of the videos)

Affects dropout regularization [srivastava2014dropout] (i.e., the random masking of values)

IiD Inference Optimizations
IiD1 LSTM Quantization
zhu2020towards pointed out that the weights of artificial neural networks are trained as floatingpoint numbers since they allow for higher precision and more stable learning [zhu2020towards]
. Previous work on posttraining quantized convolutional neural networks revealed the small impact of lower precision weights on the prediction quality. Major frameworks such as Tensorflow
[abadi2016tensorflow], and PyTorch
[paszke2019pytorch] have buildin tools to reduce the machine precision of the trained weights and to quantize them to fixedpoint numbers. However, LSTMs are not well supported or in a beta state. As they are the core element of the NAPC, quantization is needed to perform faster inference. Our custom framework approach is a greedy Monte Carlo random search of input and weight scales that processes each layer in sequential order. This approach reduces the computational effort and time required instead of optimizing all layers at once. The quantizer uses several handpicked calibration videos (less than a dozen) as a reference, and any quantized model is required to predict within an error margin for every timestep. No labels are required for this process. Also, we have kept the reference video set intentionally small. A benefit is not only a fast quantization process but also an increase in variability: we only require the quantized models to be within an error margin for a few videos and thus only relatively loosely related to the nonquantized model. We treat a quantized model as if it had been an entirely unrelated to its nonquantized counterpart, so it undergoes full validation just like any other NAPC. This approach allows us to increase our training pool size without training more models.From a structural point of view, apart from the activation functions, the network consists of two FC layers and an LSTM core. In the following, variables with
are quantized and therefore scaled versions, and denotes the elementwise product. Scalars are denoted with lower case letters in the following equations, while vectors and matrices are denoted with upper case letters.FC layers are stateless, and thus the layer indexing does not consider timesteps but only placement within the network. Such layer with index receives the scaled output of the previous layer as input, as well as the normalization factor . Such that
as is the scaled and quantized version of which could have rounding errors . When searching for appropriate input and weight scales ( and respectively), we have to scale the input , weights and biases for layer accordingly. Therefore, we calculate the output and its scale as follows:
We extend the previous notation to the stateful case of LSTMs with the timestep superscript. Again we have the output of the same timestep, but the previous layer. Additionally, for LSTMs there are the hidden and cell state of the same layer of the previous time step, where the first one is scaled with and the latter one is always unscaled (i.e., floatingpoint). We define as the nonactivated floatingpoint LSTM gate block matrix, which is calculated with the additional recurrent kernels , as well as the weights , and biases ; which are block matrices themselves with the same indices. Furthermore, is the logistic sigmoid activation function. Those definitions lead to the following calculations:
As is also the input for the next layer and timestep, it is multiplied again with .
IiD2 Ensemble Building
Ensemble methods are commonly used in machine learning to overcome the weaknesses of a single predictor. We opt for rankingbased pruning [tsoumakas2009ensemble]
, where we sort the independently trained networks according to the probability of passing the VDV 457 v2.1 and select the top
networks. The final prediction of the ensemble is determined via the th percentile or quantile for each class, which is particularly well suited to compensate for a bias and can be tuned via a calibration set. This process is a stacking ensemble method as we optimize a metamodel (quantile regression) based on the specific ensemble. In the following, unless we combine ensembles specifically for different training (sub)datasets, the only differences between most ensemble members are solely their random seeds. Also, note that, from a mathematical point of view, the seemingly familiar and straightforward medianlike functions like quantile are highly nonlinear, nondifferentiable, and thus not trainable by gradient descent. Therefore, they structurally provide added value to typical neural network training.IiE Simulative Metrics
Deep Learning uses gradient descent and variations of it, which requires a differentiable loss function. Accuracy, while not being differentiable, is constantly measured and reported during the training process and is a key metric. However, for automatic passenger counting, accuracy is not well suited since it is very much decoupled from the bias: an NAPC instance can have an extremely high accuracy (e.g. 98%), but the bias might vary from 0% to multiple percents because the few videos where the model’s prediction does not match the label (e.g. the remaining 2%) typically contain difficult situations with crowds and thus the neural network makes a large error. On the other hand, to pass the VDV 457 v2.1 or the equivalence test, the standard deviation is important as well. Our initial idea was to use some linear combination of the absolute bias and standard deviation. However, this approach is quite arbitrary, and not less complex as doing an entire equivalence test in the first place. The drawback is that the sample size cannot be controlled, and the possible outcomes are only whether the test was passed or failed with nothing in between. Labeling tens of thousands of videos to run multiple equivalence tests and average over tests would be affordable thanks to our efficient labeling software mentioned in Section IIA, but it still scales badly. Therefore, as called test success simulation in [EllenbergerSiebert2021], we use another approach: we assume that the error made on our validation set generalizes well enough so that it can be used as an empirical error distribution to sample from [vaart_1998]. We then draw such a sample of the desired sample size (e.g. 3600 videos) and repeat the process (e.g. 10000 times), and determine the test success ratio or test success chance; which is a socalled bootstrapping method [osti_5421967]. Another advantage of this approach is that no mathematical model for the error distribution with many incomprehensible parameters to tweak is required, and multiple categories like boarding and alighting passengers can be meaningfully combined into one continuous metric.
Iii Results
Iiia Network Upper Bounds
Investigations revealed a limitation of the architecture from [9665722] (see Figure 3): the original NAPC (without cumsum) cannot count above a threshold of around passengers, which can easily be observed by looping sequences multiple times, e.g. a sequence with one boarding passenger 200 times. The training of models with cumsum leads to unaffected predictions during inference regardless of the number of passengers. However, this is not the only issue: networks without cumsum also suffer from random predictions of the opposite class for sequences with more than 70 passengers.
Even though LSTMs are capable of learning nonlinearities in general and should therefore be able to predict nonstationary data, our experiments have shown that there are limits. When applying cumsum to our model, the LSTM is not required to learn and perform the arithmetic logic of counting (summation) and thus does not need to compensate for the stochastic drift induced by rising passenger counts anymore. Cumsum layers however, as can be seen in Figure 7, come at a cost: the average model quality gets worse. Therefore, training more neural networks (e.g. with different random seeds) is required to obtain a NAPC pool with comparable success chance.
IiiB Fixed Training
Initially, we could not train deterministically, even after setting all seeds. Determinism within the framework, drivers, and hardware is required to e.g. research the influence of random seeds and fortunately became available during our research [Riach2019, nvidia2021determinism]. As it turns out, the framework we used (TensorFlow) did not support deterministic training outofthebox (without patching) before v2.3 [nvidia2021determinism]. From that version on, it could be activated programmatically. Further, an issue in one of the GPU libraries (cuDNN) affected determinism in LSTMs, which was fixed (for details, see [cudnn_v7_6_1_documentation]) around the time we were about to start the experiment. There are other factors limiting the reproducibility of deep learning in general [alahmari2020challenges]:

CPU architecture

E.g. whether it is x86 or ARM


GPU manufacturer and GPU hardware architecture

E.g. whether a Nvidia Turing (RTX 20XX) or Ampere (RTX 30XX) was used


Any software and driver version & build

The operating system version & build
It turned out that, on top of fixed hardware, having fixed software versions may suffice for determinism, but not for reproducibility: using the same framework version, but from different sources (PIP, Conda, compiled from source) resulted in different randomness and thus different, incompatible training results. Therefore, we eventually used a TensorFlow docker image which can be fixed by its hash, and used only one machine equipped with four identical GPUs.
IiiC Grid Search
Since variation among trained NAPC models was generally very high and we were thus unable to reliable judge whether a modification to the architecture or hyperparameters improved quality or deteriorated it, we investigated a more fundamental aspect of training: randomness. Unlike [frankle2018lottery, Dick2014HowMR]
we decided not to fix random seeds but iterate over multiple. Such experiments can pinpoint a random variable like the initialization of the neural network or the data training order and its effect on the overall training. From other domains like operation research, it is known that randomness can have a far more significant influence than other popular design aspects
[SIEBERT20132251] and has recently been shown for deep learning as well [picard2021torchmanualseed3407]. Therefore, to achieve a minimal level of conclusiveness, iterating over multiple seeds is necessary and can be considered essential for research in general [hutson2018artificial, boulesteix2020replication].There are several kinds of randomness during the training process: Firstly, the selection of the subset of data used for training^{2}^{2}29665722 have done this reproducibly, and independent of the framework in [9665722] already., secondly the random initialization of the weights, and thirdly, the randomness during the training, which spans from the data preprocessing and augmentation, to also regularization methods. For a visual overview, compare Figure 4 and Figure 5. The most obvious observation is that more training data (6000 vs. 2000 videos) increases test success chance. The qualitative difference among the models can be very high, which needs to be accounted for when making decisions about deployment or further development, even without nondeterminism introduced by GPUs. Building (unranked) ensembles increase the success chances if a quantile > 50% is chosen as it compensates undercounting, which the NAPC method seems to be more prone to than overcounting. Even when all 8000 possible training videos are available to the ensemble via its members, the count quality is still higher if every ensemble member receives more training videos, so generalization is favorable over specialization. There seems not to be a big difference in the prediction quality introduced by the randomness in initializing weights and training (dropout layer and composition of batches). However, some configurations lead to a terrible count quality even with all training videos being available to a biascorrected ensemble (see reddish highlighted subplot in Figure 5). This is an interesting and unexpected case: when combining a certain neural network initialization seed with a certain training order seed, no matter the data, the training is almost certain to fail. The mentioned seed combination would be a starting point for further investigation and could lead to a better architectural understanding of the neural network.
IiiD Inference
IiiD1 Quantization
We could almost entirely quantize our network, except for the nonlinear LSTM activation functions. Related work [choi2018pact, alizadeh2020gradient, lin2016fixed]
leveraged restricted activations, regularization during the training, or quantized training (finetuning) to improve quantizability, while NAPC uses neither of these. Compared to feedforward neural networks, RNNs maintain an internal state. In our case, videos can range from a few dozens to multiple thousands of frames with a large variety of counts and situations. The varying length temporal dimension makes it more difficult to analytically quantize networks as it is done for other architectures
[banner2018post]. We quantize all weights, biases, inputs, and outputs to be an integer (i.e. fixedpoint). We defined as the absolute error margin for each frame and class during quantization. It is set to at most absolute deviation from the floatingpoint NAPC’s output as the final predictions are onaverage within less than to the closest integer. The idea is to increase quantization potential as much as possible since the quantized model needs to pass the same validation as its nonquantized template model. With this approach, we discourage divergent counts in most cases, resulting in less degraded prediction quality. If the quantized version predicts more than absolute deviation during the process, the scale is discarded, and new scales for this layer are drawn randomly. One could choose to be even smaller as this would result in tighter inference quality, but it also increases the risk of violations during the quantization process and could therefore take more time or could not even finish for some layers. Input and weight scales are optimized concurrently as the algorithm works through each layer. We confirm that the empirical results of this approach are sufficient to leverage quantization for realworld usage of NAPC in existing edge devices. Both 16 and 32bit fixedpoint quantization are feasible for the greedy approach. Lower precisions were only possible for particular layers. However, those increased the quantization noise for subsequent layers, thus providing no advantage. With a 32bit fixedquantization, we can achieve comparable results to unquantized models, with a maximum absolute deviation of 4 to the ground truth (floatingpoint prediction) sampled from 13 randomly chosen quantized models and checked against all 4966 holdout sequences (unseen during training and calibration).In other experiments and to our great surprise, a model quantized with this method is not necessary of inferior or of equal quality as its nonquantized template but can be superior as well even to an extent where some mediocre model turns into one of the best models after quantization. Therefore, quantizing all models can be a convenient way to enlarge the search space and increase quality. Nevertheless, the average quality of our quantized models is below those without quantization, compare to Figure 7.
IiiD2 Ensemble building
For the NAPC, we can confirm studies about random seed ensembles that suggest they can improve quality [hansen1990neural, fort2020deep]. As mentioned previously, we take the th (with ) quantile^{3}^{3}3The choice is also known as median. of the ensemble for each class for the final prediction. This is due to the property of NAPC instances undercounting on average, which is even worse when using the cumsum, compare the lower success in Figure 7
. Also, this approach filters outliers and selectively centers the error distribution of the ensemble more towards 0, which increases the test success chance, compared Figure
4 and Figure 5.IiiE Performance
To compare the performance of a quantized and a floatingpoint NAPC instance, we developed a custom inference framework written in C/C++ [kernighan2006c, stroustrup2000c++]. To increase the performance, we include the weights in the binary and do not use dynamic memory via malloc or new. This allows us to infer multiple sensors concurrently on various platforms (x86, ARMv7, and ARMv5). For the benchmark, we use a video with a length of approx. 204 seconds, which can be stored locally even on a legacy machine with only 32 MB of free storage. We measure the user time to infer all of these sequences to calculate how many NAPC instances the system could run in realtime. In Figure 8 we show the performance of different CPUs. Noticeably the already deployed SSU110 and APCR can reach around 20 (SSU110) or approx. 6 (APCR) NAPC instances in realtime while x86 machines like the Intel i7 can do thousands and even a Raspberry Pi up to around 200 with multicore systems only using a single core. The APCR profits most from quantization, as it would only compute approx. 2 times realtime with floatingpoint models.
Iv Conclusion
We proposed several improvements to the training and inference of existing NAPC implementation, which considerably enhance the quality of the predictions and usability in realworld scenarios.
For a long time, it was unclear whether the new VDV 457 v2.1 could ever be passed with 99% count accuracy^{4}^{4}4An accuracy of 99% refers to the equivalence test for bias, i.e. the absolute bias being including confidence intervals.. Our results prove that it can be done. Thanks to quantization, both in theory and practice, even on legacy hardware already installed in vehicles. Not only can the test be passed, but also the validation costs of this APC system are around 50% less out of the box due to lower sample size requirements^{5}^{5}53600 videos 3000 stop door events instead of the 6147 mentioned in the VDV 457 v2.1 which we proposed based on low precision legacy systems back in 2018 to give APC manufacturers a fair chance to pass the test..
We have created and applied a variety of methods for quality control and identified that the number of videos in the training process matters a lot, and additional measures like quantile ensembles are required to compensate for the undercounting that seems to be intrinsic to the NAPC. Further research has to be done to determine the limits of the architecture, e.g. a maximal count of training videos at which quality does not further improve. With the test success chance, we have introduced a simulative metric based on an empirical distribution. Our approach is very generic and can probably be applied to many design problems outside of automatic passenger counting.
We have also found out that our LSTMs have an upper bound up to which they can count. Our cumsum layer thus not only helps create better counts but can be seen as an improvement to LSTMs in general for regression tasks, where calculating the differences in time series (trend elimination) is not applicable. To our knowledge, we are the first to apply this approach to LSTM predictions.
Finally, we hope that our contribution to the NAPC will raise fairness in global revenue sharing and distribution to a new level and that other areas will profit from our results as well.
V Acknowledgements
This research is financially supported by the European Regional Development Fund under Grant 10166961.