Using Spatial Pooler of Hierarchical Temporal Memory to classify noisy videos with predefined complexity

by   Maciej Wielgosz, et al.

This paper examines the performance of a Spatial Pooler (SP) of a Hierarchical Temporal Memory (HTM) in the task of noisy object recognition. To address this challenge, a dedicated custom-designed system based on the SP, histogram calculation module and SVM classifier was implemented. In addition to implementing their own version of HTM, the authors also designed a profiler which is capable of tracing all of the key parameters of the system. This was necessary, since an analysis and monitoring of the system performance turned out to be extremely difficult using conventional testing and debugging tools. The system was initially trained on artificially prepared videos without noise and then tested with a set of noisy video streams. This approach was intended to mimic a real life scenario where an agent or a system trained to deal with ideal objects faces a task of classifying distorted and noisy ones in its regular working conditions. The authors conducted a series of experiments for various macro parameters of HTM SP, as well as for different levels of video reduction ratios. The experiments allowed them to evaluate the performance of two different system setups (i.e. 'Multiple HTMs' and 'Single HTM') under various noise conditions with 32--frame video files. Results of all the tests were compared to SVM baseline setup. It was determined that the system featuring SP is capable of achieving approximately 12 times the noise reduction for a video signal with with distorted bits accounting for 13% of the total. Furthermore, the system featuring SP performed better also in the experiments without a noise component and achieved a max F1 score of 0.96. The experiments also revealed that a rise of column and synapse number of SP has a substantial impact on the performance of the system. Consequently, the highest F1 score values were obtained for 256 and 4096 synapses and columns respectively.


page 8

page 12

page 13

page 14


OpenCL-accelerated object classification in video streams using Spatial Pooler of Hierarchical Temporal Memory

We present a method to classify objects in video streams using a brain-i...

IIITG-ADBU@HASOC-Dravidian-CodeMix-FIRE2020: Offensive Content Detection in Code-Mixed Dravidian Text

This paper presents the results obtained by our SVM and XLM-RoBERTa base...

Spotting Macro- and Micro-expression Intervals in Long Video Sequences

This paper presents baseline results for the Third Facial Micro-Expressi...

SegCodeNet: Color-Coded Segmentation Masks for Activity Detection from Wearable Cameras

Activity detection from first-person videos (FPV) captured using a weara...

Spotting Micro-Expressions on Long Videos Sequences

This paper presents baseline results for the first Micro-Expression Spot...

Can Image Enhancement be Beneficial to Find Smoke Images in Laparoscopic Surgery?

Laparoscopic surgery has a limited field of view. Laser ablation in a la...

1 Introduction

The world consists of objects that interact dynamically with each other to form a network of complex relationships. It should be noted that virtually all processes in the real world develop in time; even the ones that appear to be fixed, after careful analysis in the micro–scale, turn out to be dynamic.

The eyes are the basic and primary tool for examining the world with which the nature equipped virtually all the living creatures capable of moving by themselves. Closer examination of species’ evolution on Earth show that a huge leap forward occurred around the time when the fist primitive eyes were developed which happened about 540 million years ago wiki:Eye .

Stationary images, even ones of a very high resolution and great number of low level details, provide limited information regarding an analyzed scene and objects within it Han2015Object ; Cheng2015Effective ; Du2014Discriminative ; Zhang2015Saliency . They lack a temporal component that is essential to follow how the objects change in time, which in turn provides complete information of its nature. Furthermore, it is worth noting that the human brain was well adapted for processing dynamic object interactions. The important role that the dimension of time plays in analyzing visual objects is also reflected in the tissue structure, 80% of which involves temporal connections Mountcastle . Consequently, when analyzing static scenes our eyes make rapid, step–like rotation, following which the eyes remain stationary. The series of step movements are known as saccades or saccadic eye movements.

The real world is highly noisy and imperfect, thus the brain was designed to analyze such distorted streams of images. It is able to learn to distinguish between heavily malformed and noisy objects in the space we live in. This is possible due to a long and complicated process of training, during which the brain is exposed to streams of data in large quantities and learns to extract the perfect shapes of objects from the noisy ones. The process may be perceived as discovering latent features which are unique and invariant for a given object or set of objects. The idea behind this approach resembles to some extent the methods used for training the latest deep learning algorithms. A sufficiently large amount of stationary images are used to compensate for the lack of a temporal component in exploring latent features of the processed objects. This phenomenon contributed to the recent breakthrough in Very Deep Learning architecture training which is mostly driven by the availability of a huge amount of data and GPU computational power

Krizhevsky ; Schmidhuber .

The brain, in an effort to classify distorted images, compares the ideal version of the objects encoded in Sparse Distributed Representation (SDR) with the noisy ones. The authors decided to investigate to what extent this mechanism is effectively implemented in Hierarchical Temporal Memory (HTM)

wielgosz2016using , the bio-inspired algorithm, which models a selected section of neocortex Numenta using state–of–the–art knowledge in neurobiology regarding the human brain.

The hypothesis which the paper attempts to confirm states that incorporating SP of HTM in video processing flow increases the overall noise robustness of system.

The human brain as a whole has not been completely explored yet, making its artificial implementation and verification a very hard task. However, there are initiatives humanbrainproject which have taken up the challenge of simulating and modeling the brain as we know it today. Rather than model the brain, the authors of this paper have adopted a slightly different approach of gradually introducing selected components of Hierarchical Temporal Memory (HTM) to the video processing system with the intention of enhancing its performance. By doing so we aim to develop a complete systemonline:custom_htm working on the principles of the human brain as they were presented in Mountcastle ; Numenta with our modification making the algorithm suitable for hardware implementation. Running HTM on CPU is very slow, and due to its strongly parallel structure the algorithm is a good candidate for General–Purpose Graphics Processing Unit (GPGPU) and Field–Programmable Gate Array (FPGA) acceleration pietron2016parallel ; pietron2016formal . Therefore, the computationally demanding overlap and inhibition sections of SP were implemented on GPU.

The paper contains the following four main contributions:

  • analysis of noise reduction properties of SP in video processing using object detection as a case study setup,

  • experimental confirmation of SP noise reduction properties,

  • experimental study of the impact of key parameter changes on the performance of the SP-based system,

  • development of a custom-designed prototype application of an HTM solution meant for doing video experiments, along with a data set. The software and the video data is available online at online:custom_htm ; online:datasets .

The rest of the paper is organized as follows. Sections 1.1 and 1.2 provide the background and related work of object classification in video streams and Hierarchical Temporal Memory, respectively. Section 2 contains mathematical analysis of the noise reduction properties of SP. Architecture of the custom–designed system used for the experiments is described in Section 3, with the data flow presented in Section 4. Section 5 provides the results of the experiments. Finally, the conclusions of our research are presented in Section 6.

1.1 Object classification in video streams

Most state–of–the–art information extraction systems consist of the following sections: preprocessing, feature extraction, dimensionality reduction and classifier or ensemble of classifiers (Fig.

1). Their construction requires expert knowledge as well as familiarity with the data that will be processed Haibo ; Peng .

Figure 1: Architecture of a video processing system

Usually, systems for object classification in video streams are also designed according to this scheme. Consequently, the proper choice of the operations which constitute all the mentioned stages of the system is important and determines the classification result Lu ; Hota ; Islam . One of the most challenging stages is feature extraction, which substantially affects the overall performance of the system.

There are also systems which take advantage of the spatial–temporal Numenta profile of the dataCastrill ; Devarakota ; Khan ; Bengio . They are closer to the concept of the solution presented in this paper, which may be considered a hybrid approach since it features components of both schemes.

1.2 Hierarchical Temporal Memory

Hierarchical Temporal Memory (HTM) replicates the structural and algorithmic properties of the neocortex. It can be regarded as a memory system which is not programmed, but trained through exposing it to data flow. The process of training is similar to the way humans learn which, in its essence, is about finding latent causes in the acquired content. At the beginning, the HTM has no knowledge of the data stream causes it examines, but through a learning process it explores the causes and captures them in its structure. The training is considered complete when all the latent causes of data are captured and stable. The detailed presentation of HTM is provided in Numenta ; Chen ; Rachkovskij .

HTM constitutes a hierarchy of nodes, where each node performs the same algorithm. The most basic elements (raw and unprocessed data) enter at the bottom of the hierarchy. Each node learns the spatio–temporal pattern of its input and associates it with a given concept. Consequently, each node, no matter where it is in the hierarchy, discovers the causes of its input. In an HTM, beliefs exist at all levels in the hierarchy and are internal states of each node. They represent probabilities that a cause is active. Each node in an HTM has a fixed number of concepts and a fixed number of output variables. The training process of an HTM starts with a fixed number of possible causes, and in a training process, assigns a meaning to them.

Consequently, the nodes do not increase the number of concepts they cover; instead, over the course of the training, the meaning of the outputs gradually changes. This happens at all levels in the hierarchy simultaneously. Thus the top level of the hierarchy remains with little or no meaning till nodes at the bottom are trained to recognize the basic patterns.

HTM is composed of two main parts, namely Spatial and Temporal Pooler (TP). This paper focuses on Spatial Pooler (SP), aka Pattern Memory, which is employed in the processing flow of the system. It contains columns with synapses connected to the input data Numenta . The main role of SP in HTM is finding spatial patterns in the input data. It may be decomposed into three stages:

  • Overlap calculation (Alg. 1),

  • Inhibition (Alg. 2),

  • Learning.

1:  for all col sp.columns do
2:     col.overlap 0
3:     for all syn col.connected_synapses() do
4:        col.overlap col.overlap +
5:     end for
6:     if col.overlap min_overlap then
7:        col.overlap 0
8:     else
9:        col.overlap col.overlap * col.boost
10:     end if
11:  end for
Algorithm 1 Overlap
  for all col sp.columns do
2:     max_column max(n_max_overlap(col, n), 1) col.overlap max_column
4:  end for
Algorithm 2 Inhibition

The first two stages are very computationally demanding but can be parallelized. Therefore the authors decided to implement them on GPU in OpenCL wielgosz2016opencl . The learning stage, the detailed description of which is provided in the Numenta whitepaper Numenta , is implemented on CPU.

The overlap section (Alg. 1) computes for every column in SP structure i.e. a number of active and connected synapses. If the number is larger than , then it is boosted and passed on to the inhibition section (Alg. 2).

The inhibition stage (Alg. 2) implements a winner–takes–all procedure where for each column a decision is made as to whether it belongs to a range of () columns of the highest values. The function performs the comparison.

2 SP processing

Figure 2: Spatial Pooler architecture

This section covers properties and important features of the SDR (Sparse Distributed Representation) vector space as well as presents SP processing flow as a set of steps. Those steps may be described as mappings between spaces which are progressively narrowed to extract features to be used in the further processing stages that come after SP. It is important to note that all the operations presented in this section consider SP after training is accomplished, which means that learning is not taken into account.

2.1 Notation

SDR (Sparse Distributed Representation)
distance between and in SDR
input vector size
number of ’1’ in the input vector
number of noise bits
included in input vector
number of ’1’ in the noise
reduction ratio
number of columns
number of synapses
for synapse
associated with column input
Table 1: Notation

We use a single notation scheme for math formalism in this section, shown in Tab. 1. The size of all the vectors is expressed in bits. The notation covers all the parameters which are essential for a complete description of the SP architecture.

Input vector size along with number of positive bits are considered to be a key parameters since they they define the sparsity of the SP input vector Ahmad ; Numenta . Number of columns and number of synapses directly affects the shape of the SP architecture and decide about its processing capabilities Cui such as a noise elimination. The noise content is quantitatively described by and which shape the statistical distribution of information content carried by the input vectors. It is worth noting that any two vectors may be compared in terms of Hamming measure by applying the distance operator .

In addition to the basic hyper-parameters of the SP architecture and input vector defining restrictions, Tab. 1 also contains more subtle, but also important constants such as and . These constants decide about processing flow of the SP architecture. Min_overlap sets a threshold for an input data pattern to be regarded as a match. Consequently, affects selectivity of the network. On the other hand may be perceived as sensitivity shaping parameter.

2.2 Space definition

All the operations in the SP processing chain are performed in sparse distributed space.


where and are the SDR vector space and its dimension, respectively.

In video processing the vectors in SDR (Eq. 1) may be interpreted as points in

dimensional space. The points are brought to SDR space by an encoding operation which, in this case, is binarization.


2.3 SP mapping

The Spatial Pooler processing chain may be decomposed as a series of mapping operations. This allows for analysis of data structure and noise contribution at each stage separately. Regardless of the size of the input data and noise volume, the operations have the same structure.

2.3.1 Synapse connectivity mapping

In the first step every column maps from the input space to its own subspace . The operation is described by Eq. 7.


There is a fixed number of choices every column makes, which is determined by its synapse number as expressed by Eq. 8.


2.3.2 Perm_value mapping

Once a subset of input data is selected by each column, the mapping operation brings a vector from synapse to space i.e. from to as given by Eq. 9. This may reduce to some extent the size of the data vector to be processed.


associated with every synapse of each column () are compared against . When an element of interacts with a connected synapse i.e. for which , it is included in the space as expressed by Eq. 10.


2.3.3 Overlap mapping

In this step all the bits which passed through and mapping are counted to obtain an overlap value which may be again viewed as the mapping operation that is given by Eq. 11.


Once the overlap value is calculated according to Eq. 12, it is compared against as given by Eq. 14. Based on the comparison, the column is either included in the active columns set or remains excluded from further processing.


2.3.4 Boost

Conducted experiments show that boost barely affects the activity pattern of SP columns. Therefore, authors decided not to include it in the formal analysis. This makes the formalism simpler and easier to follow and also directs the focus to the most contributive sections of the algorithm.

2.3.5 Inhibition

Each column is selected as either active or inactive depending on its overlap value in the context of all the neighboring columns. The comparison range is determined by the inhibition range Numenta ; How_neurons .

According to authors’ experimental observations (Fig. 7) the inhibition radius does not change after the SP has been trained. Consequently, a small amount of noise in the input will change the output to some extent, but most of the winners will remain unchanged.

2.4 Signal propagation

The choice of the winning column set depends ultimately on the number of ’1’ in the input vector (). Therefore we decided to analyze the impact of this number and SP parameters on the selectivity of SP.

2.4.1 Synapse connectivity mapping

The number of ’1’ contained within a random vector in

subspace may be expressed in terms of the random variable

with binomial distribution given by Eq.



where is the expected value of the random variable.

2.4.2 Perm_value mapping

The number of ’1’ contained within a random vector in the subspace may be expressed in terms of the random variable with binomial distribution given by Eq. 16.


where is the expected value of the random variable. The parameter ( reduction ratio) can be expressed as in Eq. 17.


2.4.3 Overlap mapping

The main goal of using SP in data processing systems regardless of the target application is improving feature extraction quality through increasing pattern matching accuracy. This ability of filtering out the right candidates for column activation is crucial. It may be described in terms of the probability of matching between a random input vector and synapse connection pattern in



The detailed description on how Eq. 18 was derived is provided in How_neurons .

An increase of , and leads more to the exponential growth of the denominator rather than the numerator. This, in turn, results in a drop of the false positives ratio.

2.5 Noise propagation

Noise introduced to the input vector is propagated through all the mapping steps contributing to signal distortion.

2.5.1 Synapse connectivity mapping

The number of noise bits contained within in the subspace may be expressed in terms of the random variable with binomial distribution given by Eq. 19.


where is the expected value of the random variable.

2.5.2 Perm_value mapping

The number of noise bits contained within a random vector in the subspace may be expressed in terms of the random variable with binomial distribution given by Eq. 20.


where is the random variable that takes number of noise bits for each column after mapping.

2.5.3 Overlap mapping

Noise introduction affects the number of ’1’ in the data vector mapped to which may be expressed in terms of the expected value of a random variable with binomial distribution given by Eq.21.


The first part of Eq. 21: accounts for noise contribution, whereas the second part : covers all the ’1’ included in the remaining part of the signal which was not affected by the noise.


2.6 Noise impact

The noise impact on the false positive ratio is given by the Eq. 23.


The equation presents the degree of ambiguity of pattern matching results caused by a detrimental noise contribution.

It is worth noting that false negative cases are also possible but this effect may be neglected with relatively low values. The detailed study of this issue is available in How_neurons .

3 System description

The custom-designed system online:custom_htm is highly configurable, with numerous parameters responsible for the core HTM’s structure, the encoder behavior, statistics rendering, etc. In addition to the core module, a set of supporting modules has been developed. Most of them are used for feeding video data to the core module, and receiving and analyzing the results.

The HTM itself is a ’core’ module, in addition to the ones necessary for the system to function (responsible for data reading and encoding, as well as results interpretation) and ones created for debugging and statistics gathering purposes. The overall system architecture is depicted in Fig. 3. The most relevant modules are described in detail below.

Figure 3: Architecture of the implemented system

3.1 Outer Structure

The outermost level of the system is CLI (Command Line Interface). Depending on the provided command line options, it invokes a particular setup – either ’Single HTM’ or ’Multiple HTMs’. In the ’Single HTM’ setup, data from all categories are fed into a single HTM instance. ’Multiple HTMs’ refers to creating HTM instances on a per–category basis, resulting in an ensemble of one–vs–all detectors.

In both setups the same wrappers encapsulating the actual processing units can be used. A wrapper is created for a particular HTM use – it is responsible for creating relevant data readers, encoders, decoders and output writers, and for passing them to the iterator – a part of the core that manages HTM cycles.

After data is processed by the wrapper, the result reaches the CLI, which is responsible for further analysis and data presentation – combining wrapper outputs, gathering statistics, training the classifier used to provide the final results, rendering data visualizations etc. The HTM results are post-processed using a LinearSVM classifier.

3.1.1 HTM Wrapper

As mentioned above, a wrapper is created for a specific use – one designed to work with videos will differ from one tailored for texts. Assembling a wrapper from predefined or newly created modules is the main task of the experiment setup.

The wrapper used in the present system setup creates a reader able to get data from video files and an encoder that converts raw frame data to the required format. The HTM output is neither modified (a pass-through decoder module) nor stored for future reference (a pass-through writer module).

Preparing the processing units to work is not the wrapper’s only responsibility – it also controls the number of executed iterations. The minimum (and default) number of cycles equals a single pass of the learning set, however setups specifying a maximum number and/or metrics measuring whether HTM still needs learning are also possible.

The wrapper module also coordinates statistics gathering and visualization on a per-instance basis.

3.1.2 Adaptive Video Encoder

During the encoding process the original video frame is converted to a binary image. Depending on the configuration, the original image can be first reduced in size to trim down the amount of data. After reduction, the color image is converted to a grayscale one, which is later binarized using adaptive thresholding.

Adaptive thresholding uses a potentially different threshold value for each small image region. It gives better results than using a single threshold value for images with varying illumination. In this encoder ’ADAPTIVE_THRESH_Gaussian_C’ algorithm from OpenCV libraryopencv_library is used – a threshold value is the weighted sum of neighboring values where weights are a Gaussian window.

3.2 HTM Core

All implemented readers, encoders, decoders and writers provide pre-defined interfaces. Such a solution allows us to separate data acquisition and output storage from the actual processing. The loop consisting of data retrieval, processing and outputting is executed by the iterator object of the core module.

An HTM object itself consists of a configurable number of layers, a Spatial Pooler and a Temporal Pooler object. Upon each iteration, each layer state is updated by the SP and (depending on the configuration) the TP, based on the data it receives. In the case of the lowest layer the input is obtained from the encoder, and for the higher ones – from the previous level. Setting the layer number to zero effectively turns off the HTM, causing the whole module’s output to be equal to that of the encoder. This feature was used when comparing performance of ’SVM’ only with the ’SP + SVM’ ensemble.

Layers consist of columns, which are composed of connectors (containing synapses used in the spatial pooling process) and cells (used in temporal pooling). Cells themselves are built from segments, with each segment containing synapses connecting it to the other cells. This hierarchical structure closely mirrors the one described in the algorithm section.

Every object encapsulates its functionality, making introduction of changes and enhancements trivial, while at the same time providing a clear reference point for modifications. The object-oriented structure also enhances the visibility of a very important HTM feature – its potential for massive parallelization. One example of that can be a spatial pooling process. The initial system setup used a sequential version of SP. After some tests, a decision to replace it with a concurrent implementation running on a GPU (and an FPGA in the future) was made. The replacement spatial pooler, taking advantage of OpenCL capabilities, was written and plugged into the system without changes to the rest of the architecture. The best OpenCL kernel speed-up (running on GPU vs. CPU) of 632x and 207x was reached for 256 synapses and 1024 columns when compared to basic configuration for R4 dataset (see Tab. 2). Overall acceleration including data transfer time between devices amounted to 6.5x and 3.2x respectively. Details of experiments can be found in wielgosz2016opencl .

4 Data flow

Figure 4: Block diagram of the proposed approach

The data is fed into the system in a frame–by–frame manner. In the first step, the original frame is turned into a binary image (see 3.1.2). This conversion constitutes the encoding which allows the generation of input data for the SP processing stage.

Thereafter, the encoded data is fed into the SP. The processing done by the SP effectively maps input to Sparse Distributed Representation (SDR), which then may be passed on to the TP. We do not use TP in this particular application, but the system in general has such a capability. Instead, we substitute TP with histograms to serve a similar purpose.

Histograms of consecutive frames are built from SP output on a per–video basis. When an HTM module is disabled, encoder output is used to create the histograms (see 3.2). They are used as the input data for the SVM classifier which comes next. The classifier maps the results from SDR to the result space (output categories).

The complete processing flow of the system is presented in Fig. 4.

5 Experiments and the discussion

A main goal of the experiments was to examine an ability of the presented model to classify objects in noisy video streams. The authors decided to use artificially generated video sequences for a validation of the presented method. Such an approach allowed to focus on critical aspects of model architecture because the authors could observe the relationships between certain parameters and the quality and complexity of the input data stream. It is worth noting that the Spatial Pooler is unsupervised, so the training process is mostly governed by a proper choice of the model structure. Consequently, during a series of experiments the most important properties of the system were determined.

In datasets based on real-life scenes it is hard to find and select videos with a well-defined distortions such as insufficient light, certain noise level or a very distant object seen from an unusual angle. Even if such videos exist, tagging them and quantitatively measuring noise or a distortion level is a tedious task. The artificial dataset with complex scene modification capabilities used in experiments poses no such problems. Furthermore, efficient object classification in complex real-life video requires attention mechanism shikhar2015Action to be able to find and track right spots in scenes. Therefore, the authors decided to locate figures centrally in a scene.

5.1 Experiments setup

No. of frames per video 32
Object classes
cone, cube, cylinder,
monkey, sphere, torus
No. of classes 6
Total no. of videos all 6000
training 4800
testing 1200
Videos per class all 1000
training 800
testing 200
Videos per trial all 100
training 80
testing 20
Table 2: Experiment parameters
R16 R8 R4
Size of a single
video frame
 60x32 120x66 240x134
No. of columns 2048
No. of synapses
per column
64 64 128
Perm value
Perm value
Min overlap 8
Winners set size 40
Initial perm value 0.21
Initial inhibition
Table 3: SP parameters

A series of experiments (details of which are provided in Tab. 2 and Tab. 3) was conducted to validate the hypothesis stated in the introduction of the paper. They allow a comparison of the performance of the system featuring Spatial Pooler in the processing flow with the one lacking it (denoted as ’SVM’). Two approaches to introducing SP into the system were tested: ’Single HTM’ mode, denoted as ’SPS + SVM’ and ’Multiple HTMs’ mode, denoted as ’SPM + SVM’.

5.2 Debugging

(a) Learning mode
(b) Testing mode
Figure 5: Sample visualizer plots depicting percentage of active bits in an input frame.
(a) Learning mode
(b) Testing mode
Figure 6: Sample visualizer plots depicting percentage of active HTM outputs.
(a) Learning mode
(b) Testing mode
Figure 7: Sample visualizer plots depicting changes in inhibition radius value.
Figure 8: Sample visualizer plots depicting class histograms for videos belonging to the ’sphere’ class.

Debugging a conventional system is a relatively simple and well-established practice. Challenges arise when addressing other than the von Neumann architectures, such as HTM. In those cases, it is necessary to use unconventional methods and tools, often specifically designed to debug and analyze the particular application. The authors have faced the same challenges during the early stages of the system development. The initially designed tool that allowed them to analyze individual columns and synapses did not allow for effective tracing of the module behavior, because the system as a whole behaves statistically. Therefore, the authors decided to develop a tool that can globally profile the whole HTM system. It consists of two modules: an analyzer and a visualizer. The first one stores system operating parameters (e.g. active inputs, inhibition radius), while the second is responsible for data presentation in a form easily understandable to the designer of the HTM system. The debugging methods and tools are part of the system available at online:custom_htm and may be used in experiments. They allow to specify a set of parameters which are to be included in a final report. Sample visualizations are shown in Fig. 58.

It is worth emphasizing that the tool contributed a lot to the development of the presented system and in the authors opinion it would have been hard to bring the module the its final shape without it. Columns and synapses number adjusting may be given as an example of the challenge the authors managed to address with the tool. During the development the system yielded wrong results, classifying nearly all videos to a single class. It was hard to determine what was the cause, but after a close examination it turned out that input data coverage was insufficient. This in turn was caused by the low number of synapses per column. The authors were able to find a cause by scrutinizing a plot of the overlap value generated by the visualizer. It turned out that for a certain input data overlap dropped to zero despite the fact that those data items had been presented to the network in a training process before. In the next step, a number of synapses was increased which alleviated the problem by reducing a number of wrongly classified items. The presented verification and reasoning flow was substantially based on the debugging.

It is worth noting that despite that fact that substantial section of the source code of the system were ported to GPU for better performance, in the debugging mode they will be run on CPU. This affects the simulation time significantly and should be taken into account.

When the system or HTM Core online:custom_htm does not work properly, it is recommended to use the debugging tool. According the authors opinion, the best approach is to start with most critical parameters of the network, especially when a designer has no track or clue what may be wrong. Gradually, by examining the plots generated for the input, column and output activity, as well as overlap changes, one can deduct about the condition of the system. Further steps may involve picking some other parameters, such as the inhibition radius or boost, when it turns out that they may play a substantial role in the system behavior.

5.3 Datasets

Figure 9: Sample (cropped) frames of different shapes and noise levels.
Figure 10: The beginning, middle and end of sample videos.

The challenging part involved generation of sample videos for testing. The videos had to meet a series of requirements such as object location, camera location and object–camera distance. Consequently, a dedicated application was used to generate the videos (i.e. Blender blender ). Blender provides Python API, which allowed authors to automate and randomize the video generation. Consequently, a series of videos were generated of 960x540 pixels each. All the footages contain a single, centered, stationary object with the camera moving around it in all directions and at varying distance. The sample frames are presented in Fig 9 and 10.

For the experiments, three datasets (available online online:datasets ) based on the rendered videos were created: R16, R8 and R4, with frame sizes 60x32, 120x66 and 240x134 respectively. Subsequently, a Gaussian noise was introduced to the testing videos. Since noise addition at the runtime proved to be a very time consuming process, a separate script introducing noise to a large set of generated videos was created.

5.4 Quality evaluation measure

The F1 score is used as a quality evaluation of the experiments’ results presented in this paper. The precision and recall for corresponding clusters are calculated as follows:


where is the number of items of class i that are classified as members of cluster j, while and are the numbers of items in cluster j and class i, respectively. The cluster’s F1 score is given by the following formula:


The overall quality of the classification can be obtained by taking the weighted average F1 scores for each class. It is given by the equation:


where the maximum is taken over all clusters and n is the number of all objects. The F1 score value ranges from 0 to 1, with the higher value indicating the higher clustering quality.

5.5 Results

Noise level
F1 score
0 0.78 0.76 0.87
4.25 0.78 0.74 0.87
8.5 0.77 0.74 0.86
Table 4: Experiments results for reduction rate = 4
Parameter SPS + SVM SPM + SVM
Value F1 score Value F1 score
Columns 4096 0.81 4096 0.89
Synapses 256 0.92 256 0.94
Min overlap 4 0.95 6 0.96
set size
12 0.81 28 0.89
Table 5: Best results obtained for reduction rate = 4 and

Tab. 4 shows performance results of SVM and both singular and modular SP versions with various noise levels and reduction rate fixed to 4. The rest of the parameters of the setup used for the tests were of a standard value as presented in Tab. 3. It should be noted that the ’Multiple HTMs’ setup outperforms the ’Singular HTM’ one. The best results achieved for different sets of parameters were included in Tab. 5 in order to show the maximum capabilities of the examined setup.

In some cases, the results of both setups are almost equal, although ’Multiple HTMs’ is slightly superior. Analyzing the results shown in Tab. 4 one easily notices that the highest F1 scores are reached for 256 and 4096 of synapses and columns respectively. Value of also affects the quality of the results and in the case of this setup achieves the best performance. Both Tab. 4 and Fig. 13 show that has negligible effect on the overall F1 score.

Class Cosine similarity ratio [x]
of noise bits
of noise bits
cone 11.56 3.54
cube 18.14 4.27
cylinder 13.49 4.86
monkey 8.19 3.16
sphere 11.69 3.54
torus 11.85 2.77
all 12.64 3.73
Table 6: histograms cosine similarity ratio

Tab. 6 shows how efficient the SP module is in noise reduction. The values ​​in the columns of the table represent the cosine similarity ratio of the signal before and after SP processing for all the video categories and their average value. Cosine similarities were calculated between video histograms yielded by the system for clean video () and the ones with Gaussian noise introduced. Average values were calculated for each class separately as well as for all the videos combined, discarding samples for which cosine similarity could not be calculated. It should be noted that with increasing levels of noise, SP filtration efficiency decreases, which is expected. In order to increase or maintain its previous efficiency of noise reduction, some of the macro parameters of the system should be tuned to increase selectivity of the module and boost noise elimination.

(a) Reduction rate = 16
(b) Reduction rate = 8
(c) Reduction rate = 4
(d) Average
Figure 11: F1 score for different noise levels and reduction ratio.

Fig. 11 presents performance in terms of F1 score for different noise levels. It is worth noting the performance of the system depends both on noise level and reduction rate. As expected, a larger reduction ratio degrades performance of SP + SVM because less data is available. Furthermore, image preprocessing operations and especially binarization introduces distortions which are proportional to the noise level. SP + SVM setup is particularly sensitive to relocation distortions i.e. changing spatial positions of pixels in the data fed into the module.

(a) SVM
(b) SPS + SVM
(c) SPM + SVM
(d) Average
Figure 12: F1 score as a function of different reduction rate ratio.

Fig. 12 shows superiority of the SPM + SVM setup for videos of lower reduction rate (i.e. containing more pixels). This results from the SP property of generating invariant representation which becomes more apparent when frames containing more data are provided as an input stream. It is worth emphasizing that increasing a number of pixels hinders performance of the pure SVM setup. It may be expected that further growth of input data size, along with application of a dedicated SP encoder Purdy , may increase the superiority of the SP + SVM setup.

Figure 13: F1 score as a function of different configuration parameters (using R4 dataset). Italicized labels denote reference lines calculated for basic configuration.

Fig. 13 presents SP performance variation as a response to key parameter changes. The impact of the number of synapses and columns is the most significant, and it may be noticed that a rise of these parameters leads to better performance of the SP-based system. It turns out that value of contributes significantly to the performance, as opposed to which has very little or no impact on F1 score of the system.

As it is depicted in Fig. 1113, two different variants of SP introduction were examined, namely SPS and SPM. It is worth noting that ’Multiple HTMs’ (SPM) tend to outperform the ’Single HTM’ in terms of both F1 score and a pace of plateau convergence. However, this is achieved at the expense of a larger computational cost required for computing multiple instances of SP in ’Multiple HTMs’ (SPM).

According to the authors’ knowledge it is hard to find papers which directly correspond to the research conducted in this work (i.e. video classification in noisy video streams). Nevertheless, we examined the following papers : Yue ; KarpathyCVPR14 ; Zha ; simonyan2014two-stream

which present results of video classification using a UCF-101 dataset. The best systems presented in those papers are based on various architectures of Convolutional Neural Networks (CNNs) and achieve accuracy of 80% or more. It is worth emphasizing that despite similar performance in terms of the result quality, authors’ test setup differs mostly in dataset used. Preliminary comparative experiments have been conducted with Temporal Stream ConvNet

simonyan2014two-stream . For UCF-11 dataset our solution achieves F1 score 0.6194, while Temporal Stream ConvNet (trained on the same amount of data) reached 0.1982; for shapes dataset, the best F1 score was 0.9544 while Temporal Stream ConvNet yielded results of 0.2908. It is worth emphasizing that both of the datasets used for the comparison experiments were relatively small. This amplified a skill of authors’ solution to perform well despite being trained on small datasets. At the present development stage of the system, due to the computational requirements it would be hard for the authors to conduct tests with large datasets such as UCF-101.

6 Conclusions and future work

This paper presents the experimental results of using an HTM–based system for object classification in noisy video streams. The authors showed that using SP in the video processing flow improves the object classification ratio by more than 10% and achieves approximately 12 times the noise reduction for a video signal with 13% distorted bits. It was determined that a rise of column and synapse number of SP has a substantial impact on the performance of the system and the best results were obtained for 256 and 4096 synapses and columns, respectively.

In future research, the authors plan to expand the work to test the system with more advanced benchmarks featuring real-live video streams. Such experiments were not done for this paper because we needed to verify operation of the video classification module in ideal conditions and adjust the basic parameters of the system. The authors are also going to replace the binarization operation with a custom–designed encoder and enhance the SVM classifier with a dedicated decoder. Experiments with several stacked SP layers extended with TP are also scheduled for future improvements of the system. However, in order to be able to examine performance of such configurations, the remaining computationally–exhaustive routines will have to be ported to OpenCL and the system will have to be deployed on platforms equipped with GPU or FPGA. This will enable conduction of experiments with video of a lower image reduction ratio.


I would like to thank my wife Urszula Wielgosz for her huge contribution to the preparation of the paper.