fastbinning
Non-uniform quantization with linear average-case computation time
view repo
A new method for binning a set of n data values into a set of m bins for the case where the bins are of different sizes is proposed. The method skips binning using a binary search across the bins all the time. It is proven the method exhibits a linear average-case computation time. The experiments' results show a speedup factor of over four compared to binning by binary search alone for data values with unknown distributions. This result is consistent with the analysis of the method.
READ FULL TEXT VIEW PDF
We introduce a new method for hardware non-uniform random number generat...
read it
For the exploration of large state spaces, symbolic search using binary
...
read it
The Tsetlin Machine (TM) is an interpretable mechanism for pattern
recog...
read it
In this paper we show the usability of the Gray code with constant weigh...
read it
We propose a novel method for neural network quantization that casts the...
read it
We find a searching method on ordered lists that surprisingly outperform...
read it
Nowadays computational complexity of fast walsh hadamard transform and
n...
read it
Non-uniform quantization with linear average-case computation time
Binning data is about representing a large set of input data values (such as reals) into a smaller set of output values or bins. Binning is also often referred to as quantization [1]. Each bin size is an interval given by the difference between a pair of neighboring bin boundaries; bins are defined by bin boundary values. Binning maps a data value into a bin as an integer in the range . For example, time in hours is binned into 12 bins (or 24) with each bin of equal size of 60 minutes; this is uniform quantization. Age of people binned into categories such as infant, child, teenager, etc. requires non-uniform bins; that is bins of different sizes. Quantization is used everywhere in engineering; it is common in analog-to-digital conversions with uniform bins [2] while non-uniform quantization is used for read-voltage level operations in multi-level NAND flash memory [3]. Non-uniform quantization is used in audio coding standards and the most often used video encoding format and streaming video internet sources [1] where modern embedded and mobile computer systems are ubiquitous. Non-uniform bins result in methods to analyse the performance of embedded systems; e.g. centered bin distribution [4]
. In data science, binning is frequently performed on continuous attributes. Here bin boundaries are adjusted by supervised methods on unknown data distributions leading to non-uniform bins
[5], a method also used in cosmology research [6].Binning data values into uniform bins is straightforward and takes a little more than a simple division operation per data value. Binning into non-uniform bins requires more work. Linear search can be used across all the bin boundaries with a computational time cost of per data value or more efficiently by performing a binary search across the bin boundaries with a cost of computational time per data value [7]. Recent research look at ways to define the size of the non-uniform bins for a particular problem [6, 8] rather than computing the quantization value across a set of non-uniform bins; this is what this paper addresses. Binary search is the method used in functions for binning data in numeric computational packages such as Matlab, Python and R.
This work proposes a binning method for a data value into non-uniform bins, , which results in much faster binning than binary search. The method performs a one-off pre-computation at the outset. First, it forms new uniform bins, , and then computes histograms of the bin boundaries within . Each data value is binned in (taking constant computational time) and this integer result is combined with the pre-computed histogram to complete the calculation of the binning of in . It is proven that this extra step requires a linear average time complexity producing significant speed-up to the binning process.
Bins are defined as monotonically increasing boundary values such that for with . This defines bins. Binning a data value results in an integer value , such that when , then . Note that index bins start from 1. The binning process outputs integers in for data values within the bin boundaries; thus means and means since it describes bin intervals that include the left boundary value. Therefore, the experiments in this paper use data values in the range . For non-uniform bins is not constant for some , and this set of non-uniform bins is referred to as . In the uniform case the difference is constant for all bins, and the bins are referred to collectively as . Binning in is denoted as . This work applies the binning process to a data set , for , with , where stays fixed while might be constantly changing (e.g. streaming applications).
Consider bins with bin boundaries shown as dots on the upper line in Fig. 1. Note all seven bin intervals are not of the same size. Any data value within these bins is mapped to an integer in the range ; this is shown by the number in between the values (over the line in the figure).
Step 1 - set up : Create uniform bin intervals with size ; these are shown as on the lower line, also in Fig. 1. The bin boundaries are shown at small vertical ticks. Note, the extreme boundaries of and are the same.
Step 2 - histogram of in : Next compute the histogram of bin boundaries as using as the bins for the histogram. These are the numbers inside a square at the bottom of the figure. For instance, the bin that goes from to has in it, so a count of while the bin from to has in it, so a count of . Note that histogram is of length ; that is and that are excluded (since extreme values of and are the same) and so only boundaries of go into the histogram so that .
Step 3 - cumulative histogram: We then compute for with (since , we do not bin in but assumed included in the histogram); this is the prefix sum of as or cumulative histogram of length ; since we started with and . The values are the numbers in the circle at the bottom of the figure; they align with . For instance, .
The set up of , and is performed once as a pre-computation before any data is binned. The binning of a data value starts by binning as and then the method works out the bin value from and .
Suppose of in Fig. 1 with values 2, 11, 19, 20, 21, 27, 29, and 30. Then is set up as 2, 6, 10, 14, 18, 22, 26 and 30; with bin intervals of size 4 since . The following examples illustrate binning to the correct .
Consider binning ; and . Look in to find ; then we make the comparison . In this instance and we calculate as ; this is verified by Fig. 1. Note that for , is also 3 but is not true and then is calculated as . With either comparison outcome this case mapping is simple and fast.
When binning an value such that for , proceed similarly to discover that two comparisons are needed to calculate . For instance when in the above example. This case requires constant time.
This case occurs when ; in the example. In this case, we extract the bin boundaries from to form a smaller set of bins ( bins); a binary search within this smaller set of bins gives the information required to determine . For instance, if , , . We obtain . A binary search, returns 1 and so as expected from looking at Fig. 1. The surprising result of this paper is that for the average case and so this case again is performed in constant time on average.
Processing data with the proposed method follows straightforwardly from the discussion above. We assume is given as as previously defined. We are binning input for with values in any order and with any statistics. The output is .
Step 1: Set up bins as as for and with . Note .
Step 2: Compute the histogram as with for all of in .
Step 3: Compute the prefix sum of as , of length , with and for .
Step 4: For each in obtain , and , then do either one of:
Case : .
Case : if then else .
Case : if then
else if then
else .
Case :
is the subset of where .
Steps 1, 2 and 3 are performed once and so its computational time cost diminishes quickly for large . can be pre-computed as part of Step 2 and necessary only for those instances where the histogram count is greater than two. This is left down to a programming implementation of the method. Cases and require 0, 1 and 2 comparisons respectively and thus are of constant computation time per data item. The case is considered in the next section.
As boundary values of are defined such that then has intervals greater than zero since . The range of values can be divided into intervals of equal size with boundary values with and for . These are the uniform bins of with the span of . Binning a value in , , takes constant time by mapping into integer as .
Note and with being the smaller value that is mapped to since . This implies that for may map into any integer in the range but as there are bins in and we are only mapping boundaries values then, the pigeon hole principle conditions are not met ( pigeons, holes) to guarantee that at least one boundary value of will be found within a interval (see Fig. 1 where for example). Thus, defining as the count of cases where for we can guarantee that ; this is computed and stored over .
We then compute the cumulative histogram as for with (to include ) so that . Note that are annotated aligned with boundary values. We reason about in these terms. For a value such that we look at the value and that means that the value is greater or equal to the first boundary values of , that is for . Note that also means that lies within the uniform interval that goes from to boundary values of , and so to obtain the actual binning we have to resolve where actually lies within the possible boundary values that also mapped as ; information already available in .
From above, for a value such that we look at the value . When , this means that the interval of does not contain any boundary value within and as such by definition. No extra comparison is needed. When , this means that the interval of has one boundary value within it. This boundary value is and one comparison is needed in this case to determine as either or . When , this means that the interval of has two boundary values within it. These are boundary values and so we need two comparisons for this case to determine as either of , , . When , then we do a binary search of within the subset of boundary values to find the offset to be added to to determine .
The binning problem reduces to how we count all the possible valid solutions of distributing boundaries () across () bins. Define the mappings of non-uniform boundaries to uniform bins by a 2D grid of values . The 2D grid amounts to the global count of bins with valid values. For the whole grid, the amount of bins having a value of is proportional to .
As a simple example consider . All possible solutions where 3 bins add up to 3 are: 300, 210, 201, 120, 102, 111, 030, 021, 012, and 003. All possible solutions can be expressed in a grid, ; 10 rows of 3 slots each. The 30 slots in the grid satisfy ; there are 12 slots with a count of 0, 9 slots with a count of 1, 6 slots with a count of 2 and 3 slots with a count of 3. The counts are the 4, 3, 2, 1 values above where . General expressions are given below.
Each row of the grid satisfies the equation then by the stars and bars approach we see that there are rows to the grid [12].
When the value of a slot in a row on the grid is then slots add up to which implies there are such slots for any value of in then .
Equation,
(1) |
follows by a counting rule on the grid.
Also note that the sum of all possible values for the solutions is,
(2) |
Dividing equation Eq. (2) by equation Eq. (1) gives an average of the values of a slot in the grid as
(3) |
Defining allows writing this relation as,
We can derive a recurrence,
(5) | |||
where .
From Eq. (4), average
(6) |
with . We know and also . Therefore,
(7) |
As then it follows,
(8) |
When this implies for large . Also as .
Back to the small example with , from Eq. (5) and hence . So, P = 0.1 implies that 10% of the 30 slots in the grid are slots with counts greater than 2, that is 3 in this case as seen previously. Using the recurrence from Eq. (5), and for we have that,
(9) |
This is a hint that bins with counts greater than 2 are of around 12.5% () and from Eq. (8) the count is capped to 4 on average.
Let be the maximum number of from mapped to a single bin in . The complexity of the method is where is the count of items processed under the four cases determining . Gather terms together and simplify by putting where and . This makes use of the fact that is a constant for a particular run of data. As there are ways to distribute the computations the average workload is given by:
(10) |
We note that and since it does not matter how the data items are arranged with the bins associated with each of the aggregated terms. Thus,
(11) |
which implies . And since we conclude the method has time average performance.
Two main experiments were conducted on an x86-64 microprocessor. Firstly, binning data within a random set of bins, , with . Twenty million values were used with different random distributions. Five thousands of runs were performed to compute , both with a binary search and with the proposed binning method and then average taken. Fig. 2 shows a faster computation with the proposed method, over binary search, as the number of bins increases (). The speedup factor is well over a factor of three for large . This is so as binary search time follows a time as expected, while the proposed computation method exhibits time on average.
In a second experiment, the data set is kept fixed. Binning by binary search is performed for number of bins; these are typical values for binning when exploring data. Binning with the proposed method is then performed on the same set using bins when setting up ; parameter is varied across the range over thousands of runs. The results, shown in Fig. 3, show that an extra speedup factor is gained for and reaches around a factor of 1.25 when . This simple extra arrangement results in an overall combined speedup factor of over four by essentially having with double the size of bins of .
The proposed method for binning a value into non-uniform bins decomposes the process into four base cases, three of which take constant computation time. A fourth case, recurs to binary search but only of bins with ;
on average as shown. Other methods to improve over binary search, such as interpolation search, that has a
time per item, only benefits uniformly distributed data
[9]. We did not find a substantial benefit of interpolation search over binary search in the two experiments conducted here. The proposed method is expected to improve processing techniques that use binary search such as fractional cascading [10] that searches over multiple sequences or for finding interval intersections in gene sequencing [11]. The analysis shows that for large values ofthe probability of having bin count values greater than 2 is of 12.5% and of 87.5% of having a bin count of either 0, 1 or 2. A closed formula of the average of any bin having a count greater than 2 is
, for any (see section 4.1.4). By doubling the bins in compared to the probability of a bin count being 0, 1 or 2 increases to over 95%. Indeed the probabilities, when doubling the bins in ( and for large values) are for a bin count of 0, 1, and 2 respectively with a bin count average of when the method recurs to binary search. With binary search, processing data items requires time. With the proposed method, it requires which reduces to given an theoretical speedup of . This translates to an speedup of around 5 for which is consistent with the speedup shown in Fig. 2 and consistent with the analysis in Section 4. By doubling the bins in , the speedup increases by a factor of around 1.26 also consistent with the results.It was proven that on average the binning method presented here, for non-uniform quantization, runs in linear time and so it is faster for binning the same data values when compared to using binary search. The method gets better as increases. Empirically, we find a speedup factor of over three for large ; result explained from the theoretical analysis. Extra acceleration is achieved by adding a parameter , (essentially doubling the number of bins in the internal mechanism used in the method), we report an extra speedup factor of 1.25 which is also consistent with the analysis. This parameter increases the probability of binning a value within non-uniform bins in constant computational time. The method applies equally to real or integer values and it is directly applicable to computing histograms. Results shows the binning of data with the proposed method performs four times faster than existing methods in streaming applications using standard microprocessors.
Comments
There are no comments yet.