1 Introduction
In 1948, Claude Shannon published his seminal paper “A Mathematical Theory of Communications” [18], which is widely recognized as the foundation of the Information Theory field. As of today, this work is still considered a key reference in the area, as it describes the concept of information and how to measure it; actually, nowadays it is even more uptodate than ever given the widespread development of newtwork communications. As information is sensitive to be corrupted due to external factors –e.g. noise–, Coding Theory can be leveraged to detect and correct errors [14]. It is worthwhile mentioning that Coding Theory goes beyond error correction, as it can be used for many other purposes, namely: quantum computing [6], biological systems [1, 15], data compression [13, 2], cryptography [16, 11], network coding [12], or secret sharing [17, 7], among others.
For practical reasons, the most common codes employed are usually linear codes, i.e., vector subspaces
of dimension within a vector space of dimension . In addition to and , a crucial parameter to be considered is the Hamming minimum distance of the vector subspace , since it strongly determines error detection and error correction capabilities. In other words, if a corrupted word is received, the most likely codeword that was sent is the closest to in , according to the mentioned metric. If the minimum distance of a linear code is , up to errors can be detected and up to errors can be corrected. The knowledge of is not only crucial to detect and correct errors, but also in the other applications mentioned above.The scientific literature contains fast algorithms for computing –or bounding– the minimum distance of linear codes with a complex structure, such as the ReedSolomon codes or the BCH codes [14]. In contrast, computing the minimum distance of random linear codes is an NPhard problem. As of today, the fastest algorithm for computing this minimum distance is the socalled BrouwerZimmerman algorithm [20], which was described and updated by Grassl [8]. The commercial software MAGMA [5] contains an implementation of this algorithm over any field. The publicdomain software GAP [9, 3] contains an implementation of this algorithm over and . A family of implementations of this algorithm over with much higher performances has been recently developed [10]. These implementations produced higher performances on both serial computers (unicore processors) and sharedmemory architectures with multiple/multicore processors by exposing and efficiently exploiting threadlevel and datalevel parallelism by means of vector instructions.
However, the computation cost of computing the distance of large random linear codes is really humongous even when employing sharedmemory architectures with several cores. As the number of cores in these architectures is limited, a new approach is presented in this paper. We introduce several new efficient implementations that can be employed in distributedmemory architectures with hundreds (or even thousands) of cores. Experiments show that the new implementations are scalable and can compute the minimum distance of the random linear codes much faster than current optimized sharedmemory implementations. In fact, the computation of the minimum distance of a random linear code with the publicdomain GAP (Guava) software took 5 days, whereas the computation of the same distance took only 5 minutes with our new implementations.
This article is organized as follows: Section 2 introduces the necessary background in order to make this article selfcontained. Section 3 describes several new algorithms and implementations for distributedmemory architectures. Section 4 presents the performances of the the new algorithms, and compares the results with current software. Section 5 contains the conclusions.
2 Background
The objetive of this section is twofold: First, to provide the necessary mathematical tools employed in the rest of the manuscript; second, to review the new fast implementations for computing the minimum distance of a random linear code introduced in [10].
2.1 Mathematical Background
Let be a prime number and a power of it. We denote by the finite field with elements. By definition, a linear code is a vector subspace of . The dimension of as a vector subspace is denoted by and is referred as the dimension of the linear code. The encoding is done via the generator matrix, i.e., a matrix denoted by , whose rows form a base of . After elementary row operations and columns permutations, any generator matrix can be written in the systematic form , where
is the identity matrix of dimension
, and is a matrix. To encode the information, it is only needed to multiply , introducing new symbols, which eventually will help to detect and correct errors. In the decoding process, if a corrupted word is received, it is replaced by the closest codeword, in case it is unique. Therefore, a metric to measure the closeness is needed. In coding theory, the most common metric is the Hamming distance. Given two vectors , the Hamming distance between and is:Then, the minimum distance of a linear code is:
Since the code is linear, the minimum distance coincides with the minimum weight:
where .
For the sake of simplicity, in the following we will use instead of will refer to either the minimum distance or the minimum weight. A linear code with minimum distance can detect up to errors and can correct up to errors. Computing the parameter for a random linear code is a NPhard problem [19], and the corresponding decision problem is NPcomplete.
To this day, the fastest algorithm for computing the minimum distance of a random linear code is the socalled BrouwerZimmerman algorithm [20], which was described and slightly modified in [8]. This method is outlined in Algorithm 1. The key components in this algorithm are the information sets, i.e, a subset of indices such that the corresponding columns of the generator matrix are linearly independent. There are several approaches for finding the maximum number of disjoint information sets. Despite its crucial role, the computational cost of this step is negligible compared with the rest of the algorithm. Assume that one has found disjoint information sets , . For each disjoint information set , a matrix in systematic form can be obtained. In addition to that, there are columns in that do not form an information set, i.e., the corresponding columns have rank strictly less than , say . After elementary row operations and column permutations, the following matrix can be obtained:
Once the matrices have been computed, the process of enumerating codewords can proceed. This process is as follows: A lower bound is initialized to one, and an upper bound is initialized to . First, all the linear combinations of the form , where are generated. For each linear combination, if its weight is smaller than , the upper bound is updated to the new weight. After processing all the vectors of weight one, the lower bound is increased in units. Then, if , the minimum weight is and the algorithm stops. Otherwise, this same process is repeated for all vectors such that , then , and so on, until , in which case is the minimum distance.
2.2 Optimized algorithms
Hernando et al. [10] introduced several new algorithms and implementations that were faster than both current commercial software and publicdomain software, including accelerated versions of both the bruteforce algorithm and the BrouwerZimmermann algorithm. These algorithms were designed for both unicore systems and sharedmemory architectures (multicore and multiprocessors). Since the new parallelizations for distributedmemory computers are based on and make heavy use of these algorithms, we proceed first with a brief description of these serial versions. The reader can refer to [10] for an indepth description and detailed analysis.
The focus of all the algorithms is the generation of all the linear combinations since it is the most computeintensive part. The next descriptions and algorithms do not show the updates of the lower and upper bounds ( and , respectively), nor the termination condition to simplify the notation. All the algorithms stop as soon as the lower bound is equal to or larger than the upper bound after processing a matrix.
A common first step is the computation of the matrices out of the generator matrix . As the cost of this part is usually much cheaper than the rest of the algorithm, and its results can greatly affect the overall computational cost of the algorithm, several random permutations are applied to the original generator in order to find one permutation with both the largest number of fullrank matrices and the largest rank in the last matrix.
Once the matrices have been computed, the basic goal of the BrouwerZimmermann algorithm is simple: For every matrix, the additions of all the combinations of its rows taken one at a time must be computed, and then the minimum of the weights of those additions must be computed. This process is repeated then taking successively two rows at a time, then taking three rows at a time, etc. After processing each matrix in each of these stages, the lower and upper bounds are checked, and this iterative process finishes as soon as the lower bound is equal to or larger than the upper bound.
2.2.1 Serial algorithms
Basic algorithm.
This is an straightforward implementation of the BrouwerZimmermann algorithm. Let us say that a matrix has rows, then the basic algorithm generates all the combinations of the rows taken with an increasing number of rows. For every generated combination, the rows of this combination are added, and the overall minimum weight is updated. This method is outlined in Algorithm 2. The Get_first_combination() function returns true and the row indices of the first combination, such as . In this one and the next algorithms a combination is represented as a sequence of row indices, where the first row index is zero. The Get_next_combination( ) function receives a combination and returns both true and the next one if there is one. The Process_combination( , ) function computes the weight of the addition of the rows of with indices in . Besides, it updates the lower and upper bounds if needed. Though in this algorithm the order in which the combinations are generated is not important, the lexicographical order was employed to reduce the number of cache misses.
Optimized algorithm.
In the lexicographical order, the only difference between each combination and the next one (or the previous one) is usually the last element; hence, this algorithm reduces the number of additions by saving and reusing the addition of the first rows. The outline of this algorithm is very similar to the previous one. The main difference is that combinations are generated with rows instead of rows. Therefore, the new Process_combination( , ) function must perform the following two tasks: First, it adds and saves the combination with the rows with indices in . Then, it builds all the combinations with rows by adding the corresponding rows to the previous addition, thus saving a considerable number of row additions (compute operations) and row accesses (memory operations).
Stackbased algorithm.
The goal of this algorithm is to further reduce the number of additions needed to compute the addition of the rows, performed in each iteration of the while loop, by using a stack with vectors of dimension and the lexicographical order. The stack, which only requires a few KB, stores data progressively. When the combination , where is a row index, is being processed, the stack contains the following elements (incremental additions): row , the addition of rows , the addition of rows , …, and finally the addition of rows . The main savings of this algorithm are obtained in the computation of the addition of the rows because the top of the stack contains that information. Then, when computing the next combination of rows, the contents of the stack must be rebuilt from the leftmost index that has changed between the current combination and the next one.
Algorithm with saved additions.
The key to this algorithm is the efficient composition of combinations with up to elements to build combinations with larger number of elements. If with numbers and such that , the addition of the rows of the combination with indices can be computed as the addition of the rows of the combination (called left combination) and the combination (called right combination). In this way, with just one addition the desired result can be obtained if the additions of the combinations with up to at least rows have been previously saved. Therefore, if , to obtain the combinations of rows taken at a time, the combinations of rows taken at a time (left combinations) and the combinations of rows taken at a time (right combinations) must be composed. However, not all those combinations have to be processed since there are some restrictions. These restrictions to the combinations must be applied efficiently to accelerate this algorithm since otherwise an important part of the performance gains could be lost.
The outline of the method is very different to the previous ones since it can be implemented as a recursive algorithm (see [10] for further details). The data structure that stores the saved additions of the combinations of the rows of every matrix must be built in an efficient way. Otherwise, the algorithm could underperform for matrices that finish after only a few generators. For every matrix, this data structure contains several levels (), where level contains all the combinations of the rows of the matrix taken at a time. The way to do it in an efficient way is to use the previous levels of the data structure to build the current level.
Algorithm with saved additions and unrolling.
This algorithm reduces the number of memory accesses (not additions) by processing several left combinations at the same time (called unrolling). To do that, right combinations will be reused when brought from main memory. For instance, by processing two left combinations at the same time, the number of data being accessed can be nearly halved since each accessed right combination is used twice (one time for every one of the two left combinations), thus doubling the ratio of vector additions to vector accesses. However, this technique is more effective when the two left combinations must be composed with the same subsets. To achieve that, the rightmost element of the two left combinations must be the same. As in the lexicographical order the rightmost index always changes, a variant was employed.
Vectorization and other implementation details
The main advantage of hardware vector instructions is to be able to process many elements simultaneously by using data stored in large vector registers. Although the length is usually smaller than a few hundreds (very small compared with the size of modern vector registers), an efficient vectorization was achieved. Fourbyte integers were employed to store data, thus packing 32 elements into each integer and hence leveraging vector instructions to boost performance on modern and legacy computing architectures.
2.2.2 Parallel algorithms for sharedmemory architectures
The parallelization of the basic, optimized, and stackbased algorithms do not render good results because of the restrictions of the loop sizes and the size of the critical regions in comparison with the amount of work that can be simultaneously executed. See Hernando et al. [10] for more details.
In contrast, the parallelization of the two algorithms with saved additions is much easier and more effective. However, to create a largegrain parallelism, it must be parallelized only for the first level of the recursion. Though we used a small critical region for updating the overall minimum weight (a reduction operation), the impact of this critical region was minimized because of the small computational cost of the operation and by making every thread work with local variables throughout its execution, and by updating the global variables just once at the end. In the parallelized codes, OpenMP [4] was employed.
3 New algorithms and implementations for distributedmemory architectures
This section describes several new algorithms for distributedmemory architectures. Two families of algorithms have been developed: The first family employs a dynamic distribution of tasks, whereas the second family employs a static distribution of tasks. Each family comprises four different algorithms.
3.1 Distributedmemory algorithms outline
The main outline of all the distributed algorithms is common, and can be found in Algorithm 3. Note how the structure is similar to the BrouwerZimmermann algorithm (see Algorithm 1) and the serial algorithm (see Algorithm 2); however, there are subtle differences that need to be described. The main difference is that the distributed algorithm requires an additional parameter called prefix size. The while loop of the distributed algorithm generates and processes all the combinations of rows taken at a time, instead of at a time. Each combination with elements is the atomic task of the distributed algorithms, where each one can be assigned to a different process in a distributed implementation. A task or combination with elements is also called a prefix, as the combination will be employed to generate all the combinations with elements starting with , assuming . The Get_first_combination_with_… function and the Get_next_combination_with_… function are similar to those with similar names in previous algorithms, but in this case they process combinations with elements.
The second difference lies in the method employed to process combinations, called Process_prefix. This method computes the addition of all the combinations with rows of starting with the received combination with rows. To improve performances, the addition of the rows in the received combination (prefix) must be computed first. Then, the proper additions with rows must be computed and added to the previous addition. Obviously, when , no parallel work is generated since the prefix size is larger than the number of elements in a combination. In this case, no prefix distribution is performed, and prefix replication is done instead.
For example, if , , the lexicographical order is employed, and the combination is assigned; this method must compute all the combinations with 5 elements starting with , that is, , , , , , , , etc. To save work, first the addition of the rows must be computed, and then the addition of all the combinations with rows starting at least with the index row 3 (the rightmost element of plus one) must be computed.
Although the rightmost element in a combination of rows can be up to , the rightmost element of a prefix with rows () must be smaller than or equal to , because no valid combination of elements taken at a time can be formed with a value larger than in the position . For instance, if , , , the rightmost element of valid prefixes must be smaller than or equal to 4, since obviously no valid combination of 6 () elements taken 4 () at a time can be formed starting with a prefix such as .
It is important to study in detail the effect of the prefix size on the number of tasks and on the computational cost of tasks.
3.1.1 Impact of the prefix size on the number of tasks
First, the effect of the prefix size on the number of tasks is explored. The number of tasks is for every value of being processed (every iteration of the For loop). In fact, when working with usual values of and ( is usually smaller than a few hundreds and is usually smaller than 20), the smaller the prefix size, the fewer tasks will be generated. With those values, just by increasing the prefix size by one, the number of prefixes (and tasks) to be processed is increased by nearly one order of magnitude.
3.1.2 Impact of the prefix size on the computational cost of tasks
A prefix size requires that in every task only the proper combinations of elements are composed with the prefix. Therefore, the smaller the prefix size, the larger the computational cost of tasks will be, and vice versa.
It is interesting to note that the cost of processing a prefix is very heterogeneous, and it strongly depends on the rightmost element in the prefix, since it determines the number of combinations with elements to be generated starting with the prefix (or combination with elements). Obviously, the smaller the rightmost element of the prefix, the more combinations with elements must be generated and processed, and the larger the rightmost element of the prefix, the fewer combinations with elements must be generated and processed. For instance, if , , and , the cost of processing the prefix is much larger than the cost of processing the prefix , since the first prefix requires a lot of combinations with 5 elements to be generated and processed, whereas the second prefix only requires one combination with 5 elements to be generated and processed: .
The advantage of distributing prefixes is that every prefix can be processed in parallel, but its main disadvantage is the extremely wide range of the computational cost of processing prefixes. Some prefixes require a lot of time, whereas other prefixes are almost instantaneous.
3.1.3 Orderings in combination generation
The distributed algorithm can employ any order to generate the combinations with rows. Nevertheless, in our implementation we have employed two orders: the lexicographical order and the leftlexicographical order, a variant of the first one:

In the lexicographical order, the rightmost element is the one that changes most. For instance, if , , , the prefixes are generated in the following order: , , , , , , , , , and .
With this ordering, the computational cost of the prefixes is usually (although not always) decreasing, as it depends on the rightmost element of the prefix. For instance, in the previous example the prefix appears before the prefix , but the cost of processing the latter is higher than the cost of processing the former one.

In the leftlexicographical order, the leftmost element is the one that changes most. For instance, if , , , the prefixes are generated in the following order: , , , , , , , , , and .
As can be seen, in this ordering the rightmost element of each prefix is always the same as or larger than the previous one. Therefore, the first prefixes are always much more expensive than the last ones. This might be an advantage when scheduling prefixes to avoid that expensive prefixes arise in the final stages of the algorithm.
The processing of a prefix within the Process_prefix method must be performed by only one process, and therefore it can be performed by using any of the previous serial or sharedmemory algorithms, thus making the code more modular. In the case of the algorithms with saved additions special care must be applied since combinations must be saved up to size . The code that is executed inside the Process_prefix method to process a prefix is called the node engine because it is executed by only one process, and thus it is executed inside one node of a distributedmemory machine. The order in which the combinations with rows are generated inside the Process_prefix method is not important to the distributed algorithm, and it is determined by the node engine (the serial or the sharedmemory algorithm).
3.2 Dynamic algorithms
The family of algorithms with dynamic scheduling of tasks comprises four algorithms. All of them work in a similar way: They apply the onemasterworkers model. In this family of algorithms, one process in the application is the coordinator process, and the remaining processes are workers.
The coordinator process generates the prefixes (combinations with elements) in a certain order and assigns them to the worker processes under request. When one worker has finished the processing of its current prefix, it sends the minimum distance computed for that prefix, and waits for the next prefix. The sending of the result tells the coordinator process that it has finished, and therefore it is ready to accept another prefix. When the coordinator receives a result from a worker, it updates its global minimum distance, and then sends the next prefix to that worker. When there are no more prefixes, it sends a special message (a poisonous task) to indicate the finishing condition.
When all the prefixes with elements for the combinations with elements are done, the current iteration of the For loop (line 3 of Algorithm 3) finishes, and the next iteration starts if the values of the lower and upper bounds allow it.
We propose four different algorithms in this family:

DLex: Distributed algorithm with a dynamic scheduling of tasks, in which the prefixes are generated in the lexicographical order.

DLex2cm: Same as the previous one, but two prefixes are assigned at the same time (within the same message). This technique reduces the communication cost (latencies) of the previous algorithm, that could be very effective in networks with high latencies. Targeting load balancing, one prefix is picked up from the beginning of the ordering and the other one is picked up from the end of the ordering, since the first prefixes are usually more computationally expensive.

DLle: Distributed algorithm with a dynamic scheduling of tasks, in which the prefixes are generated in the leftlexicographical order.

DLle2cm: Same as the previous one, but two prefixes are assigned at the same time (within the same message). The rationale has just been described above.
3.3 Static algorithms
The family of algorithms with static scheduling of tasks also comprises four algorithms, and all of them work in a similar way. In this family, all the processes in the application are peer, and therefore no coordination role is necessary. The distribution of the tasks is made in a static way. The cyclic distribution (or a similar variant) has been employed because it provides a better load balancing than the block distribution since the most expensive prefixes are usually the first ones.
One important difference between this family and the previous one is the handling of the prefix size. The dynamic family employs an absolute prefix size, whereas the static family employs a socalled relative prefix size since the actual prefix size is: , where is the given prefix size. To achieve this, the only easy change to the previous algorithm is to modify the Get_first_combination_with_… function and the Get_next_combination_with_… function to process combinations with elements, instead of elements. The reason to use a relative prefix in the static algorithms is that the static scheduling requires a larger number of tasks to achieve a good load balancing of the workload across the processes. Since there must be many tasks and they cannot be too small, the actual prefix size should increase when increases (in every iteration of the For loop). The use of relative prefix sizes achieves this, thus making that some values of the relative prefix size work fine on a wide range of values, and on a wide range of linear codes.
Next, the similarities and differences among the four algorithms of this family are described:

SLex: Distributed algorithm with static scheduling of tasks that employs the cyclic distribution of tasks generated using the lexicographical order.
For example, the following table shows the distribution of prefixes for , , , and three processes. Recall that the rightmost element in those prefixes must be 9, because no valid combination of 11 () elements taken 4 () can be formed with the value 10 in the third position.
Process 0 Process 1 Process 2 
SLexSnc: Same as the previous one, but a variant of the cyclic distribution, called snake cyclic, is employed. In the usual cyclic distribution, the rightmost elements of the prefixes assigned to the th process are very often (but not always) smaller than the rightmost elements of the prefixes assigned to the
th process, which can unbalance the load by assigning more work to the first processes. You can compare the rightmost elements of every two consecutive columns in the above example. The snake variant of the cyclic distribution tries to break this frequent event by using the usual cyclic distribution in half of the cases (the odd rows of the table), and then reversing the usual cyclic distribution in the other half (the even rows of the table). In this way, the rightmost elements of the prefixes assigned to a process are not so often smaller that those assigned to the next process.
For example, the following table shows the distribution of prefixes for , , , and three processes.
Process 0 Process 1 Process 2 
SLle: Distributed algorithm with static scheduling of tasks that employs the cyclic distribution of tasks generated using the leftlexicographical order.
The combination of the cyclic distribution and this new ordering achieves the following two goals: First, it ensures that the most expensive prefixes are processed at the beginning. Second, it ensures that the rightmost elements of the prefixes assigned to a process are very similar (usually the same) to those assigned to the next process.
For example, the following table shows the distribution of prefixes for , , , and three processes.
Process 0 Process 1 Process 2 
SLleSnc: Same as the previous one, but the snake cyclic distribution of tasks is employed. Although in the previous algorithm the rightmost elements in the prefixes assigned to a process are often the same as those assigned to the next process, in the few remaining cases the rightmost elements assigned to a process are smaller (by one) than those assigned to the next process. The snake cyclic distribution tries to avoid that fact by reversing the ordering in half of the cases (the even rows of the table). Thus, the rightmost elements assigned to a process will be often the same as those assigned to the next process, and in the few remaining cases the rightmost elements assigned to a process will be smaller (by one) or larger (by one) than those assigned to the next process.
For example, the following table shows the distribution of prefixes for , , , and three processes.
Process 0 Process 1 Process 2
3.4 Comparison of the dynamic algorithms and the static algorithms
3.4.1 Communication cost
In the dynamic algorithms, the coordinator must send every prefix to a worker under request. Then, after being processed by the worker, the worker must send the result back to the coordinator. Although the amount of data is not very large since the prefix comprises only indices (integer values) and the result is just one value (an integer), the number of pointtopoint communication is considerable. The communication cost of the dynamic algorithms depends on the number of tasks, which strongly depends on the prefix size (). In fact, the communication cost is pointtopoint operations for every value of being processed (every iteration of the For loop). The dynamic algorithms with suffix 2cm reduce this communication cost by sending two combinations per message. Obviously, a large value of would greatly increase the communication cost by creating many tasks, and therefore many pointtopoint communications. In fact, when working with usual values of and (such as those in the experimental section), just by increasing the prefix size by one, the number of prefixes is increased by about one order of magnitude, and therefore the communication cost increases by the same order.
In contrast, the communication cost of the static algorithms is much smaller, since they do not send each prefix and do not receive each result. The assignment of tasks is static and requires no communication at all, and the computation of the global result requires one collective reduction operation (no pointtopoint communications at all) after processing all the prefixes for every value of being processed (every iteration of the For loop). The goal of the collective reduction operation is to compute the minimum distance of the distances computed by all the processes. Actually, the number of global reduction operations per iteration of the For loop is two (instead of one) to avoid uninitialized distances. Nevertheless, the cost is much smaller than that of the dynamic algorithms. Note that the communication cost of the static algorithms does not depend on , , nor , and it only depends on the number of processes logarithmically and the total number of iterations of the For loop.
3.4.2 Number of tasks
Now the effect of the number of tasks on both distributed families is explored. In the dynamic family, as said, a large number of tasks increases the communication cost. Therefore, a large prefix size will greatly increase the number of tasks and thus the communication costs. Nevertheless, despite the dynamic nature of the scheduling, a too low number of tasks could unbalance the load. To guarantee a good load balancing across all the processes, the number of tasks should be at least several times larger than the number of processes being employed. If the number of tasks is too small (and therefore the variability is very large), several processes could be processing a prefix with a large computational cost, while others could have already finished all their work. Therefore, a balance must be found between the communication costs and the load balancing, since the reduction of the communication cost requires few tasks, whereas a better load balancing requires a large number of tasks.
In contrast, in the static family, since the communication cost does not depend on the number of tasks, a large number of tasks can render better performances because a large number of tasks with smaller computational costs can be more evenly distributed among the processes.
The dynamic algorithms employ one process, the coordinator process, to assign the work to be done and to gather the results. This can be a disadvantage when employing a low number of process because the coordinator process is not really processing combinations. In contrast, the static algorithms employ all the processes to work on combinations.
Table 1 summarizes these findings by linking the prefix size with the communication cost and the load balancing.
Prefix size  No. of tasks  Task size  Dynamic algs.  Static algs. 

Small  Few  Large  Low comm. cost  Fixed comm. cost 
Bad load balancing  Bad load balancing  
Medium  Medium  Medium  Medium comm. cost  Fixed comm. cost 
Good load balancing  Bad load balancing  
Large  Many  Small  High comm. cost  Fixed comm. cost 
Good load balancing  Good load balancing 
4 Performance analysis
4.1 Experimental setup
The experiments reported in this section were performed on a cluster of HP servers, which we will call ua. Each node of the cluster contained two Intel Xeon® CPU X5560 processors at 2.8 GHz, with 12 cores and 48 GiB of RAM in total.
The nodes were connected with an Infiniband 4X QDR network. This network is capable of supporting 40 Gb/s signaling rate, with a peak data rate of 32 Gb/s in each direction. Unless otherwise stated, the Inifiniband network has been employed in the experiments. However, in some experiments a GigaEthernet network Procurve E251048 with a peak speed of 1 Gb/s was employed to assess the effect of the network on performances.
The OS of each node was GNU/Linux (Version 3.10.0514.21.1.el7.x86_64). OpenMP 1.4.3 was employed to compile (the mpicc compiler) and to deploy the implementations on the cluster (the mpirun tool).
In this experimental study we have employed the two following linear codes with parameters [,,]: The first one had parameters [150,77,17] and was called mat015; the second one had parameters [232,51,61] and was called mat023. These two different linear codes were chosen because the computational costs, the dimensions , and the lengths were very different. First, the computational cost of computing the minimum distance of the second linear code is about one order of magnitude larger than the computational cost of the first one. Second, the dimension of the first linear code is larger than that of the second linear code, which is very important because the number of parallel tasks depends on this value. Third, the length of the second linear code is much larger than that of the first one, which can affect the vectorization and other aspects of the different implementations. Usually, the left plot shows the results for the mat015 linear code, whereas the right plot shows the results for the mat023 linear code.
As were previously described, we have assessed the following distributed algorithms:

DLex: Distributed algorithm with a dynamic scheduling of tasks generated using the lexicographical order.

DLex2cm: Same as the previous one, but two tasks are assigned at the same time.

DLle: Distributed algorithm with a dynamic scheduling of tasks generated using the leftlexicographical order.

DLle2cm: Same as the previous one, but two tasks are assigned at the same time.

SLex: Distributed algorithm with a static cyclic scheduling of tasks generated using the lexicographical order.

SLexSnc: Same as the previous one, but the snake cyclic distribution of tasks is employed.

SLle: Distributed algorithm with a static cyclic scheduling of tasks generated using the leftlexicographical order.

SLleSnc: Same as the previous one, but the snake cyclic distribution of tasks is employed.
4.2 Impact of node engine
The first task to do is to assess the best node engine to be employed inside the distributed algorithms. This experiment includes two versions of the node engines previously described: node engines with scalar (nonvectorized) codes, and node engines with vectorized codes. Figure 1 reports the times spent by the different node engines to compute the minimum distance of both linear codes when using the algorithm DLex with prefix 3 and 1 thread per process on 10 nodes (120 cores). Results for other configurations (distributed algorithms, prefixes, number of threads per process, etc.) were similar. In each plot the first five bars show the results for scalar algorithms (Sca), whereas the last two bars show the results for vectorized algorithms (Vec). To better show the time differences, each bar in the plots shows the time in seconds on top of it.
The plots in Figure 1 clearly show that the node engine employed inside the distributed algorithm can affect performances dramatically. When comparing the vectorized codes of the saved variants (Vec Saved and Vec Saved Unrolled) and the scalar codes of the saved variants (Sca Saved and Sca Saved Unrolled), the vectorized codes are about 1.4 times as fast as the scalar codes for the mat015 linear code, and the vectorized codes are about 2.5 times as fast as the scalar codes for the mat023 linear code. The main reason behind the larger impact on the mat023 linear code might be its larger length ( versus ). On the other side, the unrolling only seems effective for the mat023 linear code, which can be caused by the larger computational cost of this linear code.
Unless otherwise stated, from now on, the node engine with saved additions and vectorization will been employed in the remaining experiments.
Figure 2 reports the times spent by several distributed implementations to compute the minimum distance versus the prefix size when one thread per process is employed. Each plot shows four lines: Two different distributed algorithms (DLex Vec and SLex Vec), and two different node configurations (5 and 15 nodes, that is, 60 and 180 cores). The plot shows result for the prefix sizes, which are absolute prefix sizes for the dynamic algorithms, and relative prefix sizes for the static algorithms. The DLex Vec name means the distributed algorithm DLex and the vectorized Saved node engine. Analogously, the SLex Vec name means the distributed algorithm SLex and the vectorized Saved node engine. Similar results were obtained for the other distributed algorithms.
4.3 Impact of the prefix size
As can be seen in Figure 2, for the dynamic algorithm DLex the prefix size with the best performances is 3 for the mat015 linear code, and 4 for the mat023 linear code. For this dynamic algorithm, performances drop very quickly as the prefix size increases. This is due to the fact that for the dynamic algorithms the number of parallel tasks generated, assigned and then recollected among the processes is , where is the number of rows in the combinations, and is the prefix size. Therefore, as the prefix increases in one unit, the number of tasks to be processed increases in nearly one order of magnitude, which correspondingly increases the communications costs. This large increase in the communication costs makes performances drop a lot as the prefix size increases.
As can be seen in Figure 2, for the static algorithm SLex the relative prefix size with the best performances is 4 for the mat015 linear code, and 6 for the mat023 linear code. However, it is interesting to note that the performances of this algorithm are not so affected by the prefix sizes, and the range of optimal prefix sizes is much larger. The reason is that the communication cost of this algorithm is much smaller and therefore having a larger number of tasks do not usually harm performances so much.
4.4 Number of threads per process
Figure 3 reports the times spent by the distributed implementations to compute the minimum distance versus the number of threads per process. The prefix sizes have been obtained from the previous experiment: The dynamic algorithms employ 3 for the mat015 linear code, and 4 for the mat023 linear code, whereas the static algorithms employ 4 for the mat015 linear code, and 6 for the mat023 linear code. Each plot shows four lines: Two different distributed algorithms (DLex Vec and SLex Vec), and two different node configurations (5 and 15 nodes, that is, 60 and 180 cores). Similar results were obtained for the other distributed algorithms.
To efficiently employ all the cores in every node, when increasing the number of threads per process, a proportional reduction in the number of total processes must be applied. If each node has 12 cores, is the number of nodes being used, and is the number of threads being deployed by each process, then the number of processes must be: . When the number of threads per process is increased, the communication cost is usually reduced since there are fewer processes communicating among themselves. In contrast, the computing power of each process is increased since each process has several threads and therefore several cores to process tasks. This larger computational power per process requires larger tasks, which can only be achieved by generating fewer tasks, which can unbalance the load. Therefore, a balance must be found between the communication cost and the number of tasks.
As can be observed in Figure 3, for the dynamic algorithm DLex the optimal number of threads per process is 1 when using 5 nodes, and about 2 or 3 when using 15 nodes. In the first case (5 nodes) the number of total processes is not so high, and thus the coordinator process can keep up with the requests. However, in the second case (15 nodes) the number of total processes is much higher, and thus the burden on the coordinator process can reduce performances. In this case, 2 or 3 threads per process are optimal, and achieve a good balance between the communication cost and the task size. On the other hand, for the static algorithm SLex the optimal number of threads per process is 1, since the communication cost of this type of algorithms is very small, and they require many tasks to effectively balance the load.
4.5 Distributedmemory algorithms: performance comparison
Figure 4 compares the distributed implementations described in this document. The prefix sizes and the numbers of threads per process employed in these experiments are the optimal values obtained in the above experiments. For the dynamic algorithms, the prefix size employed by the dynamic algorithms is 3 for the mat015 linear code, and 4 for the mat023 linear code, and the number of threads per process is 2 in both cases. For the static algorithms, the prefix size employed by the dynamic algorithms is 4 for the mat015 linear code, and 6 for the mat023 linear code, and the number of threads per process is 1 in both cases. To assess the effect of the interconnection network on the performances of the different algorithms, the top row shows results for the fast Infiniband network (netf), and the bottom row shows results for the not so fast GigaEthernet network (nete). Each plot contains two blocks of bars: one for 5 nodes and the other one for 15 nodes. Each block shows the performances of the eight distributed algorithms. In all cases, the vectorized Sav node engine has been employed.
As may be seen in Figure 4, when comparing the dynamic algorithms with the static algorithms, the static algorithms clearly outperform the dynamic algorithms on the mat015 linear code. This improvement is larger when the number of nodes is smaller. In contrast, on the mat023 linear code dynamic algorithms are slightly faster. The reason of this performance difference in the two linear codes might be that the computation of the linear distance of mat015 requires the generation of a much larger number of tasks than the computation of the linear distance of mat023. Recall that the number of tasks generated by the dynamic algorithms is , and that in mat015, whereas in mat023. A large number of tasks (mat015) allows the static algorithms to balance the load more evenly, while simultaneously taking advantage of their lower communication cost. Moreover, if the number of cores is not so large (such as in the 5node configuration), performances of the static algorithms increase because the load balancing of the static algorithms improves with fewer processes and because the static algorithms employ one more process to perform computations. In contrast, a smaller number of tasks (mat023) allows the dynamic algorithms to balance the load more evenly than the static algorithms while simultaneously reducing the communication cost. Therefore, the static algorithms seem to require a large number of tasks to balance the load evenly on all the cores, which is a bit difficult when is small.
When comparing the four dynamic algorithms, performances are similar. On the fast network, the mat015 linear case (the one with the shortest computational cost), and the 5node configuration, performances of the DLex are the better. In the other cases, performances are very similar. However, on the slow network, performances of the 2cm variants are slightly better because of the lower communication cost, except the mat015 on 5 nodes.
When comparing the four static algorithms, the performances of the two DLex variants (lexicographical order) are slightly better than those of the two DLle variants (leftlexicographical order).
When comparing the performances on the two networks, performances slightly decrease on the slower network. This decrease might be so small because of the efficiency of the distributed algorithms and because both networks assessed in our experiments had a theoretical speed larger than or equal to 1 Gb/s.
Figure 5 compares the distributed implementations described in this document on cheaper linear codes. The prefix sizes and the numbers of threads per process employed in these experiments are the optimal values obtained in the above experiments. The only exception has been that the number of threads per process is one in all cases since these experiments are much cheaper and therefore tasks are much shorter. The figure shows the times in seconds for several distributed algorithms and nodes on two shorter linear codes: The left plot shows results for the linear code with parameters (called mat016); the right plot shows results for the linear code with parameters (called mat023). On unicore processors, the cost to compute the distance of the second linear code is about one order of magnitude smaller than that of mat015, whereas the cost to compute the distance of the first linear code is about half an order of magnitude smaller than that of the second one. Both plots only show results on the GigaEthernet (nete).
As can be seen, of the four cases (two configurations and two families of algorithms) the new leftlexicographical order is faster in three cases, the only exception being the case of the dynamic algorithms on 5 nodes.
4.6 Scalability and comparative performance analysis
To measure the scalability of our implementations, Figure 6 shows the speedups obtained by several configurations to compute the minimum distance of both linear codes. The prefix sizes and the numbers of threads per process employed in these experiments are the optimal values obtained in the above experiments. Recall that the speedup is the number of times that the parallel algorithm is as fast as the serial (one core) algorithm. Obviously, all the number of cores assessed in this experiment were multiple of 12 (the number of cores per node).
As may be observed in Figure 6, the static algorithms are faster on the mat015 linear code, whereas the dynamic algorithms are faster on the mat023 linear code. As was commented, this might be related to the number of parallel tasks generated: the static algorithms require many tasks to balance the load, and the number of tasks greatly depend on the dimension . Note that the speedups on the mat015 linear code are smaller than those on the mat023 linear code. The reason is that the computational cost of the mat015 linear code is about one order of magnitude smaller than the computational cost of the mat023 linear code. Note that for the mat015 linear code the total time on 240 cores is about 44 seconds, which is very small in comparison with the total number of cores. The speedups for the scalar codes (top row) are slightly larger that the speedups of the vectorized codes (bottom row) since the vectorized codes are much more efficient on one core. Note that the speedups achieved are remarkable and can be up to about 200 when employing 240 cores.
Now we compare the performances of the new distributed algorithms in the ua cluster with the performances of both commercial and publicdomain software published by Hernando et al. [10]. In this paper the times were obtained in the cplex server, which is a computer based on AMD processors. It contained an AMD Opteron™ Processor 6128 (2.0 GHz), with 8 cores (though only 6 were used to let other users work). Its OS was GNU/Linux (Version 3.13.068generic). Gcc compiler (version 4.8.4) was employed.
Although the computational power of the cores in the cplex server is not exactly the same as that of the cores in the ua cluster, the processors were released in similar dates: The processor in the cplex server was launched in the first quarter of 2010, whereas the processor in the nodes of the ua cluster was launched in the first quarter of 2009. Therefore, the processors are of similar generations. We could not assess Magma in the same machine, the ua cluster, because it is a commercial software and we do not have a license for it.
Two of the mostcommon implementations currently available were assessed:

Magma [5]: It is a commercial software package focused on computations in algebra, algebraic geometry, algebraic combinatorics, etc. Version 2.223 was employed in those experiments. In the cplex server, vectorization could not be employed since Magma only implements this feature on modern processors with AVX support. Magma was assessed on one core as well as 6 cores since this software is parallelized.

Guava [9, 3]: GAP (Groups, Algorithms, Programming) is a publicdomain software environment for working on computational group theory and computational discrete algebra. It contains a package named Guava that can compute the minimum distance of linear codes. Guava Version 3.12 within GAP Version 4.7.8 was employed in those experiments. Guava does not implement any vectorization, and it only works on one core since the software is not parallelized.
In contrast, our implementations can use hardware vector instructions both on old processors (SSE) and modern processors (AVX), both from Intel and AMD. Furthermore, our implementations can employ any number of cores inside a node.
Table 2 compares the times required by the commercial software Magma, the times required by the publicdomain software Guava, and the times required by the new algorithms for distributedmemory architectures to compute the minimum distance of both linear codes. The vectorized DLex algorithm with prefix size 4 and two threads per process was employed on 24 nodes (288 cores). In this table the new distributed algorithms exceedingly reduce the time by making many computers (24 computers with 12 cores each one) cooperate to compute the minimum distance. Thus, large processing times in commercial and publicdomain software can be significantly reduced. For instance, computing the distance of the mat023 linear code with the publicdomain software Guava required about 5 days and 7 hours, whereas employing our new software required a bit less than 5 minutes.
Magma  Guava  Magma  New alg.  
1 core  1 core  6 cores  288 cores  
Code  cplex  cplex  cplex  ua 
mat015  53,052.9  40,804.3  9,562.8  38.9 
mat023  503,984.2  456,413.2  85,341.1  282.6 
5 Conclusions
In this paper, we have introduced several new implementations of the BrouwerZimmermann algorithm for computing the minimum distance of a random linear code over on distributedmemory architectures. Both stateoftheart commercial and publicdomain software can only be employed on either unicore architectures or sharedmemory architectures, which have a strong bottleneck in the number of cores/processors employed in the computation. In contrast, our family of implementations focuses on distributedmemory architectures, which are well known because of its scalability and being able to comprise hundreds or even thousands of cores. In the experimental results we show that our implementations are much faster, even up to several orders of magnitude, than current implementations widely used nowadays because of its capability of employing these scalable architectures. For a particular linear code the time to compute the minimum distance has dropped from about 11 hours and 2.5 hours (in public domain and commerical software, respectively) to half a minute with our code on a distributedmemory machine with 288 cores. For another particular linear code the time to compute the minimum distance has dropped from about 5 days and 1 day (in public domain and commerical software, respectively) to five minutes with our code on a distributedmemory machine with 288 cores.
Future work in this area will investigate the development of specific new algorithms and implementations for new architectures such GPGPUs (GeneralPurpose Graphic Processing Units).
Acknowledgements
The authors would like to thank the University of Alicante for granting access to the ua cluster. They also want to thank Javier Navarrete for his assistance and support when working on this machine.
QuintanaOrtí was supported by the Spanish Ministry of Science, Innovation and Universities under Grant RTI2018098156BC54 cofinanced by FEDER funds.
Hernando was supported by the Spanish Ministry of Science, Innovation and Universities under Grants PGC2018096446BC21 and PGC2018096446BC22, and by University Jaume I under Grant PB11B201810.
Igual was supported by the EU (FEDER), the Spanish MINECO (TIN201565277R, RTI2018BI00) and the Spanish CM (S2018/TCS4423).
References
 [1] (2004) Information theory in molecular biology. Physics of Life Reviews 1 (1), pp. 3 – 22. External Links: ISSN 15710645, Document, Link Cited by: §1.
 [2] (197607) Syndromesourcecoding and its universal generalization. IEEE Transactions on Information Theory 22 (4), pp. 432–436. External Links: Document, ISSN 00189448 Cited by: §1.
 [3] (2012)(Website) External Links: Link Cited by: §1, 2nd item.
 [4] (2008)(Website) External Links: Link Cited by: §2.2.2.
 [5] (1997) The Magma algebra system. I. The user language. J. Symbolic Comput. 24 (34), pp. 235–265. Note: Computational algebra and number theory (London, 1993) External Links: ISSN 07477171, Document, Link, MathReview Entry Cited by: §1, 1st item.
 [6] (200906) Quantum error correction via codes over gf(2). In 2009 IEEE International Symposium on Information Theory, Vol. , pp. 789–793. External Links: Document, ISSN Cited by: §1.
 [7] (201410) Relative generalized hamming weights of onepoint algebraic geometric codes. IEEE Transactions on Information Theory 60 (10), pp. 5938–5949. External Links: Document, ISSN Cited by: §1.
 [8] (2006) Searching for linear codes with large minimum distance. In Discovering mathematics with Magma, Algorithms Comput. Math., Vol. 19, pp. 287–313. External Links: Document, Link, MathReview (Piroska Lakatos) Cited by: §1, §2.1.
 [9] (2015)(Website) External Links: Link Cited by: §1, 2nd item.
 [10] (201906) Algorithm 994: fast implementations of the brouwerzimmermann algorithm for the computation of the minimum distance of a random linear code. ACM Trans. Math. Softw. 45 (2), pp. 23:1–23:28. External Links: ISSN 00983500, Link, Document Cited by: §1, §2.2.1, §2.2.2, §2.2, §2, §4.6.
 [11] (197805) A publickey cryptosystem based on algebraic coding theory. 44, pp. . Cited by: §1.
 [12] (2003) Linear network coding. IEEE Trans. Inform. Theory 49 (2), pp. 371–381. External Links: ISSN 00189448, Document, Link, MathReview (Janos Levendovszky) Cited by: §1.
 [13] (200210) Compression of binary sources with side information at the decoder using ldpc codes. IEEE Communications Letters 6 (10), pp. 440–442. External Links: Document, ISSN 10897798 Cited by: §1.
 [14] (1977) The theory of errorcorrecting codes. NorthHolland Mathematical Library, NorthHolland Pub. Co.. External Links: ISBN 0444850090 9780444850096 0444850104 9780444850102 0444851933 9780444851932 Cited by: §1, §1.
 [15] (2002) A coding theory framework for genetic sequence analysis. Cited by: §1.
 [16] (1986) Knapsacktype cryptosystems and algebraic coding theory. Problems Control Inform. Theory/Problemy Upravlen. Teor. Inform. 15 (2), pp. 159–166. External Links: ISSN 03702529, MathReview (Willi Meier) Cited by: §1.
 [17] (1979) How to share a secret. Comm. ACM 22 (11), pp. 612–613. External Links: ISSN 00010782, Document, Link, MathReview Entry Cited by: §1.
 [18] (194807) A mathematical theory of communication. The Bell System Technical Journal 27 (3), pp. 379–423. External Links: Document, Link Cited by: §1.
 [19] (1997) The intractability of computing the minimum distance of a code. IEEE Trans. Inform. Theory 43 (6), pp. 1757–1766. External Links: ISSN 00189448, Document, Link, MathReview (Giorgio Faina) Cited by: §2.1.
 [20] (1996) Integral hecke modules, integral generalized reedmuller codes, and linear codes. Berichte des Forschungsschwerpunktes Informations und Kommunikationstechnik, Techn. Univ. HamburgHarburg. External Links: Link Cited by: §1, §2.1.
Comments
There are no comments yet.